15:00:36 #startmeeting third-party 15:00:37 Meeting started Mon Oct 5 15:00:36 2015 UTC and is due to finish in 60 minutes. The chair is anteaya. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:38 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:41 The meeting name has been set to 'third_party' 15:00:48 hello 15:00:56 o/ 15:01:01 hello rfolco 15:01:32 rfolco: I'm not sure if I know you, what ci account is yours? 15:01:36 hi anteaya 15:01:46 rfolco, pkvm ci 15:01:54 mmedvede: hello, how are you today? 15:02:14 anteaya, pkvm ci :) 15:02:16 anteaya: I am well, thank you 15:02:21 o/ 15:02:34 anteaya: rfolco and I are on the same team 15:02:38 hi asselin__ 15:02:44 mmedvede: glad to hear it 15:02:55 hi everyone 15:03:00 mmedvede: oh, I was just going to say I don't see pkvm ci listed: https://wiki.openstack.org/wiki/ThirdPartySystems 15:03:27 anteaya: IBMPowerKVMCI 15:03:33 oh 15:03:39 abbreviations :) 15:03:48 yeah I wouldn't remember that abbreveation 15:03:53 so thanks 15:04:02 what shall we talk about today? 15:04:28 does anyone have anything they wish to discuss? 15:04:54 hi everyone 15:05:01 hello aysyd 15:05:30 anteaya, aysyd is in my team too 15:05:35 wonderful 15:05:43 your team has a great turnout today 15:05:48 lol 15:05:57 how is your ci operating? 15:06:18 in terms of? 15:06:29 is it working as expected? 15:06:30 it is stable, but has some pypi problems at the moment 15:06:36 can't get packages 15:06:39 glad it is stable 15:06:43 that is a problem 15:06:49 is it a new problem? 15:07:02 started today 15:07:06 interesting 15:07:15 but upstream is also failing lots 15:07:23 upstream what? 15:07:37 upstream jenkins 15:07:56 do you have some urls of patches with jobs failing due to jenkins? 15:09:03 anteaya: I do not believe it is due to jenkins. I meant that some of the jenkins jobs have also started failing. Along the other third-party CI systems 15:09:13 interesting 15:09:26 have you any logs that might point to what the issue is 15:09:47 if there is a problem that is causing jobs to fail I would be interested in finding out what that may be 15:10:09 anteaya: looking at xenproject ci http://logs.openstack.xenproject.org/86/230186/4/check/dsvm-tempest-xen/4e0a686/logs/devstacklog.txt.gz: [Errno 104] Connection reset by peer 15:10:16 during python package install 15:10:28 I don't know the specifics for today's problem, but int general pip packages and dependencies are changed without much criteria so they break CI jobs 15:10:36 same happens for us, e.g. http://dal05.objectstorage.softlayer.net/v1/AUTH_3d8e6ecb-f597-448c-8ec2-164e9f710dd6/pkvmci/nova/86/230186/4/check/check-ibm-tempest-dsvm-full/14e85a6/devstacklog.txt.gz 15:11:08 to me it looks like a load capacity problem 15:11:28 asselin__: is your CI working fine? 15:12:18 * asselin_ checks 15:12:30 o/ 15:12:50 hi hogepodge, welcome 15:13:02 we are just looking at some pypi timeout issues 15:13:20 seems ok, a few random failures I need to check the details, but nothing major 15:14:08 Hi 15:14:12 hi welcome 15:14:36 so dstufft his meeting is for operators of ci systems that aren't openstacks but report to it 15:14:38 https://wiki.openstack.org/wiki/ThirdPartySystems 15:15:13 and everyone this is dstufft he works with a lot of python packaging issues and may have some abiltiy to evaluate if pypi is experiencing load issues 15:15:22 mmedvede: can you share those log links again? 15:15:54 sure 15:15:56 thanks 15:16:11 #link IBM PowerKVM CI pypi timeout http://dal05.objectstorage.softlayer.net/v1/AUTH_3d8e6ecb-f597-448c-8ec2-164e9f710dd6/pkvmci/nova/86/230186/4/check/check-ibm-tempest-dsvm-full/14e85a6/devstacklog.txt.gz 15:16:33 #link citrix-xenserver i http://logs.openstack.xenproject.org/86/230186/4/check/dsvm-tempest-xen/4e0a686/logs/devstacklog.txt.gz 15:17:14 I might have mislabeled citrix 15:17:17 dstufft: have you enough context that your presence in this meeting makes sense to you yet? 15:17:21 #undo 15:17:22 Removing item from minutes: 15:17:23 yea 15:17:27 dstufft: thanks 15:17:31 mmedvede: try again 15:18:39 Connection Reset by Peer would be coming from Fastly 15:18:47 Fastly's our CDN 15:18:54 there was an issue like this awhile ago... 15:18:54 sec 15:18:56 mmedvede: I removed the last link from the meeting minutes you can try the last link again 15:18:59 dstufft: thank you 15:19:32 #link XenProject CI check pypi timeout http://logs.openstack.xenproject.org/86/230186/4/check/dsvm-tempest-xen/4e0a686/logs/devstacklog.txt.gz 15:19:38 mmedvede: thank you 15:20:27 https://github.com/travis-ci/travis-ci/issues/2389 15:20:52 #link https://github.com/travis-ci/travis-ci/issues/2389 15:21:24 that says the issue was closed March 9th 15:21:35 Right, that was Travis-CI having a similar issue 15:21:44 they seemed to have resolved it by disabling ECN on their systems 15:22:19 * anteaya looks up ECN 15:22:35 From my memory, this problem seemed specific to a particular setup (e.g. it wasn't a wide spread problem, but the people having the problem had it regularly) 15:22:52 and Fastly investigated it for a awhile and couldn't find anything on their end that seemed to be causing it 15:23:00 hmmmm 15:23:12 well it seems to be affecting at least two different operators 15:23:24 https://github.com/travis-ci/travis-ci/issues/2389#issuecomment-75292931 is the post where someone suggested the ECN option 15:23:41 and given the amount of folks running systems vs the amount of folks who talk to us I would multiple that by at least 10 15:24:24 dstufft: is this what you mean by ecn? https://en.wikipedia.org/wiki/Explicit_Congestion_Notification 15:24:31 yea 15:24:32 anteaya: looking at our internal scoreboard status, lots of third-parties are failing. Strangely, upstream jobs look fine, but I need to query logstash to make sure 15:24:52 #link ecn https://en.wikipedia.org/wiki/Explicit_Congestion_Notification 15:24:55 dstufft: thank you 15:25:03 mmedvede: hmmmm 15:25:03 I think it was particularly this bit "Rather than responding properly or ignoring the bits, some outdated or faulty network equipment has historically dropped or mangled packets that have ECN bits set." 15:25:23 does anyone know offhand (or can you look?) if you are using ecn? 15:25:46 Going by memory, I think the guess was that some hardware switch in between Travis and the Fastly POP was doing something bad with the ECN bits 15:26:05 mmedvede: yes, if you could query logstash with what you are seeing that would be great, thank you 15:26:17 dstufft: fair enough 15:26:31 I'm feeling that this connection reset by peer situation is new 15:26:35 It was never confirmed though, so it might be the problem 15:26:37 er 15:26:40 might not be the problem 15:26:41 mmedvede reported it started today 15:26:46 I'm happy to raise an issue with Fastly though 15:26:54 and I do think they have been running their ci for about a year 15:27:07 dstufft: that would be great, thank you 15:27:24 if they could take a look at what they are seeing from their end at the very least 15:28:21 are you able to curl -I pypi from that box and see what POP you're getting 15:28:22 so dstufft will speak with Fastly and mmedvede will look at logstash to see if upstream is seeing any of the same issues 15:28:32 mmedvede: can you do so? 15:28:38 it'll be in a header 15:28:39 curl -I pypi? 15:28:42 Served-By or so 15:28:46 curl -I https://pypi.python.org/ 15:28:59 X-Served-By: cache-iad2138-IAD 15:29:02 like that 15:29:04 perhaps put it in paste? 15:29:11 unless it is one line 15:30:13 all the headers will be multiple lines, just that header will be one 15:30:55 does anyone have time to try that now? 15:30:57 FWIW https://github.com/pypa/pip/issues/2426 would make this not fail (or at least, not fail as a much) I just haven't done it yet 15:31:58 dstufft: how can I help you have time to do that? 15:32:52 I've been working on Warehouse so I haven't been touching pip much, I can probably switch around to that sometime soon though 15:33:00 and it doesn't sound like any of our operators have time right now to look at their headers 15:33:04 X-Served-By: cache-iad2142-IAD, cache-dfw1834-DFW 15:33:11 asselin__: ah thank you 15:33:16 other option to mitigate is to run your own PyPI mirror nearby too 15:33:40 but we use our own pypi mirror now 15:33:44 dstufft: true but many of our smaller ci operators probably won't do that 15:34:00 oh wait 15:34:00 asselin__: ah okay would make sense then you are isolated from this issue 15:34:04 * anteaya waits 15:34:12 is that log you sent me from your own mirror? 15:34:24 asselin__ didn't send any logs 15:34:29 mmedvede sent logs 15:34:35 oh 15:34:36 durr 15:34:39 no, that's the output fromthe curl command you posted above 15:34:40 mmedvede: are you using a pypi mirror? 15:34:49 stupid Textual made their usernames the same color 15:34:52 dstufft: no it is a good question 15:34:57 dstufft: ah yeah 15:35:15 okay so header information to you, elbow you some time to work on pip 15:35:28 anything else we should discuss on this topic? 15:35:32 right now? 15:35:38 the curl info would be most useful from the boxes that are getting failure 15:36:01 Made a kibana query with "error: [Errno 104] Connection reset by peer" last 7 days, shows spike today. 15:36:05 dstufft: agreed, I'll work on getting that to you post meeting if it doesnt' come up before the end of the meeting 15:36:10 mmedvede: okay thanks 15:36:17 mmedvede: can you curl for the pypi headers? 15:36:32 curl -I https://pypi.python.org/ 15:36:40 anteaya: I probably can. It is not 100% rate of failure though 15:36:47 the X-served-by header 15:36:51 understood 15:37:02 anteaya: not sure about mirrors, need to check 15:37:18 X-Served-By: cache-iad2145-IAD, cache-atl6226-ATL 15:37:40 I did see this issue today with our mirror. Never saw it before...perhaps it was being updated at the time? http://15.126.198.151/98/229998/2/check/lefthand-iscsi-driver-master-client-pip-vsa673-dsvm/c2a1270/logs/devstacklog.txt.gz#_2015-10-05_11_02_13_117 15:39:32 asselin__: you mean perhaps the pbr wheel was being updated at the time? 15:39:39 rfolco: thank you 15:40:08 anteaya, ATL means atlanta and DFW Dallas ? just curious 15:40:12 #link Kibana search for "connection reset" http://imgur.com/Y46OqHl 15:40:25 rfolco: yea 15:40:25 anteaya, yes 15:40:27 rfolco: I'd guess that too, I don't know for sure 15:40:39 asselin__: okay thank you, and I agree it is possible 15:40:44 https://www.fastly.com/network <- Fastly Locations 15:41:03 generally they use airport codes in their DC names 15:41:11 mmedvede: filter on build_status:failure as well, not all of those hits are failures 15:41:57 dstufft: so that header is the fastly location header, that is being used by the box to hit pypi 15:42:33 anteaya: Yea, pypi.python.org is a GeoDNS name that routes to your closest Fastly POP, Fastly is running Varnish which connects back to the PyPI servers and caches the result 15:42:52 anteaya: thank you. Added the filter. The same picture, only now 21 hits (probably 2 new just happened) 15:43:03 the Connection Reset By Peer is coming from between your computer and Fastly, if it was between Fastly and PyPI you'd get a 503 error instead 15:43:11 mmedvede: great, thanks 15:43:28 dstufft: good to know 15:43:52 okay so we will be interested to hear your response from fastly 15:44:03 if you could post the the infra mailing list that would be great 15:44:11 does that sound fair to everyone? 15:44:42 mmedvede: thanks for bringing this up 15:45:14 okay if I give hogepodge some airtime now? 15:45:30 +1 15:45:34 thank you 15:45:58 #topic openstack foundation trademark usage program for third party operators 15:46:13 15:46:16 hogepodge: care to share your thoughts on the current status of your work? 15:46:48 We're starting out with cinder drivers, since the third-party testing is pretty solid for that project. 15:46:49 he might be afk at the moment 15:46:53 ah here we are 15:46:57 great 15:47:09 go cinder! 15:47:25 Reaching out to companies that have existing drivers passing cinder-ci to get them started on the license program or update their current licenses. 15:47:33 wonderful 15:47:51 does anyone present have any questions for hogepodge? 15:48:35 If we make good progress (I have a bit of a backlog of pre-summit work, so I'll know more later this afternoon) we're going to require all storage drivers be passing cinder ci to carry the OpenStack Compatible mark. 15:48:49 okay lovely 15:48:50 what does that mean "passing cinder ci" 15:48:57 great question 15:49:14 Passing the set of tests that cinder requires to demonstrate driver-facing and user-facing apis 15:49:28 especially in the face of patches that may or may not work and intermittent failures 15:49:53 such as pypi, devstack changes, et.c 15:51:35 We don't yank a license because of a failed test. At renewal we would if the driver has not been passing ci for some time. Same thing for initial license. We want commitment to quality and community standards. 15:52:09 So we have discretion. Meaning if something breaks upstream we can be patient and understanding and work with both upstream and downstream devs to get things right. 15:53:33 also, which tests need to be required? Can any be skipped due to legitimate bugs external to the driver? 15:53:42 After the summit we want to work with neutron team and vendors to come up with ci for network plugins, with the same idea of using community testing standards to drive the trademark program for network drivers 15:53:48 (Sorry to interject, Fastly got back to me. Asked if we could get a mtr from the failing machine to pypi.python.org) 15:53:54 I know there are quite a few encypted volume bugs that affect drivers differently 15:54:01 dstufft: what is a mtr? 15:54:34 http://www.bitwizard.nl/mtr/ 15:54:47 should be packaged in most distros 15:54:55 #link http://www.bitwizard.nl/mtr/ 15:55:12 dstufft: thanks will work on getting this back to you post meeting 15:55:18 sorry hogepodge and asselin__ 15:55:21 asselin__: we'd work with vendors and the dev team on problems like that. 15:56:03 asselin__: hogepodge: my understanding was that it was basically up to cinder to decide whether a driver was in sufficient shape for the mark (e.g. deciding what tests it needed to run, et cetera). is that accurate? 15:56:49 that puts even more pressure on the projects 15:57:05 fungi: more or less, yes. We feel like the community knows best as to what makes a compatible driver. Cinder has a fairly well-defined set of apis that all need to be implemented, so that makes the job easier in a lot of ways 15:57:13 and the projects are under considerable strain from the drivers as it is 15:57:39 I would argue neutron buckled from the pressue which is why they revoked manditory testing 15:58:16 not strain from the operators that attend meetings and comply 15:58:22 but from the ones who don't 15:58:29 and then are upset about it 15:58:34 agreed that does put additional responsibility on the projects. on the other hand having some body disconnected from the project deciding what should be tested (possibly in disagreement with what the project wants to see tested) could lead to different and arguably worse sorts of strain 15:58:57 oh as to what should be tested, the projects do know best 15:59:09 sounds like we should have a tool to help check 15:59:14 as to monitoring to say whether a given ci is running those tests? 15:59:15 this seems like something projects should be able to opt into though, for sure 15:59:24 that is considerably more work 15:59:50 well if the foundation makes the testing manditory to recieve the mark, I dont' see how anyone can opt either in or out 15:59:55 true, though last cycle cinder did that work admirably, ripping out untested drivers right and left 16:00:02 all of the conversations with neutron are prelimiary. many members have expressed their concerns, but have also expressed that mark pressure could help bring the program back. 16:00:04 very much so 16:00:09 if neutron said "nope", we'd respect that 16:00:31 I'm saying that cinder took one path and neutron another 16:00:42 anteaya: i expect it's more that the mark won't exist if there's no interest from a project in helping shape and police it 16:00:45 and I fully understand why both projects made the decison they made 16:01:14 no interest has different definitions 16:01:26 interest in having things for marketing? very high interest 16:01:42 interest in doing the leg work so those things actually reflect value? very low 16:01:55 anyway 16:02:03 paths can be different too 16:02:06 we won't reach consencus today 16:02:08 not one size-fits-all 16:02:11 hogepodge: that is good to know 16:02:20 that might work best then 16:02:24 rakhmerov: Error: Can't start another meeting, one is in progress. Use #endmeeting first. 16:02:35 as I don't think what works for cinder will work for neutron 16:02:46 rakhmerov: in a meeting just finishing up 16:02:49 excuse me, we're supposed to have a meeting 16:02:53 yep, thnx 16:03:10 thanks all for your kind attendance and participationg 16:03:14 see you next week 16:03:17 #endmeeting