Tuesday, 2016-03-15

*** mordred has quit IRC00:31
*** mordred has joined #openstack-infra-incident00:33
*** ajmiller has joined #openstack-infra-incident04:08
*** ajmiller_ has joined #openstack-infra-incident04:09
*** ajmiller has quit IRC04:13
*** ajmiller_ has quit IRC04:21
-openstackstatus- NOTICE: Gerrit is going to be restarted11:13
-openstackstatus- NOTICE: Gerrit had to be restarted because was not responsive. As a consequence, some of the test results have been lost, from 08:30 UTC to 10:30 UTC approximately. Please recheck any affected jobs by this problem.11:33
-openstackstatus- NOTICE: Gerrit had to be restarted because was not responsive. As a consequence, some of the test results have been lost, from 09:30 UTC to 11:30 UTC approximately. Please recheck any affected jobs by this problem.11:35
*** ig0r__ has joined #openstack-infra-incident13:19
*** ig0r_ has quit IRC13:22
*** greghaynes has quit IRC13:22
*** greghaynes has joined #openstack-infra-incident13:38
*** ajmiller has joined #openstack-infra-incident14:17
-openstackstatus- NOTICE: Launchpad OpenID SSO is currently experiencing issues preventing login. The Launchpad team is working on the issue14:57
*** ChanServ changes topic to "Launchpad OpenID SSO is currently experiencing issues preventing login. The Launchpad team is working on the issue"14:57
*** jeblair has joined #openstack-infra-incident15:02
*** AJaeger has joined #openstack-infra-incident15:02
mordredhey jeblair15:03
jeblairfungi, mordred, yolanda: over here, i'd like to focus on 1) why the check job missed the failure (and why its output was SO LONG: http://logs.openstack.org/87/292187/3/gate/gate-project-config-layout/6641a40/console.html )15:03
*** clarkb has joined #openstack-infra-incident15:03
jeblairand 2) if something else is wrong with zuul enqueing or reporting changes....15:04
fungi7.4m15:04
fungiyeah15:04
AJaegerWith 1000+ projects, that file is really long ;(15:04
jeblairso... someone saw a traceback on validating the zuul layout....15:04
jeblairwas that in production zuul or only locally?15:05
yolandayes, there was that traceback that fungi spotted on logs. Also on my local tests, zuul-server -t was failing15:05
yolandaremoving the kolla change made the tests pass locally15:05
jeblairso, do we think that the kolla change never ended up in the running zuul config because of the validation failure in production?15:06
yolandawe think that the kolla change caused the failure15:06
jeblairor is it possible it did end up in the running config, and that caused the false pipeline requirement failures?15:06
yolandathat one ^15:06
AJaegerAlso, why does tools/layout-checks.py not find the broken regex?15:06
yolandahowever, it has been forced, and i applied puppet and i still don't see any success. But i think that the reconfig has not been triggered15:07
jeblairyeah, that's question 1 :)15:07
fungii wonder if zuul is running with an incomplete config15:07
fungior was at the time15:07
jeblairfungi: that's a possibility worth consdering as well15:07
fungithough it looks like this was raised during validation phase so shouldn't have been actually used?15:08
jeblairfungi: i _don't_ think it was during validation15:09
clarkbya reconfigre doesn't validate iirc15:09
jeblairi'm digging in to see where it would have bailed15:09
fungioh, this was actually loading. for some reason i keep thinking it does a validation pass prior to loading15:10
fungii'm checking to see what's going on with that job passing. first making sure i can replicate this failure locally with tox -e zuul15:13
yolandaalso i haven't seen any reload, although i launched puppet with the ansible play on that host15:14
clarkbloading that yaml I don't see any obvious quoting or escape errors15:15
clarkb[{'skip-if': [{'project': '^openstack/kolla.*$', 'all-files-match-any': ['^.*\\.rst$', '^doc/.*']}], 'name': '^gate-kolla(.*)?-(?!(docs$)).*$'}]15:15
fungiyeah, though it seems to be barfing on the name regex15:16
fungii think15:16
fungifrom the traceback15:16
clarkbyes I think so too, it is possible that this is maybe a raw string vs a string string problem? /me uses that string as loaded by yaml to test15:16
*** SamYaple has joined #openstack-infra-incident15:17
SamYapleo/ sorry for the touble!15:17
*** nibalizer has joined #openstack-infra-incident15:17
nibalizerohai15:17
*** igorbelikov has joined #openstack-infra-incident15:17
jeblairSamYaple: no worries, this is supposed to be bulletproof15:18
clarkbthat compiles cleanly15:18
SamYaplerelevant paste shows the regex *should* be valid http://paste.openstack.org/show/490527/15:18
fungioh, i need the pillow build deps to be able to tox this15:18
clarkbSamYaple: yup my testing agrees with you15:18
jeblairfungi: ?15:18
SamYaplei do something special, which i think is breaking it, (.*)?15:18
clarkbSamYaple: even when loading it in from yaml and all the associated escapes and possible type conversions15:18
SamYapleyea clarkb i just did the same15:18
SamYaplebut my guess is the (.*)? is breaking it because it basically says match everything but everything may not exist15:19
SamYapleits valid, clearly, but i can see it breaking somewhere15:19
fungijeblair: to be able to tox -e zuul in project-config it pip installs pillow15:19
jeblairfungi: i don't undertand why it would do that15:20
jeblairfungi: or to be more clear, i don't understand why that should be necessary15:20
fungijeblair: depends on what "that" is15:20
jeblairclearly...15:21
fungijeblair: are you saying you don't see why trying to reproduce this with tox -e zuul in project-config would be necessary?15:21
jeblairfungi: installing pillow15:21
fungii don't know what "that" is15:21
fungioh15:21
clarkbiirc sphinx depends on it15:22
yolandaok zuul was reconfigured now15:22
jeblairfungi: (to completely level reset: the task you are working on is of the utmost importance and i am saddened that it is being made difficult by seemingly unecessary complication)15:22
yolandaremoval of that job stopped the error, and i can see a sucessful Reconfiguration complete now15:23
yolandaand changes being added to gate finally!15:24
fungijeblair: Collecting Pillow (from blockdiag>=1.5.0->sphinxcontrib-blockdiag>=1.1.0->-r /home/fungi/work/openstack/openstack-infra/project-config/.test/zuul/test-requirements.txt (line 4))15:24
clarkbSamYaple: the internets do say python has had bugs with nested repeating modifiers15:24
SamYapleclarkb: this was my thought as well15:24
clarkbSamYaple: with (.*)? the * and ? are nested and that could indeed be the trouble15:24
fungiso it's an indirect test requirement of zuul, which i guess we install in the virtualenv15:24
clarkbSamYaple: if you rewrite it to be just .* that should work15:24
SamYapleclarkb: i wanted to see the compiled pattern because im not sure if my pattern is used directly or not15:25
jeblairfungi: well, it should be a test requirement of zuul for building zuul docs...15:25
SamYapleclarkb: it should, but now im scared15:25
jeblairfungi: not sure howe it ended up in the project config test...15:25
clarkbis zuul still precise?15:25
clarkbto the puppetboard fact list15:26
SamYapleohhh clarkb good question15:26
mordredyes15:26
clarkbit is15:26
SamYaplelet me check old python15:26
clarkbso it will have an older version of python15:26
SamYapledocker away!15:26
mordredmordred@zuul:~$ python --version15:26
mordredPython 2.7.315:26
clarkbya 2.7.6 is when it was supposedly fixed which is what trusty has15:27
clarkbthis of course all on stackoverflow so take iwth a grain of salt15:27
nibalizerclarkb: ya 12.0415:27
SamYaplecoolest thing ever btw -- `docker run --rm -it ubuntu:12.04 bash`15:27
jeblairi'm guessing we're running that job on trusty now?15:27
SamYaplewill confirm15:27
jeblair2016-03-15 08:15:08.339 | Building remotely on ubuntu-trusty-osic-cloud1-8764915 (ubuntu-trusty) in workspace /home/jenkins/workspace/gate-project-config-layout15:27
SamYaplehttp://paste.openstack.org/show/490531/15:28
SamYapleconfirmed clarkb ^15:28
SamYaplelet me test a new regex15:28
jeblairreport from zuul internals: a) zuul does validate the config as part of the loading process in basically the same way it does in test config; this failure is past yaml validation and into "try to build the data structure" part of the reload / test config (it should fail in the same way both ways)15:28
clarkbSamYaple: just rewrite it to .* and that hsould be equivalent15:28
jeblairb) the connections changes have made zuul slightly more fragile if reloads break15:28
clarkbthe (.*)?15:28
SamYapleyea im testing it now clarkb15:28
SamYapleim trying to match mesos (and future projects too)15:29
jeblairc) that fragility, in this case, should only affect the timer trigger (basically, a reload failure at this point will break the timer trigger, probably until a successful reload)15:29
fungijeblair: looks like it was running on bare-trusty before https://review.openstack.org/285722 switched it to ubuntu-trusty, so it's been testing on trusty for a while15:29
yolandaso ... looking at the zuul layout job for the change of SamYaple in http://logs.openstack.org/87/292187/3/check/gate-project-config-layout/d3cd3a1/console.html15:29
yolandai see an <type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'now'15:29
yolandaan exception on apscheduler15:30
yolandacan this cause a false positive?15:30
SamYapleclarkb: yea with use .* it doesnt match mesos. ill go the explict root15:30
jeblairfungi: yeah, but aiui, we haven't tried to use this regex before now; so this was probably lurking since we changed the job to trusty15:30
fungiagreed15:30
clarkbSamYaple: is it because there is a -mesos?15:30
*** ChanServ changes topic to "situation normal"15:30
-openstackstatus- NOTICE: Launchpad SSO is back to normal - happy hacking15:30
SamYapleclarkb: yea. enough being fancy. ill just have to update this for any new projects we have15:31
clarkbSamYaple: the trailing .* should match the -somethings15:31
fungialso tox -e zuul is succeeding for me on a recent platform. gonna hold a bare-precise and try there15:31
SamYapleclarkb: im sure i can do this fancy (like i did before) but i _know_ i can do it explict, so i may as well15:31
clarkbSamYaple: re.match(r'^gate-kolla-(?!(docs$)).*$', 'gate-kolla-mesos') that works15:31
jeblairyolanda: it's a bug but should not cause any of the behaviors we are seeing15:32
SamYapleclarkb: gate-kolla-docs, gate-kolla-mesos-docs15:32
SamYaplethose are the excludes15:32
SamYaplepotentially in the future including gate-kolla-ansible-docs, i was trying to save some future commits15:33
clarkbhrm that makes it trickier15:33
jeblairfungi, yolanda: so i think we've answered question 1; i haven't found a connection between question 1 and question 2 (failed pipeline requirements checking) yet...15:35
jeblairzuul should have been running with the old config, aside from the fact that the timer triggers won't activate15:35
fungii'm going to try reenqueuing a change which was failing to enter the gate pipeline earlier just to confirm there wasn't something else going wrong15:37
jeblairfungi: cool, let me know which one15:37
jeblairfungi: preferably before you do it15:37
fungijeblair: openstack/releases 292539,315:38
fungiit's the one i quoted from the zuul debug log earlier15:38
fungiback in #openstack-infra15:38
jeblairfungi: cool go for it15:38
clarkbSamYaple: re.match(r'^gate-kolla-(.*-)?(?!(docs.*)).*$', 'gate-kolla-mesos') may be close15:38
fungifired the enqueue15:38
jeblairfungi: oh, i think 'zuul enqueue' may bypass pipeline requirements :)15:38
fungioh, maybe some of them at least. i know it still performs the mergeable check for gerrit but that may not have been an issue15:39
fungijeblair: Exception: Gerrit error executing gerrit query --format json --all-approvals --c15:40
fungiomments --commit-message --current-patch-set --dependencies --files --patch-sets15:40
fungi --submit-records 29253915:40
fungier, that's a bunch of stray newlines, sorry15:40
fungii'll get the full traceback into a paste15:40
SamYapleclarkb: yea ive been playing with ti, but that doesnt work either :/15:41
SamYapletried a few variants15:41
clarkbSamYaple: that regex works on gate-kolla-docs, gate-kolla-mesos-docs, and gate-kolla-mesos15:41
clarkb(I am not super familiar with all the combinations here)15:42
fungijeblair: full enqueue traceback http://paste.openstack.org/show/490535/15:43
SamYapleclarkb: it does not work.15:43
SamYaplere.match(r'^gate-kolla-(.*-)?(?!(docs.*)).*$', 'gate-kolla-mesos-docs')15:43
SamYaplethat _shouldnt_ match, but it does15:43
SamYaplere.match(r'^gate-kolla-(.*-)?(?!(docs.*)).*$', 'gate-kolla-docs')15:44
clarkbSamYaple: I thought we don't want that to match15:44
clarkbwhich is why we exclude docs in the first place15:44
fungias for why the job is not catching this, i've confirmed it seems to be the difference between re.compile() behavior on the ubuntu 12.04 python 2.7 and the ubuntu 14.04 python 2.7 stdlib15:44
fungias suspected15:44
SamYapleclarkb: gate-kolla-mesos-docs and gate-kolla-docs _shouldnt_ match that regex, but gate-kolla-mesos-docs _does_ match.15:45
fungiat least i do get the same failure/traceback from the layout check under tox on precise as our zuul server threw when reconfiguring15:45
fungibut not with newer platforms (ubuntu trusty, debian jessie)15:45
clarkbSamYaple: re.match(r'^gate-kolla-(.*-)?(?!(docs))$', 'gate-kolla-mesos-docs') returns None15:45
jeblairfungi: i don't understand that error :/15:45
clarkboh derp thats without the trailing .*15:45
clarkbgah regexes15:46
jeblairfungi: i don't see anything corresponding in the gerrit log, and that command works for me manually...15:46
SamYapleclarkb: da15:46
fungijeblair: well, it's also ugly because i apparently copied a bunch of stray newlines into the paste15:48
fungihard to tell the difference between where i've pasted in line breaks and where the input form for paste.o.o is wrapping long lines15:49
jeblairfungi: the same exception is in the debug log, but also does not make sense (and it's not because of the newlines)15:49
fungiahh, yeah i didn't start pulling up the zuul debug log again for that error15:49
clarkbjeblair: is gerrit executing the ocmmand but returning a non zero return code?15:50
jeblairclarkb: not when i try it manually15:50
jeblairwe can update logging with a reconfigure....15:51
jeblairso i think we should turn on zuul gerrit debug log, reconfigure zuul, then try fungi's enqueue again15:51
jeblairyolanda: you did not disable puppet, correct?15:51
fungiand yeah, that gerrit query does work for me with my account, at least15:52
yolandano, is enabled15:52
jeblairok, i am disabling puppet on zuul now, and will manually update the log config and request reconfiguration15:52
fungisounds good15:54
jeblairfungi: okay, go for it15:54
fungirerunning now15:54
fungiand... no traceback15:54
fungiheisenbug?15:54
jeblairof course not15:54
fungile sigh15:55
fungizuul also updated the change with a "starting gate jobs" comment that time, so seems to have worked15:56
jeblairfungi: do you have another change handy that failed requirements earlier where we can workflow it in gerrit?15:56
SamYapleclarkb: i ended up with "^gate-kolla.*(?<!docs)$"15:58
SamYaplegate-kolla prefix, anything not ending in docs15:58
fungijeblair: that was the only example i was working from but there were several other users reporting similar behavior in channel around the same time16:01
fungijeblair: i'm digging some up now16:01
fungiSamYaple: fwiw, something like https://review.openstack.org/293015 would have prevented the original issue from merging16:02
fungibased on some manual tests16:02
jeblairAJaeger: hi, can you not leave recheck comments on those? :)16:03
SamYaplefungi: yes it would have16:03
fungijeblair: 292074 291762 292521,1 are a few more i see mentioned in scrollback16:04
jeblairfungi: rechecked, merged, merged :(16:05
fungioh16:05
jeblairi need one where i can actually reproduce what people were claiming :(16:05
fungimorgan mentioned 292653 which may fit the profile16:08
fungiit's not enqueued but was approved at 10:26 utc today16:08
fungi292653,116:08
fungithough that may have been around the time yolanda and jhesketh were trying to fix things by restarting gerrit16:09
SamYaplefyi -- https://review.openstack.org/293021 the pastes there show its validated and shouldnt cause this problem again16:09
yolandaon gerrit restart, several changes were lost16:09
SamYaplethanks everyone, im headed out now. ping me if there are more issues16:10
fungilooks like 292653,1 was prior to the gerrit restart even16:10
fungiSamYaple: thanks!16:10
jeblair2016-03-15 00:20:19,240 DEBUG zuul.DependentPipelineManager: Change <Change 0x7f4fc785fd10 292653,1> can not merge, ignoring16:10
jeblair2016-03-15 00:20:19,240 DEBUG zuul.DependentPipelineManager: Change <Change 0x7f4fc785fd10 292653,1> is not ready to be enqueued, ignoring16:10
fungijeblair: those log entries are too old i think. we're looking for when it handled the workflow +1 at 10:26 utc16:11
jeblairfungi: ah, then it's probably a missed event16:12
jeblaircause that's the last16:12
fungiso we likely need to find one which was approved after the gerrit restart solved whatever was going on with the event queue (which presumably was a distinct issue from the zuul config problem)16:14
fungigerrit start time seems to have been 11:13 utc16:17
fungii need to take a quick break to do morning stuff which i didn't get to because i saw everything was on fire when i woke up and now it's afternoon16:23
fungiwill brb16:23
fungiCOFFEE16:23
jeblairi'm still looking for a patch that exhibits the reported problem; i have not found one16:24
fungijeblair: morgan mentioned in the infra channel another odd inconsistency in gerrit which could be query-related/impacting16:25
jeblairokay i give up16:26
fungispecifically, the resulting list at https://review.openstack.org/#/q/project:openstack/stackalytics+status:open is not showing the code review and workflow votes for his "Change company affiliation" patch which can be seen in the details at https://review.openstack.org/#/c/292653/16:26
jeblairi've gone through *a lot* of changes16:26
jeblairand so far, the only ones that *might* have reproduced the issue16:26
jeblairhave been rechecked by AJaeger16:26
jeblairi'm going to put zuul's logging configuration back16:27
fungioh, nevermind, this one may be some sort of startup race16:27
fungilooks like the approval was added the same moment gerrit was being started16:27
jeblairAJaeger: the next time there is a problem with zuul, would you please consult with people who are working on debugging it before leaving 'recheck' comments?  they destroy any chance we have to try to understand the problem and fix it16:28
jeblairi don't think there's anything left to do here, so i'm going back to -infra16:29
fungii retract my earlier statement now that i have some coffee in me16:52
fungii was comparing the approval time on that change to itself because i pasted the wrong time i was looking at earlier. the start time on gerrit is 11:13 utc, so that change was approved a solid 45 minutes before gerrit was restarted, implying we were indeed missing events16:53
fungibut i agree, we can move this back to #openstack-infra16:54
AJaegerjeblair: understood, sorry about those rechecks18:34
*** ig0r_ has joined #openstack-infra-incident20:04
*** ig0r__ has quit IRC20:06
*** AJaeger has left #openstack-infra-incident20:52
*** SamYaple is now known as NotSamYaple20:52
*** NotSamYaple is now known as SamYaple20:53
*** dhellmann has joined #openstack-infra-incident23:07

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!