Saturday, 2020-04-25

openstackgerritMerged opendev/system-config master: Add nodepool node key
openstackgerritMerged opendev/system-config master: Remove an extra backslash
clarkb723046 collected the zuul.conf andit looks good but the job failed00:26
clarkbits trying to run zuul executor under python200:27
clarkbmordred: ^ fyi I'll see if it is obvious why after dinner00:28
*** DSpider has quit IRC00:42
*** mlavalle has quit IRC01:03
*** elod has quit IRC01:06
openstackgerritClark Boylan proposed opendev/system-config master: The package is libjemalloc1
openstackgerritClark Boylan proposed opendev/system-config master: Fix zuul.conf jinja2 template
openstackgerritClark Boylan proposed opendev/system-config master: Use pip3 to install zuul on executors
clarkbmordred: ^ I think that will fix it01:15
clarkbinfra-root we are still seeing merge failures, but these appear to be due to git problems01:45
clarkbalso the http gerrit driver retries after receiving HTTP 409 errors when submitting votes. I believe that is happening because we were trying to vote on merged changes, we can probably avoid retrying if http 409 was the return code01:46
clarkbI don't understand the git problems. permissions look fine. Is that related to the thing corvus said we are now doing to curate those repos possibly?01:47
clarkbhrm the thing tobiash did for gc'ing only sets git gc options01:51
fungirootfs on ze01 is close to full, but not quite so enospc probably isn't it01:53
fungioh, this is saying SHA could not be resolved, git returned: ''01:54
clarkbfungi: I think ze03 was doing this before we did any updtes to zuul01:55
clarkbfungi: so its possible this was existing previously and now we are just noticing beacuse merge failures are a red flag right now01:55
clarkbhowever it seems to definitely be doing it more in the fallout of the work01:55
clarkband it doesn't seem like we've done it in the last half hour or so?01:56
clarkbthe stack ending at is green now after my fix01:56
clarkb(which is why I returned to the computer so maybe I should call it a day for now since the merge issues seems to be settled out?)01:56
clarkber caps the stack01:57
fungiwhere does zuul keep original copies of git repos to clone into workspaces? /var/lib/zuul/git seems empty01:57
clarkbfungi: /var/lib/zuul/executor-git I think01:57
fungioh yep it's /var/lib/zuul/executor-git/01:57
fungitoo bad that error doesn't say what sha it wanted to reset its state to01:58
clarkbfungi: its possible that is logged earlier if you grep on the build or event id01:58
fungiyeah, likely01:59
clarkbfungi: what might be interesting is to recheck that change02:00
clarkb to see if that is a persistent issue with that repo02:00
clarkbbut I need to pop out again. likely until tomorrow02:00
*** rchurch has quit IRC03:13
AJaegerinfra-root, 722540 still shows merger-failures, I just rechecked ;(05:09
*** sgw has quit IRC06:44
*** DSpider has joined #opendev06:47
AJaegerwe have new config-errors for openstack-tenant - how could those get introduced: "bad escape \l at position 3"07:27
AJaegerseems like a new Zuul check - I'm fixing...07:54
AJaegerinfra-root, the daily periodic job has failures and08:21
AJaegerand accessbot gets skipped08:21
AJaegerdo we need these jobs daily and hourly?08:21
openstackgerritAndreas Jaeger proposed openstack/project-config master: Disable ovn pypi jobs temporarily
openstackgerritAndreas Jaeger proposed openstack/project-config master: Revert "Disable ovn pypi jobs temporarily"
*** tosky has joined #opendev09:36
*** dpawlik has joined #opendev09:59
*** avass has quit IRC10:16
*** elod has joined #opendev10:28
*** dpawlik has quit IRC10:36
AJaegerconfig-errors are fixed, horizon merged the changes quickly...12:54
AJaegerinfra-root, now zuul tenant has new errors:
openstackgerritAndreas Jaeger proposed openstack/project-config master: Fix some config errors in zuul-tenant
AJaegerproposed fix ^13:01
mnaserI’m getting some non deterministic behaviour with zuul saying changes have merge conflicts13:15
fricklermnaser: there was an issue with that yesterday evening, but I though that this should be fixed now13:16
mnaserand another example on the first patch set:
mnaserfrickler: yeah, this is happening now13:16
fricklerhumm, that looks fresh :( infra-root ^^13:16
mnaserand the parent of those patches are very much tip of the branch13:17
mnaserso it wouldn’t make sense (especially for the second patch, revision 1) to say that...13:17
mnaseroh I just saw a MERGER_FAIL in zuul status for VEXXHOST tenant13:18
AJaegerfrickler: 722540 is also recent, see back scroll13:18
mnaserI’m also seeing a bunch of retries. Is it possible that a specific executor is unhappy?13:18
AJaegermnaser, so, probably fallout from yesterday and needs more work - best to wait for others to fix13:18
AJaegermnaser: I guess ;)13:19
mnasercool. just wanted to make sure there’s eyes on it13:19
mnaserI’ve been running into this issue in my own installation13:19
mnaserI wonder if it is a recent zuul bug13:19
AJaegermnaser: might be - I think Zuul was restarted as well13:21
AJaegertime for some cake - bbl13:22
mnaseryeah it hadn’t been restarted for a while before this and I’ve been dealing with this issue too.13:22
mnaserAJaeger: yum!  Have fun!13:22
fungi(see scrollback) clarkb spotted one of those in the log last night when he was signing off:
fungiValueError: SHA could not be resolved, git returned: b''13:25
fungii can't easily troubleshoot at the moment, though i looked at the executor last night and didn't see anything wrong. i was also having trouble working out from the log what ref it was trying to reset to13:26
fungiat the time we thought it could just be a broken repo, as some of the executors had been logging this before the switch to containers/ansiblification13:27
fungibut if similar errors are being raised for other repos then maybe it's more widespread13:30
fungimnaser: do you have corresponding tracebacks in your repo? maybe this is a recent regression which came in with our restarts13:31
fungiwhich would make sense... clarkb said he saw ze03 logging problems like this before yesterday's maintenance... and ze03 was rebooted unexpectedly by the provider a few days ago because of a hypervisor host problem13:32
fricklerthis would be the latest change to the merger code that might have an issue
fricklerat least it heavily touches the code in clarkb's backtrace13:37
fungimerged 2020-04-15 12:46:0913:49
fungipossible ze03 was the first merger restarted after that13:50
fungimnaser: if that's what you're also finding in logs for your deployment, how does your timeline match up?13:51
mnaserfungi: let me check logs!13:53
mnaserfungi: i dont seem to see logs that are going that far but in terms of behaviour its certainly similar to what i was seeing13:59
mnaserfungi: perhaps maybe we should issue a notice -- looks like its happening (often) enough:
AJaegerlike: status notice Zuul is currently failing some jobs with MERGER_FAILURE, this needs investigation by OpenDev team. Please refrain from rechecking until we give the all-clear.14:14
AJaegerinfra-root ^14:15
mordredAJaeger: ++14:15
AJaeger#status notice Zuul is currently failing some jobs with MERGER_FAILURE, this needs investigation by OpenDev team. Please refrain from rechecking until we give the all-clear.14:16
openstackstatusAJaeger: sending notice14:16
AJaegermorning, mordred !14:16
-openstackstatus- NOTICE: Zuul is currently failing some jobs with MERGER_FAILURE, this needs investigation by OpenDev team. Please refrain from rechecking until we give the all-clear.14:16
mordredmnaser, frickler: tobiash has been running that merger patch for a few months I think - but it also seems reasonble to imagine it might be involved14:16
mnaserinfra-root: oh also it seems to be hitting some repos harder than others, it may be useful to grab a backup of an affected repo from a merger so we have something to inspect, in order to be able to build tests off of it14:16
mordredmorning AJaeger !14:16
AJaegermordred, fungi, could either of you review and
mnaserjust in case it's somehow transient and it disappears and comes back later14:17
mordreddo we know if we've seen it from both mergers and executors or only from executors? asking because different python versions are involved14:17
corvusoh look an alarm14:18
mnaserdon't think anyone dug that deep yet mordred -- but the only captured traceback earlier was from ze0114:18
mordredcorvus: morning!14:18
openstackstatusAJaeger: finished sending notice14:19
AJaegercorvus: sorry, didn't want to wake you up ;)14:21
AJaegermorning, corvus !14:21
fricklermnaser: do you have some list of affected repos? my current guess is it affects only repos that only have a master branch14:22
mnaserfrickler: vexxhost/openstack-operator is one of mine that's being hit, openstack/openstack-helm-images was in the traceback that fungi provided earlier from zm01 -- the link i posted earlier with jobs that hit MERGER_FAILURE lists the bunch14:23
mordredkayobe has branches14:24
corvusand some of those failures are on non-master branches14:24
AJaegerkeystone has branches14:24
openstackgerritMerged openstack/project-config master: Fix some config errors in zuul-tenant
mnaserfwiw i think this is a zuul issue so maybe we can discuss in #zuul -- looks like there was a revert proposal so perhaps it was observed in another environment too14:26
openstackgerritMerged openstack/project-config master: Disable ovn pypi jobs temporarily
corvuswell, we need to decide what opendev should do14:27
corvusshould we attempt more interactive debugging, or restart with a revert of the gc change (then, if it still happens, revert the other merger change)?14:28
mnasercorvus: i suggested capturing a copy of the git repo on-disk from some of the mergers that are posting tracebacks so we can actually use those later to possibly create tests and simplify investigation14:29
mnaserand i guess, depends on how much work people want to do on a saturday :)14:29
corvushas anyone looked into this enough to know if we have enough debugging info to reconstruct the series of git operations that got to this point?14:30
mordredcorvus: restarting with the revert is fairly easy to do on a saturday morning - although - if that patch was the cause of the issue it could have done something to the repo so that we'll still see the issue on repos after a restart - but might not see more repos be impacted14:30
corvusi only see a traceback pasted, no merger debug log lines14:30
mordredcorvus: I do not believe we have gotten that far, no14:30
corvusis this only happening on executors or mergers as well?14:31
mnaserunknown info as far as reported here, mordred asked the same earlier too14:31
mordredI do think capturing a copy of couple of repos showing the issue wouldn't be a bad idea14:32
tristanCit would be good to know what the merger logged before the traceback, there should be `Reset` and `Delete` message that could help understand the issue14:32
mordredI have grabbed a copy of openstack-helm-images on ze01 (Which is the one referenced in the traceback) - just in case14:33
corvustristanC: yeah, i just grabbed that
tristanCcorvus: arg, it seems like it deleting too agresvely empty dirs14:35
mordredwe are running git 2.7.4 on the executor14:38
mordredwhich is < 2.13 - so would cause the "cleanup empty dirs" logic to run14:38
mordredI don't see any exceptions in debug log on zm0114:38
mordredmergers have git 2.20.114:39
mordredin case those data points are useful14:39
mnaser(we seem to be on something from those logs in #zuul :)14:39
corvusmordred: yes, all helpful14:39
mordredcorvus, tristanC: I checked - no Execption in any merger logs14:43
mordred(which seems to jive with #zuul analysis14:44
corvusmordred: looking at the code, i think this is unlikely to trip on the mergers because they generally don't see remote refs deleted14:44
corvusi also think that means that the executor git repo cache is probably fine14:45
AJaegermordred: infra-prod-remote-puppet-else failed, see
corvusthese errors are probably only happening on the transient build repo14:45
mordredcorvus: that's good news14:49
mordredAJaeger: thanks - will investigate14:49
mnaserso given zuul is in emergency, we'll probably have to manually rollout this across mergers once this lands?14:50
corvusyeah, though i think i just approved the outstanding changes.  but i guess they could run into this error14:53
openstackgerritMonty Taylor proposed opendev/system-config master: Use python3 for ansible
corvusif the zuul change lands first,  we can use ansible to run "git pull --ff-only" and "pip3 install ."14:54
mordredonce the zuul patch lands, we can manually run the service and then do a restart of the executors14:54
mordredcorvus: fwiw - I noticed the warnings while doing ad-hoc ansible to the mergers to check for exceptions in teh logs ... made that ^^14:55
mordredmnaser: while I've got you here, could I get you to +A ? we landed the corresponding system-config change yesterday14:57
AJaegermordred: want to merge and as well? in case mnaser wants to review more....14:59
mnasermordred: i am not fully understanding how that job authenticates against eavesdrop14:59
mordredAJaeger: yeah - both of those would be good as well - although I'm most concerned about 721099 getting in so we're not in an inconsistent state14:59
mnaserthere's no secret, the job has no parent15:00
mordredmnaser: currently there is a "jenkins" user on eavesdrop and we connect to it with the private ssh key - in the new system, we're using the zuul per-project ssh key - we've added the corresponding public key from project-config to a zuul user on eavesdrop15:00
mordredmnaser: here, - zuul-user role, and here:
mnasermordred: maybe this is new for me but zuul has an ssh key per project? :p15:02
mnaserif so, really cool and TIL15:02
mnaseroh it does15:02
mordredand here:
mordredmnaser: yes it does!15:03
mnaserthat's a really super cool15:03
mnaserok so this explains a lot then!15:03
mordredmnaser: so you do an add_host in a playbook, and zuul will try to ssh to those hosts by default with a per-project key15:03
mordredmnaser: yah - we're using it for the new zuul-driven-cd we're doing15:03
mnasermordred: cool, i have a few uses for that too then :)15:04
mordredwe do add_host with (our bastion host) - but on bridge we've put in the public keys for the system-config and project-config repos - so only jobs triggered by those repos can ssh in15:04
mordredit's pretty amazeballs15:04
*** ildikov has quit IRC15:06
*** jroll has quit IRC15:06
*** ildikov has joined #opendev15:06
*** jroll has joined #opendev15:07
mnaseroh and only added in post-review15:12
mnaserever nicer.15:12
corvusthere's a tenant ssh key in the design, but we haven't implemented it yet15:13
mordredAJaeger: the puppet error you reported is because we replaced a dir with a symlink and rsync is unhappy about that15:15
openstackgerritMerged openstack/project-config master: Use zuul deployment keys for yaml2ical
mordredcorvus, fungi: I think to fix that I need to run an ad-hoc ansible to remove the old dir and then re-run the puppet15:17
mordredinfra-root: I'd like to run that15:19
mordredthen the next normal runs should be able to rsync state appropriately15:20
mordred(this is also the error in service-nodepool atm fwiw)15:21
openstackgerritMerged opendev/system-config master: Use pip3 to install zuul on executors
openstackgerritMerged opendev/system-config master: The package is libjemalloc1
openstackgerritMerged opendev/system-config master: Fix zuul.conf jinja2 template
mordredcorvus: ^^ I think that's all of the things15:30
fungiokay, back from the wilds of my yard-jungle for a breather, before i have to head back into battle15:30
mordredcorvus: want me to remove from emergency and re-run the zuul service playbook?15:31
fungicaught up on scrollback, and sounds like you do think it's a recent zuul regression?15:31
mordredfungi: if you have a sec, I'd like to run to fix an issue with a dir changing to a symlink and making rsync sad15:32
corvusmordred: sure15:32
mordredfungi: and yes - we've approved a presumptive fix, but are still trying to make a reproducing test15:32
openstackgerritMerged openstack/project-config master: Run accessbot script on accessbot channels update
fungimordred: yep, i concur, that sounds sensible for the rsync symlink problem15:32
openstackgerritMerged openstack/project-config master: Start building focal images
fungiand 723104 looks like the probable fix we'll try for the executors?15:32
mordredcorvus: the commits are on bridge15:33
fungiand the outstanding fixes from yesterday's unplanned maintenance all just merged15:33
corvusmordred can run the playbook now before it lands and we can confirm everything we fixed yesterday is still fixed15:33
corvusmordred: what commits?15:33
mordredI have removed zuul from the emergency15:33
mordredthe ones we just landed15:33
mordredthe system-config ones15:33
corvusmordred: ah you mean they have propagated, great thx15:33
fungi723058, 723023 and 723046 for system-config15:33
mordredrunning service-zuul15:34
funginot from the screen session i guess15:35
mordredoh - no, that would have been more smarter15:35
fungino worries, i'll just use my imagination ;)15:36
openstackgerritMonty Taylor proposed opendev/system-config master: Cron module wants strings
mordredfungi, corvus : everything has run successfully15:38
fungiand that cron patch is something you just noticed when running i guess?15:38
mordredzuul.conf on zm01 looks correct15:38
mordredas is its parent15:39
mordredso - non-urgent - just respoding to nagging from ansible15:39
mnaserfyi zuul api is replying with 500s15:39
mordredthat's non-awesome15:39
mnaserwe'll probably have a trace in zuul-web somewhere15:40
mordredgear.GearmanError: Unable to submit job to any connected servers15:40
mnaserok so either gearman has pooped out or maybe ca/crt/keys are not properly setup/added into the zuul-web container?15:41
corvusthe scheduler is also not connected to gearman15:42
corvuszuul     22260  1.1  0.0      0     0 ?        Z    Apr24  11:50 [zuul-scheduler] <defunct>15:42
corvusit's probably that process15:42
fungiwas just about to mentionh that15:42
fungiyeah, probably the geard fork15:42
corvusthis is likely an unrecoverable error and we will need a full restart and to restore from a queue backup15:43
corvusi'm really curious what could have caused it15:43
mordredyeah. me too15:43
fungi/var/log/zuul/gearman-server.log is empty15:43
fungithe rotated log has some tracebacks related to statsd though15:44
mordredcorvus: we have a "reload zuul scheduler" handler in the ansible that fires when main.yaml changes15:44
fungiAttributeError: 'Server' object has no attribute 'statsd'15:44
yoctozeptoI see you already know what I was about to say :-)15:44
mordredcorvus: it runs "docker-compose kill -s HUP scheduler"15:44
fungibut those stop at 2020-04-24 22:42:48,486 which i think was when we restarted15:45
mordredcorvus: perhaps that is not doing the right thing?15:45
corvusmordred: yeah, for example, if that sends a HUP to all processes in the container, that's bad15:45
mordredI betcha it does15:45
fungiit's a good bet15:45
mordredwe probably want to do an exec scheduler kill15:45
corvusthe old version of this only HUP'd the actual scheduler process (via the pidfile)15:45
mordredto run the kill inside the container15:46
corvushow about15:46
corvuswe run zuul-scheduler smart-reconfigure15:46
mordredooh - that's a great idea15:46
corvusor whatever reconfigure we intend to run :)15:46
fungifar safer than process signals anyway15:46
corvusi think smart-reconfigure is probably what we want15:46
mordredsmart-reconfigure seems like a great choice15:46
mordredpatch coming15:47
mnaseri guess for now we'll have to backup/restore to restart scheduler15:47
openstackgerritMonty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP
fungiwe should have queue backups, unless the cronjob fix wasn't already in place15:47
corvusit was definitely in place :)15:47
mordredcorvus: I should probably put the scheduler back in emergency until that lands yes?15:47
corvuswhether it works i dunno15:48
corvusmordred: would anything else trigger it?  i think we can probably just proceed assuming that's the next thing that will land and therefore be okay15:48
mordred-rw-r--r-- 1 root root    825 Apr 25 15:48 openstack_status_1587829681.json15:48
mordredcorvus: kk15:48
mordredlooks like we have a backup15:48
fungi15:48 is probably too new?15:48
corvusmordred: on second thought, i guess any old project-config change could trigger this, so maybe emergency is a good idea15:48
mordredlooks like we have a 50015:49
mordredcorvus: kk. doing15:49
mnaserfile size will probably be different15:49
corvusmordred: 107 needs a fix15:49
mordred-rw-r--r-- 1 root root 376412 Apr 25 15:37 openstack_status_1587829021.json15:49
mordredlooks better15:49
openstackgerritMonty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP
corvusi'm not sure how we're sposed to use those json files now15:51
corvuscause the changes script does api calls15:51
corvusi think we'd have to modify the script15:51
mordredcorvus: they're mounted in the container in case that's helpful15:52
corvusi'm just saying if we try to run python /opt/zuul/tools/ file:///var/lib/zuul/backup/openstack_status_1587829021.json15:52
corvuswe get urllib2.URLError: <urlopen error [Errno 20] Not a directory: '/var/lib/zuul/backup/openstack_status_1587829021.json/api/info'>15:53
corvusso there's hacking to be done if we want to try to re-enqueue from those files15:53
mordredoh - duh. I get what you're saying15:53
corvusso is that worth doing, or should we restart with data loss?15:53
corvusoh there's a ~root/ script15:54
corvusbut that has to be run with every tenant+pipeline combo15:54
corvushow about i do that for openstack check and gate and stop there15:55
corvusi'll restart the scheduler now15:55
mordredcorvus: ++15:56
corvuserm, the statsd errors are showing up on a scheduler start15:56
corvusoh, no, maybe i'm wrong about that and those are zombie docker error messages15:57
mnaserthats reassuring :)15:57
mordredthe debug log says statsd enabled15:57
corvusyeah, i dunno where those came from, but i don't see them now15:58
fungithere were definitely some in the gearman-server.log from before yesterday's last restart16:00
corvusfungi: yeah, i think they are side effects of broken connections16:01
openstackgerritMonty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx
openstackgerritMonty Taylor proposed zuul/zuul-jobs master: Use failed_when: false instead of ignore_errors: true
openstackgerritMonty Taylor proposed opendev/system-config master: Use python3 for ansible
openstackgerritMonty Taylor proposed opendev/system-config master: Cron module wants strings
mnaser has landed, do we want to restart executors with it?17:01
corvusmnaser, mordred, tristanC: yeah -- looks like that's installed on the executors now, so i will run the restart mergers and executors playbook17:24
mnasercorvus: ok cool, i have a patch that somehow always manages to fail miserably17:24
corvusmnaser: excellent! :)17:25
mnaserftr is the really sad patch17:27
corvusgrr, the stop playbook doesn't wait for them to stop (17:28
corvusi should not have used those17:28
corvusjust waiting for 2 stragglers to stop17:33
corvusokay, all stopped.  starting now.17:37
corvusmnaser: they're up if you want to recheck17:37
mnasercorvus: ok cool let me try it out17:37
mnaserjust rechecked 723083 and will watch status17:38
mnaserah dang17:38
mnaseropendev-buildset-registry (2. attempt)17:38
mnaseropendev-buildset-registry (3. attempt)17:39
mnaseronly 2 jobs managed to start17:39
mnasersorry 3, 2 are in merger_failure and 1 still waiting17:39
corvusok i'll check the logs on opendev-buildset-registry for that and make sure it's the same error17:40
mnaserwonder if ze01 is the culprit, that's two trace from it (and eventually things seem to work)17:43
corvuslet me grep for "Unable to merge" across the executors17:44
mnaseroh hey that's another error tho, that's useful!17:44
corvusyeah, i don't know if that helps us triangulate the error, or if we've popped one problem off the stack and this is the next one :|17:44
mnaseryeah.  this seems a lot more useful of an error message anyhow17:45
corvusmnaser: the only time i see 'not something we can merge' is for vexxhost/openstack-operator17:46
mnasercorvus: could it be possible that because im the first recheck to be done since that fix we implemented17:46
corvusmnaser: did we confirm your earlier error on openstack-operator was the same one that was pasted earlier?17:46
corvusmnaser: could be17:46
corvusbut i'm also wondering if that repo just happens to have something else wrong with it17:46
mnasercorvus: we did not confirm my error to be the same as the helm one which was the traceback provided17:46
corvusmnaser: ok.  maybe we should recheck that helm change17:47
corvusi do see the 'not something we can merge' error across many executors17:47
mnaserit does seem to be a pattern of hitting specific repos --
mnaseri think it was openstack-helm-images so i see a change there with merger failures so ill recheck that17:47
mnaser specifically17:47
corvusmnaser: great, thanks; let me know if/when it hits and i'll search again; meanwhile, i'll continue digging17:48
mnasercorvus: also, something to provide signal -- 722769,7 has a lot of retries in the check pipeline, an openstack/openstack-helm change17:48
mnaserit could be a bad job though, because it's at 4 attempts, so i wonder if they override retries and are just failing a lot in pre :(17:49
mnaseropnestack-helm-images have all started except for 1 which is still queued, but signaling a much better overall situation17:52
corvusokay, i've turned on keep on ze01 and will direct-enqueue 723083 and hope a job lands there17:52
corvusgot build b2f966145388493ab63aaa9187ba1317 on ze01 with that error17:57
corvusoops, not that one, that's too early17:58
mnaseri wonder if the on-disk copy of the repo has somehow become unclean17:59
mnaserwe got a few more changes of landing there17:59
corvusgit fsck on a copy of the cached repo is clean18:01
mnasercorvus: we lose :( all jobs queued up unless the build/functional jobs end up failing but that'll be in a little bit till the registry goes up /images built / etc18:02
corvusi'll dequeue and re-enqueue18:02
corvuson the bright side, i haven't seen "SHA could not be" since before 1700 utc18:05
mnaserthats good!18:05
mnaserwe got a merger_failure! hopefully ze1?18:05
corvusdoesn't look like it; i may need to increase the odds18:05
corvusi'll wait for that linters job, then i'll set keep everywhere if it doesn't hit ze0118:07
corvusokay let's try 6011e23339bf4938b1d5f12d0334087a on ze0718:12
corvusthe git object in FETCH_HEAD definitely isn't there; if i manually fetch it again it works18:17
mnasercorvus: this is beyond my git and zuul internal but what puts the object there in the first place?18:20
mnaseri assume once a change is queued, zuul clones from (or fetches that head) and once its sure that its there, it plops it down in /var/lib/zuul/builds/foo/18:21
corvusnarrowing down on a reproducer -- i've switched back to a copy of the cached git repo (ignoring the build repo for now), i get this:
mnasersounds like gc could be involved18:22
corvusi'm somewhat inclined to just delete this repo from all the executors and see if it happens again18:25
mnasercorvus: i wouldn't be opposed.  it seems like the cached repo may have kinda hit some weird stage18:26
corvusi have low confidence that will fix it, but since it's the only one so far, i'd like to eliminate that as a possibility18:26
corvusaaaand there's another one:18:26
corvus  stderr: 'fatal: not something we can merge in .git/FETCH_HEAD: a57bca5811d4bb4b0bf7d4b8e7f46cd922eb4430'refs/changes/31/723031/3' of ssh://'18:26
corvusso on second thought, i think we should revert the gc change, then remove all the cached repos.  because they're all now configured with gc settings that aren't going to be reverted with the revert of the change.18:27
mnasercorvus: i18:28
mnaseri'm in agreement18:28
mnaseri think the gc is def involved in this18:28
corvusi've got a copy of that repo on ze07 and ze01 for us to experiment with18:29
mnasercorvus: i suspect as more repos gc on fetches, we'll probably see this start popping up more, the relative quietness of the weekend may have masked it18:30
AJaegershall I propose reverting tobiash 's gc change?
AJaegercorvus: I see your suggestion above, let me propose18:32
corvusalready did18:32
AJaegercorvus: ok, I'll abandon...18:33
mnasercorvus: i dont know how much time you have right now :) but maybe we'll have to figure out if we're going to wipe/rotate executor-by-executor to avoid a big thundering herd at gerrit18:35
mnaseri'm not sure if opendev has done this much in the past18:35
corvusmnaser: it's not that bad even when we're busy;18:36
mnaseroh right because specifically for the executors we clone ad-hoc18:36
corvus(it's basically we just have to clone nova and neutron 12 times :)18:36
corvusyeah, that helps a lot18:36
* mnaser is going to go find something to distract for next 40-50 minutes as that change lands18:37
corvusinfra-root: i am about to start a multi-hour gap; if anyone else is able to ensure that lands, all executors and mergers are stopped, all repos in /var/lib/zuul/executor-git and wherever they are on the mergers are deleted, and then executors and mergers started, that would be great :)18:38
AJaegerlooking at we've lost a couple mergers and executors and got them back after the restart18:38
corvusi probably won't be able to return until after midnight utc18:38
mnasercorvus: thanks for evertthing and the summary, ill try and keep eyes on 723110 :)18:39
mordredcorvus: I'm in and out over the next bit - but I do have coverage before midnight - so i'll keep an eye on it too18:45
fungiokay, i'm in for the day, got as much yardwork in as i could before the weather changed. catching back up and can help if stuff's still wacky19:31
fungiand looks like we're presently trying to merge 72311019:34
fungiand it's about done in the gate19:35
fungiand it's merged19:38
funginow for the deploy19:39
fungioh, promote goes first19:39
fungithough i suppose our executors don't need the images19:40
mnaserfungi: yep, executors dont use containers right now, and promote is done so we should be ready for the follow up steps afaik19:41
fungiwell, the zuul source code still needs to end up on the executors and get installed19:42
fungiat which point we can stop, clean and start them all19:42
mnaserfungi: ah, right!  but i suspect that might not happen because zuul is in emergency19:51
mnaseri *think*19:51
mnaserso we might need to git pull it in there19:51
fungiit's not, i double-checked19:53
fungiso ansible will in theory update it on its usual schedule19:54
* AJaeger is a bit puzzled - we have not seen any further MERGER_FAILURE since some time, haven't we?19:55
fungithe last infra-prod-service-zuul started at 2020-04-25T19:10:3519:57
fungiso the next one should be pretty soon19:57
* AJaeger had expected more problems after the restart and the debugging earlier, so I'm not completly convinced yet thtat the gc patch is the problem. But corvus debugging was conclusive...19:58
AJaegerBut it's late here, I might be confused19:58
fungiso probably in another ~10 minutes from now19:59
AJaegergreat, fungi!19:59
* AJaeger signs off again...19:59
fungihave a good night, AJaeger!19:59
AJaegerhave a nice weekend19:59
AJaegerthanks, fungi - you as well in a few hours ;)20:00
fungilooks like it's just started running the build20:14
fungithe revert commit is starting to show up in the local clone of zuul on the executors now20:15
fungiand done, double-checking it got installed20:19
fungizuul==3.18.1.dev104  # git sha 9b300bc20:20
fungithat's the right commit20:20
fungiokay, stopping the executors20:20
fungiwaiting for them to finish shutting down20:24
* tristanC crosses finger20:25
fungiokay, no remaining zuul-owned processes on any of the twelve other than leaked ssh-agent daemons20:32
fungitime to start deleting20:32
fungioh, need to stop all mergers too, i almost missed that20:32
fungithis should be faster20:32
tristanCfungi: i was about to ask... perhaps this can be done in a second step?20:33
tristanCfungi: e.g. remove executor-git cache, restart executor, then repeat for merger20:33
fungimaybe, though i don't know that it makes any difference20:35
fungii just need to take a moment to refresh my memory on how to use docker-compose to stop these since they're in containers20:35
fungilooks like it should be `sudo docker-compose down`20:36
fungiexcept i think it wants a very specific pwd20:37
fungicd /etc/zuul-merger/ && sudo docker-compose down20:41
fungithat seems to do the trick20:41
fungiand they're all stopped now20:42
fungion to git repo deletions20:42
fungilooks like it's still /var/lib/zuul/git/* on the mergers20:43
fungidoing those first20:43
fungiokay, i have those and also /var/lib/zuul/executor-git/* on the executors all recursively deleting in parallel20:50
fungimergers have all finished, some executors are still deleting20:52
fungigoing ahead and bringing the mergers back up20:53
fungihuh... does docker-compose up not close its copies of parent fds?20:54
fungiseems remotely running it via an ssh command line hangs indefinitely after "Creating zuul-merger_merger_1 ... done"20:54
fungii guess i'll just do them in interactive sessions20:55
fungihuh, even running docker-compose up from an interactive shell never disassociates20:58
fungiand sigint stops the container again20:58
fungithis is not good20:58
tristanCfungi: there may be a cli arg to make docker-compose detach20:58
fungiyeah, i'm doing some reading20:58
fungimanpage says: -d Detached mode. Run container in the background, print  new  container name.21:00
fungiand it's `docker-compose up -d` not `docker-compose -d up` in case anyone was wondering. the latter just prints usage instructions and not even an informative error21:02
fungiokay, looks like all the mergers are started again21:05
fungithe executors also finished their deletions whilst i was fumbling around with docker-compose so starting them back up as well21:06
fungiand they're all running again too now21:08
fungi#status log restarted all zuul executors and mergers on 9b300bc21:09
openstackstatusfungi: finished logging21:09
clarkbfungi: on the mergers note that you may need to docker-compose pull before doing the up in order to run the new thing. I expect ansible did that for you21:43
clarkbfungi: I think you can do a docker ps -a and cross check the image hash against what is on dockerhub to double check though21:44
clarkbor docker exec /usr/bin/python3 /usr/local/bin/pbr freeze | grep zuul ?21:44
mordredclarkb: the ansible should have done a pull - but it's good to double-check22:13
mnaser moment of truth22:13
mnaserim running a recheck on the super broken patch of mine22:13
fungiError: No such container: /usr/bin/python322:13
fungii'll need to read up on how that works22:13
mordreddocker-compose exec executor pbr freeze | grep zuul22:14
mordredor docker-compose exec merger pbr freeze | grep zuul22:14
mordredif you use docker-compose to do it it'll do it for you22:14
fungiahh, yeah22:15
mnaserspeaking of which22:15
mordredroot@zm01:/etc/zuul-merger# docker-compose exec merger pbr freeze | grep zuul22:15
mnaserdid we merge the smart-reconfgiure patch22:15
mnaserbefore we nuke zuul-scheduler again22:15
mordredno - we should22:15
mordredscheduler is in the emergency file22:16
fungizuul==3.18.1.dev104  # git sha 9b300bc822:16
fungiso basically what the executors say22:16
fungiexcept for some reason this pbr gives an 8-character abbreviation of the sha instead of 722:16
mnaserfailed with: No container found for scheduler_122:17
fungimaybe pbr changed that recently22:17
mordredfungi: different git versions22:17
mordredfungi: mergers are running debian buster userspace - git 2.20 or something22:17
mordredexecutors are on xenail - git 2.7 or something old22:17
mordred(once we're back to being solid - I want to rolling-replace all the executors with focal)22:17
mordredbut - you know - not today :)22:18
mnaseri dont know why your patch failed mordred  -- seems like zuul01 which feels like is the scheduler doesnt have a scheduler container?22:18
mnaser"No container found for scheduler_1"22:18
mnasermordred: oh, you somehow dont run docker-compose up after the pull22:19
mnaserafter the zuul-scheduler image pull, it goes right into zuul-web22:19
mordredoh - we need to tell it to start the scheduler in the gate22:19
mordredyeah- one sec, I can push up a fix22:19
mnaseras an aside, my patches are happily going through check/gate without merger failures, so the revert of the git gc was the culprit22:20
openstackgerritMonty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP
mordredmnaser, clarkb, fungi ^^22:23
mordredwe need that before we can take the scheduler out of emergency22:23
mordredmnaser: woot!22:23
mordreddid we delete the git caches on the exectors and mergers?22:23
mordredyup - I see fungi said that in scrollback22:24
mordredI mean - not awesome, since we still can't reproduce it in test22:24
mordredand sadly it seems to be an issue that's only on xenial - so it's going to be a bunch of work to fix it for a platform that soon we wont' be running on :(22:24
mordred(I mean, obviously we currently support xenial, so it's important - but you know - still annoying)22:25
mnasermordred: but also i think corvus left a copy of the repos on the exceutors that we could inspect that22:25
mordredmaybe tomorrow we will not wake up to any fire drills22:26
openstackgerritMonty Taylor proposed zuul/zuul-jobs master: Use failed_when: false and ignore_errors: true
mordredfungi, clarkb: if you have any more interest pellets - and are cleanup patches from earlier ansible runs22:30
fungimnaser: well, we assume the revert solved it, but it could also technically have been the repository removal (though what got the repos into a sad state if not the gc tuning patch, i don't know)22:42
openstackgerritMonty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx
* corvus reads scrollback23:03
corvusmordred: yeah, i don't think the gc thing is tied to xenial -- lemme summarize23:06
corvus(there were 2 merger related patches)23:06
corvusthe "fix leaked refs" patch and the "gc" patch.  the fixed leaked refs had the flaw that would delete .git/refs, but it only activates on old git.  that's the one that's tied to xenial; we fixed that and rolled forward, and then we stopped seeing widespread failures.23:08
corvusthe "gc" patch applies to all git versions.  we saw some errors on a very few number of repos which seem very much like they were caused by aggressive gc.  the small number of affected repos could be only because those were the first to git some gc edge condition, and perhaps others would join them over time.  or they could just be in a weird state.23:10
corvussince we reverted and deleted the repos on disk, i think either of those are still a possibility.23:11
corvusfor the record, i just investigated the possibility that the second error was caused by the first -- however i don't think that's the case.  on ze12 the first errors all happened to repos other than openstack-operator, and the second errors were all for openstack-operator23:22
*** tosky has quit IRC23:32
fungimakes sense23:35
*** DSpider has quit IRC23:41
clarkbnewer git is happier when dirs in .git are missing I guess?23:44

Generated by 2.15.3 by Marius Gedminas - find it at!