Monday, 2021-10-18

ianwi'll get back to the debian-stable removal and maybe we can free up space there00:00
prometheanfirecan periodic jobs email the core reviewer teams (or be otherwise configurable for notifications?01:30
ianwyes they can; iirc zuul-jobs might be an example01:39
ianwhttps://review.opendev.org/c/zuul/zuul-jobs/+/74868201:41
fungiianw: prometheanfire: i think the only inbuilt mailing feature in zuul is the smtp exporter, and the recipient address for that is configured on a per pipeline basis01:58
fungihttps://zuul-ci.org/docs/zuul/reference/drivers/smtp.html01:59
* prometheanfire does like zuul02:11
fungiit's not got enough insight into gerrit's data structures to work out core reviewer addresses or anything, nor can it configure notification addresses on a per-job or per-project basis02:19
fungithe periodic-stable pipeline, for example, is configured to send failure reports to the stable-maint ml: https://opendev.org/openstack/project-config/src/branch/master/zuul.d/pipelines.yaml#L293-L29702:21
fungisimilarly, failures for the release-post pipeline are reported to the release-job-failures ml:         from: zuul@openstack.org02:22
fungier, meant to paste https://opendev.org/openstack/project-config/src/branch/master/zuul.d/pipelines.yaml#L262-L26602:22
fungialso the pre-release, release, and tag pipelines get failure reports sent there02:23
prometheanfirewell, for zuul, could be a useful feature request03:01
*** bhagyashris__ is now known as bhagyashris04:11
opendevreviewIan Wienand proposed opendev/base-jobs master: Remove debian-stable nodeset  https://review.opendev.org/c/opendev/base-jobs/+/80263904:25
ianwfungi: ^ i think with that list of dependencies that is finally ready ...05:27
*** gibi is now known as gibi_back_15UTC06:37
*** ysandeep is now known as ysandeep|trng06:50
*** jpena|off is now known as jpena07:31
fricklerthose deps all seem to fail on broken c7 jobs. more skeletons in the closet ...07:38
*** ykarel is now known as ykarel|lunch08:41
*** ykarel|lunch is now known as ykarel09:02
opendevreviewThierry Carrez proposed openstack/project-config master: Add ttx as OP to #openinfra-events  https://review.opendev.org/c/openstack/project-config/+/81438109:07
ttxZuul's release-approval queue is blocked since 2021-10-15 15:42:4209:44
frickleropenstack-zuul-jobs-linters is also failing with pyyaml6, should be an easy fix I hope, but lunch is first10:00
opendevreviewdaniel.pawlik proposed opendev/system-config master: DNM Add zuul-log-scrapper role  https://review.opendev.org/c/opendev/system-config/+/81439110:01
*** mnasiadka_ is now known as mnasiadka10:10
fricklerfor the release-approval, iiuc that's what fungi was looking at earlier, but no resolution yet?10:18
fricklerzuul is also complaining about a lot of config errors, significant amount seems to be rename-related10:19
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Fix use of yaml.load()  https://review.opendev.org/c/openstack/project-config/+/81440110:26
fricklerERROR: Project openinfra/ansible-role-refstack-client has non existing acl_config line11:01
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Fix use of yaml.load()  https://review.opendev.org/c/openstack/project-config/+/81440111:10
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Fix the renaming of ansible-role-refstack-client  https://review.opendev.org/c/openstack/project-config/+/81440911:10
*** dviroel is now known as dviroel|rover11:10
* frickler hopes that this order will work, otherwise will merge them11:11
fricklerof course murphy strikes again11:21
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Fix project-config testing  https://review.opendev.org/c/openstack/project-config/+/81440111:23
*** jpena is now known as jpena|lunch11:31
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Add ttx as OP to #openinfra-events  https://review.opendev.org/c/openstack/project-config/+/81438111:36
opendevreviewMerged openstack/project-config master: Fix project-config testing  https://review.opendev.org/c/openstack/project-config/+/81440111:48
opendevreviewMerged openstack/project-config master: Add ttx as OP to #openinfra-events  https://review.opendev.org/c/openstack/project-config/+/81438112:08
*** jpena|lunch is now known as jpena12:23
*** ysandeep|trng is now known as ysandeep12:57
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303413:06
*** hjensas is now known as hjensas|afk13:19
ttxinfra-root: Not super urgent, but Zuul's release-approval queue (the one used to trigger the PTL-approval test on release changes) seems to be stuck since Friday (478 changes added up)13:20
fungittx: yep, known since saturday, i'm hoping to get some more eyes on the traceback/exception i found related to the top change there, figured if it was stuck into today that wouldn't be the end of the world so i didn't aggressively ping folks on their weekend13:23
fungibrought it up in the zuul matrix channel though for input13:23
ttxyeah, it's not super critical, I'm more concerned that it might add up to the point of slowing down other things13:25
fungisame here, so hopefully we can decide whether we need to keep it in that state much longer or can try to clear it13:25
fungiif it were the check or gate pipeline i'd have just collected as much data as i could and worked on getting it unstuck (probably a scheduler restart)13:26
Clark[m]I suspect a dequeue may be sufficient to get this type of thing moving again. However the cache entry for the specific change might be stale/bad and even a restart won't fix that13:32
*** akahat is now known as akahat|afk13:57
clarkbfungi: also I haven't forgotten that we need to do a gear release and looks like maybe we should consider a bindep release? I can push tags for those after school run if we want to do that13:59
clarkbThen I'd also like to land the gerrit 3.3.7 change and the gerrit.config cleanup change with plans to restart gerrit on those today/tomorrow14:00
fungiyeah, probably need to add release notes for bindep14:00
*** gibi_back_15UTC is now known as gibi14:06
clarkbfungi: is there a change up to default gitea_always_update to true? then we can consider if we want to just do that.14:11
corvusi can look at the zuul queue shortly14:13
clarkbcorvus: thanks!14:14
opendevreviewClark Boylan proposed opendev/bindep master: Add release note for rocky and manjaro  https://review.opendev.org/c/opendev/bindep/+/81443114:20
clarkbfungi: ^ bindep release note14:20
clarkbhttps://review.opendev.org/c/opendev/system-config/+/814048 and https://review.opendev.org/c/opendev/system-config/+/813716 are the two gerrit related changes I mentioned above. I'll approve the second one after the school run since it has the necessary +2's. Landing the first to udpate to 3.3.7 would be nice too though14:23
clarkboh I guess https://review.opendev.org/c/opendev/system-config/+/813675 is still out there too. Should probably land that ebfore a gerrit restart too14:28
clarkbits hard to know if we are using an ansible group anywhere though. Maybe reviewers haev a better way of checkign for that than my bad grepping14:29
*** timburke_ is now known as timburke14:39
*** akahat|afk is now known as akahat14:48
fungiclarkb: thanks for the bindep reno, was that all you spotted since the last tag?14:59
opendevreviewdaniel.pawlik proposed opendev/system-config master: DNM Add zuul-log-scrapper role  https://review.opendev.org/c/opendev/system-config/+/81439115:00
fungiclarkb: yep, those look like it from `git log 2.9.0..origin/master`15:04
fungii guess we'll want to make it 2.10.0, probably can skip making release candidates given the low impact of the other changes (most iffy one is where we stop trying distutils.version.LooseVersion)15:05
Clark[m]++15:07
*** ykarel is now known as ykarel|away15:08
clarkband for gear I think we're looking at 0.16.0 due to the tls changes, randomized server connection and modified connection timeout process15:17
clarkbfungi: ^ if that sounds right ot you I can start with the gear release nowish since that doesn't use reno. Then do the bindep 2.10.0 once the release notes land15:17
clarkbI've approed the gerrit config cleanup just now as well15:18
fungiyeah, pretty sure 0.16.0 is what we talked about previously, and the pin zuul added was for gear<0.1615:22
fungidouble-checking now15:22
fungigear>=0.13.0,<0.16.0,!=0.15.015:22
clarkbfungi: ok I'll load up a gpg key and remember how to push a tag to gerrit :)15:23
fungii'm happy to if we want to wait until after 17:00z15:24
clarkbnah I've done it. Good to exercise this memory. Does commit aa21a0c61b1b665714f5b6e55ec202db9ddc22f1 (HEAD -> master, tag: 0.16.0, origin/master, origin/HEAD) look right to you?15:27
fungiclarkb: yep, that looks like current origin/master and the new version we discussed15:28
clarkbpushed15:29
fungiwe should probably send a release announcement to service-announce about it as well15:29
clarkbhrm I don't know that it ran any jobs15:30
clarkbI wonder if that wasn't ported when we moved it to the opendev tenant15:30
clarkbno it lists a couple of release jobs15:31
clarkbProject opendev/gear not in pipeline <Pipeline release> for change <Tag 0x7f9c346d9d30 opendev/gear creates refs/tags/0.16.0 on fac493c11ec7319a724ed4b29ff2766e1862f643>15:36
fungiumm15:37
clarkboh wait it is there I think it redrew the status page and moved the "card"15:41
clarkbits just too early in the morning for me to process the release queue moved from the right side of the screen to the left side ...15:41
clarkbya it is on pypi now and it is building its docker image currently. Ok all is well. I'll send an email once the docker image is done getting processed15:41
clarkband the mesasge I pasted above must be from the openstack release pipeline not the opendev release pipeline15:42
clarkband bindep is waiting for one of those fedora images that refuse to boot in most clouds :/15:42
fungiif you want to crib a release announcement, http://lists.opendev.org/pipermail/service-announce/2021-April/000018.html was my most recent one for git-review15:46
*** marios is now known as marios|out15:48
clarkbthanks email sent15:55
*** frenzy_friday is now known as frenzyfriday|pto15:55
fungithanks for tackling that!15:56
fungii've been sucked into ptg sessions since 13:00 and still have another hour to go15:56
opendevreviewClark Boylan proposed opendev/system-config master: Always update gitea repo meta data  https://review.opendev.org/c/opendev/system-config/+/81444315:58
clarkbthere's a change to discuss simply udpating all the projects all the time. Testing should double check the cost in runtime for us too15:59
*** ysandeep is now known as ysandeep|dinner16:06
reedFYI https://github.com/MetaMask/eth-phishing-detect/issues/564316:08
clarkbreed: their interactive checker says we aren't blocked either16:16
clarkbI wonder if there is some bug on their end16:16
reedcould be anything 🙂 I don't understand how half of this stuff works16:17
opendevreviewMerged opendev/system-config master: Clean up our gerrit config  https://review.opendev.org/c/opendev/system-config/+/81371616:18
clarkbI've approved https://review.opendev.org/c/opendev/system-config/+/814048 since it is largely mechanical (it has fungi's +2)16:32
*** jpena is now known as jpena|off16:34
clarkbLooks like the mina update to 2.7.0 isn't a drop in update with gerrit16:55
fungi:/17:08
*** ysandeep|dinner is now known as ysandeep17:14
corvusclarkb: fungittx i've identified the zuul bug with the release-management queue.  i think a dequeue command should get things moving again, so i'll go ahead and issue that now.17:42
*** ysandeep is now known as ysandeep|out17:42
corvus(details on the bug in #zuul)17:42
opendevreviewMerged opendev/system-config master: Build Gerrit 3.3.7 images  https://review.opendev.org/c/opendev/system-config/+/81404817:43
clarkbcorvus: thanks17:44
fungiappreciated! i'll work on the dequeue now17:47
corvusfungi: i'm on it17:47
clarkbthis mina thing is fine. I've fixed basically all the issues except for a place where we have to define a new abstract method and ... an slf4j logger import that doesn't work because I don't understand bazel17:48
clarkbs/fine/fun/17:48
fungioh, thanks corvus!17:48
fungisorry, i missed where you said "i'll go ahead and issue that now"17:49
corvusclarkb: naturally i just assumed you were referencing the "This is fine." meme17:49
fungimina [breaks ssh for everyone]: "this is fine"17:49
clarkbwhat is really curious about the logger import is that the code path that hits this doesn't appear to ahve changed between mina 2.4.0 and 2.7.017:50
clarkbaha I think maybe I need to update the version of slf4j too17:52
corvusthe number of items in release-approval is decreasing.  it may take a while to zero out.17:52
clarkbok it wasn't the version, but I've updated that nayway. Turns out you have to both depending on a thing and ensure its visibility is visible from where you do the depend18:00
fungiclarkb: you're trying to port mina-sshd's negotiation implementation to the embedded copy in gerrit?18:01
fungiis that one slightly forked then, not just as simple as pulling in a new version i guess...18:02
clarkbfungi: no I'm trying to make gerrit 3.3 build against mina 2.7.0 expecting that the effort to build against 2.7.1 when it becomes available will be minimal18:03
fungiahh, okay18:03
opendevreviewMerged opendev/bindep master: Add release note for rocky and manjaro  https://review.opendev.org/c/opendev/bindep/+/81443118:10
corvusrelease-approval is at 0 now18:12
opendevreviewClark Boylan proposed opendev/system-config master: Push a patch to test MINA 2.7 with Gerrit  https://review.opendev.org/c/opendev/system-config/+/81423018:26
clarkbThat patch builds locally for me. I've put a hold on it so that we can interact with it after zuul does its thing and see if ssh is generally working18:28
clarkbfungi: fwiw I think backporting the kex handler stuff to our use of mina on top of mina 2.4 is possible. But I suspect that the best thing here is simply to get to an up to date version instead18:29
clarkbinfra-root I'm thinking that https://review.opendev.org/c/opendev/system-config/+/813675 is probably the riskest change of the bunch that I've written (just because it is hard to tell if gerrit group is used somewhere unexpected)18:40
clarkbFor that reason I'm thinking maybe I'll restart gerrit today with the 3.3.7 update and the config cleanup to make sure that is all happy. Then we can do the gerrit group cleanup and another restart in the future (to help isolate any potential fallout)18:40
clarkbIf that seems reasonable I'll plan to do the restart after lunch today18:40
clarkbfungi: for bindep does tagging 36e28c76fa1d9370e967d08f4edf18a023c2aff7 as 2.10.0 look good to you? Do you want to make that release or should I?18:42
fungiclarkb: that matches my origin/master ref and the version we discussed, feel free to tag and push, or i can get around to it in a bit18:54
clarkbI can do it18:54
fungithanks!18:55
clarkbpushed18:56
fungiawesome18:58
clarkbhttps://pypi.org/project/bindep/19:04
clarkbShould I send a service-announce email for this one too? I guess so19:05
fungii have in the past, yeah19:05
clarkbsent19:10
fungiperfect, thanks again!19:12
clarkbwe've got ~59 nodepool nodes that are locked and in-use for 4 and a half days19:13
clarkbone is in a locked but used state and the last one is a held node19:14
clarkbone of the jobs associated with this nodes is still trying to get finished on ze02 (99c5b5eff43c404f8e2d11221944cd65 is the job uuid)19:20
clarkbcorvus: ^ fyi that zuul seems to actually be holding those locks and failing to process the job finish19:20
fungiclarkb: yeah, i looked at that over the weekend too, seems to coincide with the same zk disconnect late last week which led to the scheduler getting restarted19:21
fungithey're basically all from ~3 hours before the scheduler restart19:21
clarkbya I guess we didn't restart executors too which would've killed the ephemeral znodes19:22
fungirefrained from trying to manually clean them up since we weren't pressed for quota and thought they might be useful for identifying the problem19:22
clarkbbut zuul should handle this regardless unless we updated the executor in the process and it was no logner compatible with the exectors?19:22
clarkbthe queue indicated by the "Finishing job" log entries on ze02 is growing it seems19:23
clarkbNode 0026928358 in nodepool was assigned to build 99c5b5eff43c404f8e2d11221944cd65 which ran on ze02 and has yet to successfully finish and unlock since ~Friday?19:24
fungii wonder if those were all for builds which finished in the problem timeframe prior to the scheduler restart19:25
fricklerthat sounds plausible to me. we also still need to clean up the held nodes that zuul lost track of during the earlier upgrade, I think?19:29
clarkbfrickler: ya if any of those are still held on the nodepool side but not recorded on the zuul side we should clean them up in nodepool when we are done with them19:36
clarkblooking at the code I think we only try the once to call finishJob19:53
clarkband if that doesn't succeed then the job worker remains present forever?19:53
clarkband grepping for Finishing Job: 99c5b5eff43c404f8e2d11221944cd65 returns no results on ze02 implying we never called that method?19:55
clarkbThe last thing we seem to have done is pause the job19:57
clarkb`zgrep 99c5b5eff43c404f8e2d11221944cd65 /var/log/zuul/executor-debug.log.4.gz | grep -v 'Finishing Job' | less` on ze02 shows this. I'll dopuble check the other files really quickly too19:58
clarkbya pausing is the last thing logged if we excluding the Finishing Job logging19:59
clarkbI think the next step is to do a thread dump to see if the thread is still running (I don't think it will be) or if we've just got a reference to the build in job_workers because the thread died before calling finishJob20:01
clarkbbut I need to eat lunch now. Back in a bit20:01
clarkbalso note that a graceful stop won't work on these exectuors because those jobs will never go out of the job_workers dict and thati s what graceful stop waits for an empty dict20:02
*** dviroel|rover is now known as dviroel|rover|afk21:01
clarkbhttps://paste.opendev.org/show/b1fOfN46k9aW0DsKYkz6/ I don't think that is related to the issue we have here but potentially intersting21:08
clarkbdo we know approximately when the zk issue occured? I'm not seeing it in the ze02 logs yet21:12
clarkbcorvus: ^ fyi this is a fun one too.21:12
clarkbThat exception in the paste is the only one that I think could be related. The others are git updates that failed and log streamers closing connections21:15
corvusdoes casting to a list help avoid that?21:16
corvusclarkb: i don't think that would cause a significant error; i think that's the loop that gets the next job, so failing there means "just start over and try again"21:16
clarkbcorvus: ya I didn't think it is related to the issue we're seeing with execute never calling finishJob() but I can't see anything else that might cause that21:17
clarkbcorvus: I think you are meant to make a copy and iterate over that rather than iterate the live data to avoid that error21:17
clarkbI'm almost ready to do the sigusr2 on ze02 and see if the threads ave completely gone away or if they are still present21:17
corvusclarkb: you mean copy the dict first?21:18
clarkbcorvus: or the list of values/keys/items that is being iterated over (I haven't looked at the exact code yet as i'm still trying to make sense of the locked nodes belonging to unfinished jobs issue)21:20
corvusclarkb: yeah, just wondering if you do list(dict.values()) if the list call is atomic enough to avoid that issue or whether it is subject to the same thing...21:23
corvusi guess it probably holds the GIL for the duration of list(), so it's effectively mutexed...21:23
corvusso i think it's probably sufficient.  :)21:23
clarkbya I think that is the case21:24
clarkbThread: 139991145465600 build-99c5b5eff43c404f8e2d11221944cd65 d: False <- that thread still exists21:24
clarkbFile "/usr/local/lib/python3.8/site-packages/zuul/executor/server.py", line 999, in pause\n  self._resume_event.wait()21:24
clarkbso its waiting for the unpause condition to occur but that isn't going to happen?21:25
clarkbthat gets triggered by a JobRequestEvent.RESUMED event21:27
corvusclarkb: did it log this?  "Received %s event for build %s"21:28
corvuslooking for that to get a deleted event21:28
corvusclarkb: because the build request delete should cause that, which should cause the jobworker stop() method to be called which should resume the job then stop it21:29
corvusclarkb: are we talking about the zk issue where we lost contact with the server, and it appeared that we also lost all the watches?21:30
corvusbecause that depends on watches too.21:30
clarkbcorvus: yes the incident that frickler restarted the scheduelr for. And ya I'm seeing that we do a child watch for delete and resume znodes21:30
corvusin which case, we may actually have more debugging information than i realized, though we don't have a repl on the executor to help with it21:31
corvusbut i would like to examine zk and see what sessions it thinks are active21:31
corvus(because so far we have no idea how it's possible to resume a zk session and lose the watches; that's not supposed to happen and i can't reproduce it locally)21:32
clarkbcorvus: ze02 is the one where 99c5b5eff43c404f8e2d11221944cd65 is still running in a paused state if you need specifics to focus on21:32
corvusthx21:32
clarkbI hit sigusr2 twice so yappi isn't running anymore but there may be some info in the lgos from when it collected brief data? I'll check for that log message21:32
clarkbit almost seems like the scheduler restart caused it to forget about all these builds even though they should still be in zk?21:33
clarkband that caused us to not send the resume event21:33
corvusclarkb: occam's razor says to me that if the scheduler was stuck because it lost the watches on its zk session, then it's very likely that the executor was in the same situation.21:34
corvusi mean, there could be something else going on, but that seems like a perfectly satisfactory explanation that i think we'd need to falsify first21:34
clarkb`zgrep 'Received .* event for build' /var/log/zuul/executor-debug.log.4.gz | grep 99c5b5eff43c404f8e2d11221944cd65` returns no results21:34
clarkbI guess the next step is to look at the zk state and see if there is a resume or delete event sitting in the child znode listing that a watch would've seen?21:35
corvusthx.  that's consistent with the lost watches hypothesis21:35
corvusclarkb: the delete event should literally be the znode deleted; so yeah, we can check for the 99c5b5eff43c404f8e2d11221944cd65 build request and it should not be there21:36
clarkbcorvus: will you do that or should I?21:36
corvusi will, i have a zk shell already21:36
corvusget 99c5b5eff43c404f8e2d11221944cd6521:37
corvus{"uuid": "99c5b5eff43c404f8e2d11221944cd65", "state": "paused", "precedence": 300, "resultpath": null, "zone": null, "buildsetuuid": "e2a6a56c35d94d78bcb378deb11d9696", "jobname": "tripleo-ci-centos-8-content-provider-ussuri", "tenantname": "openstack", "pipelinename": "periodic", "eventid": "724e0983166c4458a789c520989e86a8", "workerinfo": {"hostname": "ze02.opendev.org", "log_port": 7900}}21:37
corvuswell, that's that theory falsified :)21:37
corvusthe scheduler should have deleted that build request after restarting21:38
clarkbis there a resume child znode that we might have missed?21:38
corvusno children21:38
clarkbinteresting so as far as zuul is concerned the job has been running for 4 days? :)21:38
corvuswell, is it in the pipeline?21:39
corvus(and it has been running for 4 days, right?)21:39
clarkbno its not in the status.json rendered pipeline at least21:39
clarkbbut the job_worker thread is present and the job's last activity was to pause21:40
clarkbI guess the scheduler forgot about it though21:40
clarkbwhich is weird because the build request is there showing a paused state21:40
corvuswas it before the restart?21:40
clarkbthe pause? I'm not sure as I'm not sure when exactly the restart happened.21:40
clarkb2021-10-14 07:54:49,630 INFO zuul.AnsibleJob: [e: 724e0983166c4458a789c520989e86a8] [build: 99c5b5eff43c404f8e2d11221944cd65] Pausing job tripleo-ci-centos-8-content-provider-ussuri for ref refs/heads/master (change https://opendev.org/openstack/tripleo-ci/commit/None)21:41
clarkbthat is when the pause for the job occurred21:41
corvusokay, i thought for some reason we were debugging fallout from the scheduler restart21:41
clarkbcorvus: well fungi mentioned he thought it was related, but I've not yet tracked down when exactly the restart happened21:42
corvus 2021-10-14 10:03:50 UTC zuul was stuck processing jobs and has been restarted. pending jobs will be re-enqueued21:42
corvusmaybe that?21:42
clarkb10:01:02*         frickler | #status notice zuul was stuck processing jobs and has been restarted. pending jobs will be re-enqueued21:43
clarkbya so the job was paused before the restart, but those tripleo jobs that run after the pause can run for a significant amount of time so it is likely the job was still paused when that happened21:43
corvusthen the question is why didn't the scheduler delete the build request21:44
corvushrm.  we may simply not do that.21:46
corvuswe cleanup requests from executors that crash while the scheduler is up, but i think we may not do the other way around.21:47
corvusso, short-term fix: if you restart the scheduler, do the whole system.21:48
corvuslong-term fix: pipeline state in zk21:48
corvusgiven that this started being an issue like 3 months ago, and we're 90% of the way to it not being an issue, i think i'd lean toward not doing a medium-term fix and just sticking with the short-term fix for now.  we could send an announcement to that effect...21:49
corvusi relayed that to #zuul since that's a zuul-project discussion21:52
clarkbya I think if we communicate "restart executors and mergers when restarting the scheduler" that is probably reasonable for now21:52
corvusinfra-root: ^ for the time being, if you restart the zuul scheduler, go ahead and restart the mergers and executors too (the zuul_restart playbook), because of that bug in zuul21:57
clarkbcorvus: I'm thinking maybe we wait until your fix for the other issue is ready then restart everything at that point?21:57
corvussgtm21:57
corvuss/ready/merged -- i'm pretty sure the fix is ready  (i ran tests locally)21:57
clarkbya I'm starting my review of it now21:58
fungioof, that's a lot of scrollback for just wandering off for dinner22:05
fungiokay, luckily i already read the tl;dr in zuul matrix, so i think skimming it was good enough ;)22:07
clarkbshould we hold off on a gerrit restart to do that when we restart zuul?22:13
clarkb(I want to restart gerrit for the 3.3.7 image update and to ensure the config file cleanup doesn't cause problems)22:13
clarkbneat my MINA 2.7.0 held node seems to be working with stuff like ls-projects and show-queue22:19
fungii can help with a combined zuul+gerrit restart when we're ready for that22:21
fungii guess the idea is to restart all of zuul on 814493?22:22
clarkbya22:24
funginot sure if i can legitimately claim to be reviewing that, but i'm staring at it really, really hard22:25
fungibased on the earlier discussions in matrix, i think i follow it22:26
clarkbhttps://172.99.67.89/c/x/test-project/+/21 I was able to push that via local ssh as well. I think that generally means newer mina is working. The other big mina thing is replication though22:29
clarkbtesting that is a bit more of an involved process so probably won't get into that uintil a 2.7.1 shows up and we can argue for doing the update22:29
clarkbfungi: are you able to review 814493? I'll ping tristanC about it too since he expressed interest22:30
fungiclarkb: i just did, and didn't approve specifically because tristan had wanted to review it22:31
clarkbthanks22:32
fungibut yeah, i think i followed the solution and it looked okay22:33
clarkbFinally remembered to approve the prometheus spec. I got distracted last week by the rename improvement stuff22:45
fungigood call22:46
clarkbfungi: frickler: what was the story with the zuul complaints about old venus secrets in zk?22:48
clarkbThe rename process should've renamed them to openstack/venus paths in zk then deleted the venus/venus paths22:48
fungithe backup seemed to want to back up the old paths22:50
fungiand then complained when it couldn't find them in zk22:50
clarkbthats interesting since I thought it only operated off of what it saw in zk22:51
clarkbcorvus: ^ fyi22:51
clarkbfungi: was it root email where it complained?22:51
fungithat's a great question, i hadn't had time to look, just repeating what frickler said22:51
clarkbya seems to have occurred 23 hours ago and daily before that after the renames22:52
fungiyep, root cronspam22:52
fungijust found a copy in my cronspam inbox22:52
clarkbok adding to the meeting agenda. I think the export is supposed to be greedy and export as much as it can so this likely isn't a fatal issue. But something we should cleanup in our rename process or the zuul export-keys command22:52
fungiERROR:zuul.KeyStorage:Unable to load keys at /keystorage/gerrit/osf/osf%2Fopenstackid22:53
fungiand so on22:53
opendevreviewMerged opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement  https://review.opendev.org/c/opendev/infra-specs/+/80412222:53
fungimaybe we don't correctly remove the parent znode?22:54
clarkbfungi: oh that could be22:54
fungiand so the tool is finding an empty one there22:54
fungibecause, yeah, i thought it was effectively stateless22:54
clarkbfungi: ya looking directly at the db I think this is exactly the issue23:02
clarkband ya checking the backups I think we're still dumping everything else23:03
clarkbI'm working on a fix23:14
clarkbremote:   https://review.opendev.org/c/zuul/zuul/+/814504 Cleanup empty secrets dirs when deleting secrets23:35
clarkbI'm going to need to do the zuul restart and gerrit restart tomorrow if I'm helping. Have dinner to help with now. Happy for others to do it this evening if they have time though but I don't think we are in a rush tomorrow should be fine23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!