Thursday, 2023-09-07

fricklerI think I missed something, why does zuul need a downtime?05:53
*** ralonsoh_away is now known as ralonsoh07:09
fungifrickler: the default branches are cached in zk. they'll get refreshed in time if configs for those repos are updated, but to clear the cache in zk and force it sooner we'd need both schedulers offline first11:56
fricklerfungi: that answers one question and raises the next one: what is changing about default branches? and sorry if I missed that, sometimes I skip things when there is too much backlog in the morning12:03
fungifrickler: the patch in zuul for the bug you pointed out with refs/heads getting prefixed12:43
fungifrickler: clearing the cache so that it gets repopulated with the https://review.opendev.org/893925 fix in place12:44
fricklerah, that error is cached, o.k.12:45
fungiyes, the cache will correct itself over time, but for projects that don't get frequent updates it could take a while12:48
fricklerack, then the restart does make sense I guess13:03
TheJuliahey guys, you can clear out the autohold I have. It shed some light on the issue, but I'm still sort of chasing the issue, just deferring for the moment.15:39
fungiTheJulia: thanks for letting us know, i've cleaned it up now. happy hunting, and let us know if you need more help15:41
opendevreviewBernhard Berg proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects  https://review.opendev.org/c/zuul/zuul-jobs/+/88791716:08
clarkblooks like min-ready: 0 for fedora and our ready node timeout has resulted in no fedora nodes in nodepool16:46
clarkbI think that puts us in a good spot for Monday to merge the removal changes16:47
fungiagreed17:02
fungimm3 migration notifications have been sent to the airship-discuss and kata-dev mailing lists17:51
clarkbfungi: I think both are small enough that we don't have to worry about copying significant amounts of data right? Its only openstack that will pose a problem for that17:53
fungicorrect. if you look at the todo list at the bottom of https://etherpad.opendev.org/p/mm3migration i've approximated a 4-hour migration window for openstack (the migration script itself takes around 2.5 hours to complete for the site in my test runs)17:54
fungii'll still make a warm rsync copy immediately prior to the window so that we spend as little time as possible copying data during the outage17:55
fungifor all of them17:55
clarkb++17:56
fungibecause it's just running the same command a couple of times before the maintenance that i'll also run during, so not any extra work and can shave minutes off the maintenance window17:56
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects  https://review.opendev.org/c/zuul/zuul-jobs/+/88791718:16
fricklercorvus: the horizon stable/stein job was deleted on August 16, but periodic jobs are still running and there still is a stable/stein tab on https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/horizon18:50
fricklerother branches detected at that date do not seem affected https://paste.opendev.org/show/bKk2BIRN1I4sNFpFlA4N/18:51
frickler*deleted18:51
fungibranch was deleted18:52
fricklerperiodic-stable pipeline to be exact, two jobs still listed in the "View Job Graph" https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/horizon?branch=stable%2Fstein&pipeline=periodic-stable18:53
fricklerI guess if you do the zuul maint on saturday, it will do a full-reconfigure and clean this up anyway?18:54
fricklerso maybe we can either ignore now and see if it goes away then, or use the time to possibly do some debugging18:54
fricklerah, branch deleted, not job, thx fungi, I should delete myself for today, too, I guess ;)18:56
corvusfrickler: horizon issue due to gerrit disconnect at time of event: https://paste.opendev.org/show/bHIpyhR39QxPz9KZG087/  (possibly gerrit had a bunch of work going on at the time and wasn't very responsive?)19:31
corvusthat will be corrected on the next branch change, or we could force it ahead of time with a full-reconfigure, or it sounds like it's probably just fine to let it be corrected during the restart19:32
fungiaha, yeah i wondered if the event had simply gone missing19:40
fungithe new event bus work in gerrit ought to solve this sort of case longer term19:41
corvusin this case the event was processed, it's just gerrit chose shortly after that moment to go out to lunch which interrupted processing.  we discard events in that case since we're already halfway through processing.  so i don't think the pubsub stuff would have changed that.19:47
corvus(arguably we could push the event back on the stack, but it's not a simple decision -- there could be negative ramifications from that)19:48
fungiaha19:49
fricklercorvus: thx for digging in the logs. I think waiting for the restart is fine in this case19:51
fricklerelodilles: do you run those branch deletions directly next to each other? I wonder if some sleep in between might be helpful19:52
frickleralso if you could add timestamps to your log that might help possible debugging in the future19:55
fungii suppose a barrage of branch deletions could have fired events that caused pipelines in zuul to be triggered, leading to mergers fetching refs from gerrit and unintentionally knocking it offline briefly?19:57
clarkbthough we have a limited number of mergers which should mitigate that but maybe not limited enough19:58
corvusshould actually mostly be the schedulers doing this op  (unusually)19:59
fungijust speculating. the cause could have been just about anything, and was just as likely unrelated to anything going on for branch deletion19:59
corvusi think the branch lookup is currently a git op; we may be able to save some cpu cycles by making it a gerrit api call.  honestly haven't benchmarked them to figure out which is faster19:59
corvus(that code predates gerrit having an http api :)20:00
corvusone of those lines may make it all the way back to zuul v0 20:00
fungialso i wonder if the gerrit driver could special-case ref-updated changes resulting from branch deletion. if memory serves, the newref in them is 0x0 so should be pretty identifiable20:00
fungiat least in our case, i don't think we currently have any reason to want to enqueue those into pipelines (and they've resulted in some confusion in the past)20:01
corvusoh it's definitely special cased, it knows it's a branch deletion.  the special case is: branches changed, see what they are now.  that way it's self-healing.20:01
fungioh, got it20:02
fungii know at one point we were seeing something like zuul enqueuing git ref 0x0 into the post pipeline and then running builds which ultimately failed, but maybe that hasn't been for a while now20:03
fricklerah, that's why any branch operation would fix it. so creating the 2023.2 branch would also solve the issue20:17

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!