Wednesday, 2021-09-22

clarkbok fedora-34 cleanup complete. Basically had to do the same thing for a number of those records and eventually it stopped having errors and looks normal again00:00
fungiany indication as to how old they were?00:03
fungii suppose we could guess from their serials00:03
clarkbfungi: ya you can in theory look those up. The first one was gra1 the others were all for inmotion. I wonder if some combo of unhappy cloud and maybe restarting the builder when it was cleaning up records contributed to that00:04
fungiso probably fairly recent anyway00:04
ianwlooks like i've still got a few graph stats wrong00:42
*** ykarel|away is now known as ykarel04:09
opendevreviewIan Wienand proposed openstack/project-config master: grafana: further path fixes  https://review.opendev.org/c/openstack/project-config/+/81033804:55
*** ysandeep|out is now known as ysandeep05:02
*** ykarel is now known as ykarel|afk05:03
opendevreviewMerged openstack/project-config master: Update grafana to reflect dvr-ha job is now voting  https://review.opendev.org/c/openstack/project-config/+/80559405:58
*** jpena|off is now known as jpena06:52
*** rpittau|afk is now known as rpittau07:28
*** ysandeep is now known as ysandeep|lunch08:14
priteauGood morning. Is there some known slowness in Zuul today? I am seeing delays between patch submission and jobs showing in the queue at https://zuul.opendev.org/t/openstack/status08:36
*** ysandeep|lunch is now known as ysandeep09:49
*** ykarel is now known as ykarel|away10:25
*** ysandeep is now known as ysandeep|afk10:47
*** dviroel|out is now known as dviroel11:18
*** jpena is now known as jpena|lunch11:27
*** ysandeep|afk is now known as ysandeep11:55
*** jpena|lunch is now known as jpena12:24
fungipriteau: how long of a delay generally? the flow is that when you push a patch to gerrit it emits an event on its event stream, zuul listens to the stream and adds each event to an inbound event processing queue, when it has time, it takes the oldest event from the queue and decides whether it should act on it, which may include scheduling builds in some pipeline, then it asks the mergers to12:35
fungiprovide it with an appropriately constructed merge commit to provide context, then once it has that merge commit it rexamines the configuration to determine which pipelines should get what builds, at which point it asks nodepool to assign the requisite number of nodes for each build, and after that it asks the executors to start running the various builds it's decided should happen12:35
fungiif the delay is in the change showing up in pipelines, then it may be slowness on the gerrit side or a backup in the event processing queue (the status page also shows a count of items in that queue to aid in identifying the latter case)12:36
fungiif the change shows up in one or more pipelines but doesn't have a list of builds associated with it, zuul is probably waiting on mergers to provide it the appropriate git contexts12:36
fungiif the change is in a pipeline but some or all of its builds just show a queued state then it's probably waiting for nodepool to provide nodes for those builds12:37
fungialso some dependent pipelines have a queue "window" which dynamically decides the number of changes it will test in parallel for a given queue in order to minimize resource waste if a change near the end of the queue fails and forces all the changes behind it to be retested12:39
fungiin that case, you'll see changes toward the bottom of the queue in a waiting state12:39
*** arxcruz is now known as arxcruz|ruck13:14
TheJuliais... gerrit and zuul... umm. okay this morning?14:25
TheJuliawell, Gerrit just seems to get a sporatic hang on page load. Looks like 3rd party CI just posted a reply to a recheck alrady14:26
fungipriteau reported observing zuul taking some time to enqueue changes, or at least that's what it sounded like14:26
TheJuliabut zuul doesn't see the change nor the original change so I'm kind of wondering if things are not being qued up14:26
TheJuliaare we talking like 45 minutes?14:26
fungihe didn't specify, i've mostly been head-down on other maintenance tasks so i haven't had an opportunity to observe similar behaviors but sounds like maybe you have14:28
fungithe status page for the openstack tenant indicates a glance change entered the check pipeline just a couple minutes ago, i'll see if i can track how long that took14:29
Clark[m]I'm not really here yet. But it always helps to have specifics like change numbers/ID etc14:29
fungipatchset uploaded at 13:51 utc, entered the check pipeline a few minutes ago14:29
fungiso yeah, does seem like a longer delay than i would have expected14:30
fungii'll check resource utilization on relevant systems and see if it's obvious whether something's overloaded14:30
TheJuliaokay14:31
TheJuliaso I'm not going crazy and things are still "working"14:31
TheJuliacool!14:31
fungiworking on island time14:31
fungibut yeah, i just watched a tripleo-quickstart change finish testing and immediately merge, so it seems line time to enqueue is impacted but maybe not actual results processing14:33
fungino obvious resource pressure on any of the systems, i guess it's to the zuul scheduler log next14:39
*** ysandeep is now known as ysandeep|away14:43
fungilooking at the scheduler debug log, there certainly seems to be some substantial delay in event processing. 809634,12 for airship/porthole was uploaded at 13:59z and the scheduler logged seeing the event from gerrit at that time (13:59:21,605) and submits it to the event queue immediately thereafter, but doesn't seem to start process it until much later (14:42:19,278)14:50
fungii suppose there could be a massive event backlog and the "0 trigger events" on the status page is inaccurate14:51
fungii'll check grafana to see if that tells a different story14:51
fungithe event processing time graph only seems to indicate at worst a few minutes and most of the time just a few seconds14:55
fungibut we don't seem to graph the event queue length14:55
Clark[m]fungi: I think you can check the queue directly in zk now15:11
fungiahh, like with zk shell, or we report it to graphite?15:11
Clark[m]With zk shell15:13
Clark[m]Not sure that is the best method but may work if nothing else does15:13
fungisure, i'll see what i can work out15:15
Clark[m]Looking at grafana I don't see anything that clearly looks wrong on the zuul status page15:23
fungiyeah, same15:23
Clark[m]One thought was maybe merges are slow since that is the next step after getting the event from Gerrit and converting it to an internal event iirc15:23
Clark[m]But I think the change shows up in the status pipelines while waiting for merge to happen, it doesn't have any jobs associated to it at that point15:24
Clark[m]In this case we don't see the changes ending up there so slowness in requesting the merge or getting to that point?15:24
fungii guess we don't have zk-shell installed on the scheduler, where have you typically been running it from? we also have to authenticate with it now, right?15:25
Clark[m]I have it in a venv on zk04 because you can do local connections without auth15:27
Clark[m]Greatly simplifies things15:27
fungioh, yep i didn't consider that, thanks15:30
clarkbI'm getting situated at the desk finally. I'ev got to pop out in a couple of hours for a doctor visit so will let others drive this one15:31
fungino worries, i'm digging15:35
fungithe /zuul/events tree in zk is indeed massive15:39
fungiin particular, /zuul/events/connection/gerrit/events/queue has 5103 znodes15:40
fungii'll check it again in a few minutes and see whether that's rising or falling15:40
fungiseems to rise and fall, i saw it go up to 5128 and then drop to 496015:45
*** rpittau is now known as rpittau|afk15:45
clarkbbut definitely not trending all the way to zero? its possible we've reached a steady state backlog after the NA morning rush?15:45
fungidropped further to 4936 now15:46
clarkbfungi: does zk shell have a dirent count function? or are you having to pass that off to unix shell commands like wc -l?15:48
fungithe latter15:48
clarkbfungi: you might also want to try a get on the oldest looking one and see what it contains as well as check that it is being processed in a fifo fashion (oldest out first)15:49
clarkb?15:49
fungiyep, working on analyzing it15:49
clarkblooks like https://review.opendev.org/c/opendev/system-config/+/809512 and child didn't get approved yet, but probably best to focus on zuul behavior for now15:53
clarkbNo leaked replication tasks currently15:56
fungiso zk-shell tree doesn't seem to output the znodes in any particular order, but if i sort them i can see the lowest-numbered znodes disappearing on subsequent runs15:56
fungialso the queue size is now down to 489215:56
fungiso it's possible we had a massive spike in events hours ago and it's been steadily eating away at that, hard to know without historical trending15:57
clarkbfungi: I think ls printed them out in order for me yesterday when I was dealing with the nodepool fedora-34 thing15:57
*** jpena is now known as jpena|off15:59
fungiaha, i missed that ls was an available command, looks like earlier when i ran help i truncated the output unintentionally16:00
fungiyeah, ls does return them in order, tree did not16:02
fungilength was 4885 moments ago and now it's jumped back up to 493616:03
fungitwo steps forward, one step back16:03
clarkbcertainly seems like we might just be barely keeping up with the current input demands.16:04
* TheJulia wonders if everyone just needs to go make a sandwich....16:04
fungior three16:04
clarkbI wonder if there may be lock contention making that worse. Do we lock to append to the queue and maybe that prevents us from popping the head?16:04
TheJuliaor a sandwich and a movie16:04
TheJulia"Please, go watch LOTR after having a sandwich"16:05
clarkbnote the change to move event handling to zk was part of the most recent restart16:05
fungidirector's cut if possible16:05
fungiqueue length is now 5030 so i think everyone just finished sandwiches16:06
clarkbzuul.GerritEventConnector <- is the logger name for the bit that moves the events around I think16:07
clarkber sorry it was the change cache being in zookeeper not in process memory that changed in the connection drivers16:10
clarkbLooking at 809955 because it recently ended up in check we can see the recheck comment is placed on the zk connection queue at 2021-09-22 15:05:41,654 But then we basically do nothing elsefor that change until 2021-09-22 16:12:44,16016:17
clarkbCertainly seems like we're starving the queue processing somewhere16:17
fungiyes, the debug log just goes silent about the event from moments after it stuffs it into the queue until moments before it's evaluated16:20
mnaserinfra-root: i see two backup servers in the montreal datacenter, is that on purpose or is backup01  a forgotten relic? :)16:25
clarkbmnaser: I think the older one may be the bup backup server and the newer one does borg16:26
clarkbmnaser: we should cross check with ianw but it is possible that bup has been gone for long enough now that we feel comfortable removing its backups16:27
mnaserok cool, no problem16:27
mnaserfwiw i hope we're backing up review02 to sjc1 :p16:27
clarkbmnaser: we're backing it up to wherever the vexxhost backup server is and to the rax abckup server16:27
mnaserok cool so there is some offsite-ness to it16:27
clarkbI can't remember if it is in sjc1 or ca-ymq-1 but we do two different sites16:27
clarkbyup16:28
mnaserbackup servers seem to be in montreal so just wanted to make sure we're not backing up gerrit to the same place \o/16:28
clarkbyup there are two target sites and we backup to each with a 12 hour offset doing dailies16:28
fungiright, we back up to two locations in separate providers. earlier this year we switched backup solutions but were keeping the old backup servers around for a while in case we needed something from longer ago than when we switched16:28
fungii have a feeling we can look at cleaning up one or both of the old servers soon, if they're eating up a lot of resources16:29
clarkbfungi: corvus: looking at the zuul gerrit driver I suspect one of two things may be happening. Either we are not reliably winning elections or doing so takes significant time so each pass through the event queue is interrupted by that. Or iterating through the event queue is itself slow16:29
mnaserthat's fine, it's not that much resources i think, i'm just getting ready to kick them off to the new site soon (like we did for review02)16:29
clarkbUnfortunaetly we don't seem to have a ton of logging around this to inferring state from logs isn't easy16:30
fungicool, thanks mnaser! happy to help coordinate reboots there too16:30
fungias long as we don't have a backup in progress at the time, it should be entirely non-impacting16:30
fungifor the new server that is. for the old one, any time is fine16:30
mnaserand im going to guess old is 01 and new is 02 :)16:31
clarkbbackup02.ca-ymq-1.vexxhost.opendev.org is the current one according to our inventory file16:33
clarkbso ya 01 should be the old one16:33
clarkbhttps://review.opendev.org/c/zuul/zuul/+/810467 is the sort of logging that I think we are missing.16:45
clarkbAlso I think other opitons would be to run the yappi profiling. This will make things slower for a period of time16:45
clarkbbut maybe that will show us spinning in the election methods or spending a lot of time in zk get or something16:45
corvuscan we maybe start an etherpad or something about the event queue stuff?  i didn't realize there were reports of an issue16:52
clarkbthat seems like a good idea16:52
corvushttps://etherpad.opendev.org/p/gaK1SsRMpe3hh4ASOCLi16:52
fungiyeah, didn't want to pester you about it if there was a reasonable explanation, but this does seem possibly introduced in the latest restart16:55
corvusthe "zuul event processing time" graph doesn't show an increase, which is what it's designed to do in this case.  so we should pay attention to what that's measuring; it may give us information about where the slowdown is or isn't16:56
fungiright, i noticed the same, so the delay in this case is somethign not measured by that value16:57
corvusif the start time for that begins after the gerrit driver internal processing, then it seems likely that we're getting events from gerrit then spending too long updating the change cache before passing the info on to the scheduler queue.16:58
corvusclarkb: ^ we may be able to infer the answer to your questions in the etherpad without new changes16:59
clarkbcorvus: ya I figured we might be able to, but also thought a change pointing out areas where we want to measure might help others understand the bit that is likely slow17:00
clarkbI'm definitely going to have to defer to others as I have an appointment in an hour I need to head out sometime before then for17:00
corvuscpu use has increased17:03
corvusthe other thing to watch is the files change17:04
clarkbit didn't look too bad on the zk side17:04
clarkbcpu use I mean. I didn't check the scheduler side17:04
corvusdid that merge between our 2 most recent restarts17:04
clarkb2021-09-21 23:00:30 UTC restarted all of zuul on commit 0c26b1570bdd3a4d4479fb8c88a8dca0e9e38b7f and 2021-09-12 22:23:57 UTC restarted all of zuul on commit 9a27c447c159cd657735df66c87b6617c39169f6 are our recorded restarts17:05
corvusyes, that is between those17:05
clarkbso ya I suppose we could be slow due to gerrit being slow to return that info17:07
clarkbdidn't the changes explicitly handle merge commits though?17:07
clarkbmost of opendev's changes are not merge commits17:07
corvusno, it hits that endpoint in all cases17:08
clarkbah17:08
*** mgoddard- is now known as mgoddard17:13
corvusyeah, the graph is only measuring the scheduler processing time, so we don't have a metric from gerrit event received -> trigger event queue17:14
clarkbgotcha so everything before the addEvent() call is not measured17:16
clarkber not addEvent17:16
clarkbself.connection.sched.addTriggerEvent()17:16
clarkbcorvus: I think gerrit's apache logs will log time for those zuul files requests if we want to measure them17:20
clarkbthat might help us rule them in or out17:20
corvuswe do log them individually in zuul too, but i haven't checked to see what we can infer about that yet17:21
fungiunless the logging of those doesn't include the event id or item info, then it's not that because basically the logging for an event seems to just go silent after it's added to the queue17:22
fungiand the first reference after the lengthy lag is about updating the change17:23
fungiwhich i guess is the gerrit querying?17:23
corvusit does not include the event id17:24
corvuslooking at logs from yesterday, the time between "Updating <Change" and "Scheduling event" was typically <1s, and in the etherpad example, it's 16 seconds.17:26
corvusi think that means we're looking at the right place; both files and change cache happens in that window.17:27
corvussome events end up querying gerrit for multiple changes (due to dependencies).  in these cases, we emit more log lines for the other queries.  we can put an upper bound on the files api call with these17:29
fungioh, and i guess event processing is necessarily serial, so a slowdown on that can cause a significant backlog17:30
corvusyep17:30
fungiin that case the lengthy silence in the log makes sense, we've got a funnel with events falling into the top as fast as they want but only coming out the other side as quickly as zuul can process them17:31
corvusi don't think the files endpoint is the problem; i see plenty of cases where there are log entries immediately after them.17:32
clarkbsince the gerrit server upgrade it has definitely been a lot quicker. Not surprised it is handling that fine17:33
corvushowever, i note that the commit message queries may be slow17:33
fungi...which we need to get depends-on17:33
corvuswait, i can't confirm that... let me look some more17:33
corvus(we have 3 gerrit connections and the logs are interleaved :/)17:34
fungioh, ouch17:35
fungiyeah that'll make for a confusing analysis17:36
corvusokay, that query is the last one it runs, so i can't tell if it's slow or the change cache.  but i don't think we've changed anything about gerrit that might make that slow, so it's probably the change cache17:36
fungialso cacti at least didn't indicate any obvious change in gerrit resource consumption on the underlying os (granted that doesn't tell us much about what's going on within the jvm)17:38
corvusi think at this point we should pin zuul to latest release and restart; i can run a scheduler locally and subscribe to the event stream to continue debugging17:38
corvushappily, we shouldn't actually lose many events :)17:39
fungiahh, right, we lose the cache but not the events as they're persisted through the restart?17:41
corvusyeah, that long queue you saw will still be there17:41
corvusfungi, clarkb (if still around); i put some commands in the etherpad17:45
corvusif those look good, i'll run them and then do the usual restart procedure17:45
fungicorvus: i think i understand what those commands do, and am good with the plan17:52
fungiyou're basically adding a local "latest" tag to fool the docker-compose file into using the 4.9.0 release images?17:53
corvusyeah, it won't pull unless we ask it to (and we don't do that in the restart playbook)17:53
fungirather than changing the version in all the docker-compose files17:53
corvusyep.  next step is to do enough local debugging to decide if we want to roll forward or back17:53
corvusokay, i'll restart now17:54
fungithanks!17:54
fungiif this works like we expect, we should see the event queue burn down fairly quickly once everything's running again17:55
corvusyeah, i wasn't paying attention to precision timing (just <1sec), so it would take around an hour at most, but possibly only minutes.17:56
corvushrm, i don't see a scheduler process running18:00
fungiagreed, just web and finger-gw18:02
corvusoh i mis-tagged18:02
fungidocker-compose logs has nothing18:02
fungioh, the zuul-web tag is wrong18:03
corvusthere we go... my commands had a typo and yep that18:03
corvusso it was trying to run a web process instead18:03
corvusit's coming up now18:03
fungiyes, docker-compose has logs for it now18:04
Clark[m]Sorry had to step out18:06
corvusno prob18:06
fungithe event queue is empty18:09
fungialso lots of warnings in the log about tenants not being loaded18:09
fungiis it possible it tried to process all the events while it was waiting for configs?18:10
fungiif so, we probably need to warn folks that any changes pushed or approved in the past ~90 minutes will need to be revisited if they're not showing up on the status page18:12
corvusit moved them all from the gerrit event queue to the scheduler global queue.  once initial config is complete, the scheduler should move them into the pipeline queues.18:14
fungioh! nice18:14
fungiokay, so it processed them *very* quickly18:14
corvusyeah, it was looking like on the order of 10/s18:14
fungipewpewlasers18:15
corvusit's possible that there still might be a gap that they slipped through18:15
corvuscause now that reloading is done, i don't see any pipeline specific events18:16
corvusso it looks like we did lose them.  that's a bummer.  it'd be nice to fix that, though i'm not sure how much effort it's worth considering how close we are to not needing that.18:16
opendevreviewarkady kanevsky proposed opendev/irc-meetings master: Changed Interop WG meetings time and date.  https://review.opendev.org/c/opendev/irc-meetings/+/80990318:17
corvusfungi: so yeah, i think an announcement is warranted18:17
corvusi'm re-enqueing the pipeline contents now18:17
fungithanks, once the re-enqueue is complete i'll do a status notice recommending changes acted on between... 17:00 and 18:30 utc? be revisited18:24
fungiwe were running somewhere around an hour lag in the queue based on some of the last log analyses18:24
fungistatus notice Zuul has been restarted in order to address a performance regression related to event processing; any changes pushed or approved between roughly 17:00 and 18:30 UTC should be rechecked if they're not already enqueued on the Zuul status page18:27
fungisomething like that18:27
corvusfungi: lgtm; re-enqueue complete18:35
fungi#status notice Zuul has been restarted in order to address a performance regression related to event processing; any changes pushed or approved between roughly 17:00 and 18:30 UTC should be rechecked if they're not already enqueued according to the Zuul status page18:35
opendevstatusfungi: sending notice18:35
-opendevstatus- NOTICE: Zuul has been restarted in order to address a performance regression related to event processing; any changes pushed or approved between roughly 17:00 and 18:30 UTC should be rechecked if they're not already enqueued according to the Zuul status page18:35
fungithanks corvus!18:35
fungifor now, the event queue remains small enough i can even fit it on *my* screen (around 5-10 entries)19:09
fungiand often 019:10
clarkbwas 4.9.0 tagged off of our last restart commit?19:16
fungii believe so19:16
fungiessentially rollback to our pre-yesterday's-restart state19:16
clarkbya, so we have to worry about the stuck queue problem but we have a known workaround for that we can employ if necessary19:17
*** slaweq_ is now known as slaweq19:26
clarkbianw fungi thoughts on approving https://review.opendev.org/c/opendev/system-config/+/809512 ? or do we want to keep potential for problems low after this morning?19:44
clarkb(also I'm unlikely to be around when restarting is convenient for that if approved now)19:44
clarkbIn good news gerrit replication continues to be clean19:45
clarkbmnaser: ^ re the networking updates that you made would you have expected them to impact ssh connections from ymq to sjc?19:45
clarkbwondering if it is coincidence that it appears happier now or if that was a likely fixup19:45
mnaserit is a likely fix up, i think19:45
clarkbcool well evidence on our end ocntinues to be good then :)19:46
fungii'm also not going to be around as much for the rest of today, barring emergencies19:47
fungiso maybe we save 809512 for tomorrow unless ianw wants to keep an eye on it19:48
clarkbwfm (though I'm out tomorrow if all goes according to plan. Still trying to confirm that fishing schedules haven't changed for some reason)19:48
fungiyeah, i expect to be more available assuming no new fires ablaze20:02
corvusthere's no change 807969.  just a fun fact i thought i'd share.20:40
priteaufungi: sorry for not replying earlier, I was off for half of the day. When I reported the Zuul issue earlier today, I think it was around 6 minutes between changeset submission and job visible in queue20:46
fungipriteau: it reached roughly an hour before we rolled back20:47
fungicorvus: i've seen that before when a large batch of changes gets pushed and one or more of them error for some reason20:47
fungithere was a span of nonexistent change numbers last week where someone seemed to have pushed a large batch of new changes20:48
ianwsorry running a bit late, catching up22:15
ianwi'm mostly out tomorrow as well (public holiday here)22:20
clarkbianw: cool I hope you're able to do something fun too :)22:21
clarkbfwiw I'm about to pull up the zuul code to see if anything stands out in the bit that corvus identified in the zuul channel as a likely cause of the slowdown22:22
clarkbOtherwise it seems like the network update in vexxhost may have made things happier so the restart for timeouts is less urgent22:22
ianwheh not really until we reach 80% vaccination rates, sometime in october22:22
ianwcorvus: i saw a similar odd thing with changes @ https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-09-17.log.html#t2021-09-17T00:44:3722:24
ianwi wonder if there is something going on we should look at22:24
fungicheck for backtraces in the gerrit error log between the upload times for the existing changes to see if there's an error from a push which ate a change number?22:27
clarkbI think that notedb stores the next change id value and they bump it and don't roll it back if anything goes wron22:28
clarkblike I wonder if pushing without a valid cla is a case where you can trip that22:29
ianw$ less error_log | grep patchset-created | grep 'error status: 2' | wc -l22:35
ianw51622:35
ianwit looks like the patchset-created hook is unhappy a lot22:35
clarkbI don't think the hook failing should impact change creation though (gerrit logs it and moves on)22:36
ianwthe other one that is regular is22:37
ianw"Could not process project event. No its-project associated with project airship/airshipctl. Did you forget to configure the ITS project association in project.config?"22:37
clarkbfungi: ^ might understand that one better22:37
clarkbI think it has to do with the storyboard plugin somehow. Not sure if we are supposed to toggle an explicit flag to enable that per project22:38
fungiit was supposed to be on globally22:39
fungimaybe that changed with the 3.2 upgrade22:40
ianwfungi: here is a list of the errors and count https://paste.opendev.org/show/809518/22:43
ianwit seems like ... everything22:43
corvusClark, fungi: anyone know how to find that tripleo change with all the deps?22:44
fungicorvus: 808059 according to a grep of the channel log22:45
ianwhttps://gerrit.googlesource.com/plugins/its-base/+/refs/heads/master/src/main/java/com/googlesource/gerrit/plugins/its/base/workflow/ActionController.java#90 is where it emits22:45
clarkbyup 808059 looks like the one to me22:46
fungiianw: i'm fading, but it's probably worth us revisiting the its-base docs and double-checking our tracking id configuration against that22:47
ianwfungi: np, i might poke a bit because it would help us see real issues if we can clear these regularly repeating ones22:47
ianwok, looking at meta/config for airship/docs, i don't see a [plugin "its-storyboard"]22:58
ianwhrm, maybe "project event" is the key here23:03
ianw"To be able to make use of actions acting at the ITS project level, you must associate a Gerrit project to its ITS project counterpart."23:04
ianwmaybe we don't want "actions at the ... project level"?23:05
clarkbis the default change level? seems like those are mapping to projects in storyboard somehow23:06
corvuswhat do folks think about a zuul restart to gather more data about the issue from earlier?23:12
corvusmaybe a restart now, and then check back in several hours or so and then a rollback before the delay becomes too severe23:13
ianwcorvus: not too up on details, but i'm happy to help, if i can :)23:13
ianwfungi / clarkb : is it possible ~gerrit2/review_site/etc/its/actions.config is not under config management?23:14
clarkbcorvus: the biggest thing is the openstack release. queues look quiet right now though it is hard to verify every single change queued isn't important for that. I suspect it is ok23:15
corvusyeah, and i'd obviously save/restore queues too.23:15
clarkbianw: It certainly looks that way? the gerrit.confg and secure.config have its-storyboard sections23:16
clarkbianw: is it possible that gerrit is writing that config file out itself?23:16
ianwi don't think so, it has a timestamp from 201823:17
corvusokay, i'll go ahead and do a restart now, then afk for a bit, then check back throughout the evening and keep ianw up to date :)23:17
clarkbsounds good.23:18
ianw++23:18
clarkbAs a reminder I'm not really around tomorrow and will be popping out soon to have dinner too23:18
corvusianw: https://etherpad.opendev.org/p/gaK1SsRMpe3hh4ASOCLi has a bunch of info23:18
corvusincluding the rollback procedure :)23:19
corvus#status log restarted all of zuul on master for change cache bug data collection 23:22
opendevstatuscorvus: finished logging23:22
ianwpossibly the rules are defined in all projects, and this on-disk file is old?23:24
clarkbianw: maybe?23:25
ianwnope, i just checked All-Projects / project.config and its ruels aren't defined there23:26
corvusi'm starting to get a bit worried about the relative silence in the zuul logs :/23:34
yuriysSomething funky is happening, I noticed all our cloud in-use instances went to 0, and zuul.opendev.org is just spinning.23:36
clarkbyuriys: zuul is being restarted which causes it to free its instances23:36
clarkbcorvus: did it not start a process again?23:36
corvusit's running, it's just that the only thing being logged right now are gerrit events23:37
corvusit should be doing something like getting branch lists or keys... but i don't know what23:39
corvusi think we need a lot more log entries :(23:41
corvusokay, there is no main thread running23:46
corvusthat's a new experience23:46
corvuswe never even made it to "Starting scheduler"23:46
corvuscause there's a bunch of stuff that's outside an exception handler(!) in startup23:46
clarkbexciting (for all the wrong reasons)23:47
corvusokay, looking at the docker log, i see a potentially fatal error related to github installations23:50
corvusi wonder if we had a github installation removed very recently?23:50
clarkbI'm not aware of one. Don't we have the single github.com connection?23:51
corvusyes, but that corresponds to several "installations" which is a github api concept23:51
corvuswe iterate over all the projects we think we're attached to, and get our installation key for that project23:52
clarkbah23:53
corvusunfortunately, we don't log that, so all i know is installation 121061 returns 40323:53
clarkbcorvus: is it possible the remote side removed the app access?23:53
clarkbrather than us changing anything23:53
corvusyeah, that's the question i'm trying to ask23:53
clarkbI'm not sure that is something we get notification of unfortunately. The reote can update their settings and we just have to hope they don't do so inappropriately23:54
corvuswe may get that notification and just not act on it23:54
corvus2021-09-22 23:06:52,552 DEBUG zuul.GithubRequest: POST https://api.github.com/app/installations/121061/access_tokens result: 201, size: 275, duration: 19623:54
corvus2021-09-22 23:21:51,523 DEBUG zuul.GithubRequest: POST https://api.github.com/app/installations/121061/access_tokens result: 403, size: 163, duration: 6423:54
corvusbased on proximity of log entries, i'm going to guess that it's 'https://api.github.com', 'repos', 'sigmavirus24', 'github3.py',23:56
corvusi think we need to pull that out of the tenant config if we want zuul to start23:56
clarkbI'll push that up momentarily if you want to force merge it?23:57
corvusclarkb: yes pls23:58
corvusi have done that manually on the host in the interim23:58
opendevreviewClark Boylan proposed openstack/project-config master: Remove github3.py from our zuul config  https://review.opendev.org/c/openstack/project-config/+/81053023:59
clarkbcorvus: ^ something like that I think23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!