Wednesday, 2021-09-22

clarkb	ok fedora-34 cleanup complete. Basically had to do the same thing for a number of those records and eventually it stopped having errors and looks normal again	00:00
fungi	any indication as to how old they were?	00:03
fungi	i suppose we could guess from their serials	00:03
clarkb	fungi: ya you can in theory look those up. The first one was gra1 the others were all for inmotion. I wonder if some combo of unhappy cloud and maybe restarting the builder when it was cleaning up records contributed to that	00:04
fungi	so probably fairly recent anyway	00:04
ianw	looks like i've still got a few graph stats wrong	00:42
*** ykarel\|away is now known as ykarel		04:09
opendevreview	Ian Wienand proposed openstack/project-config master: grafana: further path fixes https://review.opendev.org/c/openstack/project-config/+/810338	04:55
*** ysandeep\|out is now known as ysandeep		05:02
*** ykarel is now known as ykarel\|afk		05:03
opendevreview	Merged openstack/project-config master: Update grafana to reflect dvr-ha job is now voting https://review.opendev.org/c/openstack/project-config/+/805594	05:58
*** jpena\|off is now known as jpena		06:52
*** rpittau\|afk is now known as rpittau		07:28
*** ysandeep is now known as ysandeep\|lunch		08:14
priteau	Good morning. Is there some known slowness in Zuul today? I am seeing delays between patch submission and jobs showing in the queue at https://zuul.opendev.org/t/openstack/status	08:36
*** ysandeep\|lunch is now known as ysandeep		09:49
*** ykarel is now known as ykarel\|away		10:25
*** ysandeep is now known as ysandeep\|afk		10:47
*** dviroel\|out is now known as dviroel		11:18
*** jpena is now known as jpena\|lunch		11:27
*** ysandeep\|afk is now known as ysandeep		11:55
*** jpena\|lunch is now known as jpena		12:24
fungi	priteau: how long of a delay generally? the flow is that when you push a patch to gerrit it emits an event on its event stream, zuul listens to the stream and adds each event to an inbound event processing queue, when it has time, it takes the oldest event from the queue and decides whether it should act on it, which may include scheduling builds in some pipeline, then it asks the mergers to	12:35
fungi	provide it with an appropriately constructed merge commit to provide context, then once it has that merge commit it rexamines the configuration to determine which pipelines should get what builds, at which point it asks nodepool to assign the requisite number of nodes for each build, and after that it asks the executors to start running the various builds it's decided should happen	12:35
fungi	if the delay is in the change showing up in pipelines, then it may be slowness on the gerrit side or a backup in the event processing queue (the status page also shows a count of items in that queue to aid in identifying the latter case)	12:36
fungi	if the change shows up in one or more pipelines but doesn't have a list of builds associated with it, zuul is probably waiting on mergers to provide it the appropriate git contexts	12:36
fungi	if the change is in a pipeline but some or all of its builds just show a queued state then it's probably waiting for nodepool to provide nodes for those builds	12:37
fungi	also some dependent pipelines have a queue "window" which dynamically decides the number of changes it will test in parallel for a given queue in order to minimize resource waste if a change near the end of the queue fails and forces all the changes behind it to be retested	12:39
fungi	in that case, you'll see changes toward the bottom of the queue in a waiting state	12:39
*** arxcruz is now known as arxcruz\|ruck		13:14
TheJulia	is... gerrit and zuul... umm. okay this morning?	14:25
TheJulia	well, Gerrit just seems to get a sporatic hang on page load. Looks like 3rd party CI just posted a reply to a recheck alrady	14:26
fungi	priteau reported observing zuul taking some time to enqueue changes, or at least that's what it sounded like	14:26
TheJulia	but zuul doesn't see the change nor the original change so I'm kind of wondering if things are not being qued up	14:26
TheJulia	are we talking like 45 minutes?	14:26
fungi	he didn't specify, i've mostly been head-down on other maintenance tasks so i haven't had an opportunity to observe similar behaviors but sounds like maybe you have	14:28
fungi	the status page for the openstack tenant indicates a glance change entered the check pipeline just a couple minutes ago, i'll see if i can track how long that took	14:29
Clark[m]	I'm not really here yet. But it always helps to have specifics like change numbers/ID etc	14:29
fungi	patchset uploaded at 13:51 utc, entered the check pipeline a few minutes ago	14:29
fungi	so yeah, does seem like a longer delay than i would have expected	14:30
fungi	i'll check resource utilization on relevant systems and see if it's obvious whether something's overloaded	14:30
TheJulia	okay	14:31
TheJulia	so I'm not going crazy and things are still "working"	14:31
TheJulia	cool!	14:31
fungi	working on island time	14:31
fungi	but yeah, i just watched a tripleo-quickstart change finish testing and immediately merge, so it seems line time to enqueue is impacted but maybe not actual results processing	14:33
fungi	no obvious resource pressure on any of the systems, i guess it's to the zuul scheduler log next	14:39
*** ysandeep is now known as ysandeep\|away		14:43
fungi	looking at the scheduler debug log, there certainly seems to be some substantial delay in event processing. 809634,12 for airship/porthole was uploaded at 13:59z and the scheduler logged seeing the event from gerrit at that time (13:59:21,605) and submits it to the event queue immediately thereafter, but doesn't seem to start process it until much later (14:42:19,278)	14:50
fungi	i suppose there could be a massive event backlog and the "0 trigger events" on the status page is inaccurate	14:51
fungi	i'll check grafana to see if that tells a different story	14:51
fungi	the event processing time graph only seems to indicate at worst a few minutes and most of the time just a few seconds	14:55
fungi	but we don't seem to graph the event queue length	14:55
Clark[m]	fungi: I think you can check the queue directly in zk now	15:11
fungi	ahh, like with zk shell, or we report it to graphite?	15:11
Clark[m]	With zk shell	15:13
Clark[m]	Not sure that is the best method but may work if nothing else does	15:13
fungi	sure, i'll see what i can work out	15:15
Clark[m]	Looking at grafana I don't see anything that clearly looks wrong on the zuul status page	15:23
fungi	yeah, same	15:23
Clark[m]	One thought was maybe merges are slow since that is the next step after getting the event from Gerrit and converting it to an internal event iirc	15:23
Clark[m]	But I think the change shows up in the status pipelines while waiting for merge to happen, it doesn't have any jobs associated to it at that point	15:24
Clark[m]	In this case we don't see the changes ending up there so slowness in requesting the merge or getting to that point?	15:24
fungi	i guess we don't have zk-shell installed on the scheduler, where have you typically been running it from? we also have to authenticate with it now, right?	15:25
Clark[m]	I have it in a venv on zk04 because you can do local connections without auth	15:27
Clark[m]	Greatly simplifies things	15:27
fungi	oh, yep i didn't consider that, thanks	15:30
clarkb	I'm getting situated at the desk finally. I'ev got to pop out in a couple of hours for a doctor visit so will let others drive this one	15:31
fungi	no worries, i'm digging	15:35
fungi	the /zuul/events tree in zk is indeed massive	15:39
fungi	in particular, /zuul/events/connection/gerrit/events/queue has 5103 znodes	15:40
fungi	i'll check it again in a few minutes and see whether that's rising or falling	15:40
fungi	seems to rise and fall, i saw it go up to 5128 and then drop to 4960	15:45
*** rpittau is now known as rpittau\|afk		15:45
clarkb	but definitely not trending all the way to zero? its possible we've reached a steady state backlog after the NA morning rush?	15:45
fungi	dropped further to 4936 now	15:46
clarkb	fungi: does zk shell have a dirent count function? or are you having to pass that off to unix shell commands like wc -l?	15:48
fungi	the latter	15:48
clarkb	fungi: you might also want to try a get on the oldest looking one and see what it contains as well as check that it is being processed in a fifo fashion (oldest out first)	15:49
clarkb	?	15:49
fungi	yep, working on analyzing it	15:49
clarkb	looks like https://review.opendev.org/c/opendev/system-config/+/809512 and child didn't get approved yet, but probably best to focus on zuul behavior for now	15:53
clarkb	No leaked replication tasks currently	15:56
fungi	so zk-shell tree doesn't seem to output the znodes in any particular order, but if i sort them i can see the lowest-numbered znodes disappearing on subsequent runs	15:56
fungi	also the queue size is now down to 4892	15:56
fungi	so it's possible we had a massive spike in events hours ago and it's been steadily eating away at that, hard to know without historical trending	15:57
clarkb	fungi: I think ls printed them out in order for me yesterday when I was dealing with the nodepool fedora-34 thing	15:57
*** jpena is now known as jpena\|off		15:59
fungi	aha, i missed that ls was an available command, looks like earlier when i ran help i truncated the output unintentionally	16:00
fungi	yeah, ls does return them in order, tree did not	16:02
fungi	length was 4885 moments ago and now it's jumped back up to 4936	16:03
fungi	two steps forward, one step back	16:03
clarkb	certainly seems like we might just be barely keeping up with the current input demands.	16:04
* TheJulia wonders if everyone just needs to go make a sandwich....		16:04
fungi	or three	16:04
clarkb	I wonder if there may be lock contention making that worse. Do we lock to append to the queue and maybe that prevents us from popping the head?	16:04
TheJulia	or a sandwich and a movie	16:04
TheJulia	"Please, go watch LOTR after having a sandwich"	16:05
clarkb	note the change to move event handling to zk was part of the most recent restart	16:05
fungi	director's cut if possible	16:05
fungi	queue length is now 5030 so i think everyone just finished sandwiches	16:06
clarkb	zuul.GerritEventConnector <- is the logger name for the bit that moves the events around I think	16:07
clarkb	er sorry it was the change cache being in zookeeper not in process memory that changed in the connection drivers	16:10
clarkb	Looking at 809955 because it recently ended up in check we can see the recheck comment is placed on the zk connection queue at 2021-09-22 15:05:41,654 But then we basically do nothing elsefor that change until 2021-09-22 16:12:44,160	16:17
clarkb	Certainly seems like we're starving the queue processing somewhere	16:17
fungi	yes, the debug log just goes silent about the event from moments after it stuffs it into the queue until moments before it's evaluated	16:20
mnaser	infra-root: i see two backup servers in the montreal datacenter, is that on purpose or is backup01 a forgotten relic? :)	16:25
clarkb	mnaser: I think the older one may be the bup backup server and the newer one does borg	16:26
clarkb	mnaser: we should cross check with ianw but it is possible that bup has been gone for long enough now that we feel comfortable removing its backups	16:27
mnaser	ok cool, no problem	16:27
mnaser	fwiw i hope we're backing up review02 to sjc1 :p	16:27
clarkb	mnaser: we're backing it up to wherever the vexxhost backup server is and to the rax abckup server	16:27
mnaser	ok cool so there is some offsite-ness to it	16:27
clarkb	I can't remember if it is in sjc1 or ca-ymq-1 but we do two different sites	16:27
clarkb	yup	16:28
mnaser	backup servers seem to be in montreal so just wanted to make sure we're not backing up gerrit to the same place \o/	16:28
clarkb	yup there are two target sites and we backup to each with a 12 hour offset doing dailies	16:28
fungi	right, we back up to two locations in separate providers. earlier this year we switched backup solutions but were keeping the old backup servers around for a while in case we needed something from longer ago than when we switched	16:28
fungi	i have a feeling we can look at cleaning up one or both of the old servers soon, if they're eating up a lot of resources	16:29
clarkb	fungi: corvus: looking at the zuul gerrit driver I suspect one of two things may be happening. Either we are not reliably winning elections or doing so takes significant time so each pass through the event queue is interrupted by that. Or iterating through the event queue is itself slow	16:29
mnaser	that's fine, it's not that much resources i think, i'm just getting ready to kick them off to the new site soon (like we did for review02)	16:29
clarkb	Unfortunaetly we don't seem to have a ton of logging around this to inferring state from logs isn't easy	16:30
fungi	cool, thanks mnaser! happy to help coordinate reboots there too	16:30
fungi	as long as we don't have a backup in progress at the time, it should be entirely non-impacting	16:30
fungi	for the new server that is. for the old one, any time is fine	16:30
mnaser	and im going to guess old is 01 and new is 02 :)	16:31
clarkb	backup02.ca-ymq-1.vexxhost.opendev.org is the current one according to our inventory file	16:33
clarkb	so ya 01 should be the old one	16:33
clarkb	https://review.opendev.org/c/zuul/zuul/+/810467 is the sort of logging that I think we are missing.	16:45
clarkb	Also I think other opitons would be to run the yappi profiling. This will make things slower for a period of time	16:45
clarkb	but maybe that will show us spinning in the election methods or spending a lot of time in zk get or something	16:45
corvus	can we maybe start an etherpad or something about the event queue stuff? i didn't realize there were reports of an issue	16:52
clarkb	that seems like a good idea	16:52
corvus	https://etherpad.opendev.org/p/gaK1SsRMpe3hh4ASOCLi	16:52
fungi	yeah, didn't want to pester you about it if there was a reasonable explanation, but this does seem possibly introduced in the latest restart	16:55
corvus	the "zuul event processing time" graph doesn't show an increase, which is what it's designed to do in this case. so we should pay attention to what that's measuring; it may give us information about where the slowdown is or isn't	16:56
fungi	right, i noticed the same, so the delay in this case is somethign not measured by that value	16:57
corvus	if the start time for that begins after the gerrit driver internal processing, then it seems likely that we're getting events from gerrit then spending too long updating the change cache before passing the info on to the scheduler queue.	16:58
corvus	clarkb: ^ we may be able to infer the answer to your questions in the etherpad without new changes	16:59
clarkb	corvus: ya I figured we might be able to, but also thought a change pointing out areas where we want to measure might help others understand the bit that is likely slow	17:00
clarkb	I'm definitely going to have to defer to others as I have an appointment in an hour I need to head out sometime before then for	17:00
corvus	cpu use has increased	17:03
corvus	the other thing to watch is the files change	17:04
clarkb	it didn't look too bad on the zk side	17:04
clarkb	cpu use I mean. I didn't check the scheduler side	17:04
corvus	did that merge between our 2 most recent restarts	17:04
clarkb	2021-09-21 23:00:30 UTC restarted all of zuul on commit 0c26b1570bdd3a4d4479fb8c88a8dca0e9e38b7f and 2021-09-12 22:23:57 UTC restarted all of zuul on commit 9a27c447c159cd657735df66c87b6617c39169f6 are our recorded restarts	17:05
corvus	yes, that is between those	17:05
clarkb	so ya I suppose we could be slow due to gerrit being slow to return that info	17:07
clarkb	didn't the changes explicitly handle merge commits though?	17:07
clarkb	most of opendev's changes are not merge commits	17:07
corvus	no, it hits that endpoint in all cases	17:08
clarkb	ah	17:08
*** mgoddard- is now known as mgoddard		17:13
corvus	yeah, the graph is only measuring the scheduler processing time, so we don't have a metric from gerrit event received -> trigger event queue	17:14
clarkb	gotcha so everything before the addEvent() call is not measured	17:16
clarkb	er not addEvent	17:16
clarkb	self.connection.sched.addTriggerEvent()	17:16
clarkb	corvus: I think gerrit's apache logs will log time for those zuul files requests if we want to measure them	17:20
clarkb	that might help us rule them in or out	17:20
corvus	we do log them individually in zuul too, but i haven't checked to see what we can infer about that yet	17:21
fungi	unless the logging of those doesn't include the event id or item info, then it's not that because basically the logging for an event seems to just go silent after it's added to the queue	17:22
fungi	and the first reference after the lengthy lag is about updating the change	17:23
fungi	which i guess is the gerrit querying?	17:23
corvus	it does not include the event id	17:24
corvus	looking at logs from yesterday, the time between "Updating <Change" and "Scheduling event" was typically <1s, and in the etherpad example, it's 16 seconds.	17:26
corvus	i think that means we're looking at the right place; both files and change cache happens in that window.	17:27
corvus	some events end up querying gerrit for multiple changes (due to dependencies). in these cases, we emit more log lines for the other queries. we can put an upper bound on the files api call with these	17:29
fungi	oh, and i guess event processing is necessarily serial, so a slowdown on that can cause a significant backlog	17:30
corvus	yep	17:30
fungi	in that case the lengthy silence in the log makes sense, we've got a funnel with events falling into the top as fast as they want but only coming out the other side as quickly as zuul can process them	17:31
corvus	i don't think the files endpoint is the problem; i see plenty of cases where there are log entries immediately after them.	17:32
clarkb	since the gerrit server upgrade it has definitely been a lot quicker. Not surprised it is handling that fine	17:33
corvus	however, i note that the commit message queries may be slow	17:33
fungi	...which we need to get depends-on	17:33
corvus	wait, i can't confirm that... let me look some more	17:33
corvus	(we have 3 gerrit connections and the logs are interleaved :/)	17:34
fungi	oh, ouch	17:35
fungi	yeah that'll make for a confusing analysis	17:36
corvus	okay, that query is the last one it runs, so i can't tell if it's slow or the change cache. but i don't think we've changed anything about gerrit that might make that slow, so it's probably the change cache	17:36
fungi	also cacti at least didn't indicate any obvious change in gerrit resource consumption on the underlying os (granted that doesn't tell us much about what's going on within the jvm)	17:38
corvus	i think at this point we should pin zuul to latest release and restart; i can run a scheduler locally and subscribe to the event stream to continue debugging	17:38
corvus	happily, we shouldn't actually lose many events :)	17:39
fungi	ahh, right, we lose the cache but not the events as they're persisted through the restart?	17:41
corvus	yeah, that long queue you saw will still be there	17:41
corvus	fungi, clarkb (if still around); i put some commands in the etherpad	17:45
corvus	if those look good, i'll run them and then do the usual restart procedure	17:45
fungi	corvus: i think i understand what those commands do, and am good with the plan	17:52
fungi	you're basically adding a local "latest" tag to fool the docker-compose file into using the 4.9.0 release images?	17:53
corvus	yeah, it won't pull unless we ask it to (and we don't do that in the restart playbook)	17:53
fungi	rather than changing the version in all the docker-compose files	17:53
corvus	yep. next step is to do enough local debugging to decide if we want to roll forward or back	17:53
corvus	okay, i'll restart now	17:54
fungi	thanks!	17:54
fungi	if this works like we expect, we should see the event queue burn down fairly quickly once everything's running again	17:55
corvus	yeah, i wasn't paying attention to precision timing (just <1sec), so it would take around an hour at most, but possibly only minutes.	17:56
corvus	hrm, i don't see a scheduler process running	18:00
fungi	agreed, just web and finger-gw	18:02
corvus	oh i mis-tagged	18:02
fungi	docker-compose logs has nothing	18:02
fungi	oh, the zuul-web tag is wrong	18:03
corvus	there we go... my commands had a typo and yep that	18:03
corvus	so it was trying to run a web process instead	18:03
corvus	it's coming up now	18:03
fungi	yes, docker-compose has logs for it now	18:04
Clark[m]	Sorry had to step out	18:06
corvus	no prob	18:06
fungi	the event queue is empty	18:09
fungi	also lots of warnings in the log about tenants not being loaded	18:09
fungi	is it possible it tried to process all the events while it was waiting for configs?	18:10
fungi	if so, we probably need to warn folks that any changes pushed or approved in the past ~90 minutes will need to be revisited if they're not showing up on the status page	18:12
corvus	it moved them all from the gerrit event queue to the scheduler global queue. once initial config is complete, the scheduler should move them into the pipeline queues.	18:14
fungi	oh! nice	18:14
fungi	okay, so it processed them very quickly	18:14
corvus	yeah, it was looking like on the order of 10/s	18:14
fungi	pewpewlasers	18:15
corvus	it's possible that there still might be a gap that they slipped through	18:15
corvus	cause now that reloading is done, i don't see any pipeline specific events	18:16
corvus	so it looks like we did lose them. that's a bummer. it'd be nice to fix that, though i'm not sure how much effort it's worth considering how close we are to not needing that.	18:16
opendevreview	arkady kanevsky proposed opendev/irc-meetings master: Changed Interop WG meetings time and date. https://review.opendev.org/c/opendev/irc-meetings/+/809903	18:17
corvus	fungi: so yeah, i think an announcement is warranted	18:17
corvus	i'm re-enqueing the pipeline contents now	18:17
fungi	thanks, once the re-enqueue is complete i'll do a status notice recommending changes acted on between... 17:00 and 18:30 utc? be revisited	18:24
fungi	we were running somewhere around an hour lag in the queue based on some of the last log analyses	18:24
fungi	status notice Zuul has been restarted in order to address a performance regression related to event processing; any changes pushed or approved between roughly 17:00 and 18:30 UTC should be rechecked if they're not already enqueued on the Zuul status page	18:27
fungi	something like that	18:27
corvus	fungi: lgtm; re-enqueue complete	18:35
fungi	#status notice Zuul has been restarted in order to address a performance regression related to event processing; any changes pushed or approved between roughly 17:00 and 18:30 UTC should be rechecked if they're not already enqueued according to the Zuul status page	18:35
opendevstatus	fungi: sending notice	18:35
-opendevstatus- NOTICE: Zuul has been restarted in order to address a performance regression related to event processing; any changes pushed or approved between roughly 17:00 and 18:30 UTC should be rechecked if they're not already enqueued according to the Zuul status page		18:35
fungi	thanks corvus!	18:35
fungi	for now, the event queue remains small enough i can even fit it on my screen (around 5-10 entries)	19:09
fungi	and often 0	19:10
clarkb	was 4.9.0 tagged off of our last restart commit?	19:16
fungi	i believe so	19:16
fungi	essentially rollback to our pre-yesterday's-restart state	19:16
clarkb	ya, so we have to worry about the stuck queue problem but we have a known workaround for that we can employ if necessary	19:17
*** slaweq_ is now known as slaweq		19:26
clarkb	ianw fungi thoughts on approving https://review.opendev.org/c/opendev/system-config/+/809512 ? or do we want to keep potential for problems low after this morning?	19:44
clarkb	(also I'm unlikely to be around when restarting is convenient for that if approved now)	19:44
clarkb	In good news gerrit replication continues to be clean	19:45
clarkb	mnaser: ^ re the networking updates that you made would you have expected them to impact ssh connections from ymq to sjc?	19:45
clarkb	wondering if it is coincidence that it appears happier now or if that was a likely fixup	19:45
mnaser	it is a likely fix up, i think	19:45
clarkb	cool well evidence on our end ocntinues to be good then :)	19:46
fungi	i'm also not going to be around as much for the rest of today, barring emergencies	19:47
fungi	so maybe we save 809512 for tomorrow unless ianw wants to keep an eye on it	19:48
clarkb	wfm (though I'm out tomorrow if all goes according to plan. Still trying to confirm that fishing schedules haven't changed for some reason)	19:48
fungi	yeah, i expect to be more available assuming no new fires ablaze	20:02
corvus	there's no change 807969. just a fun fact i thought i'd share.	20:40
priteau	fungi: sorry for not replying earlier, I was off for half of the day. When I reported the Zuul issue earlier today, I think it was around 6 minutes between changeset submission and job visible in queue	20:46
fungi	priteau: it reached roughly an hour before we rolled back	20:47
fungi	corvus: i've seen that before when a large batch of changes gets pushed and one or more of them error for some reason	20:47
fungi	there was a span of nonexistent change numbers last week where someone seemed to have pushed a large batch of new changes	20:48
ianw	sorry running a bit late, catching up	22:15
ianw	i'm mostly out tomorrow as well (public holiday here)	22:20
clarkb	ianw: cool I hope you're able to do something fun too :)	22:21
clarkb	fwiw I'm about to pull up the zuul code to see if anything stands out in the bit that corvus identified in the zuul channel as a likely cause of the slowdown	22:22
clarkb	Otherwise it seems like the network update in vexxhost may have made things happier so the restart for timeouts is less urgent	22:22
ianw	heh not really until we reach 80% vaccination rates, sometime in october	22:22
ianw	corvus: i saw a similar odd thing with changes @ https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-09-17.log.html#t2021-09-17T00:44:37	22:24
ianw	i wonder if there is something going on we should look at	22:24
fungi	check for backtraces in the gerrit error log between the upload times for the existing changes to see if there's an error from a push which ate a change number?	22:27
clarkb	I think that notedb stores the next change id value and they bump it and don't roll it back if anything goes wron	22:28
clarkb	like I wonder if pushing without a valid cla is a case where you can trip that	22:29
ianw	$ less error_log \| grep patchset-created \| grep 'error status: 2' \| wc -l	22:35
ianw	516	22:35
ianw	it looks like the patchset-created hook is unhappy a lot	22:35
clarkb	I don't think the hook failing should impact change creation though (gerrit logs it and moves on)	22:36
ianw	the other one that is regular is	22:37
ianw	"Could not process project event. No its-project associated with project airship/airshipctl. Did you forget to configure the ITS project association in project.config?"	22:37
clarkb	fungi: ^ might understand that one better	22:37
clarkb	I think it has to do with the storyboard plugin somehow. Not sure if we are supposed to toggle an explicit flag to enable that per project	22:38
fungi	it was supposed to be on globally	22:39
fungi	maybe that changed with the 3.2 upgrade	22:40
ianw	fungi: here is a list of the errors and count https://paste.opendev.org/show/809518/	22:43
ianw	it seems like ... everything	22:43
corvus	Clark, fungi: anyone know how to find that tripleo change with all the deps?	22:44
fungi	corvus: 808059 according to a grep of the channel log	22:45
ianw	https://gerrit.googlesource.com/plugins/its-base/+/refs/heads/master/src/main/java/com/googlesource/gerrit/plugins/its/base/workflow/ActionController.java#90 is where it emits	22:45
clarkb	yup 808059 looks like the one to me	22:46
fungi	ianw: i'm fading, but it's probably worth us revisiting the its-base docs and double-checking our tracking id configuration against that	22:47
ianw	fungi: np, i might poke a bit because it would help us see real issues if we can clear these regularly repeating ones	22:47
ianw	ok, looking at meta/config for airship/docs, i don't see a [plugin "its-storyboard"]	22:58
ianw	hrm, maybe "project event" is the key here	23:03
ianw	"To be able to make use of actions acting at the ITS project level, you must associate a Gerrit project to its ITS project counterpart."	23:04
ianw	maybe we don't want "actions at the ... project level"?	23:05
clarkb	is the default change level? seems like those are mapping to projects in storyboard somehow	23:06
corvus	what do folks think about a zuul restart to gather more data about the issue from earlier?	23:12
corvus	maybe a restart now, and then check back in several hours or so and then a rollback before the delay becomes too severe	23:13
ianw	corvus: not too up on details, but i'm happy to help, if i can :)	23:13
ianw	fungi / clarkb : is it possible ~gerrit2/review_site/etc/its/actions.config is not under config management?	23:14
clarkb	corvus: the biggest thing is the openstack release. queues look quiet right now though it is hard to verify every single change queued isn't important for that. I suspect it is ok	23:15
corvus	yeah, and i'd obviously save/restore queues too.	23:15
clarkb	ianw: It certainly looks that way? the gerrit.confg and secure.config have its-storyboard sections	23:16
clarkb	ianw: is it possible that gerrit is writing that config file out itself?	23:16
ianw	i don't think so, it has a timestamp from 2018	23:17
corvus	okay, i'll go ahead and do a restart now, then afk for a bit, then check back throughout the evening and keep ianw up to date :)	23:17
clarkb	sounds good.	23:18
ianw	++	23:18
clarkb	As a reminder I'm not really around tomorrow and will be popping out soon to have dinner too	23:18
corvus	ianw: https://etherpad.opendev.org/p/gaK1SsRMpe3hh4ASOCLi has a bunch of info	23:18
corvus	including the rollback procedure :)	23:19
corvus	#status log restarted all of zuul on master for change cache bug data collection	23:22
opendevstatus	corvus: finished logging	23:22
ianw	possibly the rules are defined in all projects, and this on-disk file is old?	23:24
clarkb	ianw: maybe?	23:25
ianw	nope, i just checked All-Projects / project.config and its ruels aren't defined there	23:26
corvus	i'm starting to get a bit worried about the relative silence in the zuul logs :/	23:34
yuriys	Something funky is happening, I noticed all our cloud in-use instances went to 0, and zuul.opendev.org is just spinning.	23:36
clarkb	yuriys: zuul is being restarted which causes it to free its instances	23:36
clarkb	corvus: did it not start a process again?	23:36
corvus	it's running, it's just that the only thing being logged right now are gerrit events	23:37
corvus	it should be doing something like getting branch lists or keys... but i don't know what	23:39
corvus	i think we need a lot more log entries :(	23:41
corvus	okay, there is no main thread running	23:46
corvus	that's a new experience	23:46
corvus	we never even made it to "Starting scheduler"	23:46
corvus	cause there's a bunch of stuff that's outside an exception handler(!) in startup	23:46
clarkb	exciting (for all the wrong reasons)	23:47
corvus	okay, looking at the docker log, i see a potentially fatal error related to github installations	23:50
corvus	i wonder if we had a github installation removed very recently?	23:50
clarkb	I'm not aware of one. Don't we have the single github.com connection?	23:51
corvus	yes, but that corresponds to several "installations" which is a github api concept	23:51
corvus	we iterate over all the projects we think we're attached to, and get our installation key for that project	23:52
clarkb	ah	23:53
corvus	unfortunately, we don't log that, so all i know is installation 121061 returns 403	23:53
clarkb	corvus: is it possible the remote side removed the app access?	23:53
clarkb	rather than us changing anything	23:53
corvus	yeah, that's the question i'm trying to ask	23:53
clarkb	I'm not sure that is something we get notification of unfortunately. The reote can update their settings and we just have to hope they don't do so inappropriately	23:54
corvus	we may get that notification and just not act on it	23:54
corvus	2021-09-22 23:06:52,552 DEBUG zuul.GithubRequest: POST https://api.github.com/app/installations/121061/access_tokens result: 201, size: 275, duration: 196	23:54
corvus	2021-09-22 23:21:51,523 DEBUG zuul.GithubRequest: POST https://api.github.com/app/installations/121061/access_tokens result: 403, size: 163, duration: 64	23:54
corvus	based on proximity of log entries, i'm going to guess that it's 'https://api.github.com', 'repos', 'sigmavirus24', 'github3.py',	23:56
corvus	i think we need to pull that out of the tenant config if we want zuul to start	23:56
clarkb	I'll push that up momentarily if you want to force merge it?	23:57
corvus	clarkb: yes pls	23:58
corvus	i have done that manually on the host in the interim	23:58
opendevreview	Clark Boylan proposed openstack/project-config master: Remove github3.py from our zuul config https://review.opendev.org/c/openstack/project-config/+/810530	23:59
clarkb	corvus: ^ something like that I think	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!