Thursday, 2021-09-02

clarkbthe zuul key backup cron failed as expected and sent email saying that was the case00:11
clarkbcorvus: ^ this has me thinking. The command doesn't write any secret data to stdout or stderr right? as we don't want that getting emailed if the cronjob fails00:12
clarkbpretty sure it doesn't and we're good00:12
clarkbit will output if a key couldn't be backed up for some reason iirc but not share any of the key data00:12
corvusclarkb: correct00:16
corvusby design00:16
opendevreviewMerged opendev/base-jobs master: buildset-registry: add flag to make job fail  https://review.opendev.org/c/opendev/base-jobs/+/80681801:22
opendevreviewIan Wienand proposed opendev/system-config master: gitea: use assets bundle  https://review.opendev.org/c/opendev/system-config/+/80593301:30
ianw:/ that didn't appear to work02:06
opendevreviewIan Wienand proposed opendev/system-config master: gitea: use assets bundle  https://review.opendev.org/c/opendev/system-config/+/80593302:16
ianwworks better when you spell things right02:39
*** ysandeep|out is now known as ysandeep06:16
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703106:46
*** jpena|off is now known as jpena07:39
*** ykarel is now known as ykarel|lunch08:13
*** ysandeep is now known as ysandeep|lunch08:16
*** ysandeep|lunch is now known as ysandeep09:12
*** ykarel|lunch is now known as ykarel09:56
*** ysandeep is now known as ysandeep|afk11:15
opendevreviewArtem Goncharov proposed zuul/zuul-jobs master: Drop deprecated url param of upload_logs_XXX modules  https://review.opendev.org/c/zuul/zuul-jobs/+/80711811:18
*** dviroel|out is now known as dviroel|ruck11:19
opendevreviewArtem Goncharov proposed zuul/zuul-jobs master: Drop deprecated url param of upload_logs_XXX modules  https://review.opendev.org/c/zuul/zuul-jobs/+/80711811:22
*** jpena is now known as jpena|lunch11:38
opendevreviewArtem Goncharov proposed zuul/zuul-jobs master: Drop deprecated url param of upload_logs_XXX modules  https://review.opendev.org/c/zuul/zuul-jobs/+/80711811:39
*** ysandeep|afk is now known as ysandeep12:13
*** jpena|lunch is now known as jpena12:40
*** frenzy_friday is now known as anbanerj|ruck12:46
opendevreviewArtem Goncharov proposed zuul/zuul-jobs master: [DNM] Test dropping delegation in the upload_logs_s3 role  https://review.opendev.org/c/zuul/zuul-jobs/+/80713212:47
*** dviroel|ruck is now known as dviroel12:48
opendevreviewArtem Goncharov proposed zuul/zuul-jobs master: [DNM] Test dropping delegation in the upload_logs_s3 role  https://review.opendev.org/c/zuul/zuul-jobs/+/80713213:05
opendevreviewArtem Goncharov proposed zuul/zuul-jobs master: [DNM] Test upload_logs_s3 role  https://review.opendev.org/c/zuul/zuul-jobs/+/80713213:22
*** diablo_rojo_phone is now known as Guest607513:28
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703114:23
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703114:42
clarkbinfra-root it looks like the inap provider use has largely fallen off. There is one in use server, perhaps a held node? I'll take a look shortly, but I suspect if that is the case we can delete it then push a change to clean up the images in that cloud, then remove it entirely?14:52
clarkbmgagne: ^ fyi14:52
fungiclarkb: two nodes, one is held with "mnaser debug multi-arch containers" and the other is an almost-day-old centos-8-stream node which will likely get recycled in about 45 minutes14:54
clarkbah thanks you have your ssh keys loaded :)14:54
clarkbmnaser: ^ do you still need that hold? if not it will help us clean up a nodepool provider that got renamed14:55
mnaserclarkb, fungi: mnaser debug multi-arch containers sounds like a really old hold that i don't need 14:55
clarkbthanks14:56
fungier, actually the centos-8-stream node is "in-use" for 00:23:13:02 so maybe leaked? or there's a stuck job...14:56
fungimnaser: thanks, i'll take out that autohold14:56
mnaserclarkb, fungi: inap is now iweb?14:57
fungimnaser: yep14:57
mnaseriweb used to private, then became publicly traded, then private, then acquired by inap, then now i guess acquired by leaseweb (according to their site) to become an independent iweb again i guess :P14:58
fungiyeah, i lost track about halfway through the history there ;)14:58
mnaser>In 2021 iWeb Technologies was acquired by Leaseweb.14:58
mnaserah yep14:58
opendevreviewGage Hugo proposed opendev/irc-meetings master: Update openstack-helm irc meeting channel  https://review.opendev.org/c/opendev/irc-meetings/+/80509414:59
opendevreviewArtem Goncharov proposed zuul/zuul-jobs master: [DNM] Test upload_logs_s3 role  https://review.opendev.org/c/zuul/zuul-jobs/+/80713215:12
mgagnemnaser: this is correct regarding ownership.15:15
clarkbI have run the gerrit user audit across the 33 remaining conflicts and the yaml file is in the typical location15:29
clarkbI don't expect I'll get to reviewing those today, but hopefully soon and will start sending out emails and writing changes in an All-Users checkout15:30
clarkbfungi: corvus that last inap instance is the paused job for 806901,2 in openstack's check queue15:36
clarkbit appears we have ~4 changes that have leaked in the check queue there based on age15:37
clarkbI'm not in a good spot to look at that right this moment as I've got a meeting in a few minutes, but I wonder if we've got a stale launcher that needs restarting or a similar issue to before where we forget requests?15:37
clarkb4e03748821a74fe6b755bc2155838125 is the zuul event id for the 806901,2 enqueue15:38
fungii'll try to dig into stuck jobs in a few, but also in meetings15:38
clarkbwhat if it is stuck that way ebcause we set max servers to 0?15:56
clarkbthe job that is paused started on inap so all the jobs behind it want to run in inap too but cannot15:56
clarkbThat doesn't explain the other leaked changes in the queue though, but could explain this one15:56
opendevreviewAnanya proposed opendev/elastic-recheck master: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80717616:04
fungiit's also possible we haven't completely squashed the hung/stale node requests problem, i haven't dug into it yet16:16
*** jpena is now known as jpena|off16:47
*** ysandeep is now known as ysandeep|out16:47
*** sshnaidm is now known as sshnaidm|off17:01
clarkb299-0015272873 is one of the node requests that was submitted after the job paused17:04
clarkb/etc/nodepool/nodepool.yaml on nl03 updated at 15:50UTC yseterday. The above node request was submitted at 16:49 yesterday17:05
clarkbI think it is very likely that we cannot provision those nodes because max-servers is 0 on the only provider that is capable of running those jobs17:05
opendevreviewAnanya proposed opendev/elastic-recheck master: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80717617:06
clarkbthat doesn't explain the other three though17:06
clarkbcorvus: ^ is there a better way to remove a provider in nodepool? maybe we just accept this risk and stop then reenqueue the change in these cases?17:06
clarkbinfra-root ^ I'll do that for 806901,2 shortly if there is no objection. The other three probably deserve looking at as I doubt they have this particular issue17:07
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563817:08
fungioh, was it trying to provide nodes for the child job from the same provider? yeah i guess that's intentional?17:08
clarkbyes I believe zuul enforces that behavior17:09
corvusyeah, i think that is an unexpected corner case (those 2 features were designed at different times)17:09
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563817:10
clarkbfor the airship change that is stuck it has a node request (300-0015270257) for https://opendev.org/airship/airshipctl/src/branch/master/zuul.d/nodesets.yaml#L39 that doesn't seem to be getting fulfilled17:10
corvusi don't have a good solution to that.  i think we just have to muddle through.  dequeue/enqueue sounds good.17:11
clarkbthat label for the airship job is currently only able to be provided by the airship cloud provider17:11
clarkbcorvus: ok I'll dequeue enqueue for that one.17:11
clarkbthen I'll keep looking at this airship node request I guess17:12
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563817:14
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563817:14
clarkbfor the airship one it failed to boot three times in the airship cloud. Every other cloud should decline the request which should result in a node failure17:15
clarkbnotsure why that hasn't happened yet17:15
clarkbthe new iweb provider did decline the request when it came online according to the logs on nl0317:18
clarkbif launchers.issubset(set(self.request.declined_by)): <- is the condition we fail on17:21
clarkbI'll have to check the zk db it looks like17:21
corvusclarkb: if you want to see the declined list, you can run nodepool list with --detail17:22
corvusrequest-list17:22
clarkbcorvus: thanks, but I need to see what the set of registered launchers is and I think that is only available from the db?17:22
clarkbbut then I can compare that against the decline by list17:22
corvusyeah i think so17:23
clarkbzk-shell appers to pull in twitter packages now?17:24
corvusalways has17:24
clarkbhuh17:24
clarkbcorvus: I think we have two extra launchers registered: nl01.opendev.org-PoolWorker.rax-ord-main-06e2062bc61f4957bb40e1de543722ca and nl03.opendev.org-PoolWorker.iweb-mtl01-main-1a2f7610ca4941e4a6af286f5758da8b17:31
clarkbhowever, those registrations are epephemeral so now I'm extra confused about how they exist17:32
clarkbMaybe the inap one is related to the held node or the paused node and will get cleaned up automatically post dequeue enqueue?17:33
clarkbmaybe I can do a thread dump and match up the launcher id to a thread?17:34
clarkbthere is a single pool worker for inap main and yet we had two registered launchers for it17:38
clarkbI didn't think we shared the zk client between launchers but that might explain it if the current launcher has kept track of it17:39
clarkboh sorry the iweb one is valid I think17:43
clarkbnl03.opendev.org-PoolWorker.inap-mtl01-main-2d4c222d1678460c90f88c158e41b2ce is the invalid one17:43
clarkbnl03.opendev.org-PoolWorker.inap-mtl01-main-b88877414cc0474caa11e7c383b17a07 shows up in the logs as declining requests but nl03.opendev.org-PoolWorker.inap-mtl01-main-2d4c222d1678460c90f88c158e41b2ce hasn't loggedanything in the current logfile. It definitely seems like we haven't deregistered properly17:45
clarkblooking back a few days I don't see 2d4c222d1678460c90f88c158e41b2ce in the logs. That implies it wasn't the shift from 120 to 0 max-servers that leaked it?17:46
corvuswe probably do share the zk client, so it's up to the launcher code to de-register the pool provider properly17:48
clarkbbased on the mtime on the znode it was last modified on July 30 ish UTC17:49
clarkbbit more confident that this isn't related to the provider update given that and is just coincidence that one of the two leaked launchers is also one we are renaming17:50
corvusdo we expect frequent updates of that znode?17:50
corvusi guess the question is really: did something in nodepool's config change to cause that pool-provider to reconfigure in the past month17:51
clarkblooks like no, many of the other launchers have ctime == mtime17:52
corvusif no, then this event may still be the first to trigger that code path17:52
clarkbyes we updated its max-servers value yesterday17:52
clarkbnote there is a leaked rax-ord launcher as well and we haven't updated that recently as far as I know17:52
clarkblooking at the code we seem to only call deregisterLauncher() if stop() on the thread is called. Looking at the thread dump it seems there is only one PoolWorker thread for inap currently which would have to be for the non leaked one that is logging declined requests. Is it possible the thread crashed hard for some reason and stop() was never called?17:53
clarkbthe run loop does catch all exceptions though17:54
clarkboh wait no it doesn't if the exception can happen early in he run loop17:54
clarkbgrepping for exceptions in the last two days I don't see any that seem related to zookeeper breaking when trying to deregister or a thread being killed due to an uncaught exception. However I'm filtering stuff that I guess could be related?18:01
clarkbcorvus: fungi: any objection to me restarting nl03 and nl01's launchers to clear the ephemeral nodes? I'm not sure we'll get much more debugging out of the launcher that isn't in the logs already?18:02
clarkbno ZooKeeper suspended messages either18:03
corvusclarkb: wfm18:06
fungisounds fine to me18:07
clarkbalright nl03 is done. will do nl01 when nl03 shows it is happy restarting18:08
fungithanks18:08
clarkbcorvus: the other thing I notice is that in nodepool.launcher.NodePool.run() we seem to check for stale pools based on provider name + pool name which are the same for these leaked pool as the good pool18:09
clarkbcorvus: however, the thread dump says the thread isn't there at all so that shouldn't be an issue18:09
*** Guest6075 is now known as diablo_rojo18:10
*** diablo_rojo is now known as Guest608918:11
*** Guest6089 is now known as diablo_rojo_phone18:12
clarkbI think we only check if we need to fail a node request when we decline a node request18:17
clarkboh maybe not seems the change finally got evicted18:17
fungithe scheduler eventually times out the node request and issues a new one, right?18:18
clarkbhttps://review.opendev.org/c/airship/airshipctl/+/772890/ <- has a node failure reported to it18:18
clarkbfungi: no I think this is entirely driven by nodepool18:18
fungiaah18:19
clarkbIt appears that restarting nl01's launcher caused it to check all open requests, Which caused it to decline the request again which caused it to be failed18:22
clarkbThat is good means it does what it can to reach a consistent state. Just need to figure out how we are leaking these launcher registrations in the first place18:22
clarkbthose restarts appear to have unstuck one of the neutron chagnes stuck in check as it is running the job it wanted now18:23
clarkb803582,1 being the last remaining stuck entry18:23
clarkbI'm going to try and write some chagnes really quickly before I lookt at that one to capture some of what I've learned though18:24
clarkbit looks like August 18 was when we added the previous current inap launcher to zk18:35
clarkband our logs don't go back that far18:35
clarkbBut I've found a bit of code that I think may be the cause. I'll get a change up shortly18:36
clarkbcorvus: it looks like we don't create a new pool worker thread when the config updates?18:39
clarkbPretty sure we did in the past but now it appears we only create new provider worker threads when a new provider shows up or if the thread has died for some reason18:40
fungiinfra-root: proposed announcement for the planned listserv upgrade manitenance: https://etherpad.opendev.org/p/lists-maint-2021-09-1218:50
fungii can send that shortly if nobody has any edits18:51
fungijust dawned on me i meant to write that up tuesday evening and got sidetracked18:51
clarkbfungi: ya I'll take a look as I think I convinced myself that my nodepool changes are likely to be correct. Trying to sort out the relationship between provider managers, pool workers, and request handler threads is something I seem to have to do every time I look at nodepool :/18:55
clarkbfungi: you caught the freenode thing. That was going to be my only comment. lgtm18:56
fungihah18:56
fungiyeah, some unfortunate finger memory18:56
fungii had to step away and then come back and re-read it to spot that18:56
opendevreviewClark Boylan proposed openstack/project-config master: Set empty nodepool resource lists on inap  https://review.opendev.org/c/openstack/project-config/+/80720419:01
opendevreviewClark Boylan proposed openstack/project-config master: Remove the inap provider from nodepool  https://review.opendev.org/c/openstack/project-config/+/80720519:01
clarkbinfra-root ^ that first change should be mergeable now I think. The second should only merge after the first has run in production long enough for nodepool to clean up after itslef19:01
opendevreviewLuciano Lo Giudice proposed openstack/project-config master: Add the cinder-lvm charm to Openstack charms  https://review.opendev.org/c/openstack/project-config/+/80720619:01
clarkbmgagne: ^ fyi moving things along there19:01
mgagneclarkb: thanks for the follow up. I think you all know how to deal with it better than me =)19:05
clarkbcorvus: the last chagne stuck in check is probably most interesting to you. 2021-08-26 08:15:40,530 INFO zuul.ExecutorClient: [e: b850d6c6a02b4d779cec37b56d5b610e] Execute job neutron-functional (uuid: 06426d3af17643a5a05e6b42928af8a0) on nodes <NodeSet devstack-single-node [<Node 0026112614 ('primary',):ubuntu-xenial>]> that node is currently unlocked and ready in nodepool19:06
clarkbcorvus: /var/log/zuul/debug.log.7.gz contains the zuul logs for that line19:06
clarkbcorvus: 2021-08-26 08:15:40,533 DEBUG zuul.ExecutorQueue: [e: b850d6c6a02b4d779cec37b56d5b610e] Submitting job request to ZooKeeper <BuildRequest 06426d3af17643a5a05e6b42928af8a0, state=requested, path=/zuul/executor/unzoned/requests/06426d3af17643a5a05e6b42928af8a0 zone=None> that seems to be the last zuul did with it. Meaning we've lost the build request in zk?19:07
clarkbcorvus: I need to grab lunch now but I won't touch that change in the queue because I think that is potentially interesting to the build requests in zk work19:07
corvusclarkb: i'm back from lunch, will take a look20:34
corvusclarkb: thanks for spotting that. there was a zk connection loss on the executor.  we have some good tracebacks (on ze09, log .7).  i'll look into handling that.20:42
clarkbcorvus: cool, should we dequeue and enqueue or wait for a bit?20:43
corvusi believe restarting ze09 (and i think we wanted to restart the whole system anyway) will correct that.  dequeue/enqueue might work too.20:43
clarkbok I've got to pop out shortly to dealwith some start of school year stuff but will be back later this afternoon and can do that if others don't beat me to it20:46
*** dviroel is now known as dviroel|out21:01
clarkbI'm going to dequeue and enqueue and see if that is sufficient since that is the last impactful change22:14
clarkbthat seemed to make it happy fwiw22:45
opendevreviewmelanie witt proposed openstack/project-config master: Set launchpad bug Fix Released after adding comment  https://review.opendev.org/c/openstack/project-config/+/80137622:45
funginotifications about the lists upgrade maintenance for the 12th has been sent to the most active mailing lists for each of our 5 current mailman domains23:28

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!