Friday, 2023-04-21

*** iurygregory is now known as iurygregory|holiday02:26
opendevreviewMerged opendev/zone-opendev.org master: Cleanup etherpad DNS records  https://review.opendev.org/c/opendev/zone-opendev.org/+/88016902:47
opendevreviewChing Kuo proposed opendev/system-config master: [dnm] Test Hound Build  https://review.opendev.org/c/opendev/system-config/+/88091202:54
opendevreviewIan Wienand proposed opendev/zone-opendev.org master: Add Jammy refresh NS records  https://review.opendev.org/c/opendev/zone-opendev.org/+/88057703:06
opendevreviewIan Wienand proposed opendev/zone-opendev.org master: Remove old nameservers  https://review.opendev.org/c/opendev/zone-opendev.org/+/88070903:06
ianwgenekuo: ^ i noticed you sent that back in and i quickly pulled it up in my browser03:32
ianwit looks broken to me in both firefox and chrome03:34
ianwhttps://104.130.253.48/03:34
ianwso it's not an artifact of selenium, or it would seem python 03:35
ianwhttps://github.com/hound-search/hound/issues/453 same issue03:48
genekuoah, ok, I was only searching with my browser last time and didn't open the advance panel.03:49
genekuoIn this case shall we postpone the update?03:49
ianwwell we definitely don't want to pull in this version of hound03:50
opendevreviewChing Kuo proposed opendev/system-config master: Update accessbot to Use Python 3.11 Base Images  https://review.opendev.org/c/opendev/system-config/+/88116103:54
ianwgenekuo: ok, here's something weird03:57
ianwi still have the window up from the test run03:57
ianw<div data-reactid=".0.0.1.0" style="height: 0px; padding: 0px;" id="adv">03:59
ianwnote that's id="adv" -- not "advanced" as changed in https://github.com/hound-search/hound/commit/d25b221872426b03d9be8cf6924327e5eab6c31403:59
ianwif i change that id to "advanced" in the inspector, it looks right03:59
genekuoseems like the run has ended04:05
genekuoLet me try to run hound locally and see if I can fix it04:05
ianwi'm 99.99% sure it's that adv change upstream04:26
opendevreviewIan Wienand proposed opendev/system-config master: [wip] build houndd directly  https://review.opendev.org/c/opendev/system-config/+/88116304:52
ianwgenekuo: ^ it's really weird ... when i build it locally it seems to look alright?  i wonder if "go get" is messing something up?04:53
ianwhttps://review.opendev.org/c/opendev/system-config/+/880580?tab=change-view-tab-header-zuul-results-summary has now hit -2 on verified twice with what seems like nodes going away05:04
ianwhuh, it looks like that ?tab= is also new along with the ?usp=05:04
*** amoralej|off is now known as amoralej06:08
genekuoianw: local build doesn't works for me, the advance tab is opened07:00
ianwgenekuo: i think it's something to do with "go get" -- https://review.opendev.org/c/opendev/system-config/+/881163 looks good.  perhaps it is grabbing some sort of cached component?07:11
ianwwe should build in a separate layer, feel free to take over 881163 if you're interested07:11
genekuogot it, not sure why local build doesn't work for me though07:18
ianwdid you "go get" ... I just typed "make".  may also be a go version thing?07:23
opendevreviewChing Kuo proposed opendev/system-config master: [wip] build houndd directly  https://review.opendev.org/c/opendev/system-config/+/88116307:26
genekuoianw: I ran go build directly instead of using make07:27
opendevreviewIan Wienand proposed openstack/project-config master: Indent Gerrit ACL options  https://review.opendev.org/c/openstack/project-config/+/87990607:29
opendevreviewIan Wienand proposed openstack/project-config master: tools/normalize_acl.py: Add some human readable output  https://review.opendev.org/c/openstack/project-config/+/88089807:29
ianwgenekuo: weird ... make does do some stuff to update ui/bindata.go which is where all the bits are stuffed as gzipped strings ... maybe that has something to do with it?07:42
opendevreviewMerged opendev/system-config master: dns: abstract names  https://review.opendev.org/c/opendev/system-config/+/88058007:43
genekuoianw: seems like that's the issue, I did a build without running and one after running make ui/bindata.go, the issue is resolved for the build after07:45
opendevreviewChing Kuo proposed opendev/system-config master: [wip] build houndd directly  https://review.opendev.org/c/opendev/system-config/+/88116307:48
fricklerstill more interesting zuul behavior on https://review.opendev.org/c/openstack/releases/+/878864 even after the gerrit side of things seems to have gotten resolved earlier07:52
ianwgenekuo: i filed https://github.com/hound-search/hound/pull/456 ... the project isn't very active so i don't hold out a lot of hope, but we'll see if anyone says anything.  i definitely do get a diff in the hound.js binary bit07:53
genekuoIf it get merged in a few days, I'll prefer to download the build binaries instead building them ourselves.07:57
ianwi don't think we were ever using upstream binaries as such with "go get", which i think is also deprecated?  honestly building it like these seems ok to me, it's not a lot of maintenance overhead07:59
genekuoI see08:01
genekuogo get is deprecated in golang 1.1808:01
fricklerinfra-root: looks like we may be having two stuck jobs again? one of them is at the head of the gate queue, effectively blocking integrated merges https://review.opendev.org/88114209:20
*** dhill is now known as Guest1177311:21
fungifrickler: on 878864 i wonder if zuul cached the result, and since the votes hadn't changed (only the rules for what those votes mean), maybe it didn't re-evaluate submittability until the next activity after the cache aged out11:23
fungilooks like the swift-tox-func-encryption-py38 build for 881142 is waiting for a node assignment, i'll see if i can find where it's ended up11:27
funginode 0033802341 was assigned for it at 23:39:12z11:35
fungithe no, i misread the log, that was for a related change11:41
funginr 200-0021026179 (a single ubuntu-focal node) was issued for it at 22:51:43z11:42
fungi2023-04-20 23:01:59,804 DEBUG nodepool.driver.NodeRequestHandler[nl01.opendev.org-PoolWorker.rax-ord-main-48b69e922e3f4d33a1c2ea0aa9544520]: [e: f09fe23c7aee43f0bae5a32c33f9bdac] [node_request: 200-0021026179] Accepting node request11:44
fungithat was the second launcher to attempt to build it. first one was ovh-gra1 on nl04 (accepted at 22:51:46)11:46
fungii see now what corvus meant by a nodepool change which increases log chattiness11:48
fungithe last node i see it trying to build to satisfy that nr was 0033802172 and i can see in the nl01 debug logs where it decides to delete it but i can't find any corresponding launch failure logged12:01
fungi2023-04-20 23:09:13,590 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0033802172 (state: building, allocated_to: 200-0021026179)12:03
fungii also don't see any further mention of 200-0021026179 in the log after it deleted that node12:04
fungii need to have some coffee and attend to deferred morning activities, maybe someone with better eyes can spot what i'm missing in the log12:15
fungiotherwise i suppose i can trigger a thread dump on nl01 and try to see if there's a hung thread for this, but beyond that we're probably at restarting the launcher to free up the lock on that nr?12:16
*** amoralej is now known as amoralej|lunch12:43
opendevreviewChing Kuo proposed opendev/system-config master: [wip] build houndd directly  https://review.opendev.org/c/opendev/system-config/+/88116312:50
opendevreviewChing Kuo proposed opendev/system-config master: [wip] build houndd directly  https://review.opendev.org/c/opendev/system-config/+/88116313:07
*** amoralej|lunch is now known as amoralej13:49
opendevreviewChing Kuo proposed opendev/system-config master: Build houndd Directly  https://review.opendev.org/c/opendev/system-config/+/88116314:05
opendevreviewMerged openstack/project-config master: Stop using Storyboard for ovn-bgp-agent  https://review.opendev.org/c/openstack/project-config/+/88093814:13
clarkbfungi: corvus looking at 200-0021026179 and 0033802172 it looks like there was a building node then nodepool was restarted to deploy the update to wait limits. This switched the node to deleting. As far as I can tell it does manage to delete the node from the cloud and zookeeper then ya it never seems to perform another attempt to launch for that node request.14:50
clarkbfungi: did you do a thread dump yet?14:50
clarkbI think we want to see where the request handler is for that request14:50
clarkbI think this is different than the previous restart induced lockup because that occurred in the node deletion process which appears to have completed in this instance14:51
clarkbthe bugfix for suddenly non public gitea apis is in their merge queue. I don't think there is a fix for basic auth needing to be forced yet. However, I think that bug was preexisting as we have a bunch of force stuff in ansible tasks already.14:56
fungiclarkb: i didn't do a thread dump yet, but the restart is a key insight i missed14:57
fungii'll trigger one now14:57
fungii got distracted by an unrelated openstack release job oddity14:58
fungirunning this on nl01 now: `sudo kill -USR2 2807717;sleep 60;sudo kill -USR2 2807717`14:59
corvusi think the next question is whether that request is locked14:59
fungiso we should have two thread dumps 60s apart in the debug log momentarily14:59
corvusi'll try to figure that out15:00
fungithanks!15:00
clarkbI suspect the reason we've seen restarts induce these problems is that we'll have a fairly large number of building nodes that suddenly all go to delete and this probably creates a thundering herd effect of zookeeper contention?15:02
clarkbjust makes it far more likely we'll catch existing races when we do that compared to normal operations I bet15:03
corvusit is locked by nl0115:03
fungifirst stack dump starts at 2023-04-21 14:59:36,245 in the log, second at 15:00:36,29515:04
corvusthere is no log message for the second lock, and looking at the code, the way for that to happen is via the cleanupLostRequests method15:08
corvuswhich is what we want to happen here15:08
corvuswow that thread dump for cleanupLostRequests looks exactly like the thread deadlock we just fixed15:10
fungicleaned up stack dump output is now in nl01:~fungi/stack_dump.2023-04-21 in case it's helpful15:11
corvusi see the problem15:30
corvushttps://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/statemachine.py#L341-L34215:31
corvuswe never unlock a node when we delete it, because deleting it unlocks it from zk's pov.  but that means we never release the local thread lock.15:31
fungiaha15:36
corvusfix proposed in https://review.opendev.org/881237 and i also wrote https://review.opendev.org/881238 in response to the log deluge15:39
clarkbboth lgtm thanks!15:40
corvusresolution of the immediate issue should be the same as last time: we can restart the launcher at any point now15:40
clarkbthough that may cause us to trip over the same thing ?15:40
clarkbbut it would at least get that chgne moving presumably15:40
corvusyes, with a new set of deleted nodes15:40
corvusif we can wait an hour or two, restarting with those 2 changes might be nice15:41
corvusif we're in more of a hurry, we could set max_servers to 0 on those regions before shutting down, then we will have no nodes to delete on startup.15:41
clarkband the reason this wasn't an issue until we restarted is that for whatever reason we don't have multiple threads trying to grab that nodelock under normal circumstances (I guess that makes sense since the launch thread normally has the lock and does all of he processing but then with bulk deletion like that we trigger cleanups?)15:42
clarkbI think we can probably wait if fungi able to review the changes nowsih15:42
fungii already approved the first one and am looking over the second now15:42
clarkbexcellent15:43
corvusclarkb: yeah it's only if we try to lock a recently deleted node; so it could happen in other cases, but the restart increases the probability15:43
fungiyep, both lgtm15:43
*** amoralej is now known as amoralej|off15:56
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Move containerfile setting in container build  https://review.opendev.org/c/zuul/zuul-jobs/+/88125217:20
clarkbonce corvus' changes to move zuul, nodepool, zuul-registry, zuul-preview etc to quay.io land we'll want to update where we pull them from in opendev17:31
corvusthe nodepool changes from earlier have merged and promoted, so nl can be pulled and restarted at will17:54
corvusi need to afk for a bit now so don't plan on doing that myself17:55
fungiour hourly jobs do that anyway, right?17:59
fungiso are in theory kicking off in mere seconds17:59
clarkbthe first change actually deployed a while ago and got things moving again.18:36
clarkbThe second one should auto deploy and may have already yup18:36
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Use full image url in container buildx path  https://review.opendev.org/c/zuul/zuul-jobs/+/88127720:16
clarkbinfra-root I've completed an initial pass of container image syncs from docker hub to quay.io20:24
clarkbI think we have two big next steps. Switching over an image or a few at a time to the new image build jobs and updating our prod consumpton location and updating the new image build jobs to create the container if it doesn't exist (not necessary for those I just synced)20:25
clarkbZuul is also working through some of this right now so I'll focus on helping there since what we fix there should be applicable to us20:25
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Pin virtualenv in tox environments  https://review.opendev.org/c/zuul/zuul-jobs/+/88127920:40
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Pin virtualenv in tox environments  https://review.opendev.org/c/zuul/zuul-jobs/+/88127920:43
opendevreviewMerged zuul/zuul-jobs master: Pin virtualenv in tox environments  https://review.opendev.org/c/zuul/zuul-jobs/+/88127920:57
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Add ensure-quay-repo role  https://review.opendev.org/c/zuul/zuul-jobs/+/87783421:31
opendevreviewMerged zuul/zuul-jobs master: Move containerfile setting in container build  https://review.opendev.org/c/zuul/zuul-jobs/+/88125221:34
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Add ensure-quay-repo role  https://review.opendev.org/c/zuul/zuul-jobs/+/87783422:04
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Add ensure-quay-repo role  https://review.opendev.org/c/zuul/zuul-jobs/+/87783422:08
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Add ensure-quay-repo role  https://review.opendev.org/c/zuul/zuul-jobs/+/87783422:10
opendevreviewClark Boylan proposed opendev/system-config master: WIP Base jobs for quay.io image publishing  https://review.opendev.org/c/opendev/system-config/+/88128522:20
clarkbinfra-root ^ feedback on those two changes would be much appreciated. I think it sketches out what our publishing process will look like for opendev images ot quay. But if I've made silly mistakes I'd love that feedback22:20
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Use full image url in container buildx path  https://review.opendev.org/c/zuul/zuul-jobs/+/88127722:26
clarkbThe other thing I've realized is that some images like gerritbot are outside of system-config so we may end up needing to do this a few times22:27
clarkbbut one step at a time22:27
opendevreviewClark Boylan proposed opendev/system-config master: WIP Base jobs for quay.io image publishing  https://review.opendev.org/c/opendev/system-config/+/88128522:55
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Use full image url in container buildx path  https://review.opendev.org/c/zuul/zuul-jobs/+/88127723:33
Clark[m]wow that last patchset might actually work. I'm not sure I'm a fan of that approach but if others don't mind it ...23:39
clarkber that was meant for the zuul matrix room23:39

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!