Monday, 2024-01-08

*** gthiemon1e is now known as gthiemonge07:41
opendevreviewJan Marchel proposed openstack/project-config master: Add new NebulOuS component: cloud-fog-service-broker  https://review.opendev.org/c/openstack/project-config/+/90495709:34
opendevreviewMerged openstack/project-config master: Add new NebulOuS component: cloud-fog-service-broker  https://review.opendev.org/c/openstack/project-config/+/90495713:26
*** blarnath is now known as d34dh0r5315:00
TheJuliao/ folks, any chane we can get the next failure of ironic-grenade held? We're specifically looking at a change to change the default rbac policy, and the job is failing in unpexected ways and we're sort of not sure if it is a bug in the upgrade, the state setup by devstack, or in the library, so instead of guessing it is just easier to hold the next failure. Specifically the change is 902009 on openstack/ironic.15:23
fungiTheJulia: autohold has been set16:01
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Introduce LogJuicer roles  https://review.opendev.org/c/zuul/zuul-jobs/+/89921216:09
TheJuliafungi: thanks!16:17
clarkbmy firefox tabs are an append only database. Pruning this database makes me sad but firefox performance is suffering16:38
fungii'm hoping one of these times ff crashes, its session history will be corrupted and it won't be able to recover my tabs16:38
clarkbthe really fun behavior is that as you delete tabs the ui and subsequent tab deletions become quicker16:39
clarkbI wonder if I restart ff after every hundred or so tab closes if that will make it faster too16:40
clarkbThere is still no new haproxy 2.9.x release to test16:42
clarkbI'm a bit surprised that both gitea and haproxy are fine with these pretty major bugs hanging out on stable releases for weeks16:43
fungiin the age of container image continuous publication and consumption, many projects are starting to deem "releases" antiquated practice and some have stopped making them outright16:46
clarkbheh eventually things get quick enough to render the pages before you can close them16:52
clarkbthe held node confirms my fears for the weekly cron run to delete all repos conflicting with the cleanup run. The weekly run failed with a no such file or directory message17:27
clarkbI'll work on a proper cron specification to run it a few hours after the cleanup run say 0300 sunday17:27
clarkbjust have to figure out how to express that in the gitea config first17:28
clarkbgood thing I checked because the go cron lib uses a different format specification than my cron implementation17:32
opendevreviewClark Boylan proposed opendev/system-config master: Enable gitea delete_repo_archives cron job  https://review.opendev.org/c/opendev/system-config/+/90487417:33
JayFI'm working with adamcarthur5 to troubleshoot an issue with coder.com-hosted SSH agents not being able to connect to gerrit. I'm 99.99% sure it's some kind of bad assumption in their SSH client stack, but if you see anything interesting/weird in logs, or have any experiences to share it'd be appreciated.17:35
clarkbJayF: do you get any errors on the client side?17:36
JayFBad ones that are mostly from their bespoke client. I'm setting up a clean test environment with Adam now, but was mainly curious if there was anything beyond auth failures in the logs.17:36
JayFall in pushes to opendev/sandbox17:36
JayFpermission denied publickey is the error we're seeing17:37
clarkbI'm not seeing errors17:37
JayFre-checking against all the basics17:37
clarkbI see logins and log outs17:38
clarkbusually the thinsg to check are that you aren't talking to port 22 (because gerrit sshd listendson 29418), that you have the correct username (which is case sensitive), that you are using the correct key which is in your gerrit account17:38
clarkbhistorically there were problems with rsa keys but that shouldn't be a problem with the version of gerrit we are running17:39
JayFit's failing on scp -P29418 user@review:hooks/commit-msg .git/hooks/commit-msg17:39
JayFIt's overriding GIT_SSH_COMMAND17:39
clarkbaha that helps17:39
JayFso I think because that's going an end-run around git, it's got no auth17:39
clarkbno17:40
clarkbits because scp is in limbo right now17:40
JayFWe got more output with `-v`17:40
JayFIt was a worse error beforehand :)17:40
clarkbyou have to force old scp iirc17:40
clarkbwhich I think git review does by default but maybe the client there doesn't support this?17:40
JayFThis is some kind of container, I think debian-based?17:41
JayFubuntu 20.04.6 LTS17:41
JayFbut like I said, GIT_SSH_COMMAND is going to `/tmp/some_crazy_dir/coder gitssh --`17:41
clarkbsee https://opendev.org/opendev/git-review/commit/5bfaa4a6f355a6820fe16c1aea77a01ba7b97eaa17:41
JayFso I'm confused as to how that scp could ever be authenticating17:42
JayFb/c there's no ssh keys on the box directly whatsoever17:42
clarkbJayF: I don't see git-review overriding GIT_SSH_COMMAND so that must be part of your end17:43
JayFDoes that make sense? Unless you somehow have scp working without authentication, this is never going to work in a situation where GIT_SSH_COMMAND is how auth is provided17:43
JayFclarkb: no, I'm saying that's how SSH auth works on this: it overrides GIT_SSH_COMMAND17:43
JayFso anything we do outside of the git binary is just ... not gonna work17:43
fungiin the age of container image continuous publication and consumption, many projects are starting to deem "releases" antiquated practice and some have stopped making them outright17:43
fungier, wrong buffer recall, sorry17:44
clarkbJayF: it depends on how ssh (and really scp) are intended to work I think? If they've completely removed the functioanlity of regular ssh and scp from the env then yes17:45
clarkbbut I know nothing about coder.com other than what yuo've just told me and honestly it doesn't provide a good impression17:45
clarkbone way you can work around this is to install the commit message hook for change ids manually17:45
JayFIt's one of those things that work wonderfully if integrated, and as always we are the weirdos and not integrated17:46
JayFthat's what I was about to ask, is that documented?17:46
clarkbJayF: yes https://review.opendev.org/Documentation/user-changeid.html#creation17:47
clarkb"wonderfully integrated" seems like a nice way of saying "set up in a weird way that prevents standard tools from functioning"17:47
JayFIt's my way of saying "I'm an old man who doesn't get it, but I'm trying to help someone from another generation who likes these tools so I'm doing what I can to bridge"17:48
fungiit sounds like whatever tool this is wasn't designed for use in the way it's being used17:48
clarkboh I'm all for these developer bootstrap tools. I just don't undersatnd why you need a special ssh implemetntation and removal of scp17:48
fungimost likely it's desigend for other platforms17:49
JayFPretty much yeah. And adamcarthur5 knows the people who make the tool, so likely this will end up getting feedback-looped in, and improving the tool :)17:49
fungier, designed17:49
JayFscp is there, they just built around git authentication not ssh authentication (e.g. they don't manage SSH keys)17:49
clarkbI mean if you really need a special implementation at least drop it in the correct location and include standard ssh tool suite17:49
fungigit-review can also use https instead of ssh, if that's a preferable alternative17:50
adamcarthur5Yeah folks, its designed for Github and Gitlab, I definitely agree its an oversight to not extend it past that. I have struggled to get it to work with scp before.17:50
clarkbfungi: yes, the docs link I provided shows how to grab it via both https and scp. I guess we can dig up git-review https docs17:51
JayFI think Adam has the breadcrumbs he needs :D17:52
adamcarthur5I will go to the coder folks and see if I can get a change made. Thanks :))17:52
fungiright, i mean beyond just the hook fetching (which we've also debated vendoring into git-review to optionally make this even less challenging), but also being able to fetch/push changes over https17:52
JayFWe had about 3 problems happening simultaneously: old git-review version in ubuntu was obscuring errors first, then needed the -v to get the scp output, now we've ID'd the real problem so Adam can figure out how he wants to integrate that into his tools17:52
JayFfungi: ++++++++ to vendoring it into git-review17:53
JayFor even just having it curl from a (even provided, if needed) public https URL as an optional fallback17:53
JayFthat functionality would allow working around these kind of limitations; which are not super unique (e.g. not allowing full access to SSH stack but getting limited things that can replace GIT_SSH)17:53
clarkbI think the main reason we've avoided that is the script does change. However, I'm not sure it has changed in meaningful ways so may be we shouldn't care too much about that17:54
clarkbavoided vendoring I mean17:54
JayFAs much as I dislike this pattern it's not the first time I've seen it17:54
clarkbI think for me its an odd approach to optimize for because ssh is a powerful and useful development tool that I use all the time.17:55
clarkbwhether normal ssh into dev instances, socks proxies or port forwarding, scp/sftp, etc17:55
JayFI find it useful in that way as well; but I think we're predisposed as advanced linux engineers to love ssh :D 17:57
JayFI remember when "look at this cool new thing I found to do with a pipe, five small unix programs, and ssh" was a spectator sport at LUG meetings P17:57
JayF* :P17:57
clarkbha17:57
clarkbyou should see what we do with socat and skopeo17:58
fungithat almost sounds like a setup for a crude joke17:58
fungiand maybe it is, come to think of it17:58
clarkbthe docker ecosystem refuses to accept ipv6 literal addresses as valid image locations. So we use socat to proxy ipv4 to ipv617:58
JayFfungi: I was about to say something about stunnel but it feels weird now LOL17:59
JayFclarkb: even '[::1]' style?!17:59
clarkbJayF: correct17:59
fungiyeah, in this case the joke's on us17:59
clarkbJayF: the eventual response on the issue I filed was soething along the lines of "this is too hard to do so we're closing it"17:59
JayF...that's an option?18:00
fungiit's okay, ipv6 has only been around for about 25 years now18:00
clarkbthey said we should be editing /etc/hosts18:00
JayFI need to go back 10 years and close the ticket to create Rackspace OnMetal /s18:00
clarkb(we don't edit /etc/hosts because its mounted read only in the test env)18:00
clarkband then podman/skopeo said they won't change their behavior beacuse they maintain compat with docker18:01
clarkbexcept if you actually use podman and skopeo you know they don't actually maintain compatibiltiy in a billion places18:01
clarkblike volume mounts18:01
clarkband networking18:02
clarkband all of the extra features for image management skopeo has18:02
JayFContainer technology peaked with John Landis Mason. 18:03
JayFI've never seen anything OCI adjacent hold anything nearly as delicious, either.18:03
clarkbits ok we have socat :)18:05
clarkbfungi: repo-archives growth on gitea09 is looking sane. Its only 29M18:06
clarkbon gitea12 the oldest uncleaned archive is from 1703238294 which is still far older than 24 hours ago, but is newer than when I checked it on friday (1702763079)18:10
clarkbI still suspect some sort of short circuit in the cleanup iteration but I expected a persistent stuck state when I first theorized that being the problem18:11
clarkbit claerly isn't getting stuck on a single item though as it is slowly cleaning things up18:11
opendevreviewClark Boylan proposed opendev/system-config master: Remove bullseye python3.11 image builds  https://review.opendev.org/c/opendev/system-config/+/90501818:20
TheJuliafungi: I guess if you avoid a "release", then you might be able to short circuit arguments "you shipped a thing" in court...18:30
opendevreviewClark Boylan proposed opendev/system-config master: Disable gitea's update checker cron job  https://review.opendev.org/c/opendev/system-config/+/90502018:30
TheJuliafungi: anyhow, autohold 0000000036 awaits :)18:30
clarkbTheJulia: where can I find your key?18:34
fungiTheJulia: ssh root@149.202.177.18518:35
clarkbfungi: that explains why the key is on the node twice :)18:36
clarkbI thought my #echo "foo" >> .ssh/authorized_keys didn't respect the comment18:36
fungihah. i added it the same way with >> redirection18:37
clarkbvi(m) isn't installed os I fallback on that18:40
TheJulialol18:42
TheJuliaThanks guys18:42
clarkbyou're welcome18:42
fungiany time!18:44
TheJuliaOkay, I'm 90% sure you guys can nuke the hold. Looks like it was one of the steps in grenade setting an environment variable :(19:20
clarkbTheJulia: do you want us to wait for you to be 100% sure?19:34
TheJuliaeh, go ahead, no sense to wait at this point19:34
clarkbok I'll get that done19:35
clarkbI'll put together a meeting agenda for tomorrow after lunch. Feel free to add items if you have them19:46
fungithanks19:55
clarkbI've removed items from the agenda that I felt reasonably confident were old and could be removed as well as added a few. I'm less sure about the topics covered during the mid december meeting20:47
clarkbfungi: looks like tonyb +2'd your robots.txt change. Not sure if you want ot send it in or discuss it tomorrow first20:48
fungiwe can in theory do both?20:53
clarkbsure20:54
clarkbfollowing up on that job stuck in zuul. I don't see any obviously broken nodepool providers after tailing debug logs on the four launchers21:13
clarkbthe job is for refs/heads/stable/2023.2 which means i can't grep zuul logs for a change id. I guess Itry that ref21:13
clarkbzuul.nodepool: [e: 14a8e5a05b554cb4ac214e7c8fc0d5d1] Unable to revise locked node request <NodeRequest 300-0023038662 ['ubuntu-focal']>21:15
clarkbI think this is the issue.21:15
clarkbLooking in the zk db I see the request and the request lock. It isn't claer to me how to map the request lock to a client yet21:19
clarkbreading the kazoo Lock code the uuid is generated fresh each time a lock is made and doesn't seem to map to a specific connection/client ?21:24
clarkbhere's some progress: the actual lock node has an ephemeralOwner value attached to it21:25
clarkblooking at cons output from zk04 none of the session_id values reported for connections there match the ephemeralOwner value for the lock21:33
clarkband the lock doesn't show up in the wchp watch by path listing (which would also give us session info)21:35
clarkbcorvus: ^ any ideas on how to debug this further?21:36
clarkbit is almost like this ephemeral node has no session/connection anymore but zookeeper didn't clean it up21:38
clarkbbut that may just be a perspective skewed by not knowing where the proper location is to look21:38
*** mtreinish_ is now known as mtreinish21:43
JayFHmm. Is there a way for me to figure out why a job has been waiting for something a while?22:33
JayFProbably just unlucky, but https://zuul.opendev.org/t/openstack/status#openstack/governance has been waiting for a py311 builder for a long time22:33
JayFand I guess I somewhat expected that gate to dedupe and only run on the tip, but that's probably just not enabled for governance repo?22:34
JayF(I linked all in the repo so you could see the one done and the one waiting, 903992 / 903239 are the ones specifically in question)22:35
JayFas is traditional, by posting about the job in here I got it scheduled22:40
JayFzuul is always listening <.< :D 22:40
fungiJayF: some of our providers have a high incidence of boot errors and also can take a long time to come ready. if a provider exceeds its failure limit and timeouts, another provider will try. as a result, it's not uncommon to see a job waiting for a node assignment for over half an hour (unfortunate, but not uncommon)22:42
JayFthere's no insight for users on that end22:42
JayF?22:42
JayFJust making sure there's not a resource I'm missing22:42
fungii believe when a build transitions from waiting to queued, a node request has been created and that's when the job is waiting for a node assignment to fulfil it22:44
fungithe node requests graph on https://grafana.opendev.org/d/21a6e53ea4/zuul-status tracks how many pending node requests are awaiting fulfilment22:45
fungihttps://grafana.opendev.org/d/6c807ed8fd/nodepool has an overview of node building activity, launch times and errors22:47
fungithere are also per-provider nodepool dashboards at https://grafana.opendev.org/ with more detailed breakdowns22:47
fungithis is all in aggregate though, there's currently no per-build details on where a particular build is in the process of starting22:48
funginor is any granular tracking of node request states surfaced in zuul aside from service debug logs22:49
JayFat least being able to see "lots of things waiting for this node type" would be nice22:50
JayFI only worry in this case because I suspect if there's a repo that'd have a weird config that needs a type of build nowhere else does22:51
JayFand gets missed in an update, governance would be a candidate for that lol22:51
fungifor the most part, node types aren't provider-specific so it's "lots of things waiting no nodes" more generally, but there are exceptions like arm, high-ram, gpu...22:51
fungi...nested-virt...22:51
clarkbthings aren't scheduled in a truly fifo manner which is what creates the most confusion I think. I've brought this up in zuul before but it would require a fairly extensive rewrite of the way nodepool handles things so interest in doing that is low22:58
clarkbbut that means that a single node boot can be slow when everything else looks fine22:59
clarkbwith zero load on zuul and plenty of quota headroom22:59
clarkbthe last time we merged something to nodepool (which would reset the zk connections for launchers) was December 12. The node request in periodic with the likely stale lock ended up in that state sometime later around December 1623:01
clarkbwe might want to restart launchers to see if that unsticks things. If it does't then a zuul scheduler likely holds the lock except those get restarted weekly so that is unlikely23:01
clarkbbut I don't want to do that until corvus has a chance to weigh in23:01
JayF> things aren't scheduled in a truly fifo manner23:03
JayFThis needs to be sticky noted on my monitor23:03
JayFbecause I think this is the base-level assumption that gets broken that gets me going "WTF" down the rabbithole23:04
fungizuul does try to fifo things, but it's best effort and there are a lot of variables that can cause stuff to run in a different order than the order in which triggering events were received23:11
JayFI assume it FIFO's requests but doesn't guarantee they get retried in order and/or that they succeed in order (I can imagine a "dumb"-in-a-good-way backoff system ala smtp failures)23:17
fungiright, though also projects and named queues are subject to a "fair queuing" algorithm which tries to prevent projects from monopolizing available resources and starving out requests from less active projects23:25
clarkbit FIFOs the requests not the node assignments23:25
fungiand pipelines also have relative priorities23:25
clarkbyour request gets assigned to a provider and it will attempt to run there three times. That provider may just be slow booting your node23:25
clarkbif all three fail that may have been up to 15 minutes or so to timeout and then you wait to get picked up by another provider23:26
fungiand then there are windows in dependent pipelines too23:26
clarkbthe alternative approach I've descirbed as being more intuitive to users would be to keep track of how many of each node has been requested and boot them from the various providers as needed but assign them in fifo order23:26
fungiso lots of ways that enqueued time won't match up with node request generation time too23:26
JayFI don't think "making developers go 'wtf' less often at points of high contention or failure" should be a high priority on your todo list 23:27
JayFlol23:27
corvusclarkb: catching up23:28
clarkbone thing that complicates the fifo assignment idea is we currently shcedule all nodes in a multinode request in the same provider23:28
fungiwhich adds to the user confusion because the subtle transition from buildset enqueued to build nodes requessted is easy to miss23:28
clarkbcorvus: the discussion of the stuck job starts around 21:13 UTC23:29
corvusclarkb: the `dump` command says that /nodepool/requests-lock/300-0023038662/a4e1ceb94ce4472f9674961970869407__lock__0000000002 is held by the same session that holds /nodepool/components/launcher/nl01000000065323:41
clarkbcorvus: ok I figured there was some four letter command I was missing23:44
clarkblooks like that is nl01 according to a get on that path23:44
corvusyeah (it's in the path, so you don't have to do the extra get step)23:44
corvusjust gets lost in the really bignum :)23:44
clarkbah thats the prefix23:44
clarkbI see that now looking at the other entries in the launcher/ dir23:45
corvus(which is worse if your font doesn't have serifs)23:45
clarkbcorvus: if you grep 300-0023038662 in the nl01 launcher debug log all three rax providers say they are trying to lock the request but it is held by someone else23:45
clarkbI guess there could be a stray thread somewhere holding it for one of the providers23:45
clarkb(nl01 is only rax providers)23:45
corvusyep, and i haven't found any other logs about that23:45
corvusdo you know offhand how old the request is?23:46
clarkbcorvus: from ~December 12 which is probably older than our logs23:46
corvusthen i think the only debug step left is sigusr2; want to do it or shall i?23:46
clarkbgo for it23:47
clarkbdon't forget to sigusr2 a second time to disable profiling23:47
corvusdone (x2)23:48
clarkbour node request id string doesn't show up in thread names.23:50
corvusand i don't see any threads of concern.  so i suspect some catastrophe happened in the past and we recovered but leaked the lock.  probably too difficult to track without the logs, so i think we just restart now.23:52
clarkback. Do you want to do that or should I?23:52
corvusi can23:52
clarkbits weird that the ephemeralOwner value doesn't seem to match anything in the wchp output23:53
clarkbbut maybe that is the hint that the connection is gone on the client side and zk is somehow thinking it still lives or something23:53
corvusi've never matched them that way, only with dump, so i can't address that.  but i don't think we've proven that the connection is gone; only that the launcher leaked the lock.  it could have done that by locking and throwing an exception and not unlocking (as an example)23:56
corvus#status log restarted nl01 to release leaked zk request lock23:57
opendevstatuscorvus: finished logging23:57
corvusclarkb: a more recent dump shows that old lock path is gone now (there's a new one held by the new nl01 connection)23:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!