Monday, 2024-01-08

*** gthiemon1e is now known as gthiemonge		07:41
opendevreview	Jan Marchel proposed openstack/project-config master: Add new NebulOuS component: cloud-fog-service-broker https://review.opendev.org/c/openstack/project-config/+/904957	09:34
opendevreview	Merged openstack/project-config master: Add new NebulOuS component: cloud-fog-service-broker https://review.opendev.org/c/openstack/project-config/+/904957	13:26
*** blarnath is now known as d34dh0r53		15:00
TheJulia	o/ folks, any chane we can get the next failure of ironic-grenade held? We're specifically looking at a change to change the default rbac policy, and the job is failing in unpexected ways and we're sort of not sure if it is a bug in the upgrade, the state setup by devstack, or in the library, so instead of guessing it is just easier to hold the next failure. Specifically the change is 902009 on openstack/ironic.	15:23
fungi	TheJulia: autohold has been set	16:01
opendevreview	Tristan Cacqueray proposed zuul/zuul-jobs master: Introduce LogJuicer roles https://review.opendev.org/c/zuul/zuul-jobs/+/899212	16:09
TheJulia	fungi: thanks!	16:17
clarkb	my firefox tabs are an append only database. Pruning this database makes me sad but firefox performance is suffering	16:38
fungi	i'm hoping one of these times ff crashes, its session history will be corrupted and it won't be able to recover my tabs	16:38
clarkb	the really fun behavior is that as you delete tabs the ui and subsequent tab deletions become quicker	16:39
clarkb	I wonder if I restart ff after every hundred or so tab closes if that will make it faster too	16:40
clarkb	There is still no new haproxy 2.9.x release to test	16:42
clarkb	I'm a bit surprised that both gitea and haproxy are fine with these pretty major bugs hanging out on stable releases for weeks	16:43
fungi	in the age of container image continuous publication and consumption, many projects are starting to deem "releases" antiquated practice and some have stopped making them outright	16:46
clarkb	heh eventually things get quick enough to render the pages before you can close them	16:52
clarkb	the held node confirms my fears for the weekly cron run to delete all repos conflicting with the cleanup run. The weekly run failed with a no such file or directory message	17:27
clarkb	I'll work on a proper cron specification to run it a few hours after the cleanup run say 0300 sunday	17:27
clarkb	just have to figure out how to express that in the gitea config first	17:28
clarkb	good thing I checked because the go cron lib uses a different format specification than my cron implementation	17:32
opendevreview	Clark Boylan proposed opendev/system-config master: Enable gitea delete_repo_archives cron job https://review.opendev.org/c/opendev/system-config/+/904874	17:33
JayF	I'm working with adamcarthur5 to troubleshoot an issue with coder.com-hosted SSH agents not being able to connect to gerrit. I'm 99.99% sure it's some kind of bad assumption in their SSH client stack, but if you see anything interesting/weird in logs, or have any experiences to share it'd be appreciated.	17:35
clarkb	JayF: do you get any errors on the client side?	17:36
JayF	Bad ones that are mostly from their bespoke client. I'm setting up a clean test environment with Adam now, but was mainly curious if there was anything beyond auth failures in the logs.	17:36
JayF	all in pushes to opendev/sandbox	17:36
JayF	permission denied publickey is the error we're seeing	17:37
clarkb	I'm not seeing errors	17:37
JayF	re-checking against all the basics	17:37
clarkb	I see logins and log outs	17:38
clarkb	usually the thinsg to check are that you aren't talking to port 22 (because gerrit sshd listendson 29418), that you have the correct username (which is case sensitive), that you are using the correct key which is in your gerrit account	17:38
clarkb	historically there were problems with rsa keys but that shouldn't be a problem with the version of gerrit we are running	17:39
JayF	it's failing on scp -P29418 user@review:hooks/commit-msg .git/hooks/commit-msg	17:39
JayF	It's overriding GIT_SSH_COMMAND	17:39
clarkb	aha that helps	17:39
JayF	so I think because that's going an end-run around git, it's got no auth	17:39
clarkb	no	17:40
clarkb	its because scp is in limbo right now	17:40
JayF	We got more output with `-v`	17:40
JayF	It was a worse error beforehand :)	17:40
clarkb	you have to force old scp iirc	17:40
clarkb	which I think git review does by default but maybe the client there doesn't support this?	17:40
JayF	This is some kind of container, I think debian-based?	17:41
JayF	ubuntu 20.04.6 LTS	17:41
JayF	but like I said, GIT_SSH_COMMAND is going to `/tmp/some_crazy_dir/coder gitssh --`	17:41
clarkb	see https://opendev.org/opendev/git-review/commit/5bfaa4a6f355a6820fe16c1aea77a01ba7b97eaa	17:41
JayF	so I'm confused as to how that scp could ever be authenticating	17:42
JayF	b/c there's no ssh keys on the box directly whatsoever	17:42
clarkb	JayF: I don't see git-review overriding GIT_SSH_COMMAND so that must be part of your end	17:43
JayF	Does that make sense? Unless you somehow have scp working without authentication, this is never going to work in a situation where GIT_SSH_COMMAND is how auth is provided	17:43
JayF	clarkb: no, I'm saying that's how SSH auth works on this: it overrides GIT_SSH_COMMAND	17:43
JayF	so anything we do outside of the git binary is just ... not gonna work	17:43
fungi	in the age of container image continuous publication and consumption, many projects are starting to deem "releases" antiquated practice and some have stopped making them outright	17:43
fungi	er, wrong buffer recall, sorry	17:44
clarkb	JayF: it depends on how ssh (and really scp) are intended to work I think? If they've completely removed the functioanlity of regular ssh and scp from the env then yes	17:45
clarkb	but I know nothing about coder.com other than what yuo've just told me and honestly it doesn't provide a good impression	17:45
clarkb	one way you can work around this is to install the commit message hook for change ids manually	17:45
JayF	It's one of those things that work wonderfully if integrated, and as always we are the weirdos and not integrated	17:46
JayF	that's what I was about to ask, is that documented?	17:46
clarkb	JayF: yes https://review.opendev.org/Documentation/user-changeid.html#creation	17:47
clarkb	"wonderfully integrated" seems like a nice way of saying "set up in a weird way that prevents standard tools from functioning"	17:47
JayF	It's my way of saying "I'm an old man who doesn't get it, but I'm trying to help someone from another generation who likes these tools so I'm doing what I can to bridge"	17:48
fungi	it sounds like whatever tool this is wasn't designed for use in the way it's being used	17:48
clarkb	oh I'm all for these developer bootstrap tools. I just don't undersatnd why you need a special ssh implemetntation and removal of scp	17:48
fungi	most likely it's desigend for other platforms	17:49
JayF	Pretty much yeah. And adamcarthur5 knows the people who make the tool, so likely this will end up getting feedback-looped in, and improving the tool :)	17:49
fungi	er, designed	17:49
JayF	scp is there, they just built around git authentication not ssh authentication (e.g. they don't manage SSH keys)	17:49
clarkb	I mean if you really need a special implementation at least drop it in the correct location and include standard ssh tool suite	17:49
fungi	git-review can also use https instead of ssh, if that's a preferable alternative	17:50
adamcarthur5	Yeah folks, its designed for Github and Gitlab, I definitely agree its an oversight to not extend it past that. I have struggled to get it to work with scp before.	17:50
clarkb	fungi: yes, the docs link I provided shows how to grab it via both https and scp. I guess we can dig up git-review https docs	17:51
JayF	I think Adam has the breadcrumbs he needs :D	17:52
adamcarthur5	I will go to the coder folks and see if I can get a change made. Thanks :))	17:52
fungi	right, i mean beyond just the hook fetching (which we've also debated vendoring into git-review to optionally make this even less challenging), but also being able to fetch/push changes over https	17:52
JayF	We had about 3 problems happening simultaneously: old git-review version in ubuntu was obscuring errors first, then needed the -v to get the scp output, now we've ID'd the real problem so Adam can figure out how he wants to integrate that into his tools	17:52
JayF	fungi: ++++++++ to vendoring it into git-review	17:53
JayF	or even just having it curl from a (even provided, if needed) public https URL as an optional fallback	17:53
JayF	that functionality would allow working around these kind of limitations; which are not super unique (e.g. not allowing full access to SSH stack but getting limited things that can replace GIT_SSH)	17:53
clarkb	I think the main reason we've avoided that is the script does change. However, I'm not sure it has changed in meaningful ways so may be we shouldn't care too much about that	17:54
clarkb	avoided vendoring I mean	17:54
JayF	As much as I dislike this pattern it's not the first time I've seen it	17:54
clarkb	I think for me its an odd approach to optimize for because ssh is a powerful and useful development tool that I use all the time.	17:55
clarkb	whether normal ssh into dev instances, socks proxies or port forwarding, scp/sftp, etc	17:55
JayF	I find it useful in that way as well; but I think we're predisposed as advanced linux engineers to love ssh :D	17:57
JayF	I remember when "look at this cool new thing I found to do with a pipe, five small unix programs, and ssh" was a spectator sport at LUG meetings P	17:57
JayF	* :P	17:57
clarkb	ha	17:57
clarkb	you should see what we do with socat and skopeo	17:58
fungi	that almost sounds like a setup for a crude joke	17:58
fungi	and maybe it is, come to think of it	17:58
clarkb	the docker ecosystem refuses to accept ipv6 literal addresses as valid image locations. So we use socat to proxy ipv4 to ipv6	17:58
JayF	fungi: I was about to say something about stunnel but it feels weird now LOL	17:59
JayF	clarkb: even '[::1]' style?!	17:59
clarkb	JayF: correct	17:59
fungi	yeah, in this case the joke's on us	17:59
clarkb	JayF: the eventual response on the issue I filed was soething along the lines of "this is too hard to do so we're closing it"	17:59
JayF	...that's an option?	18:00
fungi	it's okay, ipv6 has only been around for about 25 years now	18:00
clarkb	they said we should be editing /etc/hosts	18:00
JayF	I need to go back 10 years and close the ticket to create Rackspace OnMetal /s	18:00
clarkb	(we don't edit /etc/hosts because its mounted read only in the test env)	18:00
clarkb	and then podman/skopeo said they won't change their behavior beacuse they maintain compat with docker	18:01
clarkb	except if you actually use podman and skopeo you know they don't actually maintain compatibiltiy in a billion places	18:01
clarkb	like volume mounts	18:01
clarkb	and networking	18:02
clarkb	and all of the extra features for image management skopeo has	18:02
JayF	Container technology peaked with John Landis Mason.	18:03
JayF	I've never seen anything OCI adjacent hold anything nearly as delicious, either.	18:03
clarkb	its ok we have socat :)	18:05
clarkb	fungi: repo-archives growth on gitea09 is looking sane. Its only 29M	18:06
clarkb	on gitea12 the oldest uncleaned archive is from 1703238294 which is still far older than 24 hours ago, but is newer than when I checked it on friday (1702763079)	18:10
clarkb	I still suspect some sort of short circuit in the cleanup iteration but I expected a persistent stuck state when I first theorized that being the problem	18:11
clarkb	it claerly isn't getting stuck on a single item though as it is slowly cleaning things up	18:11
opendevreview	Clark Boylan proposed opendev/system-config master: Remove bullseye python3.11 image builds https://review.opendev.org/c/opendev/system-config/+/905018	18:20
TheJulia	fungi: I guess if you avoid a "release", then you might be able to short circuit arguments "you shipped a thing" in court...	18:30
opendevreview	Clark Boylan proposed opendev/system-config master: Disable gitea's update checker cron job https://review.opendev.org/c/opendev/system-config/+/905020	18:30
TheJulia	fungi: anyhow, autohold 0000000036 awaits :)	18:30
clarkb	TheJulia: where can I find your key?	18:34
fungi	TheJulia: ssh root@149.202.177.185	18:35
clarkb	fungi: that explains why the key is on the node twice :)	18:36
clarkb	I thought my #echo "foo" >> .ssh/authorized_keys didn't respect the comment	18:36
fungi	hah. i added it the same way with >> redirection	18:37
clarkb	vi(m) isn't installed os I fallback on that	18:40
TheJulia	lol	18:42
TheJulia	Thanks guys	18:42
clarkb	you're welcome	18:42
fungi	any time!	18:44
TheJulia	Okay, I'm 90% sure you guys can nuke the hold. Looks like it was one of the steps in grenade setting an environment variable :(	19:20
clarkb	TheJulia: do you want us to wait for you to be 100% sure?	19:34
TheJulia	eh, go ahead, no sense to wait at this point	19:34
clarkb	ok I'll get that done	19:35
clarkb	I'll put together a meeting agenda for tomorrow after lunch. Feel free to add items if you have them	19:46
fungi	thanks	19:55
clarkb	I've removed items from the agenda that I felt reasonably confident were old and could be removed as well as added a few. I'm less sure about the topics covered during the mid december meeting	20:47
clarkb	fungi: looks like tonyb +2'd your robots.txt change. Not sure if you want ot send it in or discuss it tomorrow first	20:48
fungi	we can in theory do both?	20:53
clarkb	sure	20:54
clarkb	following up on that job stuck in zuul. I don't see any obviously broken nodepool providers after tailing debug logs on the four launchers	21:13
clarkb	the job is for refs/heads/stable/2023.2 which means i can't grep zuul logs for a change id. I guess Itry that ref	21:13
clarkb	zuul.nodepool: [e: 14a8e5a05b554cb4ac214e7c8fc0d5d1] Unable to revise locked node request <NodeRequest 300-0023038662 ['ubuntu-focal']>	21:15
clarkb	I think this is the issue.	21:15
clarkb	Looking in the zk db I see the request and the request lock. It isn't claer to me how to map the request lock to a client yet	21:19
clarkb	reading the kazoo Lock code the uuid is generated fresh each time a lock is made and doesn't seem to map to a specific connection/client ?	21:24
clarkb	here's some progress: the actual lock node has an ephemeralOwner value attached to it	21:25
clarkb	looking at cons output from zk04 none of the session_id values reported for connections there match the ephemeralOwner value for the lock	21:33
clarkb	and the lock doesn't show up in the wchp watch by path listing (which would also give us session info)	21:35
clarkb	corvus: ^ any ideas on how to debug this further?	21:36
clarkb	it is almost like this ephemeral node has no session/connection anymore but zookeeper didn't clean it up	21:38
clarkb	but that may just be a perspective skewed by not knowing where the proper location is to look	21:38
*** mtreinish_ is now known as mtreinish		21:43
JayF	Hmm. Is there a way for me to figure out why a job has been waiting for something a while?	22:33
JayF	Probably just unlucky, but https://zuul.opendev.org/t/openstack/status#openstack/governance has been waiting for a py311 builder for a long time	22:33
JayF	and I guess I somewhat expected that gate to dedupe and only run on the tip, but that's probably just not enabled for governance repo?	22:34
JayF	(I linked all in the repo so you could see the one done and the one waiting, 903992 / 903239 are the ones specifically in question)	22:35
JayF	as is traditional, by posting about the job in here I got it scheduled	22:40
JayF	zuul is always listening <.< :D	22:40
fungi	JayF: some of our providers have a high incidence of boot errors and also can take a long time to come ready. if a provider exceeds its failure limit and timeouts, another provider will try. as a result, it's not uncommon to see a job waiting for a node assignment for over half an hour (unfortunate, but not uncommon)	22:42
JayF	there's no insight for users on that end	22:42
JayF	?	22:42
JayF	Just making sure there's not a resource I'm missing	22:42
fungi	i believe when a build transitions from waiting to queued, a node request has been created and that's when the job is waiting for a node assignment to fulfil it	22:44
fungi	the node requests graph on https://grafana.opendev.org/d/21a6e53ea4/zuul-status tracks how many pending node requests are awaiting fulfilment	22:45
fungi	https://grafana.opendev.org/d/6c807ed8fd/nodepool has an overview of node building activity, launch times and errors	22:47
fungi	there are also per-provider nodepool dashboards at https://grafana.opendev.org/ with more detailed breakdowns	22:47
fungi	this is all in aggregate though, there's currently no per-build details on where a particular build is in the process of starting	22:48
fungi	nor is any granular tracking of node request states surfaced in zuul aside from service debug logs	22:49
JayF	at least being able to see "lots of things waiting for this node type" would be nice	22:50
JayF	I only worry in this case because I suspect if there's a repo that'd have a weird config that needs a type of build nowhere else does	22:51
JayF	and gets missed in an update, governance would be a candidate for that lol	22:51
fungi	for the most part, node types aren't provider-specific so it's "lots of things waiting no nodes" more generally, but there are exceptions like arm, high-ram, gpu...	22:51
fungi	...nested-virt...	22:51
clarkb	things aren't scheduled in a truly fifo manner which is what creates the most confusion I think. I've brought this up in zuul before but it would require a fairly extensive rewrite of the way nodepool handles things so interest in doing that is low	22:58
clarkb	but that means that a single node boot can be slow when everything else looks fine	22:59
clarkb	with zero load on zuul and plenty of quota headroom	22:59
clarkb	the last time we merged something to nodepool (which would reset the zk connections for launchers) was December 12. The node request in periodic with the likely stale lock ended up in that state sometime later around December 16	23:01
clarkb	we might want to restart launchers to see if that unsticks things. If it does't then a zuul scheduler likely holds the lock except those get restarted weekly so that is unlikely	23:01
clarkb	but I don't want to do that until corvus has a chance to weigh in	23:01
JayF	> things aren't scheduled in a truly fifo manner	23:03
JayF	This needs to be sticky noted on my monitor	23:03
JayF	because I think this is the base-level assumption that gets broken that gets me going "WTF" down the rabbithole	23:04
fungi	zuul does try to fifo things, but it's best effort and there are a lot of variables that can cause stuff to run in a different order than the order in which triggering events were received	23:11
JayF	I assume it FIFO's requests but doesn't guarantee they get retried in order and/or that they succeed in order (I can imagine a "dumb"-in-a-good-way backoff system ala smtp failures)	23:17
fungi	right, though also projects and named queues are subject to a "fair queuing" algorithm which tries to prevent projects from monopolizing available resources and starving out requests from less active projects	23:25
clarkb	it FIFOs the requests not the node assignments	23:25
fungi	and pipelines also have relative priorities	23:25
clarkb	your request gets assigned to a provider and it will attempt to run there three times. That provider may just be slow booting your node	23:25
clarkb	if all three fail that may have been up to 15 minutes or so to timeout and then you wait to get picked up by another provider	23:26
fungi	and then there are windows in dependent pipelines too	23:26
clarkb	the alternative approach I've descirbed as being more intuitive to users would be to keep track of how many of each node has been requested and boot them from the various providers as needed but assign them in fifo order	23:26
fungi	so lots of ways that enqueued time won't match up with node request generation time too	23:26
JayF	I don't think "making developers go 'wtf' less often at points of high contention or failure" should be a high priority on your todo list	23:27
JayF	lol	23:27
corvus	clarkb: catching up	23:28
clarkb	one thing that complicates the fifo assignment idea is we currently shcedule all nodes in a multinode request in the same provider	23:28
fungi	which adds to the user confusion because the subtle transition from buildset enqueued to build nodes requessted is easy to miss	23:28
clarkb	corvus: the discussion of the stuck job starts around 21:13 UTC	23:29
corvus	clarkb: the `dump` command says that /nodepool/requests-lock/300-0023038662/a4e1ceb94ce4472f9674961970869407__lock__0000000002 is held by the same session that holds /nodepool/components/launcher/nl010000000653	23:41
clarkb	corvus: ok I figured there was some four letter command I was missing	23:44
clarkb	looks like that is nl01 according to a get on that path	23:44
corvus	yeah (it's in the path, so you don't have to do the extra get step)	23:44
corvus	just gets lost in the really bignum :)	23:44
clarkb	ah thats the prefix	23:44
clarkb	I see that now looking at the other entries in the launcher/ dir	23:45
corvus	(which is worse if your font doesn't have serifs)	23:45
clarkb	corvus: if you grep 300-0023038662 in the nl01 launcher debug log all three rax providers say they are trying to lock the request but it is held by someone else	23:45
clarkb	I guess there could be a stray thread somewhere holding it for one of the providers	23:45
clarkb	(nl01 is only rax providers)	23:45
corvus	yep, and i haven't found any other logs about that	23:45
corvus	do you know offhand how old the request is?	23:46
clarkb	corvus: from ~December 12 which is probably older than our logs	23:46
corvus	then i think the only debug step left is sigusr2; want to do it or shall i?	23:46
clarkb	go for it	23:47
clarkb	don't forget to sigusr2 a second time to disable profiling	23:47
corvus	done (x2)	23:48
clarkb	our node request id string doesn't show up in thread names.	23:50
corvus	and i don't see any threads of concern. so i suspect some catastrophe happened in the past and we recovered but leaked the lock. probably too difficult to track without the logs, so i think we just restart now.	23:52
clarkb	ack. Do you want to do that or should I?	23:52
corvus	i can	23:52
clarkb	its weird that the ephemeralOwner value doesn't seem to match anything in the wchp output	23:53
clarkb	but maybe that is the hint that the connection is gone on the client side and zk is somehow thinking it still lives or something	23:53
corvus	i've never matched them that way, only with dump, so i can't address that. but i don't think we've proven that the connection is gone; only that the launcher leaked the lock. it could have done that by locking and throwing an exception and not unlocking (as an example)	23:56
corvus	#status log restarted nl01 to release leaked zk request lock	23:57
opendevstatus	corvus: finished logging	23:57
corvus	clarkb: a more recent dump shows that old lock path is gone now (there's a new one held by the new nl01 connection)	23:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!