Tuesday, 2021-08-03

* corvus < https://matrix.org/_matrix/media/r0/download/matrix.org/QJXNXwCbiAfkqpfrwWRKgcUw/message.txt >		00:00
corvus	tristanC: i verified that the ssh private key is the same as what's currently being used with irc	00:01
corvus	tristanC: could it be something about key types and the ssh library being used?	00:01
corvus	i also verified that the key is present and readable in the container	00:03
corvus	tristanC: oh interesting, i can reproduce that inside the container with plain ssh, but not on the host.	00:05
corvus	tristanC: did you by any chance build this on a fedora container base?	00:05
corvus	yes, appears to be f34	00:05
corvus	tristanC: i've heard something about f34 and ssh. i don't know the details, but i'm sure others in this channel do.	00:06
corvus	tristanC: anyway, i bet you can reproduce the problem locally with an rsa key and the container image	00:07
ianw	yeah that will need a PubkeyAcceptedKeyTypes ssh-rsa	00:17
tristanC	corvus: https://review.opendev.org/c/opendev/system-config/+/803125/3 use another container image which i think should work better	00:40
opendevreview	Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring https://review.opendev.org/c/opendev/system-config/+/803125	01:11
tristanC	corvus: i've updated the image tag to a new version which uses the libssh2 directly, the image is now only 20.9MB	01:13
opendevreview	Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring https://review.opendev.org/c/opendev/system-config/+/803125	01:17
Clark[m]	Really seems like fedora should update their default fallback hash type in the client	01:31
opendevreview	Tristan Cacqueray proposed opendev/system-config master: Add script to import gerritbot configuration https://review.opendev.org/c/opendev/system-config/+/802983	01:42
Clark[m]	https://bugzilla.redhat.com/show_bug.cgi?id=1940201 is the issue basically	01:43
Clark[m]	Except in our case it's not a really old server just one that doesn't indicate support for sha2 explicitly	01:43
Clark[m]	However in that case it is succeeding because old server	01:44
Clark[m]	We want it to fail and force sha2	01:44
Clark[m]	https://github.com/openssh/openssh-portable/blob/master/sshconnect2.c#L1191 I think the code that needs modifying is somewhere near there	02:00
*** ykarel\|away is now known as ykarel		04:45
*** marios is now known as marios\|ruck		05:50
*** jpena\|off is now known as jpena		06:54
*** sshnaidm\|afk is now known as sshnaidm		07:27
*** rpittau\|afk is now known as rpittau		07:30
opendevreview	xinliang proposed openstack/diskimage-builder master: Introduce openEuler distro https://review.opendev.org/c/openstack/diskimage-builder/+/784363	08:35
*** ykarel is now known as ykarel\|lunch		08:46
*** diablo_rojo__ is now known as diablo_rojo		09:02
*** ykarel\|lunch is now known as ykarel		10:03
opendevreview	Kendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot https://review.opendev.org/c/opendev/system-config/+/803190	10:04
*** dviroel\|out is now known as dviroel		11:18
*** jpena is now known as jpena\|lunch		11:20
*** jpena\|lunch is now known as jpena		11:23
*** jpena is now known as jpena\|lunch		11:30
opendevreview	Tristan Cacqueray proposed opendev/system-config master: Update the gerritbot-matrix image to fix the ssh signature failure https://review.opendev.org/c/opendev/system-config/+/803328	11:42
tristanC	clarkb: with ^ the bot no longer uses openssh, which seems to fix the current issue	11:44
opendevreview	Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring https://review.opendev.org/c/opendev/system-config/+/803125	11:45
opendevreview	Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring https://review.opendev.org/c/opendev/system-config/+/803125	11:47
*** jpena\|lunch is now known as jpena		12:33
opendevreview	Kendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot https://review.opendev.org/c/opendev/system-config/+/803190	12:33
fungi	python v3.10.0rc1 has been tagged	13:49
*** Guest3154 is now known as notcorvus		14:39
*** notcorvus is now known as corvus		14:40
*** corvus is now known as Guest3197		14:40
*** Guest3197 is now known as corvus		14:42
*** corvus is now known as notcorvus		14:42
*** notcorvus is now known as corvus		14:42
*** corvus is now known as Guest3200		14:43
opendevreview	Clark Boylan proposed opendev/system-config master: Update our python-base and -builder images https://review.opendev.org/c/opendev/system-config/+/803357	14:50
clarkb	it occured to me that we're looking at gitea, gerrit, and zuul deployment updates and seemed like a good idea to update our base images as a result	14:51
clarkb	fungi's note about new python also had me thinking about it a bit	14:51
*** ykarel is now known as ykarel\|away		14:52
*** ksambor is now known as ksambor\|away		14:55
*** jpena is now known as jpena\|off		15:06
Guest3200	there are 2 matrix fixes at https://review.opendev.org/803294 and https://review.opendev.org/803328 that would be nice to merge soon	15:15
*** Guest3200 is now known as corvus		15:17
*** corvus is now known as notcorvus		15:17
*** notcorvus is now known as corvus		15:17
fungi	rackspace says afsdb03 was rebooted on sunday around 16:45 utc due to a host issue	15:41
fungi	seems to be up and running at least	15:42
fungi	also i've discovered that lists.o.o has gotten its ipv6 address listed in the spamhaus xbl, i'm waiting for the delisting request to be reviewed	15:47
opendevreview	Clark Boylan proposed opendev/system-config master: Use the gitea api in the gitea renaming playbook https://review.opendev.org/c/opendev/system-config/+/803366	15:50
opendevreview	Clark Boylan proposed opendev/system-config master: Update gitea project creation to only use the REST API https://review.opendev.org/c/opendev/system-config/+/803367	15:50
fungi	ahh, there it goes	15:50
fungi	#status log Requested Spamhaus XBL delisting for the IPv6 address of the lists.openstack.org server	15:51
opendevstatus	fungi: finished logging	15:51
clarkb	I'm trying to separate good hygiene (above) from the gitea 1.15.0 stuff	15:51
clarkb	and I split things up because the second change is quite a bit more involved than the first but we really only need the first for the v1.15.0 upgrade	15:52
clarkb	I'll rebase once it is clear things are working	15:52
clarkb	tristanC: corvus: I guess libssh2 wants both a private key file and a pubkey file unlike openssh?	15:53
clarkb	oh ha next file comments explain that	15:53
fungi	clarkb: spot-checking the next proposed batch of 73 gerrit account cleanups i don't see anything concerning. comfortable with proceeding when convenient	15:54
clarkb	fungi: excellent. Probably run the scripts tomorrow	15:54
clarkb	corvus: I've approved the gerritbot change but have left the eavesdrop change in case fungi wants to review it. I think that change is more than small enough to go ahead and approve if if fungi doesn't have time though	15:56
clarkb	infra-root https://review.opendev.org/c/opendev/system-config/+/803357 is a change to update our base python images. If you think that is a good idea I'd like to land that before I land any changes like https://review.opendev.org/c/opendev/system-config/+/803265 as it will ensure we're deploying up to date images as we go through and update images for other reasons	15:58
tristanC	clarkb: corvus: thanks for the quick review!	15:58
corvus	tristanC: you're welcome and thanks!	15:59
fungi	i'll try to look at the eavesdrop change in a bit, after i get some lunch down	16:00
tristanC	clarkb: about the bot monitoring, would it be ok to expose the prometheus port? and if so, what's the proper place to define the desired port number?	16:01
clarkb	tristanC: its fine, but we don't have a prometheus server yet so I'm not sure it will do much good. Writing a spec for that is on the todo list	16:02
opendevreview	Clark Boylan proposed opendev/system-config master: Rebuild gerrit images to include fixes https://review.opendev.org/c/opendev/system-config/+/803372	16:10
opendevreview	Clark Boylan proposed opendev/system-config master: Use the mariadb jdbc connector in gerrit https://review.opendev.org/c/opendev/system-config/+/803373	16:10
opendevreview	Clark Boylan proposed opendev/system-config master: Remove the mysql connector from our gerrit images https://review.opendev.org/c/opendev/system-config/+/803374	16:10
clarkb	ianw: ^ fyi thats a followup on our upstream bug fixes. They are present in the stable branches we build now.	16:13
*** rpittau is now known as rpittau\|afk		16:13
clarkb	stable-3.4 doesn't include them yet. There is a merge change in flight there that I want to land then I can propose a merge change that includes our two fixes	16:14
clarkb	I think it is ok for us to proceed now that stable-3.2 and stable-3.3 include them though	16:14
corvus	tristanC: you can probably grep 'iptables' in system config to find out how to open the port	16:16
clarkb	fungi: https://review.opendev.org/c/opendev/system-config/+/803357 is another good one then I'd like to go ahead and land https://review.opendev.org/c/opendev/system-config/+/803265 and https://review.opendev.org/c/opendev/system-config/+/803266	16:17
opendevreview	Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring https://review.opendev.org/c/opendev/system-config/+/803125	16:18
clarkb	the internet says the air outside is breathable again. I'm going to go get some exercise now before the meeting. Back in a bit	16:20
JayF	are there known issues with the mirrors? DIB jobs are 404'ing on a mirror Status code: 404 for http://mirror.mtl01.inap.opendev.org/fedora/updates/32/Everything/x86_64/repodata/repomd.xml	16:23
clarkb	JayF: I want to say ianw was cleaning up fedora 32	16:25
clarkb	it is an eol release now	16:26
JayF	That's incredibly unfortunate, it's hard-breaking the DIB gate	16:26
clarkb	JayF: are there chagnes in DIB to remove fedora 32 testing?	16:26
JayF	I haven't checked; I don't work on DIB usually just trying to get a change thru	16:27
clarkb	I suspect that ianw would be on board with removing those tests given he also cleaned up fedora 32 mirroring	16:27
JayF	there are zero CI-related changes in DIB pipeline	16:27
clarkb	ah its just the func test that needs to be bumped forward. I expect this was unexpected fallout	16:28
opendevreview	Merged opendev/system-config master: Update the gerritbot-matrix image to fix the ssh signature failure https://review.opendev.org/c/opendev/system-config/+/803328	16:28
JayF	I'll try to find a fix for it, just a little frustrated because the breakage of the mirror coincided almost perfectly with the mirror being broken	16:29
gibi	hi! I don't know if it is a known issue but it seems that we don't have the bot resolving launchpad bugs in #openstack-nova	16:30
JayF	***coincided almost perfectly with getting +W	16:30
JayF	lol	16:30
*** marios\|ruck is now known as marios\|out		16:30
*** ssbarnea[m] is now known as zbr		16:31
fungi	gibi: that was a plugin we had in our old supybot-based meetbot implementation. it's not present in the new limnoria-based meetbot	16:31
opendevreview	Jay Faulkner proposed openstack/diskimage-builder master: Fedora: bump DIB_RELEASE to 34 https://review.opendev.org/c/openstack/diskimage-builder/+/799341	16:37
JayF	going to see if that is the fix it needs; that's an old patch that needed a rebase	16:37
clarkb	I know there are/were problems with fedora 34 + debuntu rpm. That may work depending on how in depth those func tests get. I think if that fails you can set it to 33	16:41
JayF	ack; thank you for the pointer	16:41
clarkb	ok really popping out now. Back in a bit	16:41
opendevreview	Merged opendev/system-config master: matrix-eavesdrop: fix initial room path creation https://review.opendev.org/c/opendev/system-config/+/803294	17:23
opendevreview	Merged opendev/system-config master: Update our python-base and -builder images https://review.opendev.org/c/opendev/system-config/+/803357	17:23
nst__	i have installed openstack using kolla ansible and try to testing freezer project, after finish installation and try to backup there is error like this, "CRITICAL freezer.main [-] Critical Error: can only concatenate str (not "bytes") to str" anyone have experience like this.. thank you	17:24
*** dviroel is now known as dviroel\|brb		17:34
fungi	nst__: you might have more luck asking in the #openstack-kolla channel or on the openstack-discuss@lists.openstack.org mailing list (i don't see any dedicated irc channel for the freezer project)	17:37
fungi	#opendev is where we discuss the services which comprise the opendev collaboratory (so gerrit, zuul, etherpad/meetpad, mailman, et cetera)	17:37
opendevreview	Merged opendev/system-config master: Use nodejs 14 https://review.opendev.org/c/opendev/system-config/+/803265	17:58
clarkb	corvus: https://zuul.opendev.org/t/openstack/build/70e992a5ba9c49bd8026b1bfa2a24aca does that indicate the merger failed to fetch and update the repo?	18:17
corvus	clarkb: i think so. i think that's worth a search of the uuid in the executor logs	18:18
JayF	ianw: cgoncalves: clarkb: https://review.opendev.org/c/openstack/diskimage-builder/+/799341 appears to fix the DIB gate	18:19
clarkb	corvus: ze07 appears to be the executor	18:20
clarkb	corvus: 2021-08-03 18:03:24,008 ERROR zuul.ExecutorServer: [e: 6411f294b08945e1ba71d755335dbea1] [build: 70e992a5ba9c49bd8026b1bfa2a24aca] Got exception while updating repo gerrit/opendev/system-config	18:20
clarkb	grepping by the event id doesn't seem to show more of an error unfortunately	18:21
fungi	clarkb: the change to improve repo rename functional testing failed oddly on run-review-3.2 with a rather opaque-looking selenium error	18:22
fungi	not sure if that's prone to nondeterminism but it's the first time i've noticed that particular error anyway	18:23
clarkb	fungi: ya I see you rechecked it. fwiw I'm not sure 803265 will use the new image from 803357. But that is ok the v1.15 gitea update will get the new image. More important that we're generally updating things	18:24
corvus	clarkb: stderr: 'ssh: Could not resolve hostname review.opendev.org: Temporary failure in name resolution	18:25
corvus	clarkb: (from traceback without log line prefix)	18:25
clarkb	corvus: neat! I think we still have a short ttl for that name. Let me push a change to bump it back out to an hour	18:25
clarkb	while that won't totally fix things it should improve our ability to cache dns records	18:25
clarkb	unfortunately I think zuul uses the python 3.8 images so we may need to reenqueue that?	18:27
clarkb	or land another change to cause things to rebuild	18:27
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Increase review.opendev.org's CNAME TTL to one hour https://review.opendev.org/c/opendev/zone-opendev.org/+/803390	18:28
clarkb	corvus: ^ there is the ttl bump	18:28
clarkb	fungi: corvus: https://review.opendev.org/c/opendev/system-config/+/803366/1 should also be mergable. Looks like the child of that one passed testing, but it likely deserves in depth review	18:29
opendevreview	Clark Boylan proposed opendev/system-config master: WIP Prepare gitea 1.15.0 upgrade https://review.opendev.org/c/opendev/system-config/+/803231	18:30
clarkb	And there is the v1.15.0 change rebased on top of the other cleanups	18:30
*** dviroel\|brb is now known as dviroel		18:42
opendevreview	Merged opendev/zone-opendev.org master: Increase review.opendev.org's CNAME TTL to one hour https://review.opendev.org/c/opendev/zone-opendev.org/+/803390	18:44
clarkb	infra-root thinking about the 3.8 promote failure a bit more I'm worried that the builder and base image pairing being out of sync like that may be a problem for our wheel builds?	18:46
clarkb	I guess it depends if debian buster has updated any packages since our last image build with incompatible library updates?	18:47
clarkb	all that to say I think we should try to correct this problem if we can. Do you think reenqueing the buildset is a good idea or should we land another change to rebuild the images and go through it all again?	18:48
fungi	so the promote job failed and the image wasn't tagged?	18:49
clarkb	fungi: correct, but python-builder:3.8 was tagged. This means we'll be building wheels on newer debian buster than where we install and execute those wheels	18:50
fungi	is it python-base:3.8 we're missing then?	18:50
clarkb	yes that is the one that had its promote job fail	18:51
clarkb	I think zuul and hound are the two images that use python3.8 primarily. Most things are still 3.7	18:51
clarkb	Zuul has functional testing and would catch these issues with the images if it becomes a problem I think	18:51
clarkb	which means we're probably fine until we manage to promote that image one way or another	18:52
fungi	i don't think reenqueuing that change into promote should present a problem	18:52
clarkb	fungi: its deploy not promote	18:52
fungi	oh, i misunderstood. thought you said python-base:3.8 wasn't tagged on dockerhub	18:52
clarkb	fungi: yes that is correct, but the jobs run in the deploy pipeline	18:53
fungi	i had no idea we were doing that in deploy instead of promote. interesting	18:53
clarkb	whcih is where I think it gets confusing around whether or not we can reenqueue the buildset as it will not only rerun all the promote jobs but also run infra-prod-service-refstack and infra-prod-service-gitea	18:53
clarkb	fungi: we do that so that the images update before we do the pull as we can't sequence across pipelines aiui	18:54
fungi	i guess with it being a change-merged trigger, we already expect that changes could trigger out of sequence there, so at least don't have to worry that reenqueuing an older change would roll back the latest branch state in the context of the deployment	18:55
fungi	so probably still safe to do	18:56
clarkb	fungi: well that is what I'm not totally sure of. I think in this specific case the refstack and gitea jobs won't do anything unexpected though and in this specific case we should probably be ok	18:56
clarkb	ya I think we're good to reenqueue	18:59
tristanC	corvus: clarkb: did 803328 restarted the compose, or should we add a force-recreate when the compose file changes?	19:29
clarkb	tristanC: iirc the role does a docker-compose pull && docker-compose up -d and since the image changed that should cause it to restart	19:30
clarkb	basically up -d will restart if there is a new image present	19:30
tristanC	clarkb: alright, there it posted its first notification to the test channel!	19:32
clarkb	fungi: only the hourly deploys are currently queued up if we want to enqueue the deploy jobs for https://review.opendev.org/c/opendev/system-config/+/803357	19:43
clarkb	corvus: ^ re that do you know of any reason it wouldn't be safe to rerun the docker image promote jobs?	19:51
corvus	clarkb: it could "rewind" production? do you have a build/buildset link?	19:53
corvus	we can see what jobs will run	19:53
clarkb	corvus: the buildset for the name resolution failure I linked to previously	19:53
clarkb	corvus: https://zuul.opendev.org/t/openstack/build/70e992a5ba9c49bd8026b1bfa2a24aca	19:53
clarkb	corvus: as described above I'm slightly worried that having a newer python-builder image than our python-base image could result in errors building wheels if there are binary differences in a way that matter	19:54
corvus	clarkb: so it could rewind refstack and gitea -- has anything changed with those since then?	19:54
clarkb	However, only zuul and hound would be affected so its not super critical (as zuul testing should catch that if this is the case for zuul)	19:54
clarkb	corvus: gitea got https://review.opendev.org/c/opendev/system-config/+/803265 applied after that change with the failure	19:55
clarkb	corvus: when you say rewind do you mean it could rewind the docker images or just system-config on bridge or both?	19:55
clarkb	the alternative is to land another base image update change	19:55
clarkb	I'm happy to propose that if we want to approve that instead	19:55
corvus	system-config and whatever the ansible there does. i don't think that will affect the gitea server	19:55
corvus	i think it would have to be an ansible change, not a docker image change	19:55
clarkb	corvus: I agree I don't think it will affect the gitea server	19:56
corvus	oh wait	19:56
fungi	do we actually use the change in the deployment tasks though, or the branch state?	19:56
corvus	will that retag :latest?	19:56
clarkb	corvus: I wonder if it will	19:56
clarkb	corvus: you know I think it may	19:57
clarkb	because it just looks at the current change's image and says you are latest now	19:57
corvus	if it did not delete the old tags, then it probably will. if the old tags are deleted, then it will try and fail.	19:57
clarkb	fungi: we use the branch state for system-config but project-config should be latest iirc	19:57
corvus	based on our earlier experience, that's a crap shoot. we probably can't even tell because of the dockerhub ui/api difference.	19:57
clarkb	corvus: ya I think we should do a new change to bump the image again	19:57
corvus	clarkb: i think that's safest and most predictable :)	19:58
clarkb	I'll push that momentarily	19:58
fungi	that's fine by me. it's low-effort, maybe even lower (just going to take a little longer to go through)	19:58
fungi	i'll be around for at least a few more hours anyway	19:58
corvus	tristanC, clarkb, fungi: and eavesdrop is working too! https://meetings.opendev.org/irclogs/%23test/latest.log.html	20:01
fungi	oh nice	20:01
opendevreview	Clark Boylan proposed opendev/system-config master: Update the python docker images again https://review.opendev.org/c/opendev/system-config/+/803394	20:01
clarkb	corvus: fungi ianw ^ ok that should get us sorted	20:01
clarkb	the ttl update is in place too so dns should be much more reliable now	20:02
fungi	i need to disappear for a bit to make dinner but should probably be back by the time that's done	20:03
clarkb	ya I need to get lunch myself	20:03
clarkb	ianw: https://review.opendev.org/c/opendev/system-config/+/803372 is a stack that will probably interest you as it attempts to incorporate our upstream gerrit fixes	20:03
clarkb	ianw: but I don't think we should approve anything out of that stack until 803394 promotes properly. That way we get shiny python:3.7 images to work with	20:04
clarkb	and now lunch	20:04
opendevreview	Merged opendev/system-config master: Improve repo rename functional testing https://review.opendev.org/c/opendev/system-config/+/803266	20:13
fungi	yay, it merged on retry. that selenium error was... weird	20:14
opendevreview	Merged openstack/diskimage-builder master: Fedora: bump DIB_RELEASE to 34 https://review.opendev.org/c/openstack/diskimage-builder/+/799341	20:23
ianw	fungi: yeah, i think it can still be a bit racy, and because of the shadow-dom walking the queries are really weird	20:31
opendevreview	Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix identity lookup configuration https://review.opendev.org/c/opendev/system-config/+/803396	20:35
ianw	clarkb: thanks, responded; it's pulling in a different variables file that should hopefully stop pip.conf adding wheel mirrors	20:35
ianw	that's the theory, anyway	20:35
clarkb	ianw: I'm not sure that will work since defaults define wheel_mirror	20:38
clarkb	ianw: also we want the pypi mirror to be set just not the wheel mirror	20:38
ianw	i was modelling on fedora, but it's possible that's also wrong	20:40
clarkb	ya I suspect that we may be writing an incomplete pip.conf where it is extra-index-url = and then nothing	20:41
clarkb	or if it is something it is https:///centos-8/wheels/ or whatever the paht is	20:41
clarkb	since we don't guard against the vars being undefined or empty	20:41
ianw	hrm, i guess https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/defaults/main.yaml applies	20:42
ianw	i think we should rework this to more clearly delineate between platforms we build wheels for and not	20:43
ianw	a role to generate the wheel slug would also be useful, as i could re-use that in the publishing jobs	20:43
clarkb	ianw: in the centos-8-stream make a wheel mirror chagne what ends up doing the afs vos release?	20:47
clarkb	oh hrm these are just check jobs to see if we can build the wheels not publish them	20:48
clarkb	then when that is working we add in publication I guess	20:48
ianw	yeah, the publish jobs are in project-config	20:48
clarkb	+2	20:48
ianw	that's where i need to sort it out writing to the correct afs locations	20:49
clarkb	fungi: ianw heads up I have approved https://review.opendev.org/c/opendev/system-config/+/803263	20:49
clarkb	the newlist command fix	20:49
*** dviroel is now known as dviroel\|out		20:54
tristanC	corvus: for gerritbot matrix identity lookup, we would need to run this procedure, would you mind giving it a try and let me know if these instructions work: https://review.opendev.org/c/opendev/system-config/+/803396/1/playbooks/roles/matrix-gerritbot/README.rst	21:00
clarkb	803394 is waiting for one more node in check. I wonder if we've got some sort of failure or slow boot somewhere /me tries to be patient	21:06
opendevreview	Merged opendev/system-config master: Skip notification prompt for newlist command https://review.opendev.org/c/opendev/system-config/+/803263	21:19
clarkb	system-config-build-image-gitea-init had its node request fullfilled over an hour ago according to the logs on nl04	21:40
clarkb	but the job hasn't started. Is this the bad locks in the nodepool launcher situation? should I restart nodepool-launcher on nl04?	21:40
clarkb	fungi: ^	21:40
clarkb	I think you've debugged a few of these before	21:40
corvus	tristanC: done and added to private hostvars. worked fine	21:41
clarkb	https://paste.opendev.org/show/bMM7UPkSF7g1bqrXTYYX/ is what that looks like on the scheduler and the nodepool launcher	21:41
clarkb	I'm going to restart nl04's launcher in a couple of minutes if I don't hear objections. I suspect this is that issue	21:42
clarkb	restarting nl04's launcher does not seem to have chagned the state of that job	21:48
clarkb	corvus: ^ fyi this seems like some sort of zk related issue in job scheduling	21:48
clarkb	comparing the logs for the request associated with the stuck job to logs for other requests it seems that the nodepool side believs its work to be done	21:49
corvus	clarkb: if the scheduler never logged an update to that node request then it's probably not due to the recent zk work	21:50
corvus	i think that's more or less the same as before	21:50
clarkb	2021-08-03 20:28:17,876 INFO zuul.nodepool: [e: 7e9b6a5b1e2f41989b016e4b4699346a] Request 299-0014954444 no longer exists, resubmitting	21:51
clarkb	ok that is curious	21:51
corvus	oh yeah there's more logs than the paste	21:51
corvus	that does point toward recent changes	21:51
clarkb	ya sorry I realized i could grep by the request id in the scheduler	21:52
clarkb	nodepool does not see that request id show up i nany of its logs after the first pass of request handling	21:52
clarkb	basically the resubmitting appears to not have happened	21:52
clarkb	I suspect 802362,1 may be in a similar state?	21:53
corvus	it'll get a new request id	21:53
clarkb	ah ok let me dig more then	21:54
corvus	there's a good chance that the new request may be ignored by zuul since it also acked the original	21:54
corvus	still reading code	21:54
clarkb	299-0014954458 is the updated request id. I'm looking for it now	21:54
clarkb	that one did the same thing on nl04. It was fulfilled and unlocked so nl04 removed its request handler for it	21:55
clarkb	the scheduler only logs this Updating node request <NodeRequest 299-0014954458 <NodeSet ubuntu-focal [<Node 0025752729 ('ubuntu-focal',):ubuntu-focal>]>> for that request id	21:57
clarkb	it doesn't say anyting about the request going away and resubmitting like the first one	21:57
corvus	hrm, the updated request in zk looks correct	22:04
corvus	```{"state": "fulfilled", "state_time": 1628022502.423527, "declined_by": [], "node_types": ["ubuntu-focal"], "nodes": ["0025752729", "0025752729"], "reuse": true, "requestor": "zuul02.opendev.org", "requestor_data": {"build_set_uuid": "b397fc88821b4b398469a227dfde24f7", "tenant_name": "openstack", "pipeline_name": "check", "job_name": "system-config-build-image-gitea-init"}, "provider": "ovh-gra1", "relative_priority": 0, "event_id":	22:04
corvus	"7e9b6a5b1e2f41989b016e4b4699346a"}```	22:04
corvus	though i wonder why the node id shows up twice	22:05
corvus	oh, i wonder if that's due to the re-request	22:05
clarkb	is https://opendev.org/zuul/zuul/src/branch/master/zuul/nodepool.py#L464 sufficient to tell the scheduler where the new request is?	22:06
corvus	clarkb: i don't think the scheduler should use that any more -- it should rely on the requestor_data in the request itself now	22:06
clarkb	gotcha	22:07
corvus	in fact, that may be nearly dead code at this point	22:08
clarkb	did _updateNoteRequest() which is the watcher for the request possibly return false so we stopped watching it but the scheduler didn't/doesn't know about it somehow?	22:08
fungi	clarkb: yes, we've been seeing this off and on since at least april	22:08
clarkb	fungi: this one seems different fwiw	22:08
clarkb	fungi: I tried the restart and no luck. corvus is digging in above	22:09
fungi	restarting the launcher frees the node request locks it may be holding and allows other launchers to try to satisfy them	22:09
fungi	more recently i've noticed that after some time zuul reissues a new node request id, and i've seen the same launcher keep grabbing and sitting on it	22:10
fungi	okay, caught up reading now	22:10
fungi	it seems like the stuck node requests we used to see have more recently been replaced by the re-request behavior where the same launcher which was sitting on the old request also grabs and sits on the new one	22:11
fungi	the other day i traced a log where that happened three times in sequence for the same build	22:11
clarkb	corvus: reading the code around where I just linked and _updateNodeRequest I notice we are reusing the request object and setting some flags on it rather than creating a new one. Is it possible some status like request.cancelled is set and when the node request comes back we essentially ignore it	22:11
corvus	2021-08-03 20:28:05,916 INFO kazoo.client: Zookeeper connection lost	22:12
corvus	so it was correct to resubmit the request; the error here is that the resubmitted request did not work correctly	22:12
clarkb	corvus: ya that is what I'm leaning towards. However I don't think it was cancelled unless the zk connection does that implicitly without logging it somehow	22:13
clarkb	neither the original request id or the new request id logs that it was cancelled	22:13
corvus	clarkb: we shouldn't get to the point of resubmitting a request if it's canceled	22:13
corvus	clarkb: requests are ephemeral, so zk will automatically delete it	22:13
clarkb	corvus: I'm just going through the watcher callback and trying to rule branches out. Request .* is unknown doesn't show up in the log file at all so we didn't hit that branch.	22:19
clarkb	'Resubmitting lost node request' doesn't show up for these requests so we don't hit that branch either	22:19
corvus	clarkb: since we saw one "Updating node request" from the scheduler, but nothing after that, it does sort of seem like the callback may have returned false	22:19
clarkb	and Node request 123-foo FULFILLED doesn't show up so we don't hit that branch	22:20
clarkb	leaving the cancelled branch but I don't think we hit taht either	22:20
clarkb	corvus: its like the watcher went away entirely	22:20
clarkb	I don't think the request was cancelled because when we cancel a request that produces other logging whcih doesn't show up though we don't explicitly log for that in updateNodeRequest()	22:21
corvus	clarkb: a tricky thing to look out for: request.uid != request.id -- so the "Request %s is unknown" log may not show up in our greps. however, "is unknown" doesn't show up at all, so we know that didn't hit	22:22
clarkb	corvus: yup I ended up grepping for `Request .* is unknown` for that reason	22:23
corvus	clarkb: fwiw, i think the instigator of this incident is the event size issue that we merged fixes for yesterday	22:26
corvus	that doesn't make this okay -- we still have a problem. but as an aside, the underlying zk disconnect issue should already be addressed.	22:26
clarkb	thats good	22:27
clarkb	corvus: does it make sense to abandon and restore this change for now?	22:27
clarkb	and have zuul retry from scratch?	22:27
clarkb	or do we maybe want to restart on the changes from yesterday?	22:27
clarkb	(and queue restore will reenqueue)	22:28
corvus	clarkb: yes, but if you can hold just a bit longer that'd be great	22:28
JayF	If I have a job running, with a bunch of `nodepool` jobs that aren't starting, would this be a similar/symptom of this issue	22:28
clarkb	corvus: yup not in a rush. I already waited a bunch :)	22:28
JayF	this one is pretty fresh if that makes a difference to you :)	22:28
clarkb	JayF: if its fresh then it can just be normal delay	22:28
JayF	ack, just thought I'd offer up an example if there was a need for one	22:29
clarkb	JayF: if this is for your dib change I think it is waiting on docker images to build and that is slow because arm64 images are built via qemu or something	22:29
clarkb	I don't think that change is in a bad spot yet	22:29
JayF	I'm just judging based off how long the other DIB change I landed this afternoon took to run them	22:29
corvus	clarkb: i agree the only way to exit the callback returning False with no log lines, is if the req was canceled (or maybe an exception would stop the watcher?)	22:29
JayF	which is highly subjective anyway :)	22:29
clarkb	JayF: specifically nodepool-build-image-siblings needs to succeed then those waiting jobs will all grab the built image from there	22:29
JayF	oh wow, that's cool, I didn't realize you could seed jobs that way	22:29
JayF	I might go look at that config instead of doomscrolling zuul status :D	22:30
corvus	JayF: you can also mouseover "waiting" on the status page	22:30
corvus	zuul will tell you what it's waiting for	22:30
JayF	:-O	22:30
clarkb	corvus: my suspicion is that something related to the carried over state is leaking	22:30
JayF	how have I been using openstack/zuul for this long and never realized that! Thanks!	22:30
clarkb	corvus: since we reuse the request object and modify it when resubmitting rather than creating a new object	22:31
corvus	JayF: it might be because we added it 2 months ago..... but then again... it might not be. ;)	22:31
clarkb	corvus: and somehow that is causing updateNodeRequest() to take an unexpected path	22:31
corvus	clarkb: but the only possible path that could be is canceled, right?	22:32
corvus	and that's not possible without a "Canceling node request" line	22:32
clarkb	corvus: yes and yes	22:32
corvus	(because that's the only place canceled=True) is set	22:32
corvus	clarkb: i think i should fire up the repl and see if i can find the request object(s)	22:33
clarkb	corvus: ok let me know if I can help more	22:33
clarkb	I'll hold off on changing anything about the change itself until you give the all clear	22:33
corvus	thanks	22:33
clarkb	JayF: you'll notice those jobs have flipped from waiting to queued. Now they are queued up and looking for nodes to run on	22:36
corvus	the replacement request is in memory, the original is not	22:36
clarkb	corvus: aren't they the same object?	22:37
corvus	good point :)	22:37
corvus	req.canceled = False	22:37
clarkb	https://opendev.org/zuul/zuul/src/branch/master/zuul/nodepool.py#L462-L463 because of that	22:37
corvus	the uid is '47debb1d7af84d87918de66a7d3c9649'	22:37
clarkb	we don't seem to log that uid at all	22:38
corvus	req.state = 'requested'	22:38
corvus	nope, only on that one "is unknown" line.	22:38
clarkb	if state is requested and cancelled is false then we should've fallen through and returned true	22:39
clarkb	because we know it wasn't deleted (lack of logs for deleted)	22:39
clarkb	is it possible we've somehow missed a watcher event or the watcher is no longer watching?	22:40
clarkb	oh wait	22:40
clarkb	2021-08-03 20:28:22,423 DEBUG nodepool.driver.NodeRequestHandler[nl04.opendev.org-PoolWorker.ovh-gra1-main-8e061a04c2b24871b6e337ad89b66404]: [e: 7e9b6a5b1e2f41989b016e4b4699346a] [node_request: 299-0014954458] Fulfilled node request	22:40
clarkb	that should've set req.state to fulfilled not requested	22:40
corvus	clarkb: right, the zk node reflects that, the object in zuul's scheduler memory says requested	22:41
clarkb	ah	22:41
corvus	so that tells us that the watcher didn't fire for the fulfilled event for the re-request (the logs already told us that)	22:41
clarkb	yup	22:41
clarkb	it did fire at some point beacuse we got the single log line but it didn't fire for the fulfilled state	22:41
clarkb	and we're running out of possibilities for a return False there	22:42
corvus	it's possible the first callback is just the kazoo datawatch firing on creation, and that we never actually got a zk watch notification	22:43
clarkb	good point	22:44
clarkb	I forgot it does that	22:44
clarkb	in fact it would have to be right? since we got the log line and it does call on creation of the watch?	22:44
corvus	i think so	22:45
corvus	interesting seeing a node request in 'requested' state with nodes in its nodeset	22:48
corvus	(because the nodeset was updated by the callback from the previous request)	22:49
clarkb	oh that is a bug because we don't clear out the nodes from the request and the reuse it?	22:49
clarkb	ya	22:49
clarkb	and then when it went back to nl04 it gave the same node back again	22:49
clarkb	JayF: your change is running builds for all jobs now. It should end up being fine	22:51
corvus	the zk data that it has (from the original request) only has one node id (which is correct). the current data in zk has 2.	22:51
corvus	i suspect that's a second (minor?) bug	22:51
corvus	that's probably something we need to clear out but don't.	22:52
corvus	fwiw, i think if we can drop the nodepool.requests dict, we can use fresh noderequest objects	22:52
JayF	clarkb: I see that, thank you :D	22:53
clarkb	corvus: thinking out loud a bit more. Is it possible that because we lost the zk connection there was some associated watcher cleanup on the scheduelr side, but since we submitted a new request we got the initial call for a watch against a dead watch again?	23:04
clarkb	I think that wouldn't be the acse because from the zk perspective the node request did change	23:04
clarkb	since the node requset id in zk is the path entry	23:05
corvus	clarkb: i've been thinking along similar lines, but yeah, from zk/kazoo, they're just operating on paths, so it looks like 2 different paths, nodes, watchers	23:05
corvus	(it's just that the watchers closures happen to have the same first argument)	23:05
clarkb	yup	23:05
corvus	now that does mean that a zombie watch callback for the first path would affect the shared request object, but we don't have a log entry indicating that	23:06
clarkb	and it is an ephemeral node so in theory zk cleaned that up before we reconnected	23:07
clarkb	right?	23:07
corvus	yep	23:07
clarkb	it would be weird if ephemeral nodes could make it across connections in that way	23:07
corvus	maybe we should ask zk what watches it has	23:07
corvus	that's going to be a bit longer than it used to be	23:08
corvus	nl01, nl02, and two sessions from nl04 hold watches on that path	23:11
clarkb	but none from the scheduler	23:12
clarkb	which woudl explain the behavior we see but not explain how or why it went wrong	23:12
corvus	appears to be the case	23:12
corvus	yep	23:12
clarkb	nl01 and nl02 both yielded the request iirc	23:13
opendevreview	Merged openstack/diskimage-builder master: Permit specification of extra bootstrap packages https://review.opendev.org/c/openstack/diskimage-builder/+/802592	23:13
clarkb	but nl03 didn't	23:13
corvus	2021-08-03 20:28:45,369 ERROR kazoo.recipe.watchers: list index out of range	23:17
corvus	2021-08-03 20:28:45,374 ERROR kazoo.handlers.threading: Exception in worker queue thread	23:17
corvus	File "/usr/local/lib/python3.8/site-packages/zuul/zk/nodepool.py", line 503, in updateNodeRequest	23:18
corvus	request_nodes[i].id = nodeid	23:18
corvus	i think it's actually the mismatch in node lengths	23:18
clarkb	oh neat	23:18
clarkb	I guess the obvious weird thing was the thing to look at	23:18
corvus	it looks like an exception there will cause it to stop watching	23:18
corvus	clarkb: so i think your hunch of "we're not clearing something out" was right -- and the thing we're not clearing out is the node list	23:20
clarkb	that method gets attached to the callback we were looking at previously	23:20
clarkb	what isn't clear to me is why the callback raising would cause the watch to go away? I thought you had to return False for that	23:20
clarkb	in any case that should be an easy fix. We need to more aggressively clean up the previous state?	23:22
corvus	clarkb: yeah, me neither. i think we'll need to check the kazoo watch callback to fully understand that (i think it's outside the scope of the datawatch)	23:22
corvus	clarkb: yes -- though i think that avoiding re-use of that object will solve this problem, and we can do that if we drop the internal dict	23:23
corvus	if that's not a trivial change, then yeah, let's just clear out the node list for now	23:23
corvus	i will finish up the work i have going on the merger then do that	23:23
clarkb	corvus: ok, am I good to abandon and restore?	23:24
corvus	clarkb: one sec	23:24
clarkb	looking at the code it appears we use that dict to count things for statsd	23:24
clarkb	but otherwise its just internal accounting that we maybe don't need anymore	23:25
corvus	oh yep, it's the current request length	23:25
clarkb	corvus: we could use a simple counter for that though?	23:26
clarkb	increment and decrement instead of adding and removing from a dict	23:26
clarkb	and that integer won't drag state along with it and cause unexpected things to happen	23:26
clarkb	Anyway I'll standby on abandon restore until you give the go ahead	23:27
corvus	clarkb: okay, i cleaned up the repl and shut it off. feel free to proceed	23:27
clarkb	proceeding thanks	23:28
clarkb	there are a couple other chagnes that exhibit similar behavior that I will do thsi too as well	23:28
clarkb	oh ha I cannot abandon and restore in some repos because we changed how admin works	23:30
fungi	so you don't think restarting the launcher holding those node requests will work this time? it seemed to address the instances i saw last week, some of which had multiple reissued requests	23:32
corvus	in this case, the launchers are not holding the node request. it's done with it.	23:33
clarkb	fungi: I did restart the launcher and it did not help	23:33
corvus	nothing will cause zuul to see an update to this particular node request	23:33
clarkb	doing that and not changing the situation is what led to me doing the deeper dive debug above	23:33
fungi	ahh, okau	23:33
fungi	and zuul dequeue won't do it either?	23:34
clarkb	fungi: that will probably work. Its easier to click buttons in the gerrit web ui though :P We can do dequeue enqueue for the starglinx/ha change I guess	23:35
clarkb	but my ssh keys are telling me I shouldn't :P	23:35
timburke_	huh. anything change with paste.opendev.org lately? i just made https://paste.opendev.org/show/807865/, but it doesn't show the content -- yet when i go to https://paste.opendev.org/raw/807865/, it's all there :-/	23:47
Clark[m]	timburke_ the server was upgraded and the db server changed as a result	23:59
Clark[m]	I wonder if this is a weird interaction resulting from that	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!