Friday, 2021-04-02

*** mlavalle has quit IRC		00:03
*** hamalq has quit IRC		00:47
*** gothicserpent has joined #opendev		00:54
*** gothicserpent has quit IRC		01:04
*** auristor has joined #opendev		01:05
*** gothicserpent has joined #opendev		01:10
*** osmanlicilegi has joined #opendev		01:17
*** osmanlicilegi has quit IRC		01:17
*** osmanlicilegi has joined #opendev		01:18
*** osmanlicilegi has quit IRC		01:18
*** osmanlicilegi has joined #opendev		01:19
*** osmanlicilegi has quit IRC		01:19
*** osmanlicilegi has joined #opendev		01:20
openstackgerrit	xinliang proposed openstack/diskimage-builder master: Introduce openEuler distro https://review.opendev.org/c/openstack/diskimage-builder/+/784363	01:25
*** sshnaidm\|afk is now known as sshnaidm\|off		01:27
fungi	clarkb: sorry, stepped away for dinner, but nah even if those release note publication jobs were lost they'd just get updated on the next tag. worst case i can reenqueue any which didn't get new tags soon	03:09
fungi	and looks like they cleared up after the semaphore cleanup anyway	03:10
*** ykarel\|away has joined #opendev		03:29
*** ykarel_ has joined #opendev		03:40
*** ykarel\|away has quit IRC		03:44
openstackgerrit	xinliang proposed openstack/diskimage-builder master: Introduce openEuler distro https://review.opendev.org/c/openstack/diskimage-builder/+/784363	03:59
openstackgerrit	xinliang proposed openstack/diskimage-builder master: Fix centos stream set mirror https://review.opendev.org/c/openstack/diskimage-builder/+/784530	04:28
*** auristor has quit IRC		04:56
*** marios has joined #opendev		04:56
*** sboyron has joined #opendev		06:06
*** ykarel__ has joined #opendev		06:32
openstackgerrit	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Debian Bullseye Zuul job https://review.opendev.org/c/openstack/diskimage-builder/+/783790	06:32
*** ykarel_ has quit IRC		06:34
openstackgerrit	xinliang proposed openstack/diskimage-builder master: Fix centos stream set mirror https://review.opendev.org/c/openstack/diskimage-builder/+/784530	06:57
*** sboyron has quit IRC		07:27
*** sboyron has joined #opendev		07:34
*** hashar has joined #opendev		07:44
*** slaweq_ has quit IRC		07:55
*** slaweq has joined #opendev		07:57
*** ykarel_ has joined #opendev		08:06
*** ykarel__ has quit IRC		08:09
*** tosky has joined #opendev		08:09
*** slaweq has quit IRC		08:18
openstackgerrit	xinliang proposed openstack/diskimage-builder master: Add openEuler distro element https://review.opendev.org/c/openstack/diskimage-builder/+/784550	09:02
openstackgerrit	xinliang proposed openstack/diskimage-builder master: Add openEuler distro element https://review.opendev.org/c/openstack/diskimage-builder/+/784550	09:07
*** ysandeep\|away is now known as ysandeep\|holiday		09:32
*** ralonsoh has joined #opendev		09:48
*** lpetrut has joined #opendev		09:51
*** ykarel_ has quit IRC		10:15
*** hashar has quit IRC		10:22
*** ykarel has joined #opendev		11:17
*** slaweq has joined #opendev		11:22
*** ralonsoh has quit IRC		11:47
*** slaweq has quit IRC		11:57
*** ykarel has quit IRC		12:05
*** auristor has joined #opendev		12:48
*** hashar has joined #opendev		13:31
fungi	looking at the hate status, i think we may have some leaked node request locks again, will see if i can track them back to a specific launcher and restart it to free them	13:40
fungi	s/hate/gate/	13:40
*** whoami-rajat has joined #opendev		14:01
*** tosky has quit IRC		14:01
*** lpetrut has quit IRC		14:21
yoctozepto	😂	14:21
yoctozepto	anyhow, any idea why https://review.opendev.org/c/openstack/kolla/+/781130/9#message-58b226ad25103f40d51935ddc85c11fa70e0e494 re-triggered the checks? did I use a magic word or..? :D	14:22
fungi	#status log Restarted the nodepool-launcher container on nl02.opendev.org to free stuck node request locks	14:22
fungi	that seems to have gotten node assignments for most of them now	14:23
fungi	still a few waiting centos-8-arm64 node requests, but i think that may be due to a different problem	14:24
fungi	looks like we've lost openstackstatus too, will get it going again	14:24
fungi	yoctozepto: nothing jumps out at me in the change comment which would have caused a reenqueue into the check pipeline. i'll have to check the zuul scheduler log to see what the trigger was, will do once i get statusbot back	14:27
yoctozepto	fungi: thanks, take your time :-)	14:27
fungi	2021-03-30 22:41:22 <-- openstackstatus (~openstack@eavesdrop01.openstack.org) has quit (Ping timeout: 265 seconds)	14:28
fungi	corvus: i expect that ^ is why your status log never made it to the wiki	14:28
fungi	2021-03-30 22:37:14,818 DEBUG irc.client: _dispatcher: quit	14:29
fungi	that was in the debug log, no indication prior to that why it quit though	14:29
*** artom has quit IRC		14:29
fungi	nor why it didn't reconnect	14:30
*** openstackstatus has joined #opendev		14:30
*** ChanServ sets mode: +v openstackstatus		14:30
fungi	#status log Restarted statusbot after it never returned from a 2021-03-30 22:41:22 UTC connection timeout	14:31
openstackstatus	fungi: finished logging	14:31
fungi	#status log Restarted the nodepool-launcher container on nl02.opendev.org to free stuck node request locks	14:31
openstackstatus	fungi: finished logging	14:31
fungi	yoctozepto: the best i can determine looking at the scheduler log for that comment event is that because the change had code-review +2 and workflow +1 and a verified of either -1 or -2 from the zuul user, that was considered reason to enqueue it into the check pipeline	14:41
yoctozepto	fungi: odd, so then I don't need to write "recheck" when that happens and go straight to swearing... roger that!	14:42
fungi	which i agree doesn't match my expectation, but maybe i've just never noticed	14:42
yoctozepto	I have neither	14:42
fungi	it's event 0016fdc0aa7947e2b72b2eb24c4b1e68 in the debug log if any other root sysadmin wants to double-check my assessment	14:44
clarkb	I think that is related to how gerrit previously didn't send existing votes on events but then you would have to do two comments, one to remove approval and another to readd it to reapprove things	14:46
clarkb	which was a regression in gerrit as it didn't do that previously and they undid it	14:46
clarkb	if someone else had commented it wouldn't have enqueued	14:46
clarkb	(and when you comment it shows that you are still checking +2 +1 below your text)	14:46
fungi	oh, good point, the comment was from an account which had supplied the workflow +1 vote	14:47
fungi	so yes, i agree this likely changed in ~november	14:47
fungi	yoctozepto: ^ plausible theory	14:47
yoctozepto	clarkb, fungi: thanks! makes sense	14:49
*** slaweq has joined #opendev		15:05
openstackgerrit	Jeremy Stanley proposed opendev/system-config master: Revert "Temporarily serve tarballs site from AFS R+W vols" https://review.opendev.org/c/opendev/system-config/+/784596	15:14
clarkb	fungi: I +2'd ^ but we can probably go ahead and +A too	15:15
fungi	clarkb: yeah, that's fine	15:17
fungi	looks like something may have changed with the centos-8-arm64 images that they're no longer booting. that's the reason for the remaining stuck changes/queued builds	15:17
clarkb	done	15:17
clarkb	catching up on the gerrit account cleanup process: what I did last time was to set accounts inactive for a few days to try and catch any issues before cleaning the external ids.	15:20
clarkb	I think that is still a good idea so I'll go through my notes on review and produce a list of accounts to set inactive today and do that, then next week clean their external ids	15:21
fungi	sounds good	15:23
*** mlavalle has joined #opendev		15:43
clarkb	hrm in double checking my lists I've noticed that some of the entries in the later lists are maybe not the best idea to clean up (in particular for some of them it seems their other account might be a better option). I think I may manually scrape this a bit more to see whcih look safe and only do that subset	15:45
*** d34dh0r53 has quit IRC		15:45
clarkb	some are definitely the right choice because the other account has been actively used	15:45
clarkb	but in some cases the other account will have an invalid openid or similar	15:45
corvus	i'm going te restart zuul now with the memleak fix	15:51
*** marios is now known as marios\|out		15:54
corvus	#status log restarted all of zuul on commit 991d8280ac54d22a8cd3ff545d3a5e9a2df76c4b to fix memory leak	15:55
openstackstatus	corvus: finished logging	15:55
clarkb	thanks	15:59
corvus	starting re-enqueue	16:01
corvus	done	16:03
*** marios\|out has quit IRC		16:04
corvus	also, the enqueue-ref commands are happy with fully qualified names now, so we don't need to worry about fixing up the enqueue script anymore (cc infra-root)	16:04
fungi	corvus: oh, excellent!	16:05
fungi	does it also handle timer triggered pipelines now?	16:05
fungi	(no more commit 0?)	16:05
fungi	ahh, nope, those are still showing up with 0000000 instead of refs/heads/master	16:06
corvus	that's probably because of this: + zuul enqueue-ref --tenant openstack --pipeline periodic --project opendev.org/openstack/requirements --ref refs/heads/master --newrev 0000000000000000000000000000000000000000	16:06
fungi	oh! those are lingering from the previous restart i bet	16:07
corvus	think it was in there the whole time?	16:07
fungi	due to the aforementioned inability to boot centos-8-amd64 nodes	16:07
corvus	ah ok	16:07
fungi	they were in the queue waiting for centos-8-amd64 nodes which never arrive	16:08
corvus	yeah, so we could just be carrying that queue entry around for days	16:08
fungi	the opendev-prod-hourly items got reenqueued with actual branches instead	16:08
corvus	that command just looked like + zuul enqueue-ref --tenant openstack --pipeline opendev-prod-hourly --project opendev.org/opendev/system-config --ref refs/heads/master	16:08
corvus	biab	16:09
fungi	also i tried but failed to dequeue those 0 ref items, not sure if i just got the parameters wrong or zuul-dequeue can't actually handle them	16:09
fungi	ha, fixed, specifying the branch those were supposed to be worked	16:14
clarkb	fungi: I'm going through the list again and finding an account here and there that I can clean up too so thats a bonus	16:15
clarkb	we'll have an email address with 4 conflicting accounts and all but one will be active or similar	16:15
fungi	i found an actual contributor to wallaby (not the proposal bot account) with no preferred email: 32673	16:16
fungi	not sure how that happens	16:16
fungi	seems like gerrit wouldn't allow you to autocreate an account via openid without an associated address, and i'm pretty sure it won't let you remove the preferred address without setting a different address preferred first	16:17
fungi	so probably related to one of the conflicting account tangles	16:18
zbr	fungi: two timeouts in a row with tempest-full on https://review.opendev.org/c/openstack/pbr/+/780633 -- time to increase another job timeout value?	16:19
clarkb	fungi: is the account active? if so then it wasn't caught by my cleanups	16:21
clarkb	since the removal of a preferred email address to fix inconsistency was always paired with deactivating the account	16:21
fungi	clarkb: yes, i expect it is. though it's possible it was associated with a change started long ago which only just merged in recent months	16:22
fungi	zbr: i thought i +2'd the test timeout increase already	16:22
fungi	zbr: though that looks like a job timeout not a test timeout	16:22
zbr	that is in a totally different place	16:23
fungi	right	16:23
zbr	imho, i would drop support for old pythons but that library is quite low level. what if an emergency patch would be needed for EOL python	16:25
*** hamalq has joined #opendev		16:27
fungi	the main problem there is that pbr is a setup-requires and so can't be effectively capped during package installation, especially where older pip/setuptools may still be in use	16:31
*** smcginnis has joined #opendev		16:33
*** auristor has quit IRC		16:41
clarkb	right you should fork pbr if you want to stop supporting very old things	17:03
clarkb	zbr: we can continue to test it on eol python as long as distros have that eol python	17:03
clarkb	zbr: once the distros drop the old python versiosn we tend not to care anymore and at that point it would probably be ok to consider how to make pbr python3 only or similar	17:04
clarkb	the proper fix for this is pyproject.toml and friends	17:04
clarkb	but openstack hasn't started to shift to anything like that and that prevents you from specifying versions of setup requires properly	17:05
openstackgerrit	Merged opendev/system-config master: Revert "Temporarily serve tarballs site from AFS R+W vols" https://review.opendev.org/c/opendev/system-config/+/784596	17:08
*** auristor has joined #opendev		17:10
openstackgerrit	Merged opendev/system-config master: Have audit-users.py write out serialized data https://review.opendev.org/c/opendev/system-config/+/780663	17:19
mordred	there are also non-openstack users of pbr	17:21
fungi	yep, i even use it in a non-openstack project	17:22
* mordred uses it in all python projects he can, regardless of openstack association		17:23
clarkb	ya I was using that as an indication that you can't rely on it	17:28
clarkb	not that if oepnstack does it then you can switch	17:28
clarkb	more of a "if even openstack does it then good luck" vs "we expect users to do this"	17:28
clarkb	*does not do it	17:28
*** whoami-rajat has quit IRC		17:31
*** hashar has quit IRC		18:12
clarkb	ok I've gone though and double checked everything and ended up wtih a list of 225 accounts from the list that fungi reviewed that I will retire now	18:32
clarkb	this step sets the account inactive and removes its preferred email in preparation for removing the conflicting external ids from it later	18:32
clarkb	I'll probably do the next step mid week next week	18:32
*** Alex_Gaynor has joined #opendev		18:45
fungi	awesome, thanks for working through those	18:45
Alex_Gaynor	I'm seeing a bunch of jobs that are in queue'd status for extended periods of time: https://zuul.opendev.org/t/pyca/status/ (all arm64 jobs it looks like), is there a known issue?	18:46
fungi	Alex_Gaynor: something seems to have happened in centos-8 such that our latest images for arm64 aren't booting (not sure what the situation with the bionic job is there but i'll check on that too)	18:48
Alex_Gaynor	fungi: 🙇‍♂️	18:49
fungi	going to try to roll back the centos-8-arm64 image to yesterday's copy and see if that's working	18:52
fungi	i've put nb03 in the emergency disable list so i can set the centos-8-arm64 image to paused and not have it rolled back by ansible	18:54
fungi	clarkb: the builder re-reads nodepool configuration between each image build, right?	18:57
fungi	#status log Deleted diskimage centos-8-arm64-0000036820 on nb03.opendev.org in order to roll back to the previous centos-8-arm64-0000036819 because of repeated boot failures with the newer image	18:59
openstackstatus	fungi: finished logging	18:59
fungi	the builder doesn't seem to have picked up the pause in the config (immediately began trying to build a replacement), so i'm guessing it needs to be restarted for that after all	19:00
fungi	no dice. it started building another replacement after the restart too	19:02
clarkb	fungi: yes it does reread	19:02
fungi	my fault, i think i paused the image in the provider section	19:03
fungi	oh, we've actually got an image-pause option for the cli now, i keep forgetting that	19:04
*** mailingsam has joined #opendev		19:05
fungi	console log from a centos-8-arm64 node which is currently in the process of being launched shows it's entering a dracut emergency shell after dracut-initqueue registers a timeout starting initscripts	19:09
clarkb	I would've expected a more catastrophic boot failure than that given some of the issues we have had before	19:19
clarkb	like incomplete image in the cloud	19:19
*** tosky has joined #opendev		19:20
fungi	we're finally reaching the boot output on the console from the first node booted from the rolled-back image state, i think	19:21
fungi	yeah, i see kmesg lines	19:21
fungi	seems to just sit after "[ 16.067703] cdrom: Uniform CD-ROM driver Revision: 3.20"	19:22
fungi	something's probably timing out	19:22
fungi	yeah, same as the newer image... it's dropping to a root shell prompt	19:29
fungi	i can connect to the console url, but it just says "guest disabled display" so i've initiated a ctrl-al-del from it to see if i can get access when it reboots	19:31
fungi	i have a feeling this is also going to need someone with a better understanding of centos and probably trying to locally boot the image	19:32
fungi	yeah, seems like novnc isn't quite working	19:37
fungi	scratch that, novnc is working for an ubuntu-bionic-arm64 node i just tried	19:39
fungi	clarkb: any alternative ideas for debugging this?	19:58
fungi	since yesterday's image is exhibiting the same problem, i'm going to undo the pause	19:59
fungi	guess i'll see if centos-8-stream-arm64 is similarly broken	20:03
fungi	those seem to be broken in a different way, they're just going straight to deleting status i think?	20:06
*** hamalq has quit IRC		20:08
*** hamalq has joined #opendev		20:08
clarkb	straight to deleting usually means there is a cloud side error status	20:10
clarkb	you can check the instance details in nova before it gets deleted to see that or see if nodepool is able to bubble it up	20:10
clarkb	fungi: I would've suspected uefi at first but this seems toget far enough to do more than uefi	20:10
clarkb	maybe some x86 packages end up in there?	20:11
clarkb	I would try a rebuild and see if that helps	20:11
*** mailingsam has quit IRC		20:11
fungi	rebuild of the image?	20:12
clarkb	ya	20:20
clarkb	because if it was an upload problem or similar then a rebuild should correct it.	20:20
fungi	oh, of the stream image?	20:21
fungi	also the perpetually queued pyca-cryptography-ubuntu-bionic-py36-arm64 build turned out to be another leaked node request lock. i restarted the nodepool-launcher container on nl03 and it got a node assigned	20:22
mordred	somethign something driver something config-drive?	20:22
mordred	(guessing in the dark because of the cdrom line)	20:22
fungi	previously nl03 accepted the request and failed three times to build it, but apparently never released the lock on the request in zk	20:23
mordred	iirc config-drive presents as a cdrom	20:23
fungi	mordred: oh good idea	20:23
fungi	yeah could be something broken with configdrive	20:23
mordred	maybe somethign changed with how devices manifest - and we're trying to mount the wrong thing	20:23
mordred	now - how to debug that is a whole other question	20:24
fungi	i'm watching a centos-8-stream-arm64 build in progress now, it's not instal-delete after all	20:24
fungi	but again seems to be spending a lot of time after the cd-rom driver kmesg	20:24
fungi	yeah, confirmed, the stream images are booting to the same state as regular centos-8 actually	20:26
fungi	so something has happened to break booting for both centos-8 and centos-8-stream	20:27
fungi	seemingly in the same way	20:27
fungi	maybe a recent change in dib?	20:27
clarkb	ok 224 accounts "retired". I ended up reverting the retirement of one account because I looked at it again and deicded it wasn't claer that it wasn't the currently used account due to it being created relatively recently	20:28
mordred	Merge "Change paths for bootloader files in iso element"	20:28
clarkb	I'll upload the log to review then rerun the audit script next	20:28
mordred	was 2 weeks ago	20:28
clarkb	mordred: fungi is that in a release?	20:28
clarkb	we consume dib via releases	20:29
mordred	oh - good question	20:29
mordred	no	20:29
mordred	an latest release is 2 months ago	20:29
clarkb	ya and we shouldn't use the iso element	20:30
clarkb	not having a recently release indicates to me an issue in packages (either on our nodepool image that we rely on or in the upstream distro)	20:31
mordred	any idea if centos8 recently released a point release?	20:31
mordred	those are rather notorious	20:31
clarkb	I don't know	20:32
clarkb	the log for the 224 account retirements is on review now. I'm going to work on getting the audit script running next	20:32
mordred	CentOS-8-updates-20210324.0 updated 3/24 according to the timestamp on the COMPOSE_ID file	20:32
mordred	and, I suppose, the contents :)	20:33
*** slaweq has quit IRC		20:36
fungi	oh, also i'm currently sidetracked by making dinner	20:37
mordred	dinner is potentially more tasty than config-drive	20:38
clarkb	I've got an audit running now, it should show a couple of hundred accounts ready to have external ids cleaned up when done	20:45
clarkb	mordred: fungi: config drive can be iso format or fat32	20:46
clarkb	I'm not sure which the linaro cloud has selected	20:46
* clarkb goes to see		20:47
clarkb	it appaers to be present as /dev/sr0 on ubuntu so likely an iso	20:50
clarkb	is there possibly a regression in the cdrom drivers on arm for centos 8?	20:51
clarkb	we could ask kevin to switch us to fat32	20:51
clarkb	(part of the struggle here is going to be none of us have arm64 hardware that i know of but maybe we do)	20:51
fungi	i have several bits of arm64 hardware, but not beefy enough to boot virtual machines on	20:52
clarkb	and lsblk -f confirms it is an iso9660 fstype	20:52
clarkb	based on what your report the console does my hunch is that something related to probiing the /dev/sr0 device causes things to hold up	20:53
clarkb	probably the quickest solution to that is to have the cloud use fat32 config drives	20:53
clarkb	https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.config_drive_format the mystery deepens, apparently vfat is deprecated and may be removed :(	20:55
clarkb	there is a new aarch64 kernel in centos-8 as of march 16, though I'm not sure when it was published to the package list as that may just be file creation time?	20:58
clarkb	I doubt we are hitting something like https://bugs.centos.org/view.php?id=16801 because you wouldn't expect it to get as far as it does in the console if so	20:59
fungi	nah, the kernel is running, problem seems to manifest in userspace after init starts	21:01
clarkb	is it possibly a problem with glean?	21:03
clarkb	there have been a few recent chagnes to help make ironic testing easier	21:03
clarkb	maybe they side effected in unexpected ways on arm64?	21:03
clarkb	https://bugs.centos.org/view.php?id=17816 is a really fun one but unrelated to our problem (I almost want to suggest dib expcet our images don't work right now :) )	21:04
clarkb	fungi: to test if it is glean what we can do is make a new image out of band with baked in dhcp and accept-ra configs	21:04
clarkb	then boot that and see if it works	21:04
clarkb	usually when I did this years ago I built the smallest image I possibly could	21:05
clarkb	as it speeds up rtt and reduces things that can interfere	21:05
fungi	glean echoes stuff to the console though right?	21:09
fungi	or has systemd made that challenging?	21:10
clarkb	it should via system iirc	21:10
clarkb	there is a flag you set on the unit to redirect it there /me checks	21:10
clarkb	https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-nm@.service#L19	21:11
clarkb	and centos uses glean-nm	21:11
clarkb	we should add that directive to https://opendev.org/opendev/glean/src/branch/master/glean/init/glean@.service as well	21:11
clarkb	fungi: maybe we should try boot https://cloud.centos.org/centos/8/aarch64/images/CentOS-8-GenericCloud-8.3.2011-20201204.2.aarch64.qcow2 out of band too?	21:12
clarkb	that may help us narrow down the problem?	21:13
clarkb	in particular that image is older than the recent kernel update	21:13
fungi	there's an instance booted in there from october named "test" which can probably be cleaned up	21:17
*** sboyron has quit IRC		21:21
clarkb	audit completed and shows 220 accounts that are inactive and ready for external id cleanup (that is about right because not everyone of the 224 I did will have been part of a pair	22:52
clarkb	the audit is now up on review as well under gerrit_user_cleanups/external_id_conflict_classifications.20210402 in my homedir	22:59
*** tosky has quit IRC		23:19

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!