Friday, 2021-04-02

*** mlavalle has quit IRC00:03
*** hamalq has quit IRC00:47
*** gothicserpent has joined #opendev00:54
*** gothicserpent has quit IRC01:04
*** auristor has joined #opendev01:05
*** gothicserpent has joined #opendev01:10
*** osmanlicilegi has joined #opendev01:17
*** osmanlicilegi has quit IRC01:17
*** osmanlicilegi has joined #opendev01:18
*** osmanlicilegi has quit IRC01:18
*** osmanlicilegi has joined #opendev01:19
*** osmanlicilegi has quit IRC01:19
*** osmanlicilegi has joined #opendev01:20
openstackgerritxinliang proposed openstack/diskimage-builder master: Introduce openEuler distro
*** sshnaidm|afk is now known as sshnaidm|off01:27
fungiclarkb: sorry, stepped away for dinner, but nah even if those release note publication jobs were lost they'd just get updated on the next tag. worst case i can reenqueue any which didn't get new tags soon03:09
fungiand looks like they cleared up after the semaphore cleanup anyway03:10
*** ykarel|away has joined #opendev03:29
*** ykarel_ has joined #opendev03:40
*** ykarel|away has quit IRC03:44
openstackgerritxinliang proposed openstack/diskimage-builder master: Introduce openEuler distro
openstackgerritxinliang proposed openstack/diskimage-builder master: Fix centos stream set mirror
*** auristor has quit IRC04:56
*** marios has joined #opendev04:56
*** sboyron has joined #opendev06:06
*** ykarel__ has joined #opendev06:32
openstackgerritDmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Debian Bullseye Zuul job
*** ykarel_ has quit IRC06:34
openstackgerritxinliang proposed openstack/diskimage-builder master: Fix centos stream set mirror
*** sboyron has quit IRC07:27
*** sboyron has joined #opendev07:34
*** hashar has joined #opendev07:44
*** slaweq_ has quit IRC07:55
*** slaweq has joined #opendev07:57
*** ykarel_ has joined #opendev08:06
*** ykarel__ has quit IRC08:09
*** tosky has joined #opendev08:09
*** slaweq has quit IRC08:18
openstackgerritxinliang proposed openstack/diskimage-builder master: Add openEuler distro element
openstackgerritxinliang proposed openstack/diskimage-builder master: Add openEuler distro element
*** ysandeep|away is now known as ysandeep|holiday09:32
*** ralonsoh has joined #opendev09:48
*** lpetrut has joined #opendev09:51
*** ykarel_ has quit IRC10:15
*** hashar has quit IRC10:22
*** ykarel has joined #opendev11:17
*** slaweq has joined #opendev11:22
*** ralonsoh has quit IRC11:47
*** slaweq has quit IRC11:57
*** ykarel has quit IRC12:05
*** auristor has joined #opendev12:48
*** hashar has joined #opendev13:31
fungilooking at the hate status, i think we may have some leaked node request locks again, will see if i can track them back to a specific launcher and restart it to free them13:40
*** whoami-rajat has joined #opendev14:01
*** tosky has quit IRC14:01
*** lpetrut has quit IRC14:21
yoctozeptoanyhow, any idea why re-triggered the checks? did I use a magic word or..? :D14:22
fungi#status log Restarted the nodepool-launcher container on to free stuck node request locks14:22
fungithat seems to have gotten node assignments for most of them now14:23
fungistill a few waiting centos-8-arm64 node requests, but i think that may be due to a different problem14:24
fungilooks like we've lost openstackstatus too, will get it going again14:24
fungiyoctozepto: nothing jumps out at me in the change comment which would have caused a reenqueue into the check pipeline. i'll have to check the zuul scheduler log to see what the trigger was, will do once i get statusbot back14:27
yoctozeptofungi: thanks, take your time :-)14:27
fungi2021-03-30 22:41:22     <--     openstackstatus ( has quit (Ping timeout: 265 seconds)14:28
fungicorvus: i expect that ^ is why your status log never made it to the wiki14:28
fungi2021-03-30 22:37:14,818 DEBUG irc.client: _dispatcher: quit14:29
fungithat was in the debug log, no indication prior to that why it quit though14:29
*** artom has quit IRC14:29
funginor why it didn't reconnect14:30
*** openstackstatus has joined #opendev14:30
*** ChanServ sets mode: +v openstackstatus14:30
fungi#status log Restarted statusbot after it never returned from a 2021-03-30 22:41:22 UTC connection timeout14:31
openstackstatusfungi: finished logging14:31
fungi#status log Restarted the nodepool-launcher container on to free stuck node request locks14:31
openstackstatusfungi: finished logging14:31
fungiyoctozepto: the best i can determine looking at the scheduler log for that comment event is that because the change had code-review +2 and workflow +1 and a verified of either -1 or -2 from the zuul user, that was considered reason to enqueue it into the check pipeline14:41
yoctozeptofungi: odd, so then I don't need to write "recheck" when that happens and go straight to swearing... roger that!14:42
fungiwhich i agree doesn't match my expectation, but maybe i've just never noticed14:42
yoctozeptoI have neither14:42
fungiit's event 0016fdc0aa7947e2b72b2eb24c4b1e68 in the debug log if any other root sysadmin wants to double-check my assessment14:44
clarkbI think that is related to how gerrit previously didn't send existing votes on events but then you would have to do two comments, one to remove approval and another to readd it to reapprove things14:46
clarkbwhich was a regression in gerrit as it didn't do that previously and they undid it14:46
clarkbif someone else had commented it wouldn't have enqueued14:46
clarkb(and when you comment it shows that you are still checking +2 +1 below your text)14:46
fungioh, good point, the comment was from an account which had supplied the workflow +1 vote14:47
fungiso yes, i agree this likely changed in ~november14:47
fungiyoctozepto: ^ plausible theory14:47
yoctozeptoclarkb, fungi: thanks! makes sense14:49
*** slaweq has joined #opendev15:05
openstackgerritJeremy Stanley proposed opendev/system-config master: Revert "Temporarily serve tarballs site from AFS R+W vols"
clarkbfungi: I +2'd ^ but we can probably go ahead and +A too15:15
fungiclarkb: yeah, that's fine15:17
fungilooks like something may have changed with the centos-8-arm64 images that they're no longer booting. that's the reason for the remaining stuck changes/queued builds15:17
clarkbcatching up on the gerrit account cleanup process: what I did last time was to set accounts inactive for a few days to try and catch any issues before cleaning the external ids.15:20
clarkbI think that is still a good idea so I'll go through my notes on review and produce a list of accounts to set inactive today and do that, then next week clean their external ids15:21
fungisounds good15:23
*** mlavalle has joined #opendev15:43
clarkbhrm in double checking my lists I've noticed that some of the entries in the later lists are maybe not the best idea to clean up (in particular for some of them it seems their other account might be a better option). I think I may manually scrape this a bit more to see whcih look safe and only do that subset15:45
*** d34dh0r53 has quit IRC15:45
clarkbsome are definitely the right choice because the other account has been actively used15:45
clarkbbut in some cases the other account will have an invalid openid or similar15:45
corvusi'm going te restart zuul now with the memleak fix15:51
*** marios is now known as marios|out15:54
corvus#status log restarted all of zuul on commit 991d8280ac54d22a8cd3ff545d3a5e9a2df76c4b to fix memory leak15:55
openstackstatuscorvus: finished logging15:55
corvusstarting re-enqueue16:01
*** marios|out has quit IRC16:04
corvusalso, the enqueue-ref commands are happy with fully qualified names now, so we don't need to worry about fixing up the enqueue script anymore (cc infra-root)16:04
fungicorvus: oh, excellent!16:05
fungidoes it also handle timer triggered pipelines now?16:05
fungi(no more commit 0?)16:05
fungiahh, nope, those are still showing up with 0000000 instead of refs/heads/master16:06
corvusthat's probably because of this:  + zuul enqueue-ref --tenant openstack --pipeline periodic --project --ref refs/heads/master --newrev 000000000000000000000000000000000000000016:06
fungioh! those are lingering from the previous restart i bet16:07
corvusthink it was in there the whole time?16:07
fungidue to the aforementioned inability to boot centos-8-amd64 nodes16:07
corvusah ok16:07
fungithey were in the queue waiting for centos-8-amd64 nodes which never arrive16:08
corvusyeah, so we could just be carrying that queue entry around for days16:08
fungithe opendev-prod-hourly items got reenqueued with actual branches instead16:08
corvusthat command just looked like + zuul enqueue-ref --tenant openstack --pipeline opendev-prod-hourly --project --ref refs/heads/master16:08
fungialso i tried but failed to dequeue those 0 ref items, not sure if i just got the parameters wrong or zuul-dequeue can't actually handle them16:09
fungiha, fixed, specifying the branch those were supposed to be worked16:14
clarkbfungi: I'm going through the list again and finding an account here and there that I can clean up too so thats a bonus16:15
clarkbwe'll have an email address with 4 conflicting accounts and all but one will be active or similar16:15
fungii found an actual contributor to wallaby (not the proposal bot account) with no preferred email: 3267316:16
funginot sure how that happens16:16
fungiseems like gerrit wouldn't allow you to autocreate an account via openid without an associated address, and i'm pretty sure it won't let you remove the preferred address without setting a different address preferred first16:17
fungiso probably related to one of the conflicting account tangles16:18
zbrfungi: two timeouts in a row with tempest-full on -- time to increase another job timeout value?16:19
clarkbfungi: is the account active? if so then it wasn't caught by my cleanups16:21
clarkbsince the removal of a preferred email address to fix inconsistency was always paired with deactivating the account16:21
fungiclarkb: yes, i expect it is. though it's possible it was associated with a change started long ago which only just merged in recent months16:22
fungizbr: i thought i +2'd the test timeout increase already16:22
fungizbr: though that looks like a job timeout not a test timeout16:22
zbrthat is in a totally different place16:23
zbrimho, i would drop support for old pythons but that library is quite low level. what if an emergency patch would be needed for EOL python16:25
*** hamalq has joined #opendev16:27
fungithe main problem there is that pbr is a setup-requires and so can't be effectively capped during package installation, especially where older pip/setuptools may still be in use16:31
*** smcginnis has joined #opendev16:33
*** auristor has quit IRC16:41
clarkbright you should fork pbr if you want to stop supporting very old things17:03
clarkbzbr: we can continue to test it on eol python as long as distros have that eol python17:03
clarkbzbr: once the distros drop the old python versiosn we tend not to care anymore and at that point it would probably be ok to consider how to make pbr python3 only or similar17:04
clarkbthe proper fix for this is pyproject.toml and friends17:04
clarkbbut openstack hasn't started to shift to anything like that and that prevents you from specifying versions of setup requires properly17:05
openstackgerritMerged opendev/system-config master: Revert "Temporarily serve tarballs site from AFS R+W vols"
*** auristor has joined #opendev17:10
openstackgerritMerged opendev/system-config master: Have write out serialized data
mordredthere are also non-openstack users of pbr17:21
fungiyep, i even use it in a non-openstack project17:22
* mordred uses it in all python projects he can, regardless of openstack association17:23
clarkbya I was using that as an indication that you can't rely on it17:28
clarkbnot that if oepnstack does it then you can switch17:28
clarkbmore of a "if even openstack does it then good luck" vs "we expect users to do this"17:28
clarkb*does not do it17:28
*** whoami-rajat has quit IRC17:31
*** hashar has quit IRC18:12
clarkbok I've gone though and double checked everything and ended up wtih a list of 225 accounts from the list that fungi reviewed that I will retire now18:32
clarkbthis step sets the account inactive and removes its preferred email in preparation for removing the conflicting external ids from it later18:32
clarkbI'll probably do the next step mid week next week18:32
*** Alex_Gaynor has joined #opendev18:45
fungiawesome, thanks for working through those18:45
Alex_GaynorI'm seeing a bunch of jobs that are in queue'd status for extended periods of time: (all arm64 jobs it looks like), is there a known issue?18:46
fungiAlex_Gaynor: something seems to have happened in centos-8 such that our latest images for arm64 aren't booting (not sure what the situation with the bionic job is there but i'll check on that too)18:48
Alex_Gaynorfungi: 🙇‍♂️18:49
fungigoing to try to roll back the centos-8-arm64 image to yesterday's copy and see if that's working18:52
fungii've put nb03 in the emergency disable list so i can set the centos-8-arm64 image to paused and not have it rolled back by ansible18:54
fungiclarkb: the builder re-reads nodepool configuration between each image build, right?18:57
fungi#status log Deleted diskimage centos-8-arm64-0000036820 on in order to roll back to the previous centos-8-arm64-0000036819 because of repeated boot failures with the newer image18:59
openstackstatusfungi: finished logging18:59
fungithe builder doesn't seem to have picked up the pause in the config (immediately began trying to build a replacement), so i'm guessing it needs to be restarted for that after all19:00
fungino dice. it started building another replacement after the restart too19:02
clarkbfungi: yes it does reread19:02
fungimy fault, i think i paused the image in the provider section19:03
fungioh, we've actually got an image-pause option for the cli now, i keep forgetting that19:04
*** mailingsam has joined #opendev19:05
fungiconsole log from a centos-8-arm64 node which is currently in the process of being launched shows it's entering a dracut emergency shell after dracut-initqueue registers a timeout starting initscripts19:09
clarkbI would've expected a more catastrophic boot failure than that given some of the issues we have had before19:19
clarkblike incomplete image in the cloud19:19
*** tosky has joined #opendev19:20
fungiwe're finally reaching the boot output on the console from the first node booted from the rolled-back image state, i think19:21
fungiyeah, i see kmesg lines19:21
fungiseems to just sit after "[   16.067703] cdrom: Uniform CD-ROM driver Revision: 3.20"19:22
fungisomething's probably timing out19:22
fungiyeah, same as the newer image... it's dropping to a root shell prompt19:29
fungii can connect to the console url, but it just says "guest disabled display" so i've initiated a ctrl-al-del from it to see if i can get access when it reboots19:31
fungii have a feeling this is also going to need someone with a better understanding of centos and probably trying to locally boot the image19:32
fungiyeah, seems like novnc isn't quite working19:37
fungiscratch that, novnc is working for an ubuntu-bionic-arm64 node i just tried19:39
fungiclarkb: any alternative ideas for debugging this?19:58
fungisince yesterday's image is exhibiting the same problem, i'm going to undo the pause19:59
fungiguess i'll see if centos-8-stream-arm64 is similarly broken20:03
fungithose seem to be broken in a different way, they're just going straight to deleting status i think?20:06
*** hamalq has quit IRC20:08
*** hamalq has joined #opendev20:08
clarkbstraight to deleting usually means there is a cloud side error status20:10
clarkbyou can check the instance details in nova before it gets deleted to see that or see if nodepool is able to bubble it up20:10
clarkbfungi: I would've suspected uefi at first but this seems toget far enough to do more than uefi20:10
clarkbmaybe some x86 packages end up in there?20:11
clarkbI would try a rebuild and see if that helps20:11
*** mailingsam has quit IRC20:11
fungirebuild of the image?20:12
clarkbbecause if it was an upload problem or similar then a rebuild should correct it.20:20
fungioh, of the stream image?20:21
fungialso the perpetually queued pyca-cryptography-ubuntu-bionic-py36-arm64 build turned out to be another leaked node request lock. i restarted the nodepool-launcher container on nl03 and it got a node assigned20:22
mordredsomethign something driver something config-drive?20:22
mordred(guessing in the dark because of the cdrom line)20:22
fungipreviously nl03 accepted the request and failed three times to build it, but apparently never released the lock on the request in zk20:23
mordrediirc config-drive presents as a cdrom20:23
fungimordred: oh good idea20:23
fungiyeah could be something broken with configdrive20:23
mordredmaybe somethign changed with how devices manifest - and we're trying to mount the wrong thing20:23
mordrednow - how to debug that is a whole other question20:24
fungii'm watching a centos-8-stream-arm64 build in progress now, it's not instal-delete after all20:24
fungibut again seems to be spending a lot of time after the cd-rom driver kmesg20:24
fungiyeah, confirmed, the stream images are booting to the same state as regular centos-8 actually20:26
fungiso something has happened to break booting for both centos-8 and centos-8-stream20:27
fungiseemingly in the same way20:27
fungimaybe a recent change in dib?20:27
clarkbok 224 accounts "retired". I ended up reverting the retirement of one account because I looked at it again and deicded it wasn't claer that it wasn't the currently used account due to it being created relatively recently20:28
mordredMerge "Change paths for bootloader files in iso element"20:28
clarkbI'll upload the log to review then rerun the audit script next20:28
mordredwas 2 weeks ago20:28
clarkbmordred: fungi is that in a release?20:28
clarkbwe consume dib via releases20:29
mordredoh - good question20:29
mordredan latest release is 2 months ago20:29
clarkbya and we shouldn't use the iso element20:30
clarkbnot having a recently release indicates to me an issue in packages (either on our nodepool image that we rely on or in the upstream distro)20:31
mordredany idea if centos8 recently released a point release?20:31
mordredthose are rather notorious20:31
clarkbI don't know20:32
clarkbthe log for the 224 account retirements is on review now. I'm going to work on getting the audit script running next20:32
mordredCentOS-8-updates-20210324.0 updated 3/24 according to the timestamp on the COMPOSE_ID file20:32
mordredand, I suppose, the contents :)20:33
*** slaweq has quit IRC20:36
fungioh, also i'm currently sidetracked by making dinner20:37
mordreddinner is potentially more tasty than config-drive20:38
clarkbI've got an audit running now, it should show a couple of hundred accounts ready to have external ids cleaned up when done20:45
clarkbmordred: fungi: config drive can be iso format or fat3220:46
clarkbI'm not sure which the linaro cloud has selected20:46
* clarkb goes to see20:47
clarkbit appaers to be present as /dev/sr0 on ubuntu so likely an iso20:50
clarkbis there possibly a regression in the cdrom drivers on arm for centos 8?20:51
clarkbwe could ask kevin to switch us to fat3220:51
clarkb(part of the struggle here is going to be none of us have arm64 hardware that i know of but maybe we do)20:51
fungii have several bits of arm64 hardware, but not beefy enough to boot virtual machines on20:52
clarkband lsblk -f confirms it is an iso9660 fstype20:52
clarkbbased on what your report the console does my hunch is that something related to probiing the /dev/sr0 device causes things to hold up20:53
clarkbprobably the quickest solution to that is to have the cloud use fat32 config drives20:53
clarkb the mystery deepens, apparently vfat is deprecated and may be removed :(20:55
clarkbthere is a new aarch64 kernel in centos-8 as of march 16, though I'm not sure when it was published to the package list as that may just be file creation time?20:58
clarkbI doubt we are hitting something like because you wouldn't expect it to get as far as it does in the console if so20:59
funginah, the kernel is running, problem seems to manifest in userspace after init starts21:01
clarkbis it possibly a problem with glean?21:03
clarkbthere have been a few recent chagnes to help make ironic testing easier21:03
clarkbmaybe they side effected in unexpected ways on arm64?21:03
clarkb is a really fun one but unrelated to our problem (I almost want to suggest dib expcet our images don't work right now :) )21:04
clarkbfungi: to test if it is glean what we can do is make a new image out of band with baked in dhcp and accept-ra configs21:04
clarkbthen boot that and see if it works21:04
clarkbusually when I did this years ago I built the smallest image I possibly could21:05
clarkbas it speeds up rtt and reduces things that can interfere21:05
fungiglean echoes stuff to the console though right?21:09
fungior has systemd made that challenging?21:10
clarkbit should via system iirc21:10
clarkbthere is a flag you set on the unit to redirect it there /me checks21:10
clarkband centos uses glean-nm21:11
clarkbwe should add that directive to as well21:11
clarkbfungi: maybe we should try boot out of band too?21:12
clarkbthat may help us narrow down the problem?21:13
clarkbin particular that image is older than the recent kernel update21:13
fungithere's an instance booted in there from october named "test" which can probably be cleaned up21:17
*** sboyron has quit IRC21:21
clarkbaudit completed and shows 220 accounts that are inactive and ready for external id cleanup (that is about right because not everyone of the 224 I did will have been part of a pair22:52
clarkbthe audit is now up on review as well under gerrit_user_cleanups/external_id_conflict_classifications.20210402 in my homedir22:59
*** tosky has quit IRC23:19

Generated by 2.17.2 by Marius Gedminas - find it at!