Wednesday, 2021-04-07

*** mlavalle has quit IRC00:00
fungisounds likely00:04
*** brinzhang_ has joined #opendev00:09
*** brinzhang0 has quit IRC00:12
*** tosky has quit IRC00:17
corvusfungi, frickler: fungi just made me aware of this change: https://review.opendev.org/773710 because i spent about 40 minutes trying to figure out why a job which previously passed suddenly didn't have enough ram00:43
corvusi am strongly in favor of limiting the memory in use on those nodes to match the other providers00:44
fungiyeah, we merged that quickly because it was requested by a representative of the resource donor, indicating the old flavor was going to be removed00:44
corvushonestly, it's fine if we want to supply nodes with more ram00:44
corvusbut it's *really* important that nodes with the same nodepool label are at least roughly equivalent00:44
fungii was tempted to also announce it on the ml, as it worried me, but there didn't seem to be much concern from anyone else at the time so figured we'd address it when it became a problem00:45
corvusit's a problem, and i'm concerned :)00:45
corvuswas it decided that v3-standard-2 was not cpu sufficient?00:47
corvushttps://vexxhost.com/pricing/00:47
corvus2 cores, 8gb00:47
fungi(brief) discussion in addition to what's in the change comments happened in here http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-02-02.log.html#t2021-02-02T15:33:33-2 and then again later when i brought it up in the meeting http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-02-19.01.log.html#l-14600:47
fungiit seemed like the concern was that a 2 cpu server instance would be very slow relative to other providers00:47
fungiwhere we're typically doing ~8 cpus00:48
*** hamalq has quit IRC00:48
fungimnaser wanted the memory to balance the cpu count based on the ratios provided by his hardware00:48
corvusyeah, i assume that's still balanced at 2 vcpus; and i *assume* that's not enough, but it's just an assumption; i'm curious if anyone looked into it00:49
fungiwe can certainly give the 2cpu flavor a try, i'm not opposed, just expecting it will be slow00:49
fungiand no i don't think we tested it00:50
corvusfungi: what labels have > 8g of ram?01:02
openstackgerritJames E. Blair proposed openstack/project-config master: Revert "Remove restrict-memory"  https://review.opendev.org/c/openstack/project-config/+/78508101:07
corvusthat reverts a revert from 201601:07
corvusfungi, frickler, clarkb, ianw: ^ honestly no idea if that will still work, and unfortunately, i don't have time to drive something of this complexity right now.  i'd appreciate some help getting a fix in place.01:12
fungicorvus: ubuntu-focal-arm64-xxxlarge, ubuntu-bionic-arm64-large, ubuntu-bionic-expanded-vexxhost, centos-7-expanded, ubuntu-bionic-expanded, ubuntu-bionic-32GB, multi-numa-ubuntu-bionic-expanded, multi-numa-centos-7-expanded01:15
fungithose are the ones i'm finding in our launcher configs at least01:15
corvusfungi: oof.  i guess i missed a bunch then :(01:15
corvusoh, because we use flavor-name for those instead of min-ram01:16
corvusfungi: wait what about v2-highcpu-8 ?01:18
corvusare they only in sjc1 and not in ymq-1 ?01:18
fungiyeah, we have max-servers 0 in sjc101:21
fungiall our capacity is in ymq-101:21
corvusand do we know why v2-highcpu-8 wasn't used?  is it not available or some other reason?01:22
fungii don't know, but we could ask mnaser when he's around01:22
kevinzfungi: corvus: Morning! Is there anything I can help?01:28
corvuskevinz: with what?01:28
corvuswith the 8gb limit thing?01:29
kevinzMaybe, I saw that  clarkb said that "re centos arm64 images maybe we should email kevinz about trying fat32 config drives instead of iso?"01:29
kevinzArm64 CI related01:29
corvuskevinz: oh i have no idea about that, sorry01:29
fungikevinz: we noticed our centos-8-arm64 images stopped booting sometime around friday, so were trying to figure out why (they just drop into a dracut emergency shell now according to the console log)01:30
fungiwe were trying to figure out what might have changed to cause that, and clarkb surmised it could be a change in iso support in dib01:31
kevinzcorvus: Np01:35
kevinzfungi: OK,  understand.  So if you need any change from the arm64 cloud side,  please let me know.01:37
fungikevinz: thanks, clarkb was suggesting as a possibility switching the configdrive type, but i don't think we've even determined what's causing dracut to timeout starting things. we probably need to boot a standard image or modified one there so we can log in and take a look01:40
kevinzfungi:  OK,  np,  thanks for leting me know :-)01:48
*** rh-jlabarre has joined #opendev01:55
*** rh-jlabarre has quit IRC01:55
*** rh-jlabarre has joined #opendev01:56
*** rh-jelabarre has quit IRC01:56
*** ykarel has joined #opendev04:11
*** rh-jlabarre has quit IRC04:20
*** ysandeep|away is now known as ysandeep04:59
*** marios has joined #opendev05:09
*** whoami-rajat has joined #opendev05:23
*** sboyron has joined #opendev05:32
*** ralonsoh has joined #opendev05:51
*** remal has joined #opendev05:56
*** lpetrut has joined #opendev06:02
*** ykarel_ has joined #opendev06:18
*** ykarel has quit IRC06:21
*** remal has quit IRC06:22
*** eolivare has joined #opendev06:27
*** ykarel_ is now known as ykarel06:32
*** slaweq has joined #opendev06:37
*** lpetrut has quit IRC06:40
*** fressi has joined #opendev06:42
*** amoralej|off is now known as amoralej06:56
*** rpittau|afk is now known as rpittau07:03
*** andrewbonney has joined #opendev07:14
openstackgerritRiccardo Pittau proposed openstack/diskimage-builder master: Convert multi line if statement to case  https://review.opendev.org/c/openstack/diskimage-builder/+/73447907:17
*** tosky has joined #opendev07:37
*** artom has quit IRC07:46
*** artom has joined #opendev07:46
hrwhm. looks like need to look where to find someone donating more arm64 nodes. check-arm64 queue looks overloaded07:48
*** ykarel has quit IRC07:56
*** jpena|off is now known as jpena07:57
*** ykarel has joined #opendev08:20
*** tkajinam has quit IRC08:25
*** tkajinam has joined #opendev08:26
louroto/ zuul seems to be having random failures when git-cloning at the moment, sometimes from opendev.org, sometimes from github.com, see for example the last failures in this review: https://review.opendev.org/c/openstack/charm-ceph-iscsi/+/784421 - Is it a known issue? thanks!08:43
lourotoh it's a cert issue on opendev.org: if you open https://opendev.org/openstack/charm-ops-openstack in your browser you'll see the cert isn't trusted anymore08:50
*** darshna has quit IRC08:50
lourotthe cert just expired08:51
hrwlourot: 6th June 2021 is expiration date08:52
hrwlourot: cert was refreshed on 8th March08:52
lourotfor me it reads:08:53
lourotIssued On Thursday, January 7, 2021 at 6:43:43 AM08:53
lourotExpires On Wednesday, April 7, 2021 at 7:43:43 AMY08:53
lourotam I hitting a server that still has the old cert?08:53
hrwlooks like08:53
lourotfrom gitea08.opendev.org08:53
hrwI got gitea07.opendev.org08:54
hrwthat's why08:54
*** rpittau is now known as rpittau|bbl09:23
kevinzhrw: will consider to adding some resources from Linaro Cambridge Colo side, the uk2.linaro.cloud09:44
hrwhm. it is not only gitea08 which has ssl cert expired09:58
hrwcurl -o /requirements/upper-constraints.txt https://releases.openstack.org/constraints/upper/master09:59
hrwINFO:kolla.common.utils.kolla-toolbox:1mcurl: (60) SSL certificate problem: certificate has expired09:59
hrwso random CI jobs fail depends on which area they work10:05
*** dtantsur|afk is now known as dtantsur10:23
*** fressi has quit IRC10:24
yoctozeptoinfra-root: at least one of opendev mirrors has an expired ssl cert, jobs fail randomly :-( ^^10:34
openstackgerritchandan kumar proposed openstack/diskimage-builder master: Make DIB_DNF_MODULE_STREAMS part of yum element  https://review.opendev.org/c/openstack/diskimage-builder/+/78513810:36
*** ykarel has quit IRC10:41
*** fressi has joined #opendev10:42
*** ykarel has joined #opendev10:48
openstackgerritDmitriy Rabotyagov proposed openstack/project-config master: Add Debian Bullseye nodepool images and wheels  https://review.opendev.org/c/openstack/project-config/+/78361311:02
yoctozeptoinfra running rootless today ;-(11:02
openstackgerritDmitriy Rabotyagov proposed openstack/project-config master: Add Debian bullseye wheel cache publish jobs  https://review.opendev.org/c/openstack/project-config/+/78363311:03
*** fressi has quit IRC11:15
*** fressi has joined #opendev11:17
*** zoharm has joined #opendev11:30
*** dpawlik4 has joined #opendev11:40
*** dpawlik4 is now known as dpawlik11:42
*** dtantsur is now known as dtantsur|bbl11:42
*** pas-ha has joined #opendev11:47
*** kopecmartin has quit IRC11:51
*** kopecmartin has joined #opendev11:52
*** pas-ha has left #opendev11:56
openstackgerritHervé Beraud proposed opendev/irc-meetings master: Switch the release team meeting to 2pm UTC  https://review.opendev.org/c/opendev/irc-meetings/+/78515712:03
*** amoralej is now known as amoralej|lunch12:15
*** artom has quit IRC12:27
*** rh-jlabarre has joined #opendev12:28
*** dtantsur|bbl is now known as dtantsur12:28
*** fresta has joined #opendev12:34
*** mgoddard has joined #opendev12:35
*** rpittau|bbl is now known as rpittau12:51
*** amoralej|lunch is now known as amoralej12:59
*** lpetrut has joined #opendev13:16
fungii'm working on disabling any gitea backends with expired certs now13:44
fungii'm not seeing an expired cert served from any of our 8 gitea backends... is anyone still seeing it?13:46
fungii suppose it could have been a transient problem with a stale apache worker which finally got recycled13:47
fungiright now my browser is happy with https://gitea01.opendev.org:3000/ through https://gitea08.opendev.org:3000/13:48
fungihrw: i don't know that the arm64 nodes are all that backed up, but we observed that as of friday we stopped being able to boot our centos-8-arm64 images there, the kernel loads but dracut times out waiting for something to start (doesn't say what) and then drops to an interactive recovery shell, but novnc doesn't seem to be working for it so we can't easily investigate and i'm personally not all that13:51
fungifamiliar with centos to begin guessing what the problem is13:51
*** chkumar|ruck is now known as raukadah13:52
fungiclarkb suggested something might have changed with dib's support for iso configdrives and that switching the configdrive to vfat might be able to rule that out, but we're down to a skeleton crew and could use all the help we can get diagnosing problems13:52
corvusfungi: i randomly got a gitea exp cert on 0813:54
corvusfungi: but via the lb, not directly to 300013:55
fungicorvus: like just now?13:56
corvusyep13:56
fungithe cert on it was replaced march 8, so the apache theory is strong13:56
fungii'll check process times13:56
corvusfungi: i also get good certs from 0813:56
fungii see 6 apache processes with start times prior to march 813:56
fungii'll restart apache there13:57
fungi#status log cold restarted apache on gitea08.opendev.org as there were some stale worker processes which seemed to be serving expired certs from more than a month ago13:59
openstackstatusfungi: finished logging13:59
fungithis would also explain why our cert checker didn't alert us to that14:00
*** tinwood has joined #opendev14:02
fungii need to go run some quick errands but should be back by 15:00 utc14:04
*** fressi has quit IRC14:18
yoctozeptothanks fungi14:27
*** artom has joined #opendev14:36
hrwfungi: thanks for info. I am not familiar with how dib works (despite having some changes there). Can add it to todo but no idea will find time/ideas14:45
zbrcertificate expired on opendev.org? https://zuul.opendev.org/t/openstack/build/479f1cbbc2a34d259cbeb341e4075d5714:54
*** lpetrut has quit IRC14:54
yoctozeptozbr: fungi fixed that recently14:57
*** ysandeep is now known as ysandeep|dinner15:00
*** kopecmartin has quit IRC15:01
*** kopecmartin has joined #opendev15:01
zbrthis was like 2 minutes before I posted the message, so unless he fixed them in during the last 15-20mins.15:03
clarkbI know we put mitigations in place for that on some apache configs, iirc we set worker request limits15:03
clarkbit is possible more than one gitea backend has stale apache workers. Note everyone is able to check the backends directly, you can point s_client or your browser at them and see what you get. However if it is stale apache backends then you will only get errors some percentage of the time15:06
clarkbgive me a minute to load keys and I'll look15:06
zbrthanks!15:07
fungii can go through and restart apache on all the backends i suppose, if we want to be completely sure15:09
clarkb06 had a couple of processes from february and I've restarted apache there now too15:11
fungilooks like https://zuul.opendev.org/t/openstack/build/479f1cbbc2a34d259cbeb341e4075d57/log/job-output.txt#783 was indeed ~50 minutes after i restarted apache on gitea08 so there may be more than one backend in a similar situation still15:12
clarkb05 and 04 also have older processes but not that old. I'll restart them too15:12
clarkbthe others look fine15:12
clarkboh I guess 07 has a slgihtly old set too15:12
fungiyeah, any with worker processes from prior to whatever the timestamp on /etc/letsencrypt-certs/gitea0?.opendev.org/gitea0?.opendev.org.cer is15:13
fungiwhich i suppose could vary between backends15:13
clarkbwe can port the worker limits from mirrors to the giteas if we haven't already15:14
fungii was wondering about that, i'll see if i can find them15:14
zbrthat is one moment when i am proud about my homelab using traefik which takes care of cert management itself (including renewals)15:14
clarkbzbr: we haev automated renewals as well. The problem is that when you tell apache to gracefully reload it does so at its own pace15:15
clarkbits not a problem with cert renewals15:15
clarkbits an issue with apache reload being too graceful15:15
fungilooks like we're setting MaxConnectionsPerChild in /etc/apache2/conf-enabled/connection-tuning.conf on the mirrors15:15
zbrstuck processes I guess?15:15
clarkbzbr: yes15:15
fungizbr: not really "stuck" but rather by default apache doesn't recycle worker processes15:16
clarkbfungi: though a graceful reload is supposed to eventually turn them over aiui15:16
fungiright, but the default turnover is essentially set to never15:16
zbrtbh, i did not use apache in many years, more of nginx guy.15:16
zbri am reading the docs of MaxConnectionsPerChild -- and based on this i should assume that w/ default settings apache never restarts them? meaning graceful-restart never finishes?15:20
clarkbfungi: hrw: kevinz: note I'm not sure it was anything to do with dib's support for isos. glean + simple-init leans on mount and the kernel for that. I was suggesting that something may have changed in centos to break that. I have been meaning to boot the upstream arm64 centos8 image and see if that helps give us any clues (if it reproducing maybe the console works there, if it doesn't reproduce we15:21
clarkbcan compare package versions, etc)15:21
hrwclarkb: and it is on c8 not cs8, rigth?15:22
*** andrewbonney has quit IRC15:22
* hrw in kolla meeting15:22
fungihrw: it's both actually15:22
hrwthanks15:22
fungii confirmed centos-stream-8-arm64 is breaking the same way15:22
clarkbzbr: I think ist a bit more forceful than that, but by settings our own limits we can cycle much more quickly15:23
clarkbzbr: it runs apachectl graceful which sends sigusr1 to apache if youwan to go figure out exactly what it does15:24
zbrGracefulShutdownTimeout also defaults to 0, which may explain why very old processed were never restarted.15:24
*** andrewbonney has joined #opendev15:24
zbrwould a 60 (s) limit be too aggressive?15:25
zbrusually if you want to restart you have a need, so any value between 60-1800 would seem ok to me.15:26
fungizbr: we've been setting MaxConnectionsPerChild 8192 which seems to work fine15:27
Alex_GaynorFYI we're disabling centos arm64 builders on pyca/cryptography due to the booting issues. Is there a good way to get notified when they're working again, so we can re-enable?15:27
fungithe workers will each field far more connections that that in a month15:27
*** mlavalle has joined #opendev15:28
zbrthe there was at least one immortal among them :D15:28
fungiAlex_Gaynor: yeah we can give you a heads up in here15:30
Alex_Gaynorfungi: 🙇‍♂️ much obliged, thanks!15:30
fungizbr: well, no, i mean we've set that elsewhere (static sites, mirror servers...) just not on our gitea backends15:30
fungii'm adding that now15:31
fungichange will be up for review in a few minutes15:31
*** lourot has quit IRC15:32
clarkb#status log Restarted apache on gitea 04-07 to clean up additional stale processes which may have served old certs15:34
openstackstatusclarkb: finished logging15:34
openstackgerritJeremy Stanley proposed opendev/system-config master: Set MaxConnectionsPerChild 8192 for Gitea backends  https://review.opendev.org/c/opendev/system-config/+/78522615:38
fungiclarkb: corvus: zbr: yoctozepto: hrw: ^ that should prevent the issue in the future15:40
fungiprevent the stale certs problem i mean15:40
*** ykarel is now known as ykarel|away15:41
hrwfungi: super15:47
*** zoharm has quit IRC15:52
zbrapparently gerrit is no longer responding15:58
*** lourot has joined #opendev15:59
fungicurrent load average is only around 13, so actually fairly low for this time of day15:59
*** hamalq has joined #opendev16:00
clarkbI was just able to +2 fungi's change16:01
clarkband my dashboard loads16:01
*** ykarel|away has quit IRC16:02
*** sshnaidm is now known as sshnaidm|afk16:06
fungimaybe it was only briefly choked16:07
*** dtantsur is now known as dtantsur|afk16:18
*** ysandeep|dinner is now known as ysandeep16:28
clarkbit just occurred to me that another way to test the arm64 centos-8 image could be with qemu emulating arm6416:29
clarkbhttps://nb03.opendev.org/images/ the images are there if anyone wants to try ^ they expect a config drive. cc hrw16:29
*** marios is now known as marios|out16:31
openstackgerritMerged opendev/irc-meetings master: Switch the release team meeting to 2pm UTC  https://review.opendev.org/c/opendev/irc-meetings/+/78515716:43
hrwclarkb:     [    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.18.0-294.el8.aarch64 root=/dev/mapper/loop7p3 ro16:44
hrwclarkb: root= value is weird16:44
fungii did see a kmesg in the consoles about a missing loop device, but it didn't look fatal16:45
fungihowever i hadn't considered it in conjunction with that kernel command line16:45
*** avass has joined #opendev16:46
hrwchanged by hand to root=/dev/vda3 and it boots fine16:46
*** rpittau is now known as rpittau|afk16:46
hrwCentOS Stream 816:46
hrwKernel 4.18.0-294.el8.aarch64 on an aarch6416:46
hrwlocalhost login:16:46
hrwso image build process needs fixing16:47
hrwI suspect that centos 8 has the same issue16:48
clarkbI wonder why that only affects arm64 centos. That makes me suspect a bug in centos grub though thats mostly due to process of elimination16:48
*** eolivare has quit IRC16:48
clarkbhowever dib does use loop devices so maybe its detecting the wrong device when it does the install. We also have the build logs at https://nb03.opendev.org/16:48
*** marios|out has quit IRC16:49
clarkbmaybe that will offer clues as to how that device is selected16:49
fungii agree, it does sound very much like some new centos mechanism has started trying to autodetect the rootfs at build time, and seems to be exclusive to the bootloader setup for arm16:50
fungicould it be (u)efi-related? and we do mbr on x86?16:51
clarkbyes good point16:55
clarkbmbr across x86 as far as I know16:55
hrwdib supports both mbr and uefi on x86. I do not know which one you use16:57
fungiwe set hw_firmware_type to uefi in the meta block for arm64 diskimages and not for the others. i guess mbr must be the default?16:59
hrwprobably17:00
hrwhw_firmware_type=uefi is not needed since mitaka17:00
hrwI got it to be default in nova then17:00
fungiahh, so that doesn't control image properties i guess17:00
hrwit is saying 'dear nova, please use uefi as a bootloader while booting VM with this image'17:01
hrwwhile nova assumes uefi on aarch64 unless said otherwise17:01
mordredit would be amazing if that was how api calls were written17:01
mordredPOST "dear nova, please use uefi as a bootloader while booting VM with this image"17:02
fungijust give ai/ml a little more time17:02
mordredRESPONSE "I understand you want to order a bucket of fried chicken"17:02
hrwmordred: nope, too many different versions of english17:02
hrwand someone could use 'my dear nova' or just 'nova'17:03
hrwor 'you #@$@#%^@^@$#%@ nova $@#$@%@'17:03
hrwhttps://marcin.juszkiewicz.com.pl/2018/01/04/today-i-was-fighting-with-nova-no-idea-who-won/17:03
mordredhrw: I mean - I do the 'you #@$@#%^@^@$#%@ nova $@#$@%@' version in the comments of openstacsdk already ... :)17:04
hrwwhen I was writing it my API calls would involve curses17:04
hrwmordred: please. 'openstack HERE-WE-SPEAK-TO-NOVA' vs novaclient is not something I can discuss while sober17:05
hrwbbiab - have to go and buy some stuff17:05
mordredhrw: we'll schedule it for when we're drinking some time17:05
*** jpena is now known as jpena|off17:08
fungiclarkb: hrw: okay, so we add the block-device-efi element in the config for our arm builder, but not in the config for our x86 builders: https://opendev.org/openstack/project-config/src/branch/master/nodepool/nb03.opendev.org.yaml#L8117:08
fungiwhich means this element is only used in our arm images: https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/block-device-efi17:10
fungiand all that really seems to do is export DIB_BLOCK_DEVICE=efi in the environment for other elements to key on17:11
clarkbya so very likely x86 + mbr + centos8 is fine but arm64 + uefi + centos8 trips some issue with device detection17:11
fungihttps://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/finalise.d/50-bootloader has some switching on that value, seems to be the main place it's used17:13
*** amoralej is now known as amoralej|off17:14
*** ysandeep is now known as ysandeep|away17:16
clarkbinstall-packages -m bootloader grub-efi-$ARCH17:16
* clarkb looks at a build log to see if it emits device choices there17:17
fungimaybe this section? https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/finalise.d/50-bootloader#L244-L25517:18
clarkbhttps://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/finalise.d/50-bootloader#L14-L16 contains the loop device in it17:19
hrwmordred: ok17:19
clarkbbut that seems to only be used by ppc in that script17:19
clarkbhttps://nb03.opendev.org/centos-8-arm64-0000036826.log is the log I am looking at17:19
hrwI need dib shell command so can recreate locally17:20
clarkbhrw: something like `disk-image-create block-device-efi vm simple-init initialize-urandom growroot journal-to-console centos-minimal epel`17:21
clarkbthat should default to centos 8 as ltest centos and you need to run it on aarch64 host17:22
hrwclarkb: thanks17:22
clarkbI've stripped out infra specific elements that cache repos and stuff as they significantly lengthen the build time17:22
*** brinzhang_ has quit IRC17:22
fungiwe also have diskimage-builder/src/branch/master/diskimage_builder/elements/block-device-efi/environment.d/15-block-device.bash though you'd need to uncomment line 39 there17:22
fungier, sorry, pasted from wrong buffer17:23
fungihttps://opendev.org/openstack/project-config/src/branch/master/tools/build-image.sh17:23
*** brinzhang_ has joined #opendev17:23
hrwand no initialize-urandom17:23
hrwheh. dib assumes centos on host too ;(17:24
clarkbhrw it shouldn't we cross build17:24
clarkbusing debian, you do have to have some rpm tools though iirc17:24
clarkbif you run on centos then they will alread be present17:24
*** lbragstad has quit IRC17:25
clarkbCreating fs command [['mkfs', '-t', 'ext4', '-i', '4096', '-J', 'size=512', '-L', 'cloudimg-rootfs', '-U', '4c9f96ba-63ca-443f-9721-080da3b255de', '-q', '/dev/mapper/loop7p3']] create /usr/local/lib/python3.7/site-packages/diskimage_builder/block_device/level2/mkfs.py:132 <- that is from earlier in the build where the actual fs is written17:26
clarkbI suspect that we want grub-efi-aa64 to use labels or uuids and not paths17:26
clarkbsince it is grub-efi-aa64 that should be writing the kernel command line info right?17:27
hrwwould be best17:27
hrwclarkb: it can be either grub which generate it on fly or rootfs17:30
clarkblooks like the centos8 grub2-efi-aa64 package (and related modules etc pacakges) were last updated on march 2. There are two version though an el8 version and an el8_3.117:31
clarkbel8_3.1 is the newer one17:31
clarkband build log says we install _3.117:32
fungii wonder how far back we started using that17:33
fungiseems like we had working images until last week17:33
fungibut also, possible something else was breaking centos image builds until last week and that was the first time it updated since early march17:34
clarkbya17:34
clarkbhttps://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/finalise.d/50-bootloader#L198 is likely related?17:36
clarkbnote the line above we specifically tell it to use the label, now to check if the logs show us that label is set to the label and not the path17:36
clarkb2021-04-06 23:29:22.724 | + echo GRUB_DEVICE=LABEL=cloudimg-rootfs17:37
fungioh, yep17:38
clarkbit appears we are explicitly attempting to set the root device via a label17:38
clarkband that label matches the label set in the mkfs above17:38
fungiwhich maybe has stopped working with uefi on centos?17:39
fungi(the label detection i mean)17:39
*** klonn has joined #opendev17:39
clarkbat build time or boot time?17:40
fungiif the problem is there in the config, then at boot time i guess17:42
hrwroot=LABEL=cloudimg-rootfs boots fine17:42
fungithough that doesn't square with how the root=/dev/loopfoo is getting into the config17:42
clarkbhrw: ya and we seem to explicitly try to set that value in our script17:43
clarkbso at least the intended behavior is good, we just have to figure out how to apply it :)17:44
clarkbwe run grub2-mkconfig twice with two different outputs https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/finalise.d/50-bootloader#L177-L179 and https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/finalise.d/50-bootloader#L22917:46
clarkbthe first one is uefi specific17:46
clarkbthe first one also happens prior to us setting GRUB_DEVICE=LABEL=cloudimg-rootfs17:46
clarkbare we maybe running that too early or we should be setting those grub.cfg options prior?17:47
fungii wonder if there's useful output in the build log to tell us17:47
clarkbfungi: I'm looking at the output to find ^17:47
*** slaweq has quit IRC17:47
clarkbwhat isn't clear to me is which one is used at boot time, but wouldn't be surprised if a uefi specific config is used when booting uefi :)17:47
hrwthe one from L17717:48
hrwon normal system /etc/grub.cfg is often symlink to efi one17:49
clarkbhrw: ok in that case we set BOOT_DEVICE after L177 which may explain the issue17:49
clarkbhrw: oh interesting I wonder if that was the case until recently on centos maybe?17:49
clarkbso when we updated the normal path before it was updating the efi file17:49
clarkbbut now perhaps not17:49
hrwcommit 27a326dafb621269c501225fd4842615ca4adf7317:51
hrwAuthor: Steve Baker <sbaker@redhat.com>17:51
hrwDate:   Fri Mar 5 16:35:21 2021 +130017:51
hrw    Support secure-boot bootloader where possible17:51
hrwI suspect that part17:51
hrwdib--17:52
hrw2021-04-07 17:50:36.484 | diskimage_builder.block_device.exception.BlockDeviceSetupException: exec_sudo failed17:52
hrwin shell sudo works..17:52
hrwah. no gdisk in system ;d17:52
clarkbhrw: ya I'm noticing the code that is bad mentions secure boot17:54
clarkbI'm trying to write up a quick change (that is probably wrong) but at least gives us enough breadcrumb to follow17:55
hrwclarkb: https://paste.centos.org/view/6f166b9d like?17:56
hrwgenerate /etc/default/grub and then handle generation of both grub configs17:56
hrweach dib run shows me what my aarch64 system lacks ;D17:58
hrwgdisk, kpartx, dosfstools...17:59
hrwto make things funnier - iirc I added them to dib...17:59
clarkbhrw: you might be able to grab the zuul/nodepool-builder docker image to avoid finding all that stuff18:00
clarkbhrw: I'm moving just the GRUB_MKCONFIG calls18:00
openstackgerritClark Boylan proposed openstack/diskimage-builder master: Properly set grub2 root device when using efi  https://review.opendev.org/c/openstack/diskimage-builder/+/78524718:02
clarkbhrw: ^ something like that18:02
clarkbhrw: what isn't clear to me is why run `$GRUBNAME --modules="$modules" $extra_options $GRUB_OPTS $BOOT_DEV` only when /boot/efi/$EFI_BOOT_DIR does not exist18:05
*** klonn has quit IRC18:05
fungii wonder if that was meant to skip preexisting efi setups18:06
clarkbthe ubuntu focal build shows it is doing that skip which is why those haven't broken18:06
hrwno idea too18:06
hrwok, I have to end a day18:06
fungithanks for the help hrw! this was very useful18:07
clarkbwe run grub-mkconfig once at the very end of the script against /boot/grub/grub.cfg on ubuntu-focal18:07
hrwfungi: yw18:07
clarkbthat likely explains why only centos8 is broken18:07
* hrw off18:07
fungiyes, i figured it wasn't necessarily an efi problem, merely a problem exposed by some of the efi-specific switching logic18:08
clarkbthe change that seems to have introduced the problem does set EFI_BOOT_DIR="EFI/ubuntu" in the ubuntu-common element18:09
clarkbbut our build logs imply that doesn't exist18:10
clarkbon my local nas server that path does exist though18:11
*** andrewbonney has quit IRC18:13
clarkbreading the commit message and going through the logical paths of the script I think my change may actually be the fix18:15
*** lbragstad has joined #opendev18:19
*** sboyron has quit IRC18:29
clarkbfungi: re the flavor change change, I didn't sense urgency in the commit message when I read it, but it was early morning for me iirc and I may have missed it18:41
*** rh-jlabarre has quit IRC19:06
*** sboyron has joined #opendev19:10
*** sboyron has quit IRC19:11
*** sboyron has joined #opendev19:12
fungicommit message said the flavor we were using was being deleted, change was pushed by an operator from the donor who also immediately pinged us in here upon pushing it19:16
*** sboyron has quit IRC19:16
fungiclarkb: okay, so current best theory is the bug was introduced in 3.7.1 which was tagged a week ago... timing roughly matches when we started seeing problems19:19
fungiwe could roll back to 3.7.0 temporarily, though we're currently running 3.8.0 which includes fixes for our debian-bullseye images19:20
clarkbfungi: ah I totally missed that the old flavors has been removed19:24
clarkbfungi: ya, I also suspect that my fix would actually fix it19:24
clarkbthough testing is difficult. I guess we can deploy the nodepool-builder builder image made by check to nb03 and test it that way?19:24
fungioh, that's a good idea19:26
fungiyeah i was trying to figure out a good hotfix which we could use to exercise the patch19:26
*** weshay|ruck has left #opendev19:31
*** ralonsoh has quit IRC19:50
*** slaweq has joined #opendev19:53
openstackgerritTristan Cacqueray proposed opendev/gerritlib master: Add ignore events filter  https://review.opendev.org/c/opendev/gerritlib/+/78526219:57
*** whoami-rajat has quit IRC19:57
openstackgerritTristan Cacqueray proposed opendev/gerritbot master: Ignore replication event  https://review.opendev.org/c/opendev/gerritbot/+/78526419:59
*** mgagne has joined #opendev20:22
*** dmellado has quit IRC20:27
*** dmellado has joined #opendev20:29
*** spotz has quit IRC20:38
clarkbinfra-root I've put nb03 in the emergency file. Now I'm going to pause all the image builds but centos-8 and centos-8-stream on nb03 and run the image built for testing my change?21:18
fungiclarkb: sounds like a good test21:19
clarkbthe only concern with that is it is the siblings image which means a few things will be installed from source. I'm double checking what those items are now21:19
fungiyou may need to delete old images, but that's no loss, they don't boot anyway21:19
clarkblooks like openstacksdk and disk image builder are the expected siblings. I think we'll be ok with sdk21:21
clarkbthis didn't work21:25
clarkbwe seem to not build arm64 images in those jobs :(21:26
clarkbI think I know a workaround to that though21:26
clarkbdoes pausing only pause uploads but not builds? it seems that it is building an ubuntu image after I restarted it back on the old image which has arm64 version21:27
fungioh, so did you to it in the config or the cli?21:28
fungialso there's pausing the image for a specific provider vs pausing the diskimage21:28
fungiclarkb: stevebaker has a comment on 785247 too21:30
stevebakerhey21:30
fungimight be faster to hash stuff out in here ;)21:31
stevebakeryes lets21:31
fungistevebaker: if you need to catch up, basically the secure boot change seems to have stopped us from being able to boot arm64 centos-(stream-)8 images21:32
*** lbragstad has quit IRC21:32
clarkbok give me a minute to clean up the mess I just made (basically put nb03 back to normal)21:32
clarkbbut I'm around21:32
stevebakerfungi: ack21:33
fungicurrently it tries to boot with a nonexistet loop device as the kernel root= parameter21:33
fungidracut gives up on mounting the rootfs eventually and drops to an emergency shell21:33
clarkbinfra-root I have restarted the builder on its normal image, updated the nodepool.yaml to unpause all images, and removed nb03 from the emergency file21:34
clarkbthis is the state it was in ebfore I tried to test things21:34
stevebakerhopefully my suggestion to copy the grub.cfg to /boot/grub2 at the end of the function only for amd64 will result in working arm64 *and* amd64 images which boot in legacy bios21:34
clarkbstevebaker: I don't think your suggestion will fix anything. The problem is that block you modified is before we set grub settings so any grub config that is written then is bad21:35
clarkbI think the fix for your concern is to run mkconfig twice or copy it later in the script21:36
stevebakerclarkb: that is what I meant to suggest, build on https://review.opendev.org/c/openstack/diskimage-builder/+/785247 to copy it later in the script21:38
clarkbstevebaker: gotcha, sorry the comment context was in the location that was the problem. I'm working on an update to do that now21:38
*** spotz has joined #opendev21:38
stevebakerclarkb, fungi: I'd like to discuss ubuntu secure boot too, when you're ready21:39
fungiwell, currently we're just trying to get arm jobs running again without breaking what you implemented so far, further feature work should probably be discussed in #openstack-dib, and when more of the dib maintainers are around21:42
stevebakerok21:42
* stevebaker joins that21:42
openstackgerritClark Boylan proposed openstack/diskimage-builder master: Properly set grub2 root device when using efi  https://review.opendev.org/c/openstack/diskimage-builder/+/78524721:43
clarkbI think ^ addresses both of the comments on the previous patchset21:43
clarkbif you can review that and see if it makes sense I can psuh a chagne to ndoepool that builds an arm64 image of that (or attempts to anyway)21:43
mordredclarkb: your commit message makes sense to me21:44
fungiawesome, was just about to ask what the plan was to work around the lack of test image builds for arm21:44
clarkbI've removed the x86 image from nb03 too just to avoid any potential future confusion21:50
stevebakerclarkb: I've commented21:51
clarkbgood catch. Do we expect an efi config to be different than a legacy config? Its all about path lookups in the bios/uefi system right? not the actual grub config (because once grub is running we want it to do the same thing in both scenarios?21:54
clarkbwe can't use hardlinks because these files are on different partitions? and we can't use symlinks because uefi is looking for a specific partition uuid right?21:56
stevebakerclarkb: I would assume they should be identical, until we come across a specific need to do otherwise21:57
*** slaweq has quit IRC21:57
clarkbstevebaker: should we always run the grub-install --modules on line 179?22:02
clarkbbased on comments above that line it seems to imply that is necessary for supporting legacy boot too22:02
stevebakerclarkb: #172 handles dual legacy boot for x86 efi. So #179 is only needed for installing grub as the bootloader when that distro hasn't been properly set up for booting the secure boot shim. The redhat version of grub2 now errors when doing this to force proper shim secure boot. This is what prompted me to do that change in the first place22:07
clarkbstevebaker: but that only addresses x8622:08
clarkbstevebaker: why not do the grub-install for everyone after setting the appropriate flags as before, then additionally ensure grub.cfg exists in the proper location?22:09
clarkbits not clear to me why those would be mutually exclusive if the point is to produce images that do legacy and efi22:09
clarkb(since you need both to do that aiui)22:09
clarkbis the problem that uefi will prefer the generic and not specific grub install?22:11
clarkboh wait I think I may be conflating a couple of things. There is grub support for legacy bios. grub support for generic uefi. grub support for specific uefi with the shim22:11
clarkbI was treating the first two as if they were the same thing in my head22:11
*** ianw_pto is now known as ianw22:14
ianwo/22:14
clarkbI'm going to try and simplify some of the conditions in this script to make this mroe apparent22:14
stevebakerclarkb: grub2-install now fails on redhat in the efi case (landing in rhel-8.4) so it can't just be run for everyone. I'm using the presence of /boot/efi/$EFI_BOOT_DIR as a flag for "the shim is the bootloader, assume everything is set up correctly to just use that instead of grub"22:14
stevebakerianw: hai!22:14
fungiianw: aren't you still on vacation?22:14
*** lbragstad has joined #opendev22:15
clarkbstevebaker: right, but it is failing because we set --removable right?22:15
clarkbstevebaker: I think if we drop the --removable then rehl should be fine?22:15
ianwfungi: no back thu/fri , driving to sydney next week and working from there for week two of school holidays :)22:16
clarkbfungi: I think the dateline makes stuff weird22:16
ianweaster makes things weird, we have easter friday and easter monday here as public holidays22:16
ianwthis all sounds super fun and i'll read through scrollback ... :/22:17
clarkbhrm --removable also prevents updating nvram settings /me reads to see if a flag does just that22:22
stevebakerclarkb: It looks like --removable stops it messing with nvram, which you wouldn't want during an image build. The grub2-install error is because the redhat grub2 maintainer is very opinionated that every UEFI boot should be secure boot capable ;) and that /boot/efi/EFI/BOOT/BOOTX64.EFI should always be the shim bootloader, and never the grub binary.22:22
clarkbstevebaker: ya, my concern is taht we aren't installing the modules that we want. We're essetnailly crossing our fingers that the rhel package maintainer understands that you might need these various modules22:23
clarkbthough the manpage for grub2-install implies the default is all so we're doing an optimization by reducing the list?22:24
clarkbstevebaker: what if we set --boot-directory to /boot/efi/EFI/centos ?22:25
stevebakerclarkb: we can't write out any grub binary in the secure boot case, unless we can sign them. /boot/efi/EFI/BOOT/BOOTX64.EFI shim is signed by microsoft and adds keys so that redhat signed /boot/efi/EFI/centos/grubx64.efi can run. All we need to do is ensure /boot/efi/EFI/centos/grub.cfg is there for grubx64.efi to use.22:30
clarkbstevebaker: and in the case that /boot/efi/EFI/BOOT/BOOTX64.EFI is grub and not the shim it knows to look at /boot/grub/grub.cfg?22:31
clarkbit appears that my suse system does this but also with a shim22:32
clarkboh it has the config in both locations22:32
clarkbstevebaker: I think I get it now. I'm jsut trying to make the script in dib more verbose about what is going on22:33
clarkbI'll get a new ps up shortly22:33
stevebakerclarkb: yeah I'm not sure about that case. Does grub2-install generate /boot/efi/EFI/BOOT/BOOTX64.EFI *and* /boot/efi/EFI/BOOT/grub.cfg?22:34
clarkbstevebaker: my suse install does not22:36
clarkbso it must be using /boot/grub/grub.cfg along with bios boot22:36
stevebakerit must be, yeah22:37
stevebakerthis may be complicated by each distro forking grub and doing different things :/22:38
openstackgerritClark Boylan proposed openstack/diskimage-builder master: Properly set grub2 root device when using efi  https://review.opendev.org/c/openstack/diskimage-builder/+/78524722:43
clarkbstevebaker: fungi  ianw ^ ok that tries to simplify things just a bit more to make it more clear about what is going on (also comments)22:44
ianwi'm running a backup prune in a root screen on backup02 vexxhost22:45
clarkboh I've got the no secure boot warning in there twice now. Let me clean that up22:45
openstackgerritClark Boylan proposed openstack/diskimage-builder master: Properly set grub2 root device when using efi  https://review.opendev.org/c/openstack/diskimage-builder/+/78524722:46
stevebakerclarkb: that looks good, thanks. I'll try building some images locally and trying (non) secure uefi and legacy boot, but I'm happy for this to land asap22:48
clarkbstevebaker: thank you for talking me through that, it was really helpful to understand the various scenarios at play22:49
clarkbI've rechecked https://review.opendev.org/c/zuul/nodepool/+/785286 which should get us an arm64 image we can use too22:50
ianwclarkb: do you have an arm64 node setup as a builder already to try that?23:07
clarkbianw: no was going to put it on nb03 with everything but centos-8 paused23:08
*** tosky has quit IRC23:11
fungithat will hopefully also get centos-8-arm and centos-stream-8-arm jobs running again without having to wait for a dib release23:18
clarkbthe image build for that change failed though. Looking at it now23:18
fungithey've been stuck and queuing up for roughly a week at this point23:18
clarkbI think I see the issue (the first one to fail anyway :) )23:19
clarkbI doubt that will be done before I need to figure out dinner. If someone else wants to give it a go feel free. Othewise I'll try it in the morning23:24
ianwclarkb: sorry 785286 is what we need to look at?23:31
clarkbianw: https://review.opendev.org/c/openstack/diskimage-builder/+/785247 is the dib change and 785286 depends on it in nodepool to try and build an arm64 image with that fix in it23:32
clarkbianw: the jobs that run against dib only do x86 out of the box so the second change updates the job in nodepool to also do arm23:33
ianwok, so which one failed?23:34
clarkb785286's previous patchset failed to build the arm image23:35
clarkbhttps://5acac5304d1654f87681-c72835273dbeced4955618c918040aa4.ssl.cf5.rackcdn.com/785286/1/check/nodepool-build-image-siblings/ab6540a/job-output.txt is the failure23:35
clarkbcurrent versions of both changes are passing23:36
clarkbthough the dib one did fail a nonvoting job23:36
ianwok, so in short we want to test the dib in that builder image building a centos image right?23:39
clarkbyup and the key thing for this particular issue is the resulting centos image boots23:41
clarkbthe current images fail bootign very early in the process23:42
clarkbbecause root device target for the linux kernel as set by grub is /dev/mapper/loop7p3 but should be LABEL=cloudimg-rootfs (I think I got the two names correct)23:42
fungias in dracut essentially times out waiting for someone to hotplug the root device it's been told to look for23:42
clarkband the reason this happened is for centos efi builds we were running grub-mkconfig prior to setting GRUB_DEVICE=LABEL=cloudimg-rootfs in /etc/default/grub23:43
ianwok.  i'll see about getting a test env up.23:44
ianwi feel like the devstack arm64 work has continued, but iirc has issues fitting into 8gb.  that's the requirement for getting an end-to-end arm64 test which would obviously be good23:44
clarkbwhat my fix attempts to do is run grub-mkconfig only once then copy the resulting config to the appropriate locations once it has been written and had some tweaks applied to it23:45
ianwhttps://grafana.opendev.org/d/T5zTt6PGk/afs?viewPanel=34&orgId=1&from=now-7d&to=now ... i think i underestimated the time openafs requires to keep the tarballs mirror in sync23:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!