Thursday, 2020-08-20

openstackgerritIan Wienand proposed zuul/zuul-jobs master: edit-json-file: add role to combine values into a .json  https://review.opendev.org/74683400:46
openstackgerritIan Wienand proposed zuul/zuul-jobs master: ensure-docker: only run docker-setup.yaml when installed  https://review.opendev.org/74706200:46
openstackgerritIan Wienand proposed zuul/zuul-jobs master: ensure-docker: Linaro MTU workaround  https://review.opendev.org/74706300:46
ianwhrmmm, linaro mirror issues again ... https://zuul.opendev.org/t/zuul/build/f0f9658cd3ca40ff8abb74586e6bb569/console failed getting apt01:13
ianwdoesn't seem to be responding :/01:14
ianwSHUTOFF01:15
ianwagain01:15
ianwkevinz: ^01:15
ianwi feel like this has to be an oops taking it down01:15
ianwi think i might as well rebuild it as a focal node.  i'm not going to spend time setting up captures etc. for an old kernel01:17
ianwsigh .. .bridge is dying too01:23
ianw$ ps -aef | grep ansible-playbook | wc -l01:23
ianw21101:23
ianwall stuck on /home/zuul/src/opendev.org/opendev/system-config/playbooks/service-zuul.yaml >> /var/log/ansible/service-zuul.yaml.log01:23
ianwi've killed them all.  the log file isn't much help, as everything has tried to write to it01:26
clarkbianw: I thinj that may be the result if our zuul job tineouts that run the service playbooks01:36
clarkbthey dont seem to clean up nicely (and we run zuul hourly to get images?)01:36
ianwclarkb: i'll keep it open and see if one gets stuck, it's easier to debug one than 200 ontop of each other :)01:38
openstackgerritIan Wienand proposed opendev/system-config master: arm64 mirror : update to Focal  https://review.opendev.org/74706901:43
openstackgerritIan Wienand proposed opendev/system-config master: arm64 mirror : update to Focal  https://review.opendev.org/74706901:49
ianwok, we've caught an afs oops during boot -> http://paste.openstack.org/show/796970/02:03
ianwauristor: ^ ... if that rings any bells02:03
ianwi'm performing a hard reboot02:04
ianw... interesting .. same oops02:05
ianwso then we seem to be stuck in "A start job is running for OpenAFS client (2min 56s / 3min 3s)"02:06
ianw[    8.338401] Starting AFS cache scan... ; i wonder if the cache is bad02:07
ianwi'm going to delete /var/cache/openafs02:08
ianwthe server is up, but no afs to be clear at this point02:09
ianwwell that solved the oops, but still no afs.  i'm starting to think ipv4 issues agian02:14
ianwhrm, i dunno, i can ping afs servers02:15
fungithat's booting the ubuntu focal replacement arm64 server?02:21
ianwfungi: no, the extant bionic one that died02:38
ianwi'm going to try rebooting it again ... in case the fresh cache makes some difference02:39
fungiokay, but you're ready for reviews on the focal replacement then02:41
ianwsort of, it hasn't tested on focal arm64 i don't think because the mirror is down02:42
ianwbut i think we can merge 74706902:42
ianwok, it's back, and ls /afs works ...02:44
ianwand now the system-config gate is broken due to some linter stuff ...02:46
openstackgerritIan Wienand proposed opendev/system-config master: arm64 mirror : update to Focal  https://review.opendev.org/74706902:56
openstackgerritIan Wienand proposed opendev/system-config master: Work around new ansible lint errors.  https://review.opendev.org/74709402:56
ianwok, back to the zuul thing.  one of hte playbooks is stuck again03:08
ianwit's ... 30.248.253.23.in-addr.arpa domain name pointer zm05.openstack.org.03:09
ianwas somewhat expected, it accepts the ssh connection then hangs03:10
ianwstandardish hung tasks messages on console03:11
ianw#status log reboot zm05.openstack.org that had hung03:13
openstackstatusianw: finished logging03:13
openstackgerritMerged opendev/system-config master: Work around new ansible lint errors.  https://review.opendev.org/74709403:31
openstackgerritIan Wienand proposed opendev/system-config master: arm64 mirror : update to Focal  https://review.opendev.org/74706903:32
*** ysandeep|away is now known as ysandeep03:34
openstackgerritIan Wienand proposed zuul/zuul-jobs master: ara-report: add option for artifact prefix  https://review.opendev.org/74710004:11
openstackgerritIan Wienand proposed opendev/system-config master: run-base-post: fix ARA artifact link  https://review.opendev.org/74710104:13
openstackgerritIan Wienand proposed zuul/zuul-jobs master: ara-report: add option for artifact prefix  https://review.opendev.org/74710004:39
openstackgerritMerged opendev/system-config master: arm64 mirror : update to Focal  https://review.opendev.org/74706904:42
*** raukadah is now known as chkumar|rover04:43
openstackgerritIan Wienand proposed opendev/system-config master: launch-node: get sshfp entries from the host  https://review.opendev.org/74482105:09
openstackgerritIan Wienand proposed opendev/system-config master: launch-node: get sshfp entries from the host  https://review.opendev.org/74482105:10
fricklerianw: seems logstash-worker08.openstack.org is broken, closes ssh connection immediately, failing the ansible deploy job. do you want to take a deeper look or just reboot via the API?05:33
ianwfrickler: sounds like the same old thing; i have the console up and canr eboot it05:34
ianwshould be done05:42
*** lseki has quit IRC05:54
*** lseki has joined #opendev05:54
ianwkevinz: so i'm having trouble starting another mirror node ... it seems ipv4 can't get in.  i'm attaching to os-control-network.  it actually worked once, but i had to delete that node, and now not06:13
ianw os-control-network=192.168.1.63, 2604:1380:4111:3e54:f816:3eff:fe57:7781, 139.178.85.14406:17
ianwls -l /tmp/ | grep console | wc -l06:20
ianw10416106:20
ianwbridge has this many "console-bc764e02-6612-005b-e2c9-000000000012-bridgeopenstackorg.log" files06:20
ianwi've removed them06:23
*** lpetrut has joined #opendev06:50
*** DSpider has joined #opendev07:02
*** hashar has joined #opendev07:04
zbranyone that can help with https://review.opendev.org/#/c/747056/2 ?07:10
yoctozeptomorning infra; is https://docs.opendev.org/opendev/infra-manual/latest/creators.html the right guide to follow if I want to coordiante the etcd3gw move under the Oslo governance? i.e. the project already exists and this guide assumes it does not - what should I be aware of?07:27
yoctozeptothe current repo state (for reference) is here: https://github.com/dims/etcd3-gateway07:30
yoctozeptoit already used the (very old) cookiecutter template for libs; depends on tox but obviously does not use Zuul but Travis07:31
*** dtantsur|afk is now known as dtantsur07:34
*** johnsom has quit IRC07:41
AJaegeryoctozepto: yes, that's the right guide - and it explains what to do to import a repository that exists.07:43
AJaegeryoctozepto: check step 3 in https://docs.opendev.org/opendev/infra-manual/latest/creators.html#add-the-project-to-the-master-projects-list07:44
*** rpittau has quit IRC07:47
*** fressi has joined #opendev07:48
yoctozeptoAJaeger: ah, thanks! I was misled by the toc: https://docs.opendev.org/opendev/infra-manual/latest/creators.html#preparing-a-new-git-repository-using-cookiecutter07:53
*** rpittau has joined #opendev07:56
*** johnsom has joined #opendev07:57
*** elod is now known as elod_off07:58
chkumar|roverHello Infra, We are seeing rate limit issue in gate job08:00
chkumar|roverhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_047/746801/2/gate/tripleo-buildimage-overcloud-full-centos-8/04734f1/job-output.txt08:00
chkumar|roverprepare-workspace-git : Clone cached repo to workspace08:00
chkumar|roverprimary | /bin/sh: line 1: git: command not found08:00
jrosseri have an odd failure here https://zuul.opendev.org/t/openstack/build/f267841a98b443808365468e94ccdfa9/log/job-output.txt#17808:00
jrosser^ same08:00
*** moppy has quit IRC08:01
chkumar|roverI think it is widespeared on all distros08:01
*** moppy has joined #opendev08:01
openstackgerritAntoine Musso proposed opendev/gear master: wakeConnections: Randomize connections before scanning them  https://review.opendev.org/74711908:05
cgoncalveschkumar|rover, jrosser: this may help https://review.opendev.org/#/c/747025/08:09
chkumar|rovercgoncalves: thanks, just opened a bug https://bugs.launchpad.net/tripleo/+bug/189232608:10
openstackLaunchpad bug 1892326 in tripleo "Jobs failing with RETRY_LIMIT with primary | /bin/sh: line 1: git: command not found at prepare-workspace-git : Clone cached repo to workspace" [Critical,Triaged]08:10
cgoncalvesinfra-root: would it be possible to manually trigger rebuild of nodepool images and push them to providers once https://review.opendev.org/#/c/747025/  merges?08:14
*** ykarel has joined #opendev08:14
*** tosky has joined #opendev08:18
openstackgerrityatin proposed zuul/zuul-jobs master: Ensure git is installed in prepare-workspace-git role  https://review.opendev.org/74712108:21
*** lseki has quit IRC08:30
*** lseki has joined #opendev08:30
*** rpittau has quit IRC08:30
*** rpittau has joined #opendev08:30
*** johnsom has quit IRC08:30
*** johnsom has joined #opendev08:30
ykarelif some core around please also check ^08:35
ykarelall jobs relying on this role are affected08:36
ianwcgoncalves: i think we might have to release dib now to get it picked up08:39
cgoncalvesianw, thing is we got ourselves in a chicken-n-egg situation where CI is failing to verify the revert08:40
ianwykarel: installing git there is probably a better idea than relying on it in the base image, at any rate08:40
cgoncalvesat least two voting jobs already hit RETRY_LIMIT08:40
ianwi think the build-only thing is a bit of a foot-gun unfortunately.  anyway, that's not of immediate importance08:42
ianwcgoncalves: will 747121 fix those jobs?08:42
cgoncalvesianw, I think so but I've been wrong many times before xD08:42
ianwwelcome to the club :)08:43
cgoncalvesthanks!!08:43
ianwi'm going to single approve 747121 as i think that should unblock things.  then we can worry about the slower path of reverting, releasing, and rebuilding nodepool images and then ci images08:48
ianwi have to afk for a bit08:48
*** priteau has joined #opendev08:50
openstackgerritMerged zuul/zuul-jobs master: Ensure git is installed in prepare-workspace-git role  https://review.opendev.org/74712109:02
openstackgerritTobias Henkel proposed openstack/project-config master: Create zuul/zuul-cli  https://review.opendev.org/74712709:13
openstackgerritTobias Henkel proposed openstack/project-config master: Create zuul/zuul-client  https://review.opendev.org/74712709:33
*** andrewbonney has joined #opendev09:41
openstackgerritSorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: Use py3 with elastic-recheck  https://review.opendev.org/72933610:15
ykarelianw, Thanks for merging quickly10:58
ykarelyes should not depend on base image, having in base image is a plus though as it save couple of seconds10:58
zbrAJaeger: tobiash: https://review.opendev.org/#/c/747056/ -- please, is need for https://review.opendev.org/#/c/729336/11:11
openstackgerritSorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: Use py3 with elastic-recheck  https://review.opendev.org/72933611:12
*** hipr_c has joined #opendev11:33
*** hipr_c has joined #opendev11:33
*** hipr_c has joined #opendev11:33
toskyhi! If I click on the "Unit Tests Report" link here https://zuul.opendev.org/t/openstack/build/0cd50335a91b4e22a4776001e2d8478512:19
*** jaicaa has quit IRC12:19
AJaegerzbr: please explain what the change is about so that I can decide whether to open it or not. I'm not reviewing either of these repos - and neither does tobiash. Please ask the rest of the admins later12:19
toskyI get an empty page on chrome and an encoding error on Firefox12:19
toskys/chrome/Chromium/12:19
AJaegertosky: is that only for this specific report - or for every? I'm wondering whether that single file is corrupt or whether there's a generic problem.12:20
AJaegertosky: I can confirm the error on Firefox12:21
toskyAJaeger: just that one12:22
toskyI understand it may be a specific and once-in-a-while issue12:22
toskybut just in case...12:22
*** jaicaa has joined #opendev12:22
hasharhello. I have a basic patch that fails the task "ubuntu-bionic: Build a tarball and wheel",  python setup.py sdist bdist_wheel   yields "no module named setuptools12:29
hasharis that a known issue by any chance?  The repository is opendev/gear , patch is https://review.opendev.org/#/c/747119/112:30
AJaegertosky: ok. Hope other can help further12:30
toskyAJaeger: thanks for checking! I know it may not be fixed, and that file is not critical anyway12:38
toskyjust reporting in case other reports start to pile up12:38
*** hashar has quit IRC12:50
*** redrobot has quit IRC13:08
fricklertosky: AJaeger: looks like a bad upload to me, unless we see duplicates of that, I'd say this can happen and just do a recheck of that patch13:13
toskyack, thanks13:17
*** hashar has joined #opendev13:35
fungihashar: i've seen that when a different python is used than the one for which setuptools is installed. we should probably switch that from python to python3 if it's not using a virtualenv13:45
lourothi o/ "openstack-tox-py35 https://zuul.opendev.org/t/openstack/build/8f4947ec185c4479a57b552de4338956 : RETRY_LIMIT in 2m 54s"13:45
lourotthis happened on at least two of our (openstack-charmers/canonical) reviews this afternoon13:46
lourotthe job seems to fail apt-installing git on xenial, is it something you noticed already?13:47
hasharfungi: I am not sure I understand the reason ;]   I have a hard time finding out where the job "build-python-release" is defined  though13:47
fungilourot: that looks like the fallout from diskimage-builder removing git by default from images. we're hoping https://review.opendev.org/747121 fixes it so we don't have to wait for a revert and release in dib followed by nodepool image rebuilds and uploads to all providers13:47
fungihashar: take a look at the "console" tab for that build result and it shows the repository and path for the playbook which called the failing task, in this case opendev.org/opendev/base-jobs/playbooks/base/pre.yaml13:49
yoctozeptofungi: it seems xenial broke13:49
yoctozeptobecause it has no git packages13:49
fungihashar: er, sorry, i was looking at the wrong console, trying to answer too many questions at once13:49
lourotfungi, understood, thanks!13:50
hashar:]]]]]13:50
fungihashar: opendev.org/zuul/zuul-jobs/playbooks/python/release.yaml13:50
yoctozeptohttps://review.opendev.org/747121 broke xenial and now we can't merged https://review.opendev.org/74702513:50
fungiyoctozepto: thanks, yeah i think we need git-vcs on xenial... checking now13:51
hasharfungi: ahhh thank you very much. So yeah it runs {{ release_python }} setup.py sdist bdist_wheel , which would be python313:52
hasharand somehow I guess the base image lacks setuptools13:52
yoctozeptofungi: thanks13:53
fungihashar: we install setuptools for python3 i think, not python. ideally things should be calling python3 these days13:54
hasharoh13:54
hasharroles/build-python-release/defaults/main.yaml  has an override: release_python: python13:54
fungiyoctozepto: i was wrong, it's not git-vcs on xenial either, this error is strange, https://packages.ubuntu.com/xenial/git says it should exist13:55
fungihashar: yeah, probably we're not seeing this in other places because we set release_python: python3 (or something like that). you could check codesearch.openstack.org for release_python:13:55
*** ykarel is now known as ykarel|away13:55
openstackgerritAntoine Musso proposed opendev/gear master: zuul: use python3 for build-python-release  https://review.opendev.org/74716713:56
*** ykarel|away is now known as ykarel13:56
hasharfungi: or maybe if the base image has python2, it should also have setuptools?13:57
hasharanyway, I might have found a way to set it to python313:57
yoctozeptofungi: lack of apt-get update perhaps?13:57
ykarelseems ^ the case for git not found13:57
yoctozeptoafter working a lot on centos it feels nice to just hit install13:57
yoctozeptobut debian does not think so :-)13:58
hasharfungi: thank you very much for your guidances13:59
fungiyoctozepto: yeah, i suspect we may have tried to install a package too early before we've primed the pump for mirror stuff13:59
yoctozeptofungi, ykarel: then let's just do the apt-get update in the role, shall we?14:00
fungithough strange that this is only showing up for xenial14:00
openstackgerritAntoine Musso proposed opendev/gear master: zuul: use python3 for build-python-release  https://review.opendev.org/74716714:01
yoctozeptomaybe bionic+ images have the cache in them14:01
yoctozeptowhich is valid enough14:01
ykarelin other images it's seems installed, task is returning ok14:01
yoctozeptoor xenial's apt just got b0rken in the meantime14:01
ykareli seen a bionic jobs's log14:01
yoctozeptosad the gate on zuul change will not trigger the issue14:01
yoctozeptomaybe the ubuntu images did not rebuild?14:02
yoctozeptoi mean bionic+ ones14:02
yoctozeptoif you say they re 'ok' and not changed14:02
yoctozeptotbh, I only saw centos failures in kolla today14:02
ykarelubuntu-bionic | ok https://5fadcfca1ff80d23fcf2-2bdb8be3dd1329f8a48d0e165eec17e9.ssl.cf2.rackcdn.com/746432/1/check/openstack-tox-py36/8f6daf0/job-output.txt14:02
yoctozeptoso might have been the case14:02
yoctozeptobingo14:03
fungii've only just rubbed teh sleep from my eyes, started to sip my coffee and stumbled into this in the last few minutes, so still trying to catch up on what's been happening from scrollback14:03
yoctozeptofungi: it's a fire-fighting week for me14:03
ykarelmay be can hold a node? and see what's going to fix it quickly?14:03
yoctozeptocan't wait to see what Friday brings to the table14:03
fungiwe'll have to pick a change to recheck for the hold. i guess we can use the failing job for the dib revert14:04
fungiworking on that now14:05
ykarelstrange, in dib change the job passed in check, ubuntu-xenial | ok14:06
fricklerfungi: maybe we also want throw away current images and revert to the previous ones until we can fix dib?14:06
fungifrickler: i think we need to pause all image builds/uploads if we do that, because just deleting the images will trigger nodepool to start trying to upload them again14:08
fungilast time i tried that i think i must not have paused them correctly14:08
fungianyway, the autohold and recheck are in, now waiting for openstack-tox-py35 to get a node14:09
openstackgerritRadosław Piliszek proposed zuul/zuul-jobs master: Fix git install on Debian distro family  https://review.opendev.org/74717014:10
yoctozeptoin case we want to go the apt-get update route, I prepared the above ^14:10
fungionce we have this node held, i can also bypass zuul to merge 739717 so dib folks can continue with the revert14:14
dmsimardregarding that git install issue, I've also seen the issue in non-debian distros14:15
dmsimard"/bin/sh: line 1: git: command not found" on CentOS8: https://zuul.openstack.org/build/d48c1f1a9e024f7ba4b1d68dea285d3e/console#0/3/8/centos-814:16
fungidmsimard: yep, but for those the role is installing git successfully now i think14:17
dmsimardah, was there a separate fix ? not caught up with entire backlog14:17
fungidmsimard: yeah, https://review.opendev.org/74712114:17
dmsimardneat, thanks14:18
fricklerfungi: yeah, forcing the revert in would be the other option, but IIUC we'd need to have another dib release then, too. not sure who except ianw can do that14:20
fungioh, right, since this job is failing in pre we have to wait for it to fail three times before it will trigger the autohold :/14:21
*** lpetrut has quit IRC14:24
*** chkumar|rover is now known as raukadah14:33
fungiit's starting attempt #3 now14:33
openstackgerritRadosław Piliszek proposed openstack/project-config master: Add openstack/etcd3gw  https://review.opendev.org/74718514:36
mnaser`/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\"      install 'git'' failed: E: Package 'git' has no installation candidate\n`14:39
fungii think we finally have a held node14:39
fungior should momentarily14:40
mnaser^ anyone seen this today? i'm not seeing anything in logs14:40
openstackgerritRadosław Piliszek proposed openstack/project-config master: Add openstack/etcd3gw  https://review.opendev.org/74718514:40
fungimnaser: yes, it's fallout from the fix for the fix for dib removing git14:40
fungiwe're now trying to install git in the setup-workspace role but can't figure out why xenial is saying there's no git package14:41
fungii'm trying to get a node with that failure held now to see if i can work out what we're missing14:41
yoctozeptofungi, mnaser: I bet on lack of apt-get update and await my fix merged :-) https://review.opendev.org/74717014:42
fungidoes retry_limit not trigger autoholds?14:42
mnaserouch14:42
fungioh, nevermind, zuul hasn't finalized that build i guess14:42
fungiseems the scheduler's in the middle of a reconfiguration event14:44
clarkbretrylimit will hold and only the third and final instance14:44
fungiyeah, it's finally failed the third but the result is in the queue backlog while the scheduler's reconfiguring14:45
fungii was just being impatient14:45
fungiand there it goes14:46
fungithough i still don't have a held node yet14:47
clarkbanother approach is to manually boot a xenial node14:48
fungioh fudge, i pasted in the wrong change number14:49
fungiclarkb: well, we want to see what state the node is in when it's claiming it can't install git14:49
fungiso just booting a xenial image won't necessarily get us that14:49
clarkbit should be pretty close though14:50
clarkbprepare-workspace-git happens very early iirc14:50
fungiyep, our current suspicion is that it happens too early to be able to install distro packages14:50
fungilike before we've set up mirroring configs and stuff14:51
clarkbwe can add git to our infra package needs element too14:52
*** ysandeep is now known as ysandeep|away14:52
clarkbrather than revert dibs change and rerelease14:52
fungiyeah, i was considering that as a fallback option14:52
fungifallback to installing it in the prepare workspace role i mean14:52
fungii'm ambivalent on whether dib maintainers want to keep or undo the git removal14:52
fungii corrected my autohold and abused zuul promote to restart check pipeline testing on the change in question14:54
*** qchris has quit IRC14:57
fungii'm about to enter an hour where i'm triple-booked for meetings, but will try to keep tabs on this at the same time14:58
clarkbI'm slowly getting to a real keyboard and can help more shortly15:02
clarkbI'll probably work on the infrapackage needs change first so we've got it if we want it15:02
fungithanks15:02
clarkbgit is already in infra-package-needs15:10
clarkbis dib removing it15:10
*** larainema has quit IRC15:10
*** qchris has joined #opendev15:10
* clarkb needs to find this dib change15:10
fungiyeesh15:11
clarkbhttps://review.opendev.org/#/c/745678/115:11
clarkbya I think the build time only thing gets handled at a later build stage which then removes it15:11
fungiright, that was the change which triggered this15:12
clarkbbasically that overrides our explicit request to install the package elswhere15:12
clarkbthta makes me like the revert more15:12
clarkbits one thing to install it at runtime because we didn't install it on ou rimages. Its another to tell dib to install it on the image and be ignored15:13
clarkbI'm going to see if we can have the package installs override the other direction15:13
clarkbif you ask to intsall it and not uninstall it somewhere then don't uninstall it15:13
fungilooks like we're a bit backlogged on available nodes15:20
*** ykarel is now known as ykarel|away15:21
*** ykarel|away has quit IRC15:28
*** tosky_ has joined #opendev15:35
*** tosky has quit IRC15:36
*** tosky_ is now known as tosky15:37
fungiyep,test nodes flat-lined around 750 in use as of ~12:30z and the node requests have been climbing since15:37
fungicurrent demand seems to be around 2x capacity15:38
fungialso looks like we might could stand to have an additional executor or two15:39
fungisince around 14:00z there's been very little time where we had any executors accepting new builds15:40
fungiand the executor queue graph shows we started running fewer concurrent builds since then15:41
openstackgerritClark Boylan proposed openstack/diskimage-builder master: Don't remove packages that are requested to be installed  https://review.opendev.org/74722015:41
clarkbsomething like that maybe?15:41
clarkbfungi: the pre run churn is likely part of that15:41
fungii agree, this is probably a pathological situation15:41
fungiopenstack-tox-py35 has finally started its first try15:42
*** mlavalle has joined #opendev15:45
fungilooks like these are spending almost as much time waiting on an executor as they are waiting for a node15:47
clarkbthat will be affected by the job churn15:48
fungiabsolutely15:48
clarkbsince we rate limit job starts on executors15:48
clarkbhttps://review.opendev.org/#/c/729336/ shows https://review.opendev.org/#/c/747056/ is working. fungi once the bigger fire calms down (the gate won't pass for this with broken git anyway) maybe we can get reviews on those?16:04
fungiyep!16:05
fungithat's good16:05
corvusclarkb: is there a fire that i can help with?16:05
fungialso i'll have a break from meetings in about 55 minutes, maybe sooner16:05
clarkbcorvus: there is a fire. I think we're just trying to confirm which of the various fixes is our best bet. TL;DR is https://review.opendev.org/#/c/745678/1 merged to dib and was released. This has resulted in dib removing git from our images even though we explicitly request for git to be installed in infra-package-needs.16:06
fungicorvus: we've discovered that if dib marks a package as build-specific like in https://review.opendev.org/745678 then you can't also explicitly install that package as a runtime need in another element16:06
clarkbcorvus: an earlier attempt at a fix does a git install in prepare-workspace-git. but on ubuntu we think that may need an apt-get update (fungi is working t oconfirm that now before we land the update change into prepare-workspace-git)16:07
clarkbcorvus: on the dib side I've written https://review.opendev.org/747220 to not uninstall packages if something requests they be installed normally16:07
clarkbfor some reason this seems to most affect xenial. (Do we know why yet?)16:07
fungione theory i've not had a chance to check is that we haven't uploaded new images for bionic et al yet16:08
fungiand so they already have git preinstalled causing that task to no-op16:08
openstackgerritPaul Belanger proposed zuul/zuul-jobs master: Revert "Ensure git is installed in prepare-workspace-git role"  https://review.opendev.org/74723816:08
clarkb^^ that revert won't help anything aiui16:10
clarkbthe jobs will just fail on the next tasks16:10
clarkbfwiw I think its reasonable to make git an image dependency hence https://review.opendev.org/74722016:11
corvusclarkb, fungi: pabelanger left a comment that may be relevant on 74723816:12
clarkbcorvus: ya I think we'll be installing git with default package mirrors (whatever those may be)16:13
clarkbDNS should work on boot (thats a thing we've tried very hard to ensure)16:13
clarkbthough we arne't using the same images as ansible so ...16:13
openstackgerritPaul Belanger proposed zuul/zuul-jobs master: Revert "Ensure git is installed in prepare-workspace-git role"  https://review.opendev.org/74723816:13
fungii expect that yoctozepto's fix to do an apt update first would clear the error we're seeing with it, but i respect that ansible's use of the role may be incompatible with installing packages (though it should also no-op if the package is preinstalled in their images)16:14
yoctozeptohmm, based on https://review.opendev.org/747170 - I know why it did not fail on the change - it tests PREVIOUS playbooks16:14
yoctozeptofungi: yeah, if only it wanted to merge now that the queues are b0rken ^16:15
clarkbfungi: yup that is what I just noted on the change about the noop16:15
clarkbbsaically reverting that zuul-jobs change doens't help much if the images are broken16:16
clarkbif the images are fixed then it noops (so I think we should either fix zuul-jobs or ignore it in favor of fixing the images)16:16
clarkbthen we can swing around and clean up zuul-jobs as necessary16:16
clarkbunless paul has images with git and the only problem is that ansible doesn't noop there for some reason16:17
clarkb(that info would be useful /me adds to chnage)16:17
fungilooks like we have a node assignment for retry #3 on the job which my autohold is set for, and then i'll see if i can work out why that's failing so cryptically16:17
fungionce it gets a free executor slot anyway16:17
corvusiiuc that paul is saying dns is broken, it may be that yoctozepto's change is unsafe for paul16:18
corvus(because even a no-op 'apt-get update' would fail due to broken dns)16:18
yoctozeptofungi, clarkb, corvus: I think our best bet is to force-merge the zuul-jobs revert by pabelanger, then same with dib revert and rebuild the images16:18
clarkbcorvus: yup, that is why I'm thinking addressing the image problem is our best bet16:18
clarkbyoctozepto: but pauls revert shouldn't affect anything16:19
clarkbthats what I'm trying to say. If the images re fixed we don't need the revert. If the images are not fixed the revert won't help16:19
clarkbwe need to focus on the images imo16:19
yoctozeptoclarkb: but we do want the revert, let's start off the clean plate16:19
yoctozeptoanyhow, any idea why those jobs test the PREVIOUS playbooks?16:19
yoctozeptoI mean, they don't test the CURRENT change16:19
clarkbyoctozepto: the revert has no bearing on whether jobs will fail or pass. I think we should ignore it and focus on what has an affect16:20
clarkbthen later we can revert if we want to clean up16:20
clarkbyoctozepto: because they run in trusted repos16:20
clarkbyoctozepto: that is normal expected behavior by zuul16:20
yoctozeptoclarkb: ok, missed that16:20
yoctozeptoclarkb: so it's even in gate?16:20
yoctozeptoit's scary to +2 such changes there then16:21
clarkbyoctozepto: yes, you have to merge the change before it can be used. We have the base-test bsae job set up to act as a tester for these things16:21
fungiwhich was not used in this case because things were already broken16:21
corvusreal quick q -- since there's a fire, did we delete the broken images to revert to previous ones?16:22
clarkbcorvus: no because nodpeool will just rebuild and break us again16:22
clarkb(at least that was my read of scrollback)16:22
corvuswell, that's what pause is for16:22
fungii think last time i tried to pause all image updates i got it wrong16:23
corvusi mean, we have a documented procedure for exactly this case.  if we had followed it, everything would not be broken.16:23
corvusfungi: as an alternative, if there is any confusion, you can just stop the builders16:23
yoctozeptocan we focus on force-merging the dib revert change? :-)16:23
corvusor we could follow procedure and not have to force-merge anything16:23
fungihttps://docs.opendev.org/opendev/system-config/latest/nodepool.html#bad-images16:23
fungimaybe we didn't have that documented the last time i tried to do it16:24
fungii think i'll have to run the nodepool commands from nb03?16:24
fungiall the others are docker containers now16:24
clarkbfungi: you docker exec16:24
yoctozeptocorvus: so you pause, delete newest ones, and get previous ones?16:25
fungiyoctozepto: the older images will be used automatically16:25
yoctozeptofungi: ack16:25
openstackgerritClark Boylan proposed openstack/project-config master: Pause all image builds  https://review.opendev.org/74724116:25
clarkbso we force merge ^ then delete the image(s)?16:25
corvussure, or delete the image and then regular-merge that16:26
clarkbfungi: `sudo docker exec nodepool-builder-compose_nodepool-builder_1 nodepool $command` from my scrollback on nb0116:26
fungior stop builders delete the image, regular merge that, then start builders again once it's deployed?16:26
corvusfungi: yes or that16:27
corvusmain thing is -- we shouldn't have to force-merge anything in this situation16:27
corvus(and we should be able to get people working again immediately)16:27
fungiokay, i'll start downing the builders now16:28
clarkbif we delete the image then regularl merge it will build and then upload I think? so ya downing seems better16:28
clarkb(the pause will only apply to builds after the config is updated iirc)16:28
yoctozeptothat sounds very nice16:29
fungidoing `sudo docker-compose down` in /etc/nodepool-builder-compose on nb01,02,04 and `sudo systemctl stop nodepool-builder` on nb03 now16:29
fungi#status log all nodepool builders stopped in preparation for image rollback and pause config deployment16:31
openstackstatusfungi: finished logging16:31
fungiso next we need to build a list of the most recent images where there is at least one prior image and the latest image was built within the past day16:32
corvusfungi: should just be the list of images with "00:" as the first part of the age column16:33
fungiyep, that's what i just filtered on16:34
fungii guess we can assume there are prior images for all of those16:34
corvusnodepool dib-image-list|grep " 00:"16:34
corvusfungi: if there aren't, i don't think it matters anyway (essentially, every 00: image is broken yeah?)16:34
fungiwell, technically it's been less than 24 hours since the regression merged16:35
clarkbcorvus: the centos-8 one is 14 hours old which may not be new neough16:35
clarkbbut I think we can just assume they are broken if new like that and clean them up16:35
fungimore important is when dib release was published i guess16:35
* clarkb checks zuul builds16:35
corvusthis is what i get for that: http://paste.openstack.org/show/797001/16:35
yoctozeptoI think https://review.opendev.org/747025 can (and should) be abandoned thanks to clarkb's patch16:36
clarkb05:28 UTC yesterday16:36
fungi3.2.0 appeared on pypi 05:31z yesterday, so yeah more than 24 hours maybe16:36
clarkboh today is the 20th not 18th16:36
clarkbso ya anything built in the last 24 hours is likely bad16:36
fungii guess we just start with 00:16:36
clarkbfungi: ++16:37
fungiif i nodepool dib-image-delete will that also delete all the uploads of that build?16:37
fungior do i need to also manually delete them?16:37
clarkbfungi: it will but only once the builders are started16:37
clarkb(same iwth the on disk contents)16:37
fungiohh... right16:37
clarkbthe zk db updates should be sufficient to start booting on the older images though16:37
corvusand yes, the docs say only to run "dib-image-delete"; image-delete is not necessary.16:38
clarkbactually wait my earlier day math was right. Today is the 20th. The release was 05:30ish on the 19th16:39
clarkbso about 11 hours ago16:39
clarkbI think that means the centos-8 image is ok16:39
clarkb(but deleting it is also fine)16:39
fungi05:31 utc yesterday is 24 hours before 05:31 utc today. it's now 16:40 utc, so >24 hours16:40
clarkbbah timezones16:40
fungii failed to delete centos-7-0000134775 because it was building not ready, i guess i should have filtered on ready too16:42
clarkbwe'll need to delete it when it goes ready16:42
clarkboh wait it wont16:43
clarkbbecause we stopped the builders :)16:43
fungiyep16:43
clarkbthat should autocleanup then. Cool16:43
fungiso this is the list: http://paste.openstack.org/show/79700316:43
fungifor posterity16:43
fungiall but centos-7-0000134775 are in deleting state now16:43
clarkbnow we should cross check with the image-list16:44
clarkbit may be the case that we need the builders running to update their states16:44
corvusi approved the zuul-jobs revert for paul16:44
corvusyes, i think the 'stop the builders' variant is untested16:44
fungithis has reminded me that last time we did it without stopping the builders and had to deal with them immediately starting to build new bad images16:45
fungigranted, that takes a bit of time16:45
fungiso maybe also okay16:45
corvusyes, "immediately" is relative here16:45
clarkbya we haven't updated the image-list16:46
fungialso while i was working on that, my autohold was finally satisfied, so i'll see if i can confirm why the apt install git was breaking16:46
corvussure, we would probably need to delete a few again16:46
clarkbwe can set those to delete too, or update a builder config to pause and start it16:46
corvusbetter to let the builder do it16:46
corvustbh, i'd like to just follow the directions we wrote :)16:46
fungiwell, nodepool dib-image-delete won't let us delete an image which is building, so we have to catch it between completing the build and uploading16:47
clarkbfungi: and we'll also start new image builds16:47
clarkbbut corvus is saying we should just manually delete those again when they happen16:47
fungithe directions we wrote last time ended us with the problem coming back because we didn't catch and delete the new images fast enough16:47
clarkbmaybe start just nb01 to minimize the number of builds that can happen? A single builder should haldne clenaup just fine16:48
clarkbfungi: yes16:48
corvusyes, it's possible that one or two jobs may end up running on new images with this process.  but right now, we've been running thousands of jobs on bad images16:48
corvusso it's like a 10000000000% improvement16:48
clarkbshould I up the container on nb01?16:48
fungii suppose we could mitigate it by manually applying the pause configuration to all the builders before starting to delete images?16:48
clarkbfungi: we only need to start one, and yes we could manually apply the config there16:49
clarkb(corvus is saying don't bother though)16:49
*** fressi has left #opendev16:49
fungior do we then risk ansible deploying the old config back over them before the pause config is merged?16:49
clarkbfungi: I think the idea is eve nif we rebuild one or two images we can just delete them again16:50
clarkbwhile we land the pause config change16:50
clarkband if we restart only nb01 we'll minimize nodepools ability to build new images16:51
clarkbso I think that is safe enough16:51
clarkbcorvus: ^ is that basically what you are saying?16:51
fungiwfm16:51
corvusyou may need all the builders up.  but yes.16:51
clarkbok I'll start with nb01, then check and see if we need to start the others16:51
corvusi'm pretty much going to just keep saying "do what the instructions say"16:51
clarkbI'm making sure I'm interperting them correclty as well as articulating the corner case(s) the instructions say16:52
clarkbNote the direction say to pause first which we are not doing16:53
clarkbdo we want to manually edit the configs to pause first then?16:53
corvusnope16:53
corvusjust start the buildersn16:53
corvusmerge the change16:53
corvuskeep deleting broken images16:53
fungii'd like to improve the instructions if we can come up with a less racy process for this, or at least figure out what feature to implement in nodepool so we can eliminate the race condition16:53
clarkbok nb01 is running16:54
corvusfungi: sure it could be better, but i honestly don't think it's a big deal16:54
corvusand considering we went off-script (even after we decided to go on-script) by stopping the builders, i don't think we can actually say we followed them this time16:54
openstackgerritPierre Riteau proposed opendev/irc-meetings master: Update CloudKitty meeting information  https://review.opendev.org/74725616:54
corvusthey don't say anything about stopping or starting builders16:54
fungimy main concern is that in the past it's resulted in us telling people a problem is fixed, only to have it crop back up again hours later and then there's confusion as to when it was actually fixed and what can safely be rechecked16:54
corvusokay, let's add a pgraph at the end saying "if new images got built, delete those as well after the pause change has landed"16:55
fungithe instructions don't say to stop the builders, they also don't say to keep monitoring the builders and deleting new images which were started before the pause went into place16:56
corvussure, but they do say "if you have a broken image, delete it"16:56
clarkbnb01 is attempting to delete images according to the log16:57
clarkbthere are some auth exceptions to some url I don't recognize16:57
corvusclarkb: its own images or others?16:57
corvusclarkb: or rather, i think nb01 will only delete images on providers it talks to16:57
clarkbcorvus: so far just confirmed its own16:57
corvusso it may only delete non-arm images16:58
clarkboh ya good point16:58
* clarkb checks arm16:58
clarkblogan-: fwiw it seems we get a cert verification error talking to limestone. We can dig in more once the images are in a happier place16:58
clarkbya doesn't seem to have touched the arm64 images16:59
clarkbI'll start nb03 too16:59
clarkbok I think we're good until new images get uploaded (which will start with centos-7-0000134776 and ubuntu-xenial-arm64-0000094376 in an hour or two)17:01
fungiyoctozepto: i've confirmed that your proposed patch to apt update also wouldn't have helped. this is running before we've set our apt configuration so fails with "The repository 'http://mirror.dfw.rax.opendev.org/ubuntu xenial-security Release' is not signed. Updating from such a repository can't be done securely, and is therefore disabled by default."17:03
corvusclarkb: okay so should we zuul enqueue 747241 into gate?17:03
corvusi'm assuming it's partway through failing some check jobs or something on the old images17:03
clarkbcorvus: ya I thik we can try that now17:04
clarkbfungi: thats odd because we build the images using the mirrors to ensure we don't get ahead with their packages iirc. Which means we should bake in the override for that?17:04
fungiclarkb: apparently that's not carried over17:04
clarkbfungi: maybe that extra apt config is helpfully cleaned up17:04
*** dtantsur is now known as dtantsur|afk17:04
yoctozeptofungi: ack, I've abandoned it either way because the approach is wrong17:04
corvusclarkb: in progress17:04
fungiand yeah, it's also trying to use mirror.dfw.rax.opendev.org when it was booted in ovh-gra117:05
fungiso apparently *some* of the configuration is not cleaned up17:05
fungithough i did confirm that once the package lists were correctly updated, it was able to successfully install the git package17:06
clarkbas next steps I'm thinking revert the dib change, push a release. Then we can land my fix and a revert revert (and test it) then do another release17:08
clarkbhttps://review.opendev.org/#/c/747025/ is the dib revert17:09
clarkbyoctozepto: ^ see plan above. I think it makes sense to test this more completely and start by going back to what is known to work then roll forward with better testing from there17:09
clarkbI'm going to recheck that change now17:09
corvusclarkb, fungi: the pause change is running jobs which have passed the point at which they're doing things with 'git'17:13
corvusso ++17:13
fungigood deal17:13
fungidoes pause cause uploads to be paused too, or just builds?17:17
openstackgerritMerged openstack/project-config master: Pause all image builds  https://review.opendev.org/74724117:17
corvusfungi: there's a pause for either; clarkb paused the builds17:18
yoctozeptoclarkb: I'm not sure I agree but it's not bad either17:18
corvusso uploads of already built images will continue17:18
fungicorvus: yep, thanks, just found that in the docs too17:19
corvus(i think that is fine and correct in this case)17:19
fungiso if we wanted to avoid uploading images which were in a building state when the diskimage pause was set, we'd need to also add it for all providers17:19
fungiwe don't have a mechanism for cancelling a build in progress, right? other than maybe a well placed sigterm17:20
clarkbfungi: ya killing the dib process would do it, but nothing beyond that iirc17:21
fungiand at that point it wouldn't retry the build because of the pause17:23
clarkbyes17:23
* clarkb is trying to figure out how to test https://review.opendev.org/747220 now17:24
corvusi was wondering if some of the nodepool/devstack jobs actually boot an image?  but they probably don't do anything on it17:25
corvusas a one-off, you could probably do something that verifies that git is installed on the booted vm?17:25
corvusbut also, aren't there some dib tests that can check stuff like that?17:26
corvus(ie, build the image, then verify contents?)17:26
corvusat the functional test level17:26
clarkbcorvus: they boot the vm and I think check that ssh works. Which makes me wonder if I should s/git/openssh-server/ as that will confirm the package ends up sticking around17:27
fungithat does seem like it could also just be added as commands in a very last stage of an element, so that if the sanity checks don't succeed the image build fails17:27
clarkbcorvus: for the functional level tests they seem pretty basic.17:27
clarkbbut maybe there is something there I am missing /me looks more17:27
fungithen the test would be to try building the image. if those checks fail, the image build fails and the job then fails17:28
clarkboh you know I can probably just run the scripts in that element and check the outputs17:28
*** sgw has joined #opendev17:30
*** andrewbonney has quit IRC17:35
clarkbya I think that is enough to show I've got a bug so I'll keep pulling on it that way17:36
*** hashar has quit IRC17:41
corvusi'm going to delete | ubuntu-xenial-arm64-0000094376  | ubuntu-xenial-arm64  | nb03.openstack.org | qcow2         | ready    | 00:00:12:45  |17:46
openstackgerritJeremy Stanley proposed opendev/system-config master: Docs: Extra details for image rollback  https://review.opendev.org/74726117:47
fungicorvus: thanks! related ^17:47
corvusfungi: i'm not sure that pausing the provider-images would be effective.  it can't to into effect any earlier than the dib pause, and i think the dib pause is sufficient to stop the upload17:50
fungioh, uploads won't occur if the diskimage build is paused?17:50
corvusfungi: that's my understanding of the intent of the code.17:51
fungithat's what i was asking earlier as to whether pausing the diskimage building would also pause uploading of the images17:51
corvusi may have misunderstood that question then17:51
fungiso if an image is in building state when the pause for it takes effect, once it reaches ready state the nodepool-builder won't attempt to upload it to providers?17:52
corvusi believe that's the intent, but i'd give that 50/50 odds that what actually happens, because that's essentially a reconfiguration edge-case.17:53
corvusbut other than that potential edge case, in general, pausing a dib should stop derived uploads.17:53
fungiwell, yeah, i mean if you don't build an image then there's nothing to upload17:54
corvusuploads fail all the time, so the builders are constantly retrying them17:54
corvus(this is why i may have answered your question in a different context earlier)17:55
fungioh, i see, so it would prevent the upload from being retried, but not from being tried the first time17:55
fungi(maybe)17:55
corvusfungi: i'm just hedging my answer because it's a really specific question which i'm not sure is covered by a unit test17:56
fungisure, makes sense17:56
corvusin general, i think what we all want to have happen is what the authors of the code wanted to have happen too17:56
fungiso maybe really the only race we've encountered is from deleting images before the pause takes effect17:56
corvusso i think our docs should reflect that, until we prove otherwise :)17:56
corvusfungi: that is my expectation17:56
corvusspeaking of which, if infra-prod-service-nodepool ran, successfully, shouldn't "pause: true" appear in /etc/nodepool/nodepool.yaml on nb03?17:58
fungithat's what i would have expected17:58
fungiunless infra-prod-service-nodepool isn't handling the non-container deployment?17:59
fungimaybe that's still being done by the puppet-all job?17:59
corvusthat may be the case18:00
fungieven though it's technically not being configuration-managed by puppet18:00
corvusnb01 has true18:00
corvuswill that end up updated by a cron or something?18:00
fungialso i don't know how far ianw got with bringing the mirror for the arm64 provider back to sanity, so it's possible arm64 builds are hopelessly broken at the moment either way18:01
fungilooks like infra-prod-remote-puppet-else is queued in opendev-prod-hourly right now18:02
corvusokay, given the limited impact, i don't think exceptional action is warranted.18:04
corvusfungi: presumably the currently-building fedora-30 image will be a test of your question18:05
corvusfungi: i've confirmed that dibs are paused on nb01, and it's 20m into a build of fedora-3018:05
fungiyeah, we'll know in a "bit" (or "while" at least) whether infra-prod-remote-puppet-else takes care of it18:05
corvusso maybe when it's done, before we delete it, let's check to see if it uploads18:06
fungisounds good18:06
fungithen i'll revise the docs change accordingly18:06
openstackgerritClark Boylan proposed openstack/diskimage-builder master: Don't remove packages that are requested to be installed  https://review.opendev.org/74722018:06
clarkbthat is tested now. It fails pep8 locally but not on any of the files I changed? I wantt to see what zuul says about linting18:07
fungithough also i agree if nodepool is expected to not upload images in that state, it's probably something worth fixing in nodepool18:07
corvusclarkb: ^ fyi double check that there are no fedora-30-0000018222 uploads once it finishes building18:07
corvus(before deleting it)18:07
clarkbk18:08
corvusi'm going to take a break18:09
fungii'll be breaking in about an hour to work on dinner prep but keeping an eye on this in the meantime18:10
clarkbfungi: can https://review.opendev.org/#/c/747056/ get a review before dinner prep?18:24
fungiyep, looking18:26
fungideleting centos-8-arm64-0000006345 which went ready ~20 minutes ago18:28
*** hashar has joined #opendev18:30
fungidib-image-list indicates fedora-30-0000018222 went ready 2 minutes ago18:38
fungialso indicates that nb01 has started building fedora-31-000001197318:38
fungiso, um, does it not realize we asked it to pause?18:38
clarkbfungi: it started before the pause18:39
fungiit started 2 minutes ago18:39
clarkboh 31 not 3018:39
clarkbinteresting18:39
fungiyup18:39
fungialso i can confirm fedora-30 is "uploading" to all providers currently18:40
clarkbthe config for fedora-31 on nb01 clearly says pause: true18:40
clarkbmaybe its using cached config?18:40
fungithough that could also simply be because the builder didn't actually pause18:40
fungii'm deleting fedora-30-0000018222 now before it taints more job builds18:41
clarkbfungi: you mean because there is a bug?18:41
fungiwhich statement was that question in relation to?18:41
clarkb"though that could also simply be because the builder didn't actually pause"18:41
fungiyes, either a bug in nodepool or a bug in how we're updating its configuration18:42
fungilike does the builder daemon also need some signal to tell it to reread its configuration?18:43
fungior does it only read its config at start?18:43
clarkbreading the code it seems to read it on every pass through its run loop18:44
fungialso deleting debian-stretch-arm64-0000093525 which has gone ready18:44
clarkbfungi: I think the loop is roughly : while true: load config; for image in images: if imgae is stale rebuild18:46
fungialso, the infra-prod-remote-puppet-else build in opendev-prod-hourly finished, but /etc/nodepool/nodepool.yaml on nb03 still hasn't been updated18:46
clarkbfungi: my hunch is that its going to try and build every image with the pre pause config as it loops through that list18:46
clarkbits not reloading the config between rebuilds until it gets through the whole list18:46
fungiso if we want it to take effect ~immediately that requires a service restart, otherwise it will take effect in 6-12 hours18:47
clarkbyes? Would be good for someone else to double check my read of the code but that is my read of it18:48
fungiand i suppose we should bump the config read down one layer deeper in the nested loop if so18:48
fungiin other news, /etc/ansible/hosts/emergency.yaml includes "nb03.openstack.org # ianw 2020-05-20 hand edits applied to dib to build focal on xenial"18:50
fungiso this marks the three-month anniversary of the last configuration update there, i suppose18:50
fungii'll edit its config by hand for now18:50
clarkbhrm I think that can be removed now, but we should confirm with ianw today18:50
fungii should have looked there sooner, but so much going on18:50
corvusyeah, sounds like restart is needed currently, and we should have nodepool reload its config after each image build18:51
fungi#status log edited /etc/nodepool/nodepool.yaml on nb03 to pause all image builds for now, since its in the emergency disable list18:52
openstackstatusfungi: finished logging18:52
fungii've restarted nodepool-builder on nb03 to get it to read its updated configuration now18:53
fungiinterestingly, after a restart it immediately began building ubuntu-focal-arm6418:55
fungithe config sets pause: true for ubuntu-focal-arm6418:55
fungiwhy would it begin building?18:55
fungioh! because it's a pause under providers, not under diskimages18:56
fungiall the pauses in its config are providers18:56
* fungi sighs, then fixes18:56
clarkbfungi: oh sorry I missed the difference in contenxt. Normally we have the images set to pause: false ahead of time to toggle them18:58
fungiwell, in this case the config on nb03 had them set to pause: false in the diskimages list for linaro-us, not the main diskimages definitions list18:59
fungiand even so, after fixing and another restart it's still starting to build yet another new image19:02
openstackgerritMerged opendev/system-config master: Convert ssh keys for ruby net-ssh if necessary  https://review.opendev.org/74705619:02
fungiubuntu-xenial-arm64 this time19:02
clarkbhae we restarted nb01 ?19:03
fungiaha! that one's on me, i missed adding a pause to ubuntu-xenial-arm6419:03
fungii haven't restarted anything else yet. was trying to wrestle nb03 into line19:04
clarkbgotcha19:04
clarkbshould I restart nb01 then so that it short circuits that loop?19:04
fungiplease do19:04
clarkbdone19:05
fungiokay, after correctly reconfiguring nb03 it's no longer trying to build new images19:05
clarkbthere are no building images now19:05
funginot sure why the pause: false placeholders were in the provider instantiations rather than the definitions19:05
fungii did double-check nb01 and it looked correctly configured by comparison19:06
fungino remaining images in a building state now19:06
clarkbya I think things have stablized now. IF we want we can start nb02 and nb0419:18
clarkbbut I'm going to get lunch first.19:19
clarkbthe dib change will be entering the gate soon I hope as well19:20
fungino need to start more builders until we're ready to un-pause them. they're just going to sit there twiddling their thumbs anyway19:24
clarkbyup dib change is gating now. Now I'm really getting food as its just sit and wait time for zuul to run jobs19:25
fungiyeh, disappearing to work on dinner now19:25
openstackgerritRadosław Piliszek proposed openstack/project-config master: Add openstack/etcd3gw  https://review.opendev.org/74718519:25
openstackgerritAntoine Musso proposed opendev/gear master: wakeConnections: Randomize connections before scanning them  https://review.opendev.org/74711919:51
*** hashar has quit IRC19:52
*** yoctozepto2 has joined #opendev20:06
*** yoctozepto has quit IRC20:07
*** yoctozepto2 is now known as yoctozepto20:07
*** smcginnis has quit IRC20:12
openstackgerritMerged openstack/diskimage-builder master: Revert "source-repositories: git is a build-only dependency"  https://review.opendev.org/74702520:37
clarkbI expect ianw will be around soon and we can talk about making a release with ^ next20:45
clarkbthen work to land my change to package accounting and land a revert revert20:45
*** sshnaidm is now known as sshnaidm|afk20:47
openstackgerritPierre Riteau proposed opendev/irc-meetings master: Update CloudKitty meeting information  https://review.opendev.org/74725620:50
*** priteau has quit IRC20:52
clarkbzbr: the fix for the puppet jobs has merged. I'll try to approve the e-r python3 switch tomorrow (I'm running out of daylight today and want to make sure all the cleanup from the dib stuff is in a good spot)20:57
openstackgerritMerged openstack/project-config master: Re-introduce puppet-tripleo-core group  https://review.opendev.org/74675921:00
clarkbthe dib change to modify how package installs are handled is passing tests and has new tests to cover the behavior at https://review.opendev.org/#/c/747220/21:18
clarkbI'll stack a revert revert on top of that now21:18
clarkbhrm do I need to rebase to do that?21:19
clarkbmaybe I won't stack then21:19
ianwclarkb: hey, looking21:49
fungiianw: to catch you up, all diskimages are paused for all builders right now, and we've deleted the most recent diskimages. i manually edited the config for nb03 since it's been in the emergency disable list for months. we also discovered that builders won't notice config changes straight away generally, and so a restart is warranted if you need them to immediately apply21:51
fungioh, and on nb03 i moved the pause placeholders out of the provider section into the diskimage definitions to pause building instead of only pausing uploading21:52
ianwsigh ... so i guess we exposed a lot of assumptions about git being on the host ...21:53
fungiianw: well, also we actually explicitly install git in infra-package-needs21:53
fungibut the change to dib "cleans it up" helpfully anyway21:53
johnsomFYI, docs.openstack.org seems to not be responding21:53
fungijohnsom: thanks, checking now21:54
ianwhttps://review.opendev.org/#/c/747121/ didn't work?21:54
fungiianw: at the time setup workspace runs, we haven't configured package management on the systems yet, and they don't have package indices on debuntu type systems at that point21:54
clarkbianw: fungi and pabelanger in particular doesn't even have working dns at that point21:55
fungijohnsom: isn't not down for me21:55
johnsomfungi Yeah, just started loading for me21:55
clarkbbut ya we explicitly intsall git in infra-package-needs and so dib shouldn't undo that21:55
fungiyeah, i was about to add, also other users of that role don't even necessarily have fundamental network bits in place yet21:55
fungiso trying to install packages at that point is going to break for them regardless21:56
ianwok, so git is a special flower21:56
ianwthe revert is in -2 https://review.opendev.org/#/c/747238/21:57
fungijohnsom: looks like the webserver temporarily lost contact with the fileserver for six seconds at 21:47:22 and again for 27 seconds at 21:47:31 and another 7 seconds at 21:54:0021:58
johnsomThat would do it.21:58
ianwso there hasn't been a dib point release?21:59
clarkbianw: not yet, the revert merged not that long ago so I figured we'd wait for you just to double check21:59
clarkbianw: but I think we do that release then work on something like https://review.opendev.org/#/c/747220/ as the next step22:00
ianwclarkb: ok, your merging change lgtm as a stop-gap against this returning22:00
clarkbthen people can have git removed if they don't explicitly install it elsewhere22:00
ianwi agree, let me then push a .0.1 release22:00
fungijohnsom: which in turn seems to be due to high iowait on the fileserver out of the blue: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=6397&rra_id=all22:01
fungii'm trying to ssh into it now22:01
clarkbfungi: did static fail over to the RO server?22:02
clarkb(I think that is how it is supposed to work so yay if it did)22:02
fungiclarkb: i'm not sure, all the errors in dmesg are about losing and regaining access for 23.253.73.143 (afs02.dfw)22:03
fungiand i'm still waiting for ssh to respond on it22:03
ianwfungi/clarkb: so now we need to roll out 3.2.1 to builders and rebuild images?22:03
fungichecking oob console too22:03
fungiianw: yeah22:03
fungii'm woefully overdue for an evening beer22:04
ianwi guess the best way to do that is to bump the dib requirement in nodepool?22:04
fungiianw: or at least blacklist 3.2.022:04
*** tosky has quit IRC22:07
clarkbya adding a != 3.2.0 is what I would do22:07
clarkband then we need to revert the pause change22:07
fungioob console is showing hung kernel tasks22:07
clarkband start builders on nb02 and nb0422:07
ianwi feel like before we've just done a >=22:08
ianwgreat, my ssh-agent seems to have somehow died22:08
clarkbI think thats fine too22:08
fungisome day distros will start having console kmesg spew use estimated datetime rather than seconds since boot22:08
fungihung kernel tasks on afs02.dfw began 23818825 seconds after boot22:09
fungiif that was ~now then it means the server was booted 2019-11-19 05:5022:11
fungichecking to see if we happened to log that22:11
fungiyay us! "2019-11-19 06:09:03 UTC rebooted afs02.dfw.openstack.org after it's console was full of I/O errors. very much like what we've seen before during host migrations that didn't go so well"22:12
fungiunfortunately unless it miraculously clears up, this probably means an ungraceful reboot, fsck and then lengthy full resync of all afs volumes22:14
fungithe cacti graph is also less reassuring... looks like the server stopped responding to snmp entirely 20 minutes ago22:15
fungiinfra-root: i'm going to hard reboot afs02.dfw22:15
clarkbfungi: ok22:15
clarkbalso looks like docs is still unhappy implying we are't using the other volume?22:16
fungihopefully once its down all consumers will switch to the other server22:16
clarkbah ok maybe that is what is needed to flip flop22:16
ianwmy notes from that day say22:17
ianw  * eventually debug to afs02 being broken; reboot, retest, working22:17
fungi#status log hard rebooted afs02.dfw.openstack.org after it became entirely unresponsive (hung kernel tasks on console too)22:17
openstackstatusfungi: finished logging22:17
ianwthat i didn't log something about having to rebuild the world might be positive :)22:18
fungiianw: the subsequent entries in our status log worried me, until i realized that they were actually the result of a problem with afs01.dfw some days earlier which we didn't really grasp the full effects of until afs02.dfw hung22:19
fungidocs.o.o seems to be back up for me, btw22:20
ianwto nb03 -- i have hand edited the debootstrap there to know how to build focal images.  the plan was to get that replaced with a container.  *that* has been somewhat sidetracked by the slow builds of those containers.  which led to us looking at arm wheels.  which led to us doing 3rd party ci for cryptography22:21
ianwwhich led to us finding page size issues with the manylinux2014 images, which has led to patches for patchelf22:21
ianwi think this might be the definition of yak shaving22:21
fungiianw: ubuntu is usually good about backporting debootstrap so you can build chroots of newer releases on older systems22:22
ianwperhaps in the mean time xenial has updated it's debootstrap22:22
ianwi don't think so, last entry seems to be 201622:23
fungi:(22:23
ianwsorry, better if i look int he updates repo22:24
fungicheck xenial-backports22:24
fungibut yeah, not in xenial-backports22:24
ianw * Add (Ubuntu) focal as a symlink to gutsy.  (LP: #1848716)22:24
openstackLaunchpad bug 1848716 in debootstrap (Ubuntu) "Add Ubuntu Focal as a known release" [High,Fix released] https://launchpad.net/bugs/1848716 - Assigned to Łukasz Zemczak (sil2100)22:24
ianw -- Łukasz 'sil2100' Zemczak <lukasz.zemczak@ubuntu.com>  Fri, 18 Oct 2019 14:17:06 +010022:24
ianwhrmm, i wonder if we don't have that22:24
ianwoh i think that's right, we need 1.0.114 for some other reason22:27
ianwhttps://launchpad.net/~openstack-ci-core/+archive/ubuntu/debootstrap/+sourcepub/11302190/+listing-archive-extra22:27
ianwhttp://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-05-19.log.html#t2020-05-19T09:30:19 and that's the discussion about it all ...22:31
clarkbis the debootstrap fix not in our ppa?22:32
clarkbif it is can't we turn ansible puppet back on?22:32
ianwit is; i think we can probably turn puppet back on.  i'm starting to think i might have just forgotten to do that after building ^^^22:32
clarkbgotcha22:32
ianwthe reason we run the backport is to build buster (http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-05-19.log.html#t2020-05-19T09:36:59)22:33
ianw*that's* why the xenial-updates version doesn't work for us, it can build focal but not buster22:34
clarkbah22:35
ianwclarkb: so you're looking into the "builders don't notice config changes"?22:36
fungiianw: he's got a fix proposed22:37
fungihttps://review.opendev.org/74727722:37
clarkbhttps://review.opendev.org/747277 is that proposed fix22:37
ianwok, https://review.opendev.org/#/c/747277/ ...22:37
ianwjinx22:37
fungithey *do* (eventually) load config changes22:37
fungijust not until after cycling through all the defined images which need builds22:37
ianwso, before everyone eod's :)  i can monitor the deploy of https://review.opendev.org/747303 and re-enable builds.  nb03 we can probably re-puppet, i'll look into that.  and clarkb has the config-not-noticed issue in review22:39
ianwi think that was the 3 main branches of the problems?22:39
fungiyep, i think that covers it22:40
clarkbwe also want a revert of the pause change?22:40
clarkb Iguess that falls under reenable builds22:40
fungigood reminder that we need to do that part though, yep22:41
ianwyeah, i can watch that22:42
*** mlavalle has quit IRC22:56
ianwkevinz: if you can give me a ping about ipv4 access in the control plane cloud in linaro that would be super :)22:58
clarkboh that was the other thing I noticed22:58
clarkblimestone has an ssl cert error22:58
clarkbI don't think it is an emergency but onc ethe other fires are out we should look into that /me makes a note for tomorrow and iwll try to catch lourot22:58
clarkber logan- sorry lourot bad tab complete22:58
ianwclarkb: rejection issues or more like not in container issues?23:05
clarkbianw: I think that cloud may use a self signed cert and we explicitly add a trust for it? and ya maybe that isn't bind mounted or now its an LE cert or something23:06
clarkbI should actually point s_client at it23:06
clarkbya s_client says it is a self signed cert23:07
clarkbso we're probably just not supplying the cert in clouds.yaml for verification23:08
openstackgerritIan Wienand proposed openstack/project-config master: Revert "Pause all image builds"  https://review.opendev.org/74731223:23
*** DSpider has quit IRC23:41
fungiinfra-root: i keep forgetting to mention, but i'm planning to try to be on "vacation" all next week. in theory i'll be avoiding the computer23:44
ianwfungi: jealous!  i will be within my 5km restriction zone and 1hr of exercise time :/23:49
fungioh, i'm not going anywhere. i'll probably be put to work on a backlog of home improvement tasks23:50
clarkbfungi: but will you go past 5km?23:51
fungidoubtful. the hardware store is at most half that23:51
ianwheh, you could if you *wanted* to though :)23:53
ianwso the nodepool image is promoted, i guess we just need to wait for the next hourly roll out23:54
*** knikolla has quit IRC23:56
*** dviroel has quit IRC23:56
fungiianw: i *could* but i'd rather keep my good health ;)23:56
*** aannuusshhkkaa has quit IRC23:56
clarkbianw: yes next hourly should restart the builders even iirc23:57
*** ildikov has quit IRC23:58
*** knikolla has joined #opendev23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!