Friday, 2020-05-22

*** tosky has quit IRC00:01
fungiinfra-root: we got a ticket from rackspace saying the host for paste01 is "showing imminent signs of hardware failure" so looks like they're going to migrate the instance. maybe related to the connectivity issue this time? maybe coincidence? maybe the migration will fix the connectivity issue anyway? place your bets!00:14
ianwfungi: my bet is that it's something to do with migration that then breaks the ipv600:19
clarkbianw I do think that fixture would be helpful. Is it readu for review?00:41
ianwclarkb: yep00:41
ianwand it's used in the follow-on to autogen the ssl check list00:42
*** Meiyan has joined #opendev00:59
*** ysandeep|away is now known as ysandeep01:02
fungiianw: ooh, interesting theory... leftover routes or neighbor discovery responses for the old host?01:26
fungisomething cached01:27
ianwyeah, i don't think we could tell without backend access01:28
fungiagreed01:28
*** mlavalle has quit IRC02:20
*** elod has quit IRC03:23
*** elod has joined #opendev03:35
openstackgerritIan Wienand proposed opendev/system-config master: Add tool to export Rackspace DNS domains to bind format  https://review.opendev.org/72873904:00
*** Meiyan has quit IRC04:11
openstackgerritIan Wienand proposed opendev/system-config master: Add tool to export Rackspace DNS domains to bind format  https://review.opendev.org/72873904:20
*** sshnaidm is now known as sshnaidm|off04:33
*** ykarel|away is now known as ykarel04:34
ianwinfra-root: ^ i have done a manual run of that tool and the results are in bridge:/var/lib/rax-dns-backup04:42
ianwclarkb: did you get an answer on if we could post the openstack.org for audit on a public tool?04:43
clarkbianw: fungi (or I) were going to share the output with them privately amd have them double check first04:48
clarkbI dont think that has happened yet04:48
clarkbbut with the info on bridge that will make it easy04:49
ianwnp, we end up with 39 domains dumped all up when we walk the domain list04:50
*** sgw has quit IRC06:01
*** slaweq has joined #opendev06:57
openstackgerritzhangboye proposed openstack/diskimage-builder master: Add py38 package metadata  https://review.opendev.org/73022007:04
*** ysandeep is now known as ysandeep|afk07:12
*** ysandeep|afk is now known as ysandeep07:34
*** tosky has joined #opendev07:34
openstackgerritSorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: WIP: Use py3 with elastic-recheck  https://review.opendev.org/72933607:39
*** DSpider has joined #opendev07:51
openstackgerritSorin Sbarnea (zbr) proposed openstack/diskimage-builder master: Validate virtualenv and pip  https://review.opendev.org/70710407:58
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:01
openstackgerritSorin Sbarnea (zbr) proposed zuul/zuul-jobs master: Bump ansible-lint to 4.3.0  https://review.opendev.org/70267908:04
*** tkajinam_ has quit IRC08:05
openstackgerritSorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: WIP: Use py3 with elastic-recheck  https://review.opendev.org/72933608:26
*** lpetrut has joined #opendev08:26
*** larainema has joined #opendev08:29
*** hashar has joined #opendev08:32
*** elod has quit IRC08:43
openstackgerritSorin Sbarnea (zbr) proposed zuul/zuul-jobs master: revoke-sudo: improve sudo removal  https://review.opendev.org/70306508:44
*** elod has joined #opendev08:50
*** ykarel is now known as ykarel|lunch08:56
*** elod has quit IRC08:56
*** elod has joined #opendev08:58
*** ysandeep is now known as ysandeep|lunch09:09
*** elod has quit IRC09:10
*** elod has joined #opendev09:10
openstackgerritSorin Sbarnea (zbr) proposed zuul/zuul-jobs master: bindep: Add missing virtualenv and fixed repo install  https://review.opendev.org/69363709:10
openstackgerritRotanChen proposed openstack/diskimage-builder master: The old link does't work,this one does.  https://review.opendev.org/73028609:35
slaweqfungi: mordred clarkb: thx a lot, sorry but I was busy yesterday and missed what You told me here. I will try to use that NODEPOOL_MIRROR_HOST variable next week in neutron-tempest-plugin jobs09:51
*** ysandeep|lunch is now known as ysandeep10:01
*** yuri has joined #opendev10:04
*** ykarel|lunch is now known as ykarel10:10
*** hashar has quit IRC10:31
hrwclarkb: thanks for invitation. Will discuss with my manager and then reply.10:39
*** roman_g has joined #opendev11:01
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner  https://review.opendev.org/72868411:26
fungiianw: clarkb: i shared the original list with them yesterday... consensus was there's nothing sensitive in there to worry about, but lots of abandoned records they plan to clean up11:40
zbrwhat is the status of gentoo support in zuul-jobs? i see failures like https://zuul.opendev.org/t/zuul/build/ddc06a12b0f44d7a991cc4799c98b7cc11:56
zbrcan we make it non voting?11:56
zbrthat reminds me on an older questions: who decides when to add/drop support for a specific operating system in zuul-roles?11:57
zbrit can easily grow out of control, especially by introducing less mainstream platforms11:57
*** priteau has joined #opendev12:33
*** hashar has joined #opendev12:36
hrwwho I can talk with about build-wheel-mirror-* CI jobs?12:45
fungiprobably any of us, what's the question?12:46
hrwI should probably find it 2-3 years ago ;D12:47
hrwfrom what I see it is used by requirements to build x86-64 wheels and push them to infra mirrors12:47
hrwlooks like I should add aarch64 to it and then all aarch64 builds will speed up a lot12:48
hrwas numpy/scipy/grpcio etc will be already built as binary wheels on infra mirrors12:48
hrwam I right?12:48
AJaegerhrw: https://review.opendev.org/#/c/550582 was pushed 2 years ago but never moved forward, not sure why. that gives a start.12:50
AJaegerhrw: yes, that should speedup the builds12:50
hrwAJaeger: will concentrate on getting it working12:51
hrwI had no idea that such thing exists12:51
openstackgerritSorin Sbarnea (zbr) proposed zuul/zuul-jobs master: Minor documentation rephrase  https://review.opendev.org/72864012:52
hrwit is not used even on x86-6412:53
hrwas it is run only when bindep change instead of upper-constraints12:53
zbrfungi: clarkb ok to merge https://review.opendev.org/#/c/729974/ ?12:53
fungihrw: i think we were previously waiting to have a stable arm64/aarch64 provider to run the job in, but now that we do we should be able to run a mirror-update job there12:54
fungihrw: we run a periodic job, hold on i'll find it12:54
hrwfungi: thanks12:55
AJaegerhrw: https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5442 - its run every day12:56
AJaegerfungi, do we need to publish wheels for focal and CentOS-8 as well? I don't see them12:57
fungiAJaeger: eventually, i expect12:58
hrwAJaeger: ok. so have to add job there12:58
fungiinfra-root: bridge.o.o can now reach paste.o.o over ipv6, so may have been related to (or fixed by) host migration after all12:58
hrwAJaeger: c7 wheels should work on every other distro (maybe not xenial)12:59
hrwmanylinux2014 PEP defines c7 as base12:59
hrwhttps://review.opendev.org/#/c/728798/ finally is able to build all wheels as CI job. With one Debian package build on a way13:00
*** ykarel is now known as ykarel|afk13:02
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Undefined envlist should behave like tox -e ALL  https://review.opendev.org/73032213:12
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Undefined envlist should behave like tox -e ALL  https://review.opendev.org/73032213:17
hrwneed to find a job which makes use of those wheels from infra mirror13:22
hrwok, I see it used to create venv. now, in kolla we need to sneak it into being used in build too13:24
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Undefined envlist should behave like tox -e ALL  https://review.opendev.org/73032213:27
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683013:28
openstackgerritMarcin Juszkiewicz proposed openstack/project-config master: build and publish wheels for more distributions (x86-64)  https://review.opendev.org/73032313:31
hrwAJaeger: here you have - focal, buster, centos813:31
hrwbut it is probably not complete13:32
hrwrelease.yaml has afs_volume list which need to be filled with extra entries13:33
hrwI may only guess their names13:34
fungihrw: if you're looking for the magic to get those provider-local wheelhouse caches, it's done with the /etc/pip.conf our base jobs install on all nodes13:34
hrwfungi: thanks!13:34
fungiso if you need them in a container chroot or something you could bindmount that in13:34
*** lpetrut has quit IRC13:35
hrwcool13:35
openstackgerritMarcin Juszkiewicz proposed openstack/project-config master: build and publish wheels for more distributions (x86-64)  https://review.opendev.org/73032313:36
hrwwith afs_volume names in it. guessed ones so need someone to take a look and fix13:36
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Undefined envlist should behave like tox -e ALL  https://review.opendev.org/73032213:36
hrwnow time to add arm64 ones on top13:36
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Undefined envlist should behave like tox -e ALL  https://review.opendev.org/73032213:43
openstackgerritMarcin Juszkiewicz proposed openstack/project-config master: build and publish wheels for more distributions (x86-64)  https://review.opendev.org/73032313:44
hrwDebian needs py2 too13:44
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683013:44
zbrAJaeger: how big of a gentoo fun are you?13:45
mordredzbr: you want prometheanfire for gentoo things13:47
openstackgerritSorin Sbarnea (zbr) proposed zuul/zuul-jobs master: Revert "Add Gentoo integration tests"  https://review.opendev.org/73032913:47
zbrmordred: i asked AJaeger because the change originated from him and created jobs that where not triggered by the addition itself.13:48
zbrhmm, in fact i am bit wrong, they only recently got broken.13:48
zbri still hope to get an answer about the platform support in general, i do not find current setup sustainable13:50
zbrvery often I want to touch a role I endup discovering that the role was already broken on a less than mainstream platform.13:51
hrwwhich version would be better: build/publish-wheel-mirror job defitions for x86-64 and then same for aarch64 or rather grouped by distro so build/publish-c7, build/publish-c7-arm64 etc?13:52
AJaegerzbr: they were added for completeness.13:52
zbrbut nobody knew because we do not have periodic on them and also no owners.13:52
AJaegerfungi, mordred, do we really need all these different wheels per OS version?13:52
AJaegerzbr: prometheanfire is the local Gentoo expert13:52
zbrmaybe we should run all zuul-jobs once a week to get an idea about what went broken... naturally.13:54
zbra bit-rot pipeline13:54
*** owalsh has quit IRC13:55
AJaegerand who monitors that one?13:56
mordredAJaeger: yeah - if we don't build per-os and per-arch wheels the wheel mirror won't work13:56
mordredI mean- it won't work for those arches13:56
mordredso - we should build wheels for every arch we have in the gate13:57
mordreds/arch/arch-distro-combo/13:57
zbrwe can send email on failures, i would not mind looking at it. i would also take responsability to fix redhat ones.13:57
zbrwe can now assume that everything is fine, because we do not run them, but we have no idea how many are in the same situation.13:58
zbrmaybe we can run every 10, or 14 days, that is only an implementation detail.13:58
*** priteau has quit IRC13:59
zbrtravis has a very neat feature that allows a conditional cron, that runs only if nothing run recently, but that is not possible for us.14:00
zbrstill zuul-jobs is really high-profile imho14:00
mordredzbr: the idea of a conditional periodic has come up before - I think it would have to wait for zuul v4 (which isn't too far away) because the scheduler would have to ask the database if a job has been run recently and the db is currently optional14:02
AJaegermordred: I see, seems we missed a few when setting up. This needs a bit of review.14:02
mordredzbr: saying that - I still don't know how feasible it would be for us - just that it would _definitely_ require v414:03
mordredI haven't actually thought about it from a design perspective14:03
zbrmordred: super. clearly db would enable lots of useful things.14:03
mordredyeah. that's the main v4 thing - the db becomes mandatory instead of optional (also TLS for zk)14:04
mordredbecause from an ops pov, the db all of a sudden becoming mandatory is a breaking change :)14:05
zbrprobably would make it easy to implement regression detection compared with last-passed-build (coverage going down, more warnings,....)14:05
mordredwe're pretty sure _everyone_ has a db though14:05
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: tox: empty envlist should behave like tox -e ALL  https://review.opendev.org/73032214:06
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683014:06
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: fetch-tox-output: empty envlist should behave like tox -e ALL  https://review.opendev.org/73033414:06
*** owalsh has joined #opendev14:13
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: tox: empty envlist should behave like tox -e ALL  https://review.opendev.org/73032214:24
openstackgerritMarcin Juszkiewicz proposed openstack/project-config master: build and publish wheels for aarch64 architecture  https://review.opendev.org/73034214:25
hrwfungi, AJaeger: please take a look.14:26
*** ykarel|afk is now known as ykarel14:27
AJaegerhrw: both look good but I let fungi and al review it since it needs manual steps14:34
*** sgw has joined #opendev14:39
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: fetch-tox-output: empty envlist should behave like tox -e ALL  https://review.opendev.org/73033414:40
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683014:40
hrwAJaeger: thanks. I am aware that some changes may need manual work. Just wanted to know that changes are more or less fine14:47
hrw2020-05-22 14:37:14.511378 | primary | INFO:kolla.common.utils.kolla-toolbox:  Downloading http://mirror.bhs1.ovh.opendev.org/wheel/ubuntu-18.04-x86_64/distlib/distlib-0.3.0-py3-none-any.whl (340 kB)14:47
hrwmirror will be in use ;D14:48
hrwHave to think should it (pip.conf) be included in final images or not. distro repos are14:49
prometheanfirezbr: hi?14:52
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: fetch-tox-output: empty envlist should behave like tox -e ALL  https://review.opendev.org/73033414:53
zbrprometheanfire: hi! if you can help with https://review.opendev.org/#/c/728640/ it would be great, gentoo error is unrelated to the test patch.14:55
zbrfeel free to reuse the patch14:56
*** mlavalle has joined #opendev14:56
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683015:04
prometheanfirezbr: looks like one host worked, the other failed, so going to recheck15:11
zbrif you check the build history you will see that it started to fail few days ago, and is not random.15:11
prometheanfirezbr: the nature of the error, is it always that it can't see the ovs bridge?15:12
zbrhttps://zuul.opendev.org/t/zuul/builds?job_name=zuul-jobs-test-multinode-roles-gentoo-17-0-systemd&project=zuul/zuul-jobs15:13
prometheanfireit looks like we stablized openvswitch-2.13.0 on the 11th15:13
zbri bet something happened between 7th and 9th.15:13
openstackgerritSagi Shnaidman proposed zuul/zuul-jobs master: WIP Add ansible collection roles  https://review.opendev.org/73036015:14
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683015:19
prometheanfirezbr: to be honest I've been waiting on https://review.opendev.org/717177 to merge15:27
prometheanfireit's what I'm using for http://distfiles.gentoo.org/experimental/amd64/openstack/ at least15:27
mordredprometheanfire: +215:29
zbri am clueless about ^ but if that is fixing it, merge it.15:29
prometheanfireit helps simplify the image build process imo15:30
prometheanfireatm, upstream is shipping an older kernel for instance15:30
zbrin that case I will make the gentoo job nv.15:34
prometheanfireya, atm that sounds fine15:35
zbrmordred: how to make the job nv without breaking update-test-platforms ?15:38
*** hashar has quit IRC15:40
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683015:41
openstackgerritSorin Sbarnea (zbr) proposed zuul/zuul-jobs master: Disable broken gentoo job nv  https://review.opendev.org/72864015:43
zbrfor some reason removing auto-generated tag and adding voting: false has a nasty side effect: update-test-platforms creates a duplicate.15:46
*** ykarel is now known as ykarel|away15:57
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683015:59
* mordred afks for a bit16:00
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683016:03
openstackgerritSorin Sbarnea (zbr) proposed zuul/zuul-jobs master: Make gentoo jobs nv  https://review.opendev.org/72864016:03
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683016:14
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683016:18
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683016:31
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683016:53
*** ysandeep is now known as ysandeep|away16:54
*** cmurphy is now known as cmorpheus17:03
corvusi *think* we expect the base playbook to run successfully now?  i'll re-enqueue that change again17:03
openstackgerritMerged opendev/system-config master: Use ipv4 in inventory  https://review.opendev.org/73014417:20
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683017:33
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: tox: envlist bugfixes  https://review.opendev.org/73038117:33
openstackgerritSorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed POLLIN event check  https://review.opendev.org/72996617:35
corvusbase worked.  letsencrypt failed.17:36
corvusle failed on nb01 and nb0217:37
corvusnot entirely sure what nb01 and nb02 are doing with ssl certs...17:38
corvus /opt is full on both of those hosts17:40
openstackgerritSorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed POLLIN event check  https://review.opendev.org/72996617:45
corvusinfra-root: anyone else around?  there seem to be some nodepool problems17:46
corvusit looks like there's a db error related to the dib image records17:46
corvusand we seem to have a whole bunch of failed image uploads17:46
corvusi'm going to start looking into the db error since it's preventing use of a diagnostic tool  ("nodepool dib-image-list" fails)17:47
*** tosky has quit IRC17:50
*** tosky has joined #opendev17:51
corvusthe znode for build 0000124190 exists but is empty17:56
corvusbut it does have a providers/vexxhost-ca-ymq-1/images directory (which is also empty)17:57
corvushrm, we should be doing a recursive delete on the build znodes when we delete it, so it shouldn't have mattered that there are nodes under it17:58
corvusi can't think of what may have gone wrong; perhaps a zk conflict of some kind18:01
zbrinfra-root: the POLLPRI change is ready for review at https://review.opendev.org/#/c/729966/18:02
corvuszbr: you can use infra-core to notify infra folks with core approval rights (not the smaller set with root access)18:03
zbrtx, time to update the magic keyword list.18:04
corvus#status log manually deleted empty znode /nodepool/images/centos-7/builds/000012419018:05
openstackstatuscorvus: finished logging18:05
zbrmy hopes are quite low around paramiko, it does not have an active community18:06
corvusokay now, i can see that we have znodes for 28k failed builds18:06
corvushopefully without the dead znode there, they'll get cleaned up18:06
corvusyes, that number is slowly decreasing; i think the thing to do now is to let in run for a bit and see what gets automatically cleaned up18:08
corvusnb02 has already managed to recover some space on /opt18:09
zbrcorvus: give https://review.opendev.org/#/c/729974/ a kick if you do not mind, that use of lowercase l, drives me crazy.18:12
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Resolve unsafe yaml.load use  https://review.opendev.org/73038918:22
fungicorvus: i'm back now, can look into nodepool problems18:37
fungithanks for finding/clearing the dead znode. i'll try to keep an eye on it18:38
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner  https://review.opendev.org/72868418:39
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: tox: empty envlist should behave like tox -e ALL  https://review.opendev.org/73032218:41
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: fetch-tox-output: empty envlist should behave like tox -e ALL  https://review.opendev.org/73033418:41
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: tox: envlist bugfixes  https://review.opendev.org/73038118:41
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Deprecate default tox_envlist: venv  https://review.opendev.org/72683018:41
fungilooks like they started growing around the 18th18:44
corvusit looks like we may have another znode in a similar situation19:02
corvus(presumably newly placed into this situation)19:02
corvusi'll dig after lunch19:05
hrwfungi: can you take a look at https://review.opendev.org/#/c/730323/ and https://review.opendev.org/#/c/730342/ patches? And add whoever is needed to get AFS volumes created?19:15
fungihrw: i can create them, just may not get to it until next week. trying to take today through monday off except for urgent crises19:17
fungiit may end up being straightforward, but i need to check quotas to see how much room we have19:18
fungiand how much we've allocated to the other wheel volumes19:18
hrwfungi: no problem19:20
hrwfungi: get some rest etc. I know the feeling. Spent too much time recently on yak shaving..19:21
hrwfungi: https://marcin.juszkiewicz.com.pl/2020/05/21/from-a-diary-of-aarch64-porter-firefighting/ ;D19:22
fungiError uploading image opensuse-15 to provider airship-kna1: [...] json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)19:22
fungiproblem with the citycloud api responses?19:23
funginope. also see it for ovh-bhs119:24
fungiahh, yeah this is bubbling up from nodepool.zk._bytesToDict() so presumably the same thing corvus saw earlier19:25
*** elod has quit IRC19:26
* hrw off19:26
fungilooks like the current opensuse-15 image may be the commonality19:26
hrwhave a nice weekend folks19:26
fungithanks hrw, you too!19:26
*** elod has joined #opendev19:27
fungioh, yeah, nodepool dib-image-list even returns the same error19:28
fungiand traceback19:28
fungirefreshed my memory on getting zk-shell to work, looks like we have 7 znodes under /nodepool/images/opensuse-1519:37
fungithough in retrospect, opensuse-15 may have been showing up in the errors because that was just the image it was in the process of trying to upload19:37
fungisince dib-image-list is also generally returning an error, possible it could be anywhere in the /nodepool/images tree i suppose19:38
fungioof, running `tree` there exceeds my buffer19:41
fungimost of the tree looks reasonable except ubuntu-xenial, ubuntu-bionic, and opensuse-tumbleweed, which each have thousands of empty builds19:47
*** slaweq has quit IRC19:49
*** roman_g has quit IRC19:51
*** roman_g has joined #opendev19:53
corvusfungi: back20:04
corvusfungi: i'll see if i can figure out what znode is borked20:05
*** jesusaur has joined #opendev20:06
corvusit's /nodepool/images/opensuse-15/builds/000008949120:11
corvusit has providers/airship-kna1/images under it20:12
corvuswhich is empty, similar to before20:12
*** jesusaur has quit IRC20:37
*** jesusaur has joined #opendev20:37
fungiyep, sorry, had to jump to dinner mode, back again20:42
*** lpetrut has joined #opendev20:43
fungiokay, so an empty build tree is fine as long as it doesn't have an empty image provider list in it?20:43
fungier, empty provider image20:43
corvusi've been looking at the code, and i think we're seening issues with multiple builders racing and the lock node being held underneath the thing we're deleting20:43
corvusfungi: no, it's never okay20:43
corvusfungi: but i think it's a clue as to why the node is still there20:43
fungigot it, so the slew of empty image build znodes is likely a symptom of that one empty provider upload znode?20:44
fungiand each time a builder throws an exception trying to parse that empty upload znode it leaves another empty build znode behind?20:45
corvusoh no idea about that20:45
corvusi'm not sure if there's a quick code fix for this... i'm inclined to just attempt to get things cleaned up for the weekend though and hope that whatever triggered this doesn't happen again for a while20:46
corvus(i think we've learned not to put the lock node under the thing we're locking in future designs)20:47
corvusi think the best way out of this is to shut down nb02, clear out the empty znode, then let nb01 do all its cleanup, then start up nb02 again20:47
fungithat sounds reasonable. why only nb02? is it the source of the trouble?20:49
corvusno, just so it's not racing nb0120:49
fungioh! right20:49
fungiso either 01 or 02 just doesn't have to be both20:49
corvusyep20:49
fungiare you doing that or shall i?20:50
corvusi am20:50
corvusnb02 is off, and i've deleted the znode20:50
fungicool, thanks!20:50
fungiand we expect those other empty build znodes to clear out on their own20:50
corvusthere was only one empty znode20:50
corvusnodepool dib-image-list succeeds now; and reports ~4400 builds20:51
corvusso we're close to bottoming out20:51
fungiif i do tree for /nodepool/images i see a ton like ubuntu-xenial/builds/0000109279 with nothing under them... that's what i meant by empty20:51
*** DSpider has quit IRC20:51
corvusi meant if you "get" them you get the empty string back20:52
corvusthat's the cause of the traceback20:52
fungionly a few have a providers subtree20:52
fungiahh, okay20:52
fungiare those leaf build trees normal then?20:52
fungii guess they signify an image build with no provider uploads?20:53
corvusyes, probably because the build failed20:53
fungiand under at least some conditions we don't clear them out i suppose20:54
corvusone of those conditions is when everything is broke because of corrupt data20:55
fungiright, so likely a symptom of the problem with the empty upload znode you removed20:58
corvusokay, i think nb01 has finished clearing out its stuff, i'm going to stop it and restart nb0220:59
fungiwatching the builder log on nb01, exceptions now seem (so far) to be only about failures to delete backing images for bfv in vex20:59
fungisounds good20:59
corvusokay restarting nb02 now21:13
corvuser nb0121:13
corvuslooking at the image list now, it seems like we have some images that i would expect to be deleted but aren't21:15
corvusexample: | ubuntu-xenial-0000099848        | ubuntu-xenial        | nb01             | qcow2,raw,vhd | ready    | 19:01:35:13 |21:15
corvusthere are 3 newer images than that, and no uploads for it, so it should be gone21:15
*** lpetrut has quit IRC21:15
fungiyeah, and i don't see it in any providers according to nodepool image-list21:17
fungithe zk tree for it shows locks under each provider though21:18
corvusoooh21:18
corvusdid we replace the build nodes?21:18
fungiyes21:18
corvuswe did not copy over the builder ids21:18
funginb01 and 02 went from openstack.org to opendev.org21:18
corvusso everything with an nb01 or nb01 hostname is orphaned21:19
corvussince i'm here, i'll just delete the znodes21:20
fungiaha, and dib-image-list apparently still only shows short hostnames21:21
corvusno it shows whatever hostname was used to build it21:21
corvusso you can see both nb01 and nb01.opendev.org in there21:21
corvusbut i think we ran a version of nodepool that used short hostnames when we ran it on the openstack nodes21:22
fungiohh, okay. there was a patch which merged at one point to switch from short hostnames to full hostnames. so could those be from before that transition?21:22
fungiyeah, got it21:22
corvusi think we're going to leak | 0000123991 | 0000000002 | vexxhost-sjc1       | centos-7             | centos-7-1585726429             | e894339c-807d-4d46-9a36-51b2338e536d | deleting  | 47:19:42:15  |21:23
corvussince there's nothing left to delete that upload any more21:23
corvusi mean, it'll leak on the cloud side21:24
fungiso we probably need a todo to check our providers for orphaned images next week?21:24
*** lpetrut has joined #opendev21:26
fungithough odds are it'll just be vexxhost-sjc1, since we occasionally get stuck undeleteable instances which lock the backing images for their boot volumes indefinitely21:26
corvusyeah21:26
fungiso there were likely a few when the old builders were being taken down21:26
corvusokay, i cleaned up everything that looked unused; there are still several images in use that only existed on the old builders :/21:32
openstackgerritOleksandr Kozachenko proposed openstack/project-config master: Add openstack/heat and openstack/heat-tempest-plugin  https://review.opendev.org/73041921:33
fungiwe may be able to forcibly detach and delete the volumes which have them locked in use21:34
funginext week we can try https://opendev.org/opendev/system-config/src/branch/master/tools/clean-leaked-bfv.py on them if that's the problem21:38
corvussorry, i meant that we have uploads of images that we have no built copies of21:39
fungiohh, got it21:40
corvusie, nb01.openstack.org built opensuse-15, uploaded it everywhere, and now we can't build new ones, and we deleted the underlying image when we deleted the builder21:40
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner  https://review.opendev.org/72868421:45
Open10K8SHi team21:50
Open10K8SCan you check this PS on project-config? https://review.opendev.org/#/c/730419/21:50
*** lpetrut has quit IRC22:00
openstackgerritMerged openstack/project-config master: Add openstack/heat and openstack/heat-tempest-plugin  https://review.opendev.org/73041922:17
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner  https://review.opendev.org/72868422:21
*** smcginnis has quit IRC22:24
*** smcginnis has joined #opendev22:30
Open10K8SHi team22:43
Open10K8SZuul deploy failed for this https://review.opendev.org/#/c/730419/22:43
Open10K8SThe error msg is "Please check connectivity to [bridge.openstack.org:19885]"22:43
fungiOpen10K8S: that error is actually a red herring, we don't stream console logs from that node as it's our deployment bastion, output is directed to /var/log/ansible/service-zuul.yaml.log (per the failed task) so i'll check that file for the actual error22:49
fungiexciting, most of our zuul servers, including our scheduler, were considered unreachable22:50
fungibut it seems to be reachable from there now (via both ipv4 and ipv6)22:52
Open10K8Sfungi: ok22:52
fungimight have been a temporary network issue in that provider, i'll try to reenqueue the commit into the deploy pipeline22:52
Open10K8Sfungi: ok22:53
mnaserjust a heads up22:54
mnasergithub is unhappy -- https://www.githubstatus.com22:54
fungioh, funzies22:57
fungithanks for the heads up, mnaser!22:58
*** mlavalle has quit IRC22:58
fungisince 16:41z looks like22:59
*** tosky has quit IRC22:59
*** larainema has quit IRC23:00
fungiand the reenqueued deployment just bombed again23:00
Open10K8Sfungi: yeah23:01
Open10K8Sfungi: the same reason, seems like23:01
fungiwell, the reason is entirely hidden from the ci log23:04
fungithe only reason the ci is really reporting there is "something failed during deployment"23:04
fungiwe redirect all the deployment logging to a local file on the bastion so as to avoid leaking production credentials23:04
fungistill seeing a ton of unreachable states reported for most of the zuul servers23:05
fungithough also this error for the scheduler:23:06
fungigroupadd: GID '10001' already exists23:06
clarkbfungi: could connectivity issues be related to https://review.opendev.org/730144 ?23:07
fungigetting that for zuul01.openstack.org and ze09.openstack.org23:07
clarkbperhaps due to ssh host keys23:07
mnaserbtw there seems to be an error relating to permission denied23:07
fungioh, possibly... they're all ipv4 addresses it's complaining about in the log23:07
mnaserwhen checking things out23:07
mnaseri dont know if thats just a warning _or_ maybe prboelmatic23:07
fungiData could not be sent to remote host "23.253.248.30". Make sure this host can be reached over ssh: Host key verification failed.23:08
fungiet cetera23:08
mnaserahhh, i am going to guess known_hosts contains hostnames and ipv6 addresses only fungi23:08
fungiand indeed, if i `sudo ssh 23.253.248.30` from bridge.o.o i see it prompts about an unknown host key23:08
clarkbwe don't use hostnames23:08
fungido i need to `sudo ssh -4 ...` all of the zuul servers from bridge, or are we maintaining a confiuration-managed known_hosts file?23:09
clarkbfungi: the servers get added to known hosts with the launch node script. I expect it was only adding ipv6 records23:10
clarkbfungi: I think that means we need to manually add the ipv4 records (or we could go back to ipv6, or we could switch to hostnames)23:10
fungimanually running sudo ssh -4 for any of the zuul servers root already had in its known_hosts file by hostname auto-added the v4 addresses without any need to confirm an unknown key23:14
fungithough it choked on zm01-04, ze09 and ze1223:15
fungithose four mergers it complained about mismatched host keys (i guess we've rebuilt them since the last time it connected to them by name)23:15
clarkbnote the gid thing is likely to prevent the scheduelr from being updated too23:15
clarkband I'm not sure what the correct answer is there23:16
fungiand the two executors seemed to not have entries by hostname23:16
clarkbI think corvus expected some unhappyness that might need to be corrected?23:16
clarkbpossibly via manual edit of /etc/passwd and /etc/group23:16
clarkband then maybe restarting services? though the uids stay the same so restarting is probably less important23:16
fungilet me at least check whether the change i approved for Open10K8S got applied to the scheduler23:17
fungibut yeah, the last successful build for infra-prod-service-zuul was 2020-05-15 and today is the first time it's been triggered since23:19
fungiso something we've merged in the past week, presumably23:19
clarkbfungi: yes, yesterday I think. Its the zuul -> zuuld user/group name change (but not uid/gid)23:20
funginope, the config addition from 730419 is not getting applied, so we're currently unable to update the tenant config it looks like23:20
clarkbI think we half expected ansible to be angry about it23:20
clarkbsince a user and group already exist with those uids and gids23:21
openstackgerritMerged zuul/zuul-jobs master: Patch CoreDNS corefile  https://review.opendev.org/72786823:24
mordredclarkb: yeah - I think we just have to manually edit the /etc/passwd and group files - I don't think we need to restart anything23:35
mordredclarkb: the zuulcd change landed?23:35
clarkbmordred: ya I think that stack was what corvus was trying to get applied yesterday when we ran into the problems23:36
clarkb/etc/shadow may also need editing too23:36
mordredclarkb: yes - almost certainly23:39
mordredclarkb: I think it would be 'sed -i "s/ˆzuul:/zuulcd:/" /etc/passwd /etc/group /etc/shadow''23:40
clarkbalso we probably want to audit pur ssh host key problem after the ipv4 change landed. But Im on a phone for the forseeable future23:40
clarkbmordred: it might be zuuld not zuulcd23:40
mordredoh - yes, zuul d23:41
mordredzuuld23:41
mordredclarkb: yeah- I'm not in a great position to do a business but I could do either thing in the morning23:41
clarkbbbut otherwise that looks correct to me too23:41
mordredor - I think I can do the zuul user rename on the zuul hosts23:41
mordredwant me to try that and then try re-running service-zuul?23:42
clarkbup to you I guess. I expect its that simple but it may not be23:42
clarkbfungi: ^ thoughts23:42
fungimordred: worth a try if you're in a position to be able to23:45
mordredok - I just did:23:45
mordredansible zuul -mshell -a"grep zuul: /etc/passwd /etc/group /etc/shadow"23:45
mordred(as a quick test)23:45
mordredand I had to accept a few more host keys)23:45
mordredbut I can run that now with no issues23:45
mordredso - I think what I'd run is: ansible zuul -mshell -a"sed -i 's/ˆzuul:/zuulcd:/' /etc/passwd /etc/group /etc/shadow"23:47
mordredok - I ran that (but a fixed version) just on ze01.openstack.org and it seems to have worked23:56
mordredps now shows the zuul processes running as zuuld23:56
mordredansible ze01.openstack.org -mshell -a"sed -i 's/^zuul:/zuuld:/' /etc/passwd /etc/group /etc/shadow"23:56
mordredfor the record23:56
mordredI'm going to run it across all of them23:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!