Friday, 2020-03-13

openstackgerritMerged opendev/system-config master: pip3: Add python3-distutils  https://review.opendev.org/71281800:29
openstackgerritIan Wienand proposed opendev/system-config master: [dnm] test with plain nodes  https://review.opendev.org/71281901:40
openstackgerritIan Wienand proposed opendev/system-config master: [dnm] test with plain nodes  https://review.opendev.org/71281901:43
openstackgerritIan Wienand proposed opendev/system-config master: [dnm] test with plain nodes  https://review.opendev.org/71281901:44
openstackgerritIan Wienand proposed openstack/project-config master: Move fedora-30 builds to nb01.opendev.org  https://review.opendev.org/69312001:55
openstackgerritMerged openstack/project-config master: Move fedora-30 builds to nb01.opendev.org  https://review.opendev.org/69312002:22
openstackgerritIan Wienand proposed opendev/system-config master: nodepool-builder: add /opt/dib_cache  https://review.opendev.org/71282402:22
openstackgerritMerged openstack/diskimage-builder master: Remove hacking from requirements  https://review.opendev.org/71277805:27
openstackgerritIan Wienand proposed openstack/project-config master: Revert "Move fedora-30 builds to nb01.opendev.org"  https://review.opendev.org/71283605:28
*** DSpider has joined #opendev05:51
openstackgerritMerged openstack/project-config master: Revert "Move fedora-30 builds to nb01.opendev.org"  https://review.opendev.org/71283606:33
*** factor has joined #opendev07:50
*** lpetrut has joined #opendev10:12
*** factor has quit IRC14:00
*** factor has joined #opendev14:00
*** factor has quit IRC14:03
*** factor has joined #opendev14:04
*** factor has quit IRC14:14
*** factor has joined #opendev14:14
*** factor has quit IRC14:15
openstackgerritMerged opendev/glean master: Fix a handful of bugs in config-drive processing  https://review.opendev.org/70362314:42
openstackgerritMerged openstack/project-config master: Add a new project and repository for tripleo-ipa  https://review.opendev.org/71111415:44
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: DNM: rebase unittests to base-minimal-test  https://review.opendev.org/71298515:45
*** lpetrut has quit IRC15:48
openstackgerritMerged openstack/project-config master: New repo: devstack-plugin-open-cas  https://review.opendev.org/71187815:51
openstackgerritMerged openstack/project-config master: Add OpenInfra Labs IRC channels to bots  https://review.opendev.org/71258615:52
openstackgerritMerged openstack/project-config master: Remove neutron-tempest-dvr job from Neutron's dashboard  https://review.opendev.org/71204815:52
*** lpetrut has joined #opendev16:14
openstackgerritClark Boylan proposed openstack/project-config master: Install tox into a virtualenv on our images  https://review.opendev.org/71301716:29
*** openstackgerrit has quit IRC16:31
*** openstackgerrit has joined #opendev16:34
openstackgerritLance Bragstad proposed openstack/project-config master: Add queues for tripleo-ipa project  https://review.opendev.org/71111516:34
openstackgerritMerged openstack/project-config master: Install tox into a virtualenv on our images  https://review.opendev.org/71301717:32
corvusi'm seeing node failures for fedora-30 nodes18:06
corvusand i notice that there are some recent commits to project-config moving them around18:06
corvusis this known / is anyone working on it?18:06
corvusinfra-root, config-core: ^ ?18:06
clarkbit is not known to me18:07
clarkbmy guess is that the new builder managed to upload a fedora-30 image that does not work18:07
fungiianw moved them to the new nb01.opendev.org late yesterday18:07
clarkbwe can probably pause the builds on that builder then revert to the previous image?18:08
fungii think they're already paused18:08
corvusi don't see builder logs on nb01.opendev18:08
fungii believe he said he stopped the services on it and added it to the emergency disable list18:08
corvusi see build logs18:08
corvusbut not the builder itself18:08
corvusi don't know how to tell if the builder is running18:09
corvuswe don't have docker on that machine, so i guess we're using podman18:09
clarkbcorvus: `docker ps -a` or podman equivalent if using podman18:09
corvusclarkb: which user should i run podman as?18:10
Shrewsi thought the new builder was reverted?  https://review.opendev.org/#/c/712836/18:10
corvus(there's no global podman thingy like docker)18:10
corvusShrews: apparently that's not a revert, that's a rework18:10
clarkblooks like root18:10
corvusShrews: at least that's what the commit message says?18:10
clarkb(because we run podman-compose as root in system-config/playbooks/roles/nodepool-builder/tasks/main.yaml18:11
corvusthere are now 2 podman processes, one running as corvus, one as nodepool, probably because i ran "podman ls"18:11
clarkbbut ya rereading it should be stopped because of nodepools "delete everythign I don't know about" behavior18:12
clarkbin which case I think we can remove the newer fedora-30 image and fall back to the older one?18:13
corvusas far as i can tell, nodepool-builder is not running on this host18:13
corvusdo we have a plan for running the nodepool cli on the docker hosts?18:13
clarkbcorvus: I think a lot of that is still in the learning phase18:13
mordredcorvus: I agree that nodepool-builder does not seem to be running on the host18:13
clarkbbut we do produce a nodepool image to run commands18:14
fungiianw mentioned in scrollback (maybe in #openstack-infra) that he stopped it18:14
clarkb(so could probably add that to our setup)18:14
mordredperhaps add a convenience script so that just running "nodepool" works and does the right thing with the image18:14
mordredlike in that openstackclient patch I did in system-config a little while ago18:15
corvusthat would be swell18:15
corvusi think having the nodepool cli handy before we break things would be good18:15
mordred++18:15
fungiyeah, #openstack-infra at 02:44z18:15
corvusso i guess we're done with nb01.opendev for now; i'll log into a different host18:15
corvus"nodepool image-list |grep -i fedora" shows nothing18:16
fungihe stopped the builder because it was deleting all the images id didn't know about18:16
fungis/id/it/18:16
corvusso i'm guessing all the f30 images were deleted, and the current state of the half-revert is that the builder config for f30 is on a host which is down18:16
corvusso we should continue to complete the revert so that the f30 builder config moves back to the nb*.openstack hosts ?18:17
clarkbcorvus: we can't build fedora30 on those hosts though18:17
clarkbI think we have to roll forward?18:17
corvushow did we ever build f30?18:17
fungihttps://review.opendev.org/712836 should have put them back18:17
clarkbcorvus: ~5 months ago fedora rpms were made with a compression tool that was available on ubuntu xenial then at some point they switched off that aiui18:18
corvusfungi: line 280: pause: true18:18
fungioh, it was put back with pause: true18:18
fungiyep, just spotted that myself18:18
clarkbat that point we paused the builds then ianw has spent the intervening time trying to come up with a system to build them (and this is the result)18:18
corvusthen having that system delete the irreplacable images is especially unfortunate18:19
clarkbyes, I think this is an aspect of nodepool that we should probably think about more. (running disjoint builders is likely desireable to accomodate different builder needs, architecture, operating system, whatever)18:21
clarkbI noted on IRC last night that I think the way nodepool wants you to express this is to always list all images, then pause them where they should not build18:21
mordredI don't suppose we accidentally still have the old fedora30 qcow on any of the nodes right?18:21
clarkbmordred: probably not if nodepool deleted them18:21
fungiit tries to clean them up aggressively18:21
clarkb(its pretty good about cleaning those up)18:21
mordredyeah18:22
Shrewsi suspect the hostname change is what triggered the cleanup  (cc: ianw)18:22
clarkbShrews: there was no hostname change18:22
clarkbShrews: this is a new additive host18:22
Shrewsclarkb: nb01.opendev.org vs. nb01.openstack.org18:23
Shrewsright?18:23
clarkbShrews: yes that wasn't a change18:23
clarkbboth are/were expected to run side by side18:23
Shrewsclarkb: nodepool stores the hostname of the builder, so yes, as far as nodepool is concerned, it was  a change18:23
clarkbShrews: I am trying to clarify that we didn't delete or remove nb01.openstack.org18:24
clarkbwe added nb01.opendev.org to the set of existing servers18:24
clarkb(I understand why the images were deleted)18:24
Shrewsclarkb: i understand that18:24
Shrewsi don't think that invalidates my statement18:25
clarkbShrews: I read it as we changed nb01.openstack.org to nb01.opendev.org which did not happen18:25
clarkbwe simply added nb01.opendev.org18:25
Shrewsclarkb: didn't mean that. i meant "host ownership of a build" changed18:26
clarkbShrews: ah. I don't think that is fully it either. Because nb01.opendev.org apparently tried to delete all of nb01.openstack.org's images18:26
clarkband nb01.openstack.org tried to delete nb01.opendev.org's f30 image18:27
clarkbmaybe that is what you mean by host ownership? Basically they each decided the others disjoint set was invalid18:27
clarkb(which is why I suggested that listing all images then pausing where we don't want to run it would be a way to express this to nodepool)18:28
corvusthey are a cluster, and are all supposed to have the same configuration.  the thing that we're doing with nb03 only works because it has a disjoint set of providers.18:28
corvus(it is, in effect, a second cluster of one)18:29
corvusbut... about the future...18:29
corvusthis is affecting at least one project (nodepool).  what are our options?18:29
clarkbI expect that if we fixed the nodepool configs (possibly via the pause idea or just letting the new server build all the images) that we'd be able to build and upload a fedora30 image18:30
clarkbbasically roll forward18:30
clarkbother ideas: manually upload a fedora-30 image in some set of clouds and use that in our providers18:31
clarkb(and the probably bad option) stop testing on fedora18:31
fungii thnik either press forward trying to get a nodepool builder working on a newer distro, or try to get the necessary decompression tooling backported to ubuntu-xenial so the existing builders can unpack newer rpms18:31
fungiand yes, also possibly someone build an image locally as a stopgap18:32
clarkb(I've secretly been hoping that centos rolling distro can slip into the spot fedora fills, but thats an entirely different set of things to sort out and should maybe be ignored for now)18:32
fungiwhat are the details on the rpm decompression problem? do we have that documented somewhere?18:32
clarkbI'm sure ianw has it in a story. Let me see if I can find it18:33
mordredwait- the container deployment is the thing that solves the rpm decompression problem18:33
clarkbmordred: yes18:33
mordredI don't think that's a thing that we need to go back to try to solve, is it?18:33
clarkbmordred: no18:34
clarkb(other than the container deployment had a sad)18:34
mordredright.18:34
clarkbbut I think we can make it not have a sad18:34
mordredI agree18:34
clarkb(the put all images in the config and then pause where we don't want them to build idea)18:34
mordredI just wanted to be clear that we didn't need to go back to a more complicated drawing board18:34
mordredclarkb: ++18:34
corvusyeah, i think rolling forward with container using the new config file strategy is probably easiest (depending on how easy manually uploading an image is)18:35
corvusi can help with that after lunch18:35
fungibut yeah, i expect that if there were an easy way to backport a decompression solution for new rpms to xenial, ianw would already have done that18:35
AJaegercorvus: the move around was reverted18:35
corvusAJaeger: *partially* reverted18:35
fungiAJaeger: it was, but only after fedora images we've ceased to be able to build were accidentally deleted by it18:36
corvusAJaeger: it's in backscroll, but tldr: nothing is building f30 images now18:36
corvus(and there are no f30 images)18:36
mordredhow about I take a stab at the config file change18:36
AJaegercorvus: see it now18:36
clarkbthe paging buttons in the storyboard story search page don't work18:36
clarkboh its 1 to 6 stories not 1 to 6 pages18:37
*** tristanC has joined #opendev18:37
clarkbmordred: wfm (I'll keep trying to dig the story details out of storyboard)18:37
corvusi gotta run, i'll be back after lunch to help.18:37
openstackgerritMonty Taylor proposed openstack/project-config master: Add fedora-30 to nb01.opendev.org  https://review.opendev.org/71304718:40
mordredclarkb, corvus, fungi: ^^18:40
mordredI believe that is what we're saying we want on nb01 yeah?18:41
mordredShrews:18:41
clarkbmordred: yes I think that would allow us to build without associated deletes. If shrews can confirm that would be good18:41
fungithat'll have to be manually installed for the time being, right?18:42
Shrewsi'm still not fully understanding why one is trying to delete the other's images. i can't say for sure if that's what we want until i know that18:42
clarkbfungi: no I think we are still ansibling it18:43
clarkbfungi: we've just stopped giving the service a config to do any work with18:43
fungiclarkb: did it get taken back out of the emergency disable list?18:43
clarkboh if its in the emergency disable then ya we have to remove it from there or manually add it18:43
clarkbmaybe manually adding it is the safest thing18:44
fungii haven't checked the disable list, just saw ianw say that he had added it there18:44
clarkbya its in there18:45
clarkbso ya I think once we are comfortable with that change (I am but others should definitely double check me on it) we can manually apply it and re up the podman-compose config on nb01.opendev18:45
clarkbthen monitor it for deletions as well as building f3018:45
clarkbanother option while we are brainstorming is to reduce that provider list in the nb01.opendev.org config to a single cloud to reduce blast radius. Have it build the image and upload it, then add the other providers once we are happy with it18:51
clarkbI'm worried that will trigger some other cluster mismatch deletion behavior though. I think the proposed config from mordred is likely safest18:52
openstackgerritMonty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local  https://review.opendev.org/71305018:52
mordredhow's that look for a helper script?18:52
clarkbmordred: I think we can probably trim the mount list since nodepool cli commands aren't running dib builds or logging to disk.18:53
clarkbmordred: we should only need to mount in the cloud config and the nodepool config I think18:54
clarkbI'm going to find lunch now too18:54
clarkb(I think both changes are good as proposed even if we can do cleanup on the helper script)18:55
openstackgerritMonty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local  https://review.opendev.org/71305018:57
mordredclarkb: good point18:57
openstackgerritMonty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local  https://review.opendev.org/71305018:57
*** lpetrut has quit IRC19:05
fungimordred: is that going to have to be run via sudo -u nodepool so that it has access to bindmount the config?19:23
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Use a zuul_* and add an .ansible-lint file  https://review.opendev.org/71254719:27
AJaegerclarkb: the goaccess reports run fine today, see https://7b8a363e2631f871420c-9d822a96c58fccb739d55f79e396b06d.ssl.cf1.rackcdn.com/periodic/opendev.org/opendev/system-config/master/docs-openstack-goaccess-report/44c9931/docs.openstack.org_goaccess_report.html19:29
clarkbAJaeger: that is great to hear19:29
clarkbAJaeger: and the reports have at least as much info as the old one right?19:30
AJaegerclarkb: it gives the data - but lots of other stuff as well ;) Need to dig into how to just get those URLs.19:35
AJaegerclarkb: so, I'm fine with going forward with it19:35
* AJaeger calls it a day and waves good night19:36
* corvus is back and catching up19:36
clarkbcorvus: I think if people agree mordred changes look good we can proceed yo apply the config update to new nb01 manually and manually up the service there19:37
clarkbI'm still about 20 minutes from being able to help with that19:38
corvusclarkb, mordred: +319:40
clarkbcorvus: note the server is in the emergency file19:40
clarkbif we want anwible to update it instead of manual we should remove it19:40
corvusclarkb: yeah i saw19:40
corvusShrews: are you still looking into that?  (don't mind the +3, since it's not getting applied until we're ready)19:41
corvusI think we should add warnings to both files that they need to be kept in sync during the transition period.19:41
Shrewscorvus: "that" being why images were being deleted? if so, no. i think i need ianw to walk me through it.19:43
openstackgerritMerged openstack/project-config master: Add fedora-30 to nb01.opendev.org  https://review.opendev.org/71304719:47
corvusShrews: i think you are right to be concerned.  the builder id's present in 'nodepool dib-image-list' are short hostnames19:50
corvusShrews: ie "nb01"19:50
Shrewswe left the hostname comparisons in for compatibility with older nodepool (each should have a unique id now). it might be time to just remove that19:51
corvusugh.  i just triet do run "podman run ..." to see if i could test the behavior on nb01.opendev.org and ran into gshadow permissions19:52
corvusi thought we were removing that?19:52
Shrewsi don't think there were plans to do so19:53
clarkbI'm not aware of gshadow issues (is that a podman thing?)19:53
corvusno, it's a we're doing something in the nodepool image we shouldn't be doing thing19:53
* fungi assumed something related to the system shadow group file (/etc/gshadow)19:54
corvusi thought there was a patch to nodepool to revert that out, after we rejected the corresponding patch to zuul19:54
corvusanyway, i *also* can't run it as root, for a different reason (failed to find plugin "loopback")19:55
corvusso, i can't really predict the behavior on nb01.opendev.org because i can't get a python prompt in the production environment to test :/19:55
mordredcorvus: I can confirm we have not reverted that out of nodepool19:56
mordredbut I think we should19:56
corvusmordred: ack; i'll put in on my backlog of things to do when we can merge nodepool changes again19:56
corvusmordred: i'm less confident about the config file change now19:56
corvus(i'm also less confident we can actually run nodepool)19:57
mordredcorvus: we can make an image by hand and upload it to a personal dockerhub location then try running that manually19:57
mordredto check19:57
mordredcorvus: I don't see the revert patch in the system - want me to make one real quick?19:57
corvusmordred: let's not worry about that for now19:58
corvusthe gshadow thing is preventing me from running nodepool as a user19:58
corvusbut we run it as root19:58
corvusthe root error is different and i don't understand it19:58
mordredcorvus: ok. let me see if the root error makes sense to me19:59
clarkbare you running it with all the mounts? if not perhaps that is causing trouble?19:59
corvusno mounts19:59
openstackgerritJames E. Blair proposed openstack/project-config master: Revert "Add fedora-30 to nb01.opendev.org"  https://review.opendev.org/71305819:59
corvusmordred: ^ that's a revert of your change because of the nb01/nb01 issue20:00
mordredcorvus:  +#20:00
mordredgah20:00
Shrewshttps://review.opendev.org/713057 Stop comparing hostnames to determine ownership20:00
corvusso to summarize, i have 2 concerns: 1) some nodepool behavior is determined by the hostname, and we will have 2 hosts with the same name.  2) i wasn't able to run the nodepool container as root with "podman run" which weirds me out20:01
corvusShrews: it looks like that "or" is going to cause us to hit that case when we currently don't want to.20:02
mordredcorvus: we need --net=host20:02
mordredcorvus: I have a script in /root20:02
mordredcalled "n"20:02
corvusShrews: so i think we either need to change the nb01.opendev.org hostname, or merge your change first20:02
mordredthat you can use20:02
openstackgerritMonty Taylor proposed opendev/system-config master: Add a nodepool helper script into /usr/local  https://review.opendev.org/71305020:03
fungior turn down nb01.openstack.org, though i suspect we have too much build load to do that before the new builder is in operation20:03
corvusmordred: thanks.  that confirms that the nb01.opendev enironment reports 'nb01' as the hostname20:04
corvusfungi: yes, that's an option.  i'm unsure about the build load, let's check that out.20:04
fungii believe it comes down to how much usable disk space there is. the more builders we have, the less disk utilization on each20:04
clarkbits a lot better now than it was after cleaning up old fedoras and precise20:05
funginb02 is 65% used on /opt, nb01 34%20:05
fungiso ~100% if we turned off nb01, i expect?20:05
clarkbfungi: ya that math should ve roughly correct20:06
fungi(i mean, it wouldn't be immediate, it would take some time to hit that, but still)20:06
clarkband in the meantime we'd possibly delete all thoseimages?20:06
fungiand yeah, the sudden deletion of fedora images is presumably what dropped the disk usage on nb0120:06
clarkbspinning up nb04.opendev.org wouldnt be too bad20:07
corvusthat's probably the safest thing20:07
clarkbbasically add mordred change back but change the name20:07
clarkband then update groups and stuff as necessary20:07
fungimakes sense to me, though i'm about to be tending a hot wok for the next little while so will be less help20:07
corvusthere's also the possibility we could change the hostname in the podman-compose file without building a new host.  but that could also inspire madness.20:08
clarkbcorvus: I like that for its simplicity but have no idea how reliable it would be20:08
fungiright, i thought about that, convince nodepool its hostname is different that what the system knows itself as20:08
fungibut i agree that's icky20:09
clarkbor land shrews' change20:09
corvuscan't land nodepool changes20:09
fungiwhich i've already +2'd20:09
fungibut right, catch-2220:09
clarkbbecause we need fedora-30?20:09
corvusyep.  it's used in a gating job20:09
corvusi think it would be safe to make it non-voting for Shrews change though, if we wanted to go that way20:10
Shrewscould make that job nv in my change, then re-enable?20:10
corvusit's just going to be a thing.20:10
Shrewsjinx20:10
corvus(there are some changes though that i definitely don't want to land without it)20:10
corvus(including the one to remove the gshadow thing)20:10
mordred++20:11
corvusadding "--hostname nb04" to podman run works as expected20:11
clarkbcrazy udea tine lets do ^ to tide us over then on monday we can build it right20:12
clarkb*crazy idea time20:12
mordredso - are we thinking do that - get a f30 image, be able to land changes - fix the underlying stuff20:12
corvusso i think changing the podman-compose file as a temporary measure would be workable.  but the longer that goes on, the more cognitive dissonance we will experience.20:12
mordredthen stop nb01 and rename it back20:12
mordredyeah20:12
mordredit seems like a thing that should only be in place for exactly as long as it takes for us to get a f30 node20:12
clarkbya I think we should commit to replacing new nb01.opemdev with nb04 monday20:12
corvusthis is all assuming we can build an f30 image after not doing it for 6 months :)20:13
clarkb(I can do that)20:13
mordredcorvus: wcpgw?20:13
corvushow about we call the innerhostname "nb01opendev" ?20:13
clarkbI think the basics of f30 building is gated in dib20:13
funginb01forealz20:14
clarkbcorvus: ++ that will be less co fusing20:14
corvusthen it won't conflict with its past or future replacemnet20:14
fungiyeah, wfm20:14
mordred++20:14
mordredjust call it george20:14
Shrewscan we just suspend all testing over covid 19 concerns?20:15
corvusShrews: but we promised anyone who wants fedora30 tests could get them20:15
mordredcorvus: by anyone we only meant Tom Hanks20:16
clarkband nba players20:16
fungiit would probably be okay if the tests were only 50% accurate20:16
mordredclarkb: Ruby Gobert and Tom Hanks are the same person20:16
mordredclarkb: Tom is just that good of an actor you never noticed20:16
clarkbthe french american sweetheart20:16
corvusi think the way to do the compose file change is just to keep nb01.opendev in emergency and manually apply that and the new rev of mordred's change20:17
clarkbcorvus: wfm20:17
mordred++20:17
fungiseems reasonable20:18
corvusi'll work on that now20:19
openstackgerritMerged openstack/project-config master: Revert "Add fedora-30 to nb01.opendev.org"  https://review.opendev.org/71305820:21
corvusmordred, clarkb, fungi, Shrews: any of you who are available, want to check out the state on nb01.opendev.org?  i modified /etc/nodepool-builder-compose/docker-compose.yaml and checkout out mordred's change 713047 to update /etc/nodepool/nodepool.yaml20:23
clarkbcorvus: looking20:23
corvuss/checkout out/checked out/20:23
clarkbboth lgtm (and for others looking /opt/project-config was updated with mordreds change and /etc/nodepool/nodepool.yaml is a symlink into that repo)20:25
corvusanyone else want to weigh in, or should we run that now?20:29
mordredcorvus: lgtm20:30
corvusso now we should: cd /etc/nodepool-builder-compose; podman-compose up  ?20:30
corvus(as root)20:31
clarkbyes, I'll start a tail -f on builder-debug.log here and watch it20:31
corvusoh yeah, that was the other thing20:31
corvusno builder-debug.log20:31
clarkboh that file doesn't exist right now20:31
corvushrm.  we do have /var/log/nodepool bind mounted20:32
clarkbthe permissions and mounts are such that it should be logging there20:32
clarkband /var/log/nodepool/builds/ has content20:32
corvusbut we're probably runinng the default run in foreground and log to stderr thing20:32
clarkbin which case podman logs $containername would work?20:32
corvusyes: podman logs nodepool-builder-compose_nodepool-builder_120:33
clarkbas root20:33
corvusyep.  i feel like this is probably not how we want to run it in the long run.20:33
clarkbI'm ready to run that command (as root) and watch it once it is going20:34
corvusokay, i will "up" it now20:34
corvushttp://paste.openstack.org/show/790683/20:35
corvusapparently podman-compose does not work the same as docker-compose20:35
corvusalso, it's just sitting there now, it hasn't returned from that command invocation20:35
clarkbscared me for a second that it was trying to delete things again in the logs but those are from 18 hours ago20:35
corvusso... i guess i will ^C ?20:36
clarkbya it doesn't seem to be doing anything from what I am able to see20:36
mordredhrm20:36
corvusi will run podman-compose down, then podman-compose up.20:36
corvus(docker-compose "up" automatically recreates containers if needed)20:36
corvusit's running20:37
mordredyay20:37
clarkbmkdir: cannot create directory '/opt/dib_cache': Permission denied20:38
clarkbthat is why the build is failing20:38
clarkbI want to say I saw something about this one moment please20:38
corvusit wants to delete some vexxhost images20:38
mordredthat seems like a really bad choice20:39
clarkbcorvus: mordred I think those are old logs double check timestamops20:39
corvusno20:39
mordredoh - wait - vexxhost - those could be f30 vexxhost?20:39
clarkboh no its doing it again20:39
clarkbprobably need to stop it then20:39
corvusi've stopped it20:39
corvusbut i didn't see it succeed at deleting anything it shouldn't20:39
Shrewsdoes it still think it's name is nb01 by chance?20:39
corvuscan somepone point it out to me20:39
clarkbcorvus: Shrews one sec I think I know what is happening (and its ok for us)20:40
clarkbwe leak images in vexxhost20:40
clarkbwhen that happens its fair game for any nodepool builder to delete them from the cloud side20:40
clarkbI think it has detected this case and is helpfully trying to delete a leaked image in vexxhost. But we should double check before turning it back on20:40
clarkbalso https://review.opendev.org/#/c/712824/ is the proposed fix for the dib cache thing20:40
corvusso all our builders just log that eror every few minutes?20:40
clarkbcorvus: ya I think so20:41
clarkbI have a paste from the other day where I dug into this tring to find it now so I can cross reference ids20:41
corvusopenstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://image-sjc1.vexxhost.us/v2/images/a0b6ea3e-6c39-41b6-8243-c0c9c6d027c8, Image a0b6ea3e-6c39-41b6-8243-c0c9c6d027c8 could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance.: 409 Conflict20:41
clarkbhttp://paste.openstack.org/show/790497/ hrm those ids don't match so either they are new leaks or it is bugging out in the scary way20:42
clarkbhttp://paste.openstack.org/show/790684/ I think that shows we have new leaks20:43
clarkband those ids seem to match what it was deleting on new nb01 (I think that means we are ok)20:43
mordredsigh20:44
corvusclarkb: er, so your final word is, these leaks are expected?20:44
mordredclarkb: do we need to clear bfv's again?20:44
clarkbcorvus: ya basically boot from volume in vexxhost sometimes leaks volumes whcih prevents us from deleting the image. Once the image is in a "deleting" state in zk any builder is free to delete it in the cloud20:45
clarkbcorvus: the transition from active to deleting only happens on the "owner" builder though so we avoid races there by having it try first20:45
corvusyeah, but your pastes i don't understand20:45
clarkb(that also allows it to clean up its local disk)20:45
corvusclarkb: i think you said "ids don't match which is bad"  "something something nevermind it's good we're safe"20:46
clarkbcorvus: you can ignore the first paste at this point I think. What the second paste shows is we have 6 images that are all failing to delete in vexxhost and they are in a deleting state in zk.20:46
corvusso i just want to make sure you still think those particular leaks are harmless and don't represent some new bad thing that nb01.opendev is doing20:46
corvusok, cool.20:46
clarkbcorvus: what the builder logs show from what you just ran is that those images are not deleting because there are leaked volumes in vexxhost preventing the image from deleting20:46
corvusclarkb, mordred: so what's the deal with cleaning this up?20:46
clarkbyup I do think that because the ids in the builder logs match the ids in my second paste20:47
clarkbcorvus: usually we run volume list and search for volumes which have leaked then try to delete them again (there are heuristics for this and mordred has a tool but its not perfect)20:47
corvusso it's impractical to have nodepool fix this?20:47
clarkbcorvus: do you think we should try cleaning that up first so taht we can have clean logs in the builder when we try again with a dib_cache dir?20:47
clarkbcorvus: yes I would say it is impractical to have nodepool fix it20:47
corvusclarkb: under the circumstances, i think clean logs is important20:48
clarkbnodepool could probably do a subset of cases though20:48
mordredyes - there are some cases that are safe20:48
clarkbbut there is another subset where the server itself can't delete and it has the volume attached and that prevents the volume from deleting20:48
mordredbut not all20:48
clarkband we've seen that require intervention from the cloud itself20:48
mordredyeah20:48
clarkbcorvus: ok I'm going to look into cleaning these up manually20:48
corvusclarkb: cool, i'll modify the compose file with the cache fix20:49
corvusoh it's a perm thing20:49
corvuswhatever20:49
corvusi will do what 712824 does :)20:49
clarkbfor anyone else following along there are a bunch of unattached volumes in vexxhost (these are likely the source of the leak)20:49
clarkbI'm going to spot check them to ensure they don't belong to something important (its a test node tenant so shouldn't) then delete them20:50
clarkbmordred: ^ your tool might do it quicker than me though if you want to queue up running it ?20:50
mordredsure20:50
clarkbyup spot checking ~5 of them they are all from just after 2300UTC on march 1120:52
clarkband they are boot from volume volumes with image ids that match our unhappy images in nodepool20:52
clarkbwhat should happen is we delete them (using mordreds too should work in this case), then old nb0X will delete them from the cloud as there is no volume keeping them around20:52
clarkbwe can then confirm with nodepool image-list as per my paste above and then try again with the new builder20:53
openstackgerritJames E. Blair proposed opendev/system-config master: nodepool-builder: add /opt/dib_cache  https://review.opendev.org/71282420:53
corvusclarkb: ^ the directory had been created; but it was not bind-mounted it, so i have added it to docker-compose20:53
corvus(i'm guessing ianw may have manually run the mkdir?)20:53
mordredI'm going to run the clean tool yeah?20:53
clarkboh possibly20:53
clarkbmordred: yes I think it is safe to do so since the cleanup tool checks for unattached volumes >24 hours old iirc20:54
clarkbmordred: and I've confirmed these seem to be in that state20:54
mordredthat is correcrt20:54
mordredclarkb: can you re-check your list?20:54
clarkbmordred: they seem to still be there20:55
mordredclarkb: yeah. I didn't get prints. lemme see what's uo20:55
clarkbshould I start manually deleting?20:55
mordredone sec20:56
clarkbok. I've edited a file with a list of them and can run it through xargs openstack volume delete if we want another option20:57
clarkbwill wait20:57
mordredclarkb: yeah. I don't know why it's not cleaning them20:58
mordredgo ahead20:58
clarkbk20:58
clarkbits doing them serially and taking a couple seconds each so may be a minute or two21:00
clarkbbut it is going21:00
mordredcool21:00
mordredclarkb: oh - I think my script didn't do it because they're already unattached volumes21:00
clarkbah21:00
mordrednot volumes reporting being attached bogusly21:01
mordredso - yay script doing what it's supposed to!21:01
clarkbdown to 4 images in zk now21:01
clarkbfrom 621:01
clarkbthere is one that is much older than the others that we might have less luck cleaning21:01
clarkbbut if we can get it down to one, check logs for a single uuid is better than 621:01
clarkbdown to 2 now21:03
clarkbok I don't think that bionic image will delete beacuse its used by 3 volumes that refuse to delete21:05
clarkbthe opensuse image isn't deleting because volume list claims it is in use by a server volume (not unattached)21:05
clarkb| 303ed29e-3c06-4738-a0bd-e2f0eb50991c |      | in-use    |   80 | Attached to opensuse-15-vexxhost-sjc1-0014437332 on /dev/vda     |21:06
clarkbI'm checking to see if that is a held node21:06
clarkb| 0014437332 | vexxhost-sjc1       | opensuse-15                   | d2d73e84-d988-4605-a596-b0ddef9b2b23 | 38.108.68.90    | 2604:e100:3:0:f816:3eff:fe52:b724       | deleting | 00:00:02:34  | locked   |21:07
corvusthat seems to be a recently deleting node...21:07
clarkbya we can probably be patient with it assuming that server deletes21:07
clarkbif it doesn't delete then it may be in a similar situation to the bionic images where it was attached to servers that refuse to delete which causes a chain reaction of undeletable resources21:07
corvusit is an opensuse-15 node, that does seem likely21:08
clarkbc5b3b55a-4c74-4d41-998c-265342ab3afc and c10176f9-56a3-4749-a5dc-44ab56ec3771 are the images that are safe for new builder to delete if it comes to that21:08
corvuswell, how about we go aheand and fire it up again21:08
clarkbI'm ok with that21:08
corvushopefully we can deal with those 2 errors :/21:08
corvusok, here goes21:09
corvusfailing the build again21:09
clarkb2020-03-13 21:09:53.958 | mount: /opt/dib_tmp/dib_build.s3OQSzgg/mnt/proc: permission denied.21:10
clarkbI think fs perms are ok, is that a caps issue with procfs?21:11
clarkbI wonder if the ci of this is using docker instead of podman and we are hitting behavior differences there21:12
corvuswhat kind of testing has this undergone?21:12
corvuswhat ci?21:12
clarkbcorvus: there is a full on integration job similar to the older nodepool job that runs it outside of a container. I'm trying to pull it up now21:12
corvusright, i'm curious where "run a nodepool-builder which runs dib inside a podman container" has been tested21:13
corvus(or even in a docker container)21:13
clarkbI know there was something its what we set up the sibling container stuff for so we could use glean and stuff from source in containers21:15
clarkbnow just trying to sort out where it ended up21:15
clarkb(but that may have been docker not podman)21:15
mordredyeah. may have been21:15
mordredin fact - probably was21:15
mordredso maybe this is a good reason to use docker not podman - at least until such a time as we have podman-based gate testing21:16
corvuswe should "test like production"21:16
clarkbnodepool-functional-container*21:17
clarkbhttps://zuul.opendev.org/t/zuul/build/459f34fe1c93447c8353fe43a88e81b6 is a semi recent run21:18
clarkband ya it is using docker21:18
clarkbhttps://review.opendev.org/#/c/698818/6/playbooks/nodepool-functional-container-openstack/templates/docker-compose.yaml.j2 shows the compose file21:19
clarkbok more investigating has been done. that mnt/proc path may be owned by root21:20
clarkbits readable by not root though21:20
clarkbbut dib is trying to mount a thing there21:21
clarkb| + /opt/dib_tmp/dib_build.w2ztziu9/hooks/root.d/08-yum-chroot:main:239              :   sudo mount -t proc none /opt/dib_tmp/dib_build.w2ztziu9/mnt/proc21:21
clarkband that will require root which it probably doesn't have?21:22
clarkbwill docker run the container processes as root maybe?21:22
clarkb(and podman does not)21:22
fungiokay, stir fry has been produced, consumed and then cleaned up. skimming to see where i can be of help21:22
mordredclarkb: the container as it is now is supposed to be running as nodepool and that nodepool is supposed to have sudo access21:23
corvusclarkb: no, the nodepool dockerfile says run as the nodepool user21:23
corvusmordred: i don't see evidence of sudo access21:23
mordredcorvus: there's a line adding a sudoers fiel that I deleted in the revert patch ... one sec21:23
clarkbdrwxr-xr-x 2 root     root     4096 Mar 13 21:21 proc21:23
clarkbthat is what it looks like from outside of the container21:23
corvusmordred: when i run "sudo" in "podman run" i get a password prompt21:24
mordredhttps://opendev.org/zuul/nodepool/src/branch/master/Dockerfile#L55-L5621:24
mordredcorvus: that's in the nodepool-builder image specifically21:24
corvusmordred: yeah, that's what i'm running21:25
mordredcorvus: so we should maybe update that script to use that and not nodepool-builder21:25
mordredyeah?21:25
mordredawesome21:25
corvusmordred: oh, nope, sorry21:25
corvusmordred: nm, sudo works21:25
corvusi don't understand dib, so i'm not the person to take point on fixing this.21:26
clarkbthen I'm stumped why this doesn't work because sudo is used in the element (and the dirs on the line above created with sudo are created)21:26
clarkbcorvus: basically its trying to create a procfs because tools need it21:26
clarkbit does that in /opt/dib_tmp/$build_dir/proc so that it can be chrooted itno and seprate from the hosts procfs aiui21:27
clarkbhttp://paste.openstack.org/show/790686/ we see sudo succeed at creating the dirs on the first line21:28
clarkbbut then line 3 fails due to "mount: /opt/dib_tmp/dib_build.JnvNFPzW/mnt/proc: permission denied."21:28
mordredwe're setting privileged: true in the compose file - so I'd expect it to not be a procfs thing21:28
clarkbI'm guessing this is a capabilities thing since fs stuff looks fine21:28
clarkbmordred: maybe priviled means less privileged than docker on podman?21:28
corvuswe're throwing this host away anyway, right?  should we jut apt-get install docker and see if it works?21:29
clarkbcorvus: ya we could try that I suppose.21:29
clarkb(since testing says that should work)21:30
corvusmordred: ?21:30
mordredyeah21:30
corvusk, will do21:30
mordredI think bionic has new enough that we don't need to bother doing the upstream repo21:30
mnaseri am just jumping in this and not reading scrollback21:30
mnaser(yet)21:30
mnaserbut we use cri-o in prod with k8s for nodepool builder21:30
mnaserand our image builds have been ok, if that helps signal anything.21:31
mordredcool. so it may just be a settings thing21:31
mnaseri _really_ remember running to the similar issue for that mount thing21:31
corvusyeah that does seem to suggest there should be a route to getting it working with podman21:31
* mnaser looks at the helm charts21:31
mordredyeah. it would be more fun to figure out if we weren't unexpectedly doing so when it's more important :)21:32
mnaserhttps://opendev.org/zuul/zuul-helm/src/branch/master/charts/nodepool/templates/builder/statefulset.yaml21:32
mnaserok so21:32
mnaseri mount /dev into the actual container. i don't remember why, a note there would be nice.21:32
mnaseri think its for losetup things21:32
corvuswe do not have that in our compose file21:32
mordredcorvus: do you think it's quicker to try adding /dev to the volume mount list real quick?21:33
mnaserthat might be something that fails much later on though21:33
clarkbcorvus: its also not in the testing compose file21:33
mnaserbut .. i remember very much needing it21:33
corvuseither one at this point :)  i have installed docker21:33
mordredcorvus: you're driving - I defer to you on which you want to try first21:33
corvusi'm happy to throw /dev in there, see if it works with podman, then try docker without /dev, then try docker with /dev21:33
corvusi'll do that.  test cycle should be fast.21:33
mordredcorvus: let's do that21:34
corvus      - /dev:/dev:rw21:34
corvusjust like that?21:34
mordredcorvus: let's say yes!21:34
mnaserseems like that translates to roughly what k8s is modeling, eah21:34
corvusit failed, i'm trying to find the error21:35
corvusmount: /opt/dib_tmp/dib_build.xUudni10/mnt/proc: permission denied.21:36
mordreddocker it is!21:36
corvusis dim_tmp bind-mounted in the testing?21:36
corvusdib_tmp21:36
corvusor is it a straight-up volume?21:36
clarkbcorvus: good question, no21:36
clarkbits not even that, it may be using /tmp21:37
corvussigh21:37
corvusthis isn't going to work in docker either21:37
clarkbdib's default is to use the regular /tmp implementation. We have to use something else because our images are too big for that21:37
clarkb(that is where /opt/dib_tmp comes from)21:38
corvusi guess i'll do the docker test for completeness21:38
corvusbut i'm not optimistic21:38
mordredcorvus: maybe override user and run the container as root21:38
mordredrather than with the USER nodepool setting from inside the container21:39
mordredI don' tknow why that would have any difference of course21:39
mordredgrasping at straws21:39
corvusmordred: i doubt that's it -- i suspect the problem has something to do with mounting something on a bind mount in a container21:39
corvuslike, i don't think the mount can propagate up21:39
clarkbhrm21:39
mordredclarkb: maybe try not passing -v /opt/dib_tmp ?21:40
clarkbmordred: if our / is big enough to support that it may work21:40
corvusand hope that's enough for 1 image?21:40
clarkb/dev/xvda1                  39G  3.4G   36G   9% /21:40
clarkbit will be close21:40
clarkbbut I think that may be enough for a single image21:40
mordredyeah - it might work21:40
mordredthen - when we come back to this - we can set up docker/podman to put its container storage directly in /opt perhaps21:41
mordredbut that'll be for when we're sorting this out properly in the first place21:41
clarkb(fwiw I thought mounts were effectively flat in the kernel, and then the mount points give us an illusion of nesting, however cgroups may have completely changed that I ugess)21:41
corvusi think it may be succeeding under docker21:42
mordredcool21:42
clarkbya fedora-30-0000000549.log is showing it doing package stuff which implies it got further than the /proc mount21:42
corvus(current run is docker-compose up without /dev mounted)21:42
clarkbyay and itneresting podman difference for the supposedly compatible too :)21:43
mordredcool so maybe docker is doing a mount propagation different21:43
corvusclarkb: it's incompatible except for that one thing, i guess :(21:43
corvusnow i regret running docker-compose without the -d argument21:43
corvusmy hubris is why it succeeded21:43
clarkbthat is probably worth bringing up with the podman folks since I know rhel installs podman as `docker`21:43
clarkbcorvus: oops21:43
clarkbit complains about its hostname not being resolvable but taht appears to be a non issue so far (its not like my dib VMs locally ever resolve in dns properly either)21:44
mordredclarkb: https://github.com/containers/libpod/blob/master/docs/source/markdown/podman-run.1.md search for mount propagation21:44
clarkbmordred: I think we need shared propagation21:45
clarkb?21:45
mordredI think it's worth trying if/when we get back to investigation21:46
clarkbya21:46
clarkbthe build is cloning all the git repos right now21:46
clarkbmay be a while21:46
clarkbmordred: maybe running podman as docker changes those behaviors?21:46
clarkb(so is a non issue when running it on rhel8 that way)21:46
mordredclarkb: maybe so? I'm sure there's more than one thing to learn here21:47
mnaseri think the /dev thing comes in play with losetup happens and tries to mount the qcow221:48
clarkbmnaser: well we don't mount /dev in our testing either21:48
* mnaser shrugs at why i ended up needing it21:49
clarkbat this point I'm mostly worried some element that is infra image specific will break rather than the stuff in dib because we test the stuff in dib21:49
mnasershould have probably documented that but yeah21:49
clarkbmnaser: possibly because docker mounts a /dev by default?21:49
clarkbor it does when privileged?21:49
clarkbyou need things like /dev/random typically21:49
mnaserperhaps it was that, or maybe cause cri-o doesn't mount it?  i dunno, sorry, can't provide much more useful input other than memory that not usually well :)21:50
clarkbmnaser: also if podman is cool with breaking these things why can't it break or add ipv6 support21:50
clarkber sorry that was for mordred21:51
mnaseri'll take that one too21:51
mnaser:p21:51
clarkbLike I get the goal of being compatible if they actually did that, but they haven't (as evidenced here)21:51
corvusi have run "CTRL-\" to exit docker-compose without stopping the underlying containers21:51
mordredclarkb: sigh21:51
mordredcorvus: neat!21:51
clarkbthe dib build is still proceeding fwiw so seems to have worked21:51
clarkbI'm going to take a break from watching git clone log lines scroll by and find something to drink, back in a bit21:54
corvusclarkb: something to drink, or something to DRINK?  cause i'm pretty sure we could all use the latter21:55
corvusi will also bbiab21:55
clarkbFor now just drink :)21:55
*** DSpider has quit IRC22:02
ianw... thanks for looking in on this ... it was supposed to be a soft rollout but clearly got a little out of hand22:04
fungicorvus: i'm just impressed you can ctrl-\ without killing your xsession22:13
ianwnone of this is helped by me forgetting to git add nb01.opendev.org.yaml in https://review.opendev.org/712836 ... sigh :(  but then it seems we've found the hosts use short-names that collide anyway22:15
clarkbianw ya I thibk all we want at this point is to get f30 uploaded. then we make nb04.opendev.org22:15
clarkbas well as clean up nodepool as necessary in parallel22:16
corvusianw: i think we're a little fuzzy on the contribution of the short-names -- the current system is using a unique short name but also completely duplicated file with all the images22:16
corvusat this point, i don't know which of those, or perhaps both, are necessary for this to work22:16
ianwthe other thing i found, that i was hoping would not be an issue till monday, was that the limestone .pem in the config file is hard-coded to ~nodepool22:17
corvusianw: the other thing is this change is necessary: https://review.opendev.org/712824  and also we either need to run with docker, or explore mount propoagation settings for podman22:18
corvusianw: i don't think we've recorded that last bit yet; you might want to jot that in your notes22:18
ianwyes, i had clearly over-estimated the podman == docker situation22:18
ianwbtw the new rpm format was switched in for *f31*; that's what i've been trying to get going22:20
clarkbianw: what stopped the f30 builds then?22:20
ianwf30, iirc, started having segfaults building, that, again, iirc, didn't happen with bionic building22:20
clarkbah ok so different issue, but happier on newer platform22:20
ianwbut, f31 was supposed to be the solution anyway22:21
fungigot it, so even if we'd unpaused it on the xenial builders they still wouldn't have produced a f30 image22:22
ianwno; and i don't have good story filled out on this :/  which is my own fault22:23
mordredianw: I think we can sort out the limestone thing22:23
mordredianw: it might be a better choice to run the containerized hosts with the config in /etc/openstack and bind-mount that in rather than putting them in /home/nodepool like we have been doing - but we probably have a few things to figure out before we get to that :)22:24
ianwmordred: yeah, it will just prevent uploading; could either link or i was thinking it is probably better but a bigger change to move it to /etc all together22:24
mordredyah22:24
ianwheh, jinx, ... that's why it was a "monday" thing :)22:25
mordredyup22:25
mordredand - I like making it a self-contained change rather than just part of the puppet>ansible+container22:25
ianw2020-03-13 22:24:14.798 | Couldn't parse 'sudo: unable to resolve host nb01opendev: Name or service not known file /opt/cache/files/sudo: unable to resolve host nb01opendev: Name or service not known sudo: unable to resolve host nb01opendev: Name or service not known' as a source repository22:26
fungiwell poop22:27
ianwoh dear; i guess we somehow look at the result of a "sudo" command, and that message has confused it22:27
mordred*headdesk*22:27
fungiso whatever we tell docker/podman the hostname is also has to resolve (at least via hostfile?)22:27
clarkbfungi: it must be at least via hostfile because I've done builds without proper dns setups on local VMs22:28
mordredis there a way to tell sudo to shut up about the host thing?22:28
clarkbmordred: there is iirc22:28
clarkbof course all the docs around this say just edit /etc/hosts22:30
ianwi really have no idea why all this need sudo but it's just been like that @ https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/source-repositories/extra-data.d/98-source-repositories#L15722:31
fungiclarkb: yeah, scoured manpages and haven't turned up an option to disable local host resolution22:33
clarkbfungi: I thought it was to quiet the logging not necessarily stop the lookups22:34
ianwalso, this doesn't happen in gate?22:34
clarkbI want to say we tried this with devstack at some point22:34
clarkbianw: no beacuse its just checking /etc/hosts22:34
clarkb(I guess we can bind mount that in)22:34
ianwi mean the gate test22:34
ianw... ohh, we probably just don't run it; don't cache any repos22:35
clarkbianw: it uses host networking too so the hostname is probably correct22:35
ianwyeah; nothing in e.g. https://zuul.opendev.org/t/zuul/build/6130771f463743708b410e7a2647641f/log/nodepool/builds/test-image-0000000001.log22:36
corvusoh, i guess if you use host networking docker might not update /etc/hosts on the container?22:36
clarkbcorvus: thats my guess22:37
corvusthe contents inside the container match the host; what if we add it to the host /etc/hosts and restart the container?  maybe it will copy it?22:37
clarkbcorvus: seems reasonable22:37
corvusi'll do that now22:37
corvusyes, it did that22:38
clarkbthis time around should be quicker due to caching22:38
corvusit's running build 551 now22:38
clarkbI dont' see sudo warnings after sudo commands22:39
ianwwe can also iterate faster if we want to manually stop the caching with an override22:40
ianwhttps://opendev.org/openstack/project-config/src/branch/master/tools/build-image.sh#L7722:41
clarkbianw: I think we've cached the bulk of them now22:41
clarkbianw: so it should just update them at this point (and be much quicker)22:41
ianwyeah, much is relative :)22:41
clarkbits 1/4 done now :)22:44
ianwi'm adding stories to https://storyboard.openstack.org/#!/story/200740722:49
clarkbthanks22:49
ianwconverting this to nb04 after seems to avoid any collision issues, i'll put that in22:49
ianwdo we want to fully investigate podman before that, or covert to docker?22:49
ianw(all of this is me just following mordred ... at the time of the gate tests we were using docker, then when i wrote the production deployment i switched to podman because that's waht gerrit was using now :)22:50
clarkbianw: I'm fine with docker honestly. But mordred did link to the podman docs on this22:50
clarkbianw: https://github.com/containers/libpod/blob/master/docs/source/markdown/podman-run.1.md search mount propagation22:51
clarkb2020-03-13 22:52:11.795 | Couldn't parse 'E: Unable to locate package lsb-release file /opt/cache/files/E: Unable to locate package lsb-release E: Unable to locate package lsb-release' as a source repository22:53
clarkbI'm guessing that means lsb_release isn't working properly22:54
clarkband it is trying to cache distro packages? so relies on that info22:54
ianwE: ... is that from apt?22:55
clarkbor yum?22:55
ianwsorry, yeah pkg manager ... that seems like a weird place to get that22:55
clarkb2020-03-13 22:52:11.775 | Getting /opt/dib_cache/source-repositories/repositories_flock: Fri Mar 13 22:52:11 UTC 2020 for /opt/dib_tmp/dib_build.IoVQmc68/hooks/source-repository-images22:56
clarkbis the thing before22:56
clarkbhttps://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos from there maybe?22:57
clarkboh that script generates the list then cache-url consumes it22:59
clarkbI think we may be generating invalid image list23:01
clarkbpossibly because lsb-release doesn't exist23:01
clarkbhttps://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos#L84 that script (I'm pulling it up now to see what it does and if it expects lsb to be there)23:02
ianwour gate testing build of f30 for comparision https://zuul.opendev.org/t/openstack/build/879fcc41186d4a55b8ed4b6f561e2909/log/nodepool/builds/test-image-0000000001.log23:04
clarkbya I think that is it. That script sources devstack/functions which sources devstack/functions-common which attempts to install lsb-release if it does not exist23:04
clarkband that is coming from apt23:04
clarkbwhats odd is that it would fail though23:05
clarkb(I would expect it to install, but it doesn't possibly because we clean things up on the image enough that it can't do a package install without an update first?)23:05
ianwthat sounds highly likely23:06
clarkbianw: https://opendev.org/openstack/devstack/src/branch/master/functions-common#L321-L338 is the underlying source of that though I think23:06
ianwit seems like lsb-release could be a bindep of dib23:06
clarkbexcept we don't even need it for this functionality23:07
ianwlsb-release [platform:dpkg]23:07
clarkb(the image list generation doesn't need to know what distro it is running on)23:07
ianwit is actually23:07
clarkbonly as a side effect I think23:07
ianwso that should only trigger if lsb_release isn't there, and it should be there from bindep.txt23:08
clarkbdo we install dib's bindep?23:09
clarkbthats what the sibling stuff should get us?23:09
ianwwait, "command lsb_release" actually runs it, right23:10
ianwoh no, it's "-v"23:10
ianwclarkb: yes, i think that dib's bindep should be installed by the container build23:11
clarkbthinking out loud here we could edit /opt/project-config to stop caching images23:12
clarkb(just to see if anything else will break23:12
clarkbcorvus: mordred fungi ^ any opinions on that?23:12
ianwohhh, actually https://zuul.opendev.org/t/zuul/build/d3ffa91e9f8d4fbea364203a864a054a/log/job-output.txt23:15
ianwit doesn't install dib bindep.txt ... i remember we talked about this https://opendev.org/zuul/nodepool/src/branch/master/Dockerfile#L6623:16
clarkbwhats the difference between lsb-base and lsb-release?23:16
fungiwhat's broken in the current image list?23:16
ianwi think we need to add it @ https://opendev.org/zuul/nodepool/src/branch/master/Dockerfile#L6623:16
fungilsb-base is a set of "standard" packages which make a given distro meet the lsb minimum requirements23:17
corvuscatching up23:17
fungilsb-release is a tool to tell you what distro you're on and various other bits needed to gauge lsb compatibility23:17
clarkbfungi: we need lsb-release on our image to make https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos#L84 happy23:17
clarkbcorvus: ^23:17
fungithe lsb and thus lsb-base are effectively dead since years23:18
clarkb*on our nodepool-builder image23:18
fungilsb-release as a relatively commonly found utility to tell you what distro/release your on has survived23:18
fungis/your/you're23:19
fungi/23:19
ianwclarkb: it's emergency-ed right?  i think we can skip caching just to see if it gets a .qcow2 out23:19
clarkbianw: yes it is emergency'd23:19
clarkbianw: and ya I think that is probably the next best step now. Should we just disable image caching?23:19
corvusyeah, i don't think this is used for devstack tests?23:19
corvusso shouldn't be a big deal23:20
corvusthis==f3023:20
ianwumm, it will be, but at this point that's minor concern, it's non-voting23:20
clarkbI can comment out https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos#L123-L145 and add a pass?23:20
clarkbor rm the file entirely23:20
clarkbin /opt/project-config on the server23:20
fungiwell, also devstack will just download those images if they're not cached23:20
corvusi didn't see it in a codesearch23:21
corvusbut it's there, i stand corrected23:22
clarkboh we can just remove the cache-devstack element23:22
corvusit's not hard to update the image if we want23:22
clarkbthats cleaner and nicer23:22
corvuswe can just exec a shell, install the package and docker commit23:23
corvusas long as we're just plowing through the punch list to get something working23:23
ianwi think it's actually more just the "apt-get update" that's required23:23
clarkbcorvus: well dib is actually trying to install that but it is failing (so we'd have to debug that too)23:23
clarkbcorvus: the error we get is from the attempted isntall :)23:23
ianwyeah, i think that's just because the container has had all its metadata purged23:23
corvusdib is trying to install packages on the system it's running on?23:24
clarkbcorvus: by way of devstack :/23:25
ianw... i'm not saying it's right, but it is23:25
clarkbcorvus: so not really dib, but the dib element that calls into devstack which then side effects23:25
corvusi am now in favor of removing the image caching element everywhere23:25
corvusthat is really uncool23:25
clarkbcorvus: https://opendev.org/openstack/devstack/src/branch/master/functions-common#L321-L338 there23:25
corvusand unsafe23:25
corvusthat was never supposed to happen23:25
ianwwell, that package *is* part of bindep, so if we had fully installed dib's bindep we probably wouldn't notice23:26
corvus(i mean, we just gave devstack root on the entire control plane)23:26
corvus(i'm not exaggerating -- you can use our image builds to eventually jump to any host we manage)23:26
clarkband yes it wasn't supposed to from memory way back when sean added that23:26
corvuswell, i added the image caching but sure23:27
funginot *just* as we've been running a script from it during image builds since, what, 2014?23:27
clarkbcorvus: well the original version was a naive scan (whcih was super safe)23:27
corvusclarkb: yep23:27
clarkbthen sean wrote a thing in devstack to list the images non naively23:27
corvusand there was a reason for that23:27
clarkband I'm pretty sure the original version of that was also safe23:27
clarkb(but I could be wrong)23:27
clarkbswitching to caching a cirros or three is probably sane at this point23:28
fungisean's implementation that i remember parsed the devstack scripts to find urls23:28
clarkbtrove and the weird container stuff that was going on are basically EOF23:28
clarkbso what we end up with is "what cirros images do we care about"23:28
corvusi have to go now.  my vote is to disable the caching element globally and add anything we want in a static element.23:29
clarkb(and I think we can just list those)23:29
fungii do think maintaining a static list of images at this point is probably low-effort. devstack rarely adds/removes/updates its own image set since ages23:29
fungibut also i agree, granting the devstack project a root access backdoor on all nodes through the image build process is not good under our present model23:30
ianwi've filed https://storyboard.openstack.org/#!/story/2007407 task #3906623:32
ianwclarkb: so are you removing cache-devstack?23:34
clarkbianw: I got sidetracked thinking about rewriting cache-devstack :)23:36
clarkbI'll rm it from /etc/nodepool/nodepool.yaml now23:36
clarkbdone23:36
ianwfor right now, do you want to apt-get update in the container and just see if it passes?23:38
clarkbI probably won't thinking time is better spent updating cache-devstack at the moment23:39
ianwdocker exec 4f141126d67d sudo apt-get update ... i did that ... let's see if this build goes23:39
ianwclarkb: #39066 assigned to you :)23:40
openstackgerritMohammed Naser proposed openstack/project-config master: add vexxhost/openstack-operator  https://review.opendev.org/71308023:48
ianw... ok, i think it got a little further, it's still going23:50
ianwit's at everyone's favourite pip-and-virtualenv23:54
* fungi can't wait to see that gone23:55
ianwit's making the image!  yay23:55
openstackgerritClark Boylan proposed openstack/project-config master: Statically cache devstack images and packages  https://review.opendev.org/71308123:55
clarkbianw: fungi corvus mordred ^ that should be a safe version and largely backward compatible with the current set of cached stuff23:56
clarkbI decided to punt on arm64 for now23:56
ianwclarkb: i was thinking perhaps devstack should keep that static list?  i'm not sure anyone contributing changes there would know to update it, at least it would come up in a grep in the local source tree?23:57
clarkbianw: ya we could do that as an improvement. I was basically trying to get simple thing done that would work for now23:58
clarkbthough one issue is architecture differences23:58
clarkbif that is dynamic and in devstack we'd potentially have the same problem all over again23:58
clarkb(also the etcd caching is a bit annoying because as far as I know basically nothing is using it)23:59
clarkb(basically I acknowledge that I've punted on a few things including arch and user accessibility, but for short term this should be a good change and we can figure out longer term solutions to those problems?)23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!