Monday, 2023-05-22

*** amoralej|off is now known as amoralej06:11
opendevreviewAlfredo Moralejo proposed zuul/zuul-jobs master: Use release CentOS SIGS repo to install openvswitch in C9S  https://review.opendev.org/c/zuul/zuul-jobs/+/88379007:23
opendevreviewAlfredo Moralejo proposed zuul/zuul-jobs master: Use release CentOS SIGS repo to install openvswitch in C9S  https://review.opendev.org/c/zuul/zuul-jobs/+/88379008:06
opendevreviewIan Wienand proposed openstack/diskimage-builder master: fedora: don't use CI mirrors  https://review.opendev.org/c/openstack/diskimage-builder/+/88379810:31
*** amoralej is now known as amoralej|lunch12:29
*** dmellado90 is now known as dmellado12:48
*** amoralej|lunch is now known as amoralej13:09
TheJuliao/ Hi folks, any chance we can get a node held for the next failed ironic-grenade job ?14:55
TheJuliawe have a few different changes which now seems to result in the database upgrade freezing :(14:56
fungiTheJulia: sure, failure on any project and change for the job named "ironic-grenade" ?14:59
TheJuliaon openstack/ironic is fine14:59
TheJuliabut I think that is the only place it is run14:59
fungiin the past we matched failures for a ironic-grenade-multinode-multitenant according to my shell history, but this time the job name is just "ironic-grenade" right?15:00
TheJuliacorrect15:01
fungicool, i set this just now:15:01
fungizuul-client autohold --tenant=openstack --project=opendev.org/openstack/ironic --job=ironic-grenade --reason="TheJulia troubleshooting frozen database upgrades" --count=115:01
TheJuliaawesome, either myself or iurygregory will be investigating. We have independent changes which seem to tickle extreme database sadness :(15:02
fungiTheJulia: iurygregory: you should see a node with state=hold and the above reason text in the comment column at https://zuul.opendev.org/t/openstack/nodes once there is one. just let us know in here and one of us can add access for your ssh key15:03
TheJuliathanks15:04
fungialso i love the new(ish) nodes view in the zuul dashboard15:04
TheJuliaThat is kind of nice to see15:05
TheJuliagives people an idea of the scope of what is going on quite nicely15:05
funginow to figure out what's gone sideways in rax-iad15:05
fungiTheJulia: a more direct indicator of scale is can be seen at https://grafana.opendev.org/d/21a6e53ea4/zuul-status15:06
fungibut that's not built into zuul, just plotting the statsd emissions it provides15:07
fungias for rax-iad, openstack server list reports 117 instances in ERROR state. looking at one chosen at random, it has task_state=deleting vm_state=error fault={'message': 'InternalServerError', 'code': 500, 'created': '2023-04-07T12:36:17Z'}15:14
fungiso that one has been stuck that way consuming quota for a month and a half15:15
fungii wonder if they all have roughly the same timestamp15:15
iurygregoryThanks for the information fungi o/15:20
opendevreviewBirger J. Nordølum proposed openstack/diskimage-builder master: feat: add almalinux-container element  https://review.opendev.org/c/openstack/diskimage-builder/+/88385515:31
fungilooking at the created timesamps in the fault messages, they range from 2023-04-03T03:35:15Z to 2023-05-22T05:23:55Z so whatever the issue, it's ongoing15:31
clarkbI've had a weekend to think about it after now spending a good chunk of a couple of weeks digging into the whole quay.io + docker + speculative container testing problem and I just can't bring myself to recommend "switch everything to podman first". Podman, unfortunately, brings its own set of problems that we've run into so far. Installating it on Ubuntu is sketchy until Jammy, you15:35
clarkbcan't syslog, there's a whole transition problem I haven't even begun to really dig into (are podman and docker even coinstallable, I think they share some binary dependencies?), do we temporarily double our disk space needs between images and volumes?, how do we automate the switch (do we automate the switch)? Nothing that would prevent us from moving forward (though I havne't yet15:35
clarkbbeen able to poke at nested podman with nodepool-builder), but plenty that will make this process necessarily slow and measured. Additionally, it feels like I'm being expected to do 99% of the work. I understand there are ideals at play here but I can't personally be expected to upgrade every server to Jammy so that podman is installable, rewrite and test all of the configuration15:35
clarkbmanagement, and transition running services myself. If others continue to feel strongly about this I can help get the nodepool-builder testing up and running, but I don't think I can commit to more at this point. I'm also happy to revert the quay.io image moves or implement the skopeo workaround hack. The more I think about this workaround the less it bothers me. It is quick,15:35
clarkbstraightforward, gets us the functionality we want without dramatically compromising the testing of what we will eventually deploy to production. The biggest downside is we have to manually curate the list of images and the inclusion of the role in our playbooks/roles.15:35
clarkbcc corvus fungi frickler ianw tonyb and anyone else that might be interested15:35
corvusclarkb: we don't store any important data volumes, so i think you can basically strike that one off the list.15:39
corvusclarkb: (that's a minor point of course)15:40
fungiinfra-root: i've opened ticket #230522-ord-0001072 with rackspace about the stuck deleting nodes in iad15:41
clarkbcorvus: thats true everything should be bind mounted in opendev. Except for any mounts we may have missed. It does looks like both mysql/mariadb and zookeepre have a complete set though which would be the main ones to worry about15:41
clarkbfungi: thanks!15:41
corvusclarkb: i think if we don't want to do podman, then we should either switch back to docker, or make the skopeo solution a real solution (with automated pulls from zuul artifacts).  but keep in mind that has issues, like we can never do a "docker pull", and our production playbooks do that a lot.15:42
corvusi mean, do we actually have an idea for a solution to the "pull" problem?15:42
clarkbcorvus: the change I wrote addresses that by injecting the skopeo pull after the docker(-compose) pulls. I think that is really the only option15:43
corvusright, so it's basically giving up on the testing production idea -- we have to remember to write our production playbooks to include test code, and if we forget that, we transparently loose testing without any indication.15:45
clarkbwe could update the role I have written to only do the skopeo pull based on artifacts and stop needing to account for the specific list of images there. But I don't think you can make this transparent and have docker(-compose) pulls15:45
corvustbh, dockerhub is sounding pretty good now15:46
opendevreviewBirger J. Nordølum proposed openstack/diskimage-builder master: feat: add almalinux-container element  https://review.opendev.org/c/openstack/diskimage-builder/+/88385515:46
clarkbI don't personally see it as giving up on testing production. We are still testing production. We even cover the docker(-compose) pull code. We just run a little extra code in testing. It isn't perfect but I don't see it as giving up15:46
corvusif the choice is between two principles, then i'd rather chose the principle of testing our exact production playbooks15:46
fungiinfra-root: i also opened ticket #230522-ord-0001075 for removal of a stuck shutoff instance in dfw which responds with an "is locked" error if i try to delete it15:46
clarkband ya I see both states as less than ideal, and if we decide to pick one it is a matter of deciding which is less problematic for us15:48
corvussince dockerhub is still an option, i have a hard time saying it's better to give up the fully working system we have now just to avoid using it.15:48
corvus(and keep in mind, the alternative is still "use all the docker tools just with quay.io for hosting", so we're not even making a very strong "pro-community" stance)15:49
corvusi think all things considered, we should just roll back to dockerhub, then start picking things off the podman punch list as we can (jammy, running nested, etc)15:50
clarkbthat works for me. FWIW I think if people want to they could push on podman for servies already running on jammy too. (gitea and etherpad for example)15:51
corvusalso -- if we make the tool switch before the hosting switch, that addresses a lot15:51
clarkbyup15:51
corvuslike we can potentially slowly migrate to podman with images on dockerhub, one service at a time, then later switch hosts15:52
fungithat sounds like a reasonable way forward to me15:53
clarkbalso this morning it has been discovered that siblings testing with nodepool, dib, and glean is broken due to this issue. Apparently we push a :siblings tag into the buildset registry to make that happen?15:53
clarkbThis may have a different solution (I think the skopeo hack may be more acceptable there for example)15:53
corvusfwiw, i'm okay with eating crow and switching zuul back to dockerhub too, though i'm not sure if that's necessary or not?15:53
clarkbor maybe just move all of that to podman since it isn't touching production15:54
clarkbcorvus: I think the nodepool builder jobs that do siblings with dib and glean are the only place that should really affect zuul and friends. And I think there are options there15:54
clarkbspecifically move those jobs to podman and if that doens't work for some reason use a skopeo hack since that isn't a test like production case (its a test for testing sake case)15:54
clarkb99% of the problem here for opendev is that we're trying to also deploy this stuff to production on real servers which brings different concerns and needs15:55
corvusclarkb: i agree that doesn't need to drive the question15:56
corvusyeah, i think the main reason to move zuul back would be in solidarity (ie, to keep using the same sets of jobs), but if we still have a desire to (very slowly over time) move opendev to quay, then maybe zuul should stay there and be the advance team?15:57
corvusi should say: move opendev to podman and quay15:58
clarkbya I think it is ok for the jobs to differ. We mgiht also be able to run the same jobs just with different options. I think maybe that the container jobs would work with docker hub too15:58
clarkbbut that change should come post rollback to simplify things15:58
corvus(if you actually want to keep opendev on docker+dockerhub indefinitely, then we should move zuul back i think.  that way we're better using our limited resources to collectively maintain a smaller set of common jobs)15:59
corvus(or, after reading your last comment, maintaining a smaller set of common job configurations :)15:59
clarkbok cool. I had a lot of time to noodle on this over the weekend and wanted to get the week started with a conversation to avoid doing a bunch of unnecessary work then deciding on things. We can bring this ack up in tomorrow's meeting to catch any other opinions and if there aren't objections there I can start on the rollback for opendev.15:59
corvusokay.  i think on the zuul side, we still need to see nodepool functional testing in action with podman, right?  but for zuul itself, we worked out the issues and can switch when we're ready?16:01
clarkbcorvus: I think there are still good reasons to move to podman. I just don't see it as being quick and easy. side note: I feel like both docker and podman exhibit problems with what seems like straightforward functionality (logging, pulling images from not docker.io, exploding on ubuntu rootless due to a documented fallback that doesn't actually fallback, etc)16:01
clarkbcorvus: correct re zuul and nodepool testing16:01
corvusok, so i think if we want to have zuul as the advance party, then we should do the nodepool thing next, and if that works, switch them both over.16:02
corvusthe nested nodepool thing is not something i will be able to do though, unfortunately.16:03
clarkbnow that I think about it the nodepool testing udpate may exercise that for us. So we can use that as the advance party too16:03
corvusnodepool testing update?16:04
clarkb"we still need to see nodepool functional testing in action with podman"16:04
corvusright -- i mean, if that is anything other than straightforward, i'm not going to be in a position to fix it16:05
clarkbgotcha16:05
clarkbalso sidenote: podman had a ppa, they removed this ppa in favor of the opensuse kubic obs repo, kubic deleted packages from this beacuse they weren't going to support it anymore for older things, everyone (rightly imo) complained since the ppa was also dead and the documentation for installing things says use kubic, kubic restored the packages but isn't updating them aiui. But kubic16:07
clarkbdoesn't matter for new things because new things package podman but that is all kubic updates for. TL;DR I'm highly skeptical of kubic as a package source16:07
*** amoralej is now known as amoralej|off16:42
opendevreviewMerged opendev/system-config master: reprepro: mirror Ubuntu UCA Antelope for Ubuntu Jammy  https://review.opendev.org/c/opendev/system-config/+/88346717:54
*** mooynick is now known as yoctozepto18:22
yoctozeptomorning18:22
yoctozeptoa question about opendev container jobs not being ready for podman18:23
yoctozeptohttps://opendev.org/opendev/base-jobs/src/commit/3fc688b08dbe2ff41a75f051f53b4929dd35800f/playbooks/buildset-registry/pre.yaml18:23
yoctozeptoonly docker is installed there18:23
yoctozeptowould it be ok if I proposed a patch to handle podman as well?18:24
yoctozeptomaybe there is one already18:24
yoctozeptoforgot to check it18:24
yoctozeptohttps://review.opendev.org/q/project:opendev/base-jobs+podman18:24
yoctozeptonope18:24
yoctozeptook, so let me experiment in a moment18:25
yoctozeptounfortunately, it's a config project18:25
yoctozepto:-(18:25
*** mooynick is now known as yoctozepto18:31
yoctozepto(mobile network switch)18:31
clarkbyoctozepto: the jobs are fine wiht podman18:32
clarkbyou'll need to be more specific why they are not18:32
clarkb(I mean zuul is doing it in a half merged state and I've got at least one change up to experiment with it too. The problems are not with the jobs)18:33
clarkbwhat the buildset registry uses to run the buildset registry software is orthogonal to what you end up testing with the buildset registry as a tool18:34
yoctozeptoclarkb: https://review.opendev.org/c/nebulous/component-template/+/883304?tab=change-view-tab-header-zuul-results-summary18:48
yoctozeptoit tries to run podman18:48
yoctozeptonot having installed it18:49
yoctozeptoand fails obviously18:49
yoctozeptothat's the issue18:49
yoctozeptowhat you are saying means to me that it should not be trying to use podman19:00
yoctozeptomaybe it's some new development that it dies19:00
yoctozeptos/dies/does19:00
yoctozeptohttps://opendev.org/zuul/zuul-jobs/commits/branch/master/roles/run-buildset-registry/tasks/main.yaml19:02
yoctozeptonah, been like this for quite some time19:02
fungilooks like the container-image pre-run picks between docker and podman: https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/container-image/pre.yaml19:02
fungidepending on what container_command is set to19:03
yoctozeptofungi: yeah, but it will come later19:03
yoctozeptoit does not reach it by then19:03
yoctozeptosee the run19:03
yoctozeptohttps://zuul.opendev.org/t/nebulous/build/b86ef46fbe424a54ac4cb46b0432dfb0/console19:03
yoctozeptocontainer_command is set to podman19:03
iurygregoryfungi, hey just saw the node in https://zuul.opendev.org/t/openstack/nodes 19:03
fungiright, i was comparing to the buildset-registry pre-run19:03
yoctozeptomy patch would be doing the same in buildset-registry19:04
fungiiurygregory: what ssh key do you want added?19:04
yoctozeptoI wonder if that's the only place that will need fixing19:04
yoctozeptobut I guess we won't know without trying19:04
iurygregoryfungi, will send to you in 1min19:06
fungiiurygregory: ssh root@104.130.135.4119:13
fungilet me know if it doesn't authenticate for you19:13
opendevreviewRadosław Piliszek proposed opendev/base-jobs master: buildset-registry: Add podman support  https://review.opendev.org/c/opendev/base-jobs/+/88386919:16
yoctozeptofungi, clarkb ^19:16
iurygregoryfungi, done19:16
iurygregoryit worked19:16
yoctozeptomeh that I can't just depends-on it to test it19:18
clarkbyoctozepto: as metnioned that is orthogonal to what you are doing19:20
clarkbthe buildset registry is a service that runs in jobs/buildsets. How it runs is independent of your jobs. If your jobs need podmnan then you need to install it in your jobs19:20
yoctozeptoclarkb: it's buildset-registry that fails to run when I set the command to podman19:20
yoctozeptosee https://review.opendev.org/c/nebulous/component-template/+/88330419:21
yoctozeptoaccording to docs, this should work fine19:21
yoctozeptoit fails to start the buildset-registry19:21
clarkbok I see you finally linked to a failure :)19:21
clarkbok so run-buildset-registry doesn't install either docker or podman19:24
yoctozeptoyeah, I even wrote a nice commit message on the fix to explain what's happening19:24
yoctozeptobuildset-registry installs docker only now19:24
clarkbI feel like "buildset-regsitry is independent of your job content" is what should be happening but I guess isn't19:24
yoctozeptoseems like some old approach before it was made more flexible19:24
clarkbcorvus: ^ do you have an opinion on that?19:24
yoctozeptoyeah, we could go that direction19:25
yoctozeptolike, using buildset_registry_container_command19:25
yoctozeptoindependent of container_command19:25
clarkbyoctozepto: or just set the var when you include the role19:25
clarkbthat should override it in inner scopes but not outer right?19:25
yoctozeptoI simply reuse your jobs, see my commit19:26
yoctozeptonot easy to hack without violating DRY19:26
yoctozepto:-)19:26
clarkbyes I mean here https://review.opendev.org/c/opendev/base-jobs/+/883869/1/playbooks/buildset-registry/pre.yaml#3619:26
yoctozeptoah19:26
yoctozeptocould be19:26
yoctozeptothough maybe supporting podman simply makes more sense19:26
yoctozeptoin the long term19:27
clarkbmy concern with that is podman doesn't run in a lot of places19:27
clarkbdocker runs everywhere so for generic "run this service" things where we don't really care about speculative gating I think docker might still be a better choice19:27
clarkbI also don't know that there is much value to supporting more than one way to run it19:27
clarkbtwice as many ways it might break19:28
yoctozeptotrue that19:28
clarkbI guess I can go either way on that now that I understand the problem19:28
clarkbflexibility vs potential reliablity19:28
clarkbI'll update my review19:29
yoctozeptothanks, I am also largely indifferent; at most lazy to update the commit to do the other way ;p19:29
yoctozeptoas long as the desired speculative runs work in the end, I am happy to base it on either solution19:30
clarkbI left two notes one to fix the proposal as is and the other to try and isolate running a registry from what is happening in the jobs19:32
clarkbOne upside ot being consistent is that it reduces the number of external deps19:35
clarkbwhich is probably a bigger reliability concern than the chance of podman or docker changing behavior in unexpected ways19:36
clarkbyoctozepto: ^ if you want to update it to fix the default value to match run-buildset-regitry's default of docker I think we can probably land it for that reason. Note that ensure-podman does not work on older ubuntu19:36
corvusthere's some pretty docker-specifc stuff in there, so adding podman to that might be more than initially expected.  i think clarkb 's suggestion about the default makes sense19:40
yoctozeptoamending, clarkb 19:42
opendevreviewRadosław Piliszek proposed opendev/base-jobs master: buildset-registry: Always use Docker  https://review.opendev.org/c/opendev/base-jobs/+/88386919:46
* yoctozepto is finishing work for today19:51
yoctozeptotalk to you on gerrit19:51
dansmithclarkb: I didn't really follow the above, but just FYI we're jamming podman into jammy for the ceph jobs: https://github.com/openstack/devstack-plugin-ceph/blob/master/devstack/files/debs/devstack-plugin-ceph#L421:13
dansmithbecause cephdm wants it (I think)21:14
dansmithI certainly agree that docker is a known and supported quantity for anything that isn't opinionated about it21:14
clarkbdansmith: for your needs the main issue is that there is no reliable source of podman for ubuntu older than jammy21:17
dansmithclarkb: ah, older than jammy, yeah for sure21:18
dansmithwe were using another package repo (possibly the one you mentioned above) but switched to the inbuilt packages during the recent modernization effort21:18
clarkbdansmith: the longer story is that there was a PPA for this stuff which got deprecated and is no longer updated. This happened because there is an OBS repo called "kubic" that started building pacakges instead. But then kubic said nevermind the older distros and deleted them all. People got angry/panicked/voiced displeasure so kubic added the packages back but is no longer updating21:19
clarkbthem. The problem is that podman exists on newer stuff so really you only need kubic for the older things anyway which means it too is not super useful21:19
clarkbfor OpenDev we have a mixture of servers and can't simply rely on jammy everywhere. In CI this is less problematic21:19
dansmithack21:19
clarkbI've just updated the meeting agenda. Anything important missing?22:53
funginothing i can think of22:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!