Tuesday, 2022-09-13

*** diablo_rojo_phone is now known as Guest16607:50
clarkbWe'll start the meeting momentarily18:59
ianwo/19:01
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Sep 13 19:01:07 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-September/000359.html Our Agenda19:01
clarkbThere is an agenda with quite a number of things on it. They are mostly small things so I may go quickly to be sure we get through it all then we can swing back around on anything that needed extra discussion19:01
clarkb#topic Announcements19:01
clarkbNothing major here. Just a reminder that OpenStack is in the middle of its release process and elections19:02
clarkbDon't forget to vote if you're eligible and take care to double check changes you are making to ensure we don't inadverdently break something the release depends on19:02
clarkb#topic Topics19:03
clarkb#topic Bastion Host Updates19:03
clarkbWe've taken yet another pivot after realizing we likely just never want to run the console stream daemon in these infra prod jobs. At leastnot in its current form19:04
clarkbbut the command module (and its relatives like shell) write the files out regardless19:04
clarkbianw: wrote some changes to make that optional which I think will be helpful for us19:04
clarkb#link https://review.opendev.org/c/zuul/zuul/+/855309/ make console stream file writing toggleable19:04
clarkb#link https://review.opendev.org/c/opendev/system-config/+/855472 Disable file writing for infra-prod19:04
ianwyes sorry that needs a revision from your comments19:04
clarkbianw: ya and did you see my note about modifying the base jobs repo in a similar manner to system-config as well?19:04
ianwummm, sorry no, but can do 19:05
clarkbping me if I don't rereview those quickly enough after updates. I'd like to see those get in as they appear to be a good improvement for our use case (and probably others in a similar boat)19:05
clarkb#topic Upgrading Bionic Servers19:07
clarkb#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.19:07
clarkbI keep meaning to pick this up but then other things pop up and grab my attention19:07
clarkbHelp appreciated and let me know if anyone starts working on this and needs changes reviewed or help debugging issues. I'm more than happy to take a look19:08
clarkbBut no real updates on this yet19:08
clarkb#topic Mailman 319:09
clarkb#link https://review.opendev.org/c/opendev/system-config/+/851248 Change to deploy mm3 server.19:09
clarkb#link https://etherpad.opendev.org/p/mm3migration Server and list migration notes19:09
clarkbWe (mostly fungi at this point) continue to make progress on getting to the point where this is ready19:09
clarkbThe database appears happy with the larger connection buffer settings.19:09
clarkbfungi: ^ have you checked if mysqldump is happy with that setting too? We should check that (maybe by manually running the mysqldump?)19:10
fungino, i didn't check that, but can make a note in the etherpad to test it with the next hold after a full import19:10
clarkbOther todos include retesting now that the change is creating all the lists and not just those that ansible for mm2 knew about, checkign the pipermail redirects, and I think adding redirects for non list archive urls19:10
clarkbfungi: thanks19:11
clarkbfungi: we should probably go ahead and add a hold and recheck nowish?19:11
clarkbfungi: I can do that after the meeting if that is helpful19:11
fungiyeah, i just hadn't gotten to it yet19:11
clarkbcool I'll sync up after the meeting to get that moving forward19:11
clarkbThanks for all the help on this. You definitely realize just how many little details go into a big migration like this when you start testing it out19:12
clarkb#topic Jaeger Tracing Server19:12
clarkb#link https://review.opendev.org/c/opendev/system-config/+/855983 adds deployment for jaeger19:12
corvusmy ball; will update this week.19:12
clarkbThere is a change now. CI isn't happy with it yet and I think ianw has some feedback19:12
clarkbcorvus: great, just wanted to make sure others were aware too.19:13
corvusseems like ppl generally like it so far. just working through some technical details.19:13
clarkb#topic Fedora 3619:15
clarkb#link https://review.opendev.org/c/zuul/nodepool/+/853914 Remove fedora 35 testing from nodepool19:16
clarkbianw everywhere else is using the fedora-latest label and will get automatically updated?19:16
ianwdevstack still has https://review.opendev.org/c/openstack/devstack/+/854334 but i need to look into that19:17
clarkbah they have their own labels.19:18
ianwbut other than that, yes -- so with the nodepool change one step closer to dropping f3519:18
clarkbianw: looks like the issue there is they are branching nodeset definitions :/19:18
clarkbthats going to create problems for every transition that uses an alias like -latest19:18
clarkbmight make sense tomove that into openstack-zuul-jobs or similar to avoid thatp roblem19:19
ianwwe always seem to have this discussion about making sure various testing jobs don't end up on stable branches19:19
clarkbanother option is for them to use anonymous nodesets19:19
clarkbbut I don't think they should be managing aliased nodesets on branched repos19:20
clarkbas this will be a problem every 6 months19:20
fungiyeah, a branchless repo like osj should fit the bill19:20
fungier, ozj19:20
ianwthere should be no fedora on anything but master -- but i agree this could have a better home19:20
clarkbianw: ya the problemi s they branch yoga and don't clean it up19:21
clarkbits better to just avoid having it on master where it can end up in a stable branch probably19:21
ianwi can add a todo to have a look19:21
clarkbanyway we can sort that out with the qa team separately19:21
clarkbis there anything other than reviewing the nodepool change that we can do to help19:21
ianwi don't think so, thanks.  unless people want to start debugging devstack, which i don't think they do :)19:22
clarkb#topic Jitsi Meet Updates19:23
clarkb#link https://review.opendev.org/c/opendev/system-config/+/856553 Update to use colibri websockets and scale out JVBs19:23
clarkbThis is one of those changes where in theory I've done what is expected of the service19:23
clarkbBut its kinda hard to confirm that without having a full blown service running with dns setup and being able to talk to it with our browsers19:24
fungibut it's also fairly easy to test if we set aside a window to do so19:24
clarkbIn particular it isn't clear to me if the JVB java keystores need to have some relationship to a CA or to each other or be verifiable in some way19:24
clarkbAll of the bits I could find on the docs and forum posts about this don't indicate any sort of relationship like that so I think they may be using this just for encryption and not verification19:24
clarkbfungi: ya exactly19:25
clarkbI think if people are comfortable with the change we could probably land it and test things during a quiet time (Friday?)19:25
clarkband if it breaks either revert or try to roll forward and fix19:25
clarkbI do think there is a window of opportunity here where we should get it done or wait until after the ptg though. Probably week before ptg is not the time to land this but before that is ok?19:25
fungiand i think it should be reasonably safe to merge first, make sure things aren't broken, take a jvb server out of emergency, redeploy to it gets updated, stio the jvb container on the main server, test again19:25
fungii like friday19:26
fungis/stio/stop19:26
ianw++19:26
clarkbsounds good. /me makes a note on the todo list to try and get that done friday19:27
clarkbOther than that I think we are in good shape for having the service up for the ptg. The non jvb setup seems to be working19:27
clarkb(just a question of whether or not it can scale, but that is what the jvb change is for)19:28
clarkb#topic Stability of the Zuul Reboot Playbook19:28
clarkbIf you didn't know this already Clouds are excellent chaos monkey drivers19:28
clarkbOver the weekend we hit another issue with the playbook. This time it is a race between asking the container to stop in an exec and the container quitting out from under the docker exec19:29
clarkbwhen the container exits before the exec is complete the docker command return code is 137 and ansible gets angry19:29
clarkbI pushed an update to handle this as well as an unexpected behavior with docker-compose ps -q printing exited containers that frickler pointed out (docker ps -q does not do this)19:29
clarkbI started a manual run of that yesterday and we are currently waiting for ze08 to stop19:30
clarkbHoping that completes today which will have what should become zuul 6.4.0 deployed in opendev for a bit before the release is made19:30
clarkbCalling this out because I think it is a good idea for us to keep an eye on this playbook for a bit until we're satisfied it is stable19:30
corvusthe original run was dev10/dev1819:30
corvusthe new run is upgrading to dev21?19:30
corvuswas it resumed or is everything going to dev21?19:31
corvus(i'm not sure how to read ze01 being at dev18, ze05 at dev21, and ze12 at dev18 again19:31
clarkbcorvus: all of the ze's updated to dev18 over the weekend as the crash happened on zm05 which was after the zes19:32
clarkbcorvus: some time after my manual restart of the playbook yesterday a change or two landed to zuul and our hourly zuul playbook docker-compose pulled that so nodes after that point are upgrading to dev2119:32
clarkbonce this is done we can go and update ze01-ze04 to dev21 to match19:33
clarkbas they should be the only ones out of sync (unless more zuul changes land in the interim)19:33
corvusgotcha.  i was hoping to avoid that, but it looks like the changes that merged are inconsequential to the release19:33
clarkbyes I looked at them and they didn't look to be major19:33
corvusnah, no need, we can run with diverse versions19:33
corvuswe don't use the elasticsearch reporter :)19:34
clarkbSo far the updated playbook seems happy. I'll continue to monitor it19:35
corvus\o/19:35
clarkb#topic Python Container Image Updates19:35
clarkb#link https://review.opendev.org/c/opendev/system-config/+/85653719:35
clarkbThis is a great time to update our python container base images as they now include a fixed glibc for the ansible issue and new python minor releases19:36
clarkbOnce we land that we can remove the zuul glibc workaround and let that change rebuild the zuul images19:36
clarkbI wouldn't call this urgent, but it is good hygiene to update these periodically so that changes to our various images can pick up the underlying updates19:37
ianwok, i feel like the zuul workaround is separate though19:37
clarkbianw: once the base image has fixed glibc the zuul workaround is no longer required?19:38
clarkbThis is a necessary precondition of removing the workaround19:38
ianwoh right, it builds on these base images.  although it might do an apt-get upgrade as part of building19:38
ianwzuul that is?19:38
clarkbzuul might, thats true19:38
clarkbour base images don't19:38
ianwanyway, yeah pulling into base images seems good19:39
corvus#link zuul workaround: https://review.opendev.org/84979519:39
corvusi'm not aware of an apt-get upgrade19:39
ianwright, and https://review.opendev.org/c/zuul/zuul/+/854939 was to revert it19:40
ianwi updated that to depends-on the system-config change; so ordering should be right now19:41
clarkbcool19:41
corvussounds like a plan19:41
clarkb#topic Improving Ansible Task Runtime19:42
clarkbThis is largely meant to be informational to help people be conscious of this as they write new ansible19:42
clarkbBut I'm also happy if people end up refactoring existing ansible :)19:42
clarkbThe TL;DR is that even though zuul using ssh control persistence and ansible pipelining the cost to run an individual task as simple as copying a few bytes file or execing ls is often measured in seconds19:43
clarkbThe exact number of seconds seems to vary across our clouds but we've seen it as high as 6 in some :(19:43
clarkbThis becomes particularly problematic when you are running ansible tasks in a loop with a large number of loop inputs19:44
clarkbeach input creates a new task that can take 6 seconds to execute. Multiply that by 100 items in a loop and now you just spent 10 minutes doing something that probably should've taken a second or two at most19:44
clarkbI've written a few chagnes at this point to pick off some low hanging fruit that suffer from this19:44
clarkb#link https://review.opendev.org/c/zuul/zuul-jobs/+/85540219:45
clarkb#link https://review.opendev.org/c/zuul/zuul-jobs/+/85722819:45
clarkbin particular improve some shared library roles so that everyone can benefit19:45
clarkb#link https://review.opendev.org/c/opendev/system-config/+/85723219:45
clarkbthis change is specific to how we run nested ansible and saves 1-3 minutes or so depending on the test node. As noted in the commit message of this change there is a downside to it (more complicated nested ansible setup) and I've asked for feedback on whether or not we think that cost is worthwhile19:46
clarkbI've just WIP'd it to ensure we don't merge it before additional feedback is given19:46
clarkbSo ya, try to be aware of this as you write ansible, it can make a bit impact on how long our jobs take to execute19:47
clarkbsometimes it might be appropriate to move actions into a shell script rather than have ansible work through logic and iteration19:47
clarkbsometimes we can use synchronize instaed of a loop of copies, and so on19:47
clarkbAnd be on the look out for any particularly problematic bits that we might be able to improve. The multi node known hsots stuff could be quicker after my improvement above for example and maybe our infra log encryption could be sped up too19:48
clarkb#topic Open Discussion19:49
clarkbWe got through the agenda. Anything else or anything we covered above that you'd like to go into more detail on?19:49
fungii've got nothing else19:50
clarkbthe debian reprepro mirror needs help19:51
fungiyep, planning to dig into that during/after dinner, unless someone beats me to it19:51
clarkbit somehow leaked a lock file which I cleaned up earlier today and now it complains of a corrupt db19:51
fungidatabase rebuild seems to be necessary19:51
ianwyeah, i feel like i have notes on that, i can take alook19:51
ianw#link https://review.opendev.org/c/opendev/system-config/+/85205619:51
ianwis one; about reverting the pin of the grafana contatiner.  frickler isn't a fan, i'm a bit less worried about it -- not sure what others think19:52
clarkbianw: are they going to keep releasing beta software to :latest?19:52
clarkbI'm ok with deploying it if they stop doing that19:52
fricklerah, I checked that, the dashboard page looks empty with :latest19:52
clarkbthere was talk of them doing a :stable or similar tag iirc19:53
frickleralso didn't we have a patch that generates screenshots of all the individual dashboards? I didn't find that19:53
ianwthat's a point, this job doesn't run that19:53
clarkbfrickler: I think that job runs on the project-config side we could run it here too though and probably a good idea19:53
frickleranyway something still seems broken with latest, so we can either try some tagged version or try to find a fix in our setup19:54
fricklernot sure if someone has time and energy for that19:54
ianwwell yeah, if there is a problem with :latest, ignoring it is only going to make it worse :)  that's kind of my point19:55
clarkbright but it seems that they started releasing known problematic stuff to :latest19:55
clarkbwhereas before it was vetted releases19:55
clarkbI'm ok with keeping up with their relaeses but don't think we should be responsible for beta testing for them19:56
ianwwell, i doubt they would say that, and really it is our model of loading via the yaml path etc. that i think we're testing, and that's not going to be something covered by upstream ci19:57
clarkbianw: aiui when we broke previously it was because latest was a beta release19:57
clarkband the issue was a known issue they were already working to fix that would never end up in the final release19:57
ianwnot really -- it was their bug -- but we reported it, and confirmed it, and helped get it fixed19:58
fricklerdifferent topic, just to shortly mention it before time's up, there seem to be some issues with nested-kvm on ovh-gra1. I'm testing with a beta version of cirros, will apply some special cmdline option19:59
fricklerhope to have some more information tomorrow19:59
clarkbfrickler: thanks.19:59
ianwyeah, if it's the same thing as we saw with our jammy nodes booting there, i think we'll need some help from the cloud side 20:00
clarkbchecking docker hub they don't seem to have stable tags20:00
ianwit releases to kernel messages spewing from a prctl due to cpu flags20:00
ianw#link https://bugs.launchpad.net/ubuntu/+source/linux/+bug/197383920:00
fungithough i expect their business model provides them with an incentive to leverage users of the open source version as beta testers in order to shield their paying customers from bugs20:01
clarkbamorin was responsive at least. Suggested trying a different flavor on a one off boot to check if that was any btter20:01
fungi(grafana's business model, i mean)20:01
fricklerI'll try the added kernel option first, other flavor second, didn't get to it today20:01
clarkband we are at time. Thanks everyone. Feel free to continue the grafana and nested virt discussion in #opendev20:02
fricklermade updates to service-types-authority work again20:02
clarkbWe'll be back here same time and place next week.20:02
clarkb#endmeeting20:02
opendevmeetMeeting ended Tue Sep 13 20:02:24 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:02
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.html20:02
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.txt20:02
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.log.html20:02
fungithanks clarkb!20:02
ianwfungi: i feel like that's a bit unfair.  i think they would say that they have a lot of CI and try to maintain a lot of compatibility.  when we reported a bug it was investigated and fixed promptly.  not sure you can ask for a lot more20:02
fungiianw: rather, i mean they have an incentive to not provide a "stable" tag, because that's what people pay them for20:02
fungia tag for "give me the latest official release"20:03
fungirather than consuming their beta test stream or pinning to a specific version20:04
fricklerI think most people don't need that, they use numbered tagged versions and something like renovate bot on github to keep track20:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!