#opendev-meeting log

19:01:10 <clarkb> #startmeeting infra
19:01:10 <opendevmeet> Meeting started Tue Aug 29 19:01:10 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:10 <opendevmeet> The meeting name has been set to 'infra'
19:01:18 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/2LK5PHWBDIBZDHVLIEFKFZJKB3AEJZ45/ Our Agenda
19:01:22 <clarkb> #topic Announcements
19:01:33 <clarkb> Monday is a holiday in some parts of the world.
19:02:32 <clarkb> #topic Service Coordinator Election
19:02:47 <fungi> congratudolences
19:02:58 <clarkb> heh I was the only nominee so I'm it by default
19:03:20 <clarkb> feedback/help/interest in taking over in the future all welcome
19:03:23 <clarkb> just let me know
19:03:33 <clarkb> #topic Infra Root Google Account
19:03:56 <clarkb> This is me noting I still haven't tried to dig into that. I feel like i need to be in a forensic frame of mind for that and I just haven't had that lately
19:04:03 <clarkb> #topic Mailman 3
19:04:16 <clarkb> Cruising along to a topic with good news!
19:04:27 <fungi> si
19:04:34 <clarkb> all of fungi's outstanding changes have landed and been applied to the server. This includes upgraded to latest mailman3
19:04:40 <clarkb> thank you fungi for continuing to push this along
19:04:41 <fungi> i think we've merged everything we expected to merge
19:04:54 <fungi> so far no new issues observed and known issues are addressed
19:05:18 <fungi> next up is scheduling migrations for the 5 remaining mm2 domains we're hosting
19:05:26 <clarkb> we have successfully sent and received email through it since the changes
19:05:50 <fungi> migrating lists.katacontainers.io first might be worthwhile, since that will allow us to decommission the separate server it's occupying
19:06:17 <fungi> we also have lists.airshipit.org which is mostly dead so nobody's likely to notice it moving anyway
19:06:49 <fungi> as well as lists.starlingx.io and lists.openinfra.dev
19:07:04 <clarkb> ya starting with airshipit and kata seems like a good idea
19:07:15 <fungi> then lastly, lists.openstack.org (which we should also save for last, it will be the longest outage and should definitely have a dedicated window to itself)
19:07:35 <clarkb> do you think we should do them sequentially or try to do blocks of a few at a time for the smaller domains
19:07:42 <fungi> i expect the openstack lists migration to require a minimum of 3 hours downtime
19:08:25 <fungi> i think maybe batches of two? so we could do airship/kata in one maintenance, openinfra/starlingx in another
19:08:43 <clarkb> sounds like a plan. We can also likely go ahead with those two blocks whenever we are ready
19:08:53 <clarkb> I don't think any of those projects are currently in the middle of release activity or similar
19:09:15 <fungi> i'll identify the most relevant mailing lists on each of those to send a heads-up to
19:10:13 <clarkb> I'm happy to be an extra set of hands/eyeballs during those migrations. I expect you'll be happy for any of us to participate
19:10:14 <fungi> mainly it's the list moderators who will need to be aware of interface changes
19:10:24 <fungi> and yes, all assistance is welcome
19:10:48 <fungi> the migration is mostly scripted now, the script i've been testing with is in system-config
19:11:24 <clarkb> great I guess let us know when you've got times picked and list moderators notified and we can take it from there
19:12:00 <fungi> will do. we can coordinate scheduling those outside the meeting
19:12:24 <clarkb> #topic Server Upgrades
19:12:35 <clarkb> Another topic where I've had some todos but haven't made progress yet
19:12:52 <clarkb> I do plan to clean up the old isnecure ci registry server today and then I need to look at replacing some old server
19:13:03 <clarkb> #topic Rax IAD image upload struggles
19:13:15 <clarkb> fungi: frickler: anything new to add here? What is the current state of image uplaods for that region?
19:13:34 <fungi> i cleaned up all the leaked images in all regions
19:14:33 <fungi> there were about 400 each in dfw/ord and around 80 new in iad. now that things are mostly clean we should look for newly leaked nodes to see if we can spot why they're not getting cleaned up (if there are any, i haven't looked)
19:14:41 <fungi> also i'm not aware of a ticket for rackspace yet
19:15:20 <clarkb> would be great if we can put one of those together. I feel like I don't have enough of the full debug history to do it justice myself
19:16:27 <fungi> yeah, i'll try to put something together for that tomorrow
19:16:27 <frickler> I think if we could limit nodepool to upload no more than one image at a time, we would have no issue
19:16:53 <clarkb> I think we can do that but its nodepool builder instance wide. So we might need to run a special isntace jkust for that region
19:17:03 <clarkb> (there is a flag for number of upload threads)
19:17:11 <clarkb> it would be clunky to do with current nodepool but possible
19:17:41 <frickler> so that would also build images another time just for that region?
19:18:12 <clarkb> yes
19:18:16 <clarkb> definitely not ideal
19:18:52 <frickler> the other option might be to delete other images and just run jammy jobs there? not sure how that would affect mixed nodesets
19:19:14 <clarkb> I think it would prevent mixed nodesets from running there but nodepool would properly avoid using that region for those nodesets
19:19:18 <clarkb> so ya that would work
19:19:38 <frickler> so I could delete the other images manually
19:19:52 <frickler> and then we can wait for the rackspace ticket to work
19:20:50 <clarkb> if things are okayish right now maybe see if we get a response on the ticket quickly otherwise we can refactor something like ^ or even look at nodepool changes to make it more easily "load balanced"
19:21:22 <frickler> well the issue is that the other images get older each day, not sure when that will start to cause issues in jobs
19:21:45 <clarkb> got it. The main risk is probably that we're ignoring possible bugfixes upstream of us.
19:21:57 <fungi> they are almost certainly already causing jobs to take at least a little longer since more git commits and packages have to be pulled over the network
19:21:57 <clarkb> definitely not ideal
19:22:27 <fungi> jobs which were hovering close to timeouts could be pushed over the cliff by that, i suppose
19:22:54 <fungi> or the increase in network activity could raise their chances that a stray network issue causes the job to be retried
19:23:20 <clarkb> ya maybe we should just focus on our default label (jammy) since most jobs run on that and let the others lie dormant/disabled/removed for now
19:24:18 <clarkb> ok anything else on this topic?
19:24:30 <frickler> ok, so I'll delete other image, we can still reupload manually if needed
19:24:36 <frickler> *images
19:24:54 <corvus> what if...
19:25:15 <corvus> what if we set the upload threads to 1 globally; so don't make any other changes than that
19:25:36 <clarkb> corvus: we'll end up with more stale images everywhere. But maybe within a few days so thats ok?
19:25:40 <corvus> it would slow everything down, but would it be too much?  or would that be okay?
19:26:02 <clarkb> I think the upper bound of image uploads on things that are "happy" is ~1hour
19:26:16 <frickler> I think it will be too much, 10 or so images times ~8 regions times ~30mins per image
19:26:17 <clarkb> so we'll end up about 5 ish days behind doing some quick math in my head on fuzzy numbers
19:26:26 <fungi> and we have fewer than 24 images presently
19:26:35 <corvus> yeah, like, what's our wall-clock time for uploading to everywhere?  if that is < 24 hours than it's not a big deal?
19:26:55 <fungi> oh, upload to only one provider at a time too
19:26:55 <clarkb> 10 * 8 * .5 / 2 = 20 hours?
19:27:00 <corvus> (but also keeping in mind that we still have multiple builders, so it's not completely serialized)
19:27:09 <clarkb> .5 for half an hour per upload and /2 because we haev two builders
19:27:35 <frickler> oh, that is per builder then, not global?
19:27:49 <frickler> so then we could still have two parallel uploads to IAD
19:27:50 <clarkb> frickler: yes its an option on the nodepool-builder process
19:27:54 <clarkb> frickler: yes
19:28:12 <corvus> (but of different images)
19:28:16 <corvus> (not that matters, just clarifying)
19:28:30 <corvus> so it'd go from 8 possible to 2 possible in parallel
19:28:45 <frickler> but that would likely still push those over the 1h limit according to what we tested
19:28:58 <clarkb> maybe it is worth trying since it is a fairly low effort change?
19:29:10 <clarkb> and reverting it is quick since we don't do anything "destructive" to cloud image content
19:29:50 <corvus> that's my feeling -- like i'm not strongly advocating for it since it's not a complete solution, but maybe it's easy and maybe close enough to good enough to buy some time
19:30:15 <frickler> yeah, ok
19:30:21 <clarkb> I'm up for trying it and if we find by the end of the week we are super behind we can revert
19:30:47 <corvus> yeah, if it doesn't work out, oh well
19:31:38 <clarkb> cool lets try that and take it from there (including a ticket to rax if we can manage a constructive write up)
19:32:15 <clarkb> #topic Fedora cleanup
19:32:18 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/892380 Remove the fedora-latest nodeset
19:32:46 <clarkb> I think we're readyish for this change? The nodes themselves are largely nonfunctional so if this breaks anything it won't be more broken than before?
19:33:11 <clarkb> then we can continue towards removing the labels and images from nodepool (which will make the above situation better too)
19:33:51 <clarkb> I'm happy to continue helping nudge this along as long as we're in rough agreement about impact and process
19:34:44 <corvus> i think zuul-jobs is ready for that.  wfm.
19:35:03 <fungi> yeah, we dropped the last use of the nodeset we're aware of (was in bindep)
19:35:11 <frickler> we are still building f35 images, too, btw
19:35:51 <clarkb> frickler: ah ok so we'll claen up multiple images
19:36:05 <clarkb> alright I'll approve that change later today if I don't hear any objections
19:36:23 <frickler> just remember to drop them in the right order (which I don't remember), so nodepool can clean them up on all providers
19:36:50 <clarkb> ya I'll have to think about the ndoepool ordering after zuul side is cleaner
19:37:15 <corvus> hopefully https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-from-the-builder helps
19:37:21 <clarkb> ++
19:37:30 <corvus> (but don't actually remove the provider at the end)
19:38:14 <clarkb> #topic Zuul Ansible 8 Default
19:38:31 <clarkb> We are ansible 8 by default in opendev zuul now everywhere but openstack
19:38:45 <clarkb> I brought up the plan to switch openstack to ansible 8 by dfeault on Monday to the TC in their meeting today and no one screamed
19:38:54 <clarkb> Its also a holiday for some of us whcih should help a bit
19:39:10 <fungi> i'll be around in case it goes sideways
19:39:14 <clarkb> I plan to be around long enough in the morning (and probably longer) monday to land that change and monitor it a bit
19:39:18 <fungi> well, weather permitting anyway
19:39:37 <clarkb> ya I don't have any plans yet, but it is the day before my parents leave so might end up doign some family stuff but nothing crazy enough I can't jump on for debugging or a revert
19:39:44 <fungi> (things here might literally go sideways if the current storm track changes)
19:39:45 <clarkb> fungi: is that when the hurricane(s) might pass by?
19:40:04 <fungi> no, but if things get bad i'll likely be unavailable next week for cleanup
19:40:31 <frickler> if you prepare and review a patch, I can also approve that earlier on monday and watch a bit
19:40:32 <corvus> i should also be around
19:40:36 <clarkb> frickler: can do
19:41:12 <clarkb> looks like it is just one hurricane at least now
19:41:21 <clarkb> franklin is predicted to go further north and east
19:42:13 <clarkb> #topic Python container updates
19:42:16 <fungi> yeah, idalia is the one we have to watch for now
19:42:23 <clarkb> #link https://review.opendev.org/q/hashtag:bookworm+status:open Next round of image rebuilds onto bookworm.
19:42:51 <clarkb> thank you corvus for pushing up another set of these. Other than the gerrit one I think we can probably land these whenever. For Gerrit we should plan to land it when we are able to restart the container just in case
19:43:03 <clarkb> particularly since the gerrit change bumps java up to java 17
19:43:11 <corvus> o7
19:43:24 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/893073 Gitea bookworm migration. Does not use base python image.
19:43:51 <clarkb> I pushed a change for gitea earlier today that does not use the same base pythion images but those images will do a similar bullseye to bookworm bump
19:44:07 <clarkb> similar to gerrit gitea probably deserves a bit of attention in this case to ensure that gerrit replication isn't affected.
19:44:20 <clarkb> I'm also happy to do more testing with gerrit and or gitea if we feel that is prudent
19:44:29 <clarkb> reviews and feedback very much welcome
19:45:19 <clarkb> #topic Open Discussion
19:45:32 <clarkb> Other things of note: we upgraded gitea to 1.20.3 and etherpad to 1.9.1 recently
19:45:42 <clarkb> It has been long enough that I don't expect trouble but somethign to be aware of
19:46:08 <fungi> yay upgrades. bigger yay for our test infrastructure which makes them almost entirely worry-free
19:46:25 <clarkb> I mentioned meetpad to someone recently and was todl some group had tried it and ran into problems again. It may be worth doing a sanity check it works as expected
19:47:00 <fungi> i'm free to do a test on it soon
19:47:18 <clarkb> I can do it after I eat some lunch. Say about 20:45UTC
19:48:06 <fungi> i may be in the middle of food at that time but can play it by ear
19:48:10 <clarkb> tox 4.10.0 + pyproject-api 1.6.0/1.6.1 appear to have blown up projects using tox. Tox 4.11.0 fixes it apparently so rechecks will correct it
19:48:27 <clarkb> debugging of this was happening during this meeting so it is very new :)
19:49:20 <corvus> in other news, nox did not break today
19:49:33 <clarkb> Oh I meant to metnion to tonyb to feel free to jump into any of the above stuff or new things if still able/interested. I think you are busy with openstack election stuff right now though
19:50:27 <clarkb> sounds like that is everything. Thank you everyone!
19:50:32 <clarkb> #endmeeting