19:00:06 <clarkb> #startmeeting infra
19:00:06 <opendevmeet> Meeting started Tue Mar 26 19:00:06 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:06 <opendevmeet> The meeting name has been set to 'infra'
19:00:16 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/WN2HTKAHF257JN2FT3ZZ5YOAXF3Y5KW3/ Our Agenda
19:00:29 <frickler> o/
19:00:52 <clarkb> #topic Announcements
19:00:55 <tonyb> o/
19:01:06 <clarkb> The OpenStack Release happens next week and the PTG is the week after
19:01:26 <clarkb> #link https://etherpad.opendev.org/p/apr2024-ptg-opendev Get your PTG agenda items on the agenda doc
19:01:34 <clarkb> feel free to add ideas for our PTG time on this etherpad
19:01:55 <clarkb> #topic Server Upgrades
19:02:14 <clarkb> I don't think there is much new here. Other than to mention the rackspace mfa changes (which we'll dig into shortly)
19:02:28 <clarkb> We did make the changes on our end and we think that launch node as well as reverse dns updates should be working
19:02:53 <clarkb> its only the forward dns commands that won't work anymore and we'll have to do those in the gui or figure out how to use the api key for that too (but we rarely update openstack.org records these days so not a big deal)
19:03:25 <clarkb> If you do launch new servers and run into trouble please say something. This is all fairly new and any new behavior is worth knowing about
19:04:00 <tonyb> noted
19:04:08 <clarkb> #topic MariaDB Upgrades
19:04:20 <fungi> fyi i didn't test the volume options in launch-node with it
19:04:22 <clarkb> We upgraded the refstack DB and it went just as smoothly as the paste upgrade (all good things)
19:04:29 <clarkb> fungi: ack, but that was always weird before :/
19:04:47 <clarkb> The remaining services we need to upgrade are etherpad, gitea, gerrit, and mailman3
19:05:07 <fungi> i think mm3 should be straightforward, also not a big version jump
19:05:15 <clarkb> due to the release and ptg I hesitate to upgrade etherpad, gitea, and gerrit dbs right now (also I'm busy with that stuff so generally have less time)
19:05:25 <clarkb> ya I suspect mm3 may be the safest one to do in the next little bit
19:05:41 <fungi> i can tackle that one when i'm on a more stable connection
19:05:52 <clarkb> sounds good. Let me know if I can help but it is pretty cookie cutter I think
19:05:56 <fungi> yep
19:06:46 <clarkb> #topic AFS Mirror Cleanups
19:07:29 <clarkb> We got centos 7 and opensuse leap pretty well cleared out after I fixed the script updates. There is another cleanup change https://review.opendev.org/c/opendev/system-config/+/913454 to remove the opensuse script entirely since it is nooping now and doesn't need to run at all
19:07:34 <clarkb> That isn't urgent
19:07:54 <clarkb> Next up is ubuntu xenial cleanup. I'm not going to be able to look at that more closely until after the PTG though
19:08:16 <clarkb> I did mention in the TC meeting today that we've freed up space if anyone wants to start on noble mirroring and images
19:08:33 <clarkb> I think the rough plan there is add jobs to dib for noble, then add mirroring, then add image builds to nodepool
19:08:55 <clarkb> I'm happy for others to push on that otherwise I'm likely also going to wait until after the PTG to dig into that
19:09:18 <clarkb> also happy to report that cleaning out opensuse and centos didn't make openafs sad. May have just been coincidence
19:09:35 <clarkb> or the response to the issue has corrected it for the future. Either way I was happy I didn't have to scramble openafs restoration
19:10:08 <clarkb> #topic Rebuilding Gerrit Images
19:10:14 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/912470 Update our 3.9 image to 3.9.2
19:10:57 <clarkb> as mentioned previously this will also rebuild our gerrit 3.8 images against the latest state of that branch. I noticed in gerrit discord this morning that there are a number of bugfixes on stable-3.8 that people want a 3.8.5 release for. We can rebuild and get those deployed before a release happens so all the more reason to do this update
19:11:10 <clarkb> I'm thinking that maybe we plan to do this late next week after the openstack release is done?
19:11:19 <clarkb> we could potentially sneak in a gerrit mariadb upgrade at that time too
19:11:20 <fungi> wfm
19:11:29 <fungi> good with both
19:11:39 <clarkb> cool I won't worry about this too much until next week then
19:11:40 <fungi> maybe as separate restarts
19:12:06 <fungi> just to be extra sure we know what triggered a problem if it blows up
19:12:11 <clarkb> ++
19:12:31 <clarkb> #topic Rackspace MFA Requirement
19:12:46 <clarkb> fungi updated our three primary accounts to use MFA with info in the usual place
19:12:52 <clarkb> thank you handling that
19:13:14 <clarkb> Today is also the big day it will be come required. As mentioned earlier keep an eye out for unexpected changesi n behavior and say something if you notice any
19:13:24 <fungi> it was very straightforward, the complexity was just lots of extra precautions and staged testing on our end
19:13:44 <clarkb> fungi: I did mean to ask which app you chose to setup to get the secret totp key
19:13:56 <clarkb> maybe it didn't matter any any of the app choices provided a valid string
19:14:02 <fungi> definitely keep an eye on build results from jobs since we're still wary of the swift/cloudfiles accounts
19:14:08 <corvus1> are we uploading logs to rax currently?
19:14:13 <clarkb> corvus1: we are
19:14:39 <corvus1> so we're going with the "assume it works and yank it out if it breaks" approach?
19:14:45 <clarkb> ya I think so
19:14:56 <corvus1> sounds good to me
19:14:59 <fungi> clarkb: in the end it didn't make me choose a specific app. just said i had a phone app rather than choosing the sms option
19:15:07 <clarkb> fungi: gotcha
19:15:36 <fungi> and fed the copyable string equivalent beneath their qr code
19:15:53 <fungi> into our usual totp tool
19:15:57 <clarkb> cool I'm glad they made that easy (my credit union does not)
19:16:59 <fungi> it was only confusing insofar as they didn't mention that it might work with totp implementations besides the three popular phone apps they listed
19:17:26 <clarkb> ya this is a common issue with 2fa setups. They push you to an app whcih almost always resultsin totp because thats what all the apps do
19:17:37 <fungi> for my personal account, i hooked up two different librem key fobs (using the nitrocli utility which supports them)
19:17:58 <clarkb> they just pretend that the protocol is something people shouldn't be aware of
19:18:12 <clarkb> #topic Project Renames
19:18:21 <clarkb> Still pencilled in for April 19
19:18:26 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/911622 Move gerrit replication queue aside during project renames.
19:18:37 <fungi> sounds good, i expect to be around
19:18:45 <clarkb> this change is the only prep I've done so far but I expect to start organizing things after the PTG (there is a theme around scheduling for me can you tell)
19:19:26 <fungi> also we heard back from starlingx folks and they aren't entertaining any repo renames for that window
19:19:32 <clarkb> oh good
19:19:34 <clarkb> I missed that
19:19:56 <fungi> at least that was my takeaway from the starlingx-discuss ml thread
19:20:09 <clarkb> if you are listening in and do intend on renaming a project now is a good time to start preparing for that
19:20:54 <clarkb> #topic Nodepool Image Delete After Upload
19:21:24 <clarkb> I haven't seen any changes to implement the config update in opendev. I can't remember if someone volunteered for that last week (also maybe I missed the change)
19:21:58 <clarkb> I do think this may be a good space saving change we can make in addition to cleaning up old image builds like buster, leap etc so I should tack this onto the cleanup work for images
19:22:08 <clarkb> unless someone else wants to go ahead and push a change up
19:22:56 <clarkb> #topic Review02 had an Oops Last Night
19:23:19 <fungi> saw that, but too late to help out, sorry :(
19:23:24 <clarkb> around 02:46-02:47 UTC last night review02 was shutdown. Best I can tell looking at logs this was not a graceful shutdown
19:24:00 <fungi> didn't see any further updates from vexxhost folks yet
19:24:23 <clarkb> Once I realized this was happening I gave it a few minutes just to see if the cloud would automate a restart then manually started it. The containers were in an error 255 stopped state so did not auto start on boot. I manually down and up'd them again
19:24:46 <clarkb> ya the last update from vexxhost was that this was likely an OOM event on the host/hypervisor side whcih is why it was suddent and opaque to us
19:25:06 <fungi> at least it booted with minimal trauma, aside from maybe some changes not being properly indexed?
19:25:06 <tonyb> I missed that it was gone, because it looked like the address range was gone so I assumed it was and didn't dig further
19:25:08 <clarkb> everything seems to have come back mostly ok. I did note one manila change that wasn't indexed
19:25:19 <clarkb> fungi: ya that was the only issue I could find
19:25:24 <clarkb> and just one change that I found
19:25:39 <fungi> tonyb: i think (but am not certain) that vexxhost does bgp to the hypervisor hosts
19:26:08 <fungi> which explains why routing gets turned around at the core or edge when a vm is unreachable
19:26:24 <clarkb> I had come out of my migraine/headache cocoon to eat something and maybe push out a meeting agenda and did this instead :)
19:26:33 <tonyb> okay.  good to know
19:26:45 <clarkb> I was actually feeling a lto better at that point so not a huge deal but that is why I didn't get the agenda out until this morning
19:27:19 <fungi> i had just gotten out of the car and onto the internet when i saw you'd rebooted the vm, terrible timing on my part
19:27:30 <frickler> to me it is a bit worrying having to assume that this can repeat any time
19:27:54 <clarkb> frickler: I agree, but until we hear back from the cloud as to why it happened we don't really know what the problem is and if it can repeat
19:28:05 <fungi> which is why hearing from vexxhost folks on a rca might help assuage our concerns
19:28:19 <clarkb> if it is an OOM then java does allocate memory quite greedily so I do think thati s a good hunch
19:28:29 <clarkb> basically we'll look like the best candidate for killing in an oomkiller situation
19:28:56 <clarkb> and mnaser did say they would look today so hopefully we hear back soon and then determien if we need to make any suggestions or work with them to improve things
19:29:02 <frickler> the hypervisor only seems qemu as a whole
19:29:19 <frickler> *sees
19:29:23 <clarkb> frickler: correct but we'll actually be using all 96GB of memory or whatever in that qemu process
19:29:33 <clarkb> whereas other VMs with that much RAM may only use a small portion in reality
19:29:40 <fungi> main concern is whether it's oversubscribed host, vs memory leak in services running alongside the hypervisor
19:29:43 <clarkb> due to the way java allocates memory
19:30:44 <frickler> ceph rbd client in qemu would be a not uncommon consumer
19:31:07 <fungi> ram oversubscription would be odd, since vexxhost had previously mentioned the hosts having an excess of available ram
19:31:53 <clarkb> we should keep an eye out for any unexpected fallout, but otherwise wait to hear back from them to see if we can help mitigate it going forward
19:32:35 <fungi> thanks again for jumping on that
19:33:15 <clarkb> #topic Open Discussion
19:33:23 <clarkb> Gitea made another 1.21 relese...
19:33:27 <clarkb> we're playing a game of tag with them
19:33:32 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/914292 Update Gitea to 1.21.10
19:33:51 <clarkb> and I've got a change for etherpad 2.0.1 that I'm testing currently
19:33:53 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/914119
19:34:02 <clarkb> the previous ps didn't load our plugin correctly but I think that may have been a bug in the dockerfile
19:34:31 <clarkb> I also think it may be a bug in the upstream dockerfile (the dockerfile is actually a huge mess and maybe once we get things sorted on our end we can look at pushing a PR to clean it up)
19:34:48 <clarkb> as with the etherpad db upgrade I don't think we land this until after the PTG though
19:36:14 <clarkb> Anything else?
19:36:34 <clarkb> I'm still fightin this cold but other than the random headache its mostly a non issue at this point
19:36:40 <tonyb> not from me.
19:37:33 <frickler> ah, one thing
19:37:46 <frickler> did we want to block the storpool CI account now?
19:37:59 <clarkb> +1 from me. I emailed them months ago and got no response
19:38:00 <frickler> still seeing random reviews in devstack and elsewhere
19:38:10 <corvus1> ++
19:38:20 <fungi> also a heads up that i'm travelling on short notice due to a family emergency, not sure for how long, so spending a lot of time working from parking lots and waiting rooms over dicey phone tether. will try to help with things when i can safely do so
19:38:47 <frickler> clarkb: maybe you can do that then to make it more "official"?
19:39:14 <clarkb> frickler: sure I can add it to my todo list. Might not happen today but probably can this week
19:40:29 <fungi> fwiw, i think any of our gerrit admins should feel free to switch a problem account to inactive, especially with this much consensus. i just make sure to #status log it for posterity
19:40:42 <clarkb> ++
19:40:46 <clarkb> Oh one more thing. Do we want ot have a meeting during the PTG or should we cancel?
19:40:56 <clarkb> I think my schedule is open enough to have the meeting but not sure about others
19:41:30 <clarkb> I guess we can decide next week
19:41:42 <frickler> I'd rather skip unless something important happens
19:41:52 <clarkb> ack
19:42:00 <fungi> there's an openstack rbac session at this time i might try to catch
19:42:11 <fungi> so i concur with frickler
19:42:27 <clarkb> sounds like leaning towards cancelling on the 9th. We can make it official next week
19:42:38 <clarkb> Thank you for your time and help running opendev everyone!
19:42:47 <clarkb> I think we can call that a meeting
19:42:49 <clarkb> #endmeeting