19:00:06 #startmeeting infra 19:00:06 Meeting started Tue Mar 26 19:00:06 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:06 The meeting name has been set to 'infra' 19:00:16 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/WN2HTKAHF257JN2FT3ZZ5YOAXF3Y5KW3/ Our Agenda 19:00:29 o/ 19:00:52 #topic Announcements 19:00:55 o/ 19:01:06 The OpenStack Release happens next week and the PTG is the week after 19:01:26 #link https://etherpad.opendev.org/p/apr2024-ptg-opendev Get your PTG agenda items on the agenda doc 19:01:34 feel free to add ideas for our PTG time on this etherpad 19:01:55 #topic Server Upgrades 19:02:14 I don't think there is much new here. Other than to mention the rackspace mfa changes (which we'll dig into shortly) 19:02:28 We did make the changes on our end and we think that launch node as well as reverse dns updates should be working 19:02:53 its only the forward dns commands that won't work anymore and we'll have to do those in the gui or figure out how to use the api key for that too (but we rarely update openstack.org records these days so not a big deal) 19:03:25 If you do launch new servers and run into trouble please say something. This is all fairly new and any new behavior is worth knowing about 19:04:00 noted 19:04:08 #topic MariaDB Upgrades 19:04:20 fyi i didn't test the volume options in launch-node with it 19:04:22 We upgraded the refstack DB and it went just as smoothly as the paste upgrade (all good things) 19:04:29 fungi: ack, but that was always weird before :/ 19:04:47 The remaining services we need to upgrade are etherpad, gitea, gerrit, and mailman3 19:05:07 i think mm3 should be straightforward, also not a big version jump 19:05:15 due to the release and ptg I hesitate to upgrade etherpad, gitea, and gerrit dbs right now (also I'm busy with that stuff so generally have less time) 19:05:25 ya I suspect mm3 may be the safest one to do in the next little bit 19:05:41 i can tackle that one when i'm on a more stable connection 19:05:52 sounds good. Let me know if I can help but it is pretty cookie cutter I think 19:05:56 yep 19:06:46 #topic AFS Mirror Cleanups 19:07:29 We got centos 7 and opensuse leap pretty well cleared out after I fixed the script updates. There is another cleanup change https://review.opendev.org/c/opendev/system-config/+/913454 to remove the opensuse script entirely since it is nooping now and doesn't need to run at all 19:07:34 That isn't urgent 19:07:54 Next up is ubuntu xenial cleanup. I'm not going to be able to look at that more closely until after the PTG though 19:08:16 I did mention in the TC meeting today that we've freed up space if anyone wants to start on noble mirroring and images 19:08:33 I think the rough plan there is add jobs to dib for noble, then add mirroring, then add image builds to nodepool 19:08:55 I'm happy for others to push on that otherwise I'm likely also going to wait until after the PTG to dig into that 19:09:18 also happy to report that cleaning out opensuse and centos didn't make openafs sad. May have just been coincidence 19:09:35 or the response to the issue has corrected it for the future. Either way I was happy I didn't have to scramble openafs restoration 19:10:08 #topic Rebuilding Gerrit Images 19:10:14 #link https://review.opendev.org/c/opendev/system-config/+/912470 Update our 3.9 image to 3.9.2 19:10:57 as mentioned previously this will also rebuild our gerrit 3.8 images against the latest state of that branch. I noticed in gerrit discord this morning that there are a number of bugfixes on stable-3.8 that people want a 3.8.5 release for. We can rebuild and get those deployed before a release happens so all the more reason to do this update 19:11:10 I'm thinking that maybe we plan to do this late next week after the openstack release is done? 19:11:19 we could potentially sneak in a gerrit mariadb upgrade at that time too 19:11:20 wfm 19:11:29 good with both 19:11:39 cool I won't worry about this too much until next week then 19:11:40 maybe as separate restarts 19:12:06 just to be extra sure we know what triggered a problem if it blows up 19:12:11 ++ 19:12:31 #topic Rackspace MFA Requirement 19:12:46 fungi updated our three primary accounts to use MFA with info in the usual place 19:12:52 thank you handling that 19:13:14 Today is also the big day it will be come required. As mentioned earlier keep an eye out for unexpected changesi n behavior and say something if you notice any 19:13:24 it was very straightforward, the complexity was just lots of extra precautions and staged testing on our end 19:13:44 fungi: I did mean to ask which app you chose to setup to get the secret totp key 19:13:56 maybe it didn't matter any any of the app choices provided a valid string 19:14:02 definitely keep an eye on build results from jobs since we're still wary of the swift/cloudfiles accounts 19:14:08 are we uploading logs to rax currently? 19:14:13 corvus1: we are 19:14:39 so we're going with the "assume it works and yank it out if it breaks" approach? 19:14:45 ya I think so 19:14:56 sounds good to me 19:14:59 clarkb: in the end it didn't make me choose a specific app. just said i had a phone app rather than choosing the sms option 19:15:07 fungi: gotcha 19:15:36 and fed the copyable string equivalent beneath their qr code 19:15:53 into our usual totp tool 19:15:57 cool I'm glad they made that easy (my credit union does not) 19:16:59 it was only confusing insofar as they didn't mention that it might work with totp implementations besides the three popular phone apps they listed 19:17:26 ya this is a common issue with 2fa setups. They push you to an app whcih almost always resultsin totp because thats what all the apps do 19:17:37 for my personal account, i hooked up two different librem key fobs (using the nitrocli utility which supports them) 19:17:58 they just pretend that the protocol is something people shouldn't be aware of 19:18:12 #topic Project Renames 19:18:21 Still pencilled in for April 19 19:18:26 #link https://review.opendev.org/c/opendev/system-config/+/911622 Move gerrit replication queue aside during project renames. 19:18:37 sounds good, i expect to be around 19:18:45 this change is the only prep I've done so far but I expect to start organizing things after the PTG (there is a theme around scheduling for me can you tell) 19:19:26 also we heard back from starlingx folks and they aren't entertaining any repo renames for that window 19:19:32 oh good 19:19:34 I missed that 19:19:56 at least that was my takeaway from the starlingx-discuss ml thread 19:20:09 if you are listening in and do intend on renaming a project now is a good time to start preparing for that 19:20:54 #topic Nodepool Image Delete After Upload 19:21:24 I haven't seen any changes to implement the config update in opendev. I can't remember if someone volunteered for that last week (also maybe I missed the change) 19:21:58 I do think this may be a good space saving change we can make in addition to cleaning up old image builds like buster, leap etc so I should tack this onto the cleanup work for images 19:22:08 unless someone else wants to go ahead and push a change up 19:22:56 #topic Review02 had an Oops Last Night 19:23:19 saw that, but too late to help out, sorry :( 19:23:24 around 02:46-02:47 UTC last night review02 was shutdown. Best I can tell looking at logs this was not a graceful shutdown 19:24:00 didn't see any further updates from vexxhost folks yet 19:24:23 Once I realized this was happening I gave it a few minutes just to see if the cloud would automate a restart then manually started it. The containers were in an error 255 stopped state so did not auto start on boot. I manually down and up'd them again 19:24:46 ya the last update from vexxhost was that this was likely an OOM event on the host/hypervisor side whcih is why it was suddent and opaque to us 19:25:06 at least it booted with minimal trauma, aside from maybe some changes not being properly indexed? 19:25:06 I missed that it was gone, because it looked like the address range was gone so I assumed it was and didn't dig further 19:25:08 everything seems to have come back mostly ok. I did note one manila change that wasn't indexed 19:25:19 fungi: ya that was the only issue I could find 19:25:24 and just one change that I found 19:25:39 tonyb: i think (but am not certain) that vexxhost does bgp to the hypervisor hosts 19:26:08 which explains why routing gets turned around at the core or edge when a vm is unreachable 19:26:24 I had come out of my migraine/headache cocoon to eat something and maybe push out a meeting agenda and did this instead :) 19:26:33 okay. good to know 19:26:45 I was actually feeling a lto better at that point so not a huge deal but that is why I didn't get the agenda out until this morning 19:27:19 i had just gotten out of the car and onto the internet when i saw you'd rebooted the vm, terrible timing on my part 19:27:30 to me it is a bit worrying having to assume that this can repeat any time 19:27:54 frickler: I agree, but until we hear back from the cloud as to why it happened we don't really know what the problem is and if it can repeat 19:28:05 which is why hearing from vexxhost folks on a rca might help assuage our concerns 19:28:19 if it is an OOM then java does allocate memory quite greedily so I do think thati s a good hunch 19:28:29 basically we'll look like the best candidate for killing in an oomkiller situation 19:28:56 and mnaser did say they would look today so hopefully we hear back soon and then determien if we need to make any suggestions or work with them to improve things 19:29:02 the hypervisor only seems qemu as a whole 19:29:19 *sees 19:29:23 frickler: correct but we'll actually be using all 96GB of memory or whatever in that qemu process 19:29:33 whereas other VMs with that much RAM may only use a small portion in reality 19:29:40 main concern is whether it's oversubscribed host, vs memory leak in services running alongside the hypervisor 19:29:43 due to the way java allocates memory 19:30:44 ceph rbd client in qemu would be a not uncommon consumer 19:31:07 ram oversubscription would be odd, since vexxhost had previously mentioned the hosts having an excess of available ram 19:31:53 we should keep an eye out for any unexpected fallout, but otherwise wait to hear back from them to see if we can help mitigate it going forward 19:32:35 thanks again for jumping on that 19:33:15 #topic Open Discussion 19:33:23 Gitea made another 1.21 relese... 19:33:27 we're playing a game of tag with them 19:33:32 #link https://review.opendev.org/c/opendev/system-config/+/914292 Update Gitea to 1.21.10 19:33:51 and I've got a change for etherpad 2.0.1 that I'm testing currently 19:33:53 #link https://review.opendev.org/c/opendev/system-config/+/914119 19:34:02 the previous ps didn't load our plugin correctly but I think that may have been a bug in the dockerfile 19:34:31 I also think it may be a bug in the upstream dockerfile (the dockerfile is actually a huge mess and maybe once we get things sorted on our end we can look at pushing a PR to clean it up) 19:34:48 as with the etherpad db upgrade I don't think we land this until after the PTG though 19:36:14 Anything else? 19:36:34 I'm still fightin this cold but other than the random headache its mostly a non issue at this point 19:36:40 not from me. 19:37:33 ah, one thing 19:37:46 did we want to block the storpool CI account now? 19:37:59 +1 from me. I emailed them months ago and got no response 19:38:00 still seeing random reviews in devstack and elsewhere 19:38:10 ++ 19:38:20 also a heads up that i'm travelling on short notice due to a family emergency, not sure for how long, so spending a lot of time working from parking lots and waiting rooms over dicey phone tether. will try to help with things when i can safely do so 19:38:47 clarkb: maybe you can do that then to make it more "official"? 19:39:14 frickler: sure I can add it to my todo list. Might not happen today but probably can this week 19:40:29 fwiw, i think any of our gerrit admins should feel free to switch a problem account to inactive, especially with this much consensus. i just make sure to #status log it for posterity 19:40:42 ++ 19:40:46 Oh one more thing. Do we want ot have a meeting during the PTG or should we cancel? 19:40:56 I think my schedule is open enough to have the meeting but not sure about others 19:41:30 I guess we can decide next week 19:41:42 I'd rather skip unless something important happens 19:41:52 ack 19:42:00 there's an openstack rbac session at this time i might try to catch 19:42:11 so i concur with frickler 19:42:27 sounds like leaning towards cancelling on the 9th. We can make it official next week 19:42:38 Thank you for your time and help running opendev everyone! 19:42:47 I think we can call that a meeting 19:42:49 #endmeeting