#opendev-meeting log

19:01:16 <clarkb> #startmeeting infra
19:01:16 <opendevmeet> Meeting started Tue Mar 14 19:01:16 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:16 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:16 <opendevmeet> The meeting name has been set to 'infra'
19:01:33 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YZXXWZ7LB3KEF3AMJV3WIPFKCGH2IA2O/ Our Agenda
19:02:07 <clarkb> #topic Announcements
19:02:34 <clarkb> Daving saving time has gone into effect for some of us. Heads up that it will go into effect for others in 2-3 weeks as well
19:02:46 <clarkb> heh I can't type either. *Daylight saving time
19:03:06 <fungi> i favor switching to daving savelights time
19:03:08 <clarkb> Our meeting doesn't change the time it occurs at. It remains at 19:00 UTC but this time may have shifted relative to your local timezone due to the time change
19:03:40 <clarkb> OpenStack is making its 2023.1/Antelope release next week. That should occur on a wednesday so roughly 8 days from now
19:04:02 <fungi> yeah, "festivities" will likely start around 09:00 utc
19:04:09 <fungi> maybe a bit later
19:04:16 <fungi> release notes jobs in the tag
19:04:34 <fungi> pipeline will need about 8-10 hours due to serializatio
19:04:36 <fungi> n
19:05:04 <fungi> would love to work out a better option than that semaphore at some point
19:05:25 <clarkb> its only there to prevent errors that aren't actually fatal in the docs jobs right?
19:05:35 <clarkb> I mean you could just remove the semaphore and tell them to validate docs publication?
19:05:54 <clarkb> or maybe I'm confusing issues and there is a more important reason to have the semaphore
19:06:00 <fungi> well, it's there to solve when someone approves release requests for several branches of the same project and they race uploads of the release notes and one regresses the others
19:06:15 <fungi> because all branches share the same tree in afs
19:06:48 <fungi> so they need a per-project semaphore, which doesn't really exist (without defining a separate one for each of hundreds of repos)
19:07:01 <clarkb> aha, could possibly remove the semaphore temporarily for the release since only that one branch should e getting releases on that day?
19:07:16 <fungi> possible, i'll bring it up with them
19:08:00 <clarkb> The week after next the virtual PTG will be taking place
19:09:03 <clarkb> And that was it for announcements
19:09:09 <clarkb> #topic Bastion Host Changes
19:09:37 <clarkb> ianw: are you here? I was hoping we'd be able to decide on whetheror not we are proceeding with the backups stack.
19:09:43 <clarkb> #link https://review.opendev.org/q/topic:bridge-backups
19:09:48 <ianw> yes :)
19:10:12 <clarkb> It looks like you may need a second reviewer? Something we should probably do in this case since we need multiple people to stash keys?
19:10:21 <clarkb> Any volunteers for second reviews?
19:10:47 <ianw> yes, and probably a few people to say they're on board with holding a secret for it, otherwise it's not going to work
19:11:52 <fungi> i can try to take a look, and am happy to safeguard a piece of the key
19:11:56 <clarkb> I'm happy to stash the bits into my keepassxc db
19:12:20 <ianw> ok, well if fungi can take a look we can move one way or the other
19:12:23 <clarkb> fungi: thanks! I think thats the next step then. Get a second review and assuming review is happy make a plan to distribute the right key bits
19:12:50 <clarkb> anything else bridge related?
19:13:30 <ianw> nope, not for now
19:13:55 <clarkb> #topic Mailman 3
19:14:07 <clarkb> fungi: I haven't seen anything new here, but want to make sure Ididn't miss naything
19:15:20 <corvus> i'm on board for being a (partial) keymaster
19:16:03 <fungi> yeah, i got very close to picking it back up today, before things started to get exciting again
19:16:07 <fungi> so nothing new to share yet
19:16:47 <clarkb> yes boring would be nice occasionally
19:16:50 <fungi> vinz clortho, keymaster of gozer
19:16:50 <clarkb> #topic Gerrit Updates
19:17:11 <clarkb> ianw's stack of copyCondition and submit requirements changes has landed as has the manual update to All-Projects for submit requirements
19:17:30 <clarkb> We did run into some problems with the All-Projects update because 'and' and 'AND' are different in Gerrit 3.6 query expressions
19:17:42 <clarkb> But that got sorted out andI think things have been happy since (at least no new complaints since then)
19:17:55 <fungi> but not in 3.7. that seems like an unfortunate choice of fix not to backport
19:18:22 <clarkb> ianw: from your work on this are there other ACL updates you think we need to make or are we all up to date for modern Gerrit3.7 expecttations?
19:18:44 <ianw> nope, i think we're ready for the 3.7 transition from that POV now
19:19:02 <ianw> i will spend a little time updating https://etherpad.opendev.org/p/gerrit-upgrade-3.7 today
19:19:12 <clarkb> great!
19:19:33 <ianw> a couple of things to check, but i think all known knowns and known unknowns are dealt with :)
19:19:48 <clarkb> ianw: as far as ensuring we don't slide backwards goes can we update the little checker tool to only allow function = NoBlock and require copyCondition not the old thing?
19:20:07 <clarkb> I think if we do those two things it will prevent any cargo culting of old info accidentally
19:20:25 <ianw> oh yes, sorry that's on my todo list.  the snag i hit was that the normalizer isn't really a linter in the way of a normal linter, but a transformer, and then if there's a diff it stops
19:20:41 <clarkb> ya in that case maybe just delete the lines we don't want which will produce a diff
19:20:57 <clarkb> and hopefully that diff is clear that we don't want those lines because they are removed (don't need to replace them with an equivalent as that would be more effort)
19:21:08 <ianw> i guess the problem is that that then creates a diff that is wrong
19:21:16 <ianw> i wasn't sure if the point was that you could apply the diff
19:21:32 <ianw> if so, it kind of implies writing a complete function -> s-r transformer
19:21:37 <clarkb> I think the idea was the diff would help people correct their changes and bonus points if you could directl pply it
19:22:01 <clarkb> in this case I think it is ok if we have a diff that isn't going to complete fix changes for peopel and simply force an error and pull the eye to where the problem is
19:22:15 <ianw> i could do something like add a comment line "# the following line is deprecated, work around it"?
19:22:29 <clarkb> ++
19:22:39 <ianw> ok, i'll do that then
19:24:22 <clarkb> #topic Project Renames and Gerrit Upgrade
19:24:57 <clarkb> Quick check if we think we are still on track for an April 7th upgrade of Gerrit and project renames
19:25:12 <clarkb> I think the only concern that has come up is the docker org deletion on the 14th
19:25:40 <clarkb> mostly worried that will demand our time and we won't be able to prep for gerrit things appropriately. But it is probably too early to cancel or move the date for that. Mostly bringing it up as a risk
19:26:15 <clarkb> And then I wanted to talk about the coordination of that. Do we want to do the renames and upgrade in one window or two separate windows? And what sorts of times are we looking at?
19:27:43 <clarkb> ianw: I think you were thinking of doing the Gerrit upgrade late April 6 UTC or early April 7 UTC? Then maybe fungi and I do the renames during our working hours April 7 if we do two different windows
19:28:05 <clarkb> If we do one window I can be around to do it all late April 6 early APril 7 but I think that gets more difficult for fungi
19:28:38 <ianw> i guess question 1 is do we want renames or upgrade first?
19:28:45 <corvus> i agree it's worth keeping an eye on, and if anyone feels overburdened, raise a flag and we can slow down or deal with it.  but from right now at least, i think we can work on both.
19:29:00 <fungi> i can swing it
19:29:11 <fungi> i just can't do the week after as i'll be offline
19:29:39 <clarkb> ianw: I think one reason to do renames first would be if we had previously done renames under that gerrit version. But we have never reanmed anything under 3.6 so order doesn't matter much
19:30:13 <clarkb> fungi: ianw  ok in that case maybe aim for ~2200-2300 UTC April 6 and do both of them?
19:30:28 <clarkb> and we can sort out the order antoher time if we're committing to a single block like that
19:31:10 <ianw> ok, if that's a bit late we coul dmove it forward a few hours too
19:31:29 <fungi> wfm
19:31:51 <clarkb> Ok with that decided (lets say 2200 UTC to make it a bit easier for fungi) should we send email about that now?
19:32:03 <clarkb> for some value of now approximately equal to soon
19:32:11 <ianw> ++
19:32:13 <clarkb> I can do that I just want to make sure we're reasonably confident first
19:32:27 <clarkb> cool I'll add that to my todo list
19:32:30 <fungi> thanks!
19:32:38 <ianw> i am happy to drive, and we'll have checklists, so hopefully it's really just don't be drunk at that time in case the worst happens :)
19:32:54 <clarkb> haha
19:32:57 <ianw> or maybe, get drunk, in case the worst happens.  either way :)
19:33:22 <clarkb> #topic Old Server Upgrades
19:33:30 <clarkb> Much progress has been made with the giteas.
19:33:46 <clarkb> As of Friday we're entirely jammy for the gitea cluster in production behind the load balancer
19:34:17 <clarkb> I have changes up to clean up gitea01-04 but have WIP'd them becuase I think the openstack release tends to be a high load scenario for the giteas and that is a good sanity check we won't need those servers before deleting them
19:34:39 <clarkb> I'll basically aim to keep the gitea01-04 backends replicated to until after the openstack release and if all looks well after that clean them up
19:35:18 <fungi> yeah, especially when some of the deployment projects update and all their users start pulling the new release at the same time
19:35:19 <clarkb> there are two reasons for the caution here. The first is that we've changed the flavor type for the new servers and we've seen some high cpu steal at times. But those flavors are bigger on more modern cpus so in theory will be quicker anyway so I've reduced the gitea backend count from 6 to 8
19:35:22 <clarkb> * 8 to 6
19:35:46 <clarkb> so far though those new servers have looked ok
19:35:58 <clarkb> just want to keep an eye out through the release before making the cleanup more permanent
19:36:13 <clarkb> ianw has also started looking at nameserver replacements
19:36:15 <clarkb> #link https://etherpad.opendev.org/p/2023-opendev-dns
19:36:21 <clarkb> #link https://review.opendev.org/q/topic:jammy-dns
19:36:36 <clarkb> good news the docker stuff doesn't affect dns :)
19:36:38 <ianw> yep sorry got toally distracted on that, but will update all that now that we've got consenus on the names
19:36:52 <fungi> thanks for working on it
19:37:19 <clarkb> Ya this is all good progress. Still more work to do including ehtperad which I had previously planned to do after the PTG
19:37:39 <clarkb> its possible to get it done quickly pre ptg but the ptg relies on etherpad so much I'd kinda prefer changing things after
19:37:52 <clarkb> jitsi meet as well
19:37:58 <corvus> clarkb: the gitea graphs look good.  qq (i hope it's quick, if not, nevermind and we can take it offline) -- what happened between march 7-9 -- maybe we had fewer new servers and then added more?
19:38:22 <clarkb> corvus: yes, we had 4 new servers and we also got hit by a bot crawler that was acting like a 2014 samsung phone
19:38:43 <clarkb> corvus: we addressed that by updating our UA agent block list to block the nacient phone and added two more servers for a total of 6
19:39:02 <clarkb> I thought we might get away with 4 servers instead of 8 but that incident showed that was probably too small
19:39:10 <fungi> so the issue was twofold: a bad actor and fewer backends
19:39:14 <corvus> cool; thanks
19:39:32 <fungi> it noticeably slowed down response times for clients too
19:39:40 <fungi> while that was going on
19:40:10 <clarkb> If I get time this week or next I'll probably try to do a server or two that the ptg doesn't interact with (mirror nodes maybe?)
19:40:35 <clarkb> anyway measurable progress here. Thanks for all the help
19:40:44 <clarkb> #topic AFS volume quotas and utilization
19:41:00 <clarkb> Last week I bumped AFS quotas for the volumes that were very close to the limit
19:41:21 <clarkb> That avoided breaking any of those distro repo mirrors which is great. But doesn't address the every growing disk utilization problem
19:41:37 <clarkb> also it looks like deleting fedora 35 and adding fedora 37 resulted in a net increase of disk utilization
19:42:04 <ianw> i should be able to remove 36 fairly quickly
19:42:13 <clarkb> I did poke around looking for some easy wins deleting things (something that has worked well in the past) and did't really come up with any other than: Maybe we can drop the openeuler mirror and force them to pull from upstream like we do with rocky?
19:42:17 <clarkb> ianw: oh thats good to know
19:42:34 <fungi> there's also a debian release coming up which we'll probably need at least a temporary bump in capacity for before we can drop old-oldstable
19:42:57 <clarkb> Maybe lets get that done before making any afs decisions. The other idea I had was we should maybe consider adding a new backing volume to the two dfw fileservers
19:44:14 <clarkb> I don't think this is urgent as long as we are not adding new stuff (debian will force the issue when that happens)
19:44:28 <clarkb> I guess start with fedora 36 cleanup then evaluate what is necessary to add new debian content
19:44:32 <fungi> worth trying to find out if debian-buster images are still heavily used, or transition them off our mirroring if they are infrequently used but unlikely to get dropped from use soon
19:44:59 <fungi> in order to free up room for debian-bookworm in a few months
19:45:19 <clarkb> fungi: ya thats an option. Can also make buster talk to upstream if infrequently used
19:45:21 <ianw> you can't have one *volume* > 2tb right (that was pypi's issue?)
19:45:22 <clarkb> but keep the images
19:45:26 <clarkb> ianw: correct
19:45:56 <clarkb> ianw: we can add up to 12 cinder volumes each a max of 1TB (these are cloud limitations) to the lvm on the fileservers so we are wll under total afs disk potential
19:45:56 <fungi> yeah, that's what i meant by transition off our mirroring
19:46:08 <clarkb> but then an individual afs volume can't be more than 2TB
19:46:58 <fungi> but also the more cinder devices we attach, the more precarious the server becomes
19:47:08 <ianw> i guess the only problem is if those screw up, it becomes increasingly difficult to recover
19:47:14 <ianw> heh, jinx
19:47:19 <corvus> we can add more servers
19:47:21 <clarkb> ya and also just general risk of an outage
19:47:22 <fungi> it basically multiplies the chances of the server suffering a catastrophic failure from an iscsi incident
19:48:11 <fungi> right, more afs servers with different rw volumes may be more robust than adding more storage to one server
19:48:34 <corvus> (doesn't affect our overall chances of being hit by an iscsi incident, but may contain the fallout and make it easier to recover)
19:48:57 <fungi> the risk of *an* outage doesn't decrease, but the impact of an outage for a single device or server decreases to just the volumes served from it
19:49:30 <clarkb> corvus: does growing vicepa require services be stopped?
19:49:35 <ianw> we also add everything under vicepa -- we could use other partitions?
19:49:36 <clarkb> if so that may be another good reason to use new servers
19:50:00 <clarkb> ianw: heh jinx. I'm not sure what hte mechanics of the underlying data are like and whether or not one appraoch should be preferred
19:50:16 <fungi> also, vos release performance may improve, since we effectively serialize those today with the assumption that otherwise we'll overwhelm the one server with the rw volumes
19:50:54 <clarkb> we've only got 10 minutes left and there are a couple of other things I wanted to discuss. Lets keep afs in mind and we can brainstorm ideas going forward but it isn't urgent today
19:51:05 <clarkb> more of a mid term thing
19:51:18 <clarkb> #topic Quo vadis Storyboard
19:51:21 <corvus> we're not using raw partitions, we're using ext filesystems, so i don't think anything needs to be stopped to grow it, but i'm not positive on that.
19:51:27 <clarkb> corvus: ack
19:51:53 <clarkb> frickler won't be able to attend this meeting today but made a good point that with the PTG coming up there may be discussions from projects about not using storyboard aynmore
19:52:12 <clarkb> I mentioned in #opendev that I think we should contiue to encourage those groups to work together and coordinate any tooling they might produce so that we don't have duplicated efforts
19:52:49 <clarkb> But does leave open the question for what we should do. I also mentioned in #opendev that if I was a lone person making a decision I think I'd look at sunsetting storyboard since we haven't been able to effectively operate/upgrade/maintain it
19:53:13 <clarkb> with an ideal sunset involving more than 30 days notice and if we can makin some sort of read only archive that is easier to mange
19:53:29 <clarkb> That said I don't think I should make decisions like that alone so am open to feedback and other ideas
19:53:52 <clarkb> I'm also happy to jump into ptg sessions that involve storyboard to try and help where I can during the ptg
19:54:19 <clarkb> Maybe ya'll can digest those ideas and let me know if they make sense or are terrible or have better ones :)
19:54:39 <clarkb> Definitely not something we have time for today or in this meeting. But the feedback would be helpful
19:54:46 <ianw> perhaps sunsetting it would be the push someone needs to dedicate resources on it?
19:54:52 <clarkb> ianw: its possible
19:55:01 <clarkb> I think that is unlikely but it is a theoretical outcome
19:55:01 <ianw> either way something happens then, i guess
19:55:19 <clarkb> ok running out of time and one more item remains
19:55:29 <clarkb> this is not on the agenda but worth bringing up
19:55:35 <clarkb> #topic Docker ending free team organizations
19:55:44 <fungi> because people will ask about it anyway ;)
19:56:00 <clarkb> Docker is ending their free team organization setup which we use for opendevorg and zuul on docker hub
19:56:16 <clarkb> (there are actually two other orgs openstackinfra and stackforge which are unused and empty)
19:56:34 <clarkb> This will affect us one way or another and we are very likely going to need to make changes
19:56:57 <clarkb> It isn't clear yet which changes we will need to make and of the options which we should take but I started an etherpad to collect info and try to make that decision making easier
19:57:01 <clarkb> #link https://etherpad.opendev.org/p/MJTzrNTDMFyEUxi1ReSo
19:57:39 <clarkb> I think we should continue to gather information and collect ideas there for the next day or two without trying to attribute too much value to any of them. Then once we have a good clear picture make some decisions
19:57:59 <corvus> one point it would be useful to clarify is whether it's possible, and if so how, we can have an unpaid organization on quay.io to host our public images.  quay.io says that's possible, but i only see a $15/mo developer option on the pricing page, and account signup requires a phone number.
19:58:23 <clarkb> If you sign up with aphone number and get what you need I'm happy to sacrifice mine
19:58:35 <clarkb> ianw: ^ maybe that is something you can ask about at red hat?
19:58:48 <clarkb> basically clarify what account setup requirements are and if public open source projects need to pay for public image hosting
19:59:11 <corvus> (i'd be happy to sign up to find out too, except that i wear a lot of hats, and if i only get one phone number, i don't know if i should burn it for "opendev" "zuul" or "acme gating"...)
19:59:16 <ianw> i can certainly look into it -- of the top of my head i don't know anyone directly involved for instant answers but i'll see what i can find
19:59:27 <clarkb> ianw: thanks!
19:59:59 <corvus> (or maybe it's okay to have two accounts with the same phone number.. <shrug>)
20:00:30 <clarkb> Also NeilHanlon (Rocky Linux) and Ramereth (OSUOSL) have similar issues/concerns with this and we may be able to learn from each other. They have both applied for docker's open source program which is apparently one way around this
20:00:42 <clarkb> I asked them to provide us with info on how that goes just so that we've got it and can weight that option
20:01:24 <fungi> or at least someeone on twitter with the same name as a c-level exec at docker claimed that they won't delete teams who apply for the open source tier
20:01:47 * fungi takes nothing for granted these days
20:01:57 <clarkb> yes, they also say they won't allow names to be reused which means if/when we get our orgs deleted others shouldn't be able to impersonate us
20:02:13 <clarkb> this is important because docker clients default to dockerhub if you don't qualify the image names with a location
20:02:47 <clarkb> and we are at time. I can smell lunch too so we'll end here :)
20:02:50 <clarkb> Thank you everyone!
20:02:53 <clarkb> #endmeeting