19:01:16 #startmeeting infra 19:01:16 Meeting started Tue Mar 14 19:01:16 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:16 The meeting name has been set to 'infra' 19:01:33 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YZXXWZ7LB3KEF3AMJV3WIPFKCGH2IA2O/ Our Agenda 19:02:07 #topic Announcements 19:02:34 Daving saving time has gone into effect for some of us. Heads up that it will go into effect for others in 2-3 weeks as well 19:02:46 heh I can't type either. *Daylight saving time 19:03:06 i favor switching to daving savelights time 19:03:08 Our meeting doesn't change the time it occurs at. It remains at 19:00 UTC but this time may have shifted relative to your local timezone due to the time change 19:03:40 OpenStack is making its 2023.1/Antelope release next week. That should occur on a wednesday so roughly 8 days from now 19:04:02 yeah, "festivities" will likely start around 09:00 utc 19:04:09 maybe a bit later 19:04:16 release notes jobs in the tag 19:04:34 pipeline will need about 8-10 hours due to serializatio 19:04:36 n 19:05:04 would love to work out a better option than that semaphore at some point 19:05:25 its only there to prevent errors that aren't actually fatal in the docs jobs right? 19:05:35 I mean you could just remove the semaphore and tell them to validate docs publication? 19:05:54 or maybe I'm confusing issues and there is a more important reason to have the semaphore 19:06:00 well, it's there to solve when someone approves release requests for several branches of the same project and they race uploads of the release notes and one regresses the others 19:06:15 because all branches share the same tree in afs 19:06:48 so they need a per-project semaphore, which doesn't really exist (without defining a separate one for each of hundreds of repos) 19:07:01 aha, could possibly remove the semaphore temporarily for the release since only that one branch should e getting releases on that day? 19:07:16 possible, i'll bring it up with them 19:08:00 The week after next the virtual PTG will be taking place 19:09:03 And that was it for announcements 19:09:09 #topic Bastion Host Changes 19:09:37 ianw: are you here? I was hoping we'd be able to decide on whetheror not we are proceeding with the backups stack. 19:09:43 #link https://review.opendev.org/q/topic:bridge-backups 19:09:48 yes :) 19:10:12 It looks like you may need a second reviewer? Something we should probably do in this case since we need multiple people to stash keys? 19:10:21 Any volunteers for second reviews? 19:10:47 yes, and probably a few people to say they're on board with holding a secret for it, otherwise it's not going to work 19:11:52 i can try to take a look, and am happy to safeguard a piece of the key 19:11:56 I'm happy to stash the bits into my keepassxc db 19:12:20 ok, well if fungi can take a look we can move one way or the other 19:12:23 fungi: thanks! I think thats the next step then. Get a second review and assuming review is happy make a plan to distribute the right key bits 19:12:50 anything else bridge related? 19:13:30 nope, not for now 19:13:55 #topic Mailman 3 19:14:07 fungi: I haven't seen anything new here, but want to make sure Ididn't miss naything 19:15:20 i'm on board for being a (partial) keymaster 19:16:03 yeah, i got very close to picking it back up today, before things started to get exciting again 19:16:07 so nothing new to share yet 19:16:47 yes boring would be nice occasionally 19:16:50 vinz clortho, keymaster of gozer 19:16:50 #topic Gerrit Updates 19:17:11 ianw's stack of copyCondition and submit requirements changes has landed as has the manual update to All-Projects for submit requirements 19:17:30 We did run into some problems with the All-Projects update because 'and' and 'AND' are different in Gerrit 3.6 query expressions 19:17:42 But that got sorted out andI think things have been happy since (at least no new complaints since then) 19:17:55 but not in 3.7. that seems like an unfortunate choice of fix not to backport 19:18:22 ianw: from your work on this are there other ACL updates you think we need to make or are we all up to date for modern Gerrit3.7 expecttations? 19:18:44 nope, i think we're ready for the 3.7 transition from that POV now 19:19:02 i will spend a little time updating https://etherpad.opendev.org/p/gerrit-upgrade-3.7 today 19:19:12 great! 19:19:33 a couple of things to check, but i think all known knowns and known unknowns are dealt with :) 19:19:48 ianw: as far as ensuring we don't slide backwards goes can we update the little checker tool to only allow function = NoBlock and require copyCondition not the old thing? 19:20:07 I think if we do those two things it will prevent any cargo culting of old info accidentally 19:20:25 oh yes, sorry that's on my todo list. the snag i hit was that the normalizer isn't really a linter in the way of a normal linter, but a transformer, and then if there's a diff it stops 19:20:41 ya in that case maybe just delete the lines we don't want which will produce a diff 19:20:57 and hopefully that diff is clear that we don't want those lines because they are removed (don't need to replace them with an equivalent as that would be more effort) 19:21:08 i guess the problem is that that then creates a diff that is wrong 19:21:16 i wasn't sure if the point was that you could apply the diff 19:21:32 if so, it kind of implies writing a complete function -> s-r transformer 19:21:37 I think the idea was the diff would help people correct their changes and bonus points if you could directl pply it 19:22:01 in this case I think it is ok if we have a diff that isn't going to complete fix changes for peopel and simply force an error and pull the eye to where the problem is 19:22:15 i could do something like add a comment line "# the following line is deprecated, work around it"? 19:22:29 ++ 19:22:39 ok, i'll do that then 19:24:22 #topic Project Renames and Gerrit Upgrade 19:24:57 Quick check if we think we are still on track for an April 7th upgrade of Gerrit and project renames 19:25:12 I think the only concern that has come up is the docker org deletion on the 14th 19:25:40 mostly worried that will demand our time and we won't be able to prep for gerrit things appropriately. But it is probably too early to cancel or move the date for that. Mostly bringing it up as a risk 19:26:15 And then I wanted to talk about the coordination of that. Do we want to do the renames and upgrade in one window or two separate windows? And what sorts of times are we looking at? 19:27:43 ianw: I think you were thinking of doing the Gerrit upgrade late April 6 UTC or early April 7 UTC? Then maybe fungi and I do the renames during our working hours April 7 if we do two different windows 19:28:05 If we do one window I can be around to do it all late April 6 early APril 7 but I think that gets more difficult for fungi 19:28:38 i guess question 1 is do we want renames or upgrade first? 19:28:45 i agree it's worth keeping an eye on, and if anyone feels overburdened, raise a flag and we can slow down or deal with it. but from right now at least, i think we can work on both. 19:29:00 i can swing it 19:29:11 i just can't do the week after as i'll be offline 19:29:39 ianw: I think one reason to do renames first would be if we had previously done renames under that gerrit version. But we have never reanmed anything under 3.6 so order doesn't matter much 19:30:13 fungi: ianw ok in that case maybe aim for ~2200-2300 UTC April 6 and do both of them? 19:30:28 and we can sort out the order antoher time if we're committing to a single block like that 19:31:10 ok, if that's a bit late we coul dmove it forward a few hours too 19:31:29 wfm 19:31:51 Ok with that decided (lets say 2200 UTC to make it a bit easier for fungi) should we send email about that now? 19:32:03 for some value of now approximately equal to soon 19:32:11 ++ 19:32:13 I can do that I just want to make sure we're reasonably confident first 19:32:27 cool I'll add that to my todo list 19:32:30 thanks! 19:32:38 i am happy to drive, and we'll have checklists, so hopefully it's really just don't be drunk at that time in case the worst happens :) 19:32:54 haha 19:32:57 or maybe, get drunk, in case the worst happens. either way :) 19:33:22 #topic Old Server Upgrades 19:33:30 Much progress has been made with the giteas. 19:33:46 As of Friday we're entirely jammy for the gitea cluster in production behind the load balancer 19:34:17 I have changes up to clean up gitea01-04 but have WIP'd them becuase I think the openstack release tends to be a high load scenario for the giteas and that is a good sanity check we won't need those servers before deleting them 19:34:39 I'll basically aim to keep the gitea01-04 backends replicated to until after the openstack release and if all looks well after that clean them up 19:35:18 yeah, especially when some of the deployment projects update and all their users start pulling the new release at the same time 19:35:19 there are two reasons for the caution here. The first is that we've changed the flavor type for the new servers and we've seen some high cpu steal at times. But those flavors are bigger on more modern cpus so in theory will be quicker anyway so I've reduced the gitea backend count from 6 to 8 19:35:22 * 8 to 6 19:35:46 so far though those new servers have looked ok 19:35:58 just want to keep an eye out through the release before making the cleanup more permanent 19:36:13 ianw has also started looking at nameserver replacements 19:36:15 #link https://etherpad.opendev.org/p/2023-opendev-dns 19:36:21 #link https://review.opendev.org/q/topic:jammy-dns 19:36:36 good news the docker stuff doesn't affect dns :) 19:36:38 yep sorry got toally distracted on that, but will update all that now that we've got consenus on the names 19:36:52 thanks for working on it 19:37:19 Ya this is all good progress. Still more work to do including ehtperad which I had previously planned to do after the PTG 19:37:39 its possible to get it done quickly pre ptg but the ptg relies on etherpad so much I'd kinda prefer changing things after 19:37:52 jitsi meet as well 19:37:58 clarkb: the gitea graphs look good. qq (i hope it's quick, if not, nevermind and we can take it offline) -- what happened between march 7-9 -- maybe we had fewer new servers and then added more? 19:38:22 corvus: yes, we had 4 new servers and we also got hit by a bot crawler that was acting like a 2014 samsung phone 19:38:43 corvus: we addressed that by updating our UA agent block list to block the nacient phone and added two more servers for a total of 6 19:39:02 I thought we might get away with 4 servers instead of 8 but that incident showed that was probably too small 19:39:10 so the issue was twofold: a bad actor and fewer backends 19:39:14 cool; thanks 19:39:32 it noticeably slowed down response times for clients too 19:39:40 while that was going on 19:40:10 If I get time this week or next I'll probably try to do a server or two that the ptg doesn't interact with (mirror nodes maybe?) 19:40:35 anyway measurable progress here. Thanks for all the help 19:40:44 #topic AFS volume quotas and utilization 19:41:00 Last week I bumped AFS quotas for the volumes that were very close to the limit 19:41:21 That avoided breaking any of those distro repo mirrors which is great. But doesn't address the every growing disk utilization problem 19:41:37 also it looks like deleting fedora 35 and adding fedora 37 resulted in a net increase of disk utilization 19:42:04 i should be able to remove 36 fairly quickly 19:42:13 I did poke around looking for some easy wins deleting things (something that has worked well in the past) and did't really come up with any other than: Maybe we can drop the openeuler mirror and force them to pull from upstream like we do with rocky? 19:42:17 ianw: oh thats good to know 19:42:34 there's also a debian release coming up which we'll probably need at least a temporary bump in capacity for before we can drop old-oldstable 19:42:57 Maybe lets get that done before making any afs decisions. The other idea I had was we should maybe consider adding a new backing volume to the two dfw fileservers 19:44:14 I don't think this is urgent as long as we are not adding new stuff (debian will force the issue when that happens) 19:44:28 I guess start with fedora 36 cleanup then evaluate what is necessary to add new debian content 19:44:32 worth trying to find out if debian-buster images are still heavily used, or transition them off our mirroring if they are infrequently used but unlikely to get dropped from use soon 19:44:59 in order to free up room for debian-bookworm in a few months 19:45:19 fungi: ya thats an option. Can also make buster talk to upstream if infrequently used 19:45:21 you can't have one *volume* > 2tb right (that was pypi's issue?) 19:45:22 but keep the images 19:45:26 ianw: correct 19:45:56 ianw: we can add up to 12 cinder volumes each a max of 1TB (these are cloud limitations) to the lvm on the fileservers so we are wll under total afs disk potential 19:45:56 yeah, that's what i meant by transition off our mirroring 19:46:08 but then an individual afs volume can't be more than 2TB 19:46:58 but also the more cinder devices we attach, the more precarious the server becomes 19:47:08 i guess the only problem is if those screw up, it becomes increasingly difficult to recover 19:47:14 heh, jinx 19:47:19 we can add more servers 19:47:21 ya and also just general risk of an outage 19:47:22 it basically multiplies the chances of the server suffering a catastrophic failure from an iscsi incident 19:48:11 right, more afs servers with different rw volumes may be more robust than adding more storage to one server 19:48:34 (doesn't affect our overall chances of being hit by an iscsi incident, but may contain the fallout and make it easier to recover) 19:48:57 the risk of *an* outage doesn't decrease, but the impact of an outage for a single device or server decreases to just the volumes served from it 19:49:30 corvus: does growing vicepa require services be stopped? 19:49:35 we also add everything under vicepa -- we could use other partitions? 19:49:36 if so that may be another good reason to use new servers 19:50:00 ianw: heh jinx. I'm not sure what hte mechanics of the underlying data are like and whether or not one appraoch should be preferred 19:50:16 also, vos release performance may improve, since we effectively serialize those today with the assumption that otherwise we'll overwhelm the one server with the rw volumes 19:50:54 we've only got 10 minutes left and there are a couple of other things I wanted to discuss. Lets keep afs in mind and we can brainstorm ideas going forward but it isn't urgent today 19:51:05 more of a mid term thing 19:51:18 #topic Quo vadis Storyboard 19:51:21 we're not using raw partitions, we're using ext filesystems, so i don't think anything needs to be stopped to grow it, but i'm not positive on that. 19:51:27 corvus: ack 19:51:53 frickler won't be able to attend this meeting today but made a good point that with the PTG coming up there may be discussions from projects about not using storyboard aynmore 19:52:12 I mentioned in #opendev that I think we should contiue to encourage those groups to work together and coordinate any tooling they might produce so that we don't have duplicated efforts 19:52:49 But does leave open the question for what we should do. I also mentioned in #opendev that if I was a lone person making a decision I think I'd look at sunsetting storyboard since we haven't been able to effectively operate/upgrade/maintain it 19:53:13 with an ideal sunset involving more than 30 days notice and if we can makin some sort of read only archive that is easier to mange 19:53:29 That said I don't think I should make decisions like that alone so am open to feedback and other ideas 19:53:52 I'm also happy to jump into ptg sessions that involve storyboard to try and help where I can during the ptg 19:54:19 Maybe ya'll can digest those ideas and let me know if they make sense or are terrible or have better ones :) 19:54:39 Definitely not something we have time for today or in this meeting. But the feedback would be helpful 19:54:46 perhaps sunsetting it would be the push someone needs to dedicate resources on it? 19:54:52 ianw: its possible 19:55:01 I think that is unlikely but it is a theoretical outcome 19:55:01 either way something happens then, i guess 19:55:19 ok running out of time and one more item remains 19:55:29 this is not on the agenda but worth bringing up 19:55:35 #topic Docker ending free team organizations 19:55:44 because people will ask about it anyway ;) 19:56:00 Docker is ending their free team organization setup which we use for opendevorg and zuul on docker hub 19:56:16 (there are actually two other orgs openstackinfra and stackforge which are unused and empty) 19:56:34 This will affect us one way or another and we are very likely going to need to make changes 19:56:57 It isn't clear yet which changes we will need to make and of the options which we should take but I started an etherpad to collect info and try to make that decision making easier 19:57:01 #link https://etherpad.opendev.org/p/MJTzrNTDMFyEUxi1ReSo 19:57:39 I think we should continue to gather information and collect ideas there for the next day or two without trying to attribute too much value to any of them. Then once we have a good clear picture make some decisions 19:57:59 one point it would be useful to clarify is whether it's possible, and if so how, we can have an unpaid organization on quay.io to host our public images. quay.io says that's possible, but i only see a $15/mo developer option on the pricing page, and account signup requires a phone number. 19:58:23 If you sign up with aphone number and get what you need I'm happy to sacrifice mine 19:58:35 ianw: ^ maybe that is something you can ask about at red hat? 19:58:48 basically clarify what account setup requirements are and if public open source projects need to pay for public image hosting 19:59:11 (i'd be happy to sign up to find out too, except that i wear a lot of hats, and if i only get one phone number, i don't know if i should burn it for "opendev" "zuul" or "acme gating"...) 19:59:16 i can certainly look into it -- of the top of my head i don't know anyone directly involved for instant answers but i'll see what i can find 19:59:27 ianw: thanks! 19:59:59 (or maybe it's okay to have two accounts with the same phone number.. ) 20:00:30 Also NeilHanlon (Rocky Linux) and Ramereth (OSUOSL) have similar issues/concerns with this and we may be able to learn from each other. They have both applied for docker's open source program which is apparently one way around this 20:00:42 I asked them to provide us with info on how that goes just so that we've got it and can weight that option 20:01:24 or at least someeone on twitter with the same name as a c-level exec at docker claimed that they won't delete teams who apply for the open source tier 20:01:47 * fungi takes nothing for granted these days 20:01:57 yes, they also say they won't allow names to be reused which means if/when we get our orgs deleted others shouldn't be able to impersonate us 20:02:13 this is important because docker clients default to dockerhub if you don't qualify the image names with a location 20:02:47 and we are at time. I can smell lunch too so we'll end here :) 20:02:50 Thank you everyone! 20:02:53 #endmeeting