Tuesday, 2023-03-14

clarkbmeeting time19:00
clarkbI'm a bit behind due to the docker stuff. Please excuse my lack of organization today19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Mar 14 19:01:16 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YZXXWZ7LB3KEF3AMJV3WIPFKCGH2IA2O/ Our Agenda19:01
clarkb#topic Announcements19:02
clarkbDaving saving time has gone into effect for some of us. Heads up that it will go into effect for others in 2-3 weeks as well19:02
clarkbheh I can't type either. *Daylight saving time19:02
fungii favor switching to daving savelights time19:03
clarkbOur meeting doesn't change the time it occurs at. It remains at 19:00 UTC but this time may have shifted relative to your local timezone due to the time change19:03
clarkbOpenStack is making its 2023.1/Antelope release next week. That should occur on a wednesday so roughly 8 days from now19:03
fungiyeah, "festivities" will likely start around 09:00 utc19:04
fungimaybe a bit later19:04
fungirelease notes jobs in the tag 19:04
fungipipeline will need about 8-10 hours due to serializatio19:04
fungin19:04
fungiwould love to work out a better option than that semaphore at some point19:05
clarkbits only there to prevent errors that aren't actually fatal in the docs jobs right?19:05
clarkbI mean you could just remove the semaphore and tell them to validate docs publication?19:05
clarkbor maybe I'm confusing issues and there is a more important reason to have the semaphore19:05
fungiwell, it's there to solve when someone approves release requests for several branches of the same project and they race uploads of the release notes and one regresses the others19:06
fungibecause all branches share the same tree in afs19:06
fungiso they need a per-project semaphore, which doesn't really exist (without defining a separate one for each of hundreds of repos)19:06
clarkbaha, could possibly remove the semaphore temporarily for the release since only that one branch should e getting releases on that day?19:07
fungipossible, i'll bring it up with them19:07
clarkbThe week after next the virtual PTG will be taking place19:08
clarkbAnd that was it for announcements19:09
clarkb#topic Bastion Host Changes19:09
clarkbianw: are you here? I was hoping we'd be able to decide on whetheror not we are proceeding with the backups stack.19:09
clarkb#link https://review.opendev.org/q/topic:bridge-backups19:09
ianwyes :)19:09
clarkbIt looks like you may need a second reviewer? Something we should probably do in this case since we need multiple people to stash keys?19:10
clarkbAny volunteers for second reviews?19:10
ianwyes, and probably a few people to say they're on board with holding a secret for it, otherwise it's not going to work19:10
fungii can try to take a look, and am happy to safeguard a piece of the key19:11
clarkbI'm happy to stash the bits into my keepassxc db19:11
ianwok, well if fungi can take a look we can move one way or the other19:12
clarkbfungi: thanks! I think thats the next step then. Get a second review and assuming review is happy make a plan to distribute the right key bits19:12
clarkbanything else bridge related?19:12
ianwnope, not for now19:13
clarkb#topic Mailman 319:13
clarkbfungi: I haven't seen anything new here, but want to make sure Ididn't miss naything19:14
corvusi'm on board for being a (partial) keymaster19:15
fungiyeah, i got very close to picking it back up today, before things started to get exciting again19:16
fungiso nothing new to share yet19:16
clarkbyes boring would be nice occasionally19:16
fungivinz clortho, keymaster of gozer19:16
clarkb#topic Gerrit Updates19:16
clarkbianw's stack of copyCondition and submit requirements changes has landed as has the manual update to All-Projects for submit requirements19:17
clarkbWe did run into some problems with the All-Projects update because 'and' and 'AND' are different in Gerrit 3.6 query expressions19:17
clarkbBut that got sorted out andI think things have been happy since (at least no new complaints since then)19:17
fungibut not in 3.7. that seems like an unfortunate choice of fix not to backport19:17
clarkbianw: from your work on this are there other ACL updates you think we need to make or are we all up to date for modern Gerrit3.7 expecttations?19:18
ianwnope, i think we're ready for the 3.7 transition from that POV now19:18
ianwi will spend a little time updating https://etherpad.opendev.org/p/gerrit-upgrade-3.7 today19:19
clarkbgreat!19:19
ianwa couple of things to check, but i think all known knowns and known unknowns are dealt with :)19:19
clarkbianw: as far as ensuring we don't slide backwards goes can we update the little checker tool to only allow function = NoBlock and require copyCondition not the old thing?19:19
clarkbI think if we do those two things it will prevent any cargo culting of old info accidentally19:20
ianwoh yes, sorry that's on my todo list.  the snag i hit was that the normalizer isn't really a linter in the way of a normal linter, but a transformer, and then if there's a diff it stops19:20
clarkbya in that case maybe just delete the lines we don't want which will produce a diff19:20
clarkband hopefully that diff is clear that we don't want those lines because they are removed (don't need to replace them with an equivalent as that would be more effort)19:20
ianwi guess the problem is that that then creates a diff that is wrong19:21
ianwi wasn't sure if the point was that you could apply the diff19:21
ianwif so, it kind of implies writing a complete function -> s-r transformer19:21
clarkbI think the idea was the diff would help people correct their changes and bonus points if you could directl pply it19:21
clarkbin this case I think it is ok if we have a diff that isn't going to complete fix changes for peopel and simply force an error and pull the eye to where the problem is19:22
ianwi could do something like add a comment line "# the following line is deprecated, work around it"?19:22
clarkb++19:22
ianwok, i'll do that then19:22
clarkb#topic Project Renames and Gerrit Upgrade19:24
clarkbQuick check if we think we are still on track for an April 7th upgrade of Gerrit and project renames19:24
clarkbI think the only concern that has come up is the docker org deletion on the 14th19:25
clarkbmostly worried that will demand our time and we won't be able to prep for gerrit things appropriately. But it is probably too early to cancel or move the date for that. Mostly bringing it up as a risk19:25
clarkbAnd then I wanted to talk about the coordination of that. Do we want to do the renames and upgrade in one window or two separate windows? And what sorts of times are we looking at?19:26
clarkbianw: I think you were thinking of doing the Gerrit upgrade late April 6 UTC or early April 7 UTC? Then maybe fungi and I do the renames during our working hours April 7 if we do two different windows19:27
clarkbIf we do one window I can be around to do it all late April 6 early APril 7 but I think that gets more difficult for fungi19:28
ianwi guess question 1 is do we want renames or upgrade first?19:28
corvusi agree it's worth keeping an eye on, and if anyone feels overburdened, raise a flag and we can slow down or deal with it.  but from right now at least, i think we can work on both.19:28
fungii can swing it19:29
fungii just can't do the week after as i'll be offline19:29
clarkbianw: I think one reason to do renames first would be if we had previously done renames under that gerrit version. But we have never reanmed anything under 3.6 so order doesn't matter much19:29
clarkbfungi: ianw  ok in that case maybe aim for ~2200-2300 UTC April 6 and do both of them?19:30
clarkband we can sort out the order antoher time if we're committing to a single block like that19:30
ianwok, if that's a bit late we coul dmove it forward a few hours too19:31
fungiwfm19:31
clarkbOk with that decided (lets say 2200 UTC to make it a bit easier for fungi) should we send email about that now?19:31
clarkbfor some value of now approximately equal to soon19:32
ianw++19:32
clarkbI can do that I just want to make sure we're reasonably confident first19:32
clarkbcool I'll add that to my todo list19:32
fungithanks!19:32
ianwi am happy to drive, and we'll have checklists, so hopefully it's really just don't be drunk at that time in case the worst happens :)19:32
clarkbhaha19:32
ianwor maybe, get drunk, in case the worst happens.  either way :)19:32
clarkb#topic Old Server Upgrades19:33
clarkbMuch progress has been made with the giteas.19:33
clarkbAs of Friday we're entirely jammy for the gitea cluster in production behind the load balancer19:33
clarkbI have changes up to clean up gitea01-04 but have WIP'd them becuase I think the openstack release tends to be a high load scenario for the giteas and that is a good sanity check we won't need those servers before deleting them19:34
clarkbI'll basically aim to keep the gitea01-04 backends replicated to until after the openstack release and if all looks well after that clean them up19:34
fungiyeah, especially when some of the deployment projects update and all their users start pulling the new release at the same time19:35
clarkbthere are two reasons for the caution here. The first is that we've changed the flavor type for the new servers and we've seen some high cpu steal at times. But those flavors are bigger on more modern cpus so in theory will be quicker anyway so I've reduced the gitea backend count from 6 to 819:35
clarkb* 8 to 619:35
clarkbso far though those new servers have looked ok19:35
clarkbjust want to keep an eye out through the release before making the cleanup more permanent19:35
clarkbianw has also started looking at nameserver replacements19:36
clarkb#link https://etherpad.opendev.org/p/2023-opendev-dns19:36
clarkb#link https://review.opendev.org/q/topic:jammy-dns19:36
clarkbgood news the docker stuff doesn't affect dns :)19:36
ianwyep sorry got toally distracted on that, but will update all that now that we've got consenus on the names19:36
fungithanks for working on it19:36
clarkbYa this is all good progress. Still more work to do including ehtperad which I had previously planned to do after the PTG19:37
clarkbits possible to get it done quickly pre ptg but the ptg relies on etherpad so much I'd kinda prefer changing things after19:37
clarkbjitsi meet as well19:37
corvusclarkb: the gitea graphs look good.  qq (i hope it's quick, if not, nevermind and we can take it offline) -- what happened between march 7-9 -- maybe we had fewer new servers and then added more?19:37
clarkbcorvus: yes, we had 4 new servers and we also got hit by a bot crawler that was acting like a 2014 samsung phone19:38
clarkbcorvus: we addressed that by updating our UA agent block list to block the nacient phone and added two more servers for a total of 619:38
clarkbI thought we might get away with 4 servers instead of 8 but that incident showed that was probably too small19:39
fungiso the issue was twofold: a bad actor and fewer backends19:39
corvuscool; thanks19:39
fungiit noticeably slowed down response times for clients too19:39
fungiwhile that was going on19:39
clarkbIf I get time this week or next I'll probably try to do a server or two that the ptg doesn't interact with (mirror nodes maybe?)19:40
clarkbanyway measurable progress here. Thanks for all the help19:40
clarkb#topic AFS volume quotas and utilization19:40
clarkbLast week I bumped AFS quotas for the volumes that were very close to the limit19:41
clarkbThat avoided breaking any of those distro repo mirrors which is great. But doesn't address the every growing disk utilization problem19:41
clarkbalso it looks like deleting fedora 35 and adding fedora 37 resulted in a net increase of disk utilization19:41
ianwi should be able to remove 36 fairly quickly19:42
clarkbI did poke around looking for some easy wins deleting things (something that has worked well in the past) and did't really come up with any other than: Maybe we can drop the openeuler mirror and force them to pull from upstream like we do with rocky?19:42
clarkbianw: oh thats good to know19:42
fungithere's also a debian release coming up which we'll probably need at least a temporary bump in capacity for before we can drop old-oldstable19:42
clarkbMaybe lets get that done before making any afs decisions. The other idea I had was we should maybe consider adding a new backing volume to the two dfw fileservers19:42
clarkbI don't think this is urgent as long as we are not adding new stuff (debian will force the issue when that happens)19:44
clarkbI guess start with fedora 36 cleanup then evaluate what is necessary to add new debian content19:44
fungiworth trying to find out if debian-buster images are still heavily used, or transition them off our mirroring if they are infrequently used but unlikely to get dropped from use soon19:44
fungiin order to free up room for debian-bookworm in a few months19:44
clarkbfungi: ya thats an option. Can also make buster talk to upstream if infrequently used19:45
ianwyou can't have one *volume* > 2tb right (that was pypi's issue?)19:45
clarkbbut keep the images19:45
clarkbianw: correct19:45
clarkbianw: we can add up to 12 cinder volumes each a max of 1TB (these are cloud limitations) to the lvm on the fileservers so we are wll under total afs disk potential19:45
fungiyeah, that's what i meant by transition off our mirroring19:45
clarkbbut then an individual afs volume can't be more than 2TB19:46
fungibut also the more cinder devices we attach, the more precarious the server becomes19:46
ianwi guess the only problem is if those screw up, it becomes increasingly difficult to recover19:47
ianwheh, jinx19:47
corvuswe can add more servers19:47
clarkbya and also just general risk of an outage19:47
fungiit basically multiplies the chances of the server suffering a catastrophic failure from an iscsi incident19:47
fungiright, more afs servers with different rw volumes may be more robust than adding more storage to one server19:48
corvus(doesn't affect our overall chances of being hit by an iscsi incident, but may contain the fallout and make it easier to recover)19:48
fungithe risk of *an* outage doesn't decrease, but the impact of an outage for a single device or server decreases to just the volumes served from it19:48
clarkbcorvus: does growing vicepa require services be stopped?19:49
ianwwe also add everything under vicepa -- we could use other partitions?19:49
clarkbif so that may be another good reason to use new servers19:49
clarkbianw: heh jinx. I'm not sure what hte mechanics of the underlying data are like and whether or not one appraoch should be preferred19:50
fungialso, vos release performance may improve, since we effectively serialize those today with the assumption that otherwise we'll overwhelm the one server with the rw volumes19:50
clarkbwe've only got 10 minutes left and there are a couple of other things I wanted to discuss. Lets keep afs in mind and we can brainstorm ideas going forward but it isn't urgent today19:50
clarkbmore of a mid term thing19:51
clarkb#topic Quo vadis Storyboard19:51
corvuswe're not using raw partitions, we're using ext filesystems, so i don't think anything needs to be stopped to grow it, but i'm not positive on that.19:51
clarkbcorvus: ack19:51
clarkbfrickler won't be able to attend this meeting today but made a good point that with the PTG coming up there may be discussions from projects about not using storyboard aynmore19:51
clarkbI mentioned in #opendev that I think we should contiue to encourage those groups to work together and coordinate any tooling they might produce so that we don't have duplicated efforts19:52
clarkbBut does leave open the question for what we should do. I also mentioned in #opendev that if I was a lone person making a decision I think I'd look at sunsetting storyboard since we haven't been able to effectively operate/upgrade/maintain it19:52
clarkbwith an ideal sunset involving more than 30 days notice and if we can makin some sort of read only archive that is easier to mange19:53
clarkbThat said I don't think I should make decisions like that alone so am open to feedback and other ideas19:53
clarkbI'm also happy to jump into ptg sessions that involve storyboard to try and help where I can during the ptg19:53
clarkbMaybe ya'll can digest those ideas and let me know if they make sense or are terrible or have better ones :)19:54
clarkbDefinitely not something we have time for today or in this meeting. But the feedback would be helpful19:54
ianwperhaps sunsetting it would be the push someone needs to dedicate resources on it?19:54
clarkbianw: its possible19:54
clarkbI think that is unlikely but it is a theoretical outcome19:55
ianweither way something happens then, i guess19:55
clarkbok running out of time and one more item remains19:55
clarkbthis is not on the agenda but worth bringing up19:55
clarkb#topic Docker ending free team organizations19:55
fungibecause people will ask about it anyway ;)19:55
clarkbDocker is ending their free team organization setup which we use for opendevorg and zuul on docker hub19:56
clarkb(there are actually two other orgs openstackinfra and stackforge which are unused and empty)19:56
clarkbThis will affect us one way or another and we are very likely going to need to make changes19:56
clarkbIt isn't clear yet which changes we will need to make and of the options which we should take but I started an etherpad to collect info and try to make that decision making easier19:56
clarkb#link https://etherpad.opendev.org/p/MJTzrNTDMFyEUxi1ReSo19:57
clarkbI think we should continue to gather information and collect ideas there for the next day or two without trying to attribute too much value to any of them. Then once we have a good clear picture make some decisions19:57
corvusone point it would be useful to clarify is whether it's possible, and if so how, we can have an unpaid organization on quay.io to host our public images.  quay.io says that's possible, but i only see a $15/mo developer option on the pricing page, and account signup requires a phone number.19:57
clarkbIf you sign up with aphone number and get what you need I'm happy to sacrifice mine19:58
clarkbianw: ^ maybe that is something you can ask about at red hat?19:58
clarkbbasically clarify what account setup requirements are and if public open source projects need to pay for public image hosting19:58
corvus(i'd be happy to sign up to find out too, except that i wear a lot of hats, and if i only get one phone number, i don't know if i should burn it for "opendev" "zuul" or "acme gating"...)19:59
ianwi can certainly look into it -- of the top of my head i don't know anyone directly involved for instant answers but i'll see what i can find19:59
clarkbianw: thanks!19:59
corvus(or maybe it's okay to have two accounts with the same phone number.. <shrug>)19:59
clarkbAlso NeilHanlon (Rocky Linux) and Ramereth (OSUOSL) have similar issues/concerns with this and we may be able to learn from each other. They have both applied for docker's open source program which is apparently one way around this20:00
clarkbI asked them to provide us with info on how that goes just so that we've got it and can weight that option20:00
fungior at least someeone on twitter with the same name as a c-level exec at docker claimed that they won't delete teams who apply for the open source tier20:01
* fungi takes nothing for granted these days20:01
clarkbyes, they also say they won't allow names to be reused which means if/when we get our orgs deleted others shouldn't be able to impersonate us20:01
clarkbthis is important because docker clients default to dockerhub if you don't qualify the image names with a location20:02
clarkband we are at time. I can smell lunch too so we'll end here :)20:02
clarkbThank you everyone!20:02
clarkb#endmeeting20:02
opendevmeetMeeting ended Tue Mar 14 20:02:53 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:02
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-03-14-19.01.html20:02
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-03-14-19.01.txt20:02
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-03-14-19.01.log.html20:02
corvusfungi: good point; technically someone on github who says they're at docker said there won't be namespace takeover issues... but they don't actually have any flair that would confirm their position...20:03
fungioof20:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!