19:01:27 <clarkb> #startmeeting infra
19:01:27 <opendevmeet> Meeting started Tue Apr  4 19:01:27 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:27 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:27 <opendevmeet> The meeting name has been set to 'infra'
19:01:34 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/T2HXV6JPAKXGQREUBECMYYVGEQC2PYNY/ Our Agenda
19:02:14 <clarkb> #topic Announcements
19:02:42 <clarkb> I have no announcements other than PTG is over and openstack release has happened. We should be clear to make changes regularly
19:03:09 <clarkb> #topic Topics
19:03:18 <clarkb> #topic Docker Hub team shutdown
19:03:35 <clarkb> Good news everyone! They have reverted the team shutdown and will no longer be making this change
19:03:54 <clarkb> That means our deadline 10 days from now is no longer present, but I think we are all in agreement we should move anyway
19:04:03 <clarkb> #link https://review.opendev.org/q/topic:tag-deletion Changes to handle publication to registries generically.
19:04:05 <corvus> they just took longer than the 24h we allocated for them to decide that ...
19:04:11 <clarkb> ya took them about a week
19:04:56 <clarkb> This stack of changes from ianw is the current set of work around being abel to move to our generic container roles whihc we would point at quay
19:05:10 <clarkb> I need to rereview them but have had meetings all morning. I'll try to get to that soon
19:05:11 <fungi> it's not surprising how quickly people will abandon a platform when it's already burned most of their good will
19:06:11 <ianw> i think this gives us ultimate flexibility in the roles, which is pretty cool.  we can use a promote pipeline like we have now with tags; a little bit of work and we'll have the ability to upload from intermediate registry
19:06:26 <ianw> both have different trade-offs which are documented in the changes
19:06:32 <clarkb> yup and i Think zuul users may choose one or the other depending on their specific needs
19:06:52 <clarkb> for opendev I'd like us to try the intermediate registry approach first since that relies on no registry specific features
19:07:14 <clarkb> (though we'd have the weird creation step either way for new images in quay)
19:07:36 <clarkb> anyway reviews on that stack are the next step and figuring out the intermediate registry promotion process the step after that
19:07:55 <clarkb> Anything else to call out re docker hub?
19:08:06 <ianw> yep i personally would like to get that merged and try it with zuul-client, and then work on the promote from itnermediate registry
19:08:13 <corvus> i plan on reviewing that today; fyi i'll be afk wed-fri this week.
19:08:47 <clarkb> sound good I should be able to rereview today as well
19:08:51 <ianw> we can probably do similar and switch zuul-client to that as well, use it as a test.  because it pushes to real repos it is a bit hard to test 100% outside that
19:09:21 <corvus> i'm happy to use zuul-client as a test for both paths and can help with that
19:10:04 <clarkb> #topic Bastion Host Updates
19:10:26 <clarkb> I don't think there is really anything new here at this point? There was the launch env and rax rdns stuff but I've got that listed under booting new servers later
19:11:16 <ianw> no; i'm not sure the backup roles have really had full review, so i haven't done anything there
19:11:27 <clarkb> ack we've been plenty busy with other items
19:11:34 <clarkb> #topic Mailman 3
19:11:40 <fungi> with end of quarter, openstack release and ptg finally in the rear view mirror i'm hoping to be able to resume work on this, but travelling for vacation all next week will likely delay things a little longer
19:11:50 <fungi> no new updates though
19:11:52 <clarkb> thanks
19:11:58 <fungi> we did have a brief related item
19:12:14 <fungi> i raised a question on the mm3-users ml
19:12:35 <fungi> #link https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/CYSMH4H2VC3P5JIOFJIPRJ32QKQNITJS/
19:12:58 <fungi> quick summary, there's a hyperkitty bug which currently prevents list owners from deleting posts
19:13:27 <fungi> if you log into hyperkitty with the admin creds from our private ansible hostvars, you can do it
19:14:02 <fungi> it's not intuitive, you basically need to go from the whole thread view to the single-message view before "delete this message" shows up as a button
19:14:51 <clarkb> but possible at least
19:14:58 <fungi> also the reply included a recommended alternative moderation workflow which may be worth consideration (basically moderate all users and then set them individually to unmoderated the first time they post something that isn't spam)
19:15:20 <clarkb> This is what mailman's lists use themselves if I remember what it was like going through the initial setup there
19:15:31 <fungi> it's similar to how we "patrol" edits on the wiki from new users
19:15:57 <fungi> well, i say "we" but i think it's probably just me patrolling the wiki these days
19:16:13 <clarkb> I think we worry aout that if the problem becomes more widespread
19:16:21 <clarkb> its been one issue in a few months?
19:16:33 <fungi> yes, with mm3 having the option to post from the webui it seems likely that we may run into it more
19:16:54 <fungi> but i agree it's not critical for the moment
19:17:19 <fungi> also we could bulk-set all current subscribers to unmoderated on a list of we switched moderation workflow
19:17:41 <clarkb> oh that is good to know
19:17:45 <fungi> or i think they probably already will be, since the setting is that new subscribers are initially moderated
19:18:08 <fungi> anyway, just wanted to catch folks up on that development, i didn't have anything more on the topic
19:18:17 <clarkb> #topic Gerrit 3.7 Upgrade and Project Renames
19:18:26 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.7
19:18:39 <clarkb> We plan to do a gerrit upgrade and project renames on April 6 starting at 22:00 UTC
19:18:57 <clarkb> There were a few items related ot this I wanted to cover in the meeting to make sure we're ahppy with proceeding
19:19:25 <clarkb> First up: Do we have any strong opinions on renaming first or upgrading first? I think if we upgrade first the general process flow is easier with zuul and reindexing
19:19:42 <clarkb> ianw: I think your etherpad implies doing the upgrade first as well?
19:20:26 <clarkb> We have not done any project names under gerrit 3.6 yet so I don't think we need to prefer doing renames first for this reason
19:20:42 <ianw> that was my thinking, seems like we can put everything in emergency and make sure manage-project isn't running
19:20:59 <fungi> also it's easier to postpone the renames if things go sideways with the upgrade
19:21:17 <clarkb> ianw: ok keep in mind that the rename playbook requires things not be in emergency
19:21:41 <clarkb> but I think we can do the upgrade first, land the chnge to reflect that in configs, and have nothing in emergency then proceed with renames and be happy
19:22:04 <ianw> hrm, i'm not sure the current procedure refelects that
19:22:04 <clarkb> seems like upgrade then renames is the order of operations we are happy with.
19:22:51 <ianw> https://docs.opendev.org/opendev/system-config/latest/gerrit.html#renaming-a-project
19:22:54 <clarkb> ianw: I think it does. Step 19 on line 156 removes things from emergency before renaming
19:23:16 <clarkb> ianw: oh! we must not use !disabled in the rename playbook
19:23:28 <ianw> yeah, that's what i was thinking, but i didn't check
19:23:34 <clarkb> yup thats the case so what I said above is not correct
19:23:46 <clarkb> so I think we need to edit step 19 to keep things in emergency until we're completely done?
19:24:08 <ianw> so there's step "18.5" before 19 which is "step to rename procedure"
19:24:40 <ianw> so the idea was to do it with everything in emergency
19:24:59 <clarkb> sounds good
19:25:41 <ianw> https://opendev.org/opendev/system-config/src/branch/master/playbooks/rename_repos.yaml
19:25:52 <ianw> does't look at disabled, so i think the assumption is correct
19:26:19 <clarkb> Next up was comfort levels with landing three different rename changes after we are done and our in order processing of those changes. This has the potential to recreate projects under their old names in gerrit and gitea
19:26:45 <clarkb> On the Gerrit side of things fungi came to the realization that our jeepyb cache file should prevent this from happening. I'm still not sure what prevents it on the gitea side
19:26:59 <clarkb> possibly because the redirects exist for those names gitea would fail to create the names?
19:27:23 <fungi> but also it's a warning to us not to invalidate jeepyb's cache during a rename ;)
19:27:26 <clarkb> I'm starting to lean towards squashing the rename changes together for simplicity though then we don't have to worry about it
19:28:10 <clarkb> but I wanted to see if we have a preference. I think if we stick to separate chagnes we would need to rundown what the potential gitea behavior is and double check the jeepyb cache file has appropriate data in it for all involved projects
19:28:11 <ianw> are we talking about the force-merge of the project-config changes?
19:28:14 <clarkb> ianw: yes
19:28:47 <clarkb> the other option is to leave everything in emergency until after those jobs completely finish then let hourly/daily jobs ensure we're in sync after landing the seprate changes.
19:29:36 <ianw> step 7 of https://docs.opendev.org/opendev/system-config/latest/gerrit.html#renaming-a-project ?
19:29:48 <clarkb> So three options: 1) land as three changes and let things run normally if we understand why old projects names won't be recreated 2) land as three changes but keep hosts in emergency preventing the jobs for first two changes from running 3) squash into single change and then concerns go away
19:30:13 <clarkb> ianw: yes step 7
19:30:26 <ianw> is it three changes or two changes?
19:30:38 <clarkb> ianw: basically I think we've come to a bit of a realization that if we land more than one project-config rename change that we've gotten lucky in the past that we haven't accidentally recreated projects
19:30:44 <clarkb> ianw: its three now as of a coupel of hours ago
19:30:58 <ianw> i've accounted for virtualpdu and xstatic-angular-something
19:31:07 <clarkb> ovn-bgp-agent is the latest
19:31:14 <ianw> oh right, ok, need to update for that
19:31:30 <fungi> yeah, they just added it today
19:31:51 <fungi> i tried to iterate with them on it quickly to give us time to work it in
19:32:16 <fungi> we likely need to decide on a cut-off for further additions
19:32:32 <fungi> which could be as soon as now, i suppose
19:32:34 <ianw> (that's fine, was just a bit confused :)
19:32:34 <clarkb> I think that maybe in the gitea case our checks for creating new projects may see the redirect and/or trying to create a project where a redirect exists is an error. But I don't know for sure and feel like I'm running out of time to check that. Maybe we should provisinally plan to do 2) or 3) and only do 1) if we manage to run down gitea and jeepyb cache state?
19:33:24 <ianw> i was working under the assumption of 2) ... basically force merge everything, and manage-projects only runs after all 3 are committed
19:33:27 <fungi> i'm good with any of those options
19:33:34 <clarkb> ianw: ok that would be option 2)
19:33:50 <ianw> but i guess the point is zuul is starting manage-projects and we're relying on it not working as things are in emergency right?
19:33:55 <clarkb> ianw: by default if we take things out of the emergency file manage-projects will run for each of them in order
19:34:01 <clarkb> yes exactly
19:34:19 <clarkb> but if we leave them all in emergency zuul can run the jobs and they will noop because they can't talk to the hosts.
19:34:30 <clarkb> I'm happy with that particularly if you were already planning to go with that
19:35:13 <ianw> yeah, i cargo-culted the checklist from the system-config docs, but it makes a bit more sense to me now.  i can update the checklist to be a bit more explicit
19:35:28 <clarkb> sounds good
19:35:29 <ianw> and i'll double check the manage-projects playbook to make sure things won't run
19:35:53 <clarkb> next up is calling out there is a third rename request. I should've brought that up first :)
19:36:06 <fungi> i suppose we could hack around it by exiting from emergency mode between the penultimate deploy failing and submitting the ultimate rename change
19:36:06 <ianw> heh :)  i'll also add that in today
19:36:06 <clarkb> I think we've covered that and we can ensure all our notes and records change is updated
19:36:40 <clarkb> oh since we've decided to keep the changes separate we may want to rebase them in order to address any merge conflicts
19:36:51 <ianw> fungi: that could work
19:37:03 <ianw> clarkb: i can do that and make sure they stack
19:37:08 <clarkb> thanks!
19:37:27 <clarkb> And last questions was whether or not the revert path for 3.7 -> 3.6 has been tested
19:37:39 <clarkb> I think you tested this recently based on the plugins checking you did yesterday?
19:37:53 <ianw> yep, https://23.253.56.187/ is a reverted host
19:38:28 <ianw> basically some manual git fiddling to revert the index metadata, and then init and reindex everything
19:38:46 <ianw> as noted, that puts into logs some initially worrying things that makes you think you've got the wrong plugins
19:39:14 <ianw> but we decided that what's really happening is that if you've got multiple tags in a plugin, bazel must be choosing the highest one
19:39:25 <ianw> to stamp in as a version
19:39:32 <clarkb> great I think from a process and planning perspective this is coming together. Really jus need to udpate our notes and records change
19:39:41 <clarkb> and get the chagnes rebased so they can merge cleanly
19:39:44 <ianw> ++ will do
19:40:06 <clarkb> Oh I will note I did some due diligence around the xstatic rename to ensure we aren't hijacking a project like moin did with a different xstatic package to us
19:40:15 <ianw> i'll add a note on the rebase too, so that when we copy the checklist for next time we have that
19:40:32 <clarkb> and there was a discussin on a github issue with them that basically boiled down to there are packages moin cares about and there are packages horizon cares about and splitting them to make that clear is happening and this is part of that
19:40:45 <clarkb> all that to say I think we are good to rename the xstatic repo
19:41:35 <clarkb> I'll try to take a look oever everything again tomorrow too
19:41:47 <clarkb> and I think that may be it? THank you everyone for helping put this together. Gerrit stuff is always fun :)
19:41:58 <ianw> this has been a big one!
19:42:20 <clarkb> #topic Upgrading Old Servers
19:42:27 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes
19:42:38 <clarkb> ianw: the nameserver changes have landed for the most part
19:42:48 <clarkb> #link https://etherpad.opendev.org/p/2023-opendev-dns
19:43:03 <clarkb> I suspect docker/etc/etc have made this go slowly (I ran into similar issues)
19:43:14 <clarkb> But is there anything to do on this next? maybe reconvene next week after gerrit is done?
19:43:19 <ianw> yeah i haven't got back to that, sorry.  i just need some clear air to think about it :)
19:43:32 <clarkb> understood
19:43:49 <ianw> but yeah, still very high on the todo list
19:43:50 <clarkb> Yesterday I picked up my todo list here and launched a replacement static server and a replacement etherpad server
19:43:56 <clarkb> #link https://review.opendev.org/q/topic:add-static02 static and etherpad replacements
19:44:24 <clarkb> reviews on these changes are very much appreciated. I think everything should be good to go except for reverse dns records for these two servers. Neither does email so that isn't critical and fixing that is in progress
19:44:30 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/879388 Fix rax rdns
19:45:08 <clarkb> The updated scripting to do rdns automatically was tied into launch node and this change fixes that. I think once this chagne lands we can run the rdns command for the new servers to have it update the records. Worst case I can do it through the web ui
19:45:34 <clarkb> I also discovered in this process that the launch env openstack client could not list rax volumes (though it could attach them)
19:45:41 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/879387 Reinstall launch env periodically
19:46:02 <clarkb> this change should address that as I think we decided that we didn't reinstall the env after we fixed the dep versions for listing volumes in rax?
19:46:13 <clarkb> fungi: ianw: ^ I'm still behind on that one if you have anything to add
19:46:31 <ianw> i think that's it; fungi did confirm a fresh venv seemed to work
19:46:42 <ianw> the other thing we can do is just rm it and start it again
19:46:50 <fungi> i think we can just move /user/launcher-venv aside and let deploy recreate it in place as a test
19:46:59 <fungi> or just merge the change
19:47:21 <clarkb> I need to review the change before I decide onthat. But I suspect just merging the change is fine since we don't rely on this launch env for regularly processing
19:47:52 <ianw> i'm happy to monitor it and make sure it's doing what i think it's going to do :)
19:48:03 <clarkb> And ya reviews on the new static server in particular would be good. I stacked things due to the dns updates and etherpad is more involved
19:48:36 <clarkb> #topic AFS volume quotas
19:48:38 <ianw> lgtm and i'll review the etherpad group stuff after a tea :)
19:48:46 <clarkb> thanks!
19:49:04 <clarkb> I don't have much new here other than to point out that the utilization is slowly climbing up over time :)
19:49:22 <clarkb> some of that may be all of the openstack release artifacts too so not just mirror volumes
19:49:33 <fungi> but seems like we might free up some space with the wheel cache cleanup too
19:49:36 <clarkb> But lets keep an eye on it so that we can intervene before it becomes and emergency
19:49:38 <clarkb> ++
19:50:02 <clarkb> #topic Gitea 1.19
19:50:06 <clarkb> moving along now as we are running out of time
19:50:11 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/877541 Upgrade opendev.org to 1.19.0
19:50:57 <clarkb> There is no 1.19.1 yet last I checked and in the past we have waited for the next bugfix release to land before upgrading gitea. I'm not in a hurry for this reason. But I think if we review the change now it is unlikely there will be a bit delta when .1 arrives. Also if we like we can update to 1.19.0 before .1 releases
19:51:14 <ianw> maybe first thing after gerrit upgrade?  just in case?
19:51:15 <clarkb> The major update to this release is gitea actions which is experimental and we disable
19:51:19 <clarkb> ++
19:52:06 <clarkb> #topic Quo vadis Storyboard
19:52:34 <clarkb> As prediced a number of projects decided to move off of storyboard at the PTG. We may be asked to mark the projects read only in storyboard once they have moved
19:53:17 <clarkb> It does feel like the individualprojects could do a bit more to work together to create a more consistent process but I don't think we can force them to do that or mandate anything. Just continue to enourage them ot talk to one another I guess
19:53:29 <fungi> i have a stack of projects i need to switch to inactive and update descriptions on now
19:53:46 <fungi> just haven't gotten to that yet
19:54:06 <clarkb> #topic Open Discussion
19:54:18 <clarkb> I actually meant to put the wheel mirror stuff on the agenda then spaced it.
19:54:29 <clarkb> But I think we can talk about that now if there was anything important to bring up that isn't already on the mailing list
19:54:44 <ianw> yeah, i'm still trying to come up with a concrete plan
19:55:00 <fungi> your preliminary exploration seems promising though
19:55:02 <ianw> but i think the first thing we should probably action is double-checking the audit tool is right, and cleaning up
19:55:06 <fungi> i need to find time to reply on the ml thread
19:55:08 <clarkb> at the very lest I think the cleanup can probably be run before any long term plan is committed to
19:55:14 <clarkb> ianw: ++
19:55:25 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/879239
19:55:32 <ianw> that has some links to the output, etc.
19:55:40 <fungi> yeah, i feel like we know what's safe to delete now, even if we haven't settled on what's safe to stop building yet
19:56:07 <fungi> also, this is one of those times where vos backup may come in handy
19:56:19 <ianw> right, we can make sure we're not putting back anything we don't want (i.e. the prune is working)
19:56:42 <frickler> regarding afs, is there a way we could increase capacity further by adding more volumes or servers? or are we facing a hard limit there? just to know possible options
19:56:44 <ianw> and evaluate what clarkb was talking about in that we don't need to carry .whl's that don't build against libraries really
19:57:01 <clarkb> frickler: yes we can add up to 14 cinder volumes of 1TB each to each server
19:57:17 <fungi> for some stuff it's pretty straightforward. if we decide to delete all pure python wheels on the premise that they're trivially rebuildable from sdist in jobs, we might want to vos backup so we can quickly switch the volume state back if our assumptions turn out to be incorrect
19:57:22 <clarkb> frickler: then we add them to the vicepa pool or add a vicepb or something. I would have to look at the existing server setup to see how the existing 3TB is organized
19:57:31 <corvus> and no known limit to the number of servers
19:57:44 <ianw> it's all in vicepa
19:58:18 <fungi> yeah, we discussed previously that adding more servers probably helps distribute the points of failure a bit, vs the situation we ended up in with the static file server that had 14 attached volumes and fell over if you sneezed
19:58:36 <ianw> i think we can definitely grow if we need to, but "more storage == more problems" :)
19:58:41 <ianw> heh, yeah
19:58:57 <clarkb> yup and we do have three servers only one of which is under the massive pressure. We might want to double check if we can drop content from that server or rebalance somhow
19:59:09 <clarkb> (though I suspect with only three servers the number of organizations is small)
19:59:34 <clarkb> we definitely have options though which is a good thing
19:59:41 <fungi> i would want to add a server if we're rebalancing, because we probably need to be able to continue functioning when one dies
20:00:14 <clarkb> and we are officially at time
20:00:19 <clarkb> Thank you everyone!
20:00:25 <ianw> anyway, i'm happy to run the wheel cleanup, but yeah, after someone else has looked at it :)
20:00:29 <clarkb> #endmeeting