19:01:27 #startmeeting infra 19:01:27 Meeting started Tue Apr 4 19:01:27 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:27 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:27 The meeting name has been set to 'infra' 19:01:34 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/T2HXV6JPAKXGQREUBECMYYVGEQC2PYNY/ Our Agenda 19:02:14 #topic Announcements 19:02:42 I have no announcements other than PTG is over and openstack release has happened. We should be clear to make changes regularly 19:03:09 #topic Topics 19:03:18 #topic Docker Hub team shutdown 19:03:35 Good news everyone! They have reverted the team shutdown and will no longer be making this change 19:03:54 That means our deadline 10 days from now is no longer present, but I think we are all in agreement we should move anyway 19:04:03 #link https://review.opendev.org/q/topic:tag-deletion Changes to handle publication to registries generically. 19:04:05 they just took longer than the 24h we allocated for them to decide that ... 19:04:11 ya took them about a week 19:04:56 This stack of changes from ianw is the current set of work around being abel to move to our generic container roles whihc we would point at quay 19:05:10 I need to rereview them but have had meetings all morning. I'll try to get to that soon 19:05:11 it's not surprising how quickly people will abandon a platform when it's already burned most of their good will 19:06:11 i think this gives us ultimate flexibility in the roles, which is pretty cool. we can use a promote pipeline like we have now with tags; a little bit of work and we'll have the ability to upload from intermediate registry 19:06:26 both have different trade-offs which are documented in the changes 19:06:32 yup and i Think zuul users may choose one or the other depending on their specific needs 19:06:52 for opendev I'd like us to try the intermediate registry approach first since that relies on no registry specific features 19:07:14 (though we'd have the weird creation step either way for new images in quay) 19:07:36 anyway reviews on that stack are the next step and figuring out the intermediate registry promotion process the step after that 19:07:55 Anything else to call out re docker hub? 19:08:06 yep i personally would like to get that merged and try it with zuul-client, and then work on the promote from itnermediate registry 19:08:13 i plan on reviewing that today; fyi i'll be afk wed-fri this week. 19:08:47 sound good I should be able to rereview today as well 19:08:51 we can probably do similar and switch zuul-client to that as well, use it as a test. because it pushes to real repos it is a bit hard to test 100% outside that 19:09:21 i'm happy to use zuul-client as a test for both paths and can help with that 19:10:04 #topic Bastion Host Updates 19:10:26 I don't think there is really anything new here at this point? There was the launch env and rax rdns stuff but I've got that listed under booting new servers later 19:11:16 no; i'm not sure the backup roles have really had full review, so i haven't done anything there 19:11:27 ack we've been plenty busy with other items 19:11:34 #topic Mailman 3 19:11:40 with end of quarter, openstack release and ptg finally in the rear view mirror i'm hoping to be able to resume work on this, but travelling for vacation all next week will likely delay things a little longer 19:11:50 no new updates though 19:11:52 thanks 19:11:58 we did have a brief related item 19:12:14 i raised a question on the mm3-users ml 19:12:35 #link https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/CYSMH4H2VC3P5JIOFJIPRJ32QKQNITJS/ 19:12:58 quick summary, there's a hyperkitty bug which currently prevents list owners from deleting posts 19:13:27 if you log into hyperkitty with the admin creds from our private ansible hostvars, you can do it 19:14:02 it's not intuitive, you basically need to go from the whole thread view to the single-message view before "delete this message" shows up as a button 19:14:51 but possible at least 19:14:58 also the reply included a recommended alternative moderation workflow which may be worth consideration (basically moderate all users and then set them individually to unmoderated the first time they post something that isn't spam) 19:15:20 This is what mailman's lists use themselves if I remember what it was like going through the initial setup there 19:15:31 it's similar to how we "patrol" edits on the wiki from new users 19:15:57 well, i say "we" but i think it's probably just me patrolling the wiki these days 19:16:13 I think we worry aout that if the problem becomes more widespread 19:16:21 its been one issue in a few months? 19:16:33 yes, with mm3 having the option to post from the webui it seems likely that we may run into it more 19:16:54 but i agree it's not critical for the moment 19:17:19 also we could bulk-set all current subscribers to unmoderated on a list of we switched moderation workflow 19:17:41 oh that is good to know 19:17:45 or i think they probably already will be, since the setting is that new subscribers are initially moderated 19:18:08 anyway, just wanted to catch folks up on that development, i didn't have anything more on the topic 19:18:17 #topic Gerrit 3.7 Upgrade and Project Renames 19:18:26 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.7 19:18:39 We plan to do a gerrit upgrade and project renames on April 6 starting at 22:00 UTC 19:18:57 There were a few items related ot this I wanted to cover in the meeting to make sure we're ahppy with proceeding 19:19:25 First up: Do we have any strong opinions on renaming first or upgrading first? I think if we upgrade first the general process flow is easier with zuul and reindexing 19:19:42 ianw: I think your etherpad implies doing the upgrade first as well? 19:20:26 We have not done any project names under gerrit 3.6 yet so I don't think we need to prefer doing renames first for this reason 19:20:42 that was my thinking, seems like we can put everything in emergency and make sure manage-project isn't running 19:20:59 also it's easier to postpone the renames if things go sideways with the upgrade 19:21:17 ianw: ok keep in mind that the rename playbook requires things not be in emergency 19:21:41 but I think we can do the upgrade first, land the chnge to reflect that in configs, and have nothing in emergency then proceed with renames and be happy 19:22:04 hrm, i'm not sure the current procedure refelects that 19:22:04 seems like upgrade then renames is the order of operations we are happy with. 19:22:51 https://docs.opendev.org/opendev/system-config/latest/gerrit.html#renaming-a-project 19:22:54 ianw: I think it does. Step 19 on line 156 removes things from emergency before renaming 19:23:16 ianw: oh! we must not use !disabled in the rename playbook 19:23:28 yeah, that's what i was thinking, but i didn't check 19:23:34 yup thats the case so what I said above is not correct 19:23:46 so I think we need to edit step 19 to keep things in emergency until we're completely done? 19:24:08 so there's step "18.5" before 19 which is "step to rename procedure" 19:24:40 so the idea was to do it with everything in emergency 19:24:59 sounds good 19:25:41 https://opendev.org/opendev/system-config/src/branch/master/playbooks/rename_repos.yaml 19:25:52 does't look at disabled, so i think the assumption is correct 19:26:19 Next up was comfort levels with landing three different rename changes after we are done and our in order processing of those changes. This has the potential to recreate projects under their old names in gerrit and gitea 19:26:45 On the Gerrit side of things fungi came to the realization that our jeepyb cache file should prevent this from happening. I'm still not sure what prevents it on the gitea side 19:26:59 possibly because the redirects exist for those names gitea would fail to create the names? 19:27:23 but also it's a warning to us not to invalidate jeepyb's cache during a rename ;) 19:27:26 I'm starting to lean towards squashing the rename changes together for simplicity though then we don't have to worry about it 19:28:10 but I wanted to see if we have a preference. I think if we stick to separate chagnes we would need to rundown what the potential gitea behavior is and double check the jeepyb cache file has appropriate data in it for all involved projects 19:28:11 are we talking about the force-merge of the project-config changes? 19:28:14 ianw: yes 19:28:47 the other option is to leave everything in emergency until after those jobs completely finish then let hourly/daily jobs ensure we're in sync after landing the seprate changes. 19:29:36 step 7 of https://docs.opendev.org/opendev/system-config/latest/gerrit.html#renaming-a-project ? 19:29:48 So three options: 1) land as three changes and let things run normally if we understand why old projects names won't be recreated 2) land as three changes but keep hosts in emergency preventing the jobs for first two changes from running 3) squash into single change and then concerns go away 19:30:13 ianw: yes step 7 19:30:26 is it three changes or two changes? 19:30:38 ianw: basically I think we've come to a bit of a realization that if we land more than one project-config rename change that we've gotten lucky in the past that we haven't accidentally recreated projects 19:30:44 ianw: its three now as of a coupel of hours ago 19:30:58 i've accounted for virtualpdu and xstatic-angular-something 19:31:07 ovn-bgp-agent is the latest 19:31:14 oh right, ok, need to update for that 19:31:30 yeah, they just added it today 19:31:51 i tried to iterate with them on it quickly to give us time to work it in 19:32:16 we likely need to decide on a cut-off for further additions 19:32:32 which could be as soon as now, i suppose 19:32:34 (that's fine, was just a bit confused :) 19:32:34 I think that maybe in the gitea case our checks for creating new projects may see the redirect and/or trying to create a project where a redirect exists is an error. But I don't know for sure and feel like I'm running out of time to check that. Maybe we should provisinally plan to do 2) or 3) and only do 1) if we manage to run down gitea and jeepyb cache state? 19:33:24 i was working under the assumption of 2) ... basically force merge everything, and manage-projects only runs after all 3 are committed 19:33:27 i'm good with any of those options 19:33:34 ianw: ok that would be option 2) 19:33:50 but i guess the point is zuul is starting manage-projects and we're relying on it not working as things are in emergency right? 19:33:55 ianw: by default if we take things out of the emergency file manage-projects will run for each of them in order 19:34:01 yes exactly 19:34:19 but if we leave them all in emergency zuul can run the jobs and they will noop because they can't talk to the hosts. 19:34:30 I'm happy with that particularly if you were already planning to go with that 19:35:13 yeah, i cargo-culted the checklist from the system-config docs, but it makes a bit more sense to me now. i can update the checklist to be a bit more explicit 19:35:28 sounds good 19:35:29 and i'll double check the manage-projects playbook to make sure things won't run 19:35:53 next up is calling out there is a third rename request. I should've brought that up first :) 19:36:06 i suppose we could hack around it by exiting from emergency mode between the penultimate deploy failing and submitting the ultimate rename change 19:36:06 heh :) i'll also add that in today 19:36:06 I think we've covered that and we can ensure all our notes and records change is updated 19:36:40 oh since we've decided to keep the changes separate we may want to rebase them in order to address any merge conflicts 19:36:51 fungi: that could work 19:37:03 clarkb: i can do that and make sure they stack 19:37:08 thanks! 19:37:27 And last questions was whether or not the revert path for 3.7 -> 3.6 has been tested 19:37:39 I think you tested this recently based on the plugins checking you did yesterday? 19:37:53 yep, https://23.253.56.187/ is a reverted host 19:38:28 basically some manual git fiddling to revert the index metadata, and then init and reindex everything 19:38:46 as noted, that puts into logs some initially worrying things that makes you think you've got the wrong plugins 19:39:14 but we decided that what's really happening is that if you've got multiple tags in a plugin, bazel must be choosing the highest one 19:39:25 to stamp in as a version 19:39:32 great I think from a process and planning perspective this is coming together. Really jus need to udpate our notes and records change 19:39:41 and get the chagnes rebased so they can merge cleanly 19:39:44 ++ will do 19:40:06 Oh I will note I did some due diligence around the xstatic rename to ensure we aren't hijacking a project like moin did with a different xstatic package to us 19:40:15 i'll add a note on the rebase too, so that when we copy the checklist for next time we have that 19:40:32 and there was a discussin on a github issue with them that basically boiled down to there are packages moin cares about and there are packages horizon cares about and splitting them to make that clear is happening and this is part of that 19:40:45 all that to say I think we are good to rename the xstatic repo 19:41:35 I'll try to take a look oever everything again tomorrow too 19:41:47 and I think that may be it? THank you everyone for helping put this together. Gerrit stuff is always fun :) 19:41:58 this has been a big one! 19:42:20 #topic Upgrading Old Servers 19:42:27 #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes 19:42:38 ianw: the nameserver changes have landed for the most part 19:42:48 #link https://etherpad.opendev.org/p/2023-opendev-dns 19:43:03 I suspect docker/etc/etc have made this go slowly (I ran into similar issues) 19:43:14 But is there anything to do on this next? maybe reconvene next week after gerrit is done? 19:43:19 yeah i haven't got back to that, sorry. i just need some clear air to think about it :) 19:43:32 understood 19:43:49 but yeah, still very high on the todo list 19:43:50 Yesterday I picked up my todo list here and launched a replacement static server and a replacement etherpad server 19:43:56 #link https://review.opendev.org/q/topic:add-static02 static and etherpad replacements 19:44:24 reviews on these changes are very much appreciated. I think everything should be good to go except for reverse dns records for these two servers. Neither does email so that isn't critical and fixing that is in progress 19:44:30 #link https://review.opendev.org/c/opendev/system-config/+/879388 Fix rax rdns 19:45:08 The updated scripting to do rdns automatically was tied into launch node and this change fixes that. I think once this chagne lands we can run the rdns command for the new servers to have it update the records. Worst case I can do it through the web ui 19:45:34 I also discovered in this process that the launch env openstack client could not list rax volumes (though it could attach them) 19:45:41 #link https://review.opendev.org/c/opendev/system-config/+/879387 Reinstall launch env periodically 19:46:02 this change should address that as I think we decided that we didn't reinstall the env after we fixed the dep versions for listing volumes in rax? 19:46:13 fungi: ianw: ^ I'm still behind on that one if you have anything to add 19:46:31 i think that's it; fungi did confirm a fresh venv seemed to work 19:46:42 the other thing we can do is just rm it and start it again 19:46:50 i think we can just move /user/launcher-venv aside and let deploy recreate it in place as a test 19:46:59 or just merge the change 19:47:21 I need to review the change before I decide onthat. But I suspect just merging the change is fine since we don't rely on this launch env for regularly processing 19:47:52 i'm happy to monitor it and make sure it's doing what i think it's going to do :) 19:48:03 And ya reviews on the new static server in particular would be good. I stacked things due to the dns updates and etherpad is more involved 19:48:36 #topic AFS volume quotas 19:48:38 lgtm and i'll review the etherpad group stuff after a tea :) 19:48:46 thanks! 19:49:04 I don't have much new here other than to point out that the utilization is slowly climbing up over time :) 19:49:22 some of that may be all of the openstack release artifacts too so not just mirror volumes 19:49:33 but seems like we might free up some space with the wheel cache cleanup too 19:49:36 But lets keep an eye on it so that we can intervene before it becomes and emergency 19:49:38 ++ 19:50:02 #topic Gitea 1.19 19:50:06 moving along now as we are running out of time 19:50:11 #link https://review.opendev.org/c/opendev/system-config/+/877541 Upgrade opendev.org to 1.19.0 19:50:57 There is no 1.19.1 yet last I checked and in the past we have waited for the next bugfix release to land before upgrading gitea. I'm not in a hurry for this reason. But I think if we review the change now it is unlikely there will be a bit delta when .1 arrives. Also if we like we can update to 1.19.0 before .1 releases 19:51:14 maybe first thing after gerrit upgrade? just in case? 19:51:15 The major update to this release is gitea actions which is experimental and we disable 19:51:19 ++ 19:52:06 #topic Quo vadis Storyboard 19:52:34 As prediced a number of projects decided to move off of storyboard at the PTG. We may be asked to mark the projects read only in storyboard once they have moved 19:53:17 It does feel like the individualprojects could do a bit more to work together to create a more consistent process but I don't think we can force them to do that or mandate anything. Just continue to enourage them ot talk to one another I guess 19:53:29 i have a stack of projects i need to switch to inactive and update descriptions on now 19:53:46 just haven't gotten to that yet 19:54:06 #topic Open Discussion 19:54:18 I actually meant to put the wheel mirror stuff on the agenda then spaced it. 19:54:29 But I think we can talk about that now if there was anything important to bring up that isn't already on the mailing list 19:54:44 yeah, i'm still trying to come up with a concrete plan 19:55:00 your preliminary exploration seems promising though 19:55:02 but i think the first thing we should probably action is double-checking the audit tool is right, and cleaning up 19:55:06 i need to find time to reply on the ml thread 19:55:08 at the very lest I think the cleanup can probably be run before any long term plan is committed to 19:55:14 ianw: ++ 19:55:25 #link https://review.opendev.org/c/opendev/system-config/+/879239 19:55:32 that has some links to the output, etc. 19:55:40 yeah, i feel like we know what's safe to delete now, even if we haven't settled on what's safe to stop building yet 19:56:07 also, this is one of those times where vos backup may come in handy 19:56:19 right, we can make sure we're not putting back anything we don't want (i.e. the prune is working) 19:56:42 regarding afs, is there a way we could increase capacity further by adding more volumes or servers? or are we facing a hard limit there? just to know possible options 19:56:44 and evaluate what clarkb was talking about in that we don't need to carry .whl's that don't build against libraries really 19:57:01 frickler: yes we can add up to 14 cinder volumes of 1TB each to each server 19:57:17 for some stuff it's pretty straightforward. if we decide to delete all pure python wheels on the premise that they're trivially rebuildable from sdist in jobs, we might want to vos backup so we can quickly switch the volume state back if our assumptions turn out to be incorrect 19:57:22 frickler: then we add them to the vicepa pool or add a vicepb or something. I would have to look at the existing server setup to see how the existing 3TB is organized 19:57:31 and no known limit to the number of servers 19:57:44 it's all in vicepa 19:58:18 yeah, we discussed previously that adding more servers probably helps distribute the points of failure a bit, vs the situation we ended up in with the static file server that had 14 attached volumes and fell over if you sneezed 19:58:36 i think we can definitely grow if we need to, but "more storage == more problems" :) 19:58:41 heh, yeah 19:58:57 yup and we do have three servers only one of which is under the massive pressure. We might want to double check if we can drop content from that server or rebalance somhow 19:59:09 (though I suspect with only three servers the number of organizations is small) 19:59:34 we definitely have options though which is a good thing 19:59:41 i would want to add a server if we're rebalancing, because we probably need to be able to continue functioning when one dies 20:00:14 and we are officially at time 20:00:19 Thank you everyone! 20:00:25 anyway, i'm happy to run the wheel cleanup, but yeah, after someone else has looked at it :) 20:00:29 #endmeeting