19:01:02 #startmeeting infra 19:01:02 Meeting started Tue Apr 25 19:01:02 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:02 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:02 The meeting name has been set to 'infra' 19:01:10 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/UT4QRPDF3HXQNR3PTIJZR5MIP5WPOTAW/ Our Agenda 19:01:17 #topic Announcements 19:01:23 I don't have any announcements this week 19:01:41 #topic Topics 19:01:43 we can dive right in 19:01:52 #topic Migrating to Quay.io 19:02:12 A fair bit has happend on this since we last met 19:02:27 In the Zuul world changes are up to convert all of the zuul image publications to quay.io instead ofdocker hub 19:02:45 I think they all pass CI now too so Zuul should move soon and we'll want to updte all our docker-compose.yaml files to match when that happens 19:03:37 Piggybacking off of the hard work done in Zuul I've started doing some of the work for opendev as well. In particular I have copied our images from docker hub to quay.io to pull in old content. Most of our images only tag latest so for most things this isn't super necessary, but other images (like Gerrit) do tag versions. 19:03:48 technically the zuul-operator doesn't pass, but i think the current issue is bitrot and there's only a small chance there's a container image issue lurking there. 19:04:10 I did skip four images that I don't plan to transfer over. opendevorg/bazel, opendevorg/grafana, opendevorg/jitsi-meet-prosody, and opendevorg/jitsi-meet-web since none of these are images we use today 19:04:27 #link https://review.opendev.org/c/opendev/system-config/+/881285 WIP change to convert an image to quay.io using new jobs. 19:04:40 sounds good to me, thanks for working through that 19:04:46 This change is WIP and open for feedback. corvus already had some feedback which i need to address 19:05:25 corvus: did you see my message in #opendev earlier today? I'm tempted to update ensure-quay-repo to check for a registry-type flag on each image in container_images and if it is set to quay then the role would act on it otherwise it would skip. Then we can run the role in the opendev base job 19:05:56 the major downside to this is that opendev quay users would need to include the api_token. I suppose I can skip if that isn't defined either 19:06:30 clarkb: ah missed that message, but i like the idea. also, checking for the api_token is probably a good idea; i don't think we want to *require* the role :) 19:06:45 ok, I can work on that and that will result in changes to 881285 (and a new change) 19:06:55 I did test ensure-quay-repo this morning and it worked for me locally outside of zuul 19:07:34 The other thing to keep in mind for opendev image updates is that we should resync the images from docker hub to quay.io just before we update the zuul jobs to build he image. That way we have the latest and greatest on both sides before the pulication method changes 19:08:06 And I don't think we should try to do all images at once. It will be too much churn and potential broken stuff to debug. We should go an image or two at a time and ensure that we're deploying the image from its new home afterwards 19:08:38 ++ 19:08:51 I'll try to pick this back up again later today. I think it is realistic that at least some of our images are published to quay.io automatically before our next meeting 19:09:15 any other questions/concerns/comments on this effort? 19:10:42 #topic Bastion Host Updates 19:11:08 I think the backups changes still need reviews? Is any other infra root willing/able to look at those? I think there is value here but also getting it wrong is potnetialy dangerous so review is valuable 19:11:38 yeah it only makes sense if we have enough people who want to hold bits of it 19:12:04 i can, keep meaning to 19:12:11 thanks. 19:12:21 what's the review topic again? 19:12:31 #link https://review.opendev.org/q/topic:bridge-backups 19:12:39 thanks 19:13:03 Anything else bastion related? 19:13:54 not from me 19:14:07 #topic Mailman 3 19:14:17 fungi: any progress with the held node? 19:14:42 sorry, nothing to report yet. i've started playing with a new held node, but spent the past week catching up on urgent things that piled up while i was on vacation 19:15:11 #topic Gerrit Updates 19:15:32 before more important Gerrit items I want to note we are still running hte most recent proper release :) but gerrit 3.8 seems imminent 19:15:42 #link https://review.opendev.org/c/openstack/project-config/+/879906 Gerrit also normalized indentation of config files we should consider this to be in sync 19:16:12 fungi: ^ is a WIP change. We had talked about modifying the normalizer to insert a comment to explain the format to users. Is that the main missing piece before we unwip this change? 19:16:43 yeah, i guess so 19:17:01 i did propose something for that 19:17:30 #link https://review.opendev.org/c/openstack/project-config/+/880898/ better document gerrit acl config format 19:17:30 #link https://review.opendev.org/c/openstack/project-config/+/880898 19:17:49 also can't recall if i mentioned it already, but i saw something recently in the gerrit upstream matrix channel about changes to how links are done with ui plugins in 3.8, so we'll need to make sure we test our gitea integration 19:17:53 that i think helps generally. if something is wrong, it should give you a clue what it is without having to read the source 19:18:05 looks like the stack needs a rebase too? 19:18:16 oh something might have merged, i can rebase 19:18:27 thanks i think we can remove the wip from the parent change at that point too 19:18:38 then it is just a matter of reviewing and applying the changes 19:18:49 it'll need a rebase for every single change that touches or adds an acl, so we probably need to just make a decision to merge, then rebase, then review while other changes are on hold 19:19:03 fungi: yes, that should be testable from our test instances since the links will sned you to valid opendev.org links 19:19:09 fungi: something worth checking when we do upgrade testing though 19:20:02 fungi: you're also still wip on it 19:20:17 oh sorry, clarkb mentioned that 19:20:26 The other Gerrit thing to call out is cleaning up 3.6 images and updating our upgrade job 19:20:40 ianw: yeah, i had proposed it as a straw man seeking folks to convince me the benefits outweighed the negatives 19:20:46 I put this on the list of things that new contributors could do, but I think we should probably tackle this sooner than later. I can probably poke at this later this week 19:21:27 i guess i'm convinced enough to un-wip since others see that balance tipping toward benefit 19:21:30 ++ i might have some time too 19:21:52 fungi: ya having cleaner diffs in the future is my motivation and that seems worthwhile to me 19:22:16 The last Gerrit item I have is the replication plugin leaking all over the review_site/data dir 19:22:31 #link https://review.opendev.org/c/opendev/system-config/+/880672 Dealing with leaked replication tasks on disk 19:22:47 This is where I started digging into the problem and while that will help a bit it won't fully solve the issue 19:23:17 I ended up reading a fair bit of the plugin code and that resulted in two upstream bugs 19:23:23 #link https://bugs.chromium.org/p/gerrit/issues/detail?id=16867 19:24:14 This first issue is related to trying to replicate All-Projects and All-Users. I'm pretty sure that what happens here is the replication plugin checks if it should replicate those repos after creating the waiting/ file on disk, but then short circuits when it shouldn't replicate without sending the finished event which cleans things up 19:24:21 #link https://bugs.chromium.org/p/gerrit/issues/detail?id=16868 19:25:47 This second one is related to the plugin filtering out refs that it shouldn't replicate on changes it would otherwise replicate. For example you update a change via the web ui ths updates three refs. Two of the three should be replicated but not the third. I think this creates a new file on disk that gets properly managed and deleted orphaning the original with all three refs 19:26:22 I think this siutation is also related to the massive number of errors we get on startup. In those cases I suspect that we are actually filtering all refs for replication and it confuses the state machine 19:26:46 My workaround change addresses the first issue and some ofthe second issue. 19:27:07 I think it is still worth landing to reduce mental overhead but it likely won't completely solve the noise problem on startup. 19:27:30 I am hoping someone upstream can chime in on whether or not my ideas for fixing these problems are sound but I haven't seen any responses yet 19:28:02 And after doing all that I'm reasonably sure that deleting the files is completelyfine and that the only harm here is the leaking on disk 19:28:11 i'm guessing there's not a ton of activity on that plugin ... you might be the expert now :) 19:28:52 ya if I can find some time to set up a test instance (I'll likely hold one launched by zuul) and recompile etc I will try to fix it 19:29:02 just lots of other stuff going on now that I'm really certain this is almost harmless :) 19:29:05 and also the people who do have an interest in it are mostly interested in privileged gerrit-to-gerrit replication 19:29:28 fungi: yup that was something I realized in debugging this. Seems like very few people use the plugin with the restrictions we have 19:29:44 That was all I had for Gerrit. Anything else? 19:31:04 #topic Upgrading Servers 19:31:22 Etherpad is now running on the new server. THe move went well and it stuck to the expected timeline 19:31:40 We have since removed etherpad01 from our inventory and dns. I shutdown the server via the nova api too. 19:31:53 At this point we're probably comfortable deleting the server and its cinder volume? 19:32:01 fine by me, yeah 19:32:50 me too, new server lgtm 19:32:58 and we double checked the backups 19:32:59 i've been using it heavily, no problems 19:33:02 great I'll try to get that done (probably tomorrow) 19:33:16 Nameserver replacement is also in progress 19:33:28 Looks like the change to deploy the new servers (but not use them as resolvers yet) has landed 19:33:34 #link https://etherpad.opendev.org/p/2023-opendev-dns 19:33:39 yep they should be active today, i'll keep ontop of it 19:34:19 ianw: is there anything else you need at this point? 19:34:39 nope, i'll let people know when we can change registry records 19:34:46 thanks! 19:35:11 i am assuming gating.dev and zuul* on gandi are corvus? 19:35:12 I doubt I'll have time to take on any replacements this week, but I'd like to grab another one or two and push on them next week. Maybe jitsi meet or mirrors or something 19:35:45 ianw: I think so 19:35:46 actually zuul-ci.org is different to zuulci.org 19:35:57 the zuulci.org domain is likely the foundation 19:36:02 at least the registrar is 19:36:14 yep 19:36:52 yeah, zuulci.org is using csc global for its registrar, which is who the foundation goes through 19:37:33 any whois for our domains with "Registrar: CSC Corporate Domains, Inc." is pretty much guaranteed to be foundation staff coordinated 19:37:47 I did give the foundation registrar managers a heads up this was going to happen soon too. They are ready and willing to help us when the time comes 19:38:16 having a clear list of the domains we want them to update will be key, of course 19:38:28 fungi: sounds like opendev.org and zuulci.org 19:38:45 gating.dev and zuul-ci.org are the other two domains hosted by our nameservers and corvus has control of those 19:39:11 Anything else realted to nameservers or replacing servers generally? 19:40:43 nope 19:40:51 #topic AFS volume quotas 19:41:00 Good news! The usage seems to have stabilized? 19:41:08 it sstill quite high but I don't think it has budged since last week 19:41:18 the universe has changed in your favor, collect three tokens 19:41:39 i hope to get back to wheel cleanup 19:41:47 and fedora upgrades 19:42:02 sounds good and ya this continues to not be an emergency but I'm trying to keep an eye on it 19:43:02 #topic Gitea 1.19 19:43:32 At this point I'm happy waiting for 1.19.2. One of ianw's bugs has been fixed and will be included in 1.19.2. The other is marked as should be fixed for 1.19.2 but hasn't been fixed yet 19:43:50 There are no critical features we are missing. 1.19 was largely cut to add their actions implementation from what I can see 19:44:02 If that changes we may accelerate and deploy 1.19.1 before .2 19:44:15 #topic Quo vadis Storyboard 19:44:18 as an aside, are we any closer to being concerned about needing to switch to forgejo, or has that teapot stopped tempesting? 19:44:24 #undo 19:44:24 Removing item from minutes: #topic Quo vadis Storyboard 19:44:54 fungi: no, as far as I can tell upstream development is still happening in the open via losely oragnized maintainers that write down their release goals at the beginning of their dev cycles 19:45:07 No features have been removed. Licensing has remained the same. etc 19:45:24 any idea if forgejo has hard-forked or is continuing to pull new commits from gitea? 19:45:43 I think their intention is to avoid a hard fork 19:45:54 just wondering if it's diverging substantially over time, or sitting stagnant 19:46:13 fomr what I've seen (and I'm totally not an expert) its basically the same code base with a different name 19:46:28 the less it becomes like gitea, the harder it will be to switch if we decide we want to 19:46:36 hence my curiosity 19:47:21 anyway, i can read up on it, just didn't know if anyone had seen any news on that front 19:48:04 looking at their release notes they seem to be making intermediate releases between gitea releases but following gitea releases overall 19:48:22 and ya double checking on that is probably reasonable. I just haven't seen anything indicating major differences yet 19:48:28 #topic Pruning vexxhost backup server 19:48:49 We've gotten a few emails about this recently. We are at 92% of capacity. I suspect adding some of these new servers hasn't helped either 19:49:37 I can probably run the pruning script tomorrow 19:49:46 i think we're about monthly on pruning that, so it probably tracks 19:49:47 and then maybe we check relative disk usage compared to prior prunes 19:49:51 ah ok 19:50:44 i can run it and watch it 19:50:59 ianw: if you like. I'm happ to do it tomorrow too (I just have enough on my plate for today) 19:51:22 it would be interesting to track individually backup size 19:51:27 And with that we've made it to the end of our agenda (actually this last topic was added by me at the last minute) 19:51:29 yeah, i can also do it in the background tomorrow probably. i guess whoever gets to it first just let everyone else know ;) 19:51:34 ++ 19:51:37 #topic Open Discussion 19:51:39 Anything else? 19:52:10 did you skip sb intentionally? not much to say probably anyway 19:52:30 also I'll be away for 3 weeks starting thursday 19:52:32 frickler: oh you know whta I undid it and the nforgot it 19:52:35 thats on me 19:52:47 (backup prune running in a screen now) 19:53:12 frickler: fungi: re sb I did want to aks about recent discussion around that. Is that mostly around the logistics of reflecting changes in projects within our project-config repo? 19:53:17 ianw: noted, thanks! 19:54:10 clarkb: the latest was just about removing it for retired projects 19:54:29 clarkb: sb's database schema supports marking projects as "inactive" in order to not present them as autocompletion options and stuff, so i've been trying to remember to set that when projects retire, but should to an audit to see if any have been missed 19:54:50 got it. I guess it is a good idea to mark both retired projects that way as well as those that move elsewhere 19:55:01 also removing "use-storyboard" and "groups" entries from retired projects in our gerrit/projects.yaml file if there are any 19:55:22 another short note: the wheel builds are still broken, I think ianw has some patches up for that? 19:55:48 current issue seems to be afs on centos once again 19:56:09 on a similar note, there was at least one dib change to fix openeuler image builds, do we need a new dib release for that before we turn them back on, or did it already happen? 19:56:11 i do have something up that will release the builds separately. i'll try to take a look at the build failures 19:56:31 fungi: another dib release will be needed iirc 19:56:31 we would need a dib release and nodepool rebuild. i can look at that 19:56:51 i don't think we have anything else in the queue for dib 19:57:33 if centos is failing that often means our images are out of sync and have different kernel versions to the headers available from the mirror 19:59:24 oh, there's also the pypi org thing 19:59:39 do we want to register at least an opendev org there? 19:59:49 I'll admit I haven't read it yet. 19:59:56 The trusted publisher thing kinda made my grumpy 20:00:43 related, i registered an "opendev.org" user on pypi ("opendev" was already taken) in case we eventually want to replace the openstackci account there 20:01:03 we are officially at time 20:01:10 thanks clarkb! 20:01:17 I'll end the meeting here but feel free to continue discussion in #opendev or on the mailing list 20:01:19 #endmeeting