19:01:10 <clarkb> #startmeeting infra
19:01:10 <opendevmeet> Meeting started Tue Oct 12 19:01:10 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:10 <opendevmeet> The meeting name has been set to 'infra'
19:01:12 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-October/000288.html Our Agenda
19:01:19 <clarkb> #topic Announcements
19:01:33 <clarkb> I forgot to mention this in the agenda but next week is the PTG
19:01:52 <clarkb> I requested a short amount of time for ourselves largely as office hours that other projects can jump into to discuss stuff with us
19:02:13 <clarkb> I plan to be there and should get an etherpad together today. If the times work out for you feel to join, otherwise I think we'll have it covered
19:02:52 <clarkb> But also keep in mind that is happening next week and we shoudl avoid changes to meetpad/etherpad if possible
19:03:26 <clarkb> #topic Actions from last meeting
19:03:32 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-05-19.01.txt minutes from last meeting
19:03:46 <clarkb> We had actions last week but they were all related to specs. Let's just jump into specs discussion then :)
19:03:51 <clarkb> #topic Specs
19:04:20 <clarkb> First up I did manage to update the prometheus spec based on feedback on how to run it. Ended up settling on using the built binary to avoid docker weirdness and old versions in distros
19:04:27 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement
19:04:43 <clarkb> corvus brought up a good concern which is that we should ensure that it can run on old old distros and I confirmed it seems to work on xenial
19:04:53 <clarkb> fungi: I was going to work with you to check it on trusty when you have time
19:05:00 <fungi> ahh, yep
19:05:05 <clarkb> I didn't want to touch the remaining trusty node without you being around
19:05:10 <fungi> i should have time tomorrow
19:05:15 <clarkb> Cool I'll ping you tomorrow then.
19:05:27 <clarkb> My plan is to approve this spec if nothing comes up by end of day Thursday for me
19:05:38 <clarkb> if you havent' reviewed the spec and would like to now is the time tod o that
19:05:46 <clarkb> and we can note any trusty problems before landing it too
19:05:56 <clarkb> Next up is the mailman 3 spec
19:05:58 <clarkb> #link https://review.opendev.org/810990 Mailman 3 spec
19:06:23 <clarkb> I've reviewed this and the plan seems straightforward. Essentially spin up a new machine running mm3. Then migrate existing mm2 vhosts into it as users are ready starting with opendev
19:06:43 <clarkb> If other infra-root can review this spec that would be much appreciated
19:07:07 <clarkb> fungi: ^ anything else to add on mailman 3?
19:07:43 <fungi> nah, the migration tools are fairly honed from what i understand, so other than new interfaces and probably some new message formatting in places, users shouldn't really be impacted
19:08:22 <clarkb> thank you for putting that together. I'm excited to be able to use the new frontend
19:08:32 <clarkb> #topic Topics
19:08:39 <clarkb> #topic Improving OpenDev CD throughput
19:08:41 <fungi> i guess the actual steps for the cut-over could stand to be drilled down into a bit, deciding how we want to go about making sure deliveries to the list get queued up while we're copying things over
19:08:59 <fungi> but we can work that out as we get closer
19:09:02 <clarkb> ++
19:09:23 <clarkb> ianw: you mentioned trying to pick this up again. I think the next step is largely in making that first chagne in the stack mergeable (by runnign jobs for it somehow?)
19:10:17 <clarkb> note that corvus pointed out the change in zuul you thought would fix it likely won't as the playbooks aren't changing for those jobs
19:10:46 <ianw> yeah i added a update to a readme
19:11:15 <clarkb> ah ok, I should go back and rereview then
19:11:17 <ianw> (late last night ... and i just realised that file doesn't trigger anything either :)
19:11:33 <ianw> i'll try something else!  but yep, i did respond to all comments
19:11:41 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/807672 is the first change in the sequence
19:11:50 <clarkb> cool and I'll look at rereviewing things this afternoon
19:13:07 <clarkb> #topic Gerrit account cleanups
19:13:44 <clarkb> This is something that has gone on the back burner with zuul updates, gerrit upgrades, gitea upgrades, openstack releases etc. I haven't forgotten about it and will try and pick it up next week if the PTG is quiet for me (I expect it to be but you never know with an event)
19:14:03 <clarkb> Really just noting that I still intend on getting to this but it is an easy punt because it doesn't typcially immediately affect stuff
19:14:21 <clarkb> #topic Gerrit Project Renames
19:14:37 <clarkb> We announced last week that we would rename gerrit projects Friday October 15 at 18:00 UTC
19:14:57 <clarkb> All of our project renaming testing continues to function we should largely be mechanically ready for this
19:15:20 <clarkb> The one thing that has been noted is that we need to update project metadata in gitea after renames to update descriptions and urls and storyboard links
19:15:57 <clarkb> I think the easiest way to do that would be to run the gitea project management with the full update flag set after we rename. Either using a subset projects.yaml or just doing it for everything (which could take hours)
19:16:10 <clarkb> fungi: ^ you were thinking about this too, did you have a sense for how you wanted to approach it?
19:16:45 <fungi> it's still not clear to me why we can't update specific projects
19:16:57 <fungi> though i suppose we do need to perform a full run at least once to catch up
19:17:28 <clarkb> fungi: the fundamental issue is that the input to renaming and the input to setting project metadata are different. The rename playbook takes taht simple yaml file with old and new names. The project metadata takes projects.yaml
19:17:42 <clarkb> This is why I think it is simpler to do it as two distinct steps.
19:17:59 <fungi> oh, it doesn't filter projects.yaml by specific entries?
19:18:15 <clarkb> no projects.yaml is not referenced at all in the rename process
19:18:32 <clarkb> Then to make things more complicated projects.yaml doesn't get updated until after the reanme is done and we merge the associated changes
19:18:38 <fungi> oh, i see, we would normally update projects.yaml after renaming
19:18:51 <clarkb> and since it takes hours to do the force update we don't do those
19:18:54 <fungi> so would need to run the metadata update after that
19:18:56 <clarkb> er we don't automatically do those
19:19:02 <clarkb> yup
19:19:25 <fungi> but we could tell it to only update the projects which had been renamed, once things are up and the projects.yaml update merges, right?
19:19:51 <fungi> rather than telling it to update every project listed in the file
19:20:00 <clarkb> fungi: the current code does not support that. We could hack it in by running the update against an edited projects.yaml. But even that might be complicated since I think the playbook syncs project config directly
19:20:13 <clarkb> this is the bit I was hoping someone would have time to look at
19:20:33 <clarkb> I think it is ok to do the metadata update as a separate step post rename, but ya we should device a method of making it less expensive
19:20:41 <fungi> ahh, okay, so we need a way to tell the metadata update script to filter its actions to specific project names, i guess
19:20:46 <clarkb> ya that
19:21:33 <fungi> seems like that shouldn't be too hard, we could probably have it use the rename file as the filter
19:21:53 <fungi> i'll see if i can figure out what we need to be able to do that bit
19:21:55 <clarkb> essentially our rename process becomes: run rename playbook, restart things and land project-config updates, ensure zuul is happy, manually run the gitea repo management playbook hopefully not in the most expensive configuration possible
19:22:00 <clarkb> and is that very last bit that we need someone to look at
19:22:11 <clarkb> thanks!
19:22:46 <fungi> i guess we could even avoid doing a full sync this time by feeding it the historical rename files too
19:23:06 <clarkb> fungi: yup, we could go through and pull out all the names that need updating as a subset of the whole
19:23:10 <clarkb> since we have those records
19:23:52 <clarkb> I'll work on an etherpad process doc as well as reviewing the project-config proposal and writing up the record files for this rename tomorrow
19:24:32 <clarkb> Anything else to cover on this topic?
19:25:01 <fungi> i don't think so
19:25:08 <clarkb> #topic Gerrit 3.3 upgrade
19:25:36 <clarkb> This went exceptionally well. I'm still looking around wondering what happened and if it is too good to be true :)
19:25:44 <clarkb> the 2.13 -> 3.2 upgrade has scarred me
19:26:18 <clarkb> I've been trying to add hashtag:gerrit-3.3 to changes related to the upgrade and the cleanup afterwards. Feel free to add this hashtag to your changes too
19:26:56 <clarkb> At this point the major change remaining is the 3.2 image cleanup change.
19:27:03 <fungi> i'm trying to run down what looks like a new comment-related crash in gertty, probably related to the upgrade to 3.3
19:27:33 <clarkb> Do we have opinions on when we'll feel comfortable dropping the 3.2 images?
19:28:04 <ianw> dropping the jobs won't purge the images from dockerhub though will it?
19:28:21 <clarkb> ianw: it will not, however the images in docker hub will eventually get aged out and deleted on that end
19:28:30 <clarkb> (I forget what the timing on that is with dockerhub's new policy)
19:29:19 <clarkb> https://review.opendev.org/c/opendev/system-config/+/813074 is the change. I'm feeling more and more confident in 3.3 since we hanven't had any issues yet that make me think revert
19:29:35 <fungi> iirc, the image ageout has to do with when it last got a download
19:29:37 <clarkb> maybe we go ahead and land it and we can always restore it again later if necessary or just use the existing 3.2 tag until that ages out in docker hub
19:29:44 <ianw> yep; we can always rebuild too, and have local copies.  so i think 813074 is probably gtg
19:29:49 <clarkb> fungi: ya and our test jobs for 3.2 download that tag
19:30:08 <clarkb> ianw: ok I've removed the WIP
19:30:19 <fungi> i'm fine dropping them at any time, yeah. it's increasingly unlikely we'd try to roll back at this point
19:30:30 <fungi> we're nearing the 48-hour mark
19:30:30 <clarkb> cool please review the change then :)
19:30:51 <fungi> yeah, i was, but... gertty crashing on one of those changes distracted me ;)
19:31:06 <ianw> the attention set seems to be working
19:31:21 <clarkb> I've also got https://review.opendev.org/c/opendev/system-config/+/813534/ up which is semi related in that the chanegs to update post upgrade hit a bug where we set the heap limit to 96gb in testing whcih doesn't work if the jvm tries to allocate memory on an 8gb instance
19:31:50 <clarkb> ianw: ^ I made a new ps on that fixing an issue I cuaght when writing the followups we talked about yesterday. The followups are there too. I think the whole stack deserves careful review to ensure we don't accidentally remove anything.
19:32:12 <clarkb> oh it just occured to me that the gerrit -> review group rename needs to be checked against the private group vars. Let me WIP that until that is done
19:33:09 <clarkb> ianw: oh neat it is telling me to review the CD Improvement change since you pushed a new ps
19:33:26 <clarkb> I really think the attention set has the potential for being very powerful, just need to sort out how to make it work for us
19:33:53 <ianw> yeah i've been careful to unclick people when voting, which isn't something that needs attention
19:34:30 <clarkb> The last thing I had on this topic is pointing out that we are already testing the 3.3 -> 3.4 upgrade in CI now :) We should start thinking about scheduling that upgrade next. Probably really look at that post PTG so in 2 weeks?
19:34:48 <clarkb> ianw: using the modify button at the bottom of the comment window thing?
19:34:51 <ianw> i did add a note on that @
19:34:55 <ianw> #link https://bugs.chromium.org/p/gerrit/issues/detail?id=15154
19:35:08 <ianw> still every bug seems to go into the polygerrit category there :/
19:35:45 <ianw> i guess maybe this is polygerrit; i don't know who owns it
19:36:04 <clarkb> ianw: their bug tracker is broken and the nromal issue type can't be submitted because no one is assigned to receive notifications for them or some such
19:36:20 <clarkb> so it defaults to polygerrit and you have to hope it gets in front of the right people. But I agree this could be a polygerrit issue
19:36:45 <ianw> clarkb: yep; i think that's going to be the major issue -- if everyone adds your attention when they +1/+2 your change, your attention list becomes less useful
19:37:20 <ianw> also if anyone pops up with dashboard issues see
19:37:24 <ianw> #link https://groups.google.com/g/repo-discuss/c/565rD1Sjiag
19:37:45 <clarkb> might be worth an email to opendev-discuss calling out the modify action and the dashboard stuff
19:37:47 <ianw> basically; /#/dashboard/... doesn't work, /dashboard/... does.  unclear if this is a bug or feature
19:38:08 <ianw> sure i can draft something
19:38:15 <clarkb> thanks
19:38:56 <fungi> service-discuss?
19:39:10 <clarkb> fungi: yup sorry. Every other -discuss is name-discuss
19:39:33 <fungi> cool, just making sure you didn't mean some other-discuss
19:39:41 <fungi> (like openstack-discuss)
19:40:20 <clarkb> Thanks again to everyone who helped make this upgrade happen. I think we're in a really good place as far as gerrit goes. We can upgrade with minimal impact and in some cases downgrade. Much of the things we do with gerrit like project creation, project renaming, etc are tested. We even have upgrade testing
19:40:30 <clarkb> Oh and the new server etc
19:40:44 <clarkb> We've come a long way since we were on 2.13 a yaer ago
19:42:12 <clarkb> #topic Open Discussion
19:42:29 <clarkb> That was it for the agenda.
19:42:35 <clarkb> Anything else?
19:42:56 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/812622
19:43:17 <ianw> fungi: ^ maybe you could double check i didn't fat finger anything; that's from the lock issues i think we discussed last week
19:43:20 <clarkb> oh for some reason I thought that had landed and noticed we still had the error that should fix
19:43:31 <clarkb> but I guess it hasn't landed and ++ to reviewing it and getting it in to fix those conflicts
19:43:34 <fungi> checking
19:43:36 <ianw> borg verify lock issues i mean
19:44:05 <fungi> and yeah, i too saw the cron message and thought we had already fixed it, so good to know!
19:44:15 <clarkb> also gitea01 started failing backups again which makes me wonder about networking invexxhost again :)
19:44:26 <ianw> the recent -devel job failure had me thinking about bridge upgrades too
19:45:10 <ianw> at one time it seemed to be under a lot of pressure for it's small size, but i don't recall any issues recently
19:45:12 <clarkb> One tricky thing I remembered about bridge updates is we have ssh rules around bridge connectivity iirc. We may need to spin up a new bridge, then update all the things to talk to it, then swap over?
19:45:34 <clarkb> ianw: the effort to parallelize the CD stuff could have us wanting a bigger server again
19:45:50 <clarkb> ianw: might be a good idea to get that work done first, monitor resource needs and size appropriately for a new server?
19:46:06 <ianw> i thought it might be a good time to start thinking about using zuul-secrets in more places
19:47:52 <clarkb> ianw: to avoid needing the bastion?
19:47:56 <corvus> we could easily >4x the load on bridge before it's a problem.
19:49:39 <clarkb> ya the main issue that seems to affect bridge performance is having leaked ssh connections pile up ansible rules which causes system load to grow
19:49:45 <clarkb> when that isn't happening it doesn't ened a ton of resources
19:50:08 <ianw> clarkb: yep; it would be a different way of working but I think moves us even more towards a "gitops" model
19:51:07 <clarkb> ya I think my biggest struggle there is still around manually running stuff. It seems super useful to be able to do that when things pop up and zuul doesn't have a great answer to that (yet?)
19:52:07 <corvus> i think the cd story gets a lot better if we can remove the ansible module restrictions for untrusted jobs on the executor (but that's a v5+ idea)
19:52:15 <fungi> and the "leaked" (really indefinitely hung) ansible ssh processes seem to crop up when we have pathological server failures where ssh authentication begins but the login never completes
19:52:53 <fungi> rebooting whatever server is stuck generally clears it up
19:53:44 <clarkb> if we want to hold off a bit on updating bridge with the idea that zuul improves before we need to upgrade I'd be ok with that. But we should probably write done a concrete set of things we can go to zuul about making better for that. The modules thing is another good one
19:54:02 <frickler> before we end, I also want to mention the issue with suds-jurko, which is mostly masked in CI because we have an old wheel built for it
19:54:05 <ianw> clarkb: yep, i agree, but i imagine we could have some sort of "escape-hatch" where we do something like replicate the key and have a way to manually run ansible that provides the decrypted secrets
19:54:10 <clarkb> but also "zuul please run this job now" bypass as sysadmins would also be nice alternative to the escape hatch bridge gives us
19:54:25 <clarkb> ianw: ya that could work too
19:54:32 <fungi> worth noting, we'll be unable to upgrade to the upcoming ansible release until we're on focal, or else we need to use a nondefault python3 with it on bridge
19:54:45 <clarkb> frickler: oh ya thats a good call out
19:54:50 <frickler> so maybe we want to expire wheels after some time or have some other way to check whether they still can be built
19:55:05 <ianw> frickler: ahh, i have a spec for that i think :)
19:55:16 <corvus> i'm unaware of the suds-jurko issue, is there a summary?
19:55:26 <clarkb> we built a wheel for suds-jurko some time ago with older setuptools and have that in our wheel mirror. But suds-jurko doesn't build with current setuptools. This means running openstacky things outside of our CI system is problematic as that package doesn't install
19:55:28 <clarkb> corvus: ^
19:55:35 <ianw> #link https://review.opendev.org/#/c/703916/
19:55:36 <frickler> the wheel was built with setuptools < 58, with 58 it fails
19:56:03 <corvus> thx
19:56:15 <fungi> a sort of toctou problem, i guess
19:56:16 <clarkb> and ya doing a fresh wipe of the wheel mirrors periodically would be a good way to expose that stuff in CI
19:56:17 <ianw> that doesn't exactly cover this scenario, but related.  basically that we are an append-only copy that grows indefinitely
19:56:25 <frickler> also https://bugs.launchpad.net/cinder/+bug/1946340
19:56:43 <frickler> we first noticed it with fedora, because we don't seem to have a wheel for py39
19:56:53 <clarkb> we probably don't need daily rebuilds but weekly or monthly might be a good approach. And solve the indefinite growth problem too
19:57:44 <clarkb> separately suds-jurko has been unmaintained for like 7 years...
19:58:03 <clarkb> and should be replaced too
19:58:18 <fungi> one of half a dozen (at least) dead dependencies which setuptools 58 turned up in various projects i know about
19:58:27 <frickler> yes, in that specific case, there's suds-community as a replacement
19:58:54 <frickler> https://review.opendev.org/c/openstack/requirements/+/813302 lists what fails in u-c
19:59:40 <frickler> with a job that doesn't use the pre-built wheels
20:00:12 <clarkb> We are at time. Feel free to continue conversation in #opendev or on the mailing list.
20:00:39 <clarkb> Thank you everyone for listening and participating. We'll probably be around next week since I don't expect a ton of direct PTG involvment. But if that changes I'll try to send and email about it
20:00:49 <clarkb> #endmeeting