19:00:26 <clarkb> #startmeeting infra
19:00:26 <opendevmeet> Meeting started Tue Nov 14 19:00:26 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:26 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:26 <opendevmeet> The meeting name has been set to 'infra'
19:00:28 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/NIDXZX7JT4MQJOUS7GKI5PPRMDIIY6FI/ Our Agenda
19:00:40 <clarkb> The agenda went out late because I wasn't around yesterday, but we do have an agenda
19:01:06 <clarkb> #topic Announcements
19:01:41 <clarkb> Next week is a big US holiday. That said, I expect to be around for the beginning of the week and plan to host our weekly meeting Tuesday
19:01:54 <clarkb> But be aware that by Thursday I expect it to be very quiet
19:02:30 <clarkb> #topic Mailman 3
19:02:55 <clarkb> fungi: I think you dug up some more info on the template file parse error? And basically mailman3 is missing some file that they need to add after django removed it from their library?
19:03:05 <fungi> the bug we talked about yesterday turns out to be legitimate, yes
19:03:10 <fungi> er, last week i mean
19:03:34 <tonyb> time flies
19:03:39 <clarkb> to confirm we are running all of the versions of the softwrare we expect, but a new bug has surfaced and we aren't seeing an old bug due to accidental use of old libraries
19:04:21 <fungi> yeah, and this error really just means django isn't pre-compressing some html templates, so they're a little bigger on the wire to users
19:05:00 <clarkb> in that case I guess we're probably going to ignore this until the next mm3 upgrade?
19:05:24 <fungi> #link https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/36U5NY725FNJSGRNELFOJLLEZQIS2L3Y/ mailman-web compress - Invalid template socialaccount/login_cancelled.html
19:05:58 <fungi> yeah, it seems safe to just ignore and then we can plan to do a mid-release update when it gets fixed if we want, or wait until the next release
19:06:06 <clarkb> should we drop this agenda item from next weeks meeting then?
19:06:36 <clarkb> I believe this was the last open item for mm3
19:06:38 <fungi> i think so, yes. we can add upgrades to the agenda as needed in the future
19:06:47 <clarkb> sounds good. Thanks again for workign through all of this for us
19:06:54 <clarkb> #topic Server Upgrades
19:06:54 <fungi> thanks for your patience and help!
19:07:19 <clarkb> we added tonyb to the root list last week and promtly put him to work booting new servers :)
19:07:28 <tonyb> \o/
19:07:38 <clarkb> mirror01.ord.rax is being replaced wtih a new mirror02.ord.rax server courtesy of tonyb
19:07:43 <clarkb> #link https://review.opendev.org/c/opendev/zone-opendev.org/+/900922
19:07:48 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/900923
19:08:07 <clarkb> These changes should get the server all deployed, then we can confirm it is happy before udpating DNS to slip over the mirror.ord.rax CNAMEs
19:08:18 <clarkb> I think the plan is to work through this one first and then start doing others
19:08:19 <tonyb> After a good session booting mirror02 I managed to clip some for the longer strings and so the reviews took me longer to publish
19:08:38 <clarkb> tonyb: I did run two variations of ssh-keyscan in order to dobule check the data
19:08:49 <tonyb> clarkb: Thanks
19:08:49 <clarkb> I think it is correct and noted taht in my reviews when I noticed the note about the copy paste problems
19:09:51 <clarkb> feel free to continue asking questions and poking for reviews. This is really helpful
19:09:52 <tonyb> I started writing a "standalone" tool for handling the volume setup as the mirrors nodes are a little different
19:10:07 <tonyb> Yup I certainly will do.
19:10:30 <clarkb> tonyb: ++ to having the mirror volumes a bit more automated
19:11:17 <fungi> agreed, we have enough following that pattern that it could be worthwhile
19:11:47 <fungi> note that not all mirror servers get that treatment though, some have sufficiently large rootfs we just leave it as-is and don't create additional volumes
19:11:51 <tonyb> I think thats about that for the mirror nodes.  It's mostly carfully follwoing the bouncing ball at this stage
19:12:25 <clarkb> cool. I'm happy to do another runtrhough too if we like. I feel like that was helpful for everyone as it made probelms with cinder volume creation apparent and so on
19:13:10 <tonyb> fungi: Yup.  and as we can't always predict the device name in the guest it wont be fully automated ot intgrated it's just to document/simlify the creation work we did on the meetpad
19:13:12 <fungi> i too am happy to do another onboarding call, maybe for other activities
19:13:49 * tonyb too.
19:14:39 <clarkb> anything else on this topic?
19:14:48 <tonyb> not from me
19:14:59 <clarkb> #topic Python Container Updates
19:15:19 <clarkb> Unfortunately I haven't really had time to look at the failures here in more detail. I saw tonyb asking question about them though, were you looking?
19:15:27 <clarkb> #link https://review.opendev.org/c/zuul/zuul-operator/+/881245 Is the zuul-operator canary change
19:15:45 <clarkb> specifically we need that change to begin passing in zuul-operator before we can land the updates for the docker image in that repo
19:16:20 <tonyb> I am looking at it
19:16:52 <tonyb> I spoke to dpawlik about status and background
19:16:56 <corvus> i suspect something has bitrotted with cert-manager; but with the switch in k8s setup away from docker, we don't have the right logs collected to see it, so that's probably the first task
19:17:20 <tonyb> No substantial progress but I'm finding my feet there
19:17:38 <corvus> (in other words, the old k8s setup got us all container logs via docker, but the new setup needs to get them from k8s and explicitly fetch from all namespaces)
19:17:44 <clarkb> gotcha
19:17:52 <clarkb> because we are no longer using docker under k8s
19:18:09 <corvus> yep
19:18:30 <clarkb> I agree, addressing log collection seems like a good next step
19:18:35 <tonyb> Okay that's good to know.
19:19:44 <clarkb> #topic Gitea 1.21
19:20:01 <clarkb> 1.21.0 has been released
19:20:03 <clarkb> #link https://github.com/go-gitea/gitea/blob/v1.21.0/CHANGELOG.md we have a changelog
19:20:22 <fungi> (and there was much rejoicing)
19:20:25 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/897679 Upgrade change needs updating now that we have changelog info
19:20:52 <clarkb> so ya the next step here is to go over the changelog and make sure our change is modified properly to handle their breaking changes
19:21:03 <clarkb> I haven't even looked at the changelog yet
19:21:18 <clarkb> but doing so and modifying that change is on my todo
19:21:20 <clarkb> *todo list
19:21:43 <clarkb> In the past we've often not upgraded until the .1 release anyway due to them very quickly releasing bugfixes
19:22:00 <fungi> nobody ever wants to go first
19:22:02 <clarkb> between that and the gerrit upgrade and then thanksgiving I'm not sure this is urgent, but also dont' want it to get forgotten
19:22:41 <fungi> i agree that the next two weeks are probably not a great time to merge it, but i'll review at least
19:23:12 <clarkb> sounds good. Should have something to look at in the next day or so
19:23:16 <frickler> I'm wondering about the key length thing, how much effort would it be to use longer keys?
19:23:37 <tonyb> FWIW I'll review it to and, probably, ask "why do we $x" questions ;P
19:24:25 <clarkb> frickler: we need to generate a ne key, add it to the gerrit user in gitea (this step may be manual currently I think we only automate this at user creation time) and then add the key to gerrit and restart gerrit to pick it up
19:24:47 <clarkb> frickler: I suspect taht if we switch to ed25519 then we can have it sit next to the existing rsa key in gerrit and we don't have to coordinate any moves
19:24:58 <clarkb> if we replace shorter rsa key with logner rsa key then we'd need a bit more coordination
19:25:12 <fungi> well, we could have multiple rsa keys too, right?
19:25:23 <clarkb> fungi: I don't think gerrit will find multiple rsa keys
19:25:35 <clarkb> but I'm not sure of that. We can test that on a held node I guess
19:25:42 <fungi> oh, right, local filename collision
19:26:13 <fungi> we can do two different keytypes because they use separate filenames
19:26:18 <clarkb> yup
19:26:30 <clarkb> I can look into that more closely as I page the gitea upgrade stuff abck in
19:26:31 <fungi> i was thinking in the webui, not gerrit as a client
19:27:07 <fungi> so yeah, i agree parallel keys probably makes the transition smoother than having to swap one out in a single step
19:27:30 <clarkb> speaking of Gerrit:
19:27:35 <clarkb> #topic Gerrit 3.8 Upgrade
19:27:36 <fungi> though i guess if we add the old and new keys to gitea first, then we could swap rsa for rsa on the gerrit side
19:27:44 <fungi> but might need a restart
19:27:54 <clarkb> it will need a restart of gerrit in all cases iirc
19:27:59 <clarkb> because it reads the keys on startup
19:28:18 <clarkb> For the Gerrit upgrade I'm planning on going through the etherpad again tomorrow
19:28:24 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.8
19:28:33 <clarkb> I want to make sure I understand the screen logging magic a bit better
19:28:41 <clarkb> but also would appreciate reviews of that plan if you haven't read it yet
19:28:44 <fungi> also for the sake of the minutes...
19:28:46 <fungi> #link https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/ Upgrading review.opendev.org to Gerrit 3.8 on November 17, 2023
19:28:59 <fungi> i figured you were going to include that in the announcements at the beginning
19:29:41 <clarkb> as far as coordination goes on the day of I expect I can drive things, but maybe fungi you can do some of the earlier stuff like adding hosts to emergency files and sending #status notice notices
19:29:59 <clarkb> I'll let you know if my expectations for that change, but I don't expect them to
19:30:28 <fungi> happy to. i think i'm driving christine to an eye appointment, but can do basic stuff from my phone in the car
19:31:10 <fungi> (also the appointment is about 2 minutes from the house)
19:31:10 <clarkb> seems like we are in good shape. And I'll triple check myself before Friday anyway
19:31:27 <tonyb> I can potentially do some of the "non-destructive" early work
19:31:44 <clarkb> tonyb: oh! we should add you to the statusbot acls
19:31:48 <fungi> we should add tonyb to statusbot
19:31:48 <clarkb> and generally irc acls
19:31:54 <tonyb> but that may make more work than doing it
19:31:54 <fungi> hah, jinx!
19:32:13 <tonyb> hehe
19:32:14 <fungi> tonyb: it's work that needs doing sometime anyway
19:32:20 <tonyb> so how owes who a soda?
19:32:25 <fungi> i can take care of it
19:32:32 <tonyb> kk
19:32:36 <fungi> i owe everyone soda anyway
19:32:40 <tonyb> LOL
19:32:42 <clarkb> openstack/project-config/accessbot/channels.yaml is one file that needs editing
19:32:49 <fungi> still repaying from my ptl days
19:33:14 <tonyb> I can do that.
19:33:46 <clarkb> I'm not acutally sure where statusbot gets its user list. Does it just check for opers in the channel it is in?
19:33:59 <fungi> i'll look into it
19:34:16 <corvus> i think it's a config file
19:34:24 <clarkb> nope its statusbot_auth_nicks in system-config/inventory/service/group_vars/eavesdrop.yaml
19:34:28 <clarkb> tonyb: ^ so that file too
19:34:38 <fungi> thanks, i was almost there
19:34:44 <tonyb> gotcha
19:34:58 <clarkb> anything else Gerrit upgrade related?
19:35:00 <fungi> i'm getting slow this afternoon, must be time to start on dinner
19:35:10 <clarkb> its basically lunch time. I'm starving
19:35:43 <tonyb> Coffee o'clock and then a run. ... and then lunch
19:36:06 <clarkb> alright next up
19:36:13 <clarkb> #topic Ironic Bug Dashboard
19:36:19 <clarkb> #link https://github.com/dtantsur/ironic-bug-dashboard
19:36:34 <clarkb> The ironic team is asking if we woudl be willing to run an instance of their bug dashboard tool for them
19:36:40 <fungi> JayF: you were going to speak to this one?
19:36:43 <JayF> So some context; this is an old bug dashboard. No auth needed. Simplest python app ever.
19:36:44 <fungi> otherwise i can
19:37:04 <JayF> We've run it in various places we've just done custom-ly, before doing that again with our move to LP, we thought we'd ask about getting it a real home.
19:37:30 <JayF> No depedencies. Literally just needs a place to run, and I think dtantsur wrote a dockerfile for it the other day, too
19:37:56 <clarkb> My major concern is that running a service for a single project feels very inefficient from our side. If someone wanted to resurrect the openstack bug dashboard instead I feel like that might be a little different?
19:38:04 <fungi> so options are for adding it to opendev officially (deployment via ansible/container image building and testinfra tests), or us booting a vm for them to manage themselves
19:38:04 <tonyb> The docs show using podman etc so yeah I think that's been done
19:38:46 <clarkb> additionally teams like tripleo have had one tool and ironic has one apparently and so on. I think it is inefficient for the project teams too
19:38:46 <fungi> for historical reference, "the openstack bug dashboard" was called "bugday"
19:38:49 <JayF> clarkb: I talked to dtantsur; we are extremely willing to take patches (and will email the list about this existing again once we get it a home) if other teams want t ouse it
19:40:10 <frickler> JayF: so would you be willing to run this yourself if we give you an vm with an DNS record?
19:40:10 <JayF> fungi: it's extremely likely if infra says no, and we host it out of band, we'd do something similar to the second option (just get a VM somewhere and run it manually)
19:40:46 <JayF> frickler: replace instances of "you" and yourself" with ironic community as appropriate and the answer is "yes", with specific contacts being dtantsur and I to start
19:41:08 <JayF> frickler: if you all had no answer for us, nonzero chance this ended up as a container in my homelab :)
19:41:18 <frickler> that would be an easy start and we could see how it develops
19:41:41 <clarkb> so basically the idea behind openstack infra and now opendev was that we'd avoid doing stuff like this and instead create a commons where projects could work together to address common problems
19:42:04 <fungi> yeah, when this came up yesterday in #openstack-ironic i mentioned the current situation with the opensearch-backed log ingestion service dpawlik set up
19:42:20 <clarkb> where we've struggled is when projects do things like this specific tool and blaze their own trail. This takes away potential commons resources as well as multiplies effort required
19:42:42 <JayF> From an infra standpoint; I'm with you.
19:42:44 <JayF> This is why it's an opportunistic ask with almost an expectation that "no" was a likely answer.
19:42:51 <clarkb> I think that if we were to host it it would need to be a more generic tool for OpenDev users and not ironic specific. I have fewer concerns with handing over a VM
19:42:59 <JayF> From a community standpoint; that was storyboard; we adopted it; it disappeared; we are trying to dig out from that mistake
19:43:09 <frickler> iiuc the tool is open to be used by other projects, they just need to amend it accordingly
19:43:11 <fungi> i do think we want to encourage people to collaborate on infrastructure that supports their projects when there is a will to do so
19:43:11 <JayF> and I do not want to burn more time trying to go down alternate "work together" paths in pursuit of that goal
19:43:40 <clarkb> JayF: the problem is that all the cases of not working together are why we have massive debt
19:44:13 <clarkb> ironic is not the only project trying to deal with storyboard for example
19:44:18 <JayF> clarkb: I have lots of examples of cases of us working together that also have massive debt; so I'm not sure I agree with all of the root causing, but I do understand what you're getting at and like I said, if the answer is no, it's no.
19:44:26 <fungi> basically the risk is that the opendev sysadmins are the last people standing when whoever put some service together disappears and there are still users
19:44:29 <clarkb> and despite my prodding very little collaboration between teams with the same problems has occured as an example
19:44:54 <fungi> so we get to be the ones who tell users "sorry, nobody's keeping this running any more"
19:45:27 <clarkb> the infra sig continues to field questions about how to set up LP
19:45:38 <clarkb> stuff that should have ideally been far more coordinated among the groups moving
19:45:50 <fungi> i mostly just remind folks that we don't run launchpad, and it has documentation
19:46:10 <clarkb> and I can't shake the feeling that an ironic bug dashboard is just an extension of these problems and we'll end up being asked to run a different tool for nova and then a different one for sdks and so on
19:46:28 <JayF> This is off topic for the meeting, but the coordination is always the most difficult part ime; which is why for Ironic's LP migration it finally started moving when I stopped trying so hard to pull half of openstack with me.
19:46:30 <clarkb> when what we need as a group is rough agreement on what a tool should be and then run that. And as mentioend before this tool did exist
19:46:42 <clarkb> but it too ran into disrepair and was no longer maintained and we shut it off
19:47:23 <JayF> It sounds like consensus is no though; so for this topic you all can move on. I wouldn't want you all to host it unless everyone was onboard, anyway.
19:47:39 <clarkb> I don't think we necessarily need to resurrect bugday the code base, but I think if opendev hosts something it should be bugday the spiritial successor tool and not an ironic specific tool
19:47:49 <fungi> i think it can be "not yet" instead of just "no"?
19:48:26 <fungi> also i'm not opposed to booting a vm for them to run it on themselves, while they work on building consensus across other teams to possible make it useful beyond ironic's use case
19:48:39 <JayF> I just sent an email to the mailing list, last week, about how a cornerstone library to OpenStack is rotting and ~nobody noticed. I'm skeptical someone is going to take up the banner of uniting bug dashboards across openstack.
19:48:54 <JayF> fungi: I do not commit to building such a consesnus. I commit to being open to accepting patches.
19:49:12 <fungi> with the expectation that if opendev is going to officially take it on, then there will need to be more of a cross-project interest (and of course configuration management and tests)
19:49:26 <clarkb> ya I'm far less concerned with booting a VM and adding a DNS record
19:49:38 <JayF> fungi: not trying to be harsh; just trying to set a reasonable expectation to be clear :)
19:49:45 <JayF> my plate is overflowing and I can't fit another ounce on it
19:49:58 <fungi> sure. and we've all been there more than once, i can assure you ;)
19:50:48 <fungi> JayF: so there are some options and stipulations you can take back to the ironic team for further discussion, i guess
19:51:31 <JayF> If you want to give us a VM and a DNS name, that will work for us. If not, I'll go get equivalent from my downstream/personal resources and my next steps are the same either way
19:51:40 <corvus> i'm not sure i'm a fan of the "boot a vm and hand it over" approach
19:51:42 <corvus> if a vm is going to be handed over, i don't see why that's an opendev/infra team ask... i don't feel like we're here to hand out vms, we're here to help facilitate collaboration.  anyone can propose a patch to run a service if the service fits the mission.  so if it does fit the mission, that's how it should be run.  and if it doesn't, then it shouldn't be an opendev conversation.
19:53:05 <fungi> should we not have provided the vm for the log ingestion system that loosely replaced the old logstash system? mistake in your opinion, or failed experiment, or...?
19:53:33 <corvus> i thought that ran on aws or something
19:54:08 <clarkb> the opensearch cluster runs in aws, but there is a node that fetches logs and sends them to opensearch that dpawlik is managing
19:54:11 <fungi> the backend does, but the custom log ingestion glue to zuul's interface is on a vm we booted for the systadmins
19:54:31 <fungi> er, s/systadmins/admins of that service/
19:54:57 <corvus> i was unaware of that, and yeah, i think that's the wrong approach.  for one, the fact that i'm a root member unaware of it and it's not documented in https://docs.opendev.org/opendev/system-config/latest/ seems like a red flag.  :)
19:56:06 <corvus> that seems like something that fits the mission and should be run in the usual manner to me
19:56:14 <clarkb> ya better documentation of the exceptional node(s) is a good idea
19:56:34 <fungi> and possibly also deciding as a group that exceptions are a bad idea
19:56:49 <corvus> i think the wiki is an instructive example here too
19:57:18 <JayF> One thing I'll note that is a key difference about the service I proposed (and I suspect that logstash service) is their stateless nature.
19:57:29 <fungi> the main takeaway we had from the wiki is that we made it clear we would not take responsibility for the services running the log search service
19:57:41 <JayF> It doesn't address the basic philosophical questions; but it does draw a different picture than something like the wiki does.
19:57:54 <fungi> and that if the people maintaining it go away, we'll just turn it off with no notice
19:58:09 <corvus> yeah, in both new cases running them is operationally dead simple
19:59:28 <clarkb> (side note I think the original plan was to run the ingestion on the cluster itself but then realized that you can't really do that with the openserach as a service)
19:59:51 <corvus> i must have gotten the first version of the memo and not the update
20:00:08 <clarkb> because they delete and replace servers or somethign for upgrades. Its basically an appliance
20:00:19 <clarkb> we are at time.
20:00:27 <clarkb> #topic Upgrade Server Pruning
20:00:39 <clarkb> #undo
20:00:39 <opendevmeet> Removing item from minutes: #topic Upgrade Server Pruning
20:00:50 <clarkb> #topic Backup Server Backup Pruning
20:01:02 <clarkb> really quickly before we end I wanted to note that the rax backup server needs its backups pruned due to disk utilization
20:01:28 <clarkb> Maybe that is somethign tonyb wants to do with anothe root (ianw set it up and documented and scripted it well so its mostly a matter of going through the motions)
20:01:45 <tonyb> Yup happy to.
20:01:56 <clarkb> #topic Open Discussion
20:02:08 <fungi> i'm also happy to help tonyb if there are questions about backup pruning
20:02:31 <clarkb> We don't really have time for this but feel free to take discussion to #opendev or service-discuss@lists.opendev.org to bring up extra stuff and/or keep talking about the boot a VM and hand it over stuff
20:02:35 <tonyb> fungi: thanks.
20:02:42 <clarkb> and happy 1700000000 day
20:02:50 <fungi> woo!
20:02:52 <clarkb> I thinkwe are about 2 hours away?
20:02:54 <clarkb> something like that
20:03:04 <clarkb> thank you everyoen for your time!
20:03:06 <clarkb> #endmeeting