19:00:25 <clarkb> #startmeeting infra
19:00:25 <opendevmeet> Meeting started Tue Jan  9 19:00:25 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:25 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:25 <opendevmeet> The meeting name has been set to 'infra'
19:00:32 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/7IXDFVY34MYBW3WO2EEU3AIGOLAL6WRB/ Our Agenda
19:00:39 <clarkb> Its been a little while since we had these regularly
19:01:55 <clarkb> #topic Announcements
19:02:06 <clarkb> The OpenInfra Foundation Individual Board member election is happening now. Look for your ballot via email and vote.
19:02:27 <clarkb> This election also includes bylaw ammendments to make the bylaws less openstack specific
19:02:44 <clarkb> If you expected to have a ballot and can't find out please reach out. There may have been email delivery problems
19:03:52 <clarkb> Separately we're going to feature OpenDev on the OpenInfra Live stream/podcast/show (I'm not sure exactly how you'd classify it)
19:04:05 <clarkb> That will happen on January 18th at 1500 UTC?
19:04:21 <clarkb> I know the day is correct but not positive on the time. Feel free to tune it
19:04:23 <clarkb> *tune in
19:04:47 <corvus> clarkb: i think the kids are calling it a "realplayer tv show" now ;)
19:04:47 <fungi> also some streaming platforms have the ability for you to heckle us and ask questions
19:06:58 <clarkb> #topic Topics
19:07:02 <clarkb> #topic Server Upgrades
19:07:11 <clarkb> I believe that tonyb has gotten all of the mirror nodes upgraded at this point
19:07:27 <clarkb> Not sure if tonyb is around for the meeting, but I think the plan was to look at meetpad servers next
19:08:00 <tonyb> Correct
19:08:47 <tonyb> I started looking at meetpad, One thing that worries me a little is I can't quite see how we add the jvb nodes to meetpad
19:09:07 <clarkb> tonyb: it should be automated via configuration somehow
19:09:13 <clarkb> tonyb: I can look into that after the meeting
19:09:33 <tonyb> it seems to just be "magic" and I don't want any new jvb nodes added to auto regiuster with the existing meetpad
19:09:38 <tonyb> clarkb: Thanks
19:09:48 <clarkb> tonyb: yes it should be magic and it happems via xmpp iirc
19:09:56 <fungi> we've scaled up and down if you look at git history
19:10:16 <tonyb> Ah okay.
19:10:26 <clarkb> so ya one approach would be to have a new jvb join the old meetpad and the nreplace old meetpad and have new jvb join to the new thing. Or update config management to allow two side by side installations then update dns
19:10:41 <clarkb> we'll need to sort out how the magic happens in order to make a decision on approach I think
19:10:59 <tonyb> That was my thinking
19:12:00 <corvus> (i think a rolling replacement sounds good, but i haven't thought about it deeply)
19:12:16 <tonyb> I also looked at mediawiki and I'm reasonably close to starting that server.  translate looks like we'll just turn it off when i18n are ready, but I'm trying to help them with new weblate tools
19:12:25 <corvus> (just mostly that since we're not changing any software versions, we'd expect it to work)
19:12:36 <tonyb> so that leaves cacti and storyboard to look at
19:12:58 <clarkb> tonyb: we've got a spec to add a prometheus and some agents on servers to replace cacti which is one option there
19:13:10 <clarkb> but maybe the easiest thing right now is to just uplift cacti? I don't know
19:13:12 <fungi> cacti was in theory going to be retired in favor of prometheus
19:13:20 <fungi> yeah that
19:13:35 <clarkb> I think the main issue with prometheus was figuring out the agent stuff. Running the service to collect the data is straightforward
19:14:01 <tonyb> Okay, I know ianw was thinking prometheus would be a good place for me to start so I'd be happy to look at that
19:14:54 <clarkb> alright lets move on have a fair numebr of things things to discuss and it sounds like we're continuing to make progress there. Thanks!@
19:15:02 <clarkb> #topic Python container updates
19:15:21 <clarkb> The zuul registry service migrated to bookworm images so I've proposed a change to drop the bullseye images it was relying on
19:15:27 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/905018 Drop Bullseye python3.11 images
19:15:40 <clarkb> That leaves us with zuul-operator on the bullseye python3.10 images as our last bullseye container images
19:16:26 <clarkb> #topic Upgrading Zuul's DB server
19:16:36 <clarkb> I realized while prepping for this meeting that I had completely spaced on this.
19:16:58 <tonyb> It happens at this time of year ;P
19:16:59 <clarkb> However, coincidentally hacker news had a post about postgres options recently
19:17:01 <clarkb> #link https://www.crunchydata.com/blog/an-overview-of-distributed-postgresql-architectures a recent rundown of postgresql options
19:17:18 <clarkb> I haven't read the article yet, but figured I should as a good next step on this item
19:17:59 <clarkb> did anyone else have new input to add?
19:18:59 * tonyb shakes head
19:19:23 <clarkb> #topic EMS discontinuing legacy consumer hosting plans
19:19:41 <clarkb> fungi indicated that at the last meeting the general consensus was that we should investigate a switch to the newer plans
19:20:13 <clarkb> fungi: have we done any discussion about this on the foundation side yet? I'm guessing we need a general ack there then we can reach out to element about changing the deployment type?
19:20:56 <fungi> they indicated in the notice that they'd let folks on the old plan have a half-normal minimum user license
19:21:25 <fungi> i did some cursory talking to wes about it and it sounded like they'd be able to work it in for 2024
19:21:41 <fungi> we would have to pay for a full year up front though
19:21:53 <clarkb> I don't expect we'll stop using matrix anytime soon
19:21:58 <clarkb> so that seems fine from a usage standpoint
19:22:20 <fungi> right, since we're supporting multiple openinfra projects with it, the cost is fairly easy to justify
19:22:26 <clarkb> fungi: in that case I guess we should reach out to Element. IIRC the email gave a contact for the conversion
19:22:45 <clarkb> maybe double check with wes that nothing has changed in the last few weeks before sending that email
19:22:52 * clarkb scribbles a note to do this stuff
19:22:55 <fungi> will do
19:23:11 <tonyb> Also gives us this year to test self-hosting a homeserver
19:23:16 <fungi> we've still got about a month to sort it
19:23:24 <clarkb> right we have until February 7
19:24:14 <frickler> do we really want to test self-hosting? also, would we get an export from element that would allow moving and keeping rooms and history?
19:24:39 <corvus> no export is needed; the system is fully distributed
19:24:58 <clarkb> they provided a link to a mgiration document in the email too
19:25:00 <clarkb> trying to find it
19:25:00 <fungi> but they do have a settings export we can use too
19:25:10 <clarkb> https://ems-docs.element.io/books/element-cloud-documentation/page/migrate-from-ems-to-self-hosted
19:25:22 <fungi> basically the homeserver config
19:25:36 <frickler> so you start a new homeserver with the same name and the rooms just magically migrate?
19:25:38 <tonyb> frickler: I think it's something to investigate during the year. Gives us more information for making a long term decision
19:25:53 <clarkb> we "own" the room names so ti would largely be history and room config to worry about aiui
19:26:25 <corvus> the rooms and their contents exist on all matrix servers involved in the federation (typically homeservers of users in those rooms)
19:27:46 <corvus> if the history is exported, cool, but in theory i think a replacement server should be able to grab the history from any other server
19:28:12 <clarkb> oh interesting. So if you stand up a new server and have the well known file say it is the :opendev.org homeserver then clients will talk to the new server. That new server will sync out of the federated state the history of its rooms
19:28:37 <corvus> that's what i'd expect.  i have not tested it.
19:29:06 <clarkb> ack. Also looks like we can copy databases per the ems migration doc should that be necessary
19:29:10 <corvus> (you'd just need to use one of the other room ids initially)
19:29:41 <corvus> but i'm still in no rush to self-host.
19:29:44 <clarkb> in any case figuring that out is a next step. First up is figuring out a year of hosting
19:29:58 <clarkb> and if that is reasonable. Which I can help coordinate with fungi at the foundation and talking to element
19:30:04 <clarkb> #topic Followup on haproxy update being broken
19:30:37 <clarkb> There was a lot of info under this item but the two main points seem to be "should we be more explicit about the versions of docker images we consume" and "should we prune less aggressively"
19:30:48 <corvus> (like, i'm not looking at ems as an interim step based on our conversations so far -- but i agree that keeping aware of future options is good)
19:31:13 <clarkb> I think for haproxy in particular we can and should probably stick with their lts tag
19:31:18 <fungi> i think we mostly covered the haproxy topic at the last meeting, but happy to revisit since not everyone was present
19:31:28 <corvus> ++lts tag
19:31:35 <clarkb> fungi: ack. I wanted to bring up one thing primiarly on pruning
19:31:58 <clarkb> One gotcha with pruning is that it seems to be based on the image build/creation time not when you started using the newer image(s)
19:32:16 <fungi> right, note that we hadn't actually pruned the old haproxy image we downgraded to, when i did the manual config change and pulled, it didn't need to retrieve the image
19:32:18 <clarkb> and so it is a bit of a clunky tool, but better than nothing for images like haproxy for example where we could easily revert
19:32:57 <clarkb> I'm happy for us to extend the time we keep images, but also be aware of this limitation with the pruning command
19:33:01 <corvus> i'm ambivalent about pruning because i'm not worried about not being able to pull an old version from a registry on demand
19:33:29 <fungi> the main thing it might offer is insurance against upstreams deleting their images
19:33:46 <fungi> but i don't think that's actually been an issue we've encountered yet?
19:33:49 <frickler> one concern of mine was being able to find out which last version it actually was that we were running
19:33:51 <corvus> i'm not eager to run an image that upstream has deleted either
19:34:21 <fungi> frickler: yes, if we could add some more verbosity around our image management, that could help
19:34:29 <clarkb> frickler: we could update our ansible runs to do something like a docker ps -a and docker image list
19:34:37 <clarkb> and record that in our deployment logs
19:34:45 <fungi> even if it's just something that periodically interrogates docker for image ids and logs them to a file
19:34:55 <fungi> or yeah that
19:35:02 <frickler> maybe even somewhere more persistent than zuul build logs would be good
19:35:05 <corvus> i agree with frickler that leaving an image sitting around for some number of days provides a good indication of what we were probably running before
19:36:07 <clarkb> ok so the outstanding need is better records of what docker images we ran during which timeframes
19:36:15 <corvus> (we could stick version numbers in prometheus; it's not great for that though, but it's okay as long as they don't change too often)
19:36:45 <clarkb> ya this will probably require a bit more brainstorming
19:36:48 <corvus> (the only way to do that with prometheus increases the cardinality of metrics with each new version number)
19:37:05 <clarkb> maybe start with the simple thing of having ansible record a bit more info then try and improve on that for longer term retention
19:38:17 <clarkb> I'll continue on as we have a few more items to discuss
19:38:19 <clarkb> #topic Followup on haproxy update being broken
19:38:38 <clarkb> Similar to the last one I'm not sure if this reached a conclusion but two things worth mentioning have happened recently. First zuul's doc quota was increased
19:38:57 <frickler> that's the topic we just had?
19:39:02 <clarkb> bah yes
19:39:05 <clarkb> #undo
19:39:05 <opendevmeet> Removing item from minutes: #topic Followup on haproxy update being broken
19:39:10 <clarkb> #topic AFS Quota issues
19:39:13 <clarkb> copy and paste failure
19:39:28 * fungi is now much less confused
19:39:40 <clarkb> Second is that there are some early discussions around having openeuler be more involved with opendev and possibly contributing some CI resources
19:39:52 <frickler> the zuul project quota was increased (not doc I think)
19:40:03 <clarkb> frickler: ya it hosts the zuul docs iirc
19:40:07 <clarkb> and website?
19:40:30 <frickler> IIUC the release artefacts
19:40:45 <clarkb> There may be an opportunity to lverage this interest in collaboration to clean up the openeuler mirrors and feedback to them on the growth problems
19:40:50 <corvus> everything under zuul-ci.org is on one volume
19:40:51 <fungi> zuul's docs are part of its project website
19:40:54 <fungi> yeah that
19:40:55 <corvus> and i increased it to 5gb
19:40:56 <clarkb> ahah
19:41:10 <clarkb> essentially work with the interested parties to improve the situation around mirrors for openeuler and maybe our CI quotas
19:41:52 <clarkb> responding to their latest queries about the sizes of VMs and how many is on my todo list after meetings and lunch
19:42:06 <clarkb> (you know we write that stuff down in a document but 100% of the time the questions get asked anyway)
19:42:07 <frickler> do you have a reference to those openeuler discussions or are they private for now?
19:42:11 <corvus> they have an openstack cloud?
19:43:06 <clarkb> frickler: I think keeping the email discussion small while we sort out if it is even possible is good, but once we know if it will go somewhere we can do that more publicly
19:43:38 <clarkb> corvus: yes sounds like it? We tried to be explicit that what we need is an openstack api endpoint and accounts that can provision VMs
19:43:50 <frickler> yeah, I just wanted to know whether I missed something somewhere
19:43:52 <fungi> for transparency: openeuler representatives were in discussion with openinfra foundation staff members and offered to supply system resources, so the foundation staff are trying to put them in touch with us to determine more scope around it
19:43:58 <fungi> it's all been private discussions so far
19:44:07 <corvus> neat
19:44:32 <clarkb> were there other outstanding afs quota concerns to discuss?
19:44:47 <fungi> since openstack is a primary use case for their distro, they have a vested interest in helping test openstack upstream on it
19:45:04 <frickler> some other mirror volumes need watching
19:45:55 <clarkb> for centos stream I seem to recall digging around in those mirrors and we end up with lots of packages with many versions
19:46:05 <frickler> centos-stream and ubuntu-ports look very close to their limit
19:46:10 <clarkb> in theory we only need the newest 2 to avoid installation failures
19:46:28 <clarkb> we could potentially write a smarter syncing script that scanned through and deleted older versions
19:46:59 <clarkb> for ubuntu ports I had thought we were still syncing old versions of the distro that we could delete but we aren't so I'm nto sure what we can do there
19:47:15 <clarkb> are we syncing more than arm64 packages maybe? like 32bit arm and or ppc? I think not
19:48:22 <clarkb> I don't think we have time to solve that in this meeting. Lets continue on as we have ~3 more topics to cover
19:48:28 <clarkb> #topic Broken wheel build issues
19:48:42 <frickler> I don't know, I just noticed these issues when checking whether we have room to mirror rocky
19:48:48 <clarkb> frickler: ack
19:49:03 <fungi> it's also possible that dropping old releases from our config isn't cleaning up the old packages associated with them
19:49:14 <clarkb> fungi: oh interesting. Worth double checking
19:49:38 <clarkb> for wheels I think we can stop building and mirroring them at any time beacuse pip will prefer new sdists over old wheels right? so we don't even need to update the pip.conf in our test nodes
19:49:53 <fungi> correct
19:50:05 <clarkb> fungi: ^ you probably know off the top of your head if that is the case. But that would be my main concern is that we start testing older stuff accidentally if we stop building wheels
19:50:09 <fungi> unless you pass the pip option to prefer "binary" packages (wheels)
19:50:13 <clarkb> right
19:50:29 <fungi> but it's not on by default
19:50:49 <fungi> i'd treat that as a case oc caveat emptor
19:50:50 <clarkb> in that case I think it is reasonable to send email to the service announce list indicating we plan to stop running those jobs in the future (say beginning of february) ask if anyone is interested in keeping them alive and if not jobs will fallback to building from source
19:51:14 <clarkb> the fallback is slower and may require some bindep file updates but it isn't going to hard stop anyone from getting work done on centos distros
19:51:32 <fungi> wfm
19:51:33 <frickler> will we also clean out existing wheels at the same time? maybe keep the afs volume but not publish anymore?
19:52:00 <clarkb> frickler: I think we should keep the content for a bit as some of the existing wheels may be up to date for a while
19:52:04 <fungi> we could probably do it in phases
19:52:29 <frickler> ok
19:52:48 <clarkb> since pip's behavior is acceptable by default here we can still take advantage of the remaining benefit from the mirror for a bit
19:52:56 <clarkb> then maybe after 6-12 months clean it up
19:53:22 <clarkb> alright next topic
19:53:28 <clarkb> #topic Gitea repo-archives filling server disk
19:53:32 <fungi> fwiw, the python ecosystem has gotten a lot better about making cross-platform wheels for releases of things now, and in a more timely fashion
19:53:49 <fungi> so our pre-built wheels are far less necessary
19:53:58 <clarkb> when you ask gitea for a repo archive (tarball/zip/.bundle) it caches that on disk
19:54:28 <clarkb> then once a day it runs an itnernal cron task (using a go library implemtnation of cron not system cron) to clean up any repo archives that are older than a day old
19:54:34 <fungi> oh, yeah this is a fun one. i'd somehow already pushed it to the back of my mind
19:54:41 <frickler> can we disable that functionality? we do have our own tarballs instead (at least for openstack)?
19:54:43 <corvus> i'm guessing people do that a lot to get releases even though like zero opendev projects make releases that way?
19:54:51 <corvus> what frickler said :)
19:55:03 <fungi> s/people/web crawlers/ i think
19:55:04 <clarkb> upstream indicated it could be web crawlers
19:55:11 <clarkb> so their suggestion was to update our robots.txt
19:55:15 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/904868 update robots.txt on upstream's suggestion
19:55:19 <clarkb> and no we can't disable teh feature
19:55:29 <clarkb> at least I haven't found a way to do that
19:55:43 <clarkb> the problem is the daily claenup isn't actually cleaning up everything more than a day old
19:56:15 <clarkb> I've spent a bit of time rtfs'ing and looking at the database and I can't figure out why it is broken but you can see on gitea12 that it falls about 4 hours behind each time it runs so we end up leaking and filling the disk
19:56:40 <clarkb> In addition to reducing the number of archives generated by asking bots to leave them alone we can also run a cron job that simply deletes all archives
19:56:45 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/904874 Run weekly removal of all cached repo archives
19:56:48 <frickler> does gitea break if we make the cache non-writeable?
19:57:06 <clarkb> frickler: I haven't tesed that but I would assume so. I would expect a 500 error when you request the archive
19:57:22 <frickler> which would also be like disabling it kind of
19:57:25 <fungi> i suppose it depends on your definition of "break" ;)
19:57:34 <clarkb> since we are already trying to delete archives more than a day old deleting all archives once a week on the weekend seems safe
19:57:46 <clarkb> and when you ask it to delte all archives it does successfully delete all archives
19:58:15 <clarkb> I would prefer we not intentionally create 500 errors
19:58:20 <clarkb> there are valid reasons to get repo archives
19:58:51 <clarkb> I also noticed when looking at the cron jobs that gitea has a phone home to check if it is running the latest release cron job
19:59:03 <corvus> the cron might have a small window of breakage, but should immediately work on a retry so lgtm
19:59:13 <clarkb> I pushed https://review.opendev.org/c/opendev/system-config/+/905020 to disable that cron job ecabsue I hate the idea of a phone home for that
20:00:22 <clarkb> our hour is up and I have to context switch to another meeting
20:00:33 <clarkb> #topic Service Coordinator Election
20:01:02 <clarkb> really quickly because I end the meeting I wanted to call out that we're appraoaching the service coordinator election timeframe. I need to dig up emails to determine when I said that would happen (I beleivee it is end of january early february)
20:01:38 <clarkb> nothing for anyone to do at this point other than consider if they wish to assume the role and nominate themselves. And I'll work to get things official via email
20:01:38 <tonyb> If it matches openstack PTL/TC elections then they'll start in Feb
20:01:46 <clarkb> tonyb: its slightly offset
20:01:52 <tonyb> okay
20:01:54 <clarkb> #topic Open Discussion
20:02:00 <clarkb> Anything else important before we call the meeting?
20:03:13 <tonyb> nope
20:03:17 <clarkb> sounds like no. Thank you everyone for your time and help running the opendev services!
20:03:24 <clarkb> we'll be back next week same time and location
20:03:27 <clarkb> #endmeeting