19:00:15 <clarkb> #startmeeting infra
19:00:15 <opendevmeet> Meeting started Tue Nov 28 19:00:15 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:15 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:15 <opendevmeet> The meeting name has been set to 'infra'
19:00:22 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/3B75BMYDEBIQ56DW355IGF72ZH6JVVQI/ Our Agenda
19:00:28 <clarkb> #topic Announcements
19:00:38 <clarkb> I didn't have anything to announce.
19:01:27 <clarkb> OpenInfra foundation individial board member seat nominations are open now
19:01:43 <clarkb> if that interests you I'm sure we can point you in the right direction
19:02:44 <clarkb> I'll give it a couple more minutes before diving into the agenda
19:05:04 <clarkb> #topic Server Upgrades
19:05:11 <clarkb> tonyb continues to push this along
19:05:18 <clarkb> #link https://review.opendev.org/q/topic:%22mirror-distro-updates%22+status:open
19:05:31 <clarkb> there are three mirrors all booted and ready to be swapped in now. Just waiting on reviews
19:06:15 <clarkb> one thing tonyb and I discovered yesterday is that the launcher venv cannot create new volumes in rax. We had to use fungi's xyzzy env for that. The xyzzy env cannot attach the volume :/
19:06:26 <clarkb> fungi: so maybe don't go cleaning up that env anytime soon :)
19:07:24 <clarkb> tonyb: once we get those servers swapped in we'll need to go through and clean out the old servers too. I'm happy to sit down for that and we can go over some other root topics as well
19:07:50 <fungi> yeah, a good quiet-time project for someone might be to do another round of bisecting sdk/cli versions to figure out what will actually work
19:08:33 <fungi> i think the launch venv might be usable for all those things? and we just didn't try it for volume creation
19:08:48 <fungi> but then ended up using it for volume attachment
19:08:50 <tonyb> clarkb: Yup.  That'd probably be good to have extra eyes
19:09:01 <frickler> iiuc the intention for latest sdk/cli is still to support rax, so reporting bugs if things don't work would be an option, too
19:09:06 <clarkb> fungi: yes, the launch env worked for everything but volume creation. volume creation failed
19:10:29 <clarkb> frickler: ya we can also run with the --debug flag to see what calls are actually failing
19:10:30 <tonyb> frickler: I think so.  The challenge is the CLI/SDK team don't have easy access to testing (yet)
19:11:12 <clarkb> anyway reviews for the current set of nodes would be good so we can get them in place and then figure out cleanup of the old nodes
19:11:15 <clarkb> anything else related to this?
19:11:35 <frickler> tonyb: that's why feedback from us would be even more valuable
19:12:18 <tonyb> frickler: fair point
19:12:28 <tonyb> clarkb: not from me.
19:12:46 <clarkb> #topic Python Container Updates
19:12:49 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/898756 And parent add python3.12 images
19:13:13 <clarkb> At this point I think adding python3.12 images is the only action we can take as we are still waiting on the zuul-operator fixups. I have not personally had time to look into that more closely
19:13:31 <clarkb> That said I don't think anything should stop us from adding those images
19:13:53 <tonyb> Neither have I.  It's in the "top 5" items on my todo list
19:14:44 <clarkb> #topic Gitea 1.21
19:14:59 <clarkb> Gitea just released a 1.20.6 bugfix release that we should upgrade to prior to upgrading to 1.21
19:15:04 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/902094 Upgrade gitea to 1.20.6 first
19:15:17 <clarkb> They also made a 1.21.1 release which I bumped our existing 1.21 change to
19:15:47 <clarkb> in #opendev earlier today we said we'd appove the 1.20.6 update after this meeting. I think that still works for me though I will be popping out from about 2100-2230 UTC
19:16:28 <clarkb> My hope is that later this week (maybe thursday at this rate?) I'll be able to write a change for the gerrit half of the key rotation and then generate a new key and stash it in the appropriate locations
19:16:45 <tonyb> Sounds good.
19:16:46 <clarkb> That said the gitea side of key rotation is ready for review and landable as is: #link https://review.opendev.org/c/opendev/system-config/+/901082 Support gitea key rotation
19:16:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/901082 Support gitea key rotation
19:17:11 <clarkb> The change there is set up to manage the single existing key and we can do a followup to add the new key
19:18:14 <clarkb> for clarity I think the rough plan here is 0) upgrade to 1.20.6 1) add gitea key rotation support 2) add gerrit key rotation support 3) add new key to gitea 4) add new key to gerrit 5) use new key in gerrit 6) remove old key from gitea (and gerrit?) 7) upgrade gitea
19:18:27 <clarkb> steps 0) and 1) should be good to go.
19:19:37 <tonyb> Seems like a plan, FWIW, I'll look again at 0 and 1
19:19:51 <clarkb> #topic Upgrading Zuul's DB Server
19:19:59 <clarkb> #link https://etherpad.opendev.org/p/opendev-zuul-mysql-upgrade info gathering document
19:20:12 <clarkb> I haven't had time to dig into db cluster options yet
19:20:55 <frickler> I'm wondering whether we could reuse what kolla does for that
19:21:01 <clarkb> Looking at the document it seems like some conclusions can be made though. Backups are not currently critical, database size is about 18GB uncompressed so the server(s) don't need to be large, the database should not be hosted on zuul nodes beacuse we auto upgrade zuul nodes
19:22:03 <clarkb> frickler: that is an interesting idea.
19:22:10 <tonyb> Yup.  Given it won't be on any zuul servers I guess the RAM requirements are less interesting
19:22:29 <tonyb> frickler: Can you drop some pointers?
19:22:39 <fungi> also we can resize the instances if we run into memory pressure
19:23:40 <frickler> I need to look up the pointers in the docs, but in general there is quite a bit of logic in there to make things like upgrades work without interruption
19:23:55 <fungi> at one point we had played around with percona replicating to a hot standby
19:23:59 <clarkb> yes you need a lot of explicit coordination unlike say zookeeper
19:24:12 <clarkb> and you have to run a proxy
19:24:18 <fungi> may have relied on ndb?
19:25:30 <frickler> kolla uses either haproxy or proxysql
19:25:39 <clarkb> I don't remember that. The zuul-operator uses percona xtradb cluster and I think kolla uses galera
19:25:47 <clarkb> which are very similar backends and then ya a proxy in front
19:25:48 <corvus> looks like kolla may use galera.  that's one of the options (in addition to percona xtradb, and whatever postgres does for clustering these days)
19:26:18 <corvus> i don't think ndb is an option due to memory requirements
19:26:52 <fungi> i trust the former mysql contributors in these matters, i'm mostly database illiterate
19:27:11 <clarkb> one thing we should look at too is whether or not we can promote and existing mysql/mariadb to galera/xtradb cluster and similar with postgres
19:27:14 <corvus> (and the sort of archival nature of the data seems like not a great fit for ndb; though it is my favorite cluster tech just because of how wonderfully crazy it is)
19:27:35 <clarkb> then one option we may have is to start with a spof which isn't a regression then later add in the more complicated load balanced cluster
19:27:57 <fungi> in theory the trove instance is already a spof
19:28:04 <clarkb> yes that is why this isn't a regression
19:28:17 <corvus> clarkb: i think that's useful to know, but in all cases, a db migration for us won't be too burdensome
19:28:42 <corvus> worst case we're talking like an hour on a weekend for an outage if we want to completely change architecture
19:29:31 <clarkb> good point
19:30:32 <corvus> (so i agree, a good plan might look like "move to a spof mariadb and then make it better later" but also it's not the end of the world if we decide "move to a spof mariadb then move to a non-spof mariadb during a maint window")
19:31:14 <corvus> anyway, seems like a survey of HA options is still on the task list
19:31:14 <clarkb> fwiw it looks like postgres ha options are also fairly involved and require you to manage fault identification and failover
19:31:40 <clarkb> ++ lets defer any decision making unti lwe have a bit more data. But I think we're leaning towards running our own system on dedicated machine(s) at the very least
19:32:05 <clarkb> #topic Annual Report Season
19:32:12 <clarkb> #link https://etherpad.opendev.org/p/2023-opendev-annual-report OpenDev draft report
19:32:50 <clarkb> I've done this for a number of years now. I'll be drafting a section of the openinfra foundation's annual report that covers opendev
19:33:08 <clarkb> I'm still in the brainstorming just get something started phase but feel free to add items to the etherpad
19:33:38 <clarkb> Once I've actually written something I'll ask for feedback as well. I think they want them written by the 22nd of december. Something like that
19:33:53 <fungi> i'll try to get preliminary 2023 engagement report data to you soon, though the mailing list measurements need to be reworked for mm3
19:34:10 <tonyb> Okay so there is some time, but not lots of time
19:34:38 <clarkb> tonyb: ya its a bit earlier than previous years too. Usually we have until the first week of january
19:35:04 <fungi> final counts of things get an exception beyond december for obvious reasons, but the prose needs to be ready with placeholders or preliminary numbers
19:35:39 <tonyb> clarkb: Hmm okay
19:36:58 <clarkb> its a good opportunity to call out work you've been involved in :)
19:37:08 <clarkb> definitely add those items to the brainstorm list so I don't forget about them
19:37:15 <clarkb> #topic Open Discussion
19:37:15 <fungi> yeah, whatever you think we should be proud of
19:37:38 <clarkb> tonyb stuck the idea of making it possible for people to run unittest jobs using python containers under open discussion
19:37:58 <corvus> what's the use case that's motivating this?  is it that someone wants to run jobs on containers instead of vms?  or that they want an easier way to customize our vm images than using dib?
19:38:21 <tonyb> Yeah, I just wanted to get a feel for what's been tried.  I guess there are potentially 2 "motivators"
19:39:04 <tonyb> 1) a possibly flawed assumption that we could do more unit tests in some form of container system as the startup/rest costs are lower?
19:39:45 <tonyb> 2) making it possible, if not easy, for the community to test newer pythons without the problems of chasing unstable distros
19:40:04 <fungi> where would those containers run?
19:40:15 <corvus> ok!  for 1 there are a few things:
19:40:34 <tonyb> Well that'd be part of the discussion.
19:41:04 <corvus> - in opendev, we're usually not really limited by startup/recycle time.  most of our clouds are fast.
19:41:26 <corvus> (and we have enough capacity we can gloss over the recycle time)
19:41:46 <clarkb> also worth noting that the last time I checked we utilize less than 30% of our total available resources on a long term basis
19:42:06 <tonyb> We could make zuul job templates like openstack-tox-* to setup a VM with an appropriate comtainer-runtime and run tox in there
19:42:18 <clarkb> from an efficiency standpoint we'd need to cut our resource usage down to about 1/3 if we use always one container runners
19:42:33 <clarkb> *always on/running
19:42:39 <tonyb> but that would negate the 1st motivator
19:42:55 <corvus> - nevertheless, nodepool and zuul do support running jobs in containers via k8s or openshift.  we ran a k8s cluster for a short time, but running it required a lot of work that no one had time for.  only one of our clouds provides a k8s aas, so that doesn't meet our diversity requirements
19:43:25 <tonyb> Both fair points
19:43:35 <corvus> that ^ goes to fungi's point about where to run them.  i don't think the answer has changed since then, sadly :(
19:44:00 <tonyb> Okay.
19:44:15 <corvus> yeah, we could write jobs/roles to pull in the image and run it, but if we do that a lot, that'll be slow and drive a lot of network traffic
19:44:35 <corvus> if the motivation is to expand python versions, we might want to consider new dib images with them?
19:44:50 <clarkb> part of the problem with that is dib imges are extremely heavy weight
19:44:52 <corvus> i think there was talk of using stow to have a bunch of pythons on one image?
19:45:09 <clarkb> they are massive (each image is like 50GB * 2 of storage) and uploads are slow to certain clouds
19:45:28 <clarkb> ya so we could colocate instead.
19:45:58 <clarkb> My hesitancy here is that in the times where we've tried to make it easier for the projects to test with new stuff its not gone anywhere bceause they have a hard time keeping up in general
19:46:04 <clarkb> tumbleweed and fedora are examples of this
19:46:19 <clarkb> but even today openstack isn't testing with python3.11 across the board yet (though it is close)
19:46:53 <clarkb> I think there is probably a balance in effort vs return and maybe containers are a good tool in balancing that out?
19:47:07 <tonyb> Yeah that's why I thought avoiding the DIB image side might be helpful
19:47:29 <clarkb> basically I don't expect all of openstack to run python3.12 jobs until well after ubuntu has packages for it anyway. But maybe a project like zuul would run python3.12 jobs and those are relatively infrequent compared to openstack
19:48:27 <clarkb> but also have a dib step install python3.12 on jammy is not a ton of work if we think thi is largely a python problem
19:48:47 <clarkb> (I think generally it could be a nodejs, golang, rust, etc problem but many of those ecosystems make it a bit easier to get a random version)
19:49:31 <clarkb> corvus: does ensure-python with appropriate flags already know how to go to the internet to fetch a python version and build it?
19:49:47 <clarkb> I think it does? maybe we start there and see if there is usage and we can optmize from there?
19:50:32 <corvus> yeah, there's pyenv and stow
19:50:39 <corvus> in ensure-python
19:51:27 <tonyb> Okay.
19:52:05 <tonyb> I think that was helpful, I'd be willing to look at the ensure-python part and see what works and doesn't
19:52:43 <tonyb> it seems like the idea of using a container runtime isn't justified right now.
19:53:44 <clarkb> if our clouds had first class container runtimes as a service it would be much easier to sell/experiement with. But without that there is a lot of bootstrapping overhead for the humans and networking
19:54:21 <clarkb> side note: dox is a thing that mordred experimented with for a while: https://pypi.org/project/dox/
19:55:10 <clarkb> but ya lets start with the easy thing which is try ensure-python's existing support for getting a random python and take what we learn from there
19:55:20 <clarkb> Anything else? We are just about at time for our hour?
19:55:46 <clarkb> Is everyone still comfortable merging that gitea 1.20.6 udpate even if I'm gone from 2100 to 2230?
19:55:56 <clarkb> if so I say someone should approve it :)
19:56:03 <fungi> i will, i can keep an eye on it
19:56:08 <clarkb> thanks!
19:56:26 <fungi> and done
19:56:45 <clarkb> I guess its worth mentioning I think I'll miss our meeting on December 12. I'll be around for the very first part of the day but then I'm popping out
19:57:05 * tonyb will be back in AU by then
19:57:11 <clarkb> thank you for your time everyone
19:57:13 <clarkb> #endmeeting