19:00:25 #startmeeting infra 19:00:25 Meeting started Tue Jan 9 19:00:25 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:25 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:25 The meeting name has been set to 'infra' 19:00:32 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/7IXDFVY34MYBW3WO2EEU3AIGOLAL6WRB/ Our Agenda 19:00:39 Its been a little while since we had these regularly 19:01:55 #topic Announcements 19:02:06 The OpenInfra Foundation Individual Board member election is happening now. Look for your ballot via email and vote. 19:02:27 This election also includes bylaw ammendments to make the bylaws less openstack specific 19:02:44 If you expected to have a ballot and can't find out please reach out. There may have been email delivery problems 19:03:52 Separately we're going to feature OpenDev on the OpenInfra Live stream/podcast/show (I'm not sure exactly how you'd classify it) 19:04:05 That will happen on January 18th at 1500 UTC? 19:04:21 I know the day is correct but not positive on the time. Feel free to tune it 19:04:23 *tune in 19:04:47 clarkb: i think the kids are calling it a "realplayer tv show" now ;) 19:04:47 also some streaming platforms have the ability for you to heckle us and ask questions 19:06:58 #topic Topics 19:07:02 #topic Server Upgrades 19:07:11 I believe that tonyb has gotten all of the mirror nodes upgraded at this point 19:07:27 Not sure if tonyb is around for the meeting, but I think the plan was to look at meetpad servers next 19:08:00 Correct 19:08:47 I started looking at meetpad, One thing that worries me a little is I can't quite see how we add the jvb nodes to meetpad 19:09:07 tonyb: it should be automated via configuration somehow 19:09:13 tonyb: I can look into that after the meeting 19:09:33 it seems to just be "magic" and I don't want any new jvb nodes added to auto regiuster with the existing meetpad 19:09:38 clarkb: Thanks 19:09:48 tonyb: yes it should be magic and it happems via xmpp iirc 19:09:56 we've scaled up and down if you look at git history 19:10:16 Ah okay. 19:10:26 so ya one approach would be to have a new jvb join the old meetpad and the nreplace old meetpad and have new jvb join to the new thing. Or update config management to allow two side by side installations then update dns 19:10:41 we'll need to sort out how the magic happens in order to make a decision on approach I think 19:10:59 That was my thinking 19:12:00 (i think a rolling replacement sounds good, but i haven't thought about it deeply) 19:12:16 I also looked at mediawiki and I'm reasonably close to starting that server. translate looks like we'll just turn it off when i18n are ready, but I'm trying to help them with new weblate tools 19:12:25 (just mostly that since we're not changing any software versions, we'd expect it to work) 19:12:36 so that leaves cacti and storyboard to look at 19:12:58 tonyb: we've got a spec to add a prometheus and some agents on servers to replace cacti which is one option there 19:13:10 but maybe the easiest thing right now is to just uplift cacti? I don't know 19:13:12 cacti was in theory going to be retired in favor of prometheus 19:13:20 yeah that 19:13:35 I think the main issue with prometheus was figuring out the agent stuff. Running the service to collect the data is straightforward 19:14:01 Okay, I know ianw was thinking prometheus would be a good place for me to start so I'd be happy to look at that 19:14:54 alright lets move on have a fair numebr of things things to discuss and it sounds like we're continuing to make progress there. Thanks!@ 19:15:02 #topic Python container updates 19:15:21 The zuul registry service migrated to bookworm images so I've proposed a change to drop the bullseye images it was relying on 19:15:27 #link https://review.opendev.org/c/opendev/system-config/+/905018 Drop Bullseye python3.11 images 19:15:40 That leaves us with zuul-operator on the bullseye python3.10 images as our last bullseye container images 19:16:26 #topic Upgrading Zuul's DB server 19:16:36 I realized while prepping for this meeting that I had completely spaced on this. 19:16:58 It happens at this time of year ;P 19:16:59 However, coincidentally hacker news had a post about postgres options recently 19:17:01 #link https://www.crunchydata.com/blog/an-overview-of-distributed-postgresql-architectures a recent rundown of postgresql options 19:17:18 I haven't read the article yet, but figured I should as a good next step on this item 19:17:59 did anyone else have new input to add? 19:18:59 * tonyb shakes head 19:19:23 #topic EMS discontinuing legacy consumer hosting plans 19:19:41 fungi indicated that at the last meeting the general consensus was that we should investigate a switch to the newer plans 19:20:13 fungi: have we done any discussion about this on the foundation side yet? I'm guessing we need a general ack there then we can reach out to element about changing the deployment type? 19:20:56 they indicated in the notice that they'd let folks on the old plan have a half-normal minimum user license 19:21:25 i did some cursory talking to wes about it and it sounded like they'd be able to work it in for 2024 19:21:41 we would have to pay for a full year up front though 19:21:53 I don't expect we'll stop using matrix anytime soon 19:21:58 so that seems fine from a usage standpoint 19:22:20 right, since we're supporting multiple openinfra projects with it, the cost is fairly easy to justify 19:22:26 fungi: in that case I guess we should reach out to Element. IIRC the email gave a contact for the conversion 19:22:45 maybe double check with wes that nothing has changed in the last few weeks before sending that email 19:22:52 * clarkb scribbles a note to do this stuff 19:22:55 will do 19:23:11 Also gives us this year to test self-hosting a homeserver 19:23:16 we've still got about a month to sort it 19:23:24 right we have until February 7 19:24:14 do we really want to test self-hosting? also, would we get an export from element that would allow moving and keeping rooms and history? 19:24:39 no export is needed; the system is fully distributed 19:24:58 they provided a link to a mgiration document in the email too 19:25:00 trying to find it 19:25:00 but they do have a settings export we can use too 19:25:10 https://ems-docs.element.io/books/element-cloud-documentation/page/migrate-from-ems-to-self-hosted 19:25:22 basically the homeserver config 19:25:36 so you start a new homeserver with the same name and the rooms just magically migrate? 19:25:38 frickler: I think it's something to investigate during the year. Gives us more information for making a long term decision 19:25:53 we "own" the room names so ti would largely be history and room config to worry about aiui 19:26:25 the rooms and their contents exist on all matrix servers involved in the federation (typically homeservers of users in those rooms) 19:27:46 if the history is exported, cool, but in theory i think a replacement server should be able to grab the history from any other server 19:28:12 oh interesting. So if you stand up a new server and have the well known file say it is the :opendev.org homeserver then clients will talk to the new server. That new server will sync out of the federated state the history of its rooms 19:28:37 that's what i'd expect. i have not tested it. 19:29:06 ack. Also looks like we can copy databases per the ems migration doc should that be necessary 19:29:10 (you'd just need to use one of the other room ids initially) 19:29:41 but i'm still in no rush to self-host. 19:29:44 in any case figuring that out is a next step. First up is figuring out a year of hosting 19:29:58 and if that is reasonable. Which I can help coordinate with fungi at the foundation and talking to element 19:30:04 #topic Followup on haproxy update being broken 19:30:37 There was a lot of info under this item but the two main points seem to be "should we be more explicit about the versions of docker images we consume" and "should we prune less aggressively" 19:30:48 (like, i'm not looking at ems as an interim step based on our conversations so far -- but i agree that keeping aware of future options is good) 19:31:13 I think for haproxy in particular we can and should probably stick with their lts tag 19:31:18 i think we mostly covered the haproxy topic at the last meeting, but happy to revisit since not everyone was present 19:31:28 ++lts tag 19:31:35 fungi: ack. I wanted to bring up one thing primiarly on pruning 19:31:58 One gotcha with pruning is that it seems to be based on the image build/creation time not when you started using the newer image(s) 19:32:16 right, note that we hadn't actually pruned the old haproxy image we downgraded to, when i did the manual config change and pulled, it didn't need to retrieve the image 19:32:18 and so it is a bit of a clunky tool, but better than nothing for images like haproxy for example where we could easily revert 19:32:57 I'm happy for us to extend the time we keep images, but also be aware of this limitation with the pruning command 19:33:01 i'm ambivalent about pruning because i'm not worried about not being able to pull an old version from a registry on demand 19:33:29 the main thing it might offer is insurance against upstreams deleting their images 19:33:46 but i don't think that's actually been an issue we've encountered yet? 19:33:49 one concern of mine was being able to find out which last version it actually was that we were running 19:33:51 i'm not eager to run an image that upstream has deleted either 19:34:21 frickler: yes, if we could add some more verbosity around our image management, that could help 19:34:29 frickler: we could update our ansible runs to do something like a docker ps -a and docker image list 19:34:37 and record that in our deployment logs 19:34:45 even if it's just something that periodically interrogates docker for image ids and logs them to a file 19:34:55 or yeah that 19:35:02 maybe even somewhere more persistent than zuul build logs would be good 19:35:05 i agree with frickler that leaving an image sitting around for some number of days provides a good indication of what we were probably running before 19:36:07 ok so the outstanding need is better records of what docker images we ran during which timeframes 19:36:15 (we could stick version numbers in prometheus; it's not great for that though, but it's okay as long as they don't change too often) 19:36:45 ya this will probably require a bit more brainstorming 19:36:48 (the only way to do that with prometheus increases the cardinality of metrics with each new version number) 19:37:05 maybe start with the simple thing of having ansible record a bit more info then try and improve on that for longer term retention 19:38:17 I'll continue on as we have a few more items to discuss 19:38:19 #topic Followup on haproxy update being broken 19:38:38 Similar to the last one I'm not sure if this reached a conclusion but two things worth mentioning have happened recently. First zuul's doc quota was increased 19:38:57 that's the topic we just had? 19:39:02 bah yes 19:39:05 #undo 19:39:05 Removing item from minutes: #topic Followup on haproxy update being broken 19:39:10 #topic AFS Quota issues 19:39:13 copy and paste failure 19:39:28 * fungi is now much less confused 19:39:40 Second is that there are some early discussions around having openeuler be more involved with opendev and possibly contributing some CI resources 19:39:52 the zuul project quota was increased (not doc I think) 19:40:03 frickler: ya it hosts the zuul docs iirc 19:40:07 and website? 19:40:30 IIUC the release artefacts 19:40:45 There may be an opportunity to lverage this interest in collaboration to clean up the openeuler mirrors and feedback to them on the growth problems 19:40:50 everything under zuul-ci.org is on one volume 19:40:51 zuul's docs are part of its project website 19:40:54 yeah that 19:40:55 and i increased it to 5gb 19:40:56 ahah 19:41:10 essentially work with the interested parties to improve the situation around mirrors for openeuler and maybe our CI quotas 19:41:52 responding to their latest queries about the sizes of VMs and how many is on my todo list after meetings and lunch 19:42:06 (you know we write that stuff down in a document but 100% of the time the questions get asked anyway) 19:42:07 do you have a reference to those openeuler discussions or are they private for now? 19:42:11 they have an openstack cloud? 19:43:06 frickler: I think keeping the email discussion small while we sort out if it is even possible is good, but once we know if it will go somewhere we can do that more publicly 19:43:38 corvus: yes sounds like it? We tried to be explicit that what we need is an openstack api endpoint and accounts that can provision VMs 19:43:50 yeah, I just wanted to know whether I missed something somewhere 19:43:52 for transparency: openeuler representatives were in discussion with openinfra foundation staff members and offered to supply system resources, so the foundation staff are trying to put them in touch with us to determine more scope around it 19:43:58 it's all been private discussions so far 19:44:07 neat 19:44:32 were there other outstanding afs quota concerns to discuss? 19:44:47 since openstack is a primary use case for their distro, they have a vested interest in helping test openstack upstream on it 19:45:04 some other mirror volumes need watching 19:45:55 for centos stream I seem to recall digging around in those mirrors and we end up with lots of packages with many versions 19:46:05 centos-stream and ubuntu-ports look very close to their limit 19:46:10 in theory we only need the newest 2 to avoid installation failures 19:46:28 we could potentially write a smarter syncing script that scanned through and deleted older versions 19:46:59 for ubuntu ports I had thought we were still syncing old versions of the distro that we could delete but we aren't so I'm nto sure what we can do there 19:47:15 are we syncing more than arm64 packages maybe? like 32bit arm and or ppc? I think not 19:48:22 I don't think we have time to solve that in this meeting. Lets continue on as we have ~3 more topics to cover 19:48:28 #topic Broken wheel build issues 19:48:42 I don't know, I just noticed these issues when checking whether we have room to mirror rocky 19:48:48 frickler: ack 19:49:03 it's also possible that dropping old releases from our config isn't cleaning up the old packages associated with them 19:49:14 fungi: oh interesting. Worth double checking 19:49:38 for wheels I think we can stop building and mirroring them at any time beacuse pip will prefer new sdists over old wheels right? so we don't even need to update the pip.conf in our test nodes 19:49:53 correct 19:50:05 fungi: ^ you probably know off the top of your head if that is the case. But that would be my main concern is that we start testing older stuff accidentally if we stop building wheels 19:50:09 unless you pass the pip option to prefer "binary" packages (wheels) 19:50:13 right 19:50:29 but it's not on by default 19:50:49 i'd treat that as a case oc caveat emptor 19:50:50 in that case I think it is reasonable to send email to the service announce list indicating we plan to stop running those jobs in the future (say beginning of february) ask if anyone is interested in keeping them alive and if not jobs will fallback to building from source 19:51:14 the fallback is slower and may require some bindep file updates but it isn't going to hard stop anyone from getting work done on centos distros 19:51:32 wfm 19:51:33 will we also clean out existing wheels at the same time? maybe keep the afs volume but not publish anymore? 19:52:00 frickler: I think we should keep the content for a bit as some of the existing wheels may be up to date for a while 19:52:04 we could probably do it in phases 19:52:29 ok 19:52:48 since pip's behavior is acceptable by default here we can still take advantage of the remaining benefit from the mirror for a bit 19:52:56 then maybe after 6-12 months clean it up 19:53:22 alright next topic 19:53:28 #topic Gitea repo-archives filling server disk 19:53:32 fwiw, the python ecosystem has gotten a lot better about making cross-platform wheels for releases of things now, and in a more timely fashion 19:53:49 so our pre-built wheels are far less necessary 19:53:58 when you ask gitea for a repo archive (tarball/zip/.bundle) it caches that on disk 19:54:28 then once a day it runs an itnernal cron task (using a go library implemtnation of cron not system cron) to clean up any repo archives that are older than a day old 19:54:34 oh, yeah this is a fun one. i'd somehow already pushed it to the back of my mind 19:54:41 can we disable that functionality? we do have our own tarballs instead (at least for openstack)? 19:54:43 i'm guessing people do that a lot to get releases even though like zero opendev projects make releases that way? 19:54:51 what frickler said :) 19:55:03 s/people/web crawlers/ i think 19:55:04 upstream indicated it could be web crawlers 19:55:11 so their suggestion was to update our robots.txt 19:55:15 #link https://review.opendev.org/c/opendev/system-config/+/904868 update robots.txt on upstream's suggestion 19:55:19 and no we can't disable teh feature 19:55:29 at least I haven't found a way to do that 19:55:43 the problem is the daily claenup isn't actually cleaning up everything more than a day old 19:56:15 I've spent a bit of time rtfs'ing and looking at the database and I can't figure out why it is broken but you can see on gitea12 that it falls about 4 hours behind each time it runs so we end up leaking and filling the disk 19:56:40 In addition to reducing the number of archives generated by asking bots to leave them alone we can also run a cron job that simply deletes all archives 19:56:45 #link https://review.opendev.org/c/opendev/system-config/+/904874 Run weekly removal of all cached repo archives 19:56:48 does gitea break if we make the cache non-writeable? 19:57:06 frickler: I haven't tesed that but I would assume so. I would expect a 500 error when you request the archive 19:57:22 which would also be like disabling it kind of 19:57:25 i suppose it depends on your definition of "break" ;) 19:57:34 since we are already trying to delete archives more than a day old deleting all archives once a week on the weekend seems safe 19:57:46 and when you ask it to delte all archives it does successfully delete all archives 19:58:15 I would prefer we not intentionally create 500 errors 19:58:20 there are valid reasons to get repo archives 19:58:51 I also noticed when looking at the cron jobs that gitea has a phone home to check if it is running the latest release cron job 19:59:03 the cron might have a small window of breakage, but should immediately work on a retry so lgtm 19:59:13 I pushed https://review.opendev.org/c/opendev/system-config/+/905020 to disable that cron job ecabsue I hate the idea of a phone home for that 20:00:22 our hour is up and I have to context switch to another meeting 20:00:33 #topic Service Coordinator Election 20:01:02 really quickly because I end the meeting I wanted to call out that we're appraoaching the service coordinator election timeframe. I need to dig up emails to determine when I said that would happen (I beleivee it is end of january early february) 20:01:38 nothing for anyone to do at this point other than consider if they wish to assume the role and nominate themselves. And I'll work to get things official via email 20:01:38 If it matches openstack PTL/TC elections then they'll start in Feb 20:01:46 tonyb: its slightly offset 20:01:52 okay 20:01:54 #topic Open Discussion 20:02:00 Anything else important before we call the meeting? 20:03:13 nope 20:03:17 sounds like no. Thank you everyone for your time and help running the opendev services! 20:03:24 we'll be back next week same time and location 20:03:27 #endmeeting