19:00:26 #startmeeting infra 19:00:26 Meeting started Tue Nov 14 19:00:26 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:26 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:26 The meeting name has been set to 'infra' 19:00:28 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/NIDXZX7JT4MQJOUS7GKI5PPRMDIIY6FI/ Our Agenda 19:00:40 The agenda went out late because I wasn't around yesterday, but we do have an agenda 19:01:06 #topic Announcements 19:01:41 Next week is a big US holiday. That said, I expect to be around for the beginning of the week and plan to host our weekly meeting Tuesday 19:01:54 But be aware that by Thursday I expect it to be very quiet 19:02:30 #topic Mailman 3 19:02:55 fungi: I think you dug up some more info on the template file parse error? And basically mailman3 is missing some file that they need to add after django removed it from their library? 19:03:05 the bug we talked about yesterday turns out to be legitimate, yes 19:03:10 er, last week i mean 19:03:34 time flies 19:03:39 to confirm we are running all of the versions of the softwrare we expect, but a new bug has surfaced and we aren't seeing an old bug due to accidental use of old libraries 19:04:21 yeah, and this error really just means django isn't pre-compressing some html templates, so they're a little bigger on the wire to users 19:05:00 in that case I guess we're probably going to ignore this until the next mm3 upgrade? 19:05:24 #link https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/36U5NY725FNJSGRNELFOJLLEZQIS2L3Y/ mailman-web compress - Invalid template socialaccount/login_cancelled.html 19:05:58 yeah, it seems safe to just ignore and then we can plan to do a mid-release update when it gets fixed if we want, or wait until the next release 19:06:06 should we drop this agenda item from next weeks meeting then? 19:06:36 I believe this was the last open item for mm3 19:06:38 i think so, yes. we can add upgrades to the agenda as needed in the future 19:06:47 sounds good. Thanks again for workign through all of this for us 19:06:54 #topic Server Upgrades 19:06:54 thanks for your patience and help! 19:07:19 we added tonyb to the root list last week and promtly put him to work booting new servers :) 19:07:28 \o/ 19:07:38 mirror01.ord.rax is being replaced wtih a new mirror02.ord.rax server courtesy of tonyb 19:07:43 #link https://review.opendev.org/c/opendev/zone-opendev.org/+/900922 19:07:48 #link https://review.opendev.org/c/opendev/system-config/+/900923 19:08:07 These changes should get the server all deployed, then we can confirm it is happy before udpating DNS to slip over the mirror.ord.rax CNAMEs 19:08:18 I think the plan is to work through this one first and then start doing others 19:08:19 After a good session booting mirror02 I managed to clip some for the longer strings and so the reviews took me longer to publish 19:08:38 tonyb: I did run two variations of ssh-keyscan in order to dobule check the data 19:08:49 clarkb: Thanks 19:08:49 I think it is correct and noted taht in my reviews when I noticed the note about the copy paste problems 19:09:51 feel free to continue asking questions and poking for reviews. This is really helpful 19:09:52 I started writing a "standalone" tool for handling the volume setup as the mirrors nodes are a little different 19:10:07 Yup I certainly will do. 19:10:30 tonyb: ++ to having the mirror volumes a bit more automated 19:11:17 agreed, we have enough following that pattern that it could be worthwhile 19:11:47 note that not all mirror servers get that treatment though, some have sufficiently large rootfs we just leave it as-is and don't create additional volumes 19:11:51 I think thats about that for the mirror nodes. It's mostly carfully follwoing the bouncing ball at this stage 19:12:25 cool. I'm happy to do another runtrhough too if we like. I feel like that was helpful for everyone as it made probelms with cinder volume creation apparent and so on 19:13:10 fungi: Yup. and as we can't always predict the device name in the guest it wont be fully automated ot intgrated it's just to document/simlify the creation work we did on the meetpad 19:13:12 i too am happy to do another onboarding call, maybe for other activities 19:13:49 * tonyb too. 19:14:39 anything else on this topic? 19:14:48 not from me 19:14:59 #topic Python Container Updates 19:15:19 Unfortunately I haven't really had time to look at the failures here in more detail. I saw tonyb asking question about them though, were you looking? 19:15:27 #link https://review.opendev.org/c/zuul/zuul-operator/+/881245 Is the zuul-operator canary change 19:15:45 specifically we need that change to begin passing in zuul-operator before we can land the updates for the docker image in that repo 19:16:20 I am looking at it 19:16:52 I spoke to dpawlik about status and background 19:16:56 i suspect something has bitrotted with cert-manager; but with the switch in k8s setup away from docker, we don't have the right logs collected to see it, so that's probably the first task 19:17:20 No substantial progress but I'm finding my feet there 19:17:38 (in other words, the old k8s setup got us all container logs via docker, but the new setup needs to get them from k8s and explicitly fetch from all namespaces) 19:17:44 gotcha 19:17:52 because we are no longer using docker under k8s 19:18:09 yep 19:18:30 I agree, addressing log collection seems like a good next step 19:18:35 Okay that's good to know. 19:19:44 #topic Gitea 1.21 19:20:01 1.21.0 has been released 19:20:03 #link https://github.com/go-gitea/gitea/blob/v1.21.0/CHANGELOG.md we have a changelog 19:20:22 (and there was much rejoicing) 19:20:25 #link https://review.opendev.org/c/opendev/system-config/+/897679 Upgrade change needs updating now that we have changelog info 19:20:52 so ya the next step here is to go over the changelog and make sure our change is modified properly to handle their breaking changes 19:21:03 I haven't even looked at the changelog yet 19:21:18 but doing so and modifying that change is on my todo 19:21:20 *todo list 19:21:43 In the past we've often not upgraded until the .1 release anyway due to them very quickly releasing bugfixes 19:22:00 nobody ever wants to go first 19:22:02 between that and the gerrit upgrade and then thanksgiving I'm not sure this is urgent, but also dont' want it to get forgotten 19:22:41 i agree that the next two weeks are probably not a great time to merge it, but i'll review at least 19:23:12 sounds good. Should have something to look at in the next day or so 19:23:16 I'm wondering about the key length thing, how much effort would it be to use longer keys? 19:23:37 FWIW I'll review it to and, probably, ask "why do we $x" questions ;P 19:24:25 frickler: we need to generate a ne key, add it to the gerrit user in gitea (this step may be manual currently I think we only automate this at user creation time) and then add the key to gerrit and restart gerrit to pick it up 19:24:47 frickler: I suspect taht if we switch to ed25519 then we can have it sit next to the existing rsa key in gerrit and we don't have to coordinate any moves 19:24:58 if we replace shorter rsa key with logner rsa key then we'd need a bit more coordination 19:25:12 well, we could have multiple rsa keys too, right? 19:25:23 fungi: I don't think gerrit will find multiple rsa keys 19:25:35 but I'm not sure of that. We can test that on a held node I guess 19:25:42 oh, right, local filename collision 19:26:13 we can do two different keytypes because they use separate filenames 19:26:18 yup 19:26:30 I can look into that more closely as I page the gitea upgrade stuff abck in 19:26:31 i was thinking in the webui, not gerrit as a client 19:27:07 so yeah, i agree parallel keys probably makes the transition smoother than having to swap one out in a single step 19:27:30 speaking of Gerrit: 19:27:35 #topic Gerrit 3.8 Upgrade 19:27:36 though i guess if we add the old and new keys to gitea first, then we could swap rsa for rsa on the gerrit side 19:27:44 but might need a restart 19:27:54 it will need a restart of gerrit in all cases iirc 19:27:59 because it reads the keys on startup 19:28:18 For the Gerrit upgrade I'm planning on going through the etherpad again tomorrow 19:28:24 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.8 19:28:33 I want to make sure I understand the screen logging magic a bit better 19:28:41 but also would appreciate reviews of that plan if you haven't read it yet 19:28:44 also for the sake of the minutes... 19:28:46 #link https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/ Upgrading review.opendev.org to Gerrit 3.8 on November 17, 2023 19:28:59 i figured you were going to include that in the announcements at the beginning 19:29:41 as far as coordination goes on the day of I expect I can drive things, but maybe fungi you can do some of the earlier stuff like adding hosts to emergency files and sending #status notice notices 19:29:59 I'll let you know if my expectations for that change, but I don't expect them to 19:30:28 happy to. i think i'm driving christine to an eye appointment, but can do basic stuff from my phone in the car 19:31:10 (also the appointment is about 2 minutes from the house) 19:31:10 seems like we are in good shape. And I'll triple check myself before Friday anyway 19:31:27 I can potentially do some of the "non-destructive" early work 19:31:44 tonyb: oh! we should add you to the statusbot acls 19:31:48 we should add tonyb to statusbot 19:31:48 and generally irc acls 19:31:54 but that may make more work than doing it 19:31:54 hah, jinx! 19:32:13 hehe 19:32:14 tonyb: it's work that needs doing sometime anyway 19:32:20 so how owes who a soda? 19:32:25 i can take care of it 19:32:32 kk 19:32:36 i owe everyone soda anyway 19:32:40 LOL 19:32:42 openstack/project-config/accessbot/channels.yaml is one file that needs editing 19:32:49 still repaying from my ptl days 19:33:14 I can do that. 19:33:46 I'm not acutally sure where statusbot gets its user list. Does it just check for opers in the channel it is in? 19:33:59 i'll look into it 19:34:16 i think it's a config file 19:34:24 nope its statusbot_auth_nicks in system-config/inventory/service/group_vars/eavesdrop.yaml 19:34:28 tonyb: ^ so that file too 19:34:38 thanks, i was almost there 19:34:44 gotcha 19:34:58 anything else Gerrit upgrade related? 19:35:00 i'm getting slow this afternoon, must be time to start on dinner 19:35:10 its basically lunch time. I'm starving 19:35:43 Coffee o'clock and then a run. ... and then lunch 19:36:06 alright next up 19:36:13 #topic Ironic Bug Dashboard 19:36:19 #link https://github.com/dtantsur/ironic-bug-dashboard 19:36:34 The ironic team is asking if we woudl be willing to run an instance of their bug dashboard tool for them 19:36:40 JayF: you were going to speak to this one? 19:36:43 So some context; this is an old bug dashboard. No auth needed. Simplest python app ever. 19:36:44 otherwise i can 19:37:04 We've run it in various places we've just done custom-ly, before doing that again with our move to LP, we thought we'd ask about getting it a real home. 19:37:30 No depedencies. Literally just needs a place to run, and I think dtantsur wrote a dockerfile for it the other day, too 19:37:56 My major concern is that running a service for a single project feels very inefficient from our side. If someone wanted to resurrect the openstack bug dashboard instead I feel like that might be a little different? 19:38:04 so options are for adding it to opendev officially (deployment via ansible/container image building and testinfra tests), or us booting a vm for them to manage themselves 19:38:04 The docs show using podman etc so yeah I think that's been done 19:38:46 additionally teams like tripleo have had one tool and ironic has one apparently and so on. I think it is inefficient for the project teams too 19:38:46 for historical reference, "the openstack bug dashboard" was called "bugday" 19:38:49 clarkb: I talked to dtantsur; we are extremely willing to take patches (and will email the list about this existing again once we get it a home) if other teams want t ouse it 19:40:10 JayF: so would you be willing to run this yourself if we give you an vm with an DNS record? 19:40:10 fungi: it's extremely likely if infra says no, and we host it out of band, we'd do something similar to the second option (just get a VM somewhere and run it manually) 19:40:46 frickler: replace instances of "you" and yourself" with ironic community as appropriate and the answer is "yes", with specific contacts being dtantsur and I to start 19:41:08 frickler: if you all had no answer for us, nonzero chance this ended up as a container in my homelab :) 19:41:18 that would be an easy start and we could see how it develops 19:41:41 so basically the idea behind openstack infra and now opendev was that we'd avoid doing stuff like this and instead create a commons where projects could work together to address common problems 19:42:04 yeah, when this came up yesterday in #openstack-ironic i mentioned the current situation with the opensearch-backed log ingestion service dpawlik set up 19:42:20 where we've struggled is when projects do things like this specific tool and blaze their own trail. This takes away potential commons resources as well as multiplies effort required 19:42:42 From an infra standpoint; I'm with you. 19:42:44 This is why it's an opportunistic ask with almost an expectation that "no" was a likely answer. 19:42:51 I think that if we were to host it it would need to be a more generic tool for OpenDev users and not ironic specific. I have fewer concerns with handing over a VM 19:42:59 From a community standpoint; that was storyboard; we adopted it; it disappeared; we are trying to dig out from that mistake 19:43:09 iiuc the tool is open to be used by other projects, they just need to amend it accordingly 19:43:11 i do think we want to encourage people to collaborate on infrastructure that supports their projects when there is a will to do so 19:43:11 and I do not want to burn more time trying to go down alternate "work together" paths in pursuit of that goal 19:43:40 JayF: the problem is that all the cases of not working together are why we have massive debt 19:44:13 ironic is not the only project trying to deal with storyboard for example 19:44:18 clarkb: I have lots of examples of cases of us working together that also have massive debt; so I'm not sure I agree with all of the root causing, but I do understand what you're getting at and like I said, if the answer is no, it's no. 19:44:26 basically the risk is that the opendev sysadmins are the last people standing when whoever put some service together disappears and there are still users 19:44:29 and despite my prodding very little collaboration between teams with the same problems has occured as an example 19:44:54 so we get to be the ones who tell users "sorry, nobody's keeping this running any more" 19:45:27 the infra sig continues to field questions about how to set up LP 19:45:38 stuff that should have ideally been far more coordinated among the groups moving 19:45:50 i mostly just remind folks that we don't run launchpad, and it has documentation 19:46:10 and I can't shake the feeling that an ironic bug dashboard is just an extension of these problems and we'll end up being asked to run a different tool for nova and then a different one for sdks and so on 19:46:28 This is off topic for the meeting, but the coordination is always the most difficult part ime; which is why for Ironic's LP migration it finally started moving when I stopped trying so hard to pull half of openstack with me. 19:46:30 when what we need as a group is rough agreement on what a tool should be and then run that. And as mentioend before this tool did exist 19:46:42 but it too ran into disrepair and was no longer maintained and we shut it off 19:47:23 It sounds like consensus is no though; so for this topic you all can move on. I wouldn't want you all to host it unless everyone was onboard, anyway. 19:47:39 I don't think we necessarily need to resurrect bugday the code base, but I think if opendev hosts something it should be bugday the spiritial successor tool and not an ironic specific tool 19:47:49 i think it can be "not yet" instead of just "no"? 19:48:26 also i'm not opposed to booting a vm for them to run it on themselves, while they work on building consensus across other teams to possible make it useful beyond ironic's use case 19:48:39 I just sent an email to the mailing list, last week, about how a cornerstone library to OpenStack is rotting and ~nobody noticed. I'm skeptical someone is going to take up the banner of uniting bug dashboards across openstack. 19:48:54 fungi: I do not commit to building such a consesnus. I commit to being open to accepting patches. 19:49:12 with the expectation that if opendev is going to officially take it on, then there will need to be more of a cross-project interest (and of course configuration management and tests) 19:49:26 ya I'm far less concerned with booting a VM and adding a DNS record 19:49:38 fungi: not trying to be harsh; just trying to set a reasonable expectation to be clear :) 19:49:45 my plate is overflowing and I can't fit another ounce on it 19:49:58 sure. and we've all been there more than once, i can assure you ;) 19:50:48 JayF: so there are some options and stipulations you can take back to the ironic team for further discussion, i guess 19:51:31 If you want to give us a VM and a DNS name, that will work for us. If not, I'll go get equivalent from my downstream/personal resources and my next steps are the same either way 19:51:40 i'm not sure i'm a fan of the "boot a vm and hand it over" approach 19:51:42 if a vm is going to be handed over, i don't see why that's an opendev/infra team ask... i don't feel like we're here to hand out vms, we're here to help facilitate collaboration. anyone can propose a patch to run a service if the service fits the mission. so if it does fit the mission, that's how it should be run. and if it doesn't, then it shouldn't be an opendev conversation. 19:53:05 should we not have provided the vm for the log ingestion system that loosely replaced the old logstash system? mistake in your opinion, or failed experiment, or...? 19:53:33 i thought that ran on aws or something 19:54:08 the opensearch cluster runs in aws, but there is a node that fetches logs and sends them to opensearch that dpawlik is managing 19:54:11 the backend does, but the custom log ingestion glue to zuul's interface is on a vm we booted for the systadmins 19:54:31 er, s/systadmins/admins of that service/ 19:54:57 i was unaware of that, and yeah, i think that's the wrong approach. for one, the fact that i'm a root member unaware of it and it's not documented in https://docs.opendev.org/opendev/system-config/latest/ seems like a red flag. :) 19:56:06 that seems like something that fits the mission and should be run in the usual manner to me 19:56:14 ya better documentation of the exceptional node(s) is a good idea 19:56:34 and possibly also deciding as a group that exceptions are a bad idea 19:56:49 i think the wiki is an instructive example here too 19:57:18 One thing I'll note that is a key difference about the service I proposed (and I suspect that logstash service) is their stateless nature. 19:57:29 the main takeaway we had from the wiki is that we made it clear we would not take responsibility for the services running the log search service 19:57:41 It doesn't address the basic philosophical questions; but it does draw a different picture than something like the wiki does. 19:57:54 and that if the people maintaining it go away, we'll just turn it off with no notice 19:58:09 yeah, in both new cases running them is operationally dead simple 19:59:28 (side note I think the original plan was to run the ingestion on the cluster itself but then realized that you can't really do that with the openserach as a service) 19:59:51 i must have gotten the first version of the memo and not the update 20:00:08 because they delete and replace servers or somethign for upgrades. Its basically an appliance 20:00:19 we are at time. 20:00:27 #topic Upgrade Server Pruning 20:00:39 #undo 20:00:39 Removing item from minutes: #topic Upgrade Server Pruning 20:00:50 #topic Backup Server Backup Pruning 20:01:02 really quickly before we end I wanted to note that the rax backup server needs its backups pruned due to disk utilization 20:01:28 Maybe that is somethign tonyb wants to do with anothe root (ianw set it up and documented and scripted it well so its mostly a matter of going through the motions) 20:01:45 Yup happy to. 20:01:56 #topic Open Discussion 20:02:08 i'm also happy to help tonyb if there are questions about backup pruning 20:02:31 We don't really have time for this but feel free to take discussion to #opendev or service-discuss@lists.opendev.org to bring up extra stuff and/or keep talking about the boot a VM and hand it over stuff 20:02:35 fungi: thanks. 20:02:42 and happy 1700000000 day 20:02:50 woo! 20:02:52 I thinkwe are about 2 hours away? 20:02:54 something like that 20:03:04 thank you everyoen for your time! 20:03:06 #endmeeting