19:01:06 <clarkb> #startmeeting infra
19:01:06 <opendevmeet> Meeting started Tue May  2 19:01:06 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:06 <opendevmeet> The meeting name has been set to 'infra'
19:01:16 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/JNR4PAHZ5JD272JXUC3BQUSZPLRIJYID/ Our Agenda
19:01:24 <clarkb> #topic Announcements
19:01:39 <clarkb> I didn't have any on the agenda.
19:02:42 <clarkb> #topic Migrating to Quay
19:02:56 <clarkb> Significant progress has been made here
19:03:30 <clarkb> Zuul and all of its images have been moved and are automatically publishing to quay now. We also updated our deployment tooling to pull zuul and friends from quay
19:03:54 <clarkb> Since then Zuul had its normal weekend restart and that all seemed to work (nodepool auto updated shortly after the changes landed)
19:04:27 <clarkb> the only thing I notice as potentially problematic is I think docker image prune is treating the docker hub images as something it should keep around by default so we may need to do manual cleanup at some point. THis isn't urgent though but something to be aware of as we move things
19:05:09 <clarkb> On the opendev side of things my changes to auto create repos if necessary landed and I updated system onfig with a second set of jobs that can be inherited from to move images to quay. I did this with zookeeper-statsd and that seems to work
19:05:49 <clarkb> Where this leaves us is making a plan and todo list for getting all of our images updated. In particular one thing that is annoying is that we have some iamges that rely on other images. We can either update these at the root first or at the leaves first
19:06:36 <clarkb> There are disadvantages and advantages to each approach. If we update the root first (python-base/python-builder in particular) then we would want to pretty quickly update all of their descendents to avoid potentially getting caught out if we had to make quick updates to the base images.
19:07:01 <clarkb> But if we do the root first we probably only need to do a single pass of image rebuulds and publication as the flow follows the direction of the dependencies
19:07:29 <clarkb> if we go theo hter direction and update the leaves first then we don't need to do things as quickly because we can update the base images and rebuild and pull from docker if necessary at any time
19:07:48 <clarkb> the downside to doing the leaves first is that we will want to update them to publish to quay first, then do the base images, then reupdate all the leaves again to pull from quay
19:08:10 <clarkb> I have two changes up for updating the root first
19:08:20 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/881932 move base images to quay
19:08:33 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/881933?usp=dashboard consume base images from quay
19:08:38 <clarkb> and one that updates a leaf
19:08:54 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/881931?usp=dashboard
19:09:02 <clarkb> #undo
19:09:02 <opendevmeet> Removing item from minutes: #link https://review.opendev.org/c/opendev/system-config/+/881931?usp=dashboard
19:09:08 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/881931?usp=dashboard Move ircbot to quay
19:09:20 <clarkb> The idea here was to illustrate both approaches
19:09:43 <clarkb> I think whatever approach we do we should write down a small plan with a todo list to not get lost in the ~40 something images that all need to be done
19:10:09 <clarkb> and then we should also consider setting aside a few days to really focus on getting as much of this done as possible so that the time period where we might have to debug both docker and quay things is minimized
19:10:33 <fungi> sounds good to me
19:10:37 <ianw> fwiw it feels like updating the base images first makes the most sense
19:10:46 <clarkb> ianw: ya I am starting to come around to that myself
19:10:48 <ianw> and then working through a checklist to deploy it
19:11:09 <fungi> except it could lock us out of making base image updates to the leaf images for the remainder of the transition
19:11:33 <clarkb> fungi: yes. That said there is an outlet. We could manually push what we have pushed to quay.io to docker hub
19:11:35 <fungi> but as long as that timeframe is reasonably short, i'm in favor of whichever path is the least effort
19:11:58 <clarkb> yesterday I wasn't considering that we could do that manual push and I was far more concerned about the time where updates to base imges would be difficult
19:12:04 <ianw> yeah, i think the fact that we have updated the base images, and have a list of tasks to work through, puts a nice constraint on getting it done
19:12:13 <clarkb> but since then I've realized that this is a reasonable outlet and I think doing base images first is fine
19:12:53 <clarkb> in that case I will start putting together a document (etherpad most likely) with a overview of the plan and a todo list and links to ongoing work
19:12:58 <fungi> wfm
19:13:46 <clarkb> cool. The only other thing I wanted to mention (and this should go on the todo list) is we will need to resync some of the images from docker hub to quay.io manually before updating them as a few have had updates pushed to docker since I did the intial sync
19:13:46 <ianw> ++ sounds good.  i can probably help as i have somewhere a fairly recent list of all the images
19:13:52 <clarkb> not a huge deal. Just a step on the todo list
19:14:16 <ianw> at one point, i was trying to automate getting them into a .dot file for graphical view, i can't remember why
19:15:01 <clarkb> we have a plan to make a plan. I'll take it. Hopefulyl I'll have somethignto share tomorrow. Seems unlikely I'll get anywhere near done today
19:15:10 <clarkb> Anything else related to quay?
19:16:08 <clarkb> #topic Bastion Host Updates
19:16:16 <clarkb> Still just need reviews on the bridge backup thing
19:16:20 <clarkb> #link https://review.opendev.org/q/topic:bridge-backups
19:16:35 <clarkb> We've also seen some connectivity errors from bridge to various nodes randomly but I don't think that is anything to do with us
19:17:26 <ianw> yeah i have no idea what's up with that
19:17:43 <ianw> it was worrying it happened both on dns deploy changes, but in both cases it was nowhere near anything to do with dns
19:17:45 <clarkb> Probably just something to monitor and if it persists or get worse we can bring it up with our network/cloud providers
19:18:06 <ianw> in both cases it was rax/dfw -> rax/dfw, what you'd think would be the most reliable
19:18:29 <fungi> there were reports of connectivity issues potentially to releases.openstack.org (static02.opendev.org) from job nodes earlier too, so i wonder if rax-dfw is having some network connectivity issues
19:18:49 <clarkb> fungi: that was likely the gitea thing though?
19:18:56 <clarkb> I susppose they could be separate issues
19:19:25 <ianw> istr periods where we've had weird ipv6 dropouts.  but with ansible we only use ip4 in inventory
19:19:42 <fungi> hard to know. the releases.o.o url in question is a redirect to opendev.org gitea, but it was unclear whether the problem was before or after redirecting
19:20:07 <fungi> the job logs weren't precise enough to differentiate
19:20:37 <clarkb> ya if that persists after the UA filter update I guess we can dig deeper
19:20:46 <clarkb> #topic Mailman 3
19:21:06 <clarkb> fungi: any progress with the held test node?
19:22:09 <fungi> none, sorry :/
19:22:17 <clarkb> #topic Gerrit Updates
19:22:20 <fungi> i ned to prioritize it
19:22:24 <fungi> er, need
19:22:25 <clarkb> ack
19:22:46 <clarkb> The acl updates landed and we should be all set there. However fungi noticed some behavior of the tool that does normalization that might need updating
19:22:54 <clarkb> fungi: do you have links to those changes?
19:23:39 <fungi> #link https://review.opendev.org/882075 Add an "apply" transformation which applies all
19:23:53 <fungi> #link https://review.opendev.org/882080 Make option indenting a selectable transformation
19:24:17 <clarkb> thanks
19:24:31 <fungi> it mostly came up in reviewing recent project additions where the authors were struggling to figure out how to make their editors indent with tab characters
19:24:39 <clarkb> Neither will change the output we already applied but are good updates to making the tool useable
19:25:01 <clarkb> fungi: you hit the tab button on the keyboard :P
19:25:01 <fungi> once those merge, we can recommend a command in our documentation to reformat acls
19:25:03 <ianw> ctl-q tab :)
19:25:58 <fungi> i was going to say "oh just run this and it'll reformat your acl file" but it wasn't as trivial as i remembered
19:26:35 <ianw> didn't the final pass write it out with tabs?
19:27:02 <fungi> i was disappointed that my prediction about people not having a basic grasp of how to use their text editors was accurate
19:27:03 <clarkb> ianw: the changes above basically showed that it was always running in noop mode which only prints not writes to files
19:27:27 <clarkb> ianw: the fix is to have it also write to files so people don't need to have editors that work :)
19:27:52 <clarkb> For the gerrit replication plugin leaking files I have not seen any movement on the bugs I filed and I have not had time to dig into trying ot write fixes mself
19:27:55 <fungi> ianw: 882080 is just to make the tab intending consistent with the other transformation rules. 882075 is about making it easier to reformat an acl from the command line
19:28:03 <clarkb> I'm somewhat scared to look at how many files are on disk now
19:28:10 <ianw> ok, i guess that makes sense.  the tox job would spit out a diff file
19:28:22 <clarkb> I am strongly considering that we revert the change to bind mount that directory now
19:28:23 <fungi> well, the tox job already spits out a diff file
19:29:03 <clarkb> then when we restart gerrit we'll clear it out automatically. Far less than ideal but I'm begining to think that might be better than my hacky solution
19:29:10 <fungi> but yes, it's hard for someone who doesn't actually know the difference between spaces and tabs, or possibly doesn't know how to translate those terms into their native tongue, to know what the diff is telling them to do
19:29:17 <clarkb> but if I can get a second reviewer on my hacky solution we can revisit a revert after wards
19:29:26 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/880672 Dealing with leaked replication tasks on disk
19:30:21 <ianw> fungi: sure -- i guess i don't have an issue with it actually rewriting the file, and then us telling people "run this and check in the result"
19:30:58 <clarkb> I think the only real risk with the replication leas is if we cause ext4 to run out of inodes
19:31:18 <clarkb> *replication leaks. So this is not super urgent but something we should eventually address
19:31:43 <clarkb> And the last gerrit related item is we dropped 3.6 image builds. Added 3.8 images. And have a 3.7 to 3.8 upgrade job running
19:31:53 <clarkb> this has already led to fixing a couple of UI issues with 3.8
19:31:56 <fungi> ianw: today i saw an acl go through 6 patchsets and the author is still struggling to get the whitespace right
19:33:05 <clarkb> Anything else gerrit related?
19:33:22 <fungi> none on my end
19:33:34 <ianw> nope
19:33:37 <clarkb> #topic Upgrading Servers
19:33:46 <clarkb> Etherpad server cleanup happend so etherpad is done done at this pint
19:33:55 <clarkb> Nameserver migration stuff happened as well (thank you ianw!)
19:34:15 <clarkb> all four of our zones/domains should e running off of new servers with a new authoritative server as well
19:34:26 <clarkb> I think the remaining tasks are cleaning up the old server?
19:34:31 <clarkb> *old servers
19:35:01 <clarkb> ianw: anything we should be helping with to finish this completely?
19:35:35 <ianw> nope, i can remove the old servers now, and there's one zone change to remove the ns1/ns2/adns1 records
19:35:59 <ianw> the remaining todos were AAAA glue records for opendev.org and rdns for the vexxhost server
19:36:10 <clarkb> https://review.opendev.org/c/opendev/zone-opendev.org/+/881935
19:36:11 <ianw> i've logged a ticket about the rdns, so we'll see what happens
19:36:20 <clarkb> #link https://review.opendev.org/c/opendev/zone-opendev.org/+/881935 Cleanup old dns server records
19:37:09 <clarkb> I can mention to the foundation registrar intermediary that we want AAAA glue records
19:37:25 <clarkb> fungi: assuming you don't have objections or would prefer to do it to help ensure they don't do the wrong thing at the registrar
19:37:54 <fungi> no objections
19:38:23 <fungi> i don't know how to phrase it any better to avoid confusion, i'd just be prepared for confusion
19:39:38 <clarkb> I figure something like "The ns03.opendev.org and ns04.opendev.org servers have ipv4 and ipv6 addresses. A glue records were added for both but not AAAA. Is that something we can add""
19:39:40 <ianw> we did have them before, so there's that
19:40:03 <fungi> oh, we did?
19:40:03 <clarkb> Ok I'll reach out later today
19:40:19 <fungi> i wasn't sure since i hadn't paid close enough attention before we changed them
19:41:30 <clarkb> The last item I wanted to note on this topic is we have more servers to upgrade. All at lower priority than what we've already done but still worth doing
19:41:41 <clarkb> I'm hoping to keep pushingon that here and there and help is much appreciated
19:42:03 <clarkb> #topic AFS volume utilization
19:42:11 <clarkb> Grafana reports utiliation has jumped again
19:42:20 <clarkb> I suspect the growth is simply not as nicely linear day to day as I would like
19:42:35 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/881952 add bookworm mirrors
19:42:48 <clarkb> This change is going to be effectively blocked on us freeing space or adding space though
19:42:58 <clarkb> (I left a note on the change but didnt -1 or -W)
19:43:39 <fungi> yeah, i saw that go by and thought the same. thanks for calling it out
19:44:02 <clarkb> I feel like with everything else going on I have little time to devote to this as it isn't urgent quite yet. But something that will need to be addressed
19:44:05 <ianw> fungi: maybe you could look at https://review.opendev.org/c/opendev/system-config/+/879239 to confirm it's not insane and we can prune the wheels
19:44:16 <fungi> can do
19:44:28 <clarkb> and ya starting with cleanups like that is a good starting point and we can take it from there
19:44:36 <ianw> fedora has now become a move from 36 -> 38 situation
19:44:39 <fungi> and yeah, the bookworm release is still a ways out, but at least there's a date now
19:44:59 <ianw> i'm not sure how much time i'll have to drive that
19:45:02 <clarkb> ianw: one idea I had was whether or not centos 9 stream is sufficiently up to date that fedora is maybe less important?
19:45:16 <clarkb> ianw: but I'm not plugged into how those distros are being used well enough to know if that is the case
19:45:28 <clarkb> naively they seem to fit into a similar space for CI needs anyway
19:45:29 <ianw> heh, yeah, i was about to say we might want to consider the future of it
19:46:04 <clarkb> maybe we should write an email to the service-discuss list about it
19:46:16 <clarkb> to solicit feedback from users to see how that might impact them
19:46:26 <clarkb> I can write and send that if we think it is a good idea
19:46:28 <ianw> it was mostly for devstack; it's quite a lot of overhead in zuul-jobs
19:47:16 <fungi> i'm happy to put a bit of pressure on openstack to justify the continued use of fedora nodes too
19:47:25 <clarkb> ok I can draft that email
19:47:52 <clarkb> I'll be sure to get it looked over before sending just to avoid saying anything obviously incorrect
19:48:02 <clarkb> #topic Gitea 1.19
19:48:04 <ianw> ++
19:48:11 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/877541 Upgrade to gitea 1.19.2
19:48:20 <clarkb> I think gitea 1.19 is ready to go when we are
19:48:35 <clarkb> The api requires auth to list orgs bug was fixed and I updated our ansible to reflect this.
19:48:50 <tonyb> I can help with the Fedora update and also the CentOS determination
19:48:50 <clarkb> I did drop a no_log: true in that update since I dropped auth as well. But double check me on that being safe please
19:49:24 <clarkb> tonyb: cool I'll ping you with the email draft too to ensure it makes sense from your perspective and we can edit or decide to hold off if necessary
19:49:28 <clarkb> #link https://158.69.65.228:3081/opendev/system-config Held test node for checking
19:49:41 <clarkb> thsi is a held node running the deployment of 1.19.2 from the above change which can be used for checking it looks good
19:49:42 <tonyb> clarkb: Sounds good
19:50:38 <clarkb> My afternoon is going to be pretty busy bouncing around errands and parenting today but I'm happy to watch that deploy tomorrow if I get the necessary reviews today
19:51:12 <clarkb> The major chagne is the addition of gitea actions which we turn off. This makes it a much simpler upgrade compred to 1.18
19:51:21 <clarkb> #topic Storyboard
19:51:43 <clarkb> fungi I think you've continued to do some work helping projects gracefully turn stuff off
19:52:00 <clarkb> anything new to add on this? (I think frickler is out for a few weeks right now too)
19:52:27 <fungi> not really, just making sure we're consistently setting projects to inactive if they move off sb
19:52:47 <fungi> and could stand to do an audit between projects.yaml and the sb database
19:53:10 <fungi> to see what we've missed (moves to lp and general retirements)
19:53:14 <clarkb> storyboard doesn't have the gerrit issue of marking a project read only making it difficult to undo does it?
19:53:20 <fungi> nah
19:53:24 <clarkb> Mostly just curious I doubt it would be an issue anyway
19:53:52 <fungi> however, it never added a ui widget to toggle the active field, so has to be done with the cli
19:54:05 <clarkb> fungi: is that documented somewhere?
19:54:17 <clarkb> just thinking it would be good to be able to have other people do it if you go on vacation etc
19:54:21 <fungi> it can be
19:54:34 <fungi> but it's also not an urgent thing to change
19:54:53 <fungi> i'll try to remember to push up some docs, or better still an audit script to go with it
19:55:44 <fungi> mainly just stops the project from getting used in autocomplete for fields and turning up in searches, i think
19:56:00 <clarkb> makes sense
19:56:05 <clarkb> thanks!
19:56:11 <clarkb> #topic Open Discussion
19:56:14 <clarkb> Anything else?
19:56:36 <clarkb> As briefly mentioned earlier my afternoon is going to involve me being in and out.
19:57:11 <fungi> i'll be out a lot over the next few days
19:57:30 <fungi> but not gone entirely. some errands, long lunches, and a half day on friday
19:58:20 <clarkb> Sounds like that may be everything. Thank you for your time. We'll be back here next week
19:58:34 <clarkb> enjoy your morning/afternoon/evening!
19:58:37 <clarkb> #endmeeting