19:01:06 #startmeeting infra 19:01:06 Meeting started Tue May 2 19:01:06 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:06 The meeting name has been set to 'infra' 19:01:16 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/JNR4PAHZ5JD272JXUC3BQUSZPLRIJYID/ Our Agenda 19:01:24 #topic Announcements 19:01:39 I didn't have any on the agenda. 19:02:42 #topic Migrating to Quay 19:02:56 Significant progress has been made here 19:03:30 Zuul and all of its images have been moved and are automatically publishing to quay now. We also updated our deployment tooling to pull zuul and friends from quay 19:03:54 Since then Zuul had its normal weekend restart and that all seemed to work (nodepool auto updated shortly after the changes landed) 19:04:27 the only thing I notice as potentially problematic is I think docker image prune is treating the docker hub images as something it should keep around by default so we may need to do manual cleanup at some point. THis isn't urgent though but something to be aware of as we move things 19:05:09 On the opendev side of things my changes to auto create repos if necessary landed and I updated system onfig with a second set of jobs that can be inherited from to move images to quay. I did this with zookeeper-statsd and that seems to work 19:05:49 Where this leaves us is making a plan and todo list for getting all of our images updated. In particular one thing that is annoying is that we have some iamges that rely on other images. We can either update these at the root first or at the leaves first 19:06:36 There are disadvantages and advantages to each approach. If we update the root first (python-base/python-builder in particular) then we would want to pretty quickly update all of their descendents to avoid potentially getting caught out if we had to make quick updates to the base images. 19:07:01 But if we do the root first we probably only need to do a single pass of image rebuulds and publication as the flow follows the direction of the dependencies 19:07:29 if we go theo hter direction and update the leaves first then we don't need to do things as quickly because we can update the base images and rebuild and pull from docker if necessary at any time 19:07:48 the downside to doing the leaves first is that we will want to update them to publish to quay first, then do the base images, then reupdate all the leaves again to pull from quay 19:08:10 I have two changes up for updating the root first 19:08:20 #link https://review.opendev.org/c/opendev/system-config/+/881932 move base images to quay 19:08:33 #link https://review.opendev.org/c/opendev/system-config/+/881933?usp=dashboard consume base images from quay 19:08:38 and one that updates a leaf 19:08:54 #link https://review.opendev.org/c/opendev/system-config/+/881931?usp=dashboard 19:09:02 #undo 19:09:02 Removing item from minutes: #link https://review.opendev.org/c/opendev/system-config/+/881931?usp=dashboard 19:09:08 #link https://review.opendev.org/c/opendev/system-config/+/881931?usp=dashboard Move ircbot to quay 19:09:20 The idea here was to illustrate both approaches 19:09:43 I think whatever approach we do we should write down a small plan with a todo list to not get lost in the ~40 something images that all need to be done 19:10:09 and then we should also consider setting aside a few days to really focus on getting as much of this done as possible so that the time period where we might have to debug both docker and quay things is minimized 19:10:33 sounds good to me 19:10:37 fwiw it feels like updating the base images first makes the most sense 19:10:46 ianw: ya I am starting to come around to that myself 19:10:48 and then working through a checklist to deploy it 19:11:09 except it could lock us out of making base image updates to the leaf images for the remainder of the transition 19:11:33 fungi: yes. That said there is an outlet. We could manually push what we have pushed to quay.io to docker hub 19:11:35 but as long as that timeframe is reasonably short, i'm in favor of whichever path is the least effort 19:11:58 yesterday I wasn't considering that we could do that manual push and I was far more concerned about the time where updates to base imges would be difficult 19:12:04 yeah, i think the fact that we have updated the base images, and have a list of tasks to work through, puts a nice constraint on getting it done 19:12:13 but since then I've realized that this is a reasonable outlet and I think doing base images first is fine 19:12:53 in that case I will start putting together a document (etherpad most likely) with a overview of the plan and a todo list and links to ongoing work 19:12:58 wfm 19:13:46 cool. The only other thing I wanted to mention (and this should go on the todo list) is we will need to resync some of the images from docker hub to quay.io manually before updating them as a few have had updates pushed to docker since I did the intial sync 19:13:46 ++ sounds good. i can probably help as i have somewhere a fairly recent list of all the images 19:13:52 not a huge deal. Just a step on the todo list 19:14:16 at one point, i was trying to automate getting them into a .dot file for graphical view, i can't remember why 19:15:01 we have a plan to make a plan. I'll take it. Hopefulyl I'll have somethignto share tomorrow. Seems unlikely I'll get anywhere near done today 19:15:10 Anything else related to quay? 19:16:08 #topic Bastion Host Updates 19:16:16 Still just need reviews on the bridge backup thing 19:16:20 #link https://review.opendev.org/q/topic:bridge-backups 19:16:35 We've also seen some connectivity errors from bridge to various nodes randomly but I don't think that is anything to do with us 19:17:26 yeah i have no idea what's up with that 19:17:43 it was worrying it happened both on dns deploy changes, but in both cases it was nowhere near anything to do with dns 19:17:45 Probably just something to monitor and if it persists or get worse we can bring it up with our network/cloud providers 19:18:06 in both cases it was rax/dfw -> rax/dfw, what you'd think would be the most reliable 19:18:29 there were reports of connectivity issues potentially to releases.openstack.org (static02.opendev.org) from job nodes earlier too, so i wonder if rax-dfw is having some network connectivity issues 19:18:49 fungi: that was likely the gitea thing though? 19:18:56 I susppose they could be separate issues 19:19:25 istr periods where we've had weird ipv6 dropouts. but with ansible we only use ip4 in inventory 19:19:42 hard to know. the releases.o.o url in question is a redirect to opendev.org gitea, but it was unclear whether the problem was before or after redirecting 19:20:07 the job logs weren't precise enough to differentiate 19:20:37 ya if that persists after the UA filter update I guess we can dig deeper 19:20:46 #topic Mailman 3 19:21:06 fungi: any progress with the held test node? 19:22:09 none, sorry :/ 19:22:17 #topic Gerrit Updates 19:22:20 i ned to prioritize it 19:22:24 er, need 19:22:25 ack 19:22:46 The acl updates landed and we should be all set there. However fungi noticed some behavior of the tool that does normalization that might need updating 19:22:54 fungi: do you have links to those changes? 19:23:39 #link https://review.opendev.org/882075 Add an "apply" transformation which applies all 19:23:53 #link https://review.opendev.org/882080 Make option indenting a selectable transformation 19:24:17 thanks 19:24:31 it mostly came up in reviewing recent project additions where the authors were struggling to figure out how to make their editors indent with tab characters 19:24:39 Neither will change the output we already applied but are good updates to making the tool useable 19:25:01 fungi: you hit the tab button on the keyboard :P 19:25:01 once those merge, we can recommend a command in our documentation to reformat acls 19:25:03 ctl-q tab :) 19:25:58 i was going to say "oh just run this and it'll reformat your acl file" but it wasn't as trivial as i remembered 19:26:35 didn't the final pass write it out with tabs? 19:27:02 i was disappointed that my prediction about people not having a basic grasp of how to use their text editors was accurate 19:27:03 ianw: the changes above basically showed that it was always running in noop mode which only prints not writes to files 19:27:27 ianw: the fix is to have it also write to files so people don't need to have editors that work :) 19:27:52 For the gerrit replication plugin leaking files I have not seen any movement on the bugs I filed and I have not had time to dig into trying ot write fixes mself 19:27:55 ianw: 882080 is just to make the tab intending consistent with the other transformation rules. 882075 is about making it easier to reformat an acl from the command line 19:28:03 I'm somewhat scared to look at how many files are on disk now 19:28:10 ok, i guess that makes sense. the tox job would spit out a diff file 19:28:22 I am strongly considering that we revert the change to bind mount that directory now 19:28:23 well, the tox job already spits out a diff file 19:29:03 then when we restart gerrit we'll clear it out automatically. Far less than ideal but I'm begining to think that might be better than my hacky solution 19:29:10 but yes, it's hard for someone who doesn't actually know the difference between spaces and tabs, or possibly doesn't know how to translate those terms into their native tongue, to know what the diff is telling them to do 19:29:17 but if I can get a second reviewer on my hacky solution we can revisit a revert after wards 19:29:26 #link https://review.opendev.org/c/opendev/system-config/+/880672 Dealing with leaked replication tasks on disk 19:30:21 fungi: sure -- i guess i don't have an issue with it actually rewriting the file, and then us telling people "run this and check in the result" 19:30:58 I think the only real risk with the replication leas is if we cause ext4 to run out of inodes 19:31:18 *replication leaks. So this is not super urgent but something we should eventually address 19:31:43 And the last gerrit related item is we dropped 3.6 image builds. Added 3.8 images. And have a 3.7 to 3.8 upgrade job running 19:31:53 this has already led to fixing a couple of UI issues with 3.8 19:31:56 ianw: today i saw an acl go through 6 patchsets and the author is still struggling to get the whitespace right 19:33:05 Anything else gerrit related? 19:33:22 none on my end 19:33:34 nope 19:33:37 #topic Upgrading Servers 19:33:46 Etherpad server cleanup happend so etherpad is done done at this pint 19:33:55 Nameserver migration stuff happened as well (thank you ianw!) 19:34:15 all four of our zones/domains should e running off of new servers with a new authoritative server as well 19:34:26 I think the remaining tasks are cleaning up the old server? 19:34:31 *old servers 19:35:01 ianw: anything we should be helping with to finish this completely? 19:35:35 nope, i can remove the old servers now, and there's one zone change to remove the ns1/ns2/adns1 records 19:35:59 the remaining todos were AAAA glue records for opendev.org and rdns for the vexxhost server 19:36:10 https://review.opendev.org/c/opendev/zone-opendev.org/+/881935 19:36:11 i've logged a ticket about the rdns, so we'll see what happens 19:36:20 #link https://review.opendev.org/c/opendev/zone-opendev.org/+/881935 Cleanup old dns server records 19:37:09 I can mention to the foundation registrar intermediary that we want AAAA glue records 19:37:25 fungi: assuming you don't have objections or would prefer to do it to help ensure they don't do the wrong thing at the registrar 19:37:54 no objections 19:38:23 i don't know how to phrase it any better to avoid confusion, i'd just be prepared for confusion 19:39:38 I figure something like "The ns03.opendev.org and ns04.opendev.org servers have ipv4 and ipv6 addresses. A glue records were added for both but not AAAA. Is that something we can add"" 19:39:40 we did have them before, so there's that 19:40:03 oh, we did? 19:40:03 Ok I'll reach out later today 19:40:19 i wasn't sure since i hadn't paid close enough attention before we changed them 19:41:30 The last item I wanted to note on this topic is we have more servers to upgrade. All at lower priority than what we've already done but still worth doing 19:41:41 I'm hoping to keep pushingon that here and there and help is much appreciated 19:42:03 #topic AFS volume utilization 19:42:11 Grafana reports utiliation has jumped again 19:42:20 I suspect the growth is simply not as nicely linear day to day as I would like 19:42:35 #link https://review.opendev.org/c/opendev/system-config/+/881952 add bookworm mirrors 19:42:48 This change is going to be effectively blocked on us freeing space or adding space though 19:42:58 (I left a note on the change but didnt -1 or -W) 19:43:39 yeah, i saw that go by and thought the same. thanks for calling it out 19:44:02 I feel like with everything else going on I have little time to devote to this as it isn't urgent quite yet. But something that will need to be addressed 19:44:05 fungi: maybe you could look at https://review.opendev.org/c/opendev/system-config/+/879239 to confirm it's not insane and we can prune the wheels 19:44:16 can do 19:44:28 and ya starting with cleanups like that is a good starting point and we can take it from there 19:44:36 fedora has now become a move from 36 -> 38 situation 19:44:39 and yeah, the bookworm release is still a ways out, but at least there's a date now 19:44:59 i'm not sure how much time i'll have to drive that 19:45:02 ianw: one idea I had was whether or not centos 9 stream is sufficiently up to date that fedora is maybe less important? 19:45:16 ianw: but I'm not plugged into how those distros are being used well enough to know if that is the case 19:45:28 naively they seem to fit into a similar space for CI needs anyway 19:45:29 heh, yeah, i was about to say we might want to consider the future of it 19:46:04 maybe we should write an email to the service-discuss list about it 19:46:16 to solicit feedback from users to see how that might impact them 19:46:26 I can write and send that if we think it is a good idea 19:46:28 it was mostly for devstack; it's quite a lot of overhead in zuul-jobs 19:47:16 i'm happy to put a bit of pressure on openstack to justify the continued use of fedora nodes too 19:47:25 ok I can draft that email 19:47:52 I'll be sure to get it looked over before sending just to avoid saying anything obviously incorrect 19:48:02 #topic Gitea 1.19 19:48:04 ++ 19:48:11 #link https://review.opendev.org/c/opendev/system-config/+/877541 Upgrade to gitea 1.19.2 19:48:20 I think gitea 1.19 is ready to go when we are 19:48:35 The api requires auth to list orgs bug was fixed and I updated our ansible to reflect this. 19:48:50 I can help with the Fedora update and also the CentOS determination 19:48:50 I did drop a no_log: true in that update since I dropped auth as well. But double check me on that being safe please 19:49:24 tonyb: cool I'll ping you with the email draft too to ensure it makes sense from your perspective and we can edit or decide to hold off if necessary 19:49:28 #link https://158.69.65.228:3081/opendev/system-config Held test node for checking 19:49:41 thsi is a held node running the deployment of 1.19.2 from the above change which can be used for checking it looks good 19:49:42 clarkb: Sounds good 19:50:38 My afternoon is going to be pretty busy bouncing around errands and parenting today but I'm happy to watch that deploy tomorrow if I get the necessary reviews today 19:51:12 The major chagne is the addition of gitea actions which we turn off. This makes it a much simpler upgrade compred to 1.18 19:51:21 #topic Storyboard 19:51:43 fungi I think you've continued to do some work helping projects gracefully turn stuff off 19:52:00 anything new to add on this? (I think frickler is out for a few weeks right now too) 19:52:27 not really, just making sure we're consistently setting projects to inactive if they move off sb 19:52:47 and could stand to do an audit between projects.yaml and the sb database 19:53:10 to see what we've missed (moves to lp and general retirements) 19:53:14 storyboard doesn't have the gerrit issue of marking a project read only making it difficult to undo does it? 19:53:20 nah 19:53:24 Mostly just curious I doubt it would be an issue anyway 19:53:52 however, it never added a ui widget to toggle the active field, so has to be done with the cli 19:54:05 fungi: is that documented somewhere? 19:54:17 just thinking it would be good to be able to have other people do it if you go on vacation etc 19:54:21 it can be 19:54:34 but it's also not an urgent thing to change 19:54:53 i'll try to remember to push up some docs, or better still an audit script to go with it 19:55:44 mainly just stops the project from getting used in autocomplete for fields and turning up in searches, i think 19:56:00 makes sense 19:56:05 thanks! 19:56:11 #topic Open Discussion 19:56:14 Anything else? 19:56:36 As briefly mentioned earlier my afternoon is going to involve me being in and out. 19:57:11 i'll be out a lot over the next few days 19:57:30 but not gone entirely. some errands, long lunches, and a half day on friday 19:58:20 Sounds like that may be everything. Thank you for your time. We'll be back here next week 19:58:34 enjoy your morning/afternoon/evening! 19:58:37 #endmeeting