19:01:10 <clarkb> #startmeeting infra
19:01:11 <opendevmeet> Meeting started Tue Sep  7 19:01:10 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:11 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:11 <opendevmeet> The meeting name has been set to 'infra'
19:01:16 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-September/000281.html Our Agenda
19:01:25 <clarkb> #topic Announcements
19:01:35 <clarkb> I had nothing to announce
19:01:52 <fungi> ml upgrade
19:02:14 <clarkb> oh yup that is on the topic list but worth calling out here if people read the announcements and not the rest of the log.
19:02:33 <clarkb> lists.openstack.org will have its operating system upgraded September 12 beginning at 15:00UTC
19:03:18 <fungi> #link http://lists.opendev.org/pipermail/service-discuss/2021-September/000280.html Mailing lists offline 2021-09-12 for server upgrade
19:03:40 <fungi> i also sent a copy to th emain discuss lists for each of the different mailman sites we host on that server
19:04:06 <clarkb> the lists.katacontainers.io upgrade seemed to go well and we've tested this on zuul test nodes as well as a snapshot of that server
19:04:25 <clarkb> should hopefully be a matter of answering qusetions for the upgrade system and checking things are happy after
19:05:27 <clarkb> #topic Actions from last meeting
19:05:32 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-08-31-19.01.txt minutes from last meeting
19:05:36 <clarkb> There were no actions recorded
19:05:40 <clarkb> #topic Specs
19:05:46 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement
19:06:14 <clarkb> corvus: fungi: ianw: can I get reviews on this spec? I think it is fairly straightforward and approvable but wanted to make sure I got the details as others expected them
19:06:33 <clarkb> thank you tristanC and frickler for the reviews
19:06:49 <fungi> thanks for the reminder, i've starred it
19:07:21 <clarkb> #topic Topics
19:07:29 <clarkb> #topic lists.o.o operating system upgrade
19:07:40 <clarkb> as mentioned previously this is happenign on September 12 at 15:00UTC
19:07:55 <clarkb> This upgrade will affect lists for openstack, opendev, airship, starlingx and zuul
19:08:35 <fungi> i also did some preliminary calculations on memory consumption for the lists.katacontainers.io server post-upgrade and it seems like it's not going to present any significant additional memory pressure at least
19:08:52 <clarkb> thank you for checking that. I plan to be around for the upgrade as well
19:08:55 <fungi> unfortunately i didn't check memory utilization pre-upgrade and we don't have that server in cacti, so no trending
19:09:24 <fungi> however i'm not super concerned that the lists.o.o server will be under-sized for the upgraded state
19:09:54 <clarkb> it is bigger than I had thought previusly too which gives us more headroom than I expected :)
19:10:34 <fungi> after the upgrade is concluded, the openinfra foundation is interested in adding a lists.openinfra.dev site and moving a number of foundation-specific lists to that, so i'll pay close attention to the memory utilization post-upgrade to make sure that addition won't pose a resource problem
19:11:31 <fungi> (for those who aren't aware, our current deployment model uses 9 python processes for the various queue runners for each site)
19:11:45 <clarkb> I think that is about it for the lists upgrade. be aware of it and fungi and I will keep everyone updated as we go through the process
19:12:09 <clarkb> #topic Improving OpenDev's CD Throughput
19:12:12 <fungi> also once the ubuntu upgrade is done, i think we can start planning more seriously for containerized mailman 3
19:12:18 <fungi> oops, sorry
19:12:18 <clarkb> fungi: ++
19:12:27 <clarkb> no worries I think that is the next step for the mailman services
19:12:49 <clarkb> I haven't had time to dig into our jobs yet. Too many things kept popping up
19:12:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/807672/ starts to sketch this out.
19:12:56 <clarkb> But ianw took a look yseterday
19:13:32 <clarkb> ianw: can you give us the high level overview of this change? It seems you've modified a few jobs then started working on pipeline updates? Looks like you're sketching stuff out and this isn't quite ready yet
19:13:35 <fungi> i've been sort of paying attention to what jobs are running on system-config changes now, and it still seems sane
19:14:01 <ianw> yeah i was going to draw graphs and things but i noticed a few things
19:14:34 <ianw> firstly the system-config-run and infra-prod stages are fairly different; in that for system-config-run you just include the letsencrypt playbook, while for prod you need to run the job first
19:15:42 <ianw> in short, i think we really just need to make sure things depend on either the base job, or the letsencrypt job, or their relevant parent (but there's only a handful of cases like that)
19:16:01 <ianw> i don't see why they can't run in parallel after that
19:16:40 <clarkb> cool. I also noticed you change how manage-projects runs a little bit. I believe we're primarily driving that from project-config today, but this has it run out of system-config more often?
19:17:27 <ianw> yeah, manage-projects all i did was put the file matchers into the job: rather than in the projects
19:18:18 <clarkb> ianw: ok, I think it is done that way because we run it from openstack/project-config and file matchers are different there?
19:18:27 <ianw> and also i think that should probably depend on infra-prod-review?  as in if we've rolled out any changes to review we'd want them to merge before projects
19:18:31 <clarkb> that might need a little bit of extra investigating to understand how zuul handles that and whether it is appropraite for manage-projects
19:18:35 <clarkb> ianw: ++
19:18:48 <ianw> oh; that could be, yep.  something that probably wants a comment :)
19:19:10 <ianw> then i think infra-prod-bridge was another one i wasn't sure of in the build hierarchy
19:19:44 <clarkb> that helps me understand some of what is going on there. I can leave some comments after the meeting
19:19:49 <ianw> that pulls an updated system-config onto bridge; but i don't think that matters?  everything runs on bridge, but via zuul-checkout?
19:20:25 <clarkb> infra-prod-bridge also configures other things on bridge like the ansible version iirc
19:20:33 <ianw> yeah, it was mostly a sketch, i see it synatx errored.  but it suggested to me that we can probably tackle the issue with mostly just thinking about it and formatting things nicely in the file
19:21:15 <fungi> forcibly updating the checkout on bridge seems like the most sensible way to prevent accidental rollbacks from races in different pipelines too
19:21:51 <clarkb> I think that each job is using the checkout associated with its triggering change
19:22:08 <clarkb> there is an escape hatch in that task that checks if it is running in a periodic pipeline in which case it uses master instead
19:22:17 <clarkb> definitely seems unnecessary to do the cehckout in a prior job
19:22:44 <fungi> ahh, okay, so we still need some mitigation if mutex prioritization is implemented (did that ever land?)
19:23:27 <clarkb> ya I'm still not sure if we decided if that was necessary or not. Going from change to periodic should be fine, but periodic to change may not be?
19:23:49 <clarkb> though if we prioritize the change pipeline then periodic to change would only happen when a new change arrives and should be safe
19:23:55 <fungi> oh, when you say "checks if it is running in a periodic pipeline in which case it uses master instead" you mean explicitly updates the checkout when the build starts rather than using the master branch state zuul associated with it when enqueued. yeah that should be good enough
19:24:03 <clarkb> so ya I htink we're ok as long as deploy has a higher priority than the periodic piepliens
19:24:21 <clarkb> fungi: yes, that was my reading of it
19:24:22 <fungi> yes, i concur
19:24:29 <ianw> so should "infra-prod-bridge" be the base job?  as in infra-prod-base <- infra-prod-bridge <- infra-prod-letsencrypt <- <most other jobs>
19:24:45 <fungi> i forgot we had already arrived at that conclusion
19:25:02 <fungi> ianw: that sounds great to me
19:25:16 <ianw> if we're thinking that say updating an ansible version on bridge should affect all following jobs
19:25:24 <clarkb> ianw: yes I think so but less for having system-config updates and more so that ansible and its config update before running more jobs
19:26:08 <ianw> these are all soft dependencies
19:26:32 <clarkb> that sounds right
19:26:38 <clarkb> we don't need -bridge to run if ansible isn't updating
19:26:45 <ianw> i assume they "pass upwards" correctly.  so basically if there's no changes that match on the base/bridge for the change we're running, then everything will just fire in parallel because it knows we're good
19:27:06 <clarkb> that is my understanding of how the soft dependencies should work
19:27:42 <ianw> we may uncover deficiencies in our file matchers, but i think we just have to watch what runs and debug that
19:29:04 <clarkb> that all sounds good. I'll try to leave those comments on the change and we can continue to refine this in review.
19:29:08 <clarkb> Anything else on this subject?
19:29:29 <ianw> nope, not from me
19:29:44 <clarkb> #topic Gerrit Account Cleanups
19:30:04 <clarkb> I finalized the rpevious batch of conflict cleanups which leaves us with 33 conflicts
19:30:26 <clarkb> My intention with these is to find a morning or afternoon where I can start writing down a plan for each one then email the users directly with that proposal
19:30:47 <clarkb> Then assuming I get acks back I'll go ahead and start committing those fixes in a tmp checkout of All-Users on review02.
19:30:57 <fungi> is the list of those in your homedir on review.o.o?
19:31:29 <clarkb> I'll probably give users 2-3 weeks to respond and if they don't go ahead with my plan for them as well. Importantly once we commit these last fixes we should be able to fix any account while gerrit is online by adding and removing commits to all-users that pass validations
19:31:42 <clarkb> fungi: yup all the logs and details are in the typical location including my most recent audit results
19:33:27 <clarkb> I'll probably reach out if I need help with planning for these users otherwise I'll start emailing people this week hopefully
19:33:34 <clarkb> is anyone interested in being CC'd on those comms?
19:34:53 <ianw> sure
19:35:01 <clarkb> thanks!
19:35:03 <clarkb> #topic OpenDev Logo Hosting
19:35:22 <clarkb> The changes to make the opendevorg/assets image a thing landed this morning and gitea redeployed using those builds
19:35:29 <clarkb> thank you ianw for working through this
19:35:39 <fungi> it's awesome
19:35:45 <fungi> truly
19:35:53 <clarkb> We do still need to update gerrit and paste to incorporate the new bits one way or another
19:36:13 <clarkb> with gerrit we currently bind mount the static content dir and could put the files in that location and serve them that way
19:36:22 <clarkb> I'm not sure what the best method for paste would be
19:36:38 <clarkb> ianw: ^ you might have thoughts on those services?
19:37:03 <ianw> i think the easy approach of pointing that at https://opendev.org/opendev/system-config/assets/
19:37:09 <ianw> i can propose changes for them both
19:37:26 <clarkb> that works too, and thanks
19:37:41 <clarkb> certainly we can start there and that will be far more static for the gitea 1.15.x upgrade
19:38:35 <clarkb> Once this logo effort is done I'ld like to see if we're happy enough with the state of things to do that gitea upgrade. I'll bring that up once logos are done
19:38:57 <clarkb> #topic Rebooting gitea servers for host migrations in vexxhost sjc1
19:39:20 <fungi> 08 is already done, yeah? just batching up the rest and then doing the lb?
19:39:23 <clarkb> This is a last minute addition as mnaser is asking us to reboot gitea servers to cold migrate them to new hardware
19:39:41 <fungi> did the gerrit server already get migrated?
19:40:02 <clarkb> yup 08 is done. 06 and 07 are pulled from haproxy and ready to go. Note that mnaser needs to do the reboot/cold migration on his end as we cannot trigger it ourselves so I'm working with mnaser to turn things off in haproxy and then he can migrate
19:40:16 <clarkb> fungi: the gerrit server is/was already on new amd hardware and doesn't need this to happen
19:40:28 <clarkb> I think previously mnaser had asked about doing review not realizing it was on the new amd stuff already
19:40:39 <clarkb> in any case review wasn't on the list supplied this morning
19:40:47 <fungi> oh, cool, i remember him indicating some weeks back a need to migrate it, but maybe that was old info
19:41:04 <clarkb> I can double check with him when he gets back to the migrations
19:41:37 <clarkb> do we have any opinions on how to do the load balancer? Probably just do it this afternoon (relative to my time) if the gitea backends are happy with the moves?
19:41:49 <clarkb> that potentially impacts zuul jobs but zuul tends to be quieter during that time of day
19:42:29 <fungi> yeah, i mean we could try to pause all running jobs $somehow but honestly we warn projects not to have their jobs pull from gitea or gerrit anyway
19:42:39 <ianw> does zuul need a restart for anything?
19:43:01 <clarkb> ianw: there are a few changes that we could restart zuul for but the one change that I really wanted to get in isn't ready yet or wasn't last I checked
19:43:14 <ianw> i don't think the fix to the log buttons rolled out, and iirc corvus mentioned maybe we should just roll the whole thing
19:43:16 <clarkb> https://review.opendev.org/c/zuul/zuul/+/807221/ that change
19:43:20 <corvus> my bugfix merged with 2 others and has gotten big
19:43:34 <clarkb> I intend on rereviewing that change this afternoon
19:43:59 <corvus> i'm not feeling a huge need to restart right now
19:44:06 <clarkb> ok
19:44:15 <fungi> i think a quick reboot for the lb is probably fine whenever
19:44:17 <clarkb> In that case probably the easiest thing for the load balancer is to just go for it
19:44:23 <fungi> agreed
19:44:46 <clarkb> sounds good I'll continue to coordinate with mnaser on that and get this done
19:44:47 <ianw> i can do it in my afternoon if we like, make it even quieter
19:45:00 <clarkb> ianw: I think the problem with that is mnaser (or vexxhost person) has to do it
19:45:09 * fungi does not know how to make ianw's afternoon even quieter
19:45:18 <ianw> oh right, well let me know :)
19:45:19 <clarkb> I'm mostly managing the impact on our side but mnaser is pushing the cold migrate button
19:45:49 <ianw> fungi: you could come and cure covid and get my kids out of homeschool, that would help! :)
19:46:16 <fungi> d'oh!
19:46:19 <clarkb> #topic Open Discussion
19:46:31 <clarkb> That was it for the agenda. Anything else worth mentioning?
19:47:18 <fungi> opendev's testing and deployment is going to be featured in an talk at ansiblefest, for those who missed the announcement in other places
19:49:09 <fungi> #link https://events.ansiblefest.redhat.com/widget/redhat/ansible21/sessioncatalog/session/16248953812130016Yue
19:50:12 <fungi> registration is "free" for the virtual event, after you fill out 20 pages about how your company might be interested in ansible ;)
19:50:21 <corvus> i hope it's good :)
19:51:20 <corvus> i think it's about the coolest thing you can do with ansible so...
19:51:55 <fungi> it certainly is cool, i'll give you that
19:52:09 <corvus> sessions are also ~20m so hopefully shouldn't be a slog
19:52:38 <clarkb> oh I like that
19:52:38 <corvus> sept 29-30
19:52:43 <clarkb> I think shorter works better for virtual
19:52:58 <corvus> yeah, they had some good info for speakers about exactly that
19:53:42 <corvus> like, your audience is in a different situation when virtual, so structure the talk a little differently
19:53:59 <fungi> that's helpful
19:56:26 <clarkb> Sounds like that may be it.
19:56:29 <clarkb> Thank you everyone!
19:56:36 <clarkb> See you here next week. Same time and location
19:56:39 <fungi> thanks clarkb!
19:56:39 <clarkb> #endmeeting