19:01:10 #startmeeting infra 19:01:11 Meeting started Tue Sep 7 19:01:10 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:11 The meeting name has been set to 'infra' 19:01:16 #link http://lists.opendev.org/pipermail/service-discuss/2021-September/000281.html Our Agenda 19:01:25 #topic Announcements 19:01:35 I had nothing to announce 19:01:52 ml upgrade 19:02:14 oh yup that is on the topic list but worth calling out here if people read the announcements and not the rest of the log. 19:02:33 lists.openstack.org will have its operating system upgraded September 12 beginning at 15:00UTC 19:03:18 #link http://lists.opendev.org/pipermail/service-discuss/2021-September/000280.html Mailing lists offline 2021-09-12 for server upgrade 19:03:40 i also sent a copy to th emain discuss lists for each of the different mailman sites we host on that server 19:04:06 the lists.katacontainers.io upgrade seemed to go well and we've tested this on zuul test nodes as well as a snapshot of that server 19:04:25 should hopefully be a matter of answering qusetions for the upgrade system and checking things are happy after 19:05:27 #topic Actions from last meeting 19:05:32 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-08-31-19.01.txt minutes from last meeting 19:05:36 There were no actions recorded 19:05:40 #topic Specs 19:05:46 #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement 19:06:14 corvus: fungi: ianw: can I get reviews on this spec? I think it is fairly straightforward and approvable but wanted to make sure I got the details as others expected them 19:06:33 thank you tristanC and frickler for the reviews 19:06:49 thanks for the reminder, i've starred it 19:07:21 #topic Topics 19:07:29 #topic lists.o.o operating system upgrade 19:07:40 as mentioned previously this is happenign on September 12 at 15:00UTC 19:07:55 This upgrade will affect lists for openstack, opendev, airship, starlingx and zuul 19:08:35 i also did some preliminary calculations on memory consumption for the lists.katacontainers.io server post-upgrade and it seems like it's not going to present any significant additional memory pressure at least 19:08:52 thank you for checking that. I plan to be around for the upgrade as well 19:08:55 unfortunately i didn't check memory utilization pre-upgrade and we don't have that server in cacti, so no trending 19:09:24 however i'm not super concerned that the lists.o.o server will be under-sized for the upgraded state 19:09:54 it is bigger than I had thought previusly too which gives us more headroom than I expected :) 19:10:34 after the upgrade is concluded, the openinfra foundation is interested in adding a lists.openinfra.dev site and moving a number of foundation-specific lists to that, so i'll pay close attention to the memory utilization post-upgrade to make sure that addition won't pose a resource problem 19:11:31 (for those who aren't aware, our current deployment model uses 9 python processes for the various queue runners for each site) 19:11:45 I think that is about it for the lists upgrade. be aware of it and fungi and I will keep everyone updated as we go through the process 19:12:09 #topic Improving OpenDev's CD Throughput 19:12:12 also once the ubuntu upgrade is done, i think we can start planning more seriously for containerized mailman 3 19:12:18 oops, sorry 19:12:18 fungi: ++ 19:12:27 no worries I think that is the next step for the mailman services 19:12:49 I haven't had time to dig into our jobs yet. Too many things kept popping up 19:12:51 #link https://review.opendev.org/c/opendev/system-config/+/807672/ starts to sketch this out. 19:12:56 But ianw took a look yseterday 19:13:32 ianw: can you give us the high level overview of this change? It seems you've modified a few jobs then started working on pipeline updates? Looks like you're sketching stuff out and this isn't quite ready yet 19:13:35 i've been sort of paying attention to what jobs are running on system-config changes now, and it still seems sane 19:14:01 yeah i was going to draw graphs and things but i noticed a few things 19:14:34 firstly the system-config-run and infra-prod stages are fairly different; in that for system-config-run you just include the letsencrypt playbook, while for prod you need to run the job first 19:15:42 in short, i think we really just need to make sure things depend on either the base job, or the letsencrypt job, or their relevant parent (but there's only a handful of cases like that) 19:16:01 i don't see why they can't run in parallel after that 19:16:40 cool. I also noticed you change how manage-projects runs a little bit. I believe we're primarily driving that from project-config today, but this has it run out of system-config more often? 19:17:27 yeah, manage-projects all i did was put the file matchers into the job: rather than in the projects 19:18:18 ianw: ok, I think it is done that way because we run it from openstack/project-config and file matchers are different there? 19:18:27 and also i think that should probably depend on infra-prod-review? as in if we've rolled out any changes to review we'd want them to merge before projects 19:18:31 that might need a little bit of extra investigating to understand how zuul handles that and whether it is appropraite for manage-projects 19:18:35 ianw: ++ 19:18:48 oh; that could be, yep. something that probably wants a comment :) 19:19:10 then i think infra-prod-bridge was another one i wasn't sure of in the build hierarchy 19:19:44 that helps me understand some of what is going on there. I can leave some comments after the meeting 19:19:49 that pulls an updated system-config onto bridge; but i don't think that matters? everything runs on bridge, but via zuul-checkout? 19:20:25 infra-prod-bridge also configures other things on bridge like the ansible version iirc 19:20:33 yeah, it was mostly a sketch, i see it synatx errored. but it suggested to me that we can probably tackle the issue with mostly just thinking about it and formatting things nicely in the file 19:21:15 forcibly updating the checkout on bridge seems like the most sensible way to prevent accidental rollbacks from races in different pipelines too 19:21:51 I think that each job is using the checkout associated with its triggering change 19:22:08 there is an escape hatch in that task that checks if it is running in a periodic pipeline in which case it uses master instead 19:22:17 definitely seems unnecessary to do the cehckout in a prior job 19:22:44 ahh, okay, so we still need some mitigation if mutex prioritization is implemented (did that ever land?) 19:23:27 ya I'm still not sure if we decided if that was necessary or not. Going from change to periodic should be fine, but periodic to change may not be? 19:23:49 though if we prioritize the change pipeline then periodic to change would only happen when a new change arrives and should be safe 19:23:55 oh, when you say "checks if it is running in a periodic pipeline in which case it uses master instead" you mean explicitly updates the checkout when the build starts rather than using the master branch state zuul associated with it when enqueued. yeah that should be good enough 19:24:03 so ya I htink we're ok as long as deploy has a higher priority than the periodic piepliens 19:24:21 fungi: yes, that was my reading of it 19:24:22 yes, i concur 19:24:29 so should "infra-prod-bridge" be the base job? as in infra-prod-base <- infra-prod-bridge <- infra-prod-letsencrypt <- 19:24:45 i forgot we had already arrived at that conclusion 19:25:02 ianw: that sounds great to me 19:25:16 if we're thinking that say updating an ansible version on bridge should affect all following jobs 19:25:24 ianw: yes I think so but less for having system-config updates and more so that ansible and its config update before running more jobs 19:26:08 these are all soft dependencies 19:26:32 that sounds right 19:26:38 we don't need -bridge to run if ansible isn't updating 19:26:45 i assume they "pass upwards" correctly. so basically if there's no changes that match on the base/bridge for the change we're running, then everything will just fire in parallel because it knows we're good 19:27:06 that is my understanding of how the soft dependencies should work 19:27:42 we may uncover deficiencies in our file matchers, but i think we just have to watch what runs and debug that 19:29:04 that all sounds good. I'll try to leave those comments on the change and we can continue to refine this in review. 19:29:08 Anything else on this subject? 19:29:29 nope, not from me 19:29:44 #topic Gerrit Account Cleanups 19:30:04 I finalized the rpevious batch of conflict cleanups which leaves us with 33 conflicts 19:30:26 My intention with these is to find a morning or afternoon where I can start writing down a plan for each one then email the users directly with that proposal 19:30:47 Then assuming I get acks back I'll go ahead and start committing those fixes in a tmp checkout of All-Users on review02. 19:30:57 is the list of those in your homedir on review.o.o? 19:31:29 I'll probably give users 2-3 weeks to respond and if they don't go ahead with my plan for them as well. Importantly once we commit these last fixes we should be able to fix any account while gerrit is online by adding and removing commits to all-users that pass validations 19:31:42 fungi: yup all the logs and details are in the typical location including my most recent audit results 19:33:27 I'll probably reach out if I need help with planning for these users otherwise I'll start emailing people this week hopefully 19:33:34 is anyone interested in being CC'd on those comms? 19:34:53 sure 19:35:01 thanks! 19:35:03 #topic OpenDev Logo Hosting 19:35:22 The changes to make the opendevorg/assets image a thing landed this morning and gitea redeployed using those builds 19:35:29 thank you ianw for working through this 19:35:39 it's awesome 19:35:45 truly 19:35:53 We do still need to update gerrit and paste to incorporate the new bits one way or another 19:36:13 with gerrit we currently bind mount the static content dir and could put the files in that location and serve them that way 19:36:22 I'm not sure what the best method for paste would be 19:36:38 ianw: ^ you might have thoughts on those services? 19:37:03 i think the easy approach of pointing that at https://opendev.org/opendev/system-config/assets/ 19:37:09 i can propose changes for them both 19:37:26 that works too, and thanks 19:37:41 certainly we can start there and that will be far more static for the gitea 1.15.x upgrade 19:38:35 Once this logo effort is done I'ld like to see if we're happy enough with the state of things to do that gitea upgrade. I'll bring that up once logos are done 19:38:57 #topic Rebooting gitea servers for host migrations in vexxhost sjc1 19:39:20 08 is already done, yeah? just batching up the rest and then doing the lb? 19:39:23 This is a last minute addition as mnaser is asking us to reboot gitea servers to cold migrate them to new hardware 19:39:41 did the gerrit server already get migrated? 19:40:02 yup 08 is done. 06 and 07 are pulled from haproxy and ready to go. Note that mnaser needs to do the reboot/cold migration on his end as we cannot trigger it ourselves so I'm working with mnaser to turn things off in haproxy and then he can migrate 19:40:16 fungi: the gerrit server is/was already on new amd hardware and doesn't need this to happen 19:40:28 I think previously mnaser had asked about doing review not realizing it was on the new amd stuff already 19:40:39 in any case review wasn't on the list supplied this morning 19:40:47 oh, cool, i remember him indicating some weeks back a need to migrate it, but maybe that was old info 19:41:04 I can double check with him when he gets back to the migrations 19:41:37 do we have any opinions on how to do the load balancer? Probably just do it this afternoon (relative to my time) if the gitea backends are happy with the moves? 19:41:49 that potentially impacts zuul jobs but zuul tends to be quieter during that time of day 19:42:29 yeah, i mean we could try to pause all running jobs $somehow but honestly we warn projects not to have their jobs pull from gitea or gerrit anyway 19:42:39 does zuul need a restart for anything? 19:43:01 ianw: there are a few changes that we could restart zuul for but the one change that I really wanted to get in isn't ready yet or wasn't last I checked 19:43:14 i don't think the fix to the log buttons rolled out, and iirc corvus mentioned maybe we should just roll the whole thing 19:43:16 https://review.opendev.org/c/zuul/zuul/+/807221/ that change 19:43:20 my bugfix merged with 2 others and has gotten big 19:43:34 I intend on rereviewing that change this afternoon 19:43:59 i'm not feeling a huge need to restart right now 19:44:06 ok 19:44:15 i think a quick reboot for the lb is probably fine whenever 19:44:17 In that case probably the easiest thing for the load balancer is to just go for it 19:44:23 agreed 19:44:46 sounds good I'll continue to coordinate with mnaser on that and get this done 19:44:47 i can do it in my afternoon if we like, make it even quieter 19:45:00 ianw: I think the problem with that is mnaser (or vexxhost person) has to do it 19:45:09 * fungi does not know how to make ianw's afternoon even quieter 19:45:18 oh right, well let me know :) 19:45:19 I'm mostly managing the impact on our side but mnaser is pushing the cold migrate button 19:45:49 fungi: you could come and cure covid and get my kids out of homeschool, that would help! :) 19:46:16 d'oh! 19:46:19 #topic Open Discussion 19:46:31 That was it for the agenda. Anything else worth mentioning? 19:47:18 opendev's testing and deployment is going to be featured in an talk at ansiblefest, for those who missed the announcement in other places 19:49:09 #link https://events.ansiblefest.redhat.com/widget/redhat/ansible21/sessioncatalog/session/16248953812130016Yue 19:50:12 registration is "free" for the virtual event, after you fill out 20 pages about how your company might be interested in ansible ;) 19:50:21 i hope it's good :) 19:51:20 i think it's about the coolest thing you can do with ansible so... 19:51:55 it certainly is cool, i'll give you that 19:52:09 sessions are also ~20m so hopefully shouldn't be a slog 19:52:38 oh I like that 19:52:38 sept 29-30 19:52:43 I think shorter works better for virtual 19:52:58 yeah, they had some good info for speakers about exactly that 19:53:42 like, your audience is in a different situation when virtual, so structure the talk a little differently 19:53:59 that's helpful 19:56:26 Sounds like that may be it. 19:56:29 Thank you everyone! 19:56:36 See you here next week. Same time and location 19:56:39 thanks clarkb! 19:56:39 #endmeeting