19:01:02 <clarkb> #startmeeting infra
19:01:02 <opendevmeet> Meeting started Tue Oct  5 19:01:02 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:02 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:02 <opendevmeet> The meeting name has been set to 'infra'
19:01:08 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-October/000287.html Our Agenda
19:01:16 <clarkb> #topic Announcements
19:01:25 <ianw> o/
19:01:28 <clarkb> The OpenStack release is happening tomorrow afternoon UTC time
19:01:29 <fungi> ahoy
19:01:41 <fungi> it'll probably start tomorrow morning utc
19:01:44 <clarkb> We should avoid changes to tools that produce code today and tomorrow until that is done
19:02:02 <fungi> but should hopefully be complete by 14:00 utc or therabouts
19:02:03 <clarkb> fungi: good point, It starts earlier but aims to be done by ~1500UTC?
19:02:23 <fungi> yeah, 15z is press release time
19:02:38 <clarkb> Today is a good day to avoid touching gerrit, gitea, zuul, etc :)
19:02:51 <fungi> but they generally shoot to have all the artifacts and docs published and rechecked at least an hour prior
19:03:11 <clarkb> I plan to try and get up a bit early tomorrow to help out if anything comes up. But ya I expect it will be done by the time i have to take kids to school which will be nice as I can do that iwthout concern then :)
19:03:16 <fungi> and it's a multi-hour process so usually begins around 10:00z or so
19:03:26 <clarkb> Anyway just be aware of that and lets avoid restarting zuul for example
19:04:28 <clarkb> #topic Actions from last meeting
19:04:34 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-09-28-19.01.txt minutes from last meeting
19:04:45 <clarkb> I don't see any recorded actions. Lets move on
19:04:51 <clarkb> #topic Specs
19:04:55 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement
19:05:07 <clarkb> I didn't find time to update this spec to propose using the prebuilt binaries
19:05:16 <clarkb> #action clarkb update prometheus spec to use prebuilt binaries
19:05:26 <fungi> the current spec doesn't rule them out though
19:05:34 <clarkb> it does suggest we use the docker images
19:05:38 <clarkb> whcih are different
19:05:58 <clarkb> well specifically for node exporter the idea was docker iamges for that if we used it
19:06:15 <clarkb> but ianw makes a good point that we can just grab the prebuilt binaries for node exporter and host them ourselves and stick them on all our systems
19:06:36 <clarkb> that will give us consistent node exporter metrics without concern for version variance and avoids oddness with docker images
19:07:16 <clarkb> Anyway I'll try to update that. Seems like there are a lot of things happenign this week (yay pre release pressure build up all able to let go now)
19:07:23 <clarkb> #link https://review.opendev.org/810990 Mailman 3 spec
19:07:48 <clarkb> I'll give us all an action to give this a review as well. This is important not just for the service update but it will help inform us on the best path for handling the odd existing server
19:07:54 <clarkb> #action infra-root Review mailman 3 spec
19:07:59 <fungi> should be in a reviewable state, i don't have any outstanding todos for it, but feedback would be appreciated
19:08:22 <clarkb> #topic Topics
19:08:29 <clarkb> #topic PTGBot Deployment
19:08:43 <clarkb> This is a late entry to the meeting agenda that I added to my local copy
19:09:01 <clarkb> Looks like we've got a stack of changes to deploy ptgbot on the new eavesdrop setup, but we're struggling with LE staging errors
19:09:13 <clarkb> LE doesn't indicate any issues at https://letsencrypt.status.io/
19:09:51 <clarkb> We also had an idea that we might be able to split up the handlers/main.yaml that handles all the service restarts post cert update. That would then allow us to run more minimal sets of jobs when doing LE updates, but ansible doesn't work that way unfortunately
19:10:09 <clarkb> acme.sh is already retrying things that the protocol is expected to be less reliable with
19:10:28 <clarkb> for that reason I hesitate to add retries in our ansible calls of acme.sh but that is a possibility too if we want to try and force this a bit more
19:10:44 <ianw> i wonder if we should just short-cut acme.sh
19:10:52 <clarkb> fungi: ianw: ^ anything else to add on this subject? Mostly wanted to call it out because the LE stuff could have more widespread implications
19:11:01 <clarkb> ianw: when testing you mean?
19:11:10 <fungi> ianw: i wouldn't be entirely opposed
19:11:11 <ianw> at the moment, what it does is asks the staging to setup the certs, so we get the TXT responses
19:11:32 <ianw> but we never actually put them into dns and finish verifying, we self-generate with openssl
19:11:41 <fungi> or i considered deploying a pebble container on the test nodes and pointing acme.sh to a localhost:xxxx url
19:11:55 <ianw> yeah, that has been on my todo list for a long time :)
19:12:41 <clarkb> In the past when we've seen the staging api have trouble it usually goes away within a day. Not great to rely on that nor is there any garuntee or indication that will be the case when it happens again
19:12:41 <fungi> the staging environment api docs outright say it's not recommended for use in ci jobs
19:12:43 <ianw> in testing mode, we could just avoid calling acme.sh and somehow output a fake/random TXT record, to keep testing the surrounding bits
19:13:29 <clarkb> ianw: that might be a good compromise
19:13:38 <clarkb> ianw: the driver.sh could echo that out pretty easily
19:13:58 <ianw> i can look at this today; it would be nice to keep the path on one job, but maybe we should have a specific acme.sh test job
19:14:14 <clarkb> something like cat /dev/urandom | tr [:stuff:] [:otherstuff:] | tail -20
19:14:35 <ianw> i have had https://github.com/acmesh-official/acme.sh/pull/2606 open for 2 years to better detect failures; clearly it hasn't attracted attention
19:14:39 <fungi> yes, i was thinking the same, maybe the system-config-run-letsencrypt job should use the staging env properly, and then we fake out all the others?
19:14:51 <clarkb> ++
19:15:12 <ianw> i can look at this today
19:15:38 <clarkb> thanks
19:15:40 <fungi> i looked through the acme.sh code and it does seem to retry aggressively with delays all over the place, so i'm surprised we're still seeing 500 responses bubble up
19:15:56 <clarkb> #topic Improving OpenDev's CD throughput
19:16:09 <clarkb> lets keep moving as we have a number of other topics to talk about today and limited time :)
19:16:30 <clarkb> ianw has written a stack of changes to move this along and improve our CD throughput
19:16:44 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/807672 List dependencies for all jobs
19:17:03 <clarkb> this isn't currently mergeable beacuse Zuul doesn't detect this change as having changes that need jobs to run
19:17:20 <clarkb> ianw: I was thinking that we should maybe just put a simple edit in a file somewhere to trick it
19:17:32 <clarkb> ianw: like our README or a dockerfile or something
19:18:43 <ianw> clarkb: I do think https://review.opendev.org/c/zuul/zuul/+/755988 might fix this type of situation
19:18:56 <fungi> should we have some job which always runs?
19:19:16 <ianw> but yes, i can do something like that.  the syntax check is probably the important bit of that
19:19:21 <fungi> ahh, 755988 is a neat idea!
19:19:26 <clarkb> oh interesting I'll have to review that zuul change
19:19:33 <fungi> similar to how it handles config changes
19:19:44 <fungi> great approach
19:19:51 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/807807 Update opendev/base-jobs to support having jobs in system-config that don't clone repos
19:20:15 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/807808 stop cloning system-config in every system-config job
19:20:34 <clarkb> ianw: at the end of this stack we'll still be running everything serially, but in theory we'll be ready to update semaphores and run stuff in parallel?
19:20:49 <ianw> yes, that's the intention
19:21:10 <clarkb> great, they are on my list of things to review I've just got to find time between everything else :)
19:21:16 <clarkb> hopefully thsi afternoon for those though
19:21:28 <ianw> np; they *should* all be no-ops for live jobs
19:21:51 <clarkb> thank you for working on that
19:21:59 <ianw> but, as about 7 hours of yesterday highlights, sometimes something you think is a no-op can have interesting side-effects :)
19:22:27 <clarkb> #topic Gerrit Account Cleanup
19:22:42 <clarkb> I'm going to keep moving along to be sure we can get through everything. Happy to swing back to any topic at the end of our hour if we have time
19:23:02 <clarkb> I don't have anything new to say on this item. This issue gets deprioritized pretty easily unfortauntely
19:23:32 <clarkb> I may drop it from the meeting until I expect to be able to send the emails
19:23:49 <clarkb> #topic Debian Buster to Bullseye Updates
19:24:20 <clarkb> We have updated python base images for our docker containers. We should try to move as many images as possible from buster to bullseye as buster will eventually stop getting updates
19:24:26 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/809269 Gitea bullseye update
19:24:32 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/809286 Gerrit bullseye update
19:25:08 <clarkb> I've got those two changes pushed up for gerrit and gitea beacuse I've been making changes to their docker images recently. But basically all the containers we run need similar treatment aiui
19:25:46 <clarkb> I'm brining this up first because in a few minutes we'll also discuss gitea and gerrit service upgrades. I think we should decide on the order we want to tackle these updates in. Do we do the service or the OS first?
19:26:11 <fungi> "soon" is relative, the debian-lts team expecy to sipport buster until june 2024
19:26:18 <fungi> er, expect to support
19:26:25 <ianw> i would say OS then service
19:26:25 <clarkb> fungi: oh isn't it like a year after release?
19:26:36 <clarkb> maybe it is a year after the n-1 release
19:26:37 <fungi> official security support ends in july 2022
19:26:50 <corvus> 'soon' in debian time :)
19:26:51 <clarkb> fungi: aha I am not completely crazy then
19:26:54 <fungi> and then lts takes over
19:27:03 <ianw> it seems either are fairly easy to roll back
19:27:08 <fungi> the lts peeps are separate from the debian security team
19:27:13 <clarkb> ianw: ++ exactly my thinking and ya happy to do OS first as a result
19:27:19 <fungi> sort of like openstack's stable maintenance team and extended maintenance
19:27:30 <clarkb> fungi: ok I'm not sure if our python base images and the other base images enable the lts stuff or not. We don't make those
19:27:45 <clarkb> probably best to get off of buster by july 2022 then we don't have to worry about it
19:27:50 <fungi> right, that'll be the bigger concern. what is the support lifetime of the python-base image
19:28:08 <ianw> (i need to get back to the nodepool/dib image upgrades too)
19:28:12 <fungi> which may or may not be tied to debian's support timeline
19:28:35 <clarkb> fungi: those images are based on debian so there is some relationship there. I doubt they go past the lts period. But wouldn't be surprised if they end in july 2022
19:29:02 <clarkb> it is also possible they stop building updates sooner than that. And as ianw mentions the updats seem straightforward with easy reverts so we should go ahead and work through them
19:29:09 <fungi> the debian docker images also aren't official, at least from the debian release team's perspective
19:29:29 <fungi> so it's more about when the debian docker image maintainers want to stop supporting buster
19:29:47 <clarkb> fungi: right and they are unlikely to make new buster packages once debian stops doing so
19:30:09 <clarkb> that caps the useful life of those images to likely july 2022
19:30:16 <clarkb> (unless they do lts)
19:30:51 <clarkb> Considering there is a vote for doing OS updates first I guess I should plan to land those two changes above tomorrow after openstack release is complete
19:30:51 <fungi> might be able to infer something by looking at whether/when they stopped doing stretch images
19:31:19 <clarkb> fungi: they may also just directly say it somewhere
19:32:18 <clarkb> Anyway I think we can pretty quickly work through these updates and then not worry about it
19:32:27 <clarkb> and as a side effect we'll get newer git and other fancy new software
19:32:43 <clarkb> (but git in particular should give us improvements on gitea and possibly even gerrit doing things like repacking)
19:33:21 <clarkb> #topic Gitea 1.15.3 Upgrade
19:33:32 <clarkb> Once the gitea OS update is done. THis is the next thing I would like to do to gitea
19:33:42 <clarkb> Latest test instance: https://198.72.124.104:3081/opendev/system-config
19:33:56 <clarkb> That test instance lgtm and the logo hosting situation has been addressed with gerrit and paste et al
19:34:09 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/803231
19:34:24 <clarkb> Are there any other concerns with doing this upgrade tomorrow/thursday timeframe?
19:34:44 <fungi> after the openstack release has wrapped up, i should be around to help with it
19:35:01 <ianw> no issues, similarly i can help
19:35:06 <clarkb> great and thanks
19:35:29 <fungi> also this reminds me, i want to work on getting our infrastructure donors on the main opendev.org page, now that we have apache on the gitea servers we could just serve normal page content instead of having to stuff it into gitea's main page template, would that be a better place to start?
19:36:08 <clarkb> fungi: there might be issues doing that and neeing to host gitea at say https://opendev.org/gitea/
19:36:14 <clarkb> since all of our existing links out there don't have that root
19:36:14 <fungi> we'd need apache to directly serve the donor logos anyway probably
19:36:28 <clarkb> fungi: you can have gitea serve them just like the opendev logos
19:36:43 <clarkb> they have a static content directory with what I hope are stable paths now that they moved them
19:37:47 <fungi> seems like if we configure apache to only serve that page for get requests to the / url and when there are no query parameters, that wouldn't interfere with gitea
19:37:51 <clarkb> I guess we could maybe set it up where only the gitea landing page was hosted at /gitea/ and then all other paths would keep working? That is definitely my concern with doing something like that
19:38:20 <clarkb> fungi: I think you still need a gitea landing page beacuse gitea serves a home link
19:38:37 <clarkb> basically you either need to hack up redirects such that that continues to work or you're hacking templates either way
19:38:49 <clarkb> I don't have any objections to simply updating the home page template as a result
19:39:01 <fungi> i mean, as new content for what's served at the home url
19:39:08 <fungi> simply shadowing that one url
19:39:16 <clarkb> right, I think I prefer not relying on apache for that
19:39:34 <clarkb> since it doesn't really gain us anything and potentially complicates gitea in say k8s if we ever do that
19:40:01 <fungi> got it. i was hoping we could have a way to serve an opendev.org main page without the constraints of what the gitea homepage template can support, but we can talk about it another time i guess
19:40:18 <clarkb> I'm not sure I'm aware of what those constraints are?
19:40:25 <clarkb> I may be missing something important
19:40:30 <fungi> has to use gitea's templating, right?
19:40:55 <clarkb> "yes" you can just put what you want in there and ignore the templating stuff at the header and footer
19:41:15 <fungi> so we can't easily preview and publish that page separately from the gitea container
19:41:19 <ianw> fungi: ++ it has lightly troubled me for a while that that page is a wall of text that seems to start talking about gerrit workflows very very early.  so having something more flexible is a good goal
19:41:43 <clarkb> fungi: that is true, you have to run gitea to render the header and footer and see the entire page
19:42:03 <fungi> and i guess it can't have a different header/footer from the rest of gitea
19:42:20 <clarkb> I think it can, since it explicitly includes those bits
19:42:35 <clarkb> But you'd haev to use the existing templating system to make changes
19:42:39 <fungi> oh, so we could at least leave them out of the template if we wanted
19:42:42 <clarkb> yes
19:43:12 <corvus> the header seems like a good header for that site regardless
19:43:18 <clarkb> {{template "base/head" .}} and {{template "base/footer" .}} are line 1 and line before EOF
19:43:19 <corvus> home/explore/get started
19:43:50 <clarkb> corvus: ++
19:44:34 <fungi> yeah, i don't object to the current header and footer, just would prefer not to be unable to extend them easily
19:44:44 <clarkb> fungi: you can extend them as well
19:44:48 <clarkb> (we do the header already)
19:44:58 <fungi> okay, anyway i didn't mean to derail the meeting
19:45:15 <corvus> if we get to a point where the header for opendev.org isn't appropriate for a gitea service then we should probably move gitea to a subdomain
19:45:32 <fungi> just noodling on how to have an actual project website for opendev as a whole rather than one specific to the code browser
19:45:41 <clarkb> corvus: ya that was my immediate reaction to what this would imply. I'm ok doing that too, but it seems like we haven't reached the point where that is necessary yet
19:46:16 <fungi> we could also have a different page for the opendev main page, but having it at https://opendev.org/ seems convenient
19:46:39 <clarkb> Lets continue as we have a couple more things to go over really quickly
19:46:44 <fungi> yep
19:46:45 <clarkb> These next two are going to be related
19:46:49 <clarkb> #topic Upgrading Gerrit to 3.3
19:47:07 <clarkb> We are running gerrit 3.2 today. Gerrit 3.3 and 3.4 exist. 3.5 is in development but has not been released yet
19:47:27 <clarkb> The upgrade from 3.2 to 3.3 is pretty straightforward with most of the changes being UX stuff not really server backend
19:47:38 <clarkb> Straight forward enough that we are testing that upgrade in CI now :)
19:47:56 <clarkb> The upgrade to 3.4 is quite a bit more involved and the release notes are extensive
19:48:31 <clarkb> For this reason I'm thinking we can do a near term upgrade to 3.3. Then plan for 3.4 maybe around quiet holidaying time? (or whenever is convenient, mostly just thinking that will take more time)
19:48:34 <fungi> what are the main challenges you see for 3.4?
19:49:07 <ianw> i'd be happy to do 3.3 upgrade on like my monday, which is usually very quiet
19:49:13 <clarkb> fungi: mostly just double checking that things like plugins and zuul etc are all working with it
19:49:27 <clarkb> note you can also revert 3.3 to 3.2 and 3.4 to 3.3
19:49:37 <clarkb> so doing this incrementally keeps the reverts as small changes that we can do
19:49:46 <ianw> yeah no schema changes for either i believe
19:49:47 <clarkb> (I think you could revert 3.4 to 3.2 as well just more pieces to update)
19:49:51 <fungi> i can be around to help with the 3.3 upgrade on ianw's monday morning
19:49:58 <fungi> (my sunday evening)
19:50:10 <clarkb> there is a schema change between 3.2 and 3.3 you have to manually edit All-Users or All-Projects to do the revert
19:50:44 <clarkb> The next topic item is scheduling the project renames next week. I was thinking it might be good to do the renames on 3.2 since we have tested and done that before
19:51:07 <clarkb> however, we test the renames in CI on 3.2 and 3.3 currently so ti should just work if you're talking about this monday and not a general monday
19:51:37 <clarkb> In my head I was considering late next week renames, then late week after that (week of ptg) for the 3.3 upgrade
19:52:08 <fungi> i don't mind doing the renames on 3.3 and working through any unlikely gotchas we encounter
19:52:18 <fungi> but happy to go either way
19:52:32 <clarkb> ianw: ^ when you said monday did you mean this monday or just generally your monday is good?
19:53:03 <clarkb> we can also revert 3.3 to 3.2 if necessary so I'm comfortable doing it this au monday if we prefer
19:53:07 <ianw> i meant any monday, but the 11th does work
19:53:29 <fungi> also this coming monday (11th) is a national holiday for some in the usa and most of canada
19:53:30 <ianw> i feel like we've got about as much testing as is practical
19:53:59 <clarkb> ok in that case I think the two option we are talking about are 1) upgrade the 11th then rename the 15th or 2) rename the 15th then upgrade the 25th
19:54:08 <clarkb> sounds like everyone really likes 1) ?
19:54:18 <clarkb> do we think we need more time to announce that?
19:54:24 <fungi> the sooner we get it out of the way, the better
19:54:26 <ianw> ++ to 1
19:54:49 <clarkb> in that case any objections to doing renames on the 15th at say 1500UTC ish fungi ?
19:54:54 <fungi> i think we can announce it on relatively short notice since we anticipate only brief outages
19:55:08 <clarkb> yup thinking we can announce both the upgrade and the renames today if that is the schedule we like
19:55:24 <fungi> sgtm
19:55:40 <clarkb> ok I'll work on drafting that announcement after lunch today and make sure we get it sent out
19:56:44 <clarkb> I think the actual act of upgrading gerrit is captured in the CI job. We'll basically land a change to update the gerrit image to 3.3. Then manually stop gerrit once docker-compose is updated, pull the image, run the init command then start gerrit
19:56:49 <clarkb> pretty straightforward
19:56:57 <clarkb> And we are almost out of tiem.
19:57:01 <clarkb> #topic Open Discussion
19:57:09 <clarkb> Is there naything else to call out in our last ~3 minutes?
19:57:27 <fungi> i plan to not be around on friday this week
19:57:35 <ianw> #link https://review.opendev.org/c/zuul/zuul-jobs/+/812272
19:57:54 <ianw> if i could get some eyes on that, it reworks the rust install which was noticed by pyca/cryptography
19:58:46 <clarkb> can do
19:59:33 <clarkb> fungi: enjoy the time off. I ended up not being around as much as I expected yesterday but it was fun to walk on the beach and stop at the salt water taffy shop
20:00:02 <fungi> all our salt water taffy is imported. no idea why. like our salt water isn't good enough?
20:00:11 <fungi> it's a shame
20:00:24 <clarkb> this was made onsite :)
20:00:29 <clarkb> And we are at time. Thank you everyone
20:00:31 <clarkb> #endmeeting