19:01:02 #startmeeting infra 19:01:02 Meeting started Tue Oct 5 19:01:02 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:02 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:02 The meeting name has been set to 'infra' 19:01:08 #link http://lists.opendev.org/pipermail/service-discuss/2021-October/000287.html Our Agenda 19:01:16 #topic Announcements 19:01:25 o/ 19:01:28 The OpenStack release is happening tomorrow afternoon UTC time 19:01:29 ahoy 19:01:41 it'll probably start tomorrow morning utc 19:01:44 We should avoid changes to tools that produce code today and tomorrow until that is done 19:02:02 but should hopefully be complete by 14:00 utc or therabouts 19:02:03 fungi: good point, It starts earlier but aims to be done by ~1500UTC? 19:02:23 yeah, 15z is press release time 19:02:38 Today is a good day to avoid touching gerrit, gitea, zuul, etc :) 19:02:51 but they generally shoot to have all the artifacts and docs published and rechecked at least an hour prior 19:03:11 I plan to try and get up a bit early tomorrow to help out if anything comes up. But ya I expect it will be done by the time i have to take kids to school which will be nice as I can do that iwthout concern then :) 19:03:16 and it's a multi-hour process so usually begins around 10:00z or so 19:03:26 Anyway just be aware of that and lets avoid restarting zuul for example 19:04:28 #topic Actions from last meeting 19:04:34 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-09-28-19.01.txt minutes from last meeting 19:04:45 I don't see any recorded actions. Lets move on 19:04:51 #topic Specs 19:04:55 #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement 19:05:07 I didn't find time to update this spec to propose using the prebuilt binaries 19:05:16 #action clarkb update prometheus spec to use prebuilt binaries 19:05:26 the current spec doesn't rule them out though 19:05:34 it does suggest we use the docker images 19:05:38 whcih are different 19:05:58 well specifically for node exporter the idea was docker iamges for that if we used it 19:06:15 but ianw makes a good point that we can just grab the prebuilt binaries for node exporter and host them ourselves and stick them on all our systems 19:06:36 that will give us consistent node exporter metrics without concern for version variance and avoids oddness with docker images 19:07:16 Anyway I'll try to update that. Seems like there are a lot of things happenign this week (yay pre release pressure build up all able to let go now) 19:07:23 #link https://review.opendev.org/810990 Mailman 3 spec 19:07:48 I'll give us all an action to give this a review as well. This is important not just for the service update but it will help inform us on the best path for handling the odd existing server 19:07:54 #action infra-root Review mailman 3 spec 19:07:59 should be in a reviewable state, i don't have any outstanding todos for it, but feedback would be appreciated 19:08:22 #topic Topics 19:08:29 #topic PTGBot Deployment 19:08:43 This is a late entry to the meeting agenda that I added to my local copy 19:09:01 Looks like we've got a stack of changes to deploy ptgbot on the new eavesdrop setup, but we're struggling with LE staging errors 19:09:13 LE doesn't indicate any issues at https://letsencrypt.status.io/ 19:09:51 We also had an idea that we might be able to split up the handlers/main.yaml that handles all the service restarts post cert update. That would then allow us to run more minimal sets of jobs when doing LE updates, but ansible doesn't work that way unfortunately 19:10:09 acme.sh is already retrying things that the protocol is expected to be less reliable with 19:10:28 for that reason I hesitate to add retries in our ansible calls of acme.sh but that is a possibility too if we want to try and force this a bit more 19:10:44 i wonder if we should just short-cut acme.sh 19:10:52 fungi: ianw: ^ anything else to add on this subject? Mostly wanted to call it out because the LE stuff could have more widespread implications 19:11:01 ianw: when testing you mean? 19:11:10 ianw: i wouldn't be entirely opposed 19:11:11 at the moment, what it does is asks the staging to setup the certs, so we get the TXT responses 19:11:32 but we never actually put them into dns and finish verifying, we self-generate with openssl 19:11:41 or i considered deploying a pebble container on the test nodes and pointing acme.sh to a localhost:xxxx url 19:11:55 yeah, that has been on my todo list for a long time :) 19:12:41 In the past when we've seen the staging api have trouble it usually goes away within a day. Not great to rely on that nor is there any garuntee or indication that will be the case when it happens again 19:12:41 the staging environment api docs outright say it's not recommended for use in ci jobs 19:12:43 in testing mode, we could just avoid calling acme.sh and somehow output a fake/random TXT record, to keep testing the surrounding bits 19:13:29 ianw: that might be a good compromise 19:13:38 ianw: the driver.sh could echo that out pretty easily 19:13:58 i can look at this today; it would be nice to keep the path on one job, but maybe we should have a specific acme.sh test job 19:14:14 something like cat /dev/urandom | tr [:stuff:] [:otherstuff:] | tail -20 19:14:35 i have had https://github.com/acmesh-official/acme.sh/pull/2606 open for 2 years to better detect failures; clearly it hasn't attracted attention 19:14:39 yes, i was thinking the same, maybe the system-config-run-letsencrypt job should use the staging env properly, and then we fake out all the others? 19:14:51 ++ 19:15:12 i can look at this today 19:15:38 thanks 19:15:40 i looked through the acme.sh code and it does seem to retry aggressively with delays all over the place, so i'm surprised we're still seeing 500 responses bubble up 19:15:56 #topic Improving OpenDev's CD throughput 19:16:09 lets keep moving as we have a number of other topics to talk about today and limited time :) 19:16:30 ianw has written a stack of changes to move this along and improve our CD throughput 19:16:44 #link https://review.opendev.org/c/opendev/system-config/+/807672 List dependencies for all jobs 19:17:03 this isn't currently mergeable beacuse Zuul doesn't detect this change as having changes that need jobs to run 19:17:20 ianw: I was thinking that we should maybe just put a simple edit in a file somewhere to trick it 19:17:32 ianw: like our README or a dockerfile or something 19:18:43 clarkb: I do think https://review.opendev.org/c/zuul/zuul/+/755988 might fix this type of situation 19:18:56 should we have some job which always runs? 19:19:16 but yes, i can do something like that. the syntax check is probably the important bit of that 19:19:21 ahh, 755988 is a neat idea! 19:19:26 oh interesting I'll have to review that zuul change 19:19:33 similar to how it handles config changes 19:19:44 great approach 19:19:51 #link https://review.opendev.org/c/opendev/base-jobs/+/807807 Update opendev/base-jobs to support having jobs in system-config that don't clone repos 19:20:15 #link https://review.opendev.org/c/opendev/system-config/+/807808 stop cloning system-config in every system-config job 19:20:34 ianw: at the end of this stack we'll still be running everything serially, but in theory we'll be ready to update semaphores and run stuff in parallel? 19:20:49 yes, that's the intention 19:21:10 great, they are on my list of things to review I've just got to find time between everything else :) 19:21:16 hopefully thsi afternoon for those though 19:21:28 np; they *should* all be no-ops for live jobs 19:21:51 thank you for working on that 19:21:59 but, as about 7 hours of yesterday highlights, sometimes something you think is a no-op can have interesting side-effects :) 19:22:27 #topic Gerrit Account Cleanup 19:22:42 I'm going to keep moving along to be sure we can get through everything. Happy to swing back to any topic at the end of our hour if we have time 19:23:02 I don't have anything new to say on this item. This issue gets deprioritized pretty easily unfortauntely 19:23:32 I may drop it from the meeting until I expect to be able to send the emails 19:23:49 #topic Debian Buster to Bullseye Updates 19:24:20 We have updated python base images for our docker containers. We should try to move as many images as possible from buster to bullseye as buster will eventually stop getting updates 19:24:26 #link https://review.opendev.org/c/opendev/system-config/+/809269 Gitea bullseye update 19:24:32 #link https://review.opendev.org/c/opendev/system-config/+/809286 Gerrit bullseye update 19:25:08 I've got those two changes pushed up for gerrit and gitea beacuse I've been making changes to their docker images recently. But basically all the containers we run need similar treatment aiui 19:25:46 I'm brining this up first because in a few minutes we'll also discuss gitea and gerrit service upgrades. I think we should decide on the order we want to tackle these updates in. Do we do the service or the OS first? 19:26:11 "soon" is relative, the debian-lts team expecy to sipport buster until june 2024 19:26:18 er, expect to support 19:26:25 i would say OS then service 19:26:25 fungi: oh isn't it like a year after release? 19:26:36 maybe it is a year after the n-1 release 19:26:37 official security support ends in july 2022 19:26:50 'soon' in debian time :) 19:26:51 fungi: aha I am not completely crazy then 19:26:54 and then lts takes over 19:27:03 it seems either are fairly easy to roll back 19:27:08 the lts peeps are separate from the debian security team 19:27:13 ianw: ++ exactly my thinking and ya happy to do OS first as a result 19:27:19 sort of like openstack's stable maintenance team and extended maintenance 19:27:30 fungi: ok I'm not sure if our python base images and the other base images enable the lts stuff or not. We don't make those 19:27:45 probably best to get off of buster by july 2022 then we don't have to worry about it 19:27:50 right, that'll be the bigger concern. what is the support lifetime of the python-base image 19:28:08 (i need to get back to the nodepool/dib image upgrades too) 19:28:12 which may or may not be tied to debian's support timeline 19:28:35 fungi: those images are based on debian so there is some relationship there. I doubt they go past the lts period. But wouldn't be surprised if they end in july 2022 19:29:02 it is also possible they stop building updates sooner than that. And as ianw mentions the updats seem straightforward with easy reverts so we should go ahead and work through them 19:29:09 the debian docker images also aren't official, at least from the debian release team's perspective 19:29:29 so it's more about when the debian docker image maintainers want to stop supporting buster 19:29:47 fungi: right and they are unlikely to make new buster packages once debian stops doing so 19:30:09 that caps the useful life of those images to likely july 2022 19:30:16 (unless they do lts) 19:30:51 Considering there is a vote for doing OS updates first I guess I should plan to land those two changes above tomorrow after openstack release is complete 19:30:51 might be able to infer something by looking at whether/when they stopped doing stretch images 19:31:19 fungi: they may also just directly say it somewhere 19:32:18 Anyway I think we can pretty quickly work through these updates and then not worry about it 19:32:27 and as a side effect we'll get newer git and other fancy new software 19:32:43 (but git in particular should give us improvements on gitea and possibly even gerrit doing things like repacking) 19:33:21 #topic Gitea 1.15.3 Upgrade 19:33:32 Once the gitea OS update is done. THis is the next thing I would like to do to gitea 19:33:42 Latest test instance: https://198.72.124.104:3081/opendev/system-config 19:33:56 That test instance lgtm and the logo hosting situation has been addressed with gerrit and paste et al 19:34:09 #link https://review.opendev.org/c/opendev/system-config/+/803231 19:34:24 Are there any other concerns with doing this upgrade tomorrow/thursday timeframe? 19:34:44 after the openstack release has wrapped up, i should be around to help with it 19:35:01 no issues, similarly i can help 19:35:06 great and thanks 19:35:29 also this reminds me, i want to work on getting our infrastructure donors on the main opendev.org page, now that we have apache on the gitea servers we could just serve normal page content instead of having to stuff it into gitea's main page template, would that be a better place to start? 19:36:08 fungi: there might be issues doing that and neeing to host gitea at say https://opendev.org/gitea/ 19:36:14 since all of our existing links out there don't have that root 19:36:14 we'd need apache to directly serve the donor logos anyway probably 19:36:28 fungi: you can have gitea serve them just like the opendev logos 19:36:43 they have a static content directory with what I hope are stable paths now that they moved them 19:37:47 seems like if we configure apache to only serve that page for get requests to the / url and when there are no query parameters, that wouldn't interfere with gitea 19:37:51 I guess we could maybe set it up where only the gitea landing page was hosted at /gitea/ and then all other paths would keep working? That is definitely my concern with doing something like that 19:38:20 fungi: I think you still need a gitea landing page beacuse gitea serves a home link 19:38:37 basically you either need to hack up redirects such that that continues to work or you're hacking templates either way 19:38:49 I don't have any objections to simply updating the home page template as a result 19:39:01 i mean, as new content for what's served at the home url 19:39:08 simply shadowing that one url 19:39:16 right, I think I prefer not relying on apache for that 19:39:34 since it doesn't really gain us anything and potentially complicates gitea in say k8s if we ever do that 19:40:01 got it. i was hoping we could have a way to serve an opendev.org main page without the constraints of what the gitea homepage template can support, but we can talk about it another time i guess 19:40:18 I'm not sure I'm aware of what those constraints are? 19:40:25 I may be missing something important 19:40:30 has to use gitea's templating, right? 19:40:55 "yes" you can just put what you want in there and ignore the templating stuff at the header and footer 19:41:15 so we can't easily preview and publish that page separately from the gitea container 19:41:19 fungi: ++ it has lightly troubled me for a while that that page is a wall of text that seems to start talking about gerrit workflows very very early. so having something more flexible is a good goal 19:41:43 fungi: that is true, you have to run gitea to render the header and footer and see the entire page 19:42:03 and i guess it can't have a different header/footer from the rest of gitea 19:42:20 I think it can, since it explicitly includes those bits 19:42:35 But you'd haev to use the existing templating system to make changes 19:42:39 oh, so we could at least leave them out of the template if we wanted 19:42:42 yes 19:43:12 the header seems like a good header for that site regardless 19:43:18 {{template "base/head" .}} and {{template "base/footer" .}} are line 1 and line before EOF 19:43:19 home/explore/get started 19:43:50 corvus: ++ 19:44:34 yeah, i don't object to the current header and footer, just would prefer not to be unable to extend them easily 19:44:44 fungi: you can extend them as well 19:44:48 (we do the header already) 19:44:58 okay, anyway i didn't mean to derail the meeting 19:45:15 if we get to a point where the header for opendev.org isn't appropriate for a gitea service then we should probably move gitea to a subdomain 19:45:32 just noodling on how to have an actual project website for opendev as a whole rather than one specific to the code browser 19:45:41 corvus: ya that was my immediate reaction to what this would imply. I'm ok doing that too, but it seems like we haven't reached the point where that is necessary yet 19:46:16 we could also have a different page for the opendev main page, but having it at https://opendev.org/ seems convenient 19:46:39 Lets continue as we have a couple more things to go over really quickly 19:46:44 yep 19:46:45 These next two are going to be related 19:46:49 #topic Upgrading Gerrit to 3.3 19:47:07 We are running gerrit 3.2 today. Gerrit 3.3 and 3.4 exist. 3.5 is in development but has not been released yet 19:47:27 The upgrade from 3.2 to 3.3 is pretty straightforward with most of the changes being UX stuff not really server backend 19:47:38 Straight forward enough that we are testing that upgrade in CI now :) 19:47:56 The upgrade to 3.4 is quite a bit more involved and the release notes are extensive 19:48:31 For this reason I'm thinking we can do a near term upgrade to 3.3. Then plan for 3.4 maybe around quiet holidaying time? (or whenever is convenient, mostly just thinking that will take more time) 19:48:34 what are the main challenges you see for 3.4? 19:49:07 i'd be happy to do 3.3 upgrade on like my monday, which is usually very quiet 19:49:13 fungi: mostly just double checking that things like plugins and zuul etc are all working with it 19:49:27 note you can also revert 3.3 to 3.2 and 3.4 to 3.3 19:49:37 so doing this incrementally keeps the reverts as small changes that we can do 19:49:46 yeah no schema changes for either i believe 19:49:47 (I think you could revert 3.4 to 3.2 as well just more pieces to update) 19:49:51 i can be around to help with the 3.3 upgrade on ianw's monday morning 19:49:58 (my sunday evening) 19:50:10 there is a schema change between 3.2 and 3.3 you have to manually edit All-Users or All-Projects to do the revert 19:50:44 The next topic item is scheduling the project renames next week. I was thinking it might be good to do the renames on 3.2 since we have tested and done that before 19:51:07 however, we test the renames in CI on 3.2 and 3.3 currently so ti should just work if you're talking about this monday and not a general monday 19:51:37 In my head I was considering late next week renames, then late week after that (week of ptg) for the 3.3 upgrade 19:52:08 i don't mind doing the renames on 3.3 and working through any unlikely gotchas we encounter 19:52:18 but happy to go either way 19:52:32 ianw: ^ when you said monday did you mean this monday or just generally your monday is good? 19:53:03 we can also revert 3.3 to 3.2 if necessary so I'm comfortable doing it this au monday if we prefer 19:53:07 i meant any monday, but the 11th does work 19:53:29 also this coming monday (11th) is a national holiday for some in the usa and most of canada 19:53:30 i feel like we've got about as much testing as is practical 19:53:59 ok in that case I think the two option we are talking about are 1) upgrade the 11th then rename the 15th or 2) rename the 15th then upgrade the 25th 19:54:08 sounds like everyone really likes 1) ? 19:54:18 do we think we need more time to announce that? 19:54:24 the sooner we get it out of the way, the better 19:54:26 ++ to 1 19:54:49 in that case any objections to doing renames on the 15th at say 1500UTC ish fungi ? 19:54:54 i think we can announce it on relatively short notice since we anticipate only brief outages 19:55:08 yup thinking we can announce both the upgrade and the renames today if that is the schedule we like 19:55:24 sgtm 19:55:40 ok I'll work on drafting that announcement after lunch today and make sure we get it sent out 19:56:44 I think the actual act of upgrading gerrit is captured in the CI job. We'll basically land a change to update the gerrit image to 3.3. Then manually stop gerrit once docker-compose is updated, pull the image, run the init command then start gerrit 19:56:49 pretty straightforward 19:56:57 And we are almost out of tiem. 19:57:01 #topic Open Discussion 19:57:09 Is there naything else to call out in our last ~3 minutes? 19:57:27 i plan to not be around on friday this week 19:57:35 #link https://review.opendev.org/c/zuul/zuul-jobs/+/812272 19:57:54 if i could get some eyes on that, it reworks the rust install which was noticed by pyca/cryptography 19:58:46 can do 19:59:33 fungi: enjoy the time off. I ended up not being around as much as I expected yesterday but it was fun to walk on the beach and stop at the salt water taffy shop 20:00:02 all our salt water taffy is imported. no idea why. like our salt water isn't good enough? 20:00:11 it's a shame 20:00:24 this was made onsite :) 20:00:29 And we are at time. Thank you everyone 20:00:31 #endmeeting