19:00:52 <fungi> #startmeeting infra
19:00:52 <opendevmeet> Meeting started Tue Dec 19 19:00:52 2023 UTC and is due to finish in 60 minutes.  The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:52 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:52 <opendevmeet> The meeting name has been set to 'infra'
19:01:15 <fungi> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting Our Agenda
19:01:29 <fungi> #topic Announcements
19:02:37 <fungi> #info The OpenDev weekly meeting is cancelled for the next two weeks owing to lack of availability for many participants; we're skipping December 26 and January 2, resuming as usual on January 9.
19:03:06 <fungi> i'm also skipping the empty boilerplate topics
19:03:19 <fungi> #topic Upgrading Bionic servers to Focal/Jammy (clarkb)
19:03:30 <fungi> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades
19:04:07 <tonyb> mirrors are done and need to be cleaned up
19:04:07 <fungi> there's a note here in the agenda about updating cnames and cleaning up old servers for mirror replacements
19:04:12 <fungi> yeah, that
19:04:18 <fungi> are there open changes for dns still?
19:04:35 <tonyb> I started doing this yesterday but wanted additional eyes as it's my first time
19:04:41 <fungi> or do we just need to delete servers/volumes?
19:04:48 <tonyb> no open changes ATM
19:04:53 <tonyb> that one
19:04:56 <fungi> what specifically do you want an extra pair of eyes on? happy to help
19:05:43 <tonyb> fungi: the server and volume deletes  I understand the process
19:06:13 <fungi> i'm around to help after the meeting if you want, or you can pick another better time
19:06:28 <tonyb> fungi: after the meeting is good for me
19:07:13 <fungi> sounds good, thanks!
19:07:15 <tonyb> I've started looking at jvb and meetpad for upgrades
19:07:22 <fungi> that's a huge help
19:07:49 <tonyb> I'm thinking we'll bring up 3 new servers and then do a cname swiatch.
19:08:11 <fungi> that should be fine. there's not a lot of utilization on them at this time of year anyway
19:08:31 <fungi> #topic DIB bionic support (ianw)
19:08:34 <tonyb> I was considering a more complex process to growing the jvb pool but I think that way is uneeded
19:08:55 <fungi> i think this got covered last week. was there any followup we needed to do?
19:09:15 <fungi> seems like there was some work to fix the dib unit tests?
19:09:44 <fungi> i'm guessing this no longer needed to be on the agenda, just making sure
19:10:50 <fungi> #topic Python container updates (tonyb)
19:11:12 <fungi> zuul-operator seems to still need addressing
19:11:13 <tonyb> no updates this week
19:11:33 <fungi> no worries, just checking. thanks!
19:11:40 <fungi> Gitea 1.21.1 Upgrade (clarkb)
19:11:44 <fungi> er...
19:11:47 <fungi> #topic Gitea 1.21.1 Upgrade (clarkb
19:11:59 <tonyb> Yup I intend to update the roles to enhance container logging and then we'll have a good platform to understand the problem
19:12:32 <fungi> we were planning to do the gitea upgrade at the beginning of the week, but with lingering concerns after the haproxy incident over the weekend we decided to postpone
19:12:57 <tonyb> I think we're safe to remove the 2 LBs from emergency right?
19:13:18 <fungi> #link https://review.opendev.org/903805 Downgrade haproxy image from latest to lts
19:13:23 <fungi> that hasn't been approved yet
19:13:45 <fungi> so not until it merges at least
19:13:45 <tonyb> Ah
19:14:02 <fungi> but upgrading gitea isn't necessarily blocked on the lb being updated
19:14:20 <fungi> different system, separate software
19:14:47 <tonyb> Fair point
19:15:21 <fungi> anyway, with people also vacationing and/or ill this is probably still not a good time for a gitea upgrade. if the situation changes later in the week we can decide to do it then, i think
19:15:39 <tonyb> Okay
19:15:54 <fungi> #topic Updating Zuul's database server (clarkb)
19:16:26 <tonyb> I suspect there hasn't been much progress this week.
19:16:35 <fungi> i'm not sure where we ended up on this, there was research being done, but also an interest in temporarily dumping/importing on a replacement trove instance in the meantime
19:16:50 <fungi> we can revisit next year
19:17:03 <fungi> #topic Annual Report Season (clarkb)
19:17:13 <fungi> #link OpenDev's 2023 Annual Report Draft will live here: https://etherpad.opendev.org/p/2023-opendev-annual-report
19:17:45 <fungi> we need to get that to the foundation staff coordinator for the overall annual report by the end of the week, so we're about out of time for further edits if you wanted to check it over
19:18:11 <fungi> #topic EMS discontinuing legacy/consumer hosting plans (fungi)
19:19:04 <fungi> we received a notice last week that element matrix services (ems) who hosts our opendev.org matrix homeserver for us is changing their pricing and eliminating the low-end plan we had the foundation paying for
19:20:01 <fungi> the lowest "discounted" option they're offering us comes in at 10x what we've been paying, and has to be paid a year ahead in one lump sum
19:20:10 <fungi> (we were paying monthly before)
19:20:12 <tonyb> when?
19:20:29 <tonyb> does the plan need to be purchased
19:20:33 <fungi> we have until 2024-02-07 to upgrade to a business hosting plan or move elsewhere
19:20:43 <tonyb> phew
19:21:02 <fungi> so ~1.5 months to decide on and execute a course of action
19:21:10 <tonyb> not a lot of lead time but also some lead time
19:22:35 <corvus> is the foundation interested in upgrading?
19:22:43 <fungi> i've so far not heard anyone say they're keen to work on deploying a matrix homeserver in our infrastructure, and i looked at a few (4 i think?) other hosting options but they were either as expensive or problematic in various ways, and also we'd have to find time to export/import our configuration and switch dns resulting in some downtime
19:23:35 <fungi> i've talked to the people who hold the pursestrings on the foundation staff and it sounds like we could go ahead and buy a year of business service from ems since we do have several projects utilizing it at this point
19:24:01 <fungi> which would buy us more time to decide if we want to keep doing that or work on our own solution
19:24:02 <tonyb> A *very* quick looks implies that hosting our own server wouldn't be too bad.  the hardest part will be the export/import and downtime
19:24:16 <frickler> another option might be dropping the homeserver and moving the rooms to matrix.org?
19:24:18 <tonyb> I suspect that StartlingX will be the "most impacted"
19:25:54 <frickler> I've tried running a homeserver privately some time ago but it was very opaque and not debugable
19:25:56 <fungi> maybe, but with as many channels as they have they're still not super active on them (i lurk in all their channels and they average a few messages a day tops)
19:26:19 <corvus> fungi: does the business plan support more than one hostname?  the foundation may be able to eek out some more value if they can use the same plan to host internal comms.
19:28:32 <fungi> looking at https://element.io/pricing it's not clear to me how that's covered exactly
19:28:35 <fungi> maybe?
19:28:44 <corvus> ok.  just a thought  :)
19:28:55 <frickler> also, is that "discounted" option a special price or does that match the public pricing?
19:29:24 <fungi> the "discounted" rate they offered us to switch is basically the normal business cloud option on that page, but with a reduced minimum user count of 20 instead of 50
19:30:53 <fungi> anyway, mostly wanted to put this on the agenda so folks know it's coming and have some time to think about options
19:31:16 <fungi> we can discuss again in the next meeting which will be roughly a month before the deadline
19:31:31 <corvus> if the foundation is comfortable paying for it, i'd lean that direction
19:31:51 <fungi> yeah, i'm feeling similarly. i don't think any of us has a ton of free time for another project just now
19:32:00 <corvus> (i think there are good reasons to do so, including the value of the service provided compared to our time and materials cost of running it ourselves, and also supporting open source projects)
19:32:42 <fungi> agreed, and while it's 10x what we've been paying, there wasn't a lot of surprise at a us$1.2k/yr+tax price tag
19:33:10 <fungi> helps from a budget standpoint that it's due in the beginning of the year
19:33:33 <corvus> tbh i thought the original price was too way low for an org (i'm personally sad that it isn't an option for individuals any more though)
19:34:11 <fungi> yeah, we went with it mainly because they didn't have any open source community discounts, which we'd have otherwise opted for
19:34:22 <fungi> any other comments before we move to other topics?
19:35:03 <fungi> #topic Followup on 20231216 incident (frickler)
19:35:07 <fungi> you have the floor
19:35:27 <frickler> well I just collected some things that came to my mind on sunday
19:36:06 <frickler> first question: Do we want to pin external images like haproxy and only bump them after testing? (Not sure that would've helped for the current issue though)
19:36:52 <fungi> there's a similar question from corvus in 903805 about whether we want to make the switch from "latest" to "lts" permanent
19:37:15 <fungi> testing wouldn't have caught it though i don't thinl
19:37:36 <corvus> yeah, unlike gerrit/gitea where there's stuff to test, i don't think we're going to catch haproxy bugs in advance
19:37:39 <fungi> but maybe someone with a higher tolerance for the bleeding edge would have spotted it before latest became lts
19:38:11 <fungi> also it's not like we use recent/advanced features of haproxy
19:38:15 <corvus> for me, i think maybe permanently switching to tracking the lts tag is the right balance of auto-upgrade with hopefully low probability of breakage
19:38:35 <fungi> so i think the answer is "it depends, but we can be conservative on haproxy and similar components"
19:38:52 <frickler> are there other images we consume that could cause similar issues?
19:39:22 <frickler> and I'm fine with haproxy:lts as a middle ground for now
19:39:25 <fungi> i don't know off the top of my head, but if someone wants to `git grep :latest$` and do some digging, i'm happy to review a change
19:39:42 <frickler> ok, second thing:  Use docker prune less aggressively for easier rollback?
19:39:54 <frickler> We do so for some services, like https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea/tasks/main.yaml#L71-L76, might want to duplicate for all containers? Bump the hold time to 7d?
19:40:05 <corvus> (also, honestly i think the fact that haproxy is usually rock solid is why it took us so long to diagnose it.  normally checking restart times would be near the top of the list of things to check)
19:40:30 <fungi> fwiw, when i switched gitea-lb's compose file from latest to lts and did a pull, nothing was downloaded, the image was still in the cache
19:41:13 <frickler> so "docker prune" doesn't clear the cache?
19:41:32 <tonyb> IIRC it did download on zuul-lb
19:41:50 <corvus> i sort of wonder what we're trying to achieve there?  resiliency against upstream retroactively changing a tag?  or shortened download times?  or ability to guess what versions we were running by inspecting the cache?
19:42:26 <frickler> being able to have a fast revert of an image upgrade by just checking "docker images" locally
19:42:47 <fungi> i guess the concern is that we're tracking lts, upstream moves lts to a broken image, and we've pruned the image that lts used to point to so we have to redownload it when we change the tag?
19:43:01 <frickler> also I don't think we are having disk space issues that make fast pruning essential
19:43:25 <corvus> if it's resiliency against changes, i agree that 7d is probably a good idea.  otherwise, 1-3 days is probably okay... if we haven't cared enough to look after 3 days, we can probably check logs or dockerhub, etc...
19:43:27 <fungi> but also the download times are generally on the order of seconds, not minutes
19:44:05 <fungi> it might buy us a little time but it's far from the most significant proportion of any related outage
19:44:09 <frickler> the 3d is only in effect for gitea, most other images are pruned immediately after upgrading
19:44:23 <corvus> (we're probably going to revert to a tag, which, since we can download in a few seconds, means the local cache isn't super important)
19:45:17 <fungi> i'm basically +/-0 on adjusting image cache times. i agree that we can afford the additional storage, but want to make sure it doesn't grow without bound
19:45:29 <frickler> ok, so leave it at that for now
19:45:33 <frickler> next up: Add timestamps to zuul_reboot.log?
19:45:39 <frickler> https://opendev.org/opendev/system-config/src/branch/master/playbooks/service-bridge.yaml#L41-L55 Also this is running on Saturdays (weekday: 6), do we want to fix the comment or the dow?
19:45:40 <fungi> also having too many images cached makes it a pain to dig through when you're looking for a recent-ish obne
19:46:13 <fungi> is zuul_reboot.log a file? on bridge?
19:46:18 <frickler> yes
19:46:35 <frickler> the code above shows how it is generated
19:47:05 <fungi> aha, /var/log/ansible/zuul_reboot.log
19:47:21 <corvus> adding timestamps sounds good to me; i like the current time so i'd say change the comment
19:47:27 <fungi> i have no objection to adding timestamps
19:47:39 <fungi> to, well, anything really
19:47:41 <frickler> ok, so I'll look into that
19:47:48 <fungi> more time-based context is preferable to less
19:47:50 <fungi> thanks!
19:47:54 <frickler> final one: Do we want to document or implement a procedure for rolling back zuul upgrades? Or do we assume that issues can always be fixed in a forward going way?
19:48:26 <fungi> i think the challenge there is that "downgrading" may mean manually undoing database migrations
19:48:33 <frickler> like what would we have done if we hadn't found a fast fix for the timer issue?
19:48:45 <fungi> the details of which will differ from version to version
19:49:21 <fungi> frickler: if the solution hadn't been obvious i was going to propose a revert of the offending change and try to get that fast-tracked
19:49:44 <frickler> ok, what if no clear bad patch had been identied?
19:49:50 <frickler> identified
19:50:23 <frickler> anyway, we don't need to discuss this at length right now, more something to think about medium term
19:50:24 <fungi> for zuul we're in a special situation where several of us are maintainers, so we've generally been able to solve things like that quickly one way or another
19:50:39 <corvus> i agree with fungi, any downgrade procedure is dependent on the revisions in scope, so i don't think there's a generic process we can do
19:51:20 <fungi> it'll be an on-the-spot determination as to whether its less work to roll forward or try to unwind things
19:51:33 <frickler> ok, time's tight, so lets move to AFS?
19:51:40 <fungi> yep!
19:51:48 <fungi> #topic AFS quota issues (frickler)
19:52:10 <frickler> mirror.openeuler has reached its quota limit and the mirror job seems to be failing since two weeks. I'm also a bit worried that they seem do have doubled their volume over the last 12 months
19:52:14 <frickler> ubuntu mirrors are also getting close, but we might have another couple of months time there
19:52:17 <frickler> mirror.centos-stream seems to have a steep increase in the last two months and might also run into quota limits soon
19:52:20 <frickler> project.zuul with the latest releases is getting close to its tight limit of 1GB (sic), I suggest to simply double that
19:52:47 <frickler> the last one is easy I think. for openeuler instead of bumping the quota someone may want to look into cleanup options first?
19:53:05 <frickler> the others are more of something to keep an eye on
19:53:09 <fungi> broken openeuler mirrors that nobody brought to our attention would indicate they're not being used, but yes it's possible we can filter out some things like we do for centos
19:53:45 <fungi> i'll try to figure out based on git blame who added the openeuler mirror and see if they can propose improvements before lunar new year
19:53:57 <frickler> well they are being used in devstack, but being out of date for some weeks does not yet break jobs
19:54:03 <corvus> feel free to action me on the zuul thing if no one else wants to do it
19:54:05 <fungi> i agree just bumping the zuul quota is fine
19:54:56 <fungi> #action fungi Reach out to someone about cleaning up OpenEuler mirroring
19:55:13 <fungi> #action corvus Increase project.zuul AFS volume quota
19:55:25 <fungi> let's move to the last topic
19:55:34 <fungi> #topic Broken wheel build issues (frickler)
19:55:56 <fungi> centos8 wheel builds are the only ones that are thoroughly broken currently?
19:56:07 <fungi> i'm pleasantly surprised if so
19:56:08 <frickler> fungi: https://review.opendev.org/c/openstack/devstack/+/900143 is the last patch on devstack that a quick search showed me for openeuler
19:56:21 <frickler> I think centos9 too?
19:56:35 <fungi> oh, >+8
19:56:39 <fungi> >=8
19:56:42 <fungi> got it
19:57:19 <frickler> though depends on what you mean by thoroughly (8 months vs. just 1)
19:57:31 <fungi> how much centos testing is going on these days now that tripleo has basically closed up shop?
19:57:50 <fungi> wondering how much time we're saving by not rebuilding some stuff from sdist in centos jobs
19:58:20 <frickler> not sure, I think some usage is still there for special reqs like FIPS
19:58:40 <tonyb> yup FIPS still needs it.
19:58:54 <frickler> at least people are still enough concerned about devstack global_venv being broken on centos
19:58:55 <fungi> for 9 or 8 too?
19:59:09 <frickler> both I think
19:59:16 <tonyb> I can work with ade_lee to verify what *exactly* is needed and fix or prune as appropriate
19:59:37 <fungi> we can quite easily stop running the wheel build jobs, if the resources for running those every day is a concern
20:00:06 <fungi> i guess we can discuss options in #opendev since we're past the end of the hour
20:00:06 <frickler> the question is then do we want to keep the outdated builds or purge them too?
20:00:20 <fungi> keeping them doesn't hurt anything, i don't think
20:00:30 <fungi> it's just an extra index url for pypi
20:00:38 <fungi> and either the desired wheel is there or it's not
20:00:54 <fungi> and if it's not, the job grabs the sdist from pypi and builds it
20:00:54 <tonyb> and storage?
20:00:58 <frickler> it does mask build errors that can happen for people that do not have access to those wheels
20:01:19 <frickler> if the build was working 6 months ago but has broken since then
20:01:32 <frickler> but anyway, not urgent, we can also continue next year
20:01:36 <fungi> it's a good point, we considered that as a balance between using job resources continually building the same wheels over and over
20:01:59 <fungi> and projects forgetting to list the necessary requirements for building the wheels for things they depend on that lack them
20:02:13 <fungi> okay, let's continue in #opendev. thanks everyone!
20:02:18 <fungi> #endmeeting