19:00:52 #startmeeting infra 19:00:52 Meeting started Tue Dec 19 19:00:52 2023 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:52 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:52 The meeting name has been set to 'infra' 19:01:15 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting Our Agenda 19:01:29 #topic Announcements 19:02:37 #info The OpenDev weekly meeting is cancelled for the next two weeks owing to lack of availability for many participants; we're skipping December 26 and January 2, resuming as usual on January 9. 19:03:06 i'm also skipping the empty boilerplate topics 19:03:19 #topic Upgrading Bionic servers to Focal/Jammy (clarkb) 19:03:30 #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades 19:04:07 mirrors are done and need to be cleaned up 19:04:07 there's a note here in the agenda about updating cnames and cleaning up old servers for mirror replacements 19:04:12 yeah, that 19:04:18 are there open changes for dns still? 19:04:35 I started doing this yesterday but wanted additional eyes as it's my first time 19:04:41 or do we just need to delete servers/volumes? 19:04:48 no open changes ATM 19:04:53 that one 19:04:56 what specifically do you want an extra pair of eyes on? happy to help 19:05:43 fungi: the server and volume deletes I understand the process 19:06:13 i'm around to help after the meeting if you want, or you can pick another better time 19:06:28 fungi: after the meeting is good for me 19:07:13 sounds good, thanks! 19:07:15 I've started looking at jvb and meetpad for upgrades 19:07:22 that's a huge help 19:07:49 I'm thinking we'll bring up 3 new servers and then do a cname swiatch. 19:08:11 that should be fine. there's not a lot of utilization on them at this time of year anyway 19:08:31 #topic DIB bionic support (ianw) 19:08:34 I was considering a more complex process to growing the jvb pool but I think that way is uneeded 19:08:55 i think this got covered last week. was there any followup we needed to do? 19:09:15 seems like there was some work to fix the dib unit tests? 19:09:44 i'm guessing this no longer needed to be on the agenda, just making sure 19:10:50 #topic Python container updates (tonyb) 19:11:12 zuul-operator seems to still need addressing 19:11:13 no updates this week 19:11:33 no worries, just checking. thanks! 19:11:40 Gitea 1.21.1 Upgrade (clarkb) 19:11:44 er... 19:11:47 #topic Gitea 1.21.1 Upgrade (clarkb 19:11:59 Yup I intend to update the roles to enhance container logging and then we'll have a good platform to understand the problem 19:12:32 we were planning to do the gitea upgrade at the beginning of the week, but with lingering concerns after the haproxy incident over the weekend we decided to postpone 19:12:57 I think we're safe to remove the 2 LBs from emergency right? 19:13:18 #link https://review.opendev.org/903805 Downgrade haproxy image from latest to lts 19:13:23 that hasn't been approved yet 19:13:45 so not until it merges at least 19:13:45 Ah 19:14:02 but upgrading gitea isn't necessarily blocked on the lb being updated 19:14:20 different system, separate software 19:14:47 Fair point 19:15:21 anyway, with people also vacationing and/or ill this is probably still not a good time for a gitea upgrade. if the situation changes later in the week we can decide to do it then, i think 19:15:39 Okay 19:15:54 #topic Updating Zuul's database server (clarkb) 19:16:26 I suspect there hasn't been much progress this week. 19:16:35 i'm not sure where we ended up on this, there was research being done, but also an interest in temporarily dumping/importing on a replacement trove instance in the meantime 19:16:50 we can revisit next year 19:17:03 #topic Annual Report Season (clarkb) 19:17:13 #link OpenDev's 2023 Annual Report Draft will live here: https://etherpad.opendev.org/p/2023-opendev-annual-report 19:17:45 we need to get that to the foundation staff coordinator for the overall annual report by the end of the week, so we're about out of time for further edits if you wanted to check it over 19:18:11 #topic EMS discontinuing legacy/consumer hosting plans (fungi) 19:19:04 we received a notice last week that element matrix services (ems) who hosts our opendev.org matrix homeserver for us is changing their pricing and eliminating the low-end plan we had the foundation paying for 19:20:01 the lowest "discounted" option they're offering us comes in at 10x what we've been paying, and has to be paid a year ahead in one lump sum 19:20:10 (we were paying monthly before) 19:20:12 when? 19:20:29 does the plan need to be purchased 19:20:33 we have until 2024-02-07 to upgrade to a business hosting plan or move elsewhere 19:20:43 phew 19:21:02 so ~1.5 months to decide on and execute a course of action 19:21:10 not a lot of lead time but also some lead time 19:22:35 is the foundation interested in upgrading? 19:22:43 i've so far not heard anyone say they're keen to work on deploying a matrix homeserver in our infrastructure, and i looked at a few (4 i think?) other hosting options but they were either as expensive or problematic in various ways, and also we'd have to find time to export/import our configuration and switch dns resulting in some downtime 19:23:35 i've talked to the people who hold the pursestrings on the foundation staff and it sounds like we could go ahead and buy a year of business service from ems since we do have several projects utilizing it at this point 19:24:01 which would buy us more time to decide if we want to keep doing that or work on our own solution 19:24:02 A *very* quick looks implies that hosting our own server wouldn't be too bad. the hardest part will be the export/import and downtime 19:24:16 another option might be dropping the homeserver and moving the rooms to matrix.org? 19:24:18 I suspect that StartlingX will be the "most impacted" 19:25:54 I've tried running a homeserver privately some time ago but it was very opaque and not debugable 19:25:56 maybe, but with as many channels as they have they're still not super active on them (i lurk in all their channels and they average a few messages a day tops) 19:26:19 fungi: does the business plan support more than one hostname? the foundation may be able to eek out some more value if they can use the same plan to host internal comms. 19:28:32 looking at https://element.io/pricing it's not clear to me how that's covered exactly 19:28:35 maybe? 19:28:44 ok. just a thought :) 19:28:55 also, is that "discounted" option a special price or does that match the public pricing? 19:29:24 the "discounted" rate they offered us to switch is basically the normal business cloud option on that page, but with a reduced minimum user count of 20 instead of 50 19:30:53 anyway, mostly wanted to put this on the agenda so folks know it's coming and have some time to think about options 19:31:16 we can discuss again in the next meeting which will be roughly a month before the deadline 19:31:31 if the foundation is comfortable paying for it, i'd lean that direction 19:31:51 yeah, i'm feeling similarly. i don't think any of us has a ton of free time for another project just now 19:32:00 (i think there are good reasons to do so, including the value of the service provided compared to our time and materials cost of running it ourselves, and also supporting open source projects) 19:32:42 agreed, and while it's 10x what we've been paying, there wasn't a lot of surprise at a us$1.2k/yr+tax price tag 19:33:10 helps from a budget standpoint that it's due in the beginning of the year 19:33:33 tbh i thought the original price was too way low for an org (i'm personally sad that it isn't an option for individuals any more though) 19:34:11 yeah, we went with it mainly because they didn't have any open source community discounts, which we'd have otherwise opted for 19:34:22 any other comments before we move to other topics? 19:35:03 #topic Followup on 20231216 incident (frickler) 19:35:07 you have the floor 19:35:27 well I just collected some things that came to my mind on sunday 19:36:06 first question: Do we want to pin external images like haproxy and only bump them after testing? (Not sure that would've helped for the current issue though) 19:36:52 there's a similar question from corvus in 903805 about whether we want to make the switch from "latest" to "lts" permanent 19:37:15 testing wouldn't have caught it though i don't thinl 19:37:36 yeah, unlike gerrit/gitea where there's stuff to test, i don't think we're going to catch haproxy bugs in advance 19:37:39 but maybe someone with a higher tolerance for the bleeding edge would have spotted it before latest became lts 19:38:11 also it's not like we use recent/advanced features of haproxy 19:38:15 for me, i think maybe permanently switching to tracking the lts tag is the right balance of auto-upgrade with hopefully low probability of breakage 19:38:35 so i think the answer is "it depends, but we can be conservative on haproxy and similar components" 19:38:52 are there other images we consume that could cause similar issues? 19:39:22 and I'm fine with haproxy:lts as a middle ground for now 19:39:25 i don't know off the top of my head, but if someone wants to `git grep :latest$` and do some digging, i'm happy to review a change 19:39:42 ok, second thing: Use docker prune less aggressively for easier rollback? 19:39:54 We do so for some services, like https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea/tasks/main.yaml#L71-L76, might want to duplicate for all containers? Bump the hold time to 7d? 19:40:05 (also, honestly i think the fact that haproxy is usually rock solid is why it took us so long to diagnose it. normally checking restart times would be near the top of the list of things to check) 19:40:30 fwiw, when i switched gitea-lb's compose file from latest to lts and did a pull, nothing was downloaded, the image was still in the cache 19:41:13 so "docker prune" doesn't clear the cache? 19:41:32 IIRC it did download on zuul-lb 19:41:50 i sort of wonder what we're trying to achieve there? resiliency against upstream retroactively changing a tag? or shortened download times? or ability to guess what versions we were running by inspecting the cache? 19:42:26 being able to have a fast revert of an image upgrade by just checking "docker images" locally 19:42:47 i guess the concern is that we're tracking lts, upstream moves lts to a broken image, and we've pruned the image that lts used to point to so we have to redownload it when we change the tag? 19:43:01 also I don't think we are having disk space issues that make fast pruning essential 19:43:25 if it's resiliency against changes, i agree that 7d is probably a good idea. otherwise, 1-3 days is probably okay... if we haven't cared enough to look after 3 days, we can probably check logs or dockerhub, etc... 19:43:27 but also the download times are generally on the order of seconds, not minutes 19:44:05 it might buy us a little time but it's far from the most significant proportion of any related outage 19:44:09 the 3d is only in effect for gitea, most other images are pruned immediately after upgrading 19:44:23 (we're probably going to revert to a tag, which, since we can download in a few seconds, means the local cache isn't super important) 19:45:17 i'm basically +/-0 on adjusting image cache times. i agree that we can afford the additional storage, but want to make sure it doesn't grow without bound 19:45:29 ok, so leave it at that for now 19:45:33 next up: Add timestamps to zuul_reboot.log? 19:45:39 https://opendev.org/opendev/system-config/src/branch/master/playbooks/service-bridge.yaml#L41-L55 Also this is running on Saturdays (weekday: 6), do we want to fix the comment or the dow? 19:45:40 also having too many images cached makes it a pain to dig through when you're looking for a recent-ish obne 19:46:13 is zuul_reboot.log a file? on bridge? 19:46:18 yes 19:46:35 the code above shows how it is generated 19:47:05 aha, /var/log/ansible/zuul_reboot.log 19:47:21 adding timestamps sounds good to me; i like the current time so i'd say change the comment 19:47:27 i have no objection to adding timestamps 19:47:39 to, well, anything really 19:47:41 ok, so I'll look into that 19:47:48 more time-based context is preferable to less 19:47:50 thanks! 19:47:54 final one: Do we want to document or implement a procedure for rolling back zuul upgrades? Or do we assume that issues can always be fixed in a forward going way? 19:48:26 i think the challenge there is that "downgrading" may mean manually undoing database migrations 19:48:33 like what would we have done if we hadn't found a fast fix for the timer issue? 19:48:45 the details of which will differ from version to version 19:49:21 frickler: if the solution hadn't been obvious i was going to propose a revert of the offending change and try to get that fast-tracked 19:49:44 ok, what if no clear bad patch had been identied? 19:49:50 identified 19:50:23 anyway, we don't need to discuss this at length right now, more something to think about medium term 19:50:24 for zuul we're in a special situation where several of us are maintainers, so we've generally been able to solve things like that quickly one way or another 19:50:39 i agree with fungi, any downgrade procedure is dependent on the revisions in scope, so i don't think there's a generic process we can do 19:51:20 it'll be an on-the-spot determination as to whether its less work to roll forward or try to unwind things 19:51:33 ok, time's tight, so lets move to AFS? 19:51:40 yep! 19:51:48 #topic AFS quota issues (frickler) 19:52:10 mirror.openeuler has reached its quota limit and the mirror job seems to be failing since two weeks. I'm also a bit worried that they seem do have doubled their volume over the last 12 months 19:52:14 ubuntu mirrors are also getting close, but we might have another couple of months time there 19:52:17 mirror.centos-stream seems to have a steep increase in the last two months and might also run into quota limits soon 19:52:20 project.zuul with the latest releases is getting close to its tight limit of 1GB (sic), I suggest to simply double that 19:52:47 the last one is easy I think. for openeuler instead of bumping the quota someone may want to look into cleanup options first? 19:53:05 the others are more of something to keep an eye on 19:53:09 broken openeuler mirrors that nobody brought to our attention would indicate they're not being used, but yes it's possible we can filter out some things like we do for centos 19:53:45 i'll try to figure out based on git blame who added the openeuler mirror and see if they can propose improvements before lunar new year 19:53:57 well they are being used in devstack, but being out of date for some weeks does not yet break jobs 19:54:03 feel free to action me on the zuul thing if no one else wants to do it 19:54:05 i agree just bumping the zuul quota is fine 19:54:56 #action fungi Reach out to someone about cleaning up OpenEuler mirroring 19:55:13 #action corvus Increase project.zuul AFS volume quota 19:55:25 let's move to the last topic 19:55:34 #topic Broken wheel build issues (frickler) 19:55:56 centos8 wheel builds are the only ones that are thoroughly broken currently? 19:56:07 i'm pleasantly surprised if so 19:56:08 fungi: https://review.opendev.org/c/openstack/devstack/+/900143 is the last patch on devstack that a quick search showed me for openeuler 19:56:21 I think centos9 too? 19:56:35 oh, >+8 19:56:39 >=8 19:56:42 got it 19:57:19 though depends on what you mean by thoroughly (8 months vs. just 1) 19:57:31 how much centos testing is going on these days now that tripleo has basically closed up shop? 19:57:50 wondering how much time we're saving by not rebuilding some stuff from sdist in centos jobs 19:58:20 not sure, I think some usage is still there for special reqs like FIPS 19:58:40 yup FIPS still needs it. 19:58:54 at least people are still enough concerned about devstack global_venv being broken on centos 19:58:55 for 9 or 8 too? 19:59:09 both I think 19:59:16 I can work with ade_lee to verify what *exactly* is needed and fix or prune as appropriate 19:59:37 we can quite easily stop running the wheel build jobs, if the resources for running those every day is a concern 20:00:06 i guess we can discuss options in #opendev since we're past the end of the hour 20:00:06 the question is then do we want to keep the outdated builds or purge them too? 20:00:20 keeping them doesn't hurt anything, i don't think 20:00:30 it's just an extra index url for pypi 20:00:38 and either the desired wheel is there or it's not 20:00:54 and if it's not, the job grabs the sdist from pypi and builds it 20:00:54 and storage? 20:00:58 it does mask build errors that can happen for people that do not have access to those wheels 20:01:19 if the build was working 6 months ago but has broken since then 20:01:32 but anyway, not urgent, we can also continue next year 20:01:36 it's a good point, we considered that as a balance between using job resources continually building the same wheels over and over 20:01:59 and projects forgetting to list the necessary requirements for building the wheels for things they depend on that lack them 20:02:13 okay, let's continue in #opendev. thanks everyone! 20:02:18 #endmeeting