19:01:11 <clarkb> #startmeeting infra
19:01:11 <opendevmeet> Meeting started Tue Aug 22 19:01:11 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:11 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:11 <opendevmeet> The meeting name has been set to 'infra'
19:01:18 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VRBBT25TOTXJG3L5SXKWV3EELG34UC5E/ Our Agenda
19:01:42 <clarkb> We've actually got a fairly full agenda so I may move quicker than I'd like. But we can always go back to discussing items at the end of our hour if we have time
19:01:50 <clarkb> #topic Announcements
19:02:22 <clarkb> The service coordinator nomination period ends today. I haven't seen anyone nominate themselves yet. Does this mean everyone is happy with and prefers me to keep doing it?
19:02:36 <fungi> i'm happy to be not-it ;)
19:02:49 <fungi> (also you're doing a great job!)
19:02:59 <corvus> i can confirm 100% that is the correct interpretation
19:03:31 <clarkb> ok I guess I can make it official with an email later today before today ends to avoid and needless process confusion
19:04:03 <clarkb> anything else to announce?
19:05:01 <clarkb> #topic Infra root google account
19:05:17 <clarkb> Just a quick note that I haven't tried to login yet so I have no news yet
19:05:41 <clarkb> but it is on my todo list and hopefully I can get to it soon. This week should be a bit less crazy ofr me than last week (half the visiting family is no longer here)
19:05:49 <clarkb> #topic Mailman 3
19:06:11 <clarkb> fungi: we made some changes and got things to a good stable stopping point I think. What is next? mailman 3 upgrade?
19:06:32 <fungi> we merged the remaining fixes last week and the correct site names are showing up on archive pages now
19:07:21 <fungi> i've got a held node built with https://review.opendev.org/869210 and have just finished syncing a copy of all our production mm2 lists to it to run a test import
19:07:33 <fungi> #link https://review.opendev.org/869210 Upgrade to latest Mailman 3 releases
19:08:06 <clarkb> oh right we wanted to make sure that upgrading wouldn't put us in a worse position for the 2 -> 3 migration
19:08:07 <fungi> i'll step through the migration steps in https://etherpad.opendev.org/p/mm3migration to make sure they're still working correctly
19:08:17 <fungi> #link https://etherpad.opendev.org/p/mm3migration Mailman 3 Migration Plan
19:08:41 <fungi> at which point we can merge the upgrade change to swap out the containers and start scheduling new domain migrations
19:09:22 <fungi> that's where we're at currently
19:09:49 <clarkb> sounds good. Let us know when you feel ready for us to review and approve the upgrade change
19:09:58 <clarkb> #topic Gerrit Updates
19:09:59 <fungi> will do, thanks!
19:10:21 <clarkb> I did email the gerrit list last week about the too many query terms problem frickler has run into with starred changes
19:10:49 <clarkb> they seem to acknowledge that this is less than ideal (one idea was to log/report the query in its entirety so that you could use that information to find the starred changes)
19:11:11 <clarkb> but no suggestions for a workaround other than what we already know (bump index.maxTerms)
19:11:28 <clarkb> no one said don't do that either so I think we can proceed wtih frickler's change and then monitor the resulting performance situation
19:11:35 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/892057 Bump index.maxTerms to address starred changes limit.
19:12:30 <clarkb> ideally we can approve that and restart gerrit with the new config during a timeframe that frickler is able to confirm it has fixed the issue quickly (so taht we can revert and/or debug further if necessary)
19:13:13 <clarkb> frickler: I know you couldn't attend the meeting today, but maybe we can sync up later this week on a good time to do that restart with ou
19:13:22 <clarkb> #topic Server Upgrades
19:13:43 <clarkb> no news here. Mostly a casualty of my traveling and having family around. I should have more time for this in the near future
19:14:02 <clarkb> (I don't like reaplcing servers when I feel I'm not in a position to revert or debug unexpected problems)
19:14:49 <clarkb> #topic Rax IAD image upload struggles
19:15:13 <clarkb> nodepool image uploads to rax iad are timing out
19:15:32 <clarkb> this problem seems to get worse the more uploads you perform at the same tiem
19:15:59 <clarkb> The other two rackspace regions do not have this problem (or if they share the underlying mechanism it doesnt' manifest as badly so we don't really care/notice)
19:16:07 <fungi> specifically, the bulk of the time/increase seems to occur on the backend in glance after the upload completes
19:16:31 <clarkb> fungi and frickler have been doing the bulk of the debugging (thank you)
19:16:48 <clarkb> I think our end goal is to collect enough info that we can file a ticket with rakcspace to see if this is something they can fixup
19:16:48 <fungi> when we were trying to upload all out images, i clockec around 5 hours from end of an upload to when it would first start to appear in the image list
19:17:47 <fungi> when we're not uploading any images, a single test upload is followed by around 30 minutes of time before it appears in the image list
19:18:50 <clarkb> and when you multiply the number of images we have by 30 minutes you end up with something suspicousl close to 5 hours
19:18:55 <fungi> and yes, what we're mostly lacking right now is someone to have time to file a ticket and try to explain our observations in a way that doesn't come across as us having unrealistic expectations
19:19:25 <corvus> in the mean time, are we increasing the upload timeout to accomodate 5h?
19:19:37 <clarkb> I think a key part of doing that is showing that dfw and ord don't suffer from this which would indicate an actual problem with the egion
19:19:44 <fungi> "we're trying to upload 15 images in parallel, around 20gb each, and your service can't keep up" is likely to result in us being told "please stop that"
19:20:12 <clarkb> fungi: note it should eventually balance out to an image an hour or so due to rebuild timelines
19:20:22 <clarkb> but due to the long processing times they all pile up instaed
19:20:23 <fungi> corvus: frickler piecemeal unpaused some images manually to get fresher uploads for our more frequently used images to complete
19:21:22 <fungi> he did test by overriding the openstacksdk default wait timeout to something like 10 hours, just to confirm it did work around the issue
19:21:44 <clarkb> that isn't currently configurable in nodepool today? Or maybe we could do it with clouds.yaml?
19:21:44 <corvus> oh this is an sdk-level timeout?  not the nodepool one?
19:21:48 <clarkb> ya
19:21:49 <fungi> yes
19:22:05 <fungi> nodepool doesn't currently expose a config option for passing the sdk timeout parameter
19:22:19 <fungi> we could add one, but also this seems like a pathological condition
19:23:24 <corvus> still, i think that would be an ok change.
19:23:43 <fungi> yeah, i agree, i just don't think we're likely to want to actually set that long term
19:23:58 <corvus> in general nodepool has a bunch of user-configurable timeouts for stuff like that because we know especially with clouds, things are going to be site dependent.
19:24:20 <corvus> yeah, it's not an ideal solution, but, i think, an acceptable one.  :)
19:24:26 <clarkb> other things to note. We had leaked images in all three regions. fungi cleared those out manually. I'm not sure why the leak detection and cleanup in nodepool didn't find them and take care of it for us.
19:24:51 <clarkb> and we had instances that we could not delete in all three regions that rackspace cleaned up for us after a ticket was submitted
19:25:09 <fungi> i cleared out around 1200 leaked images in iad (mostly due to the upload timeout problem i think, based on their ages). the other two regions have around 400 leaked images but have not been cleaned up yet
19:25:28 <corvus> maybe the leaked images had a different schema or something.  if it happens again, ping me and i can try to see why.
19:25:40 <clarkb> thanks!
19:25:59 <fungi> corvus: sure, we can dig deeper. it seemed outwardly that nodepool decided the images never got created
19:26:26 <fungi> because the sdk returned an error when it gave up waiting for them
19:26:28 <corvus> ah, could be missing the metadata entirely then too
19:26:55 <fungi> yes, it could be missing followup steps the sdk would otherwise have performed
19:27:19 <fungi> if the metadata is something that gets set post-upload
19:27:32 <fungi> (from the cloud api side i mean)
19:28:05 <clarkb> anything else on this topic?
19:28:18 <fungi> not from me
19:28:22 <clarkb> #topic Fedora Cleanup
19:29:03 <clarkb> I pushed two changes earlier today to start on this. Bindep is the only fedora-latest user in codesearch that is also in the zuul tenant config.
19:29:05 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/892380 Cleanup fedora-latest in bindep
19:29:23 <clarkb> er thats the next change one second while I copy paste the bindep one properly
19:29:36 <clarkb> #undo
19:29:36 <opendevmeet> Removing item from minutes: #link https://review.opendev.org/c/opendev/base-jobs/+/892380
19:29:49 <clarkb> #link https://review.opendev.org/c/opendev/bindep/+/892378 Cleanup fedora-latest in bindep
19:30:09 <clarkb> This should be a very safe change. The next one which removes nodeset: fedora-latest is less safe because older branches may use it etc
19:30:20 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/892380 Remove fedora-latest nodeset
19:30:51 <clarkb> I'm personally inclined to land both of them ad we can revert 892380 if something unexpected happens. but that nodeset doesn't work anyway so those jobs should already be broken
19:31:08 <clarkb> We can then look at cleaning things up from nodepool and the mirrors etc
19:31:53 <clarkb> let me know if you disagree or find extra bits that need cleanup first
19:32:02 <clarkb> #topic Gitea 1.20 Upgrade
19:32:12 <fungi> this cleanup also led to us rushing through an untested change for the zuul/zuul-jobs repo, we do need to remember that it has stakeholders beyond our deployment
19:32:41 <clarkb> Gitea has published a 1.20.3 release already. I think my impressin that this is a big release with not a lot of new features is backed up by the amount of fixing they have had to do
19:33:21 <clarkb> But I think I've managed to work through all the documented breaking changes (and one undocumented breaking change)
19:33:39 <clarkb> there is a held node that seems to work here: https://158.69.78.38:3081/opendev/system-config
19:33:59 <clarkb> The main thing at this point would be to go over the change itself and that held node to make sure you are happy with the changes I had to make
19:34:18 <clarkb> and if so I can add the new necessary but otherwise completely ignored secret data to our prod secrets and merge the change when we can monitor it
19:34:38 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/886993 Gitea 1.20 change
19:34:41 <clarkb> there is the change
19:35:05 <clarkb> looks like fungi did +2 it. I should probably figure out those secrets then
19:35:08 <clarkb> thanks for the review
19:35:11 <fungi> it looked fine to me, but i won't be around today to troubleshoot if something goes sideways with the upgrade
19:35:27 <clarkb> we can approve tomorrow. I'm not in a huge rush other than simply wanting it off my todo list
19:35:46 <clarkb> #topic Zuul changes and updates
19:36:01 <clarkb> There are a number of user facing changes that have been made or will be made very soon in zuul
19:36:18 <clarkb> I want to make sure we're all aware of them and have some sort of plan for working through them
19:36:47 <clarkb> First up Zuul has added Ansible 8 support. In the not too distant future Zuul will drop Ansible 6 support which is what we default to today
19:37:12 <clarkb> in the past what we've done is asked our users to test new ansible against their jobs if they are worred and set a hard cutover date via tenant config in the future
19:37:14 <corvus> and between those 2 events, zuul will switch the default from 6 to 8
19:37:19 <fungi> also it's skipping ansible 7 support, right?
19:37:31 <corvus> yeah we're too slow, missed it.
19:37:59 <clarkb> OpenStack Bobcat releases ~Oct 6
19:38:26 <clarkb> I'm thinking we switch all tenants to ansible 8 by default the week after that
19:38:43 <clarkb> though we should probably go ahead and switch the opendev tenant nowish. So all tenants that haven't switched yet on that date
19:38:47 <corvus> oh that sounds extremely generous.  i think we can and should look into switching much earlier
19:39:14 <clarkb> corvus: my concern with that is openstack will have a revolt if anything breaks due to their CI already being super flaky and the release coming up
19:39:31 <corvus> what if we switch opendev, wait a while, switch everyone but openstack, wait a while, and then switch openstack...
19:39:38 <fungi> wfm
19:39:43 <clarkb> that is probably fine
19:40:14 <frickler> with "a while" = 6 weeks that sounds fine
19:40:24 <corvus> i think if we run into any problems, we can throw the breaks easily, but given that zuul (including zuul-jobs) switched with no changes, this might go smoothly...
19:40:29 <clarkb> maybe try to switch opendev this week, everyone else less openstack the week after if opendev is happy. Then do openstack after the bobcat release?
19:40:41 <fungi> sounds good
19:40:48 <clarkb> that way we can get an email out and give people at least some lead time
19:40:56 <corvus> well that's not really what i'm suggesting
19:41:18 <corvus> we could do a week between each and have it wrapped up by mid september
19:41:31 <clarkb> hrm I think the main risk is that specific jobs break
19:41:44 <clarkb> and openstack is going to be angry about that given how unhappy openstack ci has been lately
19:41:44 <corvus> or we could do opendev now, and then everyone else 1 week after that and wrap it up 2 weeks from now
19:42:03 <frickler> where is this urgency coming from?
19:42:23 <corvus> we actually should be doing this much more quickly and much more frequently
19:42:26 <fungi> being able to merge the change in zuul that drops ansible 6 and being able to continue upgrading the opendev deployment whne that happens
19:42:34 <clarkb> I think last time we went quickly with the idea that specific jobs could force ansible 6, but then other zuul config errors complicated that more than we anticipated
19:42:39 <corvus> we need to change ansible versions every 6 months to keep up with what's supported upstream
19:42:50 <corvus> so i would like us to acclimate to this being somewhat frequent and not a big deal
19:42:57 <clarkb> and to be fair we've been pushing openstack to address those but largely only frickler has put any effort into it :/
19:43:37 <fungi> thouhj as of an hour ago it seems like we've got buy-in from th eopenstack tc to start deleting branches in repos with unfixed zuul config errors
19:43:46 <frickler> what's wrong with running unsupported ansible versions? people are still running python2
19:44:17 <clarkb> frickler: at least one issue is the size of the installation for ansible. Every version you support creates massive bloat in your zuul executors
19:44:19 <corvus> it's not our intention to use out-of-support ansible
19:44:48 <frickler> I deon't think that that is compatible with the state openstack is in
19:44:57 <corvus> how so?
19:45:07 <corvus> do we know that openstack runs jobs that don't work with ansible 8?
19:45:14 <fungi> we certainly can't encourage it, given zuul is designed to run untrusted payloads and ansible upstream won't be fixing security vulnerabilities in those versions
19:45:14 <frickler> there is no developer capacity to adapt
19:45:30 <clarkb> no we do not know that yet. JayF thought that some ironic jobs might use openstack collections though which are apparently not backward compatible in 8
19:45:39 <frickler> if it all works fine, fine, but if not, adapting will take a long time
19:45:50 <corvus> zuul's executor can't use collections...
19:45:52 <fungi> i don't see how ironic jobs would be using collections with the executor's ansible
19:45:59 <clarkb> ah
19:46:20 <fungi> if they're using collections that would be in a nested ansible, so not affected by this
19:46:24 <clarkb> maybe a compromise would be to try it soonish and if things break in ways that aren't reasonable to address then we can revert to 6? But it sounds like we expect 8 to work so go for it until we have evidence to the contrary?
19:46:46 <corvus> yeah, i am fully on-board with throwing the emergency brake lever if we find problems
19:46:49 <clarkb> I'm not sure if installing the big pypi ansible package gets you the openstack stuff
19:47:01 <frickler> it does afaict
19:47:18 <corvus> i don't think we should assume that everything will break.  :)
19:47:47 <frickler> the problem is to find how what breaks, how will you do that?
19:47:55 <clarkb> ok proposal: switch opendev nowish. If that looks happy plan to switch everyone else less openstack early next week. If that looks good switch openstack late next week or early the week after
19:47:56 <frickler> waiting for people to complain will not work
19:48:19 <clarkb> frickler: I mean if a tree falls in a forest and no one hears...
19:48:29 <clarkb> I understand your concern but if no one is paying attention then we aren't going to solve that either way
19:48:41 <clarkb> we can however react to those who are paying attention
19:48:50 <clarkb> and I think that is the best we can do whether we wait a long time or a short time
19:48:56 <corvus> that works for me; if we want to give openstack more buffer around the release, then switching earlier may help.  either sounds good to me.
19:50:17 <clarkb> doesn't look lik any of the TC took fungi's invitation to discuss it here so we can't get their input
19:51:19 <fungi> well, can't get their input in this meeting anyway
19:51:29 <clarkb> corvus: just pushed a change to swithc opendev
19:51:55 <clarkb> lets land that asap and then if we don't have anything pushing the brakes by tomorrow I can send email to service-announce with the plan I skethced out above
19:52:11 <clarkb> we are running out of time though and I wanted to get to a few more items
19:52:18 <corvus> #link  https://review.opendev.org/c/openstack/project-config/+/892405 Switch opendev tenant to Ansible 8
19:52:25 <frickler> switching the ansible version does work speculatively, right?
19:52:31 <clarkb> frickler: it does at a job level yes
19:52:41 <clarkb> so we can ask people to test things before we switch them too
19:52:58 <clarkb> Zuul is also planning to uncombine stdout and stderr in command/shell like tasks
19:53:29 <corvus> this one is riskier.
19:53:34 <clarkb> I think this may be more likely to cause problems than the ansible 8 upgrade since it isn't always clear things are going to stderr particulalry when it hasjust worked historically
19:53:59 <clarkb> we probably need to test this more methodically from a base jobs/base roles standpoint and then move up the job inheritance ladder
19:54:08 <corvus> i think the best we can do there is speculatively test it with zuul-jobs first to see if anything explodes...
19:54:17 <corvus> then what clarkb said
19:54:39 <corvus> we might want to make new base jobs in opendev so we can flip tenants one at a time...?
19:54:47 <corvus> (because we can change the per-tenant default base job...)
19:55:04 <clarkb> oh that is an interesting idea I hadn't considered
19:55:37 <corvus> i would be very happy to let this one bake for quite a while in zuul....
19:55:46 <clarkb> ok so this is less urgent. That is good to know
19:56:04 <clarkb> we can probably punt on decisions for it now. But keep it in mind and let us know if you have any good ideas for testing that ahead of time
19:56:23 <corvus> like, long enough for lots of community members to upgrade their zuuls and try it out, etc.
19:56:29 <clarkb> And finally early failure detection in tasks via regex matching of task output is in the works
19:56:31 <corvus> there's no ticking clock on this one...
19:56:37 <clarkb> ack
19:56:44 <corvus> the regex change is being gated as we speak
19:57:00 <clarkb> the failure stuff is more "this is a cool feature you might want to take advantage of" it won't affect existing jobs without intervention
19:57:28 <corvus> we'll try it out in zuul and maybe come up with some patterns that can be copied
19:58:19 <clarkb> #topic Base container image updates
19:58:36 <clarkb> really quickly before we run out of time. We are now in a good spot to convert the consumers of the base container images to bookworm
19:58:59 <clarkb> The only one I really expect to maybe have problems is gerrit due to the java stuff and it may be a non issue there since containers seem to have less issues with this
19:59:00 <corvus> i think zuul is ready for that
19:59:43 <clarkb> help appreciated updating any of our images :)
19:59:48 <clarkb> #topic Open Discussion
19:59:54 <clarkb> Anything important last minute before we call it a meeting?
20:00:46 <fungi> nothing here
20:00:58 <clarkb> Thank you everyone for your time. Feel free to continue discussion in our other venues
20:01:02 <clarkb> #endmeeting