#opendev-meeting log

19:01:12 <clarkb> #startmeeting infra
19:01:12 <opendevmeet> Meeting started Tue Sep 12 19:01:12 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:12 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:12 <opendevmeet> The meeting name has been set to 'infra'
19:01:40 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/V23MFBBCANVK4YAK2IYOV5XNLFY64U3X/ Our Agenda
19:01:49 <clarkb> #topic Announcements
19:02:16 <clarkb> I will be out tomorrow
19:02:17 <fungi> hellot to you too!
19:02:49 <clarkb> I'll be on a little bit fishing for hopefully not little fish so won't raelly be around keyboards
19:02:54 <clarkb> but back thursday for the mailman fun
19:03:09 <clarkb> #topic Mailman 3
19:03:19 <clarkb> may as well jump straight into that
19:03:33 <fungi> nothing new here really. planning to migrate airship and kata on thursday
19:03:36 <clarkb> fungi: The plan is to migrate lists.katacontainers.io and lists.airshipit.org to mailman 3 on september 14
19:03:47 <clarkb> fungi: and youare starting around 1500 iirc?
19:03:58 <fungi> yep. will do a preliminary data sync tomorrow to prepare, so the sync during downtime will be shorter
19:04:30 <fungi> i notified the airship-discuss and kata-dev mailing lists since those are the most active ones for their respective domains
19:05:17 <clarkb> fungi: anything else we can do to help prepare?
19:05:26 <fungi> if thursday's maintenance goes well, i'll send similar notifications to the foundation and starlingx-discuss lists immediately thereafter about a similar migration for the lists.openinfra.dev and lists.starlingx.io sites thursday of next week
19:05:45 <fungi> i don't think we've got anything else that needs doing for mailman at the moment
19:06:01 <clarkb> excellent. Thank you for getting this together
19:06:25 <fungi> we did find out that the current lists.katacontainers.io server's ipv4 address was on the spamhaus pbl
19:06:36 <fungi> i put in a removal request for that in the meantime
19:06:55 <clarkb> the ipv6 address was listed too but we can't remove it because the entire /64 is listed and we have a /128 on the host?
19:07:03 <clarkb> that should improve after we migrate hosts anyway
19:07:04 <fungi> their css blocklist has the ipv6 /64 for that server listed, not much we can do about that yeah
19:07:25 <fungi> the "problem" traffic from that /64 is coming from another rackspace customer
19:07:39 <fungi> new server is in a different /64, so not affected at the moment
19:07:59 <clarkb> anything else mail(man) related?
19:08:57 <fungi> not from my end
19:08:59 <frickler> there was a spam mail on service-discuss earlier, just wondering whether that might be mailman3 related
19:09:06 <clarkb> oh right that is worth calling out
19:09:26 <clarkb> frickler: I think it is mailman3 related in that fungi suspects that maybe the web interface makes it easier for people to do?
19:09:40 <fungi> oh, did one get through? i didn't notice, but can delete it from the archive and block that sender
19:09:42 <frickler> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/message/JLGRB7TNXJK2W3ELRXMOTAK3NH5TNYI3/
19:09:55 <clarkb> fungi: is that process documented for mm3 yet? might be good to do so if not
19:10:02 <fungi> thanks, i'll take care of that right after the meeting
19:10:51 <clarkb> #topic Infra Root Google Account
19:11:04 <fungi> and yes, the supposition is that making it possible for people to post to mailing lists without a mail client (via their web browser) also makes it easier for spammers to do the same. the recommendation from the mm3 maintainers is to adjust the default list moderation policy so that every new subscriber is moderated by default until a moderator adds them to a whitelist
19:11:23 <clarkb> I did an investigating and learned things. tl;dr is that this account is used by zuul to talk to gerrit's gerrit
19:11:43 <clarkb> using the resulting gerrit token doesn't appear to count as somethign google can track as account activity (probably a good thing imo)
19:12:21 <clarkb> I logged into the account which I think is sufficient to reset the delete clock on the account. However, after reading elsewhere some people are suggesting you do some action on top of logging in like making a google search or watching a youtube video. I'll plan to log back in and do some searches then
19:12:39 <clarkb> unfortunately, this is clear as mud when it comes to what actually counts for not getting deleted
19:13:36 <fungi> worst case we get the gerrit admins to authorize a new account when that one gets deactivated by google
19:13:58 <clarkb> fungi: yup but that will require a new email address. I'm not sure if google will allow +suffixes
19:14:01 <clarkb> (my guess is not)
19:14:26 <corvus> but then we could use snarky email addresses :)
19:15:28 <corvus> (but still, who needs the trouble; thanks for logging in and searching for "timothee chalamet" or whatever)
19:15:51 <clarkb> I was going to search for zuul
19:15:53 <clarkb> make it super meta
19:16:05 <clarkb> anyway we should keep this account alvie if we can and that was all i had
19:16:10 <corvus> ooh good one
19:16:13 <clarkb> #topic Server Upgrades
19:16:21 <clarkb> No updates from me on this and I haven't seen any other movement on it
19:16:31 <clarkb> #topic Nodepool image upload changes
19:16:53 <clarkb> We have removed fedora 35 and 36 images from nodepool which reduces the total number of image builds that need uploading
19:17:09 <clarkb> on top of that we reduced the build and upload rates. Has anyone dug in to see how that is going since we made the change?
19:17:47 <clarkb> glancing at https://grafana.opendev.org/d/f3089338b3/nodepool3a-dib-status?orgId=1 the builds themselves seem fine, but I'm not sure if we're keeping up with uploads particularly to rax iad
19:18:00 <corvus> we also added the ability to configure the upload timeout in nodepool; have we configured that in opendev yet?
19:18:27 <frickler> I don't think so, I missed that change
19:18:27 <clarkb> corvus: oh yes I put that in the agenda email even. I don't think we have configured it but we probably should
19:18:46 <corvus> it used to be hard-coded to 6 hours before nodepool switched to sdk
19:18:53 <corvus> i think that would be a fine value to start with :)
19:19:27 <clarkb> wfm
19:19:33 <fungi> yeah, i found in git history spelunking that we lost the 6-hour value when shade was split from nodepool
19:19:44 <fungi> so it's... been a while
19:19:46 <clarkb> maybe we set that value then check back in next week to see how rax iad is doing afterwards
19:19:50 <corvus> #link https://zuul-ci.org/docs/nodepool/latest/openstack.html#attr-providers.[openstack].image-upload-timeout
19:20:22 <frickler> one thing I notice about the 7d rebuild cycle is that it doesn't seem to work well with two images being kept
19:20:36 <clarkb> frickler: beacuse we delete the old image after one day?
19:20:56 <frickler> yes and then upload another
19:21:19 <frickler> so the two images are just one day apart, not another 7
19:21:36 <frickler> but maaybe that's as designed kind of
19:22:20 <frickler> at least that's my interpretation for why we have images 6d and 7d old
19:22:39 <clarkb> oh hrm
19:22:46 <clarkb> I guess that still cuts down on total uploads but in a weird way
19:22:50 <clarkb> I think we can live with that for now
19:22:51 <fungi> makes sense, maybe we need to set a similarly long image expiration (can that also be set per image?)
19:23:29 <frickler> or maybe it is only because we started one week ago
19:23:45 <fungi> oh, possible
19:24:16 <fungi> so the older of the two was built on the daily rather than weekly cycle
19:25:18 <clarkb> ok any volunteers to puysh a change updating the timeout for uplaods in each of the provider configs?
19:25:33 <clarkb> I think that goes in the two builder config files as the launcher nl0X files don't care about upload timeouts
19:25:45 <frickler> for bookworm we have 2d and 4d, that looks better
19:25:59 <frickler> I can do the timeout patch tomorrow
19:26:05 <clarkb> frickler: thanks!
19:26:44 <clarkb> #topic Zuul PCRE deprecation
19:27:18 <fungi> we've merged most of the larger changes impacting our tenants, right? just a bunch of stragglers now?
19:27:18 <clarkb> wanted to check in onthis as I haven't been able to follow the work that has happend super closely. I know changes to fix the regexes are landing and I haven't heard of any problems. But is there anythign to be concerned about?
19:27:31 <corvus> i haven't drafted an email yet for this, but will soon
19:27:49 <corvus> i haven't heard of any issues
19:27:55 <fungi> nor have i
19:28:11 <fungi> at least not after the fixes for config changes on projects with warnings
19:28:30 <frickler> this list certainly is something to be concerned about https://zuul.opendev.org/t/openstack/config-errors?severity=warning
19:29:54 <fungi> yes, all the individual projects with one or two regexes in their configs (especially on stable branches) are going to be the long tail
19:30:51 <fungi> i think we just let them know when the removal is coming and either they fix it or we start removing projects from the tenant config after they've been broken for too long
19:31:10 <frickler> or we just keep backwards compatibility for once?
19:31:31 <corvus> we can't keep backwards compat once this is removed from zuul
19:31:49 <frickler> well that's on zuul then. or opendev can stop running latest zuul
19:32:17 <corvus> and the point of this migration is to prevent dos attacks against sites like opendev
19:32:22 <clarkb> the motivation behind the change is actually one that is useful to opendev though ya that
19:32:31 <corvus> so it would be counter productive for opendev to do that
19:32:56 <corvus> is that a serious consideration?  is this group interested in no longer running the latest zuul?  because of this?
19:33:01 <frickler> how many dos have we seen vs. how much trouble has the queue change brought?
19:33:34 <corvus> i have to admit i'm a little surprised to hear the suggestion
19:33:38 <clarkb> frickler: I think the queue changes expose underlying problems in various projects more than they are the actual problem
19:33:55 <clarkb> it isn't our fault that many peices of software are largely unmaintained and in those cases I don't think not runnign CI is a huge deal
19:34:06 <clarkb> when the projects become maintained again they can update their configs and move forward
19:34:24 <clarkb> either to fix existing problems in the older CI setups for the projects or by deleting what was there and starting over
19:34:56 <fungi> i guess it's a question of whether projects want a ci system that never breaks backwards compatibility, or one that fixes bugs and adds new features
19:34:59 <clarkb> I agree that this exposes a problem I just don't see the zuul updates as the problem. They indicate deeper problems
19:35:04 <corvus> indeed, zuul's configuration error handling was specifically designed for this set of requirements from opendev -- that if there are problems in "leaf node" projects, it doesn't take out the whole system.  so it's also surprising to hear that is not helpful.
19:36:23 <fungi> also though, zuul relies on other software (notably python and ansible) which have relatively aggressive deprecation and eol schedules, so not breaking backward compatibility would quickly mean running with eol versions of those integral dependencies
19:36:26 <clarkb> my personal take on it is that openstack should do like opendev and aggressively prune what isn't sustainable. But I know that isn't the current direction of the project
19:37:47 <fungi> even with jenkins we had upgrades which broke some job configurations and required changes for projects to continue running things in it
19:37:50 <corvus> i thought opendev's policy of making sure people know about upcoming changes, telling them how to fix problems, and otherwise not being concerned if individual projects aren't interested in maintaining their use of the system is reasonable.  and in those cases, just letting the projects "error out" and then pruning them.  i don't feel like that warrants a change in the ci system.
19:38:24 <clarkb> corvus: any idea what zuul's deprecation period looks like for this change? I suspect this one might take some time due to the likelyhood of impacting existing installs? More so than the ansible switch anyway
19:39:24 <corvus> i don't think it's been established, but my own feeling is the same, and i would be surprised if anyone had an appetite for it in less than, say, 6 months?
19:39:35 <corvus> yeah, ansible switches are short, and not by our choice.
19:39:44 <corvus> these other things can take as long as we want.
19:39:50 <corvus> (we = zuul community)
19:40:22 <clarkb> ya so this is going to be less urgent anyway and the bulk of that list is currently tempest (something that should be fixable) and tripleo-ci (something that is going away anyway)
19:40:55 <clarkb> I feel like this is doable with a reasonable outcome in opendev
19:40:58 <corvus> so far, the regexes we've looked at are super easy to fix too
19:41:10 <fungi> the named queue change was a year+ deprecation period, fwiw. but this time we have active reporting of deprecations which we didn't have back then
19:42:17 <frickler> the errors page and filtering is really helpful indeed
19:42:36 <fungi> unfortunately, that "long tail" of inactive projects who don't get around to fixing things until after they break (if ever) will be the same no matter how long of a deprecation period we give them
19:43:03 <corvus> to me, it seems like a reasonable path forward to send out an email announcing it along with suggestions on how to fix (that's my TODO), then if there's some kind of hardship in the future when zuul removes the backwards compat, bring that up.  maybe zuul delays or mitigates it.  or maybe opendev drops some more idle projects/branches.
19:44:02 <frickler> anyway 6 months is more than I expected, that gives some room for things to happen
19:44:44 <corvus> oh yeah, when we have a choice, zuul tends to have very long deprecation periods
19:44:47 <clarkb> ok lets check back in after the email notice is sent and people have had more time to address things
19:46:12 <clarkb> #topic Python Container Updates
19:46:52 <clarkb> The only current open chagne is for Gerrit. I decided to hold off on doing that with the zuul restart because we had weekend things and simplying to zuul streamlined a bit
19:47:17 <clarkb> I'm thinking maybe this Friday we land the change and do a short gerrit outage. Fridays tend to be quiet.
19:47:47 <clarkb> Nodepool and Zuul both seem happy running on bookworm though which is a good sign since they are pretty non trivial setups (nodepool builders in particular)
19:48:06 <fungi> i'm going to be mostly afk on friday, but don't let that stop you
19:49:05 <clarkb> ack
19:49:22 <clarkb> I suspect we've got a few more services that could be converted too, but I've got to look more closely codesearch to find them
19:49:34 <clarkb> probably worth getting gerrit out of the way then starting with a new batch
19:49:40 <clarkb> #topic Open Discussion
19:49:47 <clarkb> We upgraded Gitea to 1.20.4
19:50:09 <clarkb> We removed fedora images as previously noted
19:50:20 <clarkb> Neither have produced any issues that I've seen
19:50:24 <clarkb> Anything else?
19:51:02 <fungi> i didn't have anything
19:52:12 <corvus> maybe the github stuff?
19:52:25 <frickler> maybe mention the inmotion rebuild idea
19:52:47 <clarkb> for github we can work aroudn rate limits (partially) by adding a user token to our zuul config
19:52:55 <corvus> couple years ago, we noted that the openstack tenant can not complete a reconfiguration in zuul in less than an hour because of github api limits
19:53:04 <clarkb> but for that to work properly we need zuul to fallback from the app token to the user token to anonymous
19:53:16 <corvus> that problem still persists, and is up to >2 hours now
19:53:29 <clarkb> there is a change open for that https://review.opendev.org/c/zuul/zuul/+/794688 that needs updates and I've volunteered to revive it
19:53:30 <frickler> do we know if something changed on the github side?
19:53:32 <corvus> (and is what caused the full zuul restart to take a while)
19:53:39 <frickler> or just more load from zuul?
19:53:45 <clarkb> frickler: I think the chagnes are in the repos growing more branches which leads to more api requests
19:54:02 <clarkb> and then instead of hitting the limit once we now hit it multiple times and each time is an hour delay
19:54:12 <corvus> yeah, more repos + more branches since then, but already 2 yeags ago it was enough to be a problem
19:54:32 <clarkb> this is partially why I want to clean up those projects which can be cleaned up and/or are particulaly bad about this
19:54:49 <clarkb> https://review.opendev.org/c/openstack/project-config/+/894814 and child should be safe to merge for taht
19:54:51 <frickler> likely also we didn't do a real restart for a long time, so didn't notice that growth?
19:54:57 <clarkb> but the better fix is making zuul smarter about its github requests
19:55:08 <clarkb> frickler: correct we haven't done a full restart without cached data in a logn time
19:55:55 <clarkb> for Inmotion the cloud we've got is older and was one of the first cloud as a service deployments they did. Since then they have reportedly improved their tooling and systems and can deploy newer openstack
19:56:00 <frickler> regarding inmotion, we could update the cluster to a recent openstack version and thus avoid having to fix stuck image deletions. and hopefully have less issues with stuck things going forward
19:56:06 <corvus> this does affect us while not restarting; tenant reconfigurations can take a long time.  it's just zuul is really good at masking that now.
19:56:18 <clarkb> corvus: ah fun
19:56:30 <clarkb> frickler: yup there are palcement issues in particular that melwitt reports would be fixed if we upgraded
19:56:54 <clarkb> I can write an email to yuriys indicating we'd like to do that after the openstack release happens and get some guidance from their end before we start
19:57:26 <clarkb> but I suspect the rough process to be make notes for how things are setup today (number of nodes and roles of nodes, networking setup for neutron, size of nodes, etc) then replicate that using newer openstack in a new cluster
19:57:45 <frickler> getting feedback what versions they support would also be interesting
19:57:48 <clarkb> one open question is if we have to delete the existing cluster first or if we can have tow clusters side by side for a short time. yuriys should be able to give feedback on that
19:58:14 <frickler> like switch to rocky as base os and have 2023.1? or maybe even 2023.2 by then?
19:58:36 <clarkb> I think they said they were doing stream
19:58:42 <clarkb> but ya we can clarify all that too
19:58:53 <frickler> I think two parallel clusters might be difficult in regard to IP space
19:59:11 <clarkb> good point particularly since that was a pain point previously. We probably do need to reclaim the old cloud or at least its IPs first
19:59:55 <fungi> yeah, they were reluctant to give us much ipv4 addressing, and still (last i heard) don't have ipv6 working
20:00:31 <clarkb> and that is our hour. Thank you for your time today and for all the help keeping these systems running
20:00:41 <clarkb> I'll take the todo to write the email to yuriys and see what they say
20:00:41 <frickler> I should point them to my IPv6 guide then ;)
20:00:47 <clarkb> #endmeeting