19:01:12 #startmeeting infra 19:01:12 Meeting started Tue Sep 12 19:01:12 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:12 The meeting name has been set to 'infra' 19:01:40 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/V23MFBBCANVK4YAK2IYOV5XNLFY64U3X/ Our Agenda 19:01:49 #topic Announcements 19:02:16 I will be out tomorrow 19:02:17 hellot to you too! 19:02:49 I'll be on a little bit fishing for hopefully not little fish so won't raelly be around keyboards 19:02:54 but back thursday for the mailman fun 19:03:09 #topic Mailman 3 19:03:19 may as well jump straight into that 19:03:33 nothing new here really. planning to migrate airship and kata on thursday 19:03:36 fungi: The plan is to migrate lists.katacontainers.io and lists.airshipit.org to mailman 3 on september 14 19:03:47 fungi: and youare starting around 1500 iirc? 19:03:58 yep. will do a preliminary data sync tomorrow to prepare, so the sync during downtime will be shorter 19:04:30 i notified the airship-discuss and kata-dev mailing lists since those are the most active ones for their respective domains 19:05:17 fungi: anything else we can do to help prepare? 19:05:26 if thursday's maintenance goes well, i'll send similar notifications to the foundation and starlingx-discuss lists immediately thereafter about a similar migration for the lists.openinfra.dev and lists.starlingx.io sites thursday of next week 19:05:45 i don't think we've got anything else that needs doing for mailman at the moment 19:06:01 excellent. Thank you for getting this together 19:06:25 we did find out that the current lists.katacontainers.io server's ipv4 address was on the spamhaus pbl 19:06:36 i put in a removal request for that in the meantime 19:06:55 the ipv6 address was listed too but we can't remove it because the entire /64 is listed and we have a /128 on the host? 19:07:03 that should improve after we migrate hosts anyway 19:07:04 their css blocklist has the ipv6 /64 for that server listed, not much we can do about that yeah 19:07:25 the "problem" traffic from that /64 is coming from another rackspace customer 19:07:39 new server is in a different /64, so not affected at the moment 19:07:59 anything else mail(man) related? 19:08:57 not from my end 19:08:59 there was a spam mail on service-discuss earlier, just wondering whether that might be mailman3 related 19:09:06 oh right that is worth calling out 19:09:26 frickler: I think it is mailman3 related in that fungi suspects that maybe the web interface makes it easier for people to do? 19:09:40 oh, did one get through? i didn't notice, but can delete it from the archive and block that sender 19:09:42 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/message/JLGRB7TNXJK2W3ELRXMOTAK3NH5TNYI3/ 19:09:55 fungi: is that process documented for mm3 yet? might be good to do so if not 19:10:02 thanks, i'll take care of that right after the meeting 19:10:51 #topic Infra Root Google Account 19:11:04 and yes, the supposition is that making it possible for people to post to mailing lists without a mail client (via their web browser) also makes it easier for spammers to do the same. the recommendation from the mm3 maintainers is to adjust the default list moderation policy so that every new subscriber is moderated by default until a moderator adds them to a whitelist 19:11:23 I did an investigating and learned things. tl;dr is that this account is used by zuul to talk to gerrit's gerrit 19:11:43 using the resulting gerrit token doesn't appear to count as somethign google can track as account activity (probably a good thing imo) 19:12:21 I logged into the account which I think is sufficient to reset the delete clock on the account. However, after reading elsewhere some people are suggesting you do some action on top of logging in like making a google search or watching a youtube video. I'll plan to log back in and do some searches then 19:12:39 unfortunately, this is clear as mud when it comes to what actually counts for not getting deleted 19:13:36 worst case we get the gerrit admins to authorize a new account when that one gets deactivated by google 19:13:58 fungi: yup but that will require a new email address. I'm not sure if google will allow +suffixes 19:14:01 (my guess is not) 19:14:26 but then we could use snarky email addresses :) 19:15:28 (but still, who needs the trouble; thanks for logging in and searching for "timothee chalamet" or whatever) 19:15:51 I was going to search for zuul 19:15:53 make it super meta 19:16:05 anyway we should keep this account alvie if we can and that was all i had 19:16:10 ooh good one 19:16:13 #topic Server Upgrades 19:16:21 No updates from me on this and I haven't seen any other movement on it 19:16:31 #topic Nodepool image upload changes 19:16:53 We have removed fedora 35 and 36 images from nodepool which reduces the total number of image builds that need uploading 19:17:09 on top of that we reduced the build and upload rates. Has anyone dug in to see how that is going since we made the change? 19:17:47 glancing at https://grafana.opendev.org/d/f3089338b3/nodepool3a-dib-status?orgId=1 the builds themselves seem fine, but I'm not sure if we're keeping up with uploads particularly to rax iad 19:18:00 we also added the ability to configure the upload timeout in nodepool; have we configured that in opendev yet? 19:18:27 I don't think so, I missed that change 19:18:27 corvus: oh yes I put that in the agenda email even. I don't think we have configured it but we probably should 19:18:46 it used to be hard-coded to 6 hours before nodepool switched to sdk 19:18:53 i think that would be a fine value to start with :) 19:19:27 wfm 19:19:33 yeah, i found in git history spelunking that we lost the 6-hour value when shade was split from nodepool 19:19:44 so it's... been a while 19:19:46 maybe we set that value then check back in next week to see how rax iad is doing afterwards 19:19:50 #link https://zuul-ci.org/docs/nodepool/latest/openstack.html#attr-providers.[openstack].image-upload-timeout 19:20:22 one thing I notice about the 7d rebuild cycle is that it doesn't seem to work well with two images being kept 19:20:36 frickler: beacuse we delete the old image after one day? 19:20:56 yes and then upload another 19:21:19 so the two images are just one day apart, not another 7 19:21:36 but maaybe that's as designed kind of 19:22:20 at least that's my interpretation for why we have images 6d and 7d old 19:22:39 oh hrm 19:22:46 I guess that still cuts down on total uploads but in a weird way 19:22:50 I think we can live with that for now 19:22:51 makes sense, maybe we need to set a similarly long image expiration (can that also be set per image?) 19:23:29 or maybe it is only because we started one week ago 19:23:45 oh, possible 19:24:16 so the older of the two was built on the daily rather than weekly cycle 19:25:18 ok any volunteers to puysh a change updating the timeout for uplaods in each of the provider configs? 19:25:33 I think that goes in the two builder config files as the launcher nl0X files don't care about upload timeouts 19:25:45 for bookworm we have 2d and 4d, that looks better 19:25:59 I can do the timeout patch tomorrow 19:26:05 frickler: thanks! 19:26:44 #topic Zuul PCRE deprecation 19:27:18 we've merged most of the larger changes impacting our tenants, right? just a bunch of stragglers now? 19:27:18 wanted to check in onthis as I haven't been able to follow the work that has happend super closely. I know changes to fix the regexes are landing and I haven't heard of any problems. But is there anythign to be concerned about? 19:27:31 i haven't drafted an email yet for this, but will soon 19:27:49 i haven't heard of any issues 19:27:55 nor have i 19:28:11 at least not after the fixes for config changes on projects with warnings 19:28:30 this list certainly is something to be concerned about https://zuul.opendev.org/t/openstack/config-errors?severity=warning 19:29:54 yes, all the individual projects with one or two regexes in their configs (especially on stable branches) are going to be the long tail 19:30:51 i think we just let them know when the removal is coming and either they fix it or we start removing projects from the tenant config after they've been broken for too long 19:31:10 or we just keep backwards compatibility for once? 19:31:31 we can't keep backwards compat once this is removed from zuul 19:31:49 well that's on zuul then. or opendev can stop running latest zuul 19:32:17 and the point of this migration is to prevent dos attacks against sites like opendev 19:32:22 the motivation behind the change is actually one that is useful to opendev though ya that 19:32:31 so it would be counter productive for opendev to do that 19:32:56 is that a serious consideration? is this group interested in no longer running the latest zuul? because of this? 19:33:01 how many dos have we seen vs. how much trouble has the queue change brought? 19:33:34 i have to admit i'm a little surprised to hear the suggestion 19:33:38 frickler: I think the queue changes expose underlying problems in various projects more than they are the actual problem 19:33:55 it isn't our fault that many peices of software are largely unmaintained and in those cases I don't think not runnign CI is a huge deal 19:34:06 when the projects become maintained again they can update their configs and move forward 19:34:24 either to fix existing problems in the older CI setups for the projects or by deleting what was there and starting over 19:34:56 i guess it's a question of whether projects want a ci system that never breaks backwards compatibility, or one that fixes bugs and adds new features 19:34:59 I agree that this exposes a problem I just don't see the zuul updates as the problem. They indicate deeper problems 19:35:04 indeed, zuul's configuration error handling was specifically designed for this set of requirements from opendev -- that if there are problems in "leaf node" projects, it doesn't take out the whole system. so it's also surprising to hear that is not helpful. 19:36:23 also though, zuul relies on other software (notably python and ansible) which have relatively aggressive deprecation and eol schedules, so not breaking backward compatibility would quickly mean running with eol versions of those integral dependencies 19:36:26 my personal take on it is that openstack should do like opendev and aggressively prune what isn't sustainable. But I know that isn't the current direction of the project 19:37:47 even with jenkins we had upgrades which broke some job configurations and required changes for projects to continue running things in it 19:37:50 i thought opendev's policy of making sure people know about upcoming changes, telling them how to fix problems, and otherwise not being concerned if individual projects aren't interested in maintaining their use of the system is reasonable. and in those cases, just letting the projects "error out" and then pruning them. i don't feel like that warrants a change in the ci system. 19:38:24 corvus: any idea what zuul's deprecation period looks like for this change? I suspect this one might take some time due to the likelyhood of impacting existing installs? More so than the ansible switch anyway 19:39:24 i don't think it's been established, but my own feeling is the same, and i would be surprised if anyone had an appetite for it in less than, say, 6 months? 19:39:35 yeah, ansible switches are short, and not by our choice. 19:39:44 these other things can take as long as we want. 19:39:50 (we = zuul community) 19:40:22 ya so this is going to be less urgent anyway and the bulk of that list is currently tempest (something that should be fixable) and tripleo-ci (something that is going away anyway) 19:40:55 I feel like this is doable with a reasonable outcome in opendev 19:40:58 so far, the regexes we've looked at are super easy to fix too 19:41:10 the named queue change was a year+ deprecation period, fwiw. but this time we have active reporting of deprecations which we didn't have back then 19:42:17 the errors page and filtering is really helpful indeed 19:42:36 unfortunately, that "long tail" of inactive projects who don't get around to fixing things until after they break (if ever) will be the same no matter how long of a deprecation period we give them 19:43:03 to me, it seems like a reasonable path forward to send out an email announcing it along with suggestions on how to fix (that's my TODO), then if there's some kind of hardship in the future when zuul removes the backwards compat, bring that up. maybe zuul delays or mitigates it. or maybe opendev drops some more idle projects/branches. 19:44:02 anyway 6 months is more than I expected, that gives some room for things to happen 19:44:44 oh yeah, when we have a choice, zuul tends to have very long deprecation periods 19:44:47 ok lets check back in after the email notice is sent and people have had more time to address things 19:46:12 #topic Python Container Updates 19:46:52 The only current open chagne is for Gerrit. I decided to hold off on doing that with the zuul restart because we had weekend things and simplying to zuul streamlined a bit 19:47:17 I'm thinking maybe this Friday we land the change and do a short gerrit outage. Fridays tend to be quiet. 19:47:47 Nodepool and Zuul both seem happy running on bookworm though which is a good sign since they are pretty non trivial setups (nodepool builders in particular) 19:48:06 i'm going to be mostly afk on friday, but don't let that stop you 19:49:05 ack 19:49:22 I suspect we've got a few more services that could be converted too, but I've got to look more closely codesearch to find them 19:49:34 probably worth getting gerrit out of the way then starting with a new batch 19:49:40 #topic Open Discussion 19:49:47 We upgraded Gitea to 1.20.4 19:50:09 We removed fedora images as previously noted 19:50:20 Neither have produced any issues that I've seen 19:50:24 Anything else? 19:51:02 i didn't have anything 19:52:12 maybe the github stuff? 19:52:25 maybe mention the inmotion rebuild idea 19:52:47 for github we can work aroudn rate limits (partially) by adding a user token to our zuul config 19:52:55 couple years ago, we noted that the openstack tenant can not complete a reconfiguration in zuul in less than an hour because of github api limits 19:53:04 but for that to work properly we need zuul to fallback from the app token to the user token to anonymous 19:53:16 that problem still persists, and is up to >2 hours now 19:53:29 there is a change open for that https://review.opendev.org/c/zuul/zuul/+/794688 that needs updates and I've volunteered to revive it 19:53:30 do we know if something changed on the github side? 19:53:32 (and is what caused the full zuul restart to take a while) 19:53:39 or just more load from zuul? 19:53:45 frickler: I think the chagnes are in the repos growing more branches which leads to more api requests 19:54:02 and then instead of hitting the limit once we now hit it multiple times and each time is an hour delay 19:54:12 yeah, more repos + more branches since then, but already 2 yeags ago it was enough to be a problem 19:54:32 this is partially why I want to clean up those projects which can be cleaned up and/or are particulaly bad about this 19:54:49 https://review.opendev.org/c/openstack/project-config/+/894814 and child should be safe to merge for taht 19:54:51 likely also we didn't do a real restart for a long time, so didn't notice that growth? 19:54:57 but the better fix is making zuul smarter about its github requests 19:55:08 frickler: correct we haven't done a full restart without cached data in a logn time 19:55:55 for Inmotion the cloud we've got is older and was one of the first cloud as a service deployments they did. Since then they have reportedly improved their tooling and systems and can deploy newer openstack 19:56:00 regarding inmotion, we could update the cluster to a recent openstack version and thus avoid having to fix stuck image deletions. and hopefully have less issues with stuck things going forward 19:56:06 this does affect us while not restarting; tenant reconfigurations can take a long time. it's just zuul is really good at masking that now. 19:56:18 corvus: ah fun 19:56:30 frickler: yup there are palcement issues in particular that melwitt reports would be fixed if we upgraded 19:56:54 I can write an email to yuriys indicating we'd like to do that after the openstack release happens and get some guidance from their end before we start 19:57:26 but I suspect the rough process to be make notes for how things are setup today (number of nodes and roles of nodes, networking setup for neutron, size of nodes, etc) then replicate that using newer openstack in a new cluster 19:57:45 getting feedback what versions they support would also be interesting 19:57:48 one open question is if we have to delete the existing cluster first or if we can have tow clusters side by side for a short time. yuriys should be able to give feedback on that 19:58:14 like switch to rocky as base os and have 2023.1? or maybe even 2023.2 by then? 19:58:36 I think they said they were doing stream 19:58:42 but ya we can clarify all that too 19:58:53 I think two parallel clusters might be difficult in regard to IP space 19:59:11 good point particularly since that was a pain point previously. We probably do need to reclaim the old cloud or at least its IPs first 19:59:55 yeah, they were reluctant to give us much ipv4 addressing, and still (last i heard) don't have ipv6 working 20:00:31 and that is our hour. Thank you for your time today and for all the help keeping these systems running 20:00:41 I'll take the todo to write the email to yuriys and see what they say 20:00:41 I should point them to my IPv6 guide then ;) 20:00:47 #endmeeting