19:01:11 #startmeeting infra 19:01:11 Meeting started Tue Aug 22 19:01:11 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:11 The meeting name has been set to 'infra' 19:01:18 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VRBBT25TOTXJG3L5SXKWV3EELG34UC5E/ Our Agenda 19:01:42 We've actually got a fairly full agenda so I may move quicker than I'd like. But we can always go back to discussing items at the end of our hour if we have time 19:01:50 #topic Announcements 19:02:22 The service coordinator nomination period ends today. I haven't seen anyone nominate themselves yet. Does this mean everyone is happy with and prefers me to keep doing it? 19:02:36 i'm happy to be not-it ;) 19:02:49 (also you're doing a great job!) 19:02:59 i can confirm 100% that is the correct interpretation 19:03:31 ok I guess I can make it official with an email later today before today ends to avoid and needless process confusion 19:04:03 anything else to announce? 19:05:01 #topic Infra root google account 19:05:17 Just a quick note that I haven't tried to login yet so I have no news yet 19:05:41 but it is on my todo list and hopefully I can get to it soon. This week should be a bit less crazy ofr me than last week (half the visiting family is no longer here) 19:05:49 #topic Mailman 3 19:06:11 fungi: we made some changes and got things to a good stable stopping point I think. What is next? mailman 3 upgrade? 19:06:32 we merged the remaining fixes last week and the correct site names are showing up on archive pages now 19:07:21 i've got a held node built with https://review.opendev.org/869210 and have just finished syncing a copy of all our production mm2 lists to it to run a test import 19:07:33 #link https://review.opendev.org/869210 Upgrade to latest Mailman 3 releases 19:08:06 oh right we wanted to make sure that upgrading wouldn't put us in a worse position for the 2 -> 3 migration 19:08:07 i'll step through the migration steps in https://etherpad.opendev.org/p/mm3migration to make sure they're still working correctly 19:08:17 #link https://etherpad.opendev.org/p/mm3migration Mailman 3 Migration Plan 19:08:41 at which point we can merge the upgrade change to swap out the containers and start scheduling new domain migrations 19:09:22 that's where we're at currently 19:09:49 sounds good. Let us know when you feel ready for us to review and approve the upgrade change 19:09:58 #topic Gerrit Updates 19:09:59 will do, thanks! 19:10:21 I did email the gerrit list last week about the too many query terms problem frickler has run into with starred changes 19:10:49 they seem to acknowledge that this is less than ideal (one idea was to log/report the query in its entirety so that you could use that information to find the starred changes) 19:11:11 but no suggestions for a workaround other than what we already know (bump index.maxTerms) 19:11:28 no one said don't do that either so I think we can proceed wtih frickler's change and then monitor the resulting performance situation 19:11:35 #link https://review.opendev.org/c/opendev/system-config/+/892057 Bump index.maxTerms to address starred changes limit. 19:12:30 ideally we can approve that and restart gerrit with the new config during a timeframe that frickler is able to confirm it has fixed the issue quickly (so taht we can revert and/or debug further if necessary) 19:13:13 frickler: I know you couldn't attend the meeting today, but maybe we can sync up later this week on a good time to do that restart with ou 19:13:22 #topic Server Upgrades 19:13:43 no news here. Mostly a casualty of my traveling and having family around. I should have more time for this in the near future 19:14:02 (I don't like reaplcing servers when I feel I'm not in a position to revert or debug unexpected problems) 19:14:49 #topic Rax IAD image upload struggles 19:15:13 nodepool image uploads to rax iad are timing out 19:15:32 this problem seems to get worse the more uploads you perform at the same tiem 19:15:59 The other two rackspace regions do not have this problem (or if they share the underlying mechanism it doesnt' manifest as badly so we don't really care/notice) 19:16:07 specifically, the bulk of the time/increase seems to occur on the backend in glance after the upload completes 19:16:31 fungi and frickler have been doing the bulk of the debugging (thank you) 19:16:48 I think our end goal is to collect enough info that we can file a ticket with rakcspace to see if this is something they can fixup 19:16:48 when we were trying to upload all out images, i clockec around 5 hours from end of an upload to when it would first start to appear in the image list 19:17:47 when we're not uploading any images, a single test upload is followed by around 30 minutes of time before it appears in the image list 19:18:50 and when you multiply the number of images we have by 30 minutes you end up with something suspicousl close to 5 hours 19:18:55 and yes, what we're mostly lacking right now is someone to have time to file a ticket and try to explain our observations in a way that doesn't come across as us having unrealistic expectations 19:19:25 in the mean time, are we increasing the upload timeout to accomodate 5h? 19:19:37 I think a key part of doing that is showing that dfw and ord don't suffer from this which would indicate an actual problem with the egion 19:19:44 "we're trying to upload 15 images in parallel, around 20gb each, and your service can't keep up" is likely to result in us being told "please stop that" 19:20:12 fungi: note it should eventually balance out to an image an hour or so due to rebuild timelines 19:20:22 but due to the long processing times they all pile up instaed 19:20:23 corvus: frickler piecemeal unpaused some images manually to get fresher uploads for our more frequently used images to complete 19:21:22 he did test by overriding the openstacksdk default wait timeout to something like 10 hours, just to confirm it did work around the issue 19:21:44 that isn't currently configurable in nodepool today? Or maybe we could do it with clouds.yaml? 19:21:44 oh this is an sdk-level timeout? not the nodepool one? 19:21:48 ya 19:21:49 yes 19:22:05 nodepool doesn't currently expose a config option for passing the sdk timeout parameter 19:22:19 we could add one, but also this seems like a pathological condition 19:23:24 still, i think that would be an ok change. 19:23:43 yeah, i agree, i just don't think we're likely to want to actually set that long term 19:23:58 in general nodepool has a bunch of user-configurable timeouts for stuff like that because we know especially with clouds, things are going to be site dependent. 19:24:20 yeah, it's not an ideal solution, but, i think, an acceptable one. :) 19:24:26 other things to note. We had leaked images in all three regions. fungi cleared those out manually. I'm not sure why the leak detection and cleanup in nodepool didn't find them and take care of it for us. 19:24:51 and we had instances that we could not delete in all three regions that rackspace cleaned up for us after a ticket was submitted 19:25:09 i cleared out around 1200 leaked images in iad (mostly due to the upload timeout problem i think, based on their ages). the other two regions have around 400 leaked images but have not been cleaned up yet 19:25:28 maybe the leaked images had a different schema or something. if it happens again, ping me and i can try to see why. 19:25:40 thanks! 19:25:59 corvus: sure, we can dig deeper. it seemed outwardly that nodepool decided the images never got created 19:26:26 because the sdk returned an error when it gave up waiting for them 19:26:28 ah, could be missing the metadata entirely then too 19:26:55 yes, it could be missing followup steps the sdk would otherwise have performed 19:27:19 if the metadata is something that gets set post-upload 19:27:32 (from the cloud api side i mean) 19:28:05 anything else on this topic? 19:28:18 not from me 19:28:22 #topic Fedora Cleanup 19:29:03 I pushed two changes earlier today to start on this. Bindep is the only fedora-latest user in codesearch that is also in the zuul tenant config. 19:29:05 #link https://review.opendev.org/c/opendev/base-jobs/+/892380 Cleanup fedora-latest in bindep 19:29:23 er thats the next change one second while I copy paste the bindep one properly 19:29:36 #undo 19:29:36 Removing item from minutes: #link https://review.opendev.org/c/opendev/base-jobs/+/892380 19:29:49 #link https://review.opendev.org/c/opendev/bindep/+/892378 Cleanup fedora-latest in bindep 19:30:09 This should be a very safe change. The next one which removes nodeset: fedora-latest is less safe because older branches may use it etc 19:30:20 #link https://review.opendev.org/c/opendev/base-jobs/+/892380 Remove fedora-latest nodeset 19:30:51 I'm personally inclined to land both of them ad we can revert 892380 if something unexpected happens. but that nodeset doesn't work anyway so those jobs should already be broken 19:31:08 We can then look at cleaning things up from nodepool and the mirrors etc 19:31:53 let me know if you disagree or find extra bits that need cleanup first 19:32:02 #topic Gitea 1.20 Upgrade 19:32:12 this cleanup also led to us rushing through an untested change for the zuul/zuul-jobs repo, we do need to remember that it has stakeholders beyond our deployment 19:32:41 Gitea has published a 1.20.3 release already. I think my impressin that this is a big release with not a lot of new features is backed up by the amount of fixing they have had to do 19:33:21 But I think I've managed to work through all the documented breaking changes (and one undocumented breaking change) 19:33:39 there is a held node that seems to work here: https://158.69.78.38:3081/opendev/system-config 19:33:59 The main thing at this point would be to go over the change itself and that held node to make sure you are happy with the changes I had to make 19:34:18 and if so I can add the new necessary but otherwise completely ignored secret data to our prod secrets and merge the change when we can monitor it 19:34:38 #link https://review.opendev.org/c/opendev/system-config/+/886993 Gitea 1.20 change 19:34:41 there is the change 19:35:05 looks like fungi did +2 it. I should probably figure out those secrets then 19:35:08 thanks for the review 19:35:11 it looked fine to me, but i won't be around today to troubleshoot if something goes sideways with the upgrade 19:35:27 we can approve tomorrow. I'm not in a huge rush other than simply wanting it off my todo list 19:35:46 #topic Zuul changes and updates 19:36:01 There are a number of user facing changes that have been made or will be made very soon in zuul 19:36:18 I want to make sure we're all aware of them and have some sort of plan for working through them 19:36:47 First up Zuul has added Ansible 8 support. In the not too distant future Zuul will drop Ansible 6 support which is what we default to today 19:37:12 in the past what we've done is asked our users to test new ansible against their jobs if they are worred and set a hard cutover date via tenant config in the future 19:37:14 and between those 2 events, zuul will switch the default from 6 to 8 19:37:19 also it's skipping ansible 7 support, right? 19:37:31 yeah we're too slow, missed it. 19:37:59 OpenStack Bobcat releases ~Oct 6 19:38:26 I'm thinking we switch all tenants to ansible 8 by default the week after that 19:38:43 though we should probably go ahead and switch the opendev tenant nowish. So all tenants that haven't switched yet on that date 19:38:47 oh that sounds extremely generous. i think we can and should look into switching much earlier 19:39:14 corvus: my concern with that is openstack will have a revolt if anything breaks due to their CI already being super flaky and the release coming up 19:39:31 what if we switch opendev, wait a while, switch everyone but openstack, wait a while, and then switch openstack... 19:39:38 wfm 19:39:43 that is probably fine 19:40:14 with "a while" = 6 weeks that sounds fine 19:40:24 i think if we run into any problems, we can throw the breaks easily, but given that zuul (including zuul-jobs) switched with no changes, this might go smoothly... 19:40:29 maybe try to switch opendev this week, everyone else less openstack the week after if opendev is happy. Then do openstack after the bobcat release? 19:40:41 sounds good 19:40:48 that way we can get an email out and give people at least some lead time 19:40:56 well that's not really what i'm suggesting 19:41:18 we could do a week between each and have it wrapped up by mid september 19:41:31 hrm I think the main risk is that specific jobs break 19:41:44 and openstack is going to be angry about that given how unhappy openstack ci has been lately 19:41:44 or we could do opendev now, and then everyone else 1 week after that and wrap it up 2 weeks from now 19:42:03 where is this urgency coming from? 19:42:23 we actually should be doing this much more quickly and much more frequently 19:42:26 being able to merge the change in zuul that drops ansible 6 and being able to continue upgrading the opendev deployment whne that happens 19:42:34 I think last time we went quickly with the idea that specific jobs could force ansible 6, but then other zuul config errors complicated that more than we anticipated 19:42:39 we need to change ansible versions every 6 months to keep up with what's supported upstream 19:42:50 so i would like us to acclimate to this being somewhat frequent and not a big deal 19:42:57 and to be fair we've been pushing openstack to address those but largely only frickler has put any effort into it :/ 19:43:37 thouhj as of an hour ago it seems like we've got buy-in from th eopenstack tc to start deleting branches in repos with unfixed zuul config errors 19:43:46 what's wrong with running unsupported ansible versions? people are still running python2 19:44:17 frickler: at least one issue is the size of the installation for ansible. Every version you support creates massive bloat in your zuul executors 19:44:19 it's not our intention to use out-of-support ansible 19:44:48 I deon't think that that is compatible with the state openstack is in 19:44:57 how so? 19:45:07 do we know that openstack runs jobs that don't work with ansible 8? 19:45:14 we certainly can't encourage it, given zuul is designed to run untrusted payloads and ansible upstream won't be fixing security vulnerabilities in those versions 19:45:14 there is no developer capacity to adapt 19:45:30 no we do not know that yet. JayF thought that some ironic jobs might use openstack collections though which are apparently not backward compatible in 8 19:45:39 if it all works fine, fine, but if not, adapting will take a long time 19:45:50 zuul's executor can't use collections... 19:45:52 i don't see how ironic jobs would be using collections with the executor's ansible 19:45:59 ah 19:46:20 if they're using collections that would be in a nested ansible, so not affected by this 19:46:24 maybe a compromise would be to try it soonish and if things break in ways that aren't reasonable to address then we can revert to 6? But it sounds like we expect 8 to work so go for it until we have evidence to the contrary? 19:46:46 yeah, i am fully on-board with throwing the emergency brake lever if we find problems 19:46:49 I'm not sure if installing the big pypi ansible package gets you the openstack stuff 19:47:01 it does afaict 19:47:18 i don't think we should assume that everything will break. :) 19:47:47 the problem is to find how what breaks, how will you do that? 19:47:55 ok proposal: switch opendev nowish. If that looks happy plan to switch everyone else less openstack early next week. If that looks good switch openstack late next week or early the week after 19:47:56 waiting for people to complain will not work 19:48:19 frickler: I mean if a tree falls in a forest and no one hears... 19:48:29 I understand your concern but if no one is paying attention then we aren't going to solve that either way 19:48:41 we can however react to those who are paying attention 19:48:50 and I think that is the best we can do whether we wait a long time or a short time 19:48:56 that works for me; if we want to give openstack more buffer around the release, then switching earlier may help. either sounds good to me. 19:50:17 doesn't look lik any of the TC took fungi's invitation to discuss it here so we can't get their input 19:51:19 well, can't get their input in this meeting anyway 19:51:29 corvus: just pushed a change to swithc opendev 19:51:55 lets land that asap and then if we don't have anything pushing the brakes by tomorrow I can send email to service-announce with the plan I skethced out above 19:52:11 we are running out of time though and I wanted to get to a few more items 19:52:18 #link https://review.opendev.org/c/openstack/project-config/+/892405 Switch opendev tenant to Ansible 8 19:52:25 switching the ansible version does work speculatively, right? 19:52:31 frickler: it does at a job level yes 19:52:41 so we can ask people to test things before we switch them too 19:52:58 Zuul is also planning to uncombine stdout and stderr in command/shell like tasks 19:53:29 this one is riskier. 19:53:34 I think this may be more likely to cause problems than the ansible 8 upgrade since it isn't always clear things are going to stderr particulalry when it hasjust worked historically 19:53:59 we probably need to test this more methodically from a base jobs/base roles standpoint and then move up the job inheritance ladder 19:54:08 i think the best we can do there is speculatively test it with zuul-jobs first to see if anything explodes... 19:54:17 then what clarkb said 19:54:39 we might want to make new base jobs in opendev so we can flip tenants one at a time...? 19:54:47 (because we can change the per-tenant default base job...) 19:55:04 oh that is an interesting idea I hadn't considered 19:55:37 i would be very happy to let this one bake for quite a while in zuul.... 19:55:46 ok so this is less urgent. That is good to know 19:56:04 we can probably punt on decisions for it now. But keep it in mind and let us know if you have any good ideas for testing that ahead of time 19:56:23 like, long enough for lots of community members to upgrade their zuuls and try it out, etc. 19:56:29 And finally early failure detection in tasks via regex matching of task output is in the works 19:56:31 there's no ticking clock on this one... 19:56:37 ack 19:56:44 the regex change is being gated as we speak 19:57:00 the failure stuff is more "this is a cool feature you might want to take advantage of" it won't affect existing jobs without intervention 19:57:28 we'll try it out in zuul and maybe come up with some patterns that can be copied 19:58:19 #topic Base container image updates 19:58:36 really quickly before we run out of time. We are now in a good spot to convert the consumers of the base container images to bookworm 19:58:59 The only one I really expect to maybe have problems is gerrit due to the java stuff and it may be a non issue there since containers seem to have less issues with this 19:59:00 i think zuul is ready for that 19:59:43 help appreciated updating any of our images :) 19:59:48 #topic Open Discussion 19:59:54 Anything important last minute before we call it a meeting? 20:00:46 nothing here 20:00:58 Thank you everyone for your time. Feel free to continue discussion in our other venues 20:01:02 #endmeeting