19:01:10 #startmeeting infra 19:01:10 Meeting started Tue Sep 5 19:01:10 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:10 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 The meeting name has been set to 'infra' 19:01:24 It feels like yseterday was a holiday. So many things this morning 19:01:52 yesterday felt less like a holiday than it could have 19:02:01 o/ 19:02:05 but them's the breaks 19:02:48 I have not been aware of such holidays but hope that many people had great holidays :) 19:02:51 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/F5B2EF7BWK62UQZCTHVGEKER4XFRDSIE/ Our Agenda 19:03:06 ianychoi: I did! it was the last day my parents were here visiting and we made smoked bbq meats 19:03:14 #topic Announcements 19:03:15 I have nothing 19:03:43 Wow great :) 19:04:02 (I will bring translation topics during open discussion time.. thanks!) 19:05:41 #topic Infra Root Google Account Activity 19:05:45 I have nothing to report here 19:06:08 I'e still got it on the todo list 19:06:10 hopefully soon 19:06:15 #topic Mailman 3 19:06:43 as mentioned last week fungi thinks we're ready to start migrating additional domains and I agree. That means we need to schedule a time to migrate lists.kata-containers.io and lists.airshipit.org 19:07:36 yeah, so i've decided which are the most active lists on each site to notify 19:07:56 it seems like thursday september 14 may be a good date to do those two imports 19:08:43 if people agree, i can notify the airship-discuss and kata-dev mailing lists with a message similar to the one i sent to zuul-discuss when the lists.zuul-ci.org site was migrated 19:09:13 is there a time of day which would make sense to have more interested parties around to handle comms or whatever might arise? 19:10:18 I have no idea where the main timezones of those communities are located 19:10:43 airship was in US central time iirc 19:10:50 and kata is fairly global 19:11:26 so likely US morning would be best suited to give you some room to handle possible fallout 19:11:42 yeah, i'm more concerned with having interested sysadmins around for the maintenance window. the communities themselves will know to plan for the list archives to be temporarily unavailable and for mail deliveries to be deferred 19:12:28 i'm happy to run the migration commands (though they're also documented in the planning pad and scripted in system-config too) 19:12:35 I'm happy to help anytime after about 15:30 UTC 19:12:39 but having more people around in general at those times helps 19:12:50 and thursday the 14th shoudl work for me anytime after then 19:13:02 the migration process will probably take around an hour start to finish, and that includes dns propagation 19:14:22 i'll revisit what we did for the first migrations, but basically we leveraged dns as a means of making incoming messages temporarily undeliverable until the migration was done, and then updated dns to point to the new server 19:15:03 for kata it may be easier since it's on its own server already, the reason we did it via dns for the first migrations is that they shared a server with other sites that weren't moving at the same times 19:16:02 using DNS seemed to work fine though so we can stick to that 19:16:28 frickler: does 15:30-16:30 utc work well for you? 19:16:48 if so i'll let the airship and kata lists know that's the time we're planning 19:17:07 I'm not sure I'll be around, but time is fine in general 19:17:14 *the time 19:17:16 okay, no worries. thanks! 19:17:31 great see yall then 19:17:36 anything else mailman 3 related? 19:17:37 let's go with 15:30-16:30 utc on thursday 2023-09-14 19:17:47 i'll send announcements some time tomorrow 19:18:00 thanks 19:19:55 #topic Server Upgrades 19:20:00 Nothing new here either 19:20:21 #topic Rax IAD image upload struggles 19:20:26 Lots of progress/news here though 19:20:40 fungi filed a ticket with rax and the response was essentially that iad is expected to behave differently 19:20:52 yes, sad panda 19:21:17 this means we can't rely on the cloud provider to fix it for us. Instead we've reduced the number of upload threads to 1 per builder and increased the image rebuild time intervals 19:21:20 though that prompted us to look at whether we're too aggressively updating our images 19:21:40 the idae ehre is that we don't actually need new images for all images constantly and can more conservatively update things in a way that the cloud region can hopefully keep up with 19:22:15 what's the timing look like now? how long between update cycles? and how long does a full update cycle take? 19:22:17 yeah, basically update the default nodeset's label daily, current versions of other distros every 2 days, and older versions of distros weekly 19:22:42 but that was only merged 2 days ago, so no effect seen yet 19:22:42 makes sense 19:23:07 noting that the "default nodeset" isn't necessarily always consistent across tenants, but we can come up with a flexible policy tehre 19:23:34 fungi: we currently define it centrally in opendev/base-jobs though 19:23:52 and this is an experiment in order to hopefully get reasonably current images for jobs while minimizing the load we put on our providers' image services 19:23:55 looking at the upload ids, some images seem to have taken 4 attempts in iad to be successful 19:24:12 maybe we could look at adjusting that to so that current versions of all distros are updated daily. once we have more data. 19:24:37 frickler: that also means we probably have new leaked images in iad we should look at the metadata for now to see if we can identify for sure why nodepool doesn't clean them up 19:24:39 I was looking at nodepool stats in grafana and > 50% of the average used nodes were jammy 19:24:50 fungi: yes 19:24:52 corvus: yes, that also seems reasonable 19:25:28 if you find a leaked image, ping me with the details please 19:25:39 corvus: will do, thanks! 19:25:48 i'll try to take a look after dinner 19:26:02 i'll avoid cleaning them up until we have time to go over them 19:26:27 it's probably enough to keep one around if you want to clean up others 19:26:56 the handful we've probably leaked pales in comparison to the 1200 i cleaned up in iad recently 19:27:06 so i'll probably just delay cleanup until we're satisfied 19:29:22 sounds like that is all for images 19:29:28 #topic Fedora cleanup 19:29:34 The nodeset removal from base-jobs landed 19:29:47 I've also seen some projects like ironic push changes to clean up their use of fedora 19:30:08 I think the next step is to actually remove the label (and images) from nodepool when we think people have had enough time to prepare 19:30:17 should we send an email announcing a date for that? 19:31:01 well nothing of that would still work, or was that only devstack that was broken? 19:31:23 i thought devstack dropped fedora over a month ago 19:31:46 dropped the devstack-defined fedora nodesets anyway 19:31:48 but didn't all fedora testing stop working when they pulled their repo content? 19:32:01 frickler: yes, unless jobs got updated to pull from other locations 19:32:05 yes, well anything that tried to install packages anyway 19:32:12 I don't think we need to wait very long as you are correct most things would be very broken 19:32:22 more of a final warning if anyone had this working somehow 19:33:05 I don't think waiting a week or two will help anyone, but it also doesn't hurt us 19:33:14 ya I was thinking about a week 19:33:21 maybe announce removal for Monday? 19:33:23 wfm 19:33:39 ack 19:33:42 that gives me time to write the chagnes for it too :) 19:33:44 cool 19:34:03 #topic Zuul Ansible 8 Default 19:34:24 All of the OpenDev Zuul tenants are ansible 8 by default now 19:34:32 I haven't heard of or seen anyone needing to pin to ansible 6 either 19:34:55 what's the openstack switcheroo date? 19:35:03 yesterday 19:35:04 This doesn't need to be on the agenda for next week, but I wanted to make note of this and remind people to call out oddities if they see them 19:35:06 it was yesterday 19:35:19 there's only one concern that I mentioned earlier: we might not notice when jobs pass that actually should fail 19:35:27 cool. :) sorry i misread comment from clark :) 19:35:33 might be because new ansible changed the error handling 19:35:48 frickler: yes, I think that risk exists but it seems to be a low probability 19:35:59 since ansible is fail by default if anything goes wrong generally 19:36:03 frickler: i probably skimmed that comment too quickly earlier, what error handling changed in 8? 19:36:10 or was it hypothetical? 19:36:18 that was purely hypothetical 19:36:39 okay, yes i agree that there are a number of hypothetical concerns with any upgrade 19:36:58 since in theory any aspect of the software can change 19:38:15 yup mostly just be aware there may be behavior chagnes and if you see them please let zuul and opendev folks know 19:38:16 from a practical standpoint, unless anyone has mentioned specific changes to error handling in ansible 8 i'm not going to lose sleep over the possibility of that sort of regression, but we should of course be mindful of the ever-present possibility 19:38:26 both of us will be interested in any observed differences even if they are minor 19:38:36 (one thing I want to look at if I ever find time is performance) 19:39:11 #topic Zuul PCRE regex support is deprecated 19:39:43 The automatic weekend ugprade of zuul pulled in changes to deprecate PCRE regexes within zuul. This results in warnings where regexes that re2 cannot support are used 19:40:22 There was a bug that caused these warnings to prevent new config updates from being usable. We tracked down and fixed those bugs and corvus restarted zuul schedulers outside of the automated upgrade system 19:40:42 sorry for the disruption, and thanks for the help 19:40:58 Where that leaves us is opendev's zuul configs will need to be updated to remove pcre regexes. I don't think this is super urgent but cutting down on the warnings in the error list helps reduce noise 19:41:07 no need to apologize, thanks for implementing a very useful feature 19:41:26 and for the fast action on the fixes too 19:41:45 #link https://review.opendev.org/893702 merged change from frickler for project-config 19:42:03 #link https://review.opendev.org/893792 change to openstack-zuul-jobs 19:42:19 that showed us an issue with zuul-sphinx which should be resolved soon 19:42:51 i think those 2 changes will take care of most of the "central" stuff. 19:43:15 after they merge, i can write a message letting the wider community know about the change, how to make updates, point at those changes, etc. 19:43:17 i guess the final decision on the !zuul user match was that the \S trailer wasn't needed any longer? 19:43:34 i agreed with that assesment and approved it 19:43:50 okay, cool 19:44:01 I also checked the git history of that line and it seemed to agree with that assessment 19:44:24 yay for simplicity 19:44:33 in general, there's a lot less line noise in our branch matchers now. :) 19:44:54 thankfully 19:45:21 also, quick reminder in case it's useful -- branches is a list, so you can make a whole list of positive and negated regexes if you need to. the list is boolean or. 19:45:44 i haven't run into a case where that's necessary, or looks better than just a (a|b|c) sequence, but it's there if we need it. 19:46:05 a list of or'ed negated regexes wouldn't work would it? 19:46:26 i mean... it'd "work"... 19:46:32 !a|!b would include everything 19:46:32 fungi: Error: "a|!b" is not a valid command. 19:46:38 opendevmeet agrees 19:46:38 fungi: Error: "agrees" is not a valid command. 19:46:53 and would like to subscribe to our newsletter 19:46:56 but yes, probably of limited utility. :) 19:47:13 anything else on regexes? 19:47:20 nak 19:47:24 have a couple more things to get to and we are running out of time. Thanks 19:47:29 #topic Bookworm updates 19:47:32 #link https://review.opendev.org/q/hashtag:bookworm+status:open Next round of image rebuilds onto bookworm. 19:47:54 I think we're ready to proceed on this with zuul if zuul is ready. But nodepool may pose problems ebtween ext4 options nad older grub 19:48:20 I am helping someone debug this today and I'm not sure yet if bookworm is affected. But generally you can create an ext4 fs that grub doesn't like using newer mkfs 19:48:21 what's grub's problem there? 19:48:30 https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1844012 appears related 19:48:45 would we notice issues at the point where we already put the new image in service? so would need to roll back to "yesterday's"* image? 19:48:46 fungi: basically grub says "unknown filesystem" when it doesn't like the features enabled in the ext4 fs 19:48:54 (* yesterday may not equal yesterday anymore) 19:48:56 corvus: no, we would fail to build images in the first place 19:49:02 are we using grub in our bookworm containers? 19:49:16 fungi: the bookworm containers run dib with makes grub for all of our images 19:49:20 *all of our vm images 19:49:25 oh ok. in that case... is there dib testing using the nodepool container? 19:49:35 oh, right, that's the nodepool connection. now i understand 19:49:38 corvus: oh there should be so we can use a depends on 19:49:40 corvus: good idea 19:49:53 clarkb: not sure if they're in the same tenant though 19:49:53 sorry I'm just being made aware of this as our meeting started so its all new to me 19:50:09 exciting 19:50:16 but if depends-on doesn't work, some other ideas: 19:50:21 corvus: hrm they aren't. But we can probably figure some way of testing that. Maybe hardcoding off of the intermediate registry 19:50:44 yeah, manually specify the intermediate registry container 19:50:58 or we could land it in nodepool and fast-revert if it breaks 19:51:05 absolute worst case, new images we upload will fail to boot, so jobs will end up waiting indefinitely for node assignments until we roll back to prior images 19:51:05 it is possible that bookworm avoids the feature anyway and we're fine so definitely worth testing 19:51:22 fungi: dib fails hard on the grub failure so it shouldn't get taht far 19:51:43 corvus: ya I'll keep the test in production alternative in mind if I can't find an easy way to test it otherwise 19:51:48 oh, so our images will just get stale? even less impact, we just have to watch for it since it may take a while to notice otherwise 19:51:50 oh, one other possibility might be a throwaway nodepool job that uses an older distro 19:52:35 (since nodepool does have one functional job that builds images with dib; just probably not on an affected distro) 19:52:44 fungi: right dib running in bullseye today runs mkfs.ext4 and creates a new ext4 fs that focal grub can install into. When we switch to bookworm the concern is that grub will say unknown filesystem. exit with and error and the image build will fail 19:52:49 but it shouldn't ever upload 19:53:19 corvus: oh cool I can look at taht too. And see this is happening at build not boot time we don't need complicated verification of the end result. Just that the build itself succeeds 19:53:35 *since this is happening 19:53:52 that was all I had. Happy for zuul to proceed wtih bookworm in the meantime 19:54:07 #topic Open Discussion 19:54:16 ianychoi: I know you wanted to talk about the Zanata db/stats api stuff? 19:54:34 Yep 19:55:11 I'm not aware of us doing anything special to prevent the stats api from being used. Which is why I wonder if it is an admin only function 19:55:25 If it is i think we can provide you or someone else with admin access to use the api. 19:55:42 First, public APIs for user stats do not work - e.g., https://translate.openstack.org/rest/stats/user/ianychoi/2022-10-05..2023-03-22 19:55:49 I am concerned that I don't know what all goes into that database though so am wary of providing a database dump. But others may know more and have other thoughts 19:56:25 It worked to calculate translation stats previously to sync with Stackalytics + to calculate extra ATC status 19:57:17 The root cause might be from some messy DB status in Zanata instance but it is not easy to get help.. 19:57:46 So I thought investigating in DB issues would be one idea. 19:58:22 I see. Probably before we get that far we should check the server logs to see if there are any errors associated with those requests. I do note that the zanata rest api documentation doesn't seem to show user stats just project stats 19:59:01 http://zanata.org/zanata-platform/rest-api-docs/resource_StatisticsResource.html maybe you can get the data on a project by project basis instead? 19:59:15 I just noticed once again that I'm too young an infra-root in order to be able to access that host, but I'm fine with not changing that situation 19:59:17 Yep I also figured out that project APIs are working well 19:59:38 frickler: interesting, I thought that server was getting users managed by ansibel like the other servers do 19:59:51 So, I think some help to co-work on this part with infra-root would be so great ideally 20:00:08 Or maybe me or Seongsoo need to step up :p 20:00:35 ianychoi: ok, the main thing is that this service is long deprecated so we're unlikely to be able to invest much in it. But I think we can check logs for obvious errors. 20:00:43 frickler: you have an ssh account on translate.openstack.org 20:00:55 but maybe it's set up wrong or something 20:01:02 Agree with @clarkb would you help check logs? 20:01:24 Or feel free to point out to me so that I can investigate in seeing detail log messages 20:01:28 service admins for that platform are probably added manually though not through ansible 20:01:52 ianychoi: yes a root will need to check the logs. I can look later today 20:02:02 Thank you! 20:02:16 ah, I was using the wrong username. so I can check tomorrow if noone else has time 20:02:24 i think pleia2 added our sysadmins as zanata admins back when the server was first set up, but it likely hasn't been revisited since 20:02:28 sounds like a plan we can sync up from there 20:02:58 but we are out of time. Feel free to bring up more discussion in #opendev or on the service-discuss mailing list 20:03:01 thank you everyone! 20:03:03 #endmeeting