19:01:10 <clarkb> #startmeeting infra
19:01:10 <opendevmeet> Meeting started Tue Sep  5 19:01:10 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:10 <opendevmeet> The meeting name has been set to 'infra'
19:01:24 <clarkb> It feels like yseterday was a holiday. So many things this morning
19:01:52 <fungi> yesterday felt less like a holiday than it could have
19:02:01 <ianychoi> o/
19:02:05 <fungi> but them's the breaks
19:02:48 <ianychoi> I have not been aware of such holidays but hope that many people had great holidays :)
19:02:51 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/F5B2EF7BWK62UQZCTHVGEKER4XFRDSIE/ Our Agenda
19:03:06 <clarkb> ianychoi: I did! it was the last day my parents were here visiting and we made smoked bbq meats
19:03:14 <clarkb> #topic Announcements
19:03:15 <clarkb> I have nothing
19:03:43 <ianychoi> Wow great :)
19:04:02 <ianychoi> (I will bring translation topics during open discussion time.. thanks!)
19:05:41 <clarkb> #topic Infra Root Google Account Activity
19:05:45 <clarkb> I have nothing to report here
19:06:08 <clarkb> I'e still got it on the todo list
19:06:10 <clarkb> hopefully soon
19:06:15 <clarkb> #topic Mailman 3
19:06:43 <clarkb> as mentioned last week fungi thinks we're ready to start migrating additional domains and I agree. That means we need to schedule a time to migrate lists.kata-containers.io and lists.airshipit.org
19:07:36 <fungi> yeah, so i've decided which are the most active lists on each site to notify
19:07:56 <fungi> it seems like thursday september 14 may be a good date to do those two imports
19:08:43 <fungi> if people agree, i can notify the airship-discuss and kata-dev mailing lists with a message similar to the one i sent to zuul-discuss when the lists.zuul-ci.org site was migrated
19:09:13 <fungi> is there a time of day which would make sense to have more interested parties around to handle comms or whatever might arise?
19:10:18 <frickler> I have no idea where the main timezones of those communities are located
19:10:43 <clarkb> airship was in US central time iirc
19:10:50 <clarkb> and kata is fairly global
19:11:26 <frickler> so likely US morning would be best suited to give you some room to handle possible fallout
19:11:42 <fungi> yeah, i'm more concerned with having interested sysadmins around for the maintenance window. the communities themselves will know to plan for the list archives to be temporarily unavailable and for mail deliveries to be deferred
19:12:28 <fungi> i'm happy to run the migration commands (though they're also documented in the planning pad and scripted in system-config too)
19:12:35 <clarkb> I'm happy to help anytime after about 15:30 UTC
19:12:39 <fungi> but having more people around in general at those times helps
19:12:50 <clarkb> and thursday the 14th shoudl work for me anytime after then
19:13:02 <fungi> the migration process will probably take around an hour start to finish, and that includes dns propagation
19:14:22 <fungi> i'll revisit what we did for the first migrations, but basically we leveraged dns as a means of making incoming messages temporarily undeliverable until the migration was done, and then updated dns to point to the new server
19:15:03 <fungi> for kata it may be easier since it's on its own server already, the reason we did it via dns for the first migrations is that they shared a server with other sites that weren't moving at the same times
19:16:02 <clarkb> using DNS seemed to work fine though so we can stick to that
19:16:28 <fungi> frickler: does 15:30-16:30 utc work well for you?
19:16:48 <fungi> if so i'll let the airship and kata lists know that's the time we're planning
19:17:07 <frickler> I'm not sure I'll be around, but time is fine in general
19:17:14 <frickler> *the time
19:17:16 <fungi> okay, no worries. thanks!
19:17:31 <clarkb> great see yall then
19:17:36 <clarkb> anything else mailman 3 related?
19:17:37 <fungi> let's go with 15:30-16:30 utc on thursday 2023-09-14
19:17:47 <fungi> i'll send announcements some time tomorrow
19:18:00 <clarkb> thanks
19:19:55 <clarkb> #topic Server Upgrades
19:20:00 <clarkb> Nothing new here either
19:20:21 <clarkb> #topic Rax IAD image upload struggles
19:20:26 <clarkb> Lots of progress/news here though
19:20:40 <clarkb> fungi filed a ticket with rax and the response was essentially that iad is expected to behave differently
19:20:52 <fungi> yes, sad panda
19:21:17 <clarkb> this means we can't rely on the cloud provider to fix it for us. Instead we've reduced the number of upload threads to 1 per builder and increased the image rebuild time intervals
19:21:20 <fungi> though that prompted us to look at whether we're too aggressively updating our images
19:21:40 <clarkb> the idae ehre is that we don't actually need new images for all images constantly and can more conservatively update things in a way that the cloud region can hopefully keep up with
19:22:15 <corvus> what's the timing look like now?  how long between update cycles?  and how long does a full update cycle take?
19:22:17 <fungi> yeah, basically update the default nodeset's label daily, current versions of other distros every 2 days, and older versions of distros weekly
19:22:42 <frickler> but that was only merged 2 days ago, so no effect seen yet
19:22:42 <corvus> makes sense
19:23:07 <fungi> noting that the "default nodeset" isn't necessarily always consistent across tenants, but we can come up with a flexible policy tehre
19:23:34 <clarkb> fungi: we currently define it centrally in opendev/base-jobs though
19:23:52 <fungi> and this is an experiment in order to hopefully get reasonably current images for jobs while minimizing the load we put on our providers' image services
19:23:55 <frickler> looking at the upload ids, some images seem to have taken 4 attempts in iad to be successful
19:24:12 <corvus> maybe we could look at adjusting that to so that current versions of all distros are updated daily.  once we have more data.
19:24:37 <fungi> frickler: that also means we probably have new leaked images in iad we should look at the metadata for now to see if we can identify for sure why nodepool doesn't clean them up
19:24:39 <frickler> I was looking at nodepool stats in grafana and > 50% of the average used nodes were jammy
19:24:50 <frickler> fungi: yes
19:24:52 <fungi> corvus: yes, that also seems reasonable
19:25:28 <corvus> if you find a leaked image, ping me with the details please
19:25:39 <fungi> corvus: will do, thanks!
19:25:48 <fungi> i'll try to take a look after dinner
19:26:02 <fungi> i'll avoid cleaning them up until we have time to go over them
19:26:27 <corvus> it's probably enough to keep one around if you want to clean up others
19:26:56 <fungi> the handful we've probably leaked pales in comparison to the 1200 i cleaned up in iad recently
19:27:06 <fungi> so i'll probably just delay cleanup until we're satisfied
19:29:22 <clarkb> sounds like that is all for images
19:29:28 <clarkb> #topic Fedora cleanup
19:29:34 <clarkb> The nodeset removal from base-jobs landed
19:29:47 <clarkb> I've also seen some projects like ironic push changes to clean up their use of fedora
19:30:08 <clarkb> I think the next step is to actually remove the label (and images) from nodepool when we think people have had enough time to prepare
19:30:17 <clarkb> should we send an email announcing a date for that?
19:31:01 <frickler> well nothing of that would still work, or was that only devstack that was broken?
19:31:23 <fungi> i thought devstack dropped fedora over a month ago
19:31:46 <fungi> dropped the devstack-defined fedora nodesets anyway
19:31:48 <frickler> but didn't all fedora testing stop working when they pulled their repo content?
19:32:01 <clarkb> frickler: yes, unless jobs got updated to pull from other locations
19:32:05 <fungi> yes, well anything that tried to install packages anyway
19:32:12 <clarkb> I don't think we need to wait very long as you are correct most things would be very broken
19:32:22 <clarkb> more of a final warning if anyone had this working somehow
19:33:05 <frickler> I don't think waiting a week or two will help anyone, but it also doesn't hurt us
19:33:14 <clarkb> ya I was thinking about a week
19:33:21 <clarkb> maybe announce removal for Monday?
19:33:23 <fungi> wfm
19:33:39 <frickler> ack
19:33:42 <clarkb> that gives me time to write the chagnes for it too :)
19:33:44 <clarkb> cool
19:34:03 <clarkb> #topic Zuul Ansible 8 Default
19:34:24 <clarkb> All of the OpenDev Zuul tenants are ansible 8 by default now
19:34:32 <clarkb> I haven't heard of or seen anyone needing to pin to ansible 6 either
19:34:55 <corvus> what's the openstack switcheroo date?
19:35:03 <fungi> yesterday
19:35:04 <clarkb> This doesn't need to be on the agenda for next week, but I wanted to make note of this and remind people to call out oddities if they see them
19:35:06 <frickler> it was yesterday
19:35:19 <frickler> there's only one concern that I mentioned earlier: we might not notice when jobs pass that actually should fail
19:35:27 <corvus> cool.  :)  sorry i misread comment from clark :)
19:35:33 <frickler> might be because new ansible changed the error handling
19:35:48 <clarkb> frickler: yes, I think that risk exists but it seems to be a low probability
19:35:59 <clarkb> since ansible is fail by default if anything goes wrong generally
19:36:03 <fungi> frickler: i probably skimmed that comment too quickly earlier, what error handling changed in 8?
19:36:10 <fungi> or was it hypothetical?
19:36:18 <frickler> that was purely hypothetical
19:36:39 <fungi> okay, yes i agree that there are a number of hypothetical concerns with any upgrade
19:36:58 <fungi> since in theory any aspect of the software can change
19:38:15 <clarkb> yup mostly just be aware there may be behavior chagnes and if you see them please let zuul and opendev folks know
19:38:16 <fungi> from a practical standpoint, unless anyone has mentioned specific changes to error handling in ansible 8 i'm not going to lose sleep over the possibility of that sort of regression, but we should of course be mindful of the ever-present possibility
19:38:26 <clarkb> both of us will be interested in any observed differences even if they are minor
19:38:36 <clarkb> (one thing I want to look at if I ever find time is performance)
19:39:11 <clarkb> #topic Zuul PCRE regex support is deprecated
19:39:43 <clarkb> The automatic weekend ugprade of zuul pulled in changes to deprecate PCRE regexes within zuul. This results in warnings where regexes that re2 cannot support are used
19:40:22 <clarkb> There was a bug that caused these warnings to prevent new config updates from being usable. We tracked down and fixed those bugs and corvus restarted zuul schedulers outside of the automated upgrade system
19:40:42 <corvus> sorry for the disruption, and thanks for the help
19:40:58 <clarkb> Where that leaves us is opendev's zuul configs will need to be updated to remove pcre regexes. I don't think this is super urgent but cutting down on the warnings in the error list helps reduce noise
19:41:07 <fungi> no need to apologize, thanks for implementing a very useful feature
19:41:26 <fungi> and for the fast action on the fixes too
19:41:45 <corvus> #link https://review.opendev.org/893702 merged change from frickler for project-config
19:42:03 <corvus> #link https://review.opendev.org/893792 change to openstack-zuul-jobs
19:42:19 <corvus> that showed us an issue with zuul-sphinx which should be resolved soon
19:42:51 <corvus> i think those 2 changes will take care of most of the "central" stuff.
19:43:15 <corvus> after they merge, i can write a message letting the wider community know about the change, how to make updates, point at those changes, etc.
19:43:17 <fungi> i guess the final decision on the !zuul user match was that the \S trailer wasn't needed any longer?
19:43:34 <corvus> i agreed with that assesment and approved it
19:43:50 <fungi> okay, cool
19:44:01 <frickler> I also checked the git history of that line and it seemed to agree with that assessment
19:44:24 <fungi> yay for simplicity
19:44:33 <corvus> in general, there's a lot less line noise in our branch matchers now.  :)
19:44:54 <fungi> thankfully
19:45:21 <corvus> also, quick reminder in case it's useful -- branches is a list, so you can make a whole list of positive and negated regexes if you need to.  the list is boolean or.
19:45:44 <corvus> i haven't run into a case where that's necessary, or looks better than just a (a|b|c) sequence, but it's there if we need it.
19:46:05 <fungi> a list of or'ed negated regexes wouldn't work would it?
19:46:26 <corvus> i mean... it'd "work"...
19:46:32 <fungi> !a|!b would include everything
19:46:32 <opendevmeet> fungi: Error: "a|!b" is not a valid command.
19:46:38 <fungi> opendevmeet agrees
19:46:38 <opendevmeet> fungi: Error: "agrees" is not a valid command.
19:46:53 <fungi> and would like to subscribe to our newsletter
19:46:56 <corvus> but yes, probably of limited utility.  :)
19:47:13 <clarkb> anything else on regexes?
19:47:20 <corvus> nak
19:47:24 <clarkb> have a couple more things to get to and we are running out of time. Thanks
19:47:29 <clarkb> #topic Bookworm updates
19:47:32 <clarkb> #link https://review.opendev.org/q/hashtag:bookworm+status:open Next round of image rebuilds onto bookworm.
19:47:54 <clarkb> I think we're ready to proceed on this with zuul if zuul is ready. But nodepool may pose problems ebtween ext4 options nad older grub
19:48:20 <clarkb> I am helping someone debug this today and I'm not sure yet if bookworm is affected. But generally you can create an ext4 fs that grub doesn't like using newer mkfs
19:48:21 <fungi> what's grub's problem there?
19:48:30 <clarkb> https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1844012 appears related
19:48:45 <corvus> would we notice issues at the point where we already put the new image in service?  so would need to roll back to "yesterday's"* image?
19:48:46 <clarkb> fungi: basically grub says "unknown filesystem" when it doesn't like the features enabled in the ext4 fs
19:48:54 <corvus> (* yesterday may not equal yesterday anymore)
19:48:56 <clarkb> corvus: no, we would fail to build images in the first place
19:49:02 <fungi> are we using grub in our bookworm containers?
19:49:16 <clarkb> fungi: the bookworm containers run dib with makes grub for all of our images
19:49:20 <clarkb> *all of our vm images
19:49:25 <corvus> oh ok.  in that case... is there dib testing using the nodepool container?
19:49:35 <fungi> oh, right, that's the nodepool connection. now i understand
19:49:38 <clarkb> corvus: oh there should be so we can use a depends on
19:49:40 <clarkb> corvus: good idea
19:49:53 <corvus> clarkb: not sure if they're in the same tenant though
19:49:53 <clarkb> sorry I'm just being made aware of this as our meeting started so its all new to me
19:50:09 <fungi> exciting
19:50:16 <corvus> but if depends-on doesn't work, some other ideas:
19:50:21 <clarkb> corvus: hrm they aren't. But we can probably figure some way of testing that. Maybe hardcoding off of the intermediate registry
19:50:44 <corvus> yeah, manually specify the intermediate registry container
19:50:58 <corvus> or we could land it in nodepool and fast-revert if it breaks
19:51:05 <fungi> absolute worst case, new images we upload will fail to boot, so jobs will end up waiting indefinitely for node assignments until we roll back to prior images
19:51:05 <clarkb> it is possible that bookworm avoids the feature anyway and we're fine so definitely worth testing
19:51:22 <clarkb> fungi: dib fails hard on the grub failure so it shouldn't get taht far
19:51:43 <clarkb> corvus: ya I'll keep the test in production alternative in mind if I can't find an easy way to test it otherwise
19:51:48 <fungi> oh, so our images will just get stale? even less impact, we just have to watch for it since it may take a while to notice otherwise
19:51:50 <corvus> oh, one other possibility might be a throwaway nodepool job that uses an older distro
19:52:35 <corvus> (since nodepool does have one functional job that builds images with dib; just probably not on an affected distro)
19:52:44 <clarkb> fungi: right dib running in bullseye today runs mkfs.ext4 and creates a new ext4 fs that focal grub can install into. When we switch to bookworm the concern is that grub will say unknown filesystem. exit with and error and the image build will fail
19:52:49 <clarkb> but it shouldn't ever upload
19:53:19 <clarkb> corvus: oh cool I can look at taht too. And see this is happening at build not boot time we don't need complicated verification of the end result. Just that the build itself succeeds
19:53:35 <clarkb> *since this is happening
19:53:52 <clarkb> that was all I had. Happy for zuul to proceed wtih bookworm in the meantime
19:54:07 <clarkb> #topic Open Discussion
19:54:16 <clarkb> ianychoi: I know you wanted to talk about the Zanata db/stats api stuff?
19:54:34 <ianychoi> Yep
19:55:11 <clarkb> I'm not aware of us doing anything special to prevent the stats api from being used. Which is why I wonder if it is an admin only function
19:55:25 <clarkb> If it is i think we can provide you or someone else with admin access to use the api.
19:55:42 <ianychoi> First, public APIs for user stats do not work - e.g., https://translate.openstack.org/rest/stats/user/ianychoi/2022-10-05..2023-03-22
19:55:49 <clarkb> I am concerned that I don't know what all goes into that database though so am wary of providing a database dump. But others may know more and have other thoughts
19:56:25 <ianychoi> It worked to calculate translation stats previously to sync with Stackalytics + to calculate extra ATC status
19:57:17 <ianychoi> The root cause might be from some messy DB status in Zanata instance but it is not easy to get help..
19:57:46 <ianychoi> So I thought investigating in DB issues would be one idea.
19:58:22 <clarkb> I see. Probably before we get that far we should check the server logs to see if there are any errors associated with those requests. I do note that the zanata rest api documentation doesn't seem to show user stats just project stats
19:59:01 <clarkb> http://zanata.org/zanata-platform/rest-api-docs/resource_StatisticsResource.html maybe you can get the data on a project by project basis instead?
19:59:15 <frickler> I just noticed once again that I'm too young an infra-root in order to be able to access that host, but I'm fine with not changing that situation
19:59:17 <ianychoi> Yep I also figured out that project APIs are working well
19:59:38 <clarkb> frickler: interesting, I thought that server was getting users managed by ansibel like the other servers do
19:59:51 <ianychoi> So, I think some help to co-work on this part with infra-root would be so great ideally
20:00:08 <ianychoi> Or maybe me or Seongsoo need to step up :p
20:00:35 <clarkb> ianychoi: ok, the main thing is that this service is long deprecated so we're unlikely to be able to invest much in it. But I think we can check logs for obvious errors.
20:00:43 <fungi> frickler: you have an ssh account on translate.openstack.org
20:00:55 <fungi> but maybe it's set up wrong or something
20:01:02 <ianychoi> Agree with @clarkb would you help check logs?
20:01:24 <ianychoi> Or feel free to point out to me so that I can investigate in seeing detail log messages
20:01:28 <fungi> service admins for that platform are probably added manually though not through ansible
20:01:52 <clarkb> ianychoi: yes a root will need to check the logs. I can look later today
20:02:02 <ianychoi> Thank you!
20:02:16 <frickler> ah, I was using the wrong username. so I can check tomorrow if noone else has time
20:02:24 <fungi> i think pleia2 added our sysadmins as zanata admins back when the server was first set up, but it likely hasn't been revisited since
20:02:28 <clarkb> sounds like a plan we can sync up from there
20:02:58 <clarkb> but we are out of time. Feel free to bring up more discussion in #opendev or on the service-discuss mailing list
20:03:01 <clarkb> thank you everyone!
20:03:03 <clarkb> #endmeeting