Tuesday, 2023-09-05

clarkbhello it is meeting time19:01
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Sep  5 19:01:10 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkbIt feels like yseterday was a holiday. So many things this morning19:01
fungiyesterday felt less like a holiday than it could have19:01
ianychoio/19:02
fungibut them's the breaks19:02
ianychoiI have not been aware of such holidays but hope that many people had great holidays :)19:02
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/F5B2EF7BWK62UQZCTHVGEKER4XFRDSIE/ Our Agenda19:02
clarkbianychoi: I did! it was the last day my parents were here visiting and we made smoked bbq meats19:03
clarkb#topic Announcements19:03
clarkbI have nothing19:03
ianychoiWow great :)19:03
ianychoi(I will bring translation topics during open discussion time.. thanks!)19:04
clarkb#topic Infra Root Google Account Activity19:05
clarkbI have nothing to report here19:05
clarkbI'e still got it on the todo list19:06
clarkbhopefully soon19:06
clarkb#topic Mailman 319:06
clarkbas mentioned last week fungi thinks we're ready to start migrating additional domains and I agree. That means we need to schedule a time to migrate lists.kata-containers.io and lists.airshipit.org19:06
fungiyeah, so i've decided which are the most active lists on each site to notify19:07
fungiit seems like thursday september 14 may be a good date to do those two imports19:07
fungiif people agree, i can notify the airship-discuss and kata-dev mailing lists with a message similar to the one i sent to zuul-discuss when the lists.zuul-ci.org site was migrated19:08
fungiis there a time of day which would make sense to have more interested parties around to handle comms or whatever might arise?19:09
fricklerI have no idea where the main timezones of those communities are located19:10
clarkbairship was in US central time iirc19:10
clarkband kata is fairly global19:10
fricklerso likely US morning would be best suited to give you some room to handle possible fallout19:11
fungiyeah, i'm more concerned with having interested sysadmins around for the maintenance window. the communities themselves will know to plan for the list archives to be temporarily unavailable and for mail deliveries to be deferred19:11
fungii'm happy to run the migration commands (though they're also documented in the planning pad and scripted in system-config too)19:12
clarkbI'm happy to help anytime after about 15:30 UTC19:12
fungibut having more people around in general at those times helps19:12
clarkband thursday the 14th shoudl work for me anytime after then19:12
fungithe migration process will probably take around an hour start to finish, and that includes dns propagation19:13
fungii'll revisit what we did for the first migrations, but basically we leveraged dns as a means of making incoming messages temporarily undeliverable until the migration was done, and then updated dns to point to the new server19:14
fungifor kata it may be easier since it's on its own server already, the reason we did it via dns for the first migrations is that they shared a server with other sites that weren't moving at the same times19:15
clarkbusing DNS seemed to work fine though so we can stick to that19:16
fungifrickler: does 15:30-16:30 utc work well for you?19:16
fungiif so i'll let the airship and kata lists know that's the time we're planning19:16
fricklerI'm not sure I'll be around, but time is fine in general19:17
frickler*the time19:17
fungiokay, no worries. thanks!19:17
clarkbgreat see yall then19:17
clarkbanything else mailman 3 related?19:17
fungilet's go with 15:30-16:30 utc on thursday 2023-09-1419:17
fungii'll send announcements some time tomorrow19:17
clarkbthanks19:18
clarkb#topic Server Upgrades19:19
clarkbNothing new here either19:20
clarkb#topic Rax IAD image upload struggles19:20
clarkbLots of progress/news here though19:20
clarkbfungi filed a ticket with rax and the response was essentially that iad is expected to behave differently19:20
fungiyes, sad panda19:20
clarkbthis means we can't rely on the cloud provider to fix it for us. Instead we've reduced the number of upload threads to 1 per builder and increased the image rebuild time intervals19:21
fungithough that prompted us to look at whether we're too aggressively updating our images19:21
clarkbthe idae ehre is that we don't actually need new images for all images constantly and can more conservatively update things in a way that the cloud region can hopefully keep up with19:21
corvuswhat's the timing look like now?  how long between update cycles?  and how long does a full update cycle take?19:22
fungiyeah, basically update the default nodeset's label daily, current versions of other distros every 2 days, and older versions of distros weekly19:22
fricklerbut that was only merged 2 days ago, so no effect seen yet19:22
corvusmakes sense19:22
funginoting that the "default nodeset" isn't necessarily always consistent across tenants, but we can come up with a flexible policy tehre19:23
clarkbfungi: we currently define it centrally in opendev/base-jobs though19:23
fungiand this is an experiment in order to hopefully get reasonably current images for jobs while minimizing the load we put on our providers' image services19:23
fricklerlooking at the upload ids, some images seem to have taken 4 attempts in iad to be successful19:23
corvusmaybe we could look at adjusting that to so that current versions of all distros are updated daily.  once we have more data.19:24
fungifrickler: that also means we probably have new leaked images in iad we should look at the metadata for now to see if we can identify for sure why nodepool doesn't clean them up19:24
fricklerI was looking at nodepool stats in grafana and > 50% of the average used nodes were jammy19:24
fricklerfungi: yes19:24
fungicorvus: yes, that also seems reasonable19:24
corvusif you find a leaked image, ping me with the details please19:25
fungicorvus: will do, thanks!19:25
fungii'll try to take a look after dinner19:25
fungii'll avoid cleaning them up until we have time to go over them19:26
corvusit's probably enough to keep one around if you want to clean up others19:26
fungithe handful we've probably leaked pales in comparison to the 1200 i cleaned up in iad recently19:26
fungiso i'll probably just delay cleanup until we're satisfied19:27
clarkbsounds like that is all for images19:29
clarkb#topic Fedora cleanup19:29
clarkbThe nodeset removal from base-jobs landed19:29
clarkbI've also seen some projects like ironic push changes to clean up their use of fedora19:29
clarkbI think the next step is to actually remove the label (and images) from nodepool when we think people have had enough time to prepare19:30
clarkbshould we send an email announcing a date for that?19:30
fricklerwell nothing of that would still work, or was that only devstack that was broken?19:31
fungii thought devstack dropped fedora over a month ago19:31
fungidropped the devstack-defined fedora nodesets anyway19:31
fricklerbut didn't all fedora testing stop working when they pulled their repo content?19:31
clarkbfrickler: yes, unless jobs got updated to pull from other locations19:32
fungiyes, well anything that tried to install packages anyway19:32
clarkbI don't think we need to wait very long as you are correct most things would be very broken19:32
clarkbmore of a final warning if anyone had this working somehow19:32
fricklerI don't think waiting a week or two will help anyone, but it also doesn't hurt us19:33
clarkbya I was thinking about a week19:33
clarkbmaybe announce removal for Monday?19:33
fungiwfm19:33
fricklerack19:33
clarkbthat gives me time to write the chagnes for it too :)19:33
clarkbcool19:33
clarkb#topic Zuul Ansible 8 Default19:34
clarkbAll of the OpenDev Zuul tenants are ansible 8 by default now19:34
clarkbI haven't heard of or seen anyone needing to pin to ansible 6 either19:34
corvuswhat's the openstack switcheroo date?19:34
fungiyesterday19:35
clarkbThis doesn't need to be on the agenda for next week, but I wanted to make note of this and remind people to call out oddities if they see them19:35
fricklerit was yesterday19:35
fricklerthere's only one concern that I mentioned earlier: we might not notice when jobs pass that actually should fail19:35
corvuscool.  :)  sorry i misread comment from clark :)19:35
fricklermight be because new ansible changed the error handling19:35
clarkbfrickler: yes, I think that risk exists but it seems to be a low probability19:35
clarkbsince ansible is fail by default if anything goes wrong generally19:35
fungifrickler: i probably skimmed that comment too quickly earlier, what error handling changed in 8?19:36
fungior was it hypothetical?19:36
fricklerthat was purely hypothetical19:36
fungiokay, yes i agree that there are a number of hypothetical concerns with any upgrade19:36
fungisince in theory any aspect of the software can change19:36
clarkbyup mostly just be aware there may be behavior chagnes and if you see them please let zuul and opendev folks know19:38
fungifrom a practical standpoint, unless anyone has mentioned specific changes to error handling in ansible 8 i'm not going to lose sleep over the possibility of that sort of regression, but we should of course be mindful of the ever-present possibility19:38
clarkbboth of us will be interested in any observed differences even if they are minor19:38
clarkb(one thing I want to look at if I ever find time is performance)19:38
clarkb#topic Zuul PCRE regex support is deprecated19:39
clarkbThe automatic weekend ugprade of zuul pulled in changes to deprecate PCRE regexes within zuul. This results in warnings where regexes that re2 cannot support are used19:39
clarkbThere was a bug that caused these warnings to prevent new config updates from being usable. We tracked down and fixed those bugs and corvus restarted zuul schedulers outside of the automated upgrade system19:40
corvussorry for the disruption, and thanks for the help19:40
clarkbWhere that leaves us is opendev's zuul configs will need to be updated to remove pcre regexes. I don't think this is super urgent but cutting down on the warnings in the error list helps reduce noise19:40
fungino need to apologize, thanks for implementing a very useful feature19:41
fungiand for the fast action on the fixes too19:41
corvus#link https://review.opendev.org/893702 merged change from frickler for project-config19:41
corvus#link https://review.opendev.org/893792 change to openstack-zuul-jobs19:42
corvusthat showed us an issue with zuul-sphinx which should be resolved soon19:42
corvusi think those 2 changes will take care of most of the "central" stuff.19:42
corvusafter they merge, i can write a message letting the wider community know about the change, how to make updates, point at those changes, etc.19:43
fungii guess the final decision on the !zuul user match was that the \S trailer wasn't needed any longer?19:43
corvusi agreed with that assesment and approved it19:43
fungiokay, cool19:43
fricklerI also checked the git history of that line and it seemed to agree with that assessment19:44
fungiyay for simplicity19:44
corvusin general, there's a lot less line noise in our branch matchers now.  :)19:44
fungithankfully19:44
corvusalso, quick reminder in case it's useful -- branches is a list, so you can make a whole list of positive and negated regexes if you need to.  the list is boolean or.19:45
corvusi haven't run into a case where that's necessary, or looks better than just a (a|b|c) sequence, but it's there if we need it.19:45
fungia list of or'ed negated regexes wouldn't work would it?19:46
corvusi mean... it'd "work"...19:46
fungi!a|!b would include everything19:46
opendevmeetfungi: Error: "a|!b" is not a valid command.19:46
fungiopendevmeet agrees19:46
opendevmeetfungi: Error: "agrees" is not a valid command.19:46
fungiand would like to subscribe to our newsletter19:46
corvusbut yes, probably of limited utility.  :)19:46
clarkbanything else on regexes?19:47
corvusnak19:47
clarkbhave a couple more things to get to and we are running out of time. Thanks19:47
clarkb#topic Bookworm updates19:47
clarkb#link https://review.opendev.org/q/hashtag:bookworm+status:open Next round of image rebuilds onto bookworm.19:47
clarkbI think we're ready to proceed on this with zuul if zuul is ready. But nodepool may pose problems ebtween ext4 options nad older grub19:47
clarkbI am helping someone debug this today and I'm not sure yet if bookworm is affected. But generally you can create an ext4 fs that grub doesn't like using newer mkfs19:48
fungiwhat's grub's problem there?19:48
clarkbhttps://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1844012 appears related19:48
corvuswould we notice issues at the point where we already put the new image in service?  so would need to roll back to "yesterday's"* image? 19:48
clarkbfungi: basically grub says "unknown filesystem" when it doesn't like the features enabled in the ext4 fs19:48
corvus(* yesterday may not equal yesterday anymore)19:48
clarkbcorvus: no, we would fail to build images in the first place19:48
fungiare we using grub in our bookworm containers?19:49
clarkbfungi: the bookworm containers run dib with makes grub for all of our images19:49
clarkb*all of our vm images19:49
corvusoh ok.  in that case... is there dib testing using the nodepool container?19:49
fungioh, right, that's the nodepool connection. now i understand19:49
clarkbcorvus: oh there should be so we can use a depends on19:49
clarkbcorvus: good idea19:49
corvusclarkb: not sure if they're in the same tenant though19:49
clarkbsorry I'm just being made aware of this as our meeting started so its all new to me19:49
fungiexciting19:50
corvusbut if depends-on doesn't work, some other ideas:19:50
clarkbcorvus: hrm they aren't. But we can probably figure some way of testing that. Maybe hardcoding off of the intermediate registry19:50
corvusyeah, manually specify the intermediate registry container19:50
corvusor we could land it in nodepool and fast-revert if it breaks19:50
fungiabsolute worst case, new images we upload will fail to boot, so jobs will end up waiting indefinitely for node assignments until we roll back to prior images19:51
clarkbit is possible that bookworm avoids the feature anyway and we're fine so definitely worth testing19:51
clarkbfungi: dib fails hard on the grub failure so it shouldn't get taht far19:51
clarkbcorvus: ya I'll keep the test in production alternative in mind if I can't find an easy way to test it otherwise19:51
fungioh, so our images will just get stale? even less impact, we just have to watch for it since it may take a while to notice otherwise19:51
corvusoh, one other possibility might be a throwaway nodepool job that uses an older distro19:51
corvus(since nodepool does have one functional job that builds images with dib; just probably not on an affected distro)19:52
clarkbfungi: right dib running in bullseye today runs mkfs.ext4 and creates a new ext4 fs that focal grub can install into. When we switch to bookworm the concern is that grub will say unknown filesystem. exit with and error and the image build will fail19:52
clarkbbut it shouldn't ever upload19:52
clarkbcorvus: oh cool I can look at taht too. And see this is happening at build not boot time we don't need complicated verification of the end result. Just that the build itself succeeds19:53
clarkb*since this is happening19:53
clarkbthat was all I had. Happy for zuul to proceed wtih bookworm in the meantime19:53
clarkb#topic Open Discussion19:54
clarkbianychoi: I know you wanted to talk about the Zanata db/stats api stuff?19:54
ianychoiYep19:54
clarkbI'm not aware of us doing anything special to prevent the stats api from being used. Which is why I wonder if it is an admin only function19:55
clarkbIf it is i think we can provide you or someone else with admin access to use the api.19:55
ianychoiFirst, public APIs for user stats do not work - e.g., https://translate.openstack.org/rest/stats/user/ianychoi/2022-10-05..2023-03-2219:55
clarkbI am concerned that I don't know what all goes into that database though so am wary of providing a database dump. But others may know more and have other thoughts19:55
ianychoiIt worked to calculate translation stats previously to sync with Stackalytics + to calculate extra ATC status19:56
ianychoiThe root cause might be from some messy DB status in Zanata instance but it is not easy to get help..19:57
ianychoiSo I thought investigating in DB issues would be one idea.19:57
clarkbI see. Probably before we get that far we should check the server logs to see if there are any errors associated with those requests. I do note that the zanata rest api documentation doesn't seem to show user stats just project stats19:58
clarkbhttp://zanata.org/zanata-platform/rest-api-docs/resource_StatisticsResource.html maybe you can get the data on a project by project basis instead?19:59
fricklerI just noticed once again that I'm too young an infra-root in order to be able to access that host, but I'm fine with not changing that situation19:59
ianychoiYep I also figured out that project APIs are working well19:59
clarkbfrickler: interesting, I thought that server was getting users managed by ansibel like the other servers do19:59
ianychoiSo, I think some help to co-work on this part with infra-root would be so great ideally19:59
ianychoiOr maybe me or Seongsoo need to step up :p 20:00
clarkbianychoi: ok, the main thing is that this service is long deprecated so we're unlikely to be able to invest much in it. But I think we can check logs for obvious errors.20:00
fungifrickler: you have an ssh account on translate.openstack.org20:00
fungibut maybe it's set up wrong or something20:00
ianychoiAgree with @clarkb would you help check logs?20:01
ianychoiOr feel free to point out to me so that I can investigate in seeing detail log messages20:01
fungiservice admins for that platform are probably added manually though not through ansible20:01
clarkbianychoi: yes a root will need to check the logs. I can look later today20:01
ianychoiThank you!20:02
fricklerah, I was using the wrong username. so I can check tomorrow if noone else has time20:02
fungii think pleia2 added our sysadmins as zanata admins back when the server was first set up, but it likely hasn't been revisited since20:02
clarkbsounds like a plan we can sync up from there20:02
clarkbbut we are out of time. Feel free to bring up more discussion in #opendev or on the service-discuss mailing list20:02
clarkbthank you everyone!20:03
clarkb#endmeeting20:03
opendevmeetMeeting ended Tue Sep  5 20:03:03 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:03
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-05-19.01.html20:03
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-05-19.01.txt20:03
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-05-19.01.log.html20:03
clarkb(I can smell lunch and I am very hungry :) )20:03
ianychoiThank you all20:03
fungithanks!20:03

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!