19:01:05 #startmeeting infra 19:01:05 Meeting started Tue Aug 15 19:01:05 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:05 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:05 The meeting name has been set to 'infra' 19:01:35 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/FRMRI2B7KC2HPOC5VTJYQBKARGCTY5GA/ Our Agenda 19:01:42 #topic Announcements 19:01:58 I'm back in my normal timezone. Other than that I didn't have anything to announce 19:02:37 oh! the openstack feature freeze happens right at the end of this month/ start of septeber 19:02:49 yeah, that's notable 19:02:53 something to be aware of as we make changes to avoid major impact to openstack 19:03:15 it will be the busiest time for our zuul resources 19:03:56 sounds like openstack continues to struggle with reliability problems in the gate too so we may be asked to help diagnose issues 19:04:08 I expect they will become more urgent in feature freeze crunch time 19:04:48 #topic Topics 19:04:52 #topic Google Account for Infra root 19:05:16 our infra root email address got notified that an associated google account is on the chopping block December 1st if there is no activity bfore then 19:05:39 assuming we decide to preserve the account we'll need to do some sort of activity every two yeras or it will be deleted 19:05:42 any idea what it's used for? 19:05:46 or i guess not used for 19:05:53 deleted account names cannot be reused so we don't have t oworry about someone taking it over at least 19:05:54 formerly used for long ago 19:06:05 I'm beinging it up here in hopes someone else knows what it was used for :) 19:06:13 i think they retire the address so it may be worth someone logging in just in case we decide it's important later 19:06:33 but i don't recall using it for anything, sorry 19:07:04 ya if we login we'll reset that 2 year counter and the account itself may have clues for what it was used for 19:08:03 I can try to do that befor eour next meeting and hopefully have new info to share then 19:08:11 if anyone else recalls what it was for later please share 19:08:19 but we have time to sort this out for now at least 19:08:56 #topic Bastion Host Updates 19:09:04 #link https://review.opendev.org/q/topic:bridge-backups 19:09:13 this topic could still use root review by others 19:09:16 i haven't found time to look yet 19:10:00 I also thought we may need to look at upgrading ansible on the bastion but I think ianw may have already taken care of that 19:10:06 double checking probably a good idea though 19:10:29 #topic Mailman 3 19:10:55 fungi: it looks like the mailman 3 vhosting stuff is working as expected now. I recall reviewing some mm3 change sthough I'm not sure where we've ended up since 19:12:11 so the first three changes in topic:mailman3 still need more reviews but should be safe to merge 19:12:47 the last change in that series currently is the version upgrade to latest mm3 releases 19:13:04 #link https://review.opendev.org/q/topic:mailman3+status:open 19:13:19 i have a held node prepped to do another round of import testing, but got sideswiped by other tasks and haven't had time to run through those yet 19:13:57 ok I've got etherpad and gitea prod updates to land too. After the meeting we should make a rough plan for landing some of these things and pushing forward 19:14:05 the upgrade is probably also safe to merge, but has no votes and i can understand if reviewers would rather wait until i've tested importing on the held node 19:14:24 that does seem like a good way to exercise the upgrade 19:15:12 to summarize, no known issues. General update changes needed. Upgrade change queued after general updates. Import testing needed for further migration 19:15:22 also the manual steps for adding the dhango domains/postorius mailhost associations are currently recorded in the migration etherpad 19:15:34 i'll want to add those to our documentation 19:15:48 s/dhango/django/ 19:15:48 ++ 19:16:11 they involve things like ssh port forwarding so you can connect to the django admin url from localhost 19:16:26 and need admin the credentials from the ansible inventory 19:16:33 s/admin the/the admin/ 19:17:07 once i've done anotehr round of import tests and we merge the version update, we should be able to start scheduling more domain migrations 19:17:16 once logged in the steps are a few button clicks though so pretty straightforward 19:17:24 yup 19:17:35 i just wish that part were easier to script 19:17:38 ++ 19:18:26 from discussion on the mm3 list, it seems there is no api endpoint in postorius for the mailhost assocuation step, which would be the real blocker (we'd have to hack up something based on the current source code for how postorius's webui works) 19:18:27 lets coordinate to land some of these change safter the meeting and then figure out docs and an upgrade as followups 19:18:53 sounds great, thanks! that's all i had on this topic 19:19:02 #topic Gerrit Updates 19:19:42 I've kept this agenda item for two reasons. First I'm still hoping for some feedback on dealing with the replication task leaks. Second I'm hoping to start testing the 3.7 -> 3.8 upgrade very soon 19:20:03 For replication task leaks the recent restart where we moved those aside/deleted them showed that is a reasonable thing to do 19:20:19 we can either script that or simply stop bind mounting the dir where they are stored so docker rms them for us 19:21:00 For gerrit upgrades the base upgrade job is working (and has been working) but we need to go through the release notes and test things on a held node like reverts (if possible) and any new feature sor behaviors that concern us 19:21:51 did you ever see my issue with starred changes? 19:22:01 frickler: I don't think Idid 19:22:14 seems I'm unable to view them because I have more than 1024 starred 19:22:32 interesting. This is querying is:starred listing? 19:22:47 yes, getting "Error 400 (Bad Request): too many terms in query: 1193 terms (max = 1024)" 19:22:56 and similar via gerrit cli 19:23:12 ick 19:23:14 have we brought this up on their mailing list or via a bug? 19:23:25 I don't think so 19:23:35 ok I can write an email to repo discuss if you prefer 19:23:40 I also use this very rarely, so it may be an older regression 19:23:45 ack 19:23:51 its also possible that is a configurable limit 19:24:21 that would be great, then I could at least find out which changes are starred and unstar them 19:24:40 a maybe related issue is that stars aren't show in any list view 19:24:48 just in the view of the change itself 19:24:54 good point. I can ask for hints on methods for finding subsets for unstarring 19:25:17 thx 19:26:17 #topic Server Upgrades 19:26:27 No new servers booted recently that I am aware of 19:26:38 index.maxTerms 19:26:43 frickler: ^ 19:26:51 However we had trouble with zuul executors running out of disk today. The underlying issue was that /var/lib/zuul was not a dedicated fs with extra space 19:27:08 so a reminder to all of us replacing servers and reviewing server replacements to check for volumes/filesystem mounts 19:27:16 those got replaced over the first two weeks of july, so it's amazing we didn't start seeing problems before now 19:27:47 https://gerrit-review.googlesource.com/Documentation/config-gerrit.html (i can't seem to deep link it, so search for maxTerms there) 19:28:58 #topic Fedora Cleanup 19:29:02 re mounts -- i guess the question is what do we want to do in the fiture? update launch-node to have an option to switch around /opt? or make it a standard part of server documentation? (but then we have to remember to read the docs which we normally don't have to do for a simple server replacement) 19:29:05 #undo 19:29:05 Removing item from minutes: #topic Fedora Cleanup 19:29:36 corvus: maybe a good place to annotate that info is in our inventory file since I at least tend to look there in order to get the next server in sequence 19:29:54 that's a good idea 19:29:55 because you are right that it will be easy to miss in proper documentation 19:30:02 launch node does also have an option to add volumes, i think, which would be more portable outside rackspace 19:30:18 fungi: yes it does volume management and can do arbitrary mounts for volumes 19:30:44 or... 19:30:46 so if we moved the executors to, say, vexxhost or ovh we'd want to do it that way presumably 19:31:08 we could update zuul executors to bind-mount in /opt as /var/lib/zuul, and/or reconfigure them to use /opt/zuul as the build dir 19:31:29 /opt/zuul is a good idea actually 19:31:30 (one of those violates our "same path in container" rule, but the second doesn't) 19:31:38 since that reduces moving parts and keeps thing ssimple 19:31:53 yeah, /opt/zuul would keep the "same path" rule, so is maybe the best option... 19:32:42 i like that. 19:32:54 it does look like launch/src/opendev_launch/make_swap.sh currently hard-codes /opt as the mountpoint 19:33:07 yup and swap as the other portion 19:33:10 so would need patching if we wanted to make it configurable 19:33:49 I like the simplicity of /opt/zuul 19:34:10 thus patching the compose files seems more reasonable 19:34:30 and if we're deploying outside rax we just need to remember to add a cinder volume for /opt 19:34:30 ack, that's also what my zuul aio uses 19:34:45 or have large enough / 19:34:57 (unless the flavor in that cloud has tons of rootfs and we feel safe using that instead) 19:35:02 yeah, exactly 19:36:29 coming back to index.maxTerms, do we want to try bumping that to 2k? 19:36:33 or 1.5k? 19:37:03 I think it'll likely require a restart, though? 19:37:12 at the very least we can probably bump it temporarily allowing you to adjust your star count 19:37:18 yes it will require a restart 19:37:57 ok, I'll propose a patch and then we can discuss the details 19:38:00 I odn't know what the memory scaling is like for terms but that would be my main concern 19:38:32 #topic Fedora Cleanup 19:38:47 tonyb and I looked at doing this the graceful way and then upstream deleted the packages anyway 19:39:02 I suspect this means we can forge ahead and simply remove the image type since they are largely non functional due to changes upstream of us 19:39:07 then we can clear out the mirror content 19:39:29 any concerns with that? I know nodepool recently updated its jobs to exclude fedora 19:39:34 I think devstack has done similar cleanup 19:39:49 it's possible people have adjusted the urls in jobs to grab packages from the graveyard, but unlikely 19:40:03 zuul-jobs is mostly fedora free now due to the upstream yank 19:40:23 I'm hearing we should just go ahead and remove the images :) 19:40:30 I'll try to push that forward this week too 19:40:37 cc tonyb if still interested 19:41:03 (also it's worth specifically calling out that there is now no fedora testing in zuul jobs, meaning that the base job playbooks, etc, could break for fedora at any time) 19:41:29 even the software factory third party CI which uses fedora is on old nodes and not running jobs properly 19:41:38 so if anyone adds fedora images back to opendev, please make sure to add them to zuul-jobs for testing first before using them in any other projects 19:41:44 ++ 19:41:53 and maybe software factory is interested in updating their third party ci 19:41:58 maybe the fedora community wants to run a third-party ci 19:42:12 since they do use zuul to build fedora packages 19:42:29 (in a separate sf-based zuul deployment from the public sf, as i understand it) 19:42:59 so it's possible they have newer fedora on theirs than the public sf 19:43:14 ya we can bring it up with the sf folks and take it fro mthere 19:43:22 or bookwar maybe 19:43:31 for the base jobs roles, first party ci would be ideal 19:43:52 certainly 19:44:00 and welcome. just to be clear. 19:44:17 just needs someone interested in working on that 19:44:32 base job roles aren't very effectively tested by third party ci 19:44:49 (there is some testing, but not 100% coverage, due to the nature of base job roles) 19:45:20 s/very effectively/completely effectively/ i think that's a little more accurate 19:46:27 good to keep in mind 19:46:30 #topic Gitea 1.20 19:47:06 I sorted out the access log issue. Turns out there were additional undocumented in release notes breaking changes 19:47:21 gotta love those 19:47:28 and you need different configs for access logs now. I still need to cross check the format of them to production since the breaking change they did document is that the format may differ 19:47:45 Then I've got a whole list of TODOs in the commit messag eto work through 19:48:15 in general though I just need a block of focused time to page all this back in and get up to speed on it 19:48:21 but good news some progress here 19:48:31 #topic etherpad 1.9.1 19:48:40 #link https://review.opendev.org/c/opendev/system-config/+/887006 Etherpad 1.9.1 19:48:59 Looks like fungi and ianw have tested the held node 19:49:08 yes, seems good to me 19:49:19 cool so we can probably land this one real soon 19:49:36 as mentioned earlier we should sync up on a rough plan for some of these and start landing them 19:50:02 #topic Python Container Updates 19:50:24 we discovered last week when trying to sort out zookeeper installs on bookworm that the java packaging for bookworm is broken but not in a consistent manner 19:50:40 it seems to run package setup in different orders depending on which packages you have installed and it only breaks in one order 19:51:11 testing has package updates to fix this but they haven't made it back to bookworm yet. For zookeeper installs we are pulling the affected package from testing. 19:51:23 I think the only service this currently affects is gerrit 19:51:35 and we can probably take our time upgrading gerrit waiting for bookworm to be fixed properly 19:51:44 but be aware of that if yo uare doing java things on the new bookworm images 19:51:54 #link https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129 19:53:04 othrewise I think we are as ready as we can be migrating our images to bookworm. Sounds like zuul and nodepool plan to do so after their next release 19:53:33 clarkb: ... but local testing with a debian:bookworm image had the ca certs install working somehow... 19:53:48 so ... actually... we might be able to build bookworm java app containers? 19:53:57 corvus: it appears to be related to the packages already installed and/or being installed affecting the order of installation 19:54:09 corvus: thats true we may be able to build the gerrit containers and sidestep the issue 19:54:14 (but nobody will know why they work :) 19:54:48 specifically if the jre package is setup before the ca-certificates-java package it works. But if we go in the other ordre it break 19:55:19 the jre package depends on the certificates package so yo ucan't do separate install invocations between them 19:55:53 #topic Open Disucssion 19:55:55 Anything else? 19:56:36 forgot to add to the agenda, rackspace issues 19:56:56 around the end of july we started seeing frequent image upload errors to the iad glance 19:57:18 that led to filling up the builders and they ceased to be able to update images anywhere for about 10 days 19:57:35 i cleaned up the builders but the issue with glance in iad persists (we've paused uploads for it) 19:57:47 that still needs more looking into, and probably a ticket opened 19:58:16 ++ I mentioned this last week but I think our best bet is to engage rax and show them how the other clouds differ (if not entirely in behavior at least by degree) 19:58:21 separately, we have a bunch of stuck "deleting" nodes in multiple rackspace regions (including iad i think), taking up the majority of the quotas 19:58:23 *the other rax regions 19:59:16 frickler did some testing with a patched builder and increasing the hardcoded 60-minute timeout for images to become active did work around the issue 19:59:22 for glance uploads i mean 19:59:53 but clearly that's a pathological case and not something we should bother actually implementing 20:00:02 and that's all i had 20:00:08 yes, that worked when uploading not too many images in parallel 20:00:46 but yes, 1h should be enough for any healthy cloud 20:00:50 60m or 6 hour? 20:01:03 1h is the default from sdk 20:01:21 I bumped it to 10h on nb01/02 temporarily 20:01:47 aha, so you patched the builder to specify a timeout when calling the upload method 20:01:52 got it, thx. (so many timeout values) 20:01:58 ack 20:02:27 we are out of time 20:02:28 anyway, we were seeing upwards of a 5.5 hour delay for images to become active there when uploading manually 20:02:34 thanks clarkb! 20:02:36 I'm going to end it here but feel free to continue conversation 20:02:44 I just don't want to keep people from lunch/dinner/breakfast as necessary 20:02:48 thank you all! 20:02:50 #endmeeting