19:01:05 <clarkb> #startmeeting infra
19:01:05 <opendevmeet> Meeting started Tue Aug 15 19:01:05 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:05 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:05 <opendevmeet> The meeting name has been set to 'infra'
19:01:35 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/FRMRI2B7KC2HPOC5VTJYQBKARGCTY5GA/ Our Agenda
19:01:42 <clarkb> #topic Announcements
19:01:58 <clarkb> I'm back in my normal timezone. Other than that I didn't have anything to announce
19:02:37 <clarkb> oh! the openstack feature freeze happens right at the end of this month/ start of septeber
19:02:49 <fungi> yeah, that's notable
19:02:53 <clarkb> something to be aware of as we make changes to avoid major impact to openstack
19:03:15 <fungi> it will be the busiest time for our zuul resources
19:03:56 <clarkb> sounds like openstack continues to struggle with reliability problems in the gate too so we may be asked to help diagnose issues
19:04:08 <clarkb> I expect they will become more urgent in feature freeze crunch time
19:04:48 <clarkb> #topic Topics
19:04:52 <clarkb> #topic Google Account for Infra root
19:05:16 <clarkb> our infra root email address got notified that an associated google account is on the chopping block December 1st if there is no activity bfore then
19:05:39 <clarkb> assuming we decide to preserve the account we'll need to do some sort of activity every two yeras or it will be deleted
19:05:42 <fungi> any idea what it's used for?
19:05:46 <fungi> or i guess not used for
19:05:53 <clarkb> deleted account names cannot be reused so we don't have t oworry about someone taking it over at least
19:05:54 <fungi> formerly used for long ago
19:06:05 <clarkb> I'm beinging it up here in hopes someone else knows what it was used for :)
19:06:13 <corvus> i think they retire the address so it may be worth someone logging in just in case we decide it's important later
19:06:33 <corvus> but i don't recall using it for anything, sorry
19:07:04 <clarkb> ya if we login we'll reset that 2 year counter and the account itself may have clues for what it was used for
19:08:03 <clarkb> I can try to do that befor eour next meeting and hopefully have new info to share then
19:08:11 <clarkb> if anyone else recalls what it was for later please share
19:08:19 <clarkb> but we have time to sort this out for now at least
19:08:56 <clarkb> #topic Bastion Host Updates
19:09:04 <clarkb> #link https://review.opendev.org/q/topic:bridge-backups
19:09:13 <clarkb> this topic could still use root review by others
19:09:16 <fungi> i haven't found time to look yet
19:10:00 <clarkb> I also thought we may need to look at upgrading ansible on the bastion but I think ianw may have already taken care of that
19:10:06 <clarkb> double checking probably a good idea though
19:10:29 <clarkb> #topic Mailman 3
19:10:55 <clarkb> fungi: it looks like the mailman 3 vhosting stuff is working as expected now. I recall reviewing some mm3 change sthough I'm not sure where we've ended up since
19:12:11 <fungi> so the first three changes in topic:mailman3 still need more reviews but should be safe to merge
19:12:47 <fungi> the last change in that series currently is the version upgrade to latest mm3 releases
19:13:04 <clarkb> #link https://review.opendev.org/q/topic:mailman3+status:open
19:13:19 <fungi> i have a held node prepped to do another round of import testing, but got sideswiped by other tasks and haven't had time to run through those yet
19:13:57 <clarkb> ok I've got etherpad and gitea prod updates to land too. After the meeting we should make a rough plan for landing some of these things and pushing forward
19:14:05 <fungi> the upgrade is probably also safe to merge, but has no votes and i can understand if reviewers would rather wait until i've tested importing on the held node
19:14:24 <clarkb> that does seem like a good way to exercise the upgrade
19:15:12 <clarkb> to summarize, no known issues. General update changes needed. Upgrade change queued after general updates. Import testing needed for further migration
19:15:22 <fungi> also the manual steps for adding the dhango domains/postorius mailhost associations are currently recorded in the migration etherpad
19:15:34 <fungi> i'll want to add those to our documentation
19:15:48 <fungi> s/dhango/django/
19:15:48 <clarkb> ++
19:16:11 <fungi> they involve things like ssh port forwarding so you can connect to the django admin url from localhost
19:16:26 <fungi> and need admin the credentials from the ansible inventory
19:16:33 <fungi> s/admin the/the admin/
19:17:07 <fungi> once i've done anotehr round of import tests and we merge the version update, we should be able to start scheduling more domain migrations
19:17:16 <clarkb> once logged in the steps are a few button clicks though so pretty straightforward
19:17:24 <fungi> yup
19:17:35 <fungi> i just wish that part were easier to script
19:17:38 <clarkb> ++
19:18:26 <fungi> from discussion on the mm3 list, it seems there is no api endpoint in postorius for the mailhost assocuation step, which would be the real blocker (we'd have to hack up something based on the current source code for how postorius's webui works)
19:18:27 <clarkb> lets coordinate to land some of these change safter the meeting and then figure out docs and an upgrade as followups
19:18:53 <fungi> sounds great, thanks! that's all i had on this topic
19:19:02 <clarkb> #topic Gerrit Updates
19:19:42 <clarkb> I've kept this agenda item for two reasons. First I'm still hoping for some feedback on dealing with the replication task leaks. Second I'm hoping to start testing the 3.7 -> 3.8 upgrade very soon
19:20:03 <clarkb> For replication task leaks the recent restart where we moved those aside/deleted them showed that is a reasonable thing to do
19:20:19 <clarkb> we can either script that or simply stop bind mounting the dir where they are stored so docker rms them for us
19:21:00 <clarkb> For gerrit upgrades the base upgrade job is working (and has been working) but we need to go through the release notes and test things on a held node like reverts (if possible) and any new feature sor behaviors that concern us
19:21:51 <frickler> did you ever see my issue with starred changes?
19:22:01 <clarkb> frickler: I don't think  Idid
19:22:14 <frickler> seems I'm unable to view them because I have more than 1024 starred
19:22:32 <clarkb> interesting. This is querying is:starred listing?
19:22:47 <frickler> yes, getting "Error 400 (Bad Request): too many terms in query: 1193 terms (max = 1024)"
19:22:56 <frickler> and similar via gerrit cli
19:23:12 <fungi> ick
19:23:14 <clarkb> have we brought this up on their mailing list or via a bug?
19:23:25 <frickler> I don't think so
19:23:35 <clarkb> ok I can write an email to repo discuss if you prefer
19:23:40 <frickler> I also use this very rarely, so it may be an older regression
19:23:45 <clarkb> ack
19:23:51 <clarkb> its also possible that is a configurable limit
19:24:21 <frickler> that would be great, then I could at least find out which changes are starred and unstar them
19:24:40 <frickler> a maybe related issue is that stars aren't show in any list view
19:24:48 <frickler> just in the view of the change itself
19:24:54 <clarkb> good point. I can ask for hints on methods for finding subsets for unstarring
19:25:17 <frickler> thx
19:26:17 <clarkb> #topic Server Upgrades
19:26:27 <clarkb> No new servers booted recently that I am aware of
19:26:38 <corvus> index.maxTerms
19:26:43 <corvus> frickler: ^
19:26:51 <clarkb> However we had trouble with zuul executors running out of disk today. The underlying issue was that /var/lib/zuul was not a dedicated fs with extra space
19:27:08 <clarkb> so a reminder to all of us replacing servers and reviewing server replacements to check for volumes/filesystem mounts
19:27:16 <fungi> those got replaced over the first two weeks of july, so it's amazing we didn't start seeing problems before now
19:27:47 <corvus> https://gerrit-review.googlesource.com/Documentation/config-gerrit.html  (i can't seem to deep link it, so search for maxTerms there)
19:28:58 <clarkb> #topic Fedora Cleanup
19:29:02 <corvus> re mounts -- i guess the question is what do we want to do in the fiture?  update launch-node to have an option to switch around /opt?  or make it a standard part of server documentation?  (but then we have to remember to read the docs which we normally don't have to do for a simple server replacement)
19:29:05 <clarkb> #undo
19:29:05 <opendevmeet> Removing item from minutes: #topic Fedora Cleanup
19:29:36 <clarkb> corvus: maybe a good place to annotate that info is in our inventory file since I at least tend to look there in order to get the next server in sequence
19:29:54 <corvus> that's a good idea
19:29:55 <clarkb> because you are right that it will be easy to miss in proper documentation
19:30:02 <fungi> launch node does also have an option to add volumes, i think, which would be more portable outside rackspace
19:30:18 <clarkb> fungi: yes it does volume management and can do arbitrary mounts for volumes
19:30:44 <corvus> or...
19:30:46 <fungi> so if we moved the executors to, say, vexxhost or ovh we'd want to do it that way presumably
19:31:08 <corvus> we could update zuul executors to bind-mount in /opt as /var/lib/zuul, and/or reconfigure them to use /opt/zuul as the build dir
19:31:29 <clarkb> /opt/zuul is a good idea actually
19:31:30 <corvus> (one of those violates our "same path in container" rule, but the second doesn't)
19:31:38 <clarkb> since that reduces moving parts and keeps thing ssimple
19:31:53 <corvus> yeah, /opt/zuul would keep the "same path" rule, so is maybe the best option...
19:32:42 <corvus> i like that.
19:32:54 <fungi> it does look like launch/src/opendev_launch/make_swap.sh currently hard-codes /opt as the mountpoint
19:33:07 <clarkb> yup and swap as the other portion
19:33:10 <fungi> so would need patching if we wanted to make it configurable
19:33:49 <clarkb> I like the simplicity of /opt/zuul
19:34:10 <fungi> thus patching the compose files seems more reasonable
19:34:30 <fungi> and if we're deploying outside rax we just need to remember to add a cinder volume for /opt
19:34:30 <frickler> ack, that's also what my zuul aio uses
19:34:45 <frickler> or have large enough /
19:34:57 <fungi> (unless the flavor in that cloud has tons of rootfs and we feel safe using that instead)
19:35:02 <fungi> yeah, exactly
19:36:29 <frickler> coming back to index.maxTerms, do we want to try bumping that to 2k?
19:36:33 <frickler> or 1.5k?
19:37:03 <frickler> I think it'll likely require a restart, though?
19:37:12 <clarkb> at the very least we can probably bump it temporarily allowing you to adjust your star count
19:37:18 <clarkb> yes it will require a restart
19:37:57 <frickler> ok, I'll propose a patch and then we can discuss the details
19:38:00 <clarkb> I odn't know what the memory scaling is like for terms but that would be my main concern
19:38:32 <clarkb> #topic Fedora Cleanup
19:38:47 <clarkb> tonyb and I looked at doing this the graceful way and then upstream deleted the packages anyway
19:39:02 <clarkb> I suspect this means we can forge ahead and simply remove the image type since they are largely non functional due to changes upstream of us
19:39:07 <clarkb> then we can clear out the mirror content
19:39:29 <clarkb> any concerns with that? I know nodepool recently updated its jobs to exclude fedora
19:39:34 <clarkb> I think devstack has done similar cleanup
19:39:49 <fungi> it's possible people have adjusted the urls in jobs to grab packages from the graveyard, but unlikely
19:40:03 <corvus> zuul-jobs is mostly fedora free now due to the upstream yank
19:40:23 <clarkb> I'm hearing we should just go ahead and remove the images :)
19:40:30 <clarkb> I'll try to push that forward this week too
19:40:37 <clarkb> cc tonyb if still interested
19:41:03 <corvus> (also it's worth specifically calling out that there is now no fedora testing in zuul jobs, meaning that the base job playbooks, etc, could break for fedora at any time)
19:41:29 <clarkb> even the software factory third party CI which uses fedora is on old nodes and not running jobs properly
19:41:38 <corvus> so if anyone adds fedora images back to opendev, please make sure to add them to zuul-jobs for testing first before using them in any other projects
19:41:44 <clarkb> ++
19:41:53 <clarkb> and maybe software factory is interested in updating their third party ci
19:41:58 <fungi> maybe the fedora community wants to run a third-party ci
19:42:12 <fungi> since they do use zuul to build fedora packages
19:42:29 <fungi> (in a separate sf-based zuul deployment from the public sf, as i understand it)
19:42:59 <fungi> so it's possible they have newer fedora on theirs than the public sf
19:43:14 <clarkb> ya we can bring it up with the sf folks and take it fro mthere
19:43:22 <fungi> or bookwar maybe
19:43:31 <corvus> for the base jobs roles, first party ci would be ideal
19:43:52 <fungi> certainly
19:44:00 <corvus> and welcome.  just to be clear.
19:44:17 <fungi> just needs someone interested in working on that
19:44:32 <corvus> base job roles aren't very effectively tested by third party ci
19:44:49 <corvus> (there is some testing, but not 100% coverage, due to the nature of base job roles)
19:45:20 <corvus> s/very effectively/completely effectively/  i think that's a little more accurate
19:46:27 <clarkb> good to keep in mind
19:46:30 <clarkb> #topic Gitea 1.20
19:47:06 <clarkb> I sorted out the access log issue. Turns out there were additional undocumented in release notes breaking changes
19:47:21 <fungi> gotta love those
19:47:28 <clarkb> and you need different configs for access logs now. I still need to cross check the format of them to production since the breaking change they did document is that the format may differ
19:47:45 <clarkb> Then I've got a whole list of TODOs in the commit messag eto work through
19:48:15 <clarkb> in general though I just need a block of focused time to page all this back in and get up to speed on it
19:48:21 <clarkb> but good news some progress here
19:48:31 <clarkb> #topic etherpad 1.9.1
19:48:40 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/887006 Etherpad 1.9.1
19:48:59 <clarkb> Looks like fungi and ianw have tested the held node
19:49:08 <fungi> yes, seems good to me
19:49:19 <clarkb> cool so we can probably land this one real soon
19:49:36 <clarkb> as mentioned earlier we should sync up on a rough plan for some of these and start landing them
19:50:02 <clarkb> #topic Python Container Updates
19:50:24 <clarkb> we discovered last week when trying to sort out zookeeper installs on bookworm that the java packaging for bookworm is broken but not in a consistent manner
19:50:40 <clarkb> it seems to run package setup in different orders depending on which packages you have installed and  it only breaks in one order
19:51:11 <clarkb> testing has package updates to fix this but they haven't made it back to bookworm yet. For zookeeper installs we are pulling the affected package from testing.
19:51:23 <clarkb> I think the only service this currently affects is gerrit
19:51:35 <clarkb> and we can probably take our time upgrading gerrit waiting for bookworm to be fixed properly
19:51:44 <clarkb> but be aware of that if yo uare doing java things on the new bookworm images
19:51:54 <corvus> #link https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129
19:53:04 <clarkb> othrewise I think we are as ready as we can be migrating our images to bookworm. Sounds like zuul and nodepool plan to do so after their next release
19:53:33 <corvus> clarkb: ... but local testing with a debian:bookworm image had the ca certs install working somehow...
19:53:48 <corvus> so ... actually... we might be able to build bookworm java app containers?
19:53:57 <clarkb> corvus: it appears to be related to the packages already installed and/or being installed affecting the order of installation
19:54:09 <clarkb> corvus: thats true we may be able to build the gerrit containers and sidestep the issue
19:54:14 <corvus> (but nobody will know why they work :)
19:54:48 <clarkb> specifically if the jre package is setup before the ca-certificates-java package it works. But if we go in the other ordre it break
19:55:19 <clarkb> the jre package depends on the certificates package so yo ucan't do separate install invocations between them
19:55:53 <clarkb> #topic Open Disucssion
19:55:55 <clarkb> Anything else?
19:56:36 <fungi> forgot to add to the agenda, rackspace issues
19:56:56 <fungi> around the end of july we started seeing frequent image upload errors to the iad glance
19:57:18 <fungi> that led to filling up the builders and they ceased to be able to update images anywhere for about 10 days
19:57:35 <fungi> i cleaned up the builders but the issue with glance in iad persists (we've paused uploads for it)
19:57:47 <fungi> that still needs more looking into, and probably a ticket opened
19:58:16 <clarkb> ++ I mentioned this last week but I think our best bet is to engage rax and show them how the other clouds differ (if not entirely in behavior at least by degree)
19:58:21 <fungi> separately, we have a bunch of stuck "deleting" nodes in multiple rackspace regions (including iad i think), taking up the majority of the quotas
19:58:23 <clarkb> *the other rax regions
19:59:16 <fungi> frickler did some testing with a patched builder and increasing the hardcoded 60-minute timeout for images to become active did work around the issue
19:59:22 <fungi> for glance uploads i mean
19:59:53 <fungi> but clearly that's a pathological case and not something we should bother actually implementing
20:00:02 <fungi> and that's all i had
20:00:08 <frickler> yes, that worked when uploading not too many images in parallel
20:00:46 <frickler> but yes, 1h should be enough for any healthy cloud
20:00:50 <corvus> 60m or 6 hour?
20:01:03 <frickler> 1h is the default from sdk
20:01:21 <frickler> I bumped it to 10h on nb01/02 temporarily
20:01:47 <fungi> aha, so you patched the builder to specify a timeout when calling the upload method
20:01:52 <corvus> got it, thx.  (so many timeout values)
20:01:58 <frickler> ack
20:02:27 <clarkb> we are out of time
20:02:28 <fungi> anyway, we were seeing upwards of a 5.5 hour delay for images to become active there when uploading manually
20:02:34 <fungi> thanks clarkb!
20:02:36 <clarkb> I'm going to end it here but feel free to continue conversation
20:02:44 <clarkb> I just don't want to keep people from lunch/dinner/breakfast as necessary
20:02:48 <clarkb> thank you all!
20:02:50 <clarkb> #endmeeting