Tuesday, 2023-08-15

-opendevstatus- NOTICE: Zuul job execution is temporarily paused while we rearrange local storage on the servers16:53
-opendevstatus- NOTICE: Zuul job execution has resumed with additional disk space on the servers17:42
clarkbalmost meeting time18:59
fungiahoy!19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Aug 15 19:01:05 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/FRMRI2B7KC2HPOC5VTJYQBKARGCTY5GA/ Our Agenda19:01
clarkb#topic Announcements19:01
clarkbI'm back in my normal timezone. Other than that I didn't have anything to announce19:01
clarkboh! the openstack feature freeze happens right at the end of this month/ start of septeber19:02
fungiyeah, that's notable19:02
clarkbsomething to be aware of as we make changes to avoid major impact to openstack19:02
fungiit will be the busiest time for our zuul resources19:03
clarkbsounds like openstack continues to struggle with reliability problems in the gate too so we may be asked to help diagnose issues19:03
clarkbI expect they will become more urgent in feature freeze crunch time19:04
clarkb#topic Topics19:04
clarkb#topic Google Account for Infra root19:04
clarkbour infra root email address got notified that an associated google account is on the chopping block December 1st if there is no activity bfore then19:05
clarkbassuming we decide to preserve the account we'll need to do some sort of activity every two yeras or it will be deleted19:05
fungiany idea what it's used for?19:05
fungior i guess not used for19:05
clarkbdeleted account names cannot be reused so we don't have t oworry about someone taking it over at least19:05
fungiformerly used for long ago19:05
clarkbI'm beinging it up here in hopes someone else knows what it was used for :)19:06
corvusi think they retire the address so it may be worth someone logging in just in case we decide it's important later19:06
corvusbut i don't recall using it for anything, sorry19:06
clarkbya if we login we'll reset that 2 year counter and the account itself may have clues for what it was used for19:07
clarkbI can try to do that befor eour next meeting and hopefully have new info to share then19:08
clarkbif anyone else recalls what it was for later please share19:08
clarkbbut we have time to sort this out for now at least19:08
clarkb#topic Bastion Host Updates19:08
clarkb#link https://review.opendev.org/q/topic:bridge-backups19:09
clarkbthis topic could still use root review by others19:09
fungii haven't found time to look yet19:09
clarkbI also thought we may need to look at upgrading ansible on the bastion but I think ianw may have already taken care of that19:10
clarkbdouble checking probably a good idea though19:10
clarkb#topic Mailman 319:10
clarkbfungi: it looks like the mailman 3 vhosting stuff is working as expected now. I recall reviewing some mm3 change sthough I'm not sure where we've ended up since19:10
fungiso the first three changes in topic:mailman3 still need more reviews but should be safe to merge19:12
fungithe last change in that series currently is the version upgrade to latest mm3 releases19:12
clarkb#link https://review.opendev.org/q/topic:mailman3+status:open19:13
fungii have a held node prepped to do another round of import testing, but got sideswiped by other tasks and haven't had time to run through those yet19:13
clarkbok I've got etherpad and gitea prod updates to land too. After the meeting we should make a rough plan for landing some of these things and pushing forward19:13
fungithe upgrade is probably also safe to merge, but has no votes and i can understand if reviewers would rather wait until i've tested importing on the held node19:14
clarkbthat does seem like a good way to exercise the upgrade19:14
clarkbto summarize, no known issues. General update changes needed. Upgrade change queued after general updates. Import testing needed for further migration19:15
fungialso the manual steps for adding the dhango domains/postorius mailhost associations are currently recorded in the migration etherpad19:15
fungii'll want to add those to our documentation19:15
fungis/dhango/django/19:15
clarkb++19:15
fungithey involve things like ssh port forwarding so you can connect to the django admin url from localhost19:16
fungiand need admin the credentials from the ansible inventory19:16
fungis/admin the/the admin/19:16
fungionce i've done anotehr round of import tests and we merge the version update, we should be able to start scheduling more domain migrations19:17
clarkbonce logged in the steps are a few button clicks though so pretty straightforward19:17
fungiyup19:17
fungii just wish that part were easier to script19:17
clarkb++19:17
fungifrom discussion on the mm3 list, it seems there is no api endpoint in postorius for the mailhost assocuation step, which would be the real blocker (we'd have to hack up something based on the current source code for how postorius's webui works)19:18
clarkblets coordinate to land some of these change safter the meeting and then figure out docs and an upgrade as followups19:18
fungisounds great, thanks! that's all i had on this topic19:18
clarkb#topic Gerrit Updates19:19
clarkbI've kept this agenda item for two reasons. First I'm still hoping for some feedback on dealing with the replication task leaks. Second I'm hoping to start testing the 3.7 -> 3.8 upgrade very soon19:19
clarkbFor replication task leaks the recent restart where we moved those aside/deleted them showed that is a reasonable thing to do19:20
clarkbwe can either script that or simply stop bind mounting the dir where they are stored so docker rms them for us19:20
clarkbFor gerrit upgrades the base upgrade job is working (and has been working) but we need to go through the release notes and test things on a held node like reverts (if possible) and any new feature sor behaviors that concern us19:21
fricklerdid you ever see my issue with starred changes?19:21
clarkbfrickler: I don't think  Idid19:22
fricklerseems I'm unable to view them because I have more than 1024 starred19:22
clarkbinteresting. This is querying is:starred listing?19:22
frickleryes, getting "Error 400 (Bad Request): too many terms in query: 1193 terms (max = 1024)"19:22
fricklerand similar via gerrit cli19:22
fungiick19:23
clarkbhave we brought this up on their mailing list or via a bug?19:23
fricklerI don't think so19:23
clarkbok I can write an email to repo discuss if you prefer19:23
fricklerI also use this very rarely, so it may be an older regression19:23
clarkback19:23
clarkbits also possible that is a configurable limit19:23
fricklerthat would be great, then I could at least find out which changes are starred and unstar them19:24
fricklera maybe related issue is that stars aren't show in any list view19:24
fricklerjust in the view of the change itself19:24
clarkbgood point. I can ask for hints on methods for finding subsets for unstarring19:24
fricklerthx19:25
clarkb#topic Server Upgrades19:26
clarkbNo new servers booted recently that I am aware of19:26
corvusindex.maxTerms19:26
corvusfrickler: ^19:26
clarkbHowever we had trouble with zuul executors running out of disk today. The underlying issue was that /var/lib/zuul was not a dedicated fs with extra space19:26
clarkbso a reminder to all of us replacing servers and reviewing server replacements to check for volumes/filesystem mounts19:27
fungithose got replaced over the first two weeks of july, so it's amazing we didn't start seeing problems before now19:27
corvushttps://gerrit-review.googlesource.com/Documentation/config-gerrit.html  (i can't seem to deep link it, so search for maxTerms there)19:27
clarkb#topic Fedora Cleanup19:28
corvusre mounts -- i guess the question is what do we want to do in the fiture?  update launch-node to have an option to switch around /opt?  or make it a standard part of server documentation?  (but then we have to remember to read the docs which we normally don't have to do for a simple server replacement)19:29
clarkb#undo19:29
opendevmeetRemoving item from minutes: #topic Fedora Cleanup19:29
clarkbcorvus: maybe a good place to annotate that info is in our inventory file since I at least tend to look there in order to get the next server in sequence19:29
corvusthat's a good idea19:29
clarkbbecause you are right that it will be easy to miss in proper documentation19:29
fungilaunch node does also have an option to add volumes, i think, which would be more portable outside rackspace19:30
clarkbfungi: yes it does volume management and can do arbitrary mounts for volumes19:30
corvusor...19:30
fungiso if we moved the executors to, say, vexxhost or ovh we'd want to do it that way presumably19:30
corvuswe could update zuul executors to bind-mount in /opt as /var/lib/zuul, and/or reconfigure them to use /opt/zuul as the build dir19:31
clarkb/opt/zuul is a good idea actually19:31
corvus(one of those violates our "same path in container" rule, but the second doesn't)19:31
clarkbsince that reduces moving parts and keeps thing ssimple19:31
corvusyeah, /opt/zuul would keep the "same path" rule, so is maybe the best option...19:31
corvusi like that.19:32
fungiit does look like launch/src/opendev_launch/make_swap.sh currently hard-codes /opt as the mountpoint19:32
clarkbyup and swap as the other portion19:33
fungiso would need patching if we wanted to make it configurable19:33
clarkbI like the simplicity of /opt/zuul19:33
fungithus patching the compose files seems more reasonable19:34
fungiand if we're deploying outside rax we just need to remember to add a cinder volume for /opt19:34
fricklerack, that's also what my zuul aio uses19:34
frickleror have large enough /19:34
fungi(unless the flavor in that cloud has tons of rootfs and we feel safe using that instead)19:34
fungiyeah, exactly19:35
fricklercoming back to index.maxTerms, do we want to try bumping that to 2k?19:36
frickleror 1.5k?19:36
fricklerI think it'll likely require a restart, though?19:37
clarkbat the very least we can probably bump it temporarily allowing you to adjust your star count19:37
clarkbyes it will require a restart19:37
fricklerok, I'll propose a patch and then we can discuss the details19:37
clarkbI odn't know what the memory scaling is like for terms but that would be my main concern19:38
clarkb#topic Fedora Cleanup19:38
clarkbtonyb and I looked at doing this the graceful way and then upstream deleted the packages anyway19:38
clarkbI suspect this means we can forge ahead and simply remove the image type since they are largely non functional due to changes upstream of us19:39
clarkbthen we can clear out the mirror content19:39
clarkbany concerns with that? I know nodepool recently updated its jobs to exclude fedora19:39
clarkbI think devstack has done similar cleanup19:39
fungiit's possible people have adjusted the urls in jobs to grab packages from the graveyard, but unlikely19:39
corvuszuul-jobs is mostly fedora free now due to the upstream yank19:40
clarkbI'm hearing we should just go ahead and remove the images :)19:40
clarkbI'll try to push that forward this week too19:40
clarkbcc tonyb if still interested19:40
corvus(also it's worth specifically calling out that there is now no fedora testing in zuul jobs, meaning that the base job playbooks, etc, could break for fedora at any time)19:41
clarkbeven the software factory third party CI which uses fedora is on old nodes and not running jobs properly19:41
corvusso if anyone adds fedora images back to opendev, please make sure to add them to zuul-jobs for testing first before using them in any other projects19:41
clarkb++19:41
clarkband maybe software factory is interested in updating their third party ci19:41
fungimaybe the fedora community wants to run a third-party ci19:41
fungisince they do use zuul to build fedora packages19:42
fungi(in a separate sf-based zuul deployment from the public sf, as i understand it)19:42
fungiso it's possible they have newer fedora on theirs than the public sf19:42
clarkbya we can bring it up with the sf folks and take it fro mthere19:43
fungior bookwar maybe19:43
corvusfor the base jobs roles, first party ci would be ideal19:43
fungicertainly19:43
corvusand welcome.  just to be clear.19:44
fungijust needs someone interested in working on that19:44
corvusbase job roles aren't very effectively tested by third party ci19:44
corvus(there is some testing, but not 100% coverage, due to the nature of base job roles)19:44
corvuss/very effectively/completely effectively/  i think that's a little more accurate19:45
clarkbgood to keep in mind19:46
clarkb#topic Gitea 1.2019:46
clarkbI sorted out the access log issue. Turns out there were additional undocumented in release notes breaking changes19:47
fungigotta love those19:47
clarkband you need different configs for access logs now. I still need to cross check the format of them to production since the breaking change they did document is that the format may differ19:47
clarkbThen I've got a whole list of TODOs in the commit messag eto work through19:47
clarkbin general though I just need a block of focused time to page all this back in and get up to speed on it19:48
clarkbbut good news some progress here19:48
clarkb#topic etherpad 1.9.119:48
clarkb#link https://review.opendev.org/c/opendev/system-config/+/887006 Etherpad 1.9.119:48
clarkbLooks like fungi and ianw have tested the held node19:48
fungiyes, seems good to me19:49
clarkbcool so we can probably land this one real soon19:49
clarkbas mentioned earlier we should sync up on a rough plan for some of these and start landing them19:49
clarkb#topic Python Container Updates19:50
clarkbwe discovered last week when trying to sort out zookeeper installs on bookworm that the java packaging for bookworm is broken but not in a consistent manner19:50
clarkbit seems to run package setup in different orders depending on which packages you have installed and  it only breaks in one order19:50
clarkbtesting has package updates to fix this but they haven't made it back to bookworm yet. For zookeeper installs we are pulling the affected package from testing.19:51
clarkbI think the only service this currently affects is gerrit19:51
clarkband we can probably take our time upgrading gerrit waiting for bookworm to be fixed properly19:51
clarkbbut be aware of that if yo uare doing java things on the new bookworm images19:51
corvus#link https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=103012919:51
clarkbothrewise I think we are as ready as we can be migrating our images to bookworm. Sounds like zuul and nodepool plan to do so after their next release19:53
corvusclarkb: ... but local testing with a debian:bookworm image had the ca certs install working somehow...19:53
corvusso ... actually... we might be able to build bookworm java app containers?19:53
clarkbcorvus: it appears to be related to the packages already installed and/or being installed affecting the order of installation19:53
clarkbcorvus: thats true we may be able to build the gerrit containers and sidestep the issue19:54
corvus(but nobody will know why they work :)19:54
clarkbspecifically if the jre package is setup before the ca-certificates-java package it works. But if we go in the other ordre it break19:54
clarkbthe jre package depends on the certificates package so yo ucan't do separate install invocations between them19:55
clarkb#topic Open Disucssion19:55
clarkbAnything else?19:55
fungiforgot to add to the agenda, rackspace issues19:56
fungiaround the end of july we started seeing frequent image upload errors to the iad glance19:56
fungithat led to filling up the builders and they ceased to be able to update images anywhere for about 10 days19:57
fungii cleaned up the builders but the issue with glance in iad persists (we've paused uploads for it)19:57
fungithat still needs more looking into, and probably a ticket opened19:57
clarkb++ I mentioned this last week but I think our best bet is to engage rax and show them how the other clouds differ (if not entirely in behavior at least by degree)19:58
fungiseparately, we have a bunch of stuck "deleting" nodes in multiple rackspace regions (including iad i think), taking up the majority of the quotas19:58
clarkb*the other rax regions19:58
fungifrickler did some testing with a patched builder and increasing the hardcoded 60-minute timeout for images to become active did work around the issue19:59
fungifor glance uploads i mean19:59
fungibut clearly that's a pathological case and not something we should bother actually implementing19:59
fungiand that's all i had20:00
frickleryes, that worked when uploading not too many images in parallel20:00
fricklerbut yes, 1h should be enough for any healthy cloud20:00
corvus60m or 6 hour?20:00
frickler1h is the default from sdk20:01
fricklerI bumped it to 10h on nb01/02 temporarily20:01
fungiaha, so you patched the builder to specify a timeout when calling the upload method20:01
corvusgot it, thx.  (so many timeout values)20:01
fricklerack20:01
clarkbwe are out of time20:02
fungianyway, we were seeing upwards of a 5.5 hour delay for images to become active there when uploading manually20:02
fungithanks clarkb!20:02
clarkbI'm going to end it here but feel free to continue conversation20:02
clarkbI just don't want to keep people from lunch/dinner/breakfast as necessary20:02
clarkbthank you all!20:02
clarkb#endmeeting20:02
opendevmeetMeeting ended Tue Aug 15 20:02:50 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:02
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-15-19.01.html20:02
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-15-19.01.txt20:02
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-15-19.01.log.html20:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!