19:01:16 <clarkb> #startmeeting infra
19:01:17 <openstack> Meeting started Tue Feb 23 19:01:16 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:20 <openstack> The meeting name has been set to 'infra'
19:01:23 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-February/000185.html Our Agenda
19:02:09 <clarkb> #topic Announcements
19:02:14 <clarkb> I did not have any announcements
19:02:24 <clarkb> #topic Actions from last meeting
19:02:30 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-16-19.01.txt minutes from last meeting
19:03:02 <clarkb> corvus has an action to unfork our jitsi meet installation
19:03:17 <clarkb> I haven't seen any changes for that and corvus was busy with the zuul v4 release so I expect that has not happened yet
19:04:25 <clarkb> corvus: ^ should I readd the action?
19:04:31 <corvus> yep so sorry
19:05:11 <clarkb> #action corvus unfork our jitsi meet installation
19:05:20 <clarkb> #topic Priority Efforts
19:05:27 <clarkb> #topic OpenDev
19:05:54 <clarkb> fungi: ianw: I guess we hit the missing accounts index lock problem again over the weekend
19:06:20 <clarkb> lslocks showed it was gone and fungi responded to the upstream gerrit bug pointing out that we saw it again
19:06:29 <ianw> yes, fungi did all the investigation, but we did restart it sunday/monday during quiet time
19:07:08 <clarkb> There hasn't been much movement on the bug since I originally filed it. I wonder if we should bring it up on the mailing list to see if anyone else has seen this behavior
19:08:17 <clarkb> I wonder if we can suggest it try to relock it
19:08:25 <fungi> and yes, the restart once again mitigated the error
19:08:30 <clarkb> according to lslocks there is no lock for the path so it isn't like something else has taken the lock away
19:08:48 <fungi> all i could figure is something is happening on the fs
19:09:19 <fungi> but i couldn't find any logs to indicate what that might have been
19:10:11 <clarkb> ok, I guess we continue to monitor it and if we find time bring it up with upstream on the mailing list to see if anyone else suffers this as well
19:10:51 <clarkb> Next up is the account inconsistencies. I have not yet found time to check which of the unhappy accounts are active. But I do still like fungi's idea of generating that list, retiring the others and sorting out only the subset of active accounts
19:10:54 <fungi> it's the first time we've seen it in ~4 months
19:10:57 <clarkb> that should greatly simplify the todo list there
19:11:07 <fungi> or was it 3? anyway, it's been some time
19:11:48 <fungi> yeah, it's the "if a tree fals in the forest and nobody's ever going to log into it again anyway" approach ;)
19:13:17 <clarkb> With gitea OOMs I tried to watch manage-projects as it ran yseterday as part of the "all the jobs" run for zm01.opendev.org deployment. And there was a slight jump in resource utilization but things looked happy
19:13:32 <clarkb> that makes me suspect the we are our own dos theory less
19:13:57 <clarkb> However, the dstat recording change for system-config-run jobs did eventually land yesterday so we can start to try and look at that sort of info for future updates
19:14:09 <clarkb> and that applies to all the system-config-run jobs, not just gitea.
19:14:43 <clarkb> Any other opendev related items or should we move on?
19:16:11 <clarkb> #topic Update Config Management
19:16:34 <clarkb> Are there any configuration management updates to call out? I haven't seen any, but have had plenty of distractions so may have missed something
19:18:19 <clarkb> #topic General Topics
19:18:27 <clarkb> #topic OpenAFS Cluster Status
19:18:55 <clarkb> Last we checked in on this subject all the servers had their openafs packages upgraded but we were still waiting on operating system upgrades. Anything new on this?
19:18:59 <ianw> i haven't got to upgrades for this yet
19:19:15 <clarkb> ok we can move on then
19:19:20 <clarkb> #topic Bup and Borg
19:19:39 <clarkb> At this point I think this topic might be more of just "Borg" but we're continuing to refine and improve the borg backups
19:20:02 <ianw> yep i need to do a final sweep through the hosts and make sure the bup jobs have stopped
19:20:25 <ianw> and then we can shutdown the bup servers and decide what to do with the storage
19:21:08 <clarkb> in the past we've held on to old bup backup volumes when rotating in new ones. Probably want to keep them around for a bit to ensure we've got that overlap here?
19:22:18 <ianw> yep, we can keep for a bit.  practically, last time we tried to retrieve anything we'd sorted everything out before the bup processes had even completed extracting a tar :)
19:22:37 <clarkb> ya, that is a good point
19:23:03 <clarkb> anything else to add on this item?
19:23:19 <ianw> nope, next week might be the last time it's a thing of interest :)
19:24:05 <clarkb> excellent
19:24:15 <clarkb> #topic Picking up steam on server upgrades
19:24:53 <clarkb> I've jumped into trying to upgrade the operating systems under zuul, nodepool, and zookeeper
19:25:15 <fungi> thanks!
19:25:22 <clarkb> so far zm01.opendev.org has been replaced and seems happy. I've been working on replacing 02-08 this morning so expect changes for that after the meeting
19:25:34 <ianw> ++
19:25:43 <clarkb> Then my plan is to look at executors, launchers, zookeeper, and zuul scheduler (likely in that order)
19:26:09 <clarkb> I think that order is roughly from easiest to most difficult and working through the steps will hopefully make the more difficult steps easier :)
19:26:35 <clarkb> There are other services that need this treatment too. If you've got time and or interest please jump in too :)
19:26:52 <clarkb> some of them will require puppet be rewritten to ansible as well. These are likely to be the most painful ones
19:27:01 <clarkb> but maybe doing that sort of rewrite is more interesting to some
19:28:01 <clarkb> Anything else to add to this item?
19:28:37 <fungi> not from me
19:28:54 <clarkb> #topic Upgrading refstack.o.o
19:29:07 <clarkb> ianw: kopecmartin: are there any changes we can help review or updates to the testing here?
19:29:53 <ianw> last update for me was we put some nodes on hold after finding the testinfra wasn't actually working as well as we'd hoped
19:30:42 <ianw> there was some unicode errors which i *think* got fixed too
19:31:26 <clarkb> ya I think some problems were identified in the service itself too (a bonus for ebtter testing)
19:32:18 <clarkb> Sounds like we're still largely waiting for kopecmartin to figure out what is going on though?
19:33:04 <ianw> i think so yes; kopecmartin -- lmn if anything needs actioning
19:33:41 <clarkb> thanks for the update
19:33:55 <clarkb> #topic Bridge disk space
19:34:24 <clarkb> We're running low on disk space on bridge. I did some quick investigating yesterday and the three locations we seem to consume the most space is /var/log /home and /opt
19:35:24 <clarkb> I think there may be some cleanup we can do in /var/log/ansible where we've leaked some older log files. /home has miscellaneous content in our various homedirs, maybe we can each take a look and clean up unneeded files? and /opt seems to have a number of disk images on it as well as some stuff for ianw
19:35:50 <clarkb> mordred: I think the images in /opt were when you were trying to do builds for focal?
19:36:08 <clarkb> we ended up not using those iirc because we couldn't consistently build them with nodepool due to the way boot from volume treats images
19:36:20 <clarkb> should we just clean those up? or maybe remove the raw and vhd versions and keep qcow2?
19:36:51 <clarkb> (as a side note I used the cloud provided focal images for zuul mergers since we seemed to abandon the build our own idea for the time being)
19:37:58 <ianw> yeah i think they can go
19:38:12 <clarkb> in any case I suspect we'll run out of disk there in the near future so cleanup that can be made would be great.
19:38:38 <clarkb> if infra-root can check their homedirs and ianw can look at /opt/ianw I can take a look at the images and maybe start by removing the raw/vhd copies first
19:38:54 <fungi> apparently the launch-env in my homedir accounts for 174M
19:39:20 <fungi> but otherwise all cleaned up now
19:39:45 <clarkb> thanks!
19:39:51 <ianw> i can't remember what /opt/ianw was about, i'll clear it out
19:40:12 <clarkb> And I think that was all I had on the agenda
19:40:16 <clarkb> #topic Open Discussion
19:40:18 <clarkb> Anything else?
19:40:58 <fungi> i'm still struggling to get git-review testing working for python 3.9 so i can tag 2.0.0
19:41:28 <fungi> after discussion yesterday, i may have to rework more of how gerrit is being invoked in the test setup
19:41:51 <fungi> something is mysteriously causing the default development credentials to not work
19:42:15 <ianw> you are using the upstream image right?
19:42:27 <fungi> official warfile, yes
19:42:28 <ianw> oh, no, the upstream .jar ... not their container
19:43:03 <fungi> we could redo git-review's functional tests to use containerized gerrit, but that seemed like a much larger overhaul
19:44:06 <fungi> right now it's designed to set up a template gerrit site and then start a bunch of parallel per-test gerrits from that in different temporary directories
19:44:09 <clarkb> there are examples of similar in the gerritlib tests
19:44:22 <clarkb> and ya you'd probably want to switch it to using a project per test rather than gerrit per test
19:45:07 <fungi> right, and that gets into a deep overhaul of git-review's testing, which i was trying to avoid right now (i don't really have time for that, but maybe i don't have time for this either)
19:45:46 <fungi> alternatives are to say we support python 3.9 but not test with it, or say we don't support python 3.9 because we're unable to test with it
19:46:44 <ianw> it's sort of "we're unable to test git-review" in general ATM right?
19:46:47 <fungi> or maybe try to get 3.9 tests going on bionic instead of focal
19:47:13 <fungi> ianw: it's that our gerrit tests rely on gerrit 2.11 which focal's openssh can't connect to
19:47:32 <fungi> so we test up through python 3.8 just fine
19:47:38 <ianw> we could connect it to the system-config job; that has figured out the "get a gerrit running" bit ... but doesn't help local testing
19:48:22 <fungi> also doesn't help the "would have to substantially redo git-review's current test framework design" part
19:48:24 <mordred> clarkb: oh sorry - re: images - I think the focal images in /opt on bridge are the ones I manually built and uploaded for control plane things?
19:48:38 <clarkb> mordred: yes, but then we didn't really use them because boot from volume is weird iirc
19:48:40 <mordred> but - honestly - I don't see any reason to keep them around
19:48:59 <clarkb> that was the precursor to having nodepool do it, then nodepool did it, then we undid the nodepool
19:49:49 <fungi> i assumed the path of least resistance was to update the gerrit version we're testing against to one focal can ssh into, but 2.11 was the last version to keep ssh keys in the rdbms, which was how the test account was getting bootstrapped
19:50:15 <fungi> so gerrit>2.11 means changing how we bootstrap our test user
19:50:33 <fungi> but as usually happens, that's a rabbit hole to which i have yet to find the bottom
19:51:03 <clarkb> maybe bad idea: you could vendor an all-users repo state
19:51:08 <clarkb> and start gerrit with that
19:51:18 <fungi> that's something i considered, yeah
19:51:35 <fungi> though we'll need to vendor a corresponding ssh public key as well i suppose
19:51:42 <fungi> er, public/private keypair
19:51:53 <clarkb> or edit the repo directly before starting it with a generated value
19:52:06 <clarkb> but gerritlib and others bootstrap using a dev mode that should work
19:52:13 <clarkb> just need to sort out why it doesn't
19:52:53 <fungi> right, i thought working out how to interact with the rest api would be 1. easier than reverse-engineering undocumented notedb structure, and 2. an actual supported stable interface so we don't find ourselves right back here the next time they decide to tweak some implementation detail of the db
19:53:30 <fungi> anyway, it seems like something about how the test framework is initializing and then later starting gerrit migt be breaking dev mode
19:53:46 <fungi> so that's the next string to tug on
19:54:14 <fungi> i'll see if it could be as simple as calling java directly instead of trying to use the provided gerrit.sh initscript
19:54:36 <ianw> i can try like actually running it today instead of just making red-herring comments on diffs and see if i can see anything
19:55:05 <fungi> ianw: a tip is to comment out all the addcleanup() calls and then ask tox to run a single test
19:55:28 <fungi> i've abused that to get a running gerrit exactly how the tests try to run it
19:56:38 <fungi> right now the setup calls init with --no-auto-start, and then reindex, and then copies the resulting site and runs gerrit from teh copies in daemon mode via gerrit.sh, which is rather convoluted
19:57:07 <fungi> and supplies custom configs to each site copy with distinct tcp ports
19:58:19 <clarkb> ok, let me know if I can help.
19:58:28 <clarkb> I may be partially responsible for the old setup :)
19:58:35 <clarkb> and now we're just about at time
19:58:48 <clarkb> feel free to continue discussion on the mailing list or in #opendev
19:58:56 <clarkb> and thank you everyone for your time
19:58:58 <clarkb> #endmeeting