19:01:16 #startmeeting infra 19:01:17 Meeting started Tue Feb 23 19:01:16 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:18 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:20 The meeting name has been set to 'infra' 19:01:23 #link http://lists.opendev.org/pipermail/service-discuss/2021-February/000185.html Our Agenda 19:02:09 #topic Announcements 19:02:14 I did not have any announcements 19:02:24 #topic Actions from last meeting 19:02:30 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-16-19.01.txt minutes from last meeting 19:03:02 corvus has an action to unfork our jitsi meet installation 19:03:17 I haven't seen any changes for that and corvus was busy with the zuul v4 release so I expect that has not happened yet 19:04:25 corvus: ^ should I readd the action? 19:04:31 yep so sorry 19:05:11 #action corvus unfork our jitsi meet installation 19:05:20 #topic Priority Efforts 19:05:27 #topic OpenDev 19:05:54 fungi: ianw: I guess we hit the missing accounts index lock problem again over the weekend 19:06:20 lslocks showed it was gone and fungi responded to the upstream gerrit bug pointing out that we saw it again 19:06:29 yes, fungi did all the investigation, but we did restart it sunday/monday during quiet time 19:07:08 There hasn't been much movement on the bug since I originally filed it. I wonder if we should bring it up on the mailing list to see if anyone else has seen this behavior 19:08:17 I wonder if we can suggest it try to relock it 19:08:25 and yes, the restart once again mitigated the error 19:08:30 according to lslocks there is no lock for the path so it isn't like something else has taken the lock away 19:08:48 all i could figure is something is happening on the fs 19:09:19 but i couldn't find any logs to indicate what that might have been 19:10:11 ok, I guess we continue to monitor it and if we find time bring it up with upstream on the mailing list to see if anyone else suffers this as well 19:10:51 Next up is the account inconsistencies. I have not yet found time to check which of the unhappy accounts are active. But I do still like fungi's idea of generating that list, retiring the others and sorting out only the subset of active accounts 19:10:54 it's the first time we've seen it in ~4 months 19:10:57 that should greatly simplify the todo list there 19:11:07 or was it 3? anyway, it's been some time 19:11:48 yeah, it's the "if a tree fals in the forest and nobody's ever going to log into it again anyway" approach ;) 19:13:17 With gitea OOMs I tried to watch manage-projects as it ran yseterday as part of the "all the jobs" run for zm01.opendev.org deployment. And there was a slight jump in resource utilization but things looked happy 19:13:32 that makes me suspect the we are our own dos theory less 19:13:57 However, the dstat recording change for system-config-run jobs did eventually land yesterday so we can start to try and look at that sort of info for future updates 19:14:09 and that applies to all the system-config-run jobs, not just gitea. 19:14:43 Any other opendev related items or should we move on? 19:16:11 #topic Update Config Management 19:16:34 Are there any configuration management updates to call out? I haven't seen any, but have had plenty of distractions so may have missed something 19:18:19 #topic General Topics 19:18:27 #topic OpenAFS Cluster Status 19:18:55 Last we checked in on this subject all the servers had their openafs packages upgraded but we were still waiting on operating system upgrades. Anything new on this? 19:18:59 i haven't got to upgrades for this yet 19:19:15 ok we can move on then 19:19:20 #topic Bup and Borg 19:19:39 At this point I think this topic might be more of just "Borg" but we're continuing to refine and improve the borg backups 19:20:02 yep i need to do a final sweep through the hosts and make sure the bup jobs have stopped 19:20:25 and then we can shutdown the bup servers and decide what to do with the storage 19:21:08 in the past we've held on to old bup backup volumes when rotating in new ones. Probably want to keep them around for a bit to ensure we've got that overlap here? 19:22:18 yep, we can keep for a bit. practically, last time we tried to retrieve anything we'd sorted everything out before the bup processes had even completed extracting a tar :) 19:22:37 ya, that is a good point 19:23:03 anything else to add on this item? 19:23:19 nope, next week might be the last time it's a thing of interest :) 19:24:05 excellent 19:24:15 #topic Picking up steam on server upgrades 19:24:53 I've jumped into trying to upgrade the operating systems under zuul, nodepool, and zookeeper 19:25:15 thanks! 19:25:22 so far zm01.opendev.org has been replaced and seems happy. I've been working on replacing 02-08 this morning so expect changes for that after the meeting 19:25:34 ++ 19:25:43 Then my plan is to look at executors, launchers, zookeeper, and zuul scheduler (likely in that order) 19:26:09 I think that order is roughly from easiest to most difficult and working through the steps will hopefully make the more difficult steps easier :) 19:26:35 There are other services that need this treatment too. If you've got time and or interest please jump in too :) 19:26:52 some of them will require puppet be rewritten to ansible as well. These are likely to be the most painful ones 19:27:01 but maybe doing that sort of rewrite is more interesting to some 19:28:01 Anything else to add to this item? 19:28:37 not from me 19:28:54 #topic Upgrading refstack.o.o 19:29:07 ianw: kopecmartin: are there any changes we can help review or updates to the testing here? 19:29:53 last update for me was we put some nodes on hold after finding the testinfra wasn't actually working as well as we'd hoped 19:30:42 there was some unicode errors which i *think* got fixed too 19:31:26 ya I think some problems were identified in the service itself too (a bonus for ebtter testing) 19:32:18 Sounds like we're still largely waiting for kopecmartin to figure out what is going on though? 19:33:04 i think so yes; kopecmartin -- lmn if anything needs actioning 19:33:41 thanks for the update 19:33:55 #topic Bridge disk space 19:34:24 We're running low on disk space on bridge. I did some quick investigating yesterday and the three locations we seem to consume the most space is /var/log /home and /opt 19:35:24 I think there may be some cleanup we can do in /var/log/ansible where we've leaked some older log files. /home has miscellaneous content in our various homedirs, maybe we can each take a look and clean up unneeded files? and /opt seems to have a number of disk images on it as well as some stuff for ianw 19:35:50 mordred: I think the images in /opt were when you were trying to do builds for focal? 19:36:08 we ended up not using those iirc because we couldn't consistently build them with nodepool due to the way boot from volume treats images 19:36:20 should we just clean those up? or maybe remove the raw and vhd versions and keep qcow2? 19:36:51 (as a side note I used the cloud provided focal images for zuul mergers since we seemed to abandon the build our own idea for the time being) 19:37:58 yeah i think they can go 19:38:12 in any case I suspect we'll run out of disk there in the near future so cleanup that can be made would be great. 19:38:38 if infra-root can check their homedirs and ianw can look at /opt/ianw I can take a look at the images and maybe start by removing the raw/vhd copies first 19:38:54 apparently the launch-env in my homedir accounts for 174M 19:39:20 but otherwise all cleaned up now 19:39:45 thanks! 19:39:51 i can't remember what /opt/ianw was about, i'll clear it out 19:40:12 And I think that was all I had on the agenda 19:40:16 #topic Open Discussion 19:40:18 Anything else? 19:40:58 i'm still struggling to get git-review testing working for python 3.9 so i can tag 2.0.0 19:41:28 after discussion yesterday, i may have to rework more of how gerrit is being invoked in the test setup 19:41:51 something is mysteriously causing the default development credentials to not work 19:42:15 you are using the upstream image right? 19:42:27 official warfile, yes 19:42:28 oh, no, the upstream .jar ... not their container 19:43:03 we could redo git-review's functional tests to use containerized gerrit, but that seemed like a much larger overhaul 19:44:06 right now it's designed to set up a template gerrit site and then start a bunch of parallel per-test gerrits from that in different temporary directories 19:44:09 there are examples of similar in the gerritlib tests 19:44:22 and ya you'd probably want to switch it to using a project per test rather than gerrit per test 19:45:07 right, and that gets into a deep overhaul of git-review's testing, which i was trying to avoid right now (i don't really have time for that, but maybe i don't have time for this either) 19:45:46 alternatives are to say we support python 3.9 but not test with it, or say we don't support python 3.9 because we're unable to test with it 19:46:44 it's sort of "we're unable to test git-review" in general ATM right? 19:46:47 or maybe try to get 3.9 tests going on bionic instead of focal 19:47:13 ianw: it's that our gerrit tests rely on gerrit 2.11 which focal's openssh can't connect to 19:47:32 so we test up through python 3.8 just fine 19:47:38 we could connect it to the system-config job; that has figured out the "get a gerrit running" bit ... but doesn't help local testing 19:48:22 also doesn't help the "would have to substantially redo git-review's current test framework design" part 19:48:24 clarkb: oh sorry - re: images - I think the focal images in /opt on bridge are the ones I manually built and uploaded for control plane things? 19:48:38 mordred: yes, but then we didn't really use them because boot from volume is weird iirc 19:48:40 but - honestly - I don't see any reason to keep them around 19:48:59 that was the precursor to having nodepool do it, then nodepool did it, then we undid the nodepool 19:49:49 i assumed the path of least resistance was to update the gerrit version we're testing against to one focal can ssh into, but 2.11 was the last version to keep ssh keys in the rdbms, which was how the test account was getting bootstrapped 19:50:15 so gerrit>2.11 means changing how we bootstrap our test user 19:50:33 but as usually happens, that's a rabbit hole to which i have yet to find the bottom 19:51:03 maybe bad idea: you could vendor an all-users repo state 19:51:08 and start gerrit with that 19:51:18 that's something i considered, yeah 19:51:35 though we'll need to vendor a corresponding ssh public key as well i suppose 19:51:42 er, public/private keypair 19:51:53 or edit the repo directly before starting it with a generated value 19:52:06 but gerritlib and others bootstrap using a dev mode that should work 19:52:13 just need to sort out why it doesn't 19:52:53 right, i thought working out how to interact with the rest api would be 1. easier than reverse-engineering undocumented notedb structure, and 2. an actual supported stable interface so we don't find ourselves right back here the next time they decide to tweak some implementation detail of the db 19:53:30 anyway, it seems like something about how the test framework is initializing and then later starting gerrit migt be breaking dev mode 19:53:46 so that's the next string to tug on 19:54:14 i'll see if it could be as simple as calling java directly instead of trying to use the provided gerrit.sh initscript 19:54:36 i can try like actually running it today instead of just making red-herring comments on diffs and see if i can see anything 19:55:05 ianw: a tip is to comment out all the addcleanup() calls and then ask tox to run a single test 19:55:28 i've abused that to get a running gerrit exactly how the tests try to run it 19:56:38 right now the setup calls init with --no-auto-start, and then reindex, and then copies the resulting site and runs gerrit from teh copies in daemon mode via gerrit.sh, which is rather convoluted 19:57:07 and supplies custom configs to each site copy with distinct tcp ports 19:58:19 ok, let me know if I can help. 19:58:28 I may be partially responsible for the old setup :) 19:58:35 and now we're just about at time 19:58:48 feel free to continue discussion on the mailing list or in #opendev 19:58:56 and thank you everyone for your time 19:58:58 #endmeeting