19:01:07 #startmeeting infra 19:01:08 Meeting started Tue Mar 2 19:01:07 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:09 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:12 The meeting name has been set to 'infra' 19:01:16 #link http://lists.opendev.org/pipermail/service-discuss/2021-March/000191.html Our Agenda 19:01:24 #topic Announcements 19:01:37 clarkb out March 23rd, could use a volunteer meeting chair or plan to skip 19:01:59 This didn't make it onto the email I sent, but will be trying to spend time with the kids during their break from school 19:02:37 if you'd like to chair the meeting on the 23rd feel free to let us know and send out a meeting agenda prior to the meeting. Otherwise I think we can likely skip it 19:02:49 #topic Actions from last meeting 19:02:56 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-23-19.01.txt minutes from last meeting 19:03:23 corvus: are there changes to review to unfork jitsi meet's web component? (I think things continue to be busy with zuul so understood if not) 19:04:24 I'll go ahead and readd the action and we can follow up on it next week 19:04:33 #action corvus unfork jitsi meet 19:04:41 #topic Priority Efforts 19:04:45 #topic OpenDev 19:05:14 Last week another user showed up requesting account surgery which has bumped the priority on addressing gerrit account inconsistencies back up again 19:05:24 I've been trying to work through that since then 19:06:08 As suggested by fungi I have taken another approach at it which is to try and classify the conflicts based on whether or not one side of the conflict belongs to an inactive account or if the accounts appear to have been unused for significant periods of time 19:06:43 That has produced a list of ~35 account that we can go ahead and retire (which I did this morning) and then delete the conflicting external ids from the retired side 19:07:14 I haven't done the external id deletions for all of those accounts yet, but did push up the script I am planning to use for that if people can take a look and see if that seems safe enough 19:07:16 #link https://review.opendev.org/c/opendev/system-config/+/777846 Collecting scripting efforts here 19:08:01 Hoping to get through that chunk of fixes today, then rerun the consistency check for an up to date list of issues which can be fed back into the audit to get up to date classifications on accounts recent usage 19:08:42 There are a good number of accounst that do appear to have not been used recently. For those I think we can go through the same process as above (either pick an account out of the conflicting set to "win" or retire and remove external ids fro all of them) 19:09:13 I did notice that there may be some accounts that are only used to query the server though and my organizing based on code reviews and pushes is probably incomplete 19:09:32 I reached out to weshay|ruck about one of these (a tripleo account) to see if we can better capture those use cases 19:10:07 it continues to feel like slow going, but it is progress and the more I look at things the better I understand them 19:10:56 One thing that occured to me is that setting accounts inactive is a relatively low cost option. That makes me think we should do this in a staged process where we set the accounts inactive then wait a week or whatever for people to complain (can send eamil about this too) 19:11:19 then if people complain we reactivate their accounts and move them out of the list, for the rest we remove the external ids and fix the conflicts 19:11:37 anyway that is still a ways away as I want to refine the classifications further once this set is done 19:11:53 Any other OpenDev topics to discuss before we move on? 19:12:45 no, but thanks for working on this tricky set of circumstances! :) 19:13:26 i'm working on pushing git-review 2.0.0.0 release candidates now to exercise release automation for it in preparation for a new release 19:13:41 we've got everything merged at this point which was slated for release 19:13:54 cool, the big change being git-review will require python3? 19:13:59 rc1 is in the release pipeline as we speak 19:14:15 yes, no more 2.7 support (thanks zbr for the change for that) 19:16:14 #topic General topics 19:16:20 #topic OpenAFS cluster status 19:16:37 ianw is adding a third afs db server in order for us to have proper quorum in the cluster 19:16:44 apparently 2 is not enough (not surprising) 19:16:56 ianw: anything additional to add to that? changes to review maybe? 19:17:29 yeah, that third server is active and has validated that it works ok with focal, so i'll take on the in-place upgrades we've talked about 19:17:44 excellent 19:17:52 Also I noticed that afs01.dfw's vicepa is fairly full 19:17:57 couple of small reviews are https://review.opendev.org/c/opendev/system-config/+/778127 and https://review.opendev.org/c/opendev/system-config/+/778120 19:18:10 I noticed that a few weeks ago and pushed up some changes to work towards dropping fedora-old (not sure of the exact version) 19:18:54 There are probably other ways we could prune the data set if others have ideas that would be great 19:19:11 ahh, ok, i can go through and look for that and deal. fedora is hitting up against our -minimal issues with tools on build hosts, the container-build stuff is working but needs polishing 19:20:07 ianw: ya we have fedora-old, fedora-intermediate, and fedora-current. Its -current that has trouble, most testing seems to be on -intermediate so I think we can drop -old 19:20:19 but if you can double check that and review some of the changes that would probably be good 19:20:55 will do 19:21:06 #topic Borg Backups 19:21:28 ianw: fungi: any new insight into why gitea db backups pushing to the vexxhost dest has trouble? 19:22:04 no, but i have to admit i haven't looked fully. i think i'll try and run the mysqldump a few times and see if that is dying locally 19:22:15 ++ that seems like a good test 19:22:28 ahh, yeah i got sidetracked after getting as far as finding the disconnect error in the mariadb logs 19:22:41 the fact that it died three days in a row at the same row number seems very supicious 19:23:34 anything else on this topic? 19:23:36 and that the filesystem part doesn't seem to have issues; and no other host is reporting issues 19:24:06 nope, otherwise, i've retired the old servers, we have a 1tb drive attached to the RAX host with the latest rotation of bup backups if we require 19:24:15 thank you! 19:24:27 #topic Server Updates 19:24:41 I've made some progress with zuul server rolling replacements 19:25:00 all the mergers are focal now and the old servers have been cleaned up (though it just occurred to me I still have dns records to clean up) 19:25:19 #link https://review.opendev.org/c/opendev/system-config/+/778227 is the next step for executor replacements 19:25:33 basically if you think the new ze server is happy (from what I can see it is, including tarball publishing jobs to afs) 19:25:51 er if ^ then please help alnd that chagne. I'll delete the old server then start doing some replacements in larger batches (3 at a time?) 19:26:41 Anyone else looking at updates other than afs servers, refstack, and zuul? 19:26:49 yeah i started on review 19:27:06 oh ya I saw your email to upstream about the mariadb weirdness 19:27:10 however we've got ourselves in a bit of a tangle with review01..org 19:27:29 so we have A dns records for review01.opendev.org 19:27:54 i proposed removing them for the new server ... https://review.opendev.org/c/opendev/zone-opendev.org/+/777926 19:28:32 i need to spend some time with system-config and see what we can do 19:28:59 calling the new server "review02.opendev.org" *may* help a little? 19:29:14 my poor memory says we may have done that for a reason 19:29:40 hrm ya and with the LE records too 19:30:12 ya we use the dns records there to validate the ssl cert on the server :/ 19:30:14 git history might point to why we added it 19:30:23 but that sounds likely 19:30:42 but do we need a cert for review01.opendev.org? 19:30:49 i don't feel like anyone is accessing it like that 19:31:29 I think the major reason for it may be for sshfp since we sshfp to review01 for 22 but to review.opendev.org for 29418 19:31:40 #link https://review.opendev.org/744557 Split review's resource records from review01's 19:31:49 and ya maybe we can stop doing a review01 altname and just generate certs for review.opendev and review.openstack 19:32:22 sshfp record was breaking ssh access to gerrit's ssh api port 19:32:24 and the sshfp records aren't super important right now iirc 19:32:32 fungi: ya so we moved it to review01 from review 19:33:15 so ya I think we are ok if we reduce the LE tie in and maybe clean up sshfp records too for completeness 19:33:48 ok, i can look at that, split 777926 up into two steps 19:33:55 makes sense 19:33:56 ianw: then for bootstrapping the new host with ansibel we want to do somethign similar to what review-test did without replication config, etc 19:34:19 anything else on the topic of server upgrades? 19:34:21 yep 19:34:28 one more thing, what did we decide about review-dev? 19:34:38 ianw: we should clean it up though that hasn't happened yet 19:34:45 ok, i'll do that too 19:35:00 might want to double check with mordred and corvus et al that they dno't have anything on that server to retain (shouldn't but it was a sandbox for a while) 19:35:33 we also need to get review-test back into ansible but that is probably less urgent 19:36:34 #topic New refstack server 19:36:57 Looked like there was some new testing being done to sort out some problems? I didn't catch what the current problems are though 19:37:11 i have 2 patches up for that 19:37:12 #link https://review.opendev.org/c/opendev/system-config/+/776292 19:37:30 when merged, will the held server be updated automatically? 19:37:45 I'd like to test it one more time and then let's got to production finally 19:37:50 kopecmartin: nope, that's not being ansiblised 19:38:08 ok, np, I'll do it manually then 19:38:11 however, we could update things on the held bridge and run it manually to confirm without having to run new nodes 19:38:22 ianw: kopecmartin for that first change I think that may be a noop 19:38:25 ianw: or that, whatever you say :) 19:38:32 because we are already redirecting everything under / to localhost:8000 19:38:43 I want to say there is a way to define the refstack api path in refstack itself 19:39:13 api_url =<%= scope.lookupvar("::refstack::params::api_url") %> is what puppet does 19:39:32 clarkb: hmm, so maybe that's why refstack server didn't behave as expected when i tested it , because of the '/ to localhost:8000' 19:39:43 I think you may want to set the config such that the api_url has an /api at the end of it 19:39:54 yeah yeah, i was playing with the api_url opiton, but it was ignored and i couldn't figure out why 19:40:00 now it makes sense 19:40:12 I think you may also have to set a js config value too 19:40:24 I remember looking at it and leaving some comments recently 19:40:49 ok then, let me get back to it and i'll implement updates shortly and ping you back so that it's moving forward 19:41:11 kopecmartin: ianw in the ansible template for refstack config try changing api_url = {{ refstack_url }} to api_url = {{ refstack_url }}/api maybe? 19:41:27 but ya I'm not sure that apache config change will help since it is already sending things to / 19:41:32 i thought we did that, but maybe not 19:41:40 we did , but it didn't work 19:41:44 I see 19:41:53 it seemed like the opt was ignored or something like that 19:42:12 oh interesting the puppet side runs it at a wsgi app WSGIScriptAlias /api /etc/refstack/app.wsgi 19:42:13 therefore I reverted that and put the proxy pass there (as workaround) 19:42:26 so ya maybe the real fix is to switch to using it as wsgi? 19:42:34 that gets awkward with containes though 19:43:08 anyway sounds like you're ahead of me in the debugging so I should get out of the way :) 19:43:23 Anything else on this ? 19:43:56 so the WSGIScriptAlias /api /etc/refstack/app.wsgi is an equivalent for the ProxyPass I wrote? 19:44:24 kopecmartin: no, it runs a python wsgi process under apache and does wsgi "proxying" instead 19:44:28 they are similar in some ways but also different 19:45:05 ah 19:45:16 yeah it seems to be almost running the api bits separately 19:47:22 alright lets move on 19:47:28 #topic Bridge disk use 19:47:50 frickler discovered that /root/.cache is consuming a fair bit of disk. Particularly caches for python entrypoints and pip 19:48:41 does anyone know what caches entrypoints (is it pkg_resources?) and if it is safe to simply remove the entire dir? 19:48:55 I think my concern is that if a python process is running it may rely on that fiel being present after it has pkg_resourced 19:49:26 i think we could just mtime delete anything older than a day though? 19:49:41 ianw: ya we could do that too, but there are so many files I expect the stating for that to be slow. But maybe that is fine 19:49:45 just start it and then wait :) 19:49:59 yeah, i was thinking a cron job 19:50:09 presumably it's not "leaking" as such, as it's under .cache ... 19:51:59 what is weird is I can't find evidence that this is part of python packaging proper 19:52:11 ".cache/python-entrypoints" does not give many hits 19:52:56 I do have a much smaller number of entries on my local system from zuul testing it looks like 19:52:59 could it be stevedore? 19:53:00 ianw: I do not have anything on review-dev 19:53:14 fungi: ya maybe something in stevedore or ansible pulling in etc 19:53:34 or something similar caching entrypoints, anyway 19:53:46 I think it would be worthwhile to try and source it before we go and delete them so that we understand it better (and its expected rate of growth) 19:53:58 I can probably take a look at that after getting this batch of gerrit accounts sorted 19:54:04 I think it's stevedore 19:54:30 random other hit on the internet: https://github.com/cpoppema/docker-flexget/issues/82 - also mentions stevedore - and I think I remember someone saying something about doing that a while back for performance 19:55:03 stevedore/_cache.py: return os.path.join(base_path, 'python-entrypoints') 19:55:22 that looks incredibly suspicious :) 19:55:36 * mordred puts on his useful-for-the-day hat 19:55:47 i was hoping for something incredibly delicious. i shouldn't have skipped lunch 19:55:53 based on that it should be fine to do a time based clearing, but maybe we should also file a bug 19:55:57 and the latest patch is where you can drop a . file to stop it caching 19:56:10 ianw: oh ha someone else already hit this then I bet :) 19:56:10 Add possibility to skip caching endpoints to the filesystem when '.disable' file is present in the cache directory. 19:56:34 (the idea of a cache seems like a good one, I wonder why it needs so many cache files though) 19:56:50 is that coming from cloud launcher? what exactly is using stevedore? 19:56:55 we have just a few minutes left so one more thing 19:57:02 #topic InMotion OpenStack as a Service 19:57:50 This has ended up towards the bottom of my priority list due to otherdistractions. I think getting ssl sorted out on this system would still be worthwhile if anyone else wants to take a look (you basically need to figure out how to configure kolla then rerun kolla against the cluster) 19:58:01 I think you can even tell kolla to just make a self signed cert as a first step 19:58:20 anyway I think we are all busy so don't necessarily expect anyone to jump on that, but thought I would mention it so it doesn't get completely forgotten 19:58:23 #topic Open Discussion 19:58:29 Any thing else in our minute and a half remaining? 19:58:57 the git-review 2.0.0.0rc1 tag seems to have worked fine 19:59:03 exciting 19:59:06 #link https://pypi.org/project/git-review/2.0.0.0rc1/ 19:59:18 did you want people to install it and use it for a bit or was that mostly to exercise the publsihing? 19:59:19 i just noticed though that the release notes could be better organized 19:59:38 mostly to exercise publishing though we can ask folks to test it briefly 19:59:59 #link https://review.opendev.org/778257 will clean up release notes 20:01:02 That is all we haev scheduled tiem for. Thank you everyone and feel free to continue discussion in #opendev 20:01:04 #endmeeting