19:01:07 <clarkb> #startmeeting infra
19:01:08 <openstack> Meeting started Tue Mar  2 19:01:07 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:09 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:12 <openstack> The meeting name has been set to 'infra'
19:01:16 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-March/000191.html Our Agenda
19:01:24 <clarkb> #topic Announcements
19:01:37 <clarkb> clarkb out March 23rd, could use a volunteer meeting chair or plan to skip
19:01:59 <clarkb> This didn't make it onto the email I sent, but will be trying to spend time with the kids during their break from school
19:02:37 <clarkb> if you'd like to chair the meeting on the 23rd feel free to let us know and send out a meeting agenda prior to the meeting. Otherwise I think we can likely skip it
19:02:49 <clarkb> #topic Actions from last meeting
19:02:56 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-23-19.01.txt minutes from last meeting
19:03:23 <clarkb> corvus: are there changes to review to unfork jitsi meet's web component? (I think things continue to be busy with zuul so understood if not)
19:04:24 <clarkb> I'll go ahead and readd the action and we can follow up on it next week
19:04:33 <clarkb> #action corvus unfork jitsi meet
19:04:41 <clarkb> #topic Priority Efforts
19:04:45 <clarkb> #topic OpenDev
19:05:14 <clarkb> Last week another user showed up requesting account surgery which has bumped the priority on addressing gerrit account inconsistencies back up again
19:05:24 <clarkb> I've been trying to work through that since then
19:06:08 <clarkb> As suggested by fungi I have taken another approach at it which is to try and classify the conflicts based on whether or not one side of the conflict belongs to an inactive account or if the accounts appear to have been unused for significant periods of time
19:06:43 <clarkb> That has produced a list of ~35 account that we can go ahead and retire (which I did this morning) and then delete the conflicting external ids from the retired side
19:07:14 <clarkb> I haven't done the external id deletions for all of those accounts yet, but did push up the script I am planning to use for that if people can take a look and see if that seems safe enough
19:07:16 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/777846 Collecting scripting efforts here
19:08:01 <clarkb> Hoping to get through that chunk of fixes today, then rerun the consistency check for an up to date list of issues which can be fed back into the audit to get up to date classifications on accounts recent usage
19:08:42 <clarkb> There are a good number of accounst that do appear to have not been used recently. For those I think we can go through the same process as above (either pick an account out of the conflicting set to "win" or retire and remove external ids fro all of them)
19:09:13 <clarkb> I did notice that there may be some accounts that are only used to query the server though and my organizing based on code reviews and pushes is probably incomplete
19:09:32 <clarkb> I reached out to weshay|ruck about one of these (a tripleo account) to see if we can better capture those use cases
19:10:07 <clarkb> it continues to feel like slow going, but it is progress and the more I look at things the better I understand them
19:10:56 <clarkb> One thing that occured to me is that setting accounts inactive is a relatively low cost option. That makes me think we should do this in a staged process where we set the accounts inactive then wait a week or whatever for people to complain (can send eamil about this too)
19:11:19 <clarkb> then if people complain we reactivate their accounts and move them out of the list, for the rest we remove the external ids and fix the conflicts
19:11:37 <clarkb> anyway that is still a ways away as I want to refine the classifications further once this set is done
19:11:53 <clarkb> Any other OpenDev topics to discuss before we move on?
19:12:45 <ianw> no, but thanks for working on this tricky set of circumstances! :)
19:13:26 <fungi> i'm working on pushing git-review 2.0.0.0 release candidates now to exercise release automation for it in preparation for a new release
19:13:41 <fungi> we've got everything merged at this point which was slated for release
19:13:54 <clarkb> cool, the big change being git-review will require python3?
19:13:59 <fungi> rc1 is in the release pipeline as we speak
19:14:15 <fungi> yes, no more 2.7 support (thanks zbr for the change for that)
19:16:14 <clarkb> #topic General topics
19:16:20 <clarkb> #topic OpenAFS cluster status
19:16:37 <clarkb> ianw is adding a third afs db server in order for us to have proper quorum in the cluster
19:16:44 <clarkb> apparently 2 is not enough (not surprising)
19:16:56 <clarkb> ianw: anything additional to add to that? changes to review maybe?
19:17:29 <ianw> yeah, that third server is active and has validated that it works ok with focal, so i'll take on the in-place upgrades we've talked about
19:17:44 <clarkb> excellent
19:17:52 <clarkb> Also I noticed that afs01.dfw's vicepa is fairly full
19:17:57 <ianw> couple of small reviews are https://review.opendev.org/c/opendev/system-config/+/778127 and https://review.opendev.org/c/opendev/system-config/+/778120
19:18:10 <clarkb> I noticed that a few weeks ago and pushed up some changes to work towards dropping fedora-old (not sure of the exact version)
19:18:54 <clarkb> There are probably other ways we could prune the data set if others have ideas that would be great
19:19:11 <ianw> ahh, ok, i can go through and look for that and deal.  fedora is hitting up against our -minimal issues with tools on build hosts, the container-build stuff is working but needs polishing
19:20:07 <clarkb> ianw: ya we have fedora-old, fedora-intermediate, and fedora-current. Its -current that has trouble, most testing seems to be on -intermediate so I think we can drop -old
19:20:19 <clarkb> but if you can double check that and review some of the changes that would probably be good
19:20:55 <ianw> will do
19:21:06 <clarkb> #topic Borg Backups
19:21:28 <clarkb> ianw: fungi: any new insight into why gitea db backups pushing to the vexxhost dest has trouble?
19:22:04 <ianw> no, but i have to admit i haven't looked fully.  i think i'll try and run the mysqldump a few times and see if that is dying locally
19:22:15 <clarkb> ++ that seems like a good test
19:22:28 <fungi> ahh, yeah i got sidetracked after getting as far as finding the disconnect error in the mariadb logs
19:22:41 <ianw> the fact that it died three days in a row at the same row number seems very supicious
19:23:34 <clarkb> anything else on this topic?
19:23:36 <ianw> and that the filesystem part doesn't seem to have issues; and no other host is reporting issues
19:24:06 <ianw> nope, otherwise, i've retired the old servers, we have a 1tb drive attached to the RAX host with the latest rotation of bup backups if we require
19:24:15 <clarkb> thank you!
19:24:27 <clarkb> #topic Server Updates
19:24:41 <clarkb> I've made some progress with zuul server rolling replacements
19:25:00 <clarkb> all the mergers are focal now and the old servers have been cleaned up (though it just occurred to me I still have dns records to clean up)
19:25:19 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/778227 is the next step for executor replacements
19:25:33 <clarkb> basically if you think the new ze server is happy (from what I can see it is, including tarball publishing jobs to afs)
19:25:51 <clarkb> er if ^ then please help alnd that chagne. I'll delete the old server then start doing some replacements in larger batches (3 at a time?)
19:26:41 <clarkb> Anyone else looking at updates other than afs servers, refstack, and zuul?
19:26:49 <ianw> yeah i started on review
19:27:06 <clarkb> oh ya I saw your email to upstream about the mariadb weirdness
19:27:10 <ianw> however we've got ourselves in a bit of a tangle with review01.<openstack|opendev>.org
19:27:29 <ianw> so we have A dns records for review01.opendev.org
19:27:54 <ianw> i proposed removing them for the new server ... https://review.opendev.org/c/opendev/zone-opendev.org/+/777926
19:28:32 <ianw> i need to spend some time with system-config and see what we can do
19:28:59 <ianw> calling the new server "review02.opendev.org" *may* help a little?
19:29:14 <clarkb> my poor memory says we may have done that for a reason
19:29:40 <clarkb> hrm ya and with the LE records too
19:30:12 <clarkb> ya we use the dns records there to validate the ssl cert on the server :/
19:30:14 <fungi> git history might point to why we added it
19:30:23 <fungi> but that sounds likely
19:30:42 <ianw> but do we need a cert for review01.opendev.org?
19:30:49 <ianw> i don't feel like anyone is accessing it like that
19:31:29 <clarkb> I think the major reason for it may be for sshfp since we sshfp to review01 for 22 but to review.opendev.org for 29418
19:31:40 <fungi> #link https://review.opendev.org/744557 Split review's resource records from review01's
19:31:49 <clarkb> and ya maybe we can stop doing a review01 altname and just generate certs for review.opendev and review.openstack
19:32:22 <fungi> sshfp record was breaking ssh access to gerrit's ssh api port
19:32:24 <clarkb> and the sshfp records aren't super important right now iirc
19:32:32 <clarkb> fungi: ya so we moved it to review01 from review
19:33:15 <clarkb> so ya I think we are ok if we reduce the LE tie in and maybe clean up sshfp records too for completeness
19:33:48 <ianw> ok, i can look at that, split 777926 up into two steps
19:33:55 <fungi> makes sense
19:33:56 <clarkb> ianw: then for bootstrapping the new host with ansibel we want to do somethign similar to what review-test did without replication config, etc
19:34:19 <clarkb> anything else on the topic of server upgrades?
19:34:21 <ianw> yep
19:34:28 <ianw> one more thing, what did we decide about review-dev?
19:34:38 <clarkb> ianw: we should clean it up though that hasn't happened yet
19:34:45 <ianw> ok, i'll do that too
19:35:00 <clarkb> might want to double check with mordred and corvus et al that they dno't have anything on that server to retain (shouldn't but it was a sandbox for a while)
19:35:33 <clarkb> we also need to get review-test back into ansible but that is probably less urgent
19:36:34 <clarkb> #topic New refstack server
19:36:57 <clarkb> Looked like there was some new testing being done to sort out some problems? I didn't catch what the current problems are though
19:37:11 <kopecmartin> i have 2 patches up for that
19:37:12 <kopecmartin> #link https://review.opendev.org/c/opendev/system-config/+/776292
19:37:30 <kopecmartin> when merged, will the held server be updated automatically?
19:37:45 <kopecmartin> I'd like to test it one more time and then let's got to production finally
19:37:50 <ianw> kopecmartin: nope, that's not being ansiblised
19:38:08 <kopecmartin> ok, np, I'll do it manually then
19:38:11 <ianw> however, we could update things on the held bridge and run it manually to confirm without having to run new nodes
19:38:22 <clarkb> ianw: kopecmartin  for that first change I think that may be a noop
19:38:25 <kopecmartin> ianw: or that, whatever you say :)
19:38:32 <clarkb> because we are already redirecting everything under / to localhost:8000
19:38:43 <clarkb> I want to say there is a way to define the refstack api path in refstack itself
19:39:13 <clarkb> api_url =<%= scope.lookupvar("::refstack::params::api_url") %> is what puppet does
19:39:32 <kopecmartin> clarkb: hmm, so maybe that's why refstack server didn't behave as expected when i tested it , because of the '/ to localhost:8000'
19:39:43 <clarkb> I think you may want to set the config such that the api_url has an /api at the end of it
19:39:54 <kopecmartin> yeah yeah, i was playing with the api_url opiton, but it was ignored and i couldn't figure out why
19:40:00 <kopecmartin> now it makes sense
19:40:12 <clarkb> I think you may also have to set a js config value too
19:40:24 <clarkb> I remember looking at it and leaving some comments recently
19:40:49 <kopecmartin> ok then, let me get back to it and i'll implement updates shortly and ping you back so that it's moving forward
19:41:11 <clarkb> kopecmartin: ianw  in the ansible template for refstack config try changing api_url = {{ refstack_url }} to api_url = {{ refstack_url }}/api maybe?
19:41:27 <clarkb> but ya I'm not sure that apache config change will help since it is already sending things to /
19:41:32 <ianw> i thought we did that, but maybe not
19:41:40 <kopecmartin> we did , but it didn't work
19:41:44 <clarkb> I see
19:41:53 <kopecmartin> it seemed like the opt was ignored or something like that
19:42:12 <clarkb> oh interesting the puppet side runs it at a wsgi app WSGIScriptAlias /api /etc/refstack/app.wsgi
19:42:13 <kopecmartin> therefore I reverted that and put the proxy pass there (as workaround)
19:42:26 <clarkb> so ya maybe the real fix is to switch to using it as wsgi?
19:42:34 <clarkb> that gets awkward with containes though
19:43:08 <clarkb> anyway sounds like you're ahead of me in the debugging so I should get out of the way :)
19:43:23 <clarkb> Anything else on this ?
19:43:56 <kopecmartin> so the WSGIScriptAlias /api /etc/refstack/app.wsgi is an equivalent for the ProxyPass I wrote?
19:44:24 <clarkb> kopecmartin: no, it runs a python wsgi process under apache and does wsgi "proxying" instead
19:44:28 <clarkb> they are similar in some ways but also different
19:45:05 <kopecmartin> ah
19:45:16 <ianw> yeah it seems to be almost running the api bits separately
19:47:22 <clarkb> alright lets move on
19:47:28 <clarkb> #topic Bridge disk use
19:47:50 <clarkb> frickler discovered that /root/.cache is consuming a fair bit of disk. Particularly caches for python entrypoints and pip
19:48:41 <clarkb> does anyone know what caches entrypoints (is it pkg_resources?) and if it is safe to simply remove the entire dir?
19:48:55 <clarkb> I think my concern is that if a python process is running it may rely on that fiel being present after it has pkg_resourced
19:49:26 <ianw> i think we could just mtime delete anything older than a day though?
19:49:41 <clarkb> ianw: ya we could do that too, but there are so many files I expect the stating for that to be slow. But maybe that is fine
19:49:45 <clarkb> just start it and then wait :)
19:49:59 <ianw> yeah, i was thinking a cron job
19:50:09 <ianw> presumably it's not "leaking" as such, as it's under .cache ...
19:51:59 <clarkb> what is weird is I can't find evidence that this is part of python packaging proper
19:52:11 <ianw> ".cache/python-entrypoints" does not give many hits
19:52:56 <clarkb> I do have a much smaller number of entries on my local system from zuul testing it looks like
19:52:59 <fungi> could it be stevedore?
19:53:00 <mordred> ianw: I do not have anything on review-dev
19:53:14 <clarkb> fungi: ya maybe something in stevedore or ansible pulling in etc
19:53:34 <fungi> or something similar caching entrypoints, anyway
19:53:46 <clarkb> I think it would be worthwhile to try and source it before we go and delete them so that we understand it better (and its expected rate of growth)
19:53:58 <clarkb> I can probably take a look at that after getting this batch of gerrit accounts sorted
19:54:04 <mordred> I think it's stevedore
19:54:30 <mordred> random other hit on the internet: https://github.com/cpoppema/docker-flexget/issues/82 - also mentions stevedore - and I think I remember someone saying something about doing that a while back for performance
19:55:03 <mordred> stevedore/_cache.py:        return os.path.join(base_path, 'python-entrypoints')
19:55:22 <clarkb> that looks incredibly suspicious :)
19:55:36 * mordred puts on his useful-for-the-day hat
19:55:47 <fungi> i was hoping for something incredibly delicious. i shouldn't have skipped lunch
19:55:53 <clarkb> based on that it should be fine to do a time based clearing, but maybe we should also file a bug
19:55:57 <ianw> and the latest patch is where you can drop a . file to stop it caching
19:56:10 <clarkb> ianw: oh ha someone else already hit this then I bet :)
19:56:10 <ianw> Add possibility to skip caching endpoints to the filesystem when '.disable' file is present in the cache directory.
19:56:34 <clarkb> (the idea of a cache seems like a good one, I wonder why it needs so many cache files though)
19:56:50 <ianw> is that coming from cloud launcher?  what exactly is using stevedore?
19:56:55 <clarkb> we have just a few minutes left so one more thing
19:57:02 <clarkb> #topic InMotion OpenStack as a Service
19:57:50 <clarkb> This has ended up towards the bottom of my priority list due to otherdistractions. I think getting ssl sorted out on this system would still be worthwhile if anyone else wants to take a look (you basically need to figure out how to configure kolla then rerun kolla against the cluster)
19:58:01 <clarkb> I think you can even tell kolla to just make a self signed cert as a first step
19:58:20 <clarkb> anyway I think we are all busy so don't necessarily expect anyone to jump on that, but thought I would mention it so it doesn't get completely forgotten
19:58:23 <clarkb> #topic Open Discussion
19:58:29 <clarkb> Any thing else in our minute and a half remaining?
19:58:57 <fungi> the git-review 2.0.0.0rc1 tag seems to have worked fine
19:59:03 <clarkb> exciting
19:59:06 <fungi> #link https://pypi.org/project/git-review/2.0.0.0rc1/
19:59:18 <clarkb> did you want people to install it and use it for a bit or was that mostly to exercise the publsihing?
19:59:19 <fungi> i just noticed though that the release notes could be better organized
19:59:38 <fungi> mostly to exercise publishing though we can ask folks to test it briefly
19:59:59 <fungi> #link https://review.opendev.org/778257 will clean up release notes
20:01:02 <clarkb> That is all we haev scheduled tiem for. Thank you everyone and feel free to continue discussion in #opendev
20:01:04 <clarkb> #endmeeting