#opendev-meeting log

19:01:03 <clarkb> #startmeeting infra
19:01:04 <ianw> o/
19:01:04 <openstack> Meeting started Tue Jan 26 19:01:03 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:05 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:07 <openstack> The meeting name has been set to 'infra'
19:01:10 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-January/000174.html Our Agenda
19:01:16 <clarkb> #topic Announcements
19:01:23 <clarkb> I had no announcements so lets move along
19:01:29 <clarkb> #topic Actions from last meeting
19:01:37 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-01-19-19.01.txt minutes from last meeting
19:01:46 <clarkb> There were three actions recorded.
19:01:58 <clarkb> First up is ianw ensuring that wiki is still getting backed up with the new borg backup setup
19:02:12 <clarkb> I believe this may have happend but will let ianw confirm
19:02:44 <ianw> ahh i got a little distracted trying to fit everything into our space available
19:02:48 <diablo_rojo> o/
19:03:06 <ianw> so the short version is that no, it is not yet backing up to the new servers
19:03:20 <clarkb> got it, I've got the general borg updates on the agenda for later too so we can dig in then
19:03:35 <clarkb> fungi has an action to send an email to the openstack-discuss list asking for config-core assistance
19:03:46 <clarkb> #link https://etherpad.opendev.org/p/tact-sig-2021-rfh is the draft but the email hasn't been sent yet
19:04:04 <fungi> yeah, i was hoping mnaser could take a look first and make sure it covers what he was looking for
19:04:38 <fungi> since it was his topic in last week's meeting and the preceding openstack tc meeting which precipitated it
19:04:56 <fungi> but as far as i'm concerned it's ready to flu
19:04:58 <fungi> fly
19:05:05 <fungi> (please not flu)
19:05:09 <clarkb> and the last action was one for myself to start a puppet -> ansible and xenial upgrade todo list. I got thoroughly sniped by gerrit account inconsistencies and have not done this
19:05:17 <clarkb> #action ianw Backup wiki to new borg servers
19:05:31 <clarkb> #action fungi send https://etherpad.opendev.org/p/tact-sig-2021-rfh once mnaser is happy with it
19:05:45 <clarkb> #action clarkb Write puppet to ansible and xenial upgrade todo list
19:06:04 <clarkb> #topic Priority Efforts
19:06:09 <clarkb> #topic OpenDev
19:06:23 <clarkb> Nominations for the service coordination position are still open
19:06:29 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-January/000161.html less a week remaining.
19:06:38 <clarkb> we've got through the weekend UTC time
19:07:06 <clarkb> If you are interested but want to learn more or have concerns don't hesitate to reach out
19:07:23 <clarkb> And that takes us to the thing that sniped me.
19:07:58 <clarkb> Last week a user managed to create an interesting gerrit account situation where openids and email conflicts caused a sitaution where gerrit moved and openid from one account to another
19:08:34 <clarkb> I've done a fair bit of digging into this as well as communicating upstream on the repo-discuss list about it and this opened a whole new can of problems
19:08:48 <clarkb> We have inconsistent user groups, user accounts, and account external ids
19:09:23 <clarkb> this is a problem because we can't push external id fixes to gerrit while it is online (to fix that user's problem for example) until all the inconsistencies are dealt with
19:10:08 <clarkb> A workaround to this is that it does appear that we can stop gerrit, modify the external ids directly in git (don't push through gerrit but stright to disk), reindex accounts (and groups?), start gerrit then clear caches (accounts and groups?)
19:10:30 <clarkb> Since this workaround involves downtime, I've been trying to audit the errors to see if we can correct them and do online updates instead
19:11:03 <clarkb> For the group inconsistency it is a single group that has included itself as a subgroup which is a loop. My plan was to just fix that one today via the web ui.
19:11:35 <clarkb> We have about 109 accounts with preferred emails set in All-Users refs/users/XY/ABXY:account.config with no corresponding external id
19:12:13 <clarkb> the vast majority of these are accounts that are inactive or functionally inactive. For them I think we can set them to inactive and remove the preferred email address from account.config and push the update back to correct them
19:12:45 <clarkb> this can be done online because each account has its own ref under refs/users. If you try to push an invalid config to there it should be rejected but we are pushing updates that make them valid
19:13:03 <clarkb> and finally we have ~642 email addresses in use by multiple external ids
19:13:43 <ianw> so catching up, this has happened because they changed their email in launchpad?
19:13:50 <clarkb> sorting these out is much more complicated because many of them seem to be active accounts. fungi and I were brainstorming around this a bit earlier today and I think we may be able to classify a subset of them (where one account has clearly been unused or underused) and merge it into the other account
19:14:08 <fungi> ianw: for some users, yes
19:14:11 <clarkb> ianw: that seems to be part of it yes
19:14:32 <fungi> it's hard to generalize, because there are a myriad of different sorts of conflicts currently returned by the validation check
19:14:57 <clarkb> the other big problem with the external id conflicts is they are all present in a single ref: refs/meta/external-ids whcih means we have to fix all of them at once and push that or do the downtime workaround and iterate
19:15:34 <clarkb> and ya I'm only just starting to scratch the surface on these. I think it is possible there are multuple scenarios going on. Including the potential for some users with multiple accounts where they actively use one for ssh and another for https
19:16:08 <clarkb> review-test:~clarkb/gerrit-consistency-notes/ is where I'm keeping notes and scripts
19:16:16 <corvus> in a different gerrit, i hosed my account by removing the email addr associated with my openid account  (i didn't change my openid addr).  in short: yes, hard to generalize.
19:16:39 <clarkb> conflicting_email_user_info and preferred-email-classifications are the two areas of distilled info and may be most interesting
19:17:03 <clarkb> it is also worth noting that I have yet to dump the info from prod
19:17:30 <clarkb> it shuoldn't be vastly different than -test, but at some point i should do that dump from prod
19:17:50 <clarkb> I haven't done it yet as it isn't super clear to me how costly that check is to the running server. When I run it against -test it takes several minutes to return
19:18:04 <clarkb> Maybe I should fix the groups issue then run the consistency check against prod today?
19:18:25 <clarkb> (it is a rest api request)
19:19:32 <clarkb> fungi indicated we could pair up tomorrow and start correct some of the simpler situations for accounts that have preferred email addrs without external ids
19:20:02 <fungi> yeah, i'm up for that
19:20:19 <clarkb> I guess that is where I'm at on this: fix teh group today, run consistency check against prod today if there are no objects, cross check that against -test, fix the simpler cases tomorrow
19:20:44 <fungi> sounds great, thanks for digging into this ball of hair
19:20:45 <clarkb> if people want to take a look at the info I've put together on -test and try to classify the email conflicts or otherwise propose fixes for them I'd be grateful :)
19:21:17 <clarkb> another issue with doing a major fix for 642 emails all at once is that if we get something functionally wrong we'll potentially have a lot of people in a bad spot. vs being able to do this one by one
19:21:28 <clarkb> upstream said it is a bug that you can't do it one by one but still a bug :/
19:21:51 <clarkb> Thats all I had, happy to answer more qusetions on the subject if ya'll have them
19:22:48 <ianw> is there any way to stop this happening once we fix them?
19:22:58 <fungi> they should no longer happen
19:22:58 <clarkb> ianw: yes, new gerrit doesn't allow it to happen anymore
19:23:06 <fungi> this is an artifact of the beforetime
19:23:14 <clarkb> it does have the issue that the original user had which is it can move an openid so we may have to surgery that in the future
19:23:41 <corvus> (and my issue)
19:23:53 <clarkb> but preferred emails lacking external ids and external id email conflicts shouldn't happen to accounts once we fix those
19:24:08 <ianw> ahh, right, excellent
19:24:21 <clarkb> corvus: after the meeting I should catch up with you on that to find out what exactly you edited to cause that (as I think it will be useful to know for editing these fixes)
19:24:21 <fungi> ahh, yes, gerrit still seems capable of getting itself thoroughly confused around external id changes, but it no longer creates new conflicts, just leaves a mess for you to fix
19:24:28 <clarkb> fungi: yes that
19:24:50 <corvus> clarkb: sure -- but to be clear, i caused the problem as a regular user.  fixing it required admin.
19:25:07 <clarkb> oh wow
19:25:21 <fungi> older gerrit allowed these inconsistencies, newer gerrit does now, but we were able to upgrade without fixing them, we just can't push changes without fixing them because the push operation wants to validate everything not just what you're changing
19:25:35 <fungi> er, newer gerrit does check for them now
19:26:31 <fungi> you can make changes via the rest api without validating the entire set, however the rest api is currently limited to reading and deleting external-ids
19:26:41 <fungi> it can't create or update
19:27:32 <clarkb> I also don't think it checks for conflicts on login unless it is creating a new account
19:27:47 <clarkb> which means that users in thissituation should be fine unless they try to introduce a new conflict
19:28:02 <fungi> yep
19:28:13 <clarkb> which is unfortunate beacuse we are likely to introduce some pain for them when we correct things in our bookkeeping
19:28:16 <fungi> well, presuambly it also checks for conflicts if you try to add an address to your account
19:28:28 <fungi> but only checks that the addition doesn't conflict
19:28:32 <clarkb> right
19:28:58 <clarkb> one of my thoughts here is that we set accounts to inactive to see who complains and then work with them to fix things
19:29:37 <clarkb> (and if we do that we can do some aggressive surgery on the external ids to make them pass consistency checking without worrying to much about user impacts. Then fix user impact when they can't login anymore and do it in a way that makes sense for them)
19:29:52 <clarkb> but that is super overkill
19:30:15 <clarkb> as a timecheck we're halfway through our hour. Let's continue on and we can talk about this in #opendev more as necessary
19:30:30 <clarkb> Next up is testing that Zuul handles WIP changes properly. Has anyone done this yet?
19:30:53 <clarkb> should be simple if we push up a trivial change, mark it wip with the built in state, then approve it and see if zuul enqueues it to the gate
19:31:50 <clarkb> that might make a good distraction from gerrit accounts task I can do later this week too if no one beats me to it
19:32:35 <clarkb> Gerrit 3.3.1 includes a workaround for making Zuul notice recheck comments. There is also a followon change to this workaround one that changes event stream data structures to do this in a richer way. Zuul support for that new unlanded method has landed in Zuul
19:32:49 <clarkb> All of this to say that we should be ok to upgrade Gerrit from a Zuul perspective now.
19:33:06 <clarkb> However, I've now noticed two different users on the gerrit mailing list that have downgraded back to 3.2 after upgrading
19:33:23 <clarkb> I wonder if we should reach out to them and find out what their issues were?
19:33:36 <clarkb> (there is a documented downgrade process which I think is a first)
19:34:33 <clarkb> I also think that upgrading the gerrit server ratehr than gerrit itself might be a bigger priority right now if we had to order those
19:34:49 <clarkb> #topic Update Config Management
19:35:08 <clarkb> There have been updates to the change to ansible and docker refstack.
19:35:22 <clarkb> I'm not driving that anymore, but trying to help with reviews when I have time
19:35:33 <clarkb> fungi: do you know if there are changes for storyboard docker stuff yet too?
19:35:52 <fungi> no, not yet, other than a bit of planning
19:36:01 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/705258 refstack dockerization
19:36:12 <clarkb> Any other config management updates to call out?
19:36:18 <fungi> time for that has been split with planning for the storyboard-webclient rewrite framework discussion
19:37:17 <clarkb> sounds like that maybe it
19:37:24 <clarkb> #topic General topics
19:37:31 <clarkb> #topic OpenAFS cluster status
19:37:38 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/771521 properly install new openafs on xenial openafs clients.
19:37:47 <clarkb> I have been rechecking this change for many days now. Its always something new :)
19:38:19 <fungi> i think the only outstanding problem right now is the wheel builder updates, but it's not clear the reason those jobs are failing is afs-related
19:38:20 <clarkb> ianw: fungi: I thought it would be good to get a quick update on the state of the afs server cluster. Are they all running 1.8.6 now from our ppa? are they out of the emergency file, etc
19:39:03 <ianw> the fileservers are all afs 1.8, the db servers i did not get to before a little PTO last week
19:39:22 <fungi> i haven't touched the db servers, but things have been stable
19:39:23 <ianw> (this week i mean)
19:39:53 <ianw> after that, i think we've decided on in-place focal updates which i can stage with (hopefully) zero downtime by doing one-at-a-time
19:40:29 <fungi> amd it's worth noting not all client systems upgraded to the new packages have been restarted on them, but since issues were predominately around restarting, that should be okay
19:41:02 <clarkb> fungi: ya and we tested reboots on some prominent clients to ensure the others would likely be ok with a reboot if/when that happened
19:42:23 <clarkb> ianw: and ya I think that was the plan. THanks for the update
19:42:29 <clarkb> #topic Bup and Borg Backups
19:42:57 <clarkb> We discovered that borg has filled disks somewhat quickly and are now looking at how to more sustainably run backups
19:43:01 <ianw> so yeah, i got sniped trying to get the working set to a more reasonable level
19:43:38 <ianw> the main issue is rotating gzipped sql backups that do not do well with delta updates
19:44:26 <ianw> my proposal is to use borg's feature of streaming in from stdout directly to a separate archive to store plain dumsp
19:44:28 <ianw> https://review.opendev.org/c/opendev/system-config/+/771748/4
19:45:08 <ianw> with some help with mordred with the dump output, we made zero-delta updates mariadb even more efficient (not incoporated into changes yet)
19:45:16 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/771748/4 stream database backups to borg to make it friendly to delta based backups
19:46:37 <clarkb> ianw: we are also successfully backing up to one location (out of two) ?
19:47:02 <ianw> yes, vexxhost has run out of space, but rax is larger, and we haven't fully turned off bup
19:47:16 <clarkb> bup is off for review though iirc
19:47:22 <ianw> it is very confusing, which is why i'd like to make it consistent post-haste
19:47:28 <clarkb> ++
19:47:34 <ianw> ahh yes, indeed
19:47:38 <clarkb> thank you for working on this
19:48:06 <clarkb> Anything else on this subject?
19:48:18 <ianw> no, just i guess reviews on the streaming backup changes
19:48:40 <clarkb> #topic two-review rule impact on low-activity projects
19:48:41 <ianw> there are some trade-offs, we had a small discussion in #opendev; happy to continue the disucssion with anyone concerned
19:48:48 <clarkb> thanks again!
19:49:07 <clarkb> I kept this on the agenda because I wasn't sure we had taken the discussion last week to a conclusion.
19:49:24 <clarkb> My interpretation from last week was that it would be good if we tried to set expectations appropriately (somehow)
19:49:45 <clarkb> and that updating and exposing the things we are working on (like the borg things and gerrit account db inconsistencies) would be helpful
19:50:14 <clarkb> Was there anything else to add to that or concerns we think aren't well captured already?
19:51:11 <fungi> yeah, well, there were two main points. it's (still) okay to approve changes with a single core reviewer in emergencies or if the change is trivial or you're otherwise comfortable taking responsibility for making sure it goes okay, but also that we could be better about declining proposed changes, especially for some of our smaller.utility projects and libraries when those changes aren't really in
19:51:13 <fungi> scope
19:52:37 <clarkb> ++
19:52:40 <ianw> perhaps also we should have a specific section of this meeting "review review" or something, where we more clearly can have people put reviews that seem stalled?
19:53:05 <clarkb> ianw: I'd be happy to try that
19:53:08 <fungi> sure
19:53:36 <clarkb> I can add that to the wiki agenda so I don't forget
19:53:47 <clarkb> #topic InMotion Hosting Bare Metal Cloud
19:53:56 <ianw> generally if i've had/have something i add it as an agenda point, but perhaps people feel a little shy to do that
19:54:17 <clarkb> Last week I got pm'd to say the new inmotion cloud resources should be ready for us to try them out
19:54:31 <clarkb> the credentials and contact info are in the usual place if someone wants to try out deploying an openstack cloud
19:54:41 <clarkb> I had hoped to try it out this week btu the ngerrit stuff happened
19:54:50 <clarkb> and maybe I'll still give it a go just to focus the brain on something else for a bit
19:54:59 <clarkb> but if anyone else is interested feel free to go for it
19:55:19 <clarkb> #topic Open Discussion
19:55:31 <clarkb> We have just under 5 minutes for anything that may have been missed
19:56:22 <fungi> unless anyone else wants to review my updates to the opendev.org main page, i suppose i can self-approve them after the meeting
19:56:31 <fungi> #link https://review.opendev.org/769826 Polish the main opendev.org page
19:57:09 <fungi> wanted to get that cleaned up before we start looking at options like linking/embedding statusbot info or an infra donors callout
19:57:14 <clarkb> oh they've been updated since I last reviwed them. That said looks like you have plenty of reviewers so I wouldn't wait on me
19:57:25 <clarkb> ++ I think they are good improvements overall too
19:57:32 <clarkb> like just for random users
19:57:50 * fungi considers himself a random user
19:58:02 <fungi> they don't come much more random than me
19:58:29 <fungi> oh, and a heads up, i'm trying to knock out significant git-review and bindep releases this week
19:58:50 <fungi> will discuss in #opendev after the meeting
19:58:57 <clarkb> thank you for the heads up
19:59:30 <fungi> zbr has been a huge help rescuing old reviews on git-review in particular
19:59:56 <clarkb> and thank you zbr for the help
20:00:18 <clarkb> we are at time
20:00:20 <fungi> thanks as always, clarkb!
20:00:22 <clarkb> thank you everyone!
20:00:24 <clarkb> #endmeeting