19:01:03 #startmeeting infra 19:01:04 o/ 19:01:04 Meeting started Tue Jan 26 19:01:03 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:05 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:07 The meeting name has been set to 'infra' 19:01:10 #link http://lists.opendev.org/pipermail/service-discuss/2021-January/000174.html Our Agenda 19:01:16 #topic Announcements 19:01:23 I had no announcements so lets move along 19:01:29 #topic Actions from last meeting 19:01:37 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-01-19-19.01.txt minutes from last meeting 19:01:46 There were three actions recorded. 19:01:58 First up is ianw ensuring that wiki is still getting backed up with the new borg backup setup 19:02:12 I believe this may have happend but will let ianw confirm 19:02:44 ahh i got a little distracted trying to fit everything into our space available 19:02:48 o/ 19:03:06 so the short version is that no, it is not yet backing up to the new servers 19:03:20 got it, I've got the general borg updates on the agenda for later too so we can dig in then 19:03:35 fungi has an action to send an email to the openstack-discuss list asking for config-core assistance 19:03:46 #link https://etherpad.opendev.org/p/tact-sig-2021-rfh is the draft but the email hasn't been sent yet 19:04:04 yeah, i was hoping mnaser could take a look first and make sure it covers what he was looking for 19:04:38 since it was his topic in last week's meeting and the preceding openstack tc meeting which precipitated it 19:04:56 but as far as i'm concerned it's ready to flu 19:04:58 fly 19:05:05 (please not flu) 19:05:09 and the last action was one for myself to start a puppet -> ansible and xenial upgrade todo list. I got thoroughly sniped by gerrit account inconsistencies and have not done this 19:05:17 #action ianw Backup wiki to new borg servers 19:05:31 #action fungi send https://etherpad.opendev.org/p/tact-sig-2021-rfh once mnaser is happy with it 19:05:45 #action clarkb Write puppet to ansible and xenial upgrade todo list 19:06:04 #topic Priority Efforts 19:06:09 #topic OpenDev 19:06:23 Nominations for the service coordination position are still open 19:06:29 #link http://lists.opendev.org/pipermail/service-discuss/2021-January/000161.html less a week remaining. 19:06:38 we've got through the weekend UTC time 19:07:06 If you are interested but want to learn more or have concerns don't hesitate to reach out 19:07:23 And that takes us to the thing that sniped me. 19:07:58 Last week a user managed to create an interesting gerrit account situation where openids and email conflicts caused a sitaution where gerrit moved and openid from one account to another 19:08:34 I've done a fair bit of digging into this as well as communicating upstream on the repo-discuss list about it and this opened a whole new can of problems 19:08:48 We have inconsistent user groups, user accounts, and account external ids 19:09:23 this is a problem because we can't push external id fixes to gerrit while it is online (to fix that user's problem for example) until all the inconsistencies are dealt with 19:10:08 A workaround to this is that it does appear that we can stop gerrit, modify the external ids directly in git (don't push through gerrit but stright to disk), reindex accounts (and groups?), start gerrit then clear caches (accounts and groups?) 19:10:30 Since this workaround involves downtime, I've been trying to audit the errors to see if we can correct them and do online updates instead 19:11:03 For the group inconsistency it is a single group that has included itself as a subgroup which is a loop. My plan was to just fix that one today via the web ui. 19:11:35 We have about 109 accounts with preferred emails set in All-Users refs/users/XY/ABXY:account.config with no corresponding external id 19:12:13 the vast majority of these are accounts that are inactive or functionally inactive. For them I think we can set them to inactive and remove the preferred email address from account.config and push the update back to correct them 19:12:45 this can be done online because each account has its own ref under refs/users. If you try to push an invalid config to there it should be rejected but we are pushing updates that make them valid 19:13:03 and finally we have ~642 email addresses in use by multiple external ids 19:13:43 so catching up, this has happened because they changed their email in launchpad? 19:13:50 sorting these out is much more complicated because many of them seem to be active accounts. fungi and I were brainstorming around this a bit earlier today and I think we may be able to classify a subset of them (where one account has clearly been unused or underused) and merge it into the other account 19:14:08 ianw: for some users, yes 19:14:11 ianw: that seems to be part of it yes 19:14:32 it's hard to generalize, because there are a myriad of different sorts of conflicts currently returned by the validation check 19:14:57 the other big problem with the external id conflicts is they are all present in a single ref: refs/meta/external-ids whcih means we have to fix all of them at once and push that or do the downtime workaround and iterate 19:15:34 and ya I'm only just starting to scratch the surface on these. I think it is possible there are multuple scenarios going on. Including the potential for some users with multiple accounts where they actively use one for ssh and another for https 19:16:08 review-test:~clarkb/gerrit-consistency-notes/ is where I'm keeping notes and scripts 19:16:16 in a different gerrit, i hosed my account by removing the email addr associated with my openid account (i didn't change my openid addr). in short: yes, hard to generalize. 19:16:39 conflicting_email_user_info and preferred-email-classifications are the two areas of distilled info and may be most interesting 19:17:03 it is also worth noting that I have yet to dump the info from prod 19:17:30 it shuoldn't be vastly different than -test, but at some point i should do that dump from prod 19:17:50 I haven't done it yet as it isn't super clear to me how costly that check is to the running server. When I run it against -test it takes several minutes to return 19:18:04 Maybe I should fix the groups issue then run the consistency check against prod today? 19:18:25 (it is a rest api request) 19:19:32 fungi indicated we could pair up tomorrow and start correct some of the simpler situations for accounts that have preferred email addrs without external ids 19:20:02 yeah, i'm up for that 19:20:19 I guess that is where I'm at on this: fix teh group today, run consistency check against prod today if there are no objects, cross check that against -test, fix the simpler cases tomorrow 19:20:44 sounds great, thanks for digging into this ball of hair 19:20:45 if people want to take a look at the info I've put together on -test and try to classify the email conflicts or otherwise propose fixes for them I'd be grateful :) 19:21:17 another issue with doing a major fix for 642 emails all at once is that if we get something functionally wrong we'll potentially have a lot of people in a bad spot. vs being able to do this one by one 19:21:28 upstream said it is a bug that you can't do it one by one but still a bug :/ 19:21:51 Thats all I had, happy to answer more qusetions on the subject if ya'll have them 19:22:48 is there any way to stop this happening once we fix them? 19:22:58 they should no longer happen 19:22:58 ianw: yes, new gerrit doesn't allow it to happen anymore 19:23:06 this is an artifact of the beforetime 19:23:14 it does have the issue that the original user had which is it can move an openid so we may have to surgery that in the future 19:23:41 (and my issue) 19:23:53 but preferred emails lacking external ids and external id email conflicts shouldn't happen to accounts once we fix those 19:24:08 ahh, right, excellent 19:24:21 corvus: after the meeting I should catch up with you on that to find out what exactly you edited to cause that (as I think it will be useful to know for editing these fixes) 19:24:21 ahh, yes, gerrit still seems capable of getting itself thoroughly confused around external id changes, but it no longer creates new conflicts, just leaves a mess for you to fix 19:24:28 fungi: yes that 19:24:50 clarkb: sure -- but to be clear, i caused the problem as a regular user. fixing it required admin. 19:25:07 oh wow 19:25:21 older gerrit allowed these inconsistencies, newer gerrit does now, but we were able to upgrade without fixing them, we just can't push changes without fixing them because the push operation wants to validate everything not just what you're changing 19:25:35 er, newer gerrit does check for them now 19:26:31 you can make changes via the rest api without validating the entire set, however the rest api is currently limited to reading and deleting external-ids 19:26:41 it can't create or update 19:27:32 I also don't think it checks for conflicts on login unless it is creating a new account 19:27:47 which means that users in thissituation should be fine unless they try to introduce a new conflict 19:28:02 yep 19:28:13 which is unfortunate beacuse we are likely to introduce some pain for them when we correct things in our bookkeeping 19:28:16 well, presuambly it also checks for conflicts if you try to add an address to your account 19:28:28 but only checks that the addition doesn't conflict 19:28:32 right 19:28:58 one of my thoughts here is that we set accounts to inactive to see who complains and then work with them to fix things 19:29:37 (and if we do that we can do some aggressive surgery on the external ids to make them pass consistency checking without worrying to much about user impacts. Then fix user impact when they can't login anymore and do it in a way that makes sense for them) 19:29:52 but that is super overkill 19:30:15 as a timecheck we're halfway through our hour. Let's continue on and we can talk about this in #opendev more as necessary 19:30:30 Next up is testing that Zuul handles WIP changes properly. Has anyone done this yet? 19:30:53 should be simple if we push up a trivial change, mark it wip with the built in state, then approve it and see if zuul enqueues it to the gate 19:31:50 that might make a good distraction from gerrit accounts task I can do later this week too if no one beats me to it 19:32:35 Gerrit 3.3.1 includes a workaround for making Zuul notice recheck comments. There is also a followon change to this workaround one that changes event stream data structures to do this in a richer way. Zuul support for that new unlanded method has landed in Zuul 19:32:49 All of this to say that we should be ok to upgrade Gerrit from a Zuul perspective now. 19:33:06 However, I've now noticed two different users on the gerrit mailing list that have downgraded back to 3.2 after upgrading 19:33:23 I wonder if we should reach out to them and find out what their issues were? 19:33:36 (there is a documented downgrade process which I think is a first) 19:34:33 I also think that upgrading the gerrit server ratehr than gerrit itself might be a bigger priority right now if we had to order those 19:34:49 #topic Update Config Management 19:35:08 There have been updates to the change to ansible and docker refstack. 19:35:22 I'm not driving that anymore, but trying to help with reviews when I have time 19:35:33 fungi: do you know if there are changes for storyboard docker stuff yet too? 19:35:52 no, not yet, other than a bit of planning 19:36:01 #link https://review.opendev.org/c/opendev/system-config/+/705258 refstack dockerization 19:36:12 Any other config management updates to call out? 19:36:18 time for that has been split with planning for the storyboard-webclient rewrite framework discussion 19:37:17 sounds like that maybe it 19:37:24 #topic General topics 19:37:31 #topic OpenAFS cluster status 19:37:38 #link https://review.opendev.org/c/opendev/system-config/+/771521 properly install new openafs on xenial openafs clients. 19:37:47 I have been rechecking this change for many days now. Its always something new :) 19:38:19 i think the only outstanding problem right now is the wheel builder updates, but it's not clear the reason those jobs are failing is afs-related 19:38:20 ianw: fungi: I thought it would be good to get a quick update on the state of the afs server cluster. Are they all running 1.8.6 now from our ppa? are they out of the emergency file, etc 19:39:03 the fileservers are all afs 1.8, the db servers i did not get to before a little PTO last week 19:39:22 i haven't touched the db servers, but things have been stable 19:39:23 (this week i mean) 19:39:53 after that, i think we've decided on in-place focal updates which i can stage with (hopefully) zero downtime by doing one-at-a-time 19:40:29 amd it's worth noting not all client systems upgraded to the new packages have been restarted on them, but since issues were predominately around restarting, that should be okay 19:41:02 fungi: ya and we tested reboots on some prominent clients to ensure the others would likely be ok with a reboot if/when that happened 19:42:23 ianw: and ya I think that was the plan. THanks for the update 19:42:29 #topic Bup and Borg Backups 19:42:57 We discovered that borg has filled disks somewhat quickly and are now looking at how to more sustainably run backups 19:43:01 so yeah, i got sniped trying to get the working set to a more reasonable level 19:43:38 the main issue is rotating gzipped sql backups that do not do well with delta updates 19:44:26 my proposal is to use borg's feature of streaming in from stdout directly to a separate archive to store plain dumsp 19:44:28 https://review.opendev.org/c/opendev/system-config/+/771748/4 19:45:08 with some help with mordred with the dump output, we made zero-delta updates mariadb even more efficient (not incoporated into changes yet) 19:45:16 #link https://review.opendev.org/c/opendev/system-config/+/771748/4 stream database backups to borg to make it friendly to delta based backups 19:46:37 ianw: we are also successfully backing up to one location (out of two) ? 19:47:02 yes, vexxhost has run out of space, but rax is larger, and we haven't fully turned off bup 19:47:16 bup is off for review though iirc 19:47:22 it is very confusing, which is why i'd like to make it consistent post-haste 19:47:28 ++ 19:47:34 ahh yes, indeed 19:47:38 thank you for working on this 19:48:06 Anything else on this subject? 19:48:18 no, just i guess reviews on the streaming backup changes 19:48:40 #topic two-review rule impact on low-activity projects 19:48:41 there are some trade-offs, we had a small discussion in #opendev; happy to continue the disucssion with anyone concerned 19:48:48 thanks again! 19:49:07 I kept this on the agenda because I wasn't sure we had taken the discussion last week to a conclusion. 19:49:24 My interpretation from last week was that it would be good if we tried to set expectations appropriately (somehow) 19:49:45 and that updating and exposing the things we are working on (like the borg things and gerrit account db inconsistencies) would be helpful 19:50:14 Was there anything else to add to that or concerns we think aren't well captured already? 19:51:11 yeah, well, there were two main points. it's (still) okay to approve changes with a single core reviewer in emergencies or if the change is trivial or you're otherwise comfortable taking responsibility for making sure it goes okay, but also that we could be better about declining proposed changes, especially for some of our smaller.utility projects and libraries when those changes aren't really in 19:51:13 scope 19:52:37 ++ 19:52:40 perhaps also we should have a specific section of this meeting "review review" or something, where we more clearly can have people put reviews that seem stalled? 19:53:05 ianw: I'd be happy to try that 19:53:08 sure 19:53:36 I can add that to the wiki agenda so I don't forget 19:53:47 #topic InMotion Hosting Bare Metal Cloud 19:53:56 generally if i've had/have something i add it as an agenda point, but perhaps people feel a little shy to do that 19:54:17 Last week I got pm'd to say the new inmotion cloud resources should be ready for us to try them out 19:54:31 the credentials and contact info are in the usual place if someone wants to try out deploying an openstack cloud 19:54:41 I had hoped to try it out this week btu the ngerrit stuff happened 19:54:50 and maybe I'll still give it a go just to focus the brain on something else for a bit 19:54:59 but if anyone else is interested feel free to go for it 19:55:19 #topic Open Discussion 19:55:31 We have just under 5 minutes for anything that may have been missed 19:56:22 unless anyone else wants to review my updates to the opendev.org main page, i suppose i can self-approve them after the meeting 19:56:31 #link https://review.opendev.org/769826 Polish the main opendev.org page 19:57:09 wanted to get that cleaned up before we start looking at options like linking/embedding statusbot info or an infra donors callout 19:57:14 oh they've been updated since I last reviwed them. That said looks like you have plenty of reviewers so I wouldn't wait on me 19:57:25 ++ I think they are good improvements overall too 19:57:32 like just for random users 19:57:50 * fungi considers himself a random user 19:58:02 they don't come much more random than me 19:58:29 oh, and a heads up, i'm trying to knock out significant git-review and bindep releases this week 19:58:50 will discuss in #opendev after the meeting 19:58:57 thank you for the heads up 19:59:30 zbr has been a huge help rescuing old reviews on git-review in particular 19:59:56 and thank you zbr for the help 20:00:18 we are at time 20:00:20 thanks as always, clarkb! 20:00:22 thank you everyone! 20:00:24 #endmeeting