19:01:16 #startmeeting infra 19:01:17 Meeting started Tue Feb 9 19:01:16 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:18 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:20 The meeting name has been set to 'infra' 19:01:26 #link http://lists.opendev.org/pipermail/service-discuss/2021-February/000180.html Our Agenda 19:01:33 #topic Announcements 19:01:37 I had no announcements 19:01:49 #topic Actions from last meeting 19:02:19 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-02-19.01.txt minutes from last meeting 19:02:36 I had an action to start writing down a xenial upgrade todo list. 19:02:43 #link https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades 19:02:59 I started there, it is incomplete, but figured starting with something that we can all put notes on was better than waiting for perfect 19:03:17 ianw also had an action to followup with wiki backups. Any update on that? 19:03:42 yes, i am getting closer :) 19:03:52 do you want to talk about pruning now? 19:04:02 lets pick that up later 19:04:27 #topic Priority Efforts 19:04:38 #topic OpenDev 19:04:58 I have continued to make progress (though it feels slow) on the gerrit account situation 19:05:32 11 more accounts with preferred emails lackign external ids have been cleaned up. The bulk of these were simply retired. But one example for tbachman's accounts was a good learning experience 19:06:13 With tbachman there were two accounts. An active one that had preferred email set and no external id for that email and another inactive account with the same preferred email and external ids to match 19:06:46 tbachman said the best thing for them was to update the preferred email to a current email address. We tested this on review-test and tbachman was able to fix things on their end. The update was then made on the prod server 19:07:07 To avoid confusion with the other unused account I set it inactive 19:07:44 The important bit of news here is that users can actually update things themselves within the web ui and don't need us to intervene for this situation. They just need to update their preferred email address to be one of the actual email addresses further down in the settings page 19:08:06 I have also begun looking at the external id email conflicts. This is where two or more different accounts have external ids for the same email address 19:08:51 The vast majority of these seem to be accounts where one is clearly the account that has been used and the other is orphaned 19:09:23 for these cases I think we retire the orphaned account then remove the external ids assoicated with that account that conflict. The order here is important to ensure we don't generate a bunch of new preferred email doesn't have external id errors 19:10:10 There are a few cases where both accounts have been used and we may need to use our judgement or perhaps disable both accounts and let the user come to us with problems if they are still around (however most of these seem to be from years ago) 19:10:28 I suspect that the vast majority of users who are active and have these problems have reached out to us to help fix them 19:11:09 Where I am struggling is that I am finding it hard to automate the classification aspects. I have autoamted a good chunk of the data pulling but there is a fair bit of judgement in "what do we do next" 19:11:49 if others get a chance maybe they can take a look at my notes on review-test and see if any improvements to process or information gathing stand out. I'd also be curious if people think I've prposed invalid solutions to the issues 19:12:10 we don't need to go through that here though, can do that outside of meetings 19:12:56 As a reminder the workaround in the short term is to make changes with gerrit offline then reindex accounts (and groups?) with gerrit offline 19:13:14 I'm hoping we can fix all these issues without ever doing that, but that option is availalbe if we run into a strong need for it 19:13:45 As far as next steps go I'll continue to classify things in my notes on -test and if others agree the proposed plans there seem valid I should make a checkout of the external ids on review and then start committing those fixes 19:14:02 then if we do have to take a downtime we can get as many fixes as are already prepared in too 19:14:24 Next up is a pointer to my gerrit 3.3 image build changes 19:14:26 #link https://review.opendev.org/c/opendev/system-config/+/765021 Build 3.3 images 19:14:29 reviews appreciated. 19:14:39 And that takes us to the gitea OOM'ing from last week 19:15:09 we had to add richer logging to apache so that we had source connection port for the haproxy -> apache connections. We haven't seen the issue return so haven't really had any new data to debug aiui 19:15:19 #link https://review.opendev.org/c/opendev/system-config/+/774023 Rate limiting framework change for haproxy. 19:15:45 I also put up an example of what haproxy tcp connection based rate limits might look like. I think the change as proposed woudl completely break users behind corporate NAT though 19:15:50 so the change is WIP 19:16:03 fungi: ianw anything else to add re Gitea OOMs? 19:16:41 i'm already finding it hard to remember last week. that's not good 19:17:00 yeah, i don't think we really found a smoking gun, it just sort of went away? 19:17:14 ya it went away and by the time we got better logging in place there wasn't much to look at 19:17:48 I guess we keep our eyes open and use better logging next time around if it happens again. Separately maybe take a look at haproxy rate limiting and decide if we watn to implement some version of that? 19:18:42 (the trick is going to be figuringout what a valid bound is that doesn't just braek all the corporate NAT users) 19:19:11 sounds like that may be it let's move on 19:19:17 #topic Update Config Management 19:19:39 There are OpenAFS and refstack ansible (and docker in the case of refstack) efforts underway. 19:19:46 I also saw mention that launch node may not be working? 19:20:17 launch node was working for me yesterday (i launched a refstack) ... but openstack client on bridge isn't 19:20:27 oh I see I think I mixed up launch node and openstackclient 19:20:39 problems with latest openstackclient (or sdk?) talking to rackspace's api 19:20:43 well, it can't talk to rax anyway. i didn't let myself yak shave, fungi had a bit of a look too 19:20:48 ianw: I've got an older openstackclient in a venv in my homedir that I use to cross check against clouds when that happens 19:21:01 basically to answer the question of "does this work if we use old osc" 19:21:06 yeah, same, and my older client works 19:21:06 problem is the exception isn't super helpful because it's masked by a retry 19:21:16 so the exception is that the number of retries was exceeded 19:21:49 and it (confusingly) complains about host lookup failing 19:22:00 did osc drop keystone api v2 support? 19:22:04 that might be something to check? 19:22:19 if mordred gets bored he might be interested in looking at that failure case 19:22:46 I can probably take a look later today after lunch and bike ride stuff. Would be a nice change of pace from staring at gerrit accoutns :) 19:22:53 let me know if that would be helpful 19:22:54 but it probably merits being brought up in #openstack-sdk if it hasn't been already 19:23:22 fungi: what did I do? 19:23:36 mordred: you totally broke rackspace ;) 19:23:44 not really 19:23:58 ah - joy 19:24:00 just thought you might be interested that latest openstacksdk is failing to talk to rackspace's keystone 19:24:14 that's exciting 19:24:46 using older openstacksdk works, so that's how we got around it in the short term 19:25:15 well, an older openstacksdk install, so also older dependencies. it could be any of a number of them 19:25:33 ianw: I've got openafs and refstack as separate agenda items, Should we just go over them here or move on and catch up under proper topic headings? 19:25:52 up to you 19:26:07 #topic General topics 19:26:13 #topic OpenAFS Cluster Status 19:26:13 fungi: I'll take a look - the only thing that would be likely to have an impact would be keystoneauth 19:26:40 I don't think I saw any movement on this but wanted to double check. The fileservers are upgraded to 1.8.6 but not the db servers? 19:26:51 the openafs status is that all servers/db servers are running 1.8.6-5 19:27:06 oh nice the db servers got upgraded too. Excellent. Thank you for working on that 19:27:18 next steps there are to do the server upgrades then? 19:27:30 I've got them on my initial pass of a list for server upgrades too 19:27:40 yep; so next i'll try an in-place focal upgrade, probably on one of the db servers first as they're small, and start that process 19:28:04 great, thanks again 19:28:35 #topic Refstack upgrade and container deployment 19:29:20 i got started on this 19:29:49 there's a couple of open reviews in https://review.opendev.org/q/topic:%22refstack%22+(status:open%20OR%20status:merged) to add the production deployment jobs 19:29:56 is there a change to add a server to inventory yet? I suppose for this server we won't have dns changes as dns will be upgraded via rax 19:30:07 yeah i merged that yesterday 19:30:09 #link https://review.opendev.org/q/topic:%22refstack%22+(status:open%20OR%20status:merged) Refstack changes that need review 19:30:25 if we can just double check those jobs, i can babysit it today 19:30:36 cool I can take a look at those really quickly after the meeting I bet 19:30:38 SotK: ^ that may also make a good example for doing the storyboard deployment 19:30:42 ++ 19:30:51 then have to look at the db migration; the old one seemed to have a trove while we're running it from a container now 19:31:38 ya I expect we'll restore from dump for now testing that things work? then schedule a downtime so that we can stop refstack properly, do a dump, restore from that, then start on the new server with dns updates 19:32:00 and kopecmartin volunteered to test the service that has been newly deployed whcih will go a long way as I don't even know how to interact with it properly 19:32:28 Anything else to add on this topic? 19:32:30 yep, there's terse notes at 19:32:32 #link https://etherpad.opendev.org/p/refstack-docker 19:32:41 other than that no 19:32:53 thank you everyone who helped move this along 19:33:03 #topic Bup and Borg Backups 19:33:52 ianw feel free to give us an update on borg db streaming and pruning and all other new info 19:34:27 the streaming bit seems to be going well 19:34:47 modulo of course mysqldump --all-databases stopping actually dumping all databases with a recent update 19:35:03 but it does still work if you specify specificy databases 19:35:08 #link https://bugs.launchpad.net/ubuntu/+source/mysql-5.7/+bug/1914695 19:35:10 Launchpad bug 1914695 in mysql-5.7 (Ubuntu) "mysqldump --all-databases not dumping any databases with 5.7.33" [Undecided,New] 19:35:11 (which is the workaround we're going with?) 19:35:55 also there was some unanticipated fallout from the bup removal 19:36:00 nobody else has commented or mentioned anything in this bug, and i can't find anything in the mysql bug thing (though it's a bit of a mess) and i don't know how much more effort we want to spend on it, because it's talking to a 5.1 server in our case 19:36:44 apparently apt-get thought bup was the only reason we wanted pymysql installed on the storyboard server, so when bup got uninstalled so did the python-pymysql package. hilarity ensued 19:36:49 mordred: ^ possible you may be interested? but ya I think our workaround is likely sufficient 19:38:28 I also realised some things about borg's append-only model and pruning that are explained in their docs, if you read them the right way 19:39:07 i've put up some reviews at 19:39:08 #link https://review.opendev.org/q/topic:%22backup-more-prune%22+status:open 19:39:33 that provides a script to do manual prunes of the backups, and a cron job to warn us via email when the backup partitions are looking full 19:39:48 i think that is the best way to manage things for now 19:40:07 ianw: that seems like a good compromise, similar to how the certchecker reminded us to go buy new certs when we weren't using LE 19:40:14 i think the *best* way would be to have rolling LVM snapshots implemented on the backup server 19:41:26 but i think it's more important to just get running 100% with borg in an stable manner first 19:41:35 ++ 19:42:01 so yeah, basically request for reviews on the ideas presented in those changes 19:42:09 thank you for sticking to this. Its never an easy thing to change, but helps enable upgrades to focal and beyond for a number of services 19:42:41 but i think we've got it working at a stable working set. some things we can't avoid like the review backups being big diffs due to git pack file updates 19:42:58 we could stop packing but then gerrit would get slow 19:43:32 Anything else on this or should we move on? 19:43:36 we could "backup" git repositories via replication rather than off the fs? 19:43:59 though what does the replication in that case? 19:44:00 fungi: the risk with that is a corrupted repo wouldn't be able to roll back easily 19:44:05 yeah 19:44:10 with proper backups we can go to an old state 19:44:26 well, assuming the repository was not mid-write when we backed it up 19:44:39 yep, and although the deltas take up a lot of space, the other side is they do prune well 19:44:39 I think git is pretty godo about that 19:45:00 basically git does order of operations to make backups like that mostly work aiui 19:45:54 Alright lets move on as we have a few more topics to cover 19:46:03 #topic Xenial Server Upgrades 19:46:12 #link https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades 19:46:35 this has sort of been in continual progress over time, but as xenial eol appraoches I think we should capture what remains and start prioritizing things 19:46:44 I've started to write down a partial list in that etherpad 19:47:11 I'm hoping that I might have time next week to start doing rolling replacements of zuul-mergers, zuul-executors, and nodepool-launchers 19:47:41 my idea there was to redeploy oen of each on focal and we can check everything is happy with the switch, then roll through the others in each group 19:48:04 If you've got ideas on priorities or process/method/etc feel free to add notes to that etherpad 19:48:36 #topic Meetpad Audio Stopped Working 19:49:03 Late last week a few of us noticed that meetpad's audio wasn't working. By the time I got around to trying it again in order to look at it this week it was working 19:49:27 yeah, it seems to be working fine today 19:49:35 used it for a while 19:49:36 Last week I had actually tried using the main meet.jit.si service as well and had problems with it too. I suspect that we may have deployed a bug then deployed the fix all automatically 19:50:02 This reminds me that I think corvus has mentioend we should be able to unfork one of the images we are running too 19:50:14 it is possible that having a more static image for one of the services could have contributed as well 19:50:22 * diablo_rojo appears suuuuper late 19:50:31 corvus: ^ is it just a matter of replacing the image in our docker-compose? 19:51:35 ohai 19:52:02 clarkb: everything except -web is unpinned i think 19:52:05 -web is the fork 19:52:26 and to unfork web we just update our docker-compose file? maybe set some new settings? 19:52:32 i don't think we'd be updating/restarting any of those automatically 19:52:53 corvus: I think we may do a docker-compose pull && docker-compose up -d regularly 19:53:03 gimme a sec 19:53:07 similar to how gitea does it (and it finds new mariadb images) 19:54:23 okay, yeah, looks like we do restart, last was 4/5 days ago 19:54:41 to unfork web, we actually update our dockerfile and the docker-compose 19:55:12 ok, it wasn't clear to me if we had to keep building the image ourselves or if we can use theirs like we do for the other services 19:55:22 (we're building the image from a github/jeblair source repo and deploying it; to unfork, change docker-compose to deploy from upstream and rm the dockerfile) -- but don't do that yet, upstream may not have updated the image. 19:55:29 we should use their 19:55:34 got it 19:55:54 https://hub.docker.com/r/jitsi/web 19:56:08 https://hub.docker.com/layers/jitsi/web/latest/images/sha256-018f7407c2514b5eeb27f4bc4d887ae4cd38d8446a0958c5ca9cee3fa811f575?context=explore 19:56:09 4 days ago 19:56:13 we should unfork now 19:56:35 excellent. Did you want to write that change? If not I'm sure we can find a volunteer 19:56:39 their build of -web should now have the meetpad PR merge in it 19:56:44 clarkb: i will do so 19:56:47 thank you 19:56:54 #action corvus unfork jitsi-meet 19:57:03 #topic InMotion OpenStack as a Service 19:57:21 Really quickly before our hour is up: I have deployed a control plane for an inmotion openstack managed cloud 19:57:48 everything seems to work at first glance and we could bootstrap users and projects and then point cloud launcher at it. Except that none of the api endpoints have ssl 19:58:11 thee is a VIP involved somehow that load balances requests across the three control plane nodes (it is "hyperconverged") 19:58:48 I need to figure out how to properly listen on that VIP and then can run a simple ssl terminating proxy with a self signed cert or LE cert that forwards to local services 19:58:52 I have not yet figured that out 19:59:13 I've also tried to give this feedback back to inmotion as something that would be useful 19:59:37 another thing worth noting is that we have a /28 of ipv4 addresses there currently so the ability to expand our nodepool resources is minimal right now 19:59:38 got it, so by default their cloud deployments don't provide a reachable rest api> 19:59:46 well they do but in plaintext 19:59:56 clarkb: what's the vip attached to? 20:00:01 oh! http just no https? 20:00:08 corvus: I have no idea. I tried looking and couldn't find it then ran out of time 20:00:35 a lot of things are in kolla containers and they all run the same exact command so its been interesting poking around 20:00:49 (they run some sort of init that magically knows what other comamnds to run) 20:01:02 fungi: yup 20:01:06 is it only ipv4 or also ipv6? 20:01:16 ianw: currently only ipv4 but ipv6 is something that they are looking at 20:01:29 ipv6 is all the rage with the kids these days 20:01:38 (I expect that if we use this more properly it will be as an ipv6 "only" cloud then use the ipv4 /28 to do nat for outbound like limestone does) 20:01:44 but that is still theoretical right now 20:01:53 also we are now at time 20:01:56 thank you everyone! 20:01:57 yeah, 28 is ... 16? nodes? - control plane bits? 20:02:02 thanks clarkb! 20:02:05 ianw: correct 20:02:05 ianw: the control plane has a separate /28 or /29 20:02:22 this /28 is for the neutron networking side so ya probably 14 usable and after neutron uses a couple 12? 20:02:32 We can continue conversations in #opendev 20:02:33 #endmeeting