19:01:16 <clarkb> #startmeeting infra
19:01:17 <openstack> Meeting started Tue Feb  9 19:01:16 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:20 <openstack> The meeting name has been set to 'infra'
19:01:26 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-February/000180.html Our Agenda
19:01:33 <clarkb> #topic Announcements
19:01:37 <clarkb> I had no announcements
19:01:49 <clarkb> #topic Actions from last meeting
19:02:19 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-02-19.01.txt minutes from last meeting
19:02:36 <clarkb> I had an action to start writing down a xenial upgrade todo list.
19:02:43 <clarkb> #link https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades
19:02:59 <clarkb> I started there, it is incomplete, but figured starting with something that we can all put notes on was better than waiting for perfect
19:03:17 <clarkb> ianw also had an action to followup with wiki backups. Any update on that?
19:03:42 <ianw> yes, i am getting closer :)
19:03:52 <ianw> do you want to talk about pruning now?
19:04:02 <clarkb> lets pick that up later
19:04:27 <clarkb> #topic Priority Efforts
19:04:38 <clarkb> #topic OpenDev
19:04:58 <clarkb> I have continued to make progress (though it feels slow) on the gerrit account situation
19:05:32 <clarkb> 11 more accounts with preferred emails lackign external ids have been cleaned up. The bulk of these were simply retired. But one example for tbachman's accounts was a good learning experience
19:06:13 <clarkb> With tbachman there were two accounts. An active one that had preferred email set and no external id for that email and another inactive account with the same preferred email and external ids to match
19:06:46 <clarkb> tbachman said the best thing for them was to update the preferred email to a current email address. We tested this on review-test and tbachman was able to fix things on their end. The update was then made on the prod server
19:07:07 <clarkb> To avoid confusion with the other unused account I set it inactive
19:07:44 <clarkb> The important bit of news here is that users can actually update things themselves within the web ui and don't need us to intervene for this situation. They just need to update their preferred email address to be one of the actual email addresses further down in the settings page
19:08:06 <clarkb> I have also begun looking at the external id email conflicts. This is where two or more different accounts have external ids for the same email address
19:08:51 <clarkb> The vast majority of these seem to be accounts where one is clearly the account that has been used and the other is orphaned
19:09:23 <clarkb> for these cases I think we retire the orphaned account then remove the external ids assoicated with that account that conflict. The order here is important to ensure we don't generate a bunch of new preferred email doesn't have external id errors
19:10:10 <clarkb> There are a few cases where both accounts have been used and we may need to use our judgement or perhaps disable both accounts and let the user come to us with problems if they are still around (however most of these seem to be from years ago)
19:10:28 <clarkb> I suspect that the vast majority of users who are active and have these problems have reached out to us to help fix them
19:11:09 <clarkb> Where I am struggling is that I am finding it hard to automate the classification aspects. I have autoamted a good chunk of the data pulling but there is a fair bit of judgement in "what do we do next"
19:11:49 <clarkb> if others get a chance maybe they can take a look at my notes on review-test and see if any improvements to process or information gathing stand out. I'd also be curious if people think I've prposed invalid solutions to the issues
19:12:10 <clarkb> we don't need to go through that here though, can do that outside of meetings
19:12:56 <clarkb> As a reminder the workaround in the short term is to make changes with gerrit offline then reindex accounts (and groups?) with gerrit offline
19:13:14 <clarkb> I'm hoping we can fix all these issues without ever doing that, but that option is availalbe if we run into a strong need for it
19:13:45 <clarkb> As far as next steps go I'll continue to classify things in my notes on -test and if others agree the proposed plans there seem valid I should make a checkout of the external ids on review and then start committing those fixes
19:14:02 <clarkb> then if we do have to take a downtime we can get as many fixes as are already prepared in too
19:14:24 <clarkb> Next up is a pointer to my gerrit 3.3 image build changes
19:14:26 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/765021 Build 3.3 images
19:14:29 <clarkb> reviews appreciated.
19:14:39 <clarkb> And that takes us to the gitea OOM'ing from last week
19:15:09 <clarkb> we had to add richer logging to apache so that we had source connection port for the haproxy -> apache connections. We haven't seen the issue return so haven't really had any new data to debug aiui
19:15:19 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/774023 Rate limiting framework change for haproxy.
19:15:45 <clarkb> I also put up an example of what haproxy tcp connection based rate limits might look like. I think the change as proposed woudl completely break users behind corporate NAT though
19:15:50 <clarkb> so the change is WIP
19:16:03 <clarkb> fungi: ianw anything else to add re Gitea OOMs?
19:16:41 <fungi> i'm already finding it hard to remember last week. that's not good
19:17:00 <ianw> yeah, i don't think we really found a smoking gun, it just sort of went away?
19:17:14 <clarkb> ya it went away and by the time we got better logging in place there wasn't much to look at
19:17:48 <clarkb> I guess we keep our eyes open and use better logging next time around if it happens again. Separately maybe take a look at haproxy rate limiting and decide if we watn to implement some version of that?
19:18:42 <clarkb> (the trick is going to be figuringout what a valid bound is that doesn't just braek all the corporate NAT users)
19:19:11 <clarkb> sounds like that may be it let's move on
19:19:17 <clarkb> #topic Update Config Management
19:19:39 <clarkb> There are OpenAFS and refstack ansible (and docker in the case of refstack) efforts underway.
19:19:46 <clarkb> I also saw mention that launch node may not be working?
19:20:17 <ianw> launch node was working for me yesterday (i launched a refstack) ... but openstack client on bridge isn't
19:20:27 <clarkb> oh I see I think I mixed up launch node and openstackclient
19:20:39 <fungi> problems with latest openstackclient (or sdk?) talking to rackspace's api
19:20:43 <ianw> well, it can't talk to rax anyway.  i didn't let myself yak shave, fungi had a bit of a look too
19:20:48 <clarkb> ianw: I've got an older openstackclient in a venv in my homedir that I use to cross check against clouds when that happens
19:21:01 <clarkb> basically to answer the question of "does this work if we use old osc"
19:21:06 <ianw> yeah, same, and my older client works
19:21:06 <fungi> problem is the exception isn't super helpful because it's masked by a retry
19:21:16 <fungi> so the exception is that the number of retries was exceeded
19:21:49 <fungi> and it (confusingly) complains about host lookup failing
19:22:00 <clarkb> did osc drop keystone api v2 support?
19:22:04 <clarkb> that might be something to check?
19:22:19 <fungi> if mordred gets bored he might be interested in looking at that failure case
19:22:46 <clarkb> I can probably take a look later today after lunch and bike ride stuff. Would be a nice change of pace from staring at gerrit accoutns :)
19:22:53 <clarkb> let me know if that would be helpful
19:22:54 <fungi> but it probably merits being brought up in #openstack-sdk if it hasn't been already
19:23:22 <mordred> fungi: what did I do?
19:23:36 <fungi> mordred: you totally broke rackspace ;)
19:23:44 <fungi> not really
19:23:58 <mordred> ah - joy
19:24:00 <fungi> just thought you might be interested that latest openstacksdk is failing to talk to rackspace's keystone
19:24:14 <mordred> that's exciting
19:24:46 <fungi> using older openstacksdk works, so that's how we got around it in the short term
19:25:15 <fungi> well, an older openstacksdk install, so also older dependencies. it could be any of a number of them
19:25:33 <clarkb> ianw: I've got openafs and refstack as separate agenda items, Should we just go over them here or move on and catch up under proper topic headings?
19:25:52 <ianw> up to you
19:26:07 <clarkb> #topic General topics
19:26:13 <clarkb> #topic OpenAFS Cluster Status
19:26:13 <mordred> fungi: I'll take a look - the only thing that would be likely to have an impact would be keystoneauth
19:26:40 <clarkb> I don't think I saw any movement on this but wanted to double check. The fileservers are upgraded to 1.8.6 but not the db servers?
19:26:51 <ianw> the openafs status is that all servers/db servers are running 1.8.6-5
19:27:06 <clarkb> oh nice the db servers got upgraded too. Excellent. Thank you for working on that
19:27:18 <clarkb> next steps there are to do the server upgrades then?
19:27:30 <clarkb> I've got them on my initial pass of a list for server upgrades too
19:27:40 <ianw> yep; so next i'll try an in-place focal upgrade, probably on one of the db servers first as they're small, and start that process
19:28:04 <clarkb> great, thanks again
19:28:35 <clarkb> #topic Refstack upgrade and container deployment
19:29:20 <ianw> i got started on this
19:29:49 <ianw> there's a couple of open reviews in https://review.opendev.org/q/topic:%22refstack%22+(status:open%20OR%20status:merged) to add the production deployment jobs
19:29:56 <clarkb> is there a change to add a server to inventory yet? I suppose for this server we won't have dns changes as dns will be upgraded via rax
19:30:07 <ianw> yeah i merged that yesterday
19:30:09 <clarkb> #link https://review.opendev.org/q/topic:%22refstack%22+(status:open%20OR%20status:merged) Refstack changes that need review
19:30:25 <ianw> if we can just double check those jobs, i can babysit it today
19:30:36 <clarkb> cool I can take a look at those really quickly after the meeting I bet
19:30:38 <fungi> SotK: ^ that may also make a good example for doing the storyboard deployment
19:30:42 <clarkb> ++
19:30:51 <ianw> then have to look at the db migration; the old one seemed to have a trove while we're running it from a container now
19:31:38 <clarkb> ya I expect we'll restore from dump for now testing that things work? then schedule a downtime so that we can stop refstack properly, do a dump, restore from that, then start on the new server with dns updates
19:32:00 <clarkb> and kopecmartin volunteered to test the service that has been newly deployed whcih will go a long way as I don't even know how to interact with it properly
19:32:28 <clarkb> Anything else to add on this topic?
19:32:30 <ianw> yep, there's terse notes at
19:32:32 <ianw> #link https://etherpad.opendev.org/p/refstack-docker
19:32:41 <ianw> other than that no
19:32:53 <clarkb> thank you everyone who helped move this along
19:33:03 <clarkb> #topic Bup and Borg Backups
19:33:52 <clarkb> ianw feel free to give us an update on borg db streaming and pruning and all other new info
19:34:27 <ianw> the streaming bit seems to be going well
19:34:47 <ianw> modulo of course mysqldump --all-databases stopping actually dumping all databases with a recent update
19:35:03 <clarkb> but it does still work if you specify specificy databases
19:35:08 <ianw> #link https://bugs.launchpad.net/ubuntu/+source/mysql-5.7/+bug/1914695
19:35:10 <openstack> Launchpad bug 1914695 in mysql-5.7 (Ubuntu) "mysqldump --all-databases not dumping any databases with 5.7.33" [Undecided,New]
19:35:11 <clarkb> (which is the workaround we're going with?)
19:35:55 <fungi> also there was some unanticipated fallout from the bup removal
19:36:00 <ianw> nobody else has commented or mentioned anything in this bug, and i can't find anything in the mysql bug thing (though it's a bit of a mess) and i don't know how much more effort we want to spend on it, because it's talking to a 5.1 server in our case
19:36:44 <fungi> apparently apt-get thought bup was the only reason we wanted pymysql installed on the storyboard server, so when bup got uninstalled so did the python-pymysql package. hilarity ensued
19:36:49 <clarkb> mordred: ^ possible you may be interested? but ya I think our workaround is likely sufficient
19:38:28 <ianw> I also realised some things about borg's append-only model and pruning that are explained in their docs, if you read them the right way
19:39:07 <ianw> i've put up some reviews at
19:39:08 <ianw> #link https://review.opendev.org/q/topic:%22backup-more-prune%22+status:open
19:39:33 <ianw> that provides a script to do manual prunes of the backups, and a cron job to warn us via email when the backup partitions are looking full
19:39:48 <ianw> i think that is the best way to manage things for now
19:40:07 <clarkb> ianw: that seems like a good compromise, similar to how the certchecker reminded us to go buy new certs when we weren't using LE
19:40:14 <ianw> i think the *best* way would be to have rolling LVM snapshots implemented on the backup server
19:41:26 <ianw> but i think it's more important to just get running 100% with borg in an stable manner first
19:41:35 <clarkb> ++
19:42:01 <ianw> so yeah, basically request for reviews on the ideas presented in those changes
19:42:09 <clarkb> thank you for sticking to this. Its never an easy thing to change, but helps enable upgrades to focal and beyond for a number of services
19:42:41 <ianw> but i think we've got it working at a stable working set.  some things we can't avoid like the review backups being big diffs due to git pack file updates
19:42:58 <clarkb> we could stop packing but then gerrit would get slow
19:43:32 <clarkb> Anything else on this or should we move on?
19:43:36 <fungi> we could "backup" git repositories via replication rather than off the fs?
19:43:59 <fungi> though what does the replication in that case?
19:44:00 <clarkb> fungi: the risk with that is a corrupted repo wouldn't be able to roll back easily
19:44:05 <fungi> yeah
19:44:10 <clarkb> with proper backups we can go to an old state
19:44:26 <fungi> well, assuming the repository was not mid-write when we backed it up
19:44:39 <ianw> yep, and although the deltas take up a lot of space, the other side is they do prune well
19:44:39 <clarkb> I think git is pretty godo about that
19:45:00 <clarkb> basically git does order of operations to make backups like that mostly work aiui
19:45:54 <clarkb> Alright lets move on as we have a few more topics to cover
19:46:03 <clarkb> #topic Xenial Server Upgrades
19:46:12 <clarkb> #link https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades
19:46:35 <clarkb> this has sort of been in continual progress over time, but as xenial eol appraoches I think we should capture what remains and start prioritizing things
19:46:44 <clarkb> I've started to write down a partial list in that etherpad
19:47:11 <clarkb> I'm hoping that I might have time next week to start doing rolling replacements of zuul-mergers, zuul-executors, and nodepool-launchers
19:47:41 <clarkb> my idea there was to redeploy oen of each on focal and we can check everything is happy with the switch, then roll through the others in each group
19:48:04 <clarkb> If you've got ideas on priorities or process/method/etc feel free to add notes to that etherpad
19:48:36 <clarkb> #topic Meetpad Audio Stopped Working
19:49:03 <clarkb> Late last week a few of us noticed that meetpad's audio wasn't working. By the time I got around to trying it again in order to look at it this week it was working
19:49:27 <fungi> yeah, it seems to be working fine today
19:49:35 <fungi> used it for a while
19:49:36 <clarkb> Last week I had actually tried using the main meet.jit.si service as well and had problems with it too. I suspect that we may have deployed a bug then deployed the fix all automatically
19:50:02 <clarkb> This reminds me that I think corvus has mentioend we should be able to unfork one of the images we are running too
19:50:14 <clarkb> it is possible that having a more static image for one of the services could have contributed as well
19:50:22 * diablo_rojo appears suuuuper late
19:50:31 <clarkb> corvus: ^ is it just a matter of replacing the image in our docker-compose?
19:51:35 <corvus> ohai
19:52:02 <corvus> clarkb: everything except -web is unpinned i think
19:52:05 <corvus> -web is the fork
19:52:26 <clarkb> and to unfork web we just update our docker-compose file? maybe set some new settings?
19:52:32 <corvus> i don't think we'd be updating/restarting any of those automatically
19:52:53 <clarkb> corvus: I think we may do a docker-compose pull && docker-compose up -d regularly
19:53:03 <corvus> gimme a sec
19:53:07 <clarkb> similar to how gitea does it (and it finds new mariadb images)
19:54:23 <corvus> okay, yeah, looks like we do restart, last was 4/5 days ago
19:54:41 <corvus> to unfork web, we actually update our dockerfile and the docker-compose
19:55:12 <clarkb> ok, it wasn't clear to me if we had to keep building the image ourselves or if we can use theirs like we do for the other services
19:55:22 <corvus> (we're building the image from a github/jeblair source repo and deploying it; to unfork, change docker-compose to deploy from upstream and rm the dockerfile) -- but don't do that yet, upstream may not have updated the image.
19:55:29 <corvus> we should use their
19:55:34 <clarkb> got it
19:55:54 <corvus> https://hub.docker.com/r/jitsi/web
19:56:08 <corvus> https://hub.docker.com/layers/jitsi/web/latest/images/sha256-018f7407c2514b5eeb27f4bc4d887ae4cd38d8446a0958c5ca9cee3fa811f575?context=explore
19:56:09 <corvus> 4 days ago
19:56:13 <corvus> we should unfork now
19:56:35 <clarkb> excellent. Did you want to write that change? If not I'm sure we can find a volunteer
19:56:39 <corvus> their build of -web should now have the meetpad PR merge in it
19:56:44 <corvus> clarkb: i will do so
19:56:47 <clarkb> thank you
19:56:54 <corvus> #action corvus unfork jitsi-meet
19:57:03 <clarkb> #topic InMotion OpenStack as a Service
19:57:21 <clarkb> Really quickly before our hour is up: I have deployed a control plane for an inmotion openstack managed cloud
19:57:48 <clarkb> everything seems to work at first glance and we could bootstrap users and projects and then point cloud launcher at it. Except that none of the api endpoints have ssl
19:58:11 <clarkb> thee is a VIP involved somehow that load balances requests across the three control plane nodes (it is "hyperconverged")
19:58:48 <clarkb> I need to figure out how to properly listen on that VIP and then can run a simple ssl terminating proxy with a self signed cert or LE cert that forwards to local services
19:58:52 <clarkb> I have not yet figured that out
19:59:13 <clarkb> I've also tried to give this feedback back to inmotion as something that would be useful
19:59:37 <clarkb> another thing worth noting is that we have a /28 of ipv4 addresses there currently so the ability to expand our nodepool resources is minimal right now
19:59:38 <fungi> got it, so by default their cloud deployments don't provide a reachable rest api>
19:59:46 <clarkb> well they do but in plaintext
19:59:56 <corvus> clarkb: what's the vip attached to?
20:00:01 <fungi> oh! http just no https?
20:00:08 <clarkb> corvus: I have no idea. I tried looking and couldn't find it then ran out of time
20:00:35 <clarkb> a lot of things are in kolla containers and they all run the same exact command so its been interesting poking around
20:00:49 <clarkb> (they run some sort of init that magically knows what other comamnds to run)
20:01:02 <clarkb> fungi: yup
20:01:06 <ianw> is it only ipv4 or also ipv6?
20:01:16 <clarkb> ianw: currently only ipv4 but ipv6 is something that they are looking at
20:01:29 <fungi> ipv6 is all the rage with the kids these days
20:01:38 <clarkb> (I expect that if we use this more properly it will be as an ipv6 "only" cloud then use the ipv4 /28 to do nat for outbound like limestone does)
20:01:44 <clarkb> but that is still theoretical right now
20:01:53 <clarkb> also we are now at time
20:01:56 <clarkb> thank you everyone!
20:01:57 <ianw> yeah, 28 is ... 16? nodes? - control plane bits?
20:02:02 <fungi> thanks clarkb!
20:02:05 <fungi> ianw: correct
20:02:05 <clarkb> ianw: the control plane has a separate /28 or /29
20:02:22 <clarkb> this /28 is for the neutron networking side so ya probably 14 usable and after neutron uses a couple 12?
20:02:32 <clarkb> We can continue conversations in #opendev
20:02:33 <clarkb> #endmeeting