Tuesday, 2021-02-09

*** openstackstatus has quit IRC01:20
*** openstack has joined #opendev-meeting01:22
*** ChanServ sets mode: +o openstack01:22
*** sboyron has joined #opendev-meeting07:58
*** hashar has joined #opendev-meeting08:00
*** hashar is now known as hasharAway11:44
*** hasharAway is now known as hashar12:35
*** hashar is now known as hasharAway15:27
*** hasharAway is now known as hashar15:58
*** hashar is now known as hasharAway18:23
clarkbAnyone else here for the meeting? we will get started shortly18:59
ianwo/19:00
fungiyep19:01
clarkb#startmeeting infra19:01
openstackMeeting started Tue Feb  9 19:01:16 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
*** openstack changes topic to " (Meeting topic: infra)"19:01
openstackThe meeting name has been set to 'infra'19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-February/000180.html Our Agenda19:01
clarkb#topic Announcements19:01
*** openstack changes topic to "Announcements (Meeting topic: infra)"19:01
clarkbI had no announcements19:01
clarkb#topic Actions from last meeting19:01
*** openstack changes topic to "Actions from last meeting (Meeting topic: infra)"19:01
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-02-19.01.txt minutes from last meeting19:02
clarkbI had an action to start writing down a xenial upgrade todo list.19:02
clarkb#link https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades19:02
clarkbI started there, it is incomplete, but figured starting with something that we can all put notes on was better than waiting for perfect19:02
clarkbianw also had an action to followup with wiki backups. Any update on that?19:03
ianwyes, i am getting closer :)19:03
ianwdo you want to talk about pruning now?19:03
clarkblets pick that up later19:04
clarkb#topic Priority Efforts19:04
*** openstack changes topic to "Priority Efforts (Meeting topic: infra)"19:04
clarkb#topic OpenDev19:04
*** openstack changes topic to "OpenDev (Meeting topic: infra)"19:04
clarkbI have continued to make progress (though it feels slow) on the gerrit account situation19:04
clarkb11 more accounts with preferred emails lackign external ids have been cleaned up. The bulk of these were simply retired. But one example for tbachman's accounts was a good learning experience19:05
clarkbWith tbachman there were two accounts. An active one that had preferred email set and no external id for that email and another inactive account with the same preferred email and external ids to match19:06
clarkbtbachman said the best thing for them was to update the preferred email to a current email address. We tested this on review-test and tbachman was able to fix things on their end. The update was then made on the prod server19:06
clarkbTo avoid confusion with the other unused account I set it inactive19:07
clarkbThe important bit of news here is that users can actually update things themselves within the web ui and don't need us to intervene for this situation. They just need to update their preferred email address to be one of the actual email addresses further down in the settings page19:07
clarkbI have also begun looking at the external id email conflicts. This is where two or more different accounts have external ids for the same email address19:08
clarkbThe vast majority of these seem to be accounts where one is clearly the account that has been used and the other is orphaned19:08
clarkbfor these cases I think we retire the orphaned account then remove the external ids assoicated with that account that conflict. The order here is important to ensure we don't generate a bunch of new preferred email doesn't have external id errors19:09
clarkbThere are a few cases where both accounts have been used and we may need to use our judgement or perhaps disable both accounts and let the user come to us with problems if they are still around (however most of these seem to be from years ago)19:10
clarkbI suspect that the vast majority of users who are active and have these problems have reached out to us to help fix them19:10
clarkbWhere I am struggling is that I am finding it hard to automate the classification aspects. I have autoamted a good chunk of the data pulling but there is a fair bit of judgement in "what do we do next"19:11
clarkbif others get a chance maybe they can take a look at my notes on review-test and see if any improvements to process or information gathing stand out. I'd also be curious if people think I've prposed invalid solutions to the issues19:11
clarkbwe don't need to go through that here though, can do that outside of meetings19:12
clarkbAs a reminder the workaround in the short term is to make changes with gerrit offline then reindex accounts (and groups?) with gerrit offline19:12
clarkbI'm hoping we can fix all these issues without ever doing that, but that option is availalbe if we run into a strong need for it19:13
clarkbAs far as next steps go I'll continue to classify things in my notes on -test and if others agree the proposed plans there seem valid I should make a checkout of the external ids on review and then start committing those fixes19:13
clarkbthen if we do have to take a downtime we can get as many fixes as are already prepared in too19:14
clarkbNext up is a pointer to my gerrit 3.3 image build changes19:14
clarkb#link https://review.opendev.org/c/opendev/system-config/+/765021 Build 3.3 images19:14
clarkbreviews appreciated.19:14
clarkbAnd that takes us to the gitea OOM'ing from last week19:14
clarkbwe had to add richer logging to apache so that we had source connection port for the haproxy -> apache connections. We haven't seen the issue return so haven't really had any new data to debug aiui19:15
clarkb#link https://review.opendev.org/c/opendev/system-config/+/774023 Rate limiting framework change for haproxy.19:15
clarkbI also put up an example of what haproxy tcp connection based rate limits might look like. I think the change as proposed woudl completely break users behind corporate NAT though19:15
clarkbso the change is WIP19:15
clarkbfungi: ianw anything else to add re Gitea OOMs?19:16
fungii'm already finding it hard to remember last week. that's not good19:16
ianwyeah, i don't think we really found a smoking gun, it just sort of went away?19:17
clarkbya it went away and by the time we got better logging in place there wasn't much to look at19:17
clarkbI guess we keep our eyes open and use better logging next time around if it happens again. Separately maybe take a look at haproxy rate limiting and decide if we watn to implement some version of that?19:17
clarkb(the trick is going to be figuringout what a valid bound is that doesn't just braek all the corporate NAT users)19:18
clarkbsounds like that may be it let's move on19:19
clarkb#topic Update Config Management19:19
*** openstack changes topic to "Update Config Management (Meeting topic: infra)"19:19
clarkbThere are OpenAFS and refstack ansible (and docker in the case of refstack) efforts underway.19:19
clarkbI also saw mention that launch node may not be working?19:19
ianwlaunch node was working for me yesterday (i launched a refstack) ... but openstack client on bridge isn't19:20
clarkboh I see I think I mixed up launch node and openstackclient19:20
fungiproblems with latest openstackclient (or sdk?) talking to rackspace's api19:20
ianwwell, it can't talk to rax anyway.  i didn't let myself yak shave, fungi had a bit of a look too19:20
clarkbianw: I've got an older openstackclient in a venv in my homedir that I use to cross check against clouds when that happens19:20
clarkbbasically to answer the question of "does this work if we use old osc"19:21
ianwyeah, same, and my older client works19:21
fungiproblem is the exception isn't super helpful because it's masked by a retry19:21
fungiso the exception is that the number of retries was exceeded19:21
fungiand it (confusingly) complains about host lookup failing19:21
clarkbdid osc drop keystone api v2 support?19:22
clarkbthat might be something to check?19:22
fungiif mordred gets bored he might be interested in looking at that failure case19:22
clarkbI can probably take a look later today after lunch and bike ride stuff. Would be a nice change of pace from staring at gerrit accoutns :)19:22
clarkblet me know if that would be helpful19:22
fungibut it probably merits being brought up in #openstack-sdk if it hasn't been already19:22
mordredfungi: what did I do?19:23
fungimordred: you totally broke rackspace ;)19:23
funginot really19:23
mordredah - joy19:23
fungijust thought you might be interested that latest openstacksdk is failing to talk to rackspace's keystone19:24
mordredthat's exciting19:24
fungiusing older openstacksdk works, so that's how we got around it in the short term19:24
fungiwell, an older openstacksdk install, so also older dependencies. it could be any of a number of them19:25
clarkbianw: I've got openafs and refstack as separate agenda items, Should we just go over them here or move on and catch up under proper topic headings?19:25
ianwup to you19:25
clarkb#topic General topics19:26
*** openstack changes topic to "General topics (Meeting topic: infra)"19:26
clarkb#topic OpenAFS Cluster Status19:26
mordredfungi: I'll take a look - the only thing that would be likely to have an impact would be keystoneauth19:26
*** openstack changes topic to "OpenAFS Cluster Status (Meeting topic: infra)"19:26
clarkbI don't think I saw any movement on this but wanted to double check. The fileservers are upgraded to 1.8.6 but not the db servers?19:26
ianwthe openafs status is that all servers/db servers are running 1.8.6-519:26
clarkboh nice the db servers got upgraded too. Excellent. Thank you for working on that19:27
clarkbnext steps there are to do the server upgrades then?19:27
clarkbI've got them on my initial pass of a list for server upgrades too19:27
ianwyep; so next i'll try an in-place focal upgrade, probably on one of the db servers first as they're small, and start that process19:27
clarkbgreat, thanks again19:28
clarkb#topic Refstack upgrade and container deployment19:28
*** openstack changes topic to "Refstack upgrade and container deployment (Meeting topic: infra)"19:28
ianwi got started on this19:29
ianwthere's a couple of open reviews in https://review.opendev.org/q/topic:%22refstack%22+(status:open%20OR%20status:merged) to add the production deployment jobs19:29
clarkbis there a change to add a server to inventory yet? I suppose for this server we won't have dns changes as dns will be upgraded via rax19:29
ianwyeah i merged that yesterday19:30
clarkb#link https://review.opendev.org/q/topic:%22refstack%22+(status:open%20OR%20status:merged) Refstack changes that need review19:30
ianwif we can just double check those jobs, i can babysit it today19:30
clarkbcool I can take a look at those really quickly after the meeting I bet19:30
fungiSotK: ^ that may also make a good example for doing the storyboard deployment19:30
clarkb++19:30
ianwthen have to look at the db migration; the old one seemed to have a trove while we're running it from a container now19:30
clarkbya I expect we'll restore from dump for now testing that things work? then schedule a downtime so that we can stop refstack properly, do a dump, restore from that, then start on the new server with dns updates19:31
clarkband kopecmartin volunteered to test the service that has been newly deployed whcih will go a long way as I don't even know how to interact with it properly19:32
clarkbAnything else to add on this topic?19:32
ianwyep, there's terse notes at19:32
ianw#link https://etherpad.opendev.org/p/refstack-docker19:32
ianwother than that no19:32
clarkbthank you everyone who helped move this along19:32
clarkb#topic Bup and Borg Backups19:33
*** openstack changes topic to "Bup and Borg Backups (Meeting topic: infra)"19:33
clarkbianw feel free to give us an update on borg db streaming and pruning and all other new info19:33
ianwthe streaming bit seems to be going well19:34
ianwmodulo of course mysqldump --all-databases stopping actually dumping all databases with a recent update19:34
clarkbbut it does still work if you specify specificy databases19:35
ianw#link https://bugs.launchpad.net/ubuntu/+source/mysql-5.7/+bug/191469519:35
openstackLaunchpad bug 1914695 in mysql-5.7 (Ubuntu) "mysqldump --all-databases not dumping any databases with 5.7.33" [Undecided,New]19:35
clarkb(which is the workaround we're going with?)19:35
fungialso there was some unanticipated fallout from the bup removal19:35
ianwnobody else has commented or mentioned anything in this bug, and i can't find anything in the mysql bug thing (though it's a bit of a mess) and i don't know how much more effort we want to spend on it, because it's talking to a 5.1 server in our case19:36
fungiapparently apt-get thought bup was the only reason we wanted pymysql installed on the storyboard server, so when bup got uninstalled so did the python-pymysql package. hilarity ensued19:36
clarkbmordred: ^ possible you may be interested? but ya I think our workaround is likely sufficient19:36
ianwI also realised some things about borg's append-only model and pruning that are explained in their docs, if you read them the right way19:38
ianwi've put up some reviews at19:39
ianw#link https://review.opendev.org/q/topic:%22backup-more-prune%22+status:open19:39
ianwthat provides a script to do manual prunes of the backups, and a cron job to warn us via email when the backup partitions are looking full19:39
ianwi think that is the best way to manage things for now19:39
clarkbianw: that seems like a good compromise, similar to how the certchecker reminded us to go buy new certs when we weren't using LE19:40
ianwi think the *best* way would be to have rolling LVM snapshots implemented on the backup server19:40
ianwbut i think it's more important to just get running 100% with borg in an stable manner first19:41
clarkb++19:41
ianwso yeah, basically request for reviews on the ideas presented in those changes19:42
clarkbthank you for sticking to this. Its never an easy thing to change, but helps enable upgrades to focal and beyond for a number of services19:42
ianwbut i think we've got it working at a stable working set.  some things we can't avoid like the review backups being big diffs due to git pack file updates19:42
clarkbwe could stop packing but then gerrit would get slow19:42
clarkbAnything else on this or should we move on?19:43
fungiwe could "backup" git repositories via replication rather than off the fs?19:43
fungithough what does the replication in that case?19:43
clarkbfungi: the risk with that is a corrupted repo wouldn't be able to roll back easily19:44
fungiyeah19:44
clarkbwith proper backups we can go to an old state19:44
fungiwell, assuming the repository was not mid-write when we backed it up19:44
ianwyep, and although the deltas take up a lot of space, the other side is they do prune well19:44
clarkbI think git is pretty godo about that19:44
clarkbbasically git does order of operations to make backups like that mostly work aiui19:45
clarkbAlright lets move on as we have a few more topics to cover19:45
clarkb#topic Xenial Server Upgrades19:46
*** openstack changes topic to "Xenial Server Upgrades (Meeting topic: infra)"19:46
clarkb#link https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades19:46
clarkbthis has sort of been in continual progress over time, but as xenial eol appraoches I think we should capture what remains and start prioritizing things19:46
clarkbI've started to write down a partial list in that etherpad19:46
clarkbI'm hoping that I might have time next week to start doing rolling replacements of zuul-mergers, zuul-executors, and nodepool-launchers19:47
clarkbmy idea there was to redeploy oen of each on focal and we can check everything is happy with the switch, then roll through the others in each group19:47
clarkbIf you've got ideas on priorities or process/method/etc feel free to add notes to that etherpad19:48
clarkb#topic Meetpad Audio Stopped Working19:48
*** openstack changes topic to "Meetpad Audio Stopped Working (Meeting topic: infra)"19:48
clarkbLate last week a few of us noticed that meetpad's audio wasn't working. By the time I got around to trying it again in order to look at it this week it was working19:49
fungiyeah, it seems to be working fine today19:49
fungiused it for a while19:49
clarkbLast week I had actually tried using the main meet.jit.si service as well and had problems with it too. I suspect that we may have deployed a bug then deployed the fix all automatically19:49
clarkbThis reminds me that I think corvus has mentioend we should be able to unfork one of the images we are running too19:50
*** diablo_rojo has joined #opendev-meeting19:50
clarkbit is possible that having a more static image for one of the services could have contributed as well19:50
* diablo_rojo appears suuuuper late19:50
clarkbcorvus: ^ is it just a matter of replacing the image in our docker-compose?19:50
corvusohai19:51
corvusclarkb: everything except -web is unpinned i think19:52
corvus-web is the fork19:52
clarkband to unfork web we just update our docker-compose file? maybe set some new settings?19:52
corvusi don't think we'd be updating/restarting any of those automatically19:52
clarkbcorvus: I think we may do a docker-compose pull && docker-compose up -d regularly19:52
corvusgimme a sec19:53
clarkbsimilar to how gitea does it (and it finds new mariadb images)19:53
corvusokay, yeah, looks like we do restart, last was 4/5 days ago19:54
corvusto unfork web, we actually update our dockerfile and the docker-compose19:54
clarkbok, it wasn't clear to me if we had to keep building the image ourselves or if we can use theirs like we do for the other services19:55
corvus(we're building the image from a github/jeblair source repo and deploying it; to unfork, change docker-compose to deploy from upstream and rm the dockerfile) -- but don't do that yet, upstream may not have updated the image.19:55
corvuswe should use their19:55
clarkbgot it19:55
corvushttps://hub.docker.com/r/jitsi/web19:55
corvushttps://hub.docker.com/layers/jitsi/web/latest/images/sha256-018f7407c2514b5eeb27f4bc4d887ae4cd38d8446a0958c5ca9cee3fa811f575?context=explore19:56
corvus4 days ago19:56
corvuswe should unfork now19:56
clarkbexcellent. Did you want to write that change? If not I'm sure we can find a volunteer19:56
corvustheir build of -web should now have the meetpad PR merge in it19:56
corvusclarkb: i will do so19:56
clarkbthank you19:56
corvus#action corvus unfork jitsi-meet19:56
clarkb#topic InMotion OpenStack as a Service19:57
*** openstack changes topic to "InMotion OpenStack as a Service (Meeting topic: infra)"19:57
clarkbReally quickly before our hour is up: I have deployed a control plane for an inmotion openstack managed cloud19:57
clarkbeverything seems to work at first glance and we could bootstrap users and projects and then point cloud launcher at it. Except that none of the api endpoints have ssl19:57
clarkbthee is a VIP involved somehow that load balances requests across the three control plane nodes (it is "hyperconverged")19:58
clarkbI need to figure out how to properly listen on that VIP and then can run a simple ssl terminating proxy with a self signed cert or LE cert that forwards to local services19:58
clarkbI have not yet figured that out19:58
clarkbI've also tried to give this feedback back to inmotion as something that would be useful19:59
clarkbanother thing worth noting is that we have a /28 of ipv4 addresses there currently so the ability to expand our nodepool resources is minimal right now19:59
fungigot it, so by default their cloud deployments don't provide a reachable rest api>19:59
clarkbwell they do but in plaintext19:59
corvusclarkb: what's the vip attached to?19:59
fungioh! http just no https?20:00
clarkbcorvus: I have no idea. I tried looking and couldn't find it then ran out of time20:00
clarkba lot of things are in kolla containers and they all run the same exact command so its been interesting poking around20:00
clarkb(they run some sort of init that magically knows what other comamnds to run)20:00
clarkbfungi: yup20:01
ianwis it only ipv4 or also ipv6?20:01
clarkbianw: currently only ipv4 but ipv6 is something that they are looking at20:01
fungiipv6 is all the rage with the kids these days20:01
clarkb(I expect that if we use this more properly it will be as an ipv6 "only" cloud then use the ipv4 /28 to do nat for outbound like limestone does)20:01
clarkbbut that is still theoretical right now20:01
clarkbalso we are now at time20:01
clarkbthank you everyone!20:01
ianwyeah, 28 is ... 16? nodes? - control plane bits?20:01
fungithanks clarkb!20:02
fungiianw: correct20:02
clarkbianw: the control plane has a separate /28 or /2920:02
clarkbthis /28 is for the neutron networking side so ya probably 14 usable and after neutron uses a couple 12?20:02
clarkbWe can continue conversations in #opendev20:02
clarkb#endmeeting20:02
fungiif the entire /28 is routed to the endpoint you could in theory use all 16 addresses20:02
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev"20:02
openstackMeeting ended Tue Feb  9 20:02:33 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:02
openstackMinutes:        http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-09-19.01.html20:02
openstackMinutes (text): http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-09-19.01.txt20:02
openstackLog:            http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-09-19.01.log.html20:02
*** hasharAway has quit IRC21:14
kopecmartinianw: thank you for working on it21:21
*** gmann is now known as gmann_afk22:13
*** sboyron has quit IRC23:18

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!