#opendev-meeting log

19:01:21 <clarkb> #startmeeting infra
19:01:21 <opendevmeet> Meeting started Tue Sep 21 19:01:21 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:21 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:21 <opendevmeet> The meeting name has been set to 'infra'
19:01:24 <clarkb> hello
19:01:35 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-September/000285.html Our Agenda
19:01:46 <clarkb> #topic Announcements
19:02:26 <clarkb> Minor notice that the next few days I'll be afk a bit. Have doctor visits and also brothers are dragging me out fishing assuming the fishing is good today (the salmon are swimming upstream)
19:02:56 <clarkb> I'll be around most of the day tomorrow. Then not very around thursday
19:03:11 <clarkb> #topic Actions from last meeting
19:03:17 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-09-14-19.01.txt minutes from last meeting
19:03:21 <diablo_rojo> o/
19:03:34 <clarkb> There were no recordred actions. however I probably should've recorded one for the next thing :)
19:03:39 <clarkb> #topic Specs
19:03:44 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement
19:04:24 <clarkb> As we discussed last week there is some consideration for whether or not we should continue to use snmp or switch to node exporter
19:04:54 <clarkb> it looks like the big downsides to node exporter are going to be that using distro packages is problematic because the distro packages are older and have multiple backward incompatible changes to the current 1.x release series
19:05:18 <clarkb> using docker to deploy node exporter is possible but as ianw and corvus point out a bit odd because we have to expose system resources to it in the container
19:05:36 <clarkb> Then the downsides to snmp are needing to do a lot of work to build out the metrics and graphs ourselves
19:06:28 <clarkb> I think I'm leaning more towards node exporter. One of the reasons we are switchign to prometheus is it gives us the ability to do richer metrics beyond just system level stuff (think applications) and leanign into their more native tooling seems reasonabel as a result
19:06:46 <clarkb> Anyway please leave your preferences in review and I'll update it if necessary
19:07:05 <clarkb> #action Everyone provide feedback on Prometheus spec and indicate a preference for snmp or node exporter
19:07:16 <ianw> ++ personally i feel like un-containerised makes most sense, even if we pull it from a ppa or something for consistency
19:07:48 <fungi> in progress, i'm putting together a mailman 3.x migration spec, hope to have it up by the next meeting
19:07:50 <clarkb> ianw: I think if we can get at least node exporter v1.x that would work. As they appear to have gotten a lot better about not just changing the names of stuff
19:08:06 <clarkb> fungi: thanks!
19:08:15 <clarkb> #topic Topics
19:08:26 <clarkb> #topic Listserv updates
19:09:03 <clarkb> Just a heads up that we pinned the kernel packages on lists.o.o. If we need to update the kernel there we can do it explicitly then run the extract-vmlinux tool against the result and replace the file in /boot
19:09:43 <clarkb> As far as replacing the server goes I think we should consider that in the context of fungi's spec. Seems like there may be a process where we spin up a mm3 server and then migrate into that to transition servers as well as services
19:10:30 <fungi> yes, we could in theory migrate on a domain by domain basis at least. migrating list by list would be more complex (involving apache redriects and mail forwards)
19:10:31 <clarkb> Another option available to us is to spin up a new mm2 server using the test mode flag which won't email all the list owners. Then we can migrate list members, archives, and configs to that server. I think this becomes a good option if it is easier to upgrade mm3 from mm2 in place
19:11:06 <clarkb> I'm somewhat deferring to reading fungi's spec to get a better understanding of which approach is preferable
19:11:39 <fungi> there's a config importer for mm3, and it can also back-poulate the new hyeprkitty list archives (sans attachments) from the mbox copies
19:11:51 <fungi> er, hyperkitty
19:12:21 <fungi> but it's also typical to continue serving the old pipermail-based archives indefinitely so as to not break existing hyperlinks
19:12:32 <fungi> we can rsync them over as a separate step of course
19:12:46 <fungi> otherwise it's mostly switching dns records
19:12:49 <clarkb> that seems reasonable
19:13:07 <fungi> i think the bulk of the work will be up front, figuring out how we want to deploy and configure the containers
19:13:39 <fungi> so that's where i'm going to need the most input once i push up the draft spec
19:13:46 <clarkb> noted
19:14:02 <clarkb> are there any other concerns or items to note about the existing server?
19:14:21 <clarkb> I think we're fairly stable now. And the tools to do the kernel extraction should all be in my homedir on the server
19:14:25 <fungi> not presently, as far as i'm aware
19:14:45 <fungi> not presently any other concerns or items to note, i mean
19:15:08 <clarkb> #topic Improving OpenDev's CD throughput
19:15:19 <clarkb> I suspect that this one has taken a backseat to firefighting and other items
19:15:34 <clarkb> ianw: ^ anything new to call out? Totally fine if not (I've had my own share of ditractions recently)
19:16:36 <ianw> no, sorry, will get back to
19:16:47 <clarkb> #topic Gerrit Account Cleanups
19:17:01 <clarkb> I keep intending to send out emails for this early in a week but then finding other more urgent items early in the week :/
19:17:12 <clarkb> At this point I'm hopeful this can happen tomorrow
19:17:17 <clarkb> But I haven't sent any emails yet
19:17:28 <clarkb> #topic OpenDev Logo Hosting
19:17:36 <clarkb> This one has made great progress. Thank you ianw.
19:18:35 <clarkb> At this point we've got about 3 things half related to this gerrit update for the logo in our gerrit theme. in #opendev I proposed that we land the gerrit updates soon. Then we can do a gerrit pull and restart to pick up the replication timeout config change and the theme changes.
19:18:42 <ianw> i see we have a plan for paste
19:19:04 <ianw> we can also use the buildkit approach with gerrit container and copy it in via the assets container
19:19:05 <clarkb> Then when that is done we can update the gitea 1.15.3 change to stop trying to manage the gerrit theme logo url and upgrade gitea
19:19:18 <ianw> but that is separate to actually unblocking gitea
19:19:19 <fungi> yeah, i'm still trying to get the local hosting for paste working, have added a test and an autohold, will see if my test fails
19:19:31 <ianw> ok, will peruse the change
19:19:33 <clarkb> ianw: I think I'm ok with the gerrit approach as is since we already copy other assets in using this system
19:19:58 <clarkb> Separately I did push up a gitea 1.14.7 change stacked under the 1.15.3 chagne which I think is safe to land today and we should considerdoing so
19:20:16 <clarkb> (I'm not sure if gitea tests old point release to latest release upgrades)
19:20:41 <clarkb> ianw: anyway I didn't approve the gerrit logo changes because I wanted to make sure we are all cool with the above approach before committing to it
19:20:45 <fungi> i definitely don't mind copying assets into containers, the two goals as i saw it were 1. only have one copy of distinct assets in our git repositories, and 2. not cause browsers to grab assets for one service from an unrelated one
19:20:48 <clarkb> ianw: but feel free to approve if this sounds good to you
19:22:25 <clarkb> Sounsd like that may be it on this topic?
19:22:26 <ianw> this sounds good, will go through today after breakfast
19:22:36 <clarkb> ianw: thanks. Let me know if I can help with anything too
19:22:56 <clarkb> #topic Gerrit Replication "leaks"
19:23:30 <clarkb> I did more digging into this today. What I found was that there is no indication on the gitea side that gerrit is talking to it (no ssh processes, no git-receive-pack processes and no sockets)
19:23:44 <clarkb> fungi checked the gerrit side and saw that gerrit did think it has a socket open to the gitea
19:24:04 <clarkb> The good news with that is I suspect the no network traffic timeout may actually help us here as a result
19:24:39 <clarkb> Other things I have found include giteas have ipv6 addresses but no AAAA records. THis means all replication happens over ipv4. THis is a good thing because it appears gitea05 cannot talk to review02 via ipv6
19:25:10 <clarkb> I ran some ping -c 100 processes between gitea05 and review02 and from both sides saw about a 2% packet loss during one iteration
19:25:28 <clarkb> Makes me suspect something funny with networking is happening but that will need more investigating
19:26:01 <clarkb> Finally we've left 3 leaked tasks in place this morning to see if gerrit eventually handles them itself
19:26:33 <fungi> when looking at the leaked connections earlier, i did notice there was one which was open on the gitea side but not the gerrit side
19:26:34 <clarkb> If necessary we can kill and reenqueue the replication for those but as long as no one complains or notices it is a good sanity check to see if gerrit eventually claens up after itself
19:26:46 <fungi> s/open/established
19:26:50 <clarkb> fungi: oh I thought it was the gerrit side that was open but not gitea
19:26:57 <clarkb> or did you see that too?
19:27:04 <fungi> er, yeah might have been. i'd need to revisit that
19:27:07 <clarkb> ok
19:27:23 <fungi> also we had some crazy dos situation at the time, so i sort of stopped digging deeper
19:27:27 <clarkb> Also while I was digging into this a bit more ^ happened
19:27:35 <fungi> conditions could have been complicated by that situation
19:27:39 <clarkb> fungi and I made notes of the details in the incident channel
19:27:45 <fungi> i would not assume they're typical results
19:27:56 <clarkb> should this occur again we've identified a likely culprit and they can be temporarily filtered via iptables on the haproxy server
19:29:27 <clarkb> #topic Scheduling Gerrit Project Renames
19:29:44 <clarkb> Just a reminder that these requests are out there and we said we would pencil in the week of October 11-15
19:30:00 <clarkb> I'm beginning to strongly suspect that we cannot delete old orgs and have working redirects from the old org name to the new one
19:30:03 <fungi> yeah, the list seems fairly solidified at this point, barring further additions
19:30:20 <fungi> if anyone has repos they want renamed, now's the time to get the changes up for them
19:30:40 <fungi> also we decided that emptying a namespace might cause issues on the gitea side?
19:30:51 <fungi> was there a resolution to that?
19:31:03 <clarkb> And I looked at the rename playbook briefly to see if I could determine what would be required to force update allthe project metadata after a rename. I think the biggest issue here is access to the metadata as the rename playbook has a very small set of data
19:31:12 <clarkb> fungi: see my note above. I think it is only an issue if we delete the org
19:31:18 <fungi> ahh, okay
19:31:19 <clarkb> fungi: we won't delete the org when we rename.
19:31:38 <clarkb> I brought it up to try and figure out if we could safely cleanup old orgs but I think that is a bad idea
19:31:40 <fungi> and yes, for metadata the particular concern raised by users is that in past renames we haven't updated issues links
19:32:16 <fungi> so renamed orgs with storyboard links are going to their old urls still
19:32:33 <clarkb> ya for metadata I think where I ended up at was the simplest solution to that is to make our rename process a two pass system. First pass is the rename playbook. Then we run the gitea project management playbook with force update flag set to true, but only run it against the subset of projects that are affected by the rename
19:32:50 <fungi> though separately, a nice future addition would be some redirects in apache on the storyboard end (could just be a .htaccess file even)
19:32:53 <clarkb> rather than try and have the rename playbook learn how to do it all at once (because the datastructures are very different)
19:33:21 <clarkb> This two pass system should be testable in the existing jobs we've got for spinniing up a gitea
19:33:38 <clarkb> if someone has time to update the job to run the force update after a rename that would be a good addition
19:34:35 <clarkb> Anything else on project renames?
19:34:57 <fungi> nope, the last one went fairly smoothly
19:35:02 <clarkb> #topic InMotion Scale Up
19:35:12 <fungi> we do however need to make sure that all our servers are up so ansible doesn't have a cow man
19:35:18 <clarkb> ++
19:35:31 <clarkb> last week I fixed leaked placement records in the inmotion cloud which corrected the no valid host found errors there
19:35:55 <clarkb> Then Friday and Weekend the cloud was updated to have a few more IPs assigned to it and we bumped up the nodepool max-servers
19:36:12 <clarkb> In the process we'ev discovered we need to tune that setting for the cloud's abilities better
19:36:54 <clarkb> TheJulia noticed some unittests took a long time and more recently I've found that zuul jobs running there have difficulty talking to npm's registry (though I'm not yet certain this was a cloud issue as I couldn't replicate it from hsots with the same IP in the same cloud)
19:37:20 <clarkb> All this to say please be aware of this and don't be afraid to dial back max-servers if evidence points to problems
19:37:26 <fungi> i think yuriys mentioned yesterday adding some datadog agents to the underlying systems in order to better profile resource utilization too
19:37:44 <clarkb> They are very interested in helping us run our CI jobs and I want to support that which I guess means risking a few broken eggs
19:38:02 <fungi> as of this morning we lowered max_servers to 32
19:38:06 <clarkb> fungi: yup that was one idea that was mentioned. I was ok with it if they felt that was the best approach
19:38:14 <fungi> this morning my time (around 13z i think?)
19:38:26 <clarkb> But thought others might have opinions with using the non free service (I think they use it internally so are able to parse those metrics)
19:39:09 <fungi> i also suggested to yuriys that he can tweak quotas on the openstack side to more dynamically adjust how many nodes we boot if that's easier for troubleshooting/experimentation
19:39:24 <clarkb> note we can do that too as we have access to set quotas on the project
19:39:37 <clarkb> also 8 was the old stable node count
19:39:47 <fungi> yep, though we also have access to just edit the launcher's config and put it in the emergency list
19:40:25 <clarkb> But ya they seem very interested in helping us so I think it is worth working through this
19:40:39 <clarkb> and it seems like they have been getting valuable feedback too. Hopefully win win for everyone
19:42:01 <ianw> i'm not sure about the datadog things, but it sounds a lot like the stats nodepool puts out via openstackapi anyway
19:42:43 <clarkb> ianw: I think the datadog agents can attach to the various openstack python processes and record things like rabbitmq connection issues and placement allocation problems like we saw
19:43:02 <clarkb> similar to what prometheus theoretically lets us do with gerrit and so on
19:43:29 <clarkb> at the very least I'm willing to experiement with it if they feel it would be helpful. We've always said we can redeploy this cloud if necessary
19:43:44 <clarkb> but if anyone has strong objections definitely let yuriys know
19:44:10 <fungi> yeah, he's around in #opendev and paying attention
19:44:17 <ianw> https://grafana.opendev.org/d/4sdNjeXGk/nodepool-inmotion?orgId=1 is looking a little sad on openstackapi stats anyway
19:44:18 <fungi> at least lately
19:44:50 <clarkb> ianw: hrm is that a bug in our nodepool configs?
19:44:57 <clarkb> or maybe openstacksdk updated again and changed everything?
19:45:13 <ianw> i feel like i've fixed things in here before, i'll have to investigate
19:46:43 <clarkb> #topic Open Discussion
19:46:52 <clarkb> sounds like that may have been it for our last agenda item? Anything else can go here :)
19:47:17 <clarkb> I suspect that zuul will be wanting to do a full restart of the opendev zuul install soon. There are a number of scale out scheduelr changes that haev landed as well as bugfixes for issues we've seen
19:47:37 <clarkb> We should be careful to do that around the openstack release in a way that doesn't impact them greatly
19:47:47 <fungi> i still need to do the afs01.dfw cinder volume replacement
19:47:58 <fungi> that was going to be today, until git asploded
19:48:10 <clarkb> half related the paste cinder volume seems more stable today
19:48:25 <fungi> good
19:49:31 <clarkb> I know openstack has a number of bugs in its CI unrelated to the infrastructure too so don't be surprised if we get requests to hold instances or help debug them
19:49:49 <clarkb> some of them are as simple as debuntu package does not install reliably :/
19:50:44 <Guest490> would a restart later today be okay?
19:50:54 <fungi> also the great setuptools shaheup
19:51:15 <fungi> Guest490: i'm guessing you're corvus and asking about restarting zuul?
19:51:32 <clarkb> I suspect that today or tomorrow are likely to be ok particularly later in the day Pacific time
19:51:36 <fungi> if so, yes seems fine to me
19:51:38 <clarkb> seems we get a big early rush then it tails off
19:51:46 <clarkb> and then next week is likely to be bad for restarts
19:52:10 <clarkb> (I suspect second rcs to roll through next week)
19:52:16 <fungi> should we time the zuul and gerrit restarts together?
19:52:26 <clarkb> fungi: that is an option if we can get the theme updates in
19:52:40 <clarkb> zuul goes quickly enough that we probably don't need to require that though
19:52:58 <Guest490> yep i am corvus
19:54:34 <fungi> i'm happy to help with restarting it all after the outstanding patches for the gerrit container build and deploy
19:54:47 <fungi> in th emeantime i need to start preparing dinner
19:54:56 <clarkb> cool I'll be around too this afternoon as noted in #opendev. And ya I need lunch now
19:55:10 <clarkb> Thanks everyone! feel free to continue conversation in #opendev or on the service-discuss mailing list
19:55:12 <clarkb> #endmeeting