Tuesday, 2021-09-21

*** corvus is now known as Guest49009:27
*** tristanC_ is now known as tristanC13:16
clarkbAlmost meeting time18:59
ianwo/19:01
fungiahoy?19:01
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Sep 21 19:01:21 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkbhello19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-September/000285.html Our Agenda19:01
clarkb#topic Announcements19:01
clarkbMinor notice that the next few days I'll be afk a bit. Have doctor visits and also brothers are dragging me out fishing assuming the fishing is good today (the salmon are swimming upstream)19:02
clarkbI'll be around most of the day tomorrow. Then not very around thursday19:02
clarkb#topic Actions from last meeting19:03
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-09-14-19.01.txt minutes from last meeting19:03
diablo_rojoo/19:03
clarkbThere were no recordred actions. however I probably should've recorded one for the next thing :)19:03
clarkb#topic Specs19:03
clarkb#link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement19:03
clarkbAs we discussed last week there is some consideration for whether or not we should continue to use snmp or switch to node exporter19:04
clarkbit looks like the big downsides to node exporter are going to be that using distro packages is problematic because the distro packages are older and have multiple backward incompatible changes to the current 1.x release series19:04
clarkbusing docker to deploy node exporter is possible but as ianw and corvus point out a bit odd because we have to expose system resources to it in the container19:05
clarkbThen the downsides to snmp are needing to do a lot of work to build out the metrics and graphs ourselves19:05
clarkbI think I'm leaning more towards node exporter. One of the reasons we are switchign to prometheus is it gives us the ability to do richer metrics beyond just system level stuff (think applications) and leanign into their more native tooling seems reasonabel as a result19:06
clarkbAnyway please leave your preferences in review and I'll update it if necessary19:06
clarkb#action Everyone provide feedback on Prometheus spec and indicate a preference for snmp or node exporter19:07
ianw++ personally i feel like un-containerised makes most sense, even if we pull it from a ppa or something for consistency19:07
fungiin progress, i'm putting together a mailman 3.x migration spec, hope to have it up by the next meeting19:07
clarkbianw: I think if we can get at least node exporter v1.x that would work. As they appear to have gotten a lot better about not just changing the names of stuff19:07
clarkbfungi: thanks!19:08
clarkb#topic Topics19:08
clarkb#topic Listserv updates19:08
clarkbJust a heads up that we pinned the kernel packages on lists.o.o. If we need to update the kernel there we can do it explicitly then run the extract-vmlinux tool against the result and replace the file in /boot19:09
clarkbAs far as replacing the server goes I think we should consider that in the context of fungi's spec. Seems like there may be a process where we spin up a mm3 server and then migrate into that to transition servers as well as services19:09
fungiyes, we could in theory migrate on a domain by domain basis at least. migrating list by list would be more complex (involving apache redriects and mail forwards)19:10
clarkbAnother option available to us is to spin up a new mm2 server using the test mode flag which won't email all the list owners. Then we can migrate list members, archives, and configs to that server. I think this becomes a good option if it is easier to upgrade mm3 from mm2 in place19:10
clarkbI'm somewhat deferring to reading fungi's spec to get a better understanding of which approach is preferable19:11
fungithere's a config importer for mm3, and it can also back-poulate the new hyeprkitty list archives (sans attachments) from the mbox copies19:11
fungier, hyperkitty19:11
fungibut it's also typical to continue serving the old pipermail-based archives indefinitely so as to not break existing hyperlinks19:12
fungiwe can rsync them over as a separate step of course19:12
fungiotherwise it's mostly switching dns records19:12
clarkbthat seems reasonable19:12
fungii think the bulk of the work will be up front, figuring out how we want to deploy and configure the containers19:13
fungiso that's where i'm going to need the most input once i push up the draft spec19:13
clarkbnoted19:13
clarkbare there any other concerns or items to note about the existing server?19:14
clarkbI think we're fairly stable now. And the tools to do the kernel extraction should all be in my homedir on the server19:14
funginot presently, as far as i'm aware19:14
funginot presently any other concerns or items to note, i mean19:14
clarkb#topic Improving OpenDev's CD throughput19:15
clarkbI suspect that this one has taken a backseat to firefighting and other items19:15
clarkbianw: ^ anything new to call out? Totally fine if not (I've had my own share of ditractions recently)19:15
ianwno, sorry, will get back to19:16
clarkb#topic Gerrit Account Cleanups19:16
clarkbI keep intending to send out emails for this early in a week but then finding other more urgent items early in the week :/19:17
clarkbAt this point I'm hopeful this can happen tomorrow19:17
clarkbBut I haven't sent any emails yet19:17
clarkb#topic OpenDev Logo Hosting19:17
clarkbThis one has made great progress. Thank you ianw.19:17
clarkbAt this point we've got about 3 things half related to this gerrit update for the logo in our gerrit theme. in #opendev I proposed that we land the gerrit updates soon. Then we can do a gerrit pull and restart to pick up the replication timeout config change and the theme changes.19:18
ianwi see we have a plan for paste19:18
ianwwe can also use the buildkit approach with gerrit container and copy it in via the assets container19:19
clarkbThen when that is done we can update the gitea 1.15.3 change to stop trying to manage the gerrit theme logo url and upgrade gitea19:19
ianwbut that is separate to actually unblocking gitea19:19
fungiyeah, i'm still trying to get the local hosting for paste working, have added a test and an autohold, will see if my test fails19:19
ianwok, will peruse the change19:19
clarkbianw: I think I'm ok with the gerrit approach as is since we already copy other assets in using this system19:19
clarkbSeparately I did push up a gitea 1.14.7 change stacked under the 1.15.3 chagne which I think is safe to land today and we should considerdoing so19:19
clarkb(I'm not sure if gitea tests old point release to latest release upgrades)19:20
clarkbianw: anyway I didn't approve the gerrit logo changes because I wanted to make sure we are all cool with the above approach before committing to it19:20
fungii definitely don't mind copying assets into containers, the two goals as i saw it were 1. only have one copy of distinct assets in our git repositories, and 2. not cause browsers to grab assets for one service from an unrelated one19:20
clarkbianw: but feel free to approve if this sounds good to you19:20
clarkbSounsd like that may be it on this topic?19:22
ianwthis sounds good, will go through today after breakfast19:22
clarkbianw: thanks. Let me know if I can help with anything too19:22
clarkb#topic Gerrit Replication "leaks"19:22
clarkbI did more digging into this today. What I found was that there is no indication on the gitea side that gerrit is talking to it (no ssh processes, no git-receive-pack processes and no sockets)19:23
clarkbfungi checked the gerrit side and saw that gerrit did think it has a socket open to the gitea19:23
clarkbThe good news with that is I suspect the no network traffic timeout may actually help us here as a result19:24
clarkbOther things I have found include giteas have ipv6 addresses but no AAAA records. THis means all replication happens over ipv4. THis is a good thing because it appears gitea05 cannot talk to review02 via ipv619:24
clarkbI ran some ping -c 100 processes between gitea05 and review02 and from both sides saw about a 2% packet loss during one iteration19:25
clarkbMakes me suspect something funny with networking is happening but that will need more investigating19:25
clarkbFinally we've left 3 leaked tasks in place this morning to see if gerrit eventually handles them itself19:26
fungiwhen looking at the leaked connections earlier, i did notice there was one which was open on the gitea side but not the gerrit side19:26
clarkbIf necessary we can kill and reenqueue the replication for those but as long as no one complains or notices it is a good sanity check to see if gerrit eventually claens up after itself19:26
fungis/open/established19:26
clarkbfungi: oh I thought it was the gerrit side that was open but not gitea19:26
clarkbor did you see that too?19:26
fungier, yeah might have been. i'd need to revisit that19:27
clarkbok19:27
fungialso we had some crazy dos situation at the time, so i sort of stopped digging deeper19:27
clarkbAlso while I was digging into this a bit more ^ happened19:27
fungiconditions could have been complicated by that situation19:27
clarkbfungi and I made notes of the details in the incident channel19:27
fungii would not assume they're typical results19:27
clarkbshould this occur again we've identified a likely culprit and they can be temporarily filtered via iptables on the haproxy server19:27
clarkb#topic Scheduling Gerrit Project Renames19:29
clarkbJust a reminder that these requests are out there and we said we would pencil in the week of October 11-1519:29
clarkbI'm beginning to strongly suspect that we cannot delete old orgs and have working redirects from the old org name to the new one19:30
fungiyeah, the list seems fairly solidified at this point, barring further additions19:30
fungiif anyone has repos they want renamed, now's the time to get the changes up for them19:30
fungialso we decided that emptying a namespace might cause issues on the gitea side?19:30
fungiwas there a resolution to that?19:30
clarkbAnd I looked at the rename playbook briefly to see if I could determine what would be required to force update allthe project metadata after a rename. I think the biggest issue here is access to the metadata as the rename playbook has a very small set of data19:31
clarkbfungi: see my note above. I think it is only an issue if we delete the org19:31
fungiahh, okay19:31
clarkbfungi: we won't delete the org when we rename.19:31
clarkbI brought it up to try and figure out if we could safely cleanup old orgs but I think that is a bad idea19:31
fungiand yes, for metadata the particular concern raised by users is that in past renames we haven't updated issues links19:31
fungiso renamed orgs with storyboard links are going to their old urls still19:32
clarkbya for metadata I think where I ended up at was the simplest solution to that is to make our rename process a two pass system. First pass is the rename playbook. Then we run the gitea project management playbook with force update flag set to true, but only run it against the subset of projects that are affected by the rename19:32
fungithough separately, a nice future addition would be some redirects in apache on the storyboard end (could just be a .htaccess file even)19:32
clarkbrather than try and have the rename playbook learn how to do it all at once (because the datastructures are very different)19:32
clarkbThis two pass system should be testable in the existing jobs we've got for spinniing up a gitea19:33
clarkbif someone has time to update the job to run the force update after a rename that would be a good addition19:33
clarkbAnything else on project renames?19:34
funginope, the last one went fairly smoothly19:34
clarkb#topic InMotion Scale Up19:35
fungiwe do however need to make sure that all our servers are up so ansible doesn't have a cow man19:35
clarkb++19:35
clarkblast week I fixed leaked placement records in the inmotion cloud which corrected the no valid host found errors there19:35
clarkbThen Friday and Weekend the cloud was updated to have a few more IPs assigned to it and we bumped up the nodepool max-servers19:35
clarkbIn the process we'ev discovered we need to tune that setting for the cloud's abilities better19:36
clarkbTheJulia noticed some unittests took a long time and more recently I've found that zuul jobs running there have difficulty talking to npm's registry (though I'm not yet certain this was a cloud issue as I couldn't replicate it from hsots with the same IP in the same cloud)19:36
clarkbAll this to say please be aware of this and don't be afraid to dial back max-servers if evidence points to problems19:37
fungii think yuriys mentioned yesterday adding some datadog agents to the underlying systems in order to better profile resource utilization too19:37
clarkbThey are very interested in helping us run our CI jobs and I want to support that which I guess means risking a few broken eggs19:37
fungias of this morning we lowered max_servers to 3219:38
clarkbfungi: yup that was one idea that was mentioned. I was ok with it if they felt that was the best approach19:38
fungithis morning my time (around 13z i think?)19:38
clarkbBut thought others might have opinions with using the non free service (I think they use it internally so are able to parse those metrics)19:38
fungii also suggested to yuriys that he can tweak quotas on the openstack side to more dynamically adjust how many nodes we boot if that's easier for troubleshooting/experimentation19:39
clarkbnote we can do that too as we have access to set quotas on the project19:39
clarkbalso 8 was the old stable node count19:39
fungiyep, though we also have access to just edit the launcher's config and put it in the emergency list19:39
clarkbBut ya they seem very interested in helping us so I think it is worth working through this19:40
clarkband it seems like they have been getting valuable feedback too. Hopefully win win for everyone19:40
ianwi'm not sure about the datadog things, but it sounds a lot like the stats nodepool puts out via openstackapi anyway19:42
clarkbianw: I think the datadog agents can attach to the various openstack python processes and record things like rabbitmq connection issues and placement allocation problems like we saw19:42
clarkbsimilar to what prometheus theoretically lets us do with gerrit and so on19:43
clarkbat the very least I'm willing to experiement with it if they feel it would be helpful. We've always said we can redeploy this cloud if necessary19:43
clarkbbut if anyone has strong objections definitely let yuriys know19:43
fungiyeah, he's around in #opendev and paying attention19:44
ianwhttps://grafana.opendev.org/d/4sdNjeXGk/nodepool-inmotion?orgId=1 is looking a little sad on openstackapi stats anyway19:44
fungiat least lately19:44
clarkbianw: hrm is that a bug in our nodepool configs?19:44
clarkbor maybe openstacksdk updated again and changed everything?19:44
ianwi feel like i've fixed things in here before, i'll have to investigate19:45
clarkb#topic Open Discussion19:46
clarkbsounds like that may have been it for our last agenda item? Anything else can go here :)19:46
clarkbI suspect that zuul will be wanting to do a full restart of the opendev zuul install soon. There are a number of scale out scheduelr changes that haev landed as well as bugfixes for issues we've seen19:47
clarkbWe should be careful to do that around the openstack release in a way that doesn't impact them greatly19:47
fungii still need to do the afs01.dfw cinder volume replacement19:47
fungithat was going to be today, until git asploded19:47
clarkbhalf related the paste cinder volume seems more stable today19:48
fungigood19:48
clarkbI know openstack has a number of bugs in its CI unrelated to the infrastructure too so don't be surprised if we get requests to hold instances or help debug them19:49
clarkbsome of them are as simple as debuntu package does not install reliably :/19:49
Guest490would a restart later today be okay?19:50
fungialso the great setuptools shaheup19:50
fungiGuest490: i'm guessing you're corvus and asking about restarting zuul?19:51
clarkbI suspect that today or tomorrow are likely to be ok particularly later in the day Pacific time19:51
fungiif so, yes seems fine to me19:51
clarkbseems we get a big early rush then it tails off19:51
clarkband then next week is likely to be bad for restarts19:51
clarkb(I suspect second rcs to roll through next week)19:52
fungishould we time the zuul and gerrit restarts together?19:52
clarkbfungi: that is an option if we can get the theme updates in19:52
clarkbzuul goes quickly enough that we probably don't need to require that though19:52
Guest490yep i am corvus19:52
fungii'm happy to help with restarting it all after the outstanding patches for the gerrit container build and deploy19:54
fungiin th emeantime i need to start preparing dinner19:54
clarkbcool I'll be around too this afternoon as noted in #opendev. And ya I need lunch now19:54
clarkbThanks everyone! feel free to continue conversation in #opendev or on the service-discuss mailing list19:55
clarkb#endmeeting19:55
opendevmeetMeeting ended Tue Sep 21 19:55:12 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:55
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2021/infra.2021-09-21-19.01.html19:55
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-09-21-19.01.txt19:55
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2021/infra.2021-09-21-19.01.log.html19:55
fungithanks clarkb!19:56
*** Guest490 is now known as corvus21:55
*** corvus is now known as _corvus21:56
*** _corvus is now known as corvus21:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!