Tuesday, 2021-09-21

*** corvus is now known as Guest490		09:27
*** tristanC_ is now known as tristanC		13:16
clarkb	Almost meeting time	18:59
ianw	o/	19:01
fungi	ahoy?	19:01
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Sep 21 19:01:21 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	hello	19:01
clarkb	#link http://lists.opendev.org/pipermail/service-discuss/2021-September/000285.html Our Agenda	19:01
clarkb	#topic Announcements	19:01
clarkb	Minor notice that the next few days I'll be afk a bit. Have doctor visits and also brothers are dragging me out fishing assuming the fishing is good today (the salmon are swimming upstream)	19:02
clarkb	I'll be around most of the day tomorrow. Then not very around thursday	19:02
clarkb	#topic Actions from last meeting	19:03
clarkb	#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-09-14-19.01.txt minutes from last meeting	19:03
diablo_rojo	o/	19:03
clarkb	There were no recordred actions. however I probably should've recorded one for the next thing :)	19:03
clarkb	#topic Specs	19:03
clarkb	#link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement	19:03
clarkb	As we discussed last week there is some consideration for whether or not we should continue to use snmp or switch to node exporter	19:04
clarkb	it looks like the big downsides to node exporter are going to be that using distro packages is problematic because the distro packages are older and have multiple backward incompatible changes to the current 1.x release series	19:04
clarkb	using docker to deploy node exporter is possible but as ianw and corvus point out a bit odd because we have to expose system resources to it in the container	19:05
clarkb	Then the downsides to snmp are needing to do a lot of work to build out the metrics and graphs ourselves	19:05
clarkb	I think I'm leaning more towards node exporter. One of the reasons we are switchign to prometheus is it gives us the ability to do richer metrics beyond just system level stuff (think applications) and leanign into their more native tooling seems reasonabel as a result	19:06
clarkb	Anyway please leave your preferences in review and I'll update it if necessary	19:06
clarkb	#action Everyone provide feedback on Prometheus spec and indicate a preference for snmp or node exporter	19:07
ianw	++ personally i feel like un-containerised makes most sense, even if we pull it from a ppa or something for consistency	19:07
fungi	in progress, i'm putting together a mailman 3.x migration spec, hope to have it up by the next meeting	19:07
clarkb	ianw: I think if we can get at least node exporter v1.x that would work. As they appear to have gotten a lot better about not just changing the names of stuff	19:07
clarkb	fungi: thanks!	19:08
clarkb	#topic Topics	19:08
clarkb	#topic Listserv updates	19:08
clarkb	Just a heads up that we pinned the kernel packages on lists.o.o. If we need to update the kernel there we can do it explicitly then run the extract-vmlinux tool against the result and replace the file in /boot	19:09
clarkb	As far as replacing the server goes I think we should consider that in the context of fungi's spec. Seems like there may be a process where we spin up a mm3 server and then migrate into that to transition servers as well as services	19:09
fungi	yes, we could in theory migrate on a domain by domain basis at least. migrating list by list would be more complex (involving apache redriects and mail forwards)	19:10
clarkb	Another option available to us is to spin up a new mm2 server using the test mode flag which won't email all the list owners. Then we can migrate list members, archives, and configs to that server. I think this becomes a good option if it is easier to upgrade mm3 from mm2 in place	19:10
clarkb	I'm somewhat deferring to reading fungi's spec to get a better understanding of which approach is preferable	19:11
fungi	there's a config importer for mm3, and it can also back-poulate the new hyeprkitty list archives (sans attachments) from the mbox copies	19:11
fungi	er, hyperkitty	19:11
fungi	but it's also typical to continue serving the old pipermail-based archives indefinitely so as to not break existing hyperlinks	19:12
fungi	we can rsync them over as a separate step of course	19:12
fungi	otherwise it's mostly switching dns records	19:12
clarkb	that seems reasonable	19:12
fungi	i think the bulk of the work will be up front, figuring out how we want to deploy and configure the containers	19:13
fungi	so that's where i'm going to need the most input once i push up the draft spec	19:13
clarkb	noted	19:13
clarkb	are there any other concerns or items to note about the existing server?	19:14
clarkb	I think we're fairly stable now. And the tools to do the kernel extraction should all be in my homedir on the server	19:14
fungi	not presently, as far as i'm aware	19:14
fungi	not presently any other concerns or items to note, i mean	19:14
clarkb	#topic Improving OpenDev's CD throughput	19:15
clarkb	I suspect that this one has taken a backseat to firefighting and other items	19:15
clarkb	ianw: ^ anything new to call out? Totally fine if not (I've had my own share of ditractions recently)	19:15
ianw	no, sorry, will get back to	19:16
clarkb	#topic Gerrit Account Cleanups	19:16
clarkb	I keep intending to send out emails for this early in a week but then finding other more urgent items early in the week :/	19:17
clarkb	At this point I'm hopeful this can happen tomorrow	19:17
clarkb	But I haven't sent any emails yet	19:17
clarkb	#topic OpenDev Logo Hosting	19:17
clarkb	This one has made great progress. Thank you ianw.	19:17
clarkb	At this point we've got about 3 things half related to this gerrit update for the logo in our gerrit theme. in #opendev I proposed that we land the gerrit updates soon. Then we can do a gerrit pull and restart to pick up the replication timeout config change and the theme changes.	19:18
ianw	i see we have a plan for paste	19:18
ianw	we can also use the buildkit approach with gerrit container and copy it in via the assets container	19:19
clarkb	Then when that is done we can update the gitea 1.15.3 change to stop trying to manage the gerrit theme logo url and upgrade gitea	19:19
ianw	but that is separate to actually unblocking gitea	19:19
fungi	yeah, i'm still trying to get the local hosting for paste working, have added a test and an autohold, will see if my test fails	19:19
ianw	ok, will peruse the change	19:19
clarkb	ianw: I think I'm ok with the gerrit approach as is since we already copy other assets in using this system	19:19
clarkb	Separately I did push up a gitea 1.14.7 change stacked under the 1.15.3 chagne which I think is safe to land today and we should considerdoing so	19:19
clarkb	(I'm not sure if gitea tests old point release to latest release upgrades)	19:20
clarkb	ianw: anyway I didn't approve the gerrit logo changes because I wanted to make sure we are all cool with the above approach before committing to it	19:20
fungi	i definitely don't mind copying assets into containers, the two goals as i saw it were 1. only have one copy of distinct assets in our git repositories, and 2. not cause browsers to grab assets for one service from an unrelated one	19:20
clarkb	ianw: but feel free to approve if this sounds good to you	19:20
clarkb	Sounsd like that may be it on this topic?	19:22
ianw	this sounds good, will go through today after breakfast	19:22
clarkb	ianw: thanks. Let me know if I can help with anything too	19:22
clarkb	#topic Gerrit Replication "leaks"	19:22
clarkb	I did more digging into this today. What I found was that there is no indication on the gitea side that gerrit is talking to it (no ssh processes, no git-receive-pack processes and no sockets)	19:23
clarkb	fungi checked the gerrit side and saw that gerrit did think it has a socket open to the gitea	19:23
clarkb	The good news with that is I suspect the no network traffic timeout may actually help us here as a result	19:24
clarkb	Other things I have found include giteas have ipv6 addresses but no AAAA records. THis means all replication happens over ipv4. THis is a good thing because it appears gitea05 cannot talk to review02 via ipv6	19:24
clarkb	I ran some ping -c 100 processes between gitea05 and review02 and from both sides saw about a 2% packet loss during one iteration	19:25
clarkb	Makes me suspect something funny with networking is happening but that will need more investigating	19:25
clarkb	Finally we've left 3 leaked tasks in place this morning to see if gerrit eventually handles them itself	19:26
fungi	when looking at the leaked connections earlier, i did notice there was one which was open on the gitea side but not the gerrit side	19:26
clarkb	If necessary we can kill and reenqueue the replication for those but as long as no one complains or notices it is a good sanity check to see if gerrit eventually claens up after itself	19:26
fungi	s/open/established	19:26
clarkb	fungi: oh I thought it was the gerrit side that was open but not gitea	19:26
clarkb	or did you see that too?	19:26
fungi	er, yeah might have been. i'd need to revisit that	19:27
clarkb	ok	19:27
fungi	also we had some crazy dos situation at the time, so i sort of stopped digging deeper	19:27
clarkb	Also while I was digging into this a bit more ^ happened	19:27
fungi	conditions could have been complicated by that situation	19:27
clarkb	fungi and I made notes of the details in the incident channel	19:27
fungi	i would not assume they're typical results	19:27
clarkb	should this occur again we've identified a likely culprit and they can be temporarily filtered via iptables on the haproxy server	19:27
clarkb	#topic Scheduling Gerrit Project Renames	19:29
clarkb	Just a reminder that these requests are out there and we said we would pencil in the week of October 11-15	19:29
clarkb	I'm beginning to strongly suspect that we cannot delete old orgs and have working redirects from the old org name to the new one	19:30
fungi	yeah, the list seems fairly solidified at this point, barring further additions	19:30
fungi	if anyone has repos they want renamed, now's the time to get the changes up for them	19:30
fungi	also we decided that emptying a namespace might cause issues on the gitea side?	19:30
fungi	was there a resolution to that?	19:30
clarkb	And I looked at the rename playbook briefly to see if I could determine what would be required to force update allthe project metadata after a rename. I think the biggest issue here is access to the metadata as the rename playbook has a very small set of data	19:31
clarkb	fungi: see my note above. I think it is only an issue if we delete the org	19:31
fungi	ahh, okay	19:31
clarkb	fungi: we won't delete the org when we rename.	19:31
clarkb	I brought it up to try and figure out if we could safely cleanup old orgs but I think that is a bad idea	19:31
fungi	and yes, for metadata the particular concern raised by users is that in past renames we haven't updated issues links	19:31
fungi	so renamed orgs with storyboard links are going to their old urls still	19:32
clarkb	ya for metadata I think where I ended up at was the simplest solution to that is to make our rename process a two pass system. First pass is the rename playbook. Then we run the gitea project management playbook with force update flag set to true, but only run it against the subset of projects that are affected by the rename	19:32
fungi	though separately, a nice future addition would be some redirects in apache on the storyboard end (could just be a .htaccess file even)	19:32
clarkb	rather than try and have the rename playbook learn how to do it all at once (because the datastructures are very different)	19:32
clarkb	This two pass system should be testable in the existing jobs we've got for spinniing up a gitea	19:33
clarkb	if someone has time to update the job to run the force update after a rename that would be a good addition	19:33
clarkb	Anything else on project renames?	19:34
fungi	nope, the last one went fairly smoothly	19:34
clarkb	#topic InMotion Scale Up	19:35
fungi	we do however need to make sure that all our servers are up so ansible doesn't have a cow man	19:35
clarkb	++	19:35
clarkb	last week I fixed leaked placement records in the inmotion cloud which corrected the no valid host found errors there	19:35
clarkb	Then Friday and Weekend the cloud was updated to have a few more IPs assigned to it and we bumped up the nodepool max-servers	19:35
clarkb	In the process we'ev discovered we need to tune that setting for the cloud's abilities better	19:36
clarkb	TheJulia noticed some unittests took a long time and more recently I've found that zuul jobs running there have difficulty talking to npm's registry (though I'm not yet certain this was a cloud issue as I couldn't replicate it from hsots with the same IP in the same cloud)	19:36
clarkb	All this to say please be aware of this and don't be afraid to dial back max-servers if evidence points to problems	19:37
fungi	i think yuriys mentioned yesterday adding some datadog agents to the underlying systems in order to better profile resource utilization too	19:37
clarkb	They are very interested in helping us run our CI jobs and I want to support that which I guess means risking a few broken eggs	19:37
fungi	as of this morning we lowered max_servers to 32	19:38
clarkb	fungi: yup that was one idea that was mentioned. I was ok with it if they felt that was the best approach	19:38
fungi	this morning my time (around 13z i think?)	19:38
clarkb	But thought others might have opinions with using the non free service (I think they use it internally so are able to parse those metrics)	19:38
fungi	i also suggested to yuriys that he can tweak quotas on the openstack side to more dynamically adjust how many nodes we boot if that's easier for troubleshooting/experimentation	19:39
clarkb	note we can do that too as we have access to set quotas on the project	19:39
clarkb	also 8 was the old stable node count	19:39
fungi	yep, though we also have access to just edit the launcher's config and put it in the emergency list	19:39
clarkb	But ya they seem very interested in helping us so I think it is worth working through this	19:40
clarkb	and it seems like they have been getting valuable feedback too. Hopefully win win for everyone	19:40
ianw	i'm not sure about the datadog things, but it sounds a lot like the stats nodepool puts out via openstackapi anyway	19:42
clarkb	ianw: I think the datadog agents can attach to the various openstack python processes and record things like rabbitmq connection issues and placement allocation problems like we saw	19:42
clarkb	similar to what prometheus theoretically lets us do with gerrit and so on	19:43
clarkb	at the very least I'm willing to experiement with it if they feel it would be helpful. We've always said we can redeploy this cloud if necessary	19:43
clarkb	but if anyone has strong objections definitely let yuriys know	19:43
fungi	yeah, he's around in #opendev and paying attention	19:44
ianw	https://grafana.opendev.org/d/4sdNjeXGk/nodepool-inmotion?orgId=1 is looking a little sad on openstackapi stats anyway	19:44
fungi	at least lately	19:44
clarkb	ianw: hrm is that a bug in our nodepool configs?	19:44
clarkb	or maybe openstacksdk updated again and changed everything?	19:44
ianw	i feel like i've fixed things in here before, i'll have to investigate	19:45
clarkb	#topic Open Discussion	19:46
clarkb	sounds like that may have been it for our last agenda item? Anything else can go here :)	19:46
clarkb	I suspect that zuul will be wanting to do a full restart of the opendev zuul install soon. There are a number of scale out scheduelr changes that haev landed as well as bugfixes for issues we've seen	19:47
clarkb	We should be careful to do that around the openstack release in a way that doesn't impact them greatly	19:47
fungi	i still need to do the afs01.dfw cinder volume replacement	19:47
fungi	that was going to be today, until git asploded	19:47
clarkb	half related the paste cinder volume seems more stable today	19:48
fungi	good	19:48
clarkb	I know openstack has a number of bugs in its CI unrelated to the infrastructure too so don't be surprised if we get requests to hold instances or help debug them	19:49
clarkb	some of them are as simple as debuntu package does not install reliably :/	19:49
Guest490	would a restart later today be okay?	19:50
fungi	also the great setuptools shaheup	19:50
fungi	Guest490: i'm guessing you're corvus and asking about restarting zuul?	19:51
clarkb	I suspect that today or tomorrow are likely to be ok particularly later in the day Pacific time	19:51
fungi	if so, yes seems fine to me	19:51
clarkb	seems we get a big early rush then it tails off	19:51
clarkb	and then next week is likely to be bad for restarts	19:51
clarkb	(I suspect second rcs to roll through next week)	19:52
fungi	should we time the zuul and gerrit restarts together?	19:52
clarkb	fungi: that is an option if we can get the theme updates in	19:52
clarkb	zuul goes quickly enough that we probably don't need to require that though	19:52
Guest490	yep i am corvus	19:52
fungi	i'm happy to help with restarting it all after the outstanding patches for the gerrit container build and deploy	19:54
fungi	in th emeantime i need to start preparing dinner	19:54
clarkb	cool I'll be around too this afternoon as noted in #opendev. And ya I need lunch now	19:54
clarkb	Thanks everyone! feel free to continue conversation in #opendev or on the service-discuss mailing list	19:55
clarkb	#endmeeting	19:55
opendevmeet	Meeting ended Tue Sep 21 19:55:12 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:55
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2021/infra.2021-09-21-19.01.html	19:55
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-09-21-19.01.txt	19:55
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2021/infra.2021-09-21-19.01.log.html	19:55
fungi	thanks clarkb!	19:56
*** Guest490 is now known as corvus		21:55
*** corvus is now known as _corvus		21:56
*** _corvus is now known as corvus		21:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!