19:01:21 #startmeeting infra 19:01:21 Meeting started Tue Sep 21 19:01:21 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:21 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:21 The meeting name has been set to 'infra' 19:01:24 hello 19:01:35 #link http://lists.opendev.org/pipermail/service-discuss/2021-September/000285.html Our Agenda 19:01:46 #topic Announcements 19:02:26 Minor notice that the next few days I'll be afk a bit. Have doctor visits and also brothers are dragging me out fishing assuming the fishing is good today (the salmon are swimming upstream) 19:02:56 I'll be around most of the day tomorrow. Then not very around thursday 19:03:11 #topic Actions from last meeting 19:03:17 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-09-14-19.01.txt minutes from last meeting 19:03:21 o/ 19:03:34 There were no recordred actions. however I probably should've recorded one for the next thing :) 19:03:39 #topic Specs 19:03:44 #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement 19:04:24 As we discussed last week there is some consideration for whether or not we should continue to use snmp or switch to node exporter 19:04:54 it looks like the big downsides to node exporter are going to be that using distro packages is problematic because the distro packages are older and have multiple backward incompatible changes to the current 1.x release series 19:05:18 using docker to deploy node exporter is possible but as ianw and corvus point out a bit odd because we have to expose system resources to it in the container 19:05:36 Then the downsides to snmp are needing to do a lot of work to build out the metrics and graphs ourselves 19:06:28 I think I'm leaning more towards node exporter. One of the reasons we are switchign to prometheus is it gives us the ability to do richer metrics beyond just system level stuff (think applications) and leanign into their more native tooling seems reasonabel as a result 19:06:46 Anyway please leave your preferences in review and I'll update it if necessary 19:07:05 #action Everyone provide feedback on Prometheus spec and indicate a preference for snmp or node exporter 19:07:16 ++ personally i feel like un-containerised makes most sense, even if we pull it from a ppa or something for consistency 19:07:48 in progress, i'm putting together a mailman 3.x migration spec, hope to have it up by the next meeting 19:07:50 ianw: I think if we can get at least node exporter v1.x that would work. As they appear to have gotten a lot better about not just changing the names of stuff 19:08:06 fungi: thanks! 19:08:15 #topic Topics 19:08:26 #topic Listserv updates 19:09:03 Just a heads up that we pinned the kernel packages on lists.o.o. If we need to update the kernel there we can do it explicitly then run the extract-vmlinux tool against the result and replace the file in /boot 19:09:43 As far as replacing the server goes I think we should consider that in the context of fungi's spec. Seems like there may be a process where we spin up a mm3 server and then migrate into that to transition servers as well as services 19:10:30 yes, we could in theory migrate on a domain by domain basis at least. migrating list by list would be more complex (involving apache redriects and mail forwards) 19:10:31 Another option available to us is to spin up a new mm2 server using the test mode flag which won't email all the list owners. Then we can migrate list members, archives, and configs to that server. I think this becomes a good option if it is easier to upgrade mm3 from mm2 in place 19:11:06 I'm somewhat deferring to reading fungi's spec to get a better understanding of which approach is preferable 19:11:39 there's a config importer for mm3, and it can also back-poulate the new hyeprkitty list archives (sans attachments) from the mbox copies 19:11:51 er, hyperkitty 19:12:21 but it's also typical to continue serving the old pipermail-based archives indefinitely so as to not break existing hyperlinks 19:12:32 we can rsync them over as a separate step of course 19:12:46 otherwise it's mostly switching dns records 19:12:49 that seems reasonable 19:13:07 i think the bulk of the work will be up front, figuring out how we want to deploy and configure the containers 19:13:39 so that's where i'm going to need the most input once i push up the draft spec 19:13:46 noted 19:14:02 are there any other concerns or items to note about the existing server? 19:14:21 I think we're fairly stable now. And the tools to do the kernel extraction should all be in my homedir on the server 19:14:25 not presently, as far as i'm aware 19:14:45 not presently any other concerns or items to note, i mean 19:15:08 #topic Improving OpenDev's CD throughput 19:15:19 I suspect that this one has taken a backseat to firefighting and other items 19:15:34 ianw: ^ anything new to call out? Totally fine if not (I've had my own share of ditractions recently) 19:16:36 no, sorry, will get back to 19:16:47 #topic Gerrit Account Cleanups 19:17:01 I keep intending to send out emails for this early in a week but then finding other more urgent items early in the week :/ 19:17:12 At this point I'm hopeful this can happen tomorrow 19:17:17 But I haven't sent any emails yet 19:17:28 #topic OpenDev Logo Hosting 19:17:36 This one has made great progress. Thank you ianw. 19:18:35 At this point we've got about 3 things half related to this gerrit update for the logo in our gerrit theme. in #opendev I proposed that we land the gerrit updates soon. Then we can do a gerrit pull and restart to pick up the replication timeout config change and the theme changes. 19:18:42 i see we have a plan for paste 19:19:04 we can also use the buildkit approach with gerrit container and copy it in via the assets container 19:19:05 Then when that is done we can update the gitea 1.15.3 change to stop trying to manage the gerrit theme logo url and upgrade gitea 19:19:18 but that is separate to actually unblocking gitea 19:19:19 yeah, i'm still trying to get the local hosting for paste working, have added a test and an autohold, will see if my test fails 19:19:31 ok, will peruse the change 19:19:33 ianw: I think I'm ok with the gerrit approach as is since we already copy other assets in using this system 19:19:58 Separately I did push up a gitea 1.14.7 change stacked under the 1.15.3 chagne which I think is safe to land today and we should considerdoing so 19:20:16 (I'm not sure if gitea tests old point release to latest release upgrades) 19:20:41 ianw: anyway I didn't approve the gerrit logo changes because I wanted to make sure we are all cool with the above approach before committing to it 19:20:45 i definitely don't mind copying assets into containers, the two goals as i saw it were 1. only have one copy of distinct assets in our git repositories, and 2. not cause browsers to grab assets for one service from an unrelated one 19:20:48 ianw: but feel free to approve if this sounds good to you 19:22:25 Sounsd like that may be it on this topic? 19:22:26 this sounds good, will go through today after breakfast 19:22:36 ianw: thanks. Let me know if I can help with anything too 19:22:56 #topic Gerrit Replication "leaks" 19:23:30 I did more digging into this today. What I found was that there is no indication on the gitea side that gerrit is talking to it (no ssh processes, no git-receive-pack processes and no sockets) 19:23:44 fungi checked the gerrit side and saw that gerrit did think it has a socket open to the gitea 19:24:04 The good news with that is I suspect the no network traffic timeout may actually help us here as a result 19:24:39 Other things I have found include giteas have ipv6 addresses but no AAAA records. THis means all replication happens over ipv4. THis is a good thing because it appears gitea05 cannot talk to review02 via ipv6 19:25:10 I ran some ping -c 100 processes between gitea05 and review02 and from both sides saw about a 2% packet loss during one iteration 19:25:28 Makes me suspect something funny with networking is happening but that will need more investigating 19:26:01 Finally we've left 3 leaked tasks in place this morning to see if gerrit eventually handles them itself 19:26:33 when looking at the leaked connections earlier, i did notice there was one which was open on the gitea side but not the gerrit side 19:26:34 If necessary we can kill and reenqueue the replication for those but as long as no one complains or notices it is a good sanity check to see if gerrit eventually claens up after itself 19:26:46 s/open/established 19:26:50 fungi: oh I thought it was the gerrit side that was open but not gitea 19:26:57 or did you see that too? 19:27:04 er, yeah might have been. i'd need to revisit that 19:27:07 ok 19:27:23 also we had some crazy dos situation at the time, so i sort of stopped digging deeper 19:27:27 Also while I was digging into this a bit more ^ happened 19:27:35 conditions could have been complicated by that situation 19:27:39 fungi and I made notes of the details in the incident channel 19:27:45 i would not assume they're typical results 19:27:56 should this occur again we've identified a likely culprit and they can be temporarily filtered via iptables on the haproxy server 19:29:27 #topic Scheduling Gerrit Project Renames 19:29:44 Just a reminder that these requests are out there and we said we would pencil in the week of October 11-15 19:30:00 I'm beginning to strongly suspect that we cannot delete old orgs and have working redirects from the old org name to the new one 19:30:03 yeah, the list seems fairly solidified at this point, barring further additions 19:30:20 if anyone has repos they want renamed, now's the time to get the changes up for them 19:30:40 also we decided that emptying a namespace might cause issues on the gitea side? 19:30:51 was there a resolution to that? 19:31:03 And I looked at the rename playbook briefly to see if I could determine what would be required to force update allthe project metadata after a rename. I think the biggest issue here is access to the metadata as the rename playbook has a very small set of data 19:31:12 fungi: see my note above. I think it is only an issue if we delete the org 19:31:18 ahh, okay 19:31:19 fungi: we won't delete the org when we rename. 19:31:38 I brought it up to try and figure out if we could safely cleanup old orgs but I think that is a bad idea 19:31:40 and yes, for metadata the particular concern raised by users is that in past renames we haven't updated issues links 19:32:16 so renamed orgs with storyboard links are going to their old urls still 19:32:33 ya for metadata I think where I ended up at was the simplest solution to that is to make our rename process a two pass system. First pass is the rename playbook. Then we run the gitea project management playbook with force update flag set to true, but only run it against the subset of projects that are affected by the rename 19:32:50 though separately, a nice future addition would be some redirects in apache on the storyboard end (could just be a .htaccess file even) 19:32:53 rather than try and have the rename playbook learn how to do it all at once (because the datastructures are very different) 19:33:21 This two pass system should be testable in the existing jobs we've got for spinniing up a gitea 19:33:38 if someone has time to update the job to run the force update after a rename that would be a good addition 19:34:35 Anything else on project renames? 19:34:57 nope, the last one went fairly smoothly 19:35:02 #topic InMotion Scale Up 19:35:12 we do however need to make sure that all our servers are up so ansible doesn't have a cow man 19:35:18 ++ 19:35:31 last week I fixed leaked placement records in the inmotion cloud which corrected the no valid host found errors there 19:35:55 Then Friday and Weekend the cloud was updated to have a few more IPs assigned to it and we bumped up the nodepool max-servers 19:36:12 In the process we'ev discovered we need to tune that setting for the cloud's abilities better 19:36:54 TheJulia noticed some unittests took a long time and more recently I've found that zuul jobs running there have difficulty talking to npm's registry (though I'm not yet certain this was a cloud issue as I couldn't replicate it from hsots with the same IP in the same cloud) 19:37:20 All this to say please be aware of this and don't be afraid to dial back max-servers if evidence points to problems 19:37:26 i think yuriys mentioned yesterday adding some datadog agents to the underlying systems in order to better profile resource utilization too 19:37:44 They are very interested in helping us run our CI jobs and I want to support that which I guess means risking a few broken eggs 19:38:02 as of this morning we lowered max_servers to 32 19:38:06 fungi: yup that was one idea that was mentioned. I was ok with it if they felt that was the best approach 19:38:14 this morning my time (around 13z i think?) 19:38:26 But thought others might have opinions with using the non free service (I think they use it internally so are able to parse those metrics) 19:39:09 i also suggested to yuriys that he can tweak quotas on the openstack side to more dynamically adjust how many nodes we boot if that's easier for troubleshooting/experimentation 19:39:24 note we can do that too as we have access to set quotas on the project 19:39:37 also 8 was the old stable node count 19:39:47 yep, though we also have access to just edit the launcher's config and put it in the emergency list 19:40:25 But ya they seem very interested in helping us so I think it is worth working through this 19:40:39 and it seems like they have been getting valuable feedback too. Hopefully win win for everyone 19:42:01 i'm not sure about the datadog things, but it sounds a lot like the stats nodepool puts out via openstackapi anyway 19:42:43 ianw: I think the datadog agents can attach to the various openstack python processes and record things like rabbitmq connection issues and placement allocation problems like we saw 19:43:02 similar to what prometheus theoretically lets us do with gerrit and so on 19:43:29 at the very least I'm willing to experiement with it if they feel it would be helpful. We've always said we can redeploy this cloud if necessary 19:43:44 but if anyone has strong objections definitely let yuriys know 19:44:10 yeah, he's around in #opendev and paying attention 19:44:17 https://grafana.opendev.org/d/4sdNjeXGk/nodepool-inmotion?orgId=1 is looking a little sad on openstackapi stats anyway 19:44:18 at least lately 19:44:50 ianw: hrm is that a bug in our nodepool configs? 19:44:57 or maybe openstacksdk updated again and changed everything? 19:45:13 i feel like i've fixed things in here before, i'll have to investigate 19:46:43 #topic Open Discussion 19:46:52 sounds like that may have been it for our last agenda item? Anything else can go here :) 19:47:17 I suspect that zuul will be wanting to do a full restart of the opendev zuul install soon. There are a number of scale out scheduelr changes that haev landed as well as bugfixes for issues we've seen 19:47:37 We should be careful to do that around the openstack release in a way that doesn't impact them greatly 19:47:47 i still need to do the afs01.dfw cinder volume replacement 19:47:58 that was going to be today, until git asploded 19:48:10 half related the paste cinder volume seems more stable today 19:48:25 good 19:49:31 I know openstack has a number of bugs in its CI unrelated to the infrastructure too so don't be surprised if we get requests to hold instances or help debug them 19:49:49 some of them are as simple as debuntu package does not install reliably :/ 19:50:44 would a restart later today be okay? 19:50:54 also the great setuptools shaheup 19:51:15 Guest490: i'm guessing you're corvus and asking about restarting zuul? 19:51:32 I suspect that today or tomorrow are likely to be ok particularly later in the day Pacific time 19:51:36 if so, yes seems fine to me 19:51:38 seems we get a big early rush then it tails off 19:51:46 and then next week is likely to be bad for restarts 19:52:10 (I suspect second rcs to roll through next week) 19:52:16 should we time the zuul and gerrit restarts together? 19:52:26 fungi: that is an option if we can get the theme updates in 19:52:40 zuul goes quickly enough that we probably don't need to require that though 19:52:58 yep i am corvus 19:54:34 i'm happy to help with restarting it all after the outstanding patches for the gerrit container build and deploy 19:54:47 in th emeantime i need to start preparing dinner 19:54:56 cool I'll be around too this afternoon as noted in #opendev. And ya I need lunch now 19:55:10 Thanks everyone! feel free to continue conversation in #opendev or on the service-discuss mailing list 19:55:12 #endmeeting