19:01:13 #startmeeting infra 19:01:13 Meeting started Tue Sep 14 19:01:13 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:13 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:13 The meeting name has been set to 'infra' 19:01:23 o/ 19:01:32 #link http://lists.opendev.org/pipermail/service-discuss/2021-September/000283.html Our Agenda 19:01:45 #topic Announcements 19:02:00 I didn't have any announcements. Did anyone else have announcements to share? 19:02:22 i don't think i did 19:03:18 #topic Actions from last meeting 19:03:21 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-09-07-19.01.txt minutes from last meeting 19:03:27 There were no actions recorded last meeting 19:03:45 * mordred waves to lovely hoomans 19:03:53 #topic Specs 19:03:58 #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement 19:04:18 I updated the spec based on some of the feedback I got. Seems everyone is happy with the general plan but one specific thing has come up since I pushed the update 19:04:45 Basically corvus is pointing out we shouldn't try to do node exporter and snmp exporter as that will double our work we should commit to one or the other 19:05:10 I'll try to capture the pros/cons of each really quickly here, but I would appreciate it ya'll could take a look and leave your thoughts on this specific topic 19:05:41 For the SNMP exporter the upside is we already run and configure snmpd on all of our instances. This means the only change on our instance needed to collect snmp data is a firewall update to allow the new prometheus server to poll the data. 19:06:24 The snmp exporter downside is that we'll have to do a fair bit of configuration to tell the snmp exporter what snmp mibs (is that the right terminology?) to collect and where to map them into prometheus. Then we have to do a bunch of work to set of graphs for that data 19:06:58 oids, technically 19:07:01 For node exporter the issue is we need to run a new service that doesn't exist in our distros (at least I'm fairly certain there aren't packages for it). We would instead use docker + docker-compose to run this service 19:07:02 mibs are collections of oids 19:07:38 This means we will need to add docker to a number of systems that don't currently run docker today. OpenAFS, DNS, mailman servers immediately come to mind. This is possible but a bit of work too. 19:08:13 The upside to using node exporter is we use something that is a bit more ready out of the box to collect server perormance metrics and I'm sure there are preexisting grafana graphs we can borrow from somewhere too 19:08:47 That is the gist of it. Please leave your preferences on the spec and I'll followup on that 19:08:58 i guess we'd just include the docker role in our base playbook 19:09:02 right? 19:09:07 Personally I was leaning towards snmp simply because I thought we hadn't wanted to run docker in places like our dns servers 19:09:16 fungi: yup and set up docker-compose for node exporter there 19:09:47 are there resource concerns with adding docker to some of those servers? 19:10:06 do we really need docker to run node-exporter? 19:10:23 frickler: we do if we don't want to reinvent systems/tooling to deploy an up to date version of node exporter 19:10:41 there are alternatives but then you're doing a bunch of work to keep a binary blob up to date whcih is basically what docker does 19:11:33 I definitely don't have a strong opinion on this myself right now and will need to think about it a bit more 19:11:59 yeah, I guess I'll need to do some research, too 19:11:59 fungi: that is probably the biggest reason to not do this if dockerd + node exporter consume a bunch of resources. I can probably deploy it locally on my fileserver and see what sort of memory and cpu consumption it does 19:12:32 ubuntu has prometheus-node-exporter and prometheus-node-exporter-collectors packages, maybe that would be just as good? 19:12:42 I'm also thinking whether I should add myself as volunteer, but let me sleep about that idea first 19:12:44 fungi: but not far enough back in time for our systems iirc 19:13:23 I thought I looked at that and decided docker was really the only way we could run it with our heterogenous setup 19:13:38 there's a prometheus-node-exporter on bionic 19:13:45 looks like focal does have it but the version is quite old (and focals is quite old too) maybe that was the issue 19:13:50 we're just about out of the xenial weeds 19:14:50 Ya lets look at this a bit more. THink it over and update the spec. I'm going to continue on in the meeting as we have other stuff to cover and are a quarter of the way through the hour 19:14:58 #topic Topics 19:15:06 #topic Mailman Ansible and Server Upgrades 19:15:14 i don't have a strong opinion on which; i just feel like writing system-config changes for either basically negates the value of the other, so we should try to pick one early 19:15:28 [eot from me; carry on] 19:15:36 On Sunday fungi and I upgraded lists.openstack.org and that was quite the adventure 19:15:46 corvus also helped with that 19:15:56 oh right corvus helped out with the mailman stuff at the end 19:16:04 Everything went well until we tried to boot the Focal kernel on the ancient rax xen pv flavor 19:16:15 very little; i made only a brief appearance; :) 19:16:18 it turns out that xen can't properly decompress the focal kernels because they are compressed with lz4 19:16:38 corvus: brief but crucial to the plot 19:17:10 We worked around the kernel issue by manually decompressing the kernel using the linxu kernel's extract-vmlinux tool, installing grub-xen, then chainbooting to the /boot/xen/pvboot-x86_64.elf that it installs 19:17:28 What that did was tell xen how to find the kernel as well as supply a kernel to it that it doesn't have to decompress 19:17:53 Then we had to fix up our exim, mailman, and apache configs to handle new mailman env var filtering 19:18:13 Where we are at right now is the host is out of the emergency file and ansible is ansibling the new configs that we had to do successfuuly 19:18:40 But the kernel situation is still all kinds of bad. We need to decide how we want to ensure that ubuntu isn't going to (re)install a compressed kernel. 19:18:48 note that the kernel dance is purely because the lists.o.o server was created in 2013 and has been in-place upgraded continuously since ubuntu 12.04 lts 19:19:01 so it's still running an otherwise unavailable pv flavor in rackspace 19:19:26 We can pin the kernel package. We can create a kernel postinst.d hook to decompress the kernel when the kernel updates. We can manually decompress the current kernel whenever we need to update (and use a rescue instance if the host reboots unexpectedly). 19:19:52 pv xen loads the kernel from outside the guest domu, while pvhvm works more like a bootable virtual machine similar to kvm 19:19:52 In all cases I think we should begin working to replace the server, but there will be some period of time between now and when we are running with a new server where we want to have a working boot setup 19:20:17 oh, the chainloaded kernel can't be compressed? 19:20:26 (i thought maybe the chainloading could get around that) 19:20:37 corvus: nope, because it still has to hand the kernel blob off to the pv xen hypervisor 19:20:40 corvus: we did some digging this morning and while we haven't tested it have foudn sufficient evidence that this doesnt work on mailing lists and web forums that we didn't want to try it 19:20:51 really all the chain load is doing is finding the correct kernel to hand to xen I think 19:20:58 because it understands grub2 configs 19:21:13 https://unix.stackexchange.com/questions/583714/xen-pvgrub-with-lz4-compressed-kernels covers what is involved in auto decompressing the kernel if we want to do that 19:21:14 yeah, it essentially communicates the offset where the kernel blob starts 19:21:17 ok. then i agree, we're in a hole and we should get out of it with a new server 19:22:14 with "new server" comes a number of questions, like should we take this opportunity to fold in lists.katacontainers.io? should we take this as an opportunity to migrate to mm3 on a new server? 19:22:15 yup I think we should just accept that is necessary now. Then decide what workaround for the kernel we want to use while we do that new server work 19:22:21 pinning it as is so a power-off situation doesn't become fatal and working on a new server seems best to me 19:23:05 ianw: ya and if we really need to do a kernel update on the server we can do it manually and do the decompress step at the same time 19:23:23 I'm leaning towards an apt pin myself for this reason. It doesn't prevent us from updating but ensures we do so with care 19:24:11 maybe too obvious a question, but resizing to a modern flavor isn't supported on rackspace? 19:24:33 frickler: ya iirc you could only resize within pv or pvhvm flavors but not across 19:24:44 switching from pv to pvhvm isn't supported anyway 19:24:47 But I guess that is somethign we could ask? fungi maybe as a followup on the issue you opened? 19:25:02 fungi: it seems sensible to make the migration also be a mm3 migration 19:25:05 oh, that trouble ticket is already closed after we worked out how to boot 19:25:19 i went back over the current state of semi-official mm3 containers, we'd basically need three containers for the basic components of mm3 (core, hyperkitty, postorius) plus apache and mysql. or we could use the distro packages in focal (it has mm 3.2.2 while latest is 3.3.4) 19:25:33 also there are tools to import mm 2.1 configs to 3.x 19:25:43 fungi: I think we should confirm we can't switch from pv to pvhm. I'm fairly certain our image would support both since the menu.lst is where we put the chainload and normal grub boot should ignore that 19:26:09 and import old archives (with some caveats), though we can also serve old pipermail copies of the archives for backward compatibility with existing hyperlinks 19:26:21 fungi: it could basically be stood up completely independently for validation right? the archives seem the thing that need importing 19:26:37 ianw: fungi: yes and we should be able to use zuul holds for that too 19:26:43 clarkb: yeah, in theory the image we have now could work on a pvhvm flavor, if there's a way to swotch it 19:26:50 switch it 19:27:28 ianw: archives and list configs both need importing, but yes i expect we'd follow our test-based development pattern for building the new mm3 deployment and then just hold a test node 19:28:08 Let me try an summarize what we seem to be thinking: 1) pin the kernel package on lists.o.o so it doesn't break. Manually update the kernel and decompress if necessary. 2) Begin work to upgrade to mm3 3and4) Determine if we can switch to a pvhvm flavor whcih boots reliably against modern kernels or replace the server 19:28:24 Is there any objections to 1) since getting that sorted sooner than later is a good idea. 19:28:38 ++ to all from me 19:28:53 yeah, i'm good with all of the above 19:29:03 i can set the kernel hackage hold once the meeting ends 19:29:19 fungi: thanks. I'd be happy to follow along since I always find those confusing and more experience with them would be good :) 19:29:58 happily 19:30:00 I'll see if I can do any research into the pv to pvhvm question 19:30:24 and sounds like fungi has already been looking at 2) 19:30:46 for years, but again this week yes 19:31:08 Anything else on this subject? Concerns or issues you've noticed since the upgrade outside of the above 19:31:40 aside from the kernel issue we also had some changes we needed to make to our tooling around envvars 19:32:00 corvus managed to work out that newer mailman started filtering envvars 19:32:30 so the one we made up for the site dir in our multi-site design was no longer making it throughto the config script 19:32:52 and we ended up needing to pivot to a specific envvar it wasn't filtering 19:33:22 this meant refactoring the site hostname to directory mapping into the config script 19:33:57 since we switched from using an envvar which conveyed the directory to one which conveyed the virtual hostname 19:35:07 right we could've theoretically set the site dir in the HOST env var but that would have been very confusing 19:35:14 and if mailman used the env var for anything else potentially broken 19:35:46 worth noting, since mm3 properly supports distinct hostnames (you can have the same list localpart at multiple domains now and each is distinct) we'll be able to avoid all that complexity with a switch to mm3 19:36:09 https://review.opendev.org/c/opendev/system-config/+/808570 has the details if you are interested 19:36:46 Alright lets move on. We haev a few more things to discuss and more than half our time is gone. 19:37:03 #topic Improving CD throughput 19:37:31 ianw: ^ anything new on this subject since the realization we needed to update periodic pipelines? Sorry I haven't had time to look at this again in a while 19:38:02 umm, things in progress but i got a little sidetracked 19:38:17 i think we have the basis of the dependencies worked out, but i need to rework the changes 19:38:43 so in short, no, nothing new 19:38:58 It might also be good to sketch out what the future of the semaphores looks like in WIP changes just so we can see the end result. But no rush lots to sort out on this stack 19:39:23 yeah it's definitely a "make it work serially first" situation 19:39:50 #topic Gerrit Account Cleanups 19:40:06 I have written notes on proposed plans for each user in the comments of review02:~clarkb/gerrit_user_cleanups/audit-results-annotated.yaml 19:40:30 There are 33 of these conflicts remaining. If you get a chance to look at the notes I wrote that would be great. fungi has read them over and didn't seem concerned though 19:40:56 My intent was to start writing those emails this week and make fixups in a checkout of the repo on review02 but mailing lists and other things have distracted me 19:41:15 Other than checking the notes I don't really need anything other than time though. This is making decent progress when I get that time 19:41:23 #topic OpenDev Logo Hosting 19:41:54 at this point we just need to update paste and gerrit's themes to use the gitea in repo hosted files then we are cleaned up from a gitea upgrade perspective 19:42:01 this seems to be working well so far 19:42:02 ianw: you said you would write those changes, are they up yet? 19:42:14 fungi: and I agree seems to be working for what we are doing with gitea itself 19:42:50 ahh, no those changes aren't up yet. on my todo 19:43:03 feel free to ping me when they go up and I'll review them 19:43:18 #topic Expanding InMotion cloud deployment 19:43:37 It sounds like InMotion is able to give us a few more IPs in order to better utilize the cluster we have 19:43:58 I'll be working with them Friday morning to work through that. However, right now we are failing to boot instances there and I need to go look at it more in depth 19:44:08 apparently rabbitmq is fine? and it may be some nova quota mismatch problem 19:44:41 neat 19:44:56 I'll probably go ahead and disable the cloud soon as it isn't booting stuff and I think network changes potentailly mean we don't want nodepool running against it anyway 19:45:10 if anyone else wants to join let me know 19:45:28 will be conference call configuration meeting sounds like 19:46:08 #topic Scheduling Gerrit Project Renames 19:46:35 We've got a few project rename requests now. In addition to starting to think about a time to do those we have discovered some additional questions about the rename process 19:46:55 When we rename projects we should update all of their metadata in gitea so that the issues link and review links all line up 19:47:11 This should be doable but requires updates to the rename playbook. Good news is that is tested now :) 19:47:26 The other question I had was what do we do with orgs in gitea (and gerrit) when all projects are moved out of them. 19:47:49 In particular I'm concerned that deleting an org would break the redirects gitea has for things from foo/bar -> bar/foo if we delete foo/ 19:48:03 corvus: ^ you've looked at that code before do you have a sense for what might happen there? 19:48:32 in addition to that, it might not be terrible to have a redirect mapping for storyboard.o.o, which could probably just be a flat list in a .htaccess file deployed on the server, built from the renames data we have in opendev/project-config (this could be a nice addition after the containerization work diablo_rojo is hacking on) 19:49:35 clarkb: i don't recall for certain, but i think there is a reasonable chance that it may break as you suspect 19:50:09 ok something to test for sure then 19:50:45 As far as scheduling goes I'm wary of trying to do it before the openstack release which happens October 6th ish 19:50:49 ywah, one of the proposed rename changes would empty the osf/ gerrit namespace and thus the osf org in gitea 19:51:04 But the week after: October 11 -15 might be a good time to do renames 19:51:30 I think that is the week before the ptg too? 19:51:44 Probably a good idea to avoid doing it during the ptg :) 19:52:01 yeah, i'm good with that. i'm taking the friday before then off though 19:52:11 (the 8th) 19:52:58 ok lets pencil in that week and decide on a specifc day as we get closer. Also work on doing metadata upadtes and test org removals 19:53:07 wfm 19:53:10 If orgs can't be removed safely that isn't the end of the world and we'll just keep them for redirects 19:53:26 #topic Open Discussion 19:53:33 Thank you for listening to me for the last hour :) Anything else? 19:54:34 nothing immediately springs to mind. i'll try to whip up a mm3 spec though 19:55:34 I'll give it another minute 19:56:31 Thanks everyone! we'll see you here next week same time and location 19:56:36 #endmeeting