#opendev-meeting log

19:01:13 <clarkb> #startmeeting infra
19:01:13 <opendevmeet> Meeting started Tue Sep 14 19:01:13 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:13 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:13 <opendevmeet> The meeting name has been set to 'infra'
19:01:23 <ianw> o/
19:01:32 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-September/000283.html Our Agenda
19:01:45 <clarkb> #topic Announcements
19:02:00 <clarkb> I didn't have any announcements. Did anyone else have announcements to share?
19:02:22 <fungi> i don't think i did
19:03:18 <clarkb> #topic Actions from last meeting
19:03:21 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-09-07-19.01.txt minutes from last meeting
19:03:27 <clarkb> There were no actions recorded last meeting
19:03:45 * mordred waves to lovely hoomans
19:03:53 <clarkb> #topic Specs
19:03:58 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement
19:04:18 <clarkb> I updated the spec based on some of the feedback I got. Seems everyone is happy with the general plan but one specific thing has come up since I pushed the update
19:04:45 <clarkb> Basically corvus is pointing out we shouldn't try to do node exporter and snmp exporter as that will double our work we should commit to one or the other
19:05:10 <clarkb> I'll try to capture the pros/cons of each really quickly here, but I would appreciate it ya'll could take a look and leave your thoughts on this specific topic
19:05:41 <clarkb> For the SNMP exporter the upside is we already run and configure snmpd on all of our instances. This means the only change on our instance needed to collect snmp data is a firewall update to allow the new prometheus server to poll the data.
19:06:24 <clarkb> The snmp exporter downside is that we'll have to do a fair bit of configuration to tell the snmp exporter what snmp mibs (is that the right terminology?) to collect and where to map them into prometheus. Then we have to do a bunch of work to set of graphs for that data
19:06:58 <fungi> oids, technically
19:07:01 <clarkb> For node exporter the issue is we need to run a new service that doesn't exist in our distros (at least I'm fairly certain there aren't packages for it). We would instead use docker + docker-compose to run this service
19:07:02 <fungi> mibs are collections of oids
19:07:38 <clarkb> This means we will need to add docker to a number of systems that don't currently run docker today. OpenAFS, DNS, mailman servers immediately come to mind. This is possible but a bit of work too.
19:08:13 <clarkb> The upside to using node exporter is we use something that is a bit more ready out of the box to collect server perormance metrics and I'm sure there are preexisting grafana graphs we can borrow from somewhere too
19:08:47 <clarkb> That is the gist of it. Please leave your preferences on the spec and I'll followup on that
19:08:58 <fungi> i guess we'd just include the docker role in our base playbook
19:09:02 <fungi> right?
19:09:07 <clarkb> Personally I was leaning towards snmp simply because I thought we hadn't wanted to run docker in places like our dns servers
19:09:16 <clarkb> fungi: yup and set up docker-compose for node exporter there
19:09:47 <fungi> are there resource concerns with adding docker to some of those servers?
19:10:06 <frickler> do we really need docker to run node-exporter?
19:10:23 <clarkb> frickler: we do if we don't want to reinvent systems/tooling to deploy an up to date version of node exporter
19:10:41 <clarkb> there are alternatives but then you're doing a bunch of work to keep a binary blob up to date whcih is basically what docker does
19:11:33 <clarkb> I definitely don't have a strong opinion on this myself right now and will need to think about it a bit more
19:11:59 <frickler> yeah, I guess I'll need to do some research, too
19:11:59 <clarkb> fungi: that is probably the biggest reason to not do this if dockerd + node exporter consume a bunch of resources. I can probably deploy it locally on my fileserver and see what sort of memory and cpu consumption it does
19:12:32 <fungi> ubuntu has prometheus-node-exporter and prometheus-node-exporter-collectors packages, maybe that would be just as good?
19:12:42 <frickler> I'm also thinking whether I should add myself as volunteer, but let me sleep about that idea first
19:12:44 <clarkb> fungi: but not far enough back in time for our systems iirc
19:13:23 <clarkb> I thought I looked at that and decided docker was really the only way we could run it with our heterogenous setup
19:13:38 <fungi> there's a prometheus-node-exporter on bionic
19:13:45 <clarkb> looks like focal does have it but the version is quite old (and focals is quite old too) maybe that was the issue
19:13:50 <fungi> we're just about out of the xenial weeds
19:14:50 <clarkb> Ya lets look at this a bit more. THink it over and update the spec. I'm going to continue on in the meeting as we have other stuff to cover and are a quarter of the way through the hour
19:14:58 <clarkb> #topic Topics
19:15:06 <clarkb> #topic Mailman Ansible and Server Upgrades
19:15:14 <corvus> i don't have a strong opinion on which; i just feel like writing system-config changes for either basically negates the value of the other, so we should try to pick one early
19:15:28 <corvus> [eot from me; carry on]
19:15:36 <clarkb> On Sunday fungi and I upgraded lists.openstack.org and that was quite the adventure
19:15:46 <fungi> corvus also helped with that
19:15:56 <clarkb> oh right corvus helped out with the mailman stuff at the end
19:16:04 <clarkb> Everything went well until we tried to boot the Focal kernel on the ancient rax xen pv flavor
19:16:15 <corvus> very little;  i made only a brief appearance; :)
19:16:18 <clarkb> it turns out that xen can't properly decompress the focal kernels because they are compressed with lz4
19:16:38 <fungi> corvus: brief but crucial to the plot
19:17:10 <clarkb> We worked around the kernel issue by manually decompressing the kernel using the linxu kernel's extract-vmlinux tool, installing grub-xen, then chainbooting to the /boot/xen/pvboot-x86_64.elf that it installs
19:17:28 <clarkb> What that did was tell xen how to find the kernel as well as supply a kernel to it that it doesn't have to decompress
19:17:53 <clarkb> Then we had to fix up our exim, mailman, and apache configs to handle new mailman env var filtering
19:18:13 <clarkb> Where we are at right now is the host is out of the emergency file and ansible is ansibling the new configs that we had to do successfuuly
19:18:40 <clarkb> But the kernel situation is still all kinds of bad. We need to decide how we want to ensure that ubuntu isn't going to (re)install a compressed kernel.
19:18:48 <fungi> note that the kernel dance is purely because the lists.o.o server was created in 2013 and has been in-place upgraded continuously since ubuntu 12.04 lts
19:19:01 <fungi> so it's still running an otherwise unavailable pv flavor in rackspace
19:19:26 <clarkb> We can pin the kernel package. We can create a kernel postinst.d hook to decompress the kernel when the kernel updates. We can manually decompress the current kernel whenever we need to update (and use a rescue instance if the host reboots unexpectedly).
19:19:52 <fungi> pv xen loads the kernel from outside the guest domu, while pvhvm works more like a bootable virtual machine similar to kvm
19:19:52 <clarkb> In all cases I think we should begin working to replace the server, but there will be some period of time between now and when we are running with a new server where we want to have a working boot setup
19:20:17 <corvus> oh, the chainloaded kernel can't be compressed?
19:20:26 <corvus> (i thought maybe the chainloading could get around that)
19:20:37 <fungi> corvus: nope, because it still has to hand the kernel blob off to the pv xen hypervisor
19:20:40 <clarkb> corvus: we did some digging this morning and while we haven't tested it have foudn sufficient evidence that this doesnt work on mailing lists and web forums that we didn't want to try it
19:20:51 <clarkb> really all the chain load is doing is finding the correct kernel to hand to xen I think
19:20:58 <clarkb> because it understands grub2 configs
19:21:13 <clarkb> https://unix.stackexchange.com/questions/583714/xen-pvgrub-with-lz4-compressed-kernels covers what is involved in auto decompressing the kernel if we want to do that
19:21:14 <fungi> yeah, it essentially communicates the offset where the kernel blob starts
19:21:17 <corvus> ok.  then i agree, we're in a hole and we should get out of it with a new server
19:22:14 <fungi> with "new server" comes a number of questions, like should we take this opportunity to fold in lists.katacontainers.io? should we take this as an opportunity to migrate to mm3 on a new server?
19:22:15 <clarkb> yup I think we should just accept that is necessary now. Then decide what workaround for the kernel we want to use while we do that new server work
19:22:21 <ianw> pinning it as is so a power-off situation doesn't become fatal and working on a new server seems best to me
19:23:05 <clarkb> ianw: ya and if we really need to do a kernel update on the server we can do it manually and do the decompress step at the same time
19:23:23 <clarkb> I'm leaning towards an apt pin myself for this reason. It doesn't prevent us from updating but ensures we do so with care
19:24:11 <frickler> maybe too obvious a question, but resizing to a modern flavor isn't supported on rackspace?
19:24:33 <clarkb> frickler: ya iirc you could only resize within pv or pvhvm flavors but not across
19:24:44 <fungi> switching from pv to pvhvm isn't supported anyway
19:24:47 <clarkb> But I guess that is somethign we could ask? fungi maybe as a followup on the issue you opened?
19:25:02 <ianw> fungi: it seems sensible to make the migration also be a mm3 migration
19:25:05 <fungi> oh, that trouble ticket is already closed after we worked out how to boot
19:25:19 <fungi> i went back over the current state of semi-official mm3 containers, we'd basically need three containers for the basic components of mm3 (core, hyperkitty, postorius) plus apache and mysql. or we could use the distro packages in focal (it has mm 3.2.2 while latest is 3.3.4)
19:25:33 <fungi> also there are tools to import mm 2.1 configs to 3.x
19:25:43 <clarkb> fungi: I think we should confirm we can't switch from pv to pvhm. I'm fairly certain our image would support both since the menu.lst is where we put the chainload and normal grub boot should ignore that
19:26:09 <fungi> and import old archives (with some caveats), though we can also serve old pipermail copies of the archives for backward compatibility with existing hyperlinks
19:26:21 <ianw> fungi: it could basically be stood up completely independently for validation right?  the archives seem the thing that need importing
19:26:37 <clarkb> ianw: fungi: yes and we should be able to use zuul holds for that too
19:26:43 <fungi> clarkb: yeah, in theory the image we have now could work on a pvhvm flavor, if there's a way to swotch it
19:26:50 <fungi> switch it
19:27:28 <fungi> ianw: archives and list configs both need importing, but yes i expect we'd follow our test-based development pattern for building the new mm3 deployment and then just hold a test node
19:28:08 <clarkb> Let me try an summarize what we seem to be thinking: 1) pin the kernel package on lists.o.o so it doesn't break. Manually update the kernel and decompress if necessary. 2) Begin work to upgrade to mm3 3and4) Determine if we can switch to a pvhvm flavor whcih boots reliably against modern kernels or replace the server
19:28:24 <clarkb> Is there any objections to 1) since getting that sorted sooner than later is a good idea.
19:28:38 <ianw> ++ to all from me
19:28:53 <fungi> yeah, i'm good with all of the above
19:29:03 <fungi> i can set the kernel hackage hold once the meeting ends
19:29:19 <clarkb> fungi: thanks. I'd be happy to follow along since I always find those confusing and more experience with them would be good :)
19:29:58 <fungi> happily
19:30:00 <clarkb> I'll see if I can do any research into the pv to pvhvm question
19:30:24 <clarkb> and sounds like fungi has already been looking at 2)
19:30:46 <fungi> for years, but again this week yes
19:31:08 <clarkb> Anything else on this subject? Concerns or issues you've noticed since the upgrade outside of the above
19:31:40 <fungi> aside from the kernel issue we also had some changes we needed to make to our tooling around envvars
19:32:00 <fungi> corvus managed to work out that newer mailman started filtering envvars
19:32:30 <fungi> so the one we made up for the site dir in our multi-site design was no longer making it throughto the config script
19:32:52 <fungi> and we ended up needing to pivot to a specific envvar it wasn't filtering
19:33:22 <fungi> this meant refactoring the site hostname to directory mapping into the config script
19:33:57 <fungi> since we switched from using an envvar which conveyed the directory to one which conveyed the virtual hostname
19:35:07 <clarkb> right we could've theoretically set the site dir in the HOST env var but that would have been very confusing
19:35:14 <clarkb> and if mailman used the env var for anything else potentially broken
19:35:46 <fungi> worth noting, since mm3 properly supports distinct hostnames (you can have the same list localpart at multiple domains now and each is distinct) we'll be able to avoid all that complexity with a switch to mm3
19:36:09 <clarkb> https://review.opendev.org/c/opendev/system-config/+/808570 has the details if you are interested
19:36:46 <clarkb> Alright lets move on. We haev a few more things to discuss and more than half our time is gone.
19:37:03 <clarkb> #topic Improving CD throughput
19:37:31 <clarkb> ianw: ^ anything new on this subject since the realization we needed to update periodic pipelines? Sorry I haven't had time to look at this again in a while
19:38:02 <ianw> umm, things in progress but i got a little sidetracked
19:38:17 <ianw> i think we have the basis of the dependencies worked out, but i need to rework the changes
19:38:43 <ianw> so in short, no, nothing new
19:38:58 <clarkb> It might also be good to sketch out what the future of the semaphores looks like in WIP changes just so we can see the end result. But no rush lots to sort out on this stack
19:39:23 <ianw> yeah it's definitely a "make it work serially first" situation
19:39:50 <clarkb> #topic Gerrit Account Cleanups
19:40:06 <clarkb> I have written notes on proposed plans for each user in the comments of review02:~clarkb/gerrit_user_cleanups/audit-results-annotated.yaml
19:40:30 <clarkb> There are 33 of these conflicts remaining. If you get a chance to look at the notes I wrote that would be great. fungi has read them over and didn't seem concerned though
19:40:56 <clarkb> My intent was to start writing those emails this week and make fixups in a checkout of the repo on review02 but mailing lists and other things have distracted me
19:41:15 <clarkb> Other than checking the notes I don't really need anything other than time though. This is making decent progress when I get that time
19:41:23 <clarkb> #topic OpenDev Logo Hosting
19:41:54 <clarkb> at this point we just need to update paste and gerrit's themes to use the gitea in repo hosted files then we are cleaned up from a gitea upgrade perspective
19:42:01 <fungi> this seems to be working well so far
19:42:02 <clarkb> ianw: you said you would write those changes, are they up yet?
19:42:14 <clarkb> fungi: and I agree seems to be working for what we are doing with gitea itself
19:42:50 <ianw> ahh, no those changes aren't up yet.  on my todo
19:43:03 <clarkb> feel free to ping me when they go up and I'll review them
19:43:18 <clarkb> #topic Expanding InMotion cloud deployment
19:43:37 <clarkb> It sounds like InMotion is able to give us a few more IPs in order to better utilize the cluster we have
19:43:58 <clarkb> I'll be working with them Friday morning to work through that. However, right now we are failing to boot instances there and I need to go look at it more in depth
19:44:08 <clarkb> apparently rabbitmq is fine? and it may be some nova quota mismatch problem
19:44:41 <fungi> neat
19:44:56 <clarkb> I'll probably go ahead and disable the cloud soon as it isn't booting stuff and I think network changes potentailly mean we don't want nodepool running against it anyway
19:45:10 <clarkb> if anyone else wants to join let me know
19:45:28 <clarkb> will be conference call configuration meeting sounds like
19:46:08 <clarkb> #topic Scheduling Gerrit Project Renames
19:46:35 <clarkb> We've got a few project rename requests now. In addition to starting to think about a time to do those we have discovered some additional questions about the rename process
19:46:55 <clarkb> When we rename projects we should update all of their metadata in gitea so that the issues link and review links all line up
19:47:11 <clarkb> This should be doable but requires updates to the rename playbook. Good news is that is tested now :)
19:47:26 <clarkb> The other question I had was what do we do with orgs in gitea (and gerrit) when all projects are moved out of them.
19:47:49 <clarkb> In particular I'm concerned that deleting an org would break the redirects gitea has for things from foo/bar -> bar/foo if we delete foo/
19:48:03 <clarkb> corvus: ^ you've looked at that code before do you have a sense for what might happen there?
19:48:32 <fungi> in addition to that, it might not be terrible to have a redirect mapping for storyboard.o.o, which could probably just be a flat list in a .htaccess file deployed on the server, built from the renames data we have in opendev/project-config (this could be a nice addition after the containerization work diablo_rojo is hacking on)
19:49:35 <corvus> clarkb: i don't recall for certain, but i think there is a reasonable chance that it may break as you suspect
19:50:09 <clarkb> ok something to test for sure then
19:50:45 <clarkb> As far as scheduling goes I'm wary of trying to do it before the openstack release which happens October 6th ish
19:50:49 <fungi> ywah, one of the proposed rename changes would empty the osf/ gerrit namespace and thus the osf org in gitea
19:51:04 <clarkb> But the week after: October 11 -15 might be a good time to do renames
19:51:30 <clarkb> I think that is the week before the ptg too?
19:51:44 <clarkb> Probably a good idea to avoid doing it during the ptg :)
19:52:01 <fungi> yeah, i'm good with that. i'm taking the friday before then off though
19:52:11 <fungi> (the 8th)
19:52:58 <clarkb> ok lets pencil in that week and decide on a specifc day as we get closer. Also work on doing metadata upadtes and test org removals
19:53:07 <fungi> wfm
19:53:10 <clarkb> If orgs can't be removed safely that isn't the end of the world and we'll just keep them for redirects
19:53:26 <clarkb> #topic Open Discussion
19:53:33 <clarkb> Thank you for listening to me for the last hour :) Anything else?
19:54:34 <fungi> nothing immediately springs to mind. i'll try to whip up a mm3 spec though
19:55:34 <clarkb> I'll give it another minute
19:56:31 <clarkb> Thanks everyone! we'll see you here next week same time and location
19:56:36 <clarkb> #endmeeting