#opendev-meeting log

19:01:08 <clarkb> #startmeeting infra
19:01:08 <opendevmeet> Meeting started Tue Nov  2 19:01:08 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:08 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:08 <opendevmeet> The meeting name has been set to 'infra'
19:01:15 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-November/000294.html Our Agenda
19:01:31 <clarkb> Welcome, you'll find the agenda for this meeting ^ there.
19:01:37 <clarkb> #topic Announcements
19:02:09 <clarkb> The Gerrit User Summit will be happening sometime early next month and details should be coming out soon. I expect that it will be remote but don't know that for sure.
19:02:36 <clarkb> I bring it up because I was discussing that we did the 3.2 -> 3.3 upgrade and automated much of our testing for that and he thoguht other Gerrit users would be interested in hearing how we manage our gerrit
19:02:45 <fungi> i guess we're in a much better position to talk about the things we're doing with gerrit, now that we're running a relatively recent release
19:03:20 <clarkb> I think our installation is a bit different than many others because while we run a fairly large instance we don't currently do HA or have very strict uptime requirements. But at the same time we automate much of our testing and development around gerrit now.
19:03:44 <clarkb> Anyway the event might be interesting to those who attend this meeting so calling it out here. I'm hopign I can submit something to talk about how we run gerrit stuff too
19:04:42 <clarkb> #topic Actions from last meeting
19:04:44 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-26-19.01.txt minutes from last meeting
19:04:58 <clarkb> ianw you had an action to start on the gerrit 3.4 stuff
19:05:15 <fungi> i hope he didn't do that on his vacation
19:05:18 <clarkb> write a checklist, hold a node, and test the downgrade.
19:05:24 <ianw> yes i started on that
19:05:39 <ianw> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.4
19:05:49 <clarkb> thank you!
19:06:02 <ianw> i got as far as noticing that the plugin updates do seem to break our theme
19:06:22 <clarkb> that is good news since it was one of the questions we had
19:06:24 <ianw> so i'll dig into that first
19:06:29 <clarkb> er
19:06:38 <clarkb> I read it as does not. But I guess knowing is good either way just more work in this case
19:06:45 <clarkb> yay for testing :)
19:07:15 <ianw> indeed :)
19:07:31 <fungi> those used a "lightweight" polygerrit plugin method, so i guess we need some java to go along with it now
19:07:53 <clarkb> I think you may still be able to do pure javascript plugins but you have to hook in some specific way?
19:07:56 <fungi> s/those/the theme/
19:08:08 <clarkb> ianw: the origianl file came from the android theme iirc. We might be able to see how they updated their theme?
19:08:50 <ianw> yeah, good idea.  honestly i haven't even had an initial look at it yet
19:08:55 <fungi> paladox also might have suggestions as to how to update it since he effectively supplied the original for us
19:09:29 <clarkb> ya no rush. Just wanted to check in on this as it was recorded as an action. Sounds like good progress. Thanks
19:09:43 <clarkb> The other recorded action was for infra root to review the mailman 3 spec
19:09:47 <clarkb> lets just dive into that topic now
19:09:50 <clarkb> #topic Specs
19:09:59 <clarkb> #link https://review.opendev.org/810990 Mailman 3 spec
19:10:24 <clarkb> ianw and I have reviewed it and appear happy with the spec. I'd like to approve this soon if we can as the holiday period is a good time for this type of work
19:10:43 <clarkb> frickler: corvus: do you think you have time to review it this week? Any objections to landing it this week if not?
19:11:18 <frickler> I'll put it on my list but won't object to anything
19:11:37 <clarkb> ok in that case I'll aim to approve it end of day Friday if no review objections come up
19:11:40 <clarkb> thanks!
19:12:13 <clarkb> #topic Topics
19:12:15 <fungi> also i'm happy to make adjustments during setup if people come up with new concerns
19:12:23 <clarkb> fungi: thanks
19:12:25 <clarkb> #topic Improving OpenDev's CD throughput
19:12:52 <clarkb> I'll admit I haven't really had a chance to look at this yet. I feel like this sort of change requires me to not be juggling a few things so I can focus on understanding the end result and haven't had that opportunity tet
19:13:04 <clarkb> It is on my todo list if I ever find that block of time :/
19:13:20 <ianw> yeah i need to cycle back on some failures in jobs too
19:13:20 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/807672
19:13:29 <clarkb> specifically that chagne and its child if others have time too
19:14:38 <clarkb> at this point I think it just needs reviewers and someone to look at failures. Then we can improve it as required by review and start landing changes
19:15:15 <clarkb> #topic Gerrit account cleanups
19:15:31 <clarkb> Just a note that I haven't heard back from the user I most recently did the fixup for
19:15:37 <clarkb> I suppose no news is good news in this case
19:15:53 <fungi> they seemed relatively uncommunicative anyway
19:16:53 <clarkb> #topic Fedora 34 boot problems
19:17:07 <clarkb> I've not managed to keep up with the status of this other than reviewing a change here and there
19:17:22 <clarkb> Is this still an issue? Anything we need to do to help fix it?
19:17:53 <ianw> the dracut fix made it into f34, but i was not clear if that actually would fix the default initramfs
19:18:10 <ianw> so i have a change out still that regenerates it with dracut
19:18:39 <ianw> however, i also just updated for fedora 35
19:19:00 <fungi> what's the anticipated release date for 35?
19:19:06 <ianw> i wasn't sure what to do with the mirror, but i think it just released today
19:19:16 <ianw> so that solves having to figure out "/devel" paths
19:19:32 <clarkb> fungi: it was yesterday I think
19:19:48 <fungi> oh, then yeah that may just be a better place to focus regardless
19:20:09 <fungi> in the meantime we're not all that blocked on 34 since we've got three providers where it can boot now
19:20:26 <clarkb> I think only 2 have labels configured for it though
19:20:32 <ianw> (sorry just logging into gerrit)
19:20:32 <clarkb> but that is probably good enough while we get f35 up
19:20:36 <fungi> oh, i guess we never added it to vexxhost
19:21:10 <clarkb> the other related item was I had a -1 comment on the f33 mirror cleanup. I don't think we can remove the fedora atomic image yet because magnum has older branches still using it
19:21:12 <ianw> i guess we still think we have fedora 29 users
19:21:13 <fungi> granted it's also not terribly efficient that we've got poolworkers accepting node requests they'll ultimately be unable to fulfil after waiting 15-20 minutes for the node to never become reachable
19:21:17 <clarkb> but we should definitely clean out f33
19:21:46 <clarkb> ianw: ya I think we should also send a note to openstack-discuss that that image needs to go away. It isn't something anyone should be using and they need to make a plan for using something else?
19:22:03 <fungi> if we do decide to abandon 34 and focus on 35, then we should probably remove the 34 label from providers where we know it's broken
19:22:09 <ianw> ok, i think it's more a "this is going away" message at this point ...
19:22:32 <clarkb> ianw: yup. Basically we know it is used but no one should use it and we need to clean it up. Lets give them sufficient warning then proceed
19:23:03 <fungi> maybe a one-sentence reminder that we tend to not keep eol distro versions around
19:23:10 <ianw> i can split those up, drop f33, add f35, then starts builds for f35, update zuul-jobs and any users and then we can drop f34
19:23:31 <clarkb> sounds like a plan
19:24:00 <fungi> but in the meantime, drop f34 everywhere besides inmotion and citycloud (and maybe add it to vexxhost?)
19:24:17 <fungi> we're just wasting resources trying to boot it everywhere else
19:24:21 <clarkb> fungi: ya otherwise that
19:25:09 <ianw> ok, i'll do that too, although i hope the removal to proceed in a timely fashion :)
19:25:54 <fungi> it's ore just that i'm watching nodepool try to boot a f34 node in rackspace right now
19:26:11 <clarkb> thanks. Let me know if I can help. Happy to do reviews on that as the slow f34 boots affected random things when it was a bigger issue
19:26:49 <clarkb> #topic Zuul multi scheduler setup
19:27:09 <clarkb> Over the weekend zuul ran with an active active scheduler for the first time
19:27:19 <clarkb> I saw a report that at least one job was started by one scheduler and finished by another
19:27:41 <clarkb> Unfortunately there have been some bumps along the way (corvus is currently doing a restart to fall back to a single scheduler after debugging the main issue)
19:28:05 <clarkb> basically keep this in mind if you are doing any zuul work. And if you notice any weird zuul behavior reporting that back to the zuul matrix room is a good idea
19:28:05 <corvus> it went a lot better than i expected actually :)
19:28:10 <fungi> we ran with one again just a few minutes ago!
19:29:02 <clarkb> I'm happy because it is nice to see all that code review done last week show results. Super exciting to see zuulv5 when it is ready
19:29:41 <clarkb> But ya if you notice weirdness please report it. That information and feedback is useful.
19:30:14 <fungi> also the zuul restart docs have been updated, and include information on dumping some diagnostic data
19:31:02 <clarkb> #topic FIPS testing in our CI system
19:31:26 <clarkb> We're seeing more and more interest in testing software on FIPS enabled systems, particularly for openstack.
19:32:07 <clarkb> The way we've been approaching this is having the jobs install whatever they need for that then enable configs and reboot into the new kernel state
19:32:32 <clarkb> The reason for this is managing another set of centos-8 and fedora-* images just for FIPS doesnt' really scale well and zuul supports this reboot case just fine
19:33:16 <clarkb> This does present an issue where a lot of jobs set ephemeral state that doesn't survive reboots
19:33:47 <clarkb> for example multinode networking creates ovs networks and those don't come back after a reboot. swift unittesting creates an xfs filesystem that is mounted and not added to fstab
19:34:08 <fungi> i had a random crazy thought about that... what if glean grew the ability to run userdata scripts like cloud-init can, and nodepool supplied those to do things like enable fips and reboot, or tweak kernel parameters to limit available memory and reboot... we wouldn't need separate images then, just separate labels
19:34:18 <clarkb> If peopel come to you with problems around FIPS testing it is probably a good idea to check for any lost ephemeral state as an early debugging step
19:34:44 <clarkb> fungi: I think we intentionally avoid that stuff because it is really difficult to debug
19:34:59 <clarkb> fungi: the reason that we prefer putting as much logic into zuul as possible is that the info can then be exposed to users easily
19:35:06 <fungi> yeah, fair, only the console log can really provide that info
19:35:15 <clarkb> in fact debugging the swift issue was done with a held node but I did not log into that machine at all and only used job info
19:35:53 <ianw> it would only be a label though to pass fips=1 on the command line?
19:36:18 <ianw> does that not happen via glance options, similar to some of the stuff we set for arm64 images?
19:36:19 <clarkb> ianw: it would be a label with a user script set that installed necessary pacakges and updated config then did a reboot
19:36:21 <ianw> glance metadata
19:36:28 <clarkb> no it happens via nova metadata and is per instance
19:36:54 <clarkb> debugging that stuff is really difficult
19:37:06 <clarkb> personally I don't think we should go that route for this reason
19:37:44 <fungi> i don't think all hypervisors can alter the kernel command line anyway
19:38:17 <ianw> i did think it was just a switch, that we could have the images ready and boot them in either mode.  but i haven't looked at details, obviously
19:38:20 <clarkb> Something else to be aware of is that to address the loss of state people are wanting to modify all the jobs with FIPS enabled flags. I've been -1'ing those and asking for different base jobs instead. The reason for this is zuul-jobs is meant to be a generally reconsumable library of jobs for all zuul users and adding a bunch of fips flags to generic jobs seems to pollute
19:38:22 <clarkb> that goal
19:38:48 <clarkb> basically have a multinode-fips job instead of a multinode job with a fips flag
19:39:48 <clarkb> The last thing I wanted to call out on this topic is I think we should also try to encourage projects to avoid having two copies of all the jobs to cover fips=1 and fips=0
19:39:51 <fungi> i  can't remember, does multiple inheritance (mix-ins) work? did that every get added or just talked about?
19:40:03 <clarkb> fungi: there is a way to do it but it is undocumented iirc
19:40:12 <fungi> ahh, so probably discouraged
19:40:22 <corvus> i don't discourage it :)
19:40:43 <clarkb> We should probably encourage fips by default for projects that is important for with the assumption that if it works under fips it will work without fips
19:40:54 <clarkb> and/or targetted fips testing and not attempt to test everything under fips
19:42:07 <ianw> i'm just trying to read ... is the local config required regenerating initramfs?
19:42:46 <clarkb> ianw: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/enable-fips/tasks/main.yaml is the current implementation
19:43:02 <clarkb> I do not know what all `fips-mode-setup --enable` does
19:43:13 <clarkb> but I assume it is non trivial if it comes with its own command
19:44:30 <clarkb> I don't think you can change the operating mode without a reboot either
19:44:42 <clarkb> since it changes kernel stuff that can't be modified without rebooting
19:44:43 <ianw> it looks like it's mostly regenerating initramfs, disabling prelink and some sshd_config tweaks
19:44:55 <fungi> apparenrtly you can do it on ubuntu lts too, but need to have a ua subscription
19:45:58 <ianw> i thought prelink was dead anyway, have to investigate that
19:46:07 <clarkb> Anyway the most important thing I wanted to call out was the hint for helping debug FIPS related problems potentailly being related to losing state in the reboot
19:46:17 <clarkb> since that was overlooked in the swift case for far too long
19:47:01 <clarkb> #topic Open Discussion
19:47:15 <clarkb> I ended up rebooting nb03 to address its weird high load average
19:47:33 <fungi> the implementation detail of swift's functional tests creating and mounting an xfs filesystem on a loop device is easily overlooked anyway
19:47:34 <clarkb> the system itself is happy afterwards but nodepool-builder won't start there due to openshift==0.0.1 being installed
19:47:55 <clarkb> https://review.opendev.org/c/zuul/nodepool/+/816389 should fix that but we wanted to confirm it doesn't make the nodepool iamge builds very slow before approving it
19:48:42 <ianw> oh, indeed.  i'm not sure if we implemented something to add extra wheels to the opendev requirements build, or just talked about it
19:49:00 <frickler> that sounds like we lack some test for that image?
19:49:23 <clarkb> frickler: yup largely because we'd need to figure out arm64 specific jobs for it
19:49:33 <clarkb> it is doable but not as easy as getting it covered with say the dib jobs
19:49:55 <clarkb> ianw: I think your spec for zuul is likely to be the plan for properly solving that problem?
19:49:57 <ianw> it doesn't work on the dib side because devsatck doesn't work on arm64
19:50:11 <clarkb> ah
19:51:03 <ianw> we could do a "does it start" test
19:51:43 <clarkb> ya we would fail that currently and that would be an improvement
19:51:56 <clarkb> won't cover everything but should ensure basic functionality
19:52:02 <fungi> that might be "good enough" for a secondary architecture exercise anyway
19:52:23 <fungi> since if it starts, the rest of its operation is unlikely to differ substantially
19:53:19 <ianw> yeah, not exactly sure what that would look like -- perhaps just start a ZK container and make sure it gets into a listening state?
19:53:57 <ianw> although probably just tox tests would pick this up too?
19:53:58 <clarkb> if we can have it build a simple image but not upload it (do we have a way to force a build without upload?) then we can nodepool dib-image-list it
19:54:10 <ianw> maybe we should just run that in check-arm64?
19:54:29 <clarkb> ianw: you would need to ensure deps are installed the same way when running unittests but that should work too
19:55:03 <clarkb> part of this is an artifact of how we make the qemu arm64 emulated docker image build run in a reasonable time frame
19:55:13 <clarkb> if we just ran tox on arm64 it would probably work
19:55:24 <clarkb> because it would find the sdist for openshift 0.11.2 and install from that.
19:55:24 <fungi> (or fail, more importantly)
19:55:34 <fungi> oh, i see what you mean
19:55:35 <clarkb> well in this case it wouldn't fail
19:55:39 <fungi> yeah
19:55:52 <clarkb> if we reproduced the same install method for unittests then it would fail
19:55:54 <fungi> so tox is not good enough
19:56:23 <ianw> yeah, it's more making sure we're pulling the same wheels etc. in tox
19:56:40 <fungi> in this case, not good enough without setting nonstandard options for pip's dep solver to skip source-only versions anyway
19:57:35 <ianw> this does more-or-less cycle back to the rough spec from our discussions on arm64 wheels + zuul
19:57:50 <clarkb> yup I think if we focus on that we'll largely solve this specific problem
19:57:55 <ianw> #link https://review.opendev.org/c/zuul/zuul/+/815406
19:57:57 <clarkb> any arm64 testing would be to sanity check that result
19:58:09 <clarkb> and we can possibly rely on unittests then if we have the better arm64 stuff for zuul
19:58:12 <ianw> yeah, so i think calling this out there and making sure we address it is probably the way forward
19:58:46 <clarkb> ++
19:58:51 <ianw> i can update that for the testing case and we can loop back on it
20:00:03 <clarkb> sounds good. And we are at time
20:00:05 <clarkb> thank you everyone!
20:00:15 <fungi> thanks clarkb!
20:00:27 <clarkb> We'll see you here next week and the week after. but then many of us have a big holiday in three weeks
20:00:43 <clarkb> I expect that I won't be around much during the week of US thanksgiving
20:00:45 <clarkb> #endmeeting