19:01:08 #startmeeting infra 19:01:08 Meeting started Tue Nov 2 19:01:08 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:08 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:08 The meeting name has been set to 'infra' 19:01:15 #link http://lists.opendev.org/pipermail/service-discuss/2021-November/000294.html Our Agenda 19:01:31 Welcome, you'll find the agenda for this meeting ^ there. 19:01:37 #topic Announcements 19:02:09 The Gerrit User Summit will be happening sometime early next month and details should be coming out soon. I expect that it will be remote but don't know that for sure. 19:02:36 I bring it up because I was discussing that we did the 3.2 -> 3.3 upgrade and automated much of our testing for that and he thoguht other Gerrit users would be interested in hearing how we manage our gerrit 19:02:45 i guess we're in a much better position to talk about the things we're doing with gerrit, now that we're running a relatively recent release 19:03:20 I think our installation is a bit different than many others because while we run a fairly large instance we don't currently do HA or have very strict uptime requirements. But at the same time we automate much of our testing and development around gerrit now. 19:03:44 Anyway the event might be interesting to those who attend this meeting so calling it out here. I'm hopign I can submit something to talk about how we run gerrit stuff too 19:04:42 #topic Actions from last meeting 19:04:44 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-26-19.01.txt minutes from last meeting 19:04:58 ianw you had an action to start on the gerrit 3.4 stuff 19:05:15 i hope he didn't do that on his vacation 19:05:18 write a checklist, hold a node, and test the downgrade. 19:05:24 yes i started on that 19:05:39 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.4 19:05:49 thank you! 19:06:02 i got as far as noticing that the plugin updates do seem to break our theme 19:06:22 that is good news since it was one of the questions we had 19:06:24 so i'll dig into that first 19:06:29 er 19:06:38 I read it as does not. But I guess knowing is good either way just more work in this case 19:06:45 yay for testing :) 19:07:15 indeed :) 19:07:31 those used a "lightweight" polygerrit plugin method, so i guess we need some java to go along with it now 19:07:53 I think you may still be able to do pure javascript plugins but you have to hook in some specific way? 19:07:56 s/those/the theme/ 19:08:08 ianw: the origianl file came from the android theme iirc. We might be able to see how they updated their theme? 19:08:50 yeah, good idea. honestly i haven't even had an initial look at it yet 19:08:55 paladox also might have suggestions as to how to update it since he effectively supplied the original for us 19:09:29 ya no rush. Just wanted to check in on this as it was recorded as an action. Sounds like good progress. Thanks 19:09:43 The other recorded action was for infra root to review the mailman 3 spec 19:09:47 lets just dive into that topic now 19:09:50 #topic Specs 19:09:59 #link https://review.opendev.org/810990 Mailman 3 spec 19:10:24 ianw and I have reviewed it and appear happy with the spec. I'd like to approve this soon if we can as the holiday period is a good time for this type of work 19:10:43 frickler: corvus: do you think you have time to review it this week? Any objections to landing it this week if not? 19:11:18 I'll put it on my list but won't object to anything 19:11:37 ok in that case I'll aim to approve it end of day Friday if no review objections come up 19:11:40 thanks! 19:12:13 #topic Topics 19:12:15 also i'm happy to make adjustments during setup if people come up with new concerns 19:12:23 fungi: thanks 19:12:25 #topic Improving OpenDev's CD throughput 19:12:52 I'll admit I haven't really had a chance to look at this yet. I feel like this sort of change requires me to not be juggling a few things so I can focus on understanding the end result and haven't had that opportunity tet 19:13:04 It is on my todo list if I ever find that block of time :/ 19:13:20 yeah i need to cycle back on some failures in jobs too 19:13:20 #link https://review.opendev.org/c/opendev/system-config/+/807672 19:13:29 specifically that chagne and its child if others have time too 19:14:38 at this point I think it just needs reviewers and someone to look at failures. Then we can improve it as required by review and start landing changes 19:15:15 #topic Gerrit account cleanups 19:15:31 Just a note that I haven't heard back from the user I most recently did the fixup for 19:15:37 I suppose no news is good news in this case 19:15:53 they seemed relatively uncommunicative anyway 19:16:53 #topic Fedora 34 boot problems 19:17:07 I've not managed to keep up with the status of this other than reviewing a change here and there 19:17:22 Is this still an issue? Anything we need to do to help fix it? 19:17:53 the dracut fix made it into f34, but i was not clear if that actually would fix the default initramfs 19:18:10 so i have a change out still that regenerates it with dracut 19:18:39 however, i also just updated for fedora 35 19:19:00 what's the anticipated release date for 35? 19:19:06 i wasn't sure what to do with the mirror, but i think it just released today 19:19:16 so that solves having to figure out "/devel" paths 19:19:32 fungi: it was yesterday I think 19:19:48 oh, then yeah that may just be a better place to focus regardless 19:20:09 in the meantime we're not all that blocked on 34 since we've got three providers where it can boot now 19:20:26 I think only 2 have labels configured for it though 19:20:32 (sorry just logging into gerrit) 19:20:32 but that is probably good enough while we get f35 up 19:20:36 oh, i guess we never added it to vexxhost 19:21:10 the other related item was I had a -1 comment on the f33 mirror cleanup. I don't think we can remove the fedora atomic image yet because magnum has older branches still using it 19:21:12 i guess we still think we have fedora 29 users 19:21:13 granted it's also not terribly efficient that we've got poolworkers accepting node requests they'll ultimately be unable to fulfil after waiting 15-20 minutes for the node to never become reachable 19:21:17 but we should definitely clean out f33 19:21:46 ianw: ya I think we should also send a note to openstack-discuss that that image needs to go away. It isn't something anyone should be using and they need to make a plan for using something else? 19:22:03 if we do decide to abandon 34 and focus on 35, then we should probably remove the 34 label from providers where we know it's broken 19:22:09 ok, i think it's more a "this is going away" message at this point ... 19:22:32 ianw: yup. Basically we know it is used but no one should use it and we need to clean it up. Lets give them sufficient warning then proceed 19:23:03 maybe a one-sentence reminder that we tend to not keep eol distro versions around 19:23:10 i can split those up, drop f33, add f35, then starts builds for f35, update zuul-jobs and any users and then we can drop f34 19:23:31 sounds like a plan 19:24:00 but in the meantime, drop f34 everywhere besides inmotion and citycloud (and maybe add it to vexxhost?) 19:24:17 we're just wasting resources trying to boot it everywhere else 19:24:21 fungi: ya otherwise that 19:25:09 ok, i'll do that too, although i hope the removal to proceed in a timely fashion :) 19:25:54 it's ore just that i'm watching nodepool try to boot a f34 node in rackspace right now 19:26:11 thanks. Let me know if I can help. Happy to do reviews on that as the slow f34 boots affected random things when it was a bigger issue 19:26:49 #topic Zuul multi scheduler setup 19:27:09 Over the weekend zuul ran with an active active scheduler for the first time 19:27:19 I saw a report that at least one job was started by one scheduler and finished by another 19:27:41 Unfortunately there have been some bumps along the way (corvus is currently doing a restart to fall back to a single scheduler after debugging the main issue) 19:28:05 basically keep this in mind if you are doing any zuul work. And if you notice any weird zuul behavior reporting that back to the zuul matrix room is a good idea 19:28:05 it went a lot better than i expected actually :) 19:28:10 we ran with one again just a few minutes ago! 19:29:02 I'm happy because it is nice to see all that code review done last week show results. Super exciting to see zuulv5 when it is ready 19:29:41 But ya if you notice weirdness please report it. That information and feedback is useful. 19:30:14 also the zuul restart docs have been updated, and include information on dumping some diagnostic data 19:31:02 #topic FIPS testing in our CI system 19:31:26 We're seeing more and more interest in testing software on FIPS enabled systems, particularly for openstack. 19:32:07 The way we've been approaching this is having the jobs install whatever they need for that then enable configs and reboot into the new kernel state 19:32:32 The reason for this is managing another set of centos-8 and fedora-* images just for FIPS doesnt' really scale well and zuul supports this reboot case just fine 19:33:16 This does present an issue where a lot of jobs set ephemeral state that doesn't survive reboots 19:33:47 for example multinode networking creates ovs networks and those don't come back after a reboot. swift unittesting creates an xfs filesystem that is mounted and not added to fstab 19:34:08 i had a random crazy thought about that... what if glean grew the ability to run userdata scripts like cloud-init can, and nodepool supplied those to do things like enable fips and reboot, or tweak kernel parameters to limit available memory and reboot... we wouldn't need separate images then, just separate labels 19:34:18 If peopel come to you with problems around FIPS testing it is probably a good idea to check for any lost ephemeral state as an early debugging step 19:34:44 fungi: I think we intentionally avoid that stuff because it is really difficult to debug 19:34:59 fungi: the reason that we prefer putting as much logic into zuul as possible is that the info can then be exposed to users easily 19:35:06 yeah, fair, only the console log can really provide that info 19:35:15 in fact debugging the swift issue was done with a held node but I did not log into that machine at all and only used job info 19:35:53 it would only be a label though to pass fips=1 on the command line? 19:36:18 does that not happen via glance options, similar to some of the stuff we set for arm64 images? 19:36:19 ianw: it would be a label with a user script set that installed necessary pacakges and updated config then did a reboot 19:36:21 glance metadata 19:36:28 no it happens via nova metadata and is per instance 19:36:54 debugging that stuff is really difficult 19:37:06 personally I don't think we should go that route for this reason 19:37:44 i don't think all hypervisors can alter the kernel command line anyway 19:38:17 i did think it was just a switch, that we could have the images ready and boot them in either mode. but i haven't looked at details, obviously 19:38:20 Something else to be aware of is that to address the loss of state people are wanting to modify all the jobs with FIPS enabled flags. I've been -1'ing those and asking for different base jobs instead. The reason for this is zuul-jobs is meant to be a generally reconsumable library of jobs for all zuul users and adding a bunch of fips flags to generic jobs seems to pollute 19:38:22 that goal 19:38:48 basically have a multinode-fips job instead of a multinode job with a fips flag 19:39:48 The last thing I wanted to call out on this topic is I think we should also try to encourage projects to avoid having two copies of all the jobs to cover fips=1 and fips=0 19:39:51 i can't remember, does multiple inheritance (mix-ins) work? did that every get added or just talked about? 19:40:03 fungi: there is a way to do it but it is undocumented iirc 19:40:12 ahh, so probably discouraged 19:40:22 i don't discourage it :) 19:40:43 We should probably encourage fips by default for projects that is important for with the assumption that if it works under fips it will work without fips 19:40:54 and/or targetted fips testing and not attempt to test everything under fips 19:42:07 i'm just trying to read ... is the local config required regenerating initramfs? 19:42:46 ianw: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/enable-fips/tasks/main.yaml is the current implementation 19:43:02 I do not know what all `fips-mode-setup --enable` does 19:43:13 but I assume it is non trivial if it comes with its own command 19:44:30 I don't think you can change the operating mode without a reboot either 19:44:42 since it changes kernel stuff that can't be modified without rebooting 19:44:43 it looks like it's mostly regenerating initramfs, disabling prelink and some sshd_config tweaks 19:44:55 apparenrtly you can do it on ubuntu lts too, but need to have a ua subscription 19:45:58 i thought prelink was dead anyway, have to investigate that 19:46:07 Anyway the most important thing I wanted to call out was the hint for helping debug FIPS related problems potentailly being related to losing state in the reboot 19:46:17 since that was overlooked in the swift case for far too long 19:47:01 #topic Open Discussion 19:47:15 I ended up rebooting nb03 to address its weird high load average 19:47:33 the implementation detail of swift's functional tests creating and mounting an xfs filesystem on a loop device is easily overlooked anyway 19:47:34 the system itself is happy afterwards but nodepool-builder won't start there due to openshift==0.0.1 being installed 19:47:55 https://review.opendev.org/c/zuul/nodepool/+/816389 should fix that but we wanted to confirm it doesn't make the nodepool iamge builds very slow before approving it 19:48:42 oh, indeed. i'm not sure if we implemented something to add extra wheels to the opendev requirements build, or just talked about it 19:49:00 that sounds like we lack some test for that image? 19:49:23 frickler: yup largely because we'd need to figure out arm64 specific jobs for it 19:49:33 it is doable but not as easy as getting it covered with say the dib jobs 19:49:55 ianw: I think your spec for zuul is likely to be the plan for properly solving that problem? 19:49:57 it doesn't work on the dib side because devsatck doesn't work on arm64 19:50:11 ah 19:51:03 we could do a "does it start" test 19:51:43 ya we would fail that currently and that would be an improvement 19:51:56 won't cover everything but should ensure basic functionality 19:52:02 that might be "good enough" for a secondary architecture exercise anyway 19:52:23 since if it starts, the rest of its operation is unlikely to differ substantially 19:53:19 yeah, not exactly sure what that would look like -- perhaps just start a ZK container and make sure it gets into a listening state? 19:53:57 although probably just tox tests would pick this up too? 19:53:58 if we can have it build a simple image but not upload it (do we have a way to force a build without upload?) then we can nodepool dib-image-list it 19:54:10 maybe we should just run that in check-arm64? 19:54:29 ianw: you would need to ensure deps are installed the same way when running unittests but that should work too 19:55:03 part of this is an artifact of how we make the qemu arm64 emulated docker image build run in a reasonable time frame 19:55:13 if we just ran tox on arm64 it would probably work 19:55:24 because it would find the sdist for openshift 0.11.2 and install from that. 19:55:24 (or fail, more importantly) 19:55:34 oh, i see what you mean 19:55:35 well in this case it wouldn't fail 19:55:39 yeah 19:55:52 if we reproduced the same install method for unittests then it would fail 19:55:54 so tox is not good enough 19:56:23 yeah, it's more making sure we're pulling the same wheels etc. in tox 19:56:40 in this case, not good enough without setting nonstandard options for pip's dep solver to skip source-only versions anyway 19:57:35 this does more-or-less cycle back to the rough spec from our discussions on arm64 wheels + zuul 19:57:50 yup I think if we focus on that we'll largely solve this specific problem 19:57:55 #link https://review.opendev.org/c/zuul/zuul/+/815406 19:57:57 any arm64 testing would be to sanity check that result 19:58:09 and we can possibly rely on unittests then if we have the better arm64 stuff for zuul 19:58:12 yeah, so i think calling this out there and making sure we address it is probably the way forward 19:58:46 ++ 19:58:51 i can update that for the testing case and we can loop back on it 20:00:03 sounds good. And we are at time 20:00:05 thank you everyone! 20:00:15 thanks clarkb! 20:00:27 We'll see you here next week and the week after. but then many of us have a big holiday in three weeks 20:00:43 I expect that I won't be around much during the week of US thanksgiving 20:00:45 #endmeeting