#opendev-meeting log

19:01:01 <clarkb> #startmeeting infra
19:01:01 <opendevmeet> Meeting started Tue Dec 14 19:01:01 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:01 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:01 <opendevmeet> The meeting name has been set to 'infra'
19:01:04 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-December/000309.html Our Agenda
19:01:11 <ianw> o/
19:01:38 <clarkb> #topic Announcements
19:02:13 <clarkb> We'll cancel next week's meeting and the meeting on January 4, 2022. I'll see what the temperature for having a meeting on the 28th is on the 27th. Though I half expect no one to be around for that one either :)
19:02:31 <clarkb> Hopefully we can all enjoy a bit of rest and holidays and so on
19:03:11 <clarkb> #topic Actions from last meeting
19:03:31 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-12-07-19.01.txt minutes from last meeting
19:03:39 <clarkb> There were no actions recorded so we'll just dive straight in
19:03:44 <clarkb> #topic Topics
19:03:53 <clarkb> #topic Log4j vulnerability
19:04:24 <clarkb> Last week on Thursday afternoon relative to me a 0day RCE in a popular java library was released
19:04:53 <clarkb> The vast majority of the java applicatiosn that we care about either use an older version of the library or don't use it at all and were not vulnerable
19:05:11 <clarkb> The exception was meetpad which we shut down in response to
19:05:29 <fungi> though i was surprised to realize just how much java we do have scattered throughout our services
19:05:30 <clarkb> Since then jitsi devs have patched and updated docker images which we haev updated to and the service is running again
19:06:05 <clarkb> The roughest part of this situation was this was not a coordinated disclosure with well understood behaviors and parameters. Instaed it was a fire drill with a lot of FUD and misinformation floating around
19:06:45 <clarkb> I ended up doing a fair bit of RTFSing and reading between the lines of what others had said that night to gain confidence that the older version was not affected and eventually the authors of the older code pushed for the current log4j authors to udpate their statements to make them accurate and clear confirming our analysis
19:07:04 <fungi> and led to us having to do a lot of source code level auditing
19:07:19 <clarkb> Thank you to everyone that helped out digging into this and responding. I think a number of other organizations and individuals have had a much rougher go of it.
19:08:05 <frickler> and still have and will continue for some time
19:08:24 <clarkb> Really just wanted to mention it happened, we were aware right about when it started to become public, and the whole team ended up digging in and assessing our risk as well as responding in the case of jitsi. And that is deserving of thanks so tahnk you all!
19:08:53 <clarkb> Is there anything else to add on this subject?
19:10:06 <fungi> i'll drink to that!
19:10:16 <fungi> thanks everybody!
19:10:29 <clarkb> #topic Improving OpenDev's CD throughput
19:10:34 <clarkb> Sounds like that was it so we can move on
19:11:08 <clarkb> ianw has made good progress getting our serially run jobs all organized. Now that we are looking at running in parallel the next step is centralizing the git repo updates for system-config on bridge at the beginning of each buildset
19:11:16 <clarkb> since we don't want the jobs fighting over repo contents
19:11:30 <clarkb> What this exposed is that bootstrapping the bridge currently requires human intervention
19:11:48 <clarkb> ianw is wondering if we should have zuul do a bare minimum of bootstrapping so that subsequent jobs can take it from there
19:12:01 <clarkb> Doing this requires using zuul secrets
19:12:03 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/821645 -- spec outlining some of the issues with secrets
19:12:08 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/821155 -- sample of secret writing; more info in changelog
19:12:22 <clarkb> ianw: ^ feel free to dive into more detail or point out what the next steps that need help are
19:13:13 <ianw> yeah, in the abstract, i guess our decision is to think about moving secrets into zuul
19:14:07 <clarkb> and the spec link above is the best venue for that discussioN?
19:14:10 <ianw> i think this has quite a few advantages, particularly around running credential updates through gerrit as changes as "usual"
19:15:14 <ianw> the obvious disadvantages are that we have more foot-gun potential for publishing things, and more exposure to Zuul issues
19:15:48 <ianw> the spec -- i'm not sure if this has been discussed previously in the design of all this
19:16:06 <ianw> i mean the spec is probably a good place for discussion
19:16:19 <clarkb> ya I think the previous implemetnation was largely a how do we combine what we had before with zuul with minimal effort
19:16:33 <clarkb> And having a spec to formally work through the design seems like a great idea.
19:16:46 <ianw> if people are ok with 821155 i think it's worth a merge and revert when quiet just to confirm it works as we think it works
19:17:03 <clarkb> considering the potential impact involved and the fast approaching holidays I don't think this is something we want to rush through, but infra-root review of that spec when able would be great
19:17:33 <clarkb> and ya 821155 seems low impact enough but good poc as input to the spec.
19:18:26 <clarkb> And maybe if we end up meeting on the 28th we can discuss the spec a bit more
19:18:36 <ianw> i'd also say it not proposing a radical change to the CD pipeline
19:18:51 <clarkb> though I doubt we'll be able to do that synchronously so keeping discussion on the spec as much as possible is probably best
19:18:59 <ianw> the credentials are still on the bastion host, which is still running ansible independently
19:19:24 <ianw> just instead of admins updating them in git, Zuul would put them on the bastion host
19:19:38 <clarkb> right
19:19:48 <ianw> we would also have to run with both models -- i'm not proposing we move everything wholesale
19:20:15 <ianw> probably new things could work from zuul, and, like puppet, as it makes sense as we migrate we can move bits
19:21:16 <clarkb> thank you for writing the spec up. I'll do my best to get to it this week (though unsure if I'll get to it today)
19:21:52 <ianw> np -- i agree if nobody has preexisting thoughts on "oh, we considered that and ..." we can move on and leave it to the spec
19:23:27 <clarkb> #topic Container Maintenance
19:23:33 <clarkb> #link https://etherpad.opendev.org/p/opendev-container-maintenance
19:23:55 <clarkb> I've been working on figuring out which images need bullseye updates and what containers need uid changes and so on and ended up throwing it all into this etherpad
19:24:19 <clarkb> I've got changes up for all of the container images that need bullseye at this point and have been trying to approve them slowly one or two at a time as I can monitor them going in
19:24:58 <clarkb> So far the only real problem has been uWSGI's wheel not building reliably but jrosser dug into that and we think we identified the issue and are reasonably happy with the hacky workaround for now
19:25:15 <clarkb> Once the bullseye updates are done I'm hoping to look at the irc bot uid updates next
19:25:54 <clarkb> When I did the audit I noticed that we should also plan to upgrade our zookeeper cluster and our mariadb containers to more recent versions. The current versions are still supported which means this isn't urgent, but keeping up and learning what that process is seems like a good idea
19:26:13 <clarkb> For zookeeper we'd go to 3.6.latest as 3.7 isn't fully released yet aiui. For mariadb it is 10.6 I think
19:26:36 <clarkb> And another thing that occured to me is I'm not sure if we are pruning the CI registry's old container contents
19:26:41 <clarkb> corvus: ^ do you know?
19:27:13 <clarkb> I'm happy for people to help out with this stuff too, feel free to put your name next to an item on that etherpad and I'll look for changes
19:28:22 <clarkb> #topic Mailman Ansible fixups
19:28:41 <clarkb> fungi has a stack of changes to better test our ansible for mailman and in the process fix that ansible
19:28:48 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/820900/ this stack should address newlist issues.
19:29:27 <clarkb> In particular we don't run newlist properly as it expects input. But there is an undocumented flag to make it stop looking for input which we switch to. THen we also update firewall rules to block outbound smtp on the test nodes so we can verify it attempted to send email rather than telling newlist to not send email
19:29:52 <clarkb> fungi: do you need anything else to move that along or is largely a matter of you having time to approve cahnges and watch them?
19:29:55 <fungi> i'm working on the penultimate change in that series to move the service commands into a dedicated playbook rather thanrunning them in testinfra
19:30:07 <clarkb> ah right that was suggested in review.
19:30:10 <fungi> the earlier changes can merge any time
19:30:35 <fungi> i'm about done with the requested update to 821144
19:30:41 <clarkb> fungi: were you planning to +A them when ready then?
19:30:48 <clarkb> mostly want to make sure you aren't waiting on anything from someone lese
19:30:52 <fungi> sure, i can if nobody beats me to it
19:31:36 <corvus> clarkb: We are not
19:32:00 <clarkb> corvus: ok so that is something we should be doing. Is there anything preventing us from pruning those? or just need to do it?
19:32:31 <corvus> Suspected corruption bug
19:32:58 <clarkb> corvus: thanks I've updated the etherpad with these notes
19:33:08 <clarkb> #topic Nodepool image cleanup.
19:33:15 <clarkb> #link http://lists.opendev.org/pipermail/service-announce/2021-December/000029.html
19:33:31 <corvus> (sorry for phone brevity)
19:33:58 <clarkb> I'm realizing that the timelines I put in that email were maybe a bit aggressive with holidays fast approaching and other demands. I think this is fine. Better to say it is going away at the end of the month and take it away a week or two later than to take it away earlier than expected
19:34:23 <frickler> I would volunteer to try to keep gentoo running
19:34:24 <clarkb> I expect that this cleanup can be broken down into subtasks if others are able to help (one person per image or similar)
19:34:50 <clarkb> frickler: cool good to hear. Maybe you can respond to the thread so that other gentoo interested parties know to reach out?
19:34:59 <ianw> #link https://review.opendev.org/c/opendev/base-jobs/+/821649
19:35:07 <frickler> yeah, I can do that
19:35:07 <ianw> i've proposed the fedora-latest update, i think we can do that
19:35:19 <ianw> also there's some other missing labels there in the follow-on
19:35:26 <clarkb> ianw: frickler thanks
19:35:30 <frickler> ianw: for fedora I think devstack is still running with f34?
19:36:04 <ianw> yeah, we should updaate that -- it doesn't actually boot on most of our hosts
19:36:19 <clarkb> ya f34 seems liek a dead end unfortunately.
19:36:45 <ianw> or, actually, it was the other way, it doesn't boot on rax because they dropped some xen things from the initrd
19:36:55 <ianw> so it can't find it's root disk
19:36:59 <clarkb> I'll try to start the tumbleweed cleanup after christmas if I've got time. I expect this is one of our least used images and it hasn't held up to its promise of being an early warning system
19:37:12 <clarkb> And then I can help with the centos-8 cleanup in the new year
19:37:31 <ianw> note with f35 there's some intersection with -> https://review.opendev.org/c/openstack/diskimage-builder/+/821526
19:38:06 <ianw> it looks like "grub2-mkconfig" isn't doing what we hoped on the centos 9-stream image (*not* centos-minimal) -- still need to investigate fully
19:39:38 <clarkb> its always something with every new release :)
19:40:19 <clarkb> #topic Open Discussion
19:40:30 <clarkb> That was it for the published agenda. Anything else?
19:40:32 <frickler> two things from me
19:40:40 <clarkb> go for it
19:40:42 <fungi> the promised update for 821771 is up for review now
19:40:49 <frickler> a) I finally finished the exim4 paniclog cleanup
19:41:00 <fungi> (er, for 821144 i meant)
19:41:11 <frickler> all entries were from immediately after the nodes were set up
19:41:34 <fungi> yes, it seems like exim gets into an unhappy state during bootstrapping
19:41:35 <frickler> and all were bionic nodes, so not something likely to repeat
19:41:47 <fungi> oh, doesn't happen on focal? that's great news
19:42:09 <frickler> at least I didn't see any incidents there
19:42:48 <frickler> b) this came up discussing zuul restarts: do we want to have some kind of freeze for the holidays?
19:43:35 <clarkb> ya I think we should do our best to avoid big central changes for sure. It is probably ok to do job updates and more leaf node things since they can be reverted easily
19:43:42 <frickler> my suggestion would be to avoid things like zuul or other updates if possible for some time, maybe from this friday eob to 3rd of jan?
19:43:44 <corvus> I disagree
19:44:19 <clarkb> I would add that if people are around to babysit then my concern goes down significantly
19:44:21 <corvus> zuul is in a stabilization period; if we avoid restarts, we're only going to keep running buggy code and avoid fixes
19:44:30 <fungi> if zuul development is speeding toward 5.0.0 during that time, i'd rather opendev didn't fall behind
19:44:32 <corvus> this is the best period to make infrastructure changes
19:44:39 <clarkb> Historically, what we have had issues with is people making changes (pbr release on christmas eve one year iirc) then disappearing
19:44:51 <corvus> fewer people being around is the best time for zuul restarts
19:44:56 <corvus> that's why i do so many over weekends
19:45:01 <clarkb> If we avoid the "then disappearing" bit we should probably be ok
19:45:30 <corvus> if we're going to avoid infra changes when people are busy with openstack releases and also avoid them when they aren't, then there's precious little time to actually do them.
19:45:41 <fungi> maybe if people avoid doing things late in their day when they'll be falling unconscious soon after?
19:45:46 <clarkb> also with zuul specifically we can revert to a known good (or good enough)
19:45:48 <corvus> i don't think i have a history of doing that?
19:45:55 <clarkb> corvus: no you don't
19:46:09 <fungi> correct
19:46:31 <clarkb> I guess what I'm saying is it will often come down to a judgement call. If a change is revertable (say in the case of zuul) and the person driving it plans to pay attention (again say with zuul) then it is probably fine
19:46:45 <clarkb> but something like a gerrit 3.4 upgrade wouldn't fall into that bucket I don't think
19:47:04 <fungi> and plenty of information on how to undo things helps in case issues crop up much later (a day or three even)
19:47:06 <clarkb> in the past we've said we should be "slushy"
19:47:46 <clarkb> I think that is what I'm advocating for here. Which still allows for changes to be made but understanding how to revert and/or proceed if something goes wrong and ensuring someone is able to do that
19:48:29 <clarkb> and ya the next week and week after seem like the period of time where we'll want to take that into account
19:49:14 <ianw> fwiw from the 23rd (.au time) i'm not going to be easily within access of a interactive login till prob around jan 10
19:49:16 <clarkb> I just want to avoid repeated PBR incidents, but I think we all want to avoid that and have a good understanding of what is likely to be safe or at least revertable
19:50:28 <clarkb> the reason I dig up things like image removals and various container image updates is they both fit under the likely safe to do with a revert path if necessary :)
19:50:39 <clarkb> But are still important updates so not a waste of time
19:50:43 <fungi> i'll have family visiting from the 24th throug the 31st but will try to be around some for that week
19:51:32 <clarkb> Anyway I don't think we need to say no zuul updates. But if you are doing them monitoring the results and be prepared to revert to last known good seems reasonable (and so far that has been done with zuul so I'm not worried)
19:52:00 <frickler> sounds like a reasonable compromise, ok
19:52:11 <fungi> it seems like between now and 5.0.0 we should expect updates to come with more fixes than problems anyway
19:52:39 <clarkb> frickler: what if we say something like "Be aware that there may be minimal support from December 20-31, if you are making changes in OpenDev please consider what your revert path or path out of danger is before making the change and monitor to ensure this isn't necessary"
19:53:16 <clarkb> Also historically setuptools is due for a release that will break everything next week
19:53:23 <clarkb> Let's hope that don't do that to us :)
19:53:43 <fungi> right... most of the fire drills this time of year have traditionally not been of our own making
19:54:21 <clarkb> also credit to zuul I don't see zuul upgrades a "big central changes" these days. Zuul is well tested and monitored and we have rollback plans
19:54:41 <clarkb> Something like Gerrit or our bridge ansible version or PBR are what I've got in mind
19:56:20 <clarkb> Sounds like that may be about it. Last call :)
19:56:48 <fungi> i'll go ahead and approve the smtp egress block for our deploy tests now
19:57:03 <fungi> will keep an eye on things while putting dinner together
19:58:31 <clarkb> Alright sounds like that was it. Thank you everyone
19:58:34 <clarkb> #endmeeting