19:01:01 #startmeeting infra 19:01:01 Meeting started Tue Dec 14 19:01:01 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:01 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:01 The meeting name has been set to 'infra' 19:01:04 #link http://lists.opendev.org/pipermail/service-discuss/2021-December/000309.html Our Agenda 19:01:11 o/ 19:01:38 #topic Announcements 19:02:13 We'll cancel next week's meeting and the meeting on January 4, 2022. I'll see what the temperature for having a meeting on the 28th is on the 27th. Though I half expect no one to be around for that one either :) 19:02:31 Hopefully we can all enjoy a bit of rest and holidays and so on 19:03:11 #topic Actions from last meeting 19:03:31 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-12-07-19.01.txt minutes from last meeting 19:03:39 There were no actions recorded so we'll just dive straight in 19:03:44 #topic Topics 19:03:53 #topic Log4j vulnerability 19:04:24 Last week on Thursday afternoon relative to me a 0day RCE in a popular java library was released 19:04:53 The vast majority of the java applicatiosn that we care about either use an older version of the library or don't use it at all and were not vulnerable 19:05:11 The exception was meetpad which we shut down in response to 19:05:29 though i was surprised to realize just how much java we do have scattered throughout our services 19:05:30 Since then jitsi devs have patched and updated docker images which we haev updated to and the service is running again 19:06:05 The roughest part of this situation was this was not a coordinated disclosure with well understood behaviors and parameters. Instaed it was a fire drill with a lot of FUD and misinformation floating around 19:06:45 I ended up doing a fair bit of RTFSing and reading between the lines of what others had said that night to gain confidence that the older version was not affected and eventually the authors of the older code pushed for the current log4j authors to udpate their statements to make them accurate and clear confirming our analysis 19:07:04 and led to us having to do a lot of source code level auditing 19:07:19 Thank you to everyone that helped out digging into this and responding. I think a number of other organizations and individuals have had a much rougher go of it. 19:08:05 and still have and will continue for some time 19:08:24 Really just wanted to mention it happened, we were aware right about when it started to become public, and the whole team ended up digging in and assessing our risk as well as responding in the case of jitsi. And that is deserving of thanks so tahnk you all! 19:08:53 Is there anything else to add on this subject? 19:10:06 i'll drink to that! 19:10:16 thanks everybody! 19:10:29 #topic Improving OpenDev's CD throughput 19:10:34 Sounds like that was it so we can move on 19:11:08 ianw has made good progress getting our serially run jobs all organized. Now that we are looking at running in parallel the next step is centralizing the git repo updates for system-config on bridge at the beginning of each buildset 19:11:16 since we don't want the jobs fighting over repo contents 19:11:30 What this exposed is that bootstrapping the bridge currently requires human intervention 19:11:48 ianw is wondering if we should have zuul do a bare minimum of bootstrapping so that subsequent jobs can take it from there 19:12:01 Doing this requires using zuul secrets 19:12:03 #link https://review.opendev.org/c/opendev/infra-specs/+/821645 -- spec outlining some of the issues with secrets 19:12:08 #link https://review.opendev.org/c/opendev/system-config/+/821155 -- sample of secret writing; more info in changelog 19:12:22 ianw: ^ feel free to dive into more detail or point out what the next steps that need help are 19:13:13 yeah, in the abstract, i guess our decision is to think about moving secrets into zuul 19:14:07 and the spec link above is the best venue for that discussioN? 19:14:10 i think this has quite a few advantages, particularly around running credential updates through gerrit as changes as "usual" 19:15:14 the obvious disadvantages are that we have more foot-gun potential for publishing things, and more exposure to Zuul issues 19:15:48 the spec -- i'm not sure if this has been discussed previously in the design of all this 19:16:06 i mean the spec is probably a good place for discussion 19:16:19 ya I think the previous implemetnation was largely a how do we combine what we had before with zuul with minimal effort 19:16:33 And having a spec to formally work through the design seems like a great idea. 19:16:46 if people are ok with 821155 i think it's worth a merge and revert when quiet just to confirm it works as we think it works 19:17:03 considering the potential impact involved and the fast approaching holidays I don't think this is something we want to rush through, but infra-root review of that spec when able would be great 19:17:33 and ya 821155 seems low impact enough but good poc as input to the spec. 19:18:26 And maybe if we end up meeting on the 28th we can discuss the spec a bit more 19:18:36 i'd also say it not proposing a radical change to the CD pipeline 19:18:51 though I doubt we'll be able to do that synchronously so keeping discussion on the spec as much as possible is probably best 19:18:59 the credentials are still on the bastion host, which is still running ansible independently 19:19:24 just instead of admins updating them in git, Zuul would put them on the bastion host 19:19:38 right 19:19:48 we would also have to run with both models -- i'm not proposing we move everything wholesale 19:20:15 probably new things could work from zuul, and, like puppet, as it makes sense as we migrate we can move bits 19:21:16 thank you for writing the spec up. I'll do my best to get to it this week (though unsure if I'll get to it today) 19:21:52 np -- i agree if nobody has preexisting thoughts on "oh, we considered that and ..." we can move on and leave it to the spec 19:23:27 #topic Container Maintenance 19:23:33 #link https://etherpad.opendev.org/p/opendev-container-maintenance 19:23:55 I've been working on figuring out which images need bullseye updates and what containers need uid changes and so on and ended up throwing it all into this etherpad 19:24:19 I've got changes up for all of the container images that need bullseye at this point and have been trying to approve them slowly one or two at a time as I can monitor them going in 19:24:58 So far the only real problem has been uWSGI's wheel not building reliably but jrosser dug into that and we think we identified the issue and are reasonably happy with the hacky workaround for now 19:25:15 Once the bullseye updates are done I'm hoping to look at the irc bot uid updates next 19:25:54 When I did the audit I noticed that we should also plan to upgrade our zookeeper cluster and our mariadb containers to more recent versions. The current versions are still supported which means this isn't urgent, but keeping up and learning what that process is seems like a good idea 19:26:13 For zookeeper we'd go to 3.6.latest as 3.7 isn't fully released yet aiui. For mariadb it is 10.6 I think 19:26:36 And another thing that occured to me is I'm not sure if we are pruning the CI registry's old container contents 19:26:41 corvus: ^ do you know? 19:27:13 I'm happy for people to help out with this stuff too, feel free to put your name next to an item on that etherpad and I'll look for changes 19:28:22 #topic Mailman Ansible fixups 19:28:41 fungi has a stack of changes to better test our ansible for mailman and in the process fix that ansible 19:28:48 #link https://review.opendev.org/c/opendev/system-config/+/820900/ this stack should address newlist issues. 19:29:27 In particular we don't run newlist properly as it expects input. But there is an undocumented flag to make it stop looking for input which we switch to. THen we also update firewall rules to block outbound smtp on the test nodes so we can verify it attempted to send email rather than telling newlist to not send email 19:29:52 fungi: do you need anything else to move that along or is largely a matter of you having time to approve cahnges and watch them? 19:29:55 i'm working on the penultimate change in that series to move the service commands into a dedicated playbook rather thanrunning them in testinfra 19:30:07 ah right that was suggested in review. 19:30:10 the earlier changes can merge any time 19:30:35 i'm about done with the requested update to 821144 19:30:41 fungi: were you planning to +A them when ready then? 19:30:48 mostly want to make sure you aren't waiting on anything from someone lese 19:30:52 sure, i can if nobody beats me to it 19:31:36 clarkb: We are not 19:32:00 corvus: ok so that is something we should be doing. Is there anything preventing us from pruning those? or just need to do it? 19:32:31 Suspected corruption bug 19:32:58 corvus: thanks I've updated the etherpad with these notes 19:33:08 #topic Nodepool image cleanup. 19:33:15 #link http://lists.opendev.org/pipermail/service-announce/2021-December/000029.html 19:33:31 (sorry for phone brevity) 19:33:58 I'm realizing that the timelines I put in that email were maybe a bit aggressive with holidays fast approaching and other demands. I think this is fine. Better to say it is going away at the end of the month and take it away a week or two later than to take it away earlier than expected 19:34:23 I would volunteer to try to keep gentoo running 19:34:24 I expect that this cleanup can be broken down into subtasks if others are able to help (one person per image or similar) 19:34:50 frickler: cool good to hear. Maybe you can respond to the thread so that other gentoo interested parties know to reach out? 19:34:59 #link https://review.opendev.org/c/opendev/base-jobs/+/821649 19:35:07 yeah, I can do that 19:35:07 i've proposed the fedora-latest update, i think we can do that 19:35:19 also there's some other missing labels there in the follow-on 19:35:26 ianw: frickler thanks 19:35:30 ianw: for fedora I think devstack is still running with f34? 19:36:04 yeah, we should updaate that -- it doesn't actually boot on most of our hosts 19:36:19 ya f34 seems liek a dead end unfortunately. 19:36:45 or, actually, it was the other way, it doesn't boot on rax because they dropped some xen things from the initrd 19:36:55 so it can't find it's root disk 19:36:59 I'll try to start the tumbleweed cleanup after christmas if I've got time. I expect this is one of our least used images and it hasn't held up to its promise of being an early warning system 19:37:12 And then I can help with the centos-8 cleanup in the new year 19:37:31 note with f35 there's some intersection with -> https://review.opendev.org/c/openstack/diskimage-builder/+/821526 19:38:06 it looks like "grub2-mkconfig" isn't doing what we hoped on the centos 9-stream image (*not* centos-minimal) -- still need to investigate fully 19:39:38 its always something with every new release :) 19:40:19 #topic Open Discussion 19:40:30 That was it for the published agenda. Anything else? 19:40:32 two things from me 19:40:40 go for it 19:40:42 the promised update for 821771 is up for review now 19:40:49 a) I finally finished the exim4 paniclog cleanup 19:41:00 (er, for 821144 i meant) 19:41:11 all entries were from immediately after the nodes were set up 19:41:34 yes, it seems like exim gets into an unhappy state during bootstrapping 19:41:35 and all were bionic nodes, so not something likely to repeat 19:41:47 oh, doesn't happen on focal? that's great news 19:42:09 at least I didn't see any incidents there 19:42:48 b) this came up discussing zuul restarts: do we want to have some kind of freeze for the holidays? 19:43:35 ya I think we should do our best to avoid big central changes for sure. It is probably ok to do job updates and more leaf node things since they can be reverted easily 19:43:42 my suggestion would be to avoid things like zuul or other updates if possible for some time, maybe from this friday eob to 3rd of jan? 19:43:44 I disagree 19:44:19 I would add that if people are around to babysit then my concern goes down significantly 19:44:21 zuul is in a stabilization period; if we avoid restarts, we're only going to keep running buggy code and avoid fixes 19:44:30 if zuul development is speeding toward 5.0.0 during that time, i'd rather opendev didn't fall behind 19:44:32 this is the best period to make infrastructure changes 19:44:39 Historically, what we have had issues with is people making changes (pbr release on christmas eve one year iirc) then disappearing 19:44:51 fewer people being around is the best time for zuul restarts 19:44:56 that's why i do so many over weekends 19:45:01 If we avoid the "then disappearing" bit we should probably be ok 19:45:30 if we're going to avoid infra changes when people are busy with openstack releases and also avoid them when they aren't, then there's precious little time to actually do them. 19:45:41 maybe if people avoid doing things late in their day when they'll be falling unconscious soon after? 19:45:46 also with zuul specifically we can revert to a known good (or good enough) 19:45:48 i don't think i have a history of doing that? 19:45:55 corvus: no you don't 19:46:09 correct 19:46:31 I guess what I'm saying is it will often come down to a judgement call. If a change is revertable (say in the case of zuul) and the person driving it plans to pay attention (again say with zuul) then it is probably fine 19:46:45 but something like a gerrit 3.4 upgrade wouldn't fall into that bucket I don't think 19:47:04 and plenty of information on how to undo things helps in case issues crop up much later (a day or three even) 19:47:06 in the past we've said we should be "slushy" 19:47:46 I think that is what I'm advocating for here. Which still allows for changes to be made but understanding how to revert and/or proceed if something goes wrong and ensuring someone is able to do that 19:48:29 and ya the next week and week after seem like the period of time where we'll want to take that into account 19:49:14 fwiw from the 23rd (.au time) i'm not going to be easily within access of a interactive login till prob around jan 10 19:49:16 I just want to avoid repeated PBR incidents, but I think we all want to avoid that and have a good understanding of what is likely to be safe or at least revertable 19:50:28 the reason I dig up things like image removals and various container image updates is they both fit under the likely safe to do with a revert path if necessary :) 19:50:39 But are still important updates so not a waste of time 19:50:43 i'll have family visiting from the 24th throug the 31st but will try to be around some for that week 19:51:32 Anyway I don't think we need to say no zuul updates. But if you are doing them monitoring the results and be prepared to revert to last known good seems reasonable (and so far that has been done with zuul so I'm not worried) 19:52:00 sounds like a reasonable compromise, ok 19:52:11 it seems like between now and 5.0.0 we should expect updates to come with more fixes than problems anyway 19:52:39 frickler: what if we say something like "Be aware that there may be minimal support from December 20-31, if you are making changes in OpenDev please consider what your revert path or path out of danger is before making the change and monitor to ensure this isn't necessary" 19:53:16 Also historically setuptools is due for a release that will break everything next week 19:53:23 Let's hope that don't do that to us :) 19:53:43 right... most of the fire drills this time of year have traditionally not been of our own making 19:54:21 also credit to zuul I don't see zuul upgrades a "big central changes" these days. Zuul is well tested and monitored and we have rollback plans 19:54:41 Something like Gerrit or our bridge ansible version or PBR are what I've got in mind 19:56:20 Sounds like that may be about it. Last call :) 19:56:48 i'll go ahead and approve the smtp egress block for our deploy tests now 19:57:03 will keep an eye on things while putting dinner together 19:58:31 Alright sounds like that was it. Thank you everyone 19:58:34 #endmeeting