19:01:41 <clarkb> #startmeeting infra
19:01:41 <opendevmeet> Meeting started Tue Nov  1 19:01:41 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:41 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:41 <opendevmeet> The meeting name has been set to 'infra'
19:01:57 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-October/000376.html Our Agenda
19:02:02 <clarkb> #topic Announcements
19:02:09 <clarkb> There were no announcements so we can dive right in
19:02:39 <clarkb> #topic Topics
19:02:44 <clarkb> #topic Bastion Host Updates
19:02:54 <clarkb> #link https://review.opendev.org/q/topic:prod-bastion-group
19:03:01 <clarkb> #link https://review.opendev.org/q/topic:bridge-ansible-venv
19:03:11 <clarkb> are a couple groups of changes to keep moving this along
19:03:40 * frickler should finally review some of those
19:03:55 <clarkb> frickler also discovered that the secrets management key is missing on the new host. Something that should be migrated over and tested before we remove the old one
19:04:19 <clarkb> but I think we're really close to being able to finish this up. ianw if you are around anything else to add?
19:04:33 <frickler> we should also agree when to move editing those from one host to the other
19:04:50 <ianw> o/
19:04:58 <clarkb> ++ at this point I would probably say anything that can't be done on the new host is a bug and we should fix that as quickly s possible and use the new host
19:05:14 <ianw> yes please move over anything from your home directories, etc. that you want
19:05:50 <ianw> i've added a note on the secret key to
19:05:51 <ianw> #link https://etherpad.opendev.org/p/bastion-upgrade-nodes-2022-10
19:06:04 <ianw> thanks for that -- i will be writing something up on that
19:06:36 <clarkb> I also need to review the virtualenv management change since that will ensure we have a working openstackclient for rax and others
19:07:39 <ianw> yeah a couple of changes are out there just to clean up some final things
19:07:53 <clarkb> also the zuul reboot playbook ran successfully off the new bridge
19:08:04 <frickler> ianw: are you o.k. with rebooting bridge01 after the openssl updates or is there some blocker for that?
19:08:27 <frickler> (I ran the apt update earlier already)
19:08:29 <clarkb> one thing to consider when doing ^ is if we have any infra prod jobs that we don't want to conflict with
19:08:36 <clarkb> but I'm not aware of any urgent jobs at the moment
19:08:44 <ianw> the gist is I think that we can get testing to use "bridge99.opendev.org" -- which is a nice proof that we're not hard-coding in references
19:09:25 <ianw> i think it's fine to reboot -- sorry i've been out last two days and not caught up but i can babysit it soonish
19:09:43 <clarkb> sounds good we can coordinate further after the meeting.
19:09:45 <clarkb> Anything else on this topic?
19:10:11 <ianw> nope, thanks for the reviews and getting it this far!
19:10:33 <clarkb> #topic Upgrading Bionic Servers
19:10:51 <clarkb> at this point I think we've largely sorted out the jammy related issues and we should be good to boot just about anything on jammy
19:11:00 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/862835/ Disable phased package updates
19:11:22 <clarkb> that is one remaining item though. Basically it says don't do phased updates which will ensure that our jammy servers all get the same pacakges at the same time
19:11:43 <clarkb> rather than staggering them over time. I'm concerned the staggering will just lead to confusion about whether or not a package is related to unexpected behaviors
19:12:27 <clarkb> https://review.opendev.org/c/opendev/zone-opendev.org/+/862941 and its depends on are related to gitea-lb02 being brought up as a jammy node too (this is cleanup of old nodes)
19:12:49 <clarkb> Otherwise nothing extra to say. Just that we can (and probably should) do this for new servers and replacing old servers with jammy is extra good too
19:13:03 <clarkb> I'm hoping I'll have time later this week to replace another server (maybe one of the gitea backends)
19:13:13 <clarkb> #topic Removing snapd
19:13:23 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/862834 Change to remove snapd from our servers
19:13:46 <clarkb> after we discussed this in our last meeting I poked around on snapcraft and in ubuntu package repositories and I think there isn't much reason for us to have snapd installed on our servers
19:14:16 <clarkb> This change can and will affect a number of servers though so worth double checking. I haven't done an audit to see which would be affected but we could do that if we think it is necessary
19:15:03 <clarkb> to do my filtering I looked for snaps maitnained by canonical on snapcraft to see which ones were likely to be useful for us. And many ofthem continue to have actual packages or aren't useful to servers
19:15:10 <clarkb> Reviews very much welcome
19:15:45 <clarkb> #topic Mailman 3
19:16:01 <clarkb> Since our last meeting the upstream for the mailman 3 docker images did land my change to add lynx to the images
19:16:09 <clarkb> No repsonses on the other issues I filed though.
19:16:42 <clarkb> Unfortunately, I think this makes it more confusing over whether or not we should fork not easier. I'm leaning more towards forking at this point simply because I'm not sure how responsive upstream will be. But feedback there continues to be welcome
19:18:03 <clarkb> When fungi is back we should make a decision and move forward
19:18:14 <clarkb> #topic Updating base python docker images to use pip wheel
19:18:34 <clarkb> Upstream seems to be moving slowly on my bugfix PR. Some of that slowness is changes to git happened at the same time that impacted their CI setup
19:18:45 <clarkb> Either way I think we should strongly consider their suggestion of using pip wheel though
19:18:55 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/862152
19:19:13 <clarkb> There is a nodepool and a dib change too which help illustrate that this change functions and doesn't regress features like siblings installs
19:19:36 <clarkb> It should be a noop for us, but makes us more resilient to pip changes if/when they happen in the future. Reviews very much welcome on this as well
19:20:43 <clarkb> #topic Etherpad service logging
19:20:56 <clarkb> ianw: did you have time to write the change to update etherpad logging to syslog yet?
19:21:40 <ianw> oh no, sorry, totally distracted on that
19:21:42 <ianw> will do
19:22:13 <clarkb> thanks
19:22:32 <clarkb> unrelated to the logging issue I had to reboot etherpad after its db volume got remounted RO due to errors
19:22:46 <clarkb> after reboot it mounted the volume just fine as far as I could tell and things have been happy since yesterday
19:22:55 <clarkb> (just a heads up I don't think any action is necessary there)
19:23:02 <clarkb> #topic Unexpected Gerrit Reboot
19:23:14 <clarkb> This happened around 06:00 UTC ish today
19:23:33 <clarkb> basically looks like review02.o.o rebooted and when it came back it had no networking until ~13:00 UTC
19:23:48 <clarkb> we suspect something on the cloud side which would explain the lack of networking for some time as well. But we haven't heard back on that yet
19:24:10 <frickler> do we have some support contact at vexxhost other than mnaser__ ?
19:24:28 <clarkb> frickler: mnaser__ has been the primary contact. There have been others in the past but I don't think they are at vexxhost anymore
19:25:24 <clarkb> If we think it is important I can ask if anyone at the foundation has contacts we might try
19:25:38 <frickler> there's also the (likely unrelated) IPv6 routing issue, which I think is more important
19:25:40 <clarkb> but at this point things seem stable and we're mostly just interested in confirmation of our assumptions? Might be ok to wait a day
19:25:41 <ianw> one thing i noticed was that corvus i think had to start the container?
19:26:05 <clarkb> ianw: yes, our docker-compose file doesn't specify a restart policy which mimics the old pre docker behavior of not starting automatically
19:26:21 <clarkb> frickler: re ipv6 thats a good point.
19:27:14 <frickler> regarding manual start we assumed that that was intentional and agreeable behavior
19:27:30 <corvus> i did perform some sanity checks to make sure the server looked okay before starting
19:27:44 <corvus> (which is one of the benefits of that)
19:27:49 <ianw> something to think about -- but also this is the first case i can think of since we migrated the review host to vexxhost that there's been what seems to be instability beyond our control
19:28:14 <ianw> so yeah, no need to make urgent changes to handle unscheduled reboots at this point :)
19:28:17 <clarkb> considering that we seemed to lack network access anyway I'm not sure its super important to auto restart based on this event
19:28:20 <clarkb> we would've waited either way
19:28:44 <frickler> the other thing worth considering is whether we want to have some local account to allow debugging via vnc
19:28:57 <corvus> but i think honestly the main reason we didn't have it start on boot is so that if we stopped a service manually it didn't restart manually.  that can be achieved with a "restart: unless-stopped" policy.  so really, there are 2 reasons not to start on boot, and we can evaluate whether we still like 1, the other, both, or none of them.
19:29:16 <frickler> since my first assumption was lack of network device caused by a kernel update
19:29:33 <clarkb> frickler: the way we would normally handle that today is via a rescue instance
19:30:05 <clarkb> when you rescue and instance with nova it shuts down the instance then boots another image and attaches the broken instance to it as a device which allows you to mount the partitions
19:30:30 <clarkb> its a little bit of work, but the cases where we've had to resort to it are few and its probably worth keeping our images as simple as possible iwthout user passwords?
19:30:36 <frickler> except for boot-from-volume instances, which seem to be a bit more tricky?
19:30:41 <ianw> have we ever done that with vexxhost?
19:30:45 <clarkb> frickler: oh is bfv different?
19:31:10 <frickler> at least it needs a recent compute api (>=ussuri iirc)
19:31:12 <clarkb> ianw: I'm not sure about doing it specifically in vexxhost. Testing it is a good idea I suppose before I/we declare it is good enough
19:31:41 <clarkb> my concern with passwords on instances is that we don't have central auth so rotating/changing/managing them is mroe difficult
19:31:45 <corvus> i love not having local passwords.  i hope it is good enough.
19:31:59 <clarkb> ya I'd much rather avoid it if possible
19:32:12 <frickler> I was also wondering why we choose boot from volume, was that intentional?
19:32:21 <clarkb> I've made a note to myself to test instance rescue in vexxhost. Both bfv and not
19:32:50 <corvus> i have a vague memory that it might be a flavor requirement, but i'm not sure
19:33:01 <ianw> frickler: i'd have to go back and research, but i feel like it was a requirement of vexxhost
19:33:02 <clarkb> yes, I think at the time the current set of flavors had no disk
19:33:13 <clarkb> their latest flavors do have disk and can be booted without bfv
19:33:27 <ianw> heh, i think that's three vague memories, maybe that makes 1 real one :)
19:33:34 <clarkb> I booted gitea-lb02 without bfv (but uploaded the jammy image to vexxhost as raw allowing us to do bfv as well)
19:34:10 <clarkb> #action clarkb test vexxhost instance rescues
19:34:25 <clarkb> why don't we do that and come back to the idea of passwords for recovery once we know if ^ works
19:34:29 <clarkb> anything else on this subject?
19:34:35 <ianw> ++
19:34:40 <frickler> ack
19:35:01 <clarkb> also I'll use throwaway test instances not anything prod like  :)
19:35:16 <clarkb> #topic OpenSSL v3
19:35:52 <clarkb> As everyone is probably aware of openssl v3 had a big security release today. It turned out to be a bit less scary than the CRITICAL label that was initially shared led everyone to believe (they downgraded it to high)
19:36:07 <clarkb> Since all but two of our servers are too old to have openssl v3 we are largely unaffected
19:36:24 <clarkb> all in all the impact is far more limited than feared which is great
19:36:41 <clarkb> Also ubunut seems to think they way they compile openssl with stack protections mitigates the RCE and this is only a DoS
19:37:58 <clarkb> #topic Upgrading Zookeeper
19:38:28 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/863089
19:38:38 <clarkb> I would like to upgrade zookeeper tomorrow
19:39:01 <clarkb> at first I thought that we could just let automation do it (whcih is still liekyl fine) but all the docs I can find suggesting upgrading the leader which our automation isn't aware of
19:39:41 <clarkb> That means my plan is to stop ansible via the emergency file on zk04-zk06 and do them one by one. Followers first then the leader (currently zk05). Then merge that chagne and finally rmeove the hosts form the emergency file
19:39:58 <clarkb> if I could get reviews on the change and any concerns for that plan I'd appreciate it.
19:40:13 <clarkb> That said it seems like zookeeper upgrades if you go release to release are meant to be uneventful
19:40:14 <corvus> (upgrading the leader last i think you missed a word)
19:40:24 <clarkb> yup leader last I mean
19:40:53 <frickler> the plan sounds fine to me and I'll try to review until your morning
19:41:16 <corvus> i'll be around to help
19:41:19 <clarkb> thanks!
19:41:54 <clarkb> #topic Gitea Rebuild
19:42:09 <clarkb> There are golang compiler updates today as well and it seems worthwhile to rebuild gitea under them
19:42:19 <clarkb> I'll have that change up as soon as the meeting ends
19:42:38 <clarkb> I should be able to monitor that change as it lands and gets deployed today. But we should coordinate that with the bridge reboot
19:43:19 <clarkb> #topic Open Discussion
19:43:19 <ianw> ++
19:44:06 <clarkb> It is probably worth mentioning that gitea as an upstream is going through a bit of a rough time. Their community has disagreements over the handling of trademarks and some individuals have talked about forking
19:44:27 <ianw> :/
19:44:38 <clarkb> I've been tryingto follow along as well as I canto understand any potential impact to us and I'm not sure we're at a point where we need to take a stance or plan to change anything
19:44:54 <clarkb> but it is possible that we'll be in that position whether or not we like it in the future
19:44:59 <ianw> on the zuul-sphinx bug that started occuring with the latest sphinx -- might need to think about how that works including files per https://sourceforge.net/p/docutils/bugs/459/
19:49:49 <clarkb> Sounds like that may be it?
19:50:06 <clarkb> Everyone can have 10 minutes for breakfast/lunch/dinner/sleep :)
19:50:20 <clarkb> thank you all for your time and we'll be back here same time and location next week
19:50:23 <clarkb> #endmeeting