Tuesday, 2022-11-01

-opendevstatus- NOTICE: review.opendev.org (Gerrit) is currently down, we are working to restore service as soon as possible07:30
-opendevstatus- NOTICE: review.opendev.org (Gerrit) is back online14:25
clarkbalmost meeting time. We'll get started shortly19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Nov  1 19:01:41 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-October/000376.html Our Agenda19:01
clarkb#topic Announcements19:02
clarkbThere were no announcements so we can dive right in19:02
clarkb#topic Topics19:02
clarkb#topic Bastion Host Updates19:02
clarkb#link https://review.opendev.org/q/topic:prod-bastion-group19:02
clarkb#link https://review.opendev.org/q/topic:bridge-ansible-venv19:03
clarkbare a couple groups of changes to keep moving this along19:03
* frickler should finally review some of those19:03
clarkbfrickler also discovered that the secrets management key is missing on the new host. Something that should be migrated over and tested before we remove the old one19:03
clarkbbut I think we're really close to being able to finish this up. ianw if you are around anything else to add?19:04
fricklerwe should also agree when to move editing those from one host to the other19:04
ianwo/19:04
clarkb++ at this point I would probably say anything that can't be done on the new host is a bug and we should fix that as quickly s possible and use the new host19:04
ianwyes please move over anything from your home directories, etc. that you want19:05
ianwi've added a note on the secret key to 19:05
ianw#link https://etherpad.opendev.org/p/bastion-upgrade-nodes-2022-1019:05
ianwthanks for that -- i will be writing something up on that19:06
clarkbI also need to review the virtualenv management change since that will ensure we have a working openstackclient for rax and others19:06
ianwyeah a couple of changes are out there just to clean up some final things19:07
clarkbalso the zuul reboot playbook ran successfully off the new bridge19:07
fricklerianw: are you o.k. with rebooting bridge01 after the openssl updates or is there some blocker for that?19:08
frickler(I ran the apt update earlier already)19:08
clarkbone thing to consider when doing ^ is if we have any infra prod jobs that we don't want to conflict with19:08
clarkbbut I'm not aware of any urgent jobs at the moment19:08
ianwthe gist is I think that we can get testing to use "bridge99.opendev.org" -- which is a nice proof that we're not hard-coding in references19:08
ianwi think it's fine to reboot -- sorry i've been out last two days and not caught up but i can babysit it soonish19:09
clarkbsounds good we can coordinate further after the meeting.19:09
clarkbAnything else on this topic?19:09
ianwnope, thanks for the reviews and getting it this far!19:10
clarkb#topic Upgrading Bionic Servers19:10
clarkbat this point I think we've largely sorted out the jammy related issues and we should be good to boot just about anything on jammy19:10
clarkb#link https://review.opendev.org/c/opendev/system-config/+/862835/ Disable phased package updates19:11
clarkbthat is one remaining item though. Basically it says don't do phased updates which will ensure that our jammy servers all get the same pacakges at the same time19:11
clarkbrather than staggering them over time. I'm concerned the staggering will just lead to confusion about whether or not a package is related to unexpected behaviors19:11
clarkbhttps://review.opendev.org/c/opendev/zone-opendev.org/+/862941 and its depends on are related to gitea-lb02 being brought up as a jammy node too (this is cleanup of old nodes)19:12
clarkbOtherwise nothing extra to say. Just that we can (and probably should) do this for new servers and replacing old servers with jammy is extra good too19:12
clarkbI'm hoping I'll have time later this week to replace another server (maybe one of the gitea backends)19:13
clarkb#topic Removing snapd19:13
clarkb#link https://review.opendev.org/c/opendev/system-config/+/862834 Change to remove snapd from our servers19:13
clarkbafter we discussed this in our last meeting I poked around on snapcraft and in ubuntu package repositories and I think there isn't much reason for us to have snapd installed on our servers19:13
clarkbThis change can and will affect a number of servers though so worth double checking. I haven't done an audit to see which would be affected but we could do that if we think it is necessary19:14
clarkbto do my filtering I looked for snaps maitnained by canonical on snapcraft to see which ones were likely to be useful for us. And many ofthem continue to have actual packages or aren't useful to servers19:15
clarkbReviews very much welcome19:15
clarkb#topic Mailman 319:15
clarkbSince our last meeting the upstream for the mailman 3 docker images did land my change to add lynx to the images19:16
clarkbNo repsonses on the other issues I filed though.19:16
clarkbUnfortunately, I think this makes it more confusing over whether or not we should fork not easier. I'm leaning more towards forking at this point simply because I'm not sure how responsive upstream will be. But feedback there continues to be welcome19:16
clarkbWhen fungi is back we should make a decision and move forward19:18
clarkb#topic Updating base python docker images to use pip wheel19:18
clarkbUpstream seems to be moving slowly on my bugfix PR. Some of that slowness is changes to git happened at the same time that impacted their CI setup19:18
clarkbEither way I think we should strongly consider their suggestion of using pip wheel though19:18
clarkb#link https://review.opendev.org/c/opendev/system-config/+/86215219:18
*** diablo_rojo_phone is now known as Guest18219:19
clarkbThere is a nodepool and a dib change too which help illustrate that this change functions and doesn't regress features like siblings installs19:19
clarkbIt should be a noop for us, but makes us more resilient to pip changes if/when they happen in the future. Reviews very much welcome on this as well19:19
clarkb#topic Etherpad service logging19:20
clarkbianw: did you have time to write the change to update etherpad logging to syslog yet?19:20
ianwoh no, sorry, totally distracted on that19:21
ianwwill do19:21
clarkbthanks19:22
clarkbunrelated to the logging issue I had to reboot etherpad after its db volume got remounted RO due to errors19:22
clarkbafter reboot it mounted the volume just fine as far as I could tell and things have been happy since yesterday19:22
clarkb(just a heads up I don't think any action is necessary there)19:22
clarkb#topic Unexpected Gerrit Reboot19:23
clarkbThis happened around 06:00 UTC ish today19:23
clarkbbasically looks like review02.o.o rebooted and when it came back it had no networking until ~13:00 UTC19:23
clarkbwe suspect something on the cloud side which would explain the lack of networking for some time as well. But we haven't heard back on that yet19:23
fricklerdo we have some support contact at vexxhost other than mnaser__ ?19:24
clarkbfrickler: mnaser__ has been the primary contact. There have been others in the past but I don't think they are at vexxhost anymore19:24
clarkbIf we think it is important I can ask if anyone at the foundation has contacts we might try19:25
fricklerthere's also the (likely unrelated) IPv6 routing issue, which I think is more important19:25
clarkbbut at this point things seem stable and we're mostly just interested in confirmation of our assumptions? Might be ok to wait a day19:25
ianwone thing i noticed was that corvus i think had to start the container?19:25
clarkbianw: yes, our docker-compose file doesn't specify a restart policy which mimics the old pre docker behavior of not starting automatically19:26
clarkbfrickler: re ipv6 thats a good point.19:26
fricklerregarding manual start we assumed that that was intentional and agreeable behavior19:27
corvusi did perform some sanity checks to make sure the server looked okay before starting19:27
corvus(which is one of the benefits of that)19:27
ianwsomething to think about -- but also this is the first case i can think of since we migrated the review host to vexxhost that there's been what seems to be instability beyond our control19:27
ianwso yeah, no need to make urgent changes to handle unscheduled reboots at this point :)19:28
clarkbconsidering that we seemed to lack network access anyway I'm not sure its super important to auto restart based on this event19:28
clarkbwe would've waited either way19:28
fricklerthe other thing worth considering is whether we want to have some local account to allow debugging via vnc19:28
corvusbut i think honestly the main reason we didn't have it start on boot is so that if we stopped a service manually it didn't restart manually.  that can be achieved with a "restart: unless-stopped" policy.  so really, there are 2 reasons not to start on boot, and we can evaluate whether we still like 1, the other, both, or none of them.19:28
fricklersince my first assumption was lack of network device caused by a kernel update19:29
clarkbfrickler: the way we would normally handle that today is via a rescue instance19:29
clarkbwhen you rescue and instance with nova it shuts down the instance then boots another image and attaches the broken instance to it as a device which allows you to mount the partitions19:30
clarkbits a little bit of work, but the cases where we've had to resort to it are few and its probably worth keeping our images as simple as possible iwthout user passwords?19:30
fricklerexcept for boot-from-volume instances, which seem to be a bit more tricky?19:30
ianwhave we ever done that with vexxhost?19:30
clarkbfrickler: oh is bfv different?19:30
fricklerat least it needs a recent compute api (>=ussuri iirc)19:31
clarkbianw: I'm not sure about doing it specifically in vexxhost. Testing it is a good idea I suppose before I/we declare it is good enough19:31
clarkbmy concern with passwords on instances is that we don't have central auth so rotating/changing/managing them is mroe difficult19:31
corvusi love not having local passwords.  i hope it is good enough.19:31
clarkbya I'd much rather avoid it if possible19:31
fricklerI was also wondering why we choose boot from volume, was that intentional?19:32
clarkbI've made a note to myself to test instance rescue in vexxhost. Both bfv and not19:32
corvusi have a vague memory that it might be a flavor requirement, but i'm not sure19:32
ianwfrickler: i'd have to go back and research, but i feel like it was a requirement of vexxhost19:33
clarkbyes, I think at the time the current set of flavors had no disk19:33
clarkbtheir latest flavors do have disk and can be booted without bfv19:33
ianwheh, i think that's three vague memories, maybe that makes 1 real one :)19:33
clarkbI booted gitea-lb02 without bfv (but uploaded the jammy image to vexxhost as raw allowing us to do bfv as well)19:33
clarkb#action clarkb test vexxhost instance rescues19:34
clarkbwhy don't we do that and come back to the idea of passwords for recovery once we know if ^ works19:34
clarkbanything else on this subject?19:34
ianw++19:34
fricklerack19:34
clarkbalso I'll use throwaway test instances not anything prod like  :)19:35
clarkb#topic OpenSSL v319:35
clarkbAs everyone is probably aware of openssl v3 had a big security release today. It turned out to be a bit less scary than the CRITICAL label that was initially shared led everyone to believe (they downgraded it to high)19:35
clarkbSince all but two of our servers are too old to have openssl v3 we are largely unaffected19:36
clarkball in all the impact is far more limited than feared which is great19:36
clarkbAlso ubunut seems to think they way they compile openssl with stack protections mitigates the RCE and this is only a DoS19:36
clarkb#topic Upgrading Zookeeper19:37
clarkb#link https://review.opendev.org/c/opendev/system-config/+/86308919:38
clarkbI would like to upgrade zookeeper tomorrow19:38
clarkbat first I thought that we could just let automation do it (whcih is still liekyl fine) but all the docs I can find suggesting upgrading the leader which our automation isn't aware of19:39
clarkbThat means my plan is to stop ansible via the emergency file on zk04-zk06 and do them one by one. Followers first then the leader (currently zk05). Then merge that chagne and finally rmeove the hosts form the emergency file19:39
clarkbif I could get reviews on the change and any concerns for that plan I'd appreciate it.19:39
clarkbThat said it seems like zookeeper upgrades if you go release to release are meant to be uneventful19:40
corvus(upgrading the leader last i think you missed a word)19:40
clarkbyup leader last I mean19:40
fricklerthe plan sounds fine to me and I'll try to review until your morning19:40
corvusi'll be around to help19:41
clarkbthanks!19:41
clarkb#topic Gitea Rebuild19:41
clarkbThere are golang compiler updates today as well and it seems worthwhile to rebuild gitea under them19:42
clarkbI'll have that change up as soon as the meeting ends19:42
clarkbI should be able to monitor that change as it lands and gets deployed today. But we should coordinate that with the bridge reboot19:42
clarkb#topic Open Discussion19:43
ianw++19:43
clarkbIt is probably worth mentioning that gitea as an upstream is going through a bit of a rough time. Their community has disagreements over the handling of trademarks and some individuals have talked about forking19:44
ianw:/19:44
clarkbI've been tryingto follow along as well as I canto understand any potential impact to us and I'm not sure we're at a point where we need to take a stance or plan to change anything19:44
clarkbbut it is possible that we'll be in that position whether or not we like it in the future19:44
ianwon the zuul-sphinx bug that started occuring with the latest sphinx -- might need to think about how that works including files per https://sourceforge.net/p/docutils/bugs/459/19:44
clarkbSounds like that may be it?19:49
clarkbEveryone can have 10 minutes for breakfast/lunch/dinner/sleep :)19:50
clarkbthank you all for your time and we'll be back here same time and location next week19:50
clarkb#endmeeting19:50
opendevmeetMeeting ended Tue Nov  1 19:50:23 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:50
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-01-19.01.html19:50
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-01-19.01.txt19:50
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-01-19.01.log.html19:50
*** Guest182 is now known as diablo_rojo_phone20:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!