#opendev-meeting log

19:01:10 <clarkb> #startmeeting infra
19:01:10 <opendevmeet> Meeting started Tue Aug 10 19:01:10 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:10 <opendevmeet> The meeting name has been set to 'infra'
19:01:17 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-August/000273.html Our Agenda
19:01:28 <clarkb> #topic Announcements
19:01:40 <clarkb> I had none. Let's just jump right into the meeting proper
19:01:45 <clarkb> #topic Actions from last meeting
19:01:50 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-08-03-19.01.txt minutes from last meeting
19:02:04 <clarkb> I did manage to get around to writing up the start of a prometheus spec yesterday and today
19:02:09 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus spec
19:02:27 <clarkb> This is still quite high level as I haven't run one locally but did read a fair bit of documentation yesterday
19:02:47 <clarkb> I think in this case we don't need to have a bunch of specifics sorted out early because we can run this side by side with cacti while we sort it out and make it do what we want
19:03:09 <corvus> ++
19:03:11 <fungi> yeah, i'm in favor of feeling it out once it's running
19:03:12 <clarkb> Then as noted in the spec we can run it for a month or so and compare data between cacti and prometheus before shutting down cacti
19:03:34 <corvus> i will review the spec asap
19:03:51 <clarkb> I think it captures the important bits, but I'm happy for feedback and will update it appropriately
19:03:59 <fungi> i need to read it still, was there any treatment of how we might import our historical data, or just keep the old graphs around?
19:04:11 <clarkb> fungi: no, I think that is well beyond the scope of that spec
19:04:18 <fungi> got it, thanks
19:04:24 <clarkb> you'd need to write an rrd to tsdb conversion tool which may exist?
19:05:01 <clarkb> https://groups.google.com/g/opentsdb/c/H7t-WPY11Ro
19:05:04 <fungi> yeah, or may be as simple as plugging a coupld or python libraries into one another
19:05:16 <clarkb> if someone wants to work on that during that side by side period it should definitely be possible
19:05:18 <fungi> s/coupld or/couple of/
19:05:22 <clarkb> but I'm not sure it is critical?
19:05:37 <fungi> right, it's something else we'll want to figure out as a group
19:05:38 <corvus> i'd vote for just keeping cacti around for many months until we don't care
19:05:54 <clarkb> corvus: ya that was sort of what I was thinking
19:05:57 <fungi> certainly one option
19:06:12 <clarkb> basically keep cacti around to ensure the data we have in prometheus is at least as accurate as cacti then when ready delete cacti
19:06:23 <clarkb> the spec says we can do that after a month but happy to update that to be more flexible
19:06:35 <fungi> depends on how much we value being able to compare against older trending (and how much older)
19:06:36 <corvus> if there's a security reason we can't keep cacti up, we could still keep it around but firewall it
19:06:55 <corvus> so that if we need to look at old data, it's possible (if not easy)
19:07:05 <fungi> anyway, all things we can hash out later
19:07:24 <clarkb> #topic Topics
19:07:33 <clarkb> Ya lets hash it out in the spec review :)
19:07:41 <clarkb> #topic Service Coordinator Election
19:07:58 <clarkb> The end of today UTC time is the end of the service coordinator nomination period
19:08:07 <clarkb> I've not seen anyone volunteer yet :P
19:08:25 <clarkb> I'll keep doing it if no one else wants to do it, but definitely think someone else should do it
19:09:14 <clarkb> Anyway this is your reminder of that deadline. Please do volunteer if you are interested
19:09:23 <fungi> i can volunteer if you really need to step down, but i'm not sure another openinfra foundation staff member is a better choice. as things are, it's a struggle to explain that opendev is a community rather than a service run by the foundation
19:09:55 <fungi> (hard to explain that to the rest of the foundation staff as much as to the public)
19:10:10 <clarkb> I think for me it would be nice to be able to focus on more of the technical details of upgrading services and running new services, etc. But I agree that is also a struggle
19:10:35 <clarkb> and I think having someone else do it can be good for a shift in approach/perspective
19:10:51 <fungi> from a sustainability perspective, it would be nice to have an option other than foundation employees
19:12:19 <clarkb> #topic Review Upgrades
19:13:02 <clarkb> I believe old server cleanups have happend. Thank you ianw again for doing a bunch of the work on this
19:13:11 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/803374 Clean up old mysql gerrit stuff
19:13:19 <ianw> yep all done
19:13:26 <clarkb> That removes the mysql connector from our images as well as support for h2 and mysql from the gerrit role in system-config
19:13:41 <clarkb> at this point I think we are good to move forward on landing that as there haven't been problems with prod since the mariadb switch
19:13:42 <fungi> neatly wrapped up!
19:13:54 <fungi> i agree
19:14:37 <ianw> the only thing left on the cleanup list is "decide on sshfp records"
19:15:10 <clarkb> our options are to have no sshfp records or only do port 29418 sshd records on review.o.o and port 22 on review02.o.o ?
19:15:23 <ianw> personally i think we generally want to access ssh on port 22 & 29418  @ review.opendev.org so that is in conflict with choosing one for sshfp records
19:15:29 <clarkb> fwiw I've been trying to train myself to ssh to the actual host fqdn when using port 22 and use review.o.o for 29418
19:15:29 <fungi> i'm okay leaving it as-is, but it's inconsistent with how we handle sshfp records for admin access to our other servers
19:16:04 <clarkb> but ya I'm not doing any sshfp verification from my client as far as I know
19:16:14 <clarkb> I'm happy to leave it as is with the comment in the zone file about why this host is different
19:16:27 <fungi> on the other hand, if we do have a review02.opendev.org-only sshfp record then it wouldn't directly conflict with anything, we'd just need to separate the address records and not use a cname for that
19:17:09 <ianw> at the time i was thinking also things like zuul want review02 as the ssh target
19:17:16 <ianw> but that turned out to not work so well
19:17:30 <ianw> (gerrit ssh port target i mean)
19:17:51 <fungi> another option would be to switch openssh to using the same host key as the gerrit service, it's the only service running there, and so i'm not super concerned that someone might get ahold of the api hostkey and use that to take control of the underlying operating system, if they get that first bit then the whole server is already sunk really
19:18:33 <fungi> it's not as if there's anything else to protect which the gerrit service doesn't have access to
19:18:36 <clarkb> that is an interesting idea. I hadn't considered that before. It would make distinguishing between gerrit hosts a bit more fuzzy, but would simplify sshfp records
19:19:15 <fungi> yeah, i guess it's the transitional gerrit server replacement period when there are two running which is the real issue
19:19:17 <ianw> hrm, i'm not sure we have any ansible logic for writing out host keys on base servers though
19:19:21 <clarkb> I don't feel strongly about any of the otpions fwiw. I'm happy with the current situation but have also started trying to train myself when ssh'ing to use the actual host fqdn which falls in line with the old sshfp setup
19:19:27 <clarkb> ianw: ya we don't
19:20:23 <fungi> right, my takeaway is that all the solutions are fairly complex and have their own distinct downsides, so i'm good with the option requiring the least work (that is, to be clear, just leaving it how it's configured now)
19:20:33 <ianw> i think we're all ok with no records and a comment why, which is the status quo
19:21:12 <ianw> all right, decided.  i'll cross that off the list and so other than that cleanup change, i think this is done!
19:21:14 <fungi> the split record solution was elegant enough until we had to reason about server replacements
19:21:25 <fungi> thanks!
19:21:27 <clarkb> ianw: ++ we can always reevaluate if some reason to have the sshfp records pops up
19:21:49 <clarkb> #topic Project Renames
19:21:59 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/803992 Accomodate zuul's new zk key management system
19:22:17 <clarkb> I've pushed that up with a depends-on to handle the future zuul state where it doesn't implicitly back up things to disk
19:22:56 <clarkb> The other thing we had on the todo list was updating the docs to handle the edits we made to the etherpad compared to the documented process
19:23:04 <clarkb> has anyone started on that change yet?
19:23:27 <fungi> also we discovered that accepting the inability to run zuul jobs on rename changes makes it hard to spot when you've caught all the remaining tentacles. we ended up merging two fixes (i think it was two?) where the old project name was referenced
19:23:57 <clarkb> yup, I think part of the doc updates should be splitting those changes up so that we can review them with more CI testing upfront
19:24:28 <fungi> i agree, but last time this came up we couldn't agree on where/how to split them so we wound up just keeping it all squashed
19:24:34 <clarkb> ya its a bit of a pain iirc
19:25:08 <clarkb> I was thinkign we could do a add everything but don't remove old stuff change for things like acls etc
19:25:16 <fungi> also no i haven't yet written any process changes based on the notes in the pad
19:25:24 <clarkb> then we can safely land that first and then land a cleanup that does the actual rename?
19:25:29 <fungi> #link https://etherpad.opendev.org/p/project-renames-2021-07-30 The maintenance plan we followed
19:25:40 <clarkb> fungi: ok, I can probably look at that this week.
19:25:46 <clarkb> that == writing the docs update change
19:26:19 <fungi> i may get to it if you don't. i think a lot of it is going to be deletions anyway
19:26:19 <clarkb> Then we can delete this from the agenda along with the review upgrade topic :)
19:26:27 <clarkb> fungi: thanks
19:27:00 <fungi> i guess it's step #5 there which will need some consideration
19:27:16 <fungi> well, and step #1
19:28:15 <fungi> also is there anything about how zuul handles configuration we can improve to make this easier, or which we can take advantage of (run a config check on the altered config in teh check pipeline?)
19:28:36 <clarkb> fungi: the problem is that zuul in prod is verifying its own config against the config changes
19:28:51 <clarkb> fungi: we could run a testing zuul to validate things but those jobs won't even run due to the config errors in the proposal
19:29:47 <fungi> well, it isn't going to speculatively apply the change anyway, the refusal to enqueue is a safeguard
19:30:17 <fungi> maybe there's an option we could add to bypass that safety check in a yes-i-know-this-doesn't-make-sense kind of way?
19:30:32 <clarkb> something like that would work for acl verification at least
19:30:45 <clarkb> basically where we do out of band validation
19:31:25 <fungi> or post zuul v5 maybe some support for actual repository renames in zuul, where it can reason about such things... but that's likely to be a significant undertaking
19:32:16 <clarkb> ya something to bring up with the zuul maintainers I suspect
19:32:29 <clarkb> Lets continue on. We can hash out our options while writing and reviewing the docs updates
19:32:46 <clarkb> #topic Matrix Homeserver and bots
19:33:11 <clarkb> tristanC's prometheus metrics show that gerritbot loses connectivity to review.opendev.org reliably every hour
19:33:42 <clarkb> Sorting that out is probably a good idea, though possibly not critical to  zuul using the service
19:33:59 <fungi> that's affecting our production gerrit, or a test instance?
19:34:02 <clarkb> We also got billed for the homeserver in the expected amount which means that aspect is working without surprises (a very good thing)
19:34:04 <corvus> tristanC: are you working on that?
19:34:16 <clarkb> fungi: our production gerrit
19:34:20 <fungi> er, production gerritbot (the irc-connected one)?
19:34:26 <clarkb> fungi: aiui yes
19:34:29 <fungi> neat
19:34:33 <clarkb> oh sorry no
19:34:38 <clarkb> the production matrix gerritbot
19:34:40 <corvus> it's affected the irc one too?
19:34:46 <corvus> didn't think so
19:34:50 <clarkb> I don't have any evidence that it is affecting the irc gerritbot
19:35:00 <fungi> well, that's what i'm wondering. if the gerrit connection code is all the same then it could i suppose
19:35:00 <ianw> is it reliably at the same time every hour, or reliably once an hour?
19:35:18 <clarkb> ianw: same time every hour according to the prometheus graph I saw
19:35:43 <clarkb> fungi: its completelydifferent. irc gerritbot uses paramiko iirc and matrix gerritbot uses libssh2 in haskell
19:35:44 <ianw> i do seem to remember rewriting/fixing the gerritbot reconnect logic at some point
19:35:56 <ianw> it might be hiding any drops
19:36:14 <clarkb> I'm calling it out because it may lead to service impacts for zuul to use the matrix bot
19:37:08 <corvus> clarkb: do you know if tristanC is working on a fix?
19:37:26 <corvus> (i'm unaware of any previous discussion about this -- it's the first time i'm hearing of it)
19:37:27 <clarkb> corvus: I do not know. It was mentioned over the weekend and I don't know if anyone including tristanC is looking into it further
19:38:01 <clarkb> https://matrix-client.matrix.org/_matrix/media/r0/download/matrix.org/TIjNHQWUwHJlwgOpLbQRMYdN was what tristanC shared on Sunday (relative to me)
19:39:29 <corvus> is there some discussion somewhere?
19:40:08 <corvus> i can't find anything in #opendev eavesdrop logs
19:40:10 <clarkb> corvus: it was in #opendev on oftc from ~2100UTC Sunday to early Monday
19:40:56 <clarkb> https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-08-08.log.html#t2021-08-08T21:29:52 and https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-08-09.log.html#t2021-08-09T00:15:09
19:41:02 <corvus> https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-08-09.log.html#t2021-08-09T00:15:09 looks relevant
19:41:05 <clarkb> I haven't seen mention of it since
19:41:54 <corvus> okay, well, i was hoping to get the 'all clear' at this meeting to move zuul over, but it doesn't seem like we're there
19:42:20 <corvus> tristanC: can you please provide an update (if you're not here now, maybe over in #opendev when you are around) on the impact of this issue and if you're addressing it?
19:42:32 <clarkb> I think I'm happy for Zuul to use it as is. It would be up to Zuul if they are ok with the connection error issue being sorted out concurrently with the move
19:43:04 <clarkb> Billing was my last major concern before moving (I didn't want zuul to move then us get a large unexpected bill and have to move quickly to something else for example)
19:43:04 <corvus> clarkb: i don't feel like i have enough info to make that decision -- like -- how much stream-events time does gerritbot miss?
19:43:19 <clarkb> corvus: ya getting more info makes sense
19:45:16 <clarkb> Why don't we followup with tristanC on that then if keepalives fixed it Zuul can proceed otherwise dig in more? and make a decision? But I think from OpenDev's perspective its largely up to Zuul's level of comfort with starting to actually use the services
19:46:21 <clarkb> Anything else to bring up on this subject?
19:46:24 <fungi> that jives with my reconing
19:46:30 <corvus> that's it
19:46:49 <clarkb> #topic Gitea Backups
19:47:13 <clarkb> We got an email saying lists failed as well. I was worried that it may be suffering the same issue now but it only happened the once
19:47:24 <clarkb> I suspect that was "normal" internet flakyness rather tahn the persistent variety
19:47:33 <clarkb> ianw: did an email get sent about this yet?
19:49:15 <ianw> ahh, no sorry
19:49:45 <clarkb> Alright, considering the lists issue hasn't persisted I think that is all for this topic
19:49:52 <clarkb> #topic Gitea 1.15.0 upgrade
19:50:22 <clarkb> Thank you everyone for helping to review and land the prep changes for this work. We are no longer using hacky UI interactions via http and instead use the REST api for all gitea project management updates
19:50:44 <clarkb> The lates gitea 1.15.0-rc3 release seems to work fine in testing with the associated template updates and file moves
19:51:26 <clarkb> Upstream has a milestone setup due on the 18th for the 1.15.0 release and no outstanding bugs are listed. I expect the release will happen soon. Once it happens we can update my change and hold the nodes and do direct verification that stuff works as expected
19:51:39 <clarkb> The other gotcha is that the hosting of the logos changes and the paths move
19:51:45 <clarkb> this will impact review and paste's theming
19:52:11 <clarkb> If anyone has time to host those logos on static or with each service that uses them that might be a good idea
19:52:18 <fungi> we haven't merged any project additions to exercise the new api interactions in production, as far as anyone knows?
19:52:22 <clarkb> then we aren't updating a bunch of random stuff when our hacked up gitea theming changes
19:52:32 <clarkb> fungi: ya I don't know of any new project creations since
19:53:07 <ianw> ah i can make a static logo location
19:53:07 <fungi> i think baking the logos into each image/deploying them to each server is probably the safest so we don't have unnecessary cross-site hosting
19:53:32 <fungi> but keeping them in a single place in system-config (or some repo) would be good so we don't have duplicates in git
19:54:18 <clarkb> I hadn't considered that concern. It seems to be working now at least, but preventing future problems seems liek a good thing
19:54:40 <clarkb> We can definitely coordinate the 1.15.0 gitea update around making sure we're happy with logo hosting
19:54:49 <clarkb> While it would be nice to update early we don't need to
19:55:30 <clarkb> Almost out of time so lets move on here
19:55:35 <clarkb> #topic Mailman Ansible and Upgrades
19:55:41 <clarkb> The newlist fix landed
19:55:51 <clarkb> I don't know of any new lists being created since, so keep an eye out when that happens
19:56:14 <clarkb> I have not had time to snapshot the lists.kc.io server yet for server inplace upgrade testing but hope that it will happen this week
19:56:23 <clarkb> #topic Open Discussion
19:56:27 <clarkb> Anything else?
19:56:55 <fungi> i've got nothing
19:57:03 <clarkb> Rico Lin reached out to fungi and I about doing a presentation about OpenDev for Open Infra Days Asia 2021. This is happening in a month and we have ~3 weeks to put together a recorded talk. I'd like to give it a go, but am balancing that with everything else
19:57:18 <ianw> i'm trying to get to the bottom of debian-stable
19:57:24 <ianw> https://review.opendev.org/q/topic:%22debian-stretch-rm%22+(status:open%20OR%20status:merged)
19:57:39 <clarkb> Mentioning it in case anyone is interested in helping put that together. I've been told that one of the easiest ways to do a recording like that is to have a recorded conference call whenre you present the data either to an empty call or to your copresenters
19:57:53 <fungi> ianw: not sure if you saw, but jrosser was in favor of bypassing ci to merge the removals from murano-dashboard
19:58:13 <ianw> fungi: oh, no missed that but that seems good
19:58:51 <fungi> clarkb: yeah, i expect we could talk through some slides on jitsi-meet and then someone could record it locally from their browser
19:58:58 <fungi> the more the merrier on that
20:00:05 <clarkb> And we are at time
20:00:06 <fungi> thanks clarkb!
20:00:08 <clarkb> Thank you everyone!
20:00:11 <clarkb> #endmeeting