19:01:10 #startmeeting infra 19:01:10 Meeting started Tue Aug 10 19:01:10 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:10 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 The meeting name has been set to 'infra' 19:01:17 #link http://lists.opendev.org/pipermail/service-discuss/2021-August/000273.html Our Agenda 19:01:28 #topic Announcements 19:01:40 I had none. Let's just jump right into the meeting proper 19:01:45 #topic Actions from last meeting 19:01:50 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-08-03-19.01.txt minutes from last meeting 19:02:04 I did manage to get around to writing up the start of a prometheus spec yesterday and today 19:02:09 #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus spec 19:02:27 This is still quite high level as I haven't run one locally but did read a fair bit of documentation yesterday 19:02:47 I think in this case we don't need to have a bunch of specifics sorted out early because we can run this side by side with cacti while we sort it out and make it do what we want 19:03:09 ++ 19:03:11 yeah, i'm in favor of feeling it out once it's running 19:03:12 Then as noted in the spec we can run it for a month or so and compare data between cacti and prometheus before shutting down cacti 19:03:34 i will review the spec asap 19:03:51 I think it captures the important bits, but I'm happy for feedback and will update it appropriately 19:03:59 i need to read it still, was there any treatment of how we might import our historical data, or just keep the old graphs around? 19:04:11 fungi: no, I think that is well beyond the scope of that spec 19:04:18 got it, thanks 19:04:24 you'd need to write an rrd to tsdb conversion tool which may exist? 19:05:01 https://groups.google.com/g/opentsdb/c/H7t-WPY11Ro 19:05:04 yeah, or may be as simple as plugging a coupld or python libraries into one another 19:05:16 if someone wants to work on that during that side by side period it should definitely be possible 19:05:18 s/coupld or/couple of/ 19:05:22 but I'm not sure it is critical? 19:05:37 right, it's something else we'll want to figure out as a group 19:05:38 i'd vote for just keeping cacti around for many months until we don't care 19:05:54 corvus: ya that was sort of what I was thinking 19:05:57 certainly one option 19:06:12 basically keep cacti around to ensure the data we have in prometheus is at least as accurate as cacti then when ready delete cacti 19:06:23 the spec says we can do that after a month but happy to update that to be more flexible 19:06:35 depends on how much we value being able to compare against older trending (and how much older) 19:06:36 if there's a security reason we can't keep cacti up, we could still keep it around but firewall it 19:06:55 so that if we need to look at old data, it's possible (if not easy) 19:07:05 anyway, all things we can hash out later 19:07:24 #topic Topics 19:07:33 Ya lets hash it out in the spec review :) 19:07:41 #topic Service Coordinator Election 19:07:58 The end of today UTC time is the end of the service coordinator nomination period 19:08:07 I've not seen anyone volunteer yet :P 19:08:25 I'll keep doing it if no one else wants to do it, but definitely think someone else should do it 19:09:14 Anyway this is your reminder of that deadline. Please do volunteer if you are interested 19:09:23 i can volunteer if you really need to step down, but i'm not sure another openinfra foundation staff member is a better choice. as things are, it's a struggle to explain that opendev is a community rather than a service run by the foundation 19:09:55 (hard to explain that to the rest of the foundation staff as much as to the public) 19:10:10 I think for me it would be nice to be able to focus on more of the technical details of upgrading services and running new services, etc. But I agree that is also a struggle 19:10:35 and I think having someone else do it can be good for a shift in approach/perspective 19:10:51 from a sustainability perspective, it would be nice to have an option other than foundation employees 19:12:19 #topic Review Upgrades 19:13:02 I believe old server cleanups have happend. Thank you ianw again for doing a bunch of the work on this 19:13:11 #link https://review.opendev.org/c/opendev/system-config/+/803374 Clean up old mysql gerrit stuff 19:13:19 yep all done 19:13:26 That removes the mysql connector from our images as well as support for h2 and mysql from the gerrit role in system-config 19:13:41 at this point I think we are good to move forward on landing that as there haven't been problems with prod since the mariadb switch 19:13:42 neatly wrapped up! 19:13:54 i agree 19:14:37 the only thing left on the cleanup list is "decide on sshfp records" 19:15:10 our options are to have no sshfp records or only do port 29418 sshd records on review.o.o and port 22 on review02.o.o ? 19:15:23 personally i think we generally want to access ssh on port 22 & 29418 @ review.opendev.org so that is in conflict with choosing one for sshfp records 19:15:29 fwiw I've been trying to train myself to ssh to the actual host fqdn when using port 22 and use review.o.o for 29418 19:15:29 i'm okay leaving it as-is, but it's inconsistent with how we handle sshfp records for admin access to our other servers 19:16:04 but ya I'm not doing any sshfp verification from my client as far as I know 19:16:14 I'm happy to leave it as is with the comment in the zone file about why this host is different 19:16:27 on the other hand, if we do have a review02.opendev.org-only sshfp record then it wouldn't directly conflict with anything, we'd just need to separate the address records and not use a cname for that 19:17:09 at the time i was thinking also things like zuul want review02 as the ssh target 19:17:16 but that turned out to not work so well 19:17:30 (gerrit ssh port target i mean) 19:17:51 another option would be to switch openssh to using the same host key as the gerrit service, it's the only service running there, and so i'm not super concerned that someone might get ahold of the api hostkey and use that to take control of the underlying operating system, if they get that first bit then the whole server is already sunk really 19:18:33 it's not as if there's anything else to protect which the gerrit service doesn't have access to 19:18:36 that is an interesting idea. I hadn't considered that before. It would make distinguishing between gerrit hosts a bit more fuzzy, but would simplify sshfp records 19:19:15 yeah, i guess it's the transitional gerrit server replacement period when there are two running which is the real issue 19:19:17 hrm, i'm not sure we have any ansible logic for writing out host keys on base servers though 19:19:21 I don't feel strongly about any of the otpions fwiw. I'm happy with the current situation but have also started trying to train myself when ssh'ing to use the actual host fqdn which falls in line with the old sshfp setup 19:19:27 ianw: ya we don't 19:20:23 right, my takeaway is that all the solutions are fairly complex and have their own distinct downsides, so i'm good with the option requiring the least work (that is, to be clear, just leaving it how it's configured now) 19:20:33 i think we're all ok with no records and a comment why, which is the status quo 19:21:12 all right, decided. i'll cross that off the list and so other than that cleanup change, i think this is done! 19:21:14 the split record solution was elegant enough until we had to reason about server replacements 19:21:25 thanks! 19:21:27 ianw: ++ we can always reevaluate if some reason to have the sshfp records pops up 19:21:49 #topic Project Renames 19:21:59 #link https://review.opendev.org/c/opendev/system-config/+/803992 Accomodate zuul's new zk key management system 19:22:17 I've pushed that up with a depends-on to handle the future zuul state where it doesn't implicitly back up things to disk 19:22:56 The other thing we had on the todo list was updating the docs to handle the edits we made to the etherpad compared to the documented process 19:23:04 has anyone started on that change yet? 19:23:27 also we discovered that accepting the inability to run zuul jobs on rename changes makes it hard to spot when you've caught all the remaining tentacles. we ended up merging two fixes (i think it was two?) where the old project name was referenced 19:23:57 yup, I think part of the doc updates should be splitting those changes up so that we can review them with more CI testing upfront 19:24:28 i agree, but last time this came up we couldn't agree on where/how to split them so we wound up just keeping it all squashed 19:24:34 ya its a bit of a pain iirc 19:25:08 I was thinkign we could do a add everything but don't remove old stuff change for things like acls etc 19:25:16 also no i haven't yet written any process changes based on the notes in the pad 19:25:24 then we can safely land that first and then land a cleanup that does the actual rename? 19:25:29 #link https://etherpad.opendev.org/p/project-renames-2021-07-30 The maintenance plan we followed 19:25:40 fungi: ok, I can probably look at that this week. 19:25:46 that == writing the docs update change 19:26:19 i may get to it if you don't. i think a lot of it is going to be deletions anyway 19:26:19 Then we can delete this from the agenda along with the review upgrade topic :) 19:26:27 fungi: thanks 19:27:00 i guess it's step #5 there which will need some consideration 19:27:16 well, and step #1 19:28:15 also is there anything about how zuul handles configuration we can improve to make this easier, or which we can take advantage of (run a config check on the altered config in teh check pipeline?) 19:28:36 fungi: the problem is that zuul in prod is verifying its own config against the config changes 19:28:51 fungi: we could run a testing zuul to validate things but those jobs won't even run due to the config errors in the proposal 19:29:47 well, it isn't going to speculatively apply the change anyway, the refusal to enqueue is a safeguard 19:30:17 maybe there's an option we could add to bypass that safety check in a yes-i-know-this-doesn't-make-sense kind of way? 19:30:32 something like that would work for acl verification at least 19:30:45 basically where we do out of band validation 19:31:25 or post zuul v5 maybe some support for actual repository renames in zuul, where it can reason about such things... but that's likely to be a significant undertaking 19:32:16 ya something to bring up with the zuul maintainers I suspect 19:32:29 Lets continue on. We can hash out our options while writing and reviewing the docs updates 19:32:46 #topic Matrix Homeserver and bots 19:33:11 tristanC's prometheus metrics show that gerritbot loses connectivity to review.opendev.org reliably every hour 19:33:42 Sorting that out is probably a good idea, though possibly not critical to zuul using the service 19:33:59 that's affecting our production gerrit, or a test instance? 19:34:02 We also got billed for the homeserver in the expected amount which means that aspect is working without surprises (a very good thing) 19:34:04 tristanC: are you working on that? 19:34:16 fungi: our production gerrit 19:34:20 er, production gerritbot (the irc-connected one)? 19:34:26 fungi: aiui yes 19:34:29 neat 19:34:33 oh sorry no 19:34:38 the production matrix gerritbot 19:34:40 it's affected the irc one too? 19:34:46 didn't think so 19:34:50 I don't have any evidence that it is affecting the irc gerritbot 19:35:00 well, that's what i'm wondering. if the gerrit connection code is all the same then it could i suppose 19:35:00 is it reliably at the same time every hour, or reliably once an hour? 19:35:18 ianw: same time every hour according to the prometheus graph I saw 19:35:43 fungi: its completelydifferent. irc gerritbot uses paramiko iirc and matrix gerritbot uses libssh2 in haskell 19:35:44 i do seem to remember rewriting/fixing the gerritbot reconnect logic at some point 19:35:56 it might be hiding any drops 19:36:14 I'm calling it out because it may lead to service impacts for zuul to use the matrix bot 19:37:08 clarkb: do you know if tristanC is working on a fix? 19:37:26 (i'm unaware of any previous discussion about this -- it's the first time i'm hearing of it) 19:37:27 corvus: I do not know. It was mentioned over the weekend and I don't know if anyone including tristanC is looking into it further 19:38:01 https://matrix-client.matrix.org/_matrix/media/r0/download/matrix.org/TIjNHQWUwHJlwgOpLbQRMYdN was what tristanC shared on Sunday (relative to me) 19:39:29 is there some discussion somewhere? 19:40:08 i can't find anything in #opendev eavesdrop logs 19:40:10 corvus: it was in #opendev on oftc from ~2100UTC Sunday to early Monday 19:40:56 https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-08-08.log.html#t2021-08-08T21:29:52 and https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-08-09.log.html#t2021-08-09T00:15:09 19:41:02 https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-08-09.log.html#t2021-08-09T00:15:09 looks relevant 19:41:05 I haven't seen mention of it since 19:41:54 okay, well, i was hoping to get the 'all clear' at this meeting to move zuul over, but it doesn't seem like we're there 19:42:20 tristanC: can you please provide an update (if you're not here now, maybe over in #opendev when you are around) on the impact of this issue and if you're addressing it? 19:42:32 I think I'm happy for Zuul to use it as is. It would be up to Zuul if they are ok with the connection error issue being sorted out concurrently with the move 19:43:04 Billing was my last major concern before moving (I didn't want zuul to move then us get a large unexpected bill and have to move quickly to something else for example) 19:43:04 clarkb: i don't feel like i have enough info to make that decision -- like -- how much stream-events time does gerritbot miss? 19:43:19 corvus: ya getting more info makes sense 19:45:16 Why don't we followup with tristanC on that then if keepalives fixed it Zuul can proceed otherwise dig in more? and make a decision? But I think from OpenDev's perspective its largely up to Zuul's level of comfort with starting to actually use the services 19:46:21 Anything else to bring up on this subject? 19:46:24 that jives with my reconing 19:46:30 that's it 19:46:49 #topic Gitea Backups 19:47:13 We got an email saying lists failed as well. I was worried that it may be suffering the same issue now but it only happened the once 19:47:24 I suspect that was "normal" internet flakyness rather tahn the persistent variety 19:47:33 ianw: did an email get sent about this yet? 19:49:15 ahh, no sorry 19:49:45 Alright, considering the lists issue hasn't persisted I think that is all for this topic 19:49:52 #topic Gitea 1.15.0 upgrade 19:50:22 Thank you everyone for helping to review and land the prep changes for this work. We are no longer using hacky UI interactions via http and instead use the REST api for all gitea project management updates 19:50:44 The lates gitea 1.15.0-rc3 release seems to work fine in testing with the associated template updates and file moves 19:51:26 Upstream has a milestone setup due on the 18th for the 1.15.0 release and no outstanding bugs are listed. I expect the release will happen soon. Once it happens we can update my change and hold the nodes and do direct verification that stuff works as expected 19:51:39 The other gotcha is that the hosting of the logos changes and the paths move 19:51:45 this will impact review and paste's theming 19:52:11 If anyone has time to host those logos on static or with each service that uses them that might be a good idea 19:52:18 we haven't merged any project additions to exercise the new api interactions in production, as far as anyone knows? 19:52:22 then we aren't updating a bunch of random stuff when our hacked up gitea theming changes 19:52:32 fungi: ya I don't know of any new project creations since 19:53:07 ah i can make a static logo location 19:53:07 i think baking the logos into each image/deploying them to each server is probably the safest so we don't have unnecessary cross-site hosting 19:53:32 but keeping them in a single place in system-config (or some repo) would be good so we don't have duplicates in git 19:54:18 I hadn't considered that concern. It seems to be working now at least, but preventing future problems seems liek a good thing 19:54:40 We can definitely coordinate the 1.15.0 gitea update around making sure we're happy with logo hosting 19:54:49 While it would be nice to update early we don't need to 19:55:30 Almost out of time so lets move on here 19:55:35 #topic Mailman Ansible and Upgrades 19:55:41 The newlist fix landed 19:55:51 I don't know of any new lists being created since, so keep an eye out when that happens 19:56:14 I have not had time to snapshot the lists.kc.io server yet for server inplace upgrade testing but hope that it will happen this week 19:56:23 #topic Open Discussion 19:56:27 Anything else? 19:56:55 i've got nothing 19:57:03 Rico Lin reached out to fungi and I about doing a presentation about OpenDev for Open Infra Days Asia 2021. This is happening in a month and we have ~3 weeks to put together a recorded talk. I'd like to give it a go, but am balancing that with everything else 19:57:18 i'm trying to get to the bottom of debian-stable 19:57:24 https://review.opendev.org/q/topic:%22debian-stretch-rm%22+(status:open%20OR%20status:merged) 19:57:39 Mentioning it in case anyone is interested in helping put that together. I've been told that one of the easiest ways to do a recording like that is to have a recorded conference call whenre you present the data either to an empty call or to your copresenters 19:57:53 ianw: not sure if you saw, but jrosser was in favor of bypassing ci to merge the removals from murano-dashboard 19:58:13 fungi: oh, no missed that but that seems good 19:58:51 clarkb: yeah, i expect we could talk through some slides on jitsi-meet and then someone could record it locally from their browser 19:58:58 the more the merrier on that 20:00:05 And we are at time 20:00:06 thanks clarkb! 20:00:08 Thank you everyone! 20:00:11 #endmeeting