19:01:06 <clarkb> #startmeeting infra
19:01:06 <opendevmeet> Meeting started Tue Feb 14 19:01:06 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:06 <opendevmeet> The meeting name has been set to 'infra'
19:01:16 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VH5EQYJH3E2YKTP3K4IXQI27WRRSEUMR/ Our Agenda
19:01:31 <clarkb> #topic Announcements
19:02:04 <clarkb> It is Service Coordinator nomination time. I've not seen any nominations yet and the time period ends today. I suppose thats a gentle way ofsaying I should keep doing it?
19:02:33 <fungi> congratudolences!
19:02:36 <clarkb> If no one speaks up indicating their interest before I finish lunch today then I guess I'll make my nomination official after lunch
19:03:53 <clarkb> The other announcement today is that we'll cancel next week's meeting. fungi and I are traveling and busy next tuesday. Thats ~50% of our normal attendance so I think we can just skip
19:04:26 <ianw> ++
19:04:45 <clarkb> #topic Bastion Host Updates
19:04:51 <clarkb> #link https://review.opendev.org/q/topic:bridge-backups
19:04:56 <clarkb> #link https://review.opendev.org/q/topic:prod-bastion-group Remaining changes are part of parallel ansible runs on bridge
19:05:28 <clarkb> ianw: ^ you should just start nagging me to review that first set of changes. I keep putting it off due to distractions. I have been doing zuul reviews today and when I get tired of those I should do some opendev reviews too
19:05:49 <clarkb> (zuul early day due to overlap with europe is good then opendev late day due to overlap wuth au is good :) )
19:05:59 <ianw> :) i should loop back on the parallel stuff too
19:06:17 <ianw> it probably needs remerging etc.
19:06:31 <clarkb> are there any other bastion concerns?
19:07:11 <ianw> it being a jammy host it hits
19:07:14 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/872808
19:07:23 <ianw> with the old apt-key config
19:07:37 <ianw> that's all i can think of
19:07:40 <clarkb> oh I had a question about that which i guess I didn't post on the review directly (my bad)
19:07:50 <clarkb> specifically how new of an apt do we need to support that method of key trust
19:08:05 <clarkb> We would need it to work on bionic and newer iirc
19:08:15 <ianw> i think 1.4 which is >=bionic
19:08:42 <clarkb> and I guess reverting and making it distro release specific isn't too terrible either
19:08:58 <clarkb> I'll do a quick rereview after the meeting since i didn't record my previous thoughts properly
19:10:43 <clarkb> #topic Mailman 3
19:11:09 <clarkb> We're still poking at the site creation stuff last I saw, but there was one other thing that had a change to address it
19:11:15 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/873337 Fix warnings about missing migrations
19:11:31 <fungi> my testing on the previous held node was probably invalid because of the lingering db migration issue which i think is what also resulted in my container restart errors
19:12:05 <fungi> i thought i had approved 873337 already but i guess now
19:12:07 <fungi> approved now
19:12:23 <fungi> i'll get a new held node once we have new images
19:12:26 <clarkb> sounds good
19:12:38 <fungi> or i guess that fix won't need new images
19:12:49 <fungi> so i can recheck the dnm change as soon as that merges
19:12:58 <clarkb> Any other mailman related items? I think we've managed to chip away at most of it other than the site creation to fix vhosting (which amkes sense since that is the complicated bitwith db migrations)
19:13:34 <fungi> i don't have anything else, no. i still haven't had time to wrap my head around creating new sites with django migrations
19:13:38 <clarkb> #topic Gerrit Updates
19:13:50 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/867931 Cleaning up deprecated copy conditions in project ACLs
19:14:06 <clarkb> This update has been announced. I think we can probably land it whenever we're confident jeepyb is happy (and last I saw it was workign?)
19:14:14 <clarkb> ianw: ^ not sure if you had a specific plan for that one
19:14:52 <ianw> i think it might need to be manually applied as it will take longer than the job timeout
19:15:20 <clarkb> ianw: jeepyb will only update those 90ish repos? Oh except those files might be used in more than 90 repos
19:15:59 <clarkb> if we think increasing the timeout would work that seems fine, otherwise manual application also seems fine.
19:16:00 <fungi> we could chunk the change up into batches i guess, but manual manage-projects run seems fine to me
19:16:55 <ianw> it's *probably* ok, but still.  might be an idea to 1) put in emergency 2) down gerrit 3) run a manual backup run 4) up gerrit 5) manually apply? 6) remove from emergency?
19:17:41 <clarkb> ianw: do we think we need a backup like that? The acls are all in git so theoretically we can just revert them if necessary
19:18:05 <clarkb> mostly just thinking that I'm not sure a gerrit downtime is necessary
19:19:06 <ianw> i could go either way; i was just thinking it's an unambiguous snapshot
19:19:40 <clarkb> I think I'm willing to trust the acl system's historical record here. We've relied on it in the past and can continue to do so
19:19:41 <ianw> i guess we have now double-checked all the acl files, and gerrit shouldn't let us merge anything it doesn't like
19:20:49 <fungi> right, as long as we merge it through gerrit rather than behind its back, the worst that should happen is manage-projects throws errors and we can't create new projects or update existing ones for a little while until we sort it out
19:21:26 <fungi> or would have to take manual action in order to do so at least
19:21:41 <clarkb> I guess doing a canary change with a smaller set of updates might be good if we're worried about getting syntax wrong etc
19:21:57 <clarkb> but ya I think a downtime for backups is overkill given gerrit's builtin checks and record keeping
19:22:22 <fungi> technically the syntax is already checked by the manage-projects test in our gerrit job, right?
19:22:38 <clarkb> fungi: yes, but using the rules we try to interpret from gerrit not gerrit tiself
19:22:45 <clarkb> and this is a new set of rules so possible we got it wrong
19:23:11 <fungi> oh, i guess i thought we had an integration test creating a project in gerrit
19:23:30 <clarkb> not using our production acls
19:23:44 <clarkb> that would take too long to run probably  unfortunately
19:23:54 <fungi> and the acl change doesn't update the test acl to match?
19:24:07 <clarkb> I don't think so.
19:24:20 <fungi> i didn't think to check that myself
19:24:48 <clarkb> they are decoupled. We test jeepyb + gerritlib in that bubble. We test our deployment of gerrit in system-config. And then we do simple linter type checks in project-config
19:24:57 <clarkb> this change is to project-config and doesn't impact the others
19:25:00 <ianw> i don't think we set any conditions in system-config, but we can
19:25:35 <clarkb> ya maybe that is better than doing a canary change
19:25:45 <clarkb> just to make suer we get the general syntax correct
19:26:02 <clarkb> But ya I'm not too worried about it given the ability to rollback etc
19:26:13 <clarkb> There were two other Gerrit related items
19:26:27 <ianw> ok, will do, and then i'll plan to apply it manually just to really watch it closely and because of timeouts, but with no downtime
19:26:28 <clarkb> Both of which i've put on the Gerrit community meeting agenda for March 2 (8am pacific)
19:27:07 <clarkb> The first is Java 17 support. I have a hcange up to swap us to java 17 which works in our CI jobs. But you have to use an ugly java cli option to make it happen which seems at odds with their full compatibility statement
19:27:38 <clarkb> I'm hoping to get a better sense of that support in the community meeting and if thats he path forward I guess we roll with it
19:27:48 <clarkb> The other is the ssh connectivity problem with channel tracking
19:28:18 <clarkb> ianw: has been digging into this quite a bit and I think discovered a bug in the upstream implementation of channel tracking. That doesn't explain why ssh is unhappy though right? just that fixing the bug will get us better information when those cases happen?
19:29:09 <ianw> so i think the bug means that the workaround committed was actually not doing anything
19:29:15 <clarkb> #link https://github.com/apache/mina-sshd/issues/319 Gerrit SSH issues with flaky networks.
19:29:40 <ianw> #link https://gerrit-review.googlesource.com/c/gerrit/+/358314
19:29:42 <clarkb> ianw: "the workaround" is to disable channel tracking?
19:30:15 <ianw> specifically it's https://gerrit-review.googlesource.com/c/gerrit/+/238384
19:30:41 <ianw> what that is supposed to do is track when a ssh channel is opened in a variable
19:31:28 <ianw> then, if an unhandledchannelerror is raised by mina, it looks at what channel it was, and if that channel has been opened before, basically ignores it
19:31:58 <clarkb> ah but since it wasn't tracking it that error propagates. So your fix may be the actual fix too
19:32:29 <ianw> right, the "track when opened" was never running because it wasn't registered to receive the channel open events
19:32:46 <clarkb> In that case I can use the community meeting to beg for reviews if they haven't landed it by then
19:32:51 <clarkb> :)
19:32:59 <ianw> so ... it's a fix ... but it doesn't really seem to answer any questions of what's going on
19:33:25 <clarkb> which is why your extra logging change remains to collect that info and hopefully debug the underlying situation
19:33:48 <ianw> #link https://gerrit-review.googlesource.com/c/gerrit/+/357694
19:34:46 <ianw> right, yeah that change has logging for basically every channel event.  but i'm not sure how much it helps now -- since we would be getting log messages when the channel is initalized from the prior change, which was mostly what we were interested in
19:35:44 <ianw> i don't know.  i think maybe merge the "fix" and just move on with life and don't think too hard about it :)
19:35:58 <clarkb> works for me. I'll bring it up with gerrit if we don't manage to make progress before the meeting
19:36:25 <ianw> something is still not quite right in mina I think, but this probably isn't the context to find it
19:37:50 <clarkb> #topic Upgrading Servers
19:38:08 <clarkb> I'm trying to pick this up again and have begun looking at the gitea backends
19:38:22 <clarkb> A couple of things make this easier than I feared and one thing makes this painful :)
19:39:04 <clarkb> We control gitea ansible group independently of what servers haproxy load balances to and gerrit replicates to. This means we can pretty easily spin up a new gitea on a new server running with a bunch of empty git repos
19:39:27 <clarkb> Then when we are happy with the state of the server add it to gerrit replication, force gerrit to replicate everything to that server, then wait
19:39:47 <clarkb> Then add the server to haproxy and probably remove an old server. Repeat in a loop
19:40:02 <clarkb> What makes this painful/difficult is ensuring gitea state is what we want it to be. Specifically for redirects
19:40:52 <clarkb> I poked around in a held gitea test node's db yesterday and I think we can construct the redirects from scratch given info we have, but one thing that compliactes that is we need to create gitea orgs that don't exist in projects.yaml
19:41:22 <clarkb> essentially leading me to realize that bootstrapping that all from an empty state is probably more effort than necessary right now (though a noble exercise and maybe one we should get around to eventually)
19:41:43 <clarkb> instead I think we should stop gitea after the initial bring up then replace its db with a prod db
19:41:58 <clarkb> er replace its fresh db with a copy of a prod db from an old host
19:42:12 <clarkb> that will bring over the other orgs and redirects in theory.
19:42:38 <clarkb> What I'm concerned about doing this is that maybe we'll end up with stuff missing on disk. But since we never have to put the server into a public facing capacity until we are happy with it I think we just do that and see if it works
19:43:14 <clarkb> Looking at my current calendar and todo list maybe I can spin up that new server tomorrow, getit deployed as a blank gitea then start attempting to make it a prod like gitea on thrusday
19:44:00 <clarkb> For things that are not gitea we have etherpad, nameservers, static, mirrors, and jitsimeet. Of those I think etherpad and nameservers are the priorities
19:44:04 <ianw> this is all because in the past we've made the gitea projects as usual via the api, but then they've been moved, which we've also done via gitea, which has internally applied db updates to reflect this on it's instance, but when we're starting a new host we have no way of capturing this (at the moment, at least), right?
19:44:37 <clarkb> ianw: correct. We have a repo that captures the renames at that point in time but there is no tooling to apply that to gitea as a set of old orgs and redirects
19:45:06 <clarkb> I suppose as an alternative we could do inplace server upgrades. But I like to avoid those when we can
19:45:54 <ianw> it is always nice to validate we can start fresh
19:46:17 <clarkb> For the other servers I'm thinking etherpad and nameservers are the other priorities. In particular I had some notes about doing the nameservers but am not really confident in the process for that. If anyone has time to think that through and write out a small plan that would be appreciated
19:46:27 <clarkb> I suspect I'm overcomplicating the effort to update the nameservers in my head
19:46:58 <clarkb> and yes help much appreciated. Thanks for all the help so far too
19:49:29 <ianw> ++ i can have a look at nameservers
19:49:58 <clarkb> #topic Quo vadis Storyboard
19:50:11 <clarkb> This topic like the service has become a victim of a lack of time
19:50:49 <clarkb> I don't have anything new here. But maybe we should have a meeting dedicated to this in order to create a forcing function to spend time on it
19:51:09 <clarkb> I'd suggest hte PTG but the TPG conflicts with spring break around here so I'm trying to limit my PTG commitments :)
19:51:30 <clarkb> But maybe a higher bw call type setup the week before PTG or something?
19:52:19 <clarkb> Let me get through next week's travel and then try to put something together for that
19:52:32 <clarkb> #topic Open Discussion
19:53:02 <clarkb> As mentioned at the beginning of the meeting I'll make my service coordinator nomination official in an hour or so after lunch assuming no one beats me to it
19:53:35 <clarkb> Zuul's sqlalchemy 2.0 change merged earlier today. I may try to kick off a zuul restart sooner than the regularly scheduled weekend restart just to get that checked more quickly
19:56:16 <clarkb> anything else?
19:57:24 <ianw> not from me, thanks!
19:57:59 <clarkb> Thank you everyone for your time during this meeting but also for contributing to OpenDev. We'll skip next week's meeting and be back here in two weeks
19:58:01 <clarkb> #endmeeting