Tuesday, 2023-02-14

clarkbAlmost meeting time18:59
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Feb 14 19:01:06 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VH5EQYJH3E2YKTP3K4IXQI27WRRSEUMR/ Our Agenda19:01
clarkb#topic Announcements19:01
clarkbIt is Service Coordinator nomination time. I've not seen any nominations yet and the time period ends today. I suppose thats a gentle way ofsaying I should keep doing it?19:02
fungicongratudolences!19:02
clarkbIf no one speaks up indicating their interest before I finish lunch today then I guess I'll make my nomination official after lunch19:02
clarkbThe other announcement today is that we'll cancel next week's meeting. fungi and I are traveling and busy next tuesday. Thats ~50% of our normal attendance so I think we can just skip19:03
ianw++19:04
clarkb#topic Bastion Host Updates19:04
clarkb#link https://review.opendev.org/q/topic:bridge-backups19:04
clarkb#link https://review.opendev.org/q/topic:prod-bastion-group Remaining changes are part of parallel ansible runs on bridge19:04
clarkbianw: ^ you should just start nagging me to review that first set of changes. I keep putting it off due to distractions. I have been doing zuul reviews today and when I get tired of those I should do some opendev reviews too19:05
clarkb(zuul early day due to overlap with europe is good then opendev late day due to overlap wuth au is good :) )19:05
ianw:) i should loop back on the parallel stuff too19:05
ianwit probably needs remerging etc.19:06
clarkbare there any other bastion concerns?19:06
ianwit being a jammy host it hits19:07
ianw#link https://review.opendev.org/c/opendev/system-config/+/87280819:07
ianwwith the old apt-key config19:07
ianwthat's all i can think of19:07
clarkboh I had a question about that which i guess I didn't post on the review directly (my bad)19:07
clarkbspecifically how new of an apt do we need to support that method of key trust19:07
clarkbWe would need it to work on bionic and newer iirc19:08
ianwi think 1.4 which is >=bionic19:08
clarkband I guess reverting and making it distro release specific isn't too terrible either19:08
clarkbI'll do a quick rereview after the meeting since i didn't record my previous thoughts properly19:08
clarkb#topic Mailman 319:10
clarkbWe're still poking at the site creation stuff last I saw, but there was one other thing that had a change to address it19:11
clarkb#link https://review.opendev.org/c/opendev/system-config/+/873337 Fix warnings about missing migrations19:11
fungimy testing on the previous held node was probably invalid because of the lingering db migration issue which i think is what also resulted in my container restart errors19:11
fungii thought i had approved 873337 already but i guess now19:12
fungiapproved now19:12
fungii'll get a new held node once we have new images19:12
clarkbsounds good19:12
fungior i guess that fix won't need new images19:12
fungiso i can recheck the dnm change as soon as that merges19:12
clarkbAny other mailman related items? I think we've managed to chip away at most of it other than the site creation to fix vhosting (which amkes sense since that is the complicated bitwith db migrations)19:12
fungii don't have anything else, no. i still haven't had time to wrap my head around creating new sites with django migrations19:13
clarkb#topic Gerrit Updates19:13
clarkb#link https://review.opendev.org/c/openstack/project-config/+/867931 Cleaning up deprecated copy conditions in project ACLs19:13
clarkbThis update has been announced. I think we can probably land it whenever we're confident jeepyb is happy (and last I saw it was workign?)19:14
clarkbianw: ^ not sure if you had a specific plan for that one19:14
ianwi think it might need to be manually applied as it will take longer than the job timeout19:14
clarkbianw: jeepyb will only update those 90ish repos? Oh except those files might be used in more than 90 repos19:15
clarkbif we think increasing the timeout would work that seems fine, otherwise manual application also seems fine.19:15
fungiwe could chunk the change up into batches i guess, but manual manage-projects run seems fine to me19:16
ianwit's *probably* ok, but still.  might be an idea to 1) put in emergency 2) down gerrit 3) run a manual backup run 4) up gerrit 5) manually apply? 6) remove from emergency?19:16
clarkbianw: do we think we need a backup like that? The acls are all in git so theoretically we can just revert them if necessary19:17
clarkbmostly just thinking that I'm not sure a gerrit downtime is necessary19:18
ianwi could go either way; i was just thinking it's an unambiguous snapshot19:19
clarkbI think I'm willing to trust the acl system's historical record here. We've relied on it in the past and can continue to do so19:19
ianwi guess we have now double-checked all the acl files, and gerrit shouldn't let us merge anything it doesn't like19:19
fungiright, as long as we merge it through gerrit rather than behind its back, the worst that should happen is manage-projects throws errors and we can't create new projects or update existing ones for a little while until we sort it out19:20
fungior would have to take manual action in order to do so at least19:21
clarkbI guess doing a canary change with a smaller set of updates might be good if we're worried about getting syntax wrong etc19:21
clarkbbut ya I think a downtime for backups is overkill given gerrit's builtin checks and record keeping19:21
fungitechnically the syntax is already checked by the manage-projects test in our gerrit job, right?19:22
clarkbfungi: yes, but using the rules we try to interpret from gerrit not gerrit tiself19:22
clarkband this is a new set of rules so possible we got it wrong19:22
fungioh, i guess i thought we had an integration test creating a project in gerrit19:23
clarkbnot using our production acls19:23
clarkbthat would take too long to run probably  unfortunately19:23
fungiand the acl change doesn't update the test acl to match?19:23
clarkbI don't think so.19:24
fungii didn't think to check that myself19:24
clarkbthey are decoupled. We test jeepyb + gerritlib in that bubble. We test our deployment of gerrit in system-config. And then we do simple linter type checks in project-config19:24
clarkbthis change is to project-config and doesn't impact the others19:24
ianwi don't think we set any conditions in system-config, but we can19:25
clarkbya maybe that is better than doing a canary change19:25
clarkbjust to make suer we get the general syntax correct19:25
clarkbBut ya I'm not too worried about it given the ability to rollback etc19:26
clarkbThere were two other Gerrit related items19:26
ianwok, will do, and then i'll plan to apply it manually just to really watch it closely and because of timeouts, but with no downtime19:26
clarkbBoth of which i've put on the Gerrit community meeting agenda for March 2 (8am pacific)19:26
clarkbThe first is Java 17 support. I have a hcange up to swap us to java 17 which works in our CI jobs. But you have to use an ugly java cli option to make it happen which seems at odds with their full compatibility statement19:27
clarkbI'm hoping to get a better sense of that support in the community meeting and if thats he path forward I guess we roll with it19:27
clarkbThe other is the ssh connectivity problem with channel tracking19:27
clarkbianw: has been digging into this quite a bit and I think discovered a bug in the upstream implementation of channel tracking. That doesn't explain why ssh is unhappy though right? just that fixing the bug will get us better information when those cases happen?19:28
ianwso i think the bug means that the workaround committed was actually not doing anything19:29
clarkb#link https://github.com/apache/mina-sshd/issues/319 Gerrit SSH issues with flaky networks.19:29
ianw#link https://gerrit-review.googlesource.com/c/gerrit/+/35831419:29
clarkbianw: "the workaround" is to disable channel tracking?19:29
ianwspecifically it's https://gerrit-review.googlesource.com/c/gerrit/+/23838419:30
ianwwhat that is supposed to do is track when a ssh channel is opened in a variable19:30
ianwthen, if an unhandledchannelerror is raised by mina, it looks at what channel it was, and if that channel has been opened before, basically ignores it19:31
clarkbah but since it wasn't tracking it that error propagates. So your fix may be the actual fix too19:31
ianwright, the "track when opened" was never running because it wasn't registered to receive the channel open events19:32
clarkbIn that case I can use the community meeting to beg for reviews if they haven't landed it by then19:32
clarkb:)19:32
ianwso ... it's a fix ... but it doesn't really seem to answer any questions of what's going on19:32
clarkbwhich is why your extra logging change remains to collect that info and hopefully debug the underlying situation19:33
ianw#link https://gerrit-review.googlesource.com/c/gerrit/+/35769419:33
ianwright, yeah that change has logging for basically every channel event.  but i'm not sure how much it helps now -- since we would be getting log messages when the channel is initalized from the prior change, which was mostly what we were interested in19:34
ianwi don't know.  i think maybe merge the "fix" and just move on with life and don't think too hard about it :)19:35
clarkbworks for me. I'll bring it up with gerrit if we don't manage to make progress before the meeting19:35
ianwsomething is still not quite right in mina I think, but this probably isn't the context to find it19:36
clarkb#topic Upgrading Servers19:37
clarkbI'm trying to pick this up again and have begun looking at the gitea backends19:38
clarkbA couple of things make this easier than I feared and one thing makes this painful :)19:38
clarkbWe control gitea ansible group independently of what servers haproxy load balances to and gerrit replicates to. This means we can pretty easily spin up a new gitea on a new server running with a bunch of empty git repos19:39
clarkbThen when we are happy with the state of the server add it to gerrit replication, force gerrit to replicate everything to that server, then wait19:39
clarkbThen add the server to haproxy and probably remove an old server. Repeat in a loop19:39
clarkbWhat makes this painful/difficult is ensuring gitea state is what we want it to be. Specifically for redirects19:40
clarkbI poked around in a held gitea test node's db yesterday and I think we can construct the redirects from scratch given info we have, but one thing that compliactes that is we need to create gitea orgs that don't exist in projects.yaml19:40
clarkbessentially leading me to realize that bootstrapping that all from an empty state is probably more effort than necessary right now (though a noble exercise and maybe one we should get around to eventually)19:41
clarkbinstead I think we should stop gitea after the initial bring up then replace its db with a prod db19:41
clarkber replace its fresh db with a copy of a prod db from an old host19:41
clarkbthat will bring over the other orgs and redirects in theory.19:42
clarkbWhat I'm concerned about doing this is that maybe we'll end up with stuff missing on disk. But since we never have to put the server into a public facing capacity until we are happy with it I think we just do that and see if it works19:42
clarkbLooking at my current calendar and todo list maybe I can spin up that new server tomorrow, getit deployed as a blank gitea then start attempting to make it a prod like gitea on thrusday19:43
clarkbFor things that are not gitea we have etherpad, nameservers, static, mirrors, and jitsimeet. Of those I think etherpad and nameservers are the priorities19:44
ianwthis is all because in the past we've made the gitea projects as usual via the api, but then they've been moved, which we've also done via gitea, which has internally applied db updates to reflect this on it's instance, but when we're starting a new host we have no way of capturing this (at the moment, at least), right?19:44
clarkbianw: correct. We have a repo that captures the renames at that point in time but there is no tooling to apply that to gitea as a set of old orgs and redirects19:44
clarkbI suppose as an alternative we could do inplace server upgrades. But I like to avoid those when we can19:45
ianwit is always nice to validate we can start fresh19:45
clarkbFor the other servers I'm thinking etherpad and nameservers are the other priorities. In particular I had some notes about doing the nameservers but am not really confident in the process for that. If anyone has time to think that through and write out a small plan that would be appreciated19:46
clarkbI suspect I'm overcomplicating the effort to update the nameservers in my head19:46
clarkband yes help much appreciated. Thanks for all the help so far too19:46
ianw++ i can have a look at nameservers19:49
clarkb#topic Quo vadis Storyboard19:49
clarkbThis topic like the service has become a victim of a lack of time19:50
clarkbI don't have anything new here. But maybe we should have a meeting dedicated to this in order to create a forcing function to spend time on it19:50
clarkbI'd suggest hte PTG but the TPG conflicts with spring break around here so I'm trying to limit my PTG commitments :)19:51
clarkbBut maybe a higher bw call type setup the week before PTG or something?19:51
clarkbLet me get through next week's travel and then try to put something together for that19:52
clarkb#topic Open Discussion19:52
clarkbAs mentioned at the beginning of the meeting I'll make my service coordinator nomination official in an hour or so after lunch assuming no one beats me to it19:53
clarkbZuul's sqlalchemy 2.0 change merged earlier today. I may try to kick off a zuul restart sooner than the regularly scheduled weekend restart just to get that checked more quickly19:53
clarkbanything else?19:56
ianwnot from me, thanks!19:57
clarkbThank you everyone for your time during this meeting but also for contributing to OpenDev. We'll skip next week's meeting and be back here in two weeks19:57
clarkb#endmeeting19:58
opendevmeetMeeting ended Tue Feb 14 19:58:01 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:58
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-14-19.01.html19:58
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-14-19.01.txt19:58
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-14-19.01.log.html19:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!