Tuesday, 2023-02-14

clarkb	Almost meeting time	18:59
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Feb 14 19:01:06 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VH5EQYJH3E2YKTP3K4IXQI27WRRSEUMR/ Our Agenda	19:01
clarkb	#topic Announcements	19:01
clarkb	It is Service Coordinator nomination time. I've not seen any nominations yet and the time period ends today. I suppose thats a gentle way ofsaying I should keep doing it?	19:02
fungi	congratudolences!	19:02
clarkb	If no one speaks up indicating their interest before I finish lunch today then I guess I'll make my nomination official after lunch	19:02
clarkb	The other announcement today is that we'll cancel next week's meeting. fungi and I are traveling and busy next tuesday. Thats ~50% of our normal attendance so I think we can just skip	19:03
ianw	++	19:04
clarkb	#topic Bastion Host Updates	19:04
clarkb	#link https://review.opendev.org/q/topic:bridge-backups	19:04
clarkb	#link https://review.opendev.org/q/topic:prod-bastion-group Remaining changes are part of parallel ansible runs on bridge	19:04
clarkb	ianw: ^ you should just start nagging me to review that first set of changes. I keep putting it off due to distractions. I have been doing zuul reviews today and when I get tired of those I should do some opendev reviews too	19:05
clarkb	(zuul early day due to overlap with europe is good then opendev late day due to overlap wuth au is good :) )	19:05
ianw	:) i should loop back on the parallel stuff too	19:05
ianw	it probably needs remerging etc.	19:06
clarkb	are there any other bastion concerns?	19:06
ianw	it being a jammy host it hits	19:07
ianw	#link https://review.opendev.org/c/opendev/system-config/+/872808	19:07
ianw	with the old apt-key config	19:07
ianw	that's all i can think of	19:07
clarkb	oh I had a question about that which i guess I didn't post on the review directly (my bad)	19:07
clarkb	specifically how new of an apt do we need to support that method of key trust	19:07
clarkb	We would need it to work on bionic and newer iirc	19:08
ianw	i think 1.4 which is >=bionic	19:08
clarkb	and I guess reverting and making it distro release specific isn't too terrible either	19:08
clarkb	I'll do a quick rereview after the meeting since i didn't record my previous thoughts properly	19:08
clarkb	#topic Mailman 3	19:10
clarkb	We're still poking at the site creation stuff last I saw, but there was one other thing that had a change to address it	19:11
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/873337 Fix warnings about missing migrations	19:11
fungi	my testing on the previous held node was probably invalid because of the lingering db migration issue which i think is what also resulted in my container restart errors	19:11
fungi	i thought i had approved 873337 already but i guess now	19:12
fungi	approved now	19:12
fungi	i'll get a new held node once we have new images	19:12
clarkb	sounds good	19:12
fungi	or i guess that fix won't need new images	19:12
fungi	so i can recheck the dnm change as soon as that merges	19:12
clarkb	Any other mailman related items? I think we've managed to chip away at most of it other than the site creation to fix vhosting (which amkes sense since that is the complicated bitwith db migrations)	19:12
fungi	i don't have anything else, no. i still haven't had time to wrap my head around creating new sites with django migrations	19:13
clarkb	#topic Gerrit Updates	19:13
clarkb	#link https://review.opendev.org/c/openstack/project-config/+/867931 Cleaning up deprecated copy conditions in project ACLs	19:13
clarkb	This update has been announced. I think we can probably land it whenever we're confident jeepyb is happy (and last I saw it was workign?)	19:14
clarkb	ianw: ^ not sure if you had a specific plan for that one	19:14
ianw	i think it might need to be manually applied as it will take longer than the job timeout	19:14
clarkb	ianw: jeepyb will only update those 90ish repos? Oh except those files might be used in more than 90 repos	19:15
clarkb	if we think increasing the timeout would work that seems fine, otherwise manual application also seems fine.	19:15
fungi	we could chunk the change up into batches i guess, but manual manage-projects run seems fine to me	19:16
ianw	it's probably ok, but still. might be an idea to 1) put in emergency 2) down gerrit 3) run a manual backup run 4) up gerrit 5) manually apply? 6) remove from emergency?	19:16
clarkb	ianw: do we think we need a backup like that? The acls are all in git so theoretically we can just revert them if necessary	19:17
clarkb	mostly just thinking that I'm not sure a gerrit downtime is necessary	19:18
ianw	i could go either way; i was just thinking it's an unambiguous snapshot	19:19
clarkb	I think I'm willing to trust the acl system's historical record here. We've relied on it in the past and can continue to do so	19:19
ianw	i guess we have now double-checked all the acl files, and gerrit shouldn't let us merge anything it doesn't like	19:19
fungi	right, as long as we merge it through gerrit rather than behind its back, the worst that should happen is manage-projects throws errors and we can't create new projects or update existing ones for a little while until we sort it out	19:20
fungi	or would have to take manual action in order to do so at least	19:21
clarkb	I guess doing a canary change with a smaller set of updates might be good if we're worried about getting syntax wrong etc	19:21
clarkb	but ya I think a downtime for backups is overkill given gerrit's builtin checks and record keeping	19:21
fungi	technically the syntax is already checked by the manage-projects test in our gerrit job, right?	19:22
clarkb	fungi: yes, but using the rules we try to interpret from gerrit not gerrit tiself	19:22
clarkb	and this is a new set of rules so possible we got it wrong	19:22
fungi	oh, i guess i thought we had an integration test creating a project in gerrit	19:23
clarkb	not using our production acls	19:23
clarkb	that would take too long to run probably unfortunately	19:23
fungi	and the acl change doesn't update the test acl to match?	19:23
clarkb	I don't think so.	19:24
fungi	i didn't think to check that myself	19:24
clarkb	they are decoupled. We test jeepyb + gerritlib in that bubble. We test our deployment of gerrit in system-config. And then we do simple linter type checks in project-config	19:24
clarkb	this change is to project-config and doesn't impact the others	19:24
ianw	i don't think we set any conditions in system-config, but we can	19:25
clarkb	ya maybe that is better than doing a canary change	19:25
clarkb	just to make suer we get the general syntax correct	19:25
clarkb	But ya I'm not too worried about it given the ability to rollback etc	19:26
clarkb	There were two other Gerrit related items	19:26
ianw	ok, will do, and then i'll plan to apply it manually just to really watch it closely and because of timeouts, but with no downtime	19:26
clarkb	Both of which i've put on the Gerrit community meeting agenda for March 2 (8am pacific)	19:26
clarkb	The first is Java 17 support. I have a hcange up to swap us to java 17 which works in our CI jobs. But you have to use an ugly java cli option to make it happen which seems at odds with their full compatibility statement	19:27
clarkb	I'm hoping to get a better sense of that support in the community meeting and if thats he path forward I guess we roll with it	19:27
clarkb	The other is the ssh connectivity problem with channel tracking	19:27
clarkb	ianw: has been digging into this quite a bit and I think discovered a bug in the upstream implementation of channel tracking. That doesn't explain why ssh is unhappy though right? just that fixing the bug will get us better information when those cases happen?	19:28
ianw	so i think the bug means that the workaround committed was actually not doing anything	19:29
clarkb	#link https://github.com/apache/mina-sshd/issues/319 Gerrit SSH issues with flaky networks.	19:29
ianw	#link https://gerrit-review.googlesource.com/c/gerrit/+/358314	19:29
clarkb	ianw: "the workaround" is to disable channel tracking?	19:29
ianw	specifically it's https://gerrit-review.googlesource.com/c/gerrit/+/238384	19:30
ianw	what that is supposed to do is track when a ssh channel is opened in a variable	19:30
ianw	then, if an unhandledchannelerror is raised by mina, it looks at what channel it was, and if that channel has been opened before, basically ignores it	19:31
clarkb	ah but since it wasn't tracking it that error propagates. So your fix may be the actual fix too	19:31
ianw	right, the "track when opened" was never running because it wasn't registered to receive the channel open events	19:32
clarkb	In that case I can use the community meeting to beg for reviews if they haven't landed it by then	19:32
clarkb	:)	19:32
ianw	so ... it's a fix ... but it doesn't really seem to answer any questions of what's going on	19:32
clarkb	which is why your extra logging change remains to collect that info and hopefully debug the underlying situation	19:33
ianw	#link https://gerrit-review.googlesource.com/c/gerrit/+/357694	19:33
ianw	right, yeah that change has logging for basically every channel event. but i'm not sure how much it helps now -- since we would be getting log messages when the channel is initalized from the prior change, which was mostly what we were interested in	19:34
ianw	i don't know. i think maybe merge the "fix" and just move on with life and don't think too hard about it :)	19:35
clarkb	works for me. I'll bring it up with gerrit if we don't manage to make progress before the meeting	19:35
ianw	something is still not quite right in mina I think, but this probably isn't the context to find it	19:36
clarkb	#topic Upgrading Servers	19:37
clarkb	I'm trying to pick this up again and have begun looking at the gitea backends	19:38
clarkb	A couple of things make this easier than I feared and one thing makes this painful :)	19:38
clarkb	We control gitea ansible group independently of what servers haproxy load balances to and gerrit replicates to. This means we can pretty easily spin up a new gitea on a new server running with a bunch of empty git repos	19:39
clarkb	Then when we are happy with the state of the server add it to gerrit replication, force gerrit to replicate everything to that server, then wait	19:39
clarkb	Then add the server to haproxy and probably remove an old server. Repeat in a loop	19:39
clarkb	What makes this painful/difficult is ensuring gitea state is what we want it to be. Specifically for redirects	19:40
clarkb	I poked around in a held gitea test node's db yesterday and I think we can construct the redirects from scratch given info we have, but one thing that compliactes that is we need to create gitea orgs that don't exist in projects.yaml	19:40
clarkb	essentially leading me to realize that bootstrapping that all from an empty state is probably more effort than necessary right now (though a noble exercise and maybe one we should get around to eventually)	19:41
clarkb	instead I think we should stop gitea after the initial bring up then replace its db with a prod db	19:41
clarkb	er replace its fresh db with a copy of a prod db from an old host	19:41
clarkb	that will bring over the other orgs and redirects in theory.	19:42
clarkb	What I'm concerned about doing this is that maybe we'll end up with stuff missing on disk. But since we never have to put the server into a public facing capacity until we are happy with it I think we just do that and see if it works	19:42
clarkb	Looking at my current calendar and todo list maybe I can spin up that new server tomorrow, getit deployed as a blank gitea then start attempting to make it a prod like gitea on thrusday	19:43
clarkb	For things that are not gitea we have etherpad, nameservers, static, mirrors, and jitsimeet. Of those I think etherpad and nameservers are the priorities	19:44
ianw	this is all because in the past we've made the gitea projects as usual via the api, but then they've been moved, which we've also done via gitea, which has internally applied db updates to reflect this on it's instance, but when we're starting a new host we have no way of capturing this (at the moment, at least), right?	19:44
clarkb	ianw: correct. We have a repo that captures the renames at that point in time but there is no tooling to apply that to gitea as a set of old orgs and redirects	19:44
clarkb	I suppose as an alternative we could do inplace server upgrades. But I like to avoid those when we can	19:45
ianw	it is always nice to validate we can start fresh	19:45
clarkb	For the other servers I'm thinking etherpad and nameservers are the other priorities. In particular I had some notes about doing the nameservers but am not really confident in the process for that. If anyone has time to think that through and write out a small plan that would be appreciated	19:46
clarkb	I suspect I'm overcomplicating the effort to update the nameservers in my head	19:46
clarkb	and yes help much appreciated. Thanks for all the help so far too	19:46
ianw	++ i can have a look at nameservers	19:49
clarkb	#topic Quo vadis Storyboard	19:49
clarkb	This topic like the service has become a victim of a lack of time	19:50
clarkb	I don't have anything new here. But maybe we should have a meeting dedicated to this in order to create a forcing function to spend time on it	19:50
clarkb	I'd suggest hte PTG but the TPG conflicts with spring break around here so I'm trying to limit my PTG commitments :)	19:51
clarkb	But maybe a higher bw call type setup the week before PTG or something?	19:51
clarkb	Let me get through next week's travel and then try to put something together for that	19:52
clarkb	#topic Open Discussion	19:52
clarkb	As mentioned at the beginning of the meeting I'll make my service coordinator nomination official in an hour or so after lunch assuming no one beats me to it	19:53
clarkb	Zuul's sqlalchemy 2.0 change merged earlier today. I may try to kick off a zuul restart sooner than the regularly scheduled weekend restart just to get that checked more quickly	19:53
clarkb	anything else?	19:56
ianw	not from me, thanks!	19:57
clarkb	Thank you everyone for your time during this meeting but also for contributing to OpenDev. We'll skip next week's meeting and be back here in two weeks	19:57
clarkb	#endmeeting	19:58
opendevmeet	Meeting ended Tue Feb 14 19:58:01 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:58
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-14-19.01.html	19:58
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-14-19.01.txt	19:58
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-14-19.01.log.html	19:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!