Thursday, 2020-12-10

*** tosky has quit IRC		00:02
*** DSpider has quit IRC		00:11
openstackgerrit	Merged opendev/system-config master: graphite: also deny account page https://review.opendev.org/c/opendev/system-config/+/766318	00:12
*** mlavalle has quit IRC		00:30
*** brinzhang has joined #opendev		00:38
johnsom	Something is up with zuul. A bunch of jobs are stuck in the blue swirly	00:40
johnsom	My job as been sitting that way for 21 minutes	00:40
johnsom	ish	00:40
fungi	https://grafana.opendev.org/d/9XCNuphGk/zuul-status?orgId=1	00:58
fungi	looks like no executors are accepting jobs now	00:58
corvus	it's possible we have a bunch of hung git processes	01:01
corvus	zuuld 29738 0.4 0.0 13088 6564 ? S 00:38 0:06 ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 zuul@review.opendev.org git-upload-pack '/openstack/nova'	01:02
corvus	from ze01	01:02
fungi	that would make sense	01:03
corvus	i'll try to kill those git procs	01:03
fungi	also explains why ianw earlier observed one executor taking most of the active builds... the rest were probably already all stuck	01:03
ianw	yeah; corvus let me know if i can help	01:12
corvus	ianw: thanks; i just went through a ps listing and killed all the upload-packs that started in the 00:00 hour	01:16
corvus	hopefully that got them all	01:16
corvus	i do see https read timeouts against gerrit from the zuul scheduler	01:17
corvus	i think gerrit's dead again	01:17
corvus	yeah, continuous gc operations that are slow and ineffective: 49102M->49102M	01:18
fungi	i'll go ahead and down/up the container again in that case	01:19
fungi	i don't think it's recovering	01:19
fungi	and restarted again	01:20
*** ysandeep\|away is now known as ysandeep		01:54
ianw	node allocation still seems quite slow	02:03
ianw	zuuld 21496 24191 1 02:24 ? 00:00:01 git clone ssh://zuul@review.opendev.org:29418/openstack/nova /var/lib/zuul/executor-git/opendev.org/openstack/nova	02:27
ianw	this appears stopped	02:27
ianw	ze02 has a bunch of "git cat-file --batch-check"	02:28
ianw	i'm starting to think an executor restart cycle might be a good idea	02:28
ianw	ze03 has similar git calls going on	02:28
ianw	i'm in a bit of a hail mary here. i'm going to stop the executors from bridge playbook and see if i can clear gerrit, and then restart and see if it gets stuck on nova again	02:40
*** priteau has quit IRC		03:03
*** ysandeep is now known as ysandeep\|session		03:19
openstackgerrit	melanie witt proposed opendev/base-jobs master: Revert "Exclude neutron q-svc logs from indexing" https://review.opendev.org/c/opendev/base-jobs/+/766399	03:19
*** openstackgerrit has quit IRC		03:22
ianw	2020-12-10 02:46:57,288 DEBUG zuul.Repo: [e: ccab1fab1ca149a1a1e61aab06013f4f] [build: ef857c597f7f4dfabe37f10dea33f9e1] Resetting repository /var/lib/zuul/executor-git/opendev.org/openstack/nova	03:38
ianw	2020-12-10 03:18:32,073 DEBUG zuul.Repo: [e: fcd6dfe4875746d295e181af6a5e74aa] [build: 6991e3c93e144b6b869bbb34bb8d73a6] Resetting repository /var/lib/zuul/executor-git/opendev.org/openstack/nova	03:38
ianw	2020-12-10 03:37:22,547 DEBUG zuul.Repo: [e: fcd6dfe4875746d295e181af6a5e74aa] [build: a03f9ace9dce43269dc25fcbb6c6b959] Resetting repository /var/lib/zuul/executor-git/opendev.org/openstack/nova	03:38
ianw	the executors are making things worse i think. they are in a 300 second timeout loop trying to clone nova; it fails, and htey try again	03:38
*** hamalq_ has quit IRC		03:45
ianw	i'm trying a manual update of the git_timeout on ze07 and seeing if it's clone can work	03:52
fungi	gerrit itself seems okay	03:55
fungi	300 seconds is a rather long time to complete a clone operation though	03:56
*** user_19173783170 has joined #opendev		03:56
user_19173783170	How can I change my gerrit account username	03:57
ianw	fungi: it's going about as fast as it can	03:57
*** zbr has quit IRC		03:57
fungi	user_19173783170: gerrit usernames are immutable once set	03:57
ianw	it's up to 1.9G	03:58
ianw	(i think it is)	03:58
*** zbr has joined #opendev		03:58
ianw	ok, it took about 7 minutes	03:59
fungi	ianw: do you think it's because of the full executor restart clearing the extant copies?	04:00
ianw	fungi: I don't think the executors have managed to get these bigger repos since i guess nodedb	04:00
ianw	notedb	04:00
ianw	they all try to update it, get killed, rinse repeat	04:00
fungi	oh, like the executors have had incomplete nova clones since the upgrade?	04:01
fungi	yikes	04:01
ianw	maybe, or they've got into a bad state and can't get out	04:01
fungi	i wonder if the discussed update to turn on git v2 protocol would speed that up for them at all	04:02
ianw	i think the problem is the git in the executors isn't v2 by default, so it would require more fiddling. it may help	04:03
ianw	it looks like we lost gerritbot	04:08
ianw	fungi: https://review.opendev.org/c/opendev/system-config/+/766400	04:10
fungi	ianw: have you restarted gerritbot, or shall i?	04:21
ianw	fungi: i have	04:21
fungi	thanks!	04:22
fungi	i've approved the timeout bump to 10 minutes, but need to drop offline at this point so won't be around to see it merge	04:25
*** openstackgerrit has joined #opendev		04:59
openstackgerrit	Ian Wienand proposed opendev/system-config master: bup: Remove from hosts https://review.opendev.org/c/opendev/system-config/+/766300	04:59
*** user_19173783170 has quit IRC		05:26
openstackgerrit	Merged opendev/system-config master: zuul: increase git timeout https://review.opendev.org/c/opendev/system-config/+/766400	05:37
*** user_19173783170 has joined #opendev		05:38
*** user_19173783170 has quit IRC		05:42
*** cloudnull has quit IRC		05:56
*** cloudnull has joined #opendev		05:57
*** marios has joined #opendev		06:02
*** marios is now known as marios\|rover		06:03
*** ShadowJonathan has quit IRC		06:11
*** rpittau\|afk has quit IRC		06:11
*** ShadowJonathan has joined #opendev		06:11
*** mnaser has quit IRC		06:11
*** mnaser has joined #opendev		06:12
*** rpittau\|afk has joined #opendev		06:12
*** ysandeep\|session is now known as ysandeep		06:28
*** jaicaa has quit IRC		06:31
*** jaicaa has joined #opendev		06:34
*** sboyron has joined #opendev		06:34
*** marios\|rover has quit IRC		06:37
*** lpetrut has joined #opendev		07:00
*** ysandeep is now known as ysandeep\|sick		07:03
*** slaweq has joined #opendev		07:17
*** marios has joined #opendev		07:18
*** marios is now known as marios\|rover		07:19
*** ralonsoh has joined #opendev		07:25
frickler	oh, I think I finally found the meaning of "CC" in gerrit vs. "Reviewer": the former is someone who leaves a comment but with code-review=0. doing some non-zero cr promotes a cc to reviewer. sorry for the noise if that was discussed already or well-known	07:27
*** eolivare has joined #opendev		07:29
*** noonedeadpunk has joined #opendev		07:32
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/765963	07:33
*** hamalq has joined #opendev		08:06
*** andrewbonney has joined #opendev		08:11
*** rpittau\|afk is now known as rpittau		08:17
*** marios\|rover has quit IRC		08:25
zbr	clarkb: for the record, again "Credentials expired".	08:33
*** marios\|rover has joined #opendev		08:34
*** larainema has joined #opendev		08:35
*** hashar has joined #opendev		08:39
*** tosky has joined #opendev		08:47
*** zbr has quit IRC		09:10
*** zbr has joined #opendev		09:11
*** zbr has quit IRC		09:13
*** zbr has joined #opendev		09:13
mnasiadka	Morning	09:33
mnasiadka	It looks like some of the gitea servers have an expired certificate	09:33
*** zbr has quit IRC		09:35
*** zbr has joined #opendev		09:37
*** zbr has quit IRC		09:40
*** zbr has joined #opendev		09:40
frickler	mnasiadka: I checked all 8 and they look fine to me, can you be more specific?	09:42
mnasiadka	https://www.irccloud.com/pastebin/hvNCiiok/	09:44
mnasiadka	once in a while I get this ^^	09:44
mnasiadka	https://www.irccloud.com/pastebin/vsKrkUGp/	09:45
mnasiadka	verbose output ^^	09:45
frickler	mnasiadka: can you run "openssl s_client -connect opendev.org:443" and show the output on paste.openstack.org? also please verify that your local clock is correct	09:47
mnasiadka	frickler: sure, let me try - clock is correct, but it happens on one out of 10 tries I think	09:48
mnasiadka	http://paste.openstack.org/show/800924/	09:49
*** DSpider has joined #opendev		09:51
mnasiadka	frickler: so obviously sometimes I get an old cert, question why :)	09:52
*** zbr has quit IRC		09:54
frickler	mnasiadka: with apache we have sometimes seen single workers not updating correctly, not sure if something similar can happen with gitea	09:54
frickler	infra-root: ^^ there's also a lot of "Z"ed subprocesses from gitea. I'll leave it in this state for now until someone else can have a second look, otherwise I'd suggest to just restart the gitea-web container	09:55
*** zbr has joined #opendev		09:56
*** ralonsoh_ has joined #opendev		10:00
*** ralonsoh has quit IRC		10:01
*** hamalq has quit IRC		10:07
openstackgerrit	Illes Elod proposed zuul/zuul-jobs master: Add option to constrain tox and its dependencies https://review.opendev.org/c/zuul/zuul-jobs/+/766441	10:23
*** ralonsoh_ is now known as ralonsoh		10:33
*** sboyron_ has joined #opendev		10:43
*** sboyron has quit IRC		10:45
*** hashar is now known as hasharLunch		10:47
*** zbr has quit IRC		10:54
*** zbr has joined #opendev		10:56
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Increase flake8 version in lower-constraints https://review.opendev.org/c/openstack/diskimage-builder/+/766447	10:56
*** hamalq has joined #opendev		11:09
*** hamalq has quit IRC		11:14
*** fressi has joined #opendev		11:23
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Increase flake8 and pyflakes version in lower-constraints.txt https://review.opendev.org/c/openstack/diskimage-builder/+/766447	11:25
*** dtantsur\|afk is now known as dtantsur		11:33
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/765963	11:35
*** larainema has quit IRC		11:44
openstackgerrit	Illes Elod proposed zuul/zuul-jobs master: Add option to constrain tox and its dependencies https://review.opendev.org/c/zuul/zuul-jobs/+/766441	11:46
sshnaidm	does anyone know - which devstack jobs branches are supported now?	11:51
*** zbr has quit IRC		11:58
*** zbr has joined #opendev		11:59
*** zbr has quit IRC		12:10
*** zbr has joined #opendev		12:10
*** larainema has joined #opendev		12:12
*** zbr has quit IRC		12:18
*** zbr has joined #opendev		12:20
*** fressi_ has joined #opendev		12:26
*** fressi has quit IRC		12:27
*** fressi_ is now known as fressi		12:27
*** hamalq has joined #opendev		12:28
*** hamalq has quit IRC		12:33
openstackgerrit	chandan kumar proposed openstack/diskimage-builder master: Enable dracut list installed modules https://review.opendev.org/c/openstack/diskimage-builder/+/766232	12:50
*** zbr has quit IRC		13:01
*** zbr has joined #opendev		13:03
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/765963	13:08
*** sshnaidm has quit IRC		13:09
*** sshnaidm has joined #opendev		13:09
*** priteau has joined #opendev		13:10
*** zbr has quit IRC		13:21
*** zbr has joined #opendev		13:24
*** slaweq has quit IRC		13:29
*** zbr has quit IRC		13:30
*** zbr has joined #opendev		13:32
*** slaweq has joined #opendev		13:33
*** sboyron__ has joined #opendev		13:34
*** sboyron_ has quit IRC		13:35
*** sboyron__ is now known as sboyron		13:41
*** zigo has joined #opendev		13:51
*** hasharLunch is now known as hashar		14:02
*** zbr has quit IRC		14:05
*** zbr has joined #opendev		14:08
openstackgerrit	Jan Zerebecki proposed zuul/zuul-jobs master: ensure-pip: install virtualenv, it is still used https://review.opendev.org/c/zuul/zuul-jobs/+/766477	14:19
fungi	frickler: we have apache running on the gitea servers (used so we can filter specific abusive bots based on user agent strings), so the apache behavior can certainly apply there	14:29
*** hamalq has joined #opendev		14:29
fungi	sshnaidm: it's the openstack qa team who support and maintain devstack, so people in #openstack-qa are going to be best positioned to answer your question	14:30
sshnaidm	fungi, ack	14:31
*** hamalq_ has joined #opendev		14:32
*** hamalq has quit IRC		14:34
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/765963	14:34
*** hamalq_ has quit IRC		14:36
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/765963	14:47
*** fressi has quit IRC		14:57
openstackgerrit	Merged openstack/project-config master: New Project Request: airship/vino https://review.opendev.org/c/openstack/project-config/+/763889	14:58
openstackgerrit	Merged openstack/project-config master: New Project Request: airship/sip https://review.opendev.org/c/openstack/project-config/+/763888	14:58
openstackgerrit	Hervé Beraud proposed opendev/irc-meetings master: Switch release team to 1700 UTC https://review.opendev.org/c/opendev/irc-meetings/+/766490	14:59
openstackgerrit	Hervé Beraud proposed opendev/irc-meetings master: Switch oslo team to 1600 UTC https://review.opendev.org/c/opendev/irc-meetings/+/766493	15:04
*** adrian-a has joined #opendev		15:06
sshnaidm	is it possible to see in job somewhere what exactly was returned in zuul_return?	15:07
sshnaidm	like here https://zuul.opendev.org/t/openstack/build/713e89ddf21e4e888fce5416d5c8a028/	15:07
sshnaidm	fungi, ^	15:07
openstackgerrit	Jan Zerebecki proposed zuul/zuul-jobs master: Switch from Debian Stretch to Buster https://review.opendev.org/c/zuul/zuul-jobs/+/766496	15:12
corvus	sshnaidm: not automatically, no; you could add a debug task, or copy the zuul return file into the logs dir	15:13
sshnaidm	corvus, ok, just wanted to ensure it's same..	15:13
fungi	doing it automatically for all jobs isn't a good idea because they could, for example, pass secrets between them	15:13
sshnaidm	ack	15:13
sshnaidm	I think I have a problem that child job gets return data from different parent	15:14
sshnaidm	is it possible?	15:14
corvus	sshnaidm: if it depends on multiple jobs and they both return the same vars, yeah, one of them is going to win	15:15
sshnaidm	corvus, nope, it's one parent and multiple consumers	15:15
corvus	sshnaidm: check the inventory file for the consumers to see what they received	15:16
sshnaidm	corvus, yeah, what I did	15:16
corvus	sshnaidm: the variable is there but the value is not what you expected?	15:16
sshnaidm	corvus, yes, value is different	15:16
sshnaidm	like it was somewhere different parent job running..	15:17
corvus	sshnaidm: what consumer job, and what variable?	15:17
sshnaidm	corvus, consumer job gets IP of parent which deploys container registry: https://zuul.opendev.org/t/openstack/build/726aa3eb8c0640f4987272649c2a1040/log/zuul-info/inventory.yaml#54	15:17
sshnaidm	corvus, and this is a parent: https://zuul.opendev.org/t/openstack/build/713e89ddf21e4e888fce5416d5c8a028/logs which has different IP address and pass something else: https://zuul.opendev.org/t/openstack/build/713e89ddf21e4e888fce5416d5c8a028/log/job-output.txt#4590	15:19
sshnaidm	in most cases it works fine	15:19
sshnaidm	but recently we started to see such mess	15:19
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/765963	15:21
corvus	2020-12-10 05:32:31,504 DEBUG zuul.ExecutorClient: Build <gear.Job 0x7fe48cfce460 handle: b'H:::ffff:127.0.0.1:2433698' name: executor:execute unique: cfebd9f976594e64b8757152b4633e87> update {'paused': True, 'data': {'zuul': {'pause': True}, 'provider_dlrn_hash_branch': {'master': 'cd51c9b4a10fea745ad818e64de40a7f'}, 'provider_dlrn_hash_tag_branch': {'master': 'cd51c9b4a10fea745ad818e64de40a7f'},	15:22
corvus	'provider_job_branch': 'master', 'registry_ip_address_branch': {'master': '188.95.227.214'}}}	15:22
corvus	sshnaidm: ^	15:22
corvus	sshnaidm: that was from this build: https://zuul.opendev.org/t/openstack/build/cfebd9f976594e64b8757152b4633e87	15:23
corvus	which was retried	15:24
gibi	do we have some onging gerrit issue? we see a lot of "`ERROR Failed to update project None in 3s" messages on patches recently	15:24
corvus	gibi: link?	15:25
fungi	gibi: from today or prior?	15:25
gibi	https://review.opendev.org/c/openstack/nova/+/765749	15:26
gibi	this is a pretty recent result ^^	15:26
fungi	i think we expect those from ~monday through yesterday, but not since around 04:00 utc today hopefully	15:26
fungi	in particular our executors were failing to clone the nova repo from gerrit because it was taking longer than their 5-minute timeout	15:27
fungi	we increased their git clone timeout to 10 minutes which allowed them to finally complete	15:27
corvus	our executors should [almost] never clone the nova repo from gerrit	15:27
gibi	fungi: thanks, then we recheck	15:28
corvus	like, they should do that once when they start for the first time	15:28
corvus	then they should clone from their cache	15:28
fungi	corvus: ianw suspects they couldn't fully clone it following the upgrade when they got all the added notedb content, and have been looping ever since	15:28
corvus	fungi: then how did any nova job ever complete?	15:29
fungi	that i'm not sure about	15:29
corvus	(and do we really expect them to be pulling down notedb content?)	15:29
corvus	this isn't adding up :/	15:29
corvus	sshnaidm: i'm trying to figure out why that job retried	15:30
fungi	corvus: yeah, not sure, i'm mostly going by what ianw stated overnight at this point	15:31
corvus	sshnaidm: i think it's because the executor was restarted (see my other conversation with fungi)	15:31
sshnaidm	corvus, ack, so it's retries	15:32
fungi	(after a restart at least) the executors (possibly all the mergers?) were all looping trying to clone nova and then killing the git operation at 5 minutes and starting over, ianw tested cloning and it took 7 minutes, so once he increased the timeout to 10 minutes the executors all cloned nova successfully and resumed normal operation	15:32
sshnaidm	corvus, does it mean original job on pause was aborted and different one replaced it? while child jobs remained as is?	15:32
corvus	sshnaidm: possibly; or possibly the child jobs were replaced too but used the old data	15:33
openstackgerrit	Sorin Sbârnea proposed opendev/system-config master: Enable mirroring of centos 8-stream https://review.opendev.org/c/opendev/system-config/+/766499	15:48
*** adrian-a has quit IRC		15:57
*** lpetrut has quit IRC		15:57
*** zbr has quit IRC		16:04
*** zbr has joined #opendev		16:06
clarkb	corvus: fungi my understanding is that they shouldn't pull all the notedb content but without git protocol v2 enabled all the fetch (clone, etc) operations have to negotiate through those refs	16:07
*** zbr has quit IRC		16:07
corvus	clarkb: is it possible that the word 'clone' is being used loosely here?	16:08
*** zbr has joined #opendev		16:08
fungi	yeah, so maybe the nova repo's size and reduction in jgit/jetty performance in newer gerrit simply inched it over the clone timeout	16:08
corvus	ie, maybe ianw saw fetches taking a really long time because of the lack of v2 negotiation?	16:08
clarkb	corvus: ya that could be	16:09
corvus	because -- seriously -- if we're actually cloning that's an enormous drop-everything-and-fix-it regression in zuul	16:09
corvus	does anyone know what timeout ianw changed?	16:10
fungi	2020-12-10 03:18:32,073 DEBUG zuul.Repo: [e: fcd6dfe4875746d295e181af6a5e74aa] [build: 6991e3c93e144b6b869bbb34bb8d73a6] Resetting repository /var/lib/zuul/executor-git/opendev.org/openstack/nova	16:10
fungi	that's what he saw looping, getting killed at the 5-minute timeout, and then repeating	16:10
fungi	corvus: https://review.opendev.org/766400 i approved it before i fell asleep	16:11
clarkb	https://review.opendev.org/c/opendev/system-config/+/766365 is a change to enable git protocol v2 on gerrit and is currently enabled on review-test if epople want to test it	16:12
corvus	fungi: 'resetting repo' is normally a fetch	16:12
fungi	corvus: got it, so it was fetches taking longer than 300s i guess	16:12
clarkb	(I did rudimentary testing using the flags and command at https://opensource.googleblog.com/2018/05/introducing-git-protocol-version-2.html but talking to our review.o.o and review-test repos over https and ssh)	16:12
corvus	if it were cloning, we would see "Cloning from ... to ..."	16:13
fungi	could a fetch be as expensive as a clone if something happened to the local copy of the repo? like maybe from killing the stuck nova fetches you saw on the executors earlier?	16:13
clarkb	fungi: both fetches and clones using older protocols have to list all refs to negotiate what data to transfer	16:14
corvus	we do remove the repos and re-clone if there's an error	16:14
clarkb	fungi: with the v2 protocol the client says "I want to fetch foo and bar" and only those refs are examined by both sides	16:14
corvus	ianw's commit msg does say 'it times out.. they delete the directory'	16:15
clarkb	really the only difference between a fetch and a clone on the old protocol is how much data you end up transfering after negotiation	16:15
clarkb	corvus: aha	16:15
corvus	so maybe that's what happened? we got a bunch of errors, zuul couldn't trust the repos any more, deleted them, and then we fell back to actual cloning?	16:15
clarkb	ya that seems plausible	16:15
fungi	yeah, it sounds like git v2 is a good next step to reduce the overall cost, the question was whether we need to update git on the executors to make that happen so didn't want to try when i was already nodding off	16:15
clarkb	especially since during the gc period errors would have been seen by zuul I bet	16:16
corvus	fungi: executors are containerized now, so should be very recent git	16:16
clarkb	corvus: fungi that was my assumption too re git versions	16:16
clarkb	putting things in containers makes that much better for us I expect	16:16
corvus	clarkb, fungi: yeah, so i think i'm happy that zuul is not b0rked, ianw's timeout change is a good interim change, git v2 is a good next step (and may allow us to revert that)	16:17
corvus	i also suspect that this may explain the increasingly short cycle time between gerrit failures we observed yesterday	16:17
corvus	since the load on gerrit from zuul would have progressed geometrically as these failed	16:18
clarkb	ah yup	16:18
fungi	right, seems like there was some reinforcement between different causes of load there	16:19
openstackgerrit	Sorin Sbârnea proposed opendev/system-config master: Enable mirroring of centos 8-stream https://review.opendev.org/c/opendev/system-config/+/766499	16:22
*** zbr has quit IRC		16:24
*** zbr has joined #opendev		16:26
*** zbr has quit IRC		16:29
*** zbr has joined #opendev		16:29
*** hamalq has joined #opendev		16:30
openstackgerrit	Merged opendev/irc-meetings master: Switch oslo team to 1600 UTC https://review.opendev.org/c/opendev/irc-meetings/+/766493	16:33
openstackgerrit	Merged opendev/irc-meetings master: Switch release team to 1700 UTC https://review.opendev.org/c/opendev/irc-meetings/+/766490	16:33
openstackgerrit	Jan Zerebecki proposed zuul/zuul-jobs master: Switch from Debian Stretch to Buster https://review.opendev.org/c/zuul/zuul-jobs/+/766496	16:42
openstackgerrit	Jan Zerebecki proposed zuul/zuul-jobs master: Switch from Debian Stretch to Buster https://review.opendev.org/c/zuul/zuul-jobs/+/766496	16:43
*** zbr has quit IRC		16:46
*** zbr has joined #opendev		16:49
*** zbr has quit IRC		16:49
*** zbr has joined #opendev		16:55
*** auristor has quit IRC		16:55
*** auristor has joined #opendev		16:55
*** auristor has quit IRC		16:58
gibi	I've just got another round of "ERROR Failed to update project None in" for Zuul in https://review.opendev.org/c/openstack/nova/+/766471	16:58
clarkb	I wonder if an executor or merger or three haven't gotten the timeout config update properly applied?	17:00
clarkb	fungi: corvus ^ I'm still not quite properly caught up on all that, do we know if all the services were restarted to pick that up (or if that is even required?)	17:01
corvus	clarkb: yes and yes	17:02
zbr	i guess that mirror script is not covered by any CI test? https://review.opendev.org/c/opendev/system-config/+/766499/	17:02
clarkb	zbr: correct, ebcause it can take many hours and a lot of bw to properly test it. And half the time it fails due to various things that we don't control	17:03
clarkb	corvus: fungi in that case maybe the timeout isn't quite long enough for all cases?	17:04
clarkb	(sorry I'm still trying to boot the brain today)	17:04
clarkb	but maybe it is better to enable git protocol v2 then reevaluate the timeouts rather than doing another full zuul restart?	17:05
zbr	i could add a test for it and run in draft, but probably there are better ways to use my time	17:07
fungi	corvus: clarkb: could it be that ianw made sure the executors were okay but didn't think to check all the stand-alone mergers?	17:08
corvus	fungi: start times look right	17:08
fungi	zbr: if we wanted to test that script we'd probably be better off mocking out the mirror(s) we're copying from and just using a local path or an rsync daemon running on the loopback	17:09
*** auristor has joined #opendev		17:10
zbr	my personal approach: add support for draft mode and add a test playbook that runs it, should be enough to get an idea that it should do something	17:10
zbr	and catch problems like syntax errors in bash	17:11
zbr	copy does need to happen, but it will spot a bad url	17:11
zbr	dry is a decent way to validate the logic	17:12
*** marios\|rover has quit IRC		17:12
fungi	ahh, though the failed to update errors are on individual builds so not the scheduler failing to get a ref constructed from one of the mergers	17:13
zbr	i find hard not to ask why not using ansible to perform the mirroring, i would have found the file easier to read and maintain at the cost of not seeing live output.	17:13
fungi	zbr: these scripts are older than ansible	17:14
fungi	you might as well ask why the linux kernel wasn't written in go	17:14
fungi	okay, so one of the failing builds (d3bb8ddb4ef3433dbd01b49d7dff202b) ran on ze12	17:14
zbr	fungi: should I bother to rewrite the mirroring script or not?	17:17
fungi	zbr: hard to say, right now i'm more interested in figuring out what's happening with our zuul executors killing random jobs	17:18
fungi	looks like ze12 is raising InvalidGitRepositoryError when updating nova	17:19
*** adrian-a has joined #opendev		17:19
danpawlik	tobiash: hey, this one will not pass on the gates: https://review.opendev.org/c/openstack/diskimage-builder/+/761857/ until this one is not merged https://review.opendev.org/c/openstack/diskimage-builder/+/766447	17:19
yoctozepto	qq - is zuul retrying jobs only when they fail in pre or always?	17:20
clarkb	yoctozepto: it will always retry failures in pre. It will retry failures in other run stages if ansible reports the error is due to network connectivity	17:20
yoctozepto	clarkb: I see, that could be it	17:21
clarkb	and it is a total of 3 retries for the job regardless of where the retry causing failures originate	17:21
yoctozepto	because the job seems to be taking quite long to get to another retry	17:21
*** zbr has quit IRC		17:21
clarkb	yoctozepto: when you retry you end up at the end of the queue	17:21
yoctozepto	yeah, it's in run again	17:21
yoctozepto	yeah -> https://zuul.opendev.org/t/openstack/stream/cf4074e0343e4f0dbee8c0e7afaa0189?logfile=console.log	17:21
clarkb	avoiding retries is a really good idea :)	17:22
fungi	clarkb: corvus: here's what the traceback from that build looks like: http://paste.openstack.org/show/800945/	17:23
clarkb	fungi: corvus should we stop ze12, move the repo aside then manually reclone it?	17:23
*** zbr has joined #opendev		17:24
yoctozepto	clarkb: I'd love to; no idea why this one job is persistent in retrying	17:24
fungi	clarkb: corvus: i'm queued up to take the container down on ze12 if this isn't going to disrupt anyone else's troubleshooting	17:24
fungi	looks like i'm the only one logged into the server anyway	17:25
clarkb	I'm deferring to ya'll on this one I think	17:25
fungi	i'm taking it down now	17:25
clarkb	yoctozepto: common problems like that are jobs modifying host networking in a way that causes it to stop working. tripleo has had jobs disable dhcp, lease runs out, doesn't renew and now no more ip address. Similarly if the host is accessed via ipv6 and you disable RAs you can lose the ipv6 address, etc	17:26
fungi	ze12 has logged 1007 git.exc.InvalidGitRepositoryError exceptions in the current executor-debug.log	17:26
fungi	all for /var/lib/zuul/executor-git/opendev.org/openstack/nova	17:27
clarkb	yoctozepto: less common is the job doing something that crashes the test node. Stuff like nested virt crashing hard	17:27
fungi	i'm making a copy of that to fsck and see what might be wrong with it	17:27
yoctozepto	clarkb: nah, we ain't doing that	17:27
yoctozepto	nor that	17:27
yoctozepto	I will see whether the 3rd one succeeds	17:27
yoctozepto	and what it fails on	17:27
yoctozepto	I mean we have many flavours of scenarios	17:27
clarkb	fungi: ++	17:28
yoctozepto	no idea why this one is acting erratically this time	17:28
fungi	well, first problem: du says /var/lib/zuul/executor-git/opendev.org/openstack/nova is a mere 84K	17:28
clarkb	ya so probably it timed out and aws killed while negotiating and didn't really transfer any data?	17:29
clarkb	I think that is the behavior you'd expect if git was in the process of sorting out what it needed to do	17:29
fungi	it looks like a fresh git init	17:29
fungi	except it's not even as clean as an init	17:30
fungi	so yeah, more like an interrupted clone at the very early stages	17:30
fungi	load average on gerrit is getting rather high again too	17:31
clarkb	fungi: its doing backups right now I think	17:33
fungi	oh, yep	17:33
*** hamalq has quit IRC		17:34
fungi	okay, so the good news is that git.exc.InvalidGitRepositoryError only appears in the current executor-debug.log of ze12, no other executors	17:34
clarkb	corvus: fungi do you think we should manually clone nova for zuul then start it again on ze12?	17:34
fungi	and there it's only about the nova repo	17:34
clarkb	and maybe double check it got the timeout config update	17:35
*** zbr has quit IRC		17:35
fungi	git_timeout=600 appears in the /etc/zuul/zuul.conf there	17:35
*** zbr has joined #opendev		17:37
yoctozepto	gerrit 502 for me	17:37
yoctozepto	send help	17:37
yoctozepto	infra-root: gerrit really seems down	17:40
yoctozepto	I believe it is suffering the same it was yesterday	17:40
tobiash	danpawlik: thanks for the info!	17:41
clarkb	yoctozepto: it actually isn't but the end result is the same	17:41
fungi	yesterday we had some other activity going on which is not present in the logs today	17:41
clarkb	infra-root if you look at top -H the gc threads are not monopolizing all the time	17:41
yoctozepto	clarkb: yeah, speaking about observable stuff ;-)	17:41
fungi	also this time the load average has shot up to 100	17:41
* yoctozepto super sad about this		17:41
fungi	seems the cpu utilization is almost entirely the java process, and there's no iowait to speak of	17:42
yoctozepto	programming error?	17:43
fungi	quite a few third-party ci systems in the current show-queue output trying to fetch details on change 753847	17:44
clarkb	fungi: also powerkvm fetching nova	17:44
clarkb	it seems to be recovering too	17:45
clarkb	this feels a lot more like the "normal" disruptions we've seen previuosly. Things get busy but then recover	17:45
clarkb	vs the GC insanity	17:45
fungi	yeah, load average is falling rapidly	17:45
fungi	this might have been some sort of thundering herd condition	17:46
clarkb	ya	17:46
clarkb	my best next suggestion to try related to that is to enable protocol v2	17:46
fungi	5-minute load average is below 10 now	17:47
yoctozepto	I am still getting timeout	17:47
fungi	yep, was going to get back to looking at ze12 if the gerrit crisis has passed	17:47
clarkb	yoctozepto: I believe that apache will return those for a short period	17:47
fungi	right, i'm not saying it's necessarily recovered yet but looks like it might be starting to catch its breath	17:48
clarkb	fungi: I think load is low beacuse apache is or has said go away	17:50
fungi	load average is rather low now but the gerrit webui is still not returning content	17:50
clarkb	apache may have filled its open slots	17:52
clarkb	?	17:52
mnaser	is it possible one of the gitea backends ssl certs are expired	17:52
mnaser	a local ci job: fatal: unable to access 'https://opendev.org/zuul/zuul-helm/': SSL certificate problem: certificate has expired	17:53
clarkb	mnaser: that was brought up earlier and frickler checked them and they were all fine. But every backend serves a unique cert with its name so you can check ouyself too	17:53
mnaser	i am double checking now	17:53
yoctozepto	clarkb: I am not getting any response now though	17:53
yoctozepto	it timeouts at the transport level	17:53
clarkb	yoctozepto: ya looking at apache everything is close wait, lask ack, or fin wait	17:53
yoctozepto	ack	17:53
fungi	clarkb: however as i pointed out, if it's a stale apache worker serving the old ssl cert which had been rotataed, you won't necessarily hit that worker when you test the server	17:53
clarkb	fungi: I think gitea serves the cert	17:54
fungi	then how are we doing ua filtering in the apache layer?	17:54
clarkb	oh except if the filtering I guess apache would have to serve it? so ya	17:54
clarkb	fungi: ya you're right apache must be doing it now I guess	17:54
clarkb	fungi: should we restart apache on review? I don't undersatnd why it seems to be doing not much	17:54
fungi	i can give that a shot	17:55
*** zbr has quit IRC		17:55
mnaser	for i in `seq 1 9`; do curl -vv -s "https://opendev.org" 2>&1 \| grep 'expire'; done;	17:56
mnaser	3/9 times i got an expired cert	17:56
mnaser	so i think fungi's theory that it is the stale apache worker might be valid	17:56
mnaser	(but i hit all the backends and they came back clean)	17:56
fungi	if you grab cert details the serveraltname should say which server you're hitting	17:56
mnaser	ok sure one second	17:56
fungi	i forget the curl syntax for that	17:56
*** zbr has joined #opendev		17:57
*** zbr5 has joined #opendev		17:57
mnaser	hmm sends curl even with -vvvvv just returns certificate has expired, nothing more	17:57
clarkb	mnaser: fungi openssl s_client will show you	17:58
mnaser	yep, switching to that	17:58
*** zbr5 has quit IRC		17:58
fungi	right	17:58
clarkb	anyway running ps across all of them I think 03 may be the problem	17:58
*** zbr has quit IRC		17:58
clarkb	it has an old apache worker. The other 7 seems to have recycled them all at least today	17:58
*** zbr has joined #opendev		17:59
fungi	clarkb: on review i'm starting to think apache might be the problem as well... i can wget http://localhost:8081/ locally on the server which is what it should be proxyign to	17:59
mnaser	clarkb, fungi: i got a call i have to run into, but this is a failing s_client output -- http://paste.openstack.org/show/800946/	17:59
mnaser	i can try and help again but in an hour	17:59
clarkb	ya that says 03, I'll restart apache2 there	18:00
clarkb	and that is done	18:00
fungi	i finally got a server-status out of apache on review.o.o and all slots are in "Gracefully finishing" state	18:01
fungi	huh?	18:01
fungi	i wonder if this is mod_proxy misbehaving	18:01
clarkb	fungi: is that stale from pre restart? fwiw I think gerrit just loaded a diff for me	18:01
clarkb	and the access log is showing stuff filtering in	18:01
yoctozepto	it let me in now	18:01
fungi	clarkb: i haven't restarted anything	18:01
clarkb	fungi: oh	18:02
clarkb	fungi: I thought you had, but then ya I guess aopache is slowly recovering on its own after cleaning up old slots?	18:02
fungi	btu now server-status is returning instantly and reporting only half the slots in that state	18:02
fungi	so yeah, seems like this is some timeout in mod_proxy maybe	18:02
yoctozepto	might need decreasing it to avoid these stalls	18:03
*** hamalq has joined #opendev		18:03
clarkb	yoctozepto: feel free to propose a change :)	18:04
*** rpittau is now known as rpittau\|afk		18:04
clarkb	we'd love help, and this week has been really not fun for everyone. But more and more it feels like we need to reinforce that we run these services with you	18:05
clarkb	for gitea03 I've talked to it via port 3081 with s_client and it responds with Verification: Ok several times in a row now	18:07
yoctozepto	clarkb: if I knew the right knob for sure!	18:07
*** hamalq_ has joined #opendev		18:07
fungi	so reading more closely, "gracefully finishing" state is what apache workers enter when a graceful restart is requested	18:07
yoctozepto	it has been a bad week for all of us I guess...	18:07
yoctozepto	ah, so it made a restart	18:07
fungi	the worker waits for all requests to complete	18:08
clarkb	(it wasn't me)	18:08
yoctozepto	makes sense that it does so	18:08
clarkb	maybe our cert updated or something?	18:08
fungi	we trigger it automatically on configuration updates and cert rotation	18:08
yoctozepto	oh oh	18:08
fungi	if one of those hit in the middle of the insane load spike, maybe it had to time out a ton of dead sockets	18:09
yoctozepto	that is quite a downtime for cert rotation :D	18:09
yoctozepto	maybe put haproxies in front with their new dynamic cert reload functionality (they essentially pick up new certs for the incoming requests)	18:09
*** hamalq has quit IRC		18:10
fungi	let's dispense with the premature optimizations. i've not yet even confirmed what triggered the restart	18:10
fungi	i'm trying to approach this by methodically collecting information first and not making baseless assumptions as to a cause	18:11
clarkb	++	18:12
*** eolivare has quit IRC		18:12
fungi	-rw-r----- 1 root letsencrypt 1976 Oct 21 06:54 /etc/letsencrypt-certs/review.opendev.org/review.opendev.org.cer	18:13
fungi	so it's not cert rotation	18:13
clarkb	unless we're somehow triggering the handler even when teh cert doesn't update	18:13
clarkb	(I doubt it, but could still happen)	18:13
fungi	last restart apache logged was Thu Dec 10 06:25:07 (utc)	18:14
fungi	so another possibility is that graceful state is also used for worker recycling	18:14
fungi	mpm_event.conf sets MaxConnectionsPerChild to 0 so we're not recycling workers after a set number of requests	18:19
clarkb	fungi: could gracefully finishing be in response to the backend not responding?	18:20
fungi	however they are being recycled by something, most of the processes are no more than 20 minutes old	18:20
fungi	so looks like the parent apache process is from october 20 which i guess was the last complete restart, the first child process has a start timestamp from today's log rotation which i expect is the last graceful restart, the remainder of the workers however seem to have been dynamically recycled judging by their much more recent timestamps	18:22
clarkb	fungi: maybe cross check with the ansible logs?	18:23
fungi	easier said than done, we've got a loose time and no specific event (yet) to narrow it down with, plus ansible logs an insane number of lines mentioning apache on this server	18:25
clarkb	ya...	18:26
fungi	so i'm sifting through that now but it will take a while	18:26
clarkb	roger	18:26
fungi	1737 was the first report in here of a problem related to the server	18:27
fungi	backups start at 17:12	18:27
clarkb	fungi: when I hopped on the server a little after that borg was no longer running	18:27
clarkb	but that doesn't mean it is innocent	18:27
fungi	starting around 17:42:01 we get a number of lines in syslog from the kernel, systemd, containerd... clarkb did you run any docker commands maybe?	18:31
clarkb	fungi: I did not	18:31
clarkb	I ran simple things like w, top and apache log tailing	18:31
fungi	it was roughly 1.5 minutes after you logged in so figured i'd ask	18:31
fungi	oh! it's track-upstream being called from ceon	18:32
fungi	cron	18:32
clarkb	and all of that was on the host side	18:32
clarkb	fungi: ooooh	18:32
clarkb	fungi: you know what, I'm not sure we track any upstreams that we currently consume?	18:32
fungi	so anyway that kicked off in the middle of all this	18:32
clarkb	maybe we should disable taht and see if things settle a bit more?	18:32
fungi	i doubt it was at fault	18:32
fungi	but can't hurt to turn off a thing nobody's using, sure	18:33
clarkb	ya seems unlikely except for maybe memory pressure?	18:33
fungi	we've also still got an openstackwatch cronjob firing periodically	18:33
clarkb	I think our gerrit fork may be the only thing that "uses" it	18:33
clarkb	but we're currently unforked	18:33
clarkb	except for the /x/* update	18:33
clarkb	but that happens in the job on top of the upstream repo	18:33
clarkb	(double check me on zuul pulls from upstream not our fork though)	18:33
fungi	yeah, i was about to say, we're technically forked, just not forking the git repo	18:33
fungi	anyway, after pouring over syslog, no nothing at all happening ansible-wise and i see nothing which would have triggered apache to recycle worker processes, so it must have been an internal decision in apache	18:35
fungi	as for apache's error log, we started hitting "AH00485: scoreboard is full, not at MaxRequestWorkers" at 17:56:17 just out of the blue	18:37
fungi	so i suspect that's the point at which every worker had entered gracefully finishing state	18:38
*** dtantsur is now known as dtantsur\|afk		18:40
fungi	Parent Server Config. Generation: 68	18:42
fungi	Parent Server MPM Generation: 67	18:42
fungi	are those supposed to match?	18:42
clarkb	I have no idea	18:43
clarkb	fungi: where does that info come from?	18:44
fungi	mod_status	18:46
clarkb	ah gotcha	18:47
clarkb	fungi: maybe check it on another service and see if they line up?	18:47
fungi	anyway, i'm starting to think, after a lot of digging, this is "normal" (albeit aberrant) behavior for ubuntu xenial era mpm-event	18:47
clarkb	maybe they are off by one normally because config is loaded in that order or whatever	18:47
clarkb	fungi: ya certainly sounds like that based on your investigation	18:47
fungi	the all workers ending up in g state because they can take too long to end	18:47
fungi	basically apache periodically (i have yet to find the exact default conditions though) recycles mpm-event workers by gracefully stopping them, it's not supposed to recycle them all at the same time but if the stopping takes too long, say because your server load suddenly spiked to 100...	18:49
fungi	so anyway, apache should in theory behave a bit better in the future once we upgrade the operating system on that server. in the meantime if we see that again, a hard restart of the apache2 service should bring things back sooner	18:55
fungi	the gerrit load issue on the other hand, no idea what exactly triggered that, we'll need to keep an eye out for more incidents of the same	18:56
fungi	i'm switching back to troubleshooting the nova git repo on ze12 for now	18:57
mnaser	fungi: ouch, do i understand part of the issue here is a service that we have no idea what's restarting it? :(	18:58
fungi	mnaser: no, nothing restarted	18:58
mnaser	oh, the recycling of mpm event workers	18:59
fungi	apache spawns multiple worker processes. it periodically refreshes them by issuing an independent graceful top to a worker and fiting up a new one. if you're at your max worker count it won't fire up a new worker until a stopping one has exited. if something causes those workers to take too long to exit, they pile up, until you have no worker processes accepting connections any longer	19:00
fungi	this seems to have been triggered by the gerrit activity which caused load to skyrocket on the server	19:01
mnaser	fungi: https://httpd.apache.org/docs/2.4/mod/mpm_common.html see MaxConnectionsPerChild (or previously MaxRequestsPerChild)	19:01
mnaser	let me see if our config has that	19:01
fungi	yeah, it's 0	19:01
fungi	i'm familiar with it. we tune that on some of our other services	19:02
*** sboyron has quit IRC		19:02
mnaser	fungi: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/static/files/apache-connection-tuning ?	19:02
mnaser	is that potentially being reused?	19:02
fungi	that's how we do it on other servers, yeah, but doesn't appear to be applied on review.o.o	19:02
fungi	anyway, it was an aftereffect of a pathological condition	19:03
fungi	the gerrit service started having problems at or before 17:37z, apache didn't run out of workers until 17:56z (almost 20 minutes later)	19:04
mnaser	didn't mean to distract too much :) so i'll let you get other things	19:05
fungi	gerrit cpu consumption had trailed back off by 17:50z even	19:05
fungi	but i think the insane system load sent apache into a dark place from which it eventually recovered on its own as it started successfully reaping those stopping workers around 18:02z	19:06
fungi	so anyway, back to corrupt nova repo on ze12, i'm rescanning the other 11 executors to make sure the gerrit incident didn't trigger anything similar on another one	19:09
fungi	good, they're still all coming back clean, so it's just the (stopped) ze12	19:10
fungi	looking at the timestamps, i have a feeling ze12 never successfully cloned the nova repo after its restart, unlike the others	19:13
*** adrian-a has quit IRC		19:20
fungi	oh, also after a bit more digging, i think it's our MaxKeepAliveRequests 100 with KeepAlive On (these are defaults) in /etc/apache2/apache2.conf which is what's determining worker process recycling: https://httpd.apache.org/docs/2.4/mod/core.html#MaxKeepAliveRequests	19:22
*** adrian-a has joined #opendev		19:24
fungi	okay, i've removed the incomplete /var/lib/zuul/executor-git/opendev.org/openstack/nova on ze12 and started the service again	19:25
fungi	will keep an eye on it to make sure it clones nova completely and doesn't have any recurrence of its earlier errors	19:25
fungi	2020-12-10 19:33:02,289 DEBUG zuul.Repo: [e: b5c0c3a9504c4da8ba3ba8cae23adf3e] Cloning from ssh://zuul@review.opendev.org:29418/openstack/nova to /var/lib/zuul/executor-git/opendev.org/openstack/nova	19:33
corvus	fungi, clarkb: sorry was away earlier; back now	19:40
fungi	2020-12-10 19:39:46,351 DEBUG zuul.ExecutorServer: [e: b5c0c3a9504c4da8ba3ba8cae23adf3e] [build: 8d72d952f28f47798786350c6fbd34df] Finished updating repo gerrit/openstack/nova	19:43
fungi	so looks like it cloned in ~6m44s	19:44
*** andrewbonney has quit IRC		20:05
ianw	o/	20:31
*** hashar has quit IRC		20:32
ianw	reading backwards ...	20:32
fungi	anf if that doesn't work, i recommend reading upside-down	20:36
ianw	i am in Australia, so i guess i read upside down by default? :)	20:37
ianw	if i've understood, there were some executor problems after the timeout update that were tracked down to ze12 having a bad copy on disk?	20:37
clarkb	ianw: ya and we had a spike in load that was not attribtued to garbage collection (based on top -H)	20:38
clarkb	review recovered on its own from that but apache did so more slowly	20:38
corvus	clarkb: what's the status of v2?	20:39
clarkb	corvus: on https://review.opendev.org/c/opendev/system-config/+/766365 ianw points out that while our git is new enough to use v2 on the executors/mergers we need to enable it?	20:40
ianw	i had a quick play with my local git and v2 seemed to work fine on the test server, though i only cloned a few things	20:40
ianw	but i think we might need some work to deploy it on executors	20:40
clarkb	corvus: otherwise I think the first step is to land ^ restart gerrit, then work to ensure ci systems use it?	20:40
ianw	... heh, or what clarkb said :)	20:40
clarkb	ianw: ya in .gitconfig I Think you set protocol.version = 2	20:41
clarkb	to enable it on older git	20:41
clarkb	2.18 introduced it	20:41
fungi	ianw: the main thing i wasn't sure about was whether it's possible ze12 somehow ended up with a just-right corruption of its git repo that it never successfully blew away and recloned it (unlike the other executors after the timeout increase)?	20:41
clarkb	`git config --global protocol.version 2` will write out the config too	20:42
fungi	ianw: reason being i didn't find any significant lull of nova repo errors in the debug log for ze12	20:42
ianw	fungi: yeah, i didn't check all the executors, but in the gerrit queue i could no longer see any long zuul processes cloning nova, so assumed that all had grabbed it ok	20:42
fungi	ianw: cool, so that remains a possibility i guess	20:43
ianw	i saw some discussion, but did we decide why these clones are like 2gb?	20:43
corvus	really? wow why isn't using v2 just autonegotiated?	20:43
openstackgerrit	Merged opendev/gear master: use python3 as context for build-python-release https://review.opendev.org/c/opendev/gear/+/742165	20:43
ianw	corvus: it is from a 2.26 onwards client	20:43
clarkb	corvus: it is if your git is from this year	20:43
clarkb	I think ~may?	20:43
fungi	yeah, it seemed like maybe they weren't sure forcing everyone to v2 the same release they enabled it was a good idea and phased it in as an optional feature initially	20:44
corvus	executors say: git version 2.20.1	20:44
corvus	fungi: gotcha	20:44
clarkb	also the linux kernel reported problems after it was enabeld by default	20:44
clarkb	but I was never able to track that down to any conclusion	20:45
clarkb	(more data than expected was transfered in their requests, maybe something pathological with the size of the linux kernel?)	20:45
mordred	does gitpython support it?	20:45
fungi	if our container images are debian-based, they need to use buster-backports or be built on testing/bullseye	20:46
corvus	hopefully 2.20.1 doesn't have any issues with v2 that are fixed in later versions	20:46
fungi	to get a new enough git to have it on by default	20:46
corvus	mordred: i don't think (but i'm not positive) that gitpython does any network stuff	20:46
mordred	nod	20:46
fungi	and yeah, 2.20 was the version in buster	20:46
fungi	(current stable)	20:46
corvus	mordred: i think that's all outsourced to a spawed git process	20:46
mordred	the images are debian based - it wouldn't be too hard to add backports and get newer git	20:46
mordred	oh - they're stretch	20:47
fungi	right, i would say we enable v2 in the .gitconfig and then if we see bugs turn it off and consider building new images with buster-backports to pull teh 2.29 available there	20:47
mordred	wait- my local images are likely stale	20:47
fungi	they're stretch (oldstable)?	20:47
corvus	they're "10"	20:47
mordred	pulling and re-checking	20:47
corvus	which i think is 'buster'	20:48
mordred	in the old images stretch-backports is already in sources.list	20:48
fungi	10 is buster, yes	20:48
mordred	so it should just be a matter of adding the pin	20:48
clarkb	looking at https://lore.kernel.org/lkml/xmqqzh9mu4my.fsf@gitster.c.googlers.com/ it seems there were a few server side issues that git fixed on the v2 protocol	20:48
clarkb	but since gerrit uses jgit those issues shouldn't apply (thee may be other issues)	20:48
clarkb	it sounds like as of 2.27 on the server side they are happy with the linux kernel?	20:49
corvus	mordred: buster-backports has 2.29	20:49
mordred	cool	20:49
corvus	as does bullseye/testing	20:49
fungi	same as what's in sid right now	20:49
clarkb	fungi: corvus if buster backoprts is already carrying 2.29 would it be better to just use that ratehr than 2.20.1 + config?	20:50
ianw	perhaps we should enable it, and just hand-install it on one or two executor images for a few hours first?	20:50
clarkb	ianw: oh thats another option ya, via exec set the git config and see how it does	20:50
corvus	clarkb: i think so	20:50
clarkb	I've got 2.29.2 locally and am able to currently interact with gerrit	20:51
fungi	if we're really worried about v2 bugs solved after 2.20 then using buster-backports for the git package sounds reasonable	20:51
mordred	poo. buster-backports is not in the current images	20:51
mordred	so we need to add it to sources.list and install git with the pin	20:51
fungi	and yeah, i'm running 2.29.2 (from sid) as well	20:51
corvus	mordred: agreed it's not there	20:51
corvus	sounds like .gitconfig may be easier then	20:52
corvus	i like the idea of enabling it in gerrit, and writing .gitconfig on some executors	20:52
openstackgerrit	Merged opendev/gear master: Bump crypto requirement to accomodate security standards https://review.opendev.org/c/opendev/gear/+/742117	20:52
clarkb	ya I like that suggestion from ianw too	20:52
corvus	and if that works, just do .gitconfig in zuul, and don't worry about adding backports/upgrading unless we find we need >2.20	20:53
ianw	++ on one thing at a time :)	20:53
fungi	right, it's easy enough to switch back to v1 protocol temporarily while we solve that if necessary	20:53
corvus	766365+2	20:54
mordred	corvus: yah - although doing the backports makes the zuul images dtrt without config out of the box - so still might want to consider it	20:54
mordred	but as a followup I'd imagine after testing	20:54
fungi	still no new errors on ze12 so i think blowing away that nova repo it had did the trick	20:54
fungi	mordred: or we could wait for bullseye sometime mid/late-2021	20:55
ianw	mordred: oh yeah, definitely let's fix the image once we know it doesn't instant-explode :)	20:55
mordred	fungi: you're assuming we won't have been eaten by space aliens by mid/late-2021	20:56
fungi	well, i mean, that was my plan, but i wasn't going to let on	20:56
fungi	and yeah 2021-03-12 is the penultimate freeze phase, full freeze phase date still tbd	20:57
fungi	so who knows, bullseye could release in june or it could be abother potato	20:57
ianw	another thing i saw discussion on, has anyone notified the powerkvm people about the constant loop?	20:58
fungi	ianw: not yet i don't thinkl	20:58
fungi	some complication there is that i see their account connecting from quite a few different addresses so we can either block all those or disable the account i guess	20:58
fungi	but at first look it seems like they have multiple ci systems sharing a single gerrit account	20:59
ianw	i guess let me try find the contact and send mail, and we can disable after that if no response	20:59
fungi	or it could be they have that many zuul executors i suppose	20:59
fungi	but the connections seem persistent, more like stream-events listeners	20:59
ianw	unlike our executors though, there's only 1/2 connections ... we had one for each executor	21:00
fungi	presumably clarkb already contacted them about switching to https	21:00
ianw	so is it that git v2 will not require cloning the entire 2gb repo? i haven't quite followed that bit	21:00
clarkb	hrm myemail saerch is not showing that I did	21:01
fungi	so may know who to reach out to if the wiki isn't providing good contact info	21:01
clarkb	ianw: v2 adds more negotiation so that clients can say I want to update master and then the remote won't care about other refs	21:01
fungi	oh, maybe they're using https for queries but fetching via ssh?	21:01
clarkb	ianw: the older protocols have both sides exchange the refs they know about and since gerrit puts changes in refs it creates trouble	21:01
clarkb	fungi: ya that could be	21:01
clarkb	ianw: it basically streamlines the initial negotiation of what data a client wants to get	21:01
fungi	or maybe they just don't query often because they limit the conditions they trigger on	21:02
fungi	and so didn't turn up in the snapshots of the queue	21:02
ianw	Michael Turek (mjturek@us.ibm.com) from the wiki https://wiki.openstack.org/wiki/ThirdPartySystems/IBMPowerKVMCI	21:03
ianw	i'll send mail cc to discuss soon	21:03
clarkb	ianw: thank you	21:03
fungi	i've approved 766365 and can do or help with a gerrit restart later once it's deployed	21:03
fungi	i have a few chores i need to get to and dinner to cook in the meantime	21:04
ianw	clarkb: yeah ... so my local git clone of "git clone https://opendev.org/openstack/nova" is only 160mb ... is that because gitea is v2 enabled?	21:04
clarkb	ianw: maybe?	21:05
mordred	we should enable v2 in dib	21:05
ianw	i couldn't quite figure why zuul was getting these multi gb clones	21:05
mordred	if that's the case	21:05
ianw	mordred: or a base job role, but yeah	21:05
mordred	for building the repo cache	21:05
mordred	well - both	21:05
ianw	ohhh, yes indeed	21:05
mordred	in dib for repo cache build - and in base role for, you know, all the things	21:06
ianw	i get the same thing with "git clone ssh://iwienand@review.opendev.org:29418/openstack/nova"	21:07
ianw	oh, no, maybe not. same number of objects, but http://paste.openstack.org/show/800957/	21:09
fungi	remember zuul also fetches all branches and tags	21:15
fungi	so lots more refs	21:15
fungi	it's not just a simple git clone and done	21:15
clarkb	fungi: ya but it should be able to ignore all the resf/changes/* stuff so still better?	21:15
fungi	in theory	21:15
mordred	fungi: tags? or just all branches?	21:16
*** slaweq has quit IRC		21:16
ianw	fungi: yeah, but the command that was hanging is just "git clone"? anyway, if i clone via ssh i get the 2gb repo	21:16
fungi	mordred: mmm, maybe just branches yeah, i was looking at the log of nova getting cloned and thought i saw it also fetching tag refs at the end, but... maybe only ones reachable from a branch	21:17
corvus	ianw: do you think nova should be < 2gb?	21:19
ianw	corvus: when i clone it via http it is 165mb	21:19
corvus	fascinating	21:19
mordred	165mb is < 2gb	21:19
corvus	maybe even <<	21:20
ianw	https://docs.gitlab.com/ee/administration/git_protocol.html has some good info on finding what version your client is using	21:20
mordred	by, you know, an actual order of magnitude	21:20
ianw	GIT_TRACE_PACKET=1 git -c protocol.version=2 ls-remote https://opendev.org/openstack/nova 2>&1 \| head	21:21
ianw	does not suggest gitea is git v2	21:21
corvus	anyone done an https clone from gerrit?	21:22
openstackgerrit	Merged opendev/system-config master: Enable mirroring of centos 8-stream https://review.opendev.org/c/opendev/system-config/+/766499	21:23
ianw	(^ i will manually do that under lock and confirm free space)	21:23
fungi	i'm trying a plain git clone of nova from gerrit via https	21:23
fungi	will report	21:23
corvus	me too	21:23
fungi	then we can compare notes ;)	21:24
corvus	it's >160mb so far, so i assume it'll end up just like the ssh one at 2gb	21:24
ianw	I note that my clone from gitea had "remote: Enumerating objects: 587205, done." which i didn't see in the gerrit one?	21:25
ianw	GIT_TRACE_PACKET=1 git -c http.sslVerify=false -c protocol.version=2 ls-remote https://review-test.opendev.org/openstack/nova 2>&1 \| head looks v2-y	21:27
ianw	however, cloning that it is still a 2gb repo	21:30
clarkb	my understanding is it shouldn't change the amount of data as much as the negotiations	21:30
ianw	I just did opendev http again to ensure i'm not nuts -- "Receiving objects: 100% (587205/587205), 164.90 MiB \| 4.39 MiB/s, done"	21:31
corvus	Receiving objects: 100% (587205/587205), 1.01 GiB \| 2.56 MiB/s, done.	21:31
corvus	that's gerrit http for me	21:31
corvus	larger than 160mb but smaller than 2g :/	21:32
ianw	ianw@ze07:/var/lib/zuul/executor-git/opendev.org/openstack$ du -h -s nova	21:33
ianw	1.9Gnova	21:33
ianw	that was where i got the 1.9g number from	21:33
corvus	that may have more branches/tags	21:33
ianw	i agree, i get 1.01gb on a v2 http clone from review-test as well	21:33
corvus	hrm, my local clone has all the branches/tags	21:35
clarkb	the executors may not be packed?	21:36
corvus	maybe the .8g is due to extra changes and some amount of expansion due to merges, etc?	21:36
corvus	either way, 1.1g is still 10x 165m	21:37
fungi	my clone is still underway	21:37
*** eharney has quit IRC		21:38
fungi	yeah, 1.1gb	21:39
fungi	and i have all tags, including unreachable eol tags	21:39
fungi	without expressly fetching	21:39
fungi	i also seem to have notes refs	21:41
corvus	fungi: you do? i don't see them in mine	21:43
corvus	at least, not in packed-refs or refs/	21:43
fungi	my .gitconfig is configured with notes.displayRef=refs/notes/review so i see them in git log	21:45
fungi	i also have [remote "origin"] fetch = +refs/notes/:refs/notes/	21:45
fungi	so maybe it grabbed then during cloning	21:45
corvus	ah, i don't have that	21:45
ianw	apropos nothing, the yum-puppetlabs volume has run out of quota, i'll up it	21:57
openstackgerrit	Merged opendev/system-config master: Enable git protocol v2 on gerrit https://review.opendev.org/c/opendev/system-config/+/766365	22:09
*** mlavalle has joined #opendev		22:13
TheJulia	fwiw, this afternoon gerrit has seemed even almost... snappy	22:15
openstackgerrit	Ian Wienand proposed opendev/system-config master: bup: Remove from hosts https://review.opendev.org/c/opendev/system-config/+/766300	22:19
openstackgerrit	Ian Wienand proposed opendev/system-config master: WIP: remove all bup bits https://review.opendev.org/c/opendev/system-config/+/766630	22:19
ianw	the centos8 stream initial mirror sync is running in a screen on mirror-udpate	22:21
ianw	ok, here's something weird	22:30
clarkb	?	22:30
ianw	the list of packed objects, gathered via	22:30
ianw	for p in pack/pack-*([0-9a-f]).idx ; do git show-index < $p \| cut -f 2 -d ' '; done > packed-objs.txt	22:30
ianw	is the same between my 1GB clone from gerrit and my 165mb clone from gitea	22:31
clarkb	different compression types maybe?	22:31
clarkb	(does git do that?)	22:31
ianw	181M./objects/pack \|1.1G./objects/pack	22:33
clarkb	or maybe you fetch a pack but not all of its content?	22:33
ianw	git repack -a -d -F --window=350 --depth=250 has my fans spinning	22:36
*** adrian-a has quit IRC		22:38
ianw	https://github.com/emanuelez/gerrit/blob/master/templates/default/scripts/repack-repositories.sh	22:40
ianw	Gerrit Code Review however does not automatically repack its managed	22:41
ianw	# repositories.	22:41
ianw	eview.source.android.com runs the following script periodically,	22:41
ianw	# depending on how many changes the site is getting, but on average	22:41
ianw	# about once every two weeks:	22:41
clarkb	ianw: correct we run a git gc daily	22:41
ianw	is this still true?	22:41
ianw	does that repack?	22:41
clarkb	you can tell gerrit to do the gc'ing but jgit gc is single threaded and slow so we do it out of band	22:41
clarkb	ya aiui gc implies packing (not sure if necessarily repack)	22:41
ianw	perhaps we should be setting those pack options from that script?	22:43
clarkb	possibly? or maybe newer git would do better (I'm assuming that gitea's more up to date git may be why it is smaller?)	22:44
clarkb	ianw: I think the current gc happens on the host side not the container side, but perhaps if we converted it to execing into the container's git we'd get better results	22:45
clarkb	ianw: we can test with review-test for that	22:45
ianw	my very old laptop is still churning :)	22:47
ianw	$ du -hs .git	22:47
ianw	142M.git	22:47
ianw	even smaller than gitea	22:48
clarkb	wow	22:48
clarkb	certainly seems worth investigating, if review-test gc using the container git gets us close that is probably the simplest thing in terms of moving parts	22:48
ianw	yep, ok, good well at least that explains the difference! i thought i was going nuts	22:49
clarkb	ianw: putting something about this on the meeting agenda for if/when we have the next meeting might be a good reminder assuming we don't do anything sooner	22:49
ianw	i'll try a gc with the container git on review-test. but i think we may need to update the script to do the settings like in that repack-repos script	22:49
clarkb	ya it is possible that newer git alone isn't enough (but again gitea isn't doing anything special either just newer git aiui)	22:50
ianw	$ git gc --auto --prune in the container doesn't seem to do anything	22:54
ianw	i'm going to try settin gthose options	22:54
ianw	ok, it doesn't want to do anything with "--auto"	22:58
ianw	i will try two things; running with the git in the container, then adding the options from that script and running again. see what's smaller	22:59
ianw	with no packing options; 962M	23:08
ianw	adding all the options from https://github.com/emanuelez/gerrit/blob/master/templates/default/scripts/repack-repositories.sh and running gc --prune has not made a difference	23:12
*** smcginnis has quit IRC		23:16
ianw	tracking this @ https://etherpad.opendev.org/p/pu_RvmPeym2A7JZmOV4Q	23:18
* melwitt wishes the zuul console tab had a floating shortcut button to the top of the page		23:32
fungi	the git protocol v2 change was deployed to production as of 22:13 so we can restart the service to pick it up once things calm down a bit more	23:37
ianw	melwitt: it's actually really easy to setup a dev environment to fiddle with the webui :)	23:37
* melwitt rolls up sleeves		23:37
melwitt	cool. I'll give it a go. later. when our gate stops being on fire	23:38
ianw	melwitt: i'm not trying to be surly :) it can be kind of fun hacking on the UI, you at least get to see results infront of you	23:38
melwitt	I didn't take it as surl, I was just trying to be funny	23:38
melwitt	I'll try it later. I like working on new stuff	23:40
fungi	there are certainly times where i wish i didn't always have so much new stuff to work on. i suppose i should be thankful ;)	23:40
ianw	yeah, i took it as a bit of an opportunity to understand react/2020 javascript a bit more. i'm still rather useless, but it is something worth knowing	23:40
melwitt	:)	23:42
*** smcginnis has joined #opendev		23:42
ianw	this 16x big server doesn't seem to be much faster compressing git trees than my skylake laptop	23:44
fungi	i can confirm du says my nova clone from gitea is far smaller than my nova clone from gerrit	23:45
fungi	250mb vs 1.1gb	23:45
ianw	fungi: yeah, i've narrowed it down to the packing, and am trying different things in https://etherpad.opendev.org/p/pu_RvmPeym2A7JZmOV4Q	23:46
fungi	i do still get the git notes straight away without fetching them separately, even when cloning from gitea	23:47
ianw	yeah, it's all there; i dumped the objects in the pack files and they're exactly the same	23:48
*** smcginnis has quit IRC		23:49
ianw	that you can have an order of magnitude difference in the repo size modulo fairly obscure techniques is ... well just very git like i guess	23:49
fungi	git gonna git	23:52
*** tosky has quit IRC		23:53
ianw	ok, an explicit repack in the gerrit container barely does anything. the same thing on my local laptop is the 10x shrink. so something later gits do better?	23:55
corvus	melwitt: https://zuul-ci.org/docs/zuul/reference/developer/javascript.html#for-the-impatient-who-don-t-want-deal-with-javascript-toolchains	23:55
corvus	melwitt: i just do what that tells me to everytime i need to do something in zuul's js :)	23:56
melwitt	impatient, that's me!	23:56
melwitt	thanks	23:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!