Thursday, 2020-12-10

*** tosky has quit IRC00:02
*** DSpider has quit IRC00:11
openstackgerritMerged opendev/system-config master: graphite: also deny account page  https://review.opendev.org/c/opendev/system-config/+/76631800:12
*** mlavalle has quit IRC00:30
*** brinzhang has joined #opendev00:38
johnsomSomething is up with zuul. A bunch of jobs are stuck in the blue swirly00:40
johnsomMy job as been sitting that way for 21 minutes00:40
johnsomish00:40
fungihttps://grafana.opendev.org/d/9XCNuphGk/zuul-status?orgId=100:58
fungilooks like no executors are accepting jobs now00:58
corvusit's possible we have a bunch of hung git processes01:01
corvuszuuld    29738  0.4  0.0  13088  6564 ?        S    00:38   0:06 ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 zuul@review.opendev.org git-upload-pack '/openstack/nova'01:02
corvusfrom ze0101:02
fungithat would make sense01:03
corvusi'll try to kill those git procs01:03
fungialso explains why ianw earlier observed one executor taking most of the active builds... the rest were probably already all stuck01:03
ianwyeah; corvus let me know if i can help01:12
corvusianw: thanks; i just went through a ps listing and killed all the upload-packs that started in the 00:00 hour01:16
corvushopefully that got them all01:16
corvusi do see https read timeouts against gerrit from the zuul scheduler01:17
corvusi think gerrit's dead again01:17
corvusyeah, continuous gc operations that are slow and ineffective: 49102M->49102M01:18
fungii'll go ahead and down/up the container again in that case01:19
fungii don't think it's recovering01:19
fungiand restarted again01:20
*** ysandeep|away is now known as ysandeep01:54
ianwnode allocation still seems quite slow02:03
ianwzuuld    21496 24191  1 02:24 ?        00:00:01 git clone ssh://zuul@review.opendev.org:29418/openstack/nova /var/lib/zuul/executor-git/opendev.org/openstack/nova02:27
ianwthis appears stopped02:27
ianwze02 has a bunch of "git cat-file --batch-check"02:28
ianwi'm starting to think an executor restart cycle might be a good idea02:28
ianwze03 has similar git calls going on02:28
ianwi'm in a bit of a hail mary here.  i'm going to stop the executors from bridge playbook and see if i can clear gerrit, and then restart and see if it gets stuck on nova again02:40
*** priteau has quit IRC03:03
*** ysandeep is now known as ysandeep|session03:19
openstackgerritmelanie witt proposed opendev/base-jobs master: Revert "Exclude neutron q-svc logs from indexing"  https://review.opendev.org/c/opendev/base-jobs/+/76639903:19
*** openstackgerrit has quit IRC03:22
ianw2020-12-10 02:46:57,288 DEBUG zuul.Repo: [e: ccab1fab1ca149a1a1e61aab06013f4f] [build: ef857c597f7f4dfabe37f10dea33f9e1] Resetting repository /var/lib/zuul/executor-git/opendev.org/openstack/nova03:38
ianw2020-12-10 03:18:32,073 DEBUG zuul.Repo: [e: fcd6dfe4875746d295e181af6a5e74aa] [build: 6991e3c93e144b6b869bbb34bb8d73a6] Resetting repository /var/lib/zuul/executor-git/opendev.org/openstack/nova03:38
ianw2020-12-10 03:37:22,547 DEBUG zuul.Repo: [e: fcd6dfe4875746d295e181af6a5e74aa] [build: a03f9ace9dce43269dc25fcbb6c6b959] Resetting repository /var/lib/zuul/executor-git/opendev.org/openstack/nova03:38
ianwthe executors are making things worse i think.  they are in a 300 second timeout loop trying to clone nova; it fails, and htey try again03:38
*** hamalq_ has quit IRC03:45
ianwi'm trying a manual update of the git_timeout on ze07 and seeing if it's clone can work03:52
fungigerrit itself seems okay03:55
fungi300 seconds is a rather long time to complete a clone operation though03:56
*** user_19173783170 has joined #opendev03:56
user_19173783170How can I change my gerrit account username03:57
ianwfungi: it's going about as fast as it can03:57
*** zbr has quit IRC03:57
fungiuser_19173783170: gerrit usernames are immutable once set03:57
ianwit's up to 1.9G03:58
ianw(i think it is)03:58
*** zbr has joined #opendev03:58
ianwok, it took about 7 minutes03:59
fungiianw: do you think it's because of the full executor restart clearing the extant copies?04:00
ianwfungi: I don't think the executors have managed to get these bigger repos since i guess nodedb04:00
ianwnotedb04:00
ianwthey all try to update it, get killed, rinse repeat04:00
fungioh, like the executors have had incomplete nova clones since the upgrade?04:01
fungiyikes04:01
ianwmaybe, or they've got into a bad state and can't get out04:01
fungii wonder if the discussed update to turn on git v2 protocol would speed that up for them at all04:02
ianwi think the problem is the git in the executors isn't v2 by default, so it would require more fiddling.  it may help04:03
ianwit looks like we lost gerritbot04:08
ianwfungi: https://review.opendev.org/c/opendev/system-config/+/76640004:10
fungiianw: have you restarted gerritbot, or shall i?04:21
ianwfungi: i have04:21
fungithanks!04:22
fungii've approved the timeout bump to 10 minutes, but need to drop offline at this point so won't be around to see it merge04:25
*** openstackgerrit has joined #opendev04:59
openstackgerritIan Wienand proposed opendev/system-config master: bup: Remove from hosts  https://review.opendev.org/c/opendev/system-config/+/76630004:59
*** user_19173783170 has quit IRC05:26
openstackgerritMerged opendev/system-config master: zuul: increase git timeout  https://review.opendev.org/c/opendev/system-config/+/76640005:37
*** user_19173783170 has joined #opendev05:38
*** user_19173783170 has quit IRC05:42
*** cloudnull has quit IRC05:56
*** cloudnull has joined #opendev05:57
*** marios has joined #opendev06:02
*** marios is now known as marios|rover06:03
*** ShadowJonathan has quit IRC06:11
*** rpittau|afk has quit IRC06:11
*** ShadowJonathan has joined #opendev06:11
*** mnaser has quit IRC06:11
*** mnaser has joined #opendev06:12
*** rpittau|afk has joined #opendev06:12
*** ysandeep|session is now known as ysandeep06:28
*** jaicaa has quit IRC06:31
*** jaicaa has joined #opendev06:34
*** sboyron has joined #opendev06:34
*** marios|rover has quit IRC06:37
*** lpetrut has joined #opendev07:00
*** ysandeep is now known as ysandeep|sick07:03
*** slaweq has joined #opendev07:17
*** marios has joined #opendev07:18
*** marios is now known as marios|rover07:19
*** ralonsoh has joined #opendev07:25
frickleroh, I think I finally found the meaning of "CC" in gerrit vs. "Reviewer": the former is someone who leaves a comment but with code-review=0. doing some non-zero cr promotes a cc to reviewer. sorry for the noise if that was discussed already or well-known07:27
*** eolivare has joined #opendev07:29
*** noonedeadpunk has joined #opendev07:32
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76596307:33
*** hamalq has joined #opendev08:06
*** andrewbonney has joined #opendev08:11
*** rpittau|afk is now known as rpittau08:17
*** marios|rover has quit IRC08:25
zbrclarkb: for the record, again "Credentials expired".08:33
*** marios|rover has joined #opendev08:34
*** larainema has joined #opendev08:35
*** hashar has joined #opendev08:39
*** tosky has joined #opendev08:47
*** zbr has quit IRC09:10
*** zbr has joined #opendev09:11
*** zbr has quit IRC09:13
*** zbr has joined #opendev09:13
mnasiadkaMorning09:33
mnasiadkaIt looks like some of the gitea servers have an expired certificate09:33
*** zbr has quit IRC09:35
*** zbr has joined #opendev09:37
*** zbr has quit IRC09:40
*** zbr has joined #opendev09:40
fricklermnasiadka: I checked all 8 and they look fine to me, can you be more specific?09:42
mnasiadkahttps://www.irccloud.com/pastebin/hvNCiiok/09:44
mnasiadkaonce in a while I get this ^^09:44
mnasiadkahttps://www.irccloud.com/pastebin/vsKrkUGp/09:45
mnasiadkaverbose output ^^09:45
fricklermnasiadka: can you run "openssl s_client -connect opendev.org:443" and show the output on paste.openstack.org? also please verify that your local clock is correct09:47
mnasiadkafrickler: sure, let me try - clock is correct, but it happens on one out of 10 tries I think09:48
mnasiadkahttp://paste.openstack.org/show/800924/09:49
*** DSpider has joined #opendev09:51
mnasiadkafrickler: so obviously sometimes I get an old cert, question why :)09:52
*** zbr has quit IRC09:54
fricklermnasiadka: with apache we have sometimes seen single workers not updating correctly, not sure if something similar can happen with gitea09:54
fricklerinfra-root: ^^ there's also a lot of "Z"ed subprocesses from gitea. I'll leave it in this state for now until someone else can have a second look, otherwise I'd suggest to just restart the gitea-web container09:55
*** zbr has joined #opendev09:56
*** ralonsoh_ has joined #opendev10:00
*** ralonsoh has quit IRC10:01
*** hamalq has quit IRC10:07
openstackgerritIlles Elod proposed zuul/zuul-jobs master: Add option to constrain tox and its dependencies  https://review.opendev.org/c/zuul/zuul-jobs/+/76644110:23
*** ralonsoh_ is now known as ralonsoh10:33
*** sboyron_ has joined #opendev10:43
*** sboyron has quit IRC10:45
*** hashar is now known as hasharLunch10:47
*** zbr has quit IRC10:54
*** zbr has joined #opendev10:56
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Increase flake8 version in lower-constraints  https://review.opendev.org/c/openstack/diskimage-builder/+/76644710:56
*** hamalq has joined #opendev11:09
*** hamalq has quit IRC11:14
*** fressi has joined #opendev11:23
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Increase flake8 and pyflakes version in lower-constraints.txt  https://review.opendev.org/c/openstack/diskimage-builder/+/76644711:25
*** dtantsur|afk is now known as dtantsur11:33
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76596311:35
*** larainema has quit IRC11:44
openstackgerritIlles Elod proposed zuul/zuul-jobs master: Add option to constrain tox and its dependencies  https://review.opendev.org/c/zuul/zuul-jobs/+/76644111:46
sshnaidmdoes anyone know - which devstack jobs branches are supported now?11:51
*** zbr has quit IRC11:58
*** zbr has joined #opendev11:59
*** zbr has quit IRC12:10
*** zbr has joined #opendev12:10
*** larainema has joined #opendev12:12
*** zbr has quit IRC12:18
*** zbr has joined #opendev12:20
*** fressi_ has joined #opendev12:26
*** fressi has quit IRC12:27
*** fressi_ is now known as fressi12:27
*** hamalq has joined #opendev12:28
*** hamalq has quit IRC12:33
openstackgerritchandan kumar proposed openstack/diskimage-builder master: Enable dracut list installed modules  https://review.opendev.org/c/openstack/diskimage-builder/+/76623212:50
*** zbr has quit IRC13:01
*** zbr has joined #opendev13:03
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76596313:08
*** sshnaidm has quit IRC13:09
*** sshnaidm has joined #opendev13:09
*** priteau has joined #opendev13:10
*** zbr has quit IRC13:21
*** zbr has joined #opendev13:24
*** slaweq has quit IRC13:29
*** zbr has quit IRC13:30
*** zbr has joined #opendev13:32
*** slaweq has joined #opendev13:33
*** sboyron__ has joined #opendev13:34
*** sboyron_ has quit IRC13:35
*** sboyron__ is now known as sboyron13:41
*** zigo has joined #opendev13:51
*** hasharLunch is now known as hashar14:02
*** zbr has quit IRC14:05
*** zbr has joined #opendev14:08
openstackgerritJan Zerebecki proposed zuul/zuul-jobs master: ensure-pip: install virtualenv, it is still used  https://review.opendev.org/c/zuul/zuul-jobs/+/76647714:19
fungifrickler: we have apache running on the gitea servers (used so we can filter specific abusive bots based on user agent strings), so the apache behavior can certainly apply there14:29
*** hamalq has joined #opendev14:29
fungisshnaidm: it's the openstack qa team who support and maintain devstack, so people in #openstack-qa are going to be best positioned to answer your question14:30
sshnaidmfungi, ack14:31
*** hamalq_ has joined #opendev14:32
*** hamalq has quit IRC14:34
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76596314:34
*** hamalq_ has quit IRC14:36
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76596314:47
*** fressi has quit IRC14:57
openstackgerritMerged openstack/project-config master: New Project Request: airship/vino  https://review.opendev.org/c/openstack/project-config/+/76388914:58
openstackgerritMerged openstack/project-config master: New Project Request: airship/sip  https://review.opendev.org/c/openstack/project-config/+/76388814:58
openstackgerritHervĂ© Beraud proposed opendev/irc-meetings master: Switch release team to 1700 UTC  https://review.opendev.org/c/opendev/irc-meetings/+/76649014:59
openstackgerritHervĂ© Beraud proposed opendev/irc-meetings master: Switch oslo team to 1600 UTC  https://review.opendev.org/c/opendev/irc-meetings/+/76649315:04
*** adrian-a has joined #opendev15:06
sshnaidmis it possible to see in job somewhere what exactly was returned in zuul_return?15:07
sshnaidmlike here https://zuul.opendev.org/t/openstack/build/713e89ddf21e4e888fce5416d5c8a028/15:07
sshnaidmfungi, ^15:07
openstackgerritJan Zerebecki proposed zuul/zuul-jobs master: Switch from Debian Stretch to Buster  https://review.opendev.org/c/zuul/zuul-jobs/+/76649615:12
corvussshnaidm: not automatically, no; you could add a debug task, or copy the zuul return file into the logs dir15:13
sshnaidmcorvus, ok, just wanted to ensure it's same..15:13
fungidoing it automatically for all jobs isn't a good idea because they could, for example, pass secrets between them15:13
sshnaidmack15:13
sshnaidmI think I have a problem that child job gets return data from different parent15:14
sshnaidmis it possible?15:14
corvussshnaidm: if it depends on multiple jobs and they both return the same vars, yeah, one of them is going to win15:15
sshnaidmcorvus, nope, it's one parent and multiple consumers15:15
corvussshnaidm: check the inventory file for the consumers to see what they received15:16
sshnaidmcorvus, yeah, what I did15:16
corvussshnaidm: the variable is there but the value is not what you expected?15:16
sshnaidmcorvus, yes, value is different15:16
sshnaidmlike it was somewhere different parent job running..15:17
corvussshnaidm: what consumer job, and what variable?15:17
sshnaidmcorvus, consumer job gets IP of parent which deploys container registry: https://zuul.opendev.org/t/openstack/build/726aa3eb8c0640f4987272649c2a1040/log/zuul-info/inventory.yaml#5415:17
sshnaidmcorvus, and this is a parent: https://zuul.opendev.org/t/openstack/build/713e89ddf21e4e888fce5416d5c8a028/logs  which has different IP address and pass something else: https://zuul.opendev.org/t/openstack/build/713e89ddf21e4e888fce5416d5c8a028/log/job-output.txt#459015:19
sshnaidmin most cases it works fine15:19
sshnaidmbut recently we started to see such mess15:19
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76596315:21
corvus2020-12-10 05:32:31,504 DEBUG zuul.ExecutorClient: Build <gear.Job 0x7fe48cfce460 handle: b'H:::ffff:127.0.0.1:2433698' name: executor:execute unique: cfebd9f976594e64b8757152b4633e87> update {'paused': True, 'data': {'zuul': {'pause': True}, 'provider_dlrn_hash_branch': {'master': 'cd51c9b4a10fea745ad818e64de40a7f'}, 'provider_dlrn_hash_tag_branch': {'master': 'cd51c9b4a10fea745ad818e64de40a7f'},15:22
corvus'provider_job_branch': 'master', 'registry_ip_address_branch': {'master': '188.95.227.214'}}}15:22
corvussshnaidm: ^15:22
corvussshnaidm: that was from this build: https://zuul.opendev.org/t/openstack/build/cfebd9f976594e64b8757152b4633e8715:23
corvuswhich was retried15:24
gibido we have some onging gerrit issue? we see a lot of "`ERROR Failed to update project None in 3s" messages on patches recently15:24
corvusgibi: link?15:25
fungigibi: from today or prior?15:25
gibihttps://review.opendev.org/c/openstack/nova/+/76574915:26
gibithis is a pretty recent result ^^15:26
fungii think we expect those from ~monday through yesterday, but not since around 04:00 utc today hopefully15:26
fungiin particular our executors were failing to clone the nova repo from gerrit because it was taking longer than their 5-minute timeout15:27
fungiwe increased their git clone timeout to 10 minutes which allowed them to finally complete15:27
corvusour executors should [almost] never clone the nova repo from gerrit15:27
gibifungi: thanks, then we recheck15:28
corvuslike, they should do that once when they start for the first time15:28
corvusthen they should clone from their cache15:28
fungicorvus: ianw suspects they couldn't fully clone it following the upgrade when they got all the added notedb content, and have been looping ever since15:28
corvusfungi: then how did any nova job ever complete?15:29
fungithat i'm not sure about15:29
corvus(and do we really expect them to be pulling down notedb content?)15:29
corvusthis isn't adding up :/15:29
corvussshnaidm: i'm trying to figure out why that job retried15:30
fungicorvus: yeah, not sure, i'm mostly going by what ianw stated overnight at this point15:31
corvussshnaidm: i think it's because the executor was restarted (see my other conversation with fungi)15:31
sshnaidmcorvus, ack, so it's retries15:32
fungi(after a restart at least) the executors (possibly all the mergers?) were all looping trying to clone nova and then killing the git operation at 5 minutes and starting over, ianw tested cloning and it took 7 minutes, so once he increased the timeout to 10 minutes the executors all cloned nova successfully and resumed normal operation15:32
sshnaidmcorvus, does it mean original job on pause was aborted and different one replaced it? while child jobs remained as is?15:32
corvussshnaidm: possibly; or possibly the child jobs were replaced too but used the old data15:33
openstackgerritSorin Sbârnea proposed opendev/system-config master: Enable mirroring of centos 8-stream  https://review.opendev.org/c/opendev/system-config/+/76649915:48
*** adrian-a has quit IRC15:57
*** lpetrut has quit IRC15:57
*** zbr has quit IRC16:04
*** zbr has joined #opendev16:06
clarkbcorvus: fungi my understanding is that they shouldn't pull all the notedb content but without git protocol v2 enabled all the fetch (clone, etc) operations have to negotiate through those refs16:07
*** zbr has quit IRC16:07
corvusclarkb: is it possible that the word 'clone' is being used loosely here?16:08
*** zbr has joined #opendev16:08
fungiyeah, so maybe the nova repo's size and reduction in jgit/jetty performance in newer gerrit simply inched it over the clone timeout16:08
corvusie, maybe ianw saw fetches taking a really long time because of the lack of v2 negotiation?16:08
clarkbcorvus: ya that could be16:09
corvusbecause -- seriously -- if we're *actually* cloning that's an enormous drop-everything-and-fix-it regression in zuul16:09
corvusdoes anyone know what timeout ianw changed?16:10
fungi2020-12-10 03:18:32,073 DEBUG zuul.Repo: [e: fcd6dfe4875746d295e181af6a5e74aa] [build: 6991e3c93e144b6b869bbb34bb8d73a6] Resetting repository /var/lib/zuul/executor-git/opendev.org/openstack/nova16:10
fungithat's what he saw looping, getting killed at the 5-minute timeout, and then repeating16:10
fungicorvus: https://review.opendev.org/766400 i approved it before i fell asleep16:11
clarkbhttps://review.opendev.org/c/opendev/system-config/+/766365 is a change to enable git protocol v2 on gerrit and is currently enabled on review-test if epople want to test it16:12
corvusfungi: 'resetting repo' is normally a fetch16:12
fungicorvus: got it, so it was fetches taking longer than 300s i guess16:12
clarkb(I did rudimentary testing using the flags and command at https://opensource.googleblog.com/2018/05/introducing-git-protocol-version-2.html but talking to our review.o.o and review-test repos over https and ssh)16:12
corvusif it were cloning, we would see "Cloning from ... to ..."16:13
fungicould a fetch be as expensive as a clone if something happened to the local copy of the repo? like maybe from killing the stuck nova fetches you saw on the executors earlier?16:13
clarkbfungi: both fetches and clones using older protocols have to list all refs to negotiate what data to transfer16:14
corvuswe do remove the repos and re-clone if there's an error16:14
clarkbfungi: with the v2 protocol the client says "I want to fetch foo and bar" and only those refs are examined by both sides16:14
corvusianw's commit msg does say 'it times out.. they delete the directory'16:15
clarkbreally the only difference between a fetch and a clone on the old protocol is how much data you end up transfering after negotiation16:15
clarkbcorvus: aha16:15
corvusso maybe that's what happened?  we got a bunch of errors, zuul couldn't trust the repos any more, deleted them, and then we fell back to actual cloning?16:15
clarkbya that seems plausible16:15
fungiyeah, it sounds like git v2 is a good next step to reduce the overall cost, the question was whether we need to update git on the executors to make that happen so didn't want to try when i was already nodding off16:15
clarkbespecially since during the gc period errors would have been seen by zuul I bet16:16
corvusfungi: executors are containerized now, so should be very recent git16:16
clarkbcorvus: fungi that was my assumption too re git versions16:16
clarkbputting things in containers makes that much better for us I expect16:16
corvusclarkb, fungi: yeah, so i think i'm happy that zuul is not b0rked, ianw's timeout change is a good interim change, git v2 is a good next step (and may allow us to revert that)16:17
corvusi also suspect that this may explain the increasingly short cycle time between gerrit failures we observed yesterday16:17
corvussince the load on gerrit from zuul would have progressed geometrically as these failed16:18
clarkbah yup16:18
fungiright, seems like there was some reinforcement between different causes of load there16:19
openstackgerritSorin Sbârnea proposed opendev/system-config master: Enable mirroring of centos 8-stream  https://review.opendev.org/c/opendev/system-config/+/76649916:22
*** zbr has quit IRC16:24
*** zbr has joined #opendev16:26
*** zbr has quit IRC16:29
*** zbr has joined #opendev16:29
*** hamalq has joined #opendev16:30
openstackgerritMerged opendev/irc-meetings master: Switch oslo team to 1600 UTC  https://review.opendev.org/c/opendev/irc-meetings/+/76649316:33
openstackgerritMerged opendev/irc-meetings master: Switch release team to 1700 UTC  https://review.opendev.org/c/opendev/irc-meetings/+/76649016:33
openstackgerritJan Zerebecki proposed zuul/zuul-jobs master: Switch from Debian Stretch to Buster  https://review.opendev.org/c/zuul/zuul-jobs/+/76649616:42
openstackgerritJan Zerebecki proposed zuul/zuul-jobs master: Switch from Debian Stretch to Buster  https://review.opendev.org/c/zuul/zuul-jobs/+/76649616:43
*** zbr has quit IRC16:46
*** zbr has joined #opendev16:49
*** zbr has quit IRC16:49
*** zbr has joined #opendev16:55
*** auristor has quit IRC16:55
*** auristor has joined #opendev16:55
*** auristor has quit IRC16:58
gibiI've just got another round of "ERROR Failed to update project None in" for Zuul in https://review.opendev.org/c/openstack/nova/+/76647116:58
clarkbI wonder if an executor or merger or three haven't gotten the timeout config update properly applied?17:00
clarkbfungi: corvus ^ I'm still not quite properly caught up on all that, do we know if all the services were restarted to pick that up (or if that is even required?)17:01
corvusclarkb: yes and yes17:02
zbri guess that mirror script is not covered by any CI test? https://review.opendev.org/c/opendev/system-config/+/766499/17:02
clarkbzbr: correct, ebcause it can take many hours and a lot of bw to properly test it. And half the time it fails due to various things that we don't control17:03
clarkbcorvus: fungi in that case maybe the timeout isn't quite long enough for all cases?17:04
clarkb(sorry I'm still trying to boot the brain today)17:04
clarkbbut maybe it is better to enable git protocol v2 then reevaluate the timeouts rather than doing another full zuul restart?17:05
zbri could add a test for it and run in draft, but probably there are better ways to use my time17:07
fungicorvus: clarkb: could it be that ianw made sure the executors were okay but didn't think to check all the stand-alone mergers?17:08
corvusfungi: start times look right17:08
fungizbr: if we wanted to test that script we'd probably be better off mocking out the mirror(s) we're copying from and just using a local path or an rsync daemon running on the loopback17:09
*** auristor has joined #opendev17:10
zbrmy personal approach: add support for draft mode and add a test playbook that runs it, should be enough to get an idea that it should do something17:10
zbrand catch problems like syntax errors in bash17:11
zbrcopy does need to happen, but it will spot a bad url17:11
zbrdry is a decent way to validate the logic17:12
*** marios|rover has quit IRC17:12
fungiahh, though the failed to update errors are on individual builds so not the scheduler failing to get a ref constructed from one of the mergers17:13
zbri find hard not to ask why not using ansible to perform the mirroring, i would have found the file easier to read and maintain at the cost of not seeing live output.17:13
fungizbr: these scripts are older than ansible17:14
fungiyou might as well ask why the linux kernel wasn't written in go17:14
fungiokay, so one of the failing builds (d3bb8ddb4ef3433dbd01b49d7dff202b) ran on ze1217:14
zbrfungi: should I bother to rewrite the mirroring script or not?17:17
fungizbr: hard to say, right now i'm more interested in figuring out what's happening with our zuul executors killing random jobs17:18
fungilooks like ze12 is raising InvalidGitRepositoryError when updating nova17:19
*** adrian-a has joined #opendev17:19
danpawliktobiash: hey, this one will not pass on the gates: https://review.opendev.org/c/openstack/diskimage-builder/+/761857/ until this one is not merged https://review.opendev.org/c/openstack/diskimage-builder/+/76644717:19
yoctozeptoqq - is zuul retrying jobs only when they fail in pre or always?17:20
clarkbyoctozepto: it will always retry failures in pre. It will retry failures in other run stages if ansible reports the error is due to network connectivity17:20
yoctozeptoclarkb: I see, that could be it17:21
clarkband it is a total of 3 retries for the job regardless of where the retry causing failures originate17:21
yoctozeptobecause the job seems to be taking quite long to get to another retry17:21
*** zbr has quit IRC17:21
clarkbyoctozepto: when you retry you end up at the end of the queue17:21
yoctozeptoyeah, it's in run again17:21
yoctozeptoyeah -> https://zuul.opendev.org/t/openstack/stream/cf4074e0343e4f0dbee8c0e7afaa0189?logfile=console.log17:21
clarkbavoiding retries is a really good idea :)17:22
fungiclarkb: corvus: here's what the traceback from that build looks like: http://paste.openstack.org/show/800945/17:23
clarkbfungi: corvus should we stop ze12, move the repo aside then manually reclone it?17:23
*** zbr has joined #opendev17:24
yoctozeptoclarkb: I'd love to; no idea why this one job is persistent in retrying17:24
fungiclarkb: corvus: i'm queued up to take the container down on ze12 if this isn't going to disrupt anyone else's troubleshooting17:24
fungilooks like i'm the only one logged into the server anyway17:25
clarkbI'm deferring to ya'll on this one I think17:25
fungii'm taking it down now17:25
clarkbyoctozepto: common problems like that are jobs modifying host networking in a way that causes it to stop working. tripleo has had jobs disable dhcp, lease runs out, doesn't renew and now no more ip address. Similarly if the host is accessed via ipv6 and you disable RAs you can lose the ipv6 address, etc17:26
fungize12 has logged 1007 git.exc.InvalidGitRepositoryError exceptions in the current executor-debug.log17:26
fungiall for /var/lib/zuul/executor-git/opendev.org/openstack/nova17:27
clarkbyoctozepto: less common is the job doing something that crashes the test node. Stuff like nested virt crashing hard17:27
fungii'm making a copy of that to fsck and see what might be wrong with it17:27
yoctozeptoclarkb: nah, we ain't doing that17:27
yoctozeptonor that17:27
yoctozeptoI will see whether the 3rd one succeeds17:27
yoctozeptoand what it fails on17:27
yoctozeptoI mean we have many flavours of scenarios17:27
clarkbfungi: ++17:28
yoctozeptono idea why this one is acting erratically this time17:28
fungiwell, first problem: du says /var/lib/zuul/executor-git/opendev.org/openstack/nova is a mere 84K17:28
clarkbya so probably it timed out and aws killed while negotiating and didn't really transfer any data?17:29
clarkbI think that is the behavior you'd expect if git was in the process of sorting out what it needed to do17:29
fungiit looks like a fresh git init17:29
fungiexcept it's not even as clean as an init17:30
fungiso yeah, more like an interrupted clone at the very early stages17:30
fungiload average on gerrit is getting rather high again too17:31
clarkbfungi: its doing backups right now I think17:33
fungioh, yep17:33
*** hamalq has quit IRC17:34
fungiokay, so the good news is that git.exc.InvalidGitRepositoryError only appears in the current executor-debug.log of ze12, no other executors17:34
clarkbcorvus: fungi do you think we should manually clone nova for zuul then start it again on ze12?17:34
fungiand there it's only about the nova repo17:34
clarkband maybe double check it got the timeout config update17:35
*** zbr has quit IRC17:35
fungigit_timeout=600 appears in the /etc/zuul/zuul.conf there17:35
*** zbr has joined #opendev17:37
yoctozeptogerrit 502 for me17:37
yoctozeptosend help17:37
yoctozeptoinfra-root: gerrit really seems down17:40
yoctozeptoI believe it is suffering the same it was yesterday17:40
tobiashdanpawlik: thanks for the info!17:41
clarkbyoctozepto: it actually isn't but the end result is the same17:41
fungiyesterday we had some other activity going on which is not present in the logs today17:41
clarkbinfra-root if you look at top -H the gc threads are not monopolizing all the time17:41
yoctozeptoclarkb: yeah, speaking about observable stuff ;-)17:41
fungialso this time the load average has shot up to 10017:41
* yoctozepto super sad about this17:41
fungiseems the cpu utilization is almost entirely the java process, and there's no iowait to speak of17:42
yoctozeptoprogramming error?17:43
fungiquite a few third-party ci systems in the current show-queue output trying to fetch details on change 75384717:44
clarkbfungi: also powerkvm fetching nova17:44
clarkbit seems to be recovering too17:45
clarkbthis feels a lot more like the "normal" disruptions we've seen previuosly. Things get busy but then recover17:45
clarkbvs the GC insanity17:45
fungiyeah, load average is falling rapidly17:45
fungithis might have been some sort of thundering herd condition17:46
clarkbya17:46
clarkbmy best next suggestion to try related to that is to enable protocol v217:46
fungi5-minute load average is below 10 now17:47
yoctozeptoI am still getting timeout17:47
fungiyep, was going to get back to looking at ze12 if the gerrit crisis has passed17:47
clarkbyoctozepto: I believe that apache will return those for a short period17:47
fungiright, i'm not saying it's necessarily recovered yet but looks like it might be starting to catch its breath17:48
clarkbfungi: I think load is low beacuse apache is or has said go away17:50
fungiload average is rather low now but the gerrit webui is still not returning content17:50
clarkbapache may have filled its open slots17:52
clarkb?17:52
mnaseris it possible one of the gitea backends ssl certs are expired17:52
mnasera local ci job: fatal: unable to access 'https://opendev.org/zuul/zuul-helm/': SSL certificate problem: certificate has expired17:53
clarkbmnaser: that was brought up earlier and frickler checked them and they were all fine. But every backend serves a unique cert with its name so you can check ouyself too17:53
mnaseri am double checking now17:53
yoctozeptoclarkb: I am not getting any response now though17:53
yoctozeptoit timeouts at the transport level17:53
clarkbyoctozepto: ya looking at apache everything is close wait, lask ack, or fin wait17:53
yoctozeptoack17:53
fungiclarkb: however as i pointed out, if it's a stale apache worker serving the old ssl cert which had been rotataed, you won't necessarily hit that worker when you test the server17:53
clarkbfungi: I think gitea serves the cert17:54
fungithen how are we doing ua filtering in the apache layer?17:54
clarkboh except if the filtering I guess apache would have to serve it? so ya17:54
clarkbfungi: ya you're right apache must be doing it now I guess17:54
clarkbfungi: should we restart apache on review? I don't undersatnd why it seems to be doing not much17:54
fungii can give that a shot17:55
*** zbr has quit IRC17:55
mnaserfor i in `seq 1 9`; do curl -vv -s "https://opendev.org" 2>&1 | grep 'expire'; done;17:56
mnaser3/9 times i got an expired cert17:56
mnaserso i think fungi's theory that it is the stale apache worker might be valid17:56
mnaser(but i hit all the backends and they came back clean)17:56
fungiif you grab cert details the serveraltname should say which server you're hitting17:56
mnaserok sure one second17:56
fungii forget the curl syntax for that17:56
*** zbr has joined #opendev17:57
*** zbr5 has joined #opendev17:57
mnaserhmm sends curl even with -vvvvv just returns certificate has expired, nothing more17:57
clarkbmnaser: fungi openssl s_client will show you17:58
mnaseryep, switching to that17:58
*** zbr5 has quit IRC17:58
fungiright17:58
clarkbanyway running ps across all of them I think 03 may be the problem17:58
*** zbr has quit IRC17:58
clarkbit has an old apache worker. The other 7 seems to have recycled them all at least today17:58
*** zbr has joined #opendev17:59
fungiclarkb: on review i'm starting to think apache might be the problem as well... i can wget http://localhost:8081/ locally on the server which is what it should be proxyign to17:59
mnaserclarkb, fungi: i got a call i have to run into, but this is a failing s_client output -- http://paste.openstack.org/show/800946/17:59
mnaseri can try and help again but in an hour17:59
clarkbya that says 03, I'll restart apache2 there18:00
clarkband that is done18:00
fungii finally got a server-status out of apache on review.o.o and all slots are in "Gracefully finishing" state18:01
fungihuh?18:01
fungii wonder if this is mod_proxy misbehaving18:01
clarkbfungi: is that stale from pre restart? fwiw I think gerrit just loaded a diff for me18:01
clarkband the access log is showing stuff filtering in18:01
yoctozeptoit let me in now18:01
fungiclarkb: i haven't restarted anything18:01
clarkbfungi: oh18:02
clarkbfungi: I thought you had, but then ya I guess aopache is slowly recovering on its own after cleaning up old slots?18:02
fungibtu now server-status is returning instantly and reporting only half the slots in that state18:02
fungiso yeah, seems like this is some timeout in mod_proxy maybe18:02
yoctozeptomight need decreasing it to avoid these stalls18:03
*** hamalq has joined #opendev18:03
clarkbyoctozepto: feel free to propose a change :)18:04
*** rpittau is now known as rpittau|afk18:04
clarkbwe'd love help, and this week has been really not fun for everyone. But more and more it feels like we need to reinforce that we run these services with you18:05
clarkbfor gitea03 I've talked to it via port 3081 with s_client and it responds with Verification: Ok several times in a row now18:07
yoctozeptoclarkb: if I knew the right knob for sure!18:07
*** hamalq_ has joined #opendev18:07
fungiso reading more closely, "gracefully finishing" state is what apache workers enter when a graceful restart is requested18:07
yoctozeptoit has been a bad week for all of us I guess...18:07
yoctozeptoah, so it made a restart18:07
fungithe worker waits for all requests to complete18:08
clarkb(it wasn't me)18:08
yoctozeptomakes sense that it does so18:08
clarkbmaybe our cert updated or something?18:08
fungiwe trigger it automatically on configuration updates and cert rotation18:08
yoctozeptooh oh18:08
fungiif one of those hit in the middle of the insane load spike, maybe it had to time out a ton of dead sockets18:09
yoctozeptothat is quite a downtime for cert rotation :D18:09
yoctozeptomaybe put haproxies in front with their new dynamic cert reload functionality (they essentially pick up new certs for the incoming requests)18:09
*** hamalq has quit IRC18:10
fungilet's dispense with the premature optimizations. i've not yet even confirmed what triggered the restart18:10
fungii'm trying to approach this by methodically collecting information first and not making baseless assumptions as to a cause18:11
clarkb++18:12
*** eolivare has quit IRC18:12
fungi-rw-r----- 1 root letsencrypt 1976 Oct 21 06:54 /etc/letsencrypt-certs/review.opendev.org/review.opendev.org.cer18:13
fungiso it's not cert rotation18:13
clarkbunless we're somehow triggering the handler even when teh cert doesn't update18:13
clarkb(I doubt it, but could still happen)18:13
fungilast restart apache logged was Thu Dec 10 06:25:07 (utc)18:14
fungiso another possibility is that graceful state is also used for worker recycling18:14
fungimpm_event.conf sets MaxConnectionsPerChild to 0 so we're not recycling workers after a set number of requests18:19
clarkbfungi: could gracefully finishing be in response to the backend not responding?18:20
fungihowever they are being recycled by something, most of the processes are no more than 20 minutes old18:20
fungiso looks like the parent apache process is from october 20 which i guess was the last complete restart, the first child process has a start timestamp from today's log rotation which i expect is the last graceful restart, the remainder of the workers however seem to have been dynamically recycled judging by their much more recent timestamps18:22
clarkbfungi: maybe cross check with the ansible logs?18:23
fungieasier said than done, we've got a loose time and no specific event (yet) to narrow it down with, plus ansible logs an insane number of lines mentioning apache on this server18:25
clarkbya...18:26
fungiso i'm sifting through that now but it will take a while18:26
clarkbroger18:26
fungi1737 was the first report in here of a problem related to the server18:27
fungibackups start at 17:1218:27
clarkbfungi: when I hopped on the server a little after that borg was no longer running18:27
clarkbbut that doesn't mean it is innocent18:27
fungistarting around 17:42:01 we get a number of lines in syslog from the kernel, systemd, containerd... clarkb did you run any docker commands maybe?18:31
clarkbfungi: I did not18:31
clarkbI ran simple things like w, top and apache log tailing18:31
fungiit was roughly 1.5 minutes after you logged in so figured i'd ask18:31
fungioh! it's track-upstream being called from ceon18:32
fungicron18:32
clarkband all of that was on the host side18:32
clarkbfungi: ooooh18:32
clarkbfungi: you know what, I'm not sure we track any upstreams that we currently consume?18:32
fungiso anyway that kicked off in the middle of all this18:32
clarkbmaybe we should disable taht and see if things settle a bit more?18:32
fungii doubt it was at fault18:32
fungibut can't hurt to turn off a thing nobody's using, sure18:33
clarkbya seems unlikely except for maybe memory pressure?18:33
fungiwe've also still got an openstackwatch cronjob firing periodically18:33
clarkbI think our gerrit fork may be the only thing that "uses" it18:33
clarkbbut we're currently unforked18:33
clarkbexcept for the /x/* update18:33
clarkbbut that happens in the job on top of the upstream repo18:33
clarkb(double check me on zuul pulls from upstream not our fork though)18:33
fungiyeah, i was about to say, we're technically forked, just not forking the git repo18:33
fungianyway, after pouring over syslog, no nothing at all happening ansible-wise and i see nothing which would have triggered apache to recycle worker processes, so it must have been an internal decision in apache18:35
fungias for apache's error log, we started hitting "AH00485: scoreboard is full, not at MaxRequestWorkers" at 17:56:17 just out of the blue18:37
fungiso i suspect that's the point at which every worker had entered gracefully finishing state18:38
*** dtantsur is now known as dtantsur|afk18:40
fungiParent Server Config. Generation: 6818:42
fungiParent Server MPM Generation: 6718:42
fungiare those supposed to match?18:42
clarkbI have no idea18:43
clarkbfungi: where does that info come from?18:44
fungimod_status18:46
clarkbah gotcha18:47
clarkbfungi: maybe check it on another service and see if they line up?18:47
fungianyway, i'm starting to think, after a lot of digging, this is "normal" (albeit aberrant) behavior for ubuntu xenial era mpm-event18:47
clarkbmaybe they are off by one normally because config is loaded in that order or whatever18:47
clarkbfungi: ya certainly sounds like that based on your investigation18:47
fungithe all workers ending up in g state because they can take too long to end18:47
fungibasically apache periodically (i have yet to find the exact default conditions though) recycles mpm-event workers by gracefully stopping them, it's not supposed to recycle them all at the same time but if the stopping takes too long, say because your server load suddenly spiked to 100...18:49
fungiso anyway, apache should in theory behave a bit better in the future once we upgrade the operating system on that server. in the meantime if we see that again, a hard restart of the apache2 service should bring things back sooner18:55
fungithe gerrit load issue on the other hand, no idea what exactly triggered that, we'll need to keep an eye out for more incidents of the same18:56
fungii'm switching back to troubleshooting the nova git repo on ze12 for now18:57
mnaserfungi: ouch, do i understand part of the issue here is a service that we have no idea what's restarting it? :(18:58
fungimnaser: no, nothing restarted18:58
mnaseroh, the recycling of mpm event workers18:59
fungiapache spawns multiple worker processes. it periodically refreshes them by issuing an independent graceful top to a worker and fiting up a new one. if you're at your max worker count it won't fire up a new worker until a stopping one has exited. if something causes those workers to take too long to exit, they pile up, until you have no worker processes accepting connections any longer19:00
fungithis seems to have been triggered by the gerrit activity which caused load to skyrocket on the server19:01
mnaserfungi: https://httpd.apache.org/docs/2.4/mod/mpm_common.html see MaxConnectionsPerChild (or previously MaxRequestsPerChild)19:01
mnaserlet me see if our config has that19:01
fungiyeah, it's 019:01
fungii'm familiar with it. we tune that on some of our other services19:02
*** sboyron has quit IRC19:02
mnaserfungi: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/static/files/apache-connection-tuning ?19:02
mnaseris that potentially being reused?19:02
fungithat's how we do it on other servers, yeah, but doesn't appear to be applied on review.o.o19:02
fungianyway, it was an aftereffect of a pathological condition19:03
fungithe gerrit service started having problems at or before 17:37z, apache didn't run out of workers until 17:56z (almost 20 minutes later)19:04
mnaserdidn't mean to distract too much :) so i'll let you get other things19:05
fungigerrit cpu consumption had trailed back off by 17:50z even19:05
fungibut i think the insane system load sent apache into a dark place from which it eventually recovered on its own as it started successfully reaping those stopping workers around 18:02z19:06
fungiso anyway, back to corrupt nova repo on ze12, i'm rescanning the other 11 executors to make sure the gerrit incident didn't trigger anything similar on another one19:09
fungigood, they're still all coming back clean, so it's just the (stopped) ze1219:10
fungilooking at the timestamps, i have a feeling ze12 never successfully cloned the nova repo after its restart, unlike the others19:13
*** adrian-a has quit IRC19:20
fungioh, also after a bit more digging, i think it's our MaxKeepAliveRequests 100 with KeepAlive On (these are defaults) in /etc/apache2/apache2.conf which is what's determining worker process recycling: https://httpd.apache.org/docs/2.4/mod/core.html#MaxKeepAliveRequests19:22
*** adrian-a has joined #opendev19:24
fungiokay, i've removed the incomplete /var/lib/zuul/executor-git/opendev.org/openstack/nova on ze12 and started the service again19:25
fungiwill keep an eye on it to make sure it clones nova completely and doesn't have any recurrence of its earlier errors19:25
fungi2020-12-10 19:33:02,289 DEBUG zuul.Repo: [e: b5c0c3a9504c4da8ba3ba8cae23adf3e] Cloning from ssh://zuul@review.opendev.org:29418/openstack/nova to /var/lib/zuul/executor-git/opendev.org/openstack/nova19:33
corvusfungi, clarkb: sorry was away earlier; back now19:40
fungi2020-12-10 19:39:46,351 DEBUG zuul.ExecutorServer: [e: b5c0c3a9504c4da8ba3ba8cae23adf3e] [build: 8d72d952f28f47798786350c6fbd34df] Finished updating repo gerrit/openstack/nova19:43
fungiso looks like it cloned in ~6m44s19:44
*** andrewbonney has quit IRC20:05
ianwo/20:31
*** hashar has quit IRC20:32
ianwreading backwards ...20:32
fungianf if that doesn't work, i recommend reading upside-down20:36
ianwi am in Australia, so i guess i read upside down by default? :)20:37
ianwif i've understood, there were some executor problems after the timeout update that were tracked down to ze12 having a bad copy on disk?20:37
clarkbianw: ya and we had a spike in load that was not attribtued to garbage collection (based on top -H)20:38
clarkbreview recovered on its own from that but apache did so more slowly20:38
corvusclarkb: what's the status of v2?20:39
clarkbcorvus: on https://review.opendev.org/c/opendev/system-config/+/766365 ianw points out that while our git is new enough to use v2 on the executors/mergers we need to enable it?20:40
ianwi had a quick play with my local git and v2 seemed to work fine on the test server, though i only cloned a few things20:40
ianwbut i think we might need some work to deploy it on executors20:40
clarkbcorvus: otherwise I think the first step is to land ^ restart gerrit, then work to ensure ci systems use it?20:40
ianw... heh, or what clarkb said :)20:40
clarkbianw: ya in .gitconfig I Think you set protocol.version = 220:41
clarkbto enable it on older git20:41
clarkb2.18 introduced it20:41
fungiianw: the main thing i wasn't sure about was whether it's possible ze12 somehow ended up with a just-right corruption of its git repo that it never successfully blew away and recloned it (unlike the other executors after the timeout increase)?20:41
clarkb`git config --global protocol.version 2` will write out the config too20:42
fungiianw: reason being i didn't find any significant lull of nova repo errors in the debug log for ze1220:42
ianwfungi: yeah, i didn't check all the executors, but in the gerrit queue i could no longer see any long zuul processes cloning nova, so assumed that all had grabbed it ok20:42
fungiianw: cool, so that remains a possibility i guess20:43
ianwi saw some discussion, but did we decide why these clones are like 2gb?20:43
corvusreally?  wow why isn't using v2 just autonegotiated?20:43
openstackgerritMerged opendev/gear master: use python3 as context for build-python-release  https://review.opendev.org/c/opendev/gear/+/74216520:43
ianwcorvus: it is from a 2.26 onwards client20:43
clarkbcorvus: it is if your git is from this year20:43
clarkbI think ~may?20:43
fungiyeah, it seemed like maybe they weren't sure forcing everyone to v2 the same release they enabled it was a good idea and phased it in as an optional feature initially20:44
corvusexecutors say: git version 2.20.120:44
corvusfungi: gotcha20:44
clarkbalso the linux kernel reported problems after it was enabeld by default20:44
clarkbbut I was never able to track that down to any conclusion20:45
clarkb(more data than expected was transfered in their requests, maybe something pathological with the size of the linux kernel?)20:45
mordreddoes gitpython support it?20:45
fungiif our container images are debian-based, they need to use buster-backports or be built on testing/bullseye20:46
corvushopefully 2.20.1 doesn't have any issues with v2 that are fixed in later versions20:46
fungito get a new enough git to have it on by default20:46
corvusmordred: i don't *think* (but i'm not positive) that gitpython does any network stuff20:46
mordrednod20:46
fungiand yeah, 2.20 was the version in buster20:46
fungi(current stable)20:46
corvusmordred: i think that's all outsourced to a spawed git process20:46
mordredthe images are debian based - it wouldn't be too hard to add backports and get newer git20:46
mordredoh - they're stretch20:47
fungiright, i would say we enable v2 in the .gitconfig and then if we see bugs turn it off and consider building new images with buster-backports to pull teh 2.29 available there20:47
mordredwait- my local images are likely stale20:47
fungithey're stretch (oldstable)?20:47
corvusthey're "10"20:47
mordredpulling and re-checking20:47
corvuswhich i think is 'buster'20:48
mordredin the old images stretch-backports is already in sources.list20:48
fungi10 is buster, yes20:48
mordredso it should just be a matter of adding the pin20:48
clarkblooking at https://lore.kernel.org/lkml/xmqqzh9mu4my.fsf@gitster.c.googlers.com/ it seems there were a few server side issues that git fixed on the v2 protocol20:48
clarkbbut since gerrit uses jgit those issues shouldn't apply (thee may be other issues)20:48
clarkbit sounds like as of 2.27 on the server side they are happy with the linux kernel?20:49
corvusmordred: buster-backports has 2.2920:49
mordredcool20:49
corvusas does bullseye/testing20:49
fungisame as what's in sid right now20:49
clarkbfungi: corvus if buster backoprts is already carrying 2.29 would it be better to just use that ratehr than 2.20.1 + config?20:50
ianwperhaps we should enable it, and just hand-install it on one or two executor images for a few hours first?20:50
clarkbianw: oh thats another option ya, via exec set the git config and see how it does20:50
corvusclarkb: i think so20:50
clarkbI've got 2.29.2 locally and am able to currently interact with gerrit20:51
fungiif we're really worried about v2 bugs solved after 2.20 then using buster-backports for the git package sounds reasonable20:51
mordredpoo. buster-backports is not in the current images20:51
mordredso we need to add it to sources.list and install git with the pin20:51
fungiand yeah, i'm running 2.29.2 (from sid) as well20:51
corvusmordred: agreed it's not there20:51
corvussounds like .gitconfig may be easier then20:52
corvusi like the idea of enabling it in gerrit, and writing .gitconfig on some executors20:52
openstackgerritMerged opendev/gear master: Bump crypto requirement to accomodate security standards  https://review.opendev.org/c/opendev/gear/+/74211720:52
clarkbya I like that suggestion from ianw too20:52
corvusand if that works, just do .gitconfig in zuul, and don't worry about adding backports/upgrading unless we find we need >2.2020:53
ianw++ on one thing at a time :)20:53
fungiright, it's easy enough to switch back to v1 protocol temporarily while we solve that if necessary20:53
corvus766365+220:54
mordredcorvus: yah - although doing the backports makes the zuul images dtrt without config out of the box - so still might want to consider it20:54
mordred but as a followup I'd imagine after testing20:54
fungistill no new errors on ze12 so i think blowing away that nova repo it had did the trick20:54
fungimordred: or we could wait for bullseye sometime mid/late-202120:55
ianwmordred: oh yeah, definitely let's fix the image once we know it doesn't instant-explode :)20:55
mordredfungi: you're assuming we won't have been eaten by space aliens by mid/late-202120:56
fungiwell, i mean, that was my plan, but i wasn't going to let on20:56
fungiand yeah 2021-03-12 is the penultimate freeze phase, full freeze phase date still tbd20:57
fungiso who knows, bullseye could release in june or it could be abother potato20:57
ianwanother thing i saw discussion on, has anyone notified the powerkvm  people about the constant loop?20:58
fungiianw: not yet i don't thinkl20:58
fungisome complication there is that i see their account connecting from quite a few different addresses so we can either block all those or disable the account i guess20:58
fungibut at first look it seems like they have multiple ci systems sharing a single gerrit account20:59
ianwi guess let me try find the contact and send mail, and we can disable after that if no response20:59
fungior it could be they have that many zuul executors i suppose20:59
fungibut the connections seem persistent, more like stream-events listeners20:59
ianwunlike our executors though, there's only 1/2 connections ... we had one for each executor21:00
fungipresumably clarkb already contacted them about switching to https21:00
ianwso is it that git v2 will not require cloning the entire 2gb repo?  i haven't quite followed that bit21:00
clarkbhrm myemail saerch is not showing that I did21:01
fungiso may know who to reach out to if the wiki isn't providing good contact info21:01
clarkbianw: v2 adds more negotiation so that clients can say I want to update master and then the remote won't care about other refs21:01
fungioh, maybe they're using https for queries but fetching via ssh?21:01
clarkbianw: the older protocols have both sides exchange the refs they know about and since gerrit puts changes in refs it creates trouble21:01
clarkbfungi: ya that could be21:01
clarkbianw: it basically streamlines the initial negotiation of what data a client wants to get21:01
fungior maybe they just don't query often because they limit the conditions they trigger on21:02
fungiand so didn't turn up in the snapshots of the queue21:02
ianwMichael Turek (mjturek@us.ibm.com) from the wiki https://wiki.openstack.org/wiki/ThirdPartySystems/IBMPowerKVMCI21:03
ianwi'll send mail cc to discuss soon21:03
clarkbianw: thank you21:03
fungii've approved 766365 and can do or help with a gerrit restart later once it's deployed21:03
fungii have a few chores i need to get to and dinner to cook in the meantime21:04
ianwclarkb: yeah ... so my local git clone of "git clone https://opendev.org/openstack/nova" is only 160mb ... is that because gitea is v2 enabled?21:04
clarkbianw: maybe?21:05
mordredwe should enable v2 in dib21:05
ianwi couldn't quite figure why zuul was getting these multi gb clones21:05
mordredif that's the case21:05
ianwmordred: or a base job role, but yeah21:05
mordredfor building the repo cache21:05
mordredwell - both21:05
ianwohhh, yes indeed21:05
mordredin dib for repo cache build - and in base role for, you know, all the things21:06
ianwi get the same thing with "git clone ssh://iwienand@review.opendev.org:29418/openstack/nova"21:07
ianwoh, no, maybe not.  same number of objects, but http://paste.openstack.org/show/800957/21:09
fungiremember zuul also fetches all branches and tags21:15
fungiso lots more refs21:15
fungiit's not just a simple git clone and done21:15
clarkbfungi: ya but it should be able to ignore all the resf/changes/* stuff so still better?21:15
fungiin theory21:15
mordredfungi: tags? or just all branches?21:16
*** slaweq has quit IRC21:16
ianwfungi: yeah, but the command that was hanging is just "git clone"?  anyway, if i clone via ssh i get the 2gb repo21:16
fungimordred: mmm, maybe just branches yeah, i was looking at the log of nova getting cloned and thought i saw it also fetching tag refs at the end, but... maybe only ones reachable from a branch21:17
corvusianw: do you think nova should be < 2gb?21:19
ianwcorvus: when i clone it via http it is 165mb21:19
corvusfascinating21:19
mordred165mb is < 2gb21:19
corvusmaybe even <<21:20
ianwhttps://docs.gitlab.com/ee/administration/git_protocol.html has some good info on finding what version your client is using21:20
mordredby, you know, an actual order of magnitude21:20
ianwGIT_TRACE_PACKET=1 git -c protocol.version=2  ls-remote https://opendev.org/openstack/nova 2>&1 | head21:21
ianwdoes not suggest gitea is git v221:21
corvusanyone done an https clone from gerrit?21:22
openstackgerritMerged opendev/system-config master: Enable mirroring of centos 8-stream  https://review.opendev.org/c/opendev/system-config/+/76649921:23
ianw(^ i will manually do that under lock and confirm free space)21:23
fungii'm trying a plain git clone of nova from gerrit via https21:23
fungiwill report21:23
corvusme too21:23
fungithen we can compare notes ;)21:24
corvusit's >160mb so far, so i assume it'll end up just like the ssh one at 2gb21:24
ianwI note that my clone from gitea had "remote: Enumerating objects: 587205, done." which i didn't see in the gerrit one?21:25
ianwGIT_TRACE_PACKET=1 git -c http.sslVerify=false  -c protocol.version=2  ls-remote https://review-test.opendev.org/openstack/nova 2>&1 | head looks v2-y21:27
ianwhowever, cloning that it is still a 2gb repo21:30
clarkbmy understanding is it shouldn't change the amount of data as much as the negotiations21:30
ianwI just did opendev http again to ensure i'm not nuts -- "Receiving objects: 100% (587205/587205), 164.90 MiB | 4.39 MiB/s, done"21:31
corvusReceiving objects: 100% (587205/587205), 1.01 GiB | 2.56 MiB/s, done.21:31
corvusthat's gerrit http for me21:31
corvuslarger than 160mb but smaller than 2g  :/21:32
ianwianw@ze07:/var/lib/zuul/executor-git/opendev.org/openstack$ du -h -s nova21:33
ianw1.9Gnova21:33
ianwthat was where i got the 1.9g number from21:33
corvusthat may have more branches/tags21:33
ianwi agree, i get 1.01gb on a v2 http clone from review-test as well21:33
corvushrm, my local clone has all the branches/tags21:35
clarkbthe executors may not be packed?21:36
corvusmaybe the .8g is due to extra changes and some amount of expansion due to merges, etc?21:36
corvuseither way, 1.1g is still 10x 165m21:37
fungimy clone is still underway21:37
*** eharney has quit IRC21:38
fungiyeah, 1.1gb21:39
fungiand i have all tags, including unreachable eol tags21:39
fungiwithout expressly fetching21:39
fungii also seem to have notes refs21:41
corvusfungi: you do?  i don't see them in mine21:43
corvusat least, not in packed-refs or refs/21:43
fungimy .gitconfig is configured with notes.displayRef=refs/notes/review so i see them in git log21:45
fungii also have [remote "origin"] fetch = +refs/notes/*:refs/notes/*21:45
fungiso maybe it grabbed then during cloning21:45
corvusah, i don't have that21:45
ianwapropos nothing, the yum-puppetlabs volume has run out of quota, i'll up it21:57
openstackgerritMerged opendev/system-config master: Enable git protocol v2 on gerrit  https://review.opendev.org/c/opendev/system-config/+/76636522:09
*** mlavalle has joined #opendev22:13
TheJuliafwiw, this afternoon gerrit has seemed even almost... snappy22:15
openstackgerritIan Wienand proposed opendev/system-config master: bup: Remove from hosts  https://review.opendev.org/c/opendev/system-config/+/76630022:19
openstackgerritIan Wienand proposed opendev/system-config master: WIP: remove all bup bits  https://review.opendev.org/c/opendev/system-config/+/76663022:19
ianwthe centos8 stream initial mirror sync is running in a screen on mirror-udpate22:21
ianwok, here's something weird22:30
clarkb?22:30
ianwthe list of packed objects, gathered via22:30
ianwfor p in pack/pack-*([0-9a-f]).idx ; do     git show-index < $p | cut -f 2 -d ' '; done > packed-objs.txt22:30
ianwis the same between my 1GB clone from gerrit and my 165mb clone from gitea22:31
clarkbdifferent compression types maybe?22:31
clarkb(does git do that?)22:31
ianw181M./objects/pack      |1.1G./objects/pack22:33
clarkbor maybe you fetch a pack but not all of its content?22:33
ianwgit repack -a -d -F --window=350 --depth=250 has my fans spinning22:36
*** adrian-a has quit IRC22:38
ianwhttps://github.com/emanuelez/gerrit/blob/master/templates/default/scripts/repack-repositories.sh22:40
ianwGerrit Code Review however does not automatically repack its managed22:41
ianw# repositories.22:41
ianweview.source.android.com runs the following script periodically,22:41
ianw# depending on how many changes the site is getting, but on average22:41
ianw# about once every two weeks:22:41
clarkbianw: correct we run a git gc daily22:41
ianwis this still true?22:41
ianwdoes that repack?22:41
clarkbyou can tell gerrit to do the gc'ing but jgit gc is single threaded and slow so we do it out of band22:41
clarkbya aiui gc implies packing (not sure if necessarily repack)22:41
ianwperhaps we should be setting those pack options from that script?22:43
clarkbpossibly? or maybe newer git would do better (I'm assuming that gitea's more up to date git may be why it is smaller?)22:44
clarkbianw: I think the current gc happens on the host side not the container side, but perhaps if we converted it to execing into the container's git we'd get better results22:45
clarkbianw: we can test with review-test for that22:45
ianwmy very old laptop is still churning :)22:47
ianw$ du -hs .git22:47
ianw142M.git22:47
ianweven smaller than gitea22:48
clarkbwow22:48
clarkbcertainly seems worth investigating, if review-test gc using the container git gets us close that is probably the simplest thing in terms of moving parts22:48
ianwyep, ok, good well at least that explains the difference!  i thought i was going nuts22:49
clarkbianw: putting something about this on the meeting agenda for if/when we have the next meeting might be a good reminder assuming we don't do anything sooner22:49
ianwi'll try a gc with the container git on review-test.  but i think we may need to update the script to do the settings like in that repack-repos script22:49
clarkbya it is possible that newer git alone isn't enough (but again gitea isn't doing anything special either just newer git aiui)22:50
ianw$ git gc --auto --prune in the container doesn't seem to do anything22:54
ianwi'm going to try settin gthose options22:54
ianwok, it doesn't want to do anything with "--auto"22:58
ianwi will try two things; running with the git in the container, then adding the options from that script and running again.  see what's smaller22:59
ianwwith no packing options; 962M23:08
ianwadding all the options from https://github.com/emanuelez/gerrit/blob/master/templates/default/scripts/repack-repositories.sh and running gc --prune has not made a difference23:12
*** smcginnis has quit IRC23:16
ianwtracking this @ https://etherpad.opendev.org/p/pu_RvmPeym2A7JZmOV4Q23:18
* melwitt wishes the zuul console tab had a floating shortcut button to the top of the page23:32
fungithe git protocol v2 change was deployed to production as of 22:13 so we can restart the service to pick it up once things calm down a bit more23:37
ianwmelwitt: it's actually *really* easy to setup a dev environment to fiddle with the webui :)23:37
* melwitt rolls up sleeves23:37
melwittcool. I'll give it a go. later. when our gate stops being on fire23:38
ianwmelwitt: i'm not trying to be surly :)  it can be kind of fun hacking on the UI, you at least get to see results infront of you23:38
melwittI didn't take it as surl, I was just trying to be funny23:38
melwittI'll try it later. I like working on new stuff23:40
fungithere are certainly times where i wish i didn't always have so much new stuff to work on. i suppose i should be thankful ;)23:40
ianwyeah, i took it as a bit of an opportunity to understand react/2020 javascript a bit more.  i'm still rather useless, but it is something worth knowing23:40
melwitt:)23:42
*** smcginnis has joined #opendev23:42
ianwthis 16x big server doesn't seem to be much faster compressing git trees than my skylake laptop23:44
fungii can confirm du says my nova clone from gitea is far smaller than my nova clone from gerrit23:45
fungi250mb vs 1.1gb23:45
ianwfungi: yeah, i've narrowed it down to the packing, and am trying different things in https://etherpad.opendev.org/p/pu_RvmPeym2A7JZmOV4Q23:46
fungii do still get the git notes straight away without fetching them separately, even when cloning from gitea23:47
ianwyeah, it's all there; i dumped the objects in the pack files and they're exactly the same23:48
*** smcginnis has quit IRC23:49
ianwthat you can have an order of magnitude difference in the repo size modulo fairly obscure techniques is ... well just very git like i guess23:49
fungigit gonna git23:52
*** tosky has quit IRC23:53
ianwok, an explicit repack in the gerrit container barely does anything.  the same thing on my local laptop is the 10x shrink.  so something later gits do better?23:55
corvusmelwitt: https://zuul-ci.org/docs/zuul/reference/developer/javascript.html#for-the-impatient-who-don-t-want-deal-with-javascript-toolchains23:55
corvusmelwitt: i just do what that tells me to everytime i need to do something in zuul's js :)23:56
melwittimpatient, that's me!23:56
melwittthanks23:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!