Wednesday, 2020-12-09

*** tosky has quit IRC		00:00
openstackgerrit	Merged opendev/system-config master: Put jgit pack settings in jgit.config https://review.opendev.org/c/opendev/system-config/+/765867	00:01
clarkb	fungi: I think before restarting with ^ and the heap change we check that jgit.config, gerrit.config and docker-compose.yaml all look good?	00:03
fungi	yup	00:04
fungi	deploy hasn't finished yet though	00:05
clarkb	oh right need to wait for that too	00:05
*** slaweq has quit IRC		00:05
fungi	yeesh the heap change runs 28 jobs in deploy, all serialized	00:06
clarkb	oh because we changed host var stuff?	00:07
clarkb	:(	00:07
fungi	that's (at least part of) why the second change hasn't even started its deploy jobs yet	00:07
*** slaweq has joined #opendev		00:07
clarkb	ya I think it is because of the path of the host vars	00:08
clarkb	we don't distinguish on a per service basis for those we just glob on the higher paths?	00:08
fungi	anyway, it's fine... not like i have anywhere else to be	00:08
*** DSpider has quit IRC		00:11
*** ysandeep\|away is now known as ysandeep		00:13
*** tkajinam has quit IRC		00:16
*** tkajinam has joined #opendev		00:16
clarkb	fungi: I guess the docker compose update should be in place now? but we still have to wait on the jgit.config changes?	00:34
fungi	yup	00:35
fungi	confirmed, JAVA_OPTIONS: "-Xmx44g"	00:37
*** guillaumec has quit IRC		00:52
*** guillaumec has joined #opendev		00:52
*** zigo has quit IRC		00:54
clarkb	fungi: I think the jgit change is deploying now	01:01
clarkb	and done? (I don't have ssh keys loaded anymore so can't easily check direclty myself)	01:03
fungi	yeah looking	01:03
fungi	packedGitOpenFiles = 4096	01:04
fungi	et cetera	01:04
fungi	it's in there now	01:04
clarkb	and gerrit.config still looks happy?	01:04
clarkb	(I mean it should just a good sanity check)	01:04
*** mlavalle has quit IRC		01:04
fungi	yep, those settings are also still in gerrit.config	01:05
fungi	i'll get ready to restart	01:05
fungi	#status notice The Gerrit service on review.opendev.org is being restarted quickly to make heap memory and jgit config adjustments, downtime should be less than 5 minutes	01:07
openstackstatus	fungi: sending notice	01:07
-openstackstatus- NOTICE: The Gerrit service on review.opendev.org is being restarted quickly to make heap memory and jgit config adjustments, downtime should be less than 5 minutes		01:08
fungi	stopping	01:08
fungi	and starting again	01:08
fungi	looks like it's up again	01:09
clarkb	ya I can get the web dashboard	01:10
openstackstatus	fungi: finished sending notice	01:10
*** amotoki has quit IRC		02:01
*** amotoki has joined #opendev		02:02
*** hamalq has quit IRC		02:12
*** ysandeep is now known as ysandeep\|session		02:36
*** cloudnull has quit IRC		03:18
*** cloudnull has joined #opendev		03:18
*** zbr has quit IRC		06:06
ianw	if gerrit's feeling a little slow, it's running a fairly inefficient backup process currently. i'm not going to stop it but i'll look at cleaning it up tomorrow	06:08
ianw	i feel like this is highly likely this could have caused the issues we saw at a similar time yesterday	06:08
*** ysandeep\|session is now known as ysandeep\|afk		06:09
chkumar\|ruck	gerrit seems to be pretty slow	06:25
*** marios has joined #opendev		06:44
*** marios is now known as marios\|rover		06:45
*** ysandeep\|afk is now known as ysandeep\|session		07:01
*** lpetrut has joined #opendev		07:12
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/765963	07:16
*** ralonsoh has joined #opendev		07:19
*** dmellado has quit IRC		07:26
*** dmellado has joined #opendev		07:26
*** eolivare has joined #opendev		07:31
*** dmellado has quit IRC		07:34
*** dmellado has joined #opendev		07:36
*** sboyron has joined #opendev		07:51
*** tosky has joined #opendev		08:02
*** elod_pto is now known as elod		08:05
*** hashar has joined #opendev		08:06
*** andrewbonney has joined #opendev		08:13
*** rpittau\|afk is now known as rpittau		08:14
openstackgerrit	Stig Telfer proposed openstack/diskimage-builder master: Update handling of repo files for CentOS 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/766164	08:33
*** ysandeep\|session is now known as ysandeep\|lunch		09:03
*** eolivare_ has joined #opendev		10:05
*** eolivare has quit IRC		10:08
*** ssbarnea has joined #opendev		10:18
*** zbr has joined #opendev		10:23
*** DSpider has joined #opendev		10:27
*** brinzhang_ has quit IRC		10:34
*** ssbarnea has quit IRC		10:36
*** ysandeep\|lunch is now known as ysandeep		10:38
*** brinzhang has joined #opendev		10:42
*** dtantsur\|afk is now known as dtantsur		10:52
*** hashar is now known as hasharLunch		10:56
*** zbr has quit IRC		11:25
*** zbr has joined #opendev		11:28
*** zbr has quit IRC		11:46
*** zbr has joined #opendev		11:48
openstackgerrit	Tobias Henkel proposed zuul/zuul-jobs master: DNM: Test ensure-docker due to new docker release https://review.opendev.org/c/zuul/zuul-jobs/+/766207	11:57
*** fressi has joined #opendev		12:28
*** hasharLunch has quit IRC		12:48
*** d34dh0r53 has quit IRC		12:56
*** zbr has quit IRC		13:24
*** zbr has joined #opendev		13:27
*** zbr has quit IRC		13:45
*** zbr has joined #opendev		13:47
*** fressi has quit IRC		13:47
*** zbr has quit IRC		14:03
*** zbr has joined #opendev		14:04
*** bodgix has joined #opendev		14:12
*** fdegir0 has joined #opendev		14:24
*** fdegir has quit IRC		14:24
*** noonedeadpunk has quit IRC		14:30
openstackgerrit	chandan kumar proposed openstack/diskimage-builder master: [WIP] dracut list installed modules https://review.opendev.org/c/openstack/diskimage-builder/+/766232	14:35
sshnaidm	hi, any reason that override-branch and override-checkout don't work here? https://opendev.org/openstack/ansible-collections-openstack/src/branch/master/.zuul.yaml#L153-L159	14:41
sshnaidm	I have master instead of rocky in https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1b5/766013/2/check/ansible-collections-openstack-functional-devstack-rocky-ansible-2.10/1b50b64/	14:41
sshnaidm	I think it was working fine before	14:42
openstackgerrit	chandan kumar proposed openstack/diskimage-builder master: Enable dracut list installed modules https://review.opendev.org/c/openstack/diskimage-builder/+/766232	14:48
*** pabelanger has left #opendev		14:48
*** diablo_rojo_phon has joined #opendev		15:04
*** fdegir0 is now known as fdegir		15:05
mnaser	silly question: is there a way with gerrit to change the topic without having to refresh the change?	15:17
mnaser	i've noticed that to remove the topic, i have to hit the x but then the "add topic" is missing, so i have to refresh for it to appear	15:17
mnaser	its not a big deal, but im just curious if i'm missing someting	15:18
avass	mnaser: it takes as second to reappear for me	15:26
*** ralonsoh has quit IRC		15:27
*** ralonsoh has joined #opendev		15:27
*** zbr has quit IRC		15:29
*** zbr has joined #opendev		15:30
*** ralonsoh has quit IRC		15:34
*** ralonsoh has joined #opendev		15:38
*** ralonsoh_ has joined #opendev		15:45
*** ysandeep is now known as ysandeep\|away		15:46
*** ralonsoh has quit IRC		15:46
fungi	mnaser: you should be able to edit the topic, you could in 2.13 anyway. and there's a rest api for it as well	15:46
mnaser	avass, fungi: maybe patience is what i needed. usually in 2.13 before, i'd be able to click and edit, this time around, i cant seem to edit so i click on the 'x' to remove the current one but the 'add topic' button doesnt show up until i refresh (but maybe i should be a bit more patient)	15:47
*** zbr has quit IRC		15:48
*** zbr has joined #opendev		15:51
sshnaidm	fungi, do you know maybe answer to my question above, about not working override-branch/checkout in https://opendev.org/openstack/ansible-collections-openstack/src/branch/master/.zuul.yaml#L153-L159	15:55
sshnaidm	fungi, I have still master cloned: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1b5/766013/2/check/ansible-collections-openstack-functional-devstack-rocky-ansible-2.10/1b50b64/	15:56
mnaser	would it be possible for the openstack/governance project to get a pipeline defined for it called 'governance-check' which listens to reviews/topic changes/comments and runs a 0 node job with pass/fail showing if the patch is ready to merge or not?	16:10
mnaser	we have the essential tooling which does all the checks right now	16:11
mnaser	id like to automate more of this chair-y things	16:11
*** mlavalle has joined #opendev		16:13
fungi	sshnaidm: i don't know, i can try to work it out but i'll need some time to familiarize myself with your job, how it's running, and what those options are supposed to do	16:13
fungi	mnaser: i think we already have a pipeline like that used for checking release requests, maybe we can reuse it?	16:13
fungi	mnaser: yeah, the release-approval pipeline... maybe it could also be renamed if that helps	16:14
mnaser	fungi: yeah maybe it could be a better name	16:16
fungi	mnaser: want to sync up with the release team and see if they'd be up for a more general name and more generalized success/failure labels? https://opendev.org/openstack/project-config/src/branch/master/zuul.d/pipelines.yaml#L256-L281	16:17
fungi	otherwise i suspect the pipeline you want will look identical to that one except for the name, description, and success/failure labels	16:18
fungi	but yes we could also add a separate pipeline, i'm just hesitant to continue to set a precedent for project-specific pipelines which are nearly identical	16:19
*** brinzhang has quit IRC		16:27
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/765963	16:40
fungi	sshnaidm: okay, i'm freed up kinda and seeing if i can make heads or tails of your branch overrides now	16:42
*** marios\|rover is now known as marios\|out		16:45
*** marios\|out has quit IRC		16:46
*** lpetrut has quit IRC		16:46
*** hamalq has joined #opendev		16:53
*** hamalq_ has joined #opendev		16:55
*** hashar has joined #opendev		16:56
*** hamalq has quit IRC		16:59
openstackgerrit	daniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/765963	17:11
*** rpittau is now known as rpittau\|afk		17:12
*** zbr has quit IRC		17:27
*** eolivare_ has quit IRC		17:28
*** zbr has joined #opendev		17:29
TheJulia	Gerrit seems to be returning Proxy 502 errors	17:38
clarkb	this looks similar to the thing that happened the other night (relative to me)	17:39
clarkb	java is up and running but busy	17:39
TheJulia	GC'ing?	17:40
clarkb	ya that is what we seem to think is happening	17:40
TheJulia	is it doing parrallel gc'ing already?	17:40
TheJulia	there used to be a background gc option, where it would be able to run to keep the whole process lock master gc from firing	17:41
clarkb	yes I think we haev 16 gc threads (we haev 16 cores and so that is automatically set up that way?)	17:41
TheJulia	sadlly the master one does exactly this when it occurs and can occur under heap pressure	17:41
TheJulia	I don't remember, I think that changed at some point	17:42
TheJulia	but don't remember. it has been a long time since I admined java webapps	17:42
clarkb	its "funny" because the gerrit maintainers say that java 11 largely fixed GC lock the world problems for gerrit. But we were running on 8 happily without that problem and now have it on java 11 :/	17:43
clarkb	ya ok just confirmed lots of busy gc threads according to top -H	17:44
* zbr wonders why we do not just give it more memory?		17:45
TheJulia	maybe they meant java 11 defaults?	17:45
TheJulia	and our current defaults might change that behavior?	17:45
TheJulia	zbr: more ran can make this even worse	17:45
TheJulia	and elongate the GC run times	17:45
TheJulia	at least, where the big lock is needed :(	17:45
fungi	zbr: we're already using a vm flavor with 60gb ram	17:45
fungi	and we need to leave some headroom for processes external to the jvm, for example backups	17:46
TheJulia	also filesystem buffer io	17:47
TheJulia	well, buffer and cache	17:47
fungi	yep	17:47
*** zbr has quit IRC		17:47
*** zbr has joined #opendev		17:49
yoctozepto	here it is gerrit 503	17:49
zbr	also being a VM does not help it deliver best performance either, but using baremetal may be be harder.	17:50
* clarkb is trying to figure out jmap on the test node to then run it on prod to figure out which gc algorithm is used		17:53
clarkb	found a wikimedia bug that reports using the g1 gc helped them with similar problems	17:53
fungi	sshnaidm: looking closely at https://opendev.org/openstack/ansible-collections-openstack/src/branch/master/.zuul.yaml#L153-L159 is the "rocky" there expected to be a branch name? did you mean "stable/rocky" instead?	17:54
clarkb	but for whatever reason jmap doesn't want to talk to the test server	17:55
*** zbr has quit IRC		17:56
*** zbr has joined #opendev		17:58
clarkb	the internet indicates this is a common problem with about a million reasons for why it may happen (the jmap thing) yay	18:02
sshnaidm	fungi, should it be stable/rocky?	18:06
sshnaidm	fungi, btw, should it be override-branch or override-checkout? I see both are used there	18:07
fungi	sshnaidm: that was my question to you, did you mean stable/rocky (an actual branch in those projects)? rocky by itself doesn't exist as any sort of git ref in them that i can see	18:08
sshnaidm	fungi, yeah, of course I meant stable/rocky	18:08
fungi	sshnaidm: override-branch and override-checkout mean different things	18:08
sshnaidm	I see	18:08
fungi	also i suspect override-branch is intentionally undocumented in zuul because it was only kept for backward compatibility, but i need to dig deeper to confirm	18:10
corvus	yeah, we need to remove it	18:11
fungi	i saw a todo in the merger source about deprecating it	18:11
mnaser	i don't see a status message but yeah gerrit seems ded ;(	18:12
clarkb	the current issue with jmap is we need capadd ptrace	18:12
clarkb	I'm going to modify the docker-compose on review-test to do that	18:12
clarkb	(and restart hte container)	18:12
avass	can't you just download more ram when the old ram runs out? ;)	18:12
sshnaidm	fungi, so better to use override-checkout in required_projects	18:13
fungi	status notice The Gerrit service on review.opendev.org is currently responding slowly or timing out due to resource starvation, investigation is underway	18:13
fungi	infra-root: how about i send that for now?	18:13
clarkb	wfm	18:14
fungi	sshnaidm: yes i expect so, but it depends on what you were trying to accomplish with it	18:14
fungi	#status notice The Gerrit service on review.opendev.org is currently responding slowly or timing out due to resource starvation, investigation is underway	18:14
openstackstatus	fungi: sending notice	18:14
-openstackstatus- NOTICE: The Gerrit service on review.opendev.org is currently responding slowly or timing out due to resource starvation, investigation is underway		18:14
clarkb	ok if I add SYS_PTRACE to the container capabilities, restart the containers, then exec into the container as root using exec -u root -it containerid bash I can run `jhsdb jmap --heap --pid 7` on review-test	18:16
fungi	clarkb: remember our second borg backup of the day starts at 17:12 utc, could that coincide with a downward spiral?	18:16
fungi	just over an hour ago	18:16
clarkb	this shows we are running the G1 garbage collector with 13 threads on -test	18:16
clarkb	I expect we're in a similar setup on prod	18:16
mnaser	fungi: thank you for the status notice :)	18:16
fungi	yw	18:16
openstackstatus	fungi: finished sending notice	18:17
clarkb	and confirmed that java 11 docs say the g1 collector is the default	18:18
clarkb	"If this assumption doesn’t hold, then G1 will eventually schedule a Full GC. This type of collection performs in-place compaction of the entire heap. This might be very slow."	18:19
clarkb	I expect this is what we are currently experiencing?	18:19
clarkb	just thinking out loud here: We can try a different GC (though G1 is the one expected to work best in the scenario we are in aiui), we can try going back to java 8 (is this safe after install has been running on java 11? I expect so but dunno), We can try and tune G1 (I think we can also add better gc logging?)	18:21
clarkb	https://gerrit.wikimedia.org/r/c/operations/puppet/+/504073/4/modules/gerrit/manifests/jetty.pp	18:23
fungi	ahh, okay, so basically java 11 gc is less resource-intensive because it's less thorough, but falls back to being more resource-intensive if it turns out to have been insufficient?	18:23
clarkb	infra-root ^ thoughts on restarting the currently running container, then working to get something like ^ landed and restarted again?	18:23
clarkb	fungi: ya I'm suspecting that may be the case.	18:23
clarkb	I'm going to put that config in place on review-test if someone else wants to do the restarting	18:24
fungi	do we want to add SYS_PTRACE in production so we can inspect it the next time this crops up?	18:24
clarkb	(on prod I mean)	18:24
clarkb	fungi: that may be another good addition. Also we could manually add these bits to prod and do one restart? but test on review-test first?	18:25
clarkb	I'll work on review-test to make that possible	18:25
corvus	clarkb: ++ restart now then add logging, then make decision on gc changes	18:26
fungi	and yeah, it does seem like this main thing which changed before we started hitting this is the openjdk 8->11 upgrade	18:27
*** dtantsur is now known as dtantsur\|afk		18:28
*** fbo is now known as fbo\|lunch		18:28
*** fbo\|lunch is now known as fbo\|off		18:28
fungi	is freenode struggling too? i'm getting lots of messages in bursts	18:28
fungi	so this discussion is probably more async (or out of sync) than normal	18:29
clarkb	the wikimedia logging config is for an older jvm so I'm reading docs now	18:30
clarkb	not sure how long it will take to get a working config for java 11	18:30
mnaser	fungi: my timestamps don't seem very bursty shrug	18:30
fungi	i had like 5 messages come in from clarkb all at the same moment	18:31
fungi	i can prepare to restart the service, is the plan to restart asap to get things running again and then restart a second time to add the additional jetty logging, or try to work out the logging config now before we restart and avoid another restart later for that?	18:31
corvus	fungi: i read clarkb's comment as restart now to fix restart later to change	18:33
corvus	(which i agree with)	18:34
clarkb	ya I think its gonna be a while to sort out how to get logging going on review-test (I am working on this now and if someone else wants to restart prod that would be good0	18:34
fungi	okay, restarting it now	18:35
fungi	it should be on its way back up again	18:35
*** hashar has quit IRC		18:38
*** hashar has joined #opendev		18:38
clarkb	ok I think I sorted out a working config on review-test. It will write to review_site/logs/jvm_gc.log and rotate 10 copies there at 20M each	18:55
clarkb	fungi: corvus maybe you can take a quick look at the log it is writing currently then I can push up a change to modify prod if we are happy with that?	18:56
fungi	checking	18:57
clarkb	https://phabricator.wikimedia.org/T221026 also has a bunch of interesting info (also interesting that we've arleady done some of the chagnes they did indepednently discovering similar problems. I wonder if that poinst to gerrit needing better defaults)	18:58
fungi	wow, that sure does create a ton of different log files	18:59
fungi	ls -l /home/gerrit2/review_site/logs/jvm_gc.*	18:59
clarkb	fungi: well tahts because i was fiddling with different rotation schemes	18:59
fungi	ahh, okay	18:59
clarkb	wikimedia puts the pid in there, other people suggest the jvm startup time	19:00
clarkb	but each of those may leak over time and then we need external rotation so I ended up with simple mode where the jvm does all the necessary rotation (we just have to look at timestemps for the files mroe carefully)	19:00
clarkb	the current config should just do jvm_gc.log*	19:00
fungi	times when my dyslexia amuses me: i read "metaspace" in the log as "meatspace"	19:00
clarkb	https://gerrit.wikimedia.org/r/c/operations/puppet/+/504448 also talks about tuning the packfile stuff more	19:01
*** sboyron has quit IRC		19:02
fungi	and yeah, without knowing quite what we're looking for, that log seems reasonable... it's got some gc activity metrics at least	19:02
*** andrewbonney has quit IRC		19:03
fungi	if we're going to put that in place in production as well, we might for a less disruptive time later today to restart with it	19:03
clarkb	ya I'll work on a change for that now. One thing to note is that this config is java 11 specific which means if we roll back we'll want to use the config that looks like wikimedias that I linked earlier	19:04
clarkb	infra-root do we want to add the ptrace cap to the docker compose file?	19:06
clarkb	I can do that in one change if so	19:06
clarkb	fungi: thinking about https://gerrit.wikimedia.org/r/c/operations/puppet/+/504448 I wonder if that helps gc'ing beacuse it packs the packfiles into more contiguous areas (assuming that gets preallocated)	19:08
clarkb	fungi: and maybe our smaller 400m fragments all over then we hit these problems?	19:08
fungi	i wonder if we see nearly as much fetch activity in our deployment, but considering zuul and other ci systems fetch refs from gerrit i suppose that's where it could be hitting us	19:10
fungi	what sorts of additional inspection opportunities does allowing SYS_PTRACE cap give us?	19:12
openstackgerrit	Clark Boylan proposed opendev/system-config master: Add jvm gc logging to gerrit and traceability perms https://review.opendev.org/c/opendev/system-config/+/766283	19:14
clarkb	fungi: ^ the commit message in there hints at it. But we can run the jhsdb jmap commands against the running pid	19:14
clarkb	I think it can do other debugging too	19:14
clarkb	thinking out loud some more: maybe instead of reverting to java 8 we can try the parallel collector on java 11	19:31
clarkb	since thati s the java 8 default	19:31
clarkb	re the git packfile size thing nova has a packfile on review-test that is 810MB large. Which is too large to fit into our 400m limit (which may cause problems?)	19:36
clarkb	but ya maybe we should consider bumping that value up like wikimedia did. They went with 1/6th of their heap (though this isn't how they calculated the value)	19:37
clarkb	but maybe that points to it being a good idea to allocate a good portion of heap to the packfiles	19:37
clarkb	fungi: oh also we should cross check against melody I think	19:38
clarkb	I need to eat lunch now though	19:38
clarkb	hrm I check top before actualyl making a sandwich adn are we already doing it again?	19:47
corvus	clarkb: sorry i was deep in another terminal; looking	19:47
clarkb	we == gerrit gc'ing in this case	19:48
*** lbragstad has joined #opendev		19:48
clarkb	I think that maybe rules out backups somehow triggering it	19:48
clarkb	thinking out loud here maybe lowering heap size has made this worse? and we should consider going back to 48g then work with spreading out other system activities to reduce their overlapping memory needs	19:50
clarkb	and on the crazy idea front we could try the Z GC	19:50
corvus	clarkb: why is ptrace cap needed in docker-compose file?	19:50
clarkb	corvus: to run jhsdb jmap -heap --pid $pid against the jvm	19:51
clarkb	its another avenue for digging into what the jvm is doing	19:51
* TheJulia guesses more unhappiness?		19:51
corvus	oh because you want to run that from inside a container	19:51
clarkb	corvus: ya	19:51
clarkb	we don't have the same jdk/jre outside of the container so using host tools is clunky	19:51
corvus	(vs on the outside where presumably no d-c change would be necessary, but we probably don't have the jhsdb tool outside)	19:51
corvus	cool, i'm caught up then :)	19:52
clarkb	TheJulia: yes top reports we're in a similar situation with garbage collecting	19:52
corvus	clarkb: if things are bad now, i think we should restart with 766283 in place	19:52
clarkb	corvus: ok I'll manually make those edits now	19:52
corvus	the logs on review-test look like if we had the same info in prod we might have more clues	19:52
* TheJulia wonders if a beer and maybe a stream for SN8 might be good for the day		19:53
corvus	TheJulia: any update on a time for sn8?	19:53
TheJulia	none	19:53
TheJulia	SpaceX says they won't go live to T-5 apparently... there is tank farm venting	19:54
TheJulia	so... could be close	19:54
clarkb	corvus: want to double check docker-compose.yaml on prod and if it looks good to you I can restart	19:54
* corvus hopes that raptor engines don't have java gc issues		19:54
TheJulia	corvus: that being said there are multiple T38s in the air	19:54
corvus	clarkb: lgtm	19:55
clarkb	ok restarting now	19:56
fungi	sorry, stepped away for a moment figuring we had more time before the problem resurfaced	19:57
fungi	caught up now	19:57
clarkb	jvm_gc.log is there and is logging now	19:57
fungi	interesting that it cropped back up so quickly this time. have to wonder what's almost immediately driving it into that state	19:58
TheJulia	corvus: looks like the rocket is starting to vent... so you may want to fire up any streaming	19:58
corvus	TheJulia: roger, wilco! :)	19:59
clarkb	in the log file the format is roughly timestamps, jvm uptime, thread id, log level, tags, then the info	20:01
*** hashar has quit IRC		20:03
* corvus is watching the gc log and the starship for changes in either		20:04
clarkb	corvus: If I'm reading it correctly it is frequently freeing about 8GB of memory	20:06
clarkb	maybe even more now	20:06
TheJulia	off, does the browser immediately submit comments when I click save or is it async?	20:07
clarkb	tahts the 9876M->1234M(XYZAB) lines	20:07
clarkb	TheJulia: I think I've seen it register the posted message after I've naviated to a different page	20:07
TheJulia	Interesting...	20:08
* TheJulia re-types a comment		20:08
corvus	clarkb: 18... -> 5... is pretty common too	20:08
clarkb	corvus: that does make me wonder if maybe the packfile limit and possibly jgit strong refs would help (strong refs scare me as we can't gc those). Assuming those values are too small we may be allocating a small buffer, using it, then allocating a new small buffer to read the next chunk which deallocates the previous small buffer?	20:09
clarkb	and do that frequently enough and we may see this sort of behavior?	20:09
clarkb	alright I've been yelled at that my lunch is ready so I better go eat	20:11
ianw	o/	20:24
corvus	ianw: the action is at: tail -f /home/gerrit2/review_site/logs/jvm_gc.log	20:26
fungi	i'll be heading into dinner prep zone soon myself, but should be free later to discuss possible additional tunings	20:26
ianw	ahh excellent, yes i was going to get into suggesting we fiddle at that level today after the issues yesterday	20:26
ianw	was it the same thing with all the GC threads just spinning and nothing else happening?	20:27
fungi	seems so	20:27
fungi	twice	20:27
*** ralonsoh_ has quit IRC		20:28
fungi	we restarted and it came right back almost immediately	20:28
fungi	suggesting some external trigger	20:28
ianw	we have definitely seen that behaviour with java 8, i remember debugging it and can try finding it in my notes too	20:28
ianw	that is why i was thinking perhaps we have some magic trigger to a pathalogical case for that particular gc	20:28
clarkb	ianw: note that I think different GC systems are in play between 8 and 11	20:29
ianw	clarkb: that's about where i got to :) trying to figure out which one and the logging, so great minds i guess :)	20:30
clarkb	it just did a 35GB -> 18GB pass (and is now back up to ~24GB)	20:30
corvus	yeah, that after a bunch of 35->31 passes	20:31
clarkb	and more recently it looks like we bumped up against our heap limit and then it did a 3s full pause to bring that back down again	20:33
clarkb	Pause Full (GCLocker Initiated GC) <- grep for that to see the full pass	20:34
fungi	that gives some credence to the 48->44gb reduction in max heap being involved	20:36
clarkb	ya or at least making it worse	20:37
openstackgerrit	Ian Wienand proposed opendev/system-config master: bup: Remove from hosts https://review.opendev.org/c/opendev/system-config/+/766300	20:41
clarkb	it just did another full one	20:41
clarkb	6s this time	20:41
clarkb	also note that the GC number is important because they aren't always in order as it can do things in parallel	20:41
ianw	not to derail, but i'm happy to babysit that ^^ hopefully before 5pm my time, when it certainly doesn't help the gerrit situation :)	20:42
ianw	both clarkb and myself have mounted the backups from both servers on review particularly, and can navigate and grab various bits	20:43
clarkb	wow after the last full run 2022 was the number for it. I think it went from 45g to 9g	20:44
clarkb	I wonder if this is the issue with the soft jgit cache refs that matthias has pointed out	20:44
clarkb	where you basically thrash the memory because you are constantly unloading and reloading into memory	20:44
clarkb	the problem is a proper lru system would be nicer that way we don't need a ton of memory to have everything cached all the time :/	20:44
clarkb	but I mean maybe if we see this persisting increasing packedGitLimit and setting the strong refs option are things to consider	20:45
clarkb	probably packedGitLimit first?	20:45
clarkb	there isn't achange for that yet, should I write one? or do others think that is a bad idea (mostly going off of https://phabricator.wikimedia.org/T221026 and repo discuss threads)	20:46
fungi	if i follow the argument, it sounds like it could help	20:54
fungi	basically nova (and probably other repos) have packfiles larger than the limit so they're not retained in memory?	20:54
clarkb	ya that is what I'm thinking	20:55
clarkb	and so we're just allocating and allocating and allocating memory to serve those things	20:55
clarkb	rather than allocating and hopefully reusing a buffer	20:55
clarkb	it is possible that we could just allocate more memory by bumping up that limit and things will get worse though :/	20:55
corvus	clarkb: why packedgitlimit and strong refs together? as opposed to two separate things	20:56
clarkb	corvus: matthias wrote an email that makes them sound like they go together, let me find a link	20:56
clarkb	corvus https://groups.google.com/g/repo-discuss/c/35-9sXmdtEQ/m/Ok1cOpBmBQAJ	20:57
clarkb	corvus: basically he is saying that the garbage collection has a habit of evicting all that jgit pack data only for gerrit to reload it again	20:57
clarkb	however, as I've noted before using strong refs for that memory will prevent it from being garbage collcted so we may end up with more problems	20:57
corvus	clarkb, TheJulia: https://www.youtube.com/watch?v=ap-BkkrRg-o is live	20:58
clarkb	yup t minus ~5 minutes now	20:58
clarkb	corvus: maybe its better to do the packfile limit first	20:58
clarkb	observe and contineu to refine our observations and thoughts on updates	20:59
* fungi turns on, tunes in, and drops out		20:59
* TheJulia opens numerous streams		21:00
clarkb	and maybe before we change anything we want to catch the problem in the jvm_gc.log?	21:00
corvus	yeah	21:01
fungi	this might be the longest two minutes and six seconds i've experienced in quite a while	21:10
corvus	fungi: time is an illusion? :)	21:10
* fungi checks to see if it's lunchtime again		21:10
*** slaweq has quit IRC		21:12
*** slaweq has joined #opendev		21:13
ianw	i guess the people at spacex are probably furiously discussing the possibility of a whole other type of garbage collection	21:14
clarkb	fungi: corvus ianw did you want to approve https://review.opendev.org/c/opendev/system-config/+/766283 since we are now running with that in prod?	21:18
fungi	done	21:19
clarkb	ianw: https://review.opendev.org/c/opendev/system-config/+/766300 is failing CI jobs	21:19
ianw	yeah just realised i'll need to pull that out of testinfra too	21:19
openstackgerrit	Ian Wienand proposed opendev/system-config master: bup: Remove from hosts https://review.opendev.org/c/opendev/system-config/+/766300	21:20
openstackgerrit	Ian Wienand proposed opendev/system-config master: bup: Remove from hosts https://review.opendev.org/c/opendev/system-config/+/766300	21:22
clarkb	gerrit is working really hard right now to gc	21:28
TheJulia	:(	21:29
clarkb	its bouncing up against the heap limit looks like, it will get 1gb free then back up again	21:29
fungi	maybe it just needs encouragement	21:29
clarkb	do we want to try the jmap thing	21:33
fungi	if we wait until it's completely thrashing we may not be able to get any response from the jvm?	21:34
clarkb	fungi: well I think it is already there	21:34
TheJulia	are there tons of sockets?	21:35
fungi	ahh, i didn't realize, still finishing dinner	21:35
TheJulia	mentally thinking maybe cut traffic to it	21:35
clarkb	ya there are a few hundred but a number are in close wait	21:39
TheJulia	are connections being kept alive or being forced closed?	21:39
clarkb	well close wait is a force close that the server is wiating to make sure the client acks isn't it?	21:40
clarkb	the sockets are still there though	21:41
johnsom	502 proxy error here.	21:41
johnsom	close wait for a socket is the TCP reuse timer	21:41
TheJulia	clarkb: I'm thinking http socket closure	21:42
TheJulia	browsers like to hang connections open for a while in keep-alive mode and socket buffers can sit in the middle	21:42
clarkb	I ran the jmap it mostly just confirms what gc logs show. That we're full up	21:42
corvus	the mark/sweep is taking a long time now	21:43
corvus	5 seconds to mark	21:43
clarkb	TheJulia: those should be chaep though right? I mean a few hundred sockets is maybe a few hundred mb ?	21:44
TheJulia	clarkb: and threads they are living on... which consumes more of the heap	21:44
TheJulia	clarkb: in a past life, where we had similar issues, we had to let the frontend proxy handle http connection keep-alive and had the back-end force connection closure	21:45
clarkb	iirc java threads don't go on the heap (this was a major problem with jenkins becuse we would oom with plenty of heap remaning intead there was no more stack space for threads)	21:46
corvus	it looks like we're just maxed out now; it's not actually releasing any memory, which makes me think we're not looking at a situation where we keep clearing and reloading the cache	21:46
clarkb	corvus: ya	21:46
TheJulia	dynamic caching inside the jvm?	21:46
corvus	no difference in outcome between the different gc types	21:47
TheJulia	err, well, for gerrit	21:47
clarkb	TheJulia: the caching of jgit data	21:47
clarkb	TheJulia: one of the things a gerrit maintainer has described that we thought may be similar is thrashing of the jgit caches	21:47
TheJulia	ugh	21:48
clarkb	TheJulia: gc will unload the jgit cache then jgit will reload the data immediately after in some situations according to them, however we'd expect that the gc'ign would actually unload it again if that was the case here	21:48
clarkb	I don't have any great ideas. Lots of less good ideas. java 8 seemed happier and would haev been using parllel gc whcih we can try. we also reduced heap space by 4gb to give the system more room, we could give gerrit that 4gb back	21:49
clarkb	we can try enabling git protocol v2 which is supposed to be more efficient	21:50
clarkb	we can revert to java 8	21:50
clarkb	we can make apache close conections more aggressively as TheJulia suggests	21:50
TheJulia	I'd only change one thing at a time, fwiw	21:51
clarkb	yup thats what we've been doing since the upgrade	21:51
TheJulia	awesome	21:51
clarkb	one other thing I hadn't considered is that the java 11 update also pulled in a newer gerrit I think	21:51
clarkb	possible there is a regression between the gerrit we had on java 8 and the one we're running on java 11	21:52
corvus	when was the 8->11 upgrade?	21:52
clarkb	corvus: I want to say friday /me checks status updates	21:53
clarkb	2020-12-03 03:46:29 UTC restarted the gerrit service on review.o.o for the openjdk 11 upgrade from https://review.opendev.org/763656	21:53
clarkb	and ianw first observed this problem early monday morning (relative to my local time)	21:54
TheJulia	clarkb: fwiw, I don't remember how we were doing it anymore. we had 2 proxies, one was technically an authentication gateway for requests	21:54
TheJulia	I just remember, it became a factor where it was fine with the authentication gateway, but not the actual high memory usage webapp	21:55
clarkb	corvus: https://bugs.eclipse.org/bugs/show_bug.cgi?id=569349 could we be hitting that	21:56
openstack	bugs.eclipse.org bug 569349 in JGit "PackInvalidException when fetch and repack run concurrently" [Normal,Resolved: fixed] - Assigned to jgit.core-inbox	21:56
corvus	clarkb: http://cacti.openstack.org/cacti/graph_image.php?action=view&local_graph_id=25&rra_id=3	21:56
corvus	definite behavior difference this week vs last 2	21:57
*** otherwiseguy has joined #opendev		21:58
clarkb	the way gerrit does merges makes history really hard to read	21:58
clarkb	I think the gc in the context of that jgit bug is git gc not jvm gc	21:59
corvus	agree	21:59
fungi	do we want to push the heap limit back up to 48gb and restart now, or also try with the packed ref limit increased to something large enough to include the present nova packfile?	22:02
clarkb	I think setting it up to 48g is likely a good idea. due to what corvus said I'm not sure the packedfilelimit thing will help. My other thought for immediate ideas is use parallel gc since java 8 would have done that	22:03
clarkb	and java 8 had different behavior	22:03
corvus	those sound like 2 good next changes	22:03
fungi	both at once, or one we prefer doing first?	22:04
clarkb	I think both as thats most like the java 8 setup before without reverting to java 8	22:04
clarkb	I'll edit docker-compose.yaml now and ya'll can check it when done	22:05
clarkb	fungi: corvus ianw ok docker-compose.yaml has been edited	22:06
clarkb	if that looks good let me know and I can do a down up -d	22:06
fungi	-XX:+UseParallelGC and -Xmx48g	22:08
fungi	lgtm	22:08
clarkb	I'll give corvus another minute or two but then I'll proceed	22:09
ianw	++	22:10
corvus	looking	22:11
corvus	lgtm	22:12
clarkb	alright proceeding	22:12
fungi	also the logging changes still seem to be in place	22:12
corvus	(sorry, was weighing in on https://bugs.chromium.org/p/gerrit/issues/detail?id=13800 -- the stream-events recheck thing for 3.3.0)	22:12
clarkb	fungi: yup and it shows the GC implementation switched	22:14
clarkb	766283 is in the gate and so may end up merging and then undoing the paralellgc + 48GB bump	22:15
clarkb	I'll write a chaneg for ^	22:16
openstackgerrit	Clark Boylan proposed opendev/system-config master: Bump gerrit heap back to 48g and use parallel gc https://review.opendev.org/c/opendev/system-config/+/766317	22:20
fungi	the 48gb "bump" could just be a revert of us lowering it to 44gb? i think that was stand-alone	22:20
fungi	oh, or that, sure	22:20
fungi	whatever's easiest	22:21
clarkb	mostly I don't want to lose track and since we did them together pairing them is easy	22:21
fungi	yup	22:21
openstackgerrit	Ian Wienand proposed opendev/system-config master: graphite: also deny account page https://review.opendev.org/c/opendev/system-config/+/766318	22:23
openstackgerrit	Stig Telfer proposed openstack/diskimage-builder master: Update handling of repo files for CentOS 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/766164	22:23
clarkb	I'm not sure this is any happier :(	22:26
clarkb	which makes me think again to maybe changes in gerrit itself	22:26
clarkb	or java 11	22:27
clarkb	unfortunately I have basically zero good ideas	22:29
fungi	we could try rolling back to the patched 3.2 image we started with during the upgrade?	22:29
clarkb	or even the one before java 11	22:30
fungi	but my suspicion is the timing has more to do with activity actually ramping back up after the major holiday in the usa	22:30
fungi	and lots of extra patches/rechecks due to pip and virtualenv updates	22:31
clarkb	should we restart without parallel gc but keep 48g? I think parallelgc is just spinning its wheels	22:32
fungi	hmm, yeah maybe. i just tried the webui and it's not loading for me	22:33
clarkb	ya if you watch the jvm_gc.log you can see when it devolves	22:33
clarkb	and when I posted taht it wasn't happier earlier it had basically already reached that point	22:33
corvus	yes; meanwhile, what's the deal with rolling back the gerrit+jvm upgrade?	22:34
fungi	still have to wonder what it was about today that things have gotten so much worse, just the the past few hours	22:34
clarkb	we can try the ZGC which is experimental on java 11	22:34
clarkb	corvus: we build gerrit against the stable-3.2 branch. If we can manage to land a change to revert java 11 we could rebuild on java 8	22:34
clarkb	corvus: separaetly we may need to decide if we want to checkout a specific stable-3.2 ref or tag?	22:34
clarkb	possibly we've got the image cached somewhere but every time I try to look at docker image ids I get immediately lost bceause what the docker hub ui shows you is not what docker image list shows you and so on	22:35
clarkb	I'm undoing parallel gc now	22:35
corvus	i mean, i'm pretty much at 'if we can get back to friday do it'	22:35
clarkb	corvus: ya	22:36
fungi	insecure-ci-registry.opendev.org:5000/opendevorg/gerrit f76ab6a8900f40718c6cd8a57596e3fc_3.2 fad1ccad836a 2 weeks ago 681MB	22:36
clarkb	corvus: maybe b12bec20e800 or 712c96672bbb	22:37
fungi	i think that's the one we used at the end of the upgrade?	22:37
ianw	for reference, this is the same sort of gc death spiral from the last time i remember debugging http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-10-16.log.html#t2020-10-16T00:11:34	22:37
clarkb	I've updated docker-compose and will restart again?	22:37
fungi	i confirm the parallel gc option is now out of the compose file	22:38
corvus	ianw: you suspect this is not gerrit/jvm related?	22:38
clarkb	I've not yet restarted and will wait for at least some other acks that that is a reasonable thing to do	22:41
corvus	wfm	22:41
ianw	corvus: don't think so, just going back over my notes. not sure what it means, other than we saw similar in the before times as well, but with less frequency	22:42
clarkb	ok restart is done	22:42
clarkb	where we are now is logging + g1 gc + 48g	22:43
clarkb	liftoff	22:45
openstackgerrit	Merged opendev/system-config master: Add jvm gc logging to gerrit and traceability perms https://review.opendev.org/c/opendev/system-config/+/766283	22:45
clarkb	^ will undo the 48g change	22:47
ianw	one engine stopped? is that supposed to happen?	22:47
clarkb	no idea	22:48
clarkb	to summarize everything that happened today so far: We noticed that GC'ing had gone crazy again. In response to that we added gc logging to the service and the ability to run jhsdb jmap commands against the running process in the container. When this happened again we were able to confirm it appeared we were at the limit of memory	22:51
clarkb	from thee we compared to what things looked like last week and tried to mimic java 8 settings on java 11. But that went poorly as well (we enabled parallel gc instead of g1 gc) and bumped heap back up to 48g	22:53
clarkb	we are now back to the default java 11 gc which is g1 gc and 48g heap	22:53
clarkb	thinking about TheJulia's connection limits idea maybe we can tun that back in gerrit itself. sshd.threads is the limit for git fetches over http and ssh. We could dial that back a bit maybe?	22:54
ianw	would that just make things slightly slower, or reject connections leading to ci issues?	22:59
clarkb	ianw: it is supposed to queue things, but ya if things aren't services quickly enough maybe it would start failing ci systems and the like	23:00
clarkb	docker image id 3391de1cd0b2 is tagged 3.2 on review-test and is on review.o.o from 2 weeks ago	23:16
clarkb	I think it may be the promoted version of fad1ccad836a	23:17
clarkb	which is what we did the upgrade with	23:17
openstackgerrit	Stig Telfer proposed openstack/diskimage-builder master: Update handling of repo files for CentOS 8.3 https://review.opendev.org/c/openstack/diskimage-builder/+/766164	23:18
ianw	there seems to be a lot of jobs maybe not getting nodes?	23:18
ianw	i guess we're just busy	23:21
*** bodgix has quit IRC		23:22
*** slaweq has quit IRC		23:23
*** bodgix has joined #opendev		23:23
openstackgerrit	Clark Boylan proposed opendev/system-config master: Revert "Reduce gerrit heap limit to 44g" https://review.opendev.org/c/opendev/system-config/+/766317	23:27
clarkb	fungi: ^ did your revert idea	23:27
clarkb	but kep the change number	23:27
fungi	wfm	23:27
ianw	i don't know what's up here but https://grafana.opendev.org/d/9XCNuphGk/zuul-status?orgId=1 shows ze12 going a bit nuts	23:45
clarkb	ianw: disk looks good on it at least (one thing we've seen cause problems with executors)	23:46
*** zbr has quit IRC		23:46
clarkb	seems like there was antoher case I'm not remembering maybe something to do with gearman?	23:46
clarkb	oh I'm thinking of the finger port/process dying	23:47
clarkb	in that case it continues to run jobs happily but you don't get the console logs from them on the dashboard	23:47
ianw	it seems to want to do all the work; perhaps the others are just busy	23:47
*** zbr has joined #opendev		23:48
clarkb	ya they are supposed to self throttle	23:48
ianw	Mem: 7.8G 7.1G 149M 34M 589M 200M	23:48
ianw	that's a lot of used memory and not much in caches	23:48
openstackgerrit	Poornima Y N proposed openstack/project-config master: Integration of Secure Device onboard Rendezvous service on STX https://review.opendev.org/c/openstack/project-config/+/765494	23:51
openstackgerrit	Clark Boylan proposed opendev/system-config master: Enable git protocol v2 on gerrit https://review.opendev.org/c/opendev/system-config/+/766365	23:51
clarkb	infra-root ^ fyi, that is enabled on review-test and I've tested it. Compared against prod currently and modified -test config and it does enable it as far as I can tell. Harder to determine if it will make big impacts without seeing clients try it I guess? somethign to think about	23:52

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!