Tuesday, 2017-12-12

*** mwhahaha has joined #openstack-infra-incident		00:05
*** jhesketh has quit IRC		01:59
*** rlandy has quit IRC		02:44
*** fungi has quit IRC		05:45
*** fungi has joined #openstack-infra-incident		05:47
*** myoung\|rover\|bbl is now known as myoung\|rover		06:53
-openstackstatus- NOTICE: Our CI system Zuul is currently not accessible. Wait with approving changes and rechecks until it's back online. Currently waiting for an admin to investigate.		08:48
-openstackstatus- NOTICE: Zuul is back online, looks like a temporary network problem.		09:08
*** AJaeger has quit IRC		09:51
*** AJaeger has joined #openstack-infra-incident		09:51
*** jhesketh has joined #openstack-infra-incident		10:04
*** rosmaita has joined #openstack-infra-incident		13:04
*** weshay\|sickday is now known as weshay		13:15
*** rlandy has joined #openstack-infra-incident		13:31
-openstackstatus- NOTICE: We're currently seeing an elevated rate of timeouts in jobs and the zuulv3.openstack.org dashboard is intermittently unresponsive, please stand by while we troubleshoot the issues.		14:39
fungi	good call	14:46
dmsimard	seeing a lot of those in the logs http://paste.openstack.org/raw/628731/	14:46
*** ChanServ changes topic to "Troubleshooting Zuul scheduler memory utilization \| OpenStack Infra team incident handling channel \| Find us in #openstack-infra for more stimulating conversation"		14:46
fungi	how far back? just since the last restart?	14:47
dmsimard	not sure, let me see..	14:48
dmsimard	fwiw amount of jobs running doesn't correlate with the load and available capacity	14:48
dmsimard	http://grafana.openstack.org/dashboard/db/zuul-status	14:48
dmsimard	http://grafana.openstack.org/dashboard/db/nodepool	14:48
*** Shrews has joined #openstack-infra-incident		14:48
fungi	yeah, if this is just the memory increase we typically see from stacks of config changes, then it generally won't	14:48
clarkb	iirc zk disconnects cause the lock to be released which could explain that if zk isn't staying connected due to memory pressure?	14:49
clarkb	its also possible that is related to the bug I've been trying to sort out on the nodepool side around node request locking	14:50
dmsimard	# grep "Exception" debug.log \|grep "is not locked" \|grep -v raise \|wc -l	14:50
dmsimard	248926	14:50
dmsimard	that's a problem	14:50
clarkb	(fwiw my changes havne't merged yet so unlikely to be source of regression)	14:50
clarkb	Shrews: ^ you've been looking at things like this recently. Does it look familiar?	14:51
fungi	looks like it first showed up in the logs 2017-12-12 03:22:46,423 (zuul.log.1.gz right now)	14:51
fungi	went back a week and no hits in the logs until then	14:52
dmsimard	that's UTC right ?	14:52
fungi	yes	14:52
Shrews	clarkb: no. but if zk disconnects are happening, that would be logged too	14:52
fungi	utc	14:52
Shrews	clarkb: re: your fix, lemme check nodepool real quick	14:52
fungi	03:22z is ~ when we started swapping	14:54
Shrews	clarkb: the main symptom I saw related to your bug fix was a lot of "ready" nodes that were locked. I'm not seeing that.	14:54
fungi	so i agree this could easily be a symptom of memory pressure	14:54
*** frickler has joined #openstack-infra-incident		14:55
fungi	in fact, the timing of that first exception and when we used up the last of the ram is so close than it seems too correlated to be coincidence	14:55
dmsimard	infra-root: found the issue	14:55
dmsimard	wait no :/	14:56
dmsimard	last zk logs on nodepool.o.o date back to october with no space disk left on device, only saw part of the trace	14:56
clarkb	ya I think we've larlgely got that sorted and so the last time it was a problem shows up in the logs as the last time we had a problem (I got caught by that in the past trying to debug something else as well)	14:57
fungi	are you saying there's no space left on disk for nodepool.o.o right now?	14:57
fungi	looks like it has free space	14:57
clarkb	fungi: there wasn't free space months ago but thats the last thing in the error log iirc	14:57
clarkb	because we fixed it and been a non issue since	14:58
*** jeblair has joined #openstack-infra-incident		14:58
fungi	oh, you're saying october is the last time we ran out of space	14:58
Shrews	yeah, we should have fixed that issue with the log space (by "we", i mean fungi)	14:58
* fungi fixes nothing		14:58
Shrews	lies	14:58
dmsimard	fungi: I said "wait no", that's me not reading the full trace :(	14:58
fungi	got it	14:58
fungi	so anyway, i strongly suspect the exceptions you found in the scheduler log are an upshot of the server swapping heavily, given they started at the precise moment the server started paging	15:00
dmsimard	yeah, perhaps	15:00
fungi	+/- the 5-minute snmp polling granularity	15:00
dmsimard	is it normal that a nodepool list on nl02 would not be working ? says os_client_config.exceptions.OpenStackConfigException: Cloud citycloud was not found.	15:00
jeblair	dmsimard: run as nodepool user	15:00
dmsimard	ah	15:01
dmsimard	jeblair: thanks ! /me still learning	15:01
jeblair	did someone run siguser2?	15:01
dmsimard	jeblair: we mentioned it earlier but I didn't run it.	15:02
dmsimard	not sure if fungi or clarkb have	15:02
clarkb	I have not	15:02
jeblair	then i'll stop looking for its output in the logs	15:02
fungi	no, i did not	15:02
jeblair	i'm going to start poking at things using the repl; this may use a lot of cpu and make things worse	15:03
dmsimard	ack	15:05
jeblair	zuul built 88 layouts during the 22:00 hour yesterday	15:08
jeblair	up from a more typical 20-30 per hour	15:08
fungi	release jobs are almost all done, i think	15:16
fungi	hard to tell since i'm getting intermittent proxy errors retrieving the status.json	15:16
jeblair	i don't see an unusual number of layouts being created around the 4 and 6 hours	15:18
jeblair	which seem to be the bumps into swap	15:19
jeblair	here's the current object count: http://paste.openstack.org/show/628734/	15:27
jeblair	(since that has returned, things should start speeding up again)	15:28
fungi	yeah, i'm getting the status.json again	15:29
fungi	just two release jobs remaining	15:29
fungi	we're getting really long delays between playbooks on jobs too... i'm going to check load on the executors	15:33
pabelanger	fungi: yah, load look higer because of the pulsing jobs from zuul. Seems to overwell executors, then evens out, just in time for next pulse	15:35
fungi	most are around 30-40 for load average. highest 5-minute average was ze06 at just over 50	15:35
dmsimard	there was a large spike on ze02	15:36
dmsimard	obvious in grafana http://grafana.openstack.org/dashboard/db/zuul-status	15:37
pabelanger	fungi: yah, I think 25 loadavg is when we stop accepting jobs	15:37
dmsimard	there was mention of an elevated rate of time outs	15:38
*** tobiash has joined #openstack-infra-incident		15:39
dmsimard	this is the error that comes up: http://logs.openstack.org/54/510654/5/check/openstack-ansible-functional-ubuntu-xenial/031df0b/job-output.txt.gz#_2017-12-12_12_40_13_061676	15:39
dmsimard	one of the errors, at least	15:39
fungi	makes sense. so zuul ends up batching up a lot of events, then thundering herds the executors, they push back because load gets too high, cycle continues a la hysteresis	15:39
dmsimard	logstash -> message:"Make sure this host can be reached over ssh" is picking up a lot of hits even within the last 15 minutes	15:39
dmsimard	fungi: so should there be a governor on top of load/disk that caps the amount of jobs per executor ?	15:40
dmsimard	because the load typically doesn't occur until after the executor picked up new jobs	15:40
dmsimard	so an executor might go ahead and pick up 50 new jobs but then it starts loading heavily a minute or two later	15:40
pabelanger	yah, when executors are under a large load, ansible-playbooks slow down greatly, so much that there is almost zero communications between playbook and remote host. What is ends up doing, is start spawning a large amount of job timeouts as a result. So, I wouldn't be surprised if the high load is timeout running jobs	15:40
fungi	i don't know that we should be designing safety valves into the executor to deal with pathological situations on the scheduler	15:41
pabelanger	In v2.5, when a launcher picked up new jobs, is did it incrementally, is that something we still have in zuulv3?	15:41
fungi	i would just consider this a symptom of scheduler memory pressure for now	15:41
pabelanger	eg: when an executor can accept a job, not accept a large amount at once	15:42
dmsimard	executors are swapping as well	15:42
jeblair	there are 10 layout objects known to the scheduler, and 79 in memory. so there's a leak somewhere.	15:42
dmsimard	ze02 is 4GB into swap territory	15:42
dmsimard	but it's weird because there is 4GB "available"	15:42
dmsimard	vm.swappiness is 0 :/	15:43
pabelanger	dmsimard: Yah, that is the result of high load. Once it comes down, swap will stop	15:43
jeblair	fungi: when the release jobs are finished, i'd like to make an graph. that's going to eat up cpu again and make things very bad; likely much worse than before	15:43
fungi	sure. i guess we should grab a fresh copy of pipelines, generate memory graph, then restart scheduler and reenqueue everything?	15:44
dmsimard	+1	15:45
fungi	or are you going to want to keep it running in this condition a little longer after grabbing the object graph?	15:45
clarkb	pabelanger: that is what the load governor is supposed to provide but the problem with it is it is a delayed reaction	15:45
dmsimard	clarkb: yeah that's what I was pointing out, maybe there should be a soft and a hard limit on the job concurrency	15:45
fungi	and any delayed reaction leads to hysteresis under some conditions	15:46
jeblair	fungi: probably after i grab the graph we can restart. your plan sounds good.	15:46
pabelanger	clarkb: yah, I think so. I'll look at code and see what is happening. From grafana it seems we might just be accepting jobs too fast at start.	15:47
dmsimard	fungi: you were mentioning thundering herd, looking at the graphs it's exactly that http://grafana.openstack.org/dashboard/db/zuul-status	15:47
dmsimard	executors picks them up in batches	15:47
fungi	right	15:47
clarkb	pabelanger: ya its a similar problem to the one that mnaser had with the cpu based nova scheduling	15:47
dmsimard	I mean, you don't see that in the "running builds" graph, but the load graph.	15:48
clarkb	pabelanger: you don't know what the compute cost of a job is until it is properly running and until that point you can grab a lot of jobs	15:48
pabelanger	clarkb: right	15:48
dmsimard	clarkb: computing the cost of a job (over an average) is doable	15:48
dmsimard	clarkb: through collectd/graphite for example	15:49
fungi	but yeah, any situation where you perform some action, take measurements, then adjust to correct is going to lead to that behavior under extremes	15:49
pabelanger	how zuul-launchers worked for running ansible-playbook worked pretty well, when jobs came online. Going to see if that would be an option for zuul-executors	15:49
dmsimard	clarkb: in a public cloud environment (such as mnaser) you can't really do that because the workloads from customers are really unpredictable	15:49
clarkb	dmsimard: right	15:49
clarkb	also you could likely do something like a pid loop	15:49
clarkb	where you measure after each change and adjust (and don't really care about the work load being predictable because you measure at each step)	15:50
jeblair	as fungi said, let's not design something for when the system is broken. anything more than a simple change or minor adjustment to what we have probably isn't worth it.	15:51
Shrews	FYI, if the executors were also low on available memory, we should check the logs for OOM process trying to kill zuul-executor children (which tends to kill the finger daemon there). zuul-executor will need restarting to correct that	15:58
jeblair	Shrews: did we ever adjust the ratio for that?	15:59
dmsimard	Shrews: I don't see any OOM killer on ze02 which has this weird memory usage (4GB swap, 4GB available)	15:59
Shrews	jeblair: i certainly didn't	16:00
jeblair	oh, we don't have a ram limiter. that might be a thing.	16:03
pabelanger	Yah, I think spamaps talked about writing that	16:10
jeblair	fungi: how are the release jobs?	16:12
fungi	jeblair: that last release job just reenqueued due to pre-task timeouts, so but it's just a propose-update-constraints so maybe we should just cut our losses now	16:12
jeblair	ok, i'll start my queries now and leave it to its own devices	16:13
jeblair	i just saved queues	16:13
fungi	it's been delaying ~30 minutes between each task due to executor load, so it'd be a couple hours at least before the rerun completed anyway	16:13
fungi	er, between each playbook i mean	16:14
jeblair	graph is still being constructed	17:01
fungi	i figured. been watching the red rectangle growing to the right on the cacti cpu graph	17:07
jeblair	how long you think we should let this run?	18:07
clarkb	jeblair: any rough idea how long it has taken in the past?	18:32
*** rlandy is now known as rlandy\|brb		18:34
jeblair	clarkb: hrm, i don't think i have timing info from previous runs. i would have guessed 1-2 hours. but we're at 2.25 now.	18:34
jeblair	i think we're using more memory than last time, so there's more work to do (where work involves a lot of swapping to examine all the objects)	18:35
jeblair	i'm sure it's working and will finish	18:35
jeblair	it's more of a trade off between lost productivity now and obtaining data vs lost productivity the next time this happens if we don't fix it	18:36
fungi	also we probably need to be ready to roll back the duplicate map key detection at restart if we find that there are projects which have inadvertently introduced any	18:38
clarkb	if we aren't swapping already does it perform reasonably well?	18:39
jeblair	fungi: or drop those projects from config, depending on which projects.	18:39
clarkb	basically could we restart it, get jobs running again, then watch for leak and run graph generation before it becomes major performance issue?	18:40
jeblair	clarkb: usually much faster (like maybe tens of minutes), but i think it's intense enough there's a high risk of losing the zk connection and causing disruption.	18:40
jeblair	clarkb: it's an option, and may be less impactful, but it won't be transparent	18:41
clarkb	ya just thinking that may be preferable to no jobs at all	18:41
clarkb	especially if we do it around 0200UTC or so when we seem to have a quite time	18:41
jeblair	yup	18:42
clarkb	*quiet time	18:42
jeblair	want to give it 30m more (so about 3h total)?	18:43
clarkb	sure	18:43
fungi	wfm	18:44
fungi	i wonder if we should attempt to let events/results quiesce after the object graph is done, and then nab another copy of pipeline contents? the ones from before may not be as relevant now	18:47
clarkb	I've approved a few things myself but sort of expected to go back and recheck	18:48
dmsimard	fungi: yeah, we should definitely get a new snapshot of the pipelines right before restarting	18:48
fungi	that's only if the scheduler is still sane enough to provide one	18:49
fungi	no guarantees. and yes, worst case we just remind people to recheck/reapprove stuff	18:49
fungi	i also need to look through the stack of release job failures today and possibly reenqueue stuff if any got caught up in playbook timeouts	18:50
*** myoung\|rover is now known as myoung\|rover\|bia		18:58
*** myoung\|rover\|bia is now known as myoung\|rover\|bbl		19:00
jeblair	welp, we crossed the 3 hour mark :(	19:23
dmsimard	jeblair: do we have part of it ? or it's an all or nothing deal ?	19:25
dmsimard	like, is it stuck in a recursion loop or something ?	19:25
jeblair	all or nothing	19:26
dmsimard	bleh	19:26
jeblair	and no progress indication	19:26
dmsimard	no insight from the dump either ?	19:27
jeblair	could be 3 hours or 3 minutes away from completion	19:27
jeblair	dmsimard: dump?	19:27
dmsimard	the SIG USR2	19:27
dmsimard	or is that what we're running right now	19:27
jeblair	oh, that hasn't been run, though i used the repl to run what it does independently. that's the object count i pasted earlier	19:28
dmsimard	ok	19:28
jeblair	what's running now is a graph traversal finding all the backlinks to one of the leaked layouts	19:29
dmsimard	the swap and the load likely isn't helping :(	19:29
jeblair	oh at this point the load is entirely due to this process. it has to access objects that have been swapped out, so everything is swapping in/out again, multiple times more than likely.	19:31
*** rlandy\|brb is now known as rlandy		19:31
dmsimard	does RAX have 32GB VMs ? :/	19:32
clarkb	it does haev quite large VMs	19:32
clarkb	I wouldn't want to use massive flavor long term but could be a reasonable way to have healthier zuul while fixing these bugs	19:32
dmsimard	We should probably consider upgrading to the next flavor -- memory leaks will still be obvious while giving us the room we need to troubleshoot them	19:33
fungi	we would have probably still at least mostly maxxed out ram for their 30gb flavor	19:35
jeblair	restarting zuul-scheduler	19:41
fungi	thanks	19:42
fungi	so i guess there was at least that one	19:53
fungi	will we need to manually submit 527507 through gerrit?	19:53
jeblair	fungi: i reverted the change manually and restarted	19:54
fungi	oh, gotcha	19:54
fungi	was that the only offender?	19:54
jeblair	i'm re-enqueing now	19:54
jeblair	fungi: it stops at one; i don't know :(	19:54
fungi	ahh	19:55
jeblair	maybe next time we restart, we should manually apply a patch to make that a warning, and collect them	19:55
AJaeger	jeblair: can we do try runs to test this offline?	19:55
jeblair	AJaeger: i have no tool to do that :(	19:55
AJaeger	jeblair: that tool would be convenient for such changes. Ok...	19:56
jeblair	AJaeger: yes, though we're not going to get changes like this after the v3.0 release :)	19:56
AJaeger	;)	19:57
clarkb	which change was reverted?	20:02
AJaeger	clarkb: see #openstack-infra - openstack/congress-tempest-plugin	20:03
AJaeger	clarkb: see #openstack-infra - https://review.openstack.org/527231	20:03
* AJaeger and copy & paste ;(		20:03
AJaeger	jeblair: what's up with zuulv3.openstack.org/ ? takes ages	20:04
jeblair	AJaeger: oh i think i need to restart zuul-web	20:05
*** myoung\|rover\|bbl is now known as myoung\|rover		20:05
jeblair	AJaeger: thanks. there we go	20:06
AJaeger	jeblair: looks fine now, thanks	20:06
-openstackstatus- NOTICE: The zuul scheduler has been restarted after lengthy troubleshooting for a memory consumption issue; earlier changes have been reenqueued but if you notice jobs not running for a new or approved change you may want to leave a recheck comment or a new approval vote		20:15
*** ChanServ changes topic to "Situation normal \| OpenStack Infra team incident handling channel \| Find us in #openstack-infra for more stimulating conversation"		20:16
*** myoung\|rover has quit IRC		20:35
*** myoung has joined #openstack-infra-incident		20:39
*** clarkb has quit IRC		20:45
*** clarkb has joined #openstack-infra-incident		20:46
*** myoung is now known as myoung\|rover		20:47
*** rlandy is now known as rlandy\|afk		22:10
jeblair	zuul's memory usage is annoyingly flat. :\|	23:17
clarkb	quick throw some reconfigures at it ?	23:19

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!