Tuesday, 2017-12-12

*** mwhahaha has joined #openstack-infra-incident00:05
*** jhesketh has quit IRC01:59
*** rlandy has quit IRC02:44
*** fungi has quit IRC05:45
*** fungi has joined #openstack-infra-incident05:47
*** myoung|rover|bbl is now known as myoung|rover06:53
-openstackstatus- NOTICE: Our CI system Zuul is currently not accessible. Wait with approving changes and rechecks until it's back online. Currently waiting for an admin to investigate.08:48
-openstackstatus- NOTICE: Zuul is back online, looks like a temporary network problem.09:08
*** AJaeger has quit IRC09:51
*** AJaeger has joined #openstack-infra-incident09:51
*** jhesketh has joined #openstack-infra-incident10:04
*** rosmaita has joined #openstack-infra-incident13:04
*** weshay|sickday is now known as weshay13:15
*** rlandy has joined #openstack-infra-incident13:31
-openstackstatus- NOTICE: We're currently seeing an elevated rate of timeouts in jobs and the zuulv3.openstack.org dashboard is intermittently unresponsive, please stand by while we troubleshoot the issues.14:39
fungigood call14:46
dmsimardseeing a lot of those in the logs http://paste.openstack.org/raw/628731/14:46
*** ChanServ changes topic to "Troubleshooting Zuul scheduler memory utilization | OpenStack Infra team incident handling channel | Find us in #openstack-infra for more stimulating conversation"14:46
fungihow far back? just since the last restart?14:47
dmsimardnot sure, let me see..14:48
dmsimardfwiw amount of jobs running doesn't correlate with the load and available capacity14:48
dmsimardhttp://grafana.openstack.org/dashboard/db/zuul-status14:48
dmsimardhttp://grafana.openstack.org/dashboard/db/nodepool14:48
*** Shrews has joined #openstack-infra-incident14:48
fungiyeah, if this is just the memory increase we typically see from stacks of config changes, then it generally won't14:48
clarkbiirc zk disconnects cause the lock to be released which could explain that if zk isn't staying connected due to memory pressure?14:49
clarkbits also possible that is related to the bug I've been trying to sort out on the nodepool side around node request locking14:50
dmsimard# grep "Exception" debug.log |grep "is not locked" |grep -v raise |wc -l14:50
dmsimard24892614:50
dmsimardthat's a problem14:50
clarkb(fwiw my changes havne't merged yet so unlikely to be source of regression)14:50
clarkbShrews: ^ you've been looking at things like this recently. Does it look familiar?14:51
fungilooks like it first showed up in the logs 2017-12-12 03:22:46,423 (zuul.log.1.gz right now)14:51
fungiwent back a week and no hits in the logs until then14:52
dmsimardthat's UTC right ?14:52
fungiyes14:52
Shrewsclarkb: no. but if zk disconnects are happening, that would be logged too14:52
fungiutc14:52
Shrewsclarkb: re: your fix, lemme check nodepool real quick14:52
fungi03:22z is ~ when we started swapping14:54
Shrewsclarkb: the main symptom I saw related to your bug fix was a lot of "ready" nodes that were locked. I'm not seeing that.14:54
fungiso i agree this could easily be a symptom of memory pressure14:54
*** frickler has joined #openstack-infra-incident14:55
fungiin fact, the timing of that first exception and when we used up the last of the ram is so close than it seems too correlated to be coincidence14:55
dmsimardinfra-root: found the issue14:55
dmsimardwait no :/14:56
dmsimardlast zk logs on nodepool.o.o date back to october with no space disk left on device, only saw part of the trace14:56
clarkbya I think we've larlgely got that sorted and so the last time it was a problem shows up in the logs as the last time we had a problem (I got caught by that in the past trying to debug something else as well)14:57
fungiare you saying there's no space left on disk for nodepool.o.o right now?14:57
fungilooks like it has free space14:57
clarkbfungi: there wasn't free space months ago but thats the last thing in the error log iirc14:57
clarkbbecause we fixed it and been a non issue since14:58
*** jeblair has joined #openstack-infra-incident14:58
fungioh, you're saying october is the last time we ran out of space14:58
Shrewsyeah, we should have fixed that issue with the log space (by "we", i mean fungi)14:58
* fungi fixes nothing14:58
Shrewslies14:58
dmsimardfungi: I said "wait no", that's me not reading the full trace :(14:58
fungigot it14:58
fungiso anyway, i strongly suspect the exceptions you found in the scheduler log are an upshot of the server swapping heavily, given they started at the precise moment the server started paging15:00
dmsimardyeah, perhaps15:00
fungi+/- the 5-minute snmp polling granularity15:00
dmsimardis it normal that a nodepool list on nl02 would not be working ? says os_client_config.exceptions.OpenStackConfigException: Cloud citycloud was not found.15:00
jeblairdmsimard: run as nodepool user15:00
dmsimardah15:01
dmsimardjeblair: thanks ! /me still learning15:01
jeblairdid someone run siguser2?15:01
dmsimardjeblair: we mentioned it earlier but I didn't run it.15:02
dmsimardnot sure if fungi or clarkb have15:02
clarkbI have not15:02
jeblairthen i'll stop looking for its output in the logs15:02
fungino, i did not15:02
jeblairi'm going to start poking at things using the repl; this may use a lot of cpu and make things worse15:03
dmsimardack15:05
jeblairzuul built 88 layouts during the 22:00 hour yesterday15:08
jeblairup from a more typical 20-30 per hour15:08
fungirelease jobs are almost all done, i think15:16
fungihard to tell since i'm getting intermittent proxy errors retrieving the status.json15:16
jeblairi don't see an unusual number of layouts being created around the 4 and 6 hours15:18
jeblairwhich seem to be the bumps into swap15:19
jeblairhere's the current object count: http://paste.openstack.org/show/628734/15:27
jeblair(since that has returned, things should start speeding up again)15:28
fungiyeah, i'm getting the status.json again15:29
fungijust two release jobs remaining15:29
fungiwe're getting really long delays between playbooks on jobs too... i'm going to check load on the executors15:33
pabelangerfungi: yah, load look higer because of the pulsing jobs from zuul. Seems to overwell executors, then evens out, just in time for next pulse15:35
fungimost are around 30-40 for load average. highest 5-minute average was ze06 at just over 5015:35
dmsimardthere was a large spike on ze0215:36
dmsimardobvious in grafana http://grafana.openstack.org/dashboard/db/zuul-status15:37
pabelangerfungi: yah, I think 25 loadavg is when we stop accepting jobs15:37
dmsimardthere was mention of an elevated rate of time outs15:38
*** tobiash has joined #openstack-infra-incident15:39
dmsimardthis is the error that comes up: http://logs.openstack.org/54/510654/5/check/openstack-ansible-functional-ubuntu-xenial/031df0b/job-output.txt.gz#_2017-12-12_12_40_13_06167615:39
dmsimardone of the errors, at least15:39
fungimakes sense. so zuul ends up batching up a lot of events, then thundering herds the executors, they push back because load gets too high, cycle continues a la hysteresis15:39
dmsimardlogstash -> message:"Make sure this host can be reached over ssh"    is picking up a lot of hits even within the last 15 minutes15:39
dmsimardfungi: so should there be a governor on top of load/disk that caps the amount of jobs per executor ?15:40
dmsimardbecause the load typically doesn't occur until after the executor picked up new jobs15:40
dmsimardso an executor might go ahead and pick up 50 new jobs but then it starts loading heavily a minute or two later15:40
pabelangeryah, when executors are under a large load, ansible-playbooks slow down greatly, so much that there is almost zero communications between playbook and remote host. What is ends up doing, is start spawning a large amount of job timeouts as a result. So, I wouldn't be surprised if the high load is timeout running jobs15:40
fungii don't know that we should be designing safety valves into the executor to deal with pathological situations on the scheduler15:41
pabelangerIn v2.5, when a launcher picked up new jobs, is did it incrementally, is that something we still have in zuulv3?15:41
fungii would just consider this a symptom of scheduler memory pressure for now15:41
pabelangereg: when an executor can accept a job, not accept a large amount at once15:42
dmsimardexecutors are swapping as well15:42
jeblairthere are 10 layout objects known to the scheduler, and 79 in memory.  so there's a leak somewhere.15:42
dmsimardze02 is 4GB into swap territory15:42
dmsimardbut it's weird because there is 4GB "available"15:42
dmsimardvm.swappiness is 0 :/15:43
pabelangerdmsimard: Yah, that is the result of high load. Once it comes down, swap will stop15:43
jeblairfungi: when the release jobs are finished, i'd like to make an graph.  that's going to eat up cpu again and make things very bad; likely much worse than before15:43
fungisure. i guess we should grab a fresh copy of pipelines, generate memory graph, then restart scheduler and reenqueue everything?15:44
dmsimard+115:45
fungior are you going to want to keep it running in this condition a little longer after grabbing the object graph?15:45
clarkbpabelanger: that is what the load governor is supposed to provide but the problem with it is it is a delayed reaction15:45
dmsimardclarkb: yeah that's what I was pointing out, maybe there should be a soft and a hard limit on the job concurrency15:45
fungiand any delayed reaction leads to hysteresis under some conditions15:46
jeblairfungi: probably after i grab the graph we can restart.  your plan sounds good.15:46
pabelangerclarkb: yah, I think so. I'll look at code and see what is happening. From grafana it seems we might just be accepting jobs too fast at start.15:47
dmsimardfungi: you were mentioning thundering herd, looking at the graphs it's exactly that http://grafana.openstack.org/dashboard/db/zuul-status15:47
dmsimardexecutors picks them up in batches15:47
fungiright15:47
clarkbpabelanger: ya its a similar problem to the one that mnaser had with the cpu based nova scheduling15:47
dmsimardI mean, you don't see that in the "running builds" graph, but the load graph.15:48
clarkbpabelanger: you don't know what the compute cost of a job is until it is properly running and until that point you can grab a lot of jobs15:48
pabelangerclarkb: right15:48
dmsimardclarkb: computing the cost of a job (over an average) is doable15:48
dmsimardclarkb: through collectd/graphite for example15:49
fungibut yeah, any situation where you perform some action, take measurements, then adjust to correct is going to lead to that behavior under extremes15:49
pabelangerhow zuul-launchers worked for running ansible-playbook worked pretty well, when jobs came online. Going to see if that would be an option for zuul-executors15:49
dmsimardclarkb: in a public cloud environment (such as mnaser) you can't really do that because the workloads from customers are really unpredictable15:49
clarkbdmsimard: right15:49
clarkbalso you could likely do something like a pid loop15:49
clarkbwhere you measure after each change and adjust (and don't really care about the work load being predictable because you measure at each step)15:50
jeblairas fungi said, let's not design something for when the system is broken.  anything more than a simple change or minor adjustment to what we have probably isn't worth it.15:51
ShrewsFYI, if the executors were also low on available memory, we should check the logs for OOM process trying to kill zuul-executor children (which tends to kill the finger daemon there). zuul-executor will need restarting to correct that15:58
jeblairShrews: did we ever adjust the ratio for that?15:59
dmsimardShrews: I don't see any OOM killer on ze02 which has this weird memory usage (4GB swap, 4GB available)15:59
Shrewsjeblair: i certainly didn't16:00
jeblairoh, we don't have a ram limiter.  that might be a thing.16:03
pabelangerYah, I think spamaps talked about writing that16:10
jeblairfungi: how are the release jobs?16:12
fungijeblair: that last release job just reenqueued due to pre-task timeouts, so but it's just a propose-update-constraints so maybe we should just cut our losses now16:12
jeblairok, i'll start my queries now and leave it to its own devices16:13
jeblairi just saved queues16:13
fungiit's been delaying ~30 minutes between each task due to executor load, so it'd be a couple hours at least before the rerun completed anyway16:13
fungier, between each playbook i mean16:14
jeblairgraph is still being constructed17:01
fungii figured. been watching the red rectangle growing to the right on the cacti cpu graph17:07
jeblairhow long you think we should let this run?18:07
clarkbjeblair: any rough idea how long it has taken in the past?18:32
*** rlandy is now known as rlandy|brb18:34
jeblairclarkb: hrm, i don't think i have timing info from previous runs.  i would have guessed 1-2 hours.  but we're at 2.25 now.18:34
jeblairi think we're using more memory than last time, so there's more work to do (where work involves a lot of swapping to examine all the objects)18:35
jeblairi'm sure it's working and will finish18:35
jeblairit's more of a trade off between lost productivity now and obtaining data vs lost productivity the next time this happens if we don't fix it18:36
fungialso we probably need to be ready to roll back the duplicate map key detection at restart if we find that there are projects which have inadvertently introduced any18:38
clarkbif we aren't swapping already does it perform reasonably well?18:39
jeblairfungi: or drop those projects from config, depending on which projects.18:39
clarkbbasically could we restart it, get jobs running again, then watch for leak and run graph generation before it becomes major performance issue?18:40
jeblairclarkb: usually much faster (like maybe tens of minutes), but i think it's intense enough there's a high risk of losing the zk connection and causing disruption.18:40
jeblairclarkb: it's an option, and may be less impactful, but it won't be transparent18:41
clarkbya just thinking that may be preferable to no jobs at all18:41
clarkbespecially if we do it around 0200UTC or so when we seem to have a quite time18:41
jeblairyup18:42
clarkb*quiet time18:42
jeblairwant to give it 30m more (so about 3h total)?18:43
clarkbsure18:43
fungiwfm18:44
fungii wonder if we should attempt to let events/results quiesce after the object graph is done, and then nab another copy of pipeline contents? the ones from before may not be as relevant now18:47
clarkbI've approved a few things myself but sort of expected to go back and recheck18:48
dmsimardfungi: yeah, we should definitely get a new snapshot of the pipelines right before restarting18:48
fungithat's only if the scheduler is still sane enough to provide one18:49
fungino guarantees. and yes, worst case we just remind people to recheck/reapprove stuff18:49
fungii also need to look through the stack of release job failures today and possibly reenqueue stuff if any got caught up in playbook timeouts18:50
*** myoung|rover is now known as myoung|rover|bia18:58
*** myoung|rover|bia is now known as myoung|rover|bbl19:00
jeblairwelp, we crossed the 3 hour mark :(19:23
dmsimardjeblair: do we have part of it ? or it's an all or nothing deal ?19:25
dmsimardlike, is it stuck in a recursion loop or something ?19:25
jeblairall or nothing19:26
dmsimardbleh19:26
jeblairand no progress indication19:26
dmsimardno insight from the dump either ?19:27
jeblaircould be 3 hours or 3 minutes away from completion19:27
jeblairdmsimard: dump?19:27
dmsimardthe SIG USR219:27
dmsimardor is that what we're running right now19:27
jeblairoh, that hasn't been run, though i used the repl to run what it does independently.  that's the object count i pasted earlier19:28
dmsimardok19:28
jeblairwhat's running now is a graph traversal finding all the backlinks to one of the leaked layouts19:29
dmsimardthe swap and the load likely isn't helping :(19:29
jeblairoh at this point the load is entirely due to this process.  it has to access objects that have been swapped out, so everything is swapping in/out again, multiple times more than likely.19:31
*** rlandy|brb is now known as rlandy19:31
dmsimarddoes RAX have 32GB VMs ? :/19:32
clarkbit does haev quite large VMs19:32
clarkbI wouldn't want to use massive flavor long term but could be a reasonable way to have healthier zuul while fixing these bugs19:32
dmsimardWe should probably consider upgrading to the next flavor -- memory leaks will still be obvious while giving us the room we need to troubleshoot them19:33
fungiwe would have probably still at least mostly maxxed out ram for their 30gb flavor19:35
jeblairrestarting zuul-scheduler19:41
fungithanks19:42
fungiso i guess there was at least that one19:53
fungiwill we need to manually submit 527507 through gerrit?19:53
jeblairfungi: i reverted the change manually and restarted19:54
fungioh, gotcha19:54
fungiwas that the only offender?19:54
jeblairi'm re-enqueing now19:54
jeblairfungi: it stops at one; i don't know :(19:54
fungiahh19:55
jeblairmaybe next time we restart, we should manually apply a patch to make that a warning, and collect them19:55
AJaegerjeblair: can we do try runs to test this offline?19:55
jeblairAJaeger: i have no tool to do that :(19:55
AJaegerjeblair: that tool would be convenient for such changes. Ok...19:56
jeblairAJaeger: yes, though we're not going to get changes like this after the v3.0 release :)19:56
AJaeger;)19:57
clarkbwhich change was reverted?20:02
AJaegerclarkb: see #openstack-infra - openstack/congress-tempest-plugin20:03
AJaegerclarkb: see #openstack-infra - https://review.openstack.org/52723120:03
* AJaeger and copy & paste ;(20:03
AJaegerjeblair: what's up with zuulv3.openstack.org/ ? takes ages20:04
jeblairAJaeger: oh i think i need to restart zuul-web20:05
*** myoung|rover|bbl is now known as myoung|rover20:05
jeblairAJaeger: thanks.  there we go20:06
AJaegerjeblair: looks fine now, thanks20:06
-openstackstatus- NOTICE: The zuul scheduler has been restarted after lengthy troubleshooting for a memory consumption issue; earlier changes have been reenqueued but if you notice jobs not running for a new or approved change you may want to leave a recheck comment or a new approval vote20:15
*** ChanServ changes topic to "Situation normal | OpenStack Infra team incident handling channel | Find us in #openstack-infra for more stimulating conversation"20:16
*** myoung|rover has quit IRC20:35
*** myoung has joined #openstack-infra-incident20:39
*** clarkb has quit IRC20:45
*** clarkb has joined #openstack-infra-incident20:46
*** myoung is now known as myoung|rover20:47
*** rlandy is now known as rlandy|afk22:10
jeblairzuul's memory usage is annoyingly flat.  :|23:17
clarkbquick throw some reconfigures at it ?23:19

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!