22:04:49 #startmeeting zuul 22:04:50 Meeting started Mon Jan 15 22:04:49 2018 UTC and is due to finish in 60 minutes. The chair is corvus. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:04:51 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 22:04:53 The meeting name has been set to 'zuul' 22:05:01 #topic Roadmap 22:05:16 I'm going to ping folks individually this week to check up on status 22:05:47 but does anyone here working on a 3.0 release blocker have an issue we should talk about now? 22:05:59 (i know a lot of folks are afk today, but i thought i'd ask) 22:06:49 #topic RAM governor for the executors 22:07:00 dmsimard: i think this is your topic? 22:07:10 oh, from last week yes 22:08:04 #link https://review.openstack.org/508960 ram governor 22:08:17 We're generally memory constrained right now -- we're often finding zuul executors in swap territory and at that point it becomes a vicious circle quickly (can't clear jobs fast enough, so you get more jobs, etc.) and there's several OOM killers going around 22:09:03 So we want to land and enable the RAM governor ASAP but there's also another "governor" I'd like to talk about -- it'd be "max concurrent builds" 22:09:08 dmsimard: when executors go above a certain load average they shouldn't accept new jobs 22:09:22 on the scheduler side the memory consumer is the size of the zuul config model. Do we know what is consuming the memory on executors? is it ansible? 22:09:39 Regardless of our current governors (even pretending RAM had landed), there's nothing preventing a single executor from picking up 200 builds by itself 22:09:40 dmsimard: when have you seeen executors accept new jobs because they can't clear them fast enough? 22:10:04 dmsimard: yes there is -- we would have two things preventing it -- a load governor and a ram governor 22:10:09 right now we have one 22:10:12 (just want to amke sure that governing job exectution is expected to reduce memory use and it isn't the finger daemon that is consuming all the memroy for example) 22:10:31 corvus: not from a cold boot -- when all executors crashed a week ago, ze01 started first and picked up all the backlogged jobs and (eventually) loaded up to 150 22:11:13 i gather the issue with the system load governor not kicking in fast enough is that system load average is a trailing indicator and so can in certain thundering herd scenarios pick up a glut of jobs before the system load spikes high enough to stop it 22:11:16 clarkb: i *think* it's ansible eating the memory, but it's not leaking, it just uses a lot. at least, that's my recollection. it would be good to confirm that. 22:11:52 dmsimard: yes, that's true. i think after we have a ram governor, we should look into tuning the rate at which jobs are accepted. 22:12:29 that sounds sane 22:12:37 Generally speaking, there is only so many SSH connections/ansible playbooks we can have running at any given time 22:12:54 Wouldn't it be reasonable to say an executor can accept no more than 100 concurrent builds for example ? 22:13:00 dmsimard: i'd like to save 'max jobs per server' as a last resort -- as in, i'd like us to avoid ever implementing it if possible, unless we completely fail at everything else. the reason is that not all jobs are equal in resource usage. i think it would be best if the executors could regulate themselves toward maximum resource usage without going overboard. 22:13:23 dmsimard: depending on the resources available and performance of the server, that number may vary quite a lot though right? 22:13:59 what if you have two executors, one of which is ~half as powerful as the other... having zuul scale itself to them based on available resources is nice 22:14:05 fungi: If something like that lands, it would be something configurable (with a sane default) imo 22:14:18 the way I see it, it's more of a safety net 22:15:05 not enough of a safety net unless you get into fiddling with per-server knobs rather than having a sane resource scheduler which can guess the right values for you 22:15:08 i don't want admins to have to tune these options. there is no sensible global default for max jobs per server, and would need to be always individually tuned. further, that ignores that not all jobs are the same, so it's problematic. 22:15:48 a job with 10 nodes that runs for 3 hours is different than a job with zero nodes that runs for 30 seconds. both are very likely in the same system. 22:16:05 right 22:16:13 o/ sorry I am late 22:16:31 pabelanger: ohai I was actually about to ask, do we think we can land https://review.openstack.org/#/c/508960/ soon ? 22:17:33 i agree that we need to prevent the hysterisis from happening -- i think the road there goes through the ram governor first, then tune the acceptance rate (there should already be a small rolloff, but maybe we need to adjust that a bit) so that the trailing indicators have more time to catch up. finally, we may want to tune our heuristics a bit to give the trailing indicators more headroom. 22:17:39 dmsimard: i think we want to add tests first, I'm hoping to finish that up in the next day or so 22:18:14 corvus: fwiw I agree that the max build idea is not a definitive answer and instead we might want to do like you mentioned and revisit/tune how executors pick up jobs in the first place 22:18:15 pabelanger: ++ we should be able to use mock to return some ram data 22:18:38 corvus: wfm 22:19:52 dmsimard: it looks like right now we delay job acceptance a small amount but only with the goal of spreading the load across executors, so the response time is still pretty quick 22:20:00 speaking of jobs, one thing zuulv2.5 did, and I don't believe zuulv3 does, is we had some sort of back off method so a single executor wouldn't accept a bunch of jobs at once. That seem to work well in zuulv2.5 with our zuul-launchers 22:20:10 and it only looks at the number of jobs currently running 22:20:32 what we may want to do is adjust that to *also* factor in how recently a job was accepted 22:21:08 or just increase the delay that's already there and only use jobs running 22:21:18 it's currently: delay = (workers ** 2) / 1000.0 22:21:44 'workers' means jobs running in this context 22:21:53 would that also explain why when all the executors crashed at once, the first one to get started went nuts on the backlog? 22:22:17 since there wasn't even the rotation between executors to save it 22:22:25 ya jobs running seems like maybe a better option than workers 22:22:43 that's only going to slow us down 6.4 seconds with 80 jobs, so that's not enough time for load/ram to start to catch up 22:23:13 clarkb: no sorry, the variable is called "workers" but it means "number of jobs that this executor is running" 22:23:22 it's the internal executor gearman worker count 22:23:26 fair. and the load governor is based on one-minute load average, so you have a lot of time to ramp up to untenable levels of activity 22:23:38 gotcha 22:24:24 i have similar worries about the ram governor, if the amount of ram ansible is going to use grows over time (we may take on a glut of jobs, and not finish old ones fast enough to make way for the memory growth of the new spike) 22:24:27 We also don't start running ansible right away now too, we first merge code into local executor. Perhaps that isn't load / memory heavy? 22:25:19 pabelanger: it's not too heavy, but it is a delay worth considering. we could even let that be a natural source of delay -- like don't acquire a new job until we've completed merging the most recent job. 22:25:26 pabelanger: I'm wondering if swap should be taken into account in the RAM governor (and how) 22:25:43 that would be probably be fine fully loaded, but it would make for a very slow start. 22:26:06 corvus: yah, that might be something to try. I like that 22:26:12 i think once we've started paging zuul activity out to swap space, it's already doomed 22:26:38 fungi: that's for scheduler but executors will keep running even when swapping 22:26:50 ideally the ram governor prevents us from reaching swap territory 22:26:51 how well do they keep running? 22:27:30 not well -- when the executors start swapping, execution becomes largely I/O bound and there's a higher percentage of I/O wait 22:27:50 if "keep running" means jobs start timing out because it takes too long for ansible to start the next task/playbook then that's basically it 22:28:24 http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64003&rra_id=all 22:28:31 i didn't mean doomed to need a restart, i meant doomed to introduce otherwise avoidable job failures 22:28:52 fungi: I noticed the i/o wait and swap usage when I was trying to understand the SSH connection issues, there might be a correlation but I don't know. 22:29:27 interesting -- they're pretty active swapping but keep the used memory close to 50% 22:29:45 http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64004&rra_id=all 22:29:49 http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64005&rra_id=all 22:30:11 i wonder if buffer space there is the ansible stdout buffering stuff 22:30:40 fungi: actually I asked #ansible-devel about that and the buffer is on the target node, not the control node 22:31:03 speaking of SSH connection issues, we could using SSH retries from ansible: https://review.openstack.org/512130/ to help add some fail protection to jobs 22:31:04 so it wouldn't explain the ram usage 22:31:06 dmsimard: did they indicate what happens when the target node sends the data back to the control node? 22:31:07 regardless, system and iowait cpu usage there don't look super healthy, leading me to wonder whether we still have too few executors at peak 22:31:15 maybe even expose it to be configurable some how to jobs 22:31:55 and 5-minute load average spiking up over 40 on ze01 just a few hours ago 22:31:58 pabelanger: there's some improvements we can do around SSH, yes 22:32:14 where it topped out around 4gb of swap in use 22:32:25 fungi: that's likely load due to i/o wait 22:32:32 (heavy swapping) 22:32:33 dmsimard: exactly 22:32:39 I also think, OSIC suggested some things we could also tune in ansible for network related issues. Need to see if I can find that etherpad 22:32:59 or was is OSA team 22:33:11 fungi: vm.swappinness is at 0 on ze01 too.. 22:33:23 I read that as "never swap ever" so I don't know what's going on 22:33:57 Oh, actually it doesn't quite mean that 22:34:01 "Swap is disabled. In earlier versions, this meant that the kernel would swap only to avoid an out of memory condition, when free memory will be below vm.min_free_kbytes limit, but in later versions this is achieved by setting to 1." 22:34:21 dmsimard: no, it just means don't preemptively swap process memory out to make room for additional cache memory 22:34:39 and our min_free_kbytes is vm.min_free_kbytes = 11361 22:35:14 these are fairly typical configuration for "cloud" virtual machines 22:35:22 fungi: our cpu usage from last week is significantly different from november: http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=64000&rra_id=4&view_type=&graph_start=1483002375&graph_end=1516055559 22:35:31 ok I've got to head out now. The two things I wanted to bring up were merging feature/zuulv3 to master. I tested this nodepool and wrote an infra list email about it. The other thing is my two nodepool changes to address cloud failures. 533771 and its parent. Left test split out as I expect it may need cleanup but should be good enough to show parent works 22:35:53 we may be seeing the hit from meltdown and may indeed need to add more executors 22:36:32 yep, it is a bit worse 22:37:27 clarkb: thanks, yeah, i think we can merge the branches soon, maybe let's set a date for thursday and send out a followup email? 22:37:35 meltdown mitigation performance hit seems as good a culprit as any 22:38:15 Re: adding more executors -- do we think we have the right size right now ? In terms of flavors, disk size, etc. 22:38:29 looking at stats from zuul-launcher is a little interesting too: http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=4683&rra_id=4&view_type=&graph_start=1483002628&graph_end=1516055812 22:38:41 we do seem to be using more system resources with executors 22:38:51 pabelanger: that had a *very* different mode of operation 22:39:06 pabelanger: executors also run zuul-merger which is not negligible 22:39:20 yup 22:39:53 dmsimard: i'd argue it is negligible 22:40:07 http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=1519&rra_id=all 22:40:11 that's on a 2G server 22:40:50 but there's 8 zm nodes :P 22:41:13 yes, for paralellization 22:41:48 i'm just saying that the internal merger is not what's eating things up on the executors. we're just doing a lot more with ansible than we were in zuulv2.5 22:41:58 * dmsimard nods 22:42:40 (among other things, in zuul v2.5, we did *not* ship the entire console output back to the controlling ansible process) 22:43:57 anyway, to conclude: let's say the plan for now is: add ram governor, then slow job acceptance rate. sound good? 22:44:04 wfm 22:44:07 I think we also used 2.1.x vs 2.3.x, so possible ansible is now just using more resources 22:44:14 corvus: ++ 22:44:47 #agreed to reduce hysterisis and exceess resource usage: add ram governor, then slow job acceptance rate 22:44:55 #topic merging feature branches 22:45:35 we should be all prepared as clarkb said -- but we still may want to actually schedule this so no one is surprised 22:45:43 how about we say we'll do it on thursday? 22:46:13 the puppet changes to make that hitless for people using puppet-openstackci are merged at this point, right? 22:46:38 fungi: yep 22:46:46 thursday is fine by me 22:47:09 #link puppet-openstackci change https://review.openstack.org/523951 22:47:11 i guess as long as the people using that module update it with some frequency they're protected. if they don't, then they're using a non-continuously-deployed puppet module to continuously deploy a service... so a learning experience for them? 22:47:31 we'll have to make some config changes to nodepool-builder, since it is using old syntax. I can propose some patches for that 22:47:41 fungi: yep. and i mean, it's not going to eat their data, they just need to checkout a different version and reinstall 22:47:44 maybe also upgrade to python3 at the same time 22:49:05 pabelanger: heh, well, if we're checking out master on nodepool builders, then i think we'll automatically get switched to v3. :) 22:50:03 pabelanger: do you want to deploy new builders running from the feature/zuulv3 branch before we merge? 22:50:09 corvus: I was thinking maybe we first switch nodepool builders to feature/zuulv3 branch and get config file changes in place 22:50:12 yah 22:50:35 pabelanger: think that's reasonable to do before thursday? 22:50:54 hopefully there'll be more folks around tomorrow to help too 22:50:56 I believe so, I can start work on it ASAP 22:51:07 cool 22:51:29 #agreed schedule feature branch merge for thursday jan 18 22:51:35 #topic open discussion 22:51:39 anything else? 22:52:14 oh, as far as release-related needs, i wouldn't mind if someone took a look at my change to link the zuul-base-jobs documentation from the user's guide 22:52:18 #link https://review.openstack.org/531912 Link to zuul-base-jobs docs from User's Guide 22:53:03 it's small, and could probably stand to be at least a little less small 22:53:55 fungi: ++ 22:55:37 if that's it, i'll go ahead and end 22:55:40 thanks everyone! 22:55:45 #endmeeting