22:04:49 <corvus> #startmeeting zuul
22:04:50 <openstack> Meeting started Mon Jan 15 22:04:49 2018 UTC and is due to finish in 60 minutes.  The chair is corvus. Information about MeetBot at http://wiki.debian.org/MeetBot.
22:04:51 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
22:04:53 <openstack> The meeting name has been set to 'zuul'
22:05:01 <corvus> #topic Roadmap
22:05:16 <corvus> I'm going to ping folks individually this week to check up on status
22:05:47 <corvus> but does anyone here working on a 3.0 release blocker have an issue we should talk about now?
22:05:59 <corvus> (i know a lot of folks are afk today, but i thought i'd ask)
22:06:49 <corvus> #topic RAM governor for the executors
22:07:00 <corvus> dmsimard: i think this is your topic?
22:07:10 <dmsimard> oh, from last week yes
22:08:04 <corvus> #link https://review.openstack.org/508960 ram governor
22:08:17 <dmsimard> We're generally memory constrained right now -- we're often finding zuul executors in swap territory and at that point it becomes a vicious circle quickly (can't clear jobs fast enough, so you get more jobs, etc.) and there's several OOM killers going around
22:09:03 <dmsimard> So we want to land and enable the RAM governor ASAP but there's also another "governor" I'd like to talk about -- it'd be "max concurrent builds"
22:09:08 <corvus> dmsimard: when executors go above a certain load average they shouldn't accept new jobs
22:09:22 <clarkb> on the scheduler side the memory consumer is the size of the zuul config model. Do we know what is consuming the memory on executors? is it ansible?
22:09:39 <dmsimard> Regardless of our current governors (even pretending RAM had landed), there's nothing preventing a single executor from picking up 200 builds by itself
22:09:40 <corvus> dmsimard: when have you seeen executors accept new jobs because they can't clear them fast enough?
22:10:04 <corvus> dmsimard: yes there is -- we would have two things preventing it -- a load governor and a ram governor
22:10:09 <corvus> right now we have one
22:10:12 <clarkb> (just want to amke sure that governing job exectution is expected to reduce memory use and it isn't the finger daemon that is consuming all the memroy for example)
22:10:31 <dmsimard> corvus: not from a cold boot -- when all executors crashed a week ago, ze01 started first and picked up all the backlogged jobs and (eventually) loaded up to 150
22:11:13 <fungi> i gather the issue with the system load governor not kicking in fast enough is that system load average is a trailing indicator and so can in certain thundering herd scenarios pick up a glut of jobs before the system load spikes high enough to stop it
22:11:16 <corvus> clarkb: i *think* it's ansible eating the memory, but it's not leaking, it just uses a lot.  at least, that's my recollection.  it would be good to confirm that.
22:11:52 <corvus> dmsimard: yes, that's true.  i think after we have a ram governor, we should look into tuning the rate at which jobs are accepted.
22:12:29 <fungi> that sounds sane
22:12:37 <dmsimard> Generally speaking, there is only so many SSH connections/ansible playbooks we can have running at any given time
22:12:54 <dmsimard> Wouldn't it be reasonable to say an executor can accept no more than 100 concurrent builds for example ?
22:13:00 <corvus> dmsimard: i'd like to save 'max jobs per server' as a last resort -- as in, i'd like us to avoid ever implementing it if possible, unless we completely fail at everything else.  the reason is that not all jobs are equal in resource usage.  i think it would be best if the executors could regulate themselves toward maximum resource usage without going overboard.
22:13:23 <fungi> dmsimard: depending on the resources available and performance of the server, that number may vary quite a lot though right?
22:13:59 <fungi> what if you have two executors, one of which is ~half as powerful as the other... having zuul scale itself to them based on available resources is nice
22:14:05 <dmsimard> fungi: If something like that lands, it would be something configurable (with a sane default) imo
22:14:18 <dmsimard> the way I see it, it's more of a safety net
22:15:05 <fungi> not enough of a safety net unless you get into fiddling with per-server knobs rather than having a sane resource scheduler which can guess the right values for you
22:15:08 <corvus> i don't want admins to have to tune these options.  there is no sensible global default for max jobs per server, and would need to be always individually tuned.  further, that ignores that not all jobs are the same, so it's problematic.
22:15:48 <corvus> a job with 10 nodes that runs for 3 hours is different than a job with zero nodes that runs for 30 seconds.  both are very likely in the same system.
22:16:05 <dmsimard> right
22:16:13 <pabelanger> o/ sorry I am late
22:16:31 <dmsimard> pabelanger: ohai I was actually about to ask, do we think we can land https://review.openstack.org/#/c/508960/ soon ?
22:17:33 <corvus> i agree that we need to prevent the hysterisis from happening -- i think the road there goes through the ram governor first, then tune the acceptance rate (there should already be a small rolloff, but maybe we need to adjust that a bit) so that the trailing indicators have more time to catch up.  finally, we may want to tune our heuristics a bit to give the trailing indicators more headroom.
22:17:39 <pabelanger> dmsimard: i think we want to add tests first, I'm hoping to finish that up in the next day or so
22:18:14 <dmsimard> corvus: fwiw I agree that the max build idea is not a definitive answer and instead we might want to do like you mentioned and revisit/tune how executors pick up jobs in the first place
22:18:15 <corvus> pabelanger: ++ we should be able to use mock to return some ram data
22:18:38 <pabelanger> corvus: wfm
22:19:52 <corvus> dmsimard: it looks like right now we delay job acceptance a small amount but only with the goal of spreading the load across executors, so the response time is still pretty quick
22:20:00 <pabelanger> speaking of jobs, one thing zuulv2.5 did, and I don't believe zuulv3 does, is we had some sort of back off method so a single executor wouldn't accept a bunch of jobs at once. That seem to work well in zuulv2.5 with our zuul-launchers
22:20:10 <corvus> and it only looks at the number of jobs currently running
22:20:32 <corvus> what we may want to do is adjust that to *also* factor in how recently a job was accepted
22:21:08 <corvus> or just increase the delay that's already there and only use jobs running
22:21:18 <corvus> it's currently: delay = (workers ** 2) / 1000.0
22:21:44 <corvus> 'workers' means jobs running in this context
22:21:53 <fungi> would that also explain why when all the executors crashed at once, the first one to get started went nuts on the backlog?
22:22:17 <fungi> since there wasn't even the rotation between executors to save it
22:22:25 <clarkb> ya jobs running seems like maybe a better option than workers
22:22:43 <corvus> that's only going to slow us down 6.4 seconds with 80 jobs, so that's not enough time for load/ram to start to catch up
22:23:13 <corvus> clarkb: no sorry, the variable is called "workers" but it means "number of jobs that this executor is running"
22:23:22 <corvus> it's the internal executor gearman worker count
22:23:26 <fungi> fair. and the load governor is based on one-minute load average, so you have a lot of time to ramp up to untenable levels of activity
22:23:38 <clarkb> gotcha
22:24:24 <fungi> i have similar worries about the ram governor, if the amount of ram ansible is going to use grows over time (we may take on a glut of jobs, and not finish old ones fast enough to make way for the memory growth of the new spike)
22:24:27 <pabelanger> We also don't start running ansible right away now too, we first merge code into local executor. Perhaps that isn't load / memory heavy?
22:25:19 <corvus> pabelanger: it's not too heavy, but it is a delay worth considering.  we could even let that be a natural source of delay -- like don't acquire a new job until we've completed merging the most recent job.
22:25:26 <dmsimard> pabelanger: I'm wondering if swap should be taken into account in the RAM governor (and how)
22:25:43 <corvus> that would be probably be fine fully loaded, but it would make for a very slow start.
22:26:06 <pabelanger> corvus: yah, that might be something to try. I like that
22:26:12 <fungi> i think once we've started paging zuul activity out to swap space, it's already doomed
22:26:38 <dmsimard> fungi: that's for scheduler but executors will keep running even when swapping
22:26:50 <dmsimard> ideally the ram governor prevents us from reaching swap territory
22:26:51 <fungi> how well do they keep running?
22:27:30 <dmsimard> not well -- when the executors start swapping, execution becomes largely I/O bound and there's a higher percentage of I/O wait
22:27:50 <fungi> if "keep running" means jobs start timing out because it takes too long for ansible to start the next task/playbook then that's basically it
22:28:24 <corvus> http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64003&rra_id=all
22:28:31 <fungi> i didn't mean doomed to need a restart, i meant doomed to introduce otherwise avoidable job failures
22:28:52 <dmsimard> fungi: I noticed the i/o wait and swap usage when I was trying to understand the SSH connection issues, there might be a correlation but I don't know.
22:29:27 <corvus> interesting -- they're pretty active swapping but keep the used memory close to 50%
22:29:45 <corvus> http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64004&rra_id=all
22:29:49 <corvus> http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64005&rra_id=all
22:30:11 <fungi> i wonder if buffer space there is the ansible stdout buffering stuff
22:30:40 <dmsimard> fungi: actually I asked #ansible-devel about that and the buffer is on the target node, not the control node
22:31:03 <pabelanger> speaking of SSH connection issues, we could using SSH retries from ansible: https://review.openstack.org/512130/ to help add some fail protection to jobs
22:31:04 <dmsimard> so it wouldn't explain the ram usage
22:31:06 <corvus> dmsimard: did they indicate what happens when the target node sends the data back to the control node?
22:31:07 <fungi> regardless, system and iowait cpu usage there don't look super healthy, leading me to wonder whether we still have too few executors at peak
22:31:15 <pabelanger> maybe even expose it to be configurable some how to jobs
22:31:55 <fungi> and 5-minute load average spiking up over 40 on ze01 just a few hours ago
22:31:58 <dmsimard> pabelanger: there's some improvements we can do around SSH, yes
22:32:14 <fungi> where it topped out around 4gb of swap in use
22:32:25 <dmsimard> fungi: that's likely load due to i/o wait
22:32:32 <dmsimard> (heavy swapping)
22:32:33 <fungi> dmsimard: exactly
22:32:39 <pabelanger> I also think, OSIC suggested some things we could also tune in ansible for network related issues.   Need to see if I can find that etherpad
22:32:59 <pabelanger> or was is OSA team
22:33:11 <dmsimard> fungi: vm.swappinness is at 0 on ze01 too..
22:33:23 <dmsimard> I read that as "never swap ever" so I don't know what's going on
22:33:57 <dmsimard> Oh, actually it doesn't quite mean that
22:34:01 <dmsimard> "Swap is disabled. In earlier versions, this meant that the kernel would swap only to avoid an out of memory condition, when free memory will be below vm.min_free_kbytes limit, but in later versions this is achieved by setting to 1."
22:34:21 <fungi> dmsimard: no, it just means don't preemptively swap process memory out to make room for additional cache memory
22:34:39 <dmsimard> and our min_free_kbytes is vm.min_free_kbytes = 11361
22:35:14 <fungi> these are fairly typical configuration for "cloud" virtual machines
22:35:22 <corvus> fungi: our cpu usage from last week is significantly different from november: http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=64000&rra_id=4&view_type=&graph_start=1483002375&graph_end=1516055559
22:35:31 <clarkb> ok I've got to head out now. The two things I wanted to bring up were merging feature/zuulv3 to master. I tested this nodepool and wrote an infra list email about it. The other thing is my two nodepool changes to address cloud failures. 533771 and its parent. Left test split out as I expect it may need cleanup but should be good enough to show parent works
22:35:53 <corvus> we may be seeing the hit from meltdown and may indeed need to add more executors
22:36:32 <fungi> yep, it is a bit worse
22:37:27 <corvus> clarkb: thanks, yeah, i think we can merge the branches soon, maybe let's set a date for thursday and send out a followup email?
22:37:35 <fungi> meltdown mitigation performance hit seems as good a culprit as any
22:38:15 <dmsimard> Re: adding more executors -- do we think we have the right size right now ? In terms of flavors, disk size, etc.
22:38:29 <pabelanger> looking at stats from zuul-launcher is a little interesting too: http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=4683&rra_id=4&view_type=&graph_start=1483002628&graph_end=1516055812
22:38:41 <pabelanger> we do seem to be using more system resources with executors
22:38:51 <corvus> pabelanger: that had a *very* different mode of operation
22:39:06 <dmsimard> pabelanger: executors also run zuul-merger which is not negligible
22:39:20 <pabelanger> yup
22:39:53 <corvus> dmsimard: i'd argue it is negligible
22:40:07 <corvus> http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=1519&rra_id=all
22:40:11 <corvus> that's on a 2G server
22:40:50 <dmsimard> but there's 8 zm nodes :P
22:41:13 <corvus> yes, for paralellization
22:41:48 <corvus> i'm just saying that the internal merger is not what's eating things up on the executors.  we're just doing a lot more with ansible than we were in zuulv2.5
22:41:58 * dmsimard nods
22:42:40 <corvus> (among other things, in zuul v2.5, we did *not* ship the entire console output back to the controlling ansible process)
22:43:57 <corvus> anyway, to conclude: let's say the plan for now is: add ram governor, then slow job acceptance rate.  sound good?
22:44:04 <fungi> wfm
22:44:07 <pabelanger> I think we also used 2.1.x vs 2.3.x, so possible ansible is now just using more resources
22:44:14 <pabelanger> corvus: ++
22:44:47 <corvus> #agreed to reduce hysterisis and exceess resource usage: add ram governor, then slow job acceptance rate
22:44:55 <corvus> #topic merging feature branches
22:45:35 <corvus> we should be all prepared as clarkb said -- but we still may want to actually schedule this so no one is surprised
22:45:43 <corvus> how about we say we'll do it on thursday?
22:46:13 <fungi> the puppet changes to make that hitless for people using puppet-openstackci are merged at this point, right?
22:46:38 <corvus> fungi: yep
22:46:46 <pabelanger> thursday is fine by me
22:47:09 <corvus> #link puppet-openstackci change https://review.openstack.org/523951
22:47:11 <fungi> i guess as long as the people using that module update it with some frequency they're protected. if they don't, then they're using a non-continuously-deployed puppet module to continuously deploy a service... so a learning experience for them?
22:47:31 <pabelanger> we'll have to make some config changes to nodepool-builder, since it is using old syntax. I can propose some patches for that
22:47:41 <corvus> fungi: yep.  and i mean, it's not going to eat their data, they just need to checkout a different version and reinstall
22:47:44 <pabelanger> maybe also upgrade to python3 at the same time
22:49:05 <corvus> pabelanger: heh, well, if we're checking out master on nodepool builders, then i think we'll automatically get switched to v3.  :)
22:50:03 <corvus> pabelanger: do you want to deploy new builders running from the feature/zuulv3 branch before we merge?
22:50:09 <pabelanger> corvus: I was thinking maybe we first switch nodepool builders to feature/zuulv3 branch and get config file changes in place
22:50:12 <pabelanger> yah
22:50:35 <corvus> pabelanger: think that's reasonable to do before thursday?
22:50:54 <corvus> hopefully there'll be more folks around tomorrow to help too
22:50:56 <pabelanger> I believe so, I can start work on it ASAP
22:51:07 <corvus> cool
22:51:29 <corvus> #agreed schedule feature branch merge for thursday jan 18
22:51:35 <corvus> #topic open discussion
22:51:39 <corvus> anything else?
22:52:14 <fungi> oh, as far as release-related needs, i wouldn't mind if someone took a look at my change to link the zuul-base-jobs documentation from the user's guide
22:52:18 <fungi> #link https://review.openstack.org/531912 Link to zuul-base-jobs docs from User's Guide
22:53:03 <fungi> it's small, and could probably stand to be at least a little less small
22:53:55 <corvus> fungi: ++
22:55:37 <corvus> if that's it, i'll go ahead and end
22:55:40 <corvus> thanks everyone!
22:55:45 <corvus> #endmeeting