#openstack-meeting-alt log

22:01:05 <corvus> #startmeeting zuul
22:01:06 <openstack> Meeting started Mon Feb  5 22:01:05 2018 UTC and is due to finish in 60 minutes.  The chair is corvus. Information about MeetBot at http://wiki.debian.org/MeetBot.
22:01:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
22:01:09 <openstack> The meeting name has been set to 'zuul'
22:01:11 <corvus> #topic Agenda
22:01:31 <corvus> there is no agenda in the wiki https://wiki.openstack.org/wiki/Meetings/Zuul
22:01:36 <jhesketh> o/
22:01:41 <corvus> #link agenda (or lack thereof) https://wiki.openstack.org/wiki/Meetings/Zuul
22:01:45 <fungi> the best kind of agenda
22:02:05 <corvus> anyone have anything they want to talk about?
22:02:42 <clarkb> the inap situation has maybe showed us that fixing timeouts is more important?
22:02:48 <clarkb> granted the actual fix there was to fix the cloud
22:02:59 <corvus> #topic timeouts
22:03:04 <corvus> clarkb: can you elaborate?
22:03:21 <clarkb> Last week we had trouble with instances in inap which resulted in slow disk and slow networking (as I understand it)
22:03:23 <corvus> which timeouts, and how were they broken?
22:03:39 <clarkb> this caused jobs to timeout but it took them like 6 hours to do so because we apply the timeout to each run stage rather than the job as a whole
22:04:06 <corvus> ah yep.  that much at least should be a relatively easy change, as soon as we decide how we want to implement it.
22:04:08 <clarkb> this was painful beacuse it meant that jobs weren't getting rescheduled (relatively) quickly on new nodes and instead were sitting around for a quarter of a day before failing
22:04:27 <clarkb> (I think the rescheduling would've had to be manual via recheck in this case)
22:04:52 <fungi> pathological scenario though, where the reused ssh connection basically becomes a blackhole for teh commands being passed in
22:05:27 <corvus> we could go ahead and give the entire job the timeout budget.  so if the timeout is 2h, and the pre playbook takes 2h, we will timeout.
22:05:39 <corvus> that's probably not ideal, but it's probably better than what we have now.
22:06:03 <corvus> (i think the ideal thing would maybe be per-playbook timouts, so pre could have a 10m timeout, and run could have 2h)
22:06:09 <clarkb> ya I think thats what I have in mind as far as addressing it
22:06:22 <clarkb> another approach would be to specify three timeouts one for each run stage
22:06:23 <corvus> but we can implement cumulative job timout fairly easily, and then maybe talk about per-playbook timeouts later.
22:06:30 <corvus> or per-stage
22:06:38 <clarkb> but I think for user simplicity a single timeout is easy to understand
22:06:38 <mordred> yah. I think entire job the timeout budget to start, and maybe enhacing in the future to have per-playbook timeouts?
22:07:33 <corvus> i feel like both of those are things we can do now, and change later without much disruption
22:07:39 <mordred> ++
22:07:48 <fungi> any of the above ideas seems fine to me. as long as a job that sometimes needs 3h for its run playbook doesn't end up potentially hung through 5x that because it gets the same timeout applied to two run playbooks and two post playbooks
22:07:57 <corvus> with the first update to cumulative timeout, folks *may* need to increase timouts a bit.  but hopefully not much.
22:08:08 <clarkb> was there a particular issue that prevented us from implementing this behavior before (I want ot say I heard there was but don't know details)
22:08:14 <fungi> er, to two pre playbooks and two post playbooks
22:08:20 <dmsimard> fungi: I guess it's even worse if the timeout ends up occurring in pre which has the job retry
22:08:23 <clarkb> corvus: I think most people implemented the timeout values mostly as if they were cummulative
22:08:28 <corvus> clarkb: nope, just nobody typed the words into a text editor.
22:08:31 <fungi> dmsimard: which happened in some cases
22:08:38 <clarkb> cool, in that case I may try to poke at change the behavior
22:09:51 <corvus> dmsimard: if it times out in pre, it will retry the job.  (and therefore, the timer would reset).  i assume that we'd generally want to continue that.
22:10:10 <fungi> clarkb: i think the reason was that for converted jobs we had one timeout value, and passing a timeout to the playbook was relatively trivial to implement
22:10:33 <fungi> since that's an ansible feature already
22:10:42 <corvus> there will be a case where we'll hit the timeout 3 times in 3 retries, and we'll be sad.  but i think by and large, the kind of error we'd expect a timeout to represent is exactly the kind of error we usually want to retry.
22:10:50 <clarkb> fungi: gotcha so it was just less accounting in zuul makes sense
22:10:57 <dmsimard> corvus: +1
22:11:27 <corvus> (in pre, of course)
22:11:42 <fungi> ahh, yeah i guess timing out one of the pre playbooks would have gotten the job aborted regardless of the ssh connection state
22:11:43 <clarkb> I can look into that probably tomorrow if not later today
22:11:49 <clarkb> still catching up on being largely afk for almost two weeks
22:12:10 <corvus> #action clarkb make timeout cumulative in executor
22:12:19 <corvus> any other topics?
22:12:52 <fungi> some changes landed for a memory governor i guess? and there's question as to whether it's working as intended?
22:13:22 <corvus> yeah, i'm trying to sort that out now.
22:13:47 <corvus> we're hoping that will keep us from oom-killing the log streamer
22:13:59 <fungi> fingers crossed
22:14:28 <corvus> but at this point, i have a bunch of confusing data.  i'll keep brain-dumping into #zuul as i work through it, and bug people when i've got a coherent idea sorted
22:14:38 <fungi> thanks!
22:15:44 <corvus> if there's nothing else -- let's get back to it :)
22:15:47 <Shrews> fyi, i will be away beginning this thursday through (and including) the following thursday
22:15:56 <fungi> nothing else from me
22:15:58 <Shrews> so don't break nuttin
22:15:58 <corvus> Shrews: ah thanks!
22:16:14 <corvus> Shrews: anything we should try to get into prod before you leave?
22:16:21 <corvus> or get merged
22:16:44 <Shrews> corvus: nothing urgent. we've merged some good fixes to nodepool recently. perhaps we should restart the launchers?
22:16:52 <clarkb> I will be missing the next meeting as I'm doing taxes and its cheaper if I do them before the 15th
22:17:20 <corvus> Shrews: probably a good idea
22:17:30 <corvus> okay, thanks everyone!
22:17:32 <corvus> #endmeeting