14:00:32 <fried_rice> #startmeeting nova-scheduler
14:00:33 <openstack> Meeting started Mon Dec  3 14:00:32 2018 UTC and is due to finish in 60 minutes.  The chair is fried_rice. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:36 <openstack> The meeting name has been set to 'nova_scheduler'
14:00:39 <gibi> o/
14:00:41 <takashin> o/
14:01:38 <efried> #topic last meeting
14:01:38 <efried> #link last minutes: http://eavesdrop.openstack.org/meetings/nova_scheduler/2018/nova_scheduler.2018-11-26-14.00.html
14:01:50 <efried> hm, sec...
14:01:52 <edleafe> \o
14:01:59 <fried_rice> #chair efried
14:02:00 <openstack> Warning: Nick not in channel: efried
14:02:01 <openstack> Current chairs: efried fried_rice
14:02:08 <efried> #topic last meeting
14:02:08 <efried> #link last minutes: http://eavesdrop.openstack.org/meetings/nova_scheduler/2018/nova_scheduler.2018-11-26-14.00.html
14:02:11 <tetsuro> o/
14:02:12 <efried> there we go.
14:02:29 <efried> Any old business?
14:03:03 <cdent> o/
14:04:15 <efried> #link latest pupdate: http://lists.openstack.org/pipermail/openstack-discuss/2018-November/000392.html
14:04:40 <efried> topic from the pupdate: os-resource-classes
14:04:43 * alex_xu waves late
14:04:54 <efried> without leakypipes here, not sure we're going to get to closure
14:05:36 <efried> My take, expressed on the ML, was to do *something* rather than continuing to discuss and do nothing.
14:05:40 <cdent> efried: I think you're right: is basically a guess of just doing something
14:05:50 <cdent> s/guess/case/
14:06:13 <cdent> I haven't done it myself because a) i've been busy with other things, b) I don't want to do too much
14:06:19 <efried> Mm.
14:06:34 <cdent> but if it comes to it, I'll reinstate my existing experiments and we'll just do that
14:06:50 <cdent> If someone else would like to do so, the code is available on the -4 etherpad
14:06:55 <efried> cdent: do we have an etherpad or anything with the existing proposals?
14:07:03 <cdent> #link -4 etherpad https://etherpad.openstack.org/p/placement-extract-stein-4
14:07:09 <efried> oh, does -4 have them all listed?  /me looks...
14:07:25 <cdent> no, it was assumed that other people would add theirs
14:07:30 <cdent> but they didn't
14:07:34 <cdent> so perhaps that's a sign
14:07:46 <edleafe> silent assent?
14:07:49 * mriedem joins late
14:07:51 <efried> I added Jay's
14:08:02 <efried> I'm in favor of "the simplest thing that could possibly work".
14:08:14 <cdent> the ptg etherpads had more discussion
14:08:15 <efried> Which is basically to take the enum that we're using in the placement repo and stuff it in its own repo.
14:08:32 <efried> I think that's what cdent's thing does.
14:08:38 <cdent> pretty much
14:09:13 <efried> talk of making an os-placement-artifacts library has only led to thrashing, so I say we abandon that route for now.
14:09:26 <cdent> pretty much
14:09:36 <efried> so we'll have a super thing os-traits and a super thing os-resource-classes and they'll be separate and done.
14:09:46 <efried> s/thing/thin/g
14:09:55 <cdent> pretty much
14:10:46 <efried> cool. Anyone want to volunteer to do the git/gerrit paperwork to seed a real openstack/os-resource-classes with https://github.com/cdent/os-resource-classes ?
14:11:00 <edleafe> I might have time for that
14:11:13 <efried> Nice.
14:11:33 <efried> #action edleafe to seed openstack/os-resource-classes from https://github.com/cdent/os-resource-classes
14:11:39 * efried ignores "might" :P
14:11:45 <edleafe> So not to be wishy-washy, I'll *make* time for that
14:11:50 <efried> Thanks edleafe
14:11:51 <edleafe> jinxish
14:12:00 <efried> Moving on.
14:12:04 <efried> Anything else from the pupdate?
14:12:20 <efried> or reviews/specs not mentioned therein?
14:12:51 <cdent> I wanted to know from you, efried, if your provider tree stuff was in the glide path, or still had humps?
14:13:14 <cdent> smooth sailing or rough seas
14:13:18 <efried> Just the gate, I think.
14:13:19 <cdent> good metaphors or bad
14:13:21 <efried> one sec.
14:13:41 <efried> #link nova-to-placement traffic reduction series now starting at https://review.openstack.org/#/c/615646/
14:13:51 <efried> Bottom patch is +A, in recheck vortex.
14:14:44 <efried> Next few patches have been revised for minor issues from leakypipes and should be ready for his +2.
14:15:05 <efried> SchedulerClient evisceration patches have leakypipes +2
14:15:36 <efried> and then the top one is pretty fresh.
14:15:56 <efried> Note that the top three patches are really cleanup, not specific to the nova-to-placement traffic business
14:16:15 <efried> They're on top to obviate merge conflicts, because I'm a special kind of lazy.
14:16:20 <cdent> cleanup++
14:17:12 <efried> I think I'm hoping mriedem will look at patches 2, 3, 4 in that series, since he's been involved in the bottom two.
14:17:27 <efried> (including https://review.openstack.org/#/c/615606/ which has merged)
14:18:13 <mriedem> yar
14:18:30 <efried> Thanks mriedem :*
14:18:51 <efried> Specs-wise, I should mention the cyborg/nova thing.
14:18:56 <mriedem> is cern running with any of this yet btw?
14:19:16 * bauzas is just following the convo here silently
14:19:17 <efried> mriedem: I don't think so. But tssurya and belmoreira have reviewed the patches and +1'd
14:19:24 <efried> at previous patch sets
14:19:40 <efried> since which not much of substance has changed
14:20:43 <bauzas> I don't get why the eviscerate patches are important, but meh, I'm +1 with them
14:20:51 <bauzas> I just need some time for reviewing them
14:21:09 <efried> tssurya: any chance y'all could pull down the series, say up to https://review.openstack.org/#/c/615705/ , and make sure a) they don't blow up the world, and b) they do what we expect to reduce traffic?
14:21:49 <efried> bauzas: Important, meh. They're just getting rid of an unnecessary abstraction layer that folks (mainly leakypipes and I) have been advocating ripping out for a while and just hadn't gotten around to.
14:22:08 <bauzas> well ok
14:22:36 <bauzas> that said, I'm still needing some comments for https://review.openstack.org/#/c/599208/ :)
14:22:56 <efried> If nothing else, having that SchedulerClient in the way made it hard to search for usages of those methods in PyCharm
14:23:23 <efried> bauzas: ack, that one's on my list for sure. It's actually on the agenda here in a minute...
14:24:36 <efried> #topic Extraction
14:24:36 <efried> #link Extraction etherpad https://etherpad.openstack.org/p/placement-extract-stein-4
14:25:01 <efried> libvirt/xenapi reshaper series  <== need reviews!
14:25:01 <efried> #link libvirt reshaper https://review.openstack.org/#/c/599208/
14:25:01 <efried> #link xen reshaper (middle of series) https://review.openstack.org/#/c/521041
14:25:29 <efried> Last time we agreed to continue deferring work on the reshaper FFU framework.
14:25:44 <efried> Any other extraction-related topics or discussion that we haven't already covered?
14:26:24 <cdent> the functional tests in nova, using external placement, are happy now
14:26:24 <gibi> efried: https://review.openstack.org/#/c/617941/ now works but there is some controversy
14:26:37 <efried> cdent: ++ \o/
14:26:39 <gibi> efried: about nova-status check test
14:26:42 <cdent> #link https://review.openstack.org/#/c/617941/
14:26:57 <cdent> yeah, that's the stickler we need to resolve
14:27:12 * efried hears nova-status, looks at mriedem
14:27:17 <cdent> related to that, zzzeek has some cleaning up on the database fixtures:
14:27:34 <cdent> #link data fixture cleanup https://review.openstack.org/#/c/621304/
14:27:35 <mriedem> that change doesn't drop the in-tree nova db objects for placement,
14:27:44 <mriedem> so i don't see why that change needs to drop the nova-status tests at all
14:27:55 <mriedem> they can go later when the in-tree placement code is dropped, right?
14:27:55 <cdent> mriedem: it's child does?
14:27:59 <cdent> its
14:28:04 <mriedem> so let the child worry about it
14:29:02 <cdent> can do that if you like, but it seemed odd to get rid of a bunch of placement tests and then not get rid of that one
14:29:07 <cdent> but whatever people like
14:29:11 <mriedem> it's not a placement test,
14:29:13 <mriedem> it's a nova test
14:29:20 <mriedem> it's just that it uses the db rather than the api
14:29:27 <cdent> it's miscegeny
14:29:34 <cdent> which must be purged!
14:29:52 * cdent seeks therapy
14:30:59 <cdent> i'll put the test back this afternoon sometime
14:32:01 <cdent> but we're still going to need to resolve how that command and the other "look at placement" commands that are in nova are going to be after these changes
14:32:03 <gibi> regarding zzzeek's fixtur cleanup, I'm running nova functional against it
14:32:48 <gibi> which makes me wonder if we want to trigger nova functional for each placement change in CI
14:33:17 <mriedem> i can feel the bristling from here
14:33:42 <mriedem> if anything, maybe just an experimental queue job
14:33:51 <mriedem> i don't think placement changes want to gate on nova functional tests
14:34:02 <cdent> that would be teh suck
14:34:19 <cdent> for everyone
14:34:19 <efried> no?
14:34:24 <mriedem> experimental queue allows you to run it on-demand in case you worry about a breaking change to the fixture
14:34:42 <mriedem> but do we expect that to happen very often?
14:34:45 <mriedem> i wouldn't think so
14:35:03 <cdent> nor me
14:35:16 <gibi> OK, thanks for the feedback
14:35:41 <efried> Ready to move on?
14:35:45 <gibi> yes
14:35:52 <efried> #topic bugs
14:35:52 <efried> #link Placement bugs https://bugs.launchpad.net/nova/+bugs?field.tag=placement
14:36:07 <efried> any bugs to highlight?
14:36:57 * mriedem takes kid to bus stop
14:37:00 <efried> #topic opens
14:37:02 <cdent> I have one I find interesting
14:37:07 <efried> #undo
14:37:08 <openstack> Removing item from minutes: #topic opens
14:37:15 <efried> cdent: go
14:37:24 <cdent> #link ensure aggregates under load https://bugs.launchpad.net/nova/+bug/1804453
14:37:25 <openstack> Launchpad bug 1804453 in OpenStack Compute (nova) "maximum recursion possible while setting aggregates in placement" [Undecided,New]
14:37:58 <cdent> you have to really slam resource provider creation and aggregate manipulation to tickle that, but when you do the system gets very upset
14:38:58 <efried> wow. Why are we recursing there? I instinctively twitch at that.
14:39:09 <cdent> it's not real recursion
14:39:44 <cdent> well it is, it's a loop that if it gets stuck for long enough triggers "I've called this code too many times"
14:40:57 <cdent> but yes, it needs a way to exit itself before then
14:41:05 <efried> and it's caused by real conflicts happening, or the same conflict getting hit a bunch of times in a row while the other thread is trying to finish up?
14:41:16 <cdent> from a service mgt standpoint, the way to avoid the problem is to have higher concurrency in the service
14:41:43 <cdent> a single thread gets stuck while other threads are succeeding
14:42:25 <cdent> it's a very specific set of circumstances to trigger it, and placeload happens to have those circumstances
14:42:36 <cdent> you also have to under-provision the server so that it has too few threads
14:43:15 <cdent> but none of that really matters: what matters is that we have a place in the code that is insufficiently robust. it's not defensive against the unexpected
14:43:27 <cdent> exactly because of the reasos that make you twitch
14:45:13 <efried> Have you identified the piece of code in question? I don't see it in the bug report. If you could drop a link in there, I'd like to have a look at it.
14:45:26 <efried> just _ensure_aggregate?
14:45:52 <cdent> yes, ensure_aggregate calls ensure_aggregate
14:46:33 <cdent> there's a race between checing for an agg id, and generating that agg new
14:48:40 <cdent> anyway, I don't know that we need to be belabor that now, I just wanted to point it out as a thing of interest. It was fun to see/experience
14:49:26 <efried> Seems like the simplest thing to get rid of the stack overflow is just to get rid of the recursion. At a glance, this could be turned into a loop very easily.
14:49:37 * cdent nods
14:49:42 <efried> If that happened, would it still be possible for it to spin "forever"?
14:49:59 <efried> I imagine we reach stack overflow fairly quickly (wallclock-wise) with the recursion.
14:50:11 <efried> Couple seconds of non-recursive looping and we come out the other side?
14:51:28 <cdent> we probably want a combo of a limited loop (which errors if it reaches its end) and an exponential backoff or random sleep
14:51:29 <efried> This is also creating a new writer context every time it recurses. That seems... bad?
14:51:49 <cdent> so yeah, it's code that needs to be fixed, we don't need to fix it here though
14:51:52 <efried> or is that necessary and part of the reason we're recursing rather than looping?
14:52:16 <efried> okay, yeah, let's take it offline.
14:52:19 <efried> #topic opens
14:52:24 <efried> anyone?
14:52:42 <efried> (hello, bot?)
14:52:47 <efried> (shrug)
14:53:02 <mriedem> fwiw i've thought about adding a "performance/scaling" doc into nova before when stuff like this comes up,
14:53:08 <mriedem> i.e. known ways things can go south
14:53:24 <mriedem> more as a dumping ground for not forgetting about known issues
14:53:32 <mriedem> but, you know
14:53:36 <efried> good idea. Seems to me like cdent has answered a lot of placement-related scaling questions before (in blogs maybe?). That info could go into docs very nicely.
14:54:30 <cdent> a) that's true, b) most of the solutions are just standard "I'm a web app" solutions, c) where they are not it is because our code is bad and relatively easy to fix and we should . The current situation is c
14:55:01 <mriedem> from a nova pov, i was thinking more about the myriad config options we have for tweaking things when scaling,
14:55:11 <mriedem> e.g. the way cern restricts projects to a small number of cells
14:55:39 <mriedem> a place to document things one can run into and ways people have worked around those things
14:58:05 <mriedem> anywho
14:58:28 <cdent> the only other thing I'd add to "opens" at the moment is to reinforce from the pupdate that there are multiple deployment projects doing placement bits and pieces that could use our eyes
14:59:38 <efried> okeydokey. Thanks all.
14:59:49 <efried> #endmeeting