14:00:00 <efried> #startmeeting nova_scheduler
14:00:01 <openstack> Meeting started Mon Jan  8 14:00:00 2018 UTC and is due to finish in 60 minutes.  The chair is efried. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:02 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:04 <openstack> The meeting name has been set to 'nova_scheduler'
14:00:15 <efried> https://wiki.openstack.org/wiki/Meetings/NovaScheduler#Agenda_for_next_meeting
14:00:37 <takashin> o/
14:00:39 <efried> Greetings all.  Welcome to the first Nova Scheduler meeting of the year.
14:00:39 <alex_xu_> o/
14:00:40 * bauzas waves back from the sloppy snowy hills
14:00:54 <efried> I'll be your host, while edleafe is somewhere on I-35
14:01:52 <cdent> o/
14:02:30 <efried> There's some interesting stuff on the agenda today.  jaypipes around?  (It'd be nice to have dansmith, but it's 6am for him, so...)
14:02:43 <jaypipes> I am
14:02:54 <efried> Cool, let's get started.
14:03:15 <efried> Reviews: Nested RPs.
14:03:27 <efried> ComputeDriver.update_provider_tree() series starting with: https://review.openstack.org/#/c/521685/
14:03:41 <efried> This has prompted some discussion topics which we'll get to later.
14:03:42 <jaypipes> I'm three patches deep in the nested with allocation candidates series. making decent progress.
14:04:00 <efried> ^^ Nested affordance in GET /allocation_candidates series starting with: https://review.openstack.org/#/c/531443/
14:04:04 * jaypipes puts on brakes
14:04:32 <efried> Granular resource requests (https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/granular-resource-requests) waiting for ^^
14:04:52 <efried> Anyone have anything they want to bring up on those?
14:05:01 <efried> ...other than the "open discussion" stuff
14:05:05 <bauzas> I need to review the nested RP stuff
14:05:23 <bauzas> I see the update_provider_tree() changes as WIP, correct ?
14:05:40 <efried> bauzas Only the top couple of patches.  The bottom two or three are ready.
14:05:52 <bauzas> efried: those aren't the driver side
14:06:00 <bauzas> amirite ?
14:06:19 * alex_xu_ happy to review more sql
14:06:21 <efried> Correct; they're in report client and its helpers.
14:06:39 <bauzas> k
14:06:41 <efried> All framework needed to get support in the driver side.
14:06:59 <efried> But yeah, bauzas I noticed you were looking to use this stuff for libvirt vgpu work.
14:07:02 <bauzas> I'm cool with that, just wanted to make things clear, since you were pointing out
14:07:07 <bauzas> (15:03:27) efried: ComputeDriver.update_provider_tree() series starting with: https://review.openstack.org/#/c/521685/
14:07:25 <bauzas> efried: definitely Rocky for me
14:07:30 <efried> okay.
14:07:44 <bauzas> efried: I'm still beating with loving libvirt and nvidia fancy popping up issues
14:07:53 <bauzas> tl:dr: no docs, just words of mouth
14:08:13 <efried> Good times.  "Someone" should write the docs :)
14:08:17 <efried> Anything else on NRP?
14:08:33 <efried> Moving on.
14:08:36 <efried> Alternate hosts (edleafe):
14:08:36 <efried> Series now starts with test case patch https://review.openstack.org/#/c/531022/
14:08:36 <efried> Revealed bug: https://bugs.launchpad.net/nova/+bug/1741125
14:08:37 <openstack> Launchpad bug 1741125 in OpenStack Compute (nova) "Instance resize intermittently fails when rescheduling" [High,In progress] - Assigned to Ed Leafe (ed-leafe)
14:08:38 <bauzas> the Xen folks are probably the best customers for the NRP usecase
14:08:48 <bauzas> they are way more advanced than me
14:09:05 <efried> bauzas Noted.  They're on my radar.
14:09:29 <alex_xu_> efried: does nrp still target to this release?
14:09:43 <bauzas> the more we can merge, the better it will be
14:09:45 <efried> alex_xu_ As far as I know, yes.
14:09:59 <bauzas> at least the compute side is critical
14:10:10 <bauzas> not the scheduling one
14:10:16 <alex_xu_> efried: ok, just looks like the time is tight
14:10:48 <bauzas> alex_xu_: aaaaand I haven't harassed you with beggy asks for reviews on my VGPU changes :p
14:10:49 <efried> I believe the scope of the work is limited to what's described above, which I feel is doable if we stay aggressive.
14:11:00 <efried> jaypipes Thoughts on that?
14:11:13 <alex_xu_> bauzas: heh :)
14:11:13 * cdent gets up on the balls of his feet
14:12:18 <efried> Anything on alt hosts?
14:12:19 <jaypipes> efried: I believe we can get n-r-p in reportclient, n-r-p in alloc candidates, and xen as first client/consumer done in Queens.
14:12:29 <efried> jaypipes ++
14:12:31 <bauzas> I agree
14:12:44 <alex_xu_> \o/
14:13:00 <jaypipes> efried: what I *don't* believe will be done is the crazy-pants numa/pci stuff.
14:13:11 <jaypipes> efried: that will need more extensive testing
14:13:35 <efried> jaypipes Agree.  That is driver consumption of the stuff we're putting into Queens, yes?
14:13:36 <jaypipes> efried: so we will continue to need both the PciPassthroughFilter and the NUMATopologyFilter in the scheduler for at least Queens.
14:13:44 <cdent> I think on alt hosts, the only pending thing is fixing that bug (on the agenda). And any subsequent fall out from it existing
14:13:49 <jaypipes> efried: yes, xactly.
14:14:05 <bauzas> driver consumption or inventory ?
14:14:07 <alex_xu_> is granular request part of nrp?
14:14:09 <bauzas> I thought the former
14:14:12 <bauzas> oops
14:14:14 <efried> alex_xu_ yes
14:14:17 <bauzas> I thought the /latter/
14:14:19 <alex_xu_> efried: cool
14:14:22 <jaypipes> bauzas: driver consumption *is* inventory :)
14:14:35 <bauzas> dammit, speak French folks
14:15:16 <jaypipes> bauzas: by "driver consumption", we're referring to the virt driver API's proposed update_provider_tree() method, which is responsible for inventory tracking.
14:15:22 <bauzas> the driver will *provide* the inventory ala new way, better ?
14:15:27 <jaypipes> ++
14:15:44 <efried> bauzas jaypipes It's two-pronged, really.  The driver will have to supply the RP picture via update_provider_tree; and will also have to have the code to do the device creation/assignment to the VM based on the allocations from the scheduler.
14:16:06 <bauzas> jaypipes: that looks like a provider terminology to me, not a consumer, but meh :p
14:16:25 <jaypipes> efried: sure, yes.
14:16:26 <bauzas> anyway, I'm seriously diverting
14:16:35 * bauzas shuts up
14:16:36 <jaypipes> bauzas: agreeed.
14:16:36 <efried> Providing inventory by consuming new interfaces :)
14:16:45 <jaypipes> ok, moving on...
14:17:06 * efried realizes total failure to use #topic thus far
14:17:10 <efried> #topic Bugs
14:17:21 <efried> https://bugs.launchpad.net/nova/+bugs?field.tag=placement
14:17:29 <jaypipes> no bugs. ever. next topic.
14:18:10 <efried> 25 of these, probably a good time to do a scrub.
14:18:31 <efried> Make sure the status is appropriate.  E.g. the other day I found a moldy one (but I only looked at the one).
14:18:40 <efried> Volunteers?
14:19:10 <efried> <crickets>
14:19:19 <jaypipes> not me, sorry. I need to focus on reviews this week.
14:19:29 <efried> I'll try to skim through them.
14:19:44 <efried> Anything else on bugs?
14:19:50 <cdent> i look every week, but usually the status hasn't changed much. will continue to do so
14:19:56 <efried> Thanks
14:20:07 <efried> #topic Open discussion
14:20:32 <efried> First on the list: Using the limits recently made available on GET /allocation_candidates.
14:20:46 <efried> Dan put up a WIP on Friday:  https://review.openstack.org/#/c/531517/
14:21:29 <efried> So I guess any discussion can happen on that review.
14:21:57 <efried> Anything to talk about here & now on that?
14:22:01 <jaypipes> nope
14:22:28 <efried> Next.
14:23:04 <efried> There's been lots of words in the reviews of  https://review.openstack.org/#/c/526539
14:23:38 <cdent> "careful handling of conflicts" is a) probably right, b) scary, c) sigh making because b
14:23:54 <cdent> efried: can you expand on generation handing on aggregates?
14:24:01 <cdent> (the necessity thereof)
14:24:02 <efried> Sure.
14:24:13 <cdent> (for sake of completeness)
14:25:28 <efried> If we want to allow multiple threads, possibly/usually on different hosts, to manage the same RP, we need a way to handle race conditions.
14:25:40 <efried> ...a way that isn't process/thread level locking.
14:26:16 <efried> So the db record of a provider includes a 'generation', which is a monotonically increasing integer indicating the "revision" of the RP.
14:26:39 <bauzas> not sure I get the problem and/or the context of that conversation
14:26:40 <efried> You GET the provider, change something about it, and PUT it back.  You don't change the generation in that payload.
14:27:37 <efried> bauzas It's this: If thread A GETs the provider and makes a change, and thread B GETs the provider and makes a change, then thread A PUTs it back, then thread B puts it back, thread B's changes will overwrite thread A's changes.
14:27:51 <cdent> that part's mostly understood, but with aggregates we had originally decided not to. What's different? Instead of generation, do we need a way to add/remove a single member from an aggregate collection?
14:27:51 <jaypipes> correct.
14:28:16 <cdent> (if it's already there or already not there, do nothing, it's OK)
14:28:18 <jaypipes> cdent: no, I think we just need to do the exact same thing we do with the traits collection of an rp.
14:28:20 <efried> bauzas Is it clear how generation fixes that, or should I explain further?
14:28:51 <bauzas> I know why we have the generation bit, so I guess the problem is with aggregates that *don't* have this ?
14:28:57 <cdent> jaypipes: that's fine, just trying to make sure the picture is clear
14:29:20 <jaypipes> cdent: the trick is we need to deprecate the old API (that doesn't supply a generation) and add a new microversion that supplies a generation for PUT /aggregates
14:29:22 <efried> bauzas Right, today if you change the aggregates associated with a provider, that provider's generation is *not* incremented.
14:29:46 <efried> bauzas We're suggesting that's a bad thing, and needs to be fixed.
14:29:49 <jaypipes> efried: right. it's the only rp-related endpoint (that modifies something) that doesn't have a generation marker.
14:30:14 <bauzas> well, if the RP doesn't change, why should we increment the generation bit ?
14:30:39 <efried> bauzas Changing aggregate associations *is* changing the RP.
14:30:45 <efried> Just like changing the traits is changing the RP.
14:30:49 <jaypipes> bauzas: because the rp's generation is the lock for all things related to the rp. inventory, allocations, traits, etc
14:31:26 <bauzas> I'm maybe missing something, but if that's just a matter of incrementing the gen bit, why should we need to modify the REST API ?
14:31:39 <bauzas> for passing the new generation bit ?
14:31:47 <bauzas> sorry, see, I'm rusty
14:31:59 <efried> No, the payload structures don't change.
14:32:10 <jaypipes> bauzas: because we need the thing that is attempting to change the rp to say what its latest view of the rp was.
14:32:18 <jaypipes> efried: yes, they do.
14:32:32 <efried> Oh yeah
14:32:37 <jaypipes> efried: unless you plan on passing the generation in a header...
14:32:37 <bauzas> jaypipes: right, we need to pass the index where we are
14:32:45 <efried> cause right now GET /rp/{uuid}/aggs doesn't include generation.
14:32:49 <bauzas> gotcha
14:32:50 <jaypipes> efried: which I don't recommend.
14:33:00 <cdent> it will make cdent very fussy
14:33:10 <efried> Because why?
14:33:14 <jaypipes> and we don't want that.
14:33:26 <cdent> two ways to do the same thing: bad
14:33:28 <jaypipes> efried: because it would be very inconsistent.
14:33:33 <cdent> jinx
14:33:36 <jaypipes> efried: so let's move on :)
14:33:39 <bauzas> do we consider those race conditions as being often ?
14:34:05 <bauzas> after all, Nova aggregates don't have any concurrency management in place
14:34:06 <efried> Okay, I got confused.  We *do* want to include the generation in the payload?
14:34:12 <cdent> we consider them to be very possible when bring up several pieces of new hardware, especially in clustered envs, but then stabilize
14:34:13 <jaypipes> bauzas: originally, cdent and I did not envision aggs changing a) often or b) by the compute virt driver. but drivers like PowerVM and vCenter apparently have designs to do this.
14:34:30 <bauzas> jaypipes: I thought the former as well, so I'm surprised
14:34:32 <jaypipes> bauzas: which is why we need to solve this problem now and not later with some distributed locking hacky thing
14:35:26 <bauzas> okay, is that a thing for now or Rocky?
14:35:34 <bauzas> because that's an API microversion bump, right?
14:35:39 <bauzas> I mean, surely we can
14:35:53 <cdent> we've bumped the placement microversion 6 times this cycle, we can afford a few more
14:36:00 <bauzas> but I just know we have a shit number of other things to merge
14:36:05 <cdent> (6 is a guesstimate)
14:36:19 <bauzas> then, Just Do It (c)
14:36:26 <efried> ++
14:36:28 <bauzas> and see when we can merge
14:36:33 <jaypipes> bauzas: yeah, I'm in the just do it camp.
14:36:44 <efried> Who would like to propose this patch?
14:36:45 <jaypipes> so... *who* should do it?
14:36:48 <jaypipes> jinx
14:36:53 <bauzas> a-ha!
14:36:55 <cdent> I can do the placement side
14:37:01 <jaypipes> ++
14:37:04 <jaypipes> excellent.
14:37:05 <efried> Thanks cdent
14:37:16 * cdent adds to to do likst
14:37:38 <gibi> I have one silly question
14:37:40 <efried> #action cdent to propose generations-on-aggregate placement microversion
14:37:47 <gibi> When adding an aggregate to an RP we will increase the generation of that RP or every RP that aggregate is connected to?
14:37:50 * efried finds the #action keyword
14:38:23 <efried> gibi Excellent question, as always.
14:38:24 <jaypipes> gibi: no, only update generation when *that* RP's set of aggs changes
14:38:34 <efried> That seems reasonable to me.
14:38:35 <sean-k-mooney> gibi: i would guess the generation number of the aggregate and the RP added but not other RPs in that aggregate
14:38:54 <efried> (sean-k-mooney the aggregate doesn't have a generation)
14:38:58 <efried> (but yeah)
14:39:09 <cdent> gibi: clearly not a silly question gibi, especially since there's no such thing as a silly question
14:39:22 <gibi> OK, the answer seems consistent :)
14:39:47 <efried> Okay, so cdent pointedly said "placement side".  The client side...
14:39:59 <efried> That's probably on me, unless someone else wants to volunteer.
14:40:20 <cdent> I figured splitting it up was probably better to avoid serializing.
14:40:30 <efried> suresure
14:40:44 <efried> Right now the report client doesn't have any kind of retry on 409.
14:40:49 <cdent> and also since my savvy of the ProviderTree stuff is limited
14:41:12 <efried> cdent Actually I don't think ProviderTree enters into it much or at all for this.
14:41:13 <cdent> there's code in there somewhere for the Retry exception that jaypipes (I think?) created
14:41:23 <efried> oh?
14:41:27 * efried looks
14:41:38 <cdent> which can probalby be reused
14:41:43 <cdent> (with some tweaks)
14:42:24 <efried> report.py has class Retry and a @retries decorator.  Will investigate.
14:42:31 <efried> So now the big question:
14:42:39 <cdent> I realize that the ProviderTree just uses the report client, but the larger context is the ProviderTree doing an effectivce sync of whatever the virt driver tells it
14:42:58 <cdent> that's the part I haven't got close knowledge of, and is the driver for all this stuff, and the part that we want to be the working result
14:43:25 <jaypipes> cdent: you talking about reportclient or ProviderTree? ProviderTree does not interface with placement API.
14:43:32 <efried> cdent That's an interesting point.  I had been thinking we would do the retry logic at the low level report client methods that do the PUTs.
14:43:53 <cdent> efried: yes, so would I
14:43:56 <efried> i.e. ProviderTree not involved - it calls those guys and they just work, with however many retries they need.
14:44:07 <jaypipes> cdent: ProviderTree doesn't use reportclient :)
14:44:32 <cdent> a thing which uses ProvideerTree and reportclient is trying to manage the picture of the world
14:44:40 <cdent> the picture of the world is the thing I'm trying to say we should care most about
14:44:46 <efried> agreed
14:44:49 <cdent> that is: I'm trying to say: it's not the unit tests that matter here
14:44:55 <cdent> it's the end result
14:45:12 <jaypipes> cdent: reportclient builds the ProviderTree. ProviderTree simply implements thread-safe ability to set traits and inventory information for providers, and then we can pass that ProviderTree to the reportclient and reportclient will inspect it and call the placement REST API accordingly.
14:45:34 <cdent> yes, that's been my understanding all along
14:45:37 <jaypipes> k
14:45:50 <efried> It'll be a report client method that diffs two ProviderTrees and flushes the changes back to placement.  I'm saying that method will call the low-level PUT wrappers, and those wrappers will have the retry logic in 'em.
14:45:52 <cdent> so the provider tree is the central/key data structure that drivers the pictuyre of the world
14:46:01 <cdent> efried: yes
14:46:13 <jaypipes> cdent: yes.
14:46:42 * cdent is apparently more data structure oriented than he thought
14:46:49 <efried> The diff/flush method is the to-be-written one here: https://review.openstack.org/#/c/520246/19/nova/compute/resource_tracker.py@886
14:47:29 <cdent> see above about "scary"
14:47:42 <cdent> I know it is just a simple matter of programming
14:47:44 <efried> #action efried to start on conflict retry logic.
14:47:55 <efried> So the big question:
14:48:13 <efried> Can we do this in report client in such a way that it works for all (reasonable) use cases?
14:48:43 <efried> See the most recent review missive on https://review.openstack.org/#/c/526539/7
14:49:03 <efried> It is not always definitely clear how a conflict should be handled.
14:49:37 <efried> Consider especially the case where two hosts are creating the aggregate for a shared provider.
14:49:49 <efried> It's certainly legal for a shared provider to be a member of multiple aggregates.
14:50:41 <efried> So when we retry, how do we know whether to add our aggregate to the mix, or see that the aggregate was created by the "other thread" and just skip?
14:52:02 <jaypipes> that's the issue with >1 thing creating aggregates :)
14:52:05 <efried> In the general view, I contend that only the specific virt driver can know this answer for a specific shared provider.
14:52:24 <efried> ...assuming the picture where it's the virt driver controlling the shared provider.
14:52:55 <jaypipes> efried: yes, I think that's the case...
14:53:29 <efried> If so, then how do we manage this without giving virt the reins to do the placement update & conflict resolution itself?
14:53:35 <sean-k-mooney> efried: you are assuming that the virt driver will always know. if the shared resouce is not a compute releated resouce will it have that knoladge
14:53:52 <jaypipes> efried: the virt driver needs to be the thing that answers the question "ok, so the state of things changed. check if my desired state is now in play before I go trying to create that new desired state"
14:54:16 <efried> jaypipes So maybe if conflict is detected, reinvoke update_provider_tree?
14:54:33 <jaypipes> efried: all the generation change stuff means is "*something changed*!". it doesn't mean "this thing changed".
14:54:54 <bauzas> FWIW, 5 mins left-ish
14:55:16 <jaypipes> efried: I think it depends. certainly if you get a 409, then I would re-read all the provider's information (traits, inventory, etc)
14:55:41 <efried> So perhaps where this leads us is actually a lot broader and simpler than I had been thinking.
14:55:46 <jaypipes> efried: and then check to see if the desired state is already existing. and if so, do nothing. if not, then call the update/set again.
14:56:14 <efried> Instead of having the low-level PUT method detect 409s and redrive, have the resource tracker trap the 409 and reinvoke update_provider_tree.
14:56:30 <efried> That gives the driver the control it needs.
14:56:31 <jaypipes> efried: that "desired state" is going to be different depending on the virt driver and depending on what the virt driver is attempting to do (for example, changing a provider's inventory is different from changing the provider's set of traits)
14:56:49 <jaypipes> efried: yes, precisely what you said.
14:56:51 <efried> Because update_provider_tree is always doing as jaypipes says: checking whether the "current" state is the "desired" state.
14:56:54 <jaypipes> efried: that is the original design.
14:57:17 <jaypipes> efried: client-driven state retries, that is.
14:57:27 <efried> Dig.
14:59:07 <efried> Well, that was the only real-world case where I had thought there might be a reason for virt to talk directly to placement.
14:59:14 <efried> The last open discussion topic is on that point in general.
14:59:35 <cdent> If a ProviderTree has state A, with hierarchy, and there is a conflict, such that it should be reset to state B, if that leaves orphans in the hierarchy, who/what cleans up those orphans and how does anything know?
14:59:54 <efried> cdent The aforementioned method.
15:00:04 <cdent> it deletes resource providers?
15:00:10 <efried> Yes
15:00:11 <cdent> maybe something else wants them too?
15:00:24 <efried> We're over time.  Continue in -nova?
15:00:28 * cdent nods
15:00:29 <efried> #endmeeting