15:03:34 <garyk> #startmeeting scheduling
15:03:35 <openstack> Meeting started Tue Sep 17 15:03:34 2013 UTC and is due to finish in 60 minutes.  The chair is garyk. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:03:36 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:03:39 <openstack> The meeting name has been set to 'scheduling'
15:03:55 <garyk> hope that people are around to discuss
15:04:24 <garyk> #topic summit sessions
15:04:50 <garyk> Does anyone have any additional comments or updates to https://etherpad.openstack.org/IceHouse-Nova-Scheduler-Sessions
15:05:00 <MikeSpreitzer> yes
15:05:21 <garyk> MikeSpreitzer: ok, is that what you want to discuss later in the meeting or something else?
15:06:03 <Yathi> Debo and I added a topic called Smart Resource Placement..  and we have added a blueprint
15:06:06 <MikeSpreitzer> Can I start with a clarification on the whole host allocation part...
15:06:20 <garyk> Yathi: thanks!
15:06:56 <alaski> MikeSpreitzer: what would you like clarification on?
15:06:56 <garyk> MikeSpreitzer: Sure. Unless people want to discuss something else regarding the proposed summit sessions
15:06:59 <MikeSpreitzer> Is whole host allocation about bare metal allocation , really exclusive allocation, or is it about some bigger unit of allocation (pool)?
15:07:28 <alaski> MikeSpreitzer: It's not about baremtal.  It's about allocation to host aggregates essentially
15:08:10 <alaski> host aggregates will be set aside for exclusive use by a tenant, or delegated tenants
15:08:24 <MikeSpreitzer> It is about giving one tenant control over a whole host aggregate, right?
15:08:30 <alaski> yes
15:08:33 <MikeSpreitzer> So it is about this larger unit of allocation.
15:08:38 <alaski> yep
15:09:08 <MikeSpreitzer> Why do we want that?
15:10:01 <garyk> performance and isolation may be motivations
15:10:11 <alaski> There are customer requests for this type of allocation.  I've heard it's for concerns about resource isolation and somewhat for security concerns, though that's questionable
15:10:12 <garyk> security too
15:11:21 <MikeSpreitzer> Performance and isolation can be delivered by requesting performance and isolation from one undivided cloud, letting that cloud decide where to place for performance and isolation.
15:12:13 <MikeSpreitzer> Same thing for security, really.
15:13:24 <alaski> that's kind of what this is doing
15:13:35 <alaski> host aggregates just help the cloud decide where to place instances
15:13:52 <garyk> it is allowing the tenant to run their instances on specific resources that may be reserved for that specific tenants
15:14:22 <MikeSpreitzer> That sounds like AZ functionality.
15:15:31 <garyk> in my opinion it is just another option that is available that enables the cloud provider to meet certain standards.
15:15:33 <MikeSpreitzer> My point here is that a holistic scheduler that is aware of isolation issues could place for isolation, without having a separate feature for dividing up the cloud a-priori.
15:16:41 <alaski> MikeSpreitzer: that's likely the case, though how does it ensure that there remains enough spots to ensure isolation is possible?
15:16:46 <garyk> i agree with you on that. but why not have the option of allocating a whole host?
15:16:59 <MikeSpreitzer> ALaski: yes,...
15:17:20 <MikeSpreitzer> (thinking on my feet here...)
15:17:39 <alaski> But whole host allocation is very early right now.  I know it's going to be the topic of a lot of discussion so alternative ideas are appreciated
15:17:54 <MikeSpreitzer> OK, I'll stop here.  I understand.
15:18:28 <garyk> MikeSpreitzer: please feel free to take your questions or reservations to the lists or bring them up here.
15:18:37 <MikeSpreitzer> Next session.  For multiple scheduler policies, what sort of differences are involved?
15:19:03 <garyk> One point that come up at the Neutron meeting last night and I am not sure if it is relevant here is that people wanted to work only with etherpads at the summit and 'ban' presentations.
15:19:43 <garyk> glikson you around?
15:19:51 <alaski> I like the idea.  I think it's good for us to think about but probably a topic for the Nova meeting
15:20:08 <garyk> ok.
15:20:24 <garyk> MikeSpreitzer: alex is not here to elaborate.
15:20:39 <MikeSpreitzer> OK, I'll pursue that separately
15:20:58 <MikeSpreitzer> Is Boris Pavlovic here?
15:20:59 <garyk> I think that it enables different scheduling policies to be invoked for different requests. That is, not have one global configuration
15:21:21 <garyk> MikeSpreitzer: not sure.
15:21:40 <garyk> Are there any additional things we want to discuss regarding the summit sessions?
15:21:43 <alaski> boris is boris_42.  Doesn't look like he's here
15:22:02 <garyk> he is currently driving a rally
15:22:10 <MikeSpreitzer> I see significant overlap between the "Scheduling across Services" session proposal and the "Smart Resource Placement" session proposal.
15:22:35 <garyk> Yathi: do you think there is overlap here?
15:23:19 <garyk> I think that there may be room for some collaboration here.
15:23:25 <Yathi> Smart Resource Placement provides a generic framework to allow for complex constraints
15:23:41 <MikeSpreitzer> Yathi: between resources of different types?
15:24:02 <Yathi> yes that is part of our idea
15:24:24 <MikeSpreitzer> Isn't that the essence of Scheduling Across Services?
15:25:36 <Yathi> i guess this framework is something that can be leveraged
15:25:47 <Yathi> to build complex constraints that run across services
15:25:51 <garyk> It is in a sense and it is something that we touched on at the last summit but we did not make any progress with this
15:25:58 <MikeSpreitzer> Anyway, I think I am just suggesting they go in the same session.
15:26:36 <MikeSpreitzer> I suppose I am also suggesting the proponents talk to each other and see about a merge beforehand.
15:26:37 <Yathi> scheduling across services calls for orchestration framework
15:26:55 <Yathi> smart scheduling provides a pluggable solver framework
15:27:07 <MikeSpreitzer> um, anything calls for orchestration.  What exactly do you mean?
15:27:08 <garyk> MikeSpreitzer: agreed. that is why we are discussing this now to try and be more efficient when it comes to the summit
15:27:58 <Yathi> trying to separate orchestration between services and decision making framework
15:28:10 <Yathi> that is what I meant
15:28:30 <MikeSpreitzer> OK, no surprise there.  The u-rpm proposal also has this, as does my group's running code.
15:28:32 <garyk> #action consider combining "smart resource placement" and "multiple scheuler policies" to one session
15:28:40 <MikeSpreitzer> if I understand you correctly
15:29:56 <garyk> Anything else regarding summit or can we move to the resource tracking?
15:30:16 <MikeSpreitzer> I'm done
15:30:32 <garyk> #topic resource tracking
15:30:54 <alaski> I brought this up last time
15:30:54 <garyk> alaski: do you want to explain your ideas. last week we touched on it but the meeting was ending
15:31:31 <alaski> So my main idea is that I think it would be helpful to persist the resource tracker off of the compute node
15:31:45 <alaski> And have it be remotely accessible by other components, like conductor
15:32:06 <MikeSpreitzer> What does the tracker do?
15:32:31 <alaski> My thinking being that I want to speed up scheduling so I want to get a host from the scheduler and then consult the resource tracker quickly without having to roundtrip to the compute
15:32:50 <alaski> MikeSpreitzer: the resource tracker is the definitive source of what resources are available/used on a compute
15:33:08 <alaski> definitive in Nova I mean
15:33:09 <MikeSpreitzer> Really definitive, or a convenient cache?
15:33:35 <alaski> As definitive as we get in Nova, it could still mismatch reality a bit
15:33:43 <MikeSpreitzer> I would expect the hypervisor is the definitive source regarding what is actually being used now.
15:33:51 <garyk> alaski: in some cases there is querying from the db, would that be replaced by interfacing with the conductor instead?
15:34:11 <alaski> MikeSpreitzer: you're correct, so in that sense it is a cache
15:34:17 <MikeSpreitzer> This is where the distinction between what I call observed and target state matters...
15:34:46 <MikeSpreitzer> The observed state is a convenient cache of the real state, and the target state is about allocations that may or may not be in effect right now.
15:35:06 <alaski> garyk: I'm not sure where the db queries are, so i don't know.  But possibly
15:36:07 <garyk> alaski: cold this be related to the changes that boris and co are doing with the messages (i have yet to look at that code)
15:36:43 <alaski> MikeSpreitzer: I have your emails flagged and need to read those thoroughly.  I think we all want to move in a similar direction and need to figure out how to come together
15:36:55 <MikeSpreitzer> The read of nova's DB, in preference to (cache of) read from hypervisors, would be to get target state.
15:37:49 <MikeSpreitzer> Yes, I am also trying to catch up on the other work here and help figure out how to bring it all together.
15:37:50 <alaski> garyk: possibly, in the sense of using the same pattern for setting it up.  But resource tracker and scheduler are separate entities so it's not likely to be touched by his work
15:37:57 <garyk> the complexity is being able to sync all schedulers
15:38:14 <garyk> alaski: ok, understood
15:38:45 <MikeSpreitzer> garyk: I wonder which multiplicity you are referring to.  Different services, or different cells/regions/... ?
15:39:33 <garyk> MikeSpreitzer: i am trying to understand how the conductor(s) will manage the data and enable the scheduler(s) to access and use it
15:40:19 <MikeSpreitzer> (I need to learn what a conductor is)
15:40:49 <MikeSpreitzer> garyk: I am still wondering which multiplicity of scheduler you are referring to.
15:41:02 <alaski> garyk: the way I'm looking at it, the conductor queries the scheduler for a host or list of hosts, then it consults the resource tracker to make sure the instance will fit on that host
15:41:25 <MikeSpreitzer> garyk: I do not know what you meant by "the changes that boris and co are doing with the messages"… can you identify another way?
15:41:32 <alaski> Right now we have to send the build to the compute host before it can fail the resource tracker check.  I want it to fail faster
15:41:53 <glikson> alaski: wouldn't scheduler already check the available capacity? or you are suggesting to separate the two?
15:42:43 <alaski> glikson: I think they're already separate.  TBH I don't know everythng that the scheduler looks at, I should dig into that a bit
15:43:12 <alaski> But I know that sometimes an instance is scheduled to a host and then there's not actually enough free memory to build the instance
15:43:26 <garyk> alaski: in that case it would go to recheduling.
15:43:43 <glikson> alaski: that might happen because of race conditions between schedulers, for example..
15:44:11 <MikeSpreitzer> alaski: I have had colleagues running clouds tell me that happens for a variety of reasons, mistakes/discrepencies are possible at every level
15:44:44 <alaski> garyk: right.  My main concern is optimizing it so the schedule/reschedule loop can be faster
15:44:59 <garyk> i think that there is a over commit ratio that takes thing slike this into account (but may be wrong)
15:45:15 <alaski> glikson: yes.  My understanding is that scheduling is a best attempt fail fast setup.  I want failure to be as fast as possible
15:45:41 <MikeSpreitzer> alaski: I'm with you on that...
15:45:55 <MikeSpreitzer> but every cache has lag and there can be a nasty surprise in rare cases.
15:46:08 <garyk> alaski: is this something that will work with multiple conductors (sorry I am slow today)
15:46:41 <glikson> alaski: so, are you thinking to keep that somewhere else than the DB, to keep better track of in-fly requests?
15:47:06 <alaski> MikeSpreitzer: true.  It's worth me looking into what can go wrong.  I guess I'm thinking of a write through type cache where lag shouldn't present itself, but I suppose it could
15:47:18 <garyk> at the moment the flow is api-> scheduler-> compute node
15:47:45 <MikeSpreitzer> only one scheduler can allocate on a compute node, I take it.
15:47:58 <alaski> garyk: it would need to.  Right now resource tracker has synchronization based on being on a single compute, but moving it off the compute means we need to address synchronization another way
15:48:16 <alaski> glikson: right now resource tracker is in memory on the compute, I want it in a db or other store
15:48:20 <garyk> alaski: ok
15:48:52 <MikeSpreitzer> I have heard that when VM creation or deletion has a strange failure, a zombie can be left using memory that the scheduler does not realize exists.
15:49:35 <glikson> alaski: I thought it is already using the DB.. but maybe I'm confusing it with something else.
15:49:40 <alaski> MikeSpreitzer: multiple schedulers can allocate to a compute.  It's racy, but known to be racy, and the resource tracker is the control point
15:50:02 <glikson> alaski: didn't we just move those updates from using rpc fanout to using the DB?
15:50:15 <garyk> alaski: it would be intersting to discuss the data structure for the resource tracking in more detail
15:51:30 <glikson> alaski: or are you talking about the part that generates those updates, at nova-compute?
15:51:42 <alaski> glikson: I think we're talking about different things.  But now you have me wondering if it's sending data up to the scheduler
15:52:33 <alaski> glikson: it's possible.  I'm talking about the part that runs instance_claim() to claim resources
15:52:46 <alaski> but it may also be populating something for the scheduler to use
15:53:07 <garyk> we may be running out of time. do we want to continue with this or switch to MikeSpreitzer mails and document? Would could discuss that next week as I am not sure many of us got to read https://docs.google.com/document/d/1hQQGHId-z1A5LOipnBXFhsU3VAMQdSe-UXvL4VPY4ps/edit
15:53:37 <alaski> I say we switch.  I think I need to research a bit more and come up with a more solid proposal
15:53:54 <garyk> alaski: ok
15:54:05 <garyk> #topic MikeSpreitzer's mail
15:54:22 <garyk> MikeSpreitzer: with the few minutes left
15:54:32 <garyk> we can always continue next week
15:54:52 <glikson> I also had a quick question regarding the proposal to consider merging multi-sched and smart-sched proposals, when I was away for few minutes..
15:55:01 <MikeSpreitzer> I am finding rough alignment between the u-rpm proposal and my group's work..
15:55:16 <MikeSpreitzer> so I thought I would outline what we have worked out.
15:55:38 <garyk> glikson: MikeSpreitzer suggest that have them togertheer as there may be some overlap
15:55:43 <MikeSpreitzer> I have not yet roadmapped to a set of small changes, just wanted some review of the overall vision.
15:56:13 <MikeSpreitzer> and hope to help out
15:56:14 <garyk> glikson: i guess we can take it offline and discuss
15:56:24 <Yathi> garyk, glikson, I think garyk meant smart-sched and 'scheduling across services'
15:56:55 <garyk> Yathi: yes, that is what I meant. sorry my bad
15:56:56 <glikson> garyk: I think the two are complimentary -- one to introduce a new scheduler driver, and the second to have different driver configs co-exist within the same scheduler instance (regardless of which driver it is)
15:57:12 <MikeSpreitzer> I think there is big overlap between those two session proposals and what I wrote about.
15:57:51 <garyk> I think that we should try and read what you have written an then discuss it next week.
15:58:06 <MikeSpreitzer> OK
15:58:11 <garyk> I guess we could also have some time to see what we can combine (if possible)
15:58:23 <garyk> #action discuss MikeSpreitzer proposal next week
15:58:41 <garyk> #action check if we can merge/combine sessions
15:58:41 <glikson> yathu: ah, ok. I personally think those two are also complementary -- the optimization approach is rather orthogonal to the scope of the optimization problem to solve..
15:59:11 <MikeSpreitzer> I thought it was said that Smart Resource Placement is also about going across services
16:00:03 <MikeSpreitzer> Yathi: right?
16:00:21 <garyk> I am sorry but I guess we will have to continue next week.
16:00:34 <garyk> thanks guys
16:00:36 <garyk> #endmeeting