#openstack-meeting log

15:00:53 <n0ano> #startmeeting scheduler
15:00:54 <openstack> Meeting started Tue Jul 23 15:00:53 2013 UTC.  The chair is n0ano. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:55 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:57 <openstack> The meeting name has been set to 'scheduler'
15:01:09 * n0ano can't type this morning
15:01:20 <n0ano> anyone here for the scheduler meeting?
15:01:24 <llu-laptop> o/
15:01:30 <jgallard> hi!
15:01:53 <PaulMurray> I'm here for scheduler
15:01:57 <PaulMurray> hi
15:02:25 <alaski> here, but multitasking unfortunately
15:02:52 <jog0> o/
15:02:52 <n0ano> alaski, NP, we just give you all the actions, you can't defend :-)
15:03:03 <alaski> :)
15:03:40 <n0ano> we have some potentially contentious issues today, let's get started
15:03:51 <n0ano> #topic ceilometer vs. nova metric collector
15:04:37 <n0ano> I don't know if the thread on the dev list has bottomed out on this, I still think it makes sense to extend the current scheduler and not rely exclusively on ceilometer
15:04:54 <PaulMurray> I was in that thread
15:05:02 <PaulMurray> There are two issues for me
15:05:17 <PaulMurray> I am working on network aware scheduling
15:05:25 <alaski> I like the idea of using ceilometer, but agree that it can be a future thing.  I think it makes sense to share a lib if possible, and make sure we get an api that will work for both.
15:05:35 <PaulMurray> which is not the same as the analytical kind of stuff
15:05:58 <n0ano> alaski, +1
15:06:24 <jog0> perhaps we can get a joint ceilometer nova session at teh summit on this
15:06:30 <glikson> hi
15:06:56 <n0ano> jog0, we should, I just don't want development to stop waiting for the next summit
15:07:32 <jog0> agreed, from the ML thread it sounded like everyone thought a non-ceilometer mode made sense
15:07:44 <PaulMurray> agreed
15:07:47 <n0ano> note I have a slightly vested interest, my group is working on these changes to the scheduler but obviously we think that's the right way to go.
15:07:50 <jog0> and at teh summit we can re-evaluate where that line is and how the two projects could work together
15:08:53 <jog0> n0ano: not sure what happened to it but there as top level sheduler idea once so nova and cinder have one scheduler
15:09:34 <n0ano> I'm hearing concensus here - extend the scheduler for now, have a joint session at the summit with scheduler & ceilometer to see intersections
15:10:03 <n0ano> jog0, I haven't seen that, do you think there was a BP about it?
15:10:03 <PaulMurray> I agree
15:10:29 <alaski> n0ano: +1
15:10:35 <jog0> n0ano: don't remember now it was an idea that floated around for a while
15:10:35 <PaulMurray> To some extent
15:10:49 <alaski> I don't think there was a bp about it, but I rememeber the joint scheduler talk
15:10:51 <glikson> yep. there was a session on this in the last summit, and noone seemed to interested to work on ceilometer integration..
15:11:09 <PaulMurray> Can I ask what you mean by extend the scheduler in thsi case
15:11:13 <jog0> n0ano: I think timing wise getting any major scheduler extensions in H3 will be hard so they need to be proposed ASAP
15:11:25 <n0ano> glikson, sigh, I guess we'll try again in Hong Kong
15:12:02 <n0ano> PaulMurray, we're talking about the ability to add plugins to provide extra scheduling info
15:12:22 <PaulMurray> n0ano thanks, the question I have is
15:12:33 <jgallard> jog0, it was https://blueprints.launchpad.net/oslo/+spec/oslo-scheduler ?
15:12:37 <jog0> n0ano: the generic ability to do so easier then currently?
15:12:40 <PaulMurray> about the resoruce consumption part
15:13:02 <n0ano> jog0, yes, that's the idea
15:13:02 <jog0> jgallard:that was different that was different schedulers but dedup code base
15:13:05 <PaulMurray> Do you see a need for the scheduler and compute nodes to ccheck resoruce consumption
15:13:18 <PaulMurray> as the claims do for example
15:13:23 <PaulMurray> That part was missing for me
15:13:33 <n0ano> PaulMurray, what do you mean by `check`
15:14:13 <PaulMurray> I want to do the same as say ram_mb, free_ram_mb, ram_mb_used
15:14:28 <PaulMurray> the comupte node makes sure it has capacity - it doesn't jsut accept instances
15:14:39 <PaulMurray> the scheduler also updates hoststate
15:14:44 <jog0> PaulMurray: I think you are hitting on one of the main complaints people have today
15:14:57 <PaulMurray> what is your opinion?
15:15:30 <llu-laptop> PaulMurray: Have you seen my reply to your concern in https://review.openstack.org/#/c/35764/. The main problem is that we need to have a way to tell scheduler which kind of resource will be consumed, and how much for each instance
15:15:57 <n0ano> I'm still confused, to me it's simple, compute node report their resource capabilities to the scheduler and then the scheduler makes decisions based upon that
15:16:27 <jog0> n0ano: and you need to add that logic all over the code base
15:16:48 <n0ano> I don't like the idea of duplicating the scheduler work in the compute nodes
15:17:01 <jog0> resulting in a complex change to supportnew resoure tracking
15:17:13 <PaulMurray> n0ano that is the way it is now for static allocations
15:17:28 <llu-laptop> n0ano: PaulMurry is talking about the resources like free_mem, which is touched by both scheduler and compute node in consumption
15:17:33 <jog0> n0ano: we do that now, with the retry logic
15:17:40 <n0ano> jog0, should be simple, a plugin for the compute node to report new usage info, a filter plugin for the scheduler to use that info.
15:17:57 <PaulMurray> n0ano - exactly my thinking
15:18:04 <jog0> and DB update, and resource tracker
15:18:49 <Alexei_987> jog0: you should also consider algorithm efficiency - it's already quite slow and will become even slower if we use such approach
15:19:05 <jog0> Alexei_987: that is a different discussion IMHO
15:19:20 <n0ano> jog0, I've never understood the retry logic but the DB should only be updated by the compute node so where's the overlap?
15:19:40 <jog0> n0ano: the reyry logic is to handle race conditiosn in scheduling
15:19:48 <jog0> where scheduler sends instance to a node that cannot handle it
15:19:55 <PaulMurray> n0ano - the retry logic works with multiple schedulers as well as just the asynchrony
15:20:04 <PaulMurray> A scheduler can be wrong about what is out there
15:20:06 <jog0> node checks the requirments and if it cannot run the node it reschedules
15:20:31 <jog0> so we already have scheduling code on compute nodes
15:21:16 <n0ano> jog0, scheduling code on the compute node - that just seems wrong, should the compute node fail the start request and then the scheduler should try again
15:21:26 <n0ano> s/should/shouldn't
15:21:31 <glikson> jog0: if I understand correctly, you are suggesting that nova-compute will double-check that it can accept the request instead of just passing it down to the hypervisor?
15:21:51 <jog0> n0ano: I am not suggesting anything, just stating current status
15:22:16 <PaulMurray> the compute node is ultimately in control of what it can do
15:22:25 <jog0> PaulMurray: exactly
15:22:29 <PaulMurray> that is a sound relaibility control if nothing else
15:22:38 <PaulMurray> other services can be wrong about the state of the world
15:22:44 <n0ano> jog0, indeed, I'm thinking that's an area that needs to be looked at again, the current implementation seems wrong
15:22:44 <PaulMurray> possibly due to failure
15:23:10 <jog0> n0ano: we are way off topic now
15:23:18 <jog0> lets keep the meeting on topic
15:23:26 <jog0> we finished the ceilometer vs nova discussion
15:23:36 <jog0> then this turned into simple plugin discussion
15:23:47 <n0ano> someone should slap the moderator to keep things in line :-)
15:24:06 <n0ano> yes, let's move on
15:24:08 <n0ano> #agreed extend the scheduler for now, have a joint session at the summit with scheduler & ceilometer to see intersections
15:24:31 <n0ano> #topic a simple way to improve nova scheduler
15:24:57 <PaulMurray> Is this the no db one
15:24:58 <PaulMurray> ?
15:25:04 <n0ano> this to me is back to something we talked about weeks ago, DB update vs. fan out messages
15:25:51 <jog0> fan out == bad IMHO, sounsd like snooping vs directory for cache coherence http://en.wikipedia.org/wiki/Cache_coherence#Cache_coherence_mechanisms
15:26:19 <jog0> that being said we need more numbers to understand this
15:26:28 <n0ano> jog0, not sure about the concern
15:26:34 <alaski> I'm going to quickly jump in for the last topic and say that I'm working to remove retry logic from computes, and move it to conductor.  But otherwise everything that was said stands.
15:27:07 <PaulMurray> I have built virtual infrastructures that work the "fan-out" way
15:27:30 <jog0> alaski: -- doing scheduling validation anywhere  but compute node raises the risk of race conditions
15:27:57 <n0ano> one problem I have is, currently, the compute nodes update the DB and also send a fan out message so we're doing both
15:28:12 <jog0> PaulMurray: sure fan-out broadcasts can work but at some point the fall over
15:28:26 <jog0> n0ano: agreed my vote is kill fanout
15:28:28 <PaulMurray> jog0 well - I was going to say
15:28:31 <alaski> jog0: compute still validates, but it's at the request of the conductor before sending the build there.
15:28:39 <jog0> there was a ML thread about it a few weeks back
15:28:41 <PaulMurray> that there are a lot of good arguments for the db approach
15:28:48 <n0ano> jog0, and I would vote to kill the DB
15:28:52 <alaski> agree on killing fanout
15:28:53 <jog0> alaski: why?
15:28:59 <n0ano> but at least we agree that one of them needs to be killed
15:29:52 <jog0> broadcast to scheduler means we are sending tons of duplicate data
15:30:05 <jog0> and if we overload one scheduler thread with updates they all overload
15:30:12 <PaulMurray> jog0 we must be using different protocols
15:30:20 <PaulMurray> that didn't happen for us
15:30:29 <n0ano> well, I would also like to remove the periodic update and only send messages when state on the compute node changes
15:30:30 <PaulMurray> the problems we had are more to do with tooling
15:30:31 <jog0> we distribute the load by multiplying it by number ofschedulers
15:30:45 <PaulMurray> db is easy to inspect and debug - big in production env
15:30:55 <alaski> jog0: to centralize scheduling logic for later taskflow work.  but it's getting confusing to have this conversation in the middle of another.
15:31:23 <Alexei_987> jog0: the problem with such approach is that we cannot distibute DB
15:31:25 <jog0> n0ano: as the meeting chair which conversation? no db and return to alaski
15:31:35 <n0ano> alaski, jog0 - yes, let's try and conentrate on the DB vs. fanout for now
15:31:56 <n0ano> we can take up the retry issues later
15:32:08 <jog0> Alexei_987: can you clarify
15:32:36 <Alexei_987> jog0: for each scheduling request we read data about ALL compute node + we do 2 table joins
15:32:56 <Alexei_987> jog0: such solution won't scale if one DB server won't be able to handle the load
15:33:02 <jog0> Alexei_987: so you are making a big assumption here: that we cannot cahnge how we use the DB
15:33:05 <jog0> but we can
15:33:14 <jog0> we don't have to do two joins or read all compute nodes
15:33:46 <Alexei_987> jog0: how can we read info from DB without them?
15:34:05 <jog0> them?
15:34:09 <Alexei_987> jog0: we need to read all data cause we do filtering and weig
15:34:27 <jog0> Alexei_987: we can do some filtering in DB
15:34:30 <Alexei_987> jog0: sorry.. we do filtering in python code
15:34:44 <jog0> once again you are assuming we cannto change the basic scheduling logic
15:35:26 <Alexei_987> jog0: do you have a proposition how can we improve that?
15:35:29 <n0ano> let's reset, seems to be agreement that the current DB usage is a scalability problem...
15:35:43 <PaulMurray> agreed +556
15:35:51 <Alexei_987> agreed
15:35:53 <n0ano> we can solve the problem by removing the DB access or coming up with a way to make the DB access more scalable
15:36:14 <jog0> dont agree fullu
15:36:27 <jog0> n0ano: the issue IMHO is the current scheduler doesn't scale
15:36:36 <alaski> And from the other side, many of us feel that fanouts and RPC traffic are a scalability problem.
15:36:40 <jog0> as validated by bluehost
15:37:14 <n0ano> jog0, I believe bluehost was hit by the periodic update, remove that and their problems will potentially be much easier
15:37:25 <jog0> n0ano: yes, that IMHO is step 1
15:37:34 <jog0> either way we want to do that
15:37:48 <n0ano> note, removing the periodic update is independent of DB vs. fanout
15:37:56 <jog0> n0ano: right
15:38:11 <boris-42> n0ano +1
15:38:23 <boris-42> I don't see any reason to keep data in DB
15:38:24 <n0ano> so do we have agreement that removing the periodic update is the first thing that has to go?
15:38:31 <boris-42> RPC is much faster then DB any way
15:38:32 <jog0> well mostly, thigns get funny if you do RPC casts to scheduler  instead of dB and assume you can loose a msg
15:38:49 <boris-42> and what about DB
15:38:55 <boris-42> we are working through RPC
15:39:00 <boris-42> and we are able to lost msg
15:39:01 <boris-42> also
15:39:13 <boris-42> and our compute node won't be updated
15:39:24 <boris-42> we are trying to imagine new problems
15:39:27 <jog0> call vs cast.a nd conductor is optional
15:39:29 <boris-42> when we have already real problems
15:39:51 <jog0> boris-42: slow down, we are taking baby steps here
15:40:00 <boris-42> =)
15:40:26 <boris-42> why we are not able to remove DB use RPC (and then try to find better then RPC method)
15:40:40 <jog0> IMHO after removing periodic update the next step is to better quantify the next scalling limits
15:40:49 <jog0> and get good numbers on potential solutions
15:40:58 <PaulMurray> Can I jump in here
15:41:06 <boris-42> jog0 go in nova?
15:41:11 <jog0> PaulMurray: please
15:41:18 <PaulMurray> We need to keep an eye on production
15:41:21 <n0ano> jog0, indeed, I think the periodic update is swamping scalability so much we need to remove it and then measure things
15:41:25 <PaulMurray> as well as simple performance
15:41:31 <PaulMurray> performance is a big deal
15:41:34 <PaulMurray> and scalability
15:41:40 <PaulMurray> but right now we have tooling that
15:41:48 <PaulMurray> allows us to inspect what is going on
15:41:53 <PaulMurray> I am all for a change
15:42:00 <PaulMurray> but we need to make sure we do it with
15:42:08 <PaulMurray> security, reliability, ha,
15:42:17 <PaulMurray> and debuggability (if thats a word)
15:42:18 <PaulMurray> in mind
15:42:39 <jog0> PaulMurray: ++
15:42:48 <PaulMurray> So what ever
15:42:50 <n0ano> PaulMurray, I like your thought process
15:43:02 <PaulMurray> small change we make to start removing the databased
15:43:08 <PaulMurray> (if that is what we do)
15:43:13 <PaulMurray> must allow all these
15:43:24 <n0ano> but I think your concerns are addressable even if we remove the DB
15:43:35 <PaulMurray> I agree
15:43:48 <PaulMurray> But it is more complex
15:44:12 <PaulMurray> Not undoable - jsut a bigger step than it seems
15:44:38 <PaulMurray> Sorry - didnt' mean to kill the dicussion
15:44:40 <PaulMurray> :(
15:44:44 <n0ano> basically, the `simple` should be removed from the topic :-)
15:44:52 <PaulMurray> :)
15:45:07 <n0ano> PaulMurray, no, reality check is always good
15:45:44 <Alexei_987> I propose to start with doing clear diagrams of existing scheduling process
15:46:00 <n0ano> are we exhausted over this discussion for now, still room for thought on the mailing list
15:46:05 <Alexei_987> to get better understanding of the flow and how it can be improved
15:46:14 <glikson> sounds like a good topic for the upcoming summit :-)
15:46:23 <boris-42> =)
15:46:32 <n0ano> Alexei_987, I want to do that it's just I've been mired in unrelated stuff so far
15:47:06 <boris-42> Also we should just compare 2 approaches in numbers DB vs RPC
15:47:10 <jog0> n0ano: we ahve one action item from this
15:47:13 <jog0> err two
15:47:22 <jog0> someone kill periodic update
15:47:50 <boris-42> jog0 periodic updates are not problem
15:47:52 <glikson> seems that there is no consensus at the moment regarding both "simple" and "improve" about the proposed method..
15:47:54 <boris-42> jog0 so big problem
15:47:55 <n0ano> #action kill the periodic update (even if just for scalability measurements)
15:47:57 <jog0> and this one while we are at it (https://bugs.launchpad.net/nova/+bug/1178008)
15:47:59 <uvirtbot> Launchpad bug 1178008 in nova "publish_service_capabilities does a fanout to all nova-compute" [Undecided,Triaged]
15:48:02 <jog0> boris-42: see aboive
15:48:39 <n0ano> #action analyze the `current` scheduler
15:48:51 <jog0> and the other action item is get better numbers on boris-42 proposal along with how to reproduce the results
15:49:06 <jog0> numbers on current scheduler along with proposal
15:49:28 <n0ano> jog0, which numbers, the DB access or scheduler overhead
15:49:43 <boris-42> devananda pls
15:49:46 <jog0> n0ano: all and any that will help us decide
15:49:54 <boris-42> devananda could you say something about JOINs?)
15:50:23 <boris-42> =)
15:50:24 <jog0> boris-42: see above ^.
15:50:34 <boris-42> what should I see, I don't understand
15:50:37 <llu-laptop> just want to be more clear, is the 'kill the periodic update' the same thing as the bug jog0 mentioned?
15:50:49 <boris-42> IMHO
15:50:58 <boris-42> periodic task are not our problem in this moment
15:51:01 <jog0> llu-laptop: not exactly they are related though
15:51:23 <boris-42> revmoving or not removing we will have the same situation +-
15:51:32 <n0ano> boris-42, I believe they are a problem in that they are swamping scalabity measurements
15:52:19 <boris-42> nano they will be the next problem when we remove JOIN
15:52:26 <llu-laptop> I think boris-42 problem is the DB joinload, right?
15:52:39 <boris-42> llu-laptop yes
15:52:49 <boris-42> JOINs are always problem
15:52:50 <jog0> sure one issue at a time
15:53:03 <boris-42> jog0 priority is important thing
15:53:03 <n0ano> jog0, +1
15:53:13 <boris-42> jog0 JOIN has critical priority
15:53:20 <boris-42> jog0 periodic_task low
15:53:58 <n0ano> boris-42, note that bluehost show significant scalabiity issues that we are pretty sure are related to the periodic updates so they couldn't even look at the join issue
15:54:20 <jog0> boris-42: but join wont be fixed before Icehouse it too late
15:54:47 <jog0> n0ano: exactly
15:55:24 <llu-laptop> Is there any bp about the 'killing periodic update' thing?
15:55:43 <n0ano> no, but I'd be willing to create one
15:55:50 <boris-42> llu-laptom I think that it is too late for such changes
15:55:55 <boris-42> imho
15:56:06 <boris-42> it is much worse then remove join
15:56:12 <boris-42> because we are changing behavior
15:56:18 <n0ano> well, for havanna it's too late but there's always icehouse
15:56:52 <n0ano> of course, I think it too late to change the join for havanna also
15:56:56 <ogelbukh> does nova-compute updates state on provision/deprovision?
15:57:20 <n0ano> ogelbukh, yes (the DB) and it also does the periodic update
15:57:33 <ogelbukh> n0ano: thanks
15:57:59 <jog0> n0ano: it shouldn't be a big change http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html
15:58:09 <jog0> the periodic update is hardly used right now
15:58:16 <n0ano> #action n0ano to create BP to remove the periodic update
15:58:17 <ogelbukh> so, only service state basically gets lost if periodic updates just dropped?
15:58:34 <ogelbukh> well
15:58:35 <ogelbukh> not lost
15:58:49 <jog0> ogelbukh: service state? that is seperate
15:59:06 <ogelbukh> ok
15:59:08 <n0ano> ogelbukh, we need to consider things like lost messages, new compute nodes, restarted compute nodes - the end cases are alway the problem.
15:59:42 <PaulMurray> n0ano doesn't the stats reporting introduce more if it is added?
15:59:56 <ogelbukh> n0ano: sure, just trying to get my head around this
15:59:58 <boris-42> jog0
15:59:59 <n0ano> hey guys, it's the top of the hour and I have to run, great meeting (we missed one topic but there's always next week)
16:00:02 <boris-42> jog0 One question
16:00:22 <boris-42> jog0 If we remove DB, but keep periodic_task. And it will work good on 10k or 30k nodes
16:00:28 <n0ano> PaulMurray, it's already there (we're doing a lot of extra work right now)
16:00:31 <boris-42> jog0 experimatnal hosts
16:00:47 <boris-42> jog0 will you change priority of removing periodic_task
16:00:49 <boris-42> ?)
16:00:58 <PaulMurray> good speaking to you all
16:01:03 <n0ano> sorry but I have to run
16:01:09 <n0ano> #endmeeting