15:00:53 #startmeeting scheduler 15:00:54 Meeting started Tue Jul 23 15:00:53 2013 UTC. The chair is n0ano. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:55 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:57 The meeting name has been set to 'scheduler' 15:01:09 * n0ano can't type this morning 15:01:20 anyone here for the scheduler meeting? 15:01:24 o/ 15:01:30 hi! 15:01:53 I'm here for scheduler 15:01:57 hi 15:02:25 here, but multitasking unfortunately 15:02:52 o/ 15:02:52 alaski, NP, we just give you all the actions, you can't defend :-) 15:03:03 :) 15:03:40 we have some potentially contentious issues today, let's get started 15:03:51 #topic ceilometer vs. nova metric collector 15:04:37 I don't know if the thread on the dev list has bottomed out on this, I still think it makes sense to extend the current scheduler and not rely exclusively on ceilometer 15:04:54 I was in that thread 15:05:02 There are two issues for me 15:05:17 I am working on network aware scheduling 15:05:25 I like the idea of using ceilometer, but agree that it can be a future thing. I think it makes sense to share a lib if possible, and make sure we get an api that will work for both. 15:05:35 which is not the same as the analytical kind of stuff 15:05:58 alaski, +1 15:06:24 perhaps we can get a joint ceilometer nova session at teh summit on this 15:06:30 hi 15:06:56 jog0, we should, I just don't want development to stop waiting for the next summit 15:07:32 agreed, from the ML thread it sounded like everyone thought a non-ceilometer mode made sense 15:07:44 agreed 15:07:47 note I have a slightly vested interest, my group is working on these changes to the scheduler but obviously we think that's the right way to go. 15:07:50 and at teh summit we can re-evaluate where that line is and how the two projects could work together 15:08:53 n0ano: not sure what happened to it but there as top level sheduler idea once so nova and cinder have one scheduler 15:09:34 I'm hearing concensus here - extend the scheduler for now, have a joint session at the summit with scheduler & ceilometer to see intersections 15:10:03 jog0, I haven't seen that, do you think there was a BP about it? 15:10:03 I agree 15:10:29 n0ano: +1 15:10:35 n0ano: don't remember now it was an idea that floated around for a while 15:10:35 To some extent 15:10:49 I don't think there was a bp about it, but I rememeber the joint scheduler talk 15:10:51 yep. there was a session on this in the last summit, and noone seemed to interested to work on ceilometer integration.. 15:11:09 Can I ask what you mean by extend the scheduler in thsi case 15:11:13 n0ano: I think timing wise getting any major scheduler extensions in H3 will be hard so they need to be proposed ASAP 15:11:25 glikson, sigh, I guess we'll try again in Hong Kong 15:12:02 PaulMurray, we're talking about the ability to add plugins to provide extra scheduling info 15:12:22 n0ano thanks, the question I have is 15:12:33 jog0, it was https://blueprints.launchpad.net/oslo/+spec/oslo-scheduler ? 15:12:37 n0ano: the generic ability to do so easier then currently? 15:12:40 about the resoruce consumption part 15:13:02 jog0, yes, that's the idea 15:13:02 jgallard:that was different that was different schedulers but dedup code base 15:13:05 Do you see a need for the scheduler and compute nodes to ccheck resoruce consumption 15:13:18 as the claims do for example 15:13:23 That part was missing for me 15:13:33 PaulMurray, what do you mean by `check` 15:14:13 I want to do the same as say ram_mb, free_ram_mb, ram_mb_used 15:14:28 the comupte node makes sure it has capacity - it doesn't jsut accept instances 15:14:39 the scheduler also updates hoststate 15:14:44 PaulMurray: I think you are hitting on one of the main complaints people have today 15:14:57 what is your opinion? 15:15:30 PaulMurray: Have you seen my reply to your concern in https://review.openstack.org/#/c/35764/. The main problem is that we need to have a way to tell scheduler which kind of resource will be consumed, and how much for each instance 15:15:57 I'm still confused, to me it's simple, compute node report their resource capabilities to the scheduler and then the scheduler makes decisions based upon that 15:16:27 n0ano: and you need to add that logic all over the code base 15:16:48 I don't like the idea of duplicating the scheduler work in the compute nodes 15:17:01 resulting in a complex change to supportnew resoure tracking 15:17:13 n0ano that is the way it is now for static allocations 15:17:28 n0ano: PaulMurry is talking about the resources like free_mem, which is touched by both scheduler and compute node in consumption 15:17:33 n0ano: we do that now, with the retry logic 15:17:40 jog0, should be simple, a plugin for the compute node to report new usage info, a filter plugin for the scheduler to use that info. 15:17:57 n0ano - exactly my thinking 15:18:04 and DB update, and resource tracker 15:18:49 jog0: you should also consider algorithm efficiency - it's already quite slow and will become even slower if we use such approach 15:19:05 Alexei_987: that is a different discussion IMHO 15:19:20 jog0, I've never understood the retry logic but the DB should only be updated by the compute node so where's the overlap? 15:19:40 n0ano: the reyry logic is to handle race conditiosn in scheduling 15:19:48 where scheduler sends instance to a node that cannot handle it 15:19:55 n0ano - the retry logic works with multiple schedulers as well as just the asynchrony 15:20:04 A scheduler can be wrong about what is out there 15:20:06 node checks the requirments and if it cannot run the node it reschedules 15:20:31 so we already have scheduling code on compute nodes 15:21:16 jog0, scheduling code on the compute node - that just seems wrong, should the compute node fail the start request and then the scheduler should try again 15:21:26 s/should/shouldn't 15:21:31 jog0: if I understand correctly, you are suggesting that nova-compute will double-check that it can accept the request instead of just passing it down to the hypervisor? 15:21:51 n0ano: I am not suggesting anything, just stating current status 15:22:16 the compute node is ultimately in control of what it can do 15:22:25 PaulMurray: exactly 15:22:29 that is a sound relaibility control if nothing else 15:22:38 other services can be wrong about the state of the world 15:22:44 jog0, indeed, I'm thinking that's an area that needs to be looked at again, the current implementation seems wrong 15:22:44 possibly due to failure 15:23:10 n0ano: we are way off topic now 15:23:18 lets keep the meeting on topic 15:23:26 we finished the ceilometer vs nova discussion 15:23:36 then this turned into simple plugin discussion 15:23:47 someone should slap the moderator to keep things in line :-) 15:24:06 yes, let's move on 15:24:08 #agreed extend the scheduler for now, have a joint session at the summit with scheduler & ceilometer to see intersections 15:24:31 #topic a simple way to improve nova scheduler 15:24:57 Is this the no db one 15:24:58 ? 15:25:04 this to me is back to something we talked about weeks ago, DB update vs. fan out messages 15:25:51 fan out == bad IMHO, sounsd like snooping vs directory for cache coherence http://en.wikipedia.org/wiki/Cache_coherence#Cache_coherence_mechanisms 15:26:19 that being said we need more numbers to understand this 15:26:28 jog0, not sure about the concern 15:26:34 I'm going to quickly jump in for the last topic and say that I'm working to remove retry logic from computes, and move it to conductor. But otherwise everything that was said stands. 15:27:07 I have built virtual infrastructures that work the "fan-out" way 15:27:30 alaski: -- doing scheduling validation anywhere but compute node raises the risk of race conditions 15:27:57 one problem I have is, currently, the compute nodes update the DB and also send a fan out message so we're doing both 15:28:12 PaulMurray: sure fan-out broadcasts can work but at some point the fall over 15:28:26 n0ano: agreed my vote is kill fanout 15:28:28 jog0 well - I was going to say 15:28:31 jog0: compute still validates, but it's at the request of the conductor before sending the build there. 15:28:39 there was a ML thread about it a few weeks back 15:28:41 that there are a lot of good arguments for the db approach 15:28:48 jog0, and I would vote to kill the DB 15:28:52 agree on killing fanout 15:28:53 alaski: why? 15:28:59 but at least we agree that one of them needs to be killed 15:29:52 broadcast to scheduler means we are sending tons of duplicate data 15:30:05 and if we overload one scheduler thread with updates they all overload 15:30:12 jog0 we must be using different protocols 15:30:20 that didn't happen for us 15:30:29 well, I would also like to remove the periodic update and only send messages when state on the compute node changes 15:30:30 the problems we had are more to do with tooling 15:30:31 we distribute the load by multiplying it by number ofschedulers 15:30:45 db is easy to inspect and debug - big in production env 15:30:55 jog0: to centralize scheduling logic for later taskflow work. but it's getting confusing to have this conversation in the middle of another. 15:31:23 jog0: the problem with such approach is that we cannot distibute DB 15:31:25 n0ano: as the meeting chair which conversation? no db and return to alaski 15:31:35 alaski, jog0 - yes, let's try and conentrate on the DB vs. fanout for now 15:31:56 we can take up the retry issues later 15:32:08 Alexei_987: can you clarify 15:32:36 jog0: for each scheduling request we read data about ALL compute node + we do 2 table joins 15:32:56 jog0: such solution won't scale if one DB server won't be able to handle the load 15:33:02 Alexei_987: so you are making a big assumption here: that we cannot cahnge how we use the DB 15:33:05 but we can 15:33:14 we don't have to do two joins or read all compute nodes 15:33:46 jog0: how can we read info from DB without them? 15:34:05 them? 15:34:09 jog0: we need to read all data cause we do filtering and weig 15:34:27 Alexei_987: we can do some filtering in DB 15:34:30 jog0: sorry.. we do filtering in python code 15:34:44 once again you are assuming we cannto change the basic scheduling logic 15:35:26 jog0: do you have a proposition how can we improve that? 15:35:29 let's reset, seems to be agreement that the current DB usage is a scalability problem... 15:35:43 agreed +556 15:35:51 agreed 15:35:53 we can solve the problem by removing the DB access or coming up with a way to make the DB access more scalable 15:36:14 dont agree fullu 15:36:27 n0ano: the issue IMHO is the current scheduler doesn't scale 15:36:36 And from the other side, many of us feel that fanouts and RPC traffic are a scalability problem. 15:36:40 as validated by bluehost 15:37:14 jog0, I believe bluehost was hit by the periodic update, remove that and their problems will potentially be much easier 15:37:25 n0ano: yes, that IMHO is step 1 15:37:34 either way we want to do that 15:37:48 note, removing the periodic update is independent of DB vs. fanout 15:37:56 n0ano: right 15:38:11 n0ano +1 15:38:23 I don't see any reason to keep data in DB 15:38:24 so do we have agreement that removing the periodic update is the first thing that has to go? 15:38:31 RPC is much faster then DB any way 15:38:32 well mostly, thigns get funny if you do RPC casts to scheduler instead of dB and assume you can loose a msg 15:38:49 and what about DB 15:38:55 we are working through RPC 15:39:00 and we are able to lost msg 15:39:01 also 15:39:13 and our compute node won't be updated 15:39:24 we are trying to imagine new problems 15:39:27 call vs cast.a nd conductor is optional 15:39:29 when we have already real problems 15:39:51 boris-42: slow down, we are taking baby steps here 15:40:00 =) 15:40:26 why we are not able to remove DB use RPC (and then try to find better then RPC method) 15:40:40 IMHO after removing periodic update the next step is to better quantify the next scalling limits 15:40:49 and get good numbers on potential solutions 15:40:58 Can I jump in here 15:41:06 jog0 go in nova? 15:41:11 PaulMurray: please 15:41:18 We need to keep an eye on production 15:41:21 jog0, indeed, I think the periodic update is swamping scalability so much we need to remove it and then measure things 15:41:25 as well as simple performance 15:41:31 performance is a big deal 15:41:34 and scalability 15:41:40 but right now we have tooling that 15:41:48 allows us to inspect what is going on 15:41:53 I am all for a change 15:42:00 but we need to make sure we do it with 15:42:08 security, reliability, ha, 15:42:17 and debuggability (if thats a word) 15:42:18 in mind 15:42:39 PaulMurray: ++ 15:42:48 So what ever 15:42:50 PaulMurray, I like your thought process 15:43:02 small change we make to start removing the databased 15:43:08 (if that is what we do) 15:43:13 must allow all these 15:43:24 but I think your concerns are addressable even if we remove the DB 15:43:35 I agree 15:43:48 But it is more complex 15:44:12 Not undoable - jsut a bigger step than it seems 15:44:38 Sorry - didnt' mean to kill the dicussion 15:44:40 :( 15:44:44 basically, the `simple` should be removed from the topic :-) 15:44:52 :) 15:45:07 PaulMurray, no, reality check is always good 15:45:44 I propose to start with doing clear diagrams of existing scheduling process 15:46:00 are we exhausted over this discussion for now, still room for thought on the mailing list 15:46:05 to get better understanding of the flow and how it can be improved 15:46:14 sounds like a good topic for the upcoming summit :-) 15:46:23 =) 15:46:32 Alexei_987, I want to do that it's just I've been mired in unrelated stuff so far 15:47:06 Also we should just compare 2 approaches in numbers DB vs RPC 15:47:10 n0ano: we ahve one action item from this 15:47:13 err two 15:47:22 someone kill periodic update 15:47:50 jog0 periodic updates are not problem 15:47:52 seems that there is no consensus at the moment regarding both "simple" and "improve" about the proposed method.. 15:47:54 jog0 so big problem 15:47:55 #action kill the periodic update (even if just for scalability measurements) 15:47:57 and this one while we are at it (https://bugs.launchpad.net/nova/+bug/1178008) 15:47:59 Launchpad bug 1178008 in nova "publish_service_capabilities does a fanout to all nova-compute" [Undecided,Triaged] 15:48:02 boris-42: see aboive 15:48:39 #action analyze the `current` scheduler 15:48:51 and the other action item is get better numbers on boris-42 proposal along with how to reproduce the results 15:49:06 numbers on current scheduler along with proposal 15:49:28 jog0, which numbers, the DB access or scheduler overhead 15:49:43 devananda pls 15:49:46 n0ano: all and any that will help us decide 15:49:54 devananda could you say something about JOINs?) 15:50:23 =) 15:50:24 boris-42: see above ^. 15:50:34 what should I see, I don't understand 15:50:37 just want to be more clear, is the 'kill the periodic update' the same thing as the bug jog0 mentioned? 15:50:49 IMHO 15:50:58 periodic task are not our problem in this moment 15:51:01 llu-laptop: not exactly they are related though 15:51:23 revmoving or not removing we will have the same situation +- 15:51:32 boris-42, I believe they are a problem in that they are swamping scalabity measurements 15:52:19 nano they will be the next problem when we remove JOIN 15:52:26 I think boris-42 problem is the DB joinload, right? 15:52:39 llu-laptop yes 15:52:49 JOINs are always problem 15:52:50 sure one issue at a time 15:53:03 jog0 priority is important thing 15:53:03 jog0, +1 15:53:13 jog0 JOIN has critical priority 15:53:20 jog0 periodic_task low 15:53:58 boris-42, note that bluehost show significant scalabiity issues that we are pretty sure are related to the periodic updates so they couldn't even look at the join issue 15:54:20 boris-42: but join wont be fixed before Icehouse it too late 15:54:47 n0ano: exactly 15:55:24 Is there any bp about the 'killing periodic update' thing? 15:55:43 no, but I'd be willing to create one 15:55:50 llu-laptom I think that it is too late for such changes 15:55:55 imho 15:56:06 it is much worse then remove join 15:56:12 because we are changing behavior 15:56:18 well, for havanna it's too late but there's always icehouse 15:56:52 of course, I think it too late to change the join for havanna also 15:56:56 does nova-compute updates state on provision/deprovision? 15:57:20 ogelbukh, yes (the DB) and it also does the periodic update 15:57:33 n0ano: thanks 15:57:59 n0ano: it shouldn't be a big change http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html 15:58:09 the periodic update is hardly used right now 15:58:16 #action n0ano to create BP to remove the periodic update 15:58:17 so, only service state basically gets lost if periodic updates just dropped? 15:58:34 well 15:58:35 not lost 15:58:49 ogelbukh: service state? that is seperate 15:59:06 ok 15:59:08 ogelbukh, we need to consider things like lost messages, new compute nodes, restarted compute nodes - the end cases are alway the problem. 15:59:42 n0ano doesn't the stats reporting introduce more if it is added? 15:59:56 n0ano: sure, just trying to get my head around this 15:59:58 jog0 15:59:59 hey guys, it's the top of the hour and I have to run, great meeting (we missed one topic but there's always next week) 16:00:02 jog0 One question 16:00:22 jog0 If we remove DB, but keep periodic_task. And it will work good on 10k or 30k nodes 16:00:28 PaulMurray, it's already there (we're doing a lot of extra work right now) 16:00:31 jog0 experimatnal hosts 16:00:47 jog0 will you change priority of removing periodic_task 16:00:49 ?) 16:00:58 good speaking to you all 16:01:03 sorry but I have to run 16:01:09 #endmeeting