19:59:09 #startmeeting OpenStack Orchestration Group 19:59:10 Meeting started Thu May 2 19:59:09 2013 UTC. The chair is harlowja. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:59:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:59:13 The meeting name has been set to 'openstack_orchestration_group' 20:00:23 howdy all :) 20:00:27 hello 20:00:38 hello 20:00:48 howday! 20:01:32 so i don't have much of an agenda, since this is the first time this meeting has been held in a long time 20:02:34 but i was thinking it might be useful to talk about some of the primitivies we need for orchestration/workflow like stuff, that might be useful, i have ideas here and other ideas might be useful 20:02:57 hi 20:03:00 howdy 20:03:04 what kind of primitives? 20:03:21 i'm on another call. :( 20:03:46 #topic workflow primitives 20:04:20 so primitivies in the fashion of things that are needed to accomplish workflows, like u could think of a task object as being one of those 20:04:52 a lock primitive could be another 20:05:03 and then workflow patterns that build ontop of tasks could be another 20:05:27 a task log could be another one 20:06:08 the workflows that exist in nova could then be refactored to use said patterns (slowly but surely) 20:06:28 and then once refactored (or during refactoring) the workflows can be moved up to say conductor 20:06:36 i see 20:06:37 *for workflows that make sense* 20:06:58 does that make sense? thoughts? 20:07:35 the discussions happening right now on the ML are releated to all of this, so thats great that we have said discussions 20:07:44 and thanks changbl your paper was very nice reading last night 20:07:54 no problem:) 20:08:09 there are some discussions about DB vs ZK on the mailing list, seems to me these primitive are targeted for both right? 20:08:30 if possible that might be nice 20:08:51 id think it should try to target both 20:08:59 not everyone will want zk as a hard req 20:09:07 i would almost say the task and workflows are not connected to the actually thing doing the workflow 20:09:09 although it might not be "that easy" 20:09:22 +1 harlowja 20:09:29 ditto 20:09:34 Yes, +1 20:09:37 the ML discussions are a little bit about the thing, which is also very important 20:09:59 but there are primitives and refactoring that could occur without knowing what the thing is (at this stage) 20:10:20 agreed. Like everything else, shouldn't that be pluggable? 20:10:44 heh, flag driven development 20:10:58 randallburt yes, i think so, its just connecting at the right layer of abstraction where both are possible, that may not be so bad 20:11:23 the right layer is always the tricky part :) 20:11:29 maybe even just target _no_ zk at first, get a rough pass done, and add the flare and fanfare in later? 20:11:30 +1 20:11:50 i was planning to respond to the mailing list on zk vs db 20:11:58 do we have any design or doc of which primitives we need? 20:12:08 but i guess i can say it here.. 20:12:08 hub_cap idk, i think if there are enough interested parties that we could do both, having 2 impls makes sure the layer of abstraction actually makes sense 20:12:11 sorry, I'm late to the party. DB? 20:12:38 +1 to multiple (at least 2) initially depending on interest 20:12:43 changbl i don't have a wiki yet on those, although i can start one if thats wanted 20:12:53 okey harlowja... as long as 1 doesnt muck w/ the other in terms of design 20:12:56 i guess we are talking about a pluggable backend for a few primitives, one of which depends on zookeeper, the other one on database.. 20:13:17 i want to use zk, dont get me wrong, i just dont want a hard dep on it... 20:13:23 hub_cap well thats where having 2 impls makes sure that u make it so they don't much, with 1 u get sloppy and usually expose the underlying stuff 20:13:30 hub_cap: agreed 20:13:57 #action harlowja make primitivie wiki 20:14:07 i actually think a in-memory backend is more appropriate. 20:14:11 harlowja: cant necessarily disagree there. +1 to making the wiki and outlining it well 20:14:20 not all backend needs to have HA, persistency, and all that fancy stuff 20:14:28 hub_cap: I agree on the ZK dependency issue 20:14:33 maoy agreed, 3 impls makes the api abstraction even better :) 20:14:49 do I hear four? 20:14:52 lol 20:14:57 :) 20:15:14 i'm not a big fan on db backend since it makes upgrading harder perhaps 20:15:26 and more load on DB... 20:15:49 changbl: agree. 20:16:05 agreed, but if we do it right, maybe we can have each path work (although i think there will be certain fundamental primitives that won't be easily possible with a DB) 20:16:32 many cases where you need rollback doesn't really come from crashes. unexpected errors are more likely. even if we don't keep things persistent/HA, there are still quite some value. 20:16:33 well, in memory has issues as well around recovery and failover, but they are options that let someone devstack and watch it go for example 20:16:42 thats almost why id like to try it w/o zk at first harlowja heh 20:18:15 randallburt: the in-memory allows the implementor to take shortcuts around reliability. If we offer that, the implementation better be transactional. 20:18:15 so all seems good, i think there is agreed upon stuff that says we want primitives with some backends 20:18:29 *backends >= 2 20:18:36 lol 20:18:37 thats progress! 20:18:42 tru 20:18:48 and one of the backends doesn't depend on zk. :) 20:19:12 ya everyone will likely use the zk backend for real world, we know that ;) 20:19:13 maoy well both of them can't or thats just 1 zk backend :-P 20:19:43 so very cool, https://review.openstack.org/#/c/27869/ was a piece i am working on, if others want to add in also, thats cool 20:20:12 but i'll write up a twiki with some more of this, and we can get the best primitive set possible, just i think we have to avoid getting to complex (since then we will not accomplish much) 20:20:37 since we are in nova, i'd also discuss which functions we want to put under this transactional workflow management 20:20:44 harlowja: which _real_ example are u planning on tackling first (just curious, if its derailing, we can chat privately) 20:20:58 maoy hub_cap your questions i think are similar :) 20:21:09 haha ya seems so 20:21:10 harlowja, is your work direct in Nova or in reusable Oslo library code? 20:21:27 i'd considering sth like migration/resize 20:21:29 kebray i see no reason why it can't be put into oslo if we get the primitivies right 20:21:36 *and i would hope it does 20:21:51 k.. You know why I'm asking. I think Heat will need this. 20:21:54 :) 20:22:11 harlowja: have you looked at doing this without zookeeper? 20:22:19 Not necessarily for Havana… but, as stack templates become more complex. 20:22:35 jaybuff i've thought about it, but not a large part of my brain power was directed to that 20:22:56 maoy hub_cap so lets talk about the functions to tackle, the interesting goodies there :-p 20:23:03 #topic functions to attack! 20:23:13 nice topic 20:23:30 :) 20:23:34 harlowja: we (me and mercer at Yahoo) did orchestration/workflow with gearbox (github.com/coryb/gearbox) and we didn't have any shared state 20:23:51 well, i guess that's not true, we had an sql database 20:23:55 jaybuff agreed, gearman/gearbox might be possible 20:24:09 maybe someday in the future, who knows :) 20:24:18 so back to the functions to attack question 20:24:27 sorry, didn't mean to derail 20:24:29 np 20:24:33 i vote teh hardest most complex one first! 20:24:35 all cool jaybuff 20:24:46 so resize/migrate do fit into that path 20:24:55 *as they are the most back and forth ones 20:25:05 pre, post, .. stages 20:25:09 wasnt there talk about a big code refactor for those tho? 20:25:25 to make them more similar rather than super distinct 20:25:32 yes, i think john from IBM and tiagi i believe are doing some stuff there 20:26:10 i'd like to allow them to use the same primititives that we want to move other functions to if possible, which is why having a nice primitive base would be useful 20:26:18 *that would be my ideal* 20:26:20 ah ic 20:26:45 so that may be whose working on those paths, although i don't know many details 20:27:07 that might be hard for u to tackle since others are touching the code a good bit 20:27:07 i do though think that before we alter functional paths that we need to make something like https://wiki.openstack.org/wiki/TheBetterPathToLiveMigrationResizing 20:27:09 #link https://wiki.openstack.org/wiki/TheBetterPathToLiveMigrationResizing 20:27:14 thoughts? 20:28:00 harlowja agree 20:28:19 couldnt hurt ;) 20:28:27 i think we have to be pretty darn careful about knowing the current path before and after altering, else things may go crazy :) 20:28:52 so hub_cap in the prototype that the ntt folks are working on, they have adjusted the run_instance path 20:28:56 *as an example for the prototype* 20:29:06 that one is like the other big one behind resize/migrate 20:29:07 heh ya thats what our flawless, test 99% of the ecosystem, tests are for right harlowja? 20:29:19 ya id put my bets on that one for a first run 20:29:28 a instance failure to me, at startup, for a small bug 20:29:34 hub_cap not for me :( 20:29:36 is better than a resize failure causing downtime 20:29:41 boo 20:29:43 ha 20:29:58 why not to start with some easy and more used in rl like spawn new instance ? \ 20:30:25 for those that are interested 20:30:27 #link https://github.com/yahoo/NovaOrc/blob/master/nova/orc/manager.py#L216 20:30:47 i am working with those guys to get that code up, at least so its reviewable and all that 20:30:50 i cant help but think about wow when i see NovaOrc 20:31:00 it could be the first stab at the spawn new instance workflow 20:31:01 nsavin: i think it can be easily resolved with periodic checking/repairing 20:31:13 i think working on start-instance could wait for two reasons. 20:31:31 maoy to much change on a critical path? 20:31:49 one: it's a code path that has been used a lot and debugged a lot, and on a critical path. 20:32:11 two: if start instance fails, the user has the choice to start another one, unlike resize/migrate 20:32:39 is there a blueprint for https://wiki.openstack.org/wiki/TheBetterPathToLiveMigrationResizing? who's working on it? 20:33:00 maoy so dansmith has one, i think its called unified path to resizing/migration 20:33:04 see id think that since its been used/debugged a lot, there would be a good test harness around it.. making it easier to test :) 20:33:29 hub_cap i am sorta in the same mind set, if it has all those features, then that means we can be pretty sure that we either screwed it up or not 20:33:33 #link https://blueprints.launchpad.net/nova/+spec/unified-migrations 20:33:57 ya.. to me, especially after the unification of migrations, there could be painful, subtle bugs 20:34:14 and this might compound the problem... possibly? 20:34:28 maoy i understand your one, two, just thinking that if those code paths aren't being actively 'modified' that it might be a good time to attempt to rework at least pieces of it 20:34:33 hub_cap: isn't that the reason why we need txn orchestration to deal with unexpected bugs/problems? 20:34:52 sure maoy, but not to introduce them durin the _first_ impl of it 20:35:19 harlowja hub_cap good pts 20:35:33 maoy i also think it helps reorganize the code into things like 'task objects' that are easier to understand, easier to review 20:35:57 but i can understand the 'intimitation' changing criticial paths can cause 20:36:06 *which is why we have to tread carefully 20:36:42 either way, its likely gonna rip up whatever path you choose :) 20:36:58 although i'd like to get john/tiagi to work with us when they are doing resize/migrate so that we can share the same primtives ( i hope) 20:36:59 im sure idempotency is not thought of in a 300+ line multi branch if stmt 20:37:09 harlowja: thats not a bad idea 20:37:09 hub_cap sadly u are right 20:37:11 the other problem i've been having is to not able to terminate/override previously running tasks that got stuck. 20:37:37 maoy so u mean cancellation or something different? 20:37:48 harlowja: on that line. 20:37:52 along that line 20:38:27 e.g. we could change the reboot API behavior to "cancel whatever you are running which likely is stuck and reboot it". 20:38:42 maoy +1 20:38:57 maoy as long its not say stuck on 'run_instance' right? 20:39:08 if it's too hard for the first step, something like "cancel the onging reboot and retry reboot" 20:39:17 harlowja: of coz. :) 20:39:43 maoy i totally agree, i put some of this up on my wiki, but feel free to add more ideas 20:39:44 #link https://wiki.openstack.org/wiki/StructuredStateManagementDetails#Cancellation 20:39:59 maoy i think yours is almost more of preemption 20:40:09 which could get complex quickly :) 20:40:33 *maybe add a premption section? 20:40:40 harlowja: ideally it's both preemption and cancellation i think. 20:41:06 sure, they are releated 20:41:07 i think people aren't as scary if we mess with reboot logic 20:41:31 maoy that could be, do u think it would expose enough usefulness there? 20:41:31 if we rollback a run instance to delete a vm, but it turns out we shouldn't, then people could get scared.. 20:41:38 heh 20:42:10 maoy agree 20:42:20 harlowja: are there additional agenda items? 20:42:25 so maoy that could be an approach, maybe we can have a list of functions that we think we can tackle, what some of the issues with them might be (as u said, rolling back a runinstance flow deleting things) 20:42:43 kebray not really, sorta ad-hoc agenda that i am making up as we go :) 20:42:43 this will turn into a big FSM quickly talking about canncelations (not of the pasta kind) 20:42:49 man i cant spell 20:43:00 shoud we try to tackle restarts before we do cancellations? 20:43:10 as in, task or service died, redo from X 20:43:27 harlowja: Ok.. wasn't sure if some of the brainstorming would be better moved to mailing list. I have an item to propose: Perhaps consider renaming this something other than Orchestration? 20:43:53 hub_cap: yes, if task resumeable 20:44:08 kebray sure sure, does that work with everyone, anyone want to voluneteer to start documenting the different workflows we could alter, and possibly the benefits/drawbacks of altering said workflow 20:44:12 def nsavin 20:44:33 having said document would make it easy for others to start picking up those workflows as things to do 20:44:49 thoughts? 20:45:01 i volunteer harlowja 20:45:13 #action josh to start workflow path wiki :) 20:45:17 +1 20:45:24 +1 20:45:27 :) 20:45:28 harlowja: i volunteer to review 20:45:30 sweet! 20:45:32 plz send updates 20:45:37 hub_cap thanks 20:45:50 #topic the name of 'it' 20:46:06 so this is a interesting question 20:46:10 naming proposal: Task System Library, or Task Management library. Workflow is broadly used in industry as a Business Process Management term, coordinating task execution across large disparate systems. Orchestration is the purpose of Heat, which is different than Workflow and at a higher level than Task Management, which is something Heat could leverage to do Orchestration. 20:46:21 +1 20:46:22 harlowja: one good name already choosen unfort - "conductor" :D 20:46:30 there is conductor (right now with functionality for db-proxy), there is 'orchestrator', there is ... 20:46:44 kebray did u type that all right now, u so quick ;) 20:46:49 ya conductor should be renamed to "db-on-behalf-of" 20:46:58 harlowja: kebray types 800wpm :) 20:47:04 awesome! 20:47:11 80wpm actually ;-) 20:47:19 hehe. English vocabulary is not enough! 20:47:35 so i think there is 2 things that could be named 20:47:45 but I like the extra zero hub_cap gave me. 20:47:48 the base library (which forms in nova) - convection could be it? 20:47:53 We need to drop the terminology of Orchestration 20:47:59 but i don't want to take over the heat name 20:48:00 im good at being...err giving... 0s 20:48:12 adrian_otto so i started calling stuff state management engines 20:48:17 but thats not so sexy 20:48:20 yes, that's more accurate 20:48:46 hehe hoard 20:48:48 or task management engines.. 20:49:06 conductor might be a useful name, as long as we can get some of the conductor guys to help us understand how this functionality might fit in there 20:49:09 managed by the zookeeper 20:49:25 TME for short, ha 20:49:36 that reminds me another thing to discuss 20:49:37 tss! several backends for sure :) 20:49:43 :) 20:49:44 where does this TME run 20:49:47 I like having Task somewhere in the name.. it reflects what is actually will handle. 20:49:47 heh 20:49:54 ah maoy good question :) 20:50:06 maoy maybe that one can go on the ML? 20:50:16 the scope of execution needs at least 3 supported contexts: 20:50:23 sure. but doesn't hurt to talk a bit here. 20:50:29 maoy kk 20:50:31 how bout flow :) 20:50:36 1) Within OpenStack, such as executing within Heat 20:50:42 * i have visions of the insurance sales lady flow 20:50:46 HA 20:51:02 but flow might not be so bad 20:51:03 2) Within the task system in a limited control (like calling API services only) 20:51:30 3) In a container, such as within a specified VM that the caller is authorized to use/create 20:51:31 flow sounds like network 5 tuple stuff' 20:51:53 taskflow ? (so with word "Task" as well :) ) 20:52:02 taskflow is decent 20:52:04 nsavin: i like that. 20:52:11 taskflow++ 20:52:12 so still tme? 20:52:14 adrian_otto sure, so there is what i would call the core primitive library that 1,2,3 would use, then there is the name of 2 (conductor, idk?) 20:52:15 :) 20:52:24 taskflow++ 20:52:42 ok so i want to use taskflow in reddwarf 20:52:50 hub_cap to bad, haha 20:52:51 j/k 20:52:57 HA 20:53:08 we only have a few min left, but shoudl we maybe talk about makign it distinct from nova? 20:53:09 hub_cap: yes you do 20:53:30 adn heat will use it for sure, and I hear from at least one Cinder dev that they will use it. 20:53:38 hub_cap i think that is where if we can get some thing into oslo that might help 20:53:45 taskflow directory in oslo? 20:53:57 id prefer to put it there to start if possible... 20:53:58 but i just worry that if we don't prove it first somewhere (nova?) that it might be a crappy library 20:54:03 then my team could help contribute to it easier 20:54:13 hub_cap agreed 20:54:17 well every library in oslo starts out crappy heh 20:54:22 lol 20:54:26 then it gets awesome 20:54:28 *lets keep the trend 20:54:29 ha 20:54:29 when lots of groups use it 20:54:42 or it totally rots 20:54:44 hub_cap agreed, i can get markmc to see what he thinks 20:54:44 imo, for the very first baby step, if we could find one function in nova (e.g. migration, or reboot, whatever), convert the procedure to taskflow with log and rollback, with a in mem backend, runs locally on nova-compute, i'm happy 20:54:45 If there is enough mailing list support for it across-projects, I'm hopeful we can incubate straight in oslo.. how do we make that happen? 20:54:52 well the problem in the original change remains, so start in oslo and follow up with deps in nova and maybe heat 20:55:13 kebray: likely we talk to markmc is my guess 20:55:15 maoy that might be acceptable, except i know others reallly really want this to help elsewhere also 20:55:25 so much desire to do this 20:55:27 maoy +1 20:55:37 harlowja: it would be a _much_ easier sell to my team if it started in oslo 20:55:40 harlowja: i understand.. but it has to start frmo somewhere 20:55:55 why? 20:56:01 maoy so i think to kebray hub_cap if it could start in oslo then it could gain immediate usage in 1+ projects 20:56:07 adrian_otto: directed to me? 20:56:10 instead of just start in 1 and move to oslo 20:56:14 is there no precedent for starting a new library in Oslo? 20:56:29 i am not aware of one, but we are burning new ground here, so might as well 20:56:35 :) 20:56:37 not really, the libs there started _from_ stuff thats was copy/pasta'd from > 1 project 20:56:38 I think that's where this belongs 20:57:10 i'd like to get as much usage as we can (it just gets better/more ripe with usage) 20:57:15 *not rotten, ripe! 20:57:16 lol 20:57:19 heh +1 20:57:23 I'm painfully aware that OpenStack does not have a clear starting point for things like this that are definitely general purpose utilities that multiple projects will consume 20:57:32 so can we start one? 20:57:47 Who is PTL for Oslo? 20:57:52 markmc i beleive 20:57:56 god i cant spell 20:57:57 i think its markmc (from redhat) 20:58:01 is he present? 20:58:06 nope hes not in the us 20:58:13 he's probably sleeping 20:58:14 he is offline most late aftn's 20:58:22 ok, let's reach out to him and request guidance. 20:58:29 def 20:58:32 #action adrian_otto reach out to markmc 20:58:36 is that fine? :) 20:58:42 q: what other projects are you guys talking about? 20:58:44 its official 20:58:46 heat? 20:58:50 maoy: reddwarf for one 20:58:50 is it presumptuous to assume that the rest of this group thinks this should start in Oslo? 20:59:11 resizes are complicated in reddwarf, they require nova + additional things to database configuration files, db restarts etc.. 20:59:12 hub_cap: interesting. i haven't looked at it though. 20:59:12 heat, nova, possibly cinder iirc 20:59:23 having missed almost all the meeting, imbw but taskflow sounds potentially interesting to ironic as well 20:59:24 oh and reddwarf 20:59:26 Cinder + Heat + Nova + RedDwarf + EventScheduler (assuming that becomes something) 20:59:30 i am conflicted there a little, haha, i agree that it should be as general as possible, but if made to general, it might not be useful in any of the projects, but its hard to tell this early 20:59:44 if we have adovates from each project, i don't see problem at all starting this in Oslo 20:59:46 hi devananda! are we stepping on your meeting? 20:59:47 so i will believe that we can do it, so i say might as well try :) 20:59:49 eg, do x, y, and z before provisioning this node 20:59:52 oh crap 20:59:57 hub_cap: nope! i'm joining yours :) 21:00:01 sweet 21:00:06 times up :) 21:00:16 lol 21:00:19 I will email markmc 21:00:26 so can we get the potential people to speak up also from different projects 21:00:28 good for u to end on time harlowja 21:00:31 i think that will as maoy said def help 21:00:36 will need to wrap this up 21:00:38 kk 21:00:42 #endmeeting