#openstack-meeting log

17:00:52 <sandywalsh> #startmeeting
17:00:53 <openstack> Meeting started Tue Nov  1 17:00:52 2011 UTC.  The chair is sandywalsh. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:00:54 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:01:12 <sandywalsh> #topic plead forgiveness
17:01:18 <sandywalsh> :)
17:01:29 <sandywalsh> #link http://wiki.openstack.org/Meetings/Orchestration
17:01:59 <maoy> Hello
17:02:06 <sandywalsh> o/
17:02:51 <sandywalsh> this will be a short meeting I think
17:03:07 <sandywalsh> so, I haven't had a lot of time to get any prep done
17:03:27 <sandywalsh> (specifically the video)
17:03:44 <maoy> that's fine by me, since I watched your talk..
17:03:51 <sandywalsh> but regardless there are two issues I think (and please jump in to correct me)
17:04:00 <sandywalsh> 1. the tactical issues
17:04:18 <sandywalsh> a. how to get events back to the orchestration layer from the services
17:04:31 <sandywalsh> b. where the orchestration service lives (in scheduler?)
17:04:49 <sandywalsh> and 2. what is the strategic approach to orchestration
17:05:02 <sandywalsh> a. a trivial state machine
17:05:10 <sandywalsh> b. a more complex state machine (petri)
17:05:30 <sandywalsh> c. another service (some pre-existing library)
17:05:33 <sandywalsh> d. other?
17:05:42 <sandywalsh> maoy, I think this is where your paper comes in
17:06:02 <sandywalsh> (which I have to apologize for, I haven't read yet, but it's at the top of my stack)
17:06:29 <sandywalsh> In the link I have a list of what I think are tactical items
17:06:47 <maoy> great. i was about to say that the paper is quite relevant here.
17:06:47 <sandywalsh> which I think are applicable regardless of the strategic approach
17:07:25 <sandywalsh> good ... I'm keen to read it. I'll try to get some meaningful feedback on it by next meeting
17:07:40 <maoy> I think that the orchestration might make more sense to be below the scheduler.
17:08:23 <maoy> sandy, i'm also looking at petri net.
17:08:24 <sandywalsh> ok, so scheduler talks to orchestration and steps out of the way?
17:09:02 <sandywalsh> I sort of envisioned orchestration talking to scheduler, but you suggest the other way around?
17:10:03 <maoy> to me it depends on how we define what exactly the orchestration layer does
17:10:30 <sandywalsh> well, maoy if you can perhaps create a wiki page to summarize your idea (nothing fancy), we can comment on it there?
17:10:46 <maoy> sure
17:10:49 <maoy> i'll work on that
17:10:51 <sandywalsh> excellent
17:11:22 <maoy> i have a question on the petri net
17:11:22 <sandywalsh> we've started considering what it would take to do simple retry, so hopefully that will give us a little bit of the tactical stuff we need
17:11:26 <sandywalsh> sure
17:11:44 <vladimir3p> Hi All, Sorry for being late and you probably already discussed it, but I guess we need to divide it into several parts, where one of them - return status over AMQP is kind of related to Orchestrator, but not really the orchestrator
17:12:09 <vladimir3p> to me it seems like the orchestrator should be the one who requests from scheduler what to do ...
17:12:10 <maoy> petrinet is a great way to model concurrent processes. i'm just curious after the modeling what could we do with it
17:12:33 <sandywalsh> vladimir3p, yes, I outlined one suggestion in the agenda: http://wiki.openstack.org/Meetings/Orchestration
17:12:51 <sandywalsh> maoy, can you give an example?
17:13:38 <sandywalsh> vladimir3p, I think Orch should ask of the scheduler, but maoy is going to propose an alternative approach.
17:14:11 <vladimir3p> ah, sorry. I was definitely late for this meting :-)
17:14:28 <maoy> i'm completely unfamiliar with celery and the other tool you mentioned in the talk, but I am wondering what benefit we have with the modeling effort
17:14:33 <sandywalsh> vladimir3p, np
17:14:56 <sandywalsh> maoy, from the feedback I got, we don't want to use celery tasks.
17:15:18 <maoy> sandy and vlad, I think we might have a similar idea, but use different understanding in the terminologies, esp on "orchestration"
17:15:31 <sandywalsh> quite likely
17:15:41 <vladimir3p> yep
17:15:52 <maoy> ok.
17:16:07 <sandywalsh> still, write up your suggestion and we'll make sure we're on the same page
17:16:23 <maoy> absolutely
17:16:39 <sandywalsh> #action maoy to write up his suggestions for how the orch service works with the scheduler (and other services)
17:17:27 <maoy> do we deal with high availability issues here?
17:17:35 <sandywalsh> maoy, my ideas for using petri net was simply to be a "better state machine". There were no other immediate plans from there. Just generic hooks to the outside world
17:17:39 <maoy> e.g. the orchestrator crashes.
17:17:51 <maoy> got it
17:18:11 <sandywalsh> maoy, that's a big issue ... we're running into that now with the scheduler. How do synchronize state when there are many concurrent workers
17:18:42 <sandywalsh> Master-Slave works great for these problems since there's only one decision maker. But it's a single point of failure
17:18:59 <maoy> in the paper, we use ZooKeeper who provides a quorum-based highly available storage and coordination service
17:19:25 <sandywalsh> Workers are great for scalability, but only when the tasks can be idempotent and can be done in parallel. Scheduling/State-management doesn't seem to be one of those problems.
17:19:42 <maoy> agreed.
17:19:51 <sandywalsh> #action sandy to learn about ZooKeeper
17:20:42 <vladimir3p> vlad
17:20:49 <vladimir3p> oops :-) sorry
17:20:51 <sandywalsh> ok ... I think those are two good starts. Ideally for next meeting we should be in some agreement how to tackle the concurrency problem.
17:21:17 <maoy> in the paper, we addressed 4 problems:
17:21:29 <sandywalsh> let's keep the discussion going on the mailing list. If zookeeper looks promising perhaps we work it into the tactical parts?
17:21:36 <sandywalsh> maoy, carry on ...
17:22:01 <maoy> concurrency, high availability, unexpected errors during worker execution, and imposing policies to prevent mis-operations
17:22:12 <maoy> we can probably ignore the 4th one
17:22:50 <sandywalsh> great ... that's the stuff we need to nail down.
17:22:51 <maoy> and see if the ideas in the others can be applied in nova in an non-disruptive way
17:23:03 <sandywalsh> #action give maoy some good feedback on his paper
17:23:19 <vladimir3p> a quick question - do you plan to apply same principles of "opertation" orchestration not only for between scheduler-compute/volume nodes, but between API nodes - scheduler?
17:23:20 <sandywalsh> #link http://dl.dropbox.com/u/166877/CloudTransaction.pdf
17:23:49 <sandywalsh> vladimir3p, can you give an example?
17:24:09 <vladimir3p> when you create bunch of instances the call goes to scheduler
17:24:22 <vladimir3p> but if it was not accepted/received you probably want to retry it
17:24:33 <vladimir3p> especially if we have multiple schedulers
17:24:49 <vladimir3p> actually, it applies to any operation performed over AMQP
17:25:06 <maoy> is AMQP lossy?
17:25:17 <maoy> i'm not very familiar with it.. sorry
17:25:19 <vladimir3p> it may stuck there
17:26:24 <maoy> this is undesirable..
17:26:30 <vladimir3p> I was thinking of case when particular scheduler accepted request but crashed...
17:26:38 <vladimir3p> (as an example)
17:28:03 <maoy> seems like either we can retry and make scheduling job idempotent, or to fix amqp..
17:28:37 <sandywalsh> my assumption was the first step was to create the workflow and that would get picked up by orch layer and worked on from there.
17:29:02 <vladimir3p> ok, np
17:29:43 <sandywalsh> ok, well guys I think we have a good start here. Let's keep the discussion going on the ML once we review all the materials.
17:29:52 <maoy> great
17:29:54 <sandywalsh> cool?
17:30:02 <vladimir3p> fine
17:30:09 <maoy> i'll put up a wiki
17:30:17 <sandywalsh> excellent
17:30:21 <sandywalsh> ... thanks for your time guys
17:30:23 <maoy> my idea is still quite rough since I don't know nova that well
17:30:34 <vladimir3p> sandy, just to make sure - the same error/reply logic we could make "generic"
17:30:36 <maoy> but you guys will help me. :)
17:30:49 <vladimir3p> and try to apply it for API-sched communications
17:31:13 <vladimir3p> and it willbe kind of an essential part of orch, but not really the orch
17:31:46 <sandywalsh> yes, makes sense
17:32:36 <sandywalsh> #endmeeting