13:00:05 <Qiming> #startmeeting senlin
13:00:06 <openstack> Meeting started Tue Dec 29 13:00:05 2015 UTC and is due to finish in 60 minutes.  The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:00:09 <openstack> The meeting name has been set to 'senlin'
13:00:30 <Liuqing> hi
13:00:36 <yanyanhu> hello
13:00:41 <Qiming> hi, Liuqing
13:00:41 <elynn> o/
13:01:29 <Qiming> pls check agenda and see if you have things to add
13:01:40 <Qiming> #link https://wiki.openstack.org/wiki/Meetings/SenlinAgenda
13:01:58 <zhangguoqing> hi
13:02:05 <Qiming> hi, zhangguoqing
13:02:14 <Qiming> let's get started
13:02:25 <Qiming> #topic mitaka work items
13:02:37 <Qiming> #link https://etherpad.openstack.org/p/senlin-mitaka-workitems
13:02:54 <Qiming> heat resource type support
13:03:17 <elynn> Hi Qiming
13:03:23 <Qiming> seems blocked by action progress check
13:03:32 <elynn> yes
13:03:36 <Qiming> let's postpone that discussion
13:03:40 <elynn> still don't know how to deal with it.
13:03:46 <Qiming> it is item 2 on the agenda
13:03:56 <Qiming> client unit test
13:04:18 <Qiming> need to do a coverage test to find out where we are
13:04:30 <Qiming> unittest part 3 merged?
13:04:41 <yanyanhu> no I think
13:04:55 <yanyanhu> poor network, can't open review page
13:04:59 <Qiming> any blockers?
13:05:21 <Liuqing> gerrit's problem, so slow..
13:05:25 <yanyanhu> just haven't got enough review it I think
13:06:11 <Qiming> not just gerrit problem, looks more like a problem caused by gfw
13:06:42 <yanyanhu> oh, sorry
13:06:46 <yanyanhu> it has been merged
13:06:52 <yanyanhu> https://review.openstack.org/258416
13:07:01 <yanyanhu> my fault
13:07:23 <Qiming> great
13:07:40 <Qiming> still need to unify the client calls
13:07:55 <Qiming> especially cluster action and node action ones
13:08:16 <Qiming> health policy
13:09:02 <Qiming> xinhui and I spent a whole afternoon together yesterday, we have a draft work plan for implementing the health manager and health policy
13:09:25 <Qiming> the first step would be about improve profiles so that do_check and do_recover are supported
13:09:38 <yanyanhu> great
13:09:40 <Qiming> then these operations will be exposed from engine and rpc
13:10:06 <yanyanhu> which operations?
13:10:13 <yanyanhu> do_check and do_recover?
13:10:28 <yanyanhu> as actions can be triggered by enduser?
13:10:29 <Qiming> the health manager can poll cluster nodes in the background then invoke the do_recover() operation on a cluster if node anomalies are detected
13:10:39 <Qiming> yanyanhu, not yet
13:10:46 <Qiming> first step is to make them an internal RPC
13:10:52 <yanyanhu> ah, I see
13:11:07 <yanyanhu> make them availabe for internal request
13:11:13 <Qiming> when we feel confident/comfortable, we can expose them as REST APIs
13:11:49 <Qiming> there are some details we need to deal with
13:12:13 <Qiming> but I think xinhui is on the righ track now
13:12:26 <yanyanhu> about health manager polling status and doing recovery action, it sounds good
13:12:31 <yanyanhu> yep
13:12:38 <Qiming> we have to do that
13:12:57 <Qiming> or else the auto-scaling scenario won't be reliable
13:13:33 <Qiming> and we want that feature enabled without bothering users to do any complex config
13:14:01 <Qiming> there could be many knobs exposed for customization, but the first step would be about the basics
13:14:08 <Liuqing> could the health policy cover the user case : instance HA?
13:14:20 <Qiming> Liuqing, yes
13:14:30 <Liuqing> great
13:14:34 <yanyanhu> Qiming, totally make sense
13:14:40 <Liuqing> cool
13:14:43 <zhangguoqing> great
13:14:58 <Qiming> by design, a profile (say nova server) will have a do_recover(**options) operation
13:15:34 <Qiming> the base profile will implement this as a create-after-delete logic
13:15:54 <Liuqing> what does the logic mean ?
13:16:24 <Qiming> a specific profile will be able to override this behavior, it can do a probably better job in recovering a node
13:16:53 <Qiming> Liuqing, the default logic (implemented in base profile class) will be: delete the node, then recreate it
13:17:14 <Liuqing> Qiming: got it , thanks
13:17:16 <Qiming> that is the most generic solution, applicable to all profile types
13:17:46 <Qiming> a nova server profile will have more choices: reboot, rebuild, evacuate, ... recreate
13:18:16 <Qiming> these options can be exposed through a health policy
13:18:36 <Qiming> users can customize it as they see appropriate
13:18:57 <Qiming> question?
13:19:30 <Qiming> update operation for profile types
13:19:38 <Liuqing> so for me i could customize it for instance HA or others, right?
13:19:50 <Qiming> yes, Liuqing, absolutely
13:19:58 <Liuqing> :-)
13:20:14 <Qiming> however, we will also want to add some fencing mechanisms in future
13:20:28 <Qiming> without fencing, recoverying won't be a complete solution
13:20:29 <Liuqing> yes
13:20:42 <Liuqing> now we use pacemaker for instance HA
13:20:46 <Qiming> let's do it step by step
13:21:08 <Qiming> use pacemaker is ... a choice out of no choice, I believe
13:21:15 <Liuqing> instance HA is very important for enterprice production
13:21:42 <Liuqing> yes, Qiming
13:21:44 <Qiming> yes, totally agreed
13:22:09 <Liuqing> the customer will always ask the HA prroblems....
13:22:10 <Qiming> yanyanhu, any progress on update operation for profile types?
13:22:35 <Qiming> Liuqing, they are not yet ready for real clouds
13:22:40 <yanyanhu> I'm working on add function calls about server metadata
13:22:49 <yanyanhu> three new methods will be added to sdk
13:22:57 <yanyanhu> hope can finish it in the coming week
13:23:07 <Qiming> most of the time, they use their private cloud as an advanced virtualization platform
13:23:11 <yanyanhu> and also the metadata update for nova server profile
13:23:14 <Qiming> great, yanyanhu
13:23:34 <Qiming> maybe we should add update support for heat stacks?
13:23:54 <yanyanhu> Qiming, sure. I plan to do some investigation after nova server related work is done
13:23:57 <Qiming> the interface is much easier when compared to nova server case
13:24:02 <Qiming> cool
13:24:05 <yanyanhu> yes, I think so :)
13:24:09 <Qiming> Receiver
13:24:17 <Qiming> We are done with it?
13:24:53 <Qiming> I pushed quite a few patches last weekend to close the loop
13:24:53 <yanyanhu> I think so
13:24:58 <yanyanhu> just need some tests
13:25:08 <Qiming> a simple test shows it is now working
13:25:11 <yanyanhu> thanks for your hard work :)
13:25:23 <yanyanhu> great, will try to add functional test for it
13:25:40 <Qiming> hoho, added
13:25:54 <yanyanhu> :)
13:26:34 <Qiming> btw, our API doc is up to date: https://review.openstack.org/261627
13:26:39 <Qiming> already merged
13:27:03 <Qiming> lock breaker
13:27:18 <Qiming> https://review.openstack.org/262151
13:27:19 <elynn> I submit a patch to reenable it
13:27:25 <Qiming> yes, reviewed
13:27:29 <elynn> Not sure it's the right way or not
13:27:30 <Qiming> I disagree with the logic
13:27:38 <Qiming> posted comment
13:27:42 <elynn> Just saw your comment
13:28:02 <Qiming> it took me 20 minutes or so to post the comment, frustrating ... network really bad today
13:28:18 <elynn> You intend to move it after retries?
13:28:28 <Qiming> elynn, please consider moving the check out of the critical path
13:28:39 <Qiming> by 'critical path', I mean the retry logic
13:28:53 <Qiming> retry could be very common if you are locking a cluster
13:29:29 <elynn> so we do it outside lock_acquire?
13:29:46 <Qiming> maybe we should even relax the number of retries before doing a engine-death check
13:30:00 <Qiming> or ...
13:30:18 <Qiming> we should move this lock breakers to engine start up
13:30:42 <elynn> engine-death check will be very quick if enigne is alive
13:30:44 <Qiming> but when we have multiple engines, doing lock checks during startup is not anything good
13:30:59 <Qiming> elynn, no, engine could be very busy
13:31:00 <elynn> only took some time if engine is dead.
13:31:21 <Qiming> I have encountered several times when I was fixing the concurrency problem
13:31:42 <Qiming> many times the warning of engine dead, but the engine is still running
13:32:24 <Qiming> and .... putting it before the retry logic has led to quite some mistakes
13:32:26 <elynn> ok... I got your point...
13:32:38 <elynn> Maybe we should add a taskrunner first
13:33:03 <elynn> Like what heat do
13:33:08 <Qiming> adding task runner won't help, AFAICT
13:33:39 <Qiming> it will introduce more concurrency problems
13:34:00 <Qiming> elynn, pls continue digging
13:34:14 <Qiming> haiwei is not online I guess
13:34:15 <elynn> Yes, I will
13:34:21 <Qiming> let's skip the last itme
13:34:24 <Qiming> item
13:34:51 <Qiming> #topic checking progress of async operations/actions
13:35:04 <Qiming> this is blocking heat resource type work
13:35:14 <elynn> yes
13:35:19 <yanyanhu> just saw ethan's patch in sdk side
13:35:20 <Qiming> because .... it is really a tricky thing to do
13:35:36 <elynn> need to figure out a way to receive correct action id
13:35:40 <Qiming> we have done our best to align our APIs to the guidelines from api-wg
13:36:03 <Qiming> you got to understand the principles behind, before start this work
13:36:40 <Qiming> in senlin, we are having most of create, update and delete operations return a 202
13:36:46 <elynn> I think we are following the guidelines
13:37:01 <Qiming> and we are returning a 'location' in the header
13:37:20 <Qiming> most of the time, the location points you to an action
13:37:43 <Qiming> since we have action apis, we are not hiding this from users
13:38:10 <elynn> yes, most of the time except for cluster update...
13:38:14 <Qiming> one thing we still need to improve is about the body returned
13:38:45 <Qiming> for DELETE requests, we cannot return a byte in the body as HTTP protocol says
13:39:22 <Qiming> for UPDATE, we are returning the object in the body
13:39:35 <yanyanhu> for delete request, I think check until notfound exception happens is ok
13:39:39 <elynn> for cluster deletion, I can catch not_found in heat resource.
13:40:00 <elynn> The problem is UPDATE/RESIZE
13:40:01 <Qiming> we are also returning the pointer to the object in header
13:40:11 <Qiming> update and resize are different
13:40:17 <elynn> for RESIZE, we have a body contain action.
13:40:25 <Qiming> UPDATE is itself a PATCH request
13:40:30 <Qiming> RESIZE is an action
13:40:42 <Qiming> these two operations are following different rules
13:40:57 <elynn> How to check if a UPDATE is finished?
13:41:14 <Qiming> it depends on what you are updating, elynn
13:41:22 <Qiming> if you are updating the name of a cluster
13:41:30 <Qiming> you should just check the body
13:41:39 <elynn> I mean profile?
13:41:44 <Qiming> if you are updating a cluster's profile, ... you will need to check the profile
13:41:56 <Qiming> sorry, you will need to check the action
13:42:22 <elynn> yes, I think so
13:42:34 <Qiming> if we are not returning action in header, that is a bug to fix
13:42:59 <elynn> Do we?
13:43:16 <Qiming> you can check the api code, cannot remember
13:43:17 <yanyanhu> Yes, we have. Just didn't find a way to expose it in client
13:43:45 <Qiming> okay, then, next step is to have the 'location' header used from sdk/client side
13:43:51 <elynn> yanyanhu: that would be the problem to solve.
13:43:57 <yanyanhu> yes
13:44:15 <Qiming> if we are checking the header from senlinclient, we are requiring the whole SDK to return a tuple to us
13:44:25 <xuhaiwei> the client side can get the action id, can it be used?
13:44:27 <Qiming> or embed the header into the object
13:44:43 <Qiming> xuhaiwei, you are a ghost
13:44:54 <elynn> My patch is to embed the header into object.
13:44:58 <xuhaiwei> sorry, didn't say hello just now
13:45:11 <Qiming> neither of the solutions above sounds elegant
13:45:13 <xuhaiwei> I am in vocation from today
13:45:37 <Qiming> since we are suggested to use the function call interface from SDK, instead of the resource interface
13:45:52 <Qiming> I'm thinking maybe we should do somework in the _proxy methods
13:46:32 <Qiming> once the response is returned to senlinclient, we get no chance to check the header
13:46:35 <elynn> Qiming: You mean directly return response body?
13:46:44 <xuhaiwei> Qiming, you mean put the api response information into sdk?
13:47:05 <Qiming> the _proxy commands should know what they are doing
13:47:16 <Qiming> say cluster_create(**params)
13:47:44 <Qiming> when calling this method, we are expecting a 'location' header from the response
13:48:14 <Qiming> we can either squeeze it into the json before returning to senlinclient
13:48:27 <Qiming> or we can do a wait-for-complete inside sdk
13:48:52 <Qiming> I'm opt to the 2nd approach
13:49:16 <Qiming> since some other wait-for-something-to-complete logic is already in sdk
13:49:22 <elynn> I'm not very sure how to implement your #2 option
13:49:43 <elynn> I will have a try.
13:50:10 <Qiming> then we can add a keyword argument 'wait=True' to the cluster_create(**params)
13:50:27 <yanyanhu> I also think the option2 is better
13:50:47 <Qiming> if 'wait' is specified to be true, then we check the action repeatedly in the cluster_create method
13:50:47 <elynn> The self._create() returns a resource object, it doesn't contain any headers.
13:51:15 <Qiming> elynn, you have already added 'location' to the resource.headers
13:51:59 <Qiming> we can check it there
13:52:06 <elynn> hmm...I just don't know how to use headers :P
13:52:11 <elynn> I will have a try.
13:52:27 <Qiming> 'headers' was designed to be part of a request
13:52:38 <Qiming> now you are using it in response
13:52:42 <Qiming> that is not good
13:52:58 <Qiming> next time you send a request, you may have to clean it
13:53:18 <Qiming> maybe add a response_header property is a cleaner fix
13:53:56 <elynn> I put it in response just to find a way to expose it...
13:54:10 <Qiming> then we discuss with brian and terry, see if it is an acceptable 'hack'
13:54:23 <elynn> Or we don't have the way to set the location to headers.
13:54:23 <Qiming> if it is not acceptable, we have to do it in a different way
13:54:39 <elynn> Ok
13:54:47 <Qiming> e.g. make 'action' a field of the object returned to senlinclient, parse it and do the wait there
13:54:47 <elynn> If we add a wait function.
13:55:03 <elynn> heat codes might be block by this wait function
13:55:16 <elynn> I'm not sure if it's good way to go
13:55:21 <Qiming> if you want to wait, you will have to wait
13:56:17 <elynn> For now heat using its taskrunner to schedule tasks
13:56:27 <Qiming> that is stupid
13:56:30 <Qiming> to be honest
13:56:44 <Qiming> there are proposals to remove them all
13:56:47 <elynn> wait function might blocks its taskrunner...
13:57:46 <Qiming> are we risking blocking their engine?
13:58:05 <Qiming> taskrunner is so cheap
13:58:19 <elynn> Yes, that might be
13:58:42 <elynn> if we don't yeild from wait
13:59:31 <Qiming> okay, let's spend some time reading some more resource type implementations
13:59:49 <Qiming> I don't see a way out
14:00:06 <Qiming> sorry guys, no time for open discussions today
14:00:15 <Qiming> let's continue on #senlin
14:00:19 <Qiming> #endmeeting