13:00:05 #startmeeting senlin 13:00:06 Meeting started Tue Dec 29 13:00:05 2015 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:09 The meeting name has been set to 'senlin' 13:00:30 hi 13:00:36 hello 13:00:41 hi, Liuqing 13:00:41 o/ 13:01:29 pls check agenda and see if you have things to add 13:01:40 #link https://wiki.openstack.org/wiki/Meetings/SenlinAgenda 13:01:58 hi 13:02:05 hi, zhangguoqing 13:02:14 let's get started 13:02:25 #topic mitaka work items 13:02:37 #link https://etherpad.openstack.org/p/senlin-mitaka-workitems 13:02:54 heat resource type support 13:03:17 Hi Qiming 13:03:23 seems blocked by action progress check 13:03:32 yes 13:03:36 let's postpone that discussion 13:03:40 still don't know how to deal with it. 13:03:46 it is item 2 on the agenda 13:03:56 client unit test 13:04:18 need to do a coverage test to find out where we are 13:04:30 unittest part 3 merged? 13:04:41 no I think 13:04:55 poor network, can't open review page 13:04:59 any blockers? 13:05:21 gerrit's problem, so slow.. 13:05:25 just haven't got enough review it I think 13:06:11 not just gerrit problem, looks more like a problem caused by gfw 13:06:42 oh, sorry 13:06:46 it has been merged 13:06:52 https://review.openstack.org/258416 13:07:01 my fault 13:07:23 great 13:07:40 still need to unify the client calls 13:07:55 especially cluster action and node action ones 13:08:16 health policy 13:09:02 xinhui and I spent a whole afternoon together yesterday, we have a draft work plan for implementing the health manager and health policy 13:09:25 the first step would be about improve profiles so that do_check and do_recover are supported 13:09:38 great 13:09:40 then these operations will be exposed from engine and rpc 13:10:06 which operations? 13:10:13 do_check and do_recover? 13:10:28 as actions can be triggered by enduser? 13:10:29 the health manager can poll cluster nodes in the background then invoke the do_recover() operation on a cluster if node anomalies are detected 13:10:39 yanyanhu, not yet 13:10:46 first step is to make them an internal RPC 13:10:52 ah, I see 13:11:07 make them availabe for internal request 13:11:13 when we feel confident/comfortable, we can expose them as REST APIs 13:11:49 there are some details we need to deal with 13:12:13 but I think xinhui is on the righ track now 13:12:26 about health manager polling status and doing recovery action, it sounds good 13:12:31 yep 13:12:38 we have to do that 13:12:57 or else the auto-scaling scenario won't be reliable 13:13:33 and we want that feature enabled without bothering users to do any complex config 13:14:01 there could be many knobs exposed for customization, but the first step would be about the basics 13:14:08 could the health policy cover the user case : instance HA? 13:14:20 Liuqing, yes 13:14:30 great 13:14:34 Qiming, totally make sense 13:14:40 cool 13:14:43 great 13:14:58 by design, a profile (say nova server) will have a do_recover(**options) operation 13:15:34 the base profile will implement this as a create-after-delete logic 13:15:54 what does the logic mean ? 13:16:24 a specific profile will be able to override this behavior, it can do a probably better job in recovering a node 13:16:53 Liuqing, the default logic (implemented in base profile class) will be: delete the node, then recreate it 13:17:14 Qiming: got it , thanks 13:17:16 that is the most generic solution, applicable to all profile types 13:17:46 a nova server profile will have more choices: reboot, rebuild, evacuate, ... recreate 13:18:16 these options can be exposed through a health policy 13:18:36 users can customize it as they see appropriate 13:18:57 question? 13:19:30 update operation for profile types 13:19:38 so for me i could customize it for instance HA or others, right? 13:19:50 yes, Liuqing, absolutely 13:19:58 :-) 13:20:14 however, we will also want to add some fencing mechanisms in future 13:20:28 without fencing, recoverying won't be a complete solution 13:20:29 yes 13:20:42 now we use pacemaker for instance HA 13:20:46 let's do it step by step 13:21:08 use pacemaker is ... a choice out of no choice, I believe 13:21:15 instance HA is very important for enterprice production 13:21:42 yes, Qiming 13:21:44 yes, totally agreed 13:22:09 the customer will always ask the HA prroblems.... 13:22:10 yanyanhu, any progress on update operation for profile types? 13:22:35 Liuqing, they are not yet ready for real clouds 13:22:40 I'm working on add function calls about server metadata 13:22:49 three new methods will be added to sdk 13:22:57 hope can finish it in the coming week 13:23:07 most of the time, they use their private cloud as an advanced virtualization platform 13:23:11 and also the metadata update for nova server profile 13:23:14 great, yanyanhu 13:23:34 maybe we should add update support for heat stacks? 13:23:54 Qiming, sure. I plan to do some investigation after nova server related work is done 13:23:57 the interface is much easier when compared to nova server case 13:24:02 cool 13:24:05 yes, I think so :) 13:24:09 Receiver 13:24:17 We are done with it? 13:24:53 I pushed quite a few patches last weekend to close the loop 13:24:53 I think so 13:24:58 just need some tests 13:25:08 a simple test shows it is now working 13:25:11 thanks for your hard work :) 13:25:23 great, will try to add functional test for it 13:25:40 hoho, added 13:25:54 :) 13:26:34 btw, our API doc is up to date: https://review.openstack.org/261627 13:26:39 already merged 13:27:03 lock breaker 13:27:18 https://review.openstack.org/262151 13:27:19 I submit a patch to reenable it 13:27:25 yes, reviewed 13:27:29 Not sure it's the right way or not 13:27:30 I disagree with the logic 13:27:38 posted comment 13:27:42 Just saw your comment 13:28:02 it took me 20 minutes or so to post the comment, frustrating ... network really bad today 13:28:18 You intend to move it after retries? 13:28:28 elynn, please consider moving the check out of the critical path 13:28:39 by 'critical path', I mean the retry logic 13:28:53 retry could be very common if you are locking a cluster 13:29:29 so we do it outside lock_acquire? 13:29:46 maybe we should even relax the number of retries before doing a engine-death check 13:30:00 or ... 13:30:18 we should move this lock breakers to engine start up 13:30:42 engine-death check will be very quick if enigne is alive 13:30:44 but when we have multiple engines, doing lock checks during startup is not anything good 13:30:59 elynn, no, engine could be very busy 13:31:00 only took some time if engine is dead. 13:31:21 I have encountered several times when I was fixing the concurrency problem 13:31:42 many times the warning of engine dead, but the engine is still running 13:32:24 and .... putting it before the retry logic has led to quite some mistakes 13:32:26 ok... I got your point... 13:32:38 Maybe we should add a taskrunner first 13:33:03 Like what heat do 13:33:08 adding task runner won't help, AFAICT 13:33:39 it will introduce more concurrency problems 13:34:00 elynn, pls continue digging 13:34:14 haiwei is not online I guess 13:34:15 Yes, I will 13:34:21 let's skip the last itme 13:34:24 item 13:34:51 #topic checking progress of async operations/actions 13:35:04 this is blocking heat resource type work 13:35:14 yes 13:35:19 just saw ethan's patch in sdk side 13:35:20 because .... it is really a tricky thing to do 13:35:36 need to figure out a way to receive correct action id 13:35:40 we have done our best to align our APIs to the guidelines from api-wg 13:36:03 you got to understand the principles behind, before start this work 13:36:40 in senlin, we are having most of create, update and delete operations return a 202 13:36:46 I think we are following the guidelines 13:37:01 and we are returning a 'location' in the header 13:37:20 most of the time, the location points you to an action 13:37:43 since we have action apis, we are not hiding this from users 13:38:10 yes, most of the time except for cluster update... 13:38:14 one thing we still need to improve is about the body returned 13:38:45 for DELETE requests, we cannot return a byte in the body as HTTP protocol says 13:39:22 for UPDATE, we are returning the object in the body 13:39:35 for delete request, I think check until notfound exception happens is ok 13:39:39 for cluster deletion, I can catch not_found in heat resource. 13:40:00 The problem is UPDATE/RESIZE 13:40:01 we are also returning the pointer to the object in header 13:40:11 update and resize are different 13:40:17 for RESIZE, we have a body contain action. 13:40:25 UPDATE is itself a PATCH request 13:40:30 RESIZE is an action 13:40:42 these two operations are following different rules 13:40:57 How to check if a UPDATE is finished? 13:41:14 it depends on what you are updating, elynn 13:41:22 if you are updating the name of a cluster 13:41:30 you should just check the body 13:41:39 I mean profile? 13:41:44 if you are updating a cluster's profile, ... you will need to check the profile 13:41:56 sorry, you will need to check the action 13:42:22 yes, I think so 13:42:34 if we are not returning action in header, that is a bug to fix 13:42:59 Do we? 13:43:16 you can check the api code, cannot remember 13:43:17 Yes, we have. Just didn't find a way to expose it in client 13:43:45 okay, then, next step is to have the 'location' header used from sdk/client side 13:43:51 yanyanhu: that would be the problem to solve. 13:43:57 yes 13:44:15 if we are checking the header from senlinclient, we are requiring the whole SDK to return a tuple to us 13:44:25 the client side can get the action id, can it be used? 13:44:27 or embed the header into the object 13:44:43 xuhaiwei, you are a ghost 13:44:54 My patch is to embed the header into object. 13:44:58 sorry, didn't say hello just now 13:45:11 neither of the solutions above sounds elegant 13:45:13 I am in vocation from today 13:45:37 since we are suggested to use the function call interface from SDK, instead of the resource interface 13:45:52 I'm thinking maybe we should do somework in the _proxy methods 13:46:32 once the response is returned to senlinclient, we get no chance to check the header 13:46:35 Qiming: You mean directly return response body? 13:46:44 Qiming, you mean put the api response information into sdk? 13:47:05 the _proxy commands should know what they are doing 13:47:16 say cluster_create(**params) 13:47:44 when calling this method, we are expecting a 'location' header from the response 13:48:14 we can either squeeze it into the json before returning to senlinclient 13:48:27 or we can do a wait-for-complete inside sdk 13:48:52 I'm opt to the 2nd approach 13:49:16 since some other wait-for-something-to-complete logic is already in sdk 13:49:22 I'm not very sure how to implement your #2 option 13:49:43 I will have a try. 13:50:10 then we can add a keyword argument 'wait=True' to the cluster_create(**params) 13:50:27 I also think the option2 is better 13:50:47 if 'wait' is specified to be true, then we check the action repeatedly in the cluster_create method 13:50:47 The self._create() returns a resource object, it doesn't contain any headers. 13:51:15 elynn, you have already added 'location' to the resource.headers 13:51:59 we can check it there 13:52:06 hmm...I just don't know how to use headers :P 13:52:11 I will have a try. 13:52:27 'headers' was designed to be part of a request 13:52:38 now you are using it in response 13:52:42 that is not good 13:52:58 next time you send a request, you may have to clean it 13:53:18 maybe add a response_header property is a cleaner fix 13:53:56 I put it in response just to find a way to expose it... 13:54:10 then we discuss with brian and terry, see if it is an acceptable 'hack' 13:54:23 Or we don't have the way to set the location to headers. 13:54:23 if it is not acceptable, we have to do it in a different way 13:54:39 Ok 13:54:47 e.g. make 'action' a field of the object returned to senlinclient, parse it and do the wait there 13:54:47 If we add a wait function. 13:55:03 heat codes might be block by this wait function 13:55:16 I'm not sure if it's good way to go 13:55:21 if you want to wait, you will have to wait 13:56:17 For now heat using its taskrunner to schedule tasks 13:56:27 that is stupid 13:56:30 to be honest 13:56:44 there are proposals to remove them all 13:56:47 wait function might blocks its taskrunner... 13:57:46 are we risking blocking their engine? 13:58:05 taskrunner is so cheap 13:58:19 Yes, that might be 13:58:42 if we don't yeild from wait 13:59:31 okay, let's spend some time reading some more resource type implementations 13:59:49 I don't see a way out 14:00:06 sorry guys, no time for open discussions today 14:00:15 let's continue on #senlin 14:00:19 #endmeeting