13:01:14 #startmeeting senlin 13:01:15 Meeting started Tue Oct 20 13:01:14 2015 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:01:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:01:20 The meeting name has been set to 'senlin' 13:01:27 hello guys 13:01:30 hi 13:01:37 hello 13:01:41 #link https://wiki.openstack.org/wiki/Meetings/SenlinAgenda#Weekly_Senlin_.28Clustering.29_meeting 13:01:51 I guess haiwei is on the way back hometown 13:01:52 please revise agenda if you have things to add 13:02:00 yes, heard that 13:02:05 feel sorry about him 13:02:20 yes 13:02:46 don't know if lixinhui will join today 13:02:58 anyway, let's get started 13:03:03 ok 13:03:15 #topic health policy and its relationship to scaling policy 13:03:51 so lixinhui talked to me this afternoon, she is looking into that policy 13:03:57 hi, elynn 13:04:14 great 13:04:15 Hi 13:04:31 it is a requirement from some production teams as well 13:04:58 I have got quite some complaints that resourcegroup cannot handle resource failures 13:05:16 yes, I think their thought about 'desired_capacity' and 'current size' is understandbale 13:05:25 we had a demo last year on Paris summit, showing how VM HA can be done 13:06:08 the prototype was done using Heat, but the solution doesn't fall into Heat's scope 13:06:59 so we have had a sample policy here: http://git.openstack.org/cgit/openstack/senlin/tree/examples/policies/health_policy_poll.yaml 13:07:16 and a WIP code here: http://git.openstack.org/cgit/openstack/senlin/tree/senlin/policies/health_policy.py 13:08:04 my colleague helped write that code but didn't finish it before switching to other jobs 13:08:15 glad lixinhui is picking this up 13:08:27 yes 13:08:42 this workitem has been there for about half year I think :) 13:08:45 definitely, she will need help 13:09:01 yes 13:09:05 one thing is about how to trigger recovery actions 13:10:05 yes, I think how to perform the health check for each node is the key issue we need to figure out 13:10:08 maybe we need to define a do_recover() method for the base Profile class for an implementation to decide how to recover itself 13:10:32 I found a decorator of 'periodic_task' is added 13:10:42 agree 13:10:49 well, failure detection is not a great challenge 13:11:16 we have three options at least: 1) poll the node status periodically for clusters attached with a health policy 13:11:37 2) poll the LB health monitor if a cluster has a lb policy attached 13:11:52 3) listen to lifecycle events on message queue 13:12:28 the first, very naive, very generic solution is to have the health_manager to call profile.do_check() 13:12:43 so we can know if the object (stack, server) is still active 13:12:59 that could serve the base line 13:13:57 I am proposing it to be the "base line" because it introduces no dependencies to other services 13:14:43 yes, these three choices should be available 13:14:43 yes 13:15:18 but I guess we may still need to rely on some APIs provided by other services to help decide the status of node 13:15:24 we can figure out better/advanced solutions/strategies when we gain some experiences from this naive version 13:15:54 I believe most objects exposed have a 'status' property 13:17:26 we may need to figure out how to handle those objects who do not support status check 13:17:56 or sometimes, the status may not be able to reflect the real availability of node 13:17:56 but ... if they don't have a status for checking, we cannot be sure we can "recover" it, right? 13:18:09 yanyanhu, that is very true 13:18:15 yea, this is the problem... 13:18:43 so one of the key design is to have a profile to implement a do_recover() method 13:18:58 and somehow make that a builtin action to support 13:19:03 yes 13:19:04 then expose it with a webhook 13:19:19 we throw the question back to users 13:19:38 that sounds doable as a first attempt 13:20:17 I think we can start from the simplest case 13:20:22 I'll talk to lixinhui about this 13:20:39 great 13:20:39 yes, simple case can help us learn a lot 13:20:54 another question asked is 13:21:14 how do we deal with CLUSTER_DEL_NODES/CLUSTER_SCALE_IN actions 13:21:31 if the actions are not changing the desired_capability 13:22:24 I think that depends on our definition of cluster_del_nodes and cluster_scale_in 13:22:43 maybe a better question to ask is how do we deal with the "conflict" between health policy and scaling policy 13:22:59 maybe we just define them as actions which will reduce the desired_capacity of cluster :) 13:23:12 oh 13:23:41 previously, when I was studying the amazon design 13:23:59 somewhere they have an option to temporarily disable the health policy 13:24:22 I believe that is a feature proposed out of hard lessons learned 13:24:40 that means health policy is the most powerful one once it is enabled :) 13:24:46 yes 13:24:51 I think this is reasonable 13:25:14 since availability should have higher priority than performance 13:25:25 or cost 13:25:25 temporarily disable health policy was considered when we design cluster_policy table 13:25:41 the 'enabled' field in that table is for this very purpose 13:25:52 yes 13:26:04 we don't know what operators want to do 13:26:25 they may have a good reason to do this, for a manual scaling or something 13:26:46 so .. refer to this code: https://github.com/openstack/senlin/blob/master/senlin/engine/actions/cluster_action.py#L672-L699 13:27:20 it is the logic of CLUSTER_UPDATE_POLICY action 13:27:40 if we move this code to be a method of a cluster 13:27:50 it can be shared 13:28:16 agree 13:29:08 in policy checkings, we can always temporarily disable the health policy (if any) and reenable it in the pre_op and post_op calls 13:30:01 a byproduct of this is that the actions code are greatly simplified 13:30:58 hmm, that's true 13:31:03 let's think about it 13:31:12 ok 13:31:32 #topic heat resource type support 13:32:17 oh, elynn dropped 13:32:27 two links on the agenda 13:32:28 lets wait him for a while 13:32:43 I think the current spec looks good 13:32:55 first is the spec proposal in heat 13:33:03 yes, already +2'ed 13:33:06 only possible question is about the spec property of profile 13:33:16 however, we need to think beyond that 13:33:24 yes? 13:33:30 since it is really different from current design about template 13:33:54 yes, I will try my best to convince heat team that we are doing the right thing 13:33:57 it could take a while for other people to understand it 13:34:12 I think heat cores can grasp the idea quickly 13:34:31 sure, what we need to do is decribing it clearly :) 13:34:40 it's lesson learned when the convergence proposal was discussed 13:34:40 and correctly 13:34:56 yes 13:35:19 basically, you have 'template', 'environment', 'parameters' and 'files' together to determine the 'desired_state' 13:35:59 all these four inputs should be consolidated into a single "spec" for defining a stack 13:36:40 yea, this is what spec is for 13:37:21 or else we cannot know what a stack should look like 13:37:46 Another possible issue is about senlinclient plugin for heat 13:38:05 yes, senlinclient has some big issues 13:38:21 do we need to further refactor the implmentation of python-senlinclient to make this work easier? 13:38:22 one thing is about the interfacing with openstacksdk 13:38:51 yes, that is something we need to do 13:39:00 especially for profile create 13:39:14 here are some preprocessing logic there 13:39:22 for parsing of the get_file function 13:39:27 yes, I also think so 13:39:55 when invoked from other services (including senlin-dashboard), those logics are skipped 13:40:40 yes, and also needed by heat resource properties. 13:41:07 Just lost my connection on my way home... 13:41:09 we will need to fix the problem from ourside first 13:41:13 so it seems that these get_file operations need to be done in senlinclient 13:41:18 on behalf of heatclient 13:41:46 right, we have dependencies to heatclient because of that extra parsing logic 13:42:22 #action file a bug on profile_create skipping template parsing 13:42:53 I think we need fix this issue before elynn can start his work :) 13:43:09 yes, it is high priority 13:43:20 regarding interaction with Heat, we have more questions to answer 13:43:29 #link http://sched.co/4QbQ 13:44:17 it would be a good and difficult discussion 13:44:48 yes, really hope this session won't conflict with the demo on booth 13:44:51 yes, what are you going to talk about in summit? 13:45:18 elynn, question to whom? 13:45:21 or maybe I can ask other guy to help me stay at the booth if so 13:45:36 questins about ASG in heat 13:45:55 it is a design summit working session 13:46:09 people sit around a table for discussion 13:46:32 elynn, you really should join the design session :) 13:46:45 I did hope.. 13:46:52 I joined once in 2013 hongkong summit, very interestring 13:46:56 elynn, don't worry, https://etherpad.openstack.org/p/mitaka-heat-autoscaling 13:47:03 this will be the etherpad 13:47:52 we need to think beyond current spec 13:48:06 how autoscaling would be supported once cluster is in 13:48:07 I try to implement ResourceGroup and found that it's not very easy to do that, so many properties and situation need to be consider. like index_var, rolling_update, removal_policy. 13:48:23 right 13:48:58 All these questions might throw out at that session. Need to be careful... 13:49:05 rolling_update and removal_policy can be modeled as senlin policies 13:49:09 just to make heat user's life easier 13:49:26 index_var is difficult 13:49:44 it is completely a heat thing 13:50:09 so I think it can be still implemented in heat 13:50:48 what is index_var for? 13:50:50 Why are you saying that index_var is difficult? Since senlin can set the names of nodes in cluster, is it possible to change that logic to support index_var? 13:51:00 oh, I see 13:51:24 elynn, I mean the logic should live in heat, not passed to senlin for parsing 13:51:25 I recalled we used before 13:51:44 it is mainly used for naming things for easier reference later 13:52:35 the key is to translate the 'resource_def' property into a profile 13:52:49 Qiming, I think I understand elynn's concern about index_var 13:52:52 yes 13:53:04 and make the future update behave as what users expected 13:53:14 If that logic live in heat, if I delete a node from cluster manually, heat might not aware of that. And will cause NotFound error or something else... 13:53:44 if the cluster is created from a heat stack 13:53:50 heat is responsible for that 13:54:11 just like you will delete a nova server secretly, :) 13:54:52 elynn, feel free to call a f2f discussion this week if you have questions 13:55:20 #topic summit meetup planning 13:55:36 yes, I will, after I investigate more deeper. 13:55:42 I haven't look at the logistics yet 13:55:48 room is very limited 13:56:08 need to spend some time on this 13:56:23 so please watch mailinglist 13:56:30 hi, elynn, I think a node should not be deleted manually using senlin interface in that case 13:56:33 ok 13:56:50 will annouce time/location to meet when we have a plan 13:57:15 anything else? 13:57:35 nope 13:57:43 nothing from me 13:57:54 not from me 13:57:59 I need to talk to sdk guys for code review 13:58:04 it is stagnating 13:58:13 need to know what's the plan next 13:58:31 yes, we have dependency on it 13:58:33 it is blocking a lot of things from us 13:59:05 btw, next weekly meeting will be cancelled 13:59:23 talk to you in two weeks, :) 13:59:23 thanks for joining 13:59:27 #endmeeting