13:01:06 #startmeeting senlin 13:01:07 Meeting started Tue Sep 20 13:01:06 2016 UTC and is due to finish in 60 minutes. The chair is yanyanhu. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:01:08 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:01:10 The meeting name has been set to 'senlin' 13:01:19 hello 13:02:07 Hi 13:02:08 hi, all 13:02:36 hi 13:02:52 Qiming will come soon 13:03:41 so lets go through the newton work item list first 13:04:00 https://etherpad.openstack.org/p/senlin-newton-workitems 13:04:02 this one 13:04:14 Performance test 13:04:21 no progress in last week I think 13:04:36 #topic newton workitem 13:05:03 about more senlin support in rally side, didn't get time to work on it recently 13:05:24 working on message type of receiver support in last two weeks 13:05:47 will resume the performance test work after this job is done 13:06:01 Health Management 13:06:33 I think Qiming has something to update. Lets skip this and wait him come back to update 13:07:00 Document, no progress I guess 13:07:12 container profile 13:07:20 haiwei is not here I think? 13:07:45 looks so. lets move on 13:08:01 Zaqar message type of receiver 13:08:57 I'm now working on it. The receiver creating part has been done, including queue/subscription creating, trust building between enduer and zaqar trustee user 13:09:23 and also api and rpc interface 13:09:42 now working on message notification handling 13:09:58 https://review.openstack.org/373004 13:10:00 o/ 13:10:04 zaqar trustee user is a configuration option in senlin.conf? 13:10:05 hi, Qiming 13:10:10 elynn, yes 13:10:17 it is configurable 13:10:38 since operator can define the trustee user in zaqar side as well 13:10:41 it is a configuration option stolen from oslo_config I think 13:10:51 although the default trustee user will be 'zaqar' 13:11:06 Qiming, yes 13:11:28 will well document this part to make it clear for user/operator 13:11:34 I'd suggest we don't add this kind of config options only to be deprecated/invalidated some day 13:11:55 Qiming, yes 13:12:08 that also depends on how zaqar support it 13:12:58 so will keep looking at it and talking with zaqar team to ensure our usage is correct 13:13:08 sounds something negotiable 13:13:30 yes 13:14:18 it now works as expected 13:14:37 https://review.openstack.org/373004 after applying this patch which is in progress 13:15:34 after creating a message type of receiver, user can trigger different actions on specific cluster by posting message to the message queue 13:16:13 the queue can be reused multiple times 13:16:17 util receiver is deleted 13:16:20 need a tutorial doc on this, so that users know how to use it 13:16:25 s/util/until 13:16:32 Qiming, absolutely 13:16:40 that is necessary 13:16:48 will work on it after basic support is done 13:17:00 cool, thx 13:17:05 my pleasure 13:17:19 so this is about the progress of message receiver 13:18:15 i, hQiming, we just skipped the HA topic, could you plz give the update on it. thanks 13:18:41 yep, there is a patch which need some eyes 13:18:42 https://review.openstack.org/#/c/369937/ 13:19:10 the only problem we found so far is about the context used when calling cluster_check 13:19:46 the health manager will invoke cluster_check periodically using an admin context 13:20:17 an admin context is special, in that it has no meaningful fields except for the is_admin set to True 13:20:23 Qiming, yes 13:20:43 such a context will be referenced later in the action, and the action wants to record the requesting user/project 13:20:50 which, in this case, are both None 13:21:32 an action having user/project set to None cannot be deserialized later because we strictly require all actions objects have user/project associated with it 13:21:49 I see 13:22:11 xuefeng has helped proposed a fix to this. pls help review 13:22:11 maybe we should invoke cluster_check using senlin service context? 13:22:32 Qiming, will check it 13:22:33 service context has a different user/project 13:22:37 it makes sense also 13:22:40 from the cluster owner 13:22:42 yes 13:22:48 will think about it 13:22:59 a service context is more appropriate imo 13:23:05 more accurate 13:23:17 yes, since this action is actually done by senlin 13:23:34 yes 13:24:19 I've added something to the agenda today 13:24:30 one thing is about desired capacity 13:24:49 ok, I noticed some recent changes are about it 13:24:51 I'm still dealing with the last relevant action (CLUSTER_DEL_NODES) 13:25:07 hopefully can get it done early tomorrow 13:25:20 great 13:25:39 the idea is to encourage such a usage scenario 13:25:57 a user observes the current/actual capacity when examining a cluster 13:26:17 the desired capacity means nothing, it is just an indicator of an ideal case 13:26:18 so the current logic is all desired_capacity recalculation will be done based on 'real' size of cluster when adding/deleting node to/from cluster 13:26:47 which cannot be satisfied most of the time in a dynamic environment I'm afraid 13:27:05 at the end of the day, users have to face the truth 13:27:22 they need to know the actual/current capacity and make decisions about their next steps 13:27:31 Qiming, it makes sense when the real size of cluster is different from desired_capacity 13:27:38 actually, that is the logic behind our auto-scaling scenario 13:28:08 Will senlin provide a cron or something to do health check automaticly? 13:28:13 the metrics collected then used to trigger an auto-scaling operation are based on the actual nodes a cluster has 13:28:57 that implies the triggering was a decision based on real capacity, not the desired capacity 13:29:22 I'm trying to make things consistent across all actions related to cluster size changes 13:29:30 Qiming, I think it's reasonable for node creating/deleting scenarios 13:29:45 but for cluster scaling/resizing scenarios, I'm not sure 13:30:03 whenever an action is gonna change a cluster's size, it means the users are changing their new expectation, i.e. the new desired_capacity 13:30:28 even after those operations are performed, you will still face two numbers: actual and desired 13:30:43 especially if we want to differentiate 'scaling' and 'recovering' 13:30:56 Qiming, yes 13:31:14 okay, I was talking about manual operations, without taking health policy into the picture 13:31:41 My bad:) 13:32:01 when health policy is attached, users will get more automation in keeping the actual number of nodes close to the desired number 13:32:15 there are some tricky cases to handle 13:32:27 Consider this case: a cluster desired_capacity is 5, its real size is 4, so it is not totally health now(maybe in warning status) 13:32:51 currently, the recover operation is not trying to create or delete nodes so that the cluster size matches that of the desired capacity 13:33:09 yes, yanyan, that is a WARNING state 13:33:32 then user perform a cluster_scale_out operation(or node_add operation). If the desired_capacity is recalculated with real size, it will still be 5. 13:33:40 and a new node will be created/added 13:33:46 as we gain more experiences on health policy usage, we can add options to the policy, teach it to do some automatic 'convergence' thing 13:33:54 then the cluster will switch to health status(active, e.g.) 13:34:09 yes, in that case, the cluster is healthy 13:34:19 users new expectation is 5 nodes 13:34:26 and he has got 5 13:34:30 Qiming, exactly, what I want to say is, if the desired_capacity is done using real size, cluster scaling could become kind of recovering operation 13:34:38 and will change the cluster's health status 13:34:41 right 13:34:46 sure 13:34:55 that is an implication, very subtle one 13:35:15 I was even thinking of a cluster_resize operation with no argument 13:35:24 so I think we may need to decide whether this kind of status switch is as expected 13:35:30 yanyanhu: actually I think user is expecting 6 if he do cluster_scale_out... 13:35:43 that operation will virtually reset the cluster's status, delete all non-active nodes and re-evaluate the cluster's status 13:35:53 elynn, yes, that is something could confuse user 13:36:03 we may need to state it clearly 13:36:06 since desired_capacity is what he desired before and now he want to scale out... 13:36:07 if we are chasing the desired capacity, we will never end the loop 13:36:28 desired is always a 'dream' 13:36:32 so to make that clear 13:36:52 Qiming, yes, if the new desired_capacity become 6. the cluster real size will be added to e.g. 5. and cluster will remain on warnning 13:37:11 I'm proposing add a 'current_capacity' property to the cluster, automatically calculated at client side or before returning to client 13:37:13 but maybe this is what user want :) 13:37:18 My be desired_capacity is 6, and real size is 5? And then use health policy to keep the cluster healthy 13:37:27 exactly, you will never get your cluster status fixed 13:37:50 so I mean cluster status will only be shifted when user perform recovering operation 13:37:56 maybe 13:38:07 since it is health status related operation 13:38:09 so when user scale_out, cluster should do: 1. check cluster size 2. create nodes to desired_capacity now, which is 5 3. add new nodes to 6 13:38:35 elynn, step 2 could faile, step 3 could fail 13:38:48 elynn, yes, that is kind of implicit self recovering 13:38:55 that is possible 13:39:14 the only reliable status you can get is by automatically invoke the eval_status method after those operations 13:39:32 so maybe we only change the cluster's health status when user explicitly perform recovering operation? 13:39:43 Qiming, yes 13:39:47 If it failed, just show warning and change desired_capacity to 6. 13:39:50 users will always know the 'desired' status, as he/she expressed previously, and the 'current' status, which is always a fact 13:39:53 eval_status is for that purpose 13:40:40 elynn, if there is no health policy, how would you make the cluster healthy? 13:41:05 so my thought is we keep cluster health status unchanged after cluster scaling/resizing/node_adding/deleting 13:41:14 each time you want to add new nodes, you are pushing the desired capacity higher 13:41:35 Qiming, yes. maybe manually perform cluster_recover? 13:41:55 cluster_recover is not that reliable 13:41:56 could we provide an operation like cluster_heal? 13:42:05 manual perform by users? 13:42:08 and it is too complex to get it done right 13:42:20 cluster_recover + cluster_heal ? 13:43:03 hmm... yes, that will become more complex... 13:43:53 it is too complicated, maybe I haven't thought it through, but I did spent a lot time on balancing this 13:44:12 let's forget about cluster_heal, just cluster_recover. 13:44:21 I was even creating a table on this ... 13:44:38 any url to paste a pic? 13:44:38 understand your intention. just if cluster scaling doesnt change cluster desired_capacity, that is confused imho 13:44:46 SWOT? 13:45:04 url 13:45:19 cluster scaling does change cluster's desired_capacity 13:45:41 say if you have a cluster: desired=5, current=4 13:45:48 yes 13:45:49 and you do scale-out by 2 13:46:13 but if I scale out by 1, new desired will still be 5? 13:46:16 the current explanation of that request is: users know there are 4 nodes in the cluster, he/she wants to add 2 13:46:39 then I get d=7, c=7 13:46:41 so the desired_capacity is changed to 6 13:46:53 and we create 2 nodes as user requested 13:46:59 if the desired is calculated with real size. then this scaling will actually become "recover" 13:47:18 it is not recover 13:47:28 please read the recover logic 13:47:30 I mean for scale out 1 case 13:47:41 it heals those nodes that are created 13:47:55 Qiming, ah, yes, sorry I used wrong item 13:47:55 it doesn't create new nodes 13:48:01 maybe cluster heal as ethan mentioned 13:48:10 give me a url to paste pic pls 13:48:19 which will try to converge the cluster real size to desired one 13:48:21 why cluster recover can't create nodes? 13:48:30 it is a limitation 13:48:35 elynn, currently, recover means recover a failed node 13:48:40 we can improve it do do that 13:48:44 through recreating, rebuilding, e.g. 13:48:59 To me cluster recover should bring cluster back to heal, from its words... 13:49:08 elynn, +1 13:49:25 just we haven't made it support creating node 13:49:36 maybe we can improve it as Qiming said 13:49:49 http://picpaste.com/001-tvkIE5Aw.jpg 13:49:51 I just talk about the logic here...Not the implementation... 13:50:26 yes, go ahead think about changing the desired_capacity based on current desired_capacity then 13:50:34 see if you can fill the table with correct operations 13:50:59 in that picture, min_size is 1 desired is 2, max is 5 13:51:19 the first row contains (active)/(total) nodes you have in a cluster 13:51:28 then you get a request from user 13:51:53 tell me what you will do to keep the desired capacity a reality, or even near that number 13:52:36 so ... I was really frustrated at chasing desired_capacity in all these operations 13:52:46 Qiming, I think user will understand that the real size of cluster could be always different from their desired 13:52:55 we should really let users know ... you have your clusters status accurately reported 13:53:03 for some reasons 13:53:08 let me go through this table , it takes some time... 13:53:18 you make your decisions based on the real size, not the imaginative (desired) capacity 13:53:30 but once that happen, they need to recover their cluster to match the real size to desired 13:53:43 that is an ideal case, senlin will do its best, but there will be no guarantee 13:53:57 that's why we call that opertion "recover" or "heal" 13:54:00 yanyanhu, what if recover fails 13:54:07 I mean, fails in the middle 13:54:20 that could happen 13:54:22 we cannot hide that 13:54:29 and it just means recovery failed 13:54:36 the recover operation still cannot solve this problem 13:54:36 and user can try it again later maybe 13:54:48 Qiming: let cluster show warning status? 13:54:52 then why do we pretend we can achieve desired capacity at all? 13:55:16 And stop there.. 13:55:21 Qiming, yes, but I think no one can ensure that user can always get what they want, right 13:55:29 yes, recovery operation fails, what's the cluster's status? 13:55:39 so the logic is really simple 13:55:45 warning I think 13:55:46 expose the current capacity to users 13:56:00 let them do their decisions based on the real capacity 13:56:01 for example, d=5, c=4, scale_out=2 13:56:03 not the desired capacity 13:56:11 recover failed 13:56:22 d=7, c=4 13:56:28 the desired capacity has been proved to be a failure if you have 4 nodes created for a d=5 cluster 13:56:44 it's a warning status I think? 13:56:53 how do you explain scale_out=2 ? 13:57:05 user mean he want to create 7 nodes? 13:57:06 why? 13:57:13 Qiming, maybe each action should just do what it should? and let other action or policy to keep the real_capacity=desired 13:57:17 scale_out=2 means the desired_capacity will increase 2 13:57:26 user want to scale_out 2 nodes, he totally want 7 nodes here... 13:57:46 we even can't guarantee the 2 new nodes will be created correctly 13:57:51 elynn, users already saw only 4 nodes in the cluster 13:58:09 If he only want current nodes +2 ,then he should do recover first and then scale_out I think... 13:58:09 alright, take a step back 13:58:16 think about this 13:58:32 you have d=3, c=2, then ceilometer triggers an autoscaling 13:58:43 skip the user intervention for now 13:58:47 That's another scenario.... 13:58:47 elynn, that is also what I'm thinking actually 13:58:51 what was that decision based? 13:59:02 that's based on actual nodes... 13:59:04 why shouldn't we keep this consistent? 13:59:18 Qiming, if user leave the bar to ceilometer, I think we can treat ceilometer as the user 13:59:20 sigh ... 14:00:01 that's the confused part... 14:00:08 pls go through that table 14:00:09 umm, time is over... may keep on discussing it in senlin channel? 14:00:17 you will realize what a problem we are facing 14:00:41 Qiming, I see. Will think about it and have more discussion on it 14:00:58 will end meeting 14:01:01 #endmeeting