13:00:10 #startmeeting senlin 13:00:10 Meeting started Tue Jun 7 13:00:10 2016 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:13 The meeting name has been set to 'senlin' 13:00:26 #topic roll call 13:00:37 hello 13:00:58 hi 13:01:02 hi 13:01:08 hi, haiwei 13:01:09 hi 13:01:16 network connection at home is really unstable 13:01:17 o/ 13:01:33 #topic newton work items 13:01:45 testing, where are we ? 13:02:03 enable tempest api on gate 13:02:08 50% I think 13:02:15 about negative test cases 13:02:30 Saw some patches submitted by yanyanhu, many thanks! 13:02:40 okay, so we did find some inconsistencies in apis 13:02:43 elynn, my pleasure :) 13:02:57 also found some issues about our API implementation when writing the test 13:03:00 elynn, posted some comments to you latest patches 13:03:02 Qiming, yes 13:03:06 that is valuable 13:03:36 Qiming: will check :) 13:03:50 tempest dsvm gate is not very slow, right? 13:03:56 yes 13:04:17 great 13:04:35 rally side 13:04:47 patch 318453 was in 13:04:49 gate job is finally ready 13:05:15 you mean gate job at senlin side? 13:05:18 and the 301522 works well now 13:05:21 both side 13:05:25 in rally and senlin 13:05:41 just need to address rally teams question about that patch 13:05:58 but no critical obstacle I think 13:05:58 we don't rely on 301522 to run gate at senlin side, right? 13:06:07 yes 13:06:15 that is for rally repo 13:06:24 my question is about the gate failures we saw when doing 'check experimental' at senlin side 13:06:56 actually 307170 works as well 13:07:03 you mean this one? https://review.openstack.org/307170 13:07:23 oh? 13:07:43 that is the first time we have gate-rally-dsvm-senlin-senlin working!! 13:08:01 sorry, I was dropped 13:08:06 Qiming, yes :) 13:08:08 This name is a little weird... 13:08:26 elynn, yes if you mean senlin-senlin.yaml :) 13:08:40 yes, let's get it in and fix it later? 13:08:58 that is because we try to match gate-dsvm-rally-senlin-{name} job template 13:09:29 Qiming, ok, I think I may need to do some clean job before that patch become ready 13:09:42 okay 13:09:48 will remove[WIP] when it's ok 13:09:53 double senlin, now we have amazon :P 13:10:04 :P 13:10:36 and will discuss with eldon about their test based on rally 13:10:36 better rename that... it is strange, indeed 13:10:42 hope can provide some help for them 13:11:08 Qiming, yes, I think maybe we can propose another job template in future 13:12:03 sure, we can work together on stress tests 13:12:18 Anyway, now we have many eyes on gates :) 13:12:34 yes 13:12:48 moving on 13:12:55 health management 13:13:15 I saw xinhui has taken over that lbaas bug 13:13:25 patch proposed now 13:13:35 cool 13:13:47 yes 13:13:52 I am doing that 13:13:59 many thanks 13:14:07 my pleasure 13:14:19 will contribute more in next weeks 13:14:29 there are still some gate failures 13:14:52 en 13:15:08 I have added health monitoring by listening to vm lifecycle events 13:15:23 it took me quite some time to understand the filtering logics 13:15:45 congrats! 13:15:49 you got it 13:15:56 there are still things unstable inside oslo.messaging, complaining that some regex matching failures 13:15:56 will learn from you 13:16:23 anyway, we can get notified when vm status was modified (reboot, start, stop, ...) 13:16:32 next thing is to trigger some actions 13:16:41 is that reliable? 13:16:42 will dive into that 13:16:53 I mean the lisetning 13:17:18 listening is reliable, just some initialization wasn't complete I guess 13:17:28 ... 13:17:37 when I restart the engine, the listeners are created, but not receiving events 13:17:54 maybe need to do some fuzzy delay 13:18:02 ok 13:19:09 as for health threshold 13:19:21 I'm thinking of using desired_capacity 13:19:44 could you explain more? 13:20:06 if time permits 13:20:37 make desired_capacity the health threshold 13:21:04 a cluster is treated healthy if the number of active nodes >= desired_capacity 13:22:15 make sense? 13:22:16 okay 13:22:21 too hash? 13:22:37 Qiming, if so, the node number could beyond desired_capacity? 13:22:52 yes, between max_size and desired_capacity 13:23:26 what is the case node number bigger than desired_capacity? 13:23:43 hmm, sounds a little different from our discussion in summit 13:24:10 when you do cluster check, there are some nodes not responding 13:24:25 when you do some operations later, those nodes come back to life 13:25:17 did we have any discussion about the health threshold during summit? 13:25:28 so the desired_capacity only contains the alive nodes? 13:25:34 yes, so will the total number of health node finally match the desired_capacity 13:25:44 desired is always the 'desired' 13:25:47 Qiming, no, it's not about health threshold 13:25:55 about the scaling basement 13:26:16 it is not the number of actually active nodes, we can never assume so 13:26:50 Qiming, yes, that's what I mean. I think the case that total number of node beyond desired_capacity is kind of transient status? 13:27:11 if you don't do something, those nodes will be there 13:27:23 finally, health nodes amount will be desired_capacity 13:27:30 there are transient status some nodes are not active when you are checking them 13:28:05 the question is why are we maintaining the number of healthy nodes? 13:28:45 we are already not so sure about the number of active nodes, considering that there are transient problems 13:29:14 what we do care is "whether the cluster is healthy" 13:29:30 which means there are enough nodes to share workloads 13:29:51 by "enough" here, we mean that number of active nodes >= desired capacity 13:30:07 yes, it is a bit hash 13:30:21 yes. I think user specifies the desired_capacity which is the number of health nodes they want to have, so we should try to match it and the actual active nodes number? 13:30:31 but is there any way to maintain another statistics? 13:31:08 assume you are the user, when you are specifying desired_capacity, what are you thinking? 13:31:27 hmm, I want a cluster with this number of nodes being active 13:31:32 healthy 13:32:25 there could be a case where a user wants to create a cluster of 10 nodes, but 5 nodes is okay for him/her 13:32:44 yes, that is possible 13:32:54 why is he doing that? 13:33:19 if 5 is okay, then 5 is the min_size, right? 13:34:05 hmm, I think that means in any cases, the cluster size should not be less than 5 13:34:22 yes, then 5 is actually the min_size 13:34:24 that is not directly related to health management 13:34:31 yes, 5 is the min_size 13:34:44 if the cluster is dropping below that level, the cluster is in error status 13:35:01 I think that's two cases 13:35:03 if cluster size is between min_size and desired_capacity, we can treat it as warning 13:35:25 if cluster size is less than 5, that means internal error happened in senlin side 13:35:25 it is all about how we define the status of a cluster 13:35:41 users don't care what happened 13:35:56 yes, understand what you mean, just feel we shouldn't mix health management case and scaling case 13:36:01 maybe some nova nodes crashed 13:36:11 you cannot say it is senlin's fault 13:36:16 IMHO, min_size/max_size/desired is about scaling cases 13:36:41 they are properties you specify when you create a cluster 13:36:52 no matter you will scale that cluster or not 13:37:07 yes, that's the hard limit of cluster size 13:37:19 no matter the cluster has HA management support or not 13:37:28 exactly 13:38:07 so ... I'm wondering if we do want to introduce another threshold into senlin at the moment 13:38:14 so I think desired_capacity is something related to HA since it's user desired 13:38:39 and we can never make sure it matches the reality 13:38:43 Qiming, I agree we consider desired_capacity a health related property 13:38:51 but min_size, max_size could not be 13:39:21 okay, agree to disagree 13:39:30 let's think about it offline 13:39:38 ok 13:39:51 we can have a further discussion tomorrow :) 13:39:56 I'd suggest we forget all the actions/policies we have in senlin 13:39:59 really need more thinking about it 13:40:16 just think from a user's perspective, what makes a better sense for them 13:40:26 agree with this 13:40:40 jealous 13:40:46 a clear definition from user perspective is the most important 13:40:54 you can discuss face to face 13:41:07 we will discuss on irc 13:41:11 :) 13:41:15 lixinhui_, you can come here, some one will buy you coffee :P 13:41:24 cool! 13:41:31 or you can come to VMware 13:41:37 tomorrow we have happy hour 13:41:41 for free coffee :) 13:41:59 :P 13:42:01 we can define min_size, health_watermark, desired_capacity and max_size 13:42:13 try if you can explain all these four numbers to users 13:42:30 hmm, need more thinking on it 13:42:43 okay, let's move on 13:43:06 any news from you lixinhui_ on health management? 13:43:18 Sorry, Qiming 13:43:30 I will try to contribute more in the followed weeks 13:43:32 last sentence from me about this issue: maybe we should re consider why user define min_size/max_size and whether and when they really need it :) 13:43:39 too distract 13:43:49 no worry, just ask questions, in case you have moving too fast 13:44:21 :) 13:44:28 no update on documentation from me 13:44:38 container support 13:44:54 I submitted a patch 13:45:06 initialize docker driver 13:45:38 saw some patches from haiwei, I think we have been mixing things in a strange way 13:46:05 will have a closer look at the patch 13:46:31 yes, please comment it 13:46:44 notification/event side, some basics are there 13:47:05 need some serializers and an example to encapsulate a notification into an object 13:47:20 then dispatch that object to oslo.messaging or db 13:47:29 will continue work on that 13:47:41 zaqar work is stalled 13:48:06 that's all from the etherpad 13:48:15 things to add? 13:48:39 nope, really lots of work items 13:49:03 #topic senlin cluster-do operation 13:49:27 here is the patch: https://review.openstack.org/#/c/326208/ 13:49:51 we are adding OPERATIONS to a profile definition 13:50:00 it is not exposed to users for customization 13:50:30 but implementation wise, we are modeling operation parameters using schemas 13:50:46 oh, its for this purpose 13:50:54 I didn't get it when saw it first time 13:51:04 so an operation can be easily verified when we get a JSON containing the operation requested 13:51:10 will check it 13:51:15 yep 13:51:27 that's a nice wrap 13:51:31 an operation request could be {"reboot": {"type": "HARD"}} 13:51:55 when users input senlin cluster-do help cluster1 13:52:26 we can iterate through the OPERATIONS dict and return a help text --- here are operations you can try 13:52:49 just like when you do senlin profile-type-show 13:52:51 nice 13:53:23 in the case of a nova server cluster, you can do 'senlin cluster-do reboot --type=HARD cluster1' 13:53:57 command wise, we can add more parameters so that users can reboot nodes with specific roles, but those can be added later 13:54:26 parameters are checked just like profile/policy properties, they have data types 13:54:34 :) 13:54:36 the only difference is that they are not 'updatable' 13:54:55 that is why I revised the common schema module 13:55:11 yes, saw that patch 13:55:17 in future there could be some extensions to Operation schema, today it is only just a Map 13:55:32 that's some background about that patch 13:55:39 #topic open discussions 13:55:56 I think we have covered the health management part 13:56:16 yes 13:56:18 and ... 1 hour is definitely not enough for discussion 13:56:29 nod 13:56:30 yes :) 13:56:38 need some homework before we discuss it again 13:56:57 pls think from user's perspective 13:56:58 :) 13:57:00 will think about it as well 13:57:06 :) 13:57:06 yes 13:57:15 ok 13:58:29 oh, don't know if you have noticed it 13:58:43 we have had senlin 2.0.0.0b1 released last Friday 13:58:53 senlinclient 0.5.0 released today 13:59:08 saw it 13:59:11 senlinclient version jump was based on release team's suggesion 13:59:22 leave some version numbers for back-port 13:59:30 em. 13:59:35 in global requirement? 13:59:45 not yet propsed to global requirements 13:59:50 I see 13:59:59 feel free to do so 14:00:03 #endmeeting