12:59:57 #startmeeting senlin 12:59:57 Meeting started Tue Aug 30 12:59:57 2016 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:59:59 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:02 The meeting name has been set to 'senlin' 13:00:28 evening 13:01:19 hi 13:01:54 hi, wait a few minutes and see if anyone else is joining 13:01:59 ok 13:03:02 o/ 13:03:10 hi, elynn 13:03:13 hi, elynn and guoshan 13:03:18 not sure if others are joining 13:03:29 let's get started 13:03:31 #topic newton work items 13:03:32 hello 13:03:56 hi, qwebirc78218 13:03:56 hi, qwebirc78218 13:03:58 #link https://etherpad.openstack.org/p/senlin-newton-workitems 13:04:05 performance testing, any progress? 13:04:10 yes 13:04:27 roman has put +2 on the profile context patch 13:04:33 need another +2 and workflow 13:04:48 need to ping rally core? 13:04:49 once this patch is merged, will add context for cluster as well 13:05:11 Qiming, yes, maybe wait for another one or two days 13:05:15 okay 13:05:22 integration test side, https://review.openstack.org/#/c/354566/ 13:05:31 good news is it works now 13:05:32 still waiting for another core to approve 13:05:42 Qiming, yes, for adding zaqar support 13:05:54 but at least we can rely on it to make some basic verifications 13:05:57 okay, that is not urgent 13:06:02 yes 13:06:12 basic verification passed, that is great 13:06:18 yep 13:06:27 health policy side 13:06:40 LB based health detection is still not there 13:06:46 not sure if xinhui is still pushing it 13:07:03 she has been working on fencing nova compute host 13:07:18 experimenting with IPMI drivers 13:07:58 the only problem in that direction is nova is not emitting a notification if nova-compute is down 13:08:32 there are notifications if the compute service is shut down by operators, but if the compute host is down, there is no notification 13:08:36 that is too bad 13:08:46 so the only workaround, as of today, would be a poller 13:09:15 you have confirmed that 13:09:21 poller sounds reasonable for this scenario 13:09:25 so ... I'm not sure if we should (in Ocata release) make health manager a separate service 13:09:31 yes, lixinhui_, confirmed 13:09:35 thanks for joining 13:09:51 sorry for late 13:09:55 that is a stupid design, hopefully we can help improve it if we get cycles 13:10:20 other improvements to health policy is about the recover/check workflow revision 13:10:26 mostly are done now 13:11:02 the policy can now suspend itself if node deletion was initiated from a RPC request instead of a failure detected 13:11:13 that part is also done 13:11:25 I was thinking of make the policy a little bit smarter 13:11:26 great 13:12:03 if you look at this: http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/health_manager.py#n61 13:12:20 when a node is down and get detected 13:12:47 we actually are sending this info as params when invoking the node_recover API 13:13:12 the policy can be improved to handle different 'event' and/or 'state' a little bit smarter 13:13:27 good point 13:13:27 say if a node is in SHUTDOWN state, the policy can try just 'reboot' it 13:13:36 or 'start' it 13:13:55 this is still an imagination, have to wait for the nova server operations patch merged into sdk 13:14:19 profile/policy version 13:14:35 yanyan has been working on a 'workaround' 13:14:47 yes, basic versioning support for schema and spec has been there 13:14:58 I'm calling it a 'workaround' because ... versioning is pretty big a problem to solve 13:15:07 we'll get back to that later 13:15:18 but I think we have a lot more detail to figure out before deciding how to support policy/profile version control 13:15:21 container support 13:15:23 yes 13:15:29 correct 13:15:38 haiwei's patch is finally in 13:15:49 yes, long run... 13:16:00 he is now experimenting specifying a host_cluster when creating container clusters 13:16:05 good luck ... 13:16:32 with that work as a starting point, we may want to discuss how to proceed as next step 13:16:52 haven't got time to review his new spec proposal though 13:17:09 better have a session in summit to discuss this topic 13:17:11 but I'd like to call a cross project discussion with magnum/zun on this 13:17:16 right 13:17:27 Qiming, sure, that will be the best 13:17:56 receiver side, yanyan has been working on zaqar support 13:18:15 please delete the items that are done 13:18:15 Qiming, yes 13:18:25 sure 13:18:43 the initial part has been merged today 13:18:46 hopefully, zaqar can bring in a more secure, more flexible channel for users/services to send signals to senlin 13:18:54 yes 13:19:03 that was another marathon 13:19:26 okay, anything else on the etherpad page? 13:19:32 looks so. hopefully we can have a basic version that works before cut our release 13:19:39 this week is the week to cut newton-3 release 13:19:58 I don't want to do it on Friday, too risky, when the gate is so jammed 13:20:11 ah, hope to catch rc1 13:20:29 we have the flexibility to merge more stable features in next few weeks 13:20:42 because we don't have a huge pipeline for review/debate 13:21:00 good news 13:21:08 okay, moving on to next topic 13:21:20 #topic health checking update 13:21:29 em ... I have basically covered that 13:21:42 yep 13:21:49 mostly about the check/recover workflow and the handling of different actions in the policy 13:21:57 there is still a feature not implemented 13:22:15 we were hoping that the recover action can be a list of operations for the profile to try 13:22:44 currently, the profile (nova in particular) only understand REBUILD, and the generic profile only handles RECREATE 13:23:01 that would be an interesting work for future 13:23:12 evening, xuhaiwei_ 13:23:19 #topic cluster status update 13:23:21 hi, Qiming 13:23:56 if you are watching the gerrit notifications, you will notice that I have been working on cluster status update fix these two days 13:24:02 kept silent to not disturb you:) 13:24:19 the basic idea is this: we will update cluster status, based on the status of the member nodes 13:24:30 NOT based on the last operation performed on it 13:24:55 e.g. a CLUSTER_UPDATE operation may fail, but the cluster may still remain ACTIVE 13:25:07 we have to differentiate this two things 13:25:49 A CLUSTER_SCALE_OUT may fail, but that failure is an action failure, it doesn't mean the cluster is not operable 13:26:08 I think this series of patches is near an end 13:26:38 when making these changes, I also changed the modifcation of 'desired_capacity' 13:27:01 we were changing the 'desired_capacity' after an action is completed, but that is WRONG 13:27:08 it has been reported several times 13:27:21 yes, saw that patch, that is reasonable 13:27:29 especially from ha perspective 13:27:29 so I was also making that happen before the action is executed 13:27:54 when a request arrives, the user's expectation is the desired_capacity 13:28:13 if the engine failes to perform the action, it should not change user's expectation 13:28:28 that was a simple logic, but we unfortunately learned it in a hard way 13:28:42 questions/comments on this? 13:29:09 seems a no 13:29:15 sorry to break 13:29:20 can i ask a question 13:29:23 sure 13:29:34 last time, i create a node but failed 13:29:42 so i recovered it 13:30:14 but the desire capacity is still 0 13:30:36 yep, that is exactly one of the problems we are fixing 13:30:37 is that should be 1 13:31:00 when you are creating a node, the desired capacity should be incremented by 1 13:31:03 okey, thanks for answering 13:31:06 even if the node creation was a failure 13:31:41 'increment the cluster size by one', that is the user's (your) desire 13:31:48 we should handle it differently 13:31:53 thanks for brining this up 13:32:06 moving on 13:32:18 #topic ocata design summit sessions 13:32:29 #link https://etherpad.openstack.org/p/ocata-senlin-sessions 13:33:01 have put my name on profile/policy versioning 13:33:01 I was just dumping some topics above my head 13:33:29 policy/profile versioning definitely needs some discussion 13:33:37 even before/after that session 13:33:39 yes 13:33:56 maybe combined with Topic 4 13:34:05 "versioned everything" 13:34:11 Qiming, yes, topic 4 can be a extentional discussion 13:34:24 yep, we cannot finish that in one session 13:34:31 maybe we need two slots 13:34:50 yes, if we have enough time slot 13:34:57 topic 2 is about health 13:35:25 we have some preliminary support now, next step is to make it work in production environments 13:35:37 it is a huge problem space 13:35:51 we have to brainstorm the working items and prioritize them 13:36:17 maybe involve a congress extension or mistral workflow 13:36:21 i just don't now 13:36:26 s/now/know 13:36:42 the 3rd topic I can think of is about container clustering 13:37:10 haiwei has set a stage for us, where are we heading next? 13:37:13 Maybe I can be the driver 13:37:27 that would be excellent 13:37:52 I didn't spend enough time on it up to now, will try to do more things before the summit 13:38:30 so ... 13:38:36 any more ideas you can think of? 13:38:37 first should let the container going 13:39:10 or we can just let ttx know that we need 4 working sessions? 13:39:24 I guess another topic may worth to discuss is cluster do operation? 13:39:41 okay 13:39:43 altough we already have some basic idea for it. but may need to figure out the detail 13:39:48 and also use case 13:40:15 openstack cluster do reboot 13:40:32 I updated the spec a few days ago, hope you can review it https://review.openstack.org/#/c/281102/ 13:41:05 we already support 'openstack cluster run --script