13:00:45 #startmeeting senlin 13:00:46 Meeting started Tue May 10 13:00:45 2016 UTC and is due to finish in 60 minutes. The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:47 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:49 The meeting name has been set to 'senlin' 13:01:35 hello 13:01:39 Good evening. 13:01:42 welcome back 13:01:49 o/, zzxwill 13:01:53 hi 13:02:08 Thanks. Crazy with my work recently:( 13:02:10 hi 13:02:13 hi, haiwei, in taipei? 13:02:19 Evening! 13:02:25 back now 13:02:44 actually, I only manage to restore my work env 1 hour ago 13:02:57 lost my laptop at austin airport 13:03:05 anyway 13:03:06 hi 13:03:07 :( 13:03:10 Oh, it was a pity. 13:03:18 Got a new one? 13:03:23 the only item I have in mind is about newton work items 13:03:53 feel free to add topics to meeting agenda: https://wiki.openstack.org/wiki/Meetings/SenlinAgenda 13:04:28 Just add a topic about adding new workitems based on our discussion in summit 13:04:55 yes, that is also about newton work items 13:05:23 ok 13:06:06 Hi 13:06:08 let's quickly go thru the current list 13:06:10 hi, cschulz_ 13:06:22 ok 13:06:22 #link https://etherpad.openstack.org/p/senlin-newton-workitems 13:06:43 scalability improvement, need to sync with xujun/junwei 13:06:49 some of them have been done and some should be obsoleted 13:07:06 tempest test 13:07:12 we are on track 13:07:16 me and ethan are working on it 13:07:18 yep 13:07:18 yes 13:07:20 basic support is there 13:07:34 We are working on API tests 13:07:50 still need policy type list and profile type list and negative tests 13:07:59 Do we need to add a gate job for it? 13:08:07 in experimental 13:08:16 sure, that would be nice 13:08:40 ok, I will work on it then. 13:08:57 may also need to rework the client to enable negative test. Or we can use exception not resp status to verify the result 13:08:59 recorded 13:09:33 About the negative tests for API 13:09:36 I mean the clusteringclient of tempest test 13:09:42 need to test the status code at least, imo 13:09:46 do we only check the respond code? 13:09:52 Qiming_, yes 13:10:21 agree with this. So we may need to invoke raw_request directly 13:10:33 yes 13:10:59 I think only check status code and respond body is enough for API negative tests. 13:11:06 elynn, yes 13:11:06 if we are bringing in senlinclient into this, it looks then more like a functional test of senlinclient, instead of an API test of the server 13:11:30 so.. benchmarking 13:11:31 Existing client can not return a bad status code? 13:11:42 basic support has been done 13:11:50 in rally side 13:12:03 lixinhui_, any update? 13:12:18 will work on some simplest test case based on it 13:12:24 but maybe not now 13:12:49 Qiming 13:13:07 is it about benchmarking? 13:13:21 elynn, nope, the failure will be caught by rest client of tempest lib and converted to exception 13:13:22 I'm wondering if bran and xinhui has done some experiments on engine/api stress test 13:13:31 we are 13:13:39 yanyanhu, okay, I got your point... 13:13:44 :) 13:13:46 but 13:14:03 we are bottlenecked by nova 13:14:13 still need to overcome the scalability issue of oslo.messaging? 13:14:28 not really about oslo 13:14:32 but nova 13:14:36 nova api rate-limit? 13:14:41 oh, about this topics, I think there should be some performance improvement benefit from lastet scheduler rework 13:15:01 something like that 13:15:03 I mean the performance of senlin engine 13:15:14 we may try to resolve it at driver layer 13:15:27 from product env 13:15:38 we have rally and heat based 13:15:44 okay, we need some rough numbers using both the fake driver and the real one 13:15:45 stress tests 13:15:58 Qiming_, agree 13:16:29 but that will depends on if we need bring in senlin into this test env 13:17:04 yes, it would be nice to know senlin has scalability issue or not 13:17:13 the earlier the better 13:17:24 Bran has tried with simulated one 13:17:54 maybe we can paste the numbers on senlin wiki? 13:17:58 and found that no up limit on the 13:18:04 one engine 13:18:07 test 13:18:19 but parallel tests will need more time 13:18:26 maybe I should implement a basic rally plugin for senlin cluster and node operation to support this test 13:18:32 okay 13:18:40 lixinhui_, if you guys need it, please just tell me 13:18:50 not really now 13:18:53 I see, so there is a dependency 13:18:54 thanks yanyanhu 13:19:00 no problem 13:19:20 we will keep working on multiple engine simulted driver test 13:19:25 or these two threads can go in parallel 13:19:37 cool 13:20:09 please check if we can record these "baseline" numbers into senlin wiki: https://wiki.openstack.org/wiki/Senlin 13:20:16 sure 13:20:23 Rally side 13:20:49 basic support for senlin in rally has been done. Will start to work on plugin 13:21:00 we are still about to commit rally test cases to rally project? 13:21:06 will start from basic cluster operations 13:21:39 by plug-in, you mean we will be hosting the rally test cases? 13:21:56 Qiming_, we can if we want to I think 13:22:10 to hold the test jobs 13:22:11 what's the suggestion from rally team? 13:22:23 they sugguest us to contribute the plugin to rally repo 13:22:28 which I think makes sense 13:22:47 for those jobs, we can hold it in senlin repo I think 13:23:28 ... jobs are not modelled as plugins? 13:23:48 no, jobs means those job description file :) 13:23:49 what is this then? https://review.openstack.org/#/c/301522/ 13:23:54 those yaml or json file 13:24:23 Qiming_, those jobs are used as example to verify the plugin :) 13:24:34 okay, makes sense 13:24:35 more jobs should be defined per our test requirement 13:24:44 which I guess should be hold by ourselves 13:24:57 that is fine 13:25:59 pls help make that plugin work so others may help contribute job definitions etc. 13:26:10 sure 13:26:16 will work on it 13:26:26 health management 13:26:42 em, a huge topic indeed 13:26:44 is trying the linux HA 13:26:56 for health detection? 13:27:06 or recovery, or both? 13:27:09 wanna Qiming_ to share more picture in your mind 13:27:27 you mean photo from San Antonio? 13:27:29 based on dicussion with adam and DD 13:27:36 fencing 13:27:39 nowdays 13:27:46 with CentOS 13:27:54 VM 13:28:05 got it 13:28:05 but you know 13:28:20 just wanna to know more picture 13:28:31 need to spend sometime on the specs and the etherpad 13:28:32 about the HA story 13:28:49 yes 13:28:58 we cannot cover all HA requirement in our very first step 13:29:12 we may not be able to cover them all in future 13:29:14 from presentation of Adam and DD 13:29:22 need to focus on some typical usage scenarios 13:29:38 They hope to leverage Senlin on Recover and 13:29:42 fecing 13:29:47 fencing 13:29:59 right 13:30:15 but is that assumed design by ourselves? 13:30:27 so ... let's focus on the user story then 13:30:29 What is your thought, that HM will create events that may trigger cluster actions based on cluster policies? 13:30:55 yes 13:31:02 we will build the story step by step 13:31:07 that is the recover part 13:31:23 first step is check/recover mechanism, the very basic ones 13:31:39 and fencing may become part of the recover process 13:32:01 So there probably also needs to be policy like things in HM that defines how the health of a cluster is assessed? 13:32:12 second step is to try introduce some intelligence on failure detection 13:32:38 health check and failure recovery can be two workitems in parallel I guess? 13:32:48 third step is to link the pieces together using some sample health policies 13:33:01 actually I do not think we should do many check things 13:33:03 yes, guess so 13:33:21 agreed, health checking is independent of what actions you take when you've made an assessment 13:33:42 if users don't like the health policy, we still provide some basic APIs for them to do cluster-check, cluster-recover [--with-fence], etc. 13:34:12 Actually that is where I'd start 13:34:30 user may don't like the way we do health checking, still, they can do cluster-recover by triggering that operation from their software/service 13:34:31 Then add some basic mechanisms for those who just want simple 13:34:40 right 13:35:14 I cannot assume we understand all usage scenarios 13:36:06 :) I was challenged by linux-ha author during my presentation --- how do you detect application failure? 13:36:27 And your answer was? 13:36:38 it is a huge space, we cannot assume we know all the answers 13:36:52 application failure detection is currenly out of senlin's scope 13:37:03 Agreed! 13:37:03 yes 13:37:05 I think so 13:37:08 that is his anwser 13:37:15 from this 13:37:23 there are plenty of software doing application monitoring, use them 13:37:27 I do not think we can understand the use case today 13:37:42 but we can start from the basics 13:37:57 or the design on the loop of check and recover 13:38:15 so the key is how to leverge those monitoring tools/services 13:38:32 but trying to provide some basic investment 13:38:35 to detect failure of node/app happened in senlin cluster 13:38:38 on the choice of failure proceing 13:38:40 we leave choices to users, though we do provide some basic support to simple cases 13:38:45 processing 13:39:00 even today 13:39:03 masakari 13:39:08 's evacuate 13:39:15 recover a heat stack is completely different from recovering a nova server 13:39:22 can not work well with all guest OS and hypervisor 13:39:33 you are already onto masakari? 13:39:53 tries some that function of masakari 13:40:07 need to investigate more 13:40:17 ... big thanks! 13:40:27 :) 13:40:31 masakari is new to me. Will investigate 13:41:04 it has a vagrant and chef deployer, cschulz 13:41:05 for HA support, let's focus on planning 13:41:20 yes 13:41:24 https://github.com/ntt-sic/masakari 13:41:30 this one? 13:41:31 build stories on the etherpad: https://etherpad.openstack.org/p/senlin-ha-recover 13:41:45 yanyanhu, yes 13:41:50 moving on 13:42:17 documentation side 13:42:31 I'm working on API documentation in RST 13:43:00 hopefully, it can be done soon, then I can switch to tutorial/wiki docs 13:43:06 will provide some help on it 13:43:14 great, yanyanhu 13:43:24 container support 13:43:45 haiwei, maybe we can check in the container profile as an experimental one 13:44:22 you mean just create one first? 13:44:33 yes, very simple one is okay 13:44:46 it has to work, it has to be clean 13:44:59 we can improve it gradually 13:45:32 ok, I will submit some patches for it 13:45:56 then we can start looking into the specific issues when CLUSTERING containers together 13:46:24 at the same time, we will watch the progress of the Higgins project: https://review.openstack.org/#/c/313935/ 13:46:47 yes, I noticed it recently 13:47:00 if that one grows fast, we can spend less and less energy at this layer 13:47:14 just focusing on the clustering aspect of the problem 13:47:25 agree :) 13:47:41 that's why I think a simple profile suffices 13:47:50 ok 13:47:57 for us to think about the next layer 13:48:30 "tickless" scheduler is out 13:48:33 that is great!!! 13:49:09 :) 13:49:22 it do improve the efficiency of our scheduler 13:49:27 any news from zaqar investigation? 13:49:38 very appreciated your suggestion in summit :P 13:49:54 about event and notice mechanism 13:50:14 I've been very distracted since week of Austin summit, so not much progress. 13:50:19 I do not know if that is related to the scenario discussion on summit 13:50:22 but vmware PM 13:50:37 on customisable reaction 13:50:45 okay 13:51:12 lixinhui_, I was thinking of this scenario 13:51:12 or just related to the processing of action 13:51:14 Can someone give me a brief on the scenario discussion? 13:51:20 for vmware vm monitoring 13:51:53 senlin can emit events for vmware to listen 13:52:06 so that it will know which node belongs to which cluster 13:52:09 that will be great 13:52:29 it will have some knowledge to filter out irrelevant vms when doing maths on metrics 13:52:43 yes 13:53:01 that is desired by mix deployment env 13:53:16 okay, we can work on a design first 13:53:43 a multi-string configuration option for event backend 13:53:52 we only have database backend implemented 13:54:19 we can add http, message queue as backends 13:55:01 detailed design is still needed 13:55:12 em... only 5 mins left 13:55:17 Okay 13:55:24 Are events predefined? Or can a stack/cluster define events it wants? 13:55:33 yes, cschulz 13:55:39 Qiming_, maybe we postpone the second topic to next meeting 13:55:54 about adding new workitems based on discussion in summit 13:55:58 ok 13:56:21 there are followups wrt the design summit sessions 13:56:34 need to dump them into TODO items 13:56:49 and those items will be migrated to this etherpad for progress checking 13:57:01 yes 13:57:03 for example, profile/policy validation 13:57:18 that means one or two apis to be added 13:57:34 when someone has cycles to work on it, we can add it to the etherpad 13:57:51 the same applies to all other topics we have discussed during the summit 13:58:33 cool 13:58:40 that's all from my side 13:58:51 two mins left for free discussions 13:58:59 #topic open topics 13:58:59 that was good discussion there in Austin 13:59:15 yep :) 13:59:33 Anyone can send me anything they would like proofread for English. 13:59:41 okay, we successfully used up the 1 hour slot, :) 13:59:50 bye 13:59:53 thanks, cschulz_ 14:00:08 thanks everyone for joining 14:00:12 #endmeeting 14:00:15 thanks 14:00:18 bye 14:00:35 bye 14:00:35 * regXboi finds a corner in the room and quietly snores 14:00:39 cannot end meeting 14:00:49 .. 14:01:16 nickname occupied I think 14:01:51 hello networking nerds 14:02:02 o/ 14:02:02 o/ 14:02:04 #endmeeting