08:00:23 #startmeeting vitrage 08:00:24 Meeting started Wed Nov 22 08:00:23 2017 UTC and is due to finish in 60 minutes. The chair is ifat_afek. Information about MeetBot at http://wiki.debian.org/MeetBot. 08:00:25 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 08:00:27 The meeting name has been set to 'vitrage' 08:00:29 Hi :-) 08:00:51 Hi o/ 08:01:16 (ό‿ὸ)ﾉ 08:02:04 Yo :) 08:04:26 Today’s agenda: 08:04:33 • Status and Updates 08:04:40 • Open Discussion 08:04:46 #topic Status and Updates 08:05:03 I don’t have many updates of my own. I started looking on the required change to make Vitrage-Mistral integration fully functional. 08:05:12 We need to pass to Mistral as a parameter one of the matched resources from the condition. For example: 08:05:21 - action: 08:05:22 action_type: execute_mistral 08:05:23 properties: 08:05:24 workflow: migrate_server 08:05:26 input: 08:05:27 server_to_migrate: ${instance} 08:05:27 another_input: blabla 08:05:35 The missing part is the ${instance} 08:06:07 That’s it for me 08:06:10 What is the ${...} here? yaml syntax? 08:06:34 The syntax that I plan to propose. We already have it in nagios_conf.yaml 08:06:42 It means: take the instance from the condition 08:06:57 Or maybe it’s ok without the $ ?… 08:07:06 As in action_target 08:07:19 Well, I need to give it some more thought… 08:07:31 Yes, I just want to mention action_target 08:08:18 But we need to distinguish (in the code) between the ‘blabla’ in the example, and the ‘instance’ which is a reference to a matched instance in the graph 08:08:23 hey 08:09:16 So I thought that the ${instance} will make it more clear 08:09:24 In action_target, it is always a reference 08:09:27 I would prefer to keep consistency. Currently all reference in vitrage template does not require extra token 08:09:46 Right, but how can you distinguish between these two? 08:09:56 server_to_migrate: instance 08:10:02 another_input: blabla 08:10:05 We have referenced entity and relationship in condition as well 08:10:40 But how can you tell that blabla is a string, and not a missing reference (that should be detected by the template validator)? 08:11:01 Need to think more about it :-) 08:11:18 Maybe ${..} is the only choice... I'm not sure 08:12:19 Ok, let me know if you have a better idea 08:12:39 That’s it for me 08:13:42 Short update from me 08:13:59 The blueprint of proactive RCA is almost done 08:14:08 #link https://review.openstack.org/#/c/519264/ 08:14:18 Please let me know if you have additional questions 08:14:47 The detail implementation will be proposed in sub blueprints. This one will explain the general ideas of this evolution. 08:15:25 And I recommend the reviewer read the HTML output along with rst source 08:15:27 I’m probably ok with it, but would like to give it a little more thought. And I also want more people to review it... 08:15:28 #link http://logs.openstack.org/64/519264/15/check/build-openstack-sphinx-docs/298cb8f/html/specs/queens/approved/proactive-rca.html 08:16:19 The HTML format definitely helps 08:16:40 Another patch helps you write rst 08:16:42 #link https://review.openstack.org/#/c/521410/ 08:16:57 It will build html automatically on any changes 08:17:13 Cool, I will review it later 08:17:20 No need to run tox -edoc and refresh the browser again and again 08:17:42 That's all from my side 08:19:08 This link is help full, i have to say the blueprint was not very clear to me. i hope the sub blueprint will help me understand the change better. 08:20:03 Sure idan_hefetz 08:20:33 Ok, i'll update 08:20:43 The event api now allows creating custom alarms: 08:20:47 #link https://review.openstack.org/#/c/521775/ 08:20:52 So we can now easily create dummy alarms using this cli: 08:20:59 vitrage event post --type 'compute.bad.stuff' --details '{"hostname": "compute-0-0","source": "sample_monitor","cause": "another alarm","severity": "critical","status":"down","monitor_id": "sample monitor","monitor_event_id": "456"}' 08:21:08 Cool!! 08:21:23 :) 08:21:26 It will be also available in the tempest using the vitrage_utils.py method 'generate_fake_host_alarm' 08:21:36 #link https://review.openstack.org/#/c/521851/ 08:21:42 Also regarding tempest, please see the folder 'vitrage/vitrage_tempest_tests/tests/common/' 08:21:49 It contains many utility functions for testing, please use and add your own. 08:22:13 Cool. This really helps during local development 08:23:00 use the status 'up' or 'down' to raise/remove 08:23:12 Note that you should use the ‘doctor’ datasource for this. Make sure you have it in your vitrage.conf 08:24:01 That's it for me 08:25:48 An update on behalf of Muhamad Najjar: he is about to finish the persistor implementation 08:25:55 It seems that 'status' in detail is positional. 08:27:08 what do you mean? 08:28:14 It seems that 'status' in detail is always required. :) 08:28:35 A mandatory attribute? 08:28:37 dwj: do you see a real use case where it doesn’t appear? 08:29:29 Without the status, we need to write in Vitrage: if event_type==‘compute.host.down’ then raise alarm; else if event_type==‘compute.host.up’ then clear the ‘compute.host.down’ alarm… 08:30:03 What is this event from? Is the key always named "status"? 08:30:15 Or it is an internal vitrage event format? 08:30:36 The format was copied&pasted from the Doctor SB API definition 08:31:05 And I believe that dwj is right and status is not mandatory 08:31:41 yes, i know. I'm thinking about other monitor using the event API, maybe they not use 'status'. 08:32:26 I will try to address this issue in the commit. 08:32:28 https://review.openstack.org/#/c/521775/ 08:32:46 we can continue the discussion there 08:33:00 I guess we can just ignore events that don’t have a status, instead of failing. But we won’t raise or clear an alarm in this case (we won’t know what to do) 08:33:01 OK, thanks~ 08:33:26 Anyone else has an update? 08:34:45 I'll leave something to open discussion :-) 08:35:00 I have something for open discussion too :-) 08:35:07 #topic Open Discussion 08:35:18 yujunz: you go first 08:35:42 Ok, I now have some idea for implementing an old blueprint 08:35:45 #link https://specs.openstack.org/openstack/vitrage-specs/specs/pike/approved/datasource-skeleton-generator.html 08:36:04 I remember this one 08:36:05 Shall I moved it to queens/approved before working on? 08:36:14 Yes, I think so 08:36:35 I want to utilize cookiecutter for it 08:36:37 #link https://github.com/audreyr/cookiecutter 08:37:14 It is used for creating project skeleton such as Python packages, jQuery plugin 08:37:24 Looks cool 08:37:27 Should be suitable for data source as well 08:37:38 Go for it 08:38:08 Another thing is that how frequent shall we follow up the prioritized goals in queens 08:38:32 Good question. Do you want to go over the list now? 08:38:41 As you wish :-) 08:38:56 I think if weekly is too frequent, we can make it monthly 08:38:57 Ok 08:39:06 #link https://etherpad.openstack.org/p/vitrage-ptg-queens 08:39:23 Must 08:39:23 Queens goals - ifat_afek 08:39:25 Move tempests tests to a separate repository 08:39:26 Policies in the code 08:39:27 Rewrite Aodh datasource (Ceilometer API is being removed these days) - Wenjuan 08:39:28 Support Networkx 2.0 - if the global requirements are changed to adopt the new version, we must support it - Nokia 08:39:31 So let’s go over it 08:39:44 Queens goals: the policy refactoring is over 08:40:16 The tempest tests - idan_hefetz and nivo are working on tempest refactory and fixes, so I’m waiting for them to finish first 08:40:25 But it’s not a big task, so I’m not worried about it 08:40:54 dwj: are you still here? the rewrite Aodh datasource is over, right? 08:41:05 yes, it's done. 08:41:28 Networkx 2.0: I know that annarez started working on it, but she didn’t finish and she is now in New Zealand :-) one of us will have to take over this task 08:41:50 Looks like the ‘musts’ are in a very good state 08:42:00 Yeah \o/ 08:42:31 Very Important 08:42:31 Improve the tempest tests - Nokia 08:42:33 Make sure all existing tests run in the gate 08:42:34 Add tempest tests to python-vitrageclient 08:42:35 Should wait for the zuul migration to finish ? 08:42:36 what about vitrage-dashboard ? need to understand how to test ui ? 08:42:43 copy&paste is stages since it’s a long list 08:43:10 Run existing tests in the gate - in progress (idan_hefetz and nivo) 08:43:22 tempest in python-vitrageclient - not started 08:43:28 same for vitrage-dashboard 08:43:42 Anyone assigned for those two? 08:43:48 Configurable notifications API - in progress 08:43:58 yujunz: no 08:44:11 I believe most of us are working on more urgent issues 08:44:31 OK. I just ask. I'm not familiar with these fields so can't help much :-( 08:45:08 Me neither… eyalb is familiar with the python-vitrageclient. Regarding the vitrage-dashboard, none of us knows how to handle it, if at all there is a standard way to do it 08:45:22 Moving on 08:45:23 HA - Muhamad Najjar 08:45:28 This one is in progress 08:45:52 Parallel evaluation of vitrage templates - done (idan_hefetz) 08:46:01 Bi-directional deducing - ZTE 08:46:13 Template simplification 08:46:13 Suspect status 08:46:15 Diagnostic action 08:46:16 Equivalence and aggregation - ZTE 08:46:18 Resource equivalence 08:46:19 Edge equivalence 08:46:21 Aggregation of equivalent entities (resources, edges, alarms) 08:46:25 I believe these are all in progress, right? 08:46:38 Yes. Most of them are covered by proactive RCA 08:47:02 Ok. Done with the very important. Moving to important 08:47:10 alarm history 08:47:11 Discovery agent - Nir Cohen 08:47:12 Integration with doctor - Wenjuan 08:47:14 support TripleO 08:47:15 call Nova reset-server-state 08:47:17 Use case respository - Liyin 08:47:18 fault model for openstack core services such as nova, neutron and etc... 08:47:18 SNMP parsing service - Peipei 08:47:27 alarm history - depends on the HA. I believe we will get to it only in Rocky 08:47:46 Discovery agent - not started 08:48:01 Integration with Doctor - in progress 08:48:09 support TripleO - not started 08:48:24 call Nova reset-server-state - in progress 08:48:37 Use case respository - yujunz, do you know anything about it? 08:48:58 I think it is not started. Could you confirm with Liyin? peipei 08:49:03 SNMP parsing service - spec was approved 08:49:38 yujunz, is it you updating the etherpad? :-) 08:49:42 Good idea 08:49:48 And regarding the medium priority: 08:49:48 Yes, it's me 08:50:09 Template CRUD - idan_kinory is working on it 08:50:16 The others were not started 08:50:22 I believe that’s it? 08:50:39 Sound good 08:50:57 I feel that we are in a very good state, comparing to the previous releases 08:51:14 So do I. A good start in queens 08:51:33 :-) 08:52:04 Ok, we are almost running out of time, but I have two issues to discuss. We don’t have time for a real discussion, but at least we can think about them offline 08:52:18 I got an email from Greg Waines from WindRiver. They are thinking about new features in Vitrage. 08:52:32 One of them might be related to the alarm aggregation blueprint by Yujunz. It might also relate to the ‘host maintenance’ feature by Doctor. 08:52:43 WindRiver would like to clear all alarms on the host and its contained resources, in case the host is being rebooted. 08:52:55 My initial thoughts were that maybe the API function that returns the aggregated graph will filter out these alarms. Not sure exactly how, because ‘host reboot’ is an event and not an alarm. If you have any other ideas, let’s discuss them. 08:53:22 Another issue by WindRiver is that they wish to store and query events in Vitrage. 08:53:28 Examples: 08:53:35 • new compute hosts being added to system, 08:53:35 • Administrative lock / unlock of host 08:53:36 • Administrative request for switch of activity between Active/Standby Controllers, 08:53:37 • software patch applied to host 08:53:38 • nova instance evacuated 08:53:58 I think such events could be useful for better RCA, but we need to think how they can be combined with the alarms mechanism. 08:54:24 Let me know if you have any feedback now, or we can discuss it in the next meeting (or maybe they’ll suggest a blueprint) 08:55:31 I think we may need to take "event" into vitrage template as well 08:55:47 Could be 08:55:56 In additional to current entity "resource" and "alarm" 08:56:19 Good idea 08:56:21 It may not be present in graph but there need to be some way to define actions on events 08:56:37 And we need to relate the events to the resources 08:56:42 Somehow 08:56:47 Or it could be in graph. I'm not sure yet 08:56:53 And show an event list, but that’s the easy part 08:57:58 Another question is when to clear events. Unlike alarms, they will never disappear on their own. 08:58:45 And after a certain time, they should no longer be considered for the RCA (host reboot that happened yesterday is irrelevant, right?). So we might need to introduce a time window concept 08:58:49 Interesting :-) 08:59:19 Well, we should end the meeting in 1 minute 08:59:26 Bye :-) 08:59:29 bye 08:59:31 Bye 08:59:43 #endmeeting