08:00:23 <ifat_afek> #startmeeting vitrage
08:00:24 <openstack> Meeting started Wed Nov 22 08:00:23 2017 UTC and is due to finish in 60 minutes.  The chair is ifat_afek. Information about MeetBot at http://wiki.debian.org/MeetBot.
08:00:25 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
08:00:27 <openstack> The meeting name has been set to 'vitrage'
08:00:29 <ifat_afek> Hi :-)
08:00:51 <yujunz> Hi o/
08:01:16 <eyalb> (ό‿ὸ)ノ
08:02:04 <idan_hefetz> Yo :)
08:04:26 <ifat_afek> Today’s agenda:
08:04:33 <ifat_afek> •	Status and Updates
08:04:40 <ifat_afek> •	Open Discussion
08:04:46 <ifat_afek> #topic Status and Updates
08:05:03 <ifat_afek> I don’t have many updates of my own. I started looking on the required change to make Vitrage-Mistral integration fully functional.
08:05:12 <ifat_afek> We need to pass to Mistral as a parameter one of the matched resources from the condition. For example:
08:05:21 <ifat_afek> - action:
08:05:22 <ifat_afek> action_type: execute_mistral
08:05:23 <ifat_afek> properties:
08:05:24 <ifat_afek> workflow: migrate_server
08:05:26 <ifat_afek> input:
08:05:27 <ifat_afek> server_to_migrate: ${instance}
08:05:27 <ifat_afek> another_input: blabla
08:05:35 <ifat_afek> The missing part is the ${instance}
08:06:07 <ifat_afek> That’s it for me
08:06:10 <yujunz> What is the ${...} here? yaml syntax?
08:06:34 <ifat_afek> The syntax that I plan to propose. We already have it in nagios_conf.yaml
08:06:42 <ifat_afek> It means: take the instance from the condition
08:06:57 <ifat_afek> Or maybe it’s ok without the $ ?…
08:07:06 <ifat_afek> As in action_target
08:07:19 <ifat_afek> Well, I need to give it some more thought…
08:07:31 <yujunz> Yes, I just want to mention action_target
08:08:18 <ifat_afek> But we need to distinguish (in the code) between the ‘blabla’ in the example, and the ‘instance’ which is a reference to a matched instance in the graph
08:08:23 <nivolas> hey
08:09:16 <ifat_afek> So I thought that the ${instance} will make it more clear
08:09:24 <ifat_afek> In action_target, it is always a reference
08:09:27 <yujunz> I would prefer to keep consistency. Currently all reference in vitrage template does not require extra token
08:09:46 <ifat_afek> Right, but how can you distinguish between these two?
08:09:56 <ifat_afek> server_to_migrate: instance
08:10:02 <ifat_afek> another_input: blabla
08:10:05 <yujunz> We have referenced entity and relationship in condition as well
08:10:40 <ifat_afek> But how can you tell that blabla is a string, and not a missing reference (that should be detected by the template validator)?
08:11:01 <yujunz> Need to think more about it :-)
08:11:18 <yujunz> Maybe ${..} is the only choice... I'm not sure
08:12:19 <ifat_afek> Ok, let me know if you have a better idea
08:12:39 <ifat_afek> That’s it for me
08:13:42 <yujunz> Short update from me
08:13:59 <yujunz> The blueprint of proactive RCA is almost done
08:14:08 <yujunz> #link https://review.openstack.org/#/c/519264/
08:14:18 <yujunz> Please let me know if you have additional questions
08:14:47 <yujunz> The detail implementation will be proposed in sub blueprints. This one will explain the general ideas of this evolution.
08:15:25 <yujunz> And I recommend the reviewer read the HTML output along with rst source
08:15:27 <ifat_afek> I’m probably ok with it, but would like to give it a little more thought. And I also want more people to review it...
08:15:28 <yujunz> #link http://logs.openstack.org/64/519264/15/check/build-openstack-sphinx-docs/298cb8f/html/specs/queens/approved/proactive-rca.html
08:16:19 <ifat_afek> The HTML format definitely helps
08:16:40 <yujunz> Another patch helps you write rst
08:16:42 <yujunz> #link https://review.openstack.org/#/c/521410/
08:16:57 <yujunz> It will build html automatically on any changes
08:17:13 <ifat_afek> Cool, I will review it later
08:17:20 <yujunz> No need to run tox -edoc and refresh the browser again and again
08:17:42 <yujunz> That's all from my side
08:19:08 <idan_hefetz> This link is help full, i have to say the blueprint was not very clear to me. i hope the sub blueprint will help me understand the change better.
08:20:03 <yujunz> Sure idan_hefetz
08:20:33 <idan_hefetz> Ok, i'll update
08:20:43 <idan_hefetz> The event api now allows creating custom alarms:
08:20:47 <idan_hefetz> #link https://review.openstack.org/#/c/521775/
08:20:52 <idan_hefetz> So we can now easily create dummy alarms using this cli:
08:20:59 <idan_hefetz> vitrage event post --type 'compute.bad.stuff' --details '{"hostname": "compute-0-0","source": "sample_monitor","cause": "another alarm","severity": "critical","status":"down","monitor_id": "sample monitor","monitor_event_id": "456"}'
08:21:08 <ifat_afek> Cool!!
08:21:23 <idan_hefetz> :)
08:21:26 <idan_hefetz> It will be also available in the tempest using the vitrage_utils.py method 'generate_fake_host_alarm'
08:21:36 <idan_hefetz> #link https://review.openstack.org/#/c/521851/
08:21:42 <idan_hefetz> Also regarding tempest, please see the folder 'vitrage/vitrage_tempest_tests/tests/common/'
08:21:49 <idan_hefetz> It contains many utility functions for testing, please use and add your own.
08:22:13 <yujunz> Cool. This really helps during local development
08:23:00 <idan_hefetz> use the status 'up' or 'down' to raise/remove
08:23:12 <ifat_afek> Note that you should use the ‘doctor’ datasource for this. Make sure you have it in your vitrage.conf
08:24:01 <idan_hefetz> That's it for me
08:25:48 <ifat_afek> An update on behalf of Muhamad Najjar: he is about to finish the persistor implementation
08:25:55 <dwj> It seems that 'status' in detail is positional.
08:27:08 <idan_hefetz> what do you mean?
08:28:14 <dwj> It seems that 'status' in detail is always required. :)
08:28:35 <yujunz> A mandatory attribute?
08:28:37 <ifat_afek> dwj: do you see a real use case where it doesn’t appear?
08:29:29 <ifat_afek> Without the status, we need to write in Vitrage: if event_type==‘compute.host.down’ then raise alarm; else if event_type==‘compute.host.up’ then clear the ‘compute.host.down’ alarm…
08:30:03 <yujunz> What is this event from? Is the key always named "status"?
08:30:15 <yujunz> Or it is an internal vitrage event format?
08:30:36 <ifat_afek> The format was copied&pasted from the Doctor SB API definition
08:31:05 <ifat_afek> And I believe that dwj is right and status is not mandatory
08:31:41 <dwj> yes, i know. I'm thinking about other monitor using the event API, maybe they not use 'status'.
08:32:26 <idan_hefetz> I will try to address this issue in the commit.
08:32:28 <idan_hefetz> https://review.openstack.org/#/c/521775/
08:32:46 <idan_hefetz> we can continue the discussion there
08:33:00 <ifat_afek> I guess we can just ignore events that don’t have a status, instead of failing. But we won’t raise or clear an alarm in this case (we won’t know what to do)
08:33:01 <dwj> OK, thanks~
08:33:26 <ifat_afek> Anyone else has an update?
08:34:45 <yujunz> I'll leave something to open discussion :-)
08:35:00 <ifat_afek> I have something for open discussion too :-)
08:35:07 <ifat_afek> #topic Open Discussion
08:35:18 <ifat_afek> yujunz: you go first
08:35:42 <yujunz> Ok, I now have some idea for implementing an old blueprint
08:35:45 <yujunz> #link https://specs.openstack.org/openstack/vitrage-specs/specs/pike/approved/datasource-skeleton-generator.html
08:36:04 <ifat_afek> I remember this one
08:36:05 <yujunz> Shall I moved it to queens/approved before working on?
08:36:14 <ifat_afek> Yes, I think so
08:36:35 <yujunz> I want to utilize cookiecutter for it
08:36:37 <yujunz> #link https://github.com/audreyr/cookiecutter
08:37:14 <yujunz> It is used for creating project skeleton such as Python packages, jQuery plugin
08:37:24 <ifat_afek> Looks cool
08:37:27 <yujunz> Should be suitable for data source as well
08:37:38 <ifat_afek> Go for it
08:38:08 <yujunz> Another thing is that how frequent shall we follow up the prioritized goals in queens
08:38:32 <ifat_afek> Good question. Do you want to go over the list now?
08:38:41 <yujunz> As you wish :-)
08:38:56 <yujunz> I think if weekly is too frequent, we can make it monthly
08:38:57 <ifat_afek> Ok
08:39:06 <ifat_afek> #link https://etherpad.openstack.org/p/vitrage-ptg-queens
08:39:23 <ifat_afek> Must
08:39:23 <ifat_afek> Queens goals - ifat_afek
08:39:25 <ifat_afek> Move tempests tests to a separate repository
08:39:26 <ifat_afek> Policies in the code
08:39:27 <ifat_afek> Rewrite Aodh datasource (Ceilometer API is being removed these days) - Wenjuan
08:39:28 <ifat_afek> Support Networkx 2.0 - if the global requirements are changed to adopt the new version, we must support it - Nokia
08:39:31 <ifat_afek> So let’s go over it
08:39:44 <ifat_afek> Queens goals: the policy refactoring is over
08:40:16 <ifat_afek> The tempest tests - idan_hefetz and nivo are working on tempest refactory and fixes, so I’m waiting for them to finish first
08:40:25 <ifat_afek> But it’s not a big task, so I’m not worried about it
08:40:54 <ifat_afek> dwj: are you still here? the rewrite Aodh datasource is over, right?
08:41:05 <dwj> yes, it's done.
08:41:28 <ifat_afek> Networkx 2.0: I know that annarez started working on it, but she didn’t finish and she is now in New Zealand :-) one of us will have to take over this task
08:41:50 <ifat_afek> Looks like the ‘musts’ are in a very good state
08:42:00 <yujunz> Yeah \o/
08:42:31 <ifat_afek> Very Important
08:42:31 <ifat_afek> Improve the tempest tests - Nokia
08:42:33 <ifat_afek> Make sure all existing tests run in the gate
08:42:34 <ifat_afek> Add tempest tests to python-vitrageclient
08:42:35 <ifat_afek> Should wait for the zuul migration to finish ?
08:42:36 <ifat_afek> what about vitrage-dashboard ? need to understand how to test ui ?
08:42:43 <ifat_afek> copy&paste is stages since it’s a long list
08:43:10 <ifat_afek> Run existing tests  in the gate - in progress (idan_hefetz and nivo)
08:43:22 <ifat_afek> tempest in python-vitrageclient - not started
08:43:28 <ifat_afek> same for vitrage-dashboard
08:43:42 <yujunz> Anyone assigned for those two?
08:43:48 <ifat_afek> Configurable notifications API - in progress
08:43:58 <ifat_afek> yujunz: no
08:44:11 <ifat_afek> I believe most of us are working on more urgent issues
08:44:31 <yujunz> OK. I just ask. I'm not familiar with these fields so can't help much :-(
08:45:08 <ifat_afek> Me neither… eyalb is familiar with the python-vitrageclient. Regarding the vitrage-dashboard, none of us knows how to handle it, if at all there is a standard way to do it
08:45:22 <ifat_afek> Moving on
08:45:23 <ifat_afek> HA - Muhamad Najjar
08:45:28 <ifat_afek> This one is in progress
08:45:52 <ifat_afek> Parallel evaluation of vitrage templates - done (idan_hefetz)
08:46:01 <ifat_afek> Bi-directional deducing - ZTE
08:46:13 <ifat_afek> Template simplification
08:46:13 <ifat_afek> Suspect status
08:46:15 <ifat_afek> Diagnostic action
08:46:16 <ifat_afek> Equivalence and aggregation - ZTE
08:46:18 <ifat_afek> Resource equivalence
08:46:19 <ifat_afek> Edge equivalence
08:46:21 <ifat_afek> Aggregation of equivalent entities (resources, edges, alarms)
08:46:25 <ifat_afek> I believe these are all in progress, right?
08:46:38 <yujunz> Yes. Most of them are covered by proactive RCA
08:47:02 <ifat_afek> Ok. Done with the very important. Moving to important
08:47:10 <ifat_afek> alarm history
08:47:11 <ifat_afek> Discovery agent - Nir Cohen
08:47:12 <ifat_afek> Integration with doctor - Wenjuan
08:47:14 <ifat_afek> support TripleO
08:47:15 <ifat_afek> call Nova reset-server-state
08:47:17 <ifat_afek> Use case respository - Liyin
08:47:18 <ifat_afek> fault model for openstack core services such as nova, neutron and etc...
08:47:18 <ifat_afek> SNMP parsing service - Peipei
08:47:27 <ifat_afek> alarm history - depends on the HA. I believe we will get to it only in Rocky
08:47:46 <ifat_afek> Discovery agent - not started
08:48:01 <ifat_afek> Integration with Doctor - in progress
08:48:09 <ifat_afek> support TripleO - not started
08:48:24 <ifat_afek> call Nova reset-server-state - in progress
08:48:37 <ifat_afek> Use case respository - yujunz, do you know anything about it?
08:48:58 <yujunz> I think it is not started. Could you confirm with Liyin? peipei
08:49:03 <ifat_afek> SNMP parsing service - spec was approved
08:49:38 <ifat_afek> yujunz, is it you updating the etherpad? :-)
08:49:42 <ifat_afek> Good idea
08:49:48 <ifat_afek> And regarding the medium priority:
08:49:48 <yujunz> Yes, it's me
08:50:09 <ifat_afek> Template CRUD - idan_kinory is working on it
08:50:16 <ifat_afek> The others were not started
08:50:22 <ifat_afek> I believe that’s it?
08:50:39 <yujunz> Sound good
08:50:57 <ifat_afek> I feel that we are in a very good state, comparing to the previous releases
08:51:14 <yujunz> So do I. A good start in queens
08:51:33 <ifat_afek> :-)
08:52:04 <ifat_afek> Ok, we are almost running out of time, but I have two issues to discuss. We don’t have time for a real discussion, but at least we can think about them offline
08:52:18 <ifat_afek> I got an email from Greg Waines from WindRiver. They are thinking about new features in Vitrage.
08:52:32 <ifat_afek> One of them might be related to the alarm aggregation blueprint by Yujunz. It might also relate to the ‘host maintenance’ feature by Doctor.
08:52:43 <ifat_afek> WindRiver would like to clear all alarms on the host and its contained resources, in case the host is being rebooted.
08:52:55 <ifat_afek> My initial thoughts were that maybe the API function that returns the aggregated graph will filter out these alarms. Not sure exactly how, because ‘host reboot’ is an event and not an alarm. If you have any other ideas, let’s discuss them.
08:53:22 <ifat_afek> Another issue by WindRiver is that they wish to store and query events in Vitrage.
08:53:28 <ifat_afek> Examples:
08:53:35 <ifat_afek> •	new compute hosts being added to system,
08:53:35 <ifat_afek> •	Administrative lock / unlock of host
08:53:36 <ifat_afek> •	Administrative request for switch of activity between Active/Standby Controllers,
08:53:37 <ifat_afek> •	software patch applied to host
08:53:38 <ifat_afek> •	nova instance evacuated
08:53:58 <ifat_afek> I think such events could be useful for better RCA, but we need to think how they can be combined with the alarms mechanism.
08:54:24 <ifat_afek> Let me know if you have any feedback now, or we can discuss it in the next meeting (or maybe they’ll suggest a blueprint)
08:55:31 <yujunz> I think we may need to take "event" into vitrage template as well
08:55:47 <ifat_afek> Could be
08:55:56 <yujunz> In additional to current entity "resource" and "alarm"
08:56:19 <ifat_afek> Good idea
08:56:21 <yujunz> It may not be present in graph but there need to be some way to define actions on events
08:56:37 <ifat_afek> And we need to relate the events to the resources
08:56:42 <ifat_afek> Somehow
08:56:47 <yujunz> Or it could be in graph. I'm not sure yet
08:56:53 <ifat_afek> And show an event list, but that’s the easy part
08:57:58 <ifat_afek> Another question is when to clear events. Unlike alarms, they will never disappear on their own.
08:58:45 <ifat_afek> And after a certain time, they should no longer be considered for the RCA (host reboot that happened yesterday is irrelevant, right?). So we might need to introduce a time window concept
08:58:49 <ifat_afek> Interesting :-)
08:59:19 <ifat_afek> Well, we should end the meeting in 1 minute
08:59:26 <ifat_afek> Bye :-)
08:59:29 <eyalb> bye
08:59:31 <yujunz> Bye
08:59:43 <ifat_afek> #endmeeting