15:00:52 #startmeeting monasca 15:00:52 hello all 15:00:53 Meeting started Wed Apr 10 15:00:52 2019 UTC and is due to finish in 60 minutes. The chair is witek. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:54 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:57 The meeting name has been set to 'monasca' 15:00:58 hi dougsz 15:01:17 Hi 15:01:23 hi chaconpiza 15:01:36 hi, everybody 15:01:53 hi 15:02:00 agenda for today: 15:02:03 https://etherpad.openstack.org/p/monasca-team-meeting-agenda 15:02:24 I don 15:02:35 sorry, let's start 15:02:47 #topic monasca-thresh replacement 15:03:08 I started making myself thought how can we replace monasca-thresh 15:03:21 as we urgently do need to replace it 15:04:08 and so I looked how Prometheus or Aodh are doing this 15:04:31 and they both don't work on streams but query from the DB 15:04:45 which is much easier to implement 15:05:18 and then I thought we could actually try to use what Prometheus offers 15:05:29 and came up with this document 15:05:35 https://docs.google.com/presentation/d/1tvllnWaridOG-t-qj9D2brddeQXsYNyZwoYUfby_3Ns/edit?usp=sharing 15:05:51 I've seen your first comments, thanks a lot for that 15:06:41 I'd like to start discussion, what do you think of that approach? is that plausible? 15:07:33 maybe we can discuss smaller topics first? and then conclude wether it's plausible? 15:08:21 right, do we have to discuss if monasca-thresh should be replaced? 15:09:03 What about the upgrade from current solution to the new one using Prometheus for current clients? 15:09:07 hi 15:09:37 hi Dobroslaw 15:09:57 chaconpiza: you mean, what operator would have to do to upgrade from one Monasca version to another? 15:10:05 yes 15:10:47 I propose to discuss this (migration) later, when a decision has been taken 15:11:13 the measurement schema would change, so although saved in InfluxDB, some data migration would have to happen if new functionality would be required 15:11:15 well, if we keep the monasca api and just use prometheus for the thresholding and alarming, it might not be much change for a current client 15:11:22 Regarding your problem statement, Witek: I agree with topic 1,2 and 5. 15:11:35 $ (complex cluster): I can't really judge 15:11:40 $=4 15:12:22 topic 3: High resource consumption: This is certainly true. However, I 'm not sure if this is caused by monasca itself or bei storm 15:12:35 bei = by 15:12:38 I'm not sure if Prometheus actually will be lighter than storm... 15:13:33 yes, would definitely want to qualify performance 15:13:37 and footprint 15:13:38 would be nice if we find someone using prometheus at production and tell us how much resources it's using, on average and with data spikes 15:14:09 I found quite few people complaining about memory usage 15:14:22 Dobroslaw: We're using it, I haven't benchmarked it yet, but I've frequently seen it at the top of `top` 15:15:00 Using "remote read" from influx causes some further overhead - don't know, to what extent 15:15:11 and it don't have build in max memory tuning options 15:15:52 In addition to extending alarm expression language (#2) we also have a requirement to include metadata with alarms 15:15:59 I think I linked to discussion, like using 10x more memory per measurement... 15:17:04 dougsz: where does the metadata come from, and can that requirement be addressed with Prometheus? 15:17:11 I've talked to a few people who have the impression that Prometheus has a smaller footprint than Monasca, but I suspect that is relative to their install (or just marketing speak) 15:18:36 witek: For example, we want to create a Jira ticket for every log error message. The metadata would include a snippet of the error message. Not sure if it can be done with Prometheus either. I think the approach would be to use something like mtail to make logs scrapable. 15:18:55 it's invasive change, HA will need to be handled differently, not sure how to fast test it with monasca 15:20:47 Dobroslaw: what would be an alternative? 15:21:45 unfortunately I don't have alternative, just bringing important point, monasca most likely would be installed on same machine with prometheus 15:21:58 and sharing resources with it 15:22:38 we may need a POC to show it can be done... 15:22:59 remote read is for sure an important aspect, Prometheus normally makes use of built-in aggregations and in proposed setup, the calculation would have to be done on the complete dataset 15:24:12 complete dataset for a given alerting rule only of course, normally the last 10 minutes of data or so 15:25:36 dougsz: how do you use Prometheus, do you have many alerting rules? how much data? 15:27:27 We aren't using it at scale yet and we don't have a large number of alerting rules. 15:27:53 We've combined it with mtail to generate metrics from log messages 15:28:44 Currently we use Prometheus as the TSDB, no Influx yet 15:30:02 We use kolla-ansible for the deployment - there are quite a few exporters included in that out of the box 15:31:33 yes, for the collector part we should advertise the monasca-agent Prometheus plugin better 15:32:07 thanks dougsz 15:32:29 +1 - I think that's a big win - Prometheus exporters are generally pretty up-to-date and it's great we can take advantage from the Monasca Agent. 15:32:51 bandorf has commented on the delay until the alarm get's triggered 15:32:56 is that an issue? 15:33:15 I think it's a good point. 15:33:56 is it a requirement for anyone? 15:34:29 I had a brief discussion with Cristiano (Product Management) about this. His opinion was: In a typical OpenStack environment, it should be OK. In other scenarios (IoT-demo-fire alarm) it is not. 15:34:38 Generally we haven't used the buffering capabilities of Kafka too much, but it's slightly concerning that alarms could stop working if there was a large burst of metrics. 15:34:53 may depend on use case. Some of the auto-scaling/self healing may want faster alarming 15:36:14 to reduce downtimes and interruptions 15:37:51 I think the streaming based implementation would be much more complicated, requiring knowledge of Kafka Streams or Apache Storm 15:38:09 or not scalable, like monasca-aggregator 15:39:35 the only way to scale aggregator is to shard the data and consume from different Kafka topics 15:39:52 which is also a valid approach after all 15:40:52 I have one another concern about Prometheus based set up 15:41:24 Prometheus defines all its alerting rules and notification via config files 15:41:38 there is no API for setting them 15:41:58 only query API to get the current configuration 15:42:12 yeah, that is a concern especially if we do an HA setup (keeping the config files in sync) 15:42:55 does changing a rule then require restarting the Prometheus service? 15:43:03 reloading 15:44:32 ok, let's sum up what we have on advantages: 15:45:08 * great community eco-system with many integrations 15:45:23 * very flexible alerting rules 15:45:47 * and query language for visualisations 15:46:57 * easy deployment 15:47:31 anything else? 15:48:02 disadvantages: 15:48:03 * could also monitor the monasca components directly? eg. alert if influxdb goes down 15:49:21 yes, I'm not sure if that's Prometheus specific 15:49:23 disadvantage: * potentially large footprint and resource usage 15:50:26 disadvantage: * no guaranteed delivery of metrics (requirement for billing systems, not as much a concern for alerting) 15:50:41 * remote read requires getting complete data chunks from InfluxDB for every evaluation 15:51:02 disadvantage: * no native HA support, requires work to design 15:51:04 disadvantage: * HA model for Prometheus server isn't totally clear (to me at least) 15:51:20 joadavis: well, with Kafka and InfluxDB we do get guaranteed delivery 15:52:04 disadvantage: * Alerting chain is even more complex. Eg. Monasca API -> Kafka -> Persister -> Influx -> Prometheus -> Alert manager 15:52:11 disadvantage: * longer latency time until alarm gets fired 15:53:25 unknown: * impact of 'remote read to influxdb' 15:53:35 I would also argue with HA model, it's the same model as for InfluxDB, and we can use API and Kafka to help make it better 15:54:25 disadvantage: no API for alerting rules and notifications, config based operation 15:54:43 I have a question about whether this puts Cassandra out of our design, but we are short on time so we can save that for another day 15:55:46 for this set up, we could not use Cassandra, it does not have remote read 15:56:14 OK, let's cut it here for now 15:56:28 let's quickly go through the other topics: 15:56:47 #topic Retirement of Openstack Ansible Monasca roles 15:56:55 http://lists.openstack.org/pipermail/openstack-discuss/2019-April/004610.html 15:57:04 guimaluf: are you around? 15:57:33 unfortunately I don't know anyone using OSA 15:58:17 #topic Telemetry discussion 15:58:23 http://lists.openstack.org/pipermail/openstack-discuss/2019-April/004851.html 15:58:52 there was a quick of meeting for Telemetry project yesterday 15:58:59 with the new PTL 15:59:26 after there was nobody starting for the PTL in Train 16:00:13 anyone, they have considered if they should continue to rely on Gnocchi or search for alternatives 16:00:32 I want us to have a good response for taht 16:00:51 I need to write a thoughtful email back and recommend monasca-ceilometer :) 16:01:12 as Mark has written in his email, it would be good to maintain just one monitoring project in OpenStack 16:01:26 was just thinking about ceilosca 16:01:33 but we could also have larger discussions about where the monasca agent and ceilometer agent overlap and how to make mon-agent cover all 16:03:15 joadavis: do we want to sync about the answer to the mailing list? 16:03:35 sure. I can write a draft and send it to you, or you can 16:03:56 OK, ping you offline 16:04:02 with these kind of questions I start thinking in pictures, but that is hard to do in text emails 16:04:11 #topic PTG 16:04:32 we have a conflict with self-healing session on the first day, Thursday 16:04:54 should we start our sessions on Friday? 16:05:02 and free the slot? 16:05:32 sounds sensible 16:05:53 +1 16:05:54 joadavis: chaconpiza ? 16:05:55 I'm not sure if chaconpiza will be returning on Friday 16:06:22 I will come back on Saturday, I found a good connection flight :) 16:06:29 oh, great 16:06:30 I'm ok with that. I think one of our goals for this PTG should be working with other projects and SIGs 16:07:00 OK, thanks for joining today 16:07:05 and for good discussion 16:07:26 next week I'm in vacation 16:07:46 so could some else please start the meeting 16:08:12 all from me, bye 16:08:17 Thanks all, and have a good vacation 16:08:22 bye 16:08:26 bye 16:08:29 thank you, bye. 16:08:36 Ok, enjoy the vacations. Bye. 16:08:38 #endmeeting