#openstack-performance log

15:30:14 <DinaBelova> #startmeeting Performance Team
15:30:15 <openstack> Meeting started Tue Dec 20 15:30:14 2016 UTC and is due to finish in 60 minutes.  The chair is DinaBelova. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:30:16 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:30:18 <openstack> The meeting name has been set to 'performance_team'
15:30:25 <DinaBelova> hey folks!
15:30:34 <akrzos> Hey DinaBelova
15:30:44 <rcherrueau> o/
15:30:48 <tovin07_> o/
15:31:03 <vbala> Hi
15:31:22 <DinaBelova> let's wait for a few moments to ensure everyone who wanted joined :)
15:32:16 <lezbar__> o/
15:32:26 <DinaBelova> hey lezbar__ o/
15:32:34 <DinaBelova> so I guess we may get started
15:32:35 <DinaBelova> #topic Action Items
15:32:45 <DinaBelova> last time we had only one action item on me
15:33:00 <DinaBelova> regarding verification of what grafana backend Mirantis is using
15:33:16 <DinaBelova> in fact we're using right now plain Prometheus with its own database
15:33:44 <DinaBelova> we plan to add persistent time series storage (e.g. Cassandra or OpenTSDB) a bit later
15:34:04 <DinaBelova> to store old monitoring data
15:34:10 <DinaBelova> and then we'll need to modify our grafana boards a bit
15:34:22 <DinaBelova> to grab data from it
15:34:29 <DinaBelova> but right now it's plain prometheus
15:34:57 <DinaBelova> I don't remember who was asking this question, I believe it might be you, akrzos
15:35:43 <DinaBelova> so we may proceed to the current progress
15:36:04 <DinaBelova> #topic Current progress on the planned tests
15:36:08 <DinaBelova> rcherrueau it looks like you're only guy from inria today :)
15:36:27 <rcherrueau> Yes, msimonin is on holiday, so I will speak for him/Inria.
15:36:36 <DinaBelova> rcherrueau cool :)
15:36:41 <DinaBelova> please go ahead
15:36:48 <rcherrueau> We are working on two stuff. First, deploy a multi-region OpenStack with kolla.
15:36:48 <rcherrueau> 
15:37:03 <rcherrueau> This almost works.
15:37:16 <DinaBelova> any issues met?
15:37:22 <DinaBelova> probably we may list bugs here
15:37:31 <DinaBelova> if any
15:37:55 <rcherrueau> We have something we call the Administrative Region (AR) that contains Keystine, MariaDB (wth Keystone tables) and Memcached.
15:38:10 <rcherrueau> This AR also contains one HAProxy since we deploy with kolla.
15:38:10 <rcherrueau> 
15:38:38 <rcherrueau> We have then, n OpenStack Region (OSRn) that each contains Nova, Glance, Neutron, RabbitMQ, MariaDB and HAProxy
15:38:57 <rcherrueau> Each OSR register itself to the AR Keystone. And when an operator connect itself to Horizon, he has to choose between all OSR
15:39:20 <rcherrueau> To do so, we have to patch kolla a little bit. We plan to make a mail on the kolla mailing list to share our experience with the community
15:40:04 <DinaBelova> so you have keystone separated from the OSR to separated region? just to make sure
15:40:06 <rcherrueau> So, no special issues except patches we have to do on the kolla-ansible code.
15:40:32 <rcherrueau> Yes, exactly
15:40:50 <DinaBelova> rcherrueau ok, and those regions might be located in different locations theoretically
15:41:06 <rcherrueau> Yes, this is the idea
15:41:18 <DinaBelova> I think that keystone performance might be the issue in this case :/
15:41:28 <DinaBelova> I think although you'll test it anyway :)
15:41:54 <rcherrueau> Yes we will, and this comes to the second stuff we are working on
15:42:04 <DinaBelova> ok, thank you rcherrueau - please keep us updated regarding your experiments :)
15:42:07 <rcherrueau> At the same time we are adding `netem` to our deployment and test tool
15:42:07 <DinaBelova> and the second?
15:42:31 <rcherrueau> `netem` is a Linux tool that lets you emulate network latency, low bandwidth, packet loss ...
15:43:16 <akrzos> what about setting latency via tc?
15:43:26 <rcherrueau> The idea is to make a several multi-region deployment on our G5k platform. Then use `netem` to simulate different locations with different latencies, bandwidth and see how OpenStack behaves
15:43:42 <DinaBelova> #info Inria had to modify Kolla a bit to be able to proceed with their type of multisite deployment (Administrative Region and n OpenStack Regions)
15:43:59 <rcherrueau> akrzos: netem is tc ;)
15:44:04 <akrzos> ah
15:44:12 <akrzos> :D
15:44:31 <DinaBelova> #info the second part of work is oriented on adding `netem` to their deployment and test tool - o simulate different locations with different latencies, bandwidth and see how OpenStack behaves
15:44:37 <DinaBelova> ok, thanks rcherrueau
15:44:40 <rcherrueau> msimonin is working hard on this second part
15:44:59 <DinaBelova> hope to see him next week :)
15:45:34 <DinaBelova> akrzos any update from you sir? afair you got new HW for the telemetry testing :)
15:45:55 <akrzos> so beeing running into bottlenecks in telemetry services
15:46:04 <akrzos> first was too few metricd workers
15:46:21 <akrzos> this is with 3 controllers, 4 ceph nodes, 10 computes
15:46:30 <akrzos> booted 1k instances
15:46:34 <DinaBelova> #info akrzos has started work on telemetry testing following the test plan - http://docs.openstack.org/developer/performance-docs/test_plans/telemetry_scale/plan.html
15:46:39 <akrzos> gnocchi backlog continously grows
15:47:02 <akrzos> $os_Workers limits metricd workers to 6 on my controllers
15:47:07 <akrzos> (24 logical cpu cores)
15:47:21 <akrzos> so i redeployed overrideing it with 48 workers
15:47:26 <akrzos> so 48 workers on each controller
15:47:41 <akrzos> so 144 total metricd workers
15:47:54 <akrzos> also reduced metric processing delay
15:47:59 <akrzos> from 60s to 30s
15:48:06 <akrzos> and 1k instances is now handled in realtime
15:48:11 <akrzos> in ceph there is 36 osds
15:48:19 <akrzos> also needed to tune pgs to avoid ceph health_warn
15:48:41 <akrzos> though the calculation for this is tricky using pgcalc
15:49:00 <akrzos> so with this tuning i can now sustain 1k instances in the cloud aiwth gnocchi
15:49:10 <akrzos> on low archival policy
15:49:20 <akrzos> i attempted to scale further
15:49:26 <akrzos> (wanted 2k)
15:49:38 <akrzos> and got to ~1.9k before hitting new problems
15:49:46 <akrzos> load avg on controllers is >core count
15:50:04 <DinaBelova> wow
15:50:05 <DinaBelova> it's huge load
15:50:07 <akrzos> memory is rising in both rabbitmq and ceilometer-collector
15:50:14 <akrzos> at this scale now
15:50:19 <akrzos> also
15:50:23 <akrzos> to get to 1.9 k
15:50:29 <akrzos> i had to tune threads in gnocchi
15:50:54 <akrzos> aggregation worker threads is default to 1
15:50:55 <DinaBelova> it looks like that potentially for ~2k VMs gnocchi and rabbit needs to be separated from each other to different nodes - with more nodes given to control plane side of the cloud
15:51:20 <akrzos> my concern now is the collector grows as i have seen in the past
15:51:37 <akrzos> i thouigh there was a patch put in to limit the # of messages it grabs off rabbit
15:51:43 <akrzos> to prevent growth
15:51:58 <akrzos> but i don't understand the problem enough right now
15:52:13 <DinaBelova> akrzos ack, thank you sir
15:52:16 <akrzos> so another factor
15:52:20 <akrzos> is the archival policy
15:52:35 <akrzos> high policy might actually mean less aggregations being "Recalculated"
15:52:46 <akrzos> and could actually be a lower workload
15:52:58 <akrzos> due to a finer grain "end" timeframe
15:53:10 <akrzos> so i should retest with a new archival policy
15:53:15 <akrzos> and maybe different number of aggregations
15:53:26 <akrzos> so lots to try still
15:53:43 <akrzos> another thing i can share with the community is a collectd plugin i wrote to monitor gnocchi backlog
15:54:04 <akrzos> #link https://review.openstack.org/#/c/411030/4/ansible/install/roles/collectd-openstack/files/collectd_gnocchi_status.py
15:54:39 <akrzos> I think that summerizes the chaos i've been working on as of last week pretty well :D
15:54:53 <DinaBelova> ack, really good job being done
15:54:59 <DinaBelova> thanks akrzos
15:55:02 <akrzos> thanks
15:55:50 <akrzos> also i agree separating telemetry from control plane for scale is a must
15:56:07 <DinaBelova> yeah, I believe this is needed
15:56:16 <DinaBelova> on that scale of monitored resources
15:56:40 <DinaBelova> ok, from mirantis side we've started uploading test plans / results for some recent researches
15:56:49 <DinaBelova> #link https://review.openstack.org/411933
15:56:58 <DinaBelova> #link https://review.openstack.org/413048
15:57:27 <DinaBelova> the first one is regarding Cinder performance with Ceph backend - in case of running OpenStack services on k8s
15:57:35 <DinaBelova> Ceph is installed separately of course :)
15:58:04 <DinaBelova> the second one is related to max pods per host density testing
15:58:05 <DinaBelova> in fact what we got was a bit disappointing
15:58:27 <DinaBelova> after 200 pods being run on the host the overall process of scheduling, etc. becomes really slow
15:58:39 <DinaBelova> so 400 pods is almost the limit here
15:59:02 <DinaBelova> we think we may miss some pool / whatever configuration parameter
15:59:28 <DinaBelova> as we did not expect degradations to start that early (200 pods/node density)
15:59:42 <DinaBelova> so that's still in progress
16:00:05 <DinaBelova> also right now we're still working on workloads testing
16:00:06 <DinaBelova> on 200 nodes
16:00:25 <DinaBelova> when we're deploying heat stacks with various apps running on Vms and planning to run locust.io workloads against it
16:00:36 <DinaBelova> still on the deployment phase for now
16:01:02 <DinaBelova> we observed some strange issues with Heat support in the fuel-ccp - really bad performance
16:01:15 <DinaBelova> so we're debugging it right now to see what might be the reason for this issue
16:01:27 <DinaBelova> and I think that's pretty all from my side
16:02:04 <DinaBelova> anything else to cover in test plans / test results topic?
16:02:19 <DinaBelova> it looks like we may proceed to the Open Discussions
16:02:22 <DinaBelova> #topic Open Discussion
16:02:53 <DinaBelova> vbala tovin07_ I have an idea to finish the work on https://review.openstack.org/#/c/407967/ patch
16:03:00 <DinaBelova> and cut new osprofiler release
16:03:17 <akrzos> Any ptg updates?
16:03:30 <vbala> vmware ci posted the result on that patch
16:03:30 <DinaBelova> vbala tovin07 are you ok with it?
16:03:40 <tovin07_> Yes, it’s from vbala
16:03:47 <vbala> i'm ok with it
16:03:55 <tovin07_> I think it’s ok
16:04:05 <DinaBelova> ack, thanks :)
16:04:13 <DinaBelova> akrzos well :) from Mirantis side me and andreykurilin still coming :)
16:04:27 <andreykurilin> hi hi
16:04:27 <DinaBelova> akrzos were you able to discuss it within your team?
16:04:54 <DinaBelova> rcherrueau the same question to you sir :) any updates on PTG side?
16:04:55 <akrzos> we are still looking into budget, but in an ideal world, we would have myself, rook, sai and justin on our team come
16:05:10 <DinaBelova> akrzos yay :) I hope this will happen :)
16:05:12 <akrzos> and each have a performance topic we could cover/discuss
16:05:22 <rcherrueau> no not right now
16:05:26 <DinaBelova> akrzos I think we may start preparing agenda
16:05:37 <akrzos> so i was wondering if we would put together a schedule/agenda
16:05:38 <DinaBelova> lemme create an etherpad for those purposes
16:05:48 <akrzos> perfect
16:05:53 <tovin07_> +1
16:06:05 <DinaBelova> #action DinaBelova create an etherpad for PTG  agenda collection
16:06:06 <DinaBelova> ack, cool
16:06:07 <rcherrueau> I have to discuss that with ad_rien
16:06:13 <DinaBelova> rcherrueau sure
16:06:18 <DinaBelova> please take your time
16:06:49 <DinaBelova> akrzos as said, I plan to focus on test ideas / tools roadmaps / etc.
16:07:00 <DinaBelova> ok, one more thing to cover
16:07:16 <DinaBelova> there is  holiday season close to us
16:07:29 <akrzos> DinaBelova: got it
16:07:38 <DinaBelova> I wanted to check who's going to be available and when :)
16:08:02 <akrzos> we are out all next week, back january 3rd
16:08:05 <DinaBelova> I have a PTO for Dec 27 - Dec 30
16:08:18 <DinaBelova> ok, so it looks like it makes sense to move our next meeting to Jan
16:08:27 <DinaBelova> rcherrueau and you folks?
16:08:51 <rcherrueau> Me also, I will be out next week. I don't know for msimonin
16:08:55 <DinaBelova> are you ok to meet on Jan 3rd?
16:09:14 <DinaBelova> ack, let's agree on next meeting on Jan 3rd, already in the new year :)
16:09:20 <rcherrueau> OK great
16:09:32 <DinaBelova> #info next meeting to be on Jan 3rd, usual time
16:09:40 <tovin07_> got it
16:09:48 <akrzos> Great Thanks!
16:10:04 <DinaBelova> and I think that's all from my side
16:10:05 <DinaBelova> anything else to cover?
16:10:17 <DinaBelova> tovin07_ akrzos you're welcome :)
16:10:34 <DinaBelova> ok, thank you folks! see you next year :D
16:10:39 <DinaBelova> bye!
16:10:41 <tovin07_> Bye
16:10:45 <DinaBelova> #endmeeting