15:30:14 #startmeeting Performance Team 15:30:15 Meeting started Tue Dec 20 15:30:14 2016 UTC and is due to finish in 60 minutes. The chair is DinaBelova. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:30:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:30:18 The meeting name has been set to 'performance_team' 15:30:25 hey folks! 15:30:34 Hey DinaBelova 15:30:44 o/ 15:30:48 o/ 15:31:03 Hi 15:31:22 let's wait for a few moments to ensure everyone who wanted joined :) 15:32:16 o/ 15:32:26 hey lezbar__ o/ 15:32:34 so I guess we may get started 15:32:35 #topic Action Items 15:32:45 last time we had only one action item on me 15:33:00 regarding verification of what grafana backend Mirantis is using 15:33:16 in fact we're using right now plain Prometheus with its own database 15:33:44 we plan to add persistent time series storage (e.g. Cassandra or OpenTSDB) a bit later 15:34:04 to store old monitoring data 15:34:10 and then we'll need to modify our grafana boards a bit 15:34:22 to grab data from it 15:34:29 but right now it's plain prometheus 15:34:57 I don't remember who was asking this question, I believe it might be you, akrzos 15:35:43 so we may proceed to the current progress 15:36:04 #topic Current progress on the planned tests 15:36:08 rcherrueau it looks like you're only guy from inria today :) 15:36:27 Yes, msimonin is on holiday, so I will speak for him/Inria. 15:36:36 rcherrueau cool :) 15:36:41 please go ahead 15:36:48 We are working on two stuff. First, deploy a multi-region OpenStack with kolla. 15:36:48 15:37:03 This almost works. 15:37:16 any issues met? 15:37:22 probably we may list bugs here 15:37:31 if any 15:37:55 We have something we call the Administrative Region (AR) that contains Keystine, MariaDB (wth Keystone tables) and Memcached. 15:38:10 This AR also contains one HAProxy since we deploy with kolla. 15:38:10 15:38:38 We have then, n OpenStack Region (OSRn) that each contains Nova, Glance, Neutron, RabbitMQ, MariaDB and HAProxy 15:38:57 Each OSR register itself to the AR Keystone. And when an operator connect itself to Horizon, he has to choose between all OSR 15:39:20 To do so, we have to patch kolla a little bit. We plan to make a mail on the kolla mailing list to share our experience with the community 15:40:04 so you have keystone separated from the OSR to separated region? just to make sure 15:40:06 So, no special issues except patches we have to do on the kolla-ansible code. 15:40:32 Yes, exactly 15:40:50 rcherrueau ok, and those regions might be located in different locations theoretically 15:41:06 Yes, this is the idea 15:41:18 I think that keystone performance might be the issue in this case :/ 15:41:28 I think although you'll test it anyway :) 15:41:54 Yes we will, and this comes to the second stuff we are working on 15:42:04 ok, thank you rcherrueau - please keep us updated regarding your experiments :) 15:42:07 At the same time we are adding `netem` to our deployment and test tool 15:42:07 and the second? 15:42:31 `netem` is a Linux tool that lets you emulate network latency, low bandwidth, packet loss ... 15:43:16 what about setting latency via tc? 15:43:26 The idea is to make a several multi-region deployment on our G5k platform. Then use `netem` to simulate different locations with different latencies, bandwidth and see how OpenStack behaves 15:43:42 #info Inria had to modify Kolla a bit to be able to proceed with their type of multisite deployment (Administrative Region and n OpenStack Regions) 15:43:59 akrzos: netem is tc ;) 15:44:04 ah 15:44:12 :D 15:44:31 #info the second part of work is oriented on adding `netem` to their deployment and test tool - o simulate different locations with different latencies, bandwidth and see how OpenStack behaves 15:44:37 ok, thanks rcherrueau 15:44:40 msimonin is working hard on this second part 15:44:59 hope to see him next week :) 15:45:34 akrzos any update from you sir? afair you got new HW for the telemetry testing :) 15:45:55 so beeing running into bottlenecks in telemetry services 15:46:04 first was too few metricd workers 15:46:21 this is with 3 controllers, 4 ceph nodes, 10 computes 15:46:30 booted 1k instances 15:46:34 #info akrzos has started work on telemetry testing following the test plan - http://docs.openstack.org/developer/performance-docs/test_plans/telemetry_scale/plan.html 15:46:39 gnocchi backlog continously grows 15:47:02 $os_Workers limits metricd workers to 6 on my controllers 15:47:07 (24 logical cpu cores) 15:47:21 so i redeployed overrideing it with 48 workers 15:47:26 so 48 workers on each controller 15:47:41 so 144 total metricd workers 15:47:54 also reduced metric processing delay 15:47:59 from 60s to 30s 15:48:06 and 1k instances is now handled in realtime 15:48:11 in ceph there is 36 osds 15:48:19 also needed to tune pgs to avoid ceph health_warn 15:48:41 though the calculation for this is tricky using pgcalc 15:49:00 so with this tuning i can now sustain 1k instances in the cloud aiwth gnocchi 15:49:10 on low archival policy 15:49:20 i attempted to scale further 15:49:26 (wanted 2k) 15:49:38 and got to ~1.9k before hitting new problems 15:49:46 load avg on controllers is >core count 15:50:04 wow 15:50:05 it's huge load 15:50:07 memory is rising in both rabbitmq and ceilometer-collector 15:50:14 at this scale now 15:50:19 also 15:50:23 to get to 1.9 k 15:50:29 i had to tune threads in gnocchi 15:50:54 aggregation worker threads is default to 1 15:50:55 it looks like that potentially for ~2k VMs gnocchi and rabbit needs to be separated from each other to different nodes - with more nodes given to control plane side of the cloud 15:51:20 my concern now is the collector grows as i have seen in the past 15:51:37 i thouigh there was a patch put in to limit the # of messages it grabs off rabbit 15:51:43 to prevent growth 15:51:58 but i don't understand the problem enough right now 15:52:13 akrzos ack, thank you sir 15:52:16 so another factor 15:52:20 is the archival policy 15:52:35 high policy might actually mean less aggregations being "Recalculated" 15:52:46 and could actually be a lower workload 15:52:58 due to a finer grain "end" timeframe 15:53:10 so i should retest with a new archival policy 15:53:15 and maybe different number of aggregations 15:53:26 so lots to try still 15:53:43 another thing i can share with the community is a collectd plugin i wrote to monitor gnocchi backlog 15:54:04 #link https://review.openstack.org/#/c/411030/4/ansible/install/roles/collectd-openstack/files/collectd_gnocchi_status.py 15:54:39 I think that summerizes the chaos i've been working on as of last week pretty well :D 15:54:53 ack, really good job being done 15:54:59 thanks akrzos 15:55:02 thanks 15:55:50 also i agree separating telemetry from control plane for scale is a must 15:56:07 yeah, I believe this is needed 15:56:16 on that scale of monitored resources 15:56:40 ok, from mirantis side we've started uploading test plans / results for some recent researches 15:56:49 #link https://review.openstack.org/411933 15:56:58 #link https://review.openstack.org/413048 15:57:27 the first one is regarding Cinder performance with Ceph backend - in case of running OpenStack services on k8s 15:57:35 Ceph is installed separately of course :) 15:58:04 the second one is related to max pods per host density testing 15:58:05 in fact what we got was a bit disappointing 15:58:27 after 200 pods being run on the host the overall process of scheduling, etc. becomes really slow 15:58:39 so 400 pods is almost the limit here 15:59:02 we think we may miss some pool / whatever configuration parameter 15:59:28 as we did not expect degradations to start that early (200 pods/node density) 15:59:42 so that's still in progress 16:00:05 also right now we're still working on workloads testing 16:00:06 on 200 nodes 16:00:25 when we're deploying heat stacks with various apps running on Vms and planning to run locust.io workloads against it 16:00:36 still on the deployment phase for now 16:01:02 we observed some strange issues with Heat support in the fuel-ccp - really bad performance 16:01:15 so we're debugging it right now to see what might be the reason for this issue 16:01:27 and I think that's pretty all from my side 16:02:04 anything else to cover in test plans / test results topic? 16:02:19 it looks like we may proceed to the Open Discussions 16:02:22 #topic Open Discussion 16:02:53 vbala tovin07_ I have an idea to finish the work on https://review.openstack.org/#/c/407967/ patch 16:03:00 and cut new osprofiler release 16:03:17 Any ptg updates? 16:03:30 vmware ci posted the result on that patch 16:03:30 vbala tovin07 are you ok with it? 16:03:40 Yes, it’s from vbala 16:03:47 i'm ok with it 16:03:55 I think it’s ok 16:04:05 ack, thanks :) 16:04:13 akrzos well :) from Mirantis side me and andreykurilin still coming :) 16:04:27 hi hi 16:04:27 akrzos were you able to discuss it within your team? 16:04:54 rcherrueau the same question to you sir :) any updates on PTG side? 16:04:55 we are still looking into budget, but in an ideal world, we would have myself, rook, sai and justin on our team come 16:05:10 akrzos yay :) I hope this will happen :) 16:05:12 and each have a performance topic we could cover/discuss 16:05:22 no not right now 16:05:26 akrzos I think we may start preparing agenda 16:05:37 so i was wondering if we would put together a schedule/agenda 16:05:38 lemme create an etherpad for those purposes 16:05:48 perfect 16:05:53 +1 16:06:05 #action DinaBelova create an etherpad for PTG agenda collection 16:06:06 ack, cool 16:06:07 I have to discuss that with ad_rien 16:06:13 rcherrueau sure 16:06:18 please take your time 16:06:49 akrzos as said, I plan to focus on test ideas / tools roadmaps / etc. 16:07:00 ok, one more thing to cover 16:07:16 there is holiday season close to us 16:07:29 DinaBelova: got it 16:07:38 I wanted to check who's going to be available and when :) 16:08:02 we are out all next week, back january 3rd 16:08:05 I have a PTO for Dec 27 - Dec 30 16:08:18 ok, so it looks like it makes sense to move our next meeting to Jan 16:08:27 rcherrueau and you folks? 16:08:51 Me also, I will be out next week. I don't know for msimonin 16:08:55 are you ok to meet on Jan 3rd? 16:09:14 ack, let's agree on next meeting on Jan 3rd, already in the new year :) 16:09:20 OK great 16:09:32 #info next meeting to be on Jan 3rd, usual time 16:09:40 got it 16:09:48 Great Thanks! 16:10:04 and I think that's all from my side 16:10:05 anything else to cover? 16:10:17 tovin07_ akrzos you're welcome :) 16:10:34 ok, thank you folks! see you next year :D 16:10:39 bye! 16:10:41 Bye 16:10:45 #endmeeting