13:05:01 #startmeeting hyper-v 13:05:02 Meeting started Wed Jan 20 13:05:01 2016 UTC and is due to finish in 60 minutes. The chair is alexpilotti. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:05:03 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:05:06 The meeting name has been set to 'hyper_v' 13:05:10 o/ 13:05:22 o/ 13:05:22 * alexpilotti keeps of forgetting :-) 13:05:31 o/ 13:05:35 next time I'll try #deplymeeting 13:06:05 #topy networking-hyperv parallel ACL execution 13:06:12 #topic networking-hyperv parallel ACL execution 13:06:27 * alexpilotti reconnects fingers to brain 13:06:56 claudiub: would you like to share some thoughts on our recent benchmarks and results? 13:07:09 yeees, just a second, i have some numbers 13:07:39 o/ 13:07:41 Sonu__, kvinod: are you guys with us today? 13:07:52 also would like to discuss on following from neutron front 13:07:56 https://docs.google.com/spreadsheets/d/1DJtBFFan72HUOjXpg4_ojOD592OgQQiuXzarRtjPbjg/edit#gid=0 13:08:00 I am here. 13:08:02 Microsoft certification for OVS. 13:08:10 Help on using VXLAN using OVS in Hyper-V 13:08:18 Discuss PYMI results 13:08:42 kvinod: ok, adding all to the agenda 13:08:47 ok, so, I've run a couple of neutron-hyperv-agent scenarios: wmi, PyMI, PyMI + native threads. 13:08:49 thanks 13:09:13 and on the given link, there are some graphs, for a simpler visualization of the results. :) 13:09:32 drumroll... 13:09:46 first of all, there seems to be about ~2.5X execution time difference between WMI and PyMI. 13:09:56 PyMI being the faster option. 13:10:08 claudiub: on the graph, which line is pyMI and which one os WMI ? 13:10:29 see the legend of the right. :) red is PyMI. 13:10:48 for processing 75'th port, does it mean it takes 70 sec with PyMI and 32 native threads. 13:10:50 also: lower is better 13:11:00 can I interpret it like that. 13:11:31 Basically what is the Execution time on the Y-axis. Sorry. 13:11:37 anyways, as for Native threads + PyMI, there is another ~6X improvement, compared to origin/master + PyMI, so, in total, around ~13X improvement. 13:12:21 as for the number of threads used, it seems to me that 10 native threads to be the most optimal solution 13:12:52 as there isn't a lot of performance gain between 10 native threads or 20 or 32. 13:13:25 Sorry asking again, What is Execution Time? 13:13:40 is it the time VM started pinging? 13:13:56 Sonu__: this is a specific networking-hyperv test 13:14:14 Sonu__: so teh time it takes to perform the operation 13:14:23 it is the time when the Nth port was processed: bound to vSwitch, VLAN tagged, SG ACLs added. 13:14:24 Sonu__: this is not a Rally test 13:15:12 claudiub: Thanks for the answer 13:15:28 also, from what I've seen, having a higher worker count, it can lead to a higher number of random 32775 Hyper-V exceptions 13:15:40 which means that the port will have to be reprocessed 13:15:46 We faced it too. And that happen in a single process context 13:16:12 this basically means that vmms can't handle the ammount of things we're trying to push into it. :) 13:16:14 In our case, we had spawned independent processes, so not such exceptions were seen. 13:16:31 within a single process, the VMM handle will have a limit. 13:17:13 Sonu__: in the multiprocessing patch you had, I couldn't see the logging output from the workers. 13:17:44 Sonu__: processes or threads have nothing to do with 32775 exceptions 13:17:49 there could have been exceptions, but they were only visisble in the sdtout, not in the log file. 13:17:50 I know, but retry would have happended if such an exception occurs. 13:18:10 we minimized retries to almost 0. 13:18:35 but the improvement I see using PYMI is great 13:19:03 Sonu__: thanks 13:19:14 and with native threads, you seem to reach closer to the benchmark required to scale up. 13:19:23 this one (networking-hyperv) is just a part of the whole picture 13:19:31 yes. 13:19:44 the enhanced rpc patch I had posted will improve this further. 13:20:01 'cz neutron server has to do lot less work with enhanced rpc. 13:20:04 claudiub: did you include Sonu__'s RPC patch in those tests? 13:20:12 no. 13:20:14 alexpilotti: haven't. 13:20:19 alexpilotti:are these improvements only implemented now in networking-hyperv ? 13:20:24 Sonu__: BTW merging your RPC patch is next on the todo list 13:20:45 sagar_nikam: not sure what you mean 13:21:00 i meant if these changes are there in os-win as well ? 13:21:13 sagar_nikam: so, PyMI can be used on any other branch, the 2.5X benefit will still be the same. 13:21:26 ok 13:21:30 sagar_nikam: they work on both os-win and networking-hyperv 13:21:35 sagar_nikam: as for the native threads, they are only on master, at the moment. 13:21:53 while the threading patch is in networking-hyperv 13:21:59 so native threads change still needs to be implemented in os-win ? 13:22:13 nop, because it's one layer above 13:22:27 os-win is agnostic on this 13:22:40 ok 13:23:03 so we need to do these changes in nova, cinder, hyperv-networing etc 13:23:11 i mean nova driver 13:23:14 for hyperv 13:23:25 the nova driver case is now under investigation 13:23:32 teh operation there are parallel 13:23:39 ok 13:23:40 thanks to eventlet 13:24:06 PyMI allows parallelism even with greenlets by releasing the GIL before doing MI calls 13:24:33 we need some extra testing to ensure that we dont need some extra monkey_patching 13:25:21 that's what is referred usually to "greening" a module 13:25:51 but so far parallelism has already improved a lot, even in Nova 13:27:15 we managed, thanks to PyMI to identify with a profiler where other bottlenecks are 13:27:26 most of them are on "associator" calls 13:28:01 so abalutoiu is currently working on a patch 13:28:17 that gives another good boost 13:28:48 based on Rally tests we're now much closer to KVM times 13:29:00 anything else to add on this topic? 13:29:04 anyways, if you are going to test the native threads on networking-hyperv for yourselves, make sure you have the latest pymi version. 13:29:29 pip install -U pymi 13:29:57 I'm moving on 13:30:08 #topic rally and benchmarks 13:30:54 is Thalabathy with us? 13:31:13 Yes 13:31:17 cool 13:31:38 so first, thanks for sharing your results 13:31:53 what is the primary objective of your test runs? 13:32:10 e.g.: stability, benchmarking, etc 13:32:51 to check how many concurrency i can ... so that load gets introduced on openstack components 13:32:59 alexpilotti : I would assume, stability, long hours of operations, concurrency (private cloud), 13:33:28 benchmarking is definitely an outcome, that we wish to publish and rally can be quite useful. 13:33:30 based on this I can conclude how many user can login and create their objects 13:33:45 Thala: for that, in order to have some data to support the results, some improvements are needed 13:34:42 the first scenario is to validate how many VM operations can be sustained by the environment 13:35:42 an ideal curve shows a correlation between number of VMs and time that is up to linear 13:36:13 anything above that means that there are some bottlenecks to deal with 13:36:54 : I could see some time the egress packets are not going out 13:37:15 Thala: out from the VMs? 13:37:29 :because of this dhcp request not reaching to openstack network node, 13:37:30 or from teh host? 13:37:53 when this happens, are the VLANs properly set on the ports? 13:37:57 correct when i refer egress from VM to out 13:38:06 Yes 13:38:30 we found a Hyper-V bug that even when WMI reports that the VLAN tag has been properly applied it's not 13:38:38 I used to see this issue even with older release on openstack 13:38:47 this happens in 3-5% of the cases 13:38:54 Thala: I feel our requirement was to get same results from Hyper-V as we got from KVM 13:39:42 we have a patch in progress for this as it's a very annoying bug 13:40:11 and we carried out the same test under KVM and Hyper-V in which we observed Hyper-V not giving same result 13:40:15 vinod: agreed, but these kind of issue we do not have workaround to fix 13:40:17 basically we wait for the WMI event telling us that the port actually got set and repeat the operation if not 13:40:44 alexpilloti: patch is in networking-hyperv ? 13:41:04 one key thing was KVM flavor and HyperV flavor was different 13:41:06 sagar_nikam: it's under development now, will be in networking-hyperv 13:41:24 ok 13:42:01 another thing is that the tests require some extra work on the hosts, before running them 13:42:03 thala: your tests show the improvement which claudiub: tests achieved ? 13:42:35 as a minimum: Full Windows updates, "high performance" power scheme 13:42:42 alexpilloti: can you share the extra work on the hosts that need to be done 13:43:12 sagar_nikam: it's all in the email thread 13:43:20 ok 13:44:07 for the performance scheme: PowerCfg.exe /S 8C5E7FDA-E8BF-4A96-9A85-A6E23A8C635C 13:44:30 this alone gives roughy a 10% boost 13:44:51 Windows updates are definitely necessary 13:45:13 especially on an old OS, like 2012 R2 13:45:36 alexpilotti:agreed, will be taken care in next execution... 13:45:49 power management too 13:46:09 Thala: thanks 13:46:39 alexpilotti: one question 13:46:45 kvinod: sure 13:47:20 we will take care of doing all those and try to test again 13:48:03 great thanks 13:48:08 but what we saw in our previous run was that things starts failing after we reach certain no. of VMs 13:48:18 and the native thread fixes too. 13:48:34 and update pymi. :) 13:48:41 kvinod: we need to see the logs for that 13:48:44 so, my question is that have you guys tried booting more than 1000Vms 13:49:08 we booted more than 200 per host 13:49:09 and see if you get 100% results 13:49:47 ok, with how many users and how many computes 13:49:54 ? 13:50:01 the only errors that we see are due to the hyper-v bug mentioned above 13:50:24 around 3%, you can see it in the link attached to the last email thread as an example 13:50:38 we run the tests using 2 compute nodes 13:50:49 so total 400VMs 13:51:10 we dont care about number of user, as that's not relevant for hyper-v or the Nova driver 13:51:28 the only thing that matters at the driver level is the number of concurrent operations 13:51:30 kvinod: from what I've seen from the logs I've sent, the VMs couldn't be spawned because the neutron port could not be created 13:51:32 was interested in knowing how your test environment behaves with 1000Vms with all fixes included 13:51:53 kvinod: wondering if you have a certain quota set on neutron? 13:52:05 kvinod: again, the number of ports are relevant on VM per node 13:52:17 if you have 1000 VMs on 10 compute nodes 13:52:45 assuming a uniform distribution you'll end with 100 VMs per host 13:53:18 yes, but the number of ACLS will be different when you have 1000 vms in a security group. 13:53:18 the tests that you are doing are more significative for stressing the controller 13:53:27 yes 13:53:44 so on a compute, you may have to handle cyclic rules for all 1000 vms. 13:53:49 not only controller but the agent 13:53:54 also 13:53:59 Sonu__: not really: the ACLs applied on a port depend on the number of rules in the SG 13:54:09 kvinod: the agent wont care 13:54:10 in terms of applying rules 13:54:10 we dont try 1000, but we have 250 vms in a single security group. 13:54:49 default security group is what we use in our cloud deployment. That becomes a real challenge. 13:54:50 how many rules are in the SG? 13:55:21 thala: after you next tests, will you be able to plot a graph of WMI vs PYMI ? in the same was as claudiub: graphs 13:55:52 : as of now not planned, 13:55:57 if you have an SG with eg TCP SSH and RDP igress with 0.0.0.0/0, that's just 4 ACLs 13:56:04 around 150 to 200 rules 13:56:44 kvinod: can you share how you create the SG? 13:56:45 default security groups works differen way then normal custom security group 13:56:55 we use default security group. And we have cyclic rules for each member by protocol. 13:57:09 last 3 minutes. 13:57:52 Sonu__: ok, we'll reproduce your same config! 13:58:01 thanks 13:58:05 we use default security group will introduce additional one rule on all the vm ports when new vm gets spawned 13:58:30 2 minutes to go 13:58:33 alexpilotti: would be good if you can give us some update on OVS certification from Microsoft 13:58:37 its bounded with openstack projects 13:58:48 quick answer on that 13:58:57 #topic OVS certification 13:59:16 we are going to get the 2.5 OVS release signed by MSFT (WHQL) 13:59:38 when is OVS 2.5 be released? 13:59:47 which is due in a few weeks (unless the OVS TPL changes plans) 13:59:53 great. thanks 14:00:02 thanks 14:00:12 tim's up, thanks guys! 14:00:16 #endmeeting