16:00:51 #startmeeting neutron_performance 16:00:52 Meeting started Mon Jul 29 16:00:51 2019 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:53 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:55 The meeting name has been set to 'neutron_performance' 16:01:05 hi 16:01:30 hi 16:01:51 hi 16:01:54 let's wait one minute and we get going 16:02:16 well, I see haleyb is already here and I know njohnston won't attend 16:02:22 so let's get going 16:02:28 #topic Updates 16:03:04 rubasov merged our port binding scenario rally-openstack: https://review.opendev.org/#/c/662781/ 16:03:16 I've been playing with it the past few days 16:03:24 Thanks and gret job! 16:03:34 I hope it's being useful 16:04:11 it is being useful 16:05:21 then I'm happy 16:05:27 and I have to point out that it is going to be even more useful for https://bugs.launchpad.net/neutron/+bug/1836834 16:05:28 Launchpad bug 1836834 in neutron "[RFE] introduce distributed locks to ipam" [Wishlist,Confirmed] - Assigned to qinhaizhong (qinhaizhong) 16:06:05 this is RFE that we approved this past Friday. We need a Rally test to check how much progres we make with that RFE 16:06:23 and I think your rally test is the perfect match for that 16:06:30 what do you think slaweq? 16:07:08 I agree, with this rally scenario we can exactly measure improvement (or not) which will give us this lock to ipam 16:07:29 as in other "port_create" scenarios I think it's not allocating IP addresses, right? 16:07:35 or is it? 16:07:42 exactly 16:07:55 this is the rally test that allows us to exercise the IPAM 16:08:16 so really useful rubasov 16:08:20 \o/ 16:08:50 thx rubasov for this :) 16:08:55 cool, when I as writing it I was thinking of adding some options to tune the IP contention 16:09:09 I mean how many IPs out of how large pool 16:09:17 let me know if that's needed 16:09:25 it could be added easily 16:09:56 let me think about it, now that I am playing with it 16:10:09 ok 16:10:10 any other udates from you rubasov? 16:10:17 a bit 16:10:37 I know slaweq has already seen these changes 16:10:58 I have made some progress with osprofiling a vif plug 16:11:04 together with gibi 16:11:18 these two changes: https://review.opendev.org/666610 https://review.opendev.org/665715 16:11:49 I have some feedback from ralonsoh to fix 16:12:03 and also some ugly interdependent unit test failures to fix too 16:12:28 but the main point is that those two patches now can get a trace through ovs-agent 16:12:43 I took a quick look at these patches Friday (I think) 16:12:44 from nova through ovs-agent and back to neutron-server 16:13:03 I still have some concerns about storing so much information in the bridge register 16:13:20 ralonsoh: sorry I did not have the time to answer yet 16:13:26 rubasov, no problem 16:13:30 no rush 16:13:33 but I think it should not be that much info there 16:13:45 it depends on the number of cuncurrent traces 16:13:54 which can only come from trusted users 16:14:10 that number does not scale together with the number of ports on an agent 16:14:19 no, for sure 16:14:23 but with the number of concurrent traces 16:14:29 ralonsoh: yes, IIUC rubasov's patch it will remove any trace-id from br-int just after use it 16:14:43 but I don't want also to leave not needed info in the bidge register 16:14:56 if so, that's ok for me 16:15:15 but rubasov please correct me if I'm wrong :) 16:15:25 and eventually, once Sean merges the patch in nova, we'll put everyting in the port register 16:15:25 ralonsoh: by default ovs-agent removes the trace info as soon as it sees it 16:16:08 ralonsoh: on the other hand I fully agree that it's an ugly hack to put trace info there :-) 16:16:31 and I'd like to see Sean's patch merged 16:16:48 I'm just not sure I want to make it a dependency 16:17:01 not for now 16:17:43 ralonsoh: also I could add some extra cleanup 16:17:55 perfect 16:17:58 like deleting all trace info from the bridge on ovs-agent restart 16:18:24 would that be better? 16:18:30 for sure! 16:18:37 then I'll do that 16:19:11 let me know if you see other places in the code where we can do some meaningful cleanup 16:20:22 unless you have some questions about these patches that's all from me 16:20:50 thanks for the update. and thanks to ralonsoh for helping with this :-) 16:21:01 yep thanks 16:21:13 0/ 16:22:45 On my side, as I indicated above, I've been running the create_bind_port scenario 16:23:09 I creatde reports and posted them here: https://github.com/miguellavalle/neutron-perf/tree/july-29th/threadripper 16:24:01 If you go to the reports folder, you will see the different runs, where I increase the number of iteration and the concurrency gradually 16:24:28 I also created a quick overview of results here: http://paste.openstack.org/show/755065/ 16:26:21 so it looks for me that using osprofiler is slowing down everything a lot 16:26:26 As you can see, we went from an average duartion of 63.282 secs with 10 iteration, concurrency 5 16:26:34 yes, it does 16:27:17 but at least for the time being it is a good window into what it is going on 16:27:35 but still the trends with osprofiler should be the same as without it, right? 16:27:35 I don't think we should focus on the absolute durations 16:27:47 we should focus on the trend 16:27:52 so we should still see how it scales 16:27:58 exactly rubasov 16:28:05 I agree 16:28:18 I am just surprised that it has so much overhead 16:28:27 it is a lot of overhead 16:28:51 but if you think about it, it's not surprising 16:29:07 we are logging each DB operation 16:29:23 anyway... 16:29:25 a lots of I/O to send to the trace point db 16:30:55 going back to my oririnl comment, with 80 iterations, 20 concurrency, we go to average duration of 147.887 16:33:28 in that report folder you will find the rally report (for example iterations-10-concurrency-5.html) with one or several corresponding osprofiler reports (for example iterations-10-concurrency-5-osprofiler-max.html) 16:34:15 if the osprofiler report name contains the 'max' suffix, it means that it referes to the iteration with the largest duration 16:35:27 what looks good is that there's not much of a spread, the 95 percentile values are close to the median values 16:35:49 so neutron-server seems to be performing at a consistent speed 16:36:03 rubasov: slow but stable :D 16:36:56 and based on the first two we may need to divide the numbers by 5 (the slowdown because of osprofiler) 16:37:00 would that be an indication that we haven't reached a limit? i.e. the performance knee wehere response time grows exponentially? 16:37:56 I mean the fact that the 95th percentile is still close to the median? 16:38:47 BTW any plan to perform more test with more physical resources to compare the results with and w/o the profiler ? 16:39:08 I don't think the spreads in itself tells much about bottlenecks 16:39:15 or at least I don't see how 16:39:29 fair point 16:39:45 other than we're likely hitting the same cause of slowness in each run 16:40:24 so if we can improve that we'll likely improve all runs 16:41:06 I haven't identified a big / single obvious bottleneck in the case of port creation and update. so it going to require a very fine grained analysys of the reports, which I will attempt as my next step 16:41:43 But in the case of subnet creation I found what seems to be a tentative surprise 16:41:55 this is already very cool, thank you 16:42:23 mlavalle: one more question: what are those sql update queries in Your summary? 16:42:38 If you look at lines 75-78 and 99-100 in my summary.... 16:43:09 you will see SQL update statements that are executed during subnet creation 16:43:53 in the 60 / 15 and 80 / 20 scenarios I can see that those SQL statements take arounf 8 to 9 seconds 16:44:11 while the total subnet creation time is arounf 25 30 seconds 16:44:22 am I being clear? 16:45:00 so those are slowest queries in subnet creation requests, right? 16:45:15 in fact it's always the same query 16:45:22 exactly 16:45:35 let's take this as tentative yet 16:45:44 I just found this last night 16:45:50 but it really stands out 16:46:06 ok 16:46:19 do you have multiple query lines because we're hitting a db_retry? 16:46:52 yeah, I need to investigate and relate this to the code and the log files 16:47:03 but retries are clearly a possibility 16:47:15 ok 16:48:08 and I also need to compare a port creation / update in the 10 / 5 scenario with their conterpart in the 80 / 20 scenario 16:48:31 becaause in that case there is not a single operation that stands out 16:48:56 but clearly a lot of small operations are adding up to a big increase in time 16:49:57 makes sense? 16:50:06 yes 16:50:08 any more questions or observations 16:50:10 ? 16:50:14 yes it does 16:50:27 I don't have anything else to add 16:51:04 jrbalderrama: yes, that's the plan lomger term 16:51:40 but before getting there, I would like to have deeper knowledge on where bottlenecks are 16:52:02 so we can perform a very enlightening experiment 16:52:27 and btw, thanks for attending 16:52:58 and Viva Colombia! I bet all your French buddies are disappointed 16:53:06 hahahah 16:53:39 LOL 16:54:27 hahaha we still are the champions in other sports ;) and the tour is a French heritage ! 16:54:54 that's true 16:55:07 it is a big victory for the Colombians, though 16:55:25 back to the the point. Sounds good for me. We are looking forward to launch some tests here 16:55:41 ok, we'll keep you posted 16:55:59 thank you all! 16:56:28 no rush on the PR 16:56:47 anything else we should discuss today? 16:57:36 not from me 16:58:04 neither from me 16:58:19 Thanks for attending! 16:58:23 #endmeeting