16:00:51 <mlavalle> #startmeeting neutron_performance
16:00:52 <openstack> Meeting started Mon Jul 29 16:00:51 2019 UTC and is due to finish in 60 minutes.  The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:53 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:55 <openstack> The meeting name has been set to 'neutron_performance'
16:01:05 <rubasov> hi
16:01:30 <slaweq> hi
16:01:51 <haleyb> hi
16:01:54 <mlavalle> let's wait one minute and we get going
16:02:16 <mlavalle> well, I see haleyb is already here and I know njohnston won't attend
16:02:22 <mlavalle> so let's get going
16:02:28 <mlavalle> #topic Updates
16:03:04 <mlavalle> rubasov merged our port binding scenario rally-openstack: https://review.opendev.org/#/c/662781/
16:03:16 <mlavalle> I've been playing with it the past few days
16:03:24 <mlavalle> Thanks and gret job!
16:03:34 <rubasov> I hope it's being useful
16:04:11 <mlavalle> it is being useful
16:05:21 <rubasov> then I'm happy
16:05:27 <mlavalle> and I have to point out that it is going to be even more useful for https://bugs.launchpad.net/neutron/+bug/1836834
16:05:28 <openstack> Launchpad bug 1836834 in neutron "[RFE] introduce distributed locks to ipam" [Wishlist,Confirmed] - Assigned to qinhaizhong (qinhaizhong)
16:06:05 <mlavalle> this is RFE that we approved this past Friday. We need a Rally test to check how much progres we make with that RFE
16:06:23 <mlavalle> and I think your rally test is the perfect match for that
16:06:30 <mlavalle> what do you think slaweq?
16:07:08 <slaweq> I agree, with this rally scenario we can exactly measure improvement (or not) which will give us this lock to ipam
16:07:29 <slaweq> as in other "port_create" scenarios I think it's not allocating IP addresses, right?
16:07:35 <slaweq> or is it?
16:07:42 <mlavalle> exactly
16:07:55 <mlavalle> this is the rally test that allows us to exercise the IPAM
16:08:16 <mlavalle> so really useful rubasov
16:08:20 <mlavalle> \o/
16:08:50 <slaweq> thx rubasov for this :)
16:08:55 <rubasov> cool, when I as writing it I was thinking of adding some options to tune the IP contention
16:09:09 <rubasov> I mean how many IPs out of how large pool
16:09:17 <rubasov> let me know if that's needed
16:09:25 <rubasov> it could be added easily
16:09:56 <mlavalle> let me think about it, now that I am playing with it
16:10:09 <rubasov> ok
16:10:10 <mlavalle> any other udates from you rubasov?
16:10:17 <rubasov> a bit
16:10:37 <rubasov> I know slaweq has already seen these changes
16:10:58 <rubasov> I have made some progress with osprofiling a vif plug
16:11:04 <rubasov> together with gibi
16:11:18 <rubasov> these two changes: https://review.opendev.org/666610 https://review.opendev.org/665715
16:11:49 <rubasov> I have some feedback from ralonsoh to fix
16:12:03 <rubasov> and also some ugly interdependent unit test failures to fix too
16:12:28 <rubasov> but the main point is that those two patches now can get a trace through ovs-agent
16:12:43 <mlavalle> I took a quick look at these patches Friday (I think)
16:12:44 <rubasov> from nova through ovs-agent and back to neutron-server
16:13:03 <ralonsoh> I still have some concerns about storing so much information in the bridge register
16:13:20 <rubasov> ralonsoh: sorry I did not have the time to answer yet
16:13:26 <ralonsoh> rubasov, no problem
16:13:30 <ralonsoh> no rush
16:13:33 <rubasov> but I think it should not be that much info there
16:13:45 <rubasov> it depends on the number of cuncurrent traces
16:13:54 <rubasov> which can only come from trusted users
16:14:10 <rubasov> that number does not scale together with the number of ports on an agent
16:14:19 <ralonsoh> no, for sure
16:14:23 <rubasov> but with the number of concurrent traces
16:14:29 <slaweq> ralonsoh: yes, IIUC rubasov's patch it will remove any trace-id from br-int just after use it
16:14:43 <ralonsoh> but I don't want also to leave not needed info in the bidge register
16:14:56 <ralonsoh> if so, that's ok for me
16:15:15 <slaweq> but rubasov please correct me if I'm wrong :)
16:15:25 <ralonsoh> and eventually, once Sean merges the patch in nova, we'll put everyting in the port register
16:15:25 <rubasov> ralonsoh: by default ovs-agent removes the trace info as soon as it sees it
16:16:08 <rubasov> ralonsoh: on the other hand I fully agree that it's an ugly hack to put trace info there :-)
16:16:31 <rubasov> and I'd like to see Sean's patch merged
16:16:48 <rubasov> I'm just not sure I want to make it a dependency
16:17:01 <ralonsoh> not for now
16:17:43 <rubasov> ralonsoh: also I could add some extra cleanup
16:17:55 <ralonsoh> perfect
16:17:58 <rubasov> like deleting all trace info from the bridge on ovs-agent restart
16:18:24 <rubasov> would that be better?
16:18:30 <ralonsoh> for sure!
16:18:37 <rubasov> then I'll do that
16:19:11 <rubasov> let me know if you see other places in the code where we can do some meaningful cleanup
16:20:22 <rubasov> unless you have some questions about these patches that's all from me
16:20:50 <mlavalle> thanks for the update. and thanks to ralonsoh for helping with this :-)
16:21:01 <rubasov> yep thanks
16:21:13 <ralonsoh> 0/
16:22:45 <mlavalle> On my side, as I indicated above, I've been running the create_bind_port scenario
16:23:09 <mlavalle> I creatde reports and posted them here: https://github.com/miguellavalle/neutron-perf/tree/july-29th/threadripper
16:24:01 <mlavalle> If you go to the reports folder, you will see the different runs, where I increase the number of iteration and the concurrency gradually
16:24:28 <mlavalle> I also created a quick overview of results here: http://paste.openstack.org/show/755065/
16:26:21 <slaweq> so it looks for me that using osprofiler is slowing down everything a lot
16:26:26 <mlavalle> As you can see, we went from an average duartion of 63.282 secs with 10 iteration, concurrency 5
16:26:34 <mlavalle> yes, it does
16:27:17 <mlavalle> but at least for the time being it is a good window into what it is going on
16:27:35 <rubasov> but still the trends with osprofiler should be the same as without it, right?
16:27:35 <mlavalle> I don't think we should focus on the absolute durations
16:27:47 <mlavalle> we should focus on the trend
16:27:52 <rubasov> so we should still see how it scales
16:27:58 <mlavalle> exactly rubasov
16:28:05 <slaweq> I agree
16:28:18 <slaweq> I am just surprised that it has so much overhead
16:28:27 <mlavalle> it is a lot of overhead
16:28:51 <mlavalle> but if you think about it, it's not surprising
16:29:07 <mlavalle> we are logging each DB operation
16:29:23 <mlavalle> anyway...
16:29:25 <rubasov> a lots of I/O to send to the trace point db
16:30:55 <mlavalle> going back to my oririnl comment, with 80 iterations, 20 concurrency, we go to average duration of 147.887
16:33:28 <mlavalle> in that report folder you will find the rally report (for example iterations-10-concurrency-5.html) with one or several corresponding osprofiler reports (for example iterations-10-concurrency-5-osprofiler-max.html)
16:34:15 <mlavalle> if the osprofiler report name contains the 'max' suffix, it means that it referes to the iteration with the largest duration
16:35:27 <rubasov> what looks good is that there's not much of a spread, the 95 percentile values are close to the median values
16:35:49 <rubasov> so neutron-server seems to be performing at a consistent speed
16:36:03 <slaweq> rubasov: slow but stable :D
16:36:56 <rubasov> and based on the first two we may need to divide the numbers by 5 (the slowdown because of osprofiler)
16:37:00 <mlavalle> would that be an indication that we haven't reached a limit? i.e. the performance knee wehere response time grows exponentially?
16:37:56 <mlavalle> I mean the fact that the 95th percentile is still close to the median?
16:38:47 <jrbalderrama> BTW any plan to perform more test with more physical resources to compare the results with and w/o the profiler ?
16:39:08 <rubasov> I don't think the spreads in itself tells much about bottlenecks
16:39:15 <rubasov> or at least I don't see how
16:39:29 <mlavalle> fair point
16:39:45 <rubasov> other than we're likely hitting the same cause of slowness in each run
16:40:24 <rubasov> so if we can improve that we'll likely improve all runs
16:41:06 <mlavalle> I haven't identified a big / single obvious bottleneck in the case of port creation and update. so it going to require a very fine grained analysys of the reports, which I will attempt as my next step
16:41:43 <mlavalle> But in the case of subnet creation I found what seems to be a tentative surprise
16:41:55 <rubasov> this is already very cool, thank you
16:42:23 <slaweq> mlavalle: one more question: what are those sql update queries in Your summary?
16:42:38 <mlavalle> If you look at lines 75-78 and 99-100 in my summary....
16:43:09 <mlavalle> you will see SQL update statements that are executed during subnet creation
16:43:53 <mlavalle> in the 60 / 15 and 80 / 20 scenarios I can see that those SQL statements take arounf 8 to 9 seconds
16:44:11 <mlavalle> while the total subnet creation time is arounf 25 30 seconds
16:44:22 <mlavalle> am I being clear?
16:45:00 <slaweq> so those are slowest queries in subnet creation requests, right?
16:45:15 <slaweq> in fact it's always the same query
16:45:22 <mlavalle> exactly
16:45:35 <mlavalle> let's take this as tentative yet
16:45:44 <mlavalle> I just found this last night
16:45:50 <mlavalle> but it really stands out
16:46:06 <slaweq> ok
16:46:19 <rubasov> do you have multiple query lines because we're hitting a db_retry?
16:46:52 <mlavalle> yeah, I need to investigate and relate this to the code and the log files
16:47:03 <mlavalle> but retries are clearly a possibility
16:47:15 <rubasov> ok
16:48:08 <mlavalle> and I also need to compare a port creation / update in the 10 / 5 scenario with their conterpart in the 80 / 20 scenario
16:48:31 <mlavalle> becaause in that case there is not a single operation that stands out
16:48:56 <mlavalle> but clearly a lot of small operations are adding up to a big increase in time
16:49:57 <mlavalle> makes sense?
16:50:06 <slaweq> yes
16:50:08 <mlavalle> any more questions or observations
16:50:10 <mlavalle> ?
16:50:14 <rubasov> yes it does
16:50:27 <slaweq> I don't have anything else to add
16:51:04 <mlavalle> jrbalderrama: yes, that's the plan lomger term
16:51:40 <mlavalle> but before getting there, I would like to have deeper knowledge on where bottlenecks are
16:52:02 <mlavalle> so we can perform a very enlightening experiment
16:52:27 <mlavalle> and btw, thanks for attending
16:52:58 <mlavalle> and Viva Colombia! I bet all your French buddies are disappointed
16:53:06 <ralonsoh> hahahah
16:53:39 <slaweq> LOL
16:54:27 <jrbalderrama> hahaha we still are the champions in other sports ;) and the tour is a French heritage !
16:54:54 <mlavalle> that's true
16:55:07 <mlavalle> it is a big victory for the Colombians, though
16:55:25 <jrbalderrama> back to the the point. Sounds good for me. We are looking forward to launch some tests here
16:55:41 <mlavalle> ok, we'll keep you posted
16:55:59 <jrbalderrama> thank you all!
16:56:28 <mlavalle> no rush on the PR
16:56:47 <mlavalle> anything else we should discuss today?
16:57:36 <slaweq> not from me
16:58:04 <rubasov> neither from me
16:58:19 <mlavalle> Thanks for attending!
16:58:23 <mlavalle> #endmeeting