#openstack-meeting log

16:01:09 <mlavalle> #startmeeting neutron_performance
16:01:10 <openstack> Meeting started Mon Oct  7 16:01:09 2019 UTC and is due to finish in 60 minutes.  The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:11 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:13 <openstack> The meeting name has been set to 'neutron_performance'
16:02:29 <rubasov_> hello
16:03:03 <jrbalderrama> hi!
16:03:05 <mlavalle> hi
16:03:11 <bcafarel> o/
16:03:47 <njohnston> o/
16:03:54 <slaweq> o/
16:04:49 <mlavalle> ok let's get going
16:05:00 <mlavalle> #topic Updates
16:05:10 <mlavalle> does anybode have any updates?
16:06:05 <mlavalle> I do but I want to let the others talk first ;-)
16:06:47 <bcafarel> none from me
16:07:13 <njohnston> nothing here
16:07:13 <slaweq> I don't have any
16:07:58 <rubasov_> neither do I
16:07:59 <jrbalderrama> Neither do I
16:08:34 <mlavalle> ok in that case let me give my update
16:09:17 <mlavalle> I contnued working on the patch to collect profiling data from our Rally job: https://review.opendev.org/#/c/678438/
16:10:17 <mlavalle> as promised during the last meeting, I decorated the WSGI controllers in our API, both the legacy base controller https://review.opendev.org/#/c/678438/14/neutron/api/v2/base.py
16:10:42 <mlavalle> and the pecan controllers: https://review.opendev.org/#/c/678438/14/neutron/pecan_wsgi/controllers/resource.py
16:11:32 <mlavalle> The nice thing about this is that now we profile all the calls to the API that we receive from our Rally job
16:12:12 <njohnston> fantastic
16:12:16 <mlavalle> even though the Rally job timed out
16:12:37 <mlavalle> you can still click on it and download the Nuetron server log
16:12:52 <mlavalle> please down load it. don't open it in your browser
16:13:21 <slaweq> 66 MB - wow
16:13:30 <slaweq> quite huge
16:13:31 <mlavalle> and then you can open it with less
16:13:49 <mlavalle> yeah, search it for the string "os-profiler parent trace-id"
16:15:29 <mlavalle> you will see that for each call, we are showing the prifling data for the top 50 methos in cumulative CPU time and underneath, for those top 50, who called who and who was called by whom
16:16:21 <mlavalle> can you all see it
16:16:45 <mlavalle> in MacOS I can open it with vim
16:17:55 <slaweq> mlavalle: I'm still downloading it
16:18:50 <mlavalle> and since I am also logging the the osprofiler trace ids for each call, we can corelate the cProfile data in the server log with the osprofiler data in the Rally report
16:18:57 <mlavalle> doesn that make sense?
16:19:20 <mlavalle> slaweq: wow, you have a slow connection. I can download it in 5 secs
16:20:01 <rubasov_> if it's downloading more than 66 mb then you may want to do this: wget --header="accept-encoding: gzip" https://f5ae4a32b02585f5baea-19e17e5db2abd0e280dd7dd93f01d60e.ssl.cf5.rackcdn.com/678438/14/check/neutron-rally-task/832c214/controller/logs/screen-q-svc.txt.gz
16:20:13 <slaweq> mlavalle: I downloaded it with wget, it has 865M now
16:20:42 <rubasov_> I think the server gunzips it on the fly if you don't have that header
16:20:42 <mlavalle> yeah that size sounds right
16:21:59 <njohnston> soooo much data
16:22:38 <mlavalle> what do you all think?
16:24:02 <slaweq> at first glance looks interesting for debugging
16:24:19 <slaweq> but I'm not sure if we should merge it and use on each CI run
16:24:35 <mlavalle> yeah, I agree
16:25:08 <mlavalle> the bad news is that the rally job has timed out all the time since I adopted this approach
16:25:26 <mlavalle> so here's what I propose as a next step
16:25:59 <slaweq> I just looked now at subnets create request with id req-561309d6-eadb-490b-8bcf-08c675ba3403
16:26:09 <slaweq> it took more than 13 seconds
16:26:43 <mlavalle> I am going to change the smeantics of https://review.opendev.org/#/c/678438/14/neutron/conf/profiling.py@20
16:26:59 <slaweq> summary of it is: 2260690 function calls (2199747 primitive calls) in 13.825 seconds
16:27:25 <slaweq> more than 2 milions functions calls in one API request
16:27:35 <njohnston> wow
16:27:51 <mlavalle> well, regardless of the absolute times, that tells you the complexity of what our code does
16:28:13 <mlavalle> and IMO, at the very least, we should be cognizant of that
16:28:28 <njohnston> definitely
16:28:40 <slaweq> and look here:
16:28:42 <slaweq> /opt/stack/neutron/neutron/db/db_base_plugin_v2.py:695(_update_router_gw_ports)                      <-       1    0.000   12.052  /opt/stack/neutron/neutron/db/db_base_plugin_v2.py:732(_create_subnet_postcommit)
16:29:15 <mlavalle> yeah, this are the little surpises I was ooking forward to
16:29:52 <slaweq> mlavalle: it seems for me that this is great data to analyze now
16:30:07 <njohnston> mlavalle: awesome work
16:30:22 <slaweq> and we can get from it probably some bottlenecks
16:30:40 <slaweq> mlavalle: I agree with njohnston - great work indeed :)
16:30:41 <mlavalle> yeah, with this, we will have analysis work for many months to come
16:30:44 <njohnston> my opinion is that we merge this code but have it controlled by a master config value that defaults to off, as a 'break glass in case of emergency' we can carry with us always
16:31:05 <mlavalle> what I am going to propose is the following
16:31:49 <slaweq> njohnston: - that's why I though to maybe implement it as separate middleware instead of decorator - that way we can just add it to pipeline of middlewares and get this data for again
16:32:00 <mlavalle> 1) https://review.opendev.org/#/c/678438/14/neutron/conf/profiling.py@20 shouldn't be a boolean config option. It shoud be a list with the names of the resources whose calls will be profiled
16:32:17 <njohnston> mlavalle: good idea
16:32:36 <mlavalle> The default value of the list should be []
16:33:24 <mlavalle> that way the rally job doesn't collect the profiling data in every patch
16:34:31 <njohnston> good
16:34:47 <mlavalle> 2) When someone wants to investigate performance, he or she proposes a patch where https://review.opendev.org/#/c/678438/14/.zuul.yaml@180 is set to the list or resources whose calls will be profiled. This patch is just a DNM patch
16:34:59 <mlavalle> that allows this person to collect the profiling data
16:35:15 <slaweq> mlavalle: or can run it locally also :)
16:35:25 <mlavalle> sure
16:36:37 <mlavalle> 3) as far as this patch, what I'll do is implement what I explained in the previous two points, and add a devref doc explaining what the process for analysis should be
16:36:49 <mlavalle> and with that, this patch should be ready to go
16:36:53 <mlavalle> makes sense?
16:36:57 <slaweq> yes
16:37:30 <njohnston> +100
16:37:39 <rubasov_> +1
16:37:57 <mlavalle> of course this entails modifying the decorator code to profile only data for the resources that are configured to be profiled
16:38:01 <bcafarel> +1 so merge with empty list and have devref explaining, just throwing a DNM patch to enable on needed resources
16:38:15 <mlavalle> excatly bcafarel
16:38:40 <mlavalle> that way we don't punish the project contributors with overly long Rally job runs
16:39:13 <mlavalle> and still we have the means to continue with our performance analysis work
16:39:58 <mlavalle> I have two goals for the Shanghai PTG:
16:40:10 <mlavalle> 1) Explain to the team there how to use this tool
16:40:39 <mlavalle> 2) Have data to start making progress in uncovering and prioritizing bottlenecks
16:40:52 <mlavalle> makes sense?
16:41:08 <slaweq> definitely
16:41:20 <slaweq> will You add item to PTG etherpad?
16:41:21 <bcafarel> yes, also data for 2 can be good example to how to use the tool
16:41:31 <mlavalle> slaweq: I already did
16:41:36 <mlavalle> several weeks ago
16:41:39 <slaweq> mlavalle: great, thx
16:41:49 <slaweq> I still have to go through it before ptg :)
16:42:58 <mlavalle> lines 85 and 86 in https://etherpad.openstack.org/p/Shanghai-Neutron-Planning
16:43:51 <mlavalle> ok so expect the next revision to the patch over the next few days
16:44:09 <mlavalle> hopefully we can merge it soon after so I can collect data for the PTG
16:45:29 <mlavalle> slaweq: I don't think we run dvr in that Rally job. Imagine the number of calls in that kind of scenario
16:45:53 <slaweq> mlavalle: yes, we run only legacy router (even not HA) in rally job
16:46:28 <mlavalle> so at some point in time, we should explore this with DVR.
16:46:39 <mlavalle> For the time being, let's go with the basics
16:46:47 <bcafarel> +1
16:46:55 <njohnston> +1
16:47:54 <mlavalle> any more questions or comments on this topic?
16:48:28 <slaweq> not from my side
16:48:50 <mlavalle> ok, I have one more small item to share with you
16:49:39 <mlavalle> About two weeks ago I was pinged in the channel by a new contributor, sapna, who wanted guidance on how to start contributing. He was looking for bugs to work on
16:49:57 <mlavalle> so I directed him to pay attention to the bugs deputy weekly email report
16:50:10 <mlavalle> and pick a bug from there
16:50:26 <mlavalle> he picked https://bugs.launchpad.net/neutron/+bug/1841865
16:50:26 <openstack> Launchpad bug 1841865 in neutron "[L2] stop processing ports twice in ovs-agent" [Wishlist,Confirmed] - Assigned to Sapna Jadhav (sapana45)
16:50:44 <mlavalle> which is not an entry level bug
16:50:56 <mlavalle> but I don't want to defalte his enthusiasm
16:50:57 <bcafarel> ah yes he sent me an email on another potential bug (designate) that alread had someone working on it in the end
16:51:27 <mlavalle> this bug is relevant from the performance point of view
16:51:47 <mlavalle> so this past weekend I spent some time providing him with some guidance in the bug
16:52:01 <mlavalle> you can read it in #4
16:52:26 <mlavalle> I wanted to share with you for two reasons:
16:52:39 <mlavalle> 1) This bug is relevant from the performance point of view
16:53:09 <mlavalle> 2) By working with sapna, we might be able to mint a brand new contributor
16:54:03 <mlavalle> 3) If you see him around in the channel asking questions, if you can, help. slaweq, bcafarel he is in India, so closer to your time zones
16:54:32 <slaweq> mlavalle: sure, I'm also in touch with him
16:54:33 <bcafarel> sure thing!
16:55:13 <mlavalle> ok, does anybody have anything else to say?
16:56:05 <mlavalle> if not...
16:56:11 <mlavalle> have a great week
16:56:18 <bcafarel> the same :) o/
16:56:18 <mlavalle> thanks for attending
16:56:24 <mlavalle> #endmeeting