16:01:09 #startmeeting neutron_performance 16:01:10 Meeting started Mon Oct 7 16:01:09 2019 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:13 The meeting name has been set to 'neutron_performance' 16:02:29 hello 16:03:03 hi! 16:03:05 hi 16:03:11 o/ 16:03:47 o/ 16:03:54 o/ 16:04:49 ok let's get going 16:05:00 #topic Updates 16:05:10 does anybode have any updates? 16:06:05 I do but I want to let the others talk first ;-) 16:06:47 none from me 16:07:13 nothing here 16:07:13 I don't have any 16:07:58 neither do I 16:07:59 Neither do I 16:08:34 ok in that case let me give my update 16:09:17 I contnued working on the patch to collect profiling data from our Rally job: https://review.opendev.org/#/c/678438/ 16:10:17 as promised during the last meeting, I decorated the WSGI controllers in our API, both the legacy base controller https://review.opendev.org/#/c/678438/14/neutron/api/v2/base.py 16:10:42 and the pecan controllers: https://review.opendev.org/#/c/678438/14/neutron/pecan_wsgi/controllers/resource.py 16:11:32 The nice thing about this is that now we profile all the calls to the API that we receive from our Rally job 16:12:12 fantastic 16:12:16 even though the Rally job timed out 16:12:37 you can still click on it and download the Nuetron server log 16:12:52 please down load it. don't open it in your browser 16:13:21 66 MB - wow 16:13:30 quite huge 16:13:31 and then you can open it with less 16:13:49 yeah, search it for the string "os-profiler parent trace-id" 16:15:29 you will see that for each call, we are showing the prifling data for the top 50 methos in cumulative CPU time and underneath, for those top 50, who called who and who was called by whom 16:16:21 can you all see it 16:16:45 in MacOS I can open it with vim 16:17:55 mlavalle: I'm still downloading it 16:18:50 and since I am also logging the the osprofiler trace ids for each call, we can corelate the cProfile data in the server log with the osprofiler data in the Rally report 16:18:57 doesn that make sense? 16:19:20 slaweq: wow, you have a slow connection. I can download it in 5 secs 16:20:01 if it's downloading more than 66 mb then you may want to do this: wget --header="accept-encoding: gzip" https://f5ae4a32b02585f5baea-19e17e5db2abd0e280dd7dd93f01d60e.ssl.cf5.rackcdn.com/678438/14/check/neutron-rally-task/832c214/controller/logs/screen-q-svc.txt.gz 16:20:13 mlavalle: I downloaded it with wget, it has 865M now 16:20:42 I think the server gunzips it on the fly if you don't have that header 16:20:42 yeah that size sounds right 16:21:59 soooo much data 16:22:38 what do you all think? 16:24:02 at first glance looks interesting for debugging 16:24:19 but I'm not sure if we should merge it and use on each CI run 16:24:35 yeah, I agree 16:25:08 the bad news is that the rally job has timed out all the time since I adopted this approach 16:25:26 so here's what I propose as a next step 16:25:59 I just looked now at subnets create request with id req-561309d6-eadb-490b-8bcf-08c675ba3403 16:26:09 it took more than 13 seconds 16:26:43 I am going to change the smeantics of https://review.opendev.org/#/c/678438/14/neutron/conf/profiling.py@20 16:26:59 summary of it is: 2260690 function calls (2199747 primitive calls) in 13.825 seconds 16:27:25 more than 2 milions functions calls in one API request 16:27:35 wow 16:27:51 well, regardless of the absolute times, that tells you the complexity of what our code does 16:28:13 and IMO, at the very least, we should be cognizant of that 16:28:28 definitely 16:28:40 and look here: 16:28:42 /opt/stack/neutron/neutron/db/db_base_plugin_v2.py:695(_update_router_gw_ports) <- 1 0.000 12.052 /opt/stack/neutron/neutron/db/db_base_plugin_v2.py:732(_create_subnet_postcommit) 16:29:15 yeah, this are the little surpises I was ooking forward to 16:29:52 mlavalle: it seems for me that this is great data to analyze now 16:30:07 mlavalle: awesome work 16:30:22 and we can get from it probably some bottlenecks 16:30:40 mlavalle: I agree with njohnston - great work indeed :) 16:30:41 yeah, with this, we will have analysis work for many months to come 16:30:44 my opinion is that we merge this code but have it controlled by a master config value that defaults to off, as a 'break glass in case of emergency' we can carry with us always 16:31:05 what I am going to propose is the following 16:31:49 njohnston: - that's why I though to maybe implement it as separate middleware instead of decorator - that way we can just add it to pipeline of middlewares and get this data for again 16:32:00 1) https://review.opendev.org/#/c/678438/14/neutron/conf/profiling.py@20 shouldn't be a boolean config option. It shoud be a list with the names of the resources whose calls will be profiled 16:32:17 mlavalle: good idea 16:32:36 The default value of the list should be [] 16:33:24 that way the rally job doesn't collect the profiling data in every patch 16:34:31 good 16:34:47 2) When someone wants to investigate performance, he or she proposes a patch where https://review.opendev.org/#/c/678438/14/.zuul.yaml@180 is set to the list or resources whose calls will be profiled. This patch is just a DNM patch 16:34:59 that allows this person to collect the profiling data 16:35:15 mlavalle: or can run it locally also :) 16:35:25 sure 16:36:37 3) as far as this patch, what I'll do is implement what I explained in the previous two points, and add a devref doc explaining what the process for analysis should be 16:36:49 and with that, this patch should be ready to go 16:36:53 makes sense? 16:36:57 yes 16:37:30 +100 16:37:39 +1 16:37:57 of course this entails modifying the decorator code to profile only data for the resources that are configured to be profiled 16:38:01 +1 so merge with empty list and have devref explaining, just throwing a DNM patch to enable on needed resources 16:38:15 excatly bcafarel 16:38:40 that way we don't punish the project contributors with overly long Rally job runs 16:39:13 and still we have the means to continue with our performance analysis work 16:39:58 I have two goals for the Shanghai PTG: 16:40:10 1) Explain to the team there how to use this tool 16:40:39 2) Have data to start making progress in uncovering and prioritizing bottlenecks 16:40:52 makes sense? 16:41:08 definitely 16:41:20 will You add item to PTG etherpad? 16:41:21 yes, also data for 2 can be good example to how to use the tool 16:41:31 slaweq: I already did 16:41:36 several weeks ago 16:41:39 mlavalle: great, thx 16:41:49 I still have to go through it before ptg :) 16:42:58 lines 85 and 86 in https://etherpad.openstack.org/p/Shanghai-Neutron-Planning 16:43:51 ok so expect the next revision to the patch over the next few days 16:44:09 hopefully we can merge it soon after so I can collect data for the PTG 16:45:29 slaweq: I don't think we run dvr in that Rally job. Imagine the number of calls in that kind of scenario 16:45:53 mlavalle: yes, we run only legacy router (even not HA) in rally job 16:46:28 so at some point in time, we should explore this with DVR. 16:46:39 For the time being, let's go with the basics 16:46:47 +1 16:46:55 +1 16:47:54 any more questions or comments on this topic? 16:48:28 not from my side 16:48:50 ok, I have one more small item to share with you 16:49:39 About two weeks ago I was pinged in the channel by a new contributor, sapna, who wanted guidance on how to start contributing. He was looking for bugs to work on 16:49:57 so I directed him to pay attention to the bugs deputy weekly email report 16:50:10 and pick a bug from there 16:50:26 he picked https://bugs.launchpad.net/neutron/+bug/1841865 16:50:26 Launchpad bug 1841865 in neutron "[L2] stop processing ports twice in ovs-agent" [Wishlist,Confirmed] - Assigned to Sapna Jadhav (sapana45) 16:50:44 which is not an entry level bug 16:50:56 but I don't want to defalte his enthusiasm 16:50:57 ah yes he sent me an email on another potential bug (designate) that alread had someone working on it in the end 16:51:27 this bug is relevant from the performance point of view 16:51:47 so this past weekend I spent some time providing him with some guidance in the bug 16:52:01 you can read it in #4 16:52:26 I wanted to share with you for two reasons: 16:52:39 1) This bug is relevant from the performance point of view 16:53:09 2) By working with sapna, we might be able to mint a brand new contributor 16:54:03 3) If you see him around in the channel asking questions, if you can, help. slaweq, bcafarel he is in India, so closer to your time zones 16:54:32 mlavalle: sure, I'm also in touch with him 16:54:33 sure thing! 16:55:13 ok, does anybody have anything else to say? 16:56:05 if not... 16:56:11 have a great week 16:56:18 the same :) o/ 16:56:18 thanks for attending 16:56:24 #endmeeting