16:00:27 <mlavalle> #startmeeting neutron_performance
16:00:28 <openstack> Meeting started Mon Jul  1 16:00:27 2019 UTC and is due to finish in 60 minutes.  The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:32 <openstack> The meeting name has been set to 'neutron_performance'
16:00:56 <njohnston> o/
16:01:14 <slaweq> hi
16:01:37 <haleyb> hi
16:04:11 <mlavalle> ok, let's get going
16:05:48 <mlavalle> #topic Rally / osprofiler / EnOS
16:06:01 <mlavalle> I have progress updates on this topic
16:06:57 <mlavalle> after the message that the EnOS sent us with guidelines, all the small hurdles that I was facing to start running tests we re solved
16:08:41 <mlavalle> I was able to deploy the test that rubasov is implementing in rally-openstack: https://review.opendev.org/#/c/662781/
16:09:15 <mlavalle> I uploaded the rally report to dropbox:
16:09:19 <mlavalle> #link https://www.dropbox.com/s/ggk26tokkob4w8s/bind_port_rally_report.html?dl=0
16:09:26 <njohnston> coool
16:09:39 <mlavalle> so you can see the results of the task as it is defined by the patch
16:11:19 <mlavalle> I also uploaded for you a osprofiler report of an iteration reprsesentative of the average in the rally report:
16:12:44 <mlavalle> #link https://www.dropbox.com/s/ojigqb32swvdrgx/bind_port_avg.html?dl=0
16:13:25 <mlavalle> and a osprofiler report of the iteration with the highest elapsed time:
16:14:24 <mlavalle> #link https://www.dropbox.com/s/ojigqb32swvdrgx/bind_port_avg.html?dl=0
16:15:06 <slaweq> I wonder where on create_subnet it spends about 7 seconds in this example
16:15:18 <mlavalle> These are html files, so you need to download them and view them in your browser
16:15:23 <slaweq> creation of subnet is "just" db operation, right?
16:15:30 <mlavalle> it is
16:16:02 <slaweq> mlavalle: and Your environment wasn't overloaded by other tasks during time when You run this test, right?
16:16:14 <mlavalle> well, it interacts with the dhcp agent
16:16:29 <mlavalle> but I don't think that's the problem
16:16:41 <mlavalle> and no, my test environment is not overloades
16:17:15 <slaweq> mlavalle: to be sure if that is not a problem, can You run same test but with subnets with dhcp disabled?
16:17:50 <mlavalle> sure
16:17:56 <mlavalle> that's a good suggestion
16:19:05 <slaweq> thx
16:20:16 <mlavalle> one problem that I am finding with the osprofiler report is that we get a lot of lines of sqlalchemy tracing
16:20:48 <mlavalle> and they contain very detailed info on the DB operation
16:21:20 <mlavalle> problem is that the context info (i.e. where in the code that DB operation was issued) is lacking
16:21:29 <slaweq> yes
16:21:36 <mlavalle> so it is diffecult to relate the trace to the code
16:21:36 <slaweq> I also noticed that
16:21:53 <njohnsto_> yep, logging that is outside the scope of sql alchemy IMHO
16:23:02 <mlavalle> so I am going to ping Andrey Kurilin and see if he has any suggestions
16:23:17 <mlavalle> slaweq: do you remember his email address
16:23:57 <slaweq> mlavalle: no but I will find it and send it to You, ok?
16:24:15 <mlavalle> slaweq: not needed. thanks. found it in gerrit
16:24:18 <slaweq> mlavalle: You can also try to catch him on #openstack-rally channel, his nick is andreykurilin
16:24:48 <mlavalle> anyways. Next steps are:
16:25:21 <mlavalle> 1) Play with this scenario changing the load and concurrency parameters
16:25:49 <mlavalle> 2) Try to add conteext info to the osprofiler traces
16:26:07 <mlavalle> 3) Investigate the subnet creation anomaly as suggested by slaweq
16:27:03 <mlavalle> any other comments on this topic?
16:27:20 <slaweq> IMHO each operation there takes very long time
16:27:32 <mlavalle> it does
16:27:36 <slaweq> port create is almost as long as port update (binding) later
16:27:43 <slaweq> and port create is also only:
16:27:46 <slaweq> 1. db operations
16:27:55 <mlavalle> and it is a lot of DB operations
16:28:09 <slaweq> 2. ipam - but as for each port it is new subnet than ipam should be fast IMO because no retries should be there
16:28:11 <mlavalle> you can see that in the osprofiler reports
16:28:13 <slaweq> 3. dhcp agent update
16:29:04 <slaweq> so basically before I looked into it I was expecting that it will be something with most of the time spend on port update (port binding)
16:29:18 <slaweq> and it's not - everything else is also slow :/
16:29:35 <slaweq> do You agree with me or am I missing something?
16:29:44 <mlavalle> agree with you
16:30:05 <mlavalle> and the fact is that it is almost all db operations
16:30:29 <mlavalle> if you look at the osprofiler reports
16:30:45 <mlavalle> that is why I want to be able to relate the reports with the code
16:31:05 <slaweq> yes
16:31:28 <mlavalle> going back to your question about how much load I have in my test environment
16:31:58 <mlavalle> I am monitoring continously the syste with top
16:32:20 <mlavalle> it has 32 threads
16:32:30 <mlavalle> all of wich are almost idle
16:32:42 <mlavalle> unless I trigger the rally workload
16:32:51 <mlavalle> and even then it is not very loaded
16:33:02 <mlavalle> so it's not the test environment
16:33:46 <mlavalle> any oher comments?
16:33:58 <slaweq> great work mlavalle, thx :)
16:34:16 <mlavalle> :-)
16:34:28 <mlavalle> ok, let's move on
16:34:39 <mlavalle> #topic Security groups and RBAC
16:35:01 <mlavalle> I have continued digging into https://bugs.launchpad.net/neutron/+bug/1830679
16:35:02 <openstack> Launchpad bug 1830679 in neutron "Security groups RBAC cause a major performance degradation" [High,In progress] - Assigned to Miguel Lavalle (minsel)
16:35:45 <mlavalle> Taking a suggestion from njohnsto_, I've been doing some profiling with cProfile
16:37:15 <mlavalle> I posted one cProfile report in the bug and explained where I am conducting the profiling
16:38:36 <mlavalle> you can see the details in the report
16:38:55 * njohnsto_ will check it out
16:38:59 <mlavalle> but the conclusion is that querying 6000 security groups from the DB doesn't take much
16:39:38 <mlavalle> whre we spend a lot of time is building the objects after retrieving the from the DB is here: https://github.com/openstack/neutron/blob/master/neutron/objects/base.py#L800
16:40:49 <slaweq> and it spends most of this time on calculating if SG is shared with tenant or not, am I understanding correct what You wrote in bug?
16:41:04 <njohnsto_> should we call on zzzeek’s expert opinion on this?
16:41:26 <mlavalle> slaweq: I have some second thoughts about that comment
16:41:46 <mlavalle> the point is that we spend a lot of time building the objects from the DB
16:42:01 <mlavalle> the DB query is almost nothing
16:42:30 <mlavalle> njohnsto_: that might be a good step
16:42:53 <mlavalle> but before that here's the strategy I intend to test later today:
16:43:25 <mlavalle> Instead of calling https://github.com/openstack/neutron/blob/master/neutron/objects/base.py#L617 blindly and let it retrieve all the objects from the DB
16:44:23 <mlavalle> I am going to query the DB from here: https://github.com/openstack/neutron/blob/master/neutron/objects/base.py#L617
16:44:52 <njohnsto_> that is the same link
16:45:04 <mlavalle> and build a specifc query instead of calling obj_db_api.get_objects
16:45:40 <mlavalle> in other words, instead of querying all the objects, I am going to query only the objects that the tenenat has access to
16:45:50 <njohnsto_> good idea
16:46:11 <mlavalle> either because it owns the object or because there is a rbac entry that grants the access
16:46:35 <mlavalle> that way, the number of OVOs to build will be much lower
16:47:03 <mlavalle> and in the case where there are no rbac entries, that user / deployer doesn't pay any extra cost
16:47:11 <mlavalle> does it make sense?
16:47:41 <njohnsto_> +1
16:47:57 <mlavalle> cool
16:48:44 <mlavalle> the only other observation I want to make is that last night I was wondering how many times we call get_objects from the base class blindly and let it retrieve all the objects from the DB
16:48:48 <slaweq> +1
16:49:03 <mlavalle> and then we filter later
16:49:36 <njohnsto_> Yes, that is definitely a bad pattern for operations at scale
16:49:37 <mlavalle> maybe after fixing this specific bug I am going to dig in the code looking for cases like this
16:50:17 <mlavalle> because if we retrieve a lot of objects from the DB, build the OVO's and then we filter
16:50:27 <mlavalle> we wasted a lot of resource building the OVOs
16:50:51 <njohnsto_> +1
16:51:04 <mlavalle> ok, that's all I had for today
16:51:10 <mlavalle> any other comments?
16:51:26 <slaweq> I wanted to ask about one thing
16:51:48 <slaweq> in launchpad we have tag "loadimpact" to bugs related to performance problems
16:51:55 <slaweq> here is the list of bugs with this tag:
16:51:56 <slaweq> https://bugs.launchpad.net/neutron/+bugs?field.searchtext=&search=Search&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&assignee_option=any&field.assignee=&field.bug_reporter=&field.bug_commenter=&field.subscr
16:51:58 <slaweq> iber=&field.structural_subscriber=&field.tag=loadimpact&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_blueprints=on&ord
16:52:00 <slaweq> erby=-importance&start=0
16:52:02 <slaweq> should we maybe:
16:52:17 <slaweq> 1. clean this list and close old/already fixed bugs?
16:52:36 <slaweq> 2. check this list during those meetings?
16:52:41 <slaweq> what do You think?
16:52:46 <mlavalle> excellent idea
16:53:08 <mlavalle> I'll try to start with the clean up
16:53:18 <slaweq> thx a lot :)
16:53:40 <slaweq> that's all from me
16:53:46 <mlavalle> cool
16:53:50 <mlavalle> Thanks for attending
16:53:55 <mlavalle> have a great week!
16:53:59 <mlavalle> #endmeeting