16:00:27 #startmeeting neutron_performance 16:00:28 Meeting started Mon Jul 1 16:00:27 2019 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:29 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:32 The meeting name has been set to 'neutron_performance' 16:00:56 o/ 16:01:14 hi 16:01:37 hi 16:04:11 ok, let's get going 16:05:48 #topic Rally / osprofiler / EnOS 16:06:01 I have progress updates on this topic 16:06:57 after the message that the EnOS sent us with guidelines, all the small hurdles that I was facing to start running tests we re solved 16:08:41 I was able to deploy the test that rubasov is implementing in rally-openstack: https://review.opendev.org/#/c/662781/ 16:09:15 I uploaded the rally report to dropbox: 16:09:19 #link https://www.dropbox.com/s/ggk26tokkob4w8s/bind_port_rally_report.html?dl=0 16:09:26 coool 16:09:39 so you can see the results of the task as it is defined by the patch 16:11:19 I also uploaded for you a osprofiler report of an iteration reprsesentative of the average in the rally report: 16:12:44 #link https://www.dropbox.com/s/ojigqb32swvdrgx/bind_port_avg.html?dl=0 16:13:25 and a osprofiler report of the iteration with the highest elapsed time: 16:14:24 #link https://www.dropbox.com/s/ojigqb32swvdrgx/bind_port_avg.html?dl=0 16:15:06 I wonder where on create_subnet it spends about 7 seconds in this example 16:15:18 These are html files, so you need to download them and view them in your browser 16:15:23 creation of subnet is "just" db operation, right? 16:15:30 it is 16:16:02 mlavalle: and Your environment wasn't overloaded by other tasks during time when You run this test, right? 16:16:14 well, it interacts with the dhcp agent 16:16:29 but I don't think that's the problem 16:16:41 and no, my test environment is not overloades 16:17:15 mlavalle: to be sure if that is not a problem, can You run same test but with subnets with dhcp disabled? 16:17:50 sure 16:17:56 that's a good suggestion 16:19:05 thx 16:20:16 one problem that I am finding with the osprofiler report is that we get a lot of lines of sqlalchemy tracing 16:20:48 and they contain very detailed info on the DB operation 16:21:20 problem is that the context info (i.e. where in the code that DB operation was issued) is lacking 16:21:29 yes 16:21:36 so it is diffecult to relate the trace to the code 16:21:36 I also noticed that 16:21:53 yep, logging that is outside the scope of sql alchemy IMHO 16:23:02 so I am going to ping Andrey Kurilin and see if he has any suggestions 16:23:17 slaweq: do you remember his email address 16:23:57 mlavalle: no but I will find it and send it to You, ok? 16:24:15 slaweq: not needed. thanks. found it in gerrit 16:24:18 mlavalle: You can also try to catch him on #openstack-rally channel, his nick is andreykurilin 16:24:48 anyways. Next steps are: 16:25:21 1) Play with this scenario changing the load and concurrency parameters 16:25:49 2) Try to add conteext info to the osprofiler traces 16:26:07 3) Investigate the subnet creation anomaly as suggested by slaweq 16:27:03 any other comments on this topic? 16:27:20 IMHO each operation there takes very long time 16:27:32 it does 16:27:36 port create is almost as long as port update (binding) later 16:27:43 and port create is also only: 16:27:46 1. db operations 16:27:55 and it is a lot of DB operations 16:28:09 2. ipam - but as for each port it is new subnet than ipam should be fast IMO because no retries should be there 16:28:11 you can see that in the osprofiler reports 16:28:13 3. dhcp agent update 16:29:04 so basically before I looked into it I was expecting that it will be something with most of the time spend on port update (port binding) 16:29:18 and it's not - everything else is also slow :/ 16:29:35 do You agree with me or am I missing something? 16:29:44 agree with you 16:30:05 and the fact is that it is almost all db operations 16:30:29 if you look at the osprofiler reports 16:30:45 that is why I want to be able to relate the reports with the code 16:31:05 yes 16:31:28 going back to your question about how much load I have in my test environment 16:31:58 I am monitoring continously the syste with top 16:32:20 it has 32 threads 16:32:30 all of wich are almost idle 16:32:42 unless I trigger the rally workload 16:32:51 and even then it is not very loaded 16:33:02 so it's not the test environment 16:33:46 any oher comments? 16:33:58 great work mlavalle, thx :) 16:34:16 :-) 16:34:28 ok, let's move on 16:34:39 #topic Security groups and RBAC 16:35:01 I have continued digging into https://bugs.launchpad.net/neutron/+bug/1830679 16:35:02 Launchpad bug 1830679 in neutron "Security groups RBAC cause a major performance degradation" [High,In progress] - Assigned to Miguel Lavalle (minsel) 16:35:45 Taking a suggestion from njohnsto_, I've been doing some profiling with cProfile 16:37:15 I posted one cProfile report in the bug and explained where I am conducting the profiling 16:38:36 you can see the details in the report 16:38:55 * njohnsto_ will check it out 16:38:59 but the conclusion is that querying 6000 security groups from the DB doesn't take much 16:39:38 whre we spend a lot of time is building the objects after retrieving the from the DB is here: https://github.com/openstack/neutron/blob/master/neutron/objects/base.py#L800 16:40:49 and it spends most of this time on calculating if SG is shared with tenant or not, am I understanding correct what You wrote in bug? 16:41:04 should we call on zzzeek’s expert opinion on this? 16:41:26 slaweq: I have some second thoughts about that comment 16:41:46 the point is that we spend a lot of time building the objects from the DB 16:42:01 the DB query is almost nothing 16:42:30 njohnsto_: that might be a good step 16:42:53 but before that here's the strategy I intend to test later today: 16:43:25 Instead of calling https://github.com/openstack/neutron/blob/master/neutron/objects/base.py#L617 blindly and let it retrieve all the objects from the DB 16:44:23 I am going to query the DB from here: https://github.com/openstack/neutron/blob/master/neutron/objects/base.py#L617 16:44:52 that is the same link 16:45:04 and build a specifc query instead of calling obj_db_api.get_objects 16:45:40 in other words, instead of querying all the objects, I am going to query only the objects that the tenenat has access to 16:45:50 good idea 16:46:11 either because it owns the object or because there is a rbac entry that grants the access 16:46:35 that way, the number of OVOs to build will be much lower 16:47:03 and in the case where there are no rbac entries, that user / deployer doesn't pay any extra cost 16:47:11 does it make sense? 16:47:41 +1 16:47:57 cool 16:48:44 the only other observation I want to make is that last night I was wondering how many times we call get_objects from the base class blindly and let it retrieve all the objects from the DB 16:48:48 +1 16:49:03 and then we filter later 16:49:36 Yes, that is definitely a bad pattern for operations at scale 16:49:37 maybe after fixing this specific bug I am going to dig in the code looking for cases like this 16:50:17 because if we retrieve a lot of objects from the DB, build the OVO's and then we filter 16:50:27 we wasted a lot of resource building the OVOs 16:50:51 +1 16:51:04 ok, that's all I had for today 16:51:10 any other comments? 16:51:26 I wanted to ask about one thing 16:51:48 in launchpad we have tag "loadimpact" to bugs related to performance problems 16:51:55 here is the list of bugs with this tag: 16:51:56 https://bugs.launchpad.net/neutron/+bugs?field.searchtext=&search=Search&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&assignee_option=any&field.assignee=&field.bug_reporter=&field.bug_commenter=&field.subscr 16:51:58 iber=&field.structural_subscriber=&field.tag=loadimpact&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_blueprints=on&ord 16:52:00 erby=-importance&start=0 16:52:02 should we maybe: 16:52:17 1. clean this list and close old/already fixed bugs? 16:52:36 2. check this list during those meetings? 16:52:41 what do You think? 16:52:46 excellent idea 16:53:08 I'll try to start with the clean up 16:53:18 thx a lot :) 16:53:40 that's all from me 16:53:46 cool 16:53:50 Thanks for attending 16:53:55 have a great week! 16:53:59 #endmeeting