16:00:11 #startmeeting neutron_performance 16:00:13 Meeting started Mon Jun 17 16:00:11 2019 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:16 The meeting name has been set to 'neutron_performance' 16:00:28 o/ 16:00:31 hi 16:00:40 hi! 16:01:27 njohnston: are you coming to the performance meeting? If you have a chance, please join us. I have a few questions or you 16:01:37 for you^^^ 16:01:46 about OVO 16:02:50 ok, let's get going 16:02:58 #topic Updates 16:02:59 late o/ 16:03:24 rubasov: just in time.... any updates for today? 16:03:40 mlavalle: yes 16:04:02 first we noticed the vm boot rally scenario to be quite unstable 16:04:21 * mlavalle already +2ed the related patch ;-) 16:04:27 so there's this change: https://review.opendev.org/665614 16:04:33 mlavalle: oh thanks 16:04:59 then I have some news about profiling the vif plugging 16:05:01 I also rechecked it 16:05:06 thank you 16:05:41 I looked into the osprofiler report and realized there's nothing in it about when a vif plug finishes 16:05:52 after some debuggign we realized the cause 16:06:19 as you know osprofiler requires the trace information (trace id, hmac, parent id, etc) 16:06:25 yes 16:06:36 to be passed on in each intra-component communication 16:06:48 this is usually passed on in each http message 16:06:53 or rpc message 16:06:59 yeap 16:07:02 but the vif plug is the exception 16:07:20 * njohnston is here, apologize for being late 16:07:20 there are not http calls, right? 16:07:26 because there the communication channel is the ovsdb (at least for the ovs plugin) 16:07:52 * bcafarel also late and sorry for it 16:08:03 we pass on the port uuid in the external_ids fields of the port record in ovsdb 16:08:21 exactly 16:08:32 but we cannot really do the same with the trace info 16:08:50 first because libvirt does not allow arbitrary external_ids 16:09:11 second because libvirt deletes and recreates the ovs port 16:09:23 so whatver we put into it gets deleted by libvirt 16:09:38 we're thinking about a hacky way 16:10:08 to serlialize the trace info into the bridges table of ovsdb when os-vif plugs 16:10:35 and remove it by neutron-ovs-agent as soon as it triggered tracing in ovs-agent 16:10:44 but as you can see it's quite hacky 16:10:51 yeah 16:11:14 if you see a better way please let us know 16:11:40 let's finish the updates section of the meeting and we'll come back to it in a few minutes 16:11:41 ok? 16:11:49 ok 16:12:02 does anybody else have more annuncements? 16:12:09 updates? 16:12:10 rubasov: I think that sean-k-mooney was working on some patch to nova/os-vif to make all plug to be done on os-vif side 16:12:20 and than libvirt would not recreate it 16:12:32 there were some problems with this but You can ask him about current status 16:12:45 slaweq: that was reverted I think 16:13:03 yes, it was but I think he was still trying to do something like that later 16:13:04 I've seen the various patches fixing the security issue 16:13:29 slaweq: okay, will ask anyway, because we'll have os-vif changes likely 16:13:50 ok, I assume there are no more updates.... 16:14:02 #topic EnOS framework 16:14:35 jrbalderrama: thanks for joining us. Is there any progress with the requests I made last week? 16:14:46 Yes! 16:15:10 msimonin sent a mail today addressing those questions 16:15:38 ahh, ok, just a few minutes ago 16:15:50 I checked my inbox just before that 16:16:08 It was 20 min before the meeting starts 16:16:41 no problem. Thanks! 16:17:11 I just glanced over it quickly and it seems he followed up on all the pending points 16:17:39 I will restart my testing soon with that and will yell if I need help 16:17:42 ok? 16:17:51 right 16:18:08 Thank you very much for the follow up 16:18:21 anything else we should say today about EnOS? 16:18:48 nothing for today 16:18:54 cool 16:18:57 let's move on 16:19:22 ahh, I have one more request.... 16:19:55 rubasov: what is the status of merging you port binding scenario in the openstack rally repo? 16:20:27 mlavalle: as soon as zuul votes +1 I wanted to send a mail to rally folks asking for review 16:21:37 jrbalderrama: ^^^ given what rubasov just said, could I get instructions on how to add an unmerged Rally scenario to EnOS? 16:21:56 I tried breifly a few days ago and couldn't figure it out 16:22:18 We will send you the instructions by mail 16:22:24 * mlavalle needs to learn more about the containerized deployment 16:22:34 cool, thanks! 16:23:13 #topic Rally job and osprofiler 16:23:49 I sepent some time at reports produced by the VM boot scenario: http://logs.openstack.org/14/665614/2/check/neutron-rally-task/cc6f223/results/report.html.gz#/NovaServers.boot_and_delete_server/output 16:24:31 rubasov, slaweq: I have to say that you guys have done a great job getting this scenario together and also in getting the osprofiler report in good shape 16:24:41 the reporting has improved since the PTG 16:25:06 mlavalle: thx :) 16:25:14 slaweq has done the bigger part 16:25:42 now, looking at the report, I noted that it is generated from the point of view of Nova 16:25:42 rubasov: not exactly, it was mostly done by Andrey from rally team and kudos should go to him :) 16:26:26 slaweq: thanks to Andrey 16:26:46 what I mean by that is that I can see the conversation between Nova and the Neutron server, and work being done by the Neutron server 16:27:11 by I didn't find the the conversation between neutron server and the L2 agent 16:27:21 did I miss something? 16:27:53 rubasov: ^^^ 16:27:56 mlavalle: isn't this about what we started to discuss rigt at the beginning of this meeting? 16:28:06 yes 16:28:22 all stuff that's related to a particular trace id 16:28:34 that is running in ovs-agent 16:28:44 is kicked off by the plug 16:29:00 so if we can't get the trace info over into the agent 16:29:09 then the agent's work is not really traced 16:29:12 yeah, but the trace-id passes between neutron server and the L2 agent through the rpc channel, doesn't it 16:29:34 in other scenarios I have seen it 16:30:12 the only message where we could pass it is get_device_details I think 16:30:40 but can neutron server tell which trace id corresponds to which get_device_details call? 16:31:10 rubasov: I don't think so as it base on what is stored in external_ids, right? 16:31:11 I'm not sure if that's doable without the agent passing the trace id first 16:31:50 ok, I just wanted to bring this up 16:32:11 that's why I asked you to hold for a few minutes 16:32:16 I'm not 100% sure 16:32:29 ideally, we would like to trace the entire cycle 16:32:48 but currently I think neutron server does not have enough information to make the correspondance between get_device_details and the trace_id 16:32:54 and maybe we need to investigate how to do it 16:33:19 and that's what we hope to do by passing it in ovsdb 16:33:28 and of course one possible answer is that we need to invent it 16:33:38 yeah, I understand that 16:34:11 we hope to come up with some proof of concept code for it 16:34:11 so, will you continue looking into it? 16:34:16 both in the os-vif side 16:34:18 ok, cool 16:34:21 and on the ovs-agent side 16:34:25 perfect 16:34:29 yes, I will 16:34:29 Tnanks! 16:34:44 np 16:34:57 anything else on this topic? 16:35:05 that's all from me 16:35:19 and sorry for hijacking the meeting right at the beginning 16:35:50 oh no problem.... 16:36:00 on the contrary, very useful 16:36:30 #topic Security groups RBAC 16:36:54 I spent a good chunk of the last 2 days looking at https://bugs.launchpad.net/neutron/+bug/1830679 16:36:55 Launchpad bug 1830679 in neutron "Security groups RBAC cause a major performance degradation" [High,In progress] - Assigned to Miguel Lavalle (minsel) 16:38:15 I left extensive comments on my findings in notes #9, #11 and #12 16:39:52 the summary for users who have lots of sec groups and are not using RBAC with them, we can optimize the code, as proposed here https://review.opendev.org/#/c/665566/ 16:40:33 essentially, we don't need to check whether the user has access right for a sec group if the secgroup has no RBAC entries associated with it 16:41:47 that cuts the response time from of an "openstack security group list" from ~500 seconds to 50 seconds in a scenario where there are +6000 security groups 16:42:04 and that's good but... 16:42:14 * bcafarel waits for the OVO "but" 16:43:15 in my digging, I found that with a straight SQLAlchemy query, we can retrieve all the secgroups in less than a second 16:43:18 * slaweq needs to leave earlier 16:43:24 I have had concerns about OVO being a source of slowness for some time; placement has seen significant speed benefits from removing OVO 16:43:52 but most of those remaining 50 seconds are sspent in the sec groups query through OVO 16:44:00 about 46 secs 16:44:50 ok, I wanted njohnston to comment on my findings 16:44:52 This is the best, most clear-cut case of OVO being the primary culprit for a slowdown 16:45:36 yeah, in the end, OVO should be issuing the same SQLAlchemy that I am using, right? 16:45:52 SQLAlchemy query^^^ 16:46:58 yes, or something similar. The slowdown could be pre-query, in which case baked queries might help, or it could be post query and we'd have to see if the process of processing results into objects needs work 16:47:13 exactly 16:47:35 do the logs perchance indicate how far along the query is actually emitted? 16:47:54 I didn't get that far along 16:48:32 it was late on Father's day and my wife had cooked a special dinner for me, so it was difficult to continue beyond the early evening 16:48:39 but I will investigate 16:49:13 There are two pints I want to make in this meeting: 16:49:23 points^^^^ 16:50:48 1) short term, I might fix this bug with using staright SQLAlchemy here: https://review.opendev.org/#/c/665566/1/neutron/objects/rbac_db.py@113 16:51:22 2) long term, OVO seems a ripe area of work for this performance team 16:51:47 I'm not saying get rid of it, but we need to investigate how to improve its performance 16:52:19 and understand better the trade off between the OVO benefits and its performance cost 16:52:29 does it make sense 16:52:48 ? 16:52:50 One point I would like to make about your scripts is that the SQL Alchemy script only lists security groups, but the OVO script creates projects and creates security groups before listing them, and the destroys them after. 16:52:52 sounds good 16:53:19 We should see if we can track the elapsed time just to list the security groups to provide an apples-to-apples comparison 16:53:37 yeap, understood 16:54:19 but yes, I think we need to try to run cProfile or other tools to dig out what is taking such a long time, and how it can be optimized. 16:54:44 especially if the improvement can be in the OVO library itself so we don't have to make tactical fixes in the hundred places we already implemented OVO 16:55:12 and btw, that's what I mean by striking the right balance between OVO benefits and its performance cost 16:55:42 and you are right, we have to try to avoid trying to come up with a thousand tactical fixes 16:56:22 in the case of this specific bug, tactical is good if we want to avoid a painful revert 16:56:32 right 16:56:52 talking about https://review.opendev.org/#/c/665447/ 16:56:55 I suppose yeah if placement has OVO perf issue too, better to dig in it directly than looking for neutron-specific perf issues 16:57:32 like I said, placement saw definite speed improvements just be removing OVO: https://anticdent.org/placement-update-19-08.html 16:57:49 yes, and that's what I meant in point 2 above^^^ 16:57:53 and I recall a reference to a similar performance gain in another project, although I cannot find the link now 16:58:53 cool, good stuff! 16:59:02 anything else? 16:59:25 ok, we ran out of time 16:59:28 #endmeeting