#openstack-meeting log

16:00:11 <mlavalle> #startmeeting neutron_performance
16:00:13 <openstack> Meeting started Mon Jun 17 16:00:11 2019 UTC and is due to finish in 60 minutes.  The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:16 <openstack> The meeting name has been set to 'neutron_performance'
16:00:28 <slaweq> o/
16:00:31 <haleyb> hi
16:00:40 <jrbalderrama> hi!
16:01:27 <mlavalle> njohnston: are you coming to the performance meeting? If you have a chance, please join us. I have a few questions or you
16:01:37 <mlavalle> for you^^^
16:01:46 <mlavalle> about OVO
16:02:50 <mlavalle> ok, let's get going
16:02:58 <mlavalle> #topic Updates
16:02:59 <rubasov> late o/
16:03:24 <mlavalle> rubasov: just in time.... any updates for today?
16:03:40 <rubasov> mlavalle: yes
16:04:02 <rubasov> first we noticed the vm boot rally scenario to be quite unstable
16:04:21 * mlavalle already +2ed the related patch ;-)
16:04:27 <rubasov> so there's this change: https://review.opendev.org/665614
16:04:33 <rubasov> mlavalle: oh thanks
16:04:59 <rubasov> then I have some news about profiling the vif plugging
16:05:01 <mlavalle> I also rechecked it
16:05:06 <rubasov> thank you
16:05:41 <rubasov> I looked into the osprofiler report and realized there's nothing in it about when a vif plug finishes
16:05:52 <rubasov> after some debuggign we realized the cause
16:06:19 <rubasov> as you know osprofiler requires the trace information (trace id, hmac, parent id, etc)
16:06:25 <mlavalle> yes
16:06:36 <rubasov> to be passed on in each intra-component communication
16:06:48 <rubasov> this is usually passed on in each http message
16:06:53 <rubasov> or rpc message
16:06:59 <mlavalle> yeap
16:07:02 <rubasov> but the vif plug is the exception
16:07:20 * njohnston is here, apologize for being late
16:07:20 <mlavalle> there are not http calls, right?
16:07:26 <rubasov> because there the communication channel is the ovsdb (at least for the ovs plugin)
16:07:52 * bcafarel also late and sorry for it
16:08:03 <rubasov> we pass on the port uuid in the external_ids fields of the port record in ovsdb
16:08:21 <mlavalle> exactly
16:08:32 <rubasov> but we cannot really do the same with the trace info
16:08:50 <rubasov> first because libvirt does not allow arbitrary external_ids
16:09:11 <rubasov> second because libvirt deletes and recreates the ovs port
16:09:23 <rubasov> so whatver we put into it gets deleted by libvirt
16:09:38 <rubasov> we're thinking about a hacky way
16:10:08 <rubasov> to serlialize the trace info into the bridges table of ovsdb when os-vif plugs
16:10:35 <rubasov> and remove it by neutron-ovs-agent as soon as it triggered tracing in ovs-agent
16:10:44 <rubasov> but as you can see it's quite hacky
16:10:51 <mlavalle> yeah
16:11:14 <rubasov> if you see a better way please let us know
16:11:40 <mlavalle> let's finish the updates section of the meeting and we'll come back to it in a few minutes
16:11:41 <mlavalle> ok?
16:11:49 <rubasov> ok
16:12:02 <mlavalle> does anybody else have more annuncements?
16:12:09 <mlavalle> updates?
16:12:10 <slaweq> rubasov: I think that sean-k-mooney was working on some patch to nova/os-vif to make all plug to be done on os-vif side
16:12:20 <slaweq> and than libvirt would not recreate it
16:12:32 <slaweq> there were some problems with this but You can ask him about current status
16:12:45 <rubasov> slaweq: that was reverted I think
16:13:03 <slaweq> yes, it was but I think he was still trying to do something like that later
16:13:04 <rubasov> I've seen the various patches fixing the security issue
16:13:29 <rubasov> slaweq: okay, will ask anyway, because we'll have os-vif changes likely
16:13:50 <mlavalle> ok, I assume there are no more updates....
16:14:02 <mlavalle> #topic EnOS framework
16:14:35 <mlavalle> jrbalderrama: thanks for joining us. Is there any progress with the requests I made last week?
16:14:46 <jrbalderrama> Yes!
16:15:10 <jrbalderrama> msimonin sent a mail today addressing those questions
16:15:38 <mlavalle> ahh, ok, just a few minutes ago
16:15:50 <mlavalle> I checked my inbox just before that
16:16:08 <jrbalderrama> It was 20 min before the meeting starts
16:16:41 <mlavalle> no problem. Thanks!
16:17:11 <mlavalle> I just glanced over it quickly and it seems he followed up on all the pending points
16:17:39 <mlavalle> I will restart my testing soon with that and will yell if I need help
16:17:42 <mlavalle> ok?
16:17:51 <jrbalderrama> right
16:18:08 <mlavalle> Thank you very much for the follow up
16:18:21 <mlavalle> anything else we should say today about EnOS?
16:18:48 <jrbalderrama> nothing for today
16:18:54 <mlavalle> cool
16:18:57 <mlavalle> let's move on
16:19:22 <mlavalle> ahh, I have one more request....
16:19:55 <mlavalle> rubasov: what is the status of merging you port binding scenario in the openstack rally repo?
16:20:27 <rubasov> mlavalle: as soon as zuul votes +1 I wanted to send a mail to rally folks asking for review
16:21:37 <mlavalle> jrbalderrama: ^^^ given what rubasov just said, could I get instructions on how to add an unmerged Rally scenario to EnOS?
16:21:56 <mlavalle> I tried breifly a few days ago and couldn't figure it out
16:22:18 <jrbalderrama> We will send you the instructions by mail
16:22:24 * mlavalle needs to learn more about the containerized deployment
16:22:34 <mlavalle> cool, thanks!
16:23:13 <mlavalle> #topic Rally job and osprofiler
16:23:49 <mlavalle> I sepent some time at reports produced by the VM boot scenario: http://logs.openstack.org/14/665614/2/check/neutron-rally-task/cc6f223/results/report.html.gz#/NovaServers.boot_and_delete_server/output
16:24:31 <mlavalle> rubasov, slaweq: I have to say that you guys have done a great job getting this scenario together and also in getting the osprofiler report in good shape
16:24:41 <mlavalle> the reporting has improved since the PTG
16:25:06 <slaweq> mlavalle: thx :)
16:25:14 <rubasov> slaweq has done the bigger part
16:25:42 <mlavalle> now, looking at the report, I noted that it is generated from the point of view of Nova
16:25:42 <slaweq> rubasov: not exactly, it was mostly done by Andrey from rally team and kudos should go to him :)
16:26:26 <rubasov> slaweq: thanks to Andrey
16:26:46 <mlavalle> what I mean by that is that I can see the conversation between Nova and the Neutron server, and work being done by the Neutron server
16:27:11 <mlavalle> by I didn't find the the conversation between neutron server and the L2 agent
16:27:21 <mlavalle> did I miss something?
16:27:53 <mlavalle> rubasov: ^^^
16:27:56 <rubasov> mlavalle: isn't this about what we started to discuss rigt at the beginning of this meeting?
16:28:06 <mlavalle> yes
16:28:22 <rubasov> all stuff that's related to a particular trace id
16:28:34 <rubasov> that is running in ovs-agent
16:28:44 <rubasov> is kicked off by the plug
16:29:00 <rubasov> so if we can't get the trace info over into the agent
16:29:09 <rubasov> then the agent's work is not really traced
16:29:12 <mlavalle> yeah, but the trace-id passes between neutron server and the L2 agent through the rpc channel, doesn't it
16:29:34 <mlavalle> in other scenarios I have seen it
16:30:12 <rubasov> the only message where we could pass it is get_device_details I think
16:30:40 <rubasov> but can neutron server tell which trace id corresponds to which get_device_details call?
16:31:10 <slaweq> rubasov: I don't think so as it base on what is stored in external_ids, right?
16:31:11 <rubasov> I'm not sure if that's doable without the agent passing the trace id first
16:31:50 <mlavalle> ok, I just wanted to bring this up
16:32:11 <mlavalle> that's why I asked you to hold for a few minutes
16:32:16 <rubasov> I'm not 100% sure
16:32:29 <mlavalle> ideally, we would like to trace the entire cycle
16:32:48 <rubasov> but currently I think neutron server does not have enough information to make the correspondance between get_device_details and the trace_id
16:32:54 <mlavalle> and maybe we need to investigate how to do it
16:33:19 <rubasov> and that's what we hope to do by passing it in ovsdb
16:33:28 <mlavalle> and of course one possible answer is that we need to invent it
16:33:38 <mlavalle> yeah, I understand that
16:34:11 <rubasov> we hope to come up with some proof of concept code for it
16:34:11 <mlavalle> so, will you continue looking into it?
16:34:16 <rubasov> both in the os-vif side
16:34:18 <mlavalle> ok, cool
16:34:21 <rubasov> and on the ovs-agent side
16:34:25 <mlavalle> perfect
16:34:29 <rubasov> yes, I will
16:34:29 <mlavalle> Tnanks!
16:34:44 <rubasov> np
16:34:57 <mlavalle> anything else on this topic?
16:35:05 <rubasov> that's all from me
16:35:19 <rubasov> and sorry for hijacking the meeting right at the beginning
16:35:50 <mlavalle> oh no problem....
16:36:00 <mlavalle> on the contrary, very useful
16:36:30 <mlavalle> #topic Security groups RBAC
16:36:54 <mlavalle> I spent a good chunk of the last 2 days looking at https://bugs.launchpad.net/neutron/+bug/1830679
16:36:55 <openstack> Launchpad bug 1830679 in neutron "Security groups RBAC cause a major performance degradation" [High,In progress] - Assigned to Miguel Lavalle (minsel)
16:38:15 <mlavalle> I left extensive comments on my findings in notes #9, #11 and #12
16:39:52 <mlavalle> the summary for users who have lots of sec groups and are not using RBAC with them, we can optimize the code, as proposed here https://review.opendev.org/#/c/665566/
16:40:33 <mlavalle> essentially, we don't need to check whether the user has access right for a sec group if the secgroup has no RBAC entries associated with it
16:41:47 <mlavalle> that cuts the response time from of an "openstack security group list" from ~500 seconds to 50 seconds in a scenario where there are +6000 security groups
16:42:04 <mlavalle> and that's good but...
16:42:14 * bcafarel waits for the OVO "but"
16:43:15 <mlavalle> in my digging, I found that with a straight SQLAlchemy query, we can retrieve all the secgroups in less than a second
16:43:18 * slaweq needs to leave earlier
16:43:24 <njohnston> I have had concerns about OVO being a source of slowness for some time; placement has seen significant speed benefits from removing OVO
16:43:52 <mlavalle> but most of those remaining 50 seconds are sspent in the sec groups query through OVO
16:44:00 <mlavalle> about 46 secs
16:44:50 <mlavalle> ok, I wanted njohnston to comment on my findings
16:44:52 <njohnston> This is the best, most clear-cut case of OVO being the primary culprit for a slowdown
16:45:36 <mlavalle> yeah, in the end, OVO should be issuing the same SQLAlchemy that I am using, right?
16:45:52 <mlavalle> SQLAlchemy query^^^
16:46:58 <njohnston> yes, or something similar.  The slowdown could be pre-query, in which case baked queries might help, or it could be post query and we'd have to see if the process of processing results into objects needs work
16:47:13 <mlavalle> exactly
16:47:35 <njohnston> do the logs perchance indicate how far along the query is actually emitted?
16:47:54 <mlavalle> I didn't get that far along
16:48:32 <mlavalle> it was late on Father's day and my wife had cooked a special dinner for me, so it was difficult to continue beyond the early evening
16:48:39 <mlavalle> but I will investigate
16:49:13 <mlavalle> There are two pints I want to make in this meeting:
16:49:23 <mlavalle> points^^^^
16:50:48 <mlavalle> 1) short term, I might fix this bug with using staright SQLAlchemy here: https://review.opendev.org/#/c/665566/1/neutron/objects/rbac_db.py@113
16:51:22 <mlavalle> 2) long term, OVO seems a ripe area of work for this performance team
16:51:47 <mlavalle> I'm not saying get rid of it, but we need to investigate how to improve its performance
16:52:19 <mlavalle> and understand better the trade off between the OVO benefits and its performance cost
16:52:29 <mlavalle> does it make sense
16:52:48 <mlavalle> ?
16:52:50 <njohnston> One point I would like to make about your scripts is that the SQL Alchemy script only lists security groups, but the OVO script creates projects and creates security groups before listing them, and the destroys them after.
16:52:52 <bcafarel> sounds good
16:53:19 <njohnston> We should see if we can track the elapsed time just to list the security groups to provide an apples-to-apples comparison
16:53:37 <mlavalle> yeap, understood
16:54:19 <njohnston> but yes, I think we need to try to run cProfile or other tools to dig out what is taking such a long time, and how it can be optimized.
16:54:44 <njohnston> especially if the improvement can be in the OVO library itself so we don't have to make tactical fixes in the hundred places we already implemented OVO
16:55:12 <mlavalle> and btw, that's what I mean by striking the right balance between OVO benefits and its performance cost
16:55:42 <mlavalle> and you are right, we have to try to avoid trying to come up with a thousand tactical fixes
16:56:22 <mlavalle> in the case of this specific bug, tactical is good if we want to avoid a painful revert
16:56:32 <njohnston> right
16:56:52 <mlavalle> talking about https://review.opendev.org/#/c/665447/
16:56:55 <bcafarel> I suppose yeah if placement has OVO perf issue too, better to dig in it directly than looking for neutron-specific perf issues
16:57:32 <njohnston> like I said, placement saw definite speed improvements just be removing OVO: https://anticdent.org/placement-update-19-08.html
16:57:49 <mlavalle> yes, and that's what I meant in point 2 above^^^
16:57:53 <njohnston> and I recall a reference to a similar performance gain in another project, although I cannot find the link now
16:58:53 <mlavalle> cool, good stuff!
16:59:02 <mlavalle> anything else?
16:59:25 <mlavalle> ok, we ran out of time
16:59:28 <mlavalle> #endmeeting