14:00:59 <liuyulong> #startmeeting neutron_l3
14:01:00 <openstack> Meeting started Wed Jul 31 14:00:59 2019 UTC and is due to finish in 60 minutes.  The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:01:01 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:01:03 <openstack> The meeting name has been set to 'neutron_l3'
14:01:23 <liuyulong> Hello
14:02:33 <slaweq> hi
14:02:41 <liuyulong> Today will be a quick short meeting, IMO.
14:02:44 <njohnston> o/
14:02:53 <liuyulong> #topic Announcements
14:03:20 <haleyb> hi
14:03:21 <liuyulong> #link https://etherpad.openstack.org/p/Shanghai-Neutron-Planning
14:03:31 <liuyulong> Just added my name here ^
14:03:50 <slaweq> great, we will finally meet in person liuyulong :)
14:03:51 <liuyulong> #chair haleyb
14:03:52 <openstack> Current chairs: haleyb liuyulong
14:05:18 <liuyulong> I have no more announcements. IMO, we have reminded all of them yesterday in team meeting.
14:06:04 <liuyulong> OK, let's move on.
14:06:13 <liuyulong> #topic Bugs
14:06:24 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2019-July/008089.html
14:06:30 <liuyulong> Boden Russell was our bug deputy last week, thanks.
14:07:24 <liuyulong> And again, I will skip all the bugs which were fixed or the related patches are getting merged now.
14:07:37 <liuyulong> First one:
14:07:39 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1837635
14:07:40 <openstack> Launchpad bug 1837635 in neutron "HA router state change from "standby" to "master" should be delayed" [Undecided,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
14:08:16 <liuyulong> For the fix, IMO, it looks good to me, but I'd like to see more deep test results.
14:08:42 <liuyulong> I just added two scenarios here: https://review.opendev.org/#/c/672533/
14:10:06 <liuyulong> Sometimes, the actual results of the running program may be different from what you expected.
14:10:32 <liuyulong> Next:
14:10:37 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1834308
14:10:39 <openstack> Launchpad bug 1834308 in neutron "[DVR][DB] too many slow query during agent restart" [Medium,In progress] - Assigned to LIU Yulong (dragon889)
14:11:06 <liuyulong> I submitted the patch set yesterday.
14:11:16 <liuyulong> It is here: https://review.opendev.org/#/c/673557/
14:11:30 <liuyulong> A pep8 failure...
14:12:49 <liuyulong> All the DB query in this patch has the highest frequency of call when restart ovs-agent.
14:13:58 <haleyb> do you have numbers on the improvement?
14:14:00 <liuyulong> And it is time-consuming, when your 'ports' table is getting larger and lager. These query will have a worse results.
14:14:52 <liuyulong> 40 nodes of ovs-agent restart will call these DB query about 300K+ times.
14:17:02 <liuyulong> And these query costs about 0.1s+ seconds logged by our mariadb cluster.
14:17:28 <haleyb> .1s per query?
14:18:18 <liuyulong> Yes, one of them is about 0.5s+. Let me link it in the gerrit.
14:18:50 <liuyulong> https://review.opendev.org/#/c/673557/1/neutron/db/dvr_mac_db.py@145
14:18:56 <liuyulong> get_ports_on_host_by_subnet
14:19:02 <liuyulong> This one.
14:20:01 <liuyulong> haleyb, the results is when the ports table has about 10-20K records.
14:23:09 <haleyb> so _get_ports_query() is really slow
14:25:05 <liuyulong> The scale of resource is about: 17000+ VMs, 3000+ DVR routers, 3000+ network, 3000+ subnets and 3000+ security groups; 40 security group rules for each security group.
14:26:08 <liuyulong> After this change, the ovs-agent restart time has a very significant improvement, it's about 40-50mins to 15mins.
14:26:29 <njohnston> I wonder if it would be further optimized by adding an index specifically on the Port.device_owner field.  I'll comment on the change.
14:26:50 <tidwellr> interesting
14:27:08 <tidwellr> hi, btw
14:27:22 <haleyb> liuyulong: that's quite an improvement, even if 15mins is still a long time :)
14:27:44 <liuyulong> 40 - 50 mins, I can't believe it once, but indeed it is.
14:28:23 <liuyulong> rpc_loop 1 it will scan the ports and process it. 40-50mins.......
14:30:24 <haleyb> liuyulong: it looks like you have lots of reviewers now
14:31:03 <liuyulong> More detail about our test deployment is: 3 neutron-server with about (172 workers), its 3 nodes DB and 3 nodes MQ, are all in dedicated server.
14:31:10 <liuyulong> Yes, neutron has its own DB and MQ.
14:35:30 <liuyulong> Last one:
14:35:38 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1838431
14:35:39 <openstack> Launchpad bug 1838431 in neutron "[scale issue] ovs-agent port processing time increases linearly and eventually timeouts" [Undecided,New]
14:35:51 <liuyulong> More like a L2 issue...
14:37:05 <slaweq> liuyulong: this one looks like related to already known problem with "remote_security_group" rules in SG
14:37:19 <slaweq> there was bug reported for it already IIRC
14:37:19 <liuyulong> The test have not get successfully yet.
14:38:12 <liuyulong> I will test it again today.
14:40:19 <liuyulong> One more interesting thing is that we disable the DHCP for this test. No DHCP agent in this test. I can image if DHCP is enabled the vif-plug-timeout may get more...
14:41:13 <liuyulong> That's all bugs from me.
14:41:24 <liuyulong> any other bugs that need the team to pay attention?
14:41:37 <haleyb> there was one miguel filed yesterday
14:41:39 <slaweq> liuyulong: I can't find any bug reported for Your last issue but please check https://etherpad.openstack.org/p/openstack-networking-train-ptg in L347
14:41:48 <slaweq> njohnston: raised this problem on last PTG
14:41:58 <liuyulong> slaweq, OK, great
14:42:00 <haleyb> https://bugs.launchpad.net/neutron/+bug/1838449
14:42:01 <openstack> Launchpad bug 1838449 in neutron "Router migrations failing in the gate" [Medium,Confirmed] - Assigned to Miguel Lavalle (minsel)
14:42:23 <slaweq> liuyulong: please do Your test without security group rules which reference to remote_group_id
14:42:32 <slaweq> than it should be much, much faster
14:44:20 <haleyb> liuyulong: that was the only other bug i had, was going to try and reproduce locally today for miguel
14:45:06 <slaweq> haleyb: yes, this one hurts us quite lot in CI jobs
14:45:24 <liuyulong> slaweq, actually I refactor may test to 27 tenants yesterday. It looks better now.
14:46:09 <liuyulong> haleyb, thanks for bring up this, seems Miguel has found the issue code.
14:46:12 <slaweq> liuyulong: because if You have more tenants, there is less IPs (ports) using same security group probably and thus it's faster
14:47:16 <liuyulong> slaweq, yes, and I'm trying to add more security group for each tenant, or network.
14:48:19 <liuyulong> For one tenant and one default security group, it is a disaster.
14:49:12 <liuyulong> IMO, every one try to test this will be very easy to encounter this problem.
14:49:41 <liuyulong> njohnston's PTG summary looks very similar to this.
14:49:50 <slaweq> liuyulong: yes, we had this issue too
14:51:12 <liuyulong> And maybe some security group DB query also need some optimizing work.
14:51:28 <liuyulong> OK, next topic
14:51:36 <liuyulong> #topic Routed Networks
14:51:44 <liuyulong> mlavalle, tidwellr, wwriverrat: your turn now.
14:53:30 <tidwellr> if mlavalle and wwriverrat don't have anything, we can talk briefly floating IP's for routed networks
14:53:55 <tidwellr> https://review.opendev.org/#/c/486450/ and the POC code https://review.opendev.org/#/c/669395/
14:55:04 <njohnston> I don't see mlavalle online
14:55:36 <tidwellr> if it isn't obvious by my nagging folks to take a look at these, this has turned into my pet project :)
14:57:50 <tidwellr> I've spun up a little lab where I've tested the POC code, it seems to work nicely and it's not terribly invasive. What I'm interested in is feedback about the approach in the spec
15:00:05 <liuyulong> tidwellr, thank you for replying to my question in the patch sets.
15:00:12 <liuyulong> OK, let's end the meeting.
15:00:17 <liuyulong> #endmeeting