15:01:54 <Swami> #startmeeting distributed-virtual-router
15:01:55 <openstack> Meeting started Wed Oct 15 15:01:54 2014 UTC and is due to finish in 60 minutes.  The chair is Swami. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:58 <openstack> The meeting name has been set to 'distributed_virtual_router'
15:02:03 <emagana> Swami: Hi
15:02:27 <Swami> #info Juno RC2 was released last week
15:02:48 <Swami> Hope fully this will be the final build for Juno, unless any critical errors are seen
15:03:02 <Swami> Testers please use the RC2 build for testing.
15:03:11 <Rajeev> keep our fingers crossed
15:03:22 <Swami> #topic Bugs
15:03:46 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1377241
15:03:48 <uvirtbot> Launchpad bug 1377241 in neutron "Lock wait timeout on delete port for DVR" [High,In progress]
15:04:05 <Swami> This is one of the bugs that we have not fully resolved.
15:05:04 <Swami> The current patch posted by Kevin seems to fix it in the current state, but when any changes are made to the current code, I could see the lockwait timeout pops again.
15:05:34 <Swami> #link https://review.openstack.org/127129
15:06:06 <Swami> This is the patch posted by Kevin and it addresses spliting the db transaction from the rpc call.
15:07:32 <Swami> But the main issue that we are having with this lockwait timeout issue is two actions "router-interface-delete" and "router-gateway-clear" both trying to call 'delete_csnat_router_interface" and at some point, they are waiting to delete the same port and it gets into a lockwait scenario.
15:07:43 <Swami> carl_baldwin: hi
15:07:51 <carl_baldwin> Swami: hi
15:08:03 <Rajeev> Swami: this is completely plugin side -- right ?
15:08:21 <Swami> Rajeev: is this has nothing to do with agent.
15:08:32 <Swami> this is on the plugin db side.
15:08:36 <mrmsith> there has been some discussion on using a semaphore right?
15:08:37 <Rajeev> Swami: that is what I thought
15:09:14 <Swami> mrmsith: yes, but I got some review comments that we should find the root cause and not use the semaphore.
15:09:30 <mrmsith> huh
15:09:49 <Swami> As I see from the traces and logs it is definitely two actions trying to acquire a single resource and it should be solved by a lock.
15:10:12 <Swami> carl_baldwin: do you have any other ideas of fixing such problems without using a lock
15:10:53 <carl_baldwin> Swami: generally, the problem is that one resource acquires the lock and yields while holding the lock.  Do you know which operation first gets the lock?
15:11:57 <Swami> It is the delete_port that is being called from the "router_gateway_clear" through delete_csnat_router_interface . This is holding the lock for deleting all the csnat_ports.
15:12:15 <Swami> While it has the lock the router_interface_delete also comes in and tries to delete the same port.
15:12:47 <Swami> This is when you issue back to back commands such as router_interface_delete ( we have three commands) and then a gateway-clear.
15:13:56 <carl_baldwin> I think what armax is looking for is to look at that first delete_port to avoid holding the lock and yielding.  Does the bug have enough detail in it to follow the code path?
15:14:26 <Swami> Yes, I have uploaded a document with the neutron.log
15:15:03 <Swami> But I think I also sent out an email with all the log files, db sql log and the neutron.log  to you since those files where huge.
15:16:49 <Swami> carl_baldwin: please take a look at it and let me know if you need more information.
15:17:32 <carl_baldwin> I will take a look.  I joined late, could you remind me what the bug url is?
15:17:50 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1377241
15:17:52 <uvirtbot> Launchpad bug 1377241 in neutron "Lock wait timeout on delete port for DVR" [High,In progress]
15:17:54 <carl_baldwin> Thanks
15:18:24 <Swami> Let us move on to the next bug.
15:18:37 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1378468
15:18:42 <uvirtbot> Launchpad bug 1378468 in neutron "DBDuplicateError found sometimes when router_interface_delete issued with DVR" [Medium,In progress]
15:18:56 <Swami> This bug mentioned above is a result of the timing issues.
15:20:07 <Swami> As per our previous discussion, when the router_interface_delete and router_gateway_clear are issued back to back, sometimes while the router_interface_delete is stil processing the router_gateway_clear removes the snat_binding for the node.
15:20:54 <Swami> Then later when a router_interface_delete comes in instead of seeing the gw_port to None, it still sees the gw_port to be there and then tries to rebing the snat to the node and causes the DBDuplicateError.
15:21:12 <Swami> s/rebing/rebind
15:22:39 <Swami> #link https://review.openstack.org/126793
15:23:01 <Swami> carl_baldwin: you have already provided your comments on this patch.
15:23:40 <carl_baldwin> Swami: are there responses that I haven’t seen?
15:24:17 <Swami> There are two things in this patch, one is to first prevent the scheduler to call "schedule_snat" when there is a router_interface_action, the other one is to check the binding first again in the "bind_snat_router" and then return if it already has one.
15:24:36 <Swami> carl_baldwin: nothing on this patch.
15:25:29 <carl_baldwin> I’ll watch for them.
15:25:35 <Swami> But just wanted to confirm, that handling it through "hints" is it still valid.
15:26:51 <carl_baldwin> That was armax’s comment but I would also like to know if there is a better solution.
15:27:11 <Swami> carl_baldwin: yes that's what I am trying to figure out.
15:28:01 <Swami> But I could not find a right solution so far.
15:28:20 <Swami> Let me know if you have any thoughts on this.
15:28:35 <carl_baldwin> Swami: I will
15:29:25 <Swami> carl_baldwin: thanks
15:29:38 <Swami> Rajeev: hi
15:29:40 <Swami> are you still here
15:29:59 <Rajeev> Swami: Hi
15:30:11 <Swami> haleyb: hi
15:30:23 <Rajeev> Swami: FYI: Saw the re-occurrence of the race condition in l_3 processing floating ips.
15:30:42 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1376325
15:30:44 <Rajeev> filed defect 1381238
15:30:44 <uvirtbot> Launchpad bug 1376325 in neutron "Cannot enable DVR and IPv6 simultaneously" [Medium,New]
15:31:02 <Swami> Rajeev: haleyb : This is regarding the IPv6 and dvr
15:31:39 <Swami> haleyb: did you get any chance to investigate further on the IPv6 changes for dvr, based on our last weeks meeting.
15:31:49 <Rajeev> Swami: ok, will need to take a look at it
15:32:09 <Rajeev> is this for East west only ?
15:32:47 <Swami> take a look at the bug description
15:33:04 <Swami> I am not sure, I think it is for North-South and not for east-west
15:33:26 <Swami> may be haleyb can update you on this and he is the one who filed this bug.
15:33:36 <Swami> armax: hi
15:34:20 <armax> Swami: hi
15:34:43 <Rajeev> Swami: ok, if there is a doc that lists out IPv6 capabilities of legacy router for North south, it will be real helpful
15:35:19 <Swami> Rajeev: I don't think there is an official doc, but we can take a look at the IPv6 spec and start from there.
15:35:28 <Rajeev> IPv6 is a big area, that we need to enable. Knowing what is there and what is not will help
15:35:37 <Swami> But if we need more information we can check with markmcclain on this IPv6 support.
15:35:45 <Rajeev> Swami: sure, thanks.
15:36:51 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1374473
15:36:52 <uvirtbot> Launchpad bug 1374473 in neutron "500 error on router-gateway-set for DVR on second external network" [High,In progress]
15:37:14 <Swami> This bug stated above is related to having support for multiple external networks.
15:37:53 <Swami> Right now when we add a second external network, there is TRACE with DBDuplicateError since it is trying to reschedule and trying to bind to the same node.
15:39:47 <Swami> I am not sure about the effort required to fix this one for the multiple external networks, but we need to fix it, and until we fix, we need to have the documentation that states, dvr does not support multiple external networks.
15:40:34 <Swami> armax: do you have anything else on the bugs
15:41:06 <armax> not from me
15:42:26 <Swami> We also have a bug that is related to HA and DVR not working right now.
15:43:03 <Swami> So in order to work on this with the HA team sylvain and amuller I have created a wiki to start logging notes in there so that it will be usefull for both teams.
15:43:16 <Swami> #link https://wiki.openstack.org/wiki/Neutron/DVR/ServiceNode-HA
15:44:10 <Swami> #topic Open Discussions
15:44:42 <Swami> Do you have anything else to discuss
15:45:45 <Swami> If there are no other items to discuss we can end the session.
15:45:53 <Swami> Thanks everyone for joining the meeting.
15:46:01 <Swami> see you all next week.
15:46:06 <Swami> #endmeeting