#openstack-meeting log

14:00:14 <liuyulong> #startmeeting neutron_l3
14:00:15 <openstack> Meeting started Wed Mar  4 14:00:14 2020 UTC and is due to finish in 60 minutes.  The chair is liuyulong. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:16 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:19 <openstack> The meeting name has been set to 'neutron_l3'
14:02:32 <liuyulong> Hi there
14:02:40 <liuyulong> #topic Announcements
14:03:13 <slaweq> hi
14:03:56 <liuyulong> #link https://www.openstack.org/events/opendev-ptg-2020/
14:04:59 <liuyulong> Hope I could get to Vancouver.
14:05:28 <liuyulong> I need a VISA.
14:05:51 <liuyulong> I will try the community travel support.
14:07:08 <slaweq> for now we also don't know how it will be, mostly due to this coronavirus :/
14:07:37 <liuyulong> #link https://etherpad.openstack.org/p/neutron-victoria-ptg
14:09:15 <liuyulong> slaweq, maybe, but the Summer is coming.
14:09:27 <liuyulong> Topics are wanted! ^^
14:11:00 <liuyulong> OK, no more announcement from me.
14:11:05 <liuyulong> let's move on.
14:11:08 <liuyulong> #topic Bugs
14:11:21 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-February/012766.html
14:11:27 <liuyulong> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-March/012926.html
14:11:52 <liuyulong> Because I was not here last week, we have two lists now.
14:12:08 <liuyulong> First one:
14:12:09 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1864963
14:12:10 <openstack> Launchpad bug 1864963 in neutron "loosing connectivity to instance with FloatingIP randomly" [Undecided,New]
14:12:52 <liuyulong> I have left some questions about the reporters' deployment, that could help us to find out the real problem.
14:13:35 <liuyulong> Mostly these questions are based on our local deployment. We met some issue on these fields.
14:15:47 <slaweq> thx for taking care of this
14:16:17 <liuyulong> slaweq, np
14:16:31 <liuyulong> Next one
14:16:33 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1865061
14:16:34 <openstack> Launchpad bug 1865061 in neutron "When neutron does a switch-over between router 1 and router2, the router1 conntrack flows shoud be deleted" [Low,Confirmed]
14:17:10 <slaweq> that is something which our QE found during testing
14:17:36 <slaweq> but it can br problem only if router will failover twice in short period of time
14:17:57 <slaweq> and that's why it's set Low importance
14:18:02 <liuyulong> Yes, that is my question, how could that "twice" happen in real world?
14:18:10 <liuyulong> https://bugs.launchpad.net/neutron/+bug/1865061/comments/1
14:18:11 <openstack> Launchpad bug 1865061 in neutron "When neutron does a switch-over between router 1 and router2, the router1 conntrack flows shoud be deleted" [Low,Confirmed]
14:18:30 <liuyulong> We have "non-preemptive" settings for HA router keepalived.
14:19:09 <liuyulong> So typically the "new-master" should work then.
14:19:28 <liuyulong> The connections in the original host should be all broken.
14:19:47 <slaweq> excactly, so I reported it there "just for the record" that such issue theoretically can happen
14:20:08 <slaweq> but that shouldn't be in fact an issue in real world probably
14:23:04 <liuyulong> extremely case is the HA networking is not stable. That could cause the HA router state change rapidly. For some deployment which running HA routers on hypervisors, the bad connection state could be a potential reason.
14:24:06 <liuyulong> That could be another story.
14:24:22 <liuyulong> OK, next one
14:24:33 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1865891
14:24:34 <openstack> Launchpad bug 1865891 in neutron "Race condition during removal of subnet from the router and removal of subnet" [Medium,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
14:24:48 <slaweq> yes, that one I'm working on now
14:25:27 <slaweq> it seems that sometimes if You plug subnet to the router and in parallel remove subnet, Your router port will end up as port without fixed_ips
14:25:31 <liuyulong> Alright
14:25:36 <liuyulong> see my comment here:
14:25:36 <liuyulong> https://bugs.launchpad.net/neutron/+bug/1865891/comments/2
14:25:38 <openstack> Launchpad bug 1865891 in neutron "Race condition during removal of subnet from the router and removal of subnet" [Medium,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
14:25:57 <liuyulong> I can image another one is to add port as router interface and concurrently delete the port.
14:26:23 <slaweq> I agree that maybe we will need to close it as "wontfix"
14:26:51 <slaweq> but I want first to dig a bit more and see what can be done there
14:28:11 <liuyulong> yes, it is indeed an issue. We just want to find out a balance. : )
14:28:42 <liuyulong> OK, next one
14:28:43 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1865173
14:28:44 <openstack> Launchpad bug 1865173 in neutron "Revision number not bumped after update of router's description" [Low,Confirmed]
14:29:33 <liuyulong> Tested on stable/queens, it is not reproducible.
14:29:55 <slaweq> I was testing this on master branch
14:31:58 <liuyulong> Alright, a regression on router revision number.
14:32:08 <slaweq> probably
14:32:25 <slaweq> but I saw it only when I tried to bump router's description
14:32:59 <liuyulong> Interesting...
14:33:00 <slaweq> anyway, that's nothing really critical so I think it can stay in our backlog until someone will have some time to take a look at it
14:33:24 <liuyulong> np, make sense to me
14:33:38 <liuyulong> Next one:
14:33:40 <liuyulong> #link https://bugs.launchpad.net/neutron/+bug/1865557
14:33:41 <openstack> Launchpad bug 1865557 in neutron "Error reading log file from 'neutron-keepalived-state-change' in 'test_read_queue_send_garp'" [Low,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
14:33:59 <ralonsoh> Just a logging problem
14:34:08 <liuyulong> The fix is simple, and it is just fails the case not raise an exception.
14:34:08 <ralonsoh> I  found a problem only once, in a test
14:34:15 <ralonsoh> as commented in the bug
14:34:25 <ralonsoh> no no, we need to raise the exception
14:34:26 <liuyulong> So I've +2ed that.
14:34:57 <liuyulong> https://review.opendev.org/#/c/710850/1/neutron/tests/functional/agent/l3/test_keepalived_state_change.py
14:34:58 <ralonsoh> ok, not an exception but a fail (the same effect)
14:35:03 <ralonsoh> yes, I know
14:35:22 <ralonsoh> because we are executing a test, it's better to use self.fail
14:35:31 <ralonsoh> but the core of this patch is the extra log
14:35:38 <liuyulong> OK, maybe I'm not clear here.
14:36:07 <liuyulong> The fix is to just fail the case instead of raising an exception.
14:36:19 <ralonsoh> the effect is the same
14:36:28 <liuyulong> Yes
14:36:34 <ralonsoh> the point is to increase the log info
14:36:44 <ralonsoh> now we have the device list with the IP addresses
14:36:50 <ralonsoh> inside the testing namespace
14:37:36 <liuyulong> ralonsoh, great, thanks for working on this.
14:37:43 <ralonsoh> yw
14:38:22 <liuyulong> Alright, thag
14:38:34 <liuyulong> Alright, that's all bugs from me today.
14:38:41 <slaweq> I would like to talk about one also
14:38:43 <slaweq> https://bugs.launchpad.net/neutron/+bug/1859832
14:38:44 <openstack> Launchpad bug 1859832 in neutron "L3 HA connectivity to GW port can be broken after reboot of backup node" [Medium,In progress] - Assigned to LIU Yulong (dragon889)
14:39:02 <liuyulong> OK
14:39:03 <slaweq> and those 2 alternative solutions proposed by me and liuyulong for it
14:39:50 <slaweq> liuyulong: generally in Your approach I'm affraid those errors about fail to send garps during failover
14:40:43 <slaweq> and the second potential issue is IMO if we will not increase downtime during failover as neutron-l3-agent has to be noticed that failover happened and bring gateway up then
14:40:58 <slaweq> so 2 questions:
14:41:18 <slaweq> 1. do You know if there is any way to delay sending of first garp, to avoid those errors from keepalived?
14:41:53 <slaweq> 2. You said that You tested it in Your cloud, how long is downtime during failover with and without this patch?
14:42:41 <liuyulong> I replied the comments in the patch set. Allow me quota it here:
14:42:45 <liuyulong> We have run such code for a few months, no issue was found for such related log. Keepalived will send garp after a 60s delay by default [1], till then the L3 agent should have done qg-dev link up action. More details could be during the first phrase keepalived garp, do not send garp with no interval, it could have a 1 second delay (vrrp_garp_interval [2]).
14:42:45 <liuyulong> [1] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/keepalived.py#L165
14:42:45 <liuyulong> [2] https://www.keepalived.org/manpage.html
14:43:47 <liuyulong> Your first question could have the answer: vrrp_garp_interval.
14:44:18 <liuyulong> The link up action is really quick, we have not seen any side effect on that.
14:44:49 <slaweq> it's quick but if router has many other things to do, isn't it queued to be processed as other events?
14:45:00 <liuyulong> More about that is the outside world also have ARP.
14:45:04 <slaweq> e.g. if there would be many routers failovered in same time
14:45:49 <liuyulong> HA state change does not have queue.
14:46:01 <liuyulong> It's not like the L3-agent main processing loop.
14:46:32 <slaweq> ok, but can we maybe move this "set device up" action to the neutron-keepalived-state-change monitor process?
14:46:53 <slaweq> so it would be done just after keepalived would configure VIP in the namespace
14:47:21 <liuyulong> That "enqueue_state_change" actually does not have a "queue", it's just a list of functions.
14:48:26 <slaweq> yes, but how about doing it here: https://github.com/openstack/neutron/blob/master/neutron/agent/l3/keepalived_state_change.py#L89
14:48:37 <ralonsoh> slaweq, are we going to add net capabilities to the  neutron-keepalived-state-change agent??
14:48:45 <ralonsoh> slaweq, I do not recommend it
14:48:56 <ralonsoh> this should be just a monitoring process
14:49:39 <slaweq> ralonsoh: look at the comment in https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha.py#L166
14:49:52 <slaweq> according to it, such plans were already some time ago :)
14:49:56 <liuyulong> That could be a heavy change.
14:50:10 <ralonsoh> I still don't recommend it
14:50:39 <ralonsoh> we'll have another service changing the net devices
14:50:51 <ralonsoh> this should be in only one process: the l3 agent
14:51:04 <liuyulong> We need router info from the l3-agent process to another monitor process.
14:51:13 <slaweq> we already have keepalived which is also changing those interfaces
14:51:22 <ralonsoh> yes
14:51:39 <ralonsoh> but this is an external process not managed/programmed by us
14:52:13 <slaweq> anyway, I really need to move forward with one of those potential fixes for this issue :)
14:52:20 <ralonsoh> I know
14:52:31 <slaweq> so first we should decide which one and then continue work on it
14:53:33 <liuyulong> I prefer one fix for all drivers.
14:53:50 <slaweq> liuyulong: yes, that's adventage for Your approach for sure
14:53:56 <ralonsoh> I still don't have a clear idea
14:54:05 <ralonsoh> sorry
14:54:18 <slaweq> what I'm affraid, is that this may cause some longer failover time
14:54:48 <slaweq> but except that, I think that liuyulong's idea may be really better as it's more generic
14:54:57 <liuyulong> And L3 issue should be handled in it's own scope by default.
14:55:17 <liuyulong> slaweq, you have QA team I guess you mentioned in this meeting. : )
14:55:33 <slaweq> so ralonsoh what do You think if we will continue with liuyulong's patch?
14:56:05 <liuyulong> We also have a QA team, I will try to make sure they have fully tested the fail-over time.
14:56:15 <ralonsoh> I still need to check both again
14:56:34 <slaweq> ralonsoh: ok, thx
14:56:38 <slaweq> please check them
14:56:47 <liuyulong> Another thing is I will try to add that "vrrp_garp_interval" for the VRRP of the HA router.
14:57:09 <slaweq> liuyulong: and one more comment to this, can You remove config option from it? I don't think we really need such config option there
14:57:18 <liuyulong> It will be an independent change.
14:57:35 <slaweq> IMO this is internal implementation of HA routers and it shouldn't be configurable
14:57:36 <liuyulong> slaweq, sure
14:58:05 <slaweq> ok, liuyulong please ping me if You will add this vrrp_garp_interval option
14:58:12 <slaweq> I will test it again on my env
14:58:18 <slaweq> and thx for working on this
14:58:24 <liuyulong> slaweq, the config option is for our cloud locally, our operators would like to know the cloud code changes.
14:58:37 <liuyulong> slaweq, np
14:58:46 <slaweq> ok, that's all from my side
14:58:49 <slaweq> thx
14:59:02 <liuyulong> All right, we are out of time.
14:59:12 <liuyulong> let's end here.
14:59:23 <liuyulong> Thank you guys for attending.
14:59:25 <liuyulong> Bye
14:59:27 <ralonsoh> bye
14:59:31 <liuyulong> #endmeeting