14:00:54 #startmeeting neutron_l3 14:00:55 Meeting started Wed May 15 14:00:54 2019 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:56 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:58 The meeting name has been set to 'neutron_l3' 14:01:13 o/ 14:01:20 o/ 14:01:41 o/ (partially - in another meeting) 14:02:39 hey panda, I was wondering about you yesterday. Nice to see you here :-) 14:02:57 good morning haleyb, tidwellr 14:03:07 hi 14:03:14 hey ralonsoh 14:03:28 mlavalle: happy wednesday! 14:05:16 ok, let's get going 14:05:42 #topic Announcements 14:06:11 We are on our way to the T-1 milestone, June 3 - 7 14:06:19 #link https://releases.openstack.org/train/schedule.html 14:07:41 These are our photos from the recent PTG: 14:07:43 o/ 14:07:46 #link https://www.dropbox.com/sh/fydqjehy9h5y728/AAC1gIc5bJwwNd5JkcQ6Pqtra/Neutron?dl=0&subfolder_nav_tracking=1 14:08:28 everybody handsome as usual 14:08:45 especially in the one with the props 14:09:33 any other announcements from the team? 14:10:08 ok, let's move on 14:10:40 #topic Bugs 14:11:27 First, we have a critical issue https://bugs.launchpad.net/neutron/+bug/1824571 14:11:28 Launchpad bug 1824571 in neutron "l3agent can't create router if there are multiple external networks" [Critical,Confirmed] - Assigned to Miguel Lavalle (minsel) 14:11:52 it was recently promoted to critical by slaweq 14:12:07 so I better hurry up with a fix for this 14:12:52 I have an environment ready to reproduce the issue 14:14:26 Next bug is https://bugs.launchpad.net/neutron/+bug/1774459 14:14:28 Launchpad bug 1774459 in neutron "Update permanent ARP entries for allowed_address_pair IPs in DVR Routers" [High,Confirmed] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 14:15:33 We didn't have a chance to discuss this issue in Denver 14:15:51 couldn't reach Swami over google hangouts 14:16:05 but it seems he is making progress: https://review.opendev.org/#/c/651905/ 14:16:09 before he left Denver he wanted me to just mention he needs reviews 14:17:05 in the commit message he indicates it is related to the bug 14:17:13 are there more patches coming? 14:17:15 there's also https://review.opendev.org/#/c/616272/ that I think is also related 14:18:26 and maybe this one too https://review.opendev.org/#/c/601336/ 14:19:22 yes, these two indicate also in the commit message that are related 14:19:37 * mlavalle leaving a note in the bug pointing to these 2 patches ^^^^ 14:21:42 Last one I have today is https://bugs.launchpad.net/neutron/+bug/1823038 14:21:43 Launchpad bug 1823038 in neutron "Neutron-keepalived-state-change fails to check initial router state" [High,Confirmed] 14:22:13 which seems to have already fixed 14:22:55 not yet 14:23:03 I'm going to propose a patch for it 14:23:16 the agent is now run under neutron-rootwrap 14:23:21 and privsep is failing 14:23:46 so I'm removing this new code added and keep only the privsep initialization 14:24:09 can I assign it to you? 14:24:19 I'm just helping Slawek 14:24:29 he knows the status of this patch 14:24:48 ah ok 14:25:06 that's all 14:25:10 thanks for the update :) 14:25:39 any other bugs we should discuss today? 14:26:24 One more https://bugs.launchpad.net/neutron/+bug/1821912 14:26:25 Launchpad bug 1821912 in neutron "intermittent ssh failures in various scenario tests" [High,In progress] - Assigned to LIU Yulong (dragon889) 14:26:56 Seems we hit this more more frequently 14:27:54 is it a work around? 14:27:58 I have two direction of repair 14:28:20 one: https://review.opendev.org/#/c/659009/ wait until the floating IP is active 14:28:38 is this a work around? ^^^^ 14:29:27 It can be one work around 14:29:42 the another is we can not rely on that nova DB instance status 14:30:03 every time the guest OS is booting, then the test case is trying to login 14:30:52 So I wonder if we can ping the fixed IP first, then try to login it 14:31:36 But seems the tempest now does not allow that 14:31:41 so what you are saying is that we don't have an underlying connectivity / authentication problem, but rather a testing problem? 14:32:06 I have tried both in this https://review.opendev.org/659009, but now revert back to only have "waiting for floating IP status" 14:33:15 In the recent merged patch, they all have a lot of "recheck", maybe we can increase this bug level too. 14:33:48 critical? 14:34:57 Not entirely, Slawek mentioned that nova metadata may have something wrong. 14:35:17 ok, a combination of causes 14:35:47 liuyulong: just thinking out loud and maybe it's crazy, but I wonder if there's way to set a static route on the host that would allow us to reach the fixed IP 14:37:51 mlavalle, Slawek and I are now aiming to different direction 14:38:29 I also noticed that l3-agent may have a really long router processing time, 40s+ in some cases. 14:38:32 liuyulong: sounds good to. I am also adding a 3rd direction: tcpdump in the namespace 14:39:13 tidwellr, I'm not quite sure, but if tempest can only reach the API, the route may not work. 14:39:55 yep, it's just a thought 14:40:58 This is really a tough one... 14:41:05 yes it is 14:41:47 anything else on this bug? 14:42:01 not from me 14:42:18 thanks for the update :-) 14:42:23 any other bugs? 14:43:19 mlavalle: there's one in the open agenda section, but if we're in a buggy mood we can discuss now 14:43:30 shoot 14:43:40 https://bugs.launchpad.net/neutron/+bug/1818824 14:43:42 Launchpad bug 1818824 in neutron "When a fip is added to a vm with dvr, previous connections loss the connectivity" [Low,In progress] - Assigned to Gabriele Cerami (gcerami) 14:44:03 I saw the chatter about this in IRC yesterday 14:44:27 In short, there is a difference between DVR/centralized here 14:44:52 I tried to lay down some solution, but a behavioural decision has to be made before 14:45:31 if an instance is using the default snat IP and a floating is associated, should we be deleting the conntrack entries for the existing connections 14:46:32 i'm inclined to think we should always be cleaning them, since the instance should start using the floating IP 14:46:34 is there a concrete example of a workload in a VM that is affected by this? 14:47:50 i don't think so. it's an edge case since in order to trigger something in an instance you need a floating to get in first (or login from another instance on the private network) 14:48:15 haleyb, +1, yes, it should stop the previous connection to save the SNAT node bandwidth. 14:48:42 liuyulong always bringing the operator perspective 14:48:47 nice 14:48:51 I'm inclined to agree, once the FIP is associated force all traffic to use it 14:49:56 liuyulong: it doesn't happen with centralized routing today, you can have a connection continue to use the snat IP until it closes. DVR "breaks" it simply because it forces everything into the fip namespace where it dies 14:51:24 I agree with haleyb and tidwellr 14:51:30 so it seems we agree the conntrack entries should have been cleaned. i think if we make that change soon-ish we'll be able to get some feedback if we break something during the T cycle 14:51:45 yes! 14:51:50 ]the sooner the better 14:51:53 +1 14:51:55 for bot DVR and non DVR scnarios ? 14:52:08 and i don't think we documented the behavior, so we should do that too 14:52:32 in DVR currently the connections just starve, they are not closed 14:52:49 panda: yes, both. with dvr it's essentially cleaned by the routing change, right? 14:53:14 as you say starved since the connection is broken 14:53:45 haleyb: i'ts not cleaned at all, the package try to follow the new route but they just die somewhere, so the connection clears for the timeout 14:53:58 I'm trying to understand if the need to be explicitly closed instead 14:54:12 Could be a bug for centralized router, since we never test that. 14:55:09 I'd say explicitely close it 14:55:16 For dvr with centralized floating IPs, what's the hehavior now? 14:55:16 panda: right, but removing the stale conntrack entries would make the connection fail quickly and not timeout slowly 14:55:47 good point 14:55:54 liuyulong: that's a good question, don't know 14:56:03 previous connection may stay, IMO 14:58:15 liuyulong: and have a different behaviour for the two scenarios ? I think the idea here was to look for consistency 14:59:29 my personal preference is to try and maintain the old connection, but just because I found it a good entry point to experiment and learn the code :) 14:59:46 if we had a floating IP assigned and it got removed, conntrack gets cleaned-up, i think we should treat the default snat IP similarly - the (dis)association event flips which is used 14:59:47 we are running out of time 15:00:00 I lean towards consitency of behavior 15:00:13 #endmeeting