15:01:18 #startmeeting neutron_dvr 15:01:19 Meeting started Wed Jan 27 15:01:18 2016 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:20 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:23 The meeting name has been set to 'neutron_dvr' 15:01:26 #chair Swami 15:01:31 Current chairs: Swami haleyb 15:02:10 * haleyb wonders if it will just be Swami and himself 15:02:26 haleyb: are you also participating in the nova midcycle meetup this week 15:02:42 Swami: no 15:02:47 haleyb: today it should be short 15:02:59 #topic Announcements 15:03:29 https://etherpad.openstack.org/p/neutron-mitaka-midcycle announced 15:04:02 DVR items were on agenda - is anyone planning on attending? I see obondarev and myself as tentative 15:04:19 haleyb: I will plan to attend. 15:04:48 hi 15:04:50 haleyb: will you be there. 15:04:56 obondarev: hi 15:05:00 I'm planning 15:05:19 working on logistics 15:06:09 Swami: i'm trying, have to get approval, and i'm on vacation the week before, Mexico to Minnesota will be a shock :) 15:06:23 good to know that we have an agenda in place for DVR in the midcycle, haleyb: thanks for adding it to the agenda. 15:06:37 thank obondarev 15:06:49 haleyb: same here, I need to get approval as well, with all changes happening. 15:07:27 anyways, let's keep in touch in case one or more need to drop out of going 15:07:41 haleyb: ok makes sense 15:08:02 we can talk over irc instead of getting cold :) 15:08:19 #topic Bugs 15:08:24 haleyb: yep 15:08:37 unless you want to talk live migration first 15:08:40 There was one bug that was filed recently for this week. 15:09:06 haleyb: no we can proceed with bugs and then come back to live migration, since we have already spoken about live migration. 15:09:36 #link https://bugs.launchpad.net/neutron/+bug/1538163 15:09:38 Launchpad bug 1538163 in neutron "DVR: race in dvr serviceable port deletion" [Undecided,In progress] - Assigned to Oleg Bondarev (obondarev) 15:09:57 obondarev: I think you have filed this bug. 15:10:07 Swami: yep 15:10:19 https://review.openstack.org/#/c/272634/ is out for review 15:10:19 obondarev: you also have a patch for review. 15:10:22 it poped up during dvr refactoring work 15:10:37 I just noticed that there is a potential race 15:10:45 obondarev: yes I just had a first pass review at it yesterday. 15:10:45 the patch is https://review.openstack.org/#/c/272634/ 15:11:08 Swami: thanks, I've addressed your comments 15:11:17 obondarev: yes I did notice today. 15:11:56 let us move on to the next one. 15:12:01 so please review and we can move on 15:12:03 yeah 15:12:08 The next high in the list is the HA and DVR 15:12:12 Swami: i had another new bug 15:12:40 haleyb: have you filed. 15:12:42 but i'll wait until the gait failures part 15:12:54 haleyb: ok no problem. 15:13:00 #link https://review.openstack.org/#/c/143169/ 15:13:58 Again the DVR SNAT HA patch is also failing jenkins with couple of L3-HA tests, I have asked adolfo to take a look at it. 15:14:35 otherwise it should be good and I have already saw carl_baldwin added a +2. 15:15:12 Is it me or is the title of that change "HA or DVR" instead of "HA for DVR" ? 15:15:56 haleyb: has the title changed, I did not notice. 15:16:04 haleyb: haha, good catch! 15:16:52 changed in PS66, doh, but not as important as finding why the test fails 15:17:25 haleyb: might have been a wrong key hit when trying to ammend the patch. 15:17:36 haleyb: yes good catch. 15:17:43 how might that happen, funny) 15:18:21 obondarev: sometimes things happen. 15:18:42 The next one in the list is 15:18:45 #link https://bugs.launchpad.net/neutron/+bug/1445255 15:18:46 Launchpad bug 1445255 in neutron "DVR FloatingIP to unbound allowed_address_pairs does not work" [Low,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:18:47 Swami: indeed 15:19:33 This one has a patch as well and in review for a while. 15:19:49 https://review.openstack.org/#/c/254439/ 15:20:26 I think carl_baldwin is closely following this patch. 15:21:30 The next one in the list is 15:21:33 #link https://bugs.launchpad.net/neutron/+bug/1522824 15:21:34 Launchpad bug 1522824 in neutron "DVR multinode job: test_shelve_instance failure due to SSHTimeout" [High,In progress] - Assigned to shihanzhang (shihanzhang) 15:22:05 #link https://review.openstack.org/#/c/215467 - This patch is almost ready 15:23:09 The next one in the list is 15:23:15 #link https://bugs.launchpad.net/neutron/+bug/1445255 15:23:16 Launchpad bug 1445255 in neutron "DVR FloatingIP to unbound allowed_address_pairs does not work" [Low,In progress] - Assigned to Swaminathan Vasudevan (swaminathan-vasudevan) 15:23:31 #link https://review.openstack.org/#/c/254439/ 15:23:44 haleyb: i have addressed review comments from you on this patch. 15:24:06 yes, thanks, i'll take a look 15:25:23 I think that's all I had for bugs this week. 15:25:47 Is there any other bugs that needs attention at this time. 15:26:00 Off course we have the live migration bug. 15:26:33 There's at least two OVS/DVR bugs, let's jump to gate failures 15:26:43 haleyb: obondarev: I did see that there were some comments added to the nova midcycle etherpad on the live migration work. 15:27:07 So let us wait for their feedback and proceed on it. 15:27:08 can you share the libk please? 15:27:17 link* 15:27:23 #link https://etherpad.openstack.org/p/mitaka-nova-midcycle 15:27:51 haleyb: yes let us move on to the gate failures, that's all for the bugs. 15:27:54 #topic Gate Failures 15:28:16 I filed https://bugs.launchpad.net/bugs/1538387 yesterday after staring at it for a while 15:28:18 Launchpad bug 1538387 in neutron "fdb_chg_ip_tun throwing exception because fdb_entries not in correct format" [High,In progress] - Assigned to Kevin Benton (kevinbenton) 15:28:54 Kevin sent out https://review.openstack.org/272986 this morning 15:29:09 Seems this l2pop issue has been there for over a year 15:29:45 I basically started looking through the logs for exceptions, since we shouldn't have any 15:29:54 haleyb: is this bug only seen in DVR environments. 15:30:24 haleyb: did you find any exceptions in the logs. 15:30:28 I only saw it in the dvr-multninode job, but it could be other places 15:30:36 I also found "DVR: Unable to retrieve subnet information for subnet_id" 15:30:44 looks like get_subnet_for_dvr() call in _bind_centralized_snat_port_on_dvr_subnet() needs to check if subnet_info is not {} 15:31:01 throwing a Keyerror 15:31:15 Was going to file that today 15:31:21 haleyb: was this introduced by the recent refactor on this function. 15:31:46 obondarev: I think you made some changes to this get_subnet_for_dvr recently. 15:31:47 haleyb: this was filed I guess 15:31:49 I don't think so - one caller of that function checks the return value, one does not 15:31:57 let me find the link 15:33:14 it is something that I reviewed recently.. 15:33:53 Both the multinode and dvr-multinode jobs are pretty bad, over 50%, but it's the migration issue, at least the volume migration is the one test failing 15:34:16 haleyb: is the volume migration issue also seen on single node jobs. 15:34:31 https://review.openstack.org/#/c/272025/ 15:35:23 obondarev: thanks. that check should probably be up near the call, i'll add that 15:35:43 obondarev: is this patch related to the issue that haleyb was seeing in the logs related to subnet_info. 15:35:57 Swami: I guess so 15:36:06 i had even added myself but not made the connection 15:36:15 Swami: yes, same issue 15:36:44 what I noticed recently is that from time to time dvr multinode job fails with ~20 failed tests 15:36:58 not very oftem but still 15:37:00 haleyb: in a single node there should not be such migration failures, but was it introduced or triggered by any other patch that merged. 15:37:18 usually it either passes or fails with 1 failed test 15:37:30 haleyb: do you have the logstash filter to filter out the failures for this particular failure. 15:37:42 obondarev: same failure reason? I guess if you see it again we should look closer 15:38:07 haleyb: not sure about failure reason 15:38:37 Swami: no, but there are very few failures (ERROR) in the logs these days once we fix the two I mentioned 15:39:02 one of examples http://logs.openstack.org/55/272555/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full/e6f9faa/console.html 15:39:18 17 failed tests 15:39:54 it seems like smth went wrong and tests start to fail 15:40:12 didn't have a chance to look closer yet 15:41:25 obondarev: seems to state about the SSHtimeout issue. 15:41:33 i know a lot of mtu patches have been merging as well, both in neutron and the gate, which might help as well 15:41:37 https://review.openstack.org/#/q/topic:multinode-neutron-mtu 15:42:14 cool, hope it'l help 15:42:35 haleyb: obondarev: btw the patch that I added for debugging the SSHtimeout issue is not quiet working right because of timing issue. I suspect by the time I try to ping the metadata is not in place. 15:43:24 haleyb: obondarev: is there a test to validate if the metadata is properly received by the VM. 15:44:23 I think there are plenty in tempest 15:45:04 any that boots a vm and checks connectivity 15:46:20 obondarev: we normally check it from external connectivity, but that does not tell us if it is a metadata issue or not. 15:46:37 ah 15:46:56 you mean which are checking specifically for metadata 15:47:08 obondarev: yes. 15:47:16 can't remember 15:47:36 Swami: i don't think there is a test, but we do have the VM console log 15:47:45 right 15:48:02 haleyb: yes, it is only through the vm console log we will be able to identify. 15:48:37 haleyb: obondarev: is there a way to figure out from the router namespace that the metadata request was processed for the particular vm. 15:49:15 I don't know if the proxy logs are there, but the metadata server log is i believe 15:49:31 haleyb: ok 15:50:36 http://logs.openstack.org/55/272555/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full/e6f9faa/console.html#_2016-01-26_18_25_22_075 15:50:58 i don't know metadata well enough to know if that's a complete failure 15:51:51 haleyb: so that seems that metadata might be another victim for the SSHtimeout issue. 15:52:16 Swami: yes, assuming these VMs are getting ssh keys that way 15:52:47 and one example is https://bugs.launchpad.net/neutron/+bug/1522824 15:52:48 Launchpad bug 1522824 in neutron "DVR multinode job: test_shelve_instance failure due to SSHTimeout" [High,In progress] - Assigned to shihanzhang (shihanzhang) 15:52:51 haleyb: yes I think in the tempest run that's how they pass the keys. 15:53:07 haleyb: Swami: exactly 15:54:15 agreed. 15:54:39 that failure is different from the one i linked, seems metadata worked in my case, but only instance-id was returned 15:55:38 5 minutes left, let's move onto last topic 15:55:46 #topic Performance/Scalability 15:55:54 obondarev: one patch left? :) 15:55:59 haleyb: right 15:56:02 the main one 15:56:34 it'll require some rebase work once HA for DVR merges 15:56:54 or it'll merge first :) 15:57:07 i borked the HA DVR patch updating the commit message, so it will take an hour or more to clear 15:58:14 haleyb: no problem, anyway we have a couple of tests that were failing, let me check with adolfo on this. 15:58:21 before we close. 15:58:36 he's getting a free recheck 15:58:57 #topic Open discussion 15:59:03 haleyb: we were (obondarev and myself) planning to add in a general session talk on DVR improvements for Mitaka. 15:59:16 haleyb: would you like to be part of that discussion. 15:59:35 Swami: sure, i was going to ping you offline about it, but works for me 15:59:56 let me know how i can help with abstract 16:00:16 haleyb: ok I will loop you in the discussion. We have an abstract in google doc. 16:00:21 haleyb: https://docs.google.com/document/d/1WCjq0FL1NxSistA7nfceeXaf3gmVzDCvJm0Jmwt80Dw/edit 16:00:31 obondarev: thanks for the link 16:00:38 thanks, and we need to close out for the ML2ers 16:00:40 #endmeeting