14:00:38 #startmeeting networking 14:00:38 hi 14:00:38 Meeting started Tue Mar 15 14:00:38 2016 UTC and is due to finish in 60 minutes. The chair is kevinbenton. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:39 o/ 14:00:39 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:42 The meeting name has been set to 'networking' 14:00:43 hi 14:00:44 o/ 14:01:00 \o/ 14:01:04 This will be a relatively short meeting, just a few announcements 14:01:16 #topic Announcements/Reminders 14:01:18 o/ 14:01:24 hello 14:01:41 Today the branch for stable/mitaka is going to be cut 14:01:53 \o/ 14:01:59 hello 14:02:07 aloha 14:02:27 so any fixes that need to go into mitaka after this will need to be back-ported like we would for any other stable branch 14:02:28 o/ 14:03:03 ack 14:03:09 o/ 14:03:20 I believe armax and ihrachys have narrowed the bugs down so there are no major blockers for RC-1 that we need to worry about 14:03:38 does anyone have any bugs that need to be brought to everyone's attention? 14:04:05 (that would be a major issue in Mitaka) 14:04:13 https://bugs.launchpad.net/neutron/+bug/1556884 14:04:13 Launchpad bug 1556884 in neutron "floating-ip association is allowed via router interface" [Medium,In progress] - Assigned to YAMAMOTO Takashi (yamamoto) 14:04:52 hichihara: thanks, i saw this one as well and it looks like we accidentally added a feature :) 14:04:55 I'm not sure that is worth Mitaka. L3 folks should review it. 14:05:27 hichihara: it may be worth considering because we don't want to accidentally ship a feature that looks like it works 14:06:13 kevinbenton: I think so 14:06:15 hichihara: probably something to propose as a back-port to the mitaka branch before the final release 14:06:23 o/ 14:07:09 Also, this week is the week to announce PTL candidacies, so if you are interested in being PTL, send an email to the list! 14:07:10 kevinbenton: The feature is indeed very useful 14:07:19 kevinbenton: I'm OK. 14:08:17 I haven't seen Neutron candidacy yet. 14:08:39 hichihara: i don't think anyone has sent one yet 14:08:50 Armando is suspiciously quiet :) 14:09:00 hichihara: we still have hope Armando will lead the way ;) 14:09:22 ihrachys: Of course! :) 14:09:23 #link https://launchpad.net/neutron/+milestone/mitaka-rc1 14:09:43 I've heard Cthulhu wants to run as neutron PTL... 14:09:46 ^^ that's the stuff targeted for RC1, keep an eye on anything still open in high or critical status 14:10:00 salv-orlando, is he friend with zuul? 14:10:27 ajo: they might know each other from some past work experience but I don't think they're friends 14:10:40 * njohnston is highly amused 14:10:45 lol 14:10:46 #info hichihara is bug deputy this week! 14:10:57 Yeah. 14:11:07 good luck hichihara 14:11:17 I have already started 14:11:43 hichihara: ++ 14:11:55 #link https://github.com/openstack/neutron/blob/master/doc/source/policies/bugs.rst#neutron-bug-deputy 14:12:01 ^^ bug deputy info 14:12:25 #topic open discussion 14:12:38 i don't have anything else. does anyone have anything they would like to discuss? 14:12:50 there are some restructure-l2-agent related bugs 14:13:02 bug/1528895 and bug/1430999 14:13:10 salv-orlando: lol 14:13:22 I wonder if we want a quick fix for the coming release 14:13:56 hichihara++ 14:14:14 i don't think so on bug/1430999. we can advise timeout increases for that as a workaround 14:14:36 the bugs have been there for more than a release and affects only high density environments 14:15:22 iwamoto: yes, i don't think we want to put together last minute chunking fixes for these 14:15:54 I agree with you kevinbenton ... 14:16:12 we can target such issues in N-1? 14:16:46 ok 14:16:53 yes, we need to clearly identify the bottlenecks anyway 14:17:01 I think reverting change is better than increasing timeouts 14:17:05 we need a more general approach to fix those issues...I think Kevin is working on that, right Kevinbenton? 14:17:09 the new RPCs don't scale 14:17:11 rossella_s: yes 14:17:21 iwamoto: wait, revert what? 14:17:39 batched agent RPC calls 14:18:32 or impose some limit on numbers of ports one RPC can send 14:18:49 iwamoto: didn't that ship in liberty? 14:18:50 iwamoto, I think the former is better 14:19:16 iwamoto, we could have a parameter for agents (max bulk ports objects call) or something like that 14:19:21 objects per call 14:19:47 yes liberty has the bug. and at least one person had to increase timeout for workaround , it seems 14:19:50 breaking the calls into chunks it basically pointless though 14:20:04 when increasing the timeout acheives the same effect 14:20:04 ajo we need something better than that 14:20:14 kevinbenton you were working at that right, can't find the patch right now 14:20:23 rossella_s: yes, there is a spec 14:20:32 https://review.openstack.org/#/c/225995/ 14:20:46 why is it pointless? 14:20:54 ajo: what does it acheive? 14:21:07 ajo: the agent still sits there and waits for all of the calls to return 14:21:27 kevinbenton: smaller subsets that need to be completed 14:21:30 ajo: so waiting for 50 smaller calls to return instead of 1 big one doesn't improve anything on the agent size 14:21:38 IMO agent should not issue such a gigantic RPC call 14:21:41 that could be backported when ready to fix these issues 14:21:41 thanks kevinbenton 14:21:45 the agent waits, but if some of the calls timeout the succeeded ones don't need to be retried 14:21:56 why don't we increase the timeout from 1 min? I've made that suggestion before 14:22:12 We know 1 min is too low, we know the value is arbitrary anyway 14:22:15 Let's bump it up 14:22:17 amuller: +1 14:22:23 we can make the timeout dynamic 14:22:37 a factor per number of bulk call objects 14:23:09 if we can do that in oslo.messaging (not sure if we can dynamically control that per call) 14:23:12 amuller: +1 14:23:30 but +1 to just bumping it a bit 14:23:53 we could avoid such gigantic calls when they are not needed and send only a diff to update the l2 agent 14:23:53 we are only chunking because for some reason people have been afraid to increase these timeouts 14:24:13 I agree with Ajo of keeping the bump dynamic 14:24:23 rossella_s: i think the only time we get the huge ones on startup anyway 14:24:54 how's timeout controlled? 14:24:59 is there any way to set it per RPC call? 14:25:00 a configuration variable 14:25:03 kevinbenton, also on bulk create that might happen 14:25:12 * ajo digs oslo_messaging 14:25:18 is timeout a issue in other projects than neutron? 14:25:21 ajo: yes it can be per call 14:25:29 iwamoto, it is, for example in cinder 14:25:41 I know they need to bump it in some deployments 14:25:58 amuller, if that's the case, I'd propose controlling it on bulk calls based on the amount of objects 14:26:13 we need something now for the Mitaka release 14:26:15 and we can have a rpc_bulk_call_timeout_factor 14:26:16 there is almost no reason to not have a high timeout 14:26:24 amuller, short term: just bump it 14:26:39 short timeouts only protect against completely lost messages, that's it 14:26:44 amuller, bump it ;) 14:26:51 bump! 14:26:55 lol 14:27:02 :) 14:27:29 Kevinbenton, in fact, they don't even stop the server operation, 14:27:37 no, they don't 14:27:43 so the impact for server load is even worse, as the operation would be retried 14:27:49 so yes 14:28:21 from the server's perspective, processing a giant call is not much better than processing smaller calls 14:28:22 does RPC timeouts serve any positive purpose? 14:28:26 it makes sense to raise those timeouts by default, message loss is generally a non expected events, stuff could freeze for 3-4 minutes in such case and it'd be ok 14:28:31 iwamoto: detecting lost messages 14:28:56 "a non expected events" -> "a non expected event" 14:29:08 kevinbenton: doesn't tcp and amqp supposed to handle that? 14:29:36 iwamoto: a server can die after it takes the call message off the queue 14:29:38 ajo: while your claim is questionable about system freeze a timeout should be set in a way that 99% of non-problematic calls typically finish within that time 14:30:03 so if a call takes over 5secs 50% of time as 5 sec timeout makes no sense, it must be increased 14:30:05 so it has some meaning in a active-active setup 14:30:20 iwamoto: yes 14:30:31 salv-orlando, even 99.9% ? :) 14:30:48 failing 1 of 100 non-problematic calls also sounds problematic 14:30:49 :) 14:31:07 ajo: whatever... nines are not my department 14:31:12 ;) 14:31:12 :) 14:31:20 salv-orlando, what timeout do we have now by default? 14:31:35 either 30 or 60 seconds 14:31:43 it comes from oslo messaging i think 14:31:50 kevinbenton: which we arbitrarily chose, didn't we? 14:31:55 right inherited 14:32:03 so arbitrary from our perspective 14:32:07 yep 14:32:27 60 secs 14:32:31 I just think a timeout should be set a realistic value wrt to the call you're making 14:32:42 I keep thinking, a certain other number would always be arbitrary... 14:32:52 but for now, higher is better 14:33:01 from bulk calls perspective 14:33:36 ajo: an "educated guess" timeout... not entirely arbitary, come on ;) 14:33:54 salv-orlando, yes that's why I say proportional timeouts could be a better approach looking at the long term 14:33:57 what about one that grows everytime a timeout exception is encountered? 14:34:06 and sleeps in between?! 14:34:10 kevinbenton: you mean like the patch you already have up for review? =D 14:34:16 yeah, that one :) 14:34:21 funny you should mention it 14:34:46 https://review.openstack.org/#/c/286405/ 14:35:11 +1 for exponential backoffs (not only extra timeout) 14:35:11 This was all an elaborate setup to get everyone to look at unicode table flipping 14:35:17 there's a lot of red on that one :) 14:35:42 Add exponential backoff to RPC client: https://review.openstack.org/#/c/280595/ 14:36:31 hmm, Kevinbenton++ 14:36:51 * ajo reviews again 14:36:55 what's the point in gradually increasing timeouts? 14:37:01 garyk1, about https://review.openstack.org/#/c/286405/ 14:37:09 iwamoto: to make them larger :) 14:37:10 we can have the max from the beginning 14:37:12 I think it could be beneficial now, 14:37:26 and later on that could be addited onto oslo messaging itself 14:37:51 iwamoto: the idea is that you want a timeout still to be able to detect lost messages in a reasonably quick time 14:38:45 iwamoto: this allows it to be increased for just calls that trigger timeouts if the configuration setting is too low (which will probably be the case for many deployers) 14:39:16 Let's discuss on that patch 14:39:25 does anyone have anything else, or can we end the meeting? 14:40:04 I agree with kevinbenton, timeout should be large so that small delays are ignored, but no large that it takes an eternity to get the mesage return back. Keeping it dynamic helps in having the timeout in an acceptble range for ifferent systems 14:41:51 kevinbenton: yay neutron as a learning system.... 14:41:55 eventually it will be able to play go 14:42:22 salv-orl_: i think we need a neural network to determine timeout values 14:42:33 kevinbenton: I think you need some sleep 14:42:44 kevinbenton: how many layers of neural network do u need??? ;) 14:43:14 kevinbenton, I added you a nit comment: https://review.openstack.org/#/c/280595/7/neutron/common/rpc.py 14:43:41 * haleyb buys stock in Skynet :) 14:43:50 kevinbenton: not interested unless it's running on containers 14:43:55 to really make the backoffs exponential too, as literacy suggest 14:44:05 I'm not an expert in fact, just a reader 14:44:18 ok 14:44:26 time for meeting to be over i think :) 14:44:30 thanks everyone! 14:44:32 good night 14:44:34 #endmeeting