14:02:35 #startmeeting networking 14:02:36 Meeting started Tue Nov 18 14:02:35 2014 UTC and is due to finish in 60 minutes. The chair is mestery. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:02:38 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:02:39 hello 14:02:41 The meeting name has been set to 'networking' 14:02:43 hi! 14:02:50 hi all! 14:02:51 #link https://wiki.openstack.org/wiki/Network/Meetings Agenda 14:02:51 o/ 14:02:54 hi 14:03:04 hi 14:03:07 hi 14:03:18 Hi 14:03:22 Welcome back folks! I missed everyone in Paris, hope everyone had a great trip :) 14:03:39 we did, and congrats to you. 14:03:43 Thanks dougwig. 14:03:51 Sounds like a lot of really great discussions happened. 14:03:52 congrats! 14:04:01 #link https://wiki.openstack.org/wiki/Kilo_Release_Schedule 14:04:06 +1 for the new mestery - hope all are doing well :) 14:04:09 I wanted to make sure people are aware of the Kilo release schedule. 14:04:21 Thanks nati_ueno and regXboi. :) 14:04:34 #info Kilo-1 date: 12-18-2014 14:04:43 That's the first date of importance for folks. 14:04:54 mestery: Do we have the spec proposal freeze and spec approval freeze dates? 14:05:10 amuller_: I will have those by EOD today and send email to the list. 14:05:15 Great :) 14:05:17 thanks 14:05:18 And document them as well. 14:05:38 Also, one more announcment 14:05:43 #link https://wiki.openstack.org/wiki/Sprints/NeutronKiloSprint Mid-Cycle 14:06:00 Please note if you're attending, send your contact information to Jun from Adobe so he can presetup your wifi access 14:06:06 Any other announcements for the team? 14:06:55 #topic Bugs 14:06:59 enikanorov_: Hi there! 14:07:01 hi 14:07:55 enikanorov_: How current is the bug section of the meeting page? 14:08:11 mestery: it's not actual, i'll update 14:08:31 so the update for today 14:08:51 #action enikanorov_ to update the bugs section of the meeting page. 14:08:53 there is a bug in tempest or neutron that is being hit in the gate quite often 14:09:00 let me find the link 14:09:29 https://bugs.launchpad.net/tempest/+bug/1357055 14:09:31 Launchpad bug 1357055 in tempest "Race to delete shared subnet in Tempest neutron full jobs" [Undecided,Confirmed] 14:09:40 enikanorov_: I recall discussing this bug with armax or markmcclain yesterday 14:10:31 Looks like this is assigned to salv-orlando at the moment. 14:10:35 yep 14:10:48 so I guess he's not here right now 14:10:55 Commment #29 indicates it's a Tempest bug as well. 14:11:10 Yes, salv-orlando must be predisposed at the moment. 14:11:16 another bug which has raised quite a bit of discussion is https://bugs.launchpad.net/neutron/+bug/1382064 14:11:17 Launchpad bug 1382064 in neutron "Failure to allocate tunnel id when creating networks concurrently" [High,In progress] 14:11:43 this one was discovered during cuncurrent api tests (via rally) 14:11:58 and revealed major flaw in id allocation logic 14:12:09 enikanorov_: Yiikes! Are you looking into this one? 14:12:12 mestery: that bug from what I gather is a tempest thing, but in the past two weeks I did not find time to look at it. 14:12:31 salv-orlando: Thanks for the update there! 14:12:36 mestery: the fix is on review and we have quite long discussion there with amuller_ 14:13:03 actually the discussion spans beyong particular fix 14:13:08 enikanorov_: mestery: I think Eugene, Mike and myself showed our perspectives clearly and we probably need 3rd party feedback on the patch 14:13:12 Excellent! Thanks for tackling that one enikanorov_ and amuller_! 14:13:17 * markmcclain sneaks in late due to traffic 14:13:37 i hope we'll get a few minutes at the end of the meeting to describe underlying problem 14:13:38 amuller_: All the comments are in the review for this one then? 14:13:41 *to discuss 14:13:44 mestery: yeah 14:13:45 enikanorov_: We can do that right here if you want. 14:13:54 We can use 5-10 minutes now. 14:14:20 well, it's general question about getting rid of with_lockmode('update') (for the sake of galera/mysql) and related diffuculties 14:14:37 i think it's better to postopne in to the end of the meeting 14:14:45 enikanorov_: OK, that sounds fine. 14:15:19 enikanorov_: I just wanted to also highlight the email you sent to the list last week on neutron bugs: 14:15:21 #link http://lists.openstack.org/pipermail/openstack-dev/2014-November/049975.html 14:15:36 Have you had much luck in getting additional people to signup for triage, recreationg, etc? 14:15:51 yes, I've got a couple of contacts 14:16:00 enikanorov_: Perfect, glad to hear that! 14:16:23 enikanorov_: Also, what are your thoughts on doing a bug-day in the coming weeks? I'm thinking it would be a good thing for the community to partake in and help clear the bug count a bit. 14:16:31 i'm planning to go over some open bugs that don't have updates for last couple of weeks, probably reassigning them to other people 14:16:58 enikanorov: yes, i think we need to have it as a regular event 14:17:05 #action mestery to work with enikanorov_ on a bug day in the coming 2 weeks. 14:17:12 enikanorov_: Agreed, we'll make it happen. 14:17:14 How many bugs have assignees but they have effectively abandoned work on them? 14:17:31 HenryG: i don't know 14:18:31 OK, thanks for the updates on the bugs enikanorov_, and we'll get a bug day rolling soon. 14:18:42 mestery: cool 14:19:05 * mestery doesn't see emagana around for a docs update ... 14:19:17 #topic Docs 14:19:23 #action emagana to update docs section of meeting page 14:19:39 #topic Technical Debt in the Agents 14:19:50 * mestery isn't sure who added this to the agenda ... 14:19:58 i did on behalf of carl_baldwin 14:20:08 marios_ and carl_baldwin: Excellent! 14:20:10 #link https://etherpad.openstack.org/p/kilo-neutron-agents-technical-debt 14:20:10 it came up at the end of last week's l3... carl_baldwin ? 14:20:28 maybe a good point to mention the possibility of DHCP sub-group 14:20:29 marios_: I'll let you and carl_baldwin lead us through this discussion then if you guys are ready. 14:20:49 so i really haven't prepared anything. it was mostly for a discussion point about where/how to organise the work 14:20:51 marios_: Go for it. 14:21:32 so all the agents are discussed together in the etherpad. for example, the l2 fixes, should they be discussed under ml2 subgroup? 14:21:36 or do we need a new group etc 14:21:56 please no more groups 14:21:57 I can imagine it needing to be separate 14:22:02 but I'm not suggesting how 14:22:06 markmcclain: +1 14:22:07 ml2 -> integration point 14:22:11 agents -> control plane 14:22:34 what would be the reason for a sub group? a regular weekly meeting? 14:23:24 salv-orlando: all those reason and focus 14:23:26 Wondering specifically about how to organize the L2 agent improvements discussed in the etherpad above. Attaching to some weekly meeting could help get things started. 14:23:27 salv-orlando: i guess so, somewhere to discuss progress on that work 14:23:51 salv-orlando: & accountablility 14:23:59 can we carve out some time at the end of this meeting to start with before spinning a new group? 14:24:20 we could include L2 agent discussion in the weekly ML2 agenda as needed 14:24:23 carl_baldwin: I would propose we coudl use part of the weekly Neutron meeting for this as well in the on-demand agenda, especially to bootstrap. 14:24:58 +1 I think this was suggested by someone last week too, perhaps carl 14:25:05 mestery: rkukura: I’d be fine with either. 14:25:14 if it's high priority core neutron work, seems to make perfect sense to discuss here 14:25:19 which it sounds like it is 14:25:24 russellb: ++ 14:25:29 it's really hard to follow all of these groups 14:25:33 russellb: agreed 14:25:38 we need a new subgroup to discuss adding new groups. 14:25:43 lol 14:25:45 dougwig: +1 14:25:50 basically, i'm not even trying, because there's too many :) 14:25:53 if you want to have a task force focused on repaying debt in any agent for this release cycle and set up a meeting for it I’m ok with it. I just don’t want to create permanent sub groups of subject matter experts 14:25:53 dougwig: +1 14:25:55 * dougwig ducks 14:26:00 oh please dear lord, no! 14:26:04 dougwig: +1 14:26:04 Yes, we have too many subgroups without clear charters or missions. 14:26:09 * regXboi runs screaming from the chat room 14:26:37 So, lets discuss the agent things each week in this meeting. 14:26:39 for instance in the last releae cycle we did this “db meetings” but at the end of the day it was just a bunch of folks (4 maybe 5) syncing up on a task of making migrations idempotent 14:26:39 Sound good to everyone? 14:26:50 mestery:+1 14:26:53 sounds good to me 14:26:54 mestery: +1 14:27:03 mestery: +1 14:27:05 sounds good to me 14:27:08 wfm 14:27:10 yup 14:27:13 #info Agent refactoring discussion to happen in weekly neutron meeting going forward 14:27:13 anything that needs more bandwidth… use #openstack-neutron or the mailing list 14:27:21 salv-orlando: +1 14:27:22 salv-orlando: ++ 14:27:40 As a team, we need to refocus to having these discussions in the team meeting, and use ML and IRC for higher bandwidth discussions as needed. 14:27:53 mestery: +1 14:28:06 We can't scale with 10s of sub-teams meetings all the time. :) 14:28:07 In addition, regarding subgroup meeting, all meetings listed at https://wiki.openstack.org/wiki/Meetings#OpenStack_Networking_.28Neutron.29 will be held in Kilo cycle? If not, please update. 14:28:35 amotoki_: I'm going to request sub-teams to have a clear charter by next week's neutron meeting, and if htey don't, they should disband. 14:28:47 We need to be having discussions in the broader meeting and not require people to attend all these other meeitngs. 14:28:50 It's not scalable. 14:29:08 +1 14:29:08 mestery: +1 14:29:13 mestery: +1 14:29:15 mestery: +1 14:29:19 mestery: +1 14:29:29 I'll send email to the list post meeting on this, and add agenda item for next week. 14:29:29 mestery: +1 14:30:00 Anything else to discuss on the agent item today? 14:30:06 mestery: +1 14:31:01 #topic Services Split Update 14:31:03 mestery: marios_: We need to identify who will be working on this. Maybe a mail to the ML could help bootstrap this? 14:31:06 #undo 14:31:07 Removing item from minutes: 14:31:16 carl_baldwin: I think that would be ideal, yes. 14:31:26 carl_baldwin: people have claimed stuff in the ehterpad but we can make sure they are still up for it 14:31:27 #action carl_baldwin to send email to list to get a quorum of folks for agent refactoring. 14:31:35 carl_baldwin: I know in the past banix has expressed interest here as well. 14:32:02 Will do. I think we can move on. 14:32:10 Thanks carl_baldwin! 14:32:15 #topic Services Split Update 14:32:21 So, the short story here is there is no update yet. :) 14:32:31 lol 14:32:32 markmcclain and I are working on this with the TC at the moment. 14:32:52 any idea when the TC will be able to have an official meeting about it to decide? 14:32:53 From a proposal perspective. markmcclain, did I miss anthing else? 14:33:10 blogan: likely next week 14:33:30 #info TC to have services split discussion next week 14:33:37 because of the async way we consider items… I'll propose for this week and for adding to next week's agenda 14:33:37 mestery markmcclain: thanks, whats the proposal that is put in front of the TC? 14:34:08 markmcclain: should probably have a ML thread to introduce it first, with about week lead time before it hits TC agenda 14:34:24 SumitNaiksatam: to divide the main neutron repo into two that are managed by the networking program 14:34:27 #action markmcclain and mestery to send email to ML prior to TC agenda addition 14:34:30 russellb: Good idea 14:34:49 russellb: right wanted to wait until folks were back to introduce so that the thread didn't get skipped 14:34:57 col 14:34:58 cool, too 14:35:03 lol :) 14:35:25 So, that's the update on services split. Look for the ML discussion soon. 14:35:27 anything we can do in the meantime? 14:35:42 blogan: you can wait for a message on the ml ;) 14:35:47 lol 14:35:50 lol 14:35:50 lol thanks salv 14:36:13 #topic Open Discussion 14:36:20 * mestery notes we may end a bit early today. 14:36:24 blogan: dougwig has been working on the proposed code organization item 14:36:30 enikanorov_: This is your slot! :) 14:36:35 enikanorov_: For the previous discussion. 14:36:39 ok 14:36:49 mestery: can we talk about the meetup as well after? 14:36:50 DHCP direction, or god forbid I say sub-grou[ 14:37:04 so here's the issue, we're trying to get rid of locking tables (with_lockmode) 14:37:39 * regXboi listens 14:37:41 in some cases consistency can't be achieved without that, so retries need to be used to achieve the result 14:38:00 dkehn: you have to file the form for the subgroup on subgroups :) 14:38:08 markmcclain: thanks 14:38:24 the problem is that such operations are performed under transactions which in mysql have 'repeatable read' isolation level 14:38:43 and hence retry logic just don't work because the code fetches the same values over and over again 14:38:45 enikanorov_ are the cases where consistency can't be achieved corners or are they in the main space? 14:38:54 enikanorov: the API refactoring plus separation into tasks should remove much of the need for the locking we're doing now 14:39:04 shouldn’t the transaction be inside the retry loop? 14:39:32 rkukura: that seems to be an obvious solution, but that just a small method called by, say, create_network, that has one big transaction 14:40:29 markmcclain: can you explain about the tasks? 14:40:48 enikanorov_: it was a bit about what we talked about in the sessions in Paris 14:40:51 markmcclain: since that’s not happening tomorrow it still makes sense fixing in the existing code base 14:40:52 markmcclain: i mean how it helps 14:41:03 salv-orlando: agree 14:41:16 salv-orlando: yeah… not sure we'll be able to make a meaning lock_mode fix for Juno 14:41:18 enikanorov_: what markmcclain is saying is that we’ll pretty much rewrite eveyrthing 14:41:38 salv-orlando: i like rewriting stuff :) but... 14:41:43 everything… even ourselves ;) 14:41:48 lol 14:41:49 In the mean time we have https://bugs.launchpad.net/neutron/+bug/1382064 (Concurrent network creations all try to use the same segmentation ID, and the retry loop just tries the same number 10 times and fails) 14:41:50 Launchpad bug 1382064 in neutron "Failure to allocate tunnel id when creating networks concurrently" [High,In progress] 14:41:55 you know, we're dealing with issues that appear in distributed environment 14:42:00 We need a short term solution to that bug 14:42:06 retry logic is a correct way to deal with those 14:42:36 can we special case things that need to retry like this to use a different isolation mode? 14:42:39 if we somehow serialize access - that would mean we create contention point 14:43:04 marun: exactly. that is a proposed solution 14:43:19 enikanorov_: are there any drawbacks to that approach? 14:43:53 marun: potentially, in 'update' operations, but anyway, postgress already uses 'read committed' isolation level that is fine for retry logic 14:44:28 marun: In my mind it destroys my ability to reason about the code base. If a subtransaction deep in the stack (Such in the patch proposed) sets a different transaction level, I have no way of knowing that unless I'm just familiar with the entire code base. 14:44:40 so for now the solution for mysql is to change tx isolation level from default to 'read committed' 14:44:46 I no longer know what transaction level each transaction uses 14:44:58 And I have to constantly check and think about what that means for every flow 14:45:04 amuller_: you don't know it anyway, because it is backend-dependent right now 14:45:16 each DB has a default 14:45:17 enikanorov_: That would seem to be pretty dangerous if code has been written to assume 'consistent read' :/ 14:45:21 at least it's consistent 14:45:38 if different parts of the code base use different levels... I don't know, that seems insane to me to be honest 14:45:39 marun: correct 14:45:53 or are we safe there? I mean, postgres uses read committed and the code is fine, right? 14:45:54 marun: good news is that 'create' operations are safe for that 14:45:56 enikanorov_: I would leave the isolation where it is, becuase its the default for everyone, and look at the length the lock is there 14:46:29 marun: we don't have anough concurrent api testing to say that code is fine with postgres 14:46:35 *enough 14:46:37 enikanorov_: fair enough 14:47:02 so if any call chain issues the same query twice for the same object, that is a potential issue 14:47:13 either way it's better to use the same isolation level for mysql and postgres 14:47:29 yamamoto: well, that implies a pretty significant change then. 14:47:34 yamamoto: i tend to agree. 14:47:36 I proposed an alternate on the patch 14:47:38 marun: agree as well 14:47:50 https://review.openstack.org/#/c/129288/4//COMMIT_MSG 14:48:40 Random will be fine as long as the available space is sparsely consumed. As soon as the space is not sparsely consumed then it will be terrible. Won't this be the case for VLAN ids? 14:48:51 carl_baldwin: exactly 14:49:00 carl_baldwin: We can provide a different solution for tunneling and VLANs, since, they're different.. 14:49:10 emagana won the prize for being the one confusing the time for the networking meeting.. DLT!!!! I hate you!! 14:49:12 What about writing tests that attempt to validate concurrency of the operations in question? 14:49:39 marun: well... wei use rally in our lab 14:49:40 We can debate the merits of an approach in theory, but unless we're validating our assumptions the debate will have to continue into production environments. 14:49:42 *we 14:49:56 emagana: lol 14:50:00 Won’t it also be the case if the configured available range of ids (regardless of type) is small? 14:50:04 emagana, not sure if you won ;), I was here 1 hour before ;D, and missed the last one 14:50:07 the issue was found with rally, and that's how i validated the fix 14:50:27 mestery: sorry.. and I thought I was early!! LOL 14:50:29 carl_baldwin: for tunnels there is not much sense to configure small ranges 14:50:29 enikanorov_: Hmmm, so rally is sufficient testing then? 14:50:53 enikanorov_: Can we change the transaction isolation to 'read committed' for mysql and retest to see if issues appear? 14:51:00 marun: it's load/concurrency/performance testing framework for which a couple of api tests for neutron is written 14:51:03 globally, I mean 14:51:04 Or have you already done so? 14:51:13 enikanorov_but people tend to do, for some reason, so could be an issue, or we'd need to state clearly that they need to use broad tunnel id ranges 14:51:15 marun: that's what i did. it fixes the issue 14:51:26 ah, globally 14:51:40 yeah, we can test that. 14:51:56 enikanorov_: It is a small change with a potentially big impact. 14:52:02 enikanorov_: But at least it has consistency on its side. 14:52:18 Does anyone have any objection to attempting to move to the new isolation level? 14:52:26 It could cause problems, but we're early in the cycle. 14:52:44 and also, postgres is already 'read committed' 14:52:50 Has jaypipes looked at it? 14:52:52 Also, has anyone consulted mike bayer on the issue? 14:53:03 marun: let's test that at least...the lock wait timeout bugs are hitting us anyway 14:53:18 marun: I think it is worth some testing. 14:53:19 rossella_s: agreed, testing is the first step. 14:53:19 * regXboi wonders if we have to back off and treat concurrent access like we would treat multi-master 14:53:32 Jay and I talked about this class of problem before and he had some feedback 14:53:44 enikanorov_: Yes, postgres is using ‘read committed’ but we’re not testing that as much. 14:53:53 carl_baldwin: true 14:54:20 carl_baldwin: i just mean that at the level of confidence that gates give us, it was fine at times we tested neutron with postgres 14:54:59 maybe it's worth writing to the ML so that other people can give their feedback? They might have tried it in other projects... 14:55:08 rossella_s: will do 14:55:10 rossella_s: +1 14:55:10 rossella_s: That's a good idea actually 14:55:41 trying "READ COMMITED" globally sounds like a good approach to me from the consistency point of view, ... 14:55:50 but it will be good to hear other's experiences on that. 14:56:16 enikanorov_: Fair enough. I just wanted to point out that it won’t necessarily be a slam dunk. 14:56:42 enikanorov_: Are you going to send the mail to the list? 14:56:53 mestery: yes 14:57:06 enikanorov: ought we time bound the ML + responses time, and set a target date for decision? 14:57:12 #action enikanorov_ to send mail to ML around locking issues with bug https://bugs.launchpad.net/neutron/+bug/1382064 14:57:14 Launchpad bug 1382064 in neutron "Failure to allocate tunnel id when creating networks concurrently" [High,In progress] 14:57:28 OK, 3 minutes left folks. 14:57:29 I'd like to point out that a solution to that problem may also buy you multi-master for essentially free 14:57:32 Anything else quick this week? 14:57:33 enikanorov: decision specifically for if we do the test or not 14:57:44 i have a quick question/clarification 14:57:54 wrt the functional tests coverage for technical debt - we *are* talking about in-tree functional tests right (not tempest). I have starting poking at the dhcp_agent (non-existant in-tree functional tests afaics) testing and want to make sure I don't go off on a tangent 14:57:55 regXboi: +1 I was thinking of that, but at cost of higher query inter-locking... 14:57:59 mastery: we could continue the planning for the adv services spinout meetup while the TC deliberation is happening. 14:58:06 glebo: we will do such test, by sayin 'we' i mean, me and my colleauges 14:58:11 I just want to throw this bombshell… I think the application should not make assumptions on the underlying database isolation mode. But don’t take me seriously I just want to stir up the discussion. 14:58:17 mestery: #link https://etherpad.openstack.org/p/advanced-services-kilo-midcycle 14:58:21 salv-orlando: lol 14:58:21 marios_: in-tree, yes 14:58:24 ajo: I think we can avoid the query inter-locking at the cost of "eventual consistency" :( 14:58:27 amuller_: thx :) 14:58:29 I want to make sure the meeting finishes at 15GMT 14:58:31 SridarK: That's ongoing yes, we'll get that sorted out, expect email soon 14:58:34 enikanorov: ah, cool. Thx. 14:58:41 salv-orlando: somewhat true, somewhat not ;) 14:58:43 mestery: thanks 14:58:54 enikanorov: so we'll do in parallel then. Good to know. 14:58:55 marios_: Please add me as a reviewer as soon as you have something up :) It should be similar to the L3 agent functional testing I think? 14:59:02 OK, we're winding down now. 14:59:08 amuller_: indeed and will do 14:59:11 If you got an action item, please review post meeting. 14:59:14 mestery: action to enikanorov for doing the test in parallel and reporting back? 14:59:16 I'll walk through those next week at the meeting. 14:59:23 marios_, add me too ;) 14:59:26 ajo: ack 14:59:42 mestery: lbaas feature branch reviews 14:59:42 We'll see you all next week! 14:59:47 #endmeeting