14:00:32 #startmeeting neutron_drivers 14:00:33 Meeting started Fri Mar 22 14:00:32 2019 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:34 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:36 The meeting name has been set to 'neutron_drivers' 14:00:42 hi 14:01:04 hey slaweq 14:01:13 hi 14:01:15 hi 14:01:46 hi doreilly, haleyb 14:01:56 o/ 14:02:07 let's wait a minute for others to join 14:02:11 hi njohnston 14:02:15 hi 14:03:20 ok, let's start 14:03:28 #topic RC-1 14:04:04 This is our RC-1 dashboard: https://launchpad.net/neutron/+milestone/stein-rc1 14:04:32 As you can see, most of what we tareted has landed 14:05:37 hi 14:05:38 we are waiting for https://review.openstack.org/#/c/641712 and https://review.openstack.org/#/c/643486 to land 14:05:52 amotoki: talking about RC-1 14:06:51 mlavalle: yes, and those two again failed in check queue :/ 14:07:04 most gate failures today are caused by neutron-fullstack failures. perhaps it is same for these weeks. 14:07:16 amotoki: not this time 14:07:23 one failed on functional tests 14:07:27 and one on grenade 14:07:39 so, is it just bad luck? 14:07:43 yes, they are other patterns. 14:07:45 earlier it failed on some linuxbridge job because some packets weren't installed 14:08:25 mlavalle: our gate isn't great still and in those 2 cases it is big bad luck :/ 14:08:28 during today's afternoon (in my timezone), half of failures are from fullstack and others are what slaweq mentioned + linuxbridge one 14:08:43 yes 14:08:50 for fullstack we have 2 bugs reported already 14:09:09 for some functional tests problems too 14:09:48 for linuxbridge, AFAIK there is one issue reported https://bugs.launchpad.net/neutron/+bug/1815585 14:09:49 Launchpad bug 1815585 in neutron "Floating IP status failed to transition to DOWN in neutron-tempest-plugin-scenario-linuxbridge" [High,Confirmed] 14:10:09 other things are IMO problems with e.g. volume tests, with infra and something like that 14:10:29 and sometimes random tests are failing because of FIP not accessible via SSH 14:10:40 so let's give it a try over the morning in the USA 14:10:43 all of those are already know problems but we don't have fixes for them :/ 14:11:33 if after luch here we don't merge them, we will cut RC-1 and we will create a RC-2 14:11:39 agree. I confirmed what I saw today are covered. 14:11:58 ok, I agree with this plan 14:12:21 amotoki: don't worry about the release patch. I can take care of it 14:12:24 I am wondering whether we can mark fullstack as non-voting temporarily.... to make the gate smooth.. 14:12:54 that's another alternative 14:13:03 just to land these two patches 14:13:27 what do others think? 14:13:48 fine for me 14:13:52 +1 14:13:56 but we will need to get this patch merged too :) 14:14:29 let's give it a try this morning 14:14:36 the other way is to release RC1 as-is. I think the pending two patches can land in RC2. The one is really a bug and the other (upgrade check) is super useful for operators. 14:15:30 mhhhhh, yeah, it seems sensible to have a RC-02 14:15:33 RC-2 14:15:44 yes, but this bug fix is changing existing API (not released yet) so if we will release with this old (buggy) behaviour, we will probably need api extension to make change discoverable 14:15:56 but if it will be in RC-2 I think it's still fine 14:16:14 just IMO we shouldn't release stable Stein without this bug fix :) 14:16:23 I agree 14:17:00 IMHO such kind of follow-up patch for new features are considered as release critical :) 14:17:58 so, are we saying that we don't cur RC-1 without https://review.openstack.org/#/c/641712? 14:18:16 cut^^^ 14:18:24 yes 14:19:29 ok 14:21:13 so, here's what I propose: 14:21:35 1) Keep trying to get https://review.openstack.org/#/c/641712 through the gate 14:22:03 2) In parallel we create a patch to disable fullstack temporarily 14:22:14 does it work? 14:22:42 it works for me. 14:22:42 sure, I will do this patch right now 14:22:50 3) https://review.openstack.org/#/c/643486 can be RC-2 14:23:29 IMHO yes, this can be in RC-2 14:23:34 +1 to 3) too 14:23:54 any other thoughts? 14:24:50 your all 1-3 sound reasonable to me 14:24:51 seems like a sound plan to me 14:24:59 if we run out of time for RC1, https://review.openstack.org/#/c/641712 can be RC-2 too. 14:26:25 smcginnis: you around? 14:26:32 o/ 14:26:56 smcginnis: we are struggling to merge a bug fix: https://review.openstack.org/#/c/641712 14:27:06 we don't want to cut RC-1 without it 14:27:15 how much more time do we have? 14:27:34 We'd really like to wrap up today, but probably Monday at the latest. 14:27:43 Are there gate issues blocking that? 14:27:51 yes, our gate 14:28:00 755611 14:28:04 sorry 14:28:30 Just the one patch yet? 14:28:46 smcginnis: yes, just that one patch 14:29:09 OK, I guess just keep working on it, and we can cut the release as soon as it makes it through. 14:29:28 fixes an API behavior so we don't want to cut without it 14:29:31 If you know it will make the current state unusable as an RC1, we might as well wait a little longer. 14:29:57 no, it's not that bad 14:30:10 but it is limited to QoS feature... 14:30:14 RC is really a release candidate, so I think we can fix it in RC2. we just know that bug now, but if it is found after RC1 the situation is not so different. 14:30:18 RC1 will bu usable. it will just continue a behavior we want to prevent 14:30:44 Oh, then yeah, I would cut RC1 and pick that up along with any translations and other critical bugs in RC2. 14:32:10 smcginnis: ok, so we will give it a try during the morning. If by early afternoon US Central (your time and mine) it hasn't gone thorugh, I'll cur RC-1. Does that work for you? 14:33:26 mlavalle: Yep, that sounds like a good plan. 14:33:50 smcginnis: cool, I'll keep you posted in the release channel. Thanks for the advice 14:34:47 np 14:34:53 mlavalle: amotoki: patch to disable fullstack is ready https://review.openstack.org/645602 14:36:06 slaweq: thanks. Just +2ed it 14:37:30 #topic RFEs 14:37:37 according to Murphy's law I bet that just after we will cut RC1 both those patches will be merged :D 14:38:13 slaweq: yeap, LOL 14:38:15 that would be true :p 14:39:15 since doreilly is here, let's discuss again https://bugs.launchpad.net/neutron/+bug/1817022 14:39:17 Launchpad bug 1817022 in neutron "[RFE] set inactivity_probe and max_backoff for OVS bridge controller" [Wishlist,In progress] - Assigned to Darragh O'Reilly (darragh-oreilly) 14:39:37 I did some stress testing as suggested in the last meeting 14:39:49 thanks you for doing that 14:40:00 and found that a longer inactivity_probe can prevent InvalidDatapath errors 14:40:24 https://bugs.launchpad.net/neutron/+bug/1817022/comments/6 14:40:25 Launchpad bug 1817022 in neutron "[RFE] set inactivity_probe and max_backoff for OVS bridge controller" [Wishlist,In progress] - Assigned to Darragh O'Reilly (darragh-oreilly) 14:41:08 But I don't really understand where InvalidDatapath error comes from 14:43:35 it comes from ryu ofctl app 14:44:14 yamamoto: yeah I don't know about its internals :) 14:45:10 the agent reports it as RuntimeError 14:45:56 and inside that exception we see Datapath Invalid 14:46:41 iirc InvalidDatapath usually means the switch disconnected 14:46:53 https://ryu.readthedocs.io/en/latest/app/ofctl.html#module-ryu.app.ofctl.exception 14:47:26 "This can happen when the bridge disconnects." 14:47:47 and this is in the stacktrace cookies = set([f.cookie for f in self.dump_flows()]) - maybe this takes a long time with 34k+ flows 14:47:57 mlavalle: thank you. it seems my memory is working better than i expect. 14:48:05 yamamoto: it is 14:48:09 LOL 14:50:09 doreilly: sounds plausible 14:51:26 so the question is do we need to make inactivity_probe configurable, or maybe just hardcode a higher default 14:51:39 but i'm not sure why the switch performs inactive probe while it's busy sensing flows 14:52:01 sending 14:52:48 yamamoto: hmm right. But maybe python is parsing the text just downloaded from the switch 14:53:37 doreilly: so what you are saying is that it is the slowness in the Python side? 14:54:07 doreilly: it might be. i don't remember where messages can be buffered. 14:54:23 mlavalle: right, the eventlet thread might blocking 14:55:24 yamamoto: where would you dig? 14:55:52 * mlavalle would like to at least end the meeting with an action item 14:56:40 i don't know. i'll read doreilly's comments in the bug and add some comments. 14:56:43 maybe next week 14:56:56 okay thanks guys 14:57:21 doreilly: I'll also try to dis a bit on it 14:57:31 probably later today 14:57:46 I'll report any ideas / findings in the bug 14:57:51 mlavalle: thks 14:58:27 doreilly: so based on your latest findings, lengthening inactivity_probe is sufficient 14:58:35 yes 14:58:46 we don't need to tweak max_backoff 14:58:52 right? 14:58:57 i don't think so 14:59:08 it's only for reconnects 14:59:17 ok 14:59:24 thanks for the update 14:59:30 Have a nice weekend! 14:59:38 bye 14:59:42 we'll get this RC-1 out of the dorr today! 14:59:48 door^^^ 14:59:48 thx 14:59:53 #endmeeting