14:00:32 <mlavalle> #startmeeting neutron_drivers
14:00:33 <openstack> Meeting started Fri Mar 22 14:00:32 2019 UTC and is due to finish in 60 minutes.  The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:36 <openstack> The meeting name has been set to 'neutron_drivers'
14:00:42 <slaweq> hi
14:01:04 <mlavalle> hey slaweq
14:01:13 <doreilly> hi
14:01:15 <haleyb> hi
14:01:46 <mlavalle> hi doreilly, haleyb
14:01:56 <njohnston> o/
14:02:07 <mlavalle> let's wait a minute for others to join
14:02:11 <mlavalle> hi njohnston
14:02:15 <yamamoto> hi
14:03:20 <mlavalle> ok, let's start
14:03:28 <mlavalle> #topic RC-1
14:04:04 <mlavalle> This is our RC-1 dashboard: https://launchpad.net/neutron/+milestone/stein-rc1
14:04:32 <mlavalle> As you can see, most of what we tareted has landed
14:05:37 <amotoki> hi
14:05:38 <mlavalle> we are waiting for https://review.openstack.org/#/c/641712 and https://review.openstack.org/#/c/643486 to land
14:05:52 <mlavalle> amotoki: talking about RC-1
14:06:51 <slaweq> mlavalle: yes, and those two again failed in check queue :/
14:07:04 <amotoki> most gate failures today are caused by neutron-fullstack failures.  perhaps it is same for these weeks.
14:07:16 <slaweq> amotoki: not this time
14:07:23 <slaweq> one failed on functional tests
14:07:27 <slaweq> and one on grenade
14:07:39 <mlavalle> so, is it just bad luck?
14:07:43 <amotoki> yes, they are other patterns.
14:07:45 <slaweq> earlier it failed on some linuxbridge job because some packets weren't installed
14:08:25 <slaweq> mlavalle: our gate isn't great still and in those 2 cases it is big bad luck :/
14:08:28 <amotoki> during today's afternoon (in my timezone), half of failures are from fullstack and others are what slaweq mentioned + linuxbridge one
14:08:43 <slaweq> yes
14:08:50 <slaweq> for fullstack we have 2 bugs reported already
14:09:09 <slaweq> for some functional tests problems too
14:09:48 <slaweq> for linuxbridge, AFAIK there is one issue reported https://bugs.launchpad.net/neutron/+bug/1815585
14:09:49 <openstack> Launchpad bug 1815585 in neutron "Floating IP status failed to transition to DOWN in neutron-tempest-plugin-scenario-linuxbridge" [High,Confirmed]
14:10:09 <slaweq> other things are IMO problems with e.g. volume tests, with infra and something like that
14:10:29 <slaweq> and sometimes random tests are failing because of FIP not accessible via SSH
14:10:40 <mlavalle> so let's give it a try over the morning in the USA
14:10:43 <slaweq> all of those are already know problems but we don't have fixes for them :/
14:11:33 <mlavalle> if after luch here we don't merge them, we will cut RC-1 and we will create a RC-2
14:11:39 <amotoki> agree. I confirmed what I saw today are covered.
14:11:58 <slaweq> ok, I agree with this plan
14:12:21 <mlavalle> amotoki: don't worry about the release patch. I can take care of it
14:12:24 <amotoki> I am wondering whether we can mark fullstack as non-voting temporarily.... to make the gate smooth..
14:12:54 <mlavalle> that's another alternative
14:13:03 <mlavalle> just to land these two patches
14:13:27 <mlavalle> what do others think?
14:13:48 <slaweq> fine for me
14:13:52 <njohnston> +1
14:13:56 <slaweq> but we will need to get this patch merged too :)
14:14:29 <mlavalle> let's give it a try this morning
14:14:36 <amotoki> the other way is to release RC1 as-is. I think the pending two patches can land in RC2. The one is really a bug and the other (upgrade check) is super useful for operators.
14:15:30 <mlavalle> mhhhhh, yeah, it seems sensible to have a RC-02
14:15:33 <mlavalle> RC-2
14:15:44 <slaweq> yes, but this bug fix is changing existing API (not released yet) so if we will release with this old (buggy) behaviour, we will probably need api extension to make change discoverable
14:15:56 <slaweq> but if it will be in RC-2 I think it's still fine
14:16:14 <slaweq> just IMO we shouldn't release stable Stein without this bug fix :)
14:16:23 <mlavalle> I agree
14:17:00 <amotoki> IMHO such kind of follow-up patch for new features are considered as release critical :)
14:17:58 <mlavalle> so, are we saying that we don't cur RC-1 without https://review.openstack.org/#/c/641712?
14:18:16 <mlavalle> cut^^^
14:18:24 <amotoki> yes
14:19:29 <mlavalle> ok
14:21:13 <mlavalle> so, here's what I propose:
14:21:35 <mlavalle> 1) Keep trying to get https://review.openstack.org/#/c/641712 through the gate
14:22:03 <mlavalle> 2) In parallel we create a patch to disable fullstack temporarily
14:22:14 <mlavalle> does it work?
14:22:42 <amotoki> it works for me.
14:22:42 <slaweq> sure, I will do this patch right now
14:22:50 <mlavalle> 3) https://review.openstack.org/#/c/643486 can be RC-2
14:23:29 <slaweq> IMHO yes, this can be in RC-2
14:23:34 <amotoki> +1 to 3) too
14:23:54 <mlavalle> any other thoughts?
14:24:50 <yamamoto> your all 1-3 sound reasonable to me
14:24:51 <njohnston> seems like a sound plan to me
14:24:59 <amotoki> if we run out of time for RC1, https://review.openstack.org/#/c/641712 can be RC-2 too.
14:26:25 <mlavalle> smcginnis: you around?
14:26:32 <smcginnis> o/
14:26:56 <mlavalle> smcginnis: we are struggling to merge a bug fix: https://review.openstack.org/#/c/641712
14:27:06 <mlavalle> we don't want to cut RC-1 without it
14:27:15 <mlavalle> how much more time do we have?
14:27:34 <smcginnis> We'd really like to wrap up today, but probably Monday at the latest.
14:27:43 <smcginnis> Are there gate issues blocking that?
14:27:51 <mlavalle> yes, our gate
14:28:00 <slaweq> 755611
14:28:04 <slaweq> sorry
14:28:30 <smcginnis> Just the one patch yet?
14:28:46 <mlavalle> smcginnis: yes, just that one patch
14:29:09 <smcginnis> OK, I guess just keep working on it, and we can cut the release as soon as it makes it through.
14:29:28 <mlavalle> fixes an API behavior so we don't want to cut without it
14:29:31 <smcginnis> If you know it will make the current state unusable as an RC1, we might as well wait a little longer.
14:29:57 <mlavalle> no, it's not that bad
14:30:10 <amotoki> but it is limited to QoS feature...
14:30:14 <amotoki> RC is really a release candidate, so I think we can fix it in RC2. we just know that bug now, but if it is found after RC1 the situation is not so different.
14:30:18 <mlavalle> RC1 will bu usable. it will just continue a behavior we want to prevent
14:30:44 <smcginnis> Oh, then yeah, I would cut RC1 and pick that up along with any translations and other critical bugs in RC2.
14:32:10 <mlavalle> smcginnis: ok, so we will give it a try during the morning. If by early afternoon US Central (your time and mine) it hasn't gone thorugh, I'll cur RC-1. Does that work for you?
14:33:26 <smcginnis> mlavalle: Yep, that sounds like a good plan.
14:33:50 <mlavalle> smcginnis: cool, I'll keep you posted in the release channel. Thanks for the advice
14:34:47 <smcginnis> np
14:34:53 <slaweq> mlavalle: amotoki: patch to disable fullstack is ready https://review.openstack.org/645602
14:36:06 <mlavalle> slaweq: thanks. Just +2ed it
14:37:30 <mlavalle> #topic RFEs
14:37:37 <slaweq> according to Murphy's law I bet that just after we will cut RC1 both those patches will be merged :D
14:38:13 <mlavalle> slaweq: yeap, LOL
14:38:15 <amotoki> that would be true :p
14:39:15 <mlavalle> since doreilly is here, let's discuss again https://bugs.launchpad.net/neutron/+bug/1817022
14:39:17 <openstack> Launchpad bug 1817022 in neutron "[RFE] set inactivity_probe and max_backoff for OVS bridge controller" [Wishlist,In progress] - Assigned to Darragh O'Reilly (darragh-oreilly)
14:39:37 <doreilly> I did some stress testing as suggested in the last meeting
14:39:49 <mlavalle> thanks you for doing that
14:40:00 <doreilly> and found that a longer inactivity_probe can prevent InvalidDatapath errors
14:40:24 <doreilly> https://bugs.launchpad.net/neutron/+bug/1817022/comments/6
14:40:25 <openstack> Launchpad bug 1817022 in neutron "[RFE] set inactivity_probe and max_backoff for OVS bridge controller" [Wishlist,In progress] - Assigned to Darragh O'Reilly (darragh-oreilly)
14:41:08 <doreilly> But I don't really understand where InvalidDatapath error comes from
14:43:35 <yamamoto> it comes from ryu ofctl app
14:44:14 <doreilly> yamamoto: yeah I don't know about its internals :)
14:45:10 <mlavalle> the agent reports it as RuntimeError
14:45:56 <mlavalle> and inside that exception we see Datapath Invalid
14:46:41 <yamamoto> iirc InvalidDatapath usually means the switch disconnected
14:46:53 <mlavalle> https://ryu.readthedocs.io/en/latest/app/ofctl.html#module-ryu.app.ofctl.exception
14:47:26 <mlavalle> "This can happen when the bridge disconnects."
14:47:47 <doreilly> and this is in the stacktrace cookies = set([f.cookie for f in self.dump_flows()]) - maybe this takes a long time with 34k+ flows
14:47:57 <yamamoto> mlavalle: thank you. it seems my memory is working better than i expect.
14:48:05 <mlavalle> yamamoto: it is
14:48:09 <mlavalle> LOL
14:50:09 <yamamoto> doreilly: sounds plausible
14:51:26 <doreilly> so the question is do we need to make inactivity_probe configurable, or maybe just hardcode a higher default
14:51:39 <yamamoto> but i'm not sure why the switch performs inactive probe while it's busy sensing flows
14:52:01 <yamamoto> sending
14:52:48 <doreilly> yamamoto: hmm right. But maybe python is parsing the text just downloaded from the switch
14:53:37 <mlavalle> doreilly: so what you are saying is that it is the slowness in the Python side?
14:54:07 <yamamoto> doreilly: it might be. i don't remember where messages can be buffered.
14:54:23 <doreilly> mlavalle: right, the eventlet thread might blocking
14:55:24 <mlavalle> yamamoto: where would you dig?
14:55:52 * mlavalle would like to at least end the meeting with an action item
14:56:40 <yamamoto> i don't know. i'll read doreilly's comments in the bug and add some comments.
14:56:43 <yamamoto> maybe next week
14:56:56 <doreilly> okay thanks guys
14:57:21 <mlavalle> doreilly: I'll also try to dis a bit on it
14:57:31 <mlavalle> probably later today
14:57:46 <mlavalle> I'll report any ideas / findings in the bug
14:57:51 <doreilly> mlavalle: thks
14:58:27 <mlavalle> doreilly: so based on your latest findings, lengthening inactivity_probe is sufficient
14:58:35 <doreilly> yes
14:58:46 <mlavalle> we don't need to tweak max_backoff
14:58:52 <mlavalle> right?
14:58:57 <doreilly> i don't think so
14:59:08 <doreilly> it's only for reconnects
14:59:17 <mlavalle> ok
14:59:24 <mlavalle> thanks for the update
14:59:30 <mlavalle> Have a nice weekend!
14:59:38 <doreilly> bye
14:59:42 <mlavalle> we'll get this RC-1 out of the dorr today!
14:59:48 <mlavalle> door^^^
14:59:48 <slaweq> thx
14:59:53 <mlavalle> #endmeeting