#openstack-meeting log

16:00:27 <ihrachys> #startmeeting neutron_ci
16:00:30 <mlavalle> o/
16:00:32 <openstack> Meeting started Tue Mar 28 16:00:27 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:33 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:35 * ihrachys waves at everyone
16:00:35 <openstack> The meeting name has been set to 'neutron_ci'
16:00:57 <manjeets_> o/
16:01:44 <jlibosva> o/
16:01:51 <ihrachys> let's review action items from the prev meeting
16:01:55 <ihrachys> aka shame on ihrachys
16:01:59 <jlibosva> and jlibosva
16:02:01 <ihrachys> "ihrachys fix e-r bot not reporting in irc channel"
16:02:29 <ihrachys> hasn't happened; I gotta track that in my trello I guess
16:02:33 <ihrachys> #action ihrachys fix e-r bot not reporting in irc channel
16:02:38 <ihrachys> "ihrachys to fix the grafana board to include gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial-nv"
16:02:44 <ihrachys> nope, hasn't happened
16:02:54 <ihrachys> will follow up on it after the meeting, it should not take much time :-x
16:03:19 <ihrachys> actually, mlavalle maybe you could take that since you were to track dvr failure rate?
16:03:44 <ihrachys> that's a matter of editing grafana/neutron.yaml in project-config, not a huge task
16:03:49 <mlavalle> ihrachys: sure. I don't know how, but will find out
16:04:00 <ihrachys> that's a good learning opportunity then
16:04:05 <mlavalle> cool
16:04:08 <ihrachys> you can ask me for details in neutron channel
16:04:11 <ihrachys> and thanks
16:04:13 <mlavalle> will do
16:04:19 <ihrachys> #action mlavalle to fix the grafana board to include gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial-nv
16:04:43 <mlavalle> I spent time loking at this
16:04:47 <mlavalle> looking^^^
16:05:11 <mlavalle> all the failures I can see are due to hosts not available for the tests
16:05:31 <mlavalle> or loosing connection with the hypervisor
16:05:55 <mlavalle> the other failures I see are due to the patchsets in the check queue
16:06:23 <mlavalle> as a next step I'll be glad to talk to the infra team about this
16:06:23 <ihrachys> ok I see
16:06:41 <ihrachys> we may revisit that once we have data (grafana) back
16:07:00 <ihrachys> next was "jlibosva to figure out the plan for py3 gate transition and report back"
16:07:17 <jlibosva> didn't sync yet. Although it's quite important I won't be able to make a plan till next meeting as I'll be off most of the time. So I target for now+2 weeks :)
16:08:09 <clarkb> mlavalle: yes please do ping us in -infra after the meeting if you can (I've been trying to get things under control failure wise want to make sure we aren't missing something)
16:08:26 <mlavalle> clarkb: will do
16:08:29 <ihrachys> ok let's punt py3 for now till jlibosva is bcak
16:08:32 <ihrachys> *back
16:09:00 <ihrachys> unless someone want to take a pitch on writing a proposal for py3 coverage in gate
16:10:58 <ihrachys> ok
16:11:05 <ihrachys> #topic State of the Gate
16:11:10 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:12:01 <ihrachys> gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial is the only gate job that seem to show high failure rate
16:12:07 <ihrachys> it's 8% right now
16:12:17 <ihrachys> anyone aware of what happens there?
16:12:39 * electrocucaracha is checking
16:12:59 <jlibosva> any chance it's still the echo from the spike before?
16:13:19 <ihrachys> I see this example: http://logs.openstack.org/17/412017/5/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/c2fab50/console.html#_2017-03-24_18_31_35_087822
16:13:28 <ihrachys> jlibosva: looks rather flat from grafna
16:13:36 <ihrachys> timeout?
16:14:37 <ihrachys> I don't see too many patches merged lately, could be a one off
16:15:59 <ihrachys> #topic Fullstack voting progress
16:16:16 <ihrachys> jlibosva: surely fullstack is still at 100% failure rate but do we make progress?
16:16:26 <ihrachys> do we have grasp of all failures there?
16:17:09 <jlibosva> re - linuxbridge - I found the latest failure: http://logs.openstack.org/71/450771/1/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/df0eaa2/console.html#_2017-03-28_14_05_34_665809
16:17:22 <jlibosva> ihrachys: there is still the patch for iptables firewall
16:17:41 <ihrachys> https://review.openstack.org/441353 ?
16:17:56 <jlibosva> yes, that one
16:18:07 <jlibosva> still probably a WIP
16:18:38 <ihrachys> I see test_securitygroup failing with it as in http://logs.openstack.org/53/441353/8/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/fe9f205/testr_results.html.gz
16:18:47 <ihrachys> does it suggest it's not solving it?
16:19:05 <jlibosva> it's probably introducing another regression
16:19:11 <ihrachys> :-)
16:19:17 <jlibosva> as it solves the iptables driver but breaks iptables_hybrid
16:19:24 <ihrachys> whack a mole
16:19:27 <jlibosva> they are closely related and both use the conntrack manager
16:20:08 <ihrachys> kevinbenton: fyi seems like we need the conntrack patch in to move forward with fullstack
16:20:21 <ihrachys> jlibosva: apart from this failure, anything pressing? or is it the last one?
16:20:28 <jlibosva> no, other two :)
16:20:31 <jlibosva> rather :(
16:20:46 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1673531 - introduced recently
16:20:46 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [Undecided,New]
16:21:02 <jlibosva> by merging tests for keeping data plane connectivity while agent is restart
16:21:04 <jlibosva> ed
16:21:47 <jlibosva> I also saw another failure in trunk test where patch ports between tbr- and br-int are not cleaned properly after trunk is deleted.
16:22:01 <jlibosva> I haven't investigated that one and I don't think I reported a LP bug yet
16:22:39 <ihrachys> I should probably raise the test_controller_timeout_does_not_break_connectivity_sigkill one on upgrades meeting since it's directly related to upgrade scenarios
16:22:59 <jlibosva> It's unclear to me if it's fullstack or agent
16:23:53 <ihrachys> http://logs.openstack.org/98/446598/1/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/2e0f93e/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_ovsdb-cli_/neutron-openvswitch-agent--2017-03-16--16-06-05-730632.txt.gz?level=TRACE
16:24:08 <ihrachys> looks like multiple agents trying to add same manager?
16:24:28 <ihrachys> since we don't isolate ovs, and we execute two hosts, maybe that's why
16:25:05 <ihrachys> gotta get otherwiseguy looking at it. the bug may be in the code that is now in ovsdbapp.
16:25:36 <jlibosva> that's weird
16:25:41 <jlibosva> it has vsctl ovsdb_interface
16:25:46 <jlibosva> I thought the manager is for native
16:25:55 <ihrachys> it's chicken and egg
16:26:06 <ihrachys> you can't do native before you register the poret
16:26:07 <ihrachys> port
16:26:17 <ihrachys> so if connection fails, we call CLI to add the manager port
16:26:21 <ihrachys> and then repeat native attempt
16:26:24 <jlibosva> but there is no native whatsoever
16:26:49 <jlibosva> it's a vsctl test
16:27:17 <ihrachys> oh
16:27:36 <ihrachys> a reasonable question is then, why do we open the port
16:27:38 <ihrachys> right?
16:27:50 <jlibosva> but anyways, if it tries to create new manager and it's already there, it shouldn't influence the functionality, right?
16:28:11 <ihrachys> depending on what the agent will do with the failure.
16:28:23 <ihrachys> not sure if failure happens on this iteration, or somewhere later
16:30:21 <ihrachys> yeah, seems like the failure happens 30sec+ after that error
16:30:27 <ihrachys> probably not directly related
16:31:04 <jlibosva> I'm looking at the code right now and ovsdb monitor calls to native.helpers.enable_connection_uri
16:32:06 <jlibosva> https://review.openstack.org/#/c/441447/
16:32:26 <ihrachys> yea, was actually looking for this exact patch
16:32:40 <jlibosva> but by that time the fullstack test wasn't in tree yet
16:34:44 <ihrachys> oh so basically polling.py always passes cfg.CONF.OVS.ovsdb_connection
16:34:56 <ihrachys> and since it has default value, it always triggers the command
16:35:15 <ihrachys> I think there are several issues here. one is - we don't need that at all for vsctl
16:35:25 <ihrachys> another being - multiple calls may probably race
16:35:40 <ihrachys> neither are directly related to fullstack failure
16:36:20 <ihrachys> #action ihrachys to report bugs for fullstack race in ovs agent when calling to enable_connection_uri
16:36:34 <jlibosva> we could hack fullstack to filelock the call
16:36:42 <ihrachys> I don't think that correct
16:36:45 <jlibosva> to avoid races, it can't happen in real world
16:37:11 <ihrachys> because we don't run multiple monitors?
16:37:20 <jlibosva> we don't run multiple ovs agents
16:38:32 <ihrachys> yeah seems like the only place we call the code path is in ovs agent
16:39:53 <ihrachys> I would still prefer code level fix for that, but it would work if we lock too
16:40:06 <jlibosva> the only thing where it's used is vsphere at some dvs_neutron_agent ... http://codesearch.openstack.org/?q=get_polling_manager&i=nope&files=&repos=
16:40:12 <jlibosva> but dunno what that is
16:40:32 <ihrachys> this code looks like ovs agent copy-pasted :)
16:41:05 <ihrachys> but it doesn't seem this code reimplements the agent
16:41:21 <ihrachys> the question would be whether the DVS agent can be used with OVS agent on the same node
16:42:34 * mlavalle has to step out
16:42:38 <ihrachys> ok let's move to the next topic
16:42:48 <ihrachys> #topic Functional job state
16:43:03 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=7&fullscreen
16:43:25 <ihrachys> there are still spikes in close past till 80%
16:43:36 <ihrachys> not sure what that was, I suspect some general gate breakage
16:43:43 <ihrachys> now it's at reasonable 10%
16:43:53 <ihrachys> (note it's check queue so there may be valid breakages)
16:44:24 <ihrachys> of all patches, I am aware of this fix for a func test stability: https://review.openstack.org/#/c/446991/
16:44:37 <ihrachys> jlibosva: maybe you can have a look
16:45:00 <jlibosva> I will
16:45:19 <jlibosva> also note that almost the whole previous week the rate was around 20% which is still not ideal
16:46:28 <ihrachys> yeah. sadly I am consumed this week by misc stuff so won't be able to have a look.
16:48:25 <ihrachys> #topic Other gate failures
16:48:29 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure
16:48:41 <jlibosva> we could monitor the trend for next week and we'll see
16:48:45 * ihrachys looks through the list to see if anything could benefit from review attention
16:49:10 <ihrachys> this patch may be interesting since late pecan switch: https://review.openstack.org/#/c/447781/
16:49:18 <ihrachys> but it needs dasanind to respin it with a test included
16:49:41 <ihrachys> I think the bug hit tempest sometimes.
16:50:46 <ihrachys> any bugs worth raising?
16:51:02 <ihrachys> oh there is one from tmorin with a fix here: https://review.openstack.org/#/c/450865/
16:51:12 <ihrachys> I haven't checked the fix yet, so I am not sure if it's the right thing
16:51:57 <manjeets_> he just enabled quotas explicitly and it worked
16:52:11 <manjeets_> need to check how quota ovo disrupted normal behavior
16:52:12 <ihrachys> I don't think the change we landed was intended to break subprojects ;)
16:52:19 <ihrachys> gotta find a fix on neutron side
16:52:47 <manjeets_> yea that would be right fix
16:53:10 <ihrachys> ok let's move on
16:53:14 <ihrachys> #topic Open discussion
16:53:36 <ihrachys> https://review.openstack.org/#/c/439114/ from manjeets_ still waits for +W from infra
16:53:43 <ihrachys> I see Clark already +2d it, nice
16:54:00 <manjeets_> i asked clark yesterday for review
16:54:16 <manjeets_> may be need to post once more on infra
16:54:35 <ihrachys> yeah, thanks for following up on it
16:54:51 <ihrachys> apart from that, anything CI related worth mentioning here?
16:55:27 <jlibosva> I noticed that qos is skipped in api job
16:55:42 <jlibosva> e.g. http://logs.openstack.org/91/446991/2/check/gate-neutron-dsvm-api-ubuntu-xenial/044a331/testr_results.html.gz
16:55:49 <jlibosva> test_qos
16:55:50 <manjeets_> one question I was looking at functional tests
16:55:57 <manjeets_> don't see much for qos
16:56:43 <manjeets_> i see trunk is covered in functional but not qos
16:57:09 <ihrachys> jlibosva: I think we had a skip somewhere there
16:57:22 <ihrachys> http://logs.openstack.org/91/446991/2/check/gate-neutron-dsvm-api-ubuntu-xenial/044a331/console.html#_2017-03-24_11_12_55_029613
16:57:36 <ihrachys> apparently the driver (ovs?) doesn't support it
16:57:41 <jlibosva> I digged into it a bit and ended up that settings from local.conf are not propagated to the tempest.conf - but at this patch I see it works ... maybe it
16:57:53 <ihrachys> jlibosva: oh there was another thing related
16:57:55 <jlibosva> yeah, that's probably something else than what I saw - seems fixed by now
16:58:08 <ihrachys> https://review.openstack.org/#/c/449182/
16:58:20 <ihrachys> that should fix the issue with changes not propagated from hooks into tempest.conf
16:58:33 <ihrachys> so now we have 2 skips only, and they seem to be legit
16:58:37 <jlibosva> ihrachys: yeah, that's probably it :)
16:59:10 <jlibosva> it was weird for me though as I actually saw crudini being called - anyways, it's solved. Thanks ihrachys :)
16:59:17 <ihrachys> np
16:59:31 <ihrachys> manjeets_: there are func tests for qos too
16:59:41 <ihrachys> manjeets_: I will give links in neutron channel since we are at top of the hour
16:59:45 <ihrachys> thanks everyone and keep up!
16:59:47 <ihrachys> #endmeeting