#openstack-meeting log

16:00:34 <ihrachys> #startmeeting neutron_ci
16:00:35 <openstack> Meeting started Tue Aug 15 16:00:34 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:36 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:38 <openstack> The meeting name has been set to 'neutron_ci'
16:00:39 <ihrachys> jlibosva, haleyb o
16:00:41 <ihrachys> o/
16:00:41 <jlibosva> o/
16:01:18 <ihrachys> #topic Actions from prev week
16:01:25 <ihrachys> haleyb to reach out to all affected parties, and FBI, to get multinode grenade by default
16:01:59 <ihrachys> https://review.openstack.org/#/c/483600
16:02:09 <ihrachys> I see that we are going to wait master open
16:02:37 <ihrachys> also waiting for https://review.openstack.org/#/c/488381/ in devstack before making the multinode flavour voting
16:03:10 <ihrachys> next was "haleyb to look at why dvr grenade flavour surged to 30% failure rate comparing to 0% for other grenades"
16:03:21 <ihrachys> let's discuss that in grafana section
16:03:25 <ihrachys> these are all action items
16:03:29 <ihrachys> #topic Grafana
16:03:30 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:04:08 <jlibosva> how about "ihrachys to propose a removal for linuxbridge grenade multinode job" ? :)
16:04:43 <ihrachys> eh right!
16:04:48 <jlibosva> I see the line to be gone, so was it done?
16:04:52 <ihrachys> manjeet was going to handle that instead of me, sec
16:05:08 <ihrachys> this: https://review.openstack.org/#/c/490993/
16:05:10 <ihrachys> so it's good
16:05:44 <ihrachys> back to grafana, I no longer see dvr grenade flavor failure uptick, if anything, it's lower than usual flavour
16:06:04 <ihrachys> looks like the gate is generally quite stable lately (?)
16:06:05 <jlibosva> ah, so it seems it should be moved in grafana too
16:06:15 <ihrachys> jlibosva, does it reflect your perception?
16:06:38 <ihrachys> jlibosva, or removed?
16:06:55 <jlibosva> yeah, rather removed
16:06:59 <ihrachys> ok I will
16:07:20 <ihrachys> so, speaking of general stability, grafana seems healthy. is it really back to normal? I haven't tracked the prev week.
16:07:21 <jlibosva> yeah, seems like stable. also today on team meeting there were no critical bugs mentioned
16:07:35 <jlibosva> there is a pike in one of tempest dvr job 2 days ago
16:07:44 <jlibosva> up to ~60%
16:08:02 <ihrachys> check queue?
16:08:06 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=9&fullscreen
16:08:12 <jlibosva> yep, check queu - and all other tempest jobs are around ~20% ?
16:08:28 <jlibosva> not sure I'd call that "stable" :)
16:08:35 <ihrachys> it may be a usual rate, since it tracks legit mistakes in patches
16:09:03 <ihrachys> worth comparing with gate: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=5&fullscreen
16:10:01 * jlibosva still loading
16:10:25 * ihrachys too
16:10:44 <ihrachys> grafana does seem to show me some artifacts instead of a proper chart
16:12:04 <ihrachys> ok I have it. 10-15% seems to be the average rate
16:12:11 <ihrachys> so yeah, it's not too stable
16:13:12 <ihrachys> what could we do to that? except maybe looking at reported bugs and encouraging to not nake-recheck
16:13:52 <jlibosva> if I'll find a few minutes, I could inspect the neutron full thru logstash
16:14:51 <ihrachys> ok let it be it. note there are a lot of infra failures http://status.openstack.org/elastic-recheck/gate.html
16:15:22 <ihrachys> maybe worth clicking through those uncategorized: http://status.openstack.org/elastic-recheck/data/integrated_gate.html
16:16:28 <ihrachys> #action jlibosva to look through uncategorized/latest gate tempest failures (15% failure rate atm)
16:16:53 <ihrachys> apart from that, we have two major offenders: fullstack and scenarios being broken
16:17:21 <ihrachys> for fullstack, I started looking at l3ha failure that pops up but had little time to make it to completion; still planning to work on it
16:17:30 <ihrachys> for scenarios, I remember jlibosva sent email asking for help
16:17:40 <ihrachys> do we have reviewable results to chew for that one?
16:18:06 <jlibosva> no, doesn't seem to be popular: https://etherpad.openstack.org/p/neutron-dvr-multinode-scenario-gate-failures
16:19:04 <jlibosva> but there is a regression in ovs-fw after implementing conjunctions when tests use a lots of remote security groups
16:19:05 <ihrachys> we enabled qos tests; have they showed up since then?
16:19:19 <jlibosva> that causes random SSH denials from what I observed
16:19:24 * jlibosva looks for a bug
16:19:59 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1708092
16:20:00 <openstack> Launchpad bug 1708092 in neutron "ovsfw sometimes rejects legitimate traffic when multiple remote SG rules are in use" [Undecided,In progress] - Assigned to IWAMOTO Toshihiro (iwamoto)
16:21:27 <jlibosva> for QoS we have 13 failures in last 7 days but hard to judge as the failures might be caused by ^^
16:21:55 <ihrachys> hm. is it a serious regression? is revert possible?
16:22:41 <jlibosva> yes, revert is definitely possible
16:23:17 <ihrachys> what's the patch to revert would be?
16:23:46 <jlibosva> https://review.openstack.org/#/c/333804/
16:25:07 <jlibosva> there was a followup patch to it
16:25:13 <ihrachys> wow, that's a huge piece to revert
16:25:19 <jlibosva> so that would needed to be reverted first
16:25:24 <ihrachys> I guess we can really do it that late in Pike
16:25:31 <ihrachys> can't
16:25:33 <jlibosva> can or can't
16:25:56 <ihrachys> I see there is a patch for that but it's in conflict
16:26:14 <ihrachys> https://review.openstack.org/#/c/492404/
16:26:21 <jlibosva> and it's a wip
16:27:49 <ihrachys> I will ask in gerrit how close we are there
16:28:54 <ihrachys> ok let's switch to bugs
16:29:02 <ihrachys> #topic Gate failure bugs
16:29:03 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=-id&start=0
16:29:17 <ihrachys> first is https://bugs.launchpad.net/neutron/+bug/1710589
16:29:19 <openstack> Launchpad bug 1710589 in neutron "rally sla failure / internal error on load" [High,Triaged]
16:30:05 <jlibosva> hmm, didn't kevin sent a patch to reduce number of created ports?
16:30:46 <jlibosva> ah no, that was for trunk subports
16:31:22 <ihrachys> https://review.openstack.org/492638 ?
16:31:27 <ihrachys> yeah not related
16:31:52 <ihrachys> it seems like staledataerror is raised over and over until retry limit is reached
16:34:10 <jlibosva> aaand Ihar is gone :)
16:34:18 <jlibosva> aaand Ihar is back :)
16:34:25 <ihrachys> sorry, lost connectivity
16:34:40 <ihrachys> last I saw in the channel was:
16:34:44 <ihrachys> <ihrachys> it seems like staledataerror is raised over and over until retry limit is reached
16:34:44 <ihrachys> <ihrachys> if it's just high contention, should the remedy be similar?
16:35:33 <jlibosva> the last message wasn't sent
16:35:39 <jlibosva> and I didn't write anything, was reading the bug
16:36:36 <jlibosva> we don't have logstash for midokura gate and we don't see it on Neutron one, right?
16:37:10 <ihrachys> meh, my internet link is flaky
16:37:12 <jlibosva> we don't have logstash for midokura gate and we don't see it on Neutron one, right?
16:37:32 <ihrachys> is the scenario executed in our gate?
16:38:16 <ihrachys> I think we have a grafana board for rally, it would show there
16:38:22 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=11&fullscreen
16:38:47 <jlibosva> doesn't seem high
16:39:00 <jlibosva> oh, wrong timespan
16:39:38 <ihrachys> the bug was reported today
16:39:57 <ihrachys> and I don't see an issue in grafana for neutron (it's 5-10% rate)
16:40:26 <ihrachys> and also, those scenarios are pretty basic, they are executed
16:40:32 <ihrachys> maybe smth midonet gate specific
16:41:32 <jlibosva> sounds like that
16:41:44 <jlibosva> I'm sure Yamamoto will figure it out soon :)
16:42:08 <ihrachys> ok, I asked for logstash and diff in gate for jobs in LP, we'll see what reply is
16:42:15 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1709869
16:42:16 <openstack> Launchpad bug 1709869 in neutron "test_convert_default_subnetpool_to_non_default fails: Subnet pool could not be found" [High,In progress] - Assigned to Itzik Brown (itzikb1)
16:42:26 <ihrachys> fix https://review.openstack.org/#/c/492522/
16:43:05 <jlibosva> the patch alone doesn't fix the test, I have another one: https://review.openstack.org/#/c/492653/
16:43:05 <ihrachys> is it a new test? why hasn't it failed?
16:43:10 <jlibosva> but the test is skipped
16:43:29 <jlibosva> cause devstack creates a default subnetpool so we have not way to test it
16:43:42 <jlibosva> maybe if the subnetpool is not necessary, we should change devstack
16:44:09 <jlibosva> the test has probably never worked, it's been merged at the pike dev cycle
16:44:23 <ihrachys> ack; both in gate now
16:44:31 <ihrachys> would be nice to follow up on gate setup
16:44:52 <ihrachys> do you want a task for that, or we will pun?
16:45:08 <jlibosva> gimme a task!
16:45:23 <jlibosva> I need some default subnetpool knowledge first though
16:46:08 <ihrachys> #action jlibosva to tweak gate not to create default subnetpool and enable test_convert_default_subnetpool_to_non_default
16:46:30 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1708030
16:46:31 <openstack> Launchpad bug 1708030 in neutron "netlink test_list_entries failing with mismatch" [High,In progress] - Assigned to Cuong Nguyen (cuongnv)
16:46:40 <ihrachys> seems like we have a fix here: https://review.openstack.org/#/c/489831/
16:47:42 <jlibosva> those tests are disabled for now
16:47:54 <jlibosva> because of kernel bug in ubuntu xenial
16:48:37 <ihrachys> yeah, we may want a follow up test patch to prove it works
16:48:55 <jlibosva> I checked this morning that newly tagged kernel containing the fix hasn't been picked by the gate yet
16:48:56 <ihrachys> ...but that won't work in gate
16:49:06 <ihrachys> oh there is a new kernel?
16:49:09 <jlibosva> yes
16:49:17 <jlibosva> it was tagged last friday
16:49:18 <ihrachys> cool. I imagine it a question of days
16:49:34 <ihrachys> so we will procrastinate on the fix for now; I have a minor comment there anyway.
16:49:41 <jlibosva> I monitor it, once I see we have proper gate in place, I'll send a patch to enable those tests back
16:49:49 <ihrachys> ++
16:49:49 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1707933
16:49:50 <openstack> Launchpad bug 1707933 in neutron "functional tests timeout after a test worker killed" [Critical,Confirmed]
16:49:56 <ihrachys> our old friend
16:50:36 <jlibosva> oh, it's in Ocata too
16:50:41 <ihrachys> I don't think we made progress. last time I thought about it, I wanted to mock os.kill but ofc could not squeeze it
16:50:47 <ihrachys> yeah, I saw it in ocata once
16:51:22 <ihrachys> maybe a backport, or external dep
16:52:49 <ihrachys> #action ihrachys to capture os.kill calls in func tests and see if any of those kill test threads
16:53:34 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1707003
16:53:35 <openstack> Launchpad bug 1707003 in neutron "gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate" [High,Confirmed] - Assigned to Brian Haley (brian-haley)
16:53:48 <ihrachys> I think this one is going to be solved by Sean Dague's fix for host discovery
16:53:51 <ihrachys> so we just wait
16:54:25 <jlibosva> one more thing to the killing functional tests
16:54:27 <jlibosva> http://logs.openstack.org/65/487065/5/check/gate-neutron-dsvm-functional-ubuntu-xenial/f9b22b8/logs/syslog.txt.gz#_Jul_31_08_04_37
16:54:48 <ihrachys> eh... is it same time kill happens?
16:54:53 <jlibosva> I wonder whether this could be related
16:54:57 <jlibosva> it's about one minute earlier
16:55:25 <jlibosva> well, it's about one minute earlier than the Killed output
16:56:12 <ihrachys> very interesting. I imagine parent talks to tester children via pipes.
16:57:24 <ihrachys> see https://github.com/moby/moby/issues/34472
16:57:36 <ihrachys> it's fresh, it's ubuntu, and it describes a child hanging
16:57:49 <jlibosva> but then why would only functional tests be affected and not all other using forked workers
16:58:52 <jlibosva> ah, so maybe executor spawns other processes, like rootwrap daemon, that are killed
16:59:52 <ihrachys> maybe. there are other google hits for a similar trace: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1702665
16:59:53 <openstack> Launchpad bug 1702665 in linux (Ubuntu) "4.4.0-83-generic + Docker + EC2 frequently crashes at cgroup_rmdir GPF" [High,Confirmed]
17:00:10 <ihrachys> ok time
17:00:14 <ihrachys> thanks for joining
17:00:19 <ihrachys> #endmeeting