16:00:42 <ihrachys> #startmeeting neutron_ci
16:00:46 <openstack> Meeting started Tue Jan 31 16:00:42 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:47 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:50 <openstack> The meeting name has been set to 'neutron_ci'
16:01:28 <ihrachys> hello everyone, assuming there is anyone :)
16:01:32 <jlibosva> o/
16:01:37 <ihrachys> armax: kevinbenton: jlibosva: ding ding
16:02:39 <ihrachys> jlibosva: looks more like it's you and me :)
16:02:44 <jlibosva> :_
16:02:58 <jlibosva> ihrachys: what is the agenda?
16:03:32 <ihrachys> well I was going to present the initiative and go through the etherpad that we already have, listing patches up for review and such.
16:03:43 <ihrachys> and maybe later brain storming on current issues
16:03:53 <jlibosva> ok, I looked at the etherpad and put some comments to the ovs failure
16:04:02 <jlibosva> yesterday
16:04:03 <ihrachys> if it's just me and you, it may not make much sense
16:04:23 <jlibosva> at least we would have a meeting minutes
16:04:53 <jlibosva> if we come up with any action items
16:04:54 <ihrachys> right. ok. so, this is the first CI team meeting, that spurred by latest issues in gate
16:05:09 <ihrachys> there was a discussion before on the issues that was captured in https://etherpad.openstack.org/p/neutron-upstream-ci
16:05:25 <ihrachys> we will use the etherpad to capture new details on gate problems in the future
16:05:46 <ihrachys> there were several things to follow up on, so let's walk through the list
16:06:53 <ihrachys> 1. checking all things around elastic-recheck, whether queries can target check queue and such. I am still to follow up on that with e-r cores, but it looks like they accepted a query targeting functional tests yesterday, so we hopefully should be able to classify func test failures.
16:07:22 <ihrachys> 2. "banning rechecks without bug number" again, I am to check with infra on that point
16:07:44 <ihrachys> 3. armax added func tests periodic job to grafana: https://review.openstack.org/#/c/426308/
16:08:04 <ihrachys> sadly, I don't see it showing up in periodic dashboard, see http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=4&fullscreen
16:08:28 <ihrachys> I see 3 trend lines while there are supposed to be 5 of those as per dashboard definition
16:09:02 <ihrachys> this will need a revisit I guess
16:09:05 <jlibosva> was there a run already?
16:09:26 <ihrachys> the patch was landed 32 hours ago, and it's supposed to trigger daily (?)
16:09:39 <ihrachys> I guess we can give it some time and see if it heals itself
16:09:59 <ihrachys> but the fact that another job in the dashboard does not show up too is suspicious
16:10:12 <jlibosva> oh, sorry :D I thought it's still 30th Jan today
16:10:31 <ihrachys> #action armax to make sure periodic functional test job shows up in grafana
16:10:49 <ihrachys> #action ihrachys to follow up on elastic-recheck with e-r cores
16:11:09 <ihrachys> #action ihrachys to follow up with infra on forbidding bare gerrit rechecks
16:11:54 <ihrachys> there is an action item on adding a CI deputy role, but I believe it's not critical and should be decided on by the next ptl
16:12:11 <jlibosva> agreed
16:12:22 <ihrachys> also, I was going to map all late functional test failures, I did (the report is line 21+ in the pad)
16:12:51 <ihrachys> the short story is though there are different tests failing, most of them turn out to be the same ovs native failure
16:13:33 <ihrachys> seems like bug 1627106 is our sole enemy right now in terms of func tests
16:13:33 <openstack> bug 1627106 in neutron "TimeoutException while executing test_post_commit_vswitchd_completed_no_failures" [High,In progress] https://launchpad.net/bugs/1627106 - Assigned to Miguel Angel Ajo (mangelajo)
16:14:08 <ihrachys> kevinbenton landed a related patch for the bug: https://review.openstack.org/426032 We will need to track some more to see if that fixes anything
16:14:25 <armax> ihrachys: I think it’s because it might not have a failure yet
16:14:42 <ihrachys> armax: wouldn't we see 0% failure rate?
16:14:47 <armax> ihrachys: it’s getting built here: http://logs.openstack.org/periodic/periodic-neutron-dsvm-functional-ubuntu-xenial/
16:14:56 <armax> ihrachys: strangely I see that postgres is missing too
16:15:08 <armax> but I have yet to find the time to bug infra about seeing what is actually happening
16:15:09 <ihrachys> right
16:15:16 <ihrachys> aye, sure
16:15:24 <armax> but two builds so far
16:15:25 <armax> no errorr
16:15:27 <armax> errors
16:16:02 <ihrachys> back to ovsdb native, ajo also has a patch bumping ovs timeout: https://review.openstack.org/#/c/425623/ though afaiu otherwiseguy has reservations about the direction
16:16:33 <jlibosva> there are some interesting findings from this morning by iwamoto
16:16:39 <jlibosva> he claims the whole system freezes
16:16:54 <jlibosva> as dstat doesn't give any outputs by the time probe times out
16:17:03 <jlibosva> it's supposed to update every second
16:17:22 <ihrachys> vm progress locked by hypervisor?
16:18:17 <jlibosva> could be the reason why noone is able to reproduce it locally
16:18:39 <ihrachys> but do we see 10sec hangs?
16:18:42 <ihrachys> or shorter?
16:19:11 <jlibosva> let's have a look
16:19:36 <ihrachys> btw speaking of timeouts, another class of functional test failures that I saw in late runs could be described as 'random tests failing with test case timeout', even those not touching ovs, like test_migration
16:20:05 <ihrachys> but per test case timeout is a lot longer than ovsdb 10secs
16:21:01 <otherwiseguy> interesting.
16:21:29 <ihrachys> I see 5sec lock in dstat output that Iwamoto linked to
16:23:55 <ihrachys> interestingly, we see functional job at ~10% failure rate at the moment, which is a drastic reduce from what we saw even on Friday
16:24:56 <ihrachys> not sure what could be the reason
16:27:11 <ihrachys> we don't have dstat in functional job, so it's hard to say if we see same hypervisor locks
16:27:24 <ihrachys> the logs that Iwamoto linked to are for neutron-full
16:27:56 <ihrachys> I will check if we can easily collect those in scope of functional tests
16:28:05 <jlibosva> interesting is that it didn't cause any harm in those tests
16:28:08 <ihrachys> #action ihrachys check if we can enable dstat logs for functional job
16:29:10 <ihrachys> otherwiseguy: so what's your take on bumping timeout for test env?
16:30:03 <otherwiseguy> ihrachys, i wouldn't hurt, but I have no idea if it would help.
16:31:03 <ihrachys> ok I guess it's worth a try then. though the latest reduce in failure rate may relax the severity of the issue and also make it harder to spot if it's the fix that helps.
16:31:28 <ihrachys> otherwiseguy: apart from that, any other ideas how we could help debug or fix the situation from ovs library side?
16:32:49 <jlibosva> ihrachys: During xenial switch, I noticed ovs 2.6 is more prone to reproduce the issue
16:33:15 <otherwiseguy> ihrachys, right now I'm writing some scripts that spawn multiple processes and just create and delete a bunch of bridges. adding occasionally restarting the ovsdb-server, etc.
16:33:21 <jlibosva> so maybe having a patch that disables ovs compilation for functional and leave the one that's packaged for ubuntu could improve the repro rate
16:33:24 <ihrachys> jlibosva: ovs python library 2.6, or openvswitch service 2.6?
16:33:33 <otherwiseguy> just trying to reproduce.
16:33:33 <jlibosva> ihrachys: service
16:33:49 <jlibosva> the ovsdb server itself probably
16:34:08 <ihrachys> jlibosva: improve rate as in 'raise' or as in 'lower'?
16:34:28 <jlibosva> ihrachys: raise :) so we can test patches or add more debug message etc
16:34:28 <ihrachys> just to understand, xenial is 2.5 or 2.6?
16:34:38 <jlibosva> IIRC it should be 2.6
16:34:42 <jlibosva> let me check
16:34:46 <ihrachys> oh and we compile 2.6.1?
16:35:37 * jlibosva is confused
16:36:34 <jlibosva> maybe it's vice-versa. 2.5 is worse and we compile 2.6.1
16:37:06 <jlibosva> yeah, so xenial contains packages 2.5 but we compile to 2.6.1 on xenial nodes
16:37:09 <ihrachys> ok, I guess it should not be hard to spin up the patch and see how it fails
16:37:32 <ihrachys> #action jlibosva to spin up a test-only patch to disable ovs compilation to improve reproduce rate
16:38:18 <jlibosva> done :)
16:38:32 <ihrachys> link
16:39:07 <jlibosva> ... some network issues with sending :-/
16:39:28 <ihrachys> nevermind, let's move on
16:39:42 <jlibosva> sure
16:39:52 <ihrachys> I mentioned several tests failing with test case timeouts before
16:40:06 <ihrachys> when they do, they fail with AttributeError on __str__ call for WaitTimeout
16:40:30 <ihrachys> there is a patch by tmorin to fix the error: https://review.openstack.org/#/c/425924/2
16:40:46 <ihrachys> while it won't fix the timeout root cause, it's still worth attention
16:41:12 <jlibosva> yeah, gate is giving the patch hard times
16:41:57 <ihrachys> closing the topic of func tests, I see jlibosva added https://bugs.launchpad.net/neutron/+bug/1659965 to the etherpad
16:41:57 <openstack> Launchpad bug 1659965 in neutron "test_get_root_helper_child_pid_returns_first_child gate failure" [Undecided,In progress] - Assigned to Jakub Libosvar (libosvar)
16:42:07 <ihrachys> jlibosva: is it some high impact failure?
16:42:18 <jlibosva> ihrachys: no, I don't think so
16:42:21 <ihrachys> or you just have the patch in place that would benefit from review attention
16:43:06 <jlibosva> I added it there as it's a legitimate functional failure. It's kinda new so I don't know how burning that is
16:43:15 <jlibosva> the cause is pstree segfaulting
16:43:19 <ihrachys> ok, still seems like something to look at, thanks for pointing out
16:45:13 <ihrachys> that's it for functional tests. as for other jobs, we had oom-killers that we hoped to be fixed by the swappiness tweak: https://review.openstack.org/#/c/425961/
16:45:30 <ihrachys> ajo mentioned though we still see the problem happening in gate.
16:45:59 <jlibosva> :[
16:47:10 <ihrachys> yeah, I see that mentioned in https://bugs.launchpad.net/neutron/+bug/1656386 comments
16:47:10 <openstack> Launchpad bug 1656386 in neutron "Memory leaks on Neutron jobs" [Critical,Confirmed] - Assigned to Darek Smigiel (smigiel-dariusz)
16:48:14 <ihrachys> armax: I see the strawman patch proposing putting mysql on a diet was abandoned. was there any discussion before that?
16:48:39 <armax> ihrachys: not that I am aware
16:48:46 <ihrachys> :-o
16:48:54 <armax> ihrachys: we should check the openstack-qa channel
16:50:30 <ihrachys> I don't see anything relevant there, probably worth talking to Monty
16:50:57 <ihrachys> as for libvirtd malloc crashes, it's also not fixed, and I don't think we can help it
16:51:17 <jlibosva> we also have a new issue with linuxbridge job: https://bugs.launchpad.net/neutron/+bug/1660612
16:51:17 <openstack> Launchpad bug 1660612 in neutron "gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial times out on execution" [Undecided,New]
16:51:32 <jlibosva> the global timeout kills the test run as it runs more than an hour
16:52:31 <ihrachys> and how long does it generally take?
16:53:05 <jlibosva> I don't think we have an equivalent with ovs so it's hard to compare
16:53:22 <ihrachys> in another job, I see 40m for all tests
16:53:39 <ihrachys> could be a slowdown, hard to say. I guess we have a single data point?
16:54:05 <jlibosva> with successful linuxbridge job, the whole job takes around an hour
16:54:46 <jlibosva> so it's around 43mins in successful linuxbridge job
16:55:18 <ihrachys> weird, ok let's monitor and see if it shows more impact
16:55:39 <ihrachys> one final thing I want to touch base on before closing the meeting is gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv
16:55:40 <ihrachys> 100% failure rate
16:56:04 <ihrachys> jlibosva: do you know what happens there (seems like legit connectivity issues in some tests)?
16:56:18 <jlibosva> ihrachys: no, I haven't investigated it
16:56:24 <ihrachys> the trend seems to be 100% for almost a week
16:56:54 <ihrachys> I think it passed a while ago; we need to understand what broke and fix it, and have a plan to make it voting.
16:56:55 <jlibosva> ihrachys: yeah, I'm working on this one. SSH fails there but we don't collect console logs
16:57:26 <jlibosva> ihrachys: it might be related to ubuntu image as they update it on their site
16:57:34 <ihrachys> jlibosva: oh don't we? how come? isn't it controlled by generic devstack infra code?
16:57:48 <jlibosva> ihrachys: no, it's a tempest code
16:57:58 <jlibosva> ihrachys: and we have our own Neutron in-tree code
16:58:04 <ihrachys> jlibosva: don't we freeze a specific past version of the image?
16:58:05 <jlibosva> which doesn't have this capability
16:58:25 <jlibosva> ihrachys: that's the problem, they have 'current' dir and they don't store those with timestamps
16:58:38 <jlibosva> they store like maybe 4 latest but they get wiped eventually
16:58:43 <ihrachys> hm, then maybe we should store it somewhere ourselves?
16:58:59 <jlibosva> anyway, even when I fetch the same as in gate, the job passes on my environment
16:59:01 <jlibosva> classic
16:59:16 <ihrachys> #action jlibosva to explore what broke scenario job
16:59:17 <jlibosva> that would be best, then we would need someone to maintain the storage
16:59:41 <ihrachys> jlibosva: well if it's one time update per cycle, it's not like huge deal
16:59:57 <ihrachys> ok thanks jlibosva for joining, I would feel lonely without you :)
17:00:06 <ihrachys> I hope next time we will have better presence
17:00:21 <ihrachys> if not maybe we will need to consider other time
17:00:25 <ihrachys> thanks again
17:00:27 <ihrachys> #endmeeting