#openstack-meeting log

16:02:11 <ihrachys> #startmeeting neutron_ci
16:02:12 <openstack> Meeting started Tue Feb 20 16:02:11 2018 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:02:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:02:16 <openstack> The meeting name has been set to 'neutron_ci'
16:02:16 <ihrachys> sorry for late start
16:02:51 <ihrachys> I don't think we have jlibosva or haleyb today, both on pto
16:03:15 <ihrachys> waiting for at least some people to show up
16:03:47 <slaweq> ihrachys: hi
16:03:49 <slaweq> sorry for late
16:03:51 <slaweq> and hello to all of You :)
16:03:58 <ihrachys> slaweq, hi, np I was late too. so far we are two.
16:04:08 <slaweq> ok
16:04:09 <ihrachys> Jakub and Brian are on PTO
16:04:18 <slaweq> I know about them
16:05:33 <ihrachys> ok regardless we are just two, let's do it (quick)
16:05:34 <ihrachys> #topic Actions from prev meeting
16:05:40 <ihrachys> "ihrachys report test_get_service_by_service_and_host_name failure in periodic -pg- job"
16:06:00 <slaweq> ok
16:06:06 <ihrachys> I actually still haven't; I noticed that grafana periodic dashboard is broken (?) http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=14&fullscreen
16:06:18 <ihrachys> not showing data for legacy jobs
16:06:24 <ihrachys> I suspect it's because we renamed the jobs
16:06:33 <ihrachys> afair mlavalle was working on moving them into neutron repo
16:06:55 <slaweq> I think that patch for that was merged already
16:07:01 <ihrachys> here: https://review.openstack.org/#/q/topic:remove-neutron-periodic-jobs
16:07:22 <ihrachys> slaweq, right. but as it often happens we forgot about grafana
16:07:29 <slaweq> right
16:07:37 <ihrachys> #action ihrachys to update grafana periodic board with new names
16:08:33 <ihrachys> as for the original failure in -pg- I guess I should still follow up
16:08:42 <ihrachys> #action report test_get_service_by_service_and_host_name failure in periodic -pg- job
16:08:51 <ihrachys> next is "slaweq to backport ovsfw fixes to older stable branches"
16:09:00 <slaweq> so I checked those patches
16:09:09 <slaweq> and backport wasn't necessary in fact
16:09:22 <ihrachys> for neither?
16:09:31 <slaweq> yes
16:09:44 <ihrachys> so we know which patch broke it?
16:10:08 <slaweq> we are using earlier version of ovs there and there is no issue with crush dump there
16:10:54 <slaweq> and for second patch I know which patch broke hard reboot of instance - this patch is not in Pike nor Ocata
16:11:18 <ihrachys> hm ok
16:11:52 <ihrachys> next was "slaweq to look at http://logs.openstack.org/53/539953/2/check/neutron-fullstack/55e3511/logs/testr_results.html.gz failures"
16:12:04 <slaweq> yes
16:12:18 <slaweq> so there is at least 4 different issues spotted in those tests
16:12:23 <ihrachys> that's https://bugs.launchpad.net/neutron/+bug/1673531 right?
16:12:23 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka)
16:12:42 <slaweq> it's one of them
16:13:09 <slaweq> no, wait
16:13:24 <slaweq> it is related but this one not describes any specific reason
16:13:35 <slaweq> so I found and reported couple of issues:
16:13:38 <slaweq> 1. https://bugs.launchpad.net/neutron/+bug/1750337
16:13:38 <openstack> Launchpad bug 1750337 in neutron "Fullstack tests fail due to "block_until_boot" timeout" [High,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:13:59 <slaweq> and patch for this one is in review: https://review.openstack.org/546069
16:14:06 <slaweq> we discussed about it yesterday
16:14:18 <slaweq> 2. https://bugs.launchpad.net/neutron/+bug/1728948
16:14:18 <openstack> Launchpad bug 1728948 in neutron "fullstack: test_connectivity fails due to dhclient crash" [High,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:14:32 <slaweq> patch for this one is also ready https://review.openstack.org/#/c/545820/
16:14:44 <slaweq> 3. https://bugs.launchpad.net/neutron/+bug/1750334
16:14:45 <openstack> Launchpad bug 1750334 in neutron "ovsdb commands timeouts cause fullstack tests failures" [High,Confirmed]
16:15:20 <slaweq> for this one I don't know exactly how to fix it but maybe otherwiseguy can help?
16:15:57 <ihrachys> slaweq, can the timeout be legit because of high load you saw?
16:16:05 <slaweq> 4. sometimes tests fails with "good" reason so network is interrupted, like e.g. http://logs.openstack.org/84/545584/1/check/neutron-fullstack/e49378a/logs/testr_results.html.gz
16:16:35 <slaweq> yes, it's possible that this high load cause timeouts in ovsdb commands also
16:17:01 <slaweq> but I really didn't have more time to check all those issues this week
16:17:23 <ihrachys> slaweq, you mean "RuntimeError: Networking interrupted after controllers have vanished" is red herring for some other issue that we already track in other plac?
16:17:25 <ihrachys> *place
16:17:50 <ihrachys> slaweq, ok we're landing the workers patch and will see if it gets better.
16:17:58 <slaweq> yes
16:18:07 <ihrachys> great progress
16:18:24 <slaweq> and I will try also to dig more in this interrupted networking errors if it will happen more
16:18:37 <slaweq> but I think it's better and better :)
16:18:40 <ihrachys> next was "mlavalle to look into linuxbridge ssh timeout failures" but mlavalle is offline so we will just repeat it
16:18:42 <ihrachys> #action mlavalle to look into linuxbridge ssh timeout failures
16:19:01 <ihrachys> #topic Grafana
16:19:01 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:20:34 <slaweq> neutron-tempest-ovsfw is definitely much better than it was last week :)
16:21:04 <ihrachys> I was watching functional job the previous week and it was always at average of ~10-15% (spikes to 20%, dips to 0%)
16:21:35 <ihrachys> slaweq, yeah, and dvr scenarios are also pretty good now
16:21:44 <slaweq> :)
16:21:46 <ihrachys> not same for linuxbridge
16:22:04 <slaweq> linuxbridge and tempest are still the worst ones
16:22:16 <slaweq> tempest/fullstack/s
16:22:18 <slaweq> sorry
16:23:20 <ihrachys> yeah
16:23:29 <ihrachys> mlavalle, good day sir :)
16:23:38 <mlavalle> sorry, got distracted
16:23:50 <slaweq> hi mlavalle
16:23:54 <mlavalle> hi
16:23:56 <ihrachys> mlavalle, we were going through grafana
16:24:04 <mlavalle> ok cool
16:24:09 <ihrachys> mlavalle, one thing is functional, it seems rather stable, on par with unit tests
16:24:19 <mlavalle> that's great
16:24:21 <ihrachys> both show average failure in check queue around 10-15%
16:24:38 <ihrachys> well it's slightly lower for unit tests I guess
16:25:03 <ihrachys> but then one may wonder if it's because more patches legitly break functional tests than unit tests
16:25:14 <ihrachys> the best validation would be comparing gates I guess
16:25:59 <ihrachys> functional mostly stays at 0% but we have a spike to 15% right now there.
16:26:19 <ihrachys> one complexity with using grafana to validate anything is that afaiu it captures results for all branches
16:26:42 <ihrachys> so if e.g. we stabilize functional in master but not stable/queens, and people post patches to the latter, it will show as failure in grafana
16:26:57 <mlavalle> ahh, right
16:27:02 <ihrachys> how do we deal with it
16:27:15 <ihrachys> maybe we should actually make the dashboard about master only
16:27:27 <mlavalle> I would say so
16:27:29 <ihrachys> (I hope grafite carries the data to distinguish)
16:27:55 <mlavalle> yeah, if the underlying platform carries the data, then we should strive for that
16:28:06 <ihrachys> ok, I will have a look
16:28:18 <ihrachys> #action ihrachys to update grafana boards to include master data only
16:28:51 <ihrachys> and I guess we postpone decision on functional voting till next time in 2weeks
16:28:55 <slaweq> should we then do also dashboard for stable branches?
16:29:07 <slaweq> or it is not neccessary?
16:29:10 <ihrachys> slaweq, yeah and I was actually planning to do it for quite a while
16:29:20 <ihrachys> that's tangential though
16:29:31 <slaweq> ok :)
16:29:44 <slaweq> just asking to not forget about it :)
16:29:50 <ihrachys> at this point you should know when I plan something it doesn't happen
16:30:06 <slaweq> LOL
16:30:30 <ihrachys> I will add a new board in the patch
16:30:41 <slaweq> if You will point me when exactly is should be done I can do it
16:30:52 <ihrachys> nah I should do SOMETHING right?
16:31:02 <slaweq> ok, as You want :)
16:31:07 <mlavalle> you do more than enough
16:31:22 <mlavalle> and we are thankful
16:31:28 <slaweq> ++
16:32:03 <ihrachys> so, fullstack, we already went through it before mlavalle joined, but basically tl;dr is we land a bunch of slaweq's patches and will see if they fix other issues we may have with high load on the system
16:32:13 <ihrachys> hence skipping it now
16:32:19 <ihrachys> #topic Scenarios
16:32:19 <mlavalle> ok
16:32:24 <mlavalle> thanks for the summary
16:32:53 <ihrachys> with slaweq's patches for ovsfw we seem to be in great place for both dvr scenarios and ovsfw jobs now
16:33:04 <ihrachys> but not so much for linuxbridge
16:33:28 <ihrachys> mlavalle, afair you planned to have a look at ssh timeouts in the linuxbridge scenarios job
16:33:34 <ihrachys> have you got a chance to?
16:33:43 <mlavalle> I didn't have the time
16:33:53 <mlavalle> but I will try this week
16:34:39 <ihrachys> ok
16:34:49 <ihrachys> let's check if all failures are same in a latest run
16:34:58 <mlavalle> ok
16:35:35 <ihrachys> http://logs.openstack.org/69/546069/1/check/neutron-tempest-plugin-scenario-linuxbridge/8561cce/logs/testr_results.html.gz
16:35:56 <ihrachys> yeah seems exactly the same 4 failures
16:36:11 <ihrachys> I guess once they are tackled, we'll have another green job
16:36:22 <slaweq> mlavalle: I can try to have a look on those issues as You are probably busy with preparing to PTG
16:36:47 <mlavalle> slaweq: if you have time in your hands, yes, please go ahead
16:37:04 <slaweq> sure, I will try to debug it
16:37:06 <ihrachys> speaking of green jobs, I suggest we also consider making the dvr scenarios job voting if it survives the next 2 weeks
16:37:09 <mlavalle> THanks
16:37:22 <mlavalle> ihrachys: yeah
16:37:23 <ihrachys> and the same for ovsfw job
16:37:45 <mlavalle> it's early in the cycle, so this is the time to be agresive with this type of things
16:38:16 <slaweq> I agree
16:38:21 <ihrachys> ok. when we make it voting, do we make it voting in both queues?
16:38:38 <mlavalle> let's go with check
16:39:09 <ihrachys> ok
16:39:26 <ihrachys> we can revisit gate if it's proved to be stable with all new jobs
16:39:41 <mlavalle> yeah
16:41:25 <ihrachys> #topic Bugs
16:41:29 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=status&start=0
16:41:35 <ihrachys> those are all bugs tagged with gate-failure
16:41:50 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1724253
16:41:50 <openstack> Launchpad bug 1724253 in BaGPipe "error in netns privsep wrapper: ImportError: No module named agent.linux.ip_lib" [High,Confirmed] - Assigned to Thomas Morin (tmmorin-orange)
16:42:10 <ihrachys> this one is weird, in that it doesn't really seem to be a gate issue for neutron, and I suspect not for bagpipe either
16:42:28 <ihrachys> because they probably wouldn't have gate broken for months :)
16:43:29 <ihrachys> tmorin had this patch for neutron to make our scripts reusable for them: https://review.openstack.org/#/c/503280/
16:44:05 <ihrachys> not sure why the patch is not linked to the bug
16:44:10 <ihrachys> but I think it's related
16:44:18 * mlavalle removed -2
16:46:16 <ihrachys> I see fullstack failed in a weird way there though
16:46:31 <ihrachys> last thing I want is to break fullstack with this :)
16:46:55 <ihrachys> but apart from the failure that should be fixed, the patch seems innocent enough to help their project
16:47:13 * slaweq will cry if fullstack will be totally broken again :)
16:47:20 <ihrachys> earlier I suggested the script is not part of neutron api and shouldn't be reused but maybe it's too pedantic and they know what they do
16:48:23 <ihrachys> regardless, I removed gate-failure tag from the bug since no gate is broken
16:48:56 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1744983
16:48:56 <openstack> Launchpad bug 1744983 in neutron "coverage job fails with timeout" [High,Confirmed]
16:49:05 <ihrachys> have we seen unit test / coverage job timeouts lately?
16:49:21 <ihrachys> we merged https://review.openstack.org/537016 that bumped time for the job lately. did it help?
16:49:41 <mlavalle> I haven't seen it lately
16:49:44 <ihrachys> I can't recollect timeouts in last several weeks so maybe it's gone
16:49:45 <slaweq> I can't remember if I saw such timeouts lately
16:49:59 <ihrachys> ok let me close it; we can reopen if it resurfaces
16:51:12 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1660612
16:51:12 <openstack> Launchpad bug 1660612 in neutron "Tempest full jobs time out on execution" [High,Confirmed]
16:51:54 <ihrachys> not sure if this one still happens much. but back when I looked into it, it was because of low concurrency (1) for tempest scenarios
16:52:08 <ihrachys> (not our scenarios; tempest scenarios that are executed in -full jobs)
16:52:17 <ihrachys> I proposed this in tempest: https://review.openstack.org/#/c/536598/1/tox.ini
16:52:30 <ihrachys> but it seems like they are not entirely supportive about the change
16:54:19 <ihrachys> the -full jobs, they are still defined in our zuul config
16:54:35 <ihrachys> but the problem is that we don't control how we execute tox env from tempest
16:54:40 <ihrachys> it's devstack-gate that does it
16:55:24 <clarkb> in the new zuul tempest jobs it is tempest that does it fwiw
16:55:33 <clarkb> (via the in tree job config)
16:56:24 <ihrachys> clarkb, we have some neutron full jobs too. those trigger d-g for sure.
16:57:26 <clarkb> ya if that haven't been transitioned yet they will still hit d-g
16:57:52 <ihrachys> clarkb, but it's good point, we also want to touch jobs coming from other repos
16:58:06 <ihrachys> clarkb, in tempest repo, I see run-tempest role used. where is it defined?
16:58:11 <ihrachys> codesearch doesn't help
16:58:45 <ihrachys> I mean here: http://git.openstack.org/cgit/openstack/tempest/tree/playbooks/devstack-tempest.yaml#n14
16:58:54 <clarkb> ihrachys: I think it may come from devstack/roles
16:59:05 <clarkb> (codesearch probably failing because its a dir name and not file content)
16:59:37 <ihrachys> clarkb, and devstack/roles is in which repo?
17:00:31 <clarkb> in devstack
17:00:35 <ihrachys> ack
17:00:37 <clarkb> sorry devstack is repo, roles is the dir
17:00:46 <ihrachys> we are out of time. thanks for joining.
17:00:48 <ihrachys> #endmeeting