#openstack-meeting log

16:00:33 <ihrachys> #startmeeting neutron_ci
16:00:34 <openstack> Meeting started Tue Jan  9 16:00:33 2018 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:35 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:38 <openstack> The meeting name has been set to 'neutron_ci'
16:00:42 <mlavalle> o/
16:00:48 <slaweq> hi
16:00:49 <ihrachys> heya
16:00:52 <jlibosva> o/
16:01:06 <ihrachys> I think I happily missed the last time we were supposed to have a meeting. sorry for that. :)
16:01:17 <ihrachys> post-holiday recovery is hard!
16:01:29 <ihrachys> #topic Actions from prev meeting
16:01:41 <ihrachys> first is "mlavalle to send patch(es) removing duplicate jobs from neutron gate"
16:01:58 <mlavalle> I did
16:02:02 <ihrachys> I believe it's https://review.openstack.org/530500 and https://review.openstack.org/531496
16:02:19 <ihrachys> and of course we should land the last bit adding new jobs to stable: https://review.openstack.org/531070
16:02:38 <mlavalle> yeap
16:03:19 <ihrachys> mlavalle, speaking of your concerns in https://review.openstack.org/#/c/531496/
16:03:31 <mlavalle> ok
16:03:34 <ihrachys> mlavalle, I would think that we replace old jobs with in-tree ones for those other projects
16:03:41 <mlavalle> ok
16:03:50 <mlavalle> will work on that
16:03:54 <ihrachys> you are too agreeable today! come on! :)
16:04:14 <mlavalle> you speak with reason, so why not :-)
16:04:15 <ihrachys> mlavalle, I am not saying that's the right path; I just *assume*
16:04:22 <ihrachys> still makes sense to check with infra first
16:04:29 <mlavalle> yes I will do that
16:04:43 <mlavalle> but we don't need to rush on that patch
16:04:55 <mlavalle> as long as 530500 lands
16:04:58 <mlavalle> soon
16:04:59 <ihrachys> mlavalle, the project-config one will already remove duplicates from gate for us?
16:05:20 <mlavalle> 530500 will remove the duplication
16:05:40 <ihrachys> ok good. yeah, we can leave the second one for infra take.
16:05:45 <ihrachys> ok next item was "mlavalle to report back about result of drivers discussion on unified tempest plugin for all stadium projects"
16:06:00 <ihrachys> mlavalle, could you please give tl;dr of the discussion?
16:06:14 <mlavalle> we discussed it in the drivers meeting before the Holidays
16:06:44 <mlavalle> we will let the stadium projects join the unified tempest plugin or keep their tests in their repos
16:06:55 <mlavalle> we don't want to stretch them with more work
16:07:04 <mlavalle> sent a message to the ML last week
16:07:24 <mlavalle> and discussed it in the Neutron meeting last week
16:07:32 <mlavalle> was well received
16:07:36 <mlavalle> done
16:08:02 <ihrachys> ok cool. are there candidates who are willing to go through the exercise?
16:08:11 <mlavalle> I haven't heard
16:08:18 <mlavalle> back
16:08:24 <ihrachys> ok, fine by me!
16:08:34 <ihrachys> next is "frickler to post patch updating neutron grafana board to include designate scenario job"
16:08:40 <ihrachys> I believe this was merged https://review.openstack.org/#/c/529822/
16:08:56 <ihrachys> I just saw the job in grafana before the meeting, so it works
16:09:07 <ihrachys> next one is "jlibosva to close bug 1722644 and open a new one for trunk connectivity failures in dvr and linuxbridge scenario jobs"
16:09:08 <openstack> bug 1722644 in neutron "TrunkTest fails for OVS/DVR scenario job" [High,Confirmed] https://launchpad.net/bugs/1722644
16:09:23 <jlibosva> I did
16:09:27 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1740885
16:09:28 <openstack> Launchpad bug 1740885 in neutron "Security group updates fail when port hasn't been initialized yet" [High,In progress] - Assigned to Jakub Libosvar (libosvar)
16:09:39 <ihrachys> the other bug doesn't seem closed
16:09:41 <jlibosva> fix is here: https://review.openstack.org/#/c/531414/
16:10:25 <jlibosva> oops, I just wrote a comment I'm gonna close it and I acutally didn't
16:10:27 <jlibosva> closed now
16:10:48 <ihrachys> jlibosva, right. I started looking at the patch yesterday, and was wondering if the fix makes it ignore some updates from the server. can we miss an update and leave a port in old state with the fix?
16:11:48 <jlibosva> I guess so, the update stores e.g. IP in case a new port is added to remote security group. So we should still receive an update but let the implementation of flow rules deferred
16:12:50 <ihrachys> to clarify, you mean there is a potential race now?
16:13:22 <jlibosva> there has been a race and the 531414 fixes it
16:13:38 <jlibosva> it caches the new data and once the port initializes, it already has the correct data in cache
16:14:04 <ihrachys> oh I see, so there is cache involved so we should be good
16:14:39 <jlibosva> BUT
16:15:02 <jlibosva> I saw the ovsfw job failing, reporting the issue, so I still need to investigate why it fails
16:15:23 <jlibosva> I'm not able to reproduce it, I'll need to run tempest in the loop overnight and put a breakpoint once it fails so I'll be able to debug
16:15:33 <jlibosva> it's likely a different issue
16:15:55 <ihrachys> the ovsfw job was never too stable. but yeah, it's good to look into it.
16:16:07 <ihrachys> ok we have way forward here.
16:16:10 <ihrachys> next item was "jlibosva to disable trunk scenario connectivity tests"
16:16:39 <ihrachys> I believe it's https://review.openstack.org/530760
16:16:44 <ihrachys> and it's merged
16:16:47 <jlibosva> yeah
16:16:54 <jlibosva> so now next step should be to enable it back :)
16:17:25 <jlibosva> perhaps it should be part of 531414 patch
16:17:25 <ihrachys> sure
16:18:09 <ihrachys> yeah it makes sense to have it incorporated. we can recheck it a bunch before merging.
16:18:47 <ihrachys> ok next was "ihrachys to report sec group fullstack failure"
16:18:53 <ihrachys> I am ashamed but I forgot about it
16:19:08 * haleyb wanders in late
16:19:11 <ihrachys> I will follow up after the meeting
16:19:12 <mlavalle> the Holidays joy I guess
16:19:12 <ihrachys> #action ihrachys to report sec group fullstack failure
16:19:29 <ihrachys> next item was "slaweq to debug qos fullstack failure https://bugs.launchpad.net/neutron/+bug/1737892"
16:19:30 <openstack> Launchpad bug 1737892 in neutron "Fullstack test test_qos.TestBwLimitQoSOvs.test_bw_limit_qos_port_removed failing many times" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:20:05 <slaweq> so i was debugging it for some time
16:20:15 <slaweq> and I have no idea why it is failing
16:20:52 <slaweq> I was even able to reproduce it locally once for every 20-30 runs but I don't know what is happening there
16:21:04 <slaweq> I will have to check it more still
16:21:23 <ihrachys> I see. "Also strange thing is that after test is finished, those records are deleted from ovs." <- don't we recycle ovsdb on each run?
16:21:24 <slaweq> but maybe for now I will propose patch to mark this test as unstable
16:21:50 <slaweq> ihrachys: maybe, but I'm not sure about that
16:21:52 <ihrachys> ah wait, we didn't have it isolated properly yet
16:22:29 <slaweq> no, it's same ovsdb for all tests IMO
16:22:53 <ihrachys> so if I read it right, ovsdb command is executed but structures are still in place?
16:22:59 <slaweq> yes
16:23:28 <jlibosva> could it be related to the ovsdb issue we have been hitting? that commands return TRY_AGAIN and never succeed?
16:23:29 <slaweq> and ovsdb commands are finished with success every time as I was checking it locally
16:23:32 <jlibosva> or they timeout?
16:23:36 <jlibosva> ok
16:23:46 <slaweq> no, there wasn't any retry or timeout on it
16:23:49 <jlibosva> it doesn't sound like it then
16:24:00 <slaweq> transactions was finished fine always
16:24:09 <ihrachys> could it be something recreates them after destroy?
16:24:46 <slaweq> ihrachys: I don't think so because I was also locally watching for them with watch command and it didn't flap or something like that
16:25:18 <slaweq> but I will check it once again maybe
16:26:11 <ihrachys> ok. I think disabling the test is a good idea while we are looking into it.
16:26:27 <slaweq> ihrachys: ok, so I will do such patch today
16:26:30 <ihrachys> and those were all action items we had
16:26:38 <ihrachys> #topic Tempest plugin
16:26:48 <ihrachys> I think we are mostly done with tempest plugin per se
16:27:04 <ihrachys> though there are some tempest bits in-tree that required stadium projects to switch to new repo first
16:27:25 <ihrachys> and those were not moving quick enough
16:27:47 <ihrachys> f.e. vpnaas: https://review.openstack.org/#/c/521341/
16:28:02 <ihrachys> I recollect that mlavalle was going to talk to their representatives about moving those patches forward
16:28:08 <ihrachys> mlavalle, any news on this front?
16:28:29 <mlavalle> well, now it is my turn to say that I forgot about that
16:28:53 <mlavalle> I will follow up this week
16:28:58 <mlavalle> sorry :-(
16:29:21 <ihrachys> #action mlavalle to follow up with stadium projects on switching to new tempest repo
16:29:27 <ihrachys> that's fine it happens :)
16:29:41 <ihrachys> the list was at https://etherpad.openstack.org/p/neutron-tempest-plugin-job-move line 27+
16:29:49 <mlavalle> ok
16:29:56 <ihrachys> afaiu we should cover vpnaas dynamic-routing and midonet
16:30:07 <mlavalle> cool
16:30:55 <ihrachys> afair there were concerns about how to install the new repo, and devstack plugin was added to neutron-tempest-plugin, so stadium projects should consume it
16:31:09 <ihrachys> #topic Grafana
16:31:15 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:31:25 <ihrachys> we actually have some good news there
16:31:44 <ihrachys> linuxbridge scenario job seems to be in decent shape after recent tests disabled and security group fix
16:31:55 <ihrachys> it's at ~10% right now
16:32:08 <ihrachys> which I would say a regular failure rate one could expect from a tempest job
16:32:50 <mlavalle> ++
16:32:51 <ihrachys> so that's good. we should monitor and eventually make it voting if it will keep the level.
16:32:53 <slaweq> nice :)
16:33:34 <jlibosva> kudos to those who fixed it :)
16:33:36 <ihrachys> dvr flavor is not as great though also down from 90% level it stayed in for months
16:33:48 <ihrachys> currently at ~40% on my chart
16:34:09 <ihrachys> and fullstack is in same bad shape. so mixed news but definitely progress on scenario side.
16:34:20 <ihrachys> kudos to everyone who was and is pushing those forward
16:34:44 <jlibosva> I see functional is now ~10%, yesterday EMEA morning, it was up to 50%
16:35:15 <ihrachys> I saw a lot of weird errors in gate yesterday
16:35:22 <ihrachys> RETRY_TIMEOUTS and stuff
16:35:32 <ihrachys> could be that the gate was just unstable in general
16:35:49 <ihrachys> but we'll keep an eye
16:36:10 <ihrachys> f.e. I see a similar spike for unit tests around same time
16:36:19 <ihrachys> #topic Scenarios
16:36:35 <ihrachys> so since linuxbridge seems good now, let's have a look at a latest failure for dvr
16:37:35 <ihrachys> ok I took http://logs.openstack.org/98/513398/10/check/neutron-tempest-plugin-dvr-multinode-scenario/568c685/job-output.txt.gz
16:37:41 <ihrachys> but it seems like a timeout for the job
16:37:47 <ihrachys> took almost 3h and failed
16:38:39 <ihrachys> here is another one: http://logs.openstack.org/51/529551/4/check/neutron-tempest-plugin-dvr-multinode-scenario/ee927b8/job-output.txt.gz
16:38:48 <ihrachys> same story
16:39:05 <jlibosva> perhaps we should try now to increase the concurrency? let me check the load
16:39:11 <ihrachys> so for what I see, it either times out or it passes
16:39:24 <ihrachys> I suspect it may be related to meltdown
16:39:40 <ihrachys> we were looking yesterday in neutron channel at slowdown for rally scenarios
16:39:54 <haleyb> https://bugs.launchpad.net/neutron/+bug/1717302 still isn't fixed either
16:39:55 <openstack> Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] - Assigned to Brian Haley (brian-haley)
16:39:57 <ihrachys> x2-x3 slowdown for some scenarios
16:40:13 <ihrachys> and figured it depends on cloud and whether they are patched...
16:41:14 <jlibosva> haleyb: don't we skip the fip test cases?
16:41:31 <jlibosva> I mean that they are tagged as unstable, so skipped if they fail
16:41:33 <haleyb> oh yeah, didn't see a link in the bug
16:41:41 <ihrachys> jlibosva, we do skip them, yes
16:42:07 <ihrachys> jlibosva, you mentioned concurrency. currently we seem to run with =2
16:42:24 <ihrachys> would the suggestion be to run with eg. number of cpus?
16:42:35 <jlibosva> just remove the concurrency and let the runner decide
16:42:40 <ihrachys> I see most scenarios already taking like 5minutes+ each in good run
16:42:52 <jlibosva> I remember I added the concurrency there because we thought the server gets choked on slow machines
16:43:04 <ihrachys> I suspect there may be hardcoded timeouts in some of scenarios we have, like waiting for a resource to come back
16:43:26 <slaweq> but wouldn't it be problem with e.g. memory consumption if You will have more treads?
16:43:31 <slaweq> (just asking)
16:43:46 <ihrachys> slaweq, there may be. especially where scenarios spin up instances.
16:44:11 <ihrachys> I guess it's easier to post a patch and recheck it to death
16:44:18 <ihrachys> and see how it fares
16:44:32 <ihrachys> I can have a look at it. need to report a bug for timeouts too.
16:44:53 <ihrachys> #action ihrachys to report bug for dvr scenario job timeouts and try concurrency increase
16:45:10 <ihrachys> it's interesting that linuxbridge doesn't seem to trigger it.
16:45:20 <ihrachys> is it because maybe more tests are executed for dvr?
16:46:18 <ihrachys> it's 36 tests in dvr and 28 in linuxbridge
16:47:03 <ihrachys> some of those are dvr migration tests so that's good
16:47:23 <ihrachys> I also noticed that NetworkMtuBaseTest is not executed for linuxbridge because apparently we assume gre type driver enabled and it's not supported by linuxbridge
16:47:42 <ihrachys> we can probably configure the plugin for vxlan and have it executed for linuxbridge too then
16:48:06 <jlibosva> makes sense, let's make linuxbridge suffer too :)
16:48:25 <ihrachys> actually, it's 31 vs 21 tests
16:48:32 <ihrachys> I read the output incorrectly before
16:48:47 <ihrachys> jlibosva, huh. well it shouldn't affect much and would add coverage
16:49:00 <jlibosva> I totally agree :)
16:49:03 <ihrachys> I will report a bug for that at least. it's not critical to fix it right away.
16:49:18 <ihrachys> #action ihrachys to report bug for mtu scenario not executed for linuxbridge job
16:50:03 <ihrachys> anyway, the diff in total time spent for tests is ~3k seconds
16:50:35 <ihrachys> which is like 50 minutes?
16:50:57 <ihrachys> that's kinda significant
16:51:10 <jlibosva> that could be it, average test is like 390 seconds? so if you add 10 more, with concurrency two, it adds almost 30 minutes
16:51:34 <ihrachys> yeah. and linuxbridge is already 2h:20m in good run
16:51:39 <jlibosva> with bad luck using some slow cloud, it could start hitting timeouts too. That's why I think it's good to make linuxbridge suffering too, to confirm our theory :)
16:51:50 <ihrachys> add 30-50 on top and you have timeout triggered
16:52:05 <ihrachys> ok anyway, we have way forward. let's switch topics
16:52:08 <ihrachys> #topic Fullstack
16:52:20 <jlibosva> I remember in tempest repo, we have the slow tag so maybe we will need to start using it in the future
16:52:37 <ihrachys> jlibosva, yeah but ideally we still want to have all those executed in some way in gate
16:52:50 <ihrachys> so the best you can do is then split the job into pieces
16:53:00 <ihrachys> ok fullstack. same exercise, looking at latest runs.
16:53:11 <jlibosva> yeah, looking at times of particular tests, they all are slow
16:53:13 <ihrachys> example: http://logs.openstack.org/43/529143/3/check/neutron-fullstack/d031a6b/logs/testr_results.html.gz
16:53:28 <ihrachys> this is I believe the failure I should have reported but failed to.
16:53:54 <ihrachys> jlibosva, afaiu we don't have accelerated virtualization in infra clouds. so starting an instance takes ages.
16:54:12 <ihrachys> let's see if other fullstack runs are same
16:54:28 <ihrachys> ok this one is different: http://logs.openstack.org/98/513398/10/check/neutron-fullstack/b62a726/logs/testr_results.html.gz
16:54:48 <ihrachys> our old friend "Commands [...] exceeded timeout 10 seconds"
16:55:13 <haleyb> not a very nice friend
16:55:18 <ihrachys> I recollect jlibosva reported https://bugs.launchpad.net/neutron/+bug/1741889 lately
16:55:19 <openstack> Launchpad bug 1741889 in neutron "functional: DbAddCommand sometimes times out after 10 seconds" [Critical,New]
16:55:19 <mlavalle> nope
16:55:38 <jlibosva> I thought it's more sever as by the time I reported the functional job seemed busted
16:55:41 <jlibosva> so it's not that hot anymore
16:56:07 <jlibosva> severe*
16:56:25 <mlavalle> you mean 1741889?
16:56:29 <jlibosva> yeah
16:56:50 <mlavalle> should we lower the importance then?
16:56:58 <jlibosva> possibly
16:56:59 <ihrachys> jlibosva, do you suggest that may be intermittent and same and we may no longer experience it?
16:57:48 <jlibosva> well, I mean we talked about it before, you said UT had a peak at about that time, so the bug might not be the reason for functional failure rate peak
16:58:14 <jlibosva> although it's not a nice back as it still keeps coming back in weird intervals, like twice a year :) and then it goes away
16:58:21 <ihrachys> ok I lowered to High it for now
16:58:27 <ihrachys> and added details about fullstack
16:58:36 <jlibosva> could be related to slow hw used
16:58:37 <jlibosva> dunno
16:58:38 <mlavalle> thanks
16:58:40 <jlibosva> just thinking out loud
16:58:49 <ihrachys> yeah I think it's ok to just monitor
16:59:01 <ihrachys> that other sec group failure seems more important to have a look.
16:59:14 <ihrachys> and we have some work to do for scenarios anyway till next week
16:59:21 <slaweq> I can check on this failure with SG
16:59:29 <slaweq> *check this failure
16:59:38 <ihrachys> slaweq, if you like, I would be grateful of course.
16:59:42 <slaweq> sure
16:59:50 <ihrachys> have some other pressing things on my plate so I don't insist :)
16:59:53 <ihrachys> cool!
17:00:09 <ihrachys> #action slaweq to take over sec group failure in fullstack (report bug / triage / fix)
17:00:14 <ihrachys> we are out of time
17:00:31 <slaweq> just to be sure, we are talking about issue liek in http://logs.openstack.org/43/529143/3/check/neutron-fullstack/d031a6b/logs/testr_results.html.gz
17:00:33 <slaweq> right?
17:00:40 <ihrachys> slaweq, yes
17:00:41 <slaweq> ok
17:00:44 <slaweq> thx
17:01:07 <ihrachys> ok. thanks everyone for being so helpful and showing initiative. it's good it's just me slagging here, otherwise we wouldn't achieve all we did.
17:01:11 <ihrachys> #endmeeting