#openstack-meeting log

16:01:21 <slaweq> #startmeeting neutron_ci
16:01:22 <openstack> Meeting started Tue May  8 16:01:21 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:25 <openstack> The meeting name has been set to 'neutron_ci'
16:01:28 <slaweq> welcome on CI meeting :)
16:01:28 <mlavalle> o/
16:01:44 <mlavalle> the last one of my morning :-)
16:01:48 * slaweq have busy, meetings day :)
16:02:06 <slaweq> mlavalle: same for me but last of my afternoon
16:02:12 <ihar> o/
16:02:18 <slaweq> hi ihar
16:02:22 <ihar> oh you guys stay fir 3h? insane
16:02:24 <slaweq> haleyb will join?
16:02:26 <ihar> *for
16:02:26 <haleyb> hi
16:02:36 <slaweq> ihar: yes
16:02:46 <slaweq> I just finished QoS meeting and started this one
16:02:47 * haleyb takes an hour off for offline meetings :(
16:02:53 <slaweq> one by one :)
16:03:00 <slaweq> ok, lets start
16:03:07 <slaweq> #topic Actions from previous meetings
16:03:27 <slaweq> last week there was no meeting so let's check actions from 2 weeks
16:03:35 <slaweq> * slaweq will check failed SG fullstack test
16:03:46 <slaweq> I reported bug https://bugs.launchpad.net/neutron/+bug/1767829
16:03:47 <openstack> Launchpad bug 1767829 in neutron "Fullstack test_securitygroup.TestSecurityGroupsSameNetwork fails often after SG rule delete" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:04:16 <slaweq> I tried to reproduce it with some DNM patch with additional logs but I couldn't
16:04:32 <slaweq> I suppose that it is some race condition in conntrack manager module
16:04:58 <haleyb> slaweq: i hope not :(  but i will also try, and do some manual testing, since it's blocking that other patch
16:04:59 <slaweq> haleyb got similar error on one of his patches I think and there it was reproducible 100% of times
16:06:10 <slaweq> ok, so haleyb You will check that on Your patch, right?
16:06:42 <haleyb> slaweq: yes, and i had tweaked the patch with what looked like a fix, but it still failed, so i'll continue
16:06:59 <slaweq> ok, thx
16:07:26 <slaweq> #action haleyb will debug failing security groups fullstack test: https://bugs.launchpad.net/neutron/+bug/1767829
16:07:28 <openstack> Launchpad bug 1767829 in neutron "Fullstack test_securitygroup.TestSecurityGroupsSameNetwork fails often after SG rule delete" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:07:38 <slaweq> next one is
16:07:39 <slaweq> * slaweq will report bug about failing trunk tests in dvr multinode scenario
16:07:48 <slaweq> Bug report is here: https://bugs.launchpad.net/neutron/+bug/1766701
16:07:49 <openstack> Launchpad bug 1766701 in neutron "Trunk Tests are failing often in dvr-multinode scenario job" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:08:26 <slaweq> I see that mlavalle is assigned to it
16:08:34 <slaweq> did You find something maybe?
16:08:46 <mlavalle> no, I haven't made progress on this one
16:09:13 <mlavalle> I will work on it this week
16:09:18 <slaweq> thx
16:09:43 <slaweq> #action mlavalle will check why trunk tests are failing in dvr multinode scenario
16:09:56 <mlavalle> thanks
16:10:01 <slaweq> ok, next one
16:10:04 <slaweq> * jlibosva will mark trunk scenario tests as unstable for now
16:10:15 <slaweq> he did: https://review.openstack.org/#/c/564026/
16:10:15 <mlavalle> I think he did
16:10:15 <patchbot> patch 564026 - neutron-tempest-plugin - Mark trunk tests as unstable (MERGED)
16:10:34 <slaweq> so mlavalle, be aware that those tests will now not fail in new jobs
16:10:49 <slaweq> if You will look for failures, You need to look for skipped tests :)
16:10:57 <mlavalle> right
16:11:10 <slaweq> next one was:
16:11:11 <slaweq> * slaweq to check rally timeouts and report a bug about that
16:11:21 <slaweq> I reported a bug: https://bugs.launchpad.net/neutron/+bug/1766703
16:11:22 <openstack> Launchpad bug 1766703 in neutron "Rally tests job is reaching job timeout often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:11:39 <slaweq> and I did some initial checks of logs
16:11:51 <slaweq> there are some comments in bug report
16:12:21 <slaweq> basically it don't looks like we are very close to the limit on good runs
16:14:15 <slaweq> and also in neutron server logs I found that API calls on such bad runs are really slow, e.g. http://logs.openstack.org/24/558724/11/check/neutron-rally-neutron/d891678/logs/screen-q-svc.txt.gz?level=INFO#_Apr_24_04_23_23_632631
16:14:40 <slaweq> also there is a lot of errors in this log file: http://logs.openstack.org/24/558724/11/check/neutron-rally-neutron/d891678/logs/screen-q-svc.txt.gz?level=INFO#_Apr_24_04_14_59_003949
16:15:09 <slaweq> I don't think that this is culprit of slow responses because same errors are in logs of "good" runs
16:15:23 <slaweq> but are You aware of such errors?
16:15:34 <slaweq> ihar: especially You as it looks to be related to db :)
16:16:48 <ihar> I think savepoint deadlocks are common, as you said already
16:16:59 <ihar> I have no idea why tho
16:17:08 <slaweq> ok, just asking, thx for checking
16:17:28 <ihar> one reason for slow db could be limit on connections in oslo.db
16:17:36 <ihar> and/or wsgi
16:17:51 <ihar> neutron-server may just queue requests
16:17:51 <slaweq> ihar: thx for tips, I will try to investigate this slow responses more during the week
16:18:16 <ihar> but I dunno, depends on whether we see slowdowns in the middle of handlers or it just takes a long time to see first messages of a request
16:19:09 <slaweq> problem is that if it reach this timeout then there is no those fancy rally graphs and tables to check everything
16:19:27 <slaweq> so it's not easy to compare with "good" runs
16:19:45 <slaweq> #action slaweq will continue debugging slow rally tests issue
16:20:08 <slaweq> I think we can move on to the next one
16:20:11 <slaweq> * slaweq will report a bug and talk with Federico about issue with neutron-dynamic-routing-dsvm-tempest-with-ryu-master-scenario-ipv4
16:20:21 <slaweq> Bug report: https://bugs.launchpad.net/neutron/+bug/1766702
16:20:23 <openstack> Launchpad bug 1766702 in neutron "Periodic job * neutron-dynamic-routing-dsvm-tempest-with-ryu-master-scenario-ipv4 fails" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:20:47 <slaweq> dalvarez found that there was also another problem caused by Federico's patch
16:21:00 <slaweq> I already proposed patch to fix what dalvarez found
16:21:22 <slaweq> and I'm working on fix neutron-dynamic-routing tests
16:21:40 <slaweq> problem here is that now tests can't create two subnets with same cidr
16:22:18 <slaweq> and neutron-dynamic-routing in scenario tests creates subnetpool and then subnet "per test" but always uses same cidrs
16:22:34 <slaweq> so first test passed but in second subnet is not created as cidr is already in use
16:23:13 <slaweq> haleyb posted some patch to check it but it wasn't fix for this issue
16:23:24 <slaweq> so I will continue this work also
16:23:48 <slaweq> #action slaweq to fix neutron-dynamic-routing scenario tests bug
16:24:00 <slaweq> ok, that's all from my list
16:24:07 <slaweq> do You have anything to add?
16:25:33 <slaweq> ok, so let's move on to next topic
16:25:46 <slaweq> #topic Grafana
16:25:51 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:27:22 <slaweq> Many tests on high failure rate yesterday and today, but it might be related to some issue with devstack legacy jobs, see: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/latest.log.html#t2018-05-08T07:10:02
16:30:12 <slaweq> except that I think that it's quite "normal" for most tests
16:30:38 <slaweq> do You see anything worrying?
16:32:49 <slaweq> ok, I guess we can move to next topics
16:32:54 <mlavalle> has the devstack issue benn solved?
16:33:14 <slaweq> yes, I frickler told me today it should be solved last night
16:33:21 <mlavalle> ok
16:33:34 <slaweq> so recheck of Your patch should works fine :)
16:33:49 <slaweq> moving on
16:33:50 <slaweq> #topic Scenarios
16:34:36 <slaweq> I see that dvr-multinode job is now at about 20 to 40% still
16:34:57 <slaweq> even after marking trunk tests and dvr migration tests as unstable
16:35:09 <slaweq> let's check some example failures then :)
16:35:13 <haleyb> :(
16:36:12 <slaweq> http://logs.openstack.org/24/558724/14/check/neutron-tempest-plugin-dvr-multinode-scenario/37dde27/logs/testr_results.html.gz
16:36:55 <haleyb> slaweq: Kernel panic - not syncing:
16:37:04 <haleyb> hah, that's not neutron!
16:37:09 <slaweq> haleyb: :)
16:37:25 <slaweq> I just copied failure result without changing the reason
16:37:34 <slaweq> I'm looking for some other examples still
16:38:02 <slaweq> http://logs.openstack.org/27/534227/9/check/neutron-tempest-plugin-dvr-multinode-scenario/cff3380/logs/testr_results.html.gz
16:38:11 <slaweq> but this should be already marked as unstable, right haleyb?
16:38:45 <haleyb> this one or the last one?
16:39:03 <slaweq> last one
16:39:07 <slaweq> sorry
16:39:22 <slaweq> in first one, if it was kernel panic than it's "fine" for us :)
16:39:29 <mlavalle> LOL
16:39:51 <mlavalle> not our problem
16:40:11 <haleyb> i'm not sure the migration tests are marked unstable
16:40:12 <slaweq> mlavalle: I think we have enough of our problems still ;)
16:40:18 <mlavalle> we do
16:40:32 <haleyb> both those failed metadata
16:40:35 <slaweq> haleyb: I though that You did it few weeks ago
16:41:12 <haleyb> https://review.openstack.org/#/c/561322/
16:41:13 <patchbot> patch 561322 - neutron-tempest-plugin - Mark DVR/HA migration tests unstable (MERGED)
16:41:18 <haleyb> looking
16:42:31 <haleyb> yes, it should have covered them
16:42:51 <slaweq> so maybe it was some old patch then
16:43:08 <slaweq> but we know this issue already so I think we don't need to talk about it now
16:43:19 <slaweq> thx for check haleyb :)
16:43:33 <slaweq> in the meantime I found one more failed run:
16:43:33 <slaweq> http://logs.openstack.org/78/566178/5/check/neutron-tempest-plugin-dvr-multinode-scenario/92ef438/job-output.txt.gz#_2018-05-08_10_08_04_048287
16:43:43 <slaweq> and here it was global job timeout reached
16:46:06 <slaweq> looks that most tests took about 10 minutes
16:48:03 <slaweq> I think that if such issue will repeat more we will have to investigate what is slowing down those tests
16:48:07 <slaweq> do You agree? :)
16:48:55 <haleyb> yes, agreed
16:49:08 <slaweq> ok, thx haleyb :)
16:49:22 <slaweq> I didn't found different issues for this job from last few days
16:49:48 <slaweq> according to other scenario jobs
16:50:03 <slaweq> I want to remind You that since some time we have voting 2 jobs:
16:50:09 <slaweq> neutron-tempest-plugin-scenario-linuxbridge
16:50:09 <slaweq> and
16:50:15 <slaweq> neutron-tempest-ovsfw
16:50:26 <slaweq> both are IMHO quite stable according to graphana
16:50:43 <slaweq> and I would like to ask if we can consider to make them gating also
16:50:59 * mlavalle looking at grafana
16:51:02 <slaweq> what do You think?
16:51:46 <mlavalle> yes, let's give it a try
16:52:19 <slaweq> thx mlavalle for blessing :)
16:52:39 <slaweq> should I send patch or You want to do it?
16:54:14 <mlavalle> please send it
16:54:20 <slaweq> ok, I will
16:54:41 <slaweq> #action slaweq to make 2 scenario jobs gating
16:55:14 <slaweq> I don't have anything else according to any of job types for today
16:55:27 <slaweq> do You want to talk about something else?
16:55:41 <slaweq> if not we can finish few minutes before the time :)
16:55:46 <mlavalle> I don't have anything else
16:56:52 <slaweq> ok, so enjoy Your free time then ;)
16:56:55 <slaweq> #endmeeting