#openstack-meeting log

16:00:35 <slaweq> #startmeeting neutron_ci
16:00:36 <openstack> Meeting started Tue Mar 27 16:00:35 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:37 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:38 <slaweq> hi
16:00:39 <openstack> The meeting name has been set to 'neutron_ci'
16:01:04 <mlavalle> o/
16:01:21 <ihrachys> o/
16:01:23 <haleyb> hi
16:01:48 <jlibosva> late o/
16:01:57 <slaweq> it's my first time as chair of this meeting so please be tolerant for me :)
16:02:01 <slaweq> let's go
16:02:13 <slaweq> #topic Actions from prev meeting
16:02:29 <slaweq> jlibosva to take a look on dvr trunk tests issue: http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6d09fd8/logs/testr_results.html.gz
16:03:02 <jlibosva> I spend a fair amount of time digging into it but I haven't been able to root cause it. I can see some errors in ovs firewall but that's recovered
16:03:02 <slaweq> jlibosva: any updates?
16:03:07 <slaweq> sorry :)
16:03:29 <jlibosva> I'll keep digging into it. I might need to send some additional debug patches
16:03:32 <ihrachys> do we have a bug report for the issue?
16:03:43 <jlibosva> no, not yet
16:04:25 <jlibosva> it could be related to the old issue with sending update security group before firewall is initialized that we had in the past
16:05:12 <slaweq> jlibosva: but AFAIR it happens only in those tests related to trunk, right?
16:05:32 <jlibosva> yes, only for trunk
16:05:46 <jlibosva> and it blocks SSH to parent port
16:06:36 <slaweq> maybe You could add some additional logs with e.g. security groups or something like that and try to spot it once again in job to check what's there?
16:07:01 <slaweq> or freeze test node if it will happen, log into it and debug there
16:07:13 <slaweq> AFAIK it happens quite often so it should be doable IMO
16:07:46 <jlibosva> ok, I'll try that. thanks for tips. about freezing test node, is it offical or do I need to inject ssh key?
16:08:06 <slaweq> it is official but You should ask on infra channel for that
16:08:14 <slaweq> and give them Your ssh key
16:08:25 <slaweq> I was trying it once with debugging linuxbridge jobs
16:08:53 <slaweq> I was also trying to do remote pdb and telnet to it when test fails - and that works also fine for me :)
16:09:35 <ihrachys> slaweq, man you should document it all
16:09:43 <ihrachys> I don't think a lot of people are aware of how to do it
16:09:56 <ihrachys> I mean, that doc would be gold
16:10:05 <slaweq> ihrachys: ok, I will try to write it in docs
16:10:20 <slaweq> #action slaweq will write docs how to debug test jobs
16:10:45 <ihrachys> thanks!
16:10:56 <slaweq> ok, moving on
16:11:03 <slaweq> next was slaweq to check why dvr-scenario job is broken with netlink errors in l3 agent log
16:11:16 <slaweq> created bug report: https://bugs.launchpad.net/neutron/+bug/1757259
16:11:16 <openstack> Launchpad bug 1757259 in neutron "Netlink error raised when trying to delete not existing IP address from device" [High,Fix released] - Assigned to Slawek Kaplonski (slaweq)
16:11:26 <slaweq> and fix: https://review.openstack.org/#/c/554697/ (merged)
16:11:57 <slaweq> and last action from last week:
16:12:00 <slaweq> slaweq to revert https://review.openstack.org/#/c/541242/ because the bug is probably fixed by https://review.openstack.org/#/c/545820/4
16:12:03 <slaweq> revert done: https://review.openstack.org/#/c/554709/
16:12:32 <slaweq> it looks that this issue isn't fixed still (but it happens much less frequently)
16:12:53 <slaweq> so I will keep debugging it if I will spot this issue again
16:13:23 <slaweq> any questions/someting to add? or can we move on to next topic?
16:14:12 <ihrachys> I don't have anything
16:14:19 <slaweq> ok, so next topic
16:14:24 <slaweq> #topic Grafana
16:14:30 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:14:38 <slaweq> btw. I have one question to You
16:14:55 <slaweq> do we have similar grafana dashboard but for stable branches?
16:15:09 <slaweq> I am only aware of this one which shows results from master branch
16:15:23 <ihrachys> no we don't
16:15:43 <ihrachys> it was in my todo list for ages and it never bubbled up high enough
16:16:03 <slaweq> ah, ok
16:16:10 <ihrachys> should be quite easy though, copy the current one and replace master with whatever the string for stable query
16:16:29 <slaweq> I will add it to my todo list then (but not too high also) :)
16:17:07 <ihrachys> have fun :)
16:17:11 <slaweq> ihrachys: thx
16:17:25 <slaweq> ok, so getting back to dashboard for master branch
16:17:43 <slaweq> we have 100% failures of neutron-tempest-plugin-dvr-multinode-scenario
16:18:01 <slaweq> since few days at least
16:18:10 <ihrachys> it's 7 days at least
16:18:25 <slaweq> there was problem with FIP QoS test but this one is skipped now in this job
16:18:35 <slaweq> so I checked and found few example of failures
16:18:50 <slaweq> it looks that there are 2 main failures:
16:19:16 <slaweq> issue with neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromDVRHA, like: http://logs.openstack.org/97/552097/8/check/neutron-tempest-plugin-dvr-multinode-scenario/d242b10/logs/testr_results.html.gz
16:19:47 <slaweq> and second issue with neutron_tempest_plugin.scenario.test_trunk.TrunkTest, like: http://logs.openstack.org/20/556120/12/check/neutron-tempest-plugin-dvr-multinode-scenario/76d1a5b/logs/testr_results.html.gz
16:20:35 <haleyb> slaweq: the migration issue is interesting, i thought we had that working before
16:20:59 <slaweq> haleyb: for sure it wasn't failing 100% times :)
16:21:00 <ihrachys> haleyb, yes and it was very stable
16:21:20 <haleyb> i will have to take a look
16:21:24 <slaweq> but in all those cases it looks that reason of failure is problem with connectivity
16:21:31 <slaweq> thx haleyb
16:21:53 <ihrachys> haleyb, see in instance boot log, metadata is not reachable
16:21:54 <slaweq> #action haleyb to check router migrations issue
16:21:54 <ihrachys> [  456.685625] cloud-init[1021]: 2018-03-27 09:08:57,215 - DataSourceCloudStack.py[CRITICAL]: Giving up on waiting for the metadata from ['http://10.1.0.2/latest/meta-data/instance-id'] after 121 seconds
16:22:34 <ihrachys> and it's because of "Connection refused"
16:23:18 <ihrachys> slaweq, as for trunk, it's the same as jlibosva was looking at?
16:23:22 <jlibosva> right
16:23:25 <jlibosva> I was just about to write it
16:23:53 <slaweq> yes, it looks that it's the same
16:25:05 <slaweq> there is much more such examples in recent patches so it should be relatively easy to debug :)
16:25:23 <slaweq> (I don't know if it's possible to reproduce locally)
16:26:33 <slaweq> ok, from other frequently failing jobs we have also neutron-tempest-dvr-ha-multinode-full which failure rate is about 50%
16:26:45 <ihrachys> yeah it's like that for quite a while
16:27:02 <slaweq> I found that in most cases failures are also related to broken connectivity, like e.g.             * http://logs.openstack.org/52/555952/2/check/neutron-tempest-dvr-ha-multinode-full/9f94fa4/logs/testr_results.html.gz
16:27:39 <slaweq> ihrachys: yes, but I wanted to mention it at least :)
16:28:16 <slaweq> ahh, and one more thing, there are some periodic jobs failing 100% times
16:28:27 <slaweq> like e.g. Openstack-tox-py35-with-oslo-master
16:28:47 <slaweq> my first question to You is: where I can find results of such tests? :)
16:29:26 <ihrachys> http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/
16:29:38 <ihrachys> then dive into a job dir
16:29:41 <ihrachys> and sort by time
16:29:44 <ihrachys> and check the latest
16:29:51 <ihrachys> which is http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/openstack-tox-py27-with-oslo-master/f65fec8/job-output.txt.gz
16:29:52 <slaweq> thx ihrachys
16:29:59 <ihrachys> DuplicateOptError: duplicate option: polling_interval
16:30:09 <ihrachys> I can take a look
16:30:22 <ihrachys> probably a conflict in the same section between an oslo lib and our options
16:30:46 <slaweq> #action ihrachys to take a look at problem with openstack-tox-py35-with-oslo-master periodic job
16:30:47 <slaweq> thx
16:31:16 <slaweq> except those jobs I think it is quite fine now
16:31:38 <ihrachys> yeah. the unit test failure rate is a bit higher than I would expect but seems consistent with other jobs.
16:31:55 <ihrachys> maybe our unit tests are just as good now that they catch majority of issues :)
16:32:13 <slaweq> maybe :)
16:32:19 <mlavalle> ++
16:32:24 <njohnston> we can always hope
16:32:26 <mlavalle> hopefully
16:32:42 <slaweq> ok, moving on to next topic
16:32:45 <slaweq> #topic Fullstack
16:33:08 <slaweq> fullstack is now at similar rate as functional and tempest-plugin-api tests so it's fine
16:33:23 <slaweq> we had one issue with it during last week
16:33:33 <slaweq> Bug https://bugs.launchpad.net/neutron/+bug/1757089
16:33:34 <openstack> Launchpad bug 1757089 in neutron "Fullstack test_port_shut_down(Linux bridge agent) fails quite often" [High,Fix released] - Assigned to Slawek Kaplonski (slaweq)
16:33:49 <slaweq> but it is (I hope) already fixed by https://review.openstack.org/#/c/554940/
16:34:37 <slaweq> another problem is mentioned before bug with starting nc process: https://bugs.launchpad.net/neutron/+bug/1744402
16:34:38 <openstack> Launchpad bug 1744402 in neutron "fullstack security groups test fails because ncat process don't starts" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:35:06 <slaweq> For now I added small patch https://review.openstack.org/#/c/556155/ and when it fails I want to check if it's because IP wasn't configured properly or there is some different problem with starting nc proces
16:36:00 <slaweq> anyone wants to add anything?
16:36:11 <slaweq> or should we move on?
16:36:40 <jlibosva> nothing from me :)
16:36:55 <jlibosva> I'm just amazed how many things you can do
16:37:01 <slaweq> thx :)
16:37:15 <ihrachys> jlibosva, THIS
16:37:27 <ihrachys> sometimes I feel embarrassed
16:37:31 <ihrachys> :)
16:37:37 <slaweq> actually I have one more question here :)
16:38:15 <slaweq> as fullstack is voting since 2 weeks and I don't think there are some big problems with it, should we consieder to make it gating also?
16:38:53 <mlavalle> we are before Rocky-1, so it's a good time to give it a try
16:38:59 <ihrachys> +
16:39:10 <mlavalle> if not now, then when?
16:39:19 <slaweq> great, I will send patch for it then
16:39:27 <jlibosva> yay
16:39:35 <slaweq> #action slaweq to make fullstack job gating
16:39:46 <slaweq> ok, next topic
16:39:50 <slaweq> #topic Scenarios
16:40:03 <slaweq> we already talked about issues with dvr scenario job
16:40:15 <slaweq> so I only want to mention about linuxbridge job
16:40:25 <slaweq> we had two issues during last week:
16:40:44 <slaweq> both were in stable/queens branch only
16:41:20 <slaweq> first problem was due to we forgot to backport https://review.openstack.org/#/c/555263/ to queens so it wasn't configured and fip qos test was failing there
16:41:32 <slaweq> and second due to we forgot to backport https://review.openstack.org/#/c/554859/ and this job was failing a lot due to ssh timeouts
16:41:51 <slaweq> this was a problem as this job is voting also in stable/queens now
16:41:59 <slaweq> but I think it's fine now
16:42:20 <slaweq> and this job is on good level of failures again :)
16:42:48 <ihrachys> very nice
16:43:08 <slaweq> so I wanted to ask about making this job gating also - what You think about it?
16:43:22 <mlavalle> let's do one one week
16:43:29 <mlavalle> and then we add the next
16:43:37 <mlavalle> how about that?
16:43:39 <slaweq> mlavalle: sure, fine for me :)
16:44:06 <mlavalle> that way we don't drive ourselves crazy in case something starts to fail
16:44:27 <slaweq> mlavalle: agree
16:44:38 <slaweq> so I will back with this question next week then :)
16:45:04 <mlavalle> I know you will
16:45:09 <slaweq> LOL
16:45:15 <mlavalle> relentless Hulk
16:45:21 <mlavalle> :-)
16:45:36 <slaweq> ok, so that's all what I have on my list for today
16:45:44 <slaweq> topic #Open discussion
16:46:02 <slaweq> I wanted to ask You to review  https://review.openstack.org/#/c/552846/
16:46:18 <slaweq> it's my first attempt to move some job definition to zuul v3 format
16:46:30 <slaweq> it looks that it works for this job
16:46:47 <mlavalle> ahh nice
16:46:48 <slaweq> andreaf was reviewing it from zuul point of view and it is fine for him
16:46:55 <mlavalle> will take a look
16:47:00 <slaweq> so I think it's ready to review by You also
16:47:03 <slaweq> thx mlavalle
16:47:06 <ihrachys> if andreaf is fine it must be ok. :)
16:47:11 <ihrachys> will do
16:47:45 <slaweq> he gave +1 on one of patchsets - I will ask him also to take a look again now
16:47:50 <mlavalle> I want to migrate also job definitions. so we can partner doing it
16:48:28 <slaweq> mlavalle: what job definitions You want to migrate? from neutron repo?
16:48:44 <mlavalle> yeah I was thinking of those
16:49:09 <slaweq> I can help You with it if You want
16:49:49 <slaweq> ah, and one more question about tempest jobs
16:50:12 <mlavalle> sure, let's partner as I said
16:50:29 <slaweq> when I was checking grafana today I found that two jobs: Neutron-tempest-ovsfw and neutron-tempest-multinode-full are quite stable in last few weeks also
16:50:43 <slaweq> You can check it http://grafana.openstack.org/dashboard/db/neutron-failure-rate?from=1516976912028&to=1522153712028&panelId=8&fullscreen
16:51:12 <slaweq> so my question is: do You want to make them voting some day?
16:51:47 <ihrachys> what's this multinode-full job about? how is it different from dvr-ha one? do we plan to replace it with dvr-ha eventually?
16:52:01 <slaweq> ihrachys: I don't know to be honest
16:52:12 <slaweq> I just found it on graphs and wanted to ask :)
16:52:51 <mlavalle> it might end up being functionally a subset of the dvr-ha
16:52:56 <slaweq> ihrachys: I can compare those jobs for You if You want
16:52:57 <ihrachys> I am actually surprised we don't have a multinode tempest full job that would vote :)
16:53:25 <mlavalle> so let's clarify the differences and then we can make a decision
16:53:35 <slaweq> ok, fine for me
16:53:42 <ihrachys> mlavalle, the thing is, dvr-ha would probably take some time to get to voting, so should we maybe enable the regular one (I believe it's legacy routers) while dvr-ha is being worked on, then revisit?
16:54:03 <ihrachys> slaweq, + would be nice to understand the diff to judge
16:54:05 <mlavalle> I agree that we should have a voting multinode
16:54:36 <slaweq> #action slaweq will check difference between neutron-tempest-multinode-full and neutron-tempest-dvr-ha-multinode-full
16:54:46 <ihrachys> also, source of this job is important. it could be the regular is defined in tempest
16:55:19 <mlavalle> ihrachys so I agree with you
16:56:10 <slaweq> great, thx for opinions
16:56:17 <slaweq> and what about ovsfw job? is it to early for this one to make it voting?
16:56:42 <slaweq> it was failing 100% times at the beginning of february
16:57:21 <mlavalle> should be voting at some point. I don't see why not
16:57:27 <slaweq> but since few weeks it is better IMHO - spikes are same as for other jobs so it doesn't looks like some issues specific with this job exectly
16:57:43 <ihrachys> yeah though again, maybe dvr-ha one should incorporate this firewall driver and effectively replace it
16:57:44 <jlibosva> but I think it also have some additional failure, no?
16:58:16 <slaweq> jlibosva: I don't know about any issue specific to this one
16:59:14 <jlibosva> ah, right. it copies other jobs nicely last 7 days
16:59:25 <ihrachys> the way I see it, there are two major setup options for neutron in-tree stuff: first is old - legacy l3, iptables... another is new stuff - dvr/ha, ovsfw. to keep the number of jobs from blowing further up, I would recommend we try to keep those two major kinds of setups as targets and try to consolidate features around them
16:59:48 <slaweq> ok, I think we can check it for few more weeks and decide then as we are almost out of time now
16:59:55 <mlavalle> that's good advice ihrachys. Thanks :-)
16:59:59 <jlibosva> I think that was agreed on some PTG but never formally documented
17:00:13 <slaweq> ok, we are out of time now
17:00:17 <ihrachys> yeah
17:00:17 <slaweq> #endmeeting