#openstack-meeting log

16:00:18 <slaweq> #startmeeting neutron_ci
16:00:19 <openstack> Meeting started Tue Aug 21 16:00:18 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:20 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:20 <slaweq> hi
16:00:22 <openstack> The meeting name has been set to 'neutron_ci'
16:00:22 <mlavalle> o/
16:00:56 * mlavalle has to leave in 40 minutes for an appointement
16:01:12 <slaweq> ok, mlavalle so we will try to do it fast :)
16:01:16 <slaweq> lets start then
16:01:29 <slaweq> #topic Actions from previous meetings
16:01:40 <slaweq> njohnston to tweak stable branches dashboards
16:02:06 <slaweq> I don't know if he did something about that but I don't think so
16:03:00 <slaweq> I pinged njohnston on neutron channel, maybe he will join
16:03:16 <mlavalle> ok
16:03:37 <njohnston> o/
16:03:40 <slaweq> hi njohnston
16:03:57 <slaweq> we are talking about actions from previous meeting
16:03:59 <mlavalle> he always shows up
16:04:00 <njohnston> sorry I'm late, was engrossed in code :-)
16:04:02 <slaweq> and we have:
16:04:04 <slaweq> njohnston to tweak stable branches dashboards
16:04:21 <slaweq> no problem :)
16:04:27 <njohnston> Gah, I forgot all about that.  My apologies.  I'll do it today.
16:04:47 <slaweq> sure, fine
16:04:53 <slaweq> I will add it for next week then
16:04:58 <slaweq> #action njohnston to tweak stable branches dashboards
16:05:08 <slaweq> next action was:
16:05:10 <slaweq> slaweq add data about slow API to etherpad for PTG
16:05:35 <slaweq> I added it to etherpad but I still need to add some data about what tests and what API calls are slowest
16:06:00 <slaweq> there are also patches to improve that in progress so I will also point to them there
16:06:15 <mlavalle> that's good, thanks
16:06:31 <slaweq> ok, next one was:
16:06:33 <slaweq> slaweq will add number of failures to graphana
16:06:42 <slaweq> I added such graphs to grafana
16:06:53 <slaweq> it shows summarize of jobs from last 24h
16:07:18 <slaweq> lets use it for a while and tweak if that will be necessary
16:07:30 <mlavalle> yeah, we need to give it some time
16:08:05 <slaweq> exactly
16:08:17 <slaweq> I also did small reorganization of graphs there
16:08:35 <slaweq> and moved integration tests to one graph and scenario jobs to separate graph
16:09:06 <slaweq> I wanted to make same graphs for "check" and "gate" queue and I think that now it is like that
16:09:13 <mlavalle> yes, good that you are tweaking it
16:09:42 <slaweq> maybe would be better to move gate queue to separate dashboard even because there is a lot of graphs on it now
16:09:55 <slaweq> but lets see how it will work like that for few weeks
16:10:07 <njohnston> it definitely looks nicer +1
16:10:17 <mlavalle> I have always found the dasboard and some panels too loaded with data
16:10:23 <mlavalle> I'm not that smart
16:10:37 <slaweq> :)
16:10:40 <mlavalle> so I like that you are simplifying it
16:11:00 <slaweq> I'm trying but I'm not the best person for tweaking UX :P
16:11:35 <mlavalle> well, as long as we get something we all understand easily, don't worry about UX orthodoxies
16:11:51 <slaweq> ok, thx :)
16:11:54 <slaweq> I will remember that
16:12:03 <slaweq> ok, last action from last week was:
16:12:05 <slaweq> slaweq to report new bug about fullstack test_securitygroup(linuxbridge-iptables) issue and investigate this
16:12:14 <slaweq> Bug reported already: https://bugs.launchpad.net/neutron/+bug/1779328 I just updated it
16:12:14 <openstack> Launchpad bug 1779328 in neutron "Fullstack tests neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork fails" [High,Confirmed]
16:12:32 <slaweq> I also did small patch to enable debug_iptables_rules in L2 agent config
16:12:40 <slaweq> it was merged today or yesterday
16:13:09 <slaweq> so if it will fail again, I hope that this debug option will help me to figure out what is wrong there
16:13:10 <mlavalle> so are you planning to work on it?
16:13:22 <slaweq> yes, I have an eye on that for now
16:13:36 <mlavalle> ok, I'll assign you to it and mark it as in progress
16:13:43 <slaweq> ok, thx
16:13:48 <slaweq> I forgot about that
16:14:07 <slaweq> it's always iptables driver which is failing, never openvswitch firewall driver
16:14:34 <slaweq> I suppose that it's some race condition again but have no idea what exactly
16:15:15 <slaweq> ok, lets talk about grafana now
16:15:17 <slaweq> #topic Grafana
16:15:27 <mlavalle> yaay, the new and improved
16:15:33 <slaweq> As I said, I reorganized it a bit recently
16:15:42 <slaweq> and I wanted to ask about one thing also
16:16:03 <slaweq> I found that in available metrics are also metrics like jobs TIMED_OUT and POST_FAILURE
16:16:17 <slaweq> should we include them in our graphs?
16:16:36 <slaweq> for now we have it calculated as (SUCCESS / (FAILURE+SUCCESS))
16:16:54 <mlavalle> well the danger there is that we overload the panels with data
16:16:55 <slaweq> so in graphs we don't see jobs which had TIMEOUT or POST_FAILURE
16:17:28 <njohnston> Can we use wildcards in the selection of metric names?  If so we could use a mid-string wildcard to count all fo the TIMED_OUT and POST_FAILURE for all the jobs and just count the number of them occurring as a sort of "healthcheck on zuul" graph
16:17:30 <mlavalle> now, timed out and port failures are infra issues, aren't they?
16:17:44 <mlavalle> post failures ^^^^
16:17:46 <slaweq> but I was thinking about somethink like ((FAILURE + TIME_OUT + POST_FAILURE) / (FAILURE + TIME_OUT + POST_FAILURE + SUCCESS))
16:18:03 <mlavalle> mhhhh
16:18:13 <mlavalle> failures are our problem
16:18:25 <slaweq> post failure is infra issue usually
16:18:31 <mlavalle> TIME_OUT + POST_FAILURE are infra's, or am I wrong?
16:18:36 <slaweq> but timeout is mostly our issue (slow api for example)
16:18:50 <mlavalle> ok, I buy that
16:19:01 <slaweq> and time_out is what we hit more often
16:19:04 <njohnston> we should still have the tests get killed by the internal timer and register as a normal FAILURE instead of a TIMED_OUT
16:19:33 <mlavalle> so if we can organize this in such a way that we can easily discriminate what we need to worry about and what we need to communicate to infra, I am all for it
16:19:35 <slaweq> njohnston: yes, but if job is taking long time it reaches some kind of "global" timeout and job is killed
16:20:21 <slaweq> so updating to something like ((FAILURE + TIME_OUT) / (FAILURE + TIME_OUT + SUCCESS)) [in %]
16:20:25 <slaweq> right?
16:21:09 <njohnston> ok
16:21:11 <mlavalle> that formula seems right to indicate "what we need to worryt about"
16:21:14 <slaweq> we will then have percentage of failures and timeouts for all our jobs which isn't post_failure (so percentage of our issues)
16:21:39 <slaweq> ok, so I will update it in grafana
16:21:55 <slaweq> #action slaweq to update grafana dashboard to ((FAILURE + TIME_OUT) / (FAILURE + TIME_OUT + SUCCESS))
16:21:59 <mlavalle> njohnston: you agree with that formula?
16:23:36 <mlavalle> well, let's move on
16:23:38 <slaweq> I think we lost njohnston :/
16:23:51 <slaweq> ok
16:23:56 <njohnston> yes that is good
16:24:13 <mlavalle> ++
16:24:20 <slaweq> speaking about grafana there is no new "big" issues IMO
16:24:31 <slaweq> so let's talk about some specific jobs now
16:24:34 <slaweq> #topic functional
16:24:49 <slaweq> I want to start with functional because there is one urgent issue with it
16:25:02 <slaweq> in stable/queens it looks it's failing 100% times since few days
16:25:03 <mlavalle> ok
16:25:13 <slaweq> bug reported https://bugs.launchpad.net/neutron/+bug/1788185
16:25:13 <openstack> Launchpad bug 1788185 in neutron "[Stable/Queens] Functional tests neutron.tests.functional.agent.l3.test_ha_router failing 100% times " [Critical,Confirmed]
16:25:28 <slaweq> example of failure: http://logs.openstack.org/78/593078/1/check/neutron-functional/28fe681/logs/testr_results.html.gz
16:27:59 <slaweq> last patch merged to queens branch was https://review.openstack.org/#/c/584276/
16:28:15 <slaweq> which IMO can be potential culprit
16:28:21 <slaweq> but it has to be checked
16:29:34 <mlavalle> nobody working on it?
16:29:39 <mlavalle> the bug I mean
16:29:45 <mlavalle> it has no assignee
16:30:15 <slaweq> I reported it today, few hours ago
16:30:29 <slaweq> and I hadn't got time to work on it yet
16:30:42 <mlavalle> do you have bandwidth? I can help if you don't
16:31:14 <slaweq> would be great if You could check that
16:31:25 <mlavalle> ok, I'll take a stab at it
16:31:31 <slaweq> thx
16:31:39 <mlavalle> just assigned it to me
16:31:57 <mlavalle> I'll yell for help if I get myself in trouble
16:33:22 <slaweq> sure
16:33:49 <slaweq> ok, let's quickly move for other issues
16:34:19 <slaweq> I recently found few times issues in neutron.tests.functional.agent.l3.test_metadata_proxy.UnprivilegedUserGroupMetadataL3AgentTestCase
16:34:36 <slaweq> but this should be "fixed" by patch https://review.openstack.org/#/c/586341/4 :)
16:34:43 <slaweq> so please review it
16:35:05 <mlavalle> added it to my pile
16:35:16 <slaweq> thx
16:35:20 <mlavalle> will look at it when I come back after my appointment
16:35:28 <njohnston> I've been seeing issues with neutron_tempest_plugin.scenario.test_dns_integration.DNSIntegrationTests.test_server_with_fip but I haven't been able to track it to a source
16:36:03 <njohnston> It's happened about 40 times in the last week; I added an elastic-recheck change to look for it https://review.openstack.org/593722
16:36:08 <slaweq> njohnston: I know, it's even reported on https://bugs.launchpad.net/bugs/1788006 :)
16:36:08 <openstack> Launchpad bug 1788006 in neutron "neutron_tempest_plugin DNS integration tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Undecided,New]
16:36:47 <njohnston> yeah, I dropped that bug when I exceeded my timebox on tracking down the source of the issue
16:36:48 <mlavalle> mhhh the description sound like a nova issue
16:37:03 <mlavalle> not saying it is not our issue
16:37:10 <mlavalle> but seems odd
16:37:24 <slaweq> yep, it looks so at first glance
16:37:34 <slaweq> but we should check it
16:37:48 <slaweq> anyone has bandwidth for it?
16:38:02 <mlavalle> don't have much, but I'll try to take a look
16:38:03 <njohnston> I could ping the nova-neutron liaison
16:38:28 <njohnston> I believe that is sean-k-mooney
16:38:38 <mlavalle> I'll takle a quick look before pinging sean-k-mooney
16:38:46 <njohnston> thanks much
16:38:51 <slaweq> thx mlavalle
16:38:55 <slaweq> and thx njohnston :)
16:39:13 <mlavalle> ok, I have to leave guys
16:39:15 <slaweq> #action mlavalle to check neutron_tempest_plugin.scenario.test_dns_integration.DNSIntegrationTests.test_server_with_fip issue
16:39:20 <slaweq> ok, thx mlavalle
16:39:24 <mlavalle> o/
16:39:34 <slaweq> it's basically all what I had for today from important things
16:40:13 <slaweq> so njohnston if You don't have anything to talk about I think we can finish earlier today
16:40:16 <slaweq> :)
16:40:50 <njohnston> nothing from me, it just seems like we have a lot of timeouts of late
16:41:40 <njohnston> anyhow, have a good evening slaweq, and thanks as always!
16:41:51 <slaweq> ok, thx njohnston
16:41:57 <slaweq> and have a nice day :)
16:42:04 <slaweq> #endmeeting