#openstack-meeting log

16:00:19 <slaweq> #startmeeting neutron_ci
16:00:20 <openstack> Meeting started Tue Dec 11 16:00:19 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:22 <slaweq> hi
16:00:25 <openstack> The meeting name has been set to 'neutron_ci'
16:00:29 <haleyb> hi
16:01:35 * haleyb is in another meeting if he gets unresponsive
16:01:45 <mlavalle> o/
16:02:29 <slaweq> lets wait few more minutes for njohnston hongbin and others
16:03:19 <njohnston> we may also want sean-k-mooney for this one but I don't think he is in channel
16:04:34 <slaweq> ok, lets start
16:04:42 <slaweq> #topic Actions from previous meetings
16:04:53 <slaweq> mlavalle to change trunk scenario test and see if that will help with FIP issues
16:05:05 <mlavalle> I pushed a patch yesterday
16:05:28 <mlavalle> https://review.openstack.org/#/c/624271/
16:05:35 <bcafarel> late hi o/
16:05:50 <mlavalle> I need to investigate what the results are
16:06:16 <slaweq> looks that this didn't solve problem: http://logs.openstack.org/71/624271/1/check/neutron-tempest-plugin-dvr-multinode-scenario/a125b91/testr_results.html.gz
16:06:30 <mlavalle> yeah
16:06:54 <mlavalle> I'll still take a closer look
16:07:16 <mlavalle> and will continue investigating the bug in gneral
16:07:32 <slaweq> maybe there is some issue with timeouts there - those tests are using advaced image IIRC so it may be that it's trying ssh too short time
16:07:36 <slaweq> ?
16:08:01 <mlavalle> the lifecycle scenario doesn't use advanced image
16:08:29 <slaweq> #action mlavalle will continue debugging trunk tests failures in multinode dvr env
16:08:41 <slaweq> ahh, right - only one of tests is using adv image
16:08:44 <mlavalle> you know what on second thought we don't know if it worked
16:09:09 <mlavalle> becasue the test case that failed was the other one
16:09:20 <mlavalle> not the lifecycle one
16:09:24 <slaweq> in this example both tests failed
16:09:53 <mlavalle> ok
16:09:58 <mlavalle> I'll investigate
16:10:05 <slaweq> thx mlavalle
16:10:08 <slaweq> lets move on
16:10:11 <slaweq> njohnston will remove neutron-grenade from neutron ci queues and add comment why definition of job is still needed
16:11:18 <njohnston> So the feedback I got from the QA team is that they would rather we keep neutron-grenade, as they want to keep py2 grenade testing
16:11:55 <njohnston> they consider it part of the minimum level of testing needed until we officially stop supporting py2
16:12:32 <slaweq> ok
16:12:54 <slaweq> so we can consider this point from https://etherpad.openstack.org/p/neutron_ci_python3 as done, right?
16:13:02 <njohnston> yes
16:13:17 <njohnston> I was waiting for us to talk about it in the meeting before marking it
16:14:17 <slaweq> I just marked it as done in etherpad then
16:14:20 <slaweq> thx njohnston
16:14:21 <njohnston> thanks
16:14:37 <slaweq> one more question
16:15:15 <slaweq> is it only py2 based grenade job which QA wants still to have? or should we keep all grenade jobs with py2 too?
16:15:18 <slaweq> do You know?
16:15:57 <njohnston> They want grenad eto cover both py2 and py3, so we should have both - the same way we have unit tests for both
16:16:57 <slaweq> so we should "duplicate" all our grenade jobs then to have py2 and py3 variants for each
16:17:11 <slaweq> probably more rechecks but ok :)
16:17:31 <mlavalle> LOL
16:17:57 <njohnston> Sorry, I was not specific enough.  I think they want at least one grenade for py3 and py2 each.  I don't think we need a full matrix.
16:18:32 <njohnston> So we should have grenade-py3 and neutron-grenade... but for example neutron-grenade-multinode-dvr could be just on py3 and they would be fine
16:18:32 <slaweq> ok, so we already have neutron-grenade (py2) and grenade-py3 (py3) jobs
16:19:04 <slaweq> so we can just switch neutron-grenade-dvr-multinode and neutron-grenade-multinode to py3 now?
16:19:45 <njohnston> yes.  I proposed a 'grenade-multinode-py3' job in the grenade repo https://review.openstack.org/#/c/622612/
16:20:02 <njohnston> I thought that we could use that perhaps, and then it becomes available for other projects
16:20:13 <slaweq> ok, now it's clear
16:20:17 <slaweq> thx njohnston for working on this
16:20:20 <njohnston> np
16:20:34 <slaweq> ok, lets move on then
16:20:37 <slaweq> slaweq to continue debugging bug 1798475
16:20:37 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475
16:20:56 <slaweq> unfortunatelly I didn't have too much time to work on it last week
16:21:13 <slaweq> I lost few days because of sick leave :/
16:21:27 <slaweq> I will try to check it this week
16:21:38 <slaweq> #action slaweq to continue debugging bug 1798475
16:21:39 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475
16:21:47 <slaweq> next one
16:21:49 <slaweq> slaweq to continue fixing funtional-py3 tests
16:21:50 <mlavalle> feeling better now?
16:21:55 <hongbin> o/
16:22:00 <slaweq> mlavalle: yes, thx. It's much better
16:22:05 <slaweq> hi hongbin
16:22:16 <hongbin> slaweq: sorry, a bit late today
16:22:32 <slaweq> so according to functional py3 tests, I was playing with it a bit during the weekend
16:23:02 <slaweq> I tried to disable all warnings in python and so on but it still didn't help
16:23:49 <slaweq> issue is probably caused by capturing stderr, like e.g.: http://logs.openstack.org/83/577383/17/check/neutron-functional/2907d2b/job-output.txt.gz#_2018-12-10_11_06_04_272396
16:24:54 <slaweq> but:
16:25:01 <slaweq> 1. I don't know how to get rid of it
16:25:34 <slaweq> 2. I'm not sure if that's good idea to get rid of it because I'm not sure if that comes from test which failed or from test which actually passed
16:26:17 <slaweq> if anyone has any idea how to fix this issue - feel free to take it :)
16:26:23 <njohnston> I am at a loss for what the best course forward is
16:26:46 <slaweq> if not I will assign it to myself for next week and will try to continue work on it
16:27:26 <slaweq> #action slaweq to continue fixing funtional-py3 tests
16:27:31 <slaweq> ok, lets move on
16:27:39 <slaweq> njohnston to research py3 conversion for neutron grenade multinode jobs
16:27:48 <njohnston> I think we covered that before
16:27:52 <slaweq> I think we alread talked about it :)
16:27:56 <slaweq> yes, thx njohnston
16:28:08 <slaweq> so next one
16:28:10 <slaweq> slaweq to update etherpad with what is already converted to py3
16:28:27 <slaweq> I updated etherpad https://etherpad.openstack.org/p/neutron_ci_python3 today
16:28:27 <bcafarel> on functional tests, maybe worth sending a ML post, maybe some other projects would have an idea there
16:28:42 <bcafarel> (strange that it's only us getting hit by this "log limit")
16:28:46 <slaweq> bcafarel: good idea, I will send email today
16:30:35 <slaweq> basiacally we still need to convert most of tempest jobs, grenade, rally and functional
16:30:48 <slaweq> for rally I proposed patch https://review.openstack.org/624358
16:30:55 <slaweq> lets wait for results of CI now
16:31:34 <slaweq> so etherpad is updated, if someone wants to help, feel free to propose patches for jobs which are still waiting :)
16:32:07 <slaweq> ok, and the last action was:
16:32:09 <slaweq> hongbin to report and check failing neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_gateway_ip_changed test
16:32:40 <hongbin> i have to postponse this one since i am still figuring out how to setup the envirnoment for testing
16:33:11 <njohnston> sean-k-mooney: thanks for joining, we'll talk about CI issues in a moment
16:33:37 <sean-k-mooney> njohnston: no worries
16:33:48 <slaweq> hongbin: ok, ping me if I You will need any help
16:33:56 <hongbin> slaweq: thanks, will do
16:34:00 <slaweq> I will assign it as an action for next week, ok?
16:34:07 <hongbin> sure
16:34:10 <slaweq> #action hongbin to report and check failing neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_gateway_ip_changed test
16:34:18 <slaweq> ok, lets move on then
16:34:22 <slaweq> #topic Python 3
16:34:38 <slaweq> we already talked about grenade-jobs
16:35:21 <slaweq> I only wanted to mention this patch for neutron-rally job: https://review.openstack.org/624358
16:35:47 <slaweq> and also I sent today patch  https://review.openstack.org/624360 to remove tempest-full job as we have tempest-full-py3 already
16:36:02 <slaweq> so I think that we don't need both of them
16:36:46 <slaweq> anything else You want to talk about njohnston, bcafarel?
16:37:17 <njohnston> nope, I think that covers it
16:37:30 <bcafarel> same here
16:37:31 <slaweq> ok, so let's move on then
16:37:37 <slaweq> #topic Grafana
16:37:42 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:38:14 <njohnston> looks like failure rate on neutron-tempest-iptables_hybrid job has gone from 8% at 0930UTC to 46% at 1620UTC
16:38:14 <njohnston> http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1&panelId=18&fullscreen&from=now%2Fd&to=now%2Fd
16:38:37 <njohnston> sean-k-mooney was looking into it and how it might be related to pyroute2
16:38:38 <slaweq> njohnston: yes, and I think that this is what sean-k-mooney has culprit for, right?
16:38:59 <njohnston> https://bugs.launchpad.net/os-vif/+bug/1807949
16:39:00 <openstack> Launchpad bug 1807949 in os-vif "os_vif error: [Errno 24] Too many open files" [High,Triaged] - Assigned to sean mooney (sean-k-mooney)
16:39:16 <sean-k-mooney> so i that breaking on all build or jsut some
16:39:47 <njohnston> just the neutron-tempest-iptables_hybrid it looks like
16:41:10 <slaweq> sean-k-mooney: is it this error: http://logs.openstack.org/60/624360/1/check/neutron-tempest-iptables_hybrid/a6a4a0a/logs/screen-n-cpu.txt.gz?level=ERROR#_Dec_11_15_10_54_319285 ?
16:41:37 <njohnston> I am wondering if we should make neutron-tempest-iptables_hybrid non-voting while we figure this out, or blacklist this version of os-vif....
16:42:23 <haleyb> we already had to blacklist 0.12.0...
16:42:29 <njohnston> based on the 24-hour-rolling-average nature of grafana lines I think a rise this rapid means we may have an effective 100% failure rate at the moment
16:42:38 <slaweq> sean-k-mooney: do You know why it may happen only in this job?
16:43:05 <slaweq> I don't see any such error e.g. in tempest-full job logs (at least this which I'm checking now)
16:43:54 <njohnston> I did not see that error in the neutron-tempest-linuxbridge jobs I spot-checked
16:44:01 <njohnston> (as another datapoint)
16:46:08 <slaweq> that is strange for me, the only thing which is "special" for neutron-tempest-iptables_hybrid is iptables_hybrid firewall driver instead of openvswitch driver
16:46:18 <slaweq> how this may trigger such error?
16:49:16 <mlavalle> I think sean-k-mmoney is not on-line anymore
16:49:21 <slaweq> ok, I think that we should check if happens 100% times in this job, if so, we should, as njohnston said, mark this job as non-voting temporary and then try to investigate it
16:49:30 <slaweq> do You agree?
16:49:34 <mlavalle> yes
16:49:39 <hongbin> +1
16:49:47 <njohnston> +1
16:50:11 <slaweq> ok, I will check tomorrow morning grafana and will send a patch to set as non-voting this job
16:50:57 <njohnston> should we send something to the ML asking people not to recheck if the failure is in iptables_hybrid?
16:51:07 <slaweq> #action slaweq to switch neutron-tempest-iptables_hybrid job as non-voting if it will be failing a lot because of bug 1807949
16:51:08 <openstack> bug 1807949 in os-vif "os_vif error: [Errno 24] Too many open files" [High,Triaged] https://launchpad.net/bugs/1807949 - Assigned to sean mooney (sean-k-mooney)
16:51:12 <sean-k-mooney> hi sorry got disconnected
16:51:34 <slaweq> njohnston: yes, I will send an email
16:51:43 <bcafarel> I think I just did :/ (though there was a rally timeout too)
16:52:52 <sean-k-mooney> ill join the neutorn channel after to discuss the pyroute2 issue
16:53:01 <slaweq> ok, so sean-k-mooney - we will mark our job neutron-tempest-iptables_hybrid mark as non-voting if it will be failing 100% times becaise of this issue
16:53:20 <sean-k-mooney> ok
16:53:28 <slaweq> so we will have more time to investigate this :)
16:53:46 <sean-k-mooney> thanks :)
16:54:09 <slaweq> thx for helping with this :)
16:54:13 <slaweq> ok, lets move on
16:54:37 <slaweq> today I went through our list of issues in https://etherpad.openstack.org/p/neutron-ci-failures
16:54:55 <slaweq> and I wanted to find 3 which happens most often
16:55:33 <slaweq> one of problems which hits as the most is still this issue in db migrations in functional tests:
16:55:45 <slaweq> which happens many times
16:55:59 <slaweq> and which is in my backlog
16:56:09 <slaweq> but maybe we should mark those tests as unstable for now?
16:56:13 <slaweq> what do You think?
16:56:44 <bcafarel> sounds reasonable, I did see this db migration issue a few times recently
16:56:59 <mlavalle> yeah, I'm ok with that
16:57:06 <njohnston> it is a persistent bugaboo yes
16:57:06 <slaweq> ok, I will do that then
16:57:09 <mlavalle> we will continue trying to fix it, right?
16:57:18 <slaweq> mlavalle: of course
16:57:23 <mlavalle> ok
16:57:40 <slaweq> I even have card for it in our trello, I just need some time
16:57:40 <mlavalle> yeah, if it is getting in the way, let's mark it unstable
16:57:59 <slaweq> #action slaweq to mark db migration tests as unstable for now
16:58:00 <mlavalle> thanks slaweq
16:58:12 <slaweq> other issues which I found were:
16:58:29 <slaweq> 1. issues with cinder volume backup timeouts - I will try to ping cinder guys again with it
16:59:04 <slaweq> 2. various issues with FIP connectivity - it's not same test/job always, only common part is that ssh to fip is not working
16:59:31 <slaweq> if someone wants to debug it more, I can send list of jobs which failed because of that :)
16:59:40 <mlavalle> send it to me
16:59:46 <slaweq> mlavalle: ok, thx
17:00:08 <slaweq> we have to finish now
17:00:12 <slaweq> thx for attending guys
17:00:15 <slaweq> #endmeeting