#openstack-meeting log

16:00:37 <slaweq> #startmeeting neutron_ci
16:00:38 <openstack> Meeting started Tue Sep 18 16:00:37 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:39 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:40 <slaweq> hi
16:00:41 <openstack> The meeting name has been set to 'neutron_ci'
16:00:48 <mlavalle> o/
16:01:47 <slaweq> lets wait few more minutes for others
16:01:53 <slaweq> maybe someone else will join
16:02:05 <mlavalle> np
16:02:28 <mlavalle> I didn't show up late, did I?
16:02:40 <slaweq> mlavalle: no, You were just on time :)
16:03:17 <mlavalle> I was distracted doing the homework you gave me last meeting and I was startled by the time
16:03:20 * haleyb wanders in
16:03:57 <slaweq> :)
16:04:50 <njohnston> o/
16:04:54 <slaweq> hi njohnston :)
16:04:58 <slaweq> lets start then
16:05:06 <njohnston> hello slaweq, sorry I am late - working on a bug
16:05:11 <slaweq> #topic Actions from previous meetings
16:05:18 <slaweq> njohnston: no problem :)
16:05:26 <slaweq> * mlavalle to talk with mriedem about https://bugs.launchpad.net/neutron/+bug/1788006
16:05:26 <openstack> Launchpad bug 1788006 in neutron "Tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Critical,Fix released] - Assigned to Slawek Kaplonski (slaweq)
16:05:31 <slaweq> I think it's done, right? :)
16:05:36 <mlavalle> we did and we fixed it
16:05:42 <mriedem> yar
16:05:42 <mlavalle> \o/
16:05:46 <mriedem> virt_type=qemu
16:05:52 <mlavalle> thanks mriedem
16:05:53 <slaweq> thx mriedem for help on that
16:06:12 <njohnston> \o/
16:06:19 <slaweq> ok, next one
16:06:22 <slaweq> * mlavalle continue debugging failing MigrationFromHA tests, bug https://bugs.launchpad.net/neutron/+bug/1789434
16:06:22 <openstack> Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:06:39 <mlavalle> manjeets took over that bug last week
16:06:56 <slaweq> yes, I saw but I don't think his approach to fix that is good
16:06:58 <manjeets> ++
16:07:00 <mlavalle> he literally stole it from my hands, despite my strong resistance ;-)
16:07:06 <slaweq> LOL
16:07:20 <slaweq> I can imagine that mlavalle :P
16:07:21 <mlavalle> manjeets: I am going to assign the bug to you, ok?
16:07:32 <manjeets> mlavalle, ++
16:07:52 <slaweq> I was looking on it also quickly during the weekend
16:08:13 <slaweq> and I think that it is again some race condition or something like that
16:08:30 <manjeets> It could be some subscribed callback as well ?
16:08:43 <slaweq> IMO this ports set to down should comes from L2 agent, not directly from l3 service plugin
16:09:26 <slaweq> becasuse it is something like: neutron-sever sends notification that router is disabled to L3 agent, L3 agent removes ports so L2 agent updates ports to down status (probably)
16:09:48 <slaweq> and probably there is no this notification send properly and because of that other things didn't happens
16:10:22 <slaweq> haleyb: mlavalle does it makes sense for You? or I missunderstood something in this workflow maybe?
16:11:06 <haleyb> slaweq: yes, that makes sense.  originally i thought it was in the hadbmode code, but l2 is maybe more likely
16:11:26 <mlavalle> agree
16:11:28 * manjeets take a note will dig into l2
16:11:29 <haleyb> but it's someone getting the event and missing something
16:12:34 <slaweq> give me a sec, I will check one thing according to that
16:14:37 <slaweq> so what I found was that when I was checking migrtation from Legacy to HA, ports were down after this notification: https://github.com/openstack/neutron/blob/master/neutron/api/rpc/agentnotifiers/l3_rpc_agent_api.py#L55
16:14:53 <slaweq> in case of migration from HA it didn't happen
16:15:17 <slaweq> but I didn't have more time on the airport to dig more into it
16:15:34 <manjeets> slaweq, you that notification wasn't called in case of HA ?
16:15:38 <manjeets> you mean**
16:16:05 <slaweq> I think that it was called but then "hosts" list was empty and it wasn't send to any agent
16:16:18 <slaweq> but it has to be checked still, I'm not 100% sure
16:16:58 <manjeets> i'll test that today
16:17:05 <slaweq> ok, thx manjeets
16:17:08 <manjeets> the host thing if its empty in case
16:17:17 <slaweq> so I will assign this to You as an action, ok?
16:17:24 <manjeets> sure !
16:17:27 <slaweq> thx
16:17:51 <slaweq> #action manjeets continue debugging why migration from HA routers fails 100% of times
16:17:57 <slaweq> ok, lets move on
16:18:03 <slaweq> next one
16:18:05 <slaweq> * mlavale to check issue with failing test_attach_volume_shelved_or_offload_server test
16:18:10 <manjeets> but the issue occurs after migration to HA , migration from HA worked
16:18:24 <manjeets> slaweq, its migration to HA, from HA i think is fine
16:19:23 <slaweq> manjeets: http://logs.openstack.org/29/572729/9/check/neutron-tempest-plugin-dvr-multinode-scenario/21bb5b7/logs/testr_results.html.gz
16:19:25 <slaweq> first example
16:19:32 <slaweq> from HA to any other fails always
16:19:56 <mlavalle> yes, that was my experience when I tried it
16:20:23 <manjeets> ah ok got it !
16:20:37 <slaweq> :)
16:20:47 <mlavalle> are we moving on then?
16:20:54 <slaweq> I think we can
16:21:13 <mlavalle> That bug was the reason I was almost late for this meeting
16:21:27 <mlavalle> we have six instances in kibana over the past 7 days
16:21:42 <mlavalle> two of them is with the experimental queue
16:21:45 <slaweq> so not too many
16:22:09 <mlavalle> njohnston is playing with change 580450 and the experimental queue
16:22:20 <njohnston> yes
16:22:24 <slaweq> so it's njohnston's fault :P
16:22:42 <njohnston> => fault <=
16:22:44 <njohnston> :-)
16:22:59 <mlavalle> the other failures are with neutron-tempest-ovsfw and tempest-multinode-full, which are non voting if I remember correctly
16:23:55 <slaweq> IIRC this issue was caused by instance was not pinging after shelve/unshelve, right?
16:24:05 <mlavalle> yeah
16:24:10 <mlavalle> we get a timeout
16:24:29 <slaweq> maybe something was changed in nova then and this is not an issue anymore?
16:25:20 <mlavalle> I'll dig a little longer
16:25:30 <mlavalle> before concluding that
16:25:45 <slaweq> ok
16:25:51 <mlavalle> for the time being I am just making the point that it is not hitting us very hard
16:26:03 <slaweq> that is good information
16:26:19 <slaweq> ok, lets assign it to You for one more week then
16:26:25 <mlavalle> yes
16:26:31 <slaweq> #action mlavale to check issue with failing test_attach_volume_shelved_or_offload_server test
16:26:34 <slaweq> thx mlavalle
16:26:44 <slaweq> can we go to the next one?
16:27:04 * mlavalle has long let go of the hope of meeting El Comandante without getting homework
16:27:27 <slaweq> LOL
16:27:42 <slaweq> mlavalle: next week I will not give You homework :P
16:28:03 <mlavalle> np whatsoever.... just taking the opportunity to make a joke
16:28:14 <slaweq> I know :)
16:28:19 <slaweq> ok, lets move on
16:28:22 <slaweq> last one
16:28:24 <slaweq> njohnston to switch fullstack-python35 to python36 job
16:29:26 <slaweq> njohnston: are You around?
16:29:33 <njohnston> yes, I have a change for that up; I think it just needs a little love now that the gate is clear
16:29:47 <njohnston> I'll check it and make sure it's good to go
16:29:59 <slaweq> ok, thx njohnston
16:30:01 <slaweq> sounds good
16:30:18 <njohnston> https://review.openstack.org/599711
16:30:23 <slaweq> #action njohnston will continue work on switch fullstack-python35 to python36 job
16:30:31 <slaweq> ok, that's all from last meeting
16:31:00 <slaweq> #topic Grafana
16:31:05 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:32:07 <slaweq> I was checking grafana earlier today and there wasn't many problems there
16:32:23 <slaweq> at least not problems which we are not aware of :)
16:32:30 <haleyb> not since we removed the dvr-multinode from the gate :(
16:32:39 <slaweq> haleyb: yes
16:33:28 <slaweq> so we have still neutron-tempest-plugin-dvr-multinode-scenario 100% failures but it's related to issue with migration from HA routers
16:33:47 <slaweq> and issue related to grenade job
16:34:03 <slaweq> other things I think I in quite good shape now
16:34:43 <slaweq> I was recently checking also reasons of some failures in tempest jobs and it was usually some issues with volumes (I don't have links to examples now)
16:34:45 <mlavalle> I have a question
16:34:50 <slaweq> sure mlavalle
16:35:17 <mlavalle> This doesn't have an owner: https://bugs.launchpad.net/neutron/+bug/1791989
16:35:18 <openstack> Launchpad bug 1791989 in neutron "grenade-dvr-multinode job fails" [High,Confirmed]
16:35:30 <slaweq> yes, sorry
16:35:35 <slaweq> I forgot to assign myself to it
16:35:40 <slaweq> I just did it now
16:35:42 <mlavalle> it is not voting for the time being, but we need to fix it right?
16:35:54 <slaweq> yes, I was checking that even today
16:35:59 <mlavalle> ah ok, question answered
16:36:02 <mlavalle> thanks
16:36:03 <slaweq> :)
16:36:28 <slaweq> and I wanted to talk about it now as it's last point on my list for today :)
16:36:36 <mlavalle> ok
16:36:43 <slaweq> #topic grenade
16:36:49 <slaweq> so speaking about this issue
16:37:38 <slaweq> yesterday I pushed patch https://review.openstack.org/#/c/602156/6/playbooks/legacy/neutron-grenade-dvr-multinode/run.yaml to neutron
16:38:56 <slaweq> together with depends-on from grenade https://review.openstack.org/#/c/602204/7/projects/60_nova/resources.sh it allowed me to log into at least controller node in this job
16:39:18 <slaweq> so I tried today and then I spawned manually same vm as is spawned by grenade script
16:39:37 <slaweq> and all worked perfectly fine, instance was pinging after around 5 seconds :/
16:40:05 <slaweq> so now I added some additional logs to this grenade script: https://review.openstack.org/#/c/602204/9/projects/60_nova/resources.sh
16:40:38 <slaweq> and I'm running this job once again: http://zuul.openstack.org/stream.html?uuid=928662f6de054715835c6ef9599aefbd&logfile=console.log
16:40:45 <slaweq> I'm waiting for results of it
16:41:23 <slaweq> I also compared packages installed on nodes in such failed job from this week with packages installed before 7.09 on job which passed
16:41:43 <slaweq> I have list of packages which have different versions
16:41:59 <slaweq> there is different libvirt, linux-kernel, qemu, openvswitch
16:42:08 <slaweq> so many potential culprits
16:42:35 <slaweq> I think I will start with downgrading libvirt as it was updated in cloud-archive repo on 7.09
16:42:47 <haleyb> slaweq: we will eventually figure that one out!
16:43:04 <slaweq> any ideas what else I can check/test/do here?
16:43:36 <mlavalle> checking packages seems the right way to go
16:44:33 <slaweq> yes, so I will try to send some DNM patches with downgraded each of those packages (except kernel) and will try to recheck them few times
16:44:45 <haleyb> yes, other than that you can keep adding debug commands to the script - eg for looking at interfaces, routes, ovs, etc, but packages is a good first step
16:44:48 <slaweq> and see if issue will still happen on each of them
16:45:21 <slaweq> haleyb: yes, I just don't know if it's possible (and how to do it) to run such commands on subnode
16:45:56 <slaweq> so currently I only added some OSC commands to check status of instance/port/fip on control plane level
16:47:12 <slaweq> so I will continue debugging of this issue
16:47:28 <slaweq> ohh, one more thing, yesterday we spotted it with haleyb also in stable/pike job
16:47:58 <slaweq> and when I was looking for this issue in logstash, I found that it happend couple of times in stable/pike
16:48:09 <slaweq> less than in master but still it happend there also
16:48:11 <mlavalle> nice ctach
16:48:14 <mlavalle> catch
16:48:45 <slaweq> strange thing is that I didn't saw it on stable/queens or stable/rocky branches
16:49:27 <slaweq> so as this is failing on "old" openstack this means that it fails on neutron with stable/rocky and stable/ocata branches
16:50:47 <slaweq> and that's all as summary of this f..... issue :/
16:50:58 <mlavalle> LOL
16:51:09 <slaweq> I will assign it to me as an action for this week
16:51:15 <mlavalle> I see you are learning some Frnech
16:51:19 <mlavalle> French
16:51:28 <slaweq> #action slaweq will continue debugging multinode-dvr-grenade issue
16:51:40 <slaweq> mlavalle: it can be "French" :P
16:52:10 * slaweq is becoming Hulk when has to deal with grenade multinode issue ;)
16:52:21 <njohnston> LOL!
16:52:43 <slaweq> ok, that's all from me about this issue
16:52:52 <slaweq> #topic Open discussion
16:53:03 <slaweq> do You have anything else to talk about?
16:53:32 <njohnston> I just sent email to openstack-dev to inquire about the python3 conversion status of tempest and grenade
16:53:52 <slaweq> thx njohnston, I will read it after the meeting then
16:54:01 <njohnston> if those conversions have not happened yet, and further if they need to be done globally, that could be interesting.
16:55:00 <njohnston> But I'll try not to borrow trouble.  Thats it from me.
16:55:08 <slaweq> speaking about emails, I want to ask mlavalle one thing :)
16:55:19 <mlavalle> ok
16:55:37 <slaweq> do You remember to send email about adding some 3rd party projects jobs to neutron?
16:55:52 <mlavalle> yes
16:55:57 <slaweq> ok, great :)
16:56:50 <slaweq> ok, so if there is nothing else to talk, I think we can finish now
16:56:58 <mlavalle> Thanks!
16:57:02 <slaweq> thanks for attending
16:57:06 <slaweq> and see You next week
16:57:07 <slaweq> o/
16:57:11 <slaweq> #endmeeting