#openstack-meeting log

16:01:05 <slaweq> #startmeeting neutron_ci
16:01:05 <openstack> Meeting started Tue Jun  4 16:01:05 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:07 <slaweq> hi
16:01:08 <openstack> The meeting name has been set to 'neutron_ci'
16:01:21 <njohnston> o/
16:01:59 <bcafarel> hi again
16:02:13 <slaweq> I know that mlavalle will not be able to join this meeting today
16:02:21 <slaweq> so I think we can start now
16:02:32 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:02:54 <slaweq> #topic Actions from previous meetings
16:03:01 <slaweq> slaweq to continue work on fetch-journal-log zuul role
16:03:11 <slaweq> I did patch https://review.opendev.org/#/c/643733/ and in https://review.opendev.org/#/c/661915/ it's tested that it works
16:03:20 <slaweq> but I'm not sure if this will be needed in fact
16:03:47 <slaweq> I noticed recently that full journal log is already dumped in devstack.journal.xz.gz file in job's logs
16:04:21 <slaweq> but it's in binary journal format so You need to download it and use journalctl to examine it
16:04:46 <njohnston> good to know!
16:04:57 <slaweq> yeah, I had no idea about that before
16:05:16 <bcafarel> maybe we could use a role to translate that file in job definition?
16:05:51 <slaweq> bcafarel: You mean to text file?
16:06:15 <bcafarel> yes, to skip the download step
16:06:29 <bcafarel> for lazy folks like me :)
16:06:51 <slaweq> it's basically what role proposed by me in https://review.opendev.org/#/c/643733/ is doing
16:07:15 <slaweq> we can try to convice infra-root people that it may be useful :)
16:08:08 <clarkb> slaweq: I'm still of the opinion that we should capture the serialized version in its entirety and compress it well as it is a very large file
16:08:12 <fungi> it ends up taking up a lot of additional disk space in some jobs to keep two copies of the log, and the native version is useful for more flexible local filtering
16:08:23 <clarkb> slaweq: it gives you way more flexibility that way, with a small amount of setup
16:09:02 <clarkb> if we know we need specific service log files we can pull those out along with the openstack service logs
16:09:19 <clarkb> but for the global capture I think what we've got now with the export format works really well
16:09:48 <slaweq> clarkb: yes, but e.g. for neutron it is many different services, like keepalived, haproxy, dnsmasq which are spawned per router or network
16:10:56 <clarkb> yup which is why the current setup is great. You can do journalctl -u keepalived -u haproxy -u dnsmasq -u devstack@q-agt and get only those logs interleaved
16:11:04 <clarkb> you can also set date ranges to narrow down what you are looking at
16:11:10 <clarkb> you cannot easily do that with the chnage you have proposed
16:11:28 <slaweq> clarkb: so as I said, I can just abandon this my patch as now I know that there is this log available and how to get it
16:11:51 <slaweq> I think that this will be the best approach, what do You think?
16:11:52 <clarkb> oh I was going to suggest you update the role so that you don't have to use devstack to get that functionality
16:12:03 <clarkb> basically do what devstack does but in a reconsumable role in zuul-jobs
16:12:07 <clarkb> then we can apply it to non devstack jobs
16:12:36 <slaweq> clarkb: tbh I need it to devstack based jobs so it's enough for us what is now :)
16:12:54 <njohnston> I bet we could easily create a grabjournal script that could fetch the journal for a specific job of a specific change and run that journalctl command on it, if our issue is just that we want to make it more accessible for developers
16:13:06 <clarkb> slaweq: ok
16:13:35 <slaweq> njohnston: initially I wasn't aware of that this log exists already in job's logs, so I proposed patch
16:14:15 <slaweq> but now IMO the only "issue" is this accessibility of log and I think that this is not something which we should spent a lot of time on :)
16:14:32 <njohnston> +1 sounds good
16:15:04 <bcafarel> yes, I can survive a download+parse :)
16:15:29 <slaweq> so my proposal is: lets for now use what is already there
16:15:47 <slaweq> and we will see if this will have to be improved somehow :)
16:16:20 <slaweq> ok, so lets move on to the next action
16:16:28 <slaweq> slaweq to remove neutron-tempest-plugin-bgpvpn-bagpipe from "neutron-tempest-plugin-jobs" template
16:16:32 <slaweq> Done: https://review.opendev.org/#/c/661899/
16:16:44 <slaweq> and the last one was;
16:16:46 <slaweq> mlavalle to debug neutron-tempest-plugin-dvr-multinode-scenario failures (bug 1830763)
16:16:48 <openstack> bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,Confirmed] https://launchpad.net/bugs/1830763 - Assigned to Miguel Lavalle (minsel)
16:17:04 <slaweq> but as mlavalle is not here now, I think we can assign it to him for next week to not forget about it
16:17:09 <slaweq> #action mlavalle to debug neutron-tempest-plugin-dvr-multinode-scenario failures (bug 1830763)
16:17:31 <slaweq> do You have anything else to add regarding actions from last week?
16:18:12 <bcafarel> all good here
16:18:21 <slaweq> ok, lets move on
16:18:27 <slaweq> #topic Stadium projects
16:18:36 <slaweq> Python 3 migration
16:18:42 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status
16:18:47 <slaweq> any updates on it?
16:19:06 <bcafarel> not from me sorry no progress here
16:19:16 <njohnston> me neither
16:19:26 <slaweq> same from my side
16:19:30 <slaweq> so next thing
16:19:37 <slaweq> tempest-plugins migration
16:19:41 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo
16:20:03 <slaweq> for networking-bgpvpn both main patches are merged
16:20:20 <slaweq> but there was need to do some follow up cleanup which I forgot to do
16:20:30 <slaweq> so there is also https://review.opendev.org/#/c/662231/ waiting for review
16:20:35 <njohnston> oh, what needed to be cleaned up?
16:20:51 <njohnston> ah I see
16:20:55 <slaweq> and there was https://review.opendev.org/#/c/662142/ from masayukig but this one is already merged
16:21:05 <slaweq> so please also remember about that in Your patches :)
16:21:23 <njohnston> +1
16:21:30 <slaweq> and that's all from my side about this
16:21:36 <slaweq> any updates on Your side?
16:21:47 <bcafarel> for sfc https://review.opendev.org/#/c/653747 is close to merge (second patch), pending on some gate fixes
16:22:08 <bcafarel> so mostly gerrit work left to do :)
16:22:16 <slaweq> bcafarel: great :)
16:22:33 <njohnston> I've been letting fwaas sit, but I will probably be able to dedicate some time to it about a week from now
16:23:04 <slaweq> njohnston: great, if You would need any help, ping me :)
16:24:19 <slaweq> ok, so lets move on to the next topic then
16:24:25 <slaweq> #topic Grafana
16:24:31 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:25:14 <slaweq> still IMO things looks relatively good
16:25:28 <slaweq> the biggest problem which we have are some failures in functional/fullstack tests
16:25:39 <njohnston> yep
16:25:41 <slaweq> and those ssh failure in tempest jobs
16:25:51 <slaweq> but I think it happens less often recently :)
16:26:58 <njohnston> I did have one general grafana question; in the "Number of Functional/Fullstack job runs (Gate queue)" graph I see that the lines diverge.  Does that mean we lost data somewhere?  I can't imagine a scenario where neutron-functional runs but not neutron-functional-python27
16:27:24 <njohnston> that's just one example; it happens from time to time elsewhere as well
16:29:05 <slaweq> good question njohnston but I don't know an answer
16:29:09 <haleyb> that is interesting, if it's only +/- 1 i could just see it being one had finished at that point in time but the other hadn't
16:29:51 <slaweq> IMHO it may be lack of some data collected by infra - we have some gaps from time to time on graphs so maybe it's also something like that
16:29:56 <njohnston> that's why I picked that one: neutron-functional = 10; neutron-functional-python27 = 7; neutron-fullstack = 6
16:30:22 <njohnston> yeah, it's just weird, something to be aware of
16:31:01 <slaweq> I agree, thx njohnston for pointing this
16:32:26 <slaweq> anything else regarding grafana in general?
16:32:29 <haleyb> speaking of failure rates, i did update the neutron-lib dashboard last week, not that we talk about it much here, just an FYI
16:32:41 <slaweq> haleyb: thx
16:32:48 <haleyb> it looks similar to this one and ovn now
16:32:53 <slaweq> I should take a look at it from time to time too
16:33:04 <haleyb> http://grafana.openstack.org/d/Um5INcSmz/neutron-lib-failure-rate?orgId=1
16:33:41 <slaweq> not too much data there yet :)
16:33:58 <haleyb> nope, not many failures or patches
16:35:23 <slaweq> some periodic jobs are failing constantly
16:35:32 <slaweq> which is maybe worth to check
16:36:11 <haleyb> i think they are known failures, but will look!
16:36:42 <slaweq> haleyb: thx a lot
16:36:46 <slaweq> ok, lets move on then
16:36:51 <slaweq> #topic fullstack/functional
16:37:13 <slaweq> I was looking at results of some failed jobs from last couple of days
16:37:25 <slaweq> and I found 2 new failed tests in functional job
16:37:32 <slaweq> http://logs.openstack.org/78/653378/7/check/neutron-functional/c5ac6a3/testr_results.html.gz and
16:37:37 <slaweq> http://logs.openstack.org/82/659882/3/check/neutron-functional-python27/5e30908/testr_results.html.gz
16:38:04 <slaweq> but each of those I saw only once
16:38:16 <slaweq> did You maybe see something like that before?
16:39:01 <haleyb> i haven't seen it, but failed running sysctl in the first?
16:39:51 <slaweq> here are logs from this first failed test: http://logs.openstack.org/78/653378/7/check/neutron-functional/c5ac6a3/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_ha_router_unbound_from_agents.txt.gz#_2019-06-02_14_15_09_388
16:40:48 <haleyb> oh, rtnetlink error, that could be a bug?
16:41:39 <haleyb> i.e. delete_gateway should deal with it
16:42:00 <slaweq> yes, it's possible
16:42:10 <slaweq> I will try to look deeper into this log this week
16:42:20 <slaweq> maybe I will find something obvious to change :)
16:42:29 <njohnston> for the second one nothing jumps out at me but this one error: http://logs.openstack.org/82/659882/3/check/neutron-functional-python27/5e30908/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_l2_ovs_agent.TestOVSAgent.test_assert_br_phys_patch_port_ofports_dont_change.txt.gz#_2019-05-30_08_52_08_873
16:42:55 <slaweq> #action slaweq to check logs with failed test_dvr_ha_router_unbound_from_agents functional test
16:43:25 <haleyb> slaweq: i know rodolfo had a patchset up to change this area to use privsep, so might want to start there
16:43:38 <slaweq> haleyb: good to know, thx
16:43:48 <slaweq> I will ask him when he will back
16:44:27 <haleyb> https://review.opendev.org/#/c/661981/
16:44:49 <haleyb> errors will change after that, into some pyroute2 one perhaps?
16:45:13 <slaweq> ok, I will keep it in mind, thx :)
16:46:49 <slaweq> njohnston: for the second one, this error can be maybe related
16:47:09 <slaweq> or maybe not :)
16:47:37 <njohnston> :)
16:48:13 <slaweq> njohnston: this test is stoping agent: https://github.com/openstack/neutron/blob/86139658efdc739c6cc330304bdf4455613df78d/neutron/tests/functional/agent/test_l2_ovs_agent.py#L261
16:48:26 <slaweq> and IMO this "error" in log is related to this agent's stop
16:48:56 <slaweq> so for me it don't look like possible problem at first glance
16:49:28 <njohnston> I agree
16:50:06 <slaweq> so lets keep this issue in mind, if it will happen more often we will investigate it :)
16:50:16 <slaweq> do You agree?
16:51:36 <slaweq> ok, I guess it means yes :)
16:51:51 <slaweq> so that is all what I had for today
16:52:05 <slaweq> do You have anything else You want to talk about today?
16:52:28 <haleyb> -1 from me
16:52:35 <bcafarel> :)
16:52:38 <bcafarel> nothing from me either
16:52:39 <haleyb> i'm hungry
16:52:54 <slaweq> haleyb: I'm tired
16:53:02 <slaweq> so let's finish it bit earlier today
16:53:09 <slaweq> thx for attending :)
16:53:12 <slaweq> o/
16:53:14 <bcafarel> o/
16:53:18 <slaweq> #endmeeting