#openstack-meeting log

16:00:37 <ihrachys> #startmeeting neutron_ci
16:00:38 <openstack> Meeting started Tue Jul 11 16:00:37 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:39 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:41 <openstack> The meeting name has been set to 'neutron_ci'
16:00:49 <jlibosva> o/
16:01:01 <ihrachys> hi jlibosva
16:01:06 * ihrachys waves at haleyb too
16:01:35 <ihrachys> we haven't had a meeting for a while
16:01:45 <ihrachys> #topic Actions from prev week
16:01:51 <jlibosva> week :)
16:02:12 <ihrachys> more than a week no? anyhoo.
16:02:16 <ihrachys> first AI was "jlibosva to craft an email to openstack-dev@ with func-py3 failures and request for action"
16:02:28 <ihrachys> I believe that we made significant progress for py3 for func tests
16:02:36 <ihrachys> jlibosva, can you briefly update about latest?
16:02:48 <jlibosva> yep, team did a great job and took down failures pretty quick
16:03:01 <jlibosva> last time I checked we had a single failure that's caused likely by a bug in eventlet
16:03:28 <jlibosva> some thread switch takes too long - increasing a timeout for particular test helps: https://review.openstack.org/#/c/475888/
16:03:34 <jlibosva> but that's not a correct way to go
16:03:49 <jlibosva> I planned to reach out to some eventlet peeps but I haven't yet
16:03:57 <ihrachys> who would be the peeps?
16:04:34 <jlibosva> I'll try vstinner first
16:05:24 <ihrachys> ack
16:05:34 <ihrachys> cool, seems like we are on track with it
16:05:45 <jlibosva> I can take an AI till the next mtg
16:05:47 <ihrachys> #action jlibosva to reach out to Victor Stinner about eventlet/py3 issue with functional tests
16:06:04 <ihrachys> next AI was "jlibosva to talk to otherwiseguy about isolating ovsdb/ovs agent per fullstack 'machine'"
16:06:18 <jlibosva> so that's an interesting thing
16:06:25 <ihrachys> jlibosva, I was trying to find the patch that otherwiseguy had for ovsdbapp with the test fixture lately and couldn't find it. have a link?
16:06:29 <otherwiseguy> ihrachys, we've been talking about it quite a bit today. :p
16:06:50 <jlibosva> ihrachys: it's hidden :)
16:06:52 <jlibosva> here https://review.openstack.org/#/c/470441/30/ovsdbapp/venv.py
16:07:19 <ihrachys> oh ok that patch. I expected a separate one.
16:07:25 <otherwiseguy> yeah, that should have been separate. :(
16:07:35 <jlibosva> I was actually doing some coding and I'm trying to use the ovsdb-server in a sandbox
16:07:57 <jlibosva> my concern was whether we'll be able to connect "nodes"
16:08:14 <jlibosva> meaning that bridges in one sandbox must be reachable by bridges from the other sandbox
16:08:40 <jlibosva> it seems the ovs_system sees all entities in the sandboxes - so I hope we're on a good track
16:09:05 <ihrachys> ok cool. how do you test it? depends-on won't work until new lib is released right?
16:09:13 <jlibosva> currently I have some code that runs ovs agent, each using its own ovsdb-server
16:09:49 <jlibosva> I haven't pushed anything to gerrit yet so I have an egg-link pointing to ovsdbapp dir
16:09:55 <ihrachys> ah ok.
16:10:05 <jlibosva> but generally, yeah, for gate we'll need a new ovsdbapp release
16:10:11 <ihrachys> otherwiseguy, can we get it split separately this week?
16:10:34 <otherwiseguy> It's entirely possible that I will have enough in for 1.0 this week.
16:10:41 <otherwiseguy> maybe 0.99 just to be safe. :p
16:10:49 <ihrachys> otherwiseguy, but this patch is not in yet right?
16:10:55 <ihrachys> oh it is
16:10:57 <ihrachys> sorry
16:11:02 <otherwiseguy> the venv patch is.
16:11:11 <otherwiseguy> just not in a release.
16:11:31 <jlibosva> I have some WIP for ovsdbapp too, to not start ovn schemas, would be nice to get it in release too, if the patch makes sense
16:11:41 <jlibosva> https://github.com/cubeek/ovsdbapp/commit/0f51ab16ec72a7033057740d928c599ba3cd7fc6?diff=split
16:11:53 <ihrachys> jlibosva, why github fork?
16:12:07 <jlibosva> ihrachys: to show the WIP patch
16:12:27 <jlibosva> ihrachys: it's 2 hours old :)
16:12:41 <ihrachys> the idea of the patch makes a lot of sense. I think we discussed that before.
16:12:52 <ihrachys> please post to gerrit so that we can bash it
16:13:18 <jlibosva> oh did we? maybe I forgot, I just wanted to have the small minimum for my fullstack work
16:13:25 <ihrachys> #action jlibosva to post patch splitting OVN from OvsVenvFixture
16:13:26 <jlibosva> I'll push it once I polish it
16:13:51 <jlibosva> I'm also not sure e.g. if vtap belongs to ovn or ovs ...
16:14:01 <ihrachys> jlibosva, two weeks ago while drinking beer. I am not surprized some details could be forgotten :)
16:14:17 <jlibosva> damn you beer
16:14:35 <ihrachys> nah. yay beer. it spurred discussion in the first place.
16:15:00 <ihrachys> ok, we'll wait for your patch on gerrit and then whine about it there
16:15:25 <ihrachys> nice work otherwiseguy btw, it's a long standing issue for fullstack and you just solved it
16:15:40 * otherwiseguy crosses his fingers
16:15:40 <ihrachys> next AI was on me "ihrachys to update about functional/fullstack switch to devstack-gate and rootwrap"
16:16:08 <jlibosva> otherwiseguy++
16:16:11 <ihrachys> so, to unblock the gate, we landed the patch switching fullstack and functional test runners to rootwrap, then landed the switch of those gates to devstack-gate
16:16:15 <otherwiseguy> jlibosva++
16:16:36 <ihrachys> which resulted in breakage of fullstack job because some tests are still apparently not using rootwrap correctly
16:16:50 <ihrachys> which is the reason why fullstack is 100% failing in grafana ;)
16:17:13 <ihrachys> I didn't have time to look at it till now. I should have some till next meeting.
16:17:30 <ihrachys> #action ihrachys to look at why fullstack switch to rootwrap/d-g broke some test cases
16:17:41 <ihrachys> next was "haleyb to continue looking at prospects of dvr+ha job"
16:17:51 <haleyb> yes, i'm here
16:18:11 <ihrachys> the last time we talked about it you were going to watch the progress of the new job
16:18:37 * ihrachys looks at grafana
16:19:02 <haleyb> the job looks ok, it is still non-voting of course
16:19:56 <haleyb> it can still be higher than the dvr-multinode job
16:19:58 <ihrachys> I see it's 25%+ right now. is it ok?
16:20:28 <haleyb> when i looked yesterday it was lower, need to refresh
16:21:27 <haleyb> ihrachys: maybe it's time to split that check queue panel into two - grenade and tempest
16:21:49 <ihrachys> yeah I guess that could help. it's a mess right now.
16:21:57 <ihrachys> haleyb, will you post a patch?
16:22:44 <haleyb> sure i can do that.  i don't know why it's failing more now, i'll have to look further
16:23:16 <ihrachys> #action haleyb to split grafana check dashboard into grenade and tempest charts.
16:23:37 <ihrachys> haleyb, re failure rate, even dvr one seem to be on too high level
16:23:43 <haleyb> as part of the job reduction was going to suggest making it voting to replace the dvr-multinode
16:24:36 <haleyb> yes, they usually track each other.  last time i saw higher failures it was node setup issue, which is more likely to happen the more nodes we use
16:24:56 <ihrachys> it's 3 nodes for ha right?
16:25:05 <haleyb> yes, 3 nodes versus 2
16:25:46 <ihrachys> ack. yeah, replacing would be the end goal. if we know for sure it's just node setup thing, I think we can make the call to switch anyway. we should know though.
16:26:05 <ihrachys> #action haleyb to continue looking at dvr-ha job failure rate and reasons
16:26:13 <haleyb> i will have to go to logstash to see what's failing and if it's not just bad patches, since it is the check queue and not gate
16:26:22 <haleyb> :)
16:26:48 <ihrachys> +
16:26:52 <ihrachys> next was "ihrachys to talk to qa/keystone and maybe remove v3-only job"
16:27:13 <ihrachys> this is done as part of https://review.openstack.org/474733
16:27:30 <ihrachys> next is "haleyb to analyze all the l3 job flavours in gate/check queues and see where we could trim"
16:27:37 <ihrachys> we already touched on it somewhat
16:27:44 <haleyb> let me cut/paste a comment
16:27:58 <haleyb> regarding the grenade gate queue
16:28:01 <haleyb> grenade-dsvm-neutron and grenade-dsvm-neutron-multinode are both
16:28:01 <haleyb> voting.  Propose we remove the single-node job, multinode will
16:28:01 <haleyb> just need a small Cells v2 tweak in its config.  This is
16:28:01 <haleyb> actually two jobs less since there's a -trusty and -xenial.
16:28:10 <haleyb> doh, that pasted bad
16:28:17 <ihrachys> clarkb and other infra folks were eager to see progress on it because they had some issues with log storage disk space.
16:28:56 <ihrachys> haleyb, trusty is about to go if not already since it was newton only, and mitaka is EOL now
16:28:57 <haleyb> basically there are single-node and multinode jobs, i think we can just use the multinode ones
16:29:16 <clarkb> ihrachys: trusty should mostly be gone at this point
16:29:20 <haleyb> ihrachys: i was going to ask about -trusty, that would be a nice cleanup
16:29:28 <ihrachys> haleyb, since the single node job is part of integrated gate, do you imply that we do the replacement for all projectd?
16:29:36 <clarkb> if you notice any straggler trusty jobs let us know and we can help remove them
16:29:57 <ihrachys> clarkb, nice
16:30:03 <haleyb> neutron has a bunch
16:30:12 <clarkb> haleyb: still running as of today?
16:30:20 <clarkb> most of the cleanup happened late last week
16:30:46 <haleyb> ihrachys: i would think having multi-node is better than single-node, and more like a real setup
16:31:07 <haleyb> clarkb: i don't know, just see them on the grafana dashboard
16:31:17 <haleyb> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:31:30 <clarkb> haleyb: ya I think we kept them in the dashboard so you don't lose the historical data?
16:31:32 <ihrachys> grafana may have obsolete entries, not everyone is even aware about its existence in project-config
16:31:37 <clarkb> but we may have missed things
16:31:55 <ihrachys> clarkb, we use those dashboards for current state so it's ok to clean them up
16:32:16 <clarkb> as for making multinode default in the integrated gate, I've been in favor of it because like you say it is more realistic, but resistance/concern has been that it will be even more difficult for developers to reproduce locally and debug
16:32:31 <clarkb> now you need 16GB of ram and ~160GB of disk just to run a base test
16:32:49 <ihrachys> haleyb, since you are going to do some cleanup there anyway, I will put that on you too ;)
16:32:54 <clarkb> but I think its worth revisiting that discussion because maybe that is necessary and we take that trade off (also how many people reproduce locally?)
16:33:00 <ihrachys> #action haleyb to clean up old trusty charts from grafana
16:33:45 <haleyb> ok, np
16:34:26 <haleyb> clarkb: yeah, i was looking more at the failure rates, which are about the same, and which is more important
16:34:59 <haleyb> the grenade jobs where the only ones with this overlap
16:35:48 <haleyb> the other thing is we would still have the multinode and dvr-multinode jobs, but i had a thought on that
16:35:49 <ihrachys> clarkb, I think the switch would fit nicely in your goal of reducing the number of jobs. where would we start from to drive it? ML? I guess folks would want to see stats on stability of the job before committing to anything?
16:36:11 <clarkb> yes I think the ML is good place to start. QA team in particular had concerns
16:36:21 <clarkb> including stats on stability would be good
16:36:41 <haleyb> since we don't want to reduce coverage on non-dvr code, i was wondering if it was possible to use the dvr setup, but also run tests with a "legacy" router
16:36:49 <clarkb> and maybe an argument for how it is more realistic, eg which code paths can we test on top of the base job (metadata proxy, live migration come to mind)
16:36:54 <ihrachys> ok. I guess it may make sense to focus on neutron state for a bit to prove to ourselves it's a good replacement, then go to broader audience.
16:38:05 <haleyb> so i can send something to the ML regarding the grenade jobs
16:38:37 <clarkb> and I can respond with info on log server retention and trying to get that udner control
16:38:47 <clarkb> and how reducing job counts will help
16:38:59 <ihrachys> haleyb, in theory, each test class could be transformed into a scenario class passing different args to create_router (scenarios could be generated from the list of api extensions except for dvr that may incorrectly indicate support at least before pike)
16:39:42 <ihrachys> ok, let's start a discussion on grenade reduction now, we can polish multinode job in parallel
16:39:44 <haleyb> ihrachys: right, the only problem could be that only the admin can create non-dvr routers in a dvr setup
16:40:02 <ihrachys> #action haleyb to spin up a ML discussion on replacing single node grenade job with multinode in integrated gate
16:40:28 <haleyb> unfortunately the tempest gate didn't have the overlap the grenade one did
16:42:16 * haleyb stops there since every time he talks he gets another job :)
16:42:28 <ihrachys> haleyb, I would imagine tempest core repo, being a certification tool, may not want to see dvr/ha specific scenarios.
16:43:01 <ihrachys> haleyb, haha. well we may find other candidates for some items that are on you. speak up. :)
16:43:51 <haleyb> ihrachys: nah, the grafana and jobs are easy, digging into my other dvr option would be harder
16:43:59 <ihrachys> ok ok
16:44:05 <ihrachys> and just to piss you off
16:44:05 <ihrachys> #action haleyb to continue looking at places to reduce the number of jobs
16:44:09 <ihrachys> :p
16:44:21 <ihrachys> ok those were all items we had
16:44:26 <ihrachys> #topic Grafana
16:44:35 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:44:41 <ihrachys> we somewhat discussed that before
16:44:58 <ihrachys> one thing to spot there though is that functional job is in a bad shape it seem
16:45:07 <ihrachys> it's currently ~25-30% in gate
16:45:44 <ihrachys> I checked both https://bugs.launchpad.net/neutron/+bugs?field.tag=functional-tests&orderby=-id&start=0 and https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=-id&start=0 for any new bug reports that could explain it, with no luck
16:45:49 <ihrachys> I also checked some recent failures
16:46:04 <ihrachys> of those I spotted some were for https://bugs.launchpad.net/neutron/+bug/1693931
16:46:05 <openstack> Launchpad bug 1693931 in neutron "functional test_next_port_closed test case failed with ProcessExecutionError when killing netcat" [High,Confirmed]
16:47:02 <ihrachys> but I haven't done complete triage
16:47:09 <ihrachys> I will take it on me to complete it asap
16:47:24 <ihrachys> #action ihrachys to complete triage of latest functional test failures that result in 30% failure rate
16:47:40 <ihrachys> anyone aware of late issues with the gate that could explain it?
16:48:48 <ihrachys> I guess not. ok I will look closer.
16:49:17 <ihrachys> one other tiny thing that bothers me every time I look at grafana is - why do we have postgres job in periodics?
16:49:55 <ihrachys> it doesn't seem like anyone really cares, and TC plans to express it explictly that psql is second class citizen in openstack
16:50:01 * ihrachys wonders if we need it there
16:50:18 <ihrachys> I would be fine to have it there if someone would work on the failures.
16:50:37 <ihrachys> btw I talk about http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=4&fullscreen
16:50:56 <ihrachys> it's always 100% and makes me click each other job name to see their results are 0%
16:51:23 <jlibosva> just send a patch to remove it and let's see who will complain :)
16:52:08 <ihrachys> ok ok
16:52:21 <ihrachys> #action ihrachys to remove pg job from periodics grafana board
16:52:46 <ihrachys> finally, fullstack is 100% but I said I will have a look so moving on
16:52:47 <ihrachys> #topic Gate bugs
16:52:52 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure&orderby=-id&start=0
16:52:58 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1696690
16:52:58 <openstack> Launchpad bug 1696690 in neutron "neutron fails to connect to q-agent-notifier-port-delete_fanout exchange" [Undecided,Confirmed]
16:53:03 <ihrachys> this was reported lately
16:53:23 <ihrachys> seems like affecting ironic
16:55:13 <ihrachys> it seems like a fanout queue was not created
16:55:23 <ihrachys> but shouldn't neutron-server itself initialize it on start?
16:58:15 <ihrachys> ok doesn't seem anyone has an idea :)
16:58:21 * jlibosva ¯\_(ツ)_/¯
16:58:41 <ihrachys> seems something ironic/grenade specific, and they may in the end need to be more active poking us in our channel to get more traction.
16:58:56 <ihrachys> we have little time, so let's take those 2 mins we have back
16:59:00 <ihrachys> thanks everyone
16:59:03 <ihrachys> #endmeeting