#openstack-meeting log

16:01:09 <ihrachys> #startmeeting neutron_ci
16:01:10 <openstack> Meeting started Tue Sep 26 16:01:09 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:11 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:14 <openstack> The meeting name has been set to 'neutron_ci'
16:01:15 <ihrachys> jlibosva, mlavalle, haleyb o/
16:01:19 <mlavalle> o/
16:01:19 <jlibosva> o/
16:01:24 <ihrachys> and to whoever else joined
16:01:24 <haleyb> hi
16:01:39 <ihrachys> #topic Actions from prev week
16:01:40 <slaweq_> hello
16:01:49 <ihrachys> hi slaweq_ !
16:02:05 <ihrachys> traditionally, we walk through action items from the previous week
16:02:10 <ihrachys> first item was "jlibosva to talk to armax about enabling test_convert_default_subnetpool_to_non_default in gate"
16:02:15 <jlibosva> I did!
16:02:31 <ihrachys> fill us in!
16:02:43 <jlibosva> tldr; it's not possible because default subnetpool is also used for nova tests
16:03:08 <ihrachys> ok then how is it supposed to be executed by a tempest consumer?
16:03:12 <jlibosva> and it would be very hacky to have both auto-allocated-topology tests + nova tests and other relying on default subnet pool and defautl subnetpool
16:03:21 <ihrachys> do users choose one or another?
16:03:38 <jlibosva> as per armax it was a mistake to merge the default subnetpool API tests
16:04:01 <jlibosva> so I made this patch https://review.openstack.org/#/c/507487/
16:04:02 <patchbot> patch 507487 - neutron - Replace default subnetpool API tests with UT
16:04:24 <jlibosva> it moves the tests from API to unittests using the extension tests
16:04:40 <jlibosva> as the current default subnetpool API tests touch only API and DB layer, the coverage is kept
16:05:21 <ihrachys> great, adding to my review queue
16:05:33 <ihrachys> next action item was also on jlibosva
16:05:34 <slaweq_> jlibosva: I'm not expert but can't it be moved to fullstack?
16:05:35 <ihrachys> "jlibosva to prepare a list of scenario related fixes"
16:05:58 <ihrachys> let's address slaweq_ question first
16:06:08 <jlibosva> slaweq_: yes, that was also one of the options but given the current state of fullstack I thought UT should be sufficient
16:06:24 <ihrachys> it's not that bad ;)
16:06:29 <ihrachys> I mean, we are getting closer
16:06:32 <slaweq_> ok, UT are voting at least :)
16:06:44 <slaweq_> but I just wanted to ask
16:06:51 <jlibosva> slaweq_: it's a good point :)
16:07:14 <ihrachys> I think it's fine to move with what you have. fullstack could be a nice follow up
16:07:29 <mlavalle> ++
16:07:33 <ihrachys> aka never happen
16:07:36 <ihrachys> :)
16:07:50 <ihrachys> ok, let's look at the next item. "jlibosva to prepare a list of scenario related fixes"
16:08:13 <ihrachys> I was hoping to have the list for the team meeting today but seems like we did not have one.
16:08:19 <ihrachys> haleyb was going to help jlibosva with the task
16:08:25 <haleyb> i have two
16:08:29 <ihrachys> two lists?
16:08:46 <jlibosva> I was looking at the fixes related to the dvr stuff, I somehow mixed the things ...
16:08:52 <haleyb> two reviews that help with scenarios jobs
16:09:16 <haleyb> https://review.openstack.org/#/c/500384/ - check router interface before ssh
16:09:16 <patchbot> patch 500384 - neutron - tempest: check router interface exists before ssh
16:09:32 <haleyb> https://review.openstack.org/#/c/505324
16:09:33 <patchbot> patch 505324 - neutron - DVR: Fix unbound fip port migration to bound port
16:09:46 <ihrachys> anilvenkata, do you plan to respin https://review.openstack.org/#/c/500384/ this week?
16:09:47 <patchbot> patch 500384 - neutron - tempest: check router interface exists before ssh
16:10:03 <anilvenkata> ihrachys, yes, doing it now
16:10:39 <ihrachys> great
16:10:51 <haleyb> ihrachys: i am reviewing swami's patch again now, but with it there are very few scenario failures, think 3 now
16:11:01 <ihrachys> haleyb, looking at the second patch. why is it every time we need to fix a bug in dvr, it ends up like a 200+ line patch? :)
16:11:07 <ihrachys> I am scared each time
16:11:40 <haleyb> it's not all butterflies and rainbows
16:12:16 <ihrachys> linuxbridge job seems to show more failures than ovs
16:12:19 <ihrachys> http://logs.openstack.org/24/505324/10/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/e1a0b47/logs/testr_results.html.gz
16:12:22 <haleyb> if we didn't do migration it would be real simple :)
16:12:23 <ihrachys> vs. http://logs.openstack.org/24/505324/10/check/gate-tempest-dsvm-neutron-scenario-linuxbridge-ubuntu-xenial-nv/c2e8825/testr_results.html.gz
16:12:37 <ihrachys> so we also have trunks and qos
16:13:00 <ihrachys> but some of those could be the iptables issue that I spotted today
16:13:06 <ihrachys> (I will update later in the meeting)
16:13:41 <ihrachys> haleyb, thanks for the update on the patches. would be nice if we land them this week.
16:13:53 <haleyb> ihrachys: oh, and there is one more
16:14:20 <haleyb> https://review.openstack.org/#/c/500143/
16:14:21 <patchbot> patch 500143 - neutron - DVR: Always initialize floating IP host
16:14:37 <haleyb> final fix for agent, but maybe that's more dvr-multinode related, getting ahead of myself
16:14:41 <mlavalle> ihrachys: land the 3 of them?
16:15:46 <ihrachys> mlavalle, yeah. 2 of them will require l3 experts to chime in. too complex for my 1oz brain
16:16:15 <mlavalle> lol, I'll keep an eye on them
16:16:23 <ihrachys> haleyb, the scenario job is dvr-multinode so it would affect it
16:16:45 <ihrachys> the patch from anilvenkata - https://review.openstack.org/#/c/500384/ - reveals some weirdness (to my taste) in api behavior. the change seems fine, but the number of things that api user is supposed to check to make sure data plane is healthy after an update is enormous. is it really the best we can do? see: https://review.openstack.org/#/c/500384/7/neutron/tests/tempest/scenario/test_migration.py
16:16:46 <patchbot> patch 500384 - neutron - tempest: check router interface exists before ssh
16:16:47 <patchbot> patch 500384 - neutron - tempest: check router interface exists before ssh
16:17:17 <ihrachys> we check ports, fips, ports deleted...
16:17:44 <ihrachys> I started to think that we need some attribute to expose whether router is ready
16:17:47 <ihrachys> maybe reuse status
16:17:47 <haleyb> ihrachys: i saw your comment regarding PENDING_UPDATE, just don't know how invasive that would be
16:19:00 <haleyb> i'm just thinking of a case where the server changes the state, informs agent, but reply lost.  server would need to re-send
16:19:42 <ihrachys> it would stay pending until they recover somehow. data plane would still be functional (to the point it can be functional with the update not applied)
16:20:19 <haleyb> but i'm willing to try and make it better
16:20:50 <ihrachys> great, thanks a lot
16:20:59 <mlavalle> yeah, thanks
16:21:00 <anilvenkata> haleyb, we have similar thing for router state change
16:21:12 <anilvenkata> haleyb, if agent's reply is lost, it will periodically retry
16:21:18 <anilvenkata> haleyb, Ann has implemented it
16:21:47 <ihrachys> anilvenkata, is it merged or up for review?
16:22:16 <anilvenkata> ihrachys, that is only regarding HA router active/backup state change
16:23:29 <anilvenkata> ihrachys, not related to data plane status change
16:23:34 <ihrachys> ok. we didn't have any more items
16:23:37 <ihrachys> let's move on
16:23:48 <ihrachys> #topic Grafana
16:23:51 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:25:09 <ihrachys> (boy it takes a while to load it)
16:25:43 <mlavalle> yeap
16:25:45 <ihrachys> so first, our all time friends - scenario and fullstack - are still close to 100%
16:25:54 <ihrachys> the good news is that fullstack is now not 100% all the time
16:25:58 <ihrachys> more like 80%
16:26:06 <ihrachys> so the latest fixes helped a bit
16:26:30 <ihrachys> scenarios are not that lucky
16:26:49 <ihrachys> we had a blip with rally breakages earlier in the week but it's back to normal now
16:27:15 <ihrachys> those guys apparently don't have enough test coverage for the code so regressions happen, and since we feed from their master, it breaks us once in a while
16:28:19 <ihrachys> tl;dr nothing new, we can move to discuss specific day-to-day offenders
16:28:23 <ihrachys> #topic Scenarios
16:28:31 <ihrachys> we already discussed a bit the dvr/migration fixes
16:29:12 <ihrachys> one thing that I noticed today while looking at results for the new sec group scenario is that linuxbridge job seems to be busted because of a iptables issue
16:29:21 <ihrachys> example: http://logs.openstack.org/21/504021/2/check/gate-tempest-dsvm-neutron-scenario-linuxbridge-ubuntu-xenial-nv/e47a3f3/logs/screen-q-agt.txt.gz#_Sep_26_09_17_36_512389
16:29:39 <ihrachys> we have code in iptables managers that applies rules twice and checks that result is same
16:29:59 <ihrachys> that is activated by debug_iptables_rules = True
16:30:03 <ihrachys> which is on in gate
16:31:14 <ihrachys> doesn't happen in ovs
16:31:31 <ihrachys> but that's probably because there we use the 'openvswitch' flow firewall driver
16:31:55 <ihrachys> so maybe the issue still affects ovs/iptables setups
16:32:17 <ihrachys> of course we could disable debug_iptables_rules to avoid the errors but that would be a silly thing to do
16:33:07 <ihrachys> I hear that haleyb and jlibosva would love to take it from here ;)
16:33:23 <jlibosva> I can take a look tomorrow :)
16:33:37 <ihrachys> ok good. I will prepare a bug report after the meeting
16:33:52 <ihrachys> #action jlibosva to triage iptables apply failure in linuxbridge scenarios job
16:33:53 <haleyb> i will help as well of course
16:34:00 <ihrachys> #action ihrachys to report bug for iptables apply failure
16:34:34 <ihrachys> also while we are at it, worth reviewing the test scenario itself that triggers the failure: https://review.openstack.org/#/c/504021
16:34:34 <patchbot> patch 504021 - neutron - [Tempest] Scenarios for several sec groups on VM
16:34:38 <mlavalle> Thanks jlibosva, haleyb
16:35:01 <ihrachys> final thing to discuss here is that tempest plugin with all tests, api and scenarios, is moving to a separate repo
16:35:12 <ihrachys> chandankumar is spearheading the initiative
16:35:14 * mlavalle will review it
16:35:20 <haleyb> ihrachys: oh, so it only happens with that patch?
16:35:48 <ihrachys> haleyb, well I am not sure. could be. that's where I spotted it.
16:36:17 <ihrachys> for the split tempest, we already have https://github.com/openstack/neutron-tempest-plugin
16:36:23 <ihrachys> that carries copy of tests
16:37:02 <ihrachys> chandankumar is currently working on preparing jobs that would run tests from the plugin: https://review.openstack.org/#/c/507038/
16:37:03 <patchbot> patch 507038 - openstack-infra/project-config - Added tempest dsvm job for neutron-tempest-plugin
16:37:32 <ihrachys> once they are in the separate repo gate, and we prove they work, we will switch neutron repo to new jobs that use the plugin repo
16:37:45 <ihrachys> and then clean up the neutron tree: https://review.openstack.org/#/c/507038/
16:37:45 <patchbot> patch 507038 - openstack-infra/project-config - Added tempest dsvm job for neutron-tempest-plugin
16:37:59 <ihrachys> oops wrong link
16:38:04 <mlavalle> ihrachys: one question, what happens to code we are changing now in the Neutron repo?
16:38:13 <ihrachys> this is the patch that removes the in-tree tests: https://review.openstack.org/#/c/506672/
16:38:14 <patchbot> patch 506672 - neutron - Remove the bundled intree neutron tempest plugin
16:38:30 <ihrachys> mlavalle, that's a good question, and the answer is, we will need to first sync everything before pulling the trigger
16:38:48 <mlavalle> yeah, I saw it over the weekend and was wondering
16:38:56 <mlavalle> good to know
16:39:02 <ihrachys> once we are closer to that, we should probably freeze all new patches, work through sync process, pull the trigger, cleanup; then ask everyone to move patches to the new repo
16:39:16 <mlavalle> sounds good
16:39:25 <ihrachys> there is a bit of legwork there
16:39:31 <mlavalle> yeap
16:40:27 <ihrachys> actually I now realized that project-config change is incomplete
16:40:44 <ihrachys> because since the new repo is branchless, we need to target all neutron stable branches there
16:40:49 <ihrachys> like tempest does
16:40:55 <mlavalle> right
16:41:00 <ihrachys> I commented in the patch
16:41:19 <ihrachys> the new setup will also temporarily complicate backports of test fixes into stable branches
16:42:37 <ihrachys> that's all I have for scenarios. there may be more issues with it, but I prefer we get back to them once migration/dvr/iptables stuff is fixed and we see new results
16:42:42 <ihrachys> #topic Fullstack
16:43:02 <ihrachys> there was a PTG discussion on things we should take care to make fullstack pass
16:43:35 <ihrachys> first issue was about trunk test case failing because ovs agents remove ports of each other because ovsdb is not sandboxed
16:43:50 <ihrachys> I believe the long term vision is we actually have it sandboxed
16:43:57 <mlavalle> correct
16:44:01 <ihrachys> but for the time being, armax sent this fix: https://review.openstack.org/#/c/504186/
16:44:02 <patchbot> patch 504186 - neutron - Isolate trunk bridges on fullstack runs
16:44:39 <ihrachys> it needs some more work, but if you folks could assess the general approach that would be great.
16:45:13 <mlavalle> will do
16:45:51 <ihrachys> another issue we have with fullstack is that sometimes it seems like netlink socket operations don't immediately translate into kernel state changes
16:45:51 <ihrachys> https://launchpad.net/bugs/1717582
16:45:53 <openstack> Launchpad bug 1717582 in neutron "fullstack job failing to create namespace because it's already exists" [Undecided,Fix released] - Assigned to Slawek Kaplonski (slaweq)
16:46:01 <ihrachys> which then makes ensure_namespace fail
16:46:15 <ihrachys> slaweq_ had a fix for that https://review.openstack.org/#/c/503890/ but seems like it didn't help
16:46:16 <patchbot> patch 503890 - neutron - Fix for race condition during netns creation (MERGED)
16:46:21 <slaweq_> and my patch didn't solve it
16:46:27 <ihrachys> so now we probably revert: https://review.openstack.org/#/c/507382/
16:46:28 <patchbot> patch 507382 - neutron - Revert "Fix for race condition during netns creation"
16:46:40 <slaweq_> I think that there is less such kind of errors now but it's only my "feeling"
16:46:48 <ihrachys> and look at alternative implementation of netns add using pyroute2: https://review.openstack.org/#/c/505701/
16:46:48 <patchbot> patch 505701 - neutron - Change ip_lib network namespace code to use pyroute2
16:46:55 <mlavalle> have we taken the decision to revert yet?
16:47:14 <ihrachys> haleyb, remind everyone why we think it helps?
16:47:24 <ihrachys> haleyb, should probably be part of the commit message (also a link to the bug)
16:48:02 <ihrachys> mlavalle, afaiu slaweq_ explained: "My patch changed ensure_namespace() method to catch ProcessExecutionError exception. If Brian's patch will be merged it will never happen here that such exception will be raised."
16:48:15 <slaweq_> yep
16:48:18 <ihrachys> sounds like useless code if we go the pyroute2 path
16:48:25 <mlavalle> cool
16:48:26 <haleyb> ihrachys: i more started the pyroute2 change as a performance improvement, but it should catch the namespace exists issue better
16:48:31 <slaweq_> because pyroute2 will raise exception with EEXIST error code
16:48:48 <ihrachys> slaweq_, but we will still need to catch it?
16:48:52 <slaweq_> yep
16:49:03 <ihrachys> ok so the haleyb's patch doesn't fix the bug
16:49:04 <slaweq_> but we can check if it's this error or other one
16:49:07 <mlavalle> ahhh, I glimpsed pieces of the EEXIST conversation yesterday
16:49:32 <ihrachys> haleyb, then maybe related-bug
16:49:35 <haleyb> ihrachys: it does because it catches the pyroute EEXIST exception and ignores it
16:49:39 <ihrachys> oh ok
16:49:51 <slaweq_> also, I don't know exactly how privsep works
16:49:55 <haleyb> i'll add related-bug
16:49:58 <slaweq_> and if it uses rootwrap daemon
16:50:11 <ihrachys> haleyb, is it in neutron that you catch, or it's library?
16:50:37 <slaweq_> but I think that current issue might be related to some kind of race condition between two rootwrap daemons which executes commands
16:50:59 <haleyb> ihrachys: in neutron, the pyroute2 code doesn't seem to catch it properly
16:51:11 <ihrachys> slaweq_, could well be, meaning this wouldn't solve the underlying issue
16:51:18 <jlibosva> haleyb: even with the flag passed we saw at pyroute2 docs?
16:51:36 <ihrachys> oh I see the catch in https://review.openstack.org/#/c/505701/4/neutron/privileged/agent/linux/ip_lib.py
16:51:37 <patchbot> patch 505701 - neutron - Change ip_lib network namespace code to use pyroute2
16:52:17 <haleyb> ihrachys: i'd have to do it differently to use any flags
16:53:12 <ihrachys> slaweq_, so it seems like the lib handles the error itself?
16:53:19 <ihrachys> we don't need to handle it in neutron then?
16:53:27 <slaweq_> not exactly
16:53:42 <slaweq_> there is try except which should catch it
16:54:03 <slaweq_> but as we checked yesterday with haleyb it raises exception in other place
16:54:15 <ihrachys> oh ok, I see
16:54:29 <ihrachys> maybe a good idea to patch pyroute2 (in addition to neutron)
16:54:41 <haleyb> it might be a bug in the pyroute2 code, i need to file a bug and get maintainer response
16:54:42 <jlibosva> I haven't played with it but I looked at the docs of pyroute2 and there was an option to ignore namespace creation in case it already existed
16:54:47 <slaweq_> I was thinking that in https://github.com/svinota/pyroute2/blob/master/pyroute2/netns/__init__.py#L149 it will be catched
16:55:05 <slaweq_> but it looks that exception is raised in https://github.com/svinota/pyroute2/blob/master/pyroute2/netns/__init__.py#L162
16:55:28 <ihrachys> slaweq_, right. because netnsdir is /var/run/netns/
16:55:35 <ihrachys> and netnspath is /var/run/netns/xxx
16:55:44 <ihrachys> so it handles exists for first only
16:56:12 <jlibosva> # create a new netns or use existing one
16:56:14 <jlibosva> netns = NetNS('test', flags=os.O_CREAT)
16:56:23 <jlibosva> from http://pyroute2.org/pyroute2-0.3.4/netns.html
16:56:59 <haleyb> i use the lazy pyroute2.netns.create("test") method instead of instantiating a class
16:57:20 <ihrachys> ok we have little time, let's move on with the issue. everyone make sure we review those patches quick.
16:57:34 <ihrachys> #topic Open discussion
16:57:40 <ihrachys> I would like to note one thing before we wrap up
16:58:01 <ihrachys> infra is switching gates to zuulv3
16:58:14 <ihrachys> there are significant changes in the future in how we define our jobs
16:58:29 <ihrachys> the gist of the changes can be found in https://docs.openstack.org/infra/manual/zuulv3.html
16:58:35 <ihrachys> make yourself comfortable
16:58:49 <jlibosva> I also wanted to note I'm working on some fullstack work: https://review.openstack.org/#/c/506722/ - it's very invasive :) but I hope it'll bring a lot of fruits
16:58:50 <patchbot> patch 506722 - neutron - WIP: fullstack: Put all services into namespaces
16:58:50 <ihrachys> the switch is currently ongoing, so project-config changes are frozen for now
16:59:08 <jlibosva> I'll probably split it into smaller pieces once it's done
16:59:09 <ihrachys> jlibosva, nice!
16:59:14 <mlavalle> thanks for pointing it out
16:59:50 <ihrachys> haleyb, btw you may want to push again on getting multinode-dvr the default once infra is done with zuulv3
16:59:53 <haleyb> i also plan on re-basing https://review.openstack.org/#/c/483600/ today
16:59:54 <patchbot> patch 483600 - openstack-infra/project-config - Make neutron-multinode job the default
16:59:58 <ihrachys> ok folks we are at the top of the hour
17:00:09 <ihrachys> haleyb, huh, great minds...
17:00:10 <ihrachys> :)
17:00:15 <ihrachys> thanks everyone for joining
17:00:18 <jlibosva> thanks all, cya :)
17:00:24 <slaweq_> thanks, bye
17:00:24 <mlavalle> o/
17:00:27 <ihrachys> mlavalle, slaweq_ thanks for joining and all the help, I appreciate
17:00:29 <ihrachys> #endmeeting