#openstack-meeting log

16:00:27 <slaweq> #startmeeting neutron_ci
16:00:28 <openstack> Meeting started Tue Jan 14 16:00:27 2020 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:32 <openstack> The meeting name has been set to 'neutron_ci'
16:00:33 <slaweq> welcome back :)
16:00:40 <ralonsoh> hi
16:01:23 <njohnston> o/
16:02:05 <slaweq> lets start, maybe others will join in the meantime
16:02:07 <slaweq> #topic Actions from previous meetings
16:02:18 <bcafarel> hi again
16:02:20 <slaweq> sorry, first thing:
16:02:22 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:02:24 <slaweq> Please open now :)
16:02:32 <slaweq> and now we can start with first topic
16:02:37 <slaweq> njohnston to check failing NetworkMigrationFromHA in multinode dvr job
16:03:21 <njohnston> So I have not been able to find any incidences of that.  I tried last week, and logshash was not showing any.  I tried for about an hour this morning, but logstash became unresponsive a few times so I haven't been able to see if any happened over the weekend
16:03:34 <njohnston> I should say, any recent
16:03:38 <njohnston> recent incidences
16:04:26 <njohnston> so I will keep searching, and once I find another failure I will poke at it.
16:04:52 <slaweq> njohnston: ok, if I will have something like that I will ping You
16:04:59 <njohnston> slaweq: thanks!
16:05:39 <slaweq> njohnston: I think it's this for example https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b63/702249/3/check/neutron-tempest-plugin-dvr-multinode-scenario/b631ff4/testr_results.html
16:06:10 <njohnston> See, I know you'd be able to find one. :-) slaweq++
16:06:18 <slaweq> :)
16:06:22 <slaweq> njohnston: thx
16:06:58 <slaweq> but I'm not sure if this is the error which You are looking for actually
16:07:07 <slaweq> here it seems it's ssh to instance failure
16:07:24 <njohnston> that is all I have ever seen for a failure on this test
16:07:26 <slaweq> and I'm not sure if this is "router migration specific" problem of general issue
16:08:21 <slaweq> I know that I saw similar ssh issue on other tests too
16:08:35 <slaweq> but maybe it's for some reason failing more often in those migration tests
16:09:08 <njohnston> yes, that is the angle I am taking on it
16:09:25 <slaweq> ok, thx njohnston for looking into this issue
16:09:39 <slaweq> if You will need any help, just let me know :)
16:09:53 <njohnston> slaweq: Will do
16:10:44 <slaweq> thx
16:10:53 <slaweq> ok, lets move on
16:11:02 <slaweq> ralonsoh to report bug for timeout related to bridge creation
16:11:25 <ralonsoh> let me find the bug
16:11:39 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1858661
16:11:39 <openstack> Launchpad bug 1858661 in neutron "Error when creating a Linux interface" [High,Confirmed]
16:11:47 <ralonsoh> but I didn't work on this one
16:13:28 <slaweq> ralonsoh: thx, we can at least track it if someone will have time to work on this
16:13:32 <slaweq> thx a lot
16:13:37 <ralonsoh> yw
16:14:01 <slaweq> and last action from last week
16:14:03 <slaweq> ralonsoh to take a look how to use newer Maridb in periodic job
16:14:10 <ralonsoh> one sec
16:14:18 <ralonsoh> #link https://review.opendev.org/#/c/702416/
16:14:36 <ralonsoh> what I'm doing is, in the zuul job, adding a pre-run task
16:14:50 <ralonsoh> adding, just for ubuntu bionic, the repo with mariadb 10.4
16:15:23 <njohnston> nice
16:16:39 <slaweq> You can sent DNM patch which is on top of this one to run this job in check queue
16:16:49 <slaweq> just to see if it will actually work as expected
16:16:50 <ralonsoh> hmmm, you are right
16:17:02 <slaweq> and thx for that fix :)
16:17:11 <ralonsoh> I'll ask you about this later
16:17:21 <ralonsoh> because I though that was a periodic job
16:17:26 <ralonsoh> but in the neutron channel
16:17:50 <slaweq> ralonsoh: yes, this job is in periodic queue
16:17:54 <slaweq> and it should be like that
16:18:12 <slaweq> but You can send DNM patch where You will add it to check queue
16:18:36 <slaweq> then zuul will run it on this patch so we can see results before we will merge Your fix
16:18:43 <slaweq> I can send this patch if You want
16:18:49 <ralonsoh> ahhh ok!
16:18:54 <ralonsoh> understood
16:18:56 <slaweq> :)
16:20:10 <slaweq> ok, next topic
16:20:12 <slaweq> #topic Stadium projects
16:20:26 <slaweq> I think we already have good update about dropping py2 support
16:20:36 <slaweq> so we don't need to talk about it here probably
16:20:38 <slaweq> right?
16:20:54 <njohnston> +1
16:21:05 <ralonsoh> +1
16:21:12 <bcafarel> yep
16:21:12 <slaweq> but I have other question related to stadium projects
16:21:35 <slaweq> we have now in neutron check queue midonet and tripleo jobs which are non-voting and failing 100% times since very long time
16:21:49 <slaweq> both are failing because they try to run on python 2.7
16:21:59 <slaweq> my question is: what we should do with those jobs?
16:22:11 <slaweq> I would personally remove them for now from check queue
16:22:23 <slaweq> as it's only waste of infra resources to run them on each patch
16:22:33 <slaweq> but I want to also know Your opinion about it
16:22:35 <njohnston> I would remove midonet as they don't have anything but UTs now
16:22:51 <njohnston> For tripleo I would ask if there is an updated job for py3 we could switch to
16:23:00 <slaweq> njohnston: there is no currently
16:23:03 <bcafarel> +1 as these cannot be fixed short-term, we can always add back when there is support
16:23:15 <slaweq> afaik tripleo pinned neutron to some version which supports py2 still
16:23:20 <slaweq> and they use it like that in their ci
16:23:25 <njohnston> ugh.  never mind.  nuke it.
16:23:31 <slaweq> also this job runs on Centos 7
16:23:35 <bcafarel> yeah for tripleo we need centos8 support (which is in progress but not there tomorrow)
16:23:42 <slaweq> so we should probably wait with this job for centos8
16:24:52 <slaweq> so what do You think about removing both jobs from check queue with comment that we can bring them back when it will work?
16:24:55 <ralonsoh> for now, we should not waste CI resources (139 jobs in the queue now!)
16:25:05 <ralonsoh> we can comment them
16:25:07 <njohnston> agreed 🤯
16:25:12 <slaweq> ok, I will do that
16:25:14 <bcafarel> good for me too :)
16:25:37 <slaweq> #action slaweq to remove networking-midonet and tripleo based jobs from Neutron check queue
16:26:02 <slaweq> ok, that's all from my side regarding stadium projects
16:26:12 <slaweq> anything else You have in this topic?
16:26:44 <njohnston> nope
16:27:15 <slaweq> ok, lets move on than
16:27:17 <slaweq> #topic Grafana
16:27:41 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:28:13 <slaweq> all graphs are going to normal values today after yesterday's issue with devstack
16:28:19 <njohnston> well we can see the issues clearly enough
16:29:31 <njohnston> what was wrong with the neutron-tempest-plugin-scenario-* jobs in the check queue back on 1/8?  They were all at 100% fail then as well.
16:30:26 <slaweq> njohnston: I think it was this issue with 'None object don't have open_session" or something like that
16:30:36 <njohnston> ah ok, I recall that fix
16:30:39 <slaweq> :)
16:31:41 <slaweq> other than that I don't see any real problems in grafana
16:32:06 <slaweq> but I checked today average number of rechecks in last week and according to my script it was more than 3 rechecks per patch
16:32:20 <slaweq> which isn't good result still :/
16:32:48 <njohnston> let's see if that average is lower next week
16:33:19 <slaweq> yeah
16:33:25 <slaweq> I will try to monitor it every week
16:33:38 <ralonsoh> (can you share it?)
16:33:42 <ralonsoh> if it is possible
16:33:53 <ralonsoh> the script
16:34:20 <slaweq> ralonsoh: sure
16:34:22 <slaweq> https://github.com/slawqo/tools/blob/master/rechecks.py
16:34:29 <ralonsoh> cool!!
16:34:31 <slaweq> but it's nothing really "great"
16:34:40 <slaweq> just simple script which parses some comments from gerrit :)
16:34:47 <njohnston> ralonsoh: It is linked to from the article that describes it: http://kaplonski.pl/blog/failed_builds_per_patch/
16:34:57 <slaweq> njohnston: yes, it is
16:35:05 <ralonsoh> that's right yes!
16:35:13 <slaweq> I think it works fine and IMO this metric makes some sense
16:35:28 <slaweq> but if You have any opinion about it, please let me know
16:35:48 <slaweq> maybe my way if thinking here is wrong and this don't have any informational value at all
16:36:22 <njohnston> I think you have a good point that it measures the developer pain due to bad CI
16:36:54 <njohnston> after all if it is a code problem then the developer is probably not going to do a bare 'recheck', so they have to be nearly all CI issues that get rechecked
16:36:55 <slaweq> njohnston: thx
16:37:08 <bcafarel> ack especially as it focuses on "final" series of rechecks
16:37:21 <slaweq> yeah, that's why I checked only "build failed" comments from last patchset
16:37:34 <slaweq> as this in most cases means that code was already fine
16:37:41 <slaweq> and issues were not related to this patch
16:38:14 <slaweq> of course in some case it may be differently when patch introduces e.g. some race condition and tests are failing intermittary but in general it shouldn't be the case
16:39:07 <bcafarel> yeah from personal experience there should be a good signal to noise ratio
16:39:23 <slaweq> thx bcafarel
16:39:38 <slaweq> ok, anything else You want to talk regarding grafana?
16:39:43 <slaweq> or can we move on?
16:39:53 <bcafarel> all good here
16:40:03 <slaweq> so lets move on than
16:40:05 <slaweq> #topic fullstack/functional
16:40:22 <slaweq> I found one failure in fullstack tests:
16:40:24 <slaweq> neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_min_bw_qos_port_removed
16:40:26 <slaweq> https://b89f49db332cb8f54892-19780c33aa00a3c0d825d79cd8c225b0.ssl.cf5.rackcdn.com/701571/2/check/neutron-fullstack/51a1d1b/testr_results.html
16:40:40 <slaweq> but it's probably issue which should be fixed with https://review.opendev.org/#/c/687922/
16:40:45 <slaweq> is that correct ralonsoh?
16:41:02 <ralonsoh> let me check
16:41:40 <ralonsoh> well, we didn't have the diver cache for qos min rules
16:41:52 <ralonsoh> and this specific test is totally refactored
16:42:08 <ralonsoh> so yes, I think this is going to fix it
16:42:14 <ralonsoh> (I hope so!)
16:42:24 <slaweq> ok, thx for confirmation
16:42:53 <slaweq> and I don't have any other new failures for functional/fullstack jobs
16:43:11 <slaweq> so I think we can move on to scenario jobs now
16:43:16 <slaweq> are You ok with that?
16:43:18 <bcafarel> I may have a new one for stein branch (functional)
16:43:26 <slaweq> bcafarel: shoot
16:44:00 <bcafarel> https://review.opendev.org/#/c/701898/ and https://review.opendev.org/#/c/702364/ both failed today on functional neutron.tests.functional.test_server tests
16:44:24 <bcafarel> may just be bad node I did not have time to dig in further, for example https://deacf45b9e7640612342-216aa7667ced3686ee75e1188a89b185.ssl.cf2.rackcdn.com/702364/1/check/neutron-functional/a6cf355/testr_results.html
16:44:48 <bcafarel> but I don't recall recent changes/fixes in this part of code in master/train
16:45:41 <slaweq> I remember some failure like that in master branch in the past
16:45:55 <slaweq> maybe it's just some missing backport?
16:47:01 <ralonsoh> slaweq, you changed the start method
16:47:14 <ralonsoh> https://review.opendev.org/#/c/680001/
16:47:56 <ralonsoh> is this in stein?
16:48:05 <bcafarel> ahah so maybe oslo bump in stein?
16:48:15 <slaweq> ralonsoh: yes, and this patch in oslo was backported to stein: https://review.opendev.org/#/q/I86a34c22d41d87a9cce2d4ac6d95562d05823ecf
16:48:17 <slaweq> :)
16:48:20 <bcafarel> that one I see only in master/train
16:48:24 <slaweq> so this may be same problem
16:48:48 <bcafarel> ralonsoh: slaweq thanks I will check and send cherry-pick if that is the one
16:48:48 <ralonsoh> my job is done here!
16:49:02 <bcafarel> though I am tempted to bet it is :)
16:49:15 <slaweq> ralonsoh++
16:49:22 <slaweq> bcafarel++ thx for checking that
16:50:25 <slaweq> #action bcafarel to send cherry-pick of https://review.opendev.org/#/c/680001/ to stable/stein to fix functional tests failure
16:50:46 <slaweq> ok, anything else regarding functional/fullstack jobs?
16:50:59 <bcafarel> nothing else from me at least :)
16:51:42 <slaweq> ok, lets move on than
16:51:44 <slaweq> #topic Tempest/Scenario
16:52:06 <slaweq> I have one failure related to scenario jobs to mention
16:52:09 <slaweq> Problem with ssh: paramiko.ssh_exception.SSHException: No existing session
16:52:17 <slaweq> e.g. https://947c62482e8e55a27073-47560c94aca274da9e9228ef37db57ef.ssl.cf1.rackcdn.com/701853/2/check/neutron-tempest-plugin-scenario-linuxbridge/e077ad8/testr_results.html
16:52:32 <slaweq> it may be the same issue like njohnston saw for migration routers tests
16:53:15 <slaweq> last week I opened bug for that also https://bugs.launchpad.net/neutron/+bug/1858642
16:53:15 <openstack> Launchpad bug 1858642 in neutron "paramiko.ssh_exception.NoValidConnectionsError error cause dvr scenario jobs failing" [High,Confirmed]
16:53:27 <slaweq> but now it seems that it's not only related to dvr jobs
16:54:10 <ralonsoh> so you think this is not a DVR/no DVR problem but a paramiko one
16:54:11 <ralonsoh> ?
16:54:17 <slaweq> njohnston: if You will find something with router migration, please maybe take a look at this also to check if that's not the same problem
16:54:23 <njohnston> will do
16:54:52 <slaweq> ralonsoh: IMO it's not paramiko problem, but we have just such error from paramiko when there is no ssh connectivity
16:55:06 <ralonsoh> ok
16:55:23 <slaweq> ahh no
16:55:25 <slaweq> sorry
16:55:32 <slaweq> it's not the same error in this linuxbridge job
16:55:42 <slaweq> njohnston: so please don't check that one
16:55:46 <slaweq> it's something different
16:55:54 <njohnston> ok
16:56:04 <slaweq> here it may be some paramiko error or some issue in our tests code
16:56:14 <slaweq> sorry for mixing things
16:57:01 <slaweq> so regarding this issue with linuxbridge job, I think that if we will spot it more often, I will open new bug for that
16:57:09 <slaweq> and we can than check it
16:57:31 <slaweq> at least for now I saw it only once so I don't think it will be easy/doable to check that
16:58:32 <slaweq> ok, we are almost out of time
16:58:42 <slaweq> anything else You have to talk today quickly?
16:59:33 <slaweq> if not, lets end the meeting now, thx for attending o/
16:59:43 <slaweq> #endmeeting