16:00:19 #startmeeting neutron_ci 16:00:19 Meeting started Tue May 7 16:00:19 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:20 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:22 The meeting name has been set to 'neutron_ci' 16:00:37 hi 16:00:40 o/ 16:00:56 hi 16:01:44 lets wait couple more minutes for njohnston_ ralonsoh and others, maybe they will join 16:03:13 ok, lets start 16:03:18 first thing: 16:03:20 #link http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1 16:03:33 please open now so it will be ready later :) 16:03:42 #topic Actions from previous meetings 16:03:46 ok 16:03:58 first action from 2 weeks ago was 16:04:00 mlavalle to continue debuging reasons of neutron-tempest-plugin-dvr-multinode-scenario failures 16:04:13 I didn't make progress on this one 16:04:27 due to Summit / PTG 16:04:41 sure, I know :) Can I assign it to You for next week? 16:04:49 yes please 16:04:55 #action mlavalle to continue debuging reasons of neutron-tempest-plugin-dvr-multinode-scenario failures 16:04:58 thx 16:05:06 mlavalle to recheck tcpdump patch and analyze output from ci jobs 16:05:13 that is next one ^^ 16:05:18 any update? 16:05:22 I spent time this morning looking at that 16:06:09 I think my tcpdump command is too broad: http://logs.openstack.org/21/653021/2/check/neutron-tempest-plugin-dvr-multinode-scenario/0a24d77/controller/logs/screen-q-l3.txt.gz#_Apr_23_00_47_57_853357 16:06:40 getting weird output as you can see 16:06:40 bad checksums and that kind of stuff 16:06:55 so I am going to focus a little bit more 16:07:24 I am going to trace qr and qg interfaces 16:07:24 with tcp and port 22 16:07:37 makes sense? 16:07:48 and please add "-n" option to not resolv IPs to hostnames 16:07:58 IMHO it will be easier to look 16:08:17 yes, you are right 16:08:22 I will probably also focus on singe node jobs first 16:08:33 and maybe -l to not buffer 16:08:36 yes, and also "-e" to print mac addresses 16:09:04 thanks for the recommendations. I will follow them 16:09:12 I have a stupid question 16:09:16 can I ask it? 16:09:20 sure :) 16:09:27 there are no stupid questions ;) 16:10:11 in looking at what job to focus in kibana, I noticed this: http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/job-output.txt#_2019-05-07_14_03_22_454517 16:10:42 we are not using password ssh in any case, right? 16:11:01 no, ssh key is used always I think 16:11:21 yes, that's what I think also 16:11:32 but it was worth asking the dumb question 16:12:09 this is exactly example of this second "type" of errors with SSH connectivity 16:12:12 look at http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/job-output.txt#_2019-05-07_14_03_22_550727 16:12:29 instance-id was received properly from metadata server 16:12:47 but then 2 lines below, failed to get public-key 16:13:02 it is exactly what I was testing last week 16:13:24 do you have a feel of what is the ratio between type 1 and type 2 failures? 16:13:43 I don't know exactly but I would say 50:50 16:14:13 I think the tcpdump testing I'm doing should help with type 1 16:14:34 so I will focus on those 16:14:43 in type 2, we know we have connectivity 16:14:56 yes 16:14:58 because we fail athenticating 16:15:15 in this case problem is with slow answer for metadata requests 16:17:48 http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/compute2/logs/screen-q-meta.txt.gz#_May_07_13_15_02_833919 16:18:00 here is this failed request in neutron metadata agent's logs 16:18:17 and http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/controller/logs/screen-n-api-meta.txt.gz#_May_07_13_15_04_543966 -- that how it looks in nova 16:18:27 10 seconds gap in logs there 16:18:34 yeap 16:19:11 mlavalle: maybe You can try to talk with someone from nova team to look into those issues 16:19:41 ok 16:19:59 thx 16:20:07 can I add it as an action for You also? 16:20:11 yes 16:20:31 #action mlavalle to talk with nova folks about slow responses for metadata requests 16:20:31 please 16:20:33 thx 16:20:45 ok, next one was 16:20:47 njohnston move wsgi jobs to check queue nonvoting 16:21:05 I know it's done, we have wsgi jobs running in check queue currently 16:21:31 and tempest job is kinda broken now 16:21:48 so we will have to investigate it also 16:21:58 but that isn't very urgent for now 16:22:26 next one then 16:22:27 ralonsoh to debug issue with neutron_tempest_plugin.api.admin.test_network_segment_range test 16:22:38 I don't know if ralonsoh did anything with it 16:22:45 sorry 16:22:50 I didn't have time for it 16:22:58 sure, no problem :) 16:23:05 can I assign it to You for this week? 16:23:10 sure 16:23:15 #action ralonsoh to debug issue with neutron_tempest_plugin.api.admin.test_network_segment_range test 16:23:18 thx ralonsoh :) 16:23:27 and the last one was: 16:23:29 slaweq to cancel next week meeting 16:23:34 done - that was easy :P 16:23:49 ok, any questions/comments? 16:24:41 ok, I will take this silence as no :) 16:24:50 so lets move on then 16:24:51 +1 16:24:52 #topic Stadium projects 16:25:02 first "Python 3 migration" 16:25:08 etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status 16:25:34 I know that tidwellr started doing something with neutron-dynamic-routing repo 16:26:07 his patch https://review.opendev.org/#/c/657409/ 16:27:37 I just checked that for neutron-lib we are actually good 16:28:50 I will try to go through those projects in next weeks 16:29:07 anyone wants to add something in this topic? 16:29:40 nope 16:29:57 ok, lets move on 16:30:03 tempest-plugins migration 16:30:09 Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo 16:30:28 and I have a question here 16:30:52 I was recently struggling with error on jobs run on rocky and queens repos for networking-bgpvpn 16:31:16 but at the airport yesterday I realized that we probably don't need to run those jobs for stable branches yet 16:31:45 as we will not remove tests from stable branches from stadium projects' repos, right? 16:32:23 so we should only have this jobs in neutron-tempest-plugin repo for master branch for now and add stable branches jobs starting from Train release 16:32:28 is that correct? 16:32:50 I think so 16:33:41 ok, so that will make at least my patch easier :) 16:33:58 I will remove jobs for stable branches from it and it will be ready for review than 16:34:12 any other questions/updates? 16:35:22 not from me 16:35:35 ok, so lets move on then 16:35:37 #topic Grafana 16:35:47 http://grafana.openstack.org/dashboard/db/neutron-failure-rate - just a reminder :) 16:36:54 there isn't anything very bad there - all looks pretty same as usual 16:37:04 yeap 16:37:08 I think so 16:37:12 do You so anything what You want to talk about? 16:38:04 is there a bug for the slow job failure? looks like a volume issue? 16:38:47 haleyb: issues with volume (or volume backup) happens in various tempest jobs quite often 16:38:55 haleyb: do You have link to example? 16:39:06 http://logs.openstack.org/57/656357/1/check/tempest-slow-py3/900d859/testr_results.html.gz 16:39:15 second failure 16:39:47 yes, such errors happens from time to time 16:40:17 I'm not sure if exactly this one was reported to cinder but I was reporting some similar errors already 16:40:38 and we also discussed about it in QA session on PTG 16:40:39 just seemed like it picked up recently in the gate, at 20% now 16:40:55 I hope You all saw recent email from gmann about it 16:43:13 haleyb: I'm not sure if those 20% are only because of this issue 16:43:24 often there are also problems with ssh to instances 16:44:04 ok, lets move on to the next topic 16:44:05 slaweq: yes, that was in the other test failure, it's just at 2x the other jobs for failures 16:44:37 haleyb: where tempest-slow is 2x the other jobs failures? I don't see it to be so high 16:45:32 slaweq: argh, it's number of jobs run, i was looking at the right side... 16:45:56 ahh :) 16:46:11 but that is kinda strange that this job was so many times :) 16:46:33 right, they should all be the same 16:46:58 yes, I will check if this graph is properly defined in grafana 16:47:38 it is not 16:47:55 it counts jobs from check queue instead of gate queue 16:47:58 I will fix that 16:48:17 #action slaweq to fix number of tempest-slow-py3 jobs in grafana 16:48:26 thx haleyb for pointing this :) 16:48:36 :) 16:48:57 ok, lets move on then 16:48:59 #topic fullstack/functional 16:49:11 we still have quite high failure rates for those jobs :/ 16:49:37 for functional tests quite often we still hits bug https://bugs.launchpad.net/neutron/+bug/1823038 16:49:38 Launchpad bug 1823038 in neutron "Neutron-keepalived-state-change fails to check initial router state" [High,Confirmed] 16:49:46 like e.g. in * http://logs.openstack.org/64/656164/1/gate/neutron-functional/c59dd7c/testr_results.html.gz 16:50:14 and I have a question to You about that 16:51:14 some time ago I did https://github.com/openstack/neutron/commit/8fec1ffc833eba9b3fc5f812bf881f44b4beba0c 16:51:28 to address this race condition between keepalived and neutron-keepalived-state-change 16:51:36 and it works fine for me locally 16:52:41 but in the gate for some (unknown for me) reason, this initial check of status is failing with error like http://logs.openstack.org/64/656164/1/gate/neutron-functional/c59dd7c/controller/logs/journal_log.txt.gz#_May_07_11_21_05 16:52:49 I have no idea why it is like that 16:53:17 maybe You can take a look into that and help me with it 16:53:57 slaweq: sorry, tuned out for a second, will look 16:53:58 I send today some DNM patch https://review.opendev.org/#/c/657565/ to check if this binary is really in .tox/dsvm-functional/bin directory 16:54:07 and it is there 16:54:37 thanks haleyb. 16:55:54 thanks haleyb 16:55:55 slaweq: hmm, privsep-helper not found? that's odd 16:55:56 fwiw, I've been noting in the devstack I run in my mac (1 controller / network, 1 compute, DVR) that my HA routers sometimes has 2 masters 16:56:29 slaweq: project-config fix @ https://review.opendev.org/657646 :) 16:56:38 that happens after I restart the deployment 16:56:40 mlavalle: that is probably different issue 16:57:24 when this "my race" happend there were 2 standby routers instead of masters 16:57:28 haleyb: thx 16:58:17 ok 16:58:30 haleyb: but in my DNM patch I tried to install oslo.privep simply 16:58:32 I'll try to debug it then 16:58:42 and it looks that for python2.7 job it can find it now 16:58:54 but there is another error there still :/ 16:58:59 I will have to look into it 16:59:24 and this is odd because it works fine for me locally, and I also don't think there are such errors in e.g. tempest jobs 16:59:39 so this issue is now strictly related to functional jobs IMO 16:59:52 ok, I think we are running out of time now 17:00:00 thx for attending and see You next week 17:00:08 #endmeeting