15:01:15 #startmeeting manila 15:01:16 Meeting started Thu Jan 3 15:01:15 2019 UTC and is due to finish in 60 minutes. The chair is tbarron. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:20 The meeting name has been set to 'manila' 15:02:03 ping bswartz 15:02:05 hello 15:02:13 o/ 15:02:18 ping ganso 15:02:21 oh hi 15:02:26 .o/ 15:02:32 ping zhongjun 15:02:36 ping zyang 15:02:41 ping toabctl 15:02:45 ping erlon 15:02:48 hey 15:02:49 ping tpsiva 15:02:53 ping amito 15:02:57 ping vkmc 15:03:25 that's the ping list, add yourself at https://wiki.openstack.org/wiki/Manila/Meetings if you want 15:03:41 * tbarron waits a couple minutes 15:03:52 Happy New Year! 15:04:22 ok, Hi all, and happy new year! 15:04:26 Happy New Year, everyone! :) 15:04:47 #topic announcements 15:04:59 oh, our agenda is here 15:05:07 happy new year! 15:05:16 #link https://wiki.openstack.org/wiki/Manila/Meetings 15:05:38 gouthamr: do you have any important football announcments? 15:05:47 hey 15:05:57 hi erlon 15:06:03 :D hehe, my new year started on a good note this year tbarron 15:06:20 I'm not thrilled with the football situation 15:06:35 * tbarron tries to start a fight when he can 15:06:50 bswartz: wut, thought you guys won rose bowl 15:07:22 We did but that's not good enough for me 15:07:37 :D then beat Purdue next time 15:07:40 but on a different note, milestone 2 is next week 15:07:41 * gouthamr ducks 15:07:56 >_< 15:07:59 That's new driver submission deadline but I don't think we have any. 15:08:39 Otherwise, I plan to cut some intermediary releases so if there's anything important you want included let me know. 15:08:53 hi 15:09:03 It would be nice to be python3 ready by then but we'll see ... 15:09:14 hi xyang 15:09:27 Why releases plural? 15:09:27 Any other announcements? 15:09:37 Isn't it just 1 release for the milestone? 15:09:42 bswartz: manila-ui, client, etc. 15:09:49 Okay library releases 15:09:51 they're not required at milestone 15:10:14 but it's a convenient time and we're being encouraged not to just wait till the end of the cycle. 15:11:05 Any other announcements? 15:11:19 I guess we can welcome toabctl back :) 15:11:42 thx :) but still part-time :) 15:11:44 (sorry, was in another meeting that ran too long) 15:11:46 Happy new year, and thanks for your cleanup patches and reviews again. 15:11:58 amito: hey, happy new year. 15:12:07 tbarron: thanks, happy new year! 15:12:32 #topic new user-developer experience 15:12:57 special thanks to gouthamr for https://review.openstack.org/#/c/627020/ 15:13:34 We've had a fair number of folks trying to set up devstack and not getting anywhere good 15:14:13 But gouthamr has written up new instructions, including gentler intro than just dropping them into full DHSS=True complexity at the start. 15:14:27 I think it will be a great help. 15:15:09 We still don't have great, easy to follow instructions for running tempest locally, so if anyone wants to 15:15:18 contribute on that front it would be great. 15:15:57 #topic gate issues 15:16:02 Somewhat related 15:16:23 we've made some progress cleaning up all the red non-voting first-party jobs. 15:16:31 but not enough. 15:16:54 The generic driver jobs were failing with SSH header protocol exceptions 15:17:05 ++, a relief to see https://review.openstack.org/#/c/627854/ passing 15:17:23 And we had new users trying to use the generic driver locally and they were failing 15:17:27 at the same point. 15:17:41 gouthamr: what was the trick here? 15:17:51 I've submitted a series of patches that *mitigate* the issue and that will help us debug further I hope. 15:17:56 Or do I need to go read all 5 of those patches? 15:18:01 bswartz: crude, up the timeout. 15:18:12 which timeout 15:18:21 was going to ask tbarron: a combination of adjusting the SSH banner timeout, setting the group? 15:18:24 and remove an old keepalive hack we had in the paramiko code that no one else has anymore. 15:18:33 or this aswell? https://review.openstack.org/#/c/627797 15:18:37 there's a specific banner timerout that we need to bump 15:18:38 ah 15:18:42 as a workaround 15:19:07 * gouthamr notes this must have taken much of tbarron's holiday 15:19:17 that enables it to work most of the time just the way we see openssh client work if you wait long enough 15:19:21 So we suspect something in slow (nova or neutron) and just waiting a little longer works around the slowness? 15:19:58 then https://review.openstack.org/#/c/627020/ seems to fix an intermittent issue even with the long timeouts 15:20:24 tbarron: wrong link? 15:20:25 gouthamr: not really (holiday) - mostly I was away from other work concerns so could think a litle bit 15:20:27 tbarron: ^wrong patch? 15:20:57 *link https://review.openstack.org/#/c/627797 15:21:04 sorry bout that 15:21:11 Oh that one 15:21:22 I never investigated this keepalive stuff so I'll look more closely 15:21:31 Now there's still an issue that will probably turn into a bug for neutron/ovs 15:22:09 If you install tcpdump on the SVM and run it when connecting to it from the openssh client 15:22:31 you see "spurious retransmissions" of the SSH version header from the client. 15:22:53 the client keeps sending it over and over, with exponential backoff 15:23:13 And you see the server responding with acks (you see that on the server) 15:23:13 To the correct IP? 15:23:19 bswartz: yes 15:23:30 So it's not a DHCP issue 15:23:42 if you run tcpdump on the client you don't see the ACKs from the server 15:23:45 Probably the service VM is stuck halfway booted 15:23:51 which explains the retransmissions 15:24:03 no, I'm on the svm running tcpdump 15:24:04 Have you ever collected the kernel logs from a service VM where this issue occured? 15:24:34 bswartz: nova console log shows it booted and I'm sshed into it, 15:24:44 There's a recurring (and extremely annoying) problem in the linux kernel where bootstrapping the random number generator can take 90+ seconds in virtual environments 15:24:50 running a second ssh login issue 15:25:09 running a second ssh login 15:25:33 So you're able to SSH in shortly after boot? And it's just manila that can't connect? 15:25:56 bswartz: manila can connect too if you give it long timeouts 15:25:57 bswartz, you might need haveged installed to have entropy 15:26:24 I'm debugging the slow connection, not complete failure 15:26:30 toabctl: yeah there are several workarounds to the issue, install userspace tools is one of them 15:26:38 You can connect after about 110s. 15:27:06 So connect, fix dns, install tcpdump. Run tcpdump. 15:27:08 Well what I've seen in the past is that sshd gets stuck while starting up because it's waiting for the kernel to be able to supply random numbers 15:27:15 in another window connect again. 15:27:28 After the kernel finishes RNG initialization, sshd is able to fully start 15:27:35 bswartz: what i'm seeing is no lack of response on the server, but 15:27:41 bswartz: after it connects, if you disconnect and try again, it still takes 110 seconds 15:27:46 In unfortunate situations that can take ~90 seconds 15:27:48 packets from server not getting back to the client 15:27:58 Hmm okay 15:28:08 Perhaps a red herring 15:28:23 tbarron: is this limited to the initial connection? 15:28:24 i've been playing with mtu reduction, etc. 15:28:45 gouthamr: seems to be limited to the ssh header transmission 15:28:48 tbarron: i mean, would this happen after share creation as well, i.e, when updating exports 15:29:07 the server is piggybacking its ssh header with an ack for the tcp segment from the client that 15:29:16 contains the client's ssh header 15:29:29 these are not extrememly long though and later 15:29:34 after it unstalls 15:29:46 there's a key exchange with much bigger payload 15:31:18 anyways I'd like to get this issue resolved as well, not just longer timeouts, etc. as a workaround. 15:31:40 But with the workaround, we see that the main issues remaining with the 15:31:44 generic jobs are: 15:31:49 1) migration failures 15:32:00 2) timeouts on the scenario job 15:32:37 inspecting the scenario job, the test cases that run for tens of minutes are migration cases 15:33:03 so I'm wondering what people would think of splitting the host-assisted migration tests into their own job 15:33:11 I've seen a lot of lvm failures lately. Since it is voting, it is requiring a lot of rechecks 15:33:34 I'm confused about the 2 issues 15:33:38 That way maybe the other jobs will be green most of the time and we can focus on the red stuff separately. 15:33:46 Are the migrations causing the timeouts, or are the migrations failing outright? 15:33:51 tbarron: would we be splitting the host-assisted migration tests for all drivers or just generic in DHSS=True? 15:34:00 bswartz: both 15:34:09 Gah 15:34:11 ganso: not sure, what do you think? 15:34:23 Splitting would help with timeouts but not the outright failures 15:35:01 bswartz: it won't make those tests pass but it will allow us to see when non-migration stuff fails in the generic job 15:35:02 tbarron, bswartz: yep 15:35:13 by looking at the normally-green job 15:35:19 wihtout having to dive into the logs 15:35:31 and the actual failures are intermittent 15:35:43 and probably cascade 15:35:49 I see 15:36:26 so my idea is that we limit the scope of the actual problem cases to speed thing up, on the one hand, and 15:36:33 limit collateral damage on the other 15:36:51 Would the split be permanent though? 15:36:56 Or just until we sort out the problem? 15:36:59 ganso: you are right that there are other non-migratioon intermittent issues 15:37:23 live lvm job snapshot races 15:37:41 bswartz: I haven't really thought it through that far 15:37:58 Do we need locks in the LVM driver? 15:38:11 but if we can stabilize everything and get it to run fast enough then maybe we should consolidate again 15:38:15 or have that as a goal 15:38:46 bswartz: maybe. I don't understand the issue well enough to say. 15:39:23 we also have some intermittent failures in access-rule tests. 15:39:58 Anyways, I'm trying to sort these out a bit, make sure we have bugs, and try to get out of the mode where we just ignore failures. 15:40:16 +1 15:40:26 #topic our work for stein 15:40:49 priority for access rules https://review.openstack.org/#/c/572283/ 15:41:04 the revert to snapshot feature also has races, the dummy driver fails from time to time on that test 15:41:25 ganso: agree, do you know if we have a bug for that? 15:41:50 tbarron: I am not sure, would need to look it up. Since it is failing for a long time someone might have already opened one 15:42:15 I don't see an active champion/driver for the priority for access rules work and fear that it won't get done this cycle either. 15:42:49 That review has had a -1 on it since Dec. 13 with no update. 15:43:09 The client side review in the same situation, only longer. 15:43:42 At this point I'm inclined to just indicate that it's at risk for not getting done two cycles in a row and 15:44:01 tbarron, I can remove my -1 if others think that splitting commits is useless 15:44:03 is zhongjun here? 15:44:14 try to figure out the gate failures we have currently with access rules. 15:44:30 No updates suggests that she's not working on it 15:44:49 tbarron: I don't think splitting commits is that helpful 15:45:02 tbarron, toabctl: oops, wrong Thomas 15:45:05 lol 15:45:07 toabctl: well your -1 shouln't be a blocker, if the champion for the feature disagrees then they should say so and keep pushing 15:45:08 toabctl: ^ 15:45:41 ok 15:46:44 toabctl: theoretically I agree with you but practically I would be willing to just push ahead if we can be confident that this one is safe and that the rest of the work has momentum 15:47:25 My issue is that there is no champion for the feature, there's a lot of work to do besides just this patch, including regression testing and 15:47:33 tbarron, sure. I just think that it takes *more* time if you have a huge commit. but fine for me to leave it as it is 15:47:39 we already have some intermittent access rule test issues. 15:48:29 Anyways, I'm going to work on getting gate more stable -- including access rule tests -- and ignore this feature unless someone else drives it. 15:48:57 Moving along. 15:49:01 Python3. 15:49:24 I want to get this done more or less around milestone 2. 15:50:06 It looks like we don't set the actual python3 version variable (python3.6) so the lvm job 15:50:21 currently fails to start up the api service when running under py3. 15:50:32 > 15:50:35 Not sure why at the moment, it's set for cinder jobs. 15:50:40 we do set it 15:50:44 But we'll sort that out. 15:50:44 * gouthamr checks 15:50:58 gouthamr: I think cinder jobs learn it. 15:51:02 tbarron: https://review.openstack.org/#/c/623061/3/playbooks/legacy/manila-tempest-minimal-dsvm-lvm/run.yaml 15:51:31 gouthamr: not the boolean, the actual version 15:51:34 oh, wait, you mean python"3.6" 15:51:55 it's a different variable, cinder jobs learn it, and hellman's patch usess it 15:52:04 so we'll sort that out 15:52:06 ah, i see 15:52:23 like this one here: https://review.openstack.org/#/c/607379/2/devstack-vm-gate-wrap.sh 15:52:37 And we need to get the current centos jobs running under bionic so they can be py3 too 15:52:52 Hopefully we can get these out of the way in the coming week. 15:53:24 And declare the job done except potentially for moving from native eventlet wsgi to uswgi. 15:53:46 The last may or may not be needed since in production everyone uses 15:54:09 httpd or something in front of the api 15:54:26 and since eventlet/py3 issues may be getting sorted anyways. 15:55:02 #topic bugs 15:55:26 I don't have any hot ones other than the gate issues we discussed. 15:55:35 Some of which are also hitting users. 15:56:03 And some which need new bugs or old bugs discovered and prioritized. 15:56:15 Anyone else have particular bugs to talk about today? 15:56:38 #topic open discussion 15:56:55 ganso? 15:57:34 i'd like to discuss "public" shares, but that might be a longer topic 15:57:39 case in point: https://bugs.launchpad.net/manila/+bug/1801763 15:57:40 Launchpad bug 1801763 in Manila "public flag should be controllable via policy" [Medium,Confirmed] 15:57:57 i can add it to teh agenda for next week 15:57:57 gouthamr: that's a good one :) 15:58:08 gouthamr: ok 15:58:31 you might get people to talk about it in #openstack-manila in the mean time as well 15:58:51 sure 15:59:21 also we have someone in #openstack-horizon asking about manila-ui 16:00:05 i tried helping, but i am out of ideas on why it isn't working for the guy 16:00:05 gouthamr: does he have vkmc's latest rdo fix? 16:00:15 tbarron: he does 16:00:23 k, we're out of time 16:00:28 Thanks everyone! 16:00:35 #endmeeting