14:00:05 #startmeeting nova 14:00:06 Meeting started Thu May 18 14:00:05 2017 UTC and is due to finish in 60 minutes. The chair is mriedem. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:09 The meeting name has been set to 'nova' 14:00:28 o/ 14:00:29 o/ 14:00:31 o/ 14:00:33 o/ 14:00:35 o/ 14:00:39 o/ 14:00:41 o/ 14:00:56 o/ 14:01:21 alright then 14:01:26 #link agenda https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting 14:01:35 #topic release news 14:01:39 \o 14:01:44 #link Pike release schedule: https://wiki.openstack.org/wiki/Nova/Pike_Release_Schedule 14:02:01 #info Next upcoming milestone: Jun 8: p-2 milestone (3 weeks) 14:02:10 * johnthetubaguy sneaks in late 14:02:12 #info Blueprints: 70 targeted, 66 approved, 12 completed, 6 not started 14:02:33 anything not started by p-3 we'll likely defer 14:02:48 started or blocked i should say 14:03:06 as there are some blueprints that are blocked on needing changes / owners in other projects, like ironic 14:03:16 questions about the release? 14:03:41 #topic bugs 14:03:52 no critical bugs 14:04:08 #help Need help with bug triage; there are 94 new untriaged bugs as of today (May 18) 14:04:21 #link check queue gate status http://status.openstack.org/elastic-recheck/index.html 14:04:27 things are ok'ish 14:04:40 the gate-grenade-dsvm-neutron-multinode-live-migration-nv job on master is 100% fail 14:05:16 i think because the ocata side is not running systemd and the new side is trying to stop things the systemd way, and those aren't running so it fails 14:05:49 sdague: it was likely due to https://review.openstack.org/#/c/465766/ 14:05:50 patch 465766 - nova (stable/ocata) - Use systemctl to restart services 14:05:53 which didn't take grenade into account 14:06:02 oops i mean https://review.openstack.org/#/c/461803/ 14:06:03 patch 461803 - nova - Use systemctl to restart services (MERGED) 14:06:26 i don't have any news for 3rd party ci 14:06:35 #topic reminders 14:06:41 #link Pike Review Priorities etherpad: https://etherpad.openstack.org/p/pike-nova-priorities-tracking 14:06:45 oh, right, we don't run nova grenade on the devstack gate I guess? 14:07:01 does not compute 14:07:11 the grenade live migration job is non-voting 14:07:18 so it doesn't run in the gate queue 14:07:28 and is restricted to i think nova, and maybe tempest experimental, not sure 14:07:42 mriedem: it was, mostly because I didn't realize the script was used that way 14:08:00 sdague: i forgot about it too 14:08:11 one more reminder, 14:08:13 If you led sessions at the Forum, it would be good to provide summaries in the mailing list. 14:08:36 #topic stable branch status 14:08:40 stable/ocata: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/ocata,n,z 14:09:06 stable/newton: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/newton,n,z 14:09:11 #link We have a few bugs which were regressions in Newton that we need to get fixed on master and backported: https://review.openstack.org/#/c/465042/ https://bugs.launchpad.net/nova/+bug/1658070 https://review.openstack.org/#/c/464088/ 14:09:13 Launchpad bug 1658070 in OpenStack Compute (nova) "Failed SR_IOV evacuation with host" [High,In progress] - Assigned to Eli Qiao (taget-9) 14:09:13 patch 465042 - nova - Cache database and message queue connection objects 14:09:13 patch 464088 - nova - Handle special characters in database connection U... 14:09:19 ^ core reviews on stable/ocata would be helpful if anyone has time 14:09:51 we're starting to see newton bugs rolling in, 14:09:55 because people are just upgrading to newton now 14:10:37 #topic subteam highlights 14:10:43 Cells v2 (dansmith) 14:10:48 so, 14:11:16 we talked about the outstanding patch sets we have up. things like the quotas set 14:11:20 and we covered some recent bugs that cropped up, which I think we mostly have fixes for 14:11:39 including one where we, uh, press the limits of the network stack by trying to, uh, use all the connections 14:12:03 and I said I would send a summary of the cellsv2 session 14:12:08 which I totally might do 14:12:19 I think that's it. maybe/ 14:12:28 and i'm back 14:12:45 mriedem: I'm done. I said amazing things. they were yuuge. 14:12:50 ok so we need https://review.openstack.org/#/c/465042/ 14:12:51 patch 465042 - nova - Cache database and message queue connection objects 14:12:54 and start working on backports 14:13:16 yup 14:13:26 bauzas: scheduler 14:13:49 oh, whoops, I could see that bug being an issue 14:13:53 no huge discussion for that last meeting, apart knowing which specs to review 14:14:19 I explained to edleafe that I was working on my series for the scheduler claims 14:14:39 and we did discussed on how Jay is very good for presentations :p 14:14:45 that's it, small one. 14:15:03 ok, i don't know if there was an api meeting yesterday 14:15:08 since alex_xu is at the bug smash 14:15:20 johnthetubaguy: did you attend the api meeting? 14:15:29 ah, I did 14:15:37 we had a nice chat about policy docs thats up for review 14:15:46 and the api-extension tidy up stuff 14:16:10 the keystone policy meeting is starting to discuss how to make wider progress on the per-project admin thingy 14:16:26 #link https://review.openstack.org/#/q/topic:bp/policy-docs+status:open+project:+openstack/nova 14:16:46 #link https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/api-no-more-extensions-pike 14:16:50 thats all folks 14:16:57 ok thanks 14:17:04 gibi: notifications 14:17:09 focus is on searchlight notification additions: 14:17:16 #link https://review.openstack.org/#/q/topic:bp/additional-notification-fields-for-searchlight+status:open 14:17:22 and on 3 selected notification transformation patches: 14:17:28 #link https://review.openstack.org/#/c/396225/ and the series starting at #link https://review.openstack.org/#/c/396210 14:17:28 patch 396225 - nova - Transform instance.trigger_crash_dump notification 14:17:30 patch 396210 - nova - Transform aggregate.add_host notification 14:17:37 the next subteam meeting will be held on 6th of June as I will be on vacation for the next two weeks 14:17:46 that is all 14:18:17 k 14:18:20 efried: powervm 14:18:27 No change since last time. 14:18:33 We're really at a point where we just need the reviews. 14:18:39 So removing powervm from "subteams" section. 14:18:43 ok 14:19:05 the cinder new volume attach flow stuff is in the same boat 14:19:12 there is a series starting at https://review.openstack.org/#/c/456896/ with +2s 14:19:12 patch 456896 - nova - Add Cinder v3 detach call to _terminate_volume_con... 14:19:32 john and i have been helping on those but since i'm pushing some of the changes we need a 3rd core 14:19:51 that's it for subteam stuff 14:19:56 #topic stuck reviews 14:20:00 nothing in the agenda 14:20:05 anyone have anything they want to bring up here? 14:20:25 #topic open discussion 14:20:29 there is one item 14:20:34 Yep. 14:20:40 (mlakat) We'd like to get some advice on: What is the standard OpenStack way of running something (in this case an os.path.getsize()) with a timeout? This came up as part of: https://bugs.launchpad.net/nova/+bug/1691131 Basically what happened is that a broken NFS connection might block os.path.getsize(path) forever. Approaches tried: 14:20:41 Launchpad bug 1691131 in OpenStack Compute (nova) "IO stuck causes nova compute agent outage" [Undecided,In progress] - Assigned to Daniel Gonzalez Nothnagel (dgonzalez) 14:20:47 1) putting it on a separate green thread: Still blocks the main green thread forever 14:20:53 2) putting it on a separate Thread: We have no way of killing threads nicely in python 14:21:02 3) use multiprocess module: seems to work, but I am a bit worried about the overhead of it 14:21:13 * Someone must have solved this already in OpenStack 14:21:22 mlakat: I commented in the review: I think you need to address that problem at a different level. 14:21:29 don't we dispatch in the libvirt driver already? 14:21:52 Given that literally everything may be on NFS, if we've got fundamental issues like that, we can't use that technique everywhere. 14:21:59 right, i was going to say, 14:22:04 this seems like a giant game of whack a mole 14:22:11 mdbooth: That's why we put the issue on the agenda, it would be nice to get feedback on the issue and maybe some ideas how this can be solved the proper way 14:22:15 mdbooth: that came up in channel yesterday, a related thingy 14:22:20 for me it's a design issue - the heartbeat should never stop although there is an hanging fs 14:22:43 but if the thread is hanging, nothing will happen, so the heartbeat dieing is actually useful? 14:22:48 FWIW, NFS in Linux has been a pita as long as I've been using it 14:23:12 i'm not sure what 'heartbeat' we're talking about 14:23:17 If the server goes away, you end up with a process stuck in uninterriptible sleep 14:23:25 And that's, well, uninterruptible 14:23:25 the update_available_resource periodic task in the compute? 14:23:25 the thing was we had an overload in the cloud and weren't able to stop any vm in this case 14:23:27 D state ftw 14:23:28 heatbeat = service up check (in DB)? 14:23:32 lyarwood: Indeed 14:23:45 I think this is best solved with monitoring tools 14:24:03 So how shall we handle/recover situations where we have an I/O blocked NFS share? 14:24:07 We should blow the mount out of the water (is that possible now, never used to be) if it hangs 14:24:09 mdbooth: we have it in the monitoring 14:24:16 mriedem: yes its the update_available_resource task 14:24:19 did os-privsep make this worse by accident, out of interest? 14:24:21 but we aren't able to stop vm producing the overload 14:24:31 johnthetubaguy: i don't think it has anything to do with that 14:24:43 oh, this is a regular python call, just on an NFS filesystem 14:24:47 yes 14:24:50 yes, a stat call 14:24:52 in the end 14:24:55 uninteruptible sleep is a Linux kernel process state 14:24:56 which blocks. 14:25:00 This isn't a python issue 14:25:02 the moment os.path.getsize on the hanging nfs, the task gets stuck 14:25:07 * johnthetubaguy nodes at mlakat in a yes, and a hello nice to see you sense 14:25:17 heh nods 14:25:29 * mlakat nods back :-) 14:25:44 mdbooth: agreed 14:26:02 it's a deployment issue 14:26:19 so if this stat was threaded and timed out what then? the instances are still marked as running while they are actually in D-state right? 14:26:19 mkoderer: when you say you can't stop any vm in the cloud, you mean just any vm on this compute host right? 14:26:20 I could have swarn we fixed this someone already, for something... like in disk cache or something 14:26:33 or does this lock every vm on every compute running on the same nfs share? 14:26:46 mriedem: the NFS is mounted on several compute hosts 14:27:07 so basically we lost control of a big portion of the cloud 14:27:14 i think i would reconsider using NFS 14:27:20 mriedem: +1 14:27:42 so the message is: don't use NFS? 14:27:47 and don't do any live migration? 14:27:50 really? 14:27:58 use iSCSI or something like that instead, or ceph 14:27:59 we can't thread out every os path python call in the libvirt driver 14:28:14 just because NFS can lock up your entire cloud 14:28:29 mriedem: It's not just os.path 14:28:38 the message is, use monitoring tools to detect the issue, and then blow up the stuck mount 14:28:44 It's presumably literally anything which touches the filesystem 14:28:50 And they're just hitting it here first 14:29:18 ok 14:29:22 just to get my head straight, the problem is the system is broken, NFS has locked up, but we look bad because we also lock up and the computes appear dead? 14:29:50 johnthetubaguy: I'd say so 14:29:54 yeah 14:29:56 yes 14:29:56 johnthetubaguy, yes, and we have no way of stopping instances causing the high workload 14:30:01 sounds like we are doing the correct thing 14:30:16 mlakat: you can't kill them via virsh? 14:30:30 no I would like a design were deletion is also possible in case of an overload 14:30:54 its a good question, does virsh still work for the delete? 14:31:06 mriedem, we still look at what virsh's reaction in this situation, we are in the process of testing it. 14:31:25 I am currently testing this, and it seems to be hanging too... 14:31:41 Assume virsh is not cooperating, how would you then "kill" the busy domain? 14:31:45 ok that's tricky 14:32:01 we only really talk vai libvirt, so game over I think 14:32:04 mriedem, you said to blow the mountpoint? 14:32:15 does qemu have an builtin userspace NFS client? 14:32:17 this NFS isn't mounted by Nova I guess? 14:32:33 johnthetubaguy: It would have been, actually 14:32:33 like you can't do a force cinder detatch at this point? 14:32:38 It's a volume 14:32:48 ah, I was assuming this was the whole instances dir 14:32:55 Although it could equally be shared instance dir 14:33:17 However, force unmount of nfs is apparently *still* not possible robustly in Linux 14:33:31 dang 14:33:40 for us its currently only volumes that are affected. But all VMs that have a volume from the affected NFS can't be deleted 14:34:54 if things getting stuck on NFS is a big problem then it's worth considering NFS soft mounts instead of hard mounts, but I worry that soft mounts create other problems which might be worse 14:35:15 Yep, soft mounts prevent hangs 14:35:26 However, they cause IO errors during transient load spikes 14:35:35 So... 14:35:39 yeah and that's probably and even worse situation 14:35:45 NFS is what it is 14:35:45 yep 14:36:11 ok seems it's not solvable in nova then.. :( 14:36:22 so i'm not sure we're going to come to any conclusion here - i don't really think this is nova's problem to solve, honestly 14:36:23 I think Nova is the wrong place to address it, unfortunately 14:36:38 yeah I agree with you... 14:37:11 even if you thread things out and log an error on timeout, you have to address and resolve the underlying issue at some point outside of nova most likely 14:37:20 by cleaning up the locked mount i guess 14:37:45 sounds like a monitoring agent, or something like that? 14:37:47 mriedem: Which can sometimes only be achieved with a reboot 14:37:48 threading out anything that touches the filesystem is going to be a mess, unless you're talking about monkey patching the os module or something 14:38:07 sounds like some stat call to each NFS mount in some monitoring loop would catch this OK? 14:38:10 mriedem, threads are not an option 14:38:33 johnthetubaguy, the action is still not clear in that case. 14:39:02 anyway, i'm -1 to what's proposed in the change, and i think we've spent enough time on it in this meeting 14:39:12 we can move to -nova if you want to discuss further 14:39:16 but let's end this meeting 14:39:17 Thank you for your time. 14:39:20 np 14:39:22 mlakat: true 14:39:22 thx 14:39:30 #endmeeting