14:00:36 #startmeeting Nova Live Migration 14:00:37 Meeting started Tue Sep 13 14:00:36 2016 UTC and is due to finish in 60 minutes. The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:38 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:40 The meeting name has been set to 'nova_live_migration' 14:00:46 o/ 14:00:49 o/ 14:00:51 o/ 14:00:57 hi all 14:01:17 o/ 14:01:33 agenda: https://wiki.openstack.org/wiki/Nova/Newton_Release_Schedule 14:02:08 release schedule says RC1 16th Sept 14:02:23 really EOD dansmith time on thursday 14:02:26 that's friday 14:02:49 * mriedem updates wiki 14:03:44 So we will go over the newton-ra-potential bugs 14:03:49 but first 14:04:01 #topic CI 14:04:09 anything to do on CI ? 14:04:36 I haven't kept track - so I wanted to make sure there is nothing urgent to attend to ? 14:05:05 i'm trying to work on LM job with grenade, any advice would be helpful https://review.openstack.org/#/c/364809/ 14:05:23 heard that mriedem or dansmith tried to do something with that, but haven't found any patches yet 14:06:35 pkoniszewski: Relevant to the bug I'm working on, btw. Would definitely be good to get that. 14:07:08 i haven't 14:07:15 #link Add new job to test live migration with grenade https://review.openstack.org/#/c/364809/ 14:07:18 we'd basically need a multinode grenade job to run live migration 14:07:30 we already have multinode grenade jobs 14:07:35 where n-cpu is backlevel on one compute 14:07:54 i think we'd just need to flip the live migration flag in tempest in that job to test it 14:08:06 but it would probably be experimental queue to start 14:08:10 okay, i will check that 14:08:12 unless we used the in-tree hook 14:08:36 so I wanted to use in-tree hook, but those tests that we have right now are not enough 14:08:58 we need tests that will live migrate an instance back and forth, in our tests we always just move an instance to another host and validate it 14:09:19 pkoniszewski: You thinking of cleanup bugs? 14:09:50 what do you mean? 14:09:50 Like the bug where you couldn't cold migrate back to a host because the instance directory hadn't been deleted? 14:10:43 also about this, but the priority for me is to check whether we can move VM between two versions in a basic scenario 14:10:52 Ok 14:12:19 lets move on 14:12:30 #topic Bugs 14:12:46 Starting with live migration bugs on https://bugs.launchpad.net/nova/+bugs?field.tag=newton-rc-potential 14:12:53 https://bugs.launchpad.net/nova/+bugs?field.tag=newton-rc-potential 14:13:20 I think the first is https://bugs.launchpad.net/nova/+bug/1605016 14:13:21 Launchpad bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,In progress] - Assigned to Matthew Booth (mbooth-9) 14:13:31 that's the one mdbooth is looking at 14:13:43 i see you have one patch up 14:13:50 So, it's taken me a while to get a 'reproducer' 14:14:11 And I'm still not 100% sure I'm testing the right thing, but I can see something 14:14:29 I don't currently think this is the show stopper it sounded like last week 14:14:33 Because of 2 things 14:14:54 Firstly, the period after post-copy switch over is really short, because post-copy is really fast 14:15:03 and efficient 14:15:16 mdbooth: That depends a little on how painful your workload is 14:15:26 Secondly, because even if you leave it alone and *never* call the network fixup stuff 14:15:33 Something fixes it up anyway 14:15:40 Max 60 seconds outage 14:15:47 I'd love to know what that is, btw 14:15:59 I looked, but I don't know enough about neutron 14:16:05 mdbooth: Yeh it's worrying knowing something is lurking fiddling with the config but not knowing what 14:16:14 mdbooth, that's with DVR 14:16:18 Yeah 14:16:41 haleyb are you lurking ? 14:16:43 That said, I'm working on fixing it anyway 14:17:03 First patch is here, and it's an RPC change: https://review.openstack.org/#/c/369423/ 14:17:13 PaulMurray: sort0-of, neutron meeting now too 14:17:33 I'd love to get eyes on ^^^ from somebody with a deep understanding of what those calls actually do for various backends 14:17:37 I think the issue is there a variety of different neutron backend implementation and we don't know how hey all behave 14:17:41 I'm just shifting them around 14:17:56 PaulMurray: Yup. 14:18:19 mdbooth, but 60 sec outage is a bummer 14:18:39 My initial testing of the above patch suggests it works fine, and slightly reduces the network outage in the non-post-copy case. 14:19:36 #action please review https://review.openstack.org/#/c/369423/ 14:19:47 do you have the follow on coming ? 14:20:07 to be clear on https://review.openstack.org/#/c/369423/ , 14:20:10 I only just knocked that out this morning, so not yet :) 14:20:18 if it doesn't make newton, it's probably not going to be backported b/c of the rpc change 14:20:25 mdbooth, that was hours ago ! 14:20:27 However, the follow-on should be pretty simple in comparison, and won't involve rpc change 14:20:38 :) 14:20:52 mriedem: Right. It would be great to get ^^^ in, even if we don't get the follow-on in 14:21:06 So please, eyes on that urgently. 14:21:17 ok 14:21:18 next 14:21:28 https://bugs.launchpad.net/nova/+bug/1615613 14:21:30 Launchpad bug 1615613 in OpenStack Compute (nova) "Live migration always fails when VNC/SPICE is listening at non-local, non-catch-all address" [High,In progress] - Assigned to Paulo Matias (paulo-matias) 14:21:30 The follow-on will be a simple backport. 14:21:42 anyone know anything about this ? 14:22:33 There is a revert here with one +2: https://review.openstack.org/#/c/368732/ 14:23:17 then: https://review.openstack.org/#/c/358599/6 14:23:34 So reviews for those as well please 14:23:57 pkoniszewski: for https://review.openstack.org/#/c/368732/ 14:24:09 have you tested N->M and M->N and verified it fixes that vnc/spice console issue? 14:24:17 yes, i did 14:24:28 ok, great 14:24:36 so the problem was backward compatibility, i.e. live migrating from newton to mitaka 14:24:39 btw, pkoniszewski is once again our rc week live migration savior :) 14:24:57 it solves the issue so let's just prepare for fixing the check in ocata 14:25:37 i mean this change https://review.openstack.org/#/c/358599/ is a requirement to finally move the check in Ocata 14:26:02 pkoniszewski, are you saying we only need the first revert in newton 14:26:16 the one prepared by johnthetubaguy, yes 14:26:32 understood 14:26:58 and if we can land "fill destination check data..." change in Newton, then we will be able to move the check in Ocata 14:27:13 and then land all Markus's changes in Ocata to fix serial console 14:27:32 https://review.openstack.org/#/c/368732/ is approved now 14:28:39 #action review https://review.openstack.org/#/c/358599/ 14:28:59 the next is on consoles again: https://bugs.launchpad.net/nova/+bug/1595962 14:29:00 Launchpad bug 1595962 in OpenStack Compute (nova) "live migration with disabled vnc/spice not possible" [Medium,In progress] - Assigned to Markus Zoeller (markus_z) (mzoeller) 14:29:23 this will be in a merge confllict once the revert is merged 14:30:07 it also has a +2 from bauzas and a -1 from alaski 14:30:15 on https://review.openstack.org/#/c/335132/ 14:30:58 with the revert that one may not be necessary 14:31:05 at least that was my understanding 14:31:56 alaski, do we need to test it to confirm again with the revert just approved 14:32:31 or were you really asking a question 14:33:02 I'm hoping someone can confirm my understanding 14:33:17 we could also get https://review.openstack.org/#/c/338416/ which should provide some testing 14:33:25 pkoniszewski, you were on both patches - any thoughts ? 14:33:36 or can you try it out ? 14:33:40 yeah, what exactly are you asking about alaski? 14:33:50 and i will try it anyway 14:34:05 with the revert at https://review.openstack.org/#/c/368732/ is https://review.openstack.org/#/c/358599/ still necessary? 14:34:22 yes, it is 14:34:58 so the point of this issue was that we moved the check but we never moved a code that was responsible for populating migrate data with graphic addresses 14:35:21 right. but the revert moves the check back 14:35:34 exactly 14:36:05 so the check should now be in a place that has the graphic addresses 14:36:12 we just can't move the check to check_can_live_migrate_source when older release does not populate data in check_can_live_migrate_destination 14:37:04 once we merge the change you just pasted here we will be able to move the check in next release 14:37:05 I see. so the patch from markus_z no longer fixes a bug after the revert, but it necessary to move the check in O? 14:37:21 because older release will populate the data earlier, in check at destination 14:37:25 right now Mitaka can't do that 14:38:12 just fyi, RPC chain during prechecks looks like conductor-> destination compute (check_can_live_migrate_destination) -> source compute (check_can_live_migrate_source) 14:38:12 yeah. so the patch is needed. but not for a bug fix, but so that we can move code in O 14:38:26 exactly 14:38:51 okay. I'll let markus_z rebase and see where it's at 14:39:09 thanks alaski 14:39:33 so, well, you are right, we can merge markus_z changes still in Newton, but the code needs to be on top of the revert 14:40:11 the next one is: https://bugs.launchpad.net/nova/+bug/1621709 14:40:12 Launchpad bug 1621709 in OpenStack Compute (nova) "There is no allocation record for migration action" [Medium,In progress] - Assigned to Alex Xu (xuhj) 14:41:46 Does anyone know what is going on with this? there is a chain of patches 14:42:27 is alex_xu around? 14:44:15 mriedem, do you know about this one ? ^^^^ 14:44:44 PaulMurray: alex_xu wants to get the bottom change into newton 14:44:50 to stop leaking resources 14:44:59 just the bottom change 14:45:23 this one: https://review.openstack.org/#/c/369147/4 14:45:28 mriedem, ^ 14:47:21 reading the comment it looks like that is what he meant 14:47:27 next is: https://bugs.launchpad.net/nova/+bug/1622854 14:47:28 Launchpad bug 1622854 in OpenStack Compute (nova) "pci: double pci migration is putting vm in ERROR" [Medium,Confirmed] 14:47:35 yes that's the one he wants in newton 14:47:39 he said the rest could be ocata 14:47:44 thanks 14:47:52 no one is working this last one 14:49:08 Incidentally, I assume we're anticipating an rc2? i.e. There's still an opportunity to fix bugs after Thursday? 14:49:10 mriedem, there is a comment on this last bug saying we can go without it and do a back port later 14:49:57 mdbooth, I think that's usually the case, but only for really critical bugs 14:50:06 ? 14:50:27 need to ask the boss 14:50:45 there is usually another rc for translations 14:50:48 PaulMurray: Does bug 1605016 come under that description? If not, I'll likely go do other stuff. 14:50:49 bug 1605016 in OpenStack Compute (nova) "Post copy live migration interrupts network connectivity" [High,In progress] https://launchpad.net/bugs/1605016 - Assigned to Matthew Booth (mbooth-9) 14:51:35 * mdbooth can't guage the chances of that patch getting in by Thursday, but I'd guess they're low, right? 14:52:14 mdbooth, I would hope the chances are good for anything that actually works and is considered worth the tag 14:52:20 they get attention 14:52:33 mriedem, ^^ 14:53:20 dansmith didn't have the bandwidth to look when I spoke to him earlier. I didn't appreciate that might be a complete blocker. 14:53:33 dan is working on the placement stuff 14:53:43 i've tagged the bug 14:53:44 so it's on the list 14:53:53 and it's in the etherpad 14:54:04 it's not a trivial change though, i haven't reviewed yet 14:54:12 and if it's not required for post-copy to work, then it might slide 14:54:20 mriedem: So, the critical bit is just the rpc change, because the follow-on will be backportable 14:54:34 it's way too disruptive for this point in the cycle, IMHO.. post-copy is still optional yes? 14:54:36 if the rpc change is a noop it migth be ok 14:54:37 mriedem: And it's not required for post-copy 14:54:59 if the rpc change is a mistake though and we find that out later, we can't revert it 14:55:07 dansmith: Yup, this don't break post-copy. 14:55:24 It just makes it sub-optimal when using dvr 14:55:40 mdbooth: so only when using post-copy and dvr right? 14:55:44 And post-copy is also disabled by default iirc? 14:55:49 dansmith: Yup 14:55:53 mdbooth: IMHO it's much worse than suboptimal 14:55:55 it is disabled by default 14:56:02 right, so.. I would not make this change before the release, IMHO 14:56:13 dansmith: Cool, thanks. 14:56:31 * mdbooth will still work on it, just no longer top priority 14:56:54 so are we dropping the newton-rc-potntial tag for that ? 14:56:59 Sounds like it 14:57:02 :-( 14:57:06 ok 14:57:10 yeah, drop IMHO 14:57:21 we're out of time 14:57:28 finished just before the bell 14:57:33 thanks for coming 14:57:34 hi, I have a backport patch for stable/mitaka, https://review.openstack.org/#/c/353851, please review the same and give your suggestions when you get some time, thank you 14:58:00 thanks abhishekk - sorry no time for discussion 14:58:08 in the the open section 14:58:13 bye 14:58:19 #endmeeting