#openstack-meeting-3 log

14:00:00 <PaulMurray> #startmeeting Nova Live Migration
14:00:00 <openstack> Meeting started Tue May 10 14:00:00 2016 UTC and is due to finish in 60 minutes.  The chair is PaulMurray. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:02 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:04 <openstack> The meeting name has been set to 'nova_live_migration'
14:00:13 <tdurakov> o/
14:00:15 <mdbooth> o/
14:00:15 <diana_clarke> o/
14:00:16 <PaulMurray> hi all
14:00:22 <paul-carlton2> o/
14:00:25 <luis5tb> hi!
14:00:32 <davidgiluk> o/
14:00:34 <mriedem> o/
14:00:41 <jlanoux> o/
14:00:42 * kashyap waves
14:01:04 <PaulMurray> For those with short memory: https://wiki.openstack.org/wiki/Meetings/NovaLiveMigration
14:01:10 <PaulMurray> Agenda ^^
14:01:10 <pkoniszewski> o/
14:01:21 <abhishek> 0/
14:01:35 <andrearosa> hi
14:01:42 <PaulMurray> #topic CI
14:01:55 <PaulMurray> tdurakov, has been away - so welcome bak
14:02:16 <PaulMurray> tdurakov, any update ?
14:02:46 <tdurakov> PaulMurray: as discussed mostly working on my spec now, but happy to help with storage pool for this
14:03:03 <tdurakov> any details on that?
14:03:14 <PaulMurray> What was the experimental stability like ?
14:03:43 <tdurakov> the same, going to check is there xenial nodes in nodepool already
14:03:47 <mriedem> i noticed the experimental live migration job wasn't running with latest libvirt/qemu yet?
14:03:55 <tdurakov> mriedem: true
14:04:12 <mriedem> i thought that job was going to use that new repo that installs latest libvirt/qemu?
14:04:14 <tdurakov> haven't seen xenial multionode yet
14:04:31 <mriedem> the one that markus and tonyb worked on
14:04:32 <jlanoux> tdurakov: What about the coverage of the Mitaka features? Are you done?
14:04:58 <tdurakov> mriedem: is that finished?
14:05:04 <pkoniszewski> mitaka features aren;t covered mostly
14:05:06 <pkoniszewski> jlanoux: ^
14:05:14 <PaulMurray> I'm just getting back to the CI, did markus get it working ?
14:05:27 <PaulMurray> I mean with the latest versions of libvirt etc.
14:05:33 <mriedem> yeah i thought so, awhile ago
14:05:35 <jlanoux> pkoniszewski: ok
14:05:49 <PaulMurray> mriedem, ok, so we are playing catchup then
14:05:53 <tdurakov> mriedem: will add this to job today than
14:05:58 <mriedem> talk to markus_z and/or tonyb about it
14:06:08 <tdurakov> mriedem: ok
14:06:15 <mriedem> last i knew they just wanted to move the git repo under the openstack namespace
14:06:23 <mriedem> but they had something in project-config for using this
14:06:51 <tdurakov> mriedem: will ask them after this meeting
14:07:10 <mriedem> #action tdurakov to follow up with markus_z/tonyb about the trunk libvirt/qemu repo
14:07:20 <tdurakov> yup
14:07:50 <PaulMurray> tdurakov, let me know if you need help
14:08:11 <kashyap> mriedem: On a slightly related note, there's now a new DevStack plugin up for review (revived from an old review) that lets one install custom libvirt / QEMU from tar releases -- https://review.openstack.org/#/c/313568/
14:08:19 <kashyap> It's an external plugin though.
14:08:40 <tdurakov> PaulMurray: as a plan B for that, who could help with adding multinode xenial to nodepool?
14:08:41 <mriedem> ok, but those are like daily builds?
14:09:08 <PaulMurray> jlanoux, do you know anything about nodepool ?
14:09:24 <jlanoux> PaulMurray: nope
14:09:25 <mriedem> tdurakov: i wonder if pabelanger could help with that in infra
14:09:28 <PaulMurray> if not I will go to andreaf
14:09:53 <PaulMurray> (I hope that's the right nick)
14:10:02 <PaulMurray> and see if he can help
14:10:05 <tdurakov> smth like that
14:10:06 <tdurakov> https://github.com/openstack-infra/project-config/blob/a755ee6d0257faafb4204a843f5265e935689639/nodepool/nodepool.yaml#L73
14:10:51 <kashyap> mriedem: If there are daily tarballs avialable, yes
14:11:09 <mriedem> kashyap: ok, i guess i'd rather gate on actual releases, rather than daily builds
14:11:14 <mriedem> we already have problems with stability
14:11:20 <pkoniszewski> we don't need daily tarballs here i think
14:11:24 <kashyap> mriedem: It by default uses official releases
14:11:49 <mriedem> we're going to need multinode xenial regardless, so i think that's time well spent
14:12:22 <kashyap> pkoniszewski: It's not daily, it uses official tar balls, from here (for QEMU) and similar URL for libvirt: http://wiki.qemu-project.org/download/
14:12:53 <pkoniszewski> got it
14:13:30 <PaulMurray> #link Devstack plugin for qemu/libvirt versions: https://review.openstack.org/#/c/313568/
14:13:50 <PaulMurray> I like the title of that ^^ "First version"
14:14:24 <kashyap> PaulMurray: :-) Yeah, they could improve the commit messages
14:14:30 <PaulMurray> moving on slightly
14:14:39 <PaulMurray> #topic Libvirt Storage Pools
14:14:52 <PaulMurray> What do we need for CI for storage pools ?
14:15:06 <mriedem> for the dependent refactor,
14:15:07 <mdbooth> I addressed that in the recent spec update
14:15:14 <paul-carlton> an lvm, rbd, ploop
14:15:16 <mriedem> i'd like at least an lvm-backed job
14:15:19 * mdbooth digs it out
14:15:26 <mriedem> we have rbd and ploop ci already
14:15:40 <paul-carlton> plus shared storage and non shared
14:15:43 <mriedem> we're missing lvm - it might start as an experimental job for nova
14:15:50 <mriedem> ceph is shared storage
14:15:55 <mdbooth> Note that Jenkins currently only tests the Qcow2 and Rbd(ceph) backends
14:15:56 <mdbooth> in the gate. All current libvirt tempest jobs run by Jenkins use the
14:15:56 <mdbooth> default Qcow2 backend except gate-tempest-dsvm-full-devstack-plugin-ceph, which
14:15:56 <mdbooth> uses Rbd. We additionally coverage of the ploop backend in
14:15:56 <mdbooth> check-dsvm-tempest-vz7-exe-minimal run by Virtuozzo CI. This means that we
14:15:56 <mdbooth> currently have no gate coverage of the Raw and Lvm backends.
14:15:58 <andreaf> PaulMurray: I would ask in the -infra room, there is already a xenial single node image https://github.com/openstack-infra/project-config/blob/a755ee6d0257faafb4204a843f5265e935689639/nodepool/nodepool.yaml#L88 so it shouldn't be too difficult to setup a multinode env based on xenial - clarkb was working a lot on setting up the original multinode environment I believe
14:16:00 <mriedem> lvm/ephemeral is non-shared
14:16:07 <tdurakov> mriedem, what about adding this to existing live-migration job instead?
14:16:14 <paul-carlton> nfs shared would be good too, different from ceph
14:16:25 <PaulMurray> thanks andreaf
14:16:33 <davidgiluk> shared nfs migration tests have been quite good for finding bugs
14:16:49 <mdbooth> We're only interested in a limited set of tests for these backends
14:16:53 <mriedem> tdurakov: does the live migration job test nfs?
14:16:58 <tdurakov> yes
14:17:02 <tdurakov> and ceph too
14:17:07 <mdbooth> We're not going to run the full suite against each backend, are we?
14:17:22 <mriedem> can the live migration job also use lvm?
14:18:05 <mriedem> mdbooth: i was more concerned with the big refactor
14:18:09 <tdurakov> mdbooth: I'd prefer to start with live-migration, It would be expensive to have full multinode jobs for all backends
14:18:46 <mriedem> i'm not talking about a multinode job for lvm, just an expermintal queue job that runs on nova and could run against these refactor changes
14:18:46 <mdbooth> mriedem: Right. We need coverage, but how complete?
14:18:47 <tdurakov> mriedem: I thought this is the plan
14:18:50 <pkoniszewski> one question - won't it take too much time to execute all tests in gate if we also add storage pools to existing CI? I mean, we still need to cover all mitaka features there
14:19:30 * mdbooth is thinking about our poor testing resources
14:19:32 <mriedem> well, the live migration job isn't going to test snapshots right?
14:19:42 <mriedem> the experimental queue is on-demand
14:19:51 <tdurakov> mriedem: it's not testing yet
14:20:07 <tdurakov> but we could enable this later, after fixing stability issues
14:20:22 <mriedem> enable what later?
14:20:34 <tdurakov> test snapshots
14:20:41 <pkoniszewski> snapshots in LM CI? it's totally different thing, isn't it?
14:20:50 <mriedem> pkoniszewski: yes
14:21:04 <mriedem> that's kind of my point, we don't need to test snapshots in the LM job
14:21:05 <tdurakov> renaming will help..
14:21:32 <PaulMurray> I thought the plan was to split live migration tests from other things - no point in extending it out from there
14:21:38 <pkoniszewski> yeah, +1
14:21:49 <mriedem> but i think it's useful to have a job, in the experimental queue, that runs with lvm which we can run on mdbooth's refactor series which will test the compute api and virt driver for things that the LM job won't test
14:21:54 <pkoniszewski> can't find a reason to mix live migration with other things, it is complex enough
14:22:07 <mdbooth> mriedem: Yup. Also the 'Raw' backend, don't forget.
14:22:18 <mriedem> mdbooth: yeah
14:22:20 <mdbooth> Bizarrely we don't currently have coverage of that, either
14:22:21 <tdurakov> we have job that already tests 3 different configs, we could expand it with lvm, and add all multinode actions there
14:22:34 <mriedem> mdbooth: we could maybe change the ceph job to use raw...
14:23:00 <mdbooth> mriedem: ? Then it wouldn't be the ceph job. Have I misunderstood?
14:23:09 <mriedem> oh right, hehe
14:23:12 <PaulMurray> tdurakov, I would expect LM job to test enough back ends
14:23:24 <mriedem> mdbooth: forgot that had it's own special imagebackend
14:23:32 <PaulMurray> so it seems good to put lvm there if its not already
14:23:36 <mdbooth> It's special
14:23:45 <mriedem> well we have gate-tempest-dsvm-full (n-net) and gate-tempest-dsvm-neutron-full, those both use qcow2 right?
14:24:02 <mdbooth> Everything which doesn't use something explicitly uses qcow2
14:24:13 <mriedem> so maybe we make one of those use raw
14:24:15 * mdbooth audited them the other day
14:24:22 <mdbooth> mriedem: Makes sense
14:24:44 <mriedem> and then we just have 1 new experimental queue job for lvm
14:24:55 <mriedem> now having said this, changing one of the integrated gate jobs is i think branchless,
14:25:07 <mdbooth> mriedem: We could switch one of the other jobs to lvm, right?
14:25:08 <mriedem> so if changing it to use raw introduces a bunch of failures....that would be bad
14:25:14 <mdbooth> There are plenty of them
14:25:25 <mriedem> i have a feeling lvm will be racey
14:25:45 <mriedem> just based on what i've seen with the lxc job that uses it
14:25:53 <mdbooth> Hmm, ok. Of course we want to know that, but yeah...
14:26:05 <mriedem> anyway, i think i can hack up a devstack-gate change to test lvm and see how it looks
14:26:21 <mriedem> same for raw
14:26:53 <PaulMurray> mriedem, shall we call that an action - or just thinking out loud
14:26:59 <mriedem> sign me up
14:27:27 <PaulMurray> #action mriedem to hack up a devstack-get change to test lvm
14:27:50 <mdbooth> (and Raw'
14:27:51 <mdbooth> )
14:28:18 <mriedem> #undo
14:28:31 <mriedem> #action mriedem to hack up devstack-gate changes to test lvm and raw image backends
14:28:54 <mriedem> i'll also review https://review.openstack.org/#/c/302117/ after this meeting
14:29:31 <mdbooth> mriedem: Appreciated, thanks
14:29:47 <PaulMurray> paul-carlton, mdbooth are the two followon specs ready for broader review - looks like still needs subteam review
14:29:49 <PaulMurray> ?
14:29:56 <paul-carlton> yep
14:30:17 <mdbooth> We discussed 1 aspect of libvirt storage pools this morning
14:30:27 <paul-carlton> #link https://review.openstack.org/#/c/310538/
14:30:51 * mdbooth hasn't looked at that one in detail, yet
14:30:52 <paul-carlton> #link https://review.openstack.org/#/c/310505/
14:31:18 <mdbooth> paul-carlton: Ah, I see you've updated it. Need to re-review.
14:31:50 <PaulMurray> good - so its in hand - anyone else can help review too
14:32:18 <PaulMurray> #topic Specs
14:32:41 <PaulMurray> for specs: https://etherpad.openstack.org/p/newton-nova-priorities-tracking
14:32:55 <PaulMurray> if your spec isn't there just add it
14:33:41 <PaulMurray> Does anyone have one they want to mention ?
14:34:12 <luis5tb> can we briefly discuss about this: https://review.openstack.org/#/c/301509/
14:34:35 <PaulMurray> go ahead
14:34:49 <luis5tb> andrearosa suggested to include information about migration type in the migration object
14:35:09 <luis5tb> something like "postcopy-status" --> disabled/enabled/active
14:35:13 <andrearosa> luis5tb: it is not mandatory
14:35:17 <mdbooth> luis5tb: Is that spec related to https://review.openstack.org/#/c/306561/ ?
14:35:19 <luis5tb> what is your view about that?
14:35:37 <andrearosa> soemthing I'd like to have more opinion
14:35:45 <luis5tb> me too
14:35:54 <luis5tb> I think it could be a good idea
14:36:12 <tdurakov> migrations works not only for libvirt, will it be valid to expose post-copy over migration entity?
14:36:20 <luis5tb> also, I think we need to at least include information about the memory iterations, besides the remaining data and the other stats already there
14:36:24 <mdbooth> tdurakov: I think not
14:36:29 <pkoniszewski> i don't think that we want to expose low level details through API
14:36:43 <mdbooth> I don't think we should expose iterations, either
14:36:44 <tdurakov> mdbooth: so, here is the answer
14:36:44 <paul-carlton> nope, better to let the driver code work out if abort is allow
14:37:03 <paul-carlton> i.e. have we switched to post copy
14:37:38 * mdbooth has a slightly better understanding of post-copy this week than I had last week
14:38:08 <tdurakov> paul-carlton: what about saving such switches to instance-actions instead?
14:38:11 <mdbooth> In my view, it's something which should just happen when the user requests force completion
14:38:33 <tdurakov> as a separete instance-action-event step
14:38:35 <davidgiluk> paul-carlton: Be a little careful of races around that; if you do 'have we switched to postcopy? No - ok, abort' then you might switch to postcopy between asking and aborting
14:38:52 <paul-carlton> that would be ok I guess but current code doesn't look at that
14:39:22 <luis5tb> libvirt will deny the abort if the switch to postcopy has already been triggered anyway
14:39:34 <tdurakov> davidgiluk: races story https://review.openstack.org/#/c/287997/
14:39:34 <pkoniszewski> well, we don't save an instance action when we increase downtime during LM
14:39:37 <paul-carlton> davidgiluk, yep, you'd need to do a lock, check the proceed
14:39:52 <tdurakov> pkoniszewski: what about start doing this?
14:39:57 <pkoniszewski> why?
14:40:21 <tdurakov> instance actions are pretty explicit, allows to store even tracebacks
14:40:34 <tdurakov> would be useful to get details on migrations
14:40:38 <tdurakov> thoughts?
14:40:44 <paul-carlton> but checking that would not prevent race
14:40:48 <davidgiluk> tdurakov: Yes, especially if they do wrong
14:41:26 <andrearosa> tdurakov: yes that was my idea. I'd like to have something tell me what happened. I do not have any real use cases atm but I bet that for debugging purpose it could be handy
14:41:39 <PaulMurray> slightly confused, are you thinking of having lots of instance actions for a migration ?
14:41:45 <pkoniszewski> aren't logs enough for debugging purposes?
14:41:48 <paul-carlton> better to have the migration monitor thread in the driver get a lock on the instance before switching to post copy so other thread can't abort it
14:41:57 <tdurakov> there is already separate steps for migration
14:42:09 <pkoniszewski> we will save tons of instance actions per migration if we go this way
14:42:21 <tdurakov> pkoniszewski: why tons?
14:42:33 <tdurakov> save only several event from libvirt
14:42:35 <tdurakov> not all
14:42:48 <tdurakov> I'm not proposing to store whole progress
14:42:50 <pkoniszewski> i just mentioned downtime which is increased iteratively during LM process
14:42:54 <pkoniszewski> do we need to save it?
14:43:05 <tdurakov> pkoniszewski: worth to discuss
14:43:09 <PaulMurray> admin will look at instance actions to see what happened to a finished migration
14:43:17 <PaulMurray> needs to be deciferable for the admin
14:43:19 <paul-carlton> saving switch to post copy is a reasonable thing to do but as user information
14:43:33 <PaulMurray> paul-carlton, agreed
14:43:35 <andrearosa> I was thinking saving the switch
14:43:42 <tdurakov> paul-carlton: live-migration is admin action
14:43:43 <PaulMurray> but not lots of progress stuff
14:43:45 <tdurakov> not user
14:43:48 <andrearosa> but not the progess
14:43:59 <tdurakov> surely no progress
14:44:13 <pkoniszewski> but for switch we need to save all the data
14:44:23 <pkoniszewski> like memory remaining, how many cycles
14:44:29 <luis5tb> why?
14:44:31 <mdbooth> I would expose live migration started, live migration ended, and live migration aborted
14:44:31 <tdurakov> pkoniszewski: do we?
14:44:34 <mdbooth> To the user
14:44:34 <pkoniszewski> because if we save only switch, it means nothing
14:44:36 <mdbooth> And nothing else
14:44:37 <luis5tb> switch could be just based on the memory iteration
14:44:48 <pkoniszewski> okay, lm switched to post-copy, but what it really means?
14:45:03 <pkoniszewski> and how it would help?
14:45:15 <paul-carlton> exposing post copy switch, as a force complete action to be driver agnostic is reasonable
14:45:31 <paul-carlton> it is informative for user and admin
14:45:34 <tdurakov> pkoniszewski: it would be explicit to operator, what steps were done to converge migration
14:45:36 <mdbooth> If the user is trying to debug why their performance went a bit funky for a bit, knowing that the funkiness occurred during a live migration is sufficient
14:45:42 <mdbooth> They don't need to know every detail of it.
14:46:03 <davidgiluk> mdbooth: Do they have an easy way to get the detail if they need it?
14:46:18 * PaulMurray is thinking about the time....
14:46:44 <PaulMurray> Shall we continue on the spec and move on ?
14:46:44 <tdurakov> davidgiluk: I believe checking logs is the only way
14:46:46 <pkoniszewski> so once we start using auto converge we will save all auto converge steps, all downtime steps, and post-copy switch?
14:46:47 <paul-carlton> Can we talk about  https://review.openstack.org/#/c/301509/
14:46:52 <paul-carlton> https://review.openstack.org/#/c/301561`
14:46:56 <pkoniszewski> sounds like a small book in DB for a single migration
14:47:19 <PaulMurray> paul-carlton, we need to cover the agenda
14:47:25 <PaulMurray> lets come back at the end if time
14:47:26 <paul-carlton> ok
14:47:38 <tdurakov> pkoniszewski: not all, but let's discuss this in spec instead, will leave a comment
14:47:46 <PaulMurray> #topic Review Requests
14:47:47 <pkoniszewski> tdurakov: +1
14:48:00 <PaulMurray> https://review.openstack.org/#/c/310352/
14:48:03 <PaulMurray> eliqiao,
14:48:35 <tdurakov> PaulMurray: https://review.openstack.org/#/c/287997/
14:48:49 <tdurakov> still not merged
14:49:31 <mdbooth> Also mechanical cleanup: https://review.openstack.org/#/c/308876/
14:49:39 <PaulMurray> tdurakov, noted - lets see if we can get help with it
14:50:17 <PaulMurray> https://review.openstack.org/#/c/310707/
14:50:31 <PaulMurray> mdbooth, you've been doing reviews - well done
14:50:57 <PaulMurray> you're names on most things I look at
14:51:21 <pkoniszewski> eliqiao is not there
14:51:30 <PaulMurray> that explains it
14:51:38 <PaulMurray> ...the silence I mean
14:51:39 <pkoniszewski> we need some eyes here https://review.openstack.org/#/c/310707/
14:51:46 <pkoniszewski> it's a regression in mitaka
14:51:50 <paul-carlton> can I get reviews on https://review.openstack.org/#/c/307131/ and https://review.openstack.org/#/c/306561/ please, as well as https://review.openstack.org/#/c/310505/
14:51:54 <pkoniszewski> and requires a backport
14:52:03 <PaulMurray> pkoniszewski, yes - I noticed that (ref above too)
14:52:19 <PaulMurray> that made me think about the
14:52:30 <PaulMurray> CI with latest versions discussion earlier
14:52:56 <PaulMurray> 1.3.1 does not work for selective block migration on tunnelled connections
14:54:14 <PaulMurray> ok
14:54:27 <PaulMurray> #topic Open Discussion
14:54:35 <PaulMurray> only a few minutes left
14:54:46 <PaulMurray> anything else to cover
14:54:48 <PaulMurray> ?
14:54:56 <PaulMurray> (quickly)
14:54:58 <pkoniszewski> yeah
14:54:59 <pkoniszewski> one question
14:55:21 <davidgiluk> paul-carlton: Can you figure out how your spec for automatic live migration completion goes together with luis5tb's postcopy spec?
14:55:24 <pkoniszewski> do we really need a spec for that? https://review.openstack.org/#/c/248358/
14:55:39 <mdbooth> davidgiluk paul-carlton +1
14:55:45 <mdbooth> I'm also confused about that
14:55:57 <pkoniszewski> i mean, this is something that is already supported in nova, right now everyone can use auto converge by adding a flag to live_migration_flags
14:56:07 <paul-carlton> davidgiluk, I think it is dependant on luis5tb's spec
14:56:19 <pkoniszewski> because we want to remove live_migration_flags, this new flag is just to keep a way to turn auto converge on, nothing more
14:56:26 <pkoniszewski> mriedem_meeting: ^^
14:56:35 <mdbooth> I'd like to see all of these merged into a single spec
14:56:48 <mdbooth> 'How do I force my live migration to complete' is 1 topic
14:57:18 <davidgiluk> mdbooth: yeh
14:57:35 <paul-carlton> not a good idea to mix them up
14:57:44 <paul-carlton> post copy is one thing
14:57:46 <pkoniszewski> mdbooth: these are two different topics
14:57:51 <pkoniszewski> post-copy is a way to force to complete
14:57:58 <davidgiluk> paul-carlton: I think it would be good to see how yours, postcopy and autoconverge go together; if they're really 3 specs or 1
14:58:01 <paul-carlton> the auto completion stuff build on it
14:58:03 <mdbooth> Right, and so is auto converge
14:58:07 <pkoniszewski> auto converge and compression is just to increase chances to converge
14:58:10 <mdbooth> But they're both parts of the same problem
14:58:22 <mdbooth> Treating them separately is confusing
14:58:34 <pkoniszewski> auto converge will never force to complete, really
14:58:43 <pkoniszewski> even if you cut 99% cpu cycles
14:58:58 <paul-carlton> talking to danbp auto-converge needs to basically stop instance to get it done
14:59:00 <PaulMurray> This is going to go over the end of the meeting
14:59:07 <PaulMurray> do we want another time to talk about it ?
14:59:10 <paul-carlton> post-copy is much more effective
14:59:23 <paul-carlton> we can discuss on specs?
14:59:24 <pkoniszewski> post-copy is a way to force to complete
14:59:26 <pkoniszewski> auto converge is not
14:59:27 <mdbooth> paul-carlton: Agreed. We need to discuss that in more more than 1 spec :)
14:59:35 <mdbooth> s/more more/no more/
14:59:52 <PaulMurray> I'll have to cut off now, so lets continue in nova room
14:59:59 <PaulMurray> #endmeeting