09:01:01 <jiaopengju> #startmeeting karbor
09:01:02 <openstack> Meeting started Tue Oct 23 09:01:01 2018 UTC and is due to finish in 60 minutes.  The chair is jiaopengju. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:01:03 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:01:06 <openstack> The meeting name has been set to 'karbor'
09:01:13 <jiaopengju> hi guys
09:01:15 <yinweiishere> hi
09:01:34 <luobin-smile> hi
09:01:44 <jiaopengju> hi yinweiishere, luobin-smile
09:01:57 <yinweiishere> hi pengju
09:02:15 <yinweiishere> have we got any schedule for S version new features?
09:02:39 <jiaopengju> I saw the meeging agenda has been updated here, https://wiki.openstack.org/wiki/Meetings/Karbor
09:02:47 <yinweiishere> yeach
09:03:02 <yinweiishere> we wanna propose add snapshot feature there
09:03:19 <yinweiishere> and just now luobin and I checked current karbor code
09:03:27 <jiaopengju> yinweiishere, I have do some plan for the S version, mainly focus on optimization
09:03:32 <yinweiishere> we found some problems for snapshot there
09:03:51 <yinweiishere> any wiki page for S version plan?
09:04:53 <jiaopengju> I will send an etherpad link after meeting
09:05:08 <yinweiishere> OK, good
09:05:26 <jiaopengju> We can talk about the feature in the agenda first
09:05:37 <yinweiishere> sure
09:05:41 <jiaopengju> #topic add snapshot feature to Karbor to support crash consistency and app consistency further
09:05:59 <yinweiishere> I'd also like to hear your ideas for optimization
09:06:06 <jiaopengju> :)
09:06:19 <yinweiishere> since as I know Yuval has done a lot opt there :)
09:06:25 <yinweiishere> ok
09:06:31 <yinweiishere> so for snapshot
09:06:39 <jiaopengju> now you can describe the snapshot feature
09:06:48 <yinweiishere> as we know, snapshot has two usages
09:07:02 <yinweiishere> one to restore another instance from snapshot
09:07:17 <yinweiishere> the other is to rollback current instance to the snapshot
09:07:36 <yinweiishere> here, the instance I mean server or volume or volume groups
09:07:47 <jiaopengju> get it
09:07:56 <yinweiishere> we checked current implementation in karbor
09:08:25 <yinweiishere> we only provide restore API for snapshot, where we can't rollback current instance
09:09:23 <yinweiishere> if you check actual nova libvirt snapshot or ceph rbd snapshot, the backends all provide rbd snapshot rollback
09:10:25 <yinweiishere> so, for the API aspect, the snapshot feature is missing there
09:10:25 <jiaopengju> This means we should add rollback operation in the protection plugins?
09:10:39 <jiaopengju> just like verify, copy and so on
09:10:45 <yinweiishere> first add rollback API in restore
09:10:47 <yinweiishere> part
09:11:19 <yinweiishere> then, for snapshot protection plugin, support it
09:11:24 <jiaopengju> yes
09:11:39 <yinweiishere> so this is the API part
09:11:56 <yinweiishere> second, snapshot has a consistency semantics
09:12:22 <yinweiishere> now Karbor's checkpoint doesn't support any level's consistency
09:12:27 <yinweiishere> do you agree?
09:13:06 <jiaopengju> yes, agree
09:14:54 <yinweiishere> when we initially started this project, the founder agreed to postpone the consistency issue. But without consistency, neither the snapshot or the checkpoint (what ever you call) it doesn't satisfy 'protect' semantic in fact.
09:15:21 <yinweiishere> actually, there're two levels for consistensy
09:15:39 <yinweiishere> the initial level: crash consistency
09:15:56 <yinweiishere> the higher level: app consistency
09:16:48 <yinweiishere> although Karbor has analyzed dependencies among resources, it still fails to support even crash consistency
09:17:23 <jiaopengju> do you have specs for these two level?
09:17:34 <jiaopengju> or some docmentation link
09:18:02 <yinweiishere> look at how we protect server and its volumes/volume groups, we didn't maintain the consistency there
09:18:35 <yinweiishere> we propose to support snapshot with two levels consistency step by step
09:18:51 <yinweiishere> first crash consistency and then app level consistency
09:19:01 <yinweiishere> yes, we do have some ideas there
09:19:21 <yinweiishere> want to achieve consensus first before writing it to spec
09:20:36 <yinweiishere> pengju, are you there?
09:21:16 <jiaopengju> yes, can you give more messages about crash consistency?
09:21:27 <yinweiishere> sure
09:22:40 <yinweiishere> actually, I'm thinking that we need a pair of APIs for snapshot:take_snapshot(consistency_level) and rollback_snapshot(checkpoint_id)
09:23:29 <yinweiishere> to differenciate existed protect/restore API, which support loose restriction on consistency
09:23:43 <jiaopengju> do you mean that, if we take a snapshot in karbor, then the checkpoints info that matched to this snapshot will be saved?
09:24:15 <yinweiishere> yes, the snapshot will be the backend resource of the checkpoint
09:24:26 <yinweiishere> similar to the backup id
09:24:44 <jiaopengju> ok, I understand, that sounds useful
09:25:12 <yinweiishere> or we can merge it with protect API with consistency level params, none means current way
09:25:38 <yinweiishere> those are details, we can discuss it in spec
09:25:51 <yinweiishere> and for crash consistency
09:26:42 <yinweiishere> we need pause the server, flush the memory to disk, and then call volume/volume group's backend snapshot methods
09:28:10 <yinweiishere> on contrast, currently we never pause server, but only backup each volumes attached to the server if it booted from a volume. If it booted from an image, we never backup the changes in system volume.
09:28:29 <yinweiishere> that's the crash consistency
09:28:55 <yinweiishere> where the system could really boot from the snapshot, and apps
09:28:59 <yinweiishere> won't crash
09:29:38 <jiaopengju> I have talked the vms that booted from an image with yuval and chenying before
09:31:18 <jiaopengju> But if we pause server in karbor, this will a few intrusive to the end user
09:31:23 <yinweiishere> our current way may backup at a wrong moment,that the system and the apps haven't saved the necessary status to disk and they may fail to boot without those status.
09:31:45 <yinweiishere> that's the protection plugin
09:32:19 <jiaopengju> part of karbor
09:32:28 <yinweiishere> sure
09:32:49 <yinweiishere> that's dependent on the consistency param
09:33:22 <yinweiishere> if user asks for crash consistency, it means he/she understands what does it mean
09:33:43 <yinweiishere> for crash consistency itself, that's the way it should be
09:34:26 <jiaopengju> ok, understand, this is the detail info, you can write it in the spec
09:35:00 <jiaopengju> app consistency?
09:37:07 <jiaopengju> yinweiishere are you there?
09:37:33 <yinweiishere> yes, I'm here
09:37:43 <yinweiishere> app consistency is a bit complicated
09:38:31 <yinweiishere> Luobin, could you pls. elaborate here?
09:39:02 <yinweiishere> it's about global consistent snapshot
09:39:18 <yinweiishere> which requires some background knowledge
09:39:19 <jiaopengju> app , the resource that matched to openstack is a couple of resources, or a group of resources?
09:39:55 <yinweiishere> have you heard of chandy-lamport algorithm for distributed snapshot?
09:40:51 <yinweiishere> this algorithm is enhanced as ABS algorithm and applied in stream engine in flink
09:41:15 <yinweiishere> our idea is to enhance ABS and apply it in karbor
09:41:43 <yinweiishere> we can put it as a more long term target
09:42:52 <yinweiishere> the paper of ABS is here
09:42:54 <yinweiishere> https://arxiv.org/pdf/1506.08603.pdf
09:42:58 <yinweiishere> for your reference
09:43:33 <jiaopengju> actually I think we should define which resource that mapped to openstack we protected in app consistency scene
09:43:53 <jiaopengju> and then how
09:44:41 <yinweiishere> resources are the same
09:45:04 <yinweiishere> the problem is how to make the app aware of the snapshot
09:45:21 <yinweiishere> and, as there are many apps in one plan's servers
09:45:41 <yinweiishere> how to make sure all apps are consistent
09:46:27 <yinweiishere> the distributed snapshot issue is to make sure the causal ordering is correct
09:46:49 <yinweiishere> as, APP1 is the input of APP2
09:47:02 <yinweiishere> we take snapshot on the whole system
09:47:28 <yinweiishere> but the snapshot command may get triggered at different timming on server1 and server2
09:47:38 <jiaopengju> ok, a little understand, I will see the reference you send after meeting
09:47:47 <yinweiishere> since there're events on the fly
09:47:58 <yinweiishere> yeach
09:48:14 <yinweiishere> similar to crash consistency
09:48:31 <yinweiishere> we need maintain the dependencies among APPs
09:48:42 <yinweiishere> ok
09:48:59 <yinweiishere> that's all for my part
09:49:34 <jiaopengju> thanks for giving this useful idea
09:50:03 <yinweiishere> you can send out the schedule link
09:50:21 <jiaopengju> we can add this to stein plan
09:50:30 <yinweiishere> and we can put the effort there to see if more people will get interested there
09:50:50 <yinweiishere> yes, we can support it step by step
09:52:16 <jiaopengju> I can easily describe the plan in stein cycle here: optimization of multiple nodes of operation engine (yuval provide the first version)
09:52:42 <jiaopengju> 2. cross-site backup and restore (cross keystone)
09:52:55 <yinweiishere> haho
09:52:57 <jiaopengju> 3. documentation
09:53:09 <yinweiishere> cross site is what I proposed long long ago
09:53:29 <jiaopengju> yes, some people has asked questions about it
09:53:36 <yinweiishere> it's really useful, right?
09:53:47 <jiaopengju> cross keystone seems not
09:54:29 <yinweiishere> I think cross region/AZ is necessary
09:54:46 <yinweiishere> cross keystone is a bit difficult
09:54:50 <jiaopengju> at the same time, we do not have documentation enough about it
09:55:02 <yinweiishere> again, step by step is more feasible
09:55:11 <jiaopengju> yes, agree
09:55:30 <yinweiishere> ok, I think the time is run out
09:55:54 <jiaopengju> yeah, so I will end the meeting soon and we can talk about it in karbor channel
09:56:13 <yinweiishere> if you have made the schedule link, pls. let us know in karbor channel
09:56:19 <jiaopengju> ok
09:56:33 <jiaopengju> #endmeeting