09:00:44 #startmeeting Dragonflow 09:00:45 Meeting started Mon Nov 7 09:00:44 2016 UTC and is due to finish in 60 minutes. The chair is oanson. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:46 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:48 The meeting name has been set to 'dragonflow' 09:00:51 Hello. Who's here for the meeting? 09:00:58 Hi 09:01:00 hello, 09:01:00 Hi 09:01:43 Let's wait another minute, maybe nick-ma and yuli will also join. 09:01:45 Hello 09:01:54 hi all 09:02:32 hello 09:02:39 All right. We can begin 09:02:44 #topic Ocata Roadmap 09:03:04 Let's start with a very quick status update. 09:03:31 Openstack-ansible deployment is coming along nicely. You can see it here: https://review.openstack.org/#/c/391524/ 09:04:03 Great thanks to the openstack-ansible guys who practically wrote all of it (I only stitched it together) 09:04:32 Good to know that 09:04:40 hey 09:04:44 About the other items I suggest to wait for next week. 09:05:05 I would be happy if anyone working on new features would upload a spec in time for next week's meeting, so that it could be discussed. 09:05:21 According to the relase timetable, we are supposed to have specs up by the end of next week. 09:05:29 (I hope we can make it (: ) 09:05:37 release* 09:05:45 I will have SFC spec up for review today 09:05:51 Great. Thanks! 09:06:39 about the release, dragonflow releases independently. do we need to release a version for N cycle? 09:07:07 nick-ma_, I thought I have. I released version 2.0.0. 09:07:23 oho, got it. 09:07:27 If it didn't register as N cycle release, I'll have to go back and fix it 09:07:39 Additionally, I plan to move us to the Openstack release cycle 09:07:56 i see the tags, but not branch. 09:07:57 I want to see how this and next week go, before I finalise it 09:08:08 nick-ma_, then I'll look into it 09:08:21 #action oanson Branch out N cycle version from tag 2.0.0 09:08:32 Next on the roadmap talk is the a couple of blueprints: Controller HA, Services' status, and monitoring and notification 09:08:43 it is a big change and we will have a fixed cycle. 09:08:50 Yes 09:08:51 if we follow the openstack. 09:09:06 I think the project is mature enough to manage 09:09:14 Unless there are objections 09:10:09 In fact, if there are objections, now would be a good time to bring them up and discuss. If this may be a bad idea, I would like to know :) 09:10:47 there is a bp about chassis status report. what about service status? health check? maybe we need to make them together to prevent from duplicate work? 09:11:31 Yes. That's a good idea. 09:12:10 The chassis status report spec is here https://review.openstack.org/#/c/385719/ 09:12:25 yes. 09:12:27 #link Chassis status report spec patch https://review.openstack.org/#/c/385719 09:12:28 oanson: notification and monitoring, i would like to know more. 09:12:43 rajivk, actually I was hoping you'd share your ideas :) 09:13:16 okay, i just wanted to say that notification is not in the scope of dragonflow 09:13:31 In general, Dragonflow should monitor its artifacts, e.g. service health, statistics, etc., and pass that information on (e.g. to ceilometer). 09:13:51 however monitoring can be used for internal scheduling etc. 09:13:53 rajivk, yes. My understanding is that notification is handled in project aodh, 09:14:03 which takes its info from ceilometer 09:14:18 But I assume we have to provide ceilometer with the relevant data 09:14:19 okay, i think, i misunderstood by notification 09:14:41 i thought notification means notifying user or admin as per our earlier discussion 09:14:55 rajivk, yes. I thought so too... 09:15:05 notify metrics of virtual network. 09:15:14 notification to other components like ceilometer, congress etc should be there 09:15:48 I agree on notification for openstack component 09:16:12 Maybe we should start over :) 09:16:18 Help from other component team might be required, in some scenario's. 09:16:21 yes 09:16:47 My understanding is the ceilometer exists to receive monitor information from components. In dragonflow's case: services' health, statistics, etc. 09:17:11 Aodh exists to raise alarms, which could be used to take actions, or notify other components or users and admins 09:17:16 neutron will send basic notifications in its api worker. 09:17:46 So far, am I correct? 09:17:47 nick-ma_, to whom? 09:17:55 ceilometer 09:18:03 okay, may be we can think, about integration with congress as well. 09:18:14 like starting to create router, create router success/failure. 09:18:56 rajivk, sure. 09:19:06 May be congress allow to provide some policy for neutron 09:19:27 we will have to provide some mechanism to do the same in Dragonflow. 09:20:00 Congress probably pushes policy using the Neutron API. 09:20:08 So that should natively reach Dragonflow 09:20:15 (Unless I am wrong) 09:20:34 Sorry, no idea about internal working of congress :( 09:20:49 i have no idea about how the congress works. 09:20:54 me too~ 09:21:00 From a quick look at their documentation, there is a Neutron Policy Engine. But I don't know how it works internally wither 09:21:08 rajivk, could you do the research? 09:21:15 I know some who can help me 09:21:19 from congress community 09:21:24 That would be great! 09:22:41 I suggest you also work with xiaohhui about the service monitor ideas you mentioned the other day. See if you can collaborate on the work he is doing on the chassis status. 09:23:09 ok, i will work with him. 09:23:15 :) 09:23:17 Would you like to talk about controller HA? 09:23:38 yes 09:24:11 The floor is yours! :) 09:24:18 Currently, if local controller goes down on a compute node than no flows will be added and removed 09:25:03 As per discussion with oanson, we can take two approaches to avoid this problem 09:25:53 1) Add a watch dog, that keeps on monitoring local controller and if it goes down than it tries to restart it. 09:26:33 It try a few times(configurable), if it fails everytime then some other node's controller can be notified and from that point ownward 09:26:59 remote controller takes care of it's own flows as well as failed node's flows. 09:27:01 what about deploy two df controllers, master and slave? 09:27:17 You mean on the same node. 09:27:55 that doesn't make sense to deploy two same process on the same node. 09:28:04 hujie, usually if a service fails on one machine then there must be some external factor, which affected it from continuing 09:28:15 therefore slave will most probably fail 09:28:34 This is also why the watchdog solution may not be enough 09:28:42 in production, we use watchdog. 09:29:04 indeed we may not consider the HA in full-distributed SDN solution, if the df goes down, the server is also down, but if you consider df goes down and the server works well you can consider deploy two role df controllers 09:29:54 Sorry to pop in, but if the controller is deployed using kubernetes (kolla-kubernetes?) with health check, why is there a need for a watchdog? 09:30:56 yuval, the watchdog is there to verify the process is still running and behaving correctly. If k8s' health check does that, then that is a watchdog implementation. 09:31:02 yuval, you are right. In that it might not be required. 09:31:06 But not all deployments use k8s 09:31:25 w.g. OSA use lxc 09:31:27 e.g.* 09:31:40 sounds like watchdog is a deployment issue not specific to dragonflow 09:32:19 watchdog is solve the issue of a short misbehaviour of service. 09:32:21 Possibly. We need to know what solutions exist before writing our own 09:32:59 But the point is what to do if the process fails, and the watchdog can't bring it back up. 09:33:46 In that case, we can notify other node's controller to take over and do all the tasks remotely it possible.(Not sure about, whether possible or not) 09:34:33 In theory, currently, is should be possible, since both ovsdb and the OVS ofproto interface can be connected over the net. 09:34:59 okay, is there any major challenges to implement it? 09:35:26 can you see anything, that can stop us from doing it?(i am new to Dragonflow therefore does not know internal details) 09:35:37 How would other node's controller get the vm of current node? Besides vm, I think other resources don't need to migrate 09:36:02 I suspect the whole thing is a challenge :). But I don't see a technological problem. 09:36:02 may be we will consider one df controller to implement all rules on all cns ? 09:36:14 computer nodes 09:36:32 yuli_s, that goes against the dragonflow design. We want to be fully distributed, not migrate back to a central control unit 09:36:33 with a failover in this case 09:36:56 this is only for failover, in case local solutions (e.g. watchdog) fail 09:37:15 yuli_s: we are full distributed SDN solution:) 09:37:21 xiaohhui, I think all the necessary information is stored in the OVSDB. If it is still running, the event should be received 09:38:14 We don't even need to know about the vm. Just how to connect the southbound (OVS/OVSDB) port to the northbound (Neutron DB) port 09:38:30 And as far as I know, that information is stored in OVSDB. 09:38:51 i remember seeing a patch to update the chases table periodically. it can be used for this 09:39:03 (to detect failed controller) 09:39:06 if other df to manage remote ovs, it is in-band flow, the OM and data plane is shared, 09:39:08 We can also try adding a plug-vif driver to nova, which would help when we want to extend beyond vms and beyond ovs. But I don't think we'll make it for Ocata. 09:39:10 when the remote controller takes over the work, it also needs to update its local cache for all the remote topology 09:40:22 And tell apart items that belong to the local compute node, and to the HAed compute node 09:40:32 nick-ma_: can you elaborate 09:42:18 rajivk: i can help discuss and review. :-) 09:42:30 I think, it is good feature. 09:42:42 if df could manage remote ovs, it seems dragonflow is a high distributed ODL\floodlight\onos\ryu..., not full distributed 09:43:06 nick-ma_: i would a lot of help and discussion. Thanks. 09:43:07 I agree with hujie 09:43:10 rajivk, the local DF controller holds an in-memory cache of the database objects. We try to have it as small as possible. In case of HA, we need to read the information of the other compute node into the cache 09:43:35 hujie, xiaohhui, this feature is for fallback only. There should be a dragonflow local controller on every node. 09:43:49 what about supporting distributed cache as well like memcache? 09:43:57 yes, HA is an exception for centralization. we can run HA for all the compute nodes, but that doesn't make sense to deploy in production. 09:43:57 But it is possible it will crash, and it might be possible that the watchdog won't be able to raise it again. 09:44:32 we do have a distributed data store. 09:44:58 The local cache is just to speed up reads from that data store. 09:44:58 if we need distributed cache, we just remove the local cache layer. that's all. 09:45:20 every read will go to db layer. 09:45:37 hmm, i got it. 09:46:19 rajivk, additionally, the data store layer is fully pluggable. If we want to use specifically memcache, a driver can be written 09:46:53 i just said it for caching remote machine's info. 09:47:18 But i think, i did not understand that much about Dragonflow. May be i will discuss about it later on. 09:47:31 No worries. I was just showing off our pluggability :) 09:47:47 rajivk, sure. 09:47:57 I am always available (if not in IRC, then by mail) 09:48:17 oanson, okay thanks. 09:48:41 I would ask that you let me know what you want to implement, and that you upload a spec so that we'll have it organised. 09:48:54 But that can be done later 09:49:33 okay, i will discuss and let you know on IRC. 09:49:40 Great. Thanks! 09:49:50 Anything else for roadmap? 09:50:05 I have created a bp 09:50:24 it is not a feature but now other components are also centerlizing configurations 09:50:59 that's it from my side. 09:51:14 oanson, u wnat to talk about ml2 and dumping plugin ? 09:51:31 oanson, you want to talk about ml2 and dumping plugin.py ? 09:51:53 rajivk, I brushed over the spec. Seems like a good idea. I think nick-ma_ started working on something like. 09:52:06 Using oslo config generation 09:52:22 yuli_s, not sure what you mean. Could you please explain? 09:53:05 yes, that was done. centralized configuration is also welcomed. please share the spec link here. i can catch up. 09:53:07 rajivk, done in this patch: https://review.openstack.org/#/c/373796/ 09:53:32 for the Ocata release do we want to swithc completely to ml2 and dump old plugin support 09:53:33 #link Centralised configuration blueprint https://blueprints.launchpad.net/dragonflow/+spec/centralize-config-options 09:53:34 ? 09:53:43 yuli_s, yes. 09:53:57 It is not urgent, but it should be done within the 4-6 weeks. 09:54:03 ok,\ 09:54:14 I can do that, seeing as it's just deleting a couple of files 09:54:45 I have this work https://bugs.launchpad.net/dragonflow/+bug/1618792 which might similar to dumping plugin.py 09:54:45 Launchpad bug 1618792 in DragonFlow "RFE: Use ml2 as default option for devstack" [Wishlist,In progress] - Assigned to Hong Hui Xiao (xiaohhui) 09:55:17 xiaohhui, this is an important step in the way. Yes. 09:55:38 But it looks like it's merged :) 09:55:53 I plan to add more code for it, 09:56:04 currently it is just update the sample local.conf files 09:56:22 You want dragonflow's plugin.sh to set the variables by default? 09:56:33 yes, 09:56:42 xiaohhui, sounds good! 09:56:47 good idea 09:56:59 I also want to discuss https://bugs.launchpad.net/dragonflow/+bug/1638151 . jingting: I added a comment to the bug, could you please reply? 09:56:59 Launchpad bug 1638151 in DragonFlow "Router schedule error in L3 router plugin as there are multi-external network" [High,New] - Assigned to rajiv (rajiv-kumar) 09:57:12 It is the only high priority bug that isn't marked 'in progress'. 09:57:42 I went through the details for this bug. 09:58:06 I it seems like, during update of the router at neutron side it fails. 09:58:36 This is actually the issue that dragonflow don't support multi-external network now. 09:58:39 it uses router scheduler but it failed to do it in dragonflow. 09:58:50 If br-ex is configured in neutron, the same exception will report 09:59:27 We are running out of time. 09:59:43 If you all could share your information on the bug, we could take it from there 09:59:46 let's discuss it on IRC channel of Dragonflow 09:59:47 #link https://bugs.launchpad.net/dragonflow/+bug/1638151 09:59:47 Launchpad bug 1638151 in DragonFlow "Router schedule error in L3 router plugin as there are multi-external network" [High,New] - Assigned to rajiv (rajiv-kumar) 09:59:55 rajivk, Sure. 10:00:12 I want to bring this review out https://review.openstack.org/#/c/339975/ 10:00:20 Thanks everyone for coming. We can continue in #openstack-dragonflow . 10:00:22 It is legacy from N release 10:00:47 OK 10:01:08 xiaohhui, one we have a Newton branch (I'll take care of is ASAP), we can back port important patches 10:01:25 I suggest we'll discuss it once the patch is merged into master 10:01:33 Thanks again 10:01:36 #endmeeting