#openstack-meeting-4 log

09:00:44 <oanson> #startmeeting Dragonflow
09:00:45 <openstack> Meeting started Mon Nov  7 09:00:44 2016 UTC and is due to finish in 60 minutes.  The chair is oanson. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:00:46 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:00:48 <openstack> The meeting name has been set to 'dragonflow'
09:00:51 <oanson> Hello. Who's here for the meeting?
09:00:58 <lihi> Hi
09:01:00 <xiaohhui> hello,
09:01:00 <rajivk> Hi
09:01:43 <oanson> Let's wait another minute, maybe nick-ma and yuli will also join.
09:01:45 <dimak> Hello
09:01:54 <nick-ma_> hi all
09:02:32 <yuli_s> hello
09:02:39 <oanson> All right. We can begin
09:02:44 <oanson> #topic Ocata Roadmap
09:03:04 <oanson> Let's start with a very quick status update.
09:03:31 <oanson> Openstack-ansible deployment is coming along nicely. You can see it here: https://review.openstack.org/#/c/391524/
09:04:03 <oanson> Great thanks to the openstack-ansible guys who practically wrote all of it (I only stitched it together)
09:04:32 <xiaohhui> Good to know that
09:04:40 <yuval> hey
09:04:44 <oanson> About the other items I suggest to wait for next week.
09:05:05 <oanson> I would be happy if anyone working on new features would upload a spec in time for next week's meeting, so that it could be discussed.
09:05:21 <oanson> According to the relase timetable, we are supposed to have specs up by the end of next week.
09:05:29 <oanson> (I hope we can make it (: )
09:05:37 <oanson> release*
09:05:45 <dimak> I will have SFC spec up for review today
09:05:51 <oanson> Great. Thanks!
09:06:39 <nick-ma_> about the release, dragonflow releases independently. do we need to release a version for N cycle?
09:07:07 <oanson> nick-ma_, I thought I have. I released version 2.0.0.
09:07:23 <nick-ma_> oho, got it.
09:07:27 <oanson> If it didn't register as N cycle release, I'll have to go back and fix it
09:07:39 <oanson> Additionally, I plan to move us to the Openstack release cycle
09:07:56 <nick-ma_> i see the tags, but not branch.
09:07:57 <oanson> I want to see how this and next week go, before I finalise it
09:08:08 <oanson> nick-ma_, then I'll look into it
09:08:21 <oanson> #action oanson Branch out N cycle version from tag 2.0.0
09:08:32 <oanson> Next on the roadmap talk is the a couple of blueprints: Controller HA, Services' status, and monitoring and notification
09:08:43 <nick-ma_> it is a big change and we will have a fixed cycle.
09:08:50 <oanson> Yes
09:08:51 <nick-ma_> if we follow the openstack.
09:09:06 <oanson> I think the project is mature enough to manage
09:09:14 <oanson> Unless there are objections
09:10:09 <oanson> In fact, if there are objections, now would be a good time to bring them up and discuss. If this may be a bad idea, I would like to know :)
09:10:47 <nick-ma_> there is a bp about chassis status report. what about service status? health check? maybe we need to make them together to prevent from duplicate work?
09:11:31 <oanson> Yes. That's a good idea.
09:12:10 <xiaohhui> The chassis status report spec is here https://review.openstack.org/#/c/385719/
09:12:25 <nick-ma_> yes.
09:12:27 <oanson> #link Chassis status report spec patch https://review.openstack.org/#/c/385719
09:12:28 <rajivk> oanson: notification and monitoring, i would like to know more.
09:12:43 <oanson> rajivk, actually I was hoping you'd share your ideas :)
09:13:16 <rajivk> okay, i just wanted to say that notification is not in the scope of dragonflow
09:13:31 <oanson> In general, Dragonflow should monitor its artifacts, e.g. service health, statistics, etc., and pass that information on (e.g. to ceilometer).
09:13:51 <rajivk> however monitoring can be used for internal scheduling etc.
09:13:53 <oanson> rajivk, yes. My understanding is that notification is handled in project aodh,
09:14:03 <oanson> which takes its info from ceilometer
09:14:18 <oanson> But I assume we have to provide ceilometer with the relevant data
09:14:19 <rajivk> okay, i think, i misunderstood by notification
09:14:41 <rajivk> i thought notification means notifying user or admin as per our earlier discussion
09:14:55 <oanson> rajivk, yes. I thought so too...
09:15:05 <nick-ma_> notify metrics of virtual network.
09:15:14 <rajivk> notification to other components like ceilometer, congress etc should be there
09:15:48 <rajivk> I agree on notification for openstack component
09:16:12 <oanson> Maybe we should start over :)
09:16:18 <rajivk> Help from other component team might be required, in some scenario's.
09:16:21 <rajivk> yes
09:16:47 <oanson> My understanding is the ceilometer exists to receive monitor information from components. In dragonflow's case: services' health, statistics, etc.
09:17:11 <oanson> Aodh exists to raise alarms, which could be used to take actions, or notify other components or users and admins
09:17:16 <nick-ma_> neutron will send basic notifications in its api worker.
09:17:46 <oanson> So far, am I correct?
09:17:47 <oanson> nick-ma_, to whom?
09:17:55 <nick-ma_> ceilometer
09:18:03 <rajivk> okay, may be we can think, about integration with congress as well.
09:18:14 <nick-ma_> like starting to create router, create router success/failure.
09:18:56 <oanson> rajivk, sure.
09:19:06 <rajivk> May be congress allow to provide some policy for neutron
09:19:27 <rajivk> we will have to provide some mechanism to do the same in Dragonflow.
09:20:00 <oanson> Congress probably pushes policy using the Neutron API.
09:20:08 <oanson> So that should natively reach Dragonflow
09:20:15 <oanson> (Unless I am wrong)
09:20:34 <rajivk> Sorry, no idea about internal working of congress :(
09:20:49 <nick-ma_> i have no idea about how the congress works.
09:20:54 <nick-ma_> me too~
09:21:00 <oanson> From a quick look at their documentation, there is a Neutron Policy Engine. But I don't know how it works internally wither
09:21:08 <oanson> rajivk, could you do the research?
09:21:15 <rajivk> I know some who can help me
09:21:19 <rajivk> from congress community
09:21:24 <oanson> That would be great!
09:22:41 <oanson> I suggest you also work with xiaohhui about the service monitor ideas you mentioned the other day. See if you can collaborate on the work he is doing on the chassis status.
09:23:09 <rajivk> ok, i will work with him.
09:23:15 <xiaohhui> :)
09:23:17 <oanson> Would you like to talk about controller HA?
09:23:38 <rajivk> yes
09:24:11 <oanson> The floor is yours! :)
09:24:18 <rajivk> Currently, if local controller goes down on a compute node than no flows will be added and removed
09:25:03 <rajivk> As per discussion with oanson, we can take two approaches to avoid this problem
09:25:53 <rajivk> 1) Add a watch dog, that keeps on monitoring local controller and if it goes down than it tries to restart it.
09:26:33 <rajivk> It try a few times(configurable), if it fails everytime then some other node's controller can be notified and from that point ownward
09:26:59 <rajivk> remote controller takes care of it's own flows as well as failed node's flows.
09:27:01 <hujie> what about deploy two df controllers, master and slave?
09:27:17 <rajivk> You mean on the same node.
09:27:55 <nick-ma_> that doesn't make sense to deploy two same process on the same node.
09:28:04 <rajivk> hujie, usually if a service fails on one machine then there must be some external factor, which affected it from continuing
09:28:15 <rajivk> therefore slave will most probably fail
09:28:34 <oanson> This is also why the watchdog solution may not be enough
09:28:42 <nick-ma_> in production, we use watchdog.
09:29:04 <hujie> indeed we may not consider the HA in full-distributed SDN solution, if the df goes down, the server is also down, but if you consider df goes down and the server works well you can consider deploy two role df controllers
09:29:54 <yuval> Sorry to pop in, but if the controller is deployed using kubernetes (kolla-kubernetes?) with health check, why is there a need for a watchdog?
09:30:56 <oanson> yuval, the watchdog is there to verify the process is still running and behaving correctly. If k8s' health check does that, then that is a watchdog implementation.
09:31:02 <rajivk> yuval, you are right. In that it might not be required.
09:31:06 <oanson> But not all deployments use k8s
09:31:25 <oanson> w.g. OSA use lxc
09:31:27 <oanson> e.g.*
09:31:40 <yuval> sounds like watchdog is a deployment issue not specific to dragonflow
09:32:19 <rajivk> watchdog is solve the issue of a short misbehaviour of service.
09:32:21 <oanson> Possibly. We need to know what solutions exist before writing our own
09:32:59 <oanson> But the point is what to do if the process fails, and the watchdog can't bring it back up.
09:33:46 <rajivk> In that case, we can notify other node's controller to take over and do all the tasks remotely it possible.(Not sure about, whether possible or not)
09:34:33 <oanson> In theory, currently, is should be possible, since both ovsdb and the OVS ofproto interface can be connected over the net.
09:34:59 <rajivk> okay, is there any major challenges to implement it?
09:35:26 <rajivk> can you see anything, that can stop us from doing it?(i am new to Dragonflow therefore does not know internal details)
09:35:37 <xiaohhui> How would other node's controller get the vm of current node? Besides vm, I think other resources don't need to migrate
09:36:02 <oanson> I suspect the whole thing is a challenge :). But I don't see a technological problem.
09:36:02 <yuli_s> may be we will consider one df controller to implement all rules on all cns ?
09:36:14 <yuli_s> computer nodes
09:36:32 <oanson> yuli_s, that goes against the dragonflow design. We want to be fully distributed, not migrate back to a central control unit
09:36:33 <yuli_s> with a failover in this case
09:36:56 <oanson> this is only for failover, in case local solutions (e.g. watchdog) fail
09:37:15 <hujie> yuli_s: we are full distributed SDN solution:)
09:37:21 <oanson> xiaohhui, I think all the necessary information is stored in the OVSDB. If it is still running, the event should be received
09:38:14 <oanson> We don't even need to know about the vm. Just how to connect the southbound (OVS/OVSDB) port to the northbound (Neutron DB) port
09:38:30 <oanson> And as far as I know, that information is stored in OVSDB.
09:38:51 <yuli_s> i remember seeing a patch to update the chases table periodically. it can be used for this
09:39:03 <yuli_s> (to detect failed controller)
09:39:06 <hujie> if other df to manage remote ovs, it is in-band flow, the OM and data plane is shared,
09:39:08 <oanson> We can also try adding a plug-vif driver to nova, which would help when we want to extend beyond vms and beyond ovs. But I don't think we'll make it for Ocata.
09:39:10 <nick-ma_> when the remote controller takes over the work, it also needs to update its local cache for all the remote topology
09:40:22 <oanson> And tell apart items that belong to the local compute node, and to the HAed compute node
09:40:32 <rajivk> nick-ma_: can you elaborate
09:42:18 <nick-ma_> rajivk: i can help discuss and review. :-)
09:42:30 <rajivk> I think, it is good feature.
09:42:42 <hujie> if df could manage remote ovs, it seems dragonflow is a high distributed ODL\floodlight\onos\ryu..., not full distributed
09:43:06 <rajivk> nick-ma_: i would a lot of help and discussion. Thanks.
09:43:07 <xiaohhui> I agree with hujie
09:43:10 <oanson> rajivk, the local DF controller holds an in-memory cache of the database objects. We try to have it as small as possible. In case of HA, we need to read the information of the other compute node into the cache
09:43:35 <oanson> hujie, xiaohhui, this feature is for fallback only. There should be a dragonflow local controller on every node.
09:43:49 <rajivk> what about supporting distributed cache as well like memcache?
09:43:57 <nick-ma_> yes, HA is an exception for centralization. we can run HA for all the compute nodes, but that doesn't make sense to deploy in production.
09:43:57 <oanson> But it is possible it will crash, and it might be possible that the watchdog won't be able to raise it again.
09:44:32 <nick-ma_> we do have a distributed data store.
09:44:58 <oanson> The local cache is just to speed up reads from that data store.
09:44:58 <nick-ma_> if we need distributed cache, we just remove the local cache layer. that's all.
09:45:20 <nick-ma_> every read will go to db layer.
09:45:37 <rajivk> hmm, i got it.
09:46:19 <oanson> rajivk, additionally, the data store layer is fully pluggable. If we want to use specifically memcache, a driver can be written
09:46:53 <rajivk> i just said it for caching remote machine's info.
09:47:18 <rajivk> But i think, i did not understand that much about Dragonflow. May be i will discuss about it later on.
09:47:31 <oanson> No worries. I was just showing off our pluggability :)
09:47:47 <oanson> rajivk, sure.
09:47:57 <oanson> I am always available (if not in IRC, then by mail)
09:48:17 <rajivk> oanson, okay thanks.
09:48:41 <oanson> I would ask that you let me know what you want to implement, and that you upload a spec so that we'll have it organised.
09:48:54 <oanson> But that can be done later
09:49:33 <rajivk> okay, i will discuss and let you know on IRC.
09:49:40 <oanson> Great. Thanks!
09:49:50 <oanson> Anything else for roadmap?
09:50:05 <rajivk> I have created a bp
09:50:24 <rajivk> it is not a feature but now other components are also centerlizing configurations
09:50:59 <rajivk> that's it from my side.
09:51:14 <yuli_s> oanson, u wnat to talk about ml2 and dumping plugin ?
09:51:31 <yuli_s> oanson, you want to talk about ml2 and dumping plugin.py ?
09:51:53 <oanson> rajivk, I brushed over the spec. Seems like a good idea. I think nick-ma_ started working on something like.
09:52:06 <oanson> Using oslo config generation
09:52:22 <oanson> yuli_s, not sure what you mean. Could you please explain?
09:53:05 <nick-ma_> yes, that was done. centralized configuration is also welcomed. please share the spec link here. i can catch up.
09:53:07 <oanson> rajivk, done in this patch: https://review.openstack.org/#/c/373796/
09:53:32 <yuli_s> for the Ocata release do we want to swithc completely to ml2 and dump old plugin support
09:53:33 <oanson> #link Centralised configuration blueprint https://blueprints.launchpad.net/dragonflow/+spec/centralize-config-options
09:53:34 <yuli_s> ?
09:53:43 <oanson> yuli_s, yes.
09:53:57 <oanson> It is not urgent, but it should be done within the 4-6 weeks.
09:54:03 <yuli_s> ok,\
09:54:14 <oanson> I can do that, seeing as it's just deleting a couple of files
09:54:45 <xiaohhui> I have this work https://bugs.launchpad.net/dragonflow/+bug/1618792 which might similar to dumping plugin.py
09:54:45 <openstack> Launchpad bug 1618792 in DragonFlow "RFE: Use ml2 as default option for devstack" [Wishlist,In progress] - Assigned to Hong Hui Xiao (xiaohhui)
09:55:17 <oanson> xiaohhui, this is an important step in the way. Yes.
09:55:38 <oanson> But it looks like it's merged :)
09:55:53 <xiaohhui> I plan to add more code for it,
09:56:04 <xiaohhui> currently it is just update the sample local.conf files
09:56:22 <oanson> You want dragonflow's plugin.sh to set the variables by default?
09:56:33 <xiaohhui> yes,
09:56:42 <oanson> xiaohhui, sounds good!
09:56:47 <yuli_s> good idea
09:56:59 <oanson> I also want to discuss https://bugs.launchpad.net/dragonflow/+bug/1638151 . jingting: I added a comment to the bug, could you please reply?
09:56:59 <openstack> Launchpad bug 1638151 in DragonFlow "Router schedule error in L3 router plugin as there are multi-external network" [High,New] - Assigned to rajiv (rajiv-kumar)
09:57:12 <oanson> It is the only high priority bug that isn't marked 'in progress'.
09:57:42 <rajivk> I went through the details for this bug.
09:58:06 <rajivk> I it seems like, during update of the router at neutron side it fails.
09:58:36 <xiaohhui> This is actually the issue that dragonflow don't support multi-external network now.
09:58:39 <nick-ma_> it uses router scheduler but it failed to do it in dragonflow.
09:58:50 <xiaohhui> If br-ex is configured in neutron, the same exception will report
09:59:27 <oanson> We are running out of time.
09:59:43 <oanson> If you all could share your information on the bug, we could take it from there
09:59:46 <rajivk> let's discuss it on IRC channel of Dragonflow
09:59:47 <oanson> #link https://bugs.launchpad.net/dragonflow/+bug/1638151
09:59:47 <openstack> Launchpad bug 1638151 in DragonFlow "Router schedule error in L3 router plugin as there are multi-external network" [High,New] - Assigned to rajiv (rajiv-kumar)
09:59:55 <oanson> rajivk, Sure.
10:00:12 <xiaohhui> I want to bring this review out  https://review.openstack.org/#/c/339975/
10:00:20 <oanson> Thanks everyone for coming. We can continue in #openstack-dragonflow .
10:00:22 <xiaohhui> It is legacy from N release
10:00:47 <xiaohhui> OK
10:01:08 <oanson> xiaohhui, one we have a Newton branch (I'll take care of is ASAP), we can back port important patches
10:01:25 <oanson> I suggest we'll discuss it once the patch is merged into master
10:01:33 <oanson> Thanks again
10:01:36 <oanson> #endmeeting