#openstack-meeting log

18:00:32 <daneyon> #startmeeting container-networking
18:00:32 <openstack> Meeting started Thu Oct  1 18:00:32 2015 UTC and is due to finish in 60 minutes.  The chair is daneyon. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
18:00:37 <openstack> The meeting name has been set to 'container_networking'
18:00:39 <daneyon> Agenda
18:00:46 <daneyon> #link https://wiki.openstack.org/wiki/Meetings/Containers#Agenda
18:00:59 <daneyon> I'll wait a minute for everyone to review the agenda
18:01:08 <daneyon> It's a short one :-)
18:01:30 <daneyon> #topic roll call
18:01:32 <adrian_otto> might as well begin roll call
18:01:37 <adrian_otto> Adrian Otto
18:01:44 <dane_leblanc> o/
18:01:44 <suro-patz> Surojit Pathak
18:01:54 <vilobhmm111> o/
18:02:26 <daneyon> Thank you adrian_otto dane_leblanc suro-patz vilobhmm111 for attending the meeting.
18:02:33 <daneyon> #topic Review Swarm patch.
18:02:37 <daneyon> #link https://review.openstack.org/#/c/224367/
18:02:40 <eghobo> o/
18:02:49 <daneyon> Not much has changed with the patch I posted last week
18:02:58 <daneyon> eghobo thanks for joining
18:03:20 <daneyon> I have a newer version of the patch locally that I'm still playing with.
18:03:39 <daneyon> I got a bit side tracked fixing a few bugs.
18:04:22 <daneyon> Hopefully I can post an updated version of the patch later today that will address using None as the default net_driver for Swarm
18:05:00 <daneyon> I have removed using a VIP and all the associated load-balancing config for the swarm api
18:05:11 <daneyon> it's not needed and does not work since the tls patch was merged.
18:05:51 <daneyon> since neutron lbaas does not support tls offload, we will need to figure out a plan for supporting tls with load-balancing.
18:06:14 <daneyon> Is anyone familiar with project Octavia?
18:06:21 <adrian_otto> each node holds the cert, and use layer 3 lb (TCP port forwarding)
18:06:36 <daneyon> #link https://wiki.openstack.org/wiki/Octavia
18:06:40 <adrian_otto> use a simple health check to drop dead nodes
18:07:24 <daneyon> adrian_otto that can be a near-term fix
18:08:15 <daneyon> long-term, it would be nice to perform l7 load-balancing by offloading the session to the load-balancer and then re-encrypting on the backend from the lb -> the swarm managers
18:09:08 <daneyon> adrian_otto we will look at reimplementing the swarm mgr load-balancing when the bay type supports multiple swarm managers.
18:09:35 <daneyon> here is the guide that will be followed for implementing multiple managers:
18:09:37 <daneyon> #link https://docs.docker.com/swarm/multi-manager-setup/
18:09:49 <adrian_otto> daneyon, I don't understand the desire to offload ssl, and then use encrypted back channels
18:09:59 <eghobo> daneyon: but I believe only one can be active
18:10:01 <adrian_otto> seems like more complexity that may not be needed
18:10:16 <daneyon> as you can see from the guide, only 1 mgr is primary and others are backups
18:10:22 <adrian_otto> is there some routing decision taht involved layer 7?
18:10:39 <daneyon> I would expect that Docker better addresses swarm mgr ha/scale in a future release.
18:11:01 <eghobo> daneyon: mesos has the same model
18:11:15 <daneyon> adrian_otto I would expect that we may need to address different security use cases.
18:12:42 <adrian_otto> have we detailed the uses cases anywhere?
18:12:51 <daneyon> From my experience, some users are OK with off-loading ssl to a slb and clear on the back-end. Others want to e-2-e encryption. In that case, we can do simple L4 checks/load-balancing, but L7 is preferred as long as the hw can handle it
18:13:31 <adrian_otto> if the client can do simple SRV lookups, and designate is present, there may be no need for load balancing
18:13:55 <daneyon> adrian_otto currently load-balancing the swwarm mgr's is unneeded. It can be implemented, but any traffic to the replicas will be forwarded to the primary
18:14:05 <adrian_otto> just inform designate to update the SRV record when the service availability changes
18:14:11 <hongbin> o/
18:14:28 <eghobo> adrian_otto: +1
18:14:50 <eghobo> most clients can handle retries
18:15:09 <adrian_otto> becasue that sounds to me like a "Where do I find the active master" question, which is a service discovery issue, not a load balancing one
18:15:15 <Tango> joining late
18:15:58 <daneyon> we could setup load-balancing so the vip always sends traffic to the primary, until the L3/4 health check fails and then goes to 1 of the replicas. However, we may get in a situation where node-3 becomes the master and the slb sends traffic to node-2. node-2 will redirect to node-3. ATM I don't see much value in load-balancing the swarm managers until Docker provides a better ha/scale story
18:16:58 <eghobo> daneyon: how do you know how is primary?
18:17:00 <daneyon> eghobo you are correct,, kind of. The replicas simply redirect requests to the primary in the cluster.
18:17:19 <daneyon> eghobo good to know that mesos follows the same approach.
18:17:49 <daneyon> adrian_otto I have not detailed the swarm manager ha, scale, load-balancing, etc.. use cases.
18:18:16 <eghobo> actually it's another way around, swarm mimic it from Mesos ;)
18:18:30 <adrian_otto> I suggest we record the use cases first, and then consider design/implementation options based on those
18:19:00 <daneyon> atm i think we simply table using a load-balancer for swarm managers until A. We implement swarm clustering (right now we only deploy a single swarm mgr) B. Docker has a better ha/scale story for swarm.
18:19:17 <adrian_otto> fine with me
18:19:55 <daneyon> Tango thx for joining
18:20:00 <hongbin> sure. The swarm HA should be addressed in a dedicated blueprint
18:20:42 <eghobo> hongbin: +1, the same way as ha for kub and mesos
18:21:49 <Tango> Would it make sense for us to get involved in developing the ha/scale proposal for Docker, or at least follow it closely?
18:21:52 <daneyon> adrian_otto I think it's a bit of both and why i reference ha/scale. If we had a large swarm cluster, we want to have all mgr nodes in the cluster active. In that scenario, we want to front-end the mgr's with a load-balancer. This is the typical ha/scale scenario that I see most users request. ATM this is a moot point since swarm scaling is not there.
18:22:20 <daneyon> eghobo primary = 1st node in the cluster
18:22:28 <Tango> Especially if we have opinion about how it should be done
18:22:58 <daneyon> hongbin agreed re; swarm ha bp
18:23:06 <daneyon> I believe I have already created one
18:23:54 <daneyon> Tango I think it's a good idea to get involved in any upstream projects that can have an effect on Magnum
18:24:41 <daneyon> here is the link o the swarm ha bp
18:24:45 <daneyon> #link https://blueprints.launchpad.net/magnum/+spec/swarm-high-availability
18:24:50 <daneyon> feel free to add to it
18:25:34 <daneyon> I have also created a bp for swarm scaling
18:25:36 <daneyon> #link https://blueprints.launchpad.net/magnum/+spec/swarm-scale-manager
18:26:04 <daneyon> it would be nice to eventually auto scale swarm nodes
18:26:56 <daneyon> it would be great to see someone from the team tackle these bp's
18:27:20 <daneyon> If not, I am hoping that I can tackle them when I'm done with the net-driver implementation across all bay types
18:27:39 <eghobo> daneyon: I feel it's out of magnum scope, it's feature of swarm scheduler
18:27:44 <Tango> There is a talk on autoscaling at the Summit, we can follow up with these BPs
18:27:57 <daneyon> eghobo what is?
18:28:14 <eghobo> scale-up
18:28:36 <hongbin> Here is the autoscale blueprint:
18:28:40 <hongbin> #link https://blueprints.launchpad.net/magnum/+spec/autoscale-bay
18:28:52 <daneyon> thanks hongbin
18:30:01 <daneyon> eghobo I am referring to adding new nodes to the bay. If I create a bay with master_count 1 and node_count 1. THings work great and now I need add'l capacity. I need to scale out the node count
18:30:43 <daneyon> eghobo the swarm scheduler seems pretty decent, so I'm not talking about touching the swarm scheduler
18:31:02 <daneyon> swarm scheduler strategies
18:31:04 <daneyon> #link https://docs.docker.com/swarm/scheduler/strategy/
18:31:11 <eghobo> daneyon: i see, we definitely need it and it should work the same way for all coe
18:31:12 <suro-patz> daneyon: Would you please elaborate - what we want to achieve on https://blueprints.launchpad.net/magnum/+spec/swarm-high-availability
18:31:16 <daneyon> swarm scheduler filters
18:31:19 <daneyon> #link https://docs.docker.com/swarm/scheduler/filter/
18:31:54 <eghobo> should we return to networking topic ;)
18:32:12 <daneyon> eghobo agreed. Unfortunatly, as adrian_otto has mentioned, we do not have feature parity across all bay types.
18:32:27 <daneyon> hopefully that will change going fwd
18:33:05 <suro-patz> daneyon: my incrementing the —master_count attribute, from magnum point of view we are just adding a node to the bay as one more control end-point. Providing HA for API/ETCD should be out of magnum's scope
18:33:12 <eghobo> add/delete nodes it's common for all bays, isn't?
18:33:22 <daneyon> suro-patz I am bsaicly saying in the bp we should implement ha for the swarm mgr's. Our only solution is from Docker's HA guide
18:33:29 <daneyon> #link https://docs.docker.com/swarm/multi-manager-setup/
18:33:46 <eghobo> daneyon: +1
18:34:02 <hongbin> eghobo: yes, currently users can manually add/remove nodes from bay
18:34:15 <hongbin> eghobo: for all bay types
18:34:43 <hongbin> although remove node doesn't work very well with swarm, due to the lack of replication controller
18:35:17 <daneyon> suro-patz so, the swarm bay type needs to implement the master_count attr. The heat templates need to be updated to orchestrate multiple masters. When master_count is > 1, the --replication and --advertise flags should be added to the swarm manage command
18:35:49 <daneyon> I think it could be done pretty easily. I think this is really important to address Magnum's primary goal of production ready
18:36:28 <daneyon> In the mean-time users would have to deploy multiple swarm bays and spread their containerized app's across the multiple bays to achieve HA
18:36:47 <daneyon> I think it would be nice to provide users with an option to have in-cluster ha
18:38:08 <eghobo> daneyon: +1
18:38:17 <daneyon> eghobo re: scaling. I was referring to having a future option to auto scale nodes. FOr swarm mgr's I don't think auto-scaling is needed anytime soon. Instead we need to support multiple masters for HA purposes.
18:38:49 <suro-patz> daneyon: I see, this is to support HA of the control plane of swarm, and magnum should help setting that up
18:38:53 <suro-patz> +1
18:39:08 <daneyon> suro-patz master_count adds swarm manager nodes, not swarm agent nodes.
18:39:49 <suro-patz> daneyon: correct, I meant swarm manager by 'control plane'
18:40:59 <daneyon> suro-patz the patch I'm working on removes the swarm agent from the swarm manager. This provides a clear seperation between control/data planes. swarm managers are strictly the control plane while swarm agent nodes are the data plane. We will eventually want to separate the communication between swarm mgr/agent and standard container traffic, but that's a different topic.
18:41:32 <daneyon> suro-patz that is correct. We want HA in the control plane
18:42:06 <eghobo> daneyon: can we do without ha first and will add it latter
18:42:19 <suro-patz> daneyon: I am still not clear on the original LB issue you raised, may be we can spend some time on the IRC after this meetingt
18:42:20 <daneyon> we will leave it up to the swarm scheduler to provide ha to containers based on the scheduling strategy.
18:42:57 <daneyon> eghobo yes. None of my network-driver work depends on ha.
18:43:08 <eghobo> great
18:44:44 <daneyon> suro-patz sure. In summary, a load-balancer is not needed b/c A. We have not implemented multiple swarm managers and B. Swarm mgr clustering != all mgr's are active... only 1 active mgr (primary) and others (replicas) are on standby.
18:45:06 <daneyon> #topic Review Action Items
18:45:13 * daneyon danehans to look into changing the default network-driver for swarm to none.
18:45:58 <daneyon> I have looked into it and working through the changes to default swarm to network_driver None and have the option for flannel
18:46:12 <daneyon> dane_leblanc is working on a required patch to make this work too
18:46:33 <suro-patz> daneyon: if we are suggesting flannel for kub, why not for swarm too, as default?
18:46:48 <daneyon> He and I imeplmented network-driver api validation... currently the validation only allows for network-driver=flannel
18:46:57 <daneyon> not good for the none type ;-)
18:47:24 <daneyon> this is the validation patch that was merged:
18:47:26 <daneyon> #link https://review.openstack.org/#/c/222337/
18:47:52 <daneyon> dane_leblanc is working on a patch to update the validation to include "none" type
18:48:13 <daneyon> suro-patz we had a lengthy discussion on that topic during last week's meeting.
18:48:14 <dane_leblanc> Should have the validation up for review today
18:48:32 <suro-patz> daneyon: Will check the archive
18:48:38 <daneyon> pls review the meeting logs to come up to speed and ping myself or others over irc if you would like to discuss further.
18:48:57 * daneyon danehans to continue coordinating with gsagie on a combined kuryr/magnum design summit session.
18:49:05 <daneyon> I still have not had time to address this
18:49:19 <daneyon> I tried pinging gsagie today, but I did not see him on irc
18:49:24 <daneyon> I will carry this fwd
18:49:31 <daneyon> #action danehans to continue coordinating with gsagie on a combined kuryr/magnum design summit session.
18:49:41 <daneyon> #topic Open Discussion
18:50:13 <daneyon> We have a few minutes to discuss anything the group would like.
18:50:19 <eghobo> daneyon: are you testing swarm with atomic 3 or 5?
18:50:42 <daneyon> eghobo 3
18:50:46 <eghobo> thx
18:52:15 <daneyon> anyone see this article
18:52:18 <daneyon> #link http://blog.kubernetes.io/2015/09/kubernetes-performance-measurements-and.html
18:52:39 <daneyon> I think it would be awesome if we can pull something like this off for Magnum
18:52:57 <daneyon> would give users/operators a lot of confidence in using Magnum
18:53:39 <daneyon> I'll wait 1 minute before ending the meeting.
18:55:21 <daneyon> Alright then... thanks for joining.
18:55:26 <daneyon> #endmeeting