12:00:20 <jaosorior> #startmeeting TripleO Security Squad
12:00:21 <openstack> Meeting started Wed Jun 20 12:00:20 2018 UTC and is due to finish in 60 minutes.  The chair is jaosorior. Information about MeetBot at http://wiki.debian.org/MeetBot.
12:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
12:00:24 <openstack> The meeting name has been set to 'tripleo_security_squad'
12:00:28 <moguimar> #link https://etherpad.openstack.org/p/tripleo-security-squad
12:00:31 <jaosorior> Lets wait some minutes for more folks to log in
12:03:15 <jaosorior> Alright, I guess it's fine now
12:03:32 <jaosorior> #topic oslo pluggable secrets backend discussion
12:04:11 <jaosorior> raildo, moguimar: wanna take it from here?
12:04:32 <raildo> yeah, I'll try to sync everything here :)
12:05:05 <openstackgerrit> Alex Schultz proposed openstack/instack-undercloud stable/queens: Fall back to puppet-ntp defaults  https://review.openstack.org/576450
12:05:43 <raildo> so we're starting to discuss about the castellan driver, that will probably be supported by Tripleo, in a meaning of having a secure and automated way to handle the secrets on configuration files
12:06:37 <raildo> so we were discussing about that in yesterday meeting and dhellmann had a good point about understand more the tripleo needs for this feature, so we can guarantee that we're covering those points in that driver
12:07:26 <openstackgerrit> Martin André proposed openstack/tripleo-quickstart master: Install packages from centos-release-openshift-origin39  https://review.openstack.org/576832
12:07:35 <dhellmann> right, I would hate for us to design the driver to work in a way that doesn't fit with tripleo
12:07:36 <jaosorior> right
12:08:14 <jaosorior> So, the way we do things at the moment, is that we write everything to hiera and after that eventually it gets persisted to the configuration files
12:08:30 <chem> ccamacho: do you know if that py35 issue is generalized or a recheck will do
12:08:50 <jaosorior> It would be possible, however, to, instead of writing sensitive info to hiera, to write it to a secure backend (Vault?)
12:09:12 <chem> ccamacho: by the way waiting on the memcached one that ci passes (or should I just merge the backport now ?)
12:09:29 <jaosorior> chem: w'ere in the middle of a meeting
12:09:32 <jaosorior> *we're
12:09:42 * chem ->[ ]
12:09:59 <jaosorior> Having these secrets in the secure backend we could then write "constants" to the config files, that point to the relevant entry in the backend, and a reference on how to get to that backend
12:10:14 <ooolpbot> URGENT TRIPLEO TASKS NEED ATTENTION
12:10:14 <ooolpbot> https://bugs.launchpad.net/tripleo/+bug/1777759
12:10:16 <ooolpbot> https://bugs.launchpad.net/tripleo/+bug/1777762
12:10:16 <openstack> Launchpad bug 1777759 in tripleo "pike, volume failed to build in error status. list index out of range in cinder" [Critical,Triaged] - Assigned to Quique Llorente (quiquell)
12:10:18 <openstack> Launchpad bug 1777762 in tripleo "pike: nova scheduler, Failed to update inventory for resource provider" [Critical,Triaged] - Assigned to Quique Llorente (quiquell)
12:10:18 <dhellmann> it looks like castellan supports barbican and vault today, but I don't know how complete either of those drivers is
12:10:35 <jaosorior> dhellmann: barbican would be the most complete implementation I would say
12:10:47 <dhellmann> jaosorior : yeah, for each secret you would need the "id" string that the service gives you, I think
12:10:55 <jaosorior> the problem is that we can't use Barbican, because... how would keystone AND barbican would get their own secrets?
12:11:07 <dhellmann> so instead of storing the actual secret in the config, you store the id of the secret in a separate file the driver will read
12:11:09 <jaosorior> dhellmann: either an ID or just a unique tag
12:11:16 <raildo> dhellmann, well, we are looking for to use the vault backend, since it wont require keystone auth tokens and also because we will be able to store the keystone and babican secrets as well
12:11:29 <dhellmann> raildo : that makes sense
12:11:39 <dhellmann> jaosorior : it sounds like you probably know more about this than I do :-)
12:11:59 <jaosorior> so... we can't use the barbican backend for castellan
12:12:11 <dhellmann> so someone needs to evaluate that vault driver and figure out if it is complete enough
12:12:20 <raildo> dhellmann, also, the way that the vault backend is implemented, it's just pointing for an external vault server passing vault key, user...
12:12:25 <openstackgerrit> Martin André proposed openstack/tripleo-heat-templates master: Update for openshift 3.9  https://review.openstack.org/574233
12:12:25 <openstackgerrit> Martin André proposed openstack/tripleo-heat-templates master: Add ability to set openshift container images  https://review.openstack.org/576441
12:12:26 <dhellmann> assuming we don't want to use a completely different backend, of course
12:12:27 <jaosorior> we also don't want the overcloud to depends on the undercloud. So ideally it has to be a service that's deployed by TripleO
12:12:57 <jaosorior> redrobot has been taking a look at the Vault driver... not sure if he got to any conclusions
12:13:00 * redrobot sneaks into the back of the room
12:13:02 <jaosorior> hopefully he'll be online soon
12:13:09 <redrobot> o/
12:13:11 <jaosorior> aaah there you go! Hi! redrobot
12:13:13 <dhellmann> nice timing
12:13:17 <raildo> ++
12:14:11 <redrobot> haha, sorry guys.  Give me a sec to read the scrollback
12:15:04 <ccamacho> chem, we are still waiting for the BZ flags
12:15:14 <ccamacho> so we can wait for the patch to merge
12:15:54 <dhellmann> just for my own clarification, are we at a stage where we need to deep dive into questions like "does the driver work?" or are we still working out higher level issues like what parts of the system are responsible for different actions?
12:16:01 <redrobot> Re: Vault driver.  Still evaluating but, yeah I'm concerned about the way the castellan-context is used
12:16:04 <redrobot> or not used rather
12:16:33 <dhellmann> context?
12:16:35 <redrobot> the idea was that the castellan-context (which is a terrible name, should have been castellan-auth) would be used to abstract away auth from the backend
12:17:33 <dhellmann> does that mean the driver doesn't work? it's not secure? or it's not using the preferred implementation pattern?
12:17:49 <redrobot> and so http://git.openstack.org/cgit/openstack/castellan/tree/castellan/common/utils.py#n95 was supposed to be called to get a castellan-context
12:18:04 <redrobot> but the Vault plugin sidesteps all that and just reads the token from config
12:18:29 <redrobot> also, the whole context naming has people passing oslo contexts into the Castellan API
12:18:29 <dhellmann> ah
12:18:31 <raildo> dhellmann, my concern from a tripleo perspective, if we choose to go with the vault backend, will be how we gonna to ship/build vault to use it on tripleo? or we gonna just ask for an external vault server, something like what castellan did on that driver
12:19:02 <dhellmann> raildo : good question
12:19:54 <dhellmann> redrobot : changing the name of the public classes and arguments may be a challenge at this point, but fixing up the driver to use them seems like a good idea
12:20:27 <dhellmann> I'm not sure why a centralized set of options and a function to access them is needed. It seems like each driver is just going to have 1 auth method, right?
12:20:54 <dhellmann> but if that's the preferred way, it seems like the driver can just be fixed to use it
12:21:02 <jaosorior> Seems that Vault is the only choice at the moment. We could try to use the Barbican driver, but it would need to be a barbican instance that uses a context-middleware that's not the keystone one, and we would need to write up proper auth and permissions for that one. That just seems like too much work when we could try to fix up the vault driver.
12:21:13 <openstackgerrit> Sagi Shnaidman proposed openstack-infra/tripleo-ci master: WIP: DNM: try to remove things from toci_* scripts  https://review.openstack.org/576834
12:21:44 <redrobot> dhellmann, yeah... the more I think about it, the more I think the credential_factory is not needed, and maybe the way the Vault backend gets its credentials may be the better pattern.
12:21:50 <dhellmann> so what's involved in getting tripleo to deploy vault in a way that it can be used by the services to access their secrets?
12:22:17 <redrobot> a prod-ready vault needs a HA backend
12:22:22 <redrobot> so etcd or consul
12:22:31 <dhellmann> redrobot : I'd be happy to deep-dive into that with you at some point if you want to talk about it separately
12:23:01 <dhellmann> are we deploying either of those yet?
12:23:18 * dhellmann knows embarrassingly little about tripleo today
12:23:28 <redrobot> heh, you and me both dhellmann
12:23:43 <dhellmann> for a PoC, would we need etcd or consul?
12:23:44 <redrobot> I hear jaosorior is the TripleO expert...
12:24:04 <redrobot> No, a non-prod vault server doesn't really have any dependencies
12:24:21 <dhellmann> ok, so maybe we could do it in stages, if we need to
12:24:38 <dhellmann> we should start #info-ing these things
12:24:51 <dhellmann> or #action-ing
12:25:06 <dhellmann> who's going to look at the vault driver? redrobot, is that you?
12:25:29 <redrobot> yep, been deep diving into it, and I think I'm the Vault expert
12:25:43 <redrobot> and by expert I mean I probably read more docs than anyone else, but still don't know much, haha
12:25:52 <dhellmann> #action redrobot investigate completeness of the vault driver in castellan and identify any shortcomings that need to be resolved
12:26:29 <jaosorior> What's required for an HA Vault deployment?
12:26:38 <jaosorior> just etcd or consul?
12:26:45 <redrobot> depends ...
12:26:54 <redrobot> etcd or consul gives your storage HA
12:27:02 <redrobot> but Vault runs in single instance mode
12:27:14 <redrobot> unless you have a boatload of cash to dump on Hashicorp
12:27:30 <redrobot> only the Enterprise version has failover IIRC
12:27:53 <jaosorior> uhm...
12:27:57 <dhellmann> maybe the next question to answer is which backend we actually want to use
12:28:08 * redrobot needs to revisit the Vault open-source vs Enterprise feature set
12:28:28 <jaosorior> Seems to me that without HA, the cluster startup is gonna be quite prone to failure
12:28:43 <dhellmann> do we have any info from users about which service they have experience running?
12:28:56 <jaosorior> basically, when OpenStack services start, they will pull the secrets from Vault
12:28:58 <dhellmann> AT&T has some thing I can never remember the name of
12:29:14 <dhellmann> s/has/likes/ maybe
12:29:19 <jaosorior> when we're deploying (or updating) a cloud, this means there will be a LOT of traffic coming to Vault at that one point
12:29:38 <jaosorior> and then for most of the rest of the cloud's lifetime, there won't be any traffic...
12:29:52 <dhellmann> that's an interesting point
12:30:10 <jaosorior> most of what we need is for Vault to be HA, but for read operations
12:30:21 <jaosorior> write operations can happen quite serialized
12:30:35 <redrobot> Also, open-source Vault does not support HSMs
12:30:41 <shardy> TripleO can deploy etcd, but it's not enabled by default
12:30:46 <dhellmann> yeah, we also need to talk about what we have to do to update a cloud when secrets are changed
12:30:56 <shardy> AFAICS only the neutron vpp ml2 plugin requires it
12:30:57 <jaosorior> and will only happen when we first deploy the cloud and when we update secrets (password rotation for instance)(
12:31:20 <jaosorior> shardy: sure, if we want to enable the secure backend we could just deploy etcd for that setup
12:31:39 <jaosorior> redrobot: HSMs are not a requirement (yet)
12:34:00 <jaosorior> redrobot: so, given that our main concern aren't really write operations... will open-source Vault be alright? or does everything still relly on one node?
12:34:06 <jaosorior> * isn't
12:35:26 <redrobot> Vault is a single node, but presumably etcd or consul won't be.  Supposedly even though Vault seems like a bottleneck, because it's written in Go it can keep up with large loads and the performance limit is the speed of the backend you're using.
12:35:51 <redrobot> this is all theoretical btw.  I need to set up a for realsies Vault and actually put some numbers together
12:36:02 <jaosorior> that would be nice
12:36:14 <openstackgerrit> Martin André proposed openstack/tripleo-common master: Use upstream etcd container image for openshift  https://review.openstack.org/576497
12:36:15 <jaosorior> redrobot: also, for failover, we could potentially write a pacemaker resource agent for Vault
12:36:16 <dhellmann> that sounds like a good way to test the castellan driver, too :-)
12:36:19 <moguimar> afaik the backend is the bottleneck, not vault itself
12:36:53 <dhellmann> so we need to figure out which backend to use, as well
12:37:10 <jaosorior> well, seems etcd is the best bet we have right now, given that tripleo can deploy it
12:37:15 <moguimar> vault only encrypt/decrypt stuff
12:37:23 <moguimar> the IO is done in the backend
12:37:28 <openstackgerrit> Martin André proposed openstack/tripleo-quickstart-extras master: Add openshift etcd image to image prepare params  https://review.openstack.org/576837
12:38:03 <dhellmann> so it sounds like we need etcd regardless of whether we're worried about HA?
12:38:12 <moguimar> yep
12:38:15 <moguimar> etcd or consul
12:38:30 <moguimar> consul is also from hashicorp
12:38:46 <jaosorior> what are the other backend alternatives?
12:38:49 <raildo> btw, I'm just collecting some of this discussion on: https://etherpad.openstack.org/p/oslo-config-plaintext-secrets so we can come back later, in a future
12:39:25 <moguimar> the in memory vault backend should never be used in production
12:40:03 <moguimar> we can start with it to test the castellan integration then move on to connect vault to a real backend
12:40:44 <raildo> jaosorior, looks like they have a lot of plugin options for backend: https://www.vaultproject.io/docs/configuration/storage/index.html
12:41:22 <jaosorior> Well
12:41:22 <redrobot> Yeah, last time I looked only etcd and Consul were considered "HA"
12:41:27 <jaosorior> there is a mysql backend
12:41:32 <jaosorior> that we do deploy by default
12:41:32 <redrobot> but that may have changed, it's been a while
12:41:34 <jaosorior> why not use that?
12:41:47 <redrobot> > No High Availability
12:41:57 <redrobot> > the MySQL storage backend does not support high availability.
12:42:05 <jaosorior> What does that mean? :D
12:42:16 <dhellmann> that seems odd
12:43:37 <jaosorior> Vault will merely go to either it's local mysql instance, or the VIP (if it's not colocated), replication is handled elsewhere, so it's nothing vault has to worry about
12:43:44 <openstackgerrit> Martin Mágr proposed openstack/puppet-tripleo master: Collectd QDR connection  https://review.openstack.org/571152
12:43:44 <openstackgerrit> Martin Mágr proposed openstack/tripleo-heat-templates master: Enable collectd to connect to metrics QDR  https://review.openstack.org/576057
12:45:18 <redrobot> I want to say Vault itself has an "HA" option that can't be turned on when configured to use MySQL
12:45:33 <redrobot> but I cant recall off the top of my head what that actually implies
12:45:43 <dhellmann> ok, this feels like something that needs more investigation but that we're not going to answer here today
12:45:47 <jaosorior> redrobot: can you investigate on that?
12:45:51 <redrobot> yessir
12:45:54 <jaosorior> cause that would then be the easiest option
12:47:23 <gfidente> myoung|off I am trying to understand why scenarios 001/004 are failing https://review.openstack.org/#/c/564285/
12:47:27 <jaosorior> note that I'm actually not taking into account Hashicorp's enterprise HA offering...
12:47:51 <redrobot> jaosorior, noted
12:48:28 <jaosorior> alright
12:48:51 <jaosorior> redrobot: seems it all lies on you now :D
12:48:52 <myoung> gfidente: o/  good morning.
12:49:20 <weshay|ruck> quiquell|rover, http://logs.openstack.org/85/564285/20/check/tripleo-ci-centos-7-scenario001-multinode-oooq-container/ef7b33c/logs/df.txt.gz
12:49:30 <dhellmann> raildo : did we cover the topics you were hoping for? we have a few minutes left...
12:49:42 <myoung> gfidente: regarding the gate check jobs and scenario 001/004, I haven't sync'd up with realtime yet this morning, weshay|ruck or quiquell|rover should have current status/details
12:49:48 <raildo> dhellmann, I believe that are good for now, jaosorior, thanks for taking this time today for that discussion :)
12:49:58 <jaosorior> thanks for bringing it up!
12:50:03 <weshay|ruck> gfidente, what's up
12:50:12 <jaosorior> quite eager to see the result of the Vault research :D
12:50:21 <myoung> weshay|ruck: see above, he was asking about https://review.openstack.org/#/c/564285, scenario 1/4 fails
12:51:14 <weshay|ruck> myoung, gfidente see the alerts guys
12:51:22 <weshay|ruck> it's scen001/4
12:51:23 <jaosorior> #topic Any other business
12:51:28 <jaosorior> Anything else that folks want to bring up to the meeting
12:51:30 <jaosorior> ?
12:51:45 <raildo> nothing from me
12:52:14 <jaosorior> Arlight
12:52:15 <jaosorior> well
12:52:18 <jaosorior> thanks for joining everyone!
12:52:18 <dhellmann> nothign from me
12:52:23 <jaosorior> very interesting stuff!
12:52:27 <dhellmann> thanks, jaosorior , redrobot , & raildo
12:52:31 <moguimar> o/
12:52:31 <quiquell|rover> myoung, gfidente: one of them is RBD reporting 0 GB os disk free space
12:52:36 <jaosorior> #endmeeting