#openstack-meeting-alt log

16:00:22 <lbragstad> #startmeeting keystone
16:00:23 <openstack> Meeting started Tue Oct 23 16:00:22 2018 UTC and is due to finish in 60 minutes.  The chair is lbragstad. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:27 <openstack> The meeting name has been set to 'keystone'
16:00:28 <lbragstad> #link https://etherpad.openstack.org/p/keystone-weekly-meeting
16:00:32 <lbragstad> o/
16:00:53 <cmurphy> o/
16:01:06 <hrybacki> o/
16:01:09 <gagehugo> o/
16:02:02 <ayoung> Oyez oyez
16:02:16 <wxy|> o/
16:03:00 <lbragstad> #topic Release status
16:03:16 <lbragstad> #info next week is Stein-1 and specification proposal freeze
16:03:50 <ayoung> I assume we have real things to discuss prior to my two agenda items
16:04:04 <lbragstad> we should be smoothing out concerns with specs sooner rather than later at this point
16:04:33 <kmalloc> o/
16:04:42 <lbragstad> if you have specific items wrt to specs and want higher bandwidth to discuss, please let someone know
16:04:55 <lbragstad> or throw it on the meeting agenda
16:05:13 <lbragstad> ayoung do you want to reorder the schedule
16:05:13 <lbragstad> ?
16:05:22 <lbragstad> or is that what you're suggesting?
16:06:28 <ayoung> Nah
16:06:31 <ayoung> I'm going last
16:06:40 <lbragstad> ok
16:06:49 <lbragstad> #topic Oath approach to federation
16:07:05 <lbragstad> last week we talked about Oath open-sourcing their approach to federation
16:07:14 <ayoung> so replace uuid3 with uuid5 and I like
16:07:46 <ayoung> couple other things;  we could make it so the deployer choses the namespace, and could keep that in sync across their deployments, to get "unique" Ids that are still distributed
16:07:49 <lbragstad> tl;dr they consume Athenz tokens in place of SAML assertion and have their own auth plugin for doing their version of auto-provisioning
16:08:13 <lbragstad> you can find the code here:
16:08:15 <lbragstad> #link https://github.com/yahoo/openstack-collab/tree/master/keystone-federation-ocata
16:08:43 <lbragstad> i started walking through it and comparing their implementation against what we have, just to better understand the differences
16:08:53 <lbragstad> you can find that here, but i just started working on it
16:08:55 <lbragstad> #link https://etherpad.openstack.org/p/keystone-shadow-mapping-athenz-delta
16:09:04 <ayoung> Could Athenz be done as a middleware module?  Something that looks like REMOTE_USER/REMOTE_GROUPS?  Or does it provide more information than we currently accept from SAML etc
16:09:40 <lbragstad> it doesn't really follow the saml spec at all - from what i can tell, it gets everything from the athenz token and the auth body
16:10:35 <lbragstad> the auth plugin decodes the token and provisions users, projects, and roles based on the values
16:10:41 <ayoung> Because Autoprovisioning is its own thing, and we should be willing to accept that as a standalone contribution anyway.
16:11:44 <lbragstad> yeah - i guess it's important to note that Oath developed this for replicated usecases and not auto-provisioning specifically, but the implementation is very similar to what we developed as a solution for auto-provisioning
16:12:19 <ayoung> also...Oauth needs predictable Ids.
16:12:19 <ayoung> I have a WIP spec to support those.  It is more than just Users, it looks like
16:12:38 <lbragstad> i'm not sure they need those if they come from the identity provider
16:12:42 <lbragstad> which is athenz
16:12:49 <ayoung> I think the inter-tubes are congested
16:13:22 <ayoung> https://review.openstack.org/#/c/612099/
16:13:51 <lbragstad> why does athenz need predictable user ids?
16:14:07 <ayoung> lbragstad, becasue they need to be the same from location to location
16:14:28 <ayoung> so admin can't be one ABC on region1 and 123 in region2
16:14:30 <lbragstad> https://github.com/yahoo/openstack-collab/blob/master/keystone-federation-ocata/plugin/keystone/auth/plugins/athenz.py#L123-L129
16:14:56 <ayoung> they state they use uuid3(NAMESPACE, name)
16:14:56 <lbragstad> the user id is generated by athens
16:15:03 <lbragstad> athenz*
16:15:36 <lbragstad> and keystone just populates it in the database, from what i can tell
16:16:04 <lbragstad> so long as you're using athenz tokens to access keystone service providers, you should have the same user id at each site?
16:16:15 <ayoung> that is my understanding, yes
16:16:41 <lbragstad> so their implementation has already achieved predictable user ids
16:16:59 <lbragstad> right?
16:19:14 <lbragstad> if anyone feels like parsing that code, feel free to add your comments, questions, or concerns to that etherpad
16:19:28 <lbragstad> it might be helpful if/when we or penick go to draft a specification
16:19:42 <lbragstad> worst case, it helps us understand their usecase a bit better
16:19:59 <wxy|> Will take a look later.
16:20:07 <lbragstad> thanks wxy|
16:20:11 <lbragstad> any other questions on this?
16:20:57 * knikolla will read back. am stuck in meetings as we have the MOC workshop next week. sorry for being AWOL this time period.
16:21:11 <lbragstad> no worries - thanks knikolla
16:21:16 <lbragstad> alright, moving on
16:21:29 <lbragstad> #topic Another report of upgrade failures with user options
16:21:45 <lbragstad> #link https://bugs.launchpad.net/openstack-ansible/+bug/1793389
16:21:45 <openstack> Launchpad bug 1793389 in openstack-ansible "Upgrade to Ocata: Keystone Intermittent Missing 'options' Key" [Medium,Fix released] - Assigned to Alex Redinger (rexredinger)
16:21:56 <lbragstad> we've had this one crop up a few times
16:22:17 <lbragstad> specifically, the issue is due to caching during a live upgrade
16:22:54 <lbragstad> from pre-Ocata to Ocata
16:23:11 <lbragstad> it's still undetermined if this impacts FFU scenarios
16:23:18 <lbragstad> (e.g. Newton -> Pike)
16:23:57 <lbragstad> but it boils down to the cache returning a user reference during authentication on Ocata code that expects user['options'] to be present, but isn't because the user was cached prior to the upgrade
16:24:20 <ayoung> Gah...disconnect.  I'll try to catch up
16:24:45 <lbragstad> deployment projects have a work around to flush memcached as a way to force a miss on authentication and refetch the user
16:25:46 <lbragstad> cmurphy odyssey4me and i were discussing approaches for mitigating this in keystone directly
16:26:11 <lbragstad> there is a WIP review in gerrit
16:26:14 <lbragstad> #link https://review.openstack.org/#/c/612686/
16:26:24 <lbragstad> but curious if people have thoughts or concerns about this approach
16:26:40 <lbragstad> or if there are other approaches we should consider
16:27:31 <ayoung> wouldn't deploying a fix like this flush the cache anyway?
16:27:38 <ayoung> How could they ever get in this state?
16:28:01 <lbragstad> the memcached instance has a valid cache for a specific user
16:28:22 <ayoung> Is this a side effect of 0 downtime upgrades?  Keep the cache up, even as we change the data out from underneath?
16:28:38 <lbragstad> yeah - that's the problem
16:28:41 <lbragstad> the cache remains up
16:28:48 <lbragstad> thus holding the cached data
16:28:50 <ayoung> that is going to be a problem in other ways
16:29:13 <ayoung> needs to be part of the upgrade.  Flush cache when we do ....
16:29:18 <ayoung> contract?
16:29:38 <ayoung> we change the schema in the middle.  THe cache will no longer reflect the schema after some point
16:30:12 <lbragstad> that's what https://review.openstack.org/#/c/608066/ does
16:30:19 <lbragstad> but not in process
16:30:25 <cmurphy> that's the problem, the question is whether we can be a bit more surgical instead of flushing the whole cache
16:30:35 <ayoung> I see that, but it is on a row by row basis
16:31:00 <ayoung> yeah,  that review looks like it is in the right direction
16:31:41 <ayoung> so...can we tell memcache to flush all of a certain class of entry?  As I recall from token revocations, that is not possible
16:31:47 <lbragstad> also - alex's comment on https://review.openstack.org/#/c/612686/ proves this could affect FFU
16:31:57 <ayoung> it only knows about key/value stores
16:32:38 <lbragstad> ayoung are you asking about cache region support?
16:32:50 <ayoung> lbragstad, maybe.
16:33:02 <ayoung> does each region reflect a specific class of cached objects?
16:33:04 <lbragstad> some parts of keystone rely on regions, yes
16:33:21 <lbragstad> computed role assignment have their own region, for example
16:33:26 <lbragstad> same with tokens
16:33:50 <ayoung> are regions expensive?  Is there a reason to avoid using them?
16:34:04 <lbragstad> i'm not sure - that might be a better question for kmalloc
16:34:15 <kmalloc> no
16:34:16 <lbragstad> #link https://review.openstack.org/#/c/612686/1/keystone/identity/core.py,unified is an attempt at creating a region specifically for users
16:34:29 <ayoung> could we wrap user, groups, projects etc each with a region, and then, as part of the sql migrations, flush the region
16:34:29 <kmalloc> not expensive, but we have cases where we cannot invalidate an explicit cache key
16:34:40 <kmalloc> e.g. many entries via kwargs into a single method
16:34:46 <kmalloc> so we need to invalidate the entire region
16:34:49 <lbragstad> #link https://review.openstack.org/#/c/612686/1/keystone/auth/core.py,unified@389 drops the entire user region (every cached user)
16:34:57 <kmalloc> it is better to narrow the invalidation to as small a subset as possible
16:35:15 <kmalloc> no reason to invalidate *everything* if only computed role assignments needs to be invalidated
16:35:30 <ayoung> kmalloc, if we change the scheme on, in this case, users, we need to invalidate all cached users.  Is that too specific?
16:35:41 <kmalloc> you can do so.
16:35:52 <ayoung> each class of object gets its own region?
16:36:03 <kmalloc> so far yes
16:36:13 <kmalloc> well...
16:36:20 <kmalloc> each manager
16:36:27 <ayoung> ok...so, we could tie in with the migration code, too, to identify what reqions need to be invalidated
16:36:32 <lbragstad> correct - if that region needs to be invalidated
16:36:37 <kmalloc> and some managers have extra regions, eg. computed assignments
16:36:41 <ayoung> OK,  so users and groups would go together, for example?
16:36:46 <kmalloc> right now, yes
16:37:00 <lbragstad> but - they could be two separate regions if needed
16:37:04 <kmalloc> ++
16:37:07 <lbragstad> depends on the invalidation strategy
16:37:08 <kmalloc> it's highly modular
16:37:23 <ayoung> Backend is probably granular enough
16:37:29 <lbragstad> or what needs to invoke invalidation, how often, etc...
16:37:31 <ayoung> identity, assignment, resource
16:37:57 <kmalloc> you can also force a cache pop by changing the argument(s)/kwargs [once https://review.openstack.org/#/c/611120/ lands] in the method signature
16:38:06 <kmalloc> since we cache memoized
16:38:10 <ayoung> yech
16:38:14 <ayoung> lets not count on that.
16:38:37 <kmalloc> it is a way caching works.
16:38:37 <ayoung> I'd hate to hate to change kwargs just to force a cache pop
16:38:51 <ayoung> yeah, and it is ok, just not what we want to use for this requirement
16:38:56 <kmalloc> it is a way a lot of things on the internet work, explicit change to the request forcing a cache cycle
16:39:36 <kmalloc> in either case you can force a cache pop. though i would not want to do that in db_sync
16:40:02 <kmalloc> it might make sense to do an explicit region (all region) cache expiration/invalidation on keystone start
16:40:38 <kmalloc> or as a keystone-manage command
16:40:59 <kmalloc> hooking in all the cache logic into db_sync seems ... rough
16:41:17 <lbragstad> in that case, a single keystone node could invalidate the memcached instances
16:41:26 <ayoung> what if db_sync set the values that would then be used by the manage-command
16:41:31 <lbragstad> but that behavior also depends on cache configuration
16:41:42 <ayoung> like a scracth table with the set of regions to invalidate?
16:41:43 <kmalloc> ayoung: there is no reason to do something like that
16:42:05 <kmalloc> really, just invalidate the regions
16:42:10 <kmalloc> they will re-warm quickly
16:42:25 <kmalloc> upgrade steps should be expected to need a cache invalidation/rewarm
16:42:29 <lbragstad> performance will degrade for a bit
16:42:50 <lbragstad> also - cmurphy brought up a good point earlier that it would be nice to find a solution that wasn't super specific to just this case
16:42:54 <kmalloc> which is fine for an upgrade process. we already say "turn everything off except X"
16:42:57 <lbragstad> since this is likely going to happen in the future
16:43:13 <kmalloc> so, i'd say keystone-manage that forces a region-wide invalidation
16:43:17 <kmalloc> [all regions]
16:43:55 <ayoung> I'll defer.  I thought we were going more specific, to flush only regions we knew had changed, but, this is ok]
16:44:44 <kmalloc> for the most part our TTLs are very narrow
16:45:10 <kmalloc> i'll bet most cache is popped just by timeout (5m) during upgrade process
16:45:16 <kmalloc> or a restart of memcache servers as part of the deal
16:45:53 <kmalloc> this is just explicit another option is to add a namespace value that we change per release of keystone
16:46:05 <kmalloc> that just forces rotation of the cache based upon code base.
16:46:09 <ayoung> ok, so keystone-manage cache-invalidate [region | all ]  ?
16:46:30 <kmalloc> fwiw, a namespace is just "added" to the cache key (before sha calculation)
16:46:51 <kmalloc> which then forces a new keystone to always use new cache namespace
16:47:02 <kmalloc> no "don't forget to run this command"
16:47:19 <kmalloc> (though an explicit cache invalidate command might be generally useful regardless)
16:47:50 <ayoung> cool.  We good here?
16:48:06 <lbragstad> i think so - we're probably at a good point to continue the discussion in review
16:48:17 <kmalloc> we could use https://github.com/openstack/keystone/blob/master/keystone/version.py#L15 anyway. yeah we should continue discussion in review
16:48:29 <ayoung> Cool...ok
16:48:31 <lbragstad> #topic open discussion
16:48:41 <ayoung> stwo things
16:48:48 <ayoung> 1.  service roles
16:48:49 <lbragstad> we have 12 minutes to talk about whatever we wanna talk about
16:48:51 <kmalloc> Flask has 2 more reviews, all massive code removals! yay, we're done with the migration
16:48:59 <ayoung> we need a way to convert people from admin-everywhere to service roles
16:49:02 * kmalloc has nothing else to talk about there, just cheering that we got there
16:49:08 <ayoung> so...short version:
16:49:43 * kmalloc hands the floor to ayoung... and since ayoung is now holding the entire floor, everyone falls ... into the emptyness/botomless area below the floor.
16:49:48 <ayoung> we role in rules that say admin(not everywhere) is servicer role or is_admin_project and leae the current mechanism in place
16:50:23 <ayoung> so, once we enable a bogus admin proejct in keystone, none of the tokens will ever have is_admin_project set
16:50:29 <ayoung> then we can remove those rules
16:51:05 <ayoung> it will let a deployer decide when to switch on service roles as the only allowed way to perform those ops
16:51:11 <lbragstad> why wouldn't we just use system-scope and use the upgrade checks to make sure people have the right role assignments according to their policy?
16:51:59 <ayoung> lbragstad, so...
16:52:10 <ayoung> that implied a big bang change
16:52:14 <ayoung> those never go smoothly
16:52:30 <ayoung> we want to be able to have people get used to using system roles, but not break their existing workflows
16:52:38 <lbragstad> but upgrade checkers are a programmable way to help with those types of things?
16:52:51 <ayoung> will it make sure that Horizon works?
16:52:58 <ayoung> Will it make sure 3rd party apps work?
16:53:10 <ayoung> we want to leave the existing policy in place until they are ready to throw the switch
16:53:16 <ayoung> and give them a way to throw it back
16:53:33 <ayoung> right now, people are misusing admin tokens
16:53:47 <ayoung> I've seen some really crappy code along those lines
16:54:18 <kmalloc> ayoung: that is the idea behind the deprecated policy bits, they just do an logical OR between new and old
16:54:20 <ayoung> we want to tell people: switch to using "service scoped tokens" and make it their choice
16:54:34 <ayoung> yeah, but....
16:54:39 <kmalloc> until we remove the declaration of the "this is the deprecated rule"
16:54:47 <ayoung> I don't want to have to try and synchronize this across all of the projects in openstack
16:54:55 <kmalloc> you are going to have to.
16:54:56 <ayoung> so...we absoutelty use those
16:55:03 <kmalloc> it's just how policy works
16:55:10 <ayoung> re-read what I said
16:55:26 <ayoung> it allows us to roll in those changes, but keep things working as-is until we throw the switch
16:55:28 <kmalloc> you can't just wave a wand here.
16:55:41 <ayoung> I worked long and hard on this wand
16:56:10 <kmalloc> it is going to be a "support (new or old) or supply custom policy"
16:56:12 <ayoung> so, the idea is we get a common definition of service scoped admin-ness
16:56:16 <kmalloc> the switch is thrown down the line.
16:56:25 <ayoung> yes!
16:56:29 <kmalloc> and it likely will be an upgrade
16:56:36 <kmalloc> where the old declaration is removed
16:56:46 <kmalloc> but it COULD be re-added with a custom policy override
16:56:56 <kmalloc> this has to be done per-project in that project's tree
16:57:10 <ayoung> what happens if that breaks a critical component?
16:57:18 <ayoung> they are not going to do a downgrade
16:57:22 <kmalloc> 3 things: supply a fixed custom policy
16:57:28 <kmalloc> (quick remediation)
16:57:41 <kmalloc> 2) do better UAT and/or halt upgrade
16:57:46 <kmalloc> 3) roll back to previous
16:58:10 <kmalloc> custom policy to the old policy string is immediate and fixes "critical path is broken"
16:58:20 <ayoung> So...nothing I am saying is going to break that.  But it ain;'t going to work that smoothly
16:58:22 <ayoung> so...
16:58:27 <ayoung> here is the middle piece:
16:58:47 <ayoung> make it an organizational decision to enable and disable the service scoped roles as the ONLY way to enforce that policy
16:58:53 <ayoung> and isolate that decision
16:59:01 <lbragstad> final minute
16:59:05 <kmalloc> this feels like a deployer/installer choice.
16:59:07 <kmalloc> fwiw
16:59:18 <ayoung> OK...one other thing
16:59:19 <kmalloc> not something we can encode directly
16:59:35 <kmalloc> (just because of how we sucked at building how policy works in the past)
16:59:36 <ayoung> I propse that the custom policies we discussed last week go to oslo-context instead of olso-policy
16:59:42 <kmalloc> -2
16:59:59 <kmalloc> put them external in a new lib if it doesn't go in oslo-policy
16:59:59 <ayoung> oslo-context is the openstack specific code. oslo-policy is a generic rules engine.
17:00:11 <ayoung> there is a dependency between them for this anyway
17:00:11 <kmalloc> context is the wrong place to put things that are policy rules.
17:00:20 <ayoung> so is olso-policy, tho
17:00:21 <kmalloc> oslo context is a holder object for context data.
17:00:32 <ayoung> but we insist on it for enforcing policy
17:00:33 <kmalloc> put them in oslo-policy and then extract to new thing
17:00:38 <kmalloc> or put it in new thing and fight to land it
17:00:42 <lbragstad> oslo.context is often overridden for service specific implementations, too
17:00:53 <ayoung> I think it stays in new thing, then
17:00:55 <kmalloc> do not assume oslo.context even is in use.
17:01:08 <kmalloc> i told you i recommend olos-policy for one reason only
17:01:12 <kmalloc> just for ease of landing it
17:01:15 <kmalloc> then extract
17:01:18 <lbragstad> ok - we're out of time folks
17:01:24 <kmalloc> but, i am happy to support a new thing as well
17:01:31 <ayoung> cool.  I'll push for new thing
17:01:32 <kmalloc> it will just be painful to get adopted (overall)
17:01:34 <lbragstad> reminder that we have office hours and we can continue there
17:01:48 <lbragstad> thanks all!
17:01:54 <kmalloc> but i am fine with +2ing lots of stuff for that as it comes down the line
17:02:10 <lbragstad> #endmeeting