#openstack-containers log

21:00:05 <strigazi> #startmeeting containers
21:00:06 <openstack> Meeting started Tue Mar  5 21:00:05 2019 UTC and is due to finish in 60 minutes.  The chair is strigazi. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:09 <openstack> The meeting name has been set to 'containers'
21:00:11 <strigazi> #topic Roll Call
21:00:17 <strigazi> o/
21:00:19 <schaney> o/
21:00:21 <jakeyip> o/
21:01:48 <brtknr> o/
21:02:38 <strigazi> Hello schaney jakeyip brtknr
21:02:41 <strigazi> #topic Stories/Tasks
21:02:53 <imdigitaljim> o/
21:03:08 <strigazi> I want to mention three things quickly.
21:03:18 <strigazi> CI for swarm and kubernetes is not passing
21:03:21 <colin-> hello
21:03:34 <strigazi> Hello colin- imdigitaljim
21:04:09 <strigazi> I'm finding the error
21:04:44 <strigazi> for example for k8s http://logs.openstack.org/73/639873/3/check/magnum-functional-k8s/06f3638/logs/screen-h-eng.txt.gz?level=ERROR
21:04:59 <strigazi> The error is the same for swarm
21:06:01 <strigazi> If someone wants to take a look and then comment in https://review.openstack.org/#/c/640238/ or in a fix :)
21:06:15 <strigazi> 2.
21:06:50 <strigazi> small regression I have found for the etcd_volume_size label (persistent storage for etcd) https://storyboard.openstack.org/#!/story/2005143
21:07:06 <strigazi> this fix is obvious
21:07:25 <strigazi> 3.
21:07:33 <strigazi> imdigitaljim created Cluster creators that leave WRT Keystone cause major error https://storyboard.openstack.org/#!/story/2005145
21:07:40 <imdigitaljim> yeah thats my 1
21:07:57 <strigazi> it has been discusses many times. the keystone team says there is no fix
21:08:21 <strigazi> in our cloud we manually transfer the trustee user to another account.
21:08:23 <imdigitaljim> could we rework magnum to opt to poll heat based on a service account for 1 part
21:08:31 <imdigitaljim> instead of using trust cred to poll heat
21:08:54 <strigazi> imdigitaljim: some says this is a security issue, it was like this before.
21:09:01 <imdigitaljim> oh?
21:09:09 <strigazi> but this fixes part of the problem
21:09:11 <imdigitaljim> couldnt it be scoped to readonly/gets for heat
21:09:25 <imdigitaljim> the kubernetes side
21:09:33 <imdigitaljim> either might be trust transfer (like you suggest)
21:09:46 <imdigitaljim> or we have been opting for teams to use a bot account type approach for their tenant
21:09:58 <imdigitaljim> that will persist among users leaving
21:10:19 <strigazi> trusts transfer *won't* happen in keystone, ever
21:10:24 <imdigitaljim> yeah
21:10:28 <imdigitaljim> i doubt it would
21:10:43 <jakeyip> does this happen only if the user is deleted from keystone?
21:10:47 <strigazi> they were clear with this in the Dublin PTG
21:10:56 <strigazi> yes
21:10:58 <imdigitaljim> yeah
21:11:12 <strigazi> the trust powers die when the user is deleted
21:11:19 <strigazi> same for application creds
21:11:28 <imdigitaljim> to be honest even if we fix the "magnum to opt to poll heat based on a service account"
21:11:32 <imdigitaljim> that would be a huge improvement
21:11:41 <imdigitaljim> that would at least enable us to delete the clusters
21:11:43 <imdigitaljim> without db edits
21:11:58 <strigazi> admins can delete the cluster anyway
21:12:06 <imdigitaljim> we could not
21:12:11 <strigazi> ?
21:12:14 <imdigitaljim> with our admin accounts
21:12:22 <imdigitaljim> the codepaths bomb out with heat polling
21:12:39 <imdigitaljim> not sure where
21:12:43 <jakeyip> is this a heat issue instead?
21:12:44 <imdigitaljim> the occurrence was just yesterda
21:12:53 <strigazi> mayve you diverged in the code?
21:12:56 <imdigitaljim> no i had to delete the heat stack underneath with normal heat functionality
21:13:03 <imdigitaljim> and then manually remove the cluster via db
21:13:16 <strigazi> wrong policy?
21:13:16 <imdigitaljim> not with that regard
21:13:32 <colin-> +1 re: service account, fwiw
21:14:05 <imdigitaljim> nope
21:15:20 <imdigitaljim> AuthorizationFailure: unexpected keystone client error occurred: Could not find user: <deleted_user>. (HTTP 404) (Request-ID: req-370b414f-239a-4e13-b00d-a1d87184904b)
21:15:34 <strigazi> ok
21:15:36 <jakeyip> ok so figuring out why admin can't use magnum to delete a cluster but can use heat to delete a stack will be a way forward?
21:15:48 <jakeyip> I wonder what is the workflow for normal resources (e.g. nova instances) in case of people leaving?
21:16:02 <strigazi> the problem is magnum can't check the status of the stack
21:16:17 <brtknr> it would be nice if the trust was owned by a role+domain rather than a user, so anyone with the role+domain can act as that role+domain
21:16:24 <imdigitaljim> ^
21:16:25 <imdigitaljim> +1
21:16:26 <imdigitaljim> +1
21:16:36 <brtknr> guess its too late to refactor things now...
21:16:52 <imdigitaljim> imo not really
21:16:53 <strigazi> it is a bit bad as well
21:17:04 <imdigitaljim> but it can be bad based on the use-case
21:17:07 <imdigitaljim> for us its fine
21:17:11 <strigazi> the trust creds are a leak
21:17:39 <imdigitaljim> yeah
21:17:44 <imdigitaljim> the trust creds on the server
21:17:46 <strigazi> userA takes trust creds from userb that they both own the cluster
21:17:50 <imdigitaljim> and you can get access to other clusters
21:17:58 <strigazi> userA is fired, can still access keystone
21:18:23 <brtknr> oh, because trust is still out in the wild?
21:18:32 <strigazi> the polling issue is different than the trust in the cluster
21:18:37 <imdigitaljim> yeah
21:18:40 <brtknr> change trust password *rolls eyes*
21:18:42 <imdigitaljim> different issues
21:18:56 <strigazi> we can do service account for polling again
21:19:07 <imdigitaljim> but an admin readonly scope
21:19:08 <imdigitaljim> ?
21:19:21 <strigazi> That is possible
21:19:32 <strigazi> since the magnum controller is managed by admins
21:19:35 <imdigitaljim> yeah
21:19:44 <imdigitaljim> i think that would a satisfactory solution
21:19:53 <imdigitaljim> the clusters we can figure out/delete/etc
21:20:03 <imdigitaljim> but magnums behavior is a bit unavoidable
21:20:39 <imdigitaljim> thanks strigazi!
21:20:43 <imdigitaljim> you going to denver?
21:21:04 <strigazi> https://github.com/openstack/magnum/commit/f895b2bd0922f29a9d6b08617cb60258fa101c68#diff-e004adac7f8cb91a28c210e2a8d08ee9
21:21:19 <strigazi> I'm going yes
21:21:31 <imdigitaljim> lets meet up!
21:22:01 <strigazi> sure thing :)
21:22:58 <strigazi> Is anyone going to work on the polling thing? maybe a longer description first in storyboard?
21:23:12 <flwang1> strigazi:  re https://storyboard.openstack.org/#!/story/2005145  i think you and ricardo proposed this issue before in mailing list
21:24:10 <strigazi> yes, I mentioned this. I discussed it with the keystone team in Dublin
21:24:11 <flwang1> and IIRC, we need support from keystone side?
21:24:41 <strigazi> there won't be help or change
21:24:51 <strigazi> from the keystone side
21:25:10 <strigazi> 22:11 < strigazi> trusts transfer *won't* happen in keystone, ever
21:25:24 <strigazi> nor for application credentials
21:25:25 <flwang1> strigazi: so we have to fix it in magnum?
21:25:31 <strigazi> yes
21:25:45 <strigazi> two issues, one is the polling heat issue
21:26:03 <strigazi> 2nd, the cluster inside the cluster must be rotated
21:26:11 <imdigitaljim> creds inside*
21:26:28 <strigazi> we had a design for this in Dublin, but not man power
21:26:33 <strigazi> yes, creds :)
21:26:43 <imdigitaljim> yeah 1) trust on magnum, fixable and 2) trust on cluster, no clear path yet
21:27:06 <strigazi> 2) we have a rotate certificates api with noop
21:27:17 <strigazi> it can rotate the certs and the trust
21:27:22 <strigazi> that was the design
21:27:26 <flwang1> strigazi: ok, i think we need longer discussion for this one
21:27:44 <imdigitaljim> im more concerned about 1) for the moment which is smaller in scope
21:27:52 <imdigitaljim> 2) might be more challenging and needs more discussion/desing
21:27:55 <imdigitaljim> design
21:27:57 <strigazi> no :) we did it one year ago, someone can implement it :)
21:28:50 <strigazi> I'll bring up the pointer in storyboard
21:30:17 <strigazi> For the autoscaler, are there any outstanfing comments? Can we start pushing the maintainers to accept it?
21:30:36 <flwang1> strigazi: i'm happy with current status.
21:30:43 <flwang1> it passed my test
21:31:12 <schaney> strigazi: there are some future enhancements that I am hoping to work with you guys on
21:31:17 <flwang1> strigazi: so we can/should start to push CA team to merge it
21:32:22 <strigazi> schaney: do you want to leave a comment you are happy with the current state? we can ping the CA team the {'k8s', 'sig', 'openstack'} in some order
21:32:23 <flwang1> schaney: sure, the /resize api is coming
21:34:44 <schaney> I can leave a comment yeah
21:35:15 <schaney> Are you alright with me including some of the stipulations in the comment?
21:35:41 <schaney> for things like nodegroups, resize, and a couple bugs
21:35:59 <strigazi> schaney: I don't know how it will work for them
21:36:18 <schaney> same, not sure if it's better to get something out there and start iterating
21:36:29 <strigazi> +1 ^^
21:36:33 <schaney> or try to get it perfect first
21:36:58 <flwang1> schaney: i would suggest to track them in magnum or open separated issues later, but just my 2c
21:37:30 <imdigitaljim> we'll probably just do PRs against the first iteration
21:37:31 <schaney> track them in magnum vs the autoscaler?
21:37:43 <imdigitaljim> and use issues in autoscaler repo probably
21:37:47 <imdigitaljim> ./shrug
21:38:27 <schaney> yeah, us making PRs to the autoscaler will work for us going forward
21:38:40 <schaney> the current PR has so much going on already
21:38:48 <strigazi> We can focus on the things that work atm, and when it is in, PR in the CA repo are fine
21:38:53 <flwang1> issues in autoscaler, but don't scare them :)
21:39:03 <flwang1> strigazi: +1
21:39:40 <schaney> one question if tghartland has looking into the TemplateNodeInfo interface method implementation
21:39:41 <strigazi> as long as we agree on the direction
21:40:08 <schaney> I think the current implementation will cause a crash
21:40:16 <imdigitaljim> imho i think we're all heading the same direction
21:40:47 <strigazi> creash on what?
21:40:56 <strigazi> crash on what? why?
21:40:56 <schaney> the autoscaler
21:41:24 <strigazi> is it reproducible?
21:41:58 <schaney> Should be, I am curious as to if you guys have seen it
21:42:03 <strigazi> no
21:42:28 <schaney> I'll double check, but the current implementation should crash 100% of the time when it gets called
21:42:49 <strigazi> it is a specific call that is not implemented?
21:42:55 <schaney> yes
21:42:57 <strigazi> TemplateNodeInfo  this >
21:42:59 <schaney> TemplateNodeInfo()
21:43:16 <strigazi> I'll discuss it with him tmr
21:43:48 <schaney> kk sounds good, I think for good faith for the upstream autoscaler guys, we might want to figure that part out
21:44:11 <schaney> before requesting merge
21:44:38 <strigazi> 100% probability of crash should be fixed first
21:44:58 <schaney> :) yeah
21:45:40 <strigazi> it is the vm flavor basically?
21:45:57 <schaney> yeah pretty much
21:46:29 <schaney> the autoscaler gets confused when there are no schedulable nodes
21:46:54 <schaney> so TemplateNodeInfo() should generate a sample node for a given nodegroup
21:47:14 <strigazi> sounds easy
21:47:56 <schaney> Yeah shouldn't be too bad, just need to fully construct the template node
21:48:07 <strigazi> this however: 'the autoscaler gets confused when there are no schedulable nodes' sounds bad.
21:48:33 <schaney> it tries to run simulations before scaling up
21:48:45 <strigazi> so how it works now?
21:49:04 <schaney> if there are valid nodes, it will use their info in the simulation
21:49:14 <strigazi> it doesn't do any simulations?
21:49:17 <schaney> if there is no valid node, it needs the result of templateNodeInfo
21:50:28 <strigazi> if you can send us a scenario to reproduce, it would help
21:51:15 <schaney> cordon all nodes and put the cluster in a situation to scale up, should show the issue
21:51:36 <strigazi> but, won't it create a new node?
21:52:04 <strigazi> I pinged him, he will try tmr
21:52:33 <flwang1> strigazi: in my testing, it scaled up well
21:52:43 <strigazi> schaney: apart from that, anything else?
21:52:52 <strigazi> to request to merge
21:53:01 <strigazi> flwang1: for me as well
21:54:18 <schaney> I think that was the last crash that I was looking at, everything else will just be tweaking
21:54:30 <strigazi> nice
21:54:38 <schaney> flwang1: to be clear, this issue is only seen when effectively scaling up from 0
21:55:02 <flwang1> schaney: i see. i haven't tested that case
21:55:39 <schaney> rare case, but I was just bringing it up since it will cause a crash
21:55:54 <flwang1> schaney: cool
21:55:58 <strigazi> we can address it
21:56:09 <schaney> awesome
21:58:16 <strigazi> we are almost out of time
21:58:43 <flwang1> strigazi: rolling upgrade status?
21:58:54 <strigazi> I'll just ask one more time, Can someone look into the CI failures?
21:59:05 <flwang1> strigazi: i did
21:59:20 <strigazi> flwang1: end meeting first and the discuss it?
21:59:20 <flwang1> the current ci failure is related to nested virt
21:59:30 <strigazi> how so?
21:59:30 <flwang1> strigazi: sure
21:59:45 <flwang1> i even popped up in infra channel
21:59:51 <strigazi> let's end the meeting first
21:59:58 <colin-> see you next time
22:00:03 <strigazi> thanks everyone
22:00:07 <flwang1> and there is no good way now, seems infra recently upgrade their kernel
22:00:16 <flwang1> manser may have more inputs
22:00:33 <strigazi> #endmeeting