10:00:15 #startmeeting containers 10:00:16 Meeting started Tue Jul 10 10:00:15 2018 UTC and is due to finish in 60 minutes. The chair is strigazi. Information about MeetBot at http://wiki.debian.org/MeetBot. 10:00:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 10:00:19 The meeting name has been set to 'containers' 10:00:20 #topic Roll Call 10:00:23 o/ 10:00:29 o/ 10:01:30 agenda: 10:01:37 #link https://wiki.openstack.org/wiki/Meetings/Containers#Agenda_for_2018-07-10_1700_UTC 10:01:42 #topic Blueprints/Bugs/Ideas 10:02:19 nodegroups, diffrent flavors and AZs 10:02:21 strigazi: after you 10:03:24 At the moment we only support to select flavors for the master and worker nodes 10:03:45 For availability zones, we can select the AZ for all nodes. 10:04:12 There is a need to specify different AZs and flavors 10:04:39 AZ for availability, flavors for special vms, for example flavors with GPUs 10:04:59 stragazi: If we want to multiple flavour for the clsuter creation , can we go ahead and implement this https://blueprints.launchpad.net/magnum/+spec/support-multiple-flavor 10:05:22 i guess the spec for this BP is not approved, can i modify the same or raise the new one ? 10:05:39 o/ 10:05:43 Nodegroups were proposed in the past, IMO it is an over engineered solution and we don't have the man power for it. 10:05:58 mvpnitesh: we don't use blueprints any more. 10:06:19 strigazi: ok 10:06:25 mvpnitesh: we migrated in storyboard. These BPs where moved there with the same name. 10:07:06 brtknr: hi 10:07:41 we can consolidate these there blueprints in one, that allows cluster with different types of nodes 10:08:04 strigazi: hi, thanks for reviwing the floating ip patch, im confused about the time on the agenda... says 1700. 10:08:06 each node can be different from the other is flavor and AZ 10:08:20 brtknr is it? 10:08:51 brtknr: we alternate and I copied the wrong line 10:09:31 ah, so 1 week at 1000, another week at 1700? i like that solution 10:10:31 brtknr: http://lists.openstack.org/pipermail/openstack-dev/2018-June/131678.html 10:11:08 So, for AZs and flavors. No one is working on it. 10:11:36 We need to design it and target it for S 10:12:15 mvpnitesh: ^^ 10:12:34 strigazi: We want to have a multiple flavours for a single cluster. I'll look into the storyboard and come up with the design and i'll target that for S 10:13:26 We can discuss it next week again and we can come prepared. 10:13:40 brtknr: flwang1: are interested in this ^^ 10:14:39 brtknr: flwang1: are you interested in this ^^ 10:14:41 strigazi: not really ;) 10:14:41 brtknr: multiple flavors and multiple os too if possible 10:14:50 multiple os? 10:15:09 for the special images you have for gpus? 10:15:45 strigazi: yes, i looked into fedora support for gpu, looks a bit hacky 10:15:56 considering nvidia do not officially support gpu 10:16:02 whereas centos is supported 10:16:27 nvidia do not officially support fedora gpu drivers 10:16:49 Let's see, I think we can even use centos-atomic without any changes 10:17:27 Next subject, 10:18:40 I'm working on changing this method to sync the cluster status https://github.com/openstack/magnum/blob/master/magnum/drivers/heat/driver.py#L191 10:19:11 strigazi: what's the background? 10:19:16 Istead of list that I mentioned we can do get without resolving the outpits of the stack 10:19:23 ah, i see 10:19:24 got it 10:19:58 flwang1: with big stack magnum tries to kill heat. 10:20:16 flwang1: With the health check you are working on 10:20:38 we can avoiding getting the worker nodes, like ever 10:20:39 i can remember the issue now 10:22:12 but this is another discussion 10:23:10 ok 10:23:34 This week I expect to push, this patch, cloud-provider-enabled/disable patch, k8s upgrades and the keypair patch. 10:23:38 is there a story for this? 10:23:57 this one https://storyboard.openstack.org/#!/story/2002648 10:24:25 I saw that flwang1 and imdigitaljim had questions about rebuilding and fixing clusters 10:24:58 The above change is required also for supporting users. 10:25:30 Ah, I had the same issue with cluster created using just heat template definition which we are using to create clusters with complex configuration 10:25:50 At the moment, when a user has a cluster with a few nodes (not one node) 10:26:08 And a node is in bad shape we delete this node with heat 10:26:21 Actually we tell him how to do it: 10:27:29 openstack stack update --existing -P minions_to_remove= -P number_of_minions= 10:28:30 the resource id can be found either from the name of the vms or by doing openstack stack resource list -n 2 10:29:23 using the resource id is more helpful since you can even delete nodes that didn't got an ip and also the command we cook for users is smaller :) 10:29:31 strigazi: btw, is Rocardo still working on auto healing? 10:29:57 flwang1: he said he will 10:30:26 flwang1: so auto healing will do what I described auto automatically 10:30:30 strigazi: ok, cool 10:30:59 even so, the change with the keypait is useful so that admins can do this operation 10:31:08 or other users on the project 10:31:11 or other users in the project 10:31:40 make sense? 10:32:06 strigazi: yes for me 10:32:15 And for supporting users and the health status 10:32:25 At the moment 10:32:44 users can not retrieve certs if the cluster is in a failed state 10:33:17 We must change this so that we also know what k8s or swarm think for the cluster 10:33:44 yep, we need the health status to help magnum understand the cluster status 10:34:13 Instead of checking the status of the cluster magnum can change if the CA is created 10:34:51 If the CA is created users should retrieve the certs 10:34:53 makes sense? 10:35:14 we may need some discussion about this 10:35:27 What are your doubts? 10:35:54 The solution we need is 10:36:05 A user created a cluster with 50 nodes 10:36:28 the CA is created, the master is up and also 49 workers 10:36:56 1 vm failed, to boot to get connectivity or to report to heat. 10:37:08 strigazi: my concern is why user do have to care about a cluster failure 10:37:14 cluster status goes to CREATE_FAILED 10:37:16 can't he just create a new one? 10:37:39 what if it is 100 nodes? 10:38:00 no mater how many nodes, the time of creation should be 'same' 10:38:05 does it sound reasonable for the rest of the openstack services to take the load again? 10:38:11 so the effort to create a new one is 'same' 10:38:49 flwang1: in practice it will not be the same 10:38:57 i can see your point 10:39:32 i'm happy to have it fixed, but personally i'd like to see a 'fix' api 10:39:33 i was dealing with a 500 node cluster last week and I still feel the pain 10:39:48 instead of introducing too much manual effort for user 10:40:28 we can continue it offline 10:40:30 when automation fails you need to have manual access 10:40:47 not if, when 10:40:51 ok 10:40:56 that's it from me 10:41:10 you finished? 10:41:24 yes, go ahead 10:41:31 i have a long list 10:41:47 1. i'm still working on the multi region issue 10:42:01 and the root cause is in heat 10:42:26 i have proposed several patches in heat, heat-agents and os-collect-config 10:42:41 I saw only one for heat 10:42:44 pointers? 10:42:58 https://review.openstack.org/580470 10:43:08 https://review.openstack.org/580229 10:43:16 these two are for heat 10:43:20 one has been merged in master 10:43:26 i'm cherrypicking to queens 10:43:42 for occ https://review.openstack.org/580554 10:44:10 these three? 10:44:17 for heat-agent https://review.openstack.org/580984 is backporting to queens 10:44:35 we don't have to care about the last one 10:44:44 yeap 10:44:51 but we do need the fixes for heat and occ 10:45:24 2. heat-container-agent images can't build 10:45:41 failed to find the python-docker-py, not sure if there is anything i missed 10:46:48 3. etcd race condition issue https://review.openstack.org/579484 10:46:59 flwang1 I'll have a look in 2 10:47:06 3 is ok now 10:47:08 strigazi: thanks 10:47:31 strigazi: yep, can you bless it #3? ;) 10:47:41 Merged openstack/magnum master: Pass in `region_name` to get correct heat endpoint https://review.openstack.org/579043 10:47:43 Merged openstack/magnum master: Add release notes link in README https://review.openstack.org/581242 10:47:51 3: Patch in Merge Conflict 10:48:10 ah, yep. 10:48:28 flwang1: you can trim down the commit message when you rebase 10:48:32 as for the rename scripts patch has been reverted, can we get it in again? https://review.openstack.org/581099 i have fixed it 10:48:41 flwang1: yes 10:48:47 strigazi: thanks 10:49:12 flwang1: I was checking the cover job, not sure why sometimes fails 10:49:30 4. Clear resources created by k8s before delete cluster https://review.openstack.org/497144 10:49:50 the method used in this patch is not good IMHO 10:50:20 technically, there could be many clusters in same subnet 10:50:27 flwang1: what do you propose? 10:50:49 we're working on a fix in CPO to add cluster id into the LB's description 10:51:03 that's the only safe way IMHO 10:51:10 I might be missing something, but why we dont clear loadbalancers in a separate software deployment? 10:51:24 we won't need to connect to k8s api then 10:51:44 sfilatov: what do you mean? 10:51:55 the lb is created by k8s 10:52:07 we're not talking about the lb of master 10:52:12 yes 10:52:18 i'm talking about k8s too 10:52:29 we could have dont kubectl delete on all of them 10:52:33 inside a cluster 10:53:14 sfilatov: can you just propose a patch? im happy to review and testing 10:53:24 yes, I'm working on it 10:53:39 but i've kinda have a patch similar to the one proposed 10:53:51 so to delete a cluster we will do a stack update first? 10:53:59 and we have a lot of issies with it 10:54:04 no 10:54:31 we can have a softwaredeployment for DELETE action 10:54:49 +1 10:55:18 we are working on this issue, I can provide a patch by the end of this or next week I guess 10:55:27 sfilatov: nice 10:55:39 strigazi: that's me 10:55:56 can you link a bug or a blueprint for this issue? 10:56:02 i'm still keen to understand the auto upgrade and auto healing status 10:56:10 sfilatov: wait a sec 10:56:13 brtknr: sfilatov something to add? 10:56:25 sfilatov: story/1712062 10:57:00 thx 10:57:33 np 10:57:45 hi 10:58:50 Anything else for the meeting? 10:59:43 strigazi: nope, im good 10:59:48 ok then, see you this Thursday or in 1 week - 1 hour 10:59:59 #endmeeting