14:01:06 #startmeeting openstack-cyborg 14:01:06 Meeting started Wed May 9 14:01:06 2018 UTC and is due to finish in 60 minutes. The chair is zhipeng. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:01:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:01:09 The meeting name has been set to 'openstack_cyborg' 14:01:36 #topic Roll Call 14:01:41 #info Howard 14:01:54 #info sum12 14:02:00 #info Mike 14:02:01 #info Helloway 14:02:01 #info Ed 14:02:43 #info Li_Liu 14:02:54 #info shaohe 14:03:02 Hi sum12, could you introduce a bit about yourself ? 14:03:20 #info xinran__ 14:03:35 Hey, I am Sumit. I work for SUSE and have been part of some other projects already. 14:03:49 Hi Sumit 14:03:50 welcome :) 14:03:58 Hi Sumit 14:04:05 Thanks everyone :) 14:04:31 Welcome 14:04:34 #info Sundar 14:04:56 welcome sum12 14:05:29 glad to be here, thanks sundar zhipeng Li_Liu NokMikeR sundar 14:06:17 okey let's get into business 14:06:25 #topic driver subteam meeting time 14:07:11 so shaohe has helped started a poll 14:07:19 and it seems three options are at the top 14:08:11 Could somebody post the poll link here? 14:08:12 UTC 1:00am Mon, UTC 2:00pm Mon, UTC 8:30am Wed 14:08:59 #link https://doodle.com/poll/pa3gi78yncsr7qee 14:10:34 sundar: any prefer on the time? 14:10:38 let's pick one :) 14:11:23 I am good with anytime at night(which is morning back in China) 14:12:00 Yes, Shaohe. I clicked on the doodle now. Sunday eve or Mon 7 am PST are both good 14:13:31 ok let's do the Mon UTC 2:00pm 14:13:57 which will be 10:00pm at China, 10:00 am EST, 7:00am PST 14:15:19 it's the most popular time 14:15:22 okey ? 14:15:32 ok 14:15:32 Sounds good to me! 14:16:02 #agreed driver subteam meeting time every Mon UTC1400 14:16:04 Monday? 14:16:15 Monday 14:16:17 ok 14:16:21 sounds good 14:16:42 #topic new core reviewer promotion 14:17:06 in order to increase our review bandwidth, i hereby promote Sundar to be the new core reviewer 14:17:24 Sundar has been very active and taking charge of several important specs 14:17:57 so as usual, we will have one week time for any feedback, and acknowledgment of the promotion next Wed :) 14:18:26 #info kosamara 14:18:29 Thanks, zhipeng 14:18:41 gratz :) 14:19:03 gratz :) 14:19:19 well not yet :) 14:19:39 let's wait for a week for feedback 14:20:00 Thanks, Li_Liu and shaohe :). As zhipeng says, one week from now, we will know. 14:20:06 but would like to thanks for Sundar's great effort so far 14:20:12 :) 14:20:16 okey moving on 14:20:41 #topic KubeCon feedback 14:20:58 okey last week i attend KubeCon and the resource mgmt wg deep dive session 14:21:34 k8s res mgmt wg is the center in k8s which deals with general and acceleration resources 14:22:10 My takeaway is that the support for a general accelerator mgmt, is still not in any shape in k8s 14:22:23 Google is interested in GPU passthrough support for ML, mainly 14:23:32 so if anyone wants to introduce any feature in-tree 14:24:26 that would require a PoC up-front 14:24:59 many things we discussed here, like vGPU types, general accelerator support including FPGA and others 14:25:05 are viewed non-priority 14:25:22 the resource class/resource api PRs are also long shot 14:25:28 according to vishnu 14:27:29 zhipeng: Agreed with your assessments. If we can get just one feature in -- passing annotations to the device plugin API -- that will help us meet most basic FPGA use cases, IMHO 14:27:55 Sasha mentioned Intel team has finished a FPGA DPI PoC 14:28:04 but also just pre-programmed FPGAs 14:28:39 We can couple that with a scheduler extension. However, scheduler extensions are not viewed favorably because the scheduler fra,ework itself may change and the APIs may change along with it. 14:29:00 yes, and also DPI is designed at the node level 14:29:09 However, we can do a POC base don it, including programming support, and revamp it when the APIs change. Just my thought. :) 14:29:15 and a mostly "reschedule" focused mechanism 14:29:36 Yes, we do have a POC that does only pre-programmed use case. That does not show the strength of FPGAs, which is reprogramming 14:30:01 so DPI is designed to mostly work for hot-plug use case, not scheduling upfront 14:30:22 the scheduling will be retriggered once the node discover the DPI Plugin 14:30:35 anyways it is the current lay of land 14:31:04 so my thinking is, maybe it is reasonable to introduce a CRD framework for cyborg into k8s community 14:31:35 so that we could have all of our data model preserved, has leeway on the api and scheduling design 14:31:46 maintain a k8s-ish API interface 14:32:12 and a out-of-band general accelerator mgmt functionality, not bound to DPI development 14:32:42 i don't know what other team member's thought on this matter ? 14:32:44 The CRD framework does not allow AFAIK for a nested topology, which OpenStack supports. 14:33:02 CRD is just an API mechinism right ? 14:33:06 needs more blinking lights 14:33:12 not implementation specific 14:33:56 Yes. How do we model regions inside FPGAs, accelerators inside regions, local memory inside either ... 14:34:10 that could all be done in Cyborg 14:35:18 for example if you look at kubernetes service catalog 14:35:25 k8s will do the scheduling for Cyborg then? 14:35:43 we could use scheduling extention maybe for that 14:36:31 but I doubt Google wants to have the k8s core doing scheduling that taking accelerators into consideration 14:37:46 The Cyborg implementation could relate different resources. Agreed. The CRD discussions also seem to get into resource classes etc., which seem to be a long shot, as you said. Yes, agreed that scheduling core cannot be changed 14:37:46 but the scheduling extension still require some change in k8s main tree right? 14:38:10 Li_Liu: scheduler extension is a standard mechanism in K8s today 14:38:21 yep what sundar said 14:38:33 However, the scheduler framework itself may evolve, and the extension APIs may change along with it 14:39:12 Link to proposed K8s scheduler framework: https://docs.google.com/document/d/1NskpTHpOBWtIa5XsgB4bwPHz4IdRxL1RNvdjI7RVGio/edit# 14:40:45 #link https://medium.com/@trstringer/create-kubernetes-controllers-for-core-and-custom-resources-62fc35ad64a3 14:40:52 some crd fundamentals 14:41:40 zhipeng: can not open it. 14:41:49 so as I understand, crd is basically a way that we write a non-core k8s-ish controller 14:41:53 shaohe you need vpn 14:42:07 it listens upon the api-server 14:42:28 and the keyword will trigger the request going to the crd controller, instead of the core k8s controller 14:42:40 basically a hat on cyborg, if you will 14:44:11 so it's a subscribe/notify model right? 14:44:41 in essence, as I understand yes 14:46:14 in go land 14:47:56 yep :) 14:48:02 is it trival to use python from go or how does the cyborg api interaction work to k8 and go? 14:48:28 we could have gRPC clients that abstract away the lang difference 14:48:45 ok 14:49:17 I need that for English to Finnish to English also :) 14:50:09 haha 14:50:22 you need google duplex for that :P 14:51:10 :) 14:51:17 Exactly, gRPC -- as Zhipeng said. The controller is a separate daemon. Also, kubelet and Cyborg DP will also be separate processes. 14:52:34 let's keep discussion alive offline :) 14:52:49 #topic bugs and issues 14:53:26 shaohe a colleague of mine report that when devstack starts, he could not find cyborg services 14:53:37 have you encountered similar problem ? 14:55:28 shaohe dropped i think 14:55:39 okey let's move on to the next topic then 14:55:43 I reported the same problem many moons ago, but have not tried lately to install it 14:56:31 NokMikeR i think during some of the past fixes it turns out ok 14:56:42 i'm not sure if some of the recent patches breaks it 14:57:45 the mutable config page shows a failure? https://review.openstack.org/#/c/559303/ 14:58:08 ping 14:58:23 hey welcome back 14:58:25 pong 14:58:32 NokMikeR that should be already fixed 14:58:38 ok 15:00:09 so the specific problem is c-cond and c-agent are not running 15:00:24 shaohe_ that is not normal right / 15:00:55 yes. 15:01:23 if devstack succeed 15:01:38 c-api, c-cond, c-agent should all be running right ? 15:01:38 devstack should report error, if c-cond and c-agent are not running 15:02:24 ok i will contact the author 15:02:28 okey moving on 15:03:18 #topic spec review day 15:03:28 it should be cyborg-agent, cyborg-api, cyborg-cond 15:03:38 nova and other team all have this custom of spec sprint, or runways 15:03:43 let's have one as well 15:03:51 shaohe_ ok I will let him know 15:04:08 #link https://etherpad.openstack.org/p/cyborg-rocky-spec-day 15:04:12 is c-xxx cinder process? 15:04:25 maybe he just referred wrong 15:04:31 I should double check with him 15:04:48 yes 15:05:34 okey back to topic 15:06:26 let's start with the "old ones" :) 15:06:33 we should be ready to land those 15:06:44 * NokMikeR braces for impact 15:07:06 first up 15:07:09 #info python-cyborgclient framework 15:07:19 #link https://review.openstack.org/565023 15:07:35 shaohe_ mentioned the client code is actually ready, so let's land this one 15:07:40 any objections ? 15:07:48 agree, I think we can merge that one 15:07:54 I think the syntax is not quite in line? 15:08:05 just finished reviewing 15:08:15 Other commands are like 'openstack server ...' where the 2nd argument is the object on which an action is applied 15:08:50 Whereas we are proposing 'openstack acceleration ...' I saw Shaohe respond to my comment. 15:09:18 It is not clear to me why we cannot have 'openstack acelerator ' 15:09:54 wondering why its not openstack cyborg list/show ? 15:09:57 because accelerator more like an object that we will choose to act upon 15:10:06 NokMikeR legal issue 15:10:12 ok 15:10:32 NokMikeR: Just as we have 'openstack server ..' instead of 'openstack nova ' 15:10:45 cyborg is project name, acceleration is service type. 15:10:46 yeah - project names should be avoided 15:11:00 clarified thanks. 15:11:14 that's good to know :) 15:11:42 shaohe: It is not clear to me why we cannot have 'openstack accelerator ' 15:12:03 sundar see my above comment 15:12:09 because accelerator more like an object that we will choose to act upon 15:12:24 acceleration is a type of service, like server service, volume service 15:12:46 zhipeng, yes. Like server/image etc. shouldn't it be the 2nd arg? 15:12:55 we might have something like openstack acceleration fpga create 15:13:15 shaohe_ correct me if I'm wrong 15:13:32 yes. I have look other project. 15:13:48 they do use this syntax 15:14:01 some use this syntax 15:14:07 is it command(create) then object(fpga) ? 15:15:05 openstack [] [] [] 15:15:25 yes as NokMikeR suggestion 15:15:37 NokMikeR yes the example i mentioned might not be strictly correct :P 15:15:40 global-options can be service type 15:16:12 for cyborg the service type is acceleration. 15:16:29 for cinder the service type is volume 15:16:43 #link openstackclient guidelines: https://docs.openstack.org/python-openstackclient/3.4.1/humaninterfaceguide.html#command-structure 15:16:44 for glance the service type is image 15:16:47 shaohe: May be I still have a disconnect. :) "openstack [] [] []" Where is the service here? 15:17:11 sundar: object-1 15:18:06 So, we will have syntax like 'openstack acceleration create ...' 15:18:49 something like that 15:18:52 openstack network --help |less 15:18:52 If everybody else is ok with it, I am ok too :) 15:19:07 I would suggest s/acceleration/accelerator 15:19:15 openstack network flavor create 15:19:52 edleafe any reason for that ? 15:20:04 IMHO, the term 'accelerator' is more in line with other usages -- but go ahead :) 15:20:16 ^^ what sundar said 15:21:06 ya I'm just thinking when we actually using accelerator it will need to be more specific (FPGA, GPU, ...) 15:21:24 acceleration could just represent a service type offered by cyborg in general 15:21:26 sundar: edleafe: $ openstack --help |grep volume 15:21:35 Yes I think accelerator is more reasonable 15:21:52 zhipeng: yes, openstack accelerator create fpga -- here 'fpga' is 'object-2' 15:21:55 acceleration implies more than one device? accelerator is singular or one device. If we debate this to the end we end up somewhere in particle physics... :) 15:22:09 haha NokMikeR 15:22:17 but I think accelerator has more votes here 15:22:31 i feel accelerator/acceleration are both too complicated as compared to service/image/network/... 15:22:47 sum12 lol what is your suggestion 15:23:26 I am suggesting to ask we we anything easier in out arsenal ? 15:23:44 maybe just acc ? 15:23:54 sum12: acc for abbreviation? 15:24:07 shaohe_ man crush 15:24:07 sum12: we use the term 'accelerator' or 'device'. But the term device is too general? 15:24:25 sundar ya that might give people confusion 15:24:25 accel ? 15:24:38 accel a little bit too Xilinx-ish ? 15:24:43 sum12: accel is ok by me. I use that abbrev too 15:24:52 okey anyone else ? 15:25:15 if accel could work then accel it is :) 15:25:16 If we give bash completions 15:25:33 If we provide bash completions, it may not matter :) 15:25:38 'accelerator' is a known term for computers. 15:25:43 'accel' not so much 15:25:45 openstack command support bash completion 15:26:13 sum12: with bash completion, do you still see a problem with 'accelerator'? 15:26:47 let see cinder's command name 15:26:48 $ openstack --help |grep volume 15:27:04 volume type create Create new volume type volume type delete Delete volume type(s) volume type list List volume types volume type set Set volume type properties volume type show Display volume type details volume type unset Unset volume type properties volume unset Unset volume properties 15:27:12 oh, sorry 15:27:19 volume type create Create new volume type 15:27:27 volume snapshot create Create new volume snapshot 15:27:33 bash completion is not a problem, but if I was devops guy I like the small and easy to remember (scripting) and not too charachter-y 15:27:37 https://docs.openstack.org/python-openstackclient/latest/ they are listed here 15:28:16 https://docs.openstack.org/python-openstackclient/latest/cli/commands.html 15:28:21 From the link by NokMikeR: $ volume type list # 'volume type' is a two-word single object 15:28:22 $ openstack volume type create --help 15:28:43 usage: openstack volume type create [-h] [-f {json,shell,table,value,yaml}] 15:29:03 sum12: at least it isn't as long as 'application credential' :) 15:29:31 edleafe: :) 15:29:49 do we have a consensus now ? :) 15:29:50 sundar: yes, it is two-word single object 15:30:12 and volume is the service type. 15:30:14 We have a single-word single object :) which is better 15:30:50 Anyways, I vote for 'openstack accelerator ...' That's my 2 cents 15:31:56 i vote for that as well 15:32:07 same here if Im allowed. 15:32:32 OK, remove the service type. 15:32:39 but keep in mind 15:33:04 I would prefer accelerator too 15:33:12 if cyborg support a flavor api 15:33:23 flavor create/list 15:33:27 what should it be? 15:33:33 sum12 that word looks a bit shorter now ? lol 15:33:35 ^ sundar: 15:33:54 :) 15:34:19 let's show the command: 15:34:20 $ openstack --help |grep flavor 15:34:31 flavor create Create new flavor 15:34:40 network flavor create Create new network flavor 15:34:50 there are two flavor, 15:35:19 Shaohe: why should Cyborg commands create flavors? Flavors with accelerators should still be under usual command, right? In any case, we can do 'openstack accelerator create flavor ...' 15:35:19 first one is nova flavor 15:35:35 second one is network flavor. 15:36:03 shaohe: ok. we can do 'openstack accelerator create flavor ...' 15:36:37 sundar: but openstack accelerator create flavor is not formal 15:37:21 for accelerator is just an collection of our restful url. 15:38:12 and accelerations is the service type. we register in keystone 15:38:25 shaohe: Not sure what you mean by 'formal'. If you prefer 'openstack accelerator flavor create', like nova/network, I am fine. 15:38:50 shaohe_ i think let's just change it to accelerator 15:38:59 seems like a team consensus at the moment 15:39:18 after that update the patch should good to go :) 15:39:57 zhipeng: OK, so we register in keystone use accelerator instead of acceleration? 15:40:00 There were some deprecated code in that patch 15:40:16 sundar: I will remove them 15:40:25 Thanks, shaohe 15:40:33 shaohe- yes I guess so 15:41:16 OK, let me file a patch to correct the service type firstly. 15:41:52 many thx :) 15:41:58 ok given the time in China 15:42:00 xiran_ 15:42:12 could you provide a update on the quota spec ? 15:42:16 xinran_ 15:42:41 #info Quota spec 15:42:49 #link https://review.openstack.org/560285 15:42:59 #link https://review.openstack.org/564968 15:44:53 Is this the Keystone-based quota that we were recommended to use? 15:45:03 Yes i have update the spec firstly we should support quota usage in cyborg and implement limit part by invoke Oslo.limit once keystone guys finish that 15:46:42 I have a doubt about resource type. Should we just count total number of accelerator or should we count like fpga gpu etc 15:47:11 We count the number of deployables I think 15:47:29 Li_Liu shall we count them per type ? 15:47:42 Li_Liu: A FPGA, as well as regions with it, will all be Deployables, right? 15:47:44 count granularity 15:48:08 yes, they should be grouped in types for sure 15:48:29 zhipeng: seems there is only one resource class(accelerator) for now 15:48:57 S, the quota be based on Deployable type? i.e. you can get X regions 15:49:04 the deployable patch has already been merged 15:49:22 regions are just a type of deployable 15:49:53 I think xinran is right -- quotas are based on resource classes, right? 15:50:55 sundar: yes I think so :) 15:51:24 okey then we could settle upon that :) 15:51:26 sundar: only one resource type for quota? 15:51:35 OK, then there is only one resource class in Cyborg -- CUSTOM_ACCELERATOR -- as we agreed with Nova 15:52:28 i see, that's what is exposed to nova. 15:52:53 for example, there maybe spdk software accelerators and vgpu accelerators. they share the same quotas? 15:52:56 shaohe: I think so, but maybe I need to read more on quotas 15:53:04 xinran__ sundar their db existence are deployable just to be clear 15:53:49 Li_Liu: when we get to oslo.limit based on Keystone, as Xinran said, that will be based on resource classes, right? 15:53:56 for nova, at present, the granularity is cpu, mem... 15:54:11 sundar: why just one CUSTOM_ class? The whole idea behind CUSTOM_ resource classes is that the service can create what it needs. 15:54:33 maybe gpu is also one quota 15:54:36 sundar right, deployables are just db existences. 15:54:43 need to drop 15:55:14 edleafe: this is what Nova folks proposed to us, right? :) Are we ok with CUSTOM_ACCELERATOR_FPGA, CUSTOM_ACCELERATOR_GPU, ...? 15:55:14 I think quota should depends on resource class 15:55:39 we should keep in mind, if we use one CUSTOM_ class, we must use nest resource provider. 15:55:49 sundar: I guess I missed that proposal. Was it to keep the Nova flavors simple? 15:56:10 or we can not distinguish the different resource 15:56:55 edleafe: Multiple resource classes would actually be better. I didn't see a specific reason for single RC. Maybe the discussion was centered around vGPU types, and one was enough 15:57:27 that's the difference on flavors if we use one CUSTOM_ class or multi CUSTOM_ class? 15:57:58 sundar: that sounds more correct. As shaohe_ notes, if I ask for CUSTOM_ACCELERATOR, I might get back an FPGA or a GPU. :) 15:58:36 edleafe: we have traits that distinguish FPGAs of different types, GPUs of different types, ... 15:58:55 So, the flavor would ask for the traits too, as noted in the spec 15:59:20 But, for quotas, it would be better to have distinct RCs based on device type (FPGA, GPU, HPTS, ....) 15:59:22 I also feel a little bit confused why there is only one resource class but anyway quota should accord to resource class right? 16:00:04 Li_Liu any input ? 16:00:13 so for one CUSTOM_ class, it should be nest PR first, and the flavor should be: resources:CUSTOM_ACCELERATOR:1, traits: FPGA 16:00:44 for multi CUSTOM_ class, the flavor should be: resources:CUSTOM_FPGA:1 16:00:58 for multi CUSTOM_ class, the flavor should be: resources:CUSTOM_GPU:1 16:01:29 ^ edleafe: right? 16:01:32 shaohe: yes, you are right. We don't need nested RPs for this flavor definition. 16:01:54 It would be better, IMO, to have separate resource classes to distinguish the different devices (GPU vs. FPGA), and use traits to further refine the capabilities of a particular device 16:02:00 But that will introduce limitations on combining different device tyoes 16:02:23 edleafe: Agreed. Let me amend the spec. Please review it! 16:02:29 multi CUSTOM_ class can work without NPR. 16:02:48 sundar: ack 16:02:48 one CUSTOM_ class must work with NPR 16:03:08 shaohe: multi CUSTOM_ class without nRPs will also have issues if you combine 2 different FPGAs on same host 16:03:24 zhipeng as far as I concern, using CUSTOM_ACCELERATOR might be a bit too general, if we decide to use it, additional information will be need to schedule/allocate the resources 16:03:51 Li_Liu: like traits? 16:04:13 will the way shaohe_ just mentioned work ? 16:04:28 right. coz essentially, we want to guide nova during their scheduling. 16:04:32 We will not create RCs like CUSTOM_ACCELERATOR_FPGA_INTEL_ARRIA10, but only CUSTOM_ACCELERATOR_FPGA, CUSTOM_ACCELERATOR_GPU, etc. 16:04:38 xinran__: yes, it need traits on sub resource provider. 16:04:59 shaohes's way should work 16:05:43 shaohe and all: I have been tracking nRP support in Nova, and apparently we are a week or two away from getting it. edleafe can confirm :) So, maybe we don't have to split hairs over what to do without nRP ;) 16:06:29 Even with CUSTOM_ACCELERATOR_FPGA etc., we still need traits 16:07:03 but we can not need nPR. 16:07:48 Sorry, late for my next meeting. I'll do my reviews offline and catch up here. :) 16:08:30 it does seem that nested RPs may be merged soon 16:09:04 edleafe: good. Then one CUSTOM_ class can works. 16:09:34 sounds very promising :) 16:09:38 shaohe_: sure, it *can* work, but it isn't really a good design 16:10:03 shaohe: multiple classes are good for quotas 16:10:36 Sorry Howard I have to leave physically but will leave this on monitoring. Best of luck sleeping :) 16:11:19 NokMikeR :) no problem 16:11:39 edleafe: I have no objection on one CUSTOM_ class after we support nPR 16:11:58 okey so since this is a spec review day, Li_Liu sundar plz continue discuss your specs here and I will leave the meeting recorded throughout the day 16:12:43 Anyway I think quota should be accord with RCs... if one RC, quota count one... 16:14:41 I am gonna grab some lunch, we can discuss this later today 16:14:52 I will keep my session open 16:15:08 @sundar and all 16:20:05 Sorry, guys. I am jumping between my meeting and here. I will keep it open too. 16:28:36 zhipengh[m]: you forgot to #endmeeting 16:29:47 That is on purpose :) 16:34:38 Li_Liu: I am getting customer feedback that some of them want to use function names in Rocky. The problem is that FPGA hardware and bitstreams may expose function IDs, not names. 16:35:39 So, one possibility is to let the operator apply a Glance property for function name when he onboards a bitstream, and reference that in the flavor. For Cyborg, it is just another Glance property to query -- wheher it is function ID or name. 16:35:51 What do you think? 18:15:03 sundar, I added function_uuid based on your comments for the metadata spec. That uuid should be mapped to a specific function name 18:56:57 Li_Liu: yes. At least for Rocky, we can leave that mapping to the operator. 18:59:06 IOW, the traits we apply are still based only on UUIDs. No further complexity in Rocky 19:00:22 Can you add a function name as an optional property in your spec? 19:11:46 sundar sure will do 19:40:36 Li_Liu: Thanks! 02:29:00 Nguyen Van Trung proposed openstack/cyborg master: Add doc8 to pep8 check for cyborg project https://review.openstack.org/567457 23:36:38 Sundar Nadathur proposed openstack/cyborg master: Specification for Cyborg/Nova interaction for scheduling. https://review.openstack.org/554717