14:03:00 #startmeeting openstack-cyborg 14:03:02 Meeting started Wed Apr 4 14:03:00 2018 UTC and is due to finish in 60 minutes. The chair is zhipeng. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:03:03 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:03:05 The meeting name has been set to 'openstack_cyborg' 14:03:12 #topic Roll Call 14:03:18 #info Howard 14:03:33 #info Sundar 14:03:43 Hi Howard! 14:03:45 #info Li_Liu 14:04:26 Hi Sundar :) 14:04:26 #info Melissa 14:05:24 Hi Melissa_S 14:05:38 Can you introduce yourself a little bit ? 14:06:04 Hi! I am new to this chat - my name is Melissa Sussmann. Sundar and I actually used to work together at Intel a while back. I am at Xilinx now and sitting in for Dutch :) 14:06:28 ah nice to meet you :) 14:06:38 Welcome Melissa! 14:06:50 Likewise :) 14:07:33 Hi Melissa, welcome! Good to have you here. 14:07:58 Thanks Sundar :) 14:08:51 #info kosamara 14:08:58 hi! 14:09:04 Hey ! 14:09:11 #info Yumeng__ 14:09:14 Hi 14:09:23 Hi Yumeng :) 14:09:42 Hi zhipeng 14:10:16 Hi everyone! 14:10:37 Hi chucksong 14:10:44 don't believe we met before 14:10:52 can you introduce yourself a little bit ? 14:11:12 yes, I am Chuck Song of Xilinx 14:11:27 Currently taking charge of the FaaS team 14:11:41 I suppose we are going to talk about the Cyborg project. 14:11:45 FPGA as a service ? 14:11:48 yes 14:11:50 Hi Chuck, welcome! 14:11:56 Hi Chuck 14:12:12 We have a lot people joining :) 14:12:15 hi Sundar and LiLiu 14:12:41 let's get into business :) 14:12:50 #topic sub-team report 14:13:22 I think we could start with release subteam 14:13:49 #link https://review.openstack.org/558223 14:14:16 #info cyborg has been added for milestone based release 14:14:44 that was part of Rocky goal 14:15:05 Is Apr 19 the spec freeze day for Cyborg too? 14:16:08 yes Sudnar the week of Apr 16 to Apr 20 14:16:23 will be our very first deadline for spec freeze :) 14:16:44 what do you need delivered on the Xilinx side before freeze? 14:17:45 Pardon me - I am new to the project :) 14:18:00 Melissa_S altho currently all the drivers are in-tree which means Xilinx driver also needs a spec 14:18:00 I think the major thing we need is Xilinx Device driver. But that doesn't have to be before spec freeze I think 14:18:17 but driver spec freeze should be expected for Rocky MS-2 14:18:28 which will be week of Jun 04-Jun08 14:18:30 Does the vendor specific driver need a spec Howord? 14:18:44 Li_Liu at the moment yes 14:18:47 ok 14:19:10 but yes as Li_Liu said, plz don't wait for the deadline :) 14:19:12 Not to jump the gun :) but could I ask folks to review https://etherpad.openstack.org/p/Cyborg-Nova-Multifunction so that I can update https://review.openstack.org/#/c/554717/ ? 14:19:38 This is based on the latest proposal from Nova community 14:19:51 Sundar will go into that very soon :) 14:19:56 sure, I will take a look 14:20:03 Let's finish the subteam report first 14:20:32 Melissa_S chucksong plz refer to https://releases.openstack.org/rocky/schedule.html#r-release 14:20:41 for the general milestone sched 14:21:10 Thanks, Li_Liu and Howard 14:21:45 Yes - thank you. 14:22:09 Cyborg's specific sched could be found http://eavesdrop.openstack.org/meetings/openstack_cyborg/2018/openstack_cyborg.2018-03-14-14.07.html 14:22:28 okey moving on 14:22:50 Yumeng__ could you update the progress on your side ? 14:23:54 I sent out a update about the high time synchronisation card on the mail list 14:24:05 Pls check 14:24:29 And launchpad migration to storyboard will be done soon. 14:25:00 did you just send the clock driver proposal ? 14:25:05 or maybe I missed it ? 14:25:50 1 second 14:26:36 BTW in case everyone not follow the storyboard migration, plz refer to 14:26:39 #link https://review.openstack.org/#/c/558327/ 14:26:52 https://docs.openstack.org/infra/manual/zuulv3.html 14:27:14 Pls go ahead, I will send out the right link 14:27:28 #link https://review.openstack.org/558821 14:28:02 #info storyboard migration for cyborg 14:28:47 thx Yumeng__ :) doc subteam has been very responsive and active 14:29:17 Melissa_S chucksong Dutch signed on as co-lead for the driver sub-team 14:29:35 does one of you guys maybe interested in take the role ? 14:29:52 help update and coordinate the driver development for Rocky ? 14:30:31 Chuck and I can work on this 14:30:57 great :) 14:31:11 I think you could put Chuck's name down - he is on the apps side 14:31:24 duly noted :) 14:31:30 we will find the correct internal team for this project 14:31:35 #topic critical rocky spec review 14:32:02 first up, Sundar's Nova-Cyborg interaction spec 14:32:22 zhipeng, yes I will take the role 14:32:25 Sundar could you help get everyone up to speed ? 14:33:26 Sure, Howard 14:33:27 #info Sundar cyborg-nova spec discussion 14:33:30 #link https://etherpad.openstack.org/p/Cyborg-Nova-Multifunction 14:33:46 #link https://review.openstack.org/#/c/554717/ 14:34:01 We had arrived at a way of representing FPGA components in Nova, and worklfows for all use cases based on hat representation, back in the PTG 14:34:26 The spec that Howard provided above codifies that 14:35:01 However, there was a race condition for one use case, which has been debated in the community. An analogy was found with a vGPU issue, and a joint solution was proposed 14:35:27 http://lists.openstack.org/pipermail/openstack-dev/2018-March/128888.html 14:35:41 #link http://lists.openstack.org/pipermail/openstack-dev/2018-March/128888.html 14:36:03 Based on that, I worked out the details, in a way that should cover GPUs etc. but probably focuses more on FPGAs. 14:36:16 #link https://etherpad.openstack.org/p/Cyborg-Nova-Multifunction 14:36:38 I think this is a better solution than the PTG flow :) both for Nova and Cyborg 14:37:34 For Cyborg, this means that we don't need to keep accelerator usage details in Cyborg DB -- just use placement info :) 14:38:16 Eventually, if/when Nova supports preferred traits, maybe we don't need the weigher -- but we can evaluat that when it becomes real 14:38:54 For the new folks in Cyborg, this may sound cryptic :) I am open to walking folks through the proposal in a call if needed 14:39:27 Howard, do you want me to explain the crux of the proposal here? 14:39:28 Sundar, regardless, we still need to track individual assignment for each device in Cyborg, right? 14:40:04 Sundar it would be great to have a brief walk through and the central take aways :) 14:40:07 Li_Liu: Yes, we will be handling acceleratr -PF/VF matching 14:40:46 That is not involve din the scheduling, because Placement counts accelerator resources 14:41:16 As instances are spawned and terminated, Cyborg agent can track PF/VF usage 14:41:50 Howard, sure. 14:41:56 The brief summary is this: 14:42:13 Sundar - I am open to scheduling a call and walkthrough of your proposal. 14:42:32 We want to expose FPGAs in 2 ways: FPGA as a Service (FPGA aaS), and Accelerated Function as a Service (AFaaS) 14:43:20 For FPGAaaS, user/flavor asks for a device type (or region type) and optionally a bitstream ID that needs to be applied by Cyborg before the instance comes up. 14:43:44 This is similar to Amazon flow. 14:44:19 For AFaaS, the user/flavor asks for a function/algorithm e.g. 'ipsec' + some indication of what device family the VM has the drivers for 14:44:42 To cover both cases, we say that an accelerator can be a device/region or a function 14:45:09 We represent a generic accelerator with the custom resource class (RC) CUSTOM_ACCELERATOR 14:45:34 We also represent FPGAs and their inner regions as nested Resource Providers (RPs) 14:45:57 So, a region RP can provide N instances of a CUSTOM_ACCELERATOR class. Most commonly, N=1 14:46:48 Also, each region RP has traits: region type (e.g. CUSTOM_FPGA__REGION_...) 14:47:08 possibly function type (e.g. CUSTOM_FPGA_) 14:47:37 and device family (e.g. CUSTOM_FPGA_XILINX_...) 14:47:52 Sundar, for AFaaS, on top of asking for a function/algorithm, I think users can also specify the minimum kpi/capability for the requested resources 14:48:05 you keep going 14:48:15 I just throw some of my thoughts here 14:48:22 Li_Liu, NP, I'll get back to that 14:48:51 With this background, here's how a flavor can ask for FPGA aaS: 14:49:24 resource:CUSTOM_ACCELERATOR=1; trait:CUSTOM_FPGA__REGION_=required 14:49:48 optionally, one more extra spec: bitstream:3A56D4=required 14:50:29 This gets Placement to choose all matching devices based on trait. Once a node is selected, the Cyborg agent in the node notes the extra spec and applies the bitstream 14:50:44 Before I go to AFaaS, does this make sense for FPGA aaS/ 14:50:57 it does for me, this is awesome Sundar 14:51:19 yup, it looks great 14:51:34 Cool :) Now for AFaaS flavor 14:51:43 It sounds good Sundar. I need to know more about this project though. I just joined this project and still have a lot to learn 14:52:22 Yes agreed. Sundar, can you share some more information on this proposal for FPGA aaS/ 14:52:42 Duth let's schedule another dedicated call for it 14:52:45 :) 14:52:57 resource:CUSTOM_ACCELERATOR=1; trait:CUSTOM_FPGA_INTEL_=required; trait:CUSOTM_FPGA_INTEL_=required 14:52:59 video conf would be better for a deep dive walk through 14:53:18 Yes, we can do more in a video call 14:53:18 Agreed. 14:53:39 Sundar plz carry on the AFaaS scenario 14:54:10 Some folks don't like the UUIDs. I think UUIDs make it very concrete and we can make it more user-friendly later. Also, for AFaaS, we can avoid region UUIDs and use just product name 14:54:37 OK, back to AFaaS :) 14:54:37 string name could do the trick ? 14:54:50 sorry go ahead 14:56:50 Zhipeng: We can discuss strings in more deatail, may be in an email? 14:57:19 Sundar absolutely 14:57:38 plz carry on, sorry for the interruption :) 14:57:39 OK. AfFaaS flavor: resource:CUSTOM_ACCELERATOR=1; trait:CUSTOM_FPGA_INTEL_=required; trait:CUSOTM_FPGA_INTEL_=required 14:58:37 This picks all devices that have the required function. Makes sense, except that if no free instance of that function is available, the request will just fail, rather than have Cyborg pick an available region and reprogram it 14:59:18 That is ok if that's what the operator wants: he may want to prevent reprogramming for whatever reason. 14:59:37 But, in the general case, we want the ability to reprogram if needed. 15:00:04 So, what we want is: trait:CUSTOM_FPGA_INTEL_=preferred (not required) 15:00:34 But Nova does not support preferred traits today. I am told that it is not even close 15:01:01 So, the next best thing: resource:CUSTOM_ACCELERATOR=1; trait:CUSOTM_FPGA_INTEL_=required 15:01:20 another extra spec: function:CUSTOM_FPGA_INTEL_=required 15:01:45 Now, Placement chooses all devices that match the product, whether or not they have the function 15:02:21 A Cyborg weigher can check the allocation candidates to see which ones have the function -- based on the function trait -- and rank them higher 15:03:13 So, Nova is likely to pick a device that has the function. if not, Cyborg agent in the compute node will note the requested function is not present in the selected device/region RP 15:03:35 It will contact Glance to get a matching bitstream and program it, before the instance comes up 15:03:46 where would the weigher be implemented ? 15:04:06 This is the core of the proposal, and covers situations where each bitstream implements only one function 15:04:48 Zhipeng: the weigher would initially be a Cyborg weigher in my understanding -- but it is not Cyborg-specific. It could possibly become Nova in-tree hopefully 15:05:25 The weighe ris looking at a list of RPS and choosing those with a specific trait. It is generic 15:05:37 i mean the weigher will be part of the agent ? 15:06:10 It is a weigher in the Nova controller -- like all other filters/weighers in Nova framework 15:06:21 The weigher has to be in Cyborg Controller I think 15:07:06 yes, it should. And seems nova agree on the weigher 15:07:18 Li_Liu: OK, what I meant is, it runs in the controlle ralong with Nova/Cyborg. The operator must update nova.conf to use this weigher 15:07:36 Right :) 15:07:47 Okey 15:08:15 yes. just a config option in nova 15:08:45 Sundar I think it is definitely fine to update the spec patch based upon the current discussion conclusion 15:09:04 and let's schedule another video conf for a detailed discussion with Xilinx team 15:09:08 The part of this proposal which needs further review from Nova is when one bitstream has multiple functions, say crypto and compression. I think that is for the future and may not be needed in rocky. Does that sound agreeable? 15:09:15 to see if there is any further improvements 15:09:37 but weigher may be not very high priority. It is help to speed up the creation of a VM. but no helpful for the VM performance after VM start 15:09:39 Thanks, Howard! 15:09:40 Sundar yes that would be maybe next release :) 15:10:28 Just an FYI - it's not likely that nested RPs will be complete in Rocky 15:10:39 shalhe_feng_ agree, let's implement the custom rc and traits first 15:10:54 edleafe it is possible that we try with the nrp first right ? 15:11:32 zhipeng: sure, but it looks like the earliest nrp will be available will be in Stein 15:11:46 edleafe: There is a release notes in Queens for nRPs, right? https://github.com/openstack/nova/blob/adc4d4a29d108c87f884c779af5696e4941b9549/releasenotes/notes/placement-rest-api-nested-resource-providers-552a923a96d7adca.yaml 15:12:42 Sundar: that is the very beginnings of the structural changes needed for nrp 15:13:10 edleafe: so cyborg just can support a fpga resource class in node provider in this release? 15:13:19 the full model we need is still far away 15:13:44 edleafe: The backup optoion would be to apply the RCs and traits to the compute node RP. But, when there are multiple devices in the same node, that can result in issues. 15:13:53 shaohe_feng_: I'm not sure how that would work if you have multiple devices per node 15:14:25 edleafe: We crossed. :) 15:14:29 I'm on a call right now - I just wanted to set your expectations 15:14:38 edleafe thx :) 15:14:39 edleafe: IMHO, we can support multiple devices later. 15:14:56 yes let's be flexible 15:15:27 #action Sundar update the spec according to the ml discussion conclusion 15:15:37 We can support multiple devices with some restrictions, which may satisfy immediate needs and still give freedom to operaors 15:15:52 I would especially thanks Sundar for his initiative on the mailing list 15:16:06 Sundar: any code plan for the spec? 15:16:08 and also the spec discussion with Nova team 15:16:14 Or can we put a simple version of multiple device support in Cyborg for now? 15:16:21 shaohe, yes. 15:16:25 shaohe-feng_ your PoC code could be used right ? 15:16:27 Thanks, Howard 15:16:54 zhipeng: yes. I think so. 15:16:59 The POC code does not publish RCs and traits 15:17:04 But we can build on that 15:17:17 yes that's what i meant 15:17:20 Sundar: it publish 15:17:40 Also, the notion of using PFs and VFs as resources is something the Nova/PTG folks didn;t want ;) 15:18:04 Shaohe: sorry, to clarify, it publishes PCI functions as RCs right? 15:18:12 let's take the details offline :) 15:18:20 ok :) 15:18:22 Sundar: the poc is similar to nova teams conclusion. 15:18:27 next up, Li Liu's metadata spec 15:18:38 #info metadata standardization spec 15:18:50 #link https://review.openstack.org/558265 15:19:03 and we know nrp is not ready, so simply it. 15:19:04 folks plz review it 15:19:13 anything you want to add, Li_Liu ? 15:19:33 zhipeng, sorry for interrupt, what's the expected freeze date for the Xilinx driver spec? april 19 or Jun.4? 15:19:51 chucksong Jun 4 15:19:55 zhipeng, I made some modifications based on shaohe's comments couple days ago. Waiting for more suggestions 15:20:08 Li_Liu okey :) 15:20:09 good, thanks! 15:20:21 okey moving on 15:20:27 My next spec for programmability is on the way. within the week I think 15:20:44 Li_Liu you are the rock star man 15:20:50 Li_Liu: I had some high level comments on the spec. We can discuss them in more detail when you want 15:20:58 I will cry myself to sleep tonight :P 15:21:17 #info cyborg-spec setup 15:21:28 sure sundar, wehcat/skype/phone/email whatever you want man 15:21:32 #link https://review.openstack.org/554766 15:21:45 Yumeng__ has been great to setup the cyborg-spec repo 15:22:08 I think the current patch look good, so if plz any core give a +2, I will land it this week 15:23:00 okey folks we still have planned specs missing for rocky 15:23:16 #info quota and os-acc spec still missing 15:23:51 I will talk to indicidual owners to see how to push forward 15:23:59 deadline is less than three weeks away :) 15:24:24 #action Howard to track the missing quota and os-acc spec 15:24:39 #topic open patches that need attention 15:24:57 #info shaohe_feng's devstack fix 15:25:10 #link https://review.openstack.org/557742 15:25:25 sorry to interrup. Here is the more detailed clock-driver use case description link I sent in the mail list: https://etherpad.openstack.org/p/clock-driver 15:25:30 great work from shaohe, I will have zhuli help merge this by the end of the week 15:25:32 zhipeng: this should be land first. :) 15:25:34 Yumeng thx ! 15:25:35 Driver proposal would be provided next. 15:25:58 Yumeng I think you could directly put it up as a driver spec :P 15:26:24 shaohe_feng_: It can help to avoid other debugs coming. 15:26:29 zhipeng: okey.that would be great. 15:26:39 Li Liu proposed openstack/cyborg master: Implemented the Objects and APIs for vf/pf https://review.openstack.org/552734 15:26:46 shaohe_feng_ I will nag zhuli :P 15:26:58 next, which Li_Liu just updated 15:27:17 #info object and apis for vf/pf 15:27:28 lol.. I just addressed some comments from Shaohe yesterday 15:27:30 #link https://review.openstack.org/552734 15:27:42 :) 15:28:18 Okey I think that is all the important stuff we need to discuss today 15:28:20 Sundar, I hear you said Nova folks does not like vf/pf representations 15:29:05 good 15:29:28 Li_Liu: it is internal implementation, Why do not like? 15:29:30 but I think vf/pfs are more friendly to vendor drivers. And it can be used interchangeably with Deployables 15:29:45 Li_Liu: yes. 15:30:12 Li_Liu: yes, this is PTG feedback: e.g. https://etherpad.openstack.org/p/cyborg-ptg-rocky-nova-cyborg-interaction Line 21 15:30:24 Li_Liu: but your vf/pf representations is not expose 15:30:26 shaohe_feng_, I know, just wanna point it out, in case nova folks have questions on them. :) 15:30:44 Our implementation still uses PFs/VFs of course 15:31:04 We can correlate regions/functions with PFs/VFs 15:31:17 great 15:31:48 #topic Vancouver Forum 15:31:52 almost forget this 15:32:05 plz feel free to propose forum topics for Vancouver Summit 15:32:21 I might not be able to attend that Summit due to several reasons 15:32:37 what is the vancouver summit? 15:32:37 Li Liu will be my double there :P 15:32:50 OpenStack Vancouver Summit 15:33:03 Chuck went to bed and I am here to keep track of actions :) 15:33:40 he is in China right now 15:34:47 wow 15:34:59 would love to chat if he travel to shenzhen 15:35:05 or beijing next week 15:35:29 oh ok - I will let him know 15:36:26 what email do I use for you? 15:36:39 zhipengh512@gmail.com 15:37:34 my cell is +86-18576658966 15:40:10 okay - I sent you his cell phone number and our info via email 15:40:15 zhipeng: if https://review.openstack.org/#/c/555722/ not merged, I can not create the client repo? 15:41:56 Zhipeng, shall we propose a talk on Cyborg architecture? 15:42:12 yes shaohe_feng_ 15:42:43 zhipeng: it is ready. wait for merge. 15:42:54 Sundar definitely a good idea :) 15:43:05 I have tried to push the repo, but failed 15:43:19 Cool ;) 15:48:02 okey if there is no other topics 15:48:10 let's conclude the meeting today :) 15:48:51 sounds good - any additional actions for us? 15:50:22 nuh just keep in touch would be great :) 15:54:12 oh one more thing, kosamara will you have the bandiwidth to work on the GPU spec ? 15:57:31 let me just close the meeting first in case I sleep over it again :P 15:57:33 #endmeeting