15:00:56 #startmeeting openstack-cyborg 15:00:56 Meeting started Wed Jun 7 15:00:56 2017 UTC and is due to finish in 60 minutes. The chair is zhipeng_. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:57 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:00 The meeting name has been set to 'openstack_cyborg' 15:01:05 hahaha 15:01:10 let's hope so 15:01:18 okey so quick update from my side 15:01:22 on the api/db patch 15:01:40 #topic BP discussion 15:01:43 #link https://review.openstack.org/#/c/445814/ 15:02:07 so ChrisD reviewed with the comments that there is an ongoing discussion on the traits 15:02:16 we might consider to align our design to it 15:02:43 originally, the placement resource provider was meant for just compute node 15:02:53 I was looking over that, care to summarize? 15:03:16 sure I'm putting my thoughts together now 15:03:27 so now the placement team see the pitfall for that 15:03:48 since for example for shared storage (external arrays I would suppose) 15:04:03 if you only count the storage side of things on the compute node 15:04:22 your resource provider will never correctly reflect the required traits 15:04:43 so this is an issue with accelerators that may be shared between many computes? 15:04:48 the resouce provider should reflect the shared storage arrays, rather than only local discks 15:05:02 no, I think this is an issue for accelerators as whole 15:05:14 how so? 15:05:23 since if the resource provider only identify with compute node 15:05:56 we could wind up with the same problem as we have now, since accelerator characters are bundled with the compute charaters 15:06:13 well we could have our own resource class for sure, but that does not solve the problem 15:06:36 nova scheduler asks the placement api to provide all the necessary resources 15:07:00 and for Cyborg, one of the important goals is that accelerators being treated as the first class citezen 15:07:51 meaning that we should have indiidual resource providers for accelerators 15:08:20 from the email link Chris provided, there is an etherpad documenting the "Plan B" 15:09:02 ok so the issue is that if we have a 'gpu' resource provider it's dependent on computes in a way that resource providers aren't supposed to be. 15:09:02 which I liked very much, is working on to extend the current nested resource provider definition, to a more relaxed, multiple resource providers one 15:09:11 yes exactly 15:09:42 the scheduling decision would still largely depends on the regular compute features, since we are just part of the traits 15:09:59 interesting 15:10:12 so back to the "Plan B", the current nested resource provider model is designed primarily for stuff like NUMA nodes 15:10:22 where you got this parent-child relationship 15:10:28 So, how does that change our implementation? 15:10:42 the Plan B extneds the scope to be more general, meaning for Cyborg use cases 15:11:05 we could have multiple resource provider for each and every accelerators 15:11:14 (if they deemed important for the workload) 15:11:23 crushil the change is that 15:11:46 our DB design has to align with the proposed nested resource provider/trait design 15:12:02 at least DB schemas 15:12:18 so that when cyborg agent populate our inventory to the placement api 15:12:25 it could understand it correctly 15:13:59 Ok, what about the other specs? 15:14:13 not concerned that much :) 15:14:18 gotcha 15:14:58 So I'm thinking we might need two DB schemas 15:15:15 the current one in the spec patch, could be used for the discovery phase 15:15:37 that is when user start the cyborg service and then agent/driver do the discovery/pre-config 15:15:53 collect what we have, on the host 15:16:16 the second set of schema needs to be aligned with nested resource provider 15:16:33 to interact with placement api and eventually nova-scheduler 15:16:53 for the VM to select the correct accelerator resource 15:17:19 so we need to maintain two parallel db's for each purpose or do you mean we want to change the format in a future release? 15:18:20 what I'm thinking is that we don't have exhaustive knowledge on the hardware now 15:18:58 therefore we keep a seperate DB schema, the host side one should be more extendable or more abstract 15:19:17 But on another thought 15:19:23 it might be just too complex ..... 15:19:26 what do you guys think 15:19:49 I think we should try and keep one db as much as possible, I don't want to try and maintain parallel sets of data 15:19:58 that makes sense 15:20:24 I agree, having multiple DBs is just clunky 15:21:04 in that case we will just use the resource provider schema,I will follow up with Chris to see which one I should use 15:21:11 the current one or the proposed one 15:22:19 sounds good. 15:22:56 Anything else on that subject? 15:22:59 nope 15:23:13 anything else from you guys on the open spec ? 15:23:33 nope 15:23:51 great 15:23:58 #topic initial code development 15:24:05 so, any roadblocks 15:24:39 been trying to understand oslo rpc and message passing and start structuring the conductor/agent 15:24:52 sounds like a great start :) 15:25:02 I have created stubs and I will push them up by the end of the week 15:25:11 great ! 15:25:18 crushil, sounds good. 15:26:03 let's do small pieces like Justin suggested 15:26:09 I will fill them out rebased on top of the API and agent patches 15:26:10 so a lot of what we will be doing involves rpc between different components, so people with integrating parts need to talk to each other about interfaces 15:26:19 I don't think we should be too worried about a stable internal interface 15:26:29 yes I agree 15:27:04 oslo.messaging could provide everything we need 15:27:46 well sometimes we need rpc for example the driver should be called by the agent over rpc I'm thinking (we could invoke directly but I'm not sure if I want to do that) 15:29:38 i think it should be done over rpc 15:29:57 unless, we gave driver restful apis ? 15:30:31 I don't think that's the right application here. Our internal code needs to be more tightly integrated than restfulness allows. 15:30:45 yep 15:31:07 so rpc should be fine here 15:31:27 i think at the moment, it is agent talking to the generic driver 15:31:50 later on, we should design something like the neutron ml2 driver interface 15:32:29 that every driver, vendor or not, implements the interface which rpc calls will go through 15:32:34 in a rather standard way 15:33:29 Ok. So, are we going to follow the neutron model vs the nova/cinder model? 15:33:47 i think more like the neutron moddel 15:33:55 for out-of-tree drivers 15:34:03 But isn't that too complicated 15:34:09 cinder and nova are mostly in-tree maintained drivers 15:34:20 it won't be too complicated for us i think 15:34:40 neutron is complicated because they have to define the type drivers and mechanism drivers 15:34:42 Well, cinder has out of tree drivers based on whether you have CI or not 15:35:04 I think in-tree drivers also requires the CI 15:35:14 otherwise the cinder team removes your driver 15:35:38 No, they just make it unsupported i.e. move it out of tree 15:35:54 for us, as long as it is PCIe communicated devices, the driver interface won't be too complicated 15:36:12 but if we need to support extra protocols, that is where things will get wild 15:36:24 rushil ah okey 15:36:25 Ok. I just want to make sure we don't make things more complicated than they should be 15:36:37 yes that is always our goal 15:36:38 I can agree on a standard rpc interface but that's less complicated than I think you are making it out to be. 15:36:49 we even wanted to skip the conductor :P 15:37:14 and I nearly got away with it too! 15:37:21 jkilpatr haha 15:37:41 Lol 15:39:26 rushil the cyborg ml2 driver would be modeled from your generic driver implementation :P 15:40:37 I wouldn't call it ml2 driver though 15:40:52 of course we will have another name for it 15:41:14 aluminum drivers :P 15:41:24 for cyborg robots 15:41:45 Hehe 15:42:36 Anyways I'll try have a stub up this week (conductor) and then agent next week. 15:42:46 depends on how other tasks go for me. 15:43:44 jkilpatr: Cool 15:43:50 sounds great, i got another colleague working on cyborg this week, so api code will be developed in parallel 15:44:09 Awesome 15:44:10 hopefully when we settled the spec, the initial code will come out 15:44:21 and we could iterate over 15:44:48 #topic AoB 15:44:52 any other topics 15:44:59 Btw our group at Lenovo sent out initial emails to vendors to get their drivers aligned with cyborg 15:45:20 wow 15:45:24 that is awesome 15:45:45 I'll keep you guys posted on that 15:45:51 could you disclose the vendor names for now ? 15:45:57 or should we wait until later 15:46:14 The usual suspects 15:46:35 e.g ? 15:46:47 Nvidia, AMD 15:47:00 And smaller ones like Micron 15:47:29 cool ! 15:47:30 I'll let y'all know when they are committed to contributing code 15:47:40 great :) 15:50:41 okey if there are no other topics, we go to the usual long slumber ~~ 15:50:56 will try to remember to close the meeting an hour later 15:51:05 Cool, thanks zhipeng_ 17:00:56 #endmeeting