#openstack-meeting log

11:00:27 <oneswig> #startmeeting scientific-sig
11:00:28 <openstack> Meeting started Wed Mar 28 11:00:27 2018 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:00:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:00:31 <openstack> The meeting name has been set to 'scientific_sig'
11:00:38 <oneswig> #link Agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_March_28th_2018
11:00:49 <oneswig> Good morning good afternoon good evening
11:01:06 <zhipeng> good evening :)
11:01:16 <oneswig> Hey zhipeng, thanks for joining today
11:01:30 <zhipeng> glad to be here
11:01:43 <priteau> Hello everyone
11:01:44 <verdurin> Hello - I can only join for a while because I only remembered about the timing change this morning...
11:01:47 <daveholland> hi
11:02:03 <oneswig> Then let's get started and talk fast :-)
11:02:08 <martial__> good day :)
11:02:10 <oneswig> #topic Cyborg
11:02:18 <oneswig> Hi martial__, good morning
11:02:23 <oneswig> #chair martial__
11:02:24 <openstack> Current chairs: martial__ oneswig
11:02:44 <oneswig> #link Zhipeng's presentation https://docs.google.com/presentation/d/1tERW4CVhyxNdX50AOPZRa44iPEhico8O_vQ0Ou75L80/edit?usp=sharing
11:03:23 <oneswig> zhipeng: Thanks for sharing the presentation with us.  I have plenty of questions and I am sure others do too.
11:03:34 <zhipeng> no problem !
11:03:51 <oneswig> Can you start by describing, what is missing is OpenStack's support for (say) GPUs?
11:03:54 <zhipeng> have to applogize for lacking a more up to date one
11:04:49 <zhipeng> ok, so we have discussed very early on with Scientific SIG
11:05:07 <zhipeng> and got valuable input as well
11:05:38 <b1airo> I'm still awake it seems
11:05:47 <zhipeng> the initial feedback we got regarding GPU, is that it is difficult to fully balance between GPU and CPU resource
11:05:54 <oneswig> #chair b1airo
11:05:55 <openstack> Current chairs: b1airo martial__ oneswig
11:06:02 <oneswig> Evening Blair
11:06:07 <zhipeng> hey Blair
11:06:10 <b1airo> O/
11:06:14 <martial__> welcome b1airo :)
11:06:44 <zhipeng> so for example, users typically either have to host aggregate the GPU resource in order to fully utilize it
11:06:53 <oneswig> zhipeng: balance, as in manage the scheduling of workloads that require GPUs without under-utilising the GPU hardwrare?
11:07:05 <zhipeng> oneswig exactly
11:07:06 <zz9pzza> o/
11:07:39 <b1airo> Sounds like that is more a user problem?
11:07:43 <zhipeng> or user have a mix CPU-GPU setup but workload that needs GPU not scheduled onto GPU nodes as planned
11:08:00 <zhipeng> that was the input back in Boston Summit :)
11:08:07 <zhipeng> from Jim Golden I believe :)
11:08:46 <oneswig> I think this is a good example that people can relate to.
11:08:49 <zhipeng> well I think if we look at the latest release, with the Placement and all, this issue is not that severe
11:09:03 <b1airo> I'm not sure if I could be talking at cross purposes here as I came in late...
11:09:11 <oneswig> How does Cyborg help?
11:09:38 <zhipeng> so Cyborg, developed as a general mgmt framework dedicated to the acceleration resources
11:09:48 <zhipeng> like FPGA, GPU, NVMe SSD,...
11:10:21 <zhipeng> Will help treating these types resources as first class citizen when Nova schedules
11:10:31 <zhipeng> so we have a more recent use case
11:10:37 <zhipeng> from another perspective
11:10:42 <b1airo> But this sounds like an inherent problem with direct attached devices and passthrough - balancing usage of must-have system resources like CPU & RAM versus accelerator resources has to be done at hardware purchase time - i.e. cannot be flexible or optimal for all use-cases/workloads
11:11:49 <zhipeng> b1airo well I think getting a reasonable utilization rate is achievable
11:12:20 <zhipeng> ok,back to the new use case
11:12:26 <zhipeng> provided by kosamara from CERN
11:12:50 <zhipeng> is that for security reason, users might want to clean up the GPU after usage
11:13:28 <zhipeng> this could also be something Cyborg could help with , via NVIDIA driver
11:13:35 <oneswig> zhipeng: in a virtualised context?  Cleaning after pass-through?
11:14:23 <zhipeng> nuh it is for hpc
11:14:28 <zhipeng> no virtualization
11:14:30 <kosamara> Yes, cleaning the GPU memory after passthrough.
11:15:02 <zhipeng> kosamara tho not virtualized GPU right ?
11:15:11 <kosamara> No, only in passthrough config
11:15:19 <b1airo> That's an interesting one - what is the attack / information leak vector there kosamara ?
11:15:50 <kosamara> Events: user 1 uses a gpu, relinquishes it, user 2 claims it
11:16:08 <belmoreira> I think firmware is more important than memory
11:16:18 <kosamara> Then, user 2 can access user 1's data, which is not erased automatically
11:16:21 <b1airo> And how will you clean? Load a custom CUDA kernel that zeros all memory (without using unified memory)?
11:17:13 <kosamara> That's what I think. The problem is that the host can't do that, because it doesn't have the nvidia driver kernel modules loaded.
11:17:32 <kosamara> To allow gpu passthrough, the device must be claimed by vfio.
11:17:38 <b1airo> belmoreira: please let me know if you get a different response from elsewhere inside NVIDIA than I did
11:18:02 <belmoreira> b1airo sure
11:18:38 <martial__> (this reminds me there were quite a few announcements during GTC yesterday)
11:18:48 <zhipeng> well on a sidenote, we are really looking forward to work with NVIDIA team to have a driver ready for Rocky :)
11:18:57 <b1airo> kosamara: would this require a rowhammer alike attack?
11:19:03 <zz9pzza> You could have a image that is run between jobs who's tak kis to do the clean up
11:19:20 <b1airo> zhipeng: you can hope! :-)
11:19:25 <oneswig> kosamara: Are you simply able to write to the PCI memory regions of the VF without a driver loaded (or is that naive)?
11:19:32 <zhipeng> b1airo lol
11:19:45 <oneswig> Nat VF, GPU, ahem
11:20:26 <kosamara> blairo no. If user 1 in the above example leaves his data on memory, then user 2 can simply read the entire gpu memory and find them.
11:20:29 <zz9pzza> Having a cleaning image per thing is more generic.
11:20:51 <b1airo> Mind you, they are pretty busy making boxes that melt racks... (for those who read about DGX-2 today)
11:21:25 <belmoreira> zhipeng: not sure if I completely understood the goal of cyborg. Can you explain how cyborg differs from the work done in nova to support vGPUs?
11:21:43 <kosamara> oneswig: I'm currently researching that possibility. I don't have low-level pci knowledge yet, so it will take me some time. Perhaps someone else can provide a better answer?
11:22:01 <b1airo> kosamara: is that a verified leak, i.e. between guest instance boots and driver initialisations - I didn't know about this :-/
11:22:43 <zhipeng> belmoreira great question, so on a higher level, cyborg aims to provide a general framework. Per GPU, we are actually discussing with the vGPU folks in Nova on working out a collaboration plan
11:22:57 <kosamara> blairo: yes, I can link to this paper: https://www.semanticscholar.org/paper/Confidentiality-Issues-on-a-GPU-in-a-Virtualized-E-Maurice-Neumann/693a8b56a9e961052702ff088131eb553e88d9ae
11:23:14 <oneswig> To follow on belmoreira's question, this issue with GPU cleaning between uses, I guess it can generalise to other kinds of acceleration.  But does that require a service?
11:23:19 <priteau> kosamara: so it's quite similar to Ironic node cleaning, but with a cleaning VM loaded after each user instance is terminated?
11:23:21 <b1airo> Thanks, will have a look
11:23:24 <zhipeng> the current thinking is that cyborg could provide a more nuanced represenation of vGPU resources, for example in a tree structure
11:23:42 <zhipeng> which was originally planned but later ditched in the nova spec, if I remember correctly
11:24:53 <zhipeng> One thing worth mentioning is that Cyborg utilize and interact with Placement for resource information aggregation
11:24:56 <priteau> kosamara: which component is responsible for launching this cleaning VM? Is there a cybord-compute agent?
11:25:00 <belmoreira> oneswig: that's a good point. If nova supports PCI passthrough maybe it should be handled there
11:25:28 <kosamara> priteau: I don't know how ironic node cleaning works. But yes, a cleaning VM loaded after each user. Apart if what oneswig suggests above can actually work, through the vfio driver, which is already on the host
11:25:34 <b1airo> oneswig: kosamara: re. direct PCI config space writes - yes I believe you can. Vfio can intercept, but it doesn't currently protect everything that should be protected with GPU BAR0
11:26:23 <belmoreira> zhipeng: but at the end is nova scheduler that needs to be aware of this available resources in placement
11:26:41 <zhipeng> yes exactly
11:26:54 <b1airo> I can imagine cyborg coming into its own with a solid network/fabric based accelerator attachment model
11:26:59 <zhipeng> cyborg-conductor will sync with Placement about all the acceleration resources
11:27:11 <zhipeng> b1airo that is definitely something we are looking at
11:27:28 <b1airo> Things like PCIe fabrics
11:27:37 <oneswig> One issue with this approach is that programmed-IO writes to how-many-GB of GPU RAM might be slower than booting a vm to get the GPU to do it itself.
11:27:38 <zhipeng> that model better suits coz the life cycle is independant from the compute
11:27:47 <martial__> zhipeng: what is your timeline for features in cyborg?
11:27:51 <b1airo> Or perhaps NVMeoF
11:28:01 <zhipeng> martial__ which features ?
11:28:24 <oneswig> b1airo: you thinking of RCUDA here?
11:29:13 <zhipeng> oneswig the Huawei Cloud will actually have a RCUDA enable remote GPU for use this year
11:29:14 <b1airo> Yes, RCUDA is a good example for GPUs and would be cool to have a prototype
11:29:32 <zhipeng> the service end is implemented based upon cyborg
11:29:53 <oneswig> zhipeng: sounds good, better get that cleaning working :-)
11:30:06 <zhipeng> cleaning is more fun :)
11:30:06 <b1airo> Will we hear about that in Berlin zhipeng ? :-)
11:30:20 <martial__> zhipeng: given the abstraction level per hardware, are you prioritizing some components/hardware first or is the model/solution thought as a generic enabler for all hardware?
11:30:29 <zhipeng> b1airo will endeaver to do so :P
11:30:58 <zhipeng> martial__ starting Rocky we will try to establishing something like a standardized metadata description
11:31:05 <zhipeng> across FPGA, GPU and other things
11:31:17 <zhipeng> device tree for ARM for example
11:31:41 <zhipeng> we want to make Cyborg talk as general as possible to the accelerators
11:32:14 <martial__> sounds very good
11:32:34 <oneswig> zhipeng: can you talk more on the interaction with nova/placement?  Does Cyborg do something with custom resource classes?
11:33:08 <zhipeng> yes oneswig, cyborg implements custom trait and resource class for FPGA resources at the moment
11:33:09 <b1airo> zhipeng: I think the SLURM scheduler already has a similar tree like resource model, you should look into that for inspiration and/or blatant copying
11:33:21 <zhipeng> and will do the same for other types of accelerators as well
11:33:32 <zhipeng> b1airo any pointers ?
11:33:59 <zhipeng> would love to blatanyly copy XD
11:34:42 <b1airo> It's called GRES
11:35:08 <oneswig> zhipeng: what extra does Cyborg add to the placement service's handling of custom resource classes for scheduling with accelerators?
11:35:53 <zhipeng> oneswig actually nothing special (beauty of the placement design)
11:36:13 <zhipeng> as long as we define the schema correctly, it could work :)
11:36:30 <zhipeng> our Intel dev team did a PoC to verify that, just couple of days ago
11:36:48 <oneswig> Ah, OK, so the focus of development effort is for supporting the hardware end, more than the scheduling end?
11:37:06 <oneswig> cleaning and so on?
11:37:11 <zhipeng> yes the gaps for example for FPGA, is how to interact with Glance on image mgmt
11:37:18 <zhipeng> and how to attach
11:37:38 <zhipeng> so one of the outcome of the discussion we had with the nova team in Dublin
11:37:49 <zhipeng> is that they suggest we created a os-acc lib
11:37:57 <zhipeng> similar to os-vif and os-brick, to handle that
11:38:36 <zhipeng> For GPU, I guees it would be the cleaning and attach/detach as well
11:38:45 <oneswig> Do you have support for loading user netlists onto FPGAs and passing-through as a prepared device?
11:40:04 <zhipeng> oneswig I have to double check on that with the driver team :)
11:40:20 <b1airo> Oh right, now I realise at the start of the meeting you were probably talking about dynamic attachment of GPUs etc to instances (as opposed to fixed to instance like we have now)
11:40:42 <oneswig> could be good, but very scary attack potential
11:40:58 <zhipeng> b1airo yep :)
11:41:35 <oneswig> Ah, I have to drop off - I have a visitor - chairs can you take over the meeting bot?
11:42:15 <b1airo> Copy
11:42:47 <martial__> sure
11:43:17 <b1airo> zhipeng: have you looked at potential for cyborg to orchestrate vGPU?
11:44:13 <zhipeng> yes definitely, we actually invite Jianghua to join the discussion on our weekly meeting about 2 hours and 15mins later :P
11:44:19 <zhipeng> on #openstack-cyborg
11:45:12 <martial__> zhipeng: then can you also describe how to best use Cyborg; ie how to deploy and make use of it efficiently ?
11:45:54 <b1airo> I am not sure what actually ended being implemented for the new vGPU support in Nova, but I'm guessing since there is still no NVIDIA Linux/KVM drivers available for host side yet that there must be gaps
11:46:21 <martial__> (or plan for use, ie best practice with cyborg)
11:47:14 <zhipeng> martial__ well as you know we wrote the project from ground up, so it is still very buggy, but devstack is the best way at the moment to try it out
11:47:55 <zhipeng> b1airo i think I could confirm that with Jianghua later
11:48:24 <b1airo> Thanks, I will follow up to check the logs
11:48:30 <martial__> sounds good, thank you
11:49:00 <b1airo> zhipeng: did you have anything else to report on Cyborg?
11:49:03 <martial__> zhipeng: anyhting else we need to know about cyborg?
11:49:20 <b1airo> Or for that matter does anyone else have further questions?
11:49:28 <zhipeng> i think we've covered all the important bits
11:49:29 <b1airo> (jinx martial__ )
11:49:43 <martial__> (indeed b1airo :) )
11:49:55 <b1airo> Great!
11:50:02 <martial__> zhipeng: thank you very much for taking the time to come talk to us
11:50:02 <zhipeng> for our rocky priorities, you could checkout from the mailinglist archive
11:50:06 <zhipeng> no problem, cyborg could not be born without the great early support from SWG, and your inputs are always welcomed
11:50:16 <zhipeng> :)
11:50:47 <martial__> :)
11:50:49 <b1airo> martial__: do you have the Forum brainstorm etherpad link handy?
11:51:12 <martial__> #link Forum brainstorming (https://etherpad.openstack.org/p/YVR18-scientific-sig-brainstorming
11:51:28 <martial__> #link Forum brainstorming https://etherpad.openstack.org/p/YVR18-scientific-sig-brainstorming
11:52:07 <martial__> so far still only Blair's content
11:52:28 <martial__> we will have more added as we get closer and get confirmation of who will be able to join
11:53:18 <b1airo> kosamara: belmoreira - please throw your ideas in there regarding a session on GPUs
11:53:22 <martial__> but FTI, fellow Scientific SIG participants, the Etherpad is for our collection of ideas for the Forum
11:53:43 <martial__> (FTI -> FYI)
11:54:58 <martial__> And as I see no comments yet
11:55:11 <martial__> moving on to the next topic
11:55:27 <martial__> #topic AOB
11:55:50 <martial__> Well GTC yesterday gave us some things to look ino
11:55:52 <martial__> into
11:56:23 <martial__> "NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. With TensorRT, you can optimize neural network models, calibrate for lower precision with high accuracy, and finally deploy the models to hyperscale data centers, embedded, or automotive product platforms. TensorRT-based applications on
11:56:23 <martial__> GPUs perform up to 100x faster than CPU during inference for models trained in all major frameworks." https://developer.nvidia.com/tensorrt
11:56:49 <martial__> and "NVLink is a great advance to enable eight GPUs in a single server, and accelerate performance beyond PCIe. [...] NVIDIA NVSwitch is the first on-node switch architecture to support 16 fully-connected GPUs in a single server node and drive simultaneous communication between all eight GPU pairs" https://www.nvidia.com/en-us/data-center/nvlink/
11:57:11 <martial__> For people interested, the full 2h30 video is at https://www.ustream.tv/gpu-technology-conference
11:57:31 <martial__> and the model 2 ... b1airo ? :)
11:57:53 <b1airo> Yeah, 10kW beast
11:59:05 <martial__> Link for people interested https://www.nvidia.com/en-us/data-center/dgx-2/
11:59:19 <martial__> and with that, we are reaching the end of the hour
11:59:39 <b1airo> I want HGX-2 (for the little people)
12:00:17 <martial__> thanks again to zhipeng for spending some quality time talking to us about Cyborg (reminder on presentation https://docs.google.com/presentation/d/1tERW4CVhyxNdX50AOPZRa44iPEhico8O_vQ0Ou75L80/edit?usp=sharing )
12:00:17 <b1airo> Time's up!
12:00:27 <b1airo> Thanks!!
12:01:04 <martial__> thanks everybody for joining us for another fun session :)
12:01:11 <martial__> #endmeeting