11:00:42 #startmeeting scientific-sig 11:00:43 Meeting started Wed Jun 20 11:00:42 2018 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:00:44 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:47 The meeting name has been set to 'scientific_sig' 11:00:58 Hi there 11:01:08 hey Stig :) 11:01:08 #link Agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_June_20th_2018 11:01:17 Hey janders, good evening 11:01:30 Hi everyone! 11:01:45 ceph-RDMA! looking forward to the discussion :) 11:01:46 Greetings priteau! 11:01:55 janders: have you got experience of this? 11:02:20 oneswig: not yet, but it's an area of strong interest 11:02:30 Would be a good fit in your place, perhaps... 11:02:41 How's the production roll-out going? 11:02:50 indeed! :) 11:02:58 Good day 11:03:09 Morning martial__! 11:03:12 #chair martial__ 11:03:13 Current chairs: martial__ oneswig 11:03:16 Good morning martial__ 11:03:22 up early again 11:03:41 OK we must give apologies in advance that this meeting might not run the full hour 11:03:44 but only able to stay for 20 minutes or so 11:04:01 oneswig: making progress - multirail issues are better understood now. Working on adding some virtualisation capability for the bits that don't need bare-metal. 11:04:08 o/ 11:04:17 Hey b1airo 11:04:22 #chair b1airo 11:04:23 Current chairs: b1airo martial__ oneswig 11:04:23 Hey Blair 11:04:28 how goes it? 11:04:31 We should make the most and get cracking 11:04:37 indeed! 11:04:43 #topic DockerCon roundup 11:04:49 martial__ has been on the road... 11:04:52 How was it? 11:04:59 I have indeed been doing the rounds :) 11:05:11 well it was Dockercon sponsors Kubernetes conference 11:05:26 most of the talks were about using K8s with Docker 11:05:26 yeah, before i fall asleep and drool on my laptop 11:05:37 very good conversation indeed 11:05:56 there was an AI/HPC session, right? 11:06:08 the talk by Christine Lovett and Christian Kniep (who presented here) was the HPC talk 11:06:21 their slides/video is not available yet 11:06:30 man, these 2-bit conferences :-) 11:06:31 but it talked about the future of Docker 11:06:43 lol 11:07:02 no fntech i guess 11:07:18 #link https://www.linkedin.com/feed/update/urn:li:activity:6413174151317590016 11:07:47 the above link is a post I did that gives you a few of the pictures of that presentation 11:07:59 There was also a very interesting talk about ML 11:08:21 #link https://www.linkedin.com/feed/update/urn:li:activity:6413098707868221440 11:08:41 by two Microsoft people, looks of interesting technology presented there 11:09:12 have NVIDIA presented anything on their DGX1/docker stack? 11:09:24 and then there was a "Black Belt" talk on orchestration 11:09:40 #link https://www.linkedin.com/feed/update/urn:li:activity:6412796018143821824 11:10:01 very interesting on the context but still not ready for HPC usage 11:10:05 Interesting claim on your slides Martial "no integration with upstream ecosystem" ... that overlooks all the Docker composition tools 11:10:23 sorry - correction - your photo of their slides :-) 11:10:28 no Nvidia that I could say 11:10:34 see 11:11:14 yes Christian want to make a clear distinction with true HPC usage 11:11:38 among the cool stuff I saw: 11:11:49 martial__: were there many OpenStack / HPC folks there? 11:12:32 Helm and openwhisk are two other popular topics of conversation 11:12:46 openwhisk? What's that 11:12:50 (oneswig a few, that fellow from Sandia and Chris Hoge) 11:13:08 Lambdas right? 11:13:23 Dotmesh 11:13:44 There was a demo of this live debugging service (squash and more) and it seems powerful https://www.solo.io (and on github) 11:13:55 dotmesh - I know the founder, is the technology an evolution of the work they did with stateful containers at ClusterHQ? 11:14:17 Calico https://www.projectcalico.org/getting-started/docker/ 11:14:22 Helm https://helm.sh 11:14:29 Prometheus https://prometheus.io 11:14:30 Istio https://istio.io 11:14:38 Jaeger https://github.com/jaegertracing/jaeger 11:14:49 and For istio https://github.com/IBM/istio101 11:15:19 openwhisk https://openwhisk.apache.org/ 11:15:39 pierre: yes like Amazon Lambda 11:15:49 a lot of serverless conversation 11:16:21 oneswig: exactly 11:16:49 Without too much irony, OpenStack would benefit from a method for resilient event-driven function processing. Would save all this polling 11:16:51 oneswig: although I think they discussed it being a continuation of the work started a year ago 11:17:25 Be interesting to hear more about how it is being used. 11:17:41 Is there talk of using serverless more in HPC use cases, or is it still mainly for web services? 11:17:41 and that is it for my report from DockerCon ... talks and slides are said to be available soon 11:17:56 just how soon is not clear 11:18:25 OK, we're on a shorter session today, should we move on? Thanks martial__ 11:18:38 oneswig: isn't that what Mistral is for? 11:18:47 oneswig: 👍 11:19:04 Ceph-RDMA! :) 11:19:08 b1airo: perhaps it is. Never seems to make things faster though. 11:19:12 #topic Ceph-RDMA update 11:19:21 you ask, we give :-) 11:19:42 OK, so I wanted to follow up on the digging I've been doing on this subject since Vancouver. 11:20:05 I came away from there with a few new areas of activity which I've (briefly) explored) 11:20:31 First up - anyone here used RDMA with Ceph? 11:21:10 the closest i've come is testing RoCE between OSD nodes 11:21:20 but not with actual OSD traffic 11:21:47 OK, so the way it works is to implement a derivation of Ceph's async messenger classes 11:22:01 There are currently two competing implementations 11:22:05 I haven't. Last time I checked (a while ago) it was experimental and used accelio 11:22:20 One from XSky in China that builds on IB verbs directly 11:22:35 A new one from Mellanox that uses a userspace messaging library called UCX 11:22:44 Which sounds in overall pitch a lot like Accelio 11:23:18 The XSky project has been running for probably a couple of years 11:23:46 I have the privilege right now of 3 different systems to test on: RoCE, IB and OPA. 11:23:50 It only works on RoCE. 11:23:59 from a quick look it seems UCX replaced Accelio 11:24:26 http://www.mellanox.com/page/products_dyn?product_family=79&mtag=roceThe 11:24:38 janders: that would make sense. This is the repo: http://www.openucx.org/ 11:25:20 The async+ucx messenger is a separate implementation from async+rdma 11:25:41 It's still being actively developed but there is code available 11:25:51 https://github.com/Mellanox/ceph/tree/vasily-ucx 11:26:24 Unfortunately, it uses a new API for UCX that is not yet upstream, but may be "in a few weeks" 11:26:39 So we cannot test this code yet, unfortunately. 11:27:00 (need to get going guys, sorry) 11:27:16 see ya martial__ 11:27:30 martial__: thanks! 11:27:30 do you have any experience benchmarking "vanilla" ceph vs RDMA/ceph implementations? 11:27:43 janders: a bit, not enough yet. 11:27:49 I'm curious about 1) throughput 2) CPU utilisation 11:27:59 I'm working on a third way - there's a branch from Intel in China 11:28:00 https://github.com/tanghaodong25/ceph 11:28:24 This adds iWARP support (which I don't care about) and RDMA-CM support (which *should* enable me to use IB and OPA) 11:28:34 I'm working on that right now. 11:29:23 So for testing, what I've seen is that people typically get performance by enabling busy polling, which totally obliterates the efficiency claims. But if that's not enabled, it should (in theory) do better 11:29:43 Anyone used memstore for testing? 11:29:46 RDMA-CM sounds like it is probably the most logical and sustainable option 11:29:47 iWARP.. is that long-range RDMA that can be carried over commodity networks (eg Internet)? 11:30:01 It's RDMA over TCP, so it's routable. 11:30:21 oneswig: UDP actually i think? 11:30:23 Was a side-note in history (NetEffect, remember them?), but Intel has revived it 11:30:37 oh sorry, i was meaning RoCE v2 11:30:43 Apparently it now comes on-board on Intel Purley platforms 11:30:48 UDP sounds about right, but it's been a while since I looked into that 11:30:53 b1airo: could be, I thought TCP but am probably wrong 11:31:06 janders: RoCE v2 is routable too 11:31:22 do Mellanox call RoCEv2 RROCE? 11:31:30 (Routable ROCE) 11:31:49 Could lead to hilarity in mispronunciation attempts... 11:32:06 no you're right oneswig, iWARP is TCP - i was getting ahead of myself 11:32:19 janders: yes sometimes 11:32:24 janders: in terms of performance I have a specific issue to face 11:33:07 If I can avoid using IPoIB, I'm expecting a huge boost for free, which kind of distorts the uplift 11:33:57 I'm also experimenting with Mimic but the ceph-volume preparation keeps hanging for some reason. 11:34:11 Anyone see that? This is with the CentOS packages 11:34:52 ceph-volume does seem a little buggy still, early days 11:35:00 Probably an answer to be found on the ceph mailing list 11:35:28 oneswig: so did you get any tests working? 11:35:54 Only RoCE so far, and that's old data - just picking up these new pieces of work now 11:36:28 would you expect much of a difference with the other connection management options? 11:36:35 The RoCE system was I/O bound, not network bound, so it didn't make masses of difference 11:36:39 oneswig: regarding hangs, it's not anywhere a "gathering keys" step, is it? 11:36:47 The connection management is an enabler, that's all 11:36:54 Without it, the parameters seem to be all wrong 11:37:08 janders: not in this instance. 11:37:56 The nodes are running ceph-disk or ceph-volume, but not making any progress 11:38:07 Well, I'll keep plugging on that. 11:38:14 janders: you're using BeeGFS, right? 11:38:18 Or was it Lustre? 11:38:42 it'll be BeeGFS - we don't have it yet, kit is ordered and in flight 11:39:00 we do have some GPFS too 11:39:02 no Lustre 11:39:09 Excellent - been hearing some good things on BeeGFS 11:39:24 this is a little off topic, but given you mentioned IPoIB - do you have any experience running software vxlan over IPoIB? 11:39:43 I think the summary on Ceph-RDMA is still "watch this space". 11:39:48 I'm not after outstanding performance (SRIOV will cover this part), more after flexibility 11:40:48 I keep getting reminded to get our BeeGFS playbooks into better shape so they are respectable 11:40:55 i don't janders , but i would expect it to work given both are L3 11:40:58 One of these days... 11:41:28 Ah - VXLAN - just saw that. What's the use case? 11:41:49 a bit of virtualised capacity on the baremetal cloud 11:41:58 It might have a better chance of working with the kernel module than with (say) OVS 11:42:12 without adding an extra ethernet fabric 11:42:40 (if it was mostly KVM w/o SRIOV, I'd probably go with Mellanox ethernet + vxlan offload) 11:43:15 yeah i was thinking kernel too 11:44:16 noted! thanks guys 11:44:59 janders: the mix of bare metal + virt is interesting to us, can you come back some time and tell us how your solution's working out? 11:45:13 oneswig: sure! :) 11:45:19 will do 11:45:47 b1airo: we should wrap up given your timing 11:45:52 #topic AOB 11:46:04 not that we've rigidly kept to topic thus far. 11:46:09 Any other news? 11:46:11 maybe just quickly as this was missed last week 11:46:16 on the Ceph note, anyone seen any material comparing bluestore with rocksdb metadata.db on ssd/flash versus using underlying block-caching layer and presenting a single cached device to bluestore ? 11:46:16 focus areas for the cycle 11:46:21 I propose: spot instances 11:46:41 (I think I saw Belmiro joining :) 11:47:13 there were a couple of presentations from the Beijing Cephalocon, but they are in Chinese and no slidedecks published 11:47:16 b1airo: got a link? 11:47:24 ah 11:47:32 i'm after a link! :-) 11:47:41 janders: spot instances - agreed - we are after this too 11:48:02 The "missing link" for us is how to do it nicely at the platform level 11:48:37 oneswig: I think we need spot instances + clever chargeback to make sure resources are used efficiently 11:48:55 Totally, that's how we see it too. 11:49:07 Anyone going to ISC next week? 11:49:16 otherwise - people will either hog resources, or there'll be a lot of unused capacity that ain't good either 11:49:29 I was supposed to, but ended up not going - too much work 11:49:40 Pete is going for Sanger 11:50:20 John T will be there for us, I'm going to HPCKP (tomorrow) instead 11:50:47 it'll be interesting to hear more about the new Summit machine 11:50:58 that's a few more Volta GPUs than we have... :) 11:51:08 indeed! 11:51:31 i just got some quotes with the new 32GB V100 variants 11:52:01 how will you run 'em? baremetal? vGPU? 11:53:18 passthrough mostly i think 11:54:07 probably make these ones the new compute hosts and use some of the existing 16GB fleet for vGPU (desktop nodes in the cluster service) 11:54:51 makes sense! 11:55:15 on the GPU topic, i've been trying to do some multi-gpu profiling and getting gpudirect running within KVM, have overcome some minor issues but no major result comparisons to share yet 11:55:50 Ah, too bad. Would be great to know the state of the art while you're still able to 11:55:56 if i get it together i'll post it back to the lists as a follow up to jon from csail@mit 11:56:07 Sounds good. 11:57:24 i need to grab one of the hosts we have that supports gpu p2p and try the experimental qemu gpu clique support. the host i've reserved at the moment only has two P100s on different root complexes. rubbish ;-) 11:58:12 P100s... can't work with that... ;) 11:58:57 OK - anything more to add? 11:59:15 I'm good - thank you guys! 11:59:39 oneswig: I will keep you posted on virt+ironic mix, 11:59:42 Same here - thanks everyone 11:59:48 janders: please do, sounds interesting 11:59:54 should have some data in the coming weeks 12:00:00 https://ardc.edu.au/ 12:00:45 b1airo: NeCTAR-NG? 12:01:05 yep. 5 yr NCRIS roadmap looks to be funded reasonably well - $60m odd for capex that isn't peak hpc 12:01:28 cool! 12:01:35 Wow, sounds good 12:01:44 OK, gotta close the meeting 12:01:51 bon voyage b1airo! 12:01:56 night gents 12:02:02 night 12:02:03 #endmeeting