#openstack-meeting log

21:02:30 <martial> #startmeeting Scientific-SIG
21:02:31 <openstack> Meeting started Tue Sep  3 21:02:30 2019 UTC and is due to finish in 60 minutes.  The chair is martial. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:02:32 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:02:34 <openstack> The meeting name has been set to 'scientific_sig'
21:03:06 <martial> cute informal meeting today as well :)
21:03:18 <janders42> I'm making up the numbers by joining twice
21:03:24 <janders42> trying to get my mac to cooperate
21:04:18 <martial> janders & janders42 : good man !
21:04:50 <martial> I am nowadays using irccloud, simpler way to go online
21:05:05 <janders42> good idea - I should look at this, too
21:05:17 <janders42> what are you up to these days martial?
21:05:45 <martial> trying to train some models for ML for a Yolo test
21:05:54 <janders42> I'm fighting our GPFS hardware, finally winning
21:06:18 <janders42> (looking up initial benchmark of what the kit is capable of)
21:06:38 <janders42> Run status group 0 (all jobs):  READ: bw=36.1GiB/s (38.7GB/s), 4618MiB/s-4745MiB/s (4842MB/s-4976MB/s), io=800GiB (859GB), run=21579-22174msecRun status group 1 (all jobs): WRITE: bw=24.0GiB/s (26.8GB/s), 3194MiB/s-3224MiB/s (3349MB/s-3381MB/s), io=800GiB (859GB), run=31757-32057msec[root@s206 ~]# cat bw.fio# Do some important numbers on SSD drives
21:06:38 <janders42> , to gauge what kind of# performance you might get out of them.## Sequential read and write speeds are tested, these are expected to be# high. Random reads should also be fast, random writes are where crap# drives are usually separated from the good drives.## This uses a queue depth of 4. New SATA SSD's will support up to 32# in flight commands, so
21:06:39 <janders42> it may also be interesting to increase the queue# depth and compare. Note that most real-life usage will not see that# large of a queue depth, so 4 is more representative of normal use.#[global]bs=10Mioengine=libaioiodepth=32size=100gdirect=1runtime=60#directory=/mntfilename=/dev/md/stripenumjobs=8[seq-read]rw=readstonewall[seq-write]rw=writestone
21:06:39 <janders42> wall[root@s206 ~]#
21:06:55 <janders42> 39GB/s read, 27GB/s write per node
21:06:58 <janders42> there will be 6
21:07:01 <janders42> with HDR200
21:07:04 <janders42> so should be good
21:07:38 <janders42> this is evolution of our BeeGFS design - more balanced, non-blocking in terms of PCIe bandwidth
21:07:41 <martial> that is quite good
21:08:08 <janders42> unfortunately the interconnect doesnt fully work yet, but I'll be at the DC later this morning hopefully getting it to work
21:08:57 <janders42> have you ever looked at k8s/GFPS integration?
21:09:27 <janders42> this is meant to be HPC OpenStack storage backend and HPC storage backend but I think there's some interesting work happening in k8s-gpfs space
21:10:11 <martial> I confess we have not had the need for now
21:10:43 <martial> what are your reading recommendations on this topic?
21:11:09 <janders42> I think all I had was some internal email from IBM folks
21:11:41 <janders42> I hope to learn more about this in the coming months - and will report back
21:12:17 <martial> that would indeed be useful
21:13:09 <martial> maybe a small white paper on the subect?
21:14:04 <martial> #chair b1airo
21:14:05 <openstack> Current chairs: b1airo martial
21:14:08 <janders42> if it will be possible to remote into Shanghai SIG meetup, I can give a lightning talk about GPFS as OpenStack and/or k8s backend
21:14:11 <janders42> hey Blair!
21:14:11 <b1airo> o/
21:14:21 <b1airo> issues connecting today sorry
21:14:45 <janders42> another "interesting" thing I've come across lately is how Mellanox splitter cables work on some switches
21:14:46 <b1airo> how goes it janders42
21:14:49 <martial> not sure if Stig will have that capability at the Scientific SIG meet in Shanghai but I hope so
21:15:16 <b1airo> i'd be keen to see that lightning talk janders42
21:15:19 <janders42> yeah that would be great! :)  - unlikely I will be able to attend in person at this stage
21:15:47 <martial> not sure if Blair will be there. I will not
21:15:54 <martial> (Shanghai)
21:16:13 <janders42> I plugged a 50GE to 4x10gE cable into my mlnx-eth switch yesterday and enabled the splitter function on the port
21:16:35 <janders42> and the switch went "port eth1/29 reconfigured. Port eth1/30 reconfigured"
21:16:48 <janders42> erm... I did NOT want to touch port 30 - it's an uplink for an IPMI switch...
21:17:00 <janders42> boom
21:17:03 <janders42> another DC trip
21:17:10 <janders42> nice "feature"
21:17:27 <janders42> with some switches it is possible to use splitters and still use all of the ports
21:17:45 <janders42> with others - the above happens - connecting a splitter automagically kills off the port immediately below
21:17:57 <janders42> oh well lesson learned
21:18:06 <janders42> hopefully will undo this later this morning
21:18:52 <martial> or maybe you will lose your uplink ... <disconnect>
21:19:25 <martial> (that's why we love technology :) )
21:19:43 <b1airo> no, i won't make it to Shanghai. already have a couple of trips before the end of the year and another to the States early Jan
21:20:15 <martial> joining us in Denver ?
21:20:21 <janders42> Denver = SC?
21:20:36 <b1airo> janders42: iirc that functionality is documented regarding the breakout ports
21:21:04 <janders42> yeah... @janders
21:21:07 <janders42> RTFM
21:21:07 <janders42> :D
21:21:21 <janders42> just quite counter-intuitive
21:21:46 <martial> SC yes
21:21:50 <b1airo> reconfiguring the port requires a full switchd restart and associated low-level changes, which interrupts all the way down to L2 i guess
21:22:11 <b1airo> SC yes
21:22:20 <janders42> just looked at the calendar and SC19 and Kubecon clash
21:22:27 <janders42> I'm hoping to go to Kubecon
21:22:41 <janders42> otherwise would be happy to revisit Denver
21:22:59 <janders42> how long of a trip is it for you Blair to get to LAX? 14 hrs? Bit closer than from here I suppose..
21:23:54 <b1airo> yeah bit closer, just under 12 i think
21:23:58 <janders42> nice!
21:24:17 <b1airo> usually try to fly via San Fran though
21:24:29 <janders42> yeah LAX can be hectic
21:24:35 <janders42> I quite like Dallas
21:24:37 <janders42> very smooth
21:24:54 <janders42> good stopover while heading to more central / eastern parts of the States
21:25:25 <janders42> not optimal for San Diego though - that'd be a LAX/SFO connection
21:25:52 <martial> (like I said a very AOB meeting today ;) )
21:26:02 <janders42> since Blair is here I will re-post the GPFS node benchmarks
21:26:06 <b1airo> :-)
21:26:12 <b1airo> yes please
21:26:13 <janders42> Run status group 0 (all jobs):  READ: bw=36.1GiB/s (38.7GB/s), 4618MiB/s-4745MiB/s (4842MB/s-4976MB/s), io=800GiB (859GB), run=21579-22174msecRun status group 1 (all jobs): WRITE: bw=24.0GiB/s (26.8GB/s), 3194MiB/s-3224MiB/s (3349MB/s-3381MB/s), io=800GiB (859GB), run=31757-32057msec[root@s206 ~]# cat bw.fio# Do some important numbers on SSD drives
21:26:13 <janders42> , to gauge what kind of# performance you might get out of them.## Sequential read and write speeds are tested, these are expected to be# high. Random reads should also be fast, random writes are where crap# drives are usually separated from the good drives.## This uses a queue depth of 4. New SATA SSD's will support up to 32# in flight commands, so
21:26:14 <janders42> it may also be interesting to increase the queue# depth and compare. Note that most real-life usage will not see that# large of a queue depth, so 4 is more representative of normal use.#[global]bs=10Mioengine=libaioiodepth=32size=100gdirect=1runtime=60#directory=/mntfilename=/dev/md/stripenumjobs=8[seq-read]rw=readstonewall[seq-write]rw=writestone
21:26:14 <janders42> wall[root@s206 ~]#
21:26:26 <janders42> this is from a single node with 12NVMes - no GPFS yet
21:26:38 <janders42> but we did manage to get a 12NVMe non-blocking PCIe topology going
21:26:47 <janders42> 39GB/s read 27GB/s write
21:27:07 <janders42> we'll have six of those puppies on HDR200 so should be good
21:27:34 <janders42> but having said that I need to head off to the DC soon to bring this dropped out IPMI switch back on the network - otherwise I can't build the last node.,,
21:28:15 <b1airo> would be interesting to see how those numbers change with different workload characteristics
21:28:26 <janders42> bad news is I haven't found any way whatsoever to build these through Ironic
21:28:48 <janders42> 14 drives per box booting thru UEFI from drives number 8 and 9 is too much of an ask
21:28:53 <b1airo> but those look like big numbers for relatively low depth
21:28:55 <janders42> and these drives need to be SWRAID, too
21:29:54 <janders42> I think there's a fair bit of room for tweaking, this was just to prove that the topology is right
21:30:20 <janders42> it would be very interesting to see how the numbers stack up against our IO500 #4 BeeGFS cluster
21:30:27 <b1airo> what's the GPFS plan with these? declusered raid?
21:30:28 <janders42> in the ten node challenge
21:30:34 <janders42> GPFS-EC
21:31:04 <janders42> though we will run 4 nodes of EC and 2 nodes of non-EC just to understand the impact (or lack of) of EC on throughput/latency
21:31:19 <janders42> for prod it will be all EC
21:32:15 <janders42> the idea behind this system is a mash up of Ceph style arch, RDMA transport and NVMes connected in a non-blocking fashion
21:32:34 <janders42> hoping to get the best of all these worlds
21:32:46 <janders42> so far the third bit looks like a "tick"
21:32:55 <janders42> these run much smoother than our BeeGFS nodes in the early days
21:33:40 <janders42> these do have some blocking which is causing all sorts of special effects if not handled carefully
21:33:49 <janders42> the new GPFSes have none
21:34:25 <janders42> which gets a little funny cause people hear this and ask me - so what's the difference between BeeGFS and GPFS, why do we need both?
21:34:41 <janders42> I used to say BeeGFS is a Ferrari and GPFS is a beefed up Ford Ranger
21:34:47 <b1airo> ha!
21:34:53 <janders42> but it really comes to a Ferrari and Porsche Cayenne now really
21:35:21 <b1airo> i guess a lot of it comes down to the question of what happens when either a) the network goes tits up, and/or b) the power suddenly goes out from a node/rack/room
21:35:33 <janders42> all good questions
21:35:41 <janders42> and with BeeGFS I have a catch-all answer
21:35:43 <janders42> it's scratch
21:35:45 <janders42> it's not backed up
21:35:48 <janders42> if it goes it goes
21:35:48 <b1airo> i.e., do you still have a working cluster tomorrow :-), and is there any data still on it
21:36:00 <janders42> GPFS on the other hand... ;)
21:36:07 <b1airo> get out of jail free card ;-)
21:36:38 <janders42> if one was to go through a car crash, I recommend being in the Cayenne not the Ferrari
21:36:41 <janders42> let's put it this way
21:37:14 <janders42> but yeah it's an interesting little project
21:37:34 <janders42> if it wasnt all the drama with HDR VPI firmware slipping and splitter cables giving us shit it would be up and running by now
21:37:37 <janders42> hopefully another month
21:37:51 <janders42> and on that note I better start getting ready to head out to the DC to fix that IPMI switch..
21:38:10 <janders42> we got mlnx 1GE switches with Cumulus on them for IPMI
21:38:14 <janders42> I can't remember why
21:38:29 <janders42> they are funky, but it is soo distracting switching between mlnx-os and cumulus
21:38:39 <janders42> completely different philosophy and management style
21:38:53 <janders42> probably wouldnt get those again - just some braindead cisco like ones for IPMI
21:39:17 <janders42> MLNX-100GE and HDR ones are great to work with though
21:39:39 <janders42> (except the automatic shutdown of the other port when enabling splitters :)  )
21:39:53 <b1airo> if you go Cumulus all the way then you can obviously get 1G switches too, but then bye bye IB
21:40:08 <b1airo> i'm slightly distracted in another Zoom meeting now sorry
21:40:14 <janders42> no worries
21:40:19 <janders42> wrapping up here, too
21:40:45 <janders42> I hope I haven't bored you to death with my GPFS endeavours
21:41:13 <b1airo> no, sounds fun!
21:41:42 <janders42> if you happen to have a cumulus-cisco cheat sheet I'd love that
21:41:55 <janders42> I wish I could just paste config snippets to Google Translate
21:41:58 <janders42> :D
21:44:48 <martial> with that, should we move to AOB? :)
21:45:26 <martial> If not, I will propose we adjourn Today's meeting
21:47:15 <martial> see you all next week
21:47:21 <martial> #endmeeting