#openstack-meeting log

11:00:16 <oneswig> #startmeeting scientific-sig
11:00:17 <openstack> Meeting started Wed Mar 25 11:00:16 2020 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:00:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:00:21 <openstack> The meeting name has been set to 'scientific_sig'
11:01:08 <janders> g'day! :)
11:01:13 <janders> hope everyone is keeping safe
11:01:16 <verdurin> Morning.
11:01:26 <oneswig> hi all
11:01:43 <janders> here packets are still allowed to leave NIC ports, we'll see for how long
11:02:17 <belmoreira> hello
11:02:28 <oneswig> all healthy here
11:03:02 <janders> that's great to hear :)
11:03:30 <oneswig> There isn't an agenda today - apologies - I've been helping with a training workshop this week
11:03:56 <janders> no worries
11:06:18 <oneswig> I've heard a few things related to coronavirus and providing compute resources to help research
11:06:30 <oneswig> Anyone else active in this?
11:07:01 <verdurin> Yes, we're active there - lots of frantic requests and cluster re-arrangements.
11:07:13 <janders> we're under change moratorium to make sure systems are up and running for COVID research
11:07:40 <janders> personally I'm not involved in anything beyond this at this point in time :(
11:08:03 <oneswig> I registered for folding@home, but in 2 days (so far) I've had just 1 work unit... apparently the response has been so huge their queues have drained.
11:08:40 <oneswig> What form is the compute workload taking?
11:11:30 <oneswig> janders: I assume the fires are all done with, or is that still ongoing in the background?
11:11:51 <janders> luckily, yes
11:11:55 <janders> on to the next crisis
11:12:00 <verdurin> We have some people preparing to work on patient data, others running simulations.
11:12:22 <verdurin> Some working on drug targets/therapies.
11:12:45 <oneswig> verdurin: from the news I've seen Oxford University does appear to be very active.
11:13:12 <verdurin> Yes. I wasn't deemed glamorous enough to appear on the News, though.
11:13:53 <oneswig> Ah
11:14:30 <oneswig> Elsewhere I've not heard how it's affecting IT supply chains but inevitably there must be consequences.
11:15:17 <verdurin> It has. We had to pay over the odds for some parts to ensure timely delivery, by going with the only vendor who had stock already.
11:16:31 <oneswig> janders: is Canberra in lockdown?  We started in earnest this week.
11:16:43 <janders> partially, yes
11:16:51 <janders> though comms aren't particularly clear
11:17:02 <janders> we've been working from home for weeks now so not much change
11:17:21 <janders> but the authorities are gradually cracking down on pretty much anything
11:20:52 <oneswig> Did you see that Ceph Octopus has been released?
11:21:37 <janders> not yet... are there any new features or improvements relevant to HPC?
11:21:45 <janders> RDMA support, improvements to CephFS?
11:21:56 <oneswig> From what I've heard the Ceph-Ansible support is doubtful.  A pity considering it tends to work well nowadays
11:22:25 <janders> thats unfortunate
11:22:27 <oneswig> I think there may be less support for RDMA, but an opportunity to do it better second time round.
11:24:00 <oneswig> I'm not sure this represents the official plan of record but it's along the right lines: https://docs.ceph.com/docs/master/dev/seastore/
11:25:45 <verdurin> The big thing I noticed is this new 'cephadm' tool.
11:27:02 <oneswig> verdurin: is that part of the orchestrator?
11:34:45 <verdurin> It's a new installation tool that is yes somehow integrated with the orchestrator.
11:39:01 <verdurin> I just scanned the release notes and flipped through the new docs, though.
11:39:18 <janders> on another note, I had a chance to finally get back to VPI work on CX6 and I did get it to work with MOFED5 and latest FW
11:39:34 <janders> will probably use it more over the coming weeks
11:40:02 <janders> one interesting challenge I ran into is - how to tackle dual-PCIe-slot setup for 100/200GE
11:40:09 <verdurin> janders: was this with newer firmware than before or did you need to do something different?
11:40:29 <janders> verdurin: newer firmware
11:40:46 <janders> I had issues with VPI in the past and ended up using onboard 10GE for ethernet comms
11:40:54 <janders> now I'm trying to move back to the Mellanox cards
11:41:13 <janders> where it gets interesting is - for the dual-slot, dual-port card, how to wire this up to ovs?
11:41:27 <janders> it comes out as two ports
11:41:36 <janders> even though physically it's one
11:41:49 <janders> with IB it's easy as multipathing is more natural
11:42:11 <verdurin> Ah, you have the dual-slot card. We haven't used any of those yet.
11:42:15 <janders> with ethernet I'm not sure - do you guys have any experience with that?
11:42:19 <janders> yes
11:42:34 <janders> they work beautifully for storage, even if they are a little convoluted
11:43:13 <oneswig_> sorry, dropped off
11:45:05 <janders> I will likely chat to Mellanox about this over coming days, happy to relay back what I learned if you're interested
11:45:34 <janders> it seems that for non-performance-sensitive workloads, just using one of the two interfaces will suffice
11:46:07 <janders> what im wondering about though is if we're using leftover PCIe bandwidth for ethernet traffic, maybe it's better for IB if ethernet load balances across PCIe slots as well
11:46:18 <janders> IB likes symmetry
11:47:13 <janders> if each card gets say 12.5GB/s but we try to snap 5GB/s on one port for eth, I am not too sure if we still get 20GB/s on IB using what's left
11:47:29 <janders> so this may be motivation for figuring this out
11:47:41 <janders> to prevent IB slowdowns due to intermittent ethernet bandwidth peaks
11:48:06 <verdurin> Hmm. Sounds potentially troublesome, as you say.
11:48:12 <janders> yeah
11:48:16 <verdurin> Would be interested to hear.
11:48:23 <janders> dual-slot CX6s are fast but tricky to drive at times
11:48:32 <janders> luckily GPFS uses them very efficiently
11:48:32 <oneswig_> Surprising that VPI has been an issue, you'd think a lot of people would want that.
11:49:09 <janders> I think it got fixed initially around October, but that had the limitation of having to use a specific port for IB, wouldn't work other way round
11:49:15 <janders> I think now it is fully functional
11:50:06 <janders> new MOFED packaging convention is interesting though
11:50:27 <janders> the archive has multiple repos and accidentally using more than one leads to dependency hell
11:50:42 <janders> I think it's been like that since around 4.7 but I only really looked into this today on 5.0
11:50:44 <oneswig_> oh no...
11:50:57 <janders> makes sense when you really look through but takes a while to get used to
11:52:14 <oneswig_> We've been using this - https://github.com/stackhpc/stackhpc-image-elements/tree/master/elements/mlnx-ofed - but sometimes the kernel weak-linkage doesn't work correctly, perhaps it's related to the synthetic DIB build environment.
11:52:54 <janders> yeah I had some issues with weak linkage in the past
11:53:01 <janders> best rebuild against a specific kernel revision
11:57:38 <oneswig_> That can be tricky in diskimage-builder, but I expect there's a way through.
11:58:24 <oneswig_> Nearly at time
11:58:39 <janders> stay safe everyone :)
11:58:46 <oneswig_> same to you!
11:58:49 <janders> hopefully there won't be curfew on internet comms next week :)
11:58:56 <oneswig_> overload, more like.
11:59:03 <janders> I don't think COVID transmits through RDMA though
11:59:19 <janders> so that may be the safest comms channel :)
12:02:07 <oneswig_> cheers all
12:02:10 <oneswig_> #endmeeting