21:00:15 #startmeeting scientific-sig 21:00:16 Meeting started Tue Aug 21 21:00:15 2018 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:19 The meeting name has been set to 'scientific_sig' 21:00:29 Hi 21:00:42 #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_August_21st_2018 21:01:04 Martial sends his apologies as he is travelling 21:01:10 hello everyone 21:01:29 Blair's hoping to join us later - overlapping meeting 21:01:39 Hey rbudden! :-) 21:01:46 hello hello :) 21:01:54 How's BRIDGES? 21:02:12 it’s doing well 21:02:41 Our Queens HA setup is about to get some friendly users/staff for testing :) 21:02:55 I've been having almost too much fun working with these guys on a Queens system: https://www.euclid-ec.org 21:03:23 Mapping the dark matter of OpenStack... 21:03:23 Nice 21:03:59 It boils down to a bunch of heat stacks and 453 nodes in a Slurm partition with some whistles and bells attached 21:04:17 Sounds interesting 21:04:44 Found the scaling limits of Heat on our system surprisingly quickly! 21:05:06 Hehe, I haven’t used Heat much 21:05:09 rbudden: did you ever get your work on Slurm into the public domain? 21:05:28 I took the lazy programmer approach and wrote it all in bash originally, then converted it mostly to Ansible 21:05:34 Actually, I did not 21:05:48 I have piles that need to be pushed to our public GitHub 21:06:07 There's some interest from various places in common Ansible for OpenHPC+Slurm 21:06:10 It’s on my never ending todo list somewhere in there 21:06:21 ;) 21:06:26 Right, somewhere beyond the event horizon... 21:07:45 Are you going to the PTG rbudden? It's merged with the Ops meetup too this time 21:08:30 I am not 21:08:52 Unfortunately I don’t have much funding for travel right now 21:08:56 No, me neither - Martial's leading a session there 21:09:31 I’d have loved to get more involved with the PTG and Ops, but time and funding have always kept me away unfortunately! 21:09:41 Some of our team will be there but I'm of limited use being less hands-on than I'd like to be nowadays 21:10:06 What's new on BRIDGES, what have you been working on? 21:10:09 As long as we have someone there that’s good 21:10:35 Largely the Queens HA and looking into scaling Ironic better 21:10:50 We’ve also had a large influx of requests for Windows VMs from up on CMU campus 21:11:01 bare metal windows? 21:11:04 so we’ve been trying to patch together some infra to make those easier 21:11:10 no, virtual ATM 21:11:18 figures. 21:11:37 I’m starting to plan for our next major upgrade as well 21:11:54 what are you fighting with Ironic scaling? 21:12:05 so getting some multiconductor testing and attempting boot over OPA are high on the list 21:12:40 IPoIB interface on OPA? 21:12:43 yes 21:13:00 i had read the 7.4 centos kernel has stock drivers that could be used for the deploy 21:13:07 I’m not sure if anyone has tried it though 21:13:33 The OPA system I've been working on has the Intel HFI driver packages 21:13:43 i’d love to deploy nodes via IPoIB vs our lonely GigE private network ;) 21:13:50 I'd be surprised if the in-box driver is new enough. 21:14:14 our last upgrade to 7.4 took longer than expect pushing out updates via puppet/scripts 21:14:17 Last time I looked was end of last year though 21:14:30 so we’d like to attempt to reimage all 900+ nodes as fast as possible via Ironic this time 21:14:51 yeah, I admit I haven’t done much looking other than finding the pieces are there 21:14:58 i.e. PXE boot via OPA has been done 21:15:00 This might interest you - proposal for Ironic native RBD boot from Ceph volume - https://storyboard.openstack.org/#!/story/2003515 21:15:16 generic OPA IPoIB drivers should be in the mainline kernel now (whether they work is anotehr story) 21:15:29 cool, thx, i’ll check it out 21:16:04 Just a framework for an idea right now but one we've already tried to explore ourselves, because it's a natural fit 21:16:32 we’ve only had a limited experience with Ceph 21:16:42 and so far we’ve seen incredibly bad performance 21:16:56 I’ve been meaning to solicit some advice/stats from the group 21:17:04 on setups/performance numbers 21:17:39 Ceph's single-node performance is lame in comparison but the scaling's excellent, particularly for object and block. File is still a little doubtful. 21:17:52 by single-node I mean single-client 21:18:02 ok, interesting 21:18:09 that’s what we were curious about 21:18:15 especially given this: 21:18:16 https://docs.openstack.org/performance-docs/latest/test_results/ceph_testing/index.html#ceph-rbd-performance-results-50-osd 21:18:32 Our team has recently been working on BeeGFS deployed via ansible 21:18:35 it looks like single client peformance is a small fraction of what the hardware is capable of 21:19:00 but i was curious if that was due to scaling/sustaining those numbers in parallel to a large portion of clients 21:19:05 That matches my experience, but 20 clients will get 75% of the raw performance of the hardware 21:19:19 we have a very small amount of hardware to dedicate 21:19:26 I bet I have less :-) 21:19:27 say 3-4 servers for Ceph 21:19:33 yup, won hands down :-) 21:20:06 so far the best we’ve seen was 15 MB/s writes, though reads were maybe between 50-100 MB/s 21:20:13 pretty attrocious 21:20:14 The EUCLID deploy's home dir is CephFS, served from 2 OSDs. It's a bit ad-hoc, all we had available at the time. 21:20:35 rbudden: wow, that's not good! 21:20:39 by any measure 21:20:41 there must be some configuration issues 21:21:00 yeah, not sure what’s wrong just yet 21:21:38 we pulled in Mike from IU try and spot something obvious ;) 21:21:56 On performance, one of the team at Cambridge has been benchmarking BeeGFS vs Lustre on the same OPA hardware. 21:22:05 oh yeah? 21:22:11 we have both at PSC as well 21:22:15 Mike's a great call :-)! 21:22:24 Write performance was equivalent. 21:22:28 our Olympus cluster has both Lustre and BeeGFS 21:22:40 BeeGFS ~20% faster for reads 21:22:45 very interesting 21:22:50 Quite a surprise 21:22:56 we’ve been toying with using BeeGFS on Bridges 21:23:25 since we already mount Olympus filesystems on both 21:23:26 There seems to be good momentum behind it currently. It's pretty easy to set up and use 21:23:37 I’ll have to relay the performance numbers to ppl here 21:23:40 Olympus? 21:23:54 it’s a small cluster from our Public Health group 21:23:58 I'll see if I can get graphs for you. 21:24:01 exact hardware as Bridges 21:24:21 just a separate Slurm scheduler to manage them 21:24:33 we’ve even folded their cluster under Ironic now 21:25:12 Nice work. Hopefully not twice the effort for you now! 21:25:15 I think that might put us over 1000 nodes on single conductor :P 21:25:44 not that i’d normally advise that… but we only image a few nodes a week at this point 21:25:54 Any additional concerns taken into account for public health data on that system? 21:26:16 not to my knowledge. we don’t do any classified work, etc. 21:26:34 for nothing fancy with network isolation or anything of that nature 21:26:40 s/for/so/ 21:26:59 Got it. 21:27:30 It's a requirement that is proliferating though 21:27:36 indeed 21:27:52 we’ve done some small projects that required HIPPA with local medical community 21:28:03 on Bridges? 21:28:24 i’m not sure if it was Bridges or a small subproject on our DXC hardware 21:28:38 i’d have to check and make sure i’m not mixing things up ;) 21:28:54 I know being compliant for securing big data is on our radar 21:29:11 hipaa compliance? 21:29:34 that’s one, but i believe there are others 21:29:50 familiar with HIPPA? 21:30:28 faintly at best.. 21:30:37 it’s a health care act to help protect sensitive patient data 21:30:45 or should i say regulation ;) 21:31:30 oneswig: sorry to jet, but i need to step offline for daycare pickup! 21:31:43 we should follow up offline, it’s been awhile since we chatted in Vancouver 21:31:44 NP, good to talk 21:31:54 I'll mail re: cambridge data 21:31:54 i’ve been meaning to reach out 21:32:00 sounds good 21:32:04 rbudden: always a pleasure 21:32:14 indeed 21:32:18 tty soon 21:38:53 am i too late? 21:39:03 b1air: howzit? 21:39:15 a'ight, you? 21:39:19 Was just chatting with rbudden, he's dropped off 21:39:34 Good thanks. How's the new role going? 21:40:16 i'm in the thick of having my head exploded by all the context i'm trying to soak up 21:40:29 but otherwise good 21:40:58 Are you going to be hands-on with the Cray? 21:41:24 at least i now have a vague idea of what preconceptions the organisation has about my role, i honestly didn't know what i was really signing up for to start off with, just that it would probably be quite different from what i was doing at Monash 21:41:57 i will be unlikely to have root access there anytime soon 21:42:33 NIWA seem to like to keep systems stuff pretty closely guarded 21:43:08 how was your holiday? 21:43:47 Very good indeed. We hiked in the Pyrenees and it was ruddy hot 21:44:59 nice 21:45:24 Perhaps we should close the meeting, and then I'll reply to your email :-) 21:45:35 i need to come up with a hiking/biking/sailing itinerary for summer here! 21:45:48 b1air: in NZ, it's not hard, surely mate 21:46:05 the poor boat needs a bit of love first 21:46:17 sandpaper = tough love ? 21:46:48 lol, water blaster might be enough i hope! anyway, sure, let's continue to be penpals :-) 21:47:10 ok bud, have a good day down there 21:47:20 cheers, over and out 21:47:28 until next time 21:47:30 #endmeeting