11:00:32 <oneswig> #startmeeting scientific-sig
11:00:33 <openstack> Meeting started Wed Oct  9 11:00:32 2019 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:00:36 <openstack> The meeting name has been set to 'scientific_sig'
11:00:55 <oneswig> greetings
11:01:01 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_October_9th_2019
11:01:05 <oneswig> (such as it is)
11:01:41 <dh3> hi
11:02:02 <oneswig> Hi dh3, how's things?
11:02:09 <dh3> busy :)  no change there!
11:02:47 <oneswig> Saw an interesting development this morning
11:02:59 <oneswig> #link SUSE drops their OpenStack product https://www.suse.com/c/suse-doubles-down-on-application-delivery-to-meet-customer-needs/
11:03:09 <oneswig> Crikey
11:03:20 <verdurin> Hello.
11:03:52 <dh3> Suse has always been a bit niche (IME) and OpenStack is niche so maybe they're only dropping a tiny bit of business
11:03:53 <oneswig> Hi verdurin, afternoon
11:04:00 <janders> g'day
11:04:02 <janders> sorry for being late
11:04:07 <janders> Suse dropping OpenStack?
11:04:25 <oneswig> dh3: does seem like it.  I'm mostly worried about where it leaves their core contributors
11:04:47 <dh3> mmm that's a point
11:04:52 <oneswig> janders: yes it seems like it
11:05:34 <oneswig> I wonder what alternatives they'll be transitioning customers to, doesn't say in the post...
11:06:27 <janders> good question however I have to say I dont think I've ever met anyone running SuSE Cloud...
11:07:05 <oneswig> actually, not sure I have either.
11:07:58 <oneswig> They do appear to have taken a wrong turn in deployment tooling in recent years, not sure what they ended up with.
11:08:33 <janders> I did consider them for a couple projects to be fair, but the tooling was always like.... wait.... what?!?
11:08:47 <janders> active passive databases, chef config management etc
11:09:00 <oneswig> Well, interesting times.
11:09:19 <janders> it does look like OpenStack is past the top of the curve
11:09:34 <janders> those who are using it right are having a blast
11:09:39 <janders> the others realise maybe it's not the way
11:10:08 <oneswig> janders: there's certainly a lot less hype and more practical usage it seems.
11:10:36 <janders> shame likes of IBM seem to be slowing down development of things such as gpfs-openstack integration etc
11:10:51 <janders> good they wrote the container friendly version before that happened
11:10:56 <janders> its a shame cause it's a killer
11:11:28 <janders> with this I've got faster storage in VMs than quite a few supercomputers
11:12:01 <janders> and that's without baremetal/sriov
11:12:10 <oneswig> I wonder what will come of work like this for secure computing frameworks: https://blog.adamspiers.org/2019/09/13/improving-trust-in-the-cloud-with-openstack-and-amd-sev/
11:12:30 <janders> yet people are still undecided whether to develop it further... :/
11:12:38 <dh3> we've had users stand up gpfs on openstack instances (without needing sysadmin help!) as part of a k8s layer, that must say something
11:12:58 <oneswig> dh3: that's impressive and surprising
11:13:23 <oneswig> might also be a statement on your users
11:13:56 <oneswig> Is your k8s an OpenShift deployment or do you roll your own?
11:14:01 <dh3> some of them do jump in with all 3 feet :)
11:14:05 <janders> yeah that is the justification I hear from IBM when they say OpenStack integration will not be developed further. Diverting resources to k8s.
11:14:22 <dh3> our k8s is DIY at the moment but we are pushing towards Rancher to get the nice UI layer
11:14:57 <oneswig> dh3: kubespray or more fundamentally DIY?
11:15:21 <oneswig> hmm seems we've lost dh3
11:15:24 <janders> GPFS is quite flexible - can be relatively simple or quite complex
11:16:07 <janders> our first attempt of deploying GPFS-EC ended up destroying all the OSes it was supposed to run on
11:16:28 <janders> the magic script was filtering out the software raid but not the member drives - and formatted them :D
11:16:40 <dh3> (dunno what happened there)
11:16:49 <dh3> AFAIK people are building on kubespray
11:16:56 <janders> probably we were the first site to deploy with swraid thats why
11:17:19 <oneswig> janders: oops...
11:17:56 <janders> got the hotfix overnight and the second deploy was all good
11:18:52 <janders> now it's doing 120/50 GB/s read/write on six servers and six clients
11:19:10 <janders> gpfs backed cinder is getting close to 20GB/s in VMs
11:19:13 <oneswig> janders: are you using SR-IOV or ASAP2 for storage I/O to VMs?
11:19:24 <oneswig> or is it all block?
11:19:31 <janders> that's the best part
11:19:33 <janders> no
11:19:53 <janders> hypervisors connect to GPFS over HDR100/HDR200 (depending which ones)
11:20:00 <dh3> janders: do you have any write ups, blog posts, etc?
11:20:10 <janders> VM networking is stock standard
11:20:24 <janders> no - but happy to chat if you're interested
11:20:35 <janders> jacob.anders.au@gmail.com
11:21:06 <dh3> potentially yes. we haven't used gpfs (on the "systems supported" side) for years. but always happy to look around. I'll drop you an email, thanks
11:21:28 <janders> we could make it quite a bit faster especially on the write side but we traded that off for redundancy
11:22:28 <janders> losing a server doesn't hurt it, could probably run without two but that's not a requirement so haven't tested that scenario
11:23:01 <janders> essentially it's ceph architecture with HPC filesystem performance
11:23:19 <janders> and minimal changes to the OpenStack
11:23:41 <oneswig> janders: have you seen roadmaps for that?
11:23:56 <janders> for what exactly?
11:24:23 <oneswig> Ongoing development for GPFS+OpenStack
11:24:47 <janders> I'm being told they pulled resources from this and only maintain it but do not develop it
11:24:47 <oneswig> Are there constraints on the version you can run?
11:25:00 <janders> I've got it going with RH-OSP13
11:25:16 <janders> but I think it would run with latest, too
11:26:15 <janders> I currently have cinder and glance integrated, I tested nova, too and it worked fine
11:27:26 <dh3> do you mount gpfs as (say) /var/lib/nova/instances on the hypervisor then let everything run as normal?
11:27:39 <janders> for nova, yes
11:27:47 <janders> for cinder, no
11:29:35 <janders> though with nova I only turned it on for testing, I have 1.5TB of NVMe in each compute node
11:29:54 <janders> GPFS can do 70k IOPS per client, that NVMe can do 10x that
11:30:19 <janders> so right now it's cinder for capacity/throughput and ephemeral for IOPS
11:30:41 <oneswig> A good balance of options
11:30:47 <verdurin> janders: I'll contact you about this too, if I may
11:30:49 <oneswig> (for the user that knows how to exploit that)
11:31:38 <janders> sure - no worries, happy to chat more
11:31:53 <dh3> similar to us but the compute is SSD not NVMe
11:32:14 <janders> it is a very interesting direction cause there's a lot of push for performance hence walking away from VMs to containers and baremetal
11:32:24 <oneswig> janders: what kind of workloads are you supporting with this cloud?
11:32:40 <janders> and with something like that if it's storage performance they want, VMs suddenly become good enough again
11:33:01 <janders> it's for the cybersecurity research system. The workloads are still being identified/decided.
11:33:13 <janders> GPFS was designed to stand up to ML workloads that were killing our older HPC storage
11:33:40 <janders> it's essentially a smaller more efficiently balanced version of our BeeGFS design
11:33:55 <janders> if I had PCIe4 these could do 40GB/s read per node
11:34:02 <janders> but unfortunately I don't
11:35:47 <janders> do you guys have any experience tuning GPFS clients for good IOPS numbers?
11:36:43 <oneswig> sorry, not here jamespage
11:36:52 <oneswig> janders I mean - lost again?
11:39:14 <oneswig> welcome back janders :-)
11:39:26 <janders> thanks! :)  network hiccups
11:39:28 <verdurin> janders: our cluster nodes are all GPFS but we haven't needed to do any advanced tuning with them
11:39:46 <janders> do you remember what IOPS are you getting on clients?
11:40:17 <verdurin> Not offhand, but I can find out.
11:40:32 <janders> that would be quite interesting
11:40:44 <janders> what's the storage backend on your cluster?
11:40:53 <verdurin> This is mainly EDR to spinning disk.
11:41:00 <verdurin> Some FDR.
11:41:09 <janders> JBODs or array based?
11:41:43 <verdurin> DDN arrays.
11:42:27 <verdurin> Latest iteration will have a small pool of SSD.
11:42:39 <janders> what's the capacity?
11:43:48 <verdurin> ~7PB usable.
11:44:38 <janders> nice!
11:44:47 <janders> ours is ~250TB
11:44:52 <verdurin> Hence we're not desperately keen on capacity-based licensing...
11:44:52 <janders> but all-NVME
11:45:10 <janders> then EC is not a good idea - cheaper to buy more kit
11:45:57 <verdurin> oneswig: your earlier point about contributors is an important one. Lots of people stepping down from PTL-type roles of late, I see.
11:46:32 <oneswig> Yes that has been a trend.
11:48:33 <oneswig> We've been looking recently at the condition of Tungsten Fabric integration with Kolla-Ansible.  It's pretty good, in that it's less than a year behind, but doesn't appear to be advancing beyond that point.  I'm still investigating.
11:49:38 <oneswig> It appears Tungsten has some invasive requirements for installing widgets in the containers of other services.
11:50:36 <oneswig> janders: in a vague attempt to follow the agenda, there was a question on the SIG Slack about usage accounting.  What do you use for this, if anything?
11:51:10 <janders> nothing at the moment
11:51:49 <janders> our User Services guys interview users to identify how much resources they really need (as opposed to what they think they need or they would look like) and set the quotas accordingly
11:51:58 <janders> from there it's assumed it's a solved problem
11:52:11 <janders> not very accurate but kinda works for now
11:52:58 <janders> better than giving users what they ask for on a shared system I suppose
11:53:06 <janders> hope to have a better answer few months down the track :)
11:53:21 <oneswig> Thanks janders, good to know
11:54:10 <janders> given the User Services guys are nice to us we are nice to them and give them a simple yaml interface for managing projects, memberships and quotas
11:54:24 <janders> from there ansible sets it all, they don't need to know OpenStack commands
11:56:40 <dh3> "simple" + "yaml"... :/  our service desk get to set quotas using Cloudforms (it was only marginally quicker to set up than writing our own interface)
11:57:12 <oneswig> Nearly at time - anything for AOB?
11:57:14 <verdurin> oneswig: billing (or rather costing) keeps coming up for us, and we've relied on rough heuristics for now.
11:58:40 <oneswig> verdurin: noted.  At this end, I'm hoping for priteau to update the study he did on CloudKitty earlier this summer.
11:58:59 <janders25> ################################  sample-five:    name: sample-five    enabled: true    description: asdf    quota:      instances: 3      cores: 20      ram: 25600      volumes: 6      gigabytes: 5      snapshots: 3      floating_ips: 8    members:      - user123    networking:      create_default: true      create_router:
11:59:00 <janders25> true################################
11:59:24 <janders25> oops formatting died but I wanted to show that yaml can be simple to work with
11:59:43 <janders25> my networking is really bad today too
11:59:52 <dh3> I know that but some people are CLI-avoidant!
12:00:05 <priteau> janders25: If they were not nice to you, the interface would be XML?
12:00:08 <dh3> (hard enough getting them to edit the LSF users file)
12:00:09 <janders25> true! :)
12:00:13 <oneswig> And for APEL users, at some point we hope to complete the loop with data submission from OpenStack infrastructure
12:00:31 <oneswig> hi priteau - ears burning :-) ?
12:00:38 <oneswig> Ah, we are out of time
12:00:38 <janders25> priteau: yes! xml.... you nailed it, I was editing pacemaker configs today... argh!
12:00:52 <oneswig> xml gives a vintage feel nowadays
12:00:59 <oneswig> #endmeeting