#openstack-meeting log

21:00:45 <b1airo> #startmeeting scientific-sig
21:00:46 <openstack> Meeting started Tue Mar 19 21:00:45 2019 UTC and is due to finish in 60 minutes.  The chair is b1airo. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:47 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:49 <openstack> The meeting name has been set to 'scientific_sig'
21:00:49 <jmlowe> I expect rbudden to make an appearance today
21:00:52 <rbudden> hello
21:00:55 <rbudden> :)
21:00:59 <b1airo> o/
21:01:05 <b1airo> #chair oneswig
21:01:06 <openstack> Current chairs: b1airo oneswig
21:01:08 <oneswig> jmlowe: so quickly proven correct
21:01:10 <b1airo> #chair martial
21:01:11 <openstack> Current chairs: b1airo martial oneswig
21:01:27 <oneswig> greetings all
21:01:43 <b1airo> hopefully joe manages to get on soon
21:01:48 <jmlowe> I may have cheated and compared calendars with him within the past 45 min
21:01:59 <rbudden> heh
21:02:37 <rbudden> been far too long
21:03:07 <janders> g'day
21:03:35 <b1airo> morning janders
21:03:50 <martial> I invited Maxime and Alex to join this session to see how we run those sessions, they are proposing to discuss their work with us next week :)
21:04:09 <oneswig> cool
21:04:32 <oneswig> Although I am going to Texas next week to see our friends at Dell
21:05:03 <martial> I will help start the meeting
21:05:05 <oneswig> I expect I'll be jet-lagged so still in about the right time zone!
21:05:21 <martial> was just about to email you about it
21:05:55 <janders> texan bbq! :)
21:06:24 <jtopjian> I'm here! sorry
21:06:40 <oneswig> janders: damn right
21:06:46 <oneswig> hi jtopjian, welcome
21:07:08 <jtopjian> Hello
21:07:24 <janders> have you guys heard the news about NVIDIA's Mellanox acquisition?
21:07:36 <rbudden> yep
21:07:45 <rbudden> should be interesting
21:07:59 <oneswig> Mixes things up a little!
21:08:11 <janders> I sense this should push the HPC side of things more
21:08:30 <janders> (unlike with some other potential buyers who were on the table)
21:08:40 <martial> yep, outbid Intel by a lot too
21:08:45 <oneswig> There's a good deal of talk about the high performance data centre
21:08:55 <oneswig> which sounds good to me
21:09:13 <janders> any thoughts on how this would affect Mellanox'es OpenStack work?
21:09:24 <janders> s/would/will
21:10:06 <oneswig> neutral-to-positive, I'd guess
21:10:13 <b1airo> i doubt it will have any immediate impact janders
21:10:20 <janders> same feelings here
21:10:32 <jmlowe> I'm kind of banking on Mellanox continuing to do good things for both ceph and openstack
21:10:54 <janders> I came across this yesterday: https://www.nextplatform.com/2019/03/18/intel-to-take-on-openpower-for-exascale-dominance-with-aurora/
21:10:54 <oneswig> b1airo: increased interest in virtualised gpu direct, perhaps?
21:10:55 <b1airo> though a general HPC-datacentre trend might make NVIDIA more interested in e.g. Cyborg
21:11:17 <b1airo> possibly oneswig
21:11:31 <b1airo> or some new version thereof (more likely)
21:11:32 <jmlowe> Which I find astounding, "We've delivered nothing, but double our money an we'll deliver something even better, we priomise"
21:11:55 <janders> yeah.. and the interconnect bit is "interesting"
21:12:26 <janders> as much as I am a fan of competitive landscape, no one except Mellanox seems to have the right answers to the right questions about advanced fabric features
21:12:54 <oneswig> That's more than a ripple in the pond, janders, if it's true it's a huge splash
21:13:14 <b1airo> they are also guilty of claiming to have the answers 2 years before they have a product containing any of them, but i guess everyone knows that now
21:13:18 <jmlowe> Who exactly signed off on 1/2 billion to a company that just completely botched your 1/4 billion dollar acquisition?
21:14:10 <b1airo> NVIDIA's acquisition history is not great mind you, would have been interesting to have a front row seat to Mellanox boardroom talks when that topic came up
21:14:19 <jmlowe> I'm still waiting on my sriov live migration from Mellanox
21:14:34 <b1airo> lols, good luck
21:14:53 <b1airo> there is some upstream kvm work that looks promising - vfio migrations
21:15:31 <b1airo> ok, we should move along and pass the baton to jtopjian
21:15:48 <jmlowe> I might get live migration with vgpus from nvidia though
21:15:50 <jmlowe> yes
21:15:53 <janders> jmlowe: are you thinking eth or ib for live migration?
21:16:02 <jtopjian> sure. I shouldn't take long, but happy to answer questions, too.
21:16:13 <janders> I think SRIOV is tricky, but the RDMA/QP part is even trickier
21:16:14 <b1airo> #topic Nomad for GPU container workload management
21:16:16 <jmlowe> janders: We wound up pitching mellanox eth
21:16:30 <b1airo> stow it for the wrap-up lads :-)
21:16:36 <janders> ok!
21:16:49 <martial> jtopjian: and link for slides or external material?
21:16:57 <jtopjian> none :)
21:16:59 <b1airo> give us the pitch jtopjian
21:17:05 <jtopjian> There's a group we're working with who have a stack of nine servers, each with four GPUs. Currently, users log into each server directly and perform whatever GPU processing there.
21:17:40 <b1airo> was i correct that they are doing exploratory data-science and ML stuff jtopjian ?
21:17:46 <jtopjian> That's correct
21:18:19 <jtopjian> The group has a wide range of knowledge and skill as well as a wide range of what they're working on.
21:18:40 <jtopjian> They asked us if we could do anything to help automate job submission. We figured we'd try a bit of an experiment and do something totally different.
21:18:50 <oneswig> I guess these are all single node workloads, right?
21:19:17 <jtopjian> Possibly. I don't think there was ever enough parallel work for a node to be running more than one job :)
21:20:06 <jtopjian> Unfortunately we were not able to get direct access to the hardware, due to a myriad of non-technical reasons, so we set up a PoC cluster in our OpenStack cloud: 5 instances, one control node and 4 worker nodes with 1 GPU each
21:20:54 <martial> for a second I though it was 4 GPUs per node
21:21:10 <jtopjian> The existing cluster is. We could not get physical access to it.
21:21:23 <jtopjian> So we build a virtual PoC instead.
21:21:37 <jtopjian> We had a nomad master on the control node and a worker on each one. The only Nomad driver we used was "raw_exec" which basically just runs a shell command on the root filesystem. The reason we went with this route was because the group had a second request about Singularity support.
21:22:41 <jtopjian> Nomad 0.9-dev (master branch) has support for GPU polling. It'll poll the amount of memory the GPU cards are using and schedule jobs that way. So I had to compile Nomad from source, but that's not terribly hard at all.
21:23:02 <jtopjian> With all of that in place, I set up some sample jobs and shared it out
21:23:09 <jtopjian> And then waited for feedback
21:23:20 <jtopjian> out of six people who used the cluster, I received feedback from 2.
21:23:27 <b1airo> #link https://www.hashicorp.com/products/nomad
21:24:09 <jtopjian> Right, some quick background: Nomad is a very basic scheduler. It supports various execution styles (batch, service) and drivers (docker, exec, qemu).
21:24:24 <jtopjian> Jobs are declared using an HCL-type syntax. If you've worked with Terraform, Packer, etc - it's the same.
21:25:02 <b1airo> #link https://github.com/hashicorp/hcl
21:25:06 <jtopjian> One person "got it" and one person did not.
21:26:15 <jtopjian> I'm quite new to working with data analysis, ML etc and learned that there are a number of different ways users work on data. I assumed we'd just give them a bunch of GPUs with a scheduler and they'd have a field day
21:26:51 <b1airo> heh, users are the hardest part of this world
21:27:04 <jtopjian> The person with positive feedback was at this phase of their work: they had some code they wanted to run repeatedly with different arguments and Nomad did this quite well (you can crate "parameterized jobs" which repeat jobs with new input)
21:27:31 <jtopjian> The person with negative feedback had no code written and wanted to be as close to the GPU as possible.
21:28:00 <jtopjian> So they wondered why they couldn't get shell access to the GPU and thought it was a bother to have to commit or copy their code somewhere each time they ran a job
21:28:09 <jtopjian> So it was a learning experience.
21:28:38 <oneswig> jtopjian: what is the execution environment nomad creates for a job that might get in the way for this user?
21:29:08 <martial> any into into nvidia-docker (v2) to try to give people access to the bare metal resources through an abstraction layer (seems to be very "core" OpenStack so far or am I missing something?)
21:29:16 <jtopjian> We used "raw_exec" which just ran a command on the root filesystem. So whatever you have installed on the OS is what you have access to.
21:29:29 <jtopjian> But Nomad also allows for chroot execution, docker execution, etc
21:29:58 <jtopjian> I would have used Docker, but a few users wanted to use Singularity instead.
21:30:19 <jtopjian> So "raw_exec" basically just exec'd the singularity command with an image the user made available
21:31:00 <martial> (the HPC ... we need root squash issue ... re: Docker)
21:31:55 <b1airo> dunno about jtopjian , but i can't parse that question (was it a question?) martial ...
21:31:57 <jtopjian> So that's the summary of the PoC. As for next steps, I've learned that having job submission access is just a small part of what these users want, so I'm going to be looking at something like Kubeflow to see if that can provide better features for them.
21:32:03 <oneswig> Does it assume a network filesystem - or how is data and code copied if not?
21:32:29 <jtopjian> Nomad can download "artifacts". Unfortunately it only supports a few methods such as http and git at the moment.
21:33:12 <jtopjian> In your job spec, you declare different artifacts you want the worker to download and it'll pull them. The artifacts can also be interpolated so you can have an input variable of a version to download a certain version of something
21:33:21 <b1airo> jtopjian: do they need to turn these things into services with jobs that will be spun up in the cluster?
21:33:23 <oneswig> And I guess outputs need something too?
21:33:38 <jtopjian> @b1
21:33:52 <jtopjian> @b1airo we used the "batch" scheduler which ran jobs once
21:34:23 <jtopjian> @oneswig Nomad will save the stdout for review. For saving data somewhere else, that's something I didn't explore.
21:34:52 <oneswig> I guess there's many ways that can be approached.
21:34:55 <b1airo> it sort of sounds like the primary use-case is interactive exploration/design-iteration style work
21:35:16 <jtopjian> However, one thing I was not a fan of was that Nomad does not keep logs around long. It expects you to send logs to a more robust logging system. While I understand the intention of the design, it felt lame
21:35:45 <b1airo> if that's true i wonder whether container orchestration is really buying them anything?
21:36:01 <jtopjian> @b1airo actually, no. This was more for a something along the lines of "I have a large dataset that I want to run a known good piece of code on for a few hours"
21:36:39 <jtopjian> And so one piece of learning on my part was that most users are actually in the exploratory phase which made this cluster harder to work with.
21:36:52 <jtopjian> They really just need easy access to, say, Jupyter with GPU
21:37:11 <oneswig> The world needs that :-)
21:37:22 <jtopjian> right?
21:38:44 <b1airo> how had some of them come to Singularity already jtopjian ?
21:40:00 <jtopjian> That's a good question. There were 3-4 users who were very keen on it, but others in the group who had not heard of it.
21:40:15 <jtopjian> I'm not sure how the one group came across it
21:40:23 <b1airo> had those 3-4 already been exposed to HPC in some way?
21:40:53 <jtopjian> Yes and I believe they wanted Singularity for this PoC because they were unable to use it in their institution's existing clusters
21:41:35 <martial> (b1airo not a question)
21:42:09 <oneswig> What was Singularity gaining for them, specifically?
21:42:15 <b1airo> which leads me to the next question, why not Slurm/PBS (plus Singularity or whatever) etc instead of these higher level programming interfaces ?
21:43:01 <jtopjian> @oneswig: Again, good question and I'm not sure. Personally, I didn't mind using Singularity for this (I had to learn it to implement it) but I wasn't able to figure out what the big difference between it and Docker was.
21:43:49 <oneswig> Were they wanting to use images from Singularity hub?
21:44:24 <b1airo> i suspect you'd have to sit down with the users for a few hours to see how they were actually using it. i guess they maybe just learned how to create Singularity images and didn't know Docker images are very similar and compatible ?
21:44:44 <jtopjian> @b1airo Slurm was mentioned in the first meeting and we chose not to implement it for two reasons: 1) it's been done and we wanted to take a chance and 2) part of our interaction with users is to... how do I want to say... work with them outside of an academic context. Being a little facetious: I don't see too many Medium articles on Slurm :)
21:45:02 <jtopjian> @oneswig They had their own images to use
21:45:22 <b1airo> Singularity Hub... what are people's thoughts? i went looking for an Intel optimised tensorflow container the other day and it felt like a mess
21:45:43 <oneswig> Not tried using it b1airo
21:46:55 <jtopjian> And "sit down" - yes, this is something we plan to do. We feel there's a big gap between students/researchers/users and infrastructure teams like my group. We want to bridge that
21:48:46 <oneswig> Sometimes people call that bridge you're making "ResOps" - is that familiar?
21:49:04 <jtopjian> It's not! That's new to me :)
21:49:28 <martial> what were the advantages of Nomad for this setup?
21:49:39 <oneswig> kind of derivative but fits the bill
21:50:16 <b1airo> jtopjian: sure, it's great you're getting to do some exploration. though i'd challenge you on the "academic" context - HPC is a much bigger industry than just universities. anyway, i think yours is the first real-world example i've seen of attempting to push the status quo of workload management in this space. convergence of container orchestration and hpc workload management was raised in a couple different forums
21:50:16 <b1airo> at SC last year, so you're on trend :-)
21:51:03 <jtopjian> Exactly: from the specs which were discussed, it checked off all the boxes. Nomad, being a single binary, made deployment really quick and simple (ignoring I had to compile it, but I'm familiar with Go so that was fine). In practice, it helped us uncover a lot of missing knowledge we had about data processing.
21:52:12 <jtopjian> To be clear: I was (and am not) not trying to challenge status quo or indoctrinate people. Part of my job is to do experiments like this.
21:53:08 <jtopjian> If it was requested, I'd have no problems deploying something like Slurm for this group. But if I have the opportunity to explore different approaches, I'll take it.
21:53:39 <martial> still it sounds pretty fun, and a good setup
21:53:47 <martial> Thanks for sharing
21:53:48 <b1airo> some other interesting examples floating around like what CERN is doing on top of Magnum, but that seems to be more about using a dynamic k8s cluster on a per-workload basis
21:54:08 <oneswig> In a similar vein, I'd be very interested to hear about anyone's experiences with Univa NavOps - https://kubernetes.io/blog/2017/08/kubernetes-meets-high-performance/
21:54:19 <jtopjian> Indeed. It was a great learning experience. Nomad itself works great as a very foundational scheduler. But it's truly foundational. You'll need to add things on top of it to make it more user-friendly.
21:54:59 <jtopjian> In another similar vein, (I can't remember if I mentioned Kubeflow earlier in this meeting), Kubeflow is on my list to look at.
21:56:21 <b1airo> yeah that sounds interesting too, suspect it might have a broader potential audience
21:56:31 <oneswig> Thanks jtopjian - very interesting to hear about your work
21:56:49 <b1airo> you got Magnum in any of your clouds jtopjian ?
21:57:00 <jtopjian> @oneswig very welcome!
21:57:24 <jtopjian> @b1airo Yes, we support both Swarm and Kubernetes
21:57:25 <b1airo> yes +1, thanks jtopjian , interesting stuff
21:57:47 <rbudden> indeed, thanks for sharing
21:57:54 <b1airo> would make a good lightning talk if you're in Denver :-)
21:57:57 <janders> +1 thank you
21:58:16 <janders> great idea!
21:58:18 <jtopjian> Unfortunately I won't make it to Denver :/
21:58:19 <oneswig> going back to the NextPlatform article janders shared - see the reference to "Intel OneAPI" at the top of this - http://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-content/uploads/2019/03/argonne-aurora-tech.jpg - I wondered if that was an IaaS API but it seems not...
21:59:06 <janders> regarding jmlowe's live migration concenrs, I found this:
21:59:08 <janders> https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8401692
22:00:12 <janders> copying hardware state for bm migration sounds like a nightmare. But.. if an identical image were to be spun up on identical hardware, maybe copying the delta it wouldnt actually be that bad..
22:00:17 <janders> interesting idea :)
22:00:32 <oneswig> Live migration of bare metal?  "1) power off server, 2) move server to new rack, 3) power on server again?"
22:00:37 <jmlowe> I used to use openvz and was able to live migrate with that
22:01:07 <jmlowe> effectively live migration of containers
22:01:09 <janders> a friend used to leverage dual PSUs and LACP to move servers between racks without powering off..we used to make jokes that's the bm livemigration
22:01:39 <janders> true.. container could be that shim layer they are referring to
22:01:41 <b1airo> the vfio migration work that is being discussed upstream at the moment sounds like it is allowing for vendor-specific state capture and transfer
22:02:10 <b1airo> ooh, we're overtime!
22:02:23 <janders> thanks guys! great chat
22:02:25 <b1airo> thanks all, good turnout!
22:02:27 <oneswig> Thanks all
22:02:27 <janders> see you next week
22:02:33 <rbudden> catch ya later
22:02:34 <b1airo> sounds good!
22:02:39 <b1airo> #endmeeting