#openstack-meeting log

21:00:27 <oneswig> #startmeeting scientific-sig
21:00:28 <openstack> Meeting started Tue Jan 22 21:00:27 2019 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:32 <openstack> The meeting name has been set to 'scientific_sig'
21:00:38 <oneswig> Let us get the show on the road!
21:00:58 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_January_22nd_2019
21:01:58 <oneswig> #chair martial
21:01:59 <openstack> Current chairs: martial oneswig
21:02:39 <oneswig> While we are gathering, I'll do the reminder - CFP ends tomorrow, midnight pacific time!
21:03:00 <oneswig> #link CFP link for Open Infra Denver https://www.openstack.org/summit/denver-2019/call-for-presentations/
21:03:42 <oneswig> janders: you ready?
21:03:48 <janders> I am
21:04:02 <oneswig> Excellent, lets start with that.
21:04:16 <oneswig> #topic Long-tail latency of SR-IOV Infiniband
21:04:24 <oneswig> So what's been going on?
21:04:31 <janders> Here are some slides I prepared - these cover  IB microbenchmarks across bare-metal and SRIOV/IB
21:04:41 <janders> https://docs.google.com/presentation/d/1k-nNQpuPpMo6v__9OGKn1PIkDiF2fxihrLcJSh_7VBc/edit?usp=sharing
21:04:57 <oneswig> Excellent, thanks
21:05:17 <b1air> o/ hello! (slight issue with NickServ here...)
21:05:27 <janders> As per the Slide 2 we were aiming to get a better picture of low-level causes of the performance disparity we were seeing (or mostly hearing about)
21:05:32 <oneswig> Hi b1air, you made it
21:05:36 <oneswig> #chair b1air
21:05:36 <openstack> Current chairs: b1air martial oneswig
21:05:42 <b1air> tada!
21:05:54 <janders> We were also thinking this will help decide what's best running on bare-metal and what can run happily in a (HPC)VM
21:05:59 <martial> bad nick serve
21:06:02 <janders> hey Blair!
21:06:51 <martial> welcome Blair :)
21:07:09 <janders> Slide 3 has the details of the lab setup. We tried to stay as generic as possible so the numbers aren't super-optimised, however we've done all the ususal reasonable things to do for benchmarking (performance mode, CPU passthru)
21:07:39 <janders> Similar "generic" approach was applied to the microbenchmarks - we ran with the default parameters which are captured in Slide 4
21:08:32 <janders> On that - the results once again prove what many of us here have seen - virtualised RDMA (or IB in this case) can match the bandwidth of bare-metal RDMA
21:08:38 <martial> I asked Rion to join as well (since it is his scripts for Kubespray)
21:09:11 <janders> We're at around 99% efficiency bandwidth wise
21:09:13 <janders> With latency, however things things are different
21:09:36 <oneswig> This is just excellent. :-)
21:09:51 <janders> Virtualisation seems to add a fair bit there, about .4 us off from bare-metal
21:10:09 <janders> (or ~24%)
21:10:11 <b1airo> ok, i've managed to log in twice it seems :-)
21:10:25 <martial> b1airo: cool :)
21:10:33 <janders> the more the merrier :)
21:10:35 <oneswig> Does your VM have NUMA passthrough?
21:11:02 <oneswig> #chair b1airo
21:11:03 <openstack> Current chairs: b1air b1airo martial oneswig
21:11:04 <janders> I haven't explicitly tweaked NUMA, however the CPUs are in passthru mode
21:11:06 <b1airo> thanks oneswig
21:11:20 <b1airo> hugepages...?
21:11:39 <janders> I'm happy to check later in the meeting and report back - just run "numactl --show"?
21:12:11 <b1airo> i note the hardware is relatively old now, wonder how that impacts these results as compared to say Skylake with CX-4
21:12:19 <oneswig> That ought to do it, or if you have multiple sockets in lscpu, I think that also means it's on
21:12:36 <janders> I was considering applying NFV style tuning, but thought this would make it less generic
21:12:48 <oneswig> janders: what kind of things?
21:12:53 <janders> good point on hugepages though - just enabling that shouldn't have any negative side effects
21:13:11 <janders> NFV style approach would involve CPU pinning
21:13:33 <janders> however my worry there is we'll likely improve IB microbenchmark numbers at the expense on Linpack
21:13:37 <b1airo> this could be a nice start to some work to come up with a standard set of microbenchmarks we could use...
21:13:54 <martial> oh here is Rion
21:14:13 <oneswig> Hello Rion, welcome
21:14:14 <janders> I hope to get some newer kit soon and I'm intending to retest on Skylake/CX5 and maybe newer
21:14:33 <b1airo> surely if you were going to be running MPI code you'd want pinning though janders ?
21:14:39 <deardooley> Hi all
21:14:54 * b1airo waves to Rion
21:15:46 <oneswig> Your page 6, plotting sigmas against one another.  Are you able to plot histograms instead?  I'd be interested to see the two bell curves superimposed.
21:16:17 <b1airo> it's an interesting consideration though - how much tuning is reasonable and what are the implications...
21:17:02 <janders> oneswig: thank you for the suggestion - I will include that when I work on this again soon (with more hardware)
21:17:15 <oneswig> One potential microbenchmark for capturing the effect of long-tail jitter on a parallel job would be to time multi-node barriers
21:17:36 <oneswig> I'm not sure there's an ib_* benchmark for that but there might be something in IMB.
21:17:59 <janders> The second interesting observation is the impact of virtualisation on standard deviation in results
21:18:29 <janders> interestingly this is seen across the board. In these slides I'm mostly focusing on ib_write_* but I put Linpack there for reference too
21:18:46 <janders> whether it's Linpack, bandwidth or latency, baremetals are heaps more consistent
21:18:59 <oneswig> Is that a single-node Linpack configuration?
21:19:05 <janders> in absolute numbers the fluctuation isn't huge, but in relative numbers it's an order of magnitude
21:19:09 <janders> yes, it's single node
21:20:05 <janders> I considered multinode but was getting mixed messages about the potential impact of interconnect virtualisation on the results from different people, so thought I better keep things simple
21:20:22 <janders> this way we have the overheads measured separately
21:20:59 <oneswig> I'd love to know more on the root causes (wouldn't we all)
21:21:03 <janders> I think the variability could likely be addressed with NFV style tyning
21:21:06 <janders> *tuning
21:21:14 <janders> at least to some degree
21:21:39 <janders> with the latency impact, I think the core of it might have to do with the way IB virtualisation is done, however I've never heard the official explanation
21:22:02 <janders> I feel it likely gets better for larger message sizes
21:22:59 <b1airo> seems to me that once you start doing cpu and/or numa pinning and/or static hugepage backing, then you really need to commit to isolating specific hardware for that job and you create just a few instance-types that fit together on your specific hardware nicely. then you probably also have some other nodes for general purpose. so perhaps there are at least three interesting scenarios to benchmark: 1) VM tuned for
21:22:59 <b1airo> general purpose hosts; 2) VM tuned for high-performance dedicated hosts; 3) bare-metal
21:23:25 <janders_> sorry got kicked out :(
21:23:41 <janders_> 1 2 3 testing
21:23:50 <oneswig> Did you catch b1airo's message with 3 cases?
21:24:11 <janders_> unfortunatly not, I lost everything past "08:22] <janders> given the bandwidth numbers are quite good"
21:24:49 <janders_> b1airo: can you re-post please?
21:24:56 <janders_> sorry about that
21:25:26 <martial> helping b1airo 16:23:00 <b1airo> seems to me that once you start doing cpu and/or numa pinning and/or static hugepage backing, then you really need to commit to isolating specific hardware for that job and you create just a few instance-types that fit together on your specific hardware nicely. then you probably also have some other nodes for general purpose. so perhaps there are at least three interesting scenarios
21:25:26 <martial> to benchmark: 1) VM tuned for
21:25:26 <martial> 16:23:00 <b1airo> general purpose hosts; 2) VM tuned for high-performance dedicated hosts; 3) bare-metal
21:25:55 <janders_> indeed
21:26:27 <janders_> have you guys had a chance to benchmark CPU pinned configurations in similar ways?
21:26:33 <oneswig> This test might be interesting, across multiple nodes: IMB-MPI1 Barrier
21:26:44 <janders_> I wonder if pinning helps with consistency
21:26:57 <janders_> and what would be the impact of the pinned configuration on local Linpack?
21:27:05 <oneswig> janders_: yes, I did some stuff using OpenFOAM (with paravirtualised networking)
21:27:16 <b1airo> thanks martial - i was in the bathroom briefly
21:27:44 <janders_> (I suppose we would  likely lose some cores for the host OS - and I wonder to what degree the performance improvement on the pinned cores would compensate that)
21:28:32 <oneswig> You can see the impact of pinning in simple test cases like iperf - I found it didn't increase TCP performance much but it certainly did help with jitter
21:28:40 <b1airo> yes, i believe pinning does help with consistency
21:28:47 <janders_> I think the config proposed by Blair is a good way forward - my worry is if they users will know if they need max_cores configuration (20 core VM on a 20 core node) or the NFV configuration
21:29:29 <b1airo> i think the question of reserved host cores is another interesting one for exploration...
21:29:51 <janders_> I tried Linpack in 18 core and 20 core VMs in the past and 20 core was still faster
21:30:15 <janders_> despite the potential of scheduling issues between the host and the guest
21:30:18 <b1airo> i would contend that if most of your network traffic is happening via SR-IOV interface then reserving host cores is unnecessary
21:30:56 <oneswig> b1airo: makes sense unless they are working for the libvirt storage too
21:31:20 <oneswig> OK, we should move on.  Any more questions for janders?
21:31:58 <oneswig> janders_: one final thought - did you disable hyperthreading?
21:31:59 <b1airo> ah yes, good point oneswig - i guess i was thinking of storage via SR-IOV too, i.e., parallel filesystem
21:32:05 <janders_> yes I did
21:32:20 <janders_> no HT
21:32:43 <oneswig> We found that makes a significant difference.
21:32:57 <janders_> I typically work with node-local SSD/NVMe for scratch
21:33:04 <b1airo> to Linpack oneswig ?
21:33:08 <janders_> and a parallel fs indeed mounted via SRIOV interface
21:33:23 <oneswig> b1airo: haven't tried that.  Other benchmarks.
21:33:31 <janders_> on the local scratch it would be interesting to look at the impact of qcow2 vs lvm
21:34:02 <janders_> lvm helps with IOPS a lot, but in a scenario like the one we're discussing where there's little CPU for host OS, this might be even more useful
21:34:52 <janders_> so - b1airo - do you think in the configuration you proposed, would we need two "performance VM" flavors?
21:35:02 <janders_> max_cpu and low_latency (pinned)?
21:35:11 <b1airo> janders_: we could talk more on this offline perhaps, i'd be keen to try and get some comparable benchmarks together as well
21:36:14 <oneswig> janders_: you need Ironic deploy templates... let's revisit that
21:36:19 <janders_> OK! being mindful of time, let's move on to the next topic. Thank you for your attention and great insights!
21:36:19 <oneswig> OK time is pressing
21:36:31 <oneswig> #topic Terraform and Kubespray
21:36:42 <oneswig> deardooley: martial: take it away!
21:36:51 <martial> so I invited Rion to this convesrsation
21:37:09 <martial> but the idea is simple, I needed to deploy a small K8s cluster for testing on top of OpenStack
21:37:21 <martial> internally we have used Kubespray to do so
21:37:44 <martial> to deploy a Kubernetes (one master and two minion nodes) in an pre-configured OpenStack project titled nist-ace, using Kubespray and Terraform.
21:38:14 <martial> the default Kubespray install requires the creation of pre-configured VMs
21:38:17 <martial> #link https://github.com/kubernetes-sigs/kubespray/blob/master/docs/openstack.md
21:38:44 <martial> Terraform has the advantage to pre-configures the OpenStack project for running the ansible playbook given information about networking, users, and the OpenStack project itself. Then Terraform handles the VM configurations and creations.
21:38:51 <oneswig> Given Kubespray is Ansible, why the need for preconfiguration?
21:39:02 <martial> #link https://github.com/kubernetes-sigs/kubespray/blob/master/contrib/terraform/openstack/README.md
21:39:34 <martial> that was also my question, but the ansible script did not create the OpenStack instances
21:39:37 <martial> terraform will
21:40:00 <deardooley> @oneswig it's a pretty common pattern. terraform is much faster at pure provisioning that ansible, but config management is not it's strong suite. ansible is a good complement once the infrastructure is in place.
21:40:15 <martial> you obviously need the OpenStack project's RC file
21:41:20 <martial> once you have this sourced you are able to create the terraform configuration file to include master/minion number and names
21:41:37 <martial> images, flavors, IP pools
21:42:36 <oneswig> How do you find Terraform compares to expressing the equivalent in Heat?
21:43:34 <martial> given that kubespray has its own ways of spawning on top of OpenStack, I did not try heat for this purpose
21:44:00 <janders_> in the context of a Private Cloud, would it make sense to disable port-security so that we don't need to worry about address pairs?
21:44:16 <janders_> or do you see value in having this extra layer of protection?
21:44:32 <martial> supposedely terraforms create the private network needed for indeed k8s communication
21:44:37 <martial> on top of the OpenStack project
21:44:42 <martial> 's own network
21:45:36 <martial> once the configuration is done, calling terraform init
21:45:46 <martial> followed by terraform apply
21:46:06 <oneswig> What do you configure for Kubernetes networking?  Does it use Weave, Calico, ?
21:46:18 <deardooley> within the context of kubespray, you the terraform provisioner will handle all security group creation and managmeent for you as part of the process. You will need to implement any further security at the edge of your k8s apiservers on your own.
21:47:53 <martial> (with a little extra obviously) creates the openstack
21:48:28 <martial> I was checking in my configuration file and I do not see the k8s networking set
21:48:45 <deardooley> it's pluggable. defaults to calico. there are some tweaks you need to make in your specific inventory depending on the existence of different openstack components in your particular cloud.
21:48:52 <oneswig> Seems like a lot of interesting things are happening around blending the layers of abstraction, and interfacing with OpenStack to provide resources (storage, networking, etc) for Kubernetes - eg https://github.com/kubernetes/cloud-provider-openstack - does Kubespray support enabling that kind of stuff?
21:48:53 <martial> the default seems to be Callico
21:49:25 <oneswig> ... would be cool if it did
21:49:35 <deardooley> for example, to plug into external loadbalancers, dynamically configure upstream dns, use cinder as persistent volume provisioner, etc.
21:49:59 <oneswig> deardooley: that kind of stuff
21:50:35 <deardooley> yeah. it's all pluggable with the usual caveates.
21:51:20 <martial> (I am kind of tempting Rion here to think about a presentation at the summit on the topic of Kubespray deployment on top of OpenStack)
21:51:39 <oneswig> Have you been using those cloud-provider integrations and is it working?
21:53:00 <martial> I have not
21:53:14 <deardooley> I use them on a couple different openstack installs.
21:54:13 <deardooley> they work, but there are ways to get the job done, and there are ways to get the job done and keep all your hair and staff in tact
21:54:35 <oneswig> deardooley: sounds familiar :-)
21:55:43 <deardooley> it's likely anyone on this channel could pull it off in a day or two by pinging this list, but once you do, you'll appreciate the "Hey Google, create a 5 node Kubernes cluster" demo in a whole new way.
21:56:26 <deardooley> that being said, once you get your config down, it really is push button to spin up another dozen in the same environment.
21:56:44 <oneswig> deardooley: in your experience, what does Kubespray do wrong / badly?  Does it have shortcomings?
21:56:46 <martial> after Terraform pre-configures everything is it simply the steps or running the ansible playbook
21:57:11 <b1airo> certainly sounds like it could be an interesting presentation topic
21:57:27 <deardooley> it can build a production scale cluster for you. it can't do much to help you manage it.
21:57:45 <b1airo> i think i need a diagram of the openstack - kubespray - terraform interactions
21:57:53 <martial> see Rion "it could be an interesting presentation topic" :)
21:58:04 <martial> (wink)
21:58:38 <deardooley> as long as you treat nodes idempotently and get sufficient master quorem defined up front, it's not a huge issue. when something goes weird, just delete it, remove the node, and rerujn the cluster.yml playbook with a new vm.
21:58:44 <oneswig> We are nearly at time - final thoughts?
21:58:52 <b1airo> one other motivation type question? why not Magnum?
21:59:37 <martial> in my particular case, the person I was working with wanted K8s to test containers ... truth is given how many containers they want, docker swarm might be enough
21:59:46 <deardooley> flexibility, portability across cloud providers, secruity across multiple layers of the infrastructure and application stack, logging, monitoring, etc...
21:59:58 <deardooley> +1
22:00:12 <oneswig> OK, we are at time
22:00:22 <martial> and I will second Rion's comment of "remove" "rerun", that was very useful for testing things
22:00:29 <oneswig> Thanks deardooley martial - interesting to hear about your work.
22:00:38 <janders_> great work, thanks guys!
22:00:47 <oneswig> Until next time
22:00:50 <oneswig> #endmeeting