20:59:19 #startmeeting scientific-sig 20:59:20 Meeting started Tue Jul 10 20:59:19 2018 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:59:21 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 20:59:23 The meeting name has been set to 'scientific_sig' 20:59:36 #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_July_10th_2018 20:59:46 greetings 21:00:14 o/ 21:00:23 Afternoon trandles 21:00:49 I was hoping to catch up with you on the latest status for Charliecloud 21:01:00 bonsoir oneswig ;) 21:01:14 ah yeah, I can fill you in on Charliecloud developments 21:01:26 Hi Stig! :) 21:01:33 In fact I am in German-speaking territories right now, so guten abent, I guess... 21:01:44 Hey janders, how's things? 21:01:51 Hi Stig 21:02:02 martial_: bravo, you made it, good effort 21:02:05 #chair martial_ 21:02:06 Current chairs: martial_ oneswig 21:02:17 oneswig: good, thank you :) 21:02:26 how are you? 21:02:33 OK, I think the agenda might be a quick one today... 21:02:40 Berlin :) 21:02:51 janders: all good here, wanted to ask you about your news later 21:02:59 janders: oh go on then 21:03:08 #topic Berlin CFP 21:03:14 It's a week off! 21:04:07 I am hoping to put a talk (or two) in on some recent work - hope you'll all do the same 21:04:16 Which of the three topics would you be interested to hear about: 1) CSIRO baremetal cloud project report 2) Cloudifying NVIDIA DGX1 or 3) dual-fabric (Mellanox Ethernet+IB) OpenStack ? 21:04:21 trandles: I assume you're going to Denver? 21:04:37 er... all of the above? 21:04:38 Dallas, but I know what you meant ;) 21:04:41 I can do one, perhaps two out of the three 21:05:05 trandles: understandable, but dang all the same... 21:05:38 I'm not sure who to drink a G&T with now! 21:05:51 oneswig: yeah I'd prefer a trip to Berlin but I have HQ duties at SC for at least the next two years :( Have to get the Summit organizing committee to pick a week that doesn't conflict. 21:06:22 Is there a Summit system going to your place? 21:06:36 Ah. Different summit. 21:06:36 doh 21:06:43 and I have to get our SC18 BoF situated. I have a K8s person, 2x from Docker and our OpenStack crowd as well 21:07:03 martial_: that's even bigger than your panel at Vancouver! 21:07:04 oneswig: that would be great - it's a real PITA that OSS & SC clash pretty much every time 21:07:33 martial_: if you think of it, put me in touch with Hoge to discuss HPC containers + OpenStack 21:07:56 ok - I consider the notice served on Berlin CFP. 21:08:02 Yes Tim, sorry I got your email, just in the queue and things get added 21:08:11 martial_: no worries :) 21:08:18 Can we fit a containers-on-openstack update in here? 21:08:33 Charliecloud news? 21:09:42 Charliecloud news-in-brief: 1) new version coming soonish but I don't know of anything hugely impactful in the changelog 2) several times now I've been pinged about being able to launch baremetal Charliecloud containers using nova but I haven't followed up on it 21:10:38 if this crowd would like a brief talk on Charliecloud I'd be happy to do one at this meeting sometime 21:10:53 trandles: that would be great! :) 21:11:02 On 2, what's that all about? 21:11:04 discuss our HPC container runtime philosophy and why Charliecloud is the way it is 21:11:13 Tim: yes please 21:11:24 trandles: any news on OpenHPC packaging? 21:12:12 trandles: seconded on the SIG talk, yes please 21:12:23 well, for 2), I'm not sure what folks are thinking and no one has brought it up enough for me to make it a priority, so maybe consider this a very generic request for more info 21:13:15 I can tell you we have a lot of users that use TACC that are really unhappy about Singularity 21:13:22 Given Charliecloud takes docker containers as input, isn't this a bit odd? 21:13:23 we are investigating gVisor 21:13:23 oneswig: we (one of the exascale project focus areas) just got an OpenHPC presentation from the founders and their slides said Charliecloud is in the new release 21:13:35 martial_: care to elaborate? 21:13:53 I suspect gvisor isn't going far, seems to be a fix at the wrong level with the user space interposer 21:13:59 gVisor IIRC is more of a container hypervisory-type thing 21:15:07 oneswig: I agree about the oddness of wanting to launch Charliecloud using nova, hence my lack of interest in seriously running down the idea 21:15:16 oneswig: we have this issue where they are forced to use Singularity but the analytics is constrained within a docker container and shifting it to singularity limiting 21:16:02 plus we are hoping to remove root access. I have been pitching CharlieCloud as an alternative but TACC runs Singularity so hard to force the issue 21:16:07 martial_: perhaps this applies to a number of the non-docker runtimes? 21:16:53 martial_: They'll need to upgrade their kernel - but on the plus side they'll also get the mimic Ceph client! 21:17:09 at a higher level, there is an ECP container working group discussing two general use cases 1) containerizing HPC microservices, 2) runtime requirements for containerized HPC jobs, and 3) containerizing HPC applications 21:17:19 *three general use cases... 21:17:59 Tim: if you have some time and are willing to discuss Charliecloud for root isolation (vs Singularity or gVisor) :) 21:18:12 "book 5 in the increasingly inaccurately named trilogy..." 21:18:28 oneswig: +1 21:18:33 martial_: let me know when 21:18:50 martial_: what's the main issue with users having root - is it a data access issue or risk for the container host itself? Or all of the above? 21:18:52 FYI - I'm out on work travel July 14-25, so likely missing the next two of these meetings 21:18:59 checking if our Chris is around 21:19:14 trandles: are we talking to the Charliecloud marketing dept? 21:19:48 oneswig: no comment (might be the PR department) 21:20:03 tim: data exfiltration protection 21:20:31 that's a nice segue to the next topic on the agenda isn't it martial_ ? 21:20:46 always here to help :) 21:20:48 Ah, indeed, have we come to a conclusion on hpc containers? 21:21:06 #topic survey on controlled-access data 21:21:15 pencil me in for Charliecloud talk at the August 7 meeting 21:21:25 Just wanted to draw everyone's attention to new data on our etherpad 21:21:31 cool, I will try to bring in my crowd too 21:21:51 trandles: someone else will need to pencil that one as I'll be on vacation, alas... 21:22:02 (not alas for the vacation however) 21:22:13 vacation? Stig ... it is for science! 21:22:35 I'll pack my telescope... 21:22:46 #link Updates on the etherpad https://etherpad.openstack.org/p/Scientific-SIG-Controlled-Data-Research 21:22:53 oneswig: heading to Western Australia? South Africa? :) 21:23:17 janders: not this time... 21:23:33 Geneva to CERN next week however, really looking forward to that! 21:24:15 OK - on the etherpad, the search has turned up a few prominent projects 21:24:33 oneswig: excellent, enjoy. 21:24:41 I was hoping these might get people thinking of other items to add 21:25:11 oneswig: pro-tip - check out Charly's in Saint Genis for a friendly CERN crowd 21:25:20 I have a meeting 8/7 that ends at 5pm, so I might be penciled in on that one too 21:26:13 I am sharing this with Khalil and Craig because we are in conversation with places for data access too (with ORCA), so they might be able to extend the list 21:26:25 Etherpad editing isn't usually a spectator sport so I wanted to leave that one with people 21:26:54 martial_: that sounds really good, I think we could do with more data from their domain 21:27:55 I would like to get some guest speakers in, if possible. Martial - got any NIST connections for FedRAMP? 21:29:36 oneswig: Charly's is pretty cool indeed :) 21:29:59 trandles: janders: now I'll have to go! 21:31:09 OK, shall we move on? 21:31:15 stig: Bob Bohn 21:31:34 Stig: I was thinking on it, he can help you with that, you know Bob, right? 21:31:51 martial_: I do indeed. I'll drop him a mail, thanks 21:32:12 :) 21:32:48 #topic AOB 21:33:00 janders: how's the IB deploy going? 21:33:50 oneswig: making good progress. This week I'm training colleagues who are new on the project, so less R&D happening. 21:34:03 An interesting observation / hint: 21:34:34 Make sure that the GUIDs are always specified in lowercase 21:34:41 Otherwise Bad Things (TM) will happen 21:34:53 sound ominous 21:35:10 Like - the whole SDN workflow accepting the uppercase GUIDs end to end... till the Subnet Manager is restarted 21:35:28 then all the GUIDs disappear and all the IB ports go down 21:35:32 janders: I'm just now thinking how I want to incorporate IB into my OpenStack testbed. OK if I bug you with thoughts/questions? 21:35:45 sure! :) 21:35:58 jacob.anders@csiro.au 21:36:04 trandles@lanl.gov 21:36:10 I'll fire something off tomorrow, thx 21:36:29 trandles: are you after virtualised/SRIOV implementation, baremetal or both? 21:37:05 SRIOV primarily, but I'd like to hear about the SDN stuff especially 21:37:25 plus I might be able to extend the use case to OPA with another testbed 21:37:47 just to wrap up the uppercase GUID challange: it seems that input validation for API calls is different than for reading config files on startup. This will be addressed in a future (hopefully - next OpenSM release:) 21:38:06 rumor is the fabric manager is much happier with the dynamic stuff than an IB subnet manager 21:38:16 It was interesting to debug in R&D - not so much if it somehow happened in prod - that would have been catastrophic 21:38:48 I have to run folks...until next time... 21:38:55 thanks trandles 21:39:06 see you trandles 21:39:17 janders: is it in production now? At what scale? 21:39:55 oneswig: not yet. The requirements changed a fair bit 21:39:59 (bye Tim) 21:40:08 seems pretty cool janders 21:40:27 It turns out that the cybersecurity research system will be completely separate - so we'll build this first 21:40:31 it will be 32 nodes 21:40:37 (most likely) 21:40:51 janders: did you end up using Secure Host? 21:41:08 and then we'll be looking at a larger system, possibly an order of magnitude larger than that 21:41:27 oneswig: not yet, but it's on the roadmap - thanks for reminding me to remind Mellanox to send me the firmware 21:41:47 we had some concerns about SecureHost on OEM branded HCAs but I'm happy to try it, got a heap of spares 21:42:00 if a couple get bricked while testing not a big deal 21:42:12 janders: indeed - I assume it breaks warranty 21:42:29 Do you have firmware for CX5? 21:42:47 no, SH FW is CX3 only at this stage AFAIK 21:42:54 ancient, I know 21:43:14 this doesn't worry me as most of my current kit is CX3, however this will change soon so I hope to work this out with mlnx 21:43:34 keep up the good fight :-) 21:43:43 I will :) 21:44:09 another update that you might be interested in is on running nova-compute in ironic instances 21:44:14 it works pretty well 21:44:31 janders: what's that? 21:44:49 my current deployment got pretty badly affected by the GUID case mismatch issue, but other than that it *just works* 21:45:32 I figured that if I have ironic/SDN capability, there is no point in deploying "static", nominated nova-compute nodes 21:45:46 better make everything ironic controlled - and deploy as many as needed at a given point in time 21:46:33 if no longer needed - just delete and reuse the hardware, possibly for a "tenant" baremetal instance 21:46:39 So you're running nova hypervisors within your Ironic compute instances? 21:46:46 correct 21:46:50 That's awesome! 21:47:03 networking needs some careful thought, but when done well it's pretty cool 21:47:24 there will be a talk proposal submitted on this with a vendor, should be coming in in very soon 21:47:25 How does that even work - you won't get multi-tenant network access? 21:48:01 with or without SRIOV? 21:48:09 both work, but the implementation is very different 21:48:43 If your Ironic node is attached to a tenant network, how does it become a generalised hypervisor? 21:49:19 are you familiar with Red Hat / tripleo "standard" OpenStack networking (provisioning, internalapi, ...)? 21:49:58 sure 21:50:13 I define internalapi network as a tenant network 21:50:26 same with externalapi (which is router-external) 21:50:41 then, I boot the ironic nodes meant to be nova-compute on internalapi network 21:50:48 this way they talk to controllers 21:50:57 the router allows them to hit external API endpoints, too 21:51:19 and - in case of non-SRIOV we can run vxlan over internalapi too (though this is work in progress, I've got SRIOV implementation going for now) 21:51:36 I guess you're not using mutli-tenant Ironic networking, otherwise you wouldn't be seeing other networks (genuine tenant networks) to your new hypervisor, right? 21:52:02 Or are all tenant networks VXLAN? 21:52:13 CX3 SRIOV implementation uses pkey mapping 21:52:29 there's the "real" pkey table and then each vf has it's own virtual one 21:53:08 this seems to work well without need for pkey trunking 21:53:34 so - now that you asked me I will test this more thoroughly but I did have it working with VMs in multiple tenant networks 21:53:51 sounds promising, keep us updated! 21:54:07 there might be some extra work required for CX4 and above, but it is doable 21:54:52 I will :) I will test this again more thoroughly and happy to drop you an email with results if you like? 21:56:09 yes please janders, this is very interesting 21:56:16 will do! 21:56:24 OK we are nearly at the hour, anything else to add? 21:56:28 just to finish off I will quickly jump to my opening question 21:56:31 1) CSIRO baremetal cloud project report 2) Cloudifying NVIDIA DGX1 or 3) dual-fabric (Mellanox Ethernet+IB) OpenStack ? 21:56:44 which one do you think will get the most interest (and hence best chances of getting into the Summit)? 21:57:01 with 1) I might have even more interesting content for Denver timeframe 21:57:17 but it's doable for Berlin, too 21:58:37 Number 2 - I think there is strong competition for this. It might be interesting to do the side-by-side bake-off hinted at by 3 21:58:49 (just my personal view, mind) 21:59:07 No harm in submitting all options! 21:59:14 thanks Stig! :) 21:59:14 OK, we are out of time 21:59:21 I already have one in, so limited to 2 21:59:30 thanks all 21:59:35 #endmeeting