11:00:26 #startmeeting scientific-sig 11:00:27 Meeting started Wed Jun 6 11:00:26 2018 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:00:28 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:30 The meeting name has been set to 'scientific_sig' 11:00:40 Afternoon. 11:00:50 #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_June_6th_2018 11:00:58 Hi verdurin, good afternoon 11:01:05 Hello 11:01:06 hi 11:01:21 hello all 11:01:41 (might be distracted, working on some instances whose IP addresses show in Horizon but not "openstack server show") 11:02:00 daveholland: intriguing... 11:02:25 AOB :) 11:02:39 keep us updated! 11:03:07 OK shall we get on with the show... 11:03:41 #topic summit roundup 11:04:10 We had a well-attended session. I think there were probably 60 people in for the meeting and 80 for the lightning talks 11:04:13 a good showing 11:04:38 o/ 11:05:01 Hey b1airo, evening 11:05:07 #chair b1airo 11:05:08 Current chairs: b1airo oneswig 11:05:11 oneswig: was looking at Tuesday logs, have people thought of more talk recordings to recommend since then? 11:06:13 John's talk on preemptible instances went online. I think it got overlooked. 11:06:45 #link preemptible instances and bare metal containers https://www.openstack.org/videos/vancouver-2018/containers-on-baremetal-and-preemptible-vms-at-cern-and-ska 11:07:34 John also did a pretty comprehensive roundup on Scientific SIG interests from the forum 11:08:12 #link Scientific SIG and forum https://www.stackhpc.com/openstack-forum-vancouver-2018.html 11:08:17 yeah i saw that on twitter, haven't had a chance yet though 11:08:49 it's a good read 11:09:01 thanks daveholland, will pass that on 11:10:41 daveholland: any session picks from you? 11:11:39 there were a few things that stood out... the session on "enterprise problems" was familiar and reassuring (other people also find lost/orphaned instances, for example) 11:12:21 one tiny nugget from a CERN talk that raised eyebrows: 8 minutes average VM boot time? maybe there is pre-work in scheduling GPUs or other scarce resource, or ironic cleaning... but we think 1 minute is on the slow side 11:12:47 daveholland: might be CVMFS-related 11:13:00 That's interesting. I missed most of the CERN talks, alas. 11:13:18 @verdurin interesting point I hadn't thought of, thanks 11:14:16 but I'd assume all the CernVMFS repositories would be local, on-site? 11:14:26 you'd think 11:14:34 I found some other interesting points behind operating-at-scale, e.g. Walmart (I think) with dozens of OpenStack deployments/clouds 11:15:22 They would be, yes. 11:15:34 daveholland: yeah i've noticed a few enterprise users have scale in shear number of deployments, whereas on the scientific side it's more likely to be large individual deployments 11:16:26 this is true for e.g. Ceph too when we were at the inaugural advanced user meeting at the Red Hat summit earlier last month 11:16:31 There was a good demo of exactly this in the keynotes - Riccardo Rocha's federated Kubernetes - scale through federation at the platform level instead 11:16:33 are enough people running large enough (single) deployments that there's a feel for where the sensible boundaries are? Sanger has a single ~5000 core deployment and looking to add ~4000, currently pondering whether lumping it all in one is sensible 11:17:18 depends what you mean by all in one i guess, lots of different ways to architect 11:17:27 likewise for large Ceph (PB+) 11:17:44 scale-wise the only practical considerations for Nova are number of hosts in a cell 11:17:47 single cell, 3x HA controllers (Pike) 11:17:57 but that doesn't have to translate into anything user-facing 11:18:41 as far as i remember 300 hosts in a single cell was the max rule of thumb (possibly according to rackspace) 11:19:12 The thorny issue in any discussion on scale is how much it depends on how the control plane is used. What kind of churn is going on at the instance level and how many services are active... 11:19:22 indeed oneswig 11:19:49 and placement adds quite a bit of extra api load it seems 11:20:16 brb 11:20:32 daveholland: ceph-wise the main consideration that i haven't seen any good guidance on is what a sensible limit to number of OSDs in a pool is 11:21:09 or rather than a pool a single PG set 11:21:55 so, number of OSDs supporting a PG? we have 3 (for 3-way replication) but I can see using more for EC (which we don't yet do). 3060 OSDs in total which is starting to feel a bit unwieldy 11:23:04 i mean, depending on how you crushmap is implemented (but assuming the default sort of layout), the more OSDs in a pool the greater the chance of concurrent failures and possible data loss 11:23:38 @oneswig, @b1airo: by the way… the vapourware I’ve promised for the last year and a half is almost here (had what looks like almost the final build yesterday). I might *actually* be able to… demo… macOS Integration with this thing called Moonshot. :-/ 11:24:07 StefanPaetowJisc: hi Stefan, bring it on! 11:24:57 there was a talk many summits ago, possibly in tokyo, from "something"Stack, maybe UnitedStack, talking about how they implemented limited failure domains within crush to increase reliability and improve backfill 11:24:58 Would be interested to see integration with Keystone, or what the options are for this. 11:25:08 Indeed. 11:25:18 blairo: I may not be understanding, will drop you a line 11:25:26 oneswig: haven't even seen a demo and already making feature requests! 11:25:43 b1airo: give an inch, take a mile... 11:26:04 b1airo: I'm interested in this discussion on PG size because it seems counter-intuitive 11:26:45 But perhaps we can carry on in #scientific-wg afterwards - jmlowe might be interested too 11:27:08 (I have to dash promptly at 1pm sorry) 11:27:28 not the PG size (as in num reps) but the total population of OSDs 11:27:39 found the slides from that talk: https://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph 11:28:33 blairo: ta, will read/learn/inwardly digest (and then be in touch :) ) 11:28:36 Thanks, I'll take a look 11:28:43 ditto... 11:29:00 OK, next item? 11:29:37 #topic gathering best practice on handling controlled-access data 11:30:02 There was some good discussion here at the session. 11:30:27 Somehow the group membership has swung from particle physics to life sciences! 11:31:21 My aim is that if we can gather good practice it might make new content for the HPC book 11:31:39 oneswig: spent nearly all last week hosting a site visit for a large project in this area, so very timely for us. 11:31:58 I started an etherpad to start gathering fellow travellers: https://etherpad.openstack.org/p/Scientific-SIG-Controlled-Data-Research 11:32:10 verdurin: interesting, would love to hear more. 11:32:37 first question, what do we mean by "Implementation Standard"? 11:33:15 daveholland: at the Sanger OpenStack event, I met someone looking to combine a trifecta of sensitive data - genomics, patient records and location. Who would that have been? 11:33:51 b1airo: I was thinking of regulatory frameworks. Probably a better term. 11:34:06 Franceso at PHE maybe? 11:34:57 daveholland: not this time, it was someone from Sanger IIRC, possibly working with a population in Africa? 11:35:40 most of the big studies i hear about have some element like this in the data-science aspects 11:36:07 which is why where we think the safe haven environment comes into play 11:36:18 s/why// 11:36:20 humm, not sure, was the angle human, or cancer, or pathogen? that'd nail down which team I ask at least 11:36:44 b1airo: that project @Monash, how can we get them involved? Was it Jerico? 11:37:13 daveholland: human I'd guess but it wasn't a detailed discussion 11:37:23 I think it was likely Martin in human genetics, they have links to Africa 11:38:19 daveholland: could be one of the malaria-related groups? We have people here with a presence at your end. 11:38:31 oneswig: yeah Jerico is the main resource doing technical stuff on it at the moment, and now we are looking to focus on this i'll suggest he joins (will forward him today's logs and the etherpad) 11:39:05 excellent, thanks b1airo 11:39:48 daveholland: I think (somehow) location data was coming from phones / gps as well. Could that use of technology track it down? 11:40:06 oneswig: I will have to ask around, not heard of that aspect 11:40:34 Thanks - I hope I'm not making all this up :-) 11:40:38 oneswig: i was thinking about how to structure things around this topic 11:40:51 i think we need to break out specific areas as it's very broad 11:41:24 makes sense. Different standards / regulatory frameworks will have common requirements and common solutions 11:41:50 there are also multiple layers in any solution to consider 11:42:04 b1airo: yes, data ingest, anonymization, data analysis etc. 11:42:40 were you thinking of layers as human processes vs platform vs infrastructure? 11:42:49 e.g. on the infra side you might need to have your OpenStack cloud implemented in a certain way and need specific controls in the environment, plus also ensure your deployment conforms to reporting requirements etc 11:43:28 that'd be one good way of carving it, yes 11:43:34 b1airo: yes, that was exactly the sort of discussion we were having last week 11:44:02 verdurin: with your guests? 11:44:08 b1airo: yes 11:44:47 Seems like a good approach b1airo - go ahead and note it in the etherpad 11:45:14 verdurin: any specific noteworthy items from that discussion? 11:45:26 so then if you assume your underlying infra risks are well managed you need to move on to the tenant-level infra, guests, networks etc 11:45:50 oneswig: don't think I'm allowed to say much publicly yet - will do as soon as I can 11:46:15 controlled-access data in action :-) 11:47:42 We should leave some time for AOB - any other items to cover here for now? 11:47:46 then you start getting more towards the application level and thinking about whether controlled data access is needed, plus how to get data screened and out of a controlled environment 11:48:41 also proper deletion: wiping TB+ volumes makes performance sad; is deleting the encryption key equivalent to deleting an encrypted volume; etc 11:49:30 plus lots of very specific, but important, points like that 11:50:13 b1airo: it seems a lot of people delegate responsibility for data management to the users. Having trusted users bound by a usage agreement has to play a role here, doesn't it? Is it a non-starter if you don't have users you trust? 11:50:17 i don't have wide knowledge of the various standards / requirements, but i think encryption at rest usually suffices for that daveholland 11:51:23 oneswig: it does seem to be surprisingly common, but i don't think it flies when you start dealing with clinical or commercial data 11:52:04 Providing controlled data to users you can't trust is where things get interesting and I guess that's what underpins the Monash project. 11:52:11 you can't expect scientists to be able to implement best practice IT security 11:53:11 You can’t, no. But making it easier for them to helps :-) 11:53:19 I wonder if that's always true, when scientists know the data is sensitive. 11:53:34 oneswig: it's a little bit analogous to the problem the bloomberg terminal solves i suppose 11:53:54 what problem is that? 11:54:05 data is the product 11:54:40 ok, got it 11:54:43 you want to "sell" it (let people use it), but not let them take it away 11:54:56 en masse that is 11:55:09 makes sense. 11:55:29 We ought to move on - time and all that. 11:55:35 #topic AOB 11:56:02 time to turn in that is 11:56:16 I want to hear about daveholland and his mystery IPs 11:56:19 #link Berlin CFP closes end of June! https://www.openstack.org/summit/berlin-2018/call-for-presentations/ 11:56:35 daveholland: any update? 11:56:37 Uh-oh. 11:57:07 verdurin: looks like a hypervisor bug, many instances (but not all) on that hypervisor in this state. hypervisor logs "unexpected VIF plugged" event when the instance reboots (and the instance then has no NIC) 11:57:19 is that a mistake? i feel like it can't be right. previous summits haven't closed 4 months before the actual conference have they? 11:57:41 oneswig: it does feel a bit previous 11:57:48 b1airo: seems pretty harsh, who to ask - Jimmy? 11:57:52 fortunately most of these instances are expendable (dev, or auto-scaled workers) so.... byebye 11:58:19 daveholland: what does the virsh dumpxml say about network vifs? 11:58:22 daveholland: good - grtbr 11:58:26 if I get any closer to finding the cause I'll pass it on. But I need to dash now 11:58:47 don't leave us hanging! 11:58:59 oneswig: do let us know if you hear about the submission date 11:59:10 there's a stubborn sysadmin inside everyone huh 11:59:19 verdurin: will do - I'll open that mail now 11:59:49 looks like we're done 12:00:09 back to Cumulus upgrades... 12:00:16 One other activity from the summit - two new code branches for Ceph RDMA to explore 12:00:19 I'm on it... 12:00:31 hmm sounds promising 12:00:45 b1airo: thought you were turning in... 12:00:58 we are out of time - final comments 12:01:15 w00t - last comment 12:01:26 #endmeeting