11:01:00 #startmeeting scientific-sig 11:01:01 Meeting started Wed Mar 27 11:01:00 2019 UTC and is due to finish in 60 minutes. The chair is martial_. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:01:02 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:01:04 The meeting name has been set to 'scientific_sig' 11:01:11 #chair oneswig 11:01:12 Current chairs: martial_ oneswig 11:01:16 up, up and away 11:01:24 Good morning team 11:01:35 Greetings 11:01:59 #link Agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_March_27th_2019 11:02:09 Morning. 11:02:17 Morning verdurin 11:02:30 Today we have an exciting agenda (as always :) ) 11:02:38 #topic Evaluation of hardware-specific machine learning systems orchestrated in an OS cluster at NIST 11:03:04 We welcome Maxime and Alexandre to talk to us about the work they did on their project for NIST 11:03:28 Thank you for coming! 11:03:38 For people that want to follow up at home the link to the outline for their presentation is at 11:03:41 #link https://docs.google.com/document/d/190v0XtuVt1oH7yhPwjpuQ6mFLhL1dNcjNwTIF7Y9n9w/edit?usp=sharing 11:03:55 and it has been accepted as a lighting talk for the Denver Summit 11:04:05 What is Multimodal information? 11:04:07 #link https://www.openstack.org/summit/denver-2019/summit-schedule/events/23243/evaluation-of-hardware-specific-machine-learning-systems-orchestrated-in-an-os-cluster-at-nist 11:04:23 Alex, Maxime: the floor is yours 11:04:32 Thanks for welcoming us, that's a great opportunity to dry-run our presentation! 11:05:43 Thank you for this opportunity, Multimodal Information is a group of researchers that performs evaluation on machine learning systems 11:05:56 So first of all: hi everyone, we (Alexandre Boyer and Maxime Hubert), we are working at NIST and we are going to present to you how we automated our machine learning evaluation process using Openstack. 11:06:33 The Multimodal Information Group, which is part of the National Institute of Standards and Technology, often performs advanced evaluations and benchmarking of the performance of computer programs. 11:06:34 The main actors are: 11:06:34 - The performers whose systems will be evaluated 11:06:34 - The data providers who collect and send the data to evaluate the performers’ systems 11:06:34 - The evaluators who design the evaluation steps and report the results 11:07:29 This is how we usually proceed 11:07:29 - Collect and send the data sets to the performers: Data collection/delivery 11:07:29 - Collect the system outputs from the performers 11:07:29 - Score/Analysis of the Results 11:07:29 However, this approach has some caveats: 11:07:30 - The data collection task may be (very) costly, and has to be performed for every new evaluation 11:07:30 - Some agencies may not want their data to be released. 11:07:31 - Systems are being evaluated on “known” data (some ML systems are good at learning) 11:08:14 All these constraints led us to rethink the evaluation process: 11:08:14 New evaluation tasks involving in-house system runs: 11:08:14 - Collect the data sets: Data collection, deliver a Validation data set 11:08:14 - Collect and validate the computer programs (aka Systems delivery) 11:08:14 - Run the systems over the collected data, in our sequestered environment. 11:08:14 - Score/Analysis of the results 11:08:14 This process is raising this new question: How to deliver a system to guarantee that it will run in-house? 11:08:15 - OS Virtualization (containers) or Full Virtualization (VMs) to leverage all the configuration hassle 11:08:15 - The system can be run against the Validation data set in advance 11:08:16 Once the systems are delivered and validated, the evaluator still has to run them on the evaluation data sets. This could be done manually but would require a lot of time if a lot of systems have to be evaluated against a lot of data sets. 11:09:35 What kind of machine learning frameworks were you using and what were the use cases? 11:09:54 To support what project? 11:11:04 Also, explain the constraints that control the use of resources and the solutions to alleviate those, please 11:11:32 this process was applied during a NIST evaluation consisting of six systems performing activity detection on cctv footage: some dozens of hours of video were processed by six systems in different chunks 11:12:23 Any special considerations with the GPUs? 11:12:42 The teams designing the systems had the freedom to pick whatever framework they wanted too. 11:13:34 The configuration consisted in the following: 11:13:35 16 cores (32 hyperthreaded) 11:13:35 128GB of memory 11:13:35 4 x Nvidia GTX1080Ti 11:13:35 40GB root disk space 11:13:35 256GB of ephemeral SSD disk space mounted to /mnt/ 11:14:14 per VM or system? 11:16:28 One system consisted of one VM only so far, but we expect to receive multi-VMs systems in the next few months 11:16:59 so 4x GPU per VMs ... nice :) 11:17:51 And this was with GPU passthrough, not Ironic? 11:18:10 Yes it was 11:19:51 Did you deploy OpenStack specifically for managing this evaluation work? 11:20:21 you are listing "4 x Nvidia GTX1080Ti" in a server system or a desktop system (ie which chassis was used to support the non server version)? 11:20:54 Yes we deploy OpenStack specifically for this evaluation work 11:21:27 what prevents exfiltration of information? 11:23:06 In case people have questions, would you be okay adding your contact information in the outline document? 11:23:18 Did you use specific node-cleaning in between runs? 11:24:52 We would give one node to each team so they can integrate their system on it first through an Openstack project, and collect the images and volumes one they were ready. 11:24:52 11:24:52 The last step for us was to automate the runs of the systems: We abstracted the system runs so all pairs of systems/datasets can be executed the same way. A job scheduler would instantiate the systems and trigger them. This only required a change to our Systems delivery: we Collected and validated the computer programs wrapped into a standard CLI/API. 11:25:49 martial what do you mean by server system system or desktop system? 11:26:06 Data exfiltration was prevented by cutting the internet access to the VMs before pluggin in our datasets 11:27:05 1080s have a connector for power that make them hard to use in a server form factor usually 11:27:11 Detail on that last step you described would be interesting 11:27:23 martial_: we have a number of server systems like that 11:28:00 alexandre_b + verdurin: investigating server box to use :) 11:28:00 Instantiating the VMs over and ever again would make our nodes run out of memory sometimes, we had this problem when we had to deal with several image versions at the beginning. So the node cleaning would consist in deleting the past images on the nodes 11:29:22 maxhub_: Sure, I was wondering whether you had other cleaning processes in addition, given that you imply the need for strong isolation between groups 11:32:10 what are your lessons learned? 11:32:36 No we didn't, each system instanciation was offering a clean environment so we didn't feel the needs for any other cleanup 11:33:06 How Openstack made this possible for us: 11:33:06 Systems delivery: 11:33:06 - Opening an Openstack project per team for a limited amount of time allowed performers to remotely integrate their system on our hardware that can be specific (specific set of GPUs, SSDs). Performers have one VM per system. 11:33:06 - The ability to “snapshot” the state of a system made it easy to deliver, as well as easy to re-deploy. One system can consist of one or several images and volumes. 11:33:06 Openstack already provided all the mechanisms to help us improving our systems delivery: it’s easier for the evaluator to collect the systems, and easier for the performers who can integrate their systems directly. 11:33:07 Run the systems 11:33:07 - An NFS server instance using OS Cinder allowed us to distribute large amounts of data to different systems at the same time 11:33:08 - The use of VMs + OS security group mechanisms helped to leverage the requirements in terms of security and protection against data exfiltration 11:33:08 - VMs snapshots guaranteed the replicability needed when performing experiments 11:33:09 - A job scheduler using the OS API helped us to automate the instantiation and execution of our systems. 11:34:17 Lessons learned: Openstack is a good tool to work with, but is not designed to support the instantiation of a lot of VMs at the same time 11:34:23 "A job scheduler using the OS API" - what was that? 11:34:57 so basically using the openstack CLI to run the VM? 11:35:31 Yes, after abstracting the systems into Heat templates 11:35:40 any "wrapper" type technology to start/control the process within the VM? 11:36:12 The system delivery would require teams to implement specific entrypoints 11:36:39 We use a job scheduler developed in-house similar to Slurm, and it allows us to instanciate/delete Openstack instances, sequestered the network thanks to the OpenStack API 11:37:06 so firewall rules? 11:37:14 Yes 11:38:39 anything else that you want to share on your work that you have not shared yet? 11:39:02 or is the rest secrets for the lighting talk? :) 11:39:29 Right now we are working on generalizing this evalutation process, 11:39:57 This would imply using more resource management technologies (Terraform for example) 11:40:01 Speaking of ligthing talks, oneswig do we have space available you think for them to talk to our audience? 11:40:29 And using a more advanced Job scheduling technology (Airflow) 11:40:44 cool! 11:40:59 Thanks for listening to listening to us! 11:41:13 martial_: I expect so. 11:41:50 maxhub_: alexandre_b: Thanks to both of you. 11:42:21 Are you done with the project now or will you continue to work on it? 11:42:24 was looking for the link to the Etherpad to share with them, do you have it handy oneswig ? 11:42:59 I am not sure if there is one yet. I think Blair might have set one up. 11:43:46 Yes, thanks for presenting. 11:43:47 We will be working on it for at least another year 11:44:09 Thanks for listening to us 11:44:41 I hope the next steps go well 11:46:27 OK, let's move on 11:46:37 #topic Open Infra Days London 11:46:58 This is 1st April (ie Monday), and there's a scientific track 11:47:06 (was looking for Etherpad link, no luck) 11:47:13 including a nice chap from Oxford, I believe... 11:47:24 (maxhub_ alexandre_b will share with you when I have it) 11:47:54 verdurin: there's always plenty of interest in the AAI system on your infrastructure 11:47:57 cool :) 11:49:37 #topic Denver Forum Sessions 11:50:30 No dates yet? 11:51:01 We had a mail from Rico Lin to highlight this session 11:51:11 #link Help most needed for SIGs and WGs https://www.openstack.org/summit/denver-2019/summit-schedule/events/23612/help-most-needed-for-sigs-and-wgs 11:51:52 martial_: for the PTG session? I don't think I've seen anything confirmed 11:53:41 For the forum session, we'd need to re-think what the gaps may be for OpenStack support for research computing. 11:54:19 And that's probably a session in itself. 11:55:10 I would agree 11:56:14 any update on ISC? 11:56:59 #topic AOB 11:57:03 ISC 11:57:11 Not heard anything yet I don't think 11:58:04 Just checking when we should hear 11:58:30 April 10th 11:58:38 for ISC, I am on the program chair for HPCW 11:58:43 #link https://easychair.org/cfp/hpcw2019 11:59:03 "5th High Performance Containers Workshop - In conjunction with ISC HIGH PERFORMANCE 2019" 11:59:22 (program committee) 11:59:32 Interesting, not seen that before 12:00:02 Christian Kniep was organizing it 12:00:13 We started this conversation at SC19 12:00:28 I was wondering where Christian was, assumed he'd be involved somewhere... 12:00:29 so inviting submissions obviously :) 12:00:46 Let's put it on the agenda for next time 12:00:49 Time to close! 12:00:53 Waiting to hear back from him, but not at Docker anymore it seems 12:00:53 Final comments? 12:01:06 Thanks for a great session everybody :) 12:01:15 Indeed, thanks 12:01:19 #endmeeting