11:00:13 <oneswig> #startmeeting scientific_wg
11:00:13 <openstack> Meeting started Wed Aug  2 11:00:13 2017 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:00:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:00:16 <openstack> The meeting name has been set to 'scientific_wg'
11:00:31 <simon-AS5591> o/
11:00:38 <priteau> o/
11:00:40 <johnthetubaguy> o/
11:01:09 <daveholland> \o
11:02:28 <martial_> Hello all :)
11:03:11 <oneswig> Hello \o/
11:03:11 <oneswig> #link agenda for today is https://wiki.openstack.org/wiki/Scientific_working_group#IRC_Meeting_August_2nd_2017
11:03:11 <oneswig> greetings all!  Who said that nothing happens in Europe in August :-)
11:03:11 <oneswig> Blair is in transit currently and may / may not attend.  If he makes it, he mentioned he had an update on his THP fragmentation issues
11:03:12 <oneswig> Martial, you there?
11:03:12 <oneswig> OK, lets get going
11:03:13 <oneswig> aha
11:03:14 <oneswig> #chair martial_
11:03:14 <oneswig> good morning Martial
11:03:15 <oneswig> #topic Workload tracing on Chameleon
11:03:15 <openstack> Current chairs: martial_ oneswig
11:03:38 <oneswig> BTW I am on a train, am likely to go through some long dark tunnels ... apologies in advance
11:03:49 <oneswig> priteau: you have the floor
11:03:52 <priteau> Thanks oneswig
11:03:57 <priteau> I would like to provide an update on our cloud traces work
11:04:05 <b1airo> Evening
11:04:11 <martial_> #chair b1airo
11:04:12 <openstack> Current chairs: b1airo martial_ oneswig
11:04:13 <oneswig> Hi Blair you made it
11:04:14 <b1airo> Hi oneswig, hi priteau
11:04:23 <b1airo> Hi martial_
11:04:24 <priteau> The idea is to provide something similar to the Parallel Workloads Archive or Grid Workloads Archive
11:04:41 <b1airo> Just made it. Not sure I'll last long to be honest - already in bed...
11:04:42 <masber> hi all, sorry to bother I am new here, the reason I joined this meeting is because I am planing to support our hpc system on openstack and was wondering whether this was a good place to ask those questions?
11:04:45 <oneswig> priteau: can you describe what these do and why they are useful?
11:04:45 <priteau> These focus on HPC cluster and grids rather than clouds
11:04:57 <b1airo> Welcome masber !
11:05:00 <oneswig> masber: you've come to the right place :-)
11:05:27 <masber> thank you
11:05:38 <priteau> They provide data about all the jobs that have been run on some infrastructures. They can be used by researchers to simulate workloads that match reality.
11:05:53 <priteau> It can be particularly useful for e.g. a researcher working on job scheduling
11:06:07 <b1airo> If I recall correctly those existing archives are mainly used in scheduling research
11:06:23 <oneswig> kind of like imix for network packets?  (a sample of random internet frames)
11:06:49 <priteau> oneswig: I didn't know Imix, but yes it looks similar in another context
11:06:55 <b1airo> But I think you and Kate were lookin for richer data than just instance requests?
11:07:46 <priteau> except that IMIX might only be providing a distribution of packets, rather than actual headers
11:08:15 <oneswig> priteau: feasibly - seen but not used it. Carry on :-)
11:08:32 <priteau> b1airo: eventually yes, I think it could be combined with telemetry data. We started with just instance requests for now.
11:09:01 <b1airo> masber: briefly, this is our group meeting time. We typically have an agenda (posted to the mailing list prior) that includes any other business and discussion time towards the end. But there is also a dedicated channel where we can chat anytime
11:09:12 <oneswig> priteau: is this formed from nova notifications or something higher-level?
11:09:33 <priteau> We've defined a data format which is heavily based on the structure of the Nova database
11:09:43 <priteau> http://scienceclouds.org/cloud-traces/cloud-trace-format/
11:10:16 <priteau> It's at version 0.1 so it may change in the future depending on feedback
11:10:22 <johnthetubaguy> priteau: I am curious why the DB format and not the API format?
11:10:40 <b1airo> Impressive domain name
11:10:42 <masber> b1airo, yes I don't want to hijack the meeting, would you mind sharing the channel I can use to ask my questions?
11:11:10 <priteau> johnthetubaguy: I would assume that they're quite similar, but I will check
11:11:35 <johnthetubaguy> priteau: the API format is a public API contract, the DB changes quite a lot, in general anyways
11:11:47 <priteau> Good point
11:12:06 <priteau> Basically it's using data from instance_actions_events joined with data from the instances table
11:12:07 <oneswig> priteau: what are you looking for in the data you're collecting?
11:12:15 <b1airo> masber: #scientific-wg
11:12:30 <johnthetubaguy> priteau: actually the notification have some versions, so maybe what b1airo said is a better option
11:12:53 <priteau> oneswig: All events associated with all instances running on a cloud deployment
11:13:34 <priteau> Hopefully enough for a researcher to analyze and simulate the same kind of activity
11:13:40 <johnthetubaguy> priteau: the idea being you can re-run them to see if your job scheduling is better, etc, etc?
11:13:46 <johnthetubaguy> ah, that is a yes then
11:13:50 <priteau> johnthetubaguy: Yes, that's the idea
11:14:32 <martial_> priteau: pretty impressive :)
11:14:40 <oneswig> priteau: what's interesting here is that chameleon is bare metal.  I'm interested to compare this with the usage of a virtualised resource - for scalability considerations
11:15:09 <priteau> oneswig: Actually we've only provided a trace from our KVM cloud so far: http://scienceclouds.org/cloud-traces/chameleon-openstack-kvm-cloud-trace/
11:15:25 <oneswig> I guess also Chameleon users are a fairly unique bunch, who may be hard to characterize
11:15:35 <priteau> We thought it may be more representative and would cover more instance events (such as migration)
11:15:48 <oneswig> priteau: I'll not download that right now or the entire train network will go down :-)
11:16:06 <priteau> it's only 6 MB zipped, barely more than a modern webpage ;-)
11:16:28 <johnthetubaguy> yeah, on a UK train, that might take everyone out
11:16:39 <oneswig> priteau: What are you using to visualise your data?  (please say grafana diagram :-)
11:17:23 <oneswig> johnthetubaguy: thinking of alternative expansions of GWR now...
11:17:37 <priteau> oneswig: We haven't done visualization from these traces, but the student who worked on this is using Grafana for another project related to experiment visualization
11:18:19 <oneswig> From here, I wonder if it is not too many steps to visualise notification data coming from a live system
11:18:30 <priteau> It's still work in progress, but since the student is leaving soon we wanted to release a trace and format to request feedback
11:18:53 <b1airo> johnthetubaguy: I agree it'd certainly be better if we could do this with the APIs and/or notifications
11:19:15 <priteau> One question we have is whether folks are interested by high or low-level OpenStack events. For example, an action like “migrate” is composed of four separate action events: cold_migrate, compute_prep_resize, compute_resize_instance, and compute_finish_resize.
11:19:42 <priteau> Actions that we saw in our KVM cloud are: create/delete, start/stop, reboot, rebuild, migrate, live-migration, suspend/resume, resize/confirmResize, and pause/unpause.
11:19:43 <b1airo> Problem with notifications is it essentially means needing a trace service for capturing, whereas I imagine most admins would be happier doing once off dumps
11:19:58 <oneswig> priteau: most interesting for me is the hierarchical connection of an instigating API reuqest and the sub-events generated.  The level of detail's like a fractal, perhaps
11:20:15 <johnthetubaguy> there are two ways to look at this I think (1) measure how long each thing took (2) use it for reply
11:20:24 <oneswig> b1airo: don't ceilometer and stacktach do that?
11:20:30 <johnthetubaguy> yeah for reply, don't need as many events recorded
11:20:46 <simon-AS5591> b1airo: Personally as an operator I wouldn't mind setting up a notification collector if it seems safe to use and easy to set up.
11:20:49 <priteau> oneswig: Have you looked at os-profiler?
11:20:51 <johnthetubaguy> stacktach would collect all the notifications for you, I guess ceilometer should
11:20:59 <priteau> Sorry, osprofiler
11:21:09 <priteau> #link https://github.com/openstack/osprofiler
11:21:12 <oneswig> priteau: ah, knew I'd missed one.  Heard the name, like the face, never used it, alas
11:21:26 <oneswig> works with rally, right?
11:21:29 <johnthetubaguy> os-profiler reads from ceilometer I think? by default
11:21:41 <johnthetubaguy> although that may have changed
11:21:59 <priteau> I think osprofiler can be used with Rally, but not necessarily
11:22:17 <priteau> johnthetubaguy: seems so: https://docs.openstack.org/osprofiler/latest/user/integration.html
11:22:45 <priteau> So I have noted the comments about using the API rather than DB access
11:22:46 <oneswig> priteau: do you think the format you've proposed could be generated through 'distillation' of notification objects, or are they too different?
11:23:28 <priteau> oneswig: I think it would be possible as long as the same data is available in the notification or via an extra Nova API query
11:23:47 <johnthetubaguy> you should be able to get it all from the notifications
11:23:54 <oneswig> priteau: I am not sure, all I know is that notifications are big.
11:24:09 <oneswig> might depend on what data you're doing the join for?
11:24:32 <priteau> it's to have things like user_id, project_id, etc. on the same row
11:25:26 <priteau> BTW we anonymized fields like user_id in case Chameleon users didn't want their username to be shared (our KVM cloud is old and still using the LDAP backend for Keystone, which means that user_id == username)
11:25:57 <oneswig> priteau: you mentioned that your student is nearing completion of their placement.  What are the next steps for you/
11:26:08 <b1airo> Yes that step is going to be very important
11:26:56 <priteau> oneswig: We will release the code to extract this data and would like to see if bigger clouds could share their own traces
11:27:25 <priteau> I think b1airo was interested to share NeCTAR data
11:27:28 <b1airo> I wonder if those id fields could go through a non-reversable hash
11:27:56 <b1airo> Or is that what you're doing already?
11:27:56 <johnthetubaguy> b1airo: I think turbo-hipster from mikal tried to do some of this stuff in the past
11:28:19 <priteau> I think we've used a SHA1. Obviously if you've got access to a rainbow table you might be able to reverse.
11:28:39 <b1airo> Ah yeah I remember that now you mention it johnthetubaguy
11:28:57 <priteau> Maybe we should use scrypt?
11:29:59 <b1airo> Turbo-hipster was about capturing prod dbs for migration testing right? Is that still happening at all?
11:30:14 <priteau> Are you talking about https://github.com/openstack/turbo-hipster?
11:30:54 <johnthetubaguy> yeah, it did a bit a anonamising data
11:31:12 <priteau> interesting
11:31:14 <priteau> #link http://turbo-hipster.readthedocs.io/en/latest/intro.html
11:31:18 <johnthetubaguy> to allow folks to contribute their datasets
11:31:33 <johnthetubaguy> I think Rackspace gave their pre-prod dataset at one point
11:31:53 <priteau> apparently fields are anonymised with https://github.com/mrda/fuzzy-happiness
11:32:01 <johnthetubaguy> its just there may be ideas in there worth borrowing
11:32:25 <oneswig> That's a great idea for improving the test coverage.
11:32:37 <b1airo> priteau: maybe worth trying it to see if the trace format can be fulfilled with a turbo-hipster dump?
11:32:49 <johnthetubaguy> that did catch some upgrade bugs in the past
11:33:07 <johnthetubaguy> hmm, not sure about using its format
11:33:30 <johnthetubaguy> I would try the notification or similar format
11:34:00 <priteau> b1airo: good idea. Do you know if the "variety of real-world databases" that turbo-hipster uses is freely available?
11:34:51 <johnthetubaguy> I am not sure there is much variety in the end, maybe three DBs we ran the upgrade migrations on?
11:35:22 <johnthetubaguy> I suspect they are too old now, but I would have to go dig to find out
11:36:37 <priteau> If they are big enough they may be interesting
11:36:37 <oneswig> This discussion is a good example of data gathered for one purpose taking on an entirely different one
11:36:48 <b1airo> Anyway priteau, this is a good start. I'm sure I can convince Nectar Directorate to share an initial set
11:37:21 <priteau> b1airo: Thanks. We need to do a bit of code cleanup first but I'll be in touch.
11:37:53 <johnthetubaguy> +1 seems like a great start
11:37:56 <oneswig> priteau: thanks for sharing your work
11:38:06 <b1airo> Don't be too fussy about it, we can patch/fix/comment code too as needed
11:38:20 <oneswig> priteau: btw, any progress on the Blazar UI?
11:38:22 <priteau> oneswig: Thanks for allowing me :-)
11:38:39 <martial_> pierre: any trace that can be analyzed?
11:38:40 <priteau> oneswig: yes, it's in review, it should be merged by the end of the week!
11:38:43 <b1airo> Is that in horizon?
11:39:08 <priteau> martial_: what do you mean?
11:39:23 <oneswig> priteau: that's great, well done
11:39:50 <oneswig> OK - few other items to cover, shall we move on?
11:39:50 <martial_> well I looked at the website and listened to your presentation of the project, but now I am trying to see if we can draw comparables on the collected data
11:39:58 <priteau> b1airo: yep, https://github.com/openstack/blazar-dashboard
11:40:41 <martial_> pierre: we can take take outside of the meeting
11:40:47 <martial_> stig: please go ahead
11:40:47 <priteau> sure
11:41:04 <oneswig> #topic Update from Blair - THP
11:41:11 <oneswig> b1airo: why don't you fit in here
11:41:24 <b1airo> I would like to look at using Blazar in Nectar too, need to pick your brains at some stage so I can get a 1-pager together on what it might look like
11:41:40 <b1airo> Sure oneswig
11:42:05 <priteau> I would love to help you with that
11:43:03 <b1airo> I now have a repeatable test case that shows simply having a full pagecache before running a compute job can cause a 2-3x slowdown
11:43:17 <oneswig> in the hypervisor?
11:44:16 <b1airo> The effect seems to be on both BM and VM, but VM emphasises it
11:44:50 <oneswig> as in the 'cache' column in vmstat, 'Cached' in /proc/meminfo?
11:45:25 <martial_> how memory heavy is that job?
11:45:40 <b1airo> This particular code (SIMPLE) is a CryoEM structural refinement thing. And I'm just using a Singularity container that is setup to run SIMPLE's built-in speed test
11:45:56 <oneswig> What rune do you use to evict the page cache?
11:46:32 <b1airo> And that test only ends up using about 2GB of memory, but parallelises well on SMP
11:47:20 <b1airo> It is very memory heavy I believe, and has ill defined random access patterns
11:47:58 <b1airo> I filled the pagecache by simply cat-ing a single file equal to VM ram size
11:48:00 <oneswig> I wonder what quantity of memory amounts to a full TLB when using 4K pages.
11:48:15 <b1airo> oneswig: very little
11:48:40 <b1airo> TLBs are only a few thousand entries iirc
11:48:55 <oneswig> That's actually more than I thought
11:49:03 <martial_> is it possible the cache is not local (would be strange but that slowdown is impressive)?
11:49:13 <oneswig> So 2GB random access memory patterns would totally kill it
11:49:55 <oneswig> martial_: you're thinking it would involve a write back to networked storage to evict dirty pages?
11:50:02 <martial_> yep
11:50:10 <b1airo> The problem appears to be that the Linux mm is for some reason not able to allocate THPs even when all the memory usage is cache and should be able to be dumped immediately
11:50:13 <priteau> b1airo: are you using the default vm.swappiness value?
11:50:41 <martial_> seems unlikely and not natural at all to me, but disk is not that slow
11:51:42 <b1airo> priteau: no swap in guest at this stage, but I plan to investigate that as we have seen slub allocation problems and the reading I've been doing there makes me think the kernel allocator might still like having a teeny bit of swap around even if it doesn't really use it
11:52:29 <johnthetubaguy> are you doing all the CPU pinning, NUMA passthrough, hugepages etc for the VM?
11:52:41 <b1airo> Yep
11:52:52 <oneswig> b1airo: I wonder if THPs are allocated in a GFP_ATOMIC context, so cannot muck about with other page mappings in fear of a fault
11:53:35 <oneswig> b1airo: When are you going to write all this stuff up?
11:54:22 <b1airo> oneswig: possibly, but I doubt it because normal behaviour is to defrag on fault in order to allocate a THP if possible
11:54:48 <oneswig> b1airo: ah ok, thanks
11:55:01 <b1airo> oneswig: once I've made some decent graphs to pictorialise it
11:55:11 <oneswig> did perf help there?
11:55:42 <priteau> b1airo: I was thinking about the settings on the host
11:56:07 <daveholland> @blairo did you tweak zone_reclaim_mode at all?
11:56:30 <daveholland> (also the old-fashioned sysadmin in me agrees that having a bit of swap is no bad thing)
11:56:36 <b1airo> Actually still haven't done that as once I discovered this behaviour I figured inspecting guest vmstat was the better place to start
11:56:56 <johnthetubaguy> I wonder if you are switching THP too much, so smaller pages would be better, as they are quicker to read in?
11:57:32 <b1airo> daveholland: was just about to say that - interestingly, turning on zone reclaim makes the problem go away (at least for this little test case)
11:57:35 <martial_> I would second stig here on the 2GB random access on a THP pagecache, seems prone to slowdown even if it is simply jumps
11:58:25 <oneswig> Time call!
11:58:25 <b1airo> With zone reclaim on even after filling pagecache the kernel is keeping over 2GB free on my 8GB test VM
11:58:34 <daveholland> blairo: we came to that via a user seeing soft lockups, https://access.redhat.com/solutions/1560893 (don't know if you need a RH account to view)
11:59:06 <b1airo> By the way, I've confirmed this in both centos7 (3.10 kernel) and with 4.12 from EPEL
11:59:36 <oneswig> guest OS?
11:59:36 <b1airo> Thanks daveholland , will have a look (I do have a RH account)
11:59:45 <b1airo> Centos7
11:59:50 <oneswig> thought you ran Xenial in the hypervisor?
12:00:03 <martial_> blair, take a look at that https://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
12:00:04 <b1airo> Ubuntu Trusty host with 4.4 kernel
12:00:30 <oneswig> We are out of time, alas.
12:00:38 <oneswig> Thanks b1airo, good update
12:00:55 <oneswig> have to close now
12:00:55 <b1airo> I did read that already martial_ ;-)
12:01:07 <oneswig> #endmeeting