Monday, 2019-11-25

*** auk has joined #kata-dev02:33
*** auk has quit IRC06:27
*** lpetrut has joined #kata-dev07:16
*** jodh has joined #kata-dev07:52
*** sgarzare has joined #kata-dev08:06
*** jodh has quit IRC08:42
*** jodh has joined #kata-dev08:44
*** davidgiluk has joined #kata-dev08:55
*** sameo has joined #kata-dev09:05
*** gwhaley has joined #kata-dev09:07
*** pohly has joined #kata-dev10:38
pohlystefanha: hello. I am trying to understand how (and how well) virtio-fs supports mmap. Background: I work on PMEM-CSI, a driver which enables the use of PMEM in Kubernetes. Ultimately the goal is that an application can do mmap(MAP_SYNC) and then do byte read/writes directly to the the underlying hardware. That works without kata-containers involved. I now looked at kata-containers 1.9.1 with the kata-qemu-virtiofs. I can see that this10:42
pohly passes the dax-capable filesystem (XFS, in case that this matters) into the qemu instance with virtiofs. A test program can do mmap(MAP_SYNC) on a file.10:42
pohlyBut... it can also do that with 9p as file system and with the container root filesystem served by virtio-fs although that filesystem on the host does not support dax (hosted by plain SSD).10:43
pohlyI was under the (perhaps mistaken) impression that virtio-fs would somehow support mmap. I though I had read that somewhere. Is that really true?10:45
pohlyI checked the /proc/<pid>/maps for the /opt/kata/bin/qemu-virtiofs-system-x86_64 process that runs the pod. It doesn't have any entry for the file that currently is mapped inside the container.10:46
brtknrpohly: following this discussion10:51
davidgilukpohly: is the mount mounted with DAX?10:57
pohlyYes: kataShared on /data type virtio_fs (rw,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other,dax)10:58
pohlyThat is inside qemu.10:58
pohlyAnd also outside of it: /dev/mapper/ndbus0region0fsdax-e7660acd0fd86e6aea32589af51903654f6a4e41 on /var/lib/kubelet/pods/6576fed5-5488-4ee4-a6a2-578c5519ae9c/volumes/kubernetes.io~csi/my-csi-volume/mount type xfs (rw,relatime,attr2,dax,inode64,noquota)10:59
stefanhapohly: virtio-fs isn't intended for pmem.  QEMU won't use MAP_SYNC.11:00
stefanhapohly: If you need MAP_SYNC semantics then QEMU's nvdimm device can do that.11:01
stefanhapohly: MAP_SYNC support could be added to virtio-fs but today it doesn't do that.11:01
pohlystefanha: if virtio-fs doesn't support MAP_SYNC, shouldn't it then reject the mmap call?11:01
stefanhapohly: Probably.  Inside the guest the virtio-fs and FUSE code isn't doing anything that violates MAP_SYNC,11:03
stefanhabut the problem is that the host side doesn't necessarily honor those semantics.11:03
pohlyBut plain mmap works?11:03
stefanhapohly: Yep, plain mmap is supported.11:03
pohlyShould I then see a /proc/*/maps entry for the file? I don't have that.11:04
pohlyOr am I checking the wrong process? I looked at qemu-virtiofs-system-x86_64, because that is where the code runs.11:04
stefanhapohly: There isn't necessarily a 1:1 mmap relationship between guest application mmaps and host qemu-virtiofs-system-x86_64 mmaps.11:05
stefanhapohly: What are you trying to confirm by looking at qemu-virtiofs-system-x86_64 mmaps?11:05
pohlyLooking more closely I do see one entry that has at least the right size: 7f2f1bffe000-7f2f1bfff000 ---p 00000000 00:00 011:06
pohlyBut it doesn't have a file name associated with it. Should it have that?11:06
pohlyI am trying to verify that a file on the host has indeed been mapped into the address space of the process running inside qemu.11:07
pohlyIf that isn't the case, then how does mmap support work?11:07
stefanhapohly: The lack of filename could be due to file descriptor passing11:07
stefanhaThe file is opened by virtiofsd and passed to QEMU.  Maybe that's why no name is reported.11:08
stefanhaBut that's just a guess.11:08
pohlyThat might be it. Let me remove the mapping inside qemu...11:08
davidgilukthe name normally does show up11:08
davidgilukpohly: Have you accessed the mmap'd area, or just done the mmap?11:08
pohlyJust the mmap. So it's waiting for a page fault before doing anything on the host side? I can add that.11:10
stefanhaYes, that sounds likely.11:11
stefanhapohly: But again, if your goal is to get pmem semantics then virtio-fs in its current state doesn't guarantee that.11:11
davidgilukpohly: Yes, I think so - remember for virtiofs we only have a fixed sized cache window, so we can't guarantee to mmap the whole region11:12
stefanhapohly: QEMU has -device nvdimm and -device virtio-pmem-pci for that.11:12
pohlyUsing those for a mounted filesystem in kata-containers isn't going to be easy.11:13
pohlyvirtio-fs looked much more promising ;-}11:14
*** pcaruana has joined #kata-dev11:14
davidgilukstefanha: What stops us passing the MAP_SYNC all the way through?11:15
pohlydavidgiluk: even if you do, "fixed size cache window" sounds like another big roadblock. PMEM comes in higher capacity than DRAM, that's partly why it is appealing for some workloads.11:17
pohlyMAP_SYNC isn't even needed for all workloads. In fact, most apps currently don't depend on it.11:18
pohlySo virtio-fs may already be a good step forward and sufficient.11:18
pohlyOTOH, if it needs to set up and tear down mappings on the host side often, then that may affect performance.11:19
pohlymemcached uses PMEM as DRAM replacement and stores its data there. Predictable access times for that data probably is important.11:20
davidgilukpohly: Right; if you've got a single PMEM device to pass through then as stefan says using the -device stuff is the right way; if you're trying to pass through files that on the host are mountedon a filesystem that's backed by pmem, then virtiofs might be interesting11:20
pohlydavidgiluk: we are trying the former. PMEM-CSI basically splits up a single PMEM device and hands out portions of it to individual apps. We cannot assume that only a single app uses that device; that would be rather limiting.11:22
pohlyAhem, I meant "we are trying the latter"...11:22
davidgilukpohly: But does the PMEM-CSI portions look like individual block devices that you then put a filesystem on, and is that filesystem built in the host or the guest?11:23
pohlydavidgiluk: it is a block device. But applications in Kubernetes typically will ask for a filesystem, so PMEM-CSI formats and mounts that device.11:26
*** sameo has quit IRC11:26
pohlyAnd then Kubernetes passes the directory name of the mounted FS to the runtime.11:26
pohlyI heard that kata-containers sometimes does tricks like then passing the device into qemu and mounting again inside.11:27
pohlyThat's a bit dirty, because there are two Linux kernels which both might write to the same block device.11:27
davidgilukpohly: OK, if it's a device+filesystem just for that container then it does feel like passing that block device into the container is right rather than passing the filesystem through virtiofs11:27
pohlydavidgiluk: yes, that would be the better alternative, except for the "is already mounted" part.11:28
pohlyAlso, does it have to be some actual device? Currently the block devices are either LVM logical volumes or PMEM namespaces (/dev/pmem*).11:29
pohlyWe can't use PCI device pass-through - it's not even on the PCI bus.11:30
pohlyNor do we want to pass in the entire NVDIMM.11:30
*** openstack has joined #kata-dev11:39
*** ChanServ sets mode: +o openstack11:39
*** openstack has joined #kata-dev11:51
*** ChanServ sets mode: +o openstack11:51
davidgilukpohly: Yeh probably best to make an issue; I'm also not sure the best way to wire it through - but if it looks like a block device, and that block device is intended just for this container, then treat it as a block device and let the guest handle it11:51
*** irclogbot_1 has quit IRC11:52
*** irclogbot_2 has joined #kata-dev11:52
gwhaleypohly: include 'devimc' on that Issue, if not already - he'll have a good idea I think of what knitting would be required.11:52
gwhaleyyes, the hard bit is how to annotate that volume/mount/device to ensure it ends up mapped via the correct route. It may be that 'annotations' are the route.11:53
gwhaleyoh, amshinde might have good input as well11:53
gwhaleyso, historically we've always noted that nvdimm/dax could be used to pass items in (kata uses it for iirc the kernel image, or is it the rootfs....) - but, I don't believe there is a defined mechanism to set that all up via the orchestrators and runtime, and I don't think I've ever seen anybody actually using an nvdimm/dax mount/map for themselves ... yet....11:55
pohlygwhaley: /opt/kata/share/kata-containers/kata-containers-image_clearlinux_1.9.1_agent_d4bbd8007f.img is passed via "-object" + "-device nvdimm".11:59
pohlyLooks like the rootfs. There's also "root=/dev/pmem0p1".12:00
gwhaleypohly: right, the rootfs for the VM (I can never remember if it is the rootfs or the kernel we do it with ;-) )... so, we use it, we know it works.... now it would be how do we enable 'users' to do it...12:00
pohlydavidgiluk: to get closure on this: when actually writing into the memory mapped region via virtio-fs, I do see map entries on the host side, including the file name.12:03
pohlydavidgiluk: how large is this "fixed size cache window"?12:03
davidgilukpohly: It's configurable via an option, normally a few GB12:04
gwhaleyhttps://github.com/kata-containers/runtime/blob/master/cli/config/configuration-qemu-virtiofs.toml.in#L118-L131 :-)12:05
* davidgiluk disappears for a 2 hours12:05
* gwhaley goes for lunch...12:05
pohlySo a lot less than the hundreds of GB that people may have as PMEM. MIght be worth testing how that affects performance. Thanks!12:06
* pohly too12:06
* pohly lunch...12:06
*** lpetrut has joined #kata-dev12:53
pohlystefanha: should I also file a bug about rejecting MAP_SYNC? Where?13:46
pohlyOh, in case someone wants to follow, the issue about adding PMEM support is here: https://github.com/kata-containers/runtime/issues/226213:46
*** canyounot has joined #kata-dev14:02
stefanhapohly: Sorry, I was offline.  Please file it here: https://gitlab.com/groups/virtio-fs/-/issues14:10
pohlystefanha: which project? "libfuse"?14:13
pohlyNote that 9p has the same issue, so it might be common to fuse-based filesystems.14:14
stefanhapohly: linux please14:16
stefanhapohly: virtio-9p is not FUSE-based./14:16
pohlyOh, okay.14:16
*** devimc has joined #kata-dev14:22
*** fuentess has joined #kata-dev14:36
*** sameo has joined #kata-dev14:36
pohlystefanha: never mind. I made a slight mistake in my test program (MAP_SHARED instead of MAP_SHARED_VALIDATE) and the effect is that MAP_SYNC gets silently ignored, as specified in the man page.14:39
stefanhaaha! :)14:46
stefanhaSo now mmap(2) rejects the flag?14:46
pohlyYes.15:02
*** lpetrut has quit IRC16:51
*** dklyle has quit IRC16:53
*** dklyle has joined #kata-dev16:54
*** sgarzare has quit IRC17:04
*** igordc has joined #kata-dev17:18
*** devimc has quit IRC17:33
*** devimc has joined #kata-dev17:34
*** devimc has quit IRC17:40
*** devimc has joined #kata-dev17:41
*** devimc has quit IRC17:57
*** devimc has joined #kata-dev17:58
*** jodh has quit IRC18:05
*** gwhaley has quit IRC18:13
*** igordc has quit IRC18:39
*** noahm has joined #kata-dev18:49
*** igordc has joined #kata-dev19:55
*** davidgiluk has quit IRC20:09
*** sameo has quit IRC20:42
*** igordc has quit IRC20:47
*** igordc has joined #kata-dev20:53
*** pcaruana has quit IRC21:38
*** pohly has quit IRC21:55
*** canyounot has quit IRC22:07
*** devimc has quit IRC23:07

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!