#openstack-meeting log

21:04:16 <timburke> #startmeeting swift
21:04:17 <openstack> Meeting started Wed Nov 13 21:04:16 2019 UTC and is due to finish in 60 minutes.  The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:04:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:04:20 <openstack> The meeting name has been set to 'swift'
21:04:30 <timburke> who's here for the swift meeting?
21:04:31 <kota_> o/
21:04:35 <clayg> o/
21:04:39 <mattoliverau> o/
21:04:39 <alecuyer> o/
21:04:43 <rledisez> o/
21:05:18 <timburke> sorry to run a bit late; was upgrading the OS on my laptop
21:05:18 <clayg> alecuyer: rledisez: I feel like I *just* saw you guys!?
21:05:40 <alecuyer> chinese flashback!
21:05:41 <tdasilva> o/
21:05:54 <timburke> on that note...
21:05:58 <timburke> #topic PTG recap
21:06:01 <rledisez> clayg: did you travel in time when flighing home? it may be the explanation ;)
21:06:40 <timburke> as expected, it was wonderful seeing everyone who made it :-)
21:06:48 <clayg> rledisez: so funny story, I started my travel day by going to the wrong airport - but... i made it home
21:06:56 <timburke> everyone else, hopefully i'll get to see you in vancouver
21:07:03 <clayg> VANCOUVER!!!
21:07:11 <timburke> clayg, oh jeeze! glad you still made it home!
21:07:15 <clayg> i hear vancouver has a bunch of chinese millionares
21:08:01 <timburke> we had some really good discussions, hopefully we more or less kept the etherpad updated
21:08:04 <timburke> #link https://etherpad.openstack.org/p/swift-ptg-shanghai
21:08:06 * rledisez wish he was a canadian millionares in china
21:08:15 <mattoliverau> lol
21:08:17 <tdasilva> lol
21:09:03 <timburke> we met a new Korean operator -- if you see seongsoocho in -swift, you ought to say hi :D
21:09:21 <timburke> had a good ops feedback session
21:09:21 <mattoliverau> nice :)
21:09:24 <timburke> #link https://etherpad.openstack.org/p/PVG-swift-ops-feedback
21:09:31 <clayg> it's 6am in seoul
21:09:51 <clayg> i'm sure kota_ and mattoliverau wouldn't complain if we wanted to make the meeting a little later 🤷‍♂️
21:10:40 <mattoliverau> 8am here now, so isn't too bad. But its 6 in Tokyo isn't it kota_?
21:10:58 <mattoliverau> daylight savings for the win here
21:11:11 <tdasilva> when is ptg in vancouver?
21:11:36 <timburke> i want to say june? let me find the email...
21:11:40 <kota_> mattoliverau: yup
21:11:55 <kota_> same with Tokyo timezone
21:11:59 <rledisez> yep, 8 to 11 june I think
21:12:13 <mattoliverau> Seeing as I'm not in the cloud team anymore at Suse I probably wont have any travel funding to go. I guess I could attempt to get a talk in, maybe if it gets accepted they'll send me. I'd have to talk to my new manager.
21:12:13 <rledisez> I'm sure it ends the june, 11th
21:12:13 <timburke> http://lists.openstack.org/pipermail/foundation/2019-September/002794.html says Jun 8-11
21:13:02 <timburke> mattoliverau, there's also the travel support program: https://wiki.openstack.org/wiki/Travel_Support_Program
21:13:14 <timburke> i should make sure seongsoocho knows about it, too
21:13:15 <mattoliverau> that's true, I have used it before :)
21:13:53 <timburke> big takeaways i got out of the week (and feel free to chime in with corrections or additional info):
21:14:12 <clayg> tdasilva: aren't you going to New Zealand at some point?  Can you pick up matt on your way back to the states?
21:14:35 <tdasilva> heh, sounds like a good idea
21:14:47 <timburke> on LOSF: the main cluster that rledisez and alecuyer needed this for is getting phased out, so its future is up in the air a bit
21:14:48 <tdasilva> or maybe you all come down to join us
21:15:19 <rledisez> timburke: a bit of precision, it will happens in about 12 to 18 months, so we still get a bit of time on it ;)
21:15:20 <mattoliverau> ^ that :)
21:15:57 <timburke> we know there are a bunch more tests that we'd like to see, but we're also not sure we want to take on the maintenance burden when we might be able to get a lot of benefit out of things like xfs's realtime device support
21:16:26 <clayg> rledisez: well, but even given the runway on the phase out - aren't you also being tasked with planning for the NEW cluster (i.e. benchmarking alternatives to existing LOF index & slab storage)
21:17:03 <rledisez> right now, we are still investigating all possibilities:
21:17:07 <timburke> we know drives are only going to get bigger over time, though -- and we suspect that lots-of-small-files as a problem is going to start to look like lots-of-files
21:17:19 <rledisez> xfs realtime => I asked the status on XFS ML, to see if it maintained/tested/…
21:17:26 <rledisez> zfs => looks a nice possibility
21:17:35 <rledisez> LOSF/LOF => still in course
21:17:50 <rledisez> open-cas => does not seem stable, but is maintained so we may talk to them
21:18:08 <timburke> rledisez, alecuyer was saying that zfs doesn't have good recovery tooling, though, yeah?
21:18:41 <rledisez> I think it was about what happen in case of an I/O error. the only option might be to reboot the server
21:18:53 <rledisez> so we need to check that, right
21:18:58 <alecuyer> I started to work on eBPF scripts to monitor block device IO and link it to inode/xattr access, vs file data. Good going on SAIO, but not working on our prod (need diff kernel options). Once I have something good i'll share so everyone can check if they'd benefit from xattr stored or cached on a fast device (allowing for the use of larger HDDs for data)
21:19:13 <tdasilva> anyone ever look how hard it would be to use something like bluestore?
21:19:46 <rledisez> tdasilva: does it allow to place the rocksdb on a different device ? (SSD/NVme)
21:20:10 <tdasilva> dunno
21:20:39 <tdasilva> rledisez: why?
21:20:44 <mattoliverau> alecuyer: yeah
21:20:53 <rledisez> the goal is place filesystem metadata (inode/xattr) on a faster device (or the LOSF index)
21:21:01 <mattoliverau> you can decide if you want the rocks db and wal on a seperate device.
21:21:39 <rledisez> mattoliverau: so yes, that's something that could be investigated yes, but it's pretty much the same solution than LOSF, so I wouls stick to LOSF for now as it is designed for swift especially
21:22:10 <timburke> on versioning: people seemed enthusiastic. null namespace didn't seem to scare anyone off, and swift growing another versioning scheme seems like a good idea given how poorly the current one maps to s3 versioning. iirc, everyone wants s3 versioning
21:23:19 <timburke> so, clayg, tdasilva, and i will be working on that a lot, hopefully getting it ready to merge to master within the next few weeks
21:23:35 <mattoliverau> nice
21:24:52 <timburke> on "atomic large objects": we recognize the utility, but still aren't sure about how to implement it. had a couple discussions but no clear resolution -- will probably come up again the next time we meet in person
21:26:07 <clayg> yes bluestore was very much aimed at putting the index/metadata db on a seperate device from the blob slab
21:26:31 <clayg> you can also put it on the same device, but it's not quite as awesome as filestore
21:26:35 <timburke> on the object updater, i had an idea about grouping async pendings by container and sending UPDATE requests to do them all at once
21:26:37 <timburke> will almost certainly need some benchmarking before we know whether its actually a *good* idea
21:26:57 <mattoliverau> oh cool
21:27:09 <mattoliverau> with the new UPDATE that makes sense.
21:27:14 <clayg> eitherway it's VERY ceph sepcific - I couldn't find any documentation for a ABI or something that allows general access to volume
21:27:45 <timburke> on recon dumps: it'd be nice to get more/better keys, but where we *really* want to go involves tracking what the oldest unprocessed work item is
21:27:47 <mattoliverau> clayg: yeah I've struggled to find decent info on it too
21:29:11 <timburke> like, replicator should be able to track when partition have successfully synced with all primaries, expose the one that needs to sync most, and prioritize that work
21:29:49 <timburke> in the mean time, getting missing keys into recon could be a good short-term win
21:30:04 <clayg> #ft
21:30:06 <clayg> #ftw
21:31:09 <timburke> on tiering... we didn't actually talk much about it. sorry mattoliverau. fwiw, i know that we have customers wanting that sort of behavior, though, and i think that the null namespace could be very useful for the implementation
21:31:47 <mattoliverau> yeah, I thought so too. if null namespace is the future, we shouhld hold off so we use it.
21:32:16 <timburke> in particular, it'd be useful in combination with versioning -- so you could tier off the non-active versions to somewhere cheaper
21:32:31 <mattoliverau> +1
21:32:43 <mattoliverau> speaking of recon, I wrote this ages ago.. need to see if it still is correct and everything, it's a little old:https://review.opendev.org/#/c/541141/
21:33:10 <timburke> i think it'd be great to keep in mind as we think about implementing some of s3's bucket policies (in particular, deleting non-active versions older than X)
21:33:27 <zaitcev> Dude. I once tried to write a Ceph client in Python. The current one at the time spawned a thread for each RADOS request, which called into C++. But it was impossible with the lack of docs, and the code was fairly impregnable. You may be able to reverse-engineer Bluestore API, but it's not going to be easy, I can guarantee that much.
21:33:53 <timburke> mattoliverau, oh, nice! yeah, that does seem useful
21:35:00 <clayg> zaitcev: i bet alecuyer could do it :nerd_snipe:
21:35:11 <mattoliverau> :)
21:35:19 <timburke> on performance: https://review.opendev.org/#/c/693116/ looks good, clayg brought some nice history to the discussion that made us all feel a lot better about getting rid of the queues
21:35:36 <alecuyer> clayg: all that talk certainly got me curious ;)
21:35:37 <timburke> working its way through the gate now
21:35:49 <clayg> EAT IT GATE
21:36:13 <clayg> timburke: rledisez: i was unclear on if dropping the q significantly helped throughput - or it just mostly reduced cpu?
21:36:24 <rledisez> clayg: both actually
21:36:34 <rledisez> I'll submit soon a patch to get rid of the with Chunk*Timeout
21:36:49 <timburke> clayg, https://etherpad.openstack.org/p/swift-profiling says just no-queue brought like 45% better throughput
21:36:53 <mattoliverau> heres my dodgy benchmark results from a SAIO: https://etherpad.openstack.org/p/swift-remove-prxy-queues-benchmarks
21:37:15 <rledisez> after that, I found some other place that could provide some perf improvment (especially a place in the proxy that does a cache that is reseted at every request :))
21:37:18 <mattoliverau> which was just a look with ssbench and getput
21:37:55 <mattoliverau> rledisez: oh that sounds like a useful cache :P
21:38:12 <rledisez> and after that, I want to get rid if MD5 as a checksum algorithm (not as placement algorithm)
21:39:20 <clayg> ok then!
21:39:36 <timburke> rledisez, i wonder if it'd actually be easier to get rid of it as a placement algo... or at least, make the choice of algo a property of the ring
21:39:57 <rledisez> timburke: yeah, but I'm not sure yet the gain is worth it
21:40:07 <timburke> fair
21:40:13 <rledisez> while as checksum, he, the double md5 calculation in EC is really killing perf
21:40:17 <rledisez> but it's gonna be hard to drop MD5 because I think it's part of API (etag header)
21:40:41 <timburke> yup, that was my thought, too :-(
21:41:40 <timburke> still, if EC only had to do one MD5 and one (HW-optimized, yeah?) SHA-256 or SHA-512... might be a solid win
21:41:52 <timburke> similar with encryption
21:41:56 <rledisez> yeah, that was my though too
21:42:36 <timburke> on swiftclient test directory layout: yeah, just make it consistent with swift. merged.
21:42:36 <rledisez> i'm thinking of adler32 maybe, which is designed especially for that (and 4 times faster than md5 on my bench server)
21:44:33 <timburke> i *think* that about covers the PTG... am i forgetting anything?
21:44:52 <clayg> i hope not!  I gotta go async
21:45:55 <timburke> there were some interesting talks from cmurphy about keystone scoping and access rules -- seems like they might fit well with swift
21:46:32 <timburke> and i know people are interested in having keystone application credential support in swiftclient
21:46:56 <mattoliverau> oh really, I might have to go look those up. cmurphy is awesome
21:47:12 <timburke> all right, i think that's all i've got
21:47:13 * cmurphy blushes
21:47:17 <timburke> #topic open discussion
21:48:35 <alecuyer> I'll just leave the link here, anyone has ideas about using larger drives, please add to it: https://etherpad.openstack.org/p/swift-ptg-shanghai-large-drives
21:50:29 <timburke> all right. thank you all for coming, and thank you for working on swift!
21:50:37 <timburke> #endmeeting