21:04:16 #startmeeting swift 21:04:17 Meeting started Wed Nov 13 21:04:16 2019 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:04:18 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:04:20 The meeting name has been set to 'swift' 21:04:30 who's here for the swift meeting? 21:04:31 o/ 21:04:35 o/ 21:04:39 o/ 21:04:39 o/ 21:04:43 o/ 21:05:18 sorry to run a bit late; was upgrading the OS on my laptop 21:05:18 alecuyer: rledisez: I feel like I *just* saw you guys!? 21:05:40 chinese flashback! 21:05:41 o/ 21:05:54 on that note... 21:05:58 #topic PTG recap 21:06:01 clayg: did you travel in time when flighing home? it may be the explanation ;) 21:06:40 as expected, it was wonderful seeing everyone who made it :-) 21:06:48 rledisez: so funny story, I started my travel day by going to the wrong airport - but... i made it home 21:06:56 everyone else, hopefully i'll get to see you in vancouver 21:07:03 VANCOUVER!!! 21:07:11 clayg, oh jeeze! glad you still made it home! 21:07:15 i hear vancouver has a bunch of chinese millionares 21:08:01 we had some really good discussions, hopefully we more or less kept the etherpad updated 21:08:04 #link https://etherpad.openstack.org/p/swift-ptg-shanghai 21:08:06 * rledisez wish he was a canadian millionares in china 21:08:15 lol 21:08:17 lol 21:09:03 we met a new Korean operator -- if you see seongsoocho in -swift, you ought to say hi :D 21:09:21 had a good ops feedback session 21:09:21 nice :) 21:09:24 #link https://etherpad.openstack.org/p/PVG-swift-ops-feedback 21:09:31 it's 6am in seoul 21:09:51 i'm sure kota_ and mattoliverau wouldn't complain if we wanted to make the meeting a little later 🤷‍♂️ 21:10:40 8am here now, so isn't too bad. But its 6 in Tokyo isn't it kota_? 21:10:58 daylight savings for the win here 21:11:11 when is ptg in vancouver? 21:11:36 i want to say june? let me find the email... 21:11:40 mattoliverau: yup 21:11:55 same with Tokyo timezone 21:11:59 yep, 8 to 11 june I think 21:12:13 Seeing as I'm not in the cloud team anymore at Suse I probably wont have any travel funding to go. I guess I could attempt to get a talk in, maybe if it gets accepted they'll send me. I'd have to talk to my new manager. 21:12:13 I'm sure it ends the june, 11th 21:12:13 http://lists.openstack.org/pipermail/foundation/2019-September/002794.html says Jun 8-11 21:13:02 mattoliverau, there's also the travel support program: https://wiki.openstack.org/wiki/Travel_Support_Program 21:13:14 i should make sure seongsoocho knows about it, too 21:13:15 that's true, I have used it before :) 21:13:53 big takeaways i got out of the week (and feel free to chime in with corrections or additional info): 21:14:12 tdasilva: aren't you going to New Zealand at some point? Can you pick up matt on your way back to the states? 21:14:35 heh, sounds like a good idea 21:14:47 on LOSF: the main cluster that rledisez and alecuyer needed this for is getting phased out, so its future is up in the air a bit 21:14:48 or maybe you all come down to join us 21:15:19 timburke: a bit of precision, it will happens in about 12 to 18 months, so we still get a bit of time on it ;) 21:15:20 ^ that :) 21:15:57 we know there are a bunch more tests that we'd like to see, but we're also not sure we want to take on the maintenance burden when we might be able to get a lot of benefit out of things like xfs's realtime device support 21:16:26 rledisez: well, but even given the runway on the phase out - aren't you also being tasked with planning for the NEW cluster (i.e. benchmarking alternatives to existing LOF index & slab storage) 21:17:03 right now, we are still investigating all possibilities: 21:17:07 we know drives are only going to get bigger over time, though -- and we suspect that lots-of-small-files as a problem is going to start to look like lots-of-files 21:17:19 xfs realtime => I asked the status on XFS ML, to see if it maintained/tested/… 21:17:26 zfs => looks a nice possibility 21:17:35 LOSF/LOF => still in course 21:17:50 open-cas => does not seem stable, but is maintained so we may talk to them 21:18:08 rledisez, alecuyer was saying that zfs doesn't have good recovery tooling, though, yeah? 21:18:41 I think it was about what happen in case of an I/O error. the only option might be to reboot the server 21:18:53 so we need to check that, right 21:18:58 I started to work on eBPF scripts to monitor block device IO and link it to inode/xattr access, vs file data. Good going on SAIO, but not working on our prod (need diff kernel options). Once I have something good i'll share so everyone can check if they'd benefit from xattr stored or cached on a fast device (allowing for the use of larger HDDs for data) 21:19:13 anyone ever look how hard it would be to use something like bluestore? 21:19:46 tdasilva: does it allow to place the rocksdb on a different device ? (SSD/NVme) 21:20:10 dunno 21:20:39 rledisez: why? 21:20:44 alecuyer: yeah 21:20:53 the goal is place filesystem metadata (inode/xattr) on a faster device (or the LOSF index) 21:21:01 you can decide if you want the rocks db and wal on a seperate device. 21:21:39 mattoliverau: so yes, that's something that could be investigated yes, but it's pretty much the same solution than LOSF, so I wouls stick to LOSF for now as it is designed for swift especially 21:22:10 on versioning: people seemed enthusiastic. null namespace didn't seem to scare anyone off, and swift growing another versioning scheme seems like a good idea given how poorly the current one maps to s3 versioning. iirc, everyone wants s3 versioning 21:23:19 so, clayg, tdasilva, and i will be working on that a lot, hopefully getting it ready to merge to master within the next few weeks 21:23:35 nice 21:24:52 on "atomic large objects": we recognize the utility, but still aren't sure about how to implement it. had a couple discussions but no clear resolution -- will probably come up again the next time we meet in person 21:26:07 yes bluestore was very much aimed at putting the index/metadata db on a seperate device from the blob slab 21:26:31 you can also put it on the same device, but it's not quite as awesome as filestore 21:26:35 on the object updater, i had an idea about grouping async pendings by container and sending UPDATE requests to do them all at once 21:26:37 will almost certainly need some benchmarking before we know whether its actually a *good* idea 21:26:57 oh cool 21:27:09 with the new UPDATE that makes sense. 21:27:14 eitherway it's VERY ceph sepcific - I couldn't find any documentation for a ABI or something that allows general access to volume 21:27:45 on recon dumps: it'd be nice to get more/better keys, but where we *really* want to go involves tracking what the oldest unprocessed work item is 21:27:47 clayg: yeah I've struggled to find decent info on it too 21:29:11 like, replicator should be able to track when partition have successfully synced with all primaries, expose the one that needs to sync most, and prioritize that work 21:29:49 in the mean time, getting missing keys into recon could be a good short-term win 21:30:04 #ft 21:30:06 #ftw 21:31:09 on tiering... we didn't actually talk much about it. sorry mattoliverau. fwiw, i know that we have customers wanting that sort of behavior, though, and i think that the null namespace could be very useful for the implementation 21:31:47 yeah, I thought so too. if null namespace is the future, we shouhld hold off so we use it. 21:32:16 in particular, it'd be useful in combination with versioning -- so you could tier off the non-active versions to somewhere cheaper 21:32:31 +1 21:32:43 speaking of recon, I wrote this ages ago.. need to see if it still is correct and everything, it's a little old:https://review.opendev.org/#/c/541141/ 21:33:10 i think it'd be great to keep in mind as we think about implementing some of s3's bucket policies (in particular, deleting non-active versions older than X) 21:33:27 Dude. I once tried to write a Ceph client in Python. The current one at the time spawned a thread for each RADOS request, which called into C++. But it was impossible with the lack of docs, and the code was fairly impregnable. You may be able to reverse-engineer Bluestore API, but it's not going to be easy, I can guarantee that much. 21:33:53 mattoliverau, oh, nice! yeah, that does seem useful 21:35:00 zaitcev: i bet alecuyer could do it :nerd_snipe: 21:35:11 :) 21:35:19 on performance: https://review.opendev.org/#/c/693116/ looks good, clayg brought some nice history to the discussion that made us all feel a lot better about getting rid of the queues 21:35:36 clayg: all that talk certainly got me curious ;) 21:35:37 working its way through the gate now 21:35:49 EAT IT GATE 21:36:13 timburke: rledisez: i was unclear on if dropping the q significantly helped throughput - or it just mostly reduced cpu? 21:36:24 clayg: both actually 21:36:34 I'll submit soon a patch to get rid of the with Chunk*Timeout 21:36:49 clayg, https://etherpad.openstack.org/p/swift-profiling says just no-queue brought like 45% better throughput 21:36:53 heres my dodgy benchmark results from a SAIO: https://etherpad.openstack.org/p/swift-remove-prxy-queues-benchmarks 21:37:15 after that, I found some other place that could provide some perf improvment (especially a place in the proxy that does a cache that is reseted at every request :)) 21:37:18 which was just a look with ssbench and getput 21:37:55 rledisez: oh that sounds like a useful cache :P 21:38:12 and after that, I want to get rid if MD5 as a checksum algorithm (not as placement algorithm) 21:39:20 ok then! 21:39:36 rledisez, i wonder if it'd actually be easier to get rid of it as a placement algo... or at least, make the choice of algo a property of the ring 21:39:57 timburke: yeah, but I'm not sure yet the gain is worth it 21:40:07 fair 21:40:13 while as checksum, he, the double md5 calculation in EC is really killing perf 21:40:17 but it's gonna be hard to drop MD5 because I think it's part of API (etag header) 21:40:41 yup, that was my thought, too :-( 21:41:40 still, if EC only had to do one MD5 and one (HW-optimized, yeah?) SHA-256 or SHA-512... might be a solid win 21:41:52 similar with encryption 21:41:56 yeah, that was my though too 21:42:36 on swiftclient test directory layout: yeah, just make it consistent with swift. merged. 21:42:36 i'm thinking of adler32 maybe, which is designed especially for that (and 4 times faster than md5 on my bench server) 21:44:33 i *think* that about covers the PTG... am i forgetting anything? 21:44:52 i hope not! I gotta go async 21:45:55 there were some interesting talks from cmurphy about keystone scoping and access rules -- seems like they might fit well with swift 21:46:32 and i know people are interested in having keystone application credential support in swiftclient 21:46:56 oh really, I might have to go look those up. cmurphy is awesome 21:47:12 all right, i think that's all i've got 21:47:13 * cmurphy blushes 21:47:17 #topic open discussion 21:48:35 I'll just leave the link here, anyone has ideas about using larger drives, please add to it: https://etherpad.openstack.org/p/swift-ptg-shanghai-large-drives 21:50:29 all right. thank you all for coming, and thank you for working on swift! 21:50:37 #endmeeting