Friday, 2019-08-23

*** swifterd_ has joined #openstack-swift		00:13
*** ChanServ sets mode: +v swifterd_		00:13
*** swifterd_ has quit IRC		00:15
*** swifterdarrell has quit IRC		00:16
*** gyee has quit IRC		00:36
*** swifterdarrell has joined #openstack-swift		02:26
*** ChanServ sets mode: +v swifterdarrell		02:26
*** psachin has joined #openstack-swift		03:02
*** gkadam has joined #openstack-swift		03:19
*** e0ne has joined #openstack-swift		05:33
*** e0ne has quit IRC		05:44
*** viks___ has joined #openstack-swift		06:35
*** e0ne has joined #openstack-swift		06:42
*** gkadam has quit IRC		06:43
*** e0ne has quit IRC		06:43
*** e0ne has joined #openstack-swift		06:53
*** e0ne has quit IRC		06:59
*** zaitcev has quit IRC		07:05
*** takamatsu has joined #openstack-swift		07:08
*** rcernin has quit IRC		07:15
*** zaitcev has joined #openstack-swift		07:17
*** ChanServ sets mode: +v zaitcev		07:17
*** rdejoux has joined #openstack-swift		07:44
*** zaitcev has quit IRC		07:55
*** e0ne has joined #openstack-swift		08:05
*** zaitcev has joined #openstack-swift		08:07
*** ChanServ sets mode: +v zaitcev		08:07
*** diablo_rojo has joined #openstack-swift		08:12
*** tkajinam has quit IRC		08:19
*** zaitcev has quit IRC		08:25
*** zaitcev has joined #openstack-swift		08:38
*** ChanServ sets mode: +v zaitcev		08:38
*** diablo_rojo has quit IRC		08:48
*** zaitcev_ has joined #openstack-swift		08:56
*** ChanServ sets mode: +v zaitcev_		08:56
*** zaitcev has quit IRC		08:59
*** psachin has quit IRC		10:04
*** psachin has joined #openstack-swift		10:06
*** mrjk has quit IRC		10:27
*** tomha has joined #openstack-swift		10:35
*** tomha has quit IRC		10:40
*** tesseract has joined #openstack-swift		11:12
*** zaitcev_ has quit IRC		12:39
*** zaitcev_ has joined #openstack-swift		12:53
*** ChanServ sets mode: +v zaitcev_		12:53
*** zaitcev_ has quit IRC		13:07
*** zaitcev_ has joined #openstack-swift		13:21
*** ChanServ sets mode: +v zaitcev_		13:21
*** BjoernT has joined #openstack-swift		13:27
tdasilva	clayg: just to clarify my comment, my argument is more to the fact that conceptually putting an object with an expiration on it, only to have it versioned later and have that version still expire as originally intended seems a bit wrong to me	13:35
clayg	Well, maybe. But. I’m not intentionally trying to change or improve the interaction of expiring objects and versioned writes.	13:38
tdasilva	clayg: makes sense, this might also be a good opportunity to add some tests and I'm not sure we have many/any ??	13:42
tdasilva	clayg: btw: are you currently updating p 633857 based on reviews? or are you on p 673682 ?	13:43
patchbot	https://review.opendev.org/#/c/633857/ - swift - symlink-backed versioned_writes - 18 patch sets	13:43
patchbot	https://review.opendev.org/#/c/673682/ - swift - s3api: Implement versioning status API - 1 patch set	13:43
clayg	Mostly based on some new test I wrote. I found a bunch of stuff where they existing tests were wrong. I had to go write new s3api tests to understand how key-marker and version-market were supposed to work.	13:45
clayg	Wait, do you mean the S3 versions - or the symlink one?	13:46
clayg	I’m going to come back to symlinks. You can push over it for now I’ll rebase S3 before I go to address review stuff.	13:47
*** psachin has quit IRC		13:47
clayg	I could pause the S3 stuff and look at symlink comments if there’s anything controversial we should talk about. If it’s just like “fix this” - I’ll get to it as soon as I can.	13:48
tdasilva	clayg: nope, just wanted to sync up with you as I'm trying to go over the "fix this" comments from timburke and just wanted to make sure you weren't doing the same	13:49
clayg	Cool! Thanks!	13:50
*** zaitcev_ has quit IRC		14:22
*** zaitcev has joined #openstack-swift		14:53
*** ChanServ sets mode: +v zaitcev		14:53
*** zaitcev has quit IRC		14:57
*** zaitcev has joined #openstack-swift		15:10
*** ChanServ sets mode: +v zaitcev		15:10
swifterdarrell	rledisez and alecuyer, I'd love to chat about losf whenever you've got time :)	15:14
alecuyer	swifterdarrell: hello ! sure, no problem, I heard you're been running some tests :) I still have some time today, or next week	15:18
rledisez	swifterdarrell: for the record, alecuyer is in france and i'm in canada (east) so depending of the questions and the hours, you might get an immediate answer or not :)	15:19
swifterdarrell	so I heard... that's why I'm asking now instead of my last-night, hehe :)	15:20
swifterdarrell	also, no worries about async questions/answers--that's no problem	15:20
swifterdarrell	I've been using the tip of the losf branch (not any code still pending reviews, if there are any).	15:22
swifterdarrell	Performance-wise, I've roughly seen losf outperform "normal" swift in terms of req/s and lower latencies (esp. toward the tail like 99th percentile) at lower client-concurrencies, but found an "elbow" where it reverses and "normal" swift becomes faster	15:23
swifterdarrell	Oh, I also haven't really tuned any losf things, if there are tunables to be had	15:24
swifterdarrell	On my main hardware, I'm sort of borrowing it, so I can't just wipe the drives of non-losf storage policies/data	15:24
alecuyer	Beyond using the native golang leveldb lib, I don't think it would make much of a difference (you can choose the maximum size of a volume, defaults to 5GB)	15:24
alecuyer	If I understand what you're saying, it starts out faster, but as you create more objects, it starts to slow down, right ?	15:25
alecuyer	(vs "normal" swift)	15:25
swifterdarrell	no, more like losf is consistently faster than "normal" at concurrency X for lower values of X, but there is a level of client-concurrency for which normal swift seems consistently faster	15:26
alecuyer	ah, thanks, I see	15:26
swifterdarrell	Oh, I may have not compiled with native golang leveldb? let me check that	15:26
alecuyer	you should have both in the golang binary	15:26
alecuyer	there is a switch, let me check	15:27
swifterdarrell	I think I may have been scared of it since it was a non-default thing	15:27
alecuyer	yes, we've only switched a couple of machines to use that	15:27
alecuyer	but for tests it may be worth it :)	15:27
alecuyer	conf.get('use_go_leveldb', 'false'))	15:28
alecuyer	so, in the object server configuration, if set to true, it would start the golang binary with the correct option to use the native go leveldb implementation	15:28
swifterdarrell	cool!	15:29
alecuyer	of course I don't know if that would help the test :) it should lower the CPU usage, at least	15:29
swifterdarrell	I'll try that	15:29
swifterdarrell	I did notice short spikes of 6+ cores getting used by the golang procs, I guess as the data structure was getting merged/consolidated or whatever? (while under constant decent ingest workload)	15:30
swifterdarrell	I think I saw somewhere the replication IP or port getting used by something for losf where I'd initially expected the cluster IP to be used; wasn't a big deal and I switched my systemd service for the rpcmanager to use a `object-server/2.conf` that had replication IP/port stuff in it	15:32
*** joeljwright has quit IRC		15:32
swifterdarrell	ah, it was teh filtering done in ObjectRpcManager.get_policy2devices)	15:33
alecuyer	Yes, that might be leveldb doing its "compaction", it does that in the background. We have seen that in tests, it would be worth checking again - I haven't profiled it in quite some time (we also did have issue with go garbage collection, but it should have been fixed)	15:33
alecuyer	Ah, that's not something we've seen but we have a pretty much "fixed" configuration with our set of parameters, so maybe (probably!) I missed something and it's not doing the right thing wrt to another configuration	15:34
swifterdarrell	I don't see it as "right" or "wrong", per se, it's just a matter of whether what it's doing is semantically primarily "replication related" or "normal data" related.	15:36
alecuyer	but hm the index server should not be listening on any port actually, only the unix socket	15:36
swifterdarrell	AIUI, the rpcmanager is used for both, really, right? reading/storing obj data as well as delivering listings and obj data for replication/reconstruction activities?	15:36
swifterdarrell	I think what happened was that no devices were "seen" because they got filtered out	15:37
alecuyer	the rpcmanager would only store the name and location of files (the location being: volume index + offset within volume)	15:37
swifterdarrell	because the config	15:37
swifterdarrell	because the config's notion of "my IP" is used somewhere	15:37
alecuyer	the actual file content, and metadata (xattrs) are all stored only in the volumes, and that's written directly by the python code, it does not go through the index server	15:38
swifterdarrell	and in our config, the `object-server/1.conf` IP was different than the 2.conf which is data vs. replication; i.e. our obj data/replication traffic ewas configured to run over separate networks	15:38
alecuyer	ok, I'll have to try that configuration and find what is happening	15:39
swifterdarrell	so it's an interaction between `from swift.common.ring.utils import is_local_device` and get_policy2devices() filtering on is_local_device(...replication_ip, replication_port)	15:40
swifterdarrell	when ring devices have same value for ip/replication_ip and port/replication_port, you'd never have an issue	15:41
alecuyer	Thanks ! I'll check it. It does sound wrong because we don't mean to change the behavior of these functions	15:41
alecuyer	Which tool are you using for load testing ?	15:43
swifterdarrell	it'd be more cumbersome, but get_policy2devices() could filter with is_local_device(...ip, port) or is_local_device(...replication_ip, replication_port); i.e. select all devices "local" with respect to regular ip/port OR replication ip/port	15:43
swifterdarrell	mostly ssbench	15:43
swifterdarrell	though we may also use a wacky ingest-only test tool a customer was using that's golang & S3 API	15:43
swifterdarrell	one of our guys may use cosbench, but he had a better time with ssbench, so he may not	15:44
swifterdarrell	er, "... so he may not use cosbench, I don't know"	15:44
alecuyer	OK (Tim sent us a stack where it was unable to rename an EC frag to "#d" , I'll try to reproduce that, I think Tim mentionned concurrency ).	15:44
*** gyee has joined #openstack-swift		15:46
swifterdarrell	ya, I had a 12-hr run going, I'd have to check but I think it was mixed-but-write-heavy, and with about 2 hrs left int he run, the req/s dropped down to a new, lower plateau and we started seeing a consistent small level of 503s	15:46
swifterdarrell	When I started a fresh run a few hours later, it was at the lower plateau of req/s and the 503s were still there	15:46
swifterdarrell	I believe a full volcheck of all the losf storage locations (which took a long time of course) cleared it up	15:47
alecuyer	doesn't sound good :/ the 503s where all caused by the rename issue ? (if you know)	15:47
swifterdarrell	I think so?	15:47
swifterdarrell	they appeared spread around all the 3 nodes and volumes... i.e. it wasn't like one storage disk got messed up in isolation	15:47
alecuyer	ok that's really something I ought to reproduce. We normally don't have to reconstruct the DB outside cases where the filesystem/disk failed (missing files in the DB, etc)	15:48
swifterdarrell	I did see some objs in vols and not KV store without filesystem/disk failures, at least not any disk/filesystem failures I noticed	15:49
swifterdarrell	ssbench is pretty "mean" :D	15:49
alecuyer	without any process being killed ? (OOM or otherwise)	15:49
alecuyer	its a good thing ;)	15:49
swifterdarrell	I don't _think_ OOM was in that run; later, I dropped the RAM ceiling in the servers w/`mem=14G` linux kernel command-line arg and when it was smaller than 14G I was seeing OOM active	15:50
swifterdarrell	i was trying to force heavy memory pressure to see if losf would do better when xfs inode slab stuff was unable to fit in working set phys ram	15:50
swifterdarrell	but it was mostly just bad for "normal" and losf and I didn't get any results I was confident in	15:51
alecuyer	yes I see, that's our use case (not enough memory to run with regular xfs)	15:51
swifterdarrell	brb gotta make coffeee	15:51
swifterdarrell	my cat's refusing to make it for me	15:51
alecuyer	takes a lot of training for a cat :-)	15:51
alecuyer	I can see how you can have more files in the vols vs the KV store. If the index-server is down, the write to the volume can still happen, but registering the file in the KV will fail. Later, volcheck may pick it up (if it hasn't been overwritten yet). We've seen it happen, we would probably want to immediately erase/punch that file if the RPC call fails	15:57
swifterdarrell	(back)	16:07
swifterdarrell	so in that respect is addition to the KV store analogous to writing an object and then trying to update the container server? sometimes that fails and an async pending is written to get it done later, async? Only in this case, the "async pending" would get dynamically noticed and fixed by a later volcheck?	16:07
alecuyer	that's right but it shouldn't fail regularly (on a regular machine you should never see RPC call fail)	16:09
alecuyer	we mean volcheck to catch pending writes after a power loss/kernel panic	16:09
swifterdarrell	yeah, this was under heavy load	16:09
*** tesseract has quit IRC		16:11
swifterdarrell	OTOH, I just did a ~15 hr EC ingest-only, tiny file ssbench run last night with 60 concurrent users and it was pretty solid; no 503s taht I can see in graphs	16:11
swifterdarrell	https://gist.github.com/dbishop/2163d82ac1a795a6afae2919848c05ae	16:12
alecuyer	ok! thanks a lot for sharing these results :) and I will look at that renaming issue	16:12
swifterdarrell	so 1 error and 0.03% of reqs needed to be retried	16:12
alecuyer	this is for a single object server ? or the cluster ?	16:14
swifterdarrell	that's a whole cluster	16:14
swifterdarrell	3 storage nodes, 1 proxy node 10G networking	16:15
alecuyer	ok	16:15
swifterdarrell	Another question I had was what drove the losf project? One thing we're looking at is filling in the "benefit" part of a "cost to benefit ratio". I think I understand the costs of getting losf out to our customers (helping with upstream, helping with testing, integration w/our product's management/monitoring, etc.), but I'm trying to get a handle on teh benefits	16:15
alecuyer	That makes sense	16:16
swifterdarrell	So far, I haven't seen a place where losf really crushes "normal"; better sometimes, maybe worse sometimes. So any tips on situations or areas where losf really shines would be helpful	16:17
alecuyer	I'll try to explain our case	16:17
alecuyer	we had these large clusters, that were converted to EC	16:18
swifterdarrell	One thing we can't know or replicate very well atm, is if there was a case where "normal" swift simplly wasn't working well enough to be said to "be working" with say tons of small files. If losf took the cluster from "not working" to "working" taht's a huge benefit, just not necessarily a bump in some synthetic benchmark	16:18
alecuyer	we did not foresee the problem with EC and small files	16:18
alecuyer	yes so thats what happened	16:19
swifterdarrell	did it manifest as OOM activity from extreme memory pressure? Or just really degraded performance?	16:19
alecuyer	we had all these machines, with not much memory, and we went over a threshold	16:19
alecuyer	first degraded performance	16:19
alecuyer	and then issues with XFS	16:20
alecuyer	like, we had dedicated machines with hundreds of GB of RAM to which disks were plugged, because xfs_repair wouldn't run on a regular machine	16:20
alecuyer	XFS corruptions	16:20
alecuyer	not so much OOM	16:21
swifterdarrell	k, gtk	16:21
alecuyer	and we saw with eBPF,	16:21
alecuyer	that most of the IO load (lots of disks were 100%busy) , was caused by reading directory content	16:21
alecuyer	so REPLICATE was extremely costly	16:21
alecuyer	upgrading the hardware was not a possibility in our case	16:22
alecuyer	so that's the initial reason. Now we'll have to evaluate if we need it everywhere (we do not run LOSF everywhere)	16:23
swifterdarrell	nod	16:23
alecuyer	(and I'll say it worked.. these 3 clusters were basically unusuable for a while… :/ now it's ok, and dispersion is getting under controll :) )	16:24
swifterdarrell	that was my next question: have you guys found any situations where losf is contraindicated--like where it'd be worse than just leaving things "normal"	16:24
swifterdarrell	nice!	16:24
swifterdarrell	I saw an active migrator tool (I think); is that state of the art? I think I heard that for migration you would replace a node w/empty disks using losf and let replication fill it; was the active migration thign added later to do that more efficiently?	16:25
alecuyer	I don't see that it would break a specific workload, but I wouldn't use it where it's not needed (if you don't have small files the benefits are not obvious to me, and you have an extra component to take care of)	16:25
swifterdarrell	one problem we have is that customers really have no idea if they have small files; and if they say they don't, they're quite often actually wrong	16:26
alecuyer	that's an old version of a test we did, I didn't mean to add it and that's not actually what we used, but something similar, yes	16:26
alecuyer	aha I hear you :-)	16:26
swifterdarrell	so it's not like we could let customers say "I have only large files" and not use losf because 1 yr later the support call would come in and we'd find small objects everywhere ;)	16:27
alecuyer	Certainly with drives getting larger and larger LOSF might be useful at a higher object size (unless you add memory proportional to drive size)	16:28
alecuyer	heh	16:28
swifterdarrell	nod	16:28
swifterdarrell	have you run into any CPU utiliz issues w/leveldb using more CPU than "normal" swift w/same hardware and workload (I think I observed higher cpu usage for losf in my limited testing)	16:28
swifterdarrell	not that I ran out of cpu, just that obj nodes were less idle	16:29
alecuyer	definitely, that's something we want to look at	16:29
alecuyer	it's using more CPU everywhere, and it's been an issue on atom CPU	16:29
swifterdarrell	but i could see cheap-and-deep low-cpu, high disk-count chassis getting into trouble	16:29
alecuyer	(for others, we had CPU to spare so no problem despite higher usage)	16:29
swifterdarrell	heh, those atom procs really suck	16:29
swifterdarrell	gtk	16:29
alecuyer	(high disk count, we have two drives with atoms, mostly, and yet..)	16:29
alecuyer	so, native golang leveldb should help	16:30
alecuyer	and then , we did not want to add code, but we considered caching the partition list, things like that	16:30
swifterdarrell	we had some engineering sample atom gear that I had our guys throw away when we moved recently because it was so bad... felt like a 1st gen raspberry pi	16:30
alecuyer	(caching is not the word I'm looking for, "computing" and keeping in RAM?)	16:30
alecuyer	we tend to get old hardware that nobody wants for some of these clusters so we have to make do :)	16:31
swifterdarrell	hehe	16:31
alecuyer	So I see how you have to consider if it's worth it iin your use cases	16:31
swifterdarrell	cool, well thanks so much for the info and I'll be in here more to ask more questions and share results and stuff from our testing	16:31
alecuyer	no problem feel free to ask, happy to share what I can, and thanks your feedback!	16:32
swifterdarrell	awesome, thanks!	16:32
alecuyer	have a good day and weekend after, bye !	16:32
swifterdarrell	you too!	16:32
zaitcev	So, are there places that run out of inodes? I thought XFS could run out of inodes, just like traditional filesystems.	16:38
zaitcev	The LoSF should help with that.	16:38
swifterdarrell	not sure... on a 3T drive in the cluster I've been fiddling with, looks like I have 586,053,312 inodes. I guess you have, what, 1 inode per dir & 1 per file? Is the average dir inode overhead per object somewhere just north of 1/4096? (in normal swift, figuring that there's 4096 possible hash dir thingies, did I do that math right?)	16:42
alecuyer	zaitcev: I don't remember if that happened, I think Romain will know. But even if you don't hit the limit it helps to have a sane number of inodes, because at some point xfs_repair does not run (or you need obscene amounts of memory)	16:43
swifterdarrell	nod	16:43
alecuyer	now I have to go to pick up the kid, see you all	16:44
*** zaitcev_ has joined #openstack-swift		17:47
*** ChanServ sets mode: +v zaitcev_		17:47
openstackgerrit	Thiago da Silva proposed openstack/swift master: Add func test of changes versionining modes https://review.opendev.org/678277	17:49
*** zaitcev has quit IRC		17:51
*** zaitcev_ has quit IRC		18:38
*** gyee has quit IRC		18:42
*** e0ne has quit IRC		18:46
*** gyee has joined #openstack-swift		18:46
*** swifterdarrell has quit IRC		18:51
*** swifterdarrell has joined #openstack-swift		18:51
*** ChanServ sets mode: +v swifterdarrell		18:51
*** zaitcev_ has joined #openstack-swift		18:52
*** ChanServ sets mode: +v zaitcev_		18:52
*** swifterdarrell has quit IRC		18:57
*** zaitcev_ has quit IRC		19:01
rledisez	zaitcev_: we never ran out of inodes. at a max, we had 70M inodes per disk of 6TB (36 disks and 64GB of RAM per server). we would have needed about 1.5TB of RAM to fit all the inodes in the VFS cache. so, we saw that half of the IO budget of the disks was to fetch inodes, the other half to serve the user its data. that was the goal of LOSF, reduce the IO wasting	19:02
DHE	how is losf going anyway?	19:14
*** zaitcev_ has joined #openstack-swift		19:15
*** ChanServ sets mode: +v zaitcev_		19:15
openstackgerrit	Merged openstack/swift master: py3: Switch swift-dsvm-functional-py3 to run tests under py3 https://review.opendev.org/677047	19:24
rledisez	DHE: sorry, i'm not sure i get your question. dou you mean "is it working fine?". if that's it, yes, it's running in production in some of our clusters. as alecuyer said, it basically saved the clusters. the real primary goal was to make replication working, and it is working now. it never ate any data as far as we know, which is encouraging :)	19:27
DHE	almost merge time?	19:30
*** swifterdarrell has joined #openstack-swift		19:31
*** ChanServ sets mode: +v swifterdarrell		19:31
rledisez	there is still work to do before that ;)	19:33
timburke	good afternoon	20:01
swifterdarrell	alecuyer: rledisez: is it expected to have lots of relatively small volume files during small-file EC ingest? Here's what volume sizes look like in a volumes dir for my losf EC policy I've been benchmarking with lots of small file writes: https://gist.github.com/dbishop/81dab0269ad0859d7890bcb23b17a4b3	20:39
swifterdarrell	that's 39 GiB across 7749 volume files for avg 5.4 MB/volume	20:41
rledisez	swifterdarrell: a volume store data for only one partition (so that when a partition is rebalanced, the volume can be removed, reducing the number of extents). because the volumes are append only (they act like a journal that can be replayed in case of crash), if you have concurrent uploads for a partition, you need to have multiple volumes per partition (at least one per connections). to avoid creations of too many volumes, the	20:47
rledisez	re is a limit (configurable I think) of 10 volumes per partition.	20:47
rledisez	i think you are in this situation of having multiple volume for your partitions because you had concurrency	20:47
swifterdarrell	Oh I definitely had concurrency!	20:47
swifterdarrell	With long-running concurrent ingest does volume count grow without bound or is it bounded by something like partition_count * concurrency?	20:48
rledisez	it is bounded by partition_count * volume_per_partition_limit	20:48
rledisez	well, i should say (partition_count * min(concurrency, volume_per_partition_limit)	20:49
rledisez	approx :)	20:49
swifterdarrell	gotcha	20:49
swifterdarrell	yeah, I was assuming approx :)	20:49
swifterdarrell	fwiw, toward the end of my last small file losf EC ingest test, I was getting ~55 req/s and under same conditions with the `use_go_leveldb = true` change, only, a subsequent, similar run I just started is getting ~62-63 req/s	20:55
swifterdarrell	not quite apples to apples wrt how far into the run the sampling is, but not surprising, either	20:56
rledisez	that's interesting. we never though it could improve PUT performances (not that much at least). we were expecting only to save a lot of CPU and time on REPLICATE	20:58
rledisez	gtg, have a nice week end everybody	21:00
*** takamatsu has quit IRC		21:55
*** rcernin has joined #openstack-swift		22:11
*** BjoernT has quit IRC		22:18
*** zaitcev_ has quit IRC		22:39
*** zaitcev_ has joined #openstack-swift		22:53
*** ChanServ sets mode: +v zaitcev_		22:53
*** rcernin has quit IRC		23:11
*** rcernin has joined #openstack-swift		23:12
*** zaitcev__ has joined #openstack-swift		23:15
*** ChanServ sets mode: +v zaitcev__		23:15
*** zaitcev_ has quit IRC		23:19

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!