Friday, 2019-08-23

*** swifterd_ has joined #openstack-swift00:13
*** ChanServ sets mode: +v swifterd_00:13
*** swifterd_ has quit IRC00:15
*** swifterdarrell has quit IRC00:16
*** gyee has quit IRC00:36
*** swifterdarrell has joined #openstack-swift02:26
*** ChanServ sets mode: +v swifterdarrell02:26
*** psachin has joined #openstack-swift03:02
*** gkadam has joined #openstack-swift03:19
*** e0ne has joined #openstack-swift05:33
*** e0ne has quit IRC05:44
*** viks___ has joined #openstack-swift06:35
*** e0ne has joined #openstack-swift06:42
*** gkadam has quit IRC06:43
*** e0ne has quit IRC06:43
*** e0ne has joined #openstack-swift06:53
*** e0ne has quit IRC06:59
*** zaitcev has quit IRC07:05
*** takamatsu has joined #openstack-swift07:08
*** rcernin has quit IRC07:15
*** zaitcev has joined #openstack-swift07:17
*** ChanServ sets mode: +v zaitcev07:17
*** rdejoux has joined #openstack-swift07:44
*** zaitcev has quit IRC07:55
*** e0ne has joined #openstack-swift08:05
*** zaitcev has joined #openstack-swift08:07
*** ChanServ sets mode: +v zaitcev08:07
*** diablo_rojo has joined #openstack-swift08:12
*** tkajinam has quit IRC08:19
*** zaitcev has quit IRC08:25
*** zaitcev has joined #openstack-swift08:38
*** ChanServ sets mode: +v zaitcev08:38
*** diablo_rojo has quit IRC08:48
*** zaitcev_ has joined #openstack-swift08:56
*** ChanServ sets mode: +v zaitcev_08:56
*** zaitcev has quit IRC08:59
*** psachin has quit IRC10:04
*** psachin has joined #openstack-swift10:06
*** mrjk has quit IRC10:27
*** tomha has joined #openstack-swift10:35
*** tomha has quit IRC10:40
*** tesseract has joined #openstack-swift11:12
*** zaitcev_ has quit IRC12:39
*** zaitcev_ has joined #openstack-swift12:53
*** ChanServ sets mode: +v zaitcev_12:53
*** zaitcev_ has quit IRC13:07
*** zaitcev_ has joined #openstack-swift13:21
*** ChanServ sets mode: +v zaitcev_13:21
*** BjoernT has joined #openstack-swift13:27
tdasilvaclayg: just to clarify my comment, my argument is more to the fact that conceptually putting an object with an expiration on it, only to have it versioned later and have that version still expire as originally intended seems a bit wrong to me13:35
claygWell, maybe. But. I’m not intentionally trying to change or improve the interaction of expiring objects and versioned writes.13:38
tdasilvaclayg: makes sense, this might also be a good opportunity to add some tests and I'm not sure we have many/any ??13:42
tdasilvaclayg: btw: are you currently updating p 633857 based on reviews? or are you on p 673682 ?13:43
patchbothttps://review.opendev.org/#/c/633857/ - swift - symlink-backed versioned_writes - 18 patch sets13:43
patchbothttps://review.opendev.org/#/c/673682/ - swift - s3api: Implement versioning status API - 1 patch set13:43
claygMostly based on some new test I wrote. I found a bunch of stuff where they existing tests were wrong. I had to go write new s3api tests to understand how key-marker and version-market were supposed to work.13:45
claygWait, do you mean the S3 versions - or the symlink one?13:46
claygI’m going to come back to symlinks. You can push over it for now I’ll rebase S3 before I go to address review stuff.13:47
*** psachin has quit IRC13:47
claygI could pause the S3 stuff and look at symlink comments if there’s anything controversial we should talk about. If it’s just like “fix this” - I’ll get to it as soon as I can.13:48
tdasilvaclayg: nope, just wanted to sync up with you as I'm trying to go over the "fix this" comments from timburke and just wanted to make sure you weren't doing the same13:49
claygCool!  Thanks!13:50
*** zaitcev_ has quit IRC14:22
*** zaitcev has joined #openstack-swift14:53
*** ChanServ sets mode: +v zaitcev14:53
*** zaitcev has quit IRC14:57
*** zaitcev has joined #openstack-swift15:10
*** ChanServ sets mode: +v zaitcev15:10
swifterdarrellrledisez and alecuyer, I'd love to chat about losf whenever you've got time :)15:14
alecuyerswifterdarrell: hello ! sure, no problem, I heard you're been running some tests :) I still have some time today, or next week15:18
rledisezswifterdarrell: for the record, alecuyer is in france and i'm in canada (east) so depending of the questions and the hours, you might get an immediate answer or not :)15:19
swifterdarrellso I heard... that's why I'm asking now instead of my last-night, hehe :)15:20
swifterdarrellalso, no worries about async questions/answers--that's no problem15:20
swifterdarrellI've been using the tip of the losf branch (not any code still pending reviews, if there are any).15:22
swifterdarrellPerformance-wise, I've roughly seen losf outperform "normal" swift in terms of req/s and lower latencies (esp. toward the tail like 99th percentile) at lower client-concurrencies, but found an "elbow" where it reverses and "normal" swift becomes faster15:23
swifterdarrellOh, I also haven't really tuned any losf things, if there are tunables to be had15:24
swifterdarrellOn my main hardware, I'm sort of borrowing it, so I can't just wipe the drives of non-losf storage policies/data15:24
alecuyerBeyond using the native golang leveldb lib, I don't think it would make much of a difference (you can choose the maximum size of a volume, defaults to 5GB)15:24
alecuyerIf I understand what you're saying, it starts out faster, but as you create more objects, it starts to slow down, right ?15:25
alecuyer(vs "normal" swift)15:25
swifterdarrellno, more like losf is consistently faster than "normal" at concurrency X for lower values of X, but there is a level of client-concurrency for which normal swift seems consistently faster15:26
alecuyerah, thanks, I see15:26
swifterdarrellOh, I may have not compiled with native golang leveldb? let me check that15:26
alecuyeryou should have both in the golang binary15:26
alecuyerthere is a switch, let me check15:27
swifterdarrellI think I may have been scared of it since it was a non-default thing15:27
alecuyeryes, we've only switched a couple of machines to use that15:27
alecuyerbut for tests it may be worth it :)15:27
alecuyerconf.get('use_go_leveldb', 'false'))15:28
alecuyerso, in the object server configuration, if set to true, it would start the golang binary with the correct option to use the native go leveldb implementation15:28
swifterdarrellcool!15:29
alecuyerof course I don't know if that would help the test :) it should lower the CPU usage, at least15:29
swifterdarrellI'll try that15:29
swifterdarrellI did notice short spikes of 6+ cores getting used by the golang procs, I guess as the data structure was getting merged/consolidated or whatever?  (while under constant decent ingest workload)15:30
swifterdarrellI think I saw somewhere the replication IP or port getting used by something for losf where I'd initially expected the cluster IP to be used; wasn't a big deal and I switched my systemd service for the rpcmanager to use a `object-server/2.conf` that had replication IP/port stuff in it15:32
*** joeljwright has quit IRC15:32
swifterdarrellah, it was teh filtering done in ObjectRpcManager.get_policy2devices)15:33
alecuyerYes, that might be leveldb doing its "compaction", it does that in the background. We have seen that in tests, it would be worth checking again - I haven't profiled it in quite some time (we also did have issue with go garbage collection, but it should have been fixed)15:33
alecuyerAh, that's not something we've seen but we have a pretty much "fixed" configuration with our set of parameters, so maybe (probably!) I missed something and it's not doing the right thing wrt to another configuration15:34
swifterdarrellI don't see it as "right" or "wrong", per se, it's just a matter of whether what it's doing is semantically primarily "replication related" or "normal data" related.15:36
alecuyerbut hm the index server should not be listening on any port actually, only the unix socket15:36
swifterdarrellAIUI, the rpcmanager is used for both, really, right? reading/storing obj data as well as delivering listings and obj data for replication/reconstruction activities?15:36
swifterdarrellI think what happened was that no devices were "seen" because they got filtered out15:37
alecuyerthe rpcmanager would only store the name and location of files (the location being: volume index + offset within volume)15:37
swifterdarrellbecause the config15:37
swifterdarrellbecause the config's notion of "my IP" is used somewhere15:37
alecuyerthe actual file content, and metadata (xattrs) are all stored only in the volumes, and that's written directly by the python code, it does not go through the index server15:38
swifterdarrelland in our config, the `object-server/1.conf` IP was different than the 2.conf which is data vs. replication; i.e. our obj data/replication traffic ewas configured to run over separate networks15:38
alecuyerok, I'll have to try that configuration and find what is happening15:39
swifterdarrellso it's an interaction between `from swift.common.ring.utils import is_local_device` and get_policy2devices() filtering on is_local_device(...replication_ip, replication_port)15:40
swifterdarrellwhen ring devices have same value for ip/replication_ip and port/replication_port, you'd never have an issue15:41
alecuyerThanks ! I'll check it. It does sound wrong because we don't mean to change the behavior of these functions15:41
alecuyerWhich tool are you using for load testing ?15:43
swifterdarrellit'd be more cumbersome, but get_policy2devices() could filter with is_local_device(...ip, port) or is_local_device(...replication_ip, replication_port);  i.e. select all devices "local" with respect to regular ip/port OR replication ip/port15:43
swifterdarrellmostly ssbench15:43
swifterdarrellthough we may also use a wacky ingest-only test tool a customer was using that's golang & S3 API15:43
swifterdarrellone of our guys may use cosbench, but he had a better time with ssbench, so he may not15:44
swifterdarreller, "... so he may not use cosbench, I don't know"15:44
alecuyerOK (Tim sent us a stack where it was unable to rename an EC frag to "#d" , I'll try to reproduce that, I think Tim mentionned concurrency ).15:44
*** gyee has joined #openstack-swift15:46
swifterdarrellya, I had a 12-hr run going, I'd have to check but I think it was mixed-but-write-heavy, and with about 2 hrs left int he run, the req/s dropped down to a new, lower plateau and we started seeing a consistent small level of 503s15:46
swifterdarrellWhen I started a fresh run a few hours later, it was at the lower plateau of req/s and the 503s were still there15:46
swifterdarrellI believe a full volcheck of all the losf storage locations (which took a long time of course) cleared it up15:47
alecuyerdoesn't sound good :/  the 503s where all caused by the rename issue ? (if you know)15:47
swifterdarrellI think so?15:47
swifterdarrellthey appeared spread around all the 3 nodes and volumes... i.e. it wasn't like one storage disk got messed up in isolation15:47
alecuyerok that's really something I ought to reproduce. We normally don't have to reconstruct the DB outside cases where the filesystem/disk failed (missing files in the DB, etc)15:48
swifterdarrellI did see some objs in vols and not KV store without filesystem/disk failures, at least not any disk/filesystem failures I noticed15:49
swifterdarrellssbench is pretty "mean" :D15:49
alecuyerwithout any process being killed ? (OOM or otherwise)15:49
alecuyerits a good thing ;)15:49
swifterdarrellI don't _think_ OOM was in that run; later, I dropped the RAM ceiling in the servers w/`mem=14G` linux kernel command-line arg and when it was smaller than 14G I was seeing OOM active15:50
swifterdarrelli was trying to force heavy memory pressure to see if losf would do better when xfs inode slab stuff was unable to fit in working set phys ram15:50
swifterdarrellbut it was mostly just bad for "normal" and losf and I didn't get any results I was confident in15:51
alecuyeryes I see, that's our use case (not enough memory to run with regular xfs)15:51
swifterdarrellbrb gotta make coffeee15:51
swifterdarrellmy cat's refusing to make it for me15:51
alecuyertakes a lot of training for a cat :-)15:51
alecuyerI can see how you can have more files in the vols vs the KV store. If the index-server is down, the write to the volume can still happen, but registering the file in the KV will fail. Later, volcheck may pick it up (if it hasn't been overwritten yet). We've seen it happen, we would probably want to immediately erase/punch that file if the RPC call fails15:57
swifterdarrell(back)16:07
swifterdarrellso in that respect is addition to the KV store analogous to writing an object and then trying to update the container server? sometimes that fails and an async pending is written to get it done later, async?  Only in this case, the "async pending" would get dynamically noticed and fixed by a later volcheck?16:07
alecuyerthat's right but it shouldn't fail regularly (on a regular machine you should never see RPC call fail)16:09
alecuyerwe mean volcheck to catch pending writes after a power loss/kernel panic16:09
swifterdarrellyeah, this was under heavy load16:09
*** tesseract has quit IRC16:11
swifterdarrellOTOH, I just did a ~15 hr EC ingest-only, tiny file ssbench run last night with 60 concurrent users and it was pretty solid; no 503s taht I can see in graphs16:11
swifterdarrellhttps://gist.github.com/dbishop/2163d82ac1a795a6afae2919848c05ae16:12
alecuyerok! thanks a lot for sharing these results :) and I will look at that renaming issue16:12
swifterdarrellso 1 error and 0.03% of reqs needed to be retried16:12
alecuyerthis is for a single object server ? or the cluster ?16:14
swifterdarrellthat's a whole cluster16:14
swifterdarrell3 storage nodes, 1 proxy node 10G networking16:15
alecuyerok16:15
swifterdarrellAnother question I had was what drove the losf project? One thing we're looking at is filling in the "benefit" part of a "cost to benefit ratio".  I think I understand the costs of getting losf out to our customers (helping with upstream, helping with testing, integration w/our product's management/monitoring, etc.), but I'm trying to get a handle on teh benefits16:15
alecuyerThat makes sense16:16
swifterdarrellSo far, I haven't seen a place where losf really crushes "normal"; better sometimes, maybe worse sometimes.  So any tips on situations or areas where losf really shines would be helpful16:17
alecuyerI'll try to explain our case16:17
alecuyerwe had these large clusters, that were converted to EC16:18
swifterdarrellOne thing we can't know or replicate very well atm, is if there was a case where "normal" swift simplly wasn't working well enough to be said to "be working" with say tons of small files.  If losf took the cluster from "not working" to "working" taht's a huge benefit, just not necessarily a bump in some synthetic benchmark16:18
alecuyerwe did not foresee the problem with EC and small files16:18
alecuyeryes so thats what happened16:19
swifterdarrelldid it manifest as OOM activity from extreme memory pressure?  Or just really degraded performance?16:19
alecuyerwe had all these machines, with not much memory, and we went over a threshold16:19
alecuyerfirst degraded performance16:19
alecuyerand then issues with XFS16:20
alecuyerlike, we had dedicated machines with hundreds of GB of RAM to which disks were plugged, because xfs_repair wouldn't run on a regular machine16:20
alecuyerXFS corruptions16:20
alecuyernot so much OOM16:21
swifterdarrellk, gtk16:21
alecuyerand we saw with eBPF,16:21
alecuyerthat most of the IO load (lots of disks were 100%busy) , was caused by reading directory content16:21
alecuyerso REPLICATE was extremely costly16:21
alecuyerupgrading the hardware was not a possibility in our case16:22
alecuyerso that's the initial reason. Now we'll have to evaluate if we need it everywhere (we do not run LOSF everywhere)16:23
swifterdarrell*nod*16:23
alecuyer(and I'll say it worked.. these 3 clusters were basically unusuable for a while… :/ now it's ok, and dispersion is getting under controll :) )16:24
swifterdarrellthat was my next question: have you guys found any situations where losf is contraindicated--like where it'd be worse than just leaving things "normal"16:24
swifterdarrellnice!16:24
swifterdarrellI saw an active migrator tool (I think); is that state of the art? I think I heard that for migration you would replace a node w/empty disks using losf and let replication fill it; was the active migration thign added later to do that more efficiently?16:25
alecuyerI don't see that it would break a specific workload, but I wouldn't use it where it's not needed (if you don't have small files the benefits are not obvious to me, and you have an extra component to take care of)16:25
swifterdarrellone problem we have is that customers really have no idea if they have small files; and if they say they don't, they're quite often actually wrong16:26
alecuyerthat's an old version of a test we did, I didn't mean to add it and that's not actually what we used, but something similar, yes16:26
alecuyeraha I hear you :-)16:26
swifterdarrellso it's not like we could let customers say "I have only large files" and not use losf because 1 yr later the support call would come in and we'd find small objects everywhere ;)16:27
alecuyerCertainly with drives getting larger and larger LOSF might be useful at a higher object size (unless you add memory proportional to drive size)16:28
alecuyerheh16:28
swifterdarrell*nod*16:28
swifterdarrellhave you run into any CPU utiliz issues w/leveldb using more CPU than "normal" swift w/same hardware and workload (I think I observed higher cpu usage for losf in my limited testing)16:28
swifterdarrellnot that I ran out of cpu, just that obj nodes were less idle16:29
alecuyerdefinitely, that's something we want to look at16:29
alecuyerit's using more CPU everywhere, and it's been an issue on atom CPU16:29
swifterdarrellbut i could see cheap-and-deep low-cpu, high disk-count chassis getting into trouble16:29
alecuyer(for others, we had CPU to spare so no problem despite higher usage)16:29
swifterdarrellheh, those atom procs really suck16:29
swifterdarrellgtk16:29
alecuyer(high disk count, we have two drives with atoms, mostly, and yet..)16:29
alecuyerso, native golang leveldb should help16:30
alecuyerand then , we did not want to add code, but we considered caching the partition list, things like that16:30
swifterdarrellwe had some engineering sample atom gear that I had our guys throw away when we moved recently because it was so bad... felt like a 1st gen raspberry pi16:30
alecuyer(caching is not the word I'm looking for, "computing"  and keeping in RAM?)16:30
alecuyerwe tend to get old hardware that nobody wants for some of these clusters so we have to make do :)16:31
swifterdarrellhehe16:31
alecuyerSo I see how you have to consider if it's worth it iin your use cases16:31
swifterdarrellcool, well thanks so much for the info and I'll be in here more to ask more questions and share results and stuff from our testing16:31
alecuyerno problem feel free to ask, happy to share what I can, and thanks your feedback!16:32
swifterdarrellawesome, thanks!16:32
alecuyerhave a good day and weekend after, bye !16:32
swifterdarrellyou too!16:32
zaitcevSo, are there places that run out of inodes? I thought XFS could run out of inodes, just like traditional filesystems.16:38
zaitcevThe LoSF should help with that.16:38
swifterdarrellnot sure... on a 3T drive in the cluster I've been fiddling with, looks like I have 586,053,312 inodes.  I guess you have, what, 1 inode per dir & 1 per file?  Is the average dir inode overhead per object somewhere just north of 1/4096? (in normal swift, figuring that there's 4096 possible hash dir thingies, did I do that math right?)16:42
alecuyerzaitcev: I don't remember if that happened, I think Romain will know.  But even if you don't hit the limit it helps to have a sane number of inodes, because at some point xfs_repair does not run (or you need obscene amounts of memory)16:43
swifterdarrell*nod*16:43
alecuyernow I have to go to pick up the kid, see you all16:44
*** zaitcev_ has joined #openstack-swift17:47
*** ChanServ sets mode: +v zaitcev_17:47
openstackgerritThiago da Silva proposed openstack/swift master: Add func test of changes versionining modes  https://review.opendev.org/67827717:49
*** zaitcev has quit IRC17:51
*** zaitcev_ has quit IRC18:38
*** gyee has quit IRC18:42
*** e0ne has quit IRC18:46
*** gyee has joined #openstack-swift18:46
*** swifterdarrell has quit IRC18:51
*** swifterdarrell has joined #openstack-swift18:51
*** ChanServ sets mode: +v swifterdarrell18:51
*** zaitcev_ has joined #openstack-swift18:52
*** ChanServ sets mode: +v zaitcev_18:52
*** swifterdarrell has quit IRC18:57
*** zaitcev_ has quit IRC19:01
rledisezzaitcev_: we never ran out of inodes. at a max, we had 70M inodes per disk of 6TB (36 disks and 64GB of RAM per server). we would have needed about 1.5TB of RAM to fit all the inodes in the VFS cache. so, we saw that half of the IO budget of the disks was to fetch inodes, the other half to serve the user its data. that was the goal of LOSF, reduce the IO wasting19:02
DHEhow is losf going anyway?19:14
*** zaitcev_ has joined #openstack-swift19:15
*** ChanServ sets mode: +v zaitcev_19:15
openstackgerritMerged openstack/swift master: py3: Switch swift-dsvm-functional-py3 to run tests under py3  https://review.opendev.org/67704719:24
rledisezDHE: sorry, i'm not sure i get your question. dou you mean "is it working fine?". if that's it, yes, it's running in production in some of our clusters. as alecuyer said, it basically saved the clusters. the real primary goal was to make replication working, and it is working now. it never ate any data as far as we know, which is encouraging :)19:27
DHEalmost merge time?19:30
*** swifterdarrell has joined #openstack-swift19:31
*** ChanServ sets mode: +v swifterdarrell19:31
rledisezthere is still work to do before that ;)19:33
timburkegood afternoon20:01
swifterdarrellalecuyer: rledisez: is it expected to have lots of relatively small volume files during small-file EC ingest?  Here's what volume sizes look like in a volumes dir for my losf EC policy I've been benchmarking with lots of small file writes: https://gist.github.com/dbishop/81dab0269ad0859d7890bcb23b17a4b320:39
swifterdarrellthat's 39 GiB across 7749 volume files for avg 5.4 MB/volume20:41
rledisezswifterdarrell: a volume store data for only one partition (so that when a partition is rebalanced, the volume can be removed, reducing the number of extents). because the volumes are append only (they act like a journal that can be replayed in case of crash), if you have concurrent uploads for a partition, you need to have multiple volumes per partition (at least one per connections). to avoid creations of too many volumes, the20:47
rledisezre is a limit (configurable I think) of 10 volumes per partition.20:47
rledisezi think you are in this situation of having multiple volume for your partitions because you had concurrency20:47
swifterdarrellOh I definitely had concurrency!20:47
swifterdarrellWith long-running concurrent ingest does volume count grow without bound or is it bounded by something like partition_count * concurrency?20:48
rledisezit is bounded by partition_count * volume_per_partition_limit20:48
rledisezwell, i should say (partition_count * min(concurrency, volume_per_partition_limit)20:49
rledisezapprox :)20:49
swifterdarrellgotcha20:49
swifterdarrellyeah, I was assuming approx :)20:49
swifterdarrellfwiw, toward the end of my last small file losf EC ingest test, I was getting ~55 req/s and under same conditions with the `use_go_leveldb = true` change, only, a subsequent, similar run I just started is getting ~62-63 req/s20:55
swifterdarrellnot quite apples to apples wrt how far into the run the sampling is, but not surprising, either20:56
rledisezthat's interesting. we never though it could improve PUT performances (not that much at least). we were expecting only to save a lot of CPU and time on REPLICATE20:58
rledisezgtg, have a nice week end everybody21:00
*** takamatsu has quit IRC21:55
*** rcernin has joined #openstack-swift22:11
*** BjoernT has quit IRC22:18
*** zaitcev_ has quit IRC22:39
*** zaitcev_ has joined #openstack-swift22:53
*** ChanServ sets mode: +v zaitcev_22:53
*** rcernin has quit IRC23:11
*** rcernin has joined #openstack-swift23:12
*** zaitcev__ has joined #openstack-swift23:15
*** ChanServ sets mode: +v zaitcev__23:15
*** zaitcev_ has quit IRC23:19

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!