21:00:05 <timburke> #startmeeting swift
21:00:06 <openstack> Meeting started Wed May  6 21:00:05 2020 UTC and is due to finish in 60 minutes.  The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:09 <openstack> The meeting name has been set to 'swift'
21:00:12 <timburke> who's here for the swift meeting?
21:00:18 <seongsoocho> o/
21:00:36 <rledisez> hi o/
21:00:54 <kota_> hi
21:00:57 <mattoliverau> o/
21:01:13 <alecuyer> o/
21:01:50 <clayg> o/
21:02:04 <timburke> agenda's at https://wiki.openstack.org/wiki/Meetings/Swift
21:02:16 <timburke> #topic PTG
21:02:53 <timburke> mattoliverau very graciously offered to help with PTG planning/organization
21:03:04 <clayg> ❤️ mattoliverau
21:03:21 <mattoliverau> hey cool.
21:03:46 <timburke> mattoliverau, what have you learned?
21:03:51 <mattoliverau> So we need to get the registraion in and start putting in sessions to times
21:04:01 <mattoliverau> sorry, irc is lagging
21:04:33 <mattoliverau> So step 1, to get the registration in, I need to know if there are any other projects your interested in so we can try and avoid overlap
21:04:47 <mattoliverau> I already have storelets and first contact sig
21:05:30 <rledisez> I would be interested in Keystone (things like operator feedbacks etc…)
21:05:47 <mattoliverau> Next we need to know what time suits people because of timezones. For this I'll come up with a doodle poll and post the link in our channel.
21:06:08 <mattoliverau> rledisez: great!
21:07:20 <mattoliverau> I'll hold the doodle pool, once I've created it, open until the end of Friday (my time, though can wait a bit longer) just so people have a change to pick times.
21:07:54 <mattoliverau> I'm guessing meeting time isn't bad, but I'm happy to get up early or stay up late if need be.
21:08:31 <timburke> i'm feeling the same way -- my current plan is to just show up in irc as much as possible that week :-D
21:08:43 <kota_> yup. in the day time, it might be hard because kids interrupt me always.
21:09:11 <mattoliverau> But step 3, is to make sure all the topics you want to talk about is in the etherpad! So when we have times we can decide the number of blocks and get rooms booked.
21:09:30 <kota_> oic
21:09:56 <mattoliverau> most of this I think happens by the 10th, though this could just be the registration side.
21:09:57 <timburke> #link https://etherpad.opendev.org/p/swift-ptg-victoria
21:10:21 <clayg> so there IS an in-person presence?
21:10:23 <mattoliverau> but I'd like to get some virtual rooms booked before what ever our best times get taken by other projects
21:10:32 <clayg> oh, virtual rooms
21:10:41 <mattoliverau> *needs to happen by
21:11:14 <mattoliverau> room booking spreadsheet
21:11:20 <mattoliverau> #link https://ethercalc.openstack.org/126u8ek25noy
21:11:30 <mattoliverau> if you want to see what it currently looks like.
21:11:36 <timburke> kota_, otoh, we'll get to know each other's families so much better than we typically get just from pictures :-)
21:12:05 <kota_> timburke: :)
21:12:18 <mattoliverau> lol
21:12:19 <kota_> good idea
21:12:32 <mattoliverau> Anyway, sorry for the brain dump
21:13:05 <timburke> don't apologize! that was just the sort of overview i was hoping for and never got around to putting together myself
21:13:21 <timburke> again, thank you so much for taking that on, and sorry i didn't ask for help earlier
21:13:59 <mattoliverau> basically, 1. If ther eis any project your interested in, let me know; 2. fill out doodle poll once I get a link up later today; 3. update the etherpad;
21:14:05 <mattoliverau> timburke: nps
21:14:29 <timburke> #topic object updater
21:14:46 <clayg> updater 😡
21:14:59 <timburke> rledisez, thanks for putting this on the agenda! i think you may have noticed that we've been interested in this lately, too ;-)
21:15:22 <rledisez> Yeah, I though I bring the point here cause some of us are having issue with it (at least we do, at ovh :))
21:15:37 <rledisez> there is mostly 2 issues in my mind:
21:16:17 <rledisez> the first one is that async-pendings can quickly piles up on disks for many reasons (unsharded bug container, network issue, process hanged, …)
21:16:56 <rledisez> and while some of them will never be able to be handled by the updater (at least without an operator intervention), some of them can be handled because it was a really transiant situation (like a switch reboot…)
21:17:14 <rledisez> the first review is about that: p 571917
21:17:15 <patchbot> https://review.opendev.org/#/c/571917/ - swift - Manage async_pendings priority per containers - 5 patch sets
21:17:29 <rledisez> the blocking point seems to be that it changes the way async pendings are stored
21:18:02 <rledisez> the second issue is that the way we communicate with container-server is by sending one request per async-pendings instead of batching them
21:18:10 <rledisez> that what p 724943 is about
21:18:11 <patchbot> https://review.opendev.org/#/c/724943/ - swift - WIP: Batched updates for object-updater - 2 patch sets
21:18:27 <rledisez> I though I would bring this here to have some other oint of view
21:19:22 <rledisez> (I'm done with the summary. any questions, remarks?)
21:19:51 <timburke> so a bit of perspective from clayg tdasilva and i: we've got a cluster that's filling up, leading to quotas being implemented, leading to users wanting to delete a good bit of data, often in fairly large containers
21:20:46 <clayg> it sounds like the issue we're having might be slightly different then - we accepted ~350M deletes into some sharded containers with billions of objects and the updaters keep dos-ing the container dbs 🤷‍♂️
21:21:04 <timburke> we're currently sitting at like 450+M async pendings across ~250 nodes, and that's still going up by ~2.5M/hr
21:21:05 <alecuyer> ouch
21:21:36 <clayg> there's just no flow control across the nodes to try and put db updates in at "the correct rate"
21:21:41 <rledisez> clayg: yeah, I'm more looking into treating quickly what can be treated while still trying for the problematic containers
21:22:02 <clayg> we're also learning there's still lots of OTHER updates going into these same containers so we're trying to break up the work and prioritize stuff
21:22:57 <clayg> rledisez: yeah on a "per node" basis we need some way to have "bad containers" somehow get... I guess "error limited" or something like what you've done in the top-of-stack patch where it just "moves on"
21:23:47 <clayg> I'd really like it AT LEAST per-node we could have a per-container rate limit
21:23:48 <timburke> fwiw, "treating quickly what can be treated" is actually *exactly* what clayg did earlier this week -- run a filtered updater that ignores certain containers, then try running foreground updaters for the remaining ones
21:25:16 <clayg> yeah I don't guess I have that gist up just now, one sec
21:25:58 <rledisez> timburke: I saw that tool, there is a link in the review. my issue with it is that it has to open all async-pendings to filter them. I really want to avoid wasting I/O on that (are you running on SSD guys?)
21:26:02 <clayg> oh, no ... i did, just lost it -> https://gist.github.com/clayg/c3d31a62eba590eebd5f5d257c24a297
21:26:18 <clayg> anyways - this is *useful* but not very scalable in terms of operator friendly
21:27:51 <timburke> no, we just took the io hit -- worst case, it slows down object servers which might apply some backpressure to the clients :P
21:28:10 <timburke> fwiw, another thought i'd had recently was to make the number of successes required to unlink configurable -- if we can get the update to 2/3 replicas, that's *probably* good enough, right?
21:28:14 <clayg> we're not on SSDs but we do have SOME head room on iops
21:28:18 <timburke> let container replication square it
21:28:52 <clayg> I hadn't really considered the overhead of opening an async to parse it only to find out that container is ratelimited 🤔
21:28:57 <rledisez> how does replication work on big-unsharded container? I'm not sure it would do a better job
21:29:25 <clayg> @timburke wants to just put all the async updates in a database - I don't want to have to deal with another .pending file lock timeouts
21:29:48 <timburke> i mean, they're big... but not *that* big. ~20M rows or so, i think?
21:30:02 <clayg> timburke: I think rledisez was asking about *un* sharded
21:30:23 <clayg> yeah, sharding was primarily in my mind to fix container replication
21:30:35 <timburke> right, but i mean, we could shard that big shard -- we just haven't
21:30:39 <clayg> replication works great on the shards!
21:30:52 <clayg> timburke: yeah, we have more sharding to do
21:31:16 <timburke> we've been pretty good about sharding the *biggest* guys, we've only got like 2 containers over 50M
21:31:39 <timburke> one of them is actually itself a shard 🤔
21:31:54 <mattoliverau> then shard the shard :)
21:32:16 <alecuyer> I'm wondering if this is "fixable" without throttling DELETEs? Unless you have excess IO capacity in your container servers, something is always going to be lagging in your situation, no? (it would be nice of course to prioritize some things but still)
21:32:47 <clayg> we DO have IO headroom in the container dbs tho
21:33:10 <alecuyer> ok so it's sqlite contention?
21:33:31 <clayg> yeah it's some kind of locking - either we're doing or sqlite
21:34:11 <clayg> yesterday it looked like the replication UPDATE request locked up the db for 25s - then FAILED
21:34:37 <clayg> so we're leaving throughput on the floor and it's unclear we're making progress - we have more investigation to
21:34:58 <clayg> I'll think more on the ordering asyncs by containers - I'm generally pretty happy with the filesystem layout
21:36:05 <rledisez> with a new layout, it would be pretty easy to batch updates to send to container as they are grouped all together
21:36:41 <clayg> it's hadn't been obviously terrible to me that a cycle of the updater would read and open the 125K asyncs per-disk
21:36:51 <rledisez> it was actually a followup I wanted on the first patch (and also to move "legacy" async-pendings)
21:36:59 <mattoliverau> re container replication, if the 2 container are generally close in side, we use usync (ie sync a bunch of rows) so maybe only needing quruom successes and letting the container possibly usync (batch with other updates) is better, rather then waiting for all replicas. Of course large unsharded will always be an issue.. we just need to shard the buggers.
21:37:10 <timburke> and i need to work on getting a good process going for manual shrinking -- part of why i'm hesitant to shard is that i think a lot of those shards are going to end up mostly-empty once everything settles, and we haven't really invested in shrinking yet :-/
21:37:33 <clayg> yeah, the batch updates may indeed be useful ... rledisez you're winning me over on the layout change
21:38:20 <rledisez> yay! can I offer you a virtual-beer during the virtual-ptg?
21:38:31 <mattoliverau> I like the idea of splitting the asyncs by container in the sense from a glance you can see how if containers are stuggling, but it that too much directory walking i/o, ie listdir? Maybe one per partition? which will map to container replicas
21:39:01 <clayg> at this point my biggest complaint to change it probably just reservation about changing on-disk layouts and legacy migrations etc
21:39:18 <mattoliverau> yeah
21:39:19 <clayg> it's a bunch of work - but maybe it's worth it - thanks for bringing this up
21:39:40 <clayg> I don't think I had a good picture of where your thinking was coming from - it's clearer now
21:39:51 <rledisez> so, the current patch is compatible with current layout, so no break during an update. in case of downgrade, some move of files would be required
21:40:06 <clayg> mattoliverau: yeah some workloads I'd seen had a BUNCH of containers in their cluster
21:40:25 <clayg> I think of all the dirs we create if a node is offline for awhile
21:40:57 <clayg> instead of "a handful" of problematic containers we get one TLD for each container on a node - which... might still be less 1M - but in the 100Ks
21:41:42 <rledisez> the cost of listing a directory of 100K entries is not big, but the cost of inserting a new one is not negligeable
21:41:51 <rledisez> (i made some measure during my tests)
21:42:04 <mattoliverau> timburke: yeah, mark as SHRINKING And maybe get to tool to search for the donor etc.
21:42:17 <timburke> hmm... i wonder if a db could still be a good idea -- have the object-server continue dropping files all over the fs, then have the updater walk that tree and load into db before fanning out workers to read from the db...
21:42:56 <alecuyer> sounds good, I'm afraid of having too many files on disk, wonder why :)
21:43:01 <rledisez> why dropping a file then? with the WAL, insertion should be quite fast, no?
21:43:26 <mattoliverau> use the general task queue.. and just hope that container gets updates so we don't have to deal with async.. damn :P
21:43:38 <clayg> mattoliverau: 😆
21:43:40 <rledisez> I would then even move to appending to a file, just to avoid too many fsync
21:45:09 <timburke> ok, this has been a good discussion. are there any decisions or action items we can take away from it?
21:46:25 <rledisez> should we investigate the DB idea?
21:46:44 <rledisez> I'm pretty sure it would be better, but it's also more work so won't be ready soon
21:47:07 <alecuyer> (I have to say I have no updates on LOSF having had no time to work on it this week. rledisez left it on the agenda as I should get some time next week. So, we can have more time for object updater or other topics)
21:47:26 <timburke> alecuyer, thanks for hte heads-up
21:47:34 <clayg> i need to investigate the problem in our cluster - that's ahead of me making a decision on the suitability of rledisez 's purposed layout change - but I'd like to review that more seriously given new perspective and anything I learn trying to fix our mess
21:48:57 <timburke> ok, one more crazy idea, that might be a somewhat cheaper way to investigate the db idea: what about a db per container? we could put it in the disk's containers/ driectory...
21:50:17 <clayg> I was seriously attacted to this idea because of leverage
21:50:31 <rledisez> yeah, we can just run the container-replicator then
21:50:33 <clayg> "can't stuff it in the primary container - just put it in a local handoff!"
21:50:35 <timburke> i mean, we've already got this db schema for tracking exactly the info that's in these updates...
21:50:48 <clayg> hehhe
21:51:40 <clayg> i was concerned that for AC/O clusters there might be some distaste to adding a container-replicator to your object layer
21:51:48 <timburke> *especially* for the shards -- then you run no risk of proxies getting getting bad acls from the handoff that got popped into existence when the primaries are overloaded
21:51:56 <clayg> we could just import the container-replicator into the updater and ... well do something
21:52:39 <timburke> true enough! we've already got the sharder doing something not so dissimilar
21:52:41 <clayg> timburke: yeah vivifying containers in the read path is probably not ideal - i was thinking out of band
21:53:22 <mattoliverau> as handoff conatiners? if so just make sure you get the rowids close to their parent. otherwise it might cuase a rsync_then_merge and that wouldn't go well on really large containers.
21:54:25 <clayg> mattoliverau: good point
21:55:37 <rledisez> we can avoid that by tuning the limits on the async updater/replicator I guess. but yes, something to take care of for sure
21:56:18 <timburke> i should read the rsync-then-merge code again...
21:56:40 <timburke> all right
21:56:44 <timburke> #topic open discussion
21:56:56 <timburke> anything else we should talk about in these last few minutes of meeting time?
21:58:21 <mattoliverau> nope, /me wants breakfst ;)
21:58:31 <clayg> nom nom
21:58:41 <clayg> i might cook eggs and bacon for dinner 🤔
21:59:02 <mattoliverau> you totally should! :)
21:59:14 <kota_> it seems kids woke up.
21:59:33 <timburke> :-) then this is as good a time as any to
21:59:41 <timburke> #endmeeting