21:00:00 #startmeeting swift 21:00:01 Meeting started Wed May 13 21:00:00 2020 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:02 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:05 The meeting name has been set to 'swift' 21:00:09 who's here for the swift meeting? 21:00:14 o/ 21:00:18 hi 21:00:55 hello 21:01:05 o/ 21:02:11 clayg, mattoliverau, alecuyer? 21:02:19 o/ 21:02:35 agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:02:53 #topic PTG 21:03:22 first up, the usual reminder that we'll have our virtual PTG in just a few weeks 21:03:43 mattoliverau has been doing a great job of working to organize it 21:04:06 and you guys have been great about adding topics to the etherpad 21:04:11 #link https://etherpad.opendev.org/p/swift-ptg-victoria 21:05:24 mattoliverau added some times for us to https://ethercalc.openstack.org/126u8ek25noy 21:05:37 is there any info anywhere in how it will actually work? I read somewhere about reserving "rooms", but I'm not sure I understand what that means 21:06:04 i'll make sure they get added to the etherpad 21:07:03 tdasilva, that's a great question! unfortunately i've been a bit distracted, so i'm not entirely sure myself. i'll find out more and drop what i find out (and probably links to mailing list messages) in -swift about it 21:07:41 timburke: no worries, I can go dig for it too 21:08:12 are there any other questions about the PTG? i'm not sure i'll be able to answer them right away, but i should be able to do some research 21:10:05 all right, on to my most recent distraction, then :-) 21:10:14 #topic object updater 21:10:50 so last week i shared that one of our clusters had a *lot* of async pendings 21:11:05 good news! we're doing much better now :-) 21:11:28 better news: we've written up a bunch of bugs describing things we saw going wrong 21:12:27 best news (?): patches are starting to come up -> https://review.opendev.org/#/c/727876/1 21:12:28 clayg did the lion's share of the investigation -- the core of it was 21:12:28 patch 727876 - swift - Breakup reclaim into batches - 1 patch set 21:12:33 #link https://bugs.launchpad.net/swift/+bug/1877651 21:12:33 Launchpad bug 1877651 in OpenStack Object Storage (swift) "Reclaim of tombstone rows is unbounded and causes LockTimeout (10s)" [Medium,In progress] - Assigned to clayg (clay-gerrard) 21:12:47 yeah, that's the stuff :-) 21:15:08 rledisez: i don't know yet how this effects my thinking about the on-disk layout for the per-container stuff 21:15:54 tl;dr: after pushing our container-replicator cycle time up by about a hundred-fold, we could get our updaters from not keeping up with incoming asyncs to net-reducing them by about 11M/hr 21:16:06 clayg: me neither. i'll make sure to go look at these patches, i'll see then if it still make sense 21:16:16 we had a node offline for few days to get a nic swapped and we have a bunch of async pendings all of the cluster spread across all the containers on that node... 21:16:31 ... which is kind of different than the situation you're optimizing for 21:16:49 along the way we noticed some workers dying off due to https://bugs.launchpad.net/swift/+bug/1877924 21:16:49 Launchpad bug 1877924 in OpenStack Object Storage (swift) "object-updater should be more tolerant of already-removed async pendings" [Undecided,In progress] 21:17:22 we did some sharding (which helped, but it could be better; see https://bugs.launchpad.net/swift/+bug/1878090) 21:17:23 Launchpad bug 1878090 in OpenStack Object Storage (swift) "object-updater should remember redirects and proactively check whether an update should be pointing at a shard instead" [Undecided,Confirmed] 21:18:16 did some config changes that restarted updaters cluster-wide and caused some minor heart attacks due to https://bugs.launchpad.net/swift/+bug/1878056 21:18:16 Launchpad bug 1878056 in OpenStack Object Storage (swift) "object-updater should shuffle work before making requests" [Undecided,In progress] 21:18:58 but at the end of the day, we've got a really great system! 21:19:41 💪 ten years of sweat and tears has to be good for something - it's resilient and flexible if nothing else 21:20:29 (speaking of -- swift's birthday is in four days! 10 years running in prod!) 21:21:02 i also filed a couple more https://bugs.launchpad.net/bugs/1877662 https://bugs.launchpad.net/bugs/1877663 https://bugs.launchpad.net/bugs/1877665 21:21:02 Launchpad bug 1877662 in OpenStack Object Storage (swift) "Magic number for per_diff in rsync_then_merge " [Low,New] 21:21:03 Launchpad bug 1877663 in OpenStack Object Storage (swift) "Default db_replicator per_diff is the degenerate configuration " [Low,New] 21:21:04 Launchpad bug 1877665 in OpenStack Object Storage (swift) "Database WAL PENDING_CAP should be configurable" [Medium,New] 21:21:16 i'm not sure how many I'll get before I go back to waterfall-ec 21:21:17 congrats for the birthday 21:21:31 10 years, awesome! is swift one of the oldest object storage software? 21:21:41 i guess none of them really have anything to do with big containers just eating PUTs slowly - but we're 100% on the sharding train and trying to do more and go faster 21:22:00 so the whole "500M row database" might not be a thing we keep caring about 🤔 21:23:06 rledisez, i'm not sure, good question... 21:23:20 Sage defended the dissertation on Ceph in 2007, and it was running on Bluehost, so that's longer than Swift. 21:23:29 clayg, we need to help out mattoliverau with getting some *real* auto-sharding going :D 21:24:02 😬 21:24:11 timburke: (operator-speaking) it's clearly the next big thing 21:24:32 zaitcev, otoh, https://en.wikipedia.org/wiki/Ceph_(software) lists their initial release in 2012 21:24:34 *shrug* 21:24:59 o/ sorry I'm very late, kinda slept in. 21:25:07 no worries! 21:27:04 anyway, i think that's about all i've got by way of post-mortem -- i've been playing at putting a story together with graphs and everything, i'll see what comes out of that (and what i can share :-/ this working-for-a-Big-Corp is kinda new to me) 21:27:50 only other thing i've got for the agenda is an update for LOSF 21:28:15 but i think alecuyer said he wasn't going to be able to make it to discuss? 21:28:19 so, no update this week, alecuyer was busy on an other project 21:28:37 👍 21:28:44 i know how it goes ;-) 21:29:52 oh, last minute topic! i think i'd like to get a swiftclient release out soon. i missed the window to get versioning support out in ussuri, but it's sitting there on master -- we should publish it! 21:30:38 it'll also bring some recently-landed application credential support, and i'll see what other client patches might make sense to get merged in the next week 21:31:19 if anyone has any they want to get in, please let me know (or update the priority reviews page) 21:31:26 #topic open discussion 21:31:49 anything else we should talk about? 21:33:02 Nothing here. 21:33:13 A note from earlier, the time booking is also booking the room. So I booked the same location (different times) for the other. We're in Liberty... Now not sure what means assume it's some virtual room 21:33:32 *PTG 21:33:39 Damn autocorrect 21:33:44 Well, the Dark Data is stuck. I didn't work on it in a while. I think I addressed Romain's concerns. 21:35:36 I'll try to go look at it while I'm on updater patches 21:36:13 I don't think Romain is a great candidate for the dark data auditors - it's like part power adjustment or container sync - it won't work for everyone to their satisfaction; still useful 21:36:54 crap, i'd meant to look at that hadn't i... 21:36:56 He's the only one who might have some for this to actually find. My test cluster seems to have none at all. 21:37:06 Is there a question if it should be maintained upstream? 21:37:31 zaitcev: no dark data is good! 21:37:52 Anything that's not upstream rots. Just look how well it worked for swift3 and swauth. 21:37:54 it normally takes something to fall over pretty good and be missed for awhile 21:38:38 counter-point: container-sync 21:39:17 timburke: but rledisez is gunna look at that again, and... Gil before that? Alistair was a container sync fan. 1space container crawler is based off container-sycn! 21:39:21 (just playing devil's advocate -- i'm all for having audit-watchers upstream, and the feature makes a lot more sense when there's at least one user) 21:40:18 well, unless it's broken; in principle I'm in favor of merging it - with docs 21:40:37 some iterative review would be a great pre-ptg project 21:41:51 For me, the feature itself has value, but Sam wanted a general API. Does that need still exit, and do we have people modding auditor with additional functions? 21:42:17 lord knows I'm going to need some help/review https://review.opendev.org/#/c/727876/1/test/unit/account/test_backend.py@209 21:42:18 patch 727876 - swift - Breakup reclaim into batches - 1 patch set 21:43:27 i seem to recall that we have at least one audit-watcher-like thing that was implemented as a whole extra auditor (alongside ZBF and ALL) -- i forget what it was for offhand, though 21:44:34 Math sounds like an interesting challenge, although this probably can just be tested by running a few boundary conditions through. 21:44:37 zaitcev: can you add dark data and auditor watchers to https://etherpad.opendev.org/p/swift-ptg-victoria 21:46:20 how about this as an idea, too -- i'm assigning homework! everyone find 1-5 patches they'd like to see progress on over the course of PTG week. add them to the priority reviews wiki 21:47:05 timburke: can they all be patches I write or review between now and ptg week? 21:47:29 i'll put a new section in for it. list of patches, then maybe a sub-bullet for irc nicks interested 21:47:38 I wonder if we could use audit watchers to let the auditors walk the filesystem and trigger events like replication, sharding, etc. And that would mean much less filesystem walking by almost all the daemons (just thinking out loud and pre coffee). 21:48:11 mattoliverau: i like that, I kinda of think about something similar last week 21:48:12 kinda like out old in-person sticky notes-and-dots system at swift hackathons 21:48:40 split the "job producer" from the "job executor" 21:49:07 mattoliverau, rledisez yes! i've thought about that too -- gets us closer to a central point for all i/o scheduling 21:49:17 we are taking this approach to scale container-sync => we split container-sync, one crawl the DB, one execute what need to be synchronised 21:49:19 yeah, might screw up our tunings on different daemon internvals, but less io is less io 21:52:07 i want this even more with generic tasks queues 21:52:53 let's do it - let's build generic task queues - single producer multi executor for all daemons - and s3 bucket policies for multi-site replication and expiry 21:53:50 sounds like another PTG topic :) 21:54:04 or two 21:54:18 I wish we had this discussion before I had to book times. I could have tried to find some more slots. 21:54:36 hahaha 21:54:42 we'll make do 21:54:46 worst case, we keep talking in irc :D 21:55:07 i'll just need to make sure i have enough coffee at home 21:55:57 all right, let's let mattoliverau, seongsoocho, and kota_ go have breakfast 21:56:10 thank you all for coming, and thank you for working on swift! 21:56:15 #endmeeting