21:01:01 #startmeeting swift 21:01:02 Meeting started Wed Oct 7 21:01:01 2020 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:01:03 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:01:05 The meeting name has been set to 'swift' 21:01:15 who's here for the swift meeting? 21:01:20 o/ (mostly) 21:01:22 o/ 21:01:23 o/ 21:01:29 o/ 21:01:57 agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:02:05 #topic TC election 21:02:34 just a reminder that there's an election currently being held! vote! 21:03:04 there are 7 candidates for (i believe) 4 seats 21:03:07 #link https://governance.openstack.org/election/ 21:03:44 #topic ptg 21:04:20 we're also just a couple weeks out from the virtual PTG! 21:04:36 clayg's done a good job of seeding the etherpad 21:04:39 #link https://etherpad.opendev.org/p/swift-ptg-wallaby 21:05:03 i know *i* need to add some words about ALOs 21:05:30 and probably some other topics 21:06:11 #topic stable releases 21:07:09 there are a couple patches currently working their way through the gate to get changelogs for 2.25.1 and 2.23.2 21:07:22 p 756166 and p 756167 21:07:23 https://review.opendev.org/#/c/756166/ - swift (stable/ussuri) - Authors/ChangeLog for 2.25.1 - 1 patch set 21:07:25 https://review.opendev.org/#/c/756167/ - swift (stable/train) - ChangeLog for 2.23.2 - 1 patch set 21:08:07 one of the big motivations for them (i feel like) is the backported fix for the py3 crypto bug 21:09:18 just making sure people are aware of them and some of the great fixes they include :-) 21:09:36 that's about it for announcements; any questions or comments? 21:10:10 none on my side. 21:11:25 all right, let's talk about some patches! 21:11:36 o/ 21:11:58 #topic replication and ring version skew 21:12:01 yeah, ALOs like s3 MPU!!! 👍 21:12:06 #link https://review.opendev.org/#/c/754242/ 21:12:07 patch 754242 - swift - Fix a race condition in case of cross-replication - 5 patch sets 21:12:37 rledisez, i said i'd work on getting a dev env where i could actually repro the problem -- sorry, i haven't done that yet :-( 21:12:49 so, as a reminder, this patch only fix the issue for ssync. i'm working on a patch for rsync but nothing to propose yet 21:13:12 tests look good - I don't think I had any specific concerns left over from last week - it might be ready to go? 21:13:16 i've testing the patch on prod where we we're seing the issue on a very regular basis. so far, so good, but i'll still monitore it closely 21:13:31 i'd like to share this 21:13:39 #link https://dl.plik.ovh/file/OIouMcSnLK2W2kwX/v8teIwe3Gapjjl2p/handoff-lock.png 21:13:56 it's a rebalance of an EC policy. we can see the patch avoided the issue many time 21:14:21 the ring distribution started at 18:00 and took 30 minutes 21:14:55 what bothers me is that even after the ring is distributed everywhere, we still get occurences of the reconstructor failing to lock the partition befor reverting it. I don't understand it 21:15:26 cool! i'm loving that visual 21:16:36 what i want to try is to use a short timeout value for the reconstructor to see if it can improve perf (the rebalance without patch was taking 1h, with the patch it's 3h) 21:16:53 except than that, i'm done with it i think 21:17:11 waiting for reviews :) 21:17:15 maybe previous-and-still primaries are doing actual rebuilds to the new primary, so it needs to grab the lock, too? 21:17:51 that reminds me -- clayg, we need to make sure tsync is on the etherpad, too ;-) 21:17:53 it should be a perfect explanation, except that in our clusters we don't rebuild partition recently moved (to avoid to reconstruct something that should be moved) 21:18:10 timburke: oh... yeah... 21:19:30 timburke: i think you're right, it's the rebuild, but not of moving partitions, of all the other partitions. so yeah, it should be fine. i'll check that but I think you got it 21:20:10 alternatively, maybe there's new data landing in the partition, and the new primary's reconstructor locks it while checking in with neighbors? though you'd think that should be pretty fast 21:20:17 tim doesn't even see the code anymore it's just 'primary partition, revert hand-off, orphaned slo segment...' 21:21:39 all right, sounds like i need to do some reviews and rledisez is awesome 21:21:54 #topic async cleanup of slo segments 21:22:34 i think that's about ready to go -- thanks for the review, mattoliverau, i'll check to make sure i cover the more-than-one-container case 21:23:14 in large part i left it on here as a segue to talking about ALOs ;-) 21:24:25 Another Large Objects? 21:24:45 i've been thinking "atomic", but yeah, that could work, too :P 21:25:35 the gist of it is, i want to have something like SLOs, but where: the client API mirrors (basically exactly) S3's, the segments are all in the reserved namespace that we introduced for object versioning, and the segments get cleaned up asynchronously after delete/overwrite 21:25:55 YES!! zaitcev ❤️ 🤣 21:27:06 timburke: can we even make it so it's not racy and because we control the manifest names and segments even if an overwrite thinks it's just a create we still destroy the inaccessable segments at some point? 21:27:09 that last part may get interesting -- i'm assuming it'll need to get plumbed clear down to diskfile's cleanup_ondisk_files -- basically, before unlinking the old version, scatter a whole bunch of async_pendings to schedule the deletes 21:27:37 hopefully not an auditor plugin to clean segments hehe 21:27:38 interesting! I was thinking it'd mostly happen the container layer! Can't wait to discuss! 21:27:48 auditor plugins 😢 21:28:24 hey, those are making progress! i keep seeing new patchests from david 21:29:13 all right, that's all i've got on the agenda 21:29:18 #topic open discussion 21:29:29 anything else we ought to bring up today? 21:30:23 I already engaged your attention about the dark data, so I have nothing. I'm going to look at Romain's patch seriously. 21:30:31 timburke: i was late, so i couldn't say it when you mentioned it - but KUDOS on taking care of all those backports and stable branches man - that's really great 21:31:22 i may have found a new lead on https://bugs.launchpad.net/swift/+bug/1710328 -- i suspect some of the iterable/iterator cleanup in https://review.opendev.org/#/c/755639/ may address it 21:31:24 Launchpad bug 1710328 in OpenStack Object Storage (swift) "object server deadlocks when a worker thread logs something" [High,Fix released] - Assigned to Samuel Merritt (torgomatic) 21:31:24 patch 755639 - swift - New proxy logging field for wire status - 5 patch sets 21:31:50 er, not *that* deadlock patch... https://bugs.launchpad.net/swift/+bug/1895739 21:31:50 Launchpad bug 1895739 in OpenStack Object Storage (swift) "Proxy server sometimes deadlocks while logging client disconnect" [Undecided,In progress] 21:33:34 wow 21:35:05 BTW... I don't know if it's going to be helpful to you, but I generally found re-entrant locks to be rather harmful in kernel arena. 21:35:25 They aren't saving from deadlocks if 2 entities are present. 21:35:46 And, some people clearly have trouble considering how locks work. 21:36:32 Obviously it was the case 10 years ago and I'm sure you can think about them. I'm just saying that re-entrant locks tend to make the code harder to comprehend for mere mortals. 21:37:21 I saw it happen when someone tried to port AFS to Linux 21:37:43 yeah -- my hope is that if i can clean up the close() calls, we won't be punting to gc to handle the generators and i won't need to touch the _active_limbo_lock thing *at all* 21:38:03 And in drivers if an rwlock is used for re-entrancy property, it's a signal that the code is out of control and someone is band-aiding around the baggage. 21:38:23 Okay. 21:39:24 even when i *did* try swapping out the lock, it didn't actually fix the issue -- it'd still come up now and then, i think because of some craziness in eventlet 21:39:29 it was gross. 21:40:34 but now i've got that patch applied in the cluster where i first characterized the problem, and i'll try running the same sorts of workloads -- i guess we'll see in a month or whatever if it actually fixed it 🤮 21:40:54 timburke: can you post the link to the jira again (i don't see it from the bug) 21:41:06 lp bug #1895739 21:41:07 Launchpad bug 1895739 in OpenStack Object Storage (swift) "Proxy server sometimes deadlocks while logging client disconnect" [Undecided,In progress] https://launchpad.net/bugs/1895739 21:41:07 i really wish i had a reliable repro 21:41:14 You know, I cannot see the emoji all that well. Is that face puking? 21:42:49 yup -- i think that was one of the first emojis where i thought, "well that *is* difficult to express without emojis (at least, while remaining as succinct)" 21:43:49 kota_, clayg, rledisez, mattoliverau how are we feeling about https://review.opendev.org/#/c/739164/ ? 21:43:49 patch 739164 - swift - ec: Add an option to write fragments with legacy crc - 3 patch sets 21:44:09 https://review.opendev.org/#/c/738959/ landed -- i should cut a libec release 21:44:09 patch 738959 - liberasurecode - Be willing to write fragments with legacy crc (MERGED) - 4 patch sets 21:44:40 (step 1: remember how we do that) 21:45:07 hahah 21:45:29 Yeah. I only know that it ends in tarballs.opendev.org eventually. 21:45:49 timburke: it seems right 21:45:51 dependency is always hard problem i'm feeling 21:45:52 i think we can just push a (signed) tag, but i bet tdasilva remembers 21:47:03 If we're done, I need to go. 21:47:17 kota_, the good news is that despite the Depends-On, the new swift code is perfectly happy working with old libec -- it'll set the env var, and nothing will actually look at it 21:48:08 :P 21:50:36 all right, seems like we're winding down 21:50:48 thank you all for coming, and thank you for working on swift! 21:50:51 #endmeeting