21:00:11 #startmeeting swift 21:00:12 Meeting started Wed Jan 13 21:00:11 2021 UTC and is due to finish in 60 minutes. The chair is timburke_. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:13 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:15 The meeting name has been set to 'swift' 21:00:19 who's here for the swift meeting? 21:00:27 o/ 21:00:29 o/ 21:00:44 o/ 21:01:03 o/ 21:01:12 o/ 21:01:25 o/ 21:01:54 as usual, the agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:02:14 first up 21:02:18 #topic reconciler/ec/encryption 21:02:28 #link https://bugs.launchpad.net/swift/+bug/1910804 21:02:30 Launchpad bug 1910804 in OpenStack Object Storage (swift) "Encryption doesn't play well with processes that copy cleartext data while preserving timestamps" [Undecided,New] 21:03:07 so i had a customer report an issue with an object that would consistently 503 21:03:49 ohai 21:04:06 digging in more, we found that they had 11 frags of it for an 8+4 policy... but those had 3 separate sets of crypto meta between them 21:04:28 ...and no set of crypto meta had more than 7 frags 21:04:34 lawl 21:05:16 (I had to think about this at first) meaning frags have been encrypted with three different body keys...for same object!!! 21:06:47 root cause was traced back to a couple issues: (1) we deploy with encryption in the reconciler pipeline and (2) we have every (container?) node running a reconciler 21:07:55 (well, that and the fact that it was moved to an EC policy. if it were going to a replicated policy, any replica regardless of crypto meta would be capable of generating a client response) 21:08:48 i've got a fix up to pull encryption out of the reconciler pipeline if it was misconfigured -- https://review.opendev.org/c/openstack/swift/+/770522 21:09:23 but i wanted to raise awareness of the issue so no one else finds themselves in this situation 21:10:56 also worth noting: i think you could run into a similar issue *without encryption* if your EC backend is non-deterministic 21:12:50 the open source backends are deterministic as i recall (that is, the frag outputs only depend on the EC params from swift.conf and the input data), but i don't know the details of shss, for example 21:13:39 does anyone have any questions about the bug or its impact? 21:14:54 all right 21:14:57 Nice investigation! 21:15:00 #topic SSYNC and non-durable frags 21:15:14 #link https://bugs.launchpad.net/swift/+bug/1778002 21:15:16 Launchpad bug 1778002 in OpenStack Object Storage (swift) "EC non-durable fragment won't be deleted by reconstructor. " [High,Confirmed] 21:15:49 i know acoles (and clayg?) has been working on this problem a bit lately, though i'm not sure where things stand 21:15:49 shss might be impacted. i'll check it. 21:16:03 I just got my probe test working! 21:16:11 \o/ 21:16:24 background: we noticed some partitions were never cleaned up on handoffs 21:16:50 turned out they had non-durable data frags on them , so the dir would not be deleted 21:17:06 but reconstructor/ssync does not sync non-durable frags 21:17:24 :( 21:17:35 so https://review.opendev.org/c/openstack/swift/+/770047 should fix that 21:18:10 by (a) sync'ing non-durables (they could still be useful data) and (b) then removing non-durables on the handoff 21:18:52 https://bugs.launchpad.net/swift/+bug/1778002 has been around for awhile - anyone doing EC rebalances has probably noticed it 21:18:53 Launchpad bug 1778002 in OpenStack Object Storage (swift) "EC non-durable fragment won't be deleted by reconstructor. " [High,Confirmed] 21:20:47 Hrm. I never noticed because I have excess space. 21:21:51 i think we mainly noticed because we monitor handoffs as part of our rebalances 21:22:05 the commit message on the patch details the various changes needed to get the non-durables yielded to ssync and then have ssync sync them 21:22:11 acoles, are there any questions that might need answering, or is this something that everyone should just anticipate getting better Real Soon Now? 21:23:27 review always welcome, but there's no specific issue I have in mind for feedback 21:23:49 excellent 21:23:57 I'm about to push a new patchset - and I have one more test to write 21:25:18 #topic cleaning up shards when root DB is deleted and reclaimed 21:25:34 meanwhile, mattoliverau has picked up 21:25:37 #link https://bugs.launchpad.net/swift/+bug/1911232 21:25:38 Launchpad bug 1911232 in OpenStack Object Storage (swift) "empty shards fail audit with reclaimed root db " [Undecided,Confirmed] - Assigned to Matthew Oliver (matt-0) 21:26:02 how's that going? 21:26:32 Yeah things are moving along. I have https://review.opendev.org/c/openstack/swift/+/770529 21:26:53 it's not fixed yet, just worked on a probe test that shows he problem. 21:27:09 a very good place to start :) 21:27:40 In an ideal world we'd have shrinking and autosharding so shards with nothing in them was suppose to collapse into the root before reclaim_age 21:28:03 but we don't have that, and there is still an edge case where their not getting cleaned up. 21:28:59 I'll have another pathset up today that should have an initial version of a fix. Currently still on my laptop as it needs some debugging and tests 21:29:34 keep an eye out for that and then please review and we can make sure we don't leave any pesky shards around :) 21:29:46 sounds good 21:30:23 #topic s3api and allowable clock skew 21:31:10 i've had some clients getting back RequestTimeTooSkewed errors for a while -- not real common, but it's a fairly persistent problem 21:31:58 i'm fairly certain it's that they retry a failed request verbatim, rather than re-signing with the new request time 21:32:51 eventually, given the right retry/backoff options, the retry goes longer than 5mins and they get back a 403 21:33:35 so, there's nothing we can do, right? 21:33:54 i *think* AWS has an allowable skew of more like 15mins (though can't remember whether i read it somewhere or determined it experimentally) 21:34:11 That's what I remember, too. 21:34:24 so i proposed a patch to make it configurable, with a default of (what i recall as being) AWS's limit 21:34:41 #link https://review.opendev.org/c/openstack/swift/+/770005 21:34:48 It was mentioned in the old Developer's Guide. But that document is gone, replaced with API Reference. 21:35:30 i wanted to check if anyone had concerns about increasing this default value (it would of course be called out in release notes later) 21:35:35 should we extend the default value too? 21:36:24 * kota_ said same thing :P 21:36:29 kota_, yeah, the patch as written increases the timeout from 5mins to 15mins (if you don't explicitly set a value) 21:37:55 ok, seems like we're generally ok with it :-) 21:38:06 #topic relinker 21:39:17 i found a couple issues recently that might be good to know about if anyone's planning a part-power increase (or two) soon 21:39:24 #link https://bugs.launchpad.net/swift/+bug/1910589 21:39:25 Launchpad bug 1910589 in OpenStack Object Storage (swift) "Multiple part power increases leads to misplaced data" [Undecided,New] 21:39:59 ^^^ characterizes something i think i mentioned last week, but hadn't gotten a clean repro for 21:40:24 rledisez, do you think you might have time to review https://review.opendev.org/c/openstack/swift/+/769855 (which should address it)? 21:40:45 Christian is no longer around essentially, we have to do without. 21:41:05 😢 I hope he's doing well tho! 😁 21:41:47 timburke_: absolutely, I'll do that this week 21:42:36 thanks! only thing worth calling out (i think) is that the state file format changed in such a way that any old state files will just be discarded 21:43:08 not a big deal. don't upgrade if you're relinking, and worst case sceniario, it restart from zero 21:43:15 but that should only really be a concern if someone is doing a swift upgrade mid-part-power-increase, which doesn't seem like a great plan anyway 21:43:41 hahaha 21:43:59 the other one i noticed is a little thornier 21:44:01 #link https://bugs.launchpad.net/swift/+bug/1910470 21:44:02 Launchpad bug 1910470 in OpenStack Object Storage (swift) "swift-object-relinker does not handle unmounted disks well" [Undecided,New] 21:44:49 essentially, on master, if the relinker hits an unmounted disk, you get no feedback about it at all 21:45:23 i've got a patch that at least has us log the fact that the disk is getting skipped -- https://review.opendev.org/c/openstack/swift/+/769632 21:45:45 but it doesn't exit with a non-zero status code or anything 21:46:12 So now, is it safe to increase the partition power only once until the patch is applied? 21:47:00 seongsoocho: from production experience, it is. we did it on multiple clusters with the current status of the relinker 21:47:05 seongsoocho, yes, increasing it once will definitely be fine. once it's been increased, you could go manually clear the state files -- then it would be safe to do it again 21:48:04 but you should care about the last bug mentioned by timburke_, ensure your rights are ok (root:root) on unmounted disk to avoid bad surprises 21:48:10 they'd be named something like /srv/node/*/.relink.*.json 21:48:47 aha. ok thanks :) 21:49:17 at some point, it would be useful to have a recon option that return the values of the relink.json and tells when one is missing (eg: because unmounted) 21:49:52 good thought! 21:50:49 all right, i mostly wanted to raise awareness on those -- i'll let you know if i get a good idea on a better resolution for that second one 21:50:55 #topic open discussion 21:51:03 what else should we talk about this week? 21:52:46 OMM I'm seeing this test fail in virtualenvs (e.g. tox -e py36) but not outside virtualenv: 'nosetests ./test/unit/common/test_manager.py:TestManagerModule.test_verify_server' - anyone else noticed that? I'm baffled 21:53:35 AFAICT the test is asserting that swift-Object-server is not on my PATH 21:53:48 note the capital 'O' 21:53:49 does it always fail? 21:54:09 inside virtualenv yes - I mean, I just noticed in last 20mins 21:54:09 are you on a case insenstive file system 🤣 21:54:33 vsaio and macos both the same 21:55:02 apart from it failing, I don't like that a unit test is making assertions about what I might have on my PATH 21:55:34 oh, i just tried venv in my vsaio and it worked 🤷‍♂️ 21:55:39 if no-one else has noticed I'll dig some more 21:55:42 py3.8 tho 21:56:42 acoles: so, if you comment verify_server, does it fail? 21:56:51 or, well 21:56:59 it's a test, so it's a little artificial 21:57:52 py2.7 fails 21:58:08 se 21:58:13 a second 21:58:34 maybe related to https://review.opendev.org/c/openstack/swift/+/769848 ? 21:58:42 yes, that one 21:59:04 i should go review that... or maybe acoles should ;-) 21:59:22 Yeah 21:59:50 all right 21:59:59 don't think so, in my virtualenvs, '$ which swift-Object-server' actually finds a match 22:00:10 :/ 22:00:16 O.o 22:00:17 Maybe just back out the whole thing. It's an option. But I hoped that just backing out the effects in the decorator, and _only_ screw with the exit code, would let use preserve it. 22:00:40 Oh 22:00:45 zaitcev, seems likely to be reasonable 22:01:08 acoles, well where did it come from!? 22:01:18 anyway, we're at time 22:01:19 I think there's a case-insensitivity thing going on in my virtualenvs ?!? v weird 22:01:27 thank you all for coming, and thank you for working on swift! 22:01:44 there's a lot going on, and i'm excited to see it all happening 22:01:50 #endmeeting