21:00:02 #startmeeting swift 21:00:03 Meeting started Wed Jun 24 21:00:02 2020 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:04 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:06 The meeting name has been set to 'swift' 21:00:10 who's here for the swift meeting? 21:00:19 o/ 21:00:32 o/ 21:01:03 o/ 21:01:52 o7 21:02:10 agenda's at https://wiki.openstack.org/wiki/Meetings/Swift -- i don't have much to bring up, so it may be a fairly short meeting ;-) 21:02:37 in fact, it's just updates 21:02:40 #topic replication network and background daemons 21:03:03 after the feedback last week, i updated https://review.opendev.org/#/c/735751/ to touch more daemons 21:03:03 patch 735751 - swift - Allow direct and internal clients to use the repli... - 2 patch sets 21:04:22 there were a couple (reconciler and expirer) where i still provided an option of using the client-data-path network 21:04:25 is that a filter xprofile? what is even going on with this change set?! 🤯 21:04:50 oh, that's unrelated - use_legacy_network 👍 21:05:07 timburke: so is that one ready to go then!? 🤗 21:05:13 wait, what about xprofile? 21:05:20 yeah, pretty sure it's good to go 21:05:31 i just clicked on the file with the biggest diff 21:06:35 my logic for whether to provide an option was basically: if it walks disks (which presumably would be exposed on the replication network), don't bother to provide an option. if it just gets work by querying swift, include it 21:07:23 oh, yeah -- i was just moving the filter to what seemed like a more natural location. i could back that out; it's unrelated 21:07:39 dude, this change looks good 21:08:32 i hate the option, i'm sure all of these places would have used the replication network from the get go if these interfaces already existed 21:09:29 does anybody know of deployments where a node that might be running reconciler or expirer *wouldn't* have access to the replication network? maybe i don't even need to include the option 21:09:40 but the way you did it it's super clear; not a blocker for me 21:09:44 at least we don't have that 21:10:04 (i mean they do have access to the replication network) 21:10:24 well the reconciler is interesting because it uses internal client and direct client - it's not clear to me that the direct client requests honor the config option? 21:11:22 i haven't looked at this since you pushed it up last thursday - i thought it was still wip 🤷‍♂️ 21:11:28 I'm sure i'll be great 21:11:31 heh -- whoops. yeah, they don't atm -- feel free to leave a -1 21:11:54 anyway, i just wanted to raise attention on it 21:11:57 -1 let's just drop the option!? 😍 21:12:17 or do you mean -1 if we have the option is has to work and best tested and 🤮 21:12:36 either one, up to you ;-) 21:12:44 #topic waterfall EC 21:13:03 i totally get the argument "if ONE deployment wants the option we'll wish we had that code ready to go" 🤷‍♂️ 21:13:06 Sorry I'm late, slept in a bit o/ 21:13:11 clayg, i saw you get some more patchsets up; how's it going? 21:13:14 i just don't know if such a deployment exists and tend to YAGNI 21:13:24 timburke: it's so great, i'm really happy 21:13:29 mattoliverau, no worries -- we'll probably let you get breakfast before long ;-) 21:14:07 thanks for the ping about the concurrent frag fetcher clay. Have started reading but I think I need to read more before I start asking questions :) 21:14:23 ok, great, np! 21:14:56 you also wrote up an etherpad about non-durables 21:14:58 #link https://etherpad.opendev.org/p/the-non-durable-problem 21:15:32 didn't catch that , thanks 21:16:41 the implication was some sort of dump overly complicated looking code to define/override what's the "shortfall" for a non-durable bucket 21:17:15 but it makes a clean way to put non-durable responses into a gradient of "probably fine" until either we get a durable response or get down to ~parity outstanding requests 21:18:12 are there any open questions you'd like us to discuss, clayg? or should we just get on with reading the patches? 21:18:13 i could be debated/researched/proved where on the slider is "most correct" - but I think it's a good sliding scale, so the code is "correct" in some sense 21:18:24 uh, I'd be happy to answer questions 21:18:51 but I would like some "early" feedback on gross over expressiveness of https://review.opendev.org/#/c/737096/ 21:18:52 patch 737096 - swift - Make concurrency timeout per policy and replica - 3 patch sets 21:19:09 having concurrent_gets/timeout be *per policy* is obviously a move a positive direction 21:19:58 but crazy implementation of alecuyer 's idea that waterfall-ec should be able to express start ndata+N concurrent requests before feeding in the remaining primaries 21:20:09 ... could probably be expressed different ways than how I wrote it up 21:21:07 the underlying structure (the per primary count timeout) might be a good implementation; but a smart group of folks like we have here may have some better ways to configure it 21:21:25 i.e. is there a "do_what_i_want = true" option that would be better 21:21:55 ... than the `concurrency_timeout` that I plumbed through 21:22:38 I mean it's great! I'm happy with it - it's *completely* sufficient for anything I might want to test in a lab 21:22:55 ... and I'm sure different cluster's would hvae good reasons to want to do different things; so it doesn't bother me to have it exposed 21:23:14 morning 21:23:16 We'll have to test it - still swamped with romain about unrelated things but I hope we can test it soon 21:23:33 oh wow, that'd be cool! 21:23:43 sounds like i need to find time to review p 711342 and p 737096 this week :-) 21:23:44 https://review.opendev.org/#/c/711342/ - swift - Add concurrent_gets to EC GET requests - 12 patch sets 21:23:46 https://review.opendev.org/#/c/737096/ - swift - Make concurrency timeout per policy and replica - 3 patch sets 21:23:49 I'll probably be testing it in our lab like... in another week or two? 21:23:56 I think we want to use EC more and we have things like 12+3 so im quite sure it would help :) 21:24:20 alarm didn't ringr 21:25:11 alecuyer, oh for sure -- we're already grumpy with 8+4; needing to get another 4 live connections would surely make our problems worse ;-) 21:25:23 that's it for the agenda 21:25:30 #topic open discussion 21:25:41 anything else we should talk about this week? 21:25:53 I want to mention quickly an issue we found with romain during some tests, 21:26:15 😱 21:26:17 the test setup had a ubuntu 20.04 proxy, and 18.04 object servers 21:26:42 we used an EC policy, and were unable to reread data , 21:26:47 py2 still, or trying out py3? 21:26:49 eep! 21:26:54 still py2 sorry :/ 21:27:11 cool, just making sure 👍 21:27:16 it _seems_ liberasure code has a problem in ubuntu between these two version 21:27:17 s 21:27:22 i blame 20.04 - that focal 😠 21:27:38 which versions of liberasurecode? 21:27:39 the object server will do get_metadata (i think sorry i don't recall the function name) 21:27:42 wasn't there a thing with CRC libs that was $%^&ing INSANE (cc zaitcev ) 21:27:56 were they coming from ubuntu's repos? 21:28:00 1.6.1 vs 1.5.0 i think 21:28:01 yes 21:28:06 recompiling appeared to fix it, 21:28:15 timburke fixed most of it, I only looked at that. I think you're thinking about zlib. 21:28:19 and it seems to be related to linker flags (??) if i used ubuntu flags it broke again 21:28:22 https://github.com/openstack/liberasurecode/commit/a9b20ae6a 21:28:23 but well 21:28:36 LINKER FLAGS 21:28:51 just wanted to say we saw that, and maybe i can say more outside the meeting, or once i have tested this properly 21:28:57 (this was not the goal of the test erm..) 21:29:26 Yes. It was related to linking order. If system zlib got routines ahead of liberasurecode, then they're used instead of ours. 21:29:26 to find bugs is always the goal testing - you are a winner 21:29:39 zaitcev: thanks, i didn't figure it out 21:30:45 alecuyer: so, does this mean that the fix we committed (linked above) was incorrect? 21:31:03 so i guess probably someone would be glad if we could produce an actionable bug for ubuntu 20.04's liberasurecode package 21:31:04 I reviewed it and it seemed watertight to me, buuuuut 21:31:17 Sorry to say im not sure, haven't had time to dig further, but I'll check it and post detailed version and a test script we have to check outside of swift 21:31:32 ... but "always compile/distribute your own liberasurecode" also seems like reasonable advice 🤔 21:31:32 i think the issue must be new frags getting read by old libec :-( 21:31:46 ah, okay 21:31:47 trolorlololo 21:32:23 No, wait. What if you have a cluster that's halfway updated and rsync moves fragment archives from new to old 21:32:31 i think if it was the other way around, with old proxy talking to new object, it'd probably be fine? until the reconstructor actually had work to do :-( 21:32:32 timburke: but it's only: new ec compiled with "wrong" flags breaks 21:34:02 maybe we could offer an env var or something to have the new lib write old crcs (that could then still be read by old code)? 21:34:19 and once everything's upgraded, take out the env flag 21:34:23 alecuyer: can you share the werx and borked flags? 21:34:43 yes I will send it on #openstack-swift and etherpad 21:34:52 FABULOUS!!! 21:34:55 alecuyer: well done 21:35:08 so last night i ran down an s3api issue ormandj saw upgrading to py3 21:35:13 #link https://bugs.launchpad.net/swift/+bug/1884991 21:35:13 Launchpad bug 1884991 in OpenStack Object Storage (swift) "s3api on py3 doesn't use the bytes-on-the-wire when calculating string to sign" [Undecided,In progress] 21:35:41 just a heads-up -- the long tail of py3 bugs continues 21:35:57 glad someone is testing py3 😬 21:36:22 They have no choice. We, for example, aren't shipping py26 anymore, at all. 21:36:43 zaitcev: you're a hero!!! 🤗 21:36:44 have a fix at https://review.opendev.org/#/c/737856/, but it needs tests 21:36:44 patch 737856 - swift - py3: Stop munging RAW_PATH_INFO - 1 patch set 21:37:12 i'm very anti munging - +1 on principle alone 21:38:06 *especially* when it's for something with "raw" in the name :P 21:38:32 timburke: ❤️ 21:39:00 i'm not having a great time with s3 tests in https://review.opendev.org/#/c/735738/ right now 21:39:00 patch 735738 - swift - s3api: Don't do naive HEAD request for auth - 1 patch set 21:39:42 BTW, I have 2 same things on my plate: Dark Data with dsariel and account server crashing. Not much change on either... I know Tim looked at the account server thing, but I'm going to write a probe test that solidly reproduces it. 21:40:14 seongsoocho: how are things going for you? 21:40:50 zaitcev, oh, yeah yeah -- p 704435 -- i've got a head-start on a probe test for you at p 737117 21:40:50 https://review.opendev.org/#/c/704435/ - swift - Mark a container reported if account was reclaimed - 2 patch sets 21:40:52 https://review.opendev.org/#/c/737117/ - swift - probe: Explore reaping with async pendings - 1 patch set 21:41:22 clayg: just have a normal day. everythings are good 21:42:07 something i've learned hte last few months: every day your cluster isn't on fire is a good day :D 21:42:45 definitely :) 21:43:21 timburke: thank yuo 21:44:04 zaitcev, how is the watcher going? should i find time to look at it again soon, or wait a bit? 21:44:46 timburke: wait a bit please. We're putting together a switch for selected action. 21:44:56 👍 21:45:00 basically 21:45:19 Sam's design didn't allow for configuration options specific for watchers. 21:45:41 So, there's no way to express "The DD watcher should do X" 21:46:05 David wants something crammed into paste line 21:46:58 so it can turn a little unweildy like watchers=watcher_a,watcher_b#do_this=yes,watcher_c 21:47:06 I'll let you know. 21:47:06 🤯 21:47:26 makes me think of the config changes sam was thinking about in p 504472 ... 21:47:26 https://review.opendev.org/#/c/504472/ - swift - Shorten typical proxy pipeline. - 4 patch sets 21:48:31 i feel like it should be fair for the DD watcher to claim the dark_data_* config namespace within the object-auditor 21:48:33 I'd prefer something lie [object-auditor:watcher_b] \n do_this=yes 21:48:37 But dunno 21:48:43 Seems like overkill. 21:48:56 or that! also seems good :-) 21:49:08 ok 21:50:33 noice 21:50:33 all right, lets let kota_, mattoliverau, and seongsoocho get on with their day ;-) 21:50:46 thank you all for coming, and thank you for working on swift! 21:50:50 #endmeeting