21:00:47 #startmeeting swift 21:00:48 Meeting started Wed Apr 28 21:00:47 2021 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:49 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:51 The meeting name has been set to 'swift' 21:00:55 who's here for the swift meeting? 21:01:19 o/ 21:01:59 o/ 21:02:08 o/ 21:04:02 as usual, the agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:04:06 first up 21:04:10 #topic PTG 21:04:51 i wanted to thank everyone who came out to the PTG last week -- i feel like we had some good, productive discussions 21:05:33 and that needing to explain some of our ideas to devs we don't necessarily get to talk to super-regularly helped firm up a lot of them 21:06:02 +1 21:06:13 i don't know that i've got a lot more to say, other than thanks again! 21:06:35 #topic rolling upgrade job failures 21:06:44 I'm grateful to those who were up in their night time - thank you! 21:07:39 i don't know how much other people have noticed, but i've seen a fair few failures lately 21:08:11 yes, I HAVE seen some rolling upgrade job failures - "grenade" too! 21:08:19 I noticed a few in last 24 hours 21:08:40 but func-cors is rock solid :D 21:08:46 i suspect they've been flaky for a while (seem to be related to listing consistency issues), but we've had them disabled/non-voting a decent bit lately and hadn't noticed 21:09:43 i also think (but haven't yet verified) there's a chance they'll improve the next time we cut a tag, since i added the ability to retry failed func tests 21:10:18 just wanted to keep people updated; nothing really you all need to do 21:10:35 at least one rolling upgrade fail was a timeout https://zuul.opendev.org/t/openstack/build/79a7ae5a3cc649d0a556a29e76dc0800 21:11:31 on to updates! 21:11:52 we've got a lot of things in-flight these days; i think that was another nice benefit of the PTG :-) 21:11:59 #topic sharding 21:12:15 so, current patches: 21:12:18 https://review.opendev.org/c/openstack/swift/+/784617 - Add sharding to swift-recon (already approved) 21:12:28 https://review.opendev.org/c/openstack/swift/+/785628 - swift-manage-shard-ranges: fix exit codes 21:12:32 https://review.opendev.org/c/openstack/swift/+/774002 - Fix shrinking making acceptors prematurely active 21:12:43 https://review.opendev.org/c/openstack/swift/+/777585 - stall cleaving at shard range gaps (already approved, but waiting on pre-req ^^^) 21:12:49 https://review.opendev.org/c/openstack/swift/+/782832 - Consider tombstone count before shrinking a shard 21:12:56 https://review.opendev.org/c/openstack/swift/+/787637 - Don't consider responses generated from cache as "already visited" 21:13:29 do we have any upgrade concerns about the exit code changes? 'cause if not, i'm happy to +A :-) 21:15:31 re the exit codes (patch 785628) IIRC back 3 years, there was maybe a thought to differentiate warnings from errors using codes 1 and 2 (or vice-versa) but its slipped since then 21:15:42 exit code changes on which patch? 21:15:54 second one 21:15:59 and I discovered recently that argparse exists with 2 on invalid args 21:17:28 so my thinking with the patch is to try to line up all invalid cli to return 2 and any other non-success to be 1. 21:18:20 seems reasonable, approving 21:18:21 i feel like we ought to prioritize the "prematurely active" patch since it's blocking the "stall cleaving" patch which is otherwise good to go 21:18:37 thanks 21:19:06 how are we feeling about the tombstone counting? just waiting on review? 21:19:34 I think I attracted some interest in tombstones from clayg 21:19:37 yup looks good (exit code). I'll look at prematurely active todat to unstick ut. 21:19:41 *it 21:19:47 thanks! 21:19:53 thank mattoliverau 21:20:07 thanks* 21:21:46 i wouldn't mind talking through the "already visited" patch a bit, but maybe that'd be better next week 21:22:23 any other sharding topics i'm forgetting? 21:22:49 Maybe put the rest on priority review (if they aren't already) so I dont forget about them.. it's early here and my brain isn't working yet. 21:22:58 I need to be convinced on not including cached responses in the loop-detection 21:23:39 I haven't really looped back around to actuve_age post PTG, so not much to say there yet. But want to get back too it soon. 21:23:52 my main thought is that *we haven't gone to disk yet* 21:23:56 *active_age /me can't type this morn 21:24:20 ooh -- yeah -- it'll be interesting to see if my idea pans out :-) 21:24:32 timburke: it may be that we need a way to provoke a backend request without mandating that is for objects only 21:25:32 but also retain the break if that just results in the same loop, somehow 21:26:23 #topic relinker 21:26:48 another wall of patches: 21:26:55 https://review.opendev.org/c/openstack/swift/+/783731 - Rehash the parts actually touched when relinking 21:27:01 https://review.opendev.org/c/openstack/swift/+/788089 - Only mark partitions "done" if there were no (new) errors 21:27:05 https://review.opendev.org/c/openstack/swift/+/779655 - Add /recon/relinker endpoint and drop progress stats 21:27:09 https://review.opendev.org/c/openstack/swift/+/788413 - Log and recon on SIGTERM signal 21:27:14 https://review.opendev.org/c/openstack/swift/+/788177 - add aggregate data to recon drop 21:28:12 re 788089 - when did we ignore errors? 21:28:17 so the first two seem pretty useful for correctness and clear ops-signalling 21:28:51 they weren't *ignored* exactly... i mean, we logged them and everything, and we'll exit non-zero 21:29:14 it's just that we mark the partition as having been relinked 21:29:23 but we set state to True? 21:29:30 yup 21:29:33 eek 21:30:01 so a subsequent relink either skips the partition that had errors, or ops need to manually go clear the state file 21:30:20 yeah, we should fix that 21:30:34 yup 21:30:43 the last 3 are based around the new recon patch, split and one a worte up to something to trap signals and dump the error_code as appropriate to recon, cli return and log it. 21:31:59 the "rehash parts touched" strikes my interest since we've had instances where we had hashes in partitions from part-power 17 when relinking into 19 (for instance) 21:32:00 788414 took longer then expected when trying to write tests because of diferences in os.exit and os._exit.. which made the signals kill my test suit run.. fun times :P 21:32:07 https://review.opendev.org/c/openstack/swift/+/779655 (first in relinker recon chain) has been coming along - should we focus on getting that merged? I think there's still a few things to resolve like the option name, but hopefully it is close 21:33:20 @mattoliverau fun time!!! 🤣 21:33:35 i.e. should recon_interval actually be stats_interval like the replicator has? 21:33:37 yeah, i definitely like the recon idea -- haven't had a chance to look at it since it was split up, unfortunately 21:33:58 i'll try to take another pass at it this afternoon 21:35:07 yeah, if there is already a stats_interval elsewhere I'm all for using it. keeps things consistent. 21:36:42 so mattoliverau -- do you have a preference on which fork people look at next after the first recon change? 21:36:50 acoles: did you have other concerns that got dropped? the name change is a good idea, and easy enough to fix 💪 21:37:12 I like that you've broken things out into some follow on patches 21:38:25 timburke: no really. the trap just makes sure we write done and let people know things are done if the process is killed by something like (ahem ansible timeouts). 21:38:38 the aggregator we might need to discuss some more. 21:39:15 sounds like maybe i should look at signals next, then ;-) 21:39:39 i know acoles had some comments on the base patch -- did those ever get addressed? 21:40:04 clayg: mattoliverau I think the two non-nits were recon_interval to stats_interval and duplicated start_time, although the latter isn't a blocker. But we must straighten out the option name. 21:40:30 I might go poke an op to take a look at the existing recon and the aggregator followup to see what they'd like to see, or rather if /what they can use. 21:40:56 acoles: ahh yeah the start time, somehow I missed that again in yesterdays rework. 21:41:27 👍 21:42:08 #topic stale EC frags 21:42:29 I haven't looked at the patch this morning so don't know what's there. but will push a new patchset today. maybe I'll wait until timburke has a look (if he get's to it this arvo his time). no pressure tho 21:42:29 we've got a couple patches currently working their way through the gate (thanks clayg and acoles!) 21:42:41 https://review.opendev.org/c/openstack/swift/+/787279 - reconstructor: log more details when rebuild fails (already approved) 21:42:47 https://review.opendev.org/c/openstack/swift/+/788540 - reconstructor: extract closure for handle_response (already approved) 21:42:54 so how are we feeling on 21:42:58 https://review.opendev.org/c/openstack/swift/+/786084 - Quarantine stale EC fragments 21:43:56 I started a review on it last night.. but ran out of time. Planning on continueing it today. So don't have too much to say atm myself. 21:45:28 acoles, any known rough edges to watch out for? 21:45:38 or just waiting on review? 21:46:47 probably will need a rebase once the other two land... 21:47:07 we'll see what it looks like next week 21:47:18 #topic dark data watcher 21:47:44 so we've got a couple patches for some known deficiencies 21:48:00 https://review.opendev.org/c/openstack/swift/+/788398 - Make dark data watcher ignore the newly updated objects 21:48:11 https://review.opendev.org/c/openstack/swift/+/787656 - Work with sharded containers 21:49:04 i don't think either is quite ready yet (zaitcev's patch has a WIP in the commit message, and mine probably should, too) 21:49:16 but i wanted to keep them on people's radars 21:49:44 Mine needs tests. 21:50:07 i think those are the main major efforts in-flight right now 21:50:10 timburke: zaitcev thanks for those patches 21:50:12 #topic open discussion 21:50:23 anyhting else we ought to bring up this week? 21:52:40 nothing comes to mind 21:53:02 so i had a thought that feels like a good idea, but idk if it presents some backwards compat issues 21:53:04 https://review.opendev.org/c/openstack/swift/+/787905 - proxy: Downgrade some client problems to info 21:54:04 basically, stop logging client disconnects and timeouts at warning -- they're client behaviors, so it is (or can be) way too noisy at that level 21:56:28 well, something to think about, anyway 21:56:50 that's all i've got 21:57:05 thank you all for coming, and thank you for working on swift! 21:57:15 and thanks for coming to the PTG :-) 21:57:20 #endmeeting