21:00:34 #startmeeting swift 21:00:35 Meeting started Wed May 20 21:00:34 2020 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:36 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:38 The meeting name has been set to 'swift' 21:00:41 who's here for the swift meeting? 21:00:48 o/ 21:00:49 hi o/ 21:00:53 o/ 21:01:17 o/ 21:02:15 agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:02:22 #topic ptg 21:02:35 it's only a week and a half away! 21:03:04 thanks everybody for adding topics to the etherpad 21:03:07 #link https://etherpad.opendev.org/p/swift-ptg-victoria 21:03:29 i made sure that the link was included on http://ptg.openstack.org/etherpads.html 21:04:15 if you haven't already, please do register -- i expect it'll help with logistics/planning 21:04:20 #link https://www.eventbrite.com/e/virtual-project-teams-gathering-june-2020-tickets-103456996662 21:05:15 mattoliverau did a great job of making sure we had some timeslots booked for video conferencing 21:05:36 the schedule of those is up at http://ptg.openstack.org/ptg.html 21:05:48 as well as the top of the etherpad 21:07:30 one thing i think i'd like to try is having everyone come to the ptg with a few patches they'd really like to see some progress on -- pick three or so and add them (and your name) to the priority reviews page! 21:07:33 #link https://wiki.openstack.org/wiki/Swift/PriorityReviews 21:08:07 any questions/comments on the ptg? 21:08:31 Nope, looking forward to it :) 21:09:42 on to new business :-) 21:09:52 #topic ratelimit + s3api 21:10:55 so after digging out from the massive pile of async pendings, i wanted to make sure it didn't happen again, at least not easily. and one easy way to do that is to limit how quickly writes happen in a cluster 21:11:34 fortunately, we have a ratelimit middleware! unfortunately, it could be a little annoying to deploy with s3api 21:12:47 (my understanding is) you usually want to place it left of auth -- that auth decision may be expensive, and you don't want to have auth fall down because of an over-eager swift client 21:13:34 but if it's an s3 request, you won't have the full account/container path until *after* auth 21:14:28 so i'm thinking that having it twice (once before s3api and once after auth) might be a reasonable way to go? how do other people deploy that? 21:14:51 #link https://review.opendev.org/729051 21:14:51 patch 729051 - swift - ratelimit: Allow multiple placements - 3 patch sets 21:15:55 rledisez, surely you've got *something* for this, yeah? is it a custom ratelimiter? 21:16:23 right now (when we enabled it) we put it left of auth and s3api, but we don't have much s3 requests. and by default we do not use it 21:16:34 timburke: nothing custom. we don't ratelimit, we scale :P 21:17:03 joke aside, we enable this only when really necessary, it's pretty rare 21:17:04 excellent 21:17:35 oh, cool! never mind then ;-) 21:17:38 the usual situation is the "delete storm" 21:18:49 fwiw, following https://review.opendev.org/#/c/697535/ we can have ratelimit right of s3api and auth and still serve "reasonable" responses to s3 clients 21:18:50 patch 697535 - swift - s3api: Better handle 498/429 responses (MERGED) - 1 patch set 21:20:04 ...but i quickly realized that it'd throw off my metrics, since AWS sends out 503s -- i want a way to easily differentiate between 503 (slow down) and 503 (backend failed) 21:20:20 which led to https://review.opendev.org/729092 21:20:21 patch 729092 - swift - s3api: Log ratelimited requests as 498s - 2 patch sets 21:21:22 idk how sane of a thing that is to do though -- it feels a little dirty lying in logs like that 21:23:03 so i also started thinking about returning some other error code -- i don't know of any s3 clients that would retry a 498, but the rfc-compliant 429 seemed to get awscli to retry at least 21:23:06 #link https://review.opendev.org/729093 21:23:07 patch 729093 - swift - s3api: Add config option to return 429s on ratelimit - 1 patch set 21:23:42 rledisez, kota_: any opinion on which approach seems better/more reasonable? 21:24:43 (could even do both, i suppose; if configured, log & return 429, otherwise log 498 but return 503) 21:25:13 we need to log the same code that what is returned to customer (otherwise we can't discuss the SLA: I got 10% 503, I only see 2%) 21:25:30 sounds reeasonable but what i thought when looking the log change patch a little is we may leave 503 logging for debug due. 21:25:31 so I would say we need to reeturn something difference 21:26:00 s/due/for user support/ 21:26:17 i didn't dig it in detail yet, just quick look. 21:27:14 *nod* makes sense. i suppose i ought to dig in more to see how well other s3 clients support 429 21:27:48 I know that I would for sure only use the "429" option you describe, seems the best option (and retry is not that bad I guess, it does not consume much resources) 21:28:06 user might say, "hey I got 503s" so 503s reported in the swift logs helps us to debug. 21:29:39 all right, on to updates! 21:29:50 #topic lots of small files 21:30:12 rledisez, alecuyer how's it going? 21:31:03 So alecuyer is off, he told me few days ago that some tests were passing. I guess he worked on it this week but to be honest I don't know much. We saw a change in diskfile that need to be backoprted to LOSF (a new parameter to _finalize_durable from a recent patch of you timburke) 21:31:54 that's it 21:32:31 👍 sorry for the extra trouble ;-) 21:33:10 #topic database reclaim locking 21:33:45 i was hoping clayg would be around to discuss his findings on https://review.opendev.org/#/c/727876/ 21:33:46 patch 727876 - swift - Breakup reclaim into batches - 4 patch sets 21:34:02 but i could take a stab at it :D 21:35:36 the current code breaks up the deleted namespace into batches of ~1k and reaps them in a loop, taking getting a fresh connection/lock each time 21:36:48 and it seems to be working well! the reclaim still takes a while, but the server could continue writing deletes at a decent clip while it was happening 21:37:44 i think tests should pass now, and it's ready for review! i might have to ask clay for his testing setup though 21:38:07 has anyone else had a chance to take a look at it yet? 21:38:51 not yet 21:40:08 no worries. it's ready when you are :-) 21:40:17 that's all i had 21:40:22 #topic open discussion 21:40:32 anything else we should discuss? 21:42:50 i wonder whether clayg's reclaim patch will make https://review.opendev.org/#/c/571917/ and https://review.opendev.org/#/c/724943/ unnecessary... 21:42:50 patch 571917 - swift - Manage async_pendings priority per containers - 5 patch sets 21:42:51 patch 724943 - swift - WIP: Batched updates for object-updater - 2 patch sets 21:43:54 one of the comments on the patch has a gist for the script i'm using to do reclaim in one thread while inserting tombstones in another 21:45:24 all right, we oughta let mattoliverau and kota_ start their day ;-) 21:45:32 I'm not sure that it makes them unecessary. they are just other tools in the toolbox to reduce the container-listing lag 21:45:40 thank you all for coming, and thank you for working on siwft! 21:45:45 #endmeeting