Tuesday, 2021-10-12

*** thelounge94 is now known as redrobot13:02
*** redrobot is now known as thelounge9413:04
*** thelounge94 is now known as redrobot13:04
opendevreviewMerged openstack/swift-bench master: Migrate from testr to stestr  https://review.opendev.org/c/openstack/swift-bench/+/79894116:54
reid_gHello again! It seems that the handoffs_only seems a bit risky and mgt doesn't want to use it for adding capacity because of the risk of the loss to durability. We saw we were able to speed up the rebalances when we increased the reconstructor workers... but this seems to lead to increases in various timeouts errors being logged. Do you have recommendations for tuning these? or is it just whack-a-mole?18:57
claygreid_g: correct, do NOT leave handoffs_only turned on after the EC rebalance finishes20:28
claygreid_g: process workers are great!  You can definitely make a rebalance go quite fast... maybe TOO fast!20:29
reid_gRight. I was asked not to use it at all20:29
reid_gSo maybe my answer is to scale down the workers? Went from 1 --> 1220:31
claygreid_g: unfortunately there's a lot of unhelpful i/o contention during a rebalance if you allow primaries to attempt rebuilds during a rebalance - if the capacity increase is of sufficient size I would say it's "not possible" to do an EC rebalance w/o handoffs_only mode.  This level of operational complexity/hand holding is considered a bug and an aera of ongoing investigation at Nvidia.20:31
claygreid_g: there's a number of other knobs to tune besides workers - depending on what kind of timeouts you're seeing you may need to increase the concurrency settings for the object replication server ssync receivers so there's enough capacity to eat all the parts that are trying to get pushed off20:34
opendevreviewTim Burke proposed openstack/swift master: ring: Introduce a v2 ring format  https://review.opendev.org/c/openstack/swift/+/80853020:42
opendevreviewTim Burke proposed openstack/swift master: ring: Allow RingData to vary dev_id_bytes  https://review.opendev.org/c/openstack/swift/+/80853120:42
opendevreviewTim Burke proposed openstack/swift master: Allow ring-builder CLI users to specify device ID  https://review.opendev.org/c/openstack/swift/+/80853220:42
opendevreviewTim Burke proposed openstack/swift master: ring: Allow builder to vary dev_id_bytes  https://review.opendev.org/c/openstack/swift/+/80853320:42
opendevreviewTim Burke proposed openstack/swift master: ring: Keep track of last primary nodes from last rebalance  https://review.opendev.org/c/openstack/swift/+/79055020:42
timburke_reid_g, fwiw, the hope is that getting https://review.opendev.org/c/openstack/swift/+/792075 stacked on top of all that ^^^ will make it so we can avoid needing to switch on handoffs_only at all during a rebalance. i still need to take a closer look at it, though; i had an idea about doing a HEAD *first* (before fanning out to get frags) that i wanted to try out20:46
reid_gThat sounds like an interesting change20:50
reid_gWe started down the path of trying to tackle the timeouts that occurred with the increased workers. Each time we changed 1 timeout, another timeout error would appear.20:51
timburke_oh! i'm glad i looked -- mattoliver already added the HEAD-first idea!20:53
reid_gUltimately I don't think it is affecting the application too much since it can detect the failed uploads but it would be nice to add the capacity without the danger of reducing durability while not taking a long time between iterations.20:53
opendevreviewTim Burke proposed openstack/swift master: WIP: Reconstructor: Use past node and abort to handoff  https://review.opendev.org/c/openstack/swift/+/79207520:54
timburke_(just a rebase)20:55
opendevreviewTimur Alperovich proposed openstack/swift master: Fix multipart upload listings  https://review.opendev.org/c/openstack/swift/+/81371523:37

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!