Friday, 2021-10-08

reid_gTried out the handoffs_only setting... very nice12:17
reid_gIn handoffs_only mode. If a disk fails, will handoffs for missing fragments be created?16:00
timburke__reid_g, new writes may land on handoffs. but the data that was on the failed disk *will not* get rebuilt elsewhere (whether it was a primary or handoff location)16:02
timburke__so it's the sort of thing you turn on during the massive rebalances that happen during an expansion, then turn off again quick as you can so you can ensure full durability16:03
reid_gAlright. In normal operation a failed disk will cause handoffs to be created for EC in both new writes and stored data.16:07
timburke__yup -- though the rebuilding-to-a-handoff is a fairly recent addition. for a while, the assumption was that the failed disk would get removed from the ring fairly quickly, and we'd only rebuild to the new primary. i should double check when we added that...16:10
reid_gIs that recent for EC only or REP also?16:11
reid_gYeah would be nice to know16:13
timburke__got the rebuild-to-handoffs behavior back in stein: https://github.com/openstack/swift/blob/master/CHANGELOG#L863-L87116:37
timburke__had it for replicated policies for a good long while16:37
reid_gSo when swift drive fails. Replication policy = Create handoff for missing data? EC policy (Stein+) = Recreate fragment in handoff?16:39
reid_gWhen the drive is replaced. Replication/EC [stein+] policy = Move back to primary. EC otherwise = recreate the missing fragment.16:40
timburke__yup -- you'll want to unmount the failed drive as soon as you notice the failure. that'll cause the object server to start responding 507 to REPLICATE requests and the replicator/reconstructor will use handoffs to ensure full durability16:41
timburke__hands in the DC go swap out drives. SRE gets a new FS on the replacement, mounts it in the old location, then swift works to fill it back up16:43
timburke__note that sometimes drives fail in subtle ways -- they start to get just a *little* FS corruption, maybe you see ENODATA tracebacks in logs. it's not enough to where SMART indicates there's anything too fishy, and you can still do a healthcheck with a full PUT/POST/GET/DELETE of some well-known object through the object-server16:48
timburke__i'm ambivalent about how to handle those cases -- on the one hand, it's tempting to keep the disk in the ring but drop its weight to zero, drain off whatever data you can still read. on the other, the drive seems to be having trouble; how much should we really trust *anything* that's still on there?16:50
timburke__if the cluster's generally fairly healthy, i'd lean toward just unmounting the disk and letting the other replicas/frags get it back to full durability. if it's *not* doing great, i start to worry about whether that failing drive is my last backstop against data loss16:52
reid_gYeah they are normally healthy17:14
reid_gOur normal process now is to get the list of disks that failed in the morning and have them replaced by the EOD usually.17:24

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!