Monday, 2019-07-29

*** baojg has joined #openstack-swift02:03
*** rcernin has quit IRC02:45
*** rcernin has joined #openstack-swift02:47
*** m75abrams has joined #openstack-swift03:09
*** psachin has joined #openstack-swift03:31
*** psachin has quit IRC03:34
*** psachin has joined #openstack-swift03:37
*** psachin has quit IRC03:55
*** e0ne has joined #openstack-swift05:41
*** e0ne has quit IRC06:11
*** tesseract has joined #openstack-swift07:08
*** rcernin has quit IRC07:11
*** pcaruana has joined #openstack-swift07:31
*** tkajinam has quit IRC08:30
*** gkadam has joined #openstack-swift08:44
*** tesseract has quit IRC08:58
*** e0ne has joined #openstack-swift09:08
*** ccamacho has joined #openstack-swift09:13
*** tesseract has joined #openstack-swift09:20
*** csmart has quit IRC09:22
*** altlogbot_0 has quit IRC09:24
*** altlogbot_0 has joined #openstack-swift09:26
*** csmart has joined #openstack-swift09:33
*** m75abrams has quit IRC09:49
*** gkadam has quit IRC11:02
*** gkadam has joined #openstack-swift11:17
*** gkadam has quit IRC11:42
*** gkadam has joined #openstack-swift11:44
*** tdasilva has joined #openstack-swift12:11
*** ChanServ sets mode: +v tdasilva12:11
*** tdasilva has quit IRC12:16
*** tdasilva has joined #openstack-swift12:16
*** ChanServ sets mode: +v tdasilva12:16
*** mvkr has quit IRC12:22
*** BjoernT has joined #openstack-swift13:18
*** mvkr has joined #openstack-swift13:20
*** BjoernT_ has joined #openstack-swift13:29
*** BjoernT has quit IRC13:30
*** BjoernT has joined #openstack-swift13:33
*** BjoernT_ has quit IRC13:34
*** mvkr has quit IRC13:34
*** mvkr has joined #openstack-swift13:47
*** e0ne has quit IRC15:30
*** gyee has joined #openstack-swift15:39
openstackgerritTim Burke proposed openstack/python-swiftclient stable/stein: Fix SLO re-upload  https://review.opendev.org/67332115:55
openstackgerritTim Burke proposed openstack/python-swiftclient stable/rocky: Fix SLO re-upload  https://review.opendev.org/67332215:56
openstackgerritTim Burke proposed openstack/python-swiftclient stable/queens: Fix SLO re-upload  https://review.opendev.org/67332315:56
timburkei really shoulda written up a bug for ^^^15:59
timburkei realized that we probably ought to get that backported for the sake of distros -- bionic, for example, ships 3.5.015:59
timburke(which was also the release that introduced the bug)16:00
*** mvkr has quit IRC16:01
*** tdasilva has quit IRC16:14
*** tdasilva has joined #openstack-swift16:15
*** ChanServ sets mode: +v tdasilva16:15
*** tesseract has quit IRC16:28
*** e0ne has joined #openstack-swift17:14
*** rovanleeuwen has joined #openstack-swift17:19
rovanleeuwenHello, I have a question: We had an container node that was on-line longer then the tombstone time-out and we started that up. Now we have lots of files in the listings that are actually not there. Is there a way to clean this mess we made?17:20
*** e0ne has quit IRC17:20
rovanleeuwenI meant off-line longer then  tombstone time-out obviously17:20
*** aj11 has joined #openstack-swift17:21
timburkerovanleeuwen, ouch :-( yeah, long maintenance windows usually require that you either increase the reclaim age (at least for a while) or wipe the drives before re-introducing them to the cluster17:56
timburkefortunately, the situation you're in with objects appearing in listings that don't have data on-disk (which we sometimes call "ghost listings") isn't too bad to recover from: you can still issue deletes for the affected objects -- you'll get back a 404, but the object-server will still go update the container to mark the name as deleted17:56
rovanleeuwenAny easy way to easily find the files and clean them? It looks like the swift-account-audit will show you the error with the files with errors. I guess I can do some bash magic to issue deletes for those. But a option to that command to force a clean would be nice :)17:58
timburkei was just about to mention how the hard part becomes identifying the objects that were affected :-) you might want to spin up a special proxy server with a particularly-high request_node_count and long conn_timeout and node_timeout17:59
timburkeyou might also be able to do some digging through your proxy logs to find object names that were deleted17:59
timburkerledisez, alecuyer: if you get a chance, i feel like your particular perspective might be valuable on https://review.opendev.org/#/c/672186/ -- i feel like i'm changing our contract a decent bit and i think you'd both have a pretty good feel for whether this would be preferable for users18:05
patchbotpatch 672186 - swift - Ignore 404s from handoffs for objects when calcula... - 6 patch sets18:05
timburkefwiw, tdasilva and i talked a bit about it last week, see http://eavesdrop.openstack.org/irclogs/%23openstack-swift/%23openstack-swift.2019-07-23.log.html#t2019-07-23T14:28:4218:06
tdasilvarledisez, timburke: I thought that this etherpad was also a nice way to visualize which behavior changed and which remained the same: https://etherpad.openstack.org/p/swift-fail-responses18:09
rledisezthx for the link tdasilva18:13
rledisezis there something missing for situation 5? I don't understand it18:13
rledisezsituation 6 might look a bit weird, but it clearly traduce the "eventually consistent" aspect of swift, i'm okay with that18:14
rlediseztimburke: fpr sure it's gonna hurt my SLA, but it's for the best, right? :)18:15
timburkefor 5 -- we may not actually have responses from primaries, because we *never made the requests* since the nodes were error-limited. so that situation is specifically for "all primaries are error-limited, all handoffs respond 404"18:17
timburkeyeah, the SLA-aspect was part of my concern...18:17
rlediseztimburke: ok, error limited, got it.18:18
rledisezjust joking about SLA, it would feel like hiding the issue. is somebody monitoring 404? (we don't as we assumed it was an user issue)18:19
timburkewe had a customer complain about incorrect status codes during a stress test -- they knew the data existed and had for a decent period of time. they were concerned about how putting load on the cluster to the point that timeouts started popping, nodes got error-limited, etc. would cause inconsistencies in what had been settled data18:24
timburkei guess in part i'm wondering about how concerned you'd be if there was a sudden up-tick in 503s after pushing that change out18:25
timburkebut maybe you already have ratelimiting in place such that your customers aren't running into this problem?18:26
rledisezwe don't do ratelimiting on GET. we do some on PUT/POST to protect the clusters. our 5xx levels is pretty low. would it make sense to make this new behavior configurable at least for few versions so that we can try it on one proxy, check what happens, etc…18:29
timburkerledisez, i can see about doing that. what's your thinking on the default for it? old-but-maybe-wrong behavior to minimize upgrade disruption? or new-but-more-error-ful behavior so we can get people where we "want" them faster? (cc clayg)18:34
rledisezI would say the new behavior is the default, with a note/warning in the changelog18:35
timburke👍18:36
rledisezI'm assuming everybody is reading the changelog, but I also know i'm probably wrong. good way to teach them they should :)18:36
rlediseztotally unrelated subject:18:38
rledisezdo you know where the requests dispatching to workers is done for the proxy? I would say eventlet, but i'm not confortable yet with wsgi.py :)18:38
rledisezI'm having the issue that it does not dispatch the requests equally (eg: round-robin). So, even if there are workers doing nothing, some workers might be handling multiple requests.18:38
rledisezIt matters with EC, because one worker is running on one core, so the proxy-server can get CPU-bound while the system is mostly idle.18:38
rledisezhum, I think I get it. there is no "dispatch code", the kernel is auto-magically doing it18:53
*** rovanleeuwen has quit IRC18:54
timburkefwiw, we make GreenAsyncPiles to make requests and gather responses in each of Controller.make_requests and ResumingGetter._get_source_and_node in proxy/controllers/base.py and ECObjectController._get_or_head_response in proxy/controllers/obj.py19:07
timburkehow it gets scheduled from there, i'm not so sure. sounds like you found out it's in kernel-land :-)19:08
openstackgerritTim Burke proposed openstack/swift master: Give ECAppIter greenthreads a chance to wrap up  https://review.opendev.org/66577319:11
timburketdasilva, clayg: what do you guys think about the idea of having hardlink-to-symlink validation stop at the symlink? it feels a little weird (to me) that it's perfectly valid to have a symlink point to a name that doesn't exist yet, but you can't make a hardlink point to that symlink19:16
claygyeah that sounds weird... that traversal code is a bit brittle - IIRC it's reused on PUT (validation) & GET19:18
timburkeand given a symlink that *does* point at something, a hardlink pointing at it won't have any stronger guarantees for having done the extra work up front -- you could still swap out the target and the hardlink would 20019:18
claygyeah I'm pretty sure there's a test for something very close to that - so I'm kind of surprised by the situation you described... have you attempted a fix?19:19
clayg100% agree there's no guarntees once the hardlink points to a symlink - at that point no further validation seems needed19:19
claygfrom my perspective!!! 🤷‍♂️19:20
*** e0ne has joined #openstack-swift19:22
*** aj11 has quit IRC19:24
timburke"I'm kind of surprised by the situation you described" -- what, that last part, where the hardlink-PUT fails? yeah, it responds 404 as it tries to follow the symlink, and _validate_etag_and_update_sysmeta() returns any error it hits during validation19:44
claygyeah i guess that's the part that was surprising/undesirable - but thinking about how that code works for GET/PUT I guess it makes sense how we got here - if you drop a comment I'll figure out someway to get some tests together and make them pass20:02
openstackgerritMerged openstack/swift stable/pike: Fix the pep8 test will raise "Unknown test found in profile: B109"  https://review.opendev.org/67313120:41
*** e0ne has quit IRC20:46
*** tdasilva has quit IRC20:59
*** tdasilva has joined #openstack-swift20:59
*** ChanServ sets mode: +v tdasilva20:59
timburkerledisez, if i'm adding a config option anyway... i wonder whether it ought to control the account/container behavior change from https://review.opendev.org/#/c/667411/ too...21:22
patchbotpatch 667411 - swift - Return 503 when primary containers can't respond (MERGED) - 2 patch sets21:22
timburkethere was much more obvious bad behavior there -- we'd cache a container's non-existence when it was actually just overloaded :-(21:22
timburkewhich would prevent further writes21:23
*** baojg has quit IRC21:53
*** BjoernT has quit IRC21:59
*** mvkr has joined #openstack-swift22:06
*** zaitcev has joined #openstack-swift22:12
*** ChanServ sets mode: +v zaitcev22:12
zaitcevAnyone noticed recently that proxy-server double-quotes in its log?22:22
zaitcevJul 26 20:30:35 rhev-a24c-01 proxy-server[9123]: 127.0.0.1 127.0.0.1 27/Jul/2019/00/30/35 PUT /v1/AUTH_test/%25EF%25BF%25BD%25EF%25BF%25BD%25EF%25BF%25BD.....22:22
zaitcevLook at all these %2522:22
zaitcevSame transaction: Jul 26 20:30:35 rhev-a24c-01 container-server[9036]: 192.168.50.1 - - [27/Jul/2019:00:30:35 +0000] "PUT /a2/196445/AUTH_test/%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF22:23
timburkezaitcev, i'm even more concerned about the fact that there's a bunch of REPLACEMENT CHARACTERs...22:38
timburkethe double-quoting is a pretty long-standing wart that even prompted notmyname to write https://github.com/openstack/swift/commit/11e81cfc822:40
*** tkajinam has joined #openstack-swift22:54
*** rcernin has joined #openstack-swift23:02
openstackgerritTim Burke proposed openstack/swift master: Ignore 404s from handoffs for objects when calculating quorum  https://review.opendev.org/67218623:18
timburkerledisez, i may need your help on the UpgradeImpact wording in ^^^23:19
timburkeas i tried to come up with an explanation there, i was really struggling to come up with a compelling reason to use the option -- it kinda feels like a cop-out to me23:20

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!