Tuesday, 2019-07-23

*** gyee has quit IRC00:07
*** tkajinam has quit IRC00:10
*** godog has quit IRC00:10
mattoliveraumorning01:17
*** baojg has joined #openstack-swift01:45
*** BjoernT has joined #openstack-swift02:32
*** BjoernT has quit IRC02:36
*** BjoernT has joined #openstack-swift02:41
*** BjoernT has quit IRC02:58
*** BjoernT_ has joined #openstack-swift02:58
*** BjoernT_ has quit IRC03:07
*** psachin has joined #openstack-swift03:28
*** gkadam has joined #openstack-swift03:49
*** gkadam has quit IRC03:50
*** rcernin has quit IRC04:13
*** rcernin has joined #openstack-swift04:14
*** rcernin has quit IRC04:20
*** pcaruana has joined #openstack-swift04:43
*** threestrands has quit IRC05:04
*** m75abrams has joined #openstack-swift05:11
*** notmyname has quit IRC05:41
*** notmyname has joined #openstack-swift05:42
*** ChanServ sets mode: +v notmyname05:42
*** new_student1411 has joined #openstack-swift06:26
*** e0ne has joined #openstack-swift06:40
*** e0ne has quit IRC06:41
*** tesseract has joined #openstack-swift07:09
*** godog has joined #openstack-swift07:18
*** irclogbot_1 has quit IRC07:20
*** openstackstatus has quit IRC07:20
*** irclogbot_0 has joined #openstack-swift07:22
*** joeljwright has quit IRC07:23
*** joeljwright has joined #openstack-swift07:23
*** ChanServ sets mode: +v joeljwright07:23
*** cwright has quit IRC07:23
*** cwright has joined #openstack-swift07:24
*** mikecmpbll has joined #openstack-swift08:00
*** e0ne has joined #openstack-swift08:46
*** psachin has quit IRC09:20
*** klamath has quit IRC09:24
*** psachin has joined #openstack-swift09:35
*** viks___ has quit IRC10:06
*** tdasilva has joined #openstack-swift10:21
*** ChanServ sets mode: +v tdasilva10:21
openstackgerritThiago da Silva proposed openstack/swift master: Probe tests for 'ignore 404s from handoff nodes'  https://review.opendev.org/67226811:43
*** ccamacho has joined #openstack-swift12:22
*** henriqueof has joined #openstack-swift12:33
*** openstackstatus has joined #openstack-swift13:22
*** ChanServ sets mode: +v openstackstatus13:22
*** baojg has quit IRC14:18
*** baojg has joined #openstack-swift14:19
*** gyee has joined #openstack-swift14:23
timburketdasilva, so i think i've got an explanation for the floppy test in https://review.opendev.org/#/c/672186/ -- we don't know which of https://github.com/openstack/swift/blob/master/test/unit/proxy/test_server.py#L7634-L7635 we'll get first, and we ignore the durable flag if it comes second14:23
patchbotpatch 672186 - swift - Ignore 404s from handoffs for objects when calcula... - 1 patch set14:23
tdasilvatimburke: that test is confusing :/14:26
openstackgerritTim Burke proposed openstack/swift master: Ignore 404s from handoffs for objects when calculating quorum  https://review.opendev.org/67218614:27
timburkethe larger point about us pretty clearly having chosen to prefer 404 over 503 in these cases is definitely valid, though. i'll dig through some review history, see what i can glean14:28
timburkepart of me thinks that we (as server devs) chose the thing that made us look less-bad (4xx instead of 5xx -- must be a client problem!) instead of the thing that would be more useful to clients (i couldn't talk to *any* of the primaries. maybe i should try that again)14:30
tdasilvayeah, these are the times I do miss having some sort of documentation detailing the path chosen, it would just make it easier to argue one way or the other.14:32
tdasilvafor the scenario of EC where we know we have some data (especially if a frag is durable) then it's easier to argue that a 503 makes sense14:32
tdasilvabut in the case of a timeout and not sign of data, an argument for 404 is not wrong14:33
tdasilvas/not sign/no sign14:34
*** dosaboy has joined #openstack-swift14:41
timburkeyeah -- certainly, a 404 can *always* be argued as an "eventually consistent" response -- there definitely *was a time* when that was correct!14:41
timburkebut it means that our (or at least *my*) model of consistency windows based on cycle times is busted -- all data can be durable and squared away where it belongs, but then a bunch of load comes along and kicks out my consistency window arbitrarily14:42
timburkei wonder how much the patch would have to change if i *just* did the EC part in obj, and left base alone...14:43
tdasilvatimburke: it feels a bit more complex than that thou, it's not only cycle times...in the case of a PUT, we would be glad to go to a handoff and put the data there14:48
timburkeand if we *find* that data on the handoff, we're still happy to serve it! i'm just not so sure that we should consider a 404 from a handoff with the same weight that we would a 404 from a primary14:50
timburkedefinitely seems like i ought to write up a new, separate bug rather than just referencing the related one for containers14:51
timburkeand we maybe should raise it at the meeting tomorrow, try to get more people's input on the idea14:52
tdasilvasorry, i'm not trying to be difficult, just playing devil's advocate14:52
tdasilvayeah!14:52
timburkeno need to apologize, it's great! this is hard, complicated stuff!14:53
timburkeit really goes to the core of some parts of swift and how it works14:53
timburkethe container issue i felt better about in part because of some of the clearly-bad outcomes that happen because we'd cache the non-existence. we don't quite do anything like that for objects -- but that doesn't tell us anything about what *clients* may be doing differently as a result! they may *also* cache that non-existence for some period, and there's certainly no expectation that they'd *retry* ...14:57
timburkeit really doesn't help that external clients get nothing to differentiate a 404-with-tombstone from a 404-without15:00
timburkelike, if i have some reason to think that data *should be there* but i get back a 404 and it's not even from a tombstone -- maybe i shouldn't entirely trust that15:02
timburkebut the more i continue down that line of thinking, i just arrive at the patch i've proposed: as a server, the client sending the GET should be all the reason i *need* to think that there ought to be data there15:03
timburkeand if i expect clients to not trust the 404 i'm about to send, why should i send a 404 at all? *i* probably shouldn't trust it either!15:04
timburkeindirection objects (dlo, slo, symlink) make this weirder. you might do a GET that bypasses indirection, have it succeed, so now as a client you know where to look for the backing data. do a GET that would follow the indirection, but it fails with a 404 because there's a bunch of load. but using our prior knowledge we can still try to GET the underlying data and it works! and if we wait a while so load dies down, the GET to the indirection15:13
timburkeobj will work, too!15:13
timburkeall of this with the system in a more-or-less steady state. no new writes, no active rebalances15:14
*** ccamacho has quit IRC15:21
tdasilvaregarding this statement: "like, if i have some reason to think that data *should be there* but i get back a 404 and it's not even from a tombstone -- maybe i shouldn't entirely trust that"   OTOH, one could argue that this behavior is the eventual consistency contract. or at least this result you are getting is the result of an eventual consistent system...the issue is that now we might have two possible results to the same issue15:22
tdasilvabecause we can't guarantee that one won't see a 404 after putting objects in the system15:23
tdasilvain other words, we can't guarantee that we always returna 503, can we?15:24
timburke...which is a bit at odds with our (usual) ability to read new writes :-/15:25
timburkehmm. i suppose i could see some pathological cases where we'd still see 404s when, looking more broadly, a 503 would've been more appropriate (for this new notion of "appropriate")15:26
tdasilvatimburke: the scenario i'm thinking of are network partitions where writes would go to handoff nodes, but a read would go to a primary node and would return a 40415:28
* tdasilva is attending meeting, afaik for a bit15:28
timburkeyeah, but we'd still dig into handoffs since we haven't found data...15:28
timburkewhat i had in mind was something like yours but then there's *also* a rebalance between writing and reading that shuffles the handoff list enough that we can't find the write15:30
*** m75abrams has quit IRC15:32
*** e0ne has quit IRC15:39
claygyes, I think we definately considered "not found" a good aproximation for "it's not available ... right *now*"15:59
claygBut we've done all kinds of weird things with status codes before...  I don't think it's wrong to change it, but we may uncover some code like... in container sync or the expirer or reconciler that was handling "404" as "need to retry"16:00
claygI think 5XX conveys "you should retry" better than 404 so ... it's probably FINE16:00
*** e0ne has joined #openstack-swift16:01
*** tesseract has quit IRC16:07
*** baojg has quit IRC16:26
*** mikecmpbll has quit IRC16:35
timburkei stand by my assumption that "reduce 5xx responses" was a prior design goal :P16:38
timburkehmm... so following https://review.opendev.org/#/c/215276/ -- will an object server that 404s (because it knows it only has frags the proxy doesn't need) still send an indication that it's got a durable?16:47
patchbotpatch 215276 - swift - Enable object server to return non-durable data (MERGED) - 43 patch sets16:47
timburkei miss acoles :-(16:47
tdasilvayeah, was just thinking the same16:49
timburkeoh good! looks like it doesn't actually 404, we just let the proxy drop it: https://github.com/openstack/swift/blob/2.22.0/swift/obj/diskfile.py#L3473-L347917:27
*** e0ne has quit IRC17:33
openstackgerritMerged openstack/swift master: Update api-ref location  https://review.opendev.org/67210717:53
openstackgerritMerged openstack/swift master: Make py36 job voting  https://review.opendev.org/65703417:53
openstackgerritMerged openstack/swift master: Bump up our minimum eventlet version  https://review.opendev.org/66575817:53
openstackgerritMerged openstack/swift master: Add Python 3 Train unit tests  https://review.opendev.org/66951117:53
*** psachin has quit IRC18:19
*** e0ne has joined #openstack-swift18:42
*** tdasilva has quit IRC18:51
zaitcevwait, where do you think you're going19:18
zaitcevtimburke: Did you try to run py2 functional tests against a py3 cluster?19:19
timburkezaitcev, yeah -- and we've even got a gate job that does it now19:23
timburkebut you're still seeing trouble with the non-ascii metadata?19:23
zaitcevYes, well.... Maybe it's some old version of whatever. It's a live cluster and it accumulates cruft.19:23
timburkewhat version of eventlet is it running? what version of python?19:24
timburkei'm *really* interested in the failures that come because of cruft!19:25
zaitcevI have an account now that has 2 metadata entry with exact same non-ASCII key, but in different encoding. When doing HEAD, one of them pops out of the depths of the db.py in native unicode, and one in WSGI. The account server attempts to return both and then crashes in eventlet.19:25
zaitcevI hacked around it roughly, so tests can continue. But my hack does this: it does try: around .encode('latin-1').decode('utf-8') dance. If that fails, it does nothing. This causes the account server return both keys, but now they are indistinguishable. That also means that I cannot delete one of them ever.19:27
timburke"exact same non-ASCII key, but in different encoding" 😳19:27
zaitcevI'm thinking about a mode where you do GET /AUTH_foo?everything=1 and this uses some kind of %- or \- escape to print everything raw. Then, same goes into POST, gets parsed, and permits me to delete the offending metadata.19:28
timburkewhat's the history with the cluster? was it running mixed py2/py3? did the metadata predate the py3 upgrade?19:28
timburkeif we look at the account db from sqlite, what's the (json) metadata dict look like?19:29
zaitcevResponse.__call__ status '200 OK' headers [('Content-Type', 'application/json; charset=utf-8'), ('Content-Length', '2'), ('X-Account-Meta-Unià¸\x92', '1'), ('X-Account-Meta-Uniฒ', '1')]19:30
zaitcev>>> s="X-Account-Meta-Unià¸\x92"19:31
zaitcev>>> s.encode('latin-1').decode('utf-8')19:31
zaitcev'X-Account-Meta-Uniฒ'19:31
zaitcevIf your IRC client allows, you can see that they are literally the same.19:31
timburkeyeah -- i see it... hmmm...19:32
zaitcevTo answer your question, the history is straightforward: it was on py2 until a week ago, when I installed the first py3 that passed all unit tests.19:32
zaitcevI think maybe PUT writes WSGI and then POST writes native, or vice versa19:33
zaitcevWell, if our gate does the functests then maybe it's nothing.19:34
timburkecould be. gives me something to try...19:34
timburkehow was the metadata set? python-swiftclient? if so, running with which version of python?19:35
zaitcevAs far as running a mixed py2/py3, then yes, it did run that for a few days. I didn't notice anything amiss, it looked real good.19:35
zaitcevI don't know how to find out how the metadata was set. Do we even write that down?19:36
timburkei still have troubles with swiftclient -- see also https://review.opendev.org/#/c/645388/12/test/functional/test_account.py@72219:37
patchbotpatch 645388 - swift - py3: Cover account/container func tests - 12 patch sets19:37
timburkeand my comments about it back on patchset 819:38
*** henriqueof has quit IRC19:40
timburkeif we look at a recent gate run like http://logs.openstack.org/86/672186/2/check/swift-dsvm-functional-py3/db60a3f/ ...19:41
timburkewe can see that swift isn't even installed on py2: http://logs.openstack.org/86/672186/2/check/swift-dsvm-functional-py3/db60a3f/controller/logs/pip2-freeze.txt.gz19:42
timburkebut is on py3: http://logs.openstack.org/86/672186/2/check/swift-dsvm-functional-py3/db60a3f/controller/logs/pip3-freeze.txt.gz19:42
timburkebut then tox is using py2 to run the tests: http://logs.openstack.org/86/672186/2/check/swift-dsvm-functional-py3/db60a3f/tox/func-0.log19:42
timburkefwiw, my hope would be that we'd serialize that as '\\u0e12' in json on both py2 and py3 -- and that we'd be able to clear the bad guy with b'X-Remove-Account-Meta-Uni\xc3\xa0\xc2\xb8\xc2\x92: x' or the like19:49
timburkei *think* https://opendev.org/openstack/swift/commit/76fde892 landed *after* full unit test coverage? you'd need that to make this work right...19:51
timburkethat was certainly a large part of why i had it blocking the release19:52
timburkebut then, the proxy on py3 wouldn't show *either* of those headers without that patch... so...19:55
*** e0ne has quit IRC20:06
zaitcevProxy does. It's the account server that crashes after the app returns, when the eventlet starts to process headers.20:20
timburke:-/ we definitely shouldn't be sending non-wsgi down to eventlet20:28
timburkei kinda really want that raw meta. like, `sqlite3 <path to account db> 'select metadata from account_stat;'`20:29
timburkelooks like we finished up unit tests as of https://github.com/openstack/swift/commit/9f1ef3563 -- is that the sha you're on?20:32
zaitcevhttp://www.zaitcev.us/things/swift/metadup.txt20:42
*** pcaruana has quit IRC20:50
*** e0ne has joined #openstack-swift21:22
*** altlogbot_1 has quit IRC21:33
*** irclogbot_0 has quit IRC21:33
*** altlogbot_0 has joined #openstack-swift21:34
*** irclogbot_0 has joined #openstack-swift21:35
*** e0ne has quit IRC21:50
*** irclogbot_0 has quit IRC21:59
*** altlogbot_0 has quit IRC22:01
*** altlogbot_1 has joined #openstack-swift22:23
*** gyee has quit IRC22:23
*** altlogbot_1 has quit IRC22:27
*** new_student1411 has quit IRC22:49
*** tkajinam has joined #openstack-swift22:51
openstackgerritClay Gerrard proposed openstack/swift master: WIP: Rebuild non-durable frags if we can  https://review.opendev.org/67238523:05
*** gyee has joined #openstack-swift23:10
*** altlogbot_1 has joined #openstack-swift23:14
*** rcernin has joined #openstack-swift23:16
*** altlogbot_1 has quit IRC23:19
*** altlogbot_3 has joined #openstack-swift23:29
*** irclogbot_1 has joined #openstack-swift23:33

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!