Wednesday, 2020-06-24

*** rcernin has quit IRC		01:12
*** rcernin has joined #openstack-glance		01:17
*** Liang__ has joined #openstack-glance		01:18
*** Liang__ has quit IRC		01:32
*** Liang__ has joined #openstack-glance		01:33
*** Liang__ has quit IRC		01:47
*** rcernin has quit IRC		02:34
*** rcernin has joined #openstack-glance		02:36
*** rcernin has quit IRC		03:46
*** rcernin has joined #openstack-glance		03:55
*** rcernin has quit IRC		04:04
*** rcernin has joined #openstack-glance		04:05
*** evrardjp has quit IRC		04:33
*** evrardjp has joined #openstack-glance		04:33
*** udesale has joined #openstack-glance		04:35
*** ratailor has joined #openstack-glance		04:44
*** m75abrams has joined #openstack-glance		05:23
*** udesale has quit IRC		05:40
*** udesale has joined #openstack-glance		06:25
*** udesale_ has joined #openstack-glance		06:29
*** udesale has quit IRC		06:30
*** gyee has quit IRC		06:51
openstackgerrit	Abhishek Kekane proposed openstack/glance master: WIP: Fix race condition in copy image operation https://review.opendev.org/737596	07:26
*** rcernin has quit IRC		07:39
*** priteau has joined #openstack-glance		07:47
*** belmoreira has joined #openstack-glance		07:49
*** bhagyashris is now known as bhagyashris\|lunc		07:57
*** belmoreira has quit IRC		08:08
*** belmoreira has joined #openstack-glance		08:08
*** udesale_ has quit IRC		08:11
*** bhagyashris\|lunc is now known as bhagyashris		09:08
*** rcernin has joined #openstack-glance		09:12
*** rcernin has quit IRC		09:17
*** tkajinam has quit IRC		10:03
openstackgerrit	vinay harsha mitta proposed openstack/glance master: api-ref needs update about 'checksum' image prop https://review.opendev.org/737735	10:05
*** rcernin has joined #openstack-glance		10:10
*** rcernin has quit IRC		10:15
*** rcernin has joined #openstack-glance		11:27
*** rcernin has quit IRC		11:34
*** rcernin has joined #openstack-glance		11:39
*** wxy has quit IRC		11:40
*** rcernin has quit IRC		12:08
*** rchurch has quit IRC		12:34
*** rcernin has joined #openstack-glance		12:35
*** rchurch has joined #openstack-glance		12:37
*** jawad_axd has quit IRC		13:18
*** jawad_axd has joined #openstack-glance		13:18
*** rcernin has quit IRC		13:29
*** Liang__ has joined #openstack-glance		13:56
*** Liang__ is now known as LiangFang		13:57
dansmith	abhishekk: correct me if I'm wrong, but this test seems bogus to me: https://github.com/openstack/glance/blob/master/glance/tests/unit/v2/test_images_resource.py#L2947-L2958	14:14
dansmith	the failure that generates the forbidden result is failure to create a task as owner=None	14:14
dansmith	which would never happen via the API, right?	14:14
dansmith	I guess that is why you say "not expected in normal scenarios",	14:15
abhishekk	yes	14:15
dansmith	but the test being there makes it look like a user could get Forbidden via the API, but they can't	14:16
dansmith	even if I change that to owner=TENANT2, it allows the test to complete (without my code)	14:16
abhishekk	that is added for coverage purpose	14:16
dansmith	so shall I change that test and note to be used to test my auth check?	14:16
dansmith	what coverage?	14:16
abhishekk	Sounds good	14:16
abhishekk	code coverage	14:17
dansmith	heh	14:17
dansmith	I mean, what code does this cover that remains uncovered without it? it's testing something that's not possible via the API, so unless there is some dead code that isn't normally reachable...	14:18
abhishekk	Thats why I added note there	14:18
*** ratailor has quit IRC		14:35
openstackgerrit	Dan Smith proposed openstack/glance master: Check authorization before import for image https://review.opendev.org/737548	14:37
abhishekk	dansmith, this issue ^ is for copy-image operation, right?	14:41
dansmith	abhishekk: import_image(method='copy-to-store') but I imagine it applies to all import_image operations, no?	14:44
abhishekk	dansmith, could be (but not sure this will happen in real scenarios for other than copy-to-store case	14:47
dansmith	why not?	14:48
abhishekk	like if image is in queued state and other user use that image id to import image, less possibility	14:48
dansmith	you mean it's not likely for a user to try to do that, but if the image was created with --public then they could _try_ right?	14:48
abhishekk	hmm	14:49
dansmith	meaning if I create an image with --public, and then go make coffee,	14:49
dansmith	some other user could see it and say "hey let me try to import my qcow into that"	14:49
abhishekk	got it	14:49
dansmith	without this, they will get a 202 from the import command, and the task will fail on the server side, but otherwise not know why	14:49
dansmith	note the commit message, where even different tenant will be allowed to create the task, which is why it's important to change that test to use a _different_ tenant and not _no_ tenant	14:50
abhishekk	clear :D	14:50
dansmith	ack :)	14:50
abhishekk	between I tried little workaround for race condition, posted new PS	14:51
dansmith	okay will look in a sec	14:51
abhishekk	no problem, take your time	14:52
abhishekk	Cleanup on service start for pending tasks need to be consider as new work	14:53
*** m75abrams has quit IRC		15:07
*** priteau has quit IRC		15:27
*** LiangFang has quit IRC		15:37
*** suryasingh has joined #openstack-glance		15:48
* abhishekk going for dinner, back in 30		15:56
*** gyee has joined #openstack-glance		15:57
*** mordred has joined #openstack-glance		16:04
mordred	rosmaita: all_stores_must_succeed ... the docs describe both true and false cases but don't indicate what the default is. I'd guess default is false, would that be right?	16:05
rosmaita	mordred: i don't remember! give me a few minutes to look	16:06
mordred	rosmaita: thanks!	16:09
rosmaita	mordred: looks like it defaults to true: https://opendev.org/openstack/glance/src/branch/master/glance/api/v2/images.py#L108	16:10
mordred	rosmaita: aha! glad I asked	16:11
rosmaita	yeah, it's kind of counterintuitive -- but the idea is that you requested that the image be in stores X, Y, Z, so if it can't be stored in all 3, the call should fail	16:14
rosmaita	mordred: i'm going to file a doc bug for that, thanks for pointing it out	16:15
dansmith	mordred: all_stores = body.get('all_stores', False)	16:17
rosmaita	dansmith: that's a different one	16:18
dansmith	oh all_stores and all_stores_must_succeed I guess? confusing	16:18
rosmaita	all_stores is shorthand so that you don't have to list all the stores in the call; all_stores_must_succeed refers to all the stores listed in the call	16:19
rosmaita	you can tell this API was designed by developers!	16:19
dansmith	gotcha	16:33
mordred	rosmaita: ya - it is a little counter- although I do think it's the right default behavior	16:38
mordred	rosmaita: like - it's one of those flags that probably nobody should ever use :)	16:38
rosmaita	:D	16:38
dansmith	you mean all_stores_must_succeed should default to true?	16:39
dansmith	I'm not really arguing that it shouldn't, but it would suck to upload 1TiB over your home cable connection only to find out that one store ran out of disk, and if others had succeeded, at least the image is in the cloud and you can copy-from-store to fix it :)	16:40
rosmaita	yes, that would be a motivating scenario for learning about the option!	16:41
mordred	dansmith, rosmaita: https://review.opendev.org/737845 Add support for multiple image stores	16:42
dansmith	rosmaita: heh, fair enough	16:42
mordred	dansmith: well - the failure case for all_stores_must_succeed is that the image remains in the staging area	16:42
mordred	and the image is deleted from any store that it did succeed copying to - but import can be called again without needing to re-upload over your home cable connection	16:43
rosmaita	mordred: i'll put your patch on the agenda for tomorrow's glance meeting to get some eyes on it	16:43
mordred	rosmaita: there's one ahead of it in the stack adding image import support - and also one adding the flag to openstackclient	16:43
mordred	rosmaita: https://review.opendev.org/#/c/737625/	16:44
rosmaita	ok	16:44
rosmaita	thanks	16:44
dansmith	mordred: oh, it does that for all store types?	16:44
mordred	rosmaita: while I'm bugging you ... with normal upload the content is copied straight to the backing store (like ceph) but with import it goes into a staging area and my understanding is that that's on the controller. is it possible to configure the staging area to also be in ceph - like for controller nodes with limited local disk?	16:45
mordred	dansmith: yeah.	16:46
rosmaita	mordred: not yet	16:46
mordred	rosmaita: oh - interesting - I see another place in the docs where it does indicate False is default	16:46
rosmaita	that's not good	16:46
mordred	rosmaita: the text docs of import image indicate it - it's just not on the field docs for all_stores_must_succeed itself	16:46
abhishekk	default value for all_stores_must_succeed is False	16:50
rosmaita	abhishekk: https://opendev.org/openstack/glance/src/branch/master/glance/api/v2/images.py#L108	16:50
abhishekk	oops	16:51
mordred	honestly - True is likely a more expected value - I'd vote that the docs just get updated	16:52
mordred	but - you know - I'm realy mostly asking so I can document the parameter properly :)	16:52
rosmaita	mordred: where did you see that in the docs -- I only found (default is True)	16:52
mordred	rosmaita: https://docs.openstack.org/api-ref/image/v2/index.html?expanded=import-an-image-detail	16:53
abhishekk	rosmaita, that is in api ref	16:53
mordred	rosmaita: "done and the state of the image remains unchanged. When set to False (default), the workflow will fail only if the upload fails on all stores"	16:53
mordred	but I agree - the code seems to say True	16:54
rosmaita	mordred: thanks for finding that	16:55
mordred	\o/	16:55
* mordred is mildly helpful		16:55
abhishekk	yes, that needs to be corrected	16:55
dansmith	so, sorry to rewind, but... I'm still kinda stuck on this staging directory thing... when I upload via import, I send the whole image to the staging area, and only then does it start copying to stores? does it copy to stores in parallel?	16:56
mordred	dansmith: it copies to stores with the import call	16:57
dansmith	and if the initial upload fails, say because the glance service dies, how does glance know that the thing in the staging area isn't valid if I try to import again?	16:57
dansmith	mordred: maybe I'm missing the fact that there are multiple API steps because I've just used the client?	16:58
mordred	dansmith: so, it's POST /images ; PUT /images/{id}/stage ; POST /images/{id}/import	16:58
dansmith	okay, totes didn't get that from using the client, so ... if all the links are the same speed, twice as long to get the image ready in the new location I guess	16:59
mordred	dansmith: yah. so in the single-store no-transforms-needed case it's more efficient to PUT to image.data	16:59
dansmith	this puts the race condition abhishekk and I are working on in perspective	16:59
dansmith	mordred: yeah I didn't grok that before, but I get it now	17:00
dansmith	assume there's some hash/size checking to make sure that a previous stage got the image there in one piece before you try to import from staging?	17:00
mordred	dansmith: no, not to my knowledge - there is no hash/checksum returned from the put to stage	17:01
dansmith	hrm, okay	17:02
mordred	but - I haven't tried - it's possible the image resource will have an updated checksup after upload to staging that could be checked	17:02
dansmith	so and, um, the staging dir is per node/worker, so how do you ensure you hit the same worker again when you go to do the import?	17:02
dansmith	abhishekk: ^ ?	17:07
abhishekk	dansmith, staging area is common for all workers in single node	17:08
dansmith	abhishekk: on a single node, sure	17:08
abhishekk	For HA we need to set that staging area on NFS	17:08
dansmith	but If you have three controller nodes, each running some workers... if I stage to worker1...	17:08
dansmith	dear god really?	17:09
abhishekk	that's the limitation at the moment if you use glance-direct import method	17:09
abhishekk	if you use web-download import method then its not needed	17:09
dansmith	that seems like a limitation that puts it outside the realm of being useful at all	17:09
dansmith	oh	17:10
dansmith	import that takes a URI instead of uploads the image from the client?	17:10
abhishekk	yes	17:10
dansmith	okay	17:11
dansmith	that significantly complicates getting the image uploaded though	17:11
abhishekk	in case of web-download import method there will be only two API calls	17:11
dansmith	okay but presumably glance still stores it in its staging directory after it pulls the image?	17:11
abhishekk	yes and in this case /import does the work of staging and then importing to store	17:12
dansmith	abhishekk: when using copy-to-store, I assume glance is downloading the image to the staging area and then pushing it back to the new store, right?	17:26
abhishekk	yes	17:26
dansmith	okay and if that fails because I kill -9 the glance worker, leaving residue in the staging area, will a subsequent copy-to-store on that worker try to upload the partial image to the new store without re-fetching it?	17:27
abhishekk	it will	17:29
dansmith	okay shall I file a bug for that as well?	17:30
abhishekk	I guess yes	17:31
dansmith	https://bugs.launchpad.net/glance/+bug/1885003	17:35
openstack	Launchpad bug 1885003 in Glance "Interrupted copy-to-store may corrupt a subsequent operation" [Undecided,New]	17:35
abhishekk	ack	17:36
abhishekk	dansmith, ran this scenario on local	17:39
abhishekk	the worker does try to copy image to store from staging area but it fails while importing and operation gets reverted, below log message confirms it	17:40
abhishekk	Jun 24 17:38:26 vicoria glance-api[27935]: ERROR glance.location [-] size of uploaded data is different from current value set on the image.	17:40
dansmith	abhishekk: does it delete the residue?	17:41
abhishekk	yes	17:41
dansmith	okay, so the bug is less corruption then and more just "it makes the next operation fail"	17:42
dansmith	I'll alter the title	17:42
abhishekk	dansmith, wait	17:42
* dansmith waits :)		17:42
dansmith	surely something could just check the size of the stage residue against image.size and delete/restart instead of making the user have to do that right?	17:43
abhishekk	it does not delete the data actual from the store if all_stores_must_succeed is False	17:44
dansmith	oh gosh, you mean it might actually delete the image from the other stores if that is true?	17:44
abhishekk	let me explain	17:45
abhishekk	1. I start copy operation for image which size is 4 gb	17:45
abhishekk	2. After copying 1gb data to staging service goes down	17:46
abhishekk	3. I restart the service and initiates the copy to store B	17:46
abhishekk	(Note image is in store A)	17:46
abhishekk	while copying all_stores_must_succeed is True (step 3)	17:47
abhishekk	4. It sees some data in staging and without checking actual size it goes to import that data in store B	17:47
abhishekk	5. Now after import is complete it checks for the existing size and current size and marked the operation as failure	17:48
*** priteau has joined #openstack-glance		17:48
abhishekk	sorry all_stores_must_succeed is False (step 3)	17:49
abhishekk	Now after step 5 raises exception and all_stores_must_succeed is False it Ignores the failure and does not initiates the revert operation	17:49
abhishekk	so even if image locations does not get updated, the partial data in store B does not get cleaned	17:50
dansmith	okay what about the residue in the staging dir?	17:50
abhishekk	it gets deleted	17:52
dansmith	okay, and what if all_stores_must_succeed=True ?	17:53
abhishekk	in this case data in staging remains as it is	17:54
abhishekk	but it should delete it from stores	17:54
dansmith	just the failed store or ALL stores?	17:54
abhishekk	failed store	17:55
dansmith	whew, okay :)	17:55
dansmith	so in the =False case, the user gets slapped with an immediate fail, but can retry and it will work because the residue from the stage is cleared, but we might leak storage on the backend, right?	17:55
dansmith	and on =True case, we leave the stage but delete the backend partial.. what happens if the user tries again? fail again?	17:56
abhishekk	dansmith, for false yes we might leak storage on the backend	17:57
abhishekk	and on True if user tries again with True result will be same	17:58
dansmith	okay, IMHO, both of those cases need to be fixed. leaking storage is definitely not okay, and in both cases a pre-check of the size in the staging area would prevent the user from having to retry (potentially with no improvement) and prevent copying some data only to fail and waste the time, network, etc... agree?	17:59
abhishekk	dansmith, agree, this will work gem and not chance of leaking	18:01
dansmith	"work gem" ?	18:01
abhishekk	this is good way to solve this leaking issue	18:02
dansmith	++	18:02
dansmith	I shall update the bug and then you can see if I got it right :)	18:02
abhishekk	will put up a patch	18:02
abhishekk	dansmith, ack	18:03
abhishekk	so, this will not occur for import case	18:04
abhishekk	will only in case of copy-image (copy-to-store) operation	18:04
dansmith	right, I figured	18:07
dansmith	bug updated	18:07
abhishekk	looking	18:07
zzzeek	dansmith: hey did you try to ping me here, just notcied my irc wasclosed	18:08
abhishekk	dansmith, in glance there is no concept of copy-to-store, its copy-image	18:08
dansmith	zzzeek: I didn't because you were marked as away	18:08
zzzeek	dansmith: ah very polite :)	18:08
zzzeek	dansmith: the basic idea is do that UPDATE and then get the rowcount back for numnber of rows affected	18:09
dansmith	abhishekk: ah sorry	18:09
zzzeek	if it's zero, you missed	18:09
zzzeek	does that make sense ?	18:09
dansmith	zzzeek: meaning my example SQL update query you mean right?	18:09
zzzeek	yes	18:09
abhishekk	dansmith, no problem	18:09
zzzeek	I would skip the ORM and just run result = connection.execute(table.updat()...)	18:10
zzzeek	then result.rowcount	18:10
dansmith	zzzeek: yeah, okay that's what I thought.. I don't really know how to translate that into ORM stuff, but maybe I can work on that a bit and get you to validate that what I'm doing is legit?	18:10
dansmith	zzzeek: suhweet, that's what I'd rather do	18:10
zzzeek	dansmith: yeah I would just use a core statement	18:10
dansmith	zzzeek: a dedicated claim() operation and just do it that way	18:10
zzzeek	dansmith: just for the claim part at least. then once oyu have it you can do wahever ORM things	18:10
dansmith	yep	18:10
zzzeek	dansmith: also if you have an ORM session, you can get the connection for the transaciton as conn = session.connection()	18:11
dansmith	zzzeek: don't we want to do that update thing outside a transaction where we have already read some of the data we'll be examining?	18:12
zzzeek	dansmith: hmmmm yes if oyu are on repeatable read then I guess so	18:12
dansmith	I thought if we did it with other stuff, we'd be operating on repeated read data... yeah that	18:12
zzzeek	might not matter	18:12
zzzeek	the other transacvtions would block	18:12
dansmith	so I figured we'll just do the claim as a single shot thing, which is fine because the real work happens later in an async task	18:12
zzzeek	sure I guess commit is safer b.c. then it goes out and gets past the locks but also does the galera writeset thing	18:13
zzzeek	it's just when you do work subsequent to that, you want to have a new transaction	18:14
zzzeek	so if this is the very fisrt thing you're doing it should be OK	18:14
dansmith	ack okay	18:15
zzzeek	assuming this table is also involved in the subseqeunt work	18:15
dansmith	zzzeek: I need to look at their stuff more to be able to answer that question, but .. probably	18:15
zzzeek	its just cleaner to have two serial transactions rather than one oingoing and another pops out in the middle	18:15
abhishekk	dansmith, this solution will also solve our race?	18:16
dansmith	so abhishekk I will try to work on this claim thing, zzzeek can check my work and or make fun of it if it's wrong and then we can use it to solve your race	18:16
dansmith	zzzeek: yep, two is what I'm going for	18:16
abhishekk	dansmith, ack :D	18:16
abhishekk	dansmith, back to our staging clearing solution	18:16
zzzeek	when transactions overlap that's how you can get deadlocks so keep them serial is simpler	18:16
*** priteau has quit IRC		18:16
dansmith	zzzeek: ++	18:17
zzzeek	ok i will copy gthis to the gerrit as a record	18:17
dansmith	zzzeek: thanks	18:17
abhishekk	dansmith, sorry, forget it, I am mixing to many things	18:17
abhishekk	zzzeek, is Mike?	18:17
dansmith	yes :)	18:17
zzzeek	yes	18:17
abhishekk	:D	18:17
abhishekk	thanks zzzeek	18:18
zzzeek	ok the formatting is disastrous in my comment	18:18
dansmith	zzzeek: I can get you an eavesdrop link	18:18
dansmith	http://eavesdrop.openstack.org/irclogs/%23openstack-glance/%23openstack-glance.2020-06-24.log.html#t2020-06-24T18:08:36	18:19
dansmith	you can just mic-drop that if you want	18:19
dansmith	"the zzzeek hath spoken"	18:19
dansmith	abhishekk: so I need to step away for a bit and I know it's late there, I'll try to work on that a bit later and we can sync up in the morning okay?	18:22
abhishekk	dansmith, sounds good	18:22
abhishekk	morning means what time in UTC?	18:23
dansmith	abhishekk: about 1330 UTC	18:30
dansmith	that is 0630 local time for me	18:30
abhishekk	dansmith, I will be around at that time	18:30
dansmith	cool	18:30
abhishekk	thank you, have a nice time ahead	18:31
dansmith	:) o/	18:31
*** belmoreira has quit IRC		18:33
abhishekk	o/	18:34
*** irclogbot_0 has quit IRC		18:34
*** irclogbot_2 has joined #openstack-glance		18:37
openstackgerrit	Abhishek Kekane proposed openstack/glance master: Fix: Interrupted copy-imgae leaking data on subsequent operation https://review.opendev.org/737867	20:06
abhishekk	dansmith, ^^	20:07
dansmith	abhishekk: -1	20:07
dansmith	abhishekk: but cool, will review for real in a bit :)	20:07
abhishekk	jokke, ^^	20:07
dansmith	about to push up the atomic update patch	20:07
abhishekk	why -1?	20:07
dansmith	spelling! :D	20:08
abhishekk	ohhh	20:08
abhishekk	I am going offline now, will fix your comments tomorrow	20:08
abhishekk	thank you again	20:08
openstackgerrit	Dan Smith proposed openstack/glance master: Add image_set_property_atomic() helper https://review.opendev.org/737868	20:08
dansmith	abhishekk: ^	20:08
dansmith	abhishekk: go to sleep :)	20:08
abhishekk	looking	20:08
dansmith	don't look, go to sleep :)	20:09
abhishekk	will have better sleep then :D	20:09
abhishekk	dansmith, some print statements are there in api.py	20:12
dansmith	ah crap	20:12
openstackgerrit	Dan Smith proposed openstack/glance master: Add image_set_property_atomic() helper https://review.opendev.org/737868	20:13
abhishekk	cool, see you tomorrow	20:13
abhishekk	good night	20:14
dansmith	o/	20:14
abhishekk	o/~	20:14
*** belmoreira has joined #openstack-glance		20:21
*** belmoreira has quit IRC		20:26
*** priteau has joined #openstack-glance		20:28
*** priteau has quit IRC		20:39
dansmith	mordred: https://zuul.opendev.org/t/openstack/build/307319ffe4b845f4b58cb78bd32f1911	20:52
dansmith	mordred: same config worked with glanceclient, so I think that means you're checking something they're not	20:52
mordred	dansmith: cool! that's great - it means the machinery all worked, but there's a bug in the detection of import capability	20:55
mordred	dansmith: I'll poke and see what that bug is, then we should be good to go	20:56
dansmith	thanks	20:56
dansmith	you can recheck that patch to get the full stack test, but it won't actually pass for other reasons	20:56
dansmith	but you can at least see it not get stuck at import	20:56
mordred	dansmith: it'll be https://review.opendev.org/#/c/737608/4/openstack/image/v2/_proxy.py@224 and https://review.opendev.org/#/c/737608/4/openstack/image/v2/image.py@234	20:57
mordred	the api docs say that header is returned from the image creation if the server supports import	20:57
dansmith	hmm	21:03
mordred	dansmith: fixing a different thing first - then i'll figure out what it thinks it's doing and what it should be doing	21:53
dansmith	aight	22:04
*** rcernin has joined #openstack-glance		22:42
*** tkajinam has joined #openstack-glance		22:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!