Thursday, 2016-01-14

*** dshulyak_ has joined #openstack-solar07:27
*** salmon_ has joined #openstack-solar08:12
pigmejHello08:37
dshulyak_hi09:06
salmon_dshulyak_: pigmej https://github.com/Mirantis/solar-resources/pull/209:28
pigmejwtf is 'echo 1' ?09:29
salmon_:)09:30
salmon_It can be empty if you want :)09:30
salmon_pigmej: updated09:35
pigmejsounds good to me09:39
pigmejdshulyak_: thanks for +1 :) I merged :)09:40
salmon_https://review.openstack.org/#/c/267193/ anyone? :)09:55
pigmejI'm chceking it right now09:56
pigmejwhat is our merge policy right now?09:58
salmon_?09:59
pigmejShould I give you +2 and +1 Workflow there or we stick with +2 from one and +1 workflow from someone else ?10:00
salmon_I think, 2 reviews are batter than one10:01
pigmejyeah but speed--10:01
salmon_quality++10:01
pigmej~10:02
dshulyak_https://github.com/openstack/fuel-plugin-contrail10:08
pigmejdshulyak_: https://review.openstack.org/#/c/260082/10:08
pigmejrandom review from this project :D10:09
salmon_noop? Is it just Null test?10:09
pigmejwell the noop is wtf, but there is one extra gate10:09
pigmejfuel-plugin.contrail.build10:09
salmon_I think noop is a way to go10:10
dshulyak_one more - https://review.openstack.org/#/c/265780/10:11
dshulyak_only noop tests10:11
pigmejok this one is cool10:12
pigmejok, then I will try to create similar repo to it10:12
pigmejobviously if someone knows how to do it properly... then feel free to take it;D10:14
salmon_dshulyak_: https://review.openstack.org/#/c/267193/ :)10:23
pigmejsalmon_: https://review.openstack.org/#/c/266255/ :P10:24
dshulyak_done10:25
salmon_thx10:25
openstackgerritMerged openstack/solar: Allow to modify computable inputs in Composer files  https://review.openstack.org/26719310:28
pigmejhttps://review.openstack.org/267453 :)10:41
*** tzn has joined #openstack-solar10:54
tzn@salmon_ how is CI work going?10:59
salmon_tzn: we agreed with Sasha do to test job where you can set which test to run. I'm working on a job efinition now11:00
salmon_*definition11:00
tznok, cool11:01
tznany estimate?11:01
tzn@pigmej do we have solar-resources on review?11:03
tznI mean - repo creation?11:03
pigmejhttps://review.openstack.org/26745311:04
salmon_zen they are very very very busy, so no. I will prepare review today but they need to check it11:04
dshulyak_should we split worker that runs tasks and scheduler? i splitted them, but i’m not sure if thats better11:11
dshulyak_because there always will be atlest 2 processes11:11
pigmejdshulyak_: we should11:12
pigmejbecause then we will be able to create that "small" worker11:12
pigmejisn't it?11:13
dshulyak_i dont see how splitting them will help in creation of small worker11:14
dshulyak_ok, it will be better if i will share my work and then we will talk11:14
dshulyak_tomorrow probably11:14
pigmejyeah probably :)11:14
pigmejk, because talkig about something without code is... tricky :)11:14
pigmejhmm, guys do we require gevent now?11:52
openstackgerritJedrzej Nowak proposed openstack/solar: Set ansible<2.0 in requirements (removed callbacks)  https://review.openstack.org/26750012:07
openstackgerritJedrzej Nowak proposed openstack/solar: Conditional imports in locking (riak or peewee)  https://review.openstack.org/26750312:18
pigmejsalmon_: `pip install solar` *almost* works12:19
salmon_yupi12:20
pigmejyou just need to have these 2 patches ;D12:21
pigmejand obviously we need some password based examples12:22
pigmejbut that's other story :)12:22
pigmejupdated12:33
openstackgerritJedrzej Nowak proposed openstack/solar: Set ansible<2.0 in requirements  https://review.openstack.org/26750012:37
pigmejsalmon_: message extended :)12:37
salmon_+2ed12:39
pigmejOk, I was able to use solar without vagrant env :)12:39
pigmejdshulyak_ tzn  salmon_ :)12:39
pigmej"archivement unlocked"12:40
pigmejsalmon_: it will also simplify fuel-devops (and it will add more speed to it, no docker magic required)12:40
openstackgerritMerged openstack/solar: Set ansible<2.0 in requirements  https://review.openstack.org/26750012:48
tznpigmej: +112:50
pigmejtzn: is there any bot that reports launchpad bugs etc ?12:51
tznfrom IRC?12:51
tznor to IRC?12:51
pigmejno, post bugs changes TO irc12:51
tznyes, there are plenty12:51
tznbut I need tsome time to configure them12:52
tznI will talk to devops guys12:52
pigmejk12:52
salmon_https://blueprints.launchpad.net/solar/+spec/cleanup-solar-resources for next release :)12:53
pigmejsalmon_: we should also undo versions probalby12:54
salmon_why undo?12:54
pigmejor maybe even now... because some resources are marked as 1.012:54
pigmejwhich is ekhm...12:54
salmon_all are marked as 1.0.012:54
pigmejnot all :D12:55
pigmejthere are some 0.0.112:55
pigmejsalmon_: you should also add that it's mostly about Openstack resources12:56
pigmejbecause, transport, ro_node etc are fine12:56
salmon_this is why I created this bp :)12:57
tzncan you guys mark them as 0.1.012:59
pigmejall versions ?13:04
salmon_dshulyak_: pigmej https://bpaste.net/show/f2b61ca779eb13:04
salmon_hosts example :(13:04
salmon_hosts_file2.run -> INPROGRESS13:04
salmon_hosts_file1.run -> SUCCESS13:04
pigmejsalmon_: hmm13:04
salmon_it hung13:05
pigmejyeah because it crashed13:05
pigmejdshulyak_: is it desired behaviour ?13:06
salmon_ah, yes https://bpaste.net/show/57bd70255dfb13:06
salmon_do we need retries here ?13:06
pigmejsalmon_: wait, what have you done?13:07
salmon_pigmej: hosts example13:07
pigmejbut how did you make object in conflict ?13:08
salmon_I just run the example... :P13:08
salmon_via fuel-devops13:08
salmon_clean env13:08
pigmejhmm13:09
pigmejyou crashed  history13:11
salmon_I did nothing!13:11
salmon_dshulyak_: pigmej I reproduced it again. Just run hosts example13:20
pigmejdshulyak_: then it means that sadly riak lock is broken13:21
dshulyak_error from bpaste is not related to lock13:22
pigmejthe first is13:22
pigmejthe second is probably side effect13:22
salmon_full log https://bpaste.net/show/803b18d0e29f13:22
pigmejthe thing is, it works for me ;(13:23
pigmejsalmon_: can you wipe riak container and try again?13:23
salmon_wipe?13:23
pigmejah you spawn always on fresh env ?13:24
salmon_yup13:24
pigmejsalmon_: can you print siblings data there?13:24
salmon_command?13:25
pigmejdshulyak_: maybe the reason is that counter ?13:25
pigmejbut hmm, you should reach resolver first...13:26
salmon_in the meantime  you can +1 https://review.openstack.org/#/c/26755813:29
pigmejdshulyak_: you started to debug it ?13:31
openstackgerritLukasz Oles proposed openstack/solar: Include ansible config when syncing repo  https://review.openstack.org/26756213:32
dshulyak_not yet13:32
pigmejok, riak lock is broken13:35
pigmejI'm able to crash it13:35
dshulyak_how?13:36
pigmejuse gevent worker13:36
pigmejand create 10 hosts file example13:36
pigmejthen I switched to "ensemble" and it seems working13:37
pigmejon sqlite it seems to be ok too13:37
pigmejdshulyak_: with single riak and n_val=1, I have now broken history13:37
pigmejhttps://bpaste.net/show/854fc91bbba913:37
pigmejyup salmon_ I can reproduce13:39
pigmejthough I needed more hosts13:39
pigmejbut it's weird, because it looks like some things were done twice13:40
pigmejsalmon_: can you do solar o report last13:40
pigmej?13:40
dshulyak_i see, i think my collision resolution doesnt work properly13:41
pigmejyeah something is wrong there13:41
pigmejdshulyak_: ['{"status": "PENDING", "task_type": "solar_resource", "target": "314b40de7e918d2897b6b84fbe8b9baa", "args": ["hosts_file1", "run"], "childs": [], "parents": ["system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39~node2.run", "system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39~node1.run"], "execution": "system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39", "errmsg": "", "name": "hosts_file1.run"}', '{"status": "PENDING", "task_t13:41
pigmej"solar_resource", "target": "314b40de7e918d2897b6b84fbe8b9baa", "args": ["hosts_file1", "run"], "childs": [], "parents": ["system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39~node2.run", "system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39~node1.run"], "execution": "system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39", "errmsg": "", "name": "hosts_file1.run"}']13:41
dshulyak_after except SiblingsError:13:41
pigmejoh crap13:41
pigmejI have conflict where we have 2 different parents13:41
pigmejor they are even the same...13:41
dshulyak_do you have similar line in your log - Race condition for lock with UID system_log:7f9e1785-c075-4fcb-a5c5-1a9e5d092fc8, among [u'140691354006064', u'140691354009424'] ?13:42
pigmejvagrant@solar-dev:~$ grep -i 'race condition' /var/run/celery/celery1.log13:43
pigmejvagrant@solar-dev:~$13:43
pigmejdshulyak_: these items are identical for me. so for me it looks like something started twice13:43
pigmejsalmon_: can you check it too please ?13:43
dshulyak_for example here it is not related to lock - https://bpaste.net/show/854fc91bbba913:44
*** tzn has quit IRC13:45
pigmejdshulyak_: well, I diasagree probably13:45
pigmejI have 2 exactly the same childs13:45
dshulyak_childs of what?13:46
pigmejsiblings13:46
dshulyak_which Model ?13:46
pigmejhttps://bpaste.net/show/9d110bac7b7213:46
pigmejdshulyak_: history13:46
pigmejor whatever we call it13:46
pigmejdshulyak_: I never saw these errors before13:51
dshulyak_so how you reproduced it? just run 10 times hosts_file?13:54
pigmejyeah try to do so,13:54
pigmejnow it crashed on standard example even13:54
pigmejjust like salmon_ have13:55
pigmejdshulyak_: now I have exactly the same as salmon_ had, + race condition in logs13:57
dshulyak_pigmej: can you try with save(force=True) on L104 in locking ?13:58
pigmejnothign will change14:00
pigmejisn't it ?14:00
pigmejah14:00
pigmejno, we raise error when nothing changes14:00
pigmejhmm14:01
pigmejdshulyak_: but look14:01
pigmejthere is "fuckup"14:02
pigmej2 siblings, A , B14:02
pigmejB checks and notices A, B siblings, in the same time A checks and notices A, B14:02
dshulyak_i think it is the other way here14:03
pigmejso, then you have this for loop, which will remove "me" from conflicts, right ?14:03
pigmejso B removes B and A removes A14:03
dshulyak_not like this14:03
pigmejand they save object with one sibling, but still conflicting14:03
dshulyak_they are not saving it :)14:03
dshulyak_there is no save(force=True)14:03
dshulyak_and i dont think that both A and B see a race14:03
dshulyak_only B14:03
dshulyak_and we can see it in log14:03
pigmejwell it would crash without force :)14:03
dshulyak_please try with force, i cant reproduce it14:04
pigmejbut force changes nothing there....14:04
dshulyak_it changes14:04
dshulyak_it will be saved actually14:05
pigmejhttps://github.com/openstack/solar/blob/master/solar/dblayer/model.py#L922 if no force is given then it would be exception htere, isn't it ?14:05
dshulyak_ok, can you please try :) ?14:06
pigmejyeah14:06
pigmejdoing14:06
dshulyak_there is clearly no loop in logs14:06
pigmejthe same14:07
dshulyak_https://bpaste.net/show/803b18d0e29f14:07
pigmejor even worse, because for me history conflicted now14:07
pigmejhttps://bpaste.net/show/ef135161332b14:07
pigmejhmm, dshulyak_ I have question,14:08
pigmejhmm, nvm14:08
pigmejbut I think that both workers thinks that they are only one14:08
pigmejonly one notices race condition though14:09
dshulyak_it looks to me that A acquired it, B sees this race and starts to wait, but when A released lock - B doesnt see it14:10
dshulyak_what is that ['{"count": -9}', '{"count": -9}'] ?14:10
pigmejsiblings content14:10
dshulyak_of what?14:10
pigmejof conflicted object14:10
dshulyak_COunter ?14:11
pigmejI added print just before raise14:11
pigmejyup14:11
pigmejdshulyak_:14:12
pigmejhttps://bpaste.net/show/c870ad64296214:12
dshulyak_for Lock thats normal, the problem is that B always thinks that lock acquired by A14:13
pigmejhow does A release it /14:14
dshulyak_delete value in database14:14
pigmejah14:14
dshulyak_record14:14
pigmejso if action is fast there will be conflict14:15
pigmejmaybe that's the case14:15
pigmejbecause A will delete, but B will overwrite14:15
pigmejI have idea how to improve it14:15
pigmejcrdt like structure14:15
pigmejtuple with + or - identity14:15
pigmejthen in conflict resulution we can easily firuge out wtf, and in lock too14:16
dshulyak_overwrite?14:16
dshulyak_release is here - [2016-01-14 15:06:59,173: DEBUG/MainProcess] Release lock system_log:581d20df-48ff-454f-8f67-0cdb920447b7 with 14035997903112014:16
dshulyak_but then in B14:16
dshulyak_Found lock with UID system_log:581d20df-48ff-454f-8f67-0cdb920447b7, owned by 140359979031120, owner False14:16
pigmejdshulyak_: and it's 30ms later than B saves object14:16
dshulyak_nope14:17
pigmejit *could* be reordered in riak14:17
pigmej[2016-01-14 15:06:59,120: DEBUG/MainProcess] Race condition for lock with UID system_log:581d20df-48ff-454f-8f67-0cdb920447b7, among [u'140359979031120', u'140359979031920']14:17
pigmejthis is from B, right?14:17
pigmej[2016-01-14 15:06:59,173: DEBUG/MainProcess] Release lock system_log:581d20df-48ff-454f-8f67-0cdb920447b7 with 14035997903112014:17
pigmejand this is from A14:17
dshulyak_ah, so it is possible that B saves object that was removed14:19
pigmejyeah that's what I'm talking about14:19
pigmejA removes lock, then B saves lock with A inside "because it was like that"14:19
pigmejdshulyak_: I can try to fix it with crdt like thingy14:20
pigmejthen no delete, and we should be fine14:20
pigmejworks for you dshulyak_ ?14:21
pigmejI mean can I ? :)14:21
dshulyak_sure)14:21
pigmejwith crdt like thingy, we will be safe, we may have slightly longer latency though14:21
pigmejbut we should be fine14:21
pigmejI wonder why it worked for me before....14:22
pigmejsalmon_: good finding :)14:23
pigmejbrb, I have to prepare chicken for lunch :(14:23
dshulyak_hm, but if object was deleted there should be sibling wo data14:24
dshulyak_pigmej: we should be able to see that collision, on second write, yes?14:33
*** tzn has joined #openstack-solar14:35
pigmejdshulyak_: BUT it was there before A deleted it14:37
pigmejA & B reads, A saves, A deletes, B saves14:37
dshulyak_yes, but on B saves - there shoould be tombstone from A14:37
pigmejwhich was resolved by our conflict resolution14:39
pigmej:)14:39
dshulyak_yes seems so, but i cant reproduce :) i guess i have too slow environment for this14:40
pigmejgood that we have different cpus14:41
*** dshulyak_ has quit IRC15:00
salmon_re15:14
salmon_pigmej: how can I help?15:14
pigmejI'm imporoving lock15:15
pigmejswitch to sqlite :P15:15
pigmejbrb15:25
pigmejlunch15:25
salmon_pigmej: yup, with sqlite it's ok, but seems to be slower16:01
pigmejyeah sqlite is sometimes a bit slower than riak16:01
*** tzn has quit IRC16:39
pigmejok new lock seems to be working...16:48
pigmejI like when my room is full of papper work :D16:48
*** dshulyak_ has joined #openstack-solar16:52
pigmejsalmon_: https://review.openstack.org/#/c/266255/17:06
pigmejplease review this17:06
*** tzn has joined #openstack-solar17:06
*** tzn has quit IRC17:08
salmon_pigmej: ok17:08
salmon_pigmej: in the meantime, new error: https://bpaste.net/show/89ef23df900f ;)17:08
pigmejnot to me :P17:08
salmon_dshulyak_: ^ ;)17:09
dshulyak_salmon_: do you see any errors in celery.log?17:14
pigmejdshulyak_: I'm constainly getting conflicts on Counter object17:17
pigmejno matter what I will do, I'm getting conflicts there17:17
pigmejalways on [2016-01-14 18:16:54,544: WARNING/MainProcess] ['{"count": -9}', '{"count": -9}']17:17
pigmejare you sure that everything is correct in that manner?17:17
salmon_dshulyak_: I deleted env already, recreating now17:18
openstackgerritMerged openstack/solar: Use stevedore for handlers  https://review.openstack.org/26625517:18
dshulyak_pigmej: well, it might be that gevent affected counter somehow, because that part wasnt concurrent previously17:20
pigmejyeah it certainly is broken now17:22
dshulyak_looks like we need same logic for counter as for the lock, either resolve SiblingsError or use ensemble17:23
*** tzn has joined #openstack-solar17:25
pigmejdshulyak_: I described it already, you can't do counter in the same way17:26
dshulyak_pigmej: hm, why?17:27
pigmej1) with ensemble "it will work", with normal riak not at all17:27
dshulyak_isnt it just a matter of retry on error?17:27
dshulyak_on SiblingsError17:27
pigmejnope, why would it be ?17:27
pigmejehs, you will be able to save the same object twice17:28
pigmejand none of these could see conflict17:28
pigmejI just checked and it seems that sadly I was right at very beginning, it's perfectly fine to save object twice, and none notices "siblings"17:28
dshulyak_but with n_val we will always see it17:28
pigmejit seems not17:29
dshulyak_n_val=117:29
dshulyak_are u sure?17:29
pigmejNo, I need test more17:29
pigmej:)17:29
pigmejso there is still chance that we're not totally f** :)17:29
dshulyak_for me 2nd write is always able to see a conflict17:30
pigmejfor you lock also worked :)17:31
dshulyak_well, it still works :)17:31
pigmejyeah...17:31
pigmejthat's why I want to run long tests to verify that :)17:31
pigmejI mean that write behaviour ;)17:31
pigmejif yes, then we need similar logic as for locks and we're safe17:32
pigmejit will be still against all known good practices though :D17:32
dshulyak_u said once that we cannot use this types - http://docs.basho.com/riak/latest/dev/using/data-types/#Counters ?17:33
pigmejyup17:33
pigmejit's CRDT type17:33
dshulyak_increment operation looks much better17:33
pigmejit's floating counter17:33
dshulyak_but maybe with n_val=1 :)17:33
pigmejit sill doesn't guarantee you that you will not see the same number twice17:33
pigmejyeah, there is a chance IF n_val works as we want17:34
pigmejthen ... maybe :)17:34
pigmejthough it's CRDT...17:34
pigmejI just found one case where lock is not working correctly ;/17:35
dshulyak_when?17:39
dshulyak_or where?)17:39
pigmejin my implementation :D17:39
pigmejbut I was unlucky17:40
pigmejI hit the same `identity` after restart :D17:40
pigmejhmm dshulyak_ how can I start celery in foreground now/17:43
pigmej?17:43
dshulyak_celery worker -A ….17:43
pigmejthx17:44
dshulyak_maybe wo pidfile17:44
pigmejyeah and log :D17:44
pigmejehs17:49
pigmejsiblings17:49
pigmej['{"status": "PENDING", "task_type": "solar_resource", "target": "6053ea6868fb026b81af3637d4ec79e2", "args": ["hosts_file1", "run"], "childs": [], "parents": ["system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035~node2.run", "system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035~node1.run"], "execution": "system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035", "errmsg": "", "name": "hosts_file1.run"}', '{"status": "PENDING", "task_type":17:49
pigmej"solar_resource", "target": "6053ea6868fb026b81af3637d4ec79e2", "args": ["hosts_file1", "run"], "childs": [], "parents": ["system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035~node2.run", "system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035~node1.run"], "execution": "system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035", "errmsg": "", "name": "hosts_file1.run"}']17:49
pigmejand another one dshulyak_17:53
pigmejhttps://bpaste.net/show/77271b6156b617:53
pigmejthis one is bad, because it says SUCCESS vs INPROGRESS17:53
pigmejI can obviously write conflict resolution functions for it, BUT it should not happen anyway17:54
pigmejthis conflict probably says all about n_val=117:54
pigmejI'm even able to get [2016-01-14 18:56:50,243: WARNING/MainProcess] ['{"count": -1}', '{"count": -1}']17:57
dshulyak_counter isnt protected by lock17:57
dshulyak_thats separate  thing17:57
pigmejI know17:57
dshulyak_but those two should be17:57
pigmejwhat should be ?17:57
pigmejit's the same identity process, isn't it?17:57
dshulyak_identity process?17:58
dshulyak_what do u mean?17:58
pigmejthat identity that you use to "check unique" thingy17:58
dshulyak_i mean that those 2 errors (tasks with multiple siblings) should be protected by lock17:58
pigmejbut it's the same task, just updated from inprogress to DONE17:59
pigmejand there is a chance that it's done within the same worker, isn't it ?17:59
dshulyak_identity is not a process id18:00
dshulyak_i dont get your last point18:01
dshulyak_there is clearly an error somewhere, maybe lock was acquired by two threads because n_val=1 doesnt work like i wanted18:01
dshulyak_but what about worker identity?18:01
pigmejnvm then, I thought identity == some worker gevent thingy18:01
pigmejhttps://bpaste.net/show/ab6c28f54e4918:02
dshulyak_it is id of gevent thread18:02
pigmej(don't look at lines numbers, I tuned up stuff for performance)18:02
dshulyak_yeah, looks like n_val doesnt work like i want18:04
pigmejthen we're kinda fucked18:04
dshulyak_it is same code, or u changed something?18:04
pigmejI make all tasks do nothing etc, so it's certainly not "the same code"18:05
dshulyak_well, we can enable old concurrency for scheduler18:05
dshulyak_with prefork=118:05
pigmejbut it's weird, it worked for me when I tested it18:05
pigmejbut now it's not18:06
pigmej:(18:06
pigmejdshulyak_: yeah for release it will be "maybe good idea"18:06
pigmejbut still, we need to solve it18:06
pigmejmaybe the right solution would be to drop n_val=1 support completely18:06
openstackgerritLukasz Oles proposed openstack/solar: Update path in tests  https://review.openstack.org/26776118:27
openstackgerritDmitry Shulyak proposed openstack/solar: Set concurrency=1 for system log and scheduler queues  https://review.openstack.org/26776918:43
dshulyak_pigmej: oh wait, that log doesnt prove that n_val=1 doesnt work :)18:47
dshulyak_there is save without force18:47
pigmejI changed it to force18:47
dshulyak_still the same?18:47
pigmejas you wanted ;)18:47
pigmejyup, I mean it happened with force18:47
pigmejwithout force it didn't show that beautifull errors18:48
pigmejBUT I'm tired I may make stupid mistakes now ;D18:48
pigmejI'm checking again, crossing fingers that I was wrong :)18:48
dshulyak_:D18:48
dshulyak_i am a bit dissapointed in riak :)18:49
tznGuys, what about fault tolerance of state machine?18:49
tznWhen you start executing graph - what happens if something breaks18:50
pigmejtzn: tha'ts not a problem18:50
pigmejdshulyak_: well, it works as desired18:50
tznok18:50
tznany explanation?18:50
pigmejstate is saved in DB18:50
pigmejeach task state18:50
tznso also every step status?18:51
tznok18:51
pigmejyeah, though we have some problems about that part right now18:51
pigmej;D18:51
tznno in memory storing18:51
tznYes, I figured that out ;)18:51
tznit just reminded me about this fault tolerance ;)18:51
salmon_tzn: as long db is working we can restore execution18:51
tznok, so for fault tolerance we need riak18:52
pigmejwe could start it from any point18:52
tznat this stage18:52
pigmejtzn: well, the easiest answer is "it depends"18:52
tznwhat is task starts and have no chance to send confirmaton/status to solar?18:52
pigmejlukasz answer was correct, as long as DB has all needed info, everything is fine18:52
dshulyak_if something brakes in unexpected way - then we can miss a status update, and user will have to restart execution18:52
tzn@pigmej as always in your case ;)18:52
salmon_pigmej: 'depends' is never the easiest answer :P18:53
pigmejdshulyak_: but only from this broken task,18:53
tznyes, but assuming idempotency, that should be safe18:53
pigmejsalmon_: it's easiest for me :D18:53
salmon_pigmej: :D18:53
pigmejtzn: i would keep salmon_ sentence "as long as DB has correct info we're safe"18:53
tznsure18:53
pigmejit implies all backend features (riak vs sql)18:54
tznbut this is not an answer from my perspective ;)18:54
pigmejit also implies what dshulyak_ said :)18:54
salmon_assuming idempotency ")18:54
salmon_:)18:54
dshulyak_well anyways :) the truth is we can miss update if something brakes in unexpected way18:54
pigmejdshulyak_: but assuming idempotency of tasks we're safe18:55
dshulyak_but not all tasks will be idempotent18:55
pigmejand we can't miss `n-1` update18:55
pigmejwe can miss `n` update, but no `n-1`18:55
pigmej(excluding backend fuckups)18:55
dshulyak_provisioning, removal of smth18:55
dshulyak_is not idempotent18:55
dshulyak_so it still may lead to error18:56
pigmejremoval is18:56
pigmejif you want to remove somethign which is already removed then you just "pass"18:56
pigmejdshulyak_: in theory we have 4 states18:56
dshulyak_well, what if you want to erase a node, but u cant ssh to the node?18:56
pigmejPENDING, INPROGRESS, SUCCESS | ERROR18:56
dshulyak_how can u know if it is removed?18:56
tznwell, it's about new ref architecture18:57
pigmejif task is PENDING => it wasn't executed yet for sure.18:57
tznand orchestrating upgrades for example18:57
tznthere is no way they will make tasks idempotent18:57
pigmejand if it's success or error, it was for sure executed18:57
pigmejINPROGRESS is tricky, but then you should probably check system state by hand if major problem was detected in the middle18:57
pigmejdshulyak_: well, in that case I would lookup if machine card is registered on switch / port whatever, or if mac address / ip address mapping exists, if no I can assume it's removed18:58
dshulyak_hm, no :)18:59
dshulyak_it wont tell anything about state of machine18:59
pigmejsure,19:00
dshulyak_i think the only way here is to mark such task an error19:00
pigmejbut keeping machine in datached state is not a big problem. It will not have network connectivity etc. So it will not mess with other systems19:00
pigmejdshulyak_: it should be marked as "wtf" :)19:00
pigmejas with all network connection problems19:01
dshulyak_so what about that n_val=1 - you was able to reproduce it again? i dont know wtf but on my env i am not able to reproduce even race19:02
pigmejsalmon can yu change that save to save(force) ?19:02
pigmejbut i'm able easily to reproduce that error even with force....19:02
salmon_?19:03
salmon_what, where?19:03
pigmejdshulyak_: give him line numbers :)19:03
dshulyak_salmon_: L104 dblayer/locking.py19:04
salmon_dshulyak_: lk.save(force-True) ?19:05
pigmejyup19:06
salmon_force=True19:06
salmon_?19:06
salmon_ok19:06
salmon_checking  hosts.py example19:06
pigmejdshulyak_: to me it looks now clearly that n_val solves nothing19:06
pigmejit narrows the window, but still19:06
dshulyak_A vnode is the unit of concurrency, replication, and fault tolerance :)19:07
dshulyak_strange19:07
pigmejtrying to replicate with simple script19:07
pigmejdshulyak_: but our riak has 8vnodes19:08
dshulyak_but n_val is a replication number, isnt it?19:09
pigmejkinda19:10
pigmejit's how many copies of SINGLE object are keept19:10
dshulyak_maybe pw=1 should be added19:10
pigmejhttps://bpaste.net/show/415b3339388d19:10
pigmejit's default afair19:10
dshulyak_i remember about sloppy quorum, but it is only when primary vnode is not available, right?19:11
pigmejor maybe not, because they changed it19:11
pigmejwhat is primary vnode for non existing key ?19:11
dshulyak_whatever it is - it should be the same one) for A and B, because the placement is consistent19:12
pigmejbut it can be adjusted19:13
pigmejvnode isn't static19:13
pigmejit's not like key % vnode_num => it's given vnode19:13
pigmejsalmon_: crashes for you?19:16
salmon_pigmej: worked ok now19:20
pigmej...19:20
salmon_running again19:20
pigmejhttps://bpaste.net/show/5a8be828c95f19:20
pigmejdshulyak_:19:23
pigmejhttps://bpaste.net/show/5bd25a7d960919:23
pigmejsalmon_: you too19:23
pigmejcould you please guys to execute that ?19:23
pigmejand what it prints to you?19:23
pigmej1 and -1 nothing more ?19:24
pigmejIt would be super cool if it prints to you both only one 1 and the rest have -119:24
pigmejor maybe better change that range to something smaller19:25
salmon_pigmej: on second run it crashed in the same way as before19:25
pigmejyeah19:25
pigmejso n_val is not working as we need19:25
salmon_[1, -1, -1, -1, -1, -1, -1, -1, -1, 1]19:26
salmon_always one 119:26
pigmejwait fixing one thing there19:26
pigmejhttps://bpaste.net/show/0727dc69d7be19:27
pigmejthis one19:27
pigmeji have sometimes two times 1...19:27
salmon_all are the same: 0 [1, -1]19:27
salmon_to 9 [1, -1]19:28
pigmejthen super cool19:28
pigmejbecuase then it would mean that n_val works as dshulyak_ expected...19:28
dshulyak_also made a test - http://paste.openstack.org/show/483918/19:28
pigmejBUT wtf I have [1, 1] sometimes19:28
pigmejsalmon_: so you have always 1,-1 and it still crashes "as before" ?19:29
salmon_yup19:29
pigmejwhat is exception ?19:29
salmon_the same as before19:29
pigmejwe discussed like 10 exceptions there... ;)19:30
salmon_https://bpaste.net/show/2e13c965df2d19:30
salmon_going off now, see you tomorrow19:30
pigmejcounter19:30
pigmejsalmon_: counter is different exception :D19:30
pigmejdshulyak_: so let's say my riak is stupid, and that it does something 'wrong'19:30
dshulyak_0 [1, -1]19:31
dshulyak_1 [1, -1]19:31
dshulyak_2 [1, -1]19:31
dshulyak_3 [1, -1]19:31
dshulyak_4 [1, -1]19:31
dshulyak_5 [1, -1]19:31
dshulyak_6 [1, -1]19:31
dshulyak_7 [1, -1]19:31
dshulyak_8 [1, -1]19:31
dshulyak_9 [1, -1]19:31
dshulyak_sorry :)19:31
pigmejI will run tests for night, and we will see19:31
pigmejdshulyak_: yeah, then wtf I have [1, 1] sometimes19:31
pigmejlike in <1%19:31
dshulyak_in my test there is always 1 with 2 siblings, and 1 with 119:31
dshulyak_let me try 319:31
pigmejyeah, it's then the same as mine [1, -1]19:32
dshulyak_http://paste.openstack.org/show/483919/19:32
dshulyak_it is [1,2,3]19:32
pigmejso perfect19:33
pigmejhmm19:33
pigmejdoes your gevent support AttributeError: 'module' object has no attribute 'monkey' ?19:33
dshulyak_still i think it would be better to rollback init script to two workers, and then decide what we will do with n_val and counter for gevent19:33
dshulyak_yeah, it works19:34
pigmejHmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm19:34
dshulyak_gevent==1.0.219:34
pigmejsame there19:35
pigmejmaybe that's message "It's late Jedrzej, go off" ?19:35
dshulyak_maybe u are using old container for riak?19:36
dshulyak_without actual n_val=1 ?19:36
dshulyak_its not that late :)19:37
pigmejwell, i'm there since 9:30 :D19:37
pigmejso it's arount 11 hours ;P19:38
pigmejaround19:38
pigmejI would need 2-3 more to switch to hardcore mode ;]19:38
pigmejdshulyak_: your script failes for me...19:38
pigmejI have [1,2,2] sometimes19:38
pigmejnot funny ;/19:39
pigmejbut well, if n_val works like that, there is a chance that crdt counter will work as we need (it shares the same logic, imagine a object with list as siblings with negative and positive lists, and the value is just sum of these)19:41
pigmejthen we could use counters on  riak and "autoincrement" on sqlite19:44
pigmejok, dshulyak_ thanks' for debugging sessiion, I spawned 3 riak vms, your script is running, mine too19:44
pigmejwe will see :)19:45
pigmejtake care!19:45
dshulyak_yes, thanks, it was interesting debug session :)19:58
*** dshulyak_ has quit IRC20:08
*** dshulyak_ has joined #openstack-solar20:24
*** dshulyak_ has quit IRC20:41
tznanyone still online20:45
salmon_tzn: what's up?20:51
*** mihgen has quit IRC21:28
*** mihgen has joined #openstack-solar21:35
*** tzn has quit IRC22:11
*** salmon_ has quit IRC22:20
*** 21WAASK71 has joined #openstack-solar22:34
*** 21WAASK71 has quit IRC22:37
*** tzn has joined #openstack-solar23:16
*** tzn has quit IRC23:26

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!