Monday, 2016-01-18

*** dshulyak_ has joined #openstack-solar08:24
*** salmon_ has joined #openstack-solar08:39
pigmejdshulyak_: I have some concerns about our test_lock.py have you ever seen that they failed on current implementation ?10:11
dshulyak_pigmej: no, they passed on 3 backends, what is the failure?10:12
pigmejI ran them during night, and with fast ackquire / release scenarios I can crash them easily10:13
pigmejthat's that delete + write case probably10:14
pigmejbut I don't get why it's not crashing for you. That's why I asked if you saw failure there10:14
pigmejand I again started to wonder if my env is OK, but our scripts from last week are working properly10:15
dshulyak_which test fails?10:18
pigmejacquire_release_logic, lock_ackquired_released10:18
pigmejacquired_released*10:18
pigmejpretty "standard" tests10:19
pigmejobviously, in release_logic it fails on last assert,10:19
pigmejand in acquired_released it fails because 11 != 1210:19
dshulyak_it is with riak n_val=1 backend?10:22
dshulyak_or sqlite?10:22
pigmejn_val=110:22
pigmejit's probably that delete + 'write old state' scenario, but I cannot confirm it, because if I add additional debug, it always works10:23
pigmejdshulyak_: I mostly needed confirmation that IF it sometimes fails for you or not at all10:25
dshulyak_i thought that case with write of old state only possible in concurrent env10:26
pigmejyeah10:26
pigmejme too10:26
pigmejbut maybe delete works in strange way10:27
pigmejbecause we know that it deletes with some delay10:27
dshulyak_yeah, but you added conflict resolution for deleted items10:28
pigmejyeah, anyway I will debug this somehow ;)10:30
pigmejBUT coffe first :D10:30
dshulyak_but yeah, it looks like that for you old identity is returned is either in get or after siblingserror10:37
dshulyak_i will try to run those tests for sometime10:38
pigmejyeah kinda like that10:39
dshulyak_i thought that maybe the problem was that i am using 1cpu for vagrant, but i switched to 2, and its all the same10:42
pigmejwell, I added 3 debug prints to check it and then everything always worked10:42
pigmejbut it looks like that10:43
pigmejDEBUG (locking.py::76)::Lock for 11 acquired by 1110:43
pigmejDEBUG (locking.py::86)::Release lock 11 with 1110:43
pigmejDEBUG (locking.py::106)::Found lock with UID 11, owned by 11, owner False10:43
pigmejso it's clear that after release *sometimes* it sill finds old10:44
dshulyak_so it is even in get10:44
pigmejyup10:45
pigmejBUT I'm not sure if always like that10:45
dshulyak_pigmej: btw you are using 1cpu for solar-dev or 2-3 ?10:48
pigmej210:49
pigmej+ 2,5 G ram10:49
dshulyak_i tried to add more ram but it is still the same, 100000 iterations of acquire_release - no failure11:02
pigmejcool....11:03
pigmejwtf is with my laptop then?11:03
pigmejor with me :D11:05
dshulyak_salmon_: can you try to run test_lock with this change? http://paste.openstack.org/show/484121/11:05
dshulyak_solar/test/test_lock.py11:05
salmon_dshulyak_: shure, one moment11:05
dshulyak_with riak11:05
salmon_dshulyak_: how to configure tests to use riak?11:06
dshulyak_cat /.solar_config_override11:07
dshulyak_solar_db: riak://10.0.0.2:808711:07
salmon_ok11:08
pigmejsalmon_: default vagrant env uses riak n_val111:08
salmon_dshulyak_: btw, is thix `x` used somwhere?11:08
pigmejso it's default config11:08
salmon_pigmej: ok11:08
pigmejsalmon_: dshulyak_ added it just for parametrization :)11:08
dshulyak_yeah, range loop, i think otherwise pytest will fail11:09
pigmejyup11:10
pigmejhmm,11:16
salmon_how long will it take? :)11:16
pigmej100000 * 0.0511:17
dshulyak_i didnt notice, but should be quite fast :)11:17
salmon_...11:17
salmon_still running11:17
pigmejsalmon_: ;D11:17
pigmejok, I restarted env + laptop11:18
pigmejand... it works for me too (this test)11:18
dshulyak_if it behaves sometimes this way on your laptop then it is also possible in production11:20
pigmejsure11:20
pigmejthat's why I asked you for checking11:21
salmon_..'.count('.')11:21
salmon_Out[2]: 537611:21
salmon_It will take hours....11:21
dshulyak_hm11:21
dshulyak_maybe i executed 10000 :)11:21
pigmejone of you have broken env then ;P11:21
dshulyak_let me recheck11:21
salmon_pigmej: how long did it take for you?11:21
pigmej~0.02 each test I think11:21
pigmejI switched to my branch now11:22
pigmejbut wait I can check11:22
pigmej0.0311:22
pigmejso 0.03 + 100 00011:22
salmon_hmm, "Killed"11:22
pigmejmeans like 3000k seconds ?11:22
pigmej3000 seconds which is hmm, like 50 minutes?11:23
pigmejdshulyak_: you didn't notice test that ran for 50 minutes ?:D11:23
salmon_after 5805 iteration I got message "Killed"11:23
pigmejoom or pytest killed it?11:23
salmon_all tests passed though11:23
dshulyak_i think i run 10000, not 10000011:23
salmon_rechecking11:23
dshulyak_sorry :D11:23
pigmejdshulyak_: :D11:24
salmon_riak+test eat a lot of ram during this test11:25
dshulyak_so what the conclusion - we wont rely on n_val=1?11:25
salmon_why does it take so much ram?11:27
pigmejdshulyak_: I think something is worng in my env11:27
pigmejbecuase I already had stupid problem with that n_val11:27
pigmejI think we can rely on it11:27
pigmejBUT idk what's wrong with my stuff11:28
dshulyak_pigmej: you should buy youself macbook11:28
dshulyak_zero problems :D11:28
pigmejYou know what's worst about buying macbook ?11:28
pigmejor maybe I shouldn't say this joke there....11:29
salmon_you shouldnn't :P11:29
dshulyak_what?11:29
pigmejsalmon_: yeah I stopped it in the middle :P11:30
pigmejanyway, I ran this 10000 and It failed on 117 try11:31
pigmejbefore restart11:31
dshulyak_parametrize its a lot - with 10k it is about 600, but with 100k - 120011:32
salmon_10004 passed in 1851.42 seconds11:59
pigmejk, then I will blame my env...12:01
salmon_I may tray with more RAM because it was slow, was using SWAP12:03
pigmejok I have alternative lock approach12:12
pigmejwhich seems to be working12:12
dshulyak_i remember that on friday we discussed locking based on state12:13
pigmejyeah12:13
pigmejI have some experiment about that too12:13
pigmej:)12:13
pigmejbut it requires changes in workers etc, so I think we sould wait with that reimplementation for new worker, isn't it?12:14
pigmejor we should somehow integrate it with model...12:14
openstackgerritJedrzej Nowak proposed openstack/solar: CRDTish lock to avoid concurrent update/delete  https://review.openstack.org/26901812:26
salmon_CRDTish :)12:27
pigmejdshulyak_ salmon_ https://review.openstack.org/#/c/269018/12:27
pigmejyup12:27
pigmejit's kinda like aworset12:27
pigmejwe could use native set from riak too, but we need this implementation for SQLite for sure12:28
pigmejnow time to tackle counters :)12:28
openstackgerritLukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now  https://review.openstack.org/26903513:12
salmon_ups13:12
pigmej:>13:13
pigmejwtf ?13:13
pigmejisn't it https://review.openstack.org/#/c/268331/ ?13:13
salmon_I messed with topics :/13:14
openstackgerritLukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now  https://review.openstack.org/26903513:14
openstackgerritLukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now  https://review.openstack.org/26903513:15
salmon_the last one is correct ^13:16
pigmej...13:29
pigmejwhy you have 3 changes there?13:30
pigmej213:30
salmon_Pach sets?13:30
pigmejyup13:32
pigmejno13:32
pigmejehs13:32
pigmejsalmon_:13:32
pigmejRemove ansible.cfg, we use .ssh/config now Change-Id: I382bfb6e2b969a4058b74d569972418c19ebc834 Fix provision and image build after removing ansible.cfg Change-Id: I257bd0c7050516746ff77b8ef09dc169b945deae13:32
pigmejthis is how your commit msg looks like13:32
salmon_ah13:33
salmon_git stash...13:33
pigmejyeah yeah, excuses  ;P13:33
salmon_afk, ~1h :)13:34
pigmej;]13:34
openstackgerritJedrzej Nowak proposed openstack/solar: CRDTish lock to avoid concurrent update/delete  https://review.openstack.org/26901815:08
pigmejdshulyak_: in fact, are duplicates of "counter" bad for us?15:24
pigmejbecause I'm looking into code, and ... I don't find for now a place where we require that it's unique15:25
dshulyak_in general no, but we need to be sure that the history is correct15:27
dshulyak_sorry not in general.. in some cases no15:27
pigmejwhat do you mean by 'history is correct' ?15:28
dshulyak_e.g. if B was executed after A - they shouldnt be the same15:29
pigmejok but what if C and B were executed just after A ?15:30
pigmejwouldn't it be valid if A would have counter 1, B would have 2 and C would have also 2 ?15:30
pigmejI mean, do we really need "numbered order" there or we can have nonnumbered order there like successors and predecessors ?15:31
pigmejBecause I'm looking into code, and I find that history.filter is used only in one place, which is "history_last" method15:32
dshulyak_i think it is also used in solar ch history15:34
dshulyak_or should be15:34
dshulyak_C and B can be the same i guess15:34
pigmejwell solar ch history uses that composite15:36
pigmejwhich uses log, resource and action15:36
pigmejthe drawback could be that we could have a bit "randomized" order15:40
pigmejwhich is unwanted15:40
pigmejbecause all tasks with the same counter value *could* be then presented in any order15:40
pigmejBUT they are independent (because they were executed in the same time), so...15:41
pigmejso it maybe a problem but I'm not sure15:43
pigmejdshulyak_: it turns out that we can use counters for n_val=1... I was mistaken about that.15:56
dshulyak_u mean crdt countaers, right?15:57
pigmejyeah15:57
pigmejI just executed some testing things15:57
pigmejand with one node with n_val 1 we're fine15:57
pigmejand obviously the same for strong consistent bucket15:58
pigmejs15:58
dshulyak_btw can we use this data strctures with strongly consistent buckets?15:59
pigmej?15:59
pigmejwe can use CRDT with strong consistent buckets,15:59
dshulyak_can we use counters and sets with ensemble buckets?15:59
pigmejAND we can use it with n_val=116:00
pigmejyeah we can16:00
pigmej:(16:00
dshulyak_whats up? it started to fail :) ?16:01
pigmejno16:01
pigmejI'm sad, because I was wrong about n_val=1 ;P16:02
pigmejthe tests were executed for an hour16:02
pigmejone of counters is now ~10mln16:02
pigmejnot a single value missed / duplicated16:02
pigmej:)16:03
pigmejthough we will need client side stuff for sql :)16:05
pigmejdshulyak_: any ideas how to do our counter in SQL ?16:07
dshulyak_client side?16:08
dshulyak_can we just insert empty rows?16:08
pigmejand take first non empty ?16:08
dshulyak_ah16:08
pigmejthe problem is that we need to know value16:09
pigmejI mean, `self.history = StrInt(next(NegativeCounter.get_or_create('history')))`16:09
dshulyak_but isnt it the same as with increment?16:09
pigmejI know how to do it with riak with counter16:10
dshulyak_i mean we will know it for sure after write16:10
pigmejok, then we should move counter value as pkey in sql16:10
openstackgerritDmitry Shulyak proposed openstack/solar: Zerorpc worker for orchestration modules  https://review.openstack.org/26916616:10
pigmejbecause otherwise we could still get doubled the same value, isn't it ?16:11
dshulyak_i will still work on cleaning this, but here is an example how it works - https://review.openstack.org/#/c/269166/1/solar/test/functional/test_tasks_subscribers.py16:12
pigmejI wonder if sqlite will handle properly 2 concurrent +1 to the same row16:12
dshulyak_yes, i thought that we will use pkey for sqlite16:13
pigmejthe thing is that then we will need to have netsted transactions16:14
pigmejnested*16:14
dshulyak_where?16:15
pigmejbecause we will not know about counter conflict otherwise16:16
pigmejdshulyak_: transaction begin, x = [sql update +1], A.a = x, transaction end16:18
pigmejin concurrent end, it will not work properly, isn't it?16:18
dshulyak_wont sqlite lock whole table?16:20
pigmejLet's check, because I'm not sure16:22
pigmejI mean, it's not even +1 in our case16:22
pigmejbecause we will do x = (db.get() + 1).save().x16:22
pigmejpseudocode obviously ^16:22
pigmejbrb16:29
dshulyak_wont we use autoincrement? i think it should be ().save().pk16:38
-openstackstatus- NOTICE: Gerrit is restarting quickly as a workaround for performance degradation16:50
pigmejback16:51
pigmejdshulyak_: yeah BUT ,https://www.sqlite.org/autoinc.html16:51
pigmejdshulyak_: for some reason16:58
pigmejhttps://bpaste.net/show/08849a0c59e316:58
pigmejworks fine16:58
pigmejso... for now I will treat this as a solution :)16:58
dshulyak_wiyh sqlite?16:58
dshulyak_with16:58
pigmejyup16:59
pigmejlater we can adjust it for pk16:59
pigmejbut I will make counter based for riaks16:59
openstackgerritDmitry Shulyak proposed openstack/solar: Zerorpc worker for orchestration modules  https://review.openstack.org/26916617:17
dshulyak_the good thing is that with new worker i can catch counter errors pretty easily17:18
pigmejit shouldn't crash :)17:18
pigmejdshulyak_: can you set workflow -1 for yourself there ?17:18
dshulyak_ok, usually i dont care about that, anyway patch wont be merged accidentally17:20
pigmejwell, for now probably no, but later we may do it by accident17:20
pigmej:)17:21
pigmejbut it's just my opinion :)17:21
pigmejhmm, any ideas how should we proceed with riak data types & our vagrant env?17:24
openstackgerritDmitry Shulyak proposed openstack/solar: Zerorpc worker for orchestration modules  https://review.openstack.org/26916618:27
openstackgerritDmitry Shulyak proposed openstack/solar: Zerorpc worker for orchestration modules  https://review.openstack.org/26916618:29
salmon_dshulyak_: it looks really nice18:35
pigmejyeah the separation is cool18:36
pigmejsalmon_: https://review.openstack.org/#/c/269166/4/solar/orchestration/executors/inproc.py this is my fav18:36
dshulyak_lets hope it will work :)18:36
pigmejyeah :)18:36
salmon_pigmej: yeah, nice :)18:36
pigmejdshulyak_: well, we're engeneers, we don't hope, we know :D18:37
pigmejexcep: https://scontent-frt3-1.xx.fbcdn.net/hphotos-xpl1/v/t1.0-9/12507573_1114134135263634_3266113277144032533_n.jpg?oh=bf3b89c35ef4991aa3600078c06b1866&oe=570128EF :P18:37
salmon_:D18:39
pigmej;]18:39
openstackgerritJedrzej Nowak proposed openstack/solar: Fixing concurrency problems in history counter  https://review.openstack.org/26923818:42
pigmejit's WIP18:43
openstackgerritJedrzej Nowak proposed openstack/solar: Fixing concurrency problems in history counter  https://review.openstack.org/26923818:43
openstackgerritLukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now  https://review.openstack.org/26903518:44
openstackgerritLukasz Oles proposed openstack/solar: Hardcode ansible version. We are not ready for 2.0  https://review.openstack.org/26923918:44
salmon_pigmej: what is riak.yaml ?18:45
pigmejhttps://review.openstack.org/#/c/269238/2/bootstrap/playbooks/tasks/riak.yaml18:45
salmon_pigmej: can you move runnig it from Vagrantfile to solar yaml? You will break devops tests18:47
pigmejwell, I'm not sure :)18:49
pigmejI mean, what DB do you use in devops tests?18:49
pigmejriak ?18:49
salmon_yup18:50
pigmej'm18:50
pigmejok18:50
salmon_it's the same env as in vagrant18:50
salmon_and I'm using solar.yaml to bootstrap it18:50
pigmejthen fine, In fact I don't wanted to break your stuff so I craeted it as separte thing18:50
salmon_good intentions  ;)18:51
openstackgerritJedrzej Nowak proposed openstack/solar: Fixing concurrency problems in history counter  https://review.openstack.org/26923818:52
pigmejk18:52
pigmej:)18:52
pigmejdshulyak_: do you have any eta when your worker may be usable / testable ?18:55
pigmejanyway, I'm off for today,18:55
dshulyak_it deploys some examples already, i think tomorrow it will be usable18:57
pigmejdshulyak_: cool, I expect to have counter working tomorrow (it works already but I haven't tested setup), locks are working too ;)19:03
*** dshulyak_ has quit IRC19:12
openstackgerritLukasz Oles proposed openstack/solar: Hardcode ansible version. We are not ready for 2.0  https://review.openstack.org/26923920:38
openstackgerritJedrzej Nowak proposed openstack/solar: Fixing concurrency problems in history counter  https://review.openstack.org/26923820:57
*** salmon_ has quit IRC23:06

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!