20:00:46 #startmeeting trove 20:00:47 o/ 20:00:48 Meeting started Wed Oct 2 20:00:46 2013 UTC and is due to finish in 60 minutes. The chair is hub_cap. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:49 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 20:00:52 The meeting name has been set to 'trove' 20:00:57 o/ 20:00:59 just in time 20:01:02 o^/ 20:01:06 nick robertmyers 20:01:06 o/ 20:01:07 o/ 20:01:08 \0/ 20:01:09 ha 20:01:11 #link https://wiki.openstack.org/wiki/Meetings/TroveMeeting 20:01:16 nice robertmy_ 20:01:18 o/ 20:01:22 o/ 20:01:23 at least u didnt show your password like grapex did once 20:01:31 hi 20:01:32 hunter2 20:01:51 hai 20:01:54 hub_cap: The worst part was that it was 12345 20:01:54 #link http://eavesdrop.openstack.org/meetings/trove/2013/trove.2013-09-25-20.03.html 20:01:55 i'm going to hunter2 you up datsun180b 20:01:55 hi 20:02:07 thats my luggage password grapex 20:02:34 so is slick not around vipul? 20:02:58 i didnt do nuttun wrt launchpad + perms 20:03:07 #action SlickNik, hub_cap to check with other teams to set groups permissions correctly on LaunchPad 20:03:12 i think what we need is a -bugs team 20:03:22 o/ 20:03:26 ok moving on 20:03:26 here 20:03:29 ok 20:03:38 SlickNik: did u do anything wrt the LP stuff? 20:03:41 hub_cap: -contributors team 20:04:00 hub_cap: nope, haven't had a chance. 20:04:34 moving on 20:04:46 #topic rolling back resources 20:04:49 so 20:04:59 historically we have not rolled back resources 20:05:04 and we have let a delete clean up 20:05:18 i'm here 20:05:31 dmakogon_ipod: has suggested to clean up some of the things when a failure happens 20:05:36 main idea for rolling back is to avoid quota exceedance 20:05:46 why some, but not all? 20:05:46 well i have a Q 20:05:57 do the items you roll back have quotas associated with them? 20:06:07 yes 20:06:09 yeah - i was ging to say do you have a lower quota for security groups than instances 20:06:10 isviridov: one component per review 20:06:22 hub_cap: no 20:06:38 ok so then the quota exceedance does not matter here right? 20:06:50 security groups quota controlls by nova 20:07:11 ok so we are talking about the 3rd party quotas, and im sure DNS will be the same 20:07:17 so i've got a couple of issues: 1) delete instance call when dns support is enabled causes instance delete to fail if dns entry doesn't exist 20:07:19 nova quota exceedance is matter, but this one is out of trove scope 20:07:30 but our main quota is instance, right? 20:07:34 redthrux: this is a bug 20:07:46 if you have an instance, in failed or building state 20:07:46 hub_cap: yes 20:07:47 its a part of your quotas 20:07:56 so when an instance errors in the models (either because of sec-group or DNS), a record in teh DB is created 20:07:57 you get 1 less instance you can provision 20:08:07 +1 hub_cap, esmute 20:08:09 https://review.openstack.org/#/c/45723/ - although take a look at it 20:08:19 and we're only rolling back if prepare fails? 20:08:20 redthrux: It seems like we could handle that by not failing the delete if an associated resource can't be found (i.e. was already deleted or orphaned) 20:08:20 right esmute so when you do a POST /instances, no matter the outcome of it 20:08:22 when the exception is raised, the quota engine rolls back the quota... back to what it was originally was 20:08:29 but then the record is still there 20:08:42 +1 grapex 20:08:46 esmute: so if a instance is in FAILED status, its not a part o the quotas? 20:08:48 so when the user tries to delete that instance, the quota will falsely decrese 20:08:58 ok thats a bug esmute 20:09:10 grapex - yes - basically i'm saying this has to be addressed as a prerequisite to cleaning up 20:09:13 ok im really confused 20:09:16 let me reiterate 20:09:20 with what should happen 20:09:24 1) an instance is created 20:09:25 so what the rollback is trying to do is also to rollback the db record crated 20:09:27 created* 20:09:29 2) the quota is increased 20:09:33 3) a fialure occurs 20:09:39 4) the quotas remain the same 20:09:47 5) a user deletes 20:09:55 6) the quota goes back down by 1 20:09:56 1) instance is created 20:10:03 2) quota is increased 20:10:17 3) failure occurs in the models (different if it occurs in TM) 20:10:33 4) quota catches the exception and rolls back quota to orignal value 20:10:49 esmute: What is the distinction for 3? By models do you mean if it fails in the api daemon? 20:10:49 5) the user sees the instance still there (because it was not rolled back) 20:10:54 6) user does delete 20:11:04 esmute: this is not a conversation abou tthe bug 20:11:07 7) quota usages decreases (falsely) 20:11:07 esmute: I think what hub_cap is saying is that if the instance is around (even in FAILED state), the quota shouldn't be rolled back. 20:11:09 that you have found / fixed 20:11:12 it's about resources 20:11:17 esmute ^^ 20:11:25 so - things like dns fall under this 20:11:31 esmute: current trove state is that userc cannot delete stucked instance 20:11:33 lets not worry about that right now 20:11:47 grapex: Once the request goes to the TM, the quota usage is committed... 20:11:52 stucked how dmakogon_ipod 20:12:00 but if it fails in the API, the quota is rolled back 20:12:05 BUILDING status on poll_untill 20:12:10 we should fix a stuck instance 20:12:16 rather than roll back _some_ of the resources 20:12:22 that will lead to more confusion 20:12:25 okay - dmakogon_ipod - that means the prepare call failed 20:12:31 hub_cap: how ?? 20:12:42 So, the quota should really be tied to the resource. 20:12:47 do you know the reason for it getting stuck? 20:12:51 dmakogon_ipod: ^^ 20:12:53 what we can do is mark the instance as FAILED or ERROR but do not re-rasie the error 20:12:54 hub_cap: Agreed. If we roll back resources because something failed we'll end up duplicating logic to delete resources that already exists in the delete call 20:13:00 otherwise the quota will roll back 20:13:01 we actually shouldn't wait for instances to come out of BUILD status with a timeout 20:13:03 redthrux: it means that instance cannot be repaired 20:13:06 If the resource still exists in the DB, the quota should be +1 20:13:24 ok we are rabbit holing 20:13:27 *holeing 20:13:32 i don't think rolling back is smart period 20:13:35 this is not helping the decison 20:13:40 *wholing? 20:13:48 kevinconway: 1) users, 2) punch 20:14:01 its a matter of do we roll back resources on a poll_until timeout 20:14:04 i'd rather us reorder things - dns before prepare - and if prepare fails, then we marked as failed 20:14:05 right? 20:14:06 3) profit. 20:14:10 lol 20:14:13 esmute: So if the quota is falsely updated if it fails int he api daemon, i.e. before the taskmanager, I think the resolution is to make the failed state happening int he api daemon something which is shown to the user as FAILED but maybe is a different state (has a different description) 20:14:14 cweid: lol 20:14:29 clear ; 20:14:37 i actually loathe that a slow prepare call - that eventually finishes - can cause an instance be in failed status 20:14:39 1:14 PM hub_cap its a matter of do we roll back resources on a poll_until timeout 20:14:40 Actually, let's talk about the bug esmute brought up after we finish talking about dmakogon_ipod's topic. 20:14:44 as customer i don't know anything about low-level deployment, so if it cannot become active it is broken, and customer should delete it, but he can't 20:14:45 this happens - i've seen it with my own eyes 20:14:55 grapex: sure we can add it to the end of the agenda 20:14:58 grapex: agree. But if we do that, we cant re-raise the error.. Otherwise the quota will rollback to what it was before 20:15:06 lets not talk about esmute's bug 20:15:09 its not the topic 20:15:10 i'd like us to understand the poll until is for USAGE - 20:15:10 period 20:15:25 1:14 PM hub_cap its a matter of do we roll back resources on a poll_until timeout 20:15:28 1:14 PM hub_cap its a matter of do we roll back resources on a poll_until timeout 20:15:31 1:14 PM hub_cap its a matter of do we roll back resources on a poll_until timeout 20:15:34 lets talk about this 20:15:37 and this only 20:15:37 hub_cap: The fix for this rollback affect mine bug :P 20:15:42 lets cut out the poll until 20:15:47 what's the definition of rollback? 20:15:52 delete everything? 20:15:58 #link https://review.openstack.org/#/c/45708/ 20:15:59 or mark somthing in a terminal status 20:16:01 vipul: no 20:16:07 no vipul, to remove security groups 20:16:19 vipul: we suppose to leave instance and nothing else 20:16:20 #link https://review.openstack.org/#/c/45708/29/trove/taskmanager/models.py 20:16:20 vipul: delete associated artifacts. 20:16:25 everyone look @ that 20:16:39 a consistent view of the world, in my opinion, is that if a virtual/physical asset is still provisioned (whether it's active or failed), the quota should not be rolled back. An important addendum is that a user/admin should be able to delete an instance in BUILD/ERROR to subtract the current quota usage. So, in short, we should not (imho) rollback resources on a poll_until timeout. 20:16:43 where associated artifacts = security groups. 20:17:07 amcrn: thats what ive been tying to say 20:17:10 I don't agree with removing /deleiting resources explicitly on timeout 20:17:11 FWIW: I'm of the view that we should not roll back. 20:17:12 that was my 6 step program 20:17:15 i do agree with marking it as deelted 20:17:17 hub_cap: I'm agreeing with you :) 20:17:18 amcrn: that is why i suggested to update status on poll_until timeout 20:17:20 err.. error 20:17:34 amcrn: +1 20:17:36 explicit is better than implicit in this case. 20:17:43 because the idea is that when the user issues a delete, it will remove the assoicated resources 20:17:51 Because something failed when the user wasn't looking. 20:17:53 https://gist.github.com/hub-cap/6799894 20:17:56 amcrn: +1 20:18:01 vipul: if it so, you would get quota exceeded exception on next provisioning 20:18:03 look @ that gist 20:18:14 that's fine.. you need to delete 20:18:18 hub_cap: That's esmute's issue 20:18:18 dmakogon_ipod: then you can call delete 20:18:19 yes 20:18:22 Do we want to talk about that now? 20:18:32 i think manual delete/cleanup of resources is fine if you are talking one or two errors but this does not scale 20:18:32 if you have instances in Error, those count against quota 20:18:40 dmakogon_ipod: only if you have misconfigured quotas in nova would you have quota issues 20:18:41 hub_cap: what do you mean by "4) the quotas remain the same"? 20:18:45 we are provisioning a whole system not parts 20:19:04 i see it as a complete transaction either completely done or completely undone 20:19:06 4) quota rolls back 20:19:06 you prov a resource, it is a hit to quotas, period 20:19:11 no 20:19:16 they should not 20:19:17 hub_cap: suppose we have less sec.gr than VMs in qouta 20:19:19 I feel like the real problem here is dmakogon_ipod has encountered a case where the delete call is unable to fully work. We need to fix that case. 20:19:19 if a user deletes, then quotas roll back 20:19:29 dmakogon_ipod: then you will hit the issue even w/o rollbacks 20:19:34 10 instances, 8 secgroups 20:19:38 ok..that is what grapex suggested 20:19:39 even w/ perfect instance provisioning 20:19:46 you will get 8 instances 20:19:50 and 2 failures 20:19:54 juice: I think if we want to switch to using transactions, maybe we (wait for it everyone) wait for Heat. :) 20:20:19 hub_cap: no, nova assignes default sec.gr to instance 20:20:26 if heat addresses this issue then we should not build our own 20:20:48 it's a workflow type of scenario.. you have a distributed loosely coupled system.. you cant' impelemnt transactions unless you impolement workflow 20:20:49 s/heat/lets not talk about this/ 20:20:54 lol 20:20:57 grapex: turn up the HEAT! 20:21:02 yep, heat has a parameter to rollback or not 20:21:10 hey lets not talk heat 20:21:13 kevinconway: The heat is on. 20:21:15 so - wait - I think the consensus is to say "don't roll back parts of an instance" 20:21:16 dmakogon_ipod: tell me what you mean 20:21:21 kevinconway: It's on the street. 20:21:30 well everyone agrees w/ that but dmakogon_ipod, rev 20:21:33 *redthrux 20:21:39 and id like to get his opinion on it 20:21:46 dmakogon_ipod: explain the scenario plz 20:22:02 hub_cap: if you cannot create new security group that pass None, and nova would assign default security group to instance and that's it 20:22:21 we do not check that a secgrp is honored by nova? 20:22:26 thats a bug 20:23:02 but, default sec. gr is shared 20:23:23 you cannot add identic rules to it 20:23:35 i understand that 20:23:48 but do we not check that the secgrp we created is honored 20:23:54 current workflow missing checks for creation groups/rules 20:24:50 so i understand what dmakogon_ipod is saying, but a rollback woudl not change the scenario 20:24:53 right? 20:25:01 misconfiguration can be the cause too 20:25:09 yes 20:25:12 that's what it sounds like 20:25:12 so im not sure that rolling back will "fix" this 20:25:28 and it does leave things in a different state between nova and trove 20:25:32 ok, than we should update status 20:26:10 instance and task status to let user be able to delete instances with BUILDING/FAILED/ERROR statuses 20:26:35 yes definitely dmakogon_ipod 20:26:40 if we dont, we have a bug 20:26:51 users should be able to delete failed instances 20:27:03 and instances should go failed if they are broken (fail timeout) 20:27:03 hub_cap: but i'm still offering deleting components that are not controlled by trove quota 20:27:17 the delete will do that dmakogon_ipod 20:27:33 right - the instance delete call will do that. 20:28:00 what would it do if there is no specific component ? 20:28:06 and - why roll back anything - people running the infra will want to investigate what's going on with a delete 20:28:07 we already heard about dns 20:28:17 can we implement a API call like "refresh all"? 20:28:20 we would not fail if it does nto exist 20:28:36 but ther is no reason to "roll back" 20:28:50 hub_cap: even if support is turned on ? 20:29:08 if you try to delete dns, for example, and it does not exist properly 20:29:12 because it failed 20:29:18 hub_cap: than how it failes with DNS ? 20:29:19 then it should just skip it and finish the delete 20:29:25 its a bug dmakogon_ipod 20:29:44 hub_cap: ok 20:29:44 +1 hub_cap 20:30:06 i filed it 20:30:09 core team, do we have consensus? we should move on. so far i have 1) we have a bug in delete logic, 2) we will not rollback on create 20:30:18 are we good? ready to move on? 20:30:29 I'm good. 20:30:31 here's the bug: https://bugs.launchpad.net/trove/+bug/1233852 20:30:36 we have lots of stuff to do 20:30:39 hub_cap: i'll fix my review with status update tomorrow 20:30:42 <3 dmakogon_ipod 20:30:56 #topic Cloud-init service extensions 20:31:07 so, security group workflow update would be abandoned 20:31:14 ahhh 20:31:18 my topic again 20:31:30 guys, can you have that in writing somewhere? 20:31:41 #link https://gist.github.com/crazymac/6791694 20:31:46 esmute: it is in writing, this is logged :) 20:31:55 hub_cap: I would want to mark it as 'error' though 20:32:07 yes vipul i think dmakogon_ipod will do that 20:32:08 dmakogon_ipod: are you abandoning that fix? 20:32:12 to elaborate on esmute's point, can we get a table of scenarios with desired end states? 20:32:33 vipul hub_cap: error for which status ? 20:32:39 instance of service ? 20:32:53 instance status 20:32:58 ok 20:33:02 got it 20:33:13 now it's another topic 20:33:35 updating cloud-init before passing it into userdata 20:33:50 my idea is described in gist 20:33:59 please, take a look 20:34:08 yes i think that this is ok 20:34:13 im fine w/ it 20:34:19 but we need to really focus on heat support too 20:34:24 yse 20:34:40 definitely 20:34:46 and when we do 20:34:53 itll be easy to shift to it 20:35:07 i already mark it for TODO 20:36:02 ok great 20:36:16 i have no issues w/ it, so we good to move on? 20:37:08 #topic Configuration + service type 20:37:27 ashestakov ? 20:37:34 hey guys i will say thx to dmakogon_ipod and isviridov_ for putting their names on their topics 20:37:37 very smart 20:37:47 dmakogon_ipod: this must be from last wk? 20:37:58 maybe 20:38:10 hub_cap, yep. i've removed one. Please refresh 20:38:11 someone forgot to update it, right ?))) 20:38:22 andrey is not around ya? 20:38:38 seems like yes 20:38:38 isviridov_ k 20:38:46 #action moving on :) 20:38:53 yogesh: what should i call your next topic? 20:38:56 sounds good 20:39:01 lots of this is on the ML 20:39:02 #topic service registration 20:39:03 read it there 20:39:09 #link https://review.openstack.org/#/c/41055/ 20:39:13 yup... 20:39:16 is it updated ? 20:39:47 #link https://gist.github.com/crazymac/6784871 20:39:49 dmakogon_ipod: please update the gist per the latest decision 20:40:11 or is it already... :-) 20:40:25 ok do we need to talk about this? 20:40:30 we are good... 20:40:34 ok cool 20:40:40 yogesh: there was a typo ))) 20:40:49 yeah... 20:40:53 #topic trove-conductor 20:40:54 but the intent is clear.. 20:40:58 datsun180b: go go go 20:40:58 yea! 20:41:02 datsun180b: update? 20:41:02 hello 20:41:29 datsun180b: conductor code? 20:41:30 i'm sorting out problems with restart and unit tests but conductor at the moment successfully intercepts mysql status updates 20:41:49 horray 20:41:49 it's more than a trove review, there's also a devstack review linked in the agenda. 20:41:54 nice 20:41:59 yogesh: done, gist updated 20:42:02 can you link? 20:42:07 one moment 20:42:16 dmakogon_ipod: cool thanks 20:42:23 #link https://review.openstack.org/#/c/45116/ 20:42:29 #link https://review.openstack.org/#/c/49237/ 20:42:38 datsun180b: do you have links? 20:42:49 datsun180b: anything else to say on teh subject? 20:43:10 at the moment no, but i'd appreciate eyeballs and advice on the code i've shared so far 20:43:37 that's pretty much it for me 20:43:49 moving on, great work datsun180b 20:44:00 #topic trove-heat 20:44:05 yogesh: go 20:44:05 hub_cap: i listed down some points from trove/heat integration perspective... https://gist.github.com/mehrayogesh/6798720 20:44:42 can you folks making gists make sure to put in line breaks so everything fits in the frame? 20:45:23 point 1 and 2 can be skipped... 20:45:27 kevinconway: +++++++++++++ 20:45:41 as hardening is anyway in progress and heat events are not supported as of now... 20:45:50 polling is the only way for checking the stack status in heat.. 20:46:10 im very happy to hear that heat support is going to be fixed up 20:46:39 hub_cap: on its way... 20:46:40 :-) 20:47:14 point 3: template in code, should be configurable....is there a reasoning.... 20:47:38 template should not be in code 20:47:41 well 20:47:42 sure... 20:47:46 the template should be "built" 20:47:56 some tings, like user data should be configurable 20:48:21 but other things like, the instance itslef in the yaml, can be generated 20:48:32 it will make building a multi node stack easier 20:48:34 yes agreed... 20:48:54 yogesh hub_cap: could we externalize template out of taskamanager ? 20:49:04 dmakogon_ipod: thats the plan i think 20:49:11 i think yogesh asked about it 20:49:12 it'll be externalized.. 20:49:22 and store it like cloud-init script 20:49:26 but then we won't have a completely cooked template... 20:49:33 it'll be created dynamically 20:49:39 right yogesh 20:49:39 ok 20:49:46 and some parts will be config'd 20:49:48 like user data 20:49:51 and some will just be generated 20:49:59 hub_cap: absolutely.... 20:50:01 perfect 20:50:07 are you going to generate your template using template templates? 20:50:08 what template parts are you going to generate? 20:50:08 point 4. 20:50:10 i'd like to take a look at mechanism of dinamic creation of heat template 20:50:30 i'll keep updating the GIST and mark it off to you guys.. 20:50:46 isviridov_ the part that defines the # of instances, so u can generate a stack w/ > 1 instance 20:50:49 for clustering 20:51:03 dmakogon_ipod: once yogesh publishes the review you will be abel to :) 20:51:11 the epecific user scripts will be configurable... 20:51:12 can't you just roll a jinja2 template for the HEAT templates? 20:51:15 do we want to use HEAT HOT DSL? 20:51:15 number of instances can be parametrized 20:51:24 it supports IF and ELSE and all that 20:51:24 is trove/heat integration spec'd in a blueprint? 20:51:31 Key5_: if it impls everything that the cfn does 20:51:53 does heat supports it now ? 20:52:04 kevinconway: right, im not sure we are defining the _how_ of template generation 20:52:15 the design of dynamic template creation can be a separate topic 20:52:35