18:01:22 #startmeeting Trove 18:01:23 Meeting started Wed Mar 25 18:01:22 2015 UTC and is due to finish in 60 minutes. The chair is SlickNik. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:01:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:01:27 The meeting name has been set to 'trove' 18:01:29 \o 18:01:37 o/ 18:01:42 o/ 18:01:43 o/ 18:01:44 \0/ 18:01:49 o/ 18:01:50 o/ 18:01:58 \m/ \m/ 18:02:02 (o/ 18:02:04 hello all 18:02:16 Meeting agenda at: 18:02:18 #link https://wiki.openstack.org/wiki/Meetings/TroveMeeting 18:02:21 don't tell sgotliv that we have a meeting. 18:02:38 0/ 18:02:44 o/ 18:03:01 #topic Trove pulse update 18:03:17 #link https://etherpad.openstack.org/p/trove-pulse-update 18:03:49 Awesome job everyone on getting the review numbers up last week! :) 18:04:14 And thanks to all who did reviews. 18:04:19 o/ 18:04:54 We did more than double the number of reviews from the previous week. 18:05:17 ?/ 18:05:25 SlickNik, its really amazing 18:06:24 There are a lot more non-cores starting to get involved in the reviews, which is great. 18:06:27 sgotliv: ++ 18:06:27 200+ reviews thats a large number of reviews .... 18:07:25 yeah nice work 18:07:27 everyone 18:08:09 definitely great to see the backlog clearing out 18:08:17 lot of progress :D 18:08:46 dougshelley66, a backlog, it almost doesn't exist :-) 18:08:55 Any other questions wrt the pulse numbers? 18:09:02 that's only because we got oslo.messaging merged. 18:09:06 that was holding up EVERYTHING. 18:09:12 ;) 18:09:41 Okay, let's move on. 18:10:01 #topic Instance Storage for Replication 18:10:06 vgnbkr: around? 18:10:11 Hi. 18:10:28 If everyone wants to have a quick look at the note I wrote. 18:10:30 https://etherpad.openstack.org/p/trove-replication-storage 18:10:45 I think we should wait on the metadata - any dissenters. 18:11:44 vgnbkr: So I had one question — if the master is down/unavailable, couldn't you query a slave for the master's server-id? 18:12:13 I don't think so, unless we stored it on every slave. 18:13:01 Oh, it's not the server_id, it's the master's server_UUID. 18:15:41 georgelorch, am I correct that the master's UUID cannot be retrieved from a slave when the master in unreachable? 18:15:52 FWIW — just looking over the options in that note — I'm not liking the ones that make the DB schema mysql specific / gnostic 18:16:20 agreed - I'm just enumerating the options. 18:17:44 vgnbkr, not 100% sure about that but I think you are correct, I don't recall seeing any explicit way to get the master UUID from a disconnected slave, but there may be 'tricks' 18:19:38 so vgnbkr - barring some way of doing this by inspecting the replicated mysql schema / using a mysql function that does this, I'm liking the metadata option. 18:20:03 there is SHOW SLAVE HOSTS but that is a master call... 18:21:02 Discussing SlickNik's question, it would also be possible to store it in a file on the guest. How do we feel about storing info in files on the guest? 18:21:19 s/guest/slave guest/ 18:21:27 ohh wait vgnbkr, what about Master_UUID in show slave status, or does that go to NULL when the slave loses contact with the master? 18:22:01 * georgelorch quickly looks up docs... 18:22:58 hmm, no details in the docs on behavior 18:23:09 OK, I'll check it out after the meeting... 18:23:28 georgelorch / vgnbkr: maybe that's something that we can try out? That would be a good solution if it worked. 18:23:34 vgnbkr: Sounds good. 18:23:50 I'm not sure I want to go down the route of storing state information on the guests. 18:23:53 If not, what's the opinion on storing instance data in files on guests? 18:24:02 Never mind :-) 18:24:17 OK, Thanks. 18:24:52 yeah vgnbkr, on a GTID enabled (5.6) slave, SHOW SLAVE STATUS _should_ have a Master_UUID field that is the server id of the master...I presume that the slave would have had to make successful contact with the master at least once in order to retrieve that value. 18:25:09 Oh, so what I'm taking away is: get it from "show slave status" or wait for metadata. 18:25:43 Thanks georgelorch , I'll check it out. 18:26:21 vgnbkr, sure, let me know offline if you have any questions, I can dig into source or ask someone to make sure we get it right. 18:26:25 Awesome, thanks georgelorch and vgnbkr. 18:27:18 #topic Different timeouts in configuration based per datastore 18:27:27 hello 18:27:59 https://review.openstack.org/#/c/164640/ 18:28:20 #link https://review.openstack.org/#/c/164640/ 18:28:34 So, now with a number of experimental datastores around in trove 18:28:39 sushilkm: Did you have a specific question / clarification around this? 18:29:07 i wanted to get this discussed here 18:29:18 as in reviews it came from peterstac and vkmc 18:30:13 so this patchset aims at getting usage_timeout to datastore configs instead of the default config 18:31:50 i got the review that this parameter was continuously roaming between datastores and default ... and this should be discussed in the meeting 18:32:21 sushilkm, I'm not against the change. I just thought it needed more visibility since just last July we did the reverse change 18:32:23 in the proposed PS a new option, usage_timeout, is added per datastore and we also keep it in the general options 18:33:01 IMO it makes sense to keep it per datastore, and also mark the option in general options as deprecated... for a while (to keep backwards compatibility) 18:34:14 vkmc: I tend to agree with that. With multiple datastores, I've noted that I've had to change that value back and forth depending on which datastore I've been wanting to deploy. 18:34:54 yeah 18:35:40 I think at some point in the past, we moved to consolidate this into the default config for simplicity. 18:35:51 personally, I don't like to add more options... but in this case, it seems a good call 18:36:24 along with usage_timeout, it would be worthy to move more options into datastore configs cluster_usage_timeout and agent_call_..._timeout, for the same reasons 18:36:35 so is the discussion only about usage_timeout or about timeouts in general? 18:36:49 oh, perfect. thanks sushilkm that's where I wanted to go. 18:37:30 I think the better solution is to get rid of that type of timeout, and instead watch for agent heartbeats. 18:37:38 my only question about this would be whether we want to just do usage_timeout now and do the rest post Kilo. or do we want to do the whole lot in Kilo. 18:38:49 i would love to see them all being managed into datastores ... that would help not making the changes for every deployment 18:39:13 and wud also help in future for may be new datastores while testing+Deployment 18:39:14 amrith: It makes sense to tackle the two (usage + cluster_usage) together. 18:39:27 SlickNik, in that case I have a different proposal 18:39:38 have a common timeout and a per datastore timeout 18:39:52 and the code can look for the per datastore timeout and if not found go to the common timeout 18:40:01 this model (in general) for all timeouts would be a good one to adopt 18:40:14 I suggest agent_call_timeout not be included in such a change. It's general purpose - if there are cases where a longer timeout is required, it should have some other timeout specified. 18:40:16 it allows you to override on a per datastore basis while also having a default 18:40:23 amrith: That sounds reasonable. 18:40:34 but then there is vgnbkr's suggestion of doing away with timeouts which I think is much better. 18:40:44 and worth looking at before we make this change. 18:41:07 vgnbkr: I also like your idea of watching for hearbeats and not polling for ACTIVE — perhaps this is something that a schedule task in the conductor could do? 18:41:24 I feel a liberty BP here. 18:42:04 SlickNik, yes. my crystal ball has a picture of a blueprint fluttering around 18:42:21 but not of a schedule task 18:42:29 but rather a proper implementation of a state machine 18:42:36 for these kinds of state transitions. 18:43:29 amrith, way more than I was thinking. I was just thinking that the poll_until would not timeout if it has seen a heartbeat within 60 seconds or so. 18:43:30 amrith: That's another BP — I don't think we have a state machine mechanism at all right now. 18:44:10 I'd probably keep the two BPs separate. There are other parts of trove that would benefit from using an FSM model. 18:44:30 I'm definitely in favor of a FSM for this 18:44:40 but it is definitely a far cry from the current timeout mechanism. 18:44:56 I was thinking we'd just wrap the whole thing into the review of this bug and hand it to sushilkm ;) 18:45:27 for kilo-2 18:45:45 hah I'm sure sushilkm was thinking of exactly this when he penciled this in to the agenda. :) 18:46:26 so SlickNik to be clear ... I'll make sure a bp for this comes along. 18:46:45 So 1. Short term, I think datastore specifc timeouts falling back to default makes sense. 18:46:57 +1 18:47:04 amrith: I'm confused, kilo-2 was Feb 5 ... 18:47:06 this fsm implementation will have to be later, like liberty 18:47:33 2. Long term, let's try and get rid of this polling usage timeout altogether => amrith will follow up with a BP for Liberty. 18:47:53 SlickNik, either dougshelley66 or vgnbkr or I 18:47:57 yes 18:48:03 do we have any thoughts about agent_call_..._timeout 18:48:17 I agree with vgnbkr on that 18:48:29 don't include it in the change 18:48:37 if you want a longer timeout, find one. 18:49:32 so #1 for now is only usage_timeout 18:49:36 and cluster_timeout 18:49:36 find one implies to update those parameters in the conf 18:49:48 sushilkm, yes. 18:49:52 ok fine 18:50:13 sushilkm, I imagine that each case where agent_call_.._timeout is insufficient would warrant a new config value. 18:50:50 personally, I also feel that the generic timeout (what are they called, short timeout and long timeout) should be outlawed 18:51:06 but when we get to at least watching for heartbeats we can head in that direction 18:51:15 vgnbkr / sushilkm: And that config value would probably be datastore specific as well. 18:51:40 SlickNik, likely 18:51:47 SlickNik, yes which is why i wanted those call_timeouts to be datastore specific too :) 18:52:05 amrith: those are RPC timeouts, so watching for heartbeats won't help get rid of those, I don't think. 18:52:50 since right now guestagent\api calls based on those timeouts 18:53:19 which is again a problem-head for different datastores, as some datastores would obviously take longer than others 18:53:23 sushilkm: What vgnbkr and I are saying is that if there's a datastore specific call that's causing you to go beyond the bounds of those timeouts (like snapshot provisioning) you probably want to call that out as a separate timeout that would be datastore specific and not overload those two. 18:53:48 SlickNik, +1 18:54:22 okies 18:54:42 Okay, so I think we have some traction on this one. 18:55:01 Let's move on 18:55:06 #topic Open Discussion 18:55:06 #action sushilkm to make usage_timeout and cluster_timeout datastore specific 18:55:24 #action amrith (or dougshelley66 or vgnbkr) to propose bp for FSM in Liberty 18:55:55 ./ 18:56:10 amrith: go for it 18:56:35 for kilo 18:56:40 do we want to do the oslo.log thing? 18:57:41 currently we have the trove-integration part of it ready to merge 18:57:56 once that gets in and I can rebuid the guest images, there's a chance to get the trove change to work 18:58:04 but the question is do we want all this for Kilo? 18:58:10 we don't *HAVE TO* 18:58:34 amrith: Context on the oslo.log thing? 18:58:50 #link https://review.openstack.org/#/c/162676/ 18:59:02 #link https://review.openstack.org/#/c/162677/ 18:59:17 this is a module being deprecated 18:59:24 oslo-incubator logging vs. oslo.log 19:00:24 trove-integration doesn't follow a release cycle, so I'm good with merging the first change — shouldn't hurt and will make sure that the images we build have the dependency once the second change does merge. 19:00:33 OK, let me be more explicit. Let us punt this to Liberty. 19:00:40 the whole thing. 19:00:46 let's move to #openstack-trove 19:00:49 times up 19:01:05 sounds good. 19:01:08 #endmeeting