#openstack-meeting-alt log

18:01:22 <SlickNik> #startmeeting Trove
18:01:23 <openstack> Meeting started Wed Mar 25 18:01:22 2015 UTC and is due to finish in 60 minutes.  The chair is SlickNik. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:01:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
18:01:27 <openstack> The meeting name has been set to 'trove'
18:01:29 <mvandijk> \o
18:01:37 <danritchie> o/
18:01:42 <vkmc> o/
18:01:43 <dougshelley66> o/
18:01:44 <amrith> \0/
18:01:49 <peterstac> o/
18:01:50 <vgnbkr> o/
18:01:58 <georgelorch> \m/  \m/
18:02:02 <vkmc> (o/
18:02:04 <sushilkm> hello all
18:02:16 <SlickNik> Meeting agenda at:
18:02:18 <SlickNik> #link https://wiki.openstack.org/wiki/Meetings/TroveMeeting
18:02:21 <amrith> don't tell sgotliv that we have a meeting.
18:02:38 <sgotliv> 0/
18:02:44 <nshah> o/
18:03:01 <SlickNik> #topic Trove pulse update
18:03:17 <SlickNik> #link https://etherpad.openstack.org/p/trove-pulse-update
18:03:49 <SlickNik> Awesome job everyone on getting the review numbers up last week! :)
18:04:14 <SlickNik> And thanks to all who did reviews.
18:04:19 <atomic77> o/
18:04:54 <SlickNik> We did more than double the number of reviews from the previous week.
18:05:17 <pmalik> ?/
18:05:25 <sgotliv> SlickNik, its really amazing
18:06:24 <SlickNik> There are a lot more non-cores starting to get involved in the reviews, which is great.
18:06:27 <SlickNik> sgotliv: ++
18:06:27 <sushilkm> 200+ reviews thats a large number of reviews ....
18:07:25 <edmondk> yeah nice work
18:07:27 <edmondk> everyone
18:08:09 <dougshelley66> definitely great to see the backlog clearing out
18:08:17 <vkmc> lot of progress :D
18:08:46 <sgotliv> dougshelley66, a backlog, it almost doesn't exist :-)
18:08:55 <SlickNik> Any other questions wrt the pulse numbers?
18:09:02 <amrith> that's only because we got oslo.messaging merged.
18:09:06 <amrith> that was holding up EVERYTHING.
18:09:12 <amrith> ;)
18:09:41 <SlickNik> Okay, let's move on.
18:10:01 <SlickNik> #topic Instance Storage for Replication
18:10:06 <SlickNik> vgnbkr: around?
18:10:11 <vgnbkr> Hi.
18:10:28 <vgnbkr> If everyone wants to have a quick look at the note I wrote.
18:10:30 <vgnbkr> https://etherpad.openstack.org/p/trove-replication-storage
18:10:45 <vgnbkr> I think we should wait on the metadata - any dissenters.
18:11:44 <SlickNik> vgnbkr: So I had one question — if the master is down/unavailable, couldn't you query a slave for the master's server-id?
18:12:13 <vgnbkr> I don't think so, unless we stored it on every slave.
18:13:01 <vgnbkr> Oh, it's not the server_id, it's the master's server_UUID.
18:15:41 <vgnbkr> georgelorch, am I correct that the master's UUID cannot be retrieved from a slave when the master in unreachable?
18:15:52 <SlickNik> FWIW — just looking over the options in that note — I'm not liking the ones that make the DB schema mysql specific / gnostic
18:16:20 <vgnbkr> agreed - I'm just enumerating the options.
18:17:44 <georgelorch> vgnbkr, not 100% sure about that but I think you are correct, I don't recall seeing any explicit way to get the master UUID from a disconnected slave, but there may be 'tricks'
18:19:38 <SlickNik> so vgnbkr - barring some way of doing this by inspecting the replicated mysql schema / using a mysql function that does this, I'm liking the metadata option.
18:20:03 <georgelorch> there is SHOW SLAVE HOSTS but that is a master call...
18:21:02 <vgnbkr> Discussing SlickNik's question, it would also be possible to store it in a file on the guest.  How do we feel about storing info in files on the guest?
18:21:19 <vgnbkr> s/guest/slave guest/
18:21:27 <georgelorch> ohh wait vgnbkr, what about Master_UUID in show slave status, or does that go to NULL when the slave loses contact with the master?
18:22:01 * georgelorch quickly looks up docs...
18:22:58 <georgelorch> hmm, no details in the docs on behavior
18:23:09 <vgnbkr> OK, I'll check it out after the meeting...
18:23:28 <SlickNik> georgelorch / vgnbkr: maybe that's something that we can try out? That would be a good solution if it worked.
18:23:34 <SlickNik> vgnbkr: Sounds good.
18:23:50 <SlickNik> I'm not sure I want to go down the route of storing state information on the guests.
18:23:53 <vgnbkr> If not, what's the opinion on storing instance data in files on guests?
18:24:02 <vgnbkr> Never mind :-)
18:24:17 <vgnbkr> OK, Thanks.
18:24:52 <georgelorch> yeah vgnbkr, on a GTID enabled (5.6) slave, SHOW SLAVE STATUS _should_ have a Master_UUID field that is the server id of the master...I presume that the slave would have had to make successful contact with the master at least once in order to retrieve that value.
18:25:09 <vgnbkr> Oh, so what I'm taking away is: get it from "show slave status" or wait for metadata.
18:25:43 <vgnbkr> Thanks georgelorch , I'll check it out.
18:26:21 <georgelorch> vgnbkr, sure, let me know offline if you have any questions, I can dig into source or ask someone to make sure we get it right.
18:26:25 <SlickNik> Awesome, thanks georgelorch and vgnbkr.
18:27:18 <SlickNik> #topic Different timeouts in configuration based per datastore
18:27:27 <sushilkm> hello
18:27:59 <sushilkm> https://review.openstack.org/#/c/164640/
18:28:20 <SlickNik> #link https://review.openstack.org/#/c/164640/
18:28:34 <sushilkm> So, now with a number of experimental datastores around in trove
18:28:39 <SlickNik> sushilkm: Did you have a specific question / clarification around this?
18:29:07 <sushilkm> i wanted to get this discussed here
18:29:18 <sushilkm> as in reviews it came from peterstac and vkmc
18:30:13 <sushilkm> so this patchset aims at getting usage_timeout to datastore configs instead of the default config
18:31:50 <sushilkm> i got the review that this parameter was continuously roaming between datastores and default ... and this should be discussed in the meeting
18:32:21 <dougshelley66> sushilkm, I'm not against the change. I just thought it needed more visibility since just last July we did the reverse change
18:32:23 <vkmc> in the proposed PS a new option, usage_timeout, is added per datastore and we also keep it in the general options
18:33:01 <vkmc> IMO it makes sense to keep it per datastore, and also mark the option in general options as deprecated... for a while (to keep backwards compatibility)
18:34:14 <SlickNik> vkmc: I tend to agree with that. With multiple datastores, I've noted that I've had to change that value back and forth depending on which datastore I've been wanting to deploy.
18:34:54 <vkmc> yeah
18:35:40 <SlickNik> I think at some point in the past, we moved to consolidate this into the default config for simplicity.
18:35:51 <vkmc> personally, I don't like to add more options... but in this case, it seems a good call
18:36:24 <sushilkm> along with usage_timeout, it would be worthy to move more options into datastore configs cluster_usage_timeout and agent_call_..._timeout, for the same reasons
18:36:35 <amrith> so is the discussion only about usage_timeout or about timeouts in general?
18:36:49 <amrith> oh, perfect. thanks sushilkm that's where I wanted to go.
18:37:30 <vgnbkr> I think the better solution is to get rid of that type of timeout, and instead watch for agent heartbeats.
18:37:38 <amrith> my only question about this would be whether we want to just do usage_timeout now and do the rest post Kilo. or do we want to do the whole lot in Kilo.
18:38:49 <sushilkm> i would love to see them all being managed into datastores ... that would help not making the changes for every deployment
18:39:13 <sushilkm> and wud also help in future for may be new datastores while testing+Deployment
18:39:14 <SlickNik> amrith: It makes sense to tackle the two (usage + cluster_usage) together.
18:39:27 <amrith> SlickNik, in that case I have a different proposal
18:39:38 <amrith> have a common timeout and a per datastore timeout
18:39:52 <amrith> and the code can look for the per datastore timeout and if not found go to the common timeout
18:40:01 <amrith> this model (in general) for all timeouts would be a good one to adopt
18:40:14 <vgnbkr> I suggest agent_call_timeout not be included in such a change.  It's general purpose - if there are cases where a longer timeout is required, it should have some other timeout specified.
18:40:16 <amrith> it allows you to override on a per datastore basis while also having a default
18:40:23 <SlickNik> amrith: That sounds reasonable.
18:40:34 <amrith> but then there is vgnbkr's suggestion of doing away with timeouts which I think is much better.
18:40:44 <amrith> and worth looking at before we make this change.
18:41:07 <SlickNik> vgnbkr: I also like your idea of watching for hearbeats and not polling for ACTIVE — perhaps this is something that a schedule task in the conductor could do?
18:41:24 <SlickNik> I feel a liberty BP here.
18:42:04 <amrith> SlickNik, yes. my crystal ball has a picture of a blueprint fluttering around
18:42:21 <amrith> but not of a schedule task
18:42:29 <amrith> but rather a proper implementation of a state machine
18:42:36 <amrith> for these kinds of state transitions.
18:43:29 <vgnbkr> amrith, way more than I was thinking.  I was just thinking that the poll_until would not timeout if it has seen a heartbeat within 60 seconds or so.
18:43:30 <SlickNik> amrith: That's another BP — I don't think we have a state machine mechanism at all right now.
18:44:10 <SlickNik> I'd probably keep the two BPs separate. There are other parts of trove that would benefit from using an FSM model.
18:44:30 <amrith> I'm definitely in favor of a FSM for this
18:44:40 <amrith> but it is definitely a far cry from the current timeout mechanism.
18:44:56 <amrith> I was thinking we'd just wrap the whole thing into the review of this bug and hand it to sushilkm ;)
18:45:27 <amrith> for kilo-2
18:45:45 <SlickNik> hah I'm sure sushilkm was thinking of exactly this when he penciled this in to the agenda. :)
18:46:26 <amrith> so SlickNik to be clear ... I'll make sure a bp for this comes along.
18:46:45 <SlickNik> So 1. Short term, I think datastore specifc timeouts falling back to default makes sense.
18:46:57 <sushilkm> +1
18:47:04 <peterstac> amrith: I'm confused, kilo-2 was Feb 5  ...
18:47:06 <amrith> this fsm implementation will have to be later, like liberty
18:47:33 <SlickNik> 2. Long term, let's try and get rid of this polling usage timeout altogether => amrith will follow up with a BP for Liberty.
18:47:53 <amrith> SlickNik, either dougshelley66 or vgnbkr or I
18:47:57 <amrith> yes
18:48:03 <sushilkm> do we have any thoughts about agent_call_..._timeout
18:48:17 <amrith> I agree with vgnbkr  on that
18:48:29 <amrith> don't include it in the change
18:48:37 <amrith> if you want a longer timeout, find one.
18:49:32 <amrith> so #1 for now is only usage_timeout
18:49:36 <amrith> and cluster_timeout
18:49:36 <sushilkm> find one implies to update those parameters in the conf
18:49:48 <amrith> sushilkm, yes.
18:49:52 <sushilkm> ok fine
18:50:13 <vgnbkr> sushilkm, I imagine that each case where agent_call_.._timeout is insufficient would warrant a new config value.
18:50:50 <amrith> personally, I also feel that the generic timeout (what are they called, short timeout and long timeout) should be outlawed
18:51:06 <amrith> but when we get to at least watching for heartbeats we can head in that direction
18:51:15 <SlickNik> vgnbkr / sushilkm: And that config value would probably be datastore specific as well.
18:51:40 <vgnbkr> SlickNik, likely
18:51:47 <sushilkm> SlickNik, yes which is why i wanted those call_timeouts to be datastore specific too :)
18:52:05 <SlickNik> amrith: those are RPC timeouts, so watching for heartbeats won't help get rid of those, I don't think.
18:52:50 <sushilkm> since right now guestagent\api calls based on those timeouts
18:53:19 <sushilkm> which is again a problem-head for different datastores, as some datastores would obviously take longer than others
18:53:23 <SlickNik> sushilkm: What vgnbkr and I are saying is that if there's a datastore specific call that's causing you to go beyond the bounds of those timeouts (like snapshot provisioning) you probably want to call that out as a separate timeout that would be datastore specific and not overload those two.
18:53:48 <vgnbkr> SlickNik, +1
18:54:22 <sushilkm> okies
18:54:42 <SlickNik> Okay, so I think we have some traction on this one.
18:55:01 <SlickNik> Let's move on
18:55:06 <SlickNik> #topic Open Discussion
18:55:06 <amrith> #action sushilkm to make usage_timeout and cluster_timeout datastore specific
18:55:24 <amrith> #action amrith (or dougshelley66 or vgnbkr) to propose bp for FSM in Liberty
18:55:55 <amrith> ./
18:56:10 <SlickNik> amrith: go for it
18:56:35 <amrith> for kilo
18:56:40 <amrith> do we want to do the oslo.log thing?
18:57:41 <amrith> currently we have the trove-integration part of it ready to merge
18:57:56 <amrith> once that gets in and I can rebuid the guest images, there's a chance to get the trove change to work
18:58:04 <amrith> but the question is do we want all this for Kilo?
18:58:10 <amrith> we don't *HAVE TO*
18:58:34 <SlickNik> amrith: Context on the oslo.log thing?
18:58:50 <amrith> #link https://review.openstack.org/#/c/162676/
18:59:02 <amrith> #link https://review.openstack.org/#/c/162677/
18:59:17 <amrith> this is a module being deprecated
18:59:24 <amrith> oslo-incubator logging vs. oslo.log
19:00:24 <SlickNik> trove-integration doesn't follow a release cycle, so I'm good with merging the first change — shouldn't hurt and will make sure that the images we build have the dependency once the second change does merge.
19:00:33 <amrith> OK, let me be more explicit. Let us punt this to Liberty.
19:00:40 <amrith> the whole thing.
19:00:46 <amrith> let's move to #openstack-trove
19:00:49 <amrith> times up
19:01:05 <SlickNik> sounds good.
19:01:08 <SlickNik> #endmeeting