#openstack-meeting log

00:00:21 <thinrichs> #startmeeting CongressTeamMeeting
00:00:22 <openstack> Meeting started Thu Aug 13 00:00:21 2015 UTC and is due to finish in 60 minutes.  The chair is thinrichs. Information about MeetBot at http://wiki.debian.org/MeetBot.
00:00:22 <pballand> hi
00:00:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
00:00:24 <masahito> Hi
00:00:26 <openstack> The meeting name has been set to 'congressteammeeting'
00:00:30 <veena> Hi
00:00:30 <Yingxin1> hi
00:00:31 <RuiChen> hola, guys
00:00:39 <jwy> hello
00:00:55 <thinrichs> We've got a good crew here right on time.  So let's get started.
00:00:56 <Prakash_DataTap> #info Prakash Kanthi
00:01:18 <thinrichs> Let's start with a quick recap of last week's mid-cycle sprint.
00:01:22 <thinrichs> #topic Mid-cycle sprint
00:01:39 <thinrichs> We had 9-10 people working hard on a new distributed architecture.
00:01:54 <thinrichs> I sent out an email to the mailing list with details.
00:02:14 <thinrichs> But the short version is that we decided on...
00:02:23 <thinrichs> running each datasource driver in its own process;
00:02:30 <thinrichs> running each policy engine in its own process;
00:02:36 <thinrichs> running the API in its own process;
00:02:45 <thinrichs> and having them all communicate using oslo.messaging.
00:03:16 <thinrichs> There were also plans for generalizing that, as the need arises, to enable multiple datasources/policy engines to run in a single process.
00:03:40 <thinrichs> Comments/questions?  (I'll try to dig up a pointer to the email I sent in the meantime.)
00:04:07 <thinrichs> #link http://lists.openstack.org/pipermail/openstack-dev/2015-August/071653.html
00:04:20 <RuiChen> we will support the multi workers on same datasource drivers?
00:05:00 <thinrichs> What does multi-workers mean?
00:05:12 <RuiChen> multi processes
00:05:43 <thinrichs> I suppose that within a single process we could eventually multi-thread the datasource driver code.
00:05:46 <RuiChen> like nova-conductor
00:06:15 <RuiChen> ok, get it
00:06:17 <thinrichs> I don't know nova-conductor
00:06:39 <thinrichs> But we don't expect to ever need more than 1 datasource driver process per datasource.
00:06:43 <thinrichs> Even for high-availability.
00:06:44 <Yingxin1> multiple policy engines means HA? There is only one engine right now
00:07:09 <thinrichs> If one datasource driver process crashes, we just bring up another one, and it'll pull data immediately.
00:07:35 <thinrichs> Yingxin1: either HA or if we end up with different kinds of policy engines (such as the vm-placement policy engine we experimented with)
00:07:36 <Yingxin1> ok
00:08:03 <thinrichs> For policy-engines, we WILL need multiple replicas (multi-master) to handle HA and high query throughput.
00:08:15 <thinrichs> Each of those policy-engines will run in its own process.
00:08:38 <thinrichs> For HA we'll put those policy-engine processes on different boxes.
00:09:00 <thinrichs> Though you could imagine having multiple policy-engine processes on the same box if all you wanted was high query-throughput.
00:09:28 <thinrichs> Before I forget, there are a bunch of notes on the etherpad.
00:09:30 <thinrichs> #link https://etherpad.openstack.org/p/congress-liberty-sprint
00:09:54 <Yingxin1> get it
00:10:15 <thinrichs> Other questions/comments/suggestions?
00:11:00 <masahito> We try to implement it by Liberty, right?
00:11:37 <masahito> it's just a confirmation for me and others who didn't attend the meet-up.
00:11:42 <thinrichs> Feature-freeze (liberty-3) is Sept 1-3, so there's no real way I see getting it done by liberty.
00:12:34 <thinrichs> For liberty, we will release the existing architecture.  For M we'll release the new distributed architecture.
00:13:11 <pballand> I do remember agreeing that we will try to have the code ready by the liberty summit though
00:13:50 <pballand> (I assumed that meant we would have a distributed version available on a alpha basis, but the full existing functionality still working)
00:13:58 <thinrichs> pballand: Oh right.  The goal was to have the first draft ready in master by the summit.
00:14:09 <thinrichs> And then make it part of the release in M.
00:14:40 <masahito> Thank you. (when I re-read eatherpad there is no description for the timelime so I wanted to confirm it)
00:15:06 <thinrichs> With that, maybe it's time to move on to a discussion about the work-items we produced.
00:15:09 <thinrichs> #topic Blueprints
00:15:29 <thinrichs> The blueprints that came out of the meeting all start with dist-
00:15:36 <thinrichs> #link https://blueprints.launchpad.net/congress
00:15:51 <thinrichs> They're all Medium priority.
00:16:03 <thinrichs> (Typically High priority are the ones we're targeting for the current release.)
00:16:49 <thinrichs> About half of the dist- blueprints don't have assignees.
00:17:22 <thinrichs> A good number of them are for migrating different API modules to use the RPC-style of interaction with the policy engine/datasource drivers.
00:17:42 <thinrichs> All the dist-api- prefixed ones are for the api.
00:17:50 <thinrichs> Those are all fairly small and self-contained.
00:18:44 <thinrichs> Please make sure to sign up before actually starting the work
00:18:50 <thinrichs> so that we don't duplicate effort
00:19:19 <masahito> thinrichs: I can't change assignee because I'm not core.
00:20:26 <thinrichs> masahito: Send me an email and I'll sign you up.
00:20:29 <masahito> thinrichs: I think it's good to send a mail to ML if someone decide to implement it.
00:20:43 <masahito> thinrichs: OK
00:20:58 <thinrichs> masahito: that sounds good, but I'm guessing people will forget.
00:21:10 <Yingxin1> Have we defined a uniformed RPC interface, to make dist-api- easier to implement?
00:21:13 <thinrichs> Or won't want to tell everyone they're signing up.
00:21:22 <RuiChen> masahito: I think it's a good idea
00:21:59 <thinrichs> Yingxin1: that's a good thought.  Let's discuss that in a couple of minutes.
00:22:14 <pballand> Yingxin1: I am working on the RPC interface as part of the base class in ‘dist-cross-process-dse'
00:23:00 <pballand> thinrichs: ok, will wait to discuss until you say it’s time
00:23:30 <thinrichs> Question: can anyone else change assignees on blueprints?  Or can you only change the assignee on blueprints that you have created?
00:23:33 <Yingxin1> pballand: I'll have a look at it
00:23:58 <RuiChen> maybe we should add the dependency between the dist-* blueprint
00:24:10 <masahito> thinrichs: I can only change assignee I've created.
00:24:25 <Yingxin1> masahito: agreed
00:24:40 <thinrichs> masahito: thanks.  I don't see a way to let anyone change the assignee on my blueprints.
00:24:43 <masahito> thinrichs: and especially I can change it to only myself.
00:25:27 <thinrichs> Can someone try this one:
00:25:29 <thinrichs> #link https://blueprints.launchpad.net/congress/+spec/dist-api-rpcify-row
00:25:53 <thinrichs> I set it to Approved and set an Approver.  If that doesn't work, I don't see another way.
00:26:56 <RuiChen> I can try it dist-api-rpcify-row
00:27:02 <masahito> Your set looks work.
00:28:00 <masahito> launchpad allows me to change whitboard and workitems.
00:28:17 <thinrichs> I don't see anyone's name showing up under assignee.
00:28:40 <thinrichs> It seems only the owner/core can change that.
00:28:41 <Yingxin1> thinrichs: I still cannot change the Assignee
00:28:47 <thinrichs> Yingxin1: thanks for trying.
00:28:52 <thinrichs> Let's move  on.
00:28:57 <thinrichs> #topic RPC interface
00:29:08 <thinrichs> pballand says he's working on something
00:29:40 <thinrichs> I'm guessing it's this:
00:29:42 <thinrichs> #link https://review.openstack.org/#/c/210159/
00:29:56 <pballand> I’ve been testing out various stragegies for DseNode using oslo.messaging primitives
00:30:01 <thinrichs> I did some of the preliminary work for the policy-model API
00:30:25 <thinrichs> #link https://review.openstack.org/#/c/210691/
00:30:36 <thinrichs> pballand: please continue—didn't mean to interrupt
00:30:36 <pballand> and also working on the spec for ‘dist-cross-process-dse’
00:30:59 <pballand> the spec isn’t pushed for review yet, because I wanted to test out some things before proposing a design
00:31:22 <pballand> I’ve found one major shortcoming in oslo.messagaing
00:31:49 <pballand> it seems that the message bus connection is managed automatically (including reconnects)… this is good
00:32:07 <pballand> the problem is that the application logic isn’t notified when the message bus is disconnected
00:32:28 <pballand> this presents a problem with the design we outlined in the midcycle
00:32:55 <pballand> if a node is disconnected, and oslo.messaging reconnects, the dse doesn’t know that it may have missed messages
00:33:11 <alexsyip> We can detect that with sequence numbers.
00:33:27 <pballand> alexsyip: yes - more on that in a sec
00:34:04 <pballand> I am working on two solutions: 1) chatted with some of the oslo.messaging folks, and am going to send an email to the mailing list to propose getting a trigger for connections and disconnections
00:34:29 <pballand> 2) as alexsyip said: we can use sequence numbers to detect gaps
00:35:05 <pballand> sequence numbers don’t work for services that aren’t sending updates, however, so we will need to have periodic heartbeats
00:35:06 <alexsyip> Does oslo messaging lose messages in any other situation?
00:35:22 <alexsyip> some times, these messages systems will drop messages underl overload conditinos.
00:35:46 <pballand> alexsyip: I have yet to determine that, however it seems that the solution in 2) will handle that case as well
00:36:20 <alexsyip> Ok.  The clone pattern is meant to deal with lost messages: http://zguide.zeromq.org/php:chapter5#Reliable-Pub-Sub-Clone-Pattern
00:36:39 <pballand> so I’m currently thinking we will ship with design 2, and changes in oslo.messaging will be an optimization if/when they come
00:36:52 <alexsyip> sounds good to me.
00:37:16 <pballand> I am working on a trial of this using oslo.messaging’s RPC interface, and hope to publish a spec by the end of the weekl
00:37:18 <pballand> that’s it from me
00:37:55 <thinrichs> pballand: sounds good to me too.
00:37:59 <masahito> pballand: sounds great.
00:39:27 <Yingxin1> :)
00:39:39 <thinrichs> Am I right in thinking that this sequence-number issue will be handled at a lower layer than the api-models would worry about?
00:40:11 <thinrichs> That is, when doing the api-modifications we can assume RPC is reliable, right?
00:40:48 <pballand> thinrichs: that’s right - I expect the DseNode class will be a parent for all services on the bus, and it will contain methods that send updates and send full data - the base class will manage adding metadata such as sequence numbers
00:40:50 <alexsyip> you can’t ever expect RPCs to be reliable.
00:41:12 <alexsyip> unless you receive an ack.
00:41:20 <pballand> (that’s right was for the first message, agree with alexsyip’s comment)
00:42:04 <Yingxin1> So I think there could be an ack or a timeout.
00:42:08 <pballand> a given table will have in-order updates up to some point, without the caller needing to worry about sequence numbers
00:42:24 <thinrichs> So when someone is writing the API model that inserts a rule, and we send off an API call but don't hear back, what do we do?
00:42:31 <pballand> but when making an explicit RPC to a service, you need an ack or timeout as Yingxin1 says
00:43:03 <thinrichs> Does the behavior depend on the API call?
00:43:07 <alexsyip> You can ask to see if the rule exists.
00:43:11 <pballand> thinrichs: my initial thought would be to throw a 503 in that case
00:43:24 <thinrichs> pballand: but what if the rule actually got inserted?
00:43:30 <pballand> the caller won’t know if the call succeeded or not, but that’s a common problem
00:43:54 <pballand> the caller can check for the rule, or try the insert again (if/when we support idempotent create)
00:44:04 <thinrichs> pballand: so you're saying never retry—it's the user's problem.
00:44:23 <pballand> thinrichs: well, we retry internally, but only up to some time limit
00:44:46 <pballand> ultimately, it’s always the user’s problem (there can be disconnections other places in the line)
00:44:51 <thinrichs> pballand: is that rolled into the base class, or does the retry logic depend on the particulars of the API call?
00:45:03 <thinrichs> I'm just trying to figure out what abstraction we need to use when modifying the API models.
00:45:12 <veena> thinrichs: yes, there should be a time out limit to handle a request message
00:45:31 <pballand> oslo.messaging has support for that internally
00:45:42 <pballand> #link: I am working on two solutions: 1) chatted with some of the
00:45:45 <pballand> oops
00:45:52 <pballand> #link http://docs.openstack.org/developer/oslo.messaging/rpcclient.html#oslo_messaging.RPCClient.call
00:46:26 <thinrichs> So I should be treating the RPC method implemented in the DseNode base class as something that either returns a value/ack or that times out.
00:46:37 <thinrichs> And then if there's a time-out I return a 503.
00:46:55 <pballand> (from the docs, however, it doesn’t look like the call timeout is configurable, so we may need to implement some more logic)
00:47:19 <veena> pballand: it is configurable
00:47:31 <thinrichs> Lost track of the time.
00:47:39 <thinrichs> I wanted to make 1 quick comment before moving on.
00:47:45 <pballand> thinrichs: I would treat it as synchronous, but could raise MessagingTimeout, RemoteError, MessageDeliveryFailure
00:47:51 <pballand> veena: thanks
00:48:00 <thinrichs> pballand: sounds good.
00:48:14 <thinrichs> In the first edits I made to the policy-model,
00:48:16 <thinrichs> #link https://review.openstack.org/#/c/210691/
00:48:20 <thinrichs> I did 2 things:
00:48:43 <thinrichs> 1. introduced a self.rpc to mimic the one that will belong to the DSENode and made all communication go through that.
00:49:02 <thinrichs> (The implementation is just invoking the policy-engine's methods directly.)
00:49:15 <thinrichs> 2. I moved the database logic out of the API model and into the policy-engine.
00:49:35 <thinrichs> So instead of the API keeping the database and policy engine synchronized, that's left to the policy engine itself.
00:50:10 <thinrichs> None of that relies on having DseNode ready, so we can do all that in parallel with pballand's efforts.
00:50:24 <thinrichs> masahito: you had a question about whether (2) makes sense.
00:50:41 <masahito> thinrichs: yap.
00:51:02 <thinrichs> The question was whether the API should directly talk to the database or whether the API should always talk to one of the other processes to answer questions.
00:51:11 <thinrichs> Is that right?
00:51:14 <masahito> yes
00:51:18 <alexsyip> thinrichs: you could just write directly to the db from the API.  Then wait for the policy engine to read from the DB>
00:51:54 <alexsyip> My hunch is that it’s better to talk directly to the db.
00:52:00 <thinrichs> alexsyip: understood, but in the PE case, only the PE knows whether a new rule can be inserted so we need to talk to the PE anyway.
00:53:22 <masahito> IMO, writing access to db is only permitted for PE, reading DB is permitted for all.
00:53:23 <alexsyip> I think it’s not enough to ask the PE
00:53:47 <alexsyip> because there may be two writers going through different PEs that write conflicting writes.
00:54:24 <alexsyip> Oh, maybe not.
00:54:58 <alexsyip> Are you saying the PE writes the transaction, but only the PE knows under what conditions to write the transaction?
00:55:38 <masahito> alexsyip: maybe yes
00:55:50 <thinrichs> alexsyip: yes.  The PE is the only one that knows syntactically valid statements, and whether adding a statement would cause cycles.
00:56:09 <thinrichs> Lost track of time again.  Let's put this on hold and see if anyone else needs help from the group.
00:56:13 <thinrichs> #topic open discussion
00:56:16 <alexsyip> Are there multiple PEs with different ideas of what is valid?
00:57:31 <thinrichs> Sorry we ran so short on time this week, everyone.
00:57:49 <thinrichs> Lots of energy for the new architecture!
00:57:54 <thinrichs> Great to see!
00:58:06 <thinrichs> No one has anything to ask?
00:58:52 <RuiChen> pleased to see the new architecture in plan
00:58:58 <RuiChen> no from me
00:59:09 <thinrichs> alexsyip: if there were a bunch of rules additions coming in close together, then 2 PEs could get out of sync and then evaluate whether a new rule would create a cycle, I suppose.
00:59:17 <thinrichs> I'm guessing that's going to be rare, but possible.
00:59:40 <thinrichs> It wouldn't be until the sync that the two realized the rules in the DB were actually recursive.
00:59:44 <thinrichs> And so not permitted.
01:00:00 <thinrichs> RuiChen: agreed.
01:00:04 <alexsyip> Yes, so that means the PE is not able to really evaulate at API time.
01:00:15 <thinrichs> Thanks all—we're officially out of time.
01:00:21 <thinrichs> I can continue on #congress for a few minutes.
01:00:25 <thinrichs> #endmeeting