00:00:21 #startmeeting CongressTeamMeeting 00:00:22 Meeting started Thu Aug 13 00:00:21 2015 UTC and is due to finish in 60 minutes. The chair is thinrichs. Information about MeetBot at http://wiki.debian.org/MeetBot. 00:00:22 hi 00:00:23 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 00:00:24 Hi 00:00:26 The meeting name has been set to 'congressteammeeting' 00:00:30 Hi 00:00:30 hi 00:00:31 hola, guys 00:00:39 hello 00:00:55 We've got a good crew here right on time. So let's get started. 00:00:56 #info Prakash Kanthi 00:01:18 Let's start with a quick recap of last week's mid-cycle sprint. 00:01:22 #topic Mid-cycle sprint 00:01:39 We had 9-10 people working hard on a new distributed architecture. 00:01:54 I sent out an email to the mailing list with details. 00:02:14 But the short version is that we decided on... 00:02:23 running each datasource driver in its own process; 00:02:30 running each policy engine in its own process; 00:02:36 running the API in its own process; 00:02:45 and having them all communicate using oslo.messaging. 00:03:16 There were also plans for generalizing that, as the need arises, to enable multiple datasources/policy engines to run in a single process. 00:03:40 Comments/questions? (I'll try to dig up a pointer to the email I sent in the meantime.) 00:04:07 #link http://lists.openstack.org/pipermail/openstack-dev/2015-August/071653.html 00:04:20 we will support the multi workers on same datasource drivers? 00:05:00 What does multi-workers mean? 00:05:12 multi processes 00:05:43 I suppose that within a single process we could eventually multi-thread the datasource driver code. 00:05:46 like nova-conductor 00:06:15 ok, get it 00:06:17 I don't know nova-conductor 00:06:39 But we don't expect to ever need more than 1 datasource driver process per datasource. 00:06:43 Even for high-availability. 00:06:44 multiple policy engines means HA? There is only one engine right now 00:07:09 If one datasource driver process crashes, we just bring up another one, and it'll pull data immediately. 00:07:35 Yingxin1: either HA or if we end up with different kinds of policy engines (such as the vm-placement policy engine we experimented with) 00:07:36 ok 00:08:03 For policy-engines, we WILL need multiple replicas (multi-master) to handle HA and high query throughput. 00:08:15 Each of those policy-engines will run in its own process. 00:08:38 For HA we'll put those policy-engine processes on different boxes. 00:09:00 Though you could imagine having multiple policy-engine processes on the same box if all you wanted was high query-throughput. 00:09:28 Before I forget, there are a bunch of notes on the etherpad. 00:09:30 #link https://etherpad.openstack.org/p/congress-liberty-sprint 00:09:54 get it 00:10:15 Other questions/comments/suggestions? 00:11:00 We try to implement it by Liberty, right? 00:11:37 it's just a confirmation for me and others who didn't attend the meet-up. 00:11:42 Feature-freeze (liberty-3) is Sept 1-3, so there's no real way I see getting it done by liberty. 00:12:34 For liberty, we will release the existing architecture. For M we'll release the new distributed architecture. 00:13:11 I do remember agreeing that we will try to have the code ready by the liberty summit though 00:13:50 (I assumed that meant we would have a distributed version available on a alpha basis, but the full existing functionality still working) 00:13:58 pballand: Oh right. The goal was to have the first draft ready in master by the summit. 00:14:09 And then make it part of the release in M. 00:14:40 Thank you. (when I re-read eatherpad there is no description for the timelime so I wanted to confirm it) 00:15:06 With that, maybe it's time to move on to a discussion about the work-items we produced. 00:15:09 #topic Blueprints 00:15:29 The blueprints that came out of the meeting all start with dist- 00:15:36 #link https://blueprints.launchpad.net/congress 00:15:51 They're all Medium priority. 00:16:03 (Typically High priority are the ones we're targeting for the current release.) 00:16:49 About half of the dist- blueprints don't have assignees. 00:17:22 A good number of them are for migrating different API modules to use the RPC-style of interaction with the policy engine/datasource drivers. 00:17:42 All the dist-api- prefixed ones are for the api. 00:17:50 Those are all fairly small and self-contained. 00:18:44 Please make sure to sign up before actually starting the work 00:18:50 so that we don't duplicate effort 00:19:19 thinrichs: I can't change assignee because I'm not core. 00:20:26 masahito: Send me an email and I'll sign you up. 00:20:29 thinrichs: I think it's good to send a mail to ML if someone decide to implement it. 00:20:43 thinrichs: OK 00:20:58 masahito: that sounds good, but I'm guessing people will forget. 00:21:10 Have we defined a uniformed RPC interface, to make dist-api- easier to implement? 00:21:13 Or won't want to tell everyone they're signing up. 00:21:22 masahito: I think it's a good idea 00:21:59 Yingxin1: that's a good thought. Let's discuss that in a couple of minutes. 00:22:14 Yingxin1: I am working on the RPC interface as part of the base class in ‘dist-cross-process-dse' 00:23:00 thinrichs: ok, will wait to discuss until you say it’s time 00:23:30 Question: can anyone else change assignees on blueprints? Or can you only change the assignee on blueprints that you have created? 00:23:33 pballand: I'll have a look at it 00:23:58 maybe we should add the dependency between the dist-* blueprint 00:24:10 thinrichs: I can only change assignee I've created. 00:24:25 masahito: agreed 00:24:40 masahito: thanks. I don't see a way to let anyone change the assignee on my blueprints. 00:24:43 thinrichs: and especially I can change it to only myself. 00:25:27 Can someone try this one: 00:25:29 #link https://blueprints.launchpad.net/congress/+spec/dist-api-rpcify-row 00:25:53 I set it to Approved and set an Approver. If that doesn't work, I don't see another way. 00:26:56 I can try it dist-api-rpcify-row 00:27:02 Your set looks work. 00:28:00 launchpad allows me to change whitboard and workitems. 00:28:17 I don't see anyone's name showing up under assignee. 00:28:40 It seems only the owner/core can change that. 00:28:41 thinrichs: I still cannot change the Assignee 00:28:47 Yingxin1: thanks for trying. 00:28:52 Let's move on. 00:28:57 #topic RPC interface 00:29:08 pballand says he's working on something 00:29:40 I'm guessing it's this: 00:29:42 #link https://review.openstack.org/#/c/210159/ 00:29:56 I’ve been testing out various stragegies for DseNode using oslo.messaging primitives 00:30:01 I did some of the preliminary work for the policy-model API 00:30:25 #link https://review.openstack.org/#/c/210691/ 00:30:36 pballand: please continue—didn't mean to interrupt 00:30:36 and also working on the spec for ‘dist-cross-process-dse’ 00:30:59 the spec isn’t pushed for review yet, because I wanted to test out some things before proposing a design 00:31:22 I’ve found one major shortcoming in oslo.messagaing 00:31:49 it seems that the message bus connection is managed automatically (including reconnects)… this is good 00:32:07 the problem is that the application logic isn’t notified when the message bus is disconnected 00:32:28 this presents a problem with the design we outlined in the midcycle 00:32:55 if a node is disconnected, and oslo.messaging reconnects, the dse doesn’t know that it may have missed messages 00:33:11 We can detect that with sequence numbers. 00:33:27 alexsyip: yes - more on that in a sec 00:34:04 I am working on two solutions: 1) chatted with some of the oslo.messaging folks, and am going to send an email to the mailing list to propose getting a trigger for connections and disconnections 00:34:29 2) as alexsyip said: we can use sequence numbers to detect gaps 00:35:05 sequence numbers don’t work for services that aren’t sending updates, however, so we will need to have periodic heartbeats 00:35:06 Does oslo messaging lose messages in any other situation? 00:35:22 some times, these messages systems will drop messages underl overload conditinos. 00:35:46 alexsyip: I have yet to determine that, however it seems that the solution in 2) will handle that case as well 00:36:20 Ok. The clone pattern is meant to deal with lost messages: http://zguide.zeromq.org/php:chapter5#Reliable-Pub-Sub-Clone-Pattern 00:36:39 so I’m currently thinking we will ship with design 2, and changes in oslo.messaging will be an optimization if/when they come 00:36:52 sounds good to me. 00:37:16 I am working on a trial of this using oslo.messaging’s RPC interface, and hope to publish a spec by the end of the weekl 00:37:18 that’s it from me 00:37:55 pballand: sounds good to me too. 00:37:59 pballand: sounds great. 00:39:27 :) 00:39:39 Am I right in thinking that this sequence-number issue will be handled at a lower layer than the api-models would worry about? 00:40:11 That is, when doing the api-modifications we can assume RPC is reliable, right? 00:40:48 thinrichs: that’s right - I expect the DseNode class will be a parent for all services on the bus, and it will contain methods that send updates and send full data - the base class will manage adding metadata such as sequence numbers 00:40:50 you can’t ever expect RPCs to be reliable. 00:41:12 unless you receive an ack. 00:41:20 (that’s right was for the first message, agree with alexsyip’s comment) 00:42:04 So I think there could be an ack or a timeout. 00:42:08 a given table will have in-order updates up to some point, without the caller needing to worry about sequence numbers 00:42:24 So when someone is writing the API model that inserts a rule, and we send off an API call but don't hear back, what do we do? 00:42:31 but when making an explicit RPC to a service, you need an ack or timeout as Yingxin1 says 00:43:03 Does the behavior depend on the API call? 00:43:07 You can ask to see if the rule exists. 00:43:11 thinrichs: my initial thought would be to throw a 503 in that case 00:43:24 pballand: but what if the rule actually got inserted? 00:43:30 the caller won’t know if the call succeeded or not, but that’s a common problem 00:43:54 the caller can check for the rule, or try the insert again (if/when we support idempotent create) 00:44:04 pballand: so you're saying never retry—it's the user's problem. 00:44:23 thinrichs: well, we retry internally, but only up to some time limit 00:44:46 ultimately, it’s always the user’s problem (there can be disconnections other places in the line) 00:44:51 pballand: is that rolled into the base class, or does the retry logic depend on the particulars of the API call? 00:45:03 I'm just trying to figure out what abstraction we need to use when modifying the API models. 00:45:12 thinrichs: yes, there should be a time out limit to handle a request message 00:45:31 oslo.messaging has support for that internally 00:45:42 #link: I am working on two solutions: 1) chatted with some of the 00:45:45 oops 00:45:52 #link http://docs.openstack.org/developer/oslo.messaging/rpcclient.html#oslo_messaging.RPCClient.call 00:46:26 So I should be treating the RPC method implemented in the DseNode base class as something that either returns a value/ack or that times out. 00:46:37 And then if there's a time-out I return a 503. 00:46:55 (from the docs, however, it doesn’t look like the call timeout is configurable, so we may need to implement some more logic) 00:47:19 pballand: it is configurable 00:47:31 Lost track of the time. 00:47:39 I wanted to make 1 quick comment before moving on. 00:47:45 thinrichs: I would treat it as synchronous, but could raise MessagingTimeout, RemoteError, MessageDeliveryFailure 00:47:51 veena: thanks 00:48:00 pballand: sounds good. 00:48:14 In the first edits I made to the policy-model, 00:48:16 #link https://review.openstack.org/#/c/210691/ 00:48:20 I did 2 things: 00:48:43 1. introduced a self.rpc to mimic the one that will belong to the DSENode and made all communication go through that. 00:49:02 (The implementation is just invoking the policy-engine's methods directly.) 00:49:15 2. I moved the database logic out of the API model and into the policy-engine. 00:49:35 So instead of the API keeping the database and policy engine synchronized, that's left to the policy engine itself. 00:50:10 None of that relies on having DseNode ready, so we can do all that in parallel with pballand's efforts. 00:50:24 masahito: you had a question about whether (2) makes sense. 00:50:41 thinrichs: yap. 00:51:02 The question was whether the API should directly talk to the database or whether the API should always talk to one of the other processes to answer questions. 00:51:11 Is that right? 00:51:14 yes 00:51:18 thinrichs: you could just write directly to the db from the API. Then wait for the policy engine to read from the DB> 00:51:54 My hunch is that it’s better to talk directly to the db. 00:52:00 alexsyip: understood, but in the PE case, only the PE knows whether a new rule can be inserted so we need to talk to the PE anyway. 00:53:22 IMO, writing access to db is only permitted for PE, reading DB is permitted for all. 00:53:23 I think it’s not enough to ask the PE 00:53:47 because there may be two writers going through different PEs that write conflicting writes. 00:54:24 Oh, maybe not. 00:54:58 Are you saying the PE writes the transaction, but only the PE knows under what conditions to write the transaction? 00:55:38 alexsyip: maybe yes 00:55:50 alexsyip: yes. The PE is the only one that knows syntactically valid statements, and whether adding a statement would cause cycles. 00:56:09 Lost track of time again. Let's put this on hold and see if anyone else needs help from the group. 00:56:13 #topic open discussion 00:56:16 Are there multiple PEs with different ideas of what is valid? 00:57:31 Sorry we ran so short on time this week, everyone. 00:57:49 Lots of energy for the new architecture! 00:57:54 Great to see! 00:58:06 No one has anything to ask? 00:58:52 pleased to see the new architecture in plan 00:58:58 no from me 00:59:09 alexsyip: if there were a bunch of rules additions coming in close together, then 2 PEs could get out of sync and then evaluate whether a new rule would create a cycle, I suppose. 00:59:17 I'm guessing that's going to be rare, but possible. 00:59:40 It wouldn't be until the sync that the two realized the rules in the DB were actually recursive. 00:59:44 And so not permitted. 01:00:00 RuiChen: agreed. 01:00:04 Yes, so that means the PE is not able to really evaulate at API time. 01:00:15 Thanks all—we're officially out of time. 01:00:21 I can continue on #congress for a few minutes. 01:00:25 #endmeeting