00:04:03 #startmeeting CongressTeamMeeting 00:04:03 Meeting started Thu Jun 9 00:04:03 2016 UTC and is due to finish in 60 minutes. The chair is masahito_. Information about MeetBot at http://wiki.debian.org/MeetBot. 00:04:04 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 00:04:06 The meeting name has been set to 'congressteammeeting' 00:04:17 oh, the bot reacts! 00:04:45 Today's agenda are: 00:04:48 haha =) 00:05:01 1. newton-1 release 00:05:06 2. status update 00:05:29 something else? 00:06:16 I was hoping we could settle some HA design decisions, but maybe not witout thinrichs 00:06:18 let's move on newton-1 release 00:06:56 ekcs: ok 00:07:09 #topic newton-1 release 00:07:17 If we can make progress on HA in 10-15 minutes, I'll be here at least that long. 00:07:43 ekcs: is that enough time to be valuable? 00:07:54 thinrichs: I tihnk so. 00:08:19 masahito: would it be okay to start with the HA discussion topic? 00:08:29 thinrichs: sure 00:08:44 #topic HA design 00:09:36 ok so I summarized the main decision points in my latest comments. Here’s the short version: 00:09:46 A. node configurations: 00:09:47 1. single node-type (every node has API+PE+DSDs). 00:09:48 2. two node-types (API+PE nodes, all-DSDs node). 00:09:49 3. many node-types (API+PE nodes, all DSDs in separate nodes). 00:10:07 B. global vs local leader for action execution 00:10:08 1. global leader: Pacemaker anoints a global leader among PE instances; only the leader sends action-execution requests. 00:10:08 2. local leader: every PE instance sends action-execution requests, but each receiving DSD locally picks a leader to listen to. 00:10:18 C. DSD redundancy: 00:10:19 1. warm standby: only one set of DSDs running at a given time; backup instances ready to launch. 00:10:20 2. hot standby: multiple instances running, but only one set is active. 00:10:21 3. active-active: multiple instances active. 00:11:20 In the comment in spec discussion I also had my take on the decisions. #link https://review.openstack.org/#/c/318383/ 00:11:40 If we have thoughts here, we may be able to more or less settle them. 00:11:55 if not, we can do it offline. 00:13:02 my suggestion is: A2 (two node-types), B2 (local leader), C1 (DSD warm standby) 00:14:14 Is A2 saying multiple API+PE nodes (for HT) and a single DSD node? All with nannies to restart. 00:14:44 yes. though restart may be on a different node (managed by pacemaker) 00:15:39 C1 (DSD warm standby) sounds right 00:16:10 A2 seems right too. 00:17:00 Is B2 (local leader election) easier to implement? Or … why B2 instead of B1 (briefly)? 00:18:02 im also in favor of A2 , C1 , local leader sounds good too, as i think it can be isolated just for action execution 00:18:03 We'd need leader election in both cases, right? The difference is just where it's implemented. 00:18:31 And I'm all in favor of leaving the complexity of action-execution in the PE alone. 00:18:43 I think it’s about the same dev work: API calls and custom resource agents for global. Some new logic in DSD for local. The difference is that with local leader, the deployer doesn’t need to know about leader election. 00:19:05 the PE doesn’t even know it’s a leader or follower. 00:19:22 ekcs: global leader can also be elected only where thereis action request 00:19:33 ekcs: or it should be done when PE starts 00:19:41 “leadership” is only local to each DSD. 00:20:31 on deployment, local leadership is transparent to the deployer. and it doesn’t require any quorum taking. 00:20:54 ramineni_: it could probably be done either way. though doing at the very start makes sense for global leader I think. 00:21:34 I would think leadership is transparent in both cases. There's just an API that sets the leader, and that message gets routed either to the PEs or to the DSDs. 00:22:03 no quorum reqired because the decision is entirely made by the single DSD instance. 00:22:59 So it sorts them or something to pick a leader. That's cool. If we can avoid leader election, I would definitely vote for that one. 00:23:20 ekcs: ya, but in that case can we track its the action execution , and only leader executes and in rest other cases it acts as active/active(without any leader) 00:23:41 thinrichs: yes exactly. it’s not actual leader election. 00:24:05 I've got to run, but based on what I hear I agree with ekcs's suggested path forward. 00:24:33 later thinrichs 00:24:35 ekcs: hmmm…one other thought though. How does the DSDs know what the list of PEs actually are? 00:25:02 thinrichs: it cAN be tracked via control bus peer status 00:25:03 thinrichs: it doesn’t. it can just pick the first one it hears from. and stick with it until it loses contact. 00:25:07 If A is the leader, and the DSD is ignoring all action executions from B and C, and then A dies, the DSD needs to know about B and C 00:25:08 ok 00:25:22 The one other thing I'd think through is mitigating technical risk. 00:25:27 thinrichs: DSD will continue to receive messages from all leaders. 00:25:42 s/all leaders/ass PEs 00:25:44 We want to have a solution that we can implement somewhat incrementally 00:25:45 ugh. 00:26:21 But I think you've probably thought through that. 00:26:35 Got to run. I'll check out the meeting logs afterwards. 00:27:34 more thoughts? ramineni_ and masahito_. we don’t have to settle everything now, just gives me a direction to run with preraing next draft of spec. 00:27:36 ramineni_: do you have additional discussion for HA? 00:28:00 ekcs: ya, but in that case can we track its the action execution , and only leader executes and in rest other cases it acts as active/active(without any leader) 00:28:08 one question 00:28:15 in case global leader 00:28:35 ekcs: as I posted on gerrit, I'm in favor of ekcs's suggestion. 00:29:27 ramineni_: I’m not sure what benefit it offers to delay leader election in that case. 00:29:52 ekcs: as all other API requests doesn need to go to leader 00:30:09 masahito_: thanks. just note that A2 is different from what’s in the most recent draft (A1). A2 came out of ramineni_’s comment. 00:30:59 ekcs: anyway , im also in favor of local leader as it sounds simpler to me and code can handle it :) 00:31:02 ramineni_: ah that’s never intended to be the case. even when we have leader, the only difference is that only that leader initiates action-execution. but API requests are always routed to the same node as the API service. 00:31:46 ekcs: ah, ok 00:31:47 so you get active-active on queries and everything. just that followers don’t initiate action-execution. 00:32:09 ekcs: so, that code goes to congress resource agent right 00:33:36 I’m not sure. I’m thinking resource agent just tells a PE it is leader or follower and PE remembers it. and action-execution is automatically ignored on followers in the PE logic. agnostic.py 00:33:57 ekcs: oh ok 00:34:22 you could also put the disabling of action-execution directly in resource agent, but I think that exposes too much congress internals. 00:34:43 ok well I don’t have anything else. 00:34:45 on HA. 00:34:48 ekcs: ok 00:35:22 masahito_: im also done with HA questions 00:35:34 ramineni_: got it 00:35:42 let's move on next 00:35:53 #topic Newton-1 00:36:38 last week newton-1 release week 00:37:04 We've released it already. 00:37:20 #link http://releases.openstack.org/newton/index.html 00:37:47 nice. and this time the artifact exists! 00:38:12 yeah. 00:38:45 we got to new architecture with one node by newton-1 00:39:25 It sounds we all had great works. 00:39:36 that’s awesome. 00:39:57 anything else for N-1 release? 00:40:08 masahito_: may be we should make new arch voting now 00:40:27 ramineni_: oh, right. 00:41:01 masahito_: i can put up patch for that 00:41:16 ramineni_: nice! 00:42:24 ok, let's move on next. 00:42:37 #topic status update 00:43:08 ekcs: want to start? 00:43:13 sure. 00:43:58 Been working on HA spec, and actually installing and configuring these tools we're going to rely on: pacemaker, corosync, haproxy, etc. 00:44:28 also have the diff update and sequencing patch ready. #link https://review.openstack.org/#/c/304991/ 00:45:02 that’s it from me. 00:45:33 ekcs: thanks. I'll check the patch. 00:46:09 ramineni_: want to start? 00:46:16 sure 00:46:41 I have put up patch for migration of existing synchronizer code to new arch https://review.openstack.org/#/c/324328/ 00:47:13 some details are still need to be worked out , may be yours and ekcs comments on that would be helpful 00:48:12 ok I'll also check it. 00:48:16 got it. 00:48:53 thanks 00:49:02 masahito_: thats it from my side 00:49:08 ok, 00:49:17 next from me. 00:50:04 I pushed the patch for race-condition issue in creating datasource. 00:50:25 https://review.openstack.org/#/c/325195/ 00:51:22 I wasn't able to check yours comment before. I'll check it and react. 00:51:31 that's from my side. 00:52:29 great! 00:54:30 status update is done. 00:54:40 #topic open discussion 00:54:59 6 mins left. 00:55:21 Does anyone has anything else? 00:55:49 quick question 00:56:19 masahito_: have you used pacemaker before? 00:56:28 ekcs: yes. 00:56:55 masahito_: ok great. may need help from you down the road =) 00:57:20 for OpenStack, newtron-agent, nova-compute and so on. 01:00:41 we're running out time. 01:00:46 thanks all. bye 01:00:50 thanks all 01:00:55 #endmeeting