04:01:19 #startmeeting masakari 04:01:20 Meeting started Tue Apr 25 04:01:19 2017 UTC and is due to finish in 60 minutes. The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot. 04:01:21 hi 04:01:22 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 04:01:24 The meeting name has been set to 'masakari' 04:01:25 hi 04:01:27 hi all o/ 04:01:41 sorry for last week.. 04:01:48 o/ 04:02:06 NP 04:02:40 Had a super busy week, cause I became a father :) 04:02:58 Anyway.. 04:03:12 Congratulations!!! 04:03:16 samP: congrats!! 04:03:21 congrats! 04:03:25 thanks.. 04:03:27 Congratulations..! 04:03:36 thank you all.. 04:03:46 let's jump in to agenda 04:03:58 #topic critical bugs 04:04:14 any bugs to discuss? 04:05:54 If no bugs to discuss, let's move to Discussion. if any we can address them later in AOB 04:06:14 #topic Discussion Points 04:06:41 #link: https://etherpad.openstack.org/p/masakari-recovery-method-customization 04:06:51 triggering crash dump in a server? 04:07:03 samP: Yes 04:07:23 samP: can you please explain a little bit about this use case? 04:07:30 tpatil: sure 04:08:43 This is for wait core dump or crash dump befor shout down a server 04:09:23 When pacemakser stonith a server, it does not wait for core domp. 04:10:03 In our user environment, servers has 264GB RAM and it take 20-30 mis to do the core dump. 04:11:05 However, when host fails, pacemaker do the stonith and no time for server to do the core dump 04:12:19 This feture is for isolate the server from the network except form IPMI and give some time for server to do the core dump. 04:14:11 samP: can masakari-monitor receive the event to trigger core dump? 04:14:38 tpatil: not in current 04:15:09 Oh, I haven't recognised it. 04:15:56 Should I write receiving trigger core dump bp for monitor? 04:16:44 samP: I think this job should be done by masakari-monitor instead of masakari 04:17:27 rkmrHonjo: currently I have no idea for how to receiv this trigger in masakari monitors 04:18:13 tpatil: IMHO, job can be done in masakari monitors, but recovery method shold define in masakari, 04:18:52 samP: what action will massacre take after receiving notification to take core dump? 04:19:09 sorry, s/massacre/masakari 04:20:03 Massacre is scared... 04:20:05 tpatil: IMO, masakari will not get the notificaion for core dump. 04:20:28 It is a one of the recovery actions 04:20:52 On my machine auto spell check is enabled, I will find out a way to disable this feature 04:21:17 Masakari only get the node failure notification, and recovery action would be isolate server(wait for core dumo) -> evacuate 04:22:44 samP: Who will trigger core dump is complete? 04:23:24 tpatil: core dump will automanitcally triggered by the kernel 04:24:22 In HW failures , exceptions, kernel panic...etc will trigger the core dump in the server 04:24:47 we just have wait for it to dumo all the pages to file.. 04:25:09 samP: Are you suggesting masakari should run recovery action which will trigger core dump and wait until kernel signals it's complete using IPMI protocol? 04:25:11 ^^ just have to wait 04:27:37 tpatil: No, that can not be done. On the other hand, masakari do not have wait for core dump. it just need to ask pacemaker to ifdown the networks and isolate the node. 04:28:21 for masakari, network isolation = node dead 04:28:54 So masakari can do evacuate VMs as noraml way 04:29:51 samP: I'm trying to understand the end-to-end workflow when any node is down for some reason 04:30:47 tpatil: OK let me write down the simple flow 04:31:11 samP: thank you 04:31:15 (1) masakari monitor sends the node failure notificaion 04:32:13 (2) Masakari ask pacemaker to NW isolation of the node (<- call the pacemaker cluster to do that) 04:32:53 (3) Masakari get the reply from pacemaker "node isolation is done" 04:33:23 (4) Masakari trigger nova evacuate for VMs on that node 04:33:31 (5) done 04:33:38 samP: Masakari doesn't store any info about pacemaker cluster, need to figure out how to store this info when operator configures failover segment and hosts 04:34:19 tpatil: you are correct.. 04:34:31 samP: understood the workflow. will check what information should be stored to isolate the node 04:35:10 tpatil: we have to configure the resoureces on pacemaker side to do this.. 04:35:46 most of the work will be done in the pacemaker side. So, pacemaker and corosync need to configure correctly 04:36:41 I will wire down more info related to this in etherpad.. 04:37:07 I think we need to avoid split brain about volume booted VMs on failure node 04:37:12 samP: thank you 04:37:40 Do we need some confirmation? and we need to design this feature carefully about fenced enough for that VMs. 04:37:48 sagara: does NW isolation avoid that? 04:39:04 I don't know NW isolation is enough. In many L2 switches case, is it enough isolated? 04:40:12 example, management NW, storage NW isolated environment case 04:40:32 sagara: NW isolation in node means, ifdown the IF will kill all the connection and sessions 04:41:24 sagara: In above, I mentioned the about the NW isolation. Ex: NW isolation = ifdown all the IFs, except IPMI 04:42:35 sagara: you may choose specific IF to down, such as only Storage NW and Tenent NW, but not the Managemtn NW. 04:43:15 So do we need to clarify if we are using FC-HBA host, we cannot fence VM enough. Is that right? 04:43:35 samP: that will surely help us to figure out how to implement this recovery action 04:43:56 sagara: Ah..yes...In that case we need to do some specil thins 04:44:11 sorry,s/thins/things 04:46:01 FC-HBA environment maybe rare than iSCSI, firstly do we take it forward without FC case, or 04:46:26 Do we consider some general design? 04:47:20 sagara: For my understanding, problem in FC case is how to disable the port wich is highly HW dependet. 04:48:01 sagara: On the otherhad, how to disable it is a pacemaker configuration problem and not a masakri problem 04:48:25 I think there is two way, one is disabling the FC port, another is wait dump enough. 04:49:40 IMO, we can proceed without considering FC case. Because I can not see what can we do in masakari side for FCs 04:49:48 sagara: please correct me if Im wrong 04:51:43 operating FC port will be little difficult, Cinder already has FC auto zoning feature, so some FC switch can control with cinder FC switch driver code. 04:52:08 I agree to proceed without considering FC case. 04:52:38 sagara: are you proposing to cut off FC channels from the SW side? 04:53:22 sagara: I was only focused in server side 04:54:40 Yes, Sampath-san said "problem in FC case is how to disable the port wich is highly HW dependet", so I understood that controlling FC switches. 04:55:18 sagara: OK.. 04:56:01 I think controlling FC-HBA on server is also difficult 04:56:03 I was mention about FC ports in the server side. 04:57:00 Anyway, I will write more detils in the etherpad for this. So, we can discuss how to proceed with this. 04:57:28 we do not have much time... let move to AOB 04:57:33 #topic AOB 04:58:06 In masakari-recovery-method-customization, "Send Alert/Mail to operator", how does it send "Alert"? Logs? 04:58:06 I don't know well that FC-HBA's device path is still alive after dump kernel start to work 04:58:10 tpatil: sorry for the delay, I reply to your ( and aldo Pooja-san's) mail about summit presentaion 04:58:11 ok 04:58:16 oh, sorry... 04:58:32 samP: Thanks 04:58:44 rkmrHonjo: operator may configure it 04:59:13 sagara: path will stay alive till you kill it 04:59:18 samP: Do operators configure drivers? e.g. logs, messaging? 05:00:13 rkmrHonjo: In my mind, this was a mistal work flow. operator does not configure the drivers 05:00:16 samP: but login/logout mechanism just only iSCSI. FC is not. 05:00:25 lest's finish 05:00:27 samP: Thanks. I understand. 05:00:32 we are out of time 05:00:50 lest discuss this on ML or openstac-masakari 05:00:53 ok 05:00:55 thank you all...... 05:00:59 #endmeeting