08:59:15 #startmeeting HA 08:59:16 Meeting started Mon Feb 8 08:59:15 2016 UTC and is due to finish in 60 minutes. The chair is ddeja. Information about MeetBot at http://wiki.debian.org/MeetBot. 08:59:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 08:59:19 The meeting name has been set to 'ha' 08:59:26 Hello all 08:59:48 Hi 09:00:32 hi 09:01:13 hi 09:01:27 Ok, lets start with a quick status 09:01:29 <_gryf> hi 09:01:42 #topic Quick status report 09:02:29 My status: I have prepared two workflows for instance HA, both can be seen on https://github.com/gryf/mistral-evacuate 09:02:57 For one of them, I have hit a bug https://bugs.launchpad.net/mistral/+bug/1535295 09:02:58 Launchpad bug 1535295 in Mistral "Task with join runs more than once" [Undecided,New] - Assigned to Dawid Deja (dawid-deja-0) 09:03:36 That's all from my side 09:03:46 masahito: are you willing to give a report? 09:04:27 The patch for enabling masakari to work on pacemaker-remote was merged. 09:04:43 That's great 09:04:56 And some bug fixs were also merged. 09:05:02 That's from my side. 09:05:04 #info patch for enabling masakari to work on pacemaker-remote was merged. 09:05:28 Ok, cool. Anyone else have something to report? 09:05:43 I have one announce 09:06:12 hey guys 09:06:25 beekhof: hi 09:06:26 new ha guide docs time was assigned. Next meeting is at Feb 10, 17:00 UTC, don't hesitate to participate and add agenda items https://wiki.openstack.org/wiki/Documentation/HA_Guide_Update#Next_Meeting 09:06:27 hi beekhof :) 09:06:29 hi 09:07:34 OK, thanks bogdando 09:08:19 I think we are done in status reports 09:08:36 #topic Mistral Workflow for instance HA 09:08:55 Should we open a nova-client bug related to the bug 1535295 as well? Like it shall not throw errors on the second evacuate request arrival? 09:08:56 bug 1535295 in Mistral "Task with join runs more than once" [Undecided,New] https://launchpad.net/bugs/1535295 - Assigned to Dawid Deja (dawid-deja-0) 09:09:37 I think it's an expected behaviour 09:09:58 I was rather thinkig about catching the exception 09:10:23 as a workaround 09:10:52 can i ask how this workflow... works :) 09:11:02 yup, sure 09:11:32 eg. when is filter_vm_action.py called? 09:11:54 beekhof: you are looking at code on master branch, aren't you? 09:12:00 yep 09:12:11 ok, so let me strart with explaining this workflow 09:12:38 It starts with calling nova.servers.list() in nova python client 09:12:54 nova.servers_list is a wrapper for python client 09:13:01 oh, i see the yaml now 09:13:18 then it calls filter_vm_action 09:13:23 this is rather neat 09:13:39 i can see the attraction 09:13:55 yup, it's very simple 09:14:38 Right now I'm thinkig about adding some action that would check if evacuation succeeded 09:15:19 also, if you change the branch, there is another workflow, that separete filtering and asking nova for flavors 09:16:01 interesting 09:16:35 I thinks it's better since in approach no 2, there would be less nova calls 09:16:36 the branch looks changing the workflow depending on the extra_spec. 09:16:51 looks interesting. 09:16:58 whats the advantage of splitting them up? 09:17:38 masahito: oh, asking for flavor_extra_spec should also happen in first approach, I'll fix that :) 09:17:51 beekhof: less calls to nova API 09:18:16 I guess it allows admin to define extra_spec attached to evacuated VMs. 09:18:45 masahito: There are two ways of determing if given VM should be evacuated 09:19:09 1. It has flavor with extra spec 'evacuation:evacuate' set to True 09:19:32 2. VM itself has a metadata with 'evacuate' flag set to True 09:19:51 I should've write it in repo... 09:19:52 ddeja, could you please put that to the README as well? 09:20:11 yup, read my minds 09:20:20 bogdando: :) 09:20:50 I'll do it as soon as meeting ends 09:21:07 ddeja: sounds nice! 09:21:20 this looks really simple and neat, indeed. So what do you think about the final solution? 09:21:29 like that PoC + fence agent? 09:22:18 So from my side the most simple scenario is: this workflow + really simple fence agent that only calls Mistral API 09:23:09 like to post the YAML? 09:23:31 not really - yaml file should be loaded first and is in DB 09:23:36 ah 09:23:49 you only call it by it's name and provide input argiments 09:23:59 arguments* 09:24:06 I like that idea 09:24:37 #action ddeja to write simple fence agent that will call mistral API 09:24:39 and this to be put into the fence topology perhaps 09:24:50 yup 09:25:02 i like it as the mechanism for performing the evacuation, but i can also see scope for some of the masakari pieces for deciding when to trigger it 09:25:32 beekhof: yes, Masakari may call the workflow instead of fence agent 09:25:33 we could use it as a "fallback fence level", if Mistral fails ) 09:25:56 one thing though... i like the concept, but how well does it handle corner cases? 09:26:14 so it may be a topology like that: 1) first try Mistral flow, 2)then try the masakari, 3)fence the node 09:26:22 thats the key thing 09:26:42 bogdando: you need to have node fenced before you call evacuate 09:26:51 It's a post-mortem process, when node is dead 09:27:17 ddeja, I see, then it probably cannot fit the classic fence topos 09:27:20 and we can be sure that there won't be two VMs writing to same storage 09:27:57 bogdando: It can, we can have fencing topology like 1) fence node; evacuate VMS (call mistral) 09:28:11 beekhof: Which corner cases do you mean? 09:28:21 at the one level? yes, that should work 09:28:39 mistral nodes falling over while a workflow is in progress 09:28:40 though, I may have forget details 09:28:57 services APIs dropping in and out while a workflow is in progress 09:28:59 etc 09:29:16 beekhof: That's a problem with Mistral, unfortunetly 09:29:32 nodes returning while a WFIIP 09:29:44 exactly 09:29:59 there is a blueprint for Mistral HA itself 09:30:04 https://blueprints.launchpad.net/mistral/+spec/mistral-ha-spec 09:30:16 ddeja: Does Mistral plan to resolve it in Mitaka? 09:30:17 that should be done for M3 09:30:20 the design is beautiful, but it comes down to the suitability of mistral itself 09:30:51 so for Mitaka there would be spec for what needs to be done 09:31:28 and for N release mistral team would be working on resolving it 09:32:21 beekhof: so asking your question: It is not fully relailable now 09:32:57 Do we have a spec for this initiative, to reflect alternatives and decisions made here?.. 09:33:05 so the big question for me is: how long until it is, and can we wait that long? 09:34:08 bogdando: I'm not sure if I understand correctly. Do you mean if we have a spec for using Mistral Workflow for instance evacuate? 09:34:21 ddeja just a spec for instance evacuate 09:34:57 so there is this spec 09:35:00 #link https://blueprints.launchpad.net/mistral/+spec/mistral-ha-spec 09:35:08 not this one... 09:35:13 #link https://review.openstack.org/#/c/257809/ 09:35:16 sorry 09:35:18 this one 09:35:27 thank you 09:35:40 but I'm not the subbmiter 09:36:15 beekhof: I don't know if we can wait that long 09:36:47 we can make this cionfigurable, how long to wait and to which option failback 09:37:01 ddeja: i somewhat suspect that too 09:37:15 current scenario is to have Mistral HA for N relase, so it will be like 8 months 09:37:20 assuming 100% reliable things would be a design flaw :) 09:38:11 that is why they use fence topologies AFAICT 09:38:22 ddeja: TBH, i'd be surprised if 8 months is all that was needed 09:38:25 so perhaps we should do the like 09:38:58 and think of Mistral flow and masakari as just two fence agents 09:39:24 to co-exist and help each over to cover more fail modes 09:40:02 bogdando: The problem is that we can, for exmaple call mistral from fence_agent 09:40:12 and got OK as a reponse 09:40:40 let's make fence agent to verify results before rerutning its own OL 09:40:41 but it only means that Mistral accepted the request 09:40:42 OK 09:40:58 and in fence agents, we can't wait for the result 09:40:59 by a given timeout 09:41:15 Why can't we? 09:41:29 all agents have things like power timeout to wait for results 09:41:36 here is the same 09:41:36 you're blocking recovery 09:41:41 because fencing frezes the whole cloud, if I remember correctly 09:41:50 oh, why?.. 09:41:54 which might be the bit that is needed for the workflow to complete 09:42:01 == deadlock 09:42:59 well, we could run all evacuate options in parallel in the hope of the idempotance and immutability 09:44:14 I mean using both Mistral and masakari based fence agents to ensure results... Makes sense?.. 09:44:40 And leave operators options to decide which agent to go with 09:44:48 It is better to have more options 09:45:06 maybe it's the way to go... 09:45:30 I'm not sure about implementation details, but from design point I see no issues here 09:45:44 on the other hand, there was a discussion to use Masakari to call Mistral workflows 09:45:52 evacuation request must be idempotent and immutable for any agents acting 09:46:40 since Mistral is not HA, maybe Masakari can look if evacuation ended succesfully, and if not call Mistral API again? 09:46:50 but that's just an idea 09:47:02 good point as well 09:47:02 bogdando: you means we make notification interface from monitoring process will be same between mistral workflow and masakari. 09:47:49 masahito, yes, monitoring should understand competing agents 09:48:19 or retries, if agents are kinda "nested" and masakari retries Mistral 09:48:40 ddeja: just question. Can Mistral execute evacuate action in paralell? 09:49:05 bogdando: got it. 09:49:27 masahito: yes. If we have like 5 VMs to evacuate, there would be 5 request running in paralell 09:49:32 that was only my point though, we should think of it and suggest to the spec perhaps... 09:50:31 ddeja: thanks. 09:51:15 masahito: but to be sure - you call mistral only once. The paralelism is done inside the mistral engine :) 09:53:18 OK, I guess with this topic, we have a few minutes for open discussion 09:53:24 #topic Open discussion 09:53:53 I think we should discuss the detail in some where without IRC meeting. 09:54:20 because TL will flow. 09:55:33 masahito: I agree, but it will be good to wait till aspiers come back from vacation 09:55:57 ddeja: right. 09:57:03 another topic, we are working on sqlalchemy support of masakari. Is there any database backend you mind to use other than MySQL? 09:57:28 kazuIchikawa: postgres 09:57:38 <_gryf> kazuIchikawa, postgresql is also a popular choice 09:57:43 ddeja: got it 09:57:50 most of OpenStack deploments is on postgressql or mysql 09:59:23 ok, we are running out of time 09:59:37 thanks you all for productive meeting and see you next week :) 09:59:49 bye 09:59:51 bye 09:59:58 <_gryf> cu 10:00:08 bye 10:00:20 #endmeeting