04:00:17 <samP> #startmeeting masakari
04:00:18 <openstack> Meeting started Tue Jan 16 04:00:17 2018 UTC and is due to finish in 60 minutes.  The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot.
04:00:19 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
04:00:21 <tpatil> Hi
04:00:21 <openstack> The meeting name has been set to 'masakari'
04:00:27 <samP> tpatil: Hi
04:00:28 <sagara> hi
04:00:36 <samP> sorry for long absent
04:00:59 <samP> #topic High priority items
04:01:14 <samP> Any Hight Priority items to discuss?
04:01:53 <samP> if any please bring them up any time. proceeding to next topic
04:02:01 <samP> #topic Critical bugs
04:02:09 <samP> Any Bugs to discuss?
04:02:22 <tpatil> #link https://review.openstack.org/#/c/531310/
04:02:53 <tpatil> I think this bug should be fixed in queens release
04:02:59 <tpatil> I have voted -1
04:03:31 <rkmrHonjo> tpatil: I heard that Takahara is addressing your comments now.
04:03:51 <samP> tpatil: thanks for the review.
04:03:56 <tpatil> rkmrHonjo: Ok
04:04:28 <samP> comments are not critical, its a easy fix.
04:04:48 <samP> tpatil: rkmrHonjo: let's merge this in Q
04:04:58 <tpatil> samP: Yes
04:05:08 <rkmrHonjo> samP: ok, I tell it to takahara.
04:05:15 <tpatil> Another patch : https://review.openstack.org/#/c/486576/
04:05:20 <samP> tpatil: rkmrHonjo: thanks
04:05:48 <tpatil> we should merge this patch as py35 tests are failing on all patches
04:06:13 <tpatil> I have already voted +2, need another +2
04:06:26 <samP> tpatil: I will look into thsi
04:06:33 <samP> s/thsi/this
04:06:40 <tpatil> samP: ok, Thanks
04:06:45 <rkmrHonjo> samP: thanks.
04:06:53 <samP> tpatil: thanks for review
04:07:00 <samP> rkmrHonjo: thanks for the fix
04:08:17 <rkmrHonjo> Dinesh suggested to change the py35 test to vote.(current py35 test is non-vote.) I think that we can change it after merging it.
04:09:04 <samP> rkmrHonjo: sure, let's see how things work after merge above patch
04:09:18 <rkmrHonjo> ok.
04:09:37 <samP> if no problem, then let's put py35 to vote
04:10:01 <samP> Any other bugs?
04:10:17 <tpatil> https://bugs.launchpad.net/masakari/+bug/1738340
04:10:18 <openstack> Launchpad bug 1738340 in masakari "When no reserved_host available, nova-compute service on failed host remains enabled" [Undecided,In progress] - Assigned to takahara.kengo (takahara.kengo)
04:11:34 <tpatil> in the fix, a new config option is introduced so I have commented on the patch asking to write down lite-specs
04:11:51 <tpatil> is it ok to fix this issue as a bug or feature?
04:12:50 <rkmrHonjo> tpatil: I think that this patch doesn't add a new config option.
04:12:53 <tpatil> sorry, no new config option
04:13:58 <rkmrHonjo> tpatil: ok. And, I think that this is just a bug fix. Because this patch doesn't add new action.
04:14:51 <rkmrHonjo> In current rh_workflow, failure host is still enable and VMs are not evacuated if reserved host is nothing.
04:14:55 <samP> rkmrHonjo: tpatil: sorry fot the delay, tried to understand the problem here
04:15:33 <tpatil> The notification request will be marked as complete which will give false impression to operator
04:15:34 <samP> rkmrHonjo: this patch propose to disable to nova-compute on failed host
04:15:49 <tpatil> after disabling compute host
04:15:54 <rkmrHonjo> But, failure host is disable if this patch will be merged. I think that it is not new action, and it is good from point of safety view.
04:16:52 <rkmrHonjo> tpatil: Ah, thanks, I understand your opinion.
04:17:03 <tpatil> if notification request is complete, it means that the failed host is evacuated successfully, but if reserved host is not available, then it is simply disabling compute host which I think isn't sufficient
04:17:40 <samP> tpatil: agree.
04:18:02 <rkmrHonjo> tpatil: Do you think this idea:  "Disable host, and not complete the notification"  ?
04:18:19 <samP> you may disable to compute node but should not complete the recovery
04:18:26 <rkmrHonjo> samP: yes.
04:19:12 <tpatil> samP: agree
04:19:52 <rkmrHonjo> ok, can I(and Takahara) create a new patch according to samP's idea?
04:19:55 <tpatil> in the periodic task, it will again try to execute the workflow and finally it will give up and set it failed
04:20:12 <samP> tpatil: correct
04:21:46 <tpatil> rkmrHonjo: I will add this comment on the patch
04:21:56 <samP> I think we should bring this step to very beginning of the flow. (1) disable the compte node (2) runt evacuation
04:22:08 <samP> I will add my comments too
04:22:24 <rkmrHonjo> tpatil: thanks! In that case, should I write a lite-spec?
04:22:40 <tpatil> Just noticed that it is raising ReservedHostsUnavailable exception
04:23:23 <tpatil> so I think whatever we discussed is already taken care of. I will review the patch again
04:24:06 <rkmrHonjo> tpatil: I got it.
04:24:52 <samP> rkmrHonjo: I will add a comment if spec required
04:25:06 <rkmrHonjo> samP: ok.
04:25:32 <rkmrHonjo> I think that releasenote is required.
04:27:16 <samP> rkmrHonjo: it depends on how you implement this.
04:27:42 <rkmrHonjo> samP: Yeah. I wait your comments.
04:28:30 <samP> let's do review first and check what kind of changes needed
04:28:52 <rkmrHonjo> I got it.
04:28:56 <samP> then we could discuss about spec and release notes
04:29:29 <samP> rkmrHonjo: BTW thanks for bringing up the release note point
04:30:13 <samP> any other bugs/patches for discuss?
04:30:31 <tpatil> sam
04:30:38 <tpatil> samP: not my side
04:30:50 <rkmrHonjo> no.
04:31:02 <samP> thanks.. let's move to next topic
04:31:11 <samP> #topic Dicussion Points
04:32:06 <samP> Sorry that I couldn't follow the work.
04:32:55 <samP> Please proceed if you have any updates on your work.
04:33:16 <tpatil> Regarding horizon dashboard
04:33:31 <tpatil> Niraj has pushed initial cookie cutter patch
04:33:54 <tpatil> I have voted + 2, need another +2
04:34:25 <samP> tpatil: I will check
04:34:33 <rkmrHonjo> oh, sorry, I couldn't review it last week...
04:34:43 <samP> tpatil: Niraj: Thanks
04:37:17 <samP> any other updates?
04:37:18 <rkmrHonjo> Can I talk about my update?
04:37:40 <samP> rkmrHonjo: sure, goahead
04:37:48 <rkmrHonjo> samP: thanks.
04:37:50 <rkmrHonjo> Call Force down API when host-failure will be notified
04:38:03 <rkmrHonjo> tpatil: Takahara replied to you on gerrit. Please check it.
04:38:21 <rkmrHonjo> #link https://review.openstack.org/#/c/526598/
04:39:55 <tpatil> rkmrHonjo: I have read his reply and I agree the evacuation will succeed. But still compute service might be still up and running which could update instance states
04:41:32 <rkmrHonjo> tpatil: I think that it is prevented. There is force down flag. I re-confirm it and write it on gerrit.
04:42:29 <tpatil> after force down api is called , is compute service still running?
04:44:41 <tpatil> I will test this case and let you know my results
04:45:37 <rkmrHonjo> tpatil: I think that is case-by-case. force down api doesn't kill the process. But, basically, operator will configure the crm to stop the node. And, it is same on current implementation(waiting 3 minutes).
04:47:26 <tpatil> rkmrHonjo: If operator is going to handle force down notification and kill the compute service on the failed node, then I don't see there will be any problem
04:48:58 <samP> rkmrHonjo: not clear, operator config the crm to catch "force-down" flag?
04:49:29 <samP> s/force-down/forced_down
04:50:50 <rkmrHonjo> tpatil: No, crm doesn't catch force-down flag. Operator will config the crm to catch host down. If pacemaker catch the host down, masakari-monitor send a notification(host down).
04:51:16 <samP> rkmrHonjo: got it.
04:51:36 <rkmrHonjo> I think that force down flag is referred from nova. If it is true, nova doesn't change the status to up.
04:51:46 <samP> your point is, when masakari gets the HostDown notification, the host is already down for sure
04:52:15 <tpatil> rkmrHonjo: I got it
04:52:28 <samP> and its totally safe to use force-down api to bring down the binary and proceed the evacuation work flow
04:54:07 <samP> correction: not bring down the binary, but just the put foreced-down flag on that binary
04:54:19 <samP> rkmrHonjo: understood
04:54:49 <samP> we only have 5mis left
04:57:47 <samP> I think if we do this change to masakari, this means pacemaker or anyone controlling the cluster must make sure that it kill the node before it sends the host failed notification to masakari
04:58:18 <samP> I think we have already put this point in our docs. (if not we must)
04:58:37 <samP> any other updates?
04:58:51 <tpatil> No
04:59:05 <rkmrHonjo> no.
04:59:10 <samP> Thank you all...
04:59:17 <rkmrHonjo> thank you.
04:59:30 <samP> please use #openstack-masakari or ML with [masakari] for further discussion
04:59:38 <samP> Thank you all
04:59:45 <samP> #endmeeting