09:00:42 #startmeeting blazar 09:00:42 Meeting started Tue Dec 19 09:00:42 2017 UTC and is due to finish in 60 minutes. The chair is masahito. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:44 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:46 The meeting name has been set to 'blazar' 09:00:54 #topic RollCall 09:01:24 o/ 09:01:32 o/ 09:01:51 priteau, bertys__: hello 09:01:57 Today's agenda is 09:02:09 1. resource monitoring 09:02:12 2. next meeting 09:02:14 3. AOB 09:02:20 anything else? 09:02:39 hiro-kobayashi is out of town today 09:04:01 #topic resource monitoring 09:04:47 This topic is raised by hiro-kobayashi. 09:05:49 As I commented in https://review.openstack.org/#/c/524054/, resource-monitoring re-calculate ALL allocations on a failed host. 09:06:23 My comment and hiro's replay are https://review.openstack.org/#/c/524054/10/blazar/plugins/oshosts/host_plugin.py@690 09:07:06 priteau: do you have a preference or an idea for this because you're an operator of Blazar. 09:07:23 I haven't seen these comments yet, just a minute 09:07:30 got it. 09:07:36 I had a day off yesterday 09:07:43 np 09:07:49 masahito: this is related to what we discussed last week during the code review. The first challenge is that Masakari, Vitrage and Congress behave differently once a host failure has been detected 09:08:07 masahito: https://review.openstack.org/#/c/526598/1/masakari/engine/drivers/taskflow/host_failure.py 09:08:43 For Masakari for instance, the compute node is first disabled 09:09:18 It's a hard problem because Nova doesn't give information about how long a node might be disabled for 09:09:28 Whereas for Vitrage and Congress, the compute node is marked down 09:09:35 The summary is that resource-monitoring tries to re-allocate ALL reservations which use the failed host. 09:10:14 but the question is that should Blazar re-allocate a reservation which will starts in a year later? 09:10:16 As operators, we often have to do a quick maintenance session on a node which has errors, but it could last only a few minutes 09:10:49 bertys__: thanks for sharing the info. 09:11:51 masahito: I think we could combine "time since the node has been down or disabled" plus "time until the lease start" to come up with something sensible 09:12:31 e.g., if a node has been disabled only for 30 minutes, there is no urgency in reallocating leases that are a month away 09:12:33 I agree. We could define a threshold of e.g. 1 day 09:13:50 we could also introduce a background service that is then 1 day ahead of the start time verifying that resources are still available and trying to re-allocate otherwise 09:13:56 GeraldK: meaning 1 day for detecting failure? or 1 day advance re-allocation? 09:14:25 masahito: 1-day advance re-allocation 09:14:34 GeraldK: got it. 09:15:07 Should be configurable 09:15:11 the operator can set the parameter according to his requirements and preferences 09:15:32 yes, of course. 09:16:08 Looks like approach #2 is better for this 09:16:35 Does this mean that we would implement a new event "before_start_date"? 09:17:25 bertys__: I don't think it's needed. 09:18:44 bertys__: I though the approach we discuss is Blazar re-allocate reservations which starts in configured time. 09:19:16 so the Blazar checks only start_time of each lease. 09:20:08 ok it seems I have misunderstood GeraldK's intention 09:20:58 we may have some misunderstanding here 09:21:17 priteau's idea is adding a decision time frame to detect whether the host is really failed or not. 09:21:29 let me try to summarize my proposal: 09:22:28 host down event and reservation start time is more than 1 day (configurable) ahead -> no re-allocation 09:23:23 as we don't know how long the host will be down, a background service/before-start-date event will check 1 day ahead of the start time whether the node is still down. if yes, try to re-allocate 09:23:57 if host down and start time less then 1 day ahead -> immediate re-allocation 09:24:10 does that make sense? 09:25:40 We already have the event processing pool running every 10 seconds. The manager could create an event to remind itself of checking whether the lease is ok 09:25:47 GeraldK: Blazar doesn't re-allocate 1 day ahead reservations in second check for host status, right? 09:26:55 GeraldK: meaning 3 days ahead leases at the event of host failure are re-allocated 2 days later. 09:27:09 That's my understanding of GeraldK's proposal 09:27:24 me too. 09:27:37 masahito: no, so far it does not. but if we want to omit re-allocation of reservations that are in the long future (>1 day ahead), wouldn't we need such option ? 09:29:07 priteau's proposal to have the manager create an event to remind itself (to check periodically, e.g. every 24 h, or to check 1 day ahead of start time) sounds good to me 09:29:45 And my additional proposal is to add to GeraldK's by having a minimum time required to confirm that a host is down 09:29:49 masahito: sorry, mis-read you message. yes. that is true. 09:29:52 e.g. 30 seconds, or 5 minutes (configurable) 09:29:57 Users can reserve resources that start in 1 month or year. So it could have lot's of re-allocations. 09:30:52 priteau: it's good to have. the monitor system has polling, so we can use the periodic task. 09:30:55 masahito: that is why I proposed to re-allocate only in the case the node is still unavailable 1 day ahead of the start time 09:31:47 GeraldK: np, my wrong grammar could miss lead you. 09:33:57 okay, looks like we got a good idea for the problem. 09:34:50 any comments for the topic? 09:36:31 #topic next meeting 09:37:11 Next week is a last week of this year. 09:37:52 I will be out of office in the next two weeks 09:38:04 So lots of area could be in holiday weeks or days. 09:38:37 And I'm also out of office in two weeks. 09:39:45 If there is less people we can skip next two meetings. 09:40:06 That's fine for me 09:40:15 okay. 09:40:32 Then next meeting is 9th January. 09:40:56 I'll announce it in openstack-dev. 09:41:24 #topic AOB 09:41:51 Does someone have something to share/discuss? 09:43:07 FYI: Q-2 is released on last Friday. https://review.openstack.org/#/c/526616/ 09:43:48 Thank you everyone in the team! This milestone was good progress to the team. 09:44:12 Congratulations to the team for this milestone. 09:44:21 Are we going to release a new client soon? 09:44:40 We now have the gate jobs to push to PyPI 09:44:44 priteau: thanks for heading up this. 09:44:51 I'll push the patch soon. 09:46:08 Thanks 09:50:16 anything else? 09:50:33 If nothing we can finish early today. 09:51:36 Thanks everyone 09:51:46 Enjoy the holidays 09:52:39 thx. happy holidays. 09:53:33 Thanks all. have a good holidays! 09:53:40 #endmeeting