Wednesday, 2017-04-12

*** zhurong has joined #senlin00:39
*** XueFeng has joined #senlin01:06
*** dixiaoli has joined #senlin01:33
*** yanyanhu has joined #senlin01:35
xuhaiweimorning  Qiming, yanyanhu and others, about the concurrency problem I reported the day before yesterday, I found there is a problem here
xuhaiweibecause when concurrency happened, one action is executed, the other one went to a status of READY, this action is desired to be ran later, but the scheduler logic will finish checking the action's status before it went to READY01:45
xuhaiweiso the READY action is left unexecuted, though the former one has released the lock01:46
yanyanhuhi, xuhaiwei, that action will finally get chance to run when next scheduling event comes01:48
*** yuanbin has quit IRC01:48
yanyanhualthough there is risk that the action could fail if there is no more scheduling event comes01:49
*** yuanbin has joined #senlin01:49
yanyanhuI mean in case no any new action is created01:49
xuhaiweiyanyanhu, but if no other scheduling event comes for a long time, the action will be left there all the time?01:49
yanyanhuxuhaiwei, yes, currently it is01:50
yanyanhuso maybe we can add a periodic task to trigger action scheduling internally01:51
yanyanhuto avoid this issue01:51
XueFenghi, haiwei,yanyanhu01:51
XueFeng        else:  # result == self.RES_RETRY:01:51
XueFeng            status = self.READY01:51
XueFeng            # Action failed at the moment, but can be retried01:51
XueFeng            # We abandon it and then notify other dispatchers to execute it01:51
XueFeng            ao.Action.abandon(self.context,
xuhaiweiso the scheduler should be monitoring the action pools all the time, instead of starting working when triggered by someone01:51
xuhaiweiyanyanhu, exactly01:51
yanyanhuoh, check what XueFeng post01:52
xuhaiweiXueFeng, 'abandon' doesn't mean abandoning the action01:52
xuhaiwei # We abandon it and then notify other dispatchers to execute it01:53
XueFengyes,the status is  in ready01:53
yanyanhuXueFeng, could you please paste the link of that code section01:53
XueFengand the reason string is about abandon01:53
yanyanhulet me take a look :) if the comment is accurate, the problem haiwei mentioned won't exist01:54
yanyanhuoh, this is different01:55
XueFengCurrently, we notify nothing01:55
yanyanhuit is just for a worker/engine to abandon an action which has been locked by itself01:56
XueFengSo, the ready action will be picked when next action come01:56
yanyanhuand then allow other works to pick it up for scheduling01:56
yanyanhuXueFeng, they are two different cases I think01:56
yanyanhuthey are for two different cases01:56
yanyanhuaction_abandon is to avoid action deadlock I guess01:57
yanyanhuwhile the problem xuhaiwei mentioned is more about scheduling01:57
XueFengI remember I met the problem when I do cluster_resize01:58
yanyanhuonce action is acquired and locked by an worker/engine, any other engine or worker cannot acquire it anymore. So there could be some cases that the action owner want to give up this action and then allow other workers to acquire it01:58
yanyanhuI guess this db api is for this purpose01:58
xuhaiweiyanyanhu: I think the scheduling logic need to be modified01:58
yanyanhuxuhaiwei, there can be a improvement I think01:59
XueFengTwo cluster_resize commands continuous coming01:59
yanyanhuXueFeng, you mean the same issue xuhaiwei met?02:00
XueFengDon't ensure02:02
yanyanhuso could be, if you observed a ready action hang there without being scheduling for long time : )02:02
XueFengI think we have problem in
*** openstackgerrit has joined #senlin02:03
*** ChanServ sets mode: +v openstackgerrit02:03
openstackgerritRUIJIE YUAN proposed openstack/senlin master: handle node which status is WARNING
XueFengyanyanhu, yes02:03
XueFengIt's easy to reproduction02:03
XueFengWe expect other senlin engine to scheder the action02:04
XueFengBut we only run engine worker at most time02:04
yanyanhuso maybe simply adding "dispatcher.start_action(action_id)" after this line can address the issue?02:05
yanyanhuthis line02:05
XueFengMaybe not:)02:06
XueFengwe rescheder the action, but it can also fail02:06
xuhaiweiyanyanhu: that is meanless to set READY status to an action if you do so02:06
XueFengand maybe it will go to a dead loop02:07
yanyanhuyes. but once the code execution reaches line 320, the action will be free and re-fired again02:07
yanyanhuXueFeng, that could be. But that means the target obj(cluster/node) is always being locked02:07
yanyanhuwe can do nothing on that I guess02:08
XueFengright, so we can try the idea02:08
yanyanhucurrently, we don't promise strict time sequence based operations in senlin02:08
xuhaiweiyanyanhu: you mean if an action is failed, we try it again and again, until it get the lock to run?02:08
yanyanhuuser should be aware of this point02:08
yanyanhuxuhaiwei, yes. If user don't control it02:09
yanyanhuby checking the action or op target status by themself02:09
xuhaiweiyanyanhu, it may be work, but not an inteligent way02:09
yanyanhuxuhaiwei, yea, it's perfect. Just in internet, no one knows which API REALLY comes and will be handled first, especially when you have multiple API service instances running02:11
yanyanhuso we may expect users have some logic in their side to prepare the operation sequence02:11
xuhaiweiyanyanhu, if you user write the senlin resource in heat template, user can't control the sequence02:12
yanyanhue.g. you request multiple cluster scaling operations at the same time, you won't know which one will get executed first...02:12
yanyanhualthough the finally result should be the same02:12
yanyanhuxuhaiwei, yes. For limited number of operations(with random sequence), senlin should guarantee that the final result is consistent02:13
yanyanhuhowever, we can't ensure the sequence that each operation happens02:13
xuhaiweiyes, that's not the import02:14
yanyanhuso if you do file LOTS of operations that target on the same cluster/node, some of them could wait for a long while before get chance to be executed02:14
yanyanhuan alternate is adding a timeout logic for action02:15
yanyanhunot the current timeout which only takes effect when action is scheduled02:15
yanyanhuwe logged the timestamp of action becomes ready, once the time elapsed over a threshold, e.g. 24 hours, we marked the action to failed02:16
yanyanhuhowever, this could be inappropriate in some cases where action does cost lots of time to finish02:16
yanyanhuso this can be an option02:16
yanyanhuI mean configuration option02:17
xuhaiweiso your suggestion is add a 'try again' logic when the action can't get the lock?02:18
XueFengyanyanhu,xuhaiwei, action_acquire_random_ready the may need optimize02:19
yanyanhuxuhaiwei, yes, that could be helpful to address the issue you met. But for the action waiting for long time without being scheduled, may need other solution02:19
yanyanhuXueFeng, you mean?02:20
xuhaiweiyanyanhu, I would suggest to improve the scheduling logic to make the READY action executable02:21
yanyanhuxuhaiwei, yes, that will also work. Just still can't resolve the problem that action waits too long to be scheduled...02:21
yanyanhue.g. a cluster is locked for 24 hours for some reasons e.g. scaling or maintaining, any other action target on it will keep failing until the in-progress action finishs02:23
yanyanhukeep failing and retrying02:23
XueFengWe pick ready action randomly. For a cluster/node, we can pick it with create time02:25
XueFengAdd the targert and create_time02:26
yanyanhuXueFeng, you mean order the action according to their timestamp first?02:27
yanyanhuthen pickup the oldest one02:27
XueFengWhen a action is running, the cluster or the node is locked.02:30
XueFengAnd the later actions for the cluster/node can't get the lock, and maybe go to ready again02:31
XueFengThen for the target we'd better to pick the action with the create time02:31
yanyanhuXueFeng, we did consider this way before. However, this could cause another problem: once the oldest action keep failing and retrying. Any other younger actions could never get chance to be scheduled...02:32
yanyanhuwe are acquiring ready action randomly to avoid this issue since we have no "Action Queue" here02:33
yanyanhuunless we update the action timestamp before putting it back to DB02:34
yanyanhuif so, we may need to add extra timestamp of each different status02:34
yanyanhue.g. timestamp of action becomes ready/failed/succeeded/init02:35
yanyanhucurrently, we only have created_at and updated_at02:35
XueFengYes ,there will be another problem come02:36
yanyanhuyes... so my suggestion is retriggering the action scheduling after we marked it to ready in the following position, or as xuhaiwei suggested, we optimize our scheduler to add periodical scheduling logic02:37
XueFengWe can do this first02:38
yanyanhucurrently, senlin scheduler works following a way that combines both tickless and event-driven02:39
yanyanhuXueFeng, yes02:39
yanyanhuthen we consider how to better handle the situation that ready action waits too long to be scheduled02:39
XueFengAnd I conidered again, maybe too long time to be scheduled will not happen frequently02:41
XueFengHere it was rescheduer because can't the action get the target lock. And once it get lock, it will run success/fail...02:43
yanyanhuXueFeng, uhm, yes although it could happen, depends on the use case. But anyway, user can handle it as well by checking the action and target cluster/node status02:43
XueFengok, we can try in this way first02:44
XueFengI will do a test now02:45
yanyanhuXueFeng, great, thanks a lot :)02:45
XueFengmy pleasure:)02:46
XueFengroot@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 202:49
XueFengRequest accepted by action: e43d5b4e-fd6a-409b-adab-5ecec25c84ef02:49
XueFengroot@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 302:49
XueFengRequest accepted by action: e1863376-70cd-4605-9611-b99dc546be6a02:49
XueFengIt's easy to root@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 202:50
XueFengRequest accepted by action: e43d5b4e-fd6a-409b-adab-5ecec25c84ef02:50
XueFengroot@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 302:50
XueFengIt's easy to reproduction now02:50
XueFengAnd I will change the code to see the effect02:51
openstackgerritQiming Teng proposed openstack/senlin master: Fix ovo object for requests
*** zhurong has quit IRC04:06
openstackgerritOpenStack Proposal Bot proposed openstack/python-senlinclient master: Updated from global requirements
openstackgerritOpenStack Proposal Bot proposed openstack/senlin master: Updated from global requirements
openstackgerritOpenStack Proposal Bot proposed openstack/senlin-dashboard master: Updated from global requirements
*** zhurong has joined #senlin04:34
openstackgerritXueFeng Liu proposed openstack/senlin master: Fix scheduing problem about abandon action
*** shu-mutou-AWAY is now known as shu-mutou05:45
openstackgerritXueFeng Liu proposed openstack/senlin master: Fix scheduing problem about abandon action
*** yuanying_ has joined #senlin06:48
*** yuanying has quit IRC06:48
Qiming@everyone, py35 jobs are not demoted to non-voting07:44
Qimingit may take some time for this change to be propagated to all CI nodes, then we won't get blocked by the py35 jobs for critical patches07:45
Qimingwe can fix the py35 job later when the root cause is identified07:45
XueFengok, got it08:05
openstackgerritRUIJIE YUAN proposed openstack/senlin master: revise engine cluster obj to update runtime data
openstackgerritMerged openstack/senlin master: fix node do_check invalid code
openstackgerritQiming Teng proposed openstack/senlin master: Pike-1 release notes
*** yanyanhu has quit IRC10:50
*** dixiaoli has quit IRC11:06
*** zhurong has quit IRC11:26
openstackgerrityangyide proposed openstack/senlin master: Improve check_object for health_policy_poll recover
openstackgerrityangyide proposed openstack/senlin master: Improve check_object for health_policy_poll recover
openstackgerritMerged openstack/senlin master: Fix scheduing problem about abandon action
openstackgerritOpenStack Proposal Bot proposed openstack/senlin master: Updated from global requirements
openstackgerritShu Muto proposed openstack/senlin-dashboard master: [DNM] Fix test environments
*** shu-mutou is now known as shu-mutou-AWAY11:55
openstackgerritShu Muto proposed openstack/senlin-dashboard master: [DNM] Fix test environments
*** catintheroof has joined #senlin13:27
*** rate has joined #senlin13:28
*** zhurong has joined #senlin13:46
*** zhurong_ has joined #senlin13:59
*** zhurong has quit IRC14:01
*** rate has quit IRC14:13
*** rate has joined #senlin14:19
*** rate has quit IRC14:48
*** rate has joined #senlin14:55
*** zhurong_ has quit IRC14:57
*** rate has quit IRC15:06
*** rate has joined #senlin15:07
*** rate has quit IRC15:11
openstackgerritMerged openstack/senlin master: Updated from global requirements
-openstackstatus- NOTICE: Restarting Gerrit for our weekly memory leak cleanup.21:27
*** Qiming has quit IRC21:41
*** Qiming has joined #senlin21:46
*** catintheroof has quit IRC22:48

Generated by 2.14.0 by Marius Gedminas - find it at!