#openstack-meeting log

13:00:22 <Qiming> #startmeeting senlin
13:00:23 <openstack> Meeting started Tue Aug 15 13:00:22 2017 UTC and is due to finish in 60 minutes.  The chair is Qiming. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:00:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:00:27 <openstack> The meeting name has been set to 'senlin'
13:00:49 <Qiming> evening
13:01:21 <ruijie> evening Qiming
13:02:12 <Qiming> hi
13:02:55 <Qiming> network connection is no good today
13:03:20 <ruijie> my computer went down.. need to reinstall OS..
13:03:30 <Qiming> now?
13:03:38 <Qiming> ;)
13:03:42 <ruijie> later :)
13:04:30 <Qiming> I'm looking at the etherpad
13:04:43 <Qiming> checking if there are critical things to be completed by this relesae
13:04:51 <Qiming> s/relesae/release
13:05:27 <Qiming> liyi has been a shy hero recently
13:05:27 <ruijie> em,yes Qiming, I just proposed a commit about the scheduler and actions
13:06:56 <Qiming> it is a little bit complex
13:07:02 <Qiming> need some time to digest
13:07:24 <ruijie> sure
13:08:10 <Qiming> can you elaborate the points of the change made?
13:08:40 <ruijie> yes Qiming, basically I want to use one backend thread to process all the actions belong to the engine service
13:09:17 <ruijie> and the dispatcher just tell scheduler to queue the action which status is READY
13:09:17 <Qiming> one "backend" thread?
13:10:06 <Qiming> so when we are creating a cluster or 10 nodes
13:10:30 <Qiming> the 10 NODE_CREATE actions will now get executed sequentially?
13:10:32 <ruijie> currently we use dispatcher.start_action() to notify scheduler to work, and each time the request will enter scheduler, so, there might be a lot of request want to grad the action AND threads to process
13:10:51 <ruijie> grad/grab
13:11:55 <ruijie> inface, the start_action() method will not process the action it get from DB, it will be queued, and the backend thread will grab thread from thread pool to process it
13:13:02 <Qiming> but there is a single backend thread now
13:13:24 <ruijie> yes Qiming..
13:15:04 <Qiming> although there is no real multi-threading in Python, this patch doesn't look like an improvement to me
13:17:04 <ruijie> em, may reduce threads conflicts when a lot of requests get in :)
13:17:33 <Qiming> but it will hurt performance badly
13:18:05 <ruijie> yes, one thread will be used.. okay
13:18:14 <Qiming> please think again
13:18:42 <ruijie> and another thing is that we grab action randomly
13:19:07 <Qiming> yes, that is a no-so-good scheduling
13:19:15 <Qiming> can be improved for sure
13:19:20 <ruijie> if, in some case, some actions may not be processed all the time
13:19:53 <Qiming> for example, fix that db call to be acquire_first_ready()
13:20:23 <Qiming> by "first_ready", I mean we sort REDAY actions in DB by start_time or created_at etc
13:20:59 <ruijie> Qiming, you mean :acquire_random_in(latest=20) or acquire_the_first_one()?
13:21:28 <Qiming> acquire_the_first_one
13:21:50 <Qiming> to ensure no action will be starving
13:21:57 <ruijie> but this will increase risky of dead lock?
13:22:10 <ruijie> all the scheduler want to get the first one
13:22:15 <Qiming> no ... we only acquire READY actions
13:22:49 <Qiming> if there are action dependencies, the dependent won't be selected because they are not READY
13:23:08 <ruijie> but we broadcast the request to all dispatchers
13:23:17 <Qiming> that is fine
13:23:27 <Qiming> all threads are assumed to be dummy workers
13:23:44 <Qiming> they don't (shouldn't) know the semantics or dependencies
13:23:57 <Qiming> they just grab a ready action and do it
13:24:22 <ruijie> okay, that makes sense
13:24:35 <Qiming> we pushed all synchronization problem to the db layer
13:24:52 <Qiming> instead of handling them at different layers --- a common source of dead locks
13:25:28 <Qiming> if there are concurrency problem, we look into the db records, we blame and fix sqlalchemy calls
13:26:06 <ruijie> one problem we met is: cluster action timeout, but the depends and dependents are still there..
13:26:41 <Qiming> that means one or two db calls are not thread safe
13:27:25 <ruijie> yes Qiming, we want to delete all the records when action timeout, but the node action are still executing..
13:27:31 <ruijie> that is an known problem?
13:27:43 <Qiming> I spent a lot time looking into sqlalchemy doc, trying to learn some best practices
13:28:18 <Qiming> node action is depended by cluster action
13:28:39 <Qiming> if node action is still running, you are not supposed to delete the cluster action, right?
13:29:30 <ruijie> yes Qiming
13:29:32 <Qiming> we have a signal call before
13:30:08 <Qiming> "action_signal"
13:30:22 <Qiming> it was designed for this purpose
13:31:04 <Qiming> the intent is to have an action occasionally check if it has received a signal ...
13:31:13 <Qiming> if it does, it will abort
13:32:01 <Qiming> it looks like we have not yet used that weapon
13:32:22 <ruijie> not yet, cluster-resize 100 --> cluster action timeout --> mark_timeout(delete all dependents) --> return : node actions left, locks, dependents
13:33:11 <Qiming> yes, we are talking about the same thing
13:33:32 <Qiming> I was proposing to add "action_signal" call in the "mark_timeout" logic
13:33:55 <Qiming> so mark_timeout can kill the depended node actions
13:35:34 <Qiming> http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n260
13:35:39 <ruijie> its hard to tell whether they are blocked or just process slowly
13:36:05 <Qiming> http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n366
13:37:01 <Qiming> such a logic may be pushed down to db layer as well
13:40:54 <Qiming> by the way, I hope team has noticed liyi's contribution
13:41:23 <Qiming> he has proposed quite a few high quality patches
13:41:41 <ruijie> yes, liyi is doing great jobs
13:41:57 <Qiming> he doesn't show up on IRC
13:42:19 <Qiming> but that won't be a blocking factor for hiring him/here
13:42:25 <Qiming> s/here/her
13:42:26 <XueFeng> yes
13:42:53 <Qiming> XueFeng, is liyi from your team?
13:43:24 <XueFeng> He come from another company
13:43:50 <Qiming> okay
13:44:22 <Qiming> I'll contact him/her and see if there is interest to beome one of us
13:44:48 <XueFeng> It's good
13:44:57 <XueFeng> +1
13:45:10 <ruijie> #vote
13:45:34 <XueFeng> I contact him before then he works good in API job
13:47:00 <Qiming> he knows sdk
13:47:38 <Qiming> it would be great to have more eyes on gating the contributions
13:48:05 <Qiming> last thing in my mind for today is about high priority bugs
13:48:27 <Qiming> liyi has proposed several patches related to node adoption
13:48:43 <XueFeng> Yes, we should make our team and meeting smoothly
13:48:47 <Qiming> most of them are about some cases we haven't thought about
13:49:15 <XueFeng> Will review for these patches
13:49:25 <Qiming> great
13:49:51 <Qiming> senlin is already in rdo?
13:49:55 <XueFeng> About bugs, I think there are  no high priority bugs
13:50:14 <Qiming> senlinclient and dashboard are not there yet?
13:50:27 <XueFeng> Yes
13:50:48 <XueFeng> Senlin server(API and Engine) has in rdo
13:50:52 <Qiming> there are hands and eyes on it?
13:50:57 <Qiming> on them?
13:51:05 <XueFeng> Senlinclient is in process
13:51:06 <ruijie> https://bugs.launchpad.net/senlin/+bug/1710834
13:51:07 <openstack> Launchpad bug 1710834 in senlin "physical id should be None when creation process failed" [Undecided,In progress] - Assigned to RUIJIE YUAN (cnjie0616)
13:52:09 <Qiming> ... "UNKOWN" .. what's that for?
13:52:14 <ruijie> for heat stack there is no such problem
13:52:22 <ruijie> for the exception message ..
13:52:27 <ruijie> can we just remove it
13:52:51 <Qiming> cannot recall why we set it that way
13:52:58 <Qiming> there must be a reaon
13:53:00 <Qiming> reason
13:53:13 <ruijie> we raised ResourceException when creation failed..
13:53:14 <Qiming> but I'm fine with setting it to None
13:53:39 <ruijie> and the resource_id was set to 'UNKNOWN'
13:54:14 <Qiming> I cannot recall why we explicitly set it to 'UNKNOWN'
13:54:38 <Qiming> but I believe there was a reason, not just for exception message
13:54:59 <ruijie> compute.create() may not return server and we may not have server.id, then we are not supposed to show the exception message with server_id ..
13:55:12 <ruijie> can we just remove it from the Exception class
13:55:34 <Qiming> the problem is ...
13:55:51 <Qiming> it may return a server, but that server won't get active
13:56:18 <ruijie> yup ..
13:56:31 <Qiming> if it is not returning a server record, we should set it to None certainly
13:57:13 <ruijie> okay Qiming, will do it in profile layer
13:57:22 <Qiming> thx
13:57:53 <ruijie> np :)
13:58:41 <Qiming> alright, time is up
13:58:47 <XueFeng> OK
13:58:57 <XueFeng> Good night
13:59:09 <Qiming> thanks for joining!
13:59:13 <Qiming> #endmeeting