16:05:18 <ddeja> #startmeeting mistral
16:05:18 <openstack> Meeting started Mon Sep 19 16:05:18 2016 UTC and is due to finish in 60 minutes.  The chair is ddeja. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:05:19 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:05:21 <openstack> The meeting name has been set to 'mistral'
16:05:34 <ddeja> hello
16:06:26 <rakhmerov> hi
16:06:28 <rakhmerov> I was finally able to join
16:06:30 <rakhmerov> sorry
16:06:36 <ddeja> oh, cool
16:06:36 <d0ugal> Hey
16:06:40 <rakhmerov> ddeja: still here?
16:06:43 <rakhmerov> d0ugal: hi hi )
16:07:02 <ddeja> rakhmerov: I've just read your mail and I've just started the meeting
16:07:26 <rakhmerov> ok, good
16:07:38 <rakhmerov> ddeja: please keep in mind that you'll have to finish it
16:07:41 <rakhmerov> because you started it
16:08:00 <mgershen> hi
16:08:06 <rakhmerov> so, let's sync up quickly
16:08:10 <rakhmerov> mgershen: hi!
16:08:14 <ddeja> rakhmerov: yes, I know. Not a first time chairing ;)
16:08:27 <ddeja> #topic Review action items
16:08:39 <rakhmerov> ok :)
16:08:43 <rakhmerov> thanks a lot
16:08:53 <rakhmerov> you saved my ...
16:09:22 <rakhmerov> ddeja: I'm not sure if you have any AIs
16:09:34 <rakhmerov> we skipped last 2 meetings I guess
16:09:38 <ddeja> oh, ok
16:09:41 <d0ugal> Yeah, probably not because of that
16:09:41 <rakhmerov> yeah
16:09:50 <ddeja> #topic Current status (progress, issues, roadblocks, further plans)
16:10:07 <rakhmerov> sorry for that, I've been extremely busy last couple of months, and I've been travelling for 2 weeks by now
16:10:37 <d0ugal> rakhmerov: No problem, maybe we should move the meeting to a time that is easier for you? but that is a different discussion :)
16:10:56 <rakhmerov> yeah, we were supposed to do that long time ago :)
16:11:03 <rakhmerov> it's not convenient for many people
16:11:15 <rakhmerov> it's my debt
16:11:22 <d0ugal> :)
16:11:41 <rakhmerov> my status: still working on stability and performance improvements, last week made some great changes, they made Mistral work much faster on large workflows
16:11:45 <mgershen> status: I have some code on review in rally (yes still...), but internal things take most my time.
16:12:15 <rakhmerov> now optimizing processing of workflow context
16:12:24 <d0ugal> TripleO integration is taking most of my time, so not much to report (other than the bug I added to the agenda :) )
16:12:36 <rakhmerov> mgershen: can you please add us as reviewers?
16:12:56 <d0ugal> mgershen: or just link it?
16:13:00 <rakhmerov> d0ugal: ok, this is the main thing we probably need to discuss today
16:13:04 <ddeja> my status: mostly testing, found one bug and a root cause for another; despite that a little bit of reviews (but to little!)
16:13:17 <mgershen> sure, I'll find the link
16:13:26 <rakhmerov> ok
16:13:52 <mgershen> I have changes to do, hopfully I will have time soon... https://review.openstack.org/#/c/358352
16:14:01 <rakhmerov> ok, thanks
16:14:15 <rakhmerov> so, just before we move forward
16:14:25 <rakhmerov> please keep in mind that RC1 is released
16:14:39 <rakhmerov> and master is now open for developing new features
16:14:56 <d0ugal> rakhmerov: There is no newton branch yet
16:15:05 <d0ugal> rakhmerov: so I don't think master should be open?
16:15:08 <rakhmerov> from now on we'll be backporting only bug fixes into stable/newton
16:15:28 <rakhmerov> d0ugal: it should be created, I saw an email from Doug
16:15:31 <rakhmerov> let me check
16:15:36 <d0ugal> I asked in #openstack-release earlier, they said it should be done today
16:15:39 <d0ugal> but I still can't see it
16:16:00 <ddeja> rakhmerov: yup, there is no newton/stable branch
16:16:00 <rakhmerov> yeah, true
16:16:13 <rakhmerov> yes, hm.. it's kinda weird
16:16:20 <d0ugal> I agree :)
16:16:23 <rakhmerov> maybe something was broken in their toolkit
16:16:32 <rakhmerov> for making releases
16:16:36 <rakhmerov> ok, anyway
16:16:50 <rakhmerov> ddeja: let's move on?
16:17:40 <ddeja> rakhmerov: yup
16:17:58 <ddeja> #topic (d0ugal) MessagingTimeout when executing mistral actions https://bugs.launchpad.net/mistral/+bug/1624284
16:18:00 <openstack> Launchpad bug 1624284 in Mistral "MessagingTimeout when executing mistral actions" [Critical,Confirmed] - Assigned to Dawid Deja (dawid-deja-0)
16:18:32 * rakhmerov Renat is reading again..
16:18:36 <d0ugal> Okays, so for anyone unfamiliar, the last comment on that bug from ddeja is a good summary
16:19:08 <rakhmerov> d0ugal: yes
16:19:46 <rakhmerov> ddeja: does it help if engine and executor are running in separate processes?
16:19:56 <ddeja> rakhmerov: no
16:20:00 <rakhmerov> ok
16:20:04 <rakhmerov> just for my info
16:20:08 <ddeja> rakhmerov: I have such configuration on my devstack
16:20:41 <ddeja> and it doesn't matter
16:20:52 <rakhmerov> ok
16:21:12 <rakhmerov> ok, I'm reading these 4 steps that you pointed out
16:21:29 <rakhmerov> and I'm not sure that I understand the problem on 100%
16:21:37 <rakhmerov> so, again
16:21:52 <rakhmerov> engine sends a request to run "std.sleep"
16:21:59 <rakhmerov> executor sleeps for 30 sec
16:22:14 <ddeja> rakhmerov: yes, bu the request is a workfow (it's important)
16:22:30 <rakhmerov> which one?
16:22:36 <ddeja> the first one
16:22:42 <ddeja> std.sleep is an action in workflow
16:22:55 <rakhmerov> ooh, ok
16:23:17 <rakhmerov> reading again...
16:23:35 <rakhmerov> I don't understand #4
16:23:45 <rakhmerov> "Executor sends *sync* request: I woke up!"
16:24:02 <rakhmerov> ddeja: can you explain it?
16:24:17 <rakhmerov> what did you mean by "I woke up!"?
16:24:48 <ddeja> rakhmerov: Oh, that can be misleading
16:24:54 <ddeja> it is just sending the action results
16:25:07 <rakhmerov> for run-action ?
16:25:11 <ddeja> no
16:25:12 <rakhmerov> ooh, I got it
16:25:22 <rakhmerov> but for what?
16:25:44 <ddeja> it is for action run as a task t1 from 'sleep' workflow
16:26:07 <rakhmerov> ok
16:26:19 <rakhmerov> and why do we have a deadlock?
16:26:25 <ddeja> so
16:26:55 <ddeja> engine send request to executor 'run action std.sleep'. Since this action is a part of workflow, the request is async
16:27:05 <ddeja> which means, that we send a message via RPC and move on
16:27:05 <rakhmerov> yes
16:27:06 <rakhmerov> ok
16:27:10 <rakhmerov> yes
16:27:15 <rakhmerov> on engine side
16:27:48 <ddeja> o engine side, nothing is happening right now. On executor site, it goes to sleep (which simulates any long running task)
16:28:03 <rakhmerov> yes
16:28:22 <ddeja> while the executor is doing 'long running task' API sends eninge another request, to run action std.noop
16:28:35 <rakhmerov> ok
16:28:56 <ddeja> engine accpets the request, and since this is a 'run-action', not a part of workflow, it sends a request to executor in sync manner
16:29:05 <rakhmerov> yep
16:29:13 <ddeja> but executor is doing it previous job
16:29:17 <ddeja> so, engine waits
16:29:26 <rakhmerov> yes
16:29:31 <ddeja> after some time, executor ends it first job
16:29:39 <rakhmerov> so essentially it's not a real deadlock
16:29:41 <ddeja> and want to send result back to engine
16:29:50 <ddeja> and it do it in sync manner
16:29:52 <rakhmerov> it's just run-action fails with timeout, right?
16:30:20 <rakhmerov> ooh, no
16:30:26 <rakhmerov> ok, it's a real deadlock
16:30:28 <rakhmerov> now I see
16:30:29 <ddeja> so it waits for engine to reply for message but in the same time, engine is waiting for executor to anwser to its message
16:30:32 <ddeja> yup
16:30:37 <rakhmerov> yes, gotcha
16:30:50 <rakhmerov> it can't even send a result for 'sleep'
16:30:56 <ddeja> yes
16:31:00 <rakhmerov> because RPC subsystem is busy
16:31:05 <rakhmerov> so
16:31:20 <ddeja> well, it send it at least, becuse the first message timesout, and engine starts to operate again
16:31:35 <rakhmerov> yes
16:31:38 <rakhmerov> what about configuring RPC server differently for engine end executor?
16:31:54 <ddeja> it should work
16:32:04 <rakhmerov> will it help if executor won't be waiting to send results
16:32:22 <rakhmerov> it's one thing that we can do
16:32:36 <d0ugal> Configuting them differently where?
16:32:57 <rakhmerov> when we are initializing them
16:33:01 <rakhmerov> in launch.py
16:33:06 <ddeja> another thing - in mistral there is a lot of places where we use sync calls, but we are not doing anything with the results
16:33:25 <rakhmerov> ddeja: yes, right, we need to fix that too
16:33:37 <ddeja> it would improve performance
16:33:43 <rakhmerov> agree
16:34:26 <rakhmerov> I hope that pretty soon we'll get it back to 'eventlet' for engine too once I solve that stupid problem with green threads
16:34:34 <rakhmerov> I'll be working on it later this week
16:34:54 <ddeja> So, we want to change the executor so it uses eventlet?
16:35:10 <ddeja> or we want to use it async for returning messages?
16:35:18 <ddeja> returning results*
16:35:22 <rakhmerov> we need to do both
16:35:26 <ddeja> OK
16:35:36 <rakhmerov> starting with the simplest and more obvious change
16:35:58 <d0ugal> which one is that? :)
16:36:11 <rakhmerov> it seems like that enabling 'eventlet' for executor should be pretty simple
16:36:29 <d0ugal> Right
16:36:50 <rakhmerov> we just need to add one more parameter into the function that creates an RPC server for us and pass a different value when initializing engine and executor in launch.py
16:36:55 <rakhmerov> ddeja: sounds about right?
16:37:48 <d0ugal> Sounds easy.
16:37:54 <ddeja> rakhmerov: yup.
16:37:56 <rakhmerov> yes
16:38:02 <rakhmerov> ok :)
16:38:06 <d0ugal> I'd be happy to help in any way I can.
16:38:09 <ddeja> but it would make kombu driver still broken
16:38:20 <rakhmerov> yeah, that's what I thought too
16:38:37 <rakhmerov> but, you know, for Kombu we can just ignore this parameter for now
16:38:48 <ddeja> no, that is not a problem
16:38:59 <d0ugal> rakhmerov: If you plan to land performance fixes, can we swift back to eventlet and take the performance hit for a week or so?
16:39:07 <ddeja> a problem is that this deadlock bug will still be happening if one is using the kombu driver instead of oslo
16:39:11 <rakhmerov> we can give it some abstract name like 'rpc_processing_method' and ignore it for Kombu
16:39:30 <rakhmerov> ddeja: true, but we'll have time to fix it soon
16:39:52 <ddeja> I'll check tommorow if it is safe to change from sync to async in executor
16:40:13 <rakhmerov> ddeja: yes, please take it if you can
16:40:37 <rakhmerov> ddeja: btw, awesome job on investigating this
16:40:48 <d0ugal> ++
16:40:58 <ddeja> #action ddeja will check if it is safe to change from sync to async in default executor while returning action results
16:41:05 <ddeja> thanks :)
16:41:21 <rakhmerov> d0ugal: what did you mean by "performance hit"? :)
16:41:27 <rakhmerov> sorry, didn't get your question
16:41:35 <d0ugal> rakhmerov: don't worry, I think the plan you have sounds good
16:42:02 <rakhmerov> ooh, the performance fixes I made last week are in RC1 already
16:42:06 <rakhmerov> they are merged
16:42:15 <d0ugal> rakhmerov: I just got a bit confused with the switch from eventlet to blocking and then you said you want to go back to eventlet?
16:42:28 <rakhmerov> as far as what I'm working on, they will be finished tomorrow (one test is failing)
16:42:48 <rakhmerov> d0ugal: yes, but only for executor
16:42:57 <d0ugal> I see, thanks.
16:43:05 <rakhmerov> by design, it's safe to use 'eventlet' for executor
16:43:17 <rakhmerov> but not safe for engine (problem with green threads)
16:43:30 <rakhmerov> d0ugal: at least ddeja and I believe so :)
16:43:38 <rakhmerov> hopefully we're right
16:43:41 <d0ugal> haha, I trust you :)
16:43:53 <d0ugal> Hopefully I can find time to learn this part of Mistral more soon.
16:44:01 <rbrady> +1
16:44:19 <rakhmerov> d0ugal: sure, it's pretty complicated but I can explain everything
16:44:29 <ddeja> rakhmerov: it should be totaly safe as long as actions do not try to communicate with DB
16:44:53 <rakhmerov> yes
16:44:56 <rakhmerov> right
16:45:16 <rakhmerov> ok, seems like we have a plan
16:45:26 <rakhmerov> let's move on
16:45:35 <rakhmerov> any other topics?
16:45:42 <ddeja> #topic Open discussion
16:46:00 <rakhmerov> btw, just FYI
16:46:15 <rbrady> ddeja: actions not communicate with db?  directly or calling something that does communicate with db?  e.g. fetching mistral environment
16:46:22 <rakhmerov> what we did last week makes mistral ~5 times faster
16:46:24 <rakhmerov> :)
16:46:37 <ddeja> rbrady: directly
16:46:43 <rakhmerov> I found some huge huge problems that I was able to remove
16:46:44 <rbrady> ddeja: ack.  thanks
16:47:33 <rakhmerov> rbrady: yeah, the problem occurs only when we use green threads (eventlet's) and they do some blocking external calls
16:47:36 <rakhmerov> potentially blocking
16:47:43 <d0ugal> rakhmerov: Nice!
16:47:50 <rakhmerov> like acquiring a lock in DB
16:47:56 <rakhmerov> yeah :)
16:48:34 <d0ugal> rakhmerov: Do you have any benchmarks you can share? It would be a good thinkg to show off for Newton.
16:48:53 <rakhmerov> rbrady: in this case green threads dispatches doesn't switch threads as expected (although my understanding was different before I got this problem)
16:49:19 <rakhmerov> d0ugal: well, I can provide some numbers, yes
16:49:35 <rakhmerov> for some test workflows that I use
16:49:42 <d0ugal> rakhmerov: That would be cool, but not urgent at all :)
16:49:47 <rakhmerov> ok )
16:50:05 <rakhmerov> alright
16:50:06 <d0ugal> Okay, sorry but I need to leave a bit early
16:50:12 <rakhmerov> me too!
16:50:26 <rakhmerov> rbrady, mgershen, ddeja: how about you?
16:50:33 <rakhmerov> ok to close the meeting?
16:50:39 <rbrady> ok for me
16:50:39 <d0ugal> Thanks rakhmerov and ddeja for your discussion, that was very useful and please let me know if I can help at all.
16:50:41 <ddeja> yes
16:50:42 <mgershen> sure
16:50:48 <rakhmerov> d0ugal: sure
16:50:53 <rakhmerov> thanks everyone
16:51:00 <ddeja> ok, thanks you all and see you next week
16:51:06 <rbrady> bye
16:51:08 <rakhmerov> ddeja: thanks twice! For investigation and for driving the meeting :)
16:51:12 <mgershen> bye
16:51:16 <d0ugal> Bye :)
16:51:17 <rakhmerov> see ya
16:51:21 <ddeja> rakhmerov: no problem, bye
16:51:26 <ddeja> #endmeeting