16:05:18 #startmeeting mistral 16:05:18 Meeting started Mon Sep 19 16:05:18 2016 UTC and is due to finish in 60 minutes. The chair is ddeja. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:05:19 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:05:21 The meeting name has been set to 'mistral' 16:05:34 hello 16:06:26 hi 16:06:28 I was finally able to join 16:06:30 sorry 16:06:36 oh, cool 16:06:36 Hey 16:06:40 ddeja: still here? 16:06:43 d0ugal: hi hi ) 16:07:02 rakhmerov: I've just read your mail and I've just started the meeting 16:07:26 ok, good 16:07:38 ddeja: please keep in mind that you'll have to finish it 16:07:41 because you started it 16:08:00 hi 16:08:06 so, let's sync up quickly 16:08:10 mgershen: hi! 16:08:14 rakhmerov: yes, I know. Not a first time chairing ;) 16:08:27 #topic Review action items 16:08:39 ok :) 16:08:43 thanks a lot 16:08:53 you saved my ... 16:09:22 ddeja: I'm not sure if you have any AIs 16:09:34 we skipped last 2 meetings I guess 16:09:38 oh, ok 16:09:41 Yeah, probably not because of that 16:09:41 yeah 16:09:50 #topic Current status (progress, issues, roadblocks, further plans) 16:10:07 sorry for that, I've been extremely busy last couple of months, and I've been travelling for 2 weeks by now 16:10:37 rakhmerov: No problem, maybe we should move the meeting to a time that is easier for you? but that is a different discussion :) 16:10:56 yeah, we were supposed to do that long time ago :) 16:11:03 it's not convenient for many people 16:11:15 it's my debt 16:11:22 :) 16:11:41 my status: still working on stability and performance improvements, last week made some great changes, they made Mistral work much faster on large workflows 16:11:45 status: I have some code on review in rally (yes still...), but internal things take most my time. 16:12:15 now optimizing processing of workflow context 16:12:24 TripleO integration is taking most of my time, so not much to report (other than the bug I added to the agenda :) ) 16:12:36 mgershen: can you please add us as reviewers? 16:12:56 mgershen: or just link it? 16:13:00 d0ugal: ok, this is the main thing we probably need to discuss today 16:13:04 my status: mostly testing, found one bug and a root cause for another; despite that a little bit of reviews (but to little!) 16:13:17 sure, I'll find the link 16:13:26 ok 16:13:52 I have changes to do, hopfully I will have time soon... https://review.openstack.org/#/c/358352 16:14:01 ok, thanks 16:14:15 so, just before we move forward 16:14:25 please keep in mind that RC1 is released 16:14:39 and master is now open for developing new features 16:14:56 rakhmerov: There is no newton branch yet 16:15:05 rakhmerov: so I don't think master should be open? 16:15:08 from now on we'll be backporting only bug fixes into stable/newton 16:15:28 d0ugal: it should be created, I saw an email from Doug 16:15:31 let me check 16:15:36 I asked in #openstack-release earlier, they said it should be done today 16:15:39 but I still can't see it 16:16:00 rakhmerov: yup, there is no newton/stable branch 16:16:00 yeah, true 16:16:13 yes, hm.. it's kinda weird 16:16:20 I agree :) 16:16:23 maybe something was broken in their toolkit 16:16:32 for making releases 16:16:36 ok, anyway 16:16:50 ddeja: let's move on? 16:17:40 rakhmerov: yup 16:17:58 #topic (d0ugal) MessagingTimeout when executing mistral actions https://bugs.launchpad.net/mistral/+bug/1624284 16:18:00 Launchpad bug 1624284 in Mistral "MessagingTimeout when executing mistral actions" [Critical,Confirmed] - Assigned to Dawid Deja (dawid-deja-0) 16:18:32 * rakhmerov Renat is reading again.. 16:18:36 Okays, so for anyone unfamiliar, the last comment on that bug from ddeja is a good summary 16:19:08 d0ugal: yes 16:19:46 ddeja: does it help if engine and executor are running in separate processes? 16:19:56 rakhmerov: no 16:20:00 ok 16:20:04 just for my info 16:20:08 rakhmerov: I have such configuration on my devstack 16:20:41 and it doesn't matter 16:20:52 ok 16:21:12 ok, I'm reading these 4 steps that you pointed out 16:21:29 and I'm not sure that I understand the problem on 100% 16:21:37 so, again 16:21:52 engine sends a request to run "std.sleep" 16:21:59 executor sleeps for 30 sec 16:22:14 rakhmerov: yes, bu the request is a workfow (it's important) 16:22:30 which one? 16:22:36 the first one 16:22:42 std.sleep is an action in workflow 16:22:55 ooh, ok 16:23:17 reading again... 16:23:35 I don't understand #4 16:23:45 "Executor sends *sync* request: I woke up!" 16:24:02 ddeja: can you explain it? 16:24:17 what did you mean by "I woke up!"? 16:24:48 rakhmerov: Oh, that can be misleading 16:24:54 it is just sending the action results 16:25:07 for run-action ? 16:25:11 no 16:25:12 ooh, I got it 16:25:22 but for what? 16:25:44 it is for action run as a task t1 from 'sleep' workflow 16:26:07 ok 16:26:19 and why do we have a deadlock? 16:26:25 so 16:26:55 engine send request to executor 'run action std.sleep'. Since this action is a part of workflow, the request is async 16:27:05 which means, that we send a message via RPC and move on 16:27:05 yes 16:27:06 ok 16:27:10 yes 16:27:15 on engine side 16:27:48 o engine side, nothing is happening right now. On executor site, it goes to sleep (which simulates any long running task) 16:28:03 yes 16:28:22 while the executor is doing 'long running task' API sends eninge another request, to run action std.noop 16:28:35 ok 16:28:56 engine accpets the request, and since this is a 'run-action', not a part of workflow, it sends a request to executor in sync manner 16:29:05 yep 16:29:13 but executor is doing it previous job 16:29:17 so, engine waits 16:29:26 yes 16:29:31 after some time, executor ends it first job 16:29:39 so essentially it's not a real deadlock 16:29:41 and want to send result back to engine 16:29:50 and it do it in sync manner 16:29:52 it's just run-action fails with timeout, right? 16:30:20 ooh, no 16:30:26 ok, it's a real deadlock 16:30:28 now I see 16:30:29 so it waits for engine to reply for message but in the same time, engine is waiting for executor to anwser to its message 16:30:32 yup 16:30:37 yes, gotcha 16:30:50 it can't even send a result for 'sleep' 16:30:56 yes 16:31:00 because RPC subsystem is busy 16:31:05 so 16:31:20 well, it send it at least, becuse the first message timesout, and engine starts to operate again 16:31:35 yes 16:31:38 what about configuring RPC server differently for engine end executor? 16:31:54 it should work 16:32:04 will it help if executor won't be waiting to send results 16:32:22 it's one thing that we can do 16:32:36 Configuting them differently where? 16:32:57 when we are initializing them 16:33:01 in launch.py 16:33:06 another thing - in mistral there is a lot of places where we use sync calls, but we are not doing anything with the results 16:33:25 ddeja: yes, right, we need to fix that too 16:33:37 it would improve performance 16:33:43 agree 16:34:26 I hope that pretty soon we'll get it back to 'eventlet' for engine too once I solve that stupid problem with green threads 16:34:34 I'll be working on it later this week 16:34:54 So, we want to change the executor so it uses eventlet? 16:35:10 or we want to use it async for returning messages? 16:35:18 returning results* 16:35:22 we need to do both 16:35:26 OK 16:35:36 starting with the simplest and more obvious change 16:35:58 which one is that? :) 16:36:11 it seems like that enabling 'eventlet' for executor should be pretty simple 16:36:29 Right 16:36:50 we just need to add one more parameter into the function that creates an RPC server for us and pass a different value when initializing engine and executor in launch.py 16:36:55 ddeja: sounds about right? 16:37:48 Sounds easy. 16:37:54 rakhmerov: yup. 16:37:56 yes 16:38:02 ok :) 16:38:06 I'd be happy to help in any way I can. 16:38:09 but it would make kombu driver still broken 16:38:20 yeah, that's what I thought too 16:38:37 but, you know, for Kombu we can just ignore this parameter for now 16:38:48 no, that is not a problem 16:38:59 rakhmerov: If you plan to land performance fixes, can we swift back to eventlet and take the performance hit for a week or so? 16:39:07 a problem is that this deadlock bug will still be happening if one is using the kombu driver instead of oslo 16:39:11 we can give it some abstract name like 'rpc_processing_method' and ignore it for Kombu 16:39:30 ddeja: true, but we'll have time to fix it soon 16:39:52 I'll check tommorow if it is safe to change from sync to async in executor 16:40:13 ddeja: yes, please take it if you can 16:40:37 ddeja: btw, awesome job on investigating this 16:40:48 ++ 16:40:58 #action ddeja will check if it is safe to change from sync to async in default executor while returning action results 16:41:05 thanks :) 16:41:21 d0ugal: what did you mean by "performance hit"? :) 16:41:27 sorry, didn't get your question 16:41:35 rakhmerov: don't worry, I think the plan you have sounds good 16:42:02 ooh, the performance fixes I made last week are in RC1 already 16:42:06 they are merged 16:42:15 rakhmerov: I just got a bit confused with the switch from eventlet to blocking and then you said you want to go back to eventlet? 16:42:28 as far as what I'm working on, they will be finished tomorrow (one test is failing) 16:42:48 d0ugal: yes, but only for executor 16:42:57 I see, thanks. 16:43:05 by design, it's safe to use 'eventlet' for executor 16:43:17 but not safe for engine (problem with green threads) 16:43:30 d0ugal: at least ddeja and I believe so :) 16:43:38 hopefully we're right 16:43:41 haha, I trust you :) 16:43:53 Hopefully I can find time to learn this part of Mistral more soon. 16:44:01 +1 16:44:19 d0ugal: sure, it's pretty complicated but I can explain everything 16:44:29 rakhmerov: it should be totaly safe as long as actions do not try to communicate with DB 16:44:53 yes 16:44:56 right 16:45:16 ok, seems like we have a plan 16:45:26 let's move on 16:45:35 any other topics? 16:45:42 #topic Open discussion 16:46:00 btw, just FYI 16:46:15 ddeja: actions not communicate with db? directly or calling something that does communicate with db? e.g. fetching mistral environment 16:46:22 what we did last week makes mistral ~5 times faster 16:46:24 :) 16:46:37 rbrady: directly 16:46:43 I found some huge huge problems that I was able to remove 16:46:44 ddeja: ack. thanks 16:47:33 rbrady: yeah, the problem occurs only when we use green threads (eventlet's) and they do some blocking external calls 16:47:36 potentially blocking 16:47:43 rakhmerov: Nice! 16:47:50 like acquiring a lock in DB 16:47:56 yeah :) 16:48:34 rakhmerov: Do you have any benchmarks you can share? It would be a good thinkg to show off for Newton. 16:48:53 rbrady: in this case green threads dispatches doesn't switch threads as expected (although my understanding was different before I got this problem) 16:49:19 d0ugal: well, I can provide some numbers, yes 16:49:35 for some test workflows that I use 16:49:42 rakhmerov: That would be cool, but not urgent at all :) 16:49:47 ok ) 16:50:05 alright 16:50:06 Okay, sorry but I need to leave a bit early 16:50:12 me too! 16:50:26 rbrady, mgershen, ddeja: how about you? 16:50:33 ok to close the meeting? 16:50:39 ok for me 16:50:39 Thanks rakhmerov and ddeja for your discussion, that was very useful and please let me know if I can help at all. 16:50:41 yes 16:50:42 sure 16:50:48 d0ugal: sure 16:50:53 thanks everyone 16:51:00 ok, thanks you all and see you next week 16:51:06 bye 16:51:08 ddeja: thanks twice! For investigation and for driving the meeting :) 16:51:12 bye 16:51:16 Bye :) 16:51:17 see ya 16:51:21 rakhmerov: no problem, bye 16:51:26 #endmeeting