08:02:23 #startmeeting Mistral 08:02:24 Meeting started Wed May 29 08:02:23 2019 UTC and is due to finish in 60 minutes. The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot. 08:02:25 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 08:02:27 The meeting name has been set to 'mistral' 08:02:36 Morning 08:03:14 morning 08:03:48 apetrich: btw, just a reminder: still waiting when you change those backports ) 08:03:54 no rush but please don't forget 08:04:15 rakhmerov, I know. Thanks for understanding. 08:04:32 no problem 08:04:35 so 08:04:36 oh I did not have a time to write a blueprint about fail-on policy 08:04:41 sorry 08:04:48 :) 08:04:55 please do 08:05:20 vgvoleg_: I know you wanted to share some concerns during the office hour 08:05:27 you can go ahead and do that 08:06:00 yes 08:06:30 For now, we are testing mistral with huge workflow 08:07:00 it has about 600 nested wf and a very big context 08:07:21 we found 3 problems 08:08:07 1) There are some memory leaks in engine 08:08:15 щл 08:09:23 ok 08:09:30 2) Mistral for some reasons stuck because of db deadlocks in action execution reporter 08:09:52 Oleg, how do you know that you're observing memory leaks? 08:10:19 maybe it's just a large memory footprint (which is totally OK with that big workflow) 08:10:20 we see lots of active sessions 'update state error info heartbeat wasn't reseived' 08:10:31 ok 08:10:50 maybe you need to increase timeouts? 08:10:59 nonono 08:11:09 it's ok to fail action 08:11:22 it is not ok to stuck mistral :D 08:12:00 and from that point engines couldn't do anything, they miss connection to rabbit and never return to working state 08:12:39 btw we see a lot of sessions in 'idle in transaction' state, tbh I don't know what does it mean 08:13:15 ok 08:13:17 I see 08:13:32 about memory leaks: we use monitoring to see current state of Mistral's pods 08:13:41 Oleg, we've recently found one issue with RabbitMQ 08:14:00 if you're using the latest code you probably hit it as well 08:14:13 the thing is that oslo.messagine recently removed some deprecated options 08:14:18 and we see that memory value increases after complete run 08:14:34 and the configuration option responsible for retries is now zero by default 08:14:41 so it never tries to reconnect 08:14:56 it's easy to solve just by reconfiguring the connnection a little bit 08:15:22 if we run flow once again, this value will add some more memory, in our case this is about 2GB per pod 08:15:43 2GB that we don't know where they comes from 08:15:55 even if we turn off all caches 08:16:50 Oh, great news about rabbit, ty! We'll try ro research it 08:17:48 vgvoleg_: yes, I can share the details on Rabbit with you separately 08:18:24 as far as leaks, ok, I understood. We used to observe them but all of them have been fixed 08:18:46 we haven't observed any leaks for at least a year of constant Mistral use 08:18:58 although the workflows are also huge 08:19:15 but ok, I assume you may be hitting some corner case or something 08:19:38 I can also advise you on how to find memory leaks 08:19:44 I can recommend some tools for that 08:19:48 can anyone help me and tell about mechanisms to detect where they come from? 08:19:58 oh :) 08:20:00 ok 08:20:09 basically you need to see objects of what type are mostly in the Python heap 08:20:29 yes, I'll let you know later here in the channel 08:20:34 so that others could see as well 08:20:44 I just need to find all the relevant links 08:20:56 I've tried to do something like this, and some 'dict: 9999KB' is not helping at all 08:22:36 yeah-yeah, I know 08:22:39 it's not that simple 08:23:03 The third issue i'd like to tell about is the load balancing between engine instances 08:23:10 you need to trace a chain of references from these primitive types to our user types 08:23:20 vgvoleg_: 08:23:21 ok 08:24:03 There are cases, e.g one task has on-success with lots of other tasks 08:24:17 yep 08:24:43 Starting and executing all this tasks is one indivisible operation 08:25:03 yes 08:25:23 so it doesn't split between engines and we can see that one engine use 99% CPU and others 2-4% 08:25:52 as we discussed, I'd propose to make that option that we recently added (start_subworkflows_via_rpc) more universal 08:26:17 we use one very dirty hack to solve this problem: we add 'join:all' to all this tasks, and it help to balance load 08:26:23 so that it works not only during the start but at any point during the execution life tie 08:26:25 time 08:27:28 vgvoleg_: yes 08:27:42 what do you think about my suggestion? Do you think it will help you? 08:28:50 I think creating tasks with WAITING state and start them by rpc is the only correct solution 08:29:13 yes, makes sense 08:29:33 this is good because it could solve one more problem 08:29:33 can you come up with some short spec or blueprint for this please? 08:29:58 we need to understand what else it will affect 08:30:22 basically, here we're going to change how tasks change their state 08:30:27 and this is a serious change 08:31:03 if all execution steps are atomic and rpc-based, we can use priority to make old executions finish faster than new one 08:31:45 vgvoleg_: we definitely need to write up a description of this somewhere :) 08:31:54 with all the details and consequences of this change 08:32:01 I'd propose to make a blueprint for now 08:32:06 can you do this please? 08:32:09 ok, i'll try 08:32:26 hi 08:33:00 #action vgvoleg_: file a blueprint for processing big on-success|on-error|on-complete clauses using WAITING state and RPC 08:33:07 the action execution reporting may fail because of custom actions, unfortunately 08:33:08 akovi: hi! how's it going? 08:33:18 custom actions? 08:33:21 ad-hoc you mean? 08:33:44 if the reporter thread (green) does not get time to run, timeouts will happen in the engine and the execution will be closed 08:34:11 we use only std.noop and std.echo in out test case 08:34:21 we found such an error in one of our custom actions that listened to the output of a forked process 08:35:40 ok 08:36:33 btw we have tried to use random_delay in this job, but deadlocks appear quite often 08:38:12 vgvoleg_: it's not going to help much 08:38:19 we've already proven that 08:38:44 I have to admit that it was a stupid attempt to help mitigate a bad scheduler architecture 08:38:58 that whole thing is going to change pretty soon 08:39:08 large context can also cause issues in RPC processing 08:39:33 yes 08:39:45 yes, but action execution reporter works with DB 08:40:25 I tried compressing the data (with lzo and then lz4) but the results were not conclusive 08:40:29 no 08:40:44 the reporter has a list of the running actions 08:41:04 this list is sent to the engine over RPC in given intervals 08:41:15 the DB update happens in the engine 08:41:31 oh I got it 08:41:40 could be moved to the executor (this happened accidentally once) 08:41:59 but that would mean the executor has to have direct DB access 08:42:06 which was not required earlier 08:42:25 I'll try to research it (the problem appears yesterday) 08:42:26 yeah, we've always tried to avoid that for several reasons 08:42:40 if the RPC channels are overloaded and messages pile up, that can cause heartbeat misses too 08:44:47 action heartbeating should probably be a last resort actually 08:44:56 to close an execution 08:45:13 giving a fair amount of time for processing is a good practice 08:45:50 may cause crashes to be processed later but the overall stability improves significantly 08:46:42 30 sec reporting and 10 misses should be ok 08:47:00 akovi: yes, it's been working for us pretty well so far 08:47:16 no complaints 08:49:17 vgvoleg: so, please complete those two action items (creating blueprints) 08:49:43 I'll provide you details on Rabbit connectivity and detecting leaks in about an hour 08:50:23 thank you :) 08:50:34 sure thing 08:52:56 ok, I guess we've discussed what we had 08:53:15 I'll close the meeting now (the logs will be available online) 08:53:18 (thumbs up) 08:53:20 #endmeeting