10:16:18 #startmeeting Mistral Bug Review 10:16:19 Meeting started Wed Nov 11 10:16:18 2015 UTC and is due to finish in 60 minutes. The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot. 10:16:20 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 10:16:23 The meeting name has been set to 'mistral_bug_review' 10:16:31 1. Mistral stops responding after a few days that we haven't investigated / opened yet 10:16:50 ok 10:17:04 then let's at least file a bug 10:17:04 <[1]melisha> We need to investigate further. 10:17:28 #action melisha: File a bug for "Mistral stops responding after a few days that we haven't investigated / opened yet" 10:17:34 <[1]melisha> Cool 10:17:36 ok 10:17:56 2. task stuck in RUNNING state when all action executions are finished - https://bugs.launchpad.net/mistral/+bug/1513456 10:17:56 Launchpad bug 1513456 in Mistral "task stuck in RUNNING state when all action executions are finished" [Critical,Triaged] 10:18:39 On this one, we came across it a number of times 10:18:47 we get this a lot, specifically when ruuning in HA mode, we see that all action-executions were sucessfull but the task doesn't en 10:18:48 end 10:19:24 yes, I investigated that a little bit 10:19:28 have you done any investigation? Any assumption where the problem is? 10:19:42 we also got it 10:19:49 yes 10:19:58 the potential problem is our transactions 10:20:04 I assume that the issue is in transactions 10:20:04 yes 10:20:23 Winson also observed this behavior but in a different context 10:20:42 nkoffman: do you know a reliable way of reproducing it? 10:21:04 or at least increasing conditions that increase chances or reproducing it 10:21:09 I saw it using a workflow with a task using with_items on HA, 10:21:22 ok 10:21:28 I can try to reproduce on our node, haven't seen it on devstack though 10:21:38 please try to fill all info you have in bugs' comments 10:21:42 ok 10:21:48 ok 10:21:51 nkoffman: I saw it in devsatck installation without any ha 10:22:06 I can try to dig this task myself since I have a couple of thoughts how to track it down 10:22:14 ok 10:22:55 nastya_: I assume the HA might only bring it up more often 10:23:07 yes, I guess so 10:23:11 nkoffman: yeah, agree 10:23:27 ok, I assigned it to myself 10:23:36 will try to fix it 10:23:38 soon 10:23:45 let's continue 10:24:07 2. Running WFs fail on failed to find system actions / workflows if DB sync is running in parallel - https://bugs.launchpad.net/mistral/+bug/1508379 10:24:07 Launchpad bug 1508379 in Mistral "Running WFs fail on failed to find system actions / workflows if DB sync is running in parallel" [Medium,In progress] - Assigned to Tomer Shtilman (tomer-shtilman) 10:24:08 rakhmerov: you can use my env to debug where this problem was occured 10:24:16 ooh, sorry, it was #3 10:24:36 nastya_: ok, will talk to you once I get to working on it, thanks 10:25:44 this one is being fixed in https://review.openstack.org/#/c/240705/ 10:25:54 [1]melisha, LimorStotland, nkoffman: so this happens if you need to reinstall one of Mistral instances? 10:26:02 <[1]melisha> rakhmerov: We all know the reason for this and Tomer is working on a fix with very responsive reviews from you all so that's OK 10:26:04 but I'm not sure on 100% 10:26:58 nmakhotkin: yes, this seems to be the right patch 10:27:12 <[1]melisha> rakhmerov: On production setups, there is a puppet agent that always makes sure that the VM is up-to-date 10:27:28 #action: rakhmerov, nmakhotkin: review https://review.openstack.org/#/c/240705/ 10:27:35 [1]melisha: ok 10:27:39 <[1]melisha> This puppet agent runs every X minutes and compares conf files, etc. and also runs mistral syn db 10:27:50 I see 10:28:21 I'm just wondering.. Maybe we should change the whole algorithm of updating actions in DB 10:28:27 w/o deleting them 10:28:43 but on the other hand, if we use transactions properly it should fix the problem 10:28:52 ok, let's move on 10:28:57 <[1]melisha> It will fix the problem 10:29:09 4. Workflow executed more than once when using cron-trigger with multiple engines - https://bugs.launchpad.net/mistral/+bug/1513548 10:29:09 Launchpad bug 1513548 in Mistral "Workflow executed more than once when using cron-trigger with multiple engines" [High,In progress] - Assigned to Moshe Elisha (melisha) 10:29:22 this is being worked on 10:29:34 [1]melisha: I still owe you a review, sorry 10:29:34 <[1]melisha> Yes 10:29:41 <[1]melisha> np 10:30:04 #action: rakhmerov: Review https://review.openstack.org/243234 ASAP 10:30:58 ok, guys, btw, just for the same of time saving I'm not tagging these tickets with the new tag 10:31:06 I'll do it once we finish the meeting 10:31:16 .. for the sake ... 10:31:36 the next one 10:31:41 5. Some DB queries are reported slow as no indices are used - https://bugs.launchpad.net/mistral/+bug/1505664 10:31:41 Launchpad bug 1505664 in Mistral "Some DB queries are reported slow as no indices are used" [High,Confirmed] - Assigned to Winson Chan (winson-c-chan) 10:31:48 this one is assigned to Winson 10:32:05 I'll help him fix that, it's pretty straightforward thing to do 10:32:31 <[1]melisha> Cool. Do you have an easy way to know the indexes that are needed? 10:32:39 #action rakhmerov: tag all needed bugs with liberty-backport-potential 10:32:42 <[1]melisha> or the queries that are executed? 10:33:01 [1]melisha: yes, it's mostly in my head ) 10:33:32 If I look at DB model I'll say exactly what should be indexed and what should not 10:33:45 <[1]melisha> Great 10:33:56 of course, this doesn't cancel the need of some testing 10:34:22 <[1]melisha> :-) Sure. We will help with that 10:34:43 #action rakhmerov: put info into https://bugs.launchpad.net/mistral/+bug/1505664 about what exact indexes need to be created 10:34:43 Launchpad bug 1505664 in Mistral "Some DB queries are reported slow as no indices are used" [High,Confirmed] - Assigned to Winson Chan (winson-c-chan) 10:35:10 6. WF execution is not created if input preparation of initial task fails - https://bugs.launchpad.net/mistral/+bug/1506470 10:35:10 Launchpad bug 1506470 in Mistral "WF execution is not created if input preparation of initial task fails" [High,Fix committed] - Assigned to Nikolay Makhotkin (nmakhotkin) 10:35:30 so here my question is: is this a bug at all? 10:36:01 opinions? 10:36:11 the fix is already commited 10:36:22 IMO, yes, it is a bug 10:36:25 <[1]melisha> I think it is a bug. As I see it an execution should always be created 10:36:34 I agree 10:36:51 me 2 10:37:35 already committed? or merged? 10:37:43 can you please help me to find it? 10:37:55 merged 10:38:07 fix commited in LP means that it is merged :) 10:38:16 rakhmerov: https://review.openstack.org/#/c/239638/ 10:38:29 ok, I can find it via the ticket 10:38:35 yep, thanks 10:39:09 ok, great! 10:39:45 #action rakhmerov: take a look at https://review.openstack.org/#/c/239638/ and backport it 10:40:06 7. HTTP connection issues on simple load testing - https://bugs.launchpad.net/mistral/+bug/1423054 10:40:07 Launchpad bug 1423054 in Mistral mitaka "HTTP connection issues on simple load testing" [High,Triaged] 10:40:38 <[1]melisha> If the bug description is true - this will surely be an issue for our customers 10:41:17 ok, I'll just share what I know quickly 10:41:43 we discussed it a lot with StackStorm about 8 months ago and particularly with Winson 10:42:04 note that latest comment was made on 2015-02-18 10:42:41 so, I'm almost sure this is not really a bug if we just consider Mistral codebase 10:43:07 <_gryf> 1 10:43:10 Winson told me that once they put Mistral behind Apache server or Nginx this issue stopped appearing completely 10:43:50 the thing is that if we use just an http server provided out of the box it's mostly intended to be used for development, not for production 10:44:19 in other words, it can't really server a lot of parallel requests well and dies under even modest load 10:44:40 <[1]melisha> OK. I see 10:44:59 Apache or Nginx help exactly with a big number of requests coming in in parallel 10:45:15 <[1]melisha> so no need to backport 10:45:25 just in case, I'd suggest we talk to Winson again and clarify this information 10:46:17 #action rakhmerov, [1]melisha: talk to Winson about https://bugs.launchpad.net/mistral/+bug/1423054 and confirm that this can be solved with putting Apache or Nginx in front of Mistral API server 10:46:17 Launchpad bug 1423054 in Mistral mitaka "HTTP connection issues on simple load testing" [High,Triaged] 10:47:34 8. execution-get truncates "State info" - https://bugs.launchpad.net/mistral/+bug/1509456 10:47:34 Launchpad bug 1509456 in Mistral "execution-get truncates "State info"" [Medium,Confirmed] - Assigned to hardik (hardik-parekh047) 10:48:13 [1]melisha: can you confirm this bug too? 10:48:25 or anyone else? 10:48:33 I didn't see it myself 10:48:35 we didn 10:48:38 't 10:49:07 see it either, but based on the description, it looks like it could be an issue, if it does happen 10:49:31 I'm ready to bet that Mistral server doesn't truncate anything. If the problem exists it might be something on a client side 10:49:38 nmakhotkin:I see it is confirmed by 10:50:18 ooh, yes, nmakhotkin confirmed it 10:50:20 yep, I confirmed that 10:50:44 ok, then it should be something simple to fix 10:50:44 state_info is really truncated 10:50:57 let's not spend time on that now, we just need to fix it 10:51:16 9. wait-before and retry policies directly call task_handler.run_existing_task() method via RPC - https://bugs.launchpad.net/mistral/+bug/1484521 10:51:16 Launchpad bug 1484521 in Mistral "wait-before and retry policies directly call task_handler.run_existing_task() method via RPC" [High,In progress] - Assigned to Renat Akhmerov (rakhmerov) 10:51:40 Yes, this is definitely a bug but it's more like an architectural bug 10:52:05 what are the consequences of this bug to users? 10:52:06 we've discovered it with Limor together while improving Scheduler 10:52:37 no consequences I'd be able to tell about actually 10:52:51 it's rather an ugly design 10:53:07 and it requires some serious refactoring in engine and policies 10:53:28 not sure we need to backport it actually 10:53:34 yes, we have a bp on improving Scheduler:https://blueprints.launchpad.net/mistral/+spec/fallback-mechanism-for-scheduler 10:53:34 ok, so in that case, probably unnecessary to backport 10:54:03 yes, I think we need to make a design improvement for Mitaka 10:54:27 it'll require I think a couple of weeks for me to fix it properly 10:54:41 ok 10:54:49 I think if it doesn't have any effect on the user and its risky we shouldn't backport 10:55:08 I assigned to myself to M-2 for now 10:55:41 yes, it is risky a little bit because, as I said, it's not just a simple change, it's rather a refactoring 10:55:45 of engine and policies 10:56:06 which I'd love to do personally but it's pretty time consuming 10:56:23 10. Wrong execution state with conditional transitions - https://bugs.launchpad.net/mistral/+bug/1510936 10:56:23 Launchpad bug 1510936 in Mistral "Wrong execution state with conditional transitions" [High,Fix committed] - Assigned to Nikolay Makhotkin (nmakhotkin) 10:57:12 this one is fixed already 10:57:40 ok, cool 10:57:40 we haven't tried to reproduce it on our set up yet, but again based on description, we will need the fix 10:57:46 needs to be backported I think 10:58:05 one thing, this bug is not completely fixed (case of on-complete is uncovered) 10:58:09 yes, it was definitely a bug, we discussed it with Nikolay before 10:58:26 ok, so we need to backport it 10:58:52 #action nmakhotkin: fix https://bugs.launchpad.net/mistral/+bug/1510936 completely and backport all related patches 10:58:52 Launchpad bug 1510936 in Mistral "Wrong execution state with conditional transitions" [High,Fix committed] - Assigned to Nikolay Makhotkin (nmakhotkin) 10:58:52 :) 10:59:41 11. create pagination for the mistral client (This should be treated like a bug) - https://blueprints.launchpad.net/python-mistralclient/+spec/pagination-execution-mitralclient (This is mandatory as after a few days of work there is no way to get the execution list anymore). 11:00:28 i am missing one more +2 : https://review.openstack.org/#/c/242996/ 11:01:04 #action rakhmerov: review https://review.openstack.org/#/c/242996/ and backport it into stable/liberty 11:01:22 no questions on that, this is really a bad thing 11:01:36 and then i think we need to backported it because users with croon-trigers can use execution-list without it 11:01:38 I wish we could spend more time polishing such things 11:01:50 LimorStotland: sure, agree on 100% 11:02:05 cool :-) 11:02:48 12. Add ceilometer apis as mistral actions - https://blueprints.launchpad.net/mistral/+spec/mistral-ceilometer-actions (This is not mandatory but we have some use cases that require this). 11:03:15 as we discussed at the team meeting this is pretty easy to implement 11:03:23 nmakhotkin can do it in 10 mins ;) 11:03:29 yes, I did it on our installation as a POC, 11:03:35 :D 11:03:41 <[1]melisha> in 11 mins 11:03:50 :) 11:03:50 :) 11:03:52 <[1]melisha> or was it 9 mins? 11:04:23 ok, a serious question: who will be working on it? nmakhotkin or nkoffman? 11:04:40 I can take it 11:04:57 ok 11:05:00 is it ok for backporting? 11:05:03 ok 11:05:16 then I'll tag it properly as well to backport it 11:05:30 alright, we ran out of time already actually 11:05:33 great :) 11:05:42 do we really need to backport it? it is not so critical bug 11:05:47 very productive meeting I think 11:06:01 nastya_: good question 11:06:10 [1]melisha, nkoffman: what do you think guys? 11:06:15 actually it ia a new feature 11:06:16 <[1]melisha> nastya_: You are right. It is not even a bug 11:06:35 this is indeed not critical, however since this is a low risk, and usefull for our customers, it would be helpfull if backported 11:06:37 <[1]melisha> Yes. But so easy impl will make some really cool use cases possible 11:06:51 I don't think it mandatory for backport but it can be nice 11:07:00 <[1]melisha> But up to you to decide 11:07:07 my perspective: it's pretty easy to backport and if it brings some comfort for your customers then let's do this 11:07:25 [1]melisha: completely agree 11:07:33 customers' happiness first 11:07:42 <[1]melisha> :-) 11:07:48 it have no risk and it can be very useful no way not? 11:07:56 yes 11:08:07 ok, are we good now? 11:08:10 ok, if it is not risky, then let's do it 11:08:14 yes 11:08:15 any other questions? 11:08:18 yep 11:08:28 I have a request, 11:08:43 not regarding the meeting though.. 11:09:09 let me end the meeting then 11:09:15 and we'll continue to talk 11:09:19 #endmeeting