10:16:18 <rakhmerov> #startmeeting Mistral Bug Review
10:16:19 <openstack> Meeting started Wed Nov 11 10:16:18 2015 UTC and is due to finish in 60 minutes.  The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot.
10:16:20 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
10:16:23 <openstack> The meeting name has been set to 'mistral_bug_review'
10:16:31 <rakhmerov> 1. Mistral stops responding after a few days that we haven't investigated / opened yet
10:16:50 <rakhmerov> ok
10:17:04 <rakhmerov> then let's at least file a bug
10:17:04 <[1]melisha> We need to investigate further.
10:17:28 <rakhmerov> #action melisha: File a bug for "Mistral stops responding after a few days that we haven't investigated / opened yet"
10:17:34 <[1]melisha> Cool
10:17:36 <rakhmerov> ok
10:17:56 <rakhmerov> 2. task stuck in RUNNING state when all action executions are finished - https://bugs.launchpad.net/mistral/+bug/1513456
10:17:56 <openstack> Launchpad bug 1513456 in Mistral "task stuck in RUNNING state when all action executions are finished" [Critical,Triaged]
10:18:39 <rakhmerov> On this one, we came across it a number of times
10:18:47 <nkoffman> we get this a lot, specifically when ruuning in HA mode, we see that all action-executions were sucessfull but the task doesn't en
10:18:48 <nkoffman> end
10:19:24 <nmakhotkin> yes, I investigated that a little bit
10:19:28 <rakhmerov> have you done any investigation? Any assumption where the problem is?
10:19:42 <nastya_> we also got it
10:19:49 <rakhmerov> yes
10:19:58 <nmakhotkin> the potential problem is our transactions
10:20:04 <rakhmerov> I assume that the issue is in transactions
10:20:04 <rakhmerov> yes
10:20:23 <rakhmerov> Winson also observed this behavior but in a different context
10:20:42 <rakhmerov> nkoffman: do you know a reliable way of reproducing it?
10:21:04 <rakhmerov> or at least increasing conditions that increase chances or reproducing it
10:21:09 <nkoffman> I saw it using a workflow with a task using with_items on HA,
10:21:22 <rakhmerov> ok
10:21:28 <nkoffman> I can try to reproduce on our node, haven't seen it on devstack though
10:21:38 <rakhmerov> please try to fill all info you have in bugs' comments
10:21:42 <rakhmerov> ok
10:21:48 <nkoffman> ok
10:21:51 <nastya_> nkoffman: I saw it in devsatck installation without any ha
10:22:06 <rakhmerov> I can try to dig this task myself since I have a couple of thoughts how to track it down
10:22:14 <rakhmerov> ok
10:22:55 <nkoffman> nastya_: I assume the HA might only bring it up more often
10:23:07 <rakhmerov> yes, I guess so
10:23:11 <nastya_> nkoffman: yeah, agree
10:23:27 <rakhmerov> ok, I assigned it to myself
10:23:36 <rakhmerov> will try to fix it
10:23:38 <rakhmerov> soon
10:23:45 <rakhmerov> let's continue
10:24:07 <rakhmerov> 2. Running WFs fail on failed to find system actions / workflows if DB sync is running in parallel - https://bugs.launchpad.net/mistral/+bug/1508379
10:24:07 <openstack> Launchpad bug 1508379 in Mistral "Running WFs fail on failed to find system actions / workflows if DB sync is running in parallel" [Medium,In progress] - Assigned to Tomer Shtilman (tomer-shtilman)
10:24:08 <nastya_> rakhmerov: you can use my env to debug where this problem was occured
10:24:16 <rakhmerov> ooh, sorry, it was #3
10:24:36 <rakhmerov> nastya_: ok, will talk to you once I get to working on it, thanks
10:25:44 <nmakhotkin> this one is being fixed in https://review.openstack.org/#/c/240705/
10:25:54 <rakhmerov> [1]melisha, LimorStotland, nkoffman: so this happens if you need to reinstall one of Mistral instances?
10:26:02 <[1]melisha> rakhmerov: We all know the reason for this and Tomer is working on a fix with very responsive reviews from you all so that's OK
10:26:04 <nmakhotkin> but I'm not sure on 100%
10:26:58 <rakhmerov> nmakhotkin: yes, this seems to be the right patch
10:27:12 <[1]melisha> rakhmerov: On production setups, there is a puppet agent that always makes sure that the VM is up-to-date
10:27:28 <rakhmerov> #action: rakhmerov, nmakhotkin: review https://review.openstack.org/#/c/240705/
10:27:35 <rakhmerov> [1]melisha: ok
10:27:39 <[1]melisha> This puppet agent runs every X minutes and compares conf files, etc. and also runs mistral syn db
10:27:50 <rakhmerov> I see
10:28:21 <rakhmerov> I'm just wondering.. Maybe we should change the whole algorithm of updating actions in DB
10:28:27 <rakhmerov> w/o deleting them
10:28:43 <rakhmerov> but on the other hand, if we use transactions properly it should fix the problem
10:28:52 <rakhmerov> ok, let's move on
10:28:57 <[1]melisha> It will fix the problem
10:29:09 <rakhmerov> 4. Workflow executed more than once when using cron-trigger with multiple engines - https://bugs.launchpad.net/mistral/+bug/1513548
10:29:09 <openstack> Launchpad bug 1513548 in Mistral "Workflow executed more than once when using cron-trigger with multiple engines" [High,In progress] - Assigned to Moshe Elisha (melisha)
10:29:22 <rakhmerov> this is being worked on
10:29:34 <rakhmerov> [1]melisha: I still owe you a review, sorry
10:29:34 <[1]melisha> Yes
10:29:41 <[1]melisha> np
10:30:04 <rakhmerov> #action: rakhmerov: Review https://review.openstack.org/243234 ASAP
10:30:58 <rakhmerov> ok, guys, btw, just for the same of time saving I'm not tagging these tickets with the new tag
10:31:06 <rakhmerov> I'll do it once we finish the meeting
10:31:16 <rakhmerov> .. for the sake ...
10:31:36 <rakhmerov> the next one
10:31:41 <rakhmerov> 5. Some DB queries are reported slow as no indices are used - https://bugs.launchpad.net/mistral/+bug/1505664
10:31:41 <openstack> Launchpad bug 1505664 in Mistral "Some DB queries are reported slow as no indices are used" [High,Confirmed] - Assigned to Winson Chan (winson-c-chan)
10:31:48 <rakhmerov> this one is assigned to Winson
10:32:05 <rakhmerov> I'll help him fix that, it's pretty straightforward thing to do
10:32:31 <[1]melisha> Cool. Do you have an easy way to know the indexes that are needed?
10:32:39 <rakhmerov> #action rakhmerov: tag all needed bugs with liberty-backport-potential
10:32:42 <[1]melisha> or the queries that are executed?
10:33:01 <rakhmerov> [1]melisha: yes, it's mostly in my head )
10:33:32 <rakhmerov> If I look at DB model I'll say exactly what should be indexed and what should not
10:33:45 <[1]melisha> Great
10:33:56 <rakhmerov> of course, this doesn't cancel the need of some testing
10:34:22 <[1]melisha> :-) Sure. We will help with that
10:34:43 <rakhmerov> #action rakhmerov: put info into https://bugs.launchpad.net/mistral/+bug/1505664 about what exact indexes need to be created
10:34:43 <openstack> Launchpad bug 1505664 in Mistral "Some DB queries are reported slow as no indices are used" [High,Confirmed] - Assigned to Winson Chan (winson-c-chan)
10:35:10 <rakhmerov> 6. WF execution is not created if input preparation of initial task fails - https://bugs.launchpad.net/mistral/+bug/1506470
10:35:10 <openstack> Launchpad bug 1506470 in Mistral "WF execution is not created if input preparation of initial task fails" [High,Fix committed] - Assigned to Nikolay Makhotkin (nmakhotkin)
10:35:30 <rakhmerov> so here my question is: is this a bug at all?
10:36:01 <rakhmerov> opinions?
10:36:11 <nmakhotkin> the fix is already commited
10:36:22 <nmakhotkin> IMO, yes, it is a bug
10:36:25 <[1]melisha> I think it is a bug. As I see it an execution should always be created
10:36:34 <nkoffman> I agree
10:36:51 <LimorStotland> me 2
10:37:35 <rakhmerov> already committed? or merged?
10:37:43 <rakhmerov> can you please help me to find it?
10:37:55 <nastya_> merged
10:38:07 <nmakhotkin> fix commited in LP means that it is merged :)
10:38:16 <nastya_> rakhmerov: https://review.openstack.org/#/c/239638/
10:38:29 <rakhmerov> ok, I can find it via the ticket
10:38:35 <rakhmerov> yep, thanks
10:39:09 <rakhmerov> ok, great!
10:39:45 <rakhmerov> #action rakhmerov: take a look at https://review.openstack.org/#/c/239638/ and backport it
10:40:06 <rakhmerov> 7. HTTP connection issues on simple load testing - https://bugs.launchpad.net/mistral/+bug/1423054
10:40:07 <openstack> Launchpad bug 1423054 in Mistral mitaka "HTTP connection issues on simple load testing" [High,Triaged]
10:40:38 <[1]melisha> If the bug description is true - this will surely be an issue for our customers
10:41:17 <rakhmerov> ok, I'll just share what I know quickly
10:41:43 <rakhmerov> we discussed it a lot with StackStorm about 8 months ago and particularly with Winson
10:42:04 <rakhmerov> note that latest comment was made on 2015-02-18
10:42:41 <rakhmerov> so, I'm almost sure this is not really a bug if we just consider Mistral codebase
10:43:07 <_gryf> 1
10:43:10 <rakhmerov> Winson told me that once they put Mistral behind Apache server or Nginx this issue stopped appearing completely
10:43:50 <rakhmerov> the thing is that if we use just an http server provided out of the box it's mostly intended to be used for development, not for production
10:44:19 <rakhmerov> in other words, it can't really server a lot of parallel requests well and dies under even modest load
10:44:40 <[1]melisha> OK. I see
10:44:59 <rakhmerov> Apache or Nginx help exactly with a big number of requests coming in in parallel
10:45:15 <[1]melisha> so no need to backport
10:45:25 <rakhmerov> just in case, I'd suggest we talk to Winson again and clarify this information
10:46:17 <rakhmerov> #action rakhmerov, [1]melisha: talk to Winson about https://bugs.launchpad.net/mistral/+bug/1423054 and confirm that this can be solved with putting Apache or Nginx in front of Mistral API server
10:46:17 <openstack> Launchpad bug 1423054 in Mistral mitaka "HTTP connection issues on simple load testing" [High,Triaged]
10:47:34 <rakhmerov> 8. execution-get truncates "State info" - https://bugs.launchpad.net/mistral/+bug/1509456
10:47:34 <openstack> Launchpad bug 1509456 in Mistral "execution-get truncates "State info"" [Medium,Confirmed] - Assigned to hardik (hardik-parekh047)
10:48:13 <rakhmerov> [1]melisha: can you confirm this bug too?
10:48:25 <rakhmerov> or anyone else?
10:48:33 <rakhmerov> I didn't see it myself
10:48:35 <nkoffman> we didn
10:48:38 <nkoffman> 't
10:49:07 <nkoffman> see it either, but based on the description, it looks like it could be an issue, if it does happen
10:49:31 <rakhmerov> I'm ready to bet that Mistral server doesn't truncate anything. If the problem exists it might be something on a client side
10:49:38 <nkoffman> nmakhotkin:I see it is confirmed by
10:50:18 <rakhmerov> ooh, yes, nmakhotkin confirmed it
10:50:20 <nmakhotkin> yep, I confirmed that
10:50:44 <rakhmerov> ok, then it should be something simple to fix
10:50:44 <nmakhotkin> state_info is really truncated
10:50:57 <rakhmerov> let's not spend time on that now, we just need to fix it
10:51:16 <rakhmerov> 9.   wait-before and retry policies directly call task_handler.run_existing_task() method via RPC - https://bugs.launchpad.net/mistral/+bug/1484521
10:51:16 <openstack> Launchpad bug 1484521 in Mistral "wait-before and retry policies directly call task_handler.run_existing_task() method via RPC" [High,In progress] - Assigned to Renat Akhmerov (rakhmerov)
10:51:40 <rakhmerov> Yes, this is definitely a bug but it's more like an architectural bug
10:52:05 <nkoffman> what are the consequences of this bug to users?
10:52:06 <rakhmerov> we've discovered it with Limor together while improving Scheduler
10:52:37 <rakhmerov> no consequences I'd be able to tell about actually
10:52:51 <rakhmerov> it's rather an ugly design
10:53:07 <rakhmerov> and it requires some serious refactoring in engine and policies
10:53:28 <rakhmerov> not sure we need to backport it actually
10:53:34 <LimorStotland> yes, we have a bp on  improving Scheduler:https://blueprints.launchpad.net/mistral/+spec/fallback-mechanism-for-scheduler
10:53:34 <nkoffman> ok, so in that case, probably unnecessary to backport
10:54:03 <rakhmerov> yes, I think we need to make a design improvement for Mitaka
10:54:27 <rakhmerov> it'll require I think a couple of weeks for me to fix it properly
10:54:41 <nkoffman> ok
10:54:49 <LimorStotland> I think if it doesn't have any effect on the user and its risky we shouldn't backport
10:55:08 <rakhmerov> I assigned to myself to M-2 for now
10:55:41 <rakhmerov> yes, it is risky a little bit because, as I said, it's not just a simple change, it's rather a refactoring
10:55:45 <rakhmerov> of engine and policies
10:56:06 <rakhmerov> which I'd love to do personally but it's pretty time consuming
10:56:23 <rakhmerov> 10.  Wrong execution state with conditional transitions - https://bugs.launchpad.net/mistral/+bug/1510936
10:56:23 <openstack> Launchpad bug 1510936 in Mistral "Wrong execution state with conditional transitions" [High,Fix committed] - Assigned to Nikolay Makhotkin (nmakhotkin)
10:57:12 <nmakhotkin> this one is fixed already
10:57:40 <rakhmerov> ok, cool
10:57:40 <nkoffman> we haven't tried to reproduce it on our set up yet, but again based on description, we will need the fix
10:57:46 <rakhmerov> needs to be backported I think
10:58:05 <nmakhotkin> one thing, this bug is not completely fixed (case of on-complete is uncovered)
10:58:09 <rakhmerov> yes, it was definitely a bug, we discussed it with Nikolay before
10:58:26 <nkoffman> ok, so we need to backport it
10:58:52 <rakhmerov> #action nmakhotkin: fix https://bugs.launchpad.net/mistral/+bug/1510936 completely and backport all related patches
10:58:52 <openstack> Launchpad bug 1510936 in Mistral "Wrong execution state with conditional transitions" [High,Fix committed] - Assigned to Nikolay Makhotkin (nmakhotkin)
10:58:52 <rakhmerov> :)
10:59:41 <rakhmerov> 11. create pagination for the mistral client (This should be treated like a bug) - https://blueprints.launchpad.net/python-mistralclient/+spec/pagination-execution-mitralclient (This is mandatory as after a few days of work there is no way to get the execution list anymore).
11:00:28 <LimorStotland> i am missing one more +2 : https://review.openstack.org/#/c/242996/
11:01:04 <rakhmerov> #action rakhmerov: review https://review.openstack.org/#/c/242996/ and backport it into stable/liberty
11:01:22 <rakhmerov> no questions on that, this is really a bad thing
11:01:36 <LimorStotland> and then i think we need to backported  it because users with croon-trigers can use execution-list without it
11:01:38 <rakhmerov> I wish we could spend more time polishing such things
11:01:50 <rakhmerov> LimorStotland: sure, agree on 100%
11:02:05 <LimorStotland> cool :-)
11:02:48 <rakhmerov> 12. Add ceilometer apis as mistral actions - https://blueprints.launchpad.net/mistral/+spec/mistral-ceilometer-actions (This is not mandatory but we have some use cases that require this).
11:03:15 <rakhmerov> as we discussed at the team meeting this is pretty easy to implement
11:03:23 <rakhmerov> nmakhotkin can do it in 10 mins ;)
11:03:29 <nkoffman> yes, I did it on our installation as a POC,
11:03:35 <nmakhotkin> :D
11:03:41 <[1]melisha> in 11 mins
11:03:50 <nkoffman> :)
11:03:50 <rakhmerov> :)
11:03:52 <[1]melisha> or was it 9 mins?
11:04:23 <rakhmerov> ok, a serious question: who will be working on it? nmakhotkin or nkoffman?
11:04:40 <nkoffman> I can take it
11:04:57 <rakhmerov> ok
11:05:00 <nkoffman> is it ok for backporting?
11:05:03 <nmakhotkin> ok
11:05:16 <rakhmerov> then I'll tag it properly as well to backport it
11:05:30 <rakhmerov> alright, we ran out of time already actually
11:05:33 <nkoffman> great :)
11:05:42 <nastya_> do we really need to backport it? it is not so critical bug
11:05:47 <rakhmerov> very productive meeting I think
11:06:01 <rakhmerov> nastya_: good question
11:06:10 <rakhmerov> [1]melisha, nkoffman: what do you think guys?
11:06:15 <nastya_> actually it ia a new feature
11:06:16 <[1]melisha> nastya_: You are right. It is not even a bug
11:06:35 <nkoffman> this is indeed not critical, however since this is a low risk, and usefull for our customers, it would be helpfull if backported
11:06:37 <[1]melisha> Yes. But so easy impl will make some really cool use cases possible
11:06:51 <LimorStotland> I don't think it mandatory for backport but it can be nice
11:07:00 <[1]melisha> But up to you to decide
11:07:07 <rakhmerov> my perspective: it's pretty easy to backport and if it brings some comfort for your customers then let's do this
11:07:25 <rakhmerov> [1]melisha: completely agree
11:07:33 <rakhmerov> customers' happiness first
11:07:42 <[1]melisha> :-)
11:07:48 <LimorStotland> it have no risk and it can be very useful no way not?
11:07:56 <rakhmerov> yes
11:08:07 <rakhmerov> ok, are we good now?
11:08:10 <nastya_> ok, if it is not risky, then let's do it
11:08:14 <nkoffman> yes
11:08:15 <rakhmerov> any other questions?
11:08:18 <LimorStotland> yep
11:08:28 <nkoffman> I have a request,
11:08:43 <nkoffman> not regarding the meeting though..
11:09:09 <rakhmerov> let me end the meeting then
11:09:15 <rakhmerov> and we'll continue to talk
11:09:19 <rakhmerov> #endmeeting