04:01:47 <samP> #startmeeting masakari
04:01:47 <openstack> Meeting started Tue Apr 21 04:01:47 2020 UTC and is due to finish in 60 minutes.  The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot.
04:01:48 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
04:01:51 <openstack> The meeting name has been set to 'masakari'
04:02:00 <samP> Hi all
04:02:07 <samP> Let's start
04:02:36 <samP> #topic Critical patches and bugs
04:02:37 <suzhengwei> hi
04:02:47 <samP> suzhengwei: hi
04:03:26 <tpatil> LP bug : 1873736
04:03:28 <samP> Any patches need to merge before we tag the RC for U
04:03:37 <tpatil> Patch : https://review.opendev.org/#/c/721174/1
04:04:43 <tpatil> Our FT doesn't use tls_proxy=True
04:05:00 <tpatil> Zuul job : masakari-functional-devstack-multinode
04:05:38 <suzhengwei> https://review.opendev.org/#/c/702328/
04:06:04 <suzhengwei> I will add unit test today, would you like to give a review?
04:07:04 <tpatil> suzhengwei: I will review this patch
04:07:18 <suzhengwei> thank you
04:07:58 <samP> tpatil: suzhengwei: thanks
04:08:26 <suzhengwei> https://review.opendev.org/#/c/720623/
04:08:46 <samP> tpatil: about the TLS issue, should we use tls_proxy?
04:09:17 <samP> not understand the issue correctly
04:09:26 <suzhengwei> samp: you have give some comments. If you have time, we can talk about it later
04:09:28 <tpatil> I think we should set to True, but still we don't have any FT  that will evacuate vm
04:10:46 <tpatil> If we set tls_proxy=True, then I expect that FT written for vm notification should fail.
04:11:14 <tpatil> The changes made in the patch are simple  but I will need to verify it first before merging this patch
04:12:28 <suzhengwei> yes, we can talk about it with nova project developers.
04:13:57 <samP> In this patch, it pass the ca cert file to keystone auth session. if the session goes through the tls proxy then it will use it. Otherwise, it will be neglected.
04:14:05 <samP> I think..
04:14:33 <samP> since we dont have FT for this, can
04:14:46 <samP> can't tell for sure,
04:15:04 <suzhengwei> we can refer to other project how to deal with this common issue.
04:15:40 <samP> suzhengwei: yes, let's look around..
04:16:51 <tpatil> samP: NotificationVMTestCase->test_create_notification will invoke the changes made in the above patch.
04:18:45 <tpatil> samP: but you are right, we don't have the mechanism to set nova_ca_certificates_file config option, but still in that case the said FT should fail.
04:20:19 <tpatil> currently, there is no way to verify the changes in gate, we will need to do manually by ourself
04:20:29 <samP> FT pass because of the nova_ca_certificates_file is NULL, isn't it might be the case
04:20:31 <tpatil> but the changes looks simple and pretty straightforward.
04:21:34 <tpatil> if tls_proxy, it would expect you to set nova_ca_certificates_file otherwise I'm expecting it should raise SSLError
04:22:42 <tpatil> if tls_proxy is set to True, I  expect nova_ca_certificates_file config option should be set properly otherwise I'm expecting it raises SSLError
04:23:34 <samP> correct.
04:24:03 <tpatil> one thing is clear, we cannot test it in gate, need to verify manually. I will see if I can make changes to the gate job to verify these changes.
04:24:39 <tpatil> I mean masakari gate job "masakari-functional-devstack-multinode".
04:25:05 <samP> tpatil: thanks. Can you please comment to this patch with your findings
04:26:03 <tpatil> Simply need to verify, no findings. changes looks good to me.
04:26:25 <samP> in the meantime, lets look around how other projects are dealing with the issue.
04:26:36 <tpatil> Are you asking me to add a comment to make changes to the devstack plugin in order to verify this change in gate?
04:27:11 <samP> tpatil: no, I mean if you need to change smt in this patch
04:27:28 <samP> if it is good, then it is ok
04:27:54 <tpatil> ok, got it.
04:28:21 <tpatil> Need patch we were talking about is : https://review.opendev.org/#/c/720623
04:29:28 <samP> tpatil: suzhengwei: thanks for bringing this up
04:29:29 <tpatil> I mean next patch....
04:29:41 <samP> got it
04:31:43 <tpatil> First question is : Do we recommend to run masakari-engine on multiple hosts?
04:32:06 <tpatil> I don't think so..
04:32:53 <samP> masakari-engine will not work correctly in that case
04:33:11 <tpatil> samP: Correct, that's my understanding too
04:33:32 <suzhengwei> actually, we have a large scale cloud env. three controllers with three masakari-api/engine.
04:33:53 <suzhengwei> they works correct.
04:34:26 <samP> are you divide them to separate clusters?
04:35:29 <samP> masakari-monitors only talk to one masaka-api and that masakari-api pass messages to one masakari-engine
04:35:37 <suzhengwei_> actually, we have a large scale cloud env. three controllers with three masakari-api/engine. it works correct.
04:35:39 <samP> am I correct?
04:36:20 <tpatil> there will be no issues except the periodic task
04:36:44 <tpatil> _process_unfinished_notifications
04:36:44 <suzhengwei_> yes
04:37:24 <tpatil> There is possibility all three engine process same notification due to concurrency issue
04:37:28 <samP> Then it work as 3 different masakari clusters, nothing shared in between the clusters.
04:38:16 <suzhengwei_> there are sharing messing by amqp.
04:39:17 <samP> OK then, what tpatil saying could happen
04:39:28 <suzhengwei_> like other projects, masakari can runs with muti workers.
04:41:22 <samP> I need to check this, but my understanding is same with tpatil.
04:42:08 <samP> masakari can run multi workers, but you might face some concurrency issues
04:43:05 <suzhengwei_> I didn't face some concurrency issue.
04:43:48 <suzhengwei_> except for the report bug.
04:46:35 <tpatil> suzhengwei_: Need time to think about this new periodic task and all possible scenarios associated with running masakari-engine on multiple machine because the likelihood of setting FAILED status for already running notifications are high.
04:47:18 <suzhengwei_> ok.
04:47:38 <tpatil> if notification is in RUNNING status, that means engine is processing that notiification.
04:47:51 <tpatil> if it error's out, then the status would be set to ERROR
04:48:09 <tpatil> which then would be picked up by periodic task for second retry
04:48:20 <suzhengwei_> for low risk, the default expired is set 86400,= one day.
04:49:19 <tpatil> but still this inconsistency issue could occur at the edge..
04:50:57 <suzhengwei_> most times, the notifications have remian so long tme, the env has changes strongly, it has high risk to process it again.
04:52:14 <tpatil> so in such cases, you don't want these notifications to be processed, correct?
04:52:30 <suzhengwei_> yes, not processed again.
04:53:01 <tpatil> so in that case, then you must ignore notification in RUNNING status atleast
04:53:42 <samP> and you might need to alert them to operator.
04:53:56 <suzhengwei_> failure is allowed. continue operation  should not be blocked.
04:57:15 <tpatil> Can it be based on notification updated_at instead of notification.generated_time?
04:58:04 <samP> that should be more appropriate.... I think
04:58:22 <samP> Anyway, I will add my comments to patch.
04:59:02 <suzhengwei_> no. generated_time is the exact time when the failure comes by.
04:59:18 <samP> since, we dont have much time, send email to ML if you need any patch to merge for U
05:00:17 <samP> sorry to cut off the meeting. We dont have more time.
05:00:29 <samP> Lets discuss further in the gerrit.
05:00:35 <tpatil> ok
05:00:37 <suzhengwei_> If the failure hasn't been processing for long time, i think there is no need to process it again.
05:00:55 <suzhengwei_> ok
05:01:01 <suzhengwei_> bye
05:01:03 <samP> Thank you all for coming!
05:01:09 <samP> #endmeeting