03:02:27 <Sundar> #startmeeting openstack-cyborg
03:02:29 <openstack> Meeting started Thu Apr 16 03:02:27 2020 UTC and is due to finish in 60 minutes.  The chair is Sundar. Information about MeetBot at http://wiki.debian.org/MeetBot.
03:02:30 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
03:02:32 <openstack> The meeting name has been set to 'openstack_cyborg'
03:02:38 <brinzhang> welcome to songwenping_
03:03:01 <chenke> from beijing university?
03:03:04 <Sundar> Welcome, songwenping_ . Would you like to introduce yourself?
03:03:14 <Yumeng> Sundar: Good. How's it going with your new job?
03:03:14 <chenke> welcome songwenping
03:03:25 <songwenping_> Hi all
03:03:28 <Yumeng> welcome songwenping
03:03:45 <s_shogo> welcome
03:03:55 <songwenping_> I am from Inspur.
03:04:15 <Sundar> Good. Brin has company. :)
03:04:27 <songwenping_> do the nova-cyborg-interaction work.
03:04:42 <brinzhang> songwenping_ mainly work on nova-cyborg interaction, and his take some patch in this meeting
03:04:54 <brinzhang> Sundar, Yumeng, start meeting?
03:05:04 <Sundar> Interesting. What aspect of Nova interaction?
03:05:34 <xinranwang> welcome
03:05:37 <Yumeng> brinzhang,songwenping_ : Are you planning to use cyborg in production, right?
03:05:40 <brinzhang> GPU, and others
03:05:58 <brinzhang> Yumeng, considering
03:05:59 <Sundar> brinzhang: The meeting has started. We'll get to the patches after hearing a bit more about the Nova interaction.
03:06:22 <Sundar> Is it about more instance ops like shelve and suspend?
03:07:32 <Sundar> songwenping_, brinzhang: ^
03:08:04 <brinzhang> Sundar: now follow you patch https://review.opendev.org/#/c/716185/6/api-guide/source/accelerator-support.rst
03:08:12 <songwenping_> not do these operators
03:08:28 <xinranwang> Sundar:  I'd like to know what operation could be add in next release, shelve/unshelve, like what else? And is there any gap here to complete. Because I have started to investigate it and hope add more operations in Nova in next release.
03:09:11 <brinzhang> Sundar, we also blocked these operations now, maybe we can open that after enough develop
03:09:34 <shaohe_feng> hi all
03:09:55 <Yumeng> xinranwang,brinzhang,Sundar and all: here is the etherpad for V release. https://etherpad.opendev.org/p/cyborg-victoria-goals  we can add topics now.
03:09:58 <brinzhang> Sundar: we are still in testing
03:10:40 <Sundar> xinranwang: The API guide that brinzhang posted above lists the unsupported operations. Many opes there are important for operators. Sean Mooney from Nova is also working on some of these ops for V release, I think. You may want to check with him.
03:11:00 <Sundar> brinzhang: Good to know. Thanks.
03:11:15 <brinzhang> Yumeng: sean-k-mooney works on evacuate now
03:11:20 <Sundar> Yumeng: Thanks for posting the V release etherpad.
03:11:35 <brinzhang> https://review.opendev.org/#/c/715326/
03:12:04 <chenke> Hi Sundar, do you mean sean will implement some virtual machine operations in the next version?
03:12:22 <xinranwang> sean-k-mooney:  Is there already some patches from sean mooney? I'd like to have a look.
03:12:28 <Sundar> chenke: Yes, he says he intends to do that
03:12:28 <xinranwang> Sundar:
03:12:46 <Sundar> xinranwang: Please see https://review.opendev.org/#/c/715326/
03:13:15 <xinranwang> Sundar: thanks.
03:13:15 <chenke> Okay, this is good news, for both cyborg and nova》
03:13:51 <Sundar> Yes.Shall we get started on our patches? https://review.opendev.org/#/q/status:open+project:openstack/cyborg+branch:master
03:14:03 <Sundar> I am reviewing https://review.opendev.org/#/c/696089/
03:14:33 <chenke> Sure.
03:14:55 <brinzhang> Sundar https://review.opendev.org/#/c/696089/, how about this patch? need your check
03:15:25 <brinzhang> except some nits inline, others looks good to me
03:15:36 <Sundar> brinzhang: Yes, that's what I am looking at. Will complete after this meeting.
03:16:13 <Sundar> BTW, Release candidate 1 for Ussuri is due Apr 23 https://releases.openstack.org/ussuri/schedule.html
03:16:36 <Sundar> So, 8 days left. What are the highest priority patches to merge before that?
03:17:11 <Yumeng> brinzhang: Thanks for posting 696089!
03:17:22 <Sundar> Programming support https://review.opendev.org/698190 ?
03:18:18 <shaohe_feng> link:  https://etherpad.opendev.org/p/cyborg-victoria-goals
03:18:32 <shaohe_feng> I have add this program API in the etherpad
03:18:42 <shaohe_feng> this is really needed in product env to adopt cyborg
03:18:43 <Yumeng> Sundar: from my side. I think Zuul check bandit failure can be merged before Ussuri is due. https://review.opendev.org/#/q/topic:fix-bandit-check-failures+(status:open+OR+status:merged)
03:18:56 <Sundar> Yumeng: agreed
03:19:23 <Sundar> But it is failing the bandit check :)
03:19:32 <brinzhang> Yumeng, I found the bandit are not be voted, so you should enable it in zuul task
03:19:59 <brinzhang> and that should be works fine, that we can consider merge it.
03:20:31 <brinzhang> a questions, what is bandit to do? I am know few of it
03:20:32 <Yumeng> Sundar, brinzhang: yes. After we fixed all failure, we can set that to voting. and at that time, bandit check will not fail again.
03:21:09 <brinzhang> IMO, firstly, you should enable this task in zuul task
03:21:29 <Sundar> brinzhang: Bandit is a tool to detect security flaws. https://pypi.org/project/bandit/
03:21:32 <brinzhang> than fix the failure, that we can ensure that correctly
03:22:20 <brinzhang> Sundar: ack, thanks, will see it later
03:22:23 <Sundar> brinzhang: bandit is now a non-voting job. We need to fix the issues there and then make it voting. Otherwise, other open patches will be affected.
03:22:48 <xinranwang> https://review.opendev.org/#/c/696089/ The rafactor arq api is also important, I think.
03:22:48 <Yumeng> bandit is for code security check, which can check operations like if shell operations in cyborg code is safe
03:22:49 <chenke> Ye. I think that's the right way
03:23:07 <Sundar> So, the programming API https://review.opendev.org/698190 is not for Ussuri?
03:23:10 <brinzhang> Sundar: I mean, we should enable it as the base patch, and then fix the issue, right?
03:24:00 <shaohe_feng> Bandit  will scan the python code and gen AST to check  security issues
03:24:49 <s_shogo> Sundar , shaohe_feng : As I mentioned in the program patch , as reply for the Sundar's comment (https://review.opendev.org/698190) , I would like to take discussion the interface of PATCH API or not. IMO, that effects the schedule of merge.
03:25:15 <shaohe_feng> we had better let programming API in cyborg as soon as possible
03:25:34 <shaohe_feng> so we should try do our best to make it ready
03:26:04 <shaohe_feng> this is a very important feature
03:26:16 <shaohe_feng> s_shogo thank you for post this feature.
03:26:19 <Sundar> brinzhang: IMHO, it is better to have a patch series that fixes the open bandit issues and then make it a voting job. We need to merge other open patches for Ussuri. They will be affected if we make it a voting job too soon. Agreed?
03:26:24 <brinzhang> shaohe_feng:I also prefer to use update method
03:26:45 <shaohe_feng> update for firmware update?
03:27:04 <shaohe_feng> do you means a new update API?
03:27:52 <brinzhang> Sundar: we just enable the vote in zuul, but cannot merged, that cannot impact any open patches, I think
03:27:59 <Yumeng> Sundar,brinzhang : Agreed. We  should fix bandit failures first, then update bandit job as voting in zuul check. otherwise, if we change voting first, other patches will get Zuul -1.
03:28:19 <shaohe_feng> let's welcome a new fresher haibin-huang
03:28:38 <brinzhang> if we just fix that bandit failures, we cannot ensure it work fine after we enabled bandit vote in zuul
03:28:40 <Yumeng> brinzhang: I will post all patches for bandit today.
03:28:44 <shaohe_feng> she will try to help verify the program API in product evn.
03:29:17 <chenke> s/she/he
03:29:28 <haibin-huang23> I am very glad to join cyborg family.
03:29:29 <Sundar> Welcome haibin-huang . Would you like to introduce yourself?
03:29:38 <shaohe_feng> sorry, he :]
03:29:48 <Yumeng> welcome haibin-huang! :)
03:29:53 <shaohe_feng> haibin-huang you can help to review this new API
03:29:58 <brinzhang> welcome haibin-huang
03:30:03 <chenke> welcome haibin again. since beijing hackson.
03:30:17 <s_shogo> welcome haibin-huang
03:30:18 <xinranwang> welcome haibin-huang23
03:30:26 <songwenping> welcome haibin-huang too
03:30:41 <shaohe_feng> #link  https://review.opendev.org/#/c/698190/
03:30:59 <brinzhang> Yumeng: ok, sound good, we shuold ensure the bandit run works fine, than star merge the open fix bandit patches
03:31:04 <haibin-huang23> I am from Intel. I work on ONAP in past years. I focus on multicloud in ONAP project
03:31:22 <haibin-huang23> thank you everyone
03:31:55 <Sundar> Good to know you, haibin-huang23
03:32:24 <brinzhang> shaohe_feng, Sundar: the program API, if we use update instead of patch, is it ok?
03:33:51 <Sundar> brinzhang: The HTTP verbs are PUT, POST and PATCH. When you say update, which one do you mean?
03:34:22 <brinzhang> shaohe_feng, Sundar: ... PUT, soory
03:34:33 <Sundar> IMO, PATCH is the one that is closest in semantics to the programming operation. We are also using it for programming during ARQ binding.
03:35:06 <shaohe_feng> brinzhang you can read feilding's paper
03:35:42 <shaohe_feng> try to find which is more accurate
03:36:33 <brinzhang> I just think it's not very popular, if you all think that ok, I will be not concern this ^
03:36:49 <songwenping> https://review.opendev.org/#/c/718584/
03:36:55 <Sundar> PUT should be idempotent. PATCH need not be. The programming operation is not idempotent -- it can fail once and succeed the next time
03:37:26 <shaohe_feng> but the patch seems is newer method, I'm not sure feilding have discuss  the patch in his paper.
03:38:09 <Sundar> brinzhang: I think you pointed me to some Nova oprrations. But Nova developers themselves criticized my patches when I followed their precedent. :)
03:38:20 <Sundar> *operations with PUT
03:39:01 <brinzhang> Sundar: I see that, and I have not looked into
03:39:11 <brinzhang> pls see https://review.opendev.org/#/c/718584/
03:40:46 <brinzhang> and https://review.opendev.org/#/c/718584, that revert device and deployable when resource provider create fail
03:41:25 <Sundar> Does everybody think that the right approach is to delete/revert Cyborg objects when a Placement update fails?
03:41:30 <brinzhang> Sundar: did you review this patch?
03:41:37 <shaohe_feng> brinzhang why do you prefer PUT?
03:42:18 <songwenping> i am sorry my network is not good
03:42:45 <brinzhang> songwenping: Sundar said Does everybody think that the right approach is to delete/revert Cyborg objects when a Placement update fails?
03:43:36 <Sundar> xinranwang, Yumeng, s_shogo, shaohe_feng, all: Re. https://review.opendev.org/#/c/718584, Does everybody think that the right approach is to delete/revert Cyborg objects when a Placement update fails?
03:44:43 <xinranwang> IMO, It is reasonable.
03:45:16 <xinranwang> Otherwise, the conductor will not notice the diff and will now report to placement correctly.
03:45:20 <shaohe_feng> Let me think about it
03:45:46 <xinranwang> which will cause inconsistency
03:46:03 <Sundar> Ok. if you all decide that the best approach is to revert Cyborg db updates, then why not do Placement update first and update Cyborg db only if that succeeds?
03:46:18 <songwenping> xinranwang:right
03:47:02 <shaohe_feng> yes, there will be disaccord  between placement and cyborg DB
03:47:48 <Sundar> Personally, I'd prefer to update Cyborg db always, and mark the objects that failed to sync with Placement. That way, operators will know that Cyborg did its job correctly and Placement failed.
03:48:03 <shaohe_feng> Sundar  our design is do placement update first.
03:48:28 <shaohe_feng> we have discuss this issue in Dan smith's patch.
03:49:00 <brinzhang> Sundar: I saw you comment, I think your suggestion will be compex, but it is also an alternative way
03:49:20 <brinzhang> s/compex/complex
03:49:30 <shaohe_feng> do Placement update first and update Cyborg db only if that succeeds is the easiest way to keep consistent
03:49:46 <shaohe_feng> we have discuss it before.
03:50:26 <Sundar> Ok, I'll support your decision. Has it been tested fully in various scenarios, including Placement service down for some time and then comes up?
03:50:39 <shaohe_feng> bet another things: if placement DB successfully, but cyborg DB failed
03:50:51 <shaohe_feng> then what should cyborg do?
03:51:28 <xinranwang> Sundar:  please see chenke's patch here: https://review.opendev.org/#/c/711912/2
03:51:32 <songwenping> Sundar:no problem
03:51:41 <brinzhang> shaohe_feng: that will be raise > 500, I rembered chenke done this
03:52:06 <songwenping> i have test this scenario
03:52:10 <xinranwang> The commit message has explain the root reason of data imconsistency
03:53:02 <shaohe_feng> so I prefer make sure cyborg DB successfully first.
03:53:48 <Yumeng> agree we do the revert to handle the exception. and Let's  add reconsidering placement report as a V release goal. whether we need to decouple placement report or not.
03:55:27 <xinranwang> If placement succeed and cyborg DB failed. The next time cyborg conductor do diff, it finds a diff,  writes db and update placement with same info. I think the placment will return like "resourve provider already exists" something like this, But it will not crash the service. What do you think?
03:55:40 <shaohe_feng> sundar's comment:  Personally, I'd prefer to update Cyborg db always, and mark the objects that failed to sync with Placement. That way, operators will know that Cyborg did its job correctly and Placement failed.
03:56:03 <brinzhang> this way just reduce the date inconsistence's scenario, that cannot cover all scenarios.
03:56:39 <chenke> actually. no one can cover all scenriaos.
03:56:43 <s_shogo> I also discussed the Placement-Cyborg interaction in enable/disable API , with xinranwang.
03:57:03 <shaohe_feng> we have issues(same to sudar's comments) this in Dan's patch, let us check, why not we give this?
03:57:08 <shaohe_feng> let me check it.
03:57:51 <shaohe_feng> another things:
03:58:31 <xinranwang> chenke's patch has fix several secenarios already.  But as we said, it is difficult to cover all sceanrios.
03:58:32 <shaohe_feng> we need More accelerators drivers support
03:59:03 <xinranwang> I think we can revert like this patch does, and discuss the placement and db update decoupling in V.
03:59:20 <chenke> agree.
03:59:21 <brinzhang> if we add a flag to mark the data, I think we also need to add a period task to check this flag, and then do resource consistency
03:59:49 <Sundar> Our philosophy should be that Cyborg db is the source of knowledge about accelerators in any OpenStack cloud. If we let Placement take precedence, that reduces the value of Cyborg. That's just my opinion. As brinzhang said, it may be complex to make those changes.
04:00:10 <Sundar> brinzhang: we already have a periodc task from the agent
04:01:35 <shaohe_feng> we discuss it in:  https://review.opendev.org/#/c/708726/
04:02:02 <shaohe_feng> Sundar agree.
04:02:07 <shaohe_feng> cyborg db should firstly.
04:03:18 <shaohe_feng> but I see brinzhang agree Dan comments.
04:03:47 <brinzhang> yeah, I agree this way, but now we should do?
04:03:59 <Sundar> shaohe_feng: Dan Smith's suggestion may not apply well for Cyborg. For example, when an instance terminates, Placement resources get released. ARQs are unbound and deleted, but the unbinding may take some time if we do any device cleanup in the future. There is a period of time when the device is still not free (which Cyborg knows) but Placement
04:04:00 <Sundar> doesn't know.
04:04:22 <brinzhang> maybe we can make this as a plan in V release
04:04:37 <shaohe_feng> yes. This is a known and controversial issue
04:04:43 <shaohe_feng> I agree with you.
04:04:47 <brinzhang> songwenping's just make this as a bug fix
04:04:52 <Sundar> brinzhang and all: first, is it an important issue to fix in Ussuri?
04:04:57 <shaohe_feng> and I have discuss with xinranwang for it.
04:05:28 <shaohe_feng> we are all same options with you.
04:06:24 <shaohe_feng> make cyborg db correctly firstly?
04:06:40 <shaohe_feng> brinzhang agree?
04:07:29 <shaohe_feng> or you can list some pros and cons to make the conclusion
04:07:49 <xinranwang> The root reason is that now we coupling placement and db update, and we can not avoid this dependency. IMHO, we should discuss the decoupling in V. Now, we just make sure that there is less risk to have data inconsistency.
04:08:31 <chenke> +1
04:08:32 <brinzhang> I am not sure whether is good for me, when would like to fix this issue? shaohe_feng
04:09:28 <brinzhang> xinranwang: agree +1
04:09:47 <shaohe_feng> OK, let's xinranwang to make a good design firstly
04:09:59 <songwenping> xinranwang: agree +1
04:10:06 <Sundar> xinranwang: are you voting to merge the patch now, and revisit it later?
04:10:34 <brinzhang> xinranwang's mean is to agree to merge this patch, then in V we should do Sundar suggestion, right?
04:10:40 <shaohe_feng> maybe a new spec for it firstly, and we can discuss it base on the new spec.
04:11:26 <xinranwang> I think we can merge it to let U have less risk of data inconsistency
04:12:13 <Sundar> all ok with that?
04:12:32 <Yumeng> agree +1
04:12:53 <Sundar> shaohe_feng, s_shogo ^
04:13:03 <s_shogo> agree, +1
04:13:15 <chenke> some advice inline .
04:13:25 <chenke> https://review.opendev.org/#/c/718584/9/cyborg/conductor/manager.py
04:13:26 <brinzhang> Yumeng, xinranwang: please add this to V release plan
04:13:27 <xinranwang> I have thought about the decoupling a bit, there's lots of things to discuss,several gaps as I mentioned in last meeting. We should think more and discusss more to have a solution of decoupling.
04:13:32 <shaohe_feng> I give up vote.
04:13:55 <shaohe_feng> stay neutral
04:14:05 <Sundar> Ok, great. Please mention this in commit message of this patch.
04:14:36 <shaohe_feng> if we merge it, please leave details note in the patch
04:14:47 <brinzhang> xinranwang: can you leave comments in this patch? that songwenping can update that
04:14:57 <shaohe_feng> such as FIXME or NOTE in it.
04:15:14 <brinzhang> That TODO(), I think
04:15:17 <Sundar> Anything else, folks? We are over the time.
04:15:19 <xinranwang> shaohe_feng:  yes, understand, it is a hard decision, we have a long way to go  lol...
04:15:30 <xinranwang> brinzhang:  sure
04:15:47 <brinzhang> xinranwang: thanks
04:16:07 <brinzhang> it's time to launch, end meetting?
04:16:13 <shaohe_feng> OK.
04:16:28 <shaohe_feng> more drivers are welcome this new release right?
04:16:36 <Sundar> Great. Thanks a lot, everybody! Great progress this cycle. We have another week to make even more contributions. Take care and stay safe!
04:16:39 <Sundar> #endmeeting