03:02:43 <Sundar> #startmeeting openstack-cyborg
03:02:44 <openstack> Meeting started Thu Sep 19 03:02:43 2019 UTC and is due to finish in 60 minutes.  The chair is Sundar. Information about MeetBot at http://wiki.debian.org/MeetBot.
03:02:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
03:02:47 <openstack> The meeting name has been set to 'openstack_cyborg'
03:02:59 <Sundar> #topic Who's here
03:03:02 <Sundar> o/
03:03:08 <chenke> o/
03:03:20 <Yumeng> #info Yumeng
03:03:26 <s_shogo> #info s_shogo
03:03:32 <wangzhh> #info wangzhh
03:03:40 <changzhi> #info changzhi
03:03:49 <chenke> #info chenke
03:04:00 <Sundar> Hi chenke, Yumeng, s_shogo, wangzhh. Welcome changzhi
03:04:03 <shaohe_feng> #info shaohe_feng
03:04:10 <Sundar> Hi shaohe
03:04:17 <Sundar> #topic Status
03:04:41 <Sundar> First, thank you all for an active Train cycle. We have hit feature feeze a week ago
03:05:00 <Sundar> SO also did other projects.
03:05:41 <Sundar> The good news: Cyborg side of the Nova integration is pretty much done. We just need to clean up the way we invoke other services
03:06:07 <chenke> Great
03:06:24 <wangzhh> Cool
03:06:33 <Sundar> Not so good news: Our Nova patches did not enough reviews from Nova developers, and so did not make the cut.
03:07:24 <Sundar> Part of the problem is that, Cyborg patches were open for a long time, so Nova developers did not see it as ready, though we could put up a VM with Cyborg + Nova patches
03:08:16 <Sundar> Also, there was a longstanding request to show tempest CI working. That completed exactly in the milestone week. That was too late to get sustained reviews.
03:08:23 <shaohe_feng> We know intigration is a big effort
03:08:45 <shaohe_feng> Sundar: you d a lot of effort. Thanks
03:09:42 <chenke> It is understandable the patch in nova be merged slowly.
03:09:51 <Sundar> NP, thanks Shaohe. I am optimistic about U because I think we are close. and I have re-proposed the Nova spec. This time, tempest and most things are merged. Things that attratc croos-project attention, like tempest, privsep, sdk_adapter stuff, etc. are all done or making good progress
03:10:36 <Sundar> Hope to get the Nova patches in the runway very early in the cycle. The more we wait, the more things get bogged down among the tons of other reviews.
03:11:28 <Sundar> That said, we have a few more things to wrap up in Train :)
03:12:12 <Sundar> First, remove the hardcoding of 'dvstack-admin'. Thanks, chenker and all for addressing that :)
03:12:47 <Sundar> Second, v1 API is deprecated but still supported in Train. But it is not working because we removed all v1 from devstack. I should re-enable it, I think
03:13:28 <xinranwang> #info xinranwang
03:13:31 <Sundar> SHaohe's async bind, privsep, rbac are important
03:13:36 <xinranwang> Hi all
03:14:05 <Sundar> I think all the pep8/flake fixes from chenker/zhurong are looking good and will probably merge this week
03:14:33 <Sundar> Can you all think of anything else?
03:14:44 <Yumeng> Sundar: and please don't forget update device_profile db by conductor:https://review.opendev.org/#/c/679406/
03:14:48 <Yumeng> just updated
03:15:06 <shaohe_feng> Sundar remain some slot for me to introduce the async jobs, so other's can easily to review it.
03:15:17 <shaohe_feng> Thanks
03:15:25 <Yumeng> and this gpu fix :https://review.opendev.org/#/c/675059/    I tested in my devstack env, it works
03:15:57 <Sundar> Ah yes, that too, Yumeng :)  There are quite a few patches up there, including https://review.opendev.org/680953.
03:16:26 <Sundar> Sure, let's knock off as much as we can. Was just listing the ones critical to complete in Train
03:16:35 <openstackgerrit> Merged openstack/cyborg master: P5: Fix pep8 error in cyborg/accelerator  https://review.opendev.org/679175
03:16:54 <Sundar> shaohe_feng: Sure
03:17:15 <Sundar> Folks, anything else before we dive into Shaohe's async bind?
03:18:01 <s_shogo> I'm starting test&validation task, with real machine , begin with common functions, independet from specific accelerators.
03:18:11 <s_shogo> If extracted some bugs or erros, report that or post patches till the Train release.
03:19:36 <Sundar> Sure, s_shogo. I think the client effort can be aimed early in U release, since the Train release milestone for clients is past
03:19:41 <Sundar> I have some questions on RBAC: https://review.opendev.org/#/c/678177/ . In https://review.opendev.org/#/c/678177/3/cyborg/common/policy.py@83, should it be allow rule? ANybody can create an ARQ and thereby bind that ARQ, and so program an FPGA?
03:21:11 <s_shogo> Sundar: OK, I'll do the client&sdk task continuously, to the U release.
03:22:26 <Sundar> wangzhh: What do you think?
03:22:33 <xinranwang> should we complete v2 API in T?
03:23:09 <wangzhh> Sundar, it should be allowed and recheck it in the method  if it is  a program action or not.
03:23:29 <Sundar> wangzhh: ok
03:23:51 <Sundar> xinranwang: Only devices API remains. We are supposed to merge only bug fixes, I think. So, it will probably go to U. Is anything else remaining?
03:25:30 <Sundar> OK, 35 min remaining. Let's move to async bind.
03:25:41 <Sundar> #topic Async bind
03:25:54 <Sundar> Shaohe, take it away!
03:26:05 <shaohe_feng> Now let's we start to introduce async bind. Any questions can fafter the introduction.
03:26:12 <shaohe_feng> Briefly put, bind is to find a suitable device(maybe PCI, or MDEV) on the right host for a server instance to use.
03:26:18 <shaohe_feng> So what's the suitable device, we need a spec to describe it.
03:26:25 <shaohe_feng> On v1 we discribe the device directly on nova flavor extra spec, and cyborg parser the spec, Xinran implement this work.
03:26:32 <shaohe_feng> On v2, after the PTG discussion, we define it in cyborgs owen Device Pofile. And Sundar implement it.
03:26:43 <shaohe_feng> I have no chance to attend PTG for discussion,  More details please talk with Sundar.
03:26:50 <shaohe_feng> Thans Xinran and Sundar's effor.
03:26:59 <shaohe_feng> Before we introduce async bind, let's know some implement(rules) in the current code firstly.
03:27:08 <shaohe_feng> 1. The AtachHandler in ExtARQ is not a list, so only one AtachHandler(one devcie for ARQ)
03:27:08 <shaohe_feng> profile group in order to get the expected devices.
03:27:19 <shaohe_feng> Now Our cyborg ARQ API bind API is sync, be we define it as async, so need to improve.
03:27:28 <shaohe_feng> So what we changed:
03:27:43 <shaohe_feng> 1. Use a thread pool to start the async job.
03:27:50 <shaohe_feng> In cyborg spec, sundar suggests use concurrent, yes it is a python stand lib.  See python office link:
03:27:57 <shaohe_feng> https://docs.python.org/3/library/concurrent.futures.html
03:28:05 <shaohe_feng> Also we can greening it by greenlet. patched it by eventlet.
03:28:11 <shaohe_feng> utures = eventlet.import_patched('concurrent.futures') # 'greening' futures,
03:28:13 <openstackgerrit> Merged openstack/cyborg master: P6: Fix pep8 error in cyborg/agent and cyborg/db  https://review.opendev.org/679193
03:28:27 <shaohe_feng> easily to greening
03:28:37 <shaohe_feng> See python mail list discussion.
03:28:52 <shaohe_feng> I have simply test it, it can work, but I did not test it performance, do not enable greening in the patch.
03:29:00 <shaohe_feng> 2. I move out the bind logical from ExtARQ object.
03:29:13 <shaohe_feng> Let the ExtARQ maintain's its base function, such as its attribution's CRUD.
03:29:20 <shaohe_feng> Move it to cyborg/accelerator/common/handler.py (not sure this is a good place, this is a OPEN)
03:29:30 <shaohe_feng> Add a basic and general bind handle class named Accelerators. (not sure this is a good name, this is a OPEN)
03:29:37 <shaohe_feng> It support the base _bind
03:29:44 <shaohe_feng> https://review.opendev.org/#/c/681005/16/cyborg/accelerator/common/handler.py
03:29:54 <shaohe_feng> If a new acclerators need extra opeation, can derived it and extend it if needed, such as FPGA
03:30:02 <shaohe_feng> line 386 at
03:30:47 <shaohe_feng> For FPGA it need to get image metadata, download image, program image and update the placement.
03:31:18 <shaohe_feng> If _bind is time consume, use "wrap_job_tb" to wraper it.
03:31:30 <shaohe_feng> In this wraper I add it with "is_job" and can catch every Exception/traceback during bind process, then log it.
03:31:31 <openstackgerrit> Merged openstack/cyborg master: P7: Fix pep8 error in cyborg/objects and cyborg/image  https://review.opendev.org/679526
03:31:32 <openstackgerrit> Merged openstack/cyborg master: P8: Fix pep8 error in cyborg/tests and add post_mortem_debug.py  https://review.opendev.org/679538
03:31:38 <shaohe_feng> I also add a bind in the general class to start the jobs tagged with "is_job".
03:31:46 <shaohe_feng> I also add a master to monitor the jobs(as sundar suggestted)
03:31:52 <shaohe_feng> https://review.opendev.org/#/c/681005/16/cyborg/accelerator/common/handler.py
03:32:00 <shaohe_feng> It checks the jobs status and also will get the job Exception/traceback.
03:32:28 <shaohe_feng> please add a SUPPORT_RESOURCES in
03:32:41 <shaohe_feng> 4. I add ARQ_STATES_TRANSFORM_MATRIX to sync the status.
03:32:49 <shaohe_feng> Talked with sundar and xinran, we add extra status: ARQ_DELETING and ARQ_BIND_STARTED
03:32:57 <shaohe_feng> line at 29
03:33:11 <shaohe_feng> I just refacor Sundar's effort. Do not change his logical, at present. So did not change any API define exposed to user. Thanks for Sundar's effort.
03:33:17 <shaohe_feng> I did not test multi/batch AQRs, for example, a request for 2 FPGAs, or 1 GPU and 1 FPGA.
03:33:21 <shaohe_feng> Have no really env.
03:33:50 <shaohe_feng> So I think we need to merge the patch, and let more developers test it.
03:34:00 <shaohe_feng> That's the different with VM management. Ironic or Cyborg sometimes need hardware, so it is difficult to manage.
03:34:39 <shaohe_feng> the commit message show you how to test this patch and
03:35:17 <shaohe_feng> analyze the process by log:   https://review.opendev.org/#/c/681005/16//COMMIT_MSG
03:35:39 <shaohe_feng> Also there's still lot of works on it. Need to improve it continuously. Let it works firstly, then improvement.
03:36:31 <shaohe_feng> sorry
03:37:01 <shaohe_feng> any questions?
03:37:30 <Sundar> shaohe_feng: Thanks for all the time and hard work
03:38:19 <Sundar> For testing, hope people can use the fake driver. It supports FPGA resource class. Can we get it to take the programming patch but treat it as a no-op?
03:38:36 <Sundar> *programming code path
03:39:19 <shaohe_feng> Do you means make some mock do not really programming?
03:39:23 <Sundar> Yes
03:39:36 <shaohe_feng> Hardware support is really than VM
03:39:51 <Yumeng> shaohe_feng: that's really a comprehensive and deep research and very helpful introduction.
03:40:40 <shaohe_feng> Yumeng thanks. hopeful it is useful.
03:40:50 <s_shogo> Thanks, shaohe_feng :
03:41:12 <xinranwang> shaohe_feng:  thanks Shaohe for your efforts
03:41:13 <shaohe_feng> Sundar let me give a method to mock it later.
03:41:18 <Sundar> Not everybody has hardware, as you said. But concurrent execution is not easy to test throughly. It may work in my env but fail in somebody else's. We can hopefully get more people to check it out using fake driver
03:41:30 <Sundar> Great, thanks
03:41:54 <shaohe_feng> Yes, will give a guide for how to mock it.
03:42:26 <chenke> Great jobs thanks ShaoHe.
03:42:40 <Sundar> Also: "Move it to cyborg/accelerator/common/handler.py". Bind is really an operation on an ExtARQ. It logically belongs with objects/ext_arq.py. If you want to split that into separate source file, that is OK. But it can be a mix-in rather than a separate object/class, IMHO
03:43:37 <Yumeng> shaohe_feng: great! looking froward to the mock guide
03:44:12 <shaohe_feng> I have check nova's object code, Then I make this change.
03:44:28 <wangzhh> shaohe_feng, Thx for your effort.
03:44:38 <shaohe_feng> Sundar any details for how to split it?
03:45:23 <Sundar> shaohe_feng: I found this blog useful: http://www.qtrac.eu/pyclassmulti.html
03:46:00 <Sundar> It considers many ways to split a Python class into different source files, and finally recommends mix-ins
03:48:43 <shaohe_feng> glance it. seem it is a big change.
03:49:52 <Sundar> Hmmm... only the last part is the mix-in. That could be a small change. You can move your chosen methods into a separate file, put it in a mix-in, and inherit that mix-in into the ExtARQ object class
03:50:11 <Sundar> I can help as much as I can.
03:51:09 <shaohe_feng> good, then I can write a mock evn  guide for test.
03:51:35 <Sundar> In that article, the last section "The Definitive Version?" alone is about mix-ins
03:51:40 <Sundar> OK, great
03:52:34 <Sundar> Anything else, Shaohe?
03:52:48 <shaohe_feng> no, that's all for me.
03:53:02 <Sundar> Thanks very much, once again.
03:53:06 <Sundar> #topic AoB
03:53:09 <shaohe_feng> let move the patch on
03:53:20 <Sundar> Python IPv6 jobs: https://review.opendev.org/#/c/682517/ Please review
03:53:52 <Sundar> Many patches hit merge conflict after recent merges
03:54:04 <shaohe_feng> it does not matter.
03:54:25 <shaohe_feng> we just improve our git skill
03:54:49 <shaohe_feng> other active project
03:55:18 <Sundar> We need one more review for https://review.opendev.org/#/c/680953/ from outside Intel.
03:55:28 <shaohe_feng> conflict  is very common
03:55:51 <Sundar> Sure
03:56:03 <Sundar> Train schedule: https://releases.openstack.org/train/schedule.html RC1 candidate is next week!
03:56:18 <Sundar> Hope to get the critical patches in by that time.
03:56:49 <Sundar> After that, even bug fixes are not assured
03:57:48 <Sundar> BTW, Cyborg will get packaged as a RPM as part of OpenStack release:  https://opendev.org/openstack/rpm-packaging/src/branch/master/openstack/cyborg
03:58:19 <Sundar> Anything else, guys?
03:58:28 <shaohe_feng> no
03:58:32 <chenke> no
03:58:52 <Sundar> Have a good day! Bye
03:58:56 <Sundar> #endmeeting