14:01:23 <edmondsw> #startmeeting PowerVM Driver Meeting
14:01:24 <openstack> Meeting started Tue Sep  4 14:01:23 2018 UTC and is due to finish in 60 minutes.  The chair is edmondsw. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:01:26 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:01:29 <openstack> The meeting name has been set to 'powervm_driver_meeting'
14:01:45 <edmondsw> #link https://etherpad.openstack.org/p/powervm_driver_meeting_agenda
14:02:05 <edmondsw> #topic In-Tree Driver
14:02:57 <edmondsw> I was out last week, so if anything of significance happened here I missed it
14:03:08 <edmondsw> efried anything of note?
14:04:14 <efried> nope
14:04:39 <edmondsw> alrighty then
14:05:02 <edmondsw> #topic Out-of-Tree driver
14:05:24 <edmondsw> anything here?
14:06:37 <efried> I'll wait for other topics
14:06:49 <edmondsw> #topic Device Passthrough
14:07:04 <edmondsw> efried ^
14:07:08 <efried> okay
14:07:15 <efried> so things are getting confusing here.
14:07:22 <efried> Bear with me for a bit
14:08:18 <efried> Background:
14:08:18 <efried> I proposed the nova-powervm spec, and have a bunch of code up for it.
14:08:18 <efried> kosamara (CERN) noticed and proposed essentially the same spec into nova.
14:08:18 <efried> It has been getting reviews.
14:08:47 <efried> Recently the elephant in the room was brought up, which is: How does this play with cyborg?
14:09:23 <efried> And at this point I'm... not exactly stuck, but I haven't yet decided how we're going to answer that, either from a nova perspective or from a nova-powervm perspective.
14:10:02 <efried> I *think* in nova we're going to need to essentially abandon the idea of doing anything independent of cyborg.
14:10:27 <edmondsw> ?
14:10:58 <efried> And what that's going to mean in practical terms - despite the cyborg team's best intentions (which are IMO and based on my experience, naïve) - is that there's no way in hell we're going to have anything workable in nova in the Stein timeframe.
14:11:28 <efried> so what we need to figure out is what that means for nova-powervm.
14:11:42 <efried> because we want^Wneed to have something workable in Stein
14:12:52 <efried> So I think this means we're going to need to move forward with something like the plan we started on for nova-powervm. And then around the Train release, we'll need to do another big shift to get things working with cyborg. Or... not.
14:13:14 <efried> depending how far nova gets with that, and how far we're willing to diverge and/or stay separated from what they're doing.
14:13:21 <edmondsw> back up to why nova would abandon the idea of doing anything independent of cyborg
14:13:57 <efried> It would be like saying nova is going to add some kind of new volume support without involving cinder.
14:14:15 <efried> or network support without involving neutron.
14:14:16 <edmondsw> is it, though? I don't think so
14:14:39 <efried> Well, swhy I said *think*.
14:14:47 <edmondsw> maybe we need to start with what you mean by independent
14:15:01 <efried> Right now nova has the legacy pci passthrough subsystem.
14:15:06 <efried> Everybody hates it and agrees it needs to diaf.
14:15:11 <edmondsw> yep
14:15:20 <efried> and we've all agreed that whatever solution comes next ought to involve placement
14:16:07 <efried> And in Denver last year (Queens ptg) we started talking about generic device management, inventory/whitelist via yaml, ... all the stuff we've been working on putting together right now.
14:16:24 <efried> But then at some point during Q/R, the cyborg project materialized
14:16:37 <efried> and now, device management is recognized as being their bailiwick.
14:16:41 <efried> so
14:16:47 <efried> in addition to involving placement
14:17:01 <efried> any device management work is also seen as needing to involve cyborg.
14:17:27 <edmondsw> isn't cyborg more about device programming than device management?
14:17:43 <efried> Like, I don't see nova deciding it's okay to implement a nova+placement solution in Stein only to have to rework everything to make it a nova+placement+cyborg solution in Train.
14:18:33 <efried> and furthermore, the aforementioned naïveté will have us working as if n+p+c could become a reality in Stein
14:18:38 <efried> even though that's IMO a pipe dream.
14:19:02 <edmondsw> and isn't cyborg specific to accelerators, with no intention to have anything to do with other types of devices?
14:19:35 <efried> edmondsw: To answer your question, the practical *value add* of cyborg, in the short/middle term, is programming accelerators. But their scope is definitely defined to encompass device management in general.
14:19:48 <edmondsw> not according to them
14:19:51 <edmondsw> https://wiki.openstack.org/wiki/Cyborg
14:20:03 <edmondsw> " to provide a general purpose management framework for acceleration resources"
14:20:27 <edmondsw> and I've never heard them mention anything more general
14:21:04 <efried> heard where?
14:21:17 <efried> In their meetings? Specs? IRC? Dublin?
14:22:12 <efried> That wiki page hasn't had substantive updates since last November; I wouldn't rely on it as being a current/accurate description of their project's scope.
14:22:37 <edmondsw> yes. I won't claim intimate familiarity with what they're doing, but I've talked to them a few times, read some things on the ML, etc... could certainly have missed this, but it would be a surprise
14:22:50 <efried> It should also be noted that, until recently, there was nobody on that team with a fabulous grasp of English.
14:23:07 <edmondsw> so where are they defining their scope?
14:24:16 <efried> Well, here's an example of a spec: http://logs.openstack.org/38/577438/11/check/openstack-tox-docs/cc6ea12/html/specs/rocky/approved/compute-node.html
14:24:40 <edmondsw> again, that says "for accelerators"
14:24:42 <efried> The bulk of the first section is boilerplate that they're including in all of their specs, and it pretty well describes what they're doing.
14:24:47 <efried> Okay, what do you think an accelerator is?
14:25:09 <efried> It certainly encompasses GPUs, which is what we care about right now.
14:25:10 <edmondsw> ok, do you think all devices are accelerators?
14:25:33 <edmondsw> (the correct answer is no)
14:25:44 <edmondsw> e.g. infiniband adapter
14:25:59 <efried> of course; not sure how that matters in this context though.
14:26:01 <edmondsw> so what happens to them?
14:26:28 <edmondsw> if the solution must involve cyborg, and cyborg won't have anything to do with infiniband, does the solution not cover infiniband?
14:26:30 <edmondsw> sounds like it
14:26:32 <efried> Well, cyborg is going to manage them also. And SR-IOV etc.
14:27:02 <edmondsw> then they need to state that they're broadening their scope
14:27:08 <efried> But I guess that's been an underlying assumption in the background of discussions rather than explicitly stated in a spec or anything.
14:27:12 <edmondsw> to include more than just accelarators
14:27:13 <efried> okay, cool, you should tell them that.
14:28:37 <efried> still not sure how this gets us further along.
14:28:49 <efried> let's say hypothetically that they'll never manage infiniband or SR-IOV.
14:28:59 <efried> How does that help us get non-cyborg management of GPUs into nova in Stein?
14:29:18 <edmondsw> wrong question
14:29:55 <edmondsw> forget schedules until we figure out whether what the right path is
14:30:38 <efried> okay
14:30:44 <efried> How does that help us get non-cyborg management of GPUs into nova?
14:30:54 <efried> is that the right question?
14:31:44 <edmondsw> that's a good topic for conversation including the nova and cyborg teams
14:32:53 <edmondsw> I'm just saying that it seems we need to step back and look at this more generally
14:32:55 <efried> right; and last week someone (I don't remember who) asked cyborg to put up a nova-specs doc to describe what they think the *nova* side of things is going to look like.
14:33:08 <efried> so I think we'll know more based on the outcome of that
14:33:12 <efried> *and*
14:33:15 <efried> whatever happens next week.
14:33:27 <efried> btw, who all is going to Denver?
14:33:39 <edmondsw> right... to what extent does cyborg need to be involved when the device is an accelerator? Would need the cyborg guys to chime in there
14:33:55 <efried> you mean when the device is *not* an accelerator?
14:34:03 <edmondsw> and then how do we handle non-accelerators that cyborg doesn't care about?
14:34:04 <efried> or was the emphasis on *need*?
14:34:21 <edmondsw> so that we can cover both accelerators and non-accelerators in whatever design is worked out
14:34:36 <edmondsw> I'm fine including cyborg to the extent that makes sense. I just want a plan that covers more than accelerators
14:34:56 <edmondsw> and then when the long-term plan is laid out, we can figure out how to incrementally get there while meeting business objectives along the way
14:35:35 <edmondsw> efried to your questions: 1) let's talk PTG in open discussion, 2) no, 3) no
14:37:19 <edmondsw> restating... we need the cyborg guys to chime in on the extent to which they need to be involved when the device is an accelerator as an important thing for nova to understand while designing a solution that is not accelerator-specific but does support accelerators
14:38:19 <edmondsw> i.e. cyborg definitely needs to be involved here, but not required
14:38:30 <efried> "here" where?
14:38:35 <efried> Are you speaking for nova?
14:39:32 <edmondsw> cyborg definitely needs to be involved in device attachment, but can't be required for devices that cyborg doesn't have anything to do with
14:40:21 <edmondsw> right?
14:40:41 <efried> okay, and I'm saying I'm pretty sure, long-term, "devices that cyborg doesn't have anything to do with" == {}
14:41:56 <efried> but again, we should be attempting to get clarity on that soon, esp next week.
14:42:06 <efried> My plan for M/T is to be hanging out in the cyborg room.
14:42:13 <edmondsw> "let's say hypothetically that they'll never manage infiniband or SR-IOV."
14:42:29 <efried> yeah; I don't think that's valid, just hypothetical.
14:43:04 <edmondsw> sounds good... we definitely need clarity there
14:43:53 <edmondsw> table this until that's figured out?
14:44:59 <efried> Yup. You'll notice I opened up with how I'm confused and unsure and needing discussion/clarity.
14:45:27 <edmondsw> yup. I hope this helped? I'm at least glad to understand what's going on
14:45:39 <efried> I made some predictions based on what I've read, discussed, but also sensed and felt as (apparently) purely undercurrents.
14:45:58 <efried> Well, no, I don't feel further along on any of that I'm afraid.
14:46:07 <efried> but that's okay, I didn't really expect to.
14:46:25 <efried> I was just airing what's been going on (in my head and elsewhere) on the topic.
14:46:33 <efried> to get/keep y'all informed.
14:47:00 <edmondsw> thanks
14:47:38 <efried> not totally sure what, if any, action I should be taking this week on the device passthrough front.
14:48:00 <efried> other than continuing to review cyborg specs.
14:48:07 <efried> I wish Sundar would spend more time in IRC.
14:48:17 <efried> so I could like ask him some of these questions.
14:48:40 <edmondsw> find out whether cyborg has any intention of handling devices that are not accelerators?
14:49:12 <efried> ack
14:49:47 <edmondsw> that seems to be the key
14:50:45 <edmondsw> if you're right and they will handle all devices, then yeah, I totally get why nova would make them integral to the design and we'll have to then figure out how we deal with that in Stein
14:51:43 <edmondsw> but have to get that answered first so we're not designing based on the wrong assumptions
14:51:47 <edmondsw> thanks
14:52:01 <edmondsw> #topic PowerVM CI
14:52:22 <edmondsw> mujahidali ^
14:52:53 <edmondsw> link: http://ci-watch.tintri.com/project?project=nova
14:52:57 <mujahidali> We are facing in-tree failure for almost all the jobs, I tried to look but no luck.
14:52:59 <edmondsw> #link: http://ci-watch.tintri.com/project?project=nova
14:53:09 <edmondsw> yeah, I was just going to ask about that
14:53:42 <edmondsw> mujahidali does it look like the same issue that efried dug into last week?
14:53:56 <edmondsw> efried do you know if that fix merged?
14:54:03 <mujahidali> not sure.
14:54:11 <efried> https://review.openstack.org/#/c/598365/ not yet merged.
14:54:22 <efried> I shoulda freakin +W'd it before Sylvain got hold of it.
14:55:17 <mujahidali> All the in-tree failing Jobs are failing for same "39" test cases.
14:55:31 <edmondsw> efried does he not realize this is causing CI runs to fail?
14:55:45 <edmondsw> I'd think there would be a little more urgency to merge and then cleanup in a followup in that case
14:56:10 <efried> I would think so too. I hadn't gotten back around to it yet today, but I'll catch up quick and suggest that.
14:56:17 <efried> It's not blocking libvirt CI, so they don't give a shit.
14:57:09 <edmondsw> thanks
14:57:28 <edmondsw> mujahidali what else?
14:57:32 <mujahidali> nodepool latest version is 3.2.0 https://zuul-ci.org/docs/nodepool/releasenotes.html
14:57:33 <mujahidali> on etherpad why we want it to upgrade from 0.3.0 to 0.5.0 ??
15:00:06 <edmondsw> mujahidali I think we want to upgrade to *at least* 0.5.0
15:00:12 <edmondsw> so 3.2.0 would be fine
15:00:47 <edmondsw> and I'd rather we use the latest we can
15:01:32 <efried> what could possibly go wrong?
15:01:37 <edmondsw> lol
15:01:38 <mujahidali> I wanted to try the upgrade of nodepool and installation of zookeper along with diskimage-builder on stage environment. esberglu: can I directly do a pip install upgrade nodepool ??
15:02:38 <edmondsw> I'll let you guys work that out after the meeting
15:02:45 <edmondsw> any other status?
15:02:55 <efried> btw, I unfortunately think holding up https://review.openstack.org/#/c/598365/ to get the conf helps modified is legit because this is going to be backported, so we want it in one patch.
15:03:06 <edmondsw> efried ack
15:03:16 <efried> I'll see if I can light a fire under Matt to do that.
15:03:23 <efried> I would do it, but want to retain my +2 power.
15:03:27 <mujahidali> efried: there are some dependancies with jenkins version, if we are upgrading one then need to upgrade all the other dependant.
15:03:27 <edmondsw> mujahidali I guess we could just add it to our patching file, right?
15:04:15 <edmondsw> mujahidali patch it in for now, so we can see if anything else pops up to cause issues
15:04:24 <mujahidali> I think yes.
15:06:29 <edmondsw> #topic Open Discussion
15:06:51 <edmondsw> looks like I will not be attending the PTG next week
15:07:04 <edmondsw> and last I heard they were still trying to get approval for gman-tx and efried
15:07:44 <efried> I'm approved and booked
15:07:50 <edmondsw> yay
15:08:00 <efried> I wanted to bring up that other CI topic
15:08:12 <edmondsw> go ahead
15:08:37 <edmondsw> oh, yeah, I meant to do that
15:08:47 <efried> You want to take it?
15:08:49 <edmondsw> I added some notes to the CI todo etherpad about it
15:09:02 <efried> thought we should bring mujahidali up to speed here
15:09:08 <edmondsw> yep
15:09:09 <efried> in case he can take action
15:09:26 <edmondsw> mujahidali basically, efried wrote a patch where we make our virt driver not work
15:09:42 <efried> https://review.openstack.org/599066
15:09:48 <edmondsw> so then we can look at what tests pass there and know that those tests must not really be things we need to test in our CI
15:10:01 <efried> http://184.172.12.213/66/599066/5/check/nova-powervm-out-of-tree-pvm/a1b42d5/powervm_os_ci.html.gz
15:10:05 <edmondsw> and it turns out there are >700 tests that passed
15:10:53 <edmondsw> I think ideally we work with nova and tempest to develop a solution that will allow us to say "test virt driver" rather than try to skip every individual thing that doesn't touch the virt driver
15:11:16 <edmondsw> so like I said, I threw that on the TODO etherpad
15:11:23 <edmondsw> efried anything to add?
15:11:31 <edmondsw> mujahidali make sense?
15:11:43 <efried> I wanted to say
15:12:12 <efried> that if mujahidali has time in the near future, it would be neat to try to assemble that list of 707 tests in its own skip section and/or separate blacklist or whatever
15:12:43 <efried> and set us up to somehow run with that blacklist (i.e. the smaller subset of tests) like 90% of the time or something.
15:13:04 <mujahidali> efried: you are asking me to add the failed test cases to blacklist ??
15:13:11 <efried> Not the failed tests.
15:13:14 <efried> The 707 passing ones.
15:13:16 <efried> But
15:13:27 <mujahidali> why the passing one ??
15:13:31 <efried> I would like to see it done in such a way that we can toggle it on and off, preferably at run time, preferably automatically.
15:13:39 <efried> Because those are the tests that aren't touching our code, so we don't really care about them.
15:13:47 <mujahidali> okay
15:14:05 <efried> There's value to running them every now and then, in case somebody somewhere else makes a change that happens to break specifically when it runs on Power.
15:14:33 <efried> But if we could limit the full run to only a small fraction of the time, it would take a big load off our CI systems, and reduce our run time drastically.
15:15:01 <efried> So like, use a random number and if it's less than a certain threshold, run the full test; otherwise run the subset. Kind of thing.
15:16:09 <efried> I would guess that logic would happen in the shell script somewhere. Let me know if you need help coding it up.
15:16:32 <mujahidali> sure.
15:17:43 <mujahidali> So If we add the passing 700 test to balcklist then they will never run, so how will we gonna run the full test ??
15:18:21 <efried> swhat I'm saying, we would want to maintain that blacklist as a separate file; then whenever we don't hit that random threshold, we cat it together with the real blacklist.
15:18:24 <edmondsw> toggle between different blacklist files
15:18:27 <efried> or that
15:18:41 <edmondsw> or what efried said
15:19:15 <mujahidali> full_blacklist file ??
15:19:48 <mujahidali> I will do that ASAP once the CI is stable.
15:21:32 <edmondsw> #endmeeting