13:02:10 <ijw> #startmeeting pci_passthrough
13:02:11 <openstack> Meeting started Mon Jan 20 13:02:10 2014 UTC and is due to finish in 60 minutes.  The chair is ijw. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:02:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:02:14 <openstack> The meeting name has been set to 'pci_passthrough'
13:02:26 <ijw> Hope that name is right...
13:02:45 <ijw> Who's about?  I know Robert's skiving and I imagine yongli is too
13:03:08 <irenab> hi
13:03:53 <ijw> Well then, this will be a short meeting
13:04:08 <ijw> #topic Provider networks
13:04:49 <ijw> So the issue we've been discussing is that provider networks in particular are usually on separate trunk networks to internal networks.  I think that in ML2 the issue might potentially be more general than that.
13:05:11 <ijw> So you have to pick a card attached to the right trunk in order to be able to connect to the network.
13:05:35 <irenab> ijw:  yes. What do you mean by trunk?
13:05:50 <ijw> Well, in the VLAN case it's the physical network you're attached to, without VLANs
13:06:42 <ijw> So you could have two provider networks that are completely isolated from each other, and a number of Neutron networks set up on those networks with different segmentation IDs
13:06:47 <irenab> ijw: so in neutron languafe is a phy network label
13:06:51 <ijw> Yup
13:07:05 <ijw> In ML2 I think it's a segment (because it can use separate segments internally)
13:07:22 <ijw> But I shall defer to Bob on that because I am not a master of ML2
13:07:37 <irenab> ijw: fine, its exactly my case. So just need a way to group PCI device s on compute node according to labels
13:07:57 <ijw> If I wanted to connect to a Neutron network on one of those provider networks I would have to choose a port with the right physical connection or I just can't do it
13:08:19 <ijw> And since that's a problem of scheduling then Neutron can't solve it by itself - you have to choose the right device
13:08:34 <irenab> I think rkukura is here
13:09:09 <ijw> What I propose is that - initially - we just pick a device from the flavor, and if it can't be attached to the network because it's physically the wrong connection we abort the machine start.  That would practically mean you would want to set up two flavors and choose the right flavor for the network.
13:09:37 <ijw> That's obviously crap as a solution, but it means that we can implement the Neutron part and solve the other problem independently
13:09:52 <ijw> The other half of the problem is then making sure Nova picks the right device based on the Neutron network.
13:10:02 <irenab> ijw: I am OK with this approach for a start
13:10:05 <rkukura> I missed the proposal here - sorry I'm late
13:10:11 <ijw> Hey bob
13:10:14 <rkukura> hi
13:10:17 <irenab> hi
13:10:31 <irenab> ijw: want to recup?
13:10:51 <ijw> We were just discussing how you make PCI work when you need to choose a device that's connected to the right bit of network to suit the Neutron network you want to attach to
13:11:13 <ijw> Step one: ignore the problem and hope someone else makes it go away.  If you get an unsuitable port you refuse to start the VM
13:11:17 <irenab> assuming ML2 and segments,..
13:11:43 <ijw> Or anything else and provider networks - it's a problem with the OVS plugin too, in fact (not that I see us implementing PCI passthrough in the OVS plugin)
13:11:45 <irenab> ijw: not sure how you know if the port is unsuitable
13:12:09 <heyongli> ijw, i miss a little,  if the device label by phy connection, why we can not attach to VM ?
13:12:20 <ijw> Well, provider-network-wise you can find the provider network that the Neutron network is on.  ML2-wise, dunno
13:12:49 <ijw> heyongli: if I pick Neutron network X and some network device flavor, there's no guarantee I can connect the device I get to X
13:13:04 <ijw> As a user, I don't know (in theory) which devices are suitable - even if I have multiple flavors
13:13:17 <irenab> ijw: if we label the device flavor with provider network label, it should be solved?
13:13:22 <ijw> The best solution would be where I don't have to know and I just go grab a 10Gbit flavor and leave Neutron to solve the problem.
13:13:51 <ijw> irenab: that works better, but you can still come up with a Neutron network and a flavor in a single --nic argument that won't work together
13:14:16 <irenab> ijw: why?
13:14:23 <rkukura> ijw: We eventually need some way for nova scheduling to take neutron connectivity into account
13:14:27 <heyongli> ijw, for example, if device connect to phy  network X, labeled X, and flavor , we request phy-network: X, why this can not reslove this problem?
13:14:38 <ijw> rkukura: I have a cunning plan, we'll get there in a mo ;)
13:14:52 <ijw> I think the best solution is that (regardless of connection) you put all your 10G NICs in a single flavor and Neutron and Nova work out which specific device from that flavor to use
13:15:30 <ijw> heyongli: that works, but you have to know that if you choose a neutron network 'thing' on phy network X - which you can't see as a user, I think, anyway - then you need to choose flavor tagged with phy-network X to make it work
13:15:48 <irenab> ijw: I think it should be thus way, but to be realistic and have something work for current release, we can have separate flavor for each net connectivity. I think it should work
13:16:01 <ijw> Better is if you have a set of devices connected to multiple networks (X and Y say) all in the one flavor, and the scheduler finds you a suitable device without having to do anything special
13:16:16 <ijw> irenab: yep, that's plan A, basically, what we do to start with
13:16:53 <ijw> Plan B would involve changing Neutron so that, at the point at the moment in nova-api where Nova checks all the networks exist, you also say 'and I need a device labelled as on segment X'
13:17:08 <ijw> Neutron tells nova-api that and nova-api passes the information on to the scheduler.
13:17:30 <ijw> Plus point: without the user knowing anything special about the setup of the cloud, they will get a PCI device attached to the network
13:17:50 <irenab> ijw: so for plan B, you suggest neutron to manage PCI devices network connectivity awarness?
13:17:56 <ijw> Minus point: if I have a flavor with 10 devices left in it but they're all attached to network Y, I may be unable to schedule the machine
13:18:21 <heyongli> ijw,  given a example about your B plan , that's interesting
13:18:32 <ijw> irenab: I think Neutron needs to tell Nova which segment (provider network, whatever) it needs and Nova has to check the extra-info on the devices to work out which one suits
13:19:09 <irenab> ijw: understood
13:19:14 <ijw> So basically Nova says 'does this network exist?' and neutron now replies 'it exists and it's on segment X'.  Nova tells the scheduler 'I want a PCI device from flavor '10G' and on segment 'X'
13:19:17 <rkukura> port binding is where ml2 figures out which segment is going to be used
13:19:39 <ijw> rkukura: you can see from above where that wouldn't work here, we'd need the information a bit earlier
13:19:55 <rkukura> ijw: I think you are using "segment" slightly differently than in ml2
13:20:01 <ijw> And segment might be the wrong term, to be honest - this is the underlying trunk
13:20:03 <ijw> indeed
13:20:16 <rkukura> trunk == physical_network
13:20:25 <ijw> OK, physical_network it is
13:20:31 <heyongli> ijw, you mean we label the device with network, but flavor does not need to set up for that.
13:20:43 <ijw> heyongli: that's what I think we want, yes
13:20:48 <rkukura> A virtual network (tenant or provider) can have multiple segments, on different physical_networks, but somehow bridged
13:20:53 <heyongli> so when we tell the nova scheduler and how
13:21:25 <ijw> heyongli: I think to be properly generic Neutron should be supplying its own list of scheduling requirements as {attr: value}
13:22:09 <heyongli> ijw, where to inject this information?
13:22:18 <ijw> And (and I think from the way he's spoken about it baoli doesn't like this much) we will be setting aside specific named extra-info attributes for special purposes, so we would reserve 'physical_network' say and use that consistently for this purpose in both the extra-info and in Neutron
13:22:18 <heyongli> to nova scheduler,
13:22:41 <ijw> nova-api would pass it on to the scheduler, and the scheduler would find matching information in extra-info, done like this
13:23:11 <ijw> We'd use a consistent attribute name in both Neutron and in the configuration we added to nova-compute (the pci_information)
13:23:58 <irenab> ijw: the only problem I have is that we put networking info into nova configuration, but I think its not a tragedy
13:24:01 <ijw> In any case, plan B is up for discussion but plan A, I think, is what we should do for now - if Neutron can't attach the device it just kills the machine start
13:24:21 <ijw> irenab: yeah - we discussed that before and it's annoying but it's not the end of the world
13:24:45 <rkukura> ijw: Is "if Neutron can't attach the device it just kills the machine start" enforced in ml2's port binding?
13:24:55 <ijw> I think that's best fixed by rethinking the way the pci_information is gathered at some point in the future.  If you could pull it from several files or sources this would work better
13:25:03 <ijw> rkukura: yup
13:25:09 <heyongli> ijw, this information should be inject to VM's instance type before the scheduler does it's work, i wondering nova should do something in API stage, it's unclear to me ...
13:25:17 <irenab> ijw: for the plan A, I do not like that it may end without resources, so probably should use PCI flavor per phy_net
13:25:40 <ijw> heyongli: yes, that's exactly what I'm trying to describe - there's a call to neutron in the API code and we can just get the information and push it on to the instance just there
13:26:22 <heyongli> ijw: thank undstood.
13:26:32 <ijw> irenab: We're agreeing, I think - we're saying that the administrator has to configure things consistently and, if the admin or the user does something that's not consistent we have a fallback plan (don't boot the VM)
13:27:00 <heyongli> ijw: this sound a configration problem
13:27:02 <irenab> ijw: fine with me
13:27:21 <ijw> heyongli: for plan A, it is, absolutely - we need everything to be set up just right or things won't work.
13:27:33 <irenab> since we have Bob here, can we discuss the --nic vnic_type?
13:27:35 <rkukura> Even without PCI passthru, as soon as we have heterogeneous connectivity, we really need to make nova scheduling take account of connnectivity - so  I do not think this is a PCI-specific issue, and the solution shouldn't be PCI-specific
13:27:36 <ijw> For plan B then we basically solve the configuration problem for people by doing better scheduling
13:28:16 <ijw> rkukura: indeed, and that should work - it's the same call in the same place and I think we need to work out how we can get Neutron to return information to be passed to the scheduler that isn't too Neutron-specific
13:28:24 <heyongli> rkukura: sounds great.
13:28:40 <ijw> As in a requirement 'I want a machine that looks like this' and not 'I want a neutron-specific thing from nova'
13:28:56 <irenab> soud like task for next summit, if don't make it before
13:29:05 <ijw> rkukura: how far down the line is heterogeneous network connectivity?
13:29:14 <ijw> Not soon, I'm thinking
13:29:30 <rkukura> ijw: Its been a long time since I looked at the nova scheduler in detail - can it be made to call into nova to test whether connectivity is possible, or to prune a list of candidates or anything like that?
13:29:40 <ijw> And in fact if someone doesn't configure a provider network on one compute node presumably we get the same failure right now...
13:30:07 <ijw> It prunes candidates and it's really best if it's operating on static-ish information, so sources of information about the compute nodes and a requirement from the API
13:30:46 <ijw> I think callouts would be wrong - it tries to build and use a data model and the compute nodes generally try and keep that model current.  In your case I think the neutron agents would be supplying more information to its model.
13:30:54 <irenab> ijw: there is probably the assumption on homogenous environment, but with ML2 it is changing ..
13:31:04 <ijw> And it's also allowed to be wrong - in the case the information is slightly out of date it reschedules
13:31:05 <rkukura> ijw: OK, so static-ish but not static?
13:31:34 <ijw> Yeah - periodic updates are favoured.  You don't call to discover the info, you have it to hand and the backend keeps it current
13:32:06 <ijw> I'm not in the current scheduler rewrite discussions but I don't think anyone was planning on changing that aspect of it
13:32:30 <rkukura> the assumption has been that connectivity is homogeneous, but we've basically been saying that right now, if you need heterogeneous connectivity, you are on your own to configure some nova feature such as flavors/cells/... to match
13:32:47 <ijw> Also you can add plugins (filters, really) so adding one that reads more information and excludes machines is exactly the right model
13:33:28 <ijw> rkukura: yes, so that model and this one are exactly the same - the 'make the flavors right' plan A here is exactly what you're recommending right now, too
13:33:33 <rkukura> ijw: Filters calling into neutron was what I was originally thinking might work
13:34:12 <ijw> rkukura: I think 'neutron updates a table of machine information in all the schedulers with a cast, then a filter acts on that information' is the model you'll end up with
13:34:20 <irenab> ijw: I think filter calliing into neutron will take more efforts to implement
13:34:23 <ijw> 'table' != DB table here
13:34:49 <ijw> It's nto really the right model anyway, there can be multiple schedulers which is why the information broadcast is favoured
13:35:12 <ijw> So, are we good with the A/B approach?
13:35:18 <heyongli> ijw: for plan A, we label the device with phy connectivity and put it to flavor, and for neutron network configuration, use irenab's rename approach?
13:35:25 <rkukura> ijw: I think we could get the info to the filters, but what about have the scheduler reserve an actual resource?
13:35:43 <heyongli> ijw: i'm ok with it.
13:35:47 <irenab> ijw: what you suggest is that neutron agent will be required to gather local machine PCI devices connectivity info, propagate to neutron server and server will update the scheduler?
13:35:51 <ijw> rkukura: it already has stuff for that - you have a resource type and counts, it prereserves till the next update
13:36:33 <ijw> irenab: no, for Bob's case where he's talking about provider network bridges I think the neutron agent would be reporting - for our case I think we would report it from nova-compute at least for now
13:37:10 <irenab> ijw: I meant for plan B
13:37:15 <rkukura> ijw: Right now, neutron L2 agents report their connectivity to neutron-server via the agents_db RPC
13:37:19 <ijw> For plan B too, I would just use an attribute
13:37:25 <rkukura> Any ml2 port binding uses this info
13:37:41 <ijw> rkukura: ok - then maybe neutron-server should cast it
13:37:58 <ijw> Or maybe you should just call out.  Either works.
13:38:14 <irenab> ijw: I just think that not any vendor solution will require neutron agent
13:38:36 <irenab> I am not sure, but think that baoli's case can manage without the agent
13:38:37 <rkukura> Maybe the RPC should be replaced with a notification that both the neutron-server and nova-schedulers could subscribe to
13:38:37 <ijw> irenab: yeah - this is more Bob's problem we're solving.  PCI passthrough, we'll use the nova-compute config
13:38:58 <rkukura> irenab: good point about not always having an L2 agent
13:39:01 <irenab> rkukura: cool, I like it
13:39:45 <rkukura> Has the creating of PCI-passthru-capable neutron networks been addressed?
13:40:00 <ijw> So - plan A: 'we get the right device for the network we're trying to attach or we just abort'; plan B: 'we add code so that Neutron can add stuff to the scheduling request'.  A now, B later - possibly in Juno
13:40:18 <ijw> rkukura: as in tagging them as capable?
13:40:39 <irenab> ijw: agree. We need good examples how to setup the nova  and neutron conf
13:40:40 <ijw> At the moment we're assuming they're all capable, practically speaking.
13:41:06 <irenab> rkukura: we are talking on PCI pass-through capable ports
13:41:07 <rkukura> As in a normal tenant specifying that they need (and are willing to pay for) this capability without having to know anything about the provider's physical topology
13:41:23 <ijw> I can see how you might want to mark them so that Neutron provisions them accordingly, but we haven't gone there at present
13:41:41 <ijw> rkukura: seems like a thing we could implement independently
13:41:57 * ijw really needs to write that capability stuff.
13:41:58 <rkukura> Right, so neutron picks a physical_network that is attached to PCI-passthru-capable NICs
13:42:29 <ijw> rkukura: not discussed.  Want to enter a BP?
13:42:41 <heyongli> ijw: in plan A if we scheduler the VM to a host, it should match the requirements, so there is may not be always failed, except a competition condition happens.
13:42:57 <irenab> rkukura: not following your case, can elaborate please?
13:42:58 <rkukura> Initially, it might make sense to require these to be provider networks, where the administrator knows which physical_network to pick
13:43:07 <ijw> heyongli: indeed - you pick a network on provider network X and a flavor on Y and it will go wrong, and it will all be your fault
13:43:36 <irenab> rkukura: makes sence
13:43:44 <ijw> OK, I don't see any disagreement here, just clarifications and refinements at the moment
13:43:58 <ijw> So, since Bob's here
13:44:02 <ijw> #topic ML2 plugin
13:44:03 <heyongli> ijw: lets do A, and B is more user friendly.
13:44:25 <irenab> rkukura: I registered a blueprint for vnic_type request
13:44:45 <rkukura> If PCI-passthru NICs are a very limited resource, it might make most sense to connect all of them to one physical network, and only allocate that physical_network to virtual networks that will need passthru
13:44:57 <ijw> irenab: link?
13:45:09 <heyongli> irenab: cool
13:45:16 <irenab> ijw: https://blueprints.launchpad.net/neutron/+spec/ml2-request-vnic-type
13:45:24 <ijw> rkukura: totally true and also reasonable; I think we'll do as you say, we'll just implement it later
13:45:31 <ijw> #link https://blueprints.launchpad.net/neutron/+spec/ml2-request-vnic-type
13:45:43 <ijw> #link https://blueprints.launchpad.net/neutron/+spec/ml2-request-vnic-type Irena's blueprint
13:45:47 <rkukura> The terms "plan A" and "plan B" imply either/or - I think you want something more like "phase 1" and "phase 2"
13:45:51 <ijw> Screw you, meetingbot
13:46:08 <ijw> Fair.  The mailing list email is 1/2
13:46:32 <irenab> for ML2 mech drivers, we going with cisco and mellanox as separate Mech. Drivers
13:46:52 <rkukura> So who sets vnic_type on a port?
13:47:05 <ijw> It's set via the nova boot api
13:47:08 <ijw> (and therefore cli)
13:47:23 <ijw> --nic net-id=xxx,pci-flavor=yyy,vnic-type=macvtap
13:47:31 <irenab> on nova  bott command as part of --nic option (we should extend) on on neutron port-create
13:47:58 <irenab> vnic-type can be macvtap/pci direct/virtio
13:48:04 <irenab> name can be more logical
13:48:13 <irenab> Fast/Slow/...
13:48:16 <rkukura> So does nova interpret this --nic and pass the vnic_type to the port_create or port_update along with host_id?
13:48:26 <irenab> rkukura: yes
13:48:31 <irenab> will do so
13:48:32 <ijw> I would just go with macvtap (needs a PCI flavor, pci passthrough (ditto) and absent for a vnic, right now
13:48:55 <irenab> ijw: default = vnic
13:49:29 <irenab> rkukura: can you please take a look and approve this blueprint?
13:49:47 <rkukura> Is it considered desirable for the user to pick between these vnic_types explicitly?
13:50:08 <rkukura> vs. just saying they need low-latency high-bandwidth option
13:50:52 <ijw> yes- firstly the device drivers change dramatically for passthrough and second you can't migrate with macvtap (which some people may care about)
13:50:54 <irenab> on nova boot api it makes sense to go with later, but I think for neutron it can be explicit
13:51:16 <rkukura> Might also not get security groups, right?
13:51:45 <irenab> rkukura: yes, with PCI passthrugh its not that easy
13:51:52 <rkukura> But I get the point that the VM has to be prepared for the vnic_type chosen, so it needs to be explicit
13:52:18 <irenab> for pci passthrough VM image must have vendor drivers
13:52:54 <rkukura> OK, so having nova interpret --nic option and set binding:vnic_type seems reasonable
13:53:04 <ijw> rkukura: I'm sure secgroups and antispoof will be a nightmare at some point, yes...
13:53:18 <irenab> sorry, I have to leave now. I'll try to chat in few hours on IRC
13:53:26 <irenab> bye
13:54:23 <rkukura> OK, so ml2 MechanismDrivers can then look at the binding:vnic_type, but what would they do with it?
13:56:20 <heyongli> so for nova part we are get some agreement, so i going to update my bp, and i hope ijw can review it.
13:56:29 <heyongli> seem time is up soon.
13:56:39 <ijw> Please, yes - mail me at my Cisco address if you want me to spot it ;)
13:56:43 <ijw> One othe thing
13:56:53 <ijw> #topic Xen
13:57:05 <ijw> Apparently the Xen guys (I've just been talking on another chat) have been implementing the PCI passthrough as it stands
13:57:36 <BobBall> *cough* Sorry 'bout that! Not quite ready for external review - but we have patches up and are doing internal reviews with a few guys :)
13:57:47 <ijw> So, what I suggest is we try and get heyongli's compute changes agreed and coded sharpish (obviously we're going to miss I-2 but we get quiet time afterward) so that we can show them and get them to change their implmenentation in step
13:58:03 <ijw> Bad Bob.
13:58:07 <BobBall> we were aiming for I-2 but figured it'll slip a little so now were aiming for early I-3
13:58:24 <BobBall> yeah, we're very happy to update the impl as needed
13:58:51 <ijw> BobBall: the changes are that we don't (just) group by device and vendor any more, we can provide extra parameters in the config and group by them as well (per an externally defined list of 'significant' attributes)
13:59:03 <ijw> The discovery ought to be pretty similar, I think
13:59:06 <ijw> heyongli is your guy
13:59:25 <BobBall> Ah, great.
13:59:27 <ijw> Top of the hour.  Any more for any more?
13:59:41 <ijw> #action heyongli to update BP, ijw to review
13:59:53 <ijw> #action Annoy BobBall with superfluous changes to Xen
13:59:53 <heyongli> bob, talk me any time.
14:00:03 <ijw> #endmeeting