13:00:01 #startmeeting PCI Passthrough 13:00:01 Meeting started Wed Jan 15 13:00:01 2014 UTC and is due to finish in 60 minutes. The chair is baoli. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:00:02 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:00:04 The meeting name has been set to 'pci_passthrough' 13:00:20 Hi 13:00:29 o/ 13:00:31 hi 13:00:46 hi 13:01:12 https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU is the very late proposal from yesterday, plus a list of use cases at the bottom based on what we were talking about. I've not tried to justify the propsal against the use cases in there. 13:01:53 Let's continue with yesterday's use case, then. 13:02:54 a couple of minutes to look at your proposal 13:03:38 It's supposed to be what I was describing from yesterday, though it wouldn't kill me to add a diagram... 13:03:54 I think the questions to answer are: 13:04:05 - do we have more use cases than I've described (or are there some to go away) 13:04:14 - is the proposal going to satisfy them 13:04:30 - is there something else we could do that would be simpler / do better? 13:05:23 ijw, not much time to digest it. But I see it's very complicated. 13:05:54 First of all, about the attribute. 13:06:05 may I raise concern not related to the document? 13:06:29 Go ahead 13:07:02 We have about 2 month window to push something to icehouse and get accepted. What are we going to do? 13:07:34 Do we want full list of use cases with all apis or we can define basic cases and have e2e flow? 13:08:12 My feeling we discuss alot, but progress is very slow ... 13:08:15 The proposal on the table basically uses code we almost have ready - heyongli has a patch out there that's about 90% of the backend work. This can be implemented in stages, so if we can't write a flavor API in that time we would write a config item to do the same job temporarily. The question is whether we're going to accept it or argue about it again 13:08:48 ijw, agree. 13:08:58 irena, I agree with you 13:09:00 baoli has the code for the --nic changes we need, you're on the PCI backend, the whole thing is pretty close if we'd just sit down and write it rather than debating 13:09:09 s/PCI/NIC/ 13:09:34 irenab: in nova side, we have to wait for nova core to approve BPs and code, so on nova side, it's important to get john's ack. 13:09:37 ijw, what are the use cases that can't be taken care of by pci group but the pci flavor 13:10:31 so there is also baoli's code that require blueprint approval 13:10:53 baoli: or, vice versa, any usage case can;t be taken care of pci flavor but pci group? 13:10:56 and neutron work that I need translate to blueprints 13:11:12 Can you answer my question first? 13:11:26 You can't easily do the online modifications - some of it is making things easier - 6-a-2 is a bugger in PCI group - some is because administrators might well have reasons to change the content of a flavor as in 6-b, 6-c 13:11:47 irenab, i remember john said he like to review and approve if we done the work split 13:11:53 PCI group is mainly much simpler to implement. *Much* simpler. But a lot of flexibility goes away with it 13:12:08 How do you justify the flexibility 13:12:16 How do you define attributes 13:12:23 People asked for it. John and Alan K from Ericsson 13:12:33 Attribute definition is in there. 13:13:07 In the backend aribtrary attributes are allowed for extra_info, there's no restriction. In the scheduler there's a list of available attrs. 13:13:18 ijw, I don't see any difference between the pci flaovr and the pci group 13:13:22 if you do that 13:13:23 heyongli: its good, need also neutron blueprints approved by Mark 13:13:39 baoli: if you do what? 13:13:50 If you define the extra attributes 13:14:17 Do we agree that PCI address and host can't be used in the pci flavor? 13:14:35 baoli: agree 13:14:38 The distinction is there's a mapping from combinations of attribute (which, absolutely, you can select to be 'pci_group' and you get the groups proposal, that's intentional) to flavor. That extra level of mapping gives you the requested API flexibility. 13:15:01 baoli: I'm happy to accept that 13:15:02 I think that we've talked about that you can add "meta-data" to a pci group down the road if needed 13:15:02 baoli, pci flavor can contian you group as a extra attr, but not push the group to the domain factor of design 13:15:13 the meta-data is the attribute 13:15:38 baoli: The difference between this and group is tiny in that respect but still significant. 13:16:10 Are you seeing that a pci group is a special attribute? 13:16:19 saying 13:16:26 In group, you put all devices into a group, presumably on the backend, and add attributes to the group. In this, you put attributes on the device and then group by the attribute (which conveniently can reuse the matching code we already have) 13:16:55 That gets you two things. Firstly, if you're of a mind like yjiang5 or heyongli, you can still define your flavors in terms of device and vendor ID and it will work 13:17:14 Second, you can put attributes on individual devices which don't apply to every device in the group, which we need 13:18:01 ijw: I think group is more compute specific definition (only name is global) and flavor is global, correct? 13:18:06 No extra_info attributes are special, I'm saying, but if you want to call one pci_group and use it to group your devices that is entirely your choice and you get pretty much the grouping behaviour you're looking for (while not denying other people the behaviour they'd like) 13:18:14 ijw, do you still need to add the attributes in the pci whitelist that is defined on the compute nodes? 13:18:20 yup 13:18:29 in that case, how could it be flexible? 13:18:45 baoli: yup. irenab: I think group was always driven from the compute node, it doesn't really work well done at the frontend 13:19:05 you can modify your flavor whatever way you want, but do you have to modify your compute node? 13:19:19 I see complexity, but not flexibitlity 13:19:29 It's inflexible in that you're configuring values there (though if you really want to change them you change them and restart, and as I said yesterday those values are largely dependent on the physical configuration and situation of the server so config's a good place for them) 13:19:48 And what benefits we get from it? 13:19:50 It's flexible in that you get to interpret those values in light of how you define the flavors, which means that you can change your mind with an API call later 13:20:22 Can you give an example on that? 13:20:33 You are defining the wiring of the host at the backend, where it belongs, and the features offered to the user in the front end. I would say that's why it works well 13:20:48 give an example 13:20:56 not something in general 13:21:06 So if I want to say 'this PCI device is attached to switch X port 4' then I would do that on the compute node, because that's very situational and nothing will apply globally to all compute nodes. 13:21:42 If I want to say 'Cisco and Intel NICs are sold as the flavor 'fastnic'' then I do that at the front end, because that's a user facing decision and one I might want to re-evaluate in the future 13:22:23 ijw: if you need to specify compute node network interface, like neutron (eth2), not phy switch? 13:23:05 If I'm running a small cloud and the flavor functionality's too much bother then I can define a flavor as 'flavor X is all devices in group X' and label up my devices at the backend. If I don't like grouping at all, I can say 'flavor X is all Intel NICs (by device ID)' and that works too. ANd your specific case of 'all SRIOV NICs should be used for passthrough works as well, bonus 13:23:24 group? 13:23:29 irenab: if we want to do that we're going to have to add the network interface as a device property. 13:23:51 You can't do it now, and that's basically why. That information's pretty easy to discover though so that wouldn't be a huge change. 13:23:59 group X 13:24:02 group X ? 13:24:03 ijw: not sure, this is the network interface of PF 13:24:30 ijw, you can do it with PCI group as well 13:24:35 irenab: I think it's the same - either we autodiscover it and it's a device property or we have to manually find it and it's extra_info 13:24:54 baoli: yes, indeed - my point is that this does precisely what groups do and a number of other things besides 13:25:08 a number of other things besides? 13:25:09 'group' here would be an extra_info property that I just happened to call 'group' 13:25:12 ijw: you need it to choose the device that is child of this PF 13:25:22 irenab: we'll have to add that to the pci_device then. 13:25:30 Different problem but not too hard to solve 13:25:32 irenab: for SR-IOV device, we record the corresponding PF even in current code already 13:25:37 Ah, ok 13:25:50 irenab: as a PCI device property. 13:26:17 irenab: define PF as a attr to the flavor in ijw's propsal you get you want 13:26:21 yjiang51: its fine, but I think we need some correlation with it in pci flavor 13:26:22 So you are saying that we need a special attribute, somethign called say sriov_group? 13:26:26 baoli: As I said before, and as the document says, groups can offer you backend configuration (if you set them up in config) or front-end flexibility (if you have an API to set them up) but they can't offer you both at the same time. 13:26:45 irenab: how'd you mean? 13:26:49 heyongly: cannot, since PF can differ between compute nodes 13:27:36 irenab: then you can add attr 'connective' and define it diffrently in every host, and based it's pf 13:27:38 irenab: so you mean you want to select SR-IOV device based on the attribute of the corresponding PF device's atrribute? 13:27:39 ijw: I want to specify flavor that suites connectivy case 13:27:39 irenab: that's where you'd use the backend tagging, I think: match PF = whatever, add an attr (not quite sure what you have in mind for there though) 13:28:15 match PF = whatever, add provider_network='outside' or something 13:28:29 guys, are we trying to define something so complicated when you have a simple solution to solve all the practical cases we have right now? 13:28:30 ijw: +1 13:28:46 ijw:as an analogy, for neutron agent you specify where provider net is going outside (i.e. eth2) 13:28:52 baoli: this is not complicated for me to implement it 13:28:52 baoli: because we don't. Whatever you think groups are they are either defined in backend config or via a frontend API 13:29:12 you need all the APIs to support attributes, wouldn't you? 13:29:15 baoli: and yes, I don't think this *is* complicated, it's designed to build largely on what we have 13:29:31 baoli: no, you wouldn't - there's no API there that takes a bare attribute 13:29:55 Flavors take a matching expression, PCI flavors in flavors and NICs don't use attributes at all, only flavor names 13:30:03 how do you report stats? 13:30:51 There's no use case for stats there - if you have one in mind we can add it and then I'll answer that question, because I can think of a number of stats that might be useful. 13:31:05 ijw: whre do you define the (match PF = whatever, add provider_network='outside' )? Flavor? 13:31:08 Also, I would question whether stats are a priority for Icehouse 13:31:25 ijw, how do you do scheduling? 13:31:28 irenab: I remember even in neutron, you have to specifify provider net through config, right? then in nova , you have to specify the provider net by providing that attribute in compute node, and then create flavor for it. 13:32:01 irenab: I define that in the compute node config (because the compute node's physical connectivity determines the attribute value to use) then I make my flavor up by matching on e.provider_network 13:32:21 yjiang51: so need to associate list of devices on compute node with this provider net 13:32:48 ijw: can you, amybe later write down how the configs will look like? 13:33:12 baoli: same as I described to you before - pci_stats buckets can be created by doing a SELECT COUNT(*) GROUP BY attr, attr, attr - giving you a limited number of buckets - then each PCI flavor corresponds to a number of buckets and you find a set of choices out of those buckets with availability that satisfies your demand 13:33:40 irenab: it's in there in the abstract, but there's no worked example - 'pci_information' 13:33:45 ok, a compute node can use whatever attributes in your mind to tag an entry? 13:33:49 yup 13:34:17 irenab: just wonderinfgif nova have mechanism to get that network provider automatically? 13:34:40 ijw: I feel I need to take it a little bit more down to earth (example) 13:34:44 as a implement, stats can be calculated based on flavor property 13:35:08 yjiang51: a provider network here is 'an external network attached to a compute node port' so it's not usually automatically discoverable - it's also a Neutron concept so Nova would find it hard to get hold of, I think 13:35:31 yjiang51: not as far as I know, nova gets it by at allocate_network stage 13:35:37 heyongli: indeed, which is pretty much the same as 'group' that I was using in the worked examples above 13:35:55 irenab: the thing I don't like about that pci_information at the moment is it's ugly, but I think you'd end up with 13:36:16 ijw: yeah, anyway stats and scheduler is fine for this design 13:36:27 pci_information=[{pf => '0000:01:01.00'}, { provider_network => 'outside'}] 13:36:48 ijw: +2 for this 13:36:53 irenab: do you have a simple use case, seems it;s a good time to move on and see what and e2e solution will look like. 13:37:52 BrianB: the use case is very simple. each compute node is connected to provider network via specific netowrk interface (SRIOV NIC) 13:37:52 I think that PCI information thing wants a DSL really but for the minute we'll stick with JSON or Python 13:38:41 ijw, we need to support live migration soon. 13:38:48 how do you take care of this? 13:39:30 baoli: live migration or migration ? 13:39:36 live migration 13:39:50 in the case of SRIOV 13:40:15 baoli: we need libvirt network then 13:40:23 yes 13:40:31 irenab: how that provider network for each SRIOV NIC defined in neutron? In config file or automaticaly? If in config file, will it be ok to define same in nova also, as ijw suggested? (i.e. for all devices with PF=xxx, define provider_network as xxx) 13:41:41 yjiang51: I think it will be sufficient 13:41:43 Also, does the compute node knows about flavor? 13:42:01 irenab: great. 13:42:08 irenab: cool 13:42:45 yjiang51: but it should be considered by scheduler too 13:43:28 baoli: live migration shouldn't be a problem from a scheduling perspective, but very very few devices support it and none that I know of with KVM support 13:43:39 baoli: irenab's macvtap connection will work though. 13:43:50 irenab: sure and ijw and I give a rough description above 13:43:51 irenab: yes, if we have attribute 'provider_network' defined as attributes supported for pci_stat/pci_flavor. 13:44:12 yjiang51: provider nets are in a config file 13:44:39 baoli: compute nodes don't know about flavors, only the results of flavor, I *think* 13:45:01 ijw: i also think so 13:45:04 As in, by that point you have the allocations in terms of RAM, CPU and now PCI device, so you don't need the connection back to flavor 13:45:05 Again, we come back to the question on how pci stats are reported by the compute nodes? 13:45:58 baoli: i think compute node report the stats by the avaliable pci flavor property 13:46:20 baoli: right now, they're reported individually by device. This allows you to report them grouped by the grouping attributes, but since we'd quite like something to work I was expecting to do what we do now - report all the PCI devices and group in the DB - and move to a different form later. Also, since the scheduler's being hacked to bits right now there's not much point in prejudging the results of that 13:46:23 pci_flavor_attrs=attr,attr,attr on control node 13:46:41 ijw, no. they are not individually reported 13:47:41 OK, so why do we have a pci_devices table? 13:47:47 How does it get filled? 13:47:55 irenab: so are you ok with ijw's suggestion ? 13:48:12 ijw, they maintain status information 13:48:40 yjiang51: seems Ok, need to go over the details to validate 13:48:50 ijw: not so far, but can report by pci_flavor attrs, even directly from dB is so good( we try to push it in crrrent implementation at havana) 13:49:24 do you reporte by flavor? or do you report by individual attribute? 13:49:45 You'll notice I didn't say, because it's not terribly important 13:49:55 ijw, it's important 13:50:22 From a use case perspective it isn't - it doesn't matter which you're doing because it doesn't affect behaviour 13:50:28 As long as you get it right... 13:50:35 You can't conceptually bring up something without considering its complexity and pacticality 13:50:47 practicality 13:51:02 Actually, I considered it both ways 13:51:04 baoli: not by flavor, can based on pci_flavor_attrs, it's practically , or just by DB 13:51:26 what do you mean by based on pci_flavor_attrs? 13:51:33 or just by DB? 13:51:53 So, if you group on the control node, then that's a lot of network traffic, but otherwise fine. If you group on the compute node, the only thing you need to consider is that you need the grouping attributes available, so the one visible change would be whether you need that list of grouping attrs on the compute node (presumably in its configuration) 13:52:35 I would prefer that we group on the compute node, but I was also assuming that we could do it with control node grouping and fix the problem later, particularly since there's no user visible change there 13:52:48 ok, you define an API for PCI flaovr, you need to notify all the compute nodes about that, so that they can do grouping properly, is that what you say? 13:52:56 Personally I would choose - for now - the way that's closest to what we already have in the code to minimise work 13:53:01 No. 13:53:10 ok, then how 13:53:10 Not at all. 13:53:24 Flavors are defined in terms of a list of attributes that can be used in the flavor. 13:53:27 Which you define in config. 13:53:35 Setting aside for a moment where that config is. 13:53:50 So, your pci_stats table has a row for every value combination of those attributes. 13:54:31 Now, if you're grouping on compute nodes, you don't need to tell them about flavor. They do need to know what the grouping attributes are. Either they have the same attrs in their config and confirm early on that they have the right set or they have to ask the conrol node for the list. Either works. 13:54:34 you are saying the compute node will report every device? 13:54:43 No, I'm not. 13:56:30 baoli: I'm not sure if anyone else have more questions. If you have more, possibly you can discuss with ijw after the meeting since you are on the same company and easy to discuss? 13:56:33 remember that pci flavors are defined on the controller node which has an arbitrary criteria 13:56:59 ijw: baoli: do we want to set a plan/dates we want to progress? 13:57:03 yjang51, sure 13:57:30 baoli: any chance you can share git fork with you changes? 13:57:32 irena, I wish we can 13:58:00 Irena, I can provide a full patch if you need one 13:58:18 baoli: flavors are defined on a controller node but the criteria are not arbitrary, they're validated against the list of grouping attributes. 13:58:36 And I think that that is what you're missing 13:59:04 ijw, let's discuss it offline then 13:59:15 #endmeeting