13:02:17 <baoli> #startmeeting pci passthrough
13:02:18 <openstack> Meeting started Wed Jan 22 13:02:17 2014 UTC and is due to finish in 60 minutes.  The chair is baoli. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:02:19 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:02:21 <openstack> The meeting name has been set to 'pci_passthrough'
13:03:18 <irenab> hi
13:03:28 <baoli> Hi
13:03:59 <sadasu> Hello!
13:04:56 <irenab> baoli: following you email, I suggested to cover SRIOV related this week
13:05:17 <baoli> Irenab, I agree
13:05:37 <irenab> As for heyongli BP , I reviewed, no major  issues
13:05:53 <ijw> yo
13:05:53 <irenab> Just do not like the way it deals with netowrking for now....
13:06:01 <baoli> What's exactly the plan B you mentioned
13:06:09 <irenab> But it least it adds some support
13:06:11 <ijw> the one I suggested?
13:06:44 <irenab> yes, what Ian menat by neutron eawre scheduler
13:06:58 <irenab> ^aware
13:07:09 <baoli> How is it exactly defined?
13:08:29 <sadasu> I had one question regarding the bp wiki
13:08:49 <irenab> sadasu: I think heyongli is not here
13:09:08 <sadasu> ok...maybe ijw can answer
13:09:28 <ijw> baoli: the problem is that we need to decide in advance of starting a VM which specific network cards can be used to attach to the networks it wants
13:09:38 <irenab> baoli: you can see Ian's suggestion at #link: https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit?pli=1#
13:09:40 <sadasu> pci-flavor can be specified as part of nova flavor create and as part of --nic option
13:09:46 <ijw> baoli: the solution is for nova-api to ask Neutron what the requirements are and hand it to the scheduler
13:09:50 <sadasu> ignoring the exact syntax for now
13:10:14 <heyongli> sorry my home network un-stable today
13:10:26 <sadasu> have we thought of the case where pci-flavors can be specified both places?
13:10:31 <sadasu> I don't see it in the doc
13:10:39 <ijw> And according to rkukura, there's a similar requirement for provider networks, so I think in practice a lot of that solution would be done by them and we could piggyback on it, but I also think it's Juno territory
13:10:48 <irenab> sadasu: as far as I understood, it is there
13:11:08 <sadasu> nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU'
13:11:15 <ijw> sadasu: yes - the ones in the nova flavor are non-Neutron passthrough devices which are simply attached (even if they're NICs).  The ones in Neutron are attached by vif plugging.
13:11:34 <sadasu> nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec --nic net-id=network_X pci_flavor= '1:phyX_NIC;'
13:11:50 <ijw> And we have to combine all the PCI device requirements before we try to scheule.
13:12:30 <irenab> heyongli: Can you please address my comments on wiki?
13:12:41 <sadasu> ijw: I understand that since I have been following this discussion...I think it should be made very clear and explicit in the doc
13:12:46 <heyongli> which one? irenab?
13:12:53 <ijw> sadasu: Ok, that seems sensible
13:13:25 <sadasu> also, admin should be careful while creating a pci-flavor that is a mixture of networking and non-networking devices
13:13:29 <irenab> heyongli: can we quickly go over all of them
13:13:33 <ijw> The document could do with a work-over for formatting, too, it's hard to read because the formatting's patchy
13:13:38 <heyongli> irenab: ok
13:13:47 <ijw> I was going to do that if I found time but I haven't managed it yet
13:14:11 <heyongli> ijw: i had to do it tomorrow
13:14:12 <irenab> heyongli: just look for irenab on wiki page
13:14:16 <ijw> I suggest anyone else who has 5 minutes to spare picks a section
13:14:26 <irenab> first on pci_information format, do you agree?
13:14:52 <heyongli> format or name?
13:14:55 <ijw> The idea of pci_information as I understood it is that you could supply the config item more than once rather than having a list of matches
13:15:08 <ijw> But we should choose one, clearly that's not defined well enough
13:15:34 <irenab> heyongli: both format and name
13:15:51 <irenab> you have '=>' there
13:16:31 <heyongli> irenab: sure, if this kind of problem or typo, i think you can modify it directly.
13:16:44 <irenab> heyongli: thanks
13:17:02 <heyongli> for name, i don't have the good idea.
13:17:11 <irenab> heyongli: another Question on pci_flavor_attrs. Is it single defintion?
13:17:30 <heyongli> it can be multi ,if we want it to be.
13:18:04 <irenab> should it be defined per PCI flavor or not?
13:18:09 <heyongli> this should be identical in all cloud machine, is it a problem?
13:18:32 <heyongli> irenab: no , i don't thinks so, any concern?
13:18:58 <irenab> no, I just not sure I understand the usage and what should  be defined there by cloud admin
13:19:20 <baoli> well, I imagine it would. What if you want to add a new attribute in order to do some new stuff?
13:19:50 <irenab> As I understood, its the list of properties to filter the PCI devices by scheduler, right?
13:20:06 <baoli> Then the question becomes why would you like to define pci flavors on the fly if the attribute is static?
13:20:20 <ijw> I think it should be (a) single and (b) comma separated
13:20:28 <heyongli> irenab, not filter , it's the scheduler orgnized the stats pool, i think .
13:20:28 <ijw> a, b, c for example
13:20:58 <ijw> baoli: flavors can be defined on the fly but we need to validate the attributes they use, I think
13:21:04 <irenab> heyongli: but eventually the device will be picked by mataching all attrs?
13:21:12 <heyongli> list is a json structure, like all other definition,
13:21:15 <ijw> You still have quite a lot of flexibility in what flavors you make even if the attrs are the same
13:21:19 <baoli> why attr is static?
13:21:41 <ijw> Because pci_stats are created based on it, really, and you could invalidate flavors if you change it
13:22:02 <ijw> We discussed this before and I think the conclusion is that yes, it could be something you could change, but if we wanted to do that we could add it later
13:22:17 <heyongli> ijw: yeah, there could be some improvement maybe, but have no idea yet.
13:22:29 <ijw> But if we have lots of lines of the form:
13:22:39 <ijw> pci_information=[{match}, {attrs}]
13:22:42 <ijw> That's not awful
13:22:55 <ijw> And the lines are short enough to be readable - long config items don't wrap well, I think
13:23:13 <heyongli> ijw: don't we talk abou the  pci_flavor_attrs?
13:23:22 <baoli> would compute node read it as well so that it can validate the attr in the list?
13:23:34 <irenab> at least I do
13:23:34 <ijw> heyongli: sorry, I'm talking about both at once ;)
13:23:43 <ijw> pci_flavor_attrs=a, b, c I think
13:23:50 <ijw> spaces ignored
13:24:08 <irenab> so attrs in pci_info can be from the (a,b,c) only
13:24:09 <irenab> ?
13:24:16 <ijw> baoli: compute node uses it, yes - not to validate, but to group the pci_stats it's returning
13:24:26 <ijw> irenab: yes
13:24:31 <heyongli> ijw: this lead the to code need to be parse it, but if use [], you don't do anything.
13:24:45 <baoli> ijw, I mean to validate the attr that's used in the pci information list?
13:25:07 <ijw> heyongli: string.split will do it for commas, so I think that's 2 lines of code for a slightly nicer format
13:25:28 <ijw> baoli: In pci_information you can specify things outside the pci_flavor_list
13:25:38 <ijw> Or are you saying we should check they're all present
13:25:39 <ijw> ?
13:25:52 <heyongli> ijw: i'm fine with it. but someone will jump in suggest the [] i think, i accept it , and let's move on
13:26:04 <baoli> so attrs used in the pci information could be anything?
13:26:23 <irenab> baoli: as long as defined on Controller
13:26:33 <irenab> for pci_flavor_attr
13:26:48 <heyongli> baoli: i think this is the start of configration, it's no method to verify it, except the format
13:27:30 <ijw> baoli: yes
13:27:36 <ijw> irenab: no
13:28:08 <ijw> irenab: You can put what you like in there, they don't have to be in the flavor_attrs - but the ones that aren't can only be used for information, not for scheduling
13:28:09 <baoli> ijw, ok. got it
13:28:10 <irenab> ijw: can you please define the pci_flavor_attr, I am confused
13:28:33 <irenab> ijw: got it
13:28:48 <ijw> OK
13:29:26 <ijw> Haven't caught up with John yet - he did a driveby on the google document, I replied to his comment but haven't seen a response.  The mailing list message also had no response when I checked an hour ago.
13:30:00 <baoli> How does the flavor definition get conveyed to all the compute nodes?
13:30:02 <ijw> I see where he's coming from with his points about host aggregate but I think he's basically suggesting that we use host aggregates in place of the whole system, both pci_information and flavors
13:30:03 <heyongli> ijw: john remove his name from this blue print, he is busy on other thing, and we kill his time too much
13:30:13 <ijw> heyongli: ok
13:30:24 <ijw> I'm going for beer with him next week, I'll have a conversation then
13:30:48 <heyongli> ijw: so cool
13:30:59 <ijw> Indeed ;)
13:31:13 <ijw> One week, three countries, no sleep ;)
13:31:30 <heyongli> ijw: thank you do so much for this .
13:31:50 <ijw> Nah - this is not work, this is beer because I happen to be in the area ;)
13:31:54 <ijw> Also, beer
13:32:14 <heyongli> i update the pci  attr to a, b, c
13:32:26 <irenab> so, are we OK with this BP?
13:32:26 <baoli> Setting aside the beer, How does the flavor definition get conveyed to all the compute nodes?
13:32:43 <irenab> Can we move to baoli's list for the rest?
13:32:49 <ijw> baoli: don't you touch my pint!
13:32:55 <ijw> baoli: That's actually a good question
13:33:27 <baoli> it's early morning, and I have to work. No beer
13:33:27 <ijw> baoli: I think there are two options.  Either a compute node requests it when it comes up and doesn't start sending pci_stats till it has it, or the control node casts it occasionally
13:33:29 <heyongli> sorry , i don't under stand the question
13:33:57 <baoli> well, you can define it on the fly
13:34:20 <ijw> Yes, that's why an update after cast might be a better option
13:34:41 <irenab> ijw: can you repeat the question?
13:34:42 <ijw> I mean, even if you can't change it via the API, you can stop the control server, change it and restart, so we have to be flexible
13:35:09 <ijw> irenab: baoli asked how we get the pci flavor attrs to the compute hosts so that they can send pci_stats in the right format
13:35:31 <sadasu> so at startup time of compute and controller nodes and right after update via CLI?
13:35:35 <baoli> ijw, restarting is an interesting proposation
13:36:16 <ijw> baoli: indeed - if you restart the compute it needs to find out the information, which is either a need for a call or a periodic broadcast; if you restart the control node you have to send the information, which can be a cast
13:36:39 <ijw> If the two are out of sync then the compute node would send the wrong info and the control node would have to drop it
13:37:12 <sadasu> not sure if periodic broadcast is a good idea
13:37:13 <baoli> Well, it doesn't sound good to me if restarting needs to be done for a change of flavor, doesn't it
13:37:24 <heyongli> ijw: compute not sent the flavor i think, but use wrong specs.
13:37:32 <ijw> baoli: No - this is only if you decide to change the list of attributes (in the config, at the moment)
13:37:45 <ijw> flavors would be fine for dynamic update, the compute node doesn't need to know what they are
13:38:08 <sadasu> baoli: that may not be the reason for a restart, but before a flavor definition update has been sent to all compute nodes, it is possible that the controller is restarted
13:38:20 <baoli> ijw, why? are you sending stats based on flavor definition?
13:38:32 <ijw> baoli: no - as you say
13:38:36 <heyongli> baoli: i don't think so.
13:38:44 <irenab> on flavor_attrs I think
13:38:49 <heyongli> stats construct base on the attr list,
13:38:54 <ijw> You're sending pci stats (and instance start requests) using pci attrs, that's the only way control and compute communicate
13:39:09 <baoli> Can you explain how the stats get calculated?
13:39:12 <ekarlso> just 7win 378
13:39:15 <ijw> But they do have to agree on the list of attributes they're using (pci_flavor_attrs)
13:40:12 <baoli> Again, can you explain how the stats get calculated?
13:40:22 <baoli> by a compute node?
13:40:31 <ijw> baoli: I have devices (1, 2) and (1, 4) for (vendor, product) - if the attrs list is 'vendor' I would get {vendor: 1, count: 2} and if it's vendor, product I would get {vendor:1, device: 2, count:1} and {vendor:1, device:4, count:1}
13:40:50 <ijw> Make sense?
13:41:01 <baoli> how do you know it's by vendor or (vendor, product_id)?
13:41:11 <ijw> pci_flavor_attrs' value
13:41:23 <ijw> As received from the control node
13:41:32 <ijw> ... somehow, which is where we started the discussion
13:42:12 <ijw> Make sense?
13:42:12 <baoli> I must have missed something
13:42:19 <baoli> Not yet.
13:43:18 <baoli> you said the attribute list is one item?
13:43:55 <baoli> sorry. You define an attribute list as (a,b,c, ....)?
13:44:01 <ijw> No, a comma separated list
13:44:12 <ijw> of multiple attributes
13:44:13 <baoli> Am I using a comma in there?
13:44:31 <ijw> baoli: I mean 'no, it's not one item'
13:44:41 <ijw> I'll try to be clea
13:44:42 <ijw> r
13:44:46 <baoli> so is it (a1, a2, a3)?
13:44:54 <baoli> for example?
13:45:39 <ijw> I meant that I think pci_flavor_attrs should appear just once in the config, not repeatedly; that it should have a value like 'a,b,c' because the JSON format '["a","b","c"]' is a bit unnecessary for a list of simple items; and that the programmatic value would be a list of strings.
13:45:56 <baoli> that's fine
13:46:06 <baoli> I don't care about the format, actually
13:46:16 <ijw> Yeah, we went all over the shop in that discussion but those are the three things I meant to say
13:47:01 <baoli> how stats get calculated based on ["a", "b", "c"]?
13:47:21 <heyongli> baoli: let me show you
13:47:31 <ijw> As above - we bucket based on unique combinations of value, so there would be one stats row for each unique combination of a, b, c attribute values
13:47:42 <heyongli> if device had [a, b], it should in the [a,b ] pool
13:47:58 <heyongli> i f  device had [a, b,c ], it in the [a,b,c] pool
13:48:14 <heyongli> if only device had only a, it's in the [a,] pool
13:48:18 <ijw> Um, no
13:48:30 <baoli> oh, guys, come on
13:48:31 <heyongli> acctually, it's [a, none , none ] pool
13:48:34 <ijw> not at all
13:48:57 <ijw> Look - so the attributes will be ones that the pci devices have.  device, product, things in extra info, so on
13:49:03 <heyongli> there should have no overlap, and what is it in you mind, intresting.
13:49:16 <ijw> Absences in any of them would be None
13:49:34 <heyongli> ijw: cool, i agree
13:49:35 <ijw> But the bucket is unique combinations of those attributes' values in the pci device list on the compute node
13:49:48 <heyongli> i said [a], is [a, none none]
13:50:08 <ijw> We do an operation like the SQL statement SELECT $pci_flavor_attrs, count(*) from pci_devices GROUP BY $pci_flavor_attrs
13:50:22 <ijw> Which means that, for each unique value combo we get a count of devices with that combo
13:50:50 <ijw> heyongli: yeah, I think I agree with what you mean but I don't think your explanation was at all clear ;)
13:51:11 <ijw> So for pci_flavor_attrs='device,vendor'
13:51:13 <heyongli> ijw: may be , i'm not good at it like you, really,
13:51:16 <ijw> Take a list of PCI devices
13:51:32 <ijw> Turn it into a list of [device-value,vendor-value] one for each device
13:51:37 <ijw> Count unique combinations
13:51:39 <heyongli> ijw: in compute node we don't need sql the db to calulate this
13:51:48 <ijw> Return the combination and count to the control node
13:52:13 <baoli> ijw, you need [device], [vendor], [device, vendor], right?
13:52:17 <ijw> heyongli: yes - I'm not suggesting we use SQL, it's just I can express the transformation unambiguously with one line of SQL
13:52:31 <heyongli> ijw: in this way, the device will belong to multi pool, this had  problem.
13:52:36 <ijw> baoli: nope - in the schedulers, flavors match more than one of those buckets
13:53:00 <baoli> ijw, I'm really lost here.
13:53:24 <ijw> So if I have a flavor that says 'vendor: 1' and I have pci_flavor_attrs of 'device,vendor'
13:53:36 <heyongli> jiw: if a flavor request [a, b] you maybe allocate it from pool [a,b,c] this will lead to use up the device but stats say compute node had it
13:53:38 <ijw> Then in the scheduler I have pci_stats for each device,vendor combination
13:53:52 <baoli> you must have thought it through. But I would recommend documenting this in a well specified algorithm. Becasue this is critical
13:53:59 <ijw> I check each pci_stats row that has vendor=1, ignoring the device
13:54:04 <ijw> And find a free device
13:55:30 <baoli> There seems to be a big disconnect in this regard. So it must be clearly specified in the doc.
13:55:55 <heyongli> ijw: this will lead to this: one device in multi pool(bucks,) this is not good.  i suggest:  if device match [vendor=1, product=5,...], it count only in the most longer mach pool.
13:56:04 <heyongli> baoli: agree
13:56:16 <ijw> heyongli: One device will never be in multiple pools
13:56:31 <ijw> heyongli: There is no short-match set of pools
13:56:40 <heyongli> ijw: so , that's good for me
13:57:16 <ijw> A pool is always tagged by a full set of pci_flavor_attrs.  The scheduler works out partial matches, so the complexity isn't passed on to the compute nodes
13:57:16 <baoli> Ok, time's almost up now.
13:57:23 <ijw> So hopefully that's clear
13:57:38 <baoli> We'll start SRIOV tomorrow
13:57:43 <heyongli> ijw: i only request the stats calculate algorithm had this feature(longer match), any thing other is fine to me.
13:57:46 <ijw> baoli: I'll see about that algorithm, but I guarantee it will horrify - it's definitely a long search in the worst case
13:58:09 <baoli> ijw, I dont' like anything that's horrific.
13:58:11 <ijw> heyongli: I think you and I agree what the stats will be - just put a patch up and I'll review
13:58:29 <heyongli> i update  the the doc base on this request.
13:59:08 <heyongli> ijw: sure, i had the patch, and sent, that's fit our goal, i think.
13:59:18 <ijw> baoli: It's slow in the worst case, which the admin has an option to set up.  It has to be because we allow such general flavor conditions, because people want them.  We minimise the data it operates on to ensure it's not going to be too slow in practice, but the algorithm as code does have to do retries as it finds a match so it will look suspiciously like lots of nested loops
13:59:38 <ijw> It will have a comment explaining that when we come to write it
13:59:47 <baoli> ijw, my concern is how it works because I haven't seen a complete description on this yet.
13:59:56 <heyongli> ijw: this fine, stats report base on one compute node, there is no so many device
14:00:23 <ijw> baoli: I certainly wrote a fairly complete one, but if you'll take it on trust that it will work quite honestly I would sooner write the algorithm as code
14:00:30 <heyongli> baoli: much like the current implementation, i think.
14:00:50 <baoli> And as far as I'm concerned, it's nasty and unnecessarily complicated. I'm not sure we are going in the right direction any more
14:00:56 <ijw> (which I can do once heyongli's patch is up so I can work off of it)
14:01:07 <baoli> #endmeeting