13:02:17 #startmeeting pci passthrough 13:02:18 Meeting started Wed Jan 22 13:02:17 2014 UTC and is due to finish in 60 minutes. The chair is baoli. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:02:19 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:02:21 The meeting name has been set to 'pci_passthrough' 13:03:18 hi 13:03:28 Hi 13:03:59 Hello! 13:04:56 baoli: following you email, I suggested to cover SRIOV related this week 13:05:17 Irenab, I agree 13:05:37 As for heyongli BP , I reviewed, no major issues 13:05:53 yo 13:05:53 Just do not like the way it deals with netowrking for now.... 13:06:01 What's exactly the plan B you mentioned 13:06:09 But it least it adds some support 13:06:11 the one I suggested? 13:06:44 yes, what Ian menat by neutron eawre scheduler 13:06:58 ^aware 13:07:09 How is it exactly defined? 13:08:29 I had one question regarding the bp wiki 13:08:49 sadasu: I think heyongli is not here 13:09:08 ok...maybe ijw can answer 13:09:28 baoli: the problem is that we need to decide in advance of starting a VM which specific network cards can be used to attach to the networks it wants 13:09:38 baoli: you can see Ian's suggestion at #link: https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit?pli=1# 13:09:40 pci-flavor can be specified as part of nova flavor create and as part of --nic option 13:09:46 baoli: the solution is for nova-api to ask Neutron what the requirements are and hand it to the scheduler 13:09:50 ignoring the exact syntax for now 13:10:14 sorry my home network un-stable today 13:10:26 have we thought of the case where pci-flavors can be specified both places? 13:10:31 I don't see it in the doc 13:10:39 And according to rkukura, there's a similar requirement for provider networks, so I think in practice a lot of that solution would be done by them and we could piggyback on it, but I also think it's Juno territory 13:10:48 sadasu: as far as I understood, it is there 13:11:08 nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU' 13:11:15 sadasu: yes - the ones in the nova flavor are non-Neutron passthrough devices which are simply attached (even if they're NICs). The ones in Neutron are attached by vif plugging. 13:11:34 nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec --nic net-id=network_X pci_flavor= '1:phyX_NIC;' 13:11:50 And we have to combine all the PCI device requirements before we try to scheule. 13:12:30 heyongli: Can you please address my comments on wiki? 13:12:41 ijw: I understand that since I have been following this discussion...I think it should be made very clear and explicit in the doc 13:12:46 which one? irenab? 13:12:53 sadasu: Ok, that seems sensible 13:13:25 also, admin should be careful while creating a pci-flavor that is a mixture of networking and non-networking devices 13:13:29 heyongli: can we quickly go over all of them 13:13:33 The document could do with a work-over for formatting, too, it's hard to read because the formatting's patchy 13:13:38 irenab: ok 13:13:47 I was going to do that if I found time but I haven't managed it yet 13:14:11 ijw: i had to do it tomorrow 13:14:12 heyongli: just look for irenab on wiki page 13:14:16 I suggest anyone else who has 5 minutes to spare picks a section 13:14:26 first on pci_information format, do you agree? 13:14:52 format or name? 13:14:55 The idea of pci_information as I understood it is that you could supply the config item more than once rather than having a list of matches 13:15:08 But we should choose one, clearly that's not defined well enough 13:15:34 heyongli: both format and name 13:15:51 you have '=>' there 13:16:31 irenab: sure, if this kind of problem or typo, i think you can modify it directly. 13:16:44 heyongli: thanks 13:17:02 for name, i don't have the good idea. 13:17:11 heyongli: another Question on pci_flavor_attrs. Is it single defintion? 13:17:30 it can be multi ,if we want it to be. 13:18:04 should it be defined per PCI flavor or not? 13:18:09 this should be identical in all cloud machine, is it a problem? 13:18:32 irenab: no , i don't thinks so, any concern? 13:18:58 no, I just not sure I understand the usage and what should be defined there by cloud admin 13:19:20 well, I imagine it would. What if you want to add a new attribute in order to do some new stuff? 13:19:50 As I understood, its the list of properties to filter the PCI devices by scheduler, right? 13:20:06 Then the question becomes why would you like to define pci flavors on the fly if the attribute is static? 13:20:20 I think it should be (a) single and (b) comma separated 13:20:28 irenab, not filter , it's the scheduler orgnized the stats pool, i think . 13:20:28 a, b, c for example 13:20:58 baoli: flavors can be defined on the fly but we need to validate the attributes they use, I think 13:21:04 heyongli: but eventually the device will be picked by mataching all attrs? 13:21:12 list is a json structure, like all other definition, 13:21:15 You still have quite a lot of flexibility in what flavors you make even if the attrs are the same 13:21:19 why attr is static? 13:21:41 Because pci_stats are created based on it, really, and you could invalidate flavors if you change it 13:22:02 We discussed this before and I think the conclusion is that yes, it could be something you could change, but if we wanted to do that we could add it later 13:22:17 ijw: yeah, there could be some improvement maybe, but have no idea yet. 13:22:29 But if we have lots of lines of the form: 13:22:39 pci_information=[{match}, {attrs}] 13:22:42 That's not awful 13:22:55 And the lines are short enough to be readable - long config items don't wrap well, I think 13:23:13 ijw: don't we talk abou the pci_flavor_attrs? 13:23:22 would compute node read it as well so that it can validate the attr in the list? 13:23:34 at least I do 13:23:34 heyongli: sorry, I'm talking about both at once ;) 13:23:43 pci_flavor_attrs=a, b, c I think 13:23:50 spaces ignored 13:24:08 so attrs in pci_info can be from the (a,b,c) only 13:24:09 ? 13:24:16 baoli: compute node uses it, yes - not to validate, but to group the pci_stats it's returning 13:24:26 irenab: yes 13:24:31 ijw: this lead the to code need to be parse it, but if use [], you don't do anything. 13:24:45 ijw, I mean to validate the attr that's used in the pci information list? 13:25:07 heyongli: string.split will do it for commas, so I think that's 2 lines of code for a slightly nicer format 13:25:28 baoli: In pci_information you can specify things outside the pci_flavor_list 13:25:38 Or are you saying we should check they're all present 13:25:39 ? 13:25:52 ijw: i'm fine with it. but someone will jump in suggest the [] i think, i accept it , and let's move on 13:26:04 so attrs used in the pci information could be anything? 13:26:23 baoli: as long as defined on Controller 13:26:33 for pci_flavor_attr 13:26:48 baoli: i think this is the start of configration, it's no method to verify it, except the format 13:27:30 baoli: yes 13:27:36 irenab: no 13:28:08 irenab: You can put what you like in there, they don't have to be in the flavor_attrs - but the ones that aren't can only be used for information, not for scheduling 13:28:09 ijw, ok. got it 13:28:10 ijw: can you please define the pci_flavor_attr, I am confused 13:28:33 ijw: got it 13:28:48 OK 13:29:26 Haven't caught up with John yet - he did a driveby on the google document, I replied to his comment but haven't seen a response. The mailing list message also had no response when I checked an hour ago. 13:30:00 How does the flavor definition get conveyed to all the compute nodes? 13:30:02 I see where he's coming from with his points about host aggregate but I think he's basically suggesting that we use host aggregates in place of the whole system, both pci_information and flavors 13:30:03 ijw: john remove his name from this blue print, he is busy on other thing, and we kill his time too much 13:30:13 heyongli: ok 13:30:24 I'm going for beer with him next week, I'll have a conversation then 13:30:48 ijw: so cool 13:30:59 Indeed ;) 13:31:13 One week, three countries, no sleep ;) 13:31:30 ijw: thank you do so much for this . 13:31:50 Nah - this is not work, this is beer because I happen to be in the area ;) 13:31:54 Also, beer 13:32:14 i update the pci attr to a, b, c 13:32:26 so, are we OK with this BP? 13:32:26 Setting aside the beer, How does the flavor definition get conveyed to all the compute nodes? 13:32:43 Can we move to baoli's list for the rest? 13:32:49 baoli: don't you touch my pint! 13:32:55 baoli: That's actually a good question 13:33:27 it's early morning, and I have to work. No beer 13:33:27 baoli: I think there are two options. Either a compute node requests it when it comes up and doesn't start sending pci_stats till it has it, or the control node casts it occasionally 13:33:29 sorry , i don't under stand the question 13:33:57 well, you can define it on the fly 13:34:20 Yes, that's why an update after cast might be a better option 13:34:41 ijw: can you repeat the question? 13:34:42 I mean, even if you can't change it via the API, you can stop the control server, change it and restart, so we have to be flexible 13:35:09 irenab: baoli asked how we get the pci flavor attrs to the compute hosts so that they can send pci_stats in the right format 13:35:31 so at startup time of compute and controller nodes and right after update via CLI? 13:35:35 ijw, restarting is an interesting proposation 13:36:16 baoli: indeed - if you restart the compute it needs to find out the information, which is either a need for a call or a periodic broadcast; if you restart the control node you have to send the information, which can be a cast 13:36:39 If the two are out of sync then the compute node would send the wrong info and the control node would have to drop it 13:37:12 not sure if periodic broadcast is a good idea 13:37:13 Well, it doesn't sound good to me if restarting needs to be done for a change of flavor, doesn't it 13:37:24 ijw: compute not sent the flavor i think, but use wrong specs. 13:37:32 baoli: No - this is only if you decide to change the list of attributes (in the config, at the moment) 13:37:45 flavors would be fine for dynamic update, the compute node doesn't need to know what they are 13:38:08 baoli: that may not be the reason for a restart, but before a flavor definition update has been sent to all compute nodes, it is possible that the controller is restarted 13:38:20 ijw, why? are you sending stats based on flavor definition? 13:38:32 baoli: no - as you say 13:38:36 baoli: i don't think so. 13:38:44 on flavor_attrs I think 13:38:49 stats construct base on the attr list, 13:38:54 You're sending pci stats (and instance start requests) using pci attrs, that's the only way control and compute communicate 13:39:09 Can you explain how the stats get calculated? 13:39:12 just 7win 378 13:39:15 But they do have to agree on the list of attributes they're using (pci_flavor_attrs) 13:40:12 Again, can you explain how the stats get calculated? 13:40:22 by a compute node? 13:40:31 baoli: I have devices (1, 2) and (1, 4) for (vendor, product) - if the attrs list is 'vendor' I would get {vendor: 1, count: 2} and if it's vendor, product I would get {vendor:1, device: 2, count:1} and {vendor:1, device:4, count:1} 13:40:50 Make sense? 13:41:01 how do you know it's by vendor or (vendor, product_id)? 13:41:11 pci_flavor_attrs' value 13:41:23 As received from the control node 13:41:32 ... somehow, which is where we started the discussion 13:42:12 Make sense? 13:42:12 I must have missed something 13:42:19 Not yet. 13:43:18 you said the attribute list is one item? 13:43:55 sorry. You define an attribute list as (a,b,c, ....)? 13:44:01 No, a comma separated list 13:44:12 of multiple attributes 13:44:13 Am I using a comma in there? 13:44:31 baoli: I mean 'no, it's not one item' 13:44:41 I'll try to be clea 13:44:42 r 13:44:46 so is it (a1, a2, a3)? 13:44:54 for example? 13:45:39 I meant that I think pci_flavor_attrs should appear just once in the config, not repeatedly; that it should have a value like 'a,b,c' because the JSON format '["a","b","c"]' is a bit unnecessary for a list of simple items; and that the programmatic value would be a list of strings. 13:45:56 that's fine 13:46:06 I don't care about the format, actually 13:46:16 Yeah, we went all over the shop in that discussion but those are the three things I meant to say 13:47:01 how stats get calculated based on ["a", "b", "c"]? 13:47:21 baoli: let me show you 13:47:31 As above - we bucket based on unique combinations of value, so there would be one stats row for each unique combination of a, b, c attribute values 13:47:42 if device had [a, b], it should in the [a,b ] pool 13:47:58 i f device had [a, b,c ], it in the [a,b,c] pool 13:48:14 if only device had only a, it's in the [a,] pool 13:48:18 Um, no 13:48:30 oh, guys, come on 13:48:31 acctually, it's [a, none , none ] pool 13:48:34 not at all 13:48:57 Look - so the attributes will be ones that the pci devices have. device, product, things in extra info, so on 13:49:03 there should have no overlap, and what is it in you mind, intresting. 13:49:16 Absences in any of them would be None 13:49:34 ijw: cool, i agree 13:49:35 But the bucket is unique combinations of those attributes' values in the pci device list on the compute node 13:49:48 i said [a], is [a, none none] 13:50:08 We do an operation like the SQL statement SELECT $pci_flavor_attrs, count(*) from pci_devices GROUP BY $pci_flavor_attrs 13:50:22 Which means that, for each unique value combo we get a count of devices with that combo 13:50:50 heyongli: yeah, I think I agree with what you mean but I don't think your explanation was at all clear ;) 13:51:11 So for pci_flavor_attrs='device,vendor' 13:51:13 ijw: may be , i'm not good at it like you, really, 13:51:16 Take a list of PCI devices 13:51:32 Turn it into a list of [device-value,vendor-value] one for each device 13:51:37 Count unique combinations 13:51:39 ijw: in compute node we don't need sql the db to calulate this 13:51:48 Return the combination and count to the control node 13:52:13 ijw, you need [device], [vendor], [device, vendor], right? 13:52:17 heyongli: yes - I'm not suggesting we use SQL, it's just I can express the transformation unambiguously with one line of SQL 13:52:31 ijw: in this way, the device will belong to multi pool, this had problem. 13:52:36 baoli: nope - in the schedulers, flavors match more than one of those buckets 13:53:00 ijw, I'm really lost here. 13:53:24 So if I have a flavor that says 'vendor: 1' and I have pci_flavor_attrs of 'device,vendor' 13:53:36 jiw: if a flavor request [a, b] you maybe allocate it from pool [a,b,c] this will lead to use up the device but stats say compute node had it 13:53:38 Then in the scheduler I have pci_stats for each device,vendor combination 13:53:52 you must have thought it through. But I would recommend documenting this in a well specified algorithm. Becasue this is critical 13:53:59 I check each pci_stats row that has vendor=1, ignoring the device 13:54:04 And find a free device 13:55:30 There seems to be a big disconnect in this regard. So it must be clearly specified in the doc. 13:55:55 ijw: this will lead to this: one device in multi pool(bucks,) this is not good. i suggest: if device match [vendor=1, product=5,...], it count only in the most longer mach pool. 13:56:04 baoli: agree 13:56:16 heyongli: One device will never be in multiple pools 13:56:31 heyongli: There is no short-match set of pools 13:56:40 ijw: so , that's good for me 13:57:16 A pool is always tagged by a full set of pci_flavor_attrs. The scheduler works out partial matches, so the complexity isn't passed on to the compute nodes 13:57:16 Ok, time's almost up now. 13:57:23 So hopefully that's clear 13:57:38 We'll start SRIOV tomorrow 13:57:43 ijw: i only request the stats calculate algorithm had this feature(longer match), any thing other is fine to me. 13:57:46 baoli: I'll see about that algorithm, but I guarantee it will horrify - it's definitely a long search in the worst case 13:58:09 ijw, I dont' like anything that's horrific. 13:58:11 heyongli: I think you and I agree what the stats will be - just put a patch up and I'll review 13:58:29 i update the the doc base on this request. 13:59:08 ijw: sure, i had the patch, and sent, that's fit our goal, i think. 13:59:18 baoli: It's slow in the worst case, which the admin has an option to set up. It has to be because we allow such general flavor conditions, because people want them. We minimise the data it operates on to ensure it's not going to be too slow in practice, but the algorithm as code does have to do retries as it finds a match so it will look suspiciously like lots of nested loops 13:59:38 It will have a comment explaining that when we come to write it 13:59:47 ijw, my concern is how it works because I haven't seen a complete description on this yet. 13:59:56 ijw: this fine, stats report base on one compute node, there is no so many device 14:00:23 baoli: I certainly wrote a fairly complete one, but if you'll take it on trust that it will work quite honestly I would sooner write the algorithm as code 14:00:30 baoli: much like the current implementation, i think. 14:00:50 And as far as I'm concerned, it's nasty and unnecessarily complicated. I'm not sure we are going in the right direction any more 14:00:56 (which I can do once heyongli's patch is up so I can work off of it) 14:01:07 #endmeeting