17:20:33 #startmeeting ovn_community_development_discussion 17:20:33 Meeting started Thu Aug 13 17:20:33 2020 UTC and is due to finish in 60 minutes. The chair is imaximets. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:20:34 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:20:36 The meeting name has been set to 'ovn_community_development_discussion' 17:21:02 AFAIK, mmichelson and dceara will not be here today. 17:21:16 numans, are you here? 17:22:51 OK. I could start with a quick update. 17:23:37 This week I looked at issues with DB sizes. Mostly SB DB size. 17:24:12 It seems like we creating lots of identical lflows for each logical datapath. 17:24:58 And that might be optimized by only having one lflow referencing all logical datapaths it should be applicable to. 17:25:35 imaximets, zhouhan_ Hi 17:25:41 I don't know how exactly and under which conditions this could be done, but I'm looking into that. 17:25:50 I'm late 17:26:07 That's it from my side. 17:26:21 imaximets, Thanks for looking into that. 17:26:34 zhouhan seems to have connection issues. 17:26:47 imaximets, yeah there are many flows which are repetitive 17:27:07 I can go real quick. 17:27:18 numans, It all yours. :) 17:27:28 I did some reviews. 17:27:29 imaximets: sorry I was in readonly mode. I was asking if there is any example? 17:28:05 zhouhan, for example we had a lot of reject ACL flows. 17:28:18 * numans will contnue after this discussion. 17:28:46 imaximets: with the optimization, do we still need datapath information in the flow? 17:29:38 zhouhan, what do you mean by "we still need datapath information ..." 17:29:45 zhouhan, I thought to have a new table, e.g. Datapath_Group with sets of logical datapaths and have a single reference to a set from the logical flow. 17:30:45 numans: I mean, if the flows are common for all datapaths, then we can replace them with just one flow, removing the datapath match. 17:31:05 zhouhan, ok. That makes sense too. 17:31:45 zhouhan, I see, but we have switches and routers and flow might be only applicable for switches, but not routers. 17:32:03 at least. 17:32:21 imaximets, may be a new column option which says its for logical switches or for routers. 17:32:36 imaximets: if that's the case, would it be better to add a datapath type, instead of creating groups? 17:33:41 zhouhan, that make sense. Good point. Need to explore usecases deeply to understand if it's possible/feasible to have smaller groups. 17:34:37 zhouhan, one more case is lfows for port group. e.g. port group specific ACLs. 17:35:06 I think it is good to optimize such cases if it is low hanging fruit, but I would avoid heavy changes for that, because I think the number of datapaths is much smaller than the number of ports. The size of the flow table mainly determined by number of ports. 17:36:37 imaximets: oh, I didn't notice that. If it is for all port groups, it may be straightforward to optimize, too. 17:37:04 zhouhan, in the case of ovn-k8s, where its switch per node, there could be significat flows if say number of computes is 100 17:37:25 ok. 17:38:01 numans: even though, it is normal to have 10x more ports than number of computes, right? 17:38:42 sometimes, even 100x 17:38:58 yeah. 17:39:49 zhouhan, ok. we definitely still need to explore some usecases and see if it will have real benefits in real-world cases. Work in progress. :) 17:40:11 imaximets: sure, thanks! 17:40:35 numans: please continue. I will update after you. 17:41:24 zhouhan, thanks. 17:42:13 So I did some reviews. 17:42:21 and a couple of small bug fix patches. 17:42:49 There is one issue reported by openshift on openstack scenario 17:43:13 the etcd cluster is having a downtime and a new leader is elected when some tests are run 17:43:34 and the leader change happens when ovn-controller program flows and it updates the conjunction ids of existing flows 17:43:46 it is for ACLs which results in conjunction 17:44:06 So I'm working on making conjunction ids persistent 17:44:28 so that when a port is added to a port group or during a recompute, we use the same conjunction id. 17:44:46 It is not a big issue. But there is a very very small window for packet drops. 17:45:02 numans: sorry, how does etcd cluster impact ovn-controller? 17:45:19 zhouhan, etcd cluster is running as application pods 17:46:01 so the etcd traffic gets disrupted when ovn-controller changes the conjunction id for ACL flows which allow traffic for these etcd ports 17:46:44 zhouhan, the CI test creates other pods and other ACLs and while processing those, ovn-controller is updating the existing flows 17:46:44 I see, does it happen only during flow recompute 17:46:46 ? 17:47:07 zhouhan, it also happens when a port is added to the port group 17:47:19 ok 17:47:43 zhouhan, with the new I-P patches, the issue is not seen often 17:47:44 I wonder is it a generic problem even without conjunction flows. 17:47:54 as we don't recompute enough now 17:48:08 but the issue is still seen when the CI is run with parallel=2 itseems. 17:48:24 I'm not sure what parallel=2 exactly mean, I assume more tests are run in parallle 17:48:28 parallel. 17:49:09 i.e. problem when there are updates in OVS flows, groups, meters, is it possible to see transient traffic broke? 17:49:33 zhouhan, that's what I observed. 17:49:47 or, is conjunction ID recompute is the only thing we worry about? 17:50:09 zhouhan, the issue is seen when the ACL is added like this -- "ip && .. inport == @pg1 && tcp.dst >=900 && tcp.dst <=901" 17:50:33 zhouhan, and the issue is not seen when 2 separate ACLs are added for these 2 tcp dst ports 17:50:51 so I think its happening when conjunction is involved 17:51:07 numans: I understand that the problem you saw is related to conjunction. I was just thinking is there similar issue even without using conjunction. 17:51:24 zhouhan, I don't think so. 17:51:37 numans: that's great. 17:51:39 zhouhan, in the case of conjunction, we do FLOW_MOD. 17:51:49 zhouhan, in other cases, we don't do FLOW_MOD right ? 17:52:26 I think either the OF flow will be deleted and added again 17:52:30 numans: I don't remember. Maybe we can check offline. I don't remember either is there any chance ct-zone-id, etc. could have similar issue. 17:52:40 or nothing happens. 17:52:41 zhouhan, ok 17:52:43 sounds good 17:52:57 may be we can discuss further when I submit the patch 17:53:12 numans: yeah, sounds good 17:53:25 one point though - I'm planning to revisit the lflow expr patches which we had revertd earlier. 17:53:30 to solve this issue. 17:53:36 That's it from me. 17:53:52 I can go quickly 17:54:44 zhouhan, sure. 17:54:53 I found the root cause of the scale-test regression in 20.03 compared with 2.12. It has nothing to do with 20.03 OVN, but related to the upstream OVS. 17:55:27 great finding. I didn't see the patch closely though. 17:55:50 It is a change in ovsdb IDL code that caused the problem. I reverted the patch and the performance is comparable with 2.12 now. The revert is merged by imaximets. Thanks imaximets for the review. 17:56:11 zhouhan, thanks for finding this! 17:56:22 zhouhan, its on the client IDL side right ? 17:56:39 numans: yes 17:56:43 zhouhan, ok. cool. 17:57:44 Now with this solved, I can compare 20.03 v.s. 20.06 more fairly, because northd doesn't appear to be the main bottleneck now. 17:58:24 There is obvious latency reduce in 20.06, thanks to numans's I-P improvement for handling changes in local chassis. 17:59:26 Now there are still bottlenecks in ovn-controller ofctrl_put() when number of ports is big enough. I am working on incrementally installing flows. 17:59:44 zhouhan, that's cool. 18:00:34 zhouhan, anilvenkata had a WIP patch to improve the ofctl_put. But he switched his focus else where. I'll just share the commit in his private branch. 18:00:49 please take a look in case if that interests you. 18:01:00 numans: cool. Thanks a lot! 18:01:10 While working on this, I also found a bug related to conjunction when I-P is involved. It was introduced by the patch that handling merging conjunction flows from different logical flows. 18:01:35 ok. 18:01:38 I fixed this first, and working on the tests, hope to send a patch soon. 18:01:50 that's great. Looking forward to it. 18:02:24 that's my update 18:02:41 zhouhan, Thanks! 18:02:50 Anyone else here? 18:04:38 zhouhan, I just need another minute to share the commit 18:05:03 numans: no worries. We can share offline. 18:05:15 imaximets, I think we are done. 18:05:20 zhouhan, sounds good. 18:05:30 OK. Let's call it. :) 18:05:32 Let's end the meeting and discuss secrets to punish the people who don't join :) 18:05:35 Thanks everyone! 18:05:45 zhouhan, :) 18:05:48 #endmeeting