17:14:16 #startmeeting ovn_community_development_discussion 17:14:17 Meeting started Thu Jul 30 17:14:16 2020 UTC and is due to finish in 60 minutes. The chair is mmichelson. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:14:18 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:14:20 The meeting name has been set to 'ovn_community_development_discussion' 17:14:32 Normally I'd start the meeting by giving my update first, but I need to step away for a couple of minutes 17:14:44 So if anyone else wants to go ahead, I'll be back in a bit. 17:15:01 Hi 17:17:22 <_lore_> hi all 17:21:41 I can start, I have a quick update: we've been hitting some (probably) raft related issues lately. In ovn-k8s deployments, in specific conditions, the SB database ends up in an inconsistent state, i.e., on a follower the raft logs try to modify/delete records that are not in the snapshot. We're still investigating to figure out what the trigger is. It sounds a bit similar to what zhouhan reported a month or so ago. I was wondering if we got a 17:21:42 root cause of that until now. 17:23:41 Once the DB ends up in this situation it will refuse any write transactions from clients. 17:25:00 That's it on my side for today. Thanks. 17:25:11 OK, and I'm back now. 17:25:56 I can go next 17:26:32 Easy things first: I got the ECMP symmetric reply patch merged. Thanks numans for reviews. And thanks zhouhan for fixing the compile error introduced. 17:27:26 Next, if you're an OVN committer you've probably seen my messages with Jeremy Kerr of Patchwork. It looks like we're going to have OVN as a separate project in patchwork from OVS. This will make it significantly easier to spot relevant patch series and get them reviewed. 17:27:46 And having a separate patchwork project is also going to simplify the existing CI (i.e. 0-day robot) processing. 17:28:09 If you are a committer and have an objection to moving OVN to its own patchwork project, please speak up in the email thread. 17:28:10 mmichelson++ 17:28:32 mmichelson: sounds great 17:29:28 And finally, we've had a number of fixes go into 20.06 and I think we're verging on the need for another release. Right now, all regressions and other bugs found by ovn-kubernetes have been fixed. However, one thing that's worth talking about is whether we think it is appropriate to put any "flow explosion" fixes into the 20.06 branch. 17:30:51 ovn-kubernetes is looking at changing to a shared gateway mode, and they have flow explosion concerns. So the question is, are these changes (those that have gone in, as well as those that are still up for review) candidates for branch-20.06? 17:31:42 if ovn-k8s can't wait for 20.09, then I think it is ok to add them to 20.06 17:32:14 otherwise, it would be better to avoid backporting, because those are not new features 17:32:52 mmichelson: For the arp responder flow explosion patches, even though they're quite large, I think we can argue that they are bug fixes. 17:33:21 s/not new features/not bug fixes 17:34:00 but as dceara said, some of them were bug fixes. However, are those critical bugs? 17:34:06 I'm fine if they need to be backported 17:36:03 zhouhan, I think criticality is in the eye of the beholder :) 17:37:35 I fine for backporting, but I just want to make sure we can always keep released branches stable enough. We'd be cautious for any change that could impact the existing feature to be backported. 17:38:18 zhouhan, +1. Yeah, that's why I wanted to float the idea in here. 17:39:03 Anyways, the backporting idea doesn't have any hard vetoes, so that's good to see. 17:39:17 And that's all I had wanted to bring up. Whoever wants to go next, feel free. 17:39:26 I can go real quick. 17:39:44 I worked on stabilizing the 20.06 branch as ovn-k8s CI reported issues. 17:40:17 All the issues are addressed now and I hope this will be the last regression because of I-P patches. 17:40:40 numans++ 17:41:06 Last week I submitted a 2 patch series to improve conntrack usage in OVN. Would appreciate some reviews on it - https://patchwork.ozlabs.org/project/openvswitch/list/?series=191630 17:41:18 mmichelson: for a release, we freeze for weeks to make sure what's released is stable. We may have same criteria if we want to backport features - give some time for it to stay in master branch so that we have more confidence of its stability 17:41:36 zhouhan, I couldn't get the chance to review the other 2 patches of yours. I'll get back to them soon. Hopefully by tomorrow. 17:42:09 zhouhan, that makes sense. I'd argue that maybe we need more hardened CI so that we can get more immediate feedback as patches are merged to master. 17:42:19 zhouhan, I've one point here. Right now no CMS be it openstck or ovn-k8s is testing their CI tests on top of OVN master. 17:42:21 numans: thanks numans 17:42:38 and hence we are not able to catch any regressions on master 17:42:51 And our test coverage is definitely not covering many things. 17:43:32 All the I-P patch series regressions were caught once 20.06 was consumed by our internal QE testing and ovn-k8s testing. 17:43:42 We need to improve more test coverage on master. 17:43:58 in order for us to be sure that new features don't cause regression. 17:44:22 May be we should run ovn-k8s kind tests when we commit a patch to ovn master branch. 17:44:29 any thoughts here ? 17:44:38 I think that should be possible with github actions. 17:45:19 +`1 17:45:25 +1, I mean 17:45:57 mmichelson, you had some plans on the upstream CI right ? 17:46:00 numans: mmichelson: yes, that's a problem. We should improve test in master. But still, there is more chance to find bugs in master when people keep developing on it. Otherwise, if we completely trust CI and then release, there is not much point to keep a released branch :) 17:46:38 mmichelson, may be github actions can be considered. 17:47:29 zhouhan, Agree. But as a developer we definitely miss out on edge cases and some scenarios :) 17:47:34 numans, github actions could be a good idea. The only problem I have is that since we don't use PRs, the CI would run after the change is already pushed 17:48:00 mmichelson, github actions would also run once we push a patch. 17:48:35 So may be patchwork based CI (if you're planning on those lines) can test a patch before applying. 17:48:51 and once a patch is committed we can run ovn-k8s tests for example. 17:49:09 But I guess we can discuss about it in the ML too :) 17:49:13 numans: yes, I mean, we should do both: 1) improve testing on master, e.g. borrow CI from ovn-k8s/networking-ovn to test against OVN master. 2) give more time for a new feature on master before backporting to released branch 17:49:18 numans, sure. 17:49:19 hi all. I'm getting this error every minute: Jul 30 20:48:12 ministore ovs-vswitchd[13839]: ovs|02297|odp_util(handler13)|ERR|internal error parsing flow key 17:49:19 recirc_id(0x1),dp_hash(0xa90679df),skb_priority(0x7),in_port(7),skb_mark(0),ct_state(0x21),ct_zone(0),ct_mark(0),ct_label(0),ct_tuple4(src=10.50.18.6,dst=239.0.0.250,proto=2,tp_src=0,tp_dst=0),eth(src=b2:1d:c3:86:4d:33,dst=01:00:5e:00:00:fa),eth_type(0x8100),vlan(vid=18,pcp=0),encap(eth_type(0x0800),ipv4(src=10.50.18.6,dst=239.0.0.250,proto=2,tos=0xc0,ttl=1,frag=no)) 17:49:29 zhouhan, agree on both. 17:49:52 I'm done with the update. If some one wants to go next. 17:50:00 stintel, Hi. this is on OVN deployment ? 17:50:03 openvswitch-2.13.0 on kernel 5.7.8 (using kernel openvswitch modules) 17:50:16 we are in the middle of OVN meeting. Probably we can discuss after it. 17:50:21 numans: I am seeing this permanently 17:50:25 ah sorry about that 17:50:30 or you can bring up next if you want :) 17:51:00 may I go next? 17:51:09 sure. 17:51:26 I was working on scale testing last week. 17:52:46 I found that there were regression between 2.12 and later branches. The northd CPU utitilization almost doubled in 20.03/20.06 compared to 2.12. 17:53:25 I was testing the creating and bind 12K ports in 1200 HVs scenario 17:53:56 zhouhan, ouch 17:54:38 I am also reworking on the separate nb_cfg in Chassis/Chassis_private. Will send the patch soon. 17:55:11 I'll do more testing and analysis, and this is my update. 17:55:54 I want to discuss a bit on the ovn-northd. Any idea on the ovn-northd-ddlog ? 17:56:25 * zhouhan have the same question 17:56:32 I feel may be we should add I-P support to ovn-northd (may be a rudimentary one to start with) 17:56:52 With my last work on the I-P patches, I feel more confident in it. 17:57:01 numans: do you mean I-P without DDlog? 17:57:08 And this could relieve a bit of CPU for ovn-northd 17:57:11 zhouhan, yes. 17:57:18 until we have ddlog ready 17:57:47 not a full I-P support, but start with some basic scenarios 17:58:05 Just a thought and wanted to check what everyone here thinks ? 17:58:07 Is it worth it ? 17:58:10 numans: but it seems blp and leonid have brought ddlog very close for northd 17:58:28 numans: I wonder if this would be a big waste of effort 17:58:54 zhouhan, That's the concern I have too. 17:59:11 I think the DDlog problem is (I guess) that northd code keeps changing and then it would be hard for Ben to catch up with 17:59:26 If we do I-P, would it be the same problem? 17:59:30 zhouhan, yes. that's the problem. I think sooner we have ddlog better it is. 18:00:29 zhouhan, probably not. Because we are not adding new feature to northd right ? So it ddlog version doesn't need to catch up on it. 18:01:07 Anyway I wanted to check on this :) 18:01:29 numans: sorry, what do you mean "we are not adding new feature to northd"? I think we kept adding :) 18:01:59 Adding I-P doesn't add features that ddlog cares about 18:01:59 zhouhan, I think I misunderstood your comment- If we do I-P, would it be the same problem? 18:02:10 yes. 18:03:03 If some one wants to jump in and update please do so. Looks like I'm taking more time :) 18:03:11 oh, I meant, if we do I-P manually (without DDlog), would we face the same problem that northd keeps changes and our I-P implementation can't catch up? 18:03:30 I think it depends to what degress we add I-P 18:03:34 *degree 18:03:35 numans: BTW, do you have any idea why northd CPU doubled after 2.12? 18:04:14 zhouhan, no idea on that. 18:04:22 ok 18:04:35 zhouhan: what scale scenario are you testing with? 18:04:51 dceara: I was testing the creating and bind 12K ports in 1200 HVs scenario 18:05:25 zhouhan: without ACLs/LBs I assume, right? 18:05:39 zhouhan: one thing that comes to mind is the hairpin flows for LBs on logical switches. 18:06:14 Oh, it seems northd is costing CPU when system is idle (not running any tests). This didn't happen before. 18:06:26 dceara: no, not ACLs/LBs 18:06:31 zhouhan: ack 18:07:04 I'll dig more on this. Please continue if anyone wants to update 18:10:00 I'm guessing by the silence that there's noboday else wanting to update 18:10:10 So I'll end the meeting here. Thanks everyone 18:10:14 Bye 18:10:14 #endmeeting