17:18:44 <numans> #startmeeting ovn_community_development_discussion
17:18:45 <openstack> Meeting started Thu Jun 18 17:18:44 2020 UTC and is due to finish in 60 minutes.  The chair is numans. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:18:46 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:18:49 <openstack> The meeting name has been set to 'ovn_community_development_discussion'
17:18:56 <numans> Hello everyone.
17:19:07 <numans> Not sure if mmichelson is there or not.
17:19:08 <_lore_> hi all
17:19:11 <numans> we can probably start.
17:19:14 <flaviof> o/
17:19:19 <mmichelson> sorry I'm here
17:19:25 <mmichelson> Just got pulled away for a sec
17:19:29 <panda> o/
17:19:33 <numans> mmichelson, I just started. All yours.
17:19:37 <mmichelson> OK, thanks
17:19:51 <mmichelson> Biggest thing is that last week we released 20.06.0 and 20.03.1
17:20:09 <numans> thanks for the release.
17:20:13 <flaviof> woot!
17:20:36 <mmichelson> I noticed blp's patch series to OVS to remove insensitive language where possible. I think that at least in our documentation we probably should follow suit where it makes sense.
17:20:58 <numans> mmichelson, agree. I did a grep and find few instances of such words.
17:21:10 <mmichelson> So I did some searches for specific trigger words, and in documentation it's not too difficult to fix up
17:21:30 <numans> agree.
17:21:57 <mmichelson> Other than that, I've been chipping away at old patches of mine to try to get them in shape to be updated (case sensitivity in MAC and IPv6 addresses, ovs-scale-test plaintext client)
17:22:04 <mmichelson> And I've been reviewing
17:22:13 <mmichelson> That's all from me. Whoever wants to go next, feel free.
17:22:26 <numans> I can go real quick.
17:22:50 <numans> I got the ack from zhouhan for the v12 of I-P patches.
17:23:00 <numans> zhouhan thanks for the review.
17:23:07 <numans> waiting for dceara's comments if any.
17:23:24 <numans> I worked on a couple of patches and submitted for review.
17:23:55 <numans> One was to add packet marking for packets which got the router policies applied.
17:23:59 <zhouhan> numans: np
17:24:06 <numans> And did some reviews.
17:24:11 <numans> That's it from me.
17:24:19 <dceara> numans, ack, I'll try to have another look at the I-P patches tomorrow.
17:24:22 <_lore_> can I go next? very quick
17:24:25 <numans> dceara, thanks.
17:25:06 <_lore_> this week I mainly work on mtu issue in ovs/ovn
17:25:47 <_lore_> in particular if DF is not set, the sender is fragmenting the traffic after an ICMP error msg sent by OVN
17:26:24 <_lore_> the issue is OVN still continues to send an ICMP error msg on fragmented traffic if we have connection tracking in ingress pipeline
17:26:49 <_lore_> I figured out it is a issue in the ovs kernel datapath, I need to send the fix upstream
17:27:11 <numans> _lore_, thanks for fixing this.
17:27:19 <_lore_> then I noticed the value we configured for check_packet_len is the frame size and not the mtu
17:27:24 <_lore_> so I posted a patch for it
17:27:33 <_lore_> numans: imaximets: I sent a v2
17:27:41 <_lore_> any comments on it?
17:27:44 <numans> _lore_, ack.
17:27:57 <numans> I don't have any.
17:28:31 <imaximets> _lore_, I didn't look yet.
17:28:50 <_lore_> ack, actually in the current implementation we are wasting 14+4 bytes
17:29:06 <_lore_> that's all from my side
17:31:58 <zhouhan> may I go next?
17:32:02 <numans> sure.
17:32:16 <zhouhan> we noticed another RAFT problem this week
17:33:28 <zhouhan> For some reason, one of the nodes in the cluster missed some transactions, and become inconsistent from the leader and the other node
17:34:24 <zhouhan> Restarting the node doesn't help, because the current logs are consistent with the cluster and updates can continue.
17:34:49 <numans> zhouhan, so the missed transactions are gone for ever ?
17:35:10 <zhouhan> The inconsistent part is in the snapshot, which is never going to be synced unless a install snapshot RPC is triggered, which doesn't happen usually.
17:35:38 <numans> ok
17:35:53 <zhouhan> numans: yes, for that node, the data is inconsistent for ever. So any clients connected to that server initially would get inconsistent data
17:36:20 <numans> zhouhan, its the leader ?
17:36:26 <zhouhan> only re-joining the node to the cluster would solve the issue.
17:36:31 <zhouhan> numans: no it is not the leader
17:36:44 <numans> zhouhan, ok.
17:37:25 <numans> so ovn-controller and ovn-northd will not see this inconsistency since they always connect to leader right ?
17:37:28 <dceara> zhouhan, Should we have a periodic consistency check to detect such cases earlier?
17:38:32 <zhouhan> For some time it is not even detected. However, once there is a transaction appending from the leader that need to touch the inconsistent part of the data, i.e. delete an unexisted row, the server would detect itself as inconsistent and then prevent any transaction through that node, and all the clients connected to that node would fail for ever.
17:39:10 <zhouhan> dceara: it detects as possible as it can from the server point of view.
17:39:52 <zhouhan> dceara: but it is not gracefully handled, and it still allows client to connect
17:40:17 <imaximets> zhouhan, but clients should reconnect to other correct server and sync with it.
17:40:42 <dceara> zhouhan, So is the problem that we allow new client connections even when in this state?
17:40:57 <zhouhan> The root cause of the inconsistent data is still not clear. One thing suspected to be triggering this is that the node rebooted by itself before this happens.
17:41:39 <zhouhan> dceara: that is one of the problem. But the first thing is how could the inconsitent data happen. I still have no clue.
17:42:21 <dceara> zhouhan, ack.
17:42:25 <zhouhan> imaximets: because of fast-resync, clients won't get the correct data unless it restarts
17:43:13 <zhouhan> so I am thinking maybe the client side detection added by dceara is still needed for such cases.
17:43:27 <imaximets> zhouhan, will the recent fix from dceara that I merged fix the issue with fast-resync in this case?
17:43:39 <dceara> zhouhan, but does the IDL detect the missing updates?
17:44:02 <imaximets> zhouhan, I mean, client will detect inconsistency eventually and disable fast-resync.
17:44:22 <zhouhan> imaximets: that fix is for conditional monitoring. In this case it is purely data inconsistency on server side, so that's not helpful
17:44:38 <imaximets> zhouhan, oh.. ok.
17:45:16 <zhouhan> imaximets: the IDL detection and disabling fast-resync (the last patch of the series) was not merged :)
17:45:50 <dceara> zhouhan, OK, but you should still see logs on ovn-controller about inexistent rows. Do you see those in your case?
17:46:00 <zhouhan> dceara: I guess it would detect, if there are transactions to trigger it. But we fixed them before it happens (by restarting the clients)
17:46:00 <imaximets> zhouhan, I understand.
17:46:33 <dceara> zhouhan, OK, I can address the comments from imaximets and send a new version of that patch then.
17:46:45 <zhouhan> I would thank Ali for reporting this issue (who may be not here in the channel today)
17:46:52 <zhouhan> That's my update :)
17:48:11 <mmichelson> OK, anybody else care to give an update?
17:48:23 <flaviof> May I go next?
17:48:35 <dceara> I just have a quick note for today: zhouhan my plan is to have a go at the lflow explosion reported by Girish for dnat_and_snat as soon as I get a chance. That's unless you started already on it.
17:48:38 <zhouhan> dceara: I am still not sure. It would help to self-correct in such situation, but we also need to make sure such problem is exposed without being hidden completely
17:49:27 <zhouhan> dceara: sure, thanks for helping on dnat_and_snat flow problem!
17:49:53 <dceara> zhouhan, that's why I was thinking of a periodic self check on the server side to see if the DBs are consistent.
17:50:19 <imaximets> zhouhan, dceara: I think we should report such issues loudly in logs with ERR log level at least.
17:50:29 <dceara> imaximets, ++
17:51:43 <zhouhan> dceara: hmm, on server side, I am not sure what other check can be done, beside the current check when transaction detects inconsistency.
17:52:18 <zhouhan> dceara: in theory, the raft log should already ensure consistency. It must be a bug somewhere in some corner situation.
17:52:55 <dceara> zhouhan, I see, ok.
17:52:56 <zhouhan> imaximets: +1 for error logs
17:53:30 <dceara> zhouhan, imaximets: then i'll respin the patch and use error logs instead of the current WARN and we can continue the discussion on the ML (at least for the client side)
17:53:36 <zhouhan> imaximets: on server side, it already have error logs when it is detected, but not quite straightforward. It is only "syntax error: ..."
17:53:49 <zhouhan> dceara: sounds good
17:54:42 <imaximets> zhouhan, We might need to improve server side logs.
17:55:07 <imaximets> dceara, thanks.
17:55:36 <zhouhan> yeah, and better handling on disconnecting itself from the cluster in such case, I think
17:56:58 <panda> zhouhan: is all this captured in some bug description ?
17:57:22 <zhouhan> panda: no, it is just here :)
17:57:34 <panda> zhouhan: ok.
17:58:22 <zhouhan> One way to detect such situation from monitoring point of view, is to compare the number of rows of particular tables, such as logical_flow and port_binding, periodically from each individual node.
17:58:22 <imaximets> panda, zhouhan: It's good that we have meeting logs. :)
17:58:42 <flaviof> ++
17:58:49 <dceara> imaximets, zhouhan Shall we consider opening a github issue for this?
17:59:02 <mmichelson> Probably a good idea
17:59:07 <zhouhan> ++
17:59:37 <zhouhan> We don't have an official way to track OVS bugs I guess
17:59:43 <zhouhan> Or even OVN bugs
17:59:59 <imaximets> it's usually just an e-mail thread.
18:00:08 <numans> github issues could be a starter here.
18:00:30 <mmichelson> Yeah, github issues make sense to me. THe project is on github, after all :)
18:01:01 <zhouhan> yes, email thread is good for discussion but doesn't provide a good track. Can we agree on github as bug tracking in the future?
18:01:09 <panda> zhouhan: rows may not be the right path, if you recevive two updates so the rows number remains the same, you are not detecting changes.
18:01:38 <flaviof> #action use https://github.com/ovn-org/ovn/issues as a way of tracking ovn bugs going forward
18:02:03 <zhouhan> panda: yes, it is just one indicator. When this happened, there were hundreds of rows difference in our case :)
18:02:33 <panda> zhouhan: yep ok.
18:02:47 <zhouhan> panda: of course it doesn't guarantee to detect all inconsistency by such monitoring
18:03:23 <dceara> zhouhan, maybe such a monitoring utility would be useful to have in the repo itself. It doesn't have to be 100% precise if it raises an alarm about a potential inconsistency. What do you think?
18:03:48 <zhouhan> dceara: +1
18:04:35 <panda> I would try to use logical clocks f possible
18:04:52 <panda> but it might be a long term solution.
18:05:51 <zhouhan> or maybe implement a feature in raft to periodically compare snapshots between the servers
18:06:16 <imaximets> zhouhan, dceara: crazy idea: clinet IDL that connects to all cluster nodes at once and monitors the difference over time.
18:06:20 <panda> does the update use 2-part or 3parts commits ?
18:06:24 <zhouhan> I'd like to try to figure out the root cause after all :)
18:06:32 <mmichelson> imaximets, I was wondering if you might bring something like that up :)
18:06:37 <dceara> imaximets, that sounds cool!
18:06:38 <panda> zhouhan: yeah, taht would be the start :)
18:06:44 <imaximets> zhouhan, sure. root cause must be identified anyway.
18:07:44 <zhouhan> This was interesting discussion. Thanks all.
18:08:04 * dceara has to run. Bye all!
18:08:11 * numans too.
18:08:14 <flaviof> bye dceara !
18:08:17 <zhouhan> (a pity that blp wasn't here)
18:08:21 <zhouhan> bye all
18:08:25 <mmichelson> bye!
18:08:29 <numans> bye
18:08:32 <panda> bye
18:08:33 <imaximets> bye
18:08:34 <dceara> Thanks!
18:08:39 <mmichelson> Seems a good place to end the meeting :)
18:08:41 <mmichelson> #endmeeting
18:09:06 <imaximets> didn't work. :)
18:09:18 <mmichelson> uhhh
18:09:21 <flaviof> maybe numans have to do it?
18:09:30 <panda> too late
18:09:30 <numans> #endmeeting