#openstack-lbaas log

20:00:04 <johnsom> #startmeeting Octavia
20:00:05 <openstack> Meeting started Wed Jul  4 20:00:04 2018 UTC and is due to finish in 60 minutes.  The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot.
20:00:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
20:00:08 <openstack> The meeting name has been set to 'octavia'
20:00:11 <johnsom> Hi folks!
20:00:15 <nmagnezi> O/
20:00:19 <cgoncalves> hi
20:00:27 <johnsom> I'm guessing this will be a quick one as the US is on holiday today.
20:00:47 <johnsom> But since we are an international team I figured we should still have our meeting as scheduled.
20:01:03 <johnsom> #topic Announcements
20:01:13 <cgoncalves> oh, right. what are you doing here today?! :)
20:01:31 <nmagnezi> johnsom, don't miss out the fireworks! :)
20:01:34 <johnsom> cgoncalves It's called dedication to the PTL role....  Grin
20:01:59 <johnsom> My normal reminder, we have a priority bug list for the Rocky release:
20:02:06 <johnsom> #link https://etherpad.openstack.org/p/octavia-priority-reviews
20:02:11 <nmagnezi> johnsom, four more years!.. mm.. cycles :)
20:02:16 <johnsom> Thank you to everyone that has been helping there!
20:02:25 * johnsom sighs
20:02:52 <johnsom> Also a heads up, python-octaviaclient needs to have it's final Rocky release before July 26th per the release calendar
20:03:04 <johnsom> #link https://releases.openstack.org/rocky/schedule.html
20:03:28 <johnsom> So any additions we want to get into the "official" Rocky version of the client needs to be in by the 26th.
20:03:31 <nmagnezi> aye. We have some client patches to get in.\
20:03:50 <johnsom> I think we have two right now, backup members and the UDP extensions (which I have been testing).
20:04:33 <johnsom> If there are any others not on that list, let me know so I can keep track of them and make sure we get them in
20:05:17 <cgoncalves> FWIW I quickly tested the backup member patch and looked good to me
20:05:18 <johnsom> The last item I have is about the recent queens breakage in neutron-lbaas. However I understand this is now resolved.
20:05:27 <johnsom> Cool, thanks!
20:05:29 <cgoncalves> it is. thank YOU!
20:05:42 <nmagnezi> johnsom, great job!
20:06:15 <johnsom> FYI, the trouble with stable/queens started with this patch: https://review.openstack.org/576526
20:06:16 <nmagnezi> I reviewed bunch of queens backports today so we can have those in soon
20:06:32 <cgoncalves> jobs are arbitrarily failing due to DNS failures. rechecking should do the trick
20:06:44 <johnsom> Someone added neutron-lbaas to upper constraints and then later global requirements.
20:07:09 <johnsom> If you look in those files, none of the "plugins" horizon or neutron are in those files
20:08:39 <johnsom> This is because the tests for these plugins need to install the "parent" project, neutron in our case, from a filesystem path (zuul) or git (tox) and you can't set an upper-constraint limit on those install types.
20:09:23 <johnsom> Anyway, we got those changes reverted.  They should not have been approved as requirements changes on stable branches are against the stable policy, but that is a different issue.
20:09:53 <johnsom> cgoncalves Thanks for the rechecks and looking into that DNS issue. I know Mr. Naser is working to fix it.
20:10:22 <johnsom> Any other announcements today?
20:10:45 <johnsom> #topic Brief progress reports / bugs needing review
20:11:31 <johnsom> I have wrapped up the initial version of the neutron-lbaas to octavia LB migration tool. I have been working on the gate to test it, which was a pain with little Ansible gotchas.
20:11:57 * nmagnezi still waits for the magical local.conf :-)
20:11:57 <johnsom> Like the one where it merges stderr into the stdout content from the "neutron" command
20:12:33 <johnsom> Yeah, I need to do another spin to take another approach at getting two drivers enabled in neutron-lbaas.
20:12:45 <johnsom> I will poke at that after the meeting.
20:13:19 <johnsom> Anyway, the second test will migrate a non-octavia LB (even though there is no octavia driver for it) as a test.
20:13:29 <nmagnezi> Will that migration tool work for migration from n-lbaas (any provider) to Octavia as a keystone endoint?
20:13:36 <johnsom> Should be good soon-ish.  I plan to make it a periodic gate
20:14:19 <johnsom> Yes, any neutron-lbaas provider to Octavia. Octavia should have the appropriate provider installed of course.
20:14:49 <johnsom> It sounds like VMware is making good progress on their provider driver, so likely to see that soon.  They already have a third party gate setup.
20:15:21 <johnsom> The tool will not migrate from one provider to a different provider on Octavia. That is not it's intent.
20:15:49 <johnsom> I think that is possible, but a bunch more work that probably wouldn't make Rocky
20:16:16 <johnsom> Then after that work I am focusing on helping with the UDP support.
20:16:59 <nmagnezi> I think we should think about at least haproxy-ns to amphora driver migration since both are reference implementations. But maybe that is something I will follow up with, after you migrate what you plan for Rocky
20:17:06 <johnsom> I have been doing some testing. I was able to push 4.76 Gbits/sec through it on my little devstack setup, so performance looks decent. We just have some stuff to finish up there.
20:17:56 <johnsom> Yeah, technically the namespace driver ceased to be a reference driver in Liberty, but I'm not sure that was communicated well.
20:18:24 <johnsom> Agreed though, that is a really good use case.
20:18:37 <johnsom> Feel free to start work on it.... grin
20:18:51 <nmagnezi> Sadly many folks still use it, and migration path might help them to move :-)
20:19:13 <nmagnezi> Haha, first we finish yours. I plan to test it
20:19:18 <johnsom> I have "ideas" of how to do it if someone wants to work on it.
20:19:30 <johnsom> Cool, thanks!
20:19:51 <johnsom> It "should" work now, I'm really just poking at the gate tests
20:20:59 <johnsom> After UDP is in good shape I need to work on an active/standby test gate. (internal request)
20:21:26 <johnsom> Any other progress updates?
20:22:30 <johnsom> #topic Open Discussion
20:22:35 <johnsom> Any other topics today?
20:22:45 <cgoncalves> not much from my side. essentially backporting patches and working in tripleo to have an octavia scenario with octavia tempest tests
20:23:10 <johnsom> Cool. I really appreciate the help with backports BTW.
20:23:38 <johnsom> I know you mentioned cutting a new release, should we do that around MS3 time or would you like it sooner?
20:24:45 <cgoncalves> I *think* I don't have a preference. I just thought that it would be nice to release since we have backported a couple of good stuff
20:25:05 <nmagnezi> On my side, I'm currently digging into Active Standby, looks like data plane takes longer than expected time to recover (but the backup amp does send GARPS, so.. still looking)
20:25:07 <johnsom> Yeah, agreed.  Ok, I will set a mental upper bound on MS3 timeline
20:26:03 <johnsom> Yeah, that is interesting. If we see the GARPs coming out of the backup that became master, but traffic isn't getting to us, then something is fishy in neutron land.  Well worth looking into.
20:26:06 <cgoncalves> nmagnezi, that reminds me of the MASTER/BACKUP role topic we discussed offline
20:26:08 <nmagnezi> And by longer.. It took it 8 minutes to recover on my devstack. So definitely something worth looking into
20:26:19 <johnsom> Yeah, that is crazy
20:26:21 <cgoncalves> but maybe we can leave it to another time
20:27:07 <nmagnezi> cgoncalves, yeah, this is something I'm also looking into. Trying to find a way to determine the nodes state in a reliable way so we can report it to Octavia.
20:27:12 <johnsom> cgoncalves We can chat about it if you want. The meeting is scheduled for an hour
20:27:39 <johnsom> nmagnezi Let me share a video demo I prepared for Vancouver but didn't get to present.
20:27:52 <nmagnezi> johnsom, please do :)
20:27:56 <cgoncalves> johnsom, I'd rather like having you reviewing https://review.openstack.org/#/c/568361/ xD
20:28:41 <johnsom> #link https://drive.google.com/open?id=1wx1kkLjUxwNAOpd9KeTGSAV-INZS58fD
20:29:24 <johnsom> That is a video I recorded of a failover flow (it was part of a dashboard demo)
20:29:37 <cgoncalves> johnsom, in a nutshell we think that showing amphora roles as MASTER or BACKUP confuses users. folks tend to think the MASTER one is the active, and the other the standby
20:30:04 <cgoncalves> and that on an amp failover, they would expect the BACKUP to become MASTER and the new spawned amp the BACKUP
20:30:22 <cgoncalves> cool, thanks for the video!
20:30:28 <johnsom> cgoncalves Yes, I understand.  This is why we called it "role" and not "status" as it's really about configuration settings applied to the amphora
20:31:02 <johnsom> Luckily it's only an admin API that exposes that.
20:31:41 <nmagnezi> Today when I created an HA lb, without even doing anything I got it exactly the opposite from the listed roles
20:31:43 <nmagnezi> O_O
20:32:01 <johnsom> Yeah, and that is perfectly ok.
20:32:27 <johnsom> The amps are autonomous on their "active" status
20:33:07 <nmagnezi> By that logic I agree with you. It's just counterintuitive
20:33:13 <cgoncalves> why would admins care about configuration settings if they can't (in many cases) log-in to the amps? what is the use case for exposing the roles?
20:35:04 <nmagnezi> cgoncalves, for swapping amps (to update images) and I would even say to understand the impact of evacuating a specific compute node that has amps running on it, No?
20:35:45 <johnsom> None really. It was just an API that Adam wanted to be able to see the details of an amphora.  Amphora are really supposed to be "hidden" things that are an implementation detail.  I think that column just got swept up with the more interesting columns about the amp
20:36:27 <cgoncalves> nmagnezi, I don't follow. what you said sounds to me more of a reason for exposing the current status of the amps than their roles
20:36:44 <nmagnezi> cgoncalves, yup
20:36:50 <johnsom> No, the Role column has no operational value really. It is just there to track the priority value and preemption settings that get applied to each in the pair.
20:37:05 <nmagnezi> Sorry, you got it right
20:38:47 <cgoncalves> johnsom, ok. I don't see right now a use case where admins/operators would be interested in consuming that column. Adam may have and might be able to share
20:39:28 <cgoncalves> anyhow, I brought this up because I expect users to get confused about this
20:39:41 <johnsom> Hmm, looking in the code I'm not sure that even matters anymore....  We may have removed the differentiation between the settings on each (other than the priority which is tracked separate)
20:40:45 <johnsom> *If* we can find a reliable way to get the actual "active" status of an amp (long discussion with Nir about that yesterday), then we certainly can add a status column for it.
20:40:58 <johnsom> The challenge is getting an accurate status from keepalived.
20:41:15 <johnsom> This also brings us back around to Mr. Ohara
20:41:34 <cgoncalves> right. I read the discussion you had to amuller back in end of May here too
20:42:30 <cgoncalves> s/to/with/
20:42:57 <johnsom> grin
20:44:01 <johnsom> Yeah, it does have a bit of usefulness. I also don't want to break the autonomy of the amps in this failover. They should be free to switch whenever they detect an issue, which may be faster than the heartbeat interval.
20:46:08 <johnsom> VRRP failover should be around a second or less depending on the configuration. Heartbeat is usually spaced higher than that.
20:46:59 <johnsom> So, please feel free to work on this.  We can also add a column "active status" or something and just put "unkown" in it....  grin
20:47:00 <cgoncalves> right. I think, though, that keepalived has a notification subsystem. we could leverage that and callback octavia
20:47:10 <cgoncalves> lol
20:47:20 <nmagnezi> haha
20:47:21 <johnsom> Yeah, I couldn't get it to reliably give me the actual state.
20:47:51 <johnsom> I would skip some, or give bogus "Failed" while in the master role, etc.
20:48:03 <johnsom> Maybe it's been fixed and we can make it work.
20:48:35 <johnsom> The active IP thing might also solve this if we go clean that up.
20:48:40 <cgoncalves> interesting. maybe I can play a bit with that some day
20:49:34 <johnsom> Yeah, right now I don't have cycles to go play with it again.  As long as it is *working* it's lower on the priority list.
20:50:11 <cgoncalves> knowing status of amps would improve maintenance ops greatly (e.g. failover first standby amp for image update than active amp)
20:50:25 <johnsom> Well, technically, not really.
20:51:10 <johnsom> Since if you happen to kill the master, it's going to failover inside the TCP timeout window, so it should be transparent other than a slight delay in that packet.
20:53:34 <johnsom> This is why it's a bit funny when the marketing folks ask how many milliseconds does it take for a HA failover. With TCP flows it's just a metric of jitter, not a window of downtime.
20:53:41 <cgoncalves> hmmm the first failover would fail to the standby amp, ok, then we'd need to failover once more to also rotate the now active amp
20:54:01 <johnsom> Correct, that is what the LB failover flow does
20:54:22 <johnsom> When you call the LB failover API
20:54:43 <cgoncalves> ok, you're right in saying that technically it's possible
20:55:06 <cgoncalves> I'd just like to avoid failing over active amps twice
20:55:23 <cgoncalves> low prio for now :)
20:56:22 <johnsom> I'm not following, how would you failover one of the amps twice?
20:57:02 <johnsom> It should be, amp A and amp B, each get one failover
20:58:04 <cgoncalves> right. I'm saying it would involve failing over twice insie the TCP timeout window as you mentioned before
20:58:44 <johnsom> Oh, that would be impressive if you could cycle your first amp inside one packet timeout
20:59:13 <cgoncalves> amp A (MASTER), amp B (BACKUP) running image X. failover amp A to image X+1. amp B (now ACTIVE) is still on image X. we'd need to fail it over to update it to image X+1 too
21:00:31 <nmagnezi> johnsom, cgoncalves, my connection went down few minutes ago, sorry.
21:00:41 <nmagnezi> johnsom, cgoncalves, good night/day folks!
21:00:46 <johnsom> Correct, but amp failover (not vrrp failover) of A takes 10+ seconds on a good day, so B would be VRRP master and handling traffic long before you are ready to do amp B.
21:00:59 <johnsom> Ah, yeah, out of time....
21:01:03 <johnsom> Thanks folks!
21:01:06 <johnsom> #endmeeting