#openstack-meeting log

17:01:43 <johnsom> #startmeeting Octavia
17:01:44 <openstack> Meeting started Wed Sep 27 17:01:43 2017 UTC and is due to finish in 60 minutes.  The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:01:46 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:01:48 <openstack> The meeting name has been set to 'octavia'
17:01:52 <johnsom> Hi folks
17:01:52 <rm_mobile> There we go
17:02:21 <sanfern> Hi  johnsom
17:02:28 <johnsom> Sorry, talking with nova guys about the 404 I just reproduced local.  NIC is not showing up in the instance
17:02:44 <pksingh> o/
17:02:47 <johnsom> #topic Announcements
17:03:02 <johnsom> Just a heads up, Zuul v3 is rolling out.
17:03:25 <johnsom> This will likely mean some turbulence in our gates for a bit.
17:03:50 <johnsom> #link https://docs.openstack.org/infra/manual/zuulv3.html
17:04:04 <rm_mobile> I thought it did already
17:04:06 <johnsom> That is the link for more details on zuul v3.
17:04:29 <johnsom> It's been dragging out as they keep finding bugs.  As of last night it was still not fully deployed
17:04:32 <rm_mobile> Seemed like the pep8 job I looked at the other day was run with v3
17:05:06 <johnsom> Quote: We've pretty much run out of daylight though for the majority of the team and there is a tricky zuul-cloner related issue to deal with, so we're not going to push things further tonight. We're leaving most of today's work in place, having gotten far enough that we feel comfortable not rolling back.
17:05:29 <johnsom> From Monty's email last night
17:05:53 <johnsom> Anyway, just a heads up that this is happening and may impact us.
17:06:12 <johnsom> In the end it will be a good thing as it will be much easier to update our gates
17:06:26 <johnsom> Any other announcements today?
17:06:56 <johnsom> #topic Meeting time revisit
17:07:15 <johnsom> I have got a lot of feedback that the new meeting time is not working out.
17:07:31 <rm_mobile> I'm dieing right now
17:07:42 <johnsom> Some active contributors cannot make this time, plus that ^^^ ha
17:07:57 <rm_mobile> It's too early for me to read words
17:08:16 <johnsom> So, due to multiple requests I have put up another doodle to re-evaluate the time for the meetings
17:08:24 <johnsom> #link https://doodle.com/poll/p65x9xxkec52ecaw
17:08:52 <johnsom> I have put two times in, but we can add others.  I also picked the same day, but we have flexibility there too
17:09:30 <johnsom> Recently the TC approved that we can host our meetings in the lbaas channel, so we are no longer stuck with what slots are available on the main meeting channels
17:10:32 <johnsom> Any questions/comments about meeting times?  other proposals?
17:10:33 <tongl> When are we starting to use lbaas channel?
17:11:05 <johnsom> I will send out an e-mail and we will have at least one more meeting at this time/channel where I will announce it.
17:11:24 <tongl> sounds good
17:12:11 <johnsom> #topic Brief progress reports / bugs needing review
17:12:17 <johnsom> Ok, how are things going?
17:12:49 <johnsom> tongl Is this patch still being worked on? https://review.openstack.org/#/c/323645/
17:12:49 <patchbot> patch 323645 - neutron-lbaas - Add status in VMware driver
17:12:52 <rm_mobile> Couple of patches waiting to go in that are the results of the PTG
17:12:54 <johnsom> It has a -1 comment
17:13:35 <johnsom> Yeah, we have a bunch of patches with one +2 on them
17:13:42 <johnsom> https://review.openstack.org/#/q/(project:openstack/octavia+OR+project:openstack/octavia-dashboard+OR+project:openstack/python-octaviaclient+OR+project:openstack/octavia-tempest-plugin)+AND+status:open+AND+NOT+label:Code-Review%253C0+AND+NOT+label:Verified%253C%253D0+AND+NOT+label:Workflow%253C0
17:13:46 <johnsom> opps
17:13:47 <johnsom> #link https://review.openstack.org/#/q/(project:openstack/octavia+OR+project:openstack/octavia-dashboard+OR+project:openstack/python-octaviaclient+OR+project:openstack/octavia-tempest-plugin)+AND+status:open+AND+NOT+label:Code-Review%253C0+AND+NOT+label:Verified%253C%253D0+AND+NOT+label:Workflow%253C0
17:13:58 <tongl> johnsom: Let me have a look at it to resolve the comments. It is nsxv driver.
17:14:03 <johnsom> Some great stuff coming in fixing octavia-dashboard issues
17:14:39 <tongl> nice
17:14:56 <johnsom> I have been trying to catch up on patch reviews. A number have failed when I go to test them out.
17:15:06 <johnsom> If you have open patches check to see if I have commented.
17:15:45 <johnsom> Any other patches/bugs to discuss today?
17:16:17 <pksingh> https://review.openstack.org/#/c/486499
17:16:18 <patchbot> patch 486499 - octavia - Add flavor, flavor_profile table and their APIs
17:16:56 <pksingh> recently submitted my first patch to octavia, although one gate job is failing but seems unrelated to code
17:17:08 <pksingh> please submit your reviews
17:17:14 <johnsom> Looks like that gate failure was the OVH bug with qemu crashing
17:17:32 <johnsom> Yeah, that one is a infra host issue and not your code.
17:17:39 <johnsom> #link http://logs.openstack.org/99/486499/13/check/gate-octavia-v1-dsvm-py3x-scenario-multinode/68dec49/logs/libvirt/qemu/instance-00000002.txt.gz
17:17:46 <johnsom> cirros doesn't even boot there
17:18:04 <pksingh> ok, thanks :)
17:18:21 <johnsom> Cool, glad to see that is ready for review!
17:18:23 <johnsom> Thanks
17:18:25 <tongl> Is this the flavor support we discussed during PTG?
17:19:02 <pksingh> implmentation of https://review.openstack.org/#/c/392485/
17:19:02 <patchbot> patch 392485 - octavia - Spec detailing Octavia service flavors support (MERGED)
17:20:34 <pksingh> i got to know by jniesz that someone is working on provider support
17:20:52 <pksingh> i would like to help there too if any needed
17:21:01 <johnsom> I think some folks are working on writing up the spec.
17:21:25 <johnsom> longstaff Do you have an update on how that is going?
17:21:45 <longstaff> We've been delayed a bit but will be working on it next week
17:22:37 <pksingh> johnsom: should we wait for that spec to come up, or we should add flavor_profile metadat to current octavia handler?
17:22:48 <johnsom> Ok, feel free to post what you have and let some of us hack on it too....
17:23:12 <pksingh> i did the same in https://review.openstack.org/#/c/484325/
17:23:12 <patchbot> patch 484325 - octavia - [WIP] Add provider Implementation
17:23:33 <johnsom> pksingh Well, we know it will change, but as long as we are ok re-working it
17:23:45 <johnsom> It might flush out any issues we missed, etc.
17:24:24 <pksingh> ok
17:24:33 <rm_mobile> The spec says it's dependent on providers, is that not really the case?
17:25:18 <pksingh> yes it dependes on the providers for validating the metadata part of flavor_profile
17:25:25 <pksingh> i have left that step
17:25:42 <johnsom> It is, but the octavia driver handler is kind of like a provider (will need to be moved over)
17:26:29 <pksingh> johnsom: can i move ahead with treating handler as provider?
17:26:50 <johnsom> Well, the interface is totally going to change when we do providers
17:28:20 <pksingh> ok, then i will wait for provider spec to be merged
17:28:28 <johnsom> I would not spend too much time on the octavia handlers until we get farther with the provider spec
17:29:39 <johnsom> Ok, so I will work on reviewing the flavors work.  Thanks!
17:29:49 <pksingh> thanks
17:29:52 <johnsom> #topic Open Discussion
17:29:56 <johnsom> Other topics today?
17:30:42 <rm_work> Admin API stuff
17:31:22 <rm_work> I have a couple of Admin API type things that I'm going to look at tackling very soon (like, starting today or tomorrow probably)
17:31:23 <rm_work> Not sure if we want specs or if I should just show up with code
17:31:43 <rm_work> the Amphora Info endpoint is up and ready to merge: https://review.openstack.org/#/c/505404/
17:31:44 <patchbot> patch 505404 - octavia - Add admin endpoint for amphora info
17:31:49 <rm_work> the next couple I want to do are:
17:32:24 <rm_work> 1) A patch to clear out the spares pool, so when we push a new image, we can get rid of the old spares quickly / easily)
17:32:50 <johnsom> Spares pool is pretty straight forward, just an rfe is probably good for that
17:33:37 <rm_work> 2) Something to SYNC / retry LBs that are in bad states (ERROR, and possibly PENDING) because I have seen a number of LBs go to these states recently and it is ridiculous that there is no way to get a LB out of ERROR once it goes there
17:33:46 <rm_work> I think there are some things we could at least *try*
17:34:01 <johnsom> Let's talk about that one
17:35:05 <johnsom> Others?
17:35:21 <rm_work> also I'm tempted to have Housekeeping pop things from PENDING to ERROR if they've exceeded some timeout since last update
17:35:48 <rm_work> because when a LB has been in PENDING_UPDATE for 30 minutes, it's obviously stuck
17:35:56 <rm_work> and that state is immutable
17:35:56 <jniesz> We have been hit with a couple of lb's going into error, like when amphora fails to get DHCP on boot
17:36:18 <johnsom> So, I take it there are not others.
17:36:22 <rm_work> yeah, I think possibly the correct approach for that is just to trigger failovers
17:36:35 <rm_work> johnsom: not off the top of my head yet
17:37:06 <jniesz> also when lb is in error, how can we be sure it cleans up all resources?
17:37:13 <johnsom> So, have you run to ground how these are getting stuck in PENDING_UPDATE?  That should not be happening.  Is it the controller process being restarted?
17:37:21 <rm_work> so in that case, should the "failover API call" just ... be the "sync" or should there be more logic?
17:37:37 <rm_work> johnsom: in at least one case i've seen, yeah, the worker restarts and leaves it hanging
17:38:22 <johnsom> Yeah, ok, so the whole job board thing.  Did we break the graceful shutdown of the process or is this a host failed situation?
17:39:22 <rm_work> jniesz: yeah, what i am thinking is we add an "attempt cleanup" method that tries to intelligently remove every piece (starting with VMs, then ports, then SGs) assuming it'll see a lot of 404s, and when it seems like everything is good, then do the failover path
17:39:23 <rm_work> or just fix the failover path to accept 404s in more places
17:39:24 <rm_work> johnsom: i think the graceful shutdown is borked
17:39:48 <jniesz> rm_work agreed that would make cleanup much better
17:39:49 <johnsom> jniesz That is exactly why the current state machine is ERROR->DELETED
17:40:05 <rm_work> yeah, because it was "easy" to start
17:40:25 <rm_work> but telling users "ah, i see your LB randomly went into error for you... time to delete it and start over!" is entirely unacceptable
17:40:58 <rm_work> the thing i've seen cause ERROR states the most often is actually failovers
17:41:12 <johnsom> Yeah, I agree with that.  I would really like to run to ground WHY they are going into provisioning state ERROR (we are not talking operation status here)
17:41:21 <rm_work> usually when something dumb happens like a network blip
17:41:34 <jniesz> or some other openstack component is having issues glance, neutron, etc...
17:41:38 <rm_work> right, yes
17:41:55 <johnsom> I'm hearing two things:
17:42:08 <rm_work> it's "access to external services" mostly
17:42:21 <rm_work> so again, is the answer to this maybe "to sync, trigger a failover"?
17:42:26 <johnsom> 1. We need to figure out why the worker is not gracefully shutting down (not finishing a workflow before exiting)
17:42:28 <rm_work> because we do have the failover API
17:42:45 <johnsom> 2. We need to evaluate adding more retry logic to the flows/tasks
17:42:51 <rm_work> yes, but also:
17:43:22 <rm_work> 3. We need to provide some way to clean up stuff that still manages to get into an undefined state
17:43:38 <rm_work> Because we are awesome but not awesome enough that I predict everything will always be bug-free
17:43:51 <johnsom> Yeah.  Sadly failing over an ERROR object could mean you get into a worse state
17:44:02 <rm_work> and us saying "well, we really should figure out *why*" when an operator has stuck LBs is not useful
17:44:42 <rm_work> so the ability to clean up PENDING state stuff (as an operator, like, FORCE-DELETE) would be nice
17:44:50 <johnsom> Like losing the VIP IP or having half an LB updated (one amp failover)
17:45:22 <rm_work> yes, these things are bad
17:45:23 <rm_work> and sometimes it happens
17:45:39 <rm_work> and our delete flows also need to be improved a bit I think, because about 50% of the time something goes to ERROR, a delete is just going to ERROR-loop
17:46:05 <johnsom> That is bad.  Delete should be able to clean up ERROR cleanly
17:46:08 <rm_work> things like the "security group in use" issue still bite us even in the gate sometimes
17:46:21 <rm_work> i've tried to fix that in my own driver
17:46:25 <rm_work> but it's tricky
17:46:53 <johnsom> Really?  I have ONLY seen that in the gate when the main process failed due to coding bugs
17:46:54 <tongl> I once had that ERROR-loop issue and ending up clean up the data to remove the lb resources.
17:47:06 <tongl> database
17:47:21 <rm_work> yeah i have to dig into the DB
17:47:24 <rm_work> quite often
17:47:32 <jniesz> same here
17:47:49 <johnsom> Yeah, so what I hope we can do is capture these as bugs, with logs, so we can understand the failure mechanism and work on good mitigation options.
17:48:37 <rm_work> so, my plan would be a call that: A) ignores the state, so you can do a delete in any state; B) tries to first catalogue every possible object that we need to clean up; C) attempts to carefully clean up all of them
17:48:41 <johnsom> It also let's us discuss the pro/cons as some of these solutions have some really dangerous side effects
17:50:08 <johnsom> Like this one, if it's used while a controller is still working on the objects (still has the lock) you get into cases where your force-delete command cleans up parts, but controller goes and creates more orphaned objects.
17:50:43 <johnsom> It's like it would need to check with the controllers to see if they are still "active" with that object
17:51:29 <rm_work> then what about the suggestion that HK flips things to ERROR after a configured timeout?
17:51:40 <johnsom> Our original plan for this was the job board implementation, where it passes the flow token to different controllers if one fails to check in and move it forward.
17:51:43 <rm_work> and we just assume you need to wait until that, and at that point anything is done
17:51:51 <rm_work> hmm
17:52:04 <rm_work> yeah i mean, jobboard would be great, we've been talking about it for 3 years
17:52:22 <johnsom> True.  Act/Act for 2ish
17:53:29 <johnsom> Are we ok with starting by capturing these scenarios in bugs (stories)?
17:53:47 <johnsom> I think we need to be capturing these and discussing solutions.
17:53:48 <tongl> I am ok with that
17:53:58 <jniesz> yes, I think looking at the specific use cases is a good start
17:54:24 <jniesz> specific issues
17:55:07 <johnsom> rm_work ?
17:55:22 <rm_work> ok. probably I will make this API in the meantime, but only use it downstream
17:55:23 <rm_work> and when we figure out what we need, i can tweak and push it up
17:55:54 <johnsom> Yeah, I just think it's going to get abused by folks not thinking about the situation enough.
17:56:21 <johnsom> That is the concern.
17:56:23 <rm_work> probably, but meanwhile i need runbooks to give to people who don't know octavia much, and will keep me out of 24/7 on-call
17:56:31 <johnsom> Big red buttons are shiny
17:56:49 <rm_work> and as long as I design the big red button, i'd rather have them press that than the "call me at 2am on a weekend" button
17:57:25 <johnsom> Yeah, I just don't like spending days deleting orphaned objects, corrupt DBs
17:57:33 <rm_work> i'm already DOING that
17:57:39 <rm_work> but I think I can find the orphaned stuff
17:57:43 <rm_work> with code
17:58:02 <johnsom> Ok, so looking forward to some bugs so we can understand the problems
17:58:10 <johnsom> Grin
17:58:30 <johnsom> We have two minutes, any other quick topics?
17:59:23 <jniesz> is the plan to use failover api for changing flavors?
17:59:39 <johnsom> Oye, that is a topic isn't it
17:59:51 <johnsom> Can I put it on the agenda for next week?
17:59:57 <johnsom> Or in the channel.
18:00:00 <johnsom> We are out of time.
18:00:03 <jniesz> ok
18:00:13 <openstack> bh526r: Error: Can't start another meeting, one is in progress.  Use #endmeeting first.
18:00:15 <johnsom> #endmeeting