16:02:13 #startmeeting Octavia 16:02:14 Meeting started Wed Apr 24 16:02:13 2019 UTC and is due to finish in 60 minutes. The chair is rm_work. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:02:15 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:02:17 The meeting name has been set to 'octavia' 16:02:29 Hey folks! 16:02:34 o/ 16:02:34 hello everyone 16:02:52 Hi 16:02:55 Sorry for the slightly late start, still getting the hang of this 16:03:07 #topic Announcements 16:03:09 hi 16:03:55 The summit and PTG is next week! 16:04:04 not sure how to make that an official annoucement thing 16:04:13 Are you cancelling any of the weekly IRC meetings? 16:04:23 it is #topic 16:04:35 Oh, you got it 16:05:14 ah i guess there's no real sub-topic stuff 16:05:22 so anyway, yeah. next week: summit+ptg! 16:05:33 should we have a meeting? what do people think? 16:06:00 #startvote Should we have a meeting next week? Yes No 16:06:01 Begin voting on: Should we have a meeting next week? Valid vote options are Yes, No. 16:06:03 Vote using '#vote OPTION'. Only your last vote counts. 16:06:03 I vote to cancel next week at least 16:06:12 #vote No 16:06:24 #vote No 16:06:25 look at all of this democracy happening right now 16:06:31 it warms my heart 16:06:39 #vote No 16:06:44 Already with the votes... lol 16:06:57 this is what you get when you make me wake up at 9am :D 16:08:09 ok no more votes? 16:08:15 I think that's prolly clear anyway 16:08:27 #endvote 16:08:28 Voted on "Should we have a meeting next week?" Results are 16:08:30 No (3): rm_work, johnsom, cgoncalves 16:08:50 Ok. So, meeting next week is cancelled! 16:08:51 Shall we count the hanging chads? 16:09:09 Any other announcements? 16:09:19 #link https://opendev.org/explore/repos 16:09:34 If you haven't noticed, infra made some big changes last week. 16:09:40 Oh yes! Everything is officially moved to OpenDev.org 16:09:50 All of the openstack git repos have changed to opendev.org 16:10:08 You may need to update some of your .gitreview files. 16:10:11 Congrats infra for a relatively smooth transfer 16:10:33 hello 16:10:33 Also, most important, Depends-On links to the old domain break and may need to be updated on open reviews. 16:11:00 assuming you used a full URL, yes 16:11:06 I don't think I ever did it that way... 16:11:10 Yeah, I am super happy about the gitea move. The old git web was horrible 16:12:35 When are you going to set the schedule for the PTG based on the etherpad? 16:12:55 Hmmm, that is a great question 16:12:59 #link https://etherpad.openstack.org/p/octavia-train-ptg 16:13:08 I think I might delegate the official PTG planner role 16:13:14 Any takers? johnsom? :D 16:13:17 * johnsom forgot how fun it is to not be the PTL 16:13:39 Let's see if I can make you un-forget ^_^ 16:14:04 I will work on that today/tomorrow 16:14:09 cool 16:14:22 #action johnsom to make PTG schedule from the planning etherpad 16:14:35 Oh, one more announcement. Stackalytics is in theory fixed. 16:14:43 it was broken? 16:14:58 Yeah, it was given random results. 16:14:58 missed that 16:15:05 lol 16:15:18 At one point I had only contributed less code for Stein than I had in a single patch 16:17:08 Ok, I think that's it for announcements then 16:17:30 #topic Brief progress reports / bugs needing review 16:17:42 Anyone have anything for this? 16:18:03 I do, lol 16:18:11 link away! 16:18:16 I've got request for review of my change 16:18:38 not much from my side. Easter break Friday to Tuesday 16:18:41 I created a feature matrix for provider drivers: 16:18:44 #link http://logs.openstack.org/74/651974/6/check/openstack-tox-docs/015a575/html/user/feature-classification/index.html 16:19:05 The styling will improve once my fixes to sphinx-feature-matrix releases 16:19:14 cool 16:19:20 nice thanks 16:19:27 Just not sure it is for this section or for open disscussion - https://review.opendev.org/#/c/652953/ 16:19:39 I also started a patch to help the "non-graceful shutdown" situation 16:19:41 #link https://review.opendev.org/653872 16:19:50 #link https://review.opendev.org/#/c/652953/ 16:20:33 I think this is part one of a few patches in this space as interim until we fix flow resumption. 16:20:36 I may agree with Ann there 16:20:48 depending on the timeline we have for real flow resumes 16:21:05 Yeah, let's talk about that patch in open discussion. I have some thoughts there 16:21:08 and whether or not there's a ton of additional setup for that, where this might help in a larger portion of installs 16:21:54 Other than those I have been working on our slides for the summit presentations. 16:22:31 we see resources in the state johnsom's patch describes from time to time so will be watching that one to see how discussion goes 16:22:35 it actually looks like the patch ataraday_ linked is an alternate to the "temp solution" 16:22:45 Yes 16:22:57 ie, until we get real task resume, just allow deleting the bad ones 16:23:04 might be less complexity 16:23:21 Well, my patch requires no manual intervention 16:23:24 well, does anyone else have anything for this topic? or can we move directly to open discussion and talk about this? 16:24:11 I have one other quick topic I would like to add to the agenda if I can. 16:24:23 sure, whatsit? 16:24:29 Train features 16:24:53 ah, sure -- though I figured we'd do that major discussion during the PTG 16:25:00 #topic Train Features 16:25:29 I am creating our "project update" slides for the summit. 16:25:53 This deck includes a slide on anticipated features for Train. 16:26:14 Right now all I have is: retire neutron-lbaas, log offloading, and VIP ACLs. 16:26:32 Does anyone else have any features they plan to work on for Train they would like included in this slide? 16:26:53 i'd still like to get "members as a base resource" done 16:27:02 got a good start on that already 16:27:12 not sure if it's really major enough to list on that slide tho 16:27:20 Ok, I can include that if you would like 16:27:28 * rm_work shrugs 16:27:50 If the work on taskflow will be accepeted - I can work on that 16:27:51 we're anticipating active/active and support for a container based driver. when is the release date for train? 16:28:01 Ok, I just wanted to ask if there were other features planned for Train 16:28:19 #link https://releases.openstack.org/train/schedule.html 16:28:39 Feature freeze for Train is the week of Sept 9th 16:28:42 I think officially, we'll be deciding the feature goals at the PTG, so maybe just say something to that effect -- "this is a preliminary list, we'll be discussing more at the PTG, please join us" 16:28:53 ok 16:29:12 Right, it always has the disclaimer 16:29:35 ataraday_ If you have resource to work on that I will add it. I might also be able to help that effort. 16:29:36 then that's probably fine 16:30:09 ataraday_: yeah, what we prioritize is largely based on what people are able to commit time for -- we'll accept whatever you think you can do :) 16:30:11 colin- Are there lines you would like me to add for your efforts on Train? 16:30:34 the etherpad topics cover what we're interested in, was just reveiewing that 16:31:05 oooo, neutron-lbaas deprecation is THIS CYCLE, really? for realsies? 16:31:09 This presentation happens before the PTG, so it's a bit "guestimation" 16:31:11 we haven't organized around it internally but understand both those bodies of work need sponsors atm 16:31:17 what a time to be alive 16:31:20 and to be PTL :D 16:31:36 I'm the resource and as I spend some time on this topic, I can do more to make it happen :) 16:32:07 colin-: yes, both are things we'd love to see, but both have had people come, do work, and disappear 16:32:21 so it's really hard to say -- right now we don't have people actively active/active-ing 16:32:28 and containers is ??? 16:32:32 Cool, I will add flow resumption as a Train feature goal 16:32:41 whatever you can commit to doing is appreciated 16:33:00 understood, that aligns with where we thought those were. and johnsom has helpfully shared the relevant links to what was most recently done on active/active 16:33:04 Ha, we have a working lxd proof of concept (if you turn all of the container security off) 16:33:26 what about with zun? 16:33:37 nova-lxd i'm guessing 16:33:44 that sounds pretty cool 16:33:48 Yes, with nova-lxd. 16:34:08 #link https://review.opendev.org/636066 16:34:10 and 16:34:17 #link https://review.opendev.org/636069 16:34:56 thanks will check those out! 16:35:08 cool 16:35:14 There is no way I would actually use that stuff though. It is a bit messy 16:35:51 ok, should we move to open discussion then? did we cover this adequately? 16:36:01 +1 thanks for the feedback! 16:36:04 we did for my part ty 16:36:20 ok 16:36:24 #topic Open Discussion 16:36:55 I have something for this, but we can resume your thing first 16:38:21 johnsom? 16:38:28 My biggest concern with starting to add --force flags is it is overriding our object locking/ownership system. The only way that command could be used safely is if the operator checks that none of the controllers are actively working on the object before using the --force. 16:38:47 For example, HM failovers will set PENDING_*. 16:39:05 It also still requires operator intervention. 16:39:58 My concern that now operator goes in db and manually set status to delete, I think we should try to avoid this as much as possible... 16:40:05 Operators also can do this operation against the DB if it is really necessary. It seems like --force makes it too easy to abuse. 16:40:23 IIRC it all comes down to not trusting on our admin users 16:40:41 well, would --force be allowed for admin only, or anyone? 16:40:45 hopefully admin? 16:40:53 why not to trust? Admin should now stuff 16:41:03 The cases we know of that can lead to PENDING_ in a stuck state are "kill -9" or loss of a controller mid-flow. 16:41:05 yeah ok just re-read the commit message 16:41:05 ataraday_, "should" is the keyword ;) 16:41:13 ataraday_: agreed, we cannot sustain a model where our operators are required to update octavia's tables with any regularity 16:41:30 it's too risky 16:41:30 colin-: is this happening with regularity?! 16:41:39 That is why I approached it as, have the controller look for things it owns on startup and correct the status for those. 16:41:50 admins are doing `kill -9` to octavia services with regularity? lol 16:42:05 no, but that is only one way to induce the symptom the change describes 16:42:13 what are other ways? 16:42:15 at least, we're not killing processes that way 16:42:21 Yeah, I think the common case is mis-configured systemd service definitions where systemd gives up and kill -9 16:42:30 hmm 16:42:44 I add check for time - how long loadbalancer was not updated, is not this enough to be safe? 16:42:46 rm_work, not 'kill -9' per se. I think it's more cloud updates, controller reboots (scheduled or not) 16:43:01 agree with cgoncalves 16:43:17 Still a reboot should be graceful shutdown if the systemd service is configured correctly. 16:43:24 personally I think ataraday_'s approach does make sense, but 16:43:39 do we need either of this if we think we can complete real job resumes within the cycle? 16:44:09 No, but that is a lot of work. 16:44:16 wouldn't it be better to just prioritize doing that, and not add a bunch of other extra complexity? 16:44:16 certainly the need for it is less (gone?) if we guarantee they can't get into that state ever 16:45:32 of course, I was also a fan of an admin "sync" type command in the past, so 16:45:34 There is also a failure point with oslo messaging/rabbit. If those are not setup correctly and the queue gets lost, we could end up with PENDING_*. That is another discussion. 16:45:42 personally I'd like to have an interim solution that is backportable but I understand if it cannot be done. if we agree it's not, I'd run to document the steps that should be taken to prevent killing flows unnecessarily (pre-maintenance window actions) and corrective actions if too late/unexpected controller reboots 16:45:52 oh i hadn't considered that but yeah 16:46:01 there'd be instructions octavia thinks are eventually going to complete still 16:46:04 I think we need --force, as operator should not goes in db :) I don't think it will finised in one cicle at least only in experimental mode 16:46:09 Yeah, neither of these are backportable solutions. 16:47:20 I might be convinced to setup a periodic job as well. That would be backport-able. We just need to figure out the right conditions for it. 16:47:40 when i brought up sync in the past, my argument had been "we're not going to catch all the cases, why can't we have a way to fix the stuff we can't predict" so I'd feel hypocritical arguing against ataraday_ here 16:48:00 What we really want is a consensus protocol such that all the controllers can say "I'm not working on that object". 16:48:19 yes, that would work if we could figure out a way to do it 16:48:35 would people be ok making this a topic for the PTG for a larger discussion? it's only a week away 16:48:46 yes, it seems complex enough 16:48:51 it's a small delay but we could get a more well researched and agreed consensus there 16:48:56 Sadly ataraday_ Can't join us at the PTG 16:49:01 remotely? 16:49:10 we usually set up video-conf 16:50:27 BTW, I will give my normal caveat, I am not a hard now on the --force thing. I just want us to consider all the ramifications of going down that path before we add it and can't remove it. 16:50:39 s/now/no/g 16:51:22 yes, the "can't remove it" thing is my main worry 16:51:31 we could do that, yes. question is if ataraday_ would be available. since there's no agenda schedule fixed yet, we could also consider ataraday_'s timezone 16:51:39 we have to be fully committed to API changes 16:51:48 yep 16:51:53 Yes, I can schedule to people's availability 16:52:06 ataraday_: can you remotely attend part of the PTG? 16:52:33 whelp 16:53:09 i think we say "hopefully" and plan to discuss it at the PTG 16:53:45 or is ataraday present 16:54:02 I think she is in Europe, so I will try to put in an early morning timeslot if we don't hear otherwise. 16:54:11 +1 16:54:11 ok, sounds good to me 16:54:35 I had a topic too, though we're a bit short on time 16:55:28 We're discussing internally about adding support (upstream-first) for Athenz authentication for amphora/control-plane communications (to replace the local cert generation) 16:55:45 #link https://www.athenz.io/ 16:56:04 sorry, I got disconnected 16:56:25 I actually don't know enough about it personally yet to know if it'd be as simple as another driver like the Anchor thing, or if we'd need to modify things significantly 16:56:38 rm_work would it replace the local cert capability, or just be another driver option? 16:56:41 but it's on my roadmap to find out, and probably I'll have more for discussion at the PTG 16:57:04 I don't think it could actually replace local-cert-gen 16:57:14 ataraday_ We asked if there was a chance we could video conference you in for the discussion at the PTG? Is there a timeslot that would work for you? 16:57:23 since that is the most basic and we'd want that for simple deployments and testing stuff regardless 16:57:33 wouldn't want to make athenz a hard requirement 16:57:41 but it might be a good optional thing 16:57:46 agreed 16:57:55 Yeah, I think as another driver option I don't see any reason why not. 16:58:15 I was curious if anyone else uses any kind of in-house sshca system 16:58:35 It's an Apache license, so not a concern there either 16:58:52 we use this internally at Verizon Media (formerly known as Oath, formerly known as Yahoo), as it was born here 16:59:40 but this is a fairly common thing (authz / authn) and if there are other similar things, it might be worth trying to make it a generic pattern 16:59:41 We still need to "retire" the anchor stuff from the octavia rep 16:59:54 yeah, I can do that around the same time 17:00:04 we can discuss this one more at the PTG also 17:00:18 well, we're out of time 17:00:24 thanks for coming everyone! 17:00:24 I'm in UTC+4 zone, sure I will try to connect for PTG disscussion - just set up time and send me a link :) 17:00:36 cool, thanks ataraday_ :) 17:00:42 #endmeeting