20:00:01 #startmeeting Octavia 20:00:02 Meeting started Wed Jun 22 20:00:01 2016 UTC and is due to finish in 60 minutes. The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:04 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 20:00:04 O.o 20:00:07 The meeting name has been set to 'octavia' 20:00:08 o/ 20:00:10 o/ 20:00:12 #topic Announcements 20:00:19 hi 20:00:20 Howdy! 20:00:22 I'm back........ Grin 20:00:24 hey 20:00:24 o/ 20:00:27 Yay! 20:00:34 hi 20:00:34 o/ 20:01:01 He's not back... HE'S OURS 20:01:03 MUAH HA HA HA HA HA 20:01:05 Also, mid-cycle planning 20:01:09 he's to my right! 20:01:18 #link https://etherpad.openstack.org/p/lbaas-octavia-newton-midcycle 20:01:19 He's to my 4:30 20:01:22 hi 20:01:27 * mhayden stumbles in 20:01:38 Please update with your attendance, etc. 20:01:48 * johnsom Waves to people 20:01:56 he's to my that-a-way 20:02:00 also make sure to note if you can't physically attend but want to be on a conference 20:02:03 we can set that up here at Rax 20:02:04 Any other announcements? 20:02:27 have we finalized which days? 20:02:53 dougwig its marked in the etherpad and everything man! 20:03:03 I plan to be there for Monday through sometime Friday. I see we added a section about maybe less days. 20:03:08 (said in "The Dude"'s voice) 20:03:40 That probably means: dougwig can't stand to look at us for a full week. 20:04:09 i think its the other way around 20:04:13 Do we need a vote? 20:04:15 grin 20:04:26 Heh! 20:04:44 I'm not against having the space available for the week, but us calling it a 4-day 20:04:52 it just says 'week of', with some debate. 20:05:00 Other folks traveling, sbalukoff, any comments on how many days? 20:05:12 always better to reserve and not need vs not reserve and need. That's just my $0.02 20:05:17 i'll be there mon-thu night, regardless of what y'all decide, because i've learned i'm too much of a grouch on the 5th day. not fit company for man nor beast. 20:05:37 +1 on dougwig being intolerable by the 5th day. 20:05:40 ;) 20:05:43 I think Friday usually ends up kinda minimally useful 20:05:49 * dougwig blushes. 20:05:55 Oye, that isn't exactly what I meant, but ok.... 20:05:56 And actually, I think productivity that last day is always pretty waning anyway 20:06:25 Yeah, we end up spending the morning touching up some reviews, and then everyone bailing after lunch anyway 20:06:27 Maybe Mon-Thurs with Monday as a kind of "get everyone set up and started" day, and Tues-Thurs the real work days 20:06:29 So I think that's fine, we'll call it 4 days 20:06:29 But I'm all for working Monday -> Thurs (and possibly part of Friday, eh.) 20:06:44 yeah and we'll have the room Friday for stragglers :P 20:06:50 Sounds good. 20:06:51 Yeah, sounds good to me. 20:06:55 o/ 20:07:23 Yep, works for me too. I will likely have to leave mid-day Friday myself. 20:07:38 Yep, gotta catch a flight back home. 20:08:12 #topic Brief progress reports / bugs needing review 20:08:28 #link https://review.openstack.org/#/c/306084 20:08:31 I need eyes on this 20:08:32 its ready 20:08:34 How are things going? I have started doing reviews again. Lots of good stuff going on! 20:08:54 "its ready" == ready to -2? 20:08:58 JK 20:09:03 I'm still being diverted by internal stuff and recovery from surgery, so wasn't able to get to many reviews this week; Probably won't get to many this week either. :/ 20:09:13 TrevorV: looks like it just passed my CI, which is what i was waiting for to review it. 20:09:59 TrevorV It looks like sbalukoff has a comment to attend to on that 20:10:16 Anything else? 20:10:20 #link https://review.openstack.org/#/c/310490/ 20:10:22 mostly working on ineternal stuff; I still a few open reviews that needs to be looked at 20:10:34 johnsom I just noticed. 20:10:39 #link https://review.openstack.org/#/c/306083/ 20:10:42 I'll tinker 20:10:43 That's because I'm a jerk. 20:10:45 sbalukoff: default rise is 2, fall is 3 20:10:47 for HAProxy 20:10:48 #link https://review.openstack.org/#/c/308091/ 20:10:55 Thanks eezhova 20:10:58 just a note 20:11:00 rm_work: Oh? Ok-- we should probably do that, then. 20:11:26 I've been mainly working on internal again, lol 20:11:33 I'll post that note on the CR 20:11:46 rm_work: Sounds good. 20:11:58 eezhova: done! 20:12:01 Ok, I will be spinning up the review engine over the next week, so hopefully get our velocity back up. 20:12:13 Yay! 20:12:27 blogan, thanks! 20:12:31 #topic Should amphorae be rebootable? 20:12:38 Not sure who added this, but good topic 20:12:42 #link https://bugs.launchpad.net/octavia/+bug/1517290 20:12:42 Launchpad bug 1517290 in octavia "Not able to ssh to amphora or curl the vip after rebooting" [High,Opinion] 20:12:57 why does it need rebooting in the first place? 20:13:10 I added it 20:13:15 I think the answer is no 20:13:19 I agree with the cattle mentality here, but I have also attempted to maintain reboot functionality. 20:13:22 My thought is: "No, amphorae should not be rebootable." 20:13:23 but, we just need to make a decision so i can kill the bug or not 20:13:31 It actually was a huge pain for network namespaces. 20:13:48 what are the cases a reboot is absolutely needed though? 20:13:51 johnsom: Can you think of a reason why we'd want to allow rebooting of amphorae? 20:14:08 plus without reboot we're able to use some system like tiny core linux 20:14:11 We should make a policy call on this. At least to remove my guilt. grin 20:14:19 diltram: +1 20:14:48 diltram Not sure how that makes a difference with tiny core linux.... 20:14:49 yeah, i'm about to be dealing with minifying the amp image, so i don't want to have to deal with this :P 20:14:54 you always have your amps on UPS's, huh? 20:14:55 johnsom: Shall we vote on it? ;) Seriously, though, I've not heard any compelling arguments for needing to be able to reboot in any case. 20:15:25 it probably depends on what the amp is 20:15:28 I was leaving it functional mostly thinking of the situation where an amp could reboot faster than our health check timeout, thus reducing nova churn 20:15:29 what happens if the hypervisors bounces? downtime while we decide its dead and spin up another? is that faster than just letting it boot? 20:15:30 I mean, if an amp goes down to reboot, it should probably just failover anyway 20:15:30 johnsom: TCL is a ram based system, you need to really run command to sync data on disk and create configuration to save data on storage 20:15:34 if someone writes a hardware amp driver, then they'd have to support it 20:15:42 hmm 20:15:43 but since we only support VMs right now... 20:15:53 that's an interesting point i guess dougwig <_< 20:16:03 because that happens 20:16:29 I think I have a better question, what's the *harm* in letting it reboot? 20:16:34 rarely, but imagine having to apply a Xen patch for an XSA and then ... 20:16:41 Well, also consider that SSL keys are supposed to be on RAM only on amphorae. They'd be gone after a reboot. 20:16:52 ^^ that 20:16:54 hmm yeah that was in our design but never completed 20:16:55 That's a downside 20:17:05 if we reboot are we taking it out of health? 20:17:10 and I'm not sure if we actually decided to worry about that 20:17:14 We should complete that. It's better security. :/ 20:17:19 depends on the monitoring frequency 20:17:19 Also if we failover and it comes back can we account for that 20:17:34 well if we failover it, we'll nova-delete it 20:17:37 I also have the question if anyone has tried this recently? I think I have rebooted and sshed in without issue during namespace testing. 20:17:46 there is too many corner cases to cover when we're allowing to reboot 20:17:57 +1 20:18:01 Yep. 20:18:08 I'm strongly against allowing amphorae reboots. 20:18:13 +1 20:18:15 we can always not support it and tell people YMMV 20:18:19 sbalukoff: i'm not sure even just RAM is enough on a shared hypervisor. we're in smoke and mirrors land without a soft HSM involved. 20:18:23 Unless someone can come up with a really compelling reason to go to the trouble of allowing them. 20:18:29 depends on the amp driver doesn't it? 20:19:10 dougwig yeah, it was encrypted ram volume, but yeah, without HSM it "would" be possible. 20:19:21 i mean, I feel like dougwig's point is valid, it'd be scary if we had to bounce every hypervisor momentarily for a security patch and that meant forced cycling 100% of amps 20:19:26 dougwig: Well, RAM is still probably better than non-volatile storage-- at least its apparent then that it's not suppose to be written to non-volatile storage (even if the back-end does this without our say-so.) 20:19:57 rm_work those bounces are done by first migrating vms, then bounce, etc. 20:19:59 I think it depends on the amp image actually. Octavia doesn't care today. It is either up or not, based on the health monitoring. 20:20:12 hmm yeah i guess i don't actually have experience with doing it 20:20:17 rm_work: Yep. This is a known (and solved) problem in the could. 20:20:20 cloud. 20:20:20 I think people would stop paying you if you start rebooting vms randomly 20:20:37 well i just remember my servers were "rebooted" when that happened 20:20:43 but, i guess it was from the original migration 20:20:43 one example of a simple seeming decision like this that SUCKS for operators is cinder... have you ever tried to reboot a cinder-volume server? no? enjoy finding all the related instances and pausing them first. it's a nightmare. 20:20:52 So, before we go too far here, I think we should test this bug to validate. 20:20:55 but the ssl key argument does have some validity. 20:21:04 +1 20:21:07 dougwig: Because block storage is ugly in a cloud. 20:21:19 also in the world of containers you won’t need to reboot 20:21:31 sbalukoff +1 20:21:38 sbalukoff: what part of openstack is not ugly, and don't we exist to make it less ugly? 20:21:49 also if you don’t like cinder I can put you in touch with people who sell storage 20:22:15 ok so then we can't actually close this bug yet? T_T 20:22:35 dougwig: True enough. But some parts are just always going to be ugly. Doing block storage (ie. using a very, very old interface to storage because you can't be bothered to update your damned legacy application) is always going to be ugly.) 20:22:46 are we actually agreeing to test this before officially taking a vote, to see if we can avoid taking a vote? :P 20:23:13 I would like someone to try it and either mark it invalid or an confirmed 20:23:25 Yes, basically. 20:23:54 Again, I think we should just make a policy decision that amphorae are not rebooted, and go with that until someone comes up with a really compelling case to do it. (Compelling enough to revisit code that depends on not allowing reboot functionality.) 20:24:01 If we decide to not support reboots, we should open a bug to make the amp NOT come back up to enforce that decission. 20:25:01 sbalukoff So not churning nova isn't a basic enough reason to maintain reboot capability? This just seems like one of those we will come back and wish we didn't make the decision. 20:25:39 johnsom: Realistically, which cases are you worried about churning nova in? 20:26:20 If an amp could come up before we health fail it over. It saves a nova boot, etc. 20:26:35 since the meat of the argument here is around SSL termination security issues, what is the ratio of ssl termination to not in you big operators VIPs ? 20:26:38 but that is highly YMMV 20:26:52 johnsom: But I mean: That's still a service disruption even if it's smaller than your health check thresholds. 20:27:01 Agreed, just exploring the issue. 20:27:18 sbalukoff Not in Act/stndby or act/act 20:27:39 johnsom: Right, but why would you need to do that reboot in the first place? 20:28:06 sbalukoff I don't have a "need". reboots happen 20:28:37 johnsom: Yeah, but then I suspect you'd want to detect that it happened. Seeing the amphora get recycled is a good indicator that something hiccupped there. 20:28:53 that's a stretch. :) 20:29:06 dougwig: What is? 20:29:25 that a cycle amp is a feature that provides notification. 20:29:29 /cycle/cycled/ 20:29:47 dougwig: Right. But then, random reboots that are "normal" is also a stretch. :P 20:29:49 sbalukoff I don't disagree that you would like to know it happened, but that is different about how the situation resolves itself. 20:29:53 hi 20:30:05 sbalukoff: touche. 20:30:29 I guess I still have a certain public cloud nightmare about nova issues 20:30:31 sbalukoff so you're saying its better to have random failovers than random reboots? 20:30:42 queues full and instance issues. 20:30:54 I perelman 20:30:54 TrevorV: Probably, yes. 20:31:12 I can't say I agree there sbalukoff , especially when you increase frequency of either event. 20:31:23 I think I'd rather have 100 reboots of an amphora than 100 failovers 20:31:24 Hmmm. Ok, so we can vote or we can test and potentially kick the can.... 20:31:29 johnsom: If your reboot / failover frequency is that high, you've got some serious other problems. 20:32:02 Right, but the failover frequency is much much more taxing to other systems and provides potentially more down-time than just the reboots 20:32:05 and is just as detectable 20:32:05 sbalukoff I don't disagree. However that issue was resolved in a different way.... 20:32:16 TrevorV: If you've got a piece of faulty hardware that keeps rebooting amphora VMs, it's better that the instance get failed over to some other hardware host the very first time. 20:32:17 if i saw that some VM rebooted 100 times, i'd think "ok that VM is busted", IF i even noticed (how would I notice?) 20:32:18 TrevorV +1 20:32:55 sbalukoff My argument there is if the faulty hardware is the cause, a failover event would be issued before the amphora came back online from a reboot anyway 20:33:20 Ok. Let's hold this discussion here. I will try to test by next week. Then we can do a vote if it is an issue now. Work? 20:33:34 Sounds good. 20:33:35 TrevorV: Uh... that argument goes against the idea the reboots would take less than the health check interval. You've got no reason to assume that. 20:33:40 if I saw that an amp cycled 100 times, i'd think "oh shit" 20:33:41 and it'd be easier to see, I think 20:34:08 No, sbalukoff the reboots of an amphora are less taxing than failing over, that's what I said, not a check interval 20:34:42 I want to leave some time for other discussions (and catch a shuttle in 30 if I can) 20:34:48 now as everybody has their paintbrush are there other sheds in need of coloring 20:34:53 #topic Open Discussion 20:35:07 Sorry, I meant to say that when you say "if the faulty hardware is the cause, a failover event would be issued before the amphora came back online from a reboot anyway"-- that's a faulty argument. You have no reason to believe faulty hardware wouldn't reboot quickly. 20:35:22 an ubuntu reboot is like 5 seconds. i know a nova spawn and orchestrate is longer. 20:35:33 dougwig +1 20:35:45 +1 20:36:01 also I will skip next week’s meeting… 20:36:02 sbalukoff I'll concede that, I was just considering that lets say a server in a rack was having issues, its unlikely that everything comes back up super quickly, but you're right, I don't know that 100% 20:37:07 TrevorV: And my argument that is if you're having random reboot problems, that if faulty hardware is the cause (or faulty software, like an improperly-set-up compute node), then it's better to get off that host the first time anyway. 20:37:29 sbalukoff but do we have the logic that failover will definitely choose a different host? 20:37:36 What if it fails over to the same host constantly? 20:37:40 I guess the core of my argument is that random reboots should be a relatively rare occurrence. If not, your cloud is already having serious problems anyway. 20:38:07 TrevorV: I'm assuming you have a cloud that is somewhat large. And therefore, it's unlikely to be re-issued to the same host. 20:38:38 Only large clouds are going to need to worry about nova scheduler bottlenecks and whatnot when you have a lot of failovers near the same time. :/ 20:38:38 I'm more worried about software issues than hardware. Memory leaks, kernel panics, etc. in the amp. 20:38:54 Yes, yes, shame on us, I know, but "stuff happens" 20:39:04 These are fast reboot situations 20:39:07 reboots are very concentrated on vms 20:39:20 container wouldn’t reboot and not sure what we would do with hardware 20:39:21 xgerman: What do you mean? 20:39:33 hardware wouldn't have 'amps'. 20:39:35 well, for generalizations sake I would just not allow reboots 20:39:42 so can i argue that a nova instance being deleted is not going to happen very often adn we shouldn't solve for that case? 20:39:56 and if it does get deleted by an admin there are other serious problems? 20:39:59 * johnsom glares at blogan 20:40:04 blogan nova instances are deleted every time a failover happens 20:40:09 But not every time a reboot would happen 20:40:19 hence my tax comment 20:40:23 TrevorV: i'm alluding to a specific bug we've discussed before 20:40:26 blogan is referencing the failover flow issue with deleted instances 20:40:35 Haha 20:40:53 same arguments are being made for this that i made for not handling that case 20:41:08 In a real production scenario, then yes, I would anticipate that deliberate deletions of amphorae by administrators is going to be a relatively rare occurrence. 20:41:14 they're not apples to apples though 20:41:22 They are not 20:41:44 Hell, on our blocks load balancer product (that a good portion of Octavia was modeled after) we have instances that are still running continuously for 4 years... 20:42:05 nope, but if an operator wants to reboot they can deactivate health monitoring, reboot, and be back on their merry way 20:42:09 Alright, well then, I concede to having a test and a decision next week. 20:42:11 (They probably shouldn't be, but I'm not in charge of applying security updates, etc. on them anymore. ;) ) 20:42:14 To me it comes down to layers of resiliency. "stuff happens" and I would like clean, efficient ways to deal with those situations that minimize the impact. 20:42:19 Its not like its keeping us from doing anything if that testing is done, right? 20:42:24 xgerman: +1 20:42:25 sbalukoff: are those 4 years instances running on IBM hardware? 20:42:37 HP probably :-) 20:42:53 sbalukoff: and how are those instances not vulnerable to about 8 dozen openssl bugs by now? 20:42:57 dougwig: Nope. Which means they're going down soon in any case because the Blue Box datacenters are being phased out. ;) 20:43:08 dougwig: No comment. ;) 20:43:13 lol 20:43:22 wow... 20:43:30 Ok, any other topics for today? 20:44:02 sbalukoff http://az616578.vo.msecnd.net/files/2015/09/19/635782305346788765-336606072_2905279.jpg 20:44:02 * Frito observes the crickets 20:44:09 dougwig: Actually, we'd typically patch the libraries and restart haproxy. This doesn't require a reboot. 20:44:30 good that you threw that in the logs there to cover yourself. 20:44:44 Haha 20:45:20 Ok, going once..... 20:45:27 Thanks folks! 20:45:47 Awesome. Thanks for joining! 20:45:51 #endmeeting