20:00:03 #startmeeting Octavia 20:00:04 Meeting started Wed Feb 27 20:00:03 2019 UTC and is due to finish in 60 minutes. The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:05 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 20:00:07 The meeting name has been set to 'octavia' 20:00:11 Hi folks 20:00:17 hi 20:00:27 o/ 20:00:46 #topic Announcements 20:01:00 hi 20:01:07 The TC elections are on. You should have received an e-mail with your link to the ballot. 20:01:28 The octavia-lib feature freeze is now in effect. 20:01:47 I have also released version 1.1.0 for Stein with our recent updates. 20:02:07 nice 20:02:18 And the most important, NEXT WEEK IS FEATURE FREEZE FOR EVERYTHING ELSE 20:03:18 As usual, we are working against the priority list: 20:03:27 #link https://etherpad.openstack.org/p/octavia-priority-reviews 20:03:45 Any other announcements today? 20:04:22 #topic Brief progress reports / bugs needing review 20:04:58 I have mostly been focused on the TLS patch chains. The TLS client authentication patches have now merged. They work well in my testing. 20:05:30 I'm currently working on the backend re-encyrption chain. I hope I can finish that up today, give it a test, and we can get that merged too. 20:06:39 If all goes well, I might try to help the volume backed storage patch and see if we can get it working for Stein. I created a test gate, but the patch fails... 20:07:27 Any other updates? 20:07:32 I have been working on multiple fronts 20:07:44 o/ 20:07:45 1. RHEL 8 DIB and amphora support (tempest tests passing) 20:07:48 #link https://review.openstack.org/#/c/623137/ 20:07:49 appreciate the oslo merge, rebuilt and running at that point in master now 20:07:54 #link https://review.openstack.org/#/c/638581/ 20:08:01 2. Allow ERROR'd load balancers to be failed over 20:08:06 #link #link https://review.openstack.org/#/c/638790/ 20:08:17 3. iptables-based active-standby tempest test 20:08:18 #link https://review.openstack.org/#/c/637073/ 20:08:36 4. general bug fix backports 20:09:08 Cool, thank you for working on the backports! 20:09:32 +1 20:09:54 stable/rocky grenade job is sadly still broken. I apologize for not having invested much time on it yet 20:10:10 That is next on the agenda, I wanted to check in on that issue. 20:10:36 #topic Status of the Rocky grenade gate 20:11:03 I just wanted to get an update on that. I saw your note earlier about a potential cause. 20:11:10 right 20:11:13 #link https://review.openstack.org/#/c/639395/ 20:11:26 Are you actively working on that or is it an open item? 20:11:38 ^ this backport allow us now to see what's going on wrong creating a member 20:11:50 that is where the grenade job is failing on 20:11:57 the error is: http://logs.openstack.org/49/639349/5/check/octavia-grenade/461ebf7/logs/screen-o-cw.txt.gz?level=WARNING#_Feb_27_08_32_43_986674 20:12:27 the rocky grenade job started failing between Dec 14-17 if I got that right 20:13:00 so I'm thinking if https://review.openstack.org/#/c/624804/ is what introduced the regression 20:13:30 the member create call fails still on queens, not rocky 20:13:38 with all those regressions looks like we are lacking gates 20:14:17 xgerman, speaking of that, your VIP refactor patch partially broke active-standby in master :P 20:14:27 I put up a fix 20:14:41 Yeah, not sure how the scenario tests pased but grenade is not. 20:14:52 xgerman, I don't see it. we can chat about that after grenade 20:15:17 xgerman It looks like in my rush I forgot to switch it off of amphorae.... 20:15:22 lol 20:15:53 yeah, two small changes and it came up on my devstack 20:16:08 xgerman, ah, I see it now. you submitted a new PS to Michael's change 20:16:13 yep 20:16:13 Cool, I just rechecked my act/stdby patch which is setup to test taht 20:16:23 #link https://review.openstack.org/#/c/638992/ 20:16:42 #link https://review.openstack.org/#/c/584681 20:17:07 Ok, so cgoncalves you are actively working on the grenade issue? 20:17:37 johnsom, I will starting actively tomorrow, yes 20:18:06 Ok, cool. Thanks. Just wanted to make sure we didn't think each other was looking at it, when in reality none of us were.... 20:18:20 #topic Open Discussion 20:18:42 I have one open discussion topic, but will open the floor up first to other discussions 20:18:44 I'm sure you'll be looking at it too, at least reviewing ;) 20:19:09 Other topics today? 20:19:29 Ok, then I will go. 20:19:32 would like to soicit guidance 20:19:34 very briefly 20:19:44 Sure, go ahead colin- 20:20:23 an increaing number of internal customers are asking about the performance capabilities of the VIPs we create with octavia, and we're going to endeavor to measure that really carefully in terms of average latency, connection concurrency, and throughput (as these all vary dramatically based on cloud hw) 20:21:06 Yes, I did a similar exercise last year. 20:21:08 so, aside from economies of scale with multiple tcp/udp/http listeners, does anyone have advice on how to capture this information really effectively with octavia and its amphorae? 20:21:38 and i'm hoping to use this same approach to measure the benefits of various nova flavors and haproxy configruations later in stein 20:23:00 Vlad Gusev proposed openstack/octavia master: Add support for the oslo_middleware.http_proxy_to_wsgi https://review.openstack.org/639736 20:23:03 Yeah, so I setup a lab, had three hosts for traffic generation, three for content serving. one for the amp 20:23:27 I used iperf3 for the TCP (L4) tests and tsung for the HTTP tests 20:23:47 Vlad Gusev proposed openstack/octavia master: Add support for the oslo_middleware.http_proxy_to_wsgi https://review.openstack.org/639736 20:23:55 I wrote a custom module for nginx (ugh, but it was easy) that returned static buffers. 20:24:14 did you add any monitoring/observability tools for visualizing? 20:24:18 Vlad Gusev proposed openstack/octavia master: Add support for the oslo_middleware http_proxy_to_wsgi https://review.openstack.org/639736 20:24:23 or was shell output sufficient for your purposes 20:24:28 I did one series where traffic crossed hosts, one with everything on one host (eliminates the neutron issues). 20:24:45 tsung comes with reporting tools 20:25:00 oh ok 20:25:02 I also did some crossing a neutron router vs. all L2 20:25:31 Then it's just a bunch of time tweaking all of the knobs 20:25:39 good feedback, thank you 20:26:20 For the same-host tests, iperf3 with 20 parallel flows, 1vcpu, 1GB ram, 2GB disk did ~14gbps 20:27:02 But of course your hardware, cloud config, butterflys flapping wings is Tahiti, all impacts what you get. 20:27:13 caveat, caveat, caveat..... 20:28:22 yeah indeed. if anyone else has done this differently or tested different hardware NICs this way please lmk! that's all i had 20:28:41 Yeah, get ready to add a ton of ****** 20:28:49 for all the caveats 20:29:35 I can share the nginx hack code too if you decide you want it. 20:30:22 Ok, so we have this issue where if people kill -9 the controller processes we can leave objects in PENDING_* 20:30:31 also are you running the vip on an overlay? Or dedicated vlan, etc. 20:31:34 I have an idea for an interim solution until we do jobboard/resumption. 20:31:37 johnsom: that type of thing was supposed to get fixed when we adopt job-board 20:31:49 lol, yeah, that 20:31:54 our task-(flow) engine should have a way to deal with that 20:31:59 that’s why we went with an engine 20:32:21 It does, in fact multiple ways, but that will take some development time to address IMO 20:33:18 So, as a short term, interim fix I was thinking that we could have the processes create a UUID unique to it's instance, write that out to a file somewhere, then check it on startup and mark anything it "owned" as ERROR. 20:33:25 Thoughts? Comments? 20:33:26 if $time.now() > $last_updated_time+$timeout -> ERROR? 20:33:40 The hardest part is where to write the file.... 20:34:41 It would require a DB schema change, which we would want to get in before feature freeze (just to be nice for upgrades, etc.). So thought I would throw the idea out now. 20:36:15 I think the per-process UUID would be more reliable than trying to do a timeout. 20:36:41 Vlad Gusev proposed openstack/octavia master: Add support for the oslo_middleware http_proxy_to_wsgi https://review.openstack.org/639736 20:37:14 hmmm 20:37:57 what then flipping status to PENDING_UPDATE? maybe only valid to certain resources 20:38:21 The only downside is we don't have a /var/lib/octavia on the controllers today, so it's an upgrade/packaging issue 20:38:41 and not backportable 20:39:01 Right, the "don't do that" still applies to older releases 20:39:23 I didn't follow the PENDING_UPDATE comment 20:40:15 nah, never mind. it prolly doesn't make any sense anyway xD (I was thinking along the same lines of allowing ERROR'd LBs to be failed over) 20:40:25 It would have to flip them to ERROR because we don't know where in the flow they killed it 20:41:15 Yeah, maybe a follow on could attempt to "fix" it, but that is again logic to identify where it died. Which is starting the work on jobboard/resumption. 20:41:33 thinking of a backportable solution, wouldn't timeouts suffice? 20:42:48 I don't like that approach for a few reasons. We seem to have widely varying performance in the field, so picking the right number would be hard, sort of making it an hour or something, which defeats the purpose of a timely cleanup 20:43:05 mmh, people would likley be happy if we just flip PENDING to ERROR with the housekeeper after a while 20:43:30 I mean we already have flows that timeout after 25 minutes due to some deployments, so it would have to be longer than that. 20:43:49 some operators tend to trade resources for less work… so there’s that 20:44:10 Yeah, the nice thing about the UUID too is it shames the operator for kill -9 20:44:22 We know exactly what happened 20:44:28 or for having servers explode 20:44:44 or poweswitch istakes 20:44:48 also more and more clouds run services in containers, so docker restart would basically mean kill -9 20:44:55 yep 20:45:05 Yep, k8s is horrible 20:45:22 you don't need k8s to run services in containers ;) 20:45:23 stop, my eyes will roll out of my head 20:45:27 I mean openstack services! 20:45:46 yeah, we should rewrite octavia as a function-as-a-service 20:46:01 I know, but running the openstack control plane in k8s means lots of random kills 20:46:09 indeed 20:46:40 so how difficult is job board? did we ever look into the effort? 20:46:53 Anyway, this is an option, yes, may not solve all of the ills. 20:47:11 Yeah, we did, it's going to probably be a cycles worth of effort to go full job board. 20:47:40 There might be a not-so-full job board that would meet our needs too, but that again is going to take some time. 20:48:15 I would rather start on the “right” solution then do crudges 20:48:19 I was unaware of jobboards until now. does it sync state across multiple controller nodes? 20:48:50 not really, but accomplishes the same thing. 20:49:27 asking because if octavia worker N on node X goes down, worker N+1 on node X+1 takes over 20:49:43 So first it enables persistence of the flow data. I uses a set of "worker" processes. The main jobboard assigns and monitors the workers completion of each task 20:50:02 Right, effectively that is what happens. 20:50:06 without a syncing mechanism, how would octavia know which pending resources to ERROR? 20:50:11 do we need a zookeeper for jobboard. Yuck! 20:50:25 Much of the state is stored in the DB 20:50:32 ok 20:50:53 jobboard = ?, for the uninitiated 20:50:55 Yeah, so there was a locking requirement I remember from the analysis. I don't think zookeeper was the only option, but maybe 20:50:59 is this a work tracking tool? 20:51:31 ah, disregard 20:51:55 https://docs.openstack.org/taskflow/ocata/jobs.html 20:52:00 #link https://docs.openstack.org/taskflow/latest/user/jobs.html 20:52:13 Anyway, I didn't want to go deep on the future solution. 20:52:45 What I am hearing is we would prefer to leave this issue until we have resources to work on the full solution and that an interim solution is not valuable 20:53:32 #vote? 20:53:39 I still didn't get why timeouts wouldn't be a good interim (and backportable) solution 20:53:56 What would you pick as a timeout? 20:54:24 what ever is in the config file 20:54:27 We know some clouds complete tasks in less than a minute, others it takes over 20 20:54:36 if load creation: build timeout + heartbeat timeout 20:54:53 otherwise, just heartbeat timeout. no? 20:54:56 So 26 minutes? 20:55:12 better than forever and ever 20:55:26 and not being able to delete/error 20:55:40 I don't think we can backport this even if it has a timeout really 20:56:16 The timeout would be a new feature to the housekeeping process 20:56:34 no API or DB schema changes. no new config option 20:56:41 The other thing that worries me about timeouts is folks setting it and not understanding the ramifications 20:56:42 yeah that's tricky, i too don't want to leave them (forever) in the state where they can't be deleted 20:56:44 it would be a new periodic in housekeeping 20:57:52 yeah, I am hunted by untuned tieouts almost every day 20:58:01 xgerman: thanks for the link 20:58:05 Yep. I think it breaks the risk of regression and self-contained rules 20:58:44 And certainly the "New feature" rule 20:59:24 Well, we are about out of time. Thanks folks. 20:59:27 "Fix an issue where resources could eternally be left in a transient state" ;) 20:59:41 If you all want to talk about job board more, let me know and I can put it on the agenda. 20:59:52 I will certainly read more about it 21:00:20 I just think it's a super dangerous thing in our model to change the state out from under other processes 21:00:25 #endmeeting