16:00:33 #startmeeting Octavia 16:00:34 Meeting started Wed Jun 17 16:00:33 2020 UTC and is due to finish in 60 minutes. The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:35 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:37 The meeting name has been set to 'octavia' 16:00:56 Hello everyone 16:01:01 Hi 16:01:02 hi 16:01:05 Hello! 16:01:06 o/ 16:01:24 Hi! 16:02:19 hi 16:02:40 Merged openstack/octavia master: Use uwsgi binary from path https://review.opendev.org/736137 16:02:43 #topic Announcements 16:02:58 Well, that was one announcement ^^^^ 16:03:53 uwsgi was broken for devstack recently. That patch should resolve the master branch. 16:05:09 Does anyone have any other announcements this week? 16:06:16 aannuusshhkkaa and shtepanie are joining us for the summer as dev interns at vzm! 16:06:40 Yay! Welcome 16:06:40 nice, welcome! 16:06:45 welcome! 16:07:09 thank you! 16:07:09 Thank you!! :) 16:07:40 In the process of getting them up to speed, and we've got a topic later about what they'll be working on (metrics!) 16:08:03 Merged openstack/octavia master: add the verify for the session https://review.opendev.org/726567 16:08:31 Sounds good 16:08:40 #topic Brief progress reports / bugs needing review 16:09:44 I have been focused on catching up on reviews, getting the stable branches - sigh - stable, and cutting some stable branch releases. 16:10:05 I was a bit off with internal processes. Now, want to highlight two changes, adding retry and preupgrade check for amphorav2 16:10:06 We got Ussuri and Stein out of the gate. Train is still broken on grenade issues 16:10:16 #link https://review.opendev.org/#/c/726084/ 16:10:29 #link https://review.opendev.org/#/c/735556/ 16:10:52 Oh was it just this last week that we EOL'd two branches? Or was that already announced 16:10:59 I'm losing track of time 16:11:07 Oh, yeah, in fact it was! 16:11:18 * johnsom is living in a time warp as well 16:11:43 We have officially EOL'd the Ocata and Pike releases of Octavia. 16:12:24 Thanks rm_work for leading that effort and navigating the process waters 16:12:41 +1, thank you 16:12:47 You mean breaking through the process wall like koolaid man 16:12:58 Yes, that exactly, lol 16:13:06 Which is my preferred style of political negotiation :D 16:13:47 Well, it was a good thing as we should have truth in advertising and really no one was maintaining those branches anymore. 16:14:57 I also spent some time looking at the centos amphora images to see if I could find any tricks to speed it up under qemu tcg. I achieved a huge improvement of 17 seconds. 16:15:12 Which means, it still takes four minutes to boot and is still a problem. 16:16:04 lol 16:16:10 Well that's something 16:16:36 Yeah, not worth the trouble really. 16:17:05 Ok, any other progress reports or updates? 16:17:13 ataraday_ Thanks for the patches! 16:17:29 I spent some time working on diskimage-builder to add centos 8 stream support. centos stream is the rolling pre-release of RHEL 8 and CentOS 8. there's a WIP patch in octavia side that builds an amphora and runs the tests, all successful 16:17:44 Nice 16:17:49 Cool 16:17:51 I also continued to review johnsom's monster patch aka failover refactor patch 16:18:04 Yeah we need to get that in :) 16:18:06 Yeah, I have done a few comment update spins on that 16:18:36 We've been running it in prod for over a month now? Multiple months maybe? 16:18:45 It's been good 16:18:55 Nice, that is good feedback. 16:19:21 For the most part, the comment have been minor issues. I think the biggest change was adding retry timeouts to the configuration file. 16:20:13 Based on the PTG feedback 16:20:31 Ok, if there are no more updates, we can move on to "metrics" 16:20:35 #topic metrics 16:20:53 rm_work You have the conn 16:21:18 Alright 16:21:35 So, we're picking up this task! 16:21:52 We discussed it briefly last night, and it seems it's essentially three parts 16:22:43 ... and my irc window doesn't want to scroll back that far, apparently 16:23:00 anyway, we think it's essentially: 16:23:15 1) Add new metrics at the system level (for example, RAM usage, CPU usage) 16:24:22 2) Transition to sending deltas instead of absolute totals, where it makes sense (for things like total connections and transfer bytes, but not for current active or the system stuff probably) 16:25:17 tdd 9 16:25:22 3) Rework/improve the driver layer to allow running multiple metrics storage drivers at once, and probably add at least one new driver for shipping metrics somewhere like influxdb 16:26:01 +1 That is the list I am aware of 16:27:00 We'll probably tackle 1 and 2 first 16:27:23 The discussion topic today though is basically -- can we brainstorm what we actually want for #1? 16:27:34 Yeah, those should go together nicely with a heartbeat protocol version bump 16:28:15 yeah, that is a good question. 16:28:54 i listed the two i can think of off the top of my head 16:29:16 My first thought is percentages. Simply because the agent would have the best information about the nova flavor of the instance. 16:29:36 yeah, definitely thinking percentages 16:29:56 Ah, you meant which metrics. Yeah, RAM and CPU are on the top of my list. I personally don't have any others. 16:29:58 which does mean those numbers would be absolute, not deltas 16:30:06 Correct 16:30:30 yeah is there anything else useful we could collect? 16:30:35 disk? local logs can pile up 16:30:38 hmm 16:31:28 as an admin, having an at-a-glance of the disk might be useful in that specific situation 16:31:37 Personally I think we have other ways to address that (log offloading and hourly rotation), but we have seen one case where some other issue in the cloud filled the system log file with garbage. 16:31:49 but i don't know how generally useful that'd be in the 99% case for a user 16:32:23 ok, it was just a thought. we can add later if we want to 16:32:26 i guess we should clarify the goal 16:32:53 I THINK what we're trying to do is add metrics that would allow one essential insight: how much "capacity" does my LB have left 16:33:20 Yeah, my goal for those is to get us a step closer to auto-scaling 16:33:25 and really, I am considering formulas that we could use to turn that into one easily digestable number 16:33:44 which is apparently what AWS does with ELB 16:34:41 Initially I'm not sure we should even add the "system" metrics to the API. Simply because they have little to no meaning for other provider drivers 16:35:15 Yeah I think I agree 16:35:20 And I don't want us to get in a strange situation when we enable active/active. 16:35:36 We should just collect at first 16:35:41 +1 16:36:07 which actually simplifies the task a bit :D 16:36:25 then we can handle what to do with those new metrics in step 3, when we ship them somewhere 16:37:27 AWS offers read/write bandwidth, idle time, latency on EBS. Do we want to offer features like that? 16:37:47 Correct. Maybe, if we want to give some indication to the end user, we could consider adding a "HIGH LOAD" operating status, but I would consider that #4 or #20 on the list. 16:38:11 I wonder if we can actually have any idea what percentage of read/write bandwidth is actually being used 16:38:23 that would require an operator config setting, possibly 16:38:37 Right, that is a hard one given neutron can't usually come close to what we can handle. 16:39:04 yeah, and even if we know which NIC is in a HV, we don't know what bandwidth is like on the rest of the VMs that live there 16:39:08 We do have bytes in/out and with deltas you could calculate the rate 16:39:37 ah, yeah... how do we do deltas, exactly? that is one of my major concerns 16:39:46 there's a few concerns there actually 16:40:02 firstly, HOW? do we *reset* haproxy's counters constantly? 16:40:12 do we keep an internal tracker in the agent? 16:40:32 I suppose that's just going to be some research 16:40:44 Yeah, my expectation is the agent will keep the previous value in memory and calculate the delta 16:41:01 also, since we use UDP, do we just... hope all the packets get there, and possibly under-report? 16:41:25 Yes, this would be a "may under report in some cases" scenario 16:41:39 we have a sequence number so on the control-plane we could actually tell if we're missing packets and try to do some fill based on the points on either side... but that could be wrong too 16:41:51 and better to under-report than over-report i guess 16:42:19 also don't want to hugely increase the workload on the heartbeat ingestion 16:42:20 Right and complex. We do already have a sequence number in the heartbeat message. We just don't use it for more than a nonce 16:42:34 right 16:43:24 FYI, here are the metrics haproxy can report: 16:43:26 #link http://cbonte.github.io/haproxy-dconv/1.8/management.html#9.1 16:43:39 However, the LVS UDP side cannot support most of those 16:44:17 yeah... 16:44:18 So until we can switch out the UDP engine, that may constrain what we report or we need to call out the limitations. 16:44:49 I also want to make sure we are careful to not put in things that other drivers don't support. I.e. no haproxy specific metrics. 16:45:00 ah, hrsp_4xx hrsp_5xx etc was something mentioned 16:45:20 but I don't know if we want to try to ship those from haproxy, or allow those to be calculated by a user via log analysis 16:45:47 ereq and econ are also candidates 16:45:50 I believe other things should be able to report those for HTTP type stuff 16:46:04 we already ship ereq :) 16:46:17 Ah, ok, so .... grin 16:46:31 maybe hanafail? 16:46:35 "failed health checks details" 16:46:42 that is one other use-case 16:46:53 but also, can be handled by logs 16:47:20 Yeah, that is in the flow logs 16:47:52 We also need to keep in mind the heartbeat message size. I think it is limited to 64k at the moment. That includes both stats and status 16:48:16 Not that we can't change that, but just a consideration 16:48:35 lbtot for members would be interesting 16:48:50 "total number of times a server was selected, either for new 16:48:50 sessions, or when re-dispatching" 16:49:16 Yeah, that is per-member hits 16:49:17 but anyway, I suppose we can move on, could be here all day :D 16:49:37 and also, the user could get that info *from their members* :D 16:49:54 Lots of goodies, but we need to be conservative 16:49:59 yeah 16:50:09 Or the flow logs, it's in there 16:50:35 alright, last thing would be, does anyone ELSE want to work on #3? since that also could probably be done in parallel 16:50:55 (updating to allow multiple drivers to be used at once, and adding one for influxdb or similar) 16:51:20 The hard work there is defining the interface really 16:51:26 if not, we can look at handling that after we wrap up 1+2 16:51:56 I think the interface is already defined -- technically it's already a driver layer? 16:52:07 and it takes "our health message" :D 16:52:14 unless you are saying you want to rework that 16:52:22 and actually do some level of pre-parsing first 16:53:05 Yeah, I was trying to remember what the content was. It's a de-wrapped heartbeat json isn't it? 16:53:09 that would require a decent refactor -- it'd basically mean shifting 90% of the current "update_db" code up above the driver layer 16:53:16 which maybe should be done 16:53:21 because that doesn't really make sense 16:53:29 we should have all that pre-parsing outside of the drivers 16:53:44 and the "update_db" part should literally just be taking the final stats struct, and ... updating the DB 16:54:08 Yeah, we should be able to rev the message format version without requiring all of the drivers to respin IMO. If we can avoid it. 16:54:10 it's actually pretty badly organized 16:54:15 yeah 16:54:21 ok so maybe we MOVE the driver layer there 16:54:35 do you think it'd be ok to break our existing plugin agreement there? 16:54:46 i doubt anyone is using it? 16:55:10 It is not a published interface today. We don't document it. 16:55:12 it's internal to the amphora-driver 16:55:15 alright 16:55:24 so we'll probably reorganize that first 16:55:44 which i guess actually means parts of #3 will be #0 16:55:48 But do keep in mind, we have a stats interface for the provider drivers too 16:55:57 k 16:56:20 yeah but i believe it is already totally different from the interface i'm referring to 16:56:34 Yeah, it is a bit different 16:57:38 https://github.com/openstack/octavia/blob/master/octavia/amphorae/drivers/health/heartbeat_udp.py#L32-L47 16:57:45 I am referring to that one 16:58:08 because currently 100% of the logic that parses the packet lives in the "update_db" driver 16:58:10 Yeah, that will need improvement 16:58:12 which is ... not correct 16:58:39 that should all happen well before it passes to a driver, and what it should pass is a final structure with data 16:59:25 I am talking about: 16:59:27 #link https://github.com/openstack/octavia/blob/master/octavia/api/drivers/driver_agent/driver_updater.py#L139 17:00:27 yeah thats already closer to what an "update_db" driver SHOULD be tho 17:00:46 so right inside there, we can actually share the driver layer and the struct we pass, I think 17:00:49 Ok, we are out of time today. Thanks for the great discussion and work on metrics! 17:01:03 driver layer should be here: https://github.com/openstack/octavia/blob/master/octavia/api/drivers/driver_agent/driver_updater.py#L166-L167 17:01:06 o/ thanks everyone 17:01:17 #endmeeting