14:00:12 #startmeeting neutron_drivers 14:00:13 Meeting started Fri Aug 21 14:00:12 2020 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:14 hi 14:00:16 The meeting name has been set to 'neutron_drivers' 14:00:24 o/ hi 14:00:50 Hello. 14:00:57 slaweq: im @hopem (from lp 1892200) fwiw 14:00:59 Launchpad bug 1892200 in neutron "Make keepalived healthcheck more configurable" [Wishlist,New] https://launchpad.net/bugs/1892200 14:01:15 o/ 14:01:15 hi 14:01:21 hi 14:01:25 welcome dosaboy :) 14:01:31 slaweq, and I'm pprincipeza from lp 1891334. :) 14:01:32 Launchpad bug 1891334 in neutron "[RFE] Enable change of CIDR on a subnet" [Wishlist,New] https://launchpad.net/bugs/1891334 14:01:38 welcome pprincipeza_ :) 14:01:58 hi mlavalle haleyb and amotoki 14:02:02 :) 14:02:12 ralonsoh and njohnston are on pto this week 14:02:23 so lets just wait 2 more minutes for yamamoto 14:02:51 ACK! 14:04:35 ok, lets start 14:04:48 even without yamamoto we have quorum so we should be good to go 14:04:56 #topic RFEs 14:05:02 we have 2 RFEs today 14:05:04 first one: 14:05:11 #link https://bugs.launchpad.net/neutron/+bug/1891334 14:05:12 Launchpad bug 1891334 in neutron "[RFE] Enable change of CIDR on a subnet" [Wishlist,New] 14:07:05 I submitted that on behalf of a customer, who would like to have the ability to expand the subnet currently in use. He has implemented the alternative alread (new subnets), his wish would be to avoid creating these distinct subnets, and keep the servers under a single subnet (even though this is completely virtual and internal to OpenStack). 14:08:07 I understand the limitations that this would imply, as in having to repopulate existing Instances with new IP/Mask/GW information, and this would definitely need downtime. :/ 14:10:09 I'd imagine this would be less "painful" on subnets with DHCP allocation, as this info *should* come with the lease renewal? 14:10:21 i assume this would only be expected to work as increase and not decrease of size 14:10:49 dosaboy, yes. 14:11:55 pprincipeza_: is there a particular use-case? would just adding an extra subnet to the network now solve the issue? 14:12:05 s/now/not 14:12:32 technically, it should be able to scale-down a subnet if number of ports is small enough, just don't know why you'd want to do that 14:12:43 but how You want to force changes of e.g. mask in the existing instances? 14:13:18 dosaboy, the use-case is more on a "systems management" side than in functionality. He has implemented other subnets, and everything is working fine. 14:14:51 slaweq, my only thought there would be on doing the change with instances down, as I imagine the new information from subnet would come in upon the lease renewal? 14:14:59 (And I expect that to happen at boot time?) 14:15:13 slaweq: scaling-up by just changing the mask might not cause a big disruption, but i can't see being able to do that if pools are being used, would need to change to a new subnet, bleck 14:15:23 so the system manager wants to save himself / herself manag ement work and instead have the users reconfigure thier vms? 14:15:36 haleyb: i guess either way i worry about the number of places that update would have to be applied to 14:16:52 mlavalle, yes, it sounds not much reasonable when thinking of end-users of the Cloud. :/ 14:17:29 And the change on lb-mgmt for Octavia would also be a use-case, I believe? 14:17:49 i can see how this change would help with certain deployers too, for example where the undercloud was made too small, but don't know how the cloud admin could do this successfully, for a tenant it seems more doable 14:18:28 pprincipeza_: so is the use case more the end-user/tenant? 14:18:58 I understood the opposite, but I migh be wrong 14:19:26 on neutron's side it would be change in the dhcp agent and neutron db, right? other things are on user who needs to e.g. reboot vms or force renew of lease, is that correct or am I missing something here? 14:19:46 if its octavia lb-mgmt net then its the case where it runs out of addresses for amphora vms etc 14:19:47 haleyb, ^ thanks, slaweq. that sums it up. 14:20:08 mlavalle: it's not clear to me 14:20:09 pprincipeza_: you also mention Octavia, which is not mentioned in the RFE. Is there an aditional conversation going on that we haven't seen in this meeting? 14:21:20 mlavalle, this was a use-case not initially added to the RFE I submitted, but discussing this with dosaboy, that use-case came up as an addition. 14:21:35 mlavalle: tbh we've seen the octavia issue with older deployments before we switched to using v6 networks for the lb-mgmt net 14:21:49 mlavalle, I can certainly add that mention to the LP Bug, if that's needed. 14:22:06 but i havent yet tested if adding an extra subnet could fix that, planning to try that 14:22:59 Octavia also has an rfe to add multiple subnets for the lb-mgmt-net. It just hasn’t come up that anyone needed it, so is low priority. 14:23:13 i don't know if that's pprincipeza_ original use-case though 14:23:22 johnsom: oh interesting i was not aware of that 14:24:55 but pprincipeza_'s aim is to avoid adding subnets, isn't it? 14:24:55 I'm still not really convinced for that rfe and if we should implement something what may be in fact painful for users later 14:25:21 I had a trouble in internet connection and am just following the discussion. As you already discussed, expanding a subnet CIDR leads to a subnet mask change. It sometimes leads to a problem between an existing vm and a new booted vm if the exsting vm does not update the mask. If all communications happen between gateway and VMs, we will hit less problems. 14:25:51 mlavalle, yes, that's my initial aim. 14:26:25 in that case, the Octavia RFE mentioned above is not related, if I understand correctly 14:27:37 amotoki: and it might be you can't just change the mask, but need a new cidr due to overlap, right? 14:27:57 Yes. 14:28:03 haleyb: yes 14:28:27 In fact, using the AZ code in Ussuri Octavia allows for multiple lb-mgmt-nets today. 14:28:28 so then it's new subnet and live migration, etc 14:28:32 haleyb: I think overlapping case can be covered by the API side. we can check overlapping of CIDR 14:29:53 amotoki: right, i was just thinking the simple case of changing from /26 to /24, same cidr, which causes little disruption, otherwise it's reboot everything 14:30:10 right 14:30:33 haleyb: thanks. I am in a same page. 14:30:57 And if rebooting is needed, it is needed, I don't see that feature being added without some "pain" for the Instances. :) 14:32:25 f and the RFE refers to "changing the CIDR of a subnet", so it's not just going from /26 to /24 14:32:42 pprincipeza_: if you need to reboot, is a new subnet with new instances easier? no downtime if these instances are part of a pool? 14:33:16 mlavalle: exactly, how it would be if You would e.g. changed from 10.0.0.0/24 to 192.168.0.0/16 ? 14:33:26 haleyb, the new subnet with new instances is already in place as a functional way out of the "expansion" limitation. 14:33:29 that may be much harder :) 14:34:37 this has a ripple effect through security groups as well 14:35:03 mlavalle, slaweq, my initial use-case beared a minor /26 to /24 scenario, but the bigger change (CIDR) was requested, utmostly. 14:36:03 personally I can imaging that we are allowing extend of the cidr, so old cidr has to be inside new, bigger one 14:36:08 does it work if the RFE is rephrased from "changing" to "expanding"? 14:36:12 but other use cases I'm not sure 14:36:49 amotoki: so the /26 to /24 case? 14:36:54 mlavalle: yes 14:37:12 * with limitations 14:40:03 so should we vote on that rfe or do You want some more clarifications and discuss that again next week? 14:40:05 amotoki: it's more reasonable if just expanding, but i guess there will still be connectivity issues since the gateway will have changed? 14:40:23 haleyb: I think so 14:40:54 haleyb: the gateway address is alreaedy assigned so I don't think we need to change the gateway address. 14:40:59 haleyb: if we will just expand, why gateway need to be changed? 14:42:17 slaweq: i just didn't know if the expansion changed the ".1" address to be different. i.e. 2.1 to 0.1 or something 14:42:48 or the gateway stays the same... 14:43:30 I thought the gateway stasy the same. In my understanding, the current logic is applied only when a gateway is not specified. 14:43:47 amotoki: I think the same 14:44:46 amotoki: yes, i was just thinking out loud, shouldn't be an issue after thinking about it 14:45:18 haleyb: thanks. we are careful enough :) 14:45:55 so are we ok to approve this rfe as "expansion of subnet's cidr" and discuss details in the review of the spec? 14:46:19 +1 for expanding a subnet CIDR. This operation may require additional workaround including instance reboot as we discussed. it is worth documenting in the API ref or some. 14:46:36 If it is just about expanding CIDRS, +1 14:47:07 pprincipeza_: will that work for You? 14:47:29 slaweq, it works for me. 14:47:39 Thank you very much for considering it! 14:47:40 +1 from me, then 14:47:52 haleyb ? 14:48:43 +1 from me 14:48:57 thx, so I will mark this rfe as approved 14:49:05 with note about "expanding cidr" only 14:49:16 ok, lets quickly look into second rfe 14:49:23 #link https://bugs.launchpad.net/neutron/+bug/1892200 14:49:24 Launchpad bug 1892200 in neutron "Make keepalived healthcheck more configurable" [Wishlist,New] 14:49:27 Awesome, thank you very much slaweq haleyb mlavalle! 14:49:38 ok so, 14:49:43 1892200 is related to an issue that we have recently observed in an env using l3ha 14:49:47 the conditions to hit the issue are somewhat protracted and described in the LP but 14:49:50 long story short is that while the check was failing for a valid reason, the result of it failing ended up causing more problems 14:49:53 and since the original cause of the test failure was transient, there was no real need to failover 14:49:56 therefore a slightly more intelligent test than simply doing a single ping would be preferable 14:50:03 in terms of solutions I know that protocols like BFD are much better at dealing with this kind of thing and are available in OVN 14:50:06 but this is really for those users that will be stuck with L3HA for the forseeable 14:50:09 so adapting what we e.g. just trying more pings before we declare a failure, would be better than what we have imho 14:50:12 on top of that, the suggestion to move the current code to use a template seems like a good idea 14:50:17 right now the healthcheck script is entirely built from code 14:50:20 if we used a template it would also provide the opportunity to make the path to the template configurable 14:50:24 thus allowing for it to be modified without changing neutron code 14:50:28 I've not looked to deeply into this yet but was thinkig something along the lines of a jinja template 14:50:31 thoughts? 14:52:52 dosaboy: I was thinking about jinja2 too :) 14:52:58 The idea of a customizable template looks great, as long as there is a default one already in place. 14:53:26 rafaelweingartne: yes, my idea here was that we should basically provide default template which will be the same as what we have now 14:53:27 rafaelweingartne: yeah absolutely, default that can be overriden via config path 14:53:29 I think that is the idea, rafaelweingartne 14:53:55 slaweq: yep 14:54:54 generally templating it sounds great. it potentially makes our bug triage complex. my question is whether we will keep current configurations (though I haven't checked if we have them). 14:54:58 so i guess there's two way to look at the request, either we "improve" the existing default test and/or make it templatised to allow user-override 14:55:35 amotoki: the only config currently iirc is the interval i.e. how often to run the check 14:55:37 amotoki: are You asking about configuration for specific process which is spawned for router? 14:55:41 that can remain 14:56:09 slaweq: no process here fwiw, the check is run directly by keepalived 14:56:26 slaweq: what i mention is keepalived config 14:56:33 neutron generates the keepalived conf with the test enabled and a path to the test 14:56:43 slaweq: i like your thought on keepalived options for a router, maybe an extension? instead of adding more config options? or were you thinking something else? 14:57:01 that was in the bug comments 14:57:27 haleyb: this isnt really about how keepalived drives the test though, not in my experience anyway 14:57:38 our problem was really the test itself 14:57:54 currently when neutron-l3-agent is configuring new router, it generates keepalived config file through https://github.com/openstack/neutron/blob/master/neutron/agent/linux/keepalived.py 14:58:16 and this config file is storred somewhere in /var/lib/neutrin/ha_confs/ (IIRC) 14:58:31 ... that fails when a single icmp reply is missed within 1s 14:58:56 and my idea here was that we can somehow change clases in this module that it will generate keepalived config file based on some template 14:58:56 slaweq: correct 14:59:06 right 14:59:10 so it will still set correct interface names, ip addresses and other variables 14:59:41 but user will be able to prepare in template other things like some timeouts, etc. 14:59:52 im totally +1 on that and make the path to the template configurable so that users can modify it and put it somewhere else 14:59:56 I hope that it is clear and I hope it is correct with what dosaboy wants :) 14:59:59 slaweq: per-router or per-cloud? 15:00:08 i.e. admin-controlled template 15:00:15 haleyb: template would be "per l3 agent" in fact 15:00:16 slaweq: yeah i think thats it 15:00:29 but in practice it should be per cloud 15:00:31 yep per l3agent 15:00:40 thats enough for us anyways 15:00:51 as You shouldn't have different configs on different network nodes 15:01:41 I am +1 for introducing template (and also we can keep better default configs) 15:01:41 ok, we are out of time today 15:01:41 slaweq: yeah sorry im confusing things, it would be the same everythere but thats just cause we would configure all l3-agents the same 15:01:51 ok thanks for reviewing 15:01:51 I think we need to get back to this next week 15:01:52 it now makes sense to me 15:02:10 or is it clear for You and You want to vote now on that quickly ? 15:02:47 what about others? 15:02:59 i don't think there's a meeting right after us... 15:02:59 it makes sense to me 15:03:33 i'm comfortable casting a +1 15:03:38 ok, so it seems that mlavalle amotoki haleyb and I are ok to approve it now 15:03:40 right? 15:03:49 I think so 15:04:33 haleyb: you ok with that proposal, right? 15:04:36 yes, i'd +1, seems useful for cloud admins 15:04:42 ok, thx a lot 15:04:48 so I will mark this one as approved too 15:04:54 thx for proposing it dosaboy 15:05:07 thanks guys, much appreciated 15:05:08 and sorry that I keept You here longer than usually :) 15:05:17 have a great weekend and see You all next week 15:05:19 o/ 15:05:23 #endmeeting