#openstack-meeting log

14:00:55 <mlavalle> #startmeeting neutron_drivers
14:00:56 <openstack> Meeting started Fri Jan 18 14:00:55 2019 UTC and is due to finish in 60 minutes.  The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:59 <openstack> The meeting name has been set to 'neutron_drivers'
14:01:02 <slaweq> hi
14:01:05 <mlavalle> hi
14:01:33 <cheng1> hi
14:01:34 <mlavalle> let's wait 2 min for people to congregate
14:02:11 <njohnston_> o/
14:02:34 <njohnston_> haleyb let me know he would miss the first half of the meeting, his previous appointment is running late
14:03:57 <mlavalle> so we need amotoki to have quorum
14:04:14 <mlavalle> thanks for the update njohnston_ :-)
14:04:38 <mlavalle> in the meantime: cheng1 is there something you want to bring up?
14:05:10 <cheng1> https://bugs.launchpad.net/neutron/+bug/1808731
14:05:13 <openstack> Launchpad bug 1808731 in neutron "[RFE] Needs to restart metadata proxy with the start/restart of l3/dhcp agent" [Undecided,Triaged] - Assigned to cheng li (chengli3)
14:05:31 <cheng1> this bug, will we have a look?
14:06:07 <mlavalle> I left a comment last night in that bug
14:06:24 <mlavalle> and I think slaweq agrees with me
14:06:33 <mlavalle> and haleyb as well
14:07:18 <cheng1> mlavalle: we may need the same implement as other agents, like dnsmasq
14:07:47 <slaweq> mlavalle: basically I agree with Your comment, but I also think that doing restart of haproxy only during start of agent may work - as it's only short time when it will not be available
14:08:32 <slaweq> so in fact I think that in most cases client will have longer http timeout configured and will wait for response
14:08:39 <cheng1> sure, it will
14:09:04 <cheng1> like dnsmasq, we don't stop metadata proxy with the stop of l3/dhcp agent
14:09:30 <cheng1> just restart it with the restart/start of l3/dhcp agent
14:10:29 <mlavalle> in the case of l3 agents, today haproxy is not related to start of the agent
14:10:30 <njohnston_> So if this is really for an upgrades concern, doesn't this properly live in the realm of orchestration such as what ansible or puppet provides?  Usually a tool like that will be used to choreograph the sequence of events needed for upgrades, if I am not mistaken.
14:11:02 <mlavalle> njohnston_: that's the point of my comment of last night
14:11:18 <mlavalle> as usual, you articulated it better than me :-)
14:11:48 <cheng1> it's not only about upgrade
14:12:25 <cheng1> Curreny implement, metadata proxy doesn't restart even we restart l3 agent
14:12:28 <cheng1> or dhcp agent
14:12:40 <njohnston_> why would you want it to?
14:12:41 <cheng1> this will result in issue
14:13:04 <mlavalle> but going back to my previous point, in the case of router based proxies, the trigger is the creation, update or deletion or routers
14:13:40 <cheng1> For example, if we change the metadata_port from default 9697 to 9699
14:14:17 <cheng1> the new 9699 port will not be used, because metadata proxy doesn't restart
14:14:30 <mlavalle> so, when we restart the agent, we need to restart the proxies in that agent
14:14:49 <mlavalle> all the porxies in all the routers in that agent^^^
14:14:53 <cheng1> yes, that what I want to say
14:15:41 <mlavalle> could that be very disruptive?
14:15:57 <liuyulong> Such change should not manipulate for a cloud administrator, IMO.
14:16:23 <mlavalle> liuyulong: what do you mean? can you clarify?
14:16:50 <njohnston_> what is the agents could reread the configuration when it changes, like what we implemented for neutron server?  That would be a way to nondisruptively load the change.  Since if you're changing the metadata proxy config file you are managing objects on that host, it should not eb too much of an extra step to issue a signal to trigger the config file reload
14:17:23 <liuyulong> IIRC, such proxy should not do any config change during agent down time.
14:17:40 <mlavalle> liuyulong: yes, that's been my point
14:18:08 <slaweq> njohnston_: but we are talking here about reload haproxy config, not l3 agent
14:18:16 <slaweq> can haproxy reload config without restart?
14:19:19 <mlavalle> cheng1: also, what are the use cases where you have seen the need for this? can you elaborate on that? we are having a theoretical discussion without understanding the need behind it
14:20:14 <mlavalle> I am hesitant to introduce changes in the data plane in principle, however minor, but maybe the need justifies it
14:20:33 <cheng1> mlavalle: there is a running openstack env
14:20:57 <cheng1> for some reason, I want to change metadata_port
14:21:22 <cheng1> the step is to change the config in neutron config files
14:21:29 <cheng1> then restart l3 agent
14:21:56 <cheng1> but restart l3 agent doens't restart metaproxy
14:22:08 <cheng1> current implement of reload by 'kill', which doesn't re-generate configuration from neutron.conf
14:22:14 <mlavalle> but that is a hypothetical: "for some reason". do you have an actual use case?
14:22:53 <cheng1> not really, but it could be an use case
14:23:09 <mlavalle> we can all go through the code and start hypothesizing about changes we could make to it
14:24:04 <mlavalle> but we have a huge installed base to worry about. we should introduce only changes where the actual reward justifies the risk we always run when we introduce changes
14:24:27 <cheng1> just would like to confirm this bug in the meeting
14:24:30 <slaweq> mlavalle: well said :)
14:26:07 <mlavalle> and I am known to be in favor to introduce more features and running the risk of doing so, but I want to know the reward in terms of the benefits we will deliver to ACTUAL use cases. Then I'm willing to run the associated risks
14:26:26 <liuyulong> Actually the l3-agent and dhcp-agent can be hosted on the dedicated machines. And the port allocations for such key/bottleneck services should be planned in advance.
14:27:56 <slaweq> cheng1: mlavalle: what about writting somewhere in docs that if You change in L3 config something related to metadata proxy, e.g. metadata_port, You should kill haproxy services before restart of L3 agent to changes takes effect
14:27:57 <cheng1> besides metadata_port, there could be other parameters for metadata proxy.
14:28:07 <slaweq> that would be for sure less risky :)
14:28:19 <slaweq> and operators would be aware of it then IMO
14:29:17 <mlavalle> but today, if you kill the proxies and re-start the agent, you won't re-start the proxies
14:29:32 <mlavalle> proxies are associated to routers events
14:30:15 <mlavalle> right?
14:30:39 <slaweq> mlavalle: L3 agent will not check if proxy is running and start it if it's not?
14:30:49 <mlavalle> right
14:30:52 <slaweq> I think it will take care of it during restart
14:31:13 <liuyulong> slaweq, mlavalle, a router admin-state down/up action may work. But such action will cause inevitable data plane down.
14:31:27 <mlavalle> right
14:32:18 <cheng1> I see we restart dnsmasq with the restart of dhcp agent
14:32:46 <cheng1> why we don't restart metadata proxy with l3/dhcp agent
14:32:50 <mlavalle> cheng1: yes, but that is not a valid analogy. it is a consequence of the way we use dnsmasq
14:33:11 <mlavalle> and the way dnsmasq works
14:33:23 <liuyulong> cheng1, DHCP can reload the config is because the normal user can set the attributes for it.
14:33:54 <liuyulong> meanwhile meta proxy is transparent to users
14:35:38 <mlavalle> slaweq: you spent a long time with a public cloud operator. In your exxperience, is this something relevan to that operator?
14:36:00 <mlavalle> same question to you liuyulong. you work today for a big operator
14:36:08 <slaweq> mlavalle: unfortunatelly in this case there is no L3 agent used at all
14:36:20 <mlavalle> slaweq: good answer... LOL
14:36:51 <liuyulong> mlavalle, yes?
14:37:41 <mlavalle> liuyulong: I was asking whether a change like the one cheng1 is proposing would be benefitial for the company you work for
14:39:54 <mlavalle> and I ask these questions because so far we don't have an actual use case. So I want to see what operators would think about it
14:40:55 <liuyulong> mlavalle, as I said before such change should not happen during agend down time. The config should be planned in advance. And it then should not change in the cloud entire life cycle. Otherwise it will increasing OP difficulties.
14:40:57 <slaweq> maybe it would be good idea to send email to ML and ask operators about their feedback on this?
14:41:17 <slaweq> and then we can back to this talk here :)
14:41:25 <mlavalle> liuyulong: thanks
14:41:50 <liuyulong> mlavalle, np
14:42:27 <mlavalle> slaweq: I have expressed my opinion on this and I don't see the need to make changes for the sake of making changes. But I am willing to be overruled if:
14:42:59 <mlavalle> 1) other members of the drivers team reach consensus this is needed
14:43:34 <mlavalle> 2) we identify a set of actual operators who see the benefit of this. The ML is a good way to try to get that feedback. Great suggestion
14:44:52 <mlavalle> it can even be a topic of discussion for the next forum / ptg
14:45:13 <cheng1> mlavalle: got it, thanks
14:45:22 <slaweq> forum would be better IMO, and I agree with You mlavalle :)
14:45:45 <mlavalle> cheng1: do you want to start a thread in the ML and see if we get feedback from the community
14:45:47 <mlavalle> ?
14:47:20 <mlavalle> BTW, cheng1, slaweq, liuyulong, njohnston_: great discussion. This is what this meeting is for ;-)
14:47:45 <mlavalle> sorry I played the stubborn role this time around
14:48:13 <mlavalle> cheng1: and I thank you for your suggestions and arguments. They are always welcome
14:48:27 <slaweq> mlavalle++
14:48:38 <njohnston_> mlavalle++
14:49:41 <cheng1> mlavalle: Maybe after days, I can try ML
14:49:54 <mlavalle> cheng1: great. thanks again
14:51:08 <mlavalle> Since we don't have drivers quorum, and we have only 10 minutes left, I propose we look at https://bugs.launchpad.net/neutron/+bug/1811166
14:51:09 <openstack> Launchpad bug 1811166 in neutron "[RFE] Enforce router admin_state_up=False before distributed update" [Wishlist,New] - Assigned to Matt Welch (mattw4)
14:51:37 <mlavalle> after reading it last night, my thinking is that this is not a RFE. It's really a bug
14:52:07 <mlavalle> the code should do what the submitter proposes: https://github.com/openstack/neutron/blob/master/neutron/db/l3_dvr_db.py#L86
14:52:27 <mlavalle> slaweq, liuyulong, njohnston_, cheng1: what do you think?
14:53:14 <davidsha> It might have been a mistaken use of the RFE tag, they don't seem to have anyother bugs to their name.
14:54:15 <mlavalle> he was nudged by the bugs deputy of that week to classify it as a RFE
14:54:32 <mlavalle> that's my impression at least
14:54:55 <slaweq> it was marked as RFE as it is trying to change exsiting API behaviour I guess
14:55:13 <slaweq> other than that I agree that it should be like that
14:55:36 <slaweq> so it should require admin_state_up=False before migration
14:55:46 <davidsha> Ya, spotted that in the comments just there
14:56:38 <mlavalle> yes and in fixing it, we should document properly the change of bahavior to what it should have been from the beginning
14:56:55 <slaweq> and now the question is: do we need shim api extension to make this fix discoverable?
14:57:17 <liuyulong> agree with slaweq
14:57:21 <mlavalle> yes, slaweq, I think you are right
14:57:43 <mlavalle> as much as I dislike our extensions sprawl, in this case I think it is needed
14:57:50 <slaweq> :)
14:57:52 <mlavalle> good suggestion slaweq
14:57:53 <slaweq> yes
14:58:01 <slaweq> I agree that it's necessary
14:58:16 <slaweq> also, I think it would be good to add scenario tests to https://github.com/openstack/neutron-tempest-plugin/blob/master/neutron_tempest_plugin/scenario/test_migration.py
14:58:24 <liuyulong> Seems the bug requests to save one step to accomplish that migration?
14:58:46 <slaweq> because now we are not testing if migration is forbidden when router don't have admin_state_up=False
14:58:53 <slaweq> and it would be useful IMHO
14:59:01 <mlavalle> good point slaweq
14:59:08 <mlavalle> so let's handle this as a bug
14:59:20 <mlavalle> and I will update it with your suggestions
14:59:25 <slaweq> thx
14:59:32 <njohnston_> +1
14:59:59 <mlavalle> trhanks for your attendance. We'll do it all over again next week
15:00:07 <mlavalle> have a great weekend!
15:00:10 <slaweq> have a great weekend!
15:00:12 <slaweq> o/
15:00:14 <mlavalle> #endmeeting