#openstack-meeting log

22:04:44 <armax> #startmeeting neutron_drivers
22:04:45 <openstack> Meeting started Thu Jan 26 22:04:44 2017 UTC and is due to finish in 60 minutes.  The chair is armax. Information about MeetBot at http://wiki.debian.org/MeetBot.
22:04:46 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
22:04:48 <openstack> The meeting name has been set to 'neutron_drivers'
22:04:52 <kevinbenton> NoneType has no attribute meeting_name
22:05:40 <njohnston> :)
22:06:10 <armax> hello folks
22:06:21 <armax> we’re no officially in feature freeze
22:06:25 <armax> as O-3 is out
22:06:37 <ihrachys> *now
22:06:44 <ihrachys> a slight different :)
22:06:46 <armax> ihrachys: that one
22:07:16 <armax> I’ll start preparing a postmortmem document so that we can figure out what gets granted a FFE and what can’t
22:07:37 <armax> make sure the stuff you have looked at/reviewed/care about has up to date information, targets etc
22:07:49 <ihrachys> armax: also maybe starting looking at pre-release checklist would make sense, so that we avoid backports later
22:08:14 <armax> I’ll create an RC1/Pike-1 milestone pages shortly and start rolling stuff over
22:09:06 <armax> I’ll also create a ocata-rc-potential bug tag if it doesn’t exist already so that we can flag what we want to make sure lands in time for ocata
22:09:14 <armax> ihrachys: excellent, we need that
22:10:18 <armax> anything else I am missing?
22:10:29 <ihrachys> gate-failure bugs, do they get FFE by default?
22:10:58 <armax> ihrachys: I’d say so
22:11:05 <ihrachys> good
22:11:13 <armax> about gate-failures, I wanted actually to talk about those briefly during this meeting
22:11:24 <armax> but before we do, anything else release related that comes to mind?
22:12:14 <ihrachys> how do we handle releases for e.g. ovn?
22:12:29 <ihrachys> does FFE apply to them?
22:13:25 <armax> ihrachys: the way this worked the last few times is that if we end up seeing blockers
22:13:38 <armax> especially introduced in neutron by kevinbenton, we’ll most likely know about them
22:13:42 <armax> and release accordingly :)
22:14:10 <ihrachys> no pain no gain :)
22:14:13 <kevinbenton> i'm not sure how i understand what we are talking about relates to OVN releases?
22:14:25 <kevinbenton> (terrible sentence)
22:14:35 <armax> but as for regular outstanding work, I’d defer to the team how they want to handle the merge process in the RC window
22:14:41 <kevinbenton> how are OVN released related to me breaking the gate?
22:14:44 <armax> unless there’s something that shows up on the neutron dashboard
22:15:12 <armax> kevinbenton: remember when we broke a bunch of DHCP stuff last time?
22:15:28 <kevinbenton> yes
22:15:52 <ihrachys> kevinbenton: yeah I also try to figure. I was merely asking if they follow FFE rules, and if so, if we want stadium to follow the same procedures/timelines as we do for neutron core (btw is it now only neutron + fwaas?)
22:15:53 <armax> that led us to churn a few RCs for neutron
22:16:06 <kevinbenton> armax: for neutron, not for OVN
22:16:27 <armax> kevinbenton: they landed stop-gap in the meantime, remember?
22:16:46 <armax> that may lead to get a new release
22:16:51 <armax> but to ihrachys’ point
22:17:30 <armax> FFE proces applies to anything that’s on the neutron launchpad dashboard
22:18:00 <armax> if ther’es anything that pertains networking-foo, that’ll have to show up on it for us to have a look at it
22:18:37 <ihrachys> meaning, we don't really enforce release process on e.g. networking-odl even though we seem to vouch on it
22:18:48 <ihrachys> due to them being in stadium
22:18:58 <armax> other than that, when we are about to push an RC release, we can look at the milestone-based projects and advance the hash for them
22:19:02 <armax> however...
22:19:18 <armax> for those projects that are currently on a release:independent schedule
22:19:50 <armax> I think we should make sure that for those in the stadium end up being ready to cut a release soon-ish
22:20:14 <armax> rather than waiting Pike-? for an ocata release
22:20:25 <armax> I’ll send a note out about this
22:20:30 <armax> ihrachys: does that make sense?
22:21:10 <ihrachys> yes. I would suggest we also try to advertise proper release practices to them, because otherwise they may continue landing disruptive patches only to learn later they need to stabilize in a week
22:21:42 <ihrachys> some are still in dillusion that they will have months+ after neutron final to prepare their release
22:22:10 <armax> ihrachys: I don’t believe they are that clueless, but we can surely remind that on the ML
22:22:40 <ihrachys> aye, I think we are good on stadium then for the most part
22:23:11 <armax> as I said, I’ll follow up on the ML, I noticed that dasm has already started the thread, and I’ll reply to his email
22:23:36 <armax> and provide more details on what I expect it’ll happen between now and the time Ocata is officially released
22:23:40 <armax> sounds good?
22:24:08 <ihrachys> yes
22:24:12 <armax> sweet
22:24:25 <armax> kevinbenton, amotoki anything to add?
22:24:55 <armax> now about gate snafus
22:25:09 <kevinbenton> let's switch to pecan today. blogan wants to :)
22:25:53 <armax> can’t find the word to express the feeling that statement made me feel
22:26:06 <armax> :)
22:26:30 <ihrachys> lonely?
22:26:41 <armax> ihrachys: no, that’s not it
22:27:03 <armax> as for gate failures, I ws looking at grafana right now
22:27:05 <armax> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
22:27:42 <armax> it seems most of the instability is in the xenial vrersions of the jobs
22:27:46 <armax> would you guys concur?
22:28:12 <armax> it’s fair that trusty runs on mitaka only at this point
22:28:29 <ihrachys> yeah, not many data points there
22:28:30 <armax> and we may not have many data points
22:28:34 <kevinbenton> i haven't been watching many mitaka runs
22:28:52 <armax> kevinbenton: they seemed to have landed fine for what I could tell
22:28:59 <ihrachys> xenial is probably a mix of libvirtd corruptions and oom-killer
22:29:22 <kevinbenton> nobody replied back about my swap observation with the oom thing
22:29:30 <armax> about the oom-killer, is there anything we can do there?
22:29:55 <ihrachys> kevinbenton: I think dasm was replying in the bug didn't he?
22:30:17 <armax> do we know if anyone in the infra team is looking at these?
22:30:57 <dasm> ihrachys: i was. didn't find anything yet
22:31:12 <ihrachys> for the record, the bug is https://bugs.launchpad.net/neutron/+bug/1656386
22:31:12 <openstack> Launchpad bug 1656386 in neutron "Memory leaks on Neutron jobs" [Critical,Confirmed] - Assigned to Darek Smigiel (smigiel-dariusz)
22:31:51 <ihrachys> dasm: "Now, I see it completely reversed." can you clarify what you mean? that you don't see swap used? or that it's different from what kevinbenton observed?
22:32:24 <dasm> i added this chart: https://imgur.com/a/59KYz
22:32:38 <kevinbenton> dasm: in that run swap never exceeds a couple of hundred MB
22:32:44 <ihrachys> armax: I dunno of anyone looking at it. does it suggest we are the only affected?
22:32:59 <dasm> first time when i've seen this issue, i've seen that swap was exceeding 2GBs
22:33:19 <armax> ihrachys: the entire gate is affected
22:33:31 <armax> ihrachys: the integrated gate is, I mean
22:33:36 <armax> we’re default now, remember? :)
22:33:40 <kevinbenton> dasm: we have up to 8GB though
22:33:49 <armax> looking at logstash
22:33:52 <dasm> ihrachys: i've found oom-killer just for tripleo jobs, but as comments, not cause of failure
22:33:54 <kevinbenton> oom killer shouldn't trigger with 6GB free
22:34:15 <armax> I see 39 events in neutron patches against 18 in nova, 15 in cinder and 10 in tempest
22:34:37 <dasm> kevinbenton: ram was consumed ~8GB, then swap arose to ~2GB and after couple minutes, oom-killer was triggered against mysql
22:34:58 <armax> ihrachys: it happened 21 times in the gate queue
22:35:01 <armax> and 110 times in check
22:35:21 <armax> it’s master only at this point
22:35:22 <kevinbenton> dasm: right, that's still bad behavior though. it just killed stuff with 6GB free in that case
22:35:38 <armax> don’t see any occurence in any stable branch
22:35:45 <dasm> kevinbenton: agree
22:36:00 <ihrachys> kevinbenton: something requested more than 6gb at once? :)
22:36:59 <kevinbenton> ihrachys: maybe. I would assume that would make it a prime target for the OOM killer
22:37:11 <ihrachys> nah that's not how oom-killer works :)
22:37:20 <ihrachys> I believe it's shooting processes at random
22:37:23 <kevinbenton> ihrachys: no
22:37:30 <kevinbenton> ihrachys: each process gets a score
22:37:54 <armax> ihrachys: as a matter of fact oom-killer is always picking up on mysql
22:37:57 <ihrachys> ok but the hungry doesn't necessarily get shot?
22:38:12 <kevinbenton> http://unix.stackexchange.com/questions/153585/how-the-oom-killer-decides-which-process-to-kill-first
22:38:18 <armax> a handful of times it picks something different and that’s why I see OOM kill traces with succesful runs
22:38:51 <armax> making mysql less attratictive for a oom-kill would be a way to try and mitigate the issue
22:39:30 <armax> however, if we want to do something within our control, all we can do is figure out how to reduce the neutron processes memory footprint
22:39:35 <kevinbenton> armax: well i assume in that case it would randomly just kill nova/neutron/cinder
22:39:38 <armax> kevinbenton and I looked at this
22:39:51 <armax> kevinbenton: it depends on the test run though
22:40:13 <armax> it could kill something that can either be restarted or not being used by the specific test at that specific time, no?
22:40:34 <kevinbenton> armax: who will restart it?
22:40:34 <smcginnis> fwiw, we chased oom issues in cinder for a while and never found a good way to isolate a root cause.
22:40:49 <smcginnis> Other than just things generally getting bigger and requiring more memory.
22:40:59 <kevinbenton> smcginnis: did you also notice that it's doing it way before swap space is exhausted?
22:41:04 <armax> kevinbenton: in neutron we have the processmonitor thingy
22:41:17 <kevinbenton> armax: the processmonitor thingy is for subprocesses of the agent
22:41:20 <smcginnis> kevinbenton: I can't remember specifics now, but there were some strange things with it.
22:41:26 <kevinbenton> armax: the neutron-server will not be restarted
22:41:31 <armax> kevinbenton: I am not saying that’s exactly how it happens
22:41:36 <armax> duh
22:41:38 <armax> :)
22:42:19 <armax> kevinbenton: I’ll find a successul run with an oom-kill to see what actually happened
22:42:27 <clarkb> apparently one potential reason for OOMkiller being invoked evne with plenty of swap is if the kernel itself is requesting the memory
22:42:58 <kevinbenton> clarkb: i wonder if this kernel version is janky with our low swapiness setting that we put on the gate?
22:43:20 <armax> clarkb: do you think that trying to keep mysql memory footprint small would be a viable step, at the expense perhaps of running a bit slower in the gate?
22:43:27 <clarkb> or newer kernel in general?
22:43:44 <clarkb> armax: as I said in my email to the thread I think the ideal solution here is if we put openstack on more of a diet like heat did
22:44:00 <clarkb> they dropped memory consumption quite a bit over the last cycle or two
22:44:10 <armax> clarkb: yeah, trimming the memory footprint of the neutron processes is soemthing we’re looking at
22:44:21 <armax> if we compound the effort across all projects we can definitely make a difference
22:44:25 <armax> but these things usually take time
22:44:42 <clarkb> kevinbenton: another possible cause is use of mlock
22:45:00 <kevinbenton> clarkb: would you be ammenable to adjusting that swapiness setting upward to see if it improves things?
22:45:15 <clarkb> kevinbenton: ya I think devstack can do that
22:45:24 <armax> clarkb: we saw it in d-g
22:45:31 <kevinbenton> i will propose change
22:45:38 <clarkb> oh I guess we already adjust in d-g ya
22:45:41 <armax> that swappiness is explicitly set towards the low end of the scale
22:45:45 <armax> and for a good reason too
22:46:07 <armax> http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/functions.sh#n432
22:46:13 <ihrachys> could it be nova doing something fancy with guest memory? like enabling mlock?
22:46:26 <armax> mriedem: ^
22:46:41 <armax> we’re brainstorming on the oom-kill failures we’re seeing lately
22:46:52 <armax> any idea about how nova might be hogging memory?
22:47:16 <ihrachys> (...and in such a way that swap is not fully utilized)
22:47:24 <mriedem> not really, no one has profiled it yet
22:47:30 <armax> so kevinbenton how high are you thinking of raising swappiness?
22:47:43 <armax> mriedem: ok
22:47:48 <armax> default is 60
22:48:09 <armax> clarkb: there are also a bunch of mysql settings that devstack sets
22:48:40 <armax> clarkb: what do you think of setting cache sizes as well?
22:48:56 <clarkb> mysql cache sizes? thats probably a better quetion for mordred or devananda
22:49:14 <clarkb> my guess is that they probably help quite a bit on instances with slower IO in the gate
22:49:44 <kevinbenton> https://review.openstack.org/425961
22:49:54 <armax> clarkb: I see we have lots of connections open
22:49:55 <armax> http://git.openstack.org/cgit/openstack-dev/devstack/tree/lib/databases/mysql#n99
22:49:56 <kevinbenton> clarkb, armax: interesting comments above that
22:50:17 <armax> kevinbenton: you mean the IO one?
22:50:32 <kevinbenton> L426
22:50:57 <armax> nonymous-memory to file-backed mappings
22:51:44 <ihrachys> "kicking in on some processes despite swap being available;"
22:51:55 <dasm> kevinbenton: indeed, interesting. would fit with our use-case (although we're running ubuntu, afair)
22:52:23 <ihrachys> btw time check 8 mins
22:52:24 <clarkb> git blame should show us who ran into that but my guess is ianw when getting stuff running on fedora/centos
22:52:26 <kevinbenton> dasm: right, but setting swappiness to 10 here in the gate happens regardless of what we are running
22:52:26 <dasm> is it possible, that something change with latest ubuntu?
22:52:27 <armax> ihrachys: agreed
22:52:47 <armax> even though I had a different agenda in mind, this turned out to be a good discussion I didn’t want to stop
22:53:13 <dasm> nah. if it would be problem with ubuntu, then we should probably start seeing this oom-killer earlier.
22:53:27 <kevinbenton> did you want to talk more about how we are going to switch to pecan and ovs firewall by default before feature freeze?
22:53:29 <armax> dasm: we started hammering on xenial only recently though
22:53:37 <kevinbenton> dasm: it could be related to a later kernel version in xenial
22:53:50 <armax> and if we indeed increasing the memory of some services, it might be that it was just the last straw
22:54:13 <dasm> mhm
22:54:41 <armax> kevinbenton, clarkb: do we want to try and tweak mysql settings, propose the change and see what reactions people might have?
22:54:55 <clarkb> its possible the weighting has changed on swappiness. Apprently such weighting has changed in the past too
22:55:03 <armax> of do we want to wait and see how the swappiness increase does?
22:55:15 <kevinbenton> armax: let's wait and see if this helps
22:55:17 <clarkb> armax: probably best to see if one toggle at a time makes a difference
22:55:31 <armax> clarkb: ok, I can start on the patch and put it in WIP
22:55:40 <kevinbenton> if it does, we've successfully swept the problem under the rug until the end of next cycle when the memory consumption of all of the services has doubled :)
22:55:50 <dasm> LOL
22:56:01 <armax> we’ll double the swappinees then!
22:56:05 <armax> no
22:56:21 <armax> but in all seriousness we should look at the memory footprints of our services
22:56:29 <armax> to see if in the long term we can trim some fat
22:57:27 <armax> ok, so we’re 3 mins until the top of the hour
22:57:31 <armax> I guess that’s about it
22:58:17 <ihrachys> oktnxbye
22:58:23 <dasm> o/
22:58:26 <armax> #endmeeting