22:04:44 #startmeeting neutron_drivers 22:04:45 Meeting started Thu Jan 26 22:04:44 2017 UTC and is due to finish in 60 minutes. The chair is armax. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:04:46 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 22:04:48 The meeting name has been set to 'neutron_drivers' 22:04:52 NoneType has no attribute meeting_name 22:05:40 :) 22:06:10 hello folks 22:06:21 we’re no officially in feature freeze 22:06:25 as O-3 is out 22:06:37 *now 22:06:44 a slight different :) 22:06:46 ihrachys: that one 22:07:16 I’ll start preparing a postmortmem document so that we can figure out what gets granted a FFE and what can’t 22:07:37 make sure the stuff you have looked at/reviewed/care about has up to date information, targets etc 22:07:49 armax: also maybe starting looking at pre-release checklist would make sense, so that we avoid backports later 22:08:14 I’ll create an RC1/Pike-1 milestone pages shortly and start rolling stuff over 22:09:06 I’ll also create a ocata-rc-potential bug tag if it doesn’t exist already so that we can flag what we want to make sure lands in time for ocata 22:09:14 ihrachys: excellent, we need that 22:10:18 anything else I am missing? 22:10:29 gate-failure bugs, do they get FFE by default? 22:10:58 ihrachys: I’d say so 22:11:05 good 22:11:13 about gate-failures, I wanted actually to talk about those briefly during this meeting 22:11:24 but before we do, anything else release related that comes to mind? 22:12:14 how do we handle releases for e.g. ovn? 22:12:29 does FFE apply to them? 22:13:25 ihrachys: the way this worked the last few times is that if we end up seeing blockers 22:13:38 especially introduced in neutron by kevinbenton, we’ll most likely know about them 22:13:42 and release accordingly :) 22:14:10 no pain no gain :) 22:14:13 i'm not sure how i understand what we are talking about relates to OVN releases? 22:14:25 (terrible sentence) 22:14:35 but as for regular outstanding work, I’d defer to the team how they want to handle the merge process in the RC window 22:14:41 how are OVN released related to me breaking the gate? 22:14:44 unless there’s something that shows up on the neutron dashboard 22:15:12 kevinbenton: remember when we broke a bunch of DHCP stuff last time? 22:15:28 yes 22:15:52 kevinbenton: yeah I also try to figure. I was merely asking if they follow FFE rules, and if so, if we want stadium to follow the same procedures/timelines as we do for neutron core (btw is it now only neutron + fwaas?) 22:15:53 that led us to churn a few RCs for neutron 22:16:06 armax: for neutron, not for OVN 22:16:27 kevinbenton: they landed stop-gap in the meantime, remember? 22:16:46 that may lead to get a new release 22:16:51 but to ihrachys’ point 22:17:30 FFE proces applies to anything that’s on the neutron launchpad dashboard 22:18:00 if ther’es anything that pertains networking-foo, that’ll have to show up on it for us to have a look at it 22:18:37 meaning, we don't really enforce release process on e.g. networking-odl even though we seem to vouch on it 22:18:48 due to them being in stadium 22:18:58 other than that, when we are about to push an RC release, we can look at the milestone-based projects and advance the hash for them 22:19:02 however... 22:19:18 for those projects that are currently on a release:independent schedule 22:19:50 I think we should make sure that for those in the stadium end up being ready to cut a release soon-ish 22:20:14 rather than waiting Pike-? for an ocata release 22:20:25 I’ll send a note out about this 22:20:30 ihrachys: does that make sense? 22:21:10 yes. I would suggest we also try to advertise proper release practices to them, because otherwise they may continue landing disruptive patches only to learn later they need to stabilize in a week 22:21:42 some are still in dillusion that they will have months+ after neutron final to prepare their release 22:22:10 ihrachys: I don’t believe they are that clueless, but we can surely remind that on the ML 22:22:40 aye, I think we are good on stadium then for the most part 22:23:11 as I said, I’ll follow up on the ML, I noticed that dasm has already started the thread, and I’ll reply to his email 22:23:36 and provide more details on what I expect it’ll happen between now and the time Ocata is officially released 22:23:40 sounds good? 22:24:08 yes 22:24:12 sweet 22:24:25 kevinbenton, amotoki anything to add? 22:24:55 now about gate snafus 22:25:09 let's switch to pecan today. blogan wants to :) 22:25:53 can’t find the word to express the feeling that statement made me feel 22:26:06 :) 22:26:30 lonely? 22:26:41 ihrachys: no, that’s not it 22:27:03 as for gate failures, I ws looking at grafana right now 22:27:05 #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 22:27:42 it seems most of the instability is in the xenial vrersions of the jobs 22:27:46 would you guys concur? 22:28:12 it’s fair that trusty runs on mitaka only at this point 22:28:29 yeah, not many data points there 22:28:30 and we may not have many data points 22:28:34 i haven't been watching many mitaka runs 22:28:52 kevinbenton: they seemed to have landed fine for what I could tell 22:28:59 xenial is probably a mix of libvirtd corruptions and oom-killer 22:29:22 nobody replied back about my swap observation with the oom thing 22:29:30 about the oom-killer, is there anything we can do there? 22:29:55 kevinbenton: I think dasm was replying in the bug didn't he? 22:30:17 do we know if anyone in the infra team is looking at these? 22:30:57 ihrachys: i was. didn't find anything yet 22:31:12 for the record, the bug is https://bugs.launchpad.net/neutron/+bug/1656386 22:31:12 Launchpad bug 1656386 in neutron "Memory leaks on Neutron jobs" [Critical,Confirmed] - Assigned to Darek Smigiel (smigiel-dariusz) 22:31:51 dasm: "Now, I see it completely reversed." can you clarify what you mean? that you don't see swap used? or that it's different from what kevinbenton observed? 22:32:24 i added this chart: https://imgur.com/a/59KYz 22:32:38 dasm: in that run swap never exceeds a couple of hundred MB 22:32:44 armax: I dunno of anyone looking at it. does it suggest we are the only affected? 22:32:59 first time when i've seen this issue, i've seen that swap was exceeding 2GBs 22:33:19 ihrachys: the entire gate is affected 22:33:31 ihrachys: the integrated gate is, I mean 22:33:36 we’re default now, remember? :) 22:33:40 dasm: we have up to 8GB though 22:33:49 looking at logstash 22:33:52 ihrachys: i've found oom-killer just for tripleo jobs, but as comments, not cause of failure 22:33:54 oom killer shouldn't trigger with 6GB free 22:34:15 I see 39 events in neutron patches against 18 in nova, 15 in cinder and 10 in tempest 22:34:37 kevinbenton: ram was consumed ~8GB, then swap arose to ~2GB and after couple minutes, oom-killer was triggered against mysql 22:34:58 ihrachys: it happened 21 times in the gate queue 22:35:01 and 110 times in check 22:35:21 it’s master only at this point 22:35:22 dasm: right, that's still bad behavior though. it just killed stuff with 6GB free in that case 22:35:38 don’t see any occurence in any stable branch 22:35:45 kevinbenton: agree 22:36:00 kevinbenton: something requested more than 6gb at once? :) 22:36:59 ihrachys: maybe. I would assume that would make it a prime target for the OOM killer 22:37:11 nah that's not how oom-killer works :) 22:37:20 I believe it's shooting processes at random 22:37:23 ihrachys: no 22:37:30 ihrachys: each process gets a score 22:37:54 ihrachys: as a matter of fact oom-killer is always picking up on mysql 22:37:57 ok but the hungry doesn't necessarily get shot? 22:38:12 http://unix.stackexchange.com/questions/153585/how-the-oom-killer-decides-which-process-to-kill-first 22:38:18 a handful of times it picks something different and that’s why I see OOM kill traces with succesful runs 22:38:51 making mysql less attratictive for a oom-kill would be a way to try and mitigate the issue 22:39:30 however, if we want to do something within our control, all we can do is figure out how to reduce the neutron processes memory footprint 22:39:35 armax: well i assume in that case it would randomly just kill nova/neutron/cinder 22:39:38 kevinbenton and I looked at this 22:39:51 kevinbenton: it depends on the test run though 22:40:13 it could kill something that can either be restarted or not being used by the specific test at that specific time, no? 22:40:34 armax: who will restart it? 22:40:34 fwiw, we chased oom issues in cinder for a while and never found a good way to isolate a root cause. 22:40:49 Other than just things generally getting bigger and requiring more memory. 22:40:59 smcginnis: did you also notice that it's doing it way before swap space is exhausted? 22:41:04 kevinbenton: in neutron we have the processmonitor thingy 22:41:17 armax: the processmonitor thingy is for subprocesses of the agent 22:41:20 kevinbenton: I can't remember specifics now, but there were some strange things with it. 22:41:26 armax: the neutron-server will not be restarted 22:41:31 kevinbenton: I am not saying that’s exactly how it happens 22:41:36 duh 22:41:38 :) 22:42:19 kevinbenton: I’ll find a successul run with an oom-kill to see what actually happened 22:42:27 apparently one potential reason for OOMkiller being invoked evne with plenty of swap is if the kernel itself is requesting the memory 22:42:58 clarkb: i wonder if this kernel version is janky with our low swapiness setting that we put on the gate? 22:43:20 clarkb: do you think that trying to keep mysql memory footprint small would be a viable step, at the expense perhaps of running a bit slower in the gate? 22:43:27 or newer kernel in general? 22:43:44 armax: as I said in my email to the thread I think the ideal solution here is if we put openstack on more of a diet like heat did 22:44:00 they dropped memory consumption quite a bit over the last cycle or two 22:44:10 clarkb: yeah, trimming the memory footprint of the neutron processes is soemthing we’re looking at 22:44:21 if we compound the effort across all projects we can definitely make a difference 22:44:25 but these things usually take time 22:44:42 kevinbenton: another possible cause is use of mlock 22:45:00 clarkb: would you be ammenable to adjusting that swapiness setting upward to see if it improves things? 22:45:15 kevinbenton: ya I think devstack can do that 22:45:24 clarkb: we saw it in d-g 22:45:31 i will propose change 22:45:38 oh I guess we already adjust in d-g ya 22:45:41 that swappiness is explicitly set towards the low end of the scale 22:45:45 and for a good reason too 22:46:07 http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/functions.sh#n432 22:46:13 could it be nova doing something fancy with guest memory? like enabling mlock? 22:46:26 mriedem: ^ 22:46:41 we’re brainstorming on the oom-kill failures we’re seeing lately 22:46:52 any idea about how nova might be hogging memory? 22:47:16 (...and in such a way that swap is not fully utilized) 22:47:24 not really, no one has profiled it yet 22:47:30 so kevinbenton how high are you thinking of raising swappiness? 22:47:43 mriedem: ok 22:47:48 default is 60 22:48:09 clarkb: there are also a bunch of mysql settings that devstack sets 22:48:40 clarkb: what do you think of setting cache sizes as well? 22:48:56 mysql cache sizes? thats probably a better quetion for mordred or devananda 22:49:14 my guess is that they probably help quite a bit on instances with slower IO in the gate 22:49:44 https://review.openstack.org/425961 22:49:54 clarkb: I see we have lots of connections open 22:49:55 http://git.openstack.org/cgit/openstack-dev/devstack/tree/lib/databases/mysql#n99 22:49:56 clarkb, armax: interesting comments above that 22:50:17 kevinbenton: you mean the IO one? 22:50:32 L426 22:50:57 nonymous-memory to file-backed mappings 22:51:44 "kicking in on some processes despite swap being available;" 22:51:55 kevinbenton: indeed, interesting. would fit with our use-case (although we're running ubuntu, afair) 22:52:23 btw time check 8 mins 22:52:24 git blame should show us who ran into that but my guess is ianw when getting stuff running on fedora/centos 22:52:26 dasm: right, but setting swappiness to 10 here in the gate happens regardless of what we are running 22:52:26 is it possible, that something change with latest ubuntu? 22:52:27 ihrachys: agreed 22:52:47 even though I had a different agenda in mind, this turned out to be a good discussion I didn’t want to stop 22:53:13 nah. if it would be problem with ubuntu, then we should probably start seeing this oom-killer earlier. 22:53:27 did you want to talk more about how we are going to switch to pecan and ovs firewall by default before feature freeze? 22:53:29 dasm: we started hammering on xenial only recently though 22:53:37 dasm: it could be related to a later kernel version in xenial 22:53:50 and if we indeed increasing the memory of some services, it might be that it was just the last straw 22:54:13 mhm 22:54:41 kevinbenton, clarkb: do we want to try and tweak mysql settings, propose the change and see what reactions people might have? 22:54:55 its possible the weighting has changed on swappiness. Apprently such weighting has changed in the past too 22:55:03 of do we want to wait and see how the swappiness increase does? 22:55:15 armax: let's wait and see if this helps 22:55:17 armax: probably best to see if one toggle at a time makes a difference 22:55:31 clarkb: ok, I can start on the patch and put it in WIP 22:55:40 if it does, we've successfully swept the problem under the rug until the end of next cycle when the memory consumption of all of the services has doubled :) 22:55:50 LOL 22:56:01 we’ll double the swappinees then! 22:56:05 no 22:56:21 but in all seriousness we should look at the memory footprints of our services 22:56:29 to see if in the long term we can trim some fat 22:57:27 ok, so we’re 3 mins until the top of the hour 22:57:31 I guess that’s about it 22:58:17 oktnxbye 22:58:23 o/ 22:58:26 #endmeeting