16:00:54 <ihrachys> #startmeeting neutron_ci
16:00:55 <openstack> Meeting started Tue Feb  7 16:00:54 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:58 <openstack> The meeting name has been set to 'neutron_ci'
16:01:12 <ihrachys> hi everyone
16:01:15 <sindhu> hi
16:01:46 * ihrachys gives a minute for everyone to gather around a fire in a circle
16:01:58 <manjeets> hi
16:02:44 * haleyb gets some marshmallows for the fire
16:03:32 <ihrachys> ok let's get it started. hopefully people will get on board. :)
16:03:52 <ihrachys> #link https://wiki.openstack.org/wiki/Meetings/NeutronCI Agenda
16:04:21 <ihrachys> I guess we can start with looking at action items from the previous meeting
16:04:30 <ihrachys> #topic Action items from previous meeting
16:04:47 <ihrachys> "armax to make sure periodic functional test job shows up in grafana"
16:05:07 <ihrachys> armax: I still don't see the job in periodic dashboard in grafana. any news on that one?
16:06:05 <ihrachys> ok, I guess armax is not up that early. I will follow up with him offline.
16:06:26 <ihrachys> #action ihrachys to follow up with armax on periodic functional job not showing up in grafana
16:06:33 <ihrachys> next is: "ihrachys to follow up on elastic-recheck with e-r cores"
16:07:18 <ihrachys> so I sent this email to get some answers from e-r folks: http://lists.openstack.org/pipermail/openstack-dev/2017-January/111288.html
16:08:43 <ihrachys> mtreinish gave some answers; basically, 1) check queue is as eligible for e-r queries as gate one so functional job can be captured with it; 2) adding cores is a matter of doing some reviews before getting a hammer; 3) there is a recheck bot that we can enable for neutron channel if we feel a need for that.
16:09:17 <ihrachys> so from our perspective, we should be fine pushing more queries into the repo, and we should probably try to get some review weight in the repo
16:10:15 <ihrachys> #action ihrachys to look at e-r bot for openstack-neutron channel
16:10:18 <manjeets> ihrachys, question : do we have a dashboard for ci specific patches ?
16:10:55 <ihrachys> manjeets: we don't, though we have LP that captures bugs with gate-failure and similar tags
16:11:05 <mtreinish> ihrachys: I said check queue jobs are covered by e-r queries. But we normally only add queries for failures that occur in gate jobs. If it's just the check queue filtering noise from patches makes it tricky
16:11:10 <armax> ihrachys: I am here
16:11:13 <armax> but in another meeting
16:11:20 <ihrachys> manjeets: we probably can look at writing something to generate such a dashboard
16:11:33 <mtreinish> ihrachys: I can give you a hand updating the e-r bot config, it should be a simple yaml change
16:11:41 <electrocucaracha> ihrachys: do we have a criteria for adding a new query to e-r, like more than x number of hits?
16:11:51 <ihrachys> mtreinish: that's a price of having functional neutron job in check only. we may want to revisit that.
16:12:05 <manjeets> ihrachys, yes thanks
16:12:26 <ihrachys> electrocucaracha: I don't think we have it, but the guidelines would be - it's some high profile gate bug, and we can't quickly come up with a fix.
16:12:41 <ihrachys> manjeets: do you want to look at creating such a dashboard?
16:12:55 <manjeets> ihrachys, sure i'll take a look
16:13:04 <ihrachys> manjeets: rossella_s wrote one for patches targeted for next release, you probably could reuse her work
16:13:28 <mtreinish> electrocucaracha: if you want to get your feet wet everything here http://status.openstack.org/elastic-recheck/data/integrated_gate.html needs categorization
16:13:31 <electrocucaracha> ihrachys: regarding the point 3, it seems like only adding a new entry in the yams file https://github.com/openstack-infra/project-config/blob/master/gerritbot/channels.yaml#L831
16:13:34 <manjeets> cool thanks for example
16:13:45 <ihrachys> manjeets: see Gerrit Dashboard Links at the top at http://status.openstack.org/reviews/
16:13:56 <electrocucaracha> thanks mtreinish
16:14:03 <rossella_s> ihrachys, manjeets I can help with that
16:14:17 <mtreinish> if you find a fingerprint for any of those failures thats something we'd definitely accept an e-r query for
16:14:44 <ihrachys> #action manjeets to produce a gerrit dashboard for gate and functional failures
16:14:47 <manjeets> thanks rossella_s  i'll go through and will ping you if any help needed
16:15:04 <mtreinish> electrocucaracha: that's actually not the correct irc bot. The elastic recheck cofnig lives in the puppet-elastic_recheck repo
16:15:11 <mtreinish> I linked to it on my ML post
16:15:25 <ihrachys> armax: np, I will ping you later to see what we can do with the dashboard
16:15:59 <ihrachys> overall, some grafana dashboards are in bad shape, we may need to have a broader look at them
16:16:48 <ihrachys> ok next action was: "ihrachys to follow up with infra on forbidding bare gerrit rechecks"
16:17:01 <ihrachys> I haven't actually followed up with infra, though I checked project-config code
16:17:09 <mtreinish> ihrachys: if you have ideas on how to make: http://status.openstack.org/openstack-health/#/g/project/openstack~2Fneutron more useful that's somethign we should work on too
16:17:20 <ihrachys> basically, the gerrit comment recheck filter is per queue, not per project
16:18:03 <ihrachys> here: https://github.com/openstack-infra/project-config/blob/master/zuul/layout.yaml#L20
16:18:33 <ihrachys> so I believe it would require some more work on infra side to make it per project. but I will still check with infra to make sure.
16:19:14 <ihrachys> mtreinish: what do you mean? is it somehow project specific? or do you mean just general improvements that may help everyone?
16:19:59 <mtreinish> you were talking about dashboards, and I would like to make sure that your needs are being met with openstack-health. Whatever improvements neutron needed likely would benefit everyone
16:20:14 <mtreinish> so instead of doing it in a corner, I just wanted to see if there was space to work on that in o-h
16:20:20 <ihrachys> you see -health as a replacement for grafana?
16:20:53 <mtreinish> I see it as something that can consume grafana as necessary. But I want to unify all the test results dashboards to a single place
16:21:01 <mtreinish> instead of jumping around between a dozen web pages
16:22:42 <ihrachys> makes sense. it's just we were so far set on grafana, probably it's time to consider -health first for any new ideas.
16:23:13 <ihrachys> ok, next action was "ihrachys check if we can enable dstat logs for functional job"
16:23:40 <ihrachys> that's to track system load during functional test runs that sometimes produce timeouts in ovsdb native
16:23:52 <ihrachys> I posted this https://review.openstack.org/427358, please have a look
16:25:05 <ihrachys> on related note, I also posted https://review.openstack.org/427362 to properly index per-testcase messages in logstash, that should also help us with elastic-recheck queries, and overall with understanding impact of some failures
16:25:07 <manjeets> ihrachys, where it will the dump the logs a separate screen window ?
16:25:27 <manjeets> i mean separate file ?
16:25:34 <ihrachys> sorry, not project-config patch, I wanted to post https://review.openstack.org/430316 instead
16:25:48 <ihrachys> manjeets: yes, it should go in screen-dstat as in devstack runs
16:26:22 <manjeets> ohk
16:26:25 <ihrachys> finally, there is some peakmem-tracker service in devstack that I try to enable here: https://review.openstack.org/430289 (not sure if it will even work, I haven't found other repos that use the service)
16:26:53 <ihrachys> finally, the last action item is "jlibosva to explore what broke scenario job"
16:27:01 <ihrachys> sadly I don't see Jakb
16:27:02 <ihrachys> *Jakub
16:27:39 <ihrachys> but afaik the failures were related to bad ubuntu image contents
16:27:55 <ihrachys> so we pinned the image with https://review.openstack.org/#/c/425165/
16:28:24 <ihrachys> and Jakub also has a patch to enable console logging for connectivity failures in scenario jobs: https://review.openstack.org/#/c/427312/, that one needs second +2, please review
16:29:16 <ihrachys> sadly grafana shows that scenario jobs are still at 80% to 100% failure rate, something that did not happen even a month ago
16:29:29 <ihrachys> so there is still something to follow up on
16:29:39 <ihrachys> #action jlibosva to follow up on scenario failures
16:30:13 <ihrachys> overall, those jobs will need to go voting, or it will be another fullstack job broken once in a while :)
16:32:11 <ihrachys> ok let's have a look at bugs now
16:32:28 <ihrachys> #topic Known gate failures
16:32:39 <ihrachys> #link https://goo.gl/IYhs7k Confirmed/In progress bugs
16:33:05 <ihrachys> ok, so first is ovsdb native timeout
16:33:26 <ihrachys> otherwiseguy was kind to produce some patch that hopefully mitigates the issue: https://review.openstack.org/#/c/429095/
16:33:32 <ihrachys> and it already has +W, nice
16:34:01 <ihrachys> there is a backport for the patch for Ocata found in: https://review.openstack.org/#/q/I26c7731f5dbd3bd2955dbfa18a7c41517da63e6e,n,z
16:34:30 <ihrachys> so far rechecks in gerrit show some good results
16:34:42 <ihrachys> we will monitor the failure rate after it lands
16:36:17 <ihrachys> another bug that lingers our gates is bug 1643911
16:36:17 <openstack> bug 1643911 in OpenStack Compute (nova) "libvirt randomly crashes on xenial nodes with "*** Error in `/usr/sbin/libvirtd': malloc(): memory corruption:"" [Medium,Confirmed] https://launchpad.net/bugs/1643911
16:37:21 <ihrachys> the last time I checked, armax suspected it to be the same as the oom-killer spree bug in gate, something discussed extensively in http://lists.openstack.org/pipermail/openstack-dev/2017-February/111413.html
16:38:02 <ihrachys> armax made several attempts to lower memory footprint for neutron, like the one merged https://review.openstack.org/429069
16:38:17 <ihrachys> it's not a complete solution, but hopefully buys us some time
16:38:48 <ihrachys> there is an action item to actually run a memory profiler against neutron services and see what takes the most
16:39:12 <ihrachys> afaiu armax won't have time in next days for that, so, anyone willing to try it out?
16:41:03 <ihrachys> I may give some guidance if you are hesitant about tools to try :)
16:41:22 <ihrachys> anyway, reach out if you have cycles for this high profile assignment :
16:41:23 <ihrachys> :)
16:41:51 <manjeets> wouldn't enabling dtsat will help ?
16:42:52 <ihrachys> it will give us info about how system behaved while tests were running, but it won't give us info on which data structures use the memory
16:43:05 <manjeets> ohk gotcha
16:44:10 <armax> ihrachys: the devstsack changes landed at last
16:44:26 <armax> preliminary logstash results seem promising
16:44:29 <ihrachys> armax: do we see rate going down?
16:44:33 <armax> but that only bought us a few more days
16:44:34 <ihrachys> mmm, good
16:44:55 <ihrachys> armax: few more days? you are optimistic about what the community can achieve in such a short time :P
16:45:03 <armax> ihrachys: I haven’t looked in great detail, but the last failure was yesterday lunchtime PST
16:45:10 <ihrachys> what was it? like 300 mb freed?
16:45:25 <armax> ihrachys: between 350 and 400 MB of RSS memory, yes
16:45:57 <armax> some might be shared, but it should be enough to push the ceiling a bit further up and help aboid oom-kills and libvirt barfing all over the place
16:46:02 <armax> *avoid
16:46:10 <mtreinish> armax: fwiw, harlowja and I have been playing tracemalloc to try and profile the memory usage
16:46:19 <armax> mtreinish: nice
16:46:29 <armax> mtreinish: for libvirt?
16:47:05 <mtreinish> well I started with turning the profiling on for the neutron api server
16:47:43 <mtreinish> https://review.openstack.org/#/q/status:open+topic:tracemalloc
16:48:20 <ihrachys> mtreinish: any results to consume so far? I see all red in neutron patch.
16:48:48 <mtreinish> the neutron patch won't work, it's just a dnm to setup stuff for testing. Look at the oslo.service patch's tempest job
16:48:52 <mtreinish> there are memory snapshots there
16:48:56 <mtreinish> if you want your browser to hate you, this kinda thing is the end goal: http://blog.kortar.org/wp-content/uploads/2017/02/flame.svg
16:49:24 * ihrachys puts a life vest on and clicks
16:49:24 <mtreinish> we're still debugging the memory snapshot collection, because what we're collecting doesn't match what ps says the process is consuming
16:49:41 <mtreinish> ihrachys: heh, it's 26MB IIRC
16:50:19 <ihrachys> cool, let me capture the link in the notes
16:50:39 <ihrachys> #link https://review.openstack.org/#/q/status:open+topic:tracemalloc Attempt to trace memory usage for Neutron
16:51:01 <ihrachys> #link http://blog.kortar.org/wp-content/uploads/2017/02/flame.svg Neutron API server memory trace attempt
16:51:08 <armax> mtreinish: would apply the same patches to nova help correlate/spot a pattern?
16:51:54 <ihrachys> armax: and maybe also a service that does not seem to be as obese
16:51:56 <mtreinish> armax: that's the theory
16:52:19 <mtreinish> once we can figure out a successful pattern for collecting and visualizing where things are eating memory we can apply it to all the things
16:52:38 <armax> mtreinish: understood
16:52:43 <armax> mtreinish: thanks for looking into this
16:52:50 <ihrachys> is the tracer usable in gate? does it slow down/destabilize jobs?
16:52:59 <armax> I have been struggling to find time to go deeper into this headache
16:53:09 * mtreinish prepares his little vm for the traffic flood
16:53:33 <mtreinish> ihrachys: there didn't seem to be too much of an overhead, but I wasn't watching it closely
16:53:51 <mtreinish> it's still very early in all of this (I just started playing with it yesterday afternoon :) )
16:54:55 <ihrachys> gotcha, thanks a lot for taking it upon yourself
16:55:19 <ihrachys> on related note, I am not sure we got to the root of why we don't use all the swap
16:55:37 <mtreinish> ihrachys, armax: oh if you want to generate that flame graph locally: http://paste.openstack.org/show/597889/
16:55:45 <ihrachys> we played with swappiness knob of no affect I believe: https://review.openstack.org/#/c/425961/
16:55:55 <mtreinish> just take the snapshots from the oslo.service log dir
16:58:03 <ihrachys> #action ihrachys to read about how swappiness is supposed to work, and why it doesn't in gate
16:58:25 <ihrachys> #topic Open discussion
16:58:43 <ihrachys> we are almost at the top of the hour. anything worth mentioning before we abrupt?
17:00:09 <ihrachys> ok thanks everyone for joining, and working on making the gate great again
17:00:09 <ihrachys> #endmeeting