17:00:17 <oomichi> #startmeeting qa
17:00:22 <openstack> Meeting started Thu Feb  2 17:00:17 2017 UTC and is due to finish in 60 minutes.  The chair is oomichi. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:00:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:00:26 <openstack> The meeting name has been set to 'qa'
17:00:33 <jordanP> hello !
17:00:38 <mtreinish> o/
17:00:38 <oomichi> hello, who can join today?
17:00:45 <afazekas> o/
17:00:50 <oomichi> oh, my typing is slow
17:00:52 <andreaf> o/
17:01:01 <jordanP> (I can only attend 40min)
17:01:05 <rodrigods> o/
17:01:13 <DavidPurcell> o/
17:01:21 <oomichi> cool, many people join
17:01:27 <oomichi> ok, lets get start
17:01:34 <oomichi> #link https://wiki.openstack.org/w/index.php?title=Meetings/QATeamMeeting#Agenda_for_February_2nd_2017_.281700_UTC.29 is agenda
17:01:36 <tosky> o/
17:01:53 <oomichi> #topic PTG
17:02:17 <oomichi> as openstack-dev ml, the room capacity and something is not clear for PTG
17:02:22 <luzC> o/
17:02:45 <oomichi> so we don't know how many topics can run in parallel
17:02:47 <oomichi> at this time
17:03:22 <oomichi> I feel we don't need to concentrate on the same topic by all people
17:03:22 <jordanP> we will see there, I guess
17:03:36 <oomichi> jordanP: yeah, I hope so
17:03:54 <jordanP> yeah, we can run several effort in parallel, depending on who is interested in doing what
17:04:10 <oomichi> jordanP: yeah, that will be productive
17:04:22 <castulo> o/
17:04:26 <luzC> I was wondering if we will add time for the agenda topics - https://etherpad.openstack.org/p/qa-ptg-pike
17:04:29 <oomichi> #link https://etherpad.openstack.org/p/qa-ptg-pike is nice to write more ideas
17:05:00 <oomichi> luzC: yeah, +1 for schedule
17:05:19 <oomichi> it would be useful for getting attendees for specific topic
17:05:49 <oomichi> then we need capacity to make schedule
17:05:57 <luzC> yes, specially if they are planning on attending other session
17:06:31 <oomichi> luzC: yeah, nice point
17:06:46 <andreaf> it would be nice to have a kind of agenda / priority lists so that we can go through the most important topics first
17:07:31 <jordanP> next topic ?
17:07:46 <andreaf> but I would not want to be too strict in terms of times, as we may find out that a topic takes very little time and another one longer than we expected, and we have more flexibility than in the usual design summit
17:07:55 <oomichi> andreaf: yeah, I hope debugging test failures would be first priority now
17:08:17 <luzC> andreaf +1
17:08:18 <oomichi> andreaf: yeah I prefer mid-cycle meetup style
17:08:22 <jordanP> yeah, let's gather with a list of topics and once we are all here, see how it goes
17:08:34 <oomichi> jordanP: +1
17:08:49 <oomichi> ok, can we move on if more topcis for PTG?
17:09:10 <oomichi> #topic Spec reviews
17:09:26 <oomichi> #link https://review.openstack.org/#/q/status:open+project:openstack/qa-specs,n,z
17:10:04 <oomichi> we have several specs and they are still under discussion
17:10:24 <oomichi> do we have topics for them today?
17:11:03 <oomichi> ok, lets move on
17:11:06 <oomichi> #topic Tempest
17:11:29 <jordanP> not related to Tempest striclty, but we have a lot of job failures today
17:11:48 <oomichi> jordanP: yeah, we face
17:12:13 <rodrigods> i have a question about tempest, specific to https://review.openstack.org/#/c/427345/
17:12:16 <jordanP> is there something to do ? 3 days ago gate was fine and now it's in a really bad state
17:13:12 <oomichi> jordanP: I face different failures randamly
17:13:25 <jordanP> rodrigods, don't ask to ask a question, just ask it
17:13:27 <mtreinish> oomichi: have you helped try to categorize the failures; http://status.openstack.org/elastic-recheck/data/integrated_gate.html
17:13:32 <rodrigods> sure :)
17:13:45 <rodrigods> the problem is because this bug: https://launchpad.net/bugs/1590578
17:13:45 <openstack> Launchpad bug 1590578 in OpenStack Identity (keystone) "global role should not be able to imply domain-specific role" [Medium,Fix released] - Assigned to Mikhail Nikolaenko (mnikolaenko)
17:13:48 <rodrigods> of this bug*
17:14:02 <mtreinish> that's what bugs me it's clearly bad (you can look at o-h for the graphs) but not many people are working on categorizing the failures so we can track them
17:14:07 <rodrigods> the fix wasn't backported to Mitaka, so Mitaka job fails
17:14:39 <oomichi> mtreinish: oh, never see the page. how to categolize that? by LP?
17:14:48 <rodrigods> what is the best approach? use a feature flag?
17:14:50 <mtreinish> oomichi: submit an elastic recheck query
17:14:52 <jordanP> rodrigods, then either wait or introduce a feature flag
17:15:05 <mtreinish> rodrigods: so the api behaves differently between mitaka and newer versions? Yeah you either add a feature flag or just wait for mitaka eol
17:15:18 <rodrigods> cool
17:15:30 <oomichi> mtreinish: oh, that is cool way. OK, I will do that from my patch list
17:15:39 <rodrigods> thanks mtreinish jordanP
17:16:24 <andreaf> some of the failures are related to libvirt crashes, as jordanP pointed out - but something worse in the past one or two days
17:16:55 <andreaf> the libvirt ones are all tracked in e-r as far as I know, but no-one stepped up to look at the issue
17:17:19 <andreaf> do we have any libvirt person close to the openstack community that may help here?
17:17:20 <oomichi> #link http://lists.openstack.org/pipermail/openstack-dev/2017-February/111347.html
17:17:27 <oomichi> from jordanP
17:17:41 <jordanP> andreaf, we used to have daniel beranger
17:18:15 <dmellado> hey guys, I'm late today and I'll have to leave soon but o/
17:18:22 <jordanP> and we have kashyap also
17:18:48 <jordanP> but the libvirt thing is not what we are seeing  today, not only
17:18:50 <jordanP> today is worse
17:19:09 <oomichi> yeah, hard to merge patches on current condition
17:20:11 <oomichi> we are trying to fix bugs on tempest, but the patches could not be merged recently
17:20:21 <oomichi> the bug triage graph is #link https://github.com/oomichi/bug-counter#current-graph
17:20:46 <oomichi> the bug number still is decreasing, but that is low progress
17:21:20 <oomichi> the graph is a little odd anyways :)
17:21:32 <oomichi> due to data lack
17:22:17 <oomichi> this week, I triaged bugs as assignee and the report is #link https://github.com/oomichi/bug-counter#current-graph
17:22:35 <oomichi> ops, that is #link https://etherpad.openstack.org/p/tempest-weekly-bug-report
17:23:39 <oomichi> I want to get eyes on https://review.openstack.org/#/c/304441 for fixing a bug
17:24:45 <andreaf> oomichi: sure I will have a look at it
17:24:49 <andreaf> oomichi: but I don
17:25:06 <andreaf> but I don't think it's Tempest bugs causing gate instability
17:25:45 <oomichi> andreaf: yeah, Tempest is just detecting bugs of different projects in general
17:25:45 <andreaf> still it's great to see the graph shrinking :)
17:26:22 <oomichi> andreaf: haha, yeah the number could be managable
17:26:55 <oomichi> do we have more topics of Tempest today?
17:27:21 <andreaf> just one thing more from my side
17:27:48 <oomichi> andreaf: please go ahead
17:27:50 <andreaf> we have an issue with reboot_hard where sometimes the host keys are not flushed to disk
17:28:09 <andreaf> and the sshd server fails to start on reboot
17:28:34 <mtreinish> andreaf: I thought we landed a patch at some point that add a sync call over ssh before we rebooted
17:28:39 <oomichi> andreaf: yeah, 'sync' command doesn't work before hard-reboot?
17:28:45 <andreaf> there's a patch available in cirros to make the sshd server regenerate the host keys on reboot
17:28:46 <andreaf> mtreinish: yes we did
17:28:50 <andreaf> mtreinish: but apparently it's not enough
17:29:11 <andreaf> mtreinish: or something else is going on - but in either case at the moment we fail with not able to connect to ssh port
17:29:33 <andreaf> which is kind of misleading, as it's not a problem with network, just the sshd deamon that fails to start
17:29:40 <mtreinish> hmm, do we wait for a fresh prompt after the call? Because if we just send sync\n over ssh and then reboot right after there could be a race there
17:29:54 <andreaf> so a new release of cirros could fix that
17:30:46 <andreaf> mtreinish: well as far as I know our exec method waits for a return code
17:30:51 <mtreinish> ok
17:30:54 <mtreinish> I couldn't remember
17:30:58 <mtreinish> andreaf: have you talked to smoser about a new release?
17:31:17 <andreaf> but even a flush on the OS is no guarantee that the virtualisation layer will actually flush it's own caches to the disk
17:31:56 <andreaf> mtreinish: yes I did I have a branch ready and I built an image and tested it here #link https://review.openstack.org/#/c/427456/
17:32:09 <andreaf> it seems to be working ok, even though now we have noise on the gate
17:32:46 <andreaf> I also tested this: https://review.openstack.org/#/c/421875/
17:32:47 <andreaf> force deleting host ssh keys before reboot
17:33:11 <andreaf> so I would now ask smoser for a new release and propose a devstack patch to adopt it
17:33:26 <andreaf> unless there is any concern with this plan
17:33:31 <mtreinish> andreaf: ok, that sounds like a good plan
17:33:40 <jordanP> andreaf did you open a pull request for cirros?
17:34:18 <andreaf> yes https://code.launchpad.net/~andrea-frittoli/cirros/+git/cirros/+ref/bug_1564948
17:34:48 <jordanP> good
17:34:50 <andreaf> it's on top of branch 0.3 so only 2 commits
17:35:02 <oomichi> andreaf: the problem has been fixed on cirros side, so this could happen on the other OS? or cirros specific?
17:35:03 <andreaf> because there's a lot more on master
17:36:19 <andreaf> oomichi, mtreinish: yeah I feel a reboot hard test will never be 100% safe really but we can make stable enough for the gate so we can test reboot
17:36:28 <jordanP> andreaf, good, as smoser expected
17:36:58 <oomichi> andreaf: yeah, that is a nice direction
17:37:07 <andreaf> the alternative would be to not test reboot at all in the integration gate, but that would be a pity since it's a feature which is probably quite used
17:37:54 <oomichi> andreaf: yes, reboot feature is used in common use cases and nice to keep the corresponding tests on the gate
17:38:39 <jordanP> (i have to go, sorry)
17:38:42 <oomichi> ok, lets move to the next topic if we can
17:38:58 <oomichi> #topic DevStack + Grenade
17:39:04 <afazekas> we might ask the vm to shut down properly, before hard reset, and wait for a power down message  on the serial console ..
17:39:45 <oomichi> afazekas: that could be soft-reboot?
17:40:20 <mtreinish> on devstack one topic to keep an eye on is in this ML thread:
17:40:22 <mtreinish> #link http://lists.openstack.org/pipermail/openstack-dev/2017-February/111413.html
17:40:34 <mtreinish> it's not directly related to devstack, but just OOM issues in the gate
17:40:57 <afazekas> oomichi, On soft reboot the vm does expected to do it, but no . We would ask a booted vm to halt in safe way over ssh, before actually cutting of the power by a big hammer.
17:42:49 <oomichi> afazekas: the soft-reboot is trying to wait for vm gralceful reboot internally and hard-reboot after timeout in nova side IIUC
17:43:18 * afazekas I usually also drop the caches before hard reset,  but we can search for what is really sufficient
17:43:36 <castulo> also I submitted a fix for a bug in Grenade a few days back: https://review.openstack.org/#/c/424807/ in case you guys can take a look at it. It is for a bug that does not happen in the gate but happens when running Grenade locally
17:43:45 <oomichi> mtreinish: Interesting, the memory usage is still increasing on releases
17:43:57 <mtreinish> #link https://review.openstack.org/#/c/424807/
17:44:38 <mtreinish> oomichi: it's been a steady trend since juno (when we first started seeing oom in the gate)
17:44:50 <afazekas> oomichi, yes, it pushes the power button softy on the vm, the vm has acpi signal handler which initialize the safe shutdown, if it does not stops the vm in time, nova uses the big hammer..
17:45:15 <oomichi> mtreinish: can we increase memory capacity on the gate? or need to try reducing the usage on each project side?
17:45:48 <oomichi> afazekas: yeah, you are right. and the big hammer could erase the cache
17:46:24 <mtreinish> oomichi: we need to work on the memory footprint of the services. Requiring > 8GB per vm isn't really a good solution
17:46:52 <mtreinish> and it'll just be a matter of time until we hit whatever new amount of ram we allocate
17:47:08 <oomichi> mtreinish: it would be nice to see the memory footpring graph on o-h or something for each releases
17:47:38 <mtreinish> oomichi: we don't collect that data anywhere. No one has done real memory profiling of the python
17:47:46 <mtreinish> if we have the data, making the graph is the easy part :)
17:48:13 <oomichi> mtreinish: nice point, just a rough idea :)
17:49:04 <oomichi> I guess this kind of memory footprint could be useful for production as reference also, and nice to share
17:49:22 <oomichi> OK, lets move on if we don't have more topic about them
17:49:36 <oomichi> #topic openstack-health
17:49:46 <oomichi> do we have topics about o-h today?
17:50:02 <oomichi> #topic Patrole
17:50:22 <oomichi> DavidPurcell: do we have topics about patrole?
17:50:28 <DavidPurcell> A couple if that's okay
17:50:39 <oomichi> DavidPurcell: please go ahead
17:50:57 <DavidPurcell> I've already added these to the PTG discussion stuff as they all involve a bit of work.
17:51:26 <DavidPurcell> First off, a lot of projects (glance, neutron, etc) don't return the correct response when a role is rejected
17:51:27 <oomichi> DavidPurcell: cool, thanks
17:51:46 <DavidPurcell> Usually it is a 404 NotFound instead of 403 Forbidden.
17:52:12 <oomichi> DavidPurcell: nice testing :)
17:52:20 <DavidPurcell> Thanks :)
17:52:39 <oomichi> is that different tenant case?
17:52:52 <DavidPurcell> It varies.
17:53:01 <DavidPurcell> Sometimes it is just because you're not admin
17:53:07 <afazekas> Some cases 404 is used if someone just trying to randomly  guess resources, not have the idea is it exists or not
17:53:08 <DavidPurcell> but they return 404 for whatever reason...
17:53:16 <oomichi> I guess 404 also could be an option because different tenant users cannot see the other tennant resources
17:53:56 <oomichi> ok, this is a good chance to discuss the ideal behavior
17:54:03 <DavidPurcell> oomichi: Very possible, but some of them are just very obvious oversights.
17:54:24 <oomichi> DavidPurcell: yeah, really nice test for me anyways
17:54:38 <DavidPurcell> Another question we had involves tempest plugins for projects that want to use Patrole.  For example Murano wants to add Patrole tests.
17:55:14 <DavidPurcell> But that is currently not possible without either duplicating their clients in Patrole and putting the tests in Patrole
17:55:23 <DavidPurcell> or somehow enabling them to import Patrole
17:55:55 <oomichi> DavidPurcell: yeah, and patrole is on first dev stage, and difficult to provide stable interfaces
17:56:38 <oomichi> but it is great to attract the different projects by patrole
17:56:40 <DavidPurcell> oomichi: and I'm not sure how I feel about a plugin importing another plugin.
17:56:56 <oomichi> DavidPurcell: that could be topic of PTG :)
17:57:06 <DavidPurcell> oomichi: already on the etherpad :)
17:57:27 <oomichi> DavidPurcell: cool, ok can we move on the next topic: open discussion?
17:57:31 <DavidPurcell> sure
17:57:45 <oomichi> #topic critical review/open discussion
17:57:49 <oomichi> the time is comming
17:58:08 <oomichi> so do we have patches or discussion topics?
17:59:30 <oomichi> ok, lets end the meeting. thanks all :)
17:59:34 <oomichi> #endmeeting