#openstack-meeting log

17:00:56 <mtreinish> #startmeeting qa
17:00:57 <openstack> Meeting started Thu Jun  5 17:00:56 2014 UTC and is due to finish in 60 minutes.  The chair is mtreinish. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:00:58 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:01:01 <openstack> The meeting name has been set to 'qa'
17:01:10 <mtreinish> Hi, who's here today?
17:01:13 <asselin> Hi
17:01:18 <andreaf> hi
17:01:33 <mtreinish> #link https://wiki.openstack.org/wiki/Meetings/QATeamMeeting
17:01:38 <mtreinish> ^^^ Today's agenda
17:02:15 <mlavalle> Hi
17:02:43 <mtreinish> dkranz, sdague, mkoderer, afazekas: are you guys around?
17:02:53 <asselin> I will add my spec review to the agenda
17:02:55 <dkranz> Hi
17:02:59 <mtreinish> well anyway let's get started
17:03:09 <mtreinish> #topic Mid-Cycle Meetup (mtreinish)
17:03:30 * afazekas o/
17:03:30 <mtreinish> So I just wanted to remind everyone about the qa/infra mid cycle we're going to be having
17:03:40 <mtreinish> the details can be found here https://wiki.openstack.org/wiki/Qa_Infra_Meetup_2014
17:04:07 <mtreinish> I expect that the schedule for that week will change a bit more over the next week
17:04:17 <mtreinish> #link https://wiki.openstack.org/wiki/Qa_Infra_Meetup_2014
17:04:50 <mtreinish> I really didn't have much else for this topic, unless anyone had any questions about it
17:05:22 <ylobankov> hi folks. Sorry, I am late
17:05:46 <vrovachev> hi all :)
17:05:52 <mtreinish> ok, well I guess we'll move to the next topic
17:06:06 <mtreinish> #topic Unstable API testing in Tempest (mtreinish)
17:06:19 <mtreinish> so this is something that came up the other day in a review
17:06:29 <mtreinish> and I know we talked about it with regards to v3 testing
17:06:55 <mtreinish> but since we've moved to branchless tempest I feel that we really can't support testing unstable apis in tempest
17:07:14 <andreaf> mtreinish: +1
17:07:28 <mtreinish> if the api doesn't conform to the stability guidlines because it's still in development than we really can't be testing it
17:07:43 <andreaf> mtreinish: Neutron's model of testing in tree and then promoting to tempest sounds like a good option for unstable APIs
17:08:08 <mtreinish> andreaf: yeah I'd like to see that, but we have yet to see it implemented
17:08:24 <mtreinish> it would definitely solve this problem once it's in full swing
17:08:32 <mtreinish> oh for reference the review this came up in was an ironic api change:
17:08:34 <mtreinish> #link https://review.openstack.org/95789
17:09:03 <mtreinish> I'm thinking we should codify this policy somewhere
17:09:17 <sdague> mtreinish: ++
17:09:27 <dkranz> mtreinish: If we are talking about a whole service api, like ironic, why not just set the enable flag to false in the gate?
17:09:47 <dkranz> mtreinish: I mean as a temporary measure, not a substitute for the other ideas.
17:09:59 <sdague> dkranz: well that doesn't really address the unstable thing
17:10:19 <dkranz> sdague: It means that for gating purposes, it is out of tree.
17:10:20 <andreaf> mtreinish: the alternative here would be micro-versions in ironic :)
17:10:32 <sdague> andreaf: they'd have to implement that
17:10:42 <sdague> which I don't think they've talked about yet
17:10:53 <dkranz> sdague: Same as if it were actually out-of-tree which was the other proposal
17:10:56 <mtreinish> andreaf: yeah, but this example is actually a field removal so it'd be a major version...
17:11:00 <andreaf> sdague: yes of course - but it seems to be problem common to everyone
17:11:44 <mtreinish> dkranz: what I'm proposing is just drawing the line in the sand now, and we can figure out how to deal with the unstable apis we have in tree after that
17:11:56 <mtreinish> there shouldn't be too many
17:12:08 <andreaf> sdague: perhaps micro versions is something every project should have - also it would bring a more consistent experience
17:12:08 <sdague> yeh, I agree
17:12:09 <mtreinish> dkranz: I'm also not sure it's the entire api surface for ironic
17:12:23 <sdague> andreaf: it's a good goal, we don't have any that do it yet
17:12:36 <mtreinish> so do we think a new section in the readme is good enough
17:12:45 <mtreinish> or should we start storing these things somwhere else
17:12:46 <mtreinish> ?
17:13:14 <sdague> new readme section is probably the right starting point
17:13:33 <mtreinish> ok, then I'll push out a patch adding that
17:13:47 <mtreinish> #action mtreinish to add a readme section about only testing stable apis in tempest
17:13:47 <sdague> if we have a consistent topic for just docs changes, we could pop that to the top of the dashboard
17:13:57 <sdague> to have people review docs changes faster
17:13:57 <andreaf> mtreinish, sdague: the problem I see when reviewing is how we define a stable API
17:14:14 <sdague> andreaf: example?
17:14:21 <andreaf> meaning until we don't have a test for something we don't know
17:14:22 <mtreinish> andreaf: if it's in tempest it's stable...
17:14:28 <mtreinish> there is no turning back
17:14:36 <dkranz> andreaf: Right
17:15:05 <mkoderer> hi
17:15:26 <andreaf> ok fine so we let an API become progressively stable test by test ... as long as something does not have a test it's not stable
17:15:56 <dkranz> andreaf: No, I don't think that't right.
17:15:57 <mtreinish> andreaf: that's a separate problem, which is really just api surface coverage. This is more to prevent projects that explicitly say their apis will be changing but they're adding tests to tempest
17:16:14 <sdague> right, I'm more concerned about the second thing
17:16:28 <andreaf> ok
17:16:30 <sdague> if a project doesn't believe the api is stable, and they tell us that, then we don't land those interfaces
17:16:59 <sdague> because they've said explicitly that they don't believe it to be a contract
17:17:03 <dkranz> sdague: marun's proposal is the only clean way to address this I think.
17:17:08 <sdague> dkranz: sure
17:17:49 <mtreinish> sdague: or they don't tell us, ignore the new readme section, something gets merged and then they're locked into the stability guidelines :)
17:17:59 <sdague> sure
17:18:00 <andreaf> I'm just thinking about an easy way to be consistent in reviews on this - so if there is a place where projects publish the fact that an API is stable, we can -2 directly any test for unstable stuff
17:18:51 <mtreinish> andreaf: well I think this only really applies to new major api versions and incubated projects
17:19:00 <sdague> yeh, I had some vague thoughts about all of that. But my brain is too gate focussed right now to take the context switch.
17:19:08 <sdague> andreaf: maybe propose something as a readme patch?
17:19:14 <sdague> and we can sift it there
17:19:32 <sdague> it honestly might be nice to have a REVIEWING.rst
17:19:44 <andreaf> sdague: +1
17:19:47 <sdague> or do more in it
17:20:02 <sdague> oh, I have a local file that I started fo rthat
17:20:05 <mtreinish> yeah I was thinking about doing that as a wiki page like nova does
17:20:24 <sdague> I personally prefer it in the tree
17:20:35 <sdague> because then changing it is reviewed, and it's there when you check things out
17:20:41 <andreaf> mtreinish: or a link in tree to a wiki page
17:20:44 <andreaf> :D
17:20:46 <sdague> I find stuff in the wiki ends up logically too far away
17:20:52 <sdague> and no one notices when it changes
17:21:02 <sdague> so we mentally fork
17:21:06 <mtreinish> you can subscribe to a page...
17:21:07 <mkoderer> +1 for in the tree
17:21:11 <mtreinish> but I get what you're saying
17:21:23 <mtreinish> the location doesn't really matter as long as we have it
17:21:53 <mtreinish> sdague: do you want an action to start that?
17:22:16 <sdague> sure
17:22:27 <mtreinish> #action sdague to start a REVIEWING.rst file for tempest
17:22:37 <mtreinish> ok then is there anything else on this topic?
17:23:27 <sdague> when we get to open talk, I want to discuss a couple suggestions after debugging some gate things
17:23:30 <mtreinish> #topic Specs Review
17:23:32 <mtreinish> sdague: ok
17:23:52 <mtreinish> sdague: actually I'll make that a topic because I'm not sure we'll get to open
17:24:02 <mtreinish> dkranz: I think you posted the first one on the agenda
17:24:06 <mtreinish> #link https://review.openstack.org/#/c/94473/
17:24:14 <mkoderer> do we -1 patches that didn't have a merged spec?
17:24:27 <dkranz> mtreinish: Yes, boris-42 put in this spec but there is not enough detail
17:24:30 <asselin> (I added mine to the agenda https://review.openstack.org/#/c/97589/)
17:24:58 <mtreinish> mkoderer: yeah if the bp isn't approved yet we shouldn't merge the patches as part of it
17:25:03 <dkranz> mtreinish: He did not respond to the comment yet and I am not sure if he intends to move forward with this or not.
17:25:34 <mtreinish> dkranz: yeah it doesn't really have any design in it, it just explains what the script will be used for
17:25:59 <dkranz> mtreinish: The last comment from him was May 23
17:26:06 <mtreinish> dkranz: well I guess ping boris-42 after the meeting, and if he doesn't respond we can open a parallel one up
17:26:09 <mkoderer> mtreinish: yep we need to have a look at this... couldn't jenkins do a -1 automatically
17:26:10 <dkranz> mtreinish: I guess I will follow up with him
17:26:39 <sdague> mkoderer: I think that's over optimizing at this point
17:26:51 <mkoderer> sdague: ok :)
17:26:58 <sdague> especially as as a team we're kind of terrible about tagging commit messages
17:27:21 <mtreinish> I know I am...
17:27:35 <mtreinish> ok the next spec on the agenda is asselin's:
17:27:35 <sdague> we'll let another project build that infrastructure first, then see if we want to use it :)
17:27:47 <mtreinish> #link https://review.openstack.org/#/c/97589/
17:28:00 <mtreinish> asselin: so go ahead
17:28:14 <asselin> Hi, this is something that hemna and sdague talked about in Hong Kong.
17:28:31 <asselin> Where the API calls success and failures are automatically tracked during stress tests
17:28:55 <mkoderer> asselin: so you want to enhance the statistics?
17:29:18 <asselin> yes, there's an implementation out for review.
17:29:33 <mkoderer> asselin: ok cool I will have a look
17:29:36 <mkoderer> link?
17:29:36 <asselin> sample output is available here at the end: http://paste.openstack.org/show/73469/
17:29:48 <asselin> Stress Test API Tracking: https://review.openstack.org/#/c/90449/
17:30:09 <mkoderer> #link https://review.openstack.org/#/c/90449/
17:30:19 <asselin> in the above output, lines 18-32 were 'manually' added to the stress action.
17:30:44 <asselin> With the new code, lines 34-85 show which api calls were made, and how many passed, failed.
17:31:14 <mtreinish> asselin: so I haven't looked at this in detail yet, but the way your tracking api calls
17:31:24 <mtreinish> could that also be used in the non stress case too?
17:31:38 <asselin> yes, I believe so
17:31:58 <mkoderer> asselin: I think we don't need the manual way..
17:32:11 <mtreinish> asselin: ok that may be useful as part of tracking leaks in general tempest runs too
17:32:20 <mtreinish> I'll look at your spec and comment there
17:32:20 <asselin> mkoderer, yes, that's exactly the point: no need to do the manul way anymore
17:32:20 <mkoderer> just tracking the api calls look sufficient for me
17:32:29 <mkoderer> asselin: ok cool
17:32:52 <asselin> there's another patch for cinder tests here: Cinder Stress/CHO Test: https://review.openstack.org/#/c/94690/
17:33:01 <asselin> this one previously had the manual calls.
17:33:09 <asselin> they are all now removed
17:33:36 <mtreinish> ok, so take a look at this spec proposal to give it some feedback. It seems reasonable to me. :)
17:33:37 <asselin> and by previously, I meant before it was submitted for review.
17:33:43 <asselin> mtreinish, thanks!
17:33:53 <mtreinish> ok are there any other specs people want to bring up
17:33:56 <andreaf> yes
17:34:02 <mtreinish> andreaf: ok go ahead
17:34:02 <andreaf> #link https://review.openstack.org/#/c/86967/
17:34:08 <andreaf> non-admin one
17:34:20 <andreaf> I submitted an update on dkranz 's work
17:35:04 <andreaf> from the discussions at the summit it seems that the only kind of admin work we can avoid for now is tenant isolation
17:35:18 <sdague> andreaf: right, but I don't want to give that up
17:35:26 <sdague> which means I still feel like this is part 2
17:35:34 <sdague> and part 1 is preallocated ids
17:35:49 <andreaf> until we have hierarchical multitenancy we won't be able to do things like list all vms in a domain
17:35:52 <sdague> at which point we delete the non tenant isolation case
17:36:17 <sdague> there in simplifying tempest in the process
17:37:03 <mtreinish> sdague: +1, although the non tenant isolation case would just be a list len of 1 for the preallocated ids
17:37:12 <sdague> mtreinish: sure
17:37:17 <andreaf> sdague: so I think we should remove user and user_alt and just have user provider that either uses tenant isolation or an array of configured users
17:37:27 <sdague> andreaf: yes exactly
17:37:30 <dkranz> andreaf: +1
17:37:48 <dkranz> I'm still not sure how we pass the user in from calls to testr in that case
17:37:51 <sdague> now we just need someone who wants to do that
17:38:06 <sdague> dkranz: there might be some trickiness there
17:38:12 <mtreinish> dkranz: yeah there is an interesting problem to determine how many threads we're safe for
17:38:20 <sdague> but, honestly, we have to solve it
17:38:31 <sdague> even if we decide the solution is a tempest binary
17:38:31 <mtreinish> or we could just allow overcommit and just lock on having available creds
17:38:34 <sdague> to wrap it
17:38:43 <dkranz> sdague: I agree, but feel a lack of testr expertise
17:38:50 <sdague> mtreinish: or just die
17:39:07 <mtreinish> heh, yeah that's probably a better idea :)
17:39:10 <sdague> if you try to run concurency of more than your users, die
17:39:27 <mtreinish> sdague: the fuzziness is the alt user
17:39:32 <andreaf> we could do a config  consistency check before starting the tests
17:39:39 <dkranz> I'm not sure how a list of users would actually work in terms of tempest deciding which one to use. It would have to be per-thread
17:39:41 <sdague> mtreinish: more than user +1
17:39:50 <afazekas> At class load time a process can lock on one demo and alt_demo user
17:39:53 <sdague> users[-1] is the alt user
17:40:00 <mtreinish> well it's user + n
17:40:14 <mtreinish> because if you need an alt user for all the workers at once
17:40:29 <sdague> I think we typically only need 1 alt user globally
17:40:30 <dkranz> mtreinish: Doesn't each worker need its own alt_user?
17:40:40 <mtreinish> but anyway we can figure that out later
17:40:44 <mtreinish> we're down to 20 min
17:40:50 <mtreinish> so let's move on
17:40:51 <sdague> ok, so who's spearheading this one?
17:41:07 <andreaf> mtreinish: so we need a spec, I can start a bp
17:41:14 <sdague> andreaf: cool
17:41:47 <andreaf> #action andreaf start bp on static users
17:41:57 <mtreinish> ok, then lets move on
17:42:05 <andreaf> mtreinish: I had another spec I wanted to mention
17:42:08 <andreaf> #link https://review.openstack.org/#/c/96163/
17:42:31 <andreaf> this is the test server, client, gui
17:42:33 <mtreinish> andreaf: ok, do you want to save that for next week when masayukig is around?
17:42:42 <andreaf> ok sure
17:42:57 <mtreinish> ok cool
17:43:00 <mtreinish> #topic how to save the gate (sdague)
17:43:14 <mtreinish> sdague: you've got the floor
17:43:23 <sdague> man, that's a much bigger thing than I was going for
17:43:25 <sdague> :)
17:43:47 <sdague> ok, so a couple of things when dealing with failures
17:44:05 <sdague> first off, we explode during teardown some times on resource deletes
17:44:26 <sdague> and whether or not the test author thought delete was part of their path, explode on teardown sucks
17:44:53 <sdague> so I think we need a concerted effort to make sure that our teardown paths are safe
17:45:02 <sdague> and just log warn if they leak a thing
17:45:03 <mkoderer> sdague: I proposed a safe teardown mechanism https://review.openstack.org/#/c/84645/
17:45:15 <mkoderer> but I need to write a spec before ;)
17:45:23 <mkoderer> not sure if this helps
17:45:25 <sdague> mkoderer: oh, cool
17:45:25 <mtreinish> sdague: but explode how, like if it's an unexpected error on a delete call shouldn't that be a big issue that causes a failure?
17:46:00 <sdague> mtreinish: honestly... I'm pretty mixed on that
17:46:30 <sdague> because I have a feeling that if we treated all deletes like that
17:46:33 <sdague> we'd never pass
17:46:42 <sdague> the only reason tempest passes is because we leak
17:46:47 <afazekas> explode on teardown is ok, and still fails the related test
17:47:04 <sdague> afazekas: honestly, I don't think it is ok
17:47:15 <sdague> if you want to test delete, do it explicitly
17:47:40 <dkranz> sdague: +1, but the leaks are still bugs
17:47:44 <sdague> dkranz: sure
17:47:57 <sdague> but we can solve those orthogonally
17:48:14 <mtreinish> sdague: but isn't kind of the same thing as a setup failure. We're not explicitly testing those calls but something failed
17:48:19 <andreaf> sdague afazekas failures in fixtures is something we should perhaps write warnings and collect stats
17:48:24 <dkranz> sdague: Yes, at this point I think we need to agressively stop failures that are due to "known" bugs
17:48:26 <mtreinish> I just think moving on if a delete explodes is going to mask the real failure
17:48:27 <afazekas> sdague: explicitly by addCleanUp ?
17:48:33 <mtreinish> and cause something elsewhere
17:48:41 <sdague> afazekas: no, explicitly by callling delete
17:48:48 <dkranz> mtreinish: When the gate becomes reliable again we can add failures back
17:49:03 <sdague> the problem is there is too much implicit
17:49:22 <sdague> so even people familiar with the tempest code take a long time to unwind this
17:49:23 <afazekas> sdague: and where do we want to reallocate the resources allocated before the delete ?
17:49:23 <dkranz> afazekas: No, by calling delete not in addCleanUp
17:49:33 <afazekas> deallocate
17:49:44 <sdague> afazekas: in the test function
17:49:50 <mtreinish> dkranz: that's what I'm saying is I don't think this will necessarily make fixing the gate easier
17:49:52 <sdague> test_foo_* is where you make calls
17:50:11 <dkranz> mtreinish: It might not. But it might.
17:50:12 <afazekas> sdague: resource leak on failure is ok ?
17:50:26 <sdague> afazekas: if it has WARN in the log about it
17:50:41 <dkranz> I don't think it is ok but we are drowning in failures, right?
17:50:42 <sdague> then we are tracking it at least
17:50:54 <sdague> dkranz: yes, very much so
17:51:04 <sdague> every single tempest fix to remove a race I did yesterday
17:51:08 <sdague> was failed by another race
17:51:24 <dkranz> sdague: So I was just suggesting we stop "known" failures until we get it under control
17:51:39 <afazekas> Dow we want to avoid this kind of issues ? https://bugs.launchpad.net/tempest/+bug/1257641
17:51:41 <uvirtbot> Launchpad bug 1257641 in tempest "Quota exceeded for instances: Requested 1, but already used 10 of 10 instances" [Medium,Confirmed]
17:51:43 <sdague> dkranz: sure, I'm also suggesting a different pattern here
17:52:09 <dkranz> sdague: So make sure whatever is actually testing delete does so explicitly and don't fail on cleanup delete failures, just warn.
17:52:11 <mkoderer> I think this topic would be nice for the mid cycle meetup to work together on it
17:52:16 <dkranz> sdague: I think that is what you are saying, right?
17:52:21 <sdague> dkranz: yep, exactly
17:52:31 <sdague> the test_* should be explicit about what it tests
17:52:42 <dkranz> sdague: Yes, that was my review comment as well.
17:52:42 <sdague> and teardown is for reclamation and shouldn't be fatal
17:52:57 <dkranz> sdague: I would say more that it should be, but we suspend that for now.
17:52:59 <andreaf> sdague: +1
17:53:23 <andreaf> sdague: for api tests specifically - in scenario I would often include the cleanup in the test itself
17:53:24 <sdague> dkranz: well once tempest is actually keeping track of 100% of it's resources, I might agree
17:53:41 <sdague> andreaf: sure, and that's fine
17:53:47 <sdague> just make it explicit
17:53:47 <afazekas> How will we see if jobs randomly just WARN on a failed delete  ?
17:53:57 <sdague> afazekas: it's in the logs
17:54:01 <sdague> and the logs are indexed
17:54:22 <afazekas> sdague: nobody reads them if the test suite passed
17:54:25 <andreaf> sdague: I think we need tools to analyse the logs and
17:54:33 <sdague> sure, we need all those things
17:54:47 <sdague> but if we can't ever land a tempest patch again, then it's kind of useless :)
17:54:47 <andreaf> trigger warnings perhaps to the DL or in iRC
17:54:49 <dkranz> afazekas: I think the point is that there are race bugs around delete, and we know that, but we can't keep failing because of them
17:55:09 <afazekas> dkranz: bug link ?
17:55:16 <dkranz> afazekas: We now have to accept the risk of a regression around this issue to get things working.
17:55:27 <sdague> afazekas: and realistically people have been talking about tracking resourses forever, and no one ever did that work
17:55:40 <dkranz> afazekas: There is no bug link because no one has any idea of why the deletes fail as far as I know.
17:56:05 <afazekas> dkranz: no bug, no issue to solve
17:56:06 <dkranz> Unless I am wrong about that.
17:56:28 <sdague> afazekas: sorry, some of us have been too busy fighting fires in real time to write up all the bugs
17:56:46 <andreaf> so step1 we need to "ignore" failed deletes and get the gate back and then we can go from there?
17:57:12 <sdague> andreaf: yeh
17:57:20 <sdague> though honestly, this was only one of 2 items
17:57:30 <dkranz> andreaf: What other choice is there?
17:57:32 <sdague> the other is we need to stop being overly clever and reusing servers
17:57:43 <sdague> so I proposed that patch for promote
17:57:46 <mtreinish> I'm still not convinced just switching exceptions to be log warns in the short term is going to make it easier to debug. Because of all the shared state between tests, but I'll defer to the consensus
17:57:54 <mtreinish> sdague: +1 on the second point
17:58:00 <sdague> https://review.openstack.org/#/c/97842/
17:58:10 <sdague> we do it in image snapshots as well
17:58:18 <dkranz> mtreinish: I don't think the point was about being easier to debug, just to be able to get patches through.
17:58:19 <sdague> I'll propose a patch for that later today
17:58:33 <mtreinish> dkranz: but that too, I think it'll just shift fails to other places
17:58:39 <sdague> mtreinish: it might
17:58:50 <dkranz> mtreinish: and we will see that if it happens and learn something
17:58:54 <afazekas> I would like to see several logs/jobs where we had just delete issues.
17:58:55 <sdague> the other thing that would be really awesome
17:59:15 <sdague> afazekas: well, dive in and fix gate bugs, and you'll see them
17:59:46 <andreaf> so I think we have 1 min left?
17:59:53 <mtreinish> yeah we're basically at time
17:59:57 <mlavalle> Before we go, Could I have some core reviews for https://review.openstack.org/#/c/83627 and https://review.openstack.org/#/c/47816. There was some overlap in these two patchsets, but I fixed it this past week. So they are good to go
18:00:00 <dkranz> We can move to qa channel
18:00:03 <sdague> yeh, so one parting thought
18:00:08 <afazekas> sdague: I frequently check the logs after failures, and I can't recall
18:00:28 <mtreinish> well there is a meeting after us I think
18:00:33 <mtreinish> so I'm going to call it
18:00:39 <sdague> ok, parting in -qa
18:00:41 <mtreinish> #endmeeting