17:01:54 <jaypipes> #startmeeting qa
17:01:55 <openstack> Meeting started Thu Sep  6 17:01:54 2012 UTC.  The chair is jaypipes. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:01:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:01:57 <openstack> The meeting name has been set to 'qa'
17:02:44 <davidkranz> jaypipes: I haven't seen anyhing back from Vish about all the problems we are having.
17:03:03 <davidkranz> jaypipes: What did you think of Jim's proposal last night?
17:03:08 <jaypipes> davidkranz: I'm still playing catchup on emails/list messages
17:03:15 <jaypipes> davidkranz: reading it now. :)
17:03:18 <davidkranz> jaypipes: OK
17:03:52 <jaypipes> davidkranz: gimme a few minutes to read, ok, then we discuss?
17:03:59 <davidkranz> jaypipes: Sure.
17:18:01 <jaypipes> davidkranz: responded to jeblair on ML
17:18:15 <davidkranz> jaypipes: I saw.
17:18:18 <jaypipes> davidkranz: I think #2 and #3 are good, I'm tentative on #1
17:19:00 <davidkranz> jaypipes: Yeah. I think we need the upside of that right now.
17:19:30 <davidkranz> jaypipes: For the downside, we could have an "advisory gate" on all of tempest.
17:19:54 <jaypipes> davidkranz: well, that's what jeblair's #3 was, and I support that.
17:20:12 <jaypipes> my only thing is I don't want to make the *tempest* project gate less than the full test suite
17:21:14 <davidkranz> jaypipes: The problem is that if some one breaks something in the "adivosry" gate tempest will still be blocked if we enforce that.
17:21:38 <jaypipes> davidkranz: yes, I understand that.
17:22:10 <davidkranz> jaypipes: The way I have dealt with this in the past was to make it advisory but always have some one on the hook to fix a problem as their first priority.
17:22:11 <jaypipes> davidkranz: it sucks, but I actually think it's better to just deal with a broken tempest and work as closely as psosible with the core proejcts when breaks occur
17:22:31 <jaypipes> davidkranz: works for me :) but it's not me you need to convince ;)
17:22:52 <davidkranz> jaypipes: Right. I think we could do what you are suggesting if Vish buys into it.
17:23:12 <davidkranz> jaypipes: I just hate to see important unrelated work in tempest getting pissed on all the time.
17:23:25 <dansmith> so, I'm confused..
17:23:37 <dansmith> I thought the point of not running full tempest every time was a matter of resources
17:23:53 <dansmith> running it every time and making it only advisory seems to get all the pain with none of the gain
17:23:59 <dansmith> (or fractional gain)
17:24:29 <davidkranz> dansmith: It is not just resources (bandwidth) but wait time (latency) as well I think.
17:24:57 <dansmith> davidkranz: okay, I thought Jim's comments were mostly around resources
17:25:22 <dansmith> seems like making the temptest test advisory during check and required during gate would help get a quick sniff before the reviewers have at it,
17:25:31 <davidkranz> dansmith: There is also the issue of the startup time which would be shared in his suggestion.
17:25:33 <dansmith> and then require them to fix up things before it can be merged
17:26:50 <dansmith> my opinion, which I think is shared by davidkranz, is that tempest gets shoved to the back of the line constantly if things are asymmetrical, which I think discourages folks from contributing to it
17:27:01 <dansmith> which is the opposite of what should be happening
17:27:18 <dansmith> I speak from experience of course :)
17:27:53 <davidkranz> We need to do what will
17:28:06 <davidkranz> 1. Get all tempest tests working again asap
17:28:19 <davidkranz> 2. Minimize failing tempest gates
17:28:56 <jaypipes> dansmith: agree with you.
17:29:06 <davidkranz> jaypipes: I agree that we should turn on the full tempest gate asap but perhaps for the moment we should turn it off until things are settled.
17:29:45 <jaypipes> davidkranz: what does that buy us exactly, though?
17:30:53 <davidkranz> jaypipes: If some one breaks a tempest test then we will be informed immediately but a core reviewer does not have to rush to figure out what it is and comment it out to unblock the kqueue.
17:31:22 <jaypipes> but unblocking the queue just means broken tempest goes into trunk..
17:31:58 <davidkranz> If we could run full tempest on everyhthing that would be great but Jim doesn't want to do that.
17:32:20 <davidkranz> If we run the full gate only on tempest that what I described will happen. It's been happening.
17:33:06 <davidkranz> jaypipes: Sorry for mistakes from typing too fast...
17:33:38 <jaypipes> hehe, no worries :)
17:34:03 <dansmith> just brainstorming here,
17:34:16 <jaypipes> davidkranz: and what I am saying is that what has been happening is the better of the two worlds, IMHO, because it's not allowing broken code into tempest...
17:34:54 <dansmith> but could we introduce another batching mechanism for merging things into a core project, where five or ten patches get grouped for a single tempest run before landing?
17:35:15 <dansmith> jaypipes: could we make the full tempest gate smarter to just run tests above the smoke level that appear to have been touched by the patch in question?
17:35:43 <sdague> or is there a way to break up tempest into chunks that could be run in parallel, to address latency
17:35:47 <dansmith> jaypipes: that would be more symmetrical, not punishing the unrelated tests because of nova breakages, but without letting people check in things that are clearly broken
17:36:52 <jaypipes> sdague: unfortunately, not with the current architecture of tempest (using nosetests and its multiprocessing plugin, which is buggy to say the least()
17:37:17 <jaypipes> I'm just not sure :(
17:38:26 <davidkranz> This is a tough issue. I think if we could get the nova team to look at these things at high priority it would help.
17:38:43 <jaypipes> davidkranz: agreed.
17:39:00 <davidkranz> I am not in favor of spending a lot of time on countermeasures just to get around that.
17:39:06 <jaypipes> me neither.
17:39:24 <davidkranz> This hasn't been a big problem until now and probably won't be going forward.
17:39:49 <davidkranz> jaypipes: So who will "bell the cat"?
17:40:08 <dansmith> I'll be at the meeting anyway
17:40:11 <dansmith> to talk about XML stuff
17:40:23 <dansmith> so I can do it if nobody else is going to be there for other reasons
17:40:38 <davidkranz> dansmith: Good. I forgot the nova meeting is today.
17:40:55 <davidkranz> dansmith: Virutal Wednesday :)
17:41:02 <jaypipes> k
17:41:22 <sdague> it does seem like it's going to be impossible to handle the nova breaking tests in tempest that aren't run in the system, seems like we really need to figure out a way to get tempest run in the CI, otherwise we're just getting by by blind luck :)
17:42:14 <davidkranz> sdague: Right. We spend a bunch of time parallelizing tempest but fell down due to the multiprocessing nose issue AFAIK.
17:42:21 <jaypipes> sdague: well, jeblair's sugegstion (#3) was to have the whole temptest suite run for all commits in the core projects, but only *gate* on the smoke + more tests
17:43:16 <dansmith> yeah, see, that's just expending the resources for little gain, IMHO, and doesn't really make sense
17:43:26 <dansmith> if the concern is checking something in at lightning speed,
17:43:27 <sdague> that does mean a nova bug, which we can confirm is a bug, has a test case can be checked in though? right?
17:43:38 <dansmith> then we could have another special jenkins command to "reverify, sans tempest"
17:45:43 <dansmith> perhaps the latency concern is because the gate tests are all run sequentially? certainly the tempest one could be run in parallel to the other ones and not increase the total time-to-commit by too much
17:45:44 <sdague> davidkranz, jaypipes: is the existing multi processing work in the tree, or a seperate branch? maybe we can try to help there.
17:45:51 <davidkranz> I think the point of Jim's suggestion is that if a failure was directly tied to a particular change, the nova folks would react more quickly.
17:46:04 <jaypipes> sdague: one sec, looking for link...
17:46:28 <davidkranz> Where we are now, some one has to spend a lot of time figuring out which change broke things and no one has time for that.
17:46:39 <sdague> jaypipes: or break up the tempest sets manually and run them manually in parallel
17:46:45 <davidkranz> That is the situation we absolutely need to avoid.
17:47:02 <davidkranz> So I am in favor of Jim's idea until we can find something better.
17:47:40 <dansmith> davidkranz: but that doesn't help in the immediate short term, right? because things are flaky enough in tempest at the moment,
17:47:51 <dansmith> that people will learn to just ignore the advisory run because it "nearly always fails"
17:47:57 <sdague> sounds like it's also worth a summit session to try to throw around ideas on how to make the situation better, hopefully get folks from nova in the room for the discussion as well
17:48:02 <dansmith> I'd think the social laziness will route around that in no time :D
17:48:14 <davidkranz> dansmith: As of my change a little while ago things are working again.
17:48:35 <dansmith> davidkranz: for how long? :)
17:48:38 <dansmith> I'm just saying...
17:48:40 <davidkranz> The first priority is for the nova folks to fix the stuff I commented out.
17:48:56 <jaypipes> sdague: just run current tempest with --processes=4 and you'll see the issue.
17:49:04 <sdague> jaypipes: cool, will do
17:49:21 <davidkranz> As I said, I would be happy to turn on the full gate but they don't want to do it.
17:49:23 <jaypipes> sdague: has to do with the multiprocessing plugin not understanding module-level and package-level fixtures
17:50:15 <davidkranz> dansmith: I think we need to take immediate action to tie breakages to particular checkins whether they gate or not. And Jim agreed to that.
17:50:54 <davidkranz> So I suggest
17:50:54 <dansmith> davidkranz: understand, and I certainly won't argue that that helps.. however, I think it's likely that doing so will only help tempest folks sweep up after nova folks more efficiently :D
17:51:06 <davidkranz> 1. Ride on nova to fix the current problems
17:51:07 <dansmith> /helps/hurts/
17:51:17 <davidkranz> 2. Tie future breakage to checkins
17:51:27 <sdague> davidkranz: is there a link to the issues in nova that need to be fixed here? a tag or something for these issues?
17:51:33 <davidkranz> 3. Get full tempest to be reliable and fast enough to use as full gate.
17:51:48 <sdague> sorry, I too am coming back from a bunch of vacation, so still catching up :)
17:51:54 <davidkranz> sdague: I filed a nova bug today for the primary issue
17:52:08 <sdague> davidkranz: cool, url?
17:52:43 <davidkranz> sdague: https://bugs.launchpad.net/nova/+bug/1046870
17:52:44 <uvirtbot> Launchpad bug 1046870 in nova "Tempest showing nova compute failures" [Undecided,New]
17:53:50 <sdague> davidkranz: thanks
18:01:35 <davidkranz> Anything else? The time is up.
18:06:12 <davidkranz> #endmeeting
18:06:54 <clarkb> davidkranz: jaypipes will need to do that. If he doesn't get to it soonish I can end the meeting
18:07:14 <jaypipes> #endmeeting