21:03:32 <ttx> #startmeeting project
21:03:32 <dhellmann> o/
21:03:33 <markmcclain> o/
21:03:33 <openstack> Meeting started Tue Dec 17 21:03:32 2013 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:03:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:03:36 <openstack> The meeting name has been set to 'project'
21:03:44 <devananda> o/
21:03:44 <markwash> o/
21:03:46 <ttx> Agenda for today:
21:03:48 <ttx> #link http://wiki.openstack.org/Meetings/ProjectMeeting
21:03:51 <kgriffs> o/
21:03:51 <stevebaker> \o
21:03:58 <russellb> o/
21:04:05 <ttx> #topic Icehouse-2 roadmap
21:04:05 <SergeyLukjanov> o/
21:04:23 <ttx> All looks good from our 1:1s
21:04:43 <ttx> we'll skip the next two meetings
21:04:56 <ttx> and check back progress on the Jan 7th meeting
21:05:02 <hub_cap> bam
21:05:26 <ttx> #topic Gate checks (notmyname)
21:05:33 <notmyname> hello
21:05:36 <lifeless> hello!
21:05:40 <ttx> notmyname: hi! care to introduce topic ?
21:05:51 <notmyname> here's where we start:
21:05:53 <notmyname> I've been hearing (and experiencing) some major frsutration the the amount of effort it takes to get stuff through the gate queue
21:06:14 <notmyname> in some cases, it takes days of rechecks. other times, it's merely a dozen hours or so
21:06:25 <notmyname> so I started using the stats to graph out what's happening
21:06:29 <notmyname> http://not.mn/gate_status.html
21:07:05 <notmyname> and the end result, as shown on the graph above, is that we've got about a 60-70% chance of failure for gate jobs, just based on nondeterministic bugs
21:07:20 <jog0> notmyname: we also wedged the gate twice in less then 3 months
21:07:25 <notmyname> this means that any patch that tries to land has a pretty poor chance of actually passing
21:07:52 <notmyname> note that over the last 14 days, there are 9 days where a coin flip would have given you better odds on the top job in the gate passing
21:08:04 <russellb> I feel like folks like jog0 and sdague have done a nice job watching this status and raising extra awareness for important issues
21:08:16 <russellb> there's plenty of room for more attention to some of the bugs, though, for sure
21:08:25 <russellb> notmyname: but do you have anything in particular you'd like to propose?
21:08:28 <notmyname> so I want to do 2 things
21:08:42 <notmyname> (1) raise awareness of the issue (now with real data!)
21:08:53 <notmyname> (2) propose some ideas to fix it
21:09:02 <russellb> i feel like everyone has been very aware already  :-)  ... but your graph is neat
21:09:03 <notmyname> which leads to other ideas, I hope
21:09:30 <notmyname> so for (1), I claim that a 60% pass chance for gate jobs is unacceptable
21:09:39 <dolphm> ++
21:09:43 <markwash> +1
21:09:48 <david-lyle_> +1
21:10:01 <russellb> i don't think anyone is going to argue with failures being bad
21:10:04 <notmyname> and I have 3 proposals of how we can potentially still move forward with day-to-day dev work
21:10:04 <dolphm> can we gate on pass chance? :P
21:10:12 <jog0> russellb: I would disagree with me doing a good job of raising awareness and watching status. we haven't been able to get the base line low enough and getenough bugs fixed. wehave been able to track how bad it is and prioritize but that isn't enough
21:10:32 <russellb> jog0: OK, well just trying to give props where it's due for those working extra hard on things
21:10:34 <sdague> dolphm: only if we can take people's +2 away from them for a week when they push a 100% guarunteed to fail change to the gate :)
21:10:37 <russellb> your reports help me
21:10:49 <dolphm> sdague: where do we sign people up
21:11:04 <dhellmann> jog0: yeah, +1 to what russellb said, don't knock yourself for not having super powers
21:11:07 <notmyname> russellb: yes, I agree that the -infra team has done a great job triaging things when they get critical. but let's not stay there (as we have been)
21:11:09 <sdague> which was actually a huge part of the issue the last 4 days with all the grizzly changes
21:11:11 <notmyname> first idea: multi-gate-queue
21:11:25 <notmyname> in this case, instead of having one gate queue, have 3
21:11:34 <notmyname> Have N gate queues (for this example, let's use 3). In gate A, run all the patches like today. In gate B, run all but the top patch. In gate C, run all but the top 2. This way, if gate A fails, you already have a head start on the rechecks (and same for B->C). If gate A passes, then throw away the results of B and C.
21:11:46 <notmyname> this is a pessimistic version of what we have today
21:12:25 <markwash> sdague: I would love to drill down on that past your warranted frustrations
21:12:34 <notmyname> idea two: cut down on what's tested
21:12:51 <jeblair> notmyname: i would be happy to have zuul start exploring alternate scenarios sooner, even ones heuristically based on observed conditions like job failure rates
21:13:10 <jeblair> notmyname: that's not a simple change, so it'd be great if someone wants to sign up to dev that.
21:13:16 <jog0> proposal 1 doesn't help get things to merge, it just gets them to merge faster
21:13:20 <notmyname> in this case, there is no need to test the same code for both postgress and mysql functionality (or normal and large ops) if the patch doesn't affect those at all
21:13:21 <jog0> or fail faster
21:13:40 <notmyname> jog0: correct. things eventually merge today
21:13:40 <jeblair> jog0: i agree with that.
21:13:50 <notmyname> where eventually is really long
21:14:03 <dolphm> jog0: faster dev cycle is always appreciated, at least
21:14:09 <portante> and seems too long
21:14:10 <notmyname> for idea two, I'm proposing that the set of things that are tested are winnowed down
21:14:30 <russellb> i'm -1 on testing things less in general ... if things fail, they're broken, and should just be fixed
21:14:41 <russellb> i don't think the answer to failures is do less testing
21:14:42 <jog0> notmyname: I am much more concerned about false gate failures then gate delay. if you fix false gate failure you fix gate delay too
21:14:50 <notmyname> eg why test postgres and mysql functionality in neutron for a glance client test?
21:14:55 <jeblair> notmyname: one of the benefits of running extra jobs -- evon ones that don't seem to be needed (testing mysql/pg) is that we do hit nondeterministic failures more often
21:15:05 <markmcclain> I think testing less items is bad idea too
21:15:18 <notmyname> in all cases, the nondeterministic bugs need to be squashed
21:15:18 <mordred> the gate issues are actual openstack bugs
21:15:27 <jeblair> notmyname: neutron was in a bad state for a while because it only ran 1 test whereas everyone else ran 6; it was way more apt to fail changes
21:15:34 <jog0> I would rather make it harder to get the gate to pass then have these nondetermistic failures leak out into the releases for users to experience
21:15:37 <sdague> notmyname: yeh, we invented more jobs for neutron for exactly that case
21:16:05 <markwash> to notmyname's point, though. . we just recheck through those failures of actual nondeterministic bugs mostly, do we not?
21:16:08 <dolphm> jog0: so you're opposed to option 2?
21:16:14 <sdague> and I agree that race conditions need to be stompt out
21:16:19 <dolphm> jog0: err, idea 2
21:16:23 <markwash> rechecking is just *slow* ignoring
21:16:23 <jog0> dolphm: very much so, we need more tests
21:16:25 <notmyname> but the point is, if neutron jobs are still failing a lot, then they don't need to be run for every code repo
21:16:33 <notmyname> s/neutron/whatever/
21:16:37 <david-lyle_> all projects are gated on those failures, related or not
21:16:40 <lifeless> uhh
21:16:43 <jog0> markwash: that is a problem
21:16:44 <lifeless> I don't follow your logic
21:16:45 <sdague> markwash: you need to stop thinking about those as non deterministic, they are race conditions
21:17:00 <torgomatic> it's not a matter of "run less things because they fail", it's a matter of "run less things because they're not needed"
21:17:10 <markwash> sdague: agreed, both to me carry the same level of badness (high)
21:17:11 <jeblair> notmyname: neutron even got so bad that we pulled it out of the integrated gate -- it pretty much _instantly_ fully broke
21:17:17 <mordred> ++
21:17:29 <dolphm> what's the realistic maximum number of changes openstack has ever seen merge cleanly in succession?
21:17:34 <notmyname> torgomatic phrased it better than I was doing
21:17:36 <dolphm> 4? 5?
21:17:40 <sdague> dolphm: 20+
21:17:41 <dhellmann> torgomatic: but they *are* needed because the failures don't occur all of the time, so we need as many examples of failures as possible to debug
21:17:42 <jog0> dolphm: I saw 10 recently
21:17:42 <torgomatic> like, does keystone really need the gate job with neutron-large-ops? I don't think you can break Keystone in such a way as to only hose the large ops jobs
21:17:43 <dolphm> sdague: wow
21:17:48 <ttx> dolphm: I witnessed 25 myself
21:17:51 <lifeless> notmyname: if we don't run it, and there is any dependency in that thing on the other projects we let change, we have asymmetric gating.
21:17:52 <sdague> it's not been a good couple of weeks
21:17:53 <jeblair> notmyname: so we've learned that with no testing, real solid bugs (as opposed to transient ones) land almost immediately in repo.
21:17:59 <jog0> torgomatic: yes it does
21:18:03 <sdague> we also had a lot of external events in these 2 weeks
21:18:06 <lifeless> notmyname: asymmetric gating is a great way to wedge another project entirely, instantly.
21:18:06 <ttx> dolphm: granted, it was full moon outside.
21:18:18 <sdague> sphinx, puppetlabs repo, jenkins splode
21:18:19 <jog0> both nova and neutron use keystone so it can break neutron-large-ops
21:18:24 <mordred> yup. we've seen that almost every time we've had assymetric gating
21:18:29 <torgomatic> jog0: maybe a bad example, then, but there are other cases where the difference between two gate jobs has 0 effect on the patch being tested
21:18:34 <dhellmann> I would rather spend the effort needed to figure out which subset of all our tests need to be run for any given change to fixing these race conditions themselves
21:18:48 <sdague> dhellmann: +1
21:18:48 <russellb> dhellmann: +100
21:18:53 <markmcclain> dhellmann: +1
21:19:07 <jeblair> dhellmann: +1
21:19:16 <jog0> dhellmann: amen!
21:19:16 * jd__ nods
21:19:22 <mordred> dhellmann: ++
21:19:26 <notmyname> ok, so option 3: enforce strong SAO-style interaction between projects
21:19:32 <sdague> hey, look we even have a reasonable currated list - http://status.openstack.org/elastic-recheck/ - (will continue to try to make it better)
21:19:33 <markwash> dhellmann: that's obviously better but I hope we *do* it
21:19:35 <notmyname> Embrase API contracts between projects. If one project uses another openstack project, treat it as any other dependency with version constraints and a defined API. Use pip or packages to install it. And when a project does gate checks, only check based on that project's tests.
21:19:45 <notmyname> This is consistent for what we do today for other dependencies. If there are changes, then we can talk cross-project. That's the good stuff we have, so let's not throw that out.
21:19:45 <dhellmann> notmyname: SOA?
21:20:08 <lifeless> dhellmann: service orientated archivetture
21:20:10 <lifeless> dhellmann: what we have
21:20:11 <notmyname> service oriented architecture. IOW, just have well defined APIs with the commitment to not break it and only use that
21:20:14 <lifeless> bah, architecture
21:20:19 <dhellmann> lifeless: I know SOA, I didn't know SAO
21:20:26 <jd__> you typed SAO :)
21:20:32 <lifeless> oh lol, my brain refused to notice that
21:20:40 <russellb> so, version pinning between openstack projects?
21:20:42 <notmyname> sdague: the problem with elastic recheck (which it is good), is that it's hand-currated
21:20:49 <russellb> seems like we'd just be kicking the "find the breakage" can down the road
21:20:55 <mordred> russellb: ++
21:20:56 <mordred> in fact
21:21:05 <sdague> notmyname: it's 54% of all the fails, and super easy to add another one
21:21:07 <mordred> when you wanted to update the requirement, you would not have been testing the two together
21:21:14 <notmyname> well, what happens now for other dependencies? eg we don't run eventlet tests for every openstack patch and vice versa
21:21:16 <portante> so then why are we not pulling sphinx builds into our jobs?
21:21:16 <sdague> we approve them super fast
21:21:30 <markmcclain> I don't think we can use pip packages otherwise for projects with strong integration we run into issues landing coordinated patches in the master branches
21:21:30 <notmyname> or sphinx, as portante stated
21:21:37 <mordred> notmyname: those are libraries, not thigns that do SDN
21:21:40 <dhellmann> the whole point of gating on trunk is to ensure that trunk continues to work so we can prepare the integrated release, right?
21:21:43 <sdague> because sphinx isn't openstack
21:21:51 <notmyname> markmcclain: that's exactly my point. it needs strong API contracts
21:21:56 <dhellmann> for other dependencies, we should be doing the same gate checks on the requirements project (if we're not already)
21:21:57 <mordred> it's more tahn an API
21:21:59 <notmyname> dhellmann: it still woiuld
21:22:00 <lifeless> notmyname: we want to run eventlet tests on upstream pull requests actually.
21:22:08 <markmcclain> notmyname: those contracts evolve
21:22:13 <lifeless> notmyname: thats a test-the-world concept that infra have been kicking around
21:22:21 <lifeless> notmyname: so that we're not broken by things like sphinx 1.2
21:22:23 <notmyname> markmcclain: of course, that's where deendency versions come from
21:22:28 <mordred> the longer we diverge between these projects, the harder re-aligning is going to be
21:22:55 <mordred> it also makes it REALLY painful for folks running CD from master
21:23:03 <markmcclain> we do integrated releases so we should the tests should be intergrated
21:23:09 <ttx> yes, it's not as if dependencies did not break us badly in the past
21:23:14 <russellb> mordred: painful as in ... we stop testing that use case completely :(
21:23:18 <notmyname> mordred: yes. integration is hard, so it needs to be continually done. if something breaks, fix it. what I'm suggesting is that treating the interdependencies as more dcoupled things
21:23:21 <mordred> russellb: yup
21:23:26 <jog0> so one of the problems we have seen is that gate has so many false positives that its very easy for more to sneak in
21:23:30 <mordred> notmyname: but they're not
21:23:31 <lifeless> mmm, from a CD perspective, I don't object to carefully versioned API transitions upstream
21:23:32 <jog0> we have a horrible base line to compare against
21:23:33 <mordred> they're quire interrelated
21:23:34 <lifeless> but
21:23:43 <lifeless> I strongly object to big step integrations
21:23:47 <mordred> lifeless: ++
21:23:48 <portante> mordred: how are they not?
21:23:57 <notmyname> mordred: again, that's why I'm here talking about this today. we've got a problem, and I'm throwing out ideas to help resolve it
21:24:02 <mordred> because these are things with side effects
21:24:03 <lifeless> if we bump the API a few times a day, that would be fine with me
21:24:16 <lifeless> but more than that and we'll start to see nasty suprises I expect
21:24:20 <portante> things with side effects sounds kinda general, no?
21:24:40 <mordred> there is a reason that side effects are a bad idea in well constructed code - they aren't accounted for in the API
21:24:48 <mordred> but
21:24:52 <portante> would notmyname's idea really make things worse than what we have today?
21:24:52 <mordred> sometimes they're necessary
21:24:57 <mordred> which is why scheme isn't actually used
21:25:01 <mordred> yes
21:25:04 <mordred> it would make it worse
21:25:06 <mordred> unless
21:25:09 <markmcclain> portante: yes
21:25:10 <mordred> you happen to not care about integration
21:25:21 <portante> how will it make it worse from what we have today?
21:25:25 <mordred> if you don't care about integration, it would make your experience as a developer better
21:25:34 <mordred> portante: define "worse"
21:25:36 <notmyname> mordred: I didn't see portante say anything about not caring about integration
21:25:44 <notmyname> (ever in fact)
21:25:58 <russellb> point is, that's the case it's not worse
21:26:17 <dhellmann> russellb: ?
21:26:24 <mordred> notmyname: I'm saying that delaying integration until we have larger sets of things to integrate is going to make it more likely to introduce isseus, and harder to track them down when they happen
21:26:30 <mordred> I believe that will be worse
21:26:31 <russellb> heh, mordred is saying it's worse, unless you don't care about integration
21:26:34 <jeblair> notmyname: because the proposal would mean we would perform integration testing less, essentially only once and on abi bumps.
21:26:41 <mordred> however, doing such a delay
21:26:56 <jog0> we rarely change APIs
21:27:01 <portante> integration tests would still be run at the same rate
21:27:01 <mordred> will increase the pleasurability of folks doing development if those people are not concerned about the problems encountered in integration
21:27:13 <dhellmann> portante: how so?
21:27:17 <mordred> not against combinations that would show you that a patch introduced an issue
21:27:36 <mordred> which means that your patch against glance has no way of knowing that it breaks when combined with a recent patch to keystone
21:27:44 <mordred> when neither patches have landed yet
21:27:53 <portante> we would still run the same job sets as we do today, that would not change, it just that we would be work with sets of changes from projects instead of individual commits
21:27:55 <mordred> which means you have to BUNDLE all of the possible new patches until there is a new release
21:28:10 <mordred> which means _hundreds_ of patches
21:28:20 <jeblair> and then bisect those out when you have a problem
21:28:28 <jog0> so I think this whole discussion is looking at things the wrong way. Gate is effectively broken, we don't trust it and its slowing down development.  The solution is to fix the bugs not find ways of running less tests
21:28:30 <mordred> considering that it's hard enough to get it right when we're doing exact patch for patch matching
21:28:40 <russellb> jog0: +1
21:28:40 <markmcclain> jog0: +1
21:28:40 <portante> but why would my patch break something else without also breaking the API contract?
21:28:42 <jeblair> jog0: ++
21:28:48 <mordred> think about how much worse it will be when you only test every few huundred patches
21:28:50 <mordred> jog0: ++
21:29:02 <dhellmann> jog0: +1
21:29:03 <mordred> portante: because it can and will
21:29:07 <jog0> one thing that would help, is make sure we are collecting good data against master all the time
21:29:08 <ttx> jog0: I think notmyanme's point is that it cannot ever be fixed so you need new ideas
21:29:10 <mordred> because that's the actual reality
21:29:29 <jog0> so if we have free resources, run gate against it so we get more data to analyze and debug with
21:29:32 <dolphm> mordred: ++; it's happened plenty in our history
21:29:36 <sdague> jog0: +1 ... so basically the ask back is what do we do (me & jog0 ... as I'm signing him up for this) to get better data in elastic recheck to help bring focus to the stuff that needs fixing
21:29:41 <jog0> ttx: I am not ready to accept that answer yet
21:29:42 <ttx> jog0: do you think we can get to the bottom of those issues ?
21:29:50 <notmyname> ttx: no, not that it can't be fixed, per se. but that openstack has grown to a scale where perhaps existing methods aren't as valuable
21:29:53 <jog0> ttx: yes, it may take a lot of effor but yes
21:29:57 <mordred> I think the methods are fine
21:30:04 <mordred> the main problem is getting people to participate
21:30:12 <sdague> yeh, agree with mordred
21:30:12 <markwash> I think we probably need some sort of painful freeze to draw attention to fixing these bugs
21:30:13 <mordred> introducing more slack into the system will not help that
21:30:33 <portante> it does not seem to be about adding more slack
21:30:33 <sdague> markwash: if only developers were feeling some pain.... ;)
21:30:34 <torgomatic> markwash: more pain as the answer to gate pain?
21:30:36 <mordred> the fact that we al know that jog0 and sdague have been killing themselves on this
21:30:38 <mordred> is very sad
21:30:46 <mordred> and many people should feel shame
21:30:51 <markwash> torgomatic: yeah, in one big dose, to reduce future gate pain
21:30:51 <portante> but targetting a finite set of resources on the point of integration
21:30:53 <mordred> because everyone should be
21:31:08 <mordred> portante: it's batching integration
21:31:16 <mordred> portante: which is the opposite of continuous integratoin
21:31:29 <dolphm> was the idea of prioritizing the gate queue ever shot down? (landing [transient] bug fixes before bp's, for example) or was that just an implementation challenge
21:31:32 <mordred> and which will be a step backwards and will be a nightmare
21:31:51 <portante> mordred: if the current system causes developers to assemble large patches unbeknownst to you, isn't that the same thing?
21:31:54 <jeblair> dolphm: we just added the ability to do that
21:32:00 <jog0> so we are tracking 27 different bugs in http://status.openstack.org/elastic-recheck/ and that  doesn't cover all the failures. Fixing these bugs takes a lot of effort
21:32:02 <sdague> dolphm: we have manual ways to promote now. We've used it recently
21:32:05 <dolphm> jeblair: oh cool - where can i find details?
21:32:34 <torgomatic> it seems like we're saying that we can leave the gate as-is if we would just stop writing intermittent bugs
21:32:35 <jeblair> dolphm: we've done it ~twice now; it's a manual process that we can use for patches that are expected to fix gate-blocking bugs, and are limiting it to that for now.
21:32:37 <notmyname> portante: that's actually my biggest fear. that current gate issues encourage people to go into corners to contribute to forks. which is bad for everyone
21:32:37 <sdague> this is the in progress data to narrow things down furthere - http://paste.openstack.org/show/55185/
21:32:46 <torgomatic> and if we can stop doing that, let's just stop writing bugs at all and throw the gate out
21:32:47 <mordred> notmyname: what forks?
21:32:51 <mordred> notmyname: what forks of openstack are there?
21:33:03 <dolphm> jeblair: is the process to ping -infra when we need to land a community priority change then?
21:33:11 <mordred> notmyname: and which developers are hacking on them?
21:33:15 <jeblair> dolphm: yes
21:33:25 <dolphm> jeblair: sdague: easy enough, thanks!
21:33:26 <hub_cap> mordred: maybe internal "forks" cuz patches take a while to land?
21:33:33 * hub_cap guesses
21:33:37 <markwash> mordred: I guess many companies run private forks
21:33:37 <portante> mordred: no names,  dont' want the nsa to take them out. ;)
21:33:46 <mordred> portante: ;)
21:33:48 <notmyname> hub_cap: yes. but to portante's point, it happens privately
21:33:52 <hub_cap> portante: the nsa knows already
21:33:53 <markwash> guesses the nsa runs a fork :-)
21:33:58 <mordred> well, those companies usually learn pretty quickly
21:33:58 <portante> its does!?
21:33:59 <creiht> what company doesn't have a fork of every openstack component as they try to get features in?
21:33:59 <russellb> private forks seem natural
21:34:04 <torgomatic> alternately, we can accept that bugs happen, including intermittent bugs, and restructure things to be less annoying when they do
21:34:11 <mordred> that getting out of sync signficantly is super painful
21:34:12 * portante smashes laptop on the ground
21:34:15 <russellb> and honestly just seems like FUD
21:34:18 <notmyname> torgomatic: yes!
21:34:18 * jd__ smells FUD
21:34:21 <jog0> many of the bugs we see in gate are really bad ones
21:34:22 <russellb> jd__: jinx
21:34:26 <jd__> raaah
21:34:28 <markwash> portante: lol
21:34:44 <sdague> yeh, a lot of these races are pretty fundamental things
21:34:51 * mordred hands portante a new laptop that he promises has no malware on it
21:34:57 <sdague> where compute should go to a state... and it doesn't
21:35:17 * portante thankful for kind folks with hardware
21:35:19 <ttx> the tension is because some developers are slowed down by issues happening in other corners of the project and over which they have limited influence
21:35:25 <torgomatic> to that end, I think notmyname's first two suggestions are both good ones
21:35:42 <russellb> ttx: and the dangerous response is to continue not to care what's happening in the other corners
21:35:53 <lifeless> we're all in this together :)
21:35:54 <jeblair> ttx: they don't have limited influence though
21:35:56 <russellb> lifeless: yes!
21:35:59 <portante> can we at least run experiments with the suggestions to play them out?
21:36:03 <sdague> honestly, in the past we keep going in cycles where gate gets bad, pitch forks come out, people work on bugs, it gets better
21:36:05 <ttx> but if you take the viewpoint of openstack as a whole, some parts may be slowed down, but the result is better in the end
21:36:15 <sdague> this time... the number of folks working these bugs isn't showing up
21:36:25 <russellb> portante: which ones?  #2 and #3 there were fundamental disagreements from many people
21:36:27 <sdague> which is really the crux of the problem
21:36:31 <markwash> one policy that might help: as we triage a race-condition based failure in the gate, we need to require unit / lower level / faster tests that reproduce those failures to land in the projects themselves and fail every time
21:36:33 <jeblair> i hit a transient bug on a devstack-gate change, and with some help from sdague we tracked it down to a real bug in keystone, i filed the bug, wrote an er query and moved on
21:36:33 <russellb> #1 jeblair invited some help to zuul dev to add
21:36:48 <jeblair> i think that was beneficial to the project
21:36:53 <lifeless> so I proposed that gate affecting bugs be critical by default
21:37:04 <jog0> markwash: that won't work many times we don't know why something is breaking
21:37:07 <lifeless> I think the stats we have here suggest that perhaps that isn't a bad an idea as folk thought :)
21:37:07 <portante> it is okay to disagree, can't hurt to try a few things to see they pan out
21:37:09 <russellb> can someone ban d0ugal?  the join/parts are really annoying
21:37:10 <jeblair> and i was glad i could help even though i knew that my shell script change to devstack-gate didn't cause it.
21:37:19 <ttx> russellb: I use them as a clock
21:37:21 <jog0> take the http2 lib  file descriptor bug
21:37:25 <creiht> what if we just turn off the gate for a specific project until they fix the bugs that are clogging it?
21:37:30 <markwash> jog0: ah, okay. .yeah its only for bugs where we understand the race but its hard to fix
21:37:39 <dolphm> ttx: rofl. russellb: can your client hide join/parts?
21:37:46 <markwash> creiht: +1
21:37:56 <russellb> dolphm: probably, but i don't want to hide the non broken ones
21:38:02 <creiht> well prevent the project from any further patches until they fix gate critical bugs
21:38:02 <markmcclain> creiht: that is a bad idea… we've done this before and it caused more problems than it solved
21:38:09 <jeblair> lifeless: ++critical
21:38:25 <creiht> markmcclain: my first explanation wasn't as clear sorry
21:38:28 <russellb> heh, and now we have a pile of critical bugs that the same small number of people are looking at
21:38:31 <dolphm> creiht: not sure i follow - block that project from being tested or block that project from landing irrelevant changes?
21:38:39 <russellb> just saying, that alone doesn't get people to work on them :)
21:38:50 <creiht> block from landing any changes until the critical bugs are fixed
21:38:58 <lifeless> russellb: sure, but can't we also say 'when there are critical bugs, we won't be reviewing or landing anything else' ?
21:39:10 <lifeless> russellb: like, make it really crystal clear that these things are /what matters/
21:39:16 <russellb> lifeless: sure, something, just saying that labeling things critical doesn't do anything by itself
21:39:18 <ttx> creiht: I think we have that option, yes
21:39:20 <lifeless> russellb: ack, agreed.
21:39:27 <jeblair> markmcclain: you've done that once or twice, right?  prioritized critical fixes to the exclusion of other patches?
21:39:35 <dolphm> idea: can http://status.openstack.org/rechecks/ be redesigned so that you can see the most impactful bugs per project the associated bugs are tracked against?
21:39:42 <ttx> creiht: if we can really identify a project that doesn't play ball
21:39:54 <russellb> dolphm: have you seen http://status.openstack.org/elastic-recheck/ ?
21:39:54 <dolphm> it's impossible for me to glance and that page and see where i can help
21:40:04 <sdague> dolphm: yes, moving towards eliminating it with the elastic recheck dashboard
21:40:04 <markmcclain> jeblair: yes.. we blocked approvals until fixes landed
21:40:13 <jog0> dolphm: keystone doesn't have any gate issues as far as I know
21:40:18 <dolphm> russellb: yeah, that's not what i want either
21:40:18 <sdague> it just... takes time
21:40:22 <dolphm> jog0: understood, but still
21:40:22 <jeblair> sdague, dolphm: ++
21:40:28 <clarkb> jog0: it does
21:40:31 <clarkb> the port issue
21:40:35 <sdague> jog0: that's not true
21:40:37 <creiht> ttx: it isn't about playing ball... if there are critical bugs blocking the gate, then your project gets no new patches in until that bug is fixed
21:40:38 <jog0> clarkb: link
21:40:39 <sdague> it bounced stuff this morning
21:40:50 <jeblair> dolphm, jog0: and the keystoneclient issue we found yesterday
21:41:00 <dolphm> jog0: actually we do have a couple issues ;)
21:41:02 <portante> creiht: if there are critical bugs blocking the gate from your project, then your project ....
21:41:13 <creiht> yes
21:41:15 <jog0> in that case I think most integrated projects have critical bugs
21:41:17 <jog0> if not all
21:41:34 <portante> great, so let's do that creiht thingy then
21:41:40 <creiht> lol
21:41:40 <markwash> I mean, maybe they all need to stop and fix those
21:41:51 <ttx> creiht: in some cases it's not as binary as that. Some bugs take time to investigate/reproduce, and blocking the project that makes progress on them is probably not very useful
21:42:02 <lifeless> ttx: so, I disagree
21:42:08 <torgomatic> that approach acknowledges that bugs happen, so it's got that going for it
21:42:11 <creiht> ttx: it seems more usefull then just letting status quo go on
21:42:21 <lifeless> ttx: when you make changes there is a chance you introduce new bugs right ?
21:42:26 <lifeless> ttx: or make the current ones worse!
21:42:29 <markwash> race condition bugs are a good situation for tough love
21:42:29 <portante> nothing changes if nothing changes
21:42:31 <notmyname> well, that brings up another point. elastic-recheck doesn't do any alerting to a project. maybe that shoudl be added
21:42:46 <sdague> notmyname: agreed
21:42:46 <lifeless> ttx: so if you have critical issues, changing things that aren't fixing that issue, is just fundamentally a bad idea.
21:42:57 <jeblair> notmyname: sounds like a good idea
21:43:18 <russellb> or perhaps an openstack-dev email for each bug that gets added?  or would that be too much?
21:43:29 <portante> public flogging?
21:43:29 <lifeless> might be too little
21:43:32 <russellb> heh
21:43:33 <jog0> notmyname: so one issue is many times we don't know which project the bug is in
21:43:34 <sdague> we were talking about that, if we can determine the project, or set of projects where the bug is, it should alert those channels whenever it fails a patch
21:43:52 <ttx> ok, I think we are not maling anymore progress now
21:43:55 <ttx> or making
21:43:58 <sdague> so people shamed into how often they are breaking things
21:44:05 <notmyname> so what's next, then?
21:44:07 <notmyname> ttx: ^
21:44:09 <lifeless> I don't think shame really helps
21:44:11 <creiht> status quo!
21:44:12 <creiht> :)
21:44:21 <lifeless> noone wanted to introduce these bugs
21:44:22 <markmcclain> the downside of public is shaming is that sometimes the initial point of fault could be incorrect
21:44:28 <russellb> what's next?  how to get more people helping fix bus?
21:44:30 <russellb> bugs*
21:44:33 <lifeless> right!
21:44:41 <ttx> practical actions
21:44:43 <russellb> continued work to raise awareness of the most impotant things is part of it
21:44:43 <jog0> russellb: agreed
21:44:45 <markwash> yeah, not about shame, just about how do we progress when there are criitcal bugs
21:44:51 <russellb> and i think some ideas are being tossed around for that right now
21:45:14 <lifeless> is everyone raising their gate critical bugs in each weekly meeting ?
21:45:15 <russellb> and then what hammers are available when not enough progress is made, and when we do we use them
21:45:16 <ttx> notmyname: I think everyone agreed your suggestion 1 was interesting, just missing dev manpower to make it happen
21:45:21 <russellb> and i'm not sure we have good answers for that part yet
21:45:26 <lifeless> Like as a dedicated section? And getting volunteers to work on them ?
21:45:29 <ttx> (the multigate thing)
21:45:34 <markmcclain> lifeless: it's the 1st real item in our meeting each week
21:45:35 <torgomatic> some of us are giant fans of suggestion 2 as well
21:46:00 <torgomatic> (suggestion 2 is removing redundant gate jobs)
21:46:24 <markmcclain> torgomatic: no the extra data points are very helpful for diagnosing some the race conditions
21:46:26 <lifeless> torgomatic: what redundant jobs?
21:46:27 <ttx> I think that one was far from consensual
21:46:36 <markmcclain> it also helps us to prioritize based on frequency
21:46:42 <markwash> I think we should just have a post-gate master integration job that is wired up to a thermonuclear device. . when the failure rate hits 50% it blows
21:46:51 <lifeless> markwash: sweet
21:46:54 <russellb> ttx: if anything, more consensus on "no" for 2 and 3 IMO
21:47:03 <torgomatic> lifeless: like running devstack 5 times against every project, when there's not always a way for that project's patches to break stuff
21:47:11 <torgomatic> well, not only for one
21:47:14 <torgomatic> I meant to say
21:47:22 <lifeless> torgomatic: yes, your analysis is missing something
21:47:27 <lifeless> torgomatic: which we disucussed
21:47:34 <jog0> https://bugs.launchpad.net/openstack/+bugs?search=Search&field.importance=Critical&field.status=New&field.status=Incomplete&field.status=Confirmed&field.status=Triaged&field.status=In+Progress&field.status=Fix+Committed
21:47:34 <russellb> don't want to rehash it
21:47:38 <lifeless> torgomatic: whic his that the break relationship is often bidirectional, and transitive.
21:47:40 <torgomatic> as in, I'm sure I can write a Swift patch that breaks devstack for everything, but I cannot write one that only breaks devstack-neutron-large-ops
21:47:40 <jog0> 117 critical bugs
21:47:58 <jog0> torgomatic: yes you can
21:48:13 <torgomatic> jog0: great, please provide an existence proof in the form of a patch
21:48:21 <lifeless> lets get out of the rabbit hole
21:48:27 <jog0> put some timeouts in swift to make things super slow for glance
21:48:30 <lifeless> back to how do we get more people working on  critical bugs
21:48:45 <jeblair> btw, some projects have started tagging bugs with 'gate-failure' which can help folks searching for these bugs
21:48:47 <sdague> jog0: yuo probably want to remove git committed
21:49:02 <markwash> s/git/fix/
21:49:20 <russellb> which brings it to 44
21:49:27 <ttx> lifeless: suggestions ?
21:49:31 <lifeless> jog0: that includes non integrated project
21:49:43 <ttx> We shall soon move on to the rest of the meeting content
21:49:52 <jog0> lifeless: yeah, do you have a better link?
21:50:07 <lifeless> jog0: not in time for the meeting
21:50:11 <lifeless> jog0: LP limittion
21:50:14 <ttx> I see no reason why we can't continue to discuss this on the ML, btw
21:50:35 <ttx> Everyone agrees it's an issue
21:50:53 <ttx> Just absence of convergence on solutions
21:50:57 <russellb> let's fix it, and not by doing less testing of the continuous or the integrated varieties.
21:51:12 <ttx> except suggestion 1 which was pretty consensual
21:51:20 <ttx> just missing resources to make it happen
21:51:35 <sdague> yeh, that's going to require dev resources on zuul
21:52:00 <sdague> but jeblair said he'd be happy to entertain those adaptive algorithms
21:52:09 <jeblair> and it's worth remembering, that's just speeding up the failures.
21:52:09 <jog0> so I am not too keen on the first idea
21:52:10 <jog0> actually
21:52:24 <jog0> I think we can use the compute and human resources much better
21:52:27 <russellb> jog0: I don't think it hurts, while the others arguably do hurt
21:52:29 <jog0> if we fix gate issue one goes away
21:52:29 <sdague> well honestly, it also requires effort
21:52:32 <jeblair> russellb: ++
21:52:51 <sdague> so if someone is signing up for it, cool. If people are just "someone else should do it" then it won't happen
21:53:13 <markwash> it seems like idea #1 is just tuning the existing optimizations we have in place, not sure why it would be bad if someone showed up with a patch?
21:53:14 <russellb> like most things :)
21:53:26 <ttx> ok, 7 minutes left let's move on
21:53:35 <ttx> #topic Red Flag District / Blocked blueprints
21:53:39 <russellb> i like this new cross project meeting style :)
21:53:48 <russellb> we never had time for stuff like this before
21:53:52 <portante> exciting
21:54:00 <ttx> No blocked blueprint afaict
21:54:11 <ttx> russellb: yes, we used to put that dust under carpets
21:54:23 <ttx> at least we now voice the anger
21:54:28 <ttx> "put the dead fish on the table"
21:54:41 * markwash googles
21:54:47 <notmyname> ttx: I don't think "anger" is the right word
21:55:07 * jeblair thinks a failed patch in the queue should be called a dead fish
21:55:30 <russellb> jeblair: so the red circle in the zuul status page should be a dead fish instead?
21:55:31 <notmyname> I think there is frustration, but there is quite a bit of grace given to the current state of things by those who are frsustrated
21:55:32 <ttx> we still have a conflict between heat and keystone around service-scoped-role-definition
21:55:44 <jeblair> russellb: with little stink lines
21:55:44 <ttx> notmyname: yes, frustration ius a better term, sorry
21:56:05 <ttx> heat/management-api still needs keystone/service-scoped-role-definition
21:56:15 <ttx> stevebaker, dolphm: did you solve it ?
21:56:16 <stevebaker> ttx: that dep should be removed
21:56:20 <dolphm> i followed up on that last week - heat really shouldn't be blocked on that
21:56:31 <ttx> stevebaker: ah, great
21:56:32 <dolphm> although heat *could* take advantage of it- and i understand the desire to
21:56:36 <stevebaker> i thought I did that
21:56:47 <creiht> notmyname: well said
21:57:17 <ttx> stevebaker: yep it's removed now, thx
21:57:27 <ttx> Any other blocked work that this meeting could try to help unblock ?
21:58:06 <ttx> I'll take that as a "no"
21:58:09 <ttx> #topic Incubated projects
21:58:48 <ttx> devananda, kgriffs, SergeyLukjanov: around ? any question ?
21:59:01 <SergeyLukjanov> ttx, I'm here
21:59:08 <SergeyLukjanov> ttx, no questions atm
21:59:19 <kgriffs> no questions here
21:59:20 <devananda> aside from wondering how much slower development on ironic will be when we get integration testing .... nope :)
21:59:33 <kgriffs> +1 for raising the bar on code quality
21:59:34 <SergeyLukjanov> ttx, first working code of heat integration already landed, waiting for reviews on tempest patches
21:59:39 <ttx> kgriffs: had a question for you about when you wanted to switch to release amnagement handling your milestones
22:00:02 <kgriffs> ah, great question
22:00:11 <kgriffs> tbh, I don't have a good feel for what that entails
22:00:12 <ttx> I see your i1 is still open
22:00:32 <ttx> kgriffs: we should talk. Will ping you tomorrow ?
22:00:33 <kgriffs> hmm. Thought I closed it.
22:00:34 * kgriffs hides
22:00:42 <kgriffs> ttx: sounds good
22:00:49 <ttx> kgriffs: it's inactive but it looks in progress :)
22:00:55 <kgriffs> I've been trying to move closer to tracking the i milestones, so this is timely
22:01:07 <kgriffs> ttx: oic
22:01:07 <ttx> kgriffs: awesome, talk to you tomorrow
22:01:10 <kgriffs> kk
22:01:14 <ttx> and.. time is up
22:01:16 <ttx> #endmeeting