15:01:18 #startmeeting XenAPI 15:01:19 Meeting started Wed Jun 11 15:01:18 2014 UTC and is due to finish in 60 minutes. The chair is johnthetubaguy. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:21 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:24 The meeting name has been set to 'xenapi' 15:01:30 hello all 15:01:59 hello all 15:02:03 hows things going? 15:02:07 Fine fine 15:02:08 you? 15:02:12 what topics do we have for today? 15:02:16 BobBall: good thanks 15:02:41 our CI is the best again 15:03:04 http://www.rcbops.com/gerrit/reports/nova-cireport.html 15:03:07 the stats have been updated 15:03:36 #topic CI 15:03:48 * johnthetubaguy is looking at stats 15:04:05 hehe, so we are better than jenkins, except for the yellow bit 15:04:25 how do we get the yellow bit down? 15:04:29 we can't 15:04:41 why not? is that a time it takes to run thing? 15:04:43 unless we get more people in different timezones working on the CI in the same way as infra does 15:05:01 once a patch is missed, it's yellow forever - so we'd need <2 hour responses on _everything_ 15:05:15 right, but how often does it screw up? 15:05:22 in the last 30 days 15:05:59 very very rarely 15:06:08 but if it happens and we miss even 1 then our yellow bar is bigger 15:06:41 right, thats fine, just curious how we improve that, people watching it doesn't feel like the correct answer 15:07:01 that's how infra fixes it :) 15:07:20 people go to #openstack-infra and shout until it's fixed 15:07:23 no one does that for xs CI 15:07:45 … we could monitor things, and make it fix its-self a little bit 15:07:59 but anyway, maybe what we need is a better measure 15:08:16 maybe 15:08:21 more automated emails would be nice 15:08:29 but quite honestly I'm not fussed about the yellow 15:08:33 thats a monitoring issue on our side right? 15:08:42 where our = xenserver ci 15:08:58 Sure - or even better on the nova-cireport.html's side 15:09:09 "Hey - I think your CI is down because it hasn't voted on XYZ" 15:09:40 the reason I say this, is at the summit there was agreement to reduce the yellow bar, and no one compained 15:09:41 I think that ci report is going into infra at some point which makes it easier to add such an email 15:09:48 I complained 15:09:54 if we are not happy we need to complain and come up with a better idea 15:10:03 OK, so I was half asleep, what was the response? 15:10:03 I pointed out in the etherpad why it is not appropriate 15:10:12 I haven't followed it up yet 15:10:13 oh, so no one was reading that 15:10:16 oops 15:10:19 but neither has anyone else AFAIK 15:10:25 i.e. no formal proposal has been made 15:10:30 that I've seen anyone 15:10:32 anyway* 15:10:45 agreed, mostly as the gate is screwed right now 15:10:59 Fine - so if/when it's proposed I will definitely argue against it 15:11:12 https://etherpad.openstack.org/p/juno-nova-third-party-ci I think? 15:11:16 it's not loading for me 15:11:25 I was just trying to get a better idea as a rebuttle 15:11:33 Line 32 15:11:36 Everyone jumps when jenkins is down, but few people (other than those running the CI system) monitor 3rd party CIs with the same enthusiasm. If a 3rd party misses a patch (e.g. gerrit stream monitoring fails), then a new patch is submitted, the old missed patch is forever held as a miss by the stats. IOW I imagine Jenkis miss rate will always be lower than 3rd party miss rate. 15:11:46 My suggestion.... 15:11:49 Missed split: No vote vs late vote 15:11:52 disagreements stats (how often does it disagree with jenkins) - perhaps say 'jenkins fail' is only if all tempest failed in Jenkins to avoid known gate instabilities? - why compare to Jenkins rather than some other known, desired state? 15:11:59 correllation % / overlay with infra-jenkins 15:12:10 Low disagreement stats would be the key metric IMO 15:12:19 maybe 15:12:20 No late votes would also be a key metric 15:12:40 No votes should be 'acceptable' in the case of CI downtime as long as the 'no votes' are not too high 15:12:46 I like the idea of % late and % missed being different 15:12:49 i.e. maybe 10% 'no votes' is acceptable for a 3rd party CI 15:12:56 yeah, that seems reasonable 15:13:15 disagreement is harder, we want them to find other bugs, which would be disagreement 15:13:17 but we did make it clear that reporting must be <2h so no 'late votes' are acceptable (although I also think that's too strict) 15:13:32 let me find the link 15:13:34 Sure - it would all need to be on a scale 15:13:43 i.e. if you have 10% disagreements then we're happy 15:13:50 but we'll assume that 90% of all jobs should agree 15:14:02 if there are _ANY_ jenkins fails that you pass, then that's a massive red flag 15:14:07 https://wiki.openstack.org/wiki/HypervisorSupportMatrix/DeprecationPlan 15:14:12 hmm, it says four hours 15:14:33 but I don't like forcing a CI to match specific arbitrary numbers... the numbers should just give the PTL a feel on what is acceptable or not 15:14:37 I think an average below two is probably kinder 15:14:50 unacceptable --> warning; no fix/plan --> booting 15:14:56 right, the idea here was, how do we give a clear bar, rather than a gut feeling 15:15:03 let's all be reasonable here - we're all human after all :D 15:15:18 Sure - but the bar can't be set so low that none of the non-infra tests can match it 15:15:30 agreed 15:15:43 perhaps another metric that'd be useful is "CHANGES missed" rather than patchsets 15:15:46 just don't want people feeling like, hey we don't like you, so we don't approve your CI 15:15:57 if you miss patch 4 and patch 5 comes along, then you test 5, 4 shouldn't be a "miss" 15:15:59 ah, that in interesting idea 15:16:18 because there is no point going back and testing 4, and the CI is back up and running testing 5... 15:16:30 missed vs late etc 15:16:42 I think looking at the average reporting time is fine here, thinking about this more 15:17:11 anyways... 15:17:16 yes 15:17:18 digging out of that rat hole 15:17:27 rabbit hole. Definitely bigger than a rat hole. 15:17:32 but I think we understand what we want 15:17:33 lol 15:17:41 what else did you want to cover 15:17:55 uhhh... not sure 15:17:56 oh yeah 15:17:57 I am getting back to setting up a parallel setup to get out some funky stuff 15:18:01 there's a rubbish bug 15:18:24 if you give devstack/d-g a repo (i.e. review.openstack.org repo) then it'll checkout from there 15:18:31 which is correct - right? 15:18:43 BUT in some cases you want to merge, a-la-Zuul 15:18:56 (all cases are safer with merging of course) 15:18:57 oh, this rings a bell 15:19:08 So there was a change this last week that failed until it was rebased 15:19:09 I remember turbo hipster guys talking about this one 15:19:11 now it passes 15:19:38 I'm still pushing to try and move more of the CI to an -infra base 15:19:45 which will let it use zuul 15:19:50 right 15:20:02 about ciros, did you move to the new image/ 15:20:03 but I guess a short term fix might be to add some more hacky flags in D-G to merge rather than checkout 15:20:13 not this week, no 15:20:25 OK, so no more tests enabled at this point? 15:20:30 correct 15:20:45 no worries, just checking 15:21:08 having another meeting this week about getting us more help for this CI 15:21:25 well I'm not sure what the focus would be ATM 15:21:30 so there is a little bit of progress 15:21:30 apart from adding the quark stuff I guess 15:21:48 (or replacing nova-net with neutron+quark) 15:21:49 yeah, adding quark, adding more tests 15:22:04 maybe adding cloudcafe 15:22:41 but anyways, thats part of the discussion 15:22:43 I guess 15:22:56 also, just help with the 24-7 maintainance thing 15:23:03 yeah 15:23:13 some US people would spread the curve a little further 15:23:20 and into peak patch creation times 15:23:38 cool, so we are done for CI I guess? 15:23:53 indeed... but there is a learning curve which might be too long given that we're not having many issues at all ATM 15:24:00 Done indeed 15:24:07 I need to update the nodepool patches with more docs 15:24:14 hoping to do that tomorrow I think 15:24:24 agreed, but its probably needed, the other thing, is moving to zuul via turbohipster people 15:24:31 cool 15:24:37 That's a long way off 15:24:38 #topic Open Discussion 15:24:43 any thing more? 15:24:50 we need all of the upstreaming done first - which is the start of those nodepool changes ;) 15:25:03 Yeah... I keep meaning to test... 15:25:07 is HVM working? 15:25:13 BobBall: well, they can run modfied branches of some stuff 15:25:16 HVM working? 15:25:20 what do you mean? 15:25:28 There was a suggestion on some list somewhere where tempest only worked with PV guests and not HVM 15:25:38 oh, no, think it was on IRC 15:25:55 probably worth switching to just run a full tempest on HVM at some point to prove it does 15:25:56 oh, no idea, I suspect they just set the image up wrongly 15:25:59 and/or run some specific tests 15:26:09 well does Cirros support running HVM? 15:26:18 oh, so volume attach will need PV tools right? 15:26:27 or something like that 15:26:58 right - does cirros include PV tools for that? or would it run fully HVM? 15:27:02 oh, I doubt cirros is the correct choice for HVM tests 15:27:06 * BobBall doesn't know... 15:27:08 ah ok 15:27:19 mabe that's the answer then 15:27:30 some of our PVHM images are fairly small 15:27:37 they would probably do the trick 15:28:03 (if we turn caching on) 15:28:07 ok great 15:28:33 it certainly works in production, but its a good point, better image coverage would help 15:28:40 like testing windows and linx 15:28:51 oh wait, that will fail, but whatever 15:28:58 nested HVM, not so good 15:29:11 anyways, we are all done I guess? 15:29:37 yeah, think so 15:29:44 cool, thanks BobBall 15:29:53 catch you next week I guess 15:30:16 probably earlier on IRC with this nodepool stuff :) 15:30:20 #endmeeting