15:00:40 #startmeeting XenAPI 15:00:41 Meeting started Wed Feb 19 15:00:40 2014 UTC and is due to finish in 60 minutes. The chair is johnthetubaguy. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:42 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:44 The meeting name has been set to 'xenapi' 15:00:53 hello, how is around this week? 15:01:11 I am here for a short time. 15:01:19 I am around. 15:01:26 o/ 15:01:27 I am also here 15:01:43 Ok, so lets jump to matel 15:01:53 #topic XenServer CI 15:02:03 matel BobBall : tell me good news :D 15:02:17 #link http://paste.openstack.org/show/67274/ 15:02:18 … and any bad news 15:02:27 That's the good news 15:02:34 Okay, I guess I'll do it with Bob. 15:02:35 #link http://paste.openstack.org/show/67277/ is the bad news 15:02:50 So We are commenting on successful runs 15:03:05 {'Collected': 31, 'Finished': 142, 'Running': 14, 'Queued': 62} are the current queue stats 15:03:28 difference between "Collected" and "finished" is that collected has got the results, and finished has posted them 15:03:38 OK, cool, are we getting complete within the correct timeframe? 15:03:38 we are not posting about failures ATM 15:03:51 It's complete now. 15:03:55 BobBall: totally makes sense right now 15:03:56 I can post failures now if we want 15:04:07 well, do we know what they are yet? 15:04:19 but I personally want verification of most of the failures before I turn on auto failure posting 15:04:22 That was the bad news 15:04:27 #link http://paste.openstack.org/show/67277/ 15:04:31 Those are the 30 failures 15:04:39 some are real (e.g. 73539 15:04:43 right 15:04:45 some are not real 15:04:55 BUT all that we've looked at so far have corresponding defects in the gate 15:05:07 ah, OK 15:05:19 I guess it would be good to see that jenkins signature match 15:05:19 I do not know yet whether we are suffering a higher hit rate of those defects - and if we are whether they are related to the environment 15:05:24 Was wondering if this was just current stability ranking. Is that what you are saying Bob? 15:05:46 sorry leifz ? not sure I understand? 15:05:56 {'': 14, 'Failed': 30, None: 62, 'Passed': 139, 'Aborted: Unknown': 4} 15:05:57 Is the failure rate in line with current gate failure rate. 15:06:00 is what bob posted 15:06:05 ah yes - I don't know leifz 15:06:17 oh, its usually under a few percent, good question though 15:06:19 I asked yesterday what the current rate was but didn't get an answer and haven't saked again / chased 15:06:35 Thanks. 15:06:45 There are other issues we've been fixing in the last couple of days to get the stability of the system fixed 15:07:04 Do we think its just code issues at this point? 15:07:15 everything is just code john :) 15:07:42 We're certainly at the point where we should be trying to track down the failures that I've listed in that page 15:08:03 We seem to be hitting test_basic_scenario frequently 15:08:04 LOL stepped into that one. How close do we need to be to be non-voting reporting? 15:08:28 I would rather we don't report crud, people just assume the thing is broken then 15:08:32 and while it's an acknowledged gate bug I suspect we're hitting it more, possibly because of slower volume provisioning or something 15:09:03 OK, so, can we look at getting the gate bug patterns tested against a fail run? 15:09:11 see if we hit a gate bug signature, etc 15:09:25 I'm not giong to look at automating that now 15:09:32 that's a nice-to-have 15:09:54 Manual inspection of the tests for why they are failing is what we should do to get the rates up 15:10:16 sure, for now it makes sense 15:10:37 so do we have a public source for this data you are generating? 15:10:38 Anyway - I'd be happy to argue that we've satisfied what's needed for I-3 15:10:50 All passed tests get voted on 15:10:53 all logs are public 15:11:00 but these lists are not public, no 15:11:04 OK, one more request... 15:11:18 We could easily create a cronjob to post them somewhere if you want the latest details 15:11:19 can we just report the errors as "hmm, we found a problem, we are checking it out" 15:11:28 until we have more confedence? 15:12:17 BobBall: cron job of stats would be idea, just so people can check the queue length / status 15:12:26 We can do that, but I think the current volume means that we will not be able to check out + post on each test - I suspect "we're checking it out" could be seen as a suggestion that they will get another update. 15:12:41 An alternative would be to automatically requeue failed jobs once or twice 15:12:48 but that'll take ages to report on failures 15:13:05 OK, I recon, "hmm, we found a bug, we are rechecking" 15:13:10 Perhaps the first failure we comment on it then say we'll requeue and re-test. 15:13:16 "hmm, we still found a bug, we will look into this for you soon" 15:13:20 yeah 15:13:26 No - not the second one - that doesn't scale 15:13:35 the patch submitter must be the one who looks into any failures 15:13:44 well, I don't mind us not looking into any of those right now 15:13:46 on a patch-by-patch basis 15:13:56 Idon't think, we are offering any kind of service like "look into this for you soon" 15:13:58 we can look into more common failures 15:14:10 so the issue is, if we don't have the gate bot to tell us about errors, then we can't ask the patch submitter to do it yet 15:14:12 One thing that I would like to add is "Patch failed tests XYZ" which should be easy to grep 15:14:25 OK 15:14:26 and if we get that info - even if it's just internal - then we can easily group failures. 15:15:00 So the big think I would love is just warn people we are still testing the system, and we found an error, just it might not be an error 15:15:02 I don't understand that sentence john? ^^ 15:15:13 Dumb question: do we run current trunk (no patches) on any period? 15:15:16 I think, instead of throwing in ideas, we would need to really ask, what needs to be done to protect XenAPI's place in the trunk. 15:15:27 leifz: who is we? 15:15:28 No leifz, not currently 15:15:36 but patches continue to pass 15:15:49 if all patches start to fail then it'll be a trunk thing 15:15:49 BobBall: let me try again 15:15:51 Any of the reporting tests. 15:16:36 So the big thing I would love is: warn people we are still testing the system, but still tell them we found an error, just it might not be an error, it could the test system that is a bit funny 15:16:43 15:14 < johnthetubaguy> so the issue is, if we don't have the gate bot to tell us about errors, then we can't ask the patch submitter to do it yet 15:16:46 That sentence 15:16:51 I agree with matel on ^^ 15:17:06 oh I see 15:17:14 Agreed with matel too - just didn't see his msg. Sorry matel. 15:17:18 BobBall: its the stuff that tells you which bug you hit 15:17:24 ATM we need to be focused on the minimum. 15:17:31 e-r makes people lazy 15:17:42 It's not unreasonable to expect people to look at logs without e-r. 15:17:48 It's OK, and I think it's fun to spend time with CI systems, but we really have to align our efforts with the requirements. 15:18:02 right, I am about setting expectations 15:18:07 we need to report errors 15:18:17 e-r doesn't comment on any other third party systems does it? 15:18:20 but I would rather we told people we are not sure about them 15:18:32 until the point where we are more sure it is an error 15:18:39 hang on, let me re-read the wiki page 15:18:49 Where does other third party system do that? 15:19:09 #link http://ci.openstack.org/third_party.html#requirements for others 15:19:19 thats not the Nova one 15:19:36 https://wiki.openstack.org/wiki/HypervisorSupportMatrix/DeprecationPlan 15:20:06 So, to meet #1 you need to report errors 15:20:32 for #2 we need a cron job to show the status of our queue 15:20:33 Fine - so we can do all of that now. 15:20:38 I'No we don't 15:20:43 Cron job is extra 15:20:54 If it does it then that satisfies the requirement. 15:21:12 But I agree that we should have a cron job so you at RAX can monitor the queue too. 15:21:17 so, lets go through those requirements, just to check 15:21:36 The job need not be voting, but must be informational so that cores have an increased level of confidence in the patch 15:21:36 Results should come no later than four hours after patch submission at peak load 15:21:36 Tests should include a full run of Tempest at a minimum, but may include other tests as appropriate 15:21:36 Results should be accessible to the world and include log files archived for at least six months 15:21:38 The tempest configuration being used must be published 15:21:45 if we don't reports errors, we don't meet #1 right? 15:22:02 I can turn on reporting of errors immediately 15:22:12 how can we prove #2 without some kind of heath of queue status page? 15:22:23 by looking at the times we reported on the patch. 15:22:31 BobBall: that sounds good, just can we make a note saying we are not sure yet? 15:22:39 it's very obvious from the patch whether we met the 4 hour or not. 15:22:41 BobBall: thats a bit nuts though 15:23:03 do we publish our localrc config and list of tempest skips (assuming there are none?) 15:23:21 #link http://ca.downloads.xensource.com/OpenStack/xenserver-ci/refs/changes/00/73000/2/ 15:23:29 We publish the same logs collected by the gate. 15:23:40 thats not what they mean 15:23:54 do we have our localrc and list of tempest tests anywhere? 15:24:05 Yes - check that URL I just posted. 15:24:05 oh, hang on 15:24:15 I am blind 15:24:44 Ok, we just need a wiki page describing how we meet all those points 15:24:50 then we can remove that dodgy log message 15:25:00 then we are good 15:25:22 Before the nova meeting tomorrow would be awesome 15:25:36 BobBall matel: life savers by the way, this is awesome stuff 15:25:44 The thing that we really need is help looking into the failures 15:25:53 I don't want to say "We're not sure about the failures" 15:26:01 #help need help to look into the failures 15:26:11 BobBall: but its true right? we are not sure? 15:26:34 I would rather say we are not sure for a few weeks while we prove the stability, so people don't just ignore the xenapi test results 15:26:37 I'm saying I don't know if the failures are more likely in XenAPI 15:26:39 they are all failures 15:26:53 but just like every gate failure isn't related to the patch, the same is true of XenAPI failures here. 15:27:04 Gate doesn't say "might not be your fault" 15:27:15 right, but its not new 15:27:19 nor do other CI's that I've seen? 15:27:26 sure 15:27:42 I just want to be sure they are not new false positives 15:27:48 anyways, go with what you think is best 15:28:01 Perhaps if I phrased it differently.... 15:28:19 Every failure that we have should correspond to a bug in launchpad - and one that should be fixed. 15:28:35 I don't know if we are hitting more bugs or bugs more often than KVM or A.N.Other driver 15:28:40 but they are all real 15:29:21 I don't disagree with you, I just don't want people to start ignoring the XenAPI tests 15:29:37 Which they will if we put a comment saying "Might not be you" 15:29:53 right, my hope is we prove the system, then remove that phrase 15:30:11 People who get used to the phrase will not notice when it's gone 15:30:19 when we have more confidence that its gate bugs, probably via using the gate bug signature thingy 15:30:23 OK 15:30:34 Okay, I need to go, sorry. 15:30:44 Thanks matel 15:30:44 now worries, top work 15:30:49 no^ 15:30:57 Have fun with people poking in your mouth. 15:31:41 nice 15:31:54 Anyway 15:32:01 I've asked Ant for an increase in our quota 15:32:03 OK, so help with the failures, I would love to jump on that asap 15:32:21 Oh, cool, he is the right guy for that, makes sense 15:32:23 we're currently restricted to 128GB RAM which, at 8GB instances, is 15 total (it's just under 128G I think) 15:32:42 I know that we've got a giant queue at the moment, but I've been keen to re-process jobs that failed 15:32:49 so I'm a long way from hitting the 4 hour rate 15:33:05 with 50% more or double the VMs we'll get back very quickly. 15:33:15 right, totally makes sense 15:33:27 and I think that while we're catching up ATM we won't cope with 15 VMs under peak load 15:33:43 certainly will not, we should increase that for you 15:34:05 are we spreading across regions yet? 15:34:14 that might help a little 15:34:16 No - but we've had that at one point 15:34:22 it's easy to do and I've suggested as much to Ant 15:34:28 currently all in IAD 15:34:34 I think your quota is per region, but I could be wrong 15:34:35 but we've also had it working with DFW 15:34:47 LON is the other good choice 15:34:48 Oh? in which case I might have misunderstood 15:34:56 I'll try setting up multi-region as another job for me 15:35:04 that might resolve our quota issue today 15:35:14 Can I access LON in the same way? 15:35:20 I know it's separate from the web interface... 15:35:36 oh, different account still, bummer, maybe ask for one of those from Ant too 15:35:58 Where are performance flavors currently? 15:36:00 I would do IAD, DFW, ORD as a starting point anyway 15:36:06 most places now 15:36:06 OK - I'll add ORD too. 15:36:20 not HKG and SYD 15:36:23 If it's per-region then adding DFW and ORD would more than make up for the issues 15:36:41 its worth a whirl, I think it is per region, but I could be wrong 15:36:53 OK 15:37:31 So - tasks so far... 15:37:44 Bob: Cron job, post -ve comments, multi-region 15:38:07 yep, that sounds good 15:38:08 BobBall: you should have gotten the quota increase btw 15:38:09 John: Investigate some of the failures from http://paste.openstack.org/show/67277/ to match against bugs / or ideally propose fixes to reduce failure rate 15:38:17 perfect! thanks ant! 15:38:22 I'll try to go multi-region first 15:38:26 think i shot a mail over yesterday 15:38:29 since that'll be lighter on you 15:38:39 if they only did it in one region, let me know and i'll get it set for the others 15:38:52 sorry - I may have missed it with the fun we've been having 15:38:58 yeah, more independent failures too, when we do a deploy, etc 15:38:59 no problem :) 15:39:28 Bob: write up wiki page with links to tempest config, etc 15:39:35 add wiki page into: 15:39:35 Failed deploys show up as "Aborted" :) 15:39:38 Good point. 15:39:57 the above wiki 15:40:14 Will do. 15:40:23 awesome 15:40:28 so, sounds like we are almost there 15:40:44 I am going did into errors tomorrow I am afraid, got blueprints to sort out this afternoon 15:41:37 OK 15:41:51 well I won't have time to look at them based on the list of things I've got to do :D 15:42:00 indeed 15:42:11 OK - that's all for the CI I think 15:42:23 its awesome to see it going 15:42:46 Real quick is there a quick link to look at errors in general? 15:42:48 I actually found a team that might help maintain it in Rackspace, once we have it proven, if thats helpful 15:43:18 What do you mean leifz ? 15:43:43 #action look into XenAPI build errors: http://paste.openstack.org/show/67277/ 15:43:44 Should b e easy to do johnthetubaguy - it's all up in github 15:43:53 you said you needed help looking at failures. Was curious if that easy to look at. 15:44:07 Ah yes - the link that John gave includes the log files for the errors 15:44:27 OK, so lets move on.. 15:44:32 #topic AOB 15:44:41 anyone else got anything to talk about? 15:44:48 We want to add to those log files to include the host logs (matel is working on this) as there is important info in those for some errors 15:44:59 sounds good 15:45:16 No AOB from me 15:45:42 And I have to jump away now 15:45:46 its blueprint cut off day, patch up today, else your blueprint gets defered 15:45:51 OK, thanks BobBall 15:45:52 I'll be back in a few minutes 15:45:54 nothing from me 15:46:04 #endmeeting