15:00:30 #startmeeting XenAPI 15:00:31 Meeting started Wed Apr 15 15:00:30 2015 UTC and is due to finish in 60 minutes. The chair is BobBall. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:32 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:33 Howdy! 15:00:35 The meeting name has been set to 'xenapi' 15:00:39 johnthetubaguy - ping? 15:01:13 o/ 15:01:14 hello 15:01:22 BobBall: hope you had a good holiday 15:01:24 Howdy. Good, good. 15:01:26 I did indeed, thanks! 15:01:33 Forgot that I was on holiday last meeting :) 15:01:57 South Africa is a wonderful place. Definitely recommend it. As long as you lock up tight at night and don't walk around by yourself... 15:02:00 Anyway 15:02:02 #topic CI 15:02:07 Let's start with the fun one - the CI 15:02:18 As you probably know the CI was disabled from voting late last week 15:02:27 The initial suggestion was that the CI was broken 15:02:34 That was unfortunately not the case... 15:02:36 BobBall: south africa, awesome 15:02:46 BobBall: ah, the code is broken? 15:02:49 A new, major, race condition has been introduced somewhere which is hitting XenAPI very badl 15:02:52 badly* 15:03:03 yes, you mentioned snapshot 15:03:06 We went from a 10% failure rate to 60+% (not quite sure what the rate was) 15:03:14 but its not affecting libvirt? 15:03:21 Presumably not 15:03:23 do you have the logs from the failure to look at? 15:03:23 which isn't surprising 15:03:27 since we have very different code paths 15:03:30 http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/48/172448/1/15028/screen-n-cpu.txt.gz 15:03:43 Uhhh 15:03:49 I meant: http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/48/172448/1/15028/results.html 15:03:59 Which includes that n-cpu log file 15:04:13 have you found the offending log trace? 15:04:24 Anyway - the interesting thing is that Tempest has recently changed to add a whole bunch of useful identifiers that should help track it down 15:04:34 but I still didn't see anything helpfully obvious :( 15:05:16 It's slightly worrying, perhaps, that most of the failures seem to happen when test snapshot pattern is the last test 15:05:22 but maybe that's due to the timeout we're hitting 15:05:39 just pushing it to the last test executed 15:06:23 What I don't understand is how/why this is showing as a failure to connect to SSH 15:06:23 hmm 15:06:42 oh, thats odd, is that post boot from snapshot? 15:07:01 It really _is_ some form of race since even though the pass rate is low, some jobs at the same time are still passing 15:07:54 http://paste.openstack.org/show/203994/ is a list of some passes/fails 15:08:32 so you can see that even over a period of hours (so not just a temporary thing) we're getting a splattering of passes 15:09:16 http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/06/155006/11/15027/results.html is an example of a pass-in-amongst-the-fails 15:09:46 Anyway - either of you have thoughts on where the issue might be from those logs? 15:10:32 sorry, distracted 15:10:46 agreed its a race 15:10:53 just seeing if we found anything in the log 15:11:24 Any thoughts on how it might be a race, since it's actually SSH connection to the guest that is broken 15:12:25 waiting for it to be active? 15:12:29 unsure 15:12:36 what the test doing? 15:13:01 It's writing a pattern of bits ot the guest then snapshotting I think 15:14:19 oh - no - it's not doing that any more 15:14:22 I thought it used to 15:14:27 hmm, odd 15:14:36 now it's just writing the timestamp to the guest, snapshotting, booting that snapshot and testing it 15:14:44 but it can't write the snapshot to the server initially 15:14:46 that's what timesout 15:14:49 so it's pre-snapshot 15:15:04 Just booted a newly created image which then can't be accessed 15:15:10 hmm 15:15:21 so maybe its not waiting long enough for it to boot? 15:16:08 Not unless we take a substantially longer amount of time than libvirt to download + boot a cirros image 15:16:41 we do 15:16:47 like minutes longer? 15:16:48 from what I remember 15:16:55 I think SSH timeout is 120 seconds? 15:17:03 yeah, that should be OK 15:17:38 unless we hit some slow path I guess 15:17:39 hmm 15:17:50 console shows that the guest did boot 15:17:59 however 2015-04-10 15:45:20.980 | wget: can't connect to remote host (169.254.169.254): Network is unreachable 15:18:01 ah good point 15:18:06 how long did it take 15:18:11 Hmmmm... Was this not our fault? :/ 15:18:22 Could this have been a RAX networking issue? 15:18:35 I don't think so, its on box networking right? 15:19:03 Infra tempest used to be just on HP cloud right? 15:19:04 hmmm 15:19:10 fair point 15:19:24 2015-04-10 15:45:20.911 | cloud-setup: failed to read iid from metadata. tried 30 15:19:24 not sure 15:19:27 2015-04-10 15:45:20.919 | WARN: /etc/rc3.d/S45-cloud-setup failed 15:19:56 BobBall: we don't run the metadata service, where is that log from? 15:20:11 http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/48/172448/1/15028/run_tests.log 15:20:16 Search for the Cirros boot 15:20:22 BobBall: the alternative is to try boot with config drive 15:20:26 Looks like the guest doesn't get an IP address 15:20:31 BobBall: do force config drive 15:20:37 see if that narrows it down 15:20:53 I think we must be forcing it normally for it to work at all 15:20:59 it could be a race in the ip tables writing 15:21:06 I wonder if there is an issue where the config drive might not be attached before we boot or something 15:21:09 ? 15:21:18 seems very unlikely 15:21:18 2015-04-10 15:45:20.396 | ### ifconfig -a 15:21:18 2015-04-10 15:45:20.402 | eth0 Link encap:Ethernet HWaddr FA:16:3E:FC:EC:8F 15:21:21 2015-04-10 15:45:20.408 | inet6 addr: fe80::f816:3eff:fefc:ec8f/64 Scope:Link 15:21:29 BobBall: is it using the correct image? 15:21:32 I assume this means the guest definitely didn't get an IP address 15:21:43 I guess we override the config option 15:21:51 * ijw spies on you 15:22:54 We must 15:23:00 * BobBall hides behind a tree 15:23:17 Seems you didn't get DHCP (which is only attempted when a config drive isn't mounted) 15:23:30 OK - so config drive wasn't there, but it _must_ be for RAX. 15:23:41 that's still right isn't it johnthetubaguy? 15:23:48 So the real question is why wasn't there a config drive... 15:23:51 And config drive can be enabled with --config-drive=true but equally can be enabled if the global 'always give a config drive' option is set 15:24:05 force_config_drive = always 15:24:12 from n-cpu 15:24:17 BobBall: I think you are getting confused, the metadata service used in't the cloud one, its the one in your openstack setup right? 15:24:38 Oh, yes, I am getting confused. 15:24:40 BobBall: are you sure its the right image we are launching in this test? 15:24:46 Usually I end up poking around in the /opt/stack/data/nova/instances for the config_drive file when I'm using libvirt, you can use a loopmount to see if it's good 15:24:50 maybe the new test used the wrong config 15:25:13 But assuming your cloud is not completely buggered I suspect John's right and your image hates you 15:25:19 Bad image! 15:25:42 At this time of the morning it's probably just decided to down tools and get a coffee 15:26:01 Using CONF.compute.image_ref which is what lots of others use too 15:26:25 Late afternoon here ijw... Does that mean it's time to down tools and get a cup of tea and cake? 15:26:58 BobBall: hmm, that sounds reasonable enough 15:27:01 johnthetubaguy: But we were 'agreed' that it looked like a race... 15:27:09 because some runs definitely pass 15:27:32 oh hang on again sorry I'm getting fixated on config drive 15:27:37 which is _not_ the issue, right? 15:27:38 BobBall: it passes sometimes and not others, so it has to be right, it could be bad state shared between tests when they get reordered 15:27:50 but thats not xenapi specific 15:28:06 Agreed. 15:28:11 BobBall: unsure, coming up and not getting a ip could be and issues, but that assumes it actually started 15:28:45 BobBall: you're down the road from the pub, it's probably work just popping over there to see if that's where it's gone 15:28:55 And you might as well have a pint while you're at it 15:29:24 Might as well. 15:30:46 OK 15:30:49 I'm going in circles here 15:31:35 Anyway.... 15:31:37 Long story short 15:31:40 I disabled the test 15:31:44 so the CI is back commenting on changes 15:31:53 not voting though 15:32:45 Gets us over this initial hurdle 15:33:26 Tests can be re-run by removing it from the tempest_exclusion_list in stackforge/xenapi-os-testing and they get picked up so you can re-run multiple times 15:33:41 changes to xenapi-os-testing get picked up by the CI I mean 15:36:13 Any more on CI johnthetubaguy? 15:36:52 johnthetubaguy failed CI, sorry 15:37:03 not really I am afraid 15:37:14 we should find why it failed, and whats going on first 15:37:23 I can try help, ping me tomorrow morning 15:37:37 OK, will do. 15:38:02 I don't think I've got anything more to add 15:38:29 #topic AOB 15:38:57 Just one from me - if you know anyone in Nanjing who wants to work on OpenStack, tell them to apply to Citrix. 15:39:32 And that's all :) 15:39:39 johnthetubaguy - anything from you? 15:39:48 nothing big from me, except 15:40:46 lets look out for rc1 stuff 15:40:57 What do you mean, look out for it? 15:41:28 Review it, CI it, find more stuff that we want in RC1? ;) 15:41:50 I mean RC2 really 15:41:58 lets just keen an eye out for major issues 15:42:01 like this one :) 15:42:05 Yes 15:42:17 I am worried that this might go out in the release if we can't figure it out in time... 15:42:41 Unfortunately because it's just showing in our CI it's likely to just be us trying to fix it unless we can find a case that reproduces elsewhere 15:44:09 OK - anyway - I think we're done. 15:44:12 So I'll call it there. 15:44:31 Next meeting in a fortnight, so Thursday 30th April 15:44:34 no sorry 15:44:39 Wednesday 29th April! 15:44:42 cool, thanks 15:44:49 #endmeeting