#openstack-meeting log

15:00:30 <BobBall> #startmeeting XenAPI
15:00:31 <openstack> Meeting started Wed Apr 15 15:00:30 2015 UTC and is due to finish in 60 minutes.  The chair is BobBall. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:32 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:33 <BobBall> Howdy!
15:00:35 <openstack> The meeting name has been set to 'xenapi'
15:00:39 <BobBall> johnthetubaguy - ping?
15:01:13 <johnthetubaguy> o/
15:01:14 <johnthetubaguy> hello
15:01:22 <johnthetubaguy> BobBall: hope you had a good holiday
15:01:24 <BobBall> Howdy.  Good, good.
15:01:26 <BobBall> I did indeed, thanks!
15:01:33 <BobBall> Forgot that I was on holiday last meeting :)
15:01:57 <BobBall> South Africa is a wonderful place.  Definitely recommend it.  As long as you lock up tight at night and don't walk around by yourself...
15:02:00 <BobBall> Anyway
15:02:02 <BobBall> #topic CI
15:02:07 <BobBall> Let's start with the fun one - the CI
15:02:18 <BobBall> As you probably know the CI was disabled from voting late last week
15:02:27 <BobBall> The initial suggestion was that the CI was broken
15:02:34 <BobBall> That was unfortunately not the case...
15:02:36 <johnthetubaguy> BobBall: south africa, awesome
15:02:46 <johnthetubaguy> BobBall: ah, the code is broken?
15:02:49 <BobBall> A new, major, race condition has been introduced somewhere which is hitting XenAPI very badl
15:02:52 <BobBall> badly*
15:03:03 <johnthetubaguy> yes, you mentioned snapshot
15:03:06 <BobBall> We went from a 10% failure rate to 60+% (not quite sure what the rate was)
15:03:14 <johnthetubaguy> but its not affecting libvirt?
15:03:21 <BobBall> Presumably not
15:03:23 <johnthetubaguy> do you have the logs from the failure to look at?
15:03:23 <BobBall> which isn't surprising
15:03:27 <BobBall> since we have very different code paths
15:03:30 <BobBall> http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/48/172448/1/15028/screen-n-cpu.txt.gz
15:03:43 <BobBall> Uhhh
15:03:49 <BobBall> I meant: http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/48/172448/1/15028/results.html
15:03:59 <BobBall> Which includes that n-cpu log file
15:04:13 <johnthetubaguy> have you found the offending log trace?
15:04:24 <BobBall> Anyway - the interesting thing is that Tempest has recently changed to add a whole bunch of useful identifiers that should help track it down
15:04:34 <BobBall> but I still didn't see anything helpfully obvious :(
15:05:16 <BobBall> It's slightly worrying, perhaps, that most of the failures seem to happen when test snapshot pattern is the last test
15:05:22 <BobBall> but maybe that's due to the timeout we're hitting
15:05:39 <BobBall> just pushing it to the last test executed
15:06:23 <BobBall> What I don't understand is how/why this is showing as a failure to connect to SSH
15:06:23 <johnthetubaguy> hmm
15:06:42 <johnthetubaguy> oh, thats odd, is that post boot from snapshot?
15:07:01 <BobBall> It really _is_ some form of race since even though the pass rate is low, some jobs at the same time are still passing
15:07:54 <BobBall> http://paste.openstack.org/show/203994/ is a list of some passes/fails
15:08:32 <BobBall> so you can see that even over a period of hours (so not just a temporary thing) we're getting a splattering of passes
15:09:16 <BobBall> http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/06/155006/11/15027/results.html is an example of a pass-in-amongst-the-fails
15:09:46 <BobBall> Anyway - either of you have thoughts on where the issue might be from those logs?
15:10:32 <johnthetubaguy> sorry, distracted
15:10:46 <johnthetubaguy> agreed its a race
15:10:53 <johnthetubaguy> just seeing if we found anything in the log
15:11:24 <BobBall> Any thoughts on how it might be a race, since it's actually SSH connection to the guest that is broken
15:12:25 <johnthetubaguy> waiting for it to be active?
15:12:29 <johnthetubaguy> unsure
15:12:36 <johnthetubaguy> what the test doing?
15:13:01 <BobBall> It's writing a pattern of bits ot the guest then snapshotting I think
15:14:19 <BobBall> oh - no - it's not doing that any more
15:14:22 <BobBall> I thought it used to
15:14:27 <johnthetubaguy> hmm, odd
15:14:36 <BobBall> now it's just writing the timestamp to the guest, snapshotting, booting that snapshot and testing it
15:14:44 <BobBall> but it can't write the snapshot to the server initially
15:14:46 <BobBall> that's what timesout
15:14:49 <BobBall> so it's pre-snapshot
15:15:04 <BobBall> Just booted a newly created image which then can't be accessed
15:15:10 <johnthetubaguy> hmm
15:15:21 <johnthetubaguy> so maybe its not waiting long enough for it to boot?
15:16:08 <BobBall> Not unless we take a substantially longer amount of time than libvirt to download + boot a cirros image
15:16:41 <johnthetubaguy> we do
15:16:47 <BobBall> like minutes longer?
15:16:48 <johnthetubaguy> from what I remember
15:16:55 <BobBall> I think SSH timeout is 120 seconds?
15:17:03 <johnthetubaguy> yeah, that should be OK
15:17:38 <johnthetubaguy> unless we hit some slow path I guess
15:17:39 <johnthetubaguy> hmm
15:17:50 <BobBall> console shows that the guest did boot
15:17:59 <BobBall> however 2015-04-10 15:45:20.980 |     wget: can't connect to remote host (169.254.169.254): Network is unreachable
15:18:01 <johnthetubaguy> ah good point
15:18:06 <johnthetubaguy> how long did it take
15:18:11 <BobBall> Hmmmm... Was this not our fault? :/
15:18:22 <BobBall> Could this have been a RAX networking issue?
15:18:35 <johnthetubaguy> I don't think so, its on box networking right?
15:19:03 <BobBall> Infra tempest used to be just on HP cloud right?
15:19:04 <BobBall> hmmm
15:19:10 <BobBall> fair point
15:19:24 <BobBall> 2015-04-10 15:45:20.911 |     cloud-setup: failed to read iid from metadata. tried 30
15:19:24 <johnthetubaguy> not sure
15:19:27 <BobBall> 2015-04-10 15:45:20.919 |     WARN: /etc/rc3.d/S45-cloud-setup failed
15:19:56 <johnthetubaguy> BobBall: we don't run the metadata service, where is that log from?
15:20:11 <BobBall> http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/48/172448/1/15028/run_tests.log
15:20:16 <BobBall> Search for the Cirros boot
15:20:22 <johnthetubaguy> BobBall: the alternative is to try boot with config drive
15:20:26 <BobBall> Looks like the guest doesn't get an IP address
15:20:31 <johnthetubaguy> BobBall: do force config drive
15:20:37 <johnthetubaguy> see if that narrows it down
15:20:53 <BobBall> I think we must be forcing it normally for it to work at all
15:20:59 <johnthetubaguy> it could be a race in the ip tables writing
15:21:06 <BobBall> I wonder if there is an issue where the config drive might not be attached before we boot or something
15:21:09 <BobBall> ?
15:21:18 <johnthetubaguy> seems very unlikely
15:21:18 <BobBall> 2015-04-10 15:45:20.396 |     ### ifconfig -a
15:21:18 <BobBall> 2015-04-10 15:45:20.402 |     eth0      Link encap:Ethernet  HWaddr FA:16:3E:FC:EC:8F
15:21:21 <BobBall> 2015-04-10 15:45:20.408 |               inet6 addr: fe80::f816:3eff:fefc:ec8f/64 Scope:Link
15:21:29 <johnthetubaguy> BobBall: is it using the correct image?
15:21:32 <BobBall> I assume this means the guest definitely didn't get an IP address
15:21:43 <johnthetubaguy> I guess we override the config option
15:21:51 * ijw spies on you
15:22:54 <BobBall> We must
15:23:00 * BobBall hides behind a tree
15:23:17 <ijw> Seems you didn't get DHCP (which is only attempted when a config drive isn't mounted)
15:23:30 <BobBall> OK - so config drive wasn't there, but it _must_ be for RAX.
15:23:41 <BobBall> that's still right isn't it johnthetubaguy?
15:23:48 <BobBall> So the real question is why wasn't there a config drive...
15:23:51 <ijw> And config drive can be enabled with --config-drive=true but equally can be enabled if the global 'always give a config drive' option is set
15:24:05 <BobBall> force_config_drive             = always
15:24:12 <BobBall> from n-cpu
15:24:17 <johnthetubaguy> BobBall: I think you are getting confused, the metadata service used in't the cloud one, its the one in your openstack setup right?
15:24:38 <BobBall> Oh, yes, I am getting confused.
15:24:40 <johnthetubaguy> BobBall: are you sure its the right image we are launching in this test?
15:24:46 <ijw> Usually I end up poking around in the /opt/stack/data/nova/instances for the config_drive file when I'm using libvirt, you can use a loopmount to see if it's good
15:24:50 <johnthetubaguy> maybe the new test used the wrong config
15:25:13 <ijw> But assuming your cloud is not completely buggered I suspect John's right and your image hates you
15:25:19 <BobBall> Bad image!
15:25:42 <ijw> At this time of the morning it's probably just decided to down tools and get a coffee
15:26:01 <BobBall> Using CONF.compute.image_ref which is what lots of others use too
15:26:25 <BobBall> Late afternoon here ijw... Does that mean it's time to down tools and get a cup of tea and cake?
15:26:58 <johnthetubaguy> BobBall: hmm, that sounds reasonable enough
15:27:01 <BobBall> johnthetubaguy: But we were 'agreed' that it looked like a race...
15:27:09 <BobBall> because some runs definitely pass
15:27:32 <BobBall> oh hang on again sorry I'm getting fixated on config drive
15:27:37 <BobBall> which is _not_ the issue, right?
15:27:38 <johnthetubaguy> BobBall: it passes sometimes and not others, so it has to be right, it could be bad state shared between tests when they get reordered
15:27:50 <johnthetubaguy> but thats not xenapi specific
15:28:06 <BobBall> Agreed.
15:28:11 <johnthetubaguy> BobBall: unsure, coming up and not getting a ip could be and issues, but that assumes it actually started
15:28:45 <ijw> BobBall: you're down the road from the pub, it's probably work just popping over there to see if that's where it's gone
15:28:55 <ijw> And you might as well have a pint while you're at it
15:29:24 <BobBall> Might as well.
15:30:46 <BobBall> OK
15:30:49 <BobBall> I'm going in circles here
15:31:35 <BobBall> Anyway....
15:31:37 <BobBall> Long story short
15:31:40 <BobBall> I disabled the test
15:31:44 <BobBall> so the CI is back commenting on changes
15:31:53 <BobBall> not voting though
15:32:45 <BobBall> Gets us over this initial hurdle
15:33:26 <BobBall> Tests can be re-run by removing it from the tempest_exclusion_list in stackforge/xenapi-os-testing and they get picked up so you can re-run multiple times
15:33:41 <BobBall> changes to xenapi-os-testing get picked up by the CI I mean
15:36:13 <BobBall> Any more on CI johnthetubaguy?
15:36:52 <ijw> johnthetubaguy failed CI, sorry
15:37:03 <johnthetubaguy> not really I am afraid
15:37:14 <johnthetubaguy> we should find why it failed, and whats going on first
15:37:23 <johnthetubaguy> I can try help, ping me tomorrow morning
15:37:37 <BobBall> OK, will do.
15:38:02 <BobBall> I don't think I've got anything more to add
15:38:29 <BobBall> #topic AOB
15:38:57 <BobBall> Just one from me - if you know anyone in Nanjing who wants to work on OpenStack, tell them to apply to Citrix.
15:39:32 <BobBall> And that's all :)
15:39:39 <BobBall> johnthetubaguy - anything from you?
15:39:48 <johnthetubaguy> nothing big from me, except
15:40:46 <johnthetubaguy> lets look out for rc1 stuff
15:40:57 <BobBall> What do you mean, look out for it?
15:41:28 <BobBall> Review it, CI it, find more stuff that we want in RC1? ;)
15:41:50 <johnthetubaguy> I mean RC2 really
15:41:58 <johnthetubaguy> lets just keen an eye out for major issues
15:42:01 <johnthetubaguy> like this one :)
15:42:05 <BobBall> Yes
15:42:17 <BobBall> I am worried that this might go out in the release if we can't figure it out in time...
15:42:41 <BobBall> Unfortunately because it's just showing in our CI it's likely to just be us trying to fix it unless we can find a case that reproduces elsewhere
15:44:09 <BobBall> OK - anyway - I think we're done.
15:44:12 <BobBall> So I'll call it there.
15:44:31 <BobBall> Next meeting in a fortnight, so Thursday 30th April
15:44:34 <BobBall> no sorry
15:44:39 <BobBall> Wednesday 29th April!
15:44:42 <johnthetubaguy> cool, thanks
15:44:49 <BobBall> #endmeeting