20:00:47 <lifeless> #startmeeting tripleo
20:00:48 <openstack> Meeting started Mon Jun 17 20:00:47 2013 UTC.  The chair is lifeless. Information about MeetBot at http://wiki.debian.org/MeetBot.
20:00:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
20:00:50 <lifeless> hi everyone
20:00:51 <openstack> The meeting name has been set to 'tripleo'
20:00:59 <dprince> lifeless: hi
20:01:00 <dkehn> hi
20:01:05 <NobodyCam> morning lifeless
20:01:07 <GheRivero> hi all
20:01:49 <lifeless> #topic agenda
20:01:50 <lifeless> bugs
20:01:51 <lifeless> Grizzly test rack status
20:01:51 <lifeless> CI virtualized testing progress
20:01:51 <lifeless> open discussion
20:01:55 <lifeless> #topic bugs
20:02:07 <lifeless> https://bugs.launchpad.net/tripleo/ as usual
20:02:19 <lifeless> we're down to 7 crits
20:02:47 <lifeless> SpamapS: you have 3 of them
20:02:57 <lifeless> SpamapS: care to give a brief status on tehm?
20:03:56 <dprince> lifeless: I think we might be seeing a lot of this w/ TOCI https://bugs.launchpad.net/tripleo/+bug/1166838
20:03:57 <SpamapS> sure let me catch up
20:03:58 <uvirtbot> Launchpad bug 1166838 in tripleo "rabbitmq does not start correctly on boot" [High,Triaged]
20:04:14 <lifeless> ok while SpamapS catches up
20:04:20 <SpamapS> https://bugs.launchpad.net/heat/+bug/1191931
20:04:21 <uvirtbot> Launchpad bug 1191931 in heat "AssertionError when creating a stack." [Critical,New]
20:04:23 <lifeless> I should get https://bugs.launchpad.net/tripleo/+bug/1191714 done today
20:04:27 <uvirtbot> Launchpad bug 1191714 in tripleo "400.Bad.Request..X-Instance-ID.header.is.mising.from.reque   " [Critical,Triaged]
20:04:27 <SpamapS> btw, I believe this is breaking heat right now
20:04:28 <lifeless> oh, he's up :)
20:04:37 * lifeless hands the mike to SpamapS
20:05:16 <SpamapS> Hm, I haven't checked, all 3 of those might be already merged, just needing better docs.
20:05:22 <lifeless> SpamapS: ok, and thus tripleo ?
20:05:49 <SpamapS> ok no https://bugs.launchpad.net/tripleo/+bug/1182249 is still ongoing
20:05:50 <uvirtbot> Launchpad bug 1182249 in tripleo "quantum configuration is overly hardcoded" [Critical,In progress]
20:06:04 <SpamapS> that one needs os-apply-config and/or os-refresh-config to have access to the ec2 metadata
20:06:20 <lifeless> dprince: so, several things we can do - we should move stuff out of first-boot and into orc calls
20:06:40 <lifeless> dprince: which we should *anyway* as rabbit is a service we may reconfigure.
20:06:40 <SpamapS> https://bugs.launchpad.net/tripleo/+bug/1183223 is a bit vague and I may need to work on the wording/split it out
20:06:43 <uvirtbot> Launchpad bug 1183223 in tripleo "nova-compute.yaml missing parameters" [Critical,In progress]
20:07:02 <SpamapS> https://bugs.launchpad.net/tripleo/+bug/1183442
20:07:04 <uvirtbot> Launchpad bug 1183442 in tripleo "Heat metadata updates do not work" [Critical,In progress]
20:07:06 <lifeless> dprince: that would ameliorate the toci issue even if we don't diagnose the root issue in Ubunty.
20:07:09 <dprince> lifeless: sure. Just wanted to get that marked as high priority (and potentially being worked on as Derek)
20:07:11 <lifeless> Ubuntu.
20:07:17 <lifeless> dprince: ack
20:07:28 <SpamapS> I think that one is fixed actually
20:07:50 <SpamapS> will need to test and verify, but all of the code is in place to fix it theoretically
20:08:21 <dprince> SpamapS: reference commit?
20:08:26 <lifeless> SpamapS: if that heat bug is hurting us, care to add a tripleo task?
20:10:25 <SpamapS> dprince: there are several, it will take me a while to dig it out
20:10:42 <SpamapS> lifeless: yes I'm doing so. Its basically stopping us dead in the water (might be stopping all users dead)
20:10:54 <dprince> SpamapS: no worries. I can review history myself. Thanks for the update.
20:11:22 <SpamapS> dprince: it was worked around in t-i-e and recently fixed in keystoneclient for good
20:12:14 <lifeless> dprince: a8c2ae7e1506defaa36f035377af2b7b04aaed87
20:12:36 <dprince> lifeless: thanks.
20:12:48 <lifeless> we had tat listed as fixing 1183732
20:12:51 <lifeless> but
20:12:53 <lifeless> I think its teh same
20:13:46 <lifeless> ok, onto the others
20:13:57 <lifeless> bug 1184484 has provisional patches from quantum devs
20:13:58 <uvirtbot> Launchpad bug 1184484 in tripleo "Quantum default settings will cause deadlocks due to overflow of sqlalchemy_pool" [Critical,Triaged] https://launchpad.net/bugs/1184484
20:14:09 <lifeless> I tried to apply them to the HP POC environment
20:14:23 <lifeless> but it made all quantum APIs return empty responses.
20:14:27 <lifeless> Which was undesirable
20:15:06 <lifeless> so I reverted it. My plan is to get the current arc of 'get it up and working without fiddling' going, and then find a couple of spare machines in that environment and do a fresh build.
20:15:32 <mordred> o/
20:15:34 <lifeless> quantum folk have marked https://bugs.launchpad.net/tripleo/+bug/1189385 incomplete.
20:15:36 <uvirtbot> Launchpad bug 1189385 in tripleo "quantum-server hung up it's listening port" [Critical,Triaged]
20:15:43 <lifeless> I'm going to ping them asking what they are missing
20:15:58 <lifeless> #action lifeless to chase 1189385 diagnostics for quantum devs.
20:16:17 <lifeless> bug 1188301 - I think clint was tracking it?
20:16:18 <uvirtbot> Launchpad bug 1188301 in tripleo "keystone kvs driver causes process to grow indefinitely and spin on CPU with thousands of keys in a single python dict" [Critical,Triaged] https://launchpad.net/bugs/1188301
20:16:44 <SpamapS> lifeless: did you see the respond on bug 1184484 ?
20:16:45 <uvirtbot> Launchpad bug 1184484 in tripleo "Quantum default settings will cause deadlocks due to overflow of sqlalchemy_pool" [Critical,Triaged] https://launchpad.net/bugs/1184484
20:16:46 <SpamapS> response rather ?
20:16:58 <SpamapS> lifeless: its the same old upper/lower problem again
20:17:15 <SpamapS> lifeless: oh, bug 1188301 is fixed, keystone defaults to sql now! \o/
20:17:49 <SpamapS> https://review.openstack.org/#/c/32970/
20:18:20 <SpamapS> I linked that to bug 1188378 though
20:18:21 <uvirtbot> Launchpad bug 1188378 in keystone "keystone.token.backends.sql uses a single delete command to flush expired tokens causing replication lag and potential deadlocks" [Medium,In progress] https://launchpad.net/bugs/1188378
20:18:35 <SpamapS> oh actually no I linked both
20:18:37 <lifeless> SpamapS: I'm confused.
20:18:44 <SpamapS> but the bot picked the other bug which I mentioned first
20:18:44 <lifeless> SpamapS: is it fixed, or is it pending review to be fixed?
20:18:53 <SpamapS> lifeless: keystone upstream defaults to sql
20:19:02 <SpamapS> lifeless: we have a config file that still says kvs though
20:19:15 <lifeless> ok, so not fixed for us, because we copied the files.
20:19:18 <SpamapS> lifeless: https://review.openstack.org/#/c/32970/ fixes that
20:19:31 <lifeless> yup
20:19:36 <lifeless> dprince: re percona toolkit
20:19:37 <SpamapS> Just need to work on that
20:19:45 <lifeless> dprince: what other options are there?
20:19:46 <jog0> lifeless: this is why I wanted to shrink the size of the config file
20:20:06 <dprince> lifeless: well... could we fix this in keystone?
20:20:06 <lifeless> jog0: yes, and as I said I'm with you in principle, we just need some care.
20:20:13 <jog0> lifeless: ++
20:20:25 <SpamapS> lifeless: I have a WIP fix in keystone but MySQL doesn't support LIMIT in the IN() sub-query clauses .. so I have to do something mysql specific.
20:20:46 <lifeless> dprince: right now we're broken by default. I'd like to unbreak us first, and get more portable, long term fixes second.
20:20:49 <SpamapS> pt-archiver has been cleaning out our PoC table for a week now
20:20:50 <lifeless> dprince: what do you think of that ?
20:21:27 <dprince> lifeless: this will certainly break our Fedora efforts. That is my main objection.
20:21:42 <lifeless> dprince: what if SpamapS puts a 'if ubuntu' thing around the pt cleaner.
20:21:45 <SpamapS> Frankly I think we should mov to memcached eventually, but that is yet another 3rd party service to scale :p
20:21:50 <lifeless> dprince: fedora will be no worse off - broken is broken.
20:22:22 <lifeless> dprince: but it will build and install and run until you get too much contention with the upstream gc code.
20:22:34 <dprince> lifeless: go for it. I suppose I'd just like to see keystone support this with its database design.
20:22:40 <lifeless> dprince: Me too!
20:22:48 <lifeless> dprince: I just don't want to be hostage to educating them
20:22:50 <SpamapS> dprince: If you have some guidance on how to ask sqlalchemy if I'm using mysql, and then do a specially crafted query in sqlalchemy.. https://review.openstack.org/#/c/32044/ needs your comments :)
20:23:07 <dprince> lifeless: we should probably put a comment in to make note of this as well so that it doesn't confuse people, etc.
20:23:17 <lifeless> dprince: I'm not suggesting we stop caring about it, just that we get the move away from kvs in place
20:23:32 <lifeless> ok, so
20:24:15 <lifeless> #action spamaps to: - make the kvs->sql change still build and run on fedora; ensure there is a bug upstream in keystone about the bad sql behaviour, with medium priority task on tripleo.
20:24:30 <lifeless> dprince: ^ I think that meets all your concerns; if not please feel free to tweak it so that it does.
20:25:01 <lifeless> ok and bug 1191714 I am working on
20:25:02 <uvirtbot> Launchpad bug 1191714 in tripleo "400.Bad.Request..X-Instance-ID.header.is.mising.from.reque   " [Critical,Triaged] https://launchpad.net/bugs/1191714
20:25:16 <lifeless> this is fallout from the overcloud changes : it's a setting that has to be different in undercloud and overcloud.
20:25:36 <lifeless> right now any seed cloud/bootstrap cloud built with tripleo will fail metadata access from instances/.
20:26:08 <lifeless> Any other bug stuff to discuss?
20:26:36 <lifeless> #topic grizzly rack status
20:26:46 <lifeless> so our POC has been getting hammered by some test users
20:26:50 <lifeless> which is great.
20:26:58 <lifeless> the only issue they have had so far is the quantum poolsize one.
20:27:25 <lifeless> With the comment spamaps pointed out, we can try switching to that quantum version again
20:27:46 <lifeless> #action lifeless to test the quantum deadlock fix on the POC again.
20:28:09 <lifeless> So, this is pretty good news, the path to near-production was a lot smoother than it might have been :)
20:28:23 <lifeless> anything else on the POC environment ?
20:28:51 <SpamapS> moar PoC racks plz
20:30:25 <lifeless> I have 2 requests to bring the tripleo love to prod racks
20:30:38 <lifeless> they know we're not finished, and caveats etc.
20:30:47 <lifeless> but they want to see how it flies.
20:31:07 <lifeless> so - thats pending other folks bandwidth. Will keep everyone apprised as things eventuate.
20:31:26 <lifeless> #topic CI virtual testing progress
20:31:30 <lifeless> pleia2: tag
20:31:34 <pleia2> hello!
20:31:48 <pleia2> so testing on lxc is moving along
20:32:19 <pleia2> got most networking stuff sorted last week and openstack using boot-stack is mostly running within lxc, just working out some launching issues with some of the services
20:32:40 <lifeless> sweet
20:32:42 <pleia2> at this point I don't foresee us hitting any major blockers
20:32:47 <lifeless> do you need more eyeballs ?
20:33:12 <pleia2> not right at this moment
20:33:16 <pleia2> but soon
20:33:41 <jog0> will the CI virt testing just put test the undercloud/boot-stack in LXC
20:33:52 <pleia2> jog0: that's the plan
20:34:12 <jog0> so no overcloud testing in first pass
20:34:34 <lifeless> jog0: the primary goal is to get baremetal code path test coverage.
20:34:58 <lifeless> jog0: with a little bit of 'image builds properly' and 'tripleo config seems legit' built in.
20:35:07 <lifeless> jog0: -> baby steps.
20:35:20 * jog0 *nod*
20:35:31 <pleia2> yeah, and we're starting off simple just so we have a basic setup (this whole thing has somewhat stalled partially because I've been trying to do all-the-things)
20:35:40 <lifeless> pleia2: ok, please shout in #tripleo when you need someone to eyeballs logs or whatever to help diagnose failure-to-startup.
20:35:50 <pleia2> lifeless: great, thanks :)
20:35:52 <lifeless> #topic open discussion
20:35:53 <dprince> pleia2: is this still using TOCI?
20:36:13 <pleia2> dprince: it's diverged quite a bit, but I hope to pull it back and submit some patches to TOCI
20:36:47 <dprince> pleia2: cool. I sort of went the other way... and we are close to having TOCI driving real bare metal.
20:37:05 <pleia2> dprince: it's doing a lot of the same things, so if we could have some switches built in to handle virtual+lxc it would be great
20:37:20 <pleia2> right now I'm running everything by hand though
20:37:21 <dprince> pleia2: we'll be more resource strapped there... but we are finding good things.
20:37:27 * pleia2 nods
20:37:50 <dprince> lifeless: I've got a couple things I'd like to run past you all
20:38:36 <lifeless> dprince: shoot!
20:39:30 <dprince> lifeless: Okay. First thing this troubleshooting thing.
20:40:09 <dprince> lifeless: I have a review up to make it so that we don't always have to hang the deployment process if something bad happens in the deploy ramdisk.
20:40:40 <dprince> lifeless: we *can't* hang the deploy process for CI. it will kill my resource pools and the failure rate is still way to high.
20:40:54 <dprince> So that is step one. (don't hang it)
20:41:12 <dprince> https://review.openstack.org/#/c/33076/
20:41:49 <lifeless> hanging bad
20:41:58 <lifeless> there is a timeout mechanism for nova-bm
20:42:02 <lifeless> it's off by default...
20:42:02 <dprince> Step two would be to have a simple err message tracking capability. I understand we are working on a proper agent... but in the meantime for the "H" release we need something. So maybe something like this: https://review.openstack.org/#/c/33341/
20:42:40 <dprince> lifeless: A timeout would be good as well. But it is hanging because we call a 'bash' shell inline. That is just plain bad IMO.
20:43:20 <dprince> lifeless: So with the second branch above ^^ we'd essentiall just add a small blip to the nova-bare-metal-deploy helper so we can get and log the message.
20:43:32 <lifeless> +500
20:43:58 <dprince> lifeless: I feel like this is a bit home brew... but I gotta say I can't do much about automating this without these things.
20:44:08 <lifeless> I am totally in favour of this sort of thing; devananda has some reasonable concerns about not changing nova baremetal, but IMO leaving it totally broken is not feasible.
20:44:19 <dprince> lifeless: lastly, we need to get devananda's Nova branch to improve the IMPI power commands in.
20:44:46 <dprince> lifeless: the Nova change is really small. and totally backwards compatible. I'll push it by the end of the day too.
20:44:48 <lifeless> I don't think a super duper agent is needed in the short term : its a nice thing to have, but not a necessary condition for any of this.
20:45:05 <dprince> lifeless: okay. I think we are on the same page.
20:45:24 <dprince> lifeless: Okay. Slightly different topic.
20:45:24 <lifeless> yes, ack on that - I'll dig up that review and see where we are at.
20:45:31 <sdake> lifeless does dib now default to using tmpfs for its building magic?
20:45:32 <dprince> backticks. Can they go away.
20:45:44 <lifeless> dprince: `` -> $() ?
20:45:47 <dprince> I'm much prefer we use the more formal $()
20:45:56 <dprince> lifeless: its a style thing... but yes.
20:46:04 <lifeless> fine by me; add it to HACKING or README or something so it's discoverable.
20:46:21 <lifeless> sdake: yes
20:46:28 <dprince> lifeless: Cool. We don't have a bash HACKING that I know of but I'll take a shot at that.
20:46:41 <sdake> lifeless I guess I'm a dummy but the command line option to run a fedora dib doesn't immediately stick out at me from the -h or readme.md
20:46:42 <lifeless> sdake: see README.md - Requirements.
20:46:56 <lifeless> sdake: disk-image-create fedora
20:47:08 <sdake> thanks I'll try that lifeless ;)
20:47:23 <sdake> sure is fast
20:47:33 <lifeless> sdake: :>
20:47:49 <SpamapS> will be nice with the official F19 and later imges too
20:48:00 <sdake> be sweet if it had an api to go with it :)
20:48:19 <sdake> ya f17 lost cause at this point ;)
20:48:27 <lifeless> sdake: actually I think its still too slow, need to add some parallel in there, as well as make it trivial to setup local pypi and openstack git mirrors; but folk like derekh (not in this channel atm) have that in-progress
20:48:32 <sdake> I plan to change all the heat instances to default to f19 when it comes out
20:49:06 <lifeless> sdake: an API would be nice; I think structurally we should layer that on top - separate concern.
20:49:14 <sdake> lifeless agree
20:51:25 <SpamapS> sounds like we're done :)
20:52:33 <lifeless> agreed
20:52:34 <lifeless> #endmeeting