19:01:59 #startmeeting tripleo 19:02:00 Meeting started Tue Jul 15 19:01:59 2014 UTC and is due to finish in 60 minutes. The chair is lifeless. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:02:01 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:02:03 The meeting name has been set to 'tripleo' 19:02:28 o/ 19:03:10 o/ 19:03:15 O/ 19:03:29 o/ 19:03:56 hi 19:04:16 G'mornin 19:04:31 evening... 19:04:49 morning 19:06:05 ok we seem to have a bunch of o/'s 19:06:08 #topic agenda 19:06:18 bugs 19:06:18 reviews 19:06:18 Projects needing releases 19:06:18 CD Cloud status 19:06:18 CI 19:06:21 Tuskar 19:06:23 Specs 19:06:32 Insert one-off agenda items here 19:06:32 Please start adding items to the mid-cycle agenda on the etherpad at https://etherpad.openstack.org/p/juno-midcycle-meetup 19:06:35 open discussion 19:06:37 #topic bugs 19:06:49 #link https://bugs.launchpad.net/tripleo/ 19:06:50 #link https://bugs.launchpad.net/diskimage-builder/ 19:06:50 #link https://bugs.launchpad.net/os-refresh-config 19:06:50 #link https://bugs.launchpad.net/os-apply-config 19:06:50 #link https://bugs.launchpad.net/os-collect-config 19:06:52 #link https://bugs.launchpad.net/os-cloud-config 19:06:54 o/ 19:06:54 #link https://bugs.launchpad.net/tuskar 19:06:57 #link https://bugs.launchpad.net/python-tuskarclient 19:08:40 Couple of new ones in tripleo 19:09:03 is michael kerrin around ? 19:09:06 jp_at_hp: ^ 19:09:14 bug 1263294 19:09:16 Launchpad bug 1263294 in tripleo "ephemeral0 of /dev/sda1 triggers 'did not find entry for sda1 in /sys/block'" [Critical,In progress] https://launchpad.net/bugs/1263294 19:09:24 GheRivero: how are you going on bug 1316985 ? 19:09:25 Launchpad bug 1316985 in tripleo "set -eu may spuriously break dkms module" [Critical,In progress] https://launchpad.net/bugs/1316985 19:09:45 hmm, where is rpodylka these days ? bug 1317056 19:09:48 Launchpad bug 1317056 in tripleo "Guest VM FS corruption after compute host reboot" [Critical,Triaged] https://launchpad.net/bugs/1317056 19:10:06 TheJulia: how is bug 1336915 progressing ? 19:10:07 Launchpad bug 1336915 in tripleo "We can start multiple mysql masters if mysql.nodes is undefined" [Critical,In progress] https://launchpad.net/bugs/1336915 19:10:19 and dprince - you have bug 1342101 19:10:19 https://review.openstack.org/#/c/104414/3 19:10:20 Launchpad bug 1342101 in tripleo "Cinder volumes fail: No section: \'Filters\' errors" [Critical,In progress] https://launchpad.net/bugs/1342101 19:10:20 I thought rpodylka had planned to close that after the second-previous alternate-time meeting 19:10:22 patch for the mysql one 19:10:51 lifeless, its 8pm, so I wouldn't expect him. 19:11:01 lifeless: Just needs some reviews https://review.openstack.org/#/c/104414/ 19:12:13 lifeless: https://review.openstack.org/#/c/95151/ and dependencies in really good shape and waiting for some core reviews love :) 19:12:22 lifeless: bug 1342101 is an easy fix, just revert the original patch... 19:12:24 Launchpad bug 1342101 in tripleo "Cinder volumes fail: No section: \'Filters\' errors" [Critical,In progress] https://launchpad.net/bugs/1342101 19:13:53 ok so 19:14:31 #info please review https://review.openstack.org/#/c/104414/3 for bug bug 1336915 19:14:32 Launchpad bug 1336915 in tripleo "We can start multiple mysql masters if mysql.nodes is undefined" [Critical,In progress] https://launchpad.net/bugs/1336915 19:15:00 #info please review https://review.openstack.org/#/c/95151/ for bug bug 1316985 19:15:01 Launchpad bug 1316985 in tripleo "set -eu may spuriously break dkms module" [Critical,In progress] https://launchpad.net/bugs/1316985 19:15:21 #info please review https://review.openstack.org/107041 for bug 1342101 19:15:22 Launchpad bug 1342101 in tripleo "Cinder volumes fail: No section: \'Filters\' errors" [Critical,In progress] https://launchpad.net/bugs/1342101 19:15:55 This is my weekly reminder that anyone can use hash-info to add items to the minutes and save lifeless needing to summarise ;) 19:16:37 no other criticals 19:16:43 any other bug stuff to discuss? 19:17:02 lifeless: ci is pretty much banjaxed 19:17:21 derekh_: is that a banjo with an axe through it? 19:17:37 lifeless: yup, must be 19:17:55 lifeless: anyways should we add tripleo to bug 1341420 19:17:59 Launchpad bug 1341420 in nova "gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit" [High,Triaged] https://launchpad.net/bugs/1341420 19:18:06 lifeless: if that is the cause ? 19:19:47 also Could our recent switch to HA controller (more instances being booted) or 2G memory be magnifing the symptoms ? 19:19:54 derekh_: I don't think so 19:20:07 seems like more nodes would magnify it 19:20:15 no 19:20:22 or rather 19:20:32 * derekh_ notes that the revert ov control scale from dprince passed https://review.openstack.org/#/c/106852/ 19:20:34 the same number of UC nodes is in play 19:20:43 ah 19:20:45 oh, I was seeing this in overcloud 19:20:48 3 nodes to -> 5 nodes 19:21:01 at 3 nodes we physically cannot trigger it 19:21:07 derekh_: trying to get me fired up about that! 19:21:09 aha 19:21:11 lifeless: yup, the 3 to 5 is what I was talking about 19:21:39 I've triggered a recheck to see if it passes again 19:21:42 because with 3 nodes the 3rd retry of a node will fall on the 3rd node. 19:22:08 so basically, optimistic schedulers are hard? 19:22:19 actually this is a pessimistic scheduler 19:22:26 it presumes grants won't actually happen 19:22:42 dprince: I usually try to avoid it :-) 19:22:53 hrm, /me tries not to derail meeting 19:23:48 derekh_: well, too late, You've gone and done it 19:23:57 ok so 19:24:00 we need a plan here 19:24:01 options 19:24:15 - revert to 1 ctl 2 hypervisor 19:24:44 - put a workaround e.g. the sleep in Ic9f625398b30bcdf81fb2796de0235713f4d4aa6 in toci 19:25:09 lifeless: I thought we'd established that that sleep wasn't sufficient? 19:25:21 lifeless: I'd like to revert. I think we are premature in our chase for HA. 19:25:28 Long term I think we should consider a reservation concept. 19:25:38 * dprince notes that he can't even reboot an undercloud node successfully anymore 19:26:01 * SpamapS drives by and then runs out again because of issues IRL 19:26:04 SpamapS: lock the resouces in the scheduler? 19:26:24 not saying HA isn't important, but it falls well behind the simple case 19:26:35 greghaynes: something like that. 19:27:03 yes, +1 - that is a lot easier type of scheduler to get correct IMO 19:27:06 dprince: mmm, I very much disagree about the relative importance there, but reverting this wouldn't be about HA, it would be able velicity. 19:27:06 greghaynes: basically have actual consumable flavor-sized things that the scheduler can consume as a queue, rather than spraying hope everywhere. 19:27:27 erm velocity 19:27:51 SpamapS: so the scheduler was like that; can I suggest folk wanting to design a fix head to #openstack-nova 19:27:59 since we can rabbit hole here super easily 19:28:12 * SpamapS happily goes away 19:28:31 are there any other right-now options to unwedge CI ? 19:28:32 lifeless: HA is causing us too much churn at this point. We can't even deploy our CI overclouds without hacks yet! 19:28:35 So right now if the revert fixes it, I'm for that (unless there is a nova fix pretty much ready), once where happy its good to go again I'm happy to trun back on 3 x controllers 19:28:43 lifeless: much less run HA in them! 19:29:01 Seems like no other option :( 19:30:07 dprince: I'm interested in what churn you're seeing, but since HA is the current project focus on the path to removing the seed and really being fully deployed... lets talk out of the meeting timeline about that 19:30:11 lifeless: nice eccint 19:30:16 so seems like we need to revert 19:30:55 I'd like to suggest that we then put the 3 node HA patch back up again with an included hacky patch, and see if we can get some good runs with such a thing 19:31:19 lifeless: the recheck is still running, want to wait for a result to be sure it will help? probably an hour 19:31:46 derekh_: I'm absolutely sure it will, but yes. Also I want dprince to reword it because the revert is IMO strictly about CI being broken. 19:32:00 lifeless: k 19:32:06 dprince: are you ok with that? 19:33:02 lifeless: I'm happy to *add* a word or too. But not cool with changing what I've already written 19:33:05 I think when we're adding HA back in as default we gotta say something like X number of successfull CI runs in a row or something, not keep rechecking until it passes 19:33:16 not saying thats what happened, I havn't looked 19:33:44 derekh_: I don't believe thats what happened, I think actually it may be we turned it on and then whatever made this happen so much more landed. 19:34:08 derekh_: we weren't seeing this in Helion builds last month, for instance - something has made it massively more common very recently. 19:34:19 lifeless: k 19:34:23 derekh_: perhaps heat efficiencies, perhaps oslo.messaging, I dunno 19:35:04 dprince: so your revert patch has a bunch of editorial about resources and defaults that are IMO entirely incorrect in this context. 19:35:24 dprince: and the vibe I've got from the development community is out of sync with what you say 19:35:27 lifeless: unless of course we rename devtest to something else. My revert is really about devtest being for developers, which with the new defaults can't actually run it anymore! 19:36:02 dprince: 6 2GB VMs fits in a 16GB machine, and we've been saying 16GB machines for a long time 19:36:14 lifeless: put it to the test, with the new defaults lets see who can actuall run the entire devtest: seed -> undercloud -> overcloud 19:36:17 lifeless: not if you have firefox open as well 19:36:33 lifeless: I don't have a 16G laptop 19:36:42 Seems like youre forcing either no revert to go through or a second revert to be made? I dont think we should assume we got consensus on that revert simply because we need to get CI passing again 19:37:06 I can put up a revert that only talks about CI, but I do think getting some consensus around this is important 19:37:07 lifeless: I can run parts of devtest, sure. I know how to do that. But I think this is super unfriendly to the community. A barrier even 19:37:12 I thought we had that from atlanta 19:37:17 I do have a 16Gb laptop, but with a browser and desktop environment open I only have about 10Gb useable for VMs 19:37:44 lifeless: If your revert is the same as mine you are just being silly 19:37:48 * dprince may -2 it 19:38:10 dprince: indeed, we'd be into silly territory 19:38:32 i don't see any concensus in the summit etherpad 19:38:41 is there some other record? 19:38:57 https://etherpad.openstack.org/p/juno-summit-tripleo-environment is the etherpad you're looking at? 19:39:01 the only "concensus" i recall about this topic, was that there was disagreement 19:39:09 how about we revert with the comment that we need to revisit default and then try to reach minimum requirments consensus on mailing list ? 19:39:28 or and the sprint 19:39:32 *at 19:39:32 lifeless: there was no concensus 19:40:03 my memory is that there was consensus that we needed to pursue several options to try to find something that worked 19:40:11 tchaypo: i was actually looking at the CI one. but yea, that one too 19:40:13 hence the tripleo-on-openstack work 19:40:28 Yeah, I don't know that we ever picked a number, although most people in the room were running on 16 GB boxes. 19:40:44 I think --no-undercloud might predate that session but I remember it being suggested as one way to decrease the requirements, and I've been using it pretty much ever since then 19:40:45 right, and 16gb is not sufficient for a default devtest 19:40:49 And the agreement seemed to be that we wanted to support multiple deployment configurations to accomodate people with less hardware. 19:40:53 --no-undercloud predates it 19:41:03 if I don't use --no-undercloud on my laptop I can't build the full stack 19:41:08 and it's a 16Gb laptop 19:41:21 so right, I remember the vibe being supporting scaled down configs 19:41:25 which I'm totally in support of 19:41:34 I've given up using it though, I pretty much only use my 32Gb machine in hetzner these days 19:41:37 but this is about defaults - and look, its a good thing we're seeing this in CI, it means its *working*. 19:41:58 we're seeing actual things that matter 19:42:34 I think dprince's point is that he feels the defaults should be aimed at developers testing their own setups; if CI needs different options we should add them in toci 19:42:39 ./devtest.sh --minimal|--full (user must select one) 19:42:56 i guess i'm just used to always overriding *something* 19:43:06 hence why i only +1'd the revert 19:43:08 slagle: +1 :-) 19:43:12 we can't make everyone happy 19:43:19 and in doing so, we make no one happy 19:43:23 The reason I'm pushing on having the default be scaled up enough to trigger all these things is so that we trigger these things where possible in front of devs 19:43:32 not in CI where analysis is substantially harder 19:43:40 s/not in CI/not JUST in CI/ 19:43:44 +1 slagle 19:43:58 I'm preparing a talk for pycon-au to introduce devs to tripleo 19:44:13 one line near the end that's very firmly in the talk is to tell them that tripleo isn't going to work for them 19:44:27 we know whether to use a single machine targetted config OOTB, because we'll have been told to use VMs not real hardware 19:44:28 it's designed to be production ready, but it's not designed for *their* production in particular 19:44:31 It seems pretty silly that as a developer with sufficient resources I need to override settings to dest what we would consider our ideal deployment... 19:44:32 so dprince I'm fine with a commit that says we dont' have consensus on the default - we clearly don't, but we also don't have consensus on the status quo being better ;) 19:44:33 tbh, i think the devtest workflow should be, source devtest_varialbes, write-tripleorc, tweak tripleorc, etc 19:45:09 I'm going to be saying that they're going to need to look at where it breaks for them and start fiddling. To me this sounds like one of the ways they might expect it to break for them 19:45:41 lifeless: So my current commit says just that: "We spoke about this in Atlanta and there was not concensus" 19:46:11 dprince: ... about setting it to 3 - that implies there is consensus on the status quo, at least how I read it :) 19:46:22 dprince: I'm saying we don't seem to have any consensus at all 19:46:42 This sounds like something we _really_ need to discuss next week. 19:46:45 dprince: we've got folk expecting to have to change something, we've got folk expecting it to not work 19:46:52 * greghaynes added this to meetup agenda 19:46:56 greghaynes: thanks 19:46:59 slagle: https://wiki.openstack.org/wiki/TripleO#Notes_for_new_developers slightly reverses that - aims to get ~/.devtestrc seeded with settings before the first run :) 19:47:18 lifeless: okay. I'll remove the clarification and leave it at that 19:47:26 Who is most local to raleigh? 19:47:32 can we ask them to supply popcorn? 19:47:41 dprince: thank you, I've removed my -2 already 19:47:57 tchaypo: there are similar snacking foods in the cafeteria that will suffice for that :) 19:48:11 hah, so next week is going to be fun. 19:48:30 tchaypo: yea, my thing with that though is that i don't remember *what* to override, hence why i like the write-tripleorc that spits it all out for me nice and clean 19:48:34 * bnemec resists the urge to go off on a popcorn tangent 19:48:40 greghaynes: I think you need to add an experimental job with scale set to three. I have a few infra patches pending so I'll put one up 19:48:41 we'll make sure to setup metal detectors at all the meeting room doors 19:48:42 while in reality, just copy the same tripleorc around 19:48:52 lifeless, if there are 2 configs, does CI have the resources to test both? 19:49:02 greghaynes: actually, I mean non-voting not experimental 19:49:06 jp_at_hp: no, but I think we have to. 19:49:11 lifeless: sounds good 19:49:24 jp_at_hp: once I have SSL working again I think we'll have hp1 back up 19:49:35 slagle: i like your thinking. 19:49:55 which will bring online 12 96G hypervisors and 24 testenvs 19:50:05 derekh_: dprince: ^ thats about the right ratio, right ? 19:50:24 ok, lets move on since we have a plan 19:50:31 lifeless: the hp1 servers will have 8 TE per TE host 19:51:33 #info CI breakage fix: revert the HA default patch, setup nonvoting job to track HA status while bug 1341420 is worked on 19:51:34 Launchpad bug 1341420 in nova "gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit" [High,Triaged] https://launchpad.net/bugs/1341420 19:51:38 lifeless: could bring TE hosts down a little and bump compute nodes 19:51:57 hp2 is ready to be deployed as well, it has 20 hosts at the moment 19:52:02 and hopefully 80 more in a few weeks 19:52:14 sweet 19:52:19 folk wanting to fix this PLEASE talk to me about specific things you can do 19:52:33 #topic reviews 19:52:34 #info There's a new dashboard linked from https://wiki.openstack.org/wiki/TripleO#Review_team - look for "TripleO Inbox Dashboard" 19:52:37 #link http://russellbryant.net/openstack-stats/tripleo-openreviews.html 19:52:40 #link http://russellbryant.net/openstack-stats/tripleo-reviewers-30.txt 19:52:43 #link http://russellbryant.net/openstack-stats/tripleo-reviewers-90.txt 19:52:46 we're running very late 19:53:08 so SpamapS did a meta review of a couple HP folk and there were no -1s and plenty of +1s on the list 19:53:23 I'll add them to -core later today unless anyone here objects? 19:53:37 19:53:38 Stats since the last revision without -1 or -2 : 19:53:38 Average wait time: 10 days, 17 hours, 49 minutes 19:53:38 1st quartile wait time: 4 days, 7 hours, 57 minutes 19:53:38 Median wait time: 7 days, 10 hours, 52 minutes 19:53:40 3rd quartile wait time: 13 days, 13 hours, 25 minutes 19:53:53 we're basically in a holding pattern on 3rd quartile here. Lets talk more next week. 19:54:06 #topic projects needing releases 19:54:11 do we have a volunteer? 19:54:12 I volunteer as tribute! 19:54:18 thanks 19:54:25 #topic CD cloud status 19:54:36 rh1 is fine AFAIK - I don't think we've deployed the new machines yet though. 19:54:42 lifeless: rh1 cloud is having trouble, there seems to be a lot of instances getting double IP addresses, and also failing to be deleted. 19:54:48 I've been keeping it ticking over by calling nova reset-state on insatances that nodepool can't delete 19:54:56 ugh 19:55:01 this wasn't happening in the past (nothings changed), so this may be getting progressively worse 19:55:20 ok, we probably need to turn on debug logging and get a trace to file a bug 19:55:21 anyways, that all I have on it, so move on and we can discuss out of meeting 19:55:32 hp1 has 2 bad machines but otherwise fine pending deployment 19:55:41 derekh_: yes, lets do. Is there a bug filed on this? 19:56:00 hp2 we haven't burnt in the machines yet,we're expecting it to be heterogeneous hardware, which will mean dev needed to use them. 19:56:09 #topic CI 19:56:17 dprince: nope, I havn't managed to find anythin relevant in the logs, will turn on dubuging later 19:56:21 lifeless: holding pattern on third quartile but average and median are blowing out 19:56:40 anyone who hasn't updated https://etherpad.openstack.org/p/juno-midcycle-meetup to confirm they're going to the dinner should do that *NOW* 19:56:43 #info discussed earlier: nova bug 1341420 is killing us 19:56:45 Launchpad bug 1341420 in nova "gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit" [High,Triaged] https://launchpad.net/bugs/1341420 19:56:53 #info please help with that! 19:56:58 #topic tuskar 19:57:02 any tuskar business ? 19:57:15 #topic specs 19:57:29 I propose we talk next week, lets try to get as much review consensus on them this week as we can 19:57:38 also I really want to see the approved specs actioned :) 19:57:49 #topic other business 19:58:11 #info anyone who hasn't updated https://etherpad.openstack.org/p/juno-midcycle-meetup to confirm they're going to the dinner should do that *NOW* 19:58:19 Jaromir asks that we put some effort into the agenda on https://etherpad.openstack.org/p/juno-midcycle-meetup and also there is a dinner thing there that you need to put your name on. 19:58:23 tchaypo: dude 19:58:29 tchaypo: like 10 seconds patience 19:58:36 #topic open discussion 19:59:00 sorry. suddenly realised I forgot to #tag it earlier 19:59:17 tchaypo: it was more that you put it in the CI section 19:59:24 tchaypo: which is a fairly big non-sequitor 19:59:35 yeah, I got confused about which section we were at :( 20:00:10 CI - continuous ingestion 20:00:15 mmm breakfast time 20:00:17 streaky bacon 20:00:20 #endmeeting