14:00:43 #startmeeting tripleo 14:00:44 Meeting started Tue Apr 5 14:00:43 2016 UTC and is due to finish in 60 minutes. The chair is dprince. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:45 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:48 The meeting name has been set to 'tripleo' 14:00:54 howdy 14:00:59 hi 14:01:05 hi everyone 14:01:15 \o 14:01:22 o/ 14:01:26 o/ 14:02:51 o/ 14:02:55 o/ 14:02:58 #topic agenda 14:02:58 * quickstack (one off questions) 14:02:58 * bugs 14:02:58 * Projects releases or stable backports 14:02:58 * CI 14:03:00 * Specs 14:03:03 * open discussion 14:03:22 trown: I've added your quickstart one-off agenda item to the front of the agenda since we didn't get to it last week 14:03:33 I also just added a topic re summit sessions 14:03:48 dprince: k, I mostly resolved it :) 14:03:50 o/ 14:03:57 trown: okay, so skip it? 14:04:15 trown: meeting times have been tight so if it is handled we can swap it out for the schedule 14:04:15 but I have some other questions about documentation and image building 14:04:18 shardy: ack 14:04:34 I will send to ML first and we can discuss in a later meeting 14:05:00 okay, so this 14:05:02 #topic agenda 14:05:02 * bugs 14:05:02 * Projects releases or stable backports 14:05:02 * CI 14:05:05 * Specs 14:05:07 * summit sessions 14:05:10 * open discussion 14:05:33 oops. I meant to call that agenda2. Oh well 14:06:00 shardy: will talk about the summit sessions right after specs 14:06:04 lets get started 14:06:09 dprince: thanks 14:06:11 #topic bugs 14:06:25 lots of bugs last week 14:06:42 https://bugs.launchpad.net/tripleo/+bug/1566259 14:06:43 Launchpad bug 1566259 in tripleo "Upgrades job failing with ControllerServiceChain config_settings error" [Critical,Triaged] 14:06:56 this is currently breaking our CI the most we think 14:06:59 So, as discussed in #tripleo, this is already fixed by a heat revert 14:07:14 we just have to figure out how to consume a newer heat I think 14:07:14 shardy: yep, just wanted to mention in case others weren't following 14:07:33 yup my response also just for info if folks weren't following :) 14:08:14 any other bugs to mention 14:08:26 https://bugs.launchpad.net/tripleo/+bug/1537898 14:08:28 Launchpad bug 1537898 in tripleo "HA job overcloud pingtest timeouts" [High,Triaged] - Assigned to James Slagle (james-slagle) 14:08:37 i wanted to mention we merged a patch to address that, hopefully 14:08:48 shardy: so options are promote, temprevert or whitelist? /me wasn't following which are we going for? 14:09:05 but, now i'm seeting timeouts from the heat stack timeout, which defaults to 360s 14:09:12 so we likely need another patch to bump that as well 14:09:34 derekh: Yeah, we need to decide which - my whitelist patch didn't pass all the jobs, so all options are still on the table I think 14:09:41 that error killed 81 jobs last week, fwiw 14:09:46 slagle: does this not fix it https://review.openstack.org/#/c/298941/ 14:09:53 81 ouch 14:09:58 dprince: partially, that's what i'm saying 14:10:10 we bumped rpc timeout to 10 minutes, but the heat stack has a timeout of 6 mins 14:10:11 sshnaidm, ^ 14:10:34 so i was going to do another patch to bump the 6 minutes to 20 minutes 14:10:37 also one more CI related fix is here https://review.openstack.org/#/c/301624/ -- we weren't getting logs for jobs where crm_resource --wait hanged, this should fix it and perhaps allow us to debug stuck HA jobs 14:10:47 slagle: yes. lets do all these 14:11:00 slagle: but seriously, I think our pingtest is taking way to long 14:11:24 weshay, it's about pingtest timeouts, I meant overall timeout of the whole job 14:11:28 dprince: indeed. only thing i've been able to deduce is that we are completely cpu bound in the testenvs 14:11:37 wow, so that's 6 minutes to create a few neutron resources and have the server go ACTIVE 14:11:44 it's not even including the time to boot it 14:11:59 slagle: yeah 14:12:20 slagle: will you push the patch to tune/bump the heat stack timeout then? 14:12:25 yes, will do 14:12:43 slagle: thanks 14:13:02 okay, so with shardy's and slagle's incoming patches I think we should be in better shape 14:13:08 any other issues this week for bugs? 14:13:15 dprince, I'm not sure if it could be considere a bug, but recently a lot of jobs are stopped in the middle and failed because of timeout, they take more than 3 hours, especially ha and upgrades. Is it should be reported as a bug..? 14:13:34 liberty upgrades job is broke too, but I have not looked into exactly why... more in the CI topic 14:13:43 sshnaidm: that can be a bug. We need to stop adding to the waltime of our CI jobs. 14:14:06 +1 to reversing our trend of adding to wall time 14:14:07 dprince, ok, then I will report it in launchpad 14:14:16 sshnaidm: I think perhaps we even need to consider cutting from the jobs. The HA jobs are just too resource intensive 14:14:23 well we're also getting poor performance from the CI nodes which gives rise to spurious timeouts 14:14:57 dprince, maybe it could run in periodical jobs or experimental.. 14:15:20 sshnaidm: yep, we'd need a team focussed on keeping those passing though 14:15:29 adarazs, ^ ya we can try to back fill the coverage on ha if needed 14:15:30 sshnaidm: but we could go at it that way 14:15:32 both options result in folks ignoring failures unfortunately 14:15:46 which is a topic for the CI section ;) 14:15:54 yep, lets move on 14:16:04 #topic Projects releases or stable backports 14:16:24 anything needing to be discussed here this week 14:16:29 the Mitaka releases are cut 14:16:36 dprince: are you planning to publish a "mitaka" release around the GA of the coordinated release? 14:16:40 branches are cut rather 14:17:02 ya there have been some key fixes to land on mitaka after the branch is cut 14:17:07 Yeah we've cut the branches, I wasn't sure if we were going to attempt to advertise the released versions via openstack/releases 14:17:17 shardy: sure, we can. 14:17:24 there was some discussion re the generated docs, e.g for independent project on openstack-dev 14:17:29 thinking of the mariadb10 compatability patch for tht, though I think there are others 14:17:35 in the context of kolla, but same applies to us atm 14:17:47 http://releases.openstack.org/independent.html 14:18:01 shardy: thinking about releases EmilienM made me use reno for one of my puppet patches. 14:18:18 It'd be good if we had a TripleO mitaka "release" on there, even if it's just the various repos 14:18:25 shardy: we should consider adopting it as it would really help show what went into the release that is being cut 14:18:37 dprince: Yeah, agreed 14:18:40 +1 for reno 14:18:49 ironic has been using it very successfully 14:18:49 we need reno and probably either spec-lite bugs or blueprints 14:19:06 shardy: if we adopt it early in Newton that would be most ideal 14:19:08 so we can track both the deliverables/completed stuff and the roadmap 14:19:55 shardy: getting the whole core team to by into using it is important though as we'd need to enforce it during reviews 14:20:01 but it is fairly painless I think 14:20:09 dprince: +1 - I'll try to take a first-cut at describing the process and ping the list for feedback 14:20:26 shardy: sounds good :) 14:20:33 agree, it shouldn't be much work at all for folks, but we need to get buy-in for sure 14:21:29 okay, any other stable branch topics. I think we can revisit the "when to tag" topic again next week 14:21:39 dprince: do we need a non-reno releasenotes document for Mitaka? 14:21:46 e.g a wiki or page in TripleO docs? 14:22:01 shardy: so I sent out a wiki patch 1 month ago to the list and asked people to fill it in 14:22:09 shardy: that could be a start... 14:22:21 dprince: ack, I need to revisit that ;) 14:22:52 https://etherpad.openstack.org/p/tripleo-mitaka 14:23:03 #link https://etherpad.openstack.org/p/tripleo-mitaka 14:23:06 thanks jistr 14:23:21 cool, I was looking for that as well but email was slow 14:23:25 okay, lets move on 14:23:28 #topic CI 14:24:07 I started working on some aggregate reporting we could integrate into the ci status page 14:24:12 Two topics to mention, capacity (or lack thereof) and state of the periodic job and propmotion of tripleo-common 14:24:17 http://chunk.io/f/9b65dfaa09dd415d97859ea16bc117a2 is a really raw form of the output 14:24:51 I wanted to mention that i started to work on upgrade liberty->mitaka, only UC upgrade are passing atm 14:25:01 hrybacki, ^ 14:25:07 trown: from RDO testing do you have any list we can use as a basis for tripleo-current promotion blockers atm? 14:25:12 trown: cool. I wanted some general wall time metric graphs too https://review.openstack.org/#/c/291393/ 14:25:15 o/ 14:25:27 trown: if I can get that patch landed and start getting data then we could have graphs... 14:25:40 shardy: RDO does not have a promote job setup for newton yet 14:25:53 trown: Ok, thanks 14:26:20 I think the report I linked above shows we have some pretty serious issues with the upgrades and ha jobs 14:26:28 trown: what's the conclusion from your data? everything fails the majority of the time? :) 14:26:50 Yeah, I just wondered if anyone had made any progress on sifting through them yet 14:26:54 not that anyone didnt know that, but conditioning the success rate on the nonha job gives a pretty good approximation of real pass rate 14:26:57 trown: yeah, the incoming patches from shardy and slagle should fix a lot of those this week we think 14:27:27 Does anyone have any thoughts on how we improve the promotion latency in future? 14:27:30 dprince: ya it would be nice to have some metrics to back that up 14:27:35 trown: yeah, our existing CI reports alreayd give you a feel on this 14:27:47 atm it appears we need a ton of folks working on fixes every time the periodic job breaks 14:28:00 The problem is that transient bugs will continue to creep in, until we have somebody keeping on top of it all the time, we'll keep getting back to a bad state 14:28:00 but I get the impression we don't even have the feedback for folks to know it's failed atm 14:28:07 trown: I wanted wall times because jobs that take hours are unacceptable. And as we tune things I wanted to see clear improvements 14:28:13 caching, etc. 14:28:22 dprince: it is pretty fuzzy though... there is a big difference betwee 20% success and 60% success but I dont know that I can look at the status page and see that 14:28:24 derekh: would a nag-bot in #tripleo help at least raise visibility? 14:29:08 shardy: that would raise visibility, i dont know that it would encourage anyone to do anything about it 14:29:31 I think in addition to statistics we need also reasons statistics like: timeouts - 20%, overcloud fails - 40%, infrastructure - 30%, etc 14:29:31 slagle: I guess that's my question, how can we, and does anyone have time to help 14:29:32 from experience it requires a ton of work 14:29:43 shardy: we should aim for a success rate of 80%+ on all the jobs , and nag bot might work thats starts nagging if we drop below that 14:29:44 shardy: we could just have someone (say bnemec) go and -2 everyones patches if CI is broken 14:29:49 shardy: that would get attention I think 14:29:55 Hey now. 14:30:07 like, it's really bad that we fixed a heat issue ~2weeks ago after it was reported by folks here, yet it's still biting us now :( 14:30:09 bnemec: I choose you randomly sir :) 14:30:10 It's as if you think I'm a -2 bot. :-P 14:30:22 bnemec: but you do have the highest -2's I think too 14:30:58 shardy: yup, its been 2 weeks since we last promoted, we need to be spending time keeping that below 2 days 14:31:19 I do think a nag bot that just says 'stop rechecking' if we fall below a certain success rate is a good idea 14:31:22 I guess if we can't keep the latency for promotion very low, then we have to figure out reinstating tmprevert functionality 14:31:34 as otherwise we've got no way to consume e.g reverts from heat or wherever 14:31:39 though 80% is quite ambitious given the status quo 14:32:12 derekh: A lot of our failures appear capacity/performance related - what's the current plan re upgrading of hardware? 14:32:51 Or is it a question of reducing coverage to align with capacity? 14:33:11 trown: back in the day, when we used to keep on top of this, I used to start looking for problems if it fell below 90%, (although addmitedly it never stayed above 90% for long) 14:33:13 if we do that, I guess we'll need to rely on third-party CI for some tests 14:33:59 shardy: we're in the process of getting more memory for the machine but I wouldn't expect it until June'ish 14:34:04 derekh: also, I see we have a bunch of incremental patches to cache artifacts... what if we just built an undercloud.qcow2 using tripleo-quickstart (with upstream repo setup) in the periodic job 14:34:12 that would dramatically lower wall time 14:34:39 and could be completed much faster than redoing all of that work incrementally 14:34:40 derekh: best effort is to go after the caching. Speeding up the test time helps us on all fronts I think 14:35:13 it also solves a problem with the approved spec... how do I document using tripleo-quickstart if the only image is in RDO 14:35:21 derekh: i.e. you are already working on the most important things I think 14:35:51 keep in mind that as we keep bumping up timeouts, the job times are going to go up 14:35:56 hopefully they will pass though :) 14:36:09 trown: will take a look at tripleo-quickstart and see what could be done 14:36:14 right... band aids on band aids 14:36:15 derekh: regarding adding the extra memory... I'm not sure that helps all cases as some things seem to be CPU bound 14:36:18 Cutting down image build time may actually make the CPU situation worse though. 14:36:50 slagle: that is why until we get some caching in place to significantly reduce the wall time I think we might need to disable the delete and or extra ping tests 14:36:51 slagle: that's the problem tho, we're bumping individual timeouts, only to run into the infra global timeout 14:36:55 Not saying we shouldn't do it, but it may not be the silver bullet that makes everything pass all the time again. 14:36:55 bnemec: I see where you are going, but less wall time, means less load overall 14:37:03 +1 on caching. Jobs that don't test a DIB or image elements change doesn't need to build its own images IIUC. That would save a lot of wall time. 14:37:07 shardy: yea indeed 14:37:10 dprince: yup, we seem to be CPU bound, but we havn't always been, I think something has happened at some demanding more CPU in general 14:37:21 bnemec: if it doesn't work. At least we'd know more quickly 14:37:31 derekh: did the disabling of turbo mode help with the throttling at all? 14:37:39 i haven't checked the nodes myself 14:38:03 derekh: both the HA and the upgrades jobs are running 3 controllers w/ pacemaker right? 14:38:12 slagle: I checked yesterday was still seeing trottling wanrings 14:38:14 derekh: more HA, controller nodes in the mix now... 14:38:35 dprince: yup 14:38:36 hmm... do we need both of those jobs then? 14:39:32 just regarding the differences between the two -- one is ipv4, the other is ipv6 i think. And upgrades job will eventually really test upgrades. 14:39:39 What about a one node pacemaker cluster for the upgrades job? 14:39:43 trown: we could consider cutting one of them. 14:40:07 upgrades jobs has 3 nodes in total, HA has 4 14:40:17 dprince: we are in a pretty bad spot atm, so I think we should consider drastic measures 14:40:18 trown: and swap in a periodic job. but it'd probably fail a lot more then 14:40:33 trown: I'm not arguing :/ 14:40:48 :) 14:41:04 trown: we've overstepped our bounds for sure. Even thinking about adding more to CI at this point (like tempest) is not where we are at 14:41:09 before we go zapping jobs and moving stuff around I'd love if we could first find out if something in particular is hogging the resources 14:41:41 derekh: agree, we are sort of poking about with our eyes closed here 14:41:42 +1 14:42:41 #action derekh is going to turn the lights on in the CI cloud so we can see what is happening 14:42:44 :) 14:43:01 looks like upgrades in 1 node 14:43:05 dprince: thanks 14:43:13 *1 node pacesmaker 14:44:00 derekh: so are you suggesting we switch upgrades to 1 node? 14:44:29 dprince: Nope, it was just pointing out that it already is a one node pacemaker cluster 14:44:40 by the way, just a random idea, I'm not sure if it's done or not, but maybe before the actual installation and tests are even started, there could be a quick sanity check for the environment? 14:44:43 dprince: bnemec suggested we try it 14:44:53 derekh: oh, I see. 14:45:04 derekh: I had forget that as well 14:45:08 sorry if that's not what you are talking about 14:45:44 Must be a good idea if we're already doing it. :-) 14:45:47 we have 2 more topics, though it seems we could have a weekly CI meeting that is an hour on its own 14:46:00 rdopiera: the environments are functionally fine. Just under load 14:46:07 rdopiera: I'm not sure what we would sanity test 14:46:10 rdopiera: or so we think 14:46:11 not an awful idea really... to have a CI subteam that just reports back to this meeting 14:46:22 dprince: I see, forget it then 14:46:57 dprince: my thinking was that then you would have fewer jobs even starting -- because the ones that would fail anyways wouldn't even start 14:46:58 trown: right, we can cut it off soon. I give CI the bulk of the time I think because it is blocking all of us. And we should all probably look at it more 14:47:31 dprince: yep I agree, and think it probably needs even more time than half hour a week 14:47:33 rdopiera: yeah, well we could just scale back the number of concurrent jobs we allow to be running 14:47:44 +1 for separate CI meeting 14:48:16 trown: I think this would be their brief, 1. the periodic job failed last night, find out why and fix it. 2. bug X is causing a lot of failures, find out why and fix. 3. find places we're waisting time and make things run faster 14:48:48 derekh: yep, that is a good start 14:50:04 okay, lets moves on to specs then. We can continue CI talk in #tripleo afterwards.... 14:50:11 #topic Specs 14:50:45 still very close to landing https://review.openstack.org/#/c/280407/ 14:51:34 dprince: I'm +2 but it'd be good to see the nits mentioned fixed 14:51:50 shardy: yep, not rushing it or anything 14:52:25 any other specs this week? 14:52:56 slagle: shardy do you guys want me to abandone the composable services spec? Or make another pass at updating it again? 14:53:52 dprince: Your call, I'm happy to focus on the patches at this point 14:54:11 shardy: yeah, that is what I've been doing anyways. 14:54:15 yea, i'm fine either way 14:54:32 i would like to see us agree to get updating to the patch being tested in place in CI first 14:54:46 but i dont know about the feasibility of that given all the CI issues 14:55:11 so if specs aren't helpful (which it seems like this one provoked a lot of qustions) 14:55:16 i feel these patches have a potential to break stack-update's, and we won't know until we try towards the end of the cycle 14:55:18 then we don't have to do them 14:55:22 then we're faced with a mound of problems 14:55:32 but having a Spec to point to for people asking about high level ideas is useful 14:56:05 For this particular thing... the documentation can be to tripleo-docs however. Because it creates a sort of "interface" that service developers can plug into. 14:56:08 I think a spec for something as big as composable roles makes a lot of sense 14:56:09 so I can document it there 14:56:33 slagle: agree - hopefully we can get the upgrades job providing some of that coverage asap 14:56:36 trown: the issue with specs is people want to get into implementation details. and the core team doesn't (often) agree on those :/ 14:56:50 trown: and a review is a better place to hash out implementation details... 14:57:01 anyways. 14:57:11 #topic open discussion 14:57:24 shardy: do you want to mention the summit sessions etherpad? 14:57:37 Yeah can folks please add summit topics to here: 14:57:39 https://etherpad.openstack.org/p/newton-tripleo-sessions 14:57:54 we have 5 sessions in total, plust a half-day meetup 14:58:22 we need to decide the top-5 topics, and I'll report those so the official agenda can be updated 14:58:30 shardy: looks almost like what we talked about last time :) 14:58:37 Yeah :( 14:59:08 I'll ping the ML too, I'd suggest we work out consensus via the ML given we always run out of time in these meetings 14:59:20 yep, sounds good 14:59:29 for newton I'd like to consider less repeated agenda items to enable this as a better sync-point 14:59:37 thanks everyone, and lets continue the CI discussions in #tripleo to get it fixed ASAP 15:00:06 #endmeeting