16:02:05 #startmeeting cinder 16:02:06 Meeting started Wed Oct 2 16:02:05 2013 UTC and is due to finish in 60 minutes. The chair is jgriffith. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:02:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:02:09 The meeting name has been set to 'cinder' 16:02:14 Hey ho everyone 16:02:17 \o 16:02:20 hello 16:02:21 hello 16:02:24 Heyoooo! 16:02:25 hey 16:02:25 hello 16:02:28 hey 16:02:29 hey 16:02:39 Hi all 16:02:42 hello all 16:02:46 hi 16:02:51 hi 16:03:03 gooday 16:03:09 .o/ 16:03:10 o/ 16:03:20 What a crowd. 16:03:29 DuncanT: you've got a number of things on the agenda, you want to start? 16:04:03 DuncanT: you about? 16:04:09 jgriffith: Once I'd gone through them all, most of them ended up being fix committed. I can only find 2 taskflow bugs though 16:04:15 And last week suggested 3 16:04:27 did you log a bug/bugs? 16:04:52 jgriffith: These are all from last week's summary 16:05:06 DuncanT: which *These* 16:05:11 You mean the white-list topic? 16:05:19 #topic TaskFlow 16:05:20 Yeah 16:05:24 Ok.. 16:05:31 so we had two bugs that are in flight 16:05:37 I've asked everybody to please review 16:05:53 the white list issue a number of people objected to reversing that 16:06:00 which reviews ? 16:06:08 i just put a -0 on 49103, but i think it's ok 16:06:31 hemna_: go to https://launchpad.net/cinder/+milestone/havana-rc1 16:06:38 thnx 16:06:48 hemna_: anything that's "In Progress" needs a review if it's not in flight 16:07:30 All four seem to eb in flight now 16:07:31 hemna_: There's actually on like 3 patches that I'm waiting on, one of them is yours :) 16:07:46 DuncanT: Oh yeah!! 16:07:47 I need your iscsi patch to land 16:07:51 My cry for help worked 16:08:02 :-) 16:08:02 +1 16:08:21 then I'll refactor mine (iser) to remove the volumes_dir conf entry 16:08:27 as it's a dupe 16:08:32 in both our patches 16:08:39 hemna_: k.. if you need to you can cherry pick and make a dep 16:08:48 hemna_: but hopefully gates are moving along still this morning 16:08:55 don't jinx it... 16:09:06 eeesssh... yeah, sorry :( 16:09:25 * jungleboyj is knocking on wood. 16:09:28 DuncanT: what else on TaskFlow did you have (think we got side-tracked) 16:09:50 jgriffith: My only question is that last week's summary said 3 bugs, and I could only find 2 16:10:09 If there are no more real bugs, I'll stop worrying 16:10:37 DuncanT: well, for H I *hope* we're good 16:10:50 DuncanT: For Icehouse I think we have some work to do 16:11:08 ie white-list versus black-list debate :) 16:11:34 Sure. Hopefully somebody can take that debate to the summit? 16:11:36 jgriffith: i don't know if you want to discuss this now, but i was wondering what the policy would be for new features in Icehouse - taskflow only? 16:11:50 #topic Icehouse 16:12:04 avishay: not sure what you mean? 16:12:16 I hope that taskflow isn't the only thing we work on in I :) 16:12:23 the policy for new features? we add them no? 16:12:24 jgriffith: if i'm submitting retype for example, should it use taskflow? 16:12:30 although that seems to be everybody's interest lately 16:12:36 jgriffith: not me 16:12:38 avishay: OHHHH... excellent question! 16:12:39 api all the way 16:12:40 :P 16:12:43 thingee: :) 16:12:58 I think that favoring new features via taskflow would be a great idea. 16:13:03 avishay: TBH I'm not sure how I feel about that yet 16:13:21 avishay, so that kinda begs the question about taskflow, are we propagating it to all of the driver apis ? 16:13:22 jgriffith: The goal is to eventually get everything there, right? 16:13:23 caitlin_56: perhaps, but perhaps not 16:13:27 I hope I'll have time to convert migration and retype to use taskflow for Icehouse, but can't promise 16:13:42 WE shouldn't force things to be taskflows that are not naturally. 16:13:49 TBH I wanted to have some discussions about taskflow at the summit 16:14:00 jgriffith, ok cool, same here. 16:14:03 I'd like to get a better picture of benefits etc and where it's going and when 16:14:05 hemna: i think for something simple like extend volume we don't need it, but for more complex things it could be a good idea 16:14:27 summit discussions are good 16:14:27 avishay, well I think there could be a case made for even the simple ones. 16:14:31 avishay: I think you're right, the trick is that "some here, some there" is a bit awkward 16:14:35 anyway, something to think about until hong kong 16:14:40 I'd certainly like chance to discuss some of the weaknesses of the currently taskflow implementation 16:14:50 avishay: yeah, so long as you don't mind the wait 16:15:04 I was kind of hoping that tasklowing most things would lead to safe restart of cinder and all of it's services. 16:15:04 Ok, I think we all seem to agree here 16:15:16 a la safe shutdown/resume 16:15:17 hemna_: I think it will, that's the point 16:15:27 coolio 16:15:34 we need to get more educated and help harlow :) 16:15:38 yah 16:15:47 I'd also like to find out more about community uptake 16:15:49 anyway... 16:15:50 yep 16:15:50 already a sesion for what's next in taskflow: http://summit.openstack.org/cfp/details/117 16:15:55 I already have a long list of my wants for I :P 16:16:02 I've been working with harlow already. 16:16:11 I tihnk we're still going that direction, we just need to organize. We don't want another Brick debacle :) 16:16:12 kmartin: nice! 16:16:19 hey now 16:16:27 hemna_: that was directed at ME 16:16:51 #topic quota-syncing 16:17:02 DuncanT: you're correct ,that's still hanging out there 16:17:25 DuncanT: I looked at it a bit but quite frankly I ran away screaming 16:17:41 jgriffith: It made my head hurt too 16:17:57 I'd like to just drop quotas altogether :) 16:18:05 ha 16:18:08 ;) 16:18:17 quota syncing? 16:18:24 guitarzan: yes 16:18:25 guitarzan: https://bugs.launchpad.net/cinder/+bug/1202896 16:18:27 Launchpad bug 1202896 in nova "quota_usage data constantly out of sync" [High,Confirmed] 16:18:33 ahh 16:19:02 every time I mess with quotas I want to die, but... 16:19:16 I also think that there are just fundamental issues with the design 16:19:20 No quotas are better than quotas enforced at the wrong locations. 16:19:29 Might be something worth looking at for I??? 16:19:54 Don't all volunteer at once now! 16:20:02 i would seriously consider the suggestion in that bug to replace the usage table w/ a view if possible 16:20:29 that's an interesting idea, but it doesn't really tell you if the resource is being used or not 16:20:37 especially in the error cases 16:20:45 I'm not sure that scales with large numbers of volumes and users, unfortunately 16:20:50 DuncanT: +1 16:20:58 I think scale is the big concern with that 16:21:07 guitarzan: I agree. We need definitions that deal with real resource usage. Otherwise we're enforcing phony quotas. 16:21:08 However I think we could do something creative there 16:21:13 DB caching etc 16:21:36 anyway... I don't think we're going to solve it here in the next 40 minutes :) 16:21:40 I attempted to write a tool that checked the current quota looked valid, and ran it periodically while doing many ops in a tight loop, but couldn't provoke the out-of-sync issue 16:22:04 DuncanT: I think you just have to get something in an error state so you can delete it multiple times 16:22:23 guitarzan: maybe we should focus on fixing that instead? 16:22:26 guitarzan: Ah, ok, that I can provoke 16:22:28 go abou tit the other way 16:22:31 about 16:22:44 jgriffith: I think that's totally fixable 16:22:50 did somebody say State Machine (again) 16:22:57 guitarzan: Is there a specific bug for that scenario? Sounds like low hanging fruit.... 16:23:01 I wasn't going tos ay taht :) 16:23:10 DuncanT: I don't know, I'm just reading the bug 16:23:17 :) 16:23:34 I have been able to mess up quotas before, negative 16:23:36 DuncanT: there is not, and it's not as low hanging as one would hope IMO 16:23:46 Bugger 16:23:59 There's a number of little *holes* that we can run into 16:24:14 anyway... quotas aside those are things that I'd really like to see us work on for I 16:24:28 exceptions and exception handling falls in that category 16:24:36 Hmmm, I'm wondering if a runtime fault injection framework might make reproducing these issues easier? 16:24:37 jgriffith: +1 16:24:38 again, state machine 16:24:40 having a better picture of what happened back up at the manager 16:24:43 avishay: :) 16:24:49 I have seen several issues with deleting. 16:25:03 Also think the issue of exceptions goes along with the taskflow issue. :-) 16:25:06 DuncanT: perhaps, but you can also just pick random points in a driver and raise some exception 16:25:10 that works really well :) 16:25:16 jgriffith: ended up just writing something to correct the quota that we use internally 16:25:22 DuncanT infectedmonkeypatch? 16:25:27 thingee: Oh? 16:25:35 jgriffith: that's just a bandaid fix though 16:25:37 med: Sounds promising. I'll have a google 16:25:59 thingee: might be something to pursue if DH is interested in sharing 16:26:02 * med_ made that up so google will likely fail miserably 16:26:07 thingee: if nothing else experience 16:26:12 jgriffith: I was pretty much thinking of formalising that approach so we can test it reproducably 16:26:19 the *experience* you guys have would be helpful 16:26:47 DuncanT: Got ya.. if we just implement a State Machine it's covered :) 16:26:47 jgriffith: I think it just wasn't put upstream because it was a bad hack. but yeah we can take ideas from that. 16:26:51 Just sayin 16:27:02 * jgriffith promises to not say *State Machine* again 16:27:13 thingee: coolness 16:27:49 okie dokie 16:28:05 DuncanT: what else you got for us? 16:28:13 I'm all out I think 16:28:15 * jgriffith keeps putting DuncanT on the spot 16:28:34 Most of my stuff is summit stuff now 16:28:41 Ok, I just wanted to catch folks up on the gating disaster over the past few days 16:28:47 #topic gating issues 16:28:50 ugh 16:28:57 so I'm sure you all noticed jobs failing 16:29:02 jgriffith, jenkins puked on https://review.openstack.org/#/c/48528/ 16:29:03 wheeee! 16:29:08 but not sure how many people kept updated on what was going on 16:29:16 When were jobs failing? 16:29:34 There were a number of intermittent faiures that were in various projects 16:29:47 was mainly broken neutron gate test no? 16:29:50 I also think that some bugs in projects exposed bugs in other projects etc etc 16:30:00 hemna: That looks like a straight merge failure, manual rebase should sort it 16:30:02 dosaboy: no 16:30:08 dosaboy: it was realy a mixed bag 16:30:13 ack 16:30:19 Cinder, neutron, nova, keystone... 16:30:20 Apparently the one in Neutron was one that had been there for some time but it was a timing thing that was suddenly uncovered. 16:30:30 jungleboyj: +1 16:30:40 So anyway.... 16:31:04 things are stabilizing a bit, but here's the critical take away for right now 16:31:20 the recheck bug xxx is CRITICAL to track this stuff 16:31:48 and even though the elastic search recommendation that pops up is sometimes pretty good, other times it's wayyy off 16:32:08 we really need to make sure we take a good look at the recheck bugs and see if something fits, and if not log a new bug 16:32:20 if you don't know where to log it against, log it against tempest for now 16:32:39 best way to create these is to use the failing tests *name* as the bug title 16:32:51 this makes it easier for people that encounter it later ot identify 16:32:55 jgriffith: +1 16:33:11 so like "TestBlahBlah.test_some_stupid_thing Fails" 16:33:25 also, if something is already approved, make sure to do 'reverify bug xxx' and not recheck 16:33:30 include a link to the gate/log pages 16:33:37 avishay: +1 16:33:57 Yeah, sorry for the ones I rechecked before I learned that tidbit. 16:34:00 sucks when jenkins finally passes and need to send it through again :) 16:34:14 also take a look at this: http://paste.openstack.org/show/47798 16:34:37 particularly the last one 16:34:44 Failed 385 times!!! 16:34:48 that's crazy stuff 16:34:58 ouch. 16:34:59 I wasn't even aware of it until it was at 300 16:35:03 doh 16:35:25 BTW that wasn't the worst one :) 16:35:28 anyway... 16:35:49 I did some queires last night on those and updated when last seen etc 16:36:23 that big one 1226337 pretty much died out a few days ago after the fix I put in (break out of the retry loop) 16:36:36 but still hit occasionally 16:36:49 It's an issue with tgtd IMO 16:36:57 It's not as robust as one might like 16:37:12 so the follow up is a recovery attempt to create th backing lun explicitly 16:37:15 anyway... 16:37:26 the other item: 1223469 16:37:47 I wanted to point that out because I made a change that does a reovery but still logs the error message in the logs 16:38:12 this seemed like a good idea at the time, but the querie writers grabbed on to that and still querie on it 16:38:28 even though it recovers and doesn't fail it still gets dinged in the queries 16:38:49 so I think I should change it to warning and chane the wording to throw them off he scent :) 16:39:10 but I wanted to go through these to try and keep everybody informed of what was going on 16:39:30 I spent most of the last 3 VERY long days monitoring gates and pouring over logs 16:39:42 cool, thanks for the update and the work! 16:40:07 Hoping that if/when we hit this sort of thing again we'll have a whole team working on it :) 16:40:12 avishay: +2 16:40:14 Ok, that's all I have... 16:40:18 anybody else? 16:40:23 #topic open-discussion 16:41:01 drinks are on the house! 16:41:01 going twice.... 16:41:13 dosaboy: Your house? Ok, I'm in :) 16:41:17 :) 16:41:23 Yay! 16:41:25 going three times... 16:41:31 #endmeeting