15:59:51 #startmeeting cinder 15:59:52 Meeting started Wed Apr 16 15:59:51 2014 UTC and is due to finish in 60 minutes. The chair is jgriffith. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:59:54 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:59:57 The meeting name has been set to 'cinder' 16:00:15 Hey everyone 16:00:29 hello 16:00:33 o/ 16:00:36 Just wanted to do a quick synch up with folks today 16:00:39 hi 16:00:43 https://wiki.openstack.org/wiki/CinderMeetings 16:00:56 #topic Release Status 16:01:15 We cut another RC yesterday morning 16:01:30 At this point we should be done unless something REALLY critical pops up 16:01:46 akerr: did note a problem with Glance API V2 16:01:53 jgriffith: there was an issue that came up with create from image 16:02:01 thingee: :) 16:02:03 with regards to checksum missing 16:02:05 yes 16:02:11 same issue I just mentioned 16:02:16 It's Glance API V2 16:02:25 There are two bugs associated with that.... 16:02:36 https://bugs.launchpad.net/cinder/+bug/1308594 16:02:37 should that be fixed, if it's easy? 16:02:37 Launchpad bug 1308594 in cinder "upload-to-image fails with size error on glance v2 api" [High,Confirmed] 16:02:46 https://bugs.launchpad.net/cinder/+bug/1308058 16:02:47 Launchpad bug 1308058 in cinder "Cannot create volume from glance image without checksum" [Undecided,New] 16:02:55 kmartin: the problem is timing 16:03:13 yeah, down to the wire 16:03:28 So spinning another RC and reseting the package maintainers again this late is not good 16:03:31 Also.... 16:03:36 and since we can't guarantee the timing with the state of things, I say we revert and default to None. 16:03:39 My view is that this is a V2 Glance API thing 16:03:50 and we default to V1 16:04:08 My vote/view is document in the Release notes as Known issue (which is done) and roll 16:04:10 agree...should note it 16:04:15 But it does have an impact to a new NetApp feature requiring v2. 16:04:22 glenng: yes, correct 16:04:24 which sucks 16:04:29 agreed 16:05:14 it's too bad that this was reported pretty early yesterday too 16:05:18 I talked to akerr and he stated that a backport could be relatively easy for Netapp customers 16:05:26 and not noticed 16:05:37 Not the end of the world; documenting would be okay. 16:05:46 thingee: well since we default to V1 only Netapp uses V2 right now 16:05:59 just saying, I' 16:06:05 thingee: FYI even yesterday morning was a bit late 16:06:07 think the cut off wasn't done right 16:06:13 jgriffith: our feature is optional as well. As glenng says, not the end of the world 16:06:14 if there are unverified issues like this 16:06:18 thingee: we cut/shipped yesterday AM at about 8:00 16:06:26 I think we're ok 16:06:28 It was earlier than that 16:06:37 thingee: well ok... 16:06:41 thingee: and the point is? 16:06:58 it was unverified. there was a cut off 16:07:05 what if it was critical? 16:07:33 thingee: I'm not sure what you're looking for here? Is this criticism, or something else? 16:07:50 I'm just unhappy with the cut off decision 16:07:58 on new unverified issues 16:08:04 thingee: what specifically do you mean? 16:08:05 be present 16:08:11 thingee: the "cut off" decision? 16:08:12 can't make it more clear than that 16:08:27 thingee: You're unhappy with me cutting the RC yesterday? 16:09:03 thingee: maybe we should talk after the meeting 16:09:17 thingee: You seem to be very unhappy with how things have been going lately 16:09:21 maybe we can fix it 16:09:55 Ok... so back to our regular scheduled program 16:10:25 #topic summit sessions update 16:10:35 Yahoo! 16:10:39 I'll make another pass on those shortly 16:10:49 Have some good proposals 16:11:04 It's not too late if you have items you want to propose, but need to do it today 16:11:12 jgriffith: is the ISCSI and FC clean up work in brick needed for a session? 16:11:19 could probably just unconf for interested parties 16:11:32 will not just brick 16:11:43 thingee: unconf sessions are no more 16:11:47 let me find the blog 16:11:59 bswartz: I can invite anyone to a bar with me to discuss it 16:12:01 thingee: We have slots 16:12:04 and then bring it up in the ML 16:12:09 ;) 16:12:17 thingee: we should propose it 16:12:21 IMO 16:12:24 *is interested* 16:12:30 oh a REAL unconf session! are you buying? 16:12:32 ok, I had second thoughts how productive it would be 16:12:46 bswartz: HA, I'm on a budget nowadays 16:13:00 jgriffith: ok I'll propose it 16:13:05 thingee: thanks 16:13:24 any questions/suggestions WRT summit sessions? 16:13:53 have we verified the people that have proposed these are going to be present or have someone familiar with the subject to be present? 16:13:59 I don't want more david wang situations 16:14:01 haha 16:14:09 Who/is/ David Wang? 16:14:15 I've checked with each of them and they've "said" they'll be there 16:14:15 DuncanT-1: that's the new shirt 16:14:36 mornin 16:15:28 hemna: morning~ 16:15:29 anything else from anyone? 16:15:44 I saw a couple of sessions marked incomplete 16:15:52 Does that mean they're out? 16:16:01 thingee, iSCSI/FC cleanup work ? 16:16:01 DuncanT-1: nahh 16:16:10 hemna: yes, it's a mess 16:16:12 DuncanT-1: means they have a chance to come back with a more detailed focus 16:16:18 complicated code 16:16:32 jgriffith: how many slots do we have 16:16:38 DuncanT-1: but as the proposal was there were concerns or it wasn't clear what the objective was 16:16:47 thingee, ok fill me in offline 16:16:47 xyang1: 12 16:16:56 errr... 11 16:17:05 jgriffith: Ok, cool 16:17:24 jgriffith: do you know which days yet? Wed, Thu, Fri? 16:17:29 hemna: if you want to update multi-attach perhaps? 16:17:31 thingee, almost all of that is directly from nova. I have plans to refactor some of the initiator side, wrt multi-attach and rediscovery at detach time for iSCSI 16:17:39 xyang1: Friday last check 16:17:45 jgriffith, sure 16:17:46 xyang1: same as usual 16:18:12 multi-attach is coming along. I have the first set of patches as WIP in gerrit for nova, cinder, cinderclient 16:18:21 hemna: awesome 16:18:23 but I'm working on changes to it as well as getting unit tests working 16:18:29 #topic open-discussion 16:18:42 * thingee added a topic last minute 16:18:58 I had to make a change to the current patches at detach time to pass in the attachment uuid instead of instance uuid, because Cinder can attach to a host (no instance uuid in that case) 16:19:12 jgriffith: was asking about if people are interested in immutable volumt type or in place update for volumes when admin making changes to type definition or type-QoS associations 16:19:16 thingee: I don't see it? 16:20:02 I came across an issue yesterday that I might need some help with 16:20:02 weird, I do 16:20:10 well it's 'cinder resource status' 16:20:15 * jgriffith refreshes 16:20:18 specifically the way we handle a status for an object 16:20:21 I am here now. 16:20:22 https://bugs.launchpad.net/cinder/+bug/1305550 16:20:23 Launchpad bug 1305550 in cinder "Failed retype with driver raised exception should set volume status to "error"" [Undecided,In progress] 16:20:42 this bug raised a thought that we make the status field too complicated 16:20:45 thingee: oh, topic to the agenda 16:20:49 not the summit 16:21:13 thingee: or not complicated enough? 16:21:19 #topic what to do on retype failure 16:21:37 I would like folks to think of cinder of trying removing a lot of the intervention by ops and user. 16:21:38 winston-d: not complicated enough IMO 16:21:53 volume/manager.py _migrate_volume_generic has a call into nova to update_server_volume for an instance. since cinder can be attached to multiple instances, which one do I use in the nova_api.update_server_volume() call? 16:21:58 to be clear I don't think it should be up to people recover volumes if cinder can do it 16:22:03 or do I call it for every instance. 16:22:12 thats the only oustanding issue I have wrt multi-attach 16:22:13 I think the problem is bigger than just for three type. It seems like we should be able to provide the user with more information about a failure. 16:22:40 hemna: can we come back around to that 16:22:45 if somethin is in an error, just stop. it's done. and keep the status as *error*. 16:22:45 *retype 16:22:49 hemna: finish up the talk about retype first 16:22:52 so I'm going to try and get a second set of patches up in gerrit this week 16:22:55 jgriffith, ok 16:23:01 don't try to convey it with 'it-failed-because-of-this-thing' status 16:23:17 have a separate field that explains why the status is 'error' 16:23:24 thingee: so the question however is in that particular case is setting it to error-status appropriate I vote no 16:23:28 Fine with the separate field 16:23:39 jgriffith: ok, what do you gain from other statuses? 16:23:39 thingee: doesn't nova do something similar to that when an instance goes into error? 16:23:52 Can we do a separate fields and still be able to have the user take actions or will it only be the administrator? 16:23:54 akerr: +1 instance faults 16:24:01 DuncanT-1: I think thingee is saying the opposite 16:24:03 jgriffith: what should it do? Leave the volume at the old type? 16:24:15 my point is we should reserve error for, there is nothing cinder can do about it. nothing the user can do about it 16:24:17 DuncanT-1: well, I think so yes 16:24:18 it's up to ops 16:24:27 DuncanT-1: and the reason is becuase there's nothing "wrong" with the volume 16:24:35 jgriffith: That seems quite reasonable to me 16:24:41 and worse the user doesn't have a mechanism to know what retype's are valid 16:24:45 so it's trial and error 16:24:53 jgriffith: so i agree with retype 16:24:57 and I think it's bad user experience to put it in error 16:25:07 and say "haha now you can't use your volume" 16:25:08 I'm saying in general, better conveying to the user what happened is what I'm adovcating here 16:25:32 thingee: so that's another topic IMO 16:25:39 jgriffith: well, i think https://bugs.launchpad.net/cinder/+bug/1305550 here is more about something wrong happened when retyping a volume 16:25:39 thingee: and we should propose sub-states 16:25:41 Launchpad bug 1305550 in cinder "Failed retype with driver raised exception should set volume status to "error"" [Undecided,In progress] 16:25:53 jgriffith: agreed. and if you look back to my original sentence, this topic brought on a thought for me 16:25:56 of this 16:26:10 winston-d: yeah, that one is different and that's no good 16:26:28 Thingee so you were saying we wouldn't put the volume in error? 16:26:31 jgriffith: we already have error_extending, but I'm not sure thats the best way to go 16:26:31 I think sub-states is also complicated. Again, error is just nothing can be done about it. Not cinder, not the user. Just ops 16:26:40 put that in a status description field 16:26:45 Just leave it is available with more information in the field? 16:26:53 make it so it's safe for users' eyes 16:26:58 akerr: yeah, it still blocks some things that look for "error_" 16:27:02 Nova has both 'task state' and 'instance falut' 16:27:06 ops can see a general idea from what the user sees and look at the logs for more information 16:27:11 winston-d: +1 16:27:13 we can at least have one 16:27:20 you need to be able to handle the case of multiple errors on one resource 16:27:21 or both 16:27:29 i dont want to only see info about the latest 16:27:54 ameade: so I talked about that in #openstack-cinder too 16:27:54 ameade: I don't agree with that but I think we're rat holing a bit 16:28:07 ameade: use cases like multi-attaching? 16:28:20 the bottom line is right now we have ONE and only ONE method of conveying status 16:28:26 it seems that's not enough 16:28:36 jgriffith: +1 16:28:38 so we should at least start by implementing a task-state 16:28:41 ameade I think that is going further than we need right now. 16:28:44 and go from there 16:28:52 * thingee is talking to himself when he just brought up a second way of giving status 16:29:22 jgriffith sounds reasonable. 16:29:34 thingee: what did you want to say 16:29:39 thingee: floor is all yours 16:29:45 everybody listen to thingee 16:29:51 let me scrollback up and paste what I said earlier 16:30:04 * jungleboyj_ listens 16:30:24 This is also explained in the bug https://bugs.launchpad.net/cinder/+bug/1305550 16:30:25 Launchpad bug 1305550 in cinder "Failed retype with driver raised exception should set volume status to "error"" [Undecided,In progress] 16:30:46 thingee: ummm... sorry that doesn't help me 16:30:50 thingee: what did YOU say 16:30:51 (we should get that bot in openstack-cinder) 16:31:04 Reserve 'error' for the resource is not recoverable by user or cinder. it requires manual intervention by ops 16:31:07 jgriffith: sorry still typing 16:31:24 use a *second* field to give a description of the status 16:31:41 instead of 'it-failed-because-of-this-status' like we've been doing 16:32:22 thingee: sure 16:32:23 jgriffith I think he is pointing out that he wants to keep her for just the worst of situations. 16:32:31 thingee: as in my proposal in comment #2 of the bug 16:32:38 *error 16:32:52 thingee: so no 'error-extending' but just 'error' with an description field? 16:33:27 winston-d: i think not even "error" there because the volume is still usable, just not the new size 16:33:32 in order to promote better state setting, I would say instead of using the db api directly, we need some helper for setting state that would require things like the new state e.g. available, error, in-use, and a status description is required if it's something like error state. 16:33:32 winston-d or available with a description field. 16:33:51 akerr: well, that depends. 16:33:58 jungleboyj_: ^^ 16:34:01 thingee: yeah, we've been saying for a year defined and real states 16:34:50 i wish we can have backend driver report some type of failure that actually doesn't hurt/touch the volume. 16:34:59 It seems like we would still have to add a state for the case for your command failed and you need to see the additional information. 16:35:03 jgriffith: I guess when I read #2 comment in that bug, I took it as another key being used for the sub-status, not a full description of text. 16:35:06 winston-d: I agree 16:35:38 you could define something like a 'nonFatalError' exception that drivers could throw 16:35:41 winston-d +1 16:35:49 doesn't this fall under general state management of volume transactions. Wasn't taskflow supposed to help with this some? 16:36:07 akerr: yeah, but until we have that, an error could be a unrecoverable error 16:36:42 Ok... can I say something without hurting any feelings or pissing anybody off? 16:36:48 let's back up and focus a little 16:37:12 first; don't bring taskflow into the discussion, it doesn't do what we're talking about regardless of if that was a goal or not 16:37:24 Let's propose a summit session 16:37:41 First... let's agree on: Adding a task-status entry 16:37:58 We can argue about verbosity, what it means etc later 16:38:03 jgr iffith +1 16:38:31 At the same time, that means we have the opportunity to limit the status field we have today as thingee pointed out 16:38:35 I think getting in a room together and talking about this is a good idea. 16:38:36 which I think is needed/good 16:38:48 There are a lot of opportunities here 16:38:55 Ageed. 16:39:15 but you can't throw in EVERYTHING all at once 16:39:35 does this sound reasonable to everyone? 16:39:44 are there any disagreements? 16:39:50 question 16:40:01 well I just think taskflow is relevant to the discussion of volume state. that's all. 16:40:14 what is the task-state accomplishing? what exactly does 'creating' currently mean for example? 16:40:48 creating today is a status 16:40:54 correct 16:40:56 thingee it is telling the user what is happening that they can't see. 16:41:12 when you say task-state are you referring to the hypothetical yet to exist thing? 16:41:13 and it means you can't take other actions on the volume while it's in that status. 16:41:16 or something else? 16:41:17 ok, so again, it's explaining what 'creating' status currently means. 16:41:20 more detailed 16:41:26 as an example 16:41:49 jgriffith: I'm referring to 'task-state' that you just said a few lines up 16:41:57 thanks 16:41:59 and NO 16:42:07 it's not to describe the status 16:42:15 I don't know what task-state means and I was just giving an example to understand. 16:42:17 it's not to describe what "attaching" means 16:42:24 thingee that is what I am thinking. That is what we need to talk about at the summit. 16:42:37 I'll try my proposal again.... 16:42:42 For example: 16:42:49 You try to extend a volume 16:42:50 do you mean state=creating, task-state=in progress? 16:42:56 The volume/backend doesn't support extend 16:43:12 The volume is "fine", just not extended 16:43:23 DONT put the volume in error status 16:43:39 Set a taks-status of "extend-failed" or whatever 16:43:52 leave the volume as 'available' and the original size 16:44:01 Example 2: 16:44:09 retype from foo to baz 16:44:18 backend doens't support baz, and migration is not enabled 16:44:28 DONT set volume to error status and make it unusable 16:44:41 Set the task-status to "error-retyp" or whatever 16:44:49 Leave the status as "avaialble" 16:45:01 thingee: is that clear? 16:45:05 yup 16:45:07 thingee: do I need another example? 16:45:10 jgriffith +2 16:45:10 shouldn't we include a tnx history, instead of just the last failure ? 16:45:16 hemna: +1 16:45:17 txn 16:45:17 tnx? 16:45:25 sorry, I'm lazy....transaction 16:45:29 jgriffith + 1 16:45:30 gotcha 16:45:36 hemna: maybe... 16:45:40 hemna one thing at a time. 16:45:42 hemna: 1. What would that be 16:45:48 hemna: 2. Do you need that first pass 16:45:55 hemna: 3. How do you manage it 16:45:56 hemna: like instance faluts of Nova? 16:46:04 winston-d: yes 16:46:08 jgriffith: I think it's really important to consider this in the design now. If we change our mind later, it's going to be a pain to change on deployed 16:46:17 do we need it right now? I'd argue that yes, we could use it now :) 16:46:18 thingee: I'm not saying that it isn't 16:46:19 once deployed* 16:46:26 does it have to be done first pass, probably not. 16:46:29 hemna: Please answer the first question 16:46:33 maybe get getting ahead here again, but would want a 3rd field with a more descriptive explanation of why the task failed? 16:46:34 hemna: 'what is it' 16:46:52 The last few states of that volume? 16:47:00 another table in the db that tracks transactions and their states/steps/failures 16:47:07 jungleboyj_: that's your interpretation... I want hemna 's 16:47:25 hemna: pretty much what I thought too 16:47:35 hemna: for who's consumption? 16:47:40 the user 16:47:44 how does the user know there was an error at all (if the status isn't error)? 16:48:00 soo......that leads me to bring up taskflow again. Isn't there a built in mechanism to taskflow that tracks the transaction state? 16:48:04 * hemna ducks 16:48:13 jgriffith, for admins 16:48:21 ameade: task status 16:48:30 hemna: Not really, no. There ought to be, but isn't 16:48:35 hemna: so this is why I'm asking "you" 16:48:42 winston-d: sure that could make sense maybe 16:48:46 hemna: you say admins, thingee says users 16:48:49 winston-d: I think the problem though is how do you know. say the task status already has a value 16:48:50 winston-d: that assumes the task-status would clear up after some time? 16:48:50 heh 16:48:52 how do you know it's new? 16:48:55 others may say "ops" etc 16:49:15 I dunno, I don't think users should need to see why retype failed, but admins do. 16:49:19 what if you get the same status? do you have to keep track of the old status to know a change has happened? 16:49:47 I really think that this is being made much more complex than it should be 16:49:55 akerr: well, task state/status clear doesn't help if you want to find out why the 'retype' request was failed that you invoke 3 days ago. 16:50:00 DuncanT-1, ok sounds like we should ping harlowja about adding it then. 16:50:07 which is part of the problem I have with existing things (like taskflow) 16:50:10 jgriffith +2 16:50:23 hemna: Not simple, since taskflow currently isn't built in a way it can usefully track it 16:50:25 akerr: and after that you also did a bunch of new operations to the voluem 16:50:34 jgriffith: I think the current thought is more simplified than it should be. I'm trying to figure out how people would use it. 16:50:41 how it would look in clients like horizon 16:50:54 thingee: the same as it looks in Nova for example 16:51:10 |Status|Task| 16:51:27 avaialble|unable-to-retype| 16:51:39 fwiw, i think typically in a RESTful api what is usually done is the user would create a new 'retype' resource and they can poll that to see the status of the task 16:51:55 but that of course makes no sense in our current design 16:52:00 add a timestamp perhaps? 16:52:21 jgriffith: so I'm totally in agreement with going back to available status. +1000. But if extend fails..the user tries twice...they get the same task state back. I guess that's fine and maybe a timestamp of when that task state was updated? 16:52:25 just so you know something finished? 16:52:34 So Nova has |Status|Task|InstanceFaults| 16:52:53 thingee +2 16:53:13 |available|unable-to-retype|backend_not_supported| 16:53:48 winston-d: sure 16:54:09 backend_not_supported doesn't mean anything to an end users tennant though 16:54:22 DuncanT-1: yeah, I'd suggest that field be admin 16:54:30 thingee: so I suppose a task history would come in handy there — cinder task-history -> | Task | Outcome | Timestamp | 16:54:31 but again I think we're getting ahead of ourselves a bit 16:54:34 DuncanT-1: instance falut is for admins 16:54:34 Ok, that makes sense 16:54:43 DuncanT-1, unless you want to portray it as that action is not available 16:54:51 since it will always fail 16:55:30 5 minute warning 16:55:43 akerr: try logstash with request ID 16:56:04 Or stacky with the same 16:56:13 yeah 16:56:17 yeah, please don't suggest duplicating the log files in some API call 16:56:30 winston-d: I still don't think it helps in knowing if a task finished when you retry a failed task. 16:56:35 from the user's standpoint 16:56:37 or client 16:56:59 My suggestion was that running a new task 'always' clears the previous task-state 16:57:05 set's it to None at the onset 16:57:10 give the user as much info as possible. Eventually it helps the admin is well. 16:57:39 jungleboyj_, hey, user here is a nice fat stacktrace for you. good luck. :P 16:57:48 jungleboyj_: Disagree. Far too easy for the user to start guessing what the problem is and get completely the wrong end of the stick 16:57:54 hemna: and so much for the abstraction 16:57:57 Or confuses them. Seeing old error info may hinder when current operation worked. 16:57:59 jgriffith: would that be obvious to someone new? I'm trying to remember if on certain operations we list the volume/snapshot or whatever details before doing certain actions 16:58:12 hemna ... Not that much. 16:58:21 :P good. 16:58:42 jgriffith: It's not an obvious thing to me that a field would be cleared on a new action. 16:58:45 thingee: it's a hell of a lot more obvious that silently not extending or setting the volume to error because something isn't supported 16:58:49 other than the current volume state, all the admin has now are log stacktraces...if that. 16:59:04 thingee: when you run an API cmd and see the field change it seems obvious to me 16:59:26 jgriffith: I agree it's better. I'm just saying if we're going to revamp this, lets be careful and consider these things so we're not repeating ourselves. 16:59:28 hemna backend_not_supported doesn't seem dangerous though. 16:59:30 hemna: Good drivers log lots of useful info of their own too... if yours doesn't, talk to your vendor 16:59:37 thingee: fair enough 16:59:45 jungleboyj_, +1 16:59:46 DuncanT-1: +1 16:59:59 Ok 16:59:59 DuncanT-1: +1 17:00:04 +1 for log spam 17:00:15 We've succesfully burned our hour 17:00:19 DuncanT-1, ours does a good job of logging failures/reasons. I'm just saying in general though that's not overly useful to an admin 17:00:24 I'll get a session for this proposed 17:00:25 also with that, reviewers should be encouraging driver changes to give great logs to cinder users! 17:00:30 and have some code for ATL 17:00:34 because it takes for fricking ever to find the error in the log on a busy system. 17:00:41 ATL? 17:00:43 thanks everyone 17:00:45 atlanta 17:00:49 forcing admins to have to look in the log, is the wrong approach IMO 17:00:50 Ah 17:00:52 thingee my favorite thing to do. 17:00:53 #endmeeting