16:00:00 #startmeeting cinder 16:00:01 Meeting started Wed Jan 14 16:00:00 2015 UTC and is due to finish in 60 minutes. The chair is thingee. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:02 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:05 The meeting name has been set to 'cinder' 16:00:14 hello everyone 16:00:14 o/ 16:00:18 o/ 16:00:20 hi 16:00:23 hello 16:00:24 \0 16:00:25 o/ 16:00:32 o~~~ 16:00:36 hi 16:00:43 hi 16:00:45 agenda for today: 16:00:47 #link https://wiki.openstack.org/wiki/CinderMeetings 16:00:55 hi 16:00:59 hi 16:01:00 lets get started! 16:01:01 hi 16:01:11 #topic Kilo Third Party CI Deadline Confusion 16:01:16 Hey 16:01:17 o/ 16:01:22 #link http://lists.openstack.org/pipermail/openstack-dev/2015-January/054101.html 16:01:23 o/ 16:01:26 Hi 16:01:44 hi 16:01:48 so this thread started by erlon, is presenting some confusion with deadlines we talked about 16:01:55 #link http://eavesdrop.openstack.org/meetings/cinder/2014/cinder.2014-11-19-16.00.html 16:02:23 back in november 19th according to the summary, we would set the k-2 for old drivers and k-3 for new 16:02:26 o/ 16:02:38 So immediately after that meeting there was a big discussion in the channel 16:02:41 I wanted communication for both openstack-dev mailing list and individual maintainers 16:02:58 About whether CI was ready/easy enough/reliable enough for us to set a requirement for 16:03:19 Which didn't come to a conclusion 16:04:21 ok, regardless, we set a deadline. DuncanT you initially updated this in the wiki https://wiki.openstack.org/wiki/Cinder/tested-3rdParty-drivers#Deadlines 16:04:43 so now we have confusion coming from maintainers that have had drivers before K dev 16:05:39 DuncanT: did you email maintainers as given in action item from the nov 19th? 16:06:19 No, because the discussion after was that that deadline was unreasonable and we should include better instructions first 16:07:07 I don't know why there is confusion. We came out of the working session on Friday at the Summit with the direction in the Wiki set in place. 16:07:28 jungleboyj: not all maintainers were there. 16:08:01 thingee: Fair enough. 16:08:03 if we email maintainers with the given address were suppose to reach a vendor at and exhausted the mailing list as erlon pointed out, there would be really no excuse 16:08:22 there is no mention of the deadline on the mailing list 16:08:50 the wiki still stated the deadline regardless of this convo that took place in #openstack-cinder. My point is communication on this was horrible 16:08:52 thingee: So, we missed communication to the mailing list. Sounds like we can't fix that. What is the remediation plan then? 16:10:02 thingee: Can we take it to the mailing list now, note the mistake in communication and make the deadline K-3 ? 16:10:21 so here's my proposal. is 17 weeks enough time for maintainers with drivers before K to get a CI up 16:10:24 ? 16:10:42 not proposal, question 16:10:54 if so, my proposal would be new and old maintainers have until k-3 16:10:59 thingee: I hope so. I have gotten my teams to get systems set up between the Summit and Christmas. 16:11:02 I'd say so, yes, but then I keep being told I'm wrong about that 16:11:07 I think so. Especially since we were pushin in Juno that they had to have them. 16:11:08 * eikke has been working on setting up CI the last couple of days. It's non-trivial, at least 16:11:18 smcginnis: +1 16:11:26 thingee: to be honest, I spent most of the last 2 days trying to get my devstack instances updated. 16:11:29 +1 16:11:32 eikke, +1, we have been working on the new devstack plugin model and its been non-trivial as well 16:11:46 things like oslo.concurrency and some other problem can waste a lot of time. 16:12:14 creating a new stackforge project, understanding how to get it done, creating new job are very time consuming (atleast for the 1st time folks) 16:12:14 having CI would be "easy" (as in "requires only a bit of glue") if devstack would reliably work... 16:12:33 flip214: that's a BS statement 16:12:34 ok so when in k-3 is the cut off? When do we deprecate the driver if there is no CI? 16:12:48 flip214: not if you take the requirement a single CI system can only put 1 comment per changeset into account, and you have multiple drivers 16:12:48 but I had to do "rm -rf /opt/stack && ./stack.sh" after trying (and failing) for quite a few hours 16:13:09 thingee: that's a good question 16:13:13 jgriffith: sorry if you think so. I've wasted quite a lot of time with it, and not for the first time either. 16:13:32 for those having issues, there are 3rd party mentoring meetings to help out, in addition to asking directly in the irc channels 16:13:38 and mailing list 16:13:57 asselin +1 16:13:59 asselin: Thank you for all you and the 3rd party team have been doing to help! 16:14:01 they are very helpful 16:14:08 and have like 3 meetings a week now 16:14:15 #link https://wiki.openstack.org/wiki/Meetings/ThirdParty 16:14:17 asselin: +1, and also no offense to anybody, but there's a real lack of investment IMHO regarding people actually digging into OpenStack as a whole 16:14:37 even the documentation is outdated. the wiki still says that 2GB are enough; my VM reliably went OOM (as in killing eg. mysql) until I gave it 5GB. (see http://docs.openstack.org/developer/devstack/guides/single-vm.html#prerequisites-cloud-image) 16:15:18 last night in 3rd party meeting, there's discussions to get docs updated. 16:15:18 flip214: Have you updated the wiki? 16:15:22 Ok so if I don't see a CI reporting by k-3 that's 03-19, as far as I'm concerned, your driver might not work in Cinder. I honestly don't know if it's worth deprecating 16:15:42 perhaps a mini sprint scheduled for next week, if ppl are avaiable. 16:15:45 thingee: +1 16:15:46 Or they get put on the naughty list. 16:15:53 :) 16:16:00 thingee: +1 16:16:02 thingee: I think sadly it may not be. BUT I would argue that you better be able to at least do the old cert script 16:16:16 jgriffith: +1 16:16:19 https://wiki.openstack.org/wiki/Cinder/tested-3rdParty-drivers#Deadlines states end of K-3 for new drivers. I have set my schedule after that so I would really like that to stay that way 16:16:20 thingee, e-mails should be sent out still 16:16:28 DuncanT: no, at least not yet, because I don't even _know_ what is right or wrong there! 16:16:28 jgriffith: ok 16:16:29 thingee: question about FC drivers... 16:16:36 thingee: honestly... if you can't deploy devstack with your driver and run tests you have no business saying "hey my company xyz offers Cinder drivers and we're awwesome" 16:16:38 akerr: shoot 16:16:49 I do have to say, getting CI set up was a huge struggle. Things definitely need to improve. But I definitely learned a lot going through the process. 16:16:53 jgriffith: +1 16:17:06 thingee: FC has a lot of extra gotchas that normal IP based drivers don't that make it harder to get running CI, are they also included in this deadline>? 16:17:11 wow my irc private messages are blowing up 16:17:15 ;) 16:17:32 my conclusion: ci is *good*; but march 19th is awfully close. YMMV. 16:17:34 jgriffith: that's why CI is needed 16:17:37 * jungleboyj hears opportunity for a new song. 16:17:39 I think HP guys, especially asselin can help with that 16:17:41 flip214: You have a new driver, right? 16:17:45 akerr: ^ 16:17:47 smcginnis: right 16:17:51 jgriffith: it's not setting up devstack that was hard on our side, rather integration with gerrit and such (I used the gerrit-trigger plugin of jenkins, but now it turns out that should be changed most likely because of the 1-comment rule, so zuul and job-builder and whatnot are thrown in) 16:17:57 So the statement was any existing drivers. Is that correct? 16:17:59 akerr: they had the same problems from what I understand 16:18:14 New drivers have a little more time. 16:18:24 3-19 for new, I thought. 16:18:30 eikke: understood, you won't hear arguments against that from me 16:18:31 eikke, let's talk afterwards. there's likely some solutions ppl have already found. 16:18:39 smcginnis: to make it simple, all drivers must have a ci by k-3 16:18:43 smcginnis: new or old 16:18:45 asselin: would like that :) 16:18:49 that's march 19th 16:18:57 OK, got it. 16:19:20 so here's my proposal, as clear as I can make it to get a consensus here... 16:19:44 thingee: If you don't have it on 3/19 we submit patches to remove the driver? 16:20:15 thingee: No potential for fall back to running the Cert script? 16:20:49 All Cinder *volume* drivers must have a CI by March 19th, end of k-3. If not, your driver is removed, doesn't work with the CInder in K. We will mention the removal in release notes for Cinder users. For people who are having issues with this deadline, you better be in good communcation with me. 16:20:50 I think we have to say 'no fallback' or we'll be having this discussion again in the Z release 16:21:04 does anyone oppose to that proposal ^^^ 16:21:16 +1 from me 16:21:19 +1 with some leniency for FC drivers. 16:21:22 DuncanT: +1, just clarifying. 16:21:24 +1 16:21:26 I will email all maintainers, and email the openstack dev mailing list about this as well 16:21:26 so, just to make sure there isn't a grey area (and hopefully not to confuse the issue more), when we say "have a ci" this means a ci that runs and passes on every single check in? passes on 95%+ of them? I've noticed from personal experience that the ci's are not a set up once and forget kind of deal... did anyone have some kind of uptime requirement in mind? 16:21:37 I want drivers that work, +1 16:21:43 thingee: +1 ... Want to talk about FC. 16:22:06 patrickeast: just the tempest api.volume tests 16:22:31 thingee, patrickeast: I think 16:22:32 thingee: umm... one correction; 'tempest volume' tests 16:22:34 "most of the time" 16:22:34 thingee: right, but i guess my question is how reliably? 16:22:55 patrickeast: well if you're not reliable, you will be removed. just like today 16:23:01 Somewhat subjective, but we do miss some occasionally for various reason. 16:23:01 patrickeast: Reliably enough that Infra doesn't disable your account. 16:23:02 your ci is false to patches. 16:23:04 how many concurrent check ins do the CI system have to handle? 16:23:07 ie do i need to go tell my manager we need 24/7 shift from dev opts for it? 16:23:08 jungleboyj: +1 16:23:11 jungleboyj: +1 16:23:31 FC folks, be in good communcation with me. 16:23:35 about progress 16:23:40 patrickeast: I don't think we have that expectation. 16:23:47 thingee: will do 16:24:13 ok, are we ok to move on? 16:24:16 jungleboyj: what is the criteria for Infra to disable a 3rd party CI account? Just curious. 16:24:23 thingee: Will do. Need to check in with our FC people to see how things are going. 16:24:27 jungleboyj: ok, i guess maybe we can discuss more offline, my understanding is that the ci accounts are disabled when *we* decide to turn them off 16:24:40 winston-d: 5 or more disagreements with Jenkins in a row, without them being real issues 16:24:44 patrickeast, ++ it's a cinder team job now 16:25:00 winston-d: Is the 'guideance' criteria 16:25:00 DuncanT: That sounds right. 16:25:01 Bad votes, behaving like the CloudByte CI, etc. :) 16:25:06 #topic Fixing allocated_capacity in Cinder 16:25:10 DuncanT: can we disable CloudByte for now? 16:25:14 #link https://bugs.launchpad.net/cinder/+bug/1408763/comments/1 16:25:16 Launchpad bug 1408763 in cinder "allocated_capacity tracking is broken in manager" [Medium,Triaged] 16:25:23 #https://bugs.launchpad.net/cinder/+bug/1408763 16:25:24 * DuncanT put in a disable request for cloudbyte a few hours ago ;-) 16:25:29 #link https://bugs.launchpad.net/cinder/+bug/1408763 16:25:33 full thread ^ 16:25:37 jgriffith: here? 16:25:42 :) 16:25:51 Ok... so I'm going to try and summarize 16:26:20 currently.... the manager code sets a allocated_capacity_gb 16:26:33 this is incremented on each volume create, and decremented on each volume delete 16:26:43 this number is used as our simple scheduler replacement 16:27:03 and is intended to represent the CINDER allocated GB on a host 16:27:11 Now.... 16:27:15 the problem is taskflow 16:27:37 in the case of a failure, taskflow will retry 16:28:01 jgriffith: the problem is poor error handling / rollback , not taskflow :) 16:28:05 the way it works however is the retry is done in the tflow lib itself, and calls the scheduler-->manager multiple times 16:28:11 avishay: no, it's not that simple 16:28:15 avishay: nice try 16:28:22 let me finish please 16:28:28 Isn't quota reservation/commit going to have the exact same problem? 16:28:34 DuncanT: yes 16:28:42 so there are two problems here: 16:29:03 1. If the task fails in the driver we raise and the allocation is never incremented 16:29:14 This is bad because the delete works and decrements still 16:29:25 so the count is completely inaccurate 16:29:41 (think error on create, we have it in DB but it's in error state) 16:29:57 so.. put a finally block in the run_flow 16:30:03 Nope... that doesn't work 16:30:10 because you call it multiple times 16:30:27 so... add a taskcompletion class in TFLOW 16:30:49 Not so easy, even though everybody says "Taskflow is easy" it's not trivial to make such changes 16:30:56 they're very far reaching and complex 16:30:59 who said taskflow is easy? 16:31:09 avishay: oh no 16:31:12 someobdy in this room 16:31:13 :) 16:31:17 enough said LOL 16:31:19 ok.... 16:31:21 jgriffith: :) 16:31:32 anyway, bottom line there's all sorts of caveats, traps and gotchas 16:31:50 to me that sounds as if the per-host value needs to be fetched from the database, where all the current volume states are known 16:31:51 in the bug I proposed driver reporting which I still think is something we want, but maybe not for this bug 16:31:51 jgriffith: why not do a DB query to get the number instead of trying to keep track of it? 16:32:06 winston-d: pointed out the intent being truly just Cinder allocated space 16:32:09 which makes sense 16:32:19 the patch I worked up is well.... 16:32:27 I don't think the driver/storage should be keeping track of Cinder things 16:32:40 after each create or delete... I get all volumes for the host out of the db 16:32:40 avishay: +1 16:32:50 iterate through and tally up size for host and pool 16:32:56 jgriffith: did you see my comments in the bug? I added provisioned_capacity as part of the over subscription patch 16:33:06 avishay: jungleboyj if you would let me finish, I said that in the next line :) 16:33:24 jgriffith: go ahead, sorry 16:33:24 xyang1: yes, and I think that's needed/good but not for this problem 16:33:36 so what I'd like to prpose: 16:33:52 jgriffith: +1 :-) 16:34:11 * I change that name from allocated_capacity_gb, to something more indictive: "apparant_host_allocation" or something 16:34:22 and implement the db hack that I described 16:34:37 I'm concerned about push back from folks on doing that iteration 16:34:49 I don't think there's a big performance problem but I could be wrong 16:35:00 Can I offer an alternative? 16:35:03 It seems like there are two problems. 16:35:05 avishay: absolutely 16:35:06 If the volume is actually created on the array but the database isn't incremented then space is consumed but not reported. 16:35:09 If the volume isn't actually created then it's reported that space is consumed even though it's not. 16:35:12 Seems like the array (and therefore the driver) would be the source of truth. 16:35:36 smcginnis: Yep, and we don't care in this case 16:35:42 If the scheduler is the only one who needs this value, it can access the DB when it needs to. We can write a query that counts with SQL. 16:36:01 avishay: well, it's the manager that actually uses it 16:36:04 but maybe 16:36:09 provisioned_capacity is from the array. it what is actually created on the array 16:36:16 avishay: that was how Simple Scheduler did it. 16:36:22 avishay: pools throws a new wrench in the works here though as well 16:36:41 jgriffith: why does the manager need it? 16:36:50 If the array is shared cinder and none-cinder, how is it supposed to know what to report? 16:36:59 avishay: well, maybe it doesn't, currently that's where scheduler gets it from 16:37:09 ideally things that we retry would be idempotent and we would properly roll back on failures, so we would never get in this mess, but that's bigger than this 16:37:27 avishay: well, that's what's funny 16:37:47 avishay: we're in the mess due to a lack of understanding of how taskflow responds to failures I think 16:38:05 avishay: so there's no try/except in the mangaer code 16:38:08 DuncanT: ya, array only knows the total, cinder and non-cinder. to get cinder only, it has to be from cinder db 16:38:21 jgriffith: honestly I haven't read the discussions closely between you and harlowja_away, but he was not able to resolve them? 16:38:24 so at first glance you think... "yeah, do this fall through this call this call etc" 16:38:59 thingee: he had some ideas but they involved some changes to the library itself (taskflow library) 16:39:00 jgriffith: you mentioned this taskcompletion class, that was just not intutitive from the docs he has on it? I assume he has docs anyways, because he writes a lot of them :) 16:39:15 jgriffith: did he offer to them in the library? 16:39:18 thingee: he has great docs 16:39:20 himself 16:39:36 I think the point is being missed regardless 16:39:53 We should not have something that ONLY harlow can support, maintain and understand 16:39:57 that's BAD 16:40:12 jgriffith: all I'm trying to understand is the author of a library we depend heavily on for volume creates is trying to work with us. If you remember from the K summit, it was endangered of removal. 16:40:22 we also IMO shouldn't have such a hard time trying to make such a simple change 16:40:40 thingee: he is ABSOLUTELY working with me 16:40:41 jgriffith: I agree with that statement about harlowja_away being the only one to understand it 16:40:55 thingee: harlowja_away is GREAT!!! Make that clear, very responsive and helpful 16:41:30 thingee: my concern is what happens when he quits and goes on tour with Just Beber! 16:41:37 Justin 16:41:39 lol 16:41:39 :) 16:41:55 jgriffith: I guess I would argue the same thing about a lot of libraries. Have you ever worked in the eventlet or paramiko? we depend on contributors in those projects... 16:41:56 on the other hand, there is no alternative 16:41:56 oye vey. 16:42:22 I've spent a good deal of time trying to come up to speed on Tflow and honestly I don't feel very competent outside of adding to the existing manger 16:42:36 avishay: not true, but I'm not going there in this meeting :) 16:42:46 jgriffith: fair enough 16:42:48 all I wanted in this meeting was feedback on using the db hack 16:42:50 or... 16:43:03 as was pointed out maybe moving it somewhere else (back to scheduler)? 16:43:12 jgriffith: I think we're fine with it. I'm just worried about us avoid the layers that use taskflow, because it's taskflow/ 16:43:18 or do we need it at all... do we rip the band-aid off and go back to driver reporting 16:43:21 people, two more points on the agenda, only 15mins left. 16:43:28 jgriffith: and I just haven't spent time with it to really understand the problem you're having. 16:43:32 thingee: to be fair, that's not what I'm doing here 16:43:50 thingee: I'm not avoiding it just because it's taskflow or difficult 16:43:51 jgriffith: i would really not like the drivers to report this. i think an SQL query with count should do the trick. 16:44:08 thingee: I spent quite a bit of time trying various things with taskflow, but there are some gotchas 16:44:14 jgriffith: if harlowja_away gave you a solution, this taskcompletion class, and you decide to go do this in the drivers themselves, that sounds like avoiding taskflow. 16:44:20 jgriffith: Sounds like people are ok with the SQL query. 16:44:22 thingee: particularly when you try and work with member variables 16:44:52 thingee: ok, I feel I'm not explaining things correctly 16:44:54 IMO the SQL is much easier in the short term; the "right" solution would be reporting by the drivers. 16:45:18 jgriffith: and I apologize for my lack of understanding too. I'm going off little bits of your long discussion with him. 16:45:20 jgriffith: thingee is this a good topic for the meet-up? 16:45:29 winston-d: my question wrt driver reporting however... is we want this to be just Cinder info no? 16:45:42 thingee: no worries 16:45:47 If you code the SQL version, I will benchmark it with my million fake volumes setup if you want, that should answer the scalability / performance concerns 16:45:59 jungleboyj: probably. also because some of the work I was expect in K to do volume deletes with task flow, doesn't seem likely to happen. Unless eharney can give an update 16:46:00 DuncanT: good idea 16:46:22 jgriffith: The other approach is to say 'This isn't a sensible thing to schedule on, you can't do that any more' 16:46:32 jgriffith: yes, what we want is just cinder info 16:46:38 jgriffith: yeah regardless I think we need to get this fixed. I wanted to discuss other things with you this week, so maybe we can talk again about this 16:46:40 thingee: it's still on my radar but no progress so far 16:46:51 winston-d: yeah, I noticed *why* after your comment in the bug :) 16:47:11 ok... I'll look at a couple proposals and get them up 16:47:19 but real question is, anyone actually uses AllocatedCapcityWeigher (aka Simple Scheduler)? 16:47:40 winston-d: hmmmm 16:47:47 jgriffith: +1 then we can discuss it f-2-f at the meetup. 16:47:59 jgriffith: thanks for all your time on this issue and explaining it to us. 16:48:16 thingee: +1 16:48:30 #topic Kilo feature freeze 16:48:41 I don't think bot is working anymore =/ 16:49:11 thingee: There was a bit net split yesterday. Wonder if that broke it. 16:49:22 jungleboyj: yeah maintenance work on freenode 16:49:26 they had some announcement about it 16:49:31 ok so freeze! 16:49:37 march 19th is k03 16:49:39 k-3 rather 16:49:45 so I'm going to throw a date out 16:49:48 march 10th 16:50:11 9 days of bug fixing and nothing else can get merged in unless an exception is approved by consensus of core 16:50:21 anyone object? 16:50:47 sounds great to me 16:50:48 thingee: should we have all blueprints approved or spec/code merged before feature freeze? 16:50:58 Is there feature-proposal-freeze? 16:51:02 Sounds good. e0ne Good question. 16:51:04 +1 16:51:15 e0ne: yes 16:51:17 e0ne: merged I guess 16:51:27 Merged, I'd have thought 16:51:53 to answer dulek's question, when should we stop accepting specs for k-3 16:51:55 just to confirm: we've need to get merged _code_ before ff 16:52:14 what about specs? 16:52:15 e0ne: Correct 16:52:17 code merged by march 10 16:52:18 So, specs have to be merged, blueprints proposed and code merged before 3/10 ? 16:52:38 is any deadline for specs merge? 16:52:50 spec approval by...what do people think is fair? 16:52:56 feb 15? 16:53:20 thingee: That seems reasonable. Gives them a month from today. Roughly. 16:53:25 +1 16:53:27 feb 15 is k-2 and one week more. sounds good to me 16:54:09 #idea k-3 spec feature freeze, march 10th code freeze 16:54:18 thingee: Can you confirm this decision on the ML? 16:54:26 dulek: you bet 16:54:27 dulek: +1 16:54:38 thanks :) 16:54:38 ok thanks everyone 16:54:48 #topic Sharing volume snapshots 16:54:51 (feb 15 is sunday I think) 16:54:53 rushiagr: here? 16:54:59 thingee: yes 16:55:04 rushiagr: I never stop working 16:55:06 #link https://blueprints.launchpad.net/cinder/+spec/snapshot-sharing 16:55:13 thingee: :) 16:55:40 thingee: thanks for the link 16:56:01 4 min warning 16:56:12 so I just wanted to get an idea and if people have strong objections to it 16:56:52 * DuncanT does 16:56:53 i do :) 16:57:12 unfortunately I'm not a fan of sharing at all 16:57:20 volumes or snapshots, and certainly not ACL's 16:57:22 :( 16:57:33 I wouldn't want ACLs myself too 16:57:38 rushiagr: :) 16:57:42 rushiagr: is this a real customer need? 16:57:59 avishay: well, sort of yes 16:58:12 Somebody down the road /will/ want ACLs though, and pretty soon we've a reimplementation of glance sat in the cinder tree 16:58:23 rushiagr: sort of, or yes? :) 16:58:35 avishay: heh. I'll say yes :) 16:58:54 DuncanT: maybe 16:58:56 i think that implementing some sort of subset of ACLs will lead to poor APIs. If it's a real need, then implement the whole thing properly. 16:59:16 I can see the use case for shapshot sharing, but it seems like it would open a can of worms. That use case can probably be addressed other ways 16:59:17 DuncanT: I would say we can stop at what I'm proposing (as I too won't want ACL too) 16:59:18 Or make glance faster and get this usecase for fre.... 16:59:19 Glance has public resources, isn't it solving the problem? 16:59:22 But I also feel they are not necessary 16:59:26 avishay: rushiagr DuncanT what about just doing public/private and drawing the line there 16:59:34 I say adoption of this solution. 16:59:44 dulek: +1 :) 16:59:58 jgriffith: fine by me, but make the API uniform for volumes as well - don't see why snapshots are special 17:00:01 rushiagr: sorry for little time on your topic, we will revisit it next week first thing, or you can submit a spec for review. 17:00:03 #endmeeting