16:00:04 #startmeeting cinder 16:00:04 Meeting started Wed Jul 1 16:00:04 2015 UTC and is due to finish in 60 minutes. The chair is thingee. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:05 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:07 The meeting name has been set to 'cinder' 16:00:10 hi! 16:00:17 hi! 16:00:19 hello! 16:00:20 o/ 16:00:21 hi 16:00:21 hi 16:00:23 hi 16:00:27 hi 16:00:29 hi 16:00:30 o/ 16:00:31 hi 16:00:31 hi 16:00:34 hello 16:00:35 hi 16:00:36 o/ 16:00:47 hi everyone! 16:00:47 . 16:00:48 hello 16:01:02 hi 16:01:11 #topic announcements 16:01:40 #info Cinder is not accepting anymore drivers for liberty 16:01:44 #link http://lists.openstack.org/pipermail/openstack-dev/2015-May/064072.html 16:01:50 o/ 16:01:51 Please stop harassing me on PM's about it 16:01:53 hi 16:01:55 LOL 16:02:18 :) 16:02:30 I will be approving RPC and version object work by thangp this week 16:02:43 #info spec for rpc compat is planned to be approved this week 16:02:45 \o/ 16:02:45 #link https://review.openstack.org/#/c/192037/5 16:02:53 speak now or hold your peace 16:03:01 :) 16:03:21 #info Return request id to caller cross project will be approved this week 16:03:23 #link https://review.openstack.org/#/c/156508/ 16:03:38 speak now or hold your peace. Cinder client will be receiving this treatment in Liberty 16:03:51 nice 16:03:57 :) 16:04:01 #topic Encrypted volumes 16:04:03 mriedem: hi! 16:04:11 hello 16:04:15 #info mailing list post 16:04:17 #link http://lists.openstack.org/pipermail/openstack-dev/2015-June/068117.html 16:04:37 so i just read over the latest on the cinder change https://review.openstack.org/#/c/193673/ to set the encrypted flag globally in connection_info 16:04:47 looks like everyone is in general agreement this is goodness so our heads aren't in the sand 16:04:52 o/ 16:05:02 the tempets change to add the config option to disable the encrypted volume tests is merged 16:05:07 mriedem, does that also mean it will get saved in nova's bdm ? 16:05:10 i pinged some -qa guys for the d-g and devstack changes 16:05:15 hemna: yeah 16:05:19 ok cool 16:05:24 tbarron: was good enough to open some nova bugs 16:05:38 Is everyone fine with my decision here? http://lists.openstack.org/pipermail/openstack-dev/2015-June/068370.html 16:05:44 so those are on the radar, one looks simple (rootwrap filter update?) and the other i'm not sure about, some in-use race or something 16:06:04 we will just enable this and see which CI's surface with this problem and deal with them on a case by case 16:06:08 mriedem: there may of course be other bugs exposed once e.g. the rootwrap is fixed 16:06:14 thingee: +1 16:06:23 +1 16:06:26 tbarron: sure - this is not a very well exposed area, 16:06:30 so i expect more bugs 16:06:31 +1 16:06:32 +2 16:06:36 +1 16:06:46 tbarron: were you working on any of the nova fixes, or just reporting? 16:06:54 I'd like other 3rd party CIs to check againgst these two bugs and not just assume their failures are the same 16:06:59 #agreed enable tempest, let ci's surface and work with maintainers to resolve issues in their backend/drivers 16:07:02 mriedem: thanks :) 16:07:12 np 16:07:12 mriedem: and for all your work on tracking this issue down 16:07:16 I've added comments there as our failures are similar to tbarron's 16:07:21 \o/ 16:07:24 mriedem: I'd do the rootwrap one right now but I can't submit upstream this week (company is down this week, and I have to get approval) 16:07:34 tbarron: ok, i'll check it out today 16:07:35 FC error is the same as NFS because it uses rootwrap for iscsi 16:07:35 mriedem: the 'in use' bug is tricky I think 16:07:48 iSCSI error is exactly the same as tbarron's 16:07:53 tbarron: which two bugs, any links ? 16:08:07 https://bugs.launchpad.net/nova/+bug/1470142 16:08:07 Launchpad bug 1470142 in OpenStack Compute (nova) "Nova volume encryptors attach volume fails for NFS and FC" [Medium,Triaged] 16:08:09 https://bugs.launchpad.net/nova/+bug/1470562 16:08:10 Launchpad bug 1470562 in OpenStack Compute (nova) "'in use' error when Nova volume encryptors format cinder volumes" [Medium,Confirmed] 16:08:20 tbarron: I modified the bug to add FC there 16:08:45 xyang2: yes, I saw, thanks! would you also add to/modify the in-use bug? 16:09:00 mriedem: anything else to add? 16:09:03 tbarron: sure, I didn't see it yesterday 16:09:11 thingee: nope 16:09:12 xyang2: that's because I just now filed it 16:09:17 #topic CI job changes 16:09:19 e0ne: hi 16:09:23 hi 16:09:27 mriedem, I wonder if my FC nova patch helps out at all ? https://review.openstack.org/#/c/195350/ 16:09:31 #idea make gate-rally-dsvm-cinder job voting 16:09:41 it's table enough 16:09:42 mriedem, it's been sitting there for a bit w/o attention. 16:09:49 mriedem, it also helps live migration for FC 16:10:01 and it's only one job wich tests cinder+python-cinderclient 16:10:17 also, it covers cases which are not tested by tempest 16:10:26 e.g. lastest bug with broken backups 16:10:29 thingee: e0ne I'm not completely sure about stability on that 16:10:30 e0ne: are there any other projects that have move rally to voting? 16:10:41 o/ 16:10:48 thingee: i don't know;( 16:10:49 thingee: e0ne but I'm +1 if it is 16:10:57 nova doesn't have a rally job 16:11:00 fwiw 16:11:00 e0ne: do you have any stats on the stability? 16:11:18 or how you came to the conclusion of it being stable 16:11:21 thingee: we could get it using logstash 16:11:34 e0ne: i think you should check out graphite 16:11:39 http://graphite.openstack.org/ 16:11:44 logstash only has 10 days of results 16:11:52 graphite has release cycles worth 16:12:01 that's how we knew the ceph job stabilized 16:12:03 before making it votiing 16:12:04 In previous decisions of making things voting like Ceph we have used stats no problem. I don't see any harm in looking that over before we make a decision. 16:13:03 not sure how long this will take, and if we should just circle back 16:13:08 i'm sorry, i didn't get any stats right now:( 16:13:30 ok, I'm fine with it. I'm not sure who is opposed, but if it's stable, sure why not? 16:13:51 Agreed - if it is stable then it sounds like a good idea 16:14:03 tbh, i didn't see any fals-negative reports lasw 2 weeks 16:14:13 yfried was saying rally was broken in -infra yesteday 16:14:21 due to some jsonschema/functools32 stuff 16:14:36 i could get some stats and post it to a commit with making job voting 16:14:46 e0ne: that sounds like a good idea to move forward 16:14:55 mriedem: it was broken becose some projects were broken 16:15:00 e0ne: what's the name of the job? 16:15:01 heh 16:15:11 "not my project, this other project was broken." 16:15:12 hi all, sorry i'm late 16:15:20 avishay: you're just in time 16:15:35 thingee: tbh, rally was broken not with cinder related jobs:) 16:15:43 e0ne: is there a plan/maintainer if the job starts posting false negatives? point of contact? 16:16:00 * DuncanT can't find jenkins votes in graphite :-( 16:16:03 dannywilson: it cold be me 16:16:09 DuncanT: +1 16:16:23 I can see gate-rally-dsvm-designate/ironic/manila/mistral/murano/zaqar and some more in graphite 16:16:25 DuncanT: i can find you stats in graphite 16:16:33 dannywilson: i have +2 for cinder-related patches to rally 16:16:35 DuncanT: graphite data loss yesterday due to the storage outage. 16:16:55 smcginnis: ha 16:17:12 smcginnis: wondering what kind of storage backend they were using 16:17:13 smcginnis: mriedem: I think I might just be looking in the wrong place... a search in graphite for 'cinder' or 'rally' both fail though 16:17:23 e0ne: okay, can we post that somewhere like a wiki page so others can find it too? 16:17:30 winston-d: Not mine! :) 16:17:45 dannywilson: we'll discuss this in the next meeting so people can find it 16:17:47 DuncanT: you'll need an URL like this: http://graphite.openstack.org/render/?width=600&height=344&_salt=1434709688.361&from=-1days&xFormat=%25b%25d-%25T&title=DRBD%20Cinder%2FDevstack%20stats&colorList=red%2Cgreen%2Cblue&target=stats_counts.zuul.pipeline.check.job.check-tempest-dsvm-full-drbd-devstack-nv.FAILURE&target=stats_counts.zuul.pipeline.check.job.check-tempest-dsvm-full-drbd-devstack-nv.SUCCESS 16:17:54 nice URL 16:17:57 sounds good 16:18:02 dannywilson: https://github.com/openstack/rally/blob/master/doc/source/project_info.rst 16:18:20 #agreed get stats, discuss in next week's meeting 16:18:33 e0ne: thanks 16:18:43 thingee: thanks! 16:18:46 #action e0ne to get stats and include in review comments for enabling voting job 16:18:48 e0ne: thanks 16:18:49 DuncanT: or, even better: go to logstash.openstack.org, 16:18:55 and use a search filter like that: project:"openstack/cinder" AND build_name:"check-tempest-dsvm-full-drbd-devstack-nv" AND "Detailed logs" AND build_status:"FAILURE" 16:19:03 #topic Remove LIO iSCSI helper from Cinder 16:19:05 avishay: hi 16:19:07 thingee: hey 16:19:13 #idea remove LIO iscsi helper 16:19:19 so maybe the topic is a little too ... enthusiastic 16:19:24 could we perhaps start with the _problem_ rather than "remove it" 16:19:25 thingee: what about cinderclient functional tests job? 16:19:30 #info broken in Juno 16:19:33 but LIO in juno is currently broken and has no CI 16:19:54 other iSCSI targets (except tgt AFAIK) have no CI either 16:19:58 i am now taking up the effort for doing CI for LVM+LIO 16:19:59 avishay: FYI, we really had no ci's in juno 16:20:03 the nfs driver has no CI AFAIK 16:20:14 eharney: to be fair I raised this last cycle that it's "unknown" how to install and run on Ubuntu and has no CI 16:20:22 eharney: and isn't well tested. 16:20:32 eharney: IMO we should either get it fixed up and CI'd or remove it 16:20:39 jgriffith: i agree it's a problem. so let's fix it 16:20:45 eharney: +1 16:20:45 i will do CI 16:20:47 avishay: also LIO-iser is exactly what mellanox ci does 16:20:49 eharney: +1 16:20:54 however, it's down right now 16:20:56 according to stats 16:21:02 eharney: so I quite getting calls from EVERY customer that runs RHEL and it breaks every other iSCSI dev 16:21:11 eharney: I'm with ya on that 16:21:15 fixing it 16:21:18 I'm fine with that 16:21:21 can we fix the LIO helper instead of removing it ? 16:21:32 to avishay 's point, the only way people care is when you threaten removal 16:21:33 hemna: +1 16:21:33 anyway, i love CI as a deployer, but setting up a test env with LVM or NFS has been rough 16:21:35 :) 16:21:37 eharney: Thanks for being willing to do that. 16:21:44 hemna: +1 16:21:45 hemna: keep up hemna !! :) 16:21:51 #info Mellanox CI does provide CI for LIO-iser 16:21:51 hemna: ++ 16:21:54 #link https://wiki.openstack.org/wiki/ThirdPartySystems/Mellanox_Cinder_CI 16:22:01 tbarron: you've +1 that comment 3 times.. shouldn't it be +3 :) 16:22:05 eharney: Thanks for taking the work out of my hands :) 16:22:10 jgriffith: :-) 16:22:12 thingee: that doesn't work FWIW 16:22:23 i didn't mean to remove it today, but i think drivers (and iscsi targets) that don't have CI should be removed for liberty 16:22:28 jgriffith: well it's also marked down :) 16:22:33 including nfs, including the iscsi targets 16:22:34 :) 16:22:53 thingee: and IMO not really applicable if you're not running mellanox anyway :) 16:23:09 avishay, as a side note, we've been slowly getting folks to do CI on os-brick patches. it's kinda a similar setup 16:23:10 avishay: I am concerned by the number of people that would potentially effect. 16:23:26 jgriffith: maybe... I still think there is value in that code path being tested? 16:23:35 avishay: I think we need to work on addressing the problems. 16:23:35 but I think we need to have CI for each of the target objects. 16:23:45 thingee: for sure 16:23:58 jungleboyj: you're not concerned about the number of people running broken drivers? i'd rather have it not there than waste days trying to get it working. 16:24:18 thingee: I'm just saying we shouldn't rely on a vendors impl of the target for the CI etc 16:24:20 hemna: I think we made a decision earlier this release about target driver cis 16:24:21 hemna: CI for brick is great as well 16:24:57 thingee, ok sorry I don't remember what that was. sorry for the churn 16:25:01 jgriffith: so since things like LIO is open source, should we start working with infra to get a ci job in place? 16:25:02 avishay: CI takes time to set up... eharney is on the case for LIO 16:25:02 the duplication factor is somethign I still question but we can discuss offline 16:25:11 avishay: If you are a RedHat user I believe you have to use LIO. Alienate a whole customer base. 16:25:12 anyway, that's all i have, hope to see CIs soon 16:25:12 thingee: that's what I was thinking 16:25:14 yes 16:25:23 hemna: that was more of a question, I'm unsure :P ... can't keep track 16:25:25 jungleboyj: yes 16:25:27 oh :) 16:25:40 avishay: Nice. 16:25:42 RH7 has depracated tgt and the cinder package is preconfigured to use LIO 16:25:57 thingee: What was the decision? 16:26:04 #info this is the second time LIO has been proposed to be removed I think 16:26:17 jungleboyj: I think I said they need CI's, but there was no follow up communication 16:26:18 there is also iet and scst, which i don't know what they are 16:26:21 i'm not too familiar on details from the first time 16:26:24 there needs to be ci's* 16:26:45 eharney: we had an angry bug about lio...and just setting things up in ubuntu. I'll leave it at that. 16:26:58 uh, ok 16:27:06 bottom line, this fell through the cracks, it sucks, let's get CIs ASAP please :) 16:27:10 for nfs too please 16:27:27 nfs and block device driver... 16:27:33 yes 16:27:39 NFS and block device won't pass currently 16:27:43 unfortunately the block device driver as I understand won't pass tempest today 16:27:45 nfs doesn't support snapshot, how could it pass the CI ? 16:27:55 ^ exactly 16:27:59 i and some others have been slowly poking at getting snapshots into the NFS driver 16:28:08 current blocking point is a bug in Nova that i'm not clear really has an owner 16:28:12 block device driver doesn't meet minimum features... it is one of the reasons we have so many ABCs 16:28:14 block device driver can probably pass some subset, no? 16:28:18 eharney, which bug in nova ? 16:28:35 oh darn we should talk about abc's if we have time jgriffith 16:28:44 jordanP: looking 16:28:46 I'm tempted to say block device driver should just be pulled TBH 16:29:09 jordanP: https://bugs.launchpad.net/nova/+bug/1416132/ 16:29:09 Launchpad bug 1416132 in OpenStack Compute (nova) "_get_instance_disk_info fails to read files from NFS due to permissions" [High,In progress] - Assigned to Eric Harney (eharney) 16:29:15 DuncanT: I think that harms projects like sahara though? 16:29:19 the block device driver can't support snapshots, right? 16:29:29 eharney: no 16:29:35 oh, ok 16:29:41 eharney: no 16:29:49 eharney: i guess it can 'dd', but that's pretty terrible 16:29:54 thingee: Not enough that anybody has stepped up to CI it... Infra will host it if somebody can get it even slightly close to working 16:30:00 DuncanT: I thought Mirantis was using BlockDeviceDriver for something. 16:30:03 eharney: Correct 16:30:04 e0ne: are you guys still using it? Or Sahara guys? 16:30:05 Maybe that was Sahara. 16:30:11 smcginnis: yup! 16:30:15 hey folks 16:30:17 winston-d: yes 16:30:22 eharney, "if processing a qcow2 backing file" -->>> work around for the CI would be to run nfs_use_qcow2 = False 16:30:24 we're going to make CI for it 16:30:28 this driver is used by the bunch of sahara users 16:30:35 SergeyLukjanov: comfirm, please about CI 16:30:43 jordanP: that isn't sufficient because you still get qcow2 files if you use snapshots 16:30:53 it's the only way to make a performant storage for big data processing 16:31:06 e0ne, yeah, we're working on making CI for it 16:31:13 SergeyLukjanov: great 16:31:19 ok 16:31:21 SergeyLukjanov: Do you have a benchmark .v. local LVM? I can't get more than a few percent difference 16:31:24 our plan now is to add support for it into devstack 16:31:49 we have customers want to evaluate hadoop with cinder block device driver 16:31:52 also, i'm a contact persom for this driver if you need some maintance or bugfixing or new features reauest 16:32:00 SergeyLukjanov: by "we're" do you mean mirantis? And if so, is the contact information here https://wiki.openstack.org/wiki/ThirdPartySystems ? 16:33:14 DuncanT, it's a recommendation from Hadoop community to use JBOD and not LVM 16:33:30 eharney: what is the progress with snapshots with nfs? can we expect this for liberty? 16:33:47 and if so, can we begin having a non-voting job hosted by infra? 16:33:49 thingee, yup, we're now designing how it could be tested 16:34:06 thingee: my understanding is that the majority of the Cinder work can merge once we get the above Nova bug fixed 16:34:16 thingee: but how that Nova issue gets fixed is not clear at the moment 16:34:21 SergeyLukjanov: Yeah, but are there any numbers to back it up? I tried to benchmark it and found basically no difference .v. LVM thick, and it is missing major features that complicate cinder somewhat 16:34:27 SergeyLukjanov: great and I'm assuming for my second question the contact is e0ne or someone else? 16:34:34 :) 16:34:53 thingee: i posted answer few minutes earlier 16:34:55 eharney: that's right, I think the nova bug is linked with the bluepirnt 16:35:01 thingee: yes 16:35:06 great 16:35:08 avishay: there 16:35:12 thingee: yup 16:35:15 thingee, yes, e0ne will be contact for it 16:35:21 SergeyLukjanov: thanks 16:35:58 eharney, we also need https://review.openstack.org/#/c/192736/ to get in I think 16:36:06 i know CI takes some time, but these drivers/targets fell through the cracks, and i think we should set some deadline for them too 16:36:10 DuncanT, hm, I have no numbers in mind, just everyone who using Hadoop is asking for directly mapped disks, not lvm 16:36:14 jordanP: yes, and volume format tracking 16:36:18 #info nfs ci is blocked by not supporting CI which is blocked by a nova bug 16:36:33 i obviously don't want LIO removed since that's what we use for our internal tests, but i'd like to know that it works :) 16:36:35 thingee: and https://review.openstack.org/#/c/192736/ 16:36:37 #info block device is in progress by mirantis. e0ne is point of contact 16:36:47 SergeyLukjanov, DuncanT: i believe, we can make some performance results for it 16:37:00 e0ne: thanks 16:37:05 e0ne: that will be great 16:37:16 #info nfs ci is also blocked by https://review.openstack.org/#/c/192736/ 16:37:19 note: i didn't promice anything :) 16:37:36 eharney: I will make a point to talk to johnthetubaguy about it. 16:37:38 but i'll try to do it... 16:37:51 he reached out to me recently on syncing on some issues between nova and cinder 16:37:55 e0ne: we believe what you believe 16:38:03 avishay: anything else? 16:38:10 winston-d: :) 16:38:12 thingee: nope, thanks for the stage :) 16:38:20 #topic volume migration 16:38:28 jungleboyj: hi 16:38:49 #info current spec in review 16:38:50 #link https://review.openstack.org/#/c/186327/ 16:39:19 so I'm just about fine with this. I did think it was weird the only way to get progress on a migration is through ceilometer 16:39:25 thingee: Oh, I didn't know I was on the hook for this. 16:39:41 the only way around this is to store the progress in the db and have an api call to get it. 16:39:44 I was just looking at that review. 16:39:48 jungleboyj: sorry vincent isn't around 16:39:58 thingee: No problem. 16:40:19 Yeah, I agree we should be able to get the status from in Cinder. 16:40:25 does anyone have thoughts, opinions with the migration progress being stored and accessible via api 16:40:27 thingee: having an api for that is not an option? 16:40:30 Are you ok with tracking the progress in the DB? 16:40:49 erlon: that's was my suggestion earlier. vincent just added that in the updated patch of the spec 16:40:56 i think the overall spec is very good. i reviewed it a while ago and it may have changed slightly since, but it's definitely in the right direction. 16:41:06 thingee: ok 16:41:38 jungleboyj: I'm fine, i wanted other people to raise concerns 16:41:50 im setting up an environment with all backends I can to test vicent's patches, my idea is start with LVM, HNAS NFS HNAS iSCSI, HUS VM, HUS110 16:42:10 i'm not crazy about a progress bar in the DB because it's not persistent. 16:42:14 here's the diff between the changes when I raised some comments in the spec that introduces additional documentation for driver maintainers on how to develop this and getting migration status from the db https://review.openstack.org/#/c/186327/25..25/specs/liberty/volume-migration-improvement.rst,cm 16:42:34 then have a matrix of what works and problems of BE integration 16:42:41 The two risks with progress in the db are stale data and too many updates 16:42:48 DuncanT: +1 16:43:02 any suggestion of tests? or other scenarios?? 16:43:11 We currently don't do "progress" updates for anything else, do we need to for this? 16:43:25 maybe cinder-volume can periodically notify cinder-scheduler via RPC 16:43:32 If so, it could be "later" work, and def shouldn't be Ceilo dep IMHO 16:43:33 or even better 16:43:38 DuncanT: yeah think of cases where we have a mass migration happening because we need to switch from one pool to another. That would be a lot of updates happening. This won't be an everyday then, but still something to consider. 16:43:48 if the API is called, just ask cinder-volume what the progress is right now 16:43:56 avishay: +1 16:43:56 jgriffith: I'd love to add progress to backups - I'd probably do it via an RPC to the backup service though, so the info is fresh 16:43:57 jgriffith: +1 16:44:11 avishay: +1 16:44:15 DuncanT: +1 16:44:15 avishay: +1 16:44:22 avishay: can you raise a comment with that suggestion? 16:44:29 thingee: sure 16:44:36 excellent 16:44:59 So assuming vincent has that updated, would people be fine with me approving this spec this week? 16:45:18 thingee: +1 16:45:21 assuming no one raises blocking concerns 16:45:34 thingee: + 16:45:36 1 16:45:38 and by this week I mean friday. gives you time to read things now 16:45:42 Having more progress info in Cinder would be great. 16:46:01 thingee: +1 16:46:14 thingee: +1 16:46:15 #idea have progress for volume migration come from api -> c-vol 16:46:23 thingee: aren't you guys in holiday this friday? 16:46:30 winston-d: I never rest 16:46:31 jungleboyj: What are the (potentially) slow operations? Backup, image stuff, migration. Maybe snap if you're rackspace? 16:46:40 thingee: :) 16:46:49 thingee: you just play 16:46:54 :) 16:47:02 #agreed spec will be approved this friday assuming no blocking concerns and vincent updates spec with idea for progress update 16:47:09 jungleboyj: thanks! 16:47:09 DuncanT: ++ 16:47:18 thingee: Thank you! 16:47:25 #topic HA 16:47:27 geguileo: hi 16:47:32 thingee: Hi 16:47:39 Liberty is advancing and I believe right now most HA efforts are waiting on Cinder-nova interactions. 16:47:42 #link https://etherpad.openstack.org/p/CinderNovaAPI 16:48:11 We know there's a lot of problems in Nova-Cinder interaction 16:48:16 As can be seen in that list 16:48:18 You mean c-vol A/A probably. 16:48:23 geguileo: The atomic state change in API is a breaking API from client PoV :-( 16:48:36 geguileo: I dropped a bunch of work on winston-d to work on the error handling with cinder client. have an update winston-d ? 16:48:47 geguileo: the last time we talked about this we said we address that first. 16:48:55 thingee: Yes 16:49:07 thingee: But in that list I see a lot of things that should not be blocking HA work 16:49:17 https://etherpad.openstack.org/p/CinderNovaAPI 16:49:20 As I understand there are some interactions that need fixing 16:49:29 For HA work to be able to start 16:49:32 fwiw, that etherpad has a ton of issues called out between nova -> cinder. 16:49:34 And others that are generic 16:49:39 thingee: sorry, not much progress so far, busy separating the company. 16:49:54 hemna: Are they all limiting Cinder to move to atomic state changes? 16:50:15 geguileo, this is just a note taking etherpad that calls out all of the issues, and outstanding bugs 16:50:25 lots of them 16:50:33 geguileo: See above. atomic state change requires an API contract change with our clients, not just nova 16:50:39 I see the list is big and getting bigger, so 16:51:01 I also called out some live migration problems in that etherpad as well. it's not good. 16:51:07 DuncanT: Really? r:?? 16:51:47 DuncanT: But even if we need to update cinderclient that should be fairly easy 16:51:52 DuncanT: Any workaround? 16:51:58 geguileo: We accept certain combinations of commands right now and effectively queue them up on the lock in the volume manager. If we go with atomic state changes, that no longer works 16:52:03 geguileo: DuncanT meant clients like client scripts. 16:52:22 this is a good topic for the meetup :) 16:52:29 As first version of c-vol A/A we can go with Tooz locks for that. 16:52:30 geguileo: We can't just change cinder-client behaviour - and python-cinderclient is not the only client out there 16:52:33 hemna ++ 16:52:42 8 minute warning 16:52:55 dulek: I can't figure out safe lock expiration in tooz 16:53:03 this first stage in fixing the local volume locks was to put 'ing' checks in the API and then report VolumeIsBusy, and Nova has to cope with that. 16:53:11 DuncanT: Hm? Locks get dropped when service dies. 16:53:13 DuncanT: There's heartbeats to keep locks 16:53:26 dulek: Which leaves crap lying round... not good 16:53:28 so I thought someone was going to look at the Nova side to expect the volume is busy exception and cope with that as step 1. 16:53:28 No heartbeats, they die 16:53:36 I think that's a good idea regardless. 16:53:40 DuncanT: Why do you think that? What kind of "crap"? 16:53:57 dulek: Have cloned volumes... half done snapshots.... 16:54:07 Taskflow could revert those operations 16:54:18 taskflow...caugh..caugh 16:54:21 DuncanT: With service I've meant Cinder services. We have such situations already. 16:54:24 hemna: https://review.openstack.org/#/c/186742/ 16:54:31 geguileo: Only with persistence and really good coding 16:54:41 hemna: problem is, currently cinder doesn't raise VolumeIsBusy, right? Not until the lock-free change gets in? 16:54:42 DuncanT: right. 16:54:53 winston-d, it's a catch 22 16:55:03 winston-d: Cinder can't raise it until the client handles it 16:55:06 DuncanT: With bad coding we would go nowhere anyway 16:55:10 winston-d, we can't put it in Cinder until Nova handles it, or CI will puke 16:55:13 DuncanT: +1 about taskflow 16:55:22 geguileo: We've got a long way with some very bad code.... 16:55:24 taskflow -1 for me 16:55:36 we haven't decided as a team if we are sticking with it yet or not. 16:55:44 Ok, so we basically are saying that we can forget about HA? 16:55:45 Okay, lets get back to using tooz. What's wrong with that besides of some performance issues? 16:55:50 aarefiev is hitting on performance emprovements for persistnace taskflow 16:55:53 Or wait for API v4? 16:55:54 geguileo, no. 16:56:01 dulek: it's zookeeper :) 16:56:03 hemna: ok. thx for clarification. i can move on my change on my 'fixbug1458958' nova branch now. 16:56:04 The persistence we need seems to be at a more course granularity than tasks... 16:56:04 If c-vol dies - we have half-done snapshots 16:56:05 hemna: Good to know 16:56:08 geguileo, we have to get Nova to expect VolumeIsBusy exceptions. 16:56:15 dulek: and the perf is a pretty big issue IMO 16:56:19 i think the current locks can be removed if we garbage collect volumes offline rather than delete them immediately 16:56:21 he will be able to show some code/reports this week or next one 16:56:23 hemna: Ok, only that? 16:56:36 If tooz backend service dies - node should get fenced by Pacemaker. 16:56:38 toox isnt necessarially zk 16:56:41 er, tooz 16:56:41 geguileo, after that, then we put 'ing' checks in the cinder API, and report VolumeIsBusy. 16:56:51 hemna: i'm on it - fixing nova to expect volumeisbusy 16:56:55 then we can remove all but 1 of the locks in the volume manager. 16:57:02 hemna: Ok, so the Nova support for VolumeIsBusy 16:57:03 jgriffith: ZooKeeper is one of the options. Ceilometer relies on Redis as default backend - they say it's reliable. 16:57:05 greghaynes: fair, but it's the version that "works" and is most deployed IIUC 16:57:12 hemna: Are there patches for that already submitted? 16:57:21 dulek: they're problem space is different, but that's fair 16:57:22 jgriffith: And performance will be probably better compared to running single c-vol 16:57:27 we still have to deal w/ the lock in taskflow though 16:57:44 geguileo: not yet 16:57:46 hemna: lock in taskflow? 16:57:46 we need distributed locks for 1 lock in the code? 16:57:50 that will be the last one, as far as the volume manager is concerned 16:57:51 dulek: so what's the real advantage/reason for running multiple c-vol services? 16:57:53 dulek, yes 16:58:01 dulek: I'm not certain I think it's that great 16:58:01 yes, and thats because zk is the only system that is known to really 'work' for that problem - if you want those guarantees youll need to use that system reguardless of tooz, or if you dont need them then you can use something lighter weight 16:58:07 2 minutes 16:58:13 winston-d: So the 3 patches that are submitted don't fix our problems for HA? 16:58:20 dulek, I'll find it. it's not as easy to remove.... 16:58:22 jgriffith: A/A HA and scaling probably. 16:58:26 dulek: I've also thought we'd be MUCH better off using something like containers with mesos or something else 16:58:30 dulek: disagree 16:58:30 Force iSCSI disconnect after timeout: https://review.openstack.org/#/c/167815/ 16:58:32 Rollback if attach_volume timesout: https://review.openstack.org/#/c/138664/ 16:58:33 dulek: on both counts 16:58:34 Detach and terminate conn if Cinder attach fails: https://review.openstack.org/#/c/186742/ 16:58:36 winston-d: ^ 16:58:43 dulek: c-vol is nothing but an API interface 16:58:53 dulek: if it dies just respawn it 16:59:10 dulek: and A/A configuration is sort of... mmm... well weird 16:59:14 hemna: Point it to me please, I've became an expert on these flows in last weeks. ;) 16:59:17 dulek: when pointing to the same backend device 16:59:20 geguileo: no, i have on wip patch for nova. 16:59:22 dulek, I'm looking.. 16:59:28 jgriffith, the "name" (the host) of the service is "coded" in every volume 16:59:33 distributed locks are fixing the solution with a bulldozer. i'm sure we could figure out how to avoid the lock entirely. 16:59:39 geguileo, winston-d lets talk in the #openstack-cinder room after the meeting 16:59:40 jgriffith, so you can't respawn it everywhere 16:59:47 jordanP: sure you can 16:59:49 thingee: Ok, good idea 16:59:52 jgriffith: Hm... So DuncanT was tasked with making it A/A in the first place. What were the motivations? 17:00:01 jgriffith, we need to tweak the "host' config flag then 17:00:03 #endmeeting