#openstack-meeting log

17:01:29 <tjones> #startmeeting vmwareapi
17:01:30 <openstack> Meeting started Wed Aug  6 17:01:29 2014 UTC and is due to finish in 60 minutes.  The chair is tjones. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:01:31 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:01:33 <openstack> The meeting name has been set to 'vmwareapi'
17:02:01 <tjones> anyone here today?
17:02:10 * mdbooth waves
17:02:12 <vuil> o/
17:02:13 <arnaud> o/
17:02:43 <tjones> hey guys - so it's all about reviews and bug fixing i think for us.
17:03:34 <garyk> hi
17:03:59 <tjones> it looks to me that the only BP we will get in juno are spawn refactor, oslo.vmware, and v3 diags.  anyone think anything different? - like spbm and vsan??
17:04:20 <mdbooth> tjones: I don't see it happening
17:04:49 <garyk> tjones: we are working on spbm and ephemeral - all have code posted
17:04:51 <vuil> vsan bp was approved too I thought, but with all the logjam of patches still needed reviews, yeah.
17:04:55 <tjones> spbm is set to "good progres" - garyk what do you think?
17:05:15 <garyk> tjones: code was completed about 8 months ago. we just need to rebase - i will do that tomorrow
17:05:35 <garyk> yesterday we had the oslo.vmware updated so the spbm code can now be used
17:05:51 <garyk> it is all above tje oslo.vmware integration patch
17:05:55 <tjones> garyk:  im spacing - where is ephemeral one ??  https://blueprints.launchpad.net/openstack?searchtext=vmware  (just got back from vacation and still fuzzy)
17:05:59 <garyk> which may land in 2021
17:06:10 <tjones> lol
17:06:19 <garyk> https://review.openstack.org/109432
17:06:40 <vuil> the few patches using oslo.vmware to provide streamOptimized/vsan support are being updated right now as well.
17:06:41 <mdbooth> garyk: Incidentally, I spent the afternoon looking at bdm
17:06:52 <mdbooth> And I agree with you
17:07:05 <mdbooth> about ephemeral, that is
17:07:08 <garyk> ah, that is is only relevant to libvirt?
17:07:18 <mdbooth> No, it's definitely relevant to us
17:07:30 <mdbooth> However, there's no need for it to be in this patch
17:07:42 <vuil> *missing context re bdm*
17:07:51 <mdbooth> block device mapping
17:07:52 <garyk> ah, ok. then next stuff i'll add in a patch after that
17:08:21 <mdbooth> garyk: See my big comment in spawn() about the driver behaviour being broken wrt bdm?
17:08:27 <mdbooth> I think it needs to be fixed along with that
17:08:48 <mdbooth> Probably quite an involved patch
17:08:51 <garyk> mdbooth: i think that the bdm support is broken in general
17:08:58 * mdbooth would be happy to write it, though
17:08:58 <garyk> but that is for another discussion -
17:09:03 <mdbooth> garyk: +1 :)
17:09:14 <tjones> ok so lets go through the BP 1 by 1 (we can revisit bdm later on)
17:09:15 <tjones> https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/vmware-spawn-refactor,n,z
17:09:16 <vuil> I noted the same in my earlier phase 3 stuff too, so yeah this should be one of the first things to deal with after the refactor
17:09:24 <tjones> here's spawn
17:09:29 <garyk> after we chatted there was someone who wrote on irc that he did not manage to get it working in I…
17:10:21 <mdbooth> tjones: That's currently blocked on minesweeper
17:10:29 <tjones> arrggghhhhhh!!!
17:10:41 <mdbooth> Vui and I chatted just before this meeting about the phase 3 stuff
17:10:45 <tjones> down again?
17:11:03 <dansmith> tjones: hasn't it been down for a couple weeks now? or is it just really spotty?
17:11:07 <mdbooth> garyk had a handle on it, I believe
17:11:10 <tjones> is phase 3 in that list?  im losing track of
17:11:18 <dansmith> we've been asking about minesweeper status in -nova for at least two weeks
17:11:30 <garyk> dansmith: no, it has been up. just the last week. just a few problems last 2 days
17:11:39 <tjones> dansmith: spotty.  i literaly got back from vacation just before this meeting.  i'll get an update after
17:11:45 <mdbooth> tjones: Yes. Basically all Vui's patches are phase 3.
17:11:48 <dansmith> garyk: hmm, okay I've recheck-vmware'd a few things and never get responses
17:12:08 <garyk> dansmith: that is due to the fact that the queue is very long due to eh fact it was down for a while
17:12:24 <dansmith> okay, I wish we could see the queue so we'd know, but... okay
17:12:33 <vuil> Matt had some nice suggestions, I will be taking on those and posting an update to the phase 3 chain of patches.
17:12:36 <garyk> it was averaging about 14 patches a day due to infra issues
17:12:41 <mdbooth> +1
17:13:00 <mdbooth> I understand it's on internal infrastructure, though
17:13:57 <tjones> ok im assuming oslo.vmware also blocked on minesweeper
17:14:23 <vuil> and reviews obv
17:14:31 <garyk> tjones: all of those patches had +1's then we need to rebase and it was at a time when ms was down. so back to sqaure 1.
17:14:35 <dansmith> tjones: we talked last week about not approving any without them
17:14:50 <dansmith> them == votes
17:14:57 <tjones> yes - i agree we need minesweeper runs
17:15:02 <dansmith> I've been trying to come back to patches I've -2d regularly to check for minesweeper votes
17:15:27 <dansmith> I definitely don't want to be the guy that -2s for that and then holds us up after MS shows up :)
17:15:35 <tjones> :-D
17:15:52 <garyk> dansmith: it is understaood. rules are rules
17:16:02 <tjones> ok here's the complete list of our stuff out for review
17:16:03 <garyk> we just need to et our act together with minesweeper
17:16:05 <tjones> #link https://review.openstack.org/#/q/status:open+project:openstack/nova+message:vmware,n,z
17:16:12 <mdbooth> Hmm, I have an actual -1 from dansmith
17:16:15 <garyk> but when it is up it would be nice if the patches could get some extra eyes
17:16:22 <dansmith> mdbooth: frame it!
17:16:26 <mdbooth> In phase 2 spawn refactor
17:16:32 <tjones> we still need our team to be reviewing like mad and so when MS comes back we are ready
17:16:32 <vuil> @dansmith: on related note even when minesweeper passes, we have seen −1 on xenserver CI quite a bit despite rechecks, does that −1 factor into the filtering for reviewable things ?
17:16:34 <mdbooth> dansmith: I'm honoured :)
17:16:42 <dansmith> vuil: not to me
17:16:55 * mdbooth can address that tomorrow
17:16:57 <dansmith> vuil: everyone gets -1s from xen ci right now :)
17:17:10 <garyk> just not that ms does not run on patches in the test directory
17:17:11 <dansmith> vuil: they're working hard on that too
17:17:29 <garyk> so a patch like https://review.openstack.org/105454 should not be blopcked
17:17:37 <vuil> ah got it
17:17:43 <dansmith> garyk: yeah, that makes sense to me
17:18:20 <tjones> anything else on BP ?
17:19:21 <tjones> *listening*
17:19:38 <tjones> ok lets talk about bugs
17:19:41 <tjones> #topic bugs
17:19:50 <tjones> #link http://tinyurl.com/p28mz43
17:19:53 <tjones> we have 59
17:20:04 <tjones> a number that is not going down
17:20:27 <tjones> we have a number of these in new
17:20:34 <tjones> or triaged state
17:21:01 <garyk> tjones: a lot of them have been triaged and a lot are in progress and a lot have been completed.
17:21:08 <garyk> we need to do a cleanup
17:21:12 <tjones> i filtered out completed
17:21:23 <garyk> some are also very concerning - basically the multi cluster stuff break a lot of things
17:21:34 <tjones> these are only new, inporgress, traged, and confirmed
17:21:45 <garyk> i am in favor of pushsing rado's patch which drops the support as we discussed at the summit
17:22:44 <tjones> HP raised a lot of concerns about that as i recall
17:23:36 <mdbooth> The principal concern was memory usage on vsphere, right?
17:23:46 <garyk> yeah, i asked that they write to the list so that we can get some discussion going about it and there was nothing
17:23:54 <mdbooth> I saw that
17:23:57 <garyk> that was one - but it is something that can be addressed
17:23:57 <tjones> i thought they posted something
17:24:12 <garyk> my main concern is that each compute node has its own cache dir - that is very costly
17:24:18 <tjones> i thought it was spinning up a n-compute for each cluster they did not like and the image cachce
17:24:41 <mdbooth> So, I happened to read a tripleo thing about deploying multiple novas per node earlier
17:24:48 <garyk> tjones: i do not recall seeing anything. if someone did can you please forward the mail message to me
17:24:59 <mdbooth> That would solve a provisioning issue, but not the memory usage thing
17:25:04 <tjones> https://www.mail-archive.com/openstack-dev@lists.openstack.org/msg29338.html
17:25:38 <mdbooth> garyk: I think that's something we're going to have to live with until we get inter-node locking
17:26:36 <tjones> i  think the biggest issue is what to do with existing customers that have deployed the current way.  what is the upgrade path?
17:26:44 <mdbooth> garyk: Random thought: how likely is the backing store to do de-duplication?
17:26:49 <garyk> tjones: thanks
17:27:10 <vuil> @mdbooth storage-vendor specific
17:27:17 <mdbooth> Right, but is it common?
17:27:46 <garyk> mdbooth: soryy i do not understand
17:27:51 <vuil> *not entirely sure*
17:28:08 <mdbooth> Would suck to have to recommend it, for sure
17:29:19 <mdbooth> tjones: Is anybody looking at that, btw?
17:30:21 <tjones> mdbooth: looking at??
17:30:30 <tjones> upgrade?
17:30:33 <mdbooth> tjones: yeah
17:30:40 <tjones> no that i am aware of
17:30:58 <tjones> i tihink we have to tackle that before we do this change though
17:31:21 <mdbooth> Is anybody currently motivated to do it?
17:31:37 <mdbooth> s/motivated/motivatable/
17:31:47 <vuil> so the path we are talking about is provide inter-node locking then make image caching more efficient
17:31:47 <tjones> i dont see any way around saying "you have to reconfgure your n-compute"
17:32:11 <vuil> before taking out multi-cluster?
17:32:19 <mdbooth> tjones: Right. Presumably also db hacking.
17:32:24 <mdbooth> Much nastiness.
17:32:30 <tjones> yes
17:32:33 <garyk> mdbooth: there is no need for db hacking
17:32:40 <garyk> there is a patch for the esx migration
17:32:49 <garyk> an external utility
17:33:03 <garyk> https://review.openstack.org/101744
17:34:03 <tjones> so the issues with this are 1. soap takes too much memory / connection (could be solved with pyvmomi) 2.  image cache duplication 3.  upgrade path.
17:34:06 <tjones> right?
17:34:12 <garyk> btw there is also a pacth for the actual esx deprecation
17:34:32 <mdbooth> tjones: I don't see how pyvmomi would solve 1
17:34:43 <tjones> doesn't is use a different transport?
17:34:47 <tjones> not soap
17:34:49 <mdbooth> If it can, then we can presumably solve it without pyvmomi
17:34:54 <mdbooth> Don't think so
17:35:07 <vuil> no still soap, but on a more lightweight stack
17:35:27 <mdbooth> More lightweight on the client side, no difference to the server
17:35:29 <vuil> as in we may save some memory usage by taking out suds
17:35:40 <vuil> *remains to be seen*
17:35:47 <mdbooth> Our problem is on the server, though, iiuc
17:36:02 <vuil> hp was concerned about multiple computes each taking up lots of resources.
17:36:35 <garyk> my concerns of the multi cluster support are edge cases - for example the rezie issue
17:36:37 <mdbooth> vuil: That was resources on vcenter, though, right?
17:36:38 <vuil> I don't think server impact is going to be much
17:36:44 <tjones> i thought they were concerned on the client side - running multiple n-compute
17:36:57 * mdbooth might have misunderstood this
17:36:58 <vuil> no actual python n-cpu processes taking up memory.
17:37:05 <garyk> https://review.openstack.org/108225
17:37:05 <tjones> yeah that is what i thought
17:37:17 <dansmith> I thought I heard somethign about server side too
17:37:27 <dansmith> because each connection to vcenter comes with a lot of overhead
17:37:30 <vuil> in terms of load on VC it is pretty much the same whether it comes from one ncpu managing N clusters or N ncpu managing one each
17:37:34 <mdbooth> ~140MB in the driver
17:37:37 <mdbooth> Got it
17:37:39 <dansmith> so one compute using one connection for lots of machines
17:37:44 <mdbooth> Ok, that's way more manageable
17:38:22 <vuil> @dansmith, addition of a couple more connections should not be too big of a deal.
17:38:31 <dansmith> vuil: okay
17:39:19 <mdbooth> So, tripleo were talking about deploying multiple novas per server in separate containers
17:39:57 <tjones> even if we can decrease the client side memory load we still have the duplcate cache and upgrade.  with duplicate cache we could solve with using shared datastore for glance - right?
17:40:03 <mdbooth> And in the grand scheme of thinks, anybody deploying 32 VMware clusters worth of openstack isn't going to notice the cost of 4GB of RAM
17:40:27 <mdbooth> Although it's inelegant to waste it, it's probably not a huge deal
17:41:11 <vuil> mdbooth: my thoughts as well.
17:42:06 <mdbooth> Duplicate cache:
17:42:18 <mdbooth> Only an issue for clusters sharing datastores
17:43:16 <mdbooth> Otherwise the cache would be duplicated anyway
17:43:47 <tjones> we need to get to a place where we can implement this change without screwing HPs existing customers….
17:44:13 <garyk> tjones: yes, they are in production and this would break a installtion
17:44:24 <tjones> yep
17:44:51 <dansmith> so,
17:45:06 <dansmith> sounds like maybe we should put something into the juno release notes that such an arrangement is deprecated
17:45:17 <dansmith> to give time and notice so we can get something into kilo?
17:45:17 <tjones> i was just typing that very thing
17:45:25 <dansmith> cool
17:45:29 <tjones> we should deprecate thsi to give them some time
17:45:45 <dansmith> is there a config variable that would go away that we can also document as deprecated?
17:46:09 <dansmith> and, despite it's limited usefulness, we should also log.warning("this is going away soon") if they have that configured
17:46:16 <dansmith> per usual protocol
17:46:22 <garyk> no, there is no specific config var
17:46:22 <mdbooth> The one which selects clusters, presumably
17:46:34 <tjones> yeah - i think it's a list
17:46:36 <garyk> we can identify if there is more than one cluster configured
17:46:48 <dansmith> so, iirc, the patch changed it from a list to a string, which we can't do anyway
17:46:52 <garyk> but i think that we should address the issue on the list with the guys from hp
17:47:16 <mdbooth> self._cluster_names = CONF.vmware.cluster_name
17:47:17 <dansmith> so: 1. document in release notes, 2. log.warning() if len(list)>1 ?
17:47:25 <mdbooth> Hmm
17:47:34 <mdbooth> So it's still going to be called 'cluster_name', presumably
17:47:44 <mdbooth> Except it's no longer going to accept a list
17:47:47 <garyk> https://github.com/openstack/nova/blob/master/nova/virt/vmwareapi/driver.py#L60
17:47:51 <mdbooth> That's ugly
17:47:59 <dansmith> mdbooth: yeah, that was my complaint, IIRC
17:48:06 <garyk> i think that this needs more discussion prior to us doing a deprecation warning
17:48:41 <tjones> we really should reply back on that thread
17:48:46 <dansmith> garyk: IMHO, we mark as deprecated to force the conversation, nothing says we *have* to yank it in kilo if not ready
17:48:53 <garyk> an existing customer will also need an upgrade patch for cached images.
17:48:58 <tjones> to keep kiran in the loop - this time is not very good for him
17:49:10 <garyk> dansmith: i think that is not a good approach
17:49:24 <garyk> why force something until we have thought it out properly
17:49:50 <dansmith> garyk: well, because I feel like we've decided that it's coming out, we just have to figure out how it's going to look when it's out
17:50:01 <dansmith> garyk: and because if we're going to deprecate, starting the timer limits when we can actually remove it
17:50:15 <garyk> no, we have not decided
17:50:28 <garyk> a few people have but the general community has some issues with this
17:50:30 <dansmith> we agreed at summit, no?
17:50:51 <garyk> at the summit we discussed it in a small room. after the summit people started to raise issues
17:51:04 <garyk> are we not allowed to change things after problems and issues are raised?
17:51:15 <garyk> is that going to build a healthy community discussion?
17:51:40 <dansmith> sure, we can.. I didn't think any of the discussion here was considering the option of not doing it
17:51:50 <dansmith> but that's fine, see how the ML thread goes
17:51:50 <garyk> some people who work on the driver were not able to attend the summit and only afterthey were awair of discussion did they reiase thier issues
17:52:08 <garyk> and i think that they have come with some valid arguments
17:52:09 <dansmith> but we should make sure to revisit before too late in juno, merely for the deprecation timer ... timing :)
17:52:27 <garyk> to be honest i am kicking myself for not having found the problems when we originally added the feature, but we all approved that
17:52:42 <tjones> ok the way we got to this discussion was because we were talking about bugs
17:52:48 <tjones> #link http://preview.tinyurl.com/kkyw9c4
17:52:50 <garyk> i will follow up on this list tomorrow about this
17:53:05 <tjones> of the 59 there are 20 that are not owned by anyone and some are high prio
17:53:27 <tjones> garyk thanks for following up
17:54:33 <tjones> so - 6 minutes left.  please do reviews, fix bugs, etc.
17:54:37 <tjones> #topic open discussion
17:54:40 <tjones> anything else?
17:54:48 <garyk> tjones: i honestly do not like the multi cluster support and would be happy to see it dropped, but we need to find something that works :)
17:55:10 <tjones> garyk: i don't disagree at all
17:55:38 <mdbooth> tjones: Any chance of getting more hardware for minesweeper?
17:55:55 <tjones> mdbooth: it is on order (and has been for a while).  it will get here in sept
17:56:03 <tjones> it's a slllllooooowwwww process
17:56:11 * mdbooth has been there
17:56:15 <garyk> mdbooth: my understanding is that there is a request for more hardware
17:56:24 <garyk> and we all know how long that can take
17:56:29 <tjones> dansmith: im going to see if we can figure out how to get external access to the minesweeper status and queue
17:56:48 <dansmith> tjones: just scraping it and POSTing to an external site would be enough
17:56:58 <dansmith> tjones: just so we have some indication on whether we should ping you, or wait, or... :)
17:57:03 <tjones> yes that is what i am thinking - put it on the same place as my bug list
17:57:07 <dansmith> yeah
17:57:25 <tjones> if only i could get a free RAX vm....
17:57:33 <tjones> :-)
17:57:48 <mdbooth> :)
17:58:09 <tjones> ok i have nothing else - anyone??
17:58:14 <mdbooth> tjones: Email it somewhere?
17:58:18 <dansmith> heaven forbid, the multi-billion dollar company pay $12/mo for hosting :)
17:58:31 <tjones> lol
17:58:40 <mdbooth> dansmith: Do you know how much the lawyers to approve $12/mo hosting cost?
17:58:41 <tjones> it's $14 :-D
17:58:59 <dansmith> tjones: https://www.digitalocean.com/pricing/
17:59:20 <dansmith> tjones: $5 would be plenty for this :P
17:59:25 <tjones> nice!  cheaper that AWS
17:59:31 <dansmith> mdbooth: I used to work for IBM, I know all about this :)
18:00:04 <mdbooth> We're all the same :)
18:00:08 <tjones> ok i think we are done - thanks folks!
18:00:34 <garyk> have a good one
18:01:00 <mdbooth> g'night
18:01:53 <mdbooth> #endmeeting ?