21:00:00 <dansmith> #startmeeting nova_cells
21:00:01 <openstack> Meeting started Wed Nov 30 21:00:00 2016 UTC and is due to finish in 60 minutes.  The chair is dansmith. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:03 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:05 <openstack> The meeting name has been set to 'nova_cells'
21:00:26 <dansmith> *ahem*
21:00:26 <melwitt> o/
21:00:50 <mriedem> o/
21:01:15 <dansmith> well...
21:01:28 <dansmith> was really hoping for alaski to show up
21:01:32 <dansmith> because I have questions
21:01:36 <dansmith> concerns
21:01:43 <dtp> o/
21:01:54 <dtp> i made it
21:01:59 <dansmith> congrats :)
21:02:02 <dtp> thank you
21:02:09 <dansmith> #topic cells testing / bugs
21:02:22 <dansmith> anything on testing/bugs this week?
21:02:23 <mriedem> the only bug we had was that pg one
21:02:32 <mriedem> only took 5 days to fix it
21:02:39 <dansmith> true, I guess that was because of cellsv2
21:03:06 <mriedem> otherwise nada
21:03:08 <dansmith> #topic open reviews
21:03:16 <dansmith> so, I refreshed my set again just a bit ago:
21:03:20 <dansmith> https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/cells-scheduling-interaction
21:03:27 <dansmith> on top of melwitt's cell database fixture
21:03:38 <dansmith> the last patch is mostly good for unit tests at this point,
21:03:40 <mriedem> so https://review.openstack.org/#/c/396417/ got sorted out?
21:03:46 <dansmith> except for one ugly gotcha around cellsv1
21:04:02 <dansmith> mriedem: I need a fix to oslo.messaging for that, which I have proposed
21:04:08 <dansmith> mriedem: but their gate is fubar right now
21:04:17 <mriedem> do they know what's up?
21:04:22 <mriedem> like, are they fixing their stuff?
21:04:25 <dansmith> mriedem: so I have a hack in there to make it work, but I don't expect us to merge it
21:04:34 <dansmith> mriedem: it's zmq.. someone lobbed a patch at it claiming to fix it, but it didn't
21:04:42 <dansmith> mriedem: but yeah, I've been in there poking people about it
21:04:58 <dansmith> my fix has been +W for a few days but can't make it through gate
21:05:01 <mriedem> are you going to wip https://review.openstack.org/#/c/396417/ then?
21:05:09 <mriedem> we could also do the old depends-on dance
21:05:13 <dansmith> if I rechecked it harder I could maybe get it in
21:05:15 <mriedem> but will need to depend on a release and g-r bump
21:05:45 <dansmith> yeah
21:05:48 <dansmith> mriedem: this is the hack: https://review.openstack.org/#/c/396417/15/nova/rpc.py@77
21:05:54 <mriedem> yeah i saw it
21:06:01 <dansmith> yeah I guess I can -W it
21:06:18 <mriedem> and https://review.openstack.org/#/c/403924/ is what we want
21:06:26 <dansmith> yar
21:06:49 <mriedem> fun http://logs.openstack.org/24/403924/1/check/gate-oslo.messaging-dsvm-functional-py27-zeromq/35e71b1/console.html#_2016-11-29_19_18_53_458260
21:07:11 <dansmith> anyway, so yeah that is blocked at the moment
21:07:13 <mriedem> if it's not a regression and no one is fixing it we could skip the test...
21:07:20 <mriedem> but maybe that's extreme for right now
21:07:21 <dansmith> *shrug*
21:07:36 <mriedem> i guess we can light a fire when we've reviewed the entire series
21:07:40 <dansmith> yeah
21:07:48 <dansmith> the bottom couple of patches there can merge I think
21:07:58 <dansmith> at least
21:08:03 <dansmith> anyway, there is a bigger problem at the top of that set
21:08:18 <dansmith> and that is around cellsv1
21:08:27 <mriedem> which one? https://review.openstack.org/#/c/396775/ ?
21:08:43 <dansmith> no that one is out for the moment
21:08:52 <dansmith> https://review.openstack.org/#/c/396417
21:09:21 <dansmith> so the deal there is something I realized yesterday while trying to squash the last couple of unit test failures
21:09:48 <dansmith> in cellsv1, we do the compute/api bit, create in the api, then call down to the cell, which replays the compute/api part in the cell,
21:09:58 <dansmith> before finally casting to conductor to get things going
21:10:12 <dansmith> if we finish this move of the create from api to conductor,
21:10:49 <dansmith> then we're not going to create in the api cell, and we're going to call to conductor in the api cell,
21:11:05 <dansmith> which is then going to try to talk directly to the compute instead of calling through the cells rpc to get down there
21:11:10 <dansmith> which is ... scary
21:11:29 <dansmith> because (a) I don't want to add any more cells calls, and certainly not for things that need to call to conductor
21:11:36 <dansmith> so I'm not really sure what to do
21:11:47 <dansmith> which is what I was hoping to poke alaski about today
21:11:58 <dansmith> does any of that ranting make sense?
21:12:01 <mriedem> and we don't want a bunch of if CONF.cells.enabled checks all over this flow
21:12:05 <dansmith> yeah
21:12:16 <dansmith> and we don't want to have to keep two instance create paths either
21:12:32 <mriedem> i won't profess to understand the entire create flow through the api for cells v1
21:12:43 <dansmith> well, it's complicated
21:12:52 <mriedem> so the parent api cell creates the instance in the parent api cell db?
21:12:54 <dansmith> but we do it twice, once in the api cell and once in the child cell
21:13:02 <mriedem> oh right the instance is in both places
21:13:04 <mriedem> hence the up calls to sync
21:13:11 <dansmith> and that replication happens at the compute/api layer,
21:13:30 <dansmith> by intercepting the compute/api call, doing some of it, and then re-playing it in the child cell
21:13:36 <mriedem> and when status changes in the child cell while building the instance, we send that up to the parent cell to update the instance there too right?
21:13:47 <dansmith> yeah
21:14:09 <mriedem> and cellsv1 doesn't know jack about build requests...
21:14:15 <dansmith> nor does it know about conductor
21:14:21 <dansmith> it's purely replication at the compute/api layer
21:14:55 <dansmith> we also *have* to create in the api cell before we create in the child because otherwise the api stops working,
21:15:06 <dansmith> so we can't even try to patch it up by making the first sync create the instance back at the top again
21:16:20 <dansmith> so yeah
21:16:22 <dansmith> pretty much the suck.
21:16:25 <melwitt> I was thinking there's already "if cell_type == 'api'" in the compute/api and maybe we could just do the old way in there ...
21:16:36 <melwitt> old create I mean
21:16:53 <dansmith> melwitt: well, it means we keep two paths
21:17:05 <mriedem> CellsScheduler is parent/api cell right?
21:17:11 <mriedem> which builds the instances in the api cell
21:17:20 <mriedem> and calls the compute api code to create the instance in the api cell db
21:17:21 <melwitt> gah pushed the wrong key
21:17:35 <dansmith> mriedem: I dunno tbh
21:17:50 <mriedem> i'm pretty sure it is b/c _create_instances_here calls instance = self.compute_api.create_db_entry_for_new_instance(
21:17:55 <melwitt> don't we have to keep the two paths though? I mean wherever it says things like "cell_type =='api'" has to stay because cells v1 needs it
21:18:08 <mriedem> non-cellsv1 create_db_entry_for_new_instance doesn't create the instance in the db
21:18:34 <dansmith> melwitt: well, yes, but having instance created in two different *services* is uglier than just in two places in the same service, IMHO
21:19:00 <dansmith> melwitt: because then you get into all kinds of potentials for races I think, assuming that the instance is created by a certain point, when it might not be, etc
21:19:21 <dansmith> I mean, obviously the path out of this box is going to be some amount of "if cells1, else" sort of thing
21:19:29 <dansmith> but I'm just dreading it
21:19:39 <melwitt> yeah
21:20:38 <dansmith> anyway,
21:20:46 <mriedem> hmm, so have we hit the patch that sees this fail in cells v1 yet?
21:20:48 <dansmith> I was hoping that maybe someone had already thought about this and what the best plan would be
21:21:15 <dansmith> mriedem: it fails in unit tests, I'm still waiting for the run on cellsv1, but I might have other breakages to fix first
21:21:19 <dansmith> I just pushed it up like 30 minutes ago
21:21:48 <mriedem> and that's just https://review.openstack.org/#/c/396417 ?
21:21:59 <mriedem> i figured it would manifest in https://review.openstack.org/#/c/319379/
21:22:05 <dansmith> no, https://review.openstack.org/#/c/319379
21:22:10 <mriedem> ah ok
21:22:10 <mriedem> yeah
21:22:14 <mriedem> that makes sense
21:23:03 <mriedem> so, unit tests will fail, sure, but i think we should check out what the cells v1 job failures are and then start poking at it
21:23:05 <dansmith> anyway, so I will try to fix whatever that shakes out in the next day for the normal path and then see if I can start throwing things at it to make it still work
21:23:06 <melwitt> yeah, I guess I would be thinking to preserve what cells v1 is doing until we remove it, which is the two path thing. because the alternatives involve trying to sync the create upward to the API or something, right?
21:23:50 <mriedem> could we....take the build request that's created in the api now and use that to hydrate and create the instance in the api/parent cell db?
21:23:58 <dansmith> melwitt: yeah, I'm just concerned about other stuff, like bits we have moved out of compute api to conductor that won't get run for cellsv1, like bdm validation or something (not really, but something like that)
21:24:03 <mriedem> instead of the cells scheduler calling the compute api to create the instance?
21:24:28 <dansmith> mriedem: no, we have to create it at about the same place as we do now, or later, because of all the junk that compute/api does to the instance
21:24:32 <melwitt> I see
21:24:54 <mriedem> does to the instance how?
21:25:00 <mriedem> like figuring out the name and stuff?
21:25:16 <dansmith> mriedem: well, for one thing it handles the "num_instances" bit, as well as yeah names and stuff
21:25:17 <mriedem> i thought that was all done before the instance was serialized and the build request was created
21:25:31 <dansmith> it's spread all over the place
21:25:43 <dansmith> well, I should say,
21:25:59 <dansmith> I don't know where exactly the cell scheduler bit plugs in,
21:26:00 <dansmith> so maybe it's more in the middle than I think, I dunno
21:26:11 <mriedem> ok, so....sounds like maybe 2 options,
21:26:15 <dansmith> regardless, fixing at that layer seems like more new different code
21:26:18 <dansmith> which I'm afraid of
21:26:28 <mriedem> 1. if the build_request.instance is 90% what we need in the api cell for v1, then maybe we can use that to create it in the api cell for v1
21:26:46 <mriedem> 2. else we see if we can hack in a conditional here or there to do the dirty deeds done dirt cheap
21:26:56 <mriedem> until we can kill cells v1
21:26:59 <dansmith> I think #2 is the thing to try firfst
21:27:02 <dansmith> meaning,
21:27:02 <mriedem> sure
21:27:11 <melwitt> I like that my suggestion is the AC/DC one
21:27:12 <dansmith> right now we have a place where we just no longer call instance.create()
21:27:14 <mriedem> 3. alaski saves our asses
21:27:34 <dansmith> and so we'd just do "if cellsv1, do create like old times" but then we also have to not call the new conductor method I think
21:27:36 <melwitt> yeah, alaski, come and solve this for us
21:27:36 <dansmith> something
21:27:47 <mriedem> dansmith: yeah that's not too terrible
21:27:49 <mriedem> if that's all it is
21:27:58 <dansmith> I think it's end-of-days worst possible thing
21:28:07 <mriedem> # NOTE(danms): this makes bon scott roll in his grave, but we have to do this...
21:28:19 <dtp> lol
21:28:19 <dansmith> which is probably "not too terrible" times standard dansmith inflation factor
21:28:20 <melwitt> haha
21:28:41 <mriedem> 4. dtp fixes this all for us
21:28:48 <dtp> i wish!
21:28:54 * dansmith reassigns to dtp
21:28:55 <mriedem> did laski mention any of this in his brain dump patch?
21:29:00 <dansmith> mriedem: no
21:29:01 <dansmith> mriedem: I looked
21:29:04 <dansmith> a lot. :)
21:29:05 <mriedem> dagnabbit
21:29:06 <mriedem> ha
21:29:06 <melwitt> I looked too
21:29:19 <melwitt> first thing I did :)
21:29:26 <dansmith> alright, anyway, enough dwelling on this
21:29:35 <mriedem> so,
21:29:37 <mriedem> quotas?!
21:29:38 <dansmith> I will keep plugging away
21:29:46 * dansmith hands the mic to melwitt
21:30:31 <melwitt> yeah, I'm working on it but nothing to show yet. haven't gotten as much done by now as I wanted to
21:31:12 <mriedem> the spec was amended :) https://review.openstack.org/#/c/399750/ that's something
21:31:38 <melwitt> oh, right. I did do that
21:32:32 <dansmith> okay, well,
21:32:42 <dansmith> #topic open dis-cush-ee-ohn
21:32:47 <dansmith> anything else?
21:32:55 <mriedem> well,
21:32:59 <mriedem> on the ci front,
21:33:06 <mriedem> we should be back to nova-net being gone by eod
21:33:10 <mriedem> except for cells v1
21:33:17 <dansmith> yeah, that's good
21:33:24 <mriedem> and on the bright side sdague us back to look at the grenade change to require cells v2 in ocata
21:33:27 <mriedem> so progress
21:33:32 <melwitt> yay
21:33:38 <mriedem> s/us/is/
21:34:14 <mriedem> there are 2 semi related things for cells v2
21:34:15 <mriedem> https://blueprints.launchpad.net/nova/+spec/prep-for-network-aware-scheduling-ocata
21:34:30 <mriedem> looks like that's moving slowly
21:34:39 <mriedem> john has been caught up in the multiattach stuff lately
21:34:46 <dansmith> boo
21:35:10 <mriedem> and https://review.openstack.org/#/c/393205/ which i've asked sdague to look at again, and i need to look at again
21:35:46 <mriedem> on the bright side alex has started on the json schema validation for query params https://review.openstack.org/#/q/topic:bp/consistent-query-parameters-validation
21:35:59 <dansmith> there are multiple bright sides?
21:36:09 <mriedem> there are 3 bright sides in this meeting
21:36:13 <dansmith> oh my
21:36:26 <mriedem> #action need to review https://review.openstack.org/#/c/393205/
21:36:44 <mriedem> #action review https://review.openstack.org/#/c/399750/
21:36:49 <melwitt> one thing I realized is the remaining consoleauth stuff was covered by a spec that I missed reproposing for ocata. so that has to wait until pike
21:37:24 <mriedem> does it block anything?
21:37:25 <mriedem> if not implemented
21:37:36 <dansmith> might block the upcall thing
21:37:44 <dansmith> but also not sure who is going to work on it
21:38:09 <mriedem> i was never really familiar with that series
21:38:14 <melwitt> yeah, upcall needed to talk to consoleauth service otherwise. so mq switch needed from a cell, I think in that case
21:38:15 <mriedem> would have to read up on it
21:38:15 <melwitt> https://review.openstack.org/#/q/topic:bp/convert-consoles-to-objects
21:38:46 <melwitt> I was thinking to pick it up, i.e. restore the abandoned patches
21:39:45 <mriedem> i think someone would have to explain the upcall bits in more detail to me
21:40:00 <mriedem> to see the relation to cells v2 here
21:40:08 <dansmith> mriedem: preventing a call from a compute node up to the api db
21:40:23 <mriedem> b/c the consoleauth stuff is in the api db now?
21:40:49 <mriedem> no it's not
21:41:04 <melwitt> oh wait, my memory is recalling alaski saying we could just change deployment assumptions to run a consoleauth service per cell
21:41:18 <melwitt> in the meantime
21:41:18 <dansmith> hmm, does that work?
21:41:29 <dansmith> I thought the problem was that we route the api request by token id,
21:41:36 <dansmith> which we couldn't resolve to a cell without more information
21:41:41 <mriedem> https://review.openstack.org/#/c/321636/ is the cells v2 spec amendment
21:41:58 <dansmith> unless maybe we change what we return for them to call or something, but that's an api change I *thought*
21:42:06 <melwitt> I was thinking of the message queue part. if consoleauth runs on the api host, the cell would need to be able to talk to the api message queue
21:42:29 <melwitt> to request auth of a token, IIUC
21:42:32 <dansmith> I dunno, I don't have it in my head for sure
21:42:52 <dansmith> maybe you two can go off and figure out what needs doing and decide if someone has time for it this cycle
21:42:56 <mriedem> "The consoleauth service will be retained for legacy compatibility but
21:42:56 <mriedem> in a deprecated status, supported for one release. After the
21:42:56 <mriedem> period the consoleauth service can be removed."
21:43:13 <dansmith> that's if we do paul's thing I think
21:43:15 <melwitt> if routing is by token id from the api then that's another problem. I'm not that familiar with consoleauth
21:43:28 <melwitt> yeah
21:43:33 <dansmith> melwitt: I dunno, I might be making that up, but I thought there was something about that
21:43:43 <mriedem> This can be resolved by adding the instance uuid to
21:43:43 <mriedem> the query string in the URL.
21:43:59 <mriedem> i think it just means adding something like &instance_uuid=foo to the url
21:44:02 <melwitt> dansmith: yeah, I take it as a possibility. I'll dig into it and find out what the deal is
21:44:03 <dansmith> right
21:44:04 <mriedem> and then we can map that to a cell
21:44:10 <dansmith> mriedem: correct
21:44:34 <dansmith> anyway, we've brought it up a couple weeks in a row now, so maybe we can shoot for having a plan by next week
21:44:35 <dansmith> ?
21:44:45 <melwitt> okay
21:45:01 <mriedem> #action melwitt to figure out what's up with consoleauth changes wrt cells v2
21:45:04 <dansmith> #action melwitt to detangle the consoleauth stuff for next week
21:45:07 <mriedem> ooo
21:45:09 <dansmith> mriedem: gdi who is running this meeting?
21:45:15 <mriedem> force of habit
21:45:18 <melwitt> mriedem taking over the place
21:45:37 <mriedem> so on the quotas thing, if there were some steps to get started on that i could maybe take a crack
21:45:46 <mriedem> but would need some hand holding
21:46:32 <mriedem> anyway, i just don't do much code stuff these days on any priorities, so i'm open
21:46:55 <dansmith> #info mriedem's code chops are falling into disrepair
21:46:59 <dansmith> anything else?
21:47:01 <mriedem> hey
21:47:05 <dansmith> heh
21:47:12 <mriedem> see me triage that shelve race earlier?!
21:47:20 <melwitt> mriedem: okay, let me think about that. I was first going to do the object code for the quota tables we're keeping (that keep the limits)
21:47:20 <mriedem> i'm done
21:48:00 <melwitt> and then after that work on the resource counting and replacing reserve/commit/rollback calls
21:48:13 <mriedem> the latter part is what i thought sounded somewhat simpler
21:48:16 <mriedem> but maybe it's not
21:48:32 <mriedem> anyway, we can take that outside the meeting
21:49:04 <dansmith> I move we adjourn
21:49:17 <mriedem> second
21:49:47 * dansmith wields the gavel
21:49:59 <dansmith> #endmeeting