14:00:32 #startmeeting nova-scheduler 14:00:33 Meeting started Mon Dec 3 14:00:32 2018 UTC and is due to finish in 60 minutes. The chair is fried_rice. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:34 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:36 The meeting name has been set to 'nova_scheduler' 14:00:39 o/ 14:00:41 o/ 14:01:38 #topic last meeting 14:01:38 #link last minutes: http://eavesdrop.openstack.org/meetings/nova_scheduler/2018/nova_scheduler.2018-11-26-14.00.html 14:01:50 hm, sec... 14:01:52 \o 14:01:59 #chair efried 14:02:00 Warning: Nick not in channel: efried 14:02:01 Current chairs: efried fried_rice 14:02:08 #topic last meeting 14:02:08 #link last minutes: http://eavesdrop.openstack.org/meetings/nova_scheduler/2018/nova_scheduler.2018-11-26-14.00.html 14:02:11 o/ 14:02:12 there we go. 14:02:29 Any old business? 14:03:03 o/ 14:04:15 #link latest pupdate: http://lists.openstack.org/pipermail/openstack-discuss/2018-November/000392.html 14:04:40 topic from the pupdate: os-resource-classes 14:04:43 * alex_xu waves late 14:04:54 without leakypipes here, not sure we're going to get to closure 14:05:36 My take, expressed on the ML, was to do *something* rather than continuing to discuss and do nothing. 14:05:40 efried: I think you're right: is basically a guess of just doing something 14:05:50 s/guess/case/ 14:06:13 I haven't done it myself because a) i've been busy with other things, b) I don't want to do too much 14:06:19 Mm. 14:06:34 but if it comes to it, I'll reinstate my existing experiments and we'll just do that 14:06:50 If someone else would like to do so, the code is available on the -4 etherpad 14:06:55 cdent: do we have an etherpad or anything with the existing proposals? 14:07:03 #link -4 etherpad https://etherpad.openstack.org/p/placement-extract-stein-4 14:07:09 oh, does -4 have them all listed? /me looks... 14:07:25 no, it was assumed that other people would add theirs 14:07:30 but they didn't 14:07:34 so perhaps that's a sign 14:07:46 silent assent? 14:07:49 * mriedem joins late 14:07:51 I added Jay's 14:08:02 I'm in favor of "the simplest thing that could possibly work". 14:08:14 the ptg etherpads had more discussion 14:08:15 Which is basically to take the enum that we're using in the placement repo and stuff it in its own repo. 14:08:32 I think that's what cdent's thing does. 14:08:38 pretty much 14:09:13 talk of making an os-placement-artifacts library has only led to thrashing, so I say we abandon that route for now. 14:09:26 pretty much 14:09:36 so we'll have a super thing os-traits and a super thing os-resource-classes and they'll be separate and done. 14:09:46 s/thing/thin/g 14:09:55 pretty much 14:10:46 cool. Anyone want to volunteer to do the git/gerrit paperwork to seed a real openstack/os-resource-classes with https://github.com/cdent/os-resource-classes ? 14:11:00 I might have time for that 14:11:13 Nice. 14:11:33 #action edleafe to seed openstack/os-resource-classes from https://github.com/cdent/os-resource-classes 14:11:39 * efried ignores "might" :P 14:11:45 So not to be wishy-washy, I'll *make* time for that 14:11:50 Thanks edleafe 14:11:51 jinxish 14:12:00 Moving on. 14:12:04 Anything else from the pupdate? 14:12:20 or reviews/specs not mentioned therein? 14:12:51 I wanted to know from you, efried, if your provider tree stuff was in the glide path, or still had humps? 14:13:14 smooth sailing or rough seas 14:13:18 Just the gate, I think. 14:13:19 good metaphors or bad 14:13:21 one sec. 14:13:41 #link nova-to-placement traffic reduction series now starting at https://review.openstack.org/#/c/615646/ 14:13:51 Bottom patch is +A, in recheck vortex. 14:14:44 Next few patches have been revised for minor issues from leakypipes and should be ready for his +2. 14:15:05 SchedulerClient evisceration patches have leakypipes +2 14:15:36 and then the top one is pretty fresh. 14:15:56 Note that the top three patches are really cleanup, not specific to the nova-to-placement traffic business 14:16:15 They're on top to obviate merge conflicts, because I'm a special kind of lazy. 14:16:20 cleanup++ 14:17:12 I think I'm hoping mriedem will look at patches 2, 3, 4 in that series, since he's been involved in the bottom two. 14:17:27 (including https://review.openstack.org/#/c/615606/ which has merged) 14:18:13 yar 14:18:30 Thanks mriedem :* 14:18:51 Specs-wise, I should mention the cyborg/nova thing. 14:18:56 is cern running with any of this yet btw? 14:19:16 * bauzas is just following the convo here silently 14:19:17 mriedem: I don't think so. But tssurya and belmoreira have reviewed the patches and +1'd 14:19:24 at previous patch sets 14:19:40 since which not much of substance has changed 14:20:43 I don't get why the eviscerate patches are important, but meh, I'm +1 with them 14:20:51 I just need some time for reviewing them 14:21:09 tssurya: any chance y'all could pull down the series, say up to https://review.openstack.org/#/c/615705/ , and make sure a) they don't blow up the world, and b) they do what we expect to reduce traffic? 14:21:49 bauzas: Important, meh. They're just getting rid of an unnecessary abstraction layer that folks (mainly leakypipes and I) have been advocating ripping out for a while and just hadn't gotten around to. 14:22:08 well ok 14:22:36 that said, I'm still needing some comments for https://review.openstack.org/#/c/599208/ :) 14:22:56 If nothing else, having that SchedulerClient in the way made it hard to search for usages of those methods in PyCharm 14:23:23 bauzas: ack, that one's on my list for sure. It's actually on the agenda here in a minute... 14:24:36 #topic Extraction 14:24:36 #link Extraction etherpad https://etherpad.openstack.org/p/placement-extract-stein-4 14:25:01 libvirt/xenapi reshaper series <== need reviews! 14:25:01 #link libvirt reshaper https://review.openstack.org/#/c/599208/ 14:25:01 #link xen reshaper (middle of series) https://review.openstack.org/#/c/521041 14:25:29 Last time we agreed to continue deferring work on the reshaper FFU framework. 14:25:44 Any other extraction-related topics or discussion that we haven't already covered? 14:26:24 the functional tests in nova, using external placement, are happy now 14:26:24 efried: https://review.openstack.org/#/c/617941/ now works but there is some controversy 14:26:37 cdent: ++ \o/ 14:26:39 efried: about nova-status check test 14:26:42 #link https://review.openstack.org/#/c/617941/ 14:26:57 yeah, that's the stickler we need to resolve 14:27:12 * efried hears nova-status, looks at mriedem 14:27:17 related to that, zzzeek has some cleaning up on the database fixtures: 14:27:34 #link data fixture cleanup https://review.openstack.org/#/c/621304/ 14:27:35 that change doesn't drop the in-tree nova db objects for placement, 14:27:44 so i don't see why that change needs to drop the nova-status tests at all 14:27:55 they can go later when the in-tree placement code is dropped, right? 14:27:55 mriedem: it's child does? 14:27:59 its 14:28:04 so let the child worry about it 14:29:02 can do that if you like, but it seemed odd to get rid of a bunch of placement tests and then not get rid of that one 14:29:07 but whatever people like 14:29:11 it's not a placement test, 14:29:13 it's a nova test 14:29:20 it's just that it uses the db rather than the api 14:29:27 it's miscegeny 14:29:34 which must be purged! 14:29:52 * cdent seeks therapy 14:30:59 i'll put the test back this afternoon sometime 14:32:01 but we're still going to need to resolve how that command and the other "look at placement" commands that are in nova are going to be after these changes 14:32:03 regarding zzzeek's fixtur cleanup, I'm running nova functional against it 14:32:48 which makes me wonder if we want to trigger nova functional for each placement change in CI 14:33:17 i can feel the bristling from here 14:33:42 if anything, maybe just an experimental queue job 14:33:51 i don't think placement changes want to gate on nova functional tests 14:34:02 that would be teh suck 14:34:19 for everyone 14:34:19 no? 14:34:24 experimental queue allows you to run it on-demand in case you worry about a breaking change to the fixture 14:34:42 but do we expect that to happen very often? 14:34:45 i wouldn't think so 14:35:03 nor me 14:35:16 OK, thanks for the feedback 14:35:41 Ready to move on? 14:35:45 yes 14:35:52 #topic bugs 14:35:52 #link Placement bugs https://bugs.launchpad.net/nova/+bugs?field.tag=placement 14:36:07 any bugs to highlight? 14:36:57 * mriedem takes kid to bus stop 14:37:00 #topic opens 14:37:02 I have one I find interesting 14:37:07 #undo 14:37:08 Removing item from minutes: #topic opens 14:37:15 cdent: go 14:37:24 #link ensure aggregates under load https://bugs.launchpad.net/nova/+bug/1804453 14:37:25 Launchpad bug 1804453 in OpenStack Compute (nova) "maximum recursion possible while setting aggregates in placement" [Undecided,New] 14:37:58 you have to really slam resource provider creation and aggregate manipulation to tickle that, but when you do the system gets very upset 14:38:58 wow. Why are we recursing there? I instinctively twitch at that. 14:39:09 it's not real recursion 14:39:44 well it is, it's a loop that if it gets stuck for long enough triggers "I've called this code too many times" 14:40:57 but yes, it needs a way to exit itself before then 14:41:05 and it's caused by real conflicts happening, or the same conflict getting hit a bunch of times in a row while the other thread is trying to finish up? 14:41:16 from a service mgt standpoint, the way to avoid the problem is to have higher concurrency in the service 14:41:43 a single thread gets stuck while other threads are succeeding 14:42:25 it's a very specific set of circumstances to trigger it, and placeload happens to have those circumstances 14:42:36 you also have to under-provision the server so that it has too few threads 14:43:15 but none of that really matters: what matters is that we have a place in the code that is insufficiently robust. it's not defensive against the unexpected 14:43:27 exactly because of the reasos that make you twitch 14:45:13 Have you identified the piece of code in question? I don't see it in the bug report. If you could drop a link in there, I'd like to have a look at it. 14:45:26 just _ensure_aggregate? 14:45:52 yes, ensure_aggregate calls ensure_aggregate 14:46:33 there's a race between checing for an agg id, and generating that agg new 14:48:40 anyway, I don't know that we need to be belabor that now, I just wanted to point it out as a thing of interest. It was fun to see/experience 14:49:26 Seems like the simplest thing to get rid of the stack overflow is just to get rid of the recursion. At a glance, this could be turned into a loop very easily. 14:49:37 * cdent nods 14:49:42 If that happened, would it still be possible for it to spin "forever"? 14:49:59 I imagine we reach stack overflow fairly quickly (wallclock-wise) with the recursion. 14:50:11 Couple seconds of non-recursive looping and we come out the other side? 14:51:28 we probably want a combo of a limited loop (which errors if it reaches its end) and an exponential backoff or random sleep 14:51:29 This is also creating a new writer context every time it recurses. That seems... bad? 14:51:49 so yeah, it's code that needs to be fixed, we don't need to fix it here though 14:51:52 or is that necessary and part of the reason we're recursing rather than looping? 14:52:16 okay, yeah, let's take it offline. 14:52:19 #topic opens 14:52:24 anyone? 14:52:42 (hello, bot?) 14:52:47 (shrug) 14:53:02 fwiw i've thought about adding a "performance/scaling" doc into nova before when stuff like this comes up, 14:53:08 i.e. known ways things can go south 14:53:24 more as a dumping ground for not forgetting about known issues 14:53:32 but, you know 14:53:36 good idea. Seems to me like cdent has answered a lot of placement-related scaling questions before (in blogs maybe?). That info could go into docs very nicely. 14:54:30 a) that's true, b) most of the solutions are just standard "I'm a web app" solutions, c) where they are not it is because our code is bad and relatively easy to fix and we should . The current situation is c 14:55:01 from a nova pov, i was thinking more about the myriad config options we have for tweaking things when scaling, 14:55:11 e.g. the way cern restricts projects to a small number of cells 14:55:39 a place to document things one can run into and ways people have worked around those things 14:58:05 anywho 14:58:28 the only other thing I'd add to "opens" at the moment is to reinforce from the pupdate that there are multiple deployment projects doing placement bits and pieces that could use our eyes 14:59:38 okeydokey. Thanks all. 14:59:49 #endmeeting