16:00:05 <gibi> #startmeeting nova
16:00:05 <openstack> Meeting started Thu May  7 16:00:05 2020 UTC and is due to finish in 60 minutes.  The chair is gibi. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:09 <openstack> The meeting name has been set to 'nova'
16:00:14 <gibi> o/
16:00:38 <artom> ~o~
16:00:42 <bauzas> \o
16:00:50 <gmann> o/
16:00:51 <dansmith> .
16:01:23 <melwitt> o/
16:01:35 <gibi> #topic Last meeting
16:01:41 <gibi> #link Minutes from last meeting: http://eavesdrop.openstack.org/meetings/nova/2020/nova.2020-04-30-16.00.log.html
16:01:52 <gibi> is there anything to bring back from the last meeting?
16:02:00 <dansmith> I keep seeing that topic and getting falsely excited that *this* is the last of these meetings :)
16:02:24 <gibi> :) no it is not
16:02:37 <gibi> #topic Bugs (stuck/critical)
16:02:42 <gibi> No Critical bugs
16:02:49 <gibi> #link 31 new untriaged bugs (-7 since the last meeting): https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New
16:03:08 <bauzas> thanks gibi
16:03:08 <gibi> we are still on a downward trend but slowing down
16:03:16 <bauzas> I will help next weerk
16:03:26 <gibi> I want to reach 0 in the next couple of weeks if possible
16:03:33 <bauzas> we have a PTG discussion for this
16:03:35 <gibi> bauzas: thanks
16:04:07 <gibi> I'm not tracking any RC critical bug at the moment
16:04:13 <gibi> #link https://bugs.launchpad.net/nova/+bugs?field.tag=ussuri-rc-potential
16:04:28 <gibi> anything bug we need to discuss today?
16:05:21 <gibi> #topic Release Planning
16:05:31 <gibi> We cut RC2 this week to include the fix https://review.opendev.org/#/q/topic:bug/1875418+(status:open+OR+status:merged)
16:05:59 <gibi> I don't see anyithing that is blocking a GA now so I assume RC2 will be the GA code
16:06:21 <gibi> please raise any issue with the ussuri release basically now as the RC deadline is today
16:06:53 <gibi> anything else to discuss about the release?
16:06:54 <bauzas> #link https://releases.openstack.org/ussuri/schedule.html
16:07:07 <bauzas> GA is next week
16:07:11 <gibi> yepp
16:07:21 <bauzas> so unless we have a very large regression, I think we can hold
16:07:29 <gibi> and next week there will be a community call to present Ussuri for the world
16:07:53 <gibi> I will talk 5 minutes about what we did in the last cycle, like a really mini project update
16:09:11 <gibi> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-May/014676.html
16:09:21 <gibi> this is the details of the community call ^^
16:09:35 <gibi> #topic Stable Branches
16:09:50 <gibi> I did not see any major event on the stable branch
16:10:00 <gibi> lyarwood: if you are around, do you have any news?
16:11:47 <gibi> I guess he is not around
16:12:01 <gibi> #topic Sub/related team Highlights
16:12:06 <gibi> API (gmann)
16:12:24 <gmann> i have not checked the APi related spec for V cycle yet
16:12:32 <gmann> one things going on is healthcheck #link https://review.opendev.org/#/c/724684/
16:12:59 <gibi> gmann: do we need a bp for that?
16:13:00 <gmann> i have added this to discuss in PTG also, discussion going in review too.
16:13:39 <gmann> i asked for spec to have a complete things we can do now and later at least we know we want to do later so that we can design this not breaking when we add other things later
16:13:55 <gmann> like unauth, enable/disable options ect
16:13:57 <gmann> etc
16:14:02 <gibi> spec is even better especially if there are multiple steps
16:15:14 <gmann> yeah. we can ship it a minimum things for now and i am checking if adding things is possible as config option or not
16:16:04 <gmann> main concern is when we add new things, we can add it in compatible way. like on-demand deeper checks
16:16:04 <artom> "we can ship it a minimum things for now" + 1 to that
16:16:45 <artom> Are we discussing this in detail now? One idea I had was make it authenticatable from the start, but for now just return the basic 200 OK for everything, authenticated or not
16:16:59 <gmann> but i have not checked with poc yet is that work with oslo.middleware or we need to add extra filter for that.
16:17:00 <artom> And then we can spec out the "deep status" healthcheck
16:17:33 <melwitt> yeah I wanted to ask gmann if starting out unauth'ed and then upgrading to auth later, would that pose an issue from the API perspective?
16:17:53 <artom> And zigo makes a good point in the review that it needs to be fast, because haproxy will be hitting it every second
16:18:02 <dansmith> presumably this isn't going to be versioned as strictly as the rest of the API right?
16:18:13 <gmann> melwitt: it will as many load balancer use without auth and if they need token then it will break them
16:18:16 <artom> So it's probably a bad idea to try authentication on every request
16:18:34 <artom> There should be a "were authentication headers sent? No --> quick 200 OK" mechanism
16:18:40 <gibi> artom: nothing heavy on the agenda so I think it is OK to have a sneak-peak of the feature to draw attention
16:19:02 <bnemec> I don't think things like haproxy are going to be able to auth, so if we add auth we still need to have a basic healthcheck that is unauth'd.
16:19:18 <dansmith> we could pretty easily build the healthcheck data from authenticated requests
16:19:55 <gmann> true
16:19:55 <dansmith> unauth'd healthchecks include very coarse information, which may be up to date if there are auth'd requests keeping it fresh, and if not, it's no worse than a basic check
16:20:04 <zigo> Not even *one* haproxy hitting it every second, but in most case, 3, so 3 queries per second, constantly.
16:20:13 <artom> bnemec, almost like we need different URLs, one for load balancers, one for humans or other more advanced monitoring solutions
16:20:36 <gmann> zigo: yeah, default of helthcheck can be a fast responding things.
16:20:47 <dansmith> zigo: ack, yeah and if we have three cells, that's five databases per check, three mqs per check, which is a good reason to build that information in a cache and just return it from healthchecks
16:20:48 <gmann> anyways all these things to discuss so spec can be better
16:20:49 <bnemec> artom: That would probably be the simplest.
16:21:33 <gibi> feels like we have plenty of things for the spec. lets continue there
16:21:45 <gibi> gmann: any other API releated thing you want to mention?
16:22:01 <gmann> that's all for today from me
16:22:04 <gibi> cool, thanks
16:22:06 <gibi> Libvirt (bauzas)
16:22:46 <zigo> Do everyone agree that the current healthcheck can still be approved, in the mean while?
16:23:18 <zigo> *does
16:23:39 <gibi> zigo: we need to know that our future plans with the healthcheck as an extension of the current simple API
16:23:49 <gibi> are viable
16:24:17 <gmann> zigo: yeah so that we do not need to change the current proposed.  healthcheck usage
16:24:36 <gmann> i mean discuss in spec first and then do current proposed one
16:24:46 <dansmith> definitely discuss in spec first
16:25:26 <bnemec> For reference, there was a previous healthcheck spec with a bunch of discussion: https://review.opendev.org/#/c/531456
16:25:54 <dansmith> yeah, I remember,
16:25:54 <gmann> bnemec: thanks that will be good ref to check too
16:26:12 <dansmith> plenty of fodder there for needing a wider discssion
16:26:59 <zigo> FWIW: the same type of patch has already been approved for Neutron, Heat and Cinder, so it's kind of weird that we aren't getting things cross-project this way.
16:27:00 <bnemec> Oh, this also has a great list of previous discussions: https://storyboard.openstack.org/#!/story/2001439
16:27:28 <dansmith> zigo: omg, I'm convinced.. best argument ever
16:27:43 <zigo> :)
16:27:46 <dansmith> :P
16:27:52 <bauzas> gibi: sorry was off
16:28:03 <gibi> bauzas: no worries I call you again
16:28:04 <bauzas> nothing to say, but aarents asked for some changes
16:28:11 <bauzas> https://etherpad.opendev.org/p/nova-libvirt-subteam
16:28:16 <bauzas> will try to review them soon
16:28:33 <gibi> bauzas: cool thanks
16:28:37 <bauzas> that's it
16:28:48 <bauzas> kashyap also has a point about q35 but he's not around
16:29:13 <gibi> lets quickly finish the agenda and then we can get back to the healtcheck discussion in the Open
16:29:21 <gibi> #topic Stuck Reviews
16:29:40 <gibi> nothing on the agenda. Does anybody have a stuck review to bring up?
16:30:53 <gibi> #topic Virtual PTG planning
16:31:00 <gibi> Current nova schedule is on the top of the etherpad #link https://etherpad.opendev.org/p/nova-victoria-ptg
16:31:09 <gibi> Cyborg also wants to talk with us about SmartNic and that discussion is now scheduled for June 5 Friday 14:00 UTC - 15:00 UTC
16:31:34 <gibi> anything else about the virtual PTG ?
16:32:10 <gmann> do we want to move healthcheck topic with oslo as cross project?
16:32:29 <gmann> i added at L 179 for now
16:33:30 <artom> Not sure it's olso crossproject... It's already merged in other projects (ex: https://review.opendev.org/#/c/724676/), so if we want cross-project uniformity (which I think is important), our hands are kinda tied in that sense
16:33:33 <gibi> gmann: If you feel bnemec or other folks from oslo would be good to join to that discussion then lets try to have some dedicated time for an oslo-nova cross session
16:34:00 <gibi> bundled with the policy discussion
16:34:01 <artom> Like, making it authenticatable and future-proof are important, but it'd be bad form to go off and do our own thing entirely.
16:34:27 <gmann> ok
16:34:50 <bnemec> I think it's important to keep in mind that there are two things here: enabling the existing simple healthcheck, and designing the next-gen fancy healthcheck
16:34:57 <bnemec> The latter should not block the former IMHO.
16:35:51 <gibi> #topic Open discussion
16:36:07 <artom> bnemec, agreed. I guess the point is, if we want to have the same on the same URL (which is debatable in my mind), we need to build in things the latter might need from the start
16:36:08 <gibi> we can continue the healthcheck discussion now in the Open
16:36:20 <artom> *have them both on the same URL
16:36:20 <gibi> (as nothing else on the agend for Open)
16:36:56 <dansmith> artom: yeah, that's the thing I'd want to know
16:37:13 <dansmith> I don't want to have /healthcheck, /useful_healthcheck, /no_serously_this_one, etc
16:37:34 <bnemec> If having them both on the same URL blocks having any healthcheck for the next two years then I think that's a bad approach.
16:37:36 <artom> dansmith, well, yeah, but realistically how many are we going to have?
16:37:53 <bnemec> I note that https://storyboard.openstack.org/#!/story/2001439 mentioned possibly different behavior for GET vs HEAD.
16:37:58 <artom> dansmith, one simple, unauthenticated, unversioned, one "fancy", authenticated, versioned
16:38:00 <dansmith> we've already identified several levels..
16:38:12 <bnemec> I have no idea if that's an API no-no though.
16:38:29 <bauzas> honestly, I co-contributed to this change, but I'm not opiniated a single bit.
16:38:33 <gmann> i think it should be doable with same url with extra 'backends' to check for oslo? but need to try
16:38:37 <dansmith> artom: honestly, what does the simple unauth'd one tell you? that apache and mod_wsgi is working right?
16:38:59 <dansmith> artom: is there any difference between hitting that check vs just the version manifest?
16:39:07 <artom> dansmith, there isn't
16:39:07 <gmann> extra configured 'backends'
16:39:22 <artom> dansmith, the argument from operators is having every project have a common URL for that
16:39:37 <artom> And not nova with /versions, neutron with /healthcheck, cinder with /status or whatever
16:40:02 <artom> (I made up the last one)
16:40:09 <bauzas> honestly, if we have different URLs between services, we don't need the healthcheck one
16:40:23 <dansmith> can't you hit the / on everyone's api and get the same result?
16:40:38 <artom> dansmith, I dunno, can you?
16:40:40 <gmann> not all service has / (versions) url ?
16:40:46 <artom> zigo ^^ ?
16:40:54 * zigo reads the backlog
16:40:56 <dansmith> I don't really know what the oslo base bit gives us... I thought we could provide a function to generate the report or something. is that the case or not?
16:41:36 <dansmith> gmann: don't they all redirect to something like the version doc? anyway, I'm not really suggesting that as an alternative, I'm just saying a "hello world" seems pointless to me
16:41:51 <bauzas> dansmith: the only thing that would be nice for ops is that they can disable the healthcheck on their wishes
16:42:05 <dansmith> bauzas: sorry, what?
16:42:15 <zigo> dansmith: You wont get the same result, no, you get a "300 multiple choice", that's not what operators need.
16:42:21 <zigo> We need a "200 ok" ...
16:42:25 <bauzas> dansmith: the healthech API can return 'sorry, 503' if a file is provided
16:42:32 <gmann> dansmith: we can implement extra plugins (than default one of file existence check) to generate the report and add in olso to check all plugins  added for healthcheck app
16:43:07 <dansmith> okay I don't understand either of those fully
16:43:07 <bauzas> that's the only single bit that can help HAProxy more than just checking a port
16:43:24 <gmann> dansmith: current default plugins are file checks.
16:43:28 <bauzas> but honestly, as a support engineer years ago, I wasn't trusting healthchecks
16:43:33 <zigo> And the idea behind the file is so one can turn off the API in a nice way: tell Haproxy, I'm going to turn off the API... then really do it.
16:43:35 <gmann> and yes, port with file
16:43:41 <dansmith> bauzas: right because they tell you nothing?:)
16:43:43 <bauzas> I preferred homemade checks based on logics
16:43:52 <bauzas> for my haproxy backends
16:44:24 <dansmith> if the goal is really to have a completely pointless not-really-health-related common url across all projects then whatever
16:44:51 <zigo> dansmith: The point is having something to query for haproxy, nothing more, nothing less.
16:45:18 <gmann> exactly, it should be 'yes healthy' means your request should be success (as per  general checks we did for minimum required things)
16:45:24 <zigo> If we're capable of providing more than that, great, but this shouldn't wait for spec, design, doc, test, implementation, etc.
16:46:17 <zigo> My original patch barely activated a feature we already have...
16:46:18 <gmann> zigo: and if providing more leads to change the existing used one then also fine ?
16:46:32 <zigo> Yeah, great too ! :)
16:46:38 <bnemec> Also worth noting that the /healthcheck endpoint is already enabled for some services, so even if we decide to completely redesign it we can't ignore the existing one.
16:46:40 <zigo> If it becomes more reliable, that's bonus points.
16:47:09 <dansmith> gmann: for zigo's use case, but people will write nagios plugins and other monitoring infra against this of course, so while zigo and others only care about the "200 OK" the devil is in the details, like it always is
16:47:53 <zigo> dansmith: Operators do know that this is not enough for monitoring.
16:48:03 <zigo> I could send you my scripts if you like! :)
16:48:09 <dansmith> super unfortunate that we called it /healthcheck don't you think?
16:48:16 <artom> dansmith, which why documenting what this actually is and its limitations is important, but I don't see that as a reason to not do it. It's an unobtrusive chance.
16:48:18 <artom> *change
16:48:40 <dansmith> artom: of course, the code isn't the obtrusive part :)
16:48:50 <zigo> We can call it "/my-http-api-server-is-alive-and-haproxy-can-query-it" but that's a bit long to type ...
16:48:58 <artom> What is? The time we're spending debating this? ;)
16:49:46 <artom> dansmith, plus, it means you'll get to write another massively influential blog about about /healthcheck vs /ping vs /status, like your evacuate one ;)
16:50:17 <zigo> dansmith: For the monitoring, what we do with nova-api is actually querying https://${HOSTNAME}:8774/v2.1/servers and see if the monitoring instance is in the list for that project.
16:50:32 <zigo> That's much better than just checking /healthcheck of course.
16:50:35 <dansmith> I think what artom is saying is that any change that has few lines of code isn't worth discussing regardless of the actual impact
16:50:49 <gibi> my opinion consistency across sevices are good so I'm +1 on /healthcheck as of today returning a plain 200 OK. But have a agreement in a spec that if we want to extend that 200 OK with more information then how we extend the /healthcheck API. I'm now OK to have the unauthed vs authed switch between simple 200 OK and complex healthcheck result
16:50:51 <artom> dansmith, that's completely false and you know :P
16:50:52 <artom> *know it
16:51:35 <artom> This is *adding* and *independant* thing that operators can use or not, at their leisure
16:51:40 <artom> *an *independant*
16:51:56 <bnemec> I guess I don't understand the huge drawback of having people write monitoring checks against  a /healtcheck designed for such a thing versus them writing hacky checks against / that doesn't behave the way they want.
16:52:05 <artom> Though I'll grant that the concern about evolving it is a valid one
16:54:19 <melwitt> if it's called /healthcheck, operators are going to expect it to check health to some extent. and not just be a liveness check (like checking for an open port or something)
16:54:49 <melwitt> so if that was not the intention, I agree the name choice is unfortunatel
16:54:56 <melwitt> -l
16:54:57 <zigo> melwitt: In simple words: *no* ! :)
16:55:06 <gmann> yeah and that is what i thought it was when i first saw. i was not aware of previous oslo spec disucssion.
16:55:30 <gmann> or until i saw the olso code
16:55:32 <melwitt> zigo: what are you saying "no" about?
16:55:42 <zigo> As an operator, we do all sorts of things to check if everything is up, not just checking /healthcheck. If that is your concern, then we can further document that this is not (yet?) what it is for.
16:56:24 <artom> melwitt, so put a .. warning:: in the documentation saying this is just making sure that the HTTP service is operational
16:56:43 <zigo> artom: Right.
16:56:44 <zigo> :)
16:56:52 <gibi> 3 mintes left, lets try to warp it up here but continue it on #openstack-nova and/or in a spec
16:56:54 <melwitt> right, and so what is /healthcheck giving you beyond other checks like whether something is listening on port 8774 or that nova-api responds to http request?
16:57:23 <melwitt> well, anyway, I think my point is clear. we can wrap it
16:57:40 <artom> melwitt, the '200 OK' status - / (or /versions?) is "300 multiple choice"
16:57:41 <zigo> melwitt: If you don't give haproxy some URL to query, it's going to connect to the port, then disconnect, which is very ugly.
16:58:08 <zigo> So we got to give it an URL, and that URL must reply "200 ok".
16:58:14 <zigo> That's what the /healthcheck is for ...
16:58:30 <melwitt> I understand that, just saying if this is not a healthcheck a different name would have been more appropriate
16:58:42 <melwitt> this is implying that health is being checked, obviously
16:59:05 <artom> melwitt, fair point
16:59:08 <gmann> how about /healthcheck -> all deeper checks and /healthcheck?https-only-check -> minumum check as proposed
16:59:09 <zigo> "Thu May  7 16:58:59 2020 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) !!!"
16:59:23 <zigo> That's what I get constantly in my logs if I don't activate healtcheck stuff.
16:59:53 <artom> melwitt, but in the interest of cross-project uniformity, and because we can't go back in time and other projects have merged this (for better or worse), our hands are kinda tied
17:00:09 <gmann> and default is former one, do all deeper checks as this endpoint name suggest
17:00:16 <zigo> (in my case, that's when using uwsgi)
17:00:17 <gibi> OK. thank you folks. continue it on #openstack-nova
17:00:25 <gibi> #endmeeting