#openstack-meeting-4 log

20:01:20 <Rockyg> #startmeeting log_wg
20:01:20 <openstack> Meeting started Wed Jun 17 20:01:20 2015 UTC and is due to finish in 60 minutes.  The chair is Rockyg. Information about MeetBot at http://wiki.debian.org/MeetBot.
20:01:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
20:01:23 <openstack> The meeting name has been set to 'log_wg'
20:01:30 <jokke_> Hello
20:02:15 <Rockyg> Hey.  So, my chat client is messed up and I can't see thefolks in the room to ping the ones we need.  Would you please?
20:02:53 <jokke_> give me just a sec, do we have the courtesy list somewhere?
20:03:12 <Rockyg> Ah.  got it.  ping for dhellmann, bknudson
20:03:19 <bknudson> hi
20:03:28 <jokke_> hi there
20:03:42 <Rockyg> didn't know we could make a courtesy list
20:04:14 <Rockyg> nkrinner
20:04:18 <Rockyg> whoe else?
20:06:09 <jokke_> I think we should be good to go
20:06:52 <Rockyg> OK.  we had an agenda for las week
20:07:17 <Rockyg> One item I saw on log about request-id
20:07:38 <Rockyg> jokke_: you couldn'tremember if you had replied to that thread
20:07:44 <Rockyg> I don't think you did.
20:08:18 <Rockyg> Great if you could
20:08:58 <Rockyg> I'd like to discuss some of the problems we are having on all of these "user facing" specs
20:09:18 <Rockyg> #topic problems in moving user facing specs forward
20:09:26 <jokke_> yes!
20:09:46 <Rockyg> I am seeing a pattern in API wg, Log wg and the request-id spec
20:10:10 <bknudson> we have problems getting reviews for specs in keystone, too.
20:10:21 <bknudson> maybe it's a general issue with specs
20:10:23 <Rockyg> The problem is that the devs do not understand why users want something and how they will use it
20:10:59 <Rockyg> bknudson: that's a very good possibility.  I am certainly confused between specs and bps
20:11:07 <bknudson> how about add a section to the spec that explains why users want something and how they will use it?
20:11:37 <Rockyg> I thought specs would be the what of a problem definition, with solution approaches, and the bps the how, but it seems like both are how
20:12:13 <Rockyg> bknudson: exactly.  so what I want to do is gather the use cases and have that section up front
20:12:33 <bknudson> that seems to be the general way that specs are used, they describe a solution.
20:12:58 <jokke_> bknudson: I'd say that's a separate issue ... we slightly touched this last week but the problem seems to be something we perhaps could influence with education.
20:13:17 <Rockyg> So, for instance:  use case: switch dies. Don't have map of hosts to vms.  How do I track and repair?  Then the steps an ops would want to go through to solve the issues
20:13:35 <bknudson> Rockyg: I think that would be useful.
20:14:53 <Rockyg> So, there are two things users need to come up with:  1) Philosphies of how users expect to interact with cloud  2)Specific use cases to help define need and characteristics of feature
20:14:56 <jokke_> TBH this is actually funny to see ... my personal take was that specs replaced bps and only reason why bps are still made is to link the spec to lauchpad
20:15:40 <bknudson> jokke_: that's how they're used in keystone.
20:15:49 <Rockyg> The philosphy stuff is why logs have to be human readable.  Why a mnemonic is better than a UUID, etc.
20:16:06 <bknudson> there's no launchpad for cross-project specs as far as I know.
20:16:11 <bknudson> so no bps
20:16:20 <Rockyg> The use case shows how a feature is useful
20:16:27 <jokke_> bknudson: you're probably right
20:16:40 <Rockyg> bknudson: there might be, but if there is, it's openstack
20:17:29 <jokke_> Rockyg: that sounds like a good approach ... I have been trying to think how we could educate our dev base to understand the ops view/side of certain specs
20:17:30 <Rockyg> So, no bps.  But, the xproject spec could link to project bps that implement at the project level
20:18:16 <bknudson> the bps are more dynamic than the spec. the bp has to link to the spec.
20:18:25 <Rockyg> I think we haven't gotten more ops participation because they think in cases that they usually call incidents
20:18:54 <Rockyg> bknudson: yeah.  I see that.
20:18:56 <bknudson> operators would probably rather not have the switch die to begin with.
20:19:30 <Rockyg> So, what I want to do is use an etherpad to collect specific use cases that highlight a single feature and how it is used.
20:19:59 <Rockyg> But, if we get other use cases, we can move them around to the proper etherpad
20:20:03 <bknudson> I don't think it even needs to be that complicated -- how about just the example where a user types the wrong password and doesn't understand the response?
20:20:12 <jokke_> Rockyg: is that current features or proposed features you're referring to?
20:20:17 <Rockyg> I think we can get ops and usergroup to put use cases on an etherpad
20:20:43 <Rockyg> jokke_: both.  Start with request ids and error codes and a writeup for consistency
20:20:59 <bknudson> I think it's worth trying out the use case etherpad
20:21:03 <Rockyg> Then, if it works, we can start focusing on message payloads
20:21:11 <bknudson> if ops aren't willing to contribute then maybe it's not that important after all
20:21:13 <jokke_> Rockyg: so were you thinking to pull those then from the etherpad to the specs?
20:21:59 <Rockyg> Yes.  In some form.  But likely capture them in the use case repository the product wg is building
20:22:16 <jokke_> I see the biggest problem here not being the ops & users but if we just link some use case etherpad to the specs we get TL;DR and again wasted effort
20:22:34 <jokke_> I mean from the dev side
20:23:09 <Rockyg> The product wg intends to take the use cases as the framework for the specs.  So build the spec around the use case.  Sort of a hand-off from users to devs
20:23:43 <jokke_> k
20:24:06 <Rockyg> So, the info will be in the spec for the feature.  If devs don't like the approach, they can suggest something that works for them but gets them to understand where the spec is coming from
20:25:12 <jokke_> fair enough
20:25:14 <Rockyg> a for instance use case for the error code is:  Ops gets ERROR  Invalid VM
20:25:24 <bknudson> Rockyg: do you already have a use case that you know of?
20:25:55 <Rockyg> Yeah.  The one I just mentioned, but I likely have the message part off
20:26:15 <bknudson> I'm not an operator so I don't know if my use case is something they really care about
20:26:17 <Rockyg> It happens much too often, and is generated by flow through from multiple places in the code
20:26:29 <bknudson> I'm just a guy who has to answer questions from operators.
20:26:30 <Rockyg> Do other operators care?
20:27:17 <Rockyg> So, you are a user.  In your case, a level two or three responder.  You have to be able to quickly dig into issues.
20:27:39 <bknudson> maybe need a place in the etherpad where other ops can weigh in on whether they like the use case
20:27:47 <Rockyg> Maybe we need to actually write up something about user roles.  What the do, what they expect, etc.
20:28:28 <Rockyg> That's why the etherpad.  they don't need anything but the link and a loging and the discussion ensues.  They don't need to know or understand gerritt
20:28:30 <bknudson> as I said, I like the etherpad idea as an experiment
20:28:47 <jokke_> I start to like it ;)
20:28:48 <bknudson> might want to seed it with your use case
20:29:02 <Rockyg> Or yours.  Or both.
20:29:32 <bknudson> I can add mine.
20:30:02 <Rockyg> #link https://etherpad.openstack.org/p/Operator_Use_Cases
20:30:24 <bknudson> there's probably more issues in my use case than just that you can't track the request through the system.
20:30:48 <jokke_> bknudson: even better
20:30:55 <bknudson> although tracking the request will be a requirement
20:31:24 <Rockyg> Go for it.  I'll add an intro and put a use case out there.  the invalid thing is a hot button for a number of ops  It the trunk of a many branched problem
20:31:26 <bknudson> I'm thinking about -- keystone doesn't send back adequate error responses, horizon doesn't display them.
20:31:27 <jokke_> ignoring amit213 if needed someone ping me to remove that
20:31:34 <jokke_> amit213: please fix your client
20:32:40 <Rockyg> bknudson: I think use cases will likely encompass a number of needs.  They tend to be an integrated process that sucks in everything they touch ;-)
20:33:28 <jokke_> bknudson: that's actually pretty common issue around and not even being only related to horizon
20:33:45 <bknudson> but if we could show how "nice" it can be compared to how it is...
20:34:36 <bknudson> that you look in the keystone log and see "this user tried to login at 3:30 and the password was wrong" and tell them, vs having no idea.
20:35:06 <Rockyg> Yeah.  I would like to see for use cases: how it *should* be, plus how it is now.  The advantages of the new approach have to be very obvious.
20:35:09 <bknudson> user calls in with the error message "login failed, request ID 12423"
20:35:25 <bknudson> vs, user calls in and says "login failed"
20:35:45 <jokke_> hmm-m perhaps we should just bite the bullet and make one example ... get those things locally implemented to one specific trace we can easily reproduce that is absolutely crap at the moment and provide before & after results
20:35:48 <bknudson> and all you have is a pile of logs that show users sometimes logging in and sometimes failing with no reason.
20:37:33 <jokke_> bknudson: but telling that user failed to log in due wrong password in the logs is security threat, you know ;P
20:37:43 <jokke_> login even
20:37:47 <Rockyg> So, that points out another issue.  RequestID can't be 32 characters long. It has to be something that is transmittable over a phone line in a reasonable length of time with little to no errors
20:38:11 <jokke_> ++
20:38:20 <Rockyg> bknudson: even if the logs aren't on any public facing network ;-)
20:38:40 <jokke_> numeral rather than uuid thinking it now in that light
20:38:51 <Rockyg> that's why I started the table of logs.  So we can demonstrate which logs are security threats and which aren't
20:40:36 <Rockyg> Whereas, if the message has login failed id 1234:this  message repeats 500 times in 10 seconds, then you know you have someone trying to break in.
20:40:55 <Rockyg> And, you can save 500 lines in the log file
20:41:09 <jokke_> ;)
20:41:51 <Rockyg> You could also identify a config issue if the login attempt is from a system rather than a human
20:42:42 <Rockyg> OK.  So let's summarize some of this:
20:44:15 * dhellmann apologizes for arriving late
20:44:16 <Rockyg> #action item: Put two prospective use cases into etherpad https://etherpad.openstack.org/p/Operator_Use_Cases
20:44:16 <Rockyg> #action item: Advertise use case etherpad to ops to get them to start collaborating on it
20:44:23 <Rockyg> hi dhellmann!
20:45:10 <Rockyg> #action item:  work with other xproject wg to write up philosphy of what the user wants out of this openstack "thing"
20:46:18 <Rockyg> #action item: include a use case for clarifying *why* Error codes are worthwhile implementing
20:47:08 <jokke_> and why request Id's needs to be traceable and easily communicated
20:47:10 <Rockyg> #action item: suggest same for request id spec and for API wg
20:48:24 <Rockyg> #info What do all these groups, requests have in common?  Humans want the information to use in tracking down problems!  So make them human! readable
20:49:47 <Rockyg> Anything else?
20:50:44 <dhellmann> some of those sound like really big action items
20:50:51 <Rockyg> #topic open discussion
20:51:18 <bknudson> dhellmann: we signed you up for the items since you weren't here.
20:51:35 <dhellmann> bknudson: join the club
20:51:41 <Rockyg> dhellmann: Yeah.  But small bites in some areas and big bites in others.  Need a way to educate developers on how users are trying to use their work.
20:52:10 <jokke_> The M summit talks are open. Should we propose panel discussion around this over there?
20:52:32 <dhellmann> Rockyg: even the spec for error codes is a "boil the ocean" sort of change that's going to be hard to make incrementally. I'm worried that we're trying to solve too much at once.
20:53:09 <bknudson> this ocean needs boiling
20:53:19 <jokke_> bknudson: ++
20:53:26 <jokke_> but I agree with dhellmann
20:54:27 <bknudson> dhellmann: what's your alternative for the error codes spec?
20:54:27 <Rockyg> Yeah.  I'm working on getting enough education into the front of the spec to get a firmer foundation.  The error that all the ops hate, about invalid VM is a good example of error actually coming from multiple places and if every place that generates a log message has a separate id, then at lease there is better info
20:54:42 <jokke_> dhellmann: just before you joined us we were talking about how to document the usecases and the benefits of the proposed change. Perhaps we should take one example where it's awfully wrong and provide before/after scenario for people to look at
20:55:35 <dhellmann> bknudson: I would like for someone to think about a solution that doesn't involve us touching every single error message by hand. Can we use a MD5 sum of the error message as a code? Can we use the file and line number? I don't know the answer, but I know that we're not going to succeed if we have to change all of OpenStack by hand.
20:55:36 <jokke_> dhellmann: would you think that help to open few vulcanos under the ocean?
20:56:18 <dhellmann> jokke_: my point is that no use case justification is going to counter the fact that the size of the change you're asking for is far too big
20:57:03 <Rockyg> dhellmann: so, md5 wouldn't work because it's hard for a human.  File and line number might, even if it eventually changes when stuff gets added/removed from file
20:57:54 <jokke_> yeah the biggest problem with the file & line is that if it's dynamic we get few codes changing every commit
20:58:19 <dhellmann> it's also not clear if we need them for every single log message, or just errors sent back through the API to callers
20:58:35 <Rockyg> jokke_:  we'd have to start with that, to get it going, but then *not* change it once it's set
20:58:38 <dhellmann> because those are 2 different things, and the latter is much smaller
20:59:10 <dhellmann> Rockyg: no, I mean, we need to compute these things at runtime. We can't go through N million lines of code making this change by hand.
20:59:15 <jokke_> dhellmann: that's actually really good point. Would you think we could succeed if it was the later?
20:59:34 <dhellmann> jokke_: you still have the problem of having a lot of error cases, so I don't know
20:59:47 <dhellmann> it's far far less than the number of log calls, though
21:00:14 <Rockyg> dhellmann: you can't go changing these at run time.  Ops remember the common ones and know exactly what to do.  If they are always changing, then again, they are pretty useless to ops
21:00:20 <jokke_> how about module+integer checksum of the message?
21:00:25 <bknudson> definitely not for debug logs
21:01:06 <dhellmann> Rockyg: I understand that, but have we thought about different ways to compute an id at runtime that is unlikely to change?
21:01:12 <jokke_> bknudson: we need to get the needed logs out of debug level logging before we can make that statement ;P
21:01:26 <bknudson> y, that's part of boiling the ocean
21:01:27 <Rockyg> So, One thing that can be done:  have the placeholder print for every message that doesn't have an error code.  Ops will file bugs on messages they see a lot that don't have a code and they think they need them
21:01:31 <dhellmann> jokke_: doing *that* work is something that the devs wouldn't fight
21:01:58 <dhellmann> Rockyg: what is an "error message" to you? a log message going to ERROR level, or an API error response?
21:02:21 <bknudson> dhellmann: you think devs won't complain about having to review a bunch of logging changes?
21:02:30 <jokke_> having proj-mod-numeral checksum autogenerated would definitely work!
21:02:30 <bknudson> I doubt they'll make time for it.
21:03:11 <jokke_> like GLA-Reg-3735576 ... the numeral needs to be bit longer than optimal for humans but that would work
21:03:20 <dhellmann> bknudson: if those changes are aligned with the logging standards, and in consumable chunks (not giant patch bombs), we can make incremental progress in some projects
21:03:23 <Rockyg> This needs to be a "while you're in there" and "low hanging fruit" approach.  Start with ERROR and if needed, work down
21:03:35 <dhellmann> jokke_: where do we get the "numeral" value?
21:03:49 <Rockyg> or up.  I never remember which way this level thing on logs goes
21:03:59 <jokke_> dhellmann: as mentioned above, just integer checksum of the message
21:04:10 <jokke_> and by message I mean the payload itself
21:05:25 <Rockyg> Integer checksum might work.  That's limited to 5 places, IIrC
21:06:10 <jokke_> http://stackoverflow.com/questions/16008670/python-how-to-hash-a-string-into-8-digits something like that
21:06:42 <bknudson> are we going to put the code in the call to log.error() ?
21:06:45 <Rockyg> dhellmann: and each module would have its own pool as I was thinking the breakdown would be by modules like nov-cnd (nova conductor) or some such.  The way the log files are broken out.
21:06:59 <jokke_> it would also automatically change the error code at the point where the log message changes (which fairly often means that the situation has changed as well)
21:07:25 <jokke_> bknudson: I think that would solve this for most of the projects at once
21:07:25 <Rockyg> Funny, I was just thinking about hash codes yesterday or Monday....
21:07:37 <dhellmann> bknudson: that's what I was thinking, if we can come up with a way to do it
21:09:22 <jokke_> taking hash out of log message should not be too heavy operation and having collisions within same module means that same message is being logged in multiple places which is already worth of raising a bug
21:10:21 <jokke_> only problem is that we would like to have the hash from the english message before any dynamic insertations
21:10:24 <Rockyg> True.  And if done right, we could always fall back on fixing specific error message error codes if they change to fast or cause collisions.
21:10:57 <jokke_> we're badly out of time, but THANKS dhellmann that actually could work!
21:10:58 <Rockyg> So, hash the source code line with the %s etc.  Then do the substitution?
21:11:20 <jokke_> Rockyg: yup
21:11:20 <Rockyg> Oh, wow!  Boy, this was just getting interesting!
21:12:02 <Rockyg> So.  Lots of food for thought.  Also, dhellmann, will you be in oslo room?  I need to talk about other stuff with you.
21:12:11 <dhellmann> Rockyg: yes, I'm there
21:12:16 <Rockyg> anyway, very productive, I think.
21:12:24 <Rockyg> Thanks everyone!
21:12:26 <jokke_> Thanks everyone!
21:12:30 <Rockyg> #endmeeting