20:01:20 #startmeeting log_wg 20:01:20 Meeting started Wed Jun 17 20:01:20 2015 UTC and is due to finish in 60 minutes. The chair is Rockyg. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:01:21 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 20:01:23 The meeting name has been set to 'log_wg' 20:01:30 Hello 20:02:15 Hey. So, my chat client is messed up and I can't see thefolks in the room to ping the ones we need. Would you please? 20:02:53 give me just a sec, do we have the courtesy list somewhere? 20:03:12 Ah. got it. ping for dhellmann, bknudson 20:03:19 hi 20:03:28 hi there 20:03:42 didn't know we could make a courtesy list 20:04:14 nkrinner 20:04:18 whoe else? 20:06:09 I think we should be good to go 20:06:52 OK. we had an agenda for las week 20:07:17 One item I saw on log about request-id 20:07:38 jokke_: you couldn'tremember if you had replied to that thread 20:07:44 I don't think you did. 20:08:18 Great if you could 20:08:58 I'd like to discuss some of the problems we are having on all of these "user facing" specs 20:09:18 #topic problems in moving user facing specs forward 20:09:26 yes! 20:09:46 I am seeing a pattern in API wg, Log wg and the request-id spec 20:10:10 we have problems getting reviews for specs in keystone, too. 20:10:21 maybe it's a general issue with specs 20:10:23 The problem is that the devs do not understand why users want something and how they will use it 20:10:59 bknudson: that's a very good possibility. I am certainly confused between specs and bps 20:11:07 how about add a section to the spec that explains why users want something and how they will use it? 20:11:37 I thought specs would be the what of a problem definition, with solution approaches, and the bps the how, but it seems like both are how 20:12:13 bknudson: exactly. so what I want to do is gather the use cases and have that section up front 20:12:33 that seems to be the general way that specs are used, they describe a solution. 20:12:58 bknudson: I'd say that's a separate issue ... we slightly touched this last week but the problem seems to be something we perhaps could influence with education. 20:13:17 So, for instance: use case: switch dies. Don't have map of hosts to vms. How do I track and repair? Then the steps an ops would want to go through to solve the issues 20:13:35 Rockyg: I think that would be useful. 20:14:53 So, there are two things users need to come up with: 1) Philosphies of how users expect to interact with cloud 2)Specific use cases to help define need and characteristics of feature 20:14:56 TBH this is actually funny to see ... my personal take was that specs replaced bps and only reason why bps are still made is to link the spec to lauchpad 20:15:40 jokke_: that's how they're used in keystone. 20:15:49 The philosphy stuff is why logs have to be human readable. Why a mnemonic is better than a UUID, etc. 20:16:06 there's no launchpad for cross-project specs as far as I know. 20:16:11 so no bps 20:16:20 The use case shows how a feature is useful 20:16:27 bknudson: you're probably right 20:16:40 bknudson: there might be, but if there is, it's openstack 20:17:29 Rockyg: that sounds like a good approach ... I have been trying to think how we could educate our dev base to understand the ops view/side of certain specs 20:17:30 So, no bps. But, the xproject spec could link to project bps that implement at the project level 20:18:16 the bps are more dynamic than the spec. the bp has to link to the spec. 20:18:25 I think we haven't gotten more ops participation because they think in cases that they usually call incidents 20:18:54 bknudson: yeah. I see that. 20:18:56 operators would probably rather not have the switch die to begin with. 20:19:30 So, what I want to do is use an etherpad to collect specific use cases that highlight a single feature and how it is used. 20:19:59 But, if we get other use cases, we can move them around to the proper etherpad 20:20:03 I don't think it even needs to be that complicated -- how about just the example where a user types the wrong password and doesn't understand the response? 20:20:12 Rockyg: is that current features or proposed features you're referring to? 20:20:17 I think we can get ops and usergroup to put use cases on an etherpad 20:20:43 jokke_: both. Start with request ids and error codes and a writeup for consistency 20:20:59 I think it's worth trying out the use case etherpad 20:21:03 Then, if it works, we can start focusing on message payloads 20:21:11 if ops aren't willing to contribute then maybe it's not that important after all 20:21:13 Rockyg: so were you thinking to pull those then from the etherpad to the specs? 20:21:59 Yes. In some form. But likely capture them in the use case repository the product wg is building 20:22:16 I see the biggest problem here not being the ops & users but if we just link some use case etherpad to the specs we get TL;DR and again wasted effort 20:22:34 I mean from the dev side 20:23:09 The product wg intends to take the use cases as the framework for the specs. So build the spec around the use case. Sort of a hand-off from users to devs 20:23:43 k 20:24:06 So, the info will be in the spec for the feature. If devs don't like the approach, they can suggest something that works for them but gets them to understand where the spec is coming from 20:25:12 fair enough 20:25:14 a for instance use case for the error code is: Ops gets ERROR Invalid VM 20:25:24 Rockyg: do you already have a use case that you know of? 20:25:55 Yeah. The one I just mentioned, but I likely have the message part off 20:26:15 I'm not an operator so I don't know if my use case is something they really care about 20:26:17 It happens much too often, and is generated by flow through from multiple places in the code 20:26:29 I'm just a guy who has to answer questions from operators. 20:26:30 Do other operators care? 20:27:17 So, you are a user. In your case, a level two or three responder. You have to be able to quickly dig into issues. 20:27:39 maybe need a place in the etherpad where other ops can weigh in on whether they like the use case 20:27:47 Maybe we need to actually write up something about user roles. What the do, what they expect, etc. 20:28:28 That's why the etherpad. they don't need anything but the link and a loging and the discussion ensues. They don't need to know or understand gerritt 20:28:30 as I said, I like the etherpad idea as an experiment 20:28:47 I start to like it ;) 20:28:48 might want to seed it with your use case 20:29:02 Or yours. Or both. 20:29:32 I can add mine. 20:30:02 #link https://etherpad.openstack.org/p/Operator_Use_Cases 20:30:24 there's probably more issues in my use case than just that you can't track the request through the system. 20:30:48 bknudson: even better 20:30:55 although tracking the request will be a requirement 20:31:24 Go for it. I'll add an intro and put a use case out there. the invalid thing is a hot button for a number of ops It the trunk of a many branched problem 20:31:26 I'm thinking about -- keystone doesn't send back adequate error responses, horizon doesn't display them. 20:31:27 ignoring amit213 if needed someone ping me to remove that 20:31:34 amit213: please fix your client 20:32:40 bknudson: I think use cases will likely encompass a number of needs. They tend to be an integrated process that sucks in everything they touch ;-) 20:33:28 bknudson: that's actually pretty common issue around and not even being only related to horizon 20:33:45 but if we could show how "nice" it can be compared to how it is... 20:34:36 that you look in the keystone log and see "this user tried to login at 3:30 and the password was wrong" and tell them, vs having no idea. 20:35:06 Yeah. I would like to see for use cases: how it *should* be, plus how it is now. The advantages of the new approach have to be very obvious. 20:35:09 user calls in with the error message "login failed, request ID 12423" 20:35:25 vs, user calls in and says "login failed" 20:35:45 hmm-m perhaps we should just bite the bullet and make one example ... get those things locally implemented to one specific trace we can easily reproduce that is absolutely crap at the moment and provide before & after results 20:35:48 and all you have is a pile of logs that show users sometimes logging in and sometimes failing with no reason. 20:37:33 bknudson: but telling that user failed to log in due wrong password in the logs is security threat, you know ;P 20:37:43 login even 20:37:47 So, that points out another issue. RequestID can't be 32 characters long. It has to be something that is transmittable over a phone line in a reasonable length of time with little to no errors 20:38:11 ++ 20:38:20 bknudson: even if the logs aren't on any public facing network ;-) 20:38:40 numeral rather than uuid thinking it now in that light 20:38:51 that's why I started the table of logs. So we can demonstrate which logs are security threats and which aren't 20:40:36 Whereas, if the message has login failed id 1234:this message repeats 500 times in 10 seconds, then you know you have someone trying to break in. 20:40:55 And, you can save 500 lines in the log file 20:41:09 ;) 20:41:51 You could also identify a config issue if the login attempt is from a system rather than a human 20:42:42 OK. So let's summarize some of this: 20:44:15 * dhellmann apologizes for arriving late 20:44:16 #action item: Put two prospective use cases into etherpad https://etherpad.openstack.org/p/Operator_Use_Cases 20:44:16 #action item: Advertise use case etherpad to ops to get them to start collaborating on it 20:44:23 hi dhellmann! 20:45:10 #action item: work with other xproject wg to write up philosphy of what the user wants out of this openstack "thing" 20:46:18 #action item: include a use case for clarifying *why* Error codes are worthwhile implementing 20:47:08 and why request Id's needs to be traceable and easily communicated 20:47:10 #action item: suggest same for request id spec and for API wg 20:48:24 #info What do all these groups, requests have in common? Humans want the information to use in tracking down problems! So make them human! readable 20:49:47 Anything else? 20:50:44 some of those sound like really big action items 20:50:51 #topic open discussion 20:51:18 dhellmann: we signed you up for the items since you weren't here. 20:51:35 bknudson: join the club 20:51:41 dhellmann: Yeah. But small bites in some areas and big bites in others. Need a way to educate developers on how users are trying to use their work. 20:52:10 The M summit talks are open. Should we propose panel discussion around this over there? 20:52:32 Rockyg: even the spec for error codes is a "boil the ocean" sort of change that's going to be hard to make incrementally. I'm worried that we're trying to solve too much at once. 20:53:09 this ocean needs boiling 20:53:19 bknudson: ++ 20:53:26 but I agree with dhellmann 20:54:27 dhellmann: what's your alternative for the error codes spec? 20:54:27 Yeah. I'm working on getting enough education into the front of the spec to get a firmer foundation. The error that all the ops hate, about invalid VM is a good example of error actually coming from multiple places and if every place that generates a log message has a separate id, then at lease there is better info 20:54:42 dhellmann: just before you joined us we were talking about how to document the usecases and the benefits of the proposed change. Perhaps we should take one example where it's awfully wrong and provide before/after scenario for people to look at 20:55:35 bknudson: I would like for someone to think about a solution that doesn't involve us touching every single error message by hand. Can we use a MD5 sum of the error message as a code? Can we use the file and line number? I don't know the answer, but I know that we're not going to succeed if we have to change all of OpenStack by hand. 20:55:36 dhellmann: would you think that help to open few vulcanos under the ocean? 20:56:18 jokke_: my point is that no use case justification is going to counter the fact that the size of the change you're asking for is far too big 20:57:03 dhellmann: so, md5 wouldn't work because it's hard for a human. File and line number might, even if it eventually changes when stuff gets added/removed from file 20:57:54 yeah the biggest problem with the file & line is that if it's dynamic we get few codes changing every commit 20:58:19 it's also not clear if we need them for every single log message, or just errors sent back through the API to callers 20:58:35 jokke_: we'd have to start with that, to get it going, but then *not* change it once it's set 20:58:38 because those are 2 different things, and the latter is much smaller 20:59:10 Rockyg: no, I mean, we need to compute these things at runtime. We can't go through N million lines of code making this change by hand. 20:59:15 dhellmann: that's actually really good point. Would you think we could succeed if it was the later? 20:59:34 jokke_: you still have the problem of having a lot of error cases, so I don't know 20:59:47 it's far far less than the number of log calls, though 21:00:14 dhellmann: you can't go changing these at run time. Ops remember the common ones and know exactly what to do. If they are always changing, then again, they are pretty useless to ops 21:00:20 how about module+integer checksum of the message? 21:00:25 definitely not for debug logs 21:01:06 Rockyg: I understand that, but have we thought about different ways to compute an id at runtime that is unlikely to change? 21:01:12 bknudson: we need to get the needed logs out of debug level logging before we can make that statement ;P 21:01:26 y, that's part of boiling the ocean 21:01:27 So, One thing that can be done: have the placeholder print for every message that doesn't have an error code. Ops will file bugs on messages they see a lot that don't have a code and they think they need them 21:01:31 jokke_: doing *that* work is something that the devs wouldn't fight 21:01:58 Rockyg: what is an "error message" to you? a log message going to ERROR level, or an API error response? 21:02:21 dhellmann: you think devs won't complain about having to review a bunch of logging changes? 21:02:30 having proj-mod-numeral checksum autogenerated would definitely work! 21:02:30 I doubt they'll make time for it. 21:03:11 like GLA-Reg-3735576 ... the numeral needs to be bit longer than optimal for humans but that would work 21:03:20 bknudson: if those changes are aligned with the logging standards, and in consumable chunks (not giant patch bombs), we can make incremental progress in some projects 21:03:23 This needs to be a "while you're in there" and "low hanging fruit" approach. Start with ERROR and if needed, work down 21:03:35 jokke_: where do we get the "numeral" value? 21:03:49 or up. I never remember which way this level thing on logs goes 21:03:59 dhellmann: as mentioned above, just integer checksum of the message 21:04:10 and by message I mean the payload itself 21:05:25 Integer checksum might work. That's limited to 5 places, IIrC 21:06:10 http://stackoverflow.com/questions/16008670/python-how-to-hash-a-string-into-8-digits something like that 21:06:42 are we going to put the code in the call to log.error() ? 21:06:45 dhellmann: and each module would have its own pool as I was thinking the breakdown would be by modules like nov-cnd (nova conductor) or some such. The way the log files are broken out. 21:06:59 it would also automatically change the error code at the point where the log message changes (which fairly often means that the situation has changed as well) 21:07:25 bknudson: I think that would solve this for most of the projects at once 21:07:25 Funny, I was just thinking about hash codes yesterday or Monday.... 21:07:37 bknudson: that's what I was thinking, if we can come up with a way to do it 21:09:22 taking hash out of log message should not be too heavy operation and having collisions within same module means that same message is being logged in multiple places which is already worth of raising a bug 21:10:21 only problem is that we would like to have the hash from the english message before any dynamic insertations 21:10:24 True. And if done right, we could always fall back on fixing specific error message error codes if they change to fast or cause collisions. 21:10:57 we're badly out of time, but THANKS dhellmann that actually could work! 21:10:58 So, hash the source code line with the %s etc. Then do the substitution? 21:11:20 Rockyg: yup 21:11:20 Oh, wow! Boy, this was just getting interesting! 21:12:02 So. Lots of food for thought. Also, dhellmann, will you be in oslo room? I need to talk about other stuff with you. 21:12:11 Rockyg: yes, I'm there 21:12:16 anyway, very productive, I think. 21:12:24 Thanks everyone! 21:12:26 Thanks everyone! 21:12:30 #endmeeting