#openstack-meeting log

14:01:31 <rvasilets_> #startmeeting Rally
14:01:32 <openstack> Meeting started Mon Feb 15 14:01:31 2016 UTC and is due to finish in 60 minutes.  The chair is rvasilets_. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:01:33 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:01:35 <openstack> The meeting name has been set to 'rally'
14:01:48 <rvasilets_> Hi to all
14:03:07 <andreykurilin> hi!
14:03:09 <andreykurilin> o/
14:03:11 <rvasilets_> looks like we have not a lot of topics for today
14:03:20 <andreykurilin> are you sure?)
14:03:26 <rvasilets_> Lets wait a bit
14:03:29 <ikhudoshyn> o/
14:03:38 <rvasilets_> Maybe someone else join to us
14:04:18 <ikhudoshyn> andreykurilin: around? you seem to had a topic
14:05:00 <andreykurilin> yes, I'm here:) let's wait for another attenders a bit
14:05:23 <ikhudoshyn> i'd love to report 'bout my install_rally.sh refactoring (we want to install rally in venv on gates) but i'm at the middle of testing
14:05:30 <andreykurilin> :)
14:05:47 <ikhudoshyn> ..so I won't take your time today))
14:06:02 <andreykurilin> let's start?
14:06:07 <rvasilets_> Okey
14:06:14 <andreykurilin> I have a one topic
14:06:32 <andreykurilin> raised by saurabh__ in our main chat today
14:06:43 <rvasilets_> andreykurilin, lets start from you
14:06:48 <andreykurilin> ok ok
14:06:53 <rvasilets_> what is the topic?
14:07:15 <andreykurilin> keystone can kill rally
14:07:16 <andreykurilin> lol
14:07:26 <andreykurilin> sounds like a good topic :D
14:07:37 <andreykurilin> #topic keystone can kill rally
14:07:51 <andreykurilin> rvasilets, can you set a topic?
14:08:03 <rvasilets_> #topic keystone can kill rally
14:08:26 <andreykurilin> nice:)
14:08:39 <rvasilets_> this is my privilege)
14:09:36 <andreykurilin> In case of "dead" keystone and big number of parallel iterations, keystoneclient will open a lot of sockets
14:10:32 <andreykurilin> saurabh__ faced with the issue, when rally was unable to open a db file to write the results of task
14:10:45 <andreykurilin> sqlite was used
14:11:03 <rvasilets_> Is that a problem of rally or sqlite for example?
14:11:43 <rvasilets_> This is limitation of sqlite not Rally
14:11:48 <rvasilets_> possibly
14:11:50 <rvasilets_> ?
14:11:57 <andreykurilin> it is limitation of the system in general
14:12:31 <andreykurilin> the problem on the rally side - we don't handle such cases
14:13:35 <andreykurilin> Maybe, we can check the limit before saving results and increase it if possible
14:13:38 <rvasilets_> Could be fix this somehow?
14:14:00 <andreykurilin> At least, we can catch the error and write user-friendly error
14:14:06 <rvasilets_> limit of what?
14:14:31 <andreykurilin> limit of "open files"
14:14:36 <ikhudoshyn> andreykurilin: sorry i kinda dont follow. how lots of open sockets prevent us from writing to sqlite?
14:14:47 <ikhudoshyn> i see
14:15:26 <andreykurilin> ikhudoshyn: it depends on system settings
14:15:35 <ikhudoshyn> maybe just post a warning?
14:15:46 <andreykurilin> when?)
14:15:54 <ikhudoshyn> during parsing of scenario?
14:15:57 <rvasilets_> the biggest thin that we could to do is raise user friendly msg here
14:16:24 <andreykurilin> ikhudoshyn: each time? it will bother
14:16:41 <andreykurilin> we have already 2 warnings(from boto and from requests)
14:16:48 <andreykurilin> and I want to remove them:)
14:16:49 <ikhudoshyn> like 'dear user you are to run lots of iterations,  you might need many open sockets, pls make sure you can'
14:17:27 <ikhudoshyn> can you increase limits in runtime, not being a root?
14:17:45 <andreykurilin> ikhudoshyn: I suppose we can check the system limit before launching task and print a warning
14:17:56 <andreykurilin> ikhudoshyn: I don't have such experience:)
14:18:26 <ikhudoshyn> andreykurilin: that's what i suggested, that did not seem to satisfy u
14:18:35 <rvasilets_> Did we filled the bug?
14:18:51 <andreykurilin> ikhudoshyn: https://docs.python.org/2/library/resource.html#resource.setrlimit
14:18:56 <ikhudoshyn> I mean we parse scenario, check limits, if they are too low -- we warn
14:19:03 <rvasilets_> this is really bad thin
14:19:11 <andreykurilin> maybe, it is possible to change a limit
14:19:20 <andreykurilin> but we need to check
14:19:41 <andreykurilin> rvasilets_: no, we don't have filed bug yet
14:19:57 <ikhudoshyn> well, I'm not sure this could be a good idea -- changing system settings quietly
14:20:11 <rvasilets_> agree
14:20:35 <rvasilets_> we should just show error or warning
14:20:44 <rvasilets_> an steps how to fix it
14:21:15 <ikhudoshyn> andreykurilin: what d'you think?
14:22:20 <andreykurilin> ikhudoshyn: It would be nice to have the check proposed by you, in case of sqlite backend and user-friendly error in db-layer
14:23:08 <ikhudoshyn> why db layer? I believe it's a somewhat wider issue
14:23:51 <ikhudoshyn> like we e.g. could run in an issue when we're unable to open sockets as well as files?
14:24:20 <andreykurilin> currently, we faced we such issue at db-layer:) http://paste.openstack.org/show/486988/
14:24:42 <ikhudoshyn> so it is not just 'we'll be possibly unable ti store results in db' but 'we'll be possibly unable to do any writes/reads'
14:25:17 <andreykurilin> yes
14:25:53 <ikhudoshyn> so db layer does not look like the very best place)
14:27:02 <andreykurilin> task layer already tries to catch all errors and wrap them with user-friendly exception
14:27:56 <andreykurilin> and only db-layer is not wrapped with any try...except
14:29:11 <ikhudoshyn> it's not good) but it is not necessarily connected to system limits
14:29:30 <rvasilets_> we need bigger count of reraising =)
14:30:49 <andreykurilin> ikhudoshyn: Are you talking about the paste posted above?
14:31:17 <ikhudoshyn> nope, i'm talking about the 'limits' issue in general
14:32:31 <andreykurilin> I know about only one limit - open files:) which relates to open new files and new sockets and new threads:)
14:32:47 <ikhudoshyn> if we're sure that the 'pasebin' issue related to 'limits' -- even then i dont think it is a good idea to catch exception and print warning like 'shit happens during operationg with sqlite -- check limits'
14:33:18 <ikhudoshyn> andreykurilin: yes -- we are talking about THAT limit )
14:33:51 <rvasilets_> =)
14:35:43 <ikhudoshyn> so...
14:35:52 <andreykurilin> why you don't think that it is a good idea? imo, it would be nice to catch such errors, maybe, execute "time.sleep()" and try again
14:36:19 <rvasilets_> where we would sleep() ?)
14:36:45 <ikhudoshyn> rvasilets_: at home)
14:36:54 <rvasilets_> reraising error we could lost the trace
14:37:12 <rvasilets_> and not found exact occurance of error
14:37:23 <rvasilets_> I'm for warning
14:37:36 <ikhudoshyn> andreykurilin: from what I could see in the paste -- nothing gives any hint that the issue is related  with the limit of open files
14:37:56 <andreykurilin> ikhudoshyn: It was not a full log
14:37:57 <andreykurilin> lol
14:38:03 <andreykurilin> http://paste.openstack.org/show/486959/
14:38:05 <andreykurilin> look here
14:38:07 <ikhudoshyn> hm.. nice,)
14:38:10 <andreykurilin> L3
14:38:53 <andreykurilin> rvasilets_: we have log.exception to store an original trace
14:39:05 <ikhudoshyn> Failed to consume a task from the queue: Unable to establish connection to https://192.169.123.50:5000/v2.0
14:39:22 <ikhudoshyn> see? we got lot's of issues here, not just db related
14:40:04 <andreykurilin> ikhudoshyn: i start this topic from the phrase "keystone can kill rally"
14:40:06 <andreykurilin> :)
14:40:13 <ikhudoshyn> so I strongly suggest to print a warning during scenario parsing/validation
14:40:23 <ikhudoshyn> ))
14:40:28 <rvasilets_> yea) +1 for warning)
14:40:41 <andreykurilin> so, keystone is dead -> keystoneclient continue to open new sockets -> rally failed to write the results
14:41:05 <ikhudoshyn> andreykurilin: it can indeed. but catching exception at db layer won't save us anyway))
14:41:56 <andreykurilin> we can print the results(json.dumps) in stdout in case of unability to save in db
14:41:57 <andreykurilin> lol
14:42:05 <rvasilets_> lol
14:42:14 <ikhudoshyn> we don't want to 'sleep()' until all ks client connections close and release file handlers, do we?
14:42:49 <andreykurilin> we can add one more tries in several seconds
14:42:54 <andreykurilin> in can help
14:43:02 <andreykurilin> and it will not produce a big delay
14:43:14 <andreykurilin> *it can help
14:43:20 <ikhudoshyn> and what if we just run out of disk space))?
14:43:56 <ikhudoshyn> we're still won't be able to store data, after 1 sec of after 100
14:44:18 <andreykurilin> yes, but it is another issue, which should be fixed separated
14:44:22 <andreykurilin> or don't fixed:)
14:45:37 <ikhudoshyn> disagree. the issues is 'we can't write db'
14:46:08 <rvasilets_> the error would be sqlite3.OperationalError
14:46:28 <rvasilets_> the same as for many things)
14:46:39 <ikhudoshyn> what we could do is to list all possible reasons to user (which i don't believe to be a good idea) or to describe the reasons in a runbook
14:46:43 <andreykurilin> ok, but this issue can appeared in different cases. so of them can be processed and handled, another - not
14:46:55 <rvasilets_> only difference is the trace) and possibly msg
14:47:05 <andreykurilin> +1 for a runbook
14:47:23 <ikhudoshyn> ))
14:47:43 <ikhudoshyn> and getting back to the separate warning -- do we need it?
14:48:34 <andreykurilin> I prefer a warning after the event :)
14:49:21 <ikhudoshyn> which event? sqlite3.OperationalError
14:49:23 <ikhudoshyn> ?
14:50:07 <rvasilets_> catching the error by checking the number of used resources and available and writing warning - possibly we can, I don't see here the evil
14:50:14 <ikhudoshyn> so to be consistent you should say 'db shit happens -- pls check limits, free space, access rights.. what else?'
14:51:21 <rvasilets_> we could catch to many files opens
14:51:24 <andreykurilin> we can check limits and free space in "catch" code
14:51:35 <andreykurilin> I write a proper message
14:51:35 <rvasilets_> and check keystone
14:51:37 <rvasilets_> here
14:52:11 <rvasilets_> bECAUSE FAILED KEYSTONE UNDER LOAD THIS IS COMMON PROBLEM
14:52:15 <rvasilets_> sorry
14:52:29 <ikhudoshyn> ok i give up... you are to creathe the whole recovery and diagnostic system for just one specific case
14:52:46 <rvasilets_> and all this stuff was caused by failed keystone
14:53:05 <rvasilets_> )
14:53:12 <ikhudoshyn> rvasilets_: yesssss, but we're talking about file limits and not ks at all
14:53:19 <rvasilets_> we could use th simple rule
14:53:22 <rvasilets_> 80/20
14:54:02 <ikhudoshyn> so what 80/20 gonna tell you in a case of sqlite3.OperationalError
14:54:04 <ikhudoshyn> ?
14:54:14 <ikhudoshyn> ks sux?
14:54:35 <andreykurilin> :)
14:54:49 <andreykurilin> ikhudoshyn: btw, we already tries to catch something - https://github.com/openstack/rally/blob/master/rally/cli/cliutils.py#L567-L572
14:54:57 <ikhudoshyn> if we see gazillion of nable to establish connection to https://192.169.123.50:5000/v2.0
14:55:26 <ikhudoshyn> we could say it is a ks issue, but if we cant write to db -- why ks is here at all?
14:56:09 <andreykurilin> ikhudoshyn: Can we reserve one "open file" for db-stuff?
14:56:24 <ikhudoshyn> andreykurilin: that is a sample of good warning. 'db issue -- pls check yr db'
14:56:57 <ikhudoshyn> but 'db issue -- pls check yr limits.. or check yr ks' -- it is a bad warning))
14:57:20 <andreykurilin> ikhudoshyn: ok, but user will check the rights and disk space and will not find the reason of issue
14:57:33 <ikhudoshyn> andreykurilin: I dont know. If we could -- it would be great
14:58:07 <andreykurilin> ikhudoshyn: free resources and disk space can be checked by us ;)
14:58:09 <ikhudoshyn> i was thinking about keeping it always open -- but it could be fragile
14:58:12 <andreykurilin> and even rights
14:58:57 <rvasilets_> We have not much time here do we have any agreed?
14:58:59 <rvasilets_> )
14:59:09 <ikhudoshyn> andreykurilin: ^^ hm.. do you want to check everything in a case of issue?
14:59:14 <andreykurilin> maybe
14:59:18 <ikhudoshyn> btw we're out of time
14:59:19 <andreykurilin> why not?
14:59:34 <rvasilets_> Okey
14:59:37 <andreykurilin> let's move to our general chat
14:59:42 <andreykurilin> *main
14:59:45 <ikhudoshyn> lets continue in slack
14:59:47 <rvasilets_> #agree almost agreed)
15:00:01 <rvasilets_> See you next meeting
15:00:05 <andreykurilin> see you
15:00:09 <rvasilets_> #endmeeting