17:00:12 #startmeeting qa 17:00:13 Meeting started Thu Mar 2 17:00:12 2017 UTC and is due to finish in 60 minutes. The chair is andreaf. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:00:17 The meeting name has been set to 'qa' 17:00:27 hello, who's around today? 17:00:31 o/ 17:00:39 andreaf: hello 17:00:52 Today's agenda: #link https://wiki.openstack.org/wiki/Meetings/QATeamMeeting#Agenda_for_March_2nd_2017_.281700_UTC.29 17:00:55 o/ 17:00:55 o/ 17:01:05 \o 17:01:45 mtreinish, oomichi, afazekas: around? 17:01:58 andreaf: hi 17:02:14 o/ 17:02:42 ok we have quite a long dense agenda so let's get started 17:02:46 #topic PTG 17:03:26 About PTG I just wanted to thank everyone who attended 17:03:34 the team pictures are in the ML :) 17:03:43 o/ 17:03:50 hehe, nice pic 17:03:57 we set priorities for Pike: #link https://etherpad.openstack.org/p/pike-qa-priorities 17:04:24 if you want to review what happened the list of all etherpads is at #link https://etherpad.openstack.org/p/qa-ptg-pike 17:05:27 that's all I had on PTG, so moving on 17:05:28 #topic Gate Status 17:06:03 We are at the beginning of the Pike cycle now, and we must make sure that the gate is helping our community and not stopping people from getting work done 17:06:32 The failure rate is currently too high in the gate, so we've been discussing how to get things back under control 17:06:42 Top recheck issues #link http://status.openstack.org/elastic-recheck/gate.html, http://lists.openstack.org/pipermail/openstack-dev/2017-February/113052.html 17:06:58 One issue was already fixed: #link https://review.openstack.org/#/c/439638/ 17:07:27 it turns out Tempest did not close connection on failure which caused the ssh banner issue 17:08:06 but still we have our SUT running under high load for a large part of test runs and we suspect that may be the actual underlying issue behind a lot of the flackyness we've experienced 17:08:43 we discussed a lot in the past about pruning our scenario tests which we should not really be having in tempest 17:08:53 and now it's a good time to move ahead on that plan 17:09:04 we made an etherpad: #link https://ethercalc.openstack.org/nu56u2wrfb2b 17:09:27 proposing which scenario to keep in the gate - for now we would simply skip a number of scenarios 17:09:38 not actually remove them 17:09:51 but we will need eventually to move some out of tempest 17:10:05 yeh, the concrete patch I have up there just moves them into the slow bucket 17:10:18 which means that you can run them under -e all if you want 17:10:25 #link https://review.openstack.org/#/c/439698/ the patch 17:10:50 and also sdague's patch runs scenario tests serially 17:11:11 andreaf: current above patch seems running scenario tests as serial, right? 17:11:16 oomichi: yes 17:11:20 andreaf: yeah, 17:11:27 I would ask folks to review the list and if you have any concern with any specific test please put a comment on the ethercals 17:11:35 we should try to get that patch approved by end of day tomorrow 17:11:49 it is roughly jordanP's list of tests from the ethercalc 17:12:10 one thing that that approach does not cover though is load by API tests - some of which can be quite heavy 17:12:25 so another patch I proposed reduces concurrency to three #link https://review.openstack.org/#/c/439879/ 17:12:35 the inspiration came from some of the nova conversation with ceph folks on their failures, where they think a lot of it is load related, and we said they should just trim back to the important stuff, and run less parallel to get things under control 17:12:36 my proposal would be to actually combine the two things 17:12:48 and that seems like a reasonable starting point for us as well 17:13:16 andreaf: well, the concurency drop should be in d-g right? not https://review.openstack.org/#/c/439879/ 17:13:28 in parallel I'm working on identifying which tests are doing more resource allocation so we can verify if they really need to do so 17:13:31 can we switch the default concurrency level to 3 or 4 also in ostestr and tempest run? 17:13:59 this load problem has hit also other places (like the RDO CI) 17:14:13 yes, it can be patched in every place, but if the default is saner... 17:14:22 tosky: yeh, it's definitely worth the conversation to figure out how to back it down 17:14:43 sdague: yeah I guess concurrency is specific to the sizing of the SUT, which is managed by d-g in our case, so it would be more appropriate in there 17:14:46 tosky: especially if it turns out that load reduction really makes the world a lot better 17:14:51 sdague: only it will have to go in a number of jobs 17:15:03 I am ok to keep scenario tests with serial if we can get the gate status stable again with sdague patch 17:15:07 part of the current problem, is the data set we can get off a patch is limitted 17:16:02 so my feeling is to move forward with the serial scenario patch, see what the macro gate effect ends up being after a week, evaluate if it was effective, and if so, what other changes should be made with that data 17:16:03 so any concern anyone on this? 17:16:40 sdague: well the thing is that we've seen already a failure on API tests in your patch 17:16:50 andreaf: will we have heavy load tests like current job as non-voting? 17:17:20 it would be helpful to investigate the root problem 17:17:25 I think the only other part of it is communication about the fact that the scenario tests got trimmed by default, so projects that want to test some of those need an -e all in there gate somewhere 17:17:30 oomichi: heh I think for tempest we should have an extra job non voting which runs all the tests still, maybe serially, so we don't break them 17:17:30 probably on experimental 17:17:55 andreaf: yeh, or honestly make an all scenarios job 17:18:06 andreaf: sdague: cool, I don't have any objection 17:18:09 sdague: yeah something like this 17:18:09 and just run all the scenarios in serial, but only those 17:18:21 so there isn't the 2.5 hour job issue 17:18:26 ok to summarize 17:18:34 - email the ML with the plan 17:18:52 - comment on the etherpad, and tomorrow we merge sdague patch taking in account comments 17:19:27 - prepare a patch on d-g to reduce concurrency, on hold for now 17:19:30 - prepare a new scenario only job to run on tempest, to be merged tomorrow as well 17:19:41 andreaf: all seems reasonable 17:19:46 did I miss anything? 17:20:05 andreaf: good summary :) 17:20:45 ok, they are all pretty small tasks I can follow up tomorrow but I need people to look at the ethercalc and raise comments if appropriate 17:20:57 apart from that I have other two points 17:21:01 any effect on project tempest plugins? 17:21:20 bknudson_: good question 17:21:29 andreaf: are they typically run in -e full? 17:21:36 andreaf: this plan will also help to catch issue in tempest plugins 17:22:08 bknudson_ , sdague: I'm not quite sure 17:22:21 and I think checking all jobs might take a bit 17:22:36 I mean I can grep for tox -e full across project-config to get an idea 17:22:42 bknudson_: I think if they are run under -e full by default, it means that the run times of those jobs will get longer if they add a number of scenario tests 17:22:59 andreaf: most plugins do not use the tox -efull job 17:23:04 in fact I can't think of any 17:23:13 so if they happened to be right up against there timer, they could go over 17:23:19 mtreinish: they are using -e all? 17:23:52 sdague: yeah -eall or -eall-plugin which has system site-packages enabled (hopefully I'll be able to remove that eventually) 17:23:56 the keystone tests are scenario tests (although shouldn't be a problem since not much tested) 17:24:25 mtreinish: and do they typically set their own concurrency? 17:24:26 oh, we've got both api and scenario 17:24:39 mtreinish: ok, would you mind checking on that just to be sure? 17:24:40 bknudson_, api are only to test the clients used in the scenario 17:24:44 bknudson_: so I think given that, there will be no impact 17:25:05 however we may collect data which suggests how you might want to tweak things 17:25:09 sdague: they all rely on the d-g variable for setting concurrency 17:25:25 octavia sets OS_TESTR_CONCURRENCY 1 as a sample 17:25:40 however, I don't expect that keystone is going to have the same load/iowait issues as jobs that have lvm volumes getting allocated 17:25:50 andreaf: I haven't checked every single job in project-config, but I'd be really surprised if any project used the full job on a plugin 17:25:53 heh right 17:26:04 mtreinish: heh I agree 17:26:14 I really think that volumes + qemu boots is where we get the really heavy load that's hurting us 17:27:04 #action everyone - review the ethercalc @ https://ethercalc.openstack.org/nu56u2wrfb2b 17:27:27 #action andreaf email ML with our plan on scenario tests 17:27:35 #action andreaf setup a scenario only job for tempest 17:27:55 anecdotally, in sahara we use _REGEXP (with -eall), and probably other plugin do the same 17:27:55 two more things related to gate instability 17:28:18 * dustins makes note to look at the Manila plugin 17:28:19 tosky: ok so that would not be affected which is good 17:28:31 thanks dustins 17:28:44 Of course! 17:29:27 I would like to propose a temp no-new-test merge policy until we are confident the gate is stable 17:29:32 of course we can discuss tests on a case by case basis 17:29:50 but in general we should be very careful about getting anything in until things settle down 17:30:15 no-new-test merge policy for scenario tests only na? 17:30:18 and then we should document criterias for new scenarios as jordanP proposed 17:30:40 chandankumar: well mostly for scenario, but there are API tests that can be pretty heavy 17:31:13 right, it would be nice to let the dust settle to the point where a gate fail of the full job on random project is unexpected 17:31:14 andreaf: are there any draft of the criterias? 17:31:25 vs. just part of normal business 17:31:27 so I don't mind a negative API test or a keystone one, but a nova API test for migration would make me think 17:32:03 oomichi: we have something in docs already 17:32:33 oomichi: but I think we need to get something more specific in terms of resource utilisation and reviewing patches 17:33:02 oomichi: given the variance on run time and system load it's hard to judge from those but we can check how many servers / volumes are created :) 17:33:36 oomichi: if it's not something for interop and it could go into functional tests it would be nice to have it there 17:34:00 andreaf: yeah, I can see. but I mind negative ones anyways ;) 17:34:01 oomichi: but we can discuss these details on an etherpad or gerrit patch - it's not urgent 17:34:36 andreaf: can I see the link of current doc? 17:34:41 so can we agree on a "please be very careful about getting any new test in Tempest until things settle down" ? 17:34:47 as the criteria? 17:35:21 I could not find it in REVIEWING.rst 17:35:48 oomichi: so for instance #link https://docs.openstack.org/developer/tempest/field_guide/scenario.html 17:36:06 oomichi: as a temporary measure until the gate is back into shape 17:36:32 andreaf: I see, thanks. yeah, we need the detail more and nice to discuss it 17:37:27 #agreed until further notice we shall think twice before letting any new test in Tempest (until the gate settles down) 17:37:40 ok I'm not even sure if that's a meeting bot command :D 17:37:53 heh 17:37:56 so one last thing is about versions we test 17:38:04 API versions 17:38:33 we took cinder v1 out of the gate and have a job to test those on demand I think 17:39:04 but we may need to review / document the API version that we want to exercise in the gate 17:39:21 andreaf: I noticed that volumes v1 admin actions didn't get pulled with that 17:39:30 I think because the tests are structured differently 17:39:33 sdague: oh ok 17:39:40 https://github.com/openstack/tempest/blob/13a7fec7592236a1b2a5510d819181d8fe3f694e/tempest/api/volume/admin/test_volumes_actions.py#L58 17:39:44 probably a good todo 17:39:58 any volunteer to look into that? 17:40:08 sdague? 17:40:11 it definitely feels like for per-commit pre gating we should only be testing the most recent major API 17:40:20 andreaf: I'll see what I can do 17:40:26 sdague: thanks! 17:41:03 testing deprecated APIs seems like the role of the project, or at least being done not on master per-commit pre-gating 17:41:27 sdague: yeah my only concern is whether it's enough to drive tests via an API version, or if we need to ensure that all services are talking that same version between them as well 17:41:54 andreaf: good question, I don't know 17:42:27 sdague: I agree. cinder v3 is current, and v2 is supported. So it would be nice to test v3 as priority on the gate 17:42:36 oomichi: yep 17:42:56 especially as nova is going to require v3 shortly 17:43:02 ok I guess there is some digging to be done to track which versions we are testing in which job and propose a plan on how we want things to look like 17:43:07 oomichi: just update the endpoint in the config it should be the same 17:43:20 mtreinish: yep 17:43:25 mtreinish: yeah, v3 = v2 + microversions :) 17:43:33 #action: sdague to look into cinder v1 admin tests 17:43:40 but the base should just work sliding across 17:43:50 any volunteer to look at API versions planning? 17:44:31 andreaf: I can help 17:44:39 oomichi: great, thanks! 17:44:57 #action oomichi to look at API versions for test jobs 17:45:07 ok that's all I had on the gate issues 17:45:28 anything else anyone? 17:45:34 #topic Specs Reviews 17:45:43 #link https://review.openstack.org/#/q/status:open+project:openstack/qa-specs,n,z 17:46:10 anything on specs? 17:46:15 3... 17:46:16 2.. 17:46:19 1. 17:46:20 I think we can probably dump the 2 grenade spaces. 17:46:37 I had a good talk with luce? (sp) at PTG 17:46:52 and I think the new idea is to build a purpose built tool for the zero downtime keystone testing 17:46:56 luzC 17:47:04 andreaf: that's it 17:47:31 sdague: yeah I'm not sure if that's going to be new specs or what, probably yes 17:47:43 I'll ping luzC about those 17:47:44 so, i'd just double check with her, and close those out unless there is a reason they want to keep them up 17:48:06 #action andreaf check with luzC about grenade specs 17:48:27 #topic Tempest 17:49:26 oomichi: did you add this? https://review.openstack.org/#/c/389318/3 17:49:40 andreaf: I put that on there 17:50:05 andreaf: I had a discussion the other day in the puppet channel about how that broke the ceilo plugin 17:50:11 that broke ceilometer gate 17:50:20 mtreinish: yeah, that is 17:50:56 I replaced service client code with tempest.lib after that in ceilometer repo to avoid it again 17:51:09 I just thought it was a good discussion point, since it's a private interface in lib 17:51:22 although we clearly document the lib stable contract on public interfaces: https://docs.openstack.org/developer/tempest/library.html#tempest-library-documentation 17:51:48 mtreinish: yeah, that is really private one 17:52:14 yes so that's clearly documented already 17:52:44 but it makes me wonder what the gap was in the stable interfaces 17:53:32 so... for some definition of clearly documented :) 17:54:08 I guess people never read the doc of the other projects even if we have clear doc. 17:54:10 honestly, until the documentation includes a bunch of example usage, and makes it so that it's not worth people's time to open the code, once they get in the code, they are going to find other methods they want to use 17:54:30 I would not, for instance, call https://docs.openstack.org/developer/tempest/library/cli.html clear doc 17:55:19 sdague: nor would I 17:55:27 sdague: uhm sure that's not the point under discussion though - I though the part about stability is quite clear 17:55:28 sdague: but I again I may be to involved in tempest to see the gap 17:55:57 andreaf: so, I think it's easy to say "if you click through these 20 pages you can find the functions we consider stable" 17:56:03 sdague: we have documentation as part of the high prio things in pike I agree we need examples or so 17:56:17 but, that's not really super clear or discoverable. You basically want an SDK doc 17:56:33 and SDK needs examples for every usage 17:56:46 yeah ok that's in the todo list 17:57:02 it might also be interesting to figure out if there was a way to emit some warnings from tempest run if stuff gets inherited in places that are unexpected 17:57:09 to help people realize they did the wrong thing 17:57:27 yeah that's the other topic I had on the agenda for today 17:57:35 but I guess we are running out of time 17:57:54 we can continue in the QA channel or next meeting 17:58:05 sdague: andreaf i would like to help on this. 17:58:17 sdague: I did try it with hacking, but difficult to get agreement 17:58:20 so it's a pity it takes two weeks to meet the same group again in a meeting 17:58:24 chandankumar: thank you!! 17:58:26 I've got a review that's stuck - https://review.openstack.org/#/c/388897/ -- I answered the comment but the reviewer seems to have left. 17:58:31 andreaf: the docs stuff? I did get a start to some of that yesterday: https://review.openstack.org/439830 17:58:44 mtreinish: thank you! 17:59:06 #link https://review.openstack.org/#/c/388897/ for review 17:59:10 castulo: did you have a follow up on bknudson_'s patch ^^^ 18:00:06 ok thanks everyone! 18:00:11 #endmeeting