17:00:17 #startmeeting qa 17:00:22 Meeting started Thu Feb 2 17:00:17 2017 UTC and is due to finish in 60 minutes. The chair is oomichi. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:23 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:00:26 The meeting name has been set to 'qa' 17:00:33 hello ! 17:00:38 o/ 17:00:38 hello, who can join today? 17:00:45 o/ 17:00:50 oh, my typing is slow 17:00:52 o/ 17:01:01 (I can only attend 40min) 17:01:05 o/ 17:01:13 o/ 17:01:21 cool, many people join 17:01:27 ok, lets get start 17:01:34 #link https://wiki.openstack.org/w/index.php?title=Meetings/QATeamMeeting#Agenda_for_February_2nd_2017_.281700_UTC.29 is agenda 17:01:36 o/ 17:01:53 #topic PTG 17:02:17 as openstack-dev ml, the room capacity and something is not clear for PTG 17:02:22 o/ 17:02:45 so we don't know how many topics can run in parallel 17:02:47 at this time 17:03:22 I feel we don't need to concentrate on the same topic by all people 17:03:22 we will see there, I guess 17:03:36 jordanP: yeah, I hope so 17:03:54 yeah, we can run several effort in parallel, depending on who is interested in doing what 17:04:10 jordanP: yeah, that will be productive 17:04:22 o/ 17:04:26 I was wondering if we will add time for the agenda topics - https://etherpad.openstack.org/p/qa-ptg-pike 17:04:29 #link https://etherpad.openstack.org/p/qa-ptg-pike is nice to write more ideas 17:05:00 luzC: yeah, +1 for schedule 17:05:19 it would be useful for getting attendees for specific topic 17:05:49 then we need capacity to make schedule 17:05:57 yes, specially if they are planning on attending other session 17:06:31 luzC: yeah, nice point 17:06:46 it would be nice to have a kind of agenda / priority lists so that we can go through the most important topics first 17:07:31 next topic ? 17:07:46 but I would not want to be too strict in terms of times, as we may find out that a topic takes very little time and another one longer than we expected, and we have more flexibility than in the usual design summit 17:07:55 andreaf: yeah, I hope debugging test failures would be first priority now 17:08:17 andreaf +1 17:08:18 andreaf: yeah I prefer mid-cycle meetup style 17:08:22 yeah, let's gather with a list of topics and once we are all here, see how it goes 17:08:34 jordanP: +1 17:08:49 ok, can we move on if more topcis for PTG? 17:09:10 #topic Spec reviews 17:09:26 #link https://review.openstack.org/#/q/status:open+project:openstack/qa-specs,n,z 17:10:04 we have several specs and they are still under discussion 17:10:24 do we have topics for them today? 17:11:03 ok, lets move on 17:11:06 #topic Tempest 17:11:29 not related to Tempest striclty, but we have a lot of job failures today 17:11:48 jordanP: yeah, we face 17:12:13 i have a question about tempest, specific to https://review.openstack.org/#/c/427345/ 17:12:16 is there something to do ? 3 days ago gate was fine and now it's in a really bad state 17:13:12 jordanP: I face different failures randamly 17:13:25 rodrigods, don't ask to ask a question, just ask it 17:13:27 oomichi: have you helped try to categorize the failures; http://status.openstack.org/elastic-recheck/data/integrated_gate.html 17:13:32 sure :) 17:13:45 the problem is because this bug: https://launchpad.net/bugs/1590578 17:13:45 Launchpad bug 1590578 in OpenStack Identity (keystone) "global role should not be able to imply domain-specific role" [Medium,Fix released] - Assigned to Mikhail Nikolaenko (mnikolaenko) 17:13:48 of this bug* 17:14:02 that's what bugs me it's clearly bad (you can look at o-h for the graphs) but not many people are working on categorizing the failures so we can track them 17:14:07 the fix wasn't backported to Mitaka, so Mitaka job fails 17:14:39 mtreinish: oh, never see the page. how to categolize that? by LP? 17:14:48 what is the best approach? use a feature flag? 17:14:50 oomichi: submit an elastic recheck query 17:14:52 rodrigods, then either wait or introduce a feature flag 17:15:05 rodrigods: so the api behaves differently between mitaka and newer versions? Yeah you either add a feature flag or just wait for mitaka eol 17:15:18 cool 17:15:30 mtreinish: oh, that is cool way. OK, I will do that from my patch list 17:15:39 thanks mtreinish jordanP 17:16:24 some of the failures are related to libvirt crashes, as jordanP pointed out - but something worse in the past one or two days 17:16:55 the libvirt ones are all tracked in e-r as far as I know, but no-one stepped up to look at the issue 17:17:19 do we have any libvirt person close to the openstack community that may help here? 17:17:20 #link http://lists.openstack.org/pipermail/openstack-dev/2017-February/111347.html 17:17:27 from jordanP 17:17:41 andreaf, we used to have daniel beranger 17:18:15 hey guys, I'm late today and I'll have to leave soon but o/ 17:18:22 and we have kashyap also 17:18:48 but the libvirt thing is not what we are seeing today, not only 17:18:50 today is worse 17:19:09 yeah, hard to merge patches on current condition 17:20:11 we are trying to fix bugs on tempest, but the patches could not be merged recently 17:20:21 the bug triage graph is #link https://github.com/oomichi/bug-counter#current-graph 17:20:46 the bug number still is decreasing, but that is low progress 17:21:20 the graph is a little odd anyways :) 17:21:32 due to data lack 17:22:17 this week, I triaged bugs as assignee and the report is #link https://github.com/oomichi/bug-counter#current-graph 17:22:35 ops, that is #link https://etherpad.openstack.org/p/tempest-weekly-bug-report 17:23:39 I want to get eyes on https://review.openstack.org/#/c/304441 for fixing a bug 17:24:45 oomichi: sure I will have a look at it 17:24:49 oomichi: but I don 17:25:06 but I don't think it's Tempest bugs causing gate instability 17:25:45 andreaf: yeah, Tempest is just detecting bugs of different projects in general 17:25:45 still it's great to see the graph shrinking :) 17:26:22 andreaf: haha, yeah the number could be managable 17:26:55 do we have more topics of Tempest today? 17:27:21 just one thing more from my side 17:27:48 andreaf: please go ahead 17:27:50 we have an issue with reboot_hard where sometimes the host keys are not flushed to disk 17:28:09 and the sshd server fails to start on reboot 17:28:34 andreaf: I thought we landed a patch at some point that add a sync call over ssh before we rebooted 17:28:39 andreaf: yeah, 'sync' command doesn't work before hard-reboot? 17:28:45 there's a patch available in cirros to make the sshd server regenerate the host keys on reboot 17:28:46 mtreinish: yes we did 17:28:50 mtreinish: but apparently it's not enough 17:29:11 mtreinish: or something else is going on - but in either case at the moment we fail with not able to connect to ssh port 17:29:33 which is kind of misleading, as it's not a problem with network, just the sshd deamon that fails to start 17:29:40 hmm, do we wait for a fresh prompt after the call? Because if we just send sync\n over ssh and then reboot right after there could be a race there 17:29:54 so a new release of cirros could fix that 17:30:46 mtreinish: well as far as I know our exec method waits for a return code 17:30:51 ok 17:30:54 I couldn't remember 17:30:58 andreaf: have you talked to smoser about a new release? 17:31:17 but even a flush on the OS is no guarantee that the virtualisation layer will actually flush it's own caches to the disk 17:31:56 mtreinish: yes I did I have a branch ready and I built an image and tested it here #link https://review.openstack.org/#/c/427456/ 17:32:09 it seems to be working ok, even though now we have noise on the gate 17:32:46 I also tested this: https://review.openstack.org/#/c/421875/ 17:32:47 force deleting host ssh keys before reboot 17:33:11 so I would now ask smoser for a new release and propose a devstack patch to adopt it 17:33:26 unless there is any concern with this plan 17:33:31 andreaf: ok, that sounds like a good plan 17:33:40 andreaf did you open a pull request for cirros? 17:34:18 yes https://code.launchpad.net/~andrea-frittoli/cirros/+git/cirros/+ref/bug_1564948 17:34:48 good 17:34:50 it's on top of branch 0.3 so only 2 commits 17:35:02 andreaf: the problem has been fixed on cirros side, so this could happen on the other OS? or cirros specific? 17:35:03 because there's a lot more on master 17:36:19 oomichi, mtreinish: yeah I feel a reboot hard test will never be 100% safe really but we can make stable enough for the gate so we can test reboot 17:36:28 andreaf, good, as smoser expected 17:36:58 andreaf: yeah, that is a nice direction 17:37:07 the alternative would be to not test reboot at all in the integration gate, but that would be a pity since it's a feature which is probably quite used 17:37:54 andreaf: yes, reboot feature is used in common use cases and nice to keep the corresponding tests on the gate 17:38:39 (i have to go, sorry) 17:38:42 ok, lets move to the next topic if we can 17:38:58 #topic DevStack + Grenade 17:39:04 we might ask the vm to shut down properly, before hard reset, and wait for a power down message on the serial console .. 17:39:45 afazekas: that could be soft-reboot? 17:40:20 on devstack one topic to keep an eye on is in this ML thread: 17:40:22 #link http://lists.openstack.org/pipermail/openstack-dev/2017-February/111413.html 17:40:34 it's not directly related to devstack, but just OOM issues in the gate 17:40:57 oomichi, On soft reboot the vm does expected to do it, but no . We would ask a booted vm to halt in safe way over ssh, before actually cutting of the power by a big hammer. 17:42:49 afazekas: the soft-reboot is trying to wait for vm gralceful reboot internally and hard-reboot after timeout in nova side IIUC 17:43:18 * afazekas I usually also drop the caches before hard reset, but we can search for what is really sufficient 17:43:36 also I submitted a fix for a bug in Grenade a few days back: https://review.openstack.org/#/c/424807/ in case you guys can take a look at it. It is for a bug that does not happen in the gate but happens when running Grenade locally 17:43:45 mtreinish: Interesting, the memory usage is still increasing on releases 17:43:57 #link https://review.openstack.org/#/c/424807/ 17:44:38 oomichi: it's been a steady trend since juno (when we first started seeing oom in the gate) 17:44:50 oomichi, yes, it pushes the power button softy on the vm, the vm has acpi signal handler which initialize the safe shutdown, if it does not stops the vm in time, nova uses the big hammer.. 17:45:15 mtreinish: can we increase memory capacity on the gate? or need to try reducing the usage on each project side? 17:45:48 afazekas: yeah, you are right. and the big hammer could erase the cache 17:46:24 oomichi: we need to work on the memory footprint of the services. Requiring > 8GB per vm isn't really a good solution 17:46:52 and it'll just be a matter of time until we hit whatever new amount of ram we allocate 17:47:08 mtreinish: it would be nice to see the memory footpring graph on o-h or something for each releases 17:47:38 oomichi: we don't collect that data anywhere. No one has done real memory profiling of the python 17:47:46 if we have the data, making the graph is the easy part :) 17:48:13 mtreinish: nice point, just a rough idea :) 17:49:04 I guess this kind of memory footprint could be useful for production as reference also, and nice to share 17:49:22 OK, lets move on if we don't have more topic about them 17:49:36 #topic openstack-health 17:49:46 do we have topics about o-h today? 17:50:02 #topic Patrole 17:50:22 DavidPurcell: do we have topics about patrole? 17:50:28 A couple if that's okay 17:50:39 DavidPurcell: please go ahead 17:50:57 I've already added these to the PTG discussion stuff as they all involve a bit of work. 17:51:26 First off, a lot of projects (glance, neutron, etc) don't return the correct response when a role is rejected 17:51:27 DavidPurcell: cool, thanks 17:51:46 Usually it is a 404 NotFound instead of 403 Forbidden. 17:52:12 DavidPurcell: nice testing :) 17:52:20 Thanks :) 17:52:39 is that different tenant case? 17:52:52 It varies. 17:53:01 Sometimes it is just because you're not admin 17:53:07 Some cases 404 is used if someone just trying to randomly guess resources, not have the idea is it exists or not 17:53:08 but they return 404 for whatever reason... 17:53:16 I guess 404 also could be an option because different tenant users cannot see the other tennant resources 17:53:56 ok, this is a good chance to discuss the ideal behavior 17:54:03 oomichi: Very possible, but some of them are just very obvious oversights. 17:54:24 DavidPurcell: yeah, really nice test for me anyways 17:54:38 Another question we had involves tempest plugins for projects that want to use Patrole. For example Murano wants to add Patrole tests. 17:55:14 But that is currently not possible without either duplicating their clients in Patrole and putting the tests in Patrole 17:55:23 or somehow enabling them to import Patrole 17:55:55 DavidPurcell: yeah, and patrole is on first dev stage, and difficult to provide stable interfaces 17:56:38 but it is great to attract the different projects by patrole 17:56:40 oomichi: and I'm not sure how I feel about a plugin importing another plugin. 17:56:56 DavidPurcell: that could be topic of PTG :) 17:57:06 oomichi: already on the etherpad :) 17:57:27 DavidPurcell: cool, ok can we move on the next topic: open discussion? 17:57:31 sure 17:57:45 #topic critical review/open discussion 17:57:49 the time is comming 17:58:08 so do we have patches or discussion topics? 17:59:30 ok, lets end the meeting. thanks all :) 17:59:34 #endmeeting