16:01:19 <vkozhukalov> #startmeeting Fuel
16:01:20 <openstack> Meeting started Thu Jul 24 16:01:19 2014 UTC and is due to finish in 60 minutes.  The chair is vkozhukalov. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:23 <openstack> The meeting name has been set to 'fuel'
16:01:27 <vkozhukalov> #chair vkozhukalov
16:01:28 <openstack> Current chairs: vkozhukalov
16:01:30 <evgeniyl> Hi
16:01:32 <vkozhukalov> hi everyone
16:01:34 <angdraug> o/
16:01:35 <aglarendil> hi
16:01:38 <vkozhukalov> agenda as usual
16:01:41 <meow-nofer__> hi
16:01:49 <vkozhukalov> #link https://etherpad.openstack.org/p/fuel-weekly-meeting-agenda
16:01:55 <mattymo> hi boys and girls
16:02:04 <vkozhukalov> #topic Greetings
16:02:22 <vkozhukalov> #topic 5.1 soft code freeze
16:02:22 <mihgen> hey folks
16:02:33 <mihgen> we plan SCF for today folks
16:02:43 <vkozhukalov> mihgen im sure you have something to say on this topic
16:02:47 <mihgen> let's keep collaboratively working on our patches
16:03:05 <mihgen> we have a lot of bugs still in progress, many of them have at least one +1
16:03:31 <mihgen> I'm about to give for those which have at least +1, but didn't get approval from core folks , give 1 more day to be merged
16:03:45 <mihgen> I think otherwise we are good enough to call SCF
16:04:05 <mihgen> we will have to review bug by bug and move them to next milestone (only Low and Medium of course)
16:04:38 <aglarendil> I raised priority for some medium bugs to make sure they get into the release as they are more high than medium
16:04:39 <mihgen> if some bug is important or hits many people, we may want to increase priority
16:04:42 <mihgen> #link https://wiki.openstack.org/wiki/Fuel/5.1_Release_Schedule
16:05:12 <angdraug> please don't forget to add comments when you change bug priority
16:05:14 <mihgen> please don't forget about this important link, it has our schedule
16:05:28 <mihgen> sorry folks my Mac is crazy and slows me down today (
16:05:48 <mihgen> any objections on call for SCF today, and exception which I mentioned?
16:06:06 <mihgen> #link https://wiki.openstack.org/wiki/Fuel/Soft_Code_Freeze - about SCF
16:06:35 <mihgen> looks like no objections so far, thanks everyone for hard work with squashing bugs!
16:06:37 <xarses> mihgen: sounds good
16:06:38 <aglarendil> there are some fixes I would like to merge into library
16:06:39 <vkozhukalov> no objections
16:06:53 <aglarendil> but I guess they are attached to high-priority bugs
16:06:56 <aglarendil> so I do not have any
16:07:20 <mihgen> cool. thanks. vkozhukalov - let's go on
16:07:33 <vkozhukalov> ok, thanks mihgen, moving
16:07:42 <vkozhukalov> #topic 5.0.1 release status
16:07:48 <angdraug> Here's the current status of 5.0.1:
16:07:52 <angdraug> #link https://launchpad.net/fuel/+milestone/5.0.1
16:07:57 <angdraug> Some bugs are only tracked in mos, these can be found at:
16:08:02 <angdraug> #link https://launchpad.net/mos/+milestone/5.0.1
16:08:07 <angdraug> Primary blocker for 5.0.1 is still the RabbitMQ HA problem:
16:08:12 <angdraug> #link https://bugs.launchpad.net/mos/+bug/1340711
16:08:14 <uvirtbot> Launchpad bug 1340711 in mos/5.0.x "[OSCI] building oslo-messaging for rpm and deb" [Critical,In progress]
16:08:17 <angdraug> There are some new bugs in 5.0.1:
16:08:21 <angdraug> #link https://bugs.launchpad.net/bugs/1348166
16:08:22 <uvirtbot> Launchpad bug 1348166 in fuel/5.1.x "Upgrades, rollback for bootstrap brakes slaves bootstrapping" [High,In progress]
16:08:26 <angdraug> #link https://bugs.launchpad.net/bugs/1346093
16:08:27 <uvirtbot> Launchpad bug 1346093 in fuel "[System Tests] Dhcrelay didn't start on master in system tests" [High,Fix committed]
16:08:41 <angdraug> any comments on the above two? are they really blockers for 5.0.1?
16:09:04 <angdraug> I understand the temptation to sqeeze in more fixes into 5.0.1 while mos team is dealing with the RabbitMQ problem
16:09:22 <evgeniyl> I'm testing the patch for 1348166 right now, and it looks good.
16:09:37 <angdraug> do we need any other bugs in 5.0.1?
16:09:42 <mihgen> evgeniyl: what if we don't merge it in 5.0.1 ?
16:10:04 <mihgen> angdraug: all < Critical should go away for sure
16:10:24 <mihgen> we just need to make sure they are not Criticals
16:10:25 <evgeniyl> mihgen: fater rollback user will have broken cobbler
16:10:35 <evgeniyl> after
16:10:46 <mihgen> evgeniyl: if I do 5.0 -> 5.0.1 and then rollback?
16:11:15 <angdraug> broken cobbler means can't deploy anything, isn't that Critical?
16:11:20 <angdraug> or is there a workaround?
16:11:34 <evgeniyl> mihgen: yep 5.0 -> 5.0.1 and if there is some problem during the upgarde, rollback will break cobbler
16:11:53 <mihgen> evgeniyl: ok, only if there was a problem with upgrade
16:11:53 <evgeniyl> user can fix it manually, it is the reason why I set High priority for the bug
16:12:01 <mihgen> still sounds pretty critical
16:12:11 <mihgen> upgrade can fail and we should be ready for it
16:12:21 <angdraug> one more incoming bug from mos team:
16:12:24 <angdraug> #link https://bugs.launchpad.net/mos/+bug/1348158
16:12:29 <uvirtbot> Launchpad bug 1348158 in mos "Murano-agent does not get rabbit settings" [Critical,New]
16:12:30 <mihgen> angdraug: if we test the patch, I think we should have the fix in 5.0.1
16:12:57 <angdraug> agreed
16:13:10 <mihgen> anyone from murano team to comment on that?
16:13:42 <aglarendil> iyozhikov: ^^
16:14:18 <vkozhukalov> doubt he is here
16:14:22 <mihgen> let's get it later
16:14:27 <mihgen> angdraug: let's move on
16:14:30 <angdraug> if it's as bad as it says on the tin, and there's a fix today or tomorrow, we should have it in 5.0.1 too
16:14:32 <aleksandr_null> yep.
16:14:41 <angdraug> all other bugs are not Critical
16:14:52 <angdraug> unless someone objects now, I'll move them to 5.0.2
16:14:58 <mihgen> ok, good. waiting for oslo.messaging to be fixed then
16:15:05 <angdraug> yes. that concludes 5.0.1 discussion
16:15:05 <mihgen> angdraug: let's do it
16:15:12 <vkozhukalov> #topic 6.0 blueprints status
16:15:13 <mihgen> angdraug: thanks!
16:15:38 <mihgen> Folks we have lots of blueprints there, many with higher statuses
16:15:49 <mihgen> we don't have such a velocity for sure :)
16:16:03 <aglarendil> let's schedule blueprint triage day
16:16:07 <mihgen> if you take a look at schedule again, we would like to have Juno support before the summit
16:16:10 <mihgen> aglarendil: agree
16:16:48 <aglarendil> and discuss in #fuel-dev, I guess right after 5.0.1 release
16:17:04 <angdraug> don't need to wait for 5.0.1 release
16:17:05 <vkozhukalov> and another very important feature is plugins
16:17:23 <angdraug> as discussed above, fuel team's work on 5.0.1 is mostly done, everyone should move on
16:17:25 <mihgen> So the most important things for 6.0 are: Juno support, HA fixes and other bugs, openstack upgrades building blocks and plugins, also UI refactorings to improve UX
16:17:55 <angdraug> +1 to ^
16:17:56 <mihgen> also, vkozhukalov with agordeev are doing image based provisioning for a while now
16:18:07 <mihgen> so I assume we gonna get it finally done in 6.0
16:18:12 <mihgen> vkozhukalov: are we?)
16:18:29 <vkozhukalov> yes
16:18:32 <mihgen> :)
16:18:40 <aglarendil> also granular deployment is critical for 6.0, I guess
16:18:47 <angdraug> no
16:18:49 <aglarendil> and upstream manifests merge
16:19:04 <angdraug> upstream manifests, yes -- we've done most in 5.1
16:19:06 <aglarendil> and testing coverage: rspecs and puppet module tests
16:19:16 <vkozhukalov> ok let's schedule bp triaging meeting for 6.0
16:19:20 <aglarendil> also fencing and logs management
16:19:23 <mihgen> sustaining stuff in puppet is very important
16:19:46 <akislitsky_> aglarendil, is plugins in library can be implemented without granular deployment?
16:19:49 <mihgen> we are having most of the issues there in our process, so we definitely need to provide better test coverage and etc.
16:20:21 <aglarendil> akislitsky_: I will not hope
16:20:28 <mihgen> well let's discuss the rest out of the scope of this meeting
16:20:37 <aglarendil> mihgen: I agree
16:20:46 <vkozhukalov> moving on
16:20:52 <mihgen> items I provided are the most important, but not complete, obviously
16:21:01 <mihgen> vkozhukalov: sure
16:21:14 <vkozhukalov> #topic Features
16:21:25 <vkozhukalov> #topic upstream neutron based ml2: missed feature freeze exception deadline
16:21:38 <mihgen> xarses: hi, did you see an email from aglarendil and me
16:21:48 <xarses> yes
16:21:57 <xarses> upstream ML2 has been moved to 6.0, Multi node HA testing is basically impossible due to rabbit, or galera issues compounding testing and resolving possible issues with multiple controllers present.
16:22:13 <mihgen> xarses: :(
16:22:28 <mihgen> well looks like we can't do anything here
16:22:45 <mihgen> and just need to get it merged in 6.0?
16:22:51 <angdraug> I think xarses should keep working on it and have it ready for when stable/5.1 is branched
16:23:01 <xarses> We need to fix ha
16:23:18 <angdraug> yes, that's our next topic I think
16:23:23 <aglarendil> xarses: we have working ha right now
16:23:25 <xarses> and some other CI issues, I added to the end of the agenda
16:23:42 <mihgen> wait aglarendil why are we mixing all here?
16:23:42 <aglarendil> xarses: it might be due to puppet ordering issues
16:23:46 <mihgen> ha, CI,
16:24:12 <angdraug> lets move to the next topic and then we won't be off topic
16:24:13 <xarses> moving on
16:24:14 <mihgen> stable/5.1? I'm lost, why do we want to have ready exactly by stable/5.1?
16:24:31 <vkozhukalov> ha and ci are certainly different topics
16:24:33 <angdraug> mihgen: because then master will be open for 6.0 stuff
16:24:44 <mihgen> angdraug: ohhh ok yeah
16:24:53 <vkozhukalov> #topic Bugs
16:24:56 <mihgen> now it's clear, thanks
16:25:04 <vkozhukalov> #topic 5.1 bvt tests failure trends: what breaks it most often lately
16:25:15 <mihgen> aglarendil: your turn on our smoke-HA
16:25:39 <angdraug> over the last several weeks, bvt for ubuntu (and to lesser extent centos) has been failing more often than passing
16:25:55 <aglarendil> angdraug: mihgen most of these tests were custom ISOs
16:26:10 <aglarendil> angdraug: mihgen we spoke with nurla and decided to separate these jobs
16:26:30 <aglarendil> angdraug: mihgen nighlty tests were passing mostly
16:26:41 <mihgen> aglarendil: ok so then what's the current status?
16:26:44 <nurla> we created separeted jobs, see mail "Custom iso builds and custom bvt tests at Jenkins Product"
16:27:06 <xarses> I think we need to stop doing single ha controller testing, it's possibly a large reason why we see alot of failures
16:27:06 <mihgen> nurla: aglarendil I see smoke failed now..
16:27:10 <aglarendil> mihgen: we have bvt tests passed. the only problem I saw was OSTF test on centos
16:27:23 <aglarendil> mihgen: nurla we are waiting for an email from rvyalov
16:27:32 <aglarendil> mihgen: It should be fixed in next build
16:27:43 <vkozhukalov> xarses: great, if something does not work, let's stop testing it -)
16:27:44 <aglarendil> mihgen: some package mirror issue
16:28:02 <xarses> we need to at least doing 2 controller HA testing so that primary_controller true|false modes can be tested
16:28:37 <aglarendil> xarses: I will talk about it with holser a little bit later related to galera improvements
16:28:38 <xarses> vkozhukalov: no, single controller HA passes to easily, while multiple controller ha fails alot, we need to test the harder issue
16:29:09 <xarses> s/passes to easily/passes too easily/
16:29:18 <vkozhukalov> xarses: thanx, now i see
16:29:18 <mihgen> nurla: can we optimize builds so that master node is reused for all centos simple, centos HA & UBuntu HA, so reducing time and resources?
16:29:33 <mihgen> having 3 deployments in parallel, we have multiple envs feature, right..
16:30:09 <nurla> unfortunately, we haven't support parallel deployment in our tests
16:30:30 <mihgen> nurla: ok.. we will talk about it later
16:30:47 <vkozhukalov> ok, moving on then
16:31:00 <xarses> mihgen, nurla: rmoe and I have some ideas around that too
16:31:08 <nurla> ok, lets create blueprint for improve this issue
16:31:10 <mihgen> folks but let's keep all attention on smoke tests and BVT
16:31:30 <mihgen> these are very important things now , the only things where we can see regression fast
16:31:33 <angdraug> yes, and please lets watch for trends
16:31:42 <vkozhukalov> #topic Fuel CI Python Unit Tests failures
16:31:53 <vkozhukalov> dpyzhov: around?
16:32:04 <mihgen> akislitsky_: ^^ ?
16:32:21 <meow-nofer__> this is a real issue these days
16:32:25 <akislitsky_> I have fixed all known locks and inconsistency issues in CI tests, on local machine is passed fine, but after rebase on master it fails. I going to fix it tomorrow and merge int master
16:32:39 <mihgen> akislitsky_: sounds optimistic
16:32:52 <mihgen> looks like we had so many red builds because of these locks?
16:32:57 <meow-nofer__> akislitsky_: this will really help
16:33:19 <akislitsky_> I hope, patch is not small we need to review it tomorrow
16:34:04 <vkozhukalov> here it is clear
16:34:06 <vkozhukalov> moving on
16:34:10 <akislitsky_> mihgen, yep, locks and inconsistency of the data in DB. we have fuzzy commits in DB in code
16:34:36 <vkozhukalov> #topic Ceph bugs status
16:34:39 <mihgen> akislitsky_: meow-nofer thanks folks, looking forward to see all green :)
16:34:45 <angdraug> The blocker issue that affected all ceph deployments with 5.1 is now fixed:
16:34:52 <angdraug> #link https://bugs.launchpad.net/fuel/+bug/1333814
16:34:54 <uvirtbot> Launchpad bug 1333814 in fuel "Environment deployment failed with ceph-deploy 1.5.2" [Critical,Fix committed]
16:34:58 <angdraug> This leaves another critical bug that only affects large-scale deployments:
16:35:04 <angdraug> #link https://bugs.launchpad.net/fuel/+bug/1341009
16:35:06 <uvirtbot> Launchpad bug 1341009 in fuel "[osci] obsolete ceph package in fuel-repository for 5.0.1 and 5.1" [Critical,In progress]
16:35:25 <angdraug> there's also medium and low priority bugs open, speak up if something must be fixed in 5.1
16:36:36 <xarses> #link https://bugs.launchpad.net/fuel/+bugs?field.tag=ceph
16:37:23 <aglarendil> what about
16:37:23 <angdraug> #1335628 seems to be automatically reopened by gerrit
16:37:28 <aglarendil> #link https://bugs.launchpad.net/fuel/+bug/1335880
16:37:30 <uvirtbot> Launchpad bug 1335880 in fuel "[library] ERR: ceph-deploy osd activate node-11:/dev/sdb2 node-11:/dev/sdc2 returned 1 instead of one of [0]" [High,Confirmed]
16:37:40 <aglarendil> it looks like it has some duplicates
16:38:29 <angdraug> yes, needs to be cleaned up
16:38:32 <angdraug> rmoe: ^
16:38:57 <vkozhukalov> rmoe: was not able to reproduce 1333779
16:39:06 <rmoe> I'll take a look at it but it looks like the same issue
16:39:14 <vkozhukalov> maybe it is not so critical how it is written
16:39:45 <aglarendil> #link https://bugs.launchpad.net/fuel/+bug/1266853
16:39:46 <uvirtbot> Launchpad bug 1266853 in fuel "[library] Adding Compute Node fails - ceph/manifests/conf.pp:44" [Medium,Confirmed]
16:39:51 <aglarendil> also this looks really high
16:41:06 <angdraug> should be fixed by new ceph-deploy 1.5.9, needs a retest
16:41:43 <vkozhukalov> resume: 1266853 is critical, so we need to raise its level
16:42:11 <angdraug> huh?
16:42:14 <vkozhukalov> 1333779 is not reproducible (waiting for diagnostic snapshot)
16:42:46 <angdraug> how can 1266853 be critical?
16:43:03 <vkozhukalov> 1335880 needs to be fixed
16:43:38 <aglarendil> angdraug: if it is confirmed - it breaks compute addition feature
16:43:46 <vkozhukalov> 1266853 compute node cannot be added
16:43:52 <angdraug> 1335880 needs to be confirmed not to be a duplicate of 1323343
16:44:26 <vkozhukalov> ok, let's look at it more carefully
16:44:30 <vkozhukalov> moving on
16:44:38 <xarses> angdraug: should you and rmoe and I just go through and triage all of them?
16:44:45 <xarses> and go from there?
16:44:50 <angdraug> yes
16:44:55 <angdraug> lets move on
16:44:57 <vkozhukalov> #topic Some announcements
16:44:59 <xarses> ok, then that's our action
16:45:10 <vkozhukalov> #topic rabbitmq HA tightening status
16:45:20 <aglarendil> here
16:45:35 <aglarendil> so, I have been working on some ha stuff for rabbitmq
16:45:47 <aglarendil> it turns out we can fix almost all rabbitmq issues
16:45:52 <aglarendil> I mean the server part
16:46:07 <aglarendil> #link https://review.openstack.org/#/c/108792/
16:46:22 <aglarendil> this is still a draft
16:46:37 <aglarendil> I hope to finish it during next week and finally deal with rabbitmq clustering stuff
16:46:47 <xarses> as of yesterday it was still taking a long time to assemble the cluster and sometimes would still be broken
16:47:05 <aglarendil> xarses: yep, that's what I am going to address by this
16:47:08 <xarses> rmoe was having issues with it taking very long after a cold start
16:47:24 <aglarendil> xarses: talking to low-level stuff in mnesia
16:47:33 <aglarendil> xarses: in OCF scripts
16:47:38 <aglarendil> this will help us a lot
16:47:50 <aglarendil> and kick dead controller out of the cluster quickly
16:47:57 <aglarendil> right now it takes 60 seconds
16:48:04 <aglarendil> for rabbitmq do declare node dead
16:48:10 <aglarendil> I spoke to rabbitmq guys. it is configurable
16:48:15 <aglarendil> by nettick time parameter
16:48:21 <aglarendil> but I am afraid of false positives
16:48:30 <aglarendil> so I am gonna still leverage information from corosync
16:48:32 <aglarendil> and kick dead nodes
16:48:41 <aglarendil> by low level operations
16:49:07 <angdraug> why do you have to go through mnesia?
16:49:17 <angdraug> can't remove a node from the cluster via rabbitmqctl?
16:49:23 <aglarendil> it cannot
16:49:29 <aglarendil> angdraug: it firsts asks status
16:49:35 <aglarendil> angdraug: and sends rpc:multicall to all the node
16:49:43 <aglarendil> that it thinks are alive
16:49:50 <aglarendil> and timeout is nettick_time
16:49:57 <vkozhukalov> guys it is very detailed discussion
16:49:59 <aglarendil> this freezes all the common rabbitmqctl operations
16:50:12 <vkozhukalov> let's make it more abstract
16:50:19 <vkozhukalov> we have 10 minutes
16:50:21 <aglarendil> #link https://groups.google.com/forum/#!topic/rabbitmq-users/f5pVGtR5ct0
16:50:22 <angdraug> aglarendil: please add all these details to your commit message :)
16:50:28 <aglarendil> here is a link to discussion
16:50:34 <aglarendil> angdraug:  I will
16:50:55 <angdraug> lets move on
16:50:57 <vkozhukalov> #topic galera improvements status
16:51:21 <aglarendil> I will talk
16:51:31 <aglarendil> I and holser finished work on galera improvements
16:51:39 <aglarendil> currently we are testing a fix for a small bug
16:51:52 <aglarendil> also these improvements allow for parallel deployment of controllers
16:52:03 <aglarendil> we are running bvt tests now
16:52:12 <aglarendil> if they pass - we will be able to extend CI tests
16:52:15 <xarses> as of yesterday fuel-library master and 340 is, galera still has assembly issues. I had several controllers fail in making mysql work
16:52:17 <aglarendil> to run all the controllers
16:52:33 <aglarendil> xarses: there is an ubuntu issue
16:52:39 <aglarendil> let me show a link
16:53:01 <aglarendil> xarses: #link https://bugs.launchpad.net/fuel/+bug/1347007
16:53:03 <uvirtbot> Launchpad bug 1347007 in fuel "Pacemaker cannot assemble Galera Cluster on Ubuntu" [High,In progress]
16:53:15 <aglarendil> xarses: fix is being tested
16:53:40 <aglarendil> after that we will do what you want by CI coverage and decrease deployment time drastically
16:53:59 <aglarendil> I am done
16:54:18 <vkozhukalov> great
16:54:26 <vkozhukalov> thanx a lot aglarendil
16:54:40 <vkozhukalov> #topic Third Party CI testing
16:54:47 <xarses> We need to have CI bots for integration like MLNX, NSX, or vCenter which we cant do with our current jenkins slaves, there have been a couple of changes that I'm hesitant to review, or know would break these components, but we don't have the CI to show that it's a problem. I propose that we start setting up CI bots to do basic tests so that we can ensure that we don't introduce large breakages with each commit.
16:56:13 <vkozhukalov> what they are supposed to look like? i mean ci bots
16:56:37 <vkozhukalov> is there kinda scheme how it is supposed to work?
16:56:45 <vkozhukalov> flow etc.
16:56:53 <xarses> like what we see in openstack projects like vmware minesweeper
16:57:24 <nurla> but some features are require specific hardware, we may check only regression for third party components
16:57:26 <xarses> they attach to gerrit triggers and post voting, or non-voting reviews
16:57:34 <vkozhukalov> ok, i just did not see them
16:58:13 <vkozhukalov> is there email discussion for that?
16:58:29 <xarses> nurla thats the point of why they need to be "third party" bots, in that we know they run in the required environment
16:58:47 <xarses> vkozhukalov: no, there is no ML thread yet
16:58:56 <aglarendil> we have not time, let's start an ML thread
16:59:12 <xarses> yes, will do
16:59:16 <vkozhukalov> aglarendil:  +1
16:59:23 <vkozhukalov> thanx everyone
16:59:26 <vkozhukalov> great meeting
16:59:30 <vkozhukalov> closing
16:59:31 <aglarendil> bye
16:59:36 <vkozhukalov> #endmeeting