16:00:34 #startmeeting kolla 16:00:35 Meeting started Wed Nov 1 16:00:34 2017 UTC and is due to finish in 60 minutes. The chair is Jeffrey4l. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:37 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:40 The meeting name has been set to 'kolla' 16:00:58 #topic roll-call 16:01:02 o/ 16:01:08 o/ 16:01:23 hi chason ;D 16:01:41 Jeffrey4l Hahah 16:01:43 o/ 16:01:47 hi vhosakot 16:01:50 \o/ 16:01:51 o/ 16:01:59 hi duonghq :) 16:02:37 let us wait another two minutes. 16:02:46 this will be a short meeting. 16:02:59 ya, I have 2 topics 16:03:08 duonghq, cool. 16:03:29 we have no scheduled topics today. 16:03:49 ya, I forgot add to our schedule 16:04:15 ok. let us start. 16:04:22 #topic Announcements 16:04:54 Sydney summit will be hold next week. 16:05:22 0/ 16:05:26 ohhh, time flies too fast 16:05:37 so next meeting will be canceled. 16:05:45 duonghq, yes. 16:05:57 any other announcement from community? 16:06:52 let us start the open discuss directly. 16:07:04 #topic open discuss 16:07:11 duonghq, your call. 16:07:18 thank you Jeffrey4l 16:07:27 first one is quite simple 16:07:39 I get this bug many time: https://bugs.launchpad.net/kolla-ansible/+bug/1729246 16:07:40 Launchpad bug 1729246 in kolla-ansible "MariaDB cluster fails to start after upgrade" [Undecided,New] 16:07:58 o/ 16:08:00 I run on 2 nodes with 6GB memory/node 16:08:03 hi pbourke 16:08:21 can somebody help me test this upgrade again 16:08:42 duonghq, mariadb changed some thing recently. 16:09:05 our upgrade process ( ansible roles ) should fix the gap. 16:09:19 so, can you triaged this bug? 16:09:28 the safe_to_bootstrap do not exist before. 16:09:29 sure. 16:09:39 I cannot find any bug related to this issue 16:09:44 thank Jeffrey4l 16:10:09 curiosity upgrade is failed. 16:10:25 anyway, i will check this. 16:10:36 thank you for pointing this out. 16:11:09 :) 16:11:40 btw, there is another old bug report that possible data loss during mariadb recovery. 16:11:52 https://bugs.launchpad.net/kolla-ansible/+bug/1682153 16:11:53 Launchpad bug 1682153 in kolla-ansible "mariadb_recovery is prone to data loss" [Critical,Confirmed] 16:12:15 sure, I saw that, seem that Sam proposed a fix for the bug 16:12:37 that bug has bitten me before, that articulate bug report helped narrow down and recover cluster 16:12:37 no patch right now 16:12:58 but he propose a possible solution way in the description. 16:13:34 we should implements his proposal 16:13:39 yes. 16:13:43 than let it roll for awhile 16:14:14 okay. please move on 16:14:30 sure, 16:14:45 so, my 2nd topic is about Kolla-ansible HA layer 16:15:13 do anybody know why we use haproxy/keepalived for HA layer? 16:15:23 but not pacemaker/corosync stack 16:15:41 duonghq, pacemaker/corosync is more complicated. 16:15:47 far more complicated 16:16:07 and iirc, pacemaker can not be containerized before ( now it should work ) 16:16:29 but it provides some mechanism for react with the failure, like data plane evacuation 16:17:03 it is invaluable feature (IMO) 16:17:43 duonghq, yes. pacemaker is powerful than keepalived. but what kind of issue we are facing by using keepalived. 16:18:08 it just for add some functionality to our stack, 16:18:38 I'm thinking in implement pacemaker into Kolla and let user choose which HA stack they want to use 16:19:24 that will be cool. pacemaker can handle more than keepalived. 16:20:00 Jeffrey4l, so, I'll create a blueprint for this, is it ok? 16:20:24 and try to containerized pacemaker (again) 16:20:28 but i just afraid what it will take to Kolla. more better health check? or fail over? 16:21:03 sure. a blueprint is necessary for others to evaluate the possibility. 16:21:29 about healthcheck, I'm not sure only it can make Kolla better 16:21:48 pacemaker in kolla feels like a solution to a problem that doesn't yet exist; granted from a maturity standpoint eventually moving to pacemaker from keepalived is probably what needs to happen 16:21:49 but I'm certainly about failover 16:22:01 please write what you think and the benefit. 16:22:22 sure 16:23:50 rfxn, tbh, i will current keepalived is enough. ;) 16:24:22 i think* 16:24:27 but who know what duonghq will take 16:24:37 current, ya i think keepalived is more than enough -- time could be better spent on other areas of HA and Disaster Recovery instead of pacemaker atm 16:24:48 I think we can let user choose which stack they like, 16:25:04 rfxn, can you suggest some area? 16:25:37 MariaDB is treated emphemeral right now, we put it behind a galera cluster and smile 16:25:45 in reality, if you loose mysql data, your done 16:25:53 we need a reliable, backup strategy for mariadb 16:26:05 clustering = HA, backup = DR 16:27:08 rfxn, are u meaning use pacemaker as DR solution? 16:27:23 I'm thinking it is slightly out of scope of Kolla 16:28:48 Jeffrey4l, no, im all for keepalive (now) and pacemaker (later, as a maturity point); duonghq asked some area to suggest, my thought is lack of backup solution for mariadb is the largest, most volatile gap, currently 16:29:13 ah, got. 16:29:45 db backup is really necessary. 16:30:39 so please register a bp for this duonghq , and let us discuss base on the bp. 16:30:40 thanks 16:31:00 then any other topics? 16:31:05 Jeffrey4l, sure 16:31:15 happy to help discuss and lay out options in that bp 16:31:22 thank you rfxn 16:32:04 guess no topic. 16:32:06 https://blueprints.launchpad.net/kolla/+spec/database-backup-recovery <- related 16:32:33 rfxn, ah, it should be in kolla-ansible 16:32:36 how do you think, Jeffrey4l 16:32:44 sure. 16:32:51 and xtrabackup should be the best solution. 16:32:59 the backup should run periodic. 16:33:15 agreed xtrabackup/innobackupx with a few output options = win 16:33:56 two smaller items; updating haproxy from 1.5 to 1.7 16:33:57 and we can add this jobs into cron containers. and save the backups into a new docker volumes. 16:34:38 and letsencrypt for automagick issuance of external_fqdn ssl certs 16:35:03 refer to backup, there are other ways to backup. like ceph pg map, crush rule. 16:35:10 rfxn, what upgrade haproxy? 16:35:16 what/why 16:35:52 1.7 offers better ssl termination, http2, more advanced acl features, far more performant 16:36:22 rfxn, letencrypt may be hard. because during sign new certs, it requires network connective and a public domain. 16:37:01 rfxn, basically, package version in kolla based on the linux distro repo. 16:37:27 i dont think those are blockers, most production deployments are going to be on an external fqdn w/ internet access 16:37:52 i have a poc with haproxy that routes letsencrypt CA challenge/response to a dedicated listener on network group systems 16:38:00 so it never needs to touch the horizon container 16:38:41 letsencrypt is a nice to have, maybe not a need to have :) 16:39:06 yep 16:39:08 but would minimize the barrier to entry on new deployments imo 16:40:14 rfxn, can you elaborate your point? about new deployments 16:40:18 and we can implement this in "kolla-ansible certifications" command. 16:42:11 SSL certificates, valid browser recognized CA certificates, are the norm on any production ready deployment. Right now, managing SSL certificates is a pain, prone to human error and nothing should ever be internet facing without a valid cert. 16:42:23 letsencrypt allows us to automate the process entirely, in very trivial way 16:43:07 letsencrypt is a gread service ;p 16:43:11 rfxn, thank you for teach me this 16:43:26 great* 16:43:47 not trying to preach, just speak aloud that imo automagick ssl certificates would set kolla apart and remove a human error prone component for the stack 16:43:59 e.g azure going down cause of forgetting to renew api connector certs :P 16:44:13 lol 16:44:43 rfxn, will you add such feature into kolla? 16:45:56 i can bp it 16:46:00 and we go from there 16:46:10 cool. thanks. 16:46:26 and happy to share out my ugly code as poc :) 16:46:54 rfxn, thanks for sharing ;D 16:47:10 any other topics? 16:47:23 rfxn, poc is always has many idea 16:48:03 guess no. let us end the meeting. 16:48:08 thanks for all coming. 16:48:13 #endmeeting