14:59:57 #startmeeting openstack-helm 14:59:58 Meeting started Tue Aug 21 14:59:57 2018 UTC and is due to finish in 60 minutes. The chair is mattmceuen. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:00 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:03 The meeting name has been set to 'openstack_helm' 15:00:04 #topic Rollcall 15:00:08 o/ 15:00:13 GM / GE / GD everyone! 15:00:17 o/ srwilkers 15:00:20 hi 15:00:24 hey tdoc 15:01:14 o/ 15:01:16 Here's our agenda: https://etherpad.openstack.org/p/openstack-helm-meeting-2018-08-21 15:01:27 Please go ahead and add anything you'd like to discuss today 15:01:35 Otherwise I'll give one more min for folks to filter in 15:02:42 o/ jayahn! 15:02:48 o/ 15:03:02 o/ 15:03:15 #topic LMA News 15:03:17 mattmceuen!! 15:03:23 Good to see you man :) 15:03:51 srwilkers has been hard at work testing our LMA stack in various labs of various sizes and workloads 15:03:56 yeah.. it was independency holiday + summer vacation last week. 15:04:06 that is good. 15:04:26 That sounds awesome jayahn - hope it was an awesome vacation 15:04:28 well earned 15:04:43 we are analyzing what each exporter gathers, which is siginificant one to watch, and alarm.. 15:05:12 have you ever deployed the current lma stack at scale in a working osh cluster? 15:05:13 srwilkers has been doing some of the same thing as part of his analysis 15:05:23 oh hello 15:05:28 at scale.. how big? 15:06:15 lets start small >10 nodes, with active workloads? 15:06:48 i think we did fairly good test on >10, for logging part 15:07:24 did you run into any issues? OOM's or similar? 15:07:24 for prometheus, we are behind the schedule. 15:07:46 on elastic search side, i heard sungil experience lots of oom 15:08:08 sungil had experienced.. 15:08:15 are you running default values mostly, or have you started providing more fine-grained overrides for things like fluentbit and fluentd? my biggest takeaway from the logging stack was that it's better to leverage fluentd to provide smarter filters than just jamming everything into elasticsearch 15:08:43 once i started adding more granular filters and dumping specific entries, elasticsearch was much healthier in the long term 15:08:50 nope, i think we overrides values on es, fluent-bit. 15:09:09 i will ask sunil tomorrow on this 15:09:42 that would be great jayahn 15:09:45 he has been struggling with logging for the last two month.. 15:09:48 i've done quite a bit of work on this and exposed it as part of the work to introduce an ocata based armada gate 15:09:49 https://review.openstack.org/#/c/591808/12/tools/deployment/armada/manifests/ocata/armada-lma.yaml 15:10:00 did some test on federation as well. 15:10:49 okay. 15:11:43 prometheus is a whole different beast though 15:11:50 agreed 15:12:03 it graduated at least. :) 15:12:14 lol 15:12:17 srwilkers I believe you're moving more sane defaults into the charts so that operators can choose to let more logs through to elasticsearch if they need them, right? 15:12:43 we are urgently hiring a person to take on prometheus, really short handed right now. :) 15:13:00 srwilkers: you will be always welcome here :) 15:13:12 hey hey save the poaching for after the team meeting 15:13:27 jayahn: lol 15:13:33 jayahn: I'm his agent 15:13:49 and I'm portdirect's agent 15:13:56 and alanmeadows is yours? 15:14:02 we all get a cut 15:14:13 hey.... pyramid organization... 15:14:19 :D 15:14:22 bernie madoff would be proud 15:14:24 anyway 15:14:26 lol 15:14:29 Anything else to share on the LMA front folks? 15:14:35 yeah 15:14:37 FYI it hurts corporate diversity :p 15:14:48 I know,I have been there :p 15:15:00 im starting to work on pruning out the metrics we actually ingest into prometheus by default 15:15:23 as we were consuming a massive amount of metrics that we arent actually using with grafana/nagios/prometheus by default 15:15:39 frankly speaking, we are currently guarantying "short time usage" for elastic search, and asking all the operation team to help us to fine-tune these logging beast if they want to use this as more long term logging stroage 15:15:40 :) 15:16:04 cadvisor was the biggest culprit -- i've proposed dropping 41 metrics from cadvisor alone, and that reduced the total number of time series in a single node deployment from 18500ish to a little more than 3000 15:16:14 srwilers: I think we can help on that 15:16:17 wow 15:17:15 node exporter is probably my next target 15:17:23 as there's some there we dont really need 15:17:30 srwilkers: good to know 15:18:08 srwilkers: is there a doc for the metrics that we are currently collecting? 15:18:12 srwilkers: were you able to get all welost from cadvisor out of k8s itself, or too early to say? 15:18:13 it's something that needs some attention though, because ive been seeing prometheus fall over dead in a ~10 node deployment with 16GB memory limits 15:18:37 and it was hitting that limit after about 2 days without significant workloads running on top 15:18:45 portdirect: too early to say 15:19:03 jayahn: maybe your team could help there? 15:19:19 jamesgu: we currently gather everything available from every exporter we leverage. i dont have a list handy yet, but can provide a quick list of exporters we have 15:19:42 for every exporter, we are doing like this. https://usercontent.irccloud-cdn.com/file/mtrps177/Calico%20Exporter.pdf 15:19:48 srwilkers: if you could document it in, that would be nice :) 15:20:20 evrardjp: yeah, it's about that time :) 15:20:21 if we can setup wiki page we can all use, I will certainly upload information we summarized so far, and work together. 15:20:34 srwilkers: that would be very nice. 15:20:44 wiki or anything to put this massive document, or information to share 15:20:46 srwilkers: ping me for reviews when ready 15:20:53 evrardjp: nice, cheers 15:21:12 don't need for the whole document, just saying how it works 15:21:19 have we run into disk issue too besides memory? 15:22:01 jamesgu: yep. noticing 500gb PVCs filling up in ~7 days time on a similar sized deployment (~8-10 nodes) 15:22:15 which once again is due to the massive amount of time series we gather and persist 15:22:31 yeah pruning would be an important part :) 15:22:38 have any idea on the i/o reqs? 15:22:47 as well as raw capacity 15:23:15 and things like cadvisor are especially bad, because there's ~50 metrics that get gathered per container, so if you think about how many containers would be deployed in a production-ish environment, that gets out of hand quickly 15:23:50 portdirect: not at the moment -- certainly something that would be nice to get multiple peoples' input on. would be awesome if you could help evaluate that too jayahn 15:24:42 I am missing the converation flow.. 15:25:17 sorry, .. what awesome thing I can do? 15:25:29 could you kindly summarize? 15:25:30 jayahn: oh, sorry. just getting a better idea of the io requirements and storage capacity requirements for prometheus in a medium/large-ish deployment 15:25:48 ah.. okay 15:25:50 jayahn: in your env do you know how much pressure lma has been putting on the storage - both capacity wise, and IOPs/thoughput? 15:26:36 for prometheus, we have a plan to test that on 20 nodes deployment from the next week 15:26:55 we are right now enabling every exporter. 15:27:22 so, i guess we can share something next month. 15:28:00 jayahn: my point (sorry to have disrupted the flow) was that your research can be documented https://docs.openstack.org/openstack-helm/latest/devref/fluent-logging.html 15:28:00 to get idea on capacity planning / requirement on prometheus. 15:28:07 would you plan to incorporate srwilkers' pruning work, or do you want all that data? 15:28:22 or elsewhere, as this is maybe not enough 15:29:28 evrardjp: document would be a good place once we finalize all the contents, but WIP information sharing might be better with more flexible tools, like wiki 15:29:43 evrardjp: that's largely my fault. it's no secret that the biggest documentation gap we have is the LMA stack 15:29:49 Sorry guys, great discussion but we need to move on unfortunately 15:29:54 mattmceuen: agreed 15:29:57 Let's touch point next week 15:30:01 mattmceuen: I will review srwilkers' pruningn work, and try to leverage that 15:30:25 Thanks jayahn, hopefully its a quick & easy win for you to learn from our pain :) 15:30:59 Ok speaking of this and going slightly out of order as it's probably related to this topic 15:31:07 #topic Korean Documentation 15:31:17 oh 15:31:40 so - we have some awesome work being done by korean speaking community memebers 15:31:53 and they have some fantastic docs 15:32:03 that's nice :) 15:32:09 is that linked to i18n team? 15:32:15 not yet! 15:32:23 not yet. :) 15:32:25 sorry, go ahead :) 15:32:38 jayahn: can we work together to get korean docs up for osh 15:32:49 so your team can start moving work upstream? 15:32:54 that would be no problem. 15:33:11 so it would be korean docs? not need to translate to english? 15:33:29 in what would be an awesome bit of reversal, i think the other english speakers would be happy to help translate them into english 15:33:33 jayahn: I guess you still need to have upstream english, but that can go through i18n process to publish a korean docs 15:34:00 evrardjp: i think we need to work out how to handle this case prob a bit differently 15:34:01 if it's following the standard process :) 15:34:16 yeah I guess the first step would be to do the other way around? 15:34:17 as there is more korean docs than english.... 15:34:22 i think so? 15:34:27 okay. I will talk to ian.choi, previous i18n PTL 15:34:31 yeah, but I am not sure the tool is ready for that. 15:34:41 jayahn: can you loop me in on that please 15:34:41 jayahn: that's great,I was planning to suggest that :) 15:35:13 I think this is a great idea 15:35:22 we both (ian.choi and myself) will be at PTG, we can do f2f discussion on this topic as well 15:35:24 if you need help on the english side, shoot. I think good docs is a good factor for community size ramp up. 15:35:38 could not agree more evrardjp 15:35:45 and seeing things like this: https://usercontent.irccloud-cdn.com/file/mtrps177/Calico%20Exporter.pdf 15:36:02 make me sad, as this is such a great resource to have 15:36:02 jayahn: let's plan that PTG part in a separate channel :) 15:36:06 jayahn: youre coming to denver? time for more beer 15:36:41 I told the foundation that I only have a budge to do single trip between PTG and Summit 15:36:50 so - mattmceuen can we get an action item to get this worked out at ptg 15:36:51 so I guess the question was: do we all agree to bring more docs from jayahn to upstream, and how we do things, right? 15:36:52 they kindly offered me a free hotel 15:36:59 I will add it to the agenda 15:37:07 evrardjp: 100% 15:37:13 oh that's awesome jayahn. #thanksOSF!!! 15:37:27 okay. doing upstream in korean is really fantastic! 15:37:29 that's cool indeed :) 15:37:58 Alrighty - anything else before we move on? 15:37:59 should we discuss more about the technicalities at the PTG now that ppl are in agreement we should bring your things in? 15:38:13 mattmceuen: I guess we agree there :) 15:38:15 I think that'll be easier 15:38:25 we can move in that direction ahead of time 15:38:39 but lets plan on having things in good shape by the time PTG is over 15:38:53 I think Frank or Ian's input would be valuable in here. 15:39:00 jayahn / evrardjp: quick question, to be able to report things like calico being unable to peer to prometheus, are you running prometheus and all scrapers in host networking mode 15:39:49 unfortunatly, i am not an expert on that, but I will ask hyunsun and get back to you. just put your question on etherpad. :) 15:40:42 jayahn: is there a reason your team cant attend these? time/language etc? 15:40:57 time and language 15:41:05 lol - the double wammy 15:41:09 :) 15:41:28 dan, robert often attend these. they have english capa. 15:41:37 the other thing i'd like to dicuss at the ptg is how to bridge that gap a bit better 15:41:47 but most of others are not 15:42:22 i totally agree.. it has been very difficult point for me as well. 15:42:54 lets start on the docs - and use that as a way to close the language barrier better 15:43:05 Next week let's revisit meeting timing -- we still haven't found a time that works for everyone well 15:43:22 But if we can try harder and find a good time that would be really valuable 15:43:43 Alright gotta keep movin' 15:43:45 #topic Moving config to secrets 15:43:53 oh hai 15:44:21 so - I'm working to move much of the config we have for openstack services to k8s secrets from configmaps 15:44:36 \o/ 15:44:40 this should bring us a few wins 15:44:57 1) stop writing passwords/creds to disc on nodes 15:45:30 2) give us more ganular control on rbac for ops teams* 15:45:50 3) let us leverage k8s secrets backends etc 15:46:26 * this will need to fully come in in follow up work, when we start to split out 'config' from 'sensitive config' 15:47:02 Just wanted to highlight this - as it will be a bit disruptive for some work in flight 15:47:13 but i think moves us in the right direction. 15:47:26 it's positive disrupting -- maybe using release notes would help :p 15:47:44 Making sure I understand the last part: is this the path 15:47:44 1) None of the configs are secrets today 15:47:44 2) All configs that contain passwords etc will be secrets soon 15:47:44 3) More fine-grained split between the two in the future 15:47:44 ? 15:47:57 1) yup 15:48:08 2) yup 15:48:44 3) yeah 15:49:03 * three make take some time to implement, and frankly not be possible 15:49:10 but thats the intent 15:49:13 #2 is my favorite 15:49:20 ++ 15:49:21 But yeah - #3 would be nice 15:49:22 ++ 15:49:30 That's awesome portdirect 15:50:20 Any questions on secrecy before we move on 15:50:36 none, positive improvement, thanks portdirect 15:50:36 #topic Tempest 15:50:54 We have several colors of lavender in the etherpad, I think this may be you jayahn :) 15:51:23 just curious on tempest usage 15:51:37 Sharing the full question: 15:51:37 AT&T uses tempest? We found out that "regex, blacklist, whitelist" part is not working well. tempest 19.0.0 is required for pike, regex generation logic is changed from "currently avaialble tempest 13.0.0 on osh upstream". Just curious how gating or AT&T uses tempest. We think tempest need to be fixed, similar to rally. 15:52:14 yeah.. that 15:52:16 We are still integrating tempest into our downstream gating 15:52:55 tempest 19.0.0 is required in rocky for keystone api testing, if you do it. 18.0.0 will not work. 15:53:07 and queens 15:53:28 the tempest chart we have today, is very unloved :( 15:53:43 and could do with a blanket, and some coco. 15:53:46 so like discussion we had with rally, we need to find a good way to keep tempest version for each openstack release, and have corresponding values 15:53:48 i love it only enough to kick it every now and then 15:54:01 rough crowd! 15:54:21 jayahn: so, for OSA, we are using tempest 18.0.0 for everything until rocky. 15:54:37 that should work, as tempest is supposed to be backwards compatible 15:54:43 evrardjp: we should make that same shift then 15:55:20 if you point me to your whitelist/blacklist, I can help on which version should be required per upstream branch 15:55:41 but ourselves we are thinking to move everything to smoke. 15:55:55 ++ this makes sense for community gates 15:56:11 we did manage to make it work. 15:56:23 what did you do to get it working jayahn? 15:56:35 jayahn: can you get a ps, with the changes you made? 15:57:36 portdirect: indeed, for community, I'd think that smoke tests are fine. You can do more thorough tests in periodics or internally. 15:58:19 ++ 15:58:21 portdirect: okay 15:58:48 alright guys - we're at a couple minute to time 15:58:53 #topic Roundtable 15:59:06 I will move the things we didn't get to today to next week, sorry for not hitting everything today 15:59:07 pls review PS. :) 15:59:13 Yes! 15:59:16 one big thing - helm 2.10 is here! 15:59:21 helm yeah! 15:59:32 so expect to see some tls related patches from ruslan and I ;) 15:59:36 yeah! 16:00:02 thanks everyone 16:00:43 By anychance did anyonce went through this 16:00:55 https://storyboard.openstack.org/#!/story/2003507 16:01:27 portdirect: u said u will check yesterday did u find anything ?? 16:01:40 Gotta shut down the meeting goutham1 - can we move this into #openstack-helm ? 16:01:47 Thanks all! 16:01:53 #endmeeting