#openstack-meeting-5 log

15:00:47 <srwilkers_> #startmeeting openstack-helm
15:00:49 <openstack> Meeting started Tue Jun 20 15:00:47 2017 UTC and is due to finish in 60 minutes.  The chair is srwilkers_. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:50 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:52 <openstack> The meeting name has been set to 'openstack_helm'
15:01:01 <srwilkers_> happy tuesday everyone
15:01:15 <portdirect> \o/
15:01:16 <alraddarla> o/
15:01:27 <srwilkers_> we've got a few items on the agenda for today. can be found here: https://etherpad.openstack.org/p/openstack-helm-meeting-2017-06-20
15:01:48 <dulek> o/
15:02:07 <lrensing> o/
15:02:29 <srwilkers_> lets give it a minute or so to let others filter in
15:02:39 <srwilkers_> then we can tackle these topics, and then open up for discussion afterwards
15:03:21 <lamt> o/
15:03:36 * gagehugo lurks
15:04:09 <v1k0d3n> o/
15:04:28 <srwilkers_> alright, lets go ahead and start tackling these then
15:04:29 <srwilkers_> #topic Health Checks
15:04:45 <srwilkers_> portdirect, i think you raised this topic last week or this one -- cant recall
15:04:47 <srwilkers_> youve got the floor
15:06:02 <portdirect> yeah - I ve got a few queries about these - as I'm not sure we want to be killing/restarting pods when comms with infra is broken - seems like a way to create a runaway train. but was wondering if dulek had thought of somthing id missed?
15:06:03 * dulek is here to answer any questions about the patches.
15:06:07 <portdirect> :)
15:06:37 <dulek> Right, that's something I was wondering about too.
15:07:06 <dulek> My intention with the patches is also to have some indication for the admin that pod is unhealthy.
15:07:53 <dulek> Also I can imagine a pod being rescheduled to another kubelet, which resolves the problem if it was caused by physical network failure.
15:08:28 <portdirect> yeah - though i can also see a transient problem causing the entire cluster to go down?
15:08:37 <dulek> Please note that in case of Neutron and Nova, healthchecks are based on nova service-list/neutron agent-list commands. I was hoping to do the same with the rest of the charts but it isn't always possible.
15:09:24 <dulek> portdirect: Pod restart when a transient failure occurred isn't really that bad thing IMO.
15:10:56 <portdirect> it is if they all restart simultaneously - thats gonna potentially put a huge load on your underlying services
15:11:30 <dulek> portdirect: Well, I cannot disagree with that.
15:11:51 <portdirect> but i get your point - would the visibility your aiming for not be better achived through a mechanism like centralised logging?
15:12:18 <dulek> portdirect: Though if your infra is suffering from transient failures, you should probably tweak liveness probe period.
15:13:25 <portdirect> the best thing about failure is it s like the Spanish Inquisition: https://www.youtube.com/watch?v=sAn7baRbhx4
15:13:34 <dulek> portdirect: RabbitMQ or DB disconnections could be addressed through that, I'm not sure about monitoring "nova service-list" though.
15:14:06 <portdirect> srwilkers_: will the work your doing be able to monitor those things?
15:14:26 <srwilkers_> portdirect, yeah. it should be able to.
15:14:53 <srwilkers_> and once we actually get it into addons, we can start looking at specific exporters for prometheus and adjusting the rules as necessary
15:15:35 <portdirect> would you be able to work with dulek to get a list of the tings we'd need to monitor in a spec?
15:15:48 <srwilkers_> sure, i can get that started today
15:16:27 <srwilkers_> #action srwilkers: document appropriate monitoring targets with dulek
15:16:49 <portdirect> dulek: would that work for you? I suspect it would need a combination of your work and srwilkers_ to get all we need
15:16:50 <dulek> I'll be ending my office day just after the meeting, but I can definitely provide a list of alarms in logs/commands that we should look for.
15:17:21 <portdirect> awesome - cheers dude
15:17:23 <dulek> portdirect: Sure, I'll take a look at srwilkers_ work and see how we can use it.
15:18:02 <srwilkers_> okay cool. think we can move on then?
15:18:28 <srwilkers_> #topic Gating for Addons/Infra
15:18:53 <srwilkers_> so we've got openstack-helm-addons and openstack-helm-infra now, in addition to the primary repo
15:19:32 <srwilkers_> portdirect, lamt and StaceyF have been tossing around ideas for gating the three repos in a way that makes sense in how the repositories are expected to be used
15:20:39 <srwilkers_> the current idea is to explore zuul-cloner to see how we can run checks on the three repos without introducing any race conditions or overhead
15:20:42 <lamt> srwilkers_ -infra was created overnight, I will work to set up the linter gate there, also to clean up -addons
15:21:13 <srwilkers_> lamt, awesome. i can help as well if you need. think that should be higher priority so we can start getting stuff out of the queue in addons
15:21:18 <srwilkers_> theres already a bit backed up there
15:21:24 <lrensing> just for clarity, can we define where we draw the line for each of the repos?
15:21:27 <srwilkers_> lamt, any other thoughts?
15:22:04 <lamt> lamt for now, either tarball, or gitclone, will need to play around with it
15:22:07 <srwilkers_> lrensing, sure. the expectation is that infra is anything required to run the openstack services on top of.  addons is anything ancillary that can be used in conjunction with them
15:22:36 <srwilkers_> thats my view at any rate
15:22:38 <portdirect> ^^ ++
15:22:44 <lrensing> sounds good :)
15:22:54 <srwilkers_> lamt, awesome
15:23:16 <srwilkers_> #action lamt to explore tarballs or gitclone for multi repo checks/gates
15:23:26 <portdirect> lamt: id like us to use git clone when not in infra - being able to run the gate scripts locally is pretty important :)
15:23:27 <lamt> the linter should work for -addons, please review it.  will refine the node3 gate late
15:23:41 <lamt> portdirect sounds good
15:25:06 <portdirect> perhaps we should explore having an 'armada' check as well?
15:25:22 <portdirect> alanmeadows: ^?
15:25:45 <alanmeadows> it supports remote git urls and branch/tag targets
15:25:49 <alanmeadows> seems like the fit you're after
15:26:22 <alanmeadows> I have a working manifest for ~master at this point
15:26:52 <portdirect> nice - if we got an addtional 3 node check would you be able to submit a ps for it?
15:27:17 <alanmeadows> sure
15:28:08 <srwilkers_> alanmeadows, portdirect: awesome :)
15:28:22 <portdirect> sounds good - I'll try and get that up today
15:29:10 <srwilkers_> #action portdirect will look into 3 node check for armada check
15:29:41 <srwilkers_> anything else on gating?
15:30:09 <portdirect> I'll be working with lrensing to get ceph running iin the gate today and start the road to dropping nfs :D
15:30:50 <srwilkers_> portdirect, thatd be really awesome :)
15:30:54 <portdirect> we'll also have vms running in the gate soon - i just need to tidy up my ps
15:31:29 <portdirect> https://review.openstack.org/#/c/474960/
15:31:35 <srwilkers_> that's great. let's get some visibility on those when they're ready for review.  would be nice to get those in for sure
15:32:06 <StaceyF> I put the 3rd party gate as a separate topic but can bring it up now?
15:32:10 <srwilkers_> oh nice
15:32:17 <srwilkers_> of course StaceyF, floors yours
15:32:52 <srwilkers_> #topic Third Party Gate
15:33:36 <StaceyF> It's currently just a skeleton job but we have Openstack Helm deployed on our CI lab and will be utilizing Jenkins plugins (openstack-cloud) to dynamically provision VMs to test OSH on PS / merge in Gerrit
15:34:03 <StaceyF> I'd like to get the go ahead to make it a non-voting 3rd party instead of the skipping once we've stabilized the Jenkins jobs
15:35:21 <StaceyF> We will be putting the logs on an Apache server so they are accessible to see failures/success.
15:35:44 <srwilkers_> StaceyF, hmm. im okay with it becoming a non-voting 3rd party gate, given the logs are publicly available
15:37:17 <StaceyF> srwilkers thanks
15:37:19 <srwilkers_> portdirect, thoughts?
15:37:20 <portdirect> nice - will we be able to run rally properly inside the ATT gate?
15:37:33 <StaceyF> yes
15:37:56 <portdirect> srwilkers_: think this is pretty awesome :)
15:38:55 <srwilkers_> alright cool.
15:39:11 <StaceyF> then I'll move forward and will provide status as it moves along
15:39:24 <srwilkers_> StaceyF, great. keep us posted :)
15:40:13 <srwilkers_> #topic Monasca-Helm (https://github.com/monasca/monasca-helm)
15:40:24 <srwilkers_> i added this topic for jayahn as it was mentioned yesterday
15:40:42 <srwilkers_> and he brought up possibly touching base with the monasca folks to see what their intentions are for it
15:41:11 <srwilkers_> but im not sure if jayahn is present currently, so might want to follow up with him when i see him in the openstack-helm channel again.  i know the time difference is pretty drastic
15:41:52 <portdirect> yeah - they have some interesting things going on over there - looks like a full stack in a single chart?
15:42:02 <srwilkers_> portdirect, thats what it seems like at a glance
15:42:25 <srwilkers_> i havent had a chance to play with it yet, but they're using monasca to grab cluster-level metrics i think and scraping prometheus endpoints
15:42:47 <portdirect> be good to chat to them for sure - you know what irc chan they in?
15:42:59 <srwilkers_> i can dig it up and ask about it today
15:43:38 <srwilkers_> either way, i think im going to fiddle with it and profile it versus prometheus to see what the differences are just for the sake of comparison
15:44:37 <srwilkers_> #action srwilkers_ follow up with jayahn about monasca-helm
15:45:15 <srwilkers_> #topic open discussion
15:45:25 <srwilkers_> alright, that's all the topics we had in the etherpad for today
15:45:39 <srwilkers_> id like to open the floor for any other concerns/topics that weren't on the agenda
15:47:27 <srwilkers_> going once
15:48:07 <srwilkers_> going twice?
15:49:25 <srwilkers_> alright, going to give you all 10 minutes back
15:49:47 <srwilkers_> #endmeeting