15:01:50 #startmeeting openstack-helm 15:01:51 Meeting started Tue Oct 2 15:01:50 2018 UTC and is due to finish in 60 minutes. The chair is portdirect. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:52 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:54 The meeting name has been set to 'openstack_helm' 15:02:03 lets give it a few mins for people to roll in 15:02:09 #topic rollcall 15:02:14 o/ 15:02:17 o/ 15:02:27 \o 15:03:07 o/ 15:03:59 agenda for today: https://etherpad.openstack.org/p/openstack-helm-meeting-2018-10-02, will give until 5 past and then kick off 15:05:26 oh hai gagehugo 15:05:37 o/ 15:05:40 ok - lets get going 15:05:49 a wild gagehugo appears 15:06:00 #topic Libvirt restarts 15:06:30 so once again, we seem to have lost the ability to restart libvirt pods without stopping vms 15:07:23 as far as i can make out, the pid reaper of k8s is now (since 1.9) clever enough to kill child processes of pods, even when running in host pid mode 15:07:45 and uses cgroups to target the pids to reap 15:08:38 ahh 15:08:44 i think the soultion to this is wo get ourself out of the k8s managed cgroups entirely 15:09:00 and so have proposed the following: https://review.openstack.org/#/c/607072/2/libvirt/templates/bin/_libvirt.sh.tpl 15:09:33 I thought we used to run libvirt in hostIPC--I'm not sure if I am misremembering or we stopped--this seems to re-enable that 15:09:57 we never did - though i rememeber the same 15:10:11 shouldn't hostIPC effectively *be* the flag we need to tell k8s to stop mucking with this? Are we seeing behavior we don't expect? 15:10:12 i re-enabled as part of this - as it makes sense to have this 15:10:49 alanmeadows: its not - see the cgroups/pid reaper comment above 15:10:52 To be sure, there *should* be a k8s flag to effectively disable the cgroup, repeating, and other "helpers" and I thought hPid, and hIPC were it 15:11:01 s/repeating/reaping/ 15:11:30 they no-longer are, looking at the kubelet source, theres no way to disable this 15:11:51 this feels like a k8s gap 15:11:57 and for everyone but us - i think what it does is an improvement 15:12:04 no disagreement there 15:12:11 sure, just feel like there needs to be a "don't get smart" button 15:12:15 rkt stage 1 fly would offer this 15:12:21 libvirt is just one of several use cases 15:12:47 so - i think this to me suggests two things 15:13:03 1) we need a fix to this NOW, is the above the right way to do this? 15:13:34 2) lets use the fix we end up with, and get a bug opened with k8s to support "dumb" containers - just like the good 'ol days 15:15:01 the though behind what im doing above is that we essentially run libvirt as a transient unit on the host 15:15:11 the approach above seems acceptable to me, unless im missing something 15:15:39 so for pretty much the whole world - we get normal operation 15:16:05 the one thing being that we dont specify a name for the transient unit - so systemd assigns one 15:16:15 this allows the pod to be restarted 15:16:37 or even the chart to be removed, and qemu processes will be left running 15:17:27 and then when the pod/chart comes back - libvirt will start up in a new scope, but manage the quems left in the old one just fine 15:17:36 seem sane? 15:18:21 we validated it can not only see them but can touch them? 15:18:21 i think so 15:18:49 alanmeadows: yes 15:19:08 though i do still need to check when using the cgfroupfs driver for docker/k8s 15:19:32 that this still works, and that also leads nicely into the next point 15:19:46 what are the interactions of this and the recommendation to disable the hugetlb cgroup in the boot parameters 15:19:56 are both still required? 15:20:04 no - this removes that requirement 15:21:09 we super need to gate this - once we have fixed this issue - I really want to get a light weight gate in that just confirms that the libvirt chart can be deployed, start a vm, and then be removed and deployed again, with 0 imact on the running vm 15:21:20 last question 15:21:22 the end of this would probably be initiating a reboot 15:21:36 I dont think openstakc would be required for this gate 15:21:37 portdirect: yeah, was going to see if we could include that in the gate rework you're going to chat about later 15:21:41 if cgroup_disable=hugetlb is still leveraged, this doesn't care and operates fine? 15:21:48 yes 15:22:30 its why on l35 i get the cgroups to manually use/over-ride dynamicly: https://review.openstack.org/#/c/607072/2/libvirt/templates/bin/_libvirt.sh.tpl 15:24:05 we ok here? to leave any further convo to review? 15:24:29 yeah, works for me 15:25:27 ok 15:25:29 #topic Calico V3 15:25:39 so i dont think anticw is here 15:25:58 but theres been a load of work done on updating our now long in the tooth calico chart 15:26:03 adding v3 support 15:27:03 https://review.openstack.org/#/c/607065/ 15:27:09 please review away 15:27:54 I'm super excited about this - as it offers a ray of hope for the future, that we can get out of the quagmire of iptables rules from the kube-proxy and move to ipvs 15:27:59 but baby steps... 15:29:22 hey anticw ' we were just talking about you 15:29:24 cool. will review this proper later today 15:29:40 anything you'd like to point out re the calico v3 work? 15:30:17 it works 15:31:02 there are some cosmetic changes done to try stay aligned with upstream 15:31:20 not all of those are required, but having them means a later upgrade should be easier 15:32:58 sounds great anticw 15:33:04 thx for your work on this 15:34:18 np, the other cleanups people brought up i've put on a list and we can decide which of those are needed 15:34:34 as you pointed out some of them run counter to a uniform interface to other CNS 15:34:40 CNIs 15:35:29 sure - from what i have seen the core is good solid, and the only real discussion may be aournd some of the config entrypoints 15:35:38 but i think we can hash that out in review 15:37:04 works for me 15:38:09 ok 15:38:12 #topic MariaDB 15:38:54 so I've got a wip up here: https://review.openstack.org/#/c/604556/ that i hope radically improves our galera clustering ability 15:39:08 ive been testing it reasonably hard 15:39:54 the biggest gaps atm that i'm aware of is the need to handle locks on configmaps better so we get acid like use out of them 15:40:04 and also get xtrabackup working again 15:40:34 thankfully both of these are relativly simple, though the configmap mutex may require a bit of time 15:41:05 would be great to get people to run this though its paces, and report back any shortcomings 15:41:31 even if it does, i think this is a step in a better direction. i've been playing with some of the changes for a bit now, and im pretty happy with it thus far 15:42:41 ok - so the last thing from me this week: 15:42:42 #topic Gate intervention 15:43:13 evardjp is planning on doing and extensive overhall of the gates, and bring some much needed sanity to them 15:43:26 though hes away this week - boo! 15:43:50 that said, theres an urgent need to get our gates in a slightly better state than they are today 15:44:24 so after this meeting im planning on refactoring some of them to get us to a point where things can merge without one million retrys 15:44:47 that'd be great 15:45:17 the main method to do this will to be cutting out duplicate tests - and also potentially adding an extra gate, so we can split the load 15:45:38 not sure if it matters now, but do we want to consider moving some of the checks to experimental checks (where it makes sense), until we can get the larger overhaul started/completed? 15:45:42 as most failures seem to be the nodepool vm's just bing pushed harder than they can take 15:46:17 srwilkers: if by the end of day i've not made signifigant progress - i think that, may be the short term bandage we need 15:46:29 portdirect: yeah. i was playing around with some of the osh-infra gates just to see how things performed when the logging and monitoring charts were split into separate jobs 15:47:59 while on the subject of gates: 15:48:04 #topic Armada gate 15:48:12 srwilkers: you're up 15:48:27 i've got a few changes pending for the armada gate in openstack-helm 15:49:11 the first adds the Elasticsearch admin password to the nagios chart definition, as the current nagios chart supports querying elasticsearch for logged events 15:49:56 the second adds ragosgw to the lma manifest, along with the required overrides to take advantage of the s3 support for elasticsearch 15:50:47 the third is more reactive, as it seems the rabbitmq helm tests fail sporadically in the armada gate. that change proposes disabling them for the time being 15:51:24 and the fourth is the most important in my mind. it's the introduction of an ocata armada gate. and the question becomes: do we sunset the newton armada gate? 15:51:29 for rabbitmq - we prob dont need to run as many as we do in the upstream gates 15:51:59 portdirect: probably not. i can update that patchset to instead reduce us down to one rabbit deployment 15:52:07 ++ 15:52:47 we got consensus at the ptg to sunset newton totally 15:53:07 and move the default to ocata 15:53:43 thats why im leaning towards sunsetting the newton armada gate with the ocata armada patchset, along with avoiding adding another 5 node check to our runs 15:54:19 sounds good - though I think the 1st step would be to make ocata images the defaults in charts 15:54:25 are we sunsetting newton for just the armada job or all the jobs? 15:54:31 I volunteer to do that 15:54:36 lamt: nice :) 15:54:41 lamt: if you could that would be awesome 15:55:08 will start - those newton images start to pain me anyway 15:55:11 please add a loci newton gate though 15:55:20 will do 15:55:24 unrelated portdirect: we can take my last point wrt the values spec offline, so we have time for open discussion 15:55:37 ok - sounds good 15:55:37 we can handle that in the #openstack-helm channel 15:55:51 #topic open discussion / review needed 15:57:24 crickets :) 15:57:44 ok - lets wrap up then 15:57:54 #endmeeting