15:00:54 #startmeeting openstack-helm 15:00:55 Meeting started Tue Apr 16 15:00:54 2019 UTC and is due to finish in 60 minutes. The chair is portdirect. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:56 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:59 The meeting name has been set to 'openstack_helm' 15:01:07 lets give it a few mins for people to turn up 15:01:12 #topic rollcall 15:01:28 the agenda is here: https://etherpad.openstack.org/p/openstack-helm-meeting-2019-04-16 15:01:36 o/ 15:01:42 please feel free to add to it, and we'll kick off at 5 past 15:01:53 o/ 15:02:02 o/ 15:03:34 o/ 15:04:02 o/ 15:04:37 o/ 15:05:14 ok - lets go 15:05:23 #topic Zuul testing doesnt recover job logs 15:05:32 itxaka the floor is yours 15:05:42 yay 15:05:58 so testing tempest patcehs I found out unable to query the logs 15:06:15 as the ttl for the job/pod expired, those logs were lost forever and ever and ever (not really) 15:06:40 was wondering if there was anything already in place to recover that kind of logs as they seem to be valuable 15:07:00 we've hit this sort of thing before, esp with pods that crash 15:07:08 but Im reading srwilkers response in there so it seems like there is not and we would need to deploy something ourselves for that 15:07:08 yep 15:07:18 srwilkers: you had any thoughts here, as i know you've noodled on it before? 15:07:32 i think it requires two parts 15:08:00 one: we really need to be able to override backofflimits for the jobs in our charts 15:08:04 This is Prabhu 15:08:19 some jobs have this, others don't 15:09:14 also, we need to revisit having our zuul postrun jobs include `kubectl logs -p foo -n bar` for cases where we've got pods that are in in a crashloop state or a state where they fail once or twice before being happy 15:09:30 as insight into both of those types of situations is valuable 15:09:49 +1 15:10:05 but without the ability to override the backofflimits for jobs, that doesn't do us much good if kubernetes just deletes the pod and doesn't have a pod for us to query for logs 15:10:08 previous or otherwise 15:10:22 the above would be great as a 1st step 15:11:02 i know we've also discussed 'lma-lite' here - would there be value in looking at that as well? 15:11:54 isnt there a "time to keep" for jobs that we could leverage as well to leave those failed jobs around? 15:11:58 i think that's sane down the road, but also would require us to deploy the logging stack with every job 15:13:19 ttlSecondsAfterFinished is what I was thinking about, but that is only for jobs, not sure how would it affect other resources 15:14:04 would it perhaps make sense to also think about modifying the `wait-for-pods.sh` to detect a crash 15:14:12 and then instantly fail if it does? 15:14:45 as really we should have 0 crashes on deployment in the gates, and this would also allow us to capture the logs before they go away 15:14:49 is it crashing right after deployment? 15:15:00 wouldn't that cause false positives when something goes temporary wrong (i.e. quay.io down for a few seconds) 15:15:56 that would not cause a crashloop itxaka 15:16:42 i'd say we could add that and see if anything falls out initially 15:16:45 it seems like a sane idea 15:16:50 really? Becuase I seen it locally, crashloopbackoff due not being able to pull the image 15:17:04 that doesnt sound right itxaka 15:17:16 you should see 'errimagepull' 15:18:07 errimagepull or imagepullbackoff are the only two i've seen related to pulling images -- if the pods in a crashloop, it's indicative of an issue after the entrypoints been executed 15:18:18 as far as im aware, anyway 15:18:32 umm, I migth be wrong there 15:18:43 unfortunately my env is in a pristine status so I cannot check now :) 15:18:58 lets get a ps up for both - and we can see what the fallout is 15:19:16 srwilkers: did you have one for `logs -p` already? I cant rememeber? 15:19:25 yeah -- it's pretty dusty at this point 15:19:47 https://review.openstack.org/#/c/603229/4 15:20:39 the only change required there is just checking to see if there is infact a previous pod? 15:21:09 or to jerry rig it, just allow failure of --previous? 15:21:09 i think if there's not, it just returns nothing 15:21:18 nice! 15:21:28 this looks good actually 15:21:39 i lied 15:21:44 it'll require an update 15:22:37 with that - lets see how things go 15:22:40 ok to move on? 15:22:51 although with us setting ignore_errors: true there, might not be an issue 15:22:53 yep 15:23:11 ok 15:23:15 #topic Split 140-compute-kit.sh into nova+neutron 15:23:26 me again 15:23:45 we were just wondering for the reasons of having both nova+neutron together in the same script 15:23:56 and were wondering if there was a willingness to change it? 15:24:13 totally - the reason for it ebing together is just historical 15:24:18 We actually found out that with this compute kit, wait-for-pods can timeout quite often 15:24:24 as the two has circular deps 15:24:35 well, even if you split them out you may run into timeout issues 15:24:37 so the idea was not just split the deployment script, but also have 2 calls of wait-for-pods 15:24:37 jsuchome: how many nodes are you attempting to deploy with these scripts? 15:24:45 jsuchome: that wouldnt work 15:24:49 as they have dependencies on each other 15:25:01 ah, that's what we feared 15:25:03 which is why we initially tied them to the same deployment script 15:25:17 would it be possible to make them less dependant on each other? 15:25:19 nova compute depends on neutron metadata... 15:25:30 itxaka: not that im aware of 15:25:32 not that many (3 actually), but we're running kubernetes on openstack already 15:25:37 oh well 15:25:49 well, then just neutron comes first, right? 15:26:12 no - as neutron metadata proxy, depends on.... nova-metadata-api... 15:26:21 and the circle of joy continues :D 15:26:41 its the circcccclllee, the circle of life 15:26:44 would separate nova compute out of Nova chart help? 15:27:02 at the cost of huge complexity 15:27:26 why not just deploy, and then wait after? 15:27:51 yeah seems like we just need to fix the probes to be faster at detecting readyness and increase our timeouts :) 15:28:12 increase our timeouts, the universal solution for everything... 15:28:23 it works... :-) 15:28:33 the simplest fix to a problem is just to wait a bit more for it to solve it by itself 15:28:49 and RUN as far away as possible :P 15:28:55 why is it timing out for you? image pulls or somthing else? 15:29:24 itxaka: you've got the right idea ;) 15:29:39 we did not do detailed profiling, it just happened several times 15:29:54 I guess a bit of everything? image pulls, insufficient IO, maybe not enough workers, probes take a bit of time, etc... 15:30:03 its usually pretty close to the timeout so... 15:30:10 yeah, database backend also 15:30:30 ok, seems good to me, not much from my side on this topic 15:30:30 nova-db-sync is certainly a monster on a clean deployment 15:30:50 ok to move on, and revisit this later if we need to? 15:30:55 +1 15:30:57 +1 15:31:07 yep 15:31:10 #topic: Differences in deployment scripts (component vs multinode) 15:31:21 omg, itxaka again? shut up already 15:31:26 itxaka: XD 15:31:34 tbh I dont remember why I wrote this down 15:31:38 jsuchome, do you? 15:31:39 lol - your bringing a good set of topics dude 15:31:51 i mean, i think we're overdue for cleaning up our tools/deployment/foo 15:31:51 consolidate ? 15:31:52 probably came out from that compute-kit discussion 15:32:01 because to be frank, it's getting messy. and part of that's my fault 15:32:26 we probably could do with a re-think, we certainly need to split out the over-rides 15:32:41 yep 15:32:48 (overrides coming up next ... :-) ) 15:32:56 but do the scripts that drive them need to be split, or could they just call the approprate over-rides dependant on an env var? 15:33:11 i think the latter is the more sane one there 15:33:52 so we put habve things like ./tools/deployment/over-rides/[single|multi]/.yaml 15:34:04 that would be cool 15:34:08 ++ 15:34:10 called by ./tools/deployment/scripts/.sh ? 15:34:34 will also make it much clearer what we are using where 15:34:49 not to get too far off track here, but i'd like to see something similar for the armada deployment foo too 15:35:37 srwilkers: perhaps pegleg would help there? though lets keep this to the scripted deployment for now 15:36:43 portdirect: sounds good 15:38:16 Can someone write the armada/pegleg stuff in the minutes? I have no idea what they do so I cannot really explain it properly in there 15:38:27 oh, I see portdirect is already on it, thanks! 15:38:48 maybe @jamesgu__ can help on that, gotta have some knowledge on those tools :) 15:39:34 ok - does someone want to give it a (and excuse the pun) bash at refactoring here? 15:40:21 if nobody else wants to, i can give it a go 15:40:33 but might be a good opportunity for someone else to get more familiar with our jobs 15:40:46 mhhh I would like to but not sure if Im the best at bash thingies, if only evrardjp was here... 15:40:52 as that's where i've been living more often than not lately 15:41:06 I can have a look at it, maybe bother srwilkers with patches and guidance 15:41:16 srwilkers - lets work together on getting a poc up - and see if anyone wants to take it from there? 15:41:30 sounds good to me 15:41:39 either of those sounds good :) 15:41:44 I could take a look at it as well, bash is nice ;-) 15:41:55 lol - eveyone wants in 15:41:57 poc + me/jsuchome taking from there sounds good to me 15:42:09 unless jsuchome wasnt to make the poc himself :P 15:42:11 nice :) 15:42:30 ah, no thanks, I'll leave it to the elders :-) 15:42:30 lets go for the poc and itxaka/jsuchome if that works? 15:42:44 +1 15:42:47 jsuchome: just 'cause our minds are rotted, dont mean we are old ;P 15:42:59 +1 15:43:08 ok - lets move on 15:43:18 #topic Tungsten Fabric Review 15:43:25 that not me, yay :P 15:43:32 prabhusesha, the floor is yours :) 15:43:39 Hi 15:43:56 I want somebody to pick up the review and provide us set of comments 15:44:22 I meant, first set of comments 15:44:37 could you link to it? 15:44:42 I'm continuing the work what madhukar was doing 15:44:55 https://review.openstack.org/#/c/622573/2 15:45:10 prabhusesha_: is fantasic to see this restarted 15:45:52 I want to get this thing merged soon. I need all of your support 15:46:01 I'm also kind of new to helm 15:46:14 but I'm picking up pretty fast 15:46:29 are any specific images required for this? 15:47:00 you need neutron, heat & nova plugin images 15:47:08 from TF 15:47:27 I'm drafting a blueprint 15:47:35 am i just missing them? as i dont see them here: https://review.openstack.org/#/c/622573/2/tools/overrides/backends/networking/tungstenfabric/nova-ocata.yaml 15:47:42 also how do you deploy TF itself? 15:48:25 there are TF specific charts 15:48:46 are they public? 15:49:13 currently it's private, work is happening in that front also 15:49:34 ok - i think that will block the efforts here largly 15:49:44 we can provide some high level feedback 15:49:54 but without public, and open images 15:49:57 would we need to build extra images with modifications to support this? 15:49:57 as well as charts 15:49:59 that will good 15:50:06 maybe that falls under the loci meeting instead :) 15:50:08 we cant do any more than that 15:50:21 let me get my stuff in right place 15:50:28 and i'd be uncomfortable mergeing, untill that is resolved 15:50:35 sounds good prabhusesha_ 15:50:38 I agree 15:50:55 and as itxaka points out, it would be great to work with loci for image building 15:51:04 high level comments will be helpful 15:51:33 itxaka: I can get back to you on that 15:51:39 +1 15:51:55 ok - lets move on 15:52:01 #topic: Office-Hours 15:52:32 so last week we decided to kick start the offic hours effort again 15:52:56 initially these will be from 20:00-21:00UTC on Wednesdays in the openstack-helm channel 15:53:39 the above time is simply as its when we can make sure that there is core-reviewer attendance 15:53:50 but i know its rubbish for folks in the EU 15:54:14 i hope we can change that as soon as we can get some coverage for that timezone :) 15:54:54 thats all i got here really. ok to move on? 15:55:02 +1 15:55:16 #topic: Add internal tenant id in conf 15:55:25 hi 15:55:44 portdirect: thanks gave comments for my review 15:55:49 LiangFang: yous ps looks great - I've got one comment i need to add in gerrit, but at that point lgtm 15:56:23 https://review.openstack.org/#/c/647493/ 15:57:14 thanks, one thing is that my environment is broken, so I have not verified in my environment 15:57:29 I don't know if CI is strong enough 15:57:45 looks ok in ci: http://logs.openstack.org/93/647493/7/check/openstack-helm-cinder/49e27c2/primary/pod-logs/openstack/cinder-create-internal-tenant-wkvrv/create-internal-tenant.txt.gz 15:58:02 but the patchset before 7 is verified in my environment 15:58:15 thought we should also add '--or-show` in the user/project management job, as otherwise its just a single shot, and will fail on re-runs 15:58:50 ok 15:59:11 ok - we about to run out of time, ok to wrap up? 15:59:22 #topic reviews 15:59:41 as per-always there are some reviews that would realy apprecate some attention 15:59:46 Reviews: 15:59:46 https://review.openstack.org/#/c/651491/ Add OpenSUSE Leap15 testing - adds directories with value overrides + one test job 15:59:46 Should the job be voting from start? 15:59:46 We plan to add more jobs as followups, but they could already be included as part of this one, if seen more appropriate 15:59:46 https://review.openstack.org/#/c/642067/ Allow more generic overrides for nova placement-api - same approach as with the patches already merged 15:59:47 https://review.openstack.org/#/c/644907/ Add an option to the health probe to test all pids - Fixes broken nova health probes for >= rocky 15:59:47 https://review.openstack.org/#/c/650933/ Add tempest suse image and zuul job 15:59:48 https://review.openstack.org/#/c/647493/ Add internal tenant id in conf (irc: LiangFang) 15:59:48 https://review.openstack.org/#/c/622573/2 Tungsten Fabric plugin changes 16:00:04 20:00-21:00 UTC seems is middle night in Asia:) 16:00:21 LiangFang: we need to improve there as well for sure 16:00:39 ok, thanks 16:00:44 thanks everyone! 16:00:49 #endmeeting