22:03:04 #startmeeting zuul 22:03:05 Meeting started Mon Feb 6 22:03:04 2017 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:03:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 22:03:08 The meeting name has been set to 'zuul' 22:03:13 #link agenda https://wiki.openstack.org/wiki/Meetings/Zuul 22:03:22 #link previous meeting http://eavesdrop.openstack.org/meetings/zuul/2017/zuul.2017-01-30-22.00.html 22:03:36 o/ 22:04:06 i'd like to reserve at least the last 20 minutes for talk about the ptg 22:04:11 so with that 22:04:20 #topic Status updates: Nodepool Zookeeper work 22:04:35 o/ 22:05:20 o/ 22:06:04 Shrews is continuing work on having nodepool actually return nodes 22:06:53 428428 makes our integration job pass 22:06:57 that would be useful 22:07:09 Shrews: nice work 22:07:16 Shrews: now might be a good time for someone to jump in and update the nodepool cli commands to use zookeeper? if you think so, and no one else does that soon, i may... 22:07:46 :) 22:08:24 jeblair: i think someone could begin poking at that. i won't get to it anytime soon 22:08:42 If we are ready to start using more zookeeper in nodepool, I don't mind poking into that again. It went quite well last time 22:08:59 jeblair: is there a story for that yet? 22:09:01 * SpamapS can make one 22:09:16 SpamapS: don't think so, and thanks 22:09:31 * SpamapS makes it 22:10:36 #topic Status updates: Devstack-gate roles refactoring 22:11:04 rcarrillocruz, clarkb: where you talking about that earlier? 22:11:04 FYI: https://storyboard.openstack.org/#!/story/2000856 22:11:22 jeblair: yes, I think the current patchset on the first change is good to go 22:11:41 yeah, passing tempest tests now in zuul 22:11:44 just needs a +A 22:11:46 #link nodepool zk cli work can begin https://storyboard.openstack.org/#!/story/2000856 22:11:48 the second still has a -1 from previous reviews that will need addressing (I think rcarrillocruz may be trying to reduce number of iterations and focus on one at a time) 22:11:56 y 22:12:33 rcarrillocruz, clarkb: so 403732 is ready? 22:12:58 yes I think so 22:13:01 imho yeah 22:13:02 #link devstack-gate roles change ready for approval https://review.openstack.org/403732 22:13:48 #link next devstack-gate roles change https://review.openstack.org/404243 22:14:56 #topic Status updates: Zuul test enablement 22:15:17 a bunch of those just showed up recently! :) 22:15:25 i've started to pick up some low hangers again between doing other things.. 22:15:52 i noticed test_dependent_behind_dequeue, which was recently reenabled, doesnt seem to be too stable. i had to recheck against it a few times and noticed others had to as well 22:15:58 also -- reminder that we merged a change that requires a playbook for every test job now. there's a make_playbooks.py script in zuul/tests to help automate that. 22:16:03 I've fixed my conflicts today, and started on the conflict project tests today 22:16:39 adam_g: yeah, i recently made it more stable by extending the timeouts (it's a very busy test), but there have now been a few failures of it since then, so there's still something going on 22:17:07 ah 22:17:20 we also just merged a change which attaches full debug logs on test failures, so as long as it doesn't manifest as a timeout (which this one, unfortunately, often does) we can actually fix them. 22:17:31 https://review.openstack.org/#/c/393887/ is particularly easy :) 22:18:16 (the fact that it times out now is not likely because it's slow, but rather an error that just manifests as never reaching the stable condition) 22:18:42 jeblair: does our test zookeeper make use of tmpfs? That might help. 22:19:00 oh that 22:19:04 SpamapS: good point; i don't think so. 22:19:35 SpamapS: oh, but you know what, zk is usually pretty fast on tests in our cloud providers.... 22:19:40 yeah 22:19:51 With so much RAM 22:20:06 I'd expect it to mostly just buffer. Though ZK can be sync-happy 22:20:18 because journals 22:20:31 yeah... maybe our clouds either have battery backed caches or just turn on data-eating. 22:20:41 probably former for most. 22:20:50 hah 22:20:53 and the latter for infra-cloud, iirc... 22:20:53 yeah 22:20:59 either way, we could look at io wait if we were concerned 22:21:10 set value eat_data 22:21:12 true 22:21:21 eatmydata is a thing you know :) 22:21:30 world's best LD_PRELOAD library 22:21:43 we may want to collect the zk logs from tests... 22:21:48 I like to load it with libhostile and let them fight it out 22:21:49 jeblair: ++ 22:22:09 * fungi smells a new theme show in the making 22:22:09 #topic Status updates: Zuul Ansible running 22:22:44 my patch series to enable pre and post playbooks is making its way in (i have some random test failures to debug -- see earlier topic :) 22:23:08 mordred has a change built on that to start securing the insecure playbooks 22:23:21 #link playbook security https://review.openstack.org/428798 22:23:23 yes. and then we found a whole new set of ways in which playbooks can be insecure 22:23:29 * mordred glares at roles 22:23:38 funtimes 22:23:41 yah 22:23:47 but also sketched out some solutions for that, yeah? 22:23:50 yah 22:23:56 ohmy 22:24:10 mordred: just reading the commit message on that seems like you'll have to audit and patch after every ansible release? 22:24:18 Is this where we ask how this happened and somebody goes tower 22:24:31 jeblair: it may be worth mentioning that due to security lockdown, we may also want to develop a stdlib role that knows how to run ansible on the remote host as if it was the local host job content 22:24:56 SpamapS: actually - not really, it's more that the idea of running untrusted ansible code isn't a use case they really focus on 22:25:08 Oh, joy, this also means Zuul gets to be partially GPLv3 22:25:25 yup. this, of course, causes me to have a warm and fuzzy feeling 22:25:26 mordred: I'd worry about missing things and further complicating the ansible has pushed security update hurry and fix/upgrade 22:25:53 is it possible we could have the idea of secure / insecure zuul-launchers? I know that doesn't scale well 22:25:54 mordred: would a simpler thing be to just run it in a throw-away container? 22:25:56 mordred: you mean like push the inventory over and run something? that sounds helpful. 22:26:14 so - yes, I agree with clarkb, although from ansible core we really only need to worry about new action plugins (not very likely) or entirely new types of plugins (also not very common) 22:26:19 we don't have to look at every patch 22:26:40 SpamapS: ALSO looking at using some container tech here - but no, I do not think container == security yet 22:26:40 ( I guess now, since our zuul-launchers are in the control plane) 22:26:41 pabelanger: that's not really the problem here as much as the fact that jobs need to run some secure things and some insecure things. 22:26:44 not* 22:26:50 jeblair: ya 22:27:05 SpamapS: I think container + careful code can together be better than either one in isolation 22:27:17 actual defense in depth :) 22:27:43 so specifically looking at giftwrap which allows for construction and execution of unprivileged containers - so that we don't have to escalate zuul-launcher to root before adding in the containment :) 22:27:58 mordred: bubble wrap? 22:28:11 gah. bubblewrap. yes. 22:28:14 https://github.com/projectatomic/bubblewrap 22:28:42 it needs a fairly new kernel though - so the support for it will need to be opt-in for operators I think 22:28:49 * mordred needs to write up some thoughts on this for folks 22:28:53 oh, new things to look at 22:29:33 jeblair: and yes to "push the inventory over and run something?" ... jlk was asking about using zuul to test ansible that relies on plugins that we don't allow people to run with 22:29:45 mordred: that is a neat idea 22:29:48 jeblair: and that's _totally_ possible by writing a playbook that does a shell call to ansible 22:29:57 but ... you know ... we can likely make that experience a little better :) 22:29:57 mordred: mostly I don't want to replace one security issue with another via upgrade of ansible by ops that don't understand caveats here 22:30:32 clarkb: yup. it's definitely an area where we need WAY more prose about what's going on for all of us, and then make sure that we're happy with how we're covering it 22:30:43 if the class of objects that are an issue is small maybe we can do terrible nasty python to intercept all dispatches to them and sanitize appropriately 22:31:03 rather than having hard coded sanitization for known issues today 22:31:21 yah - so - there are 2 prongs we need to deal with 22:31:35 clarkb: well, we don't use ansible as a library, so the solution has to be in ansible configuration... 22:31:37 one are ansible in-tree action-plugin based modules - these do execptional things like the copy module 22:31:47 and execute code on purpose on the calling host 22:31:56 but there is a fixed set of them and it's easy to vet those 22:32:18 the _other_ is that roles can ship with plugins (action plugins, filter plugins, etc) that will run python code on the calling machine 22:32:35 (i'm going to call time on this at 22:35, btw) 22:32:39 in that case, the approach we've discussed so far is to scrub roles when we fetch them for plugin directories (known set of names) 22:32:47 and if a role has a plugin dir with content, just fail hard 22:32:57 so doing those two things AND adding in containment 22:33:34 could also neuter ansible's plugin loading 22:33:36 should hopefully get us fairly decent coverage ... we could also potentially talk to our friends at ansible and request they warn us if they're going ot release new local-execution action plugins 22:33:47 SpamapS: yah - bcoca talked a bit about that 22:33:50 mordred: right my concern is ansible adds new actionmodule or changes one arbitrarily 22:33:58 Just have a plugin that literally overrides the plugin loader with a pass. 22:34:02 yeah, so we're going to be as general as we can be (eg the plugins in roles), but we don't have a good general way to stop the in-tree plugins atm. 22:34:04 SpamapS: and also mused about hte possibility of adding a neuter plugins option to ansible itself 22:34:08 mordred: and then next zuul update and now you are vulnerable (and that would be a much larger target if/when people are using zuul with ansible) 22:34:16 Or an "splodey splode, no plugins allowed" 22:34:26 clarkb: yes, i agree with your concern 22:34:40 clarkb: yah - that's one where we're going to need to connect with ansible release managmenet in addition to doing defensive coding on our part 22:34:47 mordred: having contributed to one of those action modules recently I don't think its terribly hard to change the behavior of them in such ways (as people don't seem to grok how they work very well) 22:34:49 clarkb: definitely a concern 22:34:49 i definitely hadn't thought of that, but i can see it as a possibility 22:35:11 but we're hoping that's a small load due to the rarity in adding such new modules. 22:35:16 clarkb: I share your concern, and think that one has to also wrap it up in a system level protection of some kind. 22:35:20 (but yeah, this is why I think adding container wrapping to the mix will give us buffer too) 22:35:23 yup 22:35:25 SpamapS: ++ 22:35:25 * SpamapS will look at bubblewrap 22:35:37 great segue 22:35:41 SpamapS: it needs yakkety on ubuntu, fwiw - needs new kernel 22:35:56 SpamapS: or, needs that to be able to run without sudo stuff 22:35:56 #topic Progress summary 22:36:21 * jeblair hands link baton to SpamapS 22:40:09 ahoy 22:40:24 sorry I got alt-tabbed and tried to refresh and fell off the earth 22:40:33 #link https://storyboard.openstack.org/#!/board/41 22:40:33 let's come back to this if we have time at the end 22:40:37 flat earth will do that to you 22:40:42 Not much to say anyway 22:40:45 Progress continues. 22:40:51 that works out then :) 22:40:56 #topic PTG prep (jeblair) 22:40:59 * fungi is a fan of progress 22:41:08 so we've got a thing coming up soon 22:41:14 like, real soon 22:41:23 2 weeks? 22:41:29 right at, yes 22:41:31 yay travel 22:41:43 crikey 22:41:49 so much travel :_P 22:41:57 i think at this point, we probably have a good idea what's feasible 22:42:11 ++ 22:42:20 we should hopefully have nodepool at least able to hand out some nodes, even if it still doesn't do a lot of things 22:42:39 and we should have zuul able to run some jobs, even if it doesn't do a lot of things 22:43:05 so i think it's well within the realm of possibility that we can set up a v3 nodepool and zuul, and have them run some hello world jobs 22:43:12 I'd also like to have all tests re-enabled/refactored/done by the time I fly out Thursday night. 22:43:30 are our current puppet-zuul/puppet-nodepool modules up to the task of deploying what's in the feature branches yet? 22:43:34 to that end, there are probably some things we can do to prepare for that 22:43:42 (while we focus on the pragmatic thing first, I want to make a real push while I have your brains in view) 22:43:52 fungi: probably close, but probably not. 22:43:56 just curious if hello world is going to involve a lot of manual deployment 22:44:14 SpamapS: i would support that as a very worthy secondary goal :) 22:44:22 what is left to do for nodepool zookeeper production? I am assuming mordred shim? (CLI commands?) 22:44:27 or if we should try to work out the adjustments to puppet necessary to hello world it as part of the task 22:44:27 It's a stretch goal for sure. 22:44:37 pabelanger: i don't think we need the shim for this 22:44:44 ack 22:45:02 (the shim is for zuul v2 -> nodepool v3) 22:45:12 got it 22:45:25 pabelanger: actual node launches (not just record entries in ZK) needs to be completed 22:45:44 basically set up demo environment by hand and taking notes, vs deploying demo env using (patched) puppet modules so we can more directly translate that to the chnages we'll need to make 22:45:55 Shrews: thanks for the info 22:45:55 (fortunately, there's a body of code that does launches, so we're not starting from zero) 22:46:16 right 22:46:23 fungi: we may well end up doing some manual deployment, but otoh, maybe in the intervening 2 weeks, we could do some puppet work and have at least some of that codified 22:47:10 let's start an etherpad: https://etherpad.openstack.org/p/pike-ptg-zuul 22:47:17 jeblair: thanks, just wondering if anyone has a feel for where we can strike that balance of effort vs expediency 22:47:48 we didn't need to change puppet-nodepool too much for zookeeper things first time. But agree, we should try to land patches at the same time 22:48:10 i do want to make sure we can have something viable we can at least feel good about by the end of tuesday, so if that means config management changes get mostly punted to later i'm cool with that 22:48:55 would be awesome to say "zuul v3 ran a job" 22:48:59 ++ 22:49:00 yep 22:49:17 jeblair: the plan is to have nl01.o.o eventually? (nodepool-launcher) 22:49:25 pabelanger: sounds reasonable 22:49:26 maybe we can begin making notes for any documentation that may need written 22:49:31 k 22:49:33 Shrews: ++ 22:50:15 okay, take a look at that etherpad and let me know if there is anything else we should prep beforehand to increase our chances of success 22:50:31 obviously the first two are very important 22:51:13 the next few about deployment and setting up a server are things that would be really good to do ahead of time so we don't spend 2 days watching someone boot a server 22:51:26 ++ 22:51:35 i'd love it if someone would volunteer to take the lead on prepping a platform for us to work from at the ptg 22:51:39 I can start doing some prep tomorrow for that 22:51:53 i think pabelanger just volunteered for that :) thanks 22:51:58 "platform" meaning server instances? 22:52:02 yeah 22:52:22 oh, and i guess a tenant/namespace/whatever for the test nodes 22:52:23 so, new servers so that we don't touch any of the current system 22:52:40 ++ 22:52:49 fungi: i think at our scale, we can just steal some quota from our current nodepool tenants 22:53:10 (maybe bump the production quota down a little bit on one of them?) 22:53:17 wfm. we do have unique identifiers implemented for nodepool's alien cleanup instance metadata right? 22:53:54 i know we discussed having that so two could coexist on the same tenant was preferred but can't remember if it ever got implemented 22:54:22 fungi: ya nodepool should only delete leaks that it booted 22:54:31 (and if it doesn't we should fix that too) 22:54:39 just want to make sure bringing up a demon nodepool pointed at one of our production tenants won't start sniping the production nodepool nodes 22:54:40 let's check on that 22:54:51 yep. won't delete an image unless the DIB is local 22:54:53 s/demon/demo/ (fun typo though) 22:54:58 fungi: oh, that probably won't happen because we probably won't have cleanup in v3 implemented 22:55:03 fungi: i think the other direction is a possibility 22:55:23 ah, so worst case nodepool v0.x production might blow away our demon nodes before they run anything 22:55:28 its certainly the intent of the laek cleanup code to only delete things that it once booted and knew about 22:56:05 i thought we had talked about adding a config option where you could put a unique string for each nodepool scheduler so it could differentiate its own node metadata from someone else's 22:56:07 the last item i put on the list is something i'll volunteer for -- to write up what what we all need to know about the current and future state of both pieces of software in order to productively work on a hello-world job at the ptg 22:56:14 grr, yeah nodes, not images. doubtful cleanup will be implemented by then 22:56:47 wow, i keep typing demon instead of demo. what is up with that finger memory? 22:57:10 fungi: i'm living proof brains break after 5pm 22:57:13 mordred: it would probably be good if we have a handle on some of the security stuff by then, otherwise we may not be able to publish logs for our hello world job 22:57:22 jeblair: ++ 22:57:54 "we ran a job, but its logs were too insecure to fit in this margin" 22:57:58 basically 22:58:20 i put some names on the etherpad, if you would like names added or removed, let me know 22:58:52 also, if you think of anything else we need to do before then so we're not sitting on our thumbs at the ptg, add it / let me know 22:59:19 fungi: think this is probably worth a mention at the infra meeting tomorrow? 22:59:36 i think it's definitely worth mentioning, yes 22:59:41 #link actions to prepare for pike ptg https://etherpad.openstack.org/p/pike-ptg-zuul 22:59:44 will do 22:59:56 given the timing, we should spend a good chunk of tomorrow on ptg topics 23:00:01 thanks! 23:00:10 thanks everyone! 23:00:12 #endmeeting