22:07:51 <jeblair> #startmeeting zuul
22:07:52 <openstack> Meeting started Mon Nov 20 22:07:51 2017 UTC and is due to finish in 60 minutes.  The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot.
22:07:53 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
22:07:56 <openstack> The meeting name has been set to 'zuul'
22:08:12 <jlk> The meeting should be stored in UTC, and then display is adjusted by local display zone info
22:08:15 <jeblair> #link agenda: https://wiki.openstack.org/wiki/Meetings/Zuul
22:08:22 <jlk> but people fail at thinking like that, so some events are stored actually in the local timezone, so that the "time" doesn't change when the offset does
22:08:24 <jeblair> #link last meeting: http://eavesdrop.openstack.org/meetings/zuul/2017/zuul.2017-11-13-22.10.log.html
22:08:39 <jeblair> Thanks to Shrews for chairing the last meeting!
22:08:46 <jeblair> it was nice and short, so i just linked to the transcript
22:08:57 <fungi> i found it helpful, even if short
22:09:36 <jeblair> i will only add that the infra+zuul team did not have rotten vegetables thrown at us so i consider it a success (in fact, many nice things were said about v3)
22:09:51 <jeblair> it=summit
22:10:04 <pabelanger> yay
22:10:11 <fungi> yes, none of the vegetable they threw were rotten
22:10:19 <pabelanger> a lot of excitement around zuulv3 at summit
22:10:20 <jlk> woo
22:10:41 <jeblair> #topic Add support for shared ansible_host in inventory (pabelanger)
22:11:18 <pabelanger> so, this is something I found when trying to convert a zuulv2 job to native zuulv3
22:11:32 <pabelanger> it is more an optimization I think on CI resources and hopefully something we want to consider
22:11:38 <jeblair> #link https://review.openstack.org/521324/
22:12:10 <pabelanger> right now, if I use our current nodeset stanza, I'd have to request 6 nodes from nodepool, when in fact, the way the playbooks are written, I really only need 1.
22:12:20 <jeblair> pabelanger: i take it the zuulv2 job was running ansible within the job, and you're trying to move that ansible up to top-level zuulv3 internal ansible?
22:12:29 <pabelanger> also, host groups doesn't work in this case,  because of the way variable scoping is handled
22:12:36 <pabelanger> jeblair: correct
22:13:12 <dmsimard> I'm not sure I'm following, have a bad case of mondays
22:13:28 <jeblair> pabelanger: cool.  so this isn't a regression from v2, more of an impedence mismatch with zuulv3 internal ansible and native ansible.  which is cool -- we want to make it as transparant as possible.
22:13:30 * dmsimard reads review
22:13:39 <mordred> if there are differences in how variable scoping works, I could see that being something other folks would run in to should they attempt to do what pabelanger is trying to do too
22:13:49 <mordred> yah - what jeblair said
22:13:53 <dmsimard> pabelanger: oh, different hosts which lead back to the same host
22:14:07 <pabelanger> jeblair: Yup, in fact, requesting the 5 nodes from nodepool work fine.  I just didn't want to land it and eat up a bunch of nodes for each patch
22:14:31 <pabelanger> dmsimard: right
22:14:32 <mordred> "I want 3 different ansible hosts, but I only need one node"
22:14:34 <clarkb> wouldn't you put the single node in different groups with zuulv3
22:14:39 <clarkb> then have your playbooks operate on the various groups?
22:14:46 <clarkb> but one node could be in say 6 groups
22:14:52 <mordred> clarkb: yah - that apparently behaves differently in some ways
22:14:57 <dmsimard> clarkb: yes and no, you can do both
22:15:11 <mordred> clarkb: (that was my first suggestion as well)
22:15:15 <fungi> pabelanger: will it run ~5x as fast if scheduled across 5 nodes? if so, the larger node size doesn't sound too terrible
22:15:16 <pabelanger> clarkb: yes, that is possible but it would require a rewrite in my case
22:15:28 <dmsimard> clarkb: like, technically, there's nothing that prevents "keystone.domain.tld" "nova.domain.tld" to ultimately resolve to the same IP address while also being different "hosts" in ansible
22:15:56 <dmsimard> the problem here is that we use IP addresses, not hostnames
22:16:10 <jeblair> pabelanger: what's the deficiency with using groups?
22:16:30 <jeblair> i know "something about var scoping" but is there something more specific i can reference?
22:16:30 <dmsimard> pabelanger: I think we might break some assumptions in roles if we do this, especially multinode roles
22:16:46 <dmsimard> jeblair: play host filtering is an example
22:17:03 <mordred> so if we want to support the story, as much as possible, of "run your existing ansible as part of your testing" - then if there is a semantic distinction between 2 hosts with the same IP and a single host in two groups, then I think we need to allow a user to express which they want
22:17:09 <pabelanger> jeblair: it is likely better is I work up simple playbook example, because I'm likley not going to explain it very well.
22:17:27 <dmsimard> jeblair: if you want to do an "all in one" but your playbooks/roles are built to target different hosts
22:17:27 <jeblair> dmsimard: isn't that an anti-pattern?  (ie, shouldn't you filter plays based on groups or roles anyway?)
22:17:30 <mordred> (also, yah, having a little clarity on the things that are different between the two would likely be helpful for all of our deeper understanding)
22:17:43 <dmsimard> I have a (bad) example, one sec
22:18:06 <dmsimard> https://github.com/rdo-infra/rdo-container-registry/blob/master/hosts
22:18:32 <dmsimard> it so happens that I'm installing everything on the same host, but the playbooks are made to target specific groups to install specific things
22:18:45 <dmsimard> pabelanger: ^ does that make sense ?
22:19:35 <jeblair> dmsimard: iiuc, that case should be handled currently with our group support (aside from the openshift_node_labels, but that's only because we don't do any host-specific inventory vars right now.  we could, but that's a different change)
22:19:40 <clarkb> dmsimard: right but we have different groups ability so your think shoudl work right?
22:19:42 <dmsimard> and then there's an example of var scoping in that example, if you look at the nodes group
22:19:54 <pabelanger> right, but more specifically ansible_host seems to create a new SSH connection (which resets variable scope) where using groups doesn't. It will just run everything from start to finish using 1 connection. Based on my local testing
22:20:19 <mordred> pabelanger: yah - that's, I think, the most important distiction
22:20:21 <dmsimard> pabelanger: that strangely rings me a bell
22:20:29 <jeblair> pabelanger: multiple ssh connections make sense, how that's connected to variable scoping i don't understand
22:20:32 <pabelanger> but, I'll setup a simple playbook / inventory demonstrate the issue
22:20:47 <mordred> pabelanger: ++
22:21:06 <dmsimard> jeblair: hostvars can arguably be workaround by supplying a host_vars directory so it's not a big deal I think
22:21:15 <dmsimard> (or group_vars)
22:21:28 <dmsimard> I don't think we need to support providing them in the nodesets (unless we really want to)
22:21:53 <jeblair> dmsimard: either way, based on my current understanding, i think it's orthogonal to this question, so we can set it aside for now
22:22:00 <mordred> yah
22:22:01 * dmsimard nods
22:22:18 <dmsimard> a reproducer would indeed help
22:22:35 <pabelanger> okay, will get that done for tomorrow and we can discuss more
22:23:14 <jeblair> okay, my personal summary here is: i'm not opposed to this on principle, but before we proceed, i'd like to understand it a bit more; pabelanger will help by suppling more examples and details.  and if we do need to proceed, we should evaluate dmsimard's concern about assumptions in multinode jobs.
22:23:31 <jeblair> that jive for folks?
22:23:33 <pabelanger> ++
22:23:35 <mordred> I agree
22:23:38 <mordred> from a philosophical point of view, I'd prefer to minimize the number of times we have to say to someone "to use your existing ansible in zuul, you need to rewrite it"
22:23:39 <dmsimard> yeah.
22:23:47 <mordred> there will be some cases in which that is unavoidable, of course
22:24:15 <fungi> sgtm
22:24:19 <jeblair> mordred: ++
22:24:23 <pabelanger> yah, this was the closes way I could reproduce an existing inventory file I was testing with in v2
22:24:26 <jeblair> #agreed pabelanger will help by suppling more examples and details before we proceed with this.  if we do need to proceed, we should evaluate dmsimard's concern about assumptions in multinode jobs.
22:25:31 <jeblair> i -1d the change with a quick summary too, so we don't lose it
22:25:40 <pabelanger> ack
22:25:50 <jeblair> #topic Allow run to be list of playbooks (pabelanger)
22:26:27 <pabelanger> So, this was actually the first way I solved above, but kept it alive because it might be usefully.  Having a job run multiple playbooks
22:26:31 <jeblair> this seems to touch on similar issues...
22:26:33 <jeblair> ah :)
22:26:36 <pabelanger> yup
22:26:38 <jeblair> that was my question
22:26:41 <jeblair> #link https://review.openstack.org/519596
22:26:53 <jeblair> so this is a semi-alternate to the other change
22:27:25 <pabelanger> Yah, gave the option to run back to back playbooks with specific hosts
22:27:34 <pabelanger> kinda like we do on puppetmaster today with ansible
22:27:51 <pabelanger> so, not sure if we want to consider supporting it or leave until later
22:27:55 <clarkb> considering it is something we already do elsewhere it seems to make sense as a feature
22:28:51 <jeblair> actually, why do we do that on puppetmaster?
22:29:35 <pabelanger> I know we wrap each ansible-playbook with timeout, did we break it out due to memory issues?
22:30:12 <jeblair> is it something about parallelism?  or exiting on failure?
22:30:26 <mordred> jeblair: I think it's acutlaly just historical
22:30:53 <mordred> we had a run_puppet.sh - and we started using ansible to run it by modifying that script one bit at a time
22:31:39 <clarkb> jeblair: mordred the big reason for it today is decoupling infracloud from everything else
22:31:42 <jeblair> pabelanger: how would this have solved your problem?  even if zuul ran multiple playbooks in sequence, it would still have the same vars?
22:31:54 <clarkb> because infracloud is more likely to fail and adds a significant time to the round robin
22:31:59 <mordred> jeblair: vars set by tasks in the plays get reset across playbook invocations
22:32:03 <jeblair> clarkb: right, but those just operate completely in parallel, right?
22:32:12 <clarkb> jeblair: yes
22:32:14 <jeblair> mordred: oh i see
22:32:20 <clarkb> so ya I guess in the context of a job you'd just have two jobs for that
22:32:23 <mordred> yah
22:32:31 <pabelanger> jeblair: it would be same vars, but multiple ssh connection attempts.  That seems to be the key to resetting them to how I expect them in the playbooks
22:32:53 <pabelanger> again, I think a more detailed example playbook might help here
22:33:01 <pabelanger> and happy to write up
22:33:05 <jeblair> i'm still questioning the connection between ssh connections and variables
22:33:16 <jeblair> i'm pretty sure those are two independent concepts
22:33:57 <pabelanger> these are group_vars if that helps
22:34:21 <jeblair> does zuul set group vars?
22:34:50 <dmsimard> I don't think so
22:34:54 <pabelanger> it doesn't, ansible is loading them based on the playbooks/group_vars folder
22:35:11 <dmsimard> there's either inventory-wide vars or extra-vars which both apply to everything
22:35:29 <jeblair> pabelanger: so you're getting different variables because you're running playbooks in different paths?
22:36:22 <pabelanger> jeblair: I get different variables if I switch to groups in my inventory file
22:36:32 <pabelanger> well
22:36:38 <pabelanger> groups of groups
22:36:56 <dmsimard> I haven't tested group_vars and host_vars, it'd be interesting to test actually.. typically you'd have host_vars and group_vars inside {{ playbook_dir }}, but in our case those paths aren't "merged". However, I believe you can set group_vars and host_vars inside roles, and that would be more likely to work.
22:37:11 <jeblair> pabelanger: er, i'm trying to understand how this change solves your variable problem from earlier
22:37:13 <pabelanger> for exmaple: http://git.openstack.org/cgit/openstack/windmill/tree/playbooks/inventory is my working v2 inventory file
22:38:00 <pabelanger> jeblair: bascially, it allows me to stop doing http://git.openstack.org/cgit/openstack/windmill/tree/playbooks/site.yaml
22:38:17 <pabelanger> and create a run: statement for each playbook
22:38:28 <jeblair> pabelanger: why is that preferable?
22:38:41 <jeblair> pabelanger: don't you feel like you're moving too much logic *into* zuul?
22:38:49 <dmsimard> seems equivalent to me as well
22:39:25 <dmsimard> FWIW that's exactly what we're doing with the base and multinode integration jobs, we're running "one" playbook that includes multiple playbooks
22:39:33 <dmsimard> I don't currently see that as a hindrance
22:39:35 <pabelanger> jeblair: yes, this is a work around, because I created 521324, which I'd much rather have
22:39:42 <pabelanger> before*
22:41:14 <pabelanger> I'll have to run here in about 5 minutes, but don't want to leave people hanging.  I'm happy if we want to contiune this topic in #zuul too
22:41:39 <jeblair> pabelanger: to be clear, i'm, again, not permanently opposed to 519596, but before we merge changes like that, i'd like to have a really clear idea of why they are necessary, or what problem they solve, or what situation they improve.  so far we've got the "include list" as one thing, but that seems like an anti-pattern and a mild argument against merging 519596
22:41:47 <jeblair> if there's a variable aspect to this, i still don't understand it
22:42:35 <pabelanger> sure, I'll get some working examples that better show the issues I ran into converting v2 job to native v3.  These were both my attempts to address some issues I was having
22:42:44 <jeblair> pabelanger: okay, thanks
22:43:06 <jeblair> #agreed pabelanger will provide more examples and explanation for this
22:43:25 <pabelanger> I have to run now, will catch up on minutes when I return
22:43:31 <jeblair> #topic open discussion
22:44:11 <jlk> I have a topic...
22:44:18 <dmsimard> For open discussion, I just wanted to point out that we formally started looking at what it means to run a Zuul v3 that is not the one in OpenStack
22:44:18 <clarkb> I too have one, but go for it jlk
22:44:35 <dmsimard> jlk won first :P
22:44:38 <jlk> dmsimard: who's "we" ?
22:44:45 <jlk> IIRC there is one at BMW is there not?
22:45:06 <dmsimard> We is RDO Infra (analogous to openstack-infra) and Software Factory developers
22:45:08 <jeblair> there are a few in fact
22:45:10 <jlk> (and for a hot minute, there was Bonny. Sigh.)
22:45:23 <jlk> dmsimard: neat!
22:45:43 <dmsimard> Software Factory had arguably been running Zuul v3 for a while
22:46:25 <dmsimard> But there's some interesting questions and design challenges in thinking how we want to share configuration between zuuls (zuul-jobs, openstack-zuul-jobs, project-config, and specific project)
22:47:09 <jeblair> i think SpamapS was also trying out zuul-jobs sharing
22:47:28 <jlk> jamielennox had some thoughts in this space as well
22:47:43 <dmsimard> I started a thread about it in the context of TripleO http://lists.openstack.org/pipermail/openstack-dev/2017-November/124733.html and we also started hunting down issues we come across in zuul-jobs here: https://etherpad.openstack.org/p/downstream-zuul-jobs-issues
22:47:46 <clarkb> we probably want to focus on sharing zuul-jobs first and not the others right (they aren't generally supposed to be reconsumable so figuring it out for zuul-jobs where it is is a good start)
22:47:48 <jeblair> sharing between instances is definitely a design goal for zuul-jobs.
22:48:34 <dmsimard> clarkb: it's funny that you mention that, because one of the ideas that has been floating around is to centralize the playbooks/roles/jobs/etc for TripleO in tripleo-ci and then use that across all Zuuls
22:48:36 <jeblair> i think openstack-zuul-jobs and individual openstack projects may be useful for openstack third-party-ci, but that's less of an explicit goal, and i think large amounts of 'shadow' and 'include'/'exclude' may be needed.
22:48:41 <mordred> yes. I could also imagine that in-repo jobs and possibly openstack-zuul-jobs might be things that OpenStack Third Party Zuuls will want to consume
22:48:50 <dmsimard> jeblair: yes, in the context of third party CI and such.
22:49:02 <mordred> jeblair: yup
22:49:37 <jeblair> project-config definitely isn't meant to be shareable -- *however* -- we do want to have at least a stub/example base job.  that should end up either in zuul-jobs or zuul-base-jobs at some point soon.
22:49:44 <mordred> ++
22:49:51 <mordred> starting by figuring out sharing of zuul-jobs and getting it right will go a long way
22:50:17 <jeblair> ++
22:50:32 <clarkb> mordred: ya thats what I'm thinking. Those are the bits that should be reconsumbale so lets start there and learn what we learn
22:50:46 <jeblair> dmsimard: so thanks for diving in and thanks in advance for your patience :)
22:51:19 <jeblair> (cause we just *might* have gotten some things wrong in the first pass)
22:51:25 <dmsimard> no stress
22:51:26 <clarkb> last week we merged a couple changes that broke Zuuls config and zuul didn't catch them upfront. The one I remember off the top of my head is parenting to a final job. I know that pabelanger ran into something else when he had to restart zuul due to OOM as well.
22:51:31 <clarkb> ^^ is my item
22:51:53 <clarkb> it would be good if we could address those config issues pre merge
22:51:53 <dmsimard> a couple?
22:51:56 <jeblair> clarkb: by broke zuul's config... what do you mean?
22:52:01 <dmsimard> how did more than one breaking change merge ?
22:52:08 <mordred> jlk: was thatyour topic? or did we talk over you?
22:52:10 <jeblair> how did *one* breaking change merge?
22:52:11 <clarkb> jeblair: in pabelanger's restart zuul case zuul would not start up again
22:52:18 <jlk> mordred: I can wait through clarkb's :)
22:52:37 <jeblair> is there a bug report?
22:52:41 <clarkb> jeblair: in the job parented to final job zuul kept running but the new jobs that merged would not run
22:52:50 <clarkb> jeblair: I believe both were added to the zuulv3 issues etherpad
22:53:14 <jeblair> clarkb: the second sounds like an expected run-time error; the first sounds like a bug.
22:53:30 <clarkb> ok we can followup on them after meeting
22:53:35 * clarkb gets out of jlk's way
22:53:35 <jeblair> ack
22:53:51 * dmsimard has an easy item after jlk
22:54:04 * jeblair has a quick item after dmsimard
22:54:21 <jlk> hi there! So tristanC landed a nodepool driver for k8s, and I'm asking if the group collective has the appetite to discuss/debate k8s (and container in general) driver approaches for Zuul/Nodepool.
22:54:30 <jlk> "is now the time" or should we table this for later?
22:54:43 <jlk> and by now, I don't mean in this channel now, but on list and in #zuul and whatnot
22:55:07 <mordred> s/landed/proposed/
22:55:14 <jlk> sorry that's what I meant :(
22:55:16 <jlk> silly fingers
22:55:37 <mordred> jlk: yah - just wanted to make sure nobody else was confused :)
22:55:38 <dmsimard> it's worth discussing not just because it's important but because it'll be a precedent from which the design of other "drivers" will be built from IMO
22:55:57 <mordred> jlk: I was actually just writing an email to the list on this meta topic actually
22:55:59 <jeblair> i'd really like to get a v3 release out before we dive into this, because i think it will help get and retain other folks into our community.  also, it's just embarrasing not to release.
22:56:25 <dmsimard> +1
22:56:30 <clarkb> considering the lack of such features isn't a regression I'm on board with that too
22:56:45 * mordred agrees with jeblair - but does think there is at least one facet of the discussion that is worth having pre-release to make sure we don't back ourselves into a corner when we release
22:56:49 <jlk> same, a release would be good, so long as we don't have to wait for zuul4 to add container support
22:56:51 <jeblair> my preferred approach would be to get the release out the door quickly, then start on the next dev cycle.  there are things on the roadmap for release that others can pitch in on (some of them have my name next to them, but they don't have to be me and that would speed things up)
22:57:38 <dmsimard> ok mordred you will start a topic on the ML ?
22:57:44 <jeblair> i would be really surprised if we are unable to release by februrary, and think earlier is likely (though i caution, there really aren't that many work weeks left in this year)
22:57:49 <mordred> the thing I think is worth sanity-checking ourselves on pre-release is making sure we aren't doing anything thatwould fundamentally block the addition of a nodepool driver that produces build resources that do not use ssh
22:57:53 <mordred> dmsimard: yes, I shall
22:58:02 <dmsimard> let's pick up the discussion there
22:58:16 <jlk> works for me, thanks!
22:58:56 <jeblair> this dovetails into my topic, which is in next week's meeting, i'd like to check in on the release roadmap items
22:58:56 <dmsimard> There's only 1:30 left.. but Ansible 2.4.1 is out and addresses most/all of the regressions/issues introduced in 2.4. How do we upgrade without breaking people that have written <2.4 things ?
22:59:25 <jeblair> i'll add that to the angenda for next, and i expect we'll have the roadmap items in storyboard soon
22:59:33 <dmsimard> Alternatively, how do we potentially allow different version of Ansible ? Because some things works in one, some in others, etc.
22:59:35 <dmsimard> Ack
22:59:44 <mordred> jeblair: ++
23:00:11 <jlk> I really hope that we don't get too deep in "what version of ansible" land for Zuul users.
23:00:13 <jlk> at least not "choose your own"
23:00:19 <mordred> dmsimard: most of the 2.3 to 2.4 breakages were in the python layer for us, right?
23:00:32 <dmsimard> mordred: for ara and zuul themselves, perhaps
23:00:40 <dmsimard> mordred: there's still some "bug fixes" that broke things
23:00:58 <dmsimard> some behavior changes in includes/imports
23:01:05 <dmsimard> also variable scopes
23:01:16 <dmsimard> they don't really follow semver :(
23:01:27 <mordred> jlk: I actually have some thoughts on how we might consider doing that without death - but as of right now I agree with that sentiment
23:01:31 <jeblair> jlk, dmsimard: some folks have expressed a use case for multiple ansible support, and it may be in keeping with our philosohpy of trying to be as transparent as possible with ansible.  for instance, kolla tests all their stuff with all current supported versions of ansible.
23:02:11 <jlk> sure, but I thought at that point, you write a first job that installs the ansible you want to test with, then....
23:02:12 <dmsimard> this is a cool topic which I'll continue in #zuul :)
23:02:13 <mordred> dmsimard: I'm not sure there is a great story yet today for how to upgrade zuul's ansible from 2.3 to 2.4 and verify that people's job ansible doesn't break
23:02:18 <jeblair> so it may not be out of the question for zuul to maintain compatibility with current supported versions.  but i think that needs a design proposal.  :)
23:02:26 <jeblair> i think the quick answer to dmsimard's question is --
23:02:36 <jeblair> we try using 2.4.1 and fix what's broken :)
23:02:39 <mordred> yah
23:02:43 <dmsimard> yuck
23:02:53 <dmsimard> anyway, we're over time :)
23:03:01 <jeblair> Shrews has done much work on that already, we should be in pretty good shape
23:03:10 <jeblair> thanks all!
23:03:13 <jeblair> #endmeeting