22:02:44 <jeblair> #startmeeting zuul
22:02:44 <openstack> Meeting started Mon Dec  5 22:02:44 2016 UTC and is due to finish in 60 minutes.  The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot.
22:02:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
22:02:47 <openstack> The meeting name has been set to 'zuul'
22:02:57 <jeblair> #link agenda https://wiki.openstack.org/wiki/Meetings/Zuul
22:03:02 <rattboi> o/
22:03:07 <jhesketh> o/
22:03:14 <jeblair> #link previous meeting http://eavesdrop.openstack.org/meetings/zuul/2016/zuul.2016-11-28-22.02.html
22:03:22 <auggy> o/
22:03:27 <jeblair> #topic Actions from last meeting
22:03:47 <jeblair> jeblair work with Shuo_ to document roadmap location / process
22:03:56 <jeblair> i just pushed this up
22:03:59 <jamielennox> o/
22:04:02 <jeblair> #link roadmap: https://review.openstack.org/407213
22:04:08 <SpamapS> o/
22:04:18 <nibalizer> o/
22:04:23 <jeblair> i'll take the "work with Shuo_" part offline
22:04:55 <jeblair> i think that roadmap captures the direction we're heading
22:05:20 <jeblair> i don't want to do task management in README files or anything, but just having a list of "yeah these are things we know we're going to do" will be helpful i think
22:05:42 <pabelanger> looks good
22:06:27 <jeblair> #topic Status updates (Nodepool Zookeeper work)
22:06:29 <SpamapS> I think that's a fine way to manage an epic-epic story, which is basically what htis is.
22:06:47 <jamielennox> yea, README is good for a really high level, can do task management elsewhere
22:07:01 <jeblair> #undo
22:07:02 <openstack> Removing item from minutes: <ircmeeting.items.Topic object at 0x7f63d9f80bd0>
22:07:35 <jeblair> cool
22:07:40 <jeblair> i guess we're ready to move on :)
22:07:41 <jeblair> #topic Status updates (Nodepool Zookeeper work)
22:07:51 <SpamapS> (TIL there's a #undo)
22:08:29 <fungi> (i use it all the time to fix my typos)
22:08:52 <jeblair> i think we're hitting the last few things on the punch list for the nodepool builder
22:09:23 <Shrews> yeah. bunch of bug fixes/improvements since last week.
22:10:11 <pabelanger> agree, think we are getting close
22:11:25 <jeblair> nb01.o.o has been mostly working, yeah?  i think we lost a few builds this weekend due to a bug whose fix is in progress
22:11:45 <jeblair> and we actually achieved image quorum pretty quickly -- like, within 1 day?
22:12:11 <pabelanger> yes, things have looked good on the upload front for sure
22:12:48 <jeblair> is there anything outstanding we need to have in place before we switch infra production over to use it?
22:12:53 <Shrews> pabelanger: have we tried disaster scenarios yet?
22:12:56 <Shrews> jeblair: yes! ^^^
22:13:00 <Shrews> let's kill ZK
22:13:04 <Shrews> see what happens
22:13:27 <clarkb> might also be handy if we could write down a few of the manual debug and fixing steps for zk that have been used
22:13:32 <pabelanger> we did see https://storyboard.openstack.org/#!/story/2000809 over the weekend, I had to manually delete some data in zk to fix it
22:13:34 <clarkb> since its relatively new to most of us
22:14:09 <jeblair> clarkb: yes -- but any time we run a manual zk command it's a bug that we must fix
22:14:13 <Shrews> pabelanger: that json thing might be fixed by the open review
22:14:14 <fungi> yeah, if that hasn't found its way into our system-config docs, now would be the time
22:14:22 <jeblair> fungi: i don't think that's right
22:14:39 <pabelanger> Shrews: yes, we should land your patch and see if it happens again for sure
22:14:44 <fungi> system-config shouldn't mention that nodepool is using zk to coordinate tasks?
22:14:55 <SpamapS> I'd say if you find yourself having to dig around in ZK, that should definitely end up as a story.
22:15:07 <jeblair> i think we should send out an email to the infra list letting folks know how to run zk shell and a bit of overview
22:15:14 <clarkb> SpamapS: jeblair I agree, but chances are we will end up doing it at some point so we should tell people how
22:15:17 <clarkb> ya that works
22:15:24 <clarkb> (then they can submit the resulting bug report)
22:15:26 <pabelanger> SpamapS: 2000809 is the only time we've had to so far
22:15:27 <jeblair> but we should not document zk shell commands in the same way that we don't document mysql currently
22:15:32 <fungi> i guess it's too deep of an implementation detail for sysadmins to care about? do we not need to make sure the service is running et cetera et cetera?
22:15:44 <jeblair> fungi: oh that's fine :)
22:16:00 <jeblair> fungi: i just meant that we should not be writing: "if something goes wrong, here's the zk command to fix it" in system-config :)
22:16:12 <fungi> yeah, i'm less interested in knowing how to deep-dive zk internals. fro that there's its own documentation
22:16:37 <fungi> zk presumably has plenty of documentation on how to interact with it should that become necessary
22:16:38 <Shrews> https://review.openstack.org/406342 has a very hard to find (thanks jeblair!) bug fix that needs to land for the builder, fyi
22:17:07 <fungi> though linking to the zk docs from system-config could be handy
22:17:30 <jeblair> i just want to make sure we're all on the same page that zk is an internal implementation detail that we should all be aware of as developers of this thing, and operators should not really be aware of it beyond seeing it in 'ps' output
22:17:41 <SpamapS> It should be noted that the interface between Zuul and Nodepool should be documented via inline docs for whoever owns it, so that it can be understood well by developers trying to fix those bugs.
22:17:47 <jlk> hrm.
22:17:49 * SpamapS has not looked, maybe it is.
22:18:16 <fungi> are we expecting to configure it to listen on a particular socket/port, or does it use a standard iana assignment?
22:18:22 <jlk> Well, ops should know 1) it's required, 2) how to ensure that its health is as-expected, 3) how to safely kick it should it not be healthy, 4) how to reach out to devs for more help if it's fallen over hard.
22:18:24 <jeblair> SpamapS: it should be, once it exists (that's coming up later in agenda)
22:18:30 <jeblair> jlk: ++
22:18:44 <fungi> what jlk describes is exactly what i would want too. yes
22:19:20 <jlk> and maybe "where does it log, is it sensitive, should I keep them around"
22:19:28 <SpamapS> ZK's client libs expect ZK on a particular port.
22:19:41 <SpamapS> Dunno if it's been registered with IANA.. but seems likely to be.
22:19:48 <jeblair> much of that of course should come from the zk docs, which we can link to from system-config
22:19:57 <pabelanger> Shrews: as for disaster scenarios, I think our next steps are to do some ops things to our images. eg: manual deletes, builds, uploads etc. see what happens
22:20:04 <jlk> "Where does it run" which leads to SpamapS's thing of "what port does it run" so that firewalls can be adjusted.
22:20:14 <SpamapS> Right that's all in ZK's docs
22:20:18 <jlk> cool
22:20:18 <Shrews> pabelanger: ++
22:20:39 <SpamapS> what you want to put in nodepool/zuul's docs is "Where do I look to make sure zuul and nodepool are active in a particular ZK."
22:21:24 <SpamapS> Which can be as simple as "Nodepool reservations are stored under /nodepool in ZK. The contents are documented in code at nodepool/zk_interface.py"
22:21:24 <pabelanger> another thing, do we want to try adding nb02.o.o too, see if things work as expected?
22:21:31 <Shrews> pabelanger: but we definitely need to hard kill ZK several times, too (during uploads, during builds, etc)
22:21:32 <jeblair> #action jeblair update nodepool system-config docs with zk info
22:21:39 <pabelanger> Shrews: ++
22:21:56 <jeblair> #action pabelanger test zk disaster scenarios thi nodepool-builder
22:22:00 <jeblair> #undo
22:22:01 <openstack> Removing item from minutes: <ircmeeting.items.Action object at 0x7f63d984a850>
22:22:04 <jeblair> #action pabelanger test zk disaster scenarios with nodepool-builder
22:22:20 <jeblair> any other production blockers?
22:22:36 <Shrews> https://review.openstack.org/406342
22:22:39 <jeblair> pabelanger suggested an nb02, do we want that?
22:22:41 <Shrews> as mentioned
22:23:05 <clarkb> if we are able to get through and entire build and upload cycle in less than a day I don't think it is necessary
22:23:27 <clarkb> but may want to test it before we productionize it
22:23:58 <jeblair> #action jeblair merge https://review.openstack.org/406342
22:24:08 <Shrews> clarkb: agreed. i think we need the multiple node scenario before production
22:24:12 <pabelanger> Ya, a single builder is working very well right now.
22:24:22 <clarkb> https://review.openstack.org/#/c/406411/8 is another change worth getting in to fix leaking hash files
22:24:24 <jeblair> who wants to launch nb02?
22:24:54 <jeblair> #action jeblair merge https://review.openstack.org/406411
22:25:23 <pabelanger> I mean, I can do nb02, but a good time for somebody to see how the donuts are made :)
22:26:00 <jeblair> #action pabelanger launch nb02
22:26:05 <jeblair> pabelanger: feel free to delegate :)
22:26:11 <pabelanger> ack
22:26:22 <Shrews> pabelanger: i'll volunteer, but only because I have NO idea how it's done
22:26:57 * Shrews imagines something-something-system-config
22:27:03 <pabelanger> ++
22:27:12 <pabelanger> we can work through it
22:27:16 <Shrews> cool
22:27:45 <jeblair> maybe we can get those things done this week and switch nodepool.o.o over to the v3 branch and start using the zk images at the end of this week / early next?
22:28:22 <Shrews> depends on how disaster testing goes, tbh
22:29:31 <jeblair> yeah, i'm assuming if it goes poorly we won't be in a rush to start using it :)
22:29:36 <fungi> that sounds ideal if all goes as planned
22:29:38 <jeblair> we can always keep the mysql table around and switch back if something goes very wrong.  ;)
22:29:43 <SpamapS> I'd say that shouldn't matter too much. Running in prod is a form of disaster testing. ;)
22:29:57 <fungi> would be excellent timing. activity volume is likely to be somewhat low but people will still be around
22:30:27 <jeblair> i put this on the infra meeting agenda, so i'll say that tomorrow at that meeting.  :)
22:30:36 <fungi> excellent
22:31:57 <jeblair> #agreed attempt to use zk images in infra production nodepool pending resolution of blockers (including positive results from disaster testing), ideally late this week / early next
22:32:18 <jeblair> any other nodepool zk status updates?
22:32:36 <Shrews> not from me
22:32:47 <pabelanger> same
22:33:08 <jeblair> #topic Status updates (Zuul test enablement)
22:34:01 <jeblair> anything we need to talk about here?
22:34:16 <SpamapS> I submitted 3 patches
22:34:21 <SpamapS> lots of others in flight
22:34:30 <SpamapS> still a long long way to go
22:34:40 <SpamapS> But progress is definitely ongoing
22:34:44 <jeblair> ++
22:34:46 * jhesketh will try and catch up on reviews for those today
22:34:55 <adam_g> https://review.openstack.org/#/c/406361/ moves some stuff around, adds a couple new tests and re-enables the merge-mode project config option
22:35:06 <jeblair> i know i've been answering a lot of questions, so hopefully this is working as intended and folks are finding it useful
22:35:14 <adam_g> jeblair: definitely, thanks for that
22:35:46 <jeblair> (and i love talking about this stuff)
22:35:52 <SpamapS> I do want to say that a lot of what worried me about v3's long-lived-feature-branch status is mitigated by the amount of functional testing present.
22:36:04 <jamielennox> if there's a basic hit list for this it would be useful
22:36:16 <jamielennox> i know to stay away from cloner, but otherwise i'm kind of picking at random
22:36:39 <SpamapS> By seeing a test that is marked 'skip' for what it was meant to test in pre-v3 .. it helps me to reason about what is meant to change without losing some of the battle hardening of pre-v3.
22:36:54 <SpamapS> jamielennox: grep -r '@skip' tests
22:37:10 <jamielennox> so maybe less hit list, vs this has drastically changed
22:37:27 <SpamapS> I've kind of been targeting scheduler.
22:37:29 <jamielennox> like i messed with the template tests for a while before realizing they didn't make sense anymore :)
22:37:55 <SpamapS> jamielennox: It's good to lean on jeblair. I mean it. He's a font of information. Dunno how he keeps it all under that fedora. ;)
22:38:13 <fungi> pocket dimension
22:38:20 <SpamapS> fungi: right.. obviously
22:38:45 <pabelanger> i believe I still 1 or 2 tests in merge conflict, I'll work to fix that
22:39:05 <jamielennox> anyway, no big deal, there's plenty to find
22:39:05 <SpamapS> and of course, always check story 2000773 to make sure nobody else is working on a test yet.
22:39:12 <jamielennox> yep
22:39:17 <SpamapS> (and add your own task for any tests you want to work on)
22:39:32 <SpamapS> It's been going quite well.. I see tons of stuff merged and submitted.
22:40:00 <jeblair> and yeah, if something seems fishy, feel free to check in with me early
22:41:12 <jeblair> okay, let's move on to the progress summary
22:41:21 <jeblair> #topic Progress summary
22:41:33 <SpamapS> #link https://storyboard.openstack.org/#!/board/41
22:41:33 <jeblair> #link https://storyboard.openstack.org/#!/board/41
22:41:35 <SpamapS> haha
22:41:37 <SpamapS> doh
22:41:41 <jeblair> we did that last meeting to
22:41:42 <SpamapS> jeblair: you made us a new toy!
22:41:45 <jeblair> SpamapS: all yours :)
22:41:59 <jlk> pretty
22:42:10 <SpamapS> so, you may notice that as of about 45 minutes ago, the new "New" lane was filled with tasks.
22:42:30 <SpamapS> jeblair made a script that fills that with any tasks that aren't already on the board and are attached to a story tagged 'zuulv3'
22:42:53 <SpamapS> So, this should ensure that we have a complete view of #allTheTHings
22:43:21 <mordred> \o/
22:43:25 <SpamapS> it will also remove completed things
22:43:30 <SpamapS> so I removed the Done lane
22:43:41 <pabelanger> what is the difference between new and backlog?
22:43:53 <SpamapS> New is stuff nobody has manually classified yet
22:44:01 <pabelanger> okay
22:44:08 <SpamapS> Backlog is "yep, that's something we want to do some day" and Todo is "Pick this up now"
22:44:10 <clarkb> for example I think one of the new items is in progress
22:44:23 <clarkb> (the handle catastrophic build failures)
22:44:23 <SpamapS> clarkb: is it assigned to anyone?
22:44:39 <clarkb> SpamapS: no
22:44:47 <SpamapS> Then how can it be in progress? :)
22:44:49 <clarkb> (shrews is the one with changes for it though I think)
22:44:52 <jeblair> it will auto move things into in-progress if the task is updated to be in progress
22:44:57 <jeblair> clarkb: which one?
22:45:04 <clarkb> Card #493 - Set image/upload states to deleting after failure
22:45:07 <SpamapS> oh I didn't even remember that.
22:45:28 <clarkb> SpamapS: it is in progress because someone is doing the work?
22:45:38 <SpamapS> clarkb: I was being facetious :)
22:45:40 <jeblair> clarkb: you're thinking of the other change next to that
22:45:51 <clarkb> ah ok
22:45:54 <jeblair> i'm sorry i don't have numbers right now
22:46:00 <jeblair> but i filed to stories one after the other
22:46:05 <jeblair> they *might* end up being the same bug
22:46:07 <clarkb> oh right this was that one
22:46:11 <jeblair> but Shrews picked up one of them
22:46:13 <jeblair> and that's the other
22:46:27 <clarkb> where it was "maybe same bug but maybe not so lets be careful" iirc
22:47:05 <jeblair> clarkb: yeah
22:47:06 <jeblair> (and i'm sure i saw a story or task header on the change for it)
22:47:43 <Shrews> jeblair: does my outstanding review cover both SB items you created last week? seems like it might
22:48:00 <jeblair> Shrews: we should check
22:48:38 <Shrews> i couldn't find them on the current workboard
22:48:59 <jeblair> i'll dig into it after the meeting
22:49:30 <Shrews> oh, they're there now
22:50:13 <jeblair> i want to spend our remaining time talking about the nodepool-zuul spec -- anything else here?
22:50:44 <SpamapS> so anyway, if you see a task that you think _is_ being done, it would help me if you ping me or just assign/update it. Thanks.
22:51:15 <jeblair> ++
22:51:19 <jeblair> #topic Zuul v3: use Zookeeper for Nodepool-Zuul protocol
22:51:19 <jeblair> #link Zuul v3: use Zookeeper for Nodepool-Zuul protocol https://review.openstack.org/305506
22:51:42 <jeblair> i revised this spec update based on comments, and things we've learned about zk so far
22:51:55 <jeblair> i think it's about time to start working on it
22:52:13 <jeblair> ideally around the time we put the zk builder into production
22:52:29 <jeblair> i have it on tomorrow's infra team meeting agenda
22:52:42 <jeblair> so please take a look at it soon
22:53:02 <clarkb> jeblair: Shrews do the comments near the top imply step 0 is maybe patching kazoo?
22:53:11 <jeblair> if there are any big red flags we can delay it...
22:53:14 <Shrews> clarkb: nope
22:53:37 <Shrews> clarkb: writing our own implementation, as jeblair points out in a later response
22:53:49 <jeblair> yeah, it's really simple
22:54:01 <Shrews> although I'd LOVE for someone to implement read locks in kazoo, if they have time  :)
22:54:07 <clarkb> gotcha using more fundamental aspects of zk to manage it
22:54:08 * Shrews sidetracks the convo
22:55:10 <jeblair> Shrews has volunteered to continue to head this effort, and i think he's well placed to do so
22:55:33 <jeblair> mordred has talked about making a shim so we can use this immediately with zuul v2
22:55:52 <mordred> yes - I will work on that
22:56:24 <mordred> tl;dr - daemon that behaves like current nodepool talking to the v2 launchers but requests nodes over zk from new nodepool
22:56:37 <fungi> nifty!
22:57:16 <jeblair> i think it's a swell idea that helps us make the v3 migration as stepwise as possible
22:58:21 <jeblair> i can work on the zuul (requesting) side of this (which i think will be pretty simple)
22:58:35 <fungi> the potential "big bang" aspect of the v3 cutover is the scariest bit, to me
22:58:45 <Shrews> *phew*, was hoping i didn't have to touch the zuul sie
22:58:48 <Shrews> side
22:58:53 <jeblair> fungi: yeah, this will mean that it's "only" zuul.  :)
22:58:54 <fungi> so anything incremental eases my mind
22:59:05 <jeblair> Shrews: i'll ask you to review though :)
22:59:36 <jeblair> anyway, if folks can review that asap, that would be great
22:59:50 * clarkb reviews
22:59:56 <jeblair> and heads up that i'd like to do the same for https://review.openstack.org/381329 next week
23:00:02 <Shrews> SpamapS: given your past zk experience, i think you reviewing that would be good
23:00:02 <jeblair> #link https://review.openstack.org/381329
23:00:15 <jeblair> thanks everyone!
23:00:17 <jeblair> #endmeeting