22:02:44 #startmeeting zuul 22:02:44 Meeting started Mon Dec 5 22:02:44 2016 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:02:45 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 22:02:47 The meeting name has been set to 'zuul' 22:02:57 #link agenda https://wiki.openstack.org/wiki/Meetings/Zuul 22:03:02 o/ 22:03:07 o/ 22:03:14 #link previous meeting http://eavesdrop.openstack.org/meetings/zuul/2016/zuul.2016-11-28-22.02.html 22:03:22 o/ 22:03:27 #topic Actions from last meeting 22:03:47 jeblair work with Shuo_ to document roadmap location / process 22:03:56 i just pushed this up 22:03:59 o/ 22:04:02 #link roadmap: https://review.openstack.org/407213 22:04:08 o/ 22:04:18 o/ 22:04:23 i'll take the "work with Shuo_" part offline 22:04:55 i think that roadmap captures the direction we're heading 22:05:20 i don't want to do task management in README files or anything, but just having a list of "yeah these are things we know we're going to do" will be helpful i think 22:05:42 looks good 22:06:27 #topic Status updates (Nodepool Zookeeper work) 22:06:29 I think that's a fine way to manage an epic-epic story, which is basically what htis is. 22:06:47 yea, README is good for a really high level, can do task management elsewhere 22:07:01 #undo 22:07:02 Removing item from minutes: 22:07:35 cool 22:07:40 i guess we're ready to move on :) 22:07:41 #topic Status updates (Nodepool Zookeeper work) 22:07:51 (TIL there's a #undo) 22:08:29 (i use it all the time to fix my typos) 22:08:52 i think we're hitting the last few things on the punch list for the nodepool builder 22:09:23 yeah. bunch of bug fixes/improvements since last week. 22:10:11 agree, think we are getting close 22:11:25 nb01.o.o has been mostly working, yeah? i think we lost a few builds this weekend due to a bug whose fix is in progress 22:11:45 and we actually achieved image quorum pretty quickly -- like, within 1 day? 22:12:11 yes, things have looked good on the upload front for sure 22:12:48 is there anything outstanding we need to have in place before we switch infra production over to use it? 22:12:53 pabelanger: have we tried disaster scenarios yet? 22:12:56 jeblair: yes! ^^^ 22:13:00 let's kill ZK 22:13:04 see what happens 22:13:27 might also be handy if we could write down a few of the manual debug and fixing steps for zk that have been used 22:13:32 we did see https://storyboard.openstack.org/#!/story/2000809 over the weekend, I had to manually delete some data in zk to fix it 22:13:34 since its relatively new to most of us 22:14:09 clarkb: yes -- but any time we run a manual zk command it's a bug that we must fix 22:14:13 pabelanger: that json thing might be fixed by the open review 22:14:14 yeah, if that hasn't found its way into our system-config docs, now would be the time 22:14:22 fungi: i don't think that's right 22:14:39 Shrews: yes, we should land your patch and see if it happens again for sure 22:14:44 system-config shouldn't mention that nodepool is using zk to coordinate tasks? 22:14:55 I'd say if you find yourself having to dig around in ZK, that should definitely end up as a story. 22:15:07 i think we should send out an email to the infra list letting folks know how to run zk shell and a bit of overview 22:15:14 SpamapS: jeblair I agree, but chances are we will end up doing it at some point so we should tell people how 22:15:17 ya that works 22:15:24 (then they can submit the resulting bug report) 22:15:26 SpamapS: 2000809 is the only time we've had to so far 22:15:27 but we should not document zk shell commands in the same way that we don't document mysql currently 22:15:32 i guess it's too deep of an implementation detail for sysadmins to care about? do we not need to make sure the service is running et cetera et cetera? 22:15:44 fungi: oh that's fine :) 22:16:00 fungi: i just meant that we should not be writing: "if something goes wrong, here's the zk command to fix it" in system-config :) 22:16:12 yeah, i'm less interested in knowing how to deep-dive zk internals. fro that there's its own documentation 22:16:37 zk presumably has plenty of documentation on how to interact with it should that become necessary 22:16:38 https://review.openstack.org/406342 has a very hard to find (thanks jeblair!) bug fix that needs to land for the builder, fyi 22:17:07 though linking to the zk docs from system-config could be handy 22:17:30 i just want to make sure we're all on the same page that zk is an internal implementation detail that we should all be aware of as developers of this thing, and operators should not really be aware of it beyond seeing it in 'ps' output 22:17:41 It should be noted that the interface between Zuul and Nodepool should be documented via inline docs for whoever owns it, so that it can be understood well by developers trying to fix those bugs. 22:17:47 hrm. 22:17:49 * SpamapS has not looked, maybe it is. 22:18:16 are we expecting to configure it to listen on a particular socket/port, or does it use a standard iana assignment? 22:18:22 Well, ops should know 1) it's required, 2) how to ensure that its health is as-expected, 3) how to safely kick it should it not be healthy, 4) how to reach out to devs for more help if it's fallen over hard. 22:18:24 SpamapS: it should be, once it exists (that's coming up later in agenda) 22:18:30 jlk: ++ 22:18:44 what jlk describes is exactly what i would want too. yes 22:19:20 and maybe "where does it log, is it sensitive, should I keep them around" 22:19:28 ZK's client libs expect ZK on a particular port. 22:19:41 Dunno if it's been registered with IANA.. but seems likely to be. 22:19:48 much of that of course should come from the zk docs, which we can link to from system-config 22:19:57 Shrews: as for disaster scenarios, I think our next steps are to do some ops things to our images. eg: manual deletes, builds, uploads etc. see what happens 22:20:04 "Where does it run" which leads to SpamapS's thing of "what port does it run" so that firewalls can be adjusted. 22:20:14 Right that's all in ZK's docs 22:20:18 cool 22:20:18 pabelanger: ++ 22:20:39 what you want to put in nodepool/zuul's docs is "Where do I look to make sure zuul and nodepool are active in a particular ZK." 22:21:24 Which can be as simple as "Nodepool reservations are stored under /nodepool in ZK. The contents are documented in code at nodepool/zk_interface.py" 22:21:24 another thing, do we want to try adding nb02.o.o too, see if things work as expected? 22:21:31 pabelanger: but we definitely need to hard kill ZK several times, too (during uploads, during builds, etc) 22:21:32 #action jeblair update nodepool system-config docs with zk info 22:21:39 Shrews: ++ 22:21:56 #action pabelanger test zk disaster scenarios thi nodepool-builder 22:22:00 #undo 22:22:01 Removing item from minutes: 22:22:04 #action pabelanger test zk disaster scenarios with nodepool-builder 22:22:20 any other production blockers? 22:22:36 https://review.openstack.org/406342 22:22:39 pabelanger suggested an nb02, do we want that? 22:22:41 as mentioned 22:23:05 if we are able to get through and entire build and upload cycle in less than a day I don't think it is necessary 22:23:27 but may want to test it before we productionize it 22:23:58 #action jeblair merge https://review.openstack.org/406342 22:24:08 clarkb: agreed. i think we need the multiple node scenario before production 22:24:12 Ya, a single builder is working very well right now. 22:24:22 https://review.openstack.org/#/c/406411/8 is another change worth getting in to fix leaking hash files 22:24:24 who wants to launch nb02? 22:24:54 #action jeblair merge https://review.openstack.org/406411 22:25:23 I mean, I can do nb02, but a good time for somebody to see how the donuts are made :) 22:26:00 #action pabelanger launch nb02 22:26:05 pabelanger: feel free to delegate :) 22:26:11 ack 22:26:22 pabelanger: i'll volunteer, but only because I have NO idea how it's done 22:26:57 * Shrews imagines something-something-system-config 22:27:03 ++ 22:27:12 we can work through it 22:27:16 cool 22:27:45 maybe we can get those things done this week and switch nodepool.o.o over to the v3 branch and start using the zk images at the end of this week / early next? 22:28:22 depends on how disaster testing goes, tbh 22:29:31 yeah, i'm assuming if it goes poorly we won't be in a rush to start using it :) 22:29:36 that sounds ideal if all goes as planned 22:29:38 we can always keep the mysql table around and switch back if something goes very wrong. ;) 22:29:43 I'd say that shouldn't matter too much. Running in prod is a form of disaster testing. ;) 22:29:57 would be excellent timing. activity volume is likely to be somewhat low but people will still be around 22:30:27 i put this on the infra meeting agenda, so i'll say that tomorrow at that meeting. :) 22:30:36 excellent 22:31:57 #agreed attempt to use zk images in infra production nodepool pending resolution of blockers (including positive results from disaster testing), ideally late this week / early next 22:32:18 any other nodepool zk status updates? 22:32:36 not from me 22:32:47 same 22:33:08 #topic Status updates (Zuul test enablement) 22:34:01 anything we need to talk about here? 22:34:16 I submitted 3 patches 22:34:21 lots of others in flight 22:34:30 still a long long way to go 22:34:40 But progress is definitely ongoing 22:34:44 ++ 22:34:46 * jhesketh will try and catch up on reviews for those today 22:34:55 https://review.openstack.org/#/c/406361/ moves some stuff around, adds a couple new tests and re-enables the merge-mode project config option 22:35:06 i know i've been answering a lot of questions, so hopefully this is working as intended and folks are finding it useful 22:35:14 jeblair: definitely, thanks for that 22:35:46 (and i love talking about this stuff) 22:35:52 I do want to say that a lot of what worried me about v3's long-lived-feature-branch status is mitigated by the amount of functional testing present. 22:36:04 if there's a basic hit list for this it would be useful 22:36:16 i know to stay away from cloner, but otherwise i'm kind of picking at random 22:36:39 By seeing a test that is marked 'skip' for what it was meant to test in pre-v3 .. it helps me to reason about what is meant to change without losing some of the battle hardening of pre-v3. 22:36:54 jamielennox: grep -r '@skip' tests 22:37:10 so maybe less hit list, vs this has drastically changed 22:37:27 I've kind of been targeting scheduler. 22:37:29 like i messed with the template tests for a while before realizing they didn't make sense anymore :) 22:37:55 jamielennox: It's good to lean on jeblair. I mean it. He's a font of information. Dunno how he keeps it all under that fedora. ;) 22:38:13 pocket dimension 22:38:20 fungi: right.. obviously 22:38:45 i believe I still 1 or 2 tests in merge conflict, I'll work to fix that 22:39:05 anyway, no big deal, there's plenty to find 22:39:05 and of course, always check story 2000773 to make sure nobody else is working on a test yet. 22:39:12 yep 22:39:17 (and add your own task for any tests you want to work on) 22:39:32 It's been going quite well.. I see tons of stuff merged and submitted. 22:40:00 and yeah, if something seems fishy, feel free to check in with me early 22:41:12 okay, let's move on to the progress summary 22:41:21 #topic Progress summary 22:41:33 #link https://storyboard.openstack.org/#!/board/41 22:41:33 #link https://storyboard.openstack.org/#!/board/41 22:41:35 haha 22:41:37 doh 22:41:41 we did that last meeting to 22:41:42 jeblair: you made us a new toy! 22:41:45 SpamapS: all yours :) 22:41:59 pretty 22:42:10 so, you may notice that as of about 45 minutes ago, the new "New" lane was filled with tasks. 22:42:30 jeblair made a script that fills that with any tasks that aren't already on the board and are attached to a story tagged 'zuulv3' 22:42:53 So, this should ensure that we have a complete view of #allTheTHings 22:43:21 \o/ 22:43:25 it will also remove completed things 22:43:30 so I removed the Done lane 22:43:41 what is the difference between new and backlog? 22:43:53 New is stuff nobody has manually classified yet 22:44:01 okay 22:44:08 Backlog is "yep, that's something we want to do some day" and Todo is "Pick this up now" 22:44:10 for example I think one of the new items is in progress 22:44:23 (the handle catastrophic build failures) 22:44:23 clarkb: is it assigned to anyone? 22:44:39 SpamapS: no 22:44:47 Then how can it be in progress? :) 22:44:49 (shrews is the one with changes for it though I think) 22:44:52 it will auto move things into in-progress if the task is updated to be in progress 22:44:57 clarkb: which one? 22:45:04 Card #493 - Set image/upload states to deleting after failure 22:45:07 oh I didn't even remember that. 22:45:28 SpamapS: it is in progress because someone is doing the work? 22:45:38 clarkb: I was being facetious :) 22:45:40 clarkb: you're thinking of the other change next to that 22:45:51 ah ok 22:45:54 i'm sorry i don't have numbers right now 22:46:00 but i filed to stories one after the other 22:46:05 they *might* end up being the same bug 22:46:07 oh right this was that one 22:46:11 but Shrews picked up one of them 22:46:13 and that's the other 22:46:27 where it was "maybe same bug but maybe not so lets be careful" iirc 22:47:05 clarkb: yeah 22:47:06 (and i'm sure i saw a story or task header on the change for it) 22:47:43 jeblair: does my outstanding review cover both SB items you created last week? seems like it might 22:48:00 Shrews: we should check 22:48:38 i couldn't find them on the current workboard 22:48:59 i'll dig into it after the meeting 22:49:30 oh, they're there now 22:50:13 i want to spend our remaining time talking about the nodepool-zuul spec -- anything else here? 22:50:44 so anyway, if you see a task that you think _is_ being done, it would help me if you ping me or just assign/update it. Thanks. 22:51:15 ++ 22:51:19 #topic Zuul v3: use Zookeeper for Nodepool-Zuul protocol 22:51:19 #link Zuul v3: use Zookeeper for Nodepool-Zuul protocol https://review.openstack.org/305506 22:51:42 i revised this spec update based on comments, and things we've learned about zk so far 22:51:55 i think it's about time to start working on it 22:52:13 ideally around the time we put the zk builder into production 22:52:29 i have it on tomorrow's infra team meeting agenda 22:52:42 so please take a look at it soon 22:53:02 jeblair: Shrews do the comments near the top imply step 0 is maybe patching kazoo? 22:53:11 if there are any big red flags we can delay it... 22:53:14 clarkb: nope 22:53:37 clarkb: writing our own implementation, as jeblair points out in a later response 22:53:49 yeah, it's really simple 22:54:01 although I'd LOVE for someone to implement read locks in kazoo, if they have time :) 22:54:07 gotcha using more fundamental aspects of zk to manage it 22:54:08 * Shrews sidetracks the convo 22:55:10 Shrews has volunteered to continue to head this effort, and i think he's well placed to do so 22:55:33 mordred has talked about making a shim so we can use this immediately with zuul v2 22:55:52 yes - I will work on that 22:56:24 tl;dr - daemon that behaves like current nodepool talking to the v2 launchers but requests nodes over zk from new nodepool 22:56:37 nifty! 22:57:16 i think it's a swell idea that helps us make the v3 migration as stepwise as possible 22:58:21 i can work on the zuul (requesting) side of this (which i think will be pretty simple) 22:58:35 the potential "big bang" aspect of the v3 cutover is the scariest bit, to me 22:58:45 *phew*, was hoping i didn't have to touch the zuul sie 22:58:48 side 22:58:53 fungi: yeah, this will mean that it's "only" zuul. :) 22:58:54 so anything incremental eases my mind 22:59:05 Shrews: i'll ask you to review though :) 22:59:36 anyway, if folks can review that asap, that would be great 22:59:50 * clarkb reviews 22:59:56 and heads up that i'd like to do the same for https://review.openstack.org/381329 next week 23:00:02 SpamapS: given your past zk experience, i think you reviewing that would be good 23:00:02 #link https://review.openstack.org/381329 23:00:15 thanks everyone! 23:00:17 #endmeeting