22:02:57 #startmeeting zuul 22:02:57 Meeting started Mon Nov 28 22:02:57 2016 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:02:58 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 22:03:00 The meeting name has been set to 'zuul' 22:03:01 o/ 22:03:06 o/ 22:03:12 #link previous meeting http://eavesdrop.openstack.org/meetings/zuul/2016/zuul.2016-11-21-22.01.html 22:03:30 #link only slightly inaccurate agenda https://wiki.openstack.org/wiki/Meetings/Zuul 22:03:40 o/ 22:03:44 * morgan_ lurks harder 22:03:54 #topic Actions from last meeting 22:04:02 #action jeblair work with Shuo_ to document roadmap location / process 22:04:10 ETHANKSGIVING 22:04:18 exactly 22:04:30 o/ 22:04:34 #topic Status updates (Nodepool Zookeeper work) 22:04:51 we didn't *quite* get the new builder into production 22:05:22 with the unofficial day-before-thanksgiving holiday, we really only had 2 days last week 22:05:33 but we still made a lot of progress regardless 22:05:39 nb01.openstack.org does exist now 22:05:44 I heard some disturbing news btw 22:05:47 that only one ZK was running 22:06:08 I want to point out that this will present significant operational challenges. 22:06:11 yes, that is on nodepool.o.o today 22:06:22 yes, we had that conversation in this meeting last week: http://eavesdrop.openstack.org/meetings/zuul/2016/zuul.2016-11-21-22.01.log.html 22:06:25 ZK is not really good at recovering with only one node. 22:06:47 * fungi wonders what applications are really good at recovering from the loss of a spof 22:07:02 no no no.. it's worse than everything else I've dealt with that has on disk state. 22:07:03 again aiui its the same aituation as today with gearman... 22:07:10 no just igbire recovery 22:07:13 and move on 22:07:29 Unless you're running it in a ramdisk that you clear every time the process starts, it's going to be a _beast_. 22:07:30 i would like to know what igbire was a typo for 22:07:38 because it's an awesome typo 22:07:44 *ignore 22:07:47 o/ 22:07:54 I'd also be concerned if we couldn't get ZK working with a single node too, since all of our testing now is single ZK 22:07:55 fungi: lol i was wondering the same thing 22:08:00 clarkb: okay, sense made. thanks! 22:08:14 basically its not a regression to "falback" on that behavior 22:08:23 if zookeeper unexpectedly dies for any reason, you'll be left replaying transactions from the last time it successfully gracefully stopped/started. 22:08:31 and you can have more resiliency if you choose to run more 22:08:39 SpamapS: so basically avoid "dirty start" scenarios and make sure if state is lost then it's really completely lost at start? 22:09:06 SpamapS: it has no checkpoint function? 22:09:14 fungi: correct. If the process is killed in any violent way (VM sudden death, segfault, SIGKILL, etc, you need to clear the on-disk store entirely, or be prepared to wait. 22:09:26 wow. that's awesome 22:09:28 jeblair: It did not 4 years ago. 22:09:33 It may have grown one. I don't know. 22:09:40 The authors explicitly said "Oh, don't do that." 22:09:44 Run 3. 22:09:53 I guess the difference is we dont also store the info in mysql anymore 22:09:53 Or restart a lot. 22:09:55 * fungi wonders if harlowja has more recent experiences with such scenarios 22:10:00 who what 22:10:08 well, if it's not possible to run with one, then we probably need to drop zk and use something else 22:10:19 recovering modern versions of zk from a dirty shutdown 22:10:21 because all-in-one is an explicit design goal 22:10:47 yah. I thought the risk of "one" was just "if you crash, the system won't be up because you crashed" - which is fine for one node 22:11:05 Ah they added snapCount 22:11:09 but if the failure case is "after all crashes in single node you can expect to wait for a complete transaction log replay" - that is not fine for one node 22:11:11 ok, so set snapCount low for single-server 22:11:18 fungi no such experience from me :-P 22:11:29 harlowja: darn. thanks for jumping in anyway! 22:11:32 (apologies, my information is from 2012. 22:11:32 np 22:11:33 ha 22:11:33 ) 22:11:35 SpamapS: woot! 22:11:42 SpamapS: I'm _very_ glad your info is out of date 22:11:46 me too 22:11:54 because that was a long 9 hours to recover the juju database for UDS Copenhagen. 22:11:55 SpamapS: is snapCount in the zookeeper config? 22:11:55 Yay for no rewrite 22:11:57 yay we don't have to start over (yet) :) 22:12:03 jeblair: ++ 22:12:03 mordred: it is 22:12:10 SpamapS: cool. also - yay 9 hours 22:12:15 SpamapS: can I assume you were ... not happy ? :) 22:12:18 perhaps u guys want to email the zookeeper ML 22:12:23 #link https://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_configuration 22:12:25 2012 was a while ago :-P 22:12:29 i also wonder if we'll be stashing nearly the amount of raw state or churn into nodepool zk as the uds juju db had 22:12:46 fungi: not at first, but possibly later on 22:12:49 http://zookeeper.apache.org/lists.html :) 22:12:58 mordred: I was meh, but elmo was very.. very sad. 22:13:17 fungi: once we put nodes into it, and later, zuul builds 22:13:19 anyway, n/m ignore me 22:13:28 single server should be fine with lowish snapcount 22:13:40 100,000 appears to be the default 22:13:48 This says 10,000 22:13:53 jeblair: okay, i still have no basis for comparison to know if those are in a similar order of magnitude to whatever uds was doing unfortunately 22:13:55 well, we learned something we should pay attention to when we build all-in-one deployment tooling 22:14:04 But I'd say let's play with it a bit 22:14:19 + 22:14:22 ++ 22:14:29 fungi: er, yeah, let's assume i revise my statement to somehow drop the comparison part and just express relative growth of our use of zk. :) 22:14:36 "dirty shutdown" will certainly be a fun scenario to test 22:14:48 set it to 1k and wed still onlysnapshot once anhour on averagewith test instances in zk 22:14:51 my local testing is all-in-one right now, I can try setting snapcount and killing things 22:15:35 i think we need to set up some sacrificial servers running it and then take a hatchet to their innermost circuits 22:15:48 just to be really, really sure 22:15:54 fungi: i suggested that last week :) 22:16:04 clearly i'm channeling you 22:16:15 * fungi has a side job channeling the living 22:16:25 fungi: i am mostly dead 22:16:29 pretty easy to automate. kill -9 is about as dirty as you can get without offending somebody. ;) 22:16:57 clarkb: that's probably fine. the number of transactions potentially being replayed is the real problem, not the frequency of snap 22:17:00 SpamapS: explain "without offending somebody" ... I've never accomplished that in real life 22:17:24 mordred: I'm offended by that. 22:17:33 having a transaction-based checkpoint option rather than time-based might be nice 22:17:46 but we can always calibrate 22:17:51 SpamapS: I am sure there is some trade off to be matched depending on performance requirements but I just do't think we are in such a situation 22:17:57 mordred someday u will 22:18:01 re: nb01.o.o, it would be great to land https://review.openstack.org/#/c/403869/ today, then we should be ready to run nodepool-builder on the server. I've added the cinder volume already 22:18:03 worst case you start without data, and repopulate from cloud api 22:18:10 so nb01.o.o exists but isn't quite running yet -- pabelanger has kindly agreed to take over driving that so i can make sure i'm available to review zuul patches 22:18:32 clarkb: no, _worst_ case you start without data and let it clean up all the leaked alien nodes/images 22:19:00 fungi: no, it _is_ transaction based. So setting it 10x lower is the right solution. 22:19:14 SpamapS: oh! i misread. so yes, it is what i was hoping for 22:19:27 clarkb: agreed. If we get to high-perf it might also make more sense to have 3 since downtime will likely be costing us more too. 22:19:50 and the clients are really good at detecting and failing over. 22:19:58 SpamapS: yeah, i think we do want to move to 3 eventually, but we want to dog-food one while we still can (and we don't care about the spof issue) 22:20:23 by the time nodepool itself is no longer a spof, even i will want to run 3 :) 22:20:25 yes, having a resilient cluster for large/high-volume deployments sounds fine 22:20:50 shouldn't be much to stand up the other 2 servers too, the puppet-zookeeper module looks to support it 22:21:14 but being unable to effectively set up an all-in-one deployment for "small" or test sites is also something we want to be possible 22:21:33 s/unable/able/ 22:21:34 fungi: i think i agree with what you were trying to say there :) 22:21:44 * fungi spliced sentences in his head again 22:22:38 ++ 22:22:56 so if folks can heed pabelanger's request to quickly review deployment-blocking changes, we should be able to start running this soon and get actual experience with it 22:23:09 Shrews, pabelanger: anything else about nodepool-zk? 22:23:42 jeblair: it would be good to finish our pause build / upload logic this week 22:23:47 if possible 22:23:54 is there a topic to focus on? 22:24:00 pabelanger found a json exception failure that disturbes me greatly. i have no explanation for it as it should not be possible 22:24:05 (a gerrit topic I mean) 22:24:26 should be the one indicated in the spec. checking 22:24:43 fungi: well, we switched to just using feature/zuulv3 branch, specwise 22:24:54 so we can set a topic for deployment things if we want 22:24:58 oh, right-o 22:25:14 and http://specs.openstack.org/openstack-infra/infra-specs/specs/nodepool-zookeeper-workers.html doesn't actually have the part from the template where a topic is documented 22:25:17 but right now, it's just one change i think 22:25:39 fungi: was replaced with http://specs.openstack.org/openstack-infra/infra-specs/specs/nodepool-zookeeper-workers.html#gerrit-branch 22:25:53 Ya, just 403869 right now 22:25:53 yep, thanks 22:26:01 (which already has one +2 so its close :) ) 22:26:24 branch:feature/zuulv3 22:26:39 yeah, i mostly wanted to make sure people were aware that pabelanger may come with further requests like that :) 22:26:44 ...is what we have in our priority efforts query 22:26:49 400970 should also land before a production run 22:27:30 Shrews: probably a good idea, yeah :) 22:27:38 #link https://review.openstack.org/403869 22:27:38 * clarkb adds that to the list 22:27:40 Shrews: ack, will look 22:27:43 #link https://review.openstack.org/400970 22:28:07 I like the shade error on the integration test for that 22:28:17 wee floating IPs 22:28:21 yah 22:28:30 clarkb: that's happening more frequently now 22:28:41 #link https://review.openstack.org/#/q/status:open+AND+branch:feature/zuulv3 22:28:41 clarkb: like, frequently enough that we may need to investigate it for real 22:28:59 mordred: awesome 22:29:15 clarkb: yah. that's one word for it 22:29:43 mordred: like, a problem crept into nova/neutron? 22:30:18 oh and now its apparently in merge conflict 22:30:20 Shrews: ^ 22:30:20 s/crept/stumbled drunkenly while carrying a battleaxe/ 22:30:30 clarkb: fixing 22:30:50 well, lets move on... 22:30:51 #topic Status updates (Zuul test enablement) 22:31:27 there are many patches! i *think* i'm caught up on reviews for these now 22:31:31 yay! 22:31:58 if i missed something, or anyone needs me to pitch in on something, please let me know 22:33:10 jeblair: https://review.openstack.org/400836 22:33:49 Shrews: yeah, i'm almost, but not quite, caught up on nodepool patches 22:33:52 jeblair: an opinion on https://review.openstack.org/#/c/400003/ - but it's not urgent 22:33:56 just needs a +2. we can figure out the positive alien test case later 22:34:42 yay for patches merging 22:35:07 I still have a few in merge conflict, I'll try and clean them up tonight / tomorrow 22:36:16 jamielennox: yeah, i can do that -- that's also similar to another thing that came up recently -- i think it was the path to clouds.yaml so that the cli commands could work correctly... 22:36:31 jamielennox: is there a reason you added that on the master branch though, instead of zuulv3? 22:36:42 also pabelanger has comments on it 22:36:44 (that is the reason i did not see the change) 22:37:28 jeblair: not specifically, it applies to both and figured it would get merged in but i probably should have done it on v3 22:37:40 Ya, could have used that patch recently :) have diskimage-builder in a different venv, but ended up writing a wrapper script to properly source things 22:38:09 but, like the idea of defining the location of disk-image-create 22:38:11 * clarkb uses symlinks to solve this problem fwiw 22:38:15 ours is similar but we're running nodepool from systemd via the ../venv/bin/ path and so it has no PATH to dib 22:38:17 works great for virtualenv and git-review 22:38:30 jamielennox: yup thats exactly the solution ^ 22:38:40 jamielennox: I do the same, we should compare things :) 22:38:42 i do exactly the same for _everything i pip install 22:39:05 yea, can always symlink it into /bin or currently we're modifying the PATH in the unit, but this just seemed easier 22:39:19 heck, i have ~/bin/pip as a symlink to ~/pyenvs/pip/bin/pip where the latest version of pip is installed 22:39:23 We could also expose things using update-alternatives 22:39:41 there's a bunch of ways :) i figured i'd float this and see what people thought 22:39:51 (which puts things in the path) 22:40:09 * mordred likes the jamielennox patch - but that's probably clear because of the +2 22:40:18 (though the example makes more sense with ~/bin/virtualenv symlinked to ~/pyenvs/virtualenv/bin/virtualenv which i use to create all the other virtualenvs) 22:40:20 jamielennox: huh, dib should be in the venv, I must be missing something... 22:40:22 i definitely think we should be able to configure things like this. i think the ongoing tension is whether it should be in nodepool.yaml or a different file. 22:40:30 but I can check that out later 22:40:38 jeblair: ++ 22:41:04 greghaynes: the venv isn't activated we're just running the python out of the venv directly and dib is being invoked as an application not a python module 22:41:32 yah. that would do it for sure 22:41:37 ah. Theres a thought that in the (very near) future dib will have a python api 22:41:44 its part of v2 22:41:48 jlk: ya, that would be good too. I should try that in my local env 22:41:50 I guess my only concern is that we don't bake in a bunch of functionality that already exists in the OS (basically avoid redundant tooling) 22:41:51 it's worth noting that in openstack's case, we have a configuration/content separation by way of the system-config and project-config repos. project-config repo reviewers review 'content' like what things are installed in what diskimages, and what clouds are in use. 22:42:03 so yes I agree yuo should be able to configure this, and you can via $PATH 22:42:10 clarkb: I have to agree with you there. Setting PATH is a pretty standard thing. 22:42:29 yeah, with dib v2 you could conceivably "import diskimage_builder" and run the main() from python 22:42:35 yep 22:42:42 That we have PATH insanity because of virtualenvs is a relatively new idea. 22:42:57 clarkb: Agree, if people are opposed adding it to nodepool.yaml, symlinks or PATH is a great option too 22:43:20 but I don't feel strongly enough to prevent anyone from adding that to nodepool 22:43:43 yep, there's a bunch of deploy specific ways to solve this - i don't mind what we do, just thought i'd propose it 22:43:44 Same 22:43:51 nodepool has so little configuration that isn't content that nearly everything is in nodepool.yaml. i'm okay with adding non-secret configuration to nodepool.yaml. but likely the more of it that is more "system" focused rather than "project" focused may push me toward moving that to its own file. 22:44:48 jamielennox: there is sort of the question of why dib can't be installed in the same virtualenv as your nodepool-builder ... it's kind of odd to have them split? 22:44:49 but even today, we have the zmq and zk servers in there, so it's already a mix of the two. 22:45:04 jeblair: ya 22:45:18 could just have nodepool take a list of conffiles and merge the yaml-parsed dict (what to do with duplicate keys is the main concern there) 22:45:33 seems like the patch deserves discussion in the review 22:45:40 ianw: (indeed it should be already -- it's a dependency) 22:45:45 that would allow anyone to split up their configuration along whatever lines make sense 22:46:11 (not that I'm not enjoying this discussion.. but this does feel like an IRC review of the patch. :) 22:46:13 though this is all straying pretty far from the topic of reenabling zuul tests 22:46:36 ianw: I have tested nodepool and diskimage-builder in the same venv, issue rises if you don't source the venv first and just call ./venv/bin/nodepool-builder, diskimage-create no in path 22:47:18 any other zuul test enablement status updates? 22:47:52 #topic Progress summary 22:47:58 pabelanger: ok ... let's #zuul this 22:48:18 SpamapS: what did you have in mind for this part of the agenda? 22:48:33 i don't think we've actually exercised this since our agenda-brainstorm 22:49:01 jeblair: A quick rundown of the board and a chance for people to review it and speak up if they want to move things around. 22:49:06 https://storyboard.openstack.org/#!/board/41 22:49:22 jeblair: yeah I have been dealing with meatspace things. ;) 22:49:27 #link https://storyboard.openstack.org/#!/board/41 22:49:48 SpamapS: my thing in progress is actually done 22:49:56 So, if I can ask everyone to just take a look at that board, and consider whether anything needs to be added, removed, or moved. 22:49:59 Shrews: woot 22:50:10 Shrews: moved 22:50:17 i'll move the devstack-gate roles refactoring to in-progress 22:50:24 i have a long-list of dependent changes now 22:50:33 and pabelanger also did some stutff on that iirc 22:51:19 rcarrillocruz: I just added you as a user of the board, so you should be able to move things now. 22:51:34 cool, thx 22:51:59 rcarrillocruz: Yes, I've seen your patches. Want to do some reviews on that, maybe work with clarkb to see how we can run them today 22:52:03 Shrews: i think phschwartz is 'in-progress' on 2000770 22:52:24 rcarrillocruz: pabelanger random scan of that shows they fail a lot 22:52:24 feels like the general story of "nodepool changes" needs to be fleshed out and maybe moved to in progress? 22:52:37 jeblair: I am. I have implemented the base of a DAG locally and will be pushing a WIP up soon. 22:52:49 I guess that further up the stack 22:52:58 reviewing the state of the "Zuulv3 Operational" board seems like an excellent way to so the progress summary portion of the agenda. great idea 22:52:59 jeblair: that's for SpamapS, i guess 22:53:07 s/so/do/ 22:53:10 yeah, working on them, i'll ping you later on what is good to review for now 22:53:13 jeblair: which one is 2000770 .. it's hard to find a number on that board. ;) 22:53:21 Shrews: yep, I got S'd 22:53:31 SpamapS: i think phschwartz is 'in-progress' on 2000770 22:53:37 rcarrillocruz: one quick comment, these changes don't actually seem to use the new playbooks, Can you organize it so that every chagne is self testing? I don't want to review and merge a bunch of dead code 22:53:42 SpamapS: could you add me to the board as well please so I can track the branch merging progress 22:54:11 rcarrillocruz: or am I missing something important? 22:55:07 SpamapS: well, story 768 is referring to the next phase of zuul-nodepool work which we are not yet ready to start 22:55:07 SpamapS: it is the dependency graph work. 22:55:33 jeblair: OH.. so the stuff going on now isn't that? Ok, I'll move it back to backlog. 22:56:04 jhesketh: added 22:56:11 thanks :-) 22:56:19 clarkb: i started doing roles in independent changes, then created the 'ansibly' changes, that actually dpend on those role changes and replace code from d-g bash 22:56:25 phschwartz: I need a title 22:56:33 rcarrillocruz: I'd prefer we don't do it that way, its too hard to review 22:56:34 but i can do everything self-testing by merging them 22:56:34 or was it not even in the board yet? 22:56:48 rcarrillocruz: I would make each thing its own chagne that adds the playbook and uses it 22:56:56 SpamapS: phschwartz dag work is titled "Forward port..." 22:57:01 SpamapS: should probably be retitled :) 22:57:06 d-g is self testing so you should be able to see upfront what does and doesn't work 22:58:04 Ah ok 22:58:25 SpamapS: and yeah, the stuff now is nodepool-builder. the next thing is nodepool-launcher along with updated zuul-nodepool protocol. next step in that is to refresh/approve this spec: https://review.openstack.org/305506 but we want to really run nodepool-builder first so we have a chance to make any changes based on real-world use of zookeeper 22:58:26 phschwartz: I assigned the task to you and marked it in progress. It would help if you can reference the story: and task: in commit messages. :) 22:58:44 SpamapS: will do. 22:58:48 rcarrillocruz: ok I see how this works, I think it would be easier to grok if we made each thing enable + new playbook 22:59:12 jeblair: ok I'll try and update that story a bit to explain what it is. 23:00:14 SpamapS: i think 767 is the story for current nodepool work 23:00:18 jeblair: I added "Make job trees into graphs" to 'todo'. 23:00:24 jeblair: k I'll add that too 23:00:26 we're running out of time 23:00:30 anything else urgent? 23:00:35 I want to let peple go 23:00:36 people 23:00:43 thanks everyone! 23:00:47 #endmeeting