19:04:34 #startmeeting infra 19:04:35 Meeting started Tue Feb 12 19:04:34 2013 UTC. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:04:36 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:04:38 The meeting name has been set to 'infra' 19:05:00 Wiki: Meetings/InfraTeamMeeting (last edited 2013-01-29 05:30:39 by JamesBlair) 19:05:10 i feel like i've seen this episode before 19:05:24 * fungi pulls up the action items list 19:06:02 clarkb start discussion on long term log archival options when jeblair gets back 19:06:10 i guess that can happen now 19:06:31 yes, sort of started it yesterday but not in much detail 19:06:32 o/ 19:06:41 it's a jeblair! 19:06:42 should we have that discussion here and now? 19:07:09 up to you guys. the itenerary is short and mostly checkboxish 19:07:17 may as well then 19:07:30 our test log growth is not linear 19:07:37 to say the least 19:08:01 so i was wondering how much logstash could be a complete replacement for statically storing logs 19:08:09 http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=309&rra_id=all you can see the curve at the bottom of that page 19:08:12 and clarkb said it may not be a good one 19:08:21 clarkb: can you elaborate on why? 19:08:37 jeblair: logstash's purpose is to index and give quick access to your logs 19:08:49 this makes it a bad way to archive your logs 19:09:03 clarkb: but doesn't it have a complete copy of them and present them to you through a web ui? 19:09:14 so it's only useful for recent/small log quantities i guess? 19:09:43 yes, however if were to keep 100GB of logs in logstash I get the feeling we would need a much larger elasticsearch cluster to handle the increased workload 19:10:07 I don't think we want to sacrifice index speed and usability for archival functionality 19:10:37 clarkb: ok, so you're suggesting we maintain a smaller logstash system for searching the last x days of logs 19:10:43 o/ 19:10:44 yeah 19:10:49 oh, guess i forgot to /topic 19:10:57 #topic log archival 19:11:05 clarkb: and keep 6 mos of logs in a static setup? 19:11:18 clarkb: (it's easy to delete >x days of logs from logstash?) 19:11:20 the way logstash is configured to index by default suggests that this is the normal operating behavior (each day gets its own index and you want to keep the number of indexes down) 19:11:50 jeblair: yes and yes. To make deleting logs from logstash easy you set it to use a timebased index of some sort 19:12:02 jeblair: then you have a cron or similar look for old indexes and delete them 19:12:05 gotcha 19:12:49 clarkb: so for the zuul reports in gerrit, we'd probably need to link to the static archiving solution 19:12:52 o/ 19:12:59 clarkb: since the logstash one will disappear 19:13:17 jeblair: I think so. But probably with a note that you can search recent job runs in logstash and link to logstash 19:13:18 but maybe we can link to logstash from the static reports 19:13:21 and so we'll still need some separate interface to anonymously browse older archived logs. does cloudfiles have that built in? 19:13:49 yeah, we're relying on the apache mod_index for that now 19:14:08 i don't think there's such a thing for cloud files 19:14:14 maybe notmyname has something ... 19:14:17 notmyname, mordred? 19:14:23 just saw my name 19:14:45 notmyname: last 8 lines of scrollback - discussing putting build logs in swift - wondering about indexes 19:15:02 notmyname: http://logs.openstack.org/21691/1/check/gate-tempest-devstack-vm-full/2081/ 19:15:23 if you use the pseudo directory naming structure I suggested yesterday (to clarkb?), then you have listing support 19:15:56 combined with staticweb (which cloud files supports), you can even get "pretty" web pages 19:16:04 notmyname: oh, so if you GET a .../job-id/ url, you get an index page? 19:16:19 jeblair: ya, it can be configured to do so 19:16:29 notmyname: cool, that sounds like exactly what we need then 19:16:32 thx! 19:16:35 perfect 19:17:03 yay cloud 19:17:17 so we should be able to use jclouds to put these things in cloud files... 19:17:18 have we exhausted that topic for the moment? action items coming from it? 19:17:22 the second thing we need to sort out is having jenkins use swift the logs. in theory we can rely on jclouds for that. in practice clarkb had a bad experience last week 19:17:42 ahh, right 19:17:48 then clarkb can watch the zeromq stream to fetch those things and shove them in logstash 19:17:52 you said there was a patch forthcoming from jclouds 19:18:01 clarkb: oh :(, what's the skinny on that? 19:18:05 the current release of the jclouds plugin is broken for blob storing. this has been fixed in master. 19:18:30 second that fixed commit dumps your cloud key/password to the console log when it gets a 401 from the cloud provider 19:18:47 there is a potential fix for that at the tip of master but i have yet to find time to test it 19:18:52 my experience suggests that errors from cloud providers are frequent. 19:18:59 oh, right. lots of notfun in that 19:19:21 so we will just need to be careful and defensive about how we test the jclouds blobstore 19:19:32 abayer in #jclouds has been super helpful though 19:20:06 clarkb: yeah, we don't need to rush this. :) 19:20:35 we now have more space at least 19:21:51 okay, so next action items, or was there more on that one? 19:22:09 fungi: i'm good 19:22:20 I think that is it 19:22:30 #topic wiki stuffs 19:22:35 moin sucks 19:22:39 * annegentle waves 19:22:45 * annegentle agrees with mordred 19:22:52 next action items were the date change to saturday and annegentle sending an updated announcement? 19:23:09 fungi: I wasn't sure if the date really changed so I didn't send anything :) 19:23:19 to answer annegentle's question in email, no, i'm not critical. :) 19:23:34 jeblair is so critical 19:24:04 so then it's sticking with sunday after all? 19:24:58 ryan confirmed saturday was ok 19:25:41 (i'm so sorry!) 19:25:43 yeah we are saturdaying 19:26:00 annegentle: oh, you didn't get the email from ryan? I completely missed that this didn't go out 19:26:07 my bad 19:26:14 so if it's saturday, then we need a last-minute update announcement i guess 19:26:17 clarkb: ohh ok 19:26:23 yeah that's how I missed it 19:26:27 sure, I'll send now 19:26:32 annegentle: thanks! 19:27:06 so who all is planning to be around for the cut-over on saturday then? 19:27:14 * jeblair plans to be around 19:27:21 yeah. me too 19:27:30 should be 19:27:39 not me (I'll be out of town for the holiday weekend) 19:27:45 * ttx will probably be jetlagged but ~present 19:28:10 o/ 19:28:15 #action jeblair, clarkb, olaph, Ryan_Lane, mordred, annegentle, ttx, fungi work on wiki upgrade 2013-02-16 19:28:53 any other wiki-related notes while we're on the topic? we're missing a Ryan_Lane in here i guess 19:29:34 * mordred just saw Ryan_Lane an hour ago - he was leaving on his way to work ... 19:29:40 we can catch up with him later in the week if there are last-minute issues i suppose 19:29:48 * zaro is out of town. 19:30:13 #topic python-swiftclient upload 19:30:17 mordred to upload python-swiftclient 1.3.0 19:30:20 that happened, right? 19:30:25 yup 19:30:33 okay. that's it for old business 19:30:36 I used the jenkins jobs too - so they have been tested 19:30:42 new business... 19:30:57 # cla 19:31:02 er 19:31:05 #topic cla 19:31:30 funny - we're just now talking about CLAs at the board meeting 19:31:38 no real news on the cla stuff. basically ready, pending last minute reviews (i'll hit people up in the next week for reviewing after a rebase) 19:31:51 at LCA we chatted with notmyname about it, and realized he was unaware of the change... 19:31:58 still on track for 2013-02-24 cut-over 19:32:00 mordred: are we getting rid of CLA? ;) 19:32:05 jeblair: nope 19:32:12 jeblair: fungi should we send another announcement to the mailing list? 19:32:15 so we discussed that we sholud do some extra communication... 19:32:20 probably in a different form 19:32:21 yes, agreed 19:32:21 yeah. more specific 19:32:27 less explanatory 19:32:36 "you will need to resign the CLA on X" 19:32:43 ttx: can you make sure the PTLs are all aware of this at the project meeting? 19:32:57 jeblair: sure 19:32:59 #action fungi draft and send more explicit/cautionary announcement about cla cut-over 19:33:10 #action ttx discuss cla at next project meeting 19:33:19 we can ask lauren/stef to use blog/community newsletter/etc to disseminate it as well 19:33:27 fungi: any chnace you cna send that before the meeting, so that I can reference your post ? 19:33:31 cna* 19:33:33 can* 19:33:35 #action fungi hit up infra core team for reviews 19:33:40 ttx: sure 19:33:51 remind me when the project meeting happens? 19:33:57 i can look it up if necessary 19:34:11 i can join in when that topic comes up too 19:34:15 fungi: then maybe pass that announcement to lauren and stef for them to process 19:34:39 jeblair: awesome. sounds like a great idea 19:34:46 finally, there's one other option we should consider: we _could_ email everyone in gerrit. 19:35:00 spamtastic 19:35:07 i'm a little edgy about that 19:35:19 i _think_ the other things we've discussed should be enough... 19:35:29 they'll notice whenthey go to upload something 19:35:59 so i'd propose that we do those, and only if we still think no one knows about it a week before the cutover, should we spam. 19:36:15 (regardless, we should send at least one more announcement closer to the cutover) 19:36:28 (ml, i mean, not spam) 19:36:33 wiki tells me the project/release meeting is Tuesdays at 2100 UTC so i'll see about getting that announcement to the -dev ml in the next hour after this meeting 19:36:45 if that's what ttx was suggesting 19:37:05 yep 19:37:11 mordred: indeed, i think that's a reasonably mitigating factor. the error has its own instructions for rectification. 19:37:15 otherwise I'll just point to the date 19:37:29 yes. the error messages are quite explicit 19:37:38 urls and all 19:38:04 okay, anything else on cla stuff? 19:38:14 nope 19:38:20 #topic jenkins slave operating systems 19:38:50 i think we covered a couple items under this topic umbrella previously... jclouds and quantal 19:39:25 quantal is working for static slaves but i'm having trouble with jclouds'ing it. asked on #jclouds but nothing too helpful yet 19:40:17 short story is i can launch slaves from jclouds with the ubuntu version specified as 11.10 or 12.04, but if i change that to 12.10 i get an immediate java exception about not matching any images 19:40:36 so i think it's an image metadata issue in rackspace at this point 19:40:47 YAY! 19:41:08 also jclouds-related, clarkb: you had some issues with slaves not deleting right? 19:41:21 maybe 19:41:30 fungi@ci-puppetmaster:~$ nova list|grep -c jclouds 19:41:32 16 19:41:39 so yeah 19:41:48 i don't see anywhere near that many in jenkins (like maybe only 1) 19:41:57 fungi: 1h20min from now 19:42:21 ttx: yep. thanks! i checked the wiki pretty much immediately anyway 19:42:25 * ttx is answering asynchronously 19:42:31 probably the same thing we work around in devstack-gate... 19:42:46 nova returns a 200 for the delete api call and then does not delete the server 19:43:14 that's pretty neat 19:43:20 some how this is apparently not a bug 19:43:35 s/bug/money making opportunity for providers/ 19:43:45 except in our case. :/ 19:43:50 indeed 19:44:42 anyway, if there's nothing much else on that, we should probably jump into devstack/tempest/gating changes et cetera 19:45:04 #topic devstack, tempest and gating changes 19:45:05 notabug: "Empowering the ecosystem" 19:45:14 ttx: heh 19:45:23 enriching the ecosystem 19:45:57 okay, so there's been some more improvements to test scope, runtimes and also proposed efficiency improvements for the gating pipeline? 19:46:09 anyone want to discuss high points on that? 19:46:56 fungi: is 'efficiency improvements' https://review.openstack.org/#/c/20948/ ? 19:47:17 jeblair: that looks like one of them 19:47:25 fungi: another? 19:47:29 clarkb: also has a wip one i think 19:47:51 and we just put something through to make expensive tests dependent on cheap ones 19:48:04 though only for the check pipeline, not gate 19:48:08 fungi, clarkb, link? 19:48:13 to clarkb's change 19:48:14 * fungi is looking 19:48:24 i saw mordred's pep8 change 19:48:35 #link https://review.openstack.org/21723 19:49:51 mordred: for https://review.openstack.org/#/c/21267/ 19:49:57 mordred: what problem does that solve? 19:50:35 jeblair: https://bugs.launchpad.net/zuul/+bug/1123329 19:50:37 Launchpad bug 1123329 in zuul "Zuul should remove unmergable changes from queue as soon as possible." [Undecided,New] 19:51:31 clarkb: i like the sound of that 19:51:31 jeblair: with the increased gate time per change unmergable changes end up serializing zuul more than is necessary 19:51:51 i believe the idea behind 21267 was to avoid burning slave time on tempest tests (which are upwards of an hour or two at this point) if the tests which take <5 minutes don't pass anyway 19:52:48 fungi: yeah, but we're not running low on devstack/tempest slaves, yeah? and with jclouds, it seems like we shouldn't be running low on regular slaves either, in general... 19:53:01 fungi: so what that change does is optimize for the case where pep8 fails 19:53:36 it means that if your change fails pep8, the devs are notified very early 19:53:50 a lot of changes fail pep8 19:53:56 but if your change passes pep8, it now takes runtime(tempest)+runtime(pep8) to be notified 19:54:09 or pyflakes in some cases, but yes. more generally it was to avoid long-running tests if the patch isn't syntactically correct python (style checks are just a bonus there) 19:54:43 yeah, i'm just wondering why that's desirable -- it is because we want to notify people that changes fail pep8 fast (at the cost of making successful changes take _longer_ to notify) 19:54:54 or is it to solve a slave resource contention problem 19:55:16 and we're still running short on static slaves for the moment, until jclouds is in better shape 19:55:42 the devs who were chatting with us (sdague was an active one) didn't seem to mind the extra 5ish minutes compared to getting the reject quicker 19:55:47 I see it as reducing unnecessary load on the slaves and to encourage devs to run tox -epep8 before pushing 19:55:59 jenkins is becoming everyones personal test box 19:56:22 alternately ... it might be interesting to add a feature to zuul to cancel and dequeu remaining jobs if one of them fails 19:56:41 yeah, though looking at the zuul queue, the static slave tests are running ahead of the devstack tests, which is the main thing 19:56:43 which could get us to a place where we get the canary benefit without the serialization concern from jeblair 19:57:03 mordred: yeah, that's kind of where i was heading 19:57:06 we wind up starving the gate pipeline of available static slaves under heavy activity periods right now, and at least a significant percentage are spinning on failed unit tests for the check pipeline 19:57:45 so yes, anything to help that situation would be an improvement 19:58:08 mordred: we wanted to return as much info to devs as quickly as possible 19:58:17 ++ 19:58:18 mordred: i think that's still desirable 19:58:23 I agree 19:58:43 mordred: but clarkb has a good point that people are just throwing shit at jenkins and seeing what sticks 19:58:57 admittedly, he didn't quite put it like that. but i will. :) 19:59:02 :) 19:59:26 and we're about out of time 19:59:39 right. which is why I think early fail in check queue is helpful 19:59:48 so anyway, yeah, let's think about short-circuiting the whole test run if some tests fail 19:59:51 real quick, I made a wiki page: http://wiki.openstack.org/InfraTeam 19:59:52 but still start them all in parallel 20:00:04 jeblair: +1 20:00:09 at leas tfor gate 20:00:16 #topic general 20:00:20 #link http://wiki.openstack.org/InfraTeam 20:00:26 jeblair: agree 20:00:31 thanks fungi 20:00:43 thank you pleia2! 20:01:00 okay, i'll go ahead and shut this down so ttx can have the channel for tc 20:01:05 #endmeeting