19:02:59 #startmeeting infra 19:03:01 Meeting started Tue Oct 15 19:02:59 2013 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:03:02 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:03:04 The meeting name has been set to 'infra' 19:03:39 <- lurking 19:03:57 o/ 19:04:06 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting 19:04:26 #link http://eavesdrop.openstack.org/meetings/infra/2013/infra.2013-10-08-19.01.html 19:04:59 o/ 19:05:11 o/ 19:05:15 #topic Actions from last meeting 19:05:25 #action jeblair move tarballs.o.o and include 50gb space for heat/trove images 19:05:49 ok, so i still didn't do that. :( sorry. the 'quiet' periods where i think i'm going to do that keep being not quiet 19:06:06 but at least i think it's still low priority; i don't think it's impacting anything 19:06:14 agree 19:06:22 clarkb announced and executed the etherpad upgrade! 19:06:34 clarkb: thank you so much for that 19:06:42 \o/ 19:06:58 it seems to be holding up so far too 19:06:59 i'm unaware of any issues, other than needing to remind people to force-reload occasionally 19:07:02 it's awesome! i can make text permanently monospace now and don't have to keep resetting my personal prefs instead 19:07:14 ++ 19:07:17 also we theoretically have bup backups for that host now, but I haven't done a recovery yet 19:07:21 * mordred enjoys our new etherpad overlords 19:07:40 yes, also headings now enabled, which makes organizing pads nicer 19:08:37 #topic Trove testing (mordred, hub_cap) 19:09:13 helo 19:09:17 mordred, hub_cap: what's the latest here, and are you blocked on any infra stuff? 19:09:19 i just shot a msg to the room 19:09:22 I have done no additional work in this area. 19:09:31 im going to take over the caching work 19:09:35 so, looking @ the "image caching" job for the heat/trove images. i was wondering if it'd make sense to go the easy route, and just put a few more #IMAGE_URLS= into stackrc, and let the job automagically grab them 19:09:45 ^ ^ sent to -infra, we can discuss offline if u want 19:09:47 I think that's probably a great first step 19:10:03 since these ARE images that you're intending on using as part of a d-g run 19:10:08 its a simple fix, and your nodepool stuff already grabs it 19:10:15 mordred, hub_cap: do we have an etherpad plan for this somewhere? 19:10:35 * mordred feels like we did, but is not sure where 19:10:43 clarkb: had one 19:10:48 * hub_cap thought 19:10:58 i can't recall whether our current thinking is that job runs on devstack-precise nodes, or ar we making a new nodepool image type... 19:11:10 mordred: i think devstack-precise, right? 19:11:40 I think that devstack-precise was the current thinking - until proven otherwise 19:12:05 I started one with the notes that were on my whiteboard /me finds a link 19:12:06 hub_cap: i think that would be fine, but it's also easy to throw them into the nodepool scripts. so either way; probably depends on what devstack core thinks is appropriate. 19:12:19 #link https://etherpad.openstack.org/p/testing-heat-trove-with-dib 19:12:35 ill ping sdague / dtroyer on the matter and see what they think 19:12:46 its definitely the path of least resistance in terms of nodepool 19:12:57 hub_cap: ok. know where the nodepool scripts are if the answer swings the other way? 19:13:00 yer already wget'ing the IMAGE_URL's and caching them 19:13:05 hub_cap: didn't we just merge a big chunk of trove code? 19:13:17 sdague: yes this is not trove specific persay 19:13:22 * ttx <- lurking too 19:13:26 its diskimage-builder image locations 19:13:37 ok 19:13:38 jeblair: itd be pretty simple scripting too 19:13:43 talked w/ lifeless 19:14:04 he says its not worht our while to do more than a wget (as in, reading from the dib scripts like mordred and i talke dabout) 19:14:18 not yet anyhow. Crawl. Walk. Run. 19:14:21 k. great. then I think getting motion at all is great 19:14:22 fall 19:14:37 hub_cap: oh, so you're actually talking about the part where we get a dib image that has previously been published to tarballs.o.o, right? 19:14:46 oh no im not 19:15:08 hub_cap: ok, so you're talking about the job to create that image? 19:15:15 im talking about caching the dib images that woud lnormally be downloaded 19:15:15 hub_cap: which is the first bullet point on etherpad 19:15:20 yes correct jeblair 19:15:47 hub_cap: what are the images you're talking about putting in IMAGE_URLS then? 19:15:50 then ill be working on some of the other bullet points 19:15:53 jeblair: sec 19:15:59 the base ubuntu and fedora images 19:16:11 jeblair: the dib image build process starts with an upstream base cloud images 19:16:29 ah, ok. i don't think that changes any of the things we've said, except i understand them better now. :) 19:16:31 example: http://cloud.fedoraproject.org/fedora-19.x86_64.qcow2 19:16:42 jeblair: I agree :) 19:17:20 cool, anything else on this topic? 19:17:24 nosah 19:17:27 <3 19:17:34 hub_cap: thanks 19:17:43 word 19:17:44 #topic Tripleo testing (mordred, clarkb, lifeless, pleia2) 19:18:02 so I have a test nodepool up per mordred's instructions yesterday, debugging :) 19:18:07 woot! 19:18:39 pleia2: awesome; if you feel like committing those instructions to the repo, that'd be cool. :) 19:18:55 at the moment it's erroring with `No current image for tripleo-precise on tripleo-test-cloud` so I need to dig around a bit 19:19:07 pleia2: so - that's normal 19:19:16 it'll then run for a while and try to build a new imagre 19:19:26 jeblair: great, will do, also found a list of dpkg dependencies needed if installing it on a new 12.04 vm 19:19:27 not an error, just a debug message 19:19:38 pleia2: I'm assuming you have the creds for the grizzly cloud in your yaml file? 19:19:49 oh, it kept repeating so I thought it was spinning 19:19:52 mordred: yeah 19:20:00 so I should just start it and let it run? 19:20:04 yes 19:20:05 mordred: ++ but also, if you end up shutting it down while it's doing that, if there's still a 'building' record in the db for that image, it _won't_ start trying to build a new one on restart, you'll need to 'nodepool image-delete' that record. 19:20:11 * pleia2 makes so 19:20:15 jeblair: very true 19:20:31 pleia2: so make sure that ^ case doesn't apply, otherwise it really might not be doing anything 19:20:32 pleia2: nodepool image-list and nodepool image-delete are your friends 19:20:49 great, thanks 19:20:50 my instructions may not be full docs 19:21:05 pretty much all of nodepool-tabtab is full of awesome 19:21:16 all the patches have merged, so it was nice not to have to apply those at least 19:21:23 pleia2, mordred: also worth knowing -- once you get to the stage where it's actually running the scripts, the stderr from the ssh commands are output _after_ the stdout 19:21:35 jeblair: thanks 19:21:40 excellent 19:21:44 if anyone figures out the right way to get those interleaved correctly out of paramiko, i'll give them a cookie. 19:22:10 i sort of gave up and said "at least it's recorded" and moved on. 19:22:13 I like cookies 19:22:45 there's some serious magic going on in paramiko with those streams. 19:23:12 related to that, btw, now that I know how to nodepool test things, I'm going to nodepool against some of our potential other clouds 19:23:23 that may not be released or in prod yet 19:23:57 mordred: nice. isn't it exciting how each new cloud requires source changes? 19:24:16 hah 19:24:18 exciting is not the word i'd choose 19:24:32 jeblair: yah. although - to be fair, I think that the grizzly cloud mostly just found issues for us - we didn't have to put in new behavior forks 19:24:57 mordred: ok cool 19:25:11 State: building :) 19:25:18 pleia2: w00t 19:25:30 anything else related to tripleo? 19:25:34 I think that's it 19:25:44 the tripleo cloud is deploying now 19:26:02 which means we're getting closer to it being non-destructively updating - at which point infra should be able to consume vms from it 19:26:06 or start thinking about doing that 19:26:08 mordred: yay 19:26:16 just for those who haven't been folowing along 19:26:34 #topic Next bug day: Tuesday October 22nd at 1700UTC (pleia2) 19:26:49 just a reminder, next week! 19:26:59 i'm going to be at linuxcon eu then 19:27:11 me too 19:27:24 i will try to show up if i can, but i may have to flake out this time 19:27:33 me too 19:27:36 i'll be here and bugtastic 19:27:43 * clarkb wonders if jeblair and mordred find conferences that conflict with bug days on purpose :{ 19:27:46 * :P 19:27:47 (i doubt rescheduling would change that) 19:27:48 hehe 19:27:49 I will be around 19:28:01 jeblair: yeah, the next week gets mighty close to summit 19:28:34 i will be on my way to jenkins conf 19:28:41 clarkb: s/bug days/work/ ? :) 19:29:00 zaro: good luck with that! I've been sending folks your way :) 19:29:01 zaro: where you will be speaking, yeah? 19:29:30 yes 19:29:39 #topic Open discussion 19:30:02 this morning I started looking at upgrading a bunch of our logstash stuff 19:30:21 my schedule for the next month is roughly: linuxcon eu for a week, vacation for a week, summit for a week, vacation for a week. 19:30:34 basically want to upgrade logstash to 1.2.1 which requires an elasticsearch upgrade to 0.90.3 and may require an upgrade to kibana3 19:30:37 so i'm going to be in and out 19:30:52 jeblair: when you thinking about laying out infra summit plan? there were a couple of sessions which could be either qa / infra and I wanted to figure out if I should add them to a track or you were 19:30:55 i *will* be somewhat unavailable later in the week though. allthingsopen is wednesday and thursday and i want to catch at least some of it since it's local 19:31:08 the week==next week 19:31:16 sdague: was planning on doing that next week, will that work for you? 19:31:17 there is a lot to change and while I *think* I can do it non destructively, I would like to be able to just do it more organically and if we lose data oh well 19:31:18 my schedule is similar to jeblair's, except I'm doing many more conferences in that stretch 19:31:21 thankfully I'm home for pretty much the rest of the year aside from summit 19:31:22 jeblair: sure 19:31:24 sdague: jog0: do you have opinions on that attitude? 19:31:24 I will see you there fungi 19:31:25 and by many, I mean 2x 19:31:37 fallenpegasus: looking forward to it! 19:31:44 clarkb: we can always rebuild logstash, right? 19:31:48 so consider me useless as usual 19:31:59 but locally I'm speaking at balug tonight on 'code review for sysadmins' (same as oscon talk) 19:32:00 so I consider temp data loss to be an "oh well" 19:32:06 sdague: yup and great 19:32:19 sdague: well its temp in that indexes may go away 19:32:25 clarkb: the havana release is sched for oct 17 19:32:26 and I don't feel like reindexing the data 19:32:39 clarkb: my feeling is maybe wait till next week, then go for it? 19:32:51 clarkb: well for elastic-recheck, we'd want to reindex that last 7 days of data regardless 19:32:55 jeblair: yup, that was what I was thinking. After release is the best time for this sort of thing 19:33:07 so we can figure out bug trends 19:33:08 sdague: yeah, I am asserting that I would like to not do that :) 19:33:08 pleia2: ++ 19:33:19 clarkb: can't we do it as a background job after? 19:33:34 sdague: how critical is that after the release? 19:33:35 clarkb: ok, I just assumed temp outage :) 19:33:49 sdague: we can, but if I start worrying about stuff like that I am worried I won't get this done in a timely manner 19:34:04 clarkb: ok, so lets take the hit now 19:34:14 sdague: right I figured after release was the best time for that hit 19:34:29 but in future it would be nice to have a reindex process for the data 19:34:33 this: https://review.openstack.org/#/q/status:open+project:openstack-infra/config,n,z is a terrifying list 19:34:40 sdague: also, future upgrades will hopefully be less painful. logstash is doing a schema change and elasticsearch is changing major versions of lucene. It is a bit of a perfect storm 19:34:52 because - http://status.openstack.org/elastic-recheck/ even in it's current form, it super useful in understanding trending 19:34:56 so we'll have a blind spot 19:35:18 clarkb: yeh, though honestly, if they broke like that this time, I expect they'll break in the future 19:35:33 so bulk reindex process is probably in order 19:35:59 sdague: it would be nice, but we've always said this stuff is transient 19:36:23 and given our current staffing levels vs workload, i think we're going to have to accept that some things like this will have rough edges 19:36:35 as mordred just pointed out 19:36:39 sdague: they might, How about this. If the old indexes don't derp due to the lucene upgrade (they shouldn't but it is a warning they give) I will work on reindexing after the upgrade 19:36:49 clarkb: that would be awesome 19:36:54 if they do derp, we move on 19:37:04 yeh, I'm fine on that for now 19:37:10 jeblair: ++ 19:38:04 jeblair: I get it, just logstash has a lot of consumers now :) you'll hear it on irc if it comes back empty 19:38:58 sdague: yep, and their contributions to the maintenance of the system will be welcome. :) 19:39:15 fair enough 19:39:17 yeah, I will make a best effort, but I think doing it perfectly will require far too much time 19:39:26 clarkb: yeh, don't stress on it 19:39:38 now is as good a time as ever to take the hit 19:39:40 and in theory since all of the upstreams are working together now this sort of pain will be less painful in the future (I really hope so) 19:40:30 any idea how borky it's going to make e-r? like if there are enough data structure changes that we're going to need to do some emergency fixes there? 19:40:47 sdague: it won't be too bad, I will propose an updated query list 19:40:54 ok, cool 19:41:01 the metadata adds going to go in as part of this? 19:41:10 sdague: the schema is being flattened and silly symbols are being removed. so @message becomes message and @fields.build_foo is just build_foo 19:41:18 cool 19:41:36 sdague: yeah, I was planning on looking at that as part of this giant set of changes :) 19:41:42 great 19:41:59 I am also planning on trying the elasticsearch http output so that we can decouple elasticsearch upgrades from logstash 19:42:11 but we need to upgrade elasticsearch anyways 19:42:48 cool, well just keep me in the loop as things upgrade, I'll see what I can do to hotfix anything on the e-r to match 19:43:15 sdague: thanks 19:43:18 and will do 19:46:28 there's a thread on the infra list about log serving 19:46:34 Subject: "Log storage/serving" 19:47:20 it seems like the most widely accepted ideas are to store and statically serve directly from swift, and pre-process before uploading 19:48:03 if anyone else wants to weigh in, that would be great 19:48:18 sdague: ^ mentioned this the other day, just a reminder 19:48:36 jog0: ^ may be of interest to you as well 19:49:03 oh, yeh, actually 19:49:17 can I get reviews for https://review.openstack.org/#/c/47928/ that is step one in this whole process of upgrading stuff? 19:49:19 so my experience so far on the filters, is the dynamic nature is kind of important 19:49:52 so pre-process is something I'd actually tend to avoid if we could (though we could maybe build filters out of swift?) 19:50:27 sdague: well, we won't be running our own swift, so i don't believe we can write a middleware to do it 19:50:33 jeblair: nice, what about the issues about how swift doesn't fit this use case exactly 19:51:06 sdague: i agree, i like being able to process them as we serve them -- but considering that we tend to focus on the most recent logs... 19:51:08 jeblair: so that being the case, I'd kind of lean on the dynamic filters against an FS model. 19:51:43 sdague: if we keep upgrading the pre-processing, the logs we're looking at most will very shortly have those updates 19:51:49 jeblair: right, but if we change a filter output, to do something like link req ids in, then we have to go back and bulk process eerything 19:52:08 sdague: or just accept that older logs aren't as featureful 19:52:36 jeblair: that also means processing them multiple ways, as we do things like dynamic level selectors 19:52:51 we doing a summit session on this? 19:53:01 might be good to do it there 19:53:01 why is it that we didn't want to do a log-serving app like jeblair was suggesting originally? 19:53:22 mordred: because then we have to run the thing. If we use swift and mostly just swift someone else deals with it :) 19:53:28 it seems like storing the logs in swift, with an entry in a db that tells you pointers to teh data blobs 19:53:29 mordred: I think I'm arguing exactly for the log serving app approach 19:53:38 so that if you want to get complex view, you go through the log view app 19:53:39 sdague: i guess i'm saying that it's worth weighing the benefit of being able to add new processing to old logs against the simplicity of being able to use swift more or less straight up. 19:53:42 yeh, we could still put raw in swift 19:53:47 but if you just want raw data, you pull the data directly from swift 19:54:27 sdague: you could get dynamic features by doing the filtering in javascript (and by encoding tags in the file, still make filtering easily available to logstash pusher) 19:54:42 jeblair: with the size of these files... you really can't 19:55:06 sdague: ? 19:55:15 n-cpu logs, htmlized, cripple anything but chrome 19:55:22 35MB html uncompressed 19:55:43 sdague: aren't they already htmlized? 19:55:52 no, that's the point of the filter 19:56:09 sdague: we sometimes don't convert them to html? 19:56:17 jeblair: right, for logstash 19:56:20 or if you wget 19:56:30 sdague: but that doesn't cripple a non-chrome browser 19:56:33 we're doing content negotiation, so you can get html or text/plain 19:56:36 i don't think i'm following 19:56:52 the overhead of a 35 MB html dom kills most browsers 19:57:10 sdague: so you're saying it only works because it doesn't default to DEBUG? 19:57:12 javascript manipulating it would be even worse 19:57:24 sdague: and if you click debug, it'll kill your browser 19:57:44 we actually default to debug, and most people use chrome when it kills firefox 19:58:02 sdague: ok, so as far as the html goes, there would be no difference 19:58:11 but a future enhancement I was going to add was detecting browser, and defaulting to a lower level 19:58:37 anyway, face to face, maybe we can sort the various concerns 19:58:45 sdague: well, if you have a minute, could you add these thoughts to the thread? 19:58:59 sure 19:59:14 sdague: because so far, the idea of running a log serving app, which i originally suggested, has very few supporters 19:59:53 sdague: and yeah, i'll propose a summit session on this 20:00:11 and i think we're at time 20:00:18 thanks everyone 20:00:19 i suspect that's mainly because the main use cases for swift are directly serving files rather than using it purely as a storage backend 20:00:30 #endmeeting