#openstack-meeting log

19:01:35 <jeblair> #startmeeting infra
19:01:35 <openstack> Meeting started Tue Jul 29 19:01:35 2014 UTC and is due to finish in 60 minutes.  The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:36 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:38 <openstack> The meeting name has been set to 'infra'
19:01:44 <jeblair> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting
19:01:48 <krtaylor> o/
19:01:50 <jeblair> agenda ^
19:01:55 <jeblair> #link http://eavesdrop.openstack.org/meetings/infra/2014/infra.2014-07-15-19.05.html
19:01:57 <ianw> o/
19:01:59 <jeblair> last meeting ^
19:02:04 <mordred> o/
19:02:08 <jeblair> #link http://eavesdrop.openstack.org/meetings/infra/2014/infra.2014-07-08-19.01.html
19:02:18 <jeblair> last non-beer meeting ^
19:02:43 <fungi> every meeting should be a beer meeting
19:02:47 <jeblair> fungi: ++
19:02:54 * Ajaeger1 hands out beer for everybody ;)
19:03:01 <jeblair> #topic  Centos7 jobs
19:03:11 <ianw> hi
19:03:20 <jesusaurus> o/
19:03:25 <ianw> i've gone about as far as i can with the centos7 jobs out-of-tree
19:03:42 <ianw> there's three reviews, linked in the agenda
19:03:50 <ianw> 1) add puppet support
19:03:54 <jeblair> #link https://review.openstack.org/#/c/103458/
19:03:56 <ianw> 2) add rax image to nodepool
19:04:02 <jeblair> #link https://review.openstack.org/#/c/109906/
19:04:13 <ianw> 3) add an experimental devstack job
19:04:14 <jeblair> #link https://review.openstack.org/#/c/110178/
19:04:21 <ianw> (thanks jeblair)
19:04:29 <ianw> any objections to this approach?
19:04:52 <ianw> i have the centos7 disk-image-builder changes merged too
19:05:20 <ianw> but i would rather just get things going in rackspace and sort out those initial bugs i'm sure are there
19:05:23 <jeblair> ianw: nope; we'll probably consider it experimental until we either have a base image in hpcloud or dib/glance working there
19:05:37 <ianw> jeblair: yes, that's fine
19:05:44 <jeblair> ianw: but i see no reason not to keep moving ahead in rax
19:05:48 <fungi> seems a sane outline to me, having not reviewed the changes myself yet
19:05:56 <ianw> similar to the f20 job, we won't promote until we get redundancy
19:05:56 <mordred> ++
19:06:38 <ianw> once it is working, i can start more serious conversations about where we want to use centos v fedora, etc
19:07:59 <ianw> so yeah, at this point, just need those reviews to make it through.  that's all for that topic
19:08:01 <fungi> pretty sure we can't drop centos, nor can we use fedora for long-term stable support branches
19:08:20 <jeblair> ianw: i'm assuming you've been exposed to our rationale for using centos (tc distro support policy says "try not to break rhel", plus, it has a support lifetime that exceeds openstack's for stable branches)
19:08:23 <clarkb> we can't drop centos6 for juno and icehouse
19:08:38 <jeblair> ianw: if not, we can expand on those points now if it would help
19:08:40 <fungi> correct
19:08:53 <clarkb> centos6 won't be part of kilo testing aiui
19:09:05 <ianw> yeah, that's all cool, i spend time keeping devstack working on rhel6 :)
19:09:20 <fungi> getting centos7 working before juno release would prepare us well for dropping centos6 in kilo
19:09:30 <ianw> but yeah, with kilo we can finally move to python 2.7
19:09:34 <jeblair> ++
19:09:48 <ianw> that will be a nice patch to devstack with a lot of "-" lines
19:10:06 <ianw> fungi: yep, exactly the idea :)
19:10:28 <jeblair> cool, so hopefully the core team can get around to reviewing things again (i see some activity happenening; sadly, i have not yet been able to really join in)
19:10:40 <clarkb> I am trying when not distracted by meetup
19:10:42 <mordred> yah. sorry - been WAY behind on reviews
19:11:15 <fungi> i think we all have
19:11:17 <jeblair> we've been like flying around and talking about important things :)
19:11:42 <jeblair> #topic  nodepool history tracking (ianw)
19:11:54 <fungi> right now i'm just trying to find all of the balls i've dropped
19:12:03 <ianw> so the history allocation changes were reverted in nodepool
19:12:28 <ianw> again some linked reviews, the main one being
19:12:32 <ianw> #link https://review.openstack.org/109185
19:12:33 <fungi> ianw: you had new reviews up to revise the algorithm there, right?
19:12:37 <fungi> ahh, yep[
19:13:00 <ianw> the only way i could make it go crazy was to have a negative available count passed into the allocator
19:13:22 <ianw> it starts subtracting negatives, and ends up returning more than requested
19:13:27 <ianw> which seems to match with what was seen
19:13:40 <ianw> however, this wasn't specific to the history allocation changes
19:13:48 <fungi> that does sound basically like what we witnessed in production
19:13:58 <jeblair> ianw: was the negative available count logged?  if so, we can check the logs and see if that happened
19:14:09 * ttx lurks
19:14:21 <ianw> jeblair: no, i don't think that would show up in logs.  that change puts in an error message for that situation
19:14:50 <ianw> along with avoiding the negative
19:15:03 <fungi> and we definitely were able to recreate the issue twice in production on consecutive restarts, then reverted the allocation history feature and it subsided on the next restart
19:15:06 <ianw> so yeah, i'm open to ideas, but that was all i came up with
19:15:59 <jeblair> ianw: hrm, i think it's worth digging deeper to find either an error in the new code, or a change related to it that could have caused the issue -- the correlation is pretty strong
19:16:22 <jeblair> ianw: also, did you note the fact that it showed up on the second pass through the allocator (not the first)?
19:16:22 <ianw> https://review.openstack.org/#/c/109890/1/nodepool/allocation.py <- makeGrants() is the suspect here
19:16:41 <fungi> i've also restarted nodepool at least once more since, again under heavy volume, and didn't see the problem come back
19:17:02 <ianw> w.grant(min(int(w.amount), self.available))
19:17:22 <ianw> that really should clamp things to never grant more than available
19:18:37 <jeblair> i'm worried that we may have changed something fundamental with the allocator patch, and this would just mask that
19:19:04 <jeblair> i'll try to think deeply about it when i review it
19:19:26 <asselin> o/
19:19:49 <fungi> would it make sense to add some logging around there, unrevert the allocation history change, then try to get it to break in production again?
19:20:13 <ianw> fungi: i can take that approach, currently the allocator doesn't really do any debug logging
19:20:13 <fungi> assuming we can't come up with a legitimate way to recreate the conditions under which we saw it happen
19:20:45 <fungi> it would probably be a somewhat more manageable condition when we're not sitting in the hallway at oscon
19:21:22 <jeblair> i don't think we need logging in the allocator, but if we're missing logging of inputs to it, we should fix that
19:21:44 <fungi> i'm not generally a fan of "testing in production" but sometimes there aren't expedient alternatives
19:22:09 <ianw> jeblair: in that case, I think https://review.openstack.org/#/c/109185/ is probably ok as is
19:22:26 <ianw> it will error log if this negative condition comes about
19:22:42 <ianw> we could run with that and see if it comes up
19:23:02 <jeblair> ianw: it also masks the problem.  i may not be being clear.
19:23:17 <ianw> maybe we're actually over-allocating but don't notice at the moment?
19:23:18 <jeblair> we _should_ be able to reproduce the problem entirely from the production logs that we have
19:23:34 <jeblair> if we are unable to do so, we should correct the logging so that it logs the necessary information
19:24:18 <jeblair> and if that's the case (we are missing log info (remember, i think this is unlikely)), then i'd be okay with a testing-in-production approach to collect the missing data
19:24:42 <ianw> alright, let me go through nodepool.py and i'll send a change later to bump up the logging so we can see if this is true or not
19:24:43 <jeblair> but otherwise, i think we should dig deeper into reproducing it out of production
19:25:12 <ianw> where are the production nodepool logs, can i see them externally?
19:25:20 <fungi> ianw: if you need additional info out of our production logs i can get them for you
19:26:03 <ianw> ok, thanks for the discussion.  we can move on and i'll look at at the log angle
19:26:11 <jeblair> cool, thanks
19:26:28 <fungi> right now only the image update logs are published externally, though we've talked about making service/debug logs available too
19:26:44 <jeblair> #topic Puppet 3 Master (nibalizer)
19:27:08 <ianw> fungi: thanks, that's what i thought.  if i can help with publishing the logs somehow, let me know
19:27:28 <jeblair> nibalizer: last i recall, there was some serious weirdness about what version of puppet was actually getting installed
19:27:44 <mordred> jeblair: I had a radical thought re: puppet 3 and puppetmasters...
19:28:02 <mordred> but I can also hold off if you don't want radical thoughts in this meeting
19:28:39 <jeblair> mordred: the floor appears to be yours :)
19:28:42 <mordred> jeblair: neat!
19:28:49 <mordred> jeblair: what if we stop using puppetmaster altogether
19:29:01 <mordred> the only benefit we really get from it is secret data
19:29:07 <mordred> but we're driving puppet runs from ansible now
19:29:15 <mordred> so we could have ansible run puppet apply passing in secret data
19:29:22 <mordred> no more care about puppet2 v. puppet3 masters
19:29:25 <mordred> only puppet apply
19:29:29 <mordred> which is easier to test too
19:29:47 <jeblair> how does it supply the secret data?
19:30:21 <fungi> the secret data filtering to only hand out the bits a given slave actually needs is a fairly significant feature
19:30:34 <mordred> drop in a runme.pp with just the host's chunk of site.pp in it with values populated, then run puppet apply on that
19:30:39 <jeblair> and do the reports go through the puppetmaster, or do they go directly to puppetdb?
19:31:00 <mordred> puppetdb is a thing I'd need to come up with an answer for
19:31:33 <mordred> quesstion is - should I bother POC-ing an answer?
19:31:59 <jeblair> mordred: that sounds like a fair response to the compartmentalized secret data issue, though there are also secret data that are shared across many/all hosts
19:32:22 <jeblair> mordred: so we'd probably want something more sophisticated than "edit 72 files to add someone to sysadmins"
19:32:25 <mordred> yes
19:32:28 <nibalizer> hai
19:32:30 <mordred> I believe I'm never going to want to do that
19:32:34 <nibalizer> sorry was cleaning kitchen
19:32:42 * mordred throws wet cat at nibalizer
19:32:48 <nibalizer> jeblair: yes there is whitespace in your preferences file
19:32:56 <jeblair> mordred: i'm intrigued
19:32:59 <jeblair> nibalizer: it's not mine :)
19:32:59 * nibalizer thwors the wet cat at apt
19:33:05 <jesusaurus> mordred: also what does this do to the "public hiera" plans for not-secret data?
19:33:11 <nibalizer> jeblair: haha, ThePreferencesFile
19:33:15 <mordred> jesusaurus: that shold work identically
19:33:31 <mordred> since the hard problem there was solving for making sure puppet apply worked
19:33:41 <fungi> yeah, we can still stick hiera files on machines and have them refer to the contents
19:33:42 <mordred> in this case, we'd move back to _only_ puppet apply needing to work
19:33:44 <jeblair> nibalizer: is that fixed in config yet?  and have you spun up a node based on that to verify it works?
19:34:01 <nibalizer> i spun up a trusty node and did the needful to it
19:34:06 <nibalizer> verifed the error you had
19:34:08 <mordred> jeblair: ok. I'll sketch up something more than 3 second of babbling in a meeting to see if it can work sanely
19:34:12 <nibalizer> then fixed it by fixing the white space
19:34:21 <nibalizer> mordred: puppetdb and puppet apply work pretty well together
19:34:30 <mordred> nibalizer: sweet
19:34:32 <nibalizer> puppet apply nodes can report directly to the puppetdb server on what they've been doing
19:34:34 <mordred> that's even easier then
19:34:49 <jeblair> nibalizer: okay, so we're probably ready to take another stab at this.  i'm swamped now, but maybe later in the week.
19:34:51 <nibalizer> also you asked about hitting puppetdb api from the puppet master
19:34:55 <nibalizer> which is super easy
19:35:05 <nibalizer> since you can use the certs that the puppet master already has for cert auth
19:35:16 <mordred> ah - interesting
19:35:24 <mordred> nibalizer: I will talk to you after meeting about that
19:35:28 <jeblair> do we currently allow anything to connect to puppetdb?
19:35:38 <mordred> I'd like for our inventory plugin to talk to puppetdb instead of running puppet cert list
19:36:01 <nibalizer> i think puppetdb port 8081 (https) is open to just about everyone
19:36:15 <nibalizer> of course, if you dont have a cert, well ... "no ticket"
19:36:37 <jeblair> nibalizer: so puppetdb does cert verification against the puppet ca of nodes wanting to submit reports
19:36:37 <jeblair> ?
19:36:44 <nibalizer> yup
19:36:54 <jeblair> mordred: so your idea would probably not eliminate the need for a puppet ca
19:37:00 <nibalizer> plus ^^ if we're talking dangerous ideas we could actually shell out to hiera to collect the data to send to ansible to send to the nodes
19:37:25 <nibalizer> running the puppetCA isn't a huge overhead, though
19:37:32 <fungi> so we'd still need a puppet ca, but the server where it resides would no longer need to talk to other servers except to hand out signed certs when we launch new machines
19:37:38 <nibalizer> point is, its less overhead that running our own CA?
19:38:11 <nibalizer> jeblair: the clients will verify that puppetdb's cert has been signed by the CA as well
19:38:16 <fungi> we could possibly even just move the puppet ca to the puppetdb server
19:38:45 <jeblair> fungi: i think it basically reduces the potential vulnerability of having the puppetmaster protocol network accessible
19:39:00 <nibalizer> fungi: what goal are you trying to solve by moving the puppetca?
19:39:07 <jeblair> fungi: probably best to keep it separate, and on the higher-security (less-services-running) but inaccurately named "puppetmaster"
19:39:08 <nibalizer> at the end of the day the puppetca is just a folder
19:39:13 <nibalizer> we could move it to jeblairs laptop
19:39:14 <fungi> nibalizer: only if we wanted to ditch the extra server
19:39:34 <nibalizer> oh, i don't think thats possible in how im interpreting mordreds plan
19:39:37 <mordred> we still need the server there to run the out-bound ansible connections ...
19:39:44 <nibalizer> plus a place to keep hiera data
19:39:48 <mordred> yuip
19:39:57 <nibalizer> and somewhere to figlet SCIENCE | wall
19:40:06 <nibalizer> critical services
19:40:23 <jeblair> nibalizer: great, now we need to change the master password
19:40:24 <fungi> jeblair: well, if the puppet ca is only any longer used to identify slaves to the puppetdb service and vice versa, and doesn't actually get trusted to do things via the agent itself, then i'm not sure what we're protecting at that point besides a reporting channel;
19:40:27 <mordred> not solving for less servers - just _possibly_ giving us a slightly more flexible way to deal with things like puppet 2 -> puppet 3 -> puppet 4
19:40:43 <mordred> btw - puppet 4 is coming ...
19:40:57 <jeblair> fungi: true, not much, but we need the other server anyway, so may as well keep it there
19:40:58 <nibalizer> dun dun dun
19:41:13 <jeblair> mordred: it also eliminates a potential vulnerability
19:41:16 <fungi> jeblair: yep, if we keep the server, might as well leave the ca on it
19:41:19 <mordred> jeblair: ++
19:41:50 <jeblair> mordred: (the connect to puppetmaster and trick it into giving you secrets vulnerability)
19:42:04 <mordred> jeblair: bah. it would never do that! :)
19:42:29 <jeblair> which is, incidentally, why we changed all of our creds due to heartbleed
19:42:42 <fungi> also known as "release an openssl vulnerability poc and use up weeks of infra rood admin time"
19:42:52 <fungi> s/rood/root/
19:43:05 <jeblair> okay, end of topic?
19:43:48 <jeblair> also, is pleia2 still away?
19:44:25 <jeblair> #topic  Replacement for docs.openstack.org (AJaeger)
19:44:32 <jeblair> Ajaeger1: around?
19:44:39 <Ajaeger1> yes, je
19:44:43 <Ajaeger1> yes, jeblair
19:44:50 <jeblair> so we brainstormed at the qa/infra sprint
19:44:57 <Ajaeger1> thanks for talking about this in Darmstadt!
19:45:02 <jeblair> and we came up with 2 ideas
19:45:50 <mordred> I prefer one of them
19:45:51 <jeblair> 1) docs jobs publish artifacts to swift (like we are starting to do with logs), then we write a simple zuul worker that fetches them and rsyncs them into location on a static webserver
19:45:56 <nibalizer> jeblair: whe you get free time just ping me and we'll beat on p3 some more
19:46:03 * annegentle waves too
19:46:52 <jeblair> that's a super simple system that gets us the ability to automatically delete files, but otherwise isn't too different from what we have now
19:47:44 <jeblair> it also should be fairly easy to implement once we work out the kinks in the log publishing
19:47:51 <jeblair> (which we made substantial progress on in darmstadt)
19:48:06 <Ajaeger1> So, for the rsync we need to do this on a directory by directory basis - with the projects publishing at separate times, e.g. infra-manuals to http://docs.openstack.org/infra/manual/ - this needs to be fine-granular enough.
19:48:35 <jeblair> Ajaeger1: yes, we'll be able to do that
19:49:00 <jeblair> so basically, the infra manuals rsync job will take what the build job built and rsync it to the correct subdir
19:49:15 <jeblair> similarly, stable branch docs jobs will rsync to the stable branch subdir
19:49:43 <jeblair> so we get the same kind of multi-version publishing we have now
19:49:47 <mordred> jeblair: (we'll need to not delete branch subdirs when we rsync --delete)
19:49:48 <mordred> but yeah
19:50:02 <jeblair> mordred: yeah, i think the root of the rsync will be at the branch subdir
19:50:10 <mordred> but for the master rsync
19:50:11 <Ajaeger1> We currently publish to e.g. /trunk/install-guide and /arch-design at the same time
19:50:23 <jeblair> mordred: yeah, we'll need to special case that
19:50:26 <mordred> ++
19:50:33 <mordred> yup. not hard - just mentioning it
19:51:09 * Ajaeger1 doesn't remember enough about rsync options to see whether this handles all special cases but let's discuss that separately
19:51:58 <Ajaeger1> jeblair: what's option 2?
19:52:39 <jeblair> option 2) is to use afs.  building and publishing actually get even simpler (it's just an rsync directly on the build host), but it requires an afs infrastructure which we don't have yet
19:53:05 <jeblair> we want that for several other reasons (including mirrors), so hopefully we'll have it at some point
19:53:05 <mordred> one benefit of option 2 is that there are several _other_ things we'd also like to use that infrastructure for
19:53:08 <mordred> jinx
19:54:12 <portante> morganfainberg: here
19:54:15 <jeblair> i'm not entirely sure we should hang docs publishing on that just yet though; i think we're probably closer to idea 1) in implementation, and there's actually a pretty good path from idea 1 to idea 2
19:54:23 <mordred> ++
19:54:28 <morganfainberg> portante, ah can i snag you in #openstack-keystone real quick?
19:54:31 <annegentle> what's afs?
19:54:37 <fungi> andrew filesystem
19:54:42 * morganfainberg doesn't want to bug -infra types too much
19:54:43 <Ajaeger1> annegentle: similar to NFS
19:54:48 <annegentle> ah ok
19:54:53 <fungi> #link http://en.wikipedia.org/wiki/Andrew_File_System
19:54:56 <mordred> annegentle: it's a global distributed filesystem and it's teh awesome
19:55:18 <annegentle> named after andrew, heh
19:55:39 <fungi> named after two andrews
19:55:41 <jeblair> both of them :)
19:56:20 <jeblair> anyway, i think the way to proceed is to let the log publishing settle out a bit, then work on the rsync worker and set up some test jobs on a server
19:56:33 <Ajaeger1> So, what would be the next steps? I suggest that I write up a few examples (should have done that before Darmstadt) so that you see how we publish today to check that rsync can be setup correctly for this.
19:56:36 <jeblair> i'm not sure if we want to use static.o.o or make a new one (since static.o.o is getting fairly full)
19:56:54 <annegentle> and the rsync solution, how do you solve the problem of "there are no files to serve even if for a second"
19:57:17 <Ajaeger1> annegentle: rsync can first sync new content and then remove old files.
19:57:33 <Ajaeger1> annegentle: it will not delete a whole directory at once and then recreate and publish again...
19:57:35 <annegentle> Ajaeger1: ah, great, so ordered steps.
19:57:45 <fungi> right, rsync is actually much safer than the current "overwrite files via ftp" model
19:57:51 <mordred> annegentle: also, once we're on AFS, there is a really neat atomic volume publication feature that will avoid link moving race condition issues
19:57:56 <Ajaeger1> annegentle: we can even run it twice: First sync over new content, second delete.
19:58:03 <annegentle> And then, do we have any storage cost considerations? And what happens if it gets full?
19:58:04 <mordred> Ajaeger1: ++
19:58:18 <mordred> annegentle: so far none of our clouds have said anything about cost ever
19:58:24 <Ajaeger1> annegentle: Since we delete files directly, we should have less space issues ;)
19:58:47 <fungi> at the moment we've got a 15tb cinder volume quota from rax, which we'll be emptying out soonish as we transition log storage to swift
19:58:55 <fungi> er, 25tb
19:59:04 <annegentle> mordred: well that's a relief :) just don't want to be an outlier, since images and html may be bigger than log files? (not by much really though)
19:59:17 <jeblair> i bet they'll be much smaller, actually :)
19:59:20 <fungi> annegentle: trust me, they won't be ;)
19:59:28 <annegentle> jeblair: fungi: ok good!
19:59:44 <fungi> we have jobs uploading logfiles which are in the `00mb compressed range
19:59:50 <fungi> er, 100mb
19:59:57 <mordred> fungi: I think you were right the first time
19:59:59 <jeblair> sometimes in the 00mb range too
20:00:03 <annegentle> I think it sounds better than what we have! Which is the goal for starters. Thanks a bunch for all this, team infra.
20:00:05 <jesusaurus> 0mb, now theres some compression!
20:00:08 <Ajaeger1> jeblair: what time frame are we talking about and what would be the next steps besides my examples ?
20:00:25 <anteaya> oh look we are out of time
20:00:28 <annegentle> jeblair: and can it be well before release crunch time :)
20:00:41 <Ajaeger1> shall we continue discussing on #openstack-infra?
20:00:51 <jeblair> i hope; let's regroup on the current log publishing status with jhesketh when he's around, then we can probably make an estimate
20:00:59 <jeblair> thanks everyone!
20:01:02 <jeblair> #endmeeting