19:01:35 #startmeeting infra 19:01:35 Meeting started Tue Jul 29 19:01:35 2014 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:36 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:38 The meeting name has been set to 'infra' 19:01:44 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting 19:01:48 o/ 19:01:50 agenda ^ 19:01:55 #link http://eavesdrop.openstack.org/meetings/infra/2014/infra.2014-07-15-19.05.html 19:01:57 o/ 19:01:59 last meeting ^ 19:02:04 o/ 19:02:08 #link http://eavesdrop.openstack.org/meetings/infra/2014/infra.2014-07-08-19.01.html 19:02:18 last non-beer meeting ^ 19:02:43 every meeting should be a beer meeting 19:02:47 fungi: ++ 19:02:54 * Ajaeger1 hands out beer for everybody ;) 19:03:01 #topic Centos7 jobs 19:03:11 hi 19:03:20 o/ 19:03:25 i've gone about as far as i can with the centos7 jobs out-of-tree 19:03:42 there's three reviews, linked in the agenda 19:03:50 1) add puppet support 19:03:54 #link https://review.openstack.org/#/c/103458/ 19:03:56 2) add rax image to nodepool 19:04:02 #link https://review.openstack.org/#/c/109906/ 19:04:13 3) add an experimental devstack job 19:04:14 #link https://review.openstack.org/#/c/110178/ 19:04:21 (thanks jeblair) 19:04:29 any objections to this approach? 19:04:52 i have the centos7 disk-image-builder changes merged too 19:05:20 but i would rather just get things going in rackspace and sort out those initial bugs i'm sure are there 19:05:23 ianw: nope; we'll probably consider it experimental until we either have a base image in hpcloud or dib/glance working there 19:05:37 jeblair: yes, that's fine 19:05:44 ianw: but i see no reason not to keep moving ahead in rax 19:05:48 seems a sane outline to me, having not reviewed the changes myself yet 19:05:56 similar to the f20 job, we won't promote until we get redundancy 19:05:56 ++ 19:06:38 once it is working, i can start more serious conversations about where we want to use centos v fedora, etc 19:07:59 so yeah, at this point, just need those reviews to make it through. that's all for that topic 19:08:01 pretty sure we can't drop centos, nor can we use fedora for long-term stable support branches 19:08:20 ianw: i'm assuming you've been exposed to our rationale for using centos (tc distro support policy says "try not to break rhel", plus, it has a support lifetime that exceeds openstack's for stable branches) 19:08:23 we can't drop centos6 for juno and icehouse 19:08:38 ianw: if not, we can expand on those points now if it would help 19:08:40 correct 19:08:53 centos6 won't be part of kilo testing aiui 19:09:05 yeah, that's all cool, i spend time keeping devstack working on rhel6 :) 19:09:20 getting centos7 working before juno release would prepare us well for dropping centos6 in kilo 19:09:30 but yeah, with kilo we can finally move to python 2.7 19:09:34 ++ 19:09:48 that will be a nice patch to devstack with a lot of "-" lines 19:10:06 fungi: yep, exactly the idea :) 19:10:28 cool, so hopefully the core team can get around to reviewing things again (i see some activity happenening; sadly, i have not yet been able to really join in) 19:10:40 I am trying when not distracted by meetup 19:10:42 yah. sorry - been WAY behind on reviews 19:11:15 i think we all have 19:11:17 we've been like flying around and talking about important things :) 19:11:42 #topic nodepool history tracking (ianw) 19:11:54 right now i'm just trying to find all of the balls i've dropped 19:12:03 so the history allocation changes were reverted in nodepool 19:12:28 again some linked reviews, the main one being 19:12:32 #link https://review.openstack.org/109185 19:12:33 ianw: you had new reviews up to revise the algorithm there, right? 19:12:37 ahh, yep[ 19:13:00 the only way i could make it go crazy was to have a negative available count passed into the allocator 19:13:22 it starts subtracting negatives, and ends up returning more than requested 19:13:27 which seems to match with what was seen 19:13:40 however, this wasn't specific to the history allocation changes 19:13:48 that does sound basically like what we witnessed in production 19:13:58 ianw: was the negative available count logged? if so, we can check the logs and see if that happened 19:14:09 * ttx lurks 19:14:21 jeblair: no, i don't think that would show up in logs. that change puts in an error message for that situation 19:14:50 along with avoiding the negative 19:15:03 and we definitely were able to recreate the issue twice in production on consecutive restarts, then reverted the allocation history feature and it subsided on the next restart 19:15:06 so yeah, i'm open to ideas, but that was all i came up with 19:15:59 ianw: hrm, i think it's worth digging deeper to find either an error in the new code, or a change related to it that could have caused the issue -- the correlation is pretty strong 19:16:22 ianw: also, did you note the fact that it showed up on the second pass through the allocator (not the first)? 19:16:22 https://review.openstack.org/#/c/109890/1/nodepool/allocation.py <- makeGrants() is the suspect here 19:16:41 i've also restarted nodepool at least once more since, again under heavy volume, and didn't see the problem come back 19:17:02 w.grant(min(int(w.amount), self.available)) 19:17:22 that really should clamp things to never grant more than available 19:18:37 i'm worried that we may have changed something fundamental with the allocator patch, and this would just mask that 19:19:04 i'll try to think deeply about it when i review it 19:19:26 o/ 19:19:49 would it make sense to add some logging around there, unrevert the allocation history change, then try to get it to break in production again? 19:20:13 fungi: i can take that approach, currently the allocator doesn't really do any debug logging 19:20:13 assuming we can't come up with a legitimate way to recreate the conditions under which we saw it happen 19:20:45 it would probably be a somewhat more manageable condition when we're not sitting in the hallway at oscon 19:21:22 i don't think we need logging in the allocator, but if we're missing logging of inputs to it, we should fix that 19:21:44 i'm not generally a fan of "testing in production" but sometimes there aren't expedient alternatives 19:22:09 jeblair: in that case, I think https://review.openstack.org/#/c/109185/ is probably ok as is 19:22:26 it will error log if this negative condition comes about 19:22:42 we could run with that and see if it comes up 19:23:02 ianw: it also masks the problem. i may not be being clear. 19:23:17 maybe we're actually over-allocating but don't notice at the moment? 19:23:18 we _should_ be able to reproduce the problem entirely from the production logs that we have 19:23:34 if we are unable to do so, we should correct the logging so that it logs the necessary information 19:24:18 and if that's the case (we are missing log info (remember, i think this is unlikely)), then i'd be okay with a testing-in-production approach to collect the missing data 19:24:42 alright, let me go through nodepool.py and i'll send a change later to bump up the logging so we can see if this is true or not 19:24:43 but otherwise, i think we should dig deeper into reproducing it out of production 19:25:12 where are the production nodepool logs, can i see them externally? 19:25:20 ianw: if you need additional info out of our production logs i can get them for you 19:26:03 ok, thanks for the discussion. we can move on and i'll look at at the log angle 19:26:11 cool, thanks 19:26:28 right now only the image update logs are published externally, though we've talked about making service/debug logs available too 19:26:44 #topic Puppet 3 Master (nibalizer) 19:27:08 fungi: thanks, that's what i thought. if i can help with publishing the logs somehow, let me know 19:27:28 nibalizer: last i recall, there was some serious weirdness about what version of puppet was actually getting installed 19:27:44 jeblair: I had a radical thought re: puppet 3 and puppetmasters... 19:28:02 but I can also hold off if you don't want radical thoughts in this meeting 19:28:39 mordred: the floor appears to be yours :) 19:28:42 jeblair: neat! 19:28:49 jeblair: what if we stop using puppetmaster altogether 19:29:01 the only benefit we really get from it is secret data 19:29:07 but we're driving puppet runs from ansible now 19:29:15 so we could have ansible run puppet apply passing in secret data 19:29:22 no more care about puppet2 v. puppet3 masters 19:29:25 only puppet apply 19:29:29 which is easier to test too 19:29:47 how does it supply the secret data? 19:30:21 the secret data filtering to only hand out the bits a given slave actually needs is a fairly significant feature 19:30:34 drop in a runme.pp with just the host's chunk of site.pp in it with values populated, then run puppet apply on that 19:30:39 and do the reports go through the puppetmaster, or do they go directly to puppetdb? 19:31:00 puppetdb is a thing I'd need to come up with an answer for 19:31:33 quesstion is - should I bother POC-ing an answer? 19:31:59 mordred: that sounds like a fair response to the compartmentalized secret data issue, though there are also secret data that are shared across many/all hosts 19:32:22 mordred: so we'd probably want something more sophisticated than "edit 72 files to add someone to sysadmins" 19:32:25 yes 19:32:28 hai 19:32:30 I believe I'm never going to want to do that 19:32:34 sorry was cleaning kitchen 19:32:42 * mordred throws wet cat at nibalizer 19:32:48 jeblair: yes there is whitespace in your preferences file 19:32:56 mordred: i'm intrigued 19:32:59 nibalizer: it's not mine :) 19:32:59 * nibalizer thwors the wet cat at apt 19:33:05 mordred: also what does this do to the "public hiera" plans for not-secret data? 19:33:11 jeblair: haha, ThePreferencesFile 19:33:15 jesusaurus: that shold work identically 19:33:31 since the hard problem there was solving for making sure puppet apply worked 19:33:41 yeah, we can still stick hiera files on machines and have them refer to the contents 19:33:42 in this case, we'd move back to _only_ puppet apply needing to work 19:33:44 nibalizer: is that fixed in config yet? and have you spun up a node based on that to verify it works? 19:34:01 i spun up a trusty node and did the needful to it 19:34:06 verifed the error you had 19:34:08 jeblair: ok. I'll sketch up something more than 3 second of babbling in a meeting to see if it can work sanely 19:34:12 then fixed it by fixing the white space 19:34:21 mordred: puppetdb and puppet apply work pretty well together 19:34:30 nibalizer: sweet 19:34:32 puppet apply nodes can report directly to the puppetdb server on what they've been doing 19:34:34 that's even easier then 19:34:49 nibalizer: okay, so we're probably ready to take another stab at this. i'm swamped now, but maybe later in the week. 19:34:51 also you asked about hitting puppetdb api from the puppet master 19:34:55 which is super easy 19:35:05 since you can use the certs that the puppet master already has for cert auth 19:35:16 ah - interesting 19:35:24 nibalizer: I will talk to you after meeting about that 19:35:28 do we currently allow anything to connect to puppetdb? 19:35:38 I'd like for our inventory plugin to talk to puppetdb instead of running puppet cert list 19:36:01 i think puppetdb port 8081 (https) is open to just about everyone 19:36:15 of course, if you dont have a cert, well ... "no ticket" 19:36:37 nibalizer: so puppetdb does cert verification against the puppet ca of nodes wanting to submit reports 19:36:37 ? 19:36:44 yup 19:36:54 mordred: so your idea would probably not eliminate the need for a puppet ca 19:37:00 plus ^^ if we're talking dangerous ideas we could actually shell out to hiera to collect the data to send to ansible to send to the nodes 19:37:25 running the puppetCA isn't a huge overhead, though 19:37:32 so we'd still need a puppet ca, but the server where it resides would no longer need to talk to other servers except to hand out signed certs when we launch new machines 19:37:38 point is, its less overhead that running our own CA? 19:38:11 jeblair: the clients will verify that puppetdb's cert has been signed by the CA as well 19:38:16 we could possibly even just move the puppet ca to the puppetdb server 19:38:45 fungi: i think it basically reduces the potential vulnerability of having the puppetmaster protocol network accessible 19:39:00 fungi: what goal are you trying to solve by moving the puppetca? 19:39:07 fungi: probably best to keep it separate, and on the higher-security (less-services-running) but inaccurately named "puppetmaster" 19:39:08 at the end of the day the puppetca is just a folder 19:39:13 we could move it to jeblairs laptop 19:39:14 nibalizer: only if we wanted to ditch the extra server 19:39:34 oh, i don't think thats possible in how im interpreting mordreds plan 19:39:37 we still need the server there to run the out-bound ansible connections ... 19:39:44 plus a place to keep hiera data 19:39:48 yuip 19:39:57 and somewhere to figlet SCIENCE | wall 19:40:06 critical services 19:40:23 nibalizer: great, now we need to change the master password 19:40:24 jeblair: well, if the puppet ca is only any longer used to identify slaves to the puppetdb service and vice versa, and doesn't actually get trusted to do things via the agent itself, then i'm not sure what we're protecting at that point besides a reporting channel; 19:40:27 not solving for less servers - just _possibly_ giving us a slightly more flexible way to deal with things like puppet 2 -> puppet 3 -> puppet 4 19:40:43 btw - puppet 4 is coming ... 19:40:57 fungi: true, not much, but we need the other server anyway, so may as well keep it there 19:40:58 dun dun dun 19:41:13 mordred: it also eliminates a potential vulnerability 19:41:16 jeblair: yep, if we keep the server, might as well leave the ca on it 19:41:19 jeblair: ++ 19:41:50 mordred: (the connect to puppetmaster and trick it into giving you secrets vulnerability) 19:42:04 jeblair: bah. it would never do that! :) 19:42:29 which is, incidentally, why we changed all of our creds due to heartbleed 19:42:42 also known as "release an openssl vulnerability poc and use up weeks of infra rood admin time" 19:42:52 s/rood/root/ 19:43:05 okay, end of topic? 19:43:48 also, is pleia2 still away? 19:44:25 #topic Replacement for docs.openstack.org (AJaeger) 19:44:32 Ajaeger1: around? 19:44:39 yes, je 19:44:43 yes, jeblair 19:44:50 so we brainstormed at the qa/infra sprint 19:44:57 thanks for talking about this in Darmstadt! 19:45:02 and we came up with 2 ideas 19:45:50 I prefer one of them 19:45:51 1) docs jobs publish artifacts to swift (like we are starting to do with logs), then we write a simple zuul worker that fetches them and rsyncs them into location on a static webserver 19:45:56 jeblair: whe you get free time just ping me and we'll beat on p3 some more 19:46:03 * annegentle waves too 19:46:52 that's a super simple system that gets us the ability to automatically delete files, but otherwise isn't too different from what we have now 19:47:44 it also should be fairly easy to implement once we work out the kinks in the log publishing 19:47:51 (which we made substantial progress on in darmstadt) 19:48:06 So, for the rsync we need to do this on a directory by directory basis - with the projects publishing at separate times, e.g. infra-manuals to http://docs.openstack.org/infra/manual/ - this needs to be fine-granular enough. 19:48:35 Ajaeger1: yes, we'll be able to do that 19:49:00 so basically, the infra manuals rsync job will take what the build job built and rsync it to the correct subdir 19:49:15 similarly, stable branch docs jobs will rsync to the stable branch subdir 19:49:43 so we get the same kind of multi-version publishing we have now 19:49:47 jeblair: (we'll need to not delete branch subdirs when we rsync --delete) 19:49:48 but yeah 19:50:02 mordred: yeah, i think the root of the rsync will be at the branch subdir 19:50:10 but for the master rsync 19:50:11 We currently publish to e.g. /trunk/install-guide and /arch-design at the same time 19:50:23 mordred: yeah, we'll need to special case that 19:50:26 ++ 19:50:33 yup. not hard - just mentioning it 19:51:09 * Ajaeger1 doesn't remember enough about rsync options to see whether this handles all special cases but let's discuss that separately 19:51:58 jeblair: what's option 2? 19:52:39 option 2) is to use afs. building and publishing actually get even simpler (it's just an rsync directly on the build host), but it requires an afs infrastructure which we don't have yet 19:53:05 we want that for several other reasons (including mirrors), so hopefully we'll have it at some point 19:53:05 one benefit of option 2 is that there are several _other_ things we'd also like to use that infrastructure for 19:53:08 jinx 19:54:12 morganfainberg: here 19:54:15 i'm not entirely sure we should hang docs publishing on that just yet though; i think we're probably closer to idea 1) in implementation, and there's actually a pretty good path from idea 1 to idea 2 19:54:23 ++ 19:54:28 portante, ah can i snag you in #openstack-keystone real quick? 19:54:31 what's afs? 19:54:37 andrew filesystem 19:54:42 * morganfainberg doesn't want to bug -infra types too much 19:54:43 annegentle: similar to NFS 19:54:48 ah ok 19:54:53 #link http://en.wikipedia.org/wiki/Andrew_File_System 19:54:56 annegentle: it's a global distributed filesystem and it's teh awesome 19:55:18 named after andrew, heh 19:55:39 named after two andrews 19:55:41 both of them :) 19:56:20 anyway, i think the way to proceed is to let the log publishing settle out a bit, then work on the rsync worker and set up some test jobs on a server 19:56:33 So, what would be the next steps? I suggest that I write up a few examples (should have done that before Darmstadt) so that you see how we publish today to check that rsync can be setup correctly for this. 19:56:36 i'm not sure if we want to use static.o.o or make a new one (since static.o.o is getting fairly full) 19:56:54 and the rsync solution, how do you solve the problem of "there are no files to serve even if for a second" 19:57:17 annegentle: rsync can first sync new content and then remove old files. 19:57:33 annegentle: it will not delete a whole directory at once and then recreate and publish again... 19:57:35 Ajaeger1: ah, great, so ordered steps. 19:57:45 right, rsync is actually much safer than the current "overwrite files via ftp" model 19:57:51 annegentle: also, once we're on AFS, there is a really neat atomic volume publication feature that will avoid link moving race condition issues 19:57:56 annegentle: we can even run it twice: First sync over new content, second delete. 19:58:03 And then, do we have any storage cost considerations? And what happens if it gets full? 19:58:04 Ajaeger1: ++ 19:58:18 annegentle: so far none of our clouds have said anything about cost ever 19:58:24 annegentle: Since we delete files directly, we should have less space issues ;) 19:58:47 at the moment we've got a 15tb cinder volume quota from rax, which we'll be emptying out soonish as we transition log storage to swift 19:58:55 er, 25tb 19:59:04 mordred: well that's a relief :) just don't want to be an outlier, since images and html may be bigger than log files? (not by much really though) 19:59:17 i bet they'll be much smaller, actually :) 19:59:20 annegentle: trust me, they won't be ;) 19:59:28 jeblair: fungi: ok good! 19:59:44 we have jobs uploading logfiles which are in the `00mb compressed range 19:59:50 er, 100mb 19:59:57 fungi: I think you were right the first time 19:59:59 sometimes in the 00mb range too 20:00:03 I think it sounds better than what we have! Which is the goal for starters. Thanks a bunch for all this, team infra. 20:00:05 0mb, now theres some compression! 20:00:08 jeblair: what time frame are we talking about and what would be the next steps besides my examples ? 20:00:25 oh look we are out of time 20:00:28 jeblair: and can it be well before release crunch time :) 20:00:41 shall we continue discussing on #openstack-infra? 20:00:51 i hope; let's regroup on the current log publishing status with jhesketh when he's around, then we can probably make an estimate 20:00:59 thanks everyone! 20:01:02 #endmeeting