19:02:53 #startmeeting infra 19:02:53 Meeting started Tue Aug 27 19:02:53 2013 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:02:54 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:02:56 The meeting name has been set to 'infra' 19:03:08 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting 19:03:08 \o 19:03:12 #link http://eavesdrop.openstack.org/meetings/infra/2013/infra.2013-08-20-19.01.html 19:03:26 #topic Operational issues update (jeblair) 19:03:55 so i figured let's start with updates on all the exciting things from last week 19:04:18 nodepool is now running the code in git master 19:04:23 and the config file in puppet 19:04:39 so system administration on that host has returned to normal (fully puppet managed) 19:04:54 and it seems to be doing a pretty good job 19:05:16 there is one new change that went into a restart this morning that hasn't seen production testing, and that's fixing the image cleanup code 19:05:35 so that, will either work, do nothing, or delete all the images nodepool uses and everything will stop. 19:05:49 we'll know soon. :) 19:05:58 let's hope for one of the first two options 19:06:06 jeblair: so much confidence :) 19:06:27 jeblair: can you link to the change? 19:07:01 #link https://review.openstack.org/#/c/43623/ 19:07:48 clarkb: after we fixed git.o.o, i think the last lingering issues we know about were unstable jobs due to static.o.o and lost jobs due to jenkins not being able to talk to slaves 19:07:54 sound about right? 19:07:59 jeblair: yup 19:08:13 both of which should be addressed as of this morning right? 19:08:17 i believe we have worked around the lost jobs issue by having zuul detect that situation and re-launch the job 19:08:37 that change has been in production for a bit 19:09:03 long enough that i believe i've seen it work (it takes a bit to track down because it does try to be invisible to the user) 19:09:31 and then for static, we moved our intensive filesystem maintenance (compressing and deleting logs) to the weekend 19:09:43 which is a stopgap, but a good one, i think. 19:10:05 and you also spun up a new larger static node with working ipv6 and grew the filesystems that store data 19:10:06 it should be fine until we have a smarter log receiving/publishing service 19:10:24 did the new node get a AAAA record in DNS? 19:10:26 clarkb: yes, the additional cpus on static should help if we see contention there 19:10:40 clarkb: yes, that happened too 19:11:09 so the status and logs servers are now reachable via ipv6 19:11:14 and pypi 19:12:00 i think all of the bottlenecks we saw last week have been addressed, and so as we're pushing further up the stack 19:12:12 the current bottleneck is zuul preparing to merge changes 19:12:30 it takes 1 minute to process a change for nova before it even starts tests 19:12:43 we just merged some patches to zuul to make that much, much smaller 19:13:01 and i plan on restarting zuul this afternoon to pick it, and a bunch of other small bugfixes up 19:13:21 it will be a disruptive restart, because the graceful shutdown is currently broken, but that's one of the bugfixes 19:13:31 so hopefully it'll be better 19:13:53 it will be nice to get those fixes in 19:14:09 and i think after that, we'll probably be pretty selective about zuul upgrades as we approach h3 19:14:57 clarkb: want to describe the current git server config? 19:15:04 jeblair: I think we are getting really close to being selective about all changes 19:15:07 sure 19:15:22 http://ci.openstack.org/git.html has super basics 19:15:28 (might want to update it now :)) 19:15:52 one of the bottlenecks we ran into last week was fetching git refs from review.openstack.org. It caused load averages >200 on review.o.o frequently which was bad for tests and reviewers 19:16:38 to work around this we pushed pleia2's cgit server into production quickly but it quickly got bogged down as well. To work around that we put an haproxy load balancer in front of 4 identical cgit servers 19:17:18 today we have one haproxy node balancing 4 git servers. The git servers are running git-daemon, apache, and cgit for all of your git needs (git:// http(s) cgit browsability) 19:17:51 In getting that going we discovered that having a lot of loose refs files made git on cetnos very slow. So we are packing all refs once per day and it makes a major difference 19:18:18 #link http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=2 19:18:21 and that's only on the mirror; the repos in gerrit still have their refs unpacked 19:18:37 (which has come in handy in the past, when we removed all the zuul refs) 19:18:47 #link https://git.openstack.org/cgit 19:19:21 thank you pleia2 for getting the base puppet stuff for that going. It ended up being quite flexible when we needed haproxy in front of it 19:19:21 yeah, that graph suggests we're a little overprovisioned, but i think that's a good place to be for h3, so i'm in favor of leaving it as is and seeing how those 4 servers perform 19:19:29 jeblair: ++ 19:20:20 oh, according to rackspace there could be network disruption tomorrow 19:21:11 August 28th from 12:01 - 4:00 AM CDT 19:21:36 reading the announcement it didn't appear like it would be serious 19:21:50 but the possibility for network outages of up to 10 minutes is there 19:22:22 anything else about operational issues? 19:22:59 ttx: ^ there's an update to catch you up 19:23:22 #topic Backups (jeblair) 19:23:31 this may be more of a clarkb topic at this point 19:23:51 (but i'll just add that with groups-dev, we may have our first trove database that we want to backup) 19:24:12 we have a puppet module that adds a define to mysqldump mysql servers and gzip that dump allowing us to do our own backups 19:24:17 jeblair: ack 19:24:44 it is currently running on etherpad and etherpad-dev. A change to make the cron quiet merged Sunday so I think it is ready to go onto review-dev and review 19:25:03 clarkb: and maybe add it to wiki (which is already being backed up)? 19:25:05 it will need a little work to backup trove DBs but nothing major (use username and passowrd instead of a defaults file) 19:25:24 jeblair: oh yes. Are there any other hosts running mysql that need backups? 19:25:28 clarkb: on what host do you think we should do the mysqldumps for trove? 19:25:39 clarkb: etherpad? 19:26:01 jeblair: I was thinking that running the trove backups on the server consuming the trove DB would be easiest to keep the DB backups with the backups for that server 19:26:14 jeblair: but that assumes one trove DB per app and not multitenancy 19:26:14 clarkb: sounds reasonable 19:26:34 clarkb: i think we can do that. 19:26:49 that way you don't have to think too hard in a recovery situation 19:27:46 list of things that need backups: review(-dev), wiki, etherpad(-dev) 19:27:50 #topic Tarballs move (jeblair) 19:28:03 i think we decided to defer this for a while, maybe till after h3, yeah? 19:28:21 yeah, it isn't super important but is definitely nice to have 19:28:28 #topic Asterisk server (jeblair, pabelanger, russelb) 19:28:36 * russellb perks up 19:28:39 so this took a back seat to everything blowing up last week 19:29:01 but i'm thinking i should be able to spin up those other servers today 19:29:12 cool, sounds good 19:29:23 so we can test if the latency is better from a couple different network points 19:29:33 jeblair: will that include hpcloud server(s)? 19:29:49 need to identify some sort of ... metric for how to compare the different test systems 19:29:57 really what we're after is audio quality 19:30:09 system load didn't seem to be a big concern 19:30:28 but we don't really have a good tool other than our perceived quality of the call 19:30:35 clarkb: sure; it'll be good to collect data. if we _love_ them at hpcloud then we'll have to deal with the question of hpcloud's SLA, but we can kick that down the road 19:31:02 russellb: i agree, especially since actual network latency between the voip provider and pbx was minimal 19:31:13 yeah, i don't think that was it ... 19:31:41 russellb: so we may need to have a series of calls and do a subjective test? 19:31:56 yeah 19:32:25 i'm planning on varying the size and data center of the servers i spin up 19:32:33 where can we post the agreed upon times for testing the calls? can we put them on the wiki page for now? 19:33:07 #link https://wiki.openstack.org/wiki/Infrastructure/Conferencing 19:33:25 anteaya: or we could send out another email to the -infra list 19:33:31 okay 19:34:06 anything else on this topic? 19:34:43 #topic open discussion 19:35:02 This is a long weekend for those of us in the US 19:35:32 i will be on vacation starting tomorrow until 9/4 19:36:08 new patch was submitted to upstream gerrit. 19:36:13 #link https://gerrit-review.googlesource.com/#/c/48254/8 19:37:27 zaro: neat, looks david pursehouse is working with you on it 19:38:25 jeblair: yes, looking good so far, got one +1, and one -1 since new patch. -1 was just a nit pick. 19:39:48 anything else? 19:40:05 great work this week jeblair and clarkb 19:40:16 *applause* 19:40:40 nothing from me 19:40:45 anteaya: thanks for your help :) 19:40:49 thanks jeblair 19:40:51 welcome 19:40:57 :D 19:41:02 #endmeeting