19:01:22 #startmeeting infra 19:01:23 Meeting started Tue Mar 4 19:01:22 2014 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:25 o/ 19:01:26 The meeting name has been set to 'infra' 19:01:37 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting 19:01:39 agenda ^ 19:01:45 #link http://eavesdrop.openstack.org/meetings/infra/2014/infra.2014-02-25-19.02.html 19:01:48 last meeting ^ 19:01:56 morganfainberg: I've got split attention - can you ping me later about whatever you're talking about re: manpage above? 19:02:11 #topic Actions from last meeting 19:02:17 mordred send new-project service degredation annoucment 19:02:17 o/ 19:02:18 mordred, absolutely 19:02:20 i think he did that 19:02:33 I saw it 19:02:34 I did something 19:02:43 mordred: you sent an e-mail 19:02:43 and i think on friday some projects were created... 19:02:50 how did all that work out? 19:02:55 they were thanks to fungi and anteaya's hard work 19:03:06 (a) do we need to alter the process? 19:03:20 clarkb has a patch 19:03:23 (b) did we learn anything that will help mordred in his work on manage_projects? 19:03:25 spotted some bugs, fixes have been proposed (i don't have links handy) 19:03:38 the process from last time worked well, i think 19:03:40 fungi: ah cool 19:03:42 that will be applied next time, that may mean we don't need to change the process, if it works 19:03:43 #link https://review.openstack.org/#/c/77314/ 19:03:55 we should repeat, clarkb has a pending patch we didn't test last time because it came moments too late 19:04:12 and I think it may fix the bulk of the non orchestration related trouble 19:04:17 the other proposed patches were hand-applied on review.o.o and worked well 19:04:37 we still need some ordering around processes on different nodes, but the bugs outside of that seem to be getting killed \o/ 19:04:44 i think i +2'd them all with comments if i used them during the run 19:04:45 awesome! 19:05:08 since i think the orchestration bits were what mordred had the most concrete ideas about, this should help a lot 19:05:08 all jeepyb patches 19:05:24 agree 19:05:39 I have a request for next round 19:06:01 if we can start earlier I can be here for it, my taxi picks me up at 6pm on friday EST 19:06:04 i have a couple of bugs flagged i need to pick up for initial group members for the projects we created on friday, but those will get wrapped up today 19:06:40 the next round will probably go faster. a lot of the time was spent assembling the list and doing last-minute reviewing 19:06:47 agreed 19:06:55 fungi: I will try to review an dapprove those that were tested 19:07:03 and we had a lot of changes waiting in the queue for that one since we'd put them off for some weeks 19:07:08 do we have an etherpad started for next round? 19:07:34 zaro: ping 19:07:58 we should probably copy the "needs work" section from the last one to seed the next 19:08:02 #link https://etherpad.openstack.org/p/new-projects-2014-02-28 19:08:02 ohh yeah. here 19:08:11 fungi: k 19:08:20 I can mix up a new etherpad 19:08:30 anteaya, fungi: thanks 19:08:34 let's move on... 19:08:39 #topic Convert gerrit db tables to UTF8 (zaro) 19:08:50 this time around it will be faster to build the list, since anybody wo isn't on the old etherpad doesn't have an excuse for not setting an appropriate topic 19:08:53 zaro: what's the latest thinking on this? 19:08:54 ok. i think all the info is in the bug 19:09:00 let me find it 19:09:24 #link https://bugs.launchpad.net/openstack-ci/+bug/979227 19:10:07 so i believe we left off on jeblair wanting some more info on where the dups are in the conversion. 19:10:30 #link https://launchpadlibrarian.net/165584391/case_insensitive_dups.txt 19:11:18 oh - I think those are pretty easy to fix by hand :) 19:11:28 mordred: how can you fix them? 19:12:01 did you see last sentense? "I'm not sure why line 1590340 is a duplicate because I could not find a duplicate entry for it but there are many more like this one." 19:12:01 the emails can just be fixed - you're right though - the usernames are a bit ugh 19:12:26 mordred: the emails can't be fixed, actually -- the localpart in emails is case sensitive 19:12:38 it seems that the problem is that utf8 is not case sensitive and there is not a general cs utf8 collation 19:13:29 right. so, the local part is case -sensitive - but I don't believe that gmail behaves that way. that leaves us with just Daviey who might be a problem 19:13:40 mordred: or others in the future 19:13:50 there has to be a right solution to this 19:14:25 what's the deal with the utf8_general_cs collation? http://bugs.mysql.com/bug.php?id=65830 19:14:43 how does that manifest in current and next ubuntu lts? 19:15:40 is there a problem with using utf8_bin ? 19:16:00 mordred suggested that before 19:16:37 mordred: what do you think about that? 19:16:40 bin doesn't understand sorting of international characters properly - but it should at least keep the things separate 19:16:52 I don't think username sorting is very important to us 19:17:26 mordred: i suspect you are right; i'm finding it difficult to come up with a place in gerrit that could bite us 19:17:58 zaro: do you want to try that on review-dev and see if any problems manifest? 19:18:21 zaro: (thanks for reminding me) 19:18:22 (i still think knowing the answer to whether utf8_general_cs will be available in next ubuntu lts would be useful) 19:18:44 we might be able to move to it later if it is 19:19:00 well, Davi Arnaut seems to believe that's it's an experimental collation anyway 19:19:00 i believe that i was testing with review-dev data before testing with review data. 19:19:14 mordred: oh, so bad idea anyway? 19:19:15 i didn't have a problem with either when using utf8_bin 19:20:27 https://github.com/svagner/MM-Percona-Server/blob/master/config/ac-macros/character_sets.m4#L365 19:20:29 i mean no errors poppped up during the conversion. 19:21:09 _cs has troubling comments aroudn it in the build files :) 19:21:15 so I vote for just doing utf8_bin 19:22:02 unfortunately, I don't see any problems with utf8_bin, should I? 19:22:03 i wonder what gerrit actually expects? 19:22:53 it seems very strange to have case-sensitive usernames. but it also seems strange to have case-insensitive change messages and emails... 19:23:02 perhaps there is no coherent intent. :/ 19:23:11 anyone object to utf8_bin? 19:23:30 seems okay to me 19:23:42 it's really only relevant for sorting and for unique keys 19:23:47 i suspect they just didn't put much thought into it 19:23:57 utf8-bin will still work for unique keys 19:24:05 sorting might be weird in some edge-case contexts 19:24:18 but would work in most cases as long as the fields are consistentish? 19:24:19 because sorting will be essentially done numerically by underlying hex code 19:24:30 ah, so anything ascii would be fine 19:24:34 pretty much 19:24:42 and non-text fields would still sort normally 19:24:47 yes 19:24:48 clarkb: though it's asciibetical not alphabetical 19:24:53 ABC,abc 19:25:03 * fungi prefers c sort order anyway ;) 19:25:15 yeah 19:25:17 but 19:25:26 we don't REALLY show alpha-sorted list 19:25:47 right most sorts in gerrit are changenumber/changeid or date based 19:25:49 so it seems like it's worth a try, and if sorting strangely does show up some place, we can consider going back to latin1 or breaking the case-sensitive fields 19:26:27 i have to wonder whether any part of gerrit which cares about text sorting actually asks mysql to sort the results and uses that straight in the ui anyway 19:26:39 clarkb: changeid (eg hexsha) sorts could be affected, but sorting a uuid seems weird. 19:26:44 in all probability they perform their own sorting on the requery results 19:26:48 fungi: would not suprise me 19:26:49 query results 19:27:07 #agreed convert gerrit tables to utf8_bin collation 19:27:19 #topic Upgrade gerrit (zaro) 19:27:32 zaro: i'm guessing you're blocked on the buck stuff, yeah? 19:27:50 nope. although az2 has been a pain in the butt 19:28:04 no kidding 19:28:16 oh right, I was going to try and look into that more today 19:28:18 anyways, i believe all patches that are required are ready for review and just waiting for you gents to review them 19:28:54 okay. nice. 19:29:20 #topic Removing openstack-ci-admins ML from LP (fungi) 19:29:23 i've had shell loops running for the past 24 hours trying to get image rebuilds in az2 to stick 19:29:34 once all patches are merged then we can re-puppet review-dev to see if it all works. 19:29:35 i should have more accurately made that "deactivating openstack-ci-admins ml on lp" 19:30:03 since the creation of the openstack-infra list on lists.o.o we have 4 messages in the archive for the openstack-ci-admins list 19:30:25 but we keep getting people caught in moderation e-mailing that about gate failures and requests for assistance on gerrit accounts 19:30:48 so it's an attractive nuisance. i think we should disable it (the archives would still be published for historical purposes) 19:30:50 we might have that in some docs somewhere... :/ 19:30:58 but wanted to see whether there are objections to that 19:31:11 that works for me. the infra list seems a reasonable place for that now. 19:31:15 fair point. i'll search teh wiki and git repos 19:31:18 I don't object but we should grep for places we may be advertising it 19:31:42 nibalizer: ping? (i think you said you had to run...) 19:31:46 #action fungi check for remaining recommendations of openstack-ci-admins 19:31:55 #action fungi disable openstack-ci-admins list 19:32:51 #topic Monitoring of Infra Resources / Systems (morganfainberg) 19:32:55 o/ 19:33:12 morganfainberg: what's on your mind? 19:33:47 jeblair: im about 19:33:58 There was a brief discussion that we might want to start adding monitoring of Infra resources (e.g. bots) and possibly some aggrgation alarms (not page duty, but at-a-glance we've hit a threshold) that takes into account more than the individual cacti graphs 19:34:01 nibalizer: cool, we'll come back to you in a bit 19:34:09 pleia2 said she had some thoughts on this as well 19:34:42 nothing has been hashed out yet, but I'm inclined to say we should set up Nagios for some monitoring 19:34:48 this was added as an introduction to the additional monitoring - unfortunately I don't have much more at this point. 19:34:58 with the milestone my focus is a little split :) 19:35:48 I see this potentially as supplementing the status pages 19:35:56 clarkb, ++ 19:35:59 we don't actually have a bug for setting up monitoring beyond cacti 19:36:07 okay, let me share my thoughts -- i'm not opposed to monitoring; i think it's very important (i set up cacti and graphite after all)... 19:36:12 instead of needing humans to ping us and say is X broken 500 times during an outage. Have auomated checks that update a state page that everyone can check 19:36:12 however, i think it would be a pretty significant time sink to tune, groom and polish 19:36:19 i also don't want Pager-duty esqe stuff 19:36:25 we're volunteers, we don't need that 19:36:41 but i'm a little skeptical about the traditional nagios style monitoring.. 19:36:46 morganfainberg: agreed about pager duty 19:36:48 fungi, it wont happen overnight. and i wouldn't expect it to 19:36:49 a monitoring system which is half red and 95% is from false negative results is of use to noone 19:36:54 fungi: so I think we start out with a pretty minimal setup, checking disk space and ping kind of thing 19:36:57 i also just deleted 15,000 emails from our servers that i have not read 19:37:01 disk space has bitten us more than once 19:37:33 sure--we already have all that information published and available, but not enough people to sit and stare at it 19:37:35 there are some nagios bots for IRC, so instead of email it could alert to IRC 19:37:42 and my previous experience with things like nagios is that you spend a _lot_ of time adjusting paging thresholds for things like disk space and dealing with false positives.. 19:37:56 * anteaya continues to stare at zuul status 19:38:11 i've worked jobs where those sorts of systems were a great advantage. we also had a noc with 50+ people staffed around the clock and a department to keep things from firing incorrectly 19:38:28 last I used it I was managing over 100 servers, but we don't have that many and we can target specific ones to monitor that we might be concerned about 19:39:01 well, we actually have way more than 100 servers, but we may not care about much in the way of live metrics on most of them 19:39:07 heh 19:39:20 right, well, "static" servers :) 19:39:49 here is my question, what will change as a result of this info? 19:40:03 anteaya: we get alerts if a server goes offline, or disk fills up 19:40:07 fungi, i think this is something we start super small with, hit the bigger ticket items and use it in addition to status pages to help us identify issues a bit earlier than "oops" or "hey X is broken" (from 1000 people) 19:40:22 morganfainberg: yeah, that's what I'm thinking 19:40:32 and yes, we do have actual disk space utilization data, network interface error stats, and so on trended and accessible in raw form from cacti, which could be consumed by anyone wanting to give us a heads up on broken things too, which might be a good place to start 19:40:35 but if a server goes offline, someone posts in infra 19:40:41 anteaya: sadly now, we sometimes find out when someone joins channel and complains about something not working :\ 19:40:48 and as a team, we have fairly good channel coverage 19:40:52 i like monitoring, but i'm not keen on alerting in our environment. i'd be more happy with status pages that anyone can check. i'm less keen on email or irc alerts. 19:40:57 anteaya: that's kind of embarassing 19:41:07 pleia2: why? they usually tell us before nagios would anyway 19:41:09 so the issue is there are some normal fail modes, which mostly I find at 6am over coffee. So sean-nagios is something I'd like to stop doing 19:41:10 I don't see how that is going to change 19:41:19 sdague, ++ 19:41:28 sdague: normal failure modes should be corrected... 19:41:29 since someone telling us will probably happen at teh same time as the alert anyway 19:41:32 sdague, not that we want you to stop being you...or stop enjoying coffee 19:41:32 take the logs as an example 19:41:45 jeblair: hurts my sysadmin feelings, I should know what my servers are up to, not have users tell me 19:41:49 jeblair, i think this can be used to help also identify the normal failmodes over a longer period of time. 19:41:52 sdague: you would prefer to find those failures neatly organized on a status page over coffee instead? 19:41:53 jeblair, with the right tool. 19:42:03 sdague, morganfainberg: we have had the log server fill up on disk space before. our solution to that is to _stop using the log server and put logs in swift_. 19:42:07 fungi: yeh, so I don't need to spend brain power deducing them 19:42:18 how many times has X failed? 19:42:28 or - hey... is my irc bot dead? 19:42:40 sdague, morganfainberg: that's a big project and is going to take some time, but putting our _very_ limited resources into writing and reviewing and implementing that change is WAY better in my opinion than investing in monitoring it 19:42:41 debug irc bot, apply fixes, repeat 19:42:45 an irc bot for alerting could squack in a different channel than the -infra channel 19:42:52 that way its very opt-in 19:43:08 nibalizer: i especially love the circular meta concept of an irc bot warning you that your irc bots are broken ;) 19:43:10 i would be more inclined to check that, especially it being event driven, than a status webpage 19:43:33 fungi: hehe 19:43:38 fungi: no worse than the nagios email to tell you email is down 19:43:50 jeblair, this is why i brought it up. i know we have limited resources, but it's worth considering 19:44:01 jeblair, even if the answer is "not now, maybe later" 19:44:12 nibalizer: like jeblair, i basically already have no time to read all the e-mails our systems send me. so sure, no worse than that ;) 19:44:19 jeblair, or "lets do something else and see if we need it still down the line" 19:44:29 fungi: honestly, I don't want it as alerts, I want status page 19:44:39 sdague, ++ that was my initial thought 19:44:42 nagios can do a public-ish status page 19:45:03 pleia2, we could use any number of tools for it. 19:45:05 we already have all sorts of data we put in zuul status to let us know when things are crazy 19:46:02 and I think there are a class of other things that knowing if something just bust, ends up saving me 2 - 3 hrs debugging before fungi gets up and can check on something 19:46:12 pleia2: will it offend your inner sysadmin if that page is red all the time? how much time will you spend adjusting filessystem usage thresholds? 19:46:31 sdague: let's try to identify what those are and if we can expose them specifically... 19:46:38 jeblair: I guess I haven't had the same experiences as you, my Nagios is quite green 19:46:49 i think if there are people who want to spear-head adding a nagios server and doing the tuning necessary to get it usable, then i'm not directly opposed... but that's probably lots and lots of little reviews to tweak the configuration accordingly 19:46:58 jeblair, pleia2, i've had both experiences 19:47:12 jeblair: sure, my instinct would be this is incremental, in the same way that something like er was 19:47:22 jeblair, pleia2, the "everything red" experience is usually because you try and add everything at once and never get any of them right 19:47:27 morganfainberg: yeah 19:47:37 jeblair, or no time to spend on it at all 19:47:49 jeblair, some orgs are like that :P 19:47:56 most of the problem is that there are things which are easy to monitor with very low false negative rates, but those are also the things which just about never have a problem. the things which are more useful to find out about are also the things which need a lot of thought around threshholds 19:48:22 fungi: agreed (with the last 2 things you said) 19:48:38 fungi: yeh, I don't care about the easy to monitor things. I care about the things that I bug you about. Like er bot. 19:48:49 sdague, ++ 19:49:18 sdague: i suspect what you want to know about er-bot is almost impossible to monitor in nagios... 19:49:28 sdague: and i think that's an example of "let's figure out why this is broken" while finding out a little more quickly that it broke isn't quite as beneficial over the long term 19:49:31 jeblair: out of the box, for sure 19:49:33 yeah, I have bot monitors but they're all "see if this process is running" 19:50:27 pleia2: they are almost always running. sometimes they are netsplit, sometimes they are stuck due to an irc library bug. i just yesterday started upgrading irclib to see if we can fix that... 19:50:31 yeh, I could write a plugin for this. The biggest thing right now is there was really no infrastructure to register against. 19:50:35 jeblair: yeah, that's what I was afraid of 19:50:39 but all that came from reading log files 19:50:44 * pleia2 nods 19:51:16 sdague: and the solution to "we hit an irclib bug" isn't to write a nagios plugin 19:51:18 but I mostly consider this unit testing for some services 19:51:28 jeblair: sure 19:51:41 sdague: maybe we should create a wishlist bug to brainstorm on? 19:51:51 then we can come back to this and evaluate 19:52:18 pleia2, sdague, i like that. 19:52:57 okay, i don't think we have consensus, but i think you know the concerns. and i'd like to move on to the next topic 19:52:59 well, in some organizations "monitor to let us know every time this fails" is normal operating procedure, which stems from never having time to debug failures and just accepting that things break and you're going to fight fires, bring them back up asap and maybe sometimes discover why they broke in the process 19:53:15 #topic Infrastructure Priorities 19:53:16 fungi: yeh, I don't think we are that org though 19:53:25 #link https://etherpad.openstack.org/p/infrastructure-priorities 19:53:33 there are a lot of new people around... 19:54:01 and when we say things like something is or isn't a priority for our limited resources, it's possible it seems like we're making it up as we go along... 19:54:04 but we aren't. :) 19:54:19 ++ 19:54:21 it turns out we set priorities fairly clearly actually at the summits 19:54:42 but we're really bad about communicating those between summits 19:54:51 i'm hoping storyboard will make that better 19:54:58 I was just about to say storyboard 19:55:07 yay storyboard! 19:55:11 mordred: it's there! 19:55:14 but for now, here's a list of things based on the last (approximately) 2 summits 19:55:32 * ttx sighs 19:55:38 so if you're looking for something to work in in infra, check the bugs, but these are our highest priority items 19:55:47 * mordred hands ttx some wine and cheese and a sausage 19:55:55 * mordred agrees 19:55:57 and if you're looking to prioritize reviews, this same list will help 19:56:10 * mordred supports everyone who is in channel working on the list 19:56:26 * mordred will consider kindly anyone who does 19:56:57 and understand that things not on the list may well be good ideas, but i'm personally always thinking about this when i'm writing and reviewing 19:57:06 ok 19:57:16 and would love it if we could focus on completing some of these before we get too distracted 19:57:47 fwiw puppetboard 1,2 are done and 3 is in review 19:57:53 \o/ 19:58:22 jeblair: thank you for the reminder, I need to hunker down on the last backup bits and close that bug 19:58:24 nibalizer: yeah, i think that's going to unblock a lot of work 19:58:29 fungi: did you ever have a patch up on this? Write Jenkins Job which sends the salt command from salt-trigger slave 19:58:42 fungi: after feature freeze do you think we can sit together and get the sensitive stuff backups sorted? 19:59:01 anteaya: that part is a) extremely trivial at this point and b) useless until we can have the reactor-based dependencies implemented 19:59:05 with FF this week, I'm also hoping we can do another bug day next week - Tuesday March 11th at 17:00 UTC 19:59:09 fungi: k 19:59:18 clarkb: sure thing 20:00:11 thanks everyone, sorry we didn't get to nibalizer's thing, but i think that topic will be more relevant next week; hopefully we'll have a puppetboard by then! 20:00:19 #endmeeting