Wednesday, 2015-05-13

*** anteaya has quit IRC05:26
*** anteaya has joined #openstack-infra-incident05:29
*** jhesketh has quit IRC10:25
*** jhesketh has joined #openstack-infra-incident10:31
*** fungi has joined #openstack-infra-incident21:12
*** ChanServ changes topic to "Incident in progress: https://etherpad.openstack.org/p/venom-reboots"21:12
*** clarkb has joined #openstack-infra-incident21:15
fungisame bat time, different bat channel21:15
clarkbhow do we want to do this, just claim nodes off the etherpad?21:16
fungiprobably, after a little planning21:16
pleia2clarkb: unrelated to dns, gid woes means I still don't have an account on git-fe01 & 0221:16
clarkbpleia2: oh darn21:16
pleia2we'll fix that up later21:17
fungioh, right, we've not cleaned up the uids/gids on those two have we?21:17
clarkbapparently not which makes it hard for pleia2 to do the laod balancing dance for git* safe reboots21:17
pleia2right21:17
clarkbwhy don't I go take a look and see how bad the gid situation is right now21:18
clarkbmaybe we can fix that real quick21:18
mordredhey all21:18
pleia2welcome to the party, mordred21:18
lifelessso what is venom ? I can't read the ticket21:18
mordredlifeless: it's the latest marketing-named CVE21:19
clarkblifeless: its VMs can break into the hypervisor via floppy drive code in qemu21:19
pleia2lifeless: guest-executable buffer overflow of the kerney floppy thing21:19
lifelessclarkb: LOOOOOOOL21:19
pleia2kernel too21:19
*** zaro has joined #openstack-infra-incident21:19
anteayahttp://seclists.org/oss-sec/2015/q2/42521:19
clarkblifeless: so we can reoot gracefully or we get rebooted forcefully in ~24 hours21:19
lifelessanteaya: thanks21:19
anteayawelcome21:19
lifelessmuch brilliant, such wow21:19
pleia2apparently floppy stuff is hard to remove from the kernel (I was surprised it was included at all in base systems)21:20
fungiclarkb: yeah, remapping the uids/gids on those should be relatively trivial (i hope)21:20
fungipretty sure she's just conflicting with some unused account called "admin"21:21
pleia2admin is the fomer, only used to-be-backwards compatible, sudo group on ubuntu21:21
*** SpamapS has joined #openstack-infra-incident21:21
pleia2former21:21
pleia2unlikely that we're using it21:21
clarkbfungi: group admin has gid 2015 which is pleia2's group gid21:21
clarkbfungi: but sudo group needs to be moved too21:22
clarkbso basically I need to find where group owner is 2015 or 2016, make sure I am root so I can chown files after the regiding (as I may break sudo) then run puppet21:22
fungik. neither of those should actually own any files, i don't think21:22
clarkbfungi: I am going to find out shortly :)21:22
*** nibalizer has joined #openstack-infra-incident21:22
fungithat sounds like a good plan21:23
* mordred isn't QUITE back online yet - just got back from taking sandy to the airport - will be useful in a few ...21:23
fungiif nobody disagrees that the entries marked with * can be rebooted now, i'll go ahead and start in on those. take note that we need to not just reboot them from the command line. we need to halt them and then nova reboot --hard21:24
clarkbsudo find / -gid 2015 and sudo find / -gid 2016 don't find any files on git-fe01, so I guess its time to become root, change gid, then puppet21:25
clarkbfungi: go for it, maybe put rebooting process on etherpad for ease of finding21:26
pleia2I think stackalytics.openstack.org should be safe to reboot too21:26
fungiwill do21:26
fungioh, right. that's not actually production anything21:26
jeblairso for design-summit-prep, i'm not sure what to do... i think we can/should reboot it, but none of us actually knows anything about the app on there, so i don't know what happens when we reboot it and it doesn't come up21:28
jeblairwe should really stop just handing root to people21:29
mordred++21:29
*** greghaynes has joined #openstack-infra-incident21:29
fungithat one's been in a transitional state waiting on ttx to work with someone on writing puppet modules for the apps on there21:30
clarkbpleia2: can you hop on git-fe01 and check that it works for you? sudo too?21:30
jeblairtbh, my inclination is to reboot it and since there is nothing described by puppet running on it, nor any documentation, call our work done.21:30
pleia2mordred: is test-mordred-config-drive deletable? (see pad)21:30
lifelesslive migration should mitigate it... if they had that working21:30
fungii've pinged him in #-infra in case he's around21:30
jeblairit's pretty late for ttx21:30
fungilifeless: yep! too bad21:30
pleia2clarkb: I'm in, thanks!21:30
pleia2(with sudo, yay)21:31
jeblairi'm equally okay with "do nothing and let rax take care of it"21:31
clarkbpleia2: awesome, I am working on fe02 now21:31
fungii'm starting down the easy reboots list in alpha order21:31
clarkbpleia2: so first thing we need to do is take fe01 or fe02 out of the DNS roudn robin, then we can take one backend out of haproxy at a time on the other frontend and reboot the backend, put it back into service, rinse and repeat21:32
clarkbpleia2: then when that is all done add fe02 back to DNS round robin, remove 01, reboot 0121:32
clarkbpleia2: and the only people that should see any downtime are those that hardcode a git frontend21:32
pleia2clarkb: ok, how are we interacting with dns for this?21:32
* mordred is going to start on the easy reboots in reverse alpha order21:33
mordredwill meet fungi in the middle21:33
fungithanks mordred21:33
fungipleia2: clarkb: there is a rax dns client, but probably webui is easier for this21:33
pleia2fungi: if you could toss the exact instructions you're using for actually-effective reboots in the pad, it would help us be consistent21:33
clarkbmordred: keep in mind a normal reboot is not good enough21:34
fungiwell, maybe not actually. since we just need to delete and then create a and aaaa records. cli client may be easier21:34
clarkb(so lets all know how to reboot before we reboot)21:34
mordredyah21:34
mordredwe apparently need to halt. then reboot --hard21:34
mordredyah?21:34
jeblairhow can you do anything after halting?21:34
clarkbmordred: thats what fungi said, but ^21:34
clarkbjeblair: I think you have to nova reboot it21:35
mordredyah21:35
clarkbso in instance do shutdown -h now21:35
jeblairoh, "nova reboot --hard"21:35
mordredyah21:35
clarkbthen go to nova client and reboot it21:35
mordredsorry21:35
lifelesslist(parse_requirements('foo==0.1.dev0'))[0].specifier.contains(parse_version('0.1.dev0'))21:35
lifelessTrue21:35
lifelessman, copypasta all over the place today21:35
mordredlifeless: wrong channel21:35
clarkbpleia2: git-fe02 is ready for you to test there21:35
fungipleia2: i logged into puppetmaster, sourced our nova creds for openstackci in dfw, then `sudo ssh $hostname` and run `halt`, then when i get kicked out start pinging it, and then `nova reboot --hard $hostname`21:35
lifelessmordred: I know, it was my belly on my mouse pad21:35
fungionce i t no longer responds to ping21:35
mordredoh good21:37
mordredwe have 2 zuul-devs21:37
mordredshould I delete the one that is not the one dns resolves to?21:37
fungiyeah, should be safe at this point21:38
jeblairmordred: yes21:38
mordredand we have more than one translate-dev21:40
mordredpleia2: ^^ delete the one that is not in DNS? or reboot it?21:40
jeblairis the old one the pootle one?21:40
mordredmaybe?21:40
jeblairmordred: paste both ips?21:41
pleia2the old pootle one is deleteable21:41
mordred afd4a8d9-98a7-4a21-a827-33106abeeb8a | translate-dev.openstack.org      | ACTIVE | -          | Running     | public=104.130.243.78, 2001:4800:7819:105:be76:4eff:fe04:4758; private=10.209.160.236   |21:41
mordred| f1103432-ae29-4ec2-87e0-39920429ac50 | translate-dev.openstack.org      | ACTIVE | -          | Running     | public=23.253.231.198, 2001:4800:7817:103:be76:4eff:fe04:545a; private=10.208.170.81    |21:41
pleia2new traslate-dev server is 104.130.243.7821:41
mordred104. is the one in dns21:41
pleia2yeah, can kill the 23. one afaic21:41
mordredalso wiki.o.o is not o nthe list - I thin kit'sa  "can reboot any time" yeah?21:41
jeblair104.130.243.78 does not respond for me21:42
fungimordred: it's likely not pvhvm21:42
mordredfungi: ah - we only have to delete pvhm?21:42
mordredreboot?21:42
fungimordred: pvhvm is affected, pv is not21:42
mordredgotcha21:42
* mordred removes from list21:42
fungibasically this is the list which rax put in the ticket21:42
mordredah21:43
mordredso - translate.openstack.org is not pvhvm?21:43
pleia2jeblair: oh dear, maybe zanata went sideways when I wasn't looking21:43
mordredtranslate is a standard - that's ok - we can think about that later - if we're happy with performance, then it's likely fine :)21:44
fungipuppetmaster root has cached the wrong ssh host key for ci-backup-rs-ord21:47
clarkbpleia2: I think we should remove git-fe02 records from the git.o.o name to start. git.o.o A 23.253.252.15 and git.o.o AAAA 2001:4800:7818:104:be76:4eff:fe04:707221:47
fungishould i correct it, or work around it (to avoid ansible doing things to it)?21:47
jeblairfungi: work around it for now; not sure what the state is with that21:47
pleia2clarkb: yep, sounds good21:47
fungijeblair: thanks, will do21:47
clarkbpleia2: also note the TTL (I think its the minimum of 5 minutes for round robining but we will have to add that back in when we make the record for it again)21:48
fungimordred: i'll skip hound since i'm not sure what your plan is with that or if it needs special care21:48
pleia2my favorite part is how their interface doesn't show me the whole ipv6 address in the list21:48
clarkbpleia2: :(21:48
fungirax dns is a special critter, to be sure21:48
pleia2modify record lets me see it and cancel ;)21:49
clarkbpleia2: once you have those records removed you will want to hop on git-fe02 and sudo tail -f /var/log/haproxy.log and wait for connections to stop coming in21:50
clarkbpleia2: at that point we can do the reboot procedures21:50
mordredAND - in a fit of consistency - we have 2 review-dev21:50
clarkbmordred: is one trusty the other precise?21:51
clarkbmordred: if so you can probably remove the precise node21:51
clarkbpleia2: once you are about at that point let me know and I can dig up my haproxy knoweldge21:51
mordredyup21:51
pleia2clarkb: busy servers these ones21:51
fungijeblair: there are a jeblairtest2 and jeblairtest3 as well which aren't on the list but shall i delete them while i'm here?21:52
mordredwell, that's exciting21:53
mordredI halted review dev. it's not returning pings - BUT - nova won't reboot it because it's in state "powering off"21:53
jeblairfungi: please21:53
fungiwill do21:53
anteayacan someone with ops spare a moment to kick a spammer from -infra?21:56
mordredfungi: hound done. no special care - it's all happy and normal21:57
fungii've also been sshing back into each and checking uptime to make sure it really rebooted21:58
clarkbpleia2: its been > TTL now ya?21:58
pleia2clarkb: down to a trickle, so should be ready soon21:58
mordredso - anybody have any ideas what to do about a vm stuck in nova powering-off state?21:58
clarkbpleia2: ya and if these don't go away after another minute or two  Ithink we blame their bad DNS resolving21:58
pleia2clarkb: nods21:59
clarkbmordred: I want to say in hpcloud when that happened you had to contact support, unsure if rax is different21:59
mordredif there isnt' a quick answer from john - I'm going to leave it - because it's review-dev and it'll get hard-rebooted tomorrow22:03
clarkbpleia2: found the magic socat commands for haproxy cnotrol on my fe02 scrollback22:03
clarkbmordred: wfm22:03
pleia2clarkb: cool, I'll have a peek22:03
fungiokay, got the all-clear from reed to reboot groups and ask so doing those next22:03
clarkbpleia2: getting a paste up for you22:03
pleia2clarkb: thanks22:03
mordredfungi: I am not using puppetmaster - so happy for you to reboot it any time22:03
clarkbpleia2: http://paste.openstack.org/show/222205/22:04
clarkbpleia2: basically haproxy is organized as frontends eg balance_git_daemon and backends for each frontend so we have to disable the full set of pairs for all of those per backend22:05
clarkbpleia2: once that is done it should be safe to reboot the backend22:05
pleia2clarkb: ok, so this series of commands + confirm they come back for all the git0x servers?22:06
clarkbpleia2: then off to the next backend, when all backends are done we add git-fe02 back to dns, remove git-fe01 and then reboot git-fe0122:06
clarkbpleia2: yup and run that on git-fe01 since its the only haproxy balancing traffic right now22:06
pleia2clarkb: sounds good, on it22:07
clarkbpleia2: but basically go host by host, rebooting then reenabling and we shouldn't have any downtiem22:07
* pleia2 nods22:07
clarkbwe should probably ansible this for the general case, not sure we want to ansibe it for the hard reboot case since one already broke on us22:07
pleia2heh, right22:07
jeblairthinking ahead to the harder servers; we have enough SPOFS that we may just want to take an outage for all of them at once; maybe i'll start working on identifying what we should group together22:09
mordredjeblair: ++22:09
clarkbjeblair: sounds good22:09
mordredclarkb: yah - I was thinking we should maybe have some ansibles to handle "we need to reboot the world" - I'm sure it'll come up again22:09
SpamapSkernel updates come to mind22:10
fungimordred: are you rebooting openstackid.org? if not, i'll get it next22:10
mordredfungi: yeah - I got it22:10
fungithanks22:11
*** reed has joined #openstack-infra-incident22:11
jeblairfungi, clarkb: er, i see the chat in etherpad; what should we do with the ES machines?22:12
clarkbjeblair: fungi we can roll through them right now if we want, we can have rax just reboot all of them, or we can reboot all of them22:12
fungii'm inclined to halt and hard reboot them ourselves since no idea if rax halts these or just blip!22:13
clarkbjeblair: fungi: I am tempted to go ahead and reboot one then see how long recovery takes, then based on that either reboot the rest all at once or reboot them one by one22:13
pleia2clarkb: 01 down, 4 to go!22:13
mordredclarkb: sounds reasonable22:13
fungialso we have some opportunity to look carefully at the systems as they come back up and identify obvious issues immediately rather than when we happen to notice later22:13
clarkbfungi: ++22:13
mordredsubunit-worker01 should be able to just be rebooted too, no?22:13
jeblairclarkb: okay, we'll be taking "the ci system" down anyway, so if it's easier to do all at once, we will have that opportunity22:13
clarkbmordred: yup22:13
* mordred does subunit-worker22:14
jeblairwell, it may lose info22:14
clarkbjeblair: well recovery from reboot all at once is a many hours thing too I think22:14
jeblairwhich is why i marked it with [1]22:14
jeblairmordred: ^22:14
clarkbjeblair: and logstash.o.o will just queue up the work for us so ES can be in that state whenever and we should be fine22:14
clarkbjeblair: biggest impact is the delay on indexing until we catch back up again22:14
jeblairclarkb: the system is not busy and slowing, so we may be in luck there22:15
fungiokay, so logstash workers and elasticsearch group together with [1] as well?22:15
clarkbwfm22:16
jeblairoh, logstash workers can go at any time, right?22:16
clarkbjeblair: actually ya, let me just go dothose right now22:16
fungithey used to go at any time they wanted, so i suppose so ;)22:16
jeblairi vote we keep them out of [1] if that's the case;  there are 16 others that we aren't rebooting anyway22:16
fungiagreed22:16
clarkbyup yup I am doing logstash workers now22:16
jeblairso probably can actually just do all 4 right now22:16
jeblair(at once)22:16
mordredjeblair: sorry. IRC race-condition subunit-worker01 rebooted22:17
jeblairmordred: i did have the [1] in the etherpad before you did that22:17
clarkbpleia2: I think you can reboot git-fe02 whenever you have time, its down to like 2 requests per minute22:17
fungiis the plan to put zuul into graceful shutdown long enough to quiesce jobs and then stop the zuul mergers and bring zuul back up?22:18
fungithat would avoid running jobs while we do the pypi mirrors too22:18
fungithen once the rest of the [1] group is done, bring the mergers back up22:18
pleia2git02 seems stuck after the halt, not pingable, but nova shows it as running, for git01 I waited  to run the nova reboot --hard until it was in shutdown mode22:18
jeblairfungi: we have to take review.o.o down, i think we should just do it without quiescing22:19
fungijeblair: oh, right i forgot it was in that set22:19
fungipleia2: i guess give it a few minutes. if we have to open a ticket for it, we can limp along down one member until rax fixes it for us22:19
jeblairyeah, i'd think about a way to do it more gracefully, but there's no hiding that, so we might as well make it easy on ourselves and just reboot it all at once22:20
fungithe joys of active redundancy22:20
jeblairwe can save the zuul queues22:20
fungifair enough22:20
clarkbpleia2: I have nothing better than what fungi suggests22:20
pleia2fungi: oh good, looks like all it needed was 5 minutes to die, on to 03!22:20
fungiheh22:20
clarkbdoing logstash-worker17 now22:20
fungisometimes rax likes to just sit on api calls too. it's an openstack thing22:21
mordredooh! review-dev finally rebooted22:22
clarkbthey are just teaching us patience22:22
mordredawesome22:22
fungii have a feeling rax didn't tune their environment to expect all customers to want to reboot everything in one day22:22
pleia2indeed22:23
mordredweird22:23
fungipleia2: clarkb: let me know when you expect to be idle on the puppetmaster for a few minutes and i'll reboot it too22:23
clarkbfungi: let me get through the logstash workers, will let you know22:23
fungithere's no hurry22:24
pleia2fungi: will do, just going to finish up this git fun22:24
jeblairokay, i laid out a plan for the group[1] reboots22:26
clarkbafter halting 18,19,20 they all show avtive according to nova show, are we waiting for them to be shutdown before hard rebooting?22:26
mordredjeblair: I agree with your plan22:27
fungiyeah, lgtm22:27
mordredclarkb: I have been22:27
clarkbok I will wait then22:27
fungimight want to also wait for the pypi mirrors to boot before starting zuul?22:27
fungiotherwise there could be job carnage22:28
clarkb++22:28
fungialso waiting to bring up the zuul mergers until most worker types have registered in zuul again could avoid some not_registered results?22:28
jeblairfungi: true, i was mostly thinking of waiting for the services though; we should probably wait until the reboots have been issued for all of the vms22:28
fungithough i suppose just not readding the saved queues for a bit would work as well22:29
fungiworst case a couple of changes get uploaded to gerrit and their jobs are kicked back because it was too soon22:30
jeblairor waiting until nodepool has ready nodes of each type (which it may immediately sinc the system is not busy)22:30
fungiyeah, i expect that to be quick22:31
jeblair200 concurrent jobs running is "not busy"22:31
fungiheh22:32
jeblairdo folks want to claim nodes in the group[1] by adding their names to the ep?22:32
fungialso cinder seems pretty broken22:32
fungiyep, will do22:32
jeblairi'll do the gerrit/zuul/nodepool bits22:32
mordredI'll just watch I guess22:33
jeblair(and btw, not stopping nodepool is intentional; it'll reduce number of aliens)22:33
jeblairmordred: want the zuul mergers?22:33
mordredoh - sure!22:34
jeblairstatus alert Gerrit and Zuul are going offline for reboots to fix a security vulnerability.22:35
jeblair?22:35
fungihalting nodepool.o.o will likely send a term to nodepoold as init tries to gracefully stop things22:35
jeblairfungi: yup22:35
fungijeblair: that looks good22:35
jeblairclarkb, fungi, mordred: are you standing by?22:36
mordredjeblair: yup22:36
fungion hand22:36
clarkbI am here22:37
clarkbalmost done with the last logstash worker it took forever to stop22:37
jeblairi'll send the announcement then wait a bit, then give you the go ahead when i've stopped zuul and gerrit22:37
jeblair#status alert Gerrit and Zuul are going offline for reboots to fix a security vulnerability.22:37
openstackstatusjeblair: sending alert22:37
*** reed has left #openstack-infra-incident22:37
-openstackstatus- NOTICE: Gerrit and Zuul are going offline for reboots to fix a security vulnerability.22:39
*** ChanServ changes topic to "Gerrit and Zuul are going offline for reboots to fix a security vulnerability."22:39
clarkbthats exciting, the logstash indexer refuses to keep running on 19 and doesn't log why it failed22:40
clarkbbut we are super redundant there so I can switch to the [1] group whenever22:40
openstackstatusjeblair: finished sending alert22:42
jeblairclarkb, fungi, mordred: gerrit is stopped, clear to reboot22:42
mordredrebooting22:42
clarkball the ES's are halted22:43
clarkbwaiting for them to not be ACTIVE in nova show before rebooting22:44
fungimy 4 are back up, i'm checking their services now22:44
clarkbfungi: did you wait for them to not be active?22:44
clarkbI am trying to decide if that is necessary22:44
fungii did not22:44
fungizuul is already stopped22:45
clarkbok I won't wait either then22:45
fungiso it's not like job results will matter at this point22:45
funginone that are running will report22:45
mordredmine are all rebooted and services are running22:45
jeblair(nodepool is up and running)22:46
fungiyep, all mine are looking good22:46
fungirebooting gerrit now22:46
pleia2clarkb: rebooting git-fe02 now, once that's back up I'll readd to dns, remove git-fe01 and wait for that to trickle out (and tell fungi to reboot puppetmaster)22:46
mordredhour and a half isn't terrible for an emergency reboot the world22:48
clarkbelasticsearch is semi up, its red and 5/6 nodes are present22:48
fungigerrit's on its way back up22:49
pleia2fungi: I'm done with puppetmaster for now, just have a git-fe01 reboot to do, but will wait on dns for that so reboot away22:49
fungireview.o.o looks like it's working22:49
mordredI agree that review.o.o ooks like it's running22:49
fungijeblair: should be set to start zuul?22:49
anteayafungi: slow but working...22:50
jeblair[2015-05-13 22:50:02,782] WARN  org.eclipse.jetty.io.nio : Dispatched Failed! SCEP@34ee175d{l(/127.0.0.1:58159)<->r(/127.0.0.1:8081),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}-{AsyncHttpConnection@242369e7,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0} to org.eclipse.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@10fdcf3a22:50
clarkbes02 ran out of disk again22:50
clarkbpleia2: ^ same issue as last time, /me makes a note to logrotate for it better22:50
jeblairwow there were a lot of those errors in the gerrit log22:50
jeblairbut it seems happier now22:50
fungiit was probably getting slammed by people desperately reloading while it was starting up22:51
jeblairokay i will restart zuul22:51
clarkbwait hrm22:52
fungihrm?22:52
jeblairclarkb: i've already started zuul and begun re-enqueing22:52
clarkbcan someone else log into es0222:52
fungion it22:52
clarkbjeblair: sorry I think you are ok to start zuul22:52
jeblairok, will continue22:52
clarkbfungi: tell me if mount looks funn22:52
clarkbjeblair: but it looks like es02 came up without a /22:53
fungium... wow!22:53
clarkbI wonder if this is what hit worker1922:53
fungii mean, that can happen, technically, as / is mounted at boot and might not be in mtab22:53
clarkbin any case we should probably do an audit of this22:53
pleia2clarkb: oops, I think I made a bug/story about that but never managed to cycle back to it22:53
jeblairokay, zuul is up and there are a very small handful of not_registered; i think we can ignore them22:53
fungioverflow on /tmp type tmpfs (rw,size=1048576,mode=1777)22:53
clarkbpleia2: np22:53
jeblairi will status ok?22:54
fungiwhazzat22:54
mordredjeblair: ++22:54
fungijeblair: go for it22:54
clarkbjeblair: do we want to check the mout table everywhere first?22:54
clarkbwe may need to reboot more if we don't see those devices within a VM?22:54
jeblairclarkb: review looks ok (it has a / and an /opt)22:55
clarkbjeblair: cool thats probably the most important one22:55
clarkbfungi: so short of rebooting, any ideas on fixing/debugging this?22:55
fungiclarkb: i'm looking at dmesg. just a sec22:55
jeblair#status ok Gerrit and Zuul are back online.22:55
openstackstatusjeblair: sending ok22:55
jeblairnodepool looks sane22:56
clarkbES is yellow despite this trouble22:56
clarkbso it is recovering in the right direction (started red)22:57
fungidmesg looks like it properly remounted xvda1 rw... maybe mtab is corrupt22:57
jheskethMorning22:57
*** ChanServ changes topic to "Incident in progress: https://etherpad.openstack.org/p/venom-reboots"22:57
-openstackstatus- NOTICE: Gerrit and Zuul are back online.22:57
* jhesketh catches up on incident(s) 22:58
anteayajhesketh: https://etherpad.openstack.org/p/venom-reboots if you haven't found it yet22:58
fungialso what is this tmpfs called "overflow" mounted at /tmp? it's 1mb in size, which seems dysfunctional22:58
Clintfungi: that's a "feature" for when / is full on bootup22:59
fungianteaya: i also put it in the channel topic22:59
clarkband / was full22:59
fungiClint: thank you!22:59
anteayafungi: thank you22:59
clarkbbecause ES rotates logs but doesn't have a limit on how far back to keep22:59
fungiso, yes, that's interesting and i think explains some of this then22:59
fungii wonder if that is why root isn't actually showing up mounted then23:00
openstackstatusjeblair: finished sending ok23:00
clarkbfungi: seems likely23:00
fungiclarkb: i guess just clean up the root filesystem and reboot again and it will probably go back to "normal"23:01
fungiClint: you gave me something new to read about23:01
clarkbfungi: ya I have already killed the log files taking up all the disk23:01
Clintfungi: enjoy23:01
clarkbso I will go ahead and reboot it23:01
fungisounds good23:01
clarkbthen make it more of a priority to have logrotate clean up after ES23:01
jheskethLooks like good progress on the reboots has been made23:01
jheskethLet me know if I can help23:01
fungibut that being the case, i think this probably is not some mysterious endemic issue we should go hunting on our other servers looking for23:02
jeblairjhesketh: thanks, i think we're nearing the end of it23:02
fungiit's gone pleasantly quickly23:02
jheskethGreat stuff23:02
fungimordred: clarkb: so you're idle on puppetmaster for a bit. looks like i'm clear to reboot that. speak soon if not23:03
pleia2fungi: git-fe01 just slowed down enough for me to reboot it, if you can wait 5 minutes I might as well finish this up23:03
clarkbfungi: I am done23:03
mordredfungi: I am doing nothing there23:03
fungipleia2: no problem, go for it23:03
fungii'll wait23:03
clarkbpleia2: fe01 or fe02?23:03
pleia2clarkb: fe02 is all done23:04
clarkbpleia2: woot23:04
clarkbes02 is rebooting now23:04
clarkblooks like the logstash workers may be leaking logs too23:04
clarkbI find it somewhat :( that our logging system is bad at logging (most likely my fault)23:04
fungilogs are hard (and made of wood)23:05
clarkband its up, cluster is still recovering and yellow so that all looks good23:06
mordredfungi: more witches!23:07
pleia2fungi: all clear puppetmaster23:07
fungipleia2: thanks! restarting it next23:07
fungipuppet master is back up now23:08
pleia2updated dns & load is coming back to git-fe01, so all is looking good23:09
clarkbok confirmed that es doesn't do rotations with deletes properly until release 1.523:10
jeblairare all servers complete now?23:10
clarkbbut I think logrotate can delete based on age so will just set it up to kill anything >2 weeks old or whatever23:10
pleia2jeblair: I think so23:10
clarkbthe list seems to make it look that way23:10
fungigate-heat-python26 NOT_REGISTERED23:11
fungisame for gate-pbr-python2623:11
jeblairthere are bare-centos6 nodes ready, maybe we should delete them23:12
fungilooks like a lot of jobs expecting bare-precise and baer-centos6 are not registered, so yeah23:12
fungilet's23:12
jeblairdoing23:12
fungithanks23:12
jeblairoh, sorry, my query was wrong; no ready nodes there, only hold, building, and used (possibly from before reboot)23:13
jeblairbuilding might be before reboot too23:13
jeblairanyone using those held nodes?23:14
fungii am no longer23:14
clarkbjeblair: I have one I would like to keep23:14
fungijust finished yesterday23:14
fungii'll delete mine23:14
clarkb2536877 would be nice for me to keep23:14
fungii kept a list23:14
clarkbbeen using it as part of the nodepool + devstack work23:14
jeblairokay, i deleted building/used; leaving hold to you23:15
jeblairthat should be enough to correct the registrations once those are built23:15
jeblairshould be clear out of this room now?23:15
jeblair/be/we/23:16
pleia2sounds good, see you on the other side23:16
*** clarkb has left #openstack-infra-incident23:16
fungiadios23:17
*** fungi has left #openstack-infra-incident23:17
*** ChanServ changes topic to "Discussion of OpenStack project infrastructure incidents | No current incident"23:17
* jeblair puts the sheets back over the control consoles23:17
*** nibalizer has left #openstack-infra-incident23:18

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!