Thursday, 2017-04-13

*** AJaeger is now known as AJaeger_06:08
*** hjensas has joined #openstack-infra-incident07:30
*** hjensas has quit IRC08:23
-openstackstatus- NOTICE: zuul was restarted due to an unrecoverable disconnect from gerrit. If your change is missing a CI result and isn't listed in the pipelines on http://status.openstack.org/zuul/ , please recheck08:51
*** hjensas has joined #openstack-infra-incident09:34
*** Daviey has quit IRC12:28
*** Daviey has joined #openstack-infra-incident12:55
jrollhttps://nvd.nist.gov/vuln/detail/CVE-2016-10229#vulnDescriptionTitle "udp.c in the Linux kernel before 4.5 allows remote attackers to execute arbitrary code via UDP traffic that triggers an unsafe second checksum calculation during execution of a recv system call with the MSG_PEEK flag."13:00
jrollnot sure if infra listens for UDP off the top of my head, but thought I'd drop that here13:00
*** hjensas has quit IRC14:57
*** lifeless_ has joined #openstack-infra-incident15:07
*** mordred has quit IRC15:10
*** lifeless has quit IRC15:10
*** EmilienM has quit IRC15:10
*** mordred1 has joined #openstack-infra-incident15:10
*** 21WAAA2JF has joined #openstack-infra-incident15:13
*** mordred1 is now known as mordred15:46
clarkbjroll: thanks, we have an open snmp port we might want to close16:10
clarkbpabelanger: fungi ^16:10
pabelangerack16:10
pabelangerpbx might be one16:12
pabelangersince we use UDP for RTP16:12
*** hjensas has joined #openstack-infra-incident16:12
jrollclarkb: np16:13
clarkbactually snmp issourcespecific so fairly safe16:13
clarkbafs16:13
clarkbis udp16:13
clarkbmordred: ^16:13
pabelangerya, AFS might be our large exposure16:15
mordredoy. that's awesome16:17
pabelangerhttps://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-10229.html16:17
pabelangerLinux afs01.dfw.openstack.org 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux16:18
pabelangerso ya, might need a new kernel and reboot?16:18
fungiugh16:19
fungiany reports it's actively exploited in the wild?16:19
pabelangerI am not sure16:21
pabelangerlooks like android is getting the brunt of it however16:22
fungikeep in mind source filtering is still a lot less effective for udp than tcp16:23
fungieasier to spoof (mainly just need to guess a source address and active ephemeral port)16:23
*** openstack has joined #openstack-infra-incident16:33
*** openstackstatus has joined #openstack-infra-incident16:34
*** ChanServ sets mode: +v openstackstatus16:34
*** 21WAAA2JF is now known as EmilienM17:05
*** EmilienM has joined #openstack-infra-incident17:05
clarkblooking at https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-10229.html says xenial is not affected?18:57
clarkbI also don't see a ubuntu security notice for it yet18:59
clarkbit looks like we may be patched in many places?19:06
clarkbtrying to figure out what exactly is required, but if I read that correctly xenial is fine despite being 4.4? trusty needs kernel >=3.13.0-79.12319:06
clarkbpabelanger: fungi ^ that sound right to you? if so maybe next step is generate a list of kernels on all our hosts via ansible then produce a needs to be rebooted list19:12
pabelangerclarkb: ya, xenail isn't affected what what I read.19:13
pabelanger++ to ansible run19:13
clarkbpabelanger: Linux review 3.13.0-85-generic #129-Ubuntu SMP Thu Mar 17 20:50:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux19:48
pabelanger++19:48
clarkbon review.o.o which is newer than 3.13.0-79.12319:48
clarkbso I think just restart the service?19:48
pabelangerya, looks like just a restart then19:49
-openstackstatus- NOTICE: The Gerrit service on http://review.openstack.org is being restarted to address hung remote replication tasks.19:51
fungisorry for not being around... kernel update sounds right, too bad we didn't take the gerrit restart as an opportunity to reboot19:58
clarkbfungi: we don't need to reboot it20:00
clarkbfungi: gerrit's kernel is new enough I think ^ you can double check above.20:00
fungioh20:00
fungiyep, read those backwards20:00
clarkbpabelanger: puppetmaster.o.o:/home/clarkb/collect_facts.yaml has a small playbook thing to collect kernel info, want to check that out?20:00
clarkbthat is incredibly verbose, is there a better way to introspect facts?20:02
pabelangerclarkb: sure20:05
clarkbpabelanger: if it looks good to you I will run it against all the hosts and stop using my --limit commands to test20:06
clarkbits verbose but works so just going to go with it I think20:06
pabelangerclarkb: looks good20:06
clarkbpabelanger: ok I will run it and redirect output into ~clarkb/kernels.txt20:07
clarkbits running20:08
pabelangerclarkb: only seeing the ok: hostname bits20:10
clarkbpabelanger: ya its gathering all the facts before running the task I think20:10
pabelangerHa, ya20:11
pabelangergather_facts: false20:11
clarkbwell we need the facts20:11
pabelangerbut, we need them20:11
pabelangerya20:11
clarkbI guess I could've uname -a 'd it instead :)20:11
pabelangerokay, let me get some coffee20:11
pabelangeralso, forgot about infracloud20:12
pabelangerthat will be fun20:12
clarkbhrm why does the mtl01 internap mirror show up, I thoguht I cleaned that host up a while back20:16
mgagnemtl01 is the active region, nyj01 is the one that is now unused20:16
* mgagne didn't read backlog20:16
clarkboh I got them mixed up, thanks20:18
clarkblooks like right now ansible is timing out trying to get to things like jeblairtest20:27
clarkbI'm just gonna let it timeout on its own, does anyone know how long that will take?20:27
fungimaybe 60 minutes in my experience20:36
clarkbfwiw most of my spot checking shows our kernels are new neough20:54
clarkbso don't expect to need to reboot much once ansible gets back to me20:54
clarkbits been an hour and they haven't timed out yet...21:17
clarkbdone waiting going to kill ssh processes and hope that doesn't prevent play from finishing21:20
clarkbpabelanger: https://etherpad.openstack.org/p/infra-reboots-old-kernel21:26
clarkbI'm just going to start picking off some of the easy ones21:28
clarkbmordred: are you around? how do we do the afs reboots? make afs01 rw for all volumes, reboot afs02, make 02 rw, reboot 01?21:28
clarkbthen do each of the db hosts one at a time? what about kdc01?21:28
clarkbdoing etherpad now so the etherpad will be temporarily unavailable21:34
clarkbfor propsal.slave.openstack.org and others, is the zuul launcher going to gracefully pick those back up again after a reboot or will we have to restart the launcher too?21:37
clarkbI guess I'm going to find out?21:37
clarkbrebooting proposal.slave.o.o now as its not doing anything21:38
clarkbI'm going to try grabbing all the mirror update locks on mirror-update so that I can reboot it next22:09
clarkbpabelanger: gem mirroring appears to have been stalled  since april 2nd but there is a process holding the lock. Safe to grab it and then reboot?22:26
pabelangerclarkb: ya, we'll need to grab lock after reboot22:33
pabelangerjust finishing up dinner, and need to run an errand22:34
pabelangerI can help with reboots once I get back in about 90mins22:34
clarkbpabelanger: sounds good we can leave mirror-update and afs for then. I will keep working on the others22:34
clarkbafs didn't survive reboot on gra1 mirror. working to fix now22:39
clarkboh maybe it did and its just slow things are cdable now22:40
clarkbpabelanger for when you get back grafana updated on the grafana server, not sure if it matter or not? hopefully I didn't derp anything22:43
clarkbmy apologies if it does :/22:44
clarkbthe web ui seems to be working though so going to reboot server now22:44
clarkband its up and happy again22:50
clarkbnow for the baremetal00 host for infra clouds running bifrost22:50
pabelangerlooks like errands are pushed back a few hours23:00
pabelangerclarkb: look like we might have upgraded grafana.o.o too23:00
pabelangerchecking logs to see if there are any errors23:00
pabelangerbut so far, seems okay23:00
clarkbpabelanger: yes it upgraded, sorry I didn't think to not do that until it was done23:00
clarkbbut ya service seems to work23:00
pabelanger2.6.023:01
pabelangershould be fine23:01
clarkbI have been running apt-get update and dist-upgrade before reboots to make sure we get the new stuff23:01
pabelangerwe'll find out soon if grafyaml has issues23:01
clarkb:)23:01
clarkbbaremetal00 and puppetmaster are the two I want to do next23:01
clarkbthen we are just left with afs things23:01
pabelangerk23:01
clarkbI think baremetal should be fine to just reboot23:01
pabelangerya23:02
clarkbfor puppetmaster we should grab the puppet run lock and then do it so we don't interrupt a bunch of ansible/puppet23:02
pabelangerokay23:02
pabelangerwhich do you want me to do23:02
clarkbI want you to do mirror-update if you can23:02
pabelangerk23:02
clarkbsince you have the gem lock? I should have all the other locks at this point you can ps -elf | grep k5 to make sure nothing else is running23:02
clarkbI'm logged in but don't worry about it the only process I have are holding locsk23:03
clarkbI'm going to reboot baremetal00 now23:03
* clarkb crosses fingers23:03
pabelangerya, k5start process are not running23:03
pabelangerso think mirror-update.o.o is ready23:03
clarkbpabelanger: cool go for it23:04
clarkbthen grab whatever locks you need23:04
clarkbsince they shouldn't survive a reboot23:04
pabelangerrebooting23:04
clarkbthen when thats done and baremetal comes back lets figure out puppetmaster, then figure out afs servers23:05
clarkbbaremetal still not back. Real hardware is annoying :)23:06
pabelangermirror-update.o.o good, locks grabbed agin23:08
clarkband still not up23:08
pabelangerya, will take a few minutes23:08
clarkbpabelanger: can you start poking at puppetmaster maybe, see about grabbing lock for the puppet/ansible rotation?23:08
clarkbremmeber there are two now iirc23:08
pabelangeryup23:09
clarkbtyty23:09
clarkbat what point do I worry the hardware for baremetal00 is not coming back? :/23:10
clarkboh it just starting pinging23:10
clarkb\o/23:10
pabelangerokay, have both locks on puppetmaster.o.o23:12
pabelangerand ansible is not running23:12
clarkbI don't need puppetmaster if you want to go for it23:12
pabelangerk, rebooting23:12
clarkbbaremetal is up now. and ironic node-list works23:13
clarkbwell thats interesting23:13
clarkbits urnning its old kernel23:13
pabelangerpuppetmaster.o.o online23:14
clarkbI'm going to keep debugging baremetal and may have to reboot it again :(23:14
pabelangeransible now running23:15
clarkbas for afs, can we reboot the kdc01 server safely ? we just won't be able to get kerberos tokens while its down?23:15
clarkband can we reboot the db servers one at a time without impacting the service?23:16
clarkbthen we just have to do the fileservers in a synchronized manner ya?23:16
clarkbmordred: corvus ^23:16
clarkbI manually installed linux-image-3.13.0-116-generic on baremetal00, I do not know why a dist-upgrade was not pulling that in23:18
clarkbbut its in there and in grub so thinking I will do a second reboot now23:18
clarkbpabelanger: ^ any ideas on that or concerns?23:18
pabelangernope, go for it. We have access to iLo if needed23:19
clarkbwe don't have to go through baremetal00 to get ilo? we can go through any of the other hosts ya?23:20
clarkbthats my biggest concern23:20
pabelangerI think we can do any now23:20
pabelangerthey are all on same network23:20
clarkbok rebooting23:20
clarkbI put some notes about doing the afs related servers on the etherpad. Does that look right to everyone?23:23
clarkbpabelanger: maybe you can check if one file server is already rw for all volumes and we can reboot the other one?23:23
* clarkb goes to grab a drink while waiting for baremetal0023:23
pabelangerclarkb: vos listvldb show everything in symc23:25
pabelangersync*23:25
pabelangerclarkb: http://docs.openafs.org/AdminGuide/HDRWQ139.html23:26
pabelangerwe might want to follow that?23:26
clarkbvldb is what we run on afsdb0X?23:29
pabelangerI did it from afsdb0123:30
pabelangerI think we use bos shutdown23:30
pabelangerto rotate things out23:30
clarkbgotcha thats the way you signal the other server to take over duties?23:30
pabelangerI think so23:30
clarkbdefinitely seems like what you are supposed to do according to the guide23:30
pabelangermaybe start with afs02.dfw.openstack.org23:31
clarkbdoes afs02 only have ro volumes?23:31
pabelangeryes23:31
clarkband afs01 is all rw? if so then ya I think we do that one first23:31
pabelangerright23:31
clarkb(still waiting on baremetal00)23:31
pabelangerafs01 has rw and ro23:31
pabelangererr23:31
pabelangerafs01.dfw.openstack.org RW RO23:32
pabelangerafs01.dfw.openstack.org RO23:32
pabelangerafs02.dfw.openstack.org RO23:32
pabelangerafs01.ord.openstack.org23:32
pabelangeris still online, but not used23:32
pabelangermaybe we do afs01.ord.openstack.org first23:32
clarkbright ok. Then once afs02 is back up again we transition all the volumes to be swapped on the RW RO23:32
pabelangernpm volume locked atm23:33
clarkbord's kernel is old too but not in my list23:33
clarkbmaybe we skipped things in emergency file? may need to dbouel check that after we are done23:34
clarkb(I used hosts: '*' to try and get them all and not !disabled)23:34
pabelangerodd, okay we'll need to do 3 servers it seems.  afs01.ord.openstack.org is still used by a few volumes23:34
clarkbstill waiting on baremetal00 :/23:35
pabelangerokay, so which do we want to shutdown first?23:35
clarkbI really don't know :( my feeling is the non file servers may be the lowest impact?23:36
pabelangerright, afs01.ord.openstack.org is the least used FS23:36
clarkbok lets do that one first of the fileservers23:37
clarkbthen question is do we want to do kdc and afsdb before fileservers or after?23:37
clarkbalso still no baremetal pinging. This is much longer than the last time23:37
clarkbpabelanger: does the ord fs have an RW volumes?23:38
pabelangermirror.npm still is locked23:38
pabelangerclarkb: no23:39
pabelangerjust RO23:39
clarkbok so what we want to do then maybe is grab all the flocks on mirror-update so that things stop updating volumes (like npm)23:40
clarkbthen reboot ord fileserver first?23:40
clarkbsee how that goes?23:40
pabelangersure23:40
clarkbok why don't you grab the flocks. I am working on getting ilo access to baremetal0023:41
pabelangerha23:42
pabelangerpuppet needs to create the files in /var/run still23:42
pabelangersince they are deleted on boot23:42
clarkbthe lock files are deleted?23:42
pabelanger/var/run is tmpfs23:43
pabelangerso /var/run/npm/npm.lock23:43
pabelangerwill fail, until puppet create /var/run/npm23:43
clarkbI can't seem to hit the ipmi adds with ssh23:44
clarkbs/adds/addrs/23:44
clarkbI am able to hit the compute hosts own ipmi but its slwo, maybe baremetal is slower /me tries harder23:45
clarkbok I'm on the ilo now. I guess being persistent after multiple connection timeouts is the trick23:49
pabelangerk, have all the lock on mirror-update23:50
clarkbso I can see the text console. The server is running23:56
clarkbbut no ssh23:56
clarkbI think I am going to reboot with the text console up23:57
pabelangerk23:57
pabelangerlike I'm ready to bos shutdown afs01.ord.openstack.org23:58
clarkbpabelanger: I think if you are ready and willing then lets do it :)23:58

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!