Friday, 2017-04-14

pabelangerbos status afs01.ord.openstack.org00:00
pabelanger    Auxiliary status is: file server shut down.00:00
pabelangerthink we are ready for reboot00:00
pabelangerrebooting00:00
clarkbok00:01
clarkbI'm watching boot of baremetal00 on console00:02
pabelangerokay, back online00:02
pabelangerbos status afs01.ord.openstack.org00:02
pabelanger    Auxiliary status is: file server running.00:02
pabelangerkernel is also correct00:02
clarkbMLNX initializing devices00:02
clarkbit says Link: Down Link status: Not connected00:03
clarkband the mac addr there matches what is in hiera :(00:03
pabelangerodd00:04
clarkbsorry I take that back00:04
clarkbthe macaddr doesn't match00:04
clarkbit finally moved on to net1 which appears to be the one that matches and that one is dhcping00:04
pabelangerI think we can give afs02.dfw.openstack.org a shoot now00:05
clarkbgah missed it dhcped or not00:06
clarkbpabelanger: cool I say go for it then00:06
clarkbnow just a bunch of spam about usb devices flapping and thats it00:06
clarkbI'm not sure its getting network00:06
pabelangerbos status afs02.dfw.openstack.org00:07
pabelanger    Auxiliary status is: file server shut down.00:07
pabelangerrebooting00:07
clarkband VSP isn't working for getting a login promopt00:08
pabelangerbos status afs02.dfw.openstack.org00:09
pabelanger    Auxiliary status is: file server running.00:09
clarkbcool00:09
clarkbpabelanger: now how do we make afs02 the RW RO instead of just RO? or does it matter since we aren't going to do any writes with all the locks held. EG can we just do afs01 next?00:10
clarkbthinking about that I bet we could get away with no RW temporarily00:10
pabelangerright, if we are holding locks, we might be able to get away from it00:10
pabelangerunless somebody else does a write from some other location then mirror-update.o.o00:11
clarkblike one of the wheel builders?00:11
clarkbbut otherwise that should be it ya?00:11
clarkband in that case they should just fail and try again later I think00:11
pabelangerya00:11
pabelangeror docs job00:11
clarkboh right docs00:12
pabelangerlet me read quickly how to move a RW volume00:12
pabelangervos move seems to be the command00:13
clarkband maybe just move docs?00:14
pabelangerya, let me see if I can do that00:14
clarkbI'm about to call it a day on baremetal00 for now and maybe see if cmurphy or rcarillocruz can take a look in their morning time00:16
clarkbI don't know enough about the env to debug much further but it appears to be no networking on that nic00:16
pabelanger    Volume is locked for a move operation00:18
pabelangerokay move looks to be happening00:18
pabelangerwas sure to do it in screen too00:18
clarkbcool00:18
pabelangervos move -id docs -fromserver afs01.dfw.openstack.org -frompartition vicepa -toserver afs01.ord.openstack.org -topartition vicepa00:18
pabelangerwas syntax00:18
clarkbI take it it isn't an instantaneous promotion?00:21
pabelangerdoesn't look like it00:21
clarkbstill going?00:31
pabelangeryup00:32
pabelangerhave screen running on afs01.dfw.o.o00:33
clarkbI'm trying mnasers suggestion of trying to boot on old kernel on baremetal00 now00:33
clarkbwaiting for it to boot back up again00:33
pabelangerk00:35
clarkbok baremetal00 is back up on its old kernel00:45
clarkbgonna leave it alone until tomorrow00:45
clarkbpabelanger: with that out of the way do we want to try doing kdc01 or the db servers?00:46
clarkbI guess that may impact the move so probably not?00:46
pabelangerya, I think we should wait unti the move is done00:46
pabelangerhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=2264&rra_id=all00:47
pabelangerlooks like 11 Mb cap00:47
clarkbhow big is the volume?00:47
pabelanger53687099100:48
pabelanger500GB?00:48
clarkbya00:48
clarkbso this gonna take a while00:48
clarkbalso that seems like a lot of docs00:48
pabelangerso, going to be a few hours00:49
clarkbmaybe pick this up tomorrow morning?00:50
clarkbI too have errands that need to be done00:50
pabelangersounds like a great idea00:50
clarkblike new phone. My current one just died :(00:50
clarkbI am going to reattempt rebooting baremetal00. I diffed the linux packages on it against compute000 and found it was missing linux-image-generic and linux-image-extras-$version-generic so pulled those in and I think that will get us future kernel updates (and the extras package has kernel modules we likely need). The old kernel is still there so can fall back on it again via ilo + grub if I15:45
clarkbneed to15:45
clarkbsuccess! uname reports new kernel, ssh works, and I can ironic node list the ironic cluster15:55
pabelangeryay15:57
pabelangeralso15:58
pabelangerVolume 536870991 moved from afs01.dfw.openstack.org /vicepa to afs01.ord.openstack.org /vicepa15:58
clarkbcool, I really need caffeine, but will be back shortly15:59
clarkbguessing next step is to reboot afs01.df1?15:59
pabelangerya, going to grab locks first again15:59
pabelangerokay, decided to remove contrab this time and place mirror-update into emergency file16:01
pabelangerso, I don't think we have done a vos release on npm for some time16:05
pabelangerso, once the current flocks expire, we can start bos shutdown16:06
pabelanger# bos status afs01.dfw.openstack.org16:08
pabelanger    Auxiliary status is: file server shut down.16:08
pabelangerrebooting16:08
clarkbkk16:09
pabelanger    Auxiliary status is: file server running.16:10
clarkband on new kernel?16:10
clarkbbefore we allow things to run on mirror-update again lets decide on plan for the dbs and kdc16:11
pabelangeryup16:11
clarkbI think kdc can be done safely as lon as there are no writers getting kerberos tokens16:11
clarkbwhich means I think we can do that one now?16:11
clarkboh right docs16:11
clarkbhrm16:11
clarkbmaybe just go for it and rerun any docs jobs that may fail?16:12
pabelangerdocs is cron based16:12
pabelangerisn't it?16:12
pabelangervos release is crontab16:12
clarkbthe vos release is, but I think the job write to the RW volume directly16:12
pabelangerya16:12
clarkbboth things require kerberos16:12
pabelangerya, guess we should setup redundant kerberos next :)16:13
clarkbwe can stop the cron from doing releases16:13
clarkbthen just rerun any jobs that fail16:13
pabelangeris docs jobs on static node?16:13
pabelangerwe could stop zlstatic01 for now16:13
clarkbdoesn't look like it at least going off of the project-config docs job running now (its in osic)16:14
clarkbmordred: ^ you set up this stuff, any thoughts ?16:14
pabelangerwheel mirror also16:15
pabelangerafsdb01 and afsdb02 we should be able to rotate then for shutdown16:16
clarkbya I think the link you posted yesterday said it was same steps as FS too? bos shutdown then reboot?16:17
clarkbI'm taking notes on things we have to watch out for on the kdc01 restart on the etherpad16:17
pabelangerya, afs docs say same process for db servers16:17
clarkbWhy don't we go ahead with db0X rotations. Then we can quiesce as best as possible the afs writers then restart kdc0116:17
pabelangerI'm assuming openafs is setup to fail over between the databases16:17
pabelangerk16:18
clarkbpabelanger: lets hope thats why we have two of them :)16:18
pabelangerwill do afsdb02.o.o first16:18
clarkbok16:18
clarkband are you running that as root? and does bos shutdown require kerberos auth?16:18
pabelangerusing localauth16:19
pabelangerand root16:19
pabelanger# bos status afsdb02.openstack.org -localauth16:19
pabelangerInstance ptserver, temporarily disabled, currently shutdown.16:19
pabelangerInstance vlserver, temporarily disabled, currently shutdown.16:19
pabelangerrebooting16:19
pabelanger# bos status afsdb02.openstack.org -localauth16:20
pabelangerInstance ptserver, currently running normally.16:20
pabelangerInstance vlserver, currently running normally.16:20
pabelangerkernel good too16:21
clarkbyay16:21
pabelangermoving on to afsdb0116:21
clarkbok16:21
pabelanger# bos status afsdb01.openstack.org -localauth16:22
pabelangerInstance ptserver, temporarily disabled, has core file, currently shutdown.16:22
pabelangerInstance vlserver, temporarily disabled, currently shutdown.16:22
clarkbI annotated etherpad with thoughts on how we can quiesce the various pieces for kdc01 restart16:22
pabelanger# bos status afsdb01.openstack.org -localauth16:23
pabelangerInstance ptserver, has core file, currently running normally.16:23
pabelangerInstance vlserver, currently running normally.16:23
pabelangerand kernel is good16:23
clarkbwoot16:23
clarkbwhere does the docs vos release cron run?16:25
* clarkb greps puppet16:25
pabelangerokay, I can graceful shutdown zlstatic16:25
pabelangerk, not sure myself16:25
pabelanger2017-04-14 16:25:44,636 DEBUG zuul.LaunchServer: Stopped16:26
clarkblooks like maybe afsdb0116:26
pabelangerya16:27
pabelangerI see it16:27
clarkbpabelanger: you want me to put that host in the emergency file?16:27
pabelangersure16:27
pabelangerI don't see a lock, so we'll have to remove crontab16:28
clarkbya I just added afsdb01.openstack.org to emergency file so I think you can remove the cron entry now16:28
pabelangercrontab removed16:28
clarkbI'm double checking packages on kdc01 now16:29
clarkblooks good16:29
pabelangerkk16:29
clarkbshall I reboot it?16:29
pabelangersure16:29
clarkbactually /me checks what services are running on it first so we can confirm they are up before releasing locks and things16:30
clarkboh wait there is a kdc0216:30
pabelangeroh16:31
clarkbmaybe16:31
clarkbI can't ssh to it16:31
clarkbpabelanger: are you able to get onto it?16:31
pabelangersame, not responding16:31
pabelangerhttp://www.tldp.org/HOWTO/Kerberos-Infrastructure-HOWTO/server-replication.html16:32
pabelangerseems straightforward16:32
clarkbkdc01 is trying to propogate the principals databse to kdc02 using kprop right now16:32
clarkbthats how I noticed it16:32
pabelangeronce we finish this, we can do ^16:32
pabelangerlet me check cacti16:32
clarkbwell I think we may already do ^ but kdc02 is unworking. So maybe we first sort out kdc02, then do the propogation then do kdc01?16:33
pabelangerseems we lost it back in Oct16:33
pabelangeraccording to cacti16:33
pabelangersure, we can try and bring it back online16:34
clarkbya lets do that16:34
clarkbthen I think we can do this using process above16:34
pabelangerokay, you grabbing console?16:34
clarkbI was tempted to just try reboot it with nova api... but we can try console first16:35
clarkblooks like its an ord server16:35
pabelangerokay, I'll let you drive :)16:35
clarkboh right you can't use the normal console log api with rax?16:35
clarkbconfirmed...16:36
clarkbnova api says server is active and running16:37
clarkbpabelanger: was it you saying there was an ssh option for this?16:39
clarkbI got the web thing working, there is a login prompt, but I can't login because ENOPASSWD16:41
clarkbshould we go ahead and do a reboot via nova api?16:43
pabelangerclarkb: ya, ssh16:44
pabelangeryou place server into emergency mode, it will use the same IP16:44
pabelangeryou then get a new root password and can SSH16:45
clarkbah I think we can just try a reboot before emergency16:45
pabelangerkk16:45
clarkbgoing to do that now via server reboot kdc02.openstack.org16:45
clarkbI see it booting in the console16:46
pabelangerssh works16:46
clarkbthe kprop runs every 15 minutes16:47
pabelangerwe should do kdc02 first for kernel16:48
pabelangerlooks like it might need an update16:48
clarkbpabelanger: lets let the 1700UTC kprop run, then update kdc02 kernel, then kprop again then do kdc0116:48
clarkbyup16:48
pabelangerwfm16:48
clarkbI'm gonna make sure packages on kdc02 are up to date since it may not have had networking for a while16:49
pabelangermaybe just kick it from puppetmaster.oo?16:49
clarkbpackage updates are run out of apt not puppet master16:49
clarkbI just manaully doing an update and sure enough there are things that need updating16:50
clarkbapt updatse around 0600UTC daily iirc16:50
pabelangerah, right16:50
clarkbhrm got a message about setting up a kerberos realm.. I wonder if that happens regardless or if we haven't configured this server properly :/16:50
pabelangerya, I am unsure actually16:52
clarkb/etc/krb5.conf says default realm is OPENSTACK.ORG so I think we are good once we kprop16:53
pabelangeransible is also running on kdc02 now16:54
clarkbok I didn't catch the kprop happen. maybe I should run it in the foreground just to make sure it is happy?17:00
clarkbdoing that now17:01
clarkbDatabase propagation to kdc02.openstack.org: SUCCEEDED17:01
clarkbpabelanger: ready to reboot kdc02?17:01
pabelangersure17:01
clarkbok doing it now17:01
clarkbok kdc02 is back up again. I'm going to rerun propogation manually17:02
clarkbthen I think we can reboot 01?17:03
pabelangerthink so17:03
clarkbkprop: Connection refused while connecting to server <-17:03
pabelangerdocs say things will fail over17:03
* clarkb waits patiently17:03
pabelangeroh17:03
pabelangerApr 14 16:57:04 kdc02 puppet-user[18536]: (/Stage[main]/Kerberos::Server/Service[krb5-kpropd]/ensure) ensure changed 'stopped' to 'running'17:03
pabelangerthink we need to wait for puppet to start it17:04
clarkboh maybe17:04
pabelangerlet me kick.sh17:04
clarkbkk17:04
pabelangerif it comes online, then we can patch puppet17:04
pabelangerkicking17:05
pabelangerclarkb: try now17:06
clarkbDatabase propagation to kdc02.openstack.org: SUCCEEDED17:06
clarkbthat must've been it17:06
pabelangerworking on patch17:06
clarkbI'm gonna double check packages on kdc01 now17:06
clarkbsays its up to date so ready to reboot it whenever you are17:07
pabelangergo for it17:07
clarkbok doing it now17:07
clarkbits back17:08
clarkband kernel is updated17:08
pabelangerYay17:08
pabelangerclarkb: remove servers from emergency and bring zlstatic online?17:11
clarkbpabelanger: yes I think so17:11
pabelangerzuul started17:12
pabelangerservers removed from emergency17:12
pabelangerwill make sure crontabs are recreated17:12
clarkbok17:12
pabelangermirror-update.o.o good17:15
pabelangerafsdb01.o.o good too17:16
clarkbso now we just monitor that things are updating as expected ya?17:17
clarkbalso do we want to vos move the docs volume back to 01?17:17
pabelangerya, I'm going to hold the lock on npm and figure that out17:17
pabelangerwe haven't released in a month or so17:17
clarkboh wow17:17
clarkbI'm tempted to try and write down the "how to reboot an entire afs cluster without downtime" in system-config17:18
clarkblet me start on that draft so that we don't forget17:18
pabelangerkk17:21
pabelangeroh, we should also vos move docs back17:21
pabelangerI can start that shortly17:21
clarkbok17:24
pabelangervos move -id docs -toserver afs01.dfw.openstack.org -topartition vicepa -fromserver afs01.ord.openstack.org -frompartition vicepa --localauth17:27
pabelangerrunning now17:27
clarkbpabelanger: and you are running that on afsdb01?17:28
pabelangeryes17:28
pabelangerfrom screen17:28
clarkbpabelanger: fwiw next time I think we want to move to afs02 which is local to the same datacenter (will be faster)17:45
clarkboh I thought we had a second server in dfw doesn't look like we do17:45
pabelangerclarkb: agree. I thought that afs01.ord.o.o was actually not used any more17:45
clarkblistvldb says its the same17:46
pabelangerwe should confirm with jeblair next week17:46
clarkber rather RW and RO are cohabitated on same server?17:46
pabelangerYa, RW RO on the same server17:47
pabelangerwith back up RO17:47
clarkbbut backup RO is in ord right?17:48
pabelangernot any more17:49
pabelangerI thought we had stopped using ord, because network was bottleneck17:49
clarkbI'm looking at mirror specifically and I see afs01.dfw is RW and RO and afs01.ord is RO17:49
clarkbI thought we added a second server in dfw to be RO too?17:49
clarkbbut not seeing that (could just be blind)17:50
pabelangerya, I seen that too. I _think_ we need to fix somethings next week.  And make sure everything is setup for afs01.dfw and afs02.dfw17:50
pabelangerjeblair likely knows more17:51
clarkb++17:51

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!