Friday, 2018-01-26

*** panda|bbl is now known as panda|off00:24
*** rlandy|biab is now known as rlandy00:50
dmsimardfsck 70% phase 100:51
dmsimard71% phase 2 !!!01:17
pabelangerprogress01:19
dmsimardpabelanger: also, the increase in swap is because the scratch files are larger now.. it's above 7GB now, earlier it was between 5 and 601:20
dmsimards/is because/is likely because/01:21
*** SergeyLukjanov has quit IRC02:33
*** SergeyLukjanov has joined #openstack-infra-incident02:35
*** rosmaita has quit IRC03:08
dmsimardStepping away for a while, we're at 74.2%03:50
dmsimardload, swap and disk usage fairly stable03:50
*** openstackstatus has quit IRC04:56
*** openstackstatus has joined #openstack-infra-incident04:57
*** ChanServ sets mode: +v openstackstatus04:57
*** rlandy has quit IRC06:57
*** panda|off is now known as panda11:01
*** rosmaita has joined #openstack-infra-incident11:54
dmsimardfsck @85.3%, temporary volume at 36% (I just bumped timeout since we have room).. swap at 4GB but swapping activity is low12:57
*** rlandy has joined #openstack-infra-incident13:34
mordreddmsimard: \o/ maybe it'll finish today13:55
*** panda is now known as panda|lunch13:55
dmsimardit's a shame the gate is backlogged due to the integrated gate resets13:56
mordredyah13:57
*** efried is now known as fried_rice14:39
*** dansmith is now known as superdan14:41
rosmaitai second that emotion14:51
dmsimardWe went to great lengths to keep the jobs running and prevent impact on the gate as much as possible but here we are :/14:54
rosmaitadmsimard appreciate all you and mordred are doing, just wish thursday had been 72 hours long this week14:57
mordredrosmaita: yah. I could use another 72 hours myself14:58
dmsimardrosmaita: just the fact that people are understanding is already awesome in itself :)14:59
*** panda|lunch is now known as panda15:05
*** myoung|pto has quit IRC16:09
*** myoung has joined #openstack-infra-incident16:13
dmsimardfsck 88.3%17:07
*** fried_rice is now known as fried_rolls17:15
*** rlandy is now known as rlandy|brb18:02
*** rlandy|brb is now known as rlandy18:28
*** weshay is now known as weshay|ruck|brb18:43
corvusi continue to favor the reformat option.  my feeling is that the utility of new logs exceeds that of the old ones, and so if an outage extends beyond the normal 8 hours, it's better to just start from scratch so that people can find and fix the current bugs.19:11
*** zaneb has joined #openstack-infra-incident19:11
corvusgiven that we're almost at the weekend, if folks wanted to try alternatives, that's probably okay.19:12
corvushowever, i'd suggest that we start monday morning with a fully functional log volume and all log uploads enabled, regardless.19:13
dmsimardI'm losing hope that the server will become responsive in a timely fashion given the trend of the last two hours19:13
corvusso, if folks wanted to spend the weekend either waiting for the current system to finish, or if they wanted to restart it without the scratch space and use only ram, i think that's okay.  but we should set the deadline that if it isn't done by 00:01 utc monday, we reformat.19:14
dmsimardWe've gone through a large percentage of the fsck at this point.. with a bit of luck if we reboot we could just mount it successfully but I wouldn't count on that19:14
dmsimardIs resizing the logserver to 16gb of ram entirely out of the question ? I don't know what the constraints are19:15
dmsimard16gb of ram sounds perfectly appropriate for 13TB of storage19:16
corvusdmsimard: an online resize?  it's possible.  the server would be offline for an unknown amount of time while it ran.  a replacement is also possible.19:17
dmsimardcorvus: a resize implies a hard reboot iirc19:18
corvusdmsimard: that server only uses, at most, 2GB of ram normally, there's no reason for it to be that large, except to fsck.  it would be better to create a server merely to perform the fsck, and shift the volumes to and fro it.19:18
corvusdmsimard: yes, it culminates with a hard-reboot after an unknown period of downtime19:18
dmsimardcorvus: yeah I thought about a temporary server for that purpose too -- the problem is making sure we mount/remount the volumes in the right order and re-create the LVM on the other end.. could prove tricky19:19
corvusdmsimard: i don't believe order matters.19:19
dmsimardit sure does when dealing with physical disks :D19:20
corvusthe lvm superblocks should take care of that19:20
dmsimardah.. perhaps.19:20
pabelangerYah, I think at this point in our outage, reformat might be best to move forward.  I,m not sure I'd want to start rebuilding / resizing to get 16GB for fsck.19:24
dmsimardcorvus: the thing that bothers me about all of this is that we're not making the situation any better or doing anything to prevent this from re-occurring. I know pabelanger mentioned Vexxhost volume sizes were not as restrictive... At this point, is a bare metal even out of the question ?19:24
corvusdmsimard: we'll move to swift.19:24
pabelangerwell, we know this is a point of failure and have some discussion to fix it19:24
pabelangeryah19:24
corvusdmsimard: or the vexxhost thing.19:24
corvusone of those.19:25
pabelangerI don't think we can fix that now19:25
corvusindeed.  this is the least appropriate time to change the system.19:25
pabelangerI'm still not able to SSH into the server, do we still have a connection up at this point?  If we do reboot, the volume is still removed from /etc/fstab, so we shouldn't mount19:27
*** weshay|ruck|brb is now known as weshay|ruck19:27
dmsimardOk.. so how does this sound.. 1) Reboot 2) See if we can mount the log volume 3) If not, consider reformatting ? or run fsck without scratch files and fully disable log upload (which still won't impact the gate since only failed jobs would upload logs)19:27
dmsimardpabelanger: even the console is unresponsive, there is no password prompt after typing the username19:28
dmsimardbut http is working flawlessly, go figure19:29
corvusi'm not in favor of mounting without an fsck.  i have no confidence it would not fail randomly at any point later.19:29
pabelangeryah, I don't think we can do #2 without fsck19:29
dmsimardok so it boils down to reformat or fsck without scratch files (which is what we usually end up being able to do)19:29
pabelangerso, reformat (lose 4 weeks of logs) or fsck for 6 hours19:30
dmsimardhmmm... are we able to clone volumes ? Like, cinder create --source-volid19:30
corvusin order of preference, i suggest: (1) reformat (2) fsck without scratch files (3) fsck on temporary larger host (4) allow to continue19:30
corvusyou could probably convince me to swap 2<->3 if you really felt like spending your weekend doing that19:31
dmsimardI am thinking perhaps we could clone the volumes before formatting -- see if we can fsck them elsewhere19:31
corvusbut in all cases, i suggest we maintain the sunday/monday midnight reformat deadline.19:32
pabelangerRight agree with order, I'm not sure if I am around much this weekend.19:33
corvusdmsimard: i'm not certain if our quota would permit that, or how long it would take.19:33
corvusi am sure i am not around this weekend.19:33
dmsimardI don't know what's their storage backend -- with ceph, even with large volume sizes, it's near instantaneous19:33
pabelangerwe've also had release team extend queen-3 milestones due to CI issues too19:35
dmsimardI'm not sure I know how to check what our quotas are, the different CLI commands are returning 400's or 404's and the bulk of our servers aren't showing up in the rax interface19:36
corvusdmsimard: all the servers should be there...  are you using the right account?  openstackci?19:36
dmsimardah I was using openstack19:36
pabelangerdmsimard: your looking to see if we can clone volumes?19:42
dmsimardyeah, sec19:42
dmsimardfinally managed to get quotas...19:51
dmsimardSo we have 51200 SATA and 25600 SSD.. I'm going to guess those are gigabytes. We're not using any SSDs and for SATA we're at 35591 out of 51200.19:52
dmsimardWe'd have just enough room in the sata pool and plenty in the ssd pool.19:53
dmsimardI don't suspect it's possible to clone volumes across volume types19:53
corvusdmsimard: cloning sounds okay to me, as long as we're okay with falling back on just losing the data.19:56
corvusso we'd clone, reformat originals, attach clones to new 16g host, fcsk clones, rsync data back, delete new host and clones?19:56
pabelangerand we don't currently know how long a clone would take19:58
corvus(i suppose rsyncing the other direction, and swapping the volumes out again is possible, but that means more downtime, whereas the plan above means no downtime after the reformat, and we just progressively fill in old data)19:58
*** fried_rolls is now known as fried_rice19:58
corvuspabelanger: true.  if it's cow, it could be instantaneous.  if not, it could take 8 hours.19:58
dmsimardtrying a clone now, it doesn't seem instantaneous and there's actually a sort of progress indicator: clone-progress='1.29%'19:59
corvusi'm hoping/assuming that if things go wrong, we can just delete all the volumes and start over (thereby falling back on the 'just reformat' option)19:59
corvusdmsimard: can you extrapolate that?  (also, this is an on-line clone?)20:00
pabelangeryah, 8 hours to clone, another 6 for fsck, then xhrs for rsyc.  Say another day to round off20:00
corvuspabelanger: right.  my main concern though is when the 'upload logs' service starts working again.  so with the clone plan, that's determined by how long it takes to clone.20:01
corvusif the other stuff takes longer, i'm not worried.20:01
dmsimardcorvus: That's online -- some Cinder backends provide the capability, some don't.. apparently they do, but it's not a snapshot/cow20:01
pabelangercorvus: agree20:02
dmsimardit's at 6% now, eh20:02
corvusdmsimard: what's the start time?20:02
dmsimard2018-01-26T19:57:39.000000 clone-progress='6.96%'20:03
dmsimardSeems slow considering we have 13 of these to do.20:03
corvusdmsimard: can you do them all in parallel?20:03
dmsimardProbably nothing preventing me from doing them in parallel20:03
corvusdmsimard: i'm estimating about an hour if that rate holds.  if that works in parallel, this is probably doable.20:04
dmsimardLet me see20:04
dmsimard9.25% now :)20:04
corvusi'm going to lunch now.20:05
corvusdmsimard: if you want to proceed with this, i'd recommend getting all of them cloning, and also start spinning up a new temporary 16G server for the fsck.20:05
corvushopefully that means in about an hour we can get moving on all aspects.20:05
pabelangeralso looks like I'm going out for dinner tonight, so I'll won't be online much in next 60mins20:06
pabelangercorvus: ++20:06
dmsimardclone in progress: http://paste.openstack.org/raw/653933/20:19
dmsimardbrb20:19
dmsimardhmm, I think the fsck is hitting one of the volumes pretty hard, one of the clones is lagging behind a lot20:23
dmsimardShould we reboot since we're not going to let the fsck finish anyway ?20:23
*** srwilkers has joined #openstack-infra-incident20:25
*** mrhillsman has joined #openstack-infra-incident20:30
pabelangeryes, if new plan is to clone / fsck and / or format, we likely reboot now20:31
dmsimardpabelanger: ok, I will attempt ctrl+alt+delete in hope that it kills the fsck in a way that is slightly more gentle and fallback to an API reboot. Ack ?20:33
pabelangerokay20:34
dmsimardno go on ctrl+alt+delete, trying a soft reboot20:37
dmsimardI have a ping going, it's probably going to fall back to hard reboot.. it's not rebooting.20:39
dmsimardso rax is still on cinder api v1, openstackclient defaults to v3 :/20:41
dmsimardhard reboot went through20:43
dmsimardlogs.o.o is back20:45
pabelangeryup, I'm able to SSH again20:45
pabelangerdmsimard: any improvement in cinder clone?20:46
dmsimardpabelanger: Hard to tell without timed data points after rebooting, I'll be able to tell in a few minutes but here's the current status (main10 was the one lagging behind) http://paste.openstack.org/raw/653942/20:47
pabelangerok20:50
*** rosmaita has quit IRC20:57
dmsimardLooks like main10 is struggling a bit despite the reboot, update from now: http://paste.openstack.org/raw/653973/21:07
dmsimardMaybe it's on a slower storage node or something21:07
corvusdmsimard: so that's maybe 4-5 more hours to finish the clone.21:10
dmsimardwe got a few volumes that finished cloning21:24
dmsimardmain10 got about ... 4% in 15 minutes. Bleh.21:26
*** rosmaita has joined #openstack-infra-incident21:36
dmsimardmain10 just got 15% progress in about 3 minutes \o/21:41
dmsimardI lied, bad timestamping, it's still at 38% now though. Most of the other volumes are finished cloning.21:43
*** rlandy has quit IRC21:49
fungiin another airport for a few minutes and caught up on scrollback in here... an instance resize wouldn't have been an option anyway (i think it might be possible with older non-pvhvm flavors if they even still have those, but not supported on the modern flavors we've been using)22:03
dmsimardinfra-root: Have something I need to take care of and need to afk for a few hours. I started a screen on the puppetmaster with the status for the main10 volume which is the last one left to clone. It's at 61% right now. Once the clone is finished we can go ahead with the reformat and I'll take care of attempting the recovery (which might be tomorrow)22:04
fungipoor i/o to main10 is likely why the fsck was going slowly22:04
dmsimardmaybe22:04
dmsimardinfra-root: note that there's no loop right now deleting data from /srv/static, it's at 46% so we'll more than likely going to have enough time to switch things around.22:07
dmsimardtaking off now, sorry22:07
fungireformatting the "slow" main10 may be a poor choice22:07
*** panda is now known as panda|off22:15
corvusfungi: agreed; we may want to roll the dice on a new volume while we're at it, if we have the quota.22:22
fungicould lvremove the logs volume, vgreduce off of main10 (pvmove any other logical volumes if they're on there, but odds are they aren't), detach the old main10, cinder create a new main10 to attach, vgextend back onto it and then lvcreate the replacement logs volume and format that22:31
fungiif the others cloned quickly then there's likely no need to replace them22:32
dmsimardtemporarily stepped by the laptop, main10 has finished cloning and everything else is finished as well: http://paste.openstack.org/raw/654070/22:41
dmsimardfeel free to proceed22:41
* dmsimard off22:41
corvusinfra-root: i will execute fungi's plan now22:49
corvus(though i'm going to create the new volume first, to reduce the chance that the old gets immediately re-used)22:51
corvusinfra-root: oh, since we're making a new fs, what settings should we use (inodes, etc)?23:02
fungii would consider kicking up the inode ratio a notch23:06
corvuswere we at the default before?23:06
fungito my knowledge, yes23:07
fungithe graph said we were doing okay on inode count last time i looked, but it's frustrating to hit an inode cap when you still have room for more blocks23:07
corvusdo we graph inode count?23:08
fungioh, right, i may have been looking at df actually :/23:08
corvusah23:08
corvusi don't see anything in https://wiki.openstack.org/wiki/Infrastructure_Status about inodes, so assuming default23:08
fungiso if we don't graph it (i can't easily check at the moment) then it'll be hard to know23:09
fungibut yeah, i don't remember us setting it to a non-default value when we switched to ext4 way back whenever23:09
corvushttp://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2017-10-11.log.html#t2017-10-11T20:25:2323:11
corvusthere's some information23:12
fungii suppose if we want to reformat it a second time later in the weekend after we fsck the clones copies, we might b able to tell for sure23:12
corvusi'm not sure we need to double it... should we maybe do 1.5x?23:14
corvusdefault is 16kB per inode.23:15
corvuswell, our *current* use, based on only the failed logs uploaded in the last few hours, is 28kB per inode.23:17
corvusand of course, a lot of that stuff hasn't been gzipped yet23:17
fungidid you mean should we maybe do something <1x?23:18
fungilike 12kb/inode (0.75x)23:18
corvusfungi: yes (i was thinking 1.5x inode count at the time i wrote that, which would be 0.75x inode ratio)23:19
fungi1.5x the old bytes per inode would mean we run out of inodes faster23:19
fungiokay, cool23:20
fungisame page then ;)23:20
corvusright, i think no higher inode ratio than the default, but maybe less.23:20
corvusi'm trying to run a gzip pass real quick to get slightly better data from the admittedly very small sample size.23:20
fungiyeah, i'm pretty sure we were close to 1:1 block % vs inode % a few weeks ago when i took a look23:21
fungiwhich suggests that the ratio was sufficient, but not generous in terms of absorbing unanticipated inode consumption spikes23:22
corvusfungi: maybe even 0.625:  10240  ?23:23
corvusjust a small tweak23:23
corvusor strike that, that's not what i meant to do23:24
corvus0.875: 14336 is what i meant23:25
corvusokay, the gzip pass finished and it's still about 28k/inode used23:27
fungisure, that's probably plenty of headroom until we execute a plan to drop the old logserver in favor of something better (sharded across volumes, stashed in swift, whatever)23:27
corvusfungi: i'm inclined to only make a small change, so my gut instinct based on remembered data would be to do -i 1433623:28
corvushow's that sound?23:28
fungiwfm23:29
corvus#status log cloned all volumes from static.openstack.org for later fsck; replaced main10 device because it seemed slow and recreated logs logical volume.23:29
openstackstatuscorvus: finished logging23:29
fungimatches what i would expect per the manpage23:29
fungii need to disappear again shortly to board yet another flight23:31
corvus#status log created logs filesystem with "mkfs.ext4 -m 0 -j -i 14336 -L $NAME /dev/main/$NAME" http://paste.openstack.org/show/654140/23:31
openstackstatuscorvus: finished logging23:31
corvusfungi: that look reasonable? ^23:31
funginot sure you need -j with mkfs.ext4 but it's likely fine? the rest is definitely sane23:32
corvusokay, rsyncing the accumulated data23:34
fungiyeah, based on my reading of the -t option (which is implied by mkfs.ext3 and 4) you'd have a journal created regardless (though could override to omit the journal under ext3 by adding a -O exclusion)23:35
fungii'll likely be offline again until fairly late tomorrow, sorry i can't be more help23:42
corvusfungi: bon voyage!23:42
fungithanks! and also thanks to you and others for getting the logserver back on track23:43
fungii'll try to check back in again as soon as i can23:43

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!