Thursday, 2018-01-25

*** rlandy|bbl is now known as rlandy02:58
*** rlandy has quit IRC03:31
*** rosmaita has quit IRC04:06
*** panda is now known as panda|lunch12:31
*** rosmaita has joined #openstack-infra-incident12:56
*** panda|lunch is now known as panda13:20
*** rlandy has joined #openstack-infra-incident13:37
*** weshay|rover is now known as weshay|ruck13:58
*** ChanServ sets mode: +o dmsimard14:19
*** frickler has joined #openstack-infra-incident14:19
*** dmsimard changes topic to "logs.openstack.org filesystem is read only | OpenStack Infra team incident handling channel"14:19
*** dmsimard sets mode: -o dmsimard14:19
*** efried has joined #openstack-infra-incident14:19
dmsimardinfra-root: Running a fsck *now* would be quite expensive for the gate which is already quite backlogged14:20
dmsimardDo we have any ability to spin off a new logs.o.o server somehow, FSCK the previous one in the backlog and do redirects as appropriate ? I had a plan for something like this to phase out an old RDO logserver but I never got around to doing it14:21
pabelangerlikely not, we have 14 1TB volumes attached to logs.o.o14:21
dmsimardpabelanger: right -- do we have the ability to do anything better than that ?14:22
pabelangerat this moment, no14:23
dmsimardI know it's kind of a stupid question because if there was anything better we would probably be doing it already14:23
pabelangerwe've talked about moving logs.o.o to vexxhost14:23
pabelanger[Thu Jan 25 13:37:15 2018] end_request: I/O error, dev xvdh, sector 106971257614:23
dmsimardpabelanger: but for example, does any other provider (OVH, Vexxhost) have bigger volumes14:23
pabelangerit does seem limited to xvdh right now14:23
dmsimardyeah any one of the volume being even remotely unavailable will thrash the entire lvm14:23
AJaegerpabelanger: unmount it, log to local file system?14:24
AJaegeror attach temporary disk, log to it, fix big volume, reattach and move logs over?14:25
-openstackstatus- NOTICE: We're currently experiencing issues with the logs.openstack.org server which will result in POST_FAILURE for jobs, please stand by and don't needlessly recheck jobs while we troubleshoot the problem.14:25
pabelangerright, looking for disk space now14:25
pabelangerI'm going to remove jenkins SSH key for now to stop jobs from uploading14:27
pabelangeralso going to put logs.o.o into emergency14:27
dmsimardI will also put max-servers to zero everywhere across the board so we don't fail all the jobs needlessly14:28
pabelangerno, shouldn't need that14:29
fungii'm not really here, but if you just unmount it there's ~100gb free in the parent volume to get by on until you can mount it again and then (optionally) move those accumulated logs back in from a secondary mount14:29
dmsimardpabelanger: why ? we're starting jobs we know will fail14:30
fungie.g., logs end up accumulating in the /srv/static volume instead of the /srv/static/logs volume14:30
fungiwhich is already its own volume with nearly 100gb free14:31
pabelangerfungi: okay, can do that14:31
fungi(hopefully that volume isn't spanninng the impacted block device14:31
* fungi has to disappear to another meatspace meeting now14:32
dmsimardnl02.o.o: -bash: cannot create temp file for here-document: No space left on device14:34
dmsimardlooking..14:34
dmsimardwow14:35
dmsimard# du -sh /var/log/nodepool/14:35
pabelanger/srv/static/logs:    root     kernel mount /srv/static/logs14:35
dmsimard37G/var/log/nodepool/14:35
pabelangerlooks like kernel is holding mount14:35
pabelangernot sure how best to for umount14:35
*** corvus has joined #openstack-infra-incident14:36
dmsimardpabelanger: looking14:37
pabelangerumount -l worked14:37
pabelangerssh access enabled again14:38
pabelangerokay, think we are accepting logs again14:38
dmsimardI'm fixing nl0214:38
mordredpabelanger, dmsimard: morning! oh -  just missed the fun didn't I?14:39
pabelangermordred: almost, I think we need to fsck our logs volume14:39
pabelangerand maybe change out a cinder volume?14:39
pabelangernot sure yet14:39
pabelangerbut /srv/static is our new logs folder14:40
dmsimardpabelanger: that's probably going to let log uploads work but likely the vhost isn't looking there14:40
pabelangerthat is fine14:40
dmsimardoh actually it does look there? I dunno http://logs.openstack.org/ seems to be working14:40
corvusthe vhost is fine14:40
corvuslogs are being written to the same location, it's just on a different volume now14:41
dmsimardok I free'd up 37GB on nl02.o.o14:41
corvusdmsimard: what was taking up space?14:41
dmsimardcorvus: 37GB worth of logs14:41
dmsimardI had to delete some of them because we were really at 100%, and I gzipped the most recent one14:42
pabelangerokay, how would we like to handle readonly volume of /srv/static/logs? I've not done the repair of this before14:42
dmsimardcorvus: logs were rotated but not gzipped -- also we're logging in double because we're logging debug to both launcher-debug.log and nodepool-launcher.log ?14:42
pabelangerhttp://logs.openstack.org/38/533638/7/gate/tox-py35/8ebe16a/ just uploaded, so jobs are okay now14:42
dmsimardcorvus: but yes, logs.o.o is more important14:43
corvuslet's first check if xvdh is okay now14:43
corvusi'm running a dd to read it, just to make sure we're not getting any more io errors14:43
corvusdd if=/dev/xvdh of=/dev/null bs=409614:44
dmsimardpabelanger: from the looks of the history it's an fsck on /dev/mapper/main-logs14:44
dmsimardjust gotta make sure to start that in a screen but let's wait if xvdh is fine14:44
Shrewsdmsimard: the nl02 problem is interesting. nl01 and nl03 seem fine14:44
corvusdmsimard: any previous fsck commands have any interesting arguments?14:44
dmsimardcorvus: http://paste.openstack.org/raw/653419/14:45
dmsimardShrews: yes14:45
dmsimardShrews: now that we have free space on it I'm trying to understand what's going on :)14:46
dmsimardbecause it's still logging like crazy14:46
pabelangerokay, so is 'fsck -v -y /dev/mapper/main-logs' what we want to run, or fsck /dev/xvdh directly?14:47
Shrewsdmsimard: oh my, yes. something is wrong there.14:47
dmsimardwe want to fsck the aggregate volume14:47
dmsimardsince it's not mounted, we want to fsck /dev/mapper/main-logs instead of /srv/static/logs14:48
pabelangerright14:48
pabelangerokay, I have screen up14:48
pabelangerand will be running14:48
pabelangerfsck -v -y /dev/mapper/main-logs14:48
pabelangerwe are okay with that?14:48
dmsimardShrews, corvus: this is madness (nl02) http://paste.openstack.org/raw/653420/14:48
Shrews'nl02-5138-PoolWorker.infracloud-chocolate-main' is being logged like crazy... probably in the declined_by attribute?14:48
dmsimardpabelanger: wait for corvus to ack that xvdh is fine14:49
Shrewsyeah that14:49
mordredpabelanger, Shrews: that seems like something with our logic to handle clouds not existing anymore is not doing the right thing14:49
Shrewsi've got a guess at what may be happening14:49
mordredShrews: oh, good. cause I was just shooting in the dark14:49
dmsimardShrews: reproducing this would probably be non-trivial so let's troubleshoot it live :/14:50
Shrewssince chocolate is gone, i'm betting it keeps declining requests over and over, forever adding to the declined_by attr14:50
corvusthe dd is slow, so i'm going to have it skip to closer to where the io error is logged.14:50
Shrewsbut just a guess14:50
pabelangercorvus: ack14:50
corvusdd if=/dev/xvdh of=/dev/null bs=512 skip=106971257014:51
dmsimard the /srv/static mount point is already at 12% usage and is probably going to run out of space before the fsck ends, do we have a plan B ?14:51
corvusdmsimard: continuously delete things there, i guess.14:52
Shrewshttp://paste.openstack.org/raw/653422/14:52
dmsimardShrews: there you go14:52
corvusit'll take 6+ hours to fcsk, so we'll probably just need to run a find in a while loop14:52
Shrewsnow what changed recently to break that14:52
dmsimardShrews: we split the nl configuration14:53
dmsimardShrews: also https://review.openstack.org/#/c/536953/14:53
dmsimardShrews: https://github.com/openstack-infra/project-config/commits/master/nodepool14:53
corvusi think the dd has gone over the trouble spot from the logs without error. i think we can fsck now.14:53
pabelangerokay14:54
pabelangerfsck -v -y /dev/mapper/main-logs14:54
pabelangerwe are good with that?14:54
Shrewsdmsimard: yes. we've done that before. this is new14:54
dmsimardpabelanger: in a screen, yes14:54
mordredcorvus, dmsimard: we could also update the log upload job to ignore upload errors, so that the gate doesn't flap14:54
corvuspabelanger: hang on just a sec14:54
pabelangeryes, screen is running as root14:54
pabelangercorvus: sure14:54
Shrewsi'm guessing https://review.openstack.org/533372 might be causing this14:54
dmsimardmordred: hmmmm I like that but I wonder if we would have a clever way of ignoring log upload errors only for successful jobs14:55
mordreddmsimard: we have zuul_success variable14:55
dmsimardmordred: or even disable log upload for successful jobs altogether, I dunno -- we can just continuously discard logs too.. none of these ideas are awesome but hey14:56
corvuspabelanger: i wanted to verify that it was '-y' we wanted, and i have.  http://eavesdrop.openstack.org/irclogs/%23openstack-infra-incident/%23openstack-infra-incident.2017-03-31.log.html14:57
corvusthat was the time i remember where we used the wrong option and had to do it again.14:57
corvusmordred: maybe we could even just stop uploading logs?14:58
dmsimardcorvus: that's what I was suggesting but only for successful jobs14:58
mordredyah. or - at least don't upload logs for gate jobs or successful jobs14:58
mordredgah14:58
pabelangercorvus: okay, just to confirm. command posted is correct14:58
corvuspabelanger: yep.14:58
pabelangergreat14:58
AJaegernote that we upload logs and docs to the same volume - so, what to do with docs jobs?14:58
mordredAJaeger: don't docs go to afs?14:59
pabelangerHmm, says /dev/mapper/main-logs in use14:59
AJaegermordred: published ones go to afs - draft ones go to logs14:59
pabelangerlet me see why that is14:59
AJaegermordred: and I care about the draft docs from check pipeline15:00
dmsimardmordred, corvus: ok I'll try to send a patch to do the logs only on failure thing15:01
mordredAJaeger: k. lemme get the no-log patch written, then let's figure out docs-draft15:01
mordreddmsimard: already working on it15:01
AJaegeryeah, let's do that...15:01
dmsimardmordred: ok I'll review then -- from the state of the gate we might want to accelerate merging that or it won't be effective for a while15:01
pabelangercorvus: mind helping with /dev/mapper/main-logs inuse error15:02
corvuspabelanger: it looks like apache is holding files open15:02
corvuspabelanger: i will restart it15:02
* AJaeger needs to go offline for probably rest of day and sends virtual cookies and coffee...15:02
pabelangerah15:02
corvuspabelanger: any better now?15:03
mordreddmsimard, corvus: remote:   https://review.openstack.org/537929 Only upload logs when jobs fail15:03
pabelangercorvus: no, I see see in use from fsck15:03
pabelangercorvus: I did use umount -l /srv/static/logs I wonder if that is an issue15:03
dmsimardpabelanger: I don't see anything with lsof |grep mapper15:03
corvuspabelanger: hrm, i wonder if the lazy unmount from earlier is inhibiting our ability to find out what's using it15:04
pabelangeryah, I fear that too15:04
dmsimardhang on15:04
dmsimardwhat is that process ? root     26026  0.1  0.0      0     0 ?        S    Jan15  16:10 [jbd2/dm-0-8]15:05
corvusdmsimard: the kernel15:05
dmsimardit's holding dm-0 which is the block device for main-logs15:05
dmsimardlet me try and understand what that process is and if we can kill it safely ..15:05
pabelangeroh, I see a kernel panic in dmesg too15:06
mordredEXCELLENT15:06
pabelanger[Thu Jan 25 13:34:48 2018] jbd2/dm-0-8     D ffff88020f893ac0     0 26026      2 0x0000000015:06
dmsimardhung tasks15:06
dmsimardlive migration ?15:06
dmsimardhttps://bugzilla.redhat.com/show_bug.cgi?id=886030 "Unmounting an ext3 filesystem result in a stuck jbd2 process that holds the device in use and prevents the same filesystem to be checked (fsck) or mounted again."15:07
openstackbugzilla.redhat.com bug 886030 in kernel "Device in use by a jbd2 process after unmounting an ext3 filesystem" [Unspecified,Closed: duplicate] - Assigned to kernel-maint15:07
*** cmurphy has joined #openstack-infra-incident15:07
dmsimardhttps://bugzilla.redhat.com/show_bug.cgi?id=851970 systemd private temp ?15:07
openstackbugzilla.redhat.com bug 851970 in systemd "Unable to umount partition as long there are running services that have PrivateTmp set that where started after the partition mount" [High,Closed: currentrelease] - Assigned to mschmidt15:07
dmsimardbut there's nothing in /tmp15:07
dmsimardthis is 14.04 so not systemd15:08
pabelangerare we able to unload / load that kernel module again or does this mean we need a reboot?15:09
dmsimardprobably can't reload that live, it's likely in use by the other mounts15:09
Shrewsdmsimard: i think this is the fix for the launcher: remote:   https://review.openstack.org/537932 Handle missing request during a decline.15:10
pabelangerokay, I need to grab a coffee real quick. back in a moment15:10
dmsimardShrews: anything to address it in the meantime ? a restart of nodepool-launcher or something ?15:11
dmsimardShrews: if you want to live test it, nl02 is already in emergency file15:11
Shrewsi think we should restart the launcher on nl0215:11
Shrewsi can handle that15:11
Shrewswe can't really test it live b/c the restart will clear it out15:12
Shrewsnodepool-launcher restarted on nl0215:12
pabelangerback15:13
dmsimardmordred: comment on https://review.openstack.org/#/c/537929/115:13
Shrewscitycloud seems to be at quota too15:14
Shrewsoh, that's disabled too15:14
dmsimardpabelanger: the thing is I'm not sure a reboot would even work -- it's likely a process that isn't killable due to i/o block operation so we'd need a hard reboot ? If we're going to reboot might as well try to kill it first, it won't make matters worse than a failed reboot/hard reboot15:15
corvusi would just reboot15:15
corvusi would not recommend trying to kill a kernel process15:15
corvusrun reboot and if it fails, use the console to reset15:15
pabelangerright, we maybe also want to remove entry from /etc/fstab so we don't mount on firstboot?15:15
dmsimardcorvus: kernel i/o block processes typically can't be killed so we'll need a hard reboot15:16
dmsimardyeah15:16
pabelangerokay, let me edit /etc/fstab first15:16
dmsimarddo we have a root password for the console in case we need it ?15:16
corvusdmsimard: sure, just don't do that first.  :)  reboot, then 'reset' which as what you're calling a hard reboot.15:16
pabelangerdmsimard: no root password, we'd need to use emergency console I think15:17
dmsimardcorvus: reset is equivalent to init 0 ? I've never actually used reset before but what I was thinking is a nova reboot --hard15:17
dmsimardcorvus: which is the equivalent of a reset button on the chassis :p15:17
corvusdmsimard: now we're on the same page.15:17
dmsimardcorvus: oh wait we're saying the same thing.. sorry I'm slow this morning15:17
mordreddmsimard: thanks. you're totally right15:17
pabelangerokay, I've commented out /srv/static/logs in /etc/fstab now15:17
corvusdmsimard: i agree, now would be a good time to set a root password in case something goes wrong15:18
mordreddmsimard: and fixed  https://review.openstack.org/#/c/537929/15:18
mordredcorvus: ++15:18
corvusdmsimard, pabelanger: how about i set the password and add it to the passwords file?15:18
pabelangersure15:18
dmsimardPermitRootLogin is No in /etc/ssh/sshd_config so we're good on that front15:18
pabelangerdmsimard: corvus: will also defer to one of you to do reboot15:19
corvusokay, password set, saved in file, and i used it to 'su' to root, to verify.15:20
dmsimardcorvus: where is that password file ? puppetmaster ?15:21
dmsimardpabelanger: let me open up the console before we try the reboot, hang on15:22
dmsimardit'll be a way to tell if the reboot gets stuck15:22
corvusdmsimard: yes... did no one walk you and frickler through that?15:23
dmsimardcorvus: not familiar with that yet, no -- but we can walk through it later :D15:23
corvusdmsimard: yes we should schedule a new root onboarding session :)15:24
dmsimardok I have console on static0115:24
dmsimardwe good for reboot ?15:24
corvusdmsimard: yes from me15:24
dmsimardpabelanger: you good too ?15:25
pabelangeryes, reboot when ready15:25
dmsimardok attempting a reboot now15:25
dmsimardlost console so this is a good sign15:25
dmsimardit's back15:26
dmsimard$ uptime15:27
dmsimard 15:27:06 up 0 min,  1 user,  load average: 0.27, 0.08, 0.0315:27
dmsimardI started a screen called "fsck", starting the fsck15:27
pabelangergreat15:27
pabelangerdmsimard: besure to use right flags for fsck command15:28
dmsimardfsck -v -y /dev/mapper/main-logs is running15:28
dmsimardI detached from the screen15:28
pabelangerthanks15:28
pabelangerthis will take some time15:29
dmsimardneed some more reviews on https://review.openstack.org/#/c/537929/ and likely to force submit that15:30
dmsimardShrews: nl02 looks good now, thanks15:30
dmsimardShrews: it's probably worth looking into why we're double logging also15:31
corvusi've modified a copy of the log maint script in ~corvus/log_archive_maintenance.sh15:31
corvusi'm proposing we delete logs after 120 minutes15:32
corvusthough it looks like the growth rate is slowing... for some reason?15:32
dmsimardShrews: The DEBUG logging ends up in both nodepool-launcher.log and launcher-debug.log -- we probably either want to take the DEBUG out of nodepool-launcher.log or merge launcher-debug.log into nodepool-launcher.log15:33
pabelangercorvus: 120mins sees fine to me15:34
dmsimardcorvus: we were at 12% @09:51:43, we're now at 21% @10:33:1215:34
dmsimard(my time)15:34
dmsimard~10% every 30 minutes ?15:34
pabelangerdmsimard: nodepool-launcher should only be INFO, would need to check logging.conf15:34
corvusdmsimard: so we could double retention to 240 minutes and target 80%15:35
dmsimardcorvus: sure, there's also nothing preventing us from looking every now and then and change our mind -- that's just a safety net15:36
corvusokay, i've written a wrapper script that loops over my modified script and runs it every 60 seconds15:38
corvusi've also disabled the normal crontab15:38
dmsimardShrews: btw if we want to keep those logs somewhere safe while we iron this out, I gzipped things in /var/log/nodepool on nl0215:38
corvusi'm going to run my looping script in another window of the fsck screen15:39
dmsimardShrews: it compresses surprisingly well15:39
dmsimardcorvus: wfm15:39
pabelangerwe also need to remember to update /etc/fstab15:39
pabelangerI can do that now, if we want15:39
dmsimardlet's keep that for when we need to do the remount/mv dance15:39
dmsimard#status log (dmsimard) fsck started running in a screen on logs.o.o for /dev/mapper/main-logs at around 15:30UTC, logs are being sent straight to /srv/static15:42
openstackstatusdmsimard: finished logging15:42
*** mnaser has joined #openstack-infra-incident15:43
dmsimard#status log (dmsimard) We're running a modified log_archive_maintenance.sh from ~/corvus and https://review.openstack.org/#/c/537929/ as safety nets to keep us from running out of disk space15:43
openstackstatusdmsimard: finished logging15:43
dmsimardinfra-root: if there's no objections I'll submit https://review.openstack.org/#/c/537929/15:44
pabelangerdmsimard: we should be okay for now to zuul to merge right?15:45
corvusoh interesting.  we're gzipping the docs-draft builds.15:45
corvusi suspect our normal log processing script is doing that as well, but just slowly enough that we don't notice it15:46
dmsimardpabelanger: if we don't submit through gerrit, it means the patch has to go through both check and gate which could be a few hours15:46
corvusmaybe they'll still work though because of the gzip handling we have on logs.o.o.  i guess something to look out for.15:46
dmsimardcorvus: there's a couple "age" lookups in the find, one for gzip and one for deletion15:46
corvusthe log processing script still does a *lot* of gzipping, so we may end up with a significant reduction in space15:46
pabelangerdmsimard: looks like it might start running shortly, looking at zuul.o.o15:47
corvusdmsimard: yeah, the gzip is low, it's 10 minutes.  basically enough time to make sure that zuul isn't still uploading files15:47
dmsimardpabelanger: there's nothing in the check queue that will tell us if this works or not, it's a trusted playbook15:47
pabelangerright, but some of the pressure is off currently and should be able to merge on its own15:48
pabelangerdmsimard: also, we can remove nodepool-launcher from emergency?15:49
dmsimardpabelanger: maybe, I guess this isn't in the tripleo or integrated gates15:49
pabelangerdmsimard: right, once in gate should get nodes right away15:49
dmsimardpabelanger: yeah, I wound up not touching the max-servers but I was getting ready to15:49
dmsimardpabelanger: I'll remove nl* from emergency (except yours, 03 I think)15:49
corvusthe current run of the gzip/rm script is printing out files.  the next runs will be quiet.15:50
pabelangerdmsimard: no, you can also remove that now. Was going to this morning15:50
dmsimardpabelanger: 04 is commented by you actually15:50
dmsimardpabelanger: ok15:50
pabelangerthanks15:50
dmsimardpabelanger: nl* removed15:50
dmsimardpabelanger: it's a good idea to add comments in the emergency file so we know why things are in there, +1 for that idea15:50
-openstackstatus- NOTICE: logs.openstack.org is stabilized and there should no longer be *new* POST_FAILURE errors. Logs for jobs that ran in the past weeks until earlier today are currently unavailable pending FSCK completion. We're going to temporarily disable *successful* jobs from uploading their logs to reduce strain on our current limited capacity. Thanks for your patience !16:01
fungikeep in mind that /srv/static was already 8% full before this (there are a handful of docroots which lack dedicated volumes), though i agree that it's accumulating faster than the fsck is likey to complete (up to 29% now)16:16
dmsimardpabelanger: logs.o.o responds a bit erratically.. took me a few seconds to get dmesg back16:38
pabelangerload is up16:38
dmsimardlooks like load from... apache ? fsck has some load16:39
dmsimardbut apache is the one causing the load16:39
pabelangersomething hammering server?16:40
dmsimardI'm seeing semrush bot, the crawler I banned from RDO but there's not that many hits16:40
dmsimardthere's a lot of hits like those: 2001:4800:7817:103:be76:4eff:fe04:38d1 - - [25/Jan/2018:16:40:32 +0000] "GET /99/537199/1/gate/trove-scenario-mysql-multi/172fa3c/logs/screen-s-object.txt.gz?level=INFO HTTP/1.1" 200 151 "-" "Python-urllib/2.7"16:40
dmsimard... logstash/elastic-recheck ?16:41
*** openstackstatus has quit IRC16:41
dmsimardyeah those python-urllib are all logstash workers16:41
*** openstackstatus has joined #openstack-infra-incident16:42
*** ChanServ sets mode: +v openstackstatus16:42
dmsimardlots of i/o wait but I'm trying to narrow it down16:43
pabelangergiving how busy the gate is, likey too much I/O for local HDD to keep up with16:44
dmsimardpabelanger: we're heavily swapping16:44
dmsimardroot@static:/var/log/apache2# free -m16:45
dmsimard             total       used       free     shared    buffers     cached16:45
dmsimardMem:          7979       7755        224          0        117         3116:45
dmsimard-/+ buffers/cache:       7605        37316:45
dmsimardSwap:         7811       4167       364416:45
pabelangerpastebin :)16:45
dmsimardyeah sorry I meant to only paste the swap line16:45
dmsimardthe fsck is eating all the ram16:45
dmsimard.... why ?16:45
pabelangernot sure16:46
dmsimardhttps://serverfault.com/questions/9218/running-out-of-memory-running-fsck-on-large-filesystems "Alternately, there's now an option in e2fsck that'll tell it to store all of it's intermediate results in a directory instead of in RAM, which helps immensely. "16:46
dmsimardIt's buffering the results in ram16:46
dmsimardinfra-root ^16:46
dmsimardIt might explain why the previous FSCKs were so long actually, we were unknowingly heavily swapping16:47
corvusyeah, would be nice to work out how to use that16:48
dmsimardAlthough if you compare swapping to disk vs buffering to disk (instead of ram) the difference is probably not that big.. but it'd make that ram available for apache and.. ssh16:48
pabelangeryah, load is up to 30.0016:48
pabelangerso we should consider doing something16:48
dmsimardI'm going to interrupt the fsck16:49
dmsimardor try to16:49
pabelangerwait16:49
pabelangermaybe just stop apache?16:49
corvusplease don't stop the fsck16:49
dmsimardsorry, too late :/16:49
corvusseriously?16:49
corvusokay this is all you.16:49
*** corvus has left #openstack-infra-incident16:51
pabelangerokay, we need to come up with a plan. To get fsck running again16:52
pabelangerdmsimard: ^16:53
dmsimardfsck is running again with scratch files in /opt/fsck16:54
pabelangerokay, next time we should likey wait until everybody is onboard with plan. Can't fix that now, but for future16:55
mordreddmsimard: awesome16:55
mordreddmsimard: and TIL about fsck scratchfiles16:56
dmsimardyes, I agree that I should have waited for an ack -- I really apologize for not waiting on a consensus before moving forward16:56
dmsimardI've dealt with something like this before, it is familiar territory but I won't let that happen again16:56
mordreddmsimard: and yah - the performance of the fsck isn't likely to be  much different than it would via just swapping - but the  memory pressure competition would be much better16:57
mordredhopefully16:57
dmsimardI have a watch du -sh running inside a window of the screen16:58
dmsimardfor the /opt/fsck directory16:58
dmsimardWe have 66GB available in /opt, it should be largely sufficient16:58
mordreddmsimard: if it's not sufficient I'm going to be very sad :)16:58
pabelangerokay, which flags did we use for fsck now?17:04
pabelangerwant to better understand the change made17:05
mordreddmsimard: yah. agree. also - should we consider putting an /etc/e2fsck.conf on files01 with that setting in it?17:08
dmsimardmordred: It's probably worth putting in puppet for servers where we have large volumes17:11
dmsimardpabelanger: the change was to configure e2fsck to send scratch files on disk, it defaults to RAM which is faster but is not appropriate for large volumes or block devices17:12
dmsimardmordred: I'll send a patch now17:12
pabelangeryah, puppet change sounds great17:14
dmsimardpabelanger: Do we have any other servers that have very large partitions ?17:15
pabelangerdmsimard: openstack volume list would be a way to check17:22
pabelangerstatic.o.o is the largest17:22
mordredI don't think we have anything else that's crazypants like static17:22
dmsimardok let's enable it for this one for now with the ability to enable it elsewhere on a need basis17:23
mordreddmsimard: ++17:30
*** panda is now known as panda|bbl17:33
*** weshay|ruck is now known as weshay|ruck|brb18:09
*** weshay|ruck|brb is now known as weshay18:27
*** myoung is now known as myoung|biab18:58
*** myoung|biab is now known as myoung19:20
*** myoung is now known as myoung|cheesebur19:34
*** myoung|cheesebur is now known as myoung|food19:34
*** myoung|food is now known as myoung20:02
dmsimardinfra-root: Status update... We're hovering around 55% logs and we've started deleting things with a 240m timeout, we might want to increase that to try and delete less things. Thoughts ?20:18
dmsimardmordred: also I suspect we might have been mistaken on the performance impact of on-disk scratch files, we're still not in the phase 2 of the fsck yet going on 3 hours .. we had reached phase 2 before restarting within ~1h3020:20
dmsimardAlso, the e2fsck man mentions that we can send a SIGUSR1 to the process to display a progress bar (and  SIGUSR2 to stop displaying it)20:22
dmsimardSIGUSR1    This signal causes e2fsck to start displaying a completion bar or emitting progress information. (See discussion of the -C option.)20:22
dmsimardShould we enable that to see how we're doing ?20:22
pabelangerdmsimard: how long is it taking to delete stuff?20:26
dmsimardpabelanger: not sure, this is the output we get in the screen http://paste.openstack.org/raw/653497/20:27
dmsimardI guess anywhere between near instantaneous and 20 minutes ?20:27
pabelangeryah, lets give it a little longer, since we are at 55%20:28
pabelangerI've never tried SIGUR1 before with fsck, maybe others know20:29
pabelangerbut, not being in phase2 yet is troublesome20:29
dmsimardI raised it to 300 up from 24020:30
dmsimardthe server is swapping just a tiny bit but is otherwise healthy20:31
dmsimardwithout scratch files we were 4GB into swap20:31
dmsimardReally the worst timing for this to be happening at all, the other times we were lucky it happened during the night or the weekend and we just left fsck running with logs.o.o offline..20:33
dmsimardpabelanger: I have to step away for a few minutes but I don't think sigusr1 could help us, it seems to imply that the -C <file> argument was already specified.. I might try to test it on a local server to see if it works20:34
dmsimardhttps://www.youtube.com/watch?v=cS4eztFtS0U makes it look like it's possible, I'll test and confirm20:36
dmsimardbrb20:37
*** rlandy is now known as rlandy|brb20:47
*** myoung is now known as myoung|pto20:53
dmsimardok, testing this locally on a ubuntu 14.04 vm..20:59
*** rlandy|brb is now known as rlandy21:12
ianwthe progress options to fsck do work, i've used them before21:17
dmsimardianw: do you have the usage handy ? I just don't want to kill the ongoing fsck mistakenly21:18
dmsimardalmost done trying to test it locally..21:19
dmsimardyeah I'm not able to reproduce locally, I don't have a big enough disk that gives me enough time to even try it.. fsck completes near instantaneously :/21:21
ianwi'm 99% sure i've sent it sigusr1 before21:21
ianwdon't worry, you only start it once without the -C , next time you always remember :)21:21
dmsimardianw: we didn't see any -C in the history :/ http://paste.openstack.org/raw/653419/21:22
ianwanyway, send it to the fsck.ext4 process, not the parent process, i think that got us once before21:24
dmsimardianw: so we have http://paste.openstack.org/show/653506/ and what we want is... kill -SIGUSR1 27665 ?21:27
dmsimardUsing a kill command to do this kind of stuff makes me paranoid, whoever thought that was a good idea...21:28
ianwdmsimard: yep21:29
dmsimardianw: ok, trying now.21:29
dmsimardthat worked21:29
dmsimardwe're at 48% of phase 1.. ouch21:29
ianwthat said, it doesn't really give you a timeline21:30
ianwif you look back through the logs, fungi and i had to do it a little while ago?21:30
dmsimardianw: so to give you some context21:30
ianwi was watching it pretty closely, so you could probably correlate the start/finish from the irc logs21:30
dmsimardianw: today is release day and it was kind of a downer to keep logs.o.o down for like 8 hours -- so we're currently sending logs to /srv/static directly and we'll want to do a mv/remount dance once the fsck has finished21:31
ianwyep, last time i just bind mounted /srv/static somewhere and rsync'd the intermediate logs back in21:32
dmsimardianw: We realized when running the usual logs.o.o workload side by side the fsck that it was going to be difficult because the fsck process was eating all the ram (swapping >4GB and causing log uploads to keep failing)21:33
ianwthat is definitely a new one, i can't remember that happening any of the times we've had to recover before21:34
dmsimardianw: so I had fsck send the scratch files to disk instead of ram -- the patch for it is here https://review.openstack.org/#/q/topic:e2fsck21:34
dmsimardI jumped the gun on that one, there might've been another avenue but I had already interrupted the fsck21:35
ianwprobably worth understanding if that's an artefact of certain types of corruption that we just happened to hit, or something else21:37
ianwpublic holiday here and duty calls, good luck :)21:37
dmsimardo/21:38
dmsimardFor the record, there seems to be on average between 5GB and 6GB of scratch files per batch, ex: http://paste.openstack.org/show/653511/21:56
dmsimardIf we we working with 16GB of RAM instead of our existing 8GB, we might be able to work without writing to disk21:56
dmsimardWe're currently at 51% of phase 1 :(21:57
pabelangerthat's about 4 hours in right?22:01
pabelangerdmsimard: mordred: do we want to discuss switching back to original method again? 4 hours into fsck and 51% phase1, feels like we are in for a long haul22:03
dmsimard5 hours in22:04
pabelangerlogs look stable at 50% /srv/static22:04
dmsimardpabelanger: yeah I bumped the timer again22:04
dmsimardpabelanger: can't deny it's going very slowly but the flipside is that it allows jobs to complete ? If we go back to RAM we'll start getting load and failures again -- an alternative would be to resize to 16GB RAM and only then resume it back in ram22:06
pabelangerwell, we've done this before while jobs have been running. Mind you, not during the release week22:07
dmsimardyeah the gate is running at max capacity :(22:07
pabelangerif we go back to normal fsck, we could consider reducing number of jobs running at once in nodepool. In an effort to keep things moving and bring volume back online22:08
pabelangerinfra-root: ^ do you have any thoughts?22:08
dmsimardGetting any post failures would yield gate resets and at 100 patches in the gate that's crazy expensive, I would do everything we can to avoid that22:11
mordredpabelanger: reading22:12
dmsimard5 hours in at 52% of phase 1 fsck is beyond awful.. I would consider bumping ram to 16GB and resuming without on disk scratch files but I don't have any other ideas22:14
pabelangerwell, we are swapping now: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=143&rra_id=all22:21
pabelangerslowly growing22:21
pabelangerso, feel we might just be delaying the unhappyness of swapping22:22
pabelangerI'm unsure about resizing logs.o.o either, until others are okay with it.22:23
dmsimardOk I'm going to start a pad to summarize our options22:24
pabelangerok22:27
dmsimardworking on https://etherpad.openstack.org/p/2tAC4qI2eR22:29
dmsimardmordred: do you have an opinion ?22:40
clarkbisnt option 5 dont fsck?22:41
dmsimardclarkb: we admittedly didn't test if the volume was mountable in the first place22:42
clarkbits data with a retention period of 4 weeks22:43
clarkbif we lose data I dont think its a huge deal as long as fs is functional22:43
clarkbbut unsure how risky that is22:43
dmsimardclarkb: right, I know we don't care that much about the data but it might not let you mount it at all22:43
dmsimardBut I agree it's worth considering22:43
clarkbok I have to afk again just wantrd to call that out22:44
pabelangerI'm not sure how else to check it, outside of fsck22:45
*** corvus has joined #openstack-infra-incident22:54
dmsimard56% of phase 1.. but most jobs are green when looking at the gate in http://zuul.openstack.org/builds.html .. we're increasingly swapping but load is stable -- it's like there's no good solution to this problem but we're probably slowly running into a wall if we don't do anything22:56
dmsimarddamned if you do, damned if you don't ?22:56
dmsimardThere's 5 phases to go through, it's unrealistic to keep going at this rate22:57
dmsimardpabelanger: how would you want to reduce the pressure ? reduce max-servers ?22:57
dmsimardoh crap, hang on22:58
*** rlandy is now known as rlandy|biab23:26
dmsimardfsck 65% phase 123:52
mnaserfsck progress is weirdly non linear23:55

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!