Friday, 2017-03-31

-openstackstatus- NOTICE: Jobs in gate are failing with POST_FAILURE. Infra roots are investigating07:43
*** ChanServ changes topic to "Jobs in gate are failing with POST_FAILURE. Infra roots are investigating"07:43
-openstackstatus- NOTICE: logs.openstack.org has corrupted disks, it's being repaired. Please avoid rechecking until this is fixed08:23
*** ChanServ changes topic to "logs.openstack.org has corrupted disks, it's being repaired. Please avoid rechecking until this is fixed"08:23
mordredfungi: JOY13:09
fungioh! my ssh retries have finally paid off13:09
fungithe server is sort of responsive again13:10
mordredoh good13:10
fungithough free -m is still taking a while13:10
mordredfungi: should we stop zuul to stop things from trying to upload logs?13:10
funginot sure yet13:11
mordredor maybe pause all the nodepool providers instead, so that zuul still grabs the event queues13:11
* mordred stands by13:11
fungiyeah, free confirms all 8gb ram and about 4gb out of 8gb of swap in use13:15
fungino find commands in the process table at least13:15
*** yolanda has joined #openstack-infra-incident13:15
yolandahi fungi, so you are going to restart the server?13:16
fungii'm trying to see if i can get the top memory offender processes13:16
fungiyolanda: no, since it seems to sort of be working again13:16
yolandawe went with that this morning, but it was causing to loose access to all the other static sites, so we switched to a manual fsck13:16
yolandafungi, i got a POST failure now, even with log volume not being mounted. I wonder if the logs that are writte on the root volume, are exhausting it?13:17
fungii'm going to put it in the emergency list on the puppetmaster to start13:18
fungiyeah, so i think the apache wsgi for the logs site13:18
fungiworking on disabling that site in apache13:20
fungitons of very high memory using apache processes in the table, so probably os-loganalyze killing us here13:22
fungithe fsck itself is using about 6gb ram and we need every bit we can muster13:22
yolandafungi, i wonder if we could split those volumes somehow13:26
yolandaif having such a big volume is a problem13:26
*** AJaeger_ is now known as AJaeger13:28
fungiokay, i disabled that vhost but also stopped apache temporarily13:31
fungimemory pressure has reduced _greatly_13:32
fungiand system load has dropped from over 100 to around 213:32
fungistarting apache again now with the logs vhost still out of the config set13:33
yolandaapache killing us!13:33
fungiyeah, os-loganalyze is a resource hog, but that's not normally a problem when it's the only thing really running13:34
fungibut when we need most of the available ram for fsck, it gets ugly13:35
funginext step, we're probably going to need to ditch the logs we've been spooling to /srv/static while /srv/static/logs is offline13:38
fungi/srv/static is almost full now13:38
fungithis is why i've usually just intentionally broken log uploading in the past since we don't have enough room to buffer them for the duration of the fsck13:40
fungiany opinions? should we try moving them into /opt temporarily instead?13:40
yolandasometimes when i was hitting that, i was just letting logs to upload on the root volume, then copying to the right one. But yes, looks as disk space is a risk now. It never took that long for me to recover from an fsck13:41
fungiwell, we probably have 30-60 minutes before it fills up depending on how quickly activity ramps up13:41
fungii'm going to take the opportunity for a quick break to finish my morning routine and ingest some sustenance13:42
fungii think puppet must have reenabled the logs site before it noticed the server is in emergency disabled state14:07
fungiworking to undo it again14:07
fungiif my shell prompt ever comes back14:08
*** jeblair has joined #openstack-infra-incident14:25
fungistill waiting for my shell to return to responsiveness14:26
*** EmilienM has joined #openstack-infra-incident14:29
*** jroll has quit IRC14:32
*** jroll has joined #openstack-infra-incident14:34
*** yolanda has quit IRC14:43
fungiand still waiting. it's been stuck waiting to return the results of a df -h for over 35 minutes now14:43
mordredfungi: wow14:45
fungionce i regain control of that shell, sudo service apache2 stop is happening14:46
*** bnemec has joined #openstack-infra-incident14:50
jeblairsomehow snmp is still getting data14:51
fungiyeah, starting to wonder if df was hanging because of disk i/o so i just spammed a bunch of ctrl-c to see if it would stop trying15:02
fungihasn't yet though15:07
fungiokay, it just came back and i stopped apache for a moment15:16
funginext highest memory user besides fsck.ext4 is rsyslogd15:17
fungiroughly 25 sshd processes eating up quite a bit though15:17
fungiand we're at about 90% used on /srv/static now (88g out of 99g)15:18
jeblairmaybe about 2/3 jenkins ssh. 1/3 us.15:18
fungii've started apache back up again and am going to watch it to see if i can tell if it's really the culprit15:19
jeblairfungi: with or without the logs vhost?15:19
fungiturns out the logs.o.o vhost hadn't gotten reenabled15:19
jeblairoh huh15:19
fungiso i don't think it was responsible for the second load blow-up (which was more sustained but also the load average didn't climb as high)15:19
fungiright now 5-minute load has dropped below 2.0 even with apache back up (sans logs.o.o vhost still)15:20
jeblairfungi: maybe that was just fsck?15:22
fungientirely possible, though i see a bunch of 00-header and sshd and xe-update-something spiking way up in the load now15:28
jeblairit's going unresponsive for me15:29
fungitrying to get a better idea of what all this is cropping up in the process list suddenly (and why)15:29
fungiyeah, my ps is hanging again15:29
jeblairfungi: i think 'xe-' is xen something15:30
fungiyeah, i saw several of these pop up relatively high on cpu percentage in top just as the system load began climbing again15:31
*** yolanda has joined #openstack-infra-incident15:32
mordredsuch a fan of in-guest utilities doing things15:32
fungixe-update-guest-attrs: Tool for writing arbitrary values to the Xen Store15:33
fungiWrites arbitrary values to the Xen Store for hypervisor communication.15:33
jeblairthough to be fair, bash and sshd processes had a higher cpu percentage than even fsck, likely because they were stuck on io.15:33
fungiright, those are likely popping up due to high iowait15:33
fungimy terminay just got a byte in edgewise15:35
fungier, terminal15:35
fungifsck is at 84.8% complete at this point, btw (in case i'm the only one with a working screen -x there)15:36
jeblairthat seems about what it was at when i looked 15 minutes ago15:37
yolandait has only progressed a 10% over the last hours15:38
jeblair85.2% now15:41
*** clarkb has joined #openstack-infra-incident15:45
clarkbfungi: so for the spool disk you've just got the logs fs unmounted so the "base" fs is covering the deeper repos?15:49
clarkbs/repos/dirs/15:49
fungiclarkb: yeah, normally i've disabled the jenkins ssh key to prevent log uploads (breaks jobs for the duration of course), but this time the volume was just unmounted and logs have been let to accumulate in the parent directory's filesystem15:51
fungii'm still not sure how we're going to go about cleanly merging them into the logs volume (rsync?), and we're likely to run out of room to keep doing this pretty soon anyway15:52
clarkbya rsync is what we use for the copy and mount back in old location dance for test instances. Granted this is a fair bit more data15:53
fungihanging for me again. it just began to spike back up on load and i tried to catch what/where these run-parts scripts are15:58
jeblairfungi: /etc/update-motd.d?16:09
jeblairhttp://manpages.ubuntu.com/manpages/xenial/man5/update-motd.5.html16:10
jeblairExecutable  scripts in /etc/update-motd.d/* are executed by pam_motd(8) as the root user at each login, and this information is concatenated in /var/run/motd.  The order of script execution is determined by the run- parts(8) --lsbsysinit option (basically alphabetical order, with a  few caveats).16:10
jeblairfungi: so every time time jenkins logs in, that runs16:10
fungiwow16:10
fungiyep, those are the scripts i was seeing the names of popping up over and over16:11
fungiand also explains all the lsb_release calls i was seeing16:11
jeblairone of the scripts tries to see which filesystems need fscking, which it occurs to me could be kinda slow right now.  though i think it only looks at mounted filesystems.16:12
fungii may temporarily purge the update-motd package when i get control of my terminal again and see if that helps at all16:13
jeblairfungi: or just move the scripts out of the way?16:14
fungior just move the scripts out of that dir temporarily16:14
fungiheh, yeah, that16:14
fungii've pushed that through my tty, so when my shell wakes back up those should temporarily move to my homedir16:18
fungithere it goes! (well, at least the shell prompt updated, the mv hasn't returned just yet)16:34
fungiand now it's applied16:36
fungimaybe that'll cut down a little bit of the performance degradation at each iowait spike16:37
clarkbfor a second I thought you were going to say fsck was done16:40
jeblairerm, something is going on with the screen session that i don't understand.16:46
fungias far as how it's updating?16:48
fungialso 5-minute load average is back up over 40 again16:49
fungitons of defunct sshd processes16:49
jeblairi guess it was something with my terminal.  i've dropped and logged in again.16:50
fungiokay, most of the defunct sshds have cleared again16:50
jeblair87.4%16:50
fungiand system load seems to be dropping again16:50
*** pabelanger has joined #openstack-infra-incident16:50
fungiand now it's freezing on me again. seems to just be all over the place16:51
jeblairhas anyone done the math yet?  at current rate it looks to me like 6.7 hours till fsck is finished?17:05
pabelangerdoes is make sense to disable log publishing from jobs themselves?17:22
jeblairfungi: it looks like /srv/static is at 100%17:24
fungiyep. should we relocate the current content into /opt? or just delete?17:25
jeblairfungi: opt isn't big enough17:25
jeblairfungi: what about docs-draft?17:25
fungiahh, yeah, 68gb free17:25
fungioh, yep docs-draft has lots free17:26
fungii can get started doing that now17:26
fungionce the terminal is responding again17:26
jeblairthat's going to be a problem :|17:26
fungionce my shell wakes up, it will create a dir called /srv/static/docs-draft/logs-tmp and mv /srv/static/logs/* into there17:29
fungihopefully the extra i/o doesn't just make matters worse17:29
clarkbdo we know what caused the underlying volume issues? was that scheduled rax maintenance?17:33
fungiyes, the scheduled maintenance17:37
fungihours of iscsi connectivity outage for one of the cinder volumes17:38
fungithe mv is (ostensibly) under way now17:42
fungiwatch us get to 99% complete and then fanatical support "helpfully" reboots the instance for us because they see it heavily swapping17:44
clarkb(thinking out loud here) if the osuosl stuff worsk out, maybe we want to carve off some of that to run a ceph cluster and host logs there?17:46
clarkbor even just load up a node with 10TB drives and lvm it like we do today17:46
fungii still expect moving it to afs without a replica and sharding the tree across different servers would work17:47
clarkbya, we'd just need to figure out a sharding scheme that scales as we go (whereas with ceph the theory at least is you just keep adding disks and the underlying object store nature of ceph figures that out for you)17:49
clarkbyou'd also get replication with ceph17:49
jeblairi think either would be an improvement :)17:49
clarkbyes indeed. Just trying to think beyond the "must run on cloud" constraint and if that helps anything17:52
clarkbbuild a few backblaze type fileservers, call it a day17:53
clarkb(of course then the problem becaomes when a drive fails we have to drive to corvallis and replace it)17:53
jeblairthat's a lovely drive for me!17:54
fungiscenic, at least?17:55
jeblairyes, past vacaville17:57
clarkbmy drive is mostly flat and full of cows17:57
clarkbbut its not too long and I've done it before17:57
fungivaca is latin for cow17:58
clarkbjeblair: side roading around the rogue river is awesome17:58
clarkbespecially in late spring before it gets too hot17:58
*** tbarron has joined #openstack-infra-incident17:58
fungiouch, my mv is getting plenty of "cannot create hard link" failures out of kolla-kubernetes-deploy job logs18:00
fungihopefully they're in the minority18:00
jeblairfungi: i think the throughput has slowed considerably :/18:44
fungijeblair: i suspect that's because of the mv which is still proceeding18:44
jeblairfungi: do you have a read on available space in /srv/static?18:45
fungino, we don't seem to be tracking that in cacti that i could tell18:45
jeblairoh i meant progress on th mv18:45
fungii'll see if the screen session is responsive enough to get me another window and run df18:45
fungii figured i could infer the mv progress from cacti if we had been tracking it there18:46
fungioh, though maybe i can figure it out from the docs-draft in cacti18:46
fungilooks like we're maybe 60% complete18:48
jeblaircool, so we're at least able to keep above water18:48
fungier, sorry, 60% to go, probably 40% complete18:49
pabelangerquestion, is there any volume we could move to files.o.o? Which leave us with a spare volume on static.o.o for maintenance rotation moving forward?18:49
fungihard to be sure since there are some gaps in the cacti responses18:49
fungipabelanger: when i looked before, adding all the lvms other than the one for logs together came out to less than one cinder volume18:50
fungioh, except for docs-draft. moving that might do it18:50
fungii forgot i had ruled it out because of basically also being logs18:50
clarkbwe may consider it just for simplicity of management18:51
jeblairpabelanger: by files.o.o you mean afs?18:51
pabelangerjeblair: not in this cause, just adding another cinder volume to the server. As I understand it, we are at a hard limited for attaching another volume on static.o.o?18:52
pabelangercase*18:52
jeblairpabelanger: well, files.o.o is principally an afs server18:52
fungidf -h says /srv/static is 11% free at the moment18:52
jeblairi don't think we should start attaching cinder volumes to what is designed to be a load-balanced web head.18:53
pabelangerjeblair: okay. I wasn't sure of which other server would make sense to help migrate volumes from static.o.o from18:53
fungiprobably a new server18:53
pabelangerack18:54
pabelangerstatic02?18:54
fungimaybe even a server in vexxhost? ;)18:54
fungii hear they have some awesome storage18:54
jeblairyeah.  either a new server, or afs.  doc-draft might be a nice smaller test case for that.18:54
fungihow about a new server _and_ afs?18:54
jeblair(or ceph)18:55
fungior ceph-backed afs ;)18:55
jeblairany of those 3 options gives us a volume slot to pvmove in the future18:55
fungiyup18:55
jeblair(though, i do wonder about our ability to actually execute a pvmove of that scale while running and still pruning directories)18:56
fungimoving some of the other stuff from static.o.o to files01 (by way of afs) probably does make sense though... tarballs, static content for sites like governance, specs and security...18:56
pabelangerwe've talked about optimizing our directory structure, as everybody is likely aware, to help with pruning18:57
jeblairmost of those could be done like docs and get RO redundancy (not tarballs probably)18:57
fungijeblair: probably to pvmove a logs.o.o volume i'd kill/disable the pruning and compression cronjob, then do the pvmove at a time when activity is lower (e.g., a weekend)18:57
jeblairya18:58
fungimighth still be a little slowish, but would probably work out fine18:58
clarkbI wonder if ceph backed volumes need to take as many downtimes18:58
clarkb(thinking of vexxhost option)18:58
pabelangerspeaking of pmove, reviews welcome: https://review.openstack.org/#/c/449232/18:58
clarkbin theory sine you have at least one replica for all ceph things you just replace as you go18:58
fungipvmove is how i originally replaced all our 0.5tb cinder volumes with 1tb volumes on static.o.o18:58
clarkband then we woudn;t need to worry on our end18:59
pabelangerdocs I follows for eavesdrop.o.o18:59
fungiclarkb: mnaser was talking to me about that actually, said they do seamless/uninterrupted maintenance on their ceph storage18:59
clarkbfungi: ya thats what I expected, its one of the reasons why ceph is so shiny to ops19:00
clarkbthe cost is lower IO throughput but seems well worth the cost19:00
fungipabelanger: yeah, irc logs might also make sense in afs, though they're not really that huge anyway (11gb currently in our meetbot volume)19:01
jeblairis vexxhost offering enough space for logs?19:01
fungijeblair: i didn't mention a number to him, but i wouldn't be surprised if that was a fine amount19:01
clarkbalso we may be able to get even more disk than we currently have there19:02
jeblair(because all of these ideas are good and doable, but take engineering time; a straight move to vexxhost would be simpler)19:02
clarkb+++++++++++++++19:02
clarkbthat could be an excellent intermediate stopgap19:03
jeblair"rsync and check back in 3 weeks :)"19:03
clarkbor even just put a line in the sand19:03
clarkband document how to modify urls by hand if you need old things19:03
fungii'm asking him the hypothetical with actual numbers right now19:03
clarkb(and in the mean time double copy)19:03
jeblairi'm pretty sure we can have mod_rewrite do the transition for us seamlessly19:04
fungiif we were already sharding by date, we could just do apache redirects until the old server content ages out on its own19:04
jeblair(if file not exist on new server, proxy from old server)19:04
clarkbjeblair: good point19:04
fungioh, or maybe even if we aren't i guess19:04
fungias you described there19:04
clarkband then if old server gives 404 it really doesn't exist. I like that its simple19:04
pabelangerthat would be cool actually19:05
jeblairyeah, rewritecond has a "-f is regular file" test19:05
fungihe said shouldn't be a problem (we do need to update the infra donors list to include vexxhost with planet.o.o going in there anyway)19:05
fungialso he says 12tb could be done as a single block storage volume in their environment, no problem19:06
jeblairnice :)19:06
clarkbwould they be comfortable with doing more?19:06
clarkbbecause thats the other scaling issue we face right now19:06
clarkb(its not a regression if not so not deal breaker)19:07
jeblairwell, i'm hoping with zuulv3 we can start scaling back down again.  once we can start putting caps on how much logs jobs store.19:07
clarkbthe "looming" thing I worry about is with containers becoming a popular packaging format it means that packages are now 300MB large instead of 3MB and that adds up fast19:07
clarkbbut yes if we can reign that in becomes less of an worry19:07
pabelangercontainer problems19:08
jeblairfungi: er, did fsck stop?19:08
jeblairdogs: |=================================================       - 87.2%19:09
jeblairUnconnected directory inode 222462491 (/46/437546/4/experimental/gate-tripleo-ci19:09
jeblairlogs: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options)19:09
jeblairwhat's "-a"?19:09
clarkbmy man page doesn't say19:09
jeblairautomatic repair19:11
fungiclarkb: i expect they would, i think what we should do though is get them on the donors page as soon as planet01.o.o is running in production, start working on a transition of logs.o.o to there, and see what kind of quota they're happy with pushing that up to. even if we start at just enough quota for what we're storing now, i think it's an improvement19:11
fungijeblair: yikes19:12
clarkbfungi: that makes sense19:12
fungijeblair: argh, so we weren't running with -y?19:12
pabelangeris the redirect rule on our current logs.o.o server, or the new one?19:12
jeblairfungi: before we start again...19:12
pabelangerwhich way would be proxy things19:12
clarkbpabelanger: new one19:12
jeblairfungi: maybe we should take a moment and think if there's anything we can do to improve the situation while fsck isn't killing the system19:13
clarkbpabelanger: beacuse over time traffic to old one would die off19:13
pabelangerclarkb: ya, going to play with rules. since I've already spent a few hours on it today19:13
pabelangerclarkb: ya, make sense19:13
jeblairfungi: maybe finish that mv for one...  then maybe symlink logs into the docs-draft volume?19:13
fungijeblair: yeah, i'm looking at the fsck.ext4 manpage for ideas19:14
fungijeblair: also, sure the symlink sounds like a swell idea19:14
fungijeblair: i was reading earlier about setting fsck.ext4 to use a scratch space on disk instead of ram19:15
fungithough that might be just as bad (or worse?) than swapping19:15
jeblairi need to grab lunch, biab.19:17
fungil19:17
fungier, k19:17
clarkbfungi: ya I'd worry about that especially since we aren't OOMing right?19:17
fungi`man e2fsck.conf` has some interesting options19:17
clarkbso worst case we hit disk with swap anyways or we get speed of memory19:18
fungiokay, so after reading the e2fsck and e2fsck.conf manpages cover to cover, i don't have any great ideas other than we should just run with -y instead of -a so it will fix what can be fixed and if we lose some files, oh well19:22
clarkbthat works for me,19:23
fungii've pretty much always fsck'd this volume -y anyway because what else am i going to do to any of the corruption options it prompts about besides hit y and hope for the best?19:23
clarkbya19:23
pabelangeris the plan to create a symlink for logs now? to help reduce pressure on volume?19:25
fungiand of course finishing the data relocation off of /srv/static into /srv/static/docs-draft yeah19:25
pabelangerthanks19:27
fungii'm going to abort the current mv of files from /srv/static/logs/* into /srv/static/docs-draft/docs-tmp/, and then in a single command line rename /srv/static/logs to /srv/static/logs.0 and symlink /srv/static/logs to docs-draft/logs, then start rsync'ing the contents of /srv/static/logs.0 into /srv/static/docs-draft/logs-tmp19:31
fungiafter which we can delete /srv/static/logs.0 and start the fsck back up with -y19:32
fungialso this will make it easier to get the spooled logs back into the logs volume once we mount it again19:33
pabelangerack19:34
jeblairfungi: ++19:37
fungirsync -av /srv/static/logs.0/ /srv/static/docs-draft/logs-tmp19:37
fungithat look right?19:37
fungii always second-guess myself on trailing-slash situations with rsync19:38
jeblairfungi: i want a / on the end of logs-tmp.  so gimme a min to check myself.19:38
fungiheh19:38
fungii'm going by the `rsync -av /src/foo /dest` is equivalent to `rsync -av /src/foo/ /dest/foo` example in the rsync manpage19:39
clarkbya thats my reading. (btw the keyword to search for is "trailing")19:39
clarkb(its easier for me to remember to search for trailing than to just remember the actual behavior apparently)19:39
jeblairwhat's the diff for trailing slash on dest?19:40
clarkbiirc dest doesn't matter19:40
jeblairyeah, that's what i'm seeing.19:41
jeblairso, lgtm.  :)19:41
jeblair("we're both right!")19:41
fungiwas looking real quick to see if there was a good middle ground between normal execution and -v19:42
fungibecause that may slow it down if it outputs every single filename to the console19:42
fungiany opinions?19:42
fungishould i redirect it to a file instead?19:42
fungi(best of both worlds since you can tail the file if you want to without terminal buffering slowing down the application)?19:43
jeblairfungi: i'm okay just omitting it... we can df and strace19:44
fungiwfm19:44
clarkbwrite to file is still gonna block19:44
clarkbso ya19:44
fungithere goes19:44
fungiit's in the same root screen session19:44
fungii'll keep an eye on this so i can clear out the source tree and then get the fsck started back up with -y this time as quickly as possible19:45
-openstackstatus- NOTICE: lists.openstack.org will be offline from 20:00 to 23:00 UTC for planned upgrade maintenance19:58
fungii think the rsync is nearing completion since the used space on the docs-draft volume is about 110gb higher than it was earlier20:09
fungiwe'll find out shortly i guess20:10
fungiand done!20:13
fungistill over a tb available on the docs-draft volume too20:14
pabelangernice20:14
fungirunning it once more in case there were outstanding writes under it when i initially started20:15
jeblairw00t20:15
fungihopefully this goes much more quickly20:15
fungiand then i'll delete everything out of logs.0 before restarting the fsck20:15
fungiand that's done too20:30
fungicleaning up20:30
fungistarting fsck back up now, this time with -y20:41
fungistill in the same root screen session20:42
fungii'm not holding my breath that this will go much faster20:42
jeblairnow that's past, i'd like to lighten the mood by pointing out that the stray characters in the screen window made it look like we were fscking the "dogs" filesystem.20:42
jeblairi'm still giggling about that.20:43
jeblair19:09 < jeblair> dogs: |=================================================       - 87.2%20:43
*** bnemec is now known as beekneemech20:49
fungiyup, i saw the same20:55
fungii think someone hit their d key at some point during the earlier fsck (maybe even me)20:55
fungiprobably the result of a mistyped screen detach attempt20:55
-openstackstatus- NOTICE: The upgrade maintenance for lists.openstack.org has been completed and it is back online.21:50
clarkbfungi: how goes the current fsck?22:17
clarkband would you be offended if I did go do yardwork while still fscking?22:18
jeblairPass 2: Checking directory structure22:19
jeblairi guess we're looking at this finishing tomorrow morning?22:19
fungiclarkb: not much point in sticking around for now22:21
fungiyeah, i'll keep an eye on it while i'm awake, maybe see if jhesketh or ianw don't mind keeping tabs on it on their saturday, but will otherwise probably not reenable things until i wake up tomorrow22:22
*** ianw has joined #openstack-infra-incident22:24
clarkbI will be sure to check in tomorrow when I wake22:25
ianwsorry i did not think my screen session was active so that's why there's ~'s in there22:28
ianwi thought it would be finished!22:28
fungiwell, it was originally restarted with -p (-a really, but same thing)22:29
fungiso when it hit an error it couldn't safely fix, it aborted (around 90% complete)22:29

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!