Sunday, 2020-06-14

*** rchurch has quit IRC00:25
*** rchurch has joined #opendev00:27
*** DSpider has quit IRC04:02
*** sgw has joined #opendev04:06
openstackgerritMatthew Thode proposed openstack/diskimage-builder master: update grub cmdline to current kernel parameters  https://review.opendev.org/73544505:41
AJaegerinfra-root, seems we lost opensuse and centos mirrors, there's no such directory at https://mirror.bhs1.ovh.opendev.org/05:51
openstackgerritMatthew Thode proposed openstack/diskimage-builder master: add more python selections to gentoo  https://review.opendev.org/73544806:16
yoctozeptoAJaeger: ay-yay, I was about to report that07:34
yoctozeptoI've got another question too: is it possible to get nodes with nested virtualization? as in being able to run kvm in them?07:37
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:03
AJaegeryoctozepto: see https://review.opendev.org/#/c/683431/ - but we have only a limited number of these08:33
AJaeger#status notice The opendev specific CentoOS and openSUSE mirror disappeared and thus CentOS and openSUSE jobs are all broken.08:35
openstackstatusAJaeger: sending notice08:35
-openstackstatus- NOTICE: The opendev specific CentoOS and openSUSE mirror disappeared and thus CentOS and openSUSE jobs are all broken.08:35
openstackstatusAJaeger: finished sending notice08:38
yoctozeptoAJaeger: thanks, that will do - any reason for debian being missing?08:54
*** DSpider has joined #opendev09:09
*** calcmandan has quit IRC09:36
*** calcmandan has joined #opendev09:37
*** iurygregory has joined #opendev10:07
yoctozeptoAJaeger: same for c8 - if I just proposed a patch adding them, would it work? or does it need more setting up on the providers' side?10:10
*** tosky has joined #opendev10:59
*** sshnaidm_ has joined #opendev12:34
*** sshnaidm|afk has quit IRC12:34
fungilooks like afs01.dfw.openstack.org is down for some reason12:50
fungior at least entirely unreachable12:51
fungiresponds to ping but not ssh12:51
fungiactually it gets as far as the key exchange, so i suspect something's up with its rootfs or process count12:52
fungiwe stopped getting snmp responses from it just before 20:00z12:53
fungicpu utilization and load average show a significant spike just before we lost contact too12:54
fungii'll check its oob console first for any sign of what's wrong, then i guess start following our steps in https://docs.opendev.org/opendev/system-config/latest/afs.html#recovering-a-failed-fileserver12:59
fungiit's too bad our fileserver outages always seem to be of the pathological sort where afs failover doesn't actually kick in and clients continue trying to contact the unresponsive server instead of the other one13:00
fungihard to tell for sure from the console, but looks like it may have experienced an unexpected reboot since i see the end of fsck cleaning orphaned inodes from /dev/xvda113:10
fungithough it's followed by a bunch of "task ... blocked for more than 120 seconds" kmesgs13:11
fungii'm going to try to reboot it as gracefully as possible, but in order to reduce the risk of additional write activity to the rw volumes i'm going to shut down the mirror-update servers first13:12
fungithat's easier than trying to hold individual flocks for a bunch of volumes, commenting out cronjobs, adding hosts to the emergency disable list...13:13
fungi#status log temporarily powered off mirror-update.opendev.org and mirror-update.openstack.org while working on afs01.dfw.openstack.org recovery process13:14
openstackstatusfungi: finished logging13:14
fungithe "send ctrl-alt-del" button in the oob console has no effect (unsurprisingly) so i'm trying a soft reboot via cli. odds are i'll have to resort to --hard though13:16
*** tosky has quit IRC13:17
*** tosky has joined #opendev13:17
fungiyeah, doesn't seem to be having any effect. trying to openstack server reboot --hard now13:18
fungiconsole says fsck was able to proceed normally13:19
fungii can ssh into it again13:20
fungi#status log performed hard reboot of afs01.dfw.openstack.org13:20
openstackstatusfungi: finished logging13:20
fungiyeah, syslog indicates the server was either completely hung shortly after 19:35z or lost the ability to write to its rootfs13:23
yoctozeptofungi: hi; do we really need to mirror the whole repos? would not it make more sense to focus on caching proxies?13:23
fungiyoctozepto: how about we debate mirror redesign another tmie13:24
fungitime13:24
fungii've got a lot of cleanup to do here13:24
yoctozeptofungi: sure, no problem, I feel you ;-)13:24
fungithough in short, the reason for creating mirrors rather than using caching proxies is that distro package repositories need indices which match the packages served, and we had endless pain trying to rely on other mirrors of debian/ubuntu packages because they often served mismatched indices causing jobs to break (and proxying would just proxy that same problem)13:25
fungiafs and generating indices from the packages present in the mirror was the solution we found to keep package and index updates atomic13:26
funginow that afs01.dfw has been rebooted, i am once again able to browse the centos and opensuse mirrors13:28
fungiif anyone wants to double-check the stuff which they saw failing before is back to normal, that would help13:29
fungii'll work on making sure all the rw volumes are back to working order now13:29
fungi`bos getlog -server afs01.dfw.openstack.org -file SalvageLog` tells me "Fetching log file 'SalvageLog'... bos: no such entity (while reading log)"13:36
yoctozeptofungi: thanks for insights, I feared that may have been the cause13:36
fungilooks like FileLog contains mention of some salvage operations though13:37
fungi#status notice Package mirrors should be back in working order; any jobs which logged package retrieval failures between 19:35 UTC yesterday and 13:20 UTC today can be safely rechecked13:39
openstackstatusfungi: sending notice13:39
-openstackstatus- NOTICE: Package mirrors should be back in working order; any jobs which logged package retrieval failures between 19:35 UTC yesterday and 13:20 UTC today can be safely rechecked13:40
fungiso the FileLog for afs01.dfw says it scheduled salvage for the following volumes: 536870915, 536871029, 536870921, 536871065, 536870994, 53687093713:41
openstackstatusfungi: finished sending notice13:43
fungivos status says there are no active transactions on afs01.dfw.openstack.org so that's a good sign13:46
fungithe volumes mentioned as getting a salvage scheduled are (in corresponding order): root.cell, project, service, mirror.logs, docs-old, mirror.git13:49
fungii'm not super worried about those as their file counts should be low (several are entirely unused and could even stand to be deleted)13:49
fungii'll move on to performing manual releases of all the volumes to make sure they're releaseable13:50
fungi55 rw volumes13:53
fungioh, only 50 with replicas though13:56
fungiunfortunately the outage seems to have caught mirror.centos mid-release, so it's getting a full release now which will likely take a long time. i'll try to knock out the rest in parallel13:59
fungisame for mirror.epel14:03
fungiand mirror.fedora14:04
fungimirror.gem is taking a while to release, but that may be due to mirror.centos, mirror.epel and mirror.fedora being simultaneously underway14:10
fungiworth noting, it seems afs.db01 spends basically all its time at max bandwidth utilization these days: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=2362&rra_id=all14:13
fungier, i mean afs01.dfw14:13
auristorfungi: it would be worth checking one or more of clients reading from afs01.dfw logging anything to dmesg during the outage14:13
fungiauristor: good idea, will take a look now, thanks!14:14
auristorsorry, the brain isn't functioning yet.  there are some missing words in that sentence that didn't make it to the fingers.14:14
funginah, i understood ;)14:15
fungi[Sat Jun 13 19:56:49 2020] afs: Lost contact with file server 104.130.138.161 in cell openstack.org (code -1) (all multi-homed ip addresses down for the server)14:15
auristorsalvaging will be schedule when the volume is first attached by the fileserver so it would also be worth attempting to read from each volume14:15
fungi[Sat Jun 13 19:57:00 2020] afs: file server 104.130.138.161 in cell openstack.org is back up (code 0) (multi-homed address; other same-host interfaces may still be down)14:16
mordredfungi: oh good morning. anything I can help with?14:16
fungimordred: probably not at this point, just double-checking that all the volumes with replicas get back in sync14:17
auristoris there any periodic logging of the "calls waiting for a thread" count from rxdebug for afs01.dfw.openstack.org ?14:18
fungiunfortunately the fact that every release of the centos/epel/fedora mirrors seems to require hours to complete basically guarantees they're caught mid-release if the server with the rw volume goes offline14:18
fungiauristor: none that i see in dmesg (this is using the lkm from openafs 1.8.5 with linux kernel 4.15, if it makes a difference)14:19
mordredfungi: you know - when there is time next to breath - it's possible the difference in how yum repos work compared to apt repos might make yoctozepto's suggestion of considering caching proxies for those instead of full mirrors reasonable (there's less of an apt-get update ; wait ; apt-get install pattern with that toolchain)14:20
fungioh, rxdebug... that's one of the cli tools. just a sec14:20
mordredbut - I agree - let's circle back around to that later :)14:20
auristorsomeone might have setup a period run of "rxdebug <host> 7000 -noconn" and logged the "<n> calls waiting for a thread" number or use it as an alarm trigger.14:22
fungiauristor: "0 calls waiting for a thread; 244 threads are idle; 0 calls have waited for a thread" so i think that's a no14:22
mordredsounds like a good thing to add as a periodic thing14:23
auristorwhen that number is greater than 0 it means that all of the worker threads have been scheduled an RPC to process.14:23
mordredcould grep out the 0 and sent it to a graphite gague14:23
fungii this case i think it's just bandwidth-bound... the service provider caps throughput to 400mbps on this server instance14:23
auristorwhat I suspect happened is that the vice partition disk failed in such a way that I/O syscalls from the fileserver never completed.14:23
auristorEventually all ~250 worker threads are scheduled an RPC that never completes and then incoming RPCs get placed onto the "waiting for a thread" queue.   Since no workers complete, the waiting number goes up and up.14:24
fungioh, when it was offline, yes probably. i noticed that the state the server was in it responded to ping and i could complete ssh key exchange but my login process either never forked or hung indefinitely14:24
fungiour snmp graphs indicate that in the minutes leading up to the server becoming unresponsive it maxed out cpu utilization and load average had started spiking way up14:25
fungialso the out of band console did not produce a login prompt. there were kernel messages present on the console complaining about hung tasks, but no clue how old those were since they were timestamped by seconds since boot (and i didn't think to jot them down so i could try to calculate the offset from logs later)14:27
fungiunfortunately there was nothing interesting in syslog immediately before it went silent. i expect either the logger got stuck or it ceased to be able to write to its rootfs14:29
fungianyway, i'm going to leave these volume releases in progress for now, i don't feel comfortable starting any more in parallel until at least one of them completes (hopefully mirror.gem won't take too much longer to finish)14:36
AJaegerthanks, fungi!14:40
smcginnisCould these AFS server issues be the root of the release job POST_FAILURES I posted about earlier.14:57
smcginnisI did another one this morning not thinking to check on status of that and got another failure.14:58
auristorusing mirror.centos.readonly as an example.   It has two replicas, one on afs01 and one one afs02.   During a release there one on afs01 is available and the one on afs02 is offline.  If afs01 dies, there are no copies available for clients to use.15:01
fungismcginnis: yes, looking at the timestamps, i expect rsync failed to write to the rw volume on afs01.dfw.o.o because the server was hung at that point15:31
fungiunfortunately all we got out of rsync was a nonzero exit code and no helpful errors15:31
*** icarusfactor has joined #opendev16:08
*** factor has quit IRC16:10
*** icarusfactor has quit IRC16:27
openstackgerritMohammed Naser proposed opendev/system-config master: uwsgi-base: drop packages.txt  https://review.opendev.org/73547317:31
mnasermordred: ^ of your interest17:32
mordredmnaser: ++20:00
*** sgw has quit IRC20:00
openstackgerritMohammed Naser proposed openstack/project-config master: Add vexxhost/atmosphere  https://review.opendev.org/73547820:06
*** rchurch has quit IRC21:00
*** rchurch has joined #opendev21:02
*** sgw has joined #opendev22:48
*** tkajinam has joined #opendev22:57
*** iurygregory has quit IRC22:59
ianwfungi: thanks for looking in on that23:02
ianwin news that seems to be unlikely to be unrelated, it seems graphite is down too23:03
ianwit has task hung messages and is non-responsive on the console23:06
ianwit also drops out of cacti at 9am on the 13th23:07
fungihrm, i wonder if it could be related to the trove db outage, though that was back on, like, the 9th23:10
fungiand yeah, nearly done with afs volume manual releases. the only ones running now are mirror.fedora, mirror.opensuse and mirror.yum-puppetlabs23:11
fungithe mirror-update servers are still powered down23:11
ianw#status log rebooted graphite.openstack.org as it was unresponsive23:12
openstackstatusianw: finished logging23:12
ianwfungi: we've never got to the bottom of why, i think we can probably group it, the rsync mirrors take so long to release23:14
ianwthis on sleeping and clock skew was an attempt : https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L12623:16
ianwnote i have scripts for setting up logging @ https://opendev.org/opendev/system-config/src/branch/master/tools/afs-server-restart.sh23:17
ianwbut as auristor says, if the volume is releasing there's no redundancy23:18
ianwand when you look at http://grafana.openstack.org/d/ACtl1JSmz/afs?viewPanel=12&orgId=1&from=now-30d&to=now23:18
ianwbasically the mirrors take as long to release every time as their next pulse; i.e. they're basically always in the release process23:19
ianwwhich also probably results in the network being flat out 100% of the time23:19
*** DSpider has quit IRC23:22
*** tosky has quit IRC23:26
fungiand i guess we've already ruled out simple causes like rsync updating atime on every file or something23:48
fungilooks like we mount /vicepa with the relatime option, i suppose we could set it to noatime (openafs doesn't utilize the atime info from the store anyway, from what i'm reading), though no idea if that will make any difference for the phantom content changes rsync seems to cause23:56
fungimirror.yum-puppetlabs release finished, so we're down to just fedora and opensuse now23:58
ianwfungi: from my notes; https://lists.openafs.org/pipermail/openafs-info/2019-September/042864.html23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!