Monday, 2023-12-11

corvussome post_failures showing up.  1st one i checked failed on upload to ovh_gra15:26
corvusyeah 3/3 failed there15:28
opendevreviewJames E. Blair proposed opendev/base-jobs master: Temporarily disable uploads to ovh_gra  https://review.opendev.org/c/opendev/base-jobs/+/90335115:30
SvenKieskecorvus: I guess this is also related? https://zuul.opendev.org/t/openstack/build/322e58959af645229a7e387686c6cab815:35
fungihttps://public-cloud.status-ovhcloud.com/incidents/ggsd08k3wlzn15:36
fungidoes it look auth related15:36
fungi?15:36
corvusSvenKieske: yes it is15:36
corvusfungi: can't tell, no_log=true15:36
fungii'm late for an errand, but can take a closer look in about an hour. also i approved 903351 but someone may need to bypass gating to merge it15:36
opendevreviewMerged opendev/base-jobs master: Temporarily disable uploads to ovh_gra  https://review.opendev.org/c/opendev/base-jobs/+/90335115:40
fricklerseems to have been lucky15:41
corvus#status notice Zuul jobs reporting POST_FAILURE were due to an incident with one of our cloud providers; this provider has been temporarily disabled and changes can be rechecked.15:43
opendevstatuscorvus: sending notice15:43
-opendevstatus- NOTICE: Zuul jobs reporting POST_FAILURE were due to an incident with one of our cloud providers; this provider has been temporarily disabled and changes can be rechecked.15:43
opendevstatuscorvus: finished sending notice15:46
clarkbI've been thinking about the best way to force gitea09 to use ipv4 to talk to the vexxhost backup server. I think .ssh/config AddressFamily inet is the proper configuration but then I wonder if we should apply that to /root/.ssh/config on all servers running backups? Or should I constrain it to gitea09? If we want to do that on all backed up servers I could see that one day we might16:17
clarkbhave conflicting /root/.ssh/config configs for different needs/services though that isn't the case today16:17
clarkbanyone else have an opinion or ideas on how best ot tackle this?16:17
clarkbseparately but related: https://review.opendev.org/c/opendev/system-config/+/902842 should remove the old gerrit replication key from gitea. Do we want to go ahead and approve/land that now or fix backups first?16:19
fungihttps://public-cloud.status-ovhcloud.com/incidents/ggsd08k3wlzn now indicates they believe the problem in gra was resolved 14 minutes after 903351 merged16:23
fungilooks like they believe it happened from 15:05 to 15:54 utc16:24
clarkboh looks like we already manage .ssh/config for backups16:29
corvusfungi: i haven't checked to see if they're ovh, but i see post_failures going back to 13:xx utc.  (more before that, but those jobs all have "docker-image" in their names so i suspect they are unrelated)16:30
fungiis that the time the jobs started, or when the tasks failed?16:31
fungibut yeah, some projects also have perpetually broken image uploads16:31
corvusfungi: oh star time good point16:31
corvusyeah, and spot checking the end times of some of the 13:xx they ended at 15:xx16:32
corvusfungi: then i think we have high correlation with their outage times :)16:32
fungiagreed16:33
corvusclarkb: sgtm.  i haven't followed 100%, but i take it the issue is something like streaming big stuff on these hosts over ipv6 is bad?16:33
opendevreviewClark Boylan proposed opendev/system-config master: Force borg backups to run over ipv4  https://review.opendev.org/c/opendev/system-config/+/90335616:33
clarkbcorvus: yup ipv6 connectivity seems to be having problems between vexxhost sjc1 gitea09 and the mtl01 backup server16:34
corvusmaybe 2024 will be the year of ipv6 and the linux desktop16:34
clarkbha. I finally got around to trying to rma my laptop but lenovo said the turn around time isn't quick enough for this trip I'm taking so I'm delaying. In the mean time I discovered that if I boot with nomodeset that basically disables fancy gpu things and rendering "works". I just get a lower resolution than native with a different aspect ratio so things look weird and can't lower the16:36
clarkb(full) brightness. Oh and suspeding doesn't actually save as much battery as it should16:36
clarkbbut I can limp along on that for a little bit longer16:36
clarkbbut my brother has the same laptop and while I can reproduce the problem on an ubuntu live image his laptop cannot. So I'm fairly certain the problem is device specific.16:37
clarkbfungi: corvus: it is pretty easy to test if that region is working again by forcing that region in base-test16:38
fricklermight not be related, but I'm also having no route to vexxhost via IPv6 from my local DSL provider again (had that some years ago and took a long time to resolve)16:42
opendevreviewClark Boylan proposed opendev/system-config master: Add hints to borg backup error logging  https://review.opendev.org/c/opendev/system-config/+/90335716:43
opendevreviewJames E. Blair proposed opendev/base-jobs master: Force base-test to upload to ovh_gra  https://review.opendev.org/c/opendev/base-jobs/+/90335816:45
opendevreviewMerged opendev/base-jobs master: Force base-test to upload to ovh_gra  https://review.opendev.org/c/opendev/base-jobs/+/90335816:51
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: DNM: exercise base-test  https://review.opendev.org/c/zuul/zuul-jobs/+/90336216:56
opendevreviewJames E. Blair proposed opendev/base-jobs master: Revert "Temporarily disable uploads to ovh_gra"  https://review.opendev.org/c/opendev/base-jobs/+/90336517:11
corvusall but one job in the test set has completed successfully; i think we can return to standard condition now.17:11
corvusupdate: all jobs completed successfully17:12
clarkb+2 from me. I won't fast approve this one though and hopefully someone else can check it too17:13
clarkbfrickler: fyi a fix for WIP is merge conflicted in change listings for gerrit has merged against stable 3.918:10
fricklerclarkb: oh, nice, I wasn't aware that they agreed about this being a bug18:16
clarkbfrickler: https://gerrit-review.googlesource.com/c/gerrit/+/396899/ is the change18:17
clarkbI'm going to take advantage of the lack of pineapple express rain to go on a bike ride in a bit. but still happy to be around to watch any of those changes linked above (restore ovh gra logs, remove ssh key from gitea, force backups on ipv4) either before or after that happens19:06
*** elodilles is now known as elodilles_pto21:01
JayFreview.opendev.org is unreachable for me locally22:09
JayFresolves to review02.opendev.org (199.204.45.33)22:09
NeilHanlonsame here (and via v6)22:10
JayFrouting inside level3 according to this MTR looks bananas22:10
JayFhttps://home.jvf.cc/~jay/review-mtr-20231211.png22:11
JayFI wonder if there's some kind of weird BGP thing going on22:12
JayFbecause that feels like I'm being routed to the wrong area of the internet22:12
JayFinfra-root: FYI seemingly non-actionable incident appears to be ongoing with review.opendev.org at least with a portion of the internet,22:12
NeilHanloni'm checking my ripe atlas, but agree22:13
JayFI'm confirming with other folks around the world review.opendev.org is down but generally other internet things aren't; I don't know what network that is on tho22:17
NeilHanlonhttps://atlas.ripe.net/measurements/64730794#probes22:18
clarkbI'm not sure its a network thing yet. The mirror in that cloud region which has an IP addr (ipv4 anyway) in the same /24 range is reachable22:19
clarkbthe server reports it is shutoff according to the nova api22:20
JayFclarkb: I can tell you generally my route to this server doesn't go through bell canada :) but maybe that's just something else weird happening simultaneously22:20
clarkbfungi: corvus frickler tonyb should I try to start it again vai the nova api? or do we want ot investigate further first?22:21
clarkbserver show against the server doesn't indicate any in progress tasks22:21
clarkbOS-EXT-STS:task_state               | None <- I think that would tell us if they were doing a live migration for example22:22
NeilHanlonyeah, looking again it appears traffic arrives where it needs to22:22
fungiupdated_at=2023-12-11T21:28:20Z status=SHUTOFF vm_state=stopped22:22
fungii'm guessing that's when it went down22:22
clarkbthat seems to align with cacti losing connectivity too22:23
fungiyeah, i don't see anything else to investigate without trying to reboot it22:23
fungiexpect a lengthy wait for fsck to run22:23
clarkbfwiw I don't see anything in cacti indicating that something snowballed out of control prior22:24
clarkbfungi: should I server start it or do you want to do it?22:24
fungiand we may need to connect to the console if fsck requires manual intervention22:24
fungigo for it22:24
clarkbthe server came right up22:25
fungiguilhermesp: mnaser: ^ heads up we found 16acb0cb-ead1-43b2-8be7-ab4a310b4e0a (review02.opendev.org) spontaneously shutdown in ca-ymq-1 at 21:28:20 utc according to the nova api22:26
fungireboot   system boot  Mon Dec 11 22:25   still running      0.0.0.022:26
corvuso/ sorry i missed excitement22:27
fungithe previous entry from last was me logging in for 43 minutes on 2023-12-0622:27
fungiso looks like there was nobody logged in at the time that occurred22:27
clarkbcorvus: well there may still be exicitement22:27
NeilHanlonJayF: fwiw, it appears that Level3 junk you're seeing is 'normal' -- fsvo normal. 22:28
clarkbdocker reports gerrit failed to start around when I booted the server if I'm reading the docker ps -a output correctly but the last logs recorded by docker appear to be from when the server went down22:28
clarkbfungi: corvus: should I docker-compose down && docker-compose up -d?22:28
corvuscouldn't a live migration have "finished" and not shown up in the taks state?22:28
clarkbcorvus: maybe?22:28
corvushrm lemme look at docker22:28
JayFNeilHanlon: interesting; I'm a relatively newish centurylink fiber customer so maybe I'm just not so used to that particularly quirky route22:28
clarkbcorvus: k22:28
fungilast entries in syslog are from snmpd at 21:25:03, which was a few minutes before the shutdown22:29
fungiskimming syslog leading up to the outage, i don't see anything amiss22:29
clarkbI guess the other thing is whether or not we want ot force a fsck22:30
clarkbsince it seems that no fsck was done22:30
ianwis it possible it shutdown cleanly?22:30
corvusi would generally trust ext4 on that... unless our paranoia level is 11?22:30
fungiianw: i would have expected a clean shutdown to leave some trace in syslog22:31
clarkbcorvus: ack22:31
NeilHanlonJayF: yeah, the thing to look for is if the packet loss is consistent between ASes -- i.e., if you have loss which continues from hop N all the way to the destination (or close to it) without any hops without loss. in short: if the loss isn't consistent from point A to B, it's likely noise from that networks' devices not liking to respond. you can22:31
NeilHanlonsometimes get them to treat your traceroute traffic better if you send over tcp (mtr -P 443 --tcp review.opendev.org )22:31
clarkbI wonder if docker the container manager is recording that the contianers failed when it came back up hence the timestamp confusion but it didn't actually try to restart them at that time22:32
fungithe only fsck message in syslog is this one (about the configdrive i think?):22:32
fungiDec 11 22:25:19 review02 kernel: [    6.735306] FAT-fs (vda15): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.22:32
fungiaha, it's /boot/efi22:33
corvusclarkb: oh good theory.  i don't have any better ideas.  i don't see any docker logs suggesting it tried to start any containers.22:33
corvus            "StartedAt": "2023-12-06T21:39:27.061492779Z",22:34
corvus            "FinishedAt": "2023-12-11T22:25:24.402771369Z"22:34
fungican anyone confirm that /etc/fstab is set to not fsck any of our filesystems?22:34
corvusclarkb: ^ i think those timestamps from `docker inspect ac1d7b309848` support your theory22:34
fungi2021-06-22 was the last modified date for /etc/fstab btw22:35
fungiso it's been like this for 2.5 years22:35
clarkbfungi: ya it seems the last field is 922:35
clarkbs/9/0/22:35
corvusclarkb: i release my debugging hold, and i think it's okay to down/up (or maybe even just up; it will probably dtrt) once the fsck question is resolved.22:35
JayFNeilHanlon: how, in context of an mtr/tracert, do you know where you swap AS22:35
clarkbcorvus: ack thanks for looking22:35
ianwfungi: it is defaults 0 0 22:35
ianwon the gerrit partition22:35
fungispot checking other servers, we do set fsck passno to 1 or 2 for non-swap filesystems22:36
clarkbfungi: so ya shoudl we set 1 on cloudimg-rootfs and /boot/efi and then 2 on /home/gerrit2?22:36
clarkband then reboot?22:36
fungiclarkb: i think so, yes22:36
corvus++ to the pass fstab change22:36
NeilHanlonJayF: DNS (if it's available), and modernish mtr has a `-z` flag which will do AS lookups22:36
fungiianw: right, "default" for the fsck passno field is 0, which means "don't fsck at boot"22:36
JayFNeilHanlon: oh, nice :) I'm on gentoo so I better have the flag or else I can go bump the ebuild :D22:37
clarkbfungi: I'll let you drive that22:37
NeilHanlon:D22:37
clarkbI suppose we could manually fsck /home/gerrit2 first without a reboot if we wanted22:37
NeilHanlonJayF: this is a good listen (or read w/ linked slides) https://youtu.be/L0RUI5kHzEQ that taught me everything I've now forgotten about traceroute :D22:38
clarkbthe updated /etc/fstab looks correct to me22:38
fungiinfra-root: i've edited /etc/fstab on review02 now so that non-swap filesystems will get a fsck at boot22:38
fungirootfs and efi on passno 1, gerrit home volume on passno 222:39
fungishall i reboot the server now?22:39
clarkbfungi: corvus: should we down the containers before we reboot? just to avoid any unexpected interactions?22:39
fungiprobably, yes22:39
clarkbthat is the only other thought I have before rebooting22:39
ianw(i likely as not added that entry manually for the gerrit home ~ 2021-07 when we upgraded the host and afaik it wasn't an explicit choice to turn off fsck, i probably just typed 0 0 out of habit)22:40
clarkbfungi: I think you should docker-compose down the containers to prevfen them trying to start until we are ready then do a reboot22:41
fungiianw: well, the rootfs was also set to not fsck at boot22:41
corvusclarkb: yes to downing22:41
fungidowned now22:41
fungirebooting22:41
fungii also have the vnc console connected22:42
fungiso i can watch the boot progress22:42
fungiit's already up22:42
clarkbyup, is there a good way to check if it fscked? I guess your boot console would tell you?22:43
fungiDec 11 22:42:23 review02 systemd-fsck[816]: gerrit: clean, 494405/67108864 files, 113090725/268434432 blocks22:43
fungifrom syslog22:43
clarkbit did not fsck / is the implication that fs was not dirty and thus could be skipped?22:44
fungii can't tell, still looking22:45
clarkbI guess that isn't too surprising since most of the server state is on the gerrit volume. One exception is syslog/journald though22:45
fungiopenstack console log show22:45
fungiBegin: Will now check root file system ... fsck from util-linux 2.3422:45
fungi[/usr/sbin/fsck.ext4 (1) -- /dev/vda1] fsck.ext4 -a -C0 /dev/vda122:46
clarkboh I wonder if systemd-fsck can only fsck non /22:46
clarkband you need fsck before systemd for /22:46
clarkbthat could explain the logging being missing for /22:46
fungiyep22:46
clarkbfungi: does the console log show any complaints from fsck for / if not I think we're ok?22:47
tonybI think you can do something like tune2fs -l /dev/$device to see when it was last fsck'd22:47
fungii did not find any errors in the console log from fsck22:47
fungijust messages about systemd creating fsck.slice and listening on the fsckd communication socket22:47
clarkbin that case I guess we can proceed with a docker-compose up -d?22:47
fungiagreed, i'll do that now22:48
fungiit's on its way up now22:48
clarkbwe didn't move the waiting/ queue dir aside so those exceptions are "expected"22:48
clarkbthere is also a very persistent ssh client that is failing to connect. But otherwise I think that startup log in error_log looked good22:49
clarkbthe web ui is up for me and reports the version we were running prior22:49
clarkbso we didn't update gerrit (expceted we haven't made any image updates iirc)22:49
clarkbmaybe we should approve https://review.opendev.org/c/opendev/base-jobs/+/903365 and use that as a good canary of the whole approval -> CI -> merge/submit process?22:50
JayFsomething is def. wrong22:52
JayFhttps://review.opendev.org/c/openstack/governance/+/902585 does not have any comments loaded, for example22:53
NeilHanlonmaybe they're just in invisible ink now?22:53
JayFit looks weirdly spooky, all the comment spots are there but empty22:53
tonybJaF: I found review.o.o to be very slow after the reboot22:53
clarkbyes it has to reload caches22:53
clarkbif it persists after 5 or 10 minutes then we should check again. Unfortunately this is "normal" which makes it hard to say if something is wrong22:54
tonybJayF: and I see many comments on that change FWIW22:54
JayFack; makes sense. First time I've been here to see when it first gets restarted, I think :D 22:54
clarkbyou'll see it struggle to load diffs as well22:54
tonybclarkb: I'm happy to +2+A 90336522:54
fungii've approved 903365 now22:54
tonybLOL22:54
NeilHanloncomments do load, but takes a few seconds22:54
NeilHanlonhttps://drop1.neilhanlon.me/irc/uploads/b238bf77f8924b48/image.png 22:55
fungi903365 is showing on https://zuul.opendev.org/t/opendev/status with builds in progress22:55
fungieta 3 minutes22:55
clarkbdid we lose the bot?22:58
clarkbthe change merged but the bot doesn't seem to be connected (not surprising I guess)22:59
fungiand its already replicated to https://opendev.org/opendev/base-jobs/commit/ddb313722:59
fungii'll restart the bot22:59
clarkbthanks!22:59
fungiyeah, container log indicates the last event the bot saw was at 21:27:13 utc, right when the server probably died23:03
clarkbthe ip spamming us with ssh connection attempts belongs to IBM according to whois23:04
clarkbanyone know anyone at IBM?23:04
tonybNot that could help with that :/23:05
clarkbits probably some ancient jenkins that everyone forgot about23:05
clarkbI suspect the errors are due to its age23:05
tonybfungi: If you get a moment can you share a redacted mutt.conf for accessing the infra-root mail?23:05
fungistatus log Started the review.opendev.org server which appeared to have spontaneously shut down at 21:28 UTC, also corrected the fsck passno in its fstab, and restarted the Gerrit IRC/Matrix bot so they would start seeing change events again23:06
fungikinda wordy, look okay?23:06
tonybSure, I think you can drop the passno text to shrink it a little23:07
fungiwould rather not forget that we fixed it to actually fsck on boot23:08
clarkblgtm23:08
fungi#status log Started the review.opendev.org server which spontaneously shut down at 21:28 UTC, corrected the fsck passno in its fstab, and restarted the Gerrit IRC/Matrix bots so they'll start seeing change events again23:08
opendevstatusfungi: finished logging23:08
fungiwhittled it down a smidge23:09
tonybUmmm I didn't actually get that message anywhere23:10
tonyband it finished logging very quickly23:11
fungitonyb: that's what status log does23:11
fungias opposed to notice or alert or okay which notify irc channels23:12
ianwit appeared on mastodon which i was scrolling getting a tea :)23:12
tonybooooo my mistake23:12
fungiwe usually try to avoid pestering every irc channel if there's no action required23:12
JayFmy hilight bar in weechat appreciates you :)23:12
fungiand yeah, following https://fosstodon.org/@opendevinfra/ will still get them23:13
fungiunrelated, all pypi.org accounts will require 2fa (and so also upload tokens) starting on 2024-01-0123:16
*** dmellado2 is now known as dmellado23:16
fungihttps://discuss.python.org/t/announcement-2fa-requirement-for-pypi-2024-01-01/4090623:20
clarkbgithub is like 2024-01-28 ish23:25
tonybI get that Jan-1st is a really nice line in the sand, but it really sucks because holiday season :/23:27
NeilHanlonnew year, same problems 🙃23:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!