Monday, 2020-10-12

ianw[Fri Oct  9 23:08:49 2020] Buffer I/O error on dev dm-11, logical block 526344, lost async page write00:14
ianwi think this fs is unhappy generally; trying to find out where this space is, is not working00:14
ianwi can't fsck /dev/mapper/main-main00:21
ianwsays it's in use, but i've killed everything and unmounted /opt00:21
ianwi'll reboot, stop the containers, and try and fsck this partition00:23
clarkbdo you need a vgchange -a n ?00:30
clarkbthat should disable the vg00:30
ianwmaybe i could have done something like that, a reboot has allowed me to fsck it00:32
ianwonce it's finished, hopefuly i can df it to find what's going on00:32
ianwotherwise i guess i can just format it and start again, but i'd prefer not to00:33
clarkbI'm guessing we've leaked partial image builds. At least that is what it has been in the past00:34
fungiodds are there were unlinked inodes with open file handles in process, so you won't see them reflected as "used" in any particular part of the tree but they'll still count toward the used blocks for the fs itself00:41
ianwyeah, fsck gave back some space, but i'm still not sure if /opt/dib_tmp is just being really slow, or actually causing issues00:42
ianwi'm clearing /opt/dib_tmp, very slowly01:04
ianwok, i guess it's back02:02
openstackgerritIan Wienand proposed opendev/system-config master: borg : match install-borg role to run deploy job
ianw#status log cleared full storage on nb01 and rebooted02:46
openstackstatusianw: finished logging02:46
ianw#status log cleared openafs cache on and rebooted02:46
openstackstatusianw: finished logging02:46
ianwit seems to be serving again02:46
ianwand listening to En Fran├žais by Pomplamoose for good measure02:49
clarkboh was it abad cache? you rm'd /var/cache/afs contents?02:59
ianwclarkb: yeah, i've reported issues with a similar backtrace to the afs list before03:10
ianwit seems when it shuts down hard, it's very likely to corrupt itself03:11
clarkbalso is openafs-client enabled in systemd? I had disabledit in order to try manually starting it after normal boot hadfinished03:12
openstackgerritClark Boylan proposed opendev/system-config master: DNM Forcing a gitea job failure to test gerrit replication
clarkbthat never ran the job I was trying to hold on friday03:15
clarkbhopeflly that gets a held node I can use tomorrow morning03:15
openstackgerritMerged opendev/system-config master: borg : match install-borg role to run deploy job
openstackgerritIan Wienand proposed opendev/system-config master: install-borg: bump to latest version
openstackgerritIan Wienand proposed opendev/system-config master: install-borg: bump to latest version
ianwclarkb: ahh, i wondered why that was disabled.  i've enabled it03:51
openstackgerritIan Wienand proposed opendev/system-config master: install-borg: bump to latest version
openstackgerritMerged opendev/system-config master: install-borg: bump to latest version
openstackgerritIan Wienand proposed opendev/system-config master: borg backup : add ethercalc02
ianwclarkb: ^ as discussed for first borg backup host05:33
*** marios has joined #opendev05:36
*** ysandeep|away is now known as ysandeep05:38
*** slaweq has joined #opendev05:55
*** slaweq has quit IRC06:08
*** marios has joined #opendev06:10
*** rpittau|afk is now known as rpittau07:35
dirkclarkb: it is a bit awkward that this lsb installation issue slipped through. I submitted a fix today08:28
*** ysandeep|lunch is now known as ysandeep09:10
frickler#status log restarted gerritbot on eavesdrop once again09:17
openstackstatusfrickler: finished logging09:17
fricklerinfra-root: ^^ this seems to be becoming an almost daily issue, anything we can do about this?09:19
AJaegerfrickler: was that really needed? I saw it reporting earlier...09:21
AJaegerNo problems on #zuul as far as I can see09:21
fricklerAJaeger: I missed a report for a patch I submitted for devstack and there were two complaints in #-infra, so at least something was wrong.09:25
fricklerI didn't spot any issue in the docker log, but that also seems to go back for a couple of hours, so likely isn't very helpful when issues happened some time ago09:26
jrosserits certainly now doing something, when previously it wasnt, for the things i'm interested in09:30
danpawlikHi. Is everything ok with AFS mirror?09:39
danpawlikmostly related to mirror.{fedora,centos,epel}09:39
fricklerdanpawlik: likely not, looks like they are three days old09:54
fricklerfungi: ^^ also while mirror.ubuntu is recent, mirror.ubuntu-ports and mirror.debian still seem to be 7 days old, did you unlock those latter ones, too?09:56
danpawlikfrickler: exactly ;)10:03
*** ysandeep is now known as ysandeep|afk10:32
*** ysandeep|afk is now known as ysandeep11:26
*** priteau has quit IRC11:59
fungifrickler: i did not do any other volumes, just ubuntu. i'll start those now12:04
fungii've removed stale vos release locks for mirror.ubuntu-ports and mirror.debian12:07
fungibut i need to go run errands for the next few hours and won't be on hand to check their updates12:07
fungithere were stale locks for mirror.fedora, mirror.centos and mirror.epel which i've also removed now12:09
danpawlikcool fungi++12:19
fungii figure centos is probably the most urgent one to complete first, so i've held the flock for it in a root screen session on and am performing a manual vos release with -localauth for it in a root screen session on afs01.dfw.openstack.org12:25
fungihoping to get it a headstart before the fileserver is starved for bandwidth12:25
fungistepping out now, should hopefully be back by 16:00 utc12:38
*** slaweq_ is now known as slaweq14:09
clarkbre gerritbot according to the logs it thinks it is still connected14:43
clarkbwhich makes me think this isn't another python3 conversion issue but instead some sort of problem interfacing with the freenode network14:44
clarkbfungi: can we reenable bhs1 in nodepool now too? (just remove nl04 from the emergency file?)14:46
fungiclarkb: yeah i think so15:08
fungias for gerritbot, my suspicion has been something times out the connection to the irc server and the irc client module doesn't notice, so keep sending messages to a dead socket15:09
fungiwe see evidence of socket timeout reported by the server in channels15:10
clarkbI'm in meetings right now, but then can remove nl04 from emergency if htat isn't done already15:12
clarkbthen I'm going to work on getting my caught gitea99 into shape for replication15:12
clarkbI've already updated the ssh key for that but realize I need to tweak iptables rules as well as formatting and remounting xvde onto /var/gitea to have enough disk space15:13
clarkbI've removed from the emergency file which should restore its config on the next hourly pass15:47
*** rpittau is now known as rpittau|afk16:01
*** ralonsoh has quit IRC16:05
*** priteau has joined #opendev16:20
clarkbpaladox: are you around? if so do you know if the gerrit serverId config setting identifies a logical gerrit install or a specific server? seems like instanceId is for the specific server?16:38
clarkbI've just discovered that notedb relies on this apparently but the migration doc doesnt mention it :( anyway wantto figure out what we ahould set ours to16:39
paladoxyou can really set it to any, you can even set up a new install and just use that value16:40
clarkbya on my test serverit is a uuid16:41
clarkbwhichbwas auto set, but I dont want ansible to delete it then causeus problems with notedb after16:41
clarkblooks like wikimedia is using the uuid value in config mgmt16:42
clarkbI guess we can do that too then16:42
clarkbin thats case I think we may just set it after the migration is done to whatever value is chosen16:44
*** openstackgerrit has joined #opendev17:04
openstackgerritClark Boylan proposed opendev/system-config master: Disable change.move and enableSignedPush in gerrit
openstackgerritClark Boylan proposed opendev/system-config master: Stop blocking /p/ in the gerrit apache vhost
openstackgerritClark Boylan proposed opendev/system-config master: Update gerrit container image to 3.2
openstackgerritClark Boylan proposed opendev/system-config master: Switch to zuul's default gerrit auth type
openstackgerritClark Boylan proposed opendev/system-config master: Clean up old Gerrit html theming and commentlinks
openstackgerritClark Boylan proposed opendev/system-config master: Remove reviewdb config from Gerrit
openstackgerritClark Boylan proposed opendev/system-config master: Post 2.16 upgrade config updates
openstackgerritClark Boylan proposed opendev/system-config master: Switch to zuul's default gerrit auth type
openstackgerritClark Boylan proposed opendev/system-config master: Update gerrit container image to 3.2
openstackgerritClark Boylan proposed opendev/system-config master: Clean up old Gerrit html theming and commentlinks
openstackgerritClark Boylan proposed opendev/system-config master: Remove reviewdb config from Gerrit
clarkbsorry noticed a slight ordering bug in the previous stack17:07
clarkbone thing that isn't clear to me is how all those changes will work out with our zuul cd stuff17:08
clarkbI think we'll be ok because they mostly reflect the end state just different portions of each end state. As long as we don't restart gerrit between the beginning and end of the application of those we should be fine17:09
clarkbwe can also squash them all together on the dayo f and land a single change, but I think for now this makes review simpler17:09
clarkbactually zuul will probably be off17:09
clarkbso we can land the stack then manually update zuul's auth config for gerrit, start zuul, then have it run with the end of the stack17:10
*** qchris has joined #opendev17:42
fungiinfra-root: things are not looking great for the centos volume... "Failed to end transaction on rw volume: Possible communication failure"17:53
clarkbfungi: that is from a vos release?17:54
fungiyeah, it apparently got stuck when afs02.dfw hung late last week and had to be ungracefully rebooted18:01
fungilooks like there's a transaction which may need to be ended, will likely need the full replica on 02 replaced i'm guessing18:02
AJaegerconfig-core, please review and
clarkbI've just started a review-test replication to my held gitea node. iptables is blocking ports 3081 and 3000 currently so we can review what is replicated before exposing it18:14
clarkbit does look like the changes meta refs are being replicated which will increase the disk usage of our replica by quite a bit (I think it roughly doubles the size of the git repos)18:14
clarkbif anyone else has time to look at replication config options to determine if excluding the refs/changes/XY/ABCXY/meta ref is possible that may be useful to avoid very expensive replication18:15
clarkb(I've looked and can't figure out a way to do that)18:15
clarkbAJaeger: I'll take a look after lunch18:16
openstackgerritMerged zuul/zuul-jobs master: Use ansible_distribution* facts instead of ansible_lsb
clarkblooking at replication we do already replicate refs/notes (I think these make sense as it summarizes completed reviews) and refs/users/ (this is where your drafts in the web editor go)18:45
clarkbthat means the major addition is refs/changes/XY/ABCXY/meta18:46
fungiyes, i make a lot of use of the refs/notes content, i'm probably not the only one18:46
fungii'm hoping we can come up with a way to configure gitea to display gerrit's notes (it can display notes but hard-codes the notes base last i checked)18:47
clarkbfungi: yup definitely useful but I wonder if the refs/change/..../meta content supercedes it18:48
clarkbthe big difference is the notes are a summary but the meta ref is a full history aiui18:48
fungioh, it may. i mainly use its notes to find the change url, who approved and what date and time it merged18:49
clarkb2196 tasks to go .... doing this full replica is not fast18:49
clarkbbut at least it seems to be working and the only major change from what we have today is the addition of the notedb meta refs18:49
clarkb(that I can see so far)18:49
clarkbalso need to wait and see what disk use looks like to determine if we have to rebuild the gitea servers :/18:51
clarkbthinking out loud here you can tell the replication plugin to not replicate hidden projects18:53
clarkbI half wonder if we should consider setting the hidden flag on certain subsets of repos (the deb package repos come to mind)18:53
openstackgerritMerged openstack/project-config master: Update neutron stable grafana dashboards
fungiwell, it's not like they're changing, so after initial replication there won't really be any additional replication churn for them18:59
clarkbcorrect, but its more than doubling the size of our git repos I think18:59
clarkbbasically what was once a 15GB database in a single mysql instance is now in all the git repos and copied 9 times19:00
clarkbcurrent gitea01 repo size post packing is 12GB, the repo growth on review-test after the migration and pre packing was ~15GB19:00
clarkband now that I think of it I'm compaing packed vs unpacked sizes so it isn't quite doubling it19:01
clarkbI think the packed review-test size growth was ~5GB19:01
clarkbwe're adding about 50% disk overhead19:01
openstackgerritMerged openstack/project-config master: Add ansible-role-refstack-client under x namespace
*** mordred has quit IRC19:16
clarkbdown to 1979 tasks now20:01
clarkbonce I'm happy with the status of replication the next thing I want to look at is manage-projects and the delete project plugin20:01
clarkbbasically I'll check that I can create a new project and then delete it20:01
clarkbin quickly checking jeepyb for basic validation of ^ I've noticed that while manage-projects doesn't use the db a few other jeepyb gerrit integrations do (update spec, update bug, and welcome message)20:02
clarkbany opinions on whether or not we should try and fix those, disable them proactively, or just let them eventually fail?20:04
fungiall things which could become zuul jobs in opendev/base-jobs or openstack/project-config20:04
fungithey need creds to either authenticate to lp or gerrit20:04
fungibut i think they can be sufficiently generalized20:04
clarkbya also I think its harmeless to leave them in place for now. When we migrate to notedb they'll just look at stale content in the db then we'll drop reviewdb and they'll error setting up a connection to reviewdb20:04
clarkbbut we can also remove them from hooks since we know they will stop working20:05
clarkbI expect that manage-projects will work because it is tested in gerritlib and jeepyb against a more up to date gerrit20:06
clarkbbut I want to double check it on review-test too20:06
clarkbgives me an excuse to test the delete-project plugin as well :)20:06
clarkbbut trying not to get too far ahead of myself as I'm about to enter the ansiblefest then summit then ptg period of time where I'll have less time for this20:09
openstackgerritsebastian marcet proposed opendev/system-config master: OpenstackID v3.0.15
ianwclarkb/fungi: you ok with ethercalc being the test-case for borg backups with ?20:58
clarkb+2 yes20:59
openstackgerritClark Boylan proposed opendev/system-config master: Remove reviewdb config from Gerrit
clarkbhopefully that change will pass CI testing now21:10
openstackgerritMerged opendev/system-config master: OpenstackID v3.0.15
fungiianw: sounds great, thanks!21:33
openstackgerritMerged opendev/system-config master: Add gerrit static files that were lost in ansiblification
openstackgerritMerged opendev/system-config master: Stop replicating to local git mirror on gerrit
clarkbwhen we do project renames in gerrit we rely on online reindexing right?22:20
clarkbI've realized that we ahould test a project rename too which should be mv the git repo, delete caches, and reindex but wondering if that needs to be offline22:21
clarkb(I'll test this between project create and delete)22:21
clarkbalso I bet we'll orphan accountPatchReviewDb data that way but itsprobably fine22:22
*** slaweq has quit IRC22:22
openstackgerritMerged opendev/system-config master: borg backup : add ethercalc02
fungiclarkb: online, yes22:53
clarkbdown to 1400 tasks on the replication, this will likely run overnight22:56
ianwi've rebooted nb03 that was off ... :/23:07
