Thursday, 2024-02-29

*** tkajinam is now known as Guest134901:24
*** dhill is now known as Guest136805:03
opendevreviewSumanth Kumar Batchu proposed opendev/gerritlib master: Added a comment  https://review.opendev.org/c/opendev/gerritlib/+/91056806:59
noonedeadpunkclarkb: from our gut feeling, what indeed takes most of the time - connection and forking. And I fully clueless of what can be done there, except optimize SSH settings, like using lightweight ciphers, disabling GSSAPI/Kerberos, tuning persistant connections, and then disabling things like dynamic motd which is part of default ubuntu setup08:12
noonedeadpunkAnd another thing we were focusing on - trying to reduce amount of variables for the runtime. Like disabling INJECT_FACTS_AS_VARS having quite significant performance improvement, but mainly neglected by external dependency roles.08:14
noonedeadpunkBut task engine is indeed smth we didn't touch....08:14
noonedeadpunkI pretty much hoped that their agressive move/deprecation of python versions was exactly to drop legacy code and improve performance of forking08:15
noonedeadpunkand then indeed we came to a point of discussing replacing whole roles that we call multiple times (like for creating systemd.service) just into an ansible modules. This would improve runtime speed dramatically, but basically it's a way to Jinja Charms...08:18
*** elodilles_pto is now known as elodilles08:29
opendevreviewArtem Goncharov proposed openstack/project-config master: Add OpenAPI related repos to OpenStackSDK project  https://review.opendev.org/c/openstack/project-config/+/91058008:32
opendevreviewArtem Goncharov proposed openstack/project-config master: Add OpenAPI related repos to OpenStackSDK project  https://review.opendev.org/c/openstack/project-config/+/91058008:33
opendevreviewLukas Kranz proposed zuul/zuul-jobs master: Make prepare-workspace-git fail faster.  https://review.opendev.org/c/zuul/zuul-jobs/+/91058208:49
*** tosky_ is now known as tosky11:48
funginoonedeadpunk: is pipelining an option, keeping a persistent ssh connection open for the duration of the play and reusing it rather than reconnecting for each task? or does it already do some of that?13:44
noonedeadpunkYup, we do have pipelining enabled - using `-C -o ControlMaster=auto -o ControlPersist=300` for ssh arguments13:47
*** ralonsoh_ is now known as ralonsoh13:59
fungiso it's probably more fork initialization and cpython interpreter startup time i suppose14:03
Clark[m]fungi: noonedeadpunk: yes I think it is python overhead due to inefficient process management. And I think it occurs both on the controller (no preforking of the -f threads and on the remote side starting a new python for each task)15:34
Clark[m]But I haven't looked in the code in a long time. There was a huge performance regression after a refactor of this stuff though15:35
Clark[m]Way back when I mean. And it hasn't really improved since15:35
fungijust add more ram. it's the solution to every performance regression15:36
noonedeadpunk+ overhead for copying the module I assume15:45
noonedeadpunkbut yes, I agree. and unfortunatelly, the way of dealing with it is slightly beyond my scope/time constraints15:47
opendevreviewBrian Rosmaita proposed openstack/project-config master: Add more permissions for 'glance-ptl' group  https://review.opendev.org/c/openstack/project-config/+/91064115:49
clarkbthe debian mirror update is currently running (and holding the lock). I'll try to grab it as soon as it finishes then I'll approve the change for debian mirror cleanup16:12
fungithanks!16:16
fricklerfungi: clarkb: could one of you have a look at https://review.opendev.org/c/openstack/project-config/+/904837 please?16:29
clarkbfungi probably has more context on that than I do but I can take a look too16:30
fricklerI was actually thinking the same, but then I didn't want to make you feel excluded ;)16:32
clarkbI have the debian reprepro lock. I'm approving https://review.opendev.org/c/opendev/system-config/+/910032 now16:33
clarkboh double checking the change looks like i need locks for all the debian things16:33
clarkbdebian, debian-security, and debian-ceph-octopus16:34
clarkbgrabbing the other two before I approve16:34
clarkbthose were not held so I got them immediately16:36
fungithat change's mix of grep and bash pattern manipulation is mind-bending. i'm not confident i know, for example, why the script uses more toothpicks for grep ${NEW_BRANCH/\//-}-eol than for ${NEW_BRANCH//@(stable\/|unmaintained\/)}-eol16:45
clarkbfungi: I copy pasted into bash locally and the outputs looked good fwiw16:45
clarkbbut I think it is because it is doing a regex replace vs a dumb replace16:45
clarkband bash is heavy on the syntax16:46
opendevreviewMerged openstack/project-config master: Adapt make_branch script to new 'unmaintained/<series>' branch  https://review.opendev.org/c/openstack/project-config/+/90483716:54
opendevreviewMerged opendev/mqtt_statsd master: Revert "Retire this repo"  https://review.opendev.org/c/opendev/mqtt_statsd/+/90514216:55
opendevreviewMerged opendev/system-config master: Remove debian buster package mirrors  https://review.opendev.org/c/opendev/system-config/+/91003217:18
clarkbthe updated reprepro configs appear to have been applied. I'm running the cleanup for ceph octopus first17:52
clarkbI completed the steps documented in https://docs.opendev.org/opendev/system-config/latest/reprepro.html#removing-components but https://mirror.bhs1.ovh.opendev.org/ceph-deb-octopus/dists/buster/ still exists.17:55
clarkbBefore I drop the lockfile I think I should manually delete the dists/buster dir in that repo?17:55
clarkbfungi: ^17:55
clarkbthen I can rerun the vos release and drop the lock17:55
clarkbthere are also a few files to remove from https://mirror.bhs1.ovh.opendev.org/ceph-deb-octopus/lists/17:56
clarkbif that looks correct to you I'll do that after eating some breakfast, then I'll write a docs update change and then finally continue on with debian and debian-security17:56
clarkbactually I can delete the files then rerun a regular sync. If they come back then I know they should stick around. If they don't then it is proper cleanup. I'll proceed with that plan18:08
fungiclarkb: yeah, maybe delete the dir *and* rerun the mirror script again to make sure it doesn't recreate anything18:08
fungiright, what you also just said18:08
clarkbI did a vos release after the cleanup just to see it reflected on the mirrors. Will run a regular reprepro sync now18:16
clarkbok rerun didn't readd the files so I think this is correct to do18:19
clarkbI'll write up a quick docs change and then proceed with debian-security and debian18:19
fungiperfect18:23
opendevreviewClark Boylan proposed opendev/system-config master: Update reprepro cleanup docs to cover dists/ and lists/ cleanup  https://review.opendev.org/c/opendev/system-config/+/91065518:33
clarkbsomething like that18:33
clarkbnow doing debian security.18:34
clarkband now finally debian proper18:45
clarkbI note that stretch still has those extra files hanging around. I know we also haev ubuntu ports (and probably ubuntu releases?) with similar issues. I think I'm going to focus on cleaning up buster then we can do a broader cleanup of these extra files for other releases later18:47
clarkbtonyb may want to dig into that after setting up openafs creds? It has a lot of interactions with interesting afs things18:48
clarkbthere are a lot of files to clean up in buster proper :)18:55
clarkbonly into the r packages 18:56
clarkbError in vos release command.19:08
clarkbVolume needs to be salvaged19:08
clarkbthat is unfortunate and annoying19:08
clarkbI'm not sure I undersatnd what this means yet either19:10
clarkbok afs01.dfw.openstack.org has ext4 errors according to dmesg. I suspect this is the cause19:11
clarkbI may need help here. I think xvdb disappeared on us19:13
clarkbinfra-root ^ fyi19:13
fungiargh!19:14
clarkbafs02 seems fine19:14
clarkbso do we reboot or just try to remount so it isn't remounted ro?19:15
fungi[Thu Feb 29 18:48:13 2024] INFO: task jbd2/dm-0-8:541 blocked for more than 120 seconds.19:15
fungithat looks like the start of the incident19:15
clarkbI think the order of operatins here is going to be make afs01 happy with xvdb again and then we have to make afs happy19:15
fungiprobably first we check for rackspace tickets about hardware drama19:16
clarkbfungi: I didn't see any emails at least19:16
fungik19:16
clarkbbut feel free to double check19:16
clarkbI'm wondering if we need to grab locks for everything so this problem doesn't spread?19:17
clarkbbut it may be too late for that?19:17
clarkband yes argh19:17
fungiwe'll probably want to make sure that all active volumes are being served from one of the other two fileservers19:17
clarkbfungi: you mean move RW to afs02 if it is on afs01?19:18
clarkbthe RO stuff should already be on both?19:18
fungiyeah19:18
clarkbthats a lot of volumes... I can start by holding mirror-update locks I guess19:19
clarkbthen we move things then we cry?19:19
clarkb:)19:19
fungilooks like we have this too: https://docs.opendev.org/opendev/system-config/latest/afs.html#recovering-a-failed-fileserver19:20
fungidoesn't seem to recommend failing over volumes like in https://docs.opendev.org/opendev/system-config/latest/afs.html#afs0x-openstack-org19:21
fungibut it does say we should pause writes19:21
clarkbI've grabbed all the lockfiles on mirror update19:25
clarkbthis doesn't deal with docs/tarballs/etc19:25
clarkblooking for that now19:25
fungigrabbing a cup of tea real quick since the rest of our afternoons just flashed before my eyes19:25
clarkbI can never find where we run that crontab19:26
clarkbbut I'm trying to find it now19:26
fungiclarkb: it's on mirror-update19:26
fungi*/5 * * * * /opt/afs-release/release-volumes.py -d >> /var/log/afs-release/afs-release.log 2>&119:27
fungiin root's crontab19:27
fungithat's what does tarballs, docs, etc. i think if we just comment out all the cronjobs and put mirror-update in emergency disable we'll be fine19:27
clarkbya and in that script the lockfile is /var/run/release-volumes.lock19:27
clarkbfungi: ok I grabbed all the locks. The publish logs thing doesn't appear to do a lockfile19:29
clarkbwe need to put the server in the ermgecy file before editing the crontab19:29
fungiyou're just holding all the locks instead of commenting out the cronjobs. that works and i guess keeps us from needing to worry about putting mirror-update in the emergency file19:29
fungior we can do both if you like19:30
clarkbI'm doing both because of the logging publishing19:30
clarkbok we should be idled now19:31
fungiThis message is to inform you that our monitoring systems have detected a problem with the server which hosts your Cloud Block Storage device, afs01.dfw.opendev.org/main01, '1c4c3a46-2571-4442-a85a-5603ff91a68d' at 2024-02-29T19:16:47.073638. We are currently investigating the issue and will update you as soon as we have additional information regarding the alert. Please do not access or modify19:32
fungi'1c4c3a46-2571-4442-a85a-5603ff91a68d' during this process.19:32
fungiPlease reference this incident ID if you need to contact support: CBSHD-5cf24b4319:32
fungifound that in the rackspace dashboard just now when i logged in19:32
fungithat's from Thursday, February 29, 2024 at 7:16 PM UTC19:32
fungiso about 15 minutes ago19:32
fungiwe should probably refrain from rebooting the server until they've given the all-clear signal?19:33
clarkb++19:33
clarkbreading our docs and upstream salvage docs I'm not fully sure I understand the implications of this situation19:33
clarkblike will we replace the content on afs01 with afs02 content automagically? Or maybe it juist comes up and is happy?19:34
fungiopenafs should try to auto-salvage the volumes i think19:34
clarkbgotcha19:34
fungiif it can't for some reason, then we take additional steps to replace/recover the volume19:34
fungiseparately, we'll likely have hung/lost transactions from vos release commands that occurred during the incident19:35
fungiand we'll need to cancel them manually19:35
clarkbfungi: ya I think my concern is if we do a vos release will it potentially overwrite good RO content on afs02 with bad content from afs01. But sounds like it should do consistency checks and in theory we'd do a manual promotion to afs02 then potentially redo my debian cleanup19:35
clarkbfungi: how do we clean those up?19:36
clarkbthough maybe we should avoid cleaning those up until afs01 is happy?19:36
clarkbotherwise we may create io load we don't want19:36
fungithe tasks? i have to refresh my memory on the exact command, but i would wait until after we get the underlying filesystem fixed19:36
clarkb++19:37
clarkbthis doesn't give me a lot of confidence that centos 7 and xenial cleanups will go smoothyl...19:37
clarkbthough it could be coincidence19:37
fungilooks like in the past we've used `vos status -server afs01.dfw.openstack.org -localauth` to check the transactions, then `vos endtrans -v -localauth afs01.dfw.openstack.org <transaction_id>` to clean them up, and maybe manually unlocked the volumes with `vos unlock -localauth some.volume` if necessary19:40
clarkbI've brought up if we should ask openstack release to idle too (but docs won't be idled by that alone)19:41
clarkbcorvus: you too may be interested in this as it may impact zuul stuff19:41
fungior perhaps `vos endtrans -server localhost -transaction <transaction_id> -localauth -verbose` looking at shell history19:41
clarkbya there are post failures in tarball jobs19:44
clarkbfungi: an email from rax says "pending customer" is that an indication that maybe they want us to check if it is happy again?19:45
clarkbtalking out loud here but I wonder if we should disable afs services on afs01 before rebooting it (once we get to that point0 that way we can enable services manually and check them post boot?19:47
clarkbwould allow us to fsck too I think19:47
clarkbfungi: if you are still logged into the dashboard can you check the ticket status?19:48
fungii'm refreshing now19:50
fungiThis message is to inform you that your Cloud Block Storage device,afs01.dfw.opendev.org/main01, 1c4c3a46-2571-4442-a85a-5603ff91a68d has been returned to service.19:50
fungiThursday, February 29, 2024 at 7:47 PM UTC19:51
clarkbabout 4 minutes ago19:51
fungiso we should be clear to reboot the server now19:51
clarkbfungi: did we want to disable services first so that we can fsck or whatever first?19:51
clarkbfirst post reboot I mean19:51
clarkbor just follow the doc and see what happens?19:51
fungii would just follow the doc from here19:51
fungii'll pull up the server console though in case it wants to fsck for a while at boot19:52
clarkbok it says "fix any filesystem errors" which I'm not sure we'll fsck for to detect which is why I ask19:52
clarkbfungi: sounds good let me know when I should issue a reboot command on the server19:52
fungi/dev/main/vicepa  /vicepa ext4  errors=remount-ro,barrier=0  0  219:52
fungibut i expect the lvm2 layer shields us here since the volume was put into a non-writeable state as soon as errors cropped up on th eunderlying pv19:53
clarkbI see19:53
clarkbthe 2 should cause a fsck to ahppen though right?19:53
fungiyes, once the lvm volume is activated, before its filesystem is mounted19:54
clarkbalright then ready for me to reboot?19:55
clarkbor you can do it when you are ready with the console if you prefer19:55
fungii have the console up now19:55
clarkbI see you have a shell too19:55
fungiit doesn't seem to accept input from my keyboard, but there is a "send ctrlaltdel" button19:55
fungibut yeah, i can reboot it from the shell anyway19:56
clarkbya I think we resort to the other thing if that doesn't work19:56
clarkbsince this should be the most graceful option19:56
fungiprevious uptime 312 days19:56
fungiserver is rebooting19:56
clarkbstill no ping responses from there19:58
clarkbs/there/here/19:58
fungiyeah, console is blank at the moment19:59
fungihaven't seen it start booting yet19:59
fungimay still be in the process of shutting down19:59
clarkbya could be19:59
clarkbsystemd makes sshd go away fast but other things may still be slowly shutting down19:59
fungisure is taking a while though20:01
clarkb~5 or 6 minutes now?20:02
clarkbit just started pinging20:02
fungiyeah, i see a fsck of xvda1 ran20:02
clarkbxvdb is the device that had a sad20:03
clarkbfwiw pvs,lvs,vgs looks ok20:03
clarkbI see afs and bos services running20:03
fungiboot.log says it did fsck /vicepa20:04
fungiso seems that was fine20:04
clarkbI think the next thing to find is the salvager logs?20:04
fungii've got a root screen session on afs01.dfw20:04
* clarkb joins20:04
clarkblooks like ti did some salvaging. It isn't clear to me if salvaging is still in progress though20:06
clarkbit also looks like it primarily needed to salvage where I couldn't get things idled fast enough (mirror.logs for example)20:08
fungilooks like it salvaged mirror.logs and project.readonly20:08
fungithe latter is probably due to some docs jobs20:08
fungilooks like it didn't complain about anything else20:09
* clarkb will update docs for recovering a file server to add notes about where to find the salvager logs20:09
clarkbya so next step is looking for stuck volume transactions (lets get the docs for that updated too)20:09
fungithe manpage claims it will be SalvageLog but seems it's actually SalsrvLog20:10
clarkbagreed20:11
fungiokay, process of elimination, we need to use the global v4 address like `vos status -server 104.130.138.161 -localauth`20:11
clarkbfungi: we should also check transactions on the other two servers20:12
clarkbjust in case they "own" the transaction that may be stuck20:12
clarkbcool no transactions to worry about20:13
fungino active transactions for any of the three, so i think we lucked out?20:13
clarkbyup, I guess it cancelled my vos release properly20:13
clarkbso now we need to vos release the many many volumes.20:13
fungiwe'll find out when we try to manually vos release each volume, which is our next step20:13
clarkb++20:14
clarkbshould we start with the smaller volumes and get them all out of the way?20:14
clarkbthen we can leave the big ones to sit for a while (and I can grab lunch)20:14
fungiyeah20:15
clarkbI'll make a list in an etherpad20:15
fungijust looking over the `vos listvldb` output now to make sure they look okay20:15
fungido you need a raw volume list for the pad or do you have one handy?20:16
clarkbI have one20:17
clarkbfungi: I notice a few of them look "weird" like docs old and root.cell and root.afs20:17
clarkbI assume we treat them like nay other volume though20:17
fungidocs-old we didn't bother making replicas for i think20:17
fungiand the others are internal in some way, but yeah we can still try to vos release them20:18
fungilemme know the pad url when you're ready20:20
clarkbhttps://etherpad.opendev.org/p/cPfI3Q3fsexFvhTXbhkp20:21
clarkbtrying to organize them now. Sorry they ended up with a - suffix that needs to be trimmed20:21
clarkbsome may have had their names truncated too?20:21
corvusi picked a good time to afk!20:21
clarkboh no its just ubuntu-cloud not ubuntu-cloud-archive20:21
clarkbcorvus: heh yes20:21
corvusroot.X are technically just like any other volume, only their name is magic20:21
opendevreviewBrian Rosmaita proposed openstack/project-config master: Add more permissions for 'glance-ptl' group  https://review.opendev.org/c/openstack/project-config/+/91064120:21
clarkbcorvus: got it20:23
fungiwe need to release parent volumes after their children for thoroughness, right?20:23
fungior is that just when adding/removing volumes?20:23
corvusorder shouldn't matter20:25
fungiokay, cool20:25
fungii think it's when adding new volumes that it's bit me20:26
fungiclarkb: should i just start from the top then?20:26
corvus(the mounts are just pointers by name, so in this case, they'll always resolve to something.  if there's no read-only copy of a volume though you can't mount it, so that's probably what you're thinking of)20:27
fungiyeah20:27
clarkbfungi: ya I think we can start at the top20:27
clarkbI've almost got the list organized20:27
fungiwill do it in that root screen session20:27
fungivos release project -localauth -verbose20:28
fungithat command look right?20:28
corvuslgtm20:29
fungii guess i can stick the volume name on the end to make subsequent ones easier20:29
fungiReleased volume project successfully20:29
fungii'll proceed down the list and raise concerns if i see any error20:30
clarkbalso I count 58 in the etherpad (41, 14, 3) which seemd to match teh count a previous listvldb had20:32
clarkbjust as a sanity check I didn't lose any in the shuffle20:32
fungiagreed20:32
fungiaha, no point in doing a vos release of non-replicated volumes20:35
fungiso `vos release service` errors accordingly20:35
clarkbyup says it is meaningless20:36
fungiVolume 536870921 has no replicas - release operation is meaningless!20:36
clarkbfungi: when you get a cahnce can you fetch the transaction listing command out of your command history?20:36
clarkbI'm writing a docs update to capture this20:36
fungipresumably we just ignore that condition unless it happens with a volume we know should have replicas20:36
fungiclarkb: vos status -server 23.253.73.143 -localauth20:37
clarkbthanks20:37
fungiet cetera, one for each server20:37
fungiif memory serves, the raw address is required when running on the server due to quirks of name resolution?20:38
fungiif you run it from elsewhere i think you can get by with the normal dns names20:38
clarkbinteresting. I wrote it down using ip addrs20:38
clarkbnot a big deal20:38
fungiroot@afs01:~# vos release -localauth -verbose starlingx.io20:39
fungiRW Volume is not found in VLDB entry for volume 53687105620:39
clarkbfungi: I'm guessing that starlingx volume was unused due to project.starlingx existing?20:39
fungimaybe we meant to delete that?20:39
clarkband ya it only has two RO volumes20:39
clarkbfungi: probably20:39
clarkbwe should make sure we know what content is in it first though20:39
clarkbtest.corvus is also RO only20:40
fungiyeah, i'm bolding ones that don't release for some reason, and striking through those which do20:41
fungiwe can revisit later20:42
opendevreviewClark Boylan proposed opendev/system-config master: Add more info to afs fileserver recovery docs  https://review.opendev.org/c/opendev/system-config/+/91066220:43
clarkbanyone have a sense for whether or not I should rerun my debian mirror reprepro commands and vos release?20:43
clarkbThat was the volume I was working with when things went south and I'm half worried that we may have incomplete cleanup there if I don't start over?20:44
fungican't hurt, but probably wait until we've released the other volumes20:44
clarkbfor sure20:44
fungijust because they can fight for bandwidth20:45
clarkbya I want everything to be as happy as possible before trying this again :)20:45
clarkbalso Volume needs to be salvaged is what vos release reported previously which amkes me wonder if we will need to take some extra intervention against mirror.debian despite the salvager not complaining about it20:49
fungigetting ready to start on the "slow" volumes now20:52
clarkbI'm waiting for them to start taking longer than a few seconds then I can go eat lunch :) do we want to start a few in parallel or is that too risky we think?20:52
fungimirror.logs has no replicas20:52
fungishall i save mirror.debian for last?20:55
clarkbfungi: ++20:55
clarkbthese others are going quickly but that one may need to sit and churn a bit20:55
fungimy thoughts exactly20:56
clarkblol fedora is a complete release20:56
clarkbour optimizing by skipping debian may not have saved much time20:56
clarkbI guess now the question becomes do we want to run a few in parallel?20:56
clarkbI think I'm ok with that but am happy to be cautious if we prefer20:57
fungiwonder if we should have deleted this volume some time ago20:57
clarkboh right it should be mostly empty?20:57
clarkbya only 9kb according to grafana maybe it won't be too slow to do a complete release?20:57
fungimirror.epel was "a complete release" too and only took a few seconds20:58
fungiVolume needs to be salvaged20:58
fungimaybe we should flag this one and come back to it20:58
clarkbintersting. both afs01 and afs02 in dfw doesn't show disk issues currently20:58
clarkbso ya I don't think this is a recurrence of our previous issue. ++ to skipping20:58
fungientirely possible this one was broken before20:59
fungimirror.openeuler is taking a while, wonder if it had a bunch of unreleased updates21:03
clarkboh it may because it is at quota21:03
clarkbits possible we're releasing an inconsistent mirror state there21:03
clarkbwhereas previously we wouldn't release because it would error in the rsync21:03
clarkbI'm not going to worry about that too much though21:04
clarkbreading bos salvage docs I think it will clean up the RO volume(s) if they are discovered to be corrupted21:05
clarkbin the case of fedora we may want to just clean the whole thing up as you say.21:05
clarkbwe may need to update our mirror configs for that though so maybe just creating an empty RO volume is easiest for now (if that is the case)21:05
clarkbI'm going to eat something while we wait for openeuler21:05
fungiwell, it finished, but you don't need to stick around and watch if there's food with your name on it21:06
Clark[m]I have a sandwich21:08
fungimirror.ubuntu-ports needs to be salvaged21:09
clarkbubuntu ports is also at quota21:12
fungimirror.debian is all that's left other than the ones that reported they needed to be salvaged21:12
fungishould i proceed?21:12
clarkbyes I think if it is all that remains we should go ahead21:12
clarkbthen we'll figure out the salvaged needed volumes21:12
fungik21:12
clarkbfwiw still nothing in dmesg indicating disk trouble currently causing the need for salvaging21:13
clarkbI wish it would indicate what is wrong requiring it to be salvaged21:14
clarkbok debian also needs to be salvaged21:14
clarkbfungi: I think we should start with fedora since it is unused21:14
funginot entirely surprised21:15
clarkbfungi: oh look in /var/log/openafs21:15
clarkbthere is a new salvgae log and it appears to have auto salvaged debian after deciding it needed to be salvaged21:15
clarkboh its gone but the other file was updated21:16
clarkbI think it did fedora too. Lets list the vldb entries for them to see if we have all the expected RW and RO replicas then maybe try rerunning vos release?21:17
clarkbya vos listvldb shows the three entries for all three volumes that reported needing salvaging21:18
fungiReleased volume mirror.fedora successfully21:18
clarkbya so I guess it autodetects the need for salvaging but then bails on the release21:18
clarkbbut I think all three are worth trying to release since they have the replicas we expect21:18
fungiopeneuler and ubuntu-ports might need a quota bump first?21:19
clarkbfungi: openeuler completed right?21:19
fungier, which was the other one that needed a quota increase?21:20
clarkbbut ya they both need quota bumps generally. I think we should make sure they release manually then bump them then we can remove cron locks and let cron jobs update them normally21:20
clarkbfungi: those two need quota bumps but only one needed salvaging and that one was ubuntu ports21:20
fungiah okay21:20
fungiReleased volume mirror.debian successfully21:21
clarkbassuming ubuntu-ports releases properly I think the next step is for me to manually finish the debian mirror cleanup. While I do that you can bump quotas on ubuntu-ports and openeuler (beacuse that needs locks held) then with that all done we can reenable crons on mirror-update and remoev the node from emergency21:23
fungisounds good21:23
clarkband by manually finish I actually mean manually start all the steps over again just to be sure everything applied properly21:24
fungiplanning to raise openeuler from 300gb to 350gb and ubuntu-ports from 550gb to 600gb... that work?21:27
clarkb++21:28
clarkbthe debian and opensuse cleanup is on the order of 350GB I think? maybe 300GB so should be plenty of headroom21:28
clarkbubuntu ports released successfully21:28
clarkbI'll proceed with redoing the debian buster cleanup21:28
fungiyep, i'll proceed with the aforementioned quota bumps and re-release the two affected volumes21:29
clarkbthe two reprepro commands appear to have nooped but I'm redoing a vos release for debian for good measure21:30
fungiokay, volume increases and rereleases are done21:30
clarkbthen I'll do the manual file deletions and run reprepro-mirror-update21:30
fungiawesome, and then we can release locks/take stuff out of the emergency list and i can start cooking dinner21:31
clarkbreprepro is running against debian for a sanity check sync now and that will include a vos release then ya I can start undoing the lock holds and so on21:34
fungii'll go ahead and close out the rackspace ticket21:36
* clarkb looks at the days todo list and has a sad21:37
fungimy todo list just said "fight fires"21:37
fungi(i wish)21:37
clarkbif you get bored you can review my doc updates :)21:38
clarkbthough probably best to approve those after we remove the held locks and reenable cron jobs21:38
fungiand after i cook/eat dinner21:38
clarkb++21:39
clarkbif anything makes this better its a cold rainy windy miserable day here21:39
clarkbso I have nothing better to do than sit at my desk21:39
clarkbcheckpool fast is running21:39
clarkband now the last vos release is running21:41
clarkbok thats done. I'm going to drop all my held locks now then uncomment the cronjobs21:42
clarkbfungi: would be good if you can double check the crontab when this is done21:42
fungican do, though also so will ansible21:43
clarkbfungi: thats done21:45
clarkbI have removed mirror-update02 from the emergency file21:46
fungiroot crontab on mirror-update lgtm21:46
fungiand i let #openstack-release know we're done21:46
clarkbI disconnected from the root screen if you want to drop it21:47
clarkband thank you for checking21:47
clarkbIn order to help me not forget we need to investigate cleanup of RO only volumes, investigate removal of the fedora volume, and do manual dists/ and lists/ cleanup for old distro releases in debuntu reprepro mirrors21:48
clarkbbut all of that can wait until after we've had a break and dinner :)21:49
fungiscreen session gone21:52
clarkbwe should also maybe make note of the volumes that have problems. We may want to remove them21:52
clarkbxvdb on afs01.dfw.openstack.org in this case if future me looks in logs21:52
clarkbfungi: thank you for helping me get through that21:55
fungithank you for the same!21:57
clarkbthe afsmon crontab should fire in a couple of minutes and that should update our grafana graphs21:58
clarkboh nope I read it as every half hour but it runs on the half hour22:01
clarkbso 29 minutes away now22:01
clarkbgraphs updated and show the changes we made (deletions and quota bumps both)22:37
ianw(i'm just following along being glad that you're handling it, but if it's getting late and you want me to watch anything/kick of anything etc. i'm around)22:40
clarkbianw: thanks! I think everything appears to be back to normal now22:47
clarkbianw: maybe you want to review the changes I wrote related to this? otherwise I think we're good22:47
clarkbhttps://review.opendev.org/c/opendev/system-config/+/910662 and parent22:48
ianwall lgtm!22:53
clarkbI think we did end up clearnig about 400GB total between debian and opensuse22:53
clarkbcentos 7 and xenial should make a big dent too22:53
opendevreviewMerged opendev/system-config master: Update reprepro cleanup docs to cover dists/ and lists/ cleanup  https://review.opendev.org/c/opendev/system-config/+/91065523:15

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!