Friday, 2023-11-17

fungiif tonyb doesn't beat me to it, i'll send the one-hour status notice at 14:30 utc, add the indicated hosts to the disable list and pre-create the root screen session on review02 per https://etherpad.opendev.org/p/gerrit-upgrade-3.813:08
fungii'm headed out now but have all that prepped so i can do it from my phone if i'm not home by then13:08
Clark[m]Thanks! I'm working on being awake now. If possible a larger than 80x24 screen window would be good :)13:44
tonyblies!  all terminals are 80x24 ;P13:45
tonybdisabled list updated13:45
tonybAnd the correct nodes are in groups["disabled"]13:48
fungithanks!13:52
fungiturns out the cell signal in this parking lot is nearly nonexistent13:53
tonybeeek13:54
fungimy phone claims to be doing cellular data with a single bar of 2g signal13:55
fungioccasionally switches to 4g for a minute and then drops back to 2g again13:56
tonybscreen session created in a 100x50 terminal13:57
tonybThat's super frustrating.13:57
fungiattached13:57
tonybHopefully the geometry works out okay.13:59
tonybIs there some way to verify I can use statusbot ahead of time?14:00
fungiyeah, american cell providers have gotten really terrible about dropping roaming agreements, which makes it extra bad if you live in a remote location with spotty coverage14:00
fungitonyb: i suppose you could #status log something, but might as well just try to do the notice and see what happens14:01
fungiwe're 90 minutes from start anyway14:02
tonyb#status notice Gerrit will be unavailable for a short time starting at 15:30 UTC as it is upgraded to the 3.8 release. https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/14:05
opendevstatustonyb: sending notice14:05
-opendevstatus- NOTICE: Gerrit will be unavailable for a short time starting at 15:30 UTC as it is upgraded to the 3.8 release. https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/14:05
opendevstatustonyb: finished sending notice14:08
tonybOkay Looks like we're good for clarkb to carry-on from https://etherpad.opendev.org/p/gerrit-upgrade-3.8#L10914:09
fungiawesome14:10
tonybI verified that the screen logging is working ... mostly to check I did it right ;P14:12
* corvus yawns14:15
clarkbI've realized that we run gerrit init with the mysql db stopped. In this upgrade that isn't a problem because there are no schema changes, but for avoiding problems in the future I'm going to add a command to step 11 to start the db before running init14:27
clarkband once I've got tea I'll load ssh keys and hop into screen and get ready for the rest of the fun14:28
fungigood call14:28
*** blarnath is now known as d34dh0r5314:38
clarkbI confirm that screen logging appears to be working14:41
clarkband am attached to the screen14:41
clarkband emergency file looks good. Thank you for taking care of that14:42
fungihome again with time to spare14:50
fungiyeah, i checked the emergency list from the car and it looked right14:50
tonybNice.14:51
fungialso notified the openstack release team during their meeting14:51
tonybI really like the ansible that corvus shared.14:51
corvusme too!  hope it's accurate!  :)14:52
tonybfungi: awesome 14:52
corvushttps://etherpad.opendev.org/p/Av9otg2ML-52q2Nxiyi9 is my plan for zuul15:01
clarkbcorvus: do you need an inline gzip on the mysql dump to keep file sizes down? (not sure how big that will be and if zuul01's disk is large enough)15:03
corvusclarkb: good point; as of 1 month ago it was 18g uncompressed.  i'll probably dump it into /opt uncompressed which has plenty of space then compress it later15:04
clarkbsounds good15:05
corvusa month ago it was 2.6G compressed15:06
corvusi'm running the docker-compose pulls now15:07
corvuspulls complete15:17
corvushrm, we don't have a root mysql user for zuul do we?15:23
clarkbheh now both backups servers are at or above 90%15:23
corvus(it's not important, but the zuul user lacks some privileges to inspect what the innodb engine is doing, which can be useful for monitoring progress)15:23
clarkbthat shouldn't affect our backups for the gerrit upgrade but we'll want to address that soon15:23
corvus(yet another reason to just run our own db server)15:24
tonybOh yeah.  I was going to prune some older backups.15:24
clarkb5 minutes until we start15:25
tonyb++15:25
clarkband so its clear I plan to "drive"15:27
clarkbI'm awake and here so may as well :)15:27
fungigreat. i'm standing by to help test and troubleshoot15:28
* tonyb is watching from the cheap seats15:28
tonyband is happy to help as directed15:28
clarkbyup I'll be sure to mention things in here if I need help15:29
clarkbMy clock has ticked over to 1530. I'm proceeding now15:30
corvusstopping zuul15:30
clarkbgiving mariadb a few seconds to start up before I proceed with backups which talk to it15:31
clarkbfs backups are complete and exited 015:32
clarkbdb backups are in progress15:32
clarkbdb backups also report rc 0 so I'm proceeding15:33
clarkbI'm up to the point where I pull imges. The edits to the docker-compose.yaml file lgtm so I am proceeding15:36
clarkbThe hash under RepoDigests near the top of the screen window seems to match the one I've got in the etherpad (and I checked that version was up to date yesterday)15:38
clarkbfungi: tonyb: next step is the actual upgrade. Any reason to not proceed?15:38
tonybclarkb: Not that I can see.15:38
funginone i'm aware of15:38
clarkbok proceeding15:38
clarkb[2023-11-17T15:41:00.681Z] [main] INFO  com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.8.2-53-gfbcfb3e1e5-dirty ready15:41
clarkbI'm going to check reindexing maybe yall can look at web stuff?15:41
fungiyep, looking15:42
clarkbI see reindexing is in progress according to `gerrit show-queue`15:42
fungiPowered by Gerrit Code Review (3.8.2-53-gfbcfb3e1e5-dirty) 15:42
clarkbweb is up for me and I see the expected version15:42
fungipoking around the webui and i don't see anything out of the ordinary yet15:43
tonybUI looks good, logout login seem to work15:43
clarkbthanks. The config diff has the one email template that updated as expected so thats good15:44
clarkbpushing a change/patchset and then c hecking it replicates is probably the last remaining functionality thing we can do until zuul is with us again?15:44
clarkbmaybe one of you can do that?15:44
fungiyeah, i have a dnm change i think15:45
clarkbtask queue has dropped by ~600 items since I first checked15:45
corvuszuul db backup is complete and migration has started15:45
SvenKieske+1 reviews work as well :)15:45
opendevreviewJeremy Stanley proposed opendev/bindep master: DNM: Test bindep with PYTHONWARNINGS=error  https://review.opendev.org/c/opendev/bindep/+/81867215:46
fungithat's commit 2a9a8b0751f27e97820c2444aa2df6df6fb8f54b15:46
clarkbhttps://opendev.org/opendev/bindep/commit/2a9a8b0751f27e97820c2444aa2df6df6fb8f54b I see it15:47
clarkbpush and replication seem good. I think the next major item then is waiting for reindexing to complete and then verifying zuul interactions once zuul is back15:47
tonybYou guys are so quick :)15:47
fungiyeah, https://opendev.org/opendev/bindep/commit/2a9a8b07 comes up for me too15:47
fungitonyb: it's not the first time we've done this ;)15:47
clarkbtonyb: we are practiced :)15:47
tonyb:)15:48
clarkbgerrit's error log doesn't show anything unexpected. I see the expected exception from the plugin manager and reindexing updates in the log are at 33% complete15:49
clarkbI do note that at least one user has invalid project watches. I believe this is a known thing and not new15:49
corvusis it me?15:50
clarkbcorvus: no15:50
clarkbits a relatively new account which surprises me.15:50
corvuswe're at the "copy the build table" portion of the migration; i want to say that's like 8 minutes...15:51
corvusno, 11 minutes locally according to my notes15:52
clarkbjust over halfway done on the reindex according to hte log file15:53
fungiwe'll probably also want a status notice at the end to remind people some changes may need to be rechecked? maybe something like...15:55
fungistatus notice The Gerrit upgrade is now complete; be aware we had Zuul offline in parallel for a lengthy schema migration, so any events occurring prior to XX:XX UTC may need a recheck to trigger jobs.15:55
clarkb++15:56
clarkbside note: I think we upgraded to 3.7 about a week before the 3.8 release. This upgrade is about a week before the 3.9 release. Its cool to see we're keeping up. Also they are crazy for releasing over thanksgiving15:58
tonybThe release should be fine, it's the consumers of the release that may disagree ;P16:01
corvusoh cool, i didn't quite catch the end, but i'm pretty sure the table copy took no longer than my local run, meaning my time/performance estimate should be pretty close to prod16:03
clarkbcorvus: nice16:04
corvuswe are 2 steps away from the point of no return on the db migration16:04
clarkbreindexing completed and gerrit reports it is using the new gerrit index version. There were three errors against two changes being reindexed. Both changes have id's <20k16:04
* fungi grabs holds onto his seat16:04
clarkbI think we've seen that before and these are problems with old changes that we've basically accepted because what are you going to do16:05
fungiyes, we have a handful of "bad" changes that can't be indexed16:05
clarkb(if we want those to go away we could possibly try deleting the changes)16:05
fungiall very old16:05
clarkbshow queue output looks good too. I'm marking that step done now16:05
fungii can't remember now, but vaguely recall they're unreachable in the ui too16:05
clarkbya16:06
corvuswe are past the point of no return on the zuul migration (if there is an error, we will need to fix it or restore from backup)16:07
clarkback16:07
funginoted16:08
clarkbfwiw on the gerrit side if infra-prod-service-review runs before we want it to at this point thats mostly safe. It will only update the docker-compose.yaml file to use the 3.7 image but on gerrit we don't let it manage service restarts16:10
clarkbso as long as we get the change to update to 3.8 landed quickly we should be fine. All that to say I think we'll defer to corvus on when he is ready to clean up the emergency file and take it form there16:10
fungiwfm16:11
tonybSounds good.16:11
corvusthere was an error, i'm trying to sort out the logs16:13
corvus2023-11-17 15:43:47,859 DEBUG zuul.SQLConnection: Current migration revision: 151893067f9116:19
corvus2023-11-17 16:10:03,931 DEBUG zuul.Scheduler: Configured logging: 9.2.1.dev4716:19
corvusit appears the scheduler restarted at that time; i don't see any indication why16:20
clarkbagreed `docker ps -a` shows the container running for 10 minutes16:20
clarkbdmesg doesn't report OOMKiller kicking in16:21
corvusi wonder if there's a way to get the docker output from the previous container16:21
fungidid we have it copying to syslog?16:21
clarkbhttps://paste.opendev.org/show/bPTXJSq4qhmM81MofoBU/ I see this from docker in syslog16:23
clarkbI don't see any ansible tasks16:24
clarkb(so no unexpected ansible triggered this as far as I can tell)16:24
corvusso it could be a zuul crash where the logs only go to stderr16:24
clarkbI don't see anything in /var/log/containers which is where we've done our other docker log redirect to syslog output16:25
fungiyeah, directory is entirely empty16:26
corvus /var/lib/docker/containers only has the current container with no log from a previous run16:26
clarkbwe probably just haven't set it up for the zuul services16:26
fungiright, nothing for log settings in /etc/zuul-scheduler/docker-compose.yaml16:26
corvusokay i think the best thing we can do now is shut down the scheduler and then i'll try to figure out where in the migration it was and see if i can reconstruct the error, assuming there was one16:27
fungithat sounds reasonable to me16:28
corvusany objections to shutting down the running scheduler (which is in a loop trying to redo the migration but it can't)?16:28
clarkbno objection from me16:28
tonybnone16:28
fungiplease do16:28
corvusin case it's useful in the future, the current (reconstituted) container is b6d98a4420b035c1eab11088d2764849afc6f36d8096ef91525f1a83b134638016:28
corvus| 13869276 | zuul | 10.223.160.47:42130 | zuul | Query   |  166 | altering table | CREATE INDEX zuul_build_uuid_buildset_id_idx ON zuul_build (uuid, buildset_id) |16:29
corvusthat's the last thing i saw; trying to see if it proceeded past that16:29
clarkbshould we do something like #status notice The Gerrit upgrade to 3.8 is complete but Zuul remains offline due to a problem with database migrations in a Zuul upgrade that was being performed in the same outage window. We will notify when both Gerrit and Zuul are happy again.16:29
fungistatus notice The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete.16:30
corvusyes, but you might want to indicate whether or not you think ppl should use gerrit16:30
fungihah, i was just typing something similar16:30
corvusi have updated the zuul etherpad with the dump of all the sql statements i'm working from16:31
clarkbcorvus: at this point I think it is fine to use gerrit to post reviews. The only major item we haven't confirmed is working is the zuul integration16:31
clarkbbut good point. Maybe something along the lines of "it should be safe to post changes and reviews to Gerrit but you will not get CI results"16:31
clarkbI think fungi's message covers that actually16:32
fungifeel free to reword mine if you prefer16:32
clarkbfungi: no I think yours is good if you want to send it16:32
fungi#status notice The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete.16:32
opendevstatusfungi: sending notice16:32
-opendevstatus- NOTICE: The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete.16:33
corvusi think we're at line 192 in the etherpad16:34
clarkbcorvus: are you thinking roll forward from there and see if you can reproduce an error?16:35
opendevstatusfungi: finished sending notice16:35
corvusyes; and also, if able to proceed without an error, then just finish the migration manually16:36
clarkbok16:36
fungimakes sense to me16:36
corvusthe statement that presumably failed involves fk constraints; we should have disabled them already, but i wonder if this old database/server behaves differently16:36
corvusi'm going to set fk checks off and then run line 19216:37
corvusbtw i do have a screen session on zuul02 (second window) if anyone wants to join16:38
corvusokay, we're running mysql 5.7 and it does not support "alter table drop constraint"16:39
fungizuul01?16:40
corvusyep16:40
corvussorry16:40
funginp, attached now16:40
fungiwe're still using a trove instance in rax for this, right?16:41
corvusyep16:41
fungii guess we should be thinking about upgrading that instance and/or setting up our own db cluster instead16:44
corvusyep.  i'd like to try the statement on line 20316:45
clarkbthat looks fine to me (but I'm not monty sql wizard)16:46
fungilooks the same as the one at line 192. does it differ in ways i'm not spotting?16:46
clarkbfungi: it moves from dropping constraint to dropping foreign key16:46
corvuss/constraint/foreign key/16:46
clarkbseems to be a syntax thing16:46
fungioh! right16:47
corvusQuery OK, 0 rows affected (0.06 sec)16:47
fungibut yeah, i'm a bit out of my depth when it comes to foreign key constraints16:47
corvusshow create table on that lgtm now16:48
corvusshall i continue running the statements in the etherpad manually?16:48
fungiyes please16:48
clarkbcorvus: I think if you are comfortable doing that (you wrote the migration so should be pretty knowledgeable about hwat is needed form here) I would say go for it16:48
clarkbI think I would be more concerned if this was software we weren't super familiar with just because the chance of missing something is high16:49
corvusthe scheduler in its attempt to re-run the migration failed at step1, so i don't think it did any damage16:49
clarkback16:50
fungistroke of luck there, i suppose16:50
corvus(it's sort of symmetrical; same table is at the beginning and end of the migration)16:52
corvusi'm double checking the etherpad statements with the python code16:52
corvus(just to make sure nothing changed)16:52
corvuswhile we're waiting on this alter table; i think to fix zuul we can try just making this change and let testing tell us if that works in current mysql/postgres16:54
clarkb++16:54
corvusokay migration is complete16:56
corvusshall we startup the executor again now?16:56
fungii think so16:56
fungischeduler you mean?16:56
corvusha yes lets do that one :)16:56
clarkbwfm16:56
fungicool, yes then ;)16:56
corvusseems to be happy and not doing any sql thigs16:57
fungiyay!16:57
fungithanks!!!16:57
corvusi will proceed with restarting the rest16:57
fungionce it's up i can recheck that dnm change from earlier and see if it gets enqueued16:57
clarkbI guess let us know when you think we should recheck fungi's bindep DNM change to check the zuul + gerrit intercommunication16:58
clarkb++16:58
corvusrebooting all other hosts16:58
corvusstarting zuul-web on zuul0116:59
corvusclarkb: looks like we *are* going through the github branch listing17:00
corvusbut it was relatively fast this time17:00
clarkbhuh17:00
fungihopefully won't trigger any api rate limits17:00
clarkbthe updates we made should avoid that now17:01
clarkbfingers crossed anyway17:01
corvusyeah it's done17:01
corvusstarting up zuul0217:01
corvusstarting mergers and executors17:02
fungidashboard is returning content now17:03
fungii guess all of the queue items will end up retrying their prior builds?17:04
corvusthe builds and buildsets tabs produce data in a reasonable amount of time17:04
corvusyep, it's firing them off now17:04
fungioh, nice, it's just the builds which were in progress that are being retried, all the ones which had completed remain so17:04
corvusyep17:05
corvusa neat side effect of that is that we immediately have new build database records (for the "RETRY" results)17:05
clarkbshould we recheck the bindep chagne now?17:05
fungishould i go ahead and recheck our test change?17:05
corvusyep i think it's gtg17:06
fungidone17:06
clarkbI see it enqueued17:06
clarkbone reason we explicitly test rechecks if they have chagned the stream event format before around comment data17:06
clarkbbut that seems good17:06
clarkband jobs are starting17:06
clarkbI've marked the two zuul related items on step 14 as done based on the bindep change17:07
clarkbthat takse us to step 17 which is to quit the screen and save the log. Are we ready for that?17:07
fungiyeah, https://zuul.opendev.org/t/opendev/status lgtm17:07
fungii suppose we'll want to see it successfully comment in gerrit too, those jobs should hopefully be relatively fats17:08
fungier, fast17:08
clarkbI've also removed my WIP on https://review.opendev.org/c/opendev/system-config/+/899609 and think we can approve that whenever corvus is ready17:08
clarkbfungi: ++17:08
fungisome builds for 818672 have already succeeded17:08
corvusready for 89960917:09
corvusall zuul components are running now17:09
clarkbtonyb: fungi: you cool with me closing the screen now and saving the log?17:09
corvusand i think it's fine to remove zuul from emergency now17:09
tonybclarkb: Yup.17:09
clarkbcorvus: do you want to do the emergency file cleanup? I think you can remove review related stuff as well17:09
corvuscon do17:10
corvuscan do17:10
fungiclarkb: go for it17:10
fungistatus notice Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs.17:10
clarkbstep 17 to stop screen and move the log file is done17:11
fungidoes that cover what we want folks to know?17:11
clarkbfungi: lgtm17:11
corvusi have removed todays maintenance entries from emergency.17:11
fungithanks!17:11
corvusdo we want to also remove the unrelated things we think we can clean up from that file, or leave that for another day?17:11
clarkbI've approved https://review.opendev.org/c/opendev/system-config/+/89960917:11
fungi#status notice Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs.17:11
opendevstatusfungi: sending notice17:11
clarkbcorvus: I'm happy to do that another day :)17:11
-opendevstatus- NOTICE: Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs.17:11
clarkbcorvus: I can make a note to myself to do that monday17:11
corvusack17:11
fungiyeah, i'm beginning to get a smidge peckish17:12
clarkbok to recap where we are on the gerrit side of things: The upgrade is done, the checks we have performed have all checked out. Nothing crazy or unexpected in the gerrit error_log and we got things we expected which is double good. We have since removed services from the emergency file and approved the change to set the 3.8 image in docker-compose.yaml on review02. We need to confirm17:13
clarkbthat the file looks good after infra-prod-service-review runs17:13
clarkbinfra-prod-service-review does not start and stop gerrit though so it should be very safe even if we got somethign wrong there17:13
opendevstatusfungi: finished sending notice17:14
tonybclarkb: I didn't know that infra-prod-service-review does not stop/start gerrit but everything else matches my understanding17:16
clarkbtonyb: ya, many of our services we let ansible automatically do that stuff. Gerrit is special enough and has lots of rules that change between versions about whether or not you need to init or reindex or both and its also disruptive to restart even when we do updates within a single version. All that means we let ansible write the configs then we manually restart things17:17
tonybclarkb: makes sense.17:19
clarkbcorvus: fwiw the builds search feature in zuul works for me. As does buildsets. 17:20
clarkbhttps://review.opendev.org/c/starlingx/update/+/898850 is a post upgrade zuul comment against gerrit17:20
clarkbit lgtm17:21
* clarkb takes a break while waiting for job results17:22
fungithe test change is still waiting on two nodes17:28
fungiand is in a failing state (build log shows the reason)17:29
fungiso looks like it's working the way it should so far17:29
* tonyb goes afk for a bit17:30
fungiyeah, christine's being very patient waiting for me to take her to lunch, but i may have to assume this will work and check back on it after i return17:31
funginode request backlog is down to around 65 now17:32
clarkbzuul is busy today. I didn't expect a friday before a major holiday for at least some contributors to be so busy17:33
fungiare those "Will not fetch project branches as read-only is set" errors for the opendev tenant expected?17:33
clarkbfungi: yes/no They are not new. But we do need to debug them17:34
fungiah, okay. thanks17:34
fungii have a feeling the node request for that remaining opendev-nox-docs build got accepted by a provider that's repeatedly timing out booting an instance for it17:35
clarkbhttps://review.opendev.org/c/starlingx/stx-puppet/+/900806 is a merged change after the ugprade fwiw17:36
fungiwe're at about 30 minutes since the change was enqueued, and all its builds have completed except that one17:36
clarkbfungi: really we have enough other changes that have done stuff that I think its fine17:36
corvusftr, no that alter tabel syntax does not work with postgres, but it does work with mysql 8.x so i've updated the patch with a conditional.  i do expect it to pass tests now.17:36
fungicool, i'm going to step out for an hour, back soon17:36
clarkbcorvus: the fix gets a +2 from me17:37
clarkboh the tooz thing is causing the openstack gate to thrash which could explain why it is busy17:40
clarkbmore coincidence than anything else I think17:40
corvusi wonder if we should start promoting the regex-based early failure detection17:41
corvusseems to be working pretty well in the zuul-project jobs17:42
corvusmaybe i should send an email next week17:42
clarkb++17:43
clarkbhttps://opendev.org/starlingx/stx-puppet/commits/branch/master shows the above merged change replicated and the master branch updated properly17:47
clarkbjust more sanity checks of replication17:47
corvusi added another zuul change to address the missing error log problem17:51
clarkbgood idea +2 there as well17:51
clarkbas a heads up the opendev hourly jobs have enqueued. They will run against zuul but not review18:02
clarkbthe hourly jobs should wrap up in just a couple of minutes. Then a few minutes after that the change to set the image version in the docker-compose file should merge and apply (which should noop)18:20
opendevreviewMerged opendev/system-config master: Upgrade Gerrit to Gerrit 3.8  https://review.opendev.org/c/opendev/system-config/+/89960918:32
tonybgerrit-compose on review02 looks good to me (still contains 3.8)18:34
clarkbyup the job is still running though according to zuul18:34
clarkbnot sure when ansible will try to modify it so we should double check after the job completes18:34
clarkbjob is complete now and the file wasn't modified according to the timestamp18:35
clarkbyup looks good too. I think we are basically done at this point18:35
tonybOh so it is, I waited for it to merge but of course that isn't the "important" run I needed to wait for the deploy pipeline18:35
clarkbI made a note to myself to do the autohold cleanup Monday18:35
clarkband then in a few days / a week we can work to clean up the 3.7 image stuff if we don't revert between now and then18:36
clarkbtonyb: the other thigns I've got on my list for today are to swap in the new mirror and to do db pruning. I Think fungi mentioned he would help with the db pruning. Do you have a change up for the dns update to swap in the new mirror yet?18:37
tonybI don't.  Gimme 5 and I will ;P18:37
clarkbok I can review it :)18:37
opendevreviewTony Breeds proposed opendev/zone-opendev.org master: Switch CNAME for mirror.ord.rax to new mirror02 node  https://review.opendev.org/c/opendev/zone-opendev.org/+/90133218:46
tonybclarkb: I'd be keen to shadow you as you do the db purging.18:46
clarkber sorry not db pruning. Backup pruning18:48
clarkbtonyb: I was going to defer that to you and fungi since fungi mentioned the other day being willing to do that with you18:48
tonybSorry that's what I meant18:48
tonybclarkb: Okay cool18:48
clarkbya I typoed backups as db earlier :) just wanted to be clear18:48
clarkbtonyb: https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#managing-backup-storage is the relevant documentation18:49
tonybThanks18:51
clarkbwhen fungi returns from lucn he can double check the dns update and then let us know if backup pruning isn't in the cards for today18:53
clarkbif not I can refresh on it18:53
tonybOkay.  Sounds good.  I have a few things to do in a couple of hours but if that conflicts I'll shadow next time.18:55
clarkbfyi it has been reported in #openstack-infra that editing files in the web UI isn't working. Specifically you can enter edit mode, but when you open a file to edit a specific file it never loads the file contents in the editor. You can then exit edit mode successfully18:59
fungiokay, backl19:07
fungiback too19:07
fungitonyb: i can work with you on that now if you like19:07
fungior later, either works19:07
tonybfungi: If now works for you it works for me19:08
fungistarlingx folks are asking about errors from pip install... i'm looking into the logs now19:08
tonybfungi: where?  I suspect that's another venue I should hang out/monitor19:09
fungipinged me directly in the starlingx general matrix room19:09
tonybAh okay19:10
clarkbfungi: can you weigh in on the editor being broken first?19:12
clarkbI want to make sure we're comfortabl with that not working for now19:12
fungioh, i missed the broken editor19:13
clarkbI'm putting notes in the etherpad. I don't think we need to rollback for this. Its annoying but not vital19:13
tonybShoot it's workign for me now19:14
clarkbwut19:15
clarkbtonyb: it == editor in web ui?19:15
tonybclarkb: Yup.19:15
* clarkb retests19:15
tonybclarkb: I reproduced exactly what was seen on 900435.  I was poking around in the console/developer tools19:16
clarkbtonyb: it still doesnt' work for me.19:16
tonybclarkb: I closed the window by mistake and now I have a functional editor19:16
clarkbmaybe we need to hard refresh because something is cached?19:16
clarkbya maybe that is it /me tries19:16
clarkbyup that did it19:16
clarkbI think the plugin html must not be versioned like the main gerrit js/html/css is so we don't get the auto eviction stuff19:18
fungistrangely, starlingx is seeing the reverse of https://review.opendev.org/c/openstack/project-config/+/897545 in this job, i think: https://zuul.opendev.org/t/openstack/build/7b5008b924e247c7a1f3eb76fe96151f19:18
* clarkb updates the etherpad. That gives us somethign to send upstream19:18
fungithey're trying to access wheels under debian-11.8 instead of debian-1119:18
clarkbfungi: we updated zuul maybe ansible reverted that behavior/ swapped it around19:19
fungiwell, i think it's that we ended up with newer libs in the ansible venvs19:19
clarkbfungi: yes we changed teh mirror stuff because ansible updated and changed the behavior of those vars. I'm wondering if they swapped back to the old behavior19:20
clarkband did it since last Friday because we last upgarded zuul around then and just upgraded it a few hours ago19:20
fungiwhat we fixed in 897545 was the jobs that build the wheel mirrors19:21
fungithe problem they're seeing is that jobs are now looking for wheels in the location that the broken wheel mirror jobs wanted to publish to19:21
fungiso i don't think this is a revert19:21
fungiit looks more like a delayed reaction, where the playbook setting up the mirror urls in jobs is now exhibiting similar behavior to how we saw wheel publication break before we fixed it19:22
clarkbthey use the same ansible versiosn though19:22
clarkbit should all be the ansible version in zuul's venv for ansible19:22
clarkbunless maybe they are doing nested ansible?19:23
fungiit's happening in that job when opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml is invoked, doesn't look nested19:24
fungiit's parented to tox-py3919:25
clarkbmaybe we needed to update the cleint side and didn't realize it was broken after we fixed the generation side/19:26
clarkband its just now getting bubbled up?19:26
fungithat's what it seems like, i'm just trying to find where that happens19:26
fungiit's possible all jobs using debian nodes are exhibiting this now19:26
opendevreviewClark Boylan proposed opendev/system-config master: A file with tab delimited useful utf8 chars  https://review.opendev.org/c/opendev/system-config/+/90037919:27
fungihttps://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/defaults/main.yaml#L12 is where it's coming from, i think, we want ansible_distribution_major_version instead of ansible_distribution_version on debian now19:30
fungiodd that centos is working though19:32
corvusany idea why it's only showing up now?  is it maybe the case that this mirror doesn't get used often?19:32
corvus(it's unfortunate that job doesn't save tox logs so we can compare to successful runs)19:32
clarkbcorvus: I'm beginning to suspect its been broken for a while and noone noticed19:32
clarkbwe fixed the mirror generation side and the consumers didn't test it or if they did didn't check back in with us to say it doesn't work19:33
clarkbbut maybe zuul's build history can confirm19:33
corvusthe most recent runs succeeded19:33
corvushttps://zuul.opendev.org/t/openstack/builds?job_name=sysinv-tox-py39&project=starlingx/config19:33
fungiyeah, i just noticed that too19:33
corvusfailure we're looking at is 3rd newest19:33
fungiall for the same change19:34
corvusthat's what made me think that the "use mirror" path may not be used often? but we can't tell on the successful jobs19:34
tonybThe successful one seemed to use a mirror: https://zuul.opendev.org/t/openstack/build/fe0825aa756945208f228b6c52c273d5/log/job-output.txt#74719:36
tonyba non RAX mirror19:36
fungihttps://zuul.opendev.org/t/openstack/build/fe0825aa756945208f228b6c52c273d5/log/job-output.txt#747 is in a build of that job that succeeded19:36
tonybsnap19:36
corvusthat mirror url is constructed with major.minor19:37
fungiyeah, that's what i'm saying, not sure why it didn't cause a problem in that build19:37
corvusso it seems like we are getting consistent behavior from the jobs in using that variable.19:37
corvusmaybe it's not using that particular wheel mirror; either getting it from somewhere else or building it?19:38
clarkbhttps://etherpad.opendev.org/p/tvAyWLRV07MNayX3Bbc3 this is my draft to the gerrit repo-discuss list about cache stuff19:39
fungiright, the wheel mirror may be unavailable/incorrect in both builds, the error for it in the failing build might be secondary and the real problem could be elsewhere19:39
clarkbpip is supposed to fallback when it can't find additional indexes to the indexes it does fine19:39
clarkb*it does find19:39
corvusanyway, it seems like that probably means we can exclude all of todays/yesterday's changes from the list of suspects; the fix is to update the variable; and the remaining mystery is why it only sometimes manifests (and i'm pretty sure saving the tox logs would help with that too, so i'd recommend that)19:40
clarkb++19:40
clarkbI'm going to need to stop and find food soon. I skipped breakfast and it is now almost lunch time and my body is protesting. I think remaining todos are to fire off that email to repo-discuss and then other tasks unrelated to gerrit like backup pruning and ord mirror swap out via dns19:46
corvusclarkb: etherpad lgtm19:48
fungiokay, it looks like this was the actual cause of the starlingx job failure: https://zuul.opendev.org/t/openstack/build/7b5008b924e247c7a1f3eb76fe96151f/log/job-output.txt#61245-6128620:07
fungior immediately above there anyway... Could not fetch URL https://mirror-int.ord.rax.opendev.org/pypi/simple/pygments/: connection error: HTTPSConnectionPool(host='mirror-int.ord.rax.opendev.org', port=443): Read timed out. - skipping20:09
clarkbI'm going to send that email to repo-discuss now. Thank you all for reading it20:23
fungisorry, just read it now but lgtm20:25
clarkbfungi: I think we can land https://review.opendev.org/c/opendev/zone-opendev.org/+/901332 when you are ready20:27
clarkbthis will swap in the new ord mirror20:28
clarkbfungi: and were you still planning to do backup pruning with tonyb today?20:28
clarkbI'm about to eat lunch and expect to be afk for a bit20:28
clarkbhttps://groups.google.com/g/repo-discuss/c/DTrYQtY0j1k/m/7riBbIa5BwAJ20:32
tonybheading to the gym now.  back in about 90mins20:34
fungioh, yep. i'll be around when tonyb is back from the gym21:05
fungialso approved 90133221:06
opendevreviewMerged opendev/zone-opendev.org master: Switch CNAME for mirror.ord.rax to new mirror02 node  https://review.opendev.org/c/opendev/zone-opendev.org/+/90133221:12
fungideploy succeeded21:54
fungi$ host mirror.ord.rax.opendev.org21:55
fungimirror.ord.rax.opendev.org is an alias for mirror02.ord.rax.opendev.org.21:55
fungihttps://mirror.ord.rax.opendev.org/ has a working ssl cert and i can browse it as expected21:57
tonybAwesome.22:05
fungitonyb: so when you have a moment, take a look at https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#backups if you haven't already22:31
fungicurrently the two backup servers are backup01.ord.rax.opendev.org and backup02.ca-ymq-1.vexxhost.opendev.org22:32
fungithe active backup volume on both of those is /opt/backups-20201022:32
fungiwhen the active backup volume exceeds 90% used, we try to prune it22:33
fungiwhich basically boils down to running /usr/local/bin/prune-borg-backups as root on the relevant backup server22:34
tonybfungi: Just got back22:34
tonybAhh got it.  I misread the docs and thought it was on the backup client.22:35
fungiit will take a while (upwards of an hour or two) so safest to run it in a screen session in case your ssh connection gets interrupted22:35
tonybOkay22:35
fungiyou'll probably also want to crack open the prune-borg-backups script to see the goodness within22:35
tonybOkay so I'll start 2 80x24 terminals (one on each server) with sudo -s ; su - ; screen <CTRL>-H22:36
fungiit does log the output, so hardcopy of the screen session, while it doesn't hurt, isn't all that necessary22:36
tonybAh okay22:36
fungiif you look in the script, you'll see it logs to a file in /opt/backups/ called prune-<timestamp>.log22:37
fungiultimately, the real command is a few lines from the end of the script, it's calling `/opt/borg/bin/borg prune ...`22:38
fungithe rest is so much window dressing to save us from having to fiddle options22:39
tonybOkay screen sessions created22:39
tonybI'm looking at the prune-borg-backup script on backup02.ca-ymq-1.vexxhost.opendev.org22:39
fungioh. and it's you might also want to take a peek at one of the prune logs just so you know what they include22:40
tonybWill do22:41
fungi(they're extremely verbose, you won't really get much on stdout in your terminal other than whether it succeeded or, rarely and hopefully not, failed22:41
fungiyou'll see it records the exit code of each prune command it runs too22:42
fungiideally rc 0 obviously22:42
fungithere is a dry run option you might want to try first22:42
tonybSounds good.22:43
fungi`/usr/local/bin/prune-borg-backups noop` is the dry run syntax22:44
fungi`/usr/local/bin/prune-borg-backups prune` is the actually do it syntax22:44
tonybOkay, that's going to take a small amount to digest, as the script does a 'read' so I was expecting it to prompt and wait etc but you're passing the mode as an argument22:47
fungioh, actually yes22:47
fungiyou're right22:47
fungi`/usr/local/bin/prune-borg-backups` and then enter "noop"22:48
tonybOkay22:48
fungiif you enter anything other than "noop" or "prune" it will lol at you22:48
tonybI thought it was some mode of read I didn't know about22:48
fungiyes, the special kind that i forget about when it's something i only run every few months ;)22:48
tonybLOL22:49
tonybSo I understand the process as is I need to doa little more reading to really get the way (and reasons) borg is setup but so far it looks super neat22:53
tonybfungi: Should serialize the servers or do them in parallel ?  It seems like parallel is "safe"22:54
fungiyes, completely safe22:56
fungiclients push similar backups to two different servers every day, we have them in different service providers just as an insurance policy22:57
fungithey're purely for redundancy, not performance reasons22:58
fungiusually we don't simply because they tend not to fill up in the same week22:58
fungiso there's almost only ever one that needs pruning at any given point in time anyway22:58
tonybhttps://borgbackup.readthedocs.io/en/stable/usage/prune.html says "Important: Repository disk space is not freed until you run borg compact." I don't see anywhere there is a compact run22:59
fungipossible it's implied somehow23:00
tonybOkay.23:02
fungii mean, it visibly reduces disk space on the volume when we run prune, so it's happening somehow i guess23:03
tonybYeah, Just trying to understand as much as I can.23:04
Clark[m]We are still borg 1.6 iirc and those docs may be for 2.0 and maybe it changes?23:18
tonybOh okay.  I'll check that23:18
tonybThese are the 1.2.6 docs23:23
tonybOkay so even the noop run takes $some_time23:29
tonybIt look slike both servers are at > 90% so I'll prune them both23:30
fungiyep, thanks!23:31
tonybThe noop on backup01 returned success so starting the actul prune in 5mins unless someone says "NO!"23:32
fungiwhen i do it, i just leave it running and then check back on it after a few hours or the next morning23:32
fungii say go for it23:32
tonybThat's my plan.23:32
tonybIt's in a screen session as described23:33
tonybbackup01 is pruning, screen:0 is wheer it's running screen:1 is a tail -f of the log .... incase anyone wants to check in23:34
tonybditto for backup0223:35
tonybSo WRT the mirror updates, mirror02.ord.rax is now the mirror for that region23:38
Clark[m]I trust the script 23:39
tonybIIUC Assuming there are no issues I need to remove mirror01.ord.rax from DNS, and then from system-config inventory and LE handlers and then delete ther server23:39
tonybOnce I've done that I can do the OVH ones (doign them in serial is just about caution nothing technical)23:40
tonybas discussed mirror.dfw.rax will be last because extra careful23:41
tonybSound about right?23:41
Clark[m]Yup23:41
Clark[m]We will also need to delete the volume attached to it23:42
Clark[m]Otherwise I think that is complete and correct23:42
tonyb<emote character="Mr Burns"> Excellent </emote>23:42
tonybThat's a good point.23:43
* fungi nods in agreement23:43
tonybCool.  I can do that next week.23:44
tonybI'll also reach out to rosmaita and the i18n SIG about the future of translate23:46
tonybI was looking at the wiki, I think with a little work we can use the mediawiki container images in the same way we for for many other services that would at least decouple the OS upgrades and give us some testing.  There is an image for the version we're running which *hopefully* will make it easy to switch to.  From there we can look at rolling updates forward to get to a newer version.23:48
Clark[m]I think the main thing is getting all the plugins and stuff going but maybe we don't care about the theming so much anymore23:50
Clark[m]Re prune vs compact I'm confused. Maybe compact does extra cleanup beyond what prune does? In particular prune cleans up incomplete archives automatically maybe that's all we clean up?23:52
tonybIt is pruning other things23:52
tonybeg Pruning archive: review02-filesystem-2022-10-31T05:46:02 Mon, 2022-10-31 05:46:02 [SNIP] (39/39)23:53
tonybso I think it's more than incomplete archives23:53
tonybhttps://borgbackup.readthedocs.io/en/stable/usage/compact.html indicates it's usful after a prune.23:55
tonybSo I'm going to suggest that early next week (perhaps right after the team meeting), we try running a compact on a repo and see what happens to the disk utilisation.23:56
tonybI don't think that Friday afternoon is a good time to play with that kind of thing ;P23:57
fungiright there with you23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!