Friday, 2024-01-05

clarkbI've deleted the etherpad autohold but not the gitea one00:09
opendevreviewGhanshyam proposed openstack/project-config master: [QA Acls] Allow Review-Priority for non core member also  https://review.opendev.org/c/openstack/project-config/+/90480900:26
tonybclarkb: I'm done with the held node, so feel free to drop the autohold whenever you're free00:54
tonybclarkb: I don't follow your comment here: https://review.opendev.org/c/openstack/project-config/+/904809/comment/7d6a2af4_d2ba1044/01:07
tonybclarkb: everything I see matches 'grenade-core', which group is empty?01:07
opendevreviewTony Breeds proposed openstack/project-config master: [QA Acls] Allow Review-Priority for non core member also  https://review.opendev.org/c/openstack/project-config/+/90480901:12
Clark[m]tonyb: the line I highlighted is greande-core01:14
opendevreviewGhanshyam proposed openstack/project-config master: [QA Acls] Allow Review-Priority for non core member also  https://review.opendev.org/c/openstack/project-config/+/90480901:15
gmannClark[m]: ^^ updated. thanks for catching that01:15
opendevreviewMerged opendev/system-config master: Add hints to borg backup error logging  https://review.opendev.org/c/opendev/system-config/+/90335702:28
opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/90481102:59
opendevreviewDr. Jens Harbott proposed openstack/project-config master: [QA Acls] Allow Review-Priority for non core member also  https://review.opendev.org/c/openstack/project-config/+/90480906:05
opendevreviewMerged openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/90481106:09
opendevreviewJan Marchel proposed openstack/project-config master: Add new NebulOuS projects: overlay-network-manager, security-manager  https://review.opendev.org/c/openstack/project-config/+/90479208:55
opendevreviewXavier Coulon proposed openstack/diskimage-builder master: Replace OpenSUSE Leap 15.3 to OpenSUSE Leap 15.5  https://review.opendev.org/c/openstack/diskimage-builder/+/90482109:56
opendevreviewElod Illes proposed openstack/project-config master: WIP: Adapt branch creation to Unmaintained state  https://review.opendev.org/c/openstack/project-config/+/90483713:57
opendevreviewMerged openstack/project-config master: Deprecate cinderlib  https://review.opendev.org/c/openstack/project-config/+/90326014:20
fungiinfra-root: elodilles pointed out that we've got another round of deleted branches from 2023-12-21 where zuul seems to have missed or ignored some of the removals and still thinks they're present14:35
fungian example is https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/automaton persisting with a stable/train branch that's no longer there14:35
fungii suspect an online reconfigure (smart? full?) would clear that out, but am wondering if we're hitting some soft of race in zuul's event processing or whether gerrit is failing to actually send the events14:36
fungisince elodilles is preparing to do another batch of deletions shortly, it's possible we'll wind up with more14:37
fungimanage-projects just failed on 903260 because the rootfs on gitea09 is full. looking into why now14:45
fungilooks like /var/gitea/data/gitea/repo-archive is where almost all of it is14:47
fungi73% (113gb) of the 155gb rootfs is used by the contents of that directory14:49
fungithe other gitea servers range from 528mb to 5.7gb in that directory, so gitea09's is orders of magnitude bigger14:51
fungiinfra-root: anyone happen to know what gitea uses that directory for?14:51
fungii'll start going through its documentation, but looks like some sort of git object cache14:51
fungii'm going to put the server in our emergency disable list and take it out of the haproxy pools temporarily14:52
elodillesfungi: sorry, i've started the clean up script in the meantime, will that interfere with the above? ^^^14:55
fungielodilles: no, it should be fine14:55
elodillesACK14:55
fungi# status log Temporarily disabled gitea09 from the load balancer pools while investigating a full rootfs on it14:56
fungi#status log Temporarily disabled gitea09 from the load balancer pools while investigating a full rootfs on it14:56
opendevstatusfungi: finished logging14:56
fungiand now i've downed the containers15:01
fungibrowsing gitea's issues, seems like people have been seeing repo-archive grow wildly for no apparent reason, including after upgrades15:06
fungihttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=71123&rra_id=all suggests it jumped up by quite a bit in march, but then went crazy around the end of november and started to grow out of control. now it's been hovering near 100% for several weeks15:09
elodillesfungi: these were deleted this time: https://paste.opendev.org/show/bn5tNwhl35pq7SFT5aAV/15:17
fungithanks elodilles!15:17
elodillesnp15:17
Clark[m]Repo archive cleanup is supposed to happen for archives older than 24 hours automatically 15:18
fungiyeah, i found the internal cron for it. maybe it's either not running or broken on gitea09?15:18
Clark[m]Our config file doesn't override any of the cron settings so it should be running at least15:20
fungihttps://github.com/go-gitea/gitea/issues/25992 seems to indicate that it can be disabled entirely in configuration now, though i'm not finding the corresponding commit or docs to confirm15:21
fungibut also, reading through commits related to the repo archive, it seems like it's used as a cache for performance reasons, so we might still want to have it?15:22
Clark[m]It's caches of repo archives which I'm not sure are super important 15:23
Clark[m]But also reading other posts there are multiple crons and maybe only some are run by default?15:24
fungioh, this is specifically for when someone requests an "archive" tarball of a particular commit?15:24
Clark[m]Yes15:25
Clark[m]And apparently they pre build them for tags15:26
Clark[m]Looks like the cron subsystem may be disabled by default15:26
Clark[m]I think we should use the manual admin UI task to cleanup repo archives now (and maybe do that on all the nodes). Then do a followup change to enable the cron subsystem which we should take care to ensure doesn't run other jobs we don't want15:28
fungiyeah, i found references to a "Delete all repositories' archives (ZIP, TAR.GZ, etc..)" task/button which should be in the admin dashboard15:29
Clark[m]But we have to sort out this complicated cron configuration. It isn't clear to me if we enable a specific cron job if that will work when top level cron is disabled. I think enabling only the specific jobs we want is preferable15:30
Clark[m]fungi: ya I think we want to find that and click it or better yet would be something more aligned with "run the equivalent of the cron job"15:31
Clark[m]https://github.com/go-gitea/gitea/issues/6689 seems to make it clear that button deletes got archive objects of repos and not archived repos so that is good 15:32
Clark[m]fungi: maybe before deleting things check to see if any of the archive files are more than 24 hours old? The cleanup cron won't touch them if they are newer than that and we may have a different issue if so15:34
fungiyeah, calling switching a repo to read-only "archiving" was a poor design choice on their part15:34
fungi$ sudo find /var/gitea/data/gitea/repo-archive -type f -mtime +7|wc -l15:36
fungi568715:36
fungiso thousands of files more than a week old15:37
fungibut none more than a month old15:37
fungithe oldest ones are 29 days old, around 2023-12-0615:38
fungiwhich would suggest something is (or was) cleaning them up15:38
fungimaybe it's cleaning up archives older than 30 days by default?15:39
Clark[m]Ya maybe they changed the default but didn't update the docs 15:39
fungigitea10 has none older than a day15:40
clarkbweird15:41
fungisame for gitea11, but gitea12 has some as old as 19 days15:41
clarkbmaybe it is atime and not mtime? so it acts more like a cache?15:42
fungi11 days old on gitea1315:42
fungi-atime and -mtime counts seem to match up, spot-checking15:43
clarkbas a quick sanity check `cron` doesn't appear in the app.ini written to gitea10 or gitea09 so they should use the same cron defaults15:45
clarkbfungi: I'm looking at the source and I think it is using db entries for the olderthan content15:49
clarkbrather than disk times15:49
clarkbit also seems to short circuit if it errors rather than trying to continue to delete things15:50
clarkbcould be that we've got some archive that fails to delete for whatever reason and that short circuits everything else15:51
fungiif so, still odd that the oldest one has been modified as recently as a month ago, but nothing older than that15:52
clarkbI asked in the gitea general room and they say that robots can generate the files and their example robots.txt apparently asks crawlers to not do that and we can look at https://gitea.com/robots.txt as an example15:52
fungioh, nice15:52
fungilooks like we have a bit of stuff in https://opendev.org/robots.txt currently15:53
clarkbya we disallow */archive/ they disallow /*/*/archive15:54
clarkbnot sure if those are equiavlent15:54
fungii have a feeling we got a robots.txt from gitea but it's outdated compared to the one they're using15:54
clarkbbut /*/tarball/ and /*/zipball/  may be useful15:54
fungiaha, yeah looks like https://review.opendev.org/803231 may have copied gitea's ~2.5 years ago15:56
clarkbmariadb isn't running on gitea09 but I think we want to look at something like `select * from repo_archiver where created_unix < 1704240000 limit 10;`16:01
clarkbthats the unix timestamp for roughly two days ago I think and since cron only runs once a day and we clean up things older than 24 hours we may have things in there about 2 days old and be valid?16:01
clarkbthat query returns no results on gitea1016:02
clarkbthis is expected based on the filesystem inspection fungi did. Now to check gitea1216:02
clarkbthat query does return results on gitea1216:03
fungii can start the containers back up, but would feel more comfortable if we could free up a little space on gitea09's rootfs first. clarkb: you have some old db dumps from almost a year ago in your homedir, is that still needed?16:03
clarkbI think the old db backups can be cleaned up they were used to bootstrap the other new servers iirc16:04
clarkbfungi: I'm starting to think that it may be the short circuiting issue given the gitea12 query results16:04
fungiclarkb: yeah, the dumps in your homedir are called gitea09_transplant_db.sql{,.gz}16:05
fungii'll delete those which will free up a few hundred mb16:05
fungiwe now have 474M available on the rootfs which should be sufficient, but i'd also like to reboot the server to make sure it's in good shape before doing anything else16:06
fungiclarkb: you okay with me doing a quick reboot, or will that interrupt anything you're checking?16:07
clarkbfungi: I'm using gitea12 now since it has leaked archives. I'll jump off of 0916:08
fungicool, rebooting 09 now16:08
fungiit's back up now, and freed a bit of additional space (now right at half a gb)16:10
fungii've started the containers on 09 again16:10
fungiinterestingly, that freed even more space, now around 0.75gb16:11
clarkb/tmp content maybe?16:12
clarkbthe dir structure of the repo archive is repoid/firsttwoofsha/sha.filesuffix16:13
clarkbthis is useful when looking at the db contents and trying to map to the on disk contents16:13
clarkbfungi: I think we should try the admin delete all archives on one server and see if it errors16:18
clarkbthat would give more weight to the short circuit problem16:18
fungiyep, i'll look into that now16:18
clarkbgitea logs some of these errors at a trace level (which is lower than debug) and we're logging at info (higher than debug)16:23
fungiroot login is taking a while, seems like something might be wrong with it16:26
funginever mind, it finally went through16:26
fungiarchive deletion is in progress16:27
fungiit's freed a ton of space on the rootfs already16:28
fungi/var/gitea/data/gitea/repo-archive is now only 29mb16:29
clarkbstatup isn't instantaneous as the db has to come up first and gitea will wait for the db to be communicable16:29
clarkbassuming this was on gitea0916:29
fungiyeah, it wasn't at startup though. i started the containers about 20 minutes ago16:29
clarkbhrm16:30
fungiprobably just the amount of data it needed to read to put the root user dashboard together or something16:31
clarkbgitea09's repo_archiver table is empty now16:31
clarkbthe disk usage in the archives dir appears to just be for the dir structure it doesn't delete dirs just files I guess16:32
clarkbfungi: I guess the next step is to force replication to gitea09 to catch up its git content (which may create archives if tags are pushed iirc)16:33
clarkband then followup with an updated robots.txt and monitor?16:33
clarkbI don't see a smoking gun in the code for why this is happening and I'm wary of enabling the most verbose log level to get more logs out of gitea16:33
fungisounds good16:33
fungishould we force full replication for all the backends just for completeness?16:34
clarkbI guess it doesn't hurt16:34
fungidone16:35
fungiwell, started16:35
fungii did a full `replication start`16:35
clarkbfungi: and next week we can run that query and see if we have results16:35
clarkber I guess we want to do the data collection after robots.txt is updated16:35
fungiaround 14k tasks16:35
clarkbso update robots.txt, rerun cleanup task, then check a week later or so16:36
fungiworking on the updated robots.txt next16:36
fungialso need to take the host out of the emergency disable list and reenable it in the haproxy pool still, and then i have a change i procedurally blocked i need to reapprove16:37
fungii had initially approved it just before i saw the manage-projects deploy failure16:37
clarkback16:39
fungiclarkb: any idea why we commented out the disallow lines for /avatars and /user/* >?16:39
clarkbnope16:40
fungidoesn't seem like search indexes pulling those would make much sense, and gitea.com's robots disallows them16:40
clarkb++16:42
clarkbhttps://forum.gitea.com/t/how-to-configure-cron-task-for-delete-all-repositories-archives-zip-tar-gz-etc/4848/2 points to a fun undocumented cron job option which basically runs that admin task automatically16:42
clarkbwe could set that up to run say monthly. Let the daily run do its best daily and then come by once a month and clear out everything?16:42
fungilooks like we also commented out /raw/* but maybe the reason they have it in theirs in order to avoid duplicate search results for different views of the same content?16:43
clarkbfungi: that would make sense I guess. Also the upstream robots.txt has two raw entries16:44
clarkboops three16:44
fungiseems to be lots of copy-pasta16:57
clarkblooking at gitea12 we have db entries for leaked disk entries. Reading the code this implies the error is occuring either when listing/finding entries to delete or when deleting the db record for the archive. The last thing that is done is deleting the content on disk which is present as is the db record17:00
opendevreviewJeremy Stanley proposed opendev/system-config master: Update our Gitea robots.txt from gitea.com's  https://review.opendev.org/c/opendev/system-config/+/90486817:07
fungigerrit replication has caught up17:07
fungii'm going to reenable 09 in ansible and haproxy now17:07
clarkbfungi: you don't happen to still be logged into gitea09 do you? I think admin/monitor/cron should show running tasks and maybe gives us info on last results?17:07
clarkbI don't think that data is persisted to the db though so it is probably pretty empty now due to the restart. Maybe at midnight utc we can check it and see it running17:08
fungipulling it up now17:08
fungiwhich task are you interested in? there are 21 listed17:10
clarkbfungi: in the robots.txt I thought you were going to uncomment the disallows for avatar and users17:10
clarkbfungi: the delete archive one17:10
clarkblet me find the exact name17:10
fungi"Delete all repositories' archives (ZIP, TAR.GZ, etc..)" isn't scheduled17:10
fungiit shows the "previous time" as when i clicked the button in the ui17:11
clarkb"archive_cleanup" is the name in the code17:11
clarkbya delete all repositories is disabled by default17:11
fungiaha, "Delete old repository archives"17:11
clarkbthat does the full clear which is the same thing you did by clicking the button. However we should once a day run a cleanup of older archives17:11
clarkbya that one17:11
fungischedule to run @midnight17:11
fungiprevious time was Jan 5, 2024, 4:10:35 PM17:12
fungiwhich was when i restarted the container17:12
funginext time is Jan 6, 2024, 12:00:00 AM17:12
fungiso it thinks it ran that task when the container started, which would explain the small reduction in disk utilization i observed at that time17:12
clarkbyup by default that cron is set to run on startup as well17:13
clarkbthat at least confirms it is running which is good17:13
fungibut doesn't explain why it didn't remove most of the archives17:13
clarkbbecause now we can focus on why it isn't doing what we want it ot do rather than figuring out if it even executes17:13
fungiyep17:13
opendevreviewJeremy Stanley proposed opendev/system-config master: Update our Gitea robots.txt from gitea.com's  https://review.opendev.org/c/opendev/system-config/+/90486817:15
fungigitea09 is out of the emergency disable list now17:15
fungiand enabled in haproxy again17:16
clarkbcross checking on gitea12 might be good too to ensure it is running daily17:17
clarkbmaybe it stopped running on the servers after being up for $time17:17
fungilast ran on gitea12 Jan 5, 2024, 12:00:00 AM17:19
fungiexecutions count is 217:19
fungiwhich i suppose is since the gitea upgrade yesterday? (one at container start, and then one as scheduled)17:19
clarkbya that makes sense17:20
clarkbso now we know it is definitely not running and also not clearing out entries we expect to be cleared out17:20
clarkbI think the two main possibilities there are either an error listing/finding entries to delete or errors removing db rows causing short circuits in the entire process17:20
fungier, well definitely thinks it's running, but yeah not cleaning up what we want17:20
fungimmm, yeah db locking contention?17:21
clarkbmaybe? I guess deletions could hit lock problems (the listing shoudl be fine though?)17:21
fungigood point17:21
clarkbfungi: I'm definitely not seeing any smoking guns. I made note of the next record to be deleted by gitea12 if I read things correctly and can check that record on monday to see if one of the next three midnight runs get it by then17:29
clarkbuntil then I think things are manageable and this workaround of just deleting the entire archive seems fine. Honestly I think we should consider to add that cron job to gitea too on a monthly basis17:30
fungisounds good17:30
fungikeep in mind though, we had just shy of a month's archives on 09 from the look of things, and that filled up the rootfs. maybe weekly?17:31
clarkboh wow was it only a month. I guess so based on your timestamps17:31
clarkbya mgiht have to be weekly17:31
clarkbI've gone ahead and deleted my autohold for gitea 1.21.3 testing. I don't think it is useful for this archive stuff and we're upgraded now17:35
clarkbfungi: also https://review.opendev.org/c/opendev/system-config/+/904777 is more gitea related testing for unrelated problems17:37
opendevreviewMerged openstack/project-config master: Add new NebulOuS projects: overlay-network-manager, security-manager  https://review.opendev.org/c/openstack/project-config/+/90479217:42
fungideploy of that ^ succeeded, so gitea09 is no longer breaking the manage-projects job17:54
fungi(which was how i initially noticed the issue with that server)17:55
clarkbwoot and that should've also resolved the earlier failure?17:55
fungiyes17:55
clarkbpretty sure it would since we refresh all of gitea each time and then run gerrit only if gitea succeeds17:55
* clarkb should go do nromal morning things that got neglected like eating breakfast17:58
fungiyes, do that. when you're caught up, opinion on my earlier comments about zuul holding onto deleted branches? should i ask it for a reconfigure?18:00
clarkbyes I suspect that is the known issue with not batching up deletions of branches and instead doing them all serially in a short period of time (it results in lost events)18:03
clarkbthe fix for that is to reconfigure the tenant that is affected18:03
clarkbfungi: `docker exec zuul-scheduler_scheduler_1 zuul-scheduler tenant-reconfigure openstack` from scrollback in #openstack-infra18:10
opendevreviewMerged opendev/system-config master: Check for gitea template rendering errors  https://review.opendev.org/c/opendev/system-config/+/90477718:52
fungiclarkb: it doesn't seem like a tenant-reconfigure is sufficient in this case, at least the deleted branch is still lingering in the dashboard18:58
fungihttps://review.opendev.org/admin/repos/openstack/automaton,branches and https://opendev.org/openstack/automaton/branches don't have stable/train but https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/automaton still does19:05
clarkbfungi: did you wait for the reconfiguration to complete? it takes like 20 ish minutes19:11
clarkbnot sure when it ran relative to you checking the list19:11
fungi2024-01-05 18:57:42,514 INFO zuul.GerritConnection: Got branches for openstack/automaton19:13
fungiand then looks like it loaded configuration from automaton's branches at 19:08:4419:13
fungino mention of loading configuration from stable/train though19:14
fungiis there a later pass to clean up dropped branches?19:14
clarkbfungi: I'm not sure of the order but I don't think zuul uses any of the new content until the process is fully complete (the zk db is versioned)19:15
fungimaybe it's not done yet19:15
clarkbif I grep for `reconfiguration` I see 2024-01-05 17:51:53,251 DEBUG zuul.Scheduler: Smart reconfiguration triggered but no finished message19:19
clarkboh hrm maybe I needed to grep -i19:19
clarkbok I was looking at the wrong scheduler and needed -i19:21
clarkb2024-01-05 19:18:06,648 INFO zuul.Scheduler: Reconfiguration complete (smart: False, tenants: ['openstack'], duration: 1251.448 seconds)19:21
clarkbfungi: I think it is done now. Maybe refresh and check?19:21
fungiclarkb: yep, now it's gone from https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/automaton so i was just impatient19:22
clarkband ya it took almost exactly 21 minutes19:23
fungielodilles: ^ all cleaned up19:25
opendevreviewClark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node  https://review.opendev.org/c/opendev/system-config/+/84818120:31
opendevreviewClark Boylan proposed opendev/system-config master: Enable gitea delete_repo_archives cron job  https://review.opendev.org/c/opendev/system-config/+/90487420:31
clarkbI put an autohold in place for that so we can check the admin/monitor/cron (or whatever the path was) to see that it fires as expected after Sunday20:32
fungiyep, that's the path20:36
clarkbfwiw I got that config block out of the example app.ini content in the gitea repo20:38
clarkbso while undocument I believe it to be valid (especially after finding the cron job in the source code)20:38
clarkbhttps://104.130.4.31:3081/opendev/system-config is the held node. I manually downloaded the three archive file types for that repo which put them in the repo-archives dir on the held node23:36
clarkbwe should see them get cleaned up in ~24 minutes23:36
clarkbI also realize the @weekly definition might confight with the 24h definition of the daily cleanup beacuse they'll both run at midnight23:36
clarkbit might be better for us to specify a time like weekly at 0200 or whatever23:36
clarkbor maybe they will handle that properly with locking I don't know23:37
clarkboh the daily won't cleanup actually23:37
clarkbbeacuse they are less than 24 hours old23:37
clarkbso these should be good for checking the cleanup tomorrow sunday at midnight utc (since the time in 22 minutes is saturday midnight?)23:38
clarkbanyway as long as this doesn't completely explode over the weekend on the held node I think we can deploy it to prod and then monitor more long term behavior23:38
fungisounds right to me23:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!