Monday, 2023-08-07

opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Fix baseurl for Fedora versions before 36  https://review.opendev.org/c/openstack/diskimage-builder/+/89065012:38
*** amoralej is now known as amoralej|lunch13:30
*** amoralej|lunch is now known as amoralej13:54
fungilooks like we've got image upload (or build) issues. last successful uploads of ubuntu-jammy anywhere were early utc on wednesday14:07
fungiyeah, that was the last time we successfully built it14:08
fungidf says /opt is at 100% of 2tb used on nb01, nb02. nb03 is unreachable (no longer exists?), and nb04 (our arm64 builder) is the only one currently able to build new images14:10
fungii'll work on cleaning up nb01 and nb02 no14:10
fungiw14:10
fricklerseems all x86 builds have been failing since that date, all possibly interesting build logs have been rotated away since then, so we'll have to watch new logs after the cleanup14:23
fungiprobably all hitting enospc14:24
fungicleanup is slow. so far only freed about 2% of /opt on each server14:25
frickleryes, I think that may have been taking a couple of hours on earlier occasions14:25
fungiearlier occasions were probably also before we bumped the volume to 2tb14:26
fungii'll have to take a look at the graphs, but this may be the first time we've filled them since increasing the available space14:26
fricklerah, right, that's very likely the case14:27
fricklerhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=68032&rra_id=all14:28
fricklerfirst incident in almost a year14:28
fungiso we have a leak of something in there, albeit slow14:28
fricklerbut it also went very fast, 2-3 days from roughly 50% to 100%14:29
fricklerseems there's also some slow leak happening, but that wouldn't have filled the disk for another couple of years14:30
fungiso something likely went sideways around the end of july, i guess we'll have a better idea what that might be once we're building images again14:30
fricklerthat might coincide with the F36 repo archival, but yes, we need new logs to be sure14:31
fungiyeah, maybe f36 image builds started failing in a loop leaving trash behind14:32
noonedeadpunkactually, f36 are failing for sure, though I've get a patch to it: https://review.opendev.org/c/openstack/diskimage-builder/+/89065014:37
fricklerhmm, looking at dib-image-list and image-list, the hypotheses of failing image uploads is more likely to me. on some providers images are 9 days old, other only 5-6. so if nodepool had to store more than 2 iterations of each image, that may also cause an increase in disk usage14:38
fungiyeah, i noticed the uploads in inmotion-iad3 were a couple of days older than in rax-dfw14:41
fricklerfor inmotion I see a lot of delete image failures, but also these for rax-iad:14:43
frickler2023-08-01 04:07:36,069 ERROR nodepool.builder.UploadWorker.6: Failed to upload build 0000011493 of image rockylinux-8 to provider rax-iad14:43
fricklerso what's happening in iad? where is that, anyway? ;)14:44
fungiiad is the airport code for dulles international in virginia (suburb of washington dc). it's notable as the location of the mae east telecommuncations switching hub that traditionally handled a majority of trans-atlantic lines14:46
fricklernodepool-builder.log.2023-07-30_12:2023-07-30 18:18:19,636 ERROR nodepool.builder.UploadWorker.4: Failed to upload build 0000000039 of image debian-bookworm to provider rax-iad14:47
fricklerthis is where it started14:47
fricklerhundreds of similar failures in the days after that14:48
fungii wonder if we've reached our glance quota there?14:48
fungior if rax-dfw where the builders reside is having communications issues14:48
fungiokay, clearing out /opt/dib_tmp only freed up about 200-300gb on each builder15:10
fungiwhich implies we have a ton of data that's not leftover tempfiles from dib15:11
fungion nb01, /opt/nodepool_dib has 1.6tb of content, 1.7tb on nb0215:13
fungilooks like that's where we store the built images15:13
fungiso i guess that's all spoken for (each format for a couple of versions of each distro/version)15:15
fricklercan we create a ticket with rax for them to check the upload issues with iad?15:16
fungiare they still happening?15:18
fricklerI think you stopped the containers now? last failure on nb01 was: 2023-08-07 13:52:29,041 ERROR nodepool.builder.UploadWorker.7: Failed to upload build 0000067622 of image debian-buster to provider rax-iad15:19
fungi#status log Cleared out /opt/dib_tmp on nb01 and nb02, restoring their ability to build new amd64 images for the first time since early on 2023-08-0215:19
opendevstatusfungi: finished logging15:19
fungifrickler: yeah, i've started them again now that i'm done checking15:20
fungii'll see if there's anything useful in the traceback from that one15:20
fricklerwe could try to manually delete all or some images on rax-iad, if those are failing anyway like noonedeadpunk said, that shouldn't matter much and avoid future failures, even if it would make that region temporarily unusable15:21
fungiopenstack.exceptions.ResourceTimeout: Timeout waiting for Task:27aebd60-0c80-4461-af96-a6dfb4bda21c to transition to success15:21
fungithat's what the sdk is reporting, i think15:21
frickleryes, I saw that, is that the glance task or a nodepool task?15:22
fungiit's coming from openstacksdk so i take that to mean glance task (the backend "tasks api" glance has for things like image conversions)15:24
fricklerboth "openstack image task list" and "openstack image list" seem to take at least a long time for rax-iad from bridge, not sure what the actual timeout will be, but no response in a couple of minutes15:30
fricklerin the ci tenant both work fine, so must have something to do with the jenkins tenant being overloaded somehow15:32
* frickler needs to step away for a bit, will be back later15:33
fungiit's possible that we're bogging it down with old leaked images or something, i'll see if there's anything we should clean up16:23
ajaiswalany idea how to change company affilation 17:11
JayFajaiswal: in stackalytics?17:13
JayFif that's what you're trying, a PR that looks like this-ish: https://review.opendev.org/q/project:x%2Fstackalytics%20status:open17:14
JayFbut they don't get merged often at all, literally months and months of delay if not more17:14
JayFstackalytics is not an official opendev/openinfra thing17:14
ajaiswalThanks @Jayf17:19
fungiajaiswal: if you want to change your affiliation with the openinfra foundation and project activity tracked in bitergia rather than in stackalytics, you do that in your openinfra profile: click the log in button in the top-right corner of https://openinfra.dev/ and then after you're done logging in click the profile button in the top-right corner, then scroll down to the bottom where it says17:27
fungi"affiliations" and add/remove/edit them as desired, clicking the update button when you're done17:27
fungiyou need to make sure that your preferred email address in your gerrit account is one of the (primary, second or third) addresses in your openinfra profile, since that's how they get associated with one another17:28
frickleroh, nice, I didn't know about that either, seems my account didn't have any affiliations by default18:23
fricklerfungi: image list works for DFW and ORD, but it seems that we have a high number of leaked images in those two regions, too, like about 10 images per distro/version instead of the expected 2, old things like f29 and upward18:56
fricklerI cannot list image tasks in those two regions, either, so it seems like something is messed up in that regard18:56
fungii hadn't found a chance yet to look at the dashboard, but am checking it now19:29
fungisorting the images in rax-iad by creation date, there are tons going back as far as 201919:31
fungii'm going to bulk delete any that aren't from this month, as a start19:31
fungiin good news, we've already built and uploaded new centos-8-stream and centos-9-stream images since the builder cleanup earlier19:39
fungiunfortunately, `openstack image list --private` takes something like 10 minutes to return an empty response from rax-iad, presumably an internal timeout of some kind there19:48
fungii won't be surprised if my image deletion attempts through their web dashboard meet with a similar fate19:48
fungitrying to delete 680 images from prior to july 27, except for the two gentoo images we have that are approximately a year old since it hasn't built successfully more recent than that19:55
fungii picked july 27 as the cut-off because we have at least some images mentioned in nodepool image-list in rax-iad from 10 days ago19:56
fungibut nothing listed older than that aside from the two gentoo images19:57
fungigiven our average image sizes, this is ~4.5tb of images i'm deleting19:58
fricklerwow, that's quite a bit20:02
fungiugh, the dashboard reports 23 of 680 images deleted, 657 errors20:02
fungithe errors are all "Service unavailable. Please check https://status.rackspace.com"20:03
fungii think i may have caused a problem trying to delete too many images at once20:04
fungiat least it gives me a "retry failed actions" button, so i'll give that a shot after i see some of these pending deletions switch to deleted20:04
JayFI will attest that, at leasst when I worked there, the dashboard was unaware of and did not consider rate limiting at all20:05
fungihuh, actually my `openstack image list` that took forever was for our control plane tenant, not our nodepool tenant, so i'm trying that again. maybe it will return actual results20:21
fungiindeed, it returns... 1210 entries20:21
fungiso worth noting, ~50% of the images there are from the past 10 days. i have a feeling the glance task timeouts are causing us to leak images because the upload eventually completes but nodepool doesn't know it needs to delete anything there20:22
fungii'll do a couple of `openstack image list --private` listings about an hour apart, then any uuids which are present in both lists but not present in `nodepool image-list`20:25
fungii'll try to delete slowly20:25
fungithat should prevent us from deleting any images nodepool actually knows about20:26
fungilooking into the *-goaccess-report job failures, i think the problem arose when we replaced static01 with the newer static02 in april. the job is failing on the changed host key21:21
fungihttps://opendev.org/opendev/system-config/src/branch/master/playbooks/periodic/goaccess.yaml#L1621:22
fungiyeah, ip addresses there are wrong too21:23
fungii'll push up a change21:24
opendevreviewJeremy Stanley proposed opendev/system-config master: Correct static known_hosts entry for goaccess jobs  https://review.opendev.org/c/opendev/system-config/+/89069821:29
fungiinfra-root: ^21:29
ianwfungi: https://review.opendev.org/c/opendev/system-config/+/562510/1/tools/rax-cleanup-image-uploads.py might help but also iirc that was also working around a shade bug with leaked object bits that is no longer an issue22:10
funginoted, thanks!22:21
opendevreviewMichael Johnson proposed openstack/project-config master: Allow designate-core as osc/sdk service-core  https://review.opendev.org/c/openstack/project-config/+/89036522:42
fungijohnsom: ^ it probably also is in merge conflict with the current branch state22:45
fungiat least i think that's the reason for the latest error comment22:45
johnsomYep, looks like it22:45
fungiand we need gtema to +1 that and the cyborg one22:46
opendevreviewMichael Johnson proposed openstack/project-config master: Allow designate-core as osc/sdk service-core  https://review.opendev.org/c/openstack/project-config/+/89036522:48
opendevreviewMerged openstack/project-config master: Fix app-intel-ethernet-operator reviewers group  https://review.opendev.org/c/openstack/project-config/+/89056923:46

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!