Monday, 2023-08-07

opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Fix baseurl for Fedora versions before 36 https://review.opendev.org/c/openstack/diskimage-builder/+/890650	12:38
*** amoralej is now known as amoralej\|lunch		13:30
*** amoralej\|lunch is now known as amoralej		13:54
fungi	looks like we've got image upload (or build) issues. last successful uploads of ubuntu-jammy anywhere were early utc on wednesday	14:07
fungi	yeah, that was the last time we successfully built it	14:08
fungi	df says /opt is at 100% of 2tb used on nb01, nb02. nb03 is unreachable (no longer exists?), and nb04 (our arm64 builder) is the only one currently able to build new images	14:10
fungi	i'll work on cleaning up nb01 and nb02 no	14:10
fungi	w	14:10
frickler	seems all x86 builds have been failing since that date, all possibly interesting build logs have been rotated away since then, so we'll have to watch new logs after the cleanup	14:23
fungi	probably all hitting enospc	14:24
fungi	cleanup is slow. so far only freed about 2% of /opt on each server	14:25
frickler	yes, I think that may have been taking a couple of hours on earlier occasions	14:25
fungi	earlier occasions were probably also before we bumped the volume to 2tb	14:26
fungi	i'll have to take a look at the graphs, but this may be the first time we've filled them since increasing the available space	14:26
frickler	ah, right, that's very likely the case	14:27
frickler	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=68032&rra_id=all	14:28
frickler	first incident in almost a year	14:28
fungi	so we have a leak of something in there, albeit slow	14:28
frickler	but it also went very fast, 2-3 days from roughly 50% to 100%	14:29
frickler	seems there's also some slow leak happening, but that wouldn't have filled the disk for another couple of years	14:30
fungi	so something likely went sideways around the end of july, i guess we'll have a better idea what that might be once we're building images again	14:30
frickler	that might coincide with the F36 repo archival, but yes, we need new logs to be sure	14:31
fungi	yeah, maybe f36 image builds started failing in a loop leaving trash behind	14:32
noonedeadpunk	actually, f36 are failing for sure, though I've get a patch to it: https://review.opendev.org/c/openstack/diskimage-builder/+/890650	14:37
frickler	hmm, looking at dib-image-list and image-list, the hypotheses of failing image uploads is more likely to me. on some providers images are 9 days old, other only 5-6. so if nodepool had to store more than 2 iterations of each image, that may also cause an increase in disk usage	14:38
fungi	yeah, i noticed the uploads in inmotion-iad3 were a couple of days older than in rax-dfw	14:41
frickler	for inmotion I see a lot of delete image failures, but also these for rax-iad:	14:43
frickler	2023-08-01 04:07:36,069 ERROR nodepool.builder.UploadWorker.6: Failed to upload build 0000011493 of image rockylinux-8 to provider rax-iad	14:43
frickler	so what's happening in iad? where is that, anyway? ;)	14:44
fungi	iad is the airport code for dulles international in virginia (suburb of washington dc). it's notable as the location of the mae east telecommuncations switching hub that traditionally handled a majority of trans-atlantic lines	14:46
frickler	nodepool-builder.log.2023-07-30_12:2023-07-30 18:18:19,636 ERROR nodepool.builder.UploadWorker.4: Failed to upload build 0000000039 of image debian-bookworm to provider rax-iad	14:47
frickler	this is where it started	14:47
frickler	hundreds of similar failures in the days after that	14:48
fungi	i wonder if we've reached our glance quota there?	14:48
fungi	or if rax-dfw where the builders reside is having communications issues	14:48
fungi	okay, clearing out /opt/dib_tmp only freed up about 200-300gb on each builder	15:10
fungi	which implies we have a ton of data that's not leftover tempfiles from dib	15:11
fungi	on nb01, /opt/nodepool_dib has 1.6tb of content, 1.7tb on nb02	15:13
fungi	looks like that's where we store the built images	15:13
fungi	so i guess that's all spoken for (each format for a couple of versions of each distro/version)	15:15
frickler	can we create a ticket with rax for them to check the upload issues with iad?	15:16
fungi	are they still happening?	15:18
frickler	I think you stopped the containers now? last failure on nb01 was: 2023-08-07 13:52:29,041 ERROR nodepool.builder.UploadWorker.7: Failed to upload build 0000067622 of image debian-buster to provider rax-iad	15:19
fungi	#status log Cleared out /opt/dib_tmp on nb01 and nb02, restoring their ability to build new amd64 images for the first time since early on 2023-08-02	15:19
opendevstatus	fungi: finished logging	15:19
fungi	frickler: yeah, i've started them again now that i'm done checking	15:20
fungi	i'll see if there's anything useful in the traceback from that one	15:20
frickler	we could try to manually delete all or some images on rax-iad, if those are failing anyway like noonedeadpunk said, that shouldn't matter much and avoid future failures, even if it would make that region temporarily unusable	15:21
fungi	openstack.exceptions.ResourceTimeout: Timeout waiting for Task:27aebd60-0c80-4461-af96-a6dfb4bda21c to transition to success	15:21
fungi	that's what the sdk is reporting, i think	15:21
frickler	yes, I saw that, is that the glance task or a nodepool task?	15:22
fungi	it's coming from openstacksdk so i take that to mean glance task (the backend "tasks api" glance has for things like image conversions)	15:24
frickler	both "openstack image task list" and "openstack image list" seem to take at least a long time for rax-iad from bridge, not sure what the actual timeout will be, but no response in a couple of minutes	15:30
frickler	in the ci tenant both work fine, so must have something to do with the jenkins tenant being overloaded somehow	15:32
* frickler needs to step away for a bit, will be back later		15:33
fungi	it's possible that we're bogging it down with old leaked images or something, i'll see if there's anything we should clean up	16:23
ajaiswal	any idea how to change company affilation	17:11
JayF	ajaiswal: in stackalytics?	17:13
JayF	if that's what you're trying, a PR that looks like this-ish: https://review.opendev.org/q/project:x%2Fstackalytics%20status:open	17:14
JayF	but they don't get merged often at all, literally months and months of delay if not more	17:14
JayF	stackalytics is not an official opendev/openinfra thing	17:14
ajaiswal	Thanks @Jayf	17:19
fungi	ajaiswal: if you want to change your affiliation with the openinfra foundation and project activity tracked in bitergia rather than in stackalytics, you do that in your openinfra profile: click the log in button in the top-right corner of https://openinfra.dev/ and then after you're done logging in click the profile button in the top-right corner, then scroll down to the bottom where it says	17:27
fungi	"affiliations" and add/remove/edit them as desired, clicking the update button when you're done	17:27
fungi	you need to make sure that your preferred email address in your gerrit account is one of the (primary, second or third) addresses in your openinfra profile, since that's how they get associated with one another	17:28
frickler	oh, nice, I didn't know about that either, seems my account didn't have any affiliations by default	18:23
frickler	fungi: image list works for DFW and ORD, but it seems that we have a high number of leaked images in those two regions, too, like about 10 images per distro/version instead of the expected 2, old things like f29 and upward	18:56
frickler	I cannot list image tasks in those two regions, either, so it seems like something is messed up in that regard	18:56
fungi	i hadn't found a chance yet to look at the dashboard, but am checking it now	19:29
fungi	sorting the images in rax-iad by creation date, there are tons going back as far as 2019	19:31
fungi	i'm going to bulk delete any that aren't from this month, as a start	19:31
fungi	in good news, we've already built and uploaded new centos-8-stream and centos-9-stream images since the builder cleanup earlier	19:39
fungi	unfortunately, `openstack image list --private` takes something like 10 minutes to return an empty response from rax-iad, presumably an internal timeout of some kind there	19:48
fungi	i won't be surprised if my image deletion attempts through their web dashboard meet with a similar fate	19:48
fungi	trying to delete 680 images from prior to july 27, except for the two gentoo images we have that are approximately a year old since it hasn't built successfully more recent than that	19:55
fungi	i picked july 27 as the cut-off because we have at least some images mentioned in nodepool image-list in rax-iad from 10 days ago	19:56
fungi	but nothing listed older than that aside from the two gentoo images	19:57
fungi	given our average image sizes, this is ~4.5tb of images i'm deleting	19:58
frickler	wow, that's quite a bit	20:02
fungi	ugh, the dashboard reports 23 of 680 images deleted, 657 errors	20:02
fungi	the errors are all "Service unavailable. Please check https://status.rackspace.com"	20:03
fungi	i think i may have caused a problem trying to delete too many images at once	20:04
fungi	at least it gives me a "retry failed actions" button, so i'll give that a shot after i see some of these pending deletions switch to deleted	20:04
JayF	I will attest that, at leasst when I worked there, the dashboard was unaware of and did not consider rate limiting at all	20:05
fungi	huh, actually my `openstack image list` that took forever was for our control plane tenant, not our nodepool tenant, so i'm trying that again. maybe it will return actual results	20:21
fungi	indeed, it returns... 1210 entries	20:21
fungi	so worth noting, ~50% of the images there are from the past 10 days. i have a feeling the glance task timeouts are causing us to leak images because the upload eventually completes but nodepool doesn't know it needs to delete anything there	20:22
fungi	i'll do a couple of `openstack image list --private` listings about an hour apart, then any uuids which are present in both lists but not present in `nodepool image-list`	20:25
fungi	i'll try to delete slowly	20:25
fungi	that should prevent us from deleting any images nodepool actually knows about	20:26
fungi	looking into the *-goaccess-report job failures, i think the problem arose when we replaced static01 with the newer static02 in april. the job is failing on the changed host key	21:21
fungi	https://opendev.org/opendev/system-config/src/branch/master/playbooks/periodic/goaccess.yaml#L16	21:22
fungi	yeah, ip addresses there are wrong too	21:23
fungi	i'll push up a change	21:24
opendevreview	Jeremy Stanley proposed opendev/system-config master: Correct static known_hosts entry for goaccess jobs https://review.opendev.org/c/opendev/system-config/+/890698	21:29
fungi	infra-root: ^	21:29
ianw	fungi: https://review.opendev.org/c/opendev/system-config/+/562510/1/tools/rax-cleanup-image-uploads.py might help but also iirc that was also working around a shade bug with leaked object bits that is no longer an issue	22:10
fungi	noted, thanks!	22:21
opendevreview	Michael Johnson proposed openstack/project-config master: Allow designate-core as osc/sdk service-core https://review.opendev.org/c/openstack/project-config/+/890365	22:42
fungi	johnsom: ^ it probably also is in merge conflict with the current branch state	22:45
fungi	at least i think that's the reason for the latest error comment	22:45
johnsom	Yep, looks like it	22:45
fungi	and we need gtema to +1 that and the cyborg one	22:46
opendevreview	Michael Johnson proposed openstack/project-config master: Allow designate-core as osc/sdk service-core https://review.opendev.org/c/openstack/project-config/+/890365	22:48
opendevreview	Merged openstack/project-config master: Fix app-intel-ethernet-operator reviewers group https://review.opendev.org/c/openstack/project-config/+/890569	23:46

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!