Tuesday, 2023-08-08

opendevreviewMerged opendev/system-config master: Correct static known_hosts entry for goaccess jobs  https://review.opendev.org/c/opendev/system-config/+/89069805:13
ajaiswalhow can i change my gerrit username ?06:52
fricklerajaiswal: the username is immutable07:10
opendevreviewMerged openstack/project-config master: Allow cyborg-core to act as osc/sdk service-core  https://review.opendev.org/c/openstack/project-config/+/89047507:43
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Allow designate-core as osc/sdk service-core  https://review.opendev.org/c/openstack/project-config/+/89036508:18
*** dhill is now known as Guest828911:37
opendevreviewMerged openstack/project-config master: Allow designate-core as osc/sdk service-core  https://review.opendev.org/c/openstack/project-config/+/89036512:15
*** amoralej is now known as amoralej|lunch12:19
*** amoralej|lunch is now known as amoralej12:53
fricklerinfra-root: seems we are not only having issues with image upload for rax, but also with instance deletion. bot IAD and DFW have like 90% instances stuck deleting14:10
fricklerfrom https://grafana.opendev.org/d/a8667d6647/nodepool-rackspace?orgId=1&from=now-6M&to=now it looks like for dfw it has started end of may, some instances I checked have that creation date14:11
frickleralso nb01+02 disks seem to be full again, I guess we need to either fix iad uploads or disable them14:14
fungiit may be exacerbated by the image count. i'm going to go back to working on that between meetings today14:16
fungii'm slow-deleting 1172 leaked images in rax-iad now, with a 10-second delay between each request. that's roughly 3h15m worth of delays so will take at least that long to complete15:11
fricklerdoes deleting an image also delete the task(s) associated with it?15:14
fungiglance tasks? no idea really. though a lot of these images are years old so i would be surprised if there are still lingering tasks15:15
fungifor many of them anyway15:15
fungiit deleted ~17 and is now sitting15:17
fungipresumably one request is not returning in a timely manner15:18
fungithat one request has been waiting for approximately 15 minutes now. if a lot of them are like this then it's going to take many orders of magnitude longer than i expected15:25
fungiit eventually started moving again, but only did a few more and has stalled once more15:35
fungiseems like image deletions might be expensive operations and they're getting backlogged15:36
fungior hitting internal timeouts with services not communicating reliably with one another15:37
fungialso it may be fighting with image-related api calls from nodepool, so i agree we should probably pause image uploads to rackspace regions until we can clean things up15:40
fungiwe haven't successfully uploaded anything to rackspace since the builders were fixed yesterday15:40
fungier, to iad that is15:41
fungiwe have uploaded successfully to dfw and ord since then15:41
opendevreviewJeremy Stanley proposed openstack/project-config master: Temporarily pause image uploads to rax-iad  https://review.opendev.org/c/openstack/project-config/+/89080715:49
fungiinfra-root: ^15:49
fungithanks frickler!15:53
fungiseems just over 40 of my `openstack image delete` calls have returned so far15:54
fungigoing to take a while at this pace15:54
fricklerbut at least there is progress ;)15:55
fungiwell, i don't even know that it actually deleted anything, all i know is that openstackclient isn't reporting back with any errors15:56
fungiafter a while i'll check the image list from that region and see if it's shrinking at all15:56
fungibut it will be easier to tell once that nodepool config change deploys15:56
fricklerif you have a specific image id, "image show" may be faster than list16:07
fricklerlike to verify if at least the first one got deleted16:08
opendevreviewMerged openstack/project-config master: Temporarily pause image uploads to rax-iad  https://review.opendev.org/c/openstack/project-config/+/89080716:12
fungisince that has deployed, i'm taking a baseline image count now16:31
fungi1236 images in rax-iad according to the image list that completed a few minutes ago. i'll check again in an hour16:45
fungii do actually see one image delete error in my terminal scrollback, a rather nondescript 503 response16:46
fungi"service unavailable"16:46
tkajinamI'll check the status tomorrow (it's actually today) but it seems this one is stuck after +2 was voted by zuul... https://review.opendev.org/c/openstack/puppet-openstack-integration/+/89067216:49
fungitkajinam: it looks like it succeeded but never merged. also interesting is that zuul commented twice in the same second about the gate pipeline success for that buildset16:52
fungii'll probably need to look in the debug logs on the schedulers to find out what went wrong16:53
tkajinamah, yes. I didn't notice that duplicate comment.16:53
tkajinamno jobs in the status view. wondering if recheck works here or we should get it submitted forcefully.16:54
fungia recheck might work, unless there's something wrong with the change itself that's causing gerrit to reject the merge. hopefully i'll be able to tell from the service logs, but also i have to get on a conference call (and several back-to-back meetings) so can't look just yet16:57
tkajinamack. I've triggered recheck... it seems the pathc now appears both in check queue and gate queue. I'm leaving soon and will recheck the patch during the date tomorrow (today) but I might need some help if recheck still behaves strangely17:00
fungilooks like decent progress on the image deletions in rax-iad. count is down to 937, so about 200 fewer in roughly an hour17:57
fungimultitasking during the openstack tc meeting, found this in gerrit's error log regarding tkajinam's problem change:18:05
fungi[2023-08-08T15:04:29.860Z] [HTTP POST /a/changes/openstack%2Fpuppet-openstack-integration~stable%2F2023.1~I544fa835ee86a41bb4ba4bf391857b8a64750af2/ (zuul from [2001:4800:7819:103:be76:4eff:fe04:42c2])] WARN  com.google.gerrit.server.update.RetryHelper : REST_WRITE_REQUEST was attempted 3 times [CONTEXT project="openstack/puppet-openstack-integration" request="REST18:05
fungi/changes/*/revisions/*/review" ]18:05
fungii wonder if that means it failed to write the merge to the filesystem?18:05
fungi[Tue Aug  8 14:47:45 2023] INFO: task jbd2/dm-0-8:823 blocked for more than 120 seconds.18:06
fungithat's from dmesg18:06
fungianother at 14:49:4618:07
fungii think that's the cinder volume our gerrit data lives on18:08
fungiinfra-root: just a heads up that we may be seeing communication issues with the cinder volume for our gerrit data18:10
fungii'll do a quick check over the rackspace tickets/status page18:11
fungino, wait, we moved it to vexxhost a while back18:11
fungiguilhermesp_____: mnaser: just a heads up we may be seeing cinder communication issues in ca-ymq-118:13
fungiso the two events in dmesg were for dm-0 (the gerrit data volume) at 14:47:45 utc, and vda1 (the rootfs) at 14:49:46 utc18:23
fungithese are the timestamps from dmesg though, which are notoriously inaccurate, so easily a few minutes off from the actual events that were logged18:23
fungiside note, 890672 did end up merging successfully after tkajinam rechecked it18:24
fungifor vexxhost peeps who may see this in scrollback, the server is 16acb0cb-ead1-43b2-8be7-ab4a310b4e0a and the volumes it logged problems writing to are de64f4c6-c5c8-4281-a265-833236b78480 and then d1884ff4-528d-4346-a172-6c9abafb8cdf in chronological order18:26
fungidown to 780 images in rax-iad now, so still chugging along18:35
Clark[m]frickler: fungi: pretty sure deleting images on glance does not delete any related import tasks or related swift data. This is my major complaint with that system after the "it's completely undocumented and non standard part". Basically the service imposes massive amounts of state tracking on the end user and you don't really know it is necessary either20:12
fungifwiw, i can't tell how to get osc to list glance tasks20:27
fungibut i haven't gone digging in the docs yet, just context help20:27
fungirax-iad image count is down to 241, so my initial cleanup pass will hopefully finish within the hour20:32
fungidone, leaving 171 images in the list. nodepool says it knows about 34, so i'll put together a new deletion list for the other 13720:55
fungioh, actually that wasn't --private so the count included public images too. actual private image count is 130 meaning we have 96 still to clean up20:58
fungier, 92 actually leaked21:00
fungislow-deleting those now21:01
Clark[m]I don't know if you can list tasks. The whole glance task system exists in a corner of no documentation and little testing21:09
fungialso frustration and gnashing of teeth21:10
fungiokay, after the latest pass, `openstack image list` for iad has 39 entries in rax-iad compared to the 34 from `nodepool image-list`21:23
fungii'll see if i can work out what's up with the other 521:23
fungino, i keep counting the decorative lines21:24
fungiit's really 35, so only 1 nodepool has no record of21:24
fungium, it's a rockylinux-9 image uploaded 2023-08-08T20:13:42Z (a little over an hour ago)21:27
fungidid the upload pause not actually work?21:27
fungithough i guess the good news is that the builders are successfully uploading images to rax-iad now21:27
fungithe bad news is that i may have deleted some recently uploaded images21:27
Clark[m]If the upload was in progress when you paused I think it is allowed to finish21:27
Clark[m]But I'm not sure of that21:28
fungithe deploy pipeline reported success on the pause change at 16:25:18, so almost 4 hours before that was uploaded21:29
fungii guess worst case we'll have some boot failures in rax-iad for the next ~day21:30
fungioh, actually that image is also "leaked" (it doesn't appear in nodepool image-list)21:31
fungibut still suggests that the builders are continuing to upload images21:32
fungiand yeah, no successful uploads to rax-iad still according to nodepool image-list21:34
fungiokay, i think these are deferred image uploads being processed by glance tasks. the last image uploads logged were at 17:4021:41
fungiso i guess i'll watch for a while to see if the image count goes up any more21:42
fungiyeah, now a bionic image just showed up, created timestamp 2023-08-08T20:48:04Z updated at 2023-08-08T21:37:55Z21:44
fungiimage name on that one was ubuntu-bionic-1691511227 and `date -d@1691511227` reports "2023-08-08T16:13:47 UTC"21:48
fungiso yes, seems like glance tasks are running on the uploads, but are severely backlogged, and so nodepool gives up after the timeout response thinking the upload failed, but then the upload actually appears hours later21:48
fungiand since nodepool assumes it was never there, a leak ensues21:49
fungiif the backlog is predictable, then we should hopefully cease seeing new ones appear after roughly 23:00 utc21:52
fungiso about another hour21:53
fungii'll try to check back then and do another cleanup pass, or may not get to it until tomorrow21:53
*** dviroel_ is now known as dviroel22:13
fungithe count seems to have stabilized with only a few stragglers. i'll c23:01
fungilean them up in a bit23:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!