Thursday, 2023-08-10

fricklerso (some of) the image uploads look successful in the "image list" output, lots of new image there, but somehow that status doesn't end up in nodepool. I guess I'll revert the revert for now until we have some better idea. maybe increase the timeout to 2h?07:23
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Revert "Revert "Temporarily pause image uploads to rax-iad""  https://review.opendev.org/c/openstack/project-config/+/89085007:24
fricklerso as an example the last one in the list says "created_at: 2023-08-10T02:28:04Z", "updated_at: 2023-08-10T03:59:58Z", if that is the relevant timespan, it would explain the timeout on the nodepool side07:27
opendevreviewMerged openstack/project-config master: Revert "Revert "Temporarily pause image uploads to rax-iad""  https://review.opendev.org/c/openstack/project-config/+/89085007:39
opendevreviewgnuoy proposed openstack/project-config master: Add OpenStack K8S Telemetry charms  https://review.opendev.org/c/openstack/project-config/+/89092108:12
opendevreviewgnuoy proposed openstack/project-config master: Add OpenStack K8S Telemetry charms  https://review.opendev.org/c/openstack/project-config/+/89092109:30
fungifrickler: what i was observing yesterday is the "updated at" times on the images correspond to the "expiration" times on their import tasks. not sure why a task expiration results in an update to the image record though11:51
fungiwhen i get a break between today's meetings i'll try manually un-pausing one image at a time and see if they upload successfully when there's not a bunch all going at once11:52
fricklerfungi: I'm currently trying that, I also added timeout=7200 to the create_image call, will see how that goes. last attempt failed but maybe there were still other uploads in progress after the revert12:11
fungiah okay12:12
fungiwhat i ended up doing was checking once a minute in a loop until the image with the expected name appeared in the output of `openstack image list` because none of the timestamps associated with the image or import task seemed to correspond to when that happens12:13
fungidate ; time sudo openstack --os-cloud=openstackjenkins-rax --os-region-name=IAD image create --disk-format=vhd --file=/opt/debian-bookworm-0000000167.vhd test-image-create-`date +%s` ; while ! ( sudo openstack --os-cloud=openstackjenkins-rax --os-region-name=IAD image list | grep -q test-image-create ) ; do sleep 60 ; done ; date12:14
fungithat's basically how i checked it12:15
fricklerI'll need to check how the sdk actually does the waiting. too bad the osc image create doesn't simply have a --wait like server create does12:57
*** amoralej is now known as amoralej|lunch13:05
fungifrom what i can tell, glance doesn't know at upload what the image id will be, it returns the import task id instead, and then you have to poll that to find out what the image id is once it completes13:08
fricklerfungi: yes, but IMO that whole thing could/should be wrapped by sdk+osc, no need for each user to reinvent the wheel like in your script13:10
fungioh, yes i agree there13:11
frickler#status log manually updated the jitsi web cert on meetpad01 and restarted the jitsi web container13:12
opendevstatusfrickler: finished logging13:12
fungithanks for doing that frickler! hopefully it renews normally next time13:26
*** amoralej|lunch is now known as amoralej13:45
frickler2023-08-10 15:29:13,173 DEBUG nodepool.OpenStackAdapter.rax-iad: API call create_image in 8549.97859338484715:35
frickleropenstack.exceptions.ResourceTimeout: Timeout waiting for Task:b233e83c-ecd6-4360-8a2b-d294668c2792 to transition to success15:36
fricklerstill no success with 2h timeout :(15:36
fricklerI'm also getting a 503 now for image list15:37
fricklerhmm, weird, the 503 only happens when I do "openstack image list | grep something", without the pipe, it returns within a couple of seconds15:40
Clark[m]are all image uploads to rax failing?18:13
Clark[m]I'm not quite sure I understand the situation there18:13
fungiClark[m]: mainly just rax-iad, seems like18:17
fungibut the problem is that every task timeout seems to result in another leaked image18:17
fungiso the builders are just leaking lost images into that region in a loop because they continually retry the uploads18:18
fungiprobably we could address this somehow on the builder side, if there's a way to get it to recognize images it made even though it thought the image creation never succeeded18:19
Clark[m]So the upload ultimately succeeds but the API calls fail?18:19
fungiat its worst, what i observed is that the builder would upload the image, give up waiting for the import task to complete in the cloud, and then ~5 hours later the image shows up in the image list there18:19
Clark[m]I wonder if this is something we should engage rax on and fix it on their end18:19
fungii don't think the api calls fail18:20
Clark[m]Oh is the timeout in nodepool polling for the image to go active?18:20
fungithe way it seems to work is that the image upload returns a task id (not an image id) and then the builder watches that task to see when it completes, at which point its details include an image id18:20
fungibut also the import task has an expiration time on it, and the images are appearing (sometimes hours) after that expiration18:21
fungiso yes from an api perspective there seems to be some sort of disconnect, since the task state goes to success even though the image still doesn't exist for a while after that even, implying there's some additional background process happening after the import task but before it shows up as availabkle18:22
fricklerI don't think the task itself has a timeout18:22
Clark[m]Does make me wonder if there is some sort of network issue between swift and glance in that region if the data move is taking that long18:22
fricklerits the sdk that timeouts after 1h by default18:22
fungifrickler: not a timeout, but it has an "expiration" time (which i think is supposed to be when it ceases to be included in the task list, except it's not actually getting purged from the task list after that expiration)18:23
fricklerthe upload I triggered earlier seems to have finished after about 5h18:23
fricklerso maybe I'l make another try with a 12h timeout in nodepool18:23
fricklerthat might not be a permanent solution if we want to continue to rotate images every 24h, but give another datapoint18:24
fungifwiw, my test uploads from bridge01 with nodepool uploads to that region paused took around 35 minutes from start of upload until it appeared in the image list. i have a feeling two builders continuously trying to upload 15 images is contributing to the performance problem there, thundering herd style18:24
fricklerwell we stopped these for now18:25
fungiyeah, but also it took 5+ hours before i was getting reasonable test upload results18:25
fungiso i think stopping the uploads doesn't immediately solve it, because there's still a bunch of lingering load the cloud is trying to catch back up from18:25
fungimy first test upload shortly after we paused that region in the builders took around 5 hours, but once that one was done my subsequent (one at a time) uploads took a little over half an hour18:27
frickler| 0000232814 | 0000000003 | rax-iad           | ubuntu-bionic             | ubuntu-bionic-1691694393             | 018c7205-83de-4eeb-a645-6f2259ce3710 | ready     | 00:00:38:36  |20:31
fricklerfungi: the patched nodepool is running on nb01, if you want you can try some other uploads there, otherwise I'll continue tomorrow20:33
fungifrickler: what was the patch? just the increased timeout?20:35
frickleryes, though this one even took much less than 1h, so would have worked with default nodepool, too: 2023-08-10 19:52:07,018 DEBUG nodepool.OpenStackAdapter.rax-iad: API call create_image in 2733.27733938209720:36
fungifrickler: right, i think that supports my theory. if the builder had a per-provider concurrency setting so we could tell it "only upload one image at a time to rax-iad" then it might also work20:38
Clark[m]Is that timeout hard coded or is it configurable?20:38
Clark[m]I wonder if we should make it configurable if it isn't already20:39
fricklerit is configurable in sdk, but currently not in nodepool20:39
fricklerhttps://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L561-L56920:40
fricklersdk default is timeout=3600, I added timeout=36000 on nb0120:40
fricklerClark[m]: not sure if you saw that, we also have a lot of stuck in deleting nodes in iad and dfw20:41
fricklerand with that I'm out for today, cu20:42
fungiyeah, seems like the options around image uploads in rax-iad are 1. increase timeout to >5 hours, 2. add a max concurrency for uploads to the provider, 3. trickle images into it and try to avoid creating a thundering herd, or 4. disable that region entirely until it gets fixed20:43
fungithanks frickler!20:43
Clark[m]I suspect adding max uploads per provider isn't too difficult as the total uploads are already configurable and limited. That said I think we should talk to rax before doing that as 8 image uploads shouldnt make things fall over particularly since we upload yo swift first and glance is doing an in cloud transfer20:46
fungiwell, it was 15 at the same time, but yes. regardless it's something that needs fixing on their end ultimately20:47
fungii find it hard to believe they're unaware, since it looks like this has been going on for a long time and got considerably worse around 2 weeks ago20:48
fungibut a big part of why i cleaned up those 1200-odd leaked images is that i wanted one less thing for them to balk at if we open a ticket20:49
fungiand a lot of the testing was to better quantify the symptom we're seeing20:49
fungii have a feeling if we open a ticket that says "when we try to simultaneously upload 15 images around 25gb each, your service inexplicably gets slow" we're going to get a response like "please stop"20:50
fungialso there's still a bunch of leaked images that need cleaning up in the other two regions though only about a third each what we had in iad20:51
Clark[m]++ to cleanup and testing. I just don't think reduce 15 uploads at a time to one is something we should prioritize on our end as it indicates a cloud problem first20:54
fungisure, i agree20:54
Clark[m]Aiui th whole point of using the task system with a swift upload was to make multiple image uploads of any size a trivial action20:55
Clark[m]Since swift would handle the data loads and is already designed to do so20:55
fungito me there are two questions: what more should we do/collect before opening a ticket, and should we also do anything to try to continue using that region in the meantime or just turn it off for now?20:55
fungiif we do want to try to get the images there updated in the meantime so we won't need to turn it off (they're approaching two weeks old in some cases), we have a few possible options for that20:57
Clark[m]Maybe pausing the region entirely. Benchmarking a single upload of a representative image. Then reporting that is all we can do for reasonable bug reporting? That way our noisy neighbor problem is minimized20:58
Clark[m](and I think we did that it's 5 hours?)20:58
fungiwell yes and no. if we benchmark a manual upload right after pausing the mass uploads it's around 5.5 hours. if we pause everything and wait 5 or 6 hours for things to get back to normal, it's a little over half an hour20:59
fungiwe've independently done this a couple of times over the past two days, and it seems to be consistent. basically it's like there's a lingering backlog glance is still working through after we pause our uploads, and that takes hours to catch back up21:00
Clark[m]I see21:03
fungiso unfortunately the single upload timing under normal circumstances is really only a half hour delay, which isn't all that indicative of a problem on its own21:05
fungiit just seems to get much worse if we try to upload a bunch at once (and stays terrible for a while even after we stop)21:05
Clark[m]15*0.5 is suspiciously close to 5 hours too. I wonder if things are getting serialized21:08
Clark[m]On the cloud side I mean21:08
fungithat's certainly what it seems like to me, yes21:08
clarkbgiven that (hey I mnaged to connect to irc) I feel like engaging the cloud is probably a good idea since that seems like a problem22:31
fungiagreed, it's going to take some careful phrasing though so that we describe our experience in ways that don't seem outside realistic expectation, and i've burnt my available creative writing capacity for the day on openstack security bugs instead22:43
JayF-.-22:49
clarkbI didn't start today with creative writing ability22:50
clarkbthe jet lag is real22:50
JayFty for the help on all that fungi; not how I wanted the day to go for sure either lol22:50
fungiJayF: oh, it's not just that one! much as i wish it were22:51
JayF=|22:52
clarkbfungi: only iad is affected? seems like since we can point out ord and dfw don't suffer this it is reasonable that iad be like the others22:52
fungiclarkb: i can't say *only* iad is affected, but certainly it seems much worse than the other two we use22:52
fungiand is the only one misbehaving badly enough to have become unusable for us now22:53
clarkbya so maybe that is the best way to frame it. IAD suffers compared to the others22:53
fungiwithout being party to the underlying details of the problem, it may merely be a matter of degree22:54
clarkband that way we can show this behavior is at least unexpected22:54

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!