Thursday, 2023-08-10

frickler	so (some of) the image uploads look successful in the "image list" output, lots of new image there, but somehow that status doesn't end up in nodepool. I guess I'll revert the revert for now until we have some better idea. maybe increase the timeout to 2h?	07:23
opendevreview	Dr. Jens Harbott proposed openstack/project-config master: Revert "Revert "Temporarily pause image uploads to rax-iad"" https://review.opendev.org/c/openstack/project-config/+/890850	07:24
frickler	so as an example the last one in the list says "created_at: 2023-08-10T02:28:04Z", "updated_at: 2023-08-10T03:59:58Z", if that is the relevant timespan, it would explain the timeout on the nodepool side	07:27
opendevreview	Merged openstack/project-config master: Revert "Revert "Temporarily pause image uploads to rax-iad"" https://review.opendev.org/c/openstack/project-config/+/890850	07:39
opendevreview	gnuoy proposed openstack/project-config master: Add OpenStack K8S Telemetry charms https://review.opendev.org/c/openstack/project-config/+/890921	08:12
opendevreview	gnuoy proposed openstack/project-config master: Add OpenStack K8S Telemetry charms https://review.opendev.org/c/openstack/project-config/+/890921	09:30
fungi	frickler: what i was observing yesterday is the "updated at" times on the images correspond to the "expiration" times on their import tasks. not sure why a task expiration results in an update to the image record though	11:51
fungi	when i get a break between today's meetings i'll try manually un-pausing one image at a time and see if they upload successfully when there's not a bunch all going at once	11:52
frickler	fungi: I'm currently trying that, I also added timeout=7200 to the create_image call, will see how that goes. last attempt failed but maybe there were still other uploads in progress after the revert	12:11
fungi	ah okay	12:12
fungi	what i ended up doing was checking once a minute in a loop until the image with the expected name appeared in the output of `openstack image list` because none of the timestamps associated with the image or import task seemed to correspond to when that happens	12:13
fungi	date ; time sudo openstack --os-cloud=openstackjenkins-rax --os-region-name=IAD image create --disk-format=vhd --file=/opt/debian-bookworm-0000000167.vhd test-image-create-`date +%s` ; while ! ( sudo openstack --os-cloud=openstackjenkins-rax --os-region-name=IAD image list \| grep -q test-image-create ) ; do sleep 60 ; done ; date	12:14
fungi	that's basically how i checked it	12:15
frickler	I'll need to check how the sdk actually does the waiting. too bad the osc image create doesn't simply have a --wait like server create does	12:57
*** amoralej is now known as amoralej\|lunch		13:05
fungi	from what i can tell, glance doesn't know at upload what the image id will be, it returns the import task id instead, and then you have to poll that to find out what the image id is once it completes	13:08
frickler	fungi: yes, but IMO that whole thing could/should be wrapped by sdk+osc, no need for each user to reinvent the wheel like in your script	13:10
fungi	oh, yes i agree there	13:11
frickler	#status log manually updated the jitsi web cert on meetpad01 and restarted the jitsi web container	13:12
opendevstatus	frickler: finished logging	13:12
fungi	thanks for doing that frickler! hopefully it renews normally next time	13:26
*** amoralej\|lunch is now known as amoralej		13:45
frickler	2023-08-10 15:29:13,173 DEBUG nodepool.OpenStackAdapter.rax-iad: API call create_image in 8549.978593384847	15:35
frickler	openstack.exceptions.ResourceTimeout: Timeout waiting for Task:b233e83c-ecd6-4360-8a2b-d294668c2792 to transition to success	15:36
frickler	still no success with 2h timeout :(	15:36
frickler	I'm also getting a 503 now for image list	15:37
frickler	hmm, weird, the 503 only happens when I do "openstack image list \| grep something", without the pipe, it returns within a couple of seconds	15:40
Clark[m]	are all image uploads to rax failing?	18:13
Clark[m]	I'm not quite sure I understand the situation there	18:13
fungi	Clark[m]: mainly just rax-iad, seems like	18:17
fungi	but the problem is that every task timeout seems to result in another leaked image	18:17
fungi	so the builders are just leaking lost images into that region in a loop because they continually retry the uploads	18:18
fungi	probably we could address this somehow on the builder side, if there's a way to get it to recognize images it made even though it thought the image creation never succeeded	18:19
Clark[m]	So the upload ultimately succeeds but the API calls fail?	18:19
fungi	at its worst, what i observed is that the builder would upload the image, give up waiting for the import task to complete in the cloud, and then ~5 hours later the image shows up in the image list there	18:19
Clark[m]	I wonder if this is something we should engage rax on and fix it on their end	18:19
fungi	i don't think the api calls fail	18:20
Clark[m]	Oh is the timeout in nodepool polling for the image to go active?	18:20
fungi	the way it seems to work is that the image upload returns a task id (not an image id) and then the builder watches that task to see when it completes, at which point its details include an image id	18:20
fungi	but also the import task has an expiration time on it, and the images are appearing (sometimes hours) after that expiration	18:21
fungi	so yes from an api perspective there seems to be some sort of disconnect, since the task state goes to success even though the image still doesn't exist for a while after that even, implying there's some additional background process happening after the import task but before it shows up as availabkle	18:22
frickler	I don't think the task itself has a timeout	18:22
Clark[m]	Does make me wonder if there is some sort of network issue between swift and glance in that region if the data move is taking that long	18:22
frickler	its the sdk that timeouts after 1h by default	18:22
fungi	frickler: not a timeout, but it has an "expiration" time (which i think is supposed to be when it ceases to be included in the task list, except it's not actually getting purged from the task list after that expiration)	18:23
frickler	the upload I triggered earlier seems to have finished after about 5h	18:23
frickler	so maybe I'l make another try with a 12h timeout in nodepool	18:23
frickler	that might not be a permanent solution if we want to continue to rotate images every 24h, but give another datapoint	18:24
fungi	fwiw, my test uploads from bridge01 with nodepool uploads to that region paused took around 35 minutes from start of upload until it appeared in the image list. i have a feeling two builders continuously trying to upload 15 images is contributing to the performance problem there, thundering herd style	18:24
frickler	well we stopped these for now	18:25
fungi	yeah, but also it took 5+ hours before i was getting reasonable test upload results	18:25
fungi	so i think stopping the uploads doesn't immediately solve it, because there's still a bunch of lingering load the cloud is trying to catch back up from	18:25
fungi	my first test upload shortly after we paused that region in the builders took around 5 hours, but once that one was done my subsequent (one at a time) uploads took a little over half an hour	18:27
frickler	\| 0000232814 \| 0000000003 \| rax-iad \| ubuntu-bionic \| ubuntu-bionic-1691694393 \| 018c7205-83de-4eeb-a645-6f2259ce3710 \| ready \| 00:00:38:36 \|	20:31
frickler	fungi: the patched nodepool is running on nb01, if you want you can try some other uploads there, otherwise I'll continue tomorrow	20:33
fungi	frickler: what was the patch? just the increased timeout?	20:35
frickler	yes, though this one even took much less than 1h, so would have worked with default nodepool, too: 2023-08-10 19:52:07,018 DEBUG nodepool.OpenStackAdapter.rax-iad: API call create_image in 2733.277339382097	20:36
fungi	frickler: right, i think that supports my theory. if the builder had a per-provider concurrency setting so we could tell it "only upload one image at a time to rax-iad" then it might also work	20:38
Clark[m]	Is that timeout hard coded or is it configurable?	20:38
Clark[m]	I wonder if we should make it configurable if it isn't already	20:39
frickler	it is configurable in sdk, but currently not in nodepool	20:39
frickler	https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L561-L569	20:40
frickler	sdk default is timeout=3600, I added timeout=36000 on nb01	20:40
frickler	Clark[m]: not sure if you saw that, we also have a lot of stuck in deleting nodes in iad and dfw	20:41
frickler	and with that I'm out for today, cu	20:42
fungi	yeah, seems like the options around image uploads in rax-iad are 1. increase timeout to >5 hours, 2. add a max concurrency for uploads to the provider, 3. trickle images into it and try to avoid creating a thundering herd, or 4. disable that region entirely until it gets fixed	20:43
fungi	thanks frickler!	20:43
Clark[m]	I suspect adding max uploads per provider isn't too difficult as the total uploads are already configurable and limited. That said I think we should talk to rax before doing that as 8 image uploads shouldnt make things fall over particularly since we upload yo swift first and glance is doing an in cloud transfer	20:46
fungi	well, it was 15 at the same time, but yes. regardless it's something that needs fixing on their end ultimately	20:47
fungi	i find it hard to believe they're unaware, since it looks like this has been going on for a long time and got considerably worse around 2 weeks ago	20:48
fungi	but a big part of why i cleaned up those 1200-odd leaked images is that i wanted one less thing for them to balk at if we open a ticket	20:49
fungi	and a lot of the testing was to better quantify the symptom we're seeing	20:49
fungi	i have a feeling if we open a ticket that says "when we try to simultaneously upload 15 images around 25gb each, your service inexplicably gets slow" we're going to get a response like "please stop"	20:50
fungi	also there's still a bunch of leaked images that need cleaning up in the other two regions though only about a third each what we had in iad	20:51
Clark[m]	++ to cleanup and testing. I just don't think reduce 15 uploads at a time to one is something we should prioritize on our end as it indicates a cloud problem first	20:54
fungi	sure, i agree	20:54
Clark[m]	Aiui th whole point of using the task system with a swift upload was to make multiple image uploads of any size a trivial action	20:55
Clark[m]	Since swift would handle the data loads and is already designed to do so	20:55
fungi	to me there are two questions: what more should we do/collect before opening a ticket, and should we also do anything to try to continue using that region in the meantime or just turn it off for now?	20:55
fungi	if we do want to try to get the images there updated in the meantime so we won't need to turn it off (they're approaching two weeks old in some cases), we have a few possible options for that	20:57
Clark[m]	Maybe pausing the region entirely. Benchmarking a single upload of a representative image. Then reporting that is all we can do for reasonable bug reporting? That way our noisy neighbor problem is minimized	20:58
Clark[m]	(and I think we did that it's 5 hours?)	20:58
fungi	well yes and no. if we benchmark a manual upload right after pausing the mass uploads it's around 5.5 hours. if we pause everything and wait 5 or 6 hours for things to get back to normal, it's a little over half an hour	20:59
fungi	we've independently done this a couple of times over the past two days, and it seems to be consistent. basically it's like there's a lingering backlog glance is still working through after we pause our uploads, and that takes hours to catch back up	21:00
Clark[m]	I see	21:03
fungi	so unfortunately the single upload timing under normal circumstances is really only a half hour delay, which isn't all that indicative of a problem on its own	21:05
fungi	it just seems to get much worse if we try to upload a bunch at once (and stays terrible for a while even after we stop)	21:05
Clark[m]	15*0.5 is suspiciously close to 5 hours too. I wonder if things are getting serialized	21:08
Clark[m]	On the cloud side I mean	21:08
fungi	that's certainly what it seems like to me, yes	21:08
clarkb	given that (hey I mnaged to connect to irc) I feel like engaging the cloud is probably a good idea since that seems like a problem	22:31
fungi	agreed, it's going to take some careful phrasing though so that we describe our experience in ways that don't seem outside realistic expectation, and i've burnt my available creative writing capacity for the day on openstack security bugs instead	22:43
JayF	-.-	22:49
clarkb	I didn't start today with creative writing ability	22:50
clarkb	the jet lag is real	22:50
JayF	ty for the help on all that fungi; not how I wanted the day to go for sure either lol	22:50
fungi	JayF: oh, it's not just that one! much as i wish it were	22:51
JayF	=\|	22:52
clarkb	fungi: only iad is affected? seems like since we can point out ord and dfw don't suffer this it is reasonable that iad be like the others	22:52
fungi	clarkb: i can't say only iad is affected, but certainly it seems much worse than the other two we use	22:52
fungi	and is the only one misbehaving badly enough to have become unusable for us now	22:53
clarkb	ya so maybe that is the best way to frame it. IAD suffers compared to the others	22:53
fungi	without being party to the underlying details of the problem, it may merely be a matter of degree	22:54
clarkb	and that way we can show this behavior is at least unexpected	22:54

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!