Sunday, 2020-12-20

Alex_GaynorI'm seeing arm64 builds not starting again :-/01:50
Alex_GaynorAnd two just started, maybe it's just very slow.01:52
*** zbr|rover1 has joined #opendev06:47
*** zbr|rover has quit IRC06:49
*** zbr|rover1 is now known as zbr|rover06:49
*** sboyron has joined #opendev08:26
*** sboyron has quit IRC08:26
*** whoami-rajat__ has joined #opendev08:41
*** zbr|rover is now known as zbr09:43
*** hamalq has joined #opendev10:39
*** whoami-rajat__ has quit IRC11:21
*** tosky has joined #opendev11:48
*** DSpider has joined #opendev11:49
*** ysandeep|away is now known as ysandeep12:03
*** hamalq has quit IRC12:39
*** hamalq has joined #opendev13:15
*** hamalq has quit IRC13:20
fungiroughly how long of a delay were you seeing? (was it time from trigger to build starting? or elapsed time from trigger to build completing?)14:14
Alex_Gaynorfungi: from triggering it to the build starting; it was >30 minutes14:17
fungiouch, yeah that doesn't sound good. i'll trace one in the logs and see if i can come up with a timeline14:22
fungias a point of corroboration, the ttr graph at https://grafana.opendev.org/d/pwrNXt2Mk/nodepool-linaro?orgId=1&from=now-24h&to=now does seem to show nodes often taking up to 30 minutes to boot and become reachable in that cloud14:42
Alex_GaynorDoes the "available" state refer to nodes that are booted and can be assigned jobs without delay?14:44
fungiyes, but that environment is configured to only keep one node pre-booted for each configured image type so as to not eat into the quota there too much14:44
fungii also did an initial pass over the logs from around that timeframe to sample the time ranges for each node request from when the launcher first took the lock on it until the node reported ready: http://paste.openstack.org/show/80119814:45
fungithe rows with only one timestamp i think represent nodes which did not become reachable or perhaps failed to boot entirely14:45
fungigoing to look at a few in closer detail and figure out where the delay is being incurred14:46
fungii think 3 minutes may be the launcher's wait timeout too... if so that explains why we don't see any nodes taking longer than 30 minutes to reach a ready state but we do see some which have failed to become ready14:47
fungier, 30 minutes14:47
fungiaha, so looking at a sample failure, it's a 10-minute timeout with three retries14:49
fungiboot request is initiated to the provider, two minutes later the provider indicates the node is in a ready state, but by 10 minutes it's still not reachable via ssh and so the launcher drops it and starts another attempt. after three times of the same it gives up trying to boot that and unlocks the request (so that another launcher could try it, except this is the only provider we have for the relevant node14:51
fungitypes so it just winds up locking it again and starting over)14:52
*** slaweq has joined #opendev14:53
fungispot checking a few more of the failures, that seems to be a consistent pattern14:54
fungiand yeah, looking at some of the nodes which took around 30 minutes the cause seems to have been a combination of taking up to 5 minutes for the cloud to indicate the node was active and then timing out 10 minutes waiting to be able to reach it over ssh and retrying14:58
fungiso i'm going to say it looks like the reason node allocations are taking to long to be satisfied there is that many boot calls are creating servers which take longer than 10 minutes to become reachable over the network (or perhaps are never going to become reachable)14:59
fungikevinz: ^ if you're around, do you have any insight into why that might be happening?15:00
fungiin the case of frequent wait timeouts, our "error node launch attempts" graph is misleading, because it represents node requests where the launcher tried three times consecutively before giving up15:03
*** hamalq has joined #opendev15:09
*** slaweq has quit IRC15:09
*** hamalq has quit IRC15:14
*** hamalq has joined #opendev16:25
*** hamalq has quit IRC16:29
*** hamalq has joined #opendev17:45
*** mrunge_ has joined #opendev18:23
*** mrunge has quit IRC18:24
*** mrunge has joined #opendev18:27
*** mrunge_ has quit IRC18:28
*** mrunge_ has joined #opendev18:32
*** mrunge has quit IRC18:33
*** hamalq has quit IRC18:49
*** slaweq has joined #opendev19:48
*** slaweq has quit IRC21:05
*** slaweq has joined #opendev21:08
*** slaweq has quit IRC21:41

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!