Saturday, 2021-07-03

opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Ignore errors when deleting tags from dockerhub  https://review.opendev.org/c/zuul/zuul-jobs/+/79933800:00
corvusfungi, clarkb: apparently that xenial node is failing to boot in bhs100:25
clarkboh hrm00:25
corvusit's on attempt #200:25
fungitimeouts?00:26
clarkbcorvus: do you hvae the node id so we can see the traceback?00:26
corvusyeha00:26
corvus2021-07-03 00:17:16,437 INFO nodepool.NodeLauncher: [e: 56c26ac0e5cf4a35a42ab59289ef7fcf] [node_request: 200-0014656534] [node: 0025380180] Node 6aae1b30-8d51-4c9f-95bb-ee96be8bd1e3 scheduled for cleanup00:26
corvusfrom nl0400:26
corvusthat was attempt #100:26
fungiwe unfortunately see frequent timeouts in a number of our providers, i wonder if we need to tune the launchers to wait longer00:26
clarkbopenstack.exceptions.ResourceTimeout: Timeout waiting for the server to come up.00:27
corvusfungi: i'm not sure i want to wait >10 minutes for a node to boot00:27
corvusthough tbh, i don't know why we wait 10 minutes 3 times00:27
corvusi would like it to just fail immediately so we can actually fall back on another provider00:28
clarkbwe retry 3 times in the code regardless of failure type and then timeouts can cause failure00:28
clarkbbut ya in the case of timeouts falling back to another provider may be a good idea if you have >1 able to fulfill the request00:28
corvusyeah, i just think we've had a lot of push-pull over the years about failure conditions00:28
fungii agree treating timeout failures differently makes sense00:28
clarkbyou do want to retry if it is your only provider remaining I think00:29
corvusit doesn't make sense to set a timeout then wait 3 times, because it would (as fungi suggested initially) be better to wait 30m once than 10m 3 times :)00:29
fungithough it does probably increase the chances that we return node_failure if we don't have a given label in many providers00:29
clarkbok I need to check on dinner plans and related activities. I'll watch the zuul status00:29
corvus2021-07-03 00:27:18,229 ERROR nodepool.NodeLauncher: [e: 56c26ac0e5cf4a35a42ab59289ef7fcf] [node_request: 200-0014656534] [node: 0025380180] Launch attempt 2/3 failed:00:30
corvusso only another 8 minutes of waiting until we can move on00:30
fungianother possibility is that  something has happened with our xenial images and they're going to timeout everywhere00:30
corvusyeah, or xenial on bhs00:30
fungii guess we'll find out when it switches to another provider/region00:30
corvusoh hey it's running00:31
corvusattempt #3 worked00:31
fungithird try's a charm? ;)00:31
fungiyeesh00:31
corvusmaybe 3x 10m timeout isn't as terrible as i thought?  i dunno00:31
corvusi'm happy to be proven wrong if it means the job starts running ;)00:32
opendevreviewMerged zuul/zuul-jobs master: Ignore errors when deleting tags from dockerhub  https://review.opendev.org/c/zuul/zuul-jobs/+/79933800:35
corvusi'm restarting all of zuul again to pick up today's bugfixes01:06
corvus#status log restarted all of zuul on commit 10966948d723ea75ca845f77d22b8623cb44eba4 to pick up stats and zk watch bugfixes01:09
opendevstatuscorvus: finished logging01:09
corvusrestoring queues01:16
clarkbcorvus: a few jobs error'd in the openstack periodic queue01:19
clarkbbut check seems happy01:19
corvusclarkb: i'm guessing that's an artifact of the re-enqueue?01:19
fungiare we going to need to clean up any leaked znodes from before?01:19
clarkbcorvus: ya I'm wondering if periodic jobs don't reenqueue cleanly01:20
corvusfungi: no znodes leaked afaik, only watches, which have already disappeared due to closing the connections tehy were on01:20
corvusi think i saw that from the last re-enqueue01:20
fungioh, got it, so watches aren't represented by their own znodes, they're a separate structure?01:21
clarkbya a watch is a mechanism on top of the znodes (that may or may not be there)01:21
corvusyep, it's entirely an in-memory construct on a single zk server and associated with a single client connection; as soon as that connection is broken, it's gone01:22
fungiright, for some reason i had it in my head that watches were also znodes, pointing at znode paths which may or may not exist01:22
fungimakes more sense now01:22
fungiso restarting zuul would have also temporarily cleaned up the leaked watches01:23
clarkbyes01:23
fungieven before the fix01:23
clarkbzk04 went from 720 to 715 watch according to graphite01:24
clarkbthat is promosing01:24
corvusalso, i don't know how much a watch really "costs"; we may have been able to handle considerable growth before it became a problem01:27
corvusi'm just not in the mood to find out that way :)01:27
clarkbya the admin docs warn against large numbers of watches (but don't specify a scale against hardware) when running the wchp command but the command returned instantly for us in the 10-20k range without issue01:28
clarkbI suspect that large numbers in this case is much larger ++ to not finding out the hard way though01:28
corvusnone of zk04's cacti graphs show any kind of linear growth during the day, nor does the response time or anything like that, so they're probably relatively cheap01:29
corvusclarkb: yeah, seems like the 10k order of magnitude is experimentally okay :)01:29
opendevreviewGonéri Le Bouder proposed openstack/diskimage-builder master: fedora: defined DIB_FEDORA_SUBRELEASE for f3{3,4}  https://review.opendev.org/c/openstack/diskimage-builder/+/79933901:54
*** odyssey4me is now known as Guest132102:24
opendevreviewGonéri Le Bouder proposed openstack/diskimage-builder master: fedora: reuse DIB_FEDORA_SUBRELEASE if set  https://review.opendev.org/c/openstack/diskimage-builder/+/79934002:27
opendevreviewGonéri Le Bouder proposed openstack/diskimage-builder master: Fedora: bump DIB_RELEASE to 34  https://review.opendev.org/c/openstack/diskimage-builder/+/79934102:27
*** cloudnull7 is now known as cloudnull23:44

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!