Thursday, 2023-12-14

opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add san support to growvols  https://review.opendev.org/c/openstack/diskimage-builder/+/90326501:08
amorinhello, do you have an idea why this is not merged? https://review.opendev.org/c/openstack/mistral/+/89924507:11
frickleramorin: yes, needs a rebase, since a newer version of the patch below it was merged. see the red color of the "merged" text in the relation chain07:49
amorinAh! Gerrit UI is sometimes giving me headaches ;)08:24
fricklerinfra-root: couple of things I notice while checking the grafana AFS page: mirror.openeuler has reached its quota limit and the mirror job seems to be failing since two weeks. I'm also a bit worried that they seem do have double their volume over the last 12 months08:55
fricklerubuntu mirrors are also getting close, but we might have another couple of months time there08:56
fricklermirror.centos-stream seems to have a steep increase in the last two months and might also run into quota limits soon08:56
fricklerproject.zuul with the latest releases is getting close to its tight limit of 1GB (sic), I suggest to simply double that08:57
fricklerthen the wheel builds for centos >=8 seem broken, with nobody maintaining these it might be better to drop them?08:59
frickler(context was the recurring discussion of whether we'd have enough space to mirror rocky repos)09:00
fricklerI guess I'll add all of these topics to the meeting agenda so we can follow up after the holidays or so09:01
opendevreviewJames Page proposed openstack/project-config master: sunbeam: retire all single charm repositories  https://review.opendev.org/c/openstack/project-config/+/90366611:04
opendevreviewJames Page proposed openstack/project-config master: Fix the ACL associated with charm-keystone-ldap-k8s  https://review.opendev.org/c/openstack/project-config/+/90366711:14
fungiyeah, for openeuler it might be that we simply need to add some filters for things jobs won't need, like we do with other rsync mirrors14:34
*** jamesdenton_ is now known as jamesdenton15:37
fricklerinfra-root: I don't have time to dig now, but we're seeing 100% node_failures in kolla for arm nodes currently15:58
fungifrickler: i'll take a look, probably both providers are offline or full of leaked nodes15:59
fungii like that node failures now indicate the node request id that failed to be satisfied. saves having to hunt it down in the scheduler logs16:05
fungi2023-12-14 15:40:30,787 INFO nodepool.driver.NodeRequestHandler[nl03.opendev.org-PoolWorker.osuosl-regionone-main-0a52d0ebcb6146c2aaf61729723e3ffa]: [e: abd9cd79035c43e3a1f6a20313ffa157] [node_request: 300-0022991976] Not enough quota remaining to satisfy request16:07
fungi2023-12-14 15:41:06,147 INFO nodepool.driver.NodeRequestHandler[nl03.opendev.org-PoolWorker.linaro-regionone-main-ec6303cd8d4e4785a3ada55c7d750d53]: [e: abd9cd79035c43e3a1f6a20313ffa157] [node_request: 300-0022991976] Not enough quota remaining to satisfy request16:08
fungiso both providers were tried for https://zuul.opendev.org/t/openstack/build/68f8cea90b234953a4942a444d027a13 and neither had sufficient quota even after multiple retries16:09
fungii'll see if we can get things cleaned up16:09
funginodepool reports 8 arm64 nodes in use in osuosl and 7 in linaro16:13
fungiopenstack server list shows that many active nodes in each provider too16:15
fungineither provider has nodes in other states besides active16:15
fungii'm at a loss to explain16:15
fungiopenstack limits show --absolute also doesn't indicate either one is anywhere near capacity16:18
funginot all arm jobs are failing, https://zuul.opendev.org/t/openstack/build/7d18e32410704563ad81c2ba28181a8b just succeeded a few minutes ago16:19
fungiaha, these seem to be what's causing the actual failures:16:21
funginodepool.exceptions.LaunchStatusException: Server in error state16:21
fungiseeing them come from both osuosl and linaro16:22
fungithe linaro ones seem to be this:16:25
fungi2023-12-14 15:40:45,416 ERROR nodepool.StateMachineNodeLauncher.linaro-regionone: [e: 4c20fc054eaa4392ae50aec65c7bb6e4] [node_request: 300-0022991970] [node: 0036041570] Error in creating the server. Compute service reports fault: No valid host was found.16:25
fungii think what's happening is that it's trying osuosl and getting a softfail (insufficient quota) so then it goes on to linaro and gets a hardfail (no valid host was found)16:26
fungicorvus: ^ does that sound right? if you have two providers, one says "not now" because it has insufficient quota and then the other gets api errors back for all its retries, the result is node_failure not just waiting for available quota in the first provider?16:28
fungii can propose a patch to temporarily lower max-servers in linaro until someone has time to investigate what's happened to some of its compute nodes16:30
fungihttps://grafana.opendev.org/d/391eb7bb3c/nodepool3a-linaro seems to show it topping out at 16 in-use so that seems like a good number for now16:32
opendevreviewJeremy Stanley proposed openstack/project-config master: Temporarily lower max-servers for linaro  https://review.opendev.org/c/openstack/project-config/+/90370816:36
fungiinfra-root: ^16:36
fungii've gone ahead and self-approved it, since i'm probably the only sysadmin around today16:58
corvusfungi: looking into your q now17:00
fungicorvus: no rush, mainly just making sure i understand the process flow that leads to that situation17:02
corvusfungi: close -- the "not enough quota" is not an error (note the log is at info level); that's explaining why it's about to pause request handling without attempting a launch.  once there is quota available, it proceeds to attempt a launch, and that fails.  this repeats for both providers (yes, both providers were at quota, they waited, they launched, they failed), then request is deemed node_failure17:08
corvusfungi: it looks like one provider is failing with "no valid host" and the other is failing with "server in error state".17:09
corvusfungi: and yeah, lowering the max-servers is a reasonable way to compensate for the cloud lying to us about its capacity :)17:10
fungioh, i got confused and thought the "no valid host" was the api detail for the "server in error state"17:10
fungithanks17:10
corvusfungi: maybe something similar should be done for the other provider?17:10
fungiyeah, i'll see if i can tell what's going on there. the error state nodes may be random and not related to capacity17:11
fungithanks again!17:11
corvusfungi: yeah, i think neither of us completely characterized the error messages -- let's try again :)  it looks like they both put servers in error state, but additionally linaro says "no valid host found" but osuosl doesn't give us the extra info17:12
corvushere's an excerpt from each: https://paste.opendev.org/show/bqiZJwIbOpcDFeJHQ3w6/17:12
opendevreviewMerged openstack/project-config master: Temporarily lower max-servers for linaro  https://review.opendev.org/c/openstack/project-config/+/90370817:12
fungicorvus: aha, that explains my confusion. thanks17:15
fungiRamereth: if you're around, any idea why server creation in the osuosl openstack arm cloud is sometimes ending with instances in an error state?17:16
fungiwill reducing our utilization help?17:16
fungior is it unrelated to available resources/capacity?17:16
Clark[m]frickler: for stream mirror growth it looks like some packages get new versions but they don't clean up the old packages. This leads to growth. Some of the packages are quite large too iirc. I want to say things like thunderbird?17:36
corvusdockerhub appears to be having intermittent issues again.  just fyi.18:59
opendevreviewMerged openstack/diskimage-builder master: Remove cloud-init when using simple-init  https://review.opendev.org/c/openstack/diskimage-builder/+/89988519:05
Ramerethfungi: do you have some uuids and timestamps that I can look at? I made a recent change which might be related19:08
fungiRamereth: sorry, disappeared for a late lunch, but i can dig some samples up for sure, just a sec20:50
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add san support to growvols  https://review.opendev.org/c/openstack/diskimage-builder/+/90326520:56
fungiRamereth: these are the uuids of error-state nodes we created between 15:29:38 and 18:07:27 utc today: https://paste.opendev.org/show/bM1nBxM62nsDeEYuF95H/21:19
fungijust a sample, i haven't looked to see how far back this goes but can if it's relevant21:20
Ramerethfungi: thanks, I'll take a look later and get back to you21:27
fungiRamereth: at your convenience, it's not at all urgent. thanks!21:27
*** dmellado2 is now known as dmelladoo21:55
*** dmelladoo is now known as dmellado21:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!