Friday, 2021-10-08

clarkbopenstack.exceptions.ResourceTimeout: Timeout waiting for the server to come up. <- ya I think that the node request may be hitting that and we're just seeing cloudyness00:02
clarkb[e: 17c1c039c9f94bcbaeaaecf81ca119ec] [node_request: 199-0015701818] [node: 0026846253] Launch attempt 1/3 failed on nl0100:02
clarkbyup there it goes second launch went ready and the job as started00:03
clarkbI will attempt to be less impatient while sorting out dinner00:04
fungii can't figure out why this puppet deploy job failed: https://zuul.opendev.org/t/openstack/build/b0e2fecdbfb8477184ea0f7c833d24f800:22
fungilog on bridge says puppet ended with an exit code of 6 so reported the task failed00:22
fungisyslog on logstash01 doesn't say much about the ansible puppeting00:22
ianwhrm indeed it does go very quiet01:14
opendevreviewMerged opendev/system-config master: Update ICLA to reference OpenInfra  https://review.opendev.org/c/opendev/system-config/+/81305501:14
opendevreviewMerged opendev/system-config master: Update gerritbot-matrix version to include change number in notifications  https://review.opendev.org/c/opendev/system-config/+/81304001:14
ianwError: /Stage[main]/Kibana::Js/Vcsrepo[/opt/kibana/v3.1.2]: Could not evaluate: Execution of '/usr/bin/git fetch --tags origin' returned 1:01:18
ianwit looks like some on-disk git trees are not happy updating01:19
ianwfsck reports "dangling commit 85ddfd6cfbf337ab6f5408bc23aa1faae93c37cf"01:23
ianwerror: cannot lock ref 'refs/remotes/origin/ilm/rollup-v2-action': 'refs/remotes/origin/ilm' exists; cannot create 'refs/remotes/origin/ilm/rollup-v2-action'01:24
ianwlooks like there was a branch called "ilm" and now there's one called "ilm/rollup-v2-action"01:25
ianwi did a "git remote prune origin" and a lot of branches went, "ilm" included01:26
ianw * [pruned] origin/ilm01:27
ianwi feel like this might have fixed it.  there is probably something systematic about the puppet and what it's cloning, but i doubt anyone wants to dive too deep into that01:27
fungiahh, thanks! i clearly failed to spot that in the log01:30
fungiprobably a sign it's too late at night for me to be trying to pick apart logs01:31
ianwit really doesn't help that it's not prefixed.  i feel like we had some sort of output filter change to help with that, but i can't remember01:38
opendevreviewMerged zuul/zuul-jobs master: ensure-rust: rework global install  https://review.opendev.org/c/zuul/zuul-jobs/+/81227201:40
*** ysandeep|out is now known as ysandeep03:27
*** ysandeep is now known as ysandeep|afk04:08
*** ykarel|away is now known as ykarel05:15
*** ysandeep|afk is now known as ysandeep05:26
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303405:33
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303405:46
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303405:51
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303406:07
fricklerinfra-root: seems ubuntu images are failing to build with some cert issue. that would also explain why the devstack failure I'm seeing with local testing doesn't show up in CI yet06:27
fricklerE: Failed to fetch https://mirror.dfw.rax.opendev.org/ubuntu/dists/bionic/universe/binary-amd64/Packages  Certificate verification failed: The certificate is NOT trusted. The certificate chain uses expired certificate.  Could not handshake: Error in the certificate verification. [IP: 2001:4800:7819:105:be76:4eff:fe04:9b8a 443]06:28
fricklerlikely some LE fallout still?06:28
frickleryep, the images nodepool uses are 8d old06:32
frickler(for bionic and focal, xenial seems unaffected)06:32
opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/81312706:45
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303406:46
ysandeephey folks o/ to me looks like its taking long in getting node comparatively to other days.. https://zuul.openstack.org/status#tripleo06:47
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303406:57
ianwfrickler: hrm, that works for me.  but there was something recently about LE changing part of it's cert right ...07:12
ianwIdentTrust DST Root CA X3 07:13
fricklerianw: yes, seems to be something in dib or debootstrap not handling that correctly07:15
ianw2021-10-08 07:12:20.889 | I: Validating ca-certificates 20190110ubuntu107:15
ianwi guess that debootstrap doesn't use the updates repo...07:16
ianwwow that is super annoying07:18
fricklerianw: maybe add something with "--extra-suites"?07:23
fricklerbionic certs are even older07:23
opendevreviewIan Wienand proposed openstack/project-config master: nodepool: drop https for ubuntu  https://review.opendev.org/c/openstack/project-config/+/81313507:25
ianwfrickler: ^ i think that is the sanest solution.  otherwise we'd have to do something like a wget of an updates package (that is sure to change anyway)07:26
opendevreviewIan Wienand proposed openstack/project-config master: nodepool: drop https for ubuntu  https://review.opendev.org/c/openstack/project-config/+/81313507:27
ianwusually it uses gpg signed repos and http.  we don't have the signing07:29
*** jpena|off is now known as jpena07:32
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303407:32
*** odyssey4me is now known as Guest218407:34
fricklerianw: yeah, that looks better already. now let's see whether we trigger https://bugs.launchpad.net/cinder/+bug/1946340 with new images :-S07:40
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303407:42
*** ykarel is now known as ykarel|lunch07:44
*** ysandeep is now known as ysandeep|away07:52
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303407:57
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303408:08
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563808:12
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563808:24
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303408:26
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303408:35
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303408:55
frickler2021-10-08 08:55:32.531 | Build completed successfully08:57
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303409:01
fungiysandeep|away: daily periodic jobs start at ~06:25, roughly 20 minutes before you commented, and they tend to exhaust our available quota and create a bit of a backlog for node requests. see the graphs here: https://grafana.opendev.org/d/5Imot6EMk/zuul-status09:03
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303409:14
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303409:19
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303409:47
fricklerthis also looks like fallout from the new ubuntu image, pip resolver taking ages trying to install things https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_7fd/813149/1/check/openstack-tox-docs/7fd2bee/job-output.txt09:49
*** ysandeep|away is now known as ysandeep09:54
ysandeepfungi: thanks!09:56
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303410:04
fungifrickler: usually it's correctable by a constraints update10:19
fungidoes octavia maybe have some unconstrained dependencies?10:27
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303410:31
fungifrickler: aha, looking, one of the delays there is bandit, which is intentionally unconstrained10:32
fungithat seems to be the main one it complains about taking too long to satisfy10:32
fungioh, i think pylint is another cause there (surfacing through its dep on astroid), same situation though with being unconstrained10:34
fungianyway, the real problem is the kitchen sink approach to having one test-requirements.txt and trying to install it in every tox env even when most of it is unneeded10:37
fungiyou don't need tempest and bandit and pylint installed to do a docs build10:37
fungialso don't need tempest installed to run linters10:38
fungiand don't need linters installed to run tempest10:38
*** ykarel is now known as ykarel|afk10:39
fungiif those jobs used separate sets of deps, this would really be a non-issue because pip would not have to spend so much time rendering a dependency set which satisfies them all10:40
*** dviroel|out is now known as dviroel11:10
*** jpena is now known as jpena|lunch11:34
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303411:36
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303411:45
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303411:57
opendevreviewDong Zhang proposed zuul/zuul-jobs master: WIP: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303412:09
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563812:10
*** ykarel|afk is now known as ykarel12:11
*** sshnaidm is now known as sshnaidm|afk12:17
*** ysandeep is now known as ysandeep|brb12:21
*** jpena|lunch is now known as jpena12:21
opendevreviewYuriy Shyyan proposed openstack/project-config master: Disabling inmotion cloud scheduling for upgrades.  https://review.opendev.org/c/openstack/project-config/+/81318112:22
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303412:32
*** ysandeep|brb is now known as ysandeep|afk12:34
opendevreviewMerged openstack/project-config master: Disabling inmotion cloud scheduling for upgrades.  https://review.opendev.org/c/openstack/project-config/+/81318112:41
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303412:58
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303413:12
*** ysandeep|afk is now known as ysandeep13:19
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303413:23
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303413:46
*** lbragstad_ is now known as lbragstad13:51
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303413:59
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303414:21
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303414:33
fricklerinfra-root: I'd like to go over the nodes with exim paniclogs and clean them up to reduce the amount of daily mails, any objections to that?14:39
fricklerfrom spot checking most seem to stem from when the node was being installed. maybe we should automate some cleaning at the end of the installation14:39
Clark[m]frickler: seems fine with me. If you can confirm the contents are from node creation adding a clean step to our launch node tooling is probably a good idea14:41
fungifrickler: thanks, that would be great. i keep meaning to do it and keep getting distracted by other emergencies14:44
fungiand i agree, when looking at the content of the paniclogs they have generally been about pathological states during server bootstrapping14:45
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303414:52
opendevreviewGonéri Le Bouder proposed zuul/zuul-jobs master: build-container-image: improve the reliabilty  https://review.opendev.org/c/zuul/zuul-jobs/+/81320315:17
clarkbGoneri: ^ is that related to the retries you were looking at in the zuul matrix room?15:19
clarkbif so I worry that retrying aggressively like that will only make the network less stable15:19
clarkbThe buildset repository is local to the same cloud as the image builds. Those uploads should be on the most stable network segment we have15:20
clarkbThis is me thinking out loud here and wondering if we need to look at this from a different angle15:21
Goneri10 times is a bit aggressive indeed. But 2 or 3 would be enough to save us a couple of jobs every week.15:30
clarkbright, but only if it doesn't make things worse15:31
clarkbwhat has me concerned is that pushing to the buildset regsitry should be the most reliable thing we do on the network since it is local network15:31
clarkbnow maybe that means we have local network problems in some clouds or maybe the zuul registry cannot handle the level of writes that are happening etc15:31
clarkbbut identifying the underlying issue might be appropriate hear rather than asking the system to do more. I believe the buildset registry jobs collect logs for the buildset registry, was there anything in there indicating the registry might have been at fault?15:32
GoneriIn this case, I checked today and we don't have anything in the logs. I understand your point. But it may also be useful to give another change to a job before canceling everything.15:35
GoneriI suggest to move from 10 retries to 3.15:35
clarkbright I just want to amke sure we're understanding the issue before we simply retry15:35
clarkbif this was uploads to docker hub I'd say yes retry a bunch :)15:35
clarkbbut this is retrying to a local service on the same local network and should be very reliable15:36
Gonerioverall our failure rate is pretty low: https://dashboard.zuul.ansible.com/t/ansible/builds?job_name=network-ee-build-container-image-stable-2.9&project=ansible-collections/cisco.iosxr15:37
clarkband the buildset registry logs didn't indicate any unexpected trouble there?15:38
opendevreviewGonéri Le Bouder proposed zuul/zuul-jobs master: build-container-image: improve the reliabilty  https://review.opendev.org/c/zuul/zuul-jobs/+/81320315:38
clarkbif not I guess we can go with the retries15:38
GoneriIt's like the 3 times I check the registry log and I don't see anything suspicious.15:39
clarkbok +2'd15:39
*** ysandeep is now known as ysandeep|away15:43
*** marios is now known as marios|out15:55
*** ykarel is now known as ykarel|away16:19
*** jpena is now known as jpena|off16:30
opendevreviewClark Boylan proposed opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement  https://review.opendev.org/c/opendev/infra-specs/+/80412216:58
clarkbnow with more node-exporter16:58
clarkbcorvus: ianw: frickler: fyi since it was your feedback that drove a lot of that change of opinion in my head too16:58
corvusclarkb: cool, 1q inline also jobs fail17:18
clarkbcorvus: responded and looking into the test issue now17:27
opendevreviewClark Boylan proposed opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement  https://review.opendev.org/c/opendev/infra-specs/+/80412217:29
corvusclarkb: one quick test: does it run on xenial17:31
corvusor even trusty17:31
clarkbya that would be a good check let me see on one of the lgostash workers17:47
clarkbit runs on xenial17:52
clarkbcorvus: from what I can see latest glibc's minimal kernel appears to still be 3.2 (which is what the blog post I linked stated it was a while back too)18:04
clarkbI believe it will run on trusty as a result, but will work with fungi to double check when he gets back. I don't want to touch any trusty machines without him around :)18:04
corvuscool, makes sense to me.18:05
clarkband its a good thing to sanity check. Thank you for bringing that up as I hadn't even considered the possibility initially18:06
clarkbalso there is apparently a way to have golang build with its own internal glibc replacement and then you avoid these issues entirely. But reading node-exporter docs they seems to really want glibc proper (hence the glibc-static on rhel/centos requirement in the build section)18:07
clarkbis anyone else having trouble getting to gerrit?18:22
clarkbseems i can get to it via both ssh ports but not https?18:22
clarkbnow it is there again18:23
clarkbI'm going to assume local networking trouble if others didn't notice similar18:27
clarkbfungi: the mailman3 spec lgtm but I did leave a few notes. Nothing urgent but wanted to make sure you saw them for next week18:49
fricklerclarkb: fungi: btw. https://review.opendev.org/c/openstack/project-config/+/813135 is what ianw and me tested on nb01 earlier, please have a look whether you agree with that or would prefer a more secure, but likely more tedious solution18:58
clarkbhrm, the reason we don't use proper mirrors for those builds is sometimes our mirror will be behind then when we push up new images they aer ahead and can't install any pacaklges against our mirrors19:00
fricklerclarkb: iiuc the issue is not using mirrors or not, but not using -updates and thus only getting the ca-certs from the original release19:03
clarkbfrickler: ya I'm just pointing out why we couldn't point it at official mirrors then verify gpg signatures19:04
clarkbits unfortunate19:04
clarkbcan we do --extra-suites updates? 19:04
fricklerthat's the idea I had earlier, before ianw came up with that simpler solution19:05
fricklerif you want to give that a try, I'd support it19:05
clarkbit looks like dib has a DIB_DEBOOTSTRAP_EXTRA_ARGS var we could use to pass in --extra-suites I think19:06
* frickler won't do much more today19:06
clarkbfrickler: ya enjoy your weekend. I'll push up a change that tries to use ^ and we can compare the two I guess19:06
clarkbhrm actually ca-certificates is the same in both base bionic/focal and updates19:09
clarkbI wonder instead if they need an updates openssl or gnutls19:09
clarkboh I see ianw notes in the commit message that you can't use the updates repo when doing the initial chroot. That is fun19:11
clarkbI +2'd the change but didn't approve it in case there were some better ideas, but think we can approve it if we are stumped19:12
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea to 1.15.4  https://review.opendev.org/c/opendev/system-config/+/81324319:13
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563820:03
opendevreviewAnanya proposed opendev/elastic-recheck rdo: WIP: ER bot with opensearch for upstream  https://review.opendev.org/c/opendev/elastic-recheck/+/81325020:10
opendevreviewDouglas Viroel proposed zuul/zuul-jobs master: WIP - Add FIPS enable role to multi-node job  https://review.opendev.org/c/zuul/zuul-jobs/+/81325320:36
ianwclarkb / frickler: hrm, i may have been hitting https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=991625 ; perhaps that can work but will need some more testing21:07
ianwclarkb: i also didn't a gerrit point release21:08
*** dviroel is now known as dviroel|afk21:21
clarkbianw: ya I'm subscribed to their mailing list and they usually post releases there22:30
clarkbwe'll be fine to deploy what we've got I gues then can sneak in a quick restart later to pick up the latest release if it doesn't show up soon22:30
clarkbI've just checked gerrit.googlesource.com and no 3.3.7 tag there either22:31
clarkbhttps://review.opendev.org/c/opendev/system-config/+/813243 is a good one to land on monday to upgrade gitea. Looks like it passed testing happily today, but I'm not sure I'll have the focus to watch it should it have problems upgraded in prod at this point before the weekend23:01

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!