Monday, 2022-06-27

opendevreviewMerged openstack/diskimage-builder master: Revert "Temporarily stop running OpenSUSE functtests"  https://review.opendev.org/c/openstack/diskimage-builder/+/84032900:14
opendevreviewIan Wienand proposed opendev/system-config master: [wip] devel job git perms  https://review.opendev.org/c/opendev/system-config/+/84770001:55
*** ysandeep|out is now known as ysandeep02:04
opendevreviewIan Wienand proposed opendev/system-config master: [wip] devel job git perms  https://review.opendev.org/c/opendev/system-config/+/84770002:21
opendevreviewIan Wienand proposed opendev/system-config master: [wip] devel job git perms  https://review.opendev.org/c/opendev/system-config/+/84770002:38
BlaisePabon[m]1No worries.03:08
BlaisePabon[m]1I hand out with the openfaas folks and those guys have no resources, so I'm always happy to help where I can.03:08
BlaisePabon[m]1It looks like there's more corporate largesse to go around in this group.03:08
BlaisePabon[m]1s/hand/hang/03:09
opendevreviewIan Wienand proposed opendev/system-config master: [wip] devel job git perms  https://review.opendev.org/c/opendev/system-config/+/84770003:18
ianwhttps://pypi.org/project/diskimage-builder/ is showing 3.21.1 as the latest, but https://pypi.org/project/diskimage-builder/3.22.0/ is there (and the job worked (https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_85f/787dcaac0a9a446f2a5b3b442601baaf634d11a0/release/release-openstack-python/85fcaf3/job-output.txt)03:28
ianwsomeone else has mentioned a similar issue on libera #pypa03:32
ianwlooks like it comes back to https://github.com/pypa/warehouse/issues/1039003:36
ianwi seem to remember the -X purge thing working, but doesn't seem to now03:37
opendevreviewIan Wienand proposed opendev/system-config master: [wip] devel job git perms  https://review.opendev.org/c/opendev/system-config/+/84770003:40
opendevreviewBlaise R Pabon proposed opendev/infra-manual master: Update the sandbox instructions  https://review.opendev.org/c/opendev/infra-manual/+/84559303:45
BlaisePabon[m]1Sorry for the delay. I'm getting the hang of this, and rST is more familiar than Yaml.03:48
BlaisePabon[m]1This is a good doc, thank you Clark .03:55
BlaisePabon[m]1Would you like to to prepare a change with `s/OpenStack/OpenDev/g` 03:55
opendevreviewIan Wienand proposed opendev/system-config master: Allow git to operate in checked-out source directories  https://review.opendev.org/c/opendev/system-config/+/84770004:16
opendevreviewBlaise R Pabon proposed opendev/infra-manual master: Convert OpenStack terms over to OpenDev  https://review.opendev.org/c/opendev/infra-manual/+/84771004:21
*** ysandeep is now known as ysandeep|afk05:15
*** ysandeep|afk is now known as ysandeep05:56
fricklerfrickler@bridge:~$ cat /var/log/ansible/zuul_reboot.log.106:28
fricklerflock: cannot open lock file /var/run/ansible/zuul_reboot.lock: No such file or directory06:28
*** frenzyfriday|PTO is now known as frenzy_friday07:13
*** jpena|off is now known as jpena07:43
opendevreviewlikui proposed openstack/diskimage-builder master: Drop python3.6/3.7 support in testing runtime  https://review.opendev.org/c/openstack/diskimage-builder/+/84771808:15
opendevreviewlikui proposed openstack/diskimage-builder master: Stop testing lower-constraints  https://review.opendev.org/c/openstack/diskimage-builder/+/84771908:17
opendevreviewMerged zuul/zuul-jobs master: [tox] Update readme regarding default values  https://review.opendev.org/c/zuul/zuul-jobs/+/84765608:59
*** rlandy|out is now known as rlandy09:29
rlandyfungi: hi - when you come on line - we're still seeing  a lot of post_failures and not just in tripleo jobs. This is really impacting our ability to merge patches09:46
rlandyI saw your discussion with akahat|ruck - but it's not just tripleo jobs09:47
rlandywe do have a deep log collection structure - but it would seem there is some new issue?09:47
*** ysandeep is now known as ysandeep|afk09:50
*** ysandeep|afk is now known as ysandeep10:38
jrosser_rlandy: we see the same thing on openstack-ansibe jobs10:50
rlandyjrosser_: ack - killin our gate10:50
jrosser_for now we have disabled generating the HTML ARA report which has helped a lot, but still there is a low level of POST_FAILURE which means we also can't really merge anything10:50
jrosser_there seem to be periods (days to weeks) when this doesnt happen, and then similar periods when it's pretty regular10:51
rlandyjrosser_: we suspected it was one provider at first - but now it seems to be on multiple11:00
jrosser_do you have a way to tell where the logs are going?11:01
rlandyjrosser_: no - no way when logs don't have a link11:01
rlandyarxcruz: ^^ anything we can remove from our log collection to help here?11:02
rlandywe are going to have to force merge some patches if we can't get them through gate11:02
fungirlandy: jrosser_: the vast majority do still seem to be for tripleo and openstack-ansible changes looking at https://zuul.opendev.org/t/openstack/builds?pipeline=gate&result=POST_FAILURE&voting=1&skip=011:35
rlandylooking at what we can cut from collect logs - but this is a recent issue11:36
fungithings i suggested might help would be if someone wants to work on splitting log uploads so that we push up the console log/json and zuul-info files with a separate task first so that we have *some* logs and can see what swift providers the builds are destined to, and then adjust the log upload role to provide a count and size of what's going to be uploaded before proceeding with the upload11:36
fungithat will help us ascertain whether 1. there's any correlation to a specific swift provider and 2. whether there's something different about the size or quantity of logs in the builds which timeout their uploads11:37
jrosser_i made a patch to openstack-ansible to check there was nothing silly happening in the log structure https://zuul.opendev.org/t/openstack/build/c449b1162be6482c858eab60aff183c8/log/job-output.txt#1944311:38
jrosser_though getting size/file count at the same time with `tree` didnt look possible11:38
rlandychandankumar added https://review.opendev.org/c/openstack/tripleo-ci/+/847771 to try reduce log capture size11:47
rlandymay help us get in under the radar11:47
jrosser_fungi: do you know where the code for this module is? https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-swift/tasks/main.yaml#L2111:52
fungijrosser_: it's in zuul/zuul-jobs: https://zuul-ci.org/docs/zuul-jobs/log-roles.html#role-upload-logs-swift11:58
fungioh, the module, just a sec11:58
fungijrosser_: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base11:58
fungithat has the python modules11:59
*** dviroel|out is now known as dviroel12:01
fungifor splitting the uploads, i expect we'd add a new task in https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/post-logs.yaml before the existing upload-logs-swift inclusion and specify the important subset we want uploaded first, then either delete those from the local tree or find a way to exclude them from the general upload12:01
jrosser_you don't mind that the important things would be duplicated?12:02
jrosser_oh well right12:02
jrosser_thats why i was attempting to look at the code12:02
jrosser_and i'm immediately completely out of my depth unfortunatley12:03
fungimuch of the logic in the lib is delegated to openstacksdk, from what i saw looking over the weekend12:03
opendevreviewJonathan Rosser proposed zuul/zuul-jobs master: Allow upload-logs-swift files list to be overriden  https://review.opendev.org/c/zuul/zuul-jobs/+/84777512:20
fungiianw: seems the pypi cache issue may be more widespread, i saw people referring to something similar on the python discourse today: https://discuss.python.org/t/python-package-not-updating/1685612:25
*** arxcruz is now known as arxcruz|brb12:42
opendevreviewJonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task  https://review.opendev.org/c/opendev/base-jobs/+/84778012:50
opendevreviewJonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task  https://review.opendev.org/c/opendev/base-jobs/+/84778012:52
opendevreviewJonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task  https://review.opendev.org/c/opendev/base-jobs/+/84778012:55
fungiianw: also https://discuss.python.org/t/cannot-make-the-new-release-to-be-the-latest-version-for-installation/1685913:02
*** arxcruz|brb is now known as arxcruz13:48
Clark[m]fungi: jrosser_: it does look like the roles assume zuul.executor.log_root is the sole source of logs and we don't make that configurable via the role vars. I think that means the first step would be making that a parameter.13:50
jrosser_Clark[m]: did you see my patch?13:51
jrosser_i took a first stab at it here https://review.opendev.org/c/zuul/zuul-jobs/+/84777513:51
Clark[m]Yes, that is a different problem13:52
jrosser_oooh13:52
Clark[m]Oh wait there are two patches?13:52
Clark[m]I saw the random choice move13:52
jrosser_right13:52
jrosser_that was a bit of a precursor to making two uploads13:53
jrosser_as it would need to send them to the same place13:53
Clark[m]Right. I missed the 775 change. Ya I think that is basically what is needed. Then in the base job we switch to uploading two different sources of content13:55
jrosser_yeah, i wasn't really sure what the best approach was to do two uploads13:55
Clark[m]The other issue is if we'll partition deterministically but I'm pretty sure that is based on build uuid so we should13:55
jrosser_as the files: parameter just grabs everything beneath it as far as i can see13:56
Clark[m]Oh except we have to generate index files too. That might be the biggest issue13:58
Clark[m]I think what will happen is the second upload will overwrite the previous index effectively hiding the first set of files (they will be there just not in the index.html)13:58
Clark[m]And maybe similarly for the zuul manifest which the zuul log browser gses13:59
Clark[m]*uses13:59
Clark[m]Ya this isn't super straightforward now that I think about the details13:59
Clark[m]We would need the role to treat a single tree with multiple passes. Rather than two separate passes14:00
jrosser_is it only the top level index that is difficult?14:01
jrosser_upload 1 -> zuul-info, upload 2 -> everything else14:01
jrosser_so the only inconsistency would be right at the top there14:01
Clark[m]Yes, it depends on the structure of the data but currently it would just be for the top level (both in index.html and the manifest)14:03
Clark[m]Thinking out loud here: I wonder if the data is actually present behind those failures but since manifest is likely the last thing we upload zuul is ignorant to what is there14:05
Clark[m]fungi: it may be worth trying to browse directly to the paths and see what if anything is there. We may not need to change anything to have access to sufficient info if we're failing after uploading much content but before the manifest14:06
fungiClark[m]: well, longer term i think it's still very useful to have the console log/json and zuul info available from the build results page even if something goes wrong uploading everything else the job has said to archive14:07
fungibut yes i expect the files are there in swift, at least the ones which got uploaded14:08
jrosser_there is a more brute force approach, which is to upload essential stuff, then upload everything and accept some duplication14:08
jrosser_then at least there is something rather than nothing if a large upload fails, and avoids the index issue14:08
Clark[m]fungi: yes, but it is very complicated and unlikely to be refactored quickly 14:08
Clark[m]jrosser_: considering we suspect too many files to be the source or at least part of the issue I'd be wary of that approach. We'll just make it worse if this is the problem14:09
fungiClark[m]: i mean in order to make it not necessary to get a zuul admin to dig the upload url out of the executor logs, since the current build dashboard doesn't supply a log link at all14:09
fungimaking these sorts of failures at least a little more possible for users to debug would save our limited sysadmins some time14:10
jrosser_thats really it - i'm completely blind to whats happening here14:10
fungii can do it for a few representative builds, but doing it at a larger scale is much more time-consuming14:11
Clark[m]fungi: it's there somewhere because the js has to load the manifest14:11
Clark[m]I don't think admin is required though it may require opening the js debug console14:12
fungiyeah, but the dashboard gives you "no log available"14:12
clarkbhrm it looks like zuul-web proxies the manifest request then after it has the manifest the data is fetched directly from swift for everything else?14:15
jrosser_if i look in the zuul API for one of these failed jobs i get "log_url": null14:16
fungiyeah, i'm trying to fish the log url for one of these out of the executor log14:16
clarkbya I think we may not record the log_url because that happens after the upload because the upload provides the location back14:17
clarkbthe data very likely is there, we just aren't exposing it beacuse the upload process was cut short :/14:17
rlandyfungi: clarkb: we are going to need to force promotion two patches as we can't get them through gates due to post_failures ... I checked with slage ( tripleo PTL) and he will approve. Can you force merge:14:17
rlandyhttps://review.opendev.org/c/openstack/tripleo-common/+/847437/14:17
rlandyhttps://review.opendev.org/c/openstack/tripleo-upgrade/+/847680/14:17
rlandyboth backports14:18
clarkbrlandy: we usually only force merge fixes for the gate issues.14:18
fungiand generally only when they're unable to merge because of the problem they're fixing14:20
rlandyhttps://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-scenario001-standalone&branch=stable%2Fwallaby&skip=014:20
rlandyonly pass is on https://review.opendev.org/c/openstack/tripleo-common/+/847437/14:20
rlandyif we can fix the post_failure14:21
rlandythen we "should" be able to get these through14:21
fungiyou can fix the post failure by not running the jobs which are unstable because of post_failure results14:21
jrosser_imho it is not quite like that for my failing jobs14:22
jrosser_it seems random14:23
rlandyall multinode - checking if we can merge https://review.opendev.org/c/openstack/tripleo-ci/+/847771/1/toci-quickstart/config/collect-logs.yml14:23
fungirlandy: work on making those easier to diagnose is most welcome, and we've been discussing options here, but please don't mistake your desire to ignore the current spate of ci problems you're experiencing as being any sort of priority for me. if you want to ignore some jobs in order to merge changes you can already do that by disabling those jobs14:24
clarkbjrosser_: I doubt it is random. It is more likely some combo of test node cloud and/or swift provider (the way log collection works is everything is copied from test nodes to executors, then from executors to swift in one of 5 locations. if either or both of those is slow then you can run up against the timeout)14:24
jrosser_clarkb: oh absolutely - i should have qualified random better14:24
clarkbjrosser_: one approach may be to increase the timeout and see if we can collect the logs that way and identify the source of the slowness14:24
jrosser_what i really meant was it wouldnt be possible for me to not run a failing job becasue i can't tell which one will fail14:25
jrosser_clarkb: is that something i can do?14:25
clarkbjrosser_: yes, I think so. Let me dig up docs14:26
clarkbjrosser_: https://zuul-ci.org/docs/zuul/latest/config/job.html#attr-job.post-timeout there may be an upper bound on that too looking further14:26
clarkbour default post-timeout is 1800 seconds (half an hour)14:27
clarkbI'm not finding where we might set a max timeout. Maybe bump that to an hour and a half in your jobs and collect what we can and take it from there?14:28
jrosser_sure i can try that, thanks for the pointer14:29
clarkboh we must rely on the default zuul max-job-timeout of 10800?14:29
clarkbwhich is 3 hours which seems to line up14:29
clarkband I think it applies to both the regular job timeout and the post timeout as two separate blocks. So max post timeout is 3 hours as well?14:30
fungirevisiting https://zuul.opendev.org/t/openstack/builds?pipeline=gate&result=POST_FAILURE&skip=0 all the gate post_failures so far today (since utc midnight) have been tripleo multinode jobs, though there was a tripleo "standalone" job (i guess that means single-node?) which caught one just before utc midnight14:30
rlandycorrect - most are multinode14:31
clarkbjrosser_: that should theoretically give us better info on what actual max runtime for the uploads is, where the time cost is (pulling logs to executor or pushing to swift or both), and how close we are to the timeout. It may just be that you've been riding the edge for a while and a few new log files are tipping you over or something (though removal of ara not completelyfixing this14:31
clarkbindicates this is unlikely)14:31
jrosser_tbh i was surprised it failed at all after removing ARA14:32
clarkbrlandy: are there changes to stop uploading /etc content that never changes, and stop uploading duplicated content, and changes to reduce the nesting? Those sorts of things I think we could force merge given our current understadning of the issue14:32
jrosser_i will have to check again but i think that was the largest count of files by a long way14:33
fungithat may indicate that ara files weren't a significant contributing factor to the upload time14:33
rlandyclarkb: there is one patch in flight - on call now  will discuss with team more options and dirs we can remove14:34
clarkbfrickler: I was going to look into that. Will flock -n not create a file if it doesn't exist? Anyway thanks for checking I'll try to followup a bit later after breakfast14:34
fungii'm trying to see how much of the logs from https://zuul.opendev.org/t/openstack/build/bae8ea4276ad495792e320631486cbf9 got uploaded. in the executor debug log i can see that zuul_log_path is bae/847362/2/gate/openstack-ansible-deploy-aio_lxc-centos-8-stream/bae8ea4 so i can probably try to fetch the console log from random providers and brute-force discovery of the actual url14:36
clarkbfungi: that won't work with rax beacuse the urls are not consistent14:36
clarkbfungi: but I thought it logged where it was uploading earlier in the logs? If not that may be somethign we can land to zuul-jobs quickly (explicitly log the full log path prefix at the beginning of uploading)14:37
clarkbanyway I need to boot my morning. Back in a bit14:37
fungii think it may only log that if the task is successful? but anyway not having luck finding information on the actual log url in the executor debug log14:38
fungiyeah, "TASK [upload-logs-swift : Upload logs to swift]" includes a "url" parameter in the ansible output normally, but not if the task times out14:40
fungiso the executor doesn't actually even log the url in these cases14:41
jrosser_fungi: i never see that even if that task doesnt time out14:43
jrosser_isn't it chicken/egg with what is in the log file at the point it is uploaded?14:44
jrosser_oh - you mean the log on the executor i guess, not the uploaded one14:46
fungiyes14:48
fungii'm digging in the executor's debug log14:48
fungibut have basically confirmed that even from the current debug logs we can't reliably work out where the build logs ended up14:49
fungiso yes, increasing the upload timeout temporarily and then looking for successful jobs which exceed the default post-logs playbook timeout seems like the next logical step14:51
jrosser_i've made a set of patches which should hopefully catch some of those14:51
jrosser_unfortunately this does seems to come and go somewhat14:52
*** pojadhav is now known as pojadhav|afk15:04
clarkbfungi: I think we can add a logging.info() for this value https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py#L14415:18
clarkbfungi: separately we do log the file list https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py#L283-L285 but only at a debug level so I think we need some amount of ansible -v's to emit that15:19
clarkb(also I'm not sure if ansible logs INFO level by default it may be that logging.info() will need more verbosity too)15:19
fungiclarkb: but also we no_log it in order to avoid leaking credentials: https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/post-logs.yaml#L1215:24
clarkboh ha ok15:25
clarkbIn that case it really does seem like we need to work from successful runs with longertimeouts and go backawrd from there to try and identify problems :/15:25
*** dviroel is now known as dviroel|lunch15:27
fungihttps://status.python.org/ "14:20 UTC: A delay in our job queue is affecting CDN purges causing public pages to be stale and delays in email sending for notifications, email validation, password resets, etc. 15:05 UTC: We've identified specific jobs in our queue that are causing processing delays and are working to remove them."15:36
fungilooks like they finally noticed15:36
*** marios is now known as marios|out15:39
clarkbtalking out loud again. Another option we might consider is temporarily increasing the base jobs post-timeout15:50
clarkbto cast a wider net and hopefully make it easier to debug. The problem with that is the revert back to half an hour is likely to be painful15:51
fungii think we'd need a good means of identifying which builds exceeded the old timeout under the new temporary one16:00
clarkbfrickler: ok the issue is there is no /var/run/ansible so the file can't be created in the non existant dir. It appears flock will create the file if the dir exists. I think I may just stick it under /var/run then16:02
clarkbchnage incoming16:02
opendevreviewClark Boylan proposed opendev/system-config master: Fix zuul reboot playbook flock location  https://review.opendev.org/c/opendev/system-config/+/84780816:05
*** ysandeep is now known as ysandeep|PTO16:09
clarkbone crazy idea I had that won't work is to write the log url to a file then have a followup task read and echo it. That followup task wouldn't need to be no_log. But that won't work because we are timing out and exiting in the middle of the upload task16:12
jrosser_i was also thinking about that16:13
jrosser_looking at the uploader code it might be possible to call the role with an empty filelist16:13
jrosser_and the url should be returned - though i was pretty unclear if the url would be constant16:14
clarkbI think the url is going to be constant for a specific build uuid. The build uuid determines the sharded container location and the rax cdn different urls are consistent per container16:14
clarkbbasically as long as that uuid is fixed we should end up with the same url for logs across multiple uploads16:15
BlaisePabon[m]Oh! I wonder if that is why I didn't get an email when I added another address to my Pypi profile....?16:15
BlaisePabon[m](rhetorical, no reply required)16:15
fungiBlaisePabon[m]: it does say "pypi.org - Email: Degraded Performance16:16
fungiso maybe so16:16
clarkbbut ya I'm still trying to brainstorm better debugging methods here. It is a "fun" challenge. Back when we had the jenkins scp log uplaoded we had similar chicken and egg problems. Eventually we forked a separate java thread to upload the console log so that it could run beyond the job thread lifetime16:16
fungiBlaisePabon[m]: and the incident name is "Queue delay affecting CDN purges and emails"16:17
fungiso i guess the two are related16:17
clarkbjrosser_: I'm shifting tactics to look at successful runs to see if I can infer anything from them. I still see ara reports on them for openstack-ansible-deploy-infra_lxc-ubuntu-focal on stable/wallaby fwiw16:22
clarkb(stable/wallaby of that job was the most recent POST_FAILURE)16:22
jrosser_yes, we have only managed to merge the change to disable ARA on master16:23
clarkbgot it16:23
jrosser_ironically the cherry-pick for that to yoga failed..... due to POST_FAILURE16:23
clarkbThere is also some larger binary journal content. You might consider exporting that like devsatck does. That allows you to export the range you want (and potentially other bits you care about) to a more text friendyl format that compresses with eg xz very well16:26
clarkbThe apt trusted keys are also logged many times from many locations (host and all the containers)16:27
jrosser_yep - we place a lot of those ourselves16:27
jrosser_dropping the binary log journals is a good idea16:28
jrosser_i've never really had a lot of success grabbing them from one system and parsing them on another16:28
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging  https://review.opendev.org/c/openstack/diskimage-builder/+/84760016:29
clarkbjrosser_: ya journalctl has a specific export command that serliazes the info and allows you to import and view it elsewhere. Copying the binary data directly isn't meant to work that way aiui16:29
clarkblet me see if I can find a link to how devstack does/did it16:29
jrosser_we have this to post-process them https://github.com/openstack/openstack-ansible/blob/master/scripts/journal_dump.py16:30
clarkbhttps://opendev.org/openstack/devstack/src/branch/master/roles/export-devstack-journal/tasks/main.yaml#L34-L48 Looks like they switched to using gz :/16:31
clarkbwe did xz back when because it was like 10x better and 99% of the time people don't need the raw journals oh well16:31
jrosser_it's ok, we already have a list of "interesting services" and dump out what you would have had with journalctl16:31
jrosser_the binary ones are pretty useless now16:31
clarkbanyway other than the journals and ara content in the wallaby job I'm not see much that stands out to me. There is some nesting but it isn't too bad. And some duplication (of the apt trusted keys). But on the whole far less content than tripleo. Though the timeout is shorter16:32
clarkbThe other thing that just occured to me is that the post-tiemout is for all the post-run playbooks. It could be that uploads are fine and are just fallout16:33
clarkbara itself takes almost 3 minutes to generate the report in the successful job16:34
clarkbabout 1/10th the timeout value16:34
clarkbThen journal collection from the container collection takes about 20 seconds and journal collection and eth interface info collection another 20 seconds. Nothing too crazy yet16:35
clarkbCopying logs to the executor another 15 seconds16:36
opendevreviewBlaise R Pabon proposed opendev/infra-manual master: Update the sandbox instructions  https://review.opendev.org/c/opendev/infra-manual/+/84559316:36
clarkbthe upload of logs to swift took about 9 minutes16:37
*** dviroel|lunch is now known as dviroel16:37
clarkbIn total about 13 minutes of post-run time. About 43% of the timeout value.16:38
clarkbAnyway that does point to the upload to swift being the likely issue just based on successful time speant16:38
clarkbAnd that upload occurs from the executor to swift. Which means the source side of that is pretty consistent. One of 12 instances in a single cloud16:39
clarkbNow the successful case is the one I would have expected to have the most issues as it is across the atlantic16:39
clarkbhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_554/periodic/opendev.org/openstack/openstack-ansible-repo_server/stable/wallaby/openstack-ansible-deploy-infra_lxc-ubuntu-focal/554c3a2/ is what I'm looking at fwiw16:39
clarkbThat doesn't rule out some other internet/routing/ipv6/etc problem though16:40
clarkbI also see uploads to both ovh swifts from tripleo jobs that are successful as well as uploads that are hosted by the rax cdn. More difficult to determine where those rax cdn's targetted on the backend though. But that seems to rule out a consistently bad location16:42
fungii suppose something like bandwidth contention or significant packet loss on an executor resulting in much longer upload times16:47
fungiwould be another possibility16:48
clarkbyup it could be network contention between jobs too16:48
clarkbI got excited I saw that the system-config-run job that runs against latest ansible had a POST_FAILRUE and thought maybe it is related to latest ansible. but no this error is a regular error in post-run not a timeout16:59
clarkbit has logs and all that16:59
*** Guest3279 is now known as rcastillo17:20
jrosser_clarkb: fungi i have a few jobs taking suspiciously long to upload logs right now, it may be nothing unusual but i can give you the details if it's helpful17:23
clarkbjrosser_: ya can you link to the live console?17:23
clarkbI'd be happy to take a look at that17:23
jrosser_https://zuul.openstack.org/stream/d29af85f2a144f68b2eeb849ccb8e8ad?logfile=console.log17:24
clarkbLooks like that one just succeeded17:25
jrosser_yeah i figured that would happen17:25
clarkbwhich is good bceause it means we can go look at the executor logs and gather more info about it17:25
jrosser_that was about ~20 mins17:26
clarkbFirst thing I notice is the pre upload tasks seem to take about 3 minutes there whcih is consistent with the other successful uploads. Either the size of the data doubled causing it to take twice as long as other uploads or there is some other issue17:26
jrosser_this one is similar https://zuul.openstack.org/stream/62be39711bb6402d90d27f99fd4a96c3?logfile=console.log17:26
jrosser_about 20 mins in as well17:26
clarkbhttps://e08e29492ebe0e6f5609-f5e06fd9232d0500a75b50f2bf1ad7e1.ssl.cf5.rackcdn.com/847797/1/check/openstack-ansible-deploy-aio_lxc-centos-8-stream/d29af85/ is the first one. It uploaded to one of the rax backeds (have to check the executor log to get more spceific than that I suspect)17:28
clarkbIt ran on ze09.17:29
jrosser_i think that they might have both completed now17:29
clarkbhttps://a482a10c2a931478cad4-f031700b2c1819893a02a03206c89279.ssl.cf5.rackcdn.com/847668/1/check/openstack-ansible-deploy-aio_lxc-centos-8-stream/62be397/ is the second it ran on ze1217:29
clarkbyup they both ran successfully but took abit more than 20 minutes to upload17:29
jrosser_but that seems to be the pattern we get, that theres ~15 or so long running jobs on that change and just one or maybe two of those would POST_FAILURE17:30
jrosser_it certainly wasn't all of them at all17:30
opendevreviewMerged opendev/system-config master: Fix zuul reboot playbook flock location  https://review.opendev.org/c/opendev/system-config/+/84780817:30
clarkbBoth jobs ran in iweb. I suppose it is theoretically possible that the test node cloud provider is impacting logging in some way that causes this (producing much larger system logs due to something in the provider?)17:30
clarkbjust checking the journal logs they do seem larger in that one. But I would need to add them all up to be sure of that17:31
clarkbeach of them have a 96MB journal file (in addition to all the other journal files)17:32
fungiyeah, doing a side-by-side comparison of a complete download of the saved logs from one that took longer and one that didn't for the same job would be helpful17:32
clarkbhttps://zuul.opendev.org/t/openstack/build/fdd24d45c34a43fc8171baf79e5e5323/logs was a short one17:34
clarkbhttps://zuul.opendev.org/t/openstack/build/ce8c876dca6b49e0adc05eb098fb0b61/log/job-output.txt was longer but still short (~4 minutes, first short example was about 2 minutes)17:35
jrosser_oh wow the log upload time was only 2 mins?17:36
clarkbone of the long ones has an ara report. The other long one doesn't have an ara report17:36
clarkbjrosser_: double check me on that but I'm looking at the timestamp for start of upload and comparing it against the job completion time in the zuul build dashboard17:36
jrosser_yeah, thats the only way i've found to get a handle on the upload time17:36
*** pojadhav|afk is now known as pojadhav17:36
clarkbat least one quick oen had a larger journal file too17:37
jrosser_this is slightly unscientific as the patches we're looking at are trying to reduce the log upload burden17:38
jrosser_so comparisons are tricky17:38
clarkbah17:38
jrosser_it would be reasonable to look at different jobs in the same change though17:38
clarkbyup. And probably the post-timeout increase jobs are the best to look at. Recheck them a few times then compare runtimes17:39
clarkbsince in theory we're least likely to actually hit a post_failure and not be able to compare there17:40
clarkbone other thing to note. A short run ran on ze09 as well17:41
clarkbbut I agree hard to say how useful that info is until we're comparing across the same changes17:41
fungiyeah, things to possibly correlate: consistency for builds run from the same executor, for builds run on nodes from the same provider, for builds uploading logs to the same swift endpoint17:43
*** rlandy is now known as rlandy|afk18:40
*** jpena is now known as jpena|off19:06
*** rlandy|afk is now known as rlandy19:23
TheJuliaanyone around to put a VM hold in real quick?19:40
clarkbTheJulia: I can do it, just need job name and change ideally19:41
TheJulia847601,3 in flight, ironic-tempest-wholedisk-bios-snmp-pxe19:41
TheJuliashould fail shortly....19:41
clarkbthe autohold command has been run19:43
* TheJulia crosses fingers19:43
clarkbianw: ^ in doing that I notice your commands in history there don't prefix with the timestamp. I wonder why19:43
TheJuliaoh, it is *soooo* close19:43
TheJuliait is getting logs now19:43
TheJuliaclarkb: looks like the job just exited19:46
clarkbbut it was successful? the hold should be good for your recheck19:47
TheJuliafailed19:47
clarkboh /me looks again19:47
TheJuliawhich is what I'm hoping to capture the state of19:47
clarkbTheJulia: https://zuul.opendev.org/t/openstack/build/67ac58dd2a2443eaa76061b884e61faf it passed19:48
TheJuliagaaaah19:48
TheJulia:(19:48
* TheJulia goes off in the corner and cries :)19:48
clarkbthe hold is still good though and shoudl catch it if/when it does fail19:49
TheJuliaclarkb: off to issue a recheck then, thanks19:49
clarkbI'm getting our meeting agenda together for tomorrow. Anything missing?20:09
clarkbianw: I +2'd the grafana testing changes but didn't approve them as I wasn't sure if fungi wanted to look at your response prior to approval too21:06
fungithanks, i did see the responses. i was +2 on it anyway21:24
fungijust wishing i had the bandwidth to move more of our opendev jobs out of ozj21:25
*** dviroel is now known as dviroel|out21:40
ianwyeah, it feels like maybe a sprint thing where we sit down and just bang it out, dedicate a day where things might get broken but just work iteratively21:41
ianwclarkb: umm, commands on zuul02?21:42
ianwopenstack/project-config is probably the other thing that stands out a bit too in terms of naming21:43
clarkbianw: re zuul02 if you check roots `history` commands are prefixed wtih a timestamp there. But the autohold commands you have run do not have that. I think you must be su'ing somehow without getting a new login shell? But I thought the history bit was system wide so unsure how it gets ommitted21:45
ianwfor some reason i seem to have been doing "sudo bash ./zuul-client autohold ..." where "zuul-client" is a small script in my homedir21:47
ianwfrom jan 17.  maybe we had quoting issues with that wrapper script or something at the time?21:47
clarkboh ya there were quoting issues once upon a time. I believe they are resolved now21:47
ianwheh, that same day i proposed https://review.opendev.org/c/opendev/system-config/+/82501921:47
clarkbok last call on meeting agenda content. I'll send that out at the top of the hour or so21:48
ianwso that might explain it.21:48
ianwoh i was going to add something about bridge.  will do in a sec21:48
clarkbk21:49
clarkbianw: let me know when it is posted and I'll get this sent out22:00
ianwclarkb: thanks, done22:01
ianwreloading the dib page shows the right release now, so hopefully pypi has fixed that22:03
clarkboh I got an email about it too. alreight working on sending that out22:03
fungiianw: i agree moving more out of openstack/project-config to opendev/project-config would be great as long as it's stuff that doesn't still need to be in the openstack zuul tenant in order for us to test it22:04
clarkbianw: seemed like it was a queue problem. Not sure if it is just that they got to dib in the queue or if the issue itself is corrected22:04
opendevreviewAmy Marrich proposed opendev/irc-meetings master: Updated meeting info for Diversity WG  https://review.opendev.org/c/opendev/irc-meetings/+/84785922:04
clarkband sent22:05
clarkbI'm starting to look at the work in progress == merge conflict issue and https://review.opendev.org/q/is:wip does seem to show this equivalence is universal22:12
*** dasm is now known as dasm|off22:16
opendevreviewMerged opendev/irc-meetings master: Updated meeting info for Diversity WG  https://review.opendev.org/c/opendev/irc-meetings/+/84785922:17
clarkbafter fiddling with it via the rest api I strongly suspect that the issue is mergeable:false and work_in_progress:true are both set for these changes22:23
clarkband I suspect that mergeable:false is picked up by the web ui as meaning merge conflict22:23
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Fix BLS entries for /boot partitions  https://review.opendev.org/c/openstack/diskimage-builder/+/84683822:27
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Parse block device lvm lvs size attributes  https://review.opendev.org/c/openstack/diskimage-builder/+/83982922:28
opendevreviewSteve Baker proposed openstack/diskimage-builder master: WIP Support LVM thin provisioning  https://review.opendev.org/c/openstack/diskimage-builder/+/84014422:28
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Do dmsetup remove device in rollback  https://review.opendev.org/c/openstack/diskimage-builder/+/84786022:28
clarkbyup I think that is exactly the issue22:30
clarkbhttps://gerrit.googlesource.com/gerrit/+/refs/tags/v3.5.2/polygerrit-ui/app/utils/change-util.ts#16022:31
opendevreviewMerged openstack/project-config master: grafana: import graphs and take screenshots  https://review.opendev.org/c/openstack/project-config/+/84712922:32
clarkbLooking at the history of that ts file I don't think that logic has changed recently though which makes me suspect that it is the mergeable state that has changed. I wonder if we need to send a null value in the case of being mergeable but wip to distinguish from unmergeable wip22:33
ianwclarkb: ahh, you were correct -> Unable to freeze job graph: Job project-config-grafana depends on opendev-buildset-registry which was not run. :/22:37
TheJuliaclarkb: looks like the job failed so there should be a held node22:37
ianwthe project-config gate might be borked now, as i don't think we can commit around that22:37
clarkbTheJulia: looking22:38
clarkbianw: we should be able to force merge a revert at least?22:38
clarkbTheJulia: yup I'm on the host. have a pubkey you can share for me to add?22:39
ianwmaybe it should be a soft dependency in the o-z-j side?22:39
TheJuliaclarkb: one moment22:40
clarkbianw: that would also fix it I think22:40
ianwclarkb: https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/84786222:42
clarkbTheJulia: root@158.69.68.15422:42
clarkbhrm the mergeop check for mergeability that checks work in progress seems to have not chagned since 2018 either. Do we know if this behavior is new?22:46
corvusclarkb: (just a reminder that google added the thing where they don't do merge checks by default any more; so it seems conceivable issues may go unnoticed for longer upstream)22:52
corvus(could maybe somehow be related)22:52
clarkbcorvus: yup22:55
clarkbI'm finding evidence that they do attempt to set mergeable to null in the wip case22:56
clarkbso the mystery deepens22:56
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging  https://review.opendev.org/c/openstack/diskimage-builder/+/84760023:02
ianwTheJulia: ^ is this suggesting a glean issue?23:03
TheJuliaianw: it is suggesting networkmanager may have a fairly nasty bug in it23:03
TheJuliawhich causes stream-9 to break network conenctivity shortly after booting23:04
ianw:/ ... i'd probably prefer to debug glean, and that's saying something :)23:04
TheJuliayeah :)23:04
TheJuliaunfortunately this is in the IPA ramdisks on boot23:04
opendevreviewMerged openstack/project-config master: Retire openstack-helm-deployments repo  https://review.opendev.org/c/openstack/project-config/+/84741423:05
TheJuliaso things get started... and a little while later all network connectivity seems to go *poof*23:05
opendevreviewMerged openstack/project-config master: Add Cinder Dell EMC PowerStore charm  https://review.opendev.org/c/openstack/project-config/+/84688823:05
ianwurgh.  our ipv6 RA kernel v NetworkManager race was somewhat similar ... but only for ipv623:07
ianwi'm struggling to think of what else might jump in there23:08
clarkbit looks like if you try to submit a change then gerrit will more explicitly check mergeability. Except you can't submit wip changes so I can't really test if that is a viable method to unsetting the stale data23:08
ianwohhh, that job has use-buildset-registry role in it23:18
ianwwhich fails without a buildset registry.23:19
ianwshould that role skip if no registry, or should we only run the role if we see the buildset registry var, or is this an incorrect usage of the whole thing and we should unconditionally have a buildset registry?23:20
clarkbianw: I think it may be an unexpected use of the role. The idea is if you are using it then you want to do the depends on support that I described in my review. But I wasn't sure it would break so badly23:28
clarkbI've filed https://bugs.chromium.org/p/gerrit/issues/detail?id=16041 whcih outlines where I've debugged and a couple of ideas for a fix. I'll see if I can manage a change that we can test tomorrow using one of those appraoches.23:28
ianwyeah, i had in my mind the job would work either way, but i guess that's not really how to write it.  otherwise we have to conditionally include the use-buildset-registry on the registry variables, which kind of breaks the abstraction23:30
opendevreviewIan Wienand proposed openstack/project-config master: project-config-grafana: add opendev-buildset-registry  https://review.opendev.org/c/openstack/project-config/+/84786623:32
ianwclarkb: ^23:32
clarkbianw: one minor clarification but ya I think that will solve it23:34
opendevreviewIan Wienand proposed openstack/project-config master: project-config-grafana: add opendev-buildset-registry  https://review.opendev.org/c/openstack/project-config/+/84786623:35
ianwyep, check: :)23:35
TheJuliaianw: well, it looks like they removed a configuration option at some point, which is not helping... and behavior chagned23:35
clarkbI'm going to go help with dinner now.23:35
TheJulia*and* it's behavior is not ideal if you have physical switches which can make "bad things happen"23:36
ianwsigh.  if i can help beyond "random probably unhelpful suggestions" lmn :)23:39
TheJuliathanks :)23:42
TheJuliaI suspect I'm going to have to bang my head against it a little since there is in herently a race with the dhcp timeout occuring... and now is not a good time to try :\23:43
TheJuliaugh, looks like I'm going to need a noisy neighbor to fully simulate this :\23:54

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!