Friday, 2022-07-15

opendevreviewClark Boylan proposed openstack/diskimage-builder master: Add Rockylinux 9 build configuration and update jobs for 8 and 9  https://review.opendev.org/c/openstack/diskimage-builder/+/84890100:03
clarkbfungi: I'm going to need to pop out for evening things momentarily. Happy to help with the cinder stuff in the morning. I think we'll be fine until then00:04
fungiyeah, i'm contemplating adding the new volume this evening or waiting for friday00:09
Clark[m]Tomorrow works for me00:11
fungiyeah, that's probably better anyway00:14
*** rlandy|bbl is now known as rlandy|out01:10
*** dviroel|rover|afk is now known as dviroel|rover01:17
*** ysandeep|out is now known as ysandeep01:37
*** join_subline is now known as \join_subline01:47
opendevreviewMerged opendev/system-config master: production-playbook logs : don't use ansible_date_time  https://review.opendev.org/c/opendev/system-config/+/84978401:54
opendevreviewMerged opendev/system-config master: production-playbook logs : move to post-run step  https://review.opendev.org/c/opendev/system-config/+/84978501:54
opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/84990902:25
ianwi don't think run-production-playbook-post.yaml is working; but it also doesn't error02:33
opendevreviewIan Wienand proposed opendev/system-config master: run-production-playbook-post: add bridge for playbook  https://review.opendev.org/c/opendev/system-config/+/84991902:48
*** ysandeep is now known as ysandeep|afk03:45
opendevreviewMerged opendev/system-config master: run-production-playbook-post: add bridge for playbook  https://review.opendev.org/c/opendev/system-config/+/84991903:48
ianwhrm, i guess the periodic jobs that are running have checked themselves out on a repo prior to ^05:20
ianwi'll merge the limnoria change and watch that.  it's quiet now, no meetings05:21
*** ysandeep|afk is now known as ysandeep06:16
ianwoh interesting, system-config-run-eavesdrop timed_out in gate06:21
ianwperhaps it was actually during log upload?06:24
ianwit looks like it ran ok06:24
ianwhttps://zuul.opendev.org/t/openstack/build/bcc13270b9f14dd28af70882949b6579/logs06:24
ianwone data point ... this has an ara report too.  the one we looked at the other day did as well06:27
ianwthat's a lot of small files ...06:27
ianwthat wouldn't really describe the infra-prod timeouts though; they don't have ara reports06:41
ianw(only the system-config-run ones do)06:41
*** chandankumar is now known as chkumar|rover06:59
opendevreviewMerged opendev/system-config master: Install Limnoria from upstream  https://review.opendev.org/c/opendev/system-config/+/82133107:41
opendevreviewlixuehai proposed openstack/diskimage-builder master: Fix build rockylinux baremetal image error  https://review.opendev.org/c/openstack/diskimage-builder/+/84994707:46
*** rlandy|out is now known as rlandy10:31
opendevreviewTobias Henkel proposed zuul/zuul-jobs master: Allow overriding the buildset registry image  https://review.opendev.org/c/zuul/zuul-jobs/+/84998911:04
opendevreviewMerged openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/84990911:21
fungiianw: if they were timing out during log collection/upload, the build result would have been post_timeout11:25
*** dviroel|rover is now known as dviroel11:28
*** ysandeep is now known as ysandeep|afk11:56
ianwfungi: yeah, this is true.  it seems to be something else going on with the ansible on the bastion host just ... stopping12:04
opendevreviewJames Page proposed openstack/project-config master: Add OpenStack K8S charms  https://review.opendev.org/c/openstack/project-config/+/84999612:18
opendevreviewJames Page proposed openstack/project-config master: Add OpenStack K8S charms  https://review.opendev.org/c/openstack/project-config/+/84999612:19
opendevreviewJames Page proposed openstack/project-config master: Add OpenStack K8S charms  https://review.opendev.org/c/openstack/project-config/+/84999612:21
fungitrying to dig the reason for the post_failure on https://zuul.opendev.org/t/openstack/build/69dc9e01301848a9a8a3f45fd952a24b out of the debug log on ze10 and not having much luck13:27
fungino ansible tasks are returning failed in the post phase13:28
Clark[m]fungi: very likely related to splitting the log management into post-run for those jobs. But I'm not sure why it would report failure if no tasks fail13:31
Clark[m]Perhaps related to the inventory fix here: https://review.opendev.org/c/opendev/system-config/+/849919 though as noted there if no nodes match inventory you just skip successfully 13:31
*** dasm|off is now known as dasm|ruck13:33
fungioh. perhaps. i'll look for that signature14:02
fungi2022-07-15 12:22:17,931 DEBUG zuul.AnsibleJob.output: [e: 3ae9a107a3364e86bbfc5d0c7e59c499] [build: 69dc9e01301848a9a8a3f45fd952a24b] Ansible output: b'changed: [localhost] => {"add_host": {"groups": [], "host_name": "bridge.openstack.org", "host_vars": {"ansible_host": "bridge.openstack.org", "ansible_port": 22, "ansible_python_interpreter": "python3", "ansible_user": "zuul"}}, "changed":14:04
fungitrue}'14:04
fungiseems like it ran (that was the output for "TASK [Add bridge.o.o to inventory for playbook name=bridge.openstack.org, ansible_python_interpreter=python3, ansible_user=zuul, ansible_host=bridge.openstack.org, ansible_port=22]")14:04
Clark[m]And did the subsequent tasks in that playbook manage to run with the updated inventory?14:08
fungithe play recap only mentions localhost14:15
fungimaybe you can't update a playbook's inventory from within the playbook itself?14:16
*** ysandeep|afk is now known as ysandeep14:16
fungithough that would be silly since it would mean there's no point in ansible's add_host task14:17
fungithere are some "no hosts matched" warnings toward the end of the log14:20
Clark[m]That should be how the run playbook functions 14:21
Clark[m]Maybe something about the new add host is different and missing bits?14:21
fungiafter "TASK [Register quick-download link]" is recapped showing it applied to localhost, it starts to run the cleanup playbook and says "[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'"14:23
fungii think the failure must be prior to that though14:25
fungii just can't seem to find it14:26
fungias soon as the openstack release meeting wraps up in a few minutes, i need to run a couple of quick errands (probably 30 minutes) and then i'll get started adding the new cinder volume to review.o.o and migrating onto it14:38
Clark[m]Sounds good. I can look at that post failure more closely after I've fully booted my morning14:38
fungiit's probably not urgent, the job seems to have done the deployment things it was supposed to and collected/published logs, just ended with an as-of-yet inexplicable post_failure14:41
fungipopping out for quick errands now, bbiab14:45
clarkbI wonder if these timeouts are related to the problem albin is debugging with ansible and glibc14:49
clarkbhttps://github.com/ansible/ansible/issues/78270 is the upstream bug report albin filed14:49
clarkbwe wouldn't have seen them before with ansible 2.9 I think but now we default to ansible 514:49
corvusclarkb: i think Albin Vass said it was happening in older versions too?  (iirc, an earlier disproved hypothesis was that ansible 5 might fix it)15:09
corvusclarkb: but the recent container bump to 3.10 almost certainly changed our glibc, right?15:10
clarkbcorvus: no the 3.10 image and the 3.9 image are both based on debian bullseye and should have the same glibc15:11
corvusah15:12
*** dviroel is now known as dviroel|lunch15:16
clarkbfungi: [build: 69dc9e01301848a9a8a3f45fd952a24b] Ansible timeout exceeded: 180015:19
clarkbits the timeout issue that ianw was looking at. In post run we run every playbook I think so the failure happened before you were looking. It happened in the encrypt steps. Maybe a lack of entropy?15:20
clarkbI think one possibility is the glibc problem albin discovered (maybe ansible 5 trips it far more often?) or something related to the leaking zuul console log files in /tmp15:24
clarkbcorvus: ^ re the leaking console log files is that a side effect of not running the console log streamer daemon?15:24
clarkbcorvus: we should be able to safely delete those files I think, but also i wonder if we can have a zuul option to not write them in the first place?15:24
clarkbjust doing a random sampling of those filse they all appear to be ansible task text that would end up in the console log. I'm pretty confident we can just delete the lot of them15:26
*** ysandeep is now known as ysandeep|out15:28
AlbinVass[m]clarkb: we had two different issues, one in Ansible 2.9 getting stuck in the pwd library (i believe), and Ansible 5 getting stuck in grantpty with glibc 2.31 which is installed in the official zuul images (v6.1)15:35
clarkbAlbinVass[m]: ok, that is good to know. I'm not sure that this problem is related to what you've seen but the symptoms seem similar. The job just seems to stop and then eventually timeout15:36
clarkbAlbinVass[m]: I'd be willing to land your updated glibc change and see if it helps :)15:37
clarkblet me go and review it now while it is fresh15:38
AlbinVass[m]The only reason we detected it was because we had jobs hanging forever in cleanup :)15:38
clarkbya that is similar to what we are seeing now15:39
fungiclarkb: thanks! i always forget to search for task timeouts, which should be the first thing i think of when i can't find any failed tasks (because the timed out task never gets a recap)15:39
*** rlandy is now known as rlandy|brb15:42
clarkbAlbinVass[m]: ok left some comments on that change. I think it is basically there though and worth a try15:46
clarkbOtherwise we may want to undo the ansible 5 default update? or override it for system-config jobs at least15:46
clarkbit looks like just about every infra-prod job is hitting those post failures now15:48
clarkbI don't think that is the end of the world while we work thoruhg the gerrit thing. But be aware of that if landing changes to prod as they may not deploy as expected15:48
clarkbI think we should finish up the gerrit thread then look at cleaning up /tmp on bridge to see if that helps then possibly do AlbinVass[m]'s zuul image update or force ansible 2.9 on the infra-prod jobs. I'll push up a change for infra-prod jobs now so that it is ready if we wish to do that15:49
clarkbactually nevermind lets try the /tmp cleanup first15:49
fungiclarkb: you noted the timeout was after the file encryption tasks, maybe we're consistently starving that machine for entropy when repeatedly running those?15:51
clarkbfungi: yes that was another thought15:51
fungicould be entirely unrelated to our ansible version15:51
clarkbis there a way to check that?15:51
fungiwe could report the size of the entropy pool before tasks we think are likely to try to use it15:52
fungior we could check it with cacti or something15:52
clarkb`find /tmp -name 'console-*-bridgeopenstackorg.log' -mtime +30` for a listing of what we might want to delete15:53
clarkb/proc/sys/kernel/random/entropy_avail appears to report good numbers currently15:54
fungifungi@bridge:~$ cat /proc/sys/kernel/random/entropy_avail 15:54
fungi341315:54
fungiyeah, not bad for now15:54
fungibigger question is whether we're doing more key generation and similar tasks on there during jobs recently15:55
fungithe server seems to have a reasonable amount of available entropy, but the jobs could have started demanding an unreasonable amount15:56
clarkbI don't think the total has changed. We only shifted where it ran (that was in response to this problem too so the issue was preexisting)15:57
*** marios is now known as marios|out15:57
clarkbthat is the main reason why I susecpted ansible 5 + glibc since it was a recent change15:57
clarkbbasically if we look at what changed ansible 5 is it and AlbinVass[m] has evidence of a very similar problem happening for them15:57
fungiyeah, i agree that's a stronger theory15:58
clarkbI think we can try something like `find /tmp -name 'console-*-bridgeopenstackorg.log' -mtime +30 -delete` if others agree those files look safe to prune. But don't expect that to help15:59
clarkb(good cleanup either way)16:00
clarkbBut also removing the old index back up on review and expanding available disk there might be the higher priority? There are too many things and I don't want to try and do too many things at once  :)16:00
clarkbactually you know what why don't we just flip infra-prod to ansible 2.9 again. its a dirt cheap thing to change and we can toggle it back too16:03
clarkbbut that would give us a really good idea if it is related to ansible 5. change incoming16:03
fungisure16:03
fungisounds good to me16:03
*** rlandy|brb is now known as rlandy16:05
opendevreviewClark Boylan proposed opendev/system-config master: Force ansible 2.9 on infra-prod jobs  https://review.opendev.org/c/opendev/system-config/+/85002716:07
fungii count 172663 console logs matching the glob pattern suggested above16:08
clarkbya there are just over 200k if you drop the mtime filter16:08
fungiall the glob matches also match the regex ^/tmp/console-[^/]*-bridgeopenstackorg.log$16:09
fungiso i agree that looks safe16:09
clarkbfungi: do you agree the content of those files isn't work keeping after 30 days?16:10
clarkbI'm not even sure it is worth keeping 30 days of them but figure we can start there16:10
clarkbI guess the other thing we can do is limit the depth on the search so that subdirs that happen to have the same filename pattern don't get files deleted but that also seems unlikely16:12
fungiyeah, spot-checking some at random, the contents don't really seem that useful to begin with16:12
clarkb`find /tmp -maxdepth 1 -name 'console-*-bridgeopenstackorg.log' -mtime +30 -delete` ?16:13
fungiyes, that still works16:16
fungiand is slightly safer, i suppose16:16
clarkbya as there are some subdirs. I guess I'll spin up a screen and get that ready to go?16:16
clarkbfungi: screen started if you want to join and help double check16:17
clarkbshoudl I add a -print?16:17
clarkbI think I've decided that is going to be too much noise with -print16:18
AlbinVass[m]mind that we had other issues with ansible 2.9, I'm not sure why I haven't seen those before because the root cause for the deadlocks are the same (multithreaded process using fork)16:26
fungiclarkb: sorry, stepped away for a moment, but i'm in the screen session now and that looks great16:28
AlbinVass[m]* seen those when running zuul with ansible 2.9 before because16:28
fungiclarkb: and i agree, -print is worthless for this unless you're just trying to see if it's hung16:28
clarkbfungi: ok I'll run that now16:30
fungithanks16:31
clarkbits done. Still 38k entries for the last 30 days but much cleaner16:31
clarkband ya I don't expect that to help. But maybe it will16:31
fungiyeah, that looks reasonable16:31
*** dviroel|lunch is now known as dviroel16:32
clarkbAlbinVass[m]: ya at this point I think its fine for us to do some process of elimination. We ran under 2.9 for a while and it was fine there for us iirc16:33
clarkbAlbinVass[m]: could be a timing issue or one environment being more likely to trip specific issues than others16:33
corvusclarkb: nothing deletes the zuul console log files; so if we're talking bridge, we should have some kind of tmp cleaner16:59
fungicorvus: bridge, correct17:00
fungithese are in /tmp on bridge.o.o17:00
fungialso /tmp was nowhere near filling up from a block or inode standpoint17:01
opendevreviewJeremy Stanley proposed opendev/git-review master: Clarify that test rebases are not kept  https://review.opendev.org/c/opendev/git-review/+/85005417:28
clarkbcorvus: thank you for confirming I'll work on putting something together for that I guess17:35
opendevreviewClark Boylan proposed opendev/system-config master: Add cronjobs to cleanup Zuul console log files  https://review.opendev.org/c/opendev/system-config/+/85005917:48
clarkbfungi: I'm going to pop out soon for lunch, but let me know if I can be helpful re gerrit and I'll adjust timing as necessary18:20
opendevreviewJeremy Stanley proposed opendev/git-review master: Clarify that test rebases are not kept  https://review.opendev.org/c/opendev/git-review/+/85005418:26
opendevreviewJeremy Stanley proposed opendev/git-review master: Don't keep incomplete rebase state by default  https://review.opendev.org/c/opendev/git-review/+/85006118:26
fungiclarkb: thanks, i'm switching to that now, since i'm done with the git-review distraction for the moment18:27
fungi[Fri Jul 15 18:42:54 2022] virtio_blk virtio5: [vdc] 2147483648 512-byte logical blocks (1.10 TB/1.00 TiB)18:46
fungi/dev/vdc1       lvm2 ---  <1024.00g <1024.00g18:47
fungii have `pvmove /dev/vdb1 /dev/vdc1` running in a root screen session on review.o.o now18:48
fungithat's after creating and attaching the cinder volume, creating a partition table on it with one large partition, writing a pv signature to the partition, and adding that pv to the existing "main" vg on the server18:50
fungii did make sure it was set as --type=nvme too18:50
fungi(openstack volume show confirms it's the case too)18:51
fungiblock migration is already 13% done, so it's going quickly18:51
fungihopefully the additional i/o isn't impacting gerrit performance too badly18:51
clarkbcool. I think where I always get confused is the relationship between physical volumes, volume groups, and logical volumes18:51
clarkbphysical volumes are aggregates into volume groups to provide logical volumes that may be larger than any one physical volume. Is that basically it?18:52
fungi[lvx][--lvy--][-lvz]18:52
fungi[-------vg0--------]18:53
fungi   [---pva---][--pvb--]18:53
clarkbok ya that helps18:53
fungier, except for my indentation on that last line18:53
fungibut yeah, a vg is just an aggregation of whatever block devices the kernel knows about, then you create logical volumes within a vg18:54
clarkband in this case we do the pvmove so that we can remove the other physical volume later. But then we hvae to expand the lv to stretch it out to match the larger size. Then after that the fs itself18:54
fungioptional but yes18:55
fungithe process is basically 1. add a new pv to the vg, 2. move the extents for the lv from the old pv to the new pv, 3. remove the old pv from the vg18:55
fungithen later you can also 4. increase the size of the lv, 5. resize the fs on that lv, 6. profit18:56
clarkbfungi: are vdb and vdc pvs or lvs?18:56
clarkb(they are pvs  Ithink?)18:57
fungiwe could also instead have done 1. add a new pv to the vg, 2. increase the size of the lv, 3. resize the fs18:57
fungipv (so-called "physical volume") is the kernel block devices for the disks, so vdb and vdc in this case18:57
clarkbbut then we would have to keep both cinder volumes attached (which I think is less ideal long term)18:57
fungilv is the "logical volume" which spans one or more physical volumes within the volume group18:58
clarkband they are localted under the /dev/mapper paths18:59
fungiand yes, the second option i described would have been more if we just wanted to incrementally add to the existing physical volume set rather than migrating data from one to another. in the real world where you may be plugging another actual hunk of spinning rust into a server that's more typical, but with the virtual block devices in a cloud it usually makes more sense to migrate data to a19:00
fungilarger pv where possible19:00
fungialso some people simply don't know about pvmove and so don't realize it's possible to do this19:00
clarkbfungi: last question, why do you need to add it to the volume group if we're essentially lconing an existing member of the vg onto the new pv? I guess just a bookkeeping activity?19:02
fungiit's just how lvm is designed. the extents of an lv can be mapped to any pv in their vg. when you do a pvmove it's taking advantage of the copy-on-write/snapshot capabilities in lvm to update the address for an extent to a new location and then cleaning up the old one, so during any arbitrary point in the pvmove that lv is spread across the old and new pvs19:04
clarkbgot it19:05
fungisince an lv can already span multiple pvs, it's almost a side effect of the eother features19:05
clarkbright19:05
fungibut to have an lv spread across multiple pvs, they have to be in its vg, hence the initial step of adding the new pv to the vg19:06
fungitechnically we're "extending" the vg to include the new pv, and then later "reducing" the vg off of the old pv once there are no longer any lv extents in use on it19:07
clarkbmakes sense19:07
fungibut since it's all happening underneath the lv abstraction, it's essentially invisible to the filesystem activity19:08
fungijust (possibly lots of) added i/o activity for those devices while in progress19:09
clarkbonce the pvmove is done we could theoretically expand to ~1.25TB on the lv since the other pv will still be in the vg right? So you need to be careful with maths or remove the other pv first?19:10
fungiyeah, that's why i suggested removing the old pv first before resizing the lv19:10
fungiat the moment we have .25tb in use and 1.25tb free in the vg, so as long as i don't tell the lv to use more space before i remove the old pv, we'll be down to .25tb used and .75tb free out of 1tb in the vg, then i'll lvextend the lv to use the additional space and resize the fs after that19:12
fungipvmove is done now, and if you look at the pvs output /dev/vdb1 has 250.00g of 250.00g free, while /dev/vdc1 has 774.00g of 1024.00g free19:13
clarkbI'll trust yo uon that (as I'm juggling lunch right now)19:13
fungiso this is the point where i vgreduce off of /dev/vdb119:13
clarkbbut ya that lal makes sense. Th ehard part is retaining this info for the future19:13
fungiwe have most of the commands documented at https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#cinder-volume-management19:14
fungibut also the lvm manpage is helpful19:14
clarkbWhen we do the fs resize does that add more inodes (in theory we need them?)19:15
clarkbiirc it will add them at the same ratio of blocks to inodes that the existing fs has and we do get new ones but we can't change the ratio?19:15
fungii think that depends on the fs, but ext4 should increase the inode count19:15
fungiif memory serves, ext3 did not increase inode count relative to block count19:15
funginow i've done `vgreduce main /dev/vdb1` to take the old pv out of our "main" vg19:16
fungiand `pvremove /dev/vdb1` to wipe the pv signature from the partition19:16
fungiso `pvs` currently reports only one pv on the server, which is a member of the main vg, with 774.00g free19:17
fungiand `vgs` similarly reports the vg has 774.00g available19:18
funginow to detach the old volume, which is always the iffy part of this process19:18
*** tobias-urdin is now known as tobias-urdin_pto19:19
fungii used `openstack server remove volume "review02.opendev.org" "review02.opendev.org/gerrit01"` and after a minute it returned. now `openstack volume list` reports review02.opendev.org/gerrit01 is available rather than in-use19:21
clarkband /dev on review02 probably doesn't show that device anymore?19:21
fungicorrect, we have vda and vdc but nothing between19:22
fungiinterestingly, nothing gets logged in dmesg during that hot unplug action19:22
fungiand now i've done `openstack volume delete review02.opendev.org/gerrit01` so we're cleaned up on the cloud side of things19:23
funginext is the lv and fs resizing19:23
fungii've run `lvextend --extents=100%FREE main/gerrit` which tells it to grow the gerrit lv into the remaining available extents of the main vg19:25
fungi"Size of logical volume main/gerrit changed from <250.00 GiB (63999 extents) to 774.00 GiB (198144 extents)."19:26
clarkbshouldn't it be 1TB intsead of 774?19:26
fungid'oh, yep!19:27
fungii missed a +19:27
fungithanks for spotting that19:27
fungilvextend --extents=+100%FREE main/gerrit19:27
fungiSize of logical volume main/gerrit changed from 774.00 GiB (198144 extents) to <1024.00 GiB (262143 extents).19:27
clarkbout of curiousity would it have errored if the 100%FREE was less than the current consumed size?19:28
clarkb(or did we just get really lucky?)19:28
fungiit would have errored, yes. lvextend can only increase. you have to use lvreduce to decrease19:28
clarkbthe command is extend so ya19:28
clarkbmakes sense19:28
funginow pvs and vgs show 0 free blocks, and lvs shows the gerrit lv is 1tb19:28
corvusglad they retired lvbork19:29
fungithe swedish chef was bummed though19:29
fungiresize2fs /dev/main/gerrit19:30
fungiThe filesystem on /dev/main/gerrit is now 268434432 (4k) blocks long.19:30
fungiFilesystem               Size  Used Avail Use% Mounted on19:30
fungi/dev/mapper/main-gerrit 1007G  220G  788G  22% /home/gerrit219:30
clarkbany idea if the inode count changed with that expansion?19:31
fungi#status log Replaced the 250GB volume for Gerrit data on review.opendev.org with a 1TB volume, to handle future cache growth19:31
opendevstatusfungi: finished logging19:32
clarkb(I don't really expect inodes to be a problem considering why we expanded. Mostly just curious)19:32
fungidf -i says we have 1% of inodes in use on that fs19:32
clarkbcool. I feel like I have learned things19:32
fungiso while i think it did increase available inodes x4, it's unlikely to matter19:33
funginow i just need to do two more of these on the ord backup server and one on the ord afs server before the end of the month19:33
fungi(for upcoming announced storage maintenance in that region)19:34
clarkbcacti reflects the changes too fwiw19:34
fungiawesome19:34
fungii've closed out my root screen session on review.o.o now19:38
opendevreviewJeremy Stanley proposed opendev/git-review master: Don't keep incomplete rebase state by default  https://review.opendev.org/c/opendev/git-review/+/85006119:39
clarkband now we monitor the cache disk usage19:39
clarkbI marked the change to disable the caches as WIP so we know that is a fallback we aren't ready for yet19:39
fungithanks19:43
*** dviroel is now known as dviroel|biab20:09
opendevreviewJeremy Stanley proposed opendev/git-review master: Don't keep incomplete rebase state by default  https://review.opendev.org/c/opendev/git-review/+/85006120:23
*** dviroel|biab is now known as dviroel21:08
*** dasm|ruck is now known as dasm|off21:37
*** dviroel is now known as dviroel|out21:37
clarkbfungi: dpawlik1 can we abandon https://review.opendev.org/c/opendev/system-config/+/833264 ?21:40
fungiyeah, there's info at https://governance.openstack.org/sigs/tact-sig.html#opensearch and https://docs.opendev.org/opendev/infra-manual/latest/developers.html#automated-testing has been cleaned up. https://docs.openstack.org/project-team-guide/testing.html#automatic-test-failure-identification probably does need updating though21:45
fungii've abandoned it, thanks for the reminder21:47
clarkbthanks21:47
fungisure thing21:50
ianwclarkb: thanks for looking into that.  the 2.9 check would probably be good 22:24
clarkbianw: if 2.9 is better then I strongly suspect AlbinVass[m] did all the work :)22:25
ianwi've approved it and will check in on it after the weekend22:25
ianwdeadlock sounds like it.  i poked at all the processes i could see, and nothing seemed to be doing or waiting for anything obvious22:26
clarkbOne thing I wonder about is whether or not we should revert the changes to the run playbook. But I think this is still strictly an improvement to seprate the logging from the actions22:26
ianwi guess that was a response to the timeouts i had already randomly seen -- so it was happening before that22:27
clarkbyes I think it was happening before, just more reliably now22:27
ianwi'm not sure the encryption jobs really require entropy, as they're using pre-generated keys.  but it's a thread to pull22:28
ianwi'd also believe something about gpg agents, sockets, <insert hand wavy actions here>...22:29
clarkbianw: my suspicion is tha tthe new order of operations just happens to trip the deadlock more reliably22:29
clarkbit was happening before less reliably22:30
clarkbbut now we've managed to align timing/whatever to make it happen 100% of the time22:30
opendevreviewMerged opendev/system-config master: Force ansible 2.9 on infra-prod jobs  https://review.opendev.org/c/opendev/system-config/+/85002722:30
ianwoh annoying, the POST failure is different22:32
corvusclarkb: to be clear -- 850027 is temporary for data collection?22:32
clarkbcorvus: yes22:32
ianwhttps://zuul.opendev.org/t/openstack/builds?project=opendev%2Fsystem-config&skip=022:32
clarkbcorvus: if that makes it better than I think we should strongly consider AlbinVass[m]'s docker image update22:33
corvuswhen will we have data?22:33
clarkbcorvus: jobs will be triggered in 27 minutes in the deploy pipeline22:33
corvusand one buildset is sufficient?22:34
clarkbcorvus: yes  Ithink so. we had a 100% failure on the last hour's buildset22:34
clarkber sorry its the opendev-prod-hourly pipeline not depoy22:34
clarkbbut ya runs hourly. Last hour was 100% failure. In ~26 minutes the first job that indicates whether or not this is happier should run22:35
opendevreviewIan Wienand proposed opendev/system-config master: run-production-playbook-post: use timestamp from hostvars  https://review.opendev.org/c/opendev/system-config/+/85007722:36
clarkbianw: oh does that need to land before we get consistent results?22:36
clarkband/or should we just revert back to the way it was for simplicity then work from there?22:37
ianwi dunno, reverting back means we get no logs, which is also unhelpful22:37
clarkbianw: well if 2.9 fixes it we would get logs22:37
clarkbianw: I'm not sure your fix will fix it since this is a separate playbook run22:38
clarkbis _log_timestamp cached?22:38
ianwnow i've sent that i'm wondering if the way we dynamically add bridge might make that not work22:38
clarkbya I'm not sure that will work22:38
ianwit would work across playbooks, but i'm not sure with the dynamic adding of the host22:39
ianwhrm22:39
clarkbI think you can just get a new timestamp value in the post-run playbook22:41
clarkbsince the two uses in the run playbook and the post-run playbook are detached from each other (though them lining up is a bit nice it isn't necessary)22:42
ianwyeah i can do that for now, and try passing the variable around later when it's all working22:42
clarkbI think if the jobs fail because that var is undefined that is a good indication that ansible 2.9 fixes the othe rproblem at least22:42
opendevreviewIan Wienand proposed opendev/system-config master: run-production-playbook-post: use timestamp from hostvars  https://review.opendev.org/c/opendev/system-config/+/85007722:45
ianwright, because they all got far enough to try renaming the file, rather than hanging22:45
ianwwell actually, all those ran under 2.12?22:46
clarkbthe example I looked at earlier today didn't get that far. It failed due to a timeout right around encrypting things and it ran under 2.1222:47
clarkbI guess I hvaen't looked at the others we may have 100% failure due to either one of the things22:47
clarkbcorvus: ^ related we seem to report POST_FAILURE when we we timeout in post-run22:47
clarkbianw: ah yup seems at least some of the failures are due to the missing var. corvus I was mistaken I assumed 100% of the failures were due to the timeouts but that isn't the case so we won't know as quickly as I hoped22:48
clarkb(sorry I was juggling like 4 different issues this morning...)22:49
ianwright, i think we a) fix the missing var and b) i'll check in on it on monday with fresh eyes and see if we're still seeing timeouts22:49
ianwhttps://zuul.openstack.org/build/304cacd67c2e459cbb69e0af5b24963f was the job i was looking at last night that looked stuck (at the time, and eventually timed out)22:49
ianwfor reference, https://paste.opendev.org/show/bvelHlomXJwkNpCmKO6r/ is the console log which i had open on it22:51
ianwi agree the last thing before the timeout was22:51
ianw2022-07-15 09:36:45.327346 | TASK [encrypt-file : Check for existing key]22:51
ianwthis means it actually ran all the playbook and has done everything up to setting up logs for storage22:52
ianwthat is not a very exciting task -> https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/encrypt-file/tasks/import-key.yaml#L3 ; just calls gpg --list-keys in a command: | block22:53
opendevreviewIan Wienand proposed opendev/system-config master: run-production-playbook-post: generate timestamp  https://review.opendev.org/c/opendev/system-config/+/85007722:54
ianwclarkb: ^ fixed that change typo to avoid confusion22:54
clarkb+2 thanks22:55
clarkband ya once that one lands we should get much clearer info22:55
ianwi've run gpg --list-keys in a loop on bridge as "zuul" and haven't seen anything hang, so doesn't seem as obvious as gpg being silly22:56
ianwand last night, when i was poking, i did not see a gpg process attached to any of the ansible; all that was running was an "ansiballz" python glob thing afaics22:57
corvusclarkb: i'm not sure reporting post_failure when a post playbook times out is wrong?23:11
corvusthe idea of post_failure is that it's saying "the main thing worked, but the post-run stuff did not"23:12
corvusie, "your change is not broken, but your job cleanup is"23:13
clarkbcorvus: would POST_TIMEOUT make sense though?23:13
clarkbsimilar to how we have TIMEOUT and POST_FAILURE23:13
corvusi guess if that's useful?23:13
clarkbI think for me its mostly to reduce confusion over an actual failure or a timeout in post since we have that in run23:13
corvushrm23:13
clarkbin this case it would have distinguished between the two different failures we are currently seeing23:14
corvusi sort of disagree -- a post-run script should not timeout23:14
corvusyeah, but i mean that's just a happenstance of these 2 failures23:14
corvusa test run timeout tells you maybe your unit tests run too long23:14
clarkbya I'll have to think about it a bit more. Current thought is that it is mostly surprising considering the differentiation elsewhere23:15
clarkbI think it may have tripped fungi up too when trying to determine why the job we debugged earlier today had failed23:16
corvusit does have some signal to it -- it tells you something like "your job broke in post run because your system is completely hosed" vs "your job broke in post run because it didn't produce a  file that was expected"23:16
corvusbut i personally am not convinced it's enough signal to make it a result so that you can scan the build reports for it23:17
fungiyeah, we've had a few different chronic situations recently which caused some task to timeout in post-run. i should just be more diligent about mining executor logs for "timeout" in addition to failed=[^0]23:34
fungii think to check for task failures, but then forget that if there aren't any it could be because a task timed out and so we didn't get output from it23:35
fungior even a play recap23:35

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!