Sunday, 2021-10-03

stevebakerHey I've got a change which has been "running" a CI job for 72 hours, is there any way it can be killed manually? https://zuul.openstack.org/status#804000 dib-functests-bionic-python3-extras20:16
fungistevebaker: i'll take a quick look to see why it might be stuck first, but sure21:19
stevebakerfungi: much appreciated21:19
fungithe dib-functests-bionic-python3-extras build for it seems to still have a functional console, but all it logged was:21:21
fungi2021-09-30 18:55:05.598946 | Job console starting...21:22
fungi2021-09-30 18:55:05.613329 | Updating repositories21:22
fungi2021-09-30 18:55:05.883662 | Preparing job workspace21:22
fungiand it's just been sitting there ever since21:23
stevebakeryeah I noticed that21:23
fungichecking the executor log once i work out which one it got farmed out to21:23
fungithis is the incomplete build page for it, for future reference: https://zuul.opendev.org/t/openstack/build/f51e9d26fcd3458b9da5fa3f934e4aa621:25
fungize04 is the executor it ended up on21:28
fungiit's been looping this over and over in its log:21:28
fungi2021-10-03 21:26:30,843 DEBUG zuul.ExecutorServer: Finishing Job: 76fc8aac5f444f3e89998ad6697caae9, queue(5): {'f51e9d26fcd3458b9da5fa3f934e4aa6': <zuul.executor.server.AnsibleJob object at 0x7f48706bfdf0>, '08bf701e3811444f9945626d92210591': <zuul.executor.server.AnsibleJob object at 0x7f48507251c0>, '7c7b642ba1004a49988226f29bd3a9f5': <zuul.executor.server.AnsibleJob object at21:28
fungi0x7f487036dfd0>, '5aeadf4a4c9e4738a0a6bdb266b09b6c': <zuul.executor.server.AnsibleJob object at 0x7f48706d6670>, 'a451017c33674dba909a0cdfc3b2d473': <zuul.executor.server.AnsibleJob object at 0x7f4810798ca0>}21:28
fungithe enqueue event id was 840d69f7aacc4f59bdff85271e8dfdb321:31
fungithe last thing about it in the debug executor log seems to be this:21:35
fungiCloning gerrit/openstack/diskimage-builder21:36
fungi2021-09-30 18:55:05,884 DEBUG zuul.AnsibleJob: [e: 840d69f7aacc4f59bdff85271e8dfdb3] [build: f51e9d26fcd3458b9da5fa3f934e4aa6] Cloning gerrit/openstack/diskimage-builder21:36
fungithis is everything it logged for that combination of build id and event id: https://paste.opendev.org/show/80975121:38
stevebakerso its just stalled on cloning21:39
stevebakeror whatever happens after that21:39
fungiyeah, i suspect if i can find the node it's using there will be a hung git process, but what i can't figure out is why the playbook timeout didn't kick in21:40
stevebakermaybe the timeout isn't applied this early in the build21:43
fungiahh, yeah the node which got assigned is still in a ready state, i guess that git clone was in the workspace on the executor prior to being synced to the node, though i'm surprised we don't mark the node in-use before then too21:47
fungiaccording to https://wiki.openstack.org/wiki/Infrastructure_Status the networking issues in vexxhost impacting the gerrit server started around 19:10 utc that day, but maybe they were impacing things a few minutes prior to that?21:50
fungiwould explain how git got hung21:51
stevebakerI see21:54
fungilooks like /var/lib/zuul/builds/f51e9d26fcd3458b9da5fa3f934e4aa6/work/logs/job-output.txt has open file descriptors from a zuul-executor fork on ze04, but i don't see any child processes of that21:56
fungianyway, i've probably extracted about as much info as i can about the situation, i'll go ahead and dequeue that change21:57
stevebakerfungi: ok, thanks21:57
fungistevebaker: it's gone now, if you want to recheck21:58
stevebakersweet21:58
fungii'll confer with other zuulfolk on that and see if there are ways we could catch similar cases in the future21:59
stevebakerfungi: ok, thanks for your help21:59
fungistevebaker: any time, and thanks for pointing out the issue22:17

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!