Thursday, 2020-06-18

*** rfolco|rover has joined #softwarefactory00:05
*** apevec has quit IRC00:46
*** rfolco|rover has quit IRC00:47
*** rfolco|rover has joined #softwarefactory01:23
*** harrymichal has joined #softwarefactory07:31
*** harrymichal has quit IRC07:43
*** jpena|off is now known as jpena07:57
*** jpena is now known as jpena|lunch11:38
*** jpena|lunch is now known as jpena12:38
*** sshnaidm is now known as sshnaidm|mtg13:23
*** rfolco|rover is now known as rfolco14:01
*** harrymichal has joined #softwarefactory15:06
*** harrymichal has quit IRC15:11
*** sshnaidm|mtg is now known as sshnaidm|ruck15:41
*** harrymichal has joined #softwarefactory15:53
*** harrymichal has quit IRC16:03
*** harrymichal has joined #softwarefactory16:03
*** sshnaidm|ruck is now known as sshnaidm|off16:52
*** jpena is now known as jpena|off17:21
*** sduthil has joined #softwarefactory18:29
sduthilhi there! I'm running software factory for the wazo platform project http://wazo-platform.org. I'll describe my problem below:18:30
sduthilI have 1 VM running zuul and nodepool and another VM hosting the runc containers started by the zuul VM18:30
sduthilI sometimes have a job that gets stuck on the building state for the runc container18:31
sduthilafter investigating, the runc container is correctly started, but the task to check that the container is running never finishes (ssh <container> echo okay)18:32
sduthilI have one such job right now that is stuck, and I'm digging to understand what's happening18:32
sduthilI do see the ansible-playbook ...create.yml running, but I don't see the ssh echo okay client process18:33
sduthilstracing the ansible-playbook ...create.yml gives me something like this, in a fast loop:18:34
sduthilclock_gettime(CLOCK_MONOTONIC, {tv_sec=769509, tv_nsec=56756942}) = 018:34
sduthilselect(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout)18:34
sduthil...18:34
sduthildo you have any idea what I could try to understand what is stuck?18:35
sduthilif I ssh container echo okay in another shell, it works perfectly (did that in a loop 1000 times)18:35
sduthilthe jobs get stuck randomly about 1 time out of 100, it's rather rare18:36
tristanCsduthil: that's odd, perhaps it happens when the test is performed too early?18:38
sduthiltristanC, yes, that's what I thought too. In this scenario, either the ssh client command would fail and return and get re-run by ansible or the ssh client command would be stuck somewhere and I would be able to see it in the running processes18:39
sduthilbut I don't see the ssh client command anywhere18:39
tristanCsduthil: we are actually in the process of removing the runc driver... perhaps we could help you setup the kubernetes based replacement?18:39
sduthilah ok, I didn't know runc was being removed :)18:40
nhichersduthil: it will be remove on sf-3.5. Are you on 3.4 ?18:41
tristanCsduthil: the ssh client command should be running on the nodepool-launcher host, it is trigger by: https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/playbooks/create.yml@3218:41
sduthilI never played with kubernetes yet, so I'll need to read up a bit on kube before making that kind of transition18:41
tristanCsduthil: you don't have too, sf comes with a role to setup a small service that can be used in place of k8s18:42
sduthilnhicher, I don't know how to check the version of software-factory, what's the command?18:43
tristanCsduthil: cat /etc/sf-release18:43
sduthilI'm running SF 3.318:43
sduthiltristanC, I read that playbook thoroughly, and I totally agree that I should see a ssh client command on the nodepool-launcher host, but I don't18:45
sduthilalso, in an earlier investigation, I killed the ansible-playbook ...create.yml and I saw in the nodepool log "Wait for ssh access" then nothing. The current symptoms are exactly the same than the current situation so most likely it is stuck at the same place than earlier.18:47
tristanCsduthil: could it be the task before that is stuck?18:47
tristanCsduthil: hum, if it said "Wait for ssh access", then no :)18:48
sduthiltristanC, I doubt it, since the runc container is correctly running and usable, and the earlier investigation showed the log "Wait for ssh access"18:48
sduthilyes, I agree18:48
* tristanC scratching head18:50
sduthilyep, same here :)18:51
sduthilI have those two processes:18:51
sduthilnodepool  9831  1.0  0.9 523032 38328 ?        Sl   13:22   3:25 /opt/rh/rh-python35/root/usr/bin/python3 /opt/rh/rh-python35/root/usr/bin/ansible-playbook /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/nodepool/driver/runc/playbooks/create.yml -i /var/opt/rh/rh-python35/lib/nodepool/runc/containers.zuul.wazo.community.inventory -e use_rootfs=False -e zuul_console_dir=/tmp -e hypervisor_info_file=/var/opt/rh/rh-python35/li18:51
sduthilb/nodepool/runc/containers.zuul.wazo.community.json -e container_id=0000031672-runc-debian-buster-200-0000884177 -e container_spec=/var/opt/rh/rh-python35/lib/nodepool/runc/0000031672-runc-debian-buster-200-0000884177.config -e worker_username=zuul-worker -e worker_homedir=/home/zuul-worker -e host_addr=containers.zuul.wazo.community -e container_port=2360418:51
sduthilnodepool 10170  0.0  0.8 524936 34080 ?        S    13:24   0:00 /opt/rh/rh-python35/root/usr/bin/python3 /opt/rh/rh-python35/root/usr/bin/ansible-playbook /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/nodepool/driver/runc/playbooks/create.yml -i /var/opt/rh/rh-python35/lib/nodepool/runc/containers.zuul.wazo.community.inventory -e use_rootfs=False -e zuul_console_dir=/tmp -e hypervisor_info_file=/var/opt/rh/rh-python35/li18:51
sduthilb/nodepool/runc/containers.zuul.wazo.community.json -e container_id=0000031672-runc-debian-buster-200-0000884177 -e container_spec=/var/opt/rh/rh-python35/lib/nodepool/runc/0000031672-runc-debian-buster-200-0000884177.config -e worker_username=zuul-worker -e worker_homedir=/home/zuul-worker -e host_addr=containers.zuul.wazo.community -e container_port=2360418:51
sduthilthe first is the parent, the second is the child18:51
sduthilthe parent show the strace mentioned earlier18:51
sduthilthe child shows the following strace, when attaching:18:52
sduthilstrace: Process 10170 attached18:52
sduthilfutex(0x7f02facee930, FUTEX_WAIT_PRIVATE, 2, NULL18:52
sduthil...18:52
sduthil(no loop here, it's just waiting) so not very helpful18:53
tristanCsduthil: it looks like an ansible bug, the task should either have an ssh child process, either it should fail after 3 attempts (with a default 5 second delay)18:53
sduthilwould upgrading ansible be a good idea? I'm running ansible 2.6.918:53
tristanCsduthil: that can be attempted, though upgrading doesn't always solve the desired issue, and there is the risk it introduce more18:54
tristanCsduthil: since that driver is no longer supported, may i suggest another band aid solution?18:55
sduthilare there known compatibility issues for software-factory with higher versions of ansible ?18:55
tristanCsduthil: let me check18:56
tristanCsduthil: it should be fine, the main risk would be zuul executor, but your version already has the `multi-ansible-version` feature where zuul is using custom venv for ansible18:59
tristanCsduthil: however, i think a better solution would be to add here: https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/provider.py@8018:59
tristanCsduthil: a ["timeout", "256s", "ansible-playbook"] in the begining of the argv list19:00
tristanCsduthil: such timeout would be a safeguard to prevent the ansible process from being stuck, and when it kicks in, it should properly propagate as a start failure and it will retry creating the node19:01
sduthilah thank you :) I was looking for a timeout feature in ansible, but didn't find any19:02
sduthilwould it behave differently than killing the ansible-playbook ...create.yml directly? Because I already tried this one, and the container was never removed, so when nodepool tried to recreate the container, it failed saying "the container with id ... is already started"19:04
tristanChmm, adding such timeout should have the same effect as killing the ansible-playbook process19:05
tristanCe.g. it should propage through https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/provider.py@19919:06
tristanCand then https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/handler.py@4519:07
tristanCwhich should result in a new attempt using a new hostid19:07
tristanChowever, you may have found another issue where the leftover runc process may result in a conflict...19:08
sduthilyeah, I see that the hostid does not include the retry number, so it will be identical to the previous try19:09
tristanCsduthil: thus, you may have to also add a 'try: self.handler.manager.cleanupNode(hostid) \n except: pass' in https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/handler.py@4619:10
tristanCyeah you are right, the hostid is not updated19:11
tristanCsduthil: i think it's safe to attempt a delete when the create fails, just have to wrap in another try/except to avoid escaping the retry loop too soon19:11
sduthilok, I'll apply this patch then19:12
sduthilthank you very much for your help19:12
tristanCsduthil: you're welcome, please let us know how it went19:18
sduthilsure, but it will take a few weeks to confirm that is doesn't happen again :)19:20
sduthilI'd like to clone the source code to make the patch, but I can't find the files to be patched when I clone with git clone https://review.opendev.org/zuul/nodepool19:23
sduthilI can't find the nodepool/driver/runc directory19:23
sduthilhow do I git clone the runc driver?19:24
tristanCsduthil: that's because the driver is not merged. have a look at https://review.opendev.org/#/c/535556/2819:24
tristanCsduthil: on the top right, there is a `download` button that contains git command to pick the driver locally19:24
sduthilah! thank you!19:26
sduthilseems like a well hidden bug, still unsolved: https://github.com/ansible/ansible/issues/3041120:07
*** sduthil has quit IRC20:36
*** rfolco has quit IRC20:52
*** harrymichal has quit IRC21:33
*** harrymichal has joined #softwarefactory21:35
*** rfolco has joined #softwarefactory23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!