Sunday, 2020-12-13

*** DSpider has quit IRC01:19
*** iurygregory has quit IRC01:41
*** ysandeep|sick is now known as ysandeep02:27
*** ykarel has joined #opendev08:40
*** DSpider has joined #opendev08:54
*** ykarel has quit IRC09:16
*** ykarel has joined #opendev09:34
*** ykarel has quit IRC11:27
*** tosky has joined #opendev12:14
*** hamalq has joined #opendev12:48
*** knikolla has quit IRC13:17
*** knikolla has joined #opendev13:19
*** hamalq has quit IRC14:49
*** hamalq has joined #opendev15:27
*** fressi has joined #opendev16:02
*** hamalq_ has joined #opendev16:15
*** hamalq has quit IRC16:19
*** fressi has quit IRC16:20
*** tosky has quit IRC16:29
*** fressi has joined #opendev16:43
*** Alex_Gaynor has joined #opendev17:06
Alex_GaynorHey, on pyca, we're seeing intermittent (but somewhat frequent) network errors from arm64 machines, for example https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_a27/5410/ae72d6b91239c5262ed0b28792f76c7449a42ec6/check/pyca-cryptography-ubuntu-bionic-py36-arm64/a27b5f5/job-output.txt (search "Clone wycheproof")17:07
*** hamalq_ has quit IRC17:28
*** tosky has joined #opendev17:28
fungiAlex_Gaynor: mmm, yeah the nodes in that cloud only have unique global ipv6 addresses so share an ipv4 nat for reaching v4-only sites like github.com, in the past we've seen similar issues with the nat table getting overrun by too many simultaneous tracked states. i wonder if it could be ths same situation this time17:32
fungione workaround would be to declare those repositories as required projects in the zuul jobs, so that our zuul executors cache and push them onto the job nodes, then you're only cloning locally from that copy on the filesystem17:33
fungiwe've also seen wrapping remote network operations like that in a retry to help if it problem is reasonably random17:34
fungialso cloning those in a pre phase playbook instead of the run phase would cause zuul to just automatically and silently retry the build (up to three times by default) if that failed17:35
Alex_GaynorHow do I put something in a pre-phase playbook?17:36
* fungi looks at that job definition real fast17:36
Alex_GaynorAh, `pre-run` key it looks like. Let me try this17:37
fungiyeah, looks like that's all happening in the .zuul.playbooks/playbooks/tox/main.yaml called from the run phase of pyca-cryptography-base but you could move a lot of those tasks into a different playbook called in pre-run17:39
fungibasically any job setup should generally be done in pre-run and then tasks you actually expect a bad patch to fail would be all you put in the run phase17:40
fungithat way things like network blips hitting the pre-run setup for the build would just cause it to be retried17:41
Alex_GaynorOk, PR here https://github.com/pyca/cryptography/pull/5644 let's see if it helps!17:41
fungilooks like ianw set up the initial jobs there so may have additional input once he's awake and around (should be nearly his monday morning now, though i don't recall if he was planning to be at the computer this week)17:41
Alex_GaynorGood news, the retries appear to work, at least 😬17:43
fungiso, yeah, that ought to make the builds more robust (lather rinse repeat for other stuff you want going on in pre-run instead of run), but we also need to look into what's going on with that cloud17:44
fungiwe've got a static node in a separate tenant there so i ought to at least be able to tell whether their ipv4 routing on the whole is busted17:44
Alex_GaynorYeah, I'm a bit concerned that what we're going to learn is that the failure rate on this clone is high enough that even 3 retries doesn't make it robust. But hopefully this at least helps.17:44
fungiand yes, if all three retries fail, the build will report a "retry_limit" failure result17:45
Alex_Gaynorfungi: can `pre-run` playbooks access things from `vars.`?17:46
fungishould be able to, yes17:46
*** fdegir has quit IRC17:46
Alex_GaynorGreat, thanks much!17:46
fungion saturday they noticed keepalived had stopped across all of their control plane, which killed the cloud apis so we weren't booting anything there. possible there's something else generally broken there at the moment too17:47
*** fdegir has joined #opendev17:47
Alex_GaynorOoof17:49
fungiso the good news is that our mirror node there (which has a 1:1 ipv4 "floating ip" nat assigned) is able to clone from github17:51
fungiso it's not total ipv4 routing failure there at least17:51
fungileading me to increasingly suspect the many:1 nat shared by the job nodes17:51
fungikevinz takes care of that environment, but isn't in here at the moment (it's also very early in his part of the world right now)17:52
fungiour max-servers there is only 40, so in theory there are at most that many nodes sharing the same v4 nat, but at the moment utilization is low and it looks like you're probably the only one using nodes there so an overload of the nat table seems unlikely: https://grafana.opendev.org/d/pwrNXt2Mk/nodepool-linaro?orgId=117:55
fungii'll e-mail kevinz now so he'll hopefully see it once he wakes up17:57
fungi#status log e-mailed kevinz about apparent nat problem in linaro-us cloud, cc'd infra-root inbox18:00
openstackstatusfungi: finished logging18:00
fungiwe could take that provider offline in our nodepool config for now, but it's the only one providing arm64 nodes (and those are all it provides) so any arm64 builds would just queue indefinitely until it's returned to service18:02
Alex_GaynorFrom our perpsective that'd be strictly worse, things are succeeding at a high enough rate that we can passing PRs with the retries.18:03
*** fdegir has quit IRC18:06
*** cgoncalves has quit IRC18:41
openstackgerritJames E. Blair proposed openstack/project-config master: Add google/wycheproof to pyca Zuul tenant  https://review.opendev.org/c/openstack/project-config/+/76686419:40
corvusAlex_Gaynor, fungi, ianw: i agree https://github.com/pyca/cryptography/pull/5644 should help (and is objectively the better way to write the job anyway).  i think we can go a step further with https://review.opendev.org/c/openstack/project-config/+/766864 and actually have zuul do all the cloning ahead of time; that would reduce the amount of public internet traffic from the job, which may avoid retries19:43
corvus(to be clear, that's step 1; if that lands, i'll write the step 2 change and propose to to pyca/cryptography)19:43
Alex_GaynorI'm also doing https://github.com/pyca/cryptography/pull/564519:53
corvusAlex_Gaynor: ++ those are all great to have in a pre-run19:54
ianwo/ i'll have to go through my notes but i have a vague feeling we might have seen ipv4 issues in that cloud before21:04
ianwcorvus: it seemed everyone was positive about the new zuul sumary plugin repo, what's the next step?21:06
*** slaweq has quit IRC21:29
openstackgerritIan Wienand proposed opendev/system-config master: WIP: initalize gerrit in testing  https://review.opendev.org/c/opendev/system-config/+/76522421:29
openstackgerritIan Wienand proposed opendev/system-config master: system-config-run-review: remove review-dev server  https://review.opendev.org/c/opendev/system-config/+/76686721:29
*** cgoncalves has joined #opendev21:44
*** DSpider has quit IRC22:50
*** tkajinam has joined #opendev23:00
*** prometheanfire has quit IRC23:46
*** Green_Bird has joined #opendev23:54

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!