Sunday, 2021-10-31

corvusi'm going to restart zuul now01:18
corvuswith just one scheduler01:18
Alex_GaynorIs zuul experiencing some sort of known outage ATM?01:44
corvusAlex_Gaynor: i'm restarting it01:45
Alex_Gaynorcorvus: 👍01:45
Alex_GaynorHow long does that typically take?01:46
fungiup to now, something around 15-20 minutes, longer depending on what else we might be doing as part of the restart01:47
corvusAlex_Gaynor: something like 20 minutes or so for a full restart01:47
fungisoon though, restarts should go unnoticed01:47
corvus(i think i'm going to have to clear state and start again)01:47
Alex_GaynorGot it. Thanks. (Is there a better place to follow along than here?)01:48
corvusthis is the place01:48
corvusAlex_Gaynor: did you notice an issue before about 20m ago?01:48
corvus(i'm wondering if this is prompted by the restart, or if there was another issue before the restart that i didn't notice)01:49
Alex_GaynorLooks like I started seeing issues about 30 minutes ago01:51
Alex_GaynorIn the form of various errors loading https://zuul.opendev.org/t/pyca/status/01:51
corvusokay.  that's as expected then :)01:52
Alex_Gaynor👍01:52
corvusi've cleared state and am starting the scheduler again01:59
ianw_ptoAlex_Gaynor: while talking about zuul and pyca things, let me know if you have any thoughts on https://github.com/pyca/pynacl/issues/601 for arm64 wheels for pynacl; i imagine it could be very similar to what we have02:09
ianw_ptowhat prompted me to think about it again was recent work we were doing to upgrade our containers to bullseye; the buildx process cross-compiles the dependencies an pynacl was one of the more painful bits02:11
Alex_Gaynorianw_pto: 👍 I ping'd reaperhulk, pynacl makes me sad these days so I don't think about it much.02:11
ianw_ptook, can do, thanks :)  don't want to make anyone sad02:11
Alex_GaynorHehe, not remotely your fault. (Once upon a time pynacl was a library with misuse resistant cryptography, and now it's mostly cryptography that's too hipster to be in openssl)02:13
corvusapparently we're waiting on github rate limits again02:22
corvusi think that may mean it could be a few hours before zuul is able to start02:23
corvusah, i think the timout may have expired, it's moving along now02:28
Alex_GaynorIs zuul having a problem ATM? I'm seeing the events queue not going down13:25
fungiAlex_Gaynor: i'm taking a look13:41
Alex_Gaynor🙇‍♂️13:41
fungiit's at 0 for most tenants, but yes it looks like the pyca tenant is reporting an event queue length of 20 at the moment13:43
fungithe vexxhost queue length is also 2013:43
fungievent queue length13:44
fungiand the zuul tenant's event queue is at 913:44
fungiall the others are at 013:44
fungii'll see if there's any clues in the scheduler logs13:45
Alex_GaynorFWIW the pyca one has been at 20 for many minutes, it's not just transitory13:45
fungithe vexxhost tenant only has a gerrit source connection, so seems unlikely to be limited to github events13:48
fungithe most recent event logged by the scheduler for the pyca tenant seems to be this one:13:50
fungi2021-10-31 13:17:38,143 DEBUG zuul.GithubConnection: [e: ea4f4296-3a4c-11ec-9cac-f345a58a0adc] Scheduling event from github: <GithubTriggerEvent 0x7fd7d80c0c70 pull_request pyca/cryptography refs/pull/6504/head status github.com/pyca/cryptography 6504,6f45c6d2e0978d1521718ed5e97eda6a4d97d763 delivery: ea4f4296-3a4c-11ec-9cac-f345a58a0adc>13:50
fungiand no mention of that event id past that log entry13:51
fungiso that does seem to support the impression that it's still hanging out in the event queue13:52
fungithe most recent builds to start in that tenant were at 09:04:37 utc, so it was clearly processing the event queue at least up to that time13:53
fungiimplying it wasn't immediately stuck when the scheduler was restarted13:54
fungilooks like the vexxhost tenant processed a potential trigger event as recently as 10:09:06 utc13:58
corvusfungi: i see the same change cache error. i  think we should revert to 4.10.414:10
fungiokay, i saw it too but it's not the only exception so i was trying to do perform some rudimentary statistical analysis to see which ones were more common before and after updating14:11
corvusthat one will stop queue processing at least14:11
fungiAttributeError: 'NoneType' object has no attribute 'cache_key'14:12
fungithat one?14:12
corvusyep14:12
fungiyeah, 5300 of those today since the debug log rotated14:12
fungi3135 in yesterday's log14:12
fungiwhat was the trick you did for the mass downgrade last time? ansible playbook to locally tag 4.10.4 as "latest" on all the servers?14:14
corvusi'll work on a manual revert14:14
corvusyep14:14
corvusstopping zuul14:17
fungiconfirmed, before yesterday we did not seem to log that exception14:17
fungithe other one i see starting yesterday and continuing today is...14:18
fungiAttributeError: 'str' object has no attribute 'change'14:18
corvusi'm deleting the zk state14:18
fungifrom the rpc listener14:18
fungithough that may have been related to manual reenqueuing of changes14:19
corvusstarting zuul14:20
corvushttps://zuul.opendev.org/api/components looks like good versions14:21
funginevermind, those exceptions doesn't look like they were clustered at times where reenqueuing was underway14:21
fungidon't look like14:22
Alex_GaynorQuestion: Does the restart mean the even queue was lost, or will those jobs still happen?14:23
fungisince the state in zookeeper was cleared, the queued trigger events will be lost i think14:24
corvusright (though items already processed and running jobs are saved, but there weren't any in pyca)14:24
funginote these upgrades/restarts are working toward persistent state for purposes of being able to run multiple schedulers, so restarts in the (hopefully near) future will be hitless14:25
corvuswe're almost there.  unfortunately this was a bug in our state persistence :/14:26
fungii need to run out on a couple of quick errands, but should return within the hour hopefully14:37
corvusfungi: i'll be leaving for the day soon14:37
fungithanks, i'll keep an eye on things once i'm back, but we ran smoothly enough on that release so i don't expect it will give us trouble14:38
corvus++14:38
corvusit's back up; i'm re-enqueing items14:41
corvusAlex_Gaynor: and i see something running in pyca14:41
Alex_Gaynorcorvus: Yeah, I kicked teh job14:42
corvusre-enqueue is done14:42
corvus#status log restarted zuul on 4.10.4 due to bugs in master14:43
opendevstatuscorvus: finished logging14:43
fungii'm around again and will keep an eye to irc/mailing lists in case anyone notices something still awry15:49

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!