Friday, 2023-12-01

ianwtonyb: https://paste.opendev.org/show/bJeCVfdS9SUfEmRdWpun/ is the failure for 902008 ... looks like cacti failed trying to build the certcheck domains for mirror03.gra1.ovh.opendev.org05:12
ianwit's worked in the next run05:15
ianwoh i think i see it05:19
ianwTASK [letsencrypt-request-certs : Run acme.sh driver for certificate issue] ****05:19
ianwfatal: [mirror01.iad.rax.opendev.org]: FAILED! => {05:19
ianw    "changed": false,05:19
ianw    "rc": -1305:19
ianw}05:19
ianwit looks like that failed.  that aborted the letsencrypt run on mirror01, so it never got to the next step, where it populates the hostvar with the certificates to put into the certcheck list05:20
ianwso it was looping around making that list -- mirror03.gra1.ovh.opendev.org was the one *before* this ... it asked for hostvars[mirror01.iad...]['letsencrypt_certcheck_domains'] which didn't exist because letsencrypt-request-certs bailed out before it got set05:21
ianwwhy acme.sh failed -13 i don't know :)05:23
tonybianw: thanks.  that makes sense the fact the error looks like mirror03 rather than mirror01.iad is a little confusing.06:24
tonybI'll grab acme.sh tomorrow and see what exits with -1306:25
tonyb.... where I guess tomorrow is actually later today06:25
ykarelfrickler, fungi i have reported https://bugs.launchpad.net/neutron/+bug/2045383 to track that xena job issue07:50
opendevreviewSteve Brasier proposed openstack/diskimage-builder master: allow passing arbitrary build options in containerfile element  https://review.opendev.org/c/openstack/diskimage-builder/+/90236610:24
*** ykarel is now known as ykarel|away13:51
fungiykarel|away: it should be fixed when zuul restarts tomorrow, since https://review.opendev.org/902316 has already merged14:51
clarkbLooks like my message about that ansible thing didn't make it to irc16:08
clarkbtonyb: ianw: https://review.opendev.org/c/opendev/system-config/+/898475 is related to the certcheck domains list building issue16:08
clarkbI thought it was an ansible bug that may haev been fixed by ansible 8 beacuse it seemed to go away after we upgraded, but I guess not16:09
clarkbwe can resurrect that change if we think the extra debug info would still be useful16:09
clarkbtonyb: frickler: thoughts on proceeding with https://review.opendev.org/c/opendev/system-config/+/902315/ and child to rotate the replication key?16:09
clarkbI'm pretty sure I can be around for a gerrit restart at some point today if we want to try and land both of them (we should ensure the gitea one lands and applies first though just to avoid any errors there)16:11
fungisame16:11
clarkbthe new/replacement/fixed gerrit 3.9.x release should be out sometime today16:35
clarkbI'm hoping it will be based on rc6 this time around and we alraedy ran tests against that rc and they were good16:35
clarkbfungi: half wondering if we should proceed at least with the gitea update since the risk for that one is lower? (though technically if I got the key cleanup code wrong it could remove the key we are relying on I suppose. This seems unlikely because it didn't remove the key yet)16:41
tonybclarkb: reviewing 902315 now16:47
clarkbthanks!16:48
tonybclarkb: I'm happy with both of those.  I agree we should land the gitea one first; pause/check and then land the gerrit one17:03
tonybI'm going to be in and out a bit today17:03
tonybclarkb: I think restoring https://review.opendev.org/c/opendev/system-config/+/898475 is a good idea.17:06
clarkbtonyb: thanks I approved the gitea change and restored the ansible debugging change17:23
tonybAwesome17:26
tonybdoing a little more digging.  the -13 that ianw logged is coming from ansible NOT acme.sh.  the shell module had an error17:27
tonybMODULE FAILURE17:27
tonybSee stdout/stderr for the exact error17:27
tonybI looking at logs on mirro01.iad indeed LE never ran there17:28
tonybso STDOUT would go to the LE log that I've been loogin at.  do we loose STDERR?17:30
clarkbtonyb: this is why I suspected an ansible bug17:39
clarkbbasically nothing ansible is doing there makes any sense. They don't log the iteration value they don't log any outputs its just a -13 error17:40
clarkbmy hunch is that we aren't loading the host vars for the other hosts properly due to some weird logic bug17:40
clarkbtonyb: you might want to look at the periodic run from overnight to see if it managed to succeed17:41
clarkbpart of the issue is we don't know what host is actually causing the error because it doesn't log failed loop iterations only successful ones (this definitely seems like an ansible bug in my opinion)17:41
tonybI think in this case specifically for the builing of certcheck I'm pretty confident that we do know beacuase earilier mirro01.iad failed with -13 (which means that it didn't run driver.sh on that node, and therefore also didn't add create the hostvar.  which caused the ultimate failure17:44
clarkboh I see so this is a variant of the case I ran into I think. Unless I had simply missed that there was an earlier failure17:45
tonybin run-production-playbook we do 'ansible-playbook ..... > log'17:45
clarkbin the situation I ran into I thought the only error was in the iteration itself not previously but it is entirely possible I simply missed that17:45
clarkbstill ansible shoudl log the iteration item17:45
tonybwhere would stderr from that command go?17:45
tonybYes I agree totally.  I think the change you restored is a good thing regardless17:46
clarkbit should be in the job log17:46
clarkbhttps://zuul.opendev.org/t/openstack/build/0b1a2cad2c0842b6874b599efb955fcc/log/job-output.txt#13917:47
tonybOkay  I was hoping to see more.17:47
clarkbthere is no stderr there or at https://zuul.opendev.org/t/openstack/build/0b1a2cad2c0842b6874b599efb955fcc/console#2/1/2/bridge01.opendev.org17:47
clarkbwe may need more -v's to get that info but probably also want to redirect stderr if to stdout if we do that since that may expose lots of inner workings we don't want out in the clear17:48
clarkbtonyb: and on the host do you have an acme.sh log? I guess the -13 implies that ansible broke and didn't run the shell at all though17:49
clarkbsince acme.sh should return 0 1 or 2 iirc17:49
tonybHmm okay.17:50
tonybCorrect acme.sh didn't run at that time.17:50
tonybit has since and worked just fine17:50
tonybhttps://github.com/ansible/ansible/issues/57323#issuecomment-498732320 is about the file module but seems to say that the shell module got a SIGPIPE before running on mirror01.iad17:51
clarkbhttps://github.com/ansible/ansible/issues/8177717:54
clarkbI'm betting it is this issue17:56
tonybYup17:56
tonybSo I think this is something to watch for, but there isn't anything we can do right now.17:58
clarkbagreed. I think the one thing we might be able to do is increase the 60s control persist timeout to extend the amount of time things can be idle? That doesn't seem like a great workaround though a minute is an awfully long time17:59
corvusdo i understand correctly that this bug would be triggered if the time between tasks on a given host is 60 seconds (ie, the control persist timeout).  and that this does not happen when tasks are running?17:59
clarkbcorvus: that was my reading of it. basically if you run a task and then try to run a task exactly 60 seconds later the new task execution races with connection persistence in ssh18:00
clarkbmy hunch is that statistically the delta between tasks executions is most likely in the 0-5second range with a long long tail that extends out beyond there18:01
clarkbif we set the persistence timeout to 120s instead of 60s we're probably several orders of magnitude less likely to hit this race?18:02
corvusyeah. agreed.  so the question is whether it's worth it to bump that to, say, 120 for the nested ansible running on bridge.18:02
corvusheh, we picked the same number.  :)18:02
corvusyeah, i sort of lean towards that since it's been characterized in our environment that 60s intervals are apparently common enough to be occasionally noticed.18:03
clarkbI'd be fine with that. I don't know that I'd go much beyond say 5 minutes because we could end up with a lot of useless ssh processes that way. but 120s is probably safe18:03
corvus(and to be clear, i think this is an okay band-aid for nested ansible on bridge but my feelings for the similar setting in zuul are very different)18:04
tonybI agree.18:04
tonybYeah very different for zuul itself18:05
clarkbI think 60s might be ansible's default? I don't see where we set that explicitlytoday18:08
fungii'm going to go grab lunch real quick, but available to help with gitea/gerrit in about an hour18:08
corvusi don't see it being set either.  but zuul sets it in ansible.cfg with ssh_args = -o ControlMaster=auto -o ControlPersist=60s18:11
clarkbhttps://docs.ansible.com/ansible/latest/network/getting_started/network_connection_options.html seems to be the more ansible way?18:23
clarkbunless that is just a regular timeout unrealted to ssh persistence18:24
clarkbya I think that is unrealted to control persistence just a happy google collision18:25
opendevreviewClark Boylan proposed opendev/system-config master: Increase bridge's ansible ssh controlpersist timeout to 120s  https://review.opendev.org/c/opendev/system-config/+/90243618:29
clarkbI believe that should be self testing18:29
clarkband it is easy to revert (by hand if we have to) if it goes wrong18:29
corvusclarkb: want to put the ansible bug in a code comment?  that could be helpful for unwinding it18:31
clarkbsure one sec18:31
opendevreviewClark Boylan proposed opendev/system-config master: Increase bridge's ansible ssh controlpersist timeout to 120s  https://review.opendev.org/c/opendev/system-config/+/90243618:32
opendevreviewMerged opendev/system-config master: Rotate the new Gitea replication key into Gitea config  https://review.opendev.org/c/opendev/system-config/+/90231518:55
clarkbI've got a tail -F going on /var/log/ansible/service-gitea.log so I should see when that adds the second key18:56
clarkbI'm pretty sure the key was properly added to gitea09 based on the logs19:17
clarkbwe no log a bunch of stuff so I'm going by "ok" and "skipping" output19:17
clarkbassuming the job succeeds I think we acn proceed with approving the gerrit side change19:18
clarkbgitea14 also looks good. not sure we need to check all of them if those two are fine chances something went wrong on another and the job passed is low19:25
clarkbI'll let someone else approve the gerrit change if they agree and think we are ready. I'm going to figure out breakfast/lunch things nowish19:25
fungii'm looking now19:36
fungiback for the remainder of the day, so we can take this as slowly as we like19:36
tonybclarkb: 902436 looks good.  holding off +W until the gitea/Gerrit key rotation is done19:42
*** dviroel__ is now known as dviroel19:46
fungideploy succeeded about 30 minutes ago (19:23:31)19:52
fungifor 902315 i mean19:53
clarkband looks like the gerrit update has been approved19:54
fungii went ahead and approved 902169, so once it deploys successfully we'll arrange a quick gerrit restart and make sure replication is still working19:54
tonybI was thinking of 902169 and the associated Gerrit restart 19:55
fungialso probably worth triggering a full re-replication once we're sure it is19:55
clarkb++ also I think we can temporarily move id_rsa aside when we restart gerrit to ensure the new key is being used. I think that is the easiest way. If it doesn't get used we'll need to move it back and restart gerrit again but that is better now than later I suspect19:55
fungisince we keep saying we want to do that and then i keep forgetting to actually do it19:55
clarkbas I'm not sure there is any logging on either side that would be clearly definitive about which key is used19:56
fungia very good point19:56
fungiat least if gerrit no longer has access to the old key, we'll know19:56
tonybyeah.  we could ssh -vv but that isn't guaranteed to be the same19:57
clarkbthe issue is gerrit uses mina ssh client19:58
clarkband its logging is in the depths of java logging somewhere. We probably can figure out a way to get debug logs out of it but I'm not confident even that would identify the specific key19:58
fungii blame java19:58
fungicould probably even blame log4j19:59
fungilooks like system-config-run-review-3.8 encountered some testinfra failures20:55
fungithe thread plugin-manager-preloader exceptions in error_log are expected, right?20:56
Clark[m]Yes20:56
fungioh, looks like maybe the sshd on the test node started closing connections20:57
fungi"Connection closed by remote host\r\nConnection closed by 104.130.74.118 port 22"20:57
fungihttps://c495624553d21df6092e-b1e6fa2ff0876057a2da9deb6ccb589b.ssl.cf5.rackcdn.com/902169/3/gate/system-config-run-review-3.8/63d072e/bridge99.opendev.org/test-results.html?sort=result20:57
funginothing related jumping out at me in syslog though20:59
Clark[m]Could it be the arp fight we've seen?21:00
fungipossibly, though it's just connection closed during key exchange rather than a hostkey mismatch21:03
Clark[m]Maybe recheck and see if it persists? The ssh config I wrote out should only affect the gerrit user not zuul and testinfra21:06
Clark[m]And even then it's scoped to gitea* which aren't in play for testinfra on review0221:07
tonybyeah seems unlikely the change is responsible 21:07
fungiyeah, i agree it's more likely the rogue vm problem21:09
fungionce zuul reports on the change in a sec i'll recheck it21:13
fungilooks like the last job is wrapping up now21:13
fungiand rechecked now21:20
opendevreviewClark Boylan proposed opendev/system-config master: Add gerrit 3.9 image builds  https://review.opendev.org/c/opendev/system-config/+/90146822:02
opendevreviewClark Boylan proposed opendev/system-config master: Add gerrit 3.8 to 3.9 upgrade testing  https://review.opendev.org/c/opendev/system-config/+/90146922:02
clarkbinfra-root Gerrit 3.9.1 has been released to address the problems with the pulled 3.9.0 release. I expect the above changes will pass testing and they should be mergeable as is now22:03
clarkbnote that since this doesn't affect production (only our testing of future upgrades) we don't need any special monitoring either22:03
clarkblooks like we are an hour away from the gerrit ssh config update landing22:55
clarkbI'll still be around and can do a restart if others are22:55
tonybI won't be around at that point (not that I'm really around now)22:57
fungii'll be around for a while still, sure23:19
clarkbfungi: its even later where you are :) maybe we should just do it first thing next week23:28
clarkbup to you :)23:29
clarkbthe gerrit 3.9.1 changes did pass so thats good23:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!