Tuesday, 2023-10-17

tkajinamclarkb corvus, fyi: I tried recheck and it works now.01:39
*** ramishra_ is now known as ramishra04:37
fungisorry, left my phone behind when heading out to the conference yesterday and was too exhausted to check in once i got back to the rental last night. i may not be able to catch up on all the scrollback, but let me know if there's still anything urgent needing my attention13:30
fungiran into pleia2 and olaph here so far. pabelanger is apparently around here somewhere too, trying to track him down14:17
Clark[m]fungi: mostly just struggles getting the mm3 exim update in place but that has happened.14:30
Clark[m]I'm going to followup on some of the container updates/cleanups today and try to upgrade zk to 3.8 as well cc corvus14:31
NeilHanlonfungi: really wish I could've made it to ATO this year :\ hope you're having a good time!14:31
fungiClark[m]: cool, thanks for that!14:32
fungiNeilHanlon: it's great, but also changed a lot since the very first one, which was the only other time i was able to make it14:33
fungithey started running it while i was still living in raleigh, which was super convenient14:34
NeilHanlonthat does sound very convenient heh14:37
fungiyeah, now it's a ~4hr drive from the beach for me, not terrible but does still require more planning14:50
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/898475 and its parent are the main things I tripped over yesterday trying to get the exim update deployed15:10
clarkbfungi: landing the parent or something like it is probably the most important fix since it will prevent us from landing some updates to system-config https://review.opendev.org/c/opendev/system-config/+/898502/215:10
clarkbcorvus: zk05 is our zk leader. I think the rough plan is put zk04,05,06 in the emrgency file, manualy edit the docker-compose.yaml on zk04 to use the :3.8 label, docker-compose pull, docker-compose down, docker-compose up -d. Check that the node is a follower again. Repeat on zk06, then zk05 and check we haev a new leader. Finally land15:17
clarkbhttps://review.opendev.org/c/opendev/system-config/+/897985 and take the nodes out of the emergency file15:17
corvusclarkb: sounds good15:19
clarkbcorvus: I'm good to start that now if you think now is a good time for it15:20
clarkbzk nodes are in the emergency file15:27
clarkbthe tripleo jobs are hitting retry limits due to the galaxy api udpates...15:30
corvusclarkb: sounds good; i'm around15:32
clarkbok I'll proceed with zk04 now15:33
clarkbzk04 is now 3.8.315:35
clarkbas far as I can tell things are still working15:36
corvusi'm just reviewing the log now15:36
clarkback let me know when you are happy for me to do zk0615:36
clarkbgrafana graphs look good though zk04 is very idle (I think this is normal as zk06 was in a previous position)15:37
clarkber was in that position previously15:37
corvuslooks okay.  looks like it took a few attempts to get re-synced, but it seems happy now15:40
clarkbalright I'll proceed with zk06 now15:40
corvusthat one recovered more quickly15:42
clarkball of the active connections appear to have ended up on the leader (none went to zk04)15:42
corvusthat is not great15:42
corvusi think it might be worth restarting some mergers or something to see if they connect to 4 or 615:43
corvusi'm not comfortable stopping 5 without knowing more15:43
clarkbok15:43
clarkbI'll start with zm01 and work my way up from there. We can do graceful stops then restart15:44
corvussounds good15:44
clarkbI believe zm01 connected to zk0415:45
clarkbbut I'm happy to do a couple more since it is low impact and helps build confidence15:46
clarkbI did `sudo docker-compose exec -T merger zuul-merger stop` then `sudo docker-compose start merger` fwiw15:46
corvusyeah.. .also i just noticed something in the logs15:46
corvus2023-10-17 15:45:35,043 [myid:]15:46
corvusthere's no "id" on 4 and 615:46
corvus2023-10-17 15:45:39,074 [myid:5]15:46
clarkbhrm15:47
corvuscompare to 5 ^15:47
clarkbthat comes out of the config file or should iirc15:47
clarkbbut lets figure that out before restarting more mergers15:47
corvusit's in /var/zookeeper/data/myid15:49
clarkbcorvus: which does show up in the container as well and contains 4 on zk04. So maybe we're not putting it where the new containers expect it15:50
clarkbit is just /data/myid within the container15:50
opendevreviewMerged opendev/system-config master: Fix the Ansible Galaxy proxy testinfra test  https://review.opendev.org/c/opendev/system-config/+/89850215:52
clarkbhttps://github.com/31z4/zookeeper-docker/blob/master/3.8.3/docker-entrypoint.sh#L45-L48 thats where the upstream image manages the id but only if we don't set it and it sets it to 1 (ours is still 4 implying that didn't fire)15:53
clarkbdataDir=/data is set in the config too so we're pointing at the directory containing the file at least15:53
corvuswhat 3.7 version are we running?15:53
clarkbcorvus: 3.7.215:54
clarkb`echo stat | nc localhost 2181` shows the full version and build info on each host15:54
corvusthey migrating logging frameworks between those versions15:55
corvuslog4j->logback15:55
corvusmaybe they missed something15:55
clarkbya I've been looking for a four letter command that will report the myid value back to us independent of logging and haven't found one yet but I think that is the next thing to sort out15:56
clarkbmaybe logging is broken15:56
corvusoh interesting, in my local test container (3.8.1) i have some lines with myid:1 and some are myid:15:56
corvusyeah and i see similar on zk0615:57
corvusif we go back a ways in the log, there are some myid:6 entries15:57
clarkbphew15:57
clarkbfwiw I think `conf` mgiht be the command we need but it isn't allowed in our list of allowed four letter commands. But I'm much ahppier with your report that some log lines include teh correct value15:58
* clarkb makes a note to test for myid values in the server log in our ci job15:58
corvusyeah, i think we can call this a red herring, and just proceed with merger restarts and observe the connection distribution15:58
clarkbok proceed with zm02 now15:58
clarkber I'm proceeding with15:59
corvus(this is a behavior change; because zk05 is reporting the mntr log entries with myid:5 and the others are not)15:59
clarkbzm02 appears to have reconnected to zk0516:00
clarkbI'm going to stop and start it again and see if we can get it to connect elsewhere16:01
corvus++16:02
clarkbnow it is connected to zk06 I think16:02
* clarkb continues with the rest of the mergers since this is easy16:02
corvusagreed16:02
clarkbzm03 is now attached to zk0616:04
corvusi verified that /var/lib/zuul/zuul-keys-backup.json is current, btw.16:04
clarkbthanks16:04
clarkbzm04 is connected to zk06 as well16:05
corvuslikewise /var/log/nodepool/nodepool-image-backup.json on nb01 is relatively current (from yesterday)16:05
clarkbzm05 connected to zk0416:06
clarkbzm06 connected to zk0616:07
clarkbzm07 connected to zk05 so I'm redoing it16:07
corvusi also note that the number of watches has grown considerably; i don't know what to make of that.16:08
clarkbzm07 really likes zk05... I'll skip it and go to zm0816:08
clarkbzm08 connected to zk0416:10
corvusokay i reckon we restart zk05 now?16:10
clarkbcorvus: ya I think so. zm07 is still connected to it but the other 7 mergers connected to a different one and seem to be working?16:11
clarkbdo we want ot check the operating logs of a merger first?16:11
corvusi think i saw zm01 run jobs; but let's double check16:11
clarkbzm01 did work at 16:07 whcih is after I restarted it at 15:45 or so16:12
clarkba refstate job16:12
corvusyeah it's run a lot of jobs since the restart.  i think we're good16:12
clarkbok I'll proceed with the upgrade of zk05 (the leader) from 3.7.2 to 3.8.3 now16:12
corvus++16:13
clarkbzk06 is the new leader16:13
clarkbzk05 reports it is a follower16:14
corvuszm07 reconnected to something and is happy16:14
corvusthings look reasonable to me16:18
clarkbcool16:18
clarkbcorvus: want to approve https://review.opendev.org/c/opendev/system-config/+/897985 so that our config matches the new reality? I can pull the ndoes out of the emergency file once that lands16:19
corvusit looks like we're running more builds than before we started, so increased activity may explain the increase in watches16:19
clarkbah16:19
corvus+316:19
clarkbthanks! I'll look into updating our test for zk deployment to check for the myid value in the logs as that seems like a good check16:20
corvussounds cool16:20
corvusthanks for driving this!  i'll check back a bit later and see if the graphs still look good16:21
clarkb#status log Upgraded our Zookeeper cluster to Zookeeper 3.8.316:21
opendevstatusclarkb: finished logging16:21
clarkbcorvus: and thank you for being an extra set of eyeballs. I always find that helpful as different prespective tends to find mroe things to be cautious with16:22
fungithanks for working on that! sorry i've16:25
fungibeen out of touch16:25
clarkbfungi: it just occurred to me you are in raleigh, maybe you want ot do a pitstop at RH HQ and fix their mailservers for them >_<16:31
opendevreviewClark Boylan proposed opendev/system-config master: Add zk test to check myid is set in service  https://review.opendev.org/c/opendev/system-config/+/89861416:32
JayFJust wonder near the IBM campus in RTP with a suit on and say you're looking for a solution, they'll let you right in ;) 16:32
clarkbI think ^ that will ensure we've got myid showing up in the logs properly16:32
*** ralonsoh is now known as ralonsoh_ooo16:33
clarkbfungi: if you get a chance another less urgent but easy review is https://review.opendev.org/c/opendev/system-config/+/898479 want to get that in before we start removing older container image builds16:35
fungiclarkb: JayF: yeah, the rh building is just a few blocks from here16:41
clarkbfungi: print out a copy of the dns rfc section that covers ttls :)16:41
clarkbfwiw its getting annoying beacuse rh peopel are replying to the list and other thread members. Then thread members who aren't at rh reply and we get half the email chain16:42
clarkbAnd it isn't any of the involved rh people's fault but their email systems seem to be sad16:42
fungithe other possibility is that rackspace's dns servers are sometimes returning old records, i guess?16:44
clarkbfungi: that seems unlikely given that only rh seems affected so far?16:45
clarkbbut maybe cdns or anycast are involved16:45
JayFI would believe that is possible. 16:45
JayFBut any personal experiences I have informing that are years old.16:45
fungii checked both authoritative nameservers are returning correct addresses at least16:47
clarkbfrom my home all three of the major dns forwarders (google, cloudflare, and quad 9) return the correct record. As do dns1 and dns2 at stabletransit16:48
clarkbfungi: one thing I'll note is that you used an A record instead of a CNAME probably beacuse A records are fallback MX records. Maybe we want explicit MX records?16:49
clarkbI think what you did is correct but maybe whatever resolvers/mail server out there that is having trouble isn't happy with that16:49
clarkbalso we'll want to bump the ttl up to an hour at some point16:51
clarkbmaybe after this issue is resolved though in case we have to make changes16:51
clarkbcorvus: there is a spike in event processing time. Possibly related to oepnstack deleting EOL branches?17:03
clarkbya I see zuul02 handling a bunch of ref updated events for stable/stein17:06
corvusclarkb: yeah, from what i saw yesterday, the release jobs are trickling a bunch of branch creates (so today, deletes?) which puts the openstack tenant in more or less a continual reconfiguration loop.  the events get deduplicated, but if by the time it finishes a reconfig, there's another batch of events to trigger another one, then it starts again17:07
corvusthat behavior would cause event processing delays17:07
clarkbcool just making sure we're comfortable with it. And that seems to match what I seei nthe logs17:07
corvus(the faster that the release team/jobs can issue branch ops, so they are more closely clustered in time, the better)17:07
clarkbelodilles: ^ fyi17:08
opendevreviewMerged opendev/system-config master: Bump zookeeper from 3.7 to 3.8  https://review.opendev.org/c/opendev/system-config/+/89798517:10
clarkbI'm going to remove zk04,05, and 06 from the emergency file as soon as the hourly run for zuul finishes. This way we get the deploy run for ^ applying and we can check it all looks good after17:30
clarkbemergency file is updated. The zookeeper job for 898985 will run and should noop due to matching configs not due to skipping hosts17:35
clarkbzk04 is done being "updated" and it nooped as expected17:41
clarkband the other two look good as well.17:43
opendevreviewClark Boylan proposed opendev/system-config master: Update to Ansible 8 on bridge  https://review.opendev.org/c/opendev/system-config/+/89850521:18
clarkbok that should run many jobs. Maybe too many. But will give good feedback on how ansible 8 does with our existing playbooks and roles21:19
clarkbhttps://review.opendev.org/c/opendev/system-config/+/898505 passed even when running all those extra jobs. I can't think of much else to test before we take the ansible 8 plunge so probably we just go for it when we've got a day to monitor it22:51

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!