19:05:19 <clarkb> #startmeeting infra
19:05:19 <opendevmeet> Meeting started Tue Oct 17 19:05:19 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:05:19 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:05:19 <opendevmeet> The meeting name has been set to 'infra'
19:05:21 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/DBCAGOTUNVOC2NLM4FATGKZK6GTZRJQ5/ Our Agenda
19:05:30 <clarkb> #topic Announcements
19:05:50 <clarkb> The PTG is happening next week from October 23-27. Please keep this in mind as we make changes to our hosted systems
19:06:26 <clarkb> #topic Mailman 3
19:06:40 <clarkb> All of our lists have been migrated into lists01.opendev.org and are running with mailman 3
19:06:50 <clarkb> thank you fungi for getting this mostly over the finish line
19:07:01 <clarkb> There are two issues (one oustanding) that should be called out
19:07:21 <clarkb> First is we had exim configured to copy deliveries to openstack-discuss to a local mailbox. We had done this on the old server to debug dmarc issues
19:07:43 <clarkb> Exim was unable to make these local copy deliveries because the dest dir didn't exist. Senders were getting "not delivered" emails as a result
19:08:01 <clarkb> The periodic jobs at ~02:00 Monday should have fixed this as we landed a change to remove the local copy config in exim entirely
19:08:24 <clarkb> I sent email today and can report back tomorrow if I got the "not delivered" message for it (it arrived 24 hours later last time)
19:08:57 <clarkb> The other issue is that RH corp email appears to not be delivering to the new server. As far as we can tell this is because they use some service that is resolving lists.openstack.org to the old server which refuses smtp connections at this point
19:09:48 <clarkb> Not much we can do for that other than bring it to the attention of others who might engage with this service. This is an ongoing effort. I just brought it up with the opensdtack TC which is made up of some RHers
19:10:05 <clarkb> I think the last remaining tasks are to plan for cleaning up the old server
19:10:13 <frickler> well we could enable exim on the old server again and make it forward to the new one
19:10:14 <clarkb> and maybe consider adding MX records alongside our A/AAAA records
19:10:16 <tonyb> that's super strange
19:10:21 <frickler> at least for some time
19:10:26 <clarkb> frickler: I think I would -2 that.
19:11:03 <clarkb> we shouldn't need to act as a proxy for email in that way. And it will just prolong the need for keeping the 11.5 year old server that sometimes fails to boot around longer
19:11:03 <frickler> I don't say we should do that, but it is something we could do to resolve the issue from our side
19:12:13 <frickler> adding MX records doesn't sound wrong, either
19:12:16 <clarkb> I think working around it will give the third party an excuse to not resolve it. Best we tackle it directly
19:12:48 <clarkb> ya if we add MX records we should do it for all of the list domains for consistency. I don't think it will help this issue but may make other senders happy
19:13:38 <clarkb> unfortunately fungi is at a conference so can't weigh in. Hopefully we can catch up on all the mailing list fun with him later this week
19:14:17 <clarkb> #topic LE certcheck failures in Ansible
19:14:41 <clarkb> While I was trying to get the exim config on lists01 updated to fix the undeliverable local copy errors I hit a problem with our LE jobs
19:14:42 <tonyb> I definitely see it as a RH bug and once it's been raised as an issues inside RH that's up to their infra teams tonfix
19:15:15 <clarkb> tonyb: fwiw hberaud and ralonsoh have been working it as affected users already aiui
19:15:28 <clarkb> When compiling the list of domains to configure certcheck with we got `'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'`
19:15:31 <tonyb> okay
19:15:44 <clarkb> this error does not occur 100% of the time so I suspect some sort of weird ansible issue
19:16:08 <clarkb> digging through the info in the logs I wasn't able to find any nodes that didn't have letsencrypt_certcheck_domains applied to them that are in the letsencrypt group
19:16:12 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/898475 Changes to LE roles to improve debugging
19:16:26 <clarkb> I don't have a fix but did write up ^ to add more debugging to the system to hopefully make the problem more clear
19:16:30 <corvus> (mx records should not be required; they kind of only muddy the waters; i agree they're not technically wrong though, and could be a voodoo solution to the problem.  just bringing that up in case anyone is under the impression that they are required and we are wrong for omitting them)
19:17:02 <clarkb> annoyingly ansible does not report the iteration item when a loop iteration fails. It does report it when it succeeds....
19:17:17 <clarkb> makes debugging loop failures like the one building the certcheck list difficult to debug
19:17:27 <clarkb> my change basically hacks around that by recording the info directly
19:18:07 <clarkb> reviews welcome as is fresh eyes debugging if someone has time to look at the logs. I can probably even past them somewhere after making sure no secrets are leaked if that helps
19:18:27 <clarkb> #topic Zuul not properly caching branches when they are created
19:18:58 <clarkb> This doesn't appear to be a 100% of the time problem either. But yesterday we noticed after a user reported jobs weren't running on a change that zuul seemed unaware of the branch that change was proposed to
19:19:22 <clarkb> corvus: theorized that this may be due to Gerrit emitting the ref-updated event that zuul processes before the git repos have the branch in them on disk (whcih zuulrefers to to list branches)
19:19:40 <clarkb> the long term fix for this is to have zuul query the gerrit api for branch listing which should be consistent with the events stream
19:19:55 <clarkb> in the meantime we can force zuul to reload the affected zuul tenant which fixes the problem
19:20:01 <clarkb> Run `docker exec zuul-scheduler_scheduler_1 zuul-scheduler tenant-reconfigure openstack` on scheduler to fix
19:20:17 <corvus> i might nitpick the topic title here and point out that it may not actually be a zuul bug; there's a good chance that the issue might be described as "Gerrit doesn't report the correct set of branches over git under some circumstances".  but i agree that it manifests to users as "zuul doesn't have the right branch list" :)
19:20:18 <clarkb> I did this yesterday and it took about 21 minutes but afterwards all was well
19:20:28 <tonyb> when someone reports it right?
19:20:36 <clarkb> tonyb: yup
19:20:45 <corvus> and yes, i think the next step in fixing is to switch to the gerrit rest api to see if it behaves better
19:21:25 <clarkb> fair point. The cooperation between services is broken by data consistency expectations that dno't hold :)
19:21:39 <corvus> yes! :)
19:21:53 <clarkb> #topic Server Upgrades
19:22:08 <clarkb> No new server upgrades. Some services have been upgraded though. More on that later
19:22:21 <clarkb> #topic InMotion/OnMetal Cloud Redeployment
19:22:26 <clarkb> #undo
19:22:26 <opendevmeet> Removing item from minutes: #topic InMotion/OnMetal Cloud Redeployment
19:22:32 <clarkb> #topic InMotion/OpenMetal Cloud Redeployment
19:22:39 <clarkb> I always want to type OnMetal beacuse that was Rax's thing
19:23:17 <clarkb> After discussing this last week I think I'm leaning towards doing a single redeployment early next year. That way we get all the new goodness with the least amount of effort
19:23:35 <clarkb> the main resource we tend to lack is time so minimizing time required to use services and tools seems important to me
19:24:04 <frickler> +1
19:24:37 <tonyb> +1
19:24:40 <clarkb> we can always change our mind later if we find a new good reason to deploy sooner. But until then I'm happy as is.
19:26:05 <clarkb> #topic Python Container Updates
19:26:11 <clarkb> #link https://review.opendev.org/q/(+topic:bookworm-python3.11+OR+hashtag:bookworm+)status:open
19:26:48 <clarkb> The end of this process is in sight. Everything but zuul/zuul-operator and openstack/python-openstackclient are now on python3.11. Everythin on Python3.11 is on bookworm except for zuul/zuul-registry
19:27:12 <clarkb> I have a change to fixup some of the job dependencies (something we missed when making the other changes) and then another change to drop python3.9 entirely as nothing is using it
19:27:28 <clarkb> Once zuul-operator and openstackclient move to python3.11 we can drop the python3.10 builds too
19:27:41 <tonyb> Nice
19:28:02 <clarkb> And then we can look into adding python3.12 image builds, but I don't think this is urgent as we don't have a host platform outside of the containers for running things like linters and unittests. But having the images ready would be nice
19:29:32 <clarkb> #topic Gita 1.21 Upgrade
19:29:32 <tonyb> +1
19:29:41 <clarkb> #undo
19:29:41 <opendevmeet> Removing item from minutes: #topic Gita 1.21 Upgrade
19:29:44 <clarkb> #topic Gitea 1.21 Upgrade
19:29:47 <clarkb> I cannot type today
19:30:04 <clarkb> Nothing really new here. Upstream hasn't produced a new rc or final release so there is no changelog yet
19:30:27 <clarkb> Hopefully we get one soon so that we can plan key rotations if we deem that necessary as well as the gitea upgrade proper
19:30:49 <clarkb> #topic Zookeeper 3.8 Upgrade
19:31:22 <clarkb> This wasn't on the agenda I sent out because updates happened in docker hub this morning. I decided to go ahead and upgrade the zookeeper cluster to 3.8.3 today after new images with some bug fixes became available
19:31:35 <clarkb> This is now done. All three nodes are updated and the cluster seems happy
19:31:46 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/898614 check myid in zookeeper testing
19:32:03 <clarkb> This change came out of one of the things corvus was checking during the upgrade. Basically a sanity check that the zookeeper node recognizes its own id properly
19:32:50 <clarkb> The main motiviation behind this is 3.9 is out now which means 3.8 is the current stable release. Now we're caught up and getting all the latest updates
19:33:02 <clarkb> #topic Ansible 8 Upgrade for OpenDev Control Plane
19:33:14 <clarkb> Another topic that didn't make it on the agenda. This was entirely my fault as I knew about it but failed to add it
19:33:30 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/898505 Update ansible on bridge to ansible 8
19:33:54 <clarkb> I think it is time for us to catch up on ansible releases for our control plane. Zuul has been happy with it and the migration seemed straightforward
19:34:28 <clarkb> I do note that change did not run many system-config-run-* jobs against ansible 8 so we should modify the change to trigger more of those to get good coverage with the new ansible version before merging it. I've got that as a todo for later today
19:35:08 <clarkb> assuming our system-config-run jobs are happy it should be very safe to land. Just need to monitor after it goes in to ensure the upgrade is successful and we didn't miss any compatibility issues
19:35:15 <clarkb> #topic Open Discussion
19:35:20 <clarkb> Anything else?
19:35:37 <frickler> coming back to lists.openstack.org, one possible issue occurred to me
19:36:04 <frickler> for the old server, rdns pointed back to lists.openstack.org, now we have lists01.opendev.org
19:36:31 <frickler> so in fact doing an MX record pointing to the latter might be more correct
19:36:41 <clarkb> frickler: you think they may only accept forward records that have matching reverse records?
19:37:11 <frickler> I know some people do when receiving mail, not sure how strict things are when sending
19:37:46 <clarkb> ya may be worth a try with MX records I guess then. Though I'd like fungi to weigh in on that before we take action since he has been driving this whole thing
19:37:57 <frickler> also in the SMTP dialog the server identifies as lists01
19:37:59 <corvus> hrm, i have received 2 different A responses for lists.openstack.org.  it's possible the old one was cached
19:38:20 <corvus> lists.openstack.org.	30	IN	A	50.56.173.222
19:38:24 <corvus> lists.openstack.org.	21	IN	A	162.209.78.70
19:38:33 <frickler> corvus: from where did you receive those? first is the old IP
19:38:55 <corvus> just a local lookup; so it's entirely possible it's some old cache on my router
19:39:09 <corvus> i'll keep an eye out and let folks know if it flaps back
19:39:47 <clarkb> thanks. I have only received the new ip from my local resolver, the authoritative servers, google, cloudflare, and quad9 so far
19:39:58 <clarkb> if that is more consistent it may be a thread to pull on
19:40:25 <corvus> java is famous for not respecting ttls; so if rh has some java thing involved, that could be related
19:40:48 <clarkb> I pushed an update to https://gerrit-review.googlesource.com/c/plugins/replication/+/387314 earlier today. It still doesn't pass all tests but passes my new tests and I'm hoping I can get feedback on the approach before doing the work to make all test cases pass and fix one known issue
19:41:15 <clarkb> corvus: yup java 5 and older ignored ttls by default using only the first resolved values. Then after that this behavior became configured
19:41:18 <clarkb> *configurable
19:42:07 <corvus> if a server is unhappy about forward/reverse dns matching, an mx record probably won't help that.  the important thing is that the forward dns of the helo matches the reverse dns
19:42:27 <tonyb> I'll check with the affected users and make sure an internal ticket is raised
19:42:51 <corvus> (and that the A record for the name returned by the PTR matches the incoming IP)
19:43:24 <clarkb> corvus: I feel like this needs pictures :)
19:43:32 <clarkb> with lots of arrows
19:43:48 <tonyb> lol
19:44:03 <corvus> imagine a cat playing with a ball of yarn
19:45:00 <clarkb> sounds like that may be it. We can talk dns and smtp with fungi when he is able and take it from there
19:45:08 <clarkb> thank you for your time everyone and sorry I was a few minutes late
19:45:19 <corvus> thank you clarkb :)
19:45:32 <clarkb> I think we will have a meeting next week during the PTG since our normal meeting time is outside of ptg times and I don't think I'm going to be super busy with the pTG this time around but that may change
19:45:41 <clarkb> #endmeeting