19:05:19 #startmeeting infra 19:05:19 Meeting started Tue Oct 17 19:05:19 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:05:19 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:05:19 The meeting name has been set to 'infra' 19:05:21 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/DBCAGOTUNVOC2NLM4FATGKZK6GTZRJQ5/ Our Agenda 19:05:30 #topic Announcements 19:05:50 The PTG is happening next week from October 23-27. Please keep this in mind as we make changes to our hosted systems 19:06:26 #topic Mailman 3 19:06:40 All of our lists have been migrated into lists01.opendev.org and are running with mailman 3 19:06:50 thank you fungi for getting this mostly over the finish line 19:07:01 There are two issues (one oustanding) that should be called out 19:07:21 First is we had exim configured to copy deliveries to openstack-discuss to a local mailbox. We had done this on the old server to debug dmarc issues 19:07:43 Exim was unable to make these local copy deliveries because the dest dir didn't exist. Senders were getting "not delivered" emails as a result 19:08:01 The periodic jobs at ~02:00 Monday should have fixed this as we landed a change to remove the local copy config in exim entirely 19:08:24 I sent email today and can report back tomorrow if I got the "not delivered" message for it (it arrived 24 hours later last time) 19:08:57 The other issue is that RH corp email appears to not be delivering to the new server. As far as we can tell this is because they use some service that is resolving lists.openstack.org to the old server which refuses smtp connections at this point 19:09:48 Not much we can do for that other than bring it to the attention of others who might engage with this service. This is an ongoing effort. I just brought it up with the opensdtack TC which is made up of some RHers 19:10:05 I think the last remaining tasks are to plan for cleaning up the old server 19:10:13 well we could enable exim on the old server again and make it forward to the new one 19:10:14 and maybe consider adding MX records alongside our A/AAAA records 19:10:16 that's super strange 19:10:21 at least for some time 19:10:26 frickler: I think I would -2 that. 19:11:03 we shouldn't need to act as a proxy for email in that way. And it will just prolong the need for keeping the 11.5 year old server that sometimes fails to boot around longer 19:11:03 I don't say we should do that, but it is something we could do to resolve the issue from our side 19:12:13 adding MX records doesn't sound wrong, either 19:12:16 I think working around it will give the third party an excuse to not resolve it. Best we tackle it directly 19:12:48 ya if we add MX records we should do it for all of the list domains for consistency. I don't think it will help this issue but may make other senders happy 19:13:38 unfortunately fungi is at a conference so can't weigh in. Hopefully we can catch up on all the mailing list fun with him later this week 19:14:17 #topic LE certcheck failures in Ansible 19:14:41 While I was trying to get the exim config on lists01 updated to fix the undeliverable local copy errors I hit a problem with our LE jobs 19:14:42 I definitely see it as a RH bug and once it's been raised as an issues inside RH that's up to their infra teams tonfix 19:15:15 tonyb: fwiw hberaud and ralonsoh have been working it as affected users already aiui 19:15:28 When compiling the list of domains to configure certcheck with we got `'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'` 19:15:31 okay 19:15:44 this error does not occur 100% of the time so I suspect some sort of weird ansible issue 19:16:08 digging through the info in the logs I wasn't able to find any nodes that didn't have letsencrypt_certcheck_domains applied to them that are in the letsencrypt group 19:16:12 #link https://review.opendev.org/c/opendev/system-config/+/898475 Changes to LE roles to improve debugging 19:16:26 I don't have a fix but did write up ^ to add more debugging to the system to hopefully make the problem more clear 19:16:30 (mx records should not be required; they kind of only muddy the waters; i agree they're not technically wrong though, and could be a voodoo solution to the problem. just bringing that up in case anyone is under the impression that they are required and we are wrong for omitting them) 19:17:02 annoyingly ansible does not report the iteration item when a loop iteration fails. It does report it when it succeeds.... 19:17:17 makes debugging loop failures like the one building the certcheck list difficult to debug 19:17:27 my change basically hacks around that by recording the info directly 19:18:07 reviews welcome as is fresh eyes debugging if someone has time to look at the logs. I can probably even past them somewhere after making sure no secrets are leaked if that helps 19:18:27 #topic Zuul not properly caching branches when they are created 19:18:58 This doesn't appear to be a 100% of the time problem either. But yesterday we noticed after a user reported jobs weren't running on a change that zuul seemed unaware of the branch that change was proposed to 19:19:22 corvus: theorized that this may be due to Gerrit emitting the ref-updated event that zuul processes before the git repos have the branch in them on disk (whcih zuulrefers to to list branches) 19:19:40 the long term fix for this is to have zuul query the gerrit api for branch listing which should be consistent with the events stream 19:19:55 in the meantime we can force zuul to reload the affected zuul tenant which fixes the problem 19:20:01 Run `docker exec zuul-scheduler_scheduler_1 zuul-scheduler tenant-reconfigure openstack` on scheduler to fix 19:20:17 i might nitpick the topic title here and point out that it may not actually be a zuul bug; there's a good chance that the issue might be described as "Gerrit doesn't report the correct set of branches over git under some circumstances". but i agree that it manifests to users as "zuul doesn't have the right branch list" :) 19:20:18 I did this yesterday and it took about 21 minutes but afterwards all was well 19:20:28 when someone reports it right? 19:20:36 tonyb: yup 19:20:45 and yes, i think the next step in fixing is to switch to the gerrit rest api to see if it behaves better 19:21:25 fair point. The cooperation between services is broken by data consistency expectations that dno't hold :) 19:21:39 yes! :) 19:21:53 #topic Server Upgrades 19:22:08 No new server upgrades. Some services have been upgraded though. More on that later 19:22:21 #topic InMotion/OnMetal Cloud Redeployment 19:22:26 #undo 19:22:26 Removing item from minutes: #topic InMotion/OnMetal Cloud Redeployment 19:22:32 #topic InMotion/OpenMetal Cloud Redeployment 19:22:39 I always want to type OnMetal beacuse that was Rax's thing 19:23:17 After discussing this last week I think I'm leaning towards doing a single redeployment early next year. That way we get all the new goodness with the least amount of effort 19:23:35 the main resource we tend to lack is time so minimizing time required to use services and tools seems important to me 19:24:04 +1 19:24:37 +1 19:24:40 we can always change our mind later if we find a new good reason to deploy sooner. But until then I'm happy as is. 19:26:05 #topic Python Container Updates 19:26:11 #link https://review.opendev.org/q/(+topic:bookworm-python3.11+OR+hashtag:bookworm+)status:open 19:26:48 The end of this process is in sight. Everything but zuul/zuul-operator and openstack/python-openstackclient are now on python3.11. Everythin on Python3.11 is on bookworm except for zuul/zuul-registry 19:27:12 I have a change to fixup some of the job dependencies (something we missed when making the other changes) and then another change to drop python3.9 entirely as nothing is using it 19:27:28 Once zuul-operator and openstackclient move to python3.11 we can drop the python3.10 builds too 19:27:41 Nice 19:28:02 And then we can look into adding python3.12 image builds, but I don't think this is urgent as we don't have a host platform outside of the containers for running things like linters and unittests. But having the images ready would be nice 19:29:32 #topic Gita 1.21 Upgrade 19:29:32 +1 19:29:41 #undo 19:29:41 Removing item from minutes: #topic Gita 1.21 Upgrade 19:29:44 #topic Gitea 1.21 Upgrade 19:29:47 I cannot type today 19:30:04 Nothing really new here. Upstream hasn't produced a new rc or final release so there is no changelog yet 19:30:27 Hopefully we get one soon so that we can plan key rotations if we deem that necessary as well as the gitea upgrade proper 19:30:49 #topic Zookeeper 3.8 Upgrade 19:31:22 This wasn't on the agenda I sent out because updates happened in docker hub this morning. I decided to go ahead and upgrade the zookeeper cluster to 3.8.3 today after new images with some bug fixes became available 19:31:35 This is now done. All three nodes are updated and the cluster seems happy 19:31:46 #link https://review.opendev.org/c/opendev/system-config/+/898614 check myid in zookeeper testing 19:32:03 This change came out of one of the things corvus was checking during the upgrade. Basically a sanity check that the zookeeper node recognizes its own id properly 19:32:50 The main motiviation behind this is 3.9 is out now which means 3.8 is the current stable release. Now we're caught up and getting all the latest updates 19:33:02 #topic Ansible 8 Upgrade for OpenDev Control Plane 19:33:14 Another topic that didn't make it on the agenda. This was entirely my fault as I knew about it but failed to add it 19:33:30 #link https://review.opendev.org/c/opendev/system-config/+/898505 Update ansible on bridge to ansible 8 19:33:54 I think it is time for us to catch up on ansible releases for our control plane. Zuul has been happy with it and the migration seemed straightforward 19:34:28 I do note that change did not run many system-config-run-* jobs against ansible 8 so we should modify the change to trigger more of those to get good coverage with the new ansible version before merging it. I've got that as a todo for later today 19:35:08 assuming our system-config-run jobs are happy it should be very safe to land. Just need to monitor after it goes in to ensure the upgrade is successful and we didn't miss any compatibility issues 19:35:15 #topic Open Discussion 19:35:20 Anything else? 19:35:37 coming back to lists.openstack.org, one possible issue occurred to me 19:36:04 for the old server, rdns pointed back to lists.openstack.org, now we have lists01.opendev.org 19:36:31 so in fact doing an MX record pointing to the latter might be more correct 19:36:41 frickler: you think they may only accept forward records that have matching reverse records? 19:37:11 I know some people do when receiving mail, not sure how strict things are when sending 19:37:46 ya may be worth a try with MX records I guess then. Though I'd like fungi to weigh in on that before we take action since he has been driving this whole thing 19:37:57 also in the SMTP dialog the server identifies as lists01 19:37:59 hrm, i have received 2 different A responses for lists.openstack.org. it's possible the old one was cached 19:38:20 lists.openstack.org. 30 IN A 50.56.173.222 19:38:24 lists.openstack.org. 21 IN A 162.209.78.70 19:38:33 corvus: from where did you receive those? first is the old IP 19:38:55 just a local lookup; so it's entirely possible it's some old cache on my router 19:39:09 i'll keep an eye out and let folks know if it flaps back 19:39:47 thanks. I have only received the new ip from my local resolver, the authoritative servers, google, cloudflare, and quad9 so far 19:39:58 if that is more consistent it may be a thread to pull on 19:40:25 java is famous for not respecting ttls; so if rh has some java thing involved, that could be related 19:40:48 I pushed an update to https://gerrit-review.googlesource.com/c/plugins/replication/+/387314 earlier today. It still doesn't pass all tests but passes my new tests and I'm hoping I can get feedback on the approach before doing the work to make all test cases pass and fix one known issue 19:41:15 corvus: yup java 5 and older ignored ttls by default using only the first resolved values. Then after that this behavior became configured 19:41:18 *configurable 19:42:07 if a server is unhappy about forward/reverse dns matching, an mx record probably won't help that. the important thing is that the forward dns of the helo matches the reverse dns 19:42:27 I'll check with the affected users and make sure an internal ticket is raised 19:42:51 (and that the A record for the name returned by the PTR matches the incoming IP) 19:43:24 corvus: I feel like this needs pictures :) 19:43:32 with lots of arrows 19:43:48 lol 19:44:03 imagine a cat playing with a ball of yarn 19:45:00 sounds like that may be it. We can talk dns and smtp with fungi when he is able and take it from there 19:45:08 thank you for your time everyone and sorry I was a few minutes late 19:45:19 thank you clarkb :) 19:45:32 I think we will have a meeting next week during the PTG since our normal meeting time is outside of ptg times and I don't think I'm going to be super busy with the pTG this time around but that may change 19:45:41 #endmeeting