Monday, 2022-11-21

opendevreviewMerged opendev/system-config master: Use prod_bastion group in gate bootstrap  https://review.opendev.org/c/opendev/system-config/+/86499100:12
*** rlandy is now known as rlandy|out00:25
opendevreviewIan Wienand proposed opendev/system-config master: rax-dns-backup: fix parsing  https://review.opendev.org/c/opendev/system-config/+/86508300:50
ianw^ we should probably do another cleanup purge through openstack.org and others to get rid of old stuff too00:53
opendevreviewIan Wienand proposed opendev/system-config master: bridge: Disable writing known_hosts files  https://review.opendev.org/c/opendev/system-config/+/86509204:04
*** yadnesh|away is now known as yadnesh04:10
opendevreviewIan Wienand proposed opendev/system-config master: bridge: Disable writing known_hosts files  https://review.opendev.org/c/opendev/system-config/+/86509204:30
*** ysandeep|out is now known as ysandeep04:59
*** ysandeep is now known as ysandeep|ruck04:59
opendevreviewIan Wienand proposed opendev/system-config master: launch-node : make into a small package  https://review.opendev.org/c/opendev/system-config/+/86128405:30
StutiArya[m]Hi, I am trying to set-up a new stack with the below commands in python 3.9.15 and Ubuntu 20.04.4 LTS. I created a stack user and cloned devstack using 'git clone https://opendev.org/openstack/devstack' and checkout yoga and granted all permissions to user with help of chmod 755. Creating the local.conf file using below configuration.... (full message at <https://matrix.org/_matrix/media/r0/download/matrix.org/xucoaZnhyzuOlxTsWPaMIfEG>)06:11
*** yadnesh is now known as yadnesh|afk07:43
*** yadnesh|afk is now known as yadnesh08:34
*** jpena|off is now known as jpena08:36
*** ysandeep|ruck is now known as ysandeep|ruck|lunch09:46
*** pojadhav is now known as pojadhav|afk10:22
*** ysandeep|ruck|lunch is now known as ysandeep|ruck10:26
*** gthiemon1e is now known as gthiemonge11:00
*** rlandy|out is now known as rlandy|rover11:06
*** dviroel|out is now known as dviroel11:23
*** pojadhav- is now known as pojadhav11:44
*** ysandeep|ruck is now known as ysandeep|ruck|brb12:00
*** dviroel_ is now known as dviroel12:12
*** ysandeep|ruck|brb is now known as ysandeep|ruck12:12
fungiStutiArya[m]: this is the channel for discussing collaboration services run as part of the opendev collaboratory, you're probably looking for the #openstack-qa channel (primary channel for the team which maintains devstack)12:34
ysandeep|ruckfolks o/ We build tripleo containers in a content-provider job, this job is hitting transient issue during containers build-  unable to resolve the mirror. Appreciate any pointers.13:04
ysandeep|ruckhttps://c6e907431413b276a63b-23a7172f5dc1e58445390ec61883d218.ssl.cf2.rackcdn.com/864994/8/gate/tripleo-ci-centos-9-content-provider/9d99392/logs/container-builds/522fec05-8e3f-4cc7-8901-45f0ca049eed/base/os/horizon/horizon-build.log13:05
ysandeep|ruck~~~13:05
ysandeep|ruckhttps://c6e907431413b276a63b-23a7172f5dc1e58445390ec61883d218.ssl.cf2.rackcdn.com/864994/8/gate/tripleo-ci-centos-9-content-provider/9d99392/logs/container-builds/522fec05-8e3f-4cc7-8901-45f0ca049eed/base/os/horizon/horizon-build.log13:05
ysandeep|ruck[MIRROR] python3-PyMySQL-0.10.1-6.el9.noarch.rpm: Curl error (6): Couldn't resolve host name for http://mirror.bhs1.ovh.opendev.org/centos-stream/9-stream/AppStream/x86_64/os/Packages/python3-PyMySQL-0.10.1-6.el9.noarch.rpm [Could not resolve host: mirror.bhs1.ovh.opendev.org]13:05
ysandeep|ruck~~~13:05
ysandeep|ruckDifferent mirror/cloud provider: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_2ea/864469/1/gate/tripleo-ci-centos-9-content-provider/2ea04fe/logs/container-builds/3b86b709-7cd1-4929-bbc8-7ce5c35137a4/base/os/glance-api/glance-api-build.log13:05
fungisounds like dns resolution issues. are you using the local caching resolver which runs on the loopback on our test nodes, or directly querying some remote resolver?13:06
ysandeep|ruckfungi, https://b61c6185702ea5ab9478-5863b3c6c7dfee81e4a8aea3295e1b03.ssl.cf2.rackcdn.com/864814/2/check/tripleo-ci-centos-9-content-provider-zed/7653d78/logs/undercloud/etc/resolv.conf .. looks like remote server13:07
fungii wonder if that could be made to use the local resolver instead. we intentionally cache dns responses on the test node in order to make dns resolution a little less likely to break at random13:08
ysandeep|ruckEven though ubound service running, I don't see 127.0.0.1 entry in resolv.conf13:08
ysandeep|ruckfungi: Should we use unbound? 13:10
* ysandeep|ruck need to cross-check if the config here can works directly https://b61c6185702ea5ab9478-5863b3c6c7dfee81e4a8aea3295e1b03.ssl.cf2.rackcdn.com/864814/2/check/tripleo-ci-centos-9-content-provider-zed/7653d78/logs/undercloud/etc/unbound/index.html13:10
fungijust double-checking a devstack job for comparison, and it does: https://zuul.opendev.org/t/openstack/build/f4ed66a491334120a6e3a95c3dcb983b/log/controller/logs/resolv_conf.txt13:12
fungii think we normally set it up in our base job so should be inherited by all jobs unless they undo/overwrite it somehow13:13
fungiyeah, it's set up in the base pre playbook here: https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/pre.yaml#L48-L5513:16
ysandeep|ruckokay looks like we are running that pre in our job: https://zuul.openstack.org/build/14fb711246c542a6842edf4e04db5b03/log/job-output.txt#147-182 , maybe overriding the resolv.conf entry somewhere in our job.13:20
*** mrunge_ is now known as mrunge13:24
ysandeep|ruckfungi: thanks for the pointer, We will try local resolver if issue continues.13:25
fungiysandeep|ruck: for the record, it appears we bake that resolv.conf into our diskimages here: https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/nodepool-base/finalise.d/89-boot-settings#L15913:26
fungi(sorry for the delay, took a bit of digging to refresh my memory of where that happens)13:27
fungirather, we bake in an rclocal script which should overwrite it at boot13:28
fungiysandeep|ruck: are the issues you're seeing primarily (or only) on centos-9-stream nodes?13:29
ysandeep|ruckfungi: yes we use centos-9-stream nodes for tripleo test and that's where we are seeing these errors.13:30
fungichecking https://nb01.opendev.org/centos-9-stream-0000007862.log for the image build, it seems there should be an /etc/rc.local on those images which overwrites /etc/resolv.conf with an entry for nameserver 127.0.0.13:31
fungi113:31
fungiso it might be that /etc/rc.local no longer gets executed at boot on stream 913:32
fungithe nameserver in the example you pasted is one which would have been set at boot by the cloud provider instead, making me suspicious13:33
fungiysandeep|ruck: do you happen to know if stream 9 executes /etc/rc.local at all?13:33
fungilooks like maybe it relies on having the rc-local service enabled in systemd13:35
ysandeep|ruckfungi, I am not sure about that13:35
fungii don't see anywhere we explicitly enable it, but ianw did write a brief essay in code comment form about it: https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/nodepool-base/finalise.d/89-boot-settings#L117-L13213:37
ysandeep|ruckfrom the service logs, rc-local service is enabled13:40
ysandeep|ruck● rc-local.service - /etc/rc.d/rc.local Compatibility13:40
ysandeep|ruck     Loaded: loaded (/usr/lib/systemd/system/rc-local.service; enabled-runtime; vendor preset: disabled)13:40
ysandeep|ruckhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_14f/852485/5/check/tripleo-ci-centos-9-content-provider/14fb711/logs/undercloud/var/log/extra/services.txt13:40
fungii checked a centos-9-stream node which is currently running a job for a ceilometer change and it has an /etc/resolv.conf with nameserver 127.0.0.113:46
ysandeep|ruckfungi: thanks for digging in, I will check further incase we are modifying the resolv.conf somewhere in our steps.13:47
fungii checked another one which is running a job for tripleo-heat-templates and it has entries for both 127.0.0.1 and 1.1.1.1 with a comment line at the top that says "Generated by NetworkManager"13:47
fungi(the example from ceilometer did not have that comment nor the extra entry for 1.1.1.1)13:48
fungii checked a node running a tripleo-ansible job and it just has "nameserver 127.0.0.1" at the moment13:49
fungiso i'm reasonably sure the nodes are starting out configured to query the local unbound cache and then something is overwriting them (either directly in your code or a service you're enabling/restarting overwrites it as a side effect?)13:50
ysandeep|ruckfungi: agreed, I will let you know once I found what's overwriting it. thanks again!13:51
fungimy pleasure13:52
fungimy hamfisted approach at debugging it would probably start by catting resolv.conf into logger at various points in the code and then searching through the collected syslog at the end of the build to see when it changes, which would probably help narrow it down13:54
fungianother possibility is that this is varying by provider, and we see different behavior in providers which rely on dhcp for network configuration vs those which rely on configdrive13:56
fungibut we'll need more data points to rule that out13:56
*** dasm|off is now known as dasm14:48
*** ysandeep|ruck is now known as ysandeep|ruck|dinner15:11
Clark[m]Note that the containers themselves (in which container images are built) may also have a different network config. This is because 127.0.0.1 in a container network namespace is different than the host and no unbound is listening there15:13
Clark[m]In this case because the problem is in a container (build) the host networking may be completely ignored 15:13
*** dviroel is now known as dviroel|lunch15:14
*** yadnesh is now known as yadnesh|away15:18
fungiwell, in this case it also sounds like something has reverted the node's resolv.conf contents, or prevented rc-conf from running the script to overwrite them15:24
*** jpodivin_ is now known as jpodivin15:34
fungiClark[m]: a couple of new django warnings on the latest mailman migration test run:16:02
fungi?: (2_0.W001) Your URL pattern '^$' has a route that contains '(?P<', begins with a '^', or ends with a '$'. This was likely an oversight when migrating to django.urls.path().16:02
fungi?: (2_0.W001) Your URL pattern '^admin/' has a route that contains '(?P<', begins with a '^', or ends with a '$'. This was likely an oversight when migrating to django.urls.path().16:02
funginot sure what to make of that, but maybe latest django needs a full match already?16:03
fungialso these warnings which i hadn't seen before either but are likely benign:16:05
fungiUnable to convert mailing list attribute: msg_footer with value "...some data..."16:05
fungiUnable to convert mailing list attribute: digest_footer with value "...some data..."16:06
fungii logged the full output so will scour for more occurrences since it doesn't say why it can't import them. not sure yet if it's for every list or just a few16:06
fungifor the django.urls.path() warnings, looks looks like we're setting those in docker/mailman/web/mailman-web/urls.py so maybe i need to double-check that matches latest16:08
fungimmm, nope, identical16:10
fungimatches the equivalent file in the tip of the main branch16:10
clarkbfungi: I'm guessing ?P< is the control code for match entire string?16:12
fungino idea, our routes don't have that but they do have starting with a ^ or ending with a $16:12
fungiit's warning that the entry has at least one of those three characteristics, but not what to do about it16:13
fungithis may be related to why the upstream docker config is still pinning to older django and mailman16:13
*** ysandeep|ruck|dinner is now known as ysandeep|ruck16:13
clarkbwe're pinning the same way though?16:14
clarkbour dockerfile should be an exact replica16:14
clarkblooks like django 2.0's path method dropped regex support16:16
clarkbbut we're on django 4.something and were on 3.something previously iirc16:16
clarkbyou need to use re_path() for regex routes16:16
clarkbseems like this may be an actual upstream bug? But I would've expected it to be present regardless of the current or previous django versions16:17
*** dviroel|lunch is now known as dviroel16:18
clarkbfungi: looking at both of those routes in that file I suspect that we can replace the first one with empty string and the second with admin/. Or replace path() with re_path()16:21
*** ysandeep|ruck is now known as ysandeep|out16:23
clarkbyes I've confirmed that django is pinned <4.1. I'm not seeing where hyperkitty is explicitly installed16:23
clarkbfrom the requirements.txt and that lists pinned versions of the mailman side as well16:24
clarkbso I think we're pinning everything?16:24
fungiahh, okay right it wasn't django i unpinned for mailman 3.3.7 it was sqlalchemy16:31
clarkbya I'm like 99% certain this is an upstream bug16:31
fungithe update to urls.py in the latest patchset replaced a bunch of url() entries with path() but didn't alter the parameters, so maybe they should have?16:32
fungiif you load change 860157 and compare patchsets 14 and 15 on the urls.py file, that shows it pretty clearly16:33
clarkbyup, I think that is it. For ^admin/ I think we can simply drop the ^ and that matches accounts/ on the line above? Except maybe it depends on what is in admin.site.urls and whether or not it expects a regex?16:33
fungioh, in fact they did change some of the entries just not all of them16:33
fungiurl(r'^postorius/', became path(r'postorius/', for example16:33
fungiokay, i can push up that edit and hold another node, just want to see if there's anything else i spot in the logs which also needs fixing up while at it16:34
clarkbupstream mailman may have a urls file we can look at too16:34
clarkblet me see if I can find that really quickly before we push an update16:34
fungisure, much obliged16:35
fungioh, one thing worth noting... this would result in two entries for path(r'', ...)16:36
fungione does a RedirectView.as_view() and the other does an include()16:37
clarkbfungi: https://gitlab.com/mailman/mailman-web/-/blob/master/mailman_web/urls.py16:37
funginice, thanks!16:37
clarkband ya that source file has two '' entries16:37
fungiperfect16:38
clarkbfungi: assuming that makes things happy we can do a PR to sync that up in maxking's repo too16:38
fungiinterestingly, the upstream one in mailman-web uses "mailman3" instead of "postorius" and "archives" in stead of "hyperkitty" for the paths16:39
funginot sure why they're different in the docker versions16:39
clarkbI wonder if that is done to allow them to change the components out in the future16:40
clarkbmight be worth raising with maxking?16:40
fungioh, could be that's in order to support side-by-side installs of 2.x and 3.x16:40
fungiwell, no because almost all 2.x deployments put "pipermail" in the archive path16:41
fungii'm stumped16:41
fungilooks like it's been that way since the file was introduced in 2017 (commit c110bb1)16:43
fungimaybe upstream changed their naming convention for those in the past 5 years16:44
clarkbI suspect this is something that is semi configurable and you can set to your liking16:44
fungiindeed, 6d7de87 changed it in 2020 for mailman-web16:45
clarkbianw: I +2'd https://review.opendev.org/c/opendev/system-config/+/861284 but didn't approve it due to the minor thing called out inline. I mostly wanted to make sure that was intentional before we land the change16:45
clarkbfungi: I suspect maxking cna't change the docker images without breaking all of the existing docker image users16:45
clarkbianw: but feel free to approve or fixup or do a followup change I'm happy however that happens16:46
fungiclarkb: right, that's what i was thinking too. so leaves us with the dilemma: do we want to be more like the docker version's config or more like upstream mailman's?16:46
fungii'm leaning toward changing it to be like upstream, since it looks like tech debt/baggage for existing docker users16:47
fungii'll probably have to adjust our redirects and tests to match though16:47
clarkbfungi: ya I think worst case for us down the road we have apache do some redirecting to match both sets16:47
*** marios is now known as marios|out16:57
clarkbysandeep|out: fungi: one other thing I notice after pulling up the launchpad bug is that multiple domains are failing to resolve which means it is unlikely to be a dns server issue for that domain17:11
clarkb(basically points to the local systems reinforcing suspcion there)17:11
clarkbalso it looks like people are confusing where we store logs with where the jobs run https://bugs.launchpad.net/tripleo/+bug/1997202/comments/317:12
fungiyeah, i have a feeling the resolvers maintained by one or more of our donor cloud providers are overloaded/broken, but in theory more local caching on the test nodes should mostly shield us from that17:12
clarkb++17:13
fungicombined with the fact that we tell unbound to forward to the larger bulk resolver services, completely bypassing the resolvers run by our cloud donors17:14
fungiin particular, in the past we saw rackspace running an attack mitigation solution which would block abusers' ip addresses from querying the resolvers they were providing, but the blocking didn't react to those ip addresses getting recycled to other customers, so we found that some test nodes would end up being completely unable to do dns resolution through them17:16
clarkbI left acomment to clarify that log location isn't the same as buidl loction17:19
fungithanks17:19
clarkbanother interesting tidbit is they are hitting this in vexxhost but I don't think vexxhost is currently running any normally sized jobs in opendev17:22
fungiwhat node label?17:22
clarkbthat makes me think the issue is less provider specific and more in their builds themslves17:22
clarkbfungi: oh I mean rdoproject's CI system hits this in vexxhost. But opendev shouldn't run any tripleo jobs in vexxhost currently17:22
fungioh, there's something i didn't check. so the example resolv.conf file which was collected contained a rackspace resolver address, but i simply assumed that meant the job ran in rackspace... what if it was an embedded resolv.conf in the image which was lingering from the image getting built in rackspace, and something kept the rc.local script from running to replace it?17:24
fungidigging back to see if i can find it again17:24
fungiwhoa. get this...17:26
fungijob ran in ovh-gra1 but has "nameserver 213.186.33.99" in the resolv.conf17:26
fungino, nevermind, that's an ovh resolver17:27
clarkbthat is an ovh ya17:27
clarkbI think we still expect that to all get replaced though17:27
clarkbso could be that the underlying issue is not using the local caching resolver and that happens across the providers and they are all just flaky enough to not resolve reliably17:27
fungithe resolv.conf also contains this comment "Generated by NetworkManager"17:27
fungimaybe something is restarting nm and that causes it to overwrite resolv.conf?17:27
clarkbgiven that comment that seems very possible17:28
fungior maybe nm races with rc-local on centos for some reason and it's a dice roll as to which wins17:28
clarkbwe could update our base job to dump /etc/resolv.conf contents early in jobs17:32
clarkbI think we should wait on tripleo's debugging before we do that though as that will affect all jobs17:33
clarkb(slow them down by a few seconds)17:33
fungiyeah, i thought we splatted it with the rest of the host info role17:34
clarkbcurrently tripleo jobs have resolv.conf recorded but only for the end of the job after we've failed17:34
fungibut we don't seem to17:34
fungiclarkb: oh! we do have it in the ansible host info record though17:35
clarkboh neat /me looks17:35
fungiansible_dns: nameservers: - 127.0.0.117:36
*** jpena is now known as jpena|off17:36
clarkbhttps://zuul.opendev.org/t/openstack/build/22500697c5244d31b0687057040cf1af/log/zuul-info/host-info.primary.yaml#187-189 then later we record resolv.conf without a resolver: https://zuul.opendev.org/t/openstack/build/22500697c5244d31b0687057040cf1af/log/logs/undercloud/etc/resolv.conf17:36
fungiso yes, at the start of the job ansible thought the dns resolver list was just 127.0.0.117:36
clarkbso ya almost certainly this is a bug in their jobs not our images17:36
fungiright, seems it had to have changed at some point during job runtime17:37
clarkbif I had to guess why that job had no resolver in it is because it is in rax where we don't do dhcp and NM is doing dhcp and failing17:37
fungiit was correct when the job started17:37
clarkbysandeep|out: rlandy|rover ^ fyi17:37
* rlandy|rover reads back17:38
clarkbrlandy|rover: the tldr is that in a failing job we see 127.0.0.1 as a resolver in the host info file recorded at the start of the job. THen after things fail your job records resolv.conf as having no resolvers. THis indicates that it is almost certianly an issue in the build itself updating the config17:39
rlandy|roverclarkb: ok - we'll look into this17:40
clarkbthis would also better explain why it happens across providers and CI systems since it isn't related to the provider or image but instead the build workload (but still no hard confirmation of this just hints towards it)17:40
rlandy|roverthere are a few places we set this17:40
rlandy|roverthanks for looking into it17:40
clarkbrlandy|rover: ok, you might consider not setting it at all. Our CI system attempts to set it correctly for you17:40
rlandy|roverclarkb: yeah - the playbooks are used across various providers and zuul systems17:41
rlandy|roverso it could be we needed to set the value for other system17:41
rlandy|roverin opendev, we should skip that17:41
clarkbeven then it is probably better to update those systems to boto with correct settings (what opendev does) and not manage it in job payload17:41
clarkbunless your jobs are specifically testing DNS having the system set a system appropriate setting is probably the best option17:41
rlandy|roveronly the ipa should do that17:42
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/862152 has been on my backlog too long. I'd like to go ahead and approve it now as I should be able to monitor image builds using our python images this week (until the holiday anyway). Any objection to that ?17:58
clarkbcorvus: ^ fyi this may also impact zuul and nodepool though I did my testingwith nodepool in particular and it seemed fine17:59
corvusclarkb: ack, thanks18:00
fungiclarkb: yep, i agree that would be good to get merged18:03
clarkbok approved. Please call out any problems if you see them. I have no issues reverting either if that is easiest18:04
clarkbianw: I'm digging through my backlog of changes and https://review.opendev.org/c/opendev/system-config/+/857239 seems relevant to the bridge work once we start upgrading ansible on bridge. Either we can abandon it by going straight to ansible 6 or we keep it because we use ansible 5.18:11
clarkbinfra-root and for openafs https://review.opendev.org/c/opendev/system-config/+/857520 is something I wrote a while back in response to some logging we had on mirror nodes18:11
opendevreviewMerged opendev/system-config master: Switch python-builder/python-base to pip wheel  https://review.opendev.org/c/opendev/system-config/+/86215218:35
clarkbfungi: were you going to push an update to the mm3 stack to update the urls file or were you just going to modify that locally on the test node you already have held?20:04
fungiyes, i'm just going over the import logs looking for anything else that might need fixing first20:05
fungiand then i'll push it up and do another hold and import test20:05
clarkbinfra-root I've updated the meeting agenda. Please add content or remove it as appropriate.20:08
ianwclarkb: i was actually thinking yesterday we go straight to 6.  we're running the ansible devel job which is working, so there seems to be no reason to get as close to that as possible20:23
clarkbok, the main thing would be double checking our use of shebang lines in modules, but otherwise I agree shouldn't be an issue20:24
ianwclarkb: dropped a comment on the openafs client flags; maybe a few tests would help 20:24
clarkbianw: ya looks like the gitea-git-repos module at least is broken with the shebang line20:25
clarkbhrm ya that may only work for the initial install as written and need a restart for existing nodes20:26
opendevreviewIan Wienand proposed opendev/system-config master: [wip] Bump bridge ansible to 6.6.0  https://review.opendev.org/c/opendev/system-config/+/86519520:27
ianw^ i think that should trigger everything20:27
ianwclarkb: i think even on initial install the client will be started before it writes that out20:28
clarkbianw: but we don't install the package until after that file is written?20:29
ianwoh sorry yes i agree.  sorry i had it in my head this was in a mirror role20:31
ianwprobably still worth a quick test infra to make sure it applies20:31
ianwspeaking of openafs; https://review.opendev.org/c/opendev/system-config/+/864148 is a quick one to just grab the make logs from dkms in our test builds20:32
ianwmakes it quite a bit easier to debug if the build stops working20:32
clarkbianw: re ansible 6 note some modules will run with the usr/bin/env python lines because they don't depend on any deps outside of stdlib but it is still wrong. I think we should clean those up either wya20:32
ianwanother quick one is https://review.opendev.org/c/opendev/system-config/+/864600 which adds some links to the statusbot on the homepage.  that will give us a green tick on the mastodon account20:33
ianwclarkb: ++ i'll go through and add it 20:33
ianwclarkb: re 861284 ... i think it's two lines between classes -- most of the other files don't have the extra space?  just looked weird as i added stuff to the file20:35
ianwit looks like bridge01 /root/.ssh/known_hosts hasn't come back.  i was kind of wondering if any part of what we do might result in it being written out, but i guess everything it needs is in the /etc/ssh/known_hosts now, which is good20:37
ianwi still think https://review.opendev.org/c/opendev/system-config/+/865092 to disable it is good belt-and-suspenders, but i'm not so worried we're doing anything weird now20:38
clarkbianw: I thought it was two lines between any top level file entries. But I agree other files don't do that20:39
opendevreviewClark Boylan proposed opendev/system-config master: Up openafs client -stat value  https://review.opendev.org/c/opendev/system-config/+/85752020:40
ianw^^ not sure what thoughts are on doing that globally.  for example we could add the backup known host keys to /etc to (instead of root user), and pretty much disable it everywhere.  i'd have to think about review/gerrit20:40
clarkbsomething like ^ for the afs test20:40
opendevreviewMerged opendev/system-config master: rax-dns-backup: fix parsing  https://review.opendev.org/c/opendev/system-config/+/86508320:46
clarkbianw: I went ahead and approved the afs dkms logs chagne since it is straightforward. I did leave a note about being careful about deep log dirs though20:47
fungiout of the full import of all lists from 7 sites, only 5 mailing lists generated complaints about "Unable to convert mailing list attribute: (digest_footer|msg_footer)"20:48
clarkbre 865092 I'm trying to check the assertion it should never ssh to anything else20:50
fungier, i take that back. lots of lists on lists.openinfra.dev did, but not any on lists.starlingx.io for example20:50
clarkbI guess the possibility of that would require we do things outside of the ansible inventory. SSH'ing by hand to random nodes maybe?20:50
clarkbfungi: maybe there is no digest_foot/msg_footer equivalent in mm3 and the old import process just didn't awrn us?20:51
fungithat was my first thought, but all the old lists seem to have that set so that can't be why it only complains about ome20:52
fungiabout some20:52
fungimy next guess is that some of these have outdated replacement macros20:52
fungiso i'm comparing them20:53
fungiunfortunately, comparing one it complained about and one it didn't side-by-side, i don't see any difference in the field content20:56
ianwclarkb: yeah, that's the type of thing we *shouldn't* be keeping a cache of imo (sshing to random nodes via bridge).  it just means if we write anything to make it a less-random node, we have a window to forget we added it21:01
clarkbfungi: does the error say "with bad replacements" or "with value" ?21:02
clarkbfungi: looking at the source code it seems those are the only cases and that hsould hopefully give a better indication of what is failing21:02
clarkbianw: but it will also force us to accept host keys every time which might be less secure? I guess the fact that it is unlikely to happen means we should not worry about it and deal with it if it happens21:03
ianwtrue i guess, but that OTOH can also be a good indication that "this is not a host under management, so think about that" :)  tomato tomato :)21:07
fungiclarkb: they're all "with value"21:10
fungino occurrences of "bad replacements"21:10
ianwlooks like a couple of hard failures with the ansible 6 upgrade in the gate -- which is good :)  i'll debug them this morning21:11
clarkbif expanded_text and '%' in expanded_text <- do you possibly have % in the footer?21:11
clarkbIt seems to use that as a flag for unreplaced values but maybe if there are actual % signs it would trip the message21:11
fungiyes, the expansion macros are all things like %(real_name)s and %(cgiext)s21:11
clarkbnote in that situation it doesn't seem to continue so would proceed with writing out the file21:11
clarkbthe other case is when decorate_template raises a KeyError which does cause it to continue and not write the file21:12
opendevreviewMerged opendev/system-config master: openafs: copy dkms log directory  https://review.opendev.org/c/opendev/system-config/+/86414821:12
clarkbI think we can narrow this down by checking for output files post migration and checking for actual % in the input21:12
fungithe error messages, unhelpfully, seem to use the post-conversion replacement macro syntax like ${display_name} and ${listname}21:13
clarkbfungi: yes it is outputting the result of the conversion21:13
clarkber wait not its a half conversion21:14
clarkbtext = text.replace(oldph, newph) <- it does that before emitting it to you21:14
fungiand just being sure, this is when running the import21 utility that the errors seem to be emitted21:14
clarkbI suspect oldph and newph are things like % and $21:14
clarkbya those strings only show up in the one file in mailman-core21:14
clarkbbut ya I would check to see if it ended up writing a file anyway (in which case % was in the output and it may be a non issue) and if not then there is a key error21:15
clarkbmailman/src/mailman/utilities/importer.py if cloned from https://gitlab.com/mailman/mailman21:15
fungio21:18
fungii'm not sure where to find the output file21:18
fungialso, i don't recall seeing these errors when importing with mm 3.3.621:19
clarkbdfa72b74beb2e65649b551acd0603227de41acdd this commit from april added the if % in text check21:20
clarkbthat was not in 3.3.5 but was in 3.3.6. However 3.3.6 is from the end of october and I don't recall if we tested on it or not21:20
clarkbI strongly suspect that we've got extra %s in the input text and we're tripping this check and it might be fine because we write the files anyway21:21
fungihttps://gitlab.com/mailman/mailman/-/commit/dfa72b721:22
fungiyeah, just found it myself21:22
clarkbfungi: var/templates/lists/<LIST_ID>/list:member:regular:footer.txt and var/templates/lists/<LIST_ID>/list:member:digest:footer.txt21:22
fungiso yes, i think we went from 3.3.5 to 3.3.721:22
clarkbthose are the types of output dirs21:22
fungi/var/lib/mailman/core/var/templates/lists/ does seem to have a small subset of our total set of lists, but i see some in there which complained and some which did not21:24
clarkbfungi: its possible that some tripped the keyerror and some tripped this issue. If you open those files that do exist it should hopefully be clearer why those failed to convert21:25
fungiindeed, /var/lib/mailman/core/var/templates/lists/airship-announce.lists.airshipit.org/en/list:member:regular:footer.txt has some old % macros still unconverted in it21:25
fungi/var/lib/mailman/core/var/templates/lists/zuul-announce.lists.zuul-ci.org/en/list\:member\:regular\:footer.txt exists but is entirely empty21:26
fungiso i'll try to dissect an example21:27
clarkbhrm I didn't expect any completely empty file ssince it seems to only open and write the file if it doesn't continue in the loop21:27
clarkband the only cas ewhere it doesn't do that with that error is when % is in the text and that would result in non empty output21:28
fungiso this is the mm2 version of the field for the airship-announce ml: https://paste.opendev.org/show/bS0VoH2UztFt2p3EI9S3/21:29
fungiand this is what ended up in the template file on the mm3 host: https://paste.opendev.org/show/bprYIACe00uV8Ueqhv4j/21:30
fungithis is what the error logged during conversion looked like: https://paste.opendev.org/show/bq3SsusO00dvqyKsA6JN/21:31
fungihttps://gitlab.com/mailman/mailman/-/blob/master/src/mailman/utilities/importer.py#L49521:32
fungithat should be replacing the problem line with an empty string21:32
fungii wonder if some of these are lacking a trailing \n21:33
clarkbfungi: your example input didn't have the newline at the end. I bet that is it21:35
clarkbAnd previously it didn't complain but now does since 3.3.621:35
fungithough that doesn't appear to be it, because it didn't error about openstack-announce and its msg_footer is identical21:38
fungioddly, there is no /var/lib/mailman/core/var/templates/lists/openstack-announce at all though21:38
fungimaybe it only writes templates there under certain conditions?21:39
fungiand for whatever reason it didn't try to parse the msg_footer for openstack-announce?21:39
fungiwow, you can't search gitlab.com issues and merge requests without having an account21:43
Clark[m]They could be separate issues. No newline in this case. Something else for the others21:45
fungiwell, my point is that no errors were emitted when importing the openstack-announce ml, and its msg_footer field is identical to that of airship-discuss which generated these errors21:47
fungibut the more i look into this, the more i think we should just blow away the offending fields. if we had done the import with 3.3.5 they wouldn't have been copied into postorius at all21:48
fungi3.3.6 and later try to import them, but then break when they can't21:48
fungifiddling with these configs in bulk is a pain because they're stored on disk as python pickle files21:53
fungii may be better off just updating them all with the webui to remove that url line21:53
fungiin the meantime, i'll get the other fix pushed up and an autohold set so we can have a new one ready for another test import21:53
opendevreviewJeremy Stanley proposed opendev/system-config master: Fork the maxking/docker-mailman images  https://review.opendev.org/c/opendev/system-config/+/86015721:55
opendevreviewJeremy Stanley proposed opendev/system-config master: DNM force mm3 failure to hold the node  https://review.opendev.org/c/opendev/system-config/+/85529221:55
*** dasm is now known as dasm|off22:01
clarkbfungi: ya maybe we should reset all footers to the mm3 default22:03
*** rlandy|rover is now known as rlandy|rover|biab22:07
fungiwell, some of our lists intentionally blank out the footers, so i don't want to interfere with that choice22:09
clarkbfungi: fwiw I think that is why openstack-announce didn't get a file. It produced a default footer which caused the importer to skip it22:21
clarkbwhich would make me think tha maybe the two inputs are different in some subtle way22:21
fungiwell, openstack-announce had that line in its msg_footer but didn't generate an error (or even a template)22:30
fungiopenstack-discuss has an empty string for its msg_footer, which generated an empty template file22:31
fungianyway, in order not to burn too much more time on it, i'm just removing that url line from any list footers which have it, since that's what the importer wants to do anyway22:32
clarkbfungi: right if the resulting output of the conversion is the same as the mm3 default it doesn't write a template at all22:33
clarkbI'm suggesting that it seems likely that is what happened to openstack-announce and explains the lack of an mm3 file22:34
clarkbit also suggests there is some difference between that list and the one that generated the error22:34
clarkbthe empty to empty conversion also makes sense since the default is non empty I guess22:34
fungiright, maybe it sets something else which would have caused it to need a template22:34
fungiproblem is, how does it determine that the resulting output would be the same as the default if it can't parse the input?22:35
clarkbsince the replacement does seem to explicitly match on a newline after that line though I strongly suspect that is at least part of the problem for the one iwth the error22:35
clarkbfungi: it doesn't in that case.22:35
fungigetUtility(ITemplateLoader).get(newvar, mlist)22:36
fungii guess that's what causes it to be skipped22:36
clarkbhttps://gitlab.com/mailman/mailman/-/blob/master/src/mailman/utilities/importer.py#L551-554 that checks if the result is the same as the default and it continues otherwise22:37
clarkber continues if it is the same which doesn't allow the file to be written further down in that function22:37
fungier, nevermind, i was looking at the wrong loop22:37
opendevreviewMartin Kopec proposed opendev/irc-meetings master: Update Interop meeting details  https://review.opendev.org/c/opendev/irc-meetings/+/86520122:37
clarkbbut that strongly implies to me that there is a difference between the two list footers. Or maybe withe the available text replacements22:37
fungiyeah, the odd thing is that the mm2 tool for parsing and displaying the pickle contents doesn't show a trailing newline on any of them. but when i pull some of them up in the webui they have a trailing newline, while others don't22:39
fungiso maybe it's the config_list util in mm2 not being entirely truthful and stripping strings before outputting them22:40
fungione tool thinking newlines are irrelevant and not showing them to you, another tool insisting on their presence22:40
fungihooray for consistency22:41
clarkbit might be a valid update to the importer to check for newline or EOF22:41
clarkbI wonder why only that replacement gets the newline suffix22:41
clarkbmight just be a bug22:42
fungiyes, i expect it is, but i wouldn't want to hold up our migration efforts waiting for that to merge, nor necessarily dirty-patch the import21 utility in our deployments in the meantime22:42
clarkbya I agree, its minor enough especially if we can just add a newline to the source data and have ti be happy22:44
*** dviroel is now known as dviroel|afk22:45
opendevreviewIan Wienand proposed opendev/system-config master: [wip] Bump bridge ansible to 6.6.0  https://review.opendev.org/c/opendev/system-config/+/86519522:45
opendevreviewIan Wienand proposed opendev/system-config master: borg-backup-server: build borg users betterer  https://review.opendev.org/c/opendev/system-config/+/86520222:45
opendevreviewIan Wienand proposed opendev/system-config master: letsencrypt-install-txt-record: build txt record list betterer  https://review.opendev.org/c/opendev/system-config/+/86520322:45
fungiwell, i'm just removing the "%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s" line from all the msg_footer and digest_footer fields22:47
fungisame end result22:47
clarkbis it?22:48
clarkboh! because the line gets replaced with empty string22:48
clarkbok ya22:48
fungiyep22:49
fungithe only reason the importer tries to match that line at all is so it can remove it22:49
fungiso i'm just saving it the trouble22:50
opendevreviewIan Wienand proposed opendev/system-config master: borg-backup-server: build borg users betterer  https://review.opendev.org/c/opendev/system-config/+/86520223:11
opendevreviewIan Wienand proposed opendev/system-config master: letsencrypt-install-txt-record: build txt record list betterer  https://review.opendev.org/c/opendev/system-config/+/86520323:11
opendevreviewIan Wienand proposed opendev/system-config master: [wip] Bump bridge ansible to 6.6.0  https://review.opendev.org/c/opendev/system-config/+/86519523:11
opendevreviewIan Wienand proposed opendev/system-config master: system-config-run-gitea: use standard bridge host  https://review.opendev.org/c/opendev/system-config/+/86520423:11
opendevreviewMerged opendev/system-config master: launch-node : make into a small package  https://review.opendev.org/c/opendev/system-config/+/86128423:19
fungiokay, i have them all cleaned up23:27
fungiand 104.130.140.226 is the latest held node23:31
fungiit's in rax though, so i'll need to move /var/lib/mailman to the ephemeral disk23:32
clarkbfungi: you cleaned them up in prod? or in your test copy?23:34
fungiin prod, so we won't have this issue on future actual imports23:35
fungisimilar to some of the other fields we ran into which were too large to import into the database23:35
fungii'm shifting gears to the rsync from prod to the test node now23:36
fungijust need to make sure there's enough space for all the production data23:36
fungiokay, /var/lib/mailman is now the 80gb ephemeral disk23:39
fungion the held node23:39
fungiand rsync from prod servers to the held node is now in progress23:43
clarkbfungi: assuming that the urls.py update makes things happier did you want to do a PR for maxking or should I work on that once confirmed?23:53
fungii can probably find time for it, and we should also likely do a mr on gitlab for the importer23:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!