Friday, 2023-08-25

opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/89272603:06
opendevreviewMerged openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/89272611:43
*** TheJulia is now known as needs_brains_and_sleep13:04
*** needs_brains_and_sleep is now known as TheJulia13:23
fricklerinfra-root: kolla just saw four jobs failing in gate in parallel https://zuul.opendev.org/t/openstack/buildset/9a2af787cb62474c88536484142d607d , three of them ran on rax-ord. I'm kind of eoding, maybe you can have a look? mnasiadka might still be around a bit to answer possible kolla related questions14:02
fungilooking14:12
fungiduring `kolla-ansible -i /etc/kolla/inventory -vvv bootstrap-servers` the ssh connection to the node was prematurely closed, in all 4 cases, looks like?14:13
fungisince one of those four happened in ovh, i think we can rule out a provider-specific problem impacting the nodes themselves14:16
fungithough it's possible something provider side is impacting the executors14:16
Clark[m]Was it zuul's connection or Kolla ansible's connection? The latter is comms all within the same cloud14:19
fungiwith all four builds, the connection error occurred within 1-2 minutes of starting to run the setup_gate.sh script14:22
fungiprimary | ok: 19 changed: 15 unreachable: 0 failed: 1 skipped: 14 rescued: 0 ignored: 014:23
fungii think that's from the nested ansible?14:23
fungiit's possible that stderr line about the shared connection being closed is not related to the cause. ultimately the task failed because the setup script exited 214:30
fungiokay, here we go: https://zuul.opendev.org/t/openstack/build/72774530166546c1a96284f3312e0e36/log/primary/logs/ansible/bootstrap-servers#74914:31
fungihttps://zuul.opendev.org/t/openstack/build/f7e6f2677ed9440db9f8a3c1b04f1868/log/primary/logs/ansible/bootstrap-servers#69914:32
fungifailing in different spots in that script14:33
fungi"Failed to download metadata for repo 'docker': Yum repo downloading error: Downloading error(s): repodata/8e89c445039a4ff75bb98ab62bee6b6ae7c4c8ae853a61cab75de5e30c39d0bf-primary.xml.gz - Cannot download, all mirrors were already tried without success; repodata/abe464de7c144654302f1b3b46042d88f1d6550b46527f15a2cef794091f2b3c-filelists.xml.gz - Cannot download, all mirrors were already tried14:33
fungiwithout success"14:33
fungi"E:Failed to fetch https://download.docker.com/linux/debian/dists/bookworm/stable/binary-amd64/Packages.bz2  File has unexpected size (11933 != 12572). Mirror sync in progress?"14:36
frickler 14:37
fungiseems like different mirror problems (for different distros, in different parts of the world), but maybe there is some relationship14:37
fungiafs fileservers and database servers don't have anything new in dmesg for the past two months, so they don't seem to think they're in distress at least14:43
fungioh! i should have paid closer attention. those are direct download errors for things hosted on download.docker.com, nothing to do with our mirrors at all i don't think?14:44
fungii should have looked closer14:45
fungiso my best guess is that docker.com is blowing up their package repositories14:45
fungimnasiadka: ^ let me know if that explanation doesn't match your observations14:45
frickleroops, looks like that ssh session wasn't as dead yet as it looked like, sorry15:09
fricklerfungi: thanks for debugging, seems I was mislead by the non-fatal warnings about our mirror host in the same block15:10
fungifrickler: some of the confusion is probably due to the fact that zuul is now separating display of stdout and stderr streams, which made a normal connection close jump out as if it were the cause15:11
fungisimply because it was the only thing that task sent to the stderr stream15:12
fungiit mislead me first too15:12
fricklerI thought that that feature wasn't enabled yet?15:13
fricklerkolla explicitly redirects logs for those tasks I think15:13
fungiright, i think the connection closed was coming from the task run by the executor, looking at how it's split up. but maybe i'm misinterpreting that15:15
Clark[m]Stdout and stderr shouldn't be split yet. And when the first change happens it will be opt in15:20
fungithat "Shared connection to 23.253.56.132 closed." et cetera is clearly called out as being from stderr though15:27
fungihttps://zuul.opendev.org/t/openstack/build/72774530166546c1a96284f3312e0e36/console15:27
fungihas separate boxes for stderr and stdout15:27
fungithey're separate in the ansible json15:28
fungiand so also in the task summary15:28
Clark[m]It's not a command or shell task. I think it is a https://docs.ansible.com/ansible/latest/collections/ansible/builtin/script_module.html task15:33
Clark[m]Only command and shell have the combined outputs15:33
fungioh, i see15:35
clarkbfungi: if you're still ready for mailman3 I think we can send it in15:41
fungiyeah, i'm braced and ready for impact15:42
clarkbshould I +A or do you want to do it?15:43
fungigo for it15:44
fricklerthe "Shared connection ... closed." is a red herring, you can also see it for that task when it (and the whole job) is passing https://zuul.opendev.org/t/openstack/build/89e86e76b0df4af1a19511f98a4fb323/console#2/1/28/primary15:44
clarkbdone15:44
fungithanks!15:47
fungifrickler: yeah, i realized that was the case after i went hunting for the bootstrap-servers log and saw the errors in it15:48
fungithe upload image job has been sitting queued for a while even though we've got tons of available capacity. i wonder if we're running into significant boot failures again16:08
clarkbit is running now16:13
clarkb~5 minutes isn't abnormal for anode boot in some clouds. However I Think it ended up being longer than that16:13
fungiyeah, it was close to 30 minutes wait for the node after the registry job paused16:23
fungiit's wrapping up now16:26
opendevreviewMerged opendev/system-config master: Upgrade to latest Mailman 3 releases  https://review.opendev.org/c/opendev/system-config/+/86921016:27
clarkblooks like it is promoting the image but not triggering the lists3 job16:29
clarkbya that job doesn't trigger on mailman3 docker updates. Only on the playbook side16:30
clarkbfungi: you could add the docker paths to the files list for the infra-prod-service-lists3 job and add a rebuild trigger comment to one or several of the imges to have it run through again16:30
clarkbor we could manually trigger the infra-prod-service-lists3 playbook on bridge instead16:30
fungisure, on it16:30
clarkbproably better long term to have docker image builds trigger the job though16:31
clarkbthinking back to when we built this out I think it wasn't clear if upgrades needed intervention like gerrit or not so we left that out16:34
clarkbthe current indication is that this should typically be automated so making it happen automatically makes sense to me. But if you think that isn't the case we can leave it as is and trigger the playbook manually16:34
opendevreviewJeremy Stanley proposed opendev/system-config master: Trigger mm3 deployment when containers change  https://review.opendev.org/c/opendev/system-config/+/89280716:36
fungiclarkb: like that ^?16:36
clarkbyup looks similar to etherpad for example16:37
clarkbI've approved it16:37
fungithanks16:37
fungionce again system-config-build-image-mailman is taking a surprisingly long time to get a single ubuntu-jammy node assigned16:48
fungithere it finally goes16:48
fungithat time was about 10 minutes after opendev-buildset-registry paused16:49
funginot as bad at least16:49
fungijust surprising when we have so much available capacity at the moment16:49
fungithis time it got a node the instant the registry paused17:09
fungierror node and time to ready spikes on https://grafana.opendev.org/d/6c807ed8fd/nodepool suggest we may have some intermittent issues17:13
fungii think the errors are predominately in ovh regions: https://grafana.opendev.org/d/2b4dba9e25/nodepool%3a-ovh17:14
clarkbshould merge soon17:22
fungiyup17:24
opendevreviewMerged opendev/system-config master: Trigger mm3 deployment when containers change  https://review.opendev.org/c/opendev/system-config/+/89280717:24
clarkbI just cleaned up my etherpad and gitea autoholds in zuul17:25
fungiinfra-prod-service-lists3 is waiting in deploy this time, so looks like it worked17:25
fungii'll clean up the mm3 held node once we're sure the prod upgrade is good, just in case we need it for a comparison or something17:26
fungideployment is now in progress17:26
fungicontainers are restarting17:27
clarkbya I left those other two around for a bit for the same reason17:28
fungihttps://lists.opendev.org/ is up and still seems to be working17:28
fungi"Postorius Version 1.3.8" on https://lists.opendev.org/mailman3/lists/17:29
clarkband hyperkitty 1.3.7 on archive pages17:29
fungi"HyperKitty version 1.3.7" on https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/ yep17:29
clarkbI can read emails in the archive. vhosting still seems good.17:30
clarkbMain thing we're missing is an email going through17:30
fungidocker/mailman/web/requirements.txt contains postorius==1.3.8 and hyperkitty==1.3.717:31
fungii'm planning to send something to service-discuss next17:31
fungii've received my copy17:35
fungihttps://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/EUE5GZNFTH22QAG5D2BMF3R56IEAXE4R/17:36
clarkbI got it too17:36
fungii think we're set17:36
clarkb++17:36
fungii'm going to head out to a late lunch pretty soon, but will check back in once i get back17:36
clarkbenjoy. I'm going to try and sneak a bike ride in before it gets super hot17:36
clarkbwe had thunderstorms overnight so temperatures never really dropped17:36
clarkbwas warm and humid.17:37
fungihot and muggy. sounds like here17:37
fungii mostly just want to get outside to escape the paint fumes17:37
funginow that they're done for the day17:37
clarkbif you breath deeply it is its own form of escape17:37
fungitouché17:37
fungiseems the universe didn't implode while i was out at the bar. good19:46
Clark[m]I didn't expect fireworks. I'm eating lunch but wanted to follow up on whether or not we can unfork that config file now. Then maybe approve a bookworm container update or two20:01
fungioh, yep sonds good20:08
fungiClark[m]: which config file specifically were you thinking we can un-fork?20:13
Clark[m]fungi: https://review.opendev.org/c/opendev/system-config/+/869210/8/docker/mailman/web/mailman-web/settings.py that one maybe20:22
fungii'll double check it against upstream20:23
Clark[m]https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mailman3/files/web-settings.py is the forked version20:23
Clark[m]the one in the change is/should be in sync with upstream. Then in our role we bind mount over it20:23
fungiwe do force SITE_ID = 0 in it intentionally20:25
clarkbya so there are a couple of extra things that we would need ot address upstream first20:27
clarkbin that case we can't unfork and thats fine. This has worked well enough so far20:27
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/892702 is a low risk bookworm update20:28
fungiwe can reset ours to what's in maxking/docker-mailman except for the SITE_ID override, i thnk20:28
fungiyes, i agree. the rest seem like basically no-op changes20:31
fungiat least in our case20:31
clarkbI think the gethostbyname("mailman-web"), hardcoded in the list of hosts is still a problem too20:35
clarkbwe use host netowkring so that name doesn't end up in magical dns for us20:36
fungiah, so we do still need the custom 127.0.0.1 entry?20:39
opendevreviewJeremy Stanley proposed opendev/system-config master: mailman3: re-sync custom web/settings.py  https://review.opendev.org/c/opendev/system-config/+/89281720:42
clarkbfungi: I thought 127.0.0.1 came from upstream and I put the differences under my comment. We may not need it since localhost is there20:42
fungithe 127.0.0.1 had a comment explicitly claiming to be an opendev edit20:43
clarkbah it did20:43
clarkbfungi: note the mailman-web stuff needs to be commented out so I don't think^ will work20:43
fungii just tried to clarify the comment a bit20:43
clarkbit will fail on mailman web startup trying ot resolve that name20:43
fungiah, i'll comment it out again then20:44
clarkbboth the lines20:44
opendevreviewJeremy Stanley proposed opendev/system-config master: mailman3: re-sync custom web/settings.py  https://review.opendev.org/c/opendev/system-config/+/89281720:45
fungiyep20:45
fungididn't know if it would just get ignored when lookups failed20:46
clarkbfungi: oh you need to edit our DJANGO_ALLOWED_HOSTS value too20:49
clarkbI dind't notice upstream changed the separator20:49
fungithe new upstream separator won't work for us?20:50
clarkbfungi: our values are separated by : not , so the split won't split things in a meaningful way for us20:50
clarkbcurrently we reuse the exim mm_domains variable and exim wants : iirc20:50
clarkbbut we can define a new value or convert it in ansible before writing it out into the config for docker-compose20:50
fungioh, so we can't just change to ,20:50
clarkbcorrect20:51
clarkbwe need something slightly smarter. But still doable20:51
fungiwe still need the conditional there too?20:51
fungiand use the ansible var instead of the envvar?20:52
clarkbfungi: we don't need the condition. That was there to make the change more likely to be upstreamable. But they condensed it down in a safe way for us (even if they didn't it would work because we always set the value)20:52
clarkbI wouldn't use the ansible var. I would keep using the envvar to stay in sync with upstream there20:53
clarkbwhat we need to change is where we set the env var value20:53
clarkbwhich is set in playbooks/roles/mailman3/templates/docker-compose.yaml.j220:53
opendevreviewJeremy Stanley proposed opendev/system-config master: mailman3: re-sync custom web/settings.py  https://review.opendev.org/c/opendev/system-config/+/89281720:53
clarkband we can do something like mm_domains | split(:) | join (,)20:53
clarkbnot valid ansible20:53
fungimmm, okay. but basically the only thing we could un-fork was to update the TIME_ZONE assignment?20:54
clarkbwe can unfork the DJANGO_ALLOWED_HOSTS code too. We just have to change how we set the DJANGO_ALLOWED_HOSTS value in docker-compose.yaml20:54
fungiseems like we're not really un-forking the config, though it was a good exercise to confirm basically all the differences there were needed20:54
clarkbbsaically upstream splits on , we split on : so we can change the input to the split and unfork that way20:55
fungiah, i guess the docker-compose is a jinja2 template so we can manipulate values there20:57
clarkbyup exactly. I think doing that is worthwhile if we can pretty easily use jinja filters to change the separator value20:57
clarkbthat way its less divergence from upstream in the settings file20:57
opendevreviewJeremy Stanley proposed opendev/system-config master: mailman3: re-sync custom web/settings.py  https://review.opendev.org/c/opendev/system-config/+/89281721:02
fungiso like that?21:02
clarkbin DJANGO_ALLOWED_HOSTS={{ mm_domains.split(':') | join(,) }} I don't think mm_domains.split() is valid. YOu need to us | filter() syntax?21:08
clarkbbut yes from a psuedo code perspective21:08
clarkbhuh google says I'm wrong21:12
clarkbI guess it isn't clera to me which things are functions and whihc things are filters then21:12
clarkbfungi: you might need quotes around the , ? otherwise I guess that may work21:13
fungioh, i thought i did, sorry. will fix21:14
opendevreviewJeremy Stanley proposed opendev/system-config master: mailman3: re-sync custom web/settings.py  https://review.opendev.org/c/opendev/system-config/+/89281721:15
fungii stole the foo.split() invocation from other templates we have in system-config, fwiw21:15
clarkbya so some string methods exist as directly invocable?21:16
clarkbjinja is weird21:16
clarkbI think that will work. CI should confirm21:16
fungiplaybooks/roles/base/exim/templates/exim4.conf.j2 roles/set-hostname/templates/hosts.j2 roles/set-hostname/templates/mailname.j221:17
fungiwere the examples i found of split() methods in templates21:17
fungicargo cult ftw21:18
opendevreviewMerged opendev/system-config master: Update zookeeper-statsd image to bookworm  https://review.opendev.org/c/opendev/system-config/+/89270221:21
clarkbthe mm3 change passed testing. I was hoping we record the docker-compose.yaml file but it seems we don't22:02
clarkbits probably fine22:02
clarkblooks like zookeeper-statsd won't update until our daily run later. Not a big deal its low impact if anything goes wrong (we lose zk stats until we fix it)22:03

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!