Wednesday, 2023-10-25

opendevreviewDmitriy Rabotyagov proposed zuul/zuul-jobs master: Add role for uploading Ansible collections to Galaxy  https://review.opendev.org/c/zuul/zuul-jobs/+/89923008:12
pcheliHello, I'm setting up ThirdParty CI with jenkins and gerrit-trigger plugin. 08:45
pcheliGenerally, it works. However, results posting fails with Too many concurrent connections (96) - max. allowed: 96.08:45
pcheliCan anybody help with this?08:45
zigoI'm really not sure what to do to re-trigger the puppet-nova release job and get the release notes in order ... can someone help?08:48
zigohttps://docs.openstack.org/releasenotes/puppet-heat/2023.2.html <--- 404 as well ...08:57
fricklerzigo: did you check the release jobs as clarkb suggested earlier?09:24
fricklerpcheli: seems you need to limit the number of connections your setup uses, no idea how to help with that. also please don't ask the same question in multiple channels if possible09:25
zigofrickler: I'm really not sure how to do this ... :/09:30
zigoAlltogether, we have puppet-{heat,nova,octavia} that have broken release notes.09:31
tkajinamfrickler clarkb zigo, hmm it's strange that the promote job succeeded without any error after https://review.opendev.org/c/openstack/puppet-nova/+/898384 was merged10:45
zigoAh, thanks for looking into it! :)10:46
zigoI had the same thinking and didn't get it too...10:46
tkajinamI subscribe to the release-job-failures list but I've not seen any failures about these puppet repos, either10:47
tkajinam(I mean release-job-failures@lists.openstack.org10:47
tkajinamhttps://zuul.opendev.org/t/openstack/builds?job_name=publish-openstack-releasenotes-python3&project=openstack%2Fpuppet-nova&skip=010:48
tkajinamit looks like we have to trigger the job to build release notes based on the latest master content to reflect the change in the index made by that 898384 but I don't clearly understand why it hasn't been done10:50
tkajinamsorry I have be disconnected for a while, but I'll check the status later (or tomorrow)10:52
tkajinamI have to be *10:52
fricklerseems the above publish job ran at the same time as the job for the update of the 2023.2 branch, which did not have the 2023.2 reno update yet https://review.opendev.org/c/openstack/puppet-nova/+/89838311:24
fricklerso that may have overwritten the content from the master patch. I'm not sure whether we can simply reenqueue the promote job, another - maybe safer - solution would be to commit any new update on the release notes, like just a typo or formatting fix, which should cause the whole site to be republished in the correct form11:26
fungiyes, the problem with those release notes jobs for different branches sharing the same file tree is that changes for different branches can race one another and publish content out of sequence compared to the order in which they were built/merged12:03
*** d34dh0r5- is now known as d34dh0r5312:20
*** Guest4496 is now known as diablo_rojo13:09
clarkbpcheli: I would use netstat/ss/lsof to determine how many connectiosn you've got to gerrit from the Jenkins host. If it is a high number (near 96) then you'll need to debug the Jenkins server. If it is much smaller and you are traversing NAT then you may need to identify other sources of connections.13:57
clarkbpcheli: however, I suspect they will be from the Jenkins server because the 96 connections limit is per username iirc and not per IP. We have a separate slightly higher limit for IPs13:57
fungiyes, also it's likely you have a bug with something not correctly closing ssh sessions13:58
pcheliclarkb: I've found only one connection. tcp6       0      0 xxxx:34254    199.204.45.33:29418     ESTABLISHED 9210/java13:58
pchelithat's why I'm asking :)13:59
fungi96 open connections to gerrit's ssh api is unlikely to represent normal behavior13:59
fungipcheli: is it possible you have a firewall in front of your jenkins server that is uncleanly dropping "idle" ssh connections? if it doesn't cleanly terminate the connection by sending a tcp/rst or fin on behalf of the client, then the gerrit server will assume those old connections are still open14:00
fungiwe can manually close them, but they'll just pile back up again if the problem isn't addressed14:00
clarkbhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerrit/templates/gerrit.config.j2#L56 this is where the limit comes from and it is configured by user account14:01
clarkb(just to be sure the 96 limit wasn't our IP limit)14:01
fungiyeah, the connections per ip address limit we set with conntrack in iptables is 100, and if you hit that you'll start getting icmp port-unreachable errors rather than error messages from the api itself14:04
pcheliI've found the same issue in mailing list resolved by Clark Boylan by killing stale connections. May I ask you to do the same?14:04
fungilike i said, doing that may temporarily stop the errors, but unless you know what caused you to end up with so many unclosed connections (like a poorly-configured firewall, for example) then it will start happening again at some point14:05
pchelifungi: I've updated gerrit trigger plugin. Hopefully, it will resolve the issue.14:16
fungipcheli: if it has an ssh keepalive option, or dead peer detection feature, make sure those are turned on14:17
pcheliHm, nothing like this. 14:19
fungilooks like the only account with 96 established ssh sessions is a/3374614:21
pcheliyep, this is mine14:23
fungii've got a loop going telling gerrit to close all those now14:23
fungithis will take a few minutes to complete14:24
fungi#status log Manually closed 96 stale SSH connections to Gerrit for account 3374614:25
opendevstatusfungi: finished logging14:25
fungipcheli: there's just 1 established session for that account now14:25
pchelifungi: can you check again pls?14:27
pchelijust to be sure that everything is fine14:27
fungipcheli: still only 1 session for that account at the moment14:27
pcheliGreat14:28
pcheli#thanks fungi14:28
fungii'll check again later in the day and see if the count starts to climb14:28
opendevstatuspcheli: Added your thanks to Thanks page (https://wiki.openstack.org/wiki/Thanks)14:28
opendevreviewMerged opendev/system-config master: Stop building python3.9 container images  https://review.opendev.org/c/opendev/system-config/+/89848014:52
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/898989 is ready for review and there is a link in the comments to a held test node where you can see that the conversion appears to be working in comments of the linked change14:52
clarkbfungi: and I've marked the secondary email lookup thing in gerrit as a non issue as the tools only use primarily emails14:53
clarkbfungi: if you are back today https://review.opendev.org/c/opendev/system-config/+/898505 might be a good one to try and get in. I've intentionally been waiting until more people are around so will defer on others' availability15:15
fungiyep, i'm around enough today, parents are headed home but i have a repair tech coming to try to fix my washing machine15:20
opendevreviewJeremy Stanley proposed opendev/system-config master: Add OpenInfra EU mailing lists  https://review.opendev.org/c/opendev/system-config/+/89884615:33
clarkbfungi: for the ansible 8 change do you want to review it?15:52
clarkbfungi: there are notes about the testing done in comments there as well15:52
fungiclarkb: yep, i just approved it15:53
fungihoping it will also fix infra-prod-run-cloud-launcher15:53
clarkbcool so the thing to check is that the virtualenv updates properly (it should)15:53
clarkband then monitor jobs15:53
fungiyep15:53
opendevreviewClark Boylan proposed opendev/system-config master: Revert "Cap ruamel.yaml install for ARA"  https://review.opendev.org/c/opendev/system-config/+/89928316:05
clarkbtesting if that cap is no longer necessary after some updates were made to ruamel.yaml16:05
fungioh, did they roll some stuff back or fix regressions?16:09
clarkbfungi: they replaced a sys.exit() call with an exception throw16:10
clarkbapparently they were hard crashign things previously by exiting 1 in the library...16:11
fungiouch16:12
fungiyeah, sys.exit() is really never appropriate in a library16:13
opendevreviewMerged opendev/system-config master: Update to Ansible 8 on bridge  https://review.opendev.org/c/opendev/system-config/+/89850516:25
clarkbansible==8.5.016:31
clarkbI believe the upgrade of ansible in the venv worked16:31
fungithat was fast!16:31
clarkbfungi: the merge for the list creation will probably be the first thing that runs under ansible 8 just fyi16:32
clarkbI can execute ansible-playbook --version successfully as well so the install seems to be good16:33
clarkbhttps://zuul.opendev.org/t/openstack/build/d095cf5cd898428982a71742f30a7c74/log/bridge99.opendev.org/ansible/install-root-key.2023-10-25T16:17:50.log this log shows the ruamel thing is no logner fatal (the rest of the playbook runs rather tahn stopping)16:36
clarkband we get an ara report https://44e79568cedacd253db2-e38ecce2b4446ed6b5d96caa6af2a2c7.ssl.cf2.rackcdn.com/899283/1/check/system-config-run-base/d095cf5/bridge99.opendev.org/ara-report/16:36
fungioh nice16:36
clarkbso ara is still working. I guess that isn't a super critical piece of code?16:36
clarkb(I think it is in the ara server path which we don't really use maybe)16:36
clarkbso ya https://review.opendev.org/c/opendev/system-config/+/899283 should be safe to merge16:39
opendevreviewMerged opendev/system-config master: Add OpenInfra EU mailing lists  https://review.opendev.org/c/opendev/system-config/+/89884616:42
clarkbfungi: the lists playbook is running now16:59
fungithanks! looks like it worked17:03
clarkbya I see the public list that was created17:04
clarkbthere are a number of gerrit 3.8 changes that affect theming plugins and general ui plugins. https://217.182.143.183/c/x/test-project/+/3?tab=change-view-tab-header-zuul-results-summary looks fine though17:47
clarkbI'll do some grepping of the removed/renamed methods across the two plugins we run to see if there are any hits but I suspect all that is a non issue based on the held node's behavior17:48
clarkbfungi: can you check my notes for 358975 in https://etherpad.opendev.org/p/gerrit-upgrade-3.8? I think this is somethign we don't really care about but its a big enough chagne that I want another set of eyeballs on it. I tried to sumarrize the behavior change as well as my interpretation for why this doesn't affect us18:40
clarkbIf we can cross that one off then the commentlinks chagne is the only one out of that list to take action on. I'll have to look at the other changes listed next (the non breaking but still called out changes)18:42
fungiclarkb: yeah, i think it'll be fine. if anything, tooling we have that queries such things may be able to drop some error checks because now they'll get well-formed empty responses18:43
clarkbthanks I've struck it out. Leaving just commentlinks so far as something we need to address pre upgrade18:46
opendevreviewJeremy Stanley proposed opendev/system-config master: Upgrade to latest Mailman 3 releases  https://review.opendev.org/c/opendev/system-config/+/89930019:39
opendevreviewJeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs  https://review.opendev.org/c/opendev/system-config/+/89930419:46
opendevreviewJeremy Stanley proposed opendev/system-config master: Clean up old Mailman v2 roles and vars  https://review.opendev.org/c/opendev/system-config/+/89930519:52
fungiinfra-root: ^ more post-migration changes for mailman v319:53
funginot urgent, just trying to make sure they didn't fall off my plate while it's still fresh in my mind19:54
opendevreviewJeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs  https://review.opendev.org/c/opendev/system-config/+/89930420:11
opendevreviewJeremy Stanley proposed opendev/system-config master: Clean up old Mailman v2 roles and vars  https://review.opendev.org/c/opendev/system-config/+/89930520:16
clarkbfungi: I'm not seeing any special upgrade steps between these versions of mm3 components. is taht your read too?20:25
clarkbbasically we stop the containers, then start the containers which will run db migratiosn as necessary and that should be it? (those steps are automated too iirc)20:25
fungiright20:25
fungijust like last upgrade20:26
opendevreviewJeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs  https://review.opendev.org/c/opendev/system-config/+/89930420:47
clarkbfungi: looks like the upgrade change is failign on the db check for the auth user table being present20:51
clarkbI wonder if that table has a new name20:51
fungithe db container log shows auth errors20:52
fungistill digging20:52
fungihard to tell in the console log what the timestamps would really be for starting and stopping trying to check for that table20:57
fungithese are suspicious: https://zuul.opendev.org/t/openstack/build/f24a998cc95340bd82fc69f3e637b0e2/log/lists99.opendev.org/docker/mailman-compose_database_1.txt#87-11620:59
clarkbfungi: https://zuul.opendev.org/t/openstack/build/f24a998cc95340bd82fc69f3e637b0e2/log/job-output.txt#17764 that is connecting as the mailman user21:04
clarkbfungi: looking in ara it seems to be saying we never get any stdout which implies to me that the database table just doesn't exist21:07
clarkbcould be something isn't creating it because there is an error or the db table was renamed21:07
fungiyeah, i'll probably have to hold a node and inspect the db, or add a mysqldump21:27
TheJuliao/ Regarding glean, is the testing just image builds, or do we try to boot the image with say, static network config via configuration drive?21:52
opendevreviewJeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs  https://review.opendev.org/c/opendev/system-config/+/89930422:01
clarkbTheJulia: the integration testing with nodepool and dib does a full build and boot and ssh into the node test22:02
fungidib-nodepool-functional-openstack-centos-9-stream-src et cetera22:02
clarkbTheJulia: the unittests simply rely on that os detection library to mock out /etc/os-release stuff and then we check output results for the config files22:02
TheJuliaclarkb: but do those nodes operate with full static metadata, or are we just doing dhcp? I ask because at least on centos9, I've noticed I'm not getting static config applying necessarily on an instance boot, which has me raising my eyebrow22:08
clarkbTheJulia: oh is the question whether or not dhcp is used or static config? I'm not sure. It could be default dhcp. We would need to look at the nodepool config for the provider22:11
clarkbalso I think openstack actually makes it difficult to not do dhcp. Which makes the fact that multiple public clouds fail at dhcp all the more surprising22:11
TheJuliaOkay, I ask because I have been working on an advanced ironic job without dhcp22:11
TheJuliaand expecting simple-init/glean to just work, and it thinks it does things, but doesn't seem to22:11
TheJuliaAt least, with the instance image, which is still a bit curious.22:12
clarkbfwiw glean does work without dhcp on our images beacuse they all boot in rackspace22:12
TheJuliaYeah, that is a good data point22:12
TheJuliaI know this worked in the past, but maybe something changed. Dunno. It is also weird it just works with the ramdisk I boot, but not again when I reboot22:13
TheJuliaI can see it doing what it expects, I might just have to reproduce it locally22:13
clarkbwith centos 9 you have to use network manager with glean but I thought that was autmatic when using simple-init22:16
TheJulia... yeah, that is what I was thinking as well.22:17
TheJuliaI might be grazing upon some problematic case22:17
TheJuliaso in my stack of changes, I can see where I explicitly re-run glean to extract the configuration, and then trigger networkmanager to refresh and it does the needful, it is an instance image though that fails22:21
TheJuliawhich is built very similarly22:21
TheJuliahmmmmm22:21
TheJuliaI wonder if this is centos vs centos-minimal...22:23
TheJuliaerr, that makes no sense22:23
* TheJulia will look deeper tomorrow22:23
clarkbdiablo_rojo: tonyb: the ptgbot etherpad for tomorrow doesn't ahve any agenda. IIRC that was a session frickler was interested in but requested a meetpad location instead of zoom?22:59
clarkbI was planning to be there but wanted to call that otu to make sure everyone could attend23:00
tonybI think it's on Friday sometime?23:04
diablo_rojo_phoneHeh I guess i don't remember signing up for that time but okay lol. 23:05
diablo_rojo_phoneYes we can definitely do meetpad instead. 23:05
diablo_rojo_phoneI am happy to meet there instead. 23:05
tonybhttps://meetpad.opendev.org/oct2023-ptg-ptgbot  registered for tomorrow23:08
diablo_rojo_phonePerfect. 23:09
diablo_rojo_phonefrickler: should we do an hour earlier so you don't miss tc stuff? 23:11
diablo_rojo_phoneAssuming that works for you clarkb and you tonyb 23:11
diablo_rojo_phonefungi: too. 23:12
clarkbthat is fine with me. But I'm not sure if frickler is attendnign tc things due to zoom?23:12
clarkbI dont' mind either way23:12
tonybI thought the TC agreed to use meetpad rather than zoom23:16
clarkbif they did it isn't in the schedule. The previous tc sessions were on zoom not meetpad23:16
tonybBut that's not what's in the bot23:16
tonybso I guess I imagined it23:16
tonybdiablo_rojo_phone: an hour ealier would be good for me as I'd like to be in the "leaderless projects reto/discussion"23:17
fungian hour earlier will conflict with openstack qa rather than tc, not sure if frickler wanted to attend both23:20
clarkbalso apologies if I misremembered frickler's interested in that session. I swear that was one that frickler said would be attended if held on meetpad though23:21
fungitwo hours earlier wouldn't conflict with either one, but might be early for folks in pdt23:21

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!