Thursday, 2022-03-24

opendevreviewMerged openstack/project-config master: Add puppet-manila-core  https://review.opendev.org/c/openstack/project-config/+/83431802:31
*** ysandeep|out is now known as ysandeep06:45
*** pojadhav- is now known as pojadhav|rover08:28
*** jpena|off is now known as jpena08:38
*** ysandeep is now known as ysandeep|afk08:45
*** ysandeep|afk is now known as ysandeep09:04
*** pojadhav- is now known as pojadhav|rover09:16
opendevreviewMichal Nasiadka proposed opendev/irc-meetings master: kolla: move meeting one hour backward (DST)  https://review.opendev.org/c/opendev/irc-meetings/+/83502009:44
*** rlandy|out is now known as rlandy10:21
*** ysandeep is now known as ysandeep|afk11:10
*** ysandeep|afk is now known as ysandeep11:14
opendevreviewSimon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job  https://review.opendev.org/c/zuul/zuul-jobs/+/82124711:21
opendevreviewSimon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job  https://review.opendev.org/c/zuul/zuul-jobs/+/82124711:26
opendevreviewSimon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job  https://review.opendev.org/c/zuul/zuul-jobs/+/82124711:29
opendevreviewMerged opendev/irc-meetings master: kolla: move meeting one hour backward (DST)  https://review.opendev.org/c/opendev/irc-meetings/+/83502011:52
opendevreviewSimon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job  https://review.opendev.org/c/zuul/zuul-jobs/+/82124712:25
*** ysandeep is now known as ysandeep|afk12:58
fricklerinfra-root: regarding iweb, I checked the "nodepool list" output and there is a high number of nodes in deleting state there, and checking some of those I found more nodes with an uptime >10d. those not necessarily correlate to IP conflicts, but still maybe we just want to shut that cloud down right now instead of doing further debugging?13:31
funginodes stuck in a deleting state simply subtract from the available quota and cause the launchers to make additional delete calls13:32
fungiit would be nice if we could safely ride it out through next wednesday so we have that additional capacity for openstack yoga release day13:33
*** ysandeep|afk is now known as ysandeep13:40
*** pojadhav|rover is now known as pojadhav|afk14:12
fricklerI think "safely" is the key point here. iiuc there is noone left operating that cloud, without the stuck nodes we have no control about how many further IP conflicts may happen. and the risk of jobs falsely passing because they run on the wrong node type can't be excluded, which I consider pretty dangerous in particular for release day 14:20
frickler*without cleaning up the stuck nodes14:20
fungiyep, i guess we can take a closer look at our current utilization as a projection for next week, and if we decide we're unlikely to be overloaded then just plan for the release to take longer to run all the jobs it will need14:21
*** Guest86 is now known as diablo_rojo_phone14:53
mgagnefrickler: fungi: I can take a look14:59
fungithanks mgagne! didn't know you still had access15:01
mgagneI 100% have access and our team "owns" the infra.15:02
fungito summarize, there are a bunch of undeletable nodes, and also some which nova seems to have lost track of so they're still up but their addresses are also getting assigned to newly-booted nodes15:02
fungithe undeletable nodes are mainly just impacting available capacity slightly, it's the rogue servers which are presenting more of a problem since we're getting some builds which end up connecting to them and either run on a wrong dirtied node or fail with host key mismatches15:04
mgagneI'll take care of the undeletable nodes. I'll take a look at the rogue vms too.15:07
fungithanks so much!15:10
fricklermgagne: that's great, if you let me know when done, I can double check the nodepool list. did you see the reference to the two rogue nodes we powered off in #openstack-infra?15:17
mgagneWas an IP or UUI provided? I guess I missed it from the backlog.15:18
fricklermgagne: I can get the IPs for you, one moment15:18
fricklermgagne: ubuntu-focal-iweb-mtl01-0028848617 198.72.124.130 and ubuntu-focal-iweb-mtl01-0028797665 198.72.124.11115:21
fungithose are the ones we're aware of so far, any way. there may be others we haven't spotted15:22
mgagnethanks, looking into it15:22
mgagneI'm not able to find any "rogue" VMs, everything matches between Nova and the compute nodes.15:34
fungimaybe they got cleaned up after we logged in and issued shutdown commands to stop them from interfering with new nodes15:56
frickleryeah, that's likely. or you didn't find them because they are shut down? anyway, we can ping you once we find another instance15:58
fricklerfungi: another idea I just had: nodepool could possibly detect such a situation by checking for the correct hostname to be seen when it confirms ssh connectitivity, do you think that that would be worthwhile to add as a feature?15:59
fungifrickler: i don't think nodepool currently logs into the nodes it boots, just checks to see that ssh is responding?16:00
fricklerfungi: it should at least verify that ssh access is working, shouldn't it? but I must admit I'm not sure about the details either16:01
fungialso that doesn't really address the issue, since the situation you get into generally is an arp conflict, where 50% of connections go to one node and 50% to the other depending on timing and the state of the gateway's arp table16:01
fungiso it would at best discover 50% of them16:02
fricklerdepends on timing I guess, if the older node is faster at responding, it could win 100% of the time. at least with the current incident I've always been connected to the older node only16:03
fricklermaybe be the better option would be for nova to finally learn to allow ssh hostkey verification to happen via the API somehow. again I'm not sure how that would be possible, but it sure would be a cool feature16:06
corvusnodepool does not log into the node.  but zuul does, so if you wanted to check that, you could do so in zuul.16:06
corvus(by "in zuul" i mean in a pre-run playbook in a job)16:07
fungiright, we have some similar checks which happen in our base job already, so could extend that16:27
fungibut still, it'll only catch situations where that playbook happens to hit the "wrong" server, so won't catch them all16:27
dpawlikfungi, clarkb: hey, I added few visualization in Opensearch. Right now it is just simple visualization, maybe it will be helpful for someone.  Also added some new logs to be pushed: https://review.opendev.org/c/openstack/ci-log-processing/+/833624/12/logscraper/config.yaml.sample . Other will be added soon!16:35
*** ysandeep is now known as ysandeep|out16:36
dpawlikfungi: I will add the white chars tomorrow to PS https://review.opendev.org/c/opendev/system-config/+/833264 . Need to go now, (I just leave a message because I saw ping on openstack-tc channel) 16:38
*** rlandy is now known as rlandy|PTO16:46
*** jpena is now known as jpena|off17:28
opendevreviewMerged openstack/project-config master: Add Istio app to StarlingX  https://review.opendev.org/c/openstack/project-config/+/83489619:43
*** diablo_rojo_phone is now known as Guest25020:16
*** dviroel is now known as dviroel|pto20:45
priteauGood evening. I am investigating a doc job failure in blazar, seen in stable/wallaby but I assume master is affected too. The failure is caused by the latest Jinja2 3.1.0 being installed, which is incompatible with Sphinx 3.5.2 from wallaby upper constraints.21:34
priteauActually master may not be affected, since the Sphinx u-c is different21:35
priteauJinja3 comes from Flask which is a blazar requirement, I am wondering what could be causing it to be installed with upper constraints?21:36
priteauFailure log: https://8abbfecd00d9996fdb0c-5c4643ce01a9b304087712e3e08013b4.ssl.cf1.rackcdn.com/835057/1/check/openstack-tox-docs/8c5e9c0/job-output.txt21:36
priteauI can actually reproduce it locally, it doesn't affect master or stable/xena21:40
corvusthe ara package has a similar error: cannot import name 'Markup' from 'jinja2'21:43
priteauWorkaround is to pin Jinja2<3.1.021:45
priteauIf you use Jinja2 directly, the proper fix is to update usage, see https://jinja.palletsprojects.com/en/3.1.x/changes/21:45
corvusah looks like they released 3.1.0 today21:46
priteauAnyway, I can work around it in blazar, but I am curious why tox -e docs is installing blazar without using u-c?21:46
priteauIt is the develop-inst step which does this21:52
priteauSorry for thinking out loud, I found the solution22:02
priteaudocs tox env should be using `skip_install = True`22:02
opendevreviewMohammed Naser proposed zuul/zuul-jobs master: run-buildset-registry: Drop extra install packages task  https://review.opendev.org/c/zuul/zuul-jobs/+/83515622:30
fungisetuptools 61.0.0 was released roughly 2 hours ago. i haven't looked through the changelog yet22:49
corvusi'm going to start a rolling restart of zuul23:14
corvusfirst 6 executors are gracefully stopping23:18
fungithanks! i'm around if i can be of help23:21
opendevreviewMohammed Naser proposed zuul/zuul-jobs master: run-buildset-registry: Drop extra install packages task  https://review.opendev.org/c/zuul/zuul-jobs/+/83515623:29
opendevreviewMohammed Naser proposed zuul/zuul-jobs master: ensure-kubernetes: fix missing 02-crio.conf  https://review.opendev.org/c/zuul/zuul-jobs/+/83516223:29
corvusfungi: the executor restart is running in screen on bridge; i think whenever they're all back online we can do the schedulers/web23:41
fungiahh, awesome, connecting23:43
opendevreviewMohammed Naser proposed zuul/zuul-jobs master: ensure-kubernetes: fix missing 02-crio.conf  https://review.opendev.org/c/zuul/zuul-jobs/+/83516223:47
opendevreviewMohammed Naser proposed zuul/zuul-jobs master: run-buildset-registry: Drop extra install packages task  https://review.opendev.org/c/zuul/zuul-jobs/+/83515623:47
fungilooks like there's a second screen session with the results of a merger restart in it. i'll leave it for now in case someone's still doing something with that23:48

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!