Tuesday, 2021-08-03

* corvus < https://matrix.org/_matrix/media/r0/download/matrix.org/QJXNXwCbiAfkqpfrwWRKgcUw/message.txt >00:00
corvustristanC: i verified that the ssh private key is the same as what's currently being used with irc00:01
corvustristanC: could it be something about key types and the ssh library being used?00:01
corvusi also verified that the key is present and readable in the container00:03
corvustristanC: oh interesting, i can reproduce that inside the container with plain ssh, but not on the host.00:05
corvustristanC: did you by any chance build this on a fedora container base?00:05
corvusyes, appears to be f3400:05
corvustristanC: i've heard something about f34 and ssh.  i don't know the details, but i'm sure others in this channel do.00:06
corvustristanC: anyway, i bet you can reproduce the problem locally with an rsa key and the container image00:07
ianwyeah that will need a PubkeyAcceptedKeyTypes ssh-rsa00:17
tristanCcorvus: https://review.opendev.org/c/opendev/system-config/+/803125/3 use another container image which i think should work better00:40
opendevreviewTristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring  https://review.opendev.org/c/opendev/system-config/+/80312501:11
tristanCcorvus: i've updated the image tag to a new version which uses the libssh2 directly, the image is now only 20.9MB01:13
opendevreviewTristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring  https://review.opendev.org/c/opendev/system-config/+/80312501:17
Clark[m]Really seems like fedora should update their default fallback hash type in the client01:31
opendevreviewTristan Cacqueray proposed opendev/system-config master: Add script to import gerritbot configuration  https://review.opendev.org/c/opendev/system-config/+/80298301:42
Clark[m]https://bugzilla.redhat.com/show_bug.cgi?id=1940201 is the issue basically01:43
Clark[m]Except in our case it's not a really old server just one that doesn't indicate support for sha2 explicitly01:43
Clark[m]However in that case it is succeeding because old server01:44
Clark[m]We want it to fail and force sha201:44
Clark[m]https://github.com/openssh/openssh-portable/blob/master/sshconnect2.c#L1191 I think the code that needs modifying is somewhere near there02:00
*** ykarel|away is now known as ykarel04:45
*** marios is now known as marios|ruck05:50
*** jpena|off is now known as jpena06:54
*** sshnaidm|afk is now known as sshnaidm07:27
*** rpittau|afk is now known as rpittau07:30
opendevreviewxinliang proposed openstack/diskimage-builder master: Introduce openEuler distro  https://review.opendev.org/c/openstack/diskimage-builder/+/78436308:35
*** ykarel is now known as ykarel|lunch08:46
*** diablo_rojo__ is now known as diablo_rojo09:02
*** ykarel|lunch is now known as ykarel10:03
opendevreviewKendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot  https://review.opendev.org/c/opendev/system-config/+/80319010:04
*** dviroel|out is now known as dviroel11:18
*** jpena is now known as jpena|lunch11:20
*** jpena|lunch is now known as jpena11:23
*** jpena is now known as jpena|lunch11:30
opendevreviewTristan Cacqueray proposed opendev/system-config master: Update the gerritbot-matrix image to fix the ssh signature failure  https://review.opendev.org/c/opendev/system-config/+/80332811:42
tristanCclarkb: with ^ the bot no longer uses openssh, which seems to fix the current issue11:44
opendevreviewTristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring  https://review.opendev.org/c/opendev/system-config/+/80312511:45
opendevreviewTristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring  https://review.opendev.org/c/opendev/system-config/+/80312511:47
*** jpena|lunch is now known as jpena12:33
opendevreviewKendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot  https://review.opendev.org/c/opendev/system-config/+/80319012:33
fungipython v3.10.0rc1 has been tagged13:49
*** Guest3154 is now known as notcorvus14:39
*** notcorvus is now known as corvus14:40
*** corvus is now known as Guest319714:40
*** Guest3197 is now known as corvus14:42
*** corvus is now known as notcorvus14:42
*** notcorvus is now known as corvus14:42
*** corvus is now known as Guest320014:43
opendevreviewClark Boylan proposed opendev/system-config master: Update our python-base and -builder images  https://review.opendev.org/c/opendev/system-config/+/80335714:50
clarkbit occured to me that we're looking at gitea, gerrit, and zuul deployment updates and seemed like a good idea to update our base images as a result14:51
clarkbfungi's note about new python also had me thinking about it a bit14:51
*** ykarel is now known as ykarel|away14:52
*** ksambor is now known as ksambor|away14:55
*** jpena is now known as jpena|off15:06
Guest3200there are 2 matrix fixes at https://review.opendev.org/803294 and https://review.opendev.org/803328 that would be nice to merge soon15:15
*** Guest3200 is now known as corvus15:17
*** corvus is now known as notcorvus15:17
*** notcorvus is now known as corvus15:17
fungirackspace says afsdb03 was rebooted on sunday around 16:45 utc due to a host issue15:41
fungiseems to be up and running at least15:42
fungialso i've discovered that lists.o.o has gotten its ipv6 address listed in the spamhaus xbl, i'm waiting for the delisting request to be reviewed15:47
opendevreviewClark Boylan proposed opendev/system-config master: Use the gitea api in the gitea renaming playbook  https://review.opendev.org/c/opendev/system-config/+/80336615:50
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea project creation to only use the REST API  https://review.opendev.org/c/opendev/system-config/+/80336715:50
fungiahh, there it goes15:50
fungi#status log Requested Spamhaus XBL delisting for the IPv6 address of the lists.openstack.org server15:51
opendevstatusfungi: finished logging15:51
clarkbI'm trying to separate good hygiene (above) from the gitea 1.15.0 stuff15:51
clarkband I split things up because the second change is quite a bit more involved than the first but we really only need the first for the v1.15.0 upgrade15:52
clarkbI'll rebase once it is clear things are working15:52
clarkbtristanC: corvus: I guess libssh2 wants both a private key file and a pubkey file unlike openssh?15:53
clarkboh ha next file comments explain that15:53
fungiclarkb: spot-checking the next proposed batch of 73 gerrit account cleanups i don't see anything concerning. comfortable with proceeding when convenient15:54
clarkbfungi: excellent. Probably run the scripts tomorrow15:54
clarkbcorvus: I've approved the gerritbot change but have left the eavesdrop change in case fungi wants to review it. I think that change is more than small enough to go ahead and approve if if fungi doesn't have time though15:56
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/803357 is a change to update our base python images. If you think that is a good idea I'd like to land that before I land any changes like https://review.opendev.org/c/opendev/system-config/+/803265 as it will ensure we're deploying up to date images as we go through and update images for other reasons15:58
tristanCclarkb: corvus: thanks for the quick review!15:58
corvustristanC: you're welcome and thanks!15:59
fungii'll try to look at the eavesdrop change in a bit, after i get some lunch down16:00
tristanCclarkb: about the bot monitoring, would it be ok to expose the prometheus port? and if so, what's the proper place to define the desired port number?16:01
clarkbtristanC: its fine, but we don't have a prometheus server yet so I'm not sure it will do much good. Writing a spec for that is on the todo list16:02
opendevreviewClark Boylan proposed opendev/system-config master: Rebuild gerrit images to include fixes  https://review.opendev.org/c/opendev/system-config/+/80337216:10
opendevreviewClark Boylan proposed opendev/system-config master: Use the mariadb jdbc connector in gerrit  https://review.opendev.org/c/opendev/system-config/+/80337316:10
opendevreviewClark Boylan proposed opendev/system-config master: Remove the mysql connector from our gerrit images  https://review.opendev.org/c/opendev/system-config/+/80337416:10
clarkbianw: ^ fyi thats a followup on our upstream bug fixes. They are present in the stable branches we build now.16:13
*** rpittau is now known as rpittau|afk16:13
clarkbstable-3.4 doesn't include them yet. There is a merge change in flight there that I want to land then I can propose a merge change that includes our two fixes16:14
clarkbI think it is ok for us to proceed now that stable-3.2 and stable-3.3 include them though16:14
corvustristanC: you can probably grep 'iptables' in system config to find out how to open the port16:16
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/803357 is another good one then I'd like to go ahead and land https://review.opendev.org/c/opendev/system-config/+/803265 and https://review.opendev.org/c/opendev/system-config/+/80326616:17
opendevreviewTristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring  https://review.opendev.org/c/opendev/system-config/+/80312516:18
clarkbthe internet says the air outside is breathable again. I'm going to go get some exercise now before the meeting. Back in a bit16:20
JayFare there known issues with the mirrors? DIB jobs are 404'ing on a mirror Status code: 404 for http://mirror.mtl01.inap.opendev.org/fedora/updates/32/Everything/x86_64/repodata/repomd.xml16:23
clarkbJayF: I want to say ianw was cleaning up fedora 3216:25
clarkbit is an eol release now16:26
JayFThat's incredibly unfortunate, it's hard-breaking the DIB gate16:26
clarkbJayF: are there chagnes in DIB to remove fedora 32 testing?16:26
JayFI haven't checked; I don't work on DIB usually just trying to get a change thru16:27
clarkbI suspect that ianw would be on board with removing those tests given he also cleaned up fedora 32 mirroring16:27
JayFthere are zero CI-related changes in DIB pipeline16:27
clarkbah its just the func test that needs to be bumped forward. I expect this was unexpected fallout16:28
opendevreviewMerged opendev/system-config master: Update the gerritbot-matrix image to fix the ssh signature failure  https://review.opendev.org/c/opendev/system-config/+/80332816:28
JayFI'll try to find a fix for it, just a little frustrated because the breakage of the mirror coincided almost perfectly with the mirror being broken16:29
gibihi! I don't know if it is a known issue but it seems that we don't have the bot resolving launchpad bugs in #openstack-nova16:30
JayF***coincided almost perfectly with getting +W16:30
JayFlol16:30
*** marios|ruck is now known as marios|out16:30
*** ssbarnea[m] is now known as zbr16:31
fungigibi: that was a plugin we had in our old supybot-based meetbot implementation. it's not present in the new limnoria-based meetbot16:31
opendevreviewJay Faulkner proposed openstack/diskimage-builder master: Fedora: bump DIB_RELEASE to 34  https://review.opendev.org/c/openstack/diskimage-builder/+/79934116:37
JayFgoing to see if that is the fix it needs; that's an old patch that needed a rebase16:37
clarkbI know there are/were problems with fedora 34 + debuntu rpm. That may work depending on how in depth those func tests get. I think if that fails you can set it to 3316:41
JayFack; thank you for the pointer16:41
clarkbok really popping out now. Back in a bit16:41
opendevreviewMerged opendev/system-config master: matrix-eavesdrop: fix initial room path creation  https://review.opendev.org/c/opendev/system-config/+/80329417:23
opendevreviewMerged opendev/system-config master: Update our python-base and -builder images  https://review.opendev.org/c/opendev/system-config/+/80335717:23
nst__i have installed openstack using kolla ansible and try to testing freezer project, after finish installation and try to backup there is error like this, "CRITICAL freezer.main [-] Critical Error: can only concatenate str (not "bytes") to str" anyone have experience like this.. thank you17:24
*** dviroel is now known as dviroel|brb17:34
funginst__: you might have more luck asking in the #openstack-kolla channel or on the openstack-discuss@lists.openstack.org mailing list (i don't see any dedicated irc channel for the freezer project)17:37
fungi#opendev is where we discuss the services which comprise the opendev collaboratory (so gerrit, zuul, etherpad/meetpad, mailman, et cetera)17:37
opendevreviewMerged opendev/system-config master: Use nodejs 14  https://review.opendev.org/c/opendev/system-config/+/80326517:58
clarkbcorvus: https://zuul.opendev.org/t/openstack/build/70e992a5ba9c49bd8026b1bfa2a24aca does that indicate the merger failed to fetch and update the repo?18:17
corvusclarkb: i think so.  i think that's worth a search of the uuid in the executor logs18:18
JayFianw: cgoncalves: clarkb: https://review.opendev.org/c/openstack/diskimage-builder/+/799341 appears to fix the DIB gate18:19
clarkbcorvus: ze07 appears to be the executor18:20
clarkbcorvus: 2021-08-03 18:03:24,008 ERROR zuul.ExecutorServer: [e: 6411f294b08945e1ba71d755335dbea1] [build: 70e992a5ba9c49bd8026b1bfa2a24aca] Got exception while updating repo gerrit/opendev/system-config18:20
clarkbgrepping by the event id doesn't seem to show more of an error unfortunately18:21
fungiclarkb: the change to improve repo rename functional testing failed oddly on run-review-3.2 with a rather opaque-looking selenium error18:22
funginot sure if that's prone to nondeterminism but it's the first time i've noticed that particular error anyway18:23
clarkbfungi: ya I see you rechecked it. fwiw I'm not sure 803265 will use the new image from 803357. But that is ok the v1.15 gitea update will get the new image. More important that we're generally updating things18:24
corvusclarkb:   stderr: 'ssh: Could not resolve hostname review.opendev.org: Temporary failure in name resolution18:25
corvusclarkb: (from traceback without log line prefix)18:25
clarkbcorvus: neat! I think we still have a short ttl for that name. Let me push a change to bump it back out to an hour18:25
clarkbwhile that won't totally fix things it should improve our ability to cache dns records18:25
clarkbunfortunately I think zuul uses the python 3.8 images so we may need to reenqueue that?18:27
clarkbor land another change to cause things to rebuild18:27
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Increase review.opendev.org's CNAME TTL to one hour  https://review.opendev.org/c/opendev/zone-opendev.org/+/80339018:28
clarkbcorvus: ^ there is the ttl bump18:28
clarkbfungi: corvus: https://review.opendev.org/c/opendev/system-config/+/803366/1 should also be mergable. Looks like the child of that one passed testing, but it likely deserves in depth review18:29
opendevreviewClark Boylan proposed opendev/system-config master: WIP Prepare gitea 1.15.0 upgrade  https://review.opendev.org/c/opendev/system-config/+/80323118:30
clarkbAnd there is the v1.15.0 change rebased on top of the other cleanups18:30
*** dviroel|brb is now known as dviroel18:42
opendevreviewMerged opendev/zone-opendev.org master: Increase review.opendev.org's CNAME TTL to one hour  https://review.opendev.org/c/opendev/zone-opendev.org/+/80339018:44
clarkbinfra-root thinking about the 3.8 promote failure a bit more I'm worried that the builder and base image pairing being out of sync like that may be a problem for our wheel builds?18:46
clarkbI guess it depends if debian buster has updated any packages since our last image build with incompatible library updates?18:47
clarkball that to say I think we should try to correct this problem if we can. Do you think reenqueing the buildset is a good idea or should we land another change to rebuild the images and go through it all again?18:48
fungiso the promote job failed and the image wasn't tagged?18:49
clarkbfungi: correct, but python-builder:3.8 was tagged. This means we'll be building wheels on newer debian buster than where we install and execute those wheels18:50
fungiis it python-base:3.8 we're missing then?18:50
clarkbyes that is the one that had its promote job fail18:51
clarkbI think zuul and hound are the two images that use python3.8 primarily. Most things are still 3.718:51
clarkbZuul has functional testing and would catch these issues with the images if it becomes a problem I think18:51
clarkbwhich means we're probably fine until we manage to promote that image one way or another18:52
fungii don't think reenqueuing that change into promote should present a problem18:52
clarkbfungi: its deploy not promote18:52
fungioh, i misunderstood. thought you said python-base:3.8 wasn't tagged on dockerhub18:52
clarkbfungi: yes that is correct, but the jobs run in the deploy pipeline18:53
fungii had no idea we were doing that in deploy instead of promote. interesting18:53
clarkbwhcih is where I think it gets confusing around whether or not we can reenqueue the buildset as it will not only rerun all the promote jobs but also run infra-prod-service-refstack and infra-prod-service-gitea18:53
clarkbfungi: we do that so that the images update before we do the pull as we can't sequence across pipelines aiui18:54
fungii guess with it being a change-merged trigger, we already expect that changes could trigger out of sequence there, so at least don't have to worry that reenqueuing an older change would roll back the latest branch state in the context of the deployment18:55
fungiso probably still safe to do18:56
clarkbfungi: well that is what I'm not totally sure of. I think in this specific case the refstack and gitea jobs won't do anything unexpected though and in this specific case we should probably be ok18:56
clarkbya I think we're good to reenqueue18:59
tristanCcorvus: clarkb: did 803328 restarted the compose, or should we add a force-recreate when the compose file changes?19:29
clarkbtristanC: iirc the role does a docker-compose pull && docker-compose up -d and since the image changed that should cause it to restart19:30
clarkbbasically up -d will restart if there is a new image present19:30
tristanCclarkb: alright, there it posted its first notification to the test channel!19:32
clarkbfungi: only the hourly deploys are currently queued up if we want to enqueue the deploy jobs for https://review.opendev.org/c/opendev/system-config/+/80335719:43
clarkbcorvus: ^ re that do you know of any reason it wouldn't be safe to rerun the docker image promote jobs?19:51
corvusclarkb: it could "rewind" production?  do you have a build/buildset link?19:53
corvuswe can see what jobs will run19:53
clarkbcorvus: the buildset for the name resolution failure I linked to previously19:53
clarkbcorvus: https://zuul.opendev.org/t/openstack/build/70e992a5ba9c49bd8026b1bfa2a24aca19:53
clarkbcorvus: as described above I'm slightly worried that having a newer python-builder image than our python-base image could result in errors building wheels if there are binary differences in a way that matter19:54
corvusclarkb: so it could rewind refstack and gitea -- has anything changed with those since then?19:54
clarkbHowever, only zuul and hound would be affected so its not super critical (as zuul testing should catch that if this is the case for zuul)19:54
clarkbcorvus: gitea got https://review.opendev.org/c/opendev/system-config/+/803265 applied after that change with the failure19:55
clarkbcorvus: when you say rewind do you mean it could rewind the docker images or just system-config on bridge or both?19:55
clarkbthe alternative is to land another base image update change19:55
clarkbI'm happy to propose that if we want to approve that instead19:55
corvussystem-config and whatever the ansible there does.  i don't think that will affect the gitea server19:55
corvusi think it would have to be an ansible change, not a docker image change19:55
clarkbcorvus: I agree I don't think it will affect the gitea server19:56
corvusoh wait19:56
fungido we actually use the change in the deployment tasks though, or the branch state?19:56
corvuswill that retag :latest?19:56
clarkbcorvus: I wonder if it will19:56
clarkbcorvus: you know I think it may19:57
clarkbbecause it just looks at the current change's image and says you are latest now19:57
corvusif it did not delete the old tags, then it probably will.  if the old tags are deleted, then it will try and fail.19:57
clarkbfungi: we use the branch state for system-config but project-config should be latest iirc19:57
corvusbased on our earlier experience, that's a crap shoot.  we probably can't even tell because of the dockerhub ui/api difference.19:57
clarkbcorvus: ya I think we should do a new change to bump the image again19:57
corvusclarkb: i think that's safest and most predictable :)19:58
clarkbI'll push that momentarily19:58
fungithat's fine by me. it's low-effort, maybe even lower (just going to take a little longer to go through)19:58
fungii'll be around for at least a few more hours anyway19:58
corvustristanC, clarkb, fungi: and eavesdrop is working too! https://meetings.opendev.org/irclogs/%23test/latest.log.html20:01
fungioh nice20:01
opendevreviewClark Boylan proposed opendev/system-config master: Update the python docker images again  https://review.opendev.org/c/opendev/system-config/+/80339420:01
clarkbcorvus: fungi  ianw ^ ok that should get us sorted20:01
clarkbthe ttl update is in place too so dns should be much more reliable now20:02
fungii need to disappear for a bit to make dinner but should probably be back by the time that's done20:03
clarkbya I need to get lunch myself20:03
clarkbianw: https://review.opendev.org/c/opendev/system-config/+/803372 is a stack that will probably interest you as it attempts to incorporate our upstream gerrit fixes20:03
clarkbianw: but I don't think we should approve anything out of that stack until 803394 promotes properly. That way we get shiny python:3.7 images to work with20:04
clarkband now lunch20:04
opendevreviewMerged opendev/system-config master: Improve repo rename functional testing  https://review.opendev.org/c/opendev/system-config/+/80326620:13
fungiyay, it merged on retry. that selenium error was... weird20:14
opendevreviewMerged openstack/diskimage-builder master: Fedora: bump DIB_RELEASE to 34  https://review.opendev.org/c/openstack/diskimage-builder/+/79934120:23
ianwfungi: yeah, i think it can still be a bit racy, and because of the shadow-dom walking the queries are really weird20:31
opendevreviewTristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix identity lookup configuration  https://review.opendev.org/c/opendev/system-config/+/80339620:35
ianwclarkb: thanks, responded; it's pulling in a different variables file that should hopefully stop pip.conf adding wheel mirrors20:35
ianwthat's the theory, anyway20:35
clarkbianw: I'm not sure that will work since defaults define wheel_mirror20:38
clarkbianw: also we want the pypi mirror to be set just not the wheel mirror20:38
ianwi was modelling on fedora, but it's possible that's also wrong20:40
clarkbya I suspect that we may be writing an incomplete pip.conf where it is extra-index-url = and then nothing20:41
clarkbor if it is something it is https:///centos-8/wheels/ or whatever the paht is20:41
clarkbsince we don't guard against the vars being undefined or empty20:41
ianwhrm, i guess https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/defaults/main.yaml applies20:42
ianwi think we should rework this to more clearly delineate between platforms we build wheels for and not20:43
ianwa role to generate the wheel slug would also be useful, as i could re-use that in the publishing jobs20:43
clarkbianw: in the centos-8-stream make a wheel mirror chagne what ends up doing the afs vos release?20:47
clarkboh hrm these are just check jobs to see if we can build the wheels not publish them20:48
clarkbthen when that is working we add in publication I guess20:48
ianwyeah, the publish jobs are in project-config20:48
clarkb+220:48
ianwthat's where i need to sort it out writing to the correct afs locations20:49
clarkbfungi: ianw heads up I have approved https://review.opendev.org/c/opendev/system-config/+/80326320:49
clarkbthe newlist command fix20:49
*** dviroel is now known as dviroel|out20:54
tristanCcorvus: for gerritbot matrix identity lookup, we would need to run this procedure, would you mind giving it a try and let me know if these instructions work: https://review.opendev.org/c/opendev/system-config/+/803396/1/playbooks/roles/matrix-gerritbot/README.rst21:00
clarkb803394 is waiting for one more node in check. I wonder if we've got some sort of failure or slow boot somewhere /me tries to be patient21:06
opendevreviewMerged opendev/system-config master: Skip notification prompt for newlist command  https://review.opendev.org/c/opendev/system-config/+/80326321:19
clarkbsystem-config-build-image-gitea-init had its node request fullfilled over an hour ago according to the logs on nl0421:40
clarkbbut the job hasn't started. Is this the bad locks in the nodepool launcher situation? should I restart nodepool-launcher on nl04?21:40
clarkbfungi: ^21:40
clarkbI think you've debugged a few of these before21:40
corvustristanC: done and added to private hostvars.  worked fine21:41
clarkbhttps://paste.opendev.org/show/bMM7UPkSF7g1bqrXTYYX/ is what that looks like on the scheduler and the nodepool launcher21:41
clarkbI'm going to restart nl04's launcher in a couple of minutes if I don't hear objections. I suspect this is that issue21:42
clarkbrestarting nl04's launcher does not seem to have chagned the state of that job21:48
clarkbcorvus: ^ fyi this seems like some sort of zk related issue in job scheduling21:48
clarkbcomparing the logs for the request associated with the stuck job to logs for other requests it seems that the nodepool side believs its work to be done21:49
corvusclarkb: if the scheduler never logged an update to that node request then it's probably not due to the recent zk work21:50
corvusi think that's more or less the same as before21:50
clarkb2021-08-03 20:28:17,876 INFO zuul.nodepool: [e: 7e9b6a5b1e2f41989b016e4b4699346a] Request 299-0014954444 no longer exists, resubmitting21:51
clarkbok that is curious21:51
corvusoh yeah there's more logs than the paste21:51
corvusthat does point toward recent changes21:51
clarkbya sorry I realized i could grep by the request id in the scheduler21:52
clarkbnodepool does not see that request id show up i nany of its logs after the first pass of request handling21:52
clarkbbasically the resubmitting appears to not have happened21:52
clarkbI suspect 802362,1 may be in a similar state?21:53
corvusit'll get a new request id21:53
clarkbah ok let me dig more then21:54
corvusthere's a good chance that the new request may be ignored by zuul since it also acked the original21:54
corvusstill reading code21:54
clarkb299-0014954458 is the updated request id. I'm looking for it now21:54
clarkbthat one did the same thing on nl04. It was fulfilled and unlocked so nl04 removed its request handler for it21:55
clarkbthe scheduler only logs this Updating node request <NodeRequest 299-0014954458 <NodeSet ubuntu-focal [<Node 0025752729 ('ubuntu-focal',):ubuntu-focal>]>> for that request id21:57
clarkbit doesn't say anyting about the request going away and resubmitting like the first one21:57
corvushrm, the updated request in zk looks correct22:04
corvus```{"state": "fulfilled", "state_time": 1628022502.423527, "declined_by": [], "node_types": ["ubuntu-focal"], "nodes": ["0025752729", "0025752729"], "reuse": true, "requestor": "zuul02.opendev.org", "requestor_data": {"build_set_uuid": "b397fc88821b4b398469a227dfde24f7", "tenant_name": "openstack", "pipeline_name": "check", "job_name": "system-config-build-image-gitea-init"}, "provider": "ovh-gra1", "relative_priority": 0, "event_id":22:04
corvus"7e9b6a5b1e2f41989b016e4b4699346a"}```22:04
corvusthough i wonder why the node id shows up twice22:05
corvusoh, i wonder if that's due to the re-request22:05
clarkbis https://opendev.org/zuul/zuul/src/branch/master/zuul/nodepool.py#L464 sufficient to tell the scheduler where the new request is?22:06
corvusclarkb: i don't think the scheduler should use that any more -- it should rely on the requestor_data in the request itself now22:06
clarkbgotcha22:07
corvusin fact, that may be nearly dead code at this point22:08
clarkbdid _updateNoteRequest() which is the watcher for the request possibly return false so we stopped watching it but the scheduler didn't/doesn't know about it somehow?22:08
fungiclarkb: yes, we've been seeing this off and on since at least april22:08
clarkbfungi: this one seems different fwiw22:08
clarkbfungi: I tried the restart and no luck. corvus is digging in above22:09
fungirestarting the launcher frees the node request locks it may be holding and allows other launchers to try to satisfy them22:09
fungimore recently i've noticed that after some time zuul reissues a new node request id, and i've seen the same launcher keep grabbing and sitting on it22:10
fungiokay, caught up reading now22:10
fungiit seems like the stuck node requests we used to see have more recently been replaced by the re-request behavior where the same launcher which was sitting on the old request also grabs and sits on the new one22:11
fungithe other day i traced a log where that happened three times in sequence for the same build22:11
clarkbcorvus: reading the code around where I just linked and _updateNodeRequest I notice we are reusing the request object and setting some flags on it rather than creating a new one. Is it possible some status like request.cancelled is set and when the node request comes back we essentially ignore it22:11
corvus2021-08-03 20:28:05,916 INFO kazoo.client: Zookeeper connection lost22:12
corvusso it was correct to resubmit the request; the error here is that the resubmitted request did not work correctly22:12
clarkbcorvus: ya that is what I'm leaning towards. However I don't think it was cancelled unless the zk connection does that implicitly without logging it somehow22:13
clarkbneither the original request id or the new request id logs that it was cancelled22:13
corvusclarkb: we shouldn't get to the point of resubmitting a request if it's canceled22:13
corvusclarkb: requests are ephemeral, so zk will automatically delete it22:13
clarkbcorvus: I'm just going through the watcher callback and trying to rule branches out. Request .* is unknown doesn't show up in the log file at all so we didn't hit that branch.22:19
clarkb'Resubmitting lost node request' doesn't show up for these requests so we don't hit that branch either22:19
corvusclarkb: since we saw one "Updating node request" from the scheduler, but nothing after that, it does sort of seem like the callback may have returned false22:19
clarkband Node request 123-foo FULFILLED doesn't show up so we don't hit that branch22:20
clarkbleaving the cancelled branch but I don't think we hit taht either22:20
clarkbcorvus: its like the watcher went away entirely22:20
clarkbI don't think the request was cancelled because when we cancel a request that produces other logging whcih doesn't show up though we don't explicitly log for that in updateNodeRequest()22:21
corvusclarkb: a tricky thing to look out for: request.uid != request.id -- so the "Request %s is unknown" log may not show up in our greps.  however, "is unknown" doesn't show up at all, so we know that didn't hit22:22
clarkbcorvus: yup I ended up grepping for `Request .* is unknown` for that reason22:23
corvusclarkb: fwiw, i think the instigator of this incident is the event size issue that we merged fixes for yesterday22:26
corvusthat doesn't make this okay -- we still have a problem.  but as an aside, the underlying zk disconnect issue should already be addressed.22:26
clarkbthats good22:27
clarkbcorvus: does it make sense to abandon and restore this change for now?22:27
clarkband have zuul retry from scratch?22:27
clarkbor do we maybe want to restart on the changes from yesterday?22:27
clarkb(and queue restore will reenqueue)22:28
corvusclarkb: yes, but if you can hold just a bit longer that'd be great22:28
JayFIf I have a job running, with a bunch of `nodepool` jobs that aren't starting, would this be a similar/symptom of this issue22:28
clarkbcorvus: yup not in a rush. I already waited a bunch :)22:28
JayFthis one is pretty fresh if that makes a difference to you :)22:28
clarkbJayF: if its fresh then it can just be normal delay22:28
JayFack, just thought I'd offer up an example if there was a need for one22:29
clarkbJayF: if this is for your dib change I think it is waiting on docker images to build and that is slow because arm64 images are built via qemu or something22:29
clarkbI don't think that change is in a bad spot yet22:29
JayFI'm just judging based off how long the other DIB change I landed this afternoon took to run them22:29
corvusclarkb: i agree the only way to exit the callback returning False with no log lines, is if the req was canceled (or maybe an exception would stop the watcher?)22:29
JayFwhich is highly subjective anyway :)22:29
clarkbJayF: specifically nodepool-build-image-siblings needs to succeed then those waiting jobs will all grab the built image from there22:29
JayFoh wow, that's cool, I didn't realize you could seed jobs that way 22:29
JayFI might go look at that config instead of doomscrolling zuul status :D 22:30
corvusJayF: you can also mouseover "waiting" on the status page22:30
corvuszuul will tell you what it's waiting for22:30
JayF:-O22:30
clarkbcorvus: my suspicion is that something related to the carried over state is leaking22:30
JayFhow have I been using openstack/zuul for this long and never realized that! Thanks!22:30
clarkbcorvus: since we reuse the request object and modify it when resubmitting rather than creating a new object22:31
corvusJayF: it might be because we added it 2 months ago..... but then again... it might not be.  ;)22:31
clarkbcorvus: and somehow that is causing updateNodeRequest() to take an unexpected path22:31
corvusclarkb: but the only possible path that could be is canceled, right?22:32
corvusand that's not possible without a "Canceling node request" line22:32
clarkbcorvus: yes and yes22:32
corvus(because that's the only place canceled=True) is set22:32
corvusclarkb: i think i should fire up the repl and see if i can find the request object(s)22:33
clarkbcorvus: ok let me know if I can help more22:33
clarkbI'll hold off on changing anything about the change itself until you give the all clear22:33
corvusthanks22:33
clarkbJayF: you'll notice those jobs have flipped from waiting to queued. Now they are queued up and looking for nodes to run on22:36
corvusthe replacement request is in memory, the original is not22:36
clarkbcorvus: aren't they the same object?22:37
corvusgood point :)22:37
corvusreq.canceled = False22:37
clarkbhttps://opendev.org/zuul/zuul/src/branch/master/zuul/nodepool.py#L462-L463 because of that22:37
corvusthe uid is '47debb1d7af84d87918de66a7d3c9649'22:37
clarkbwe don't seem to log that uid at all22:38
corvusreq.state = 'requested'22:38
corvusnope, only on that one "is unknown" line.22:38
clarkbif state is requested and cancelled is false then we should've fallen through and returned true22:39
clarkbbecause we know it wasn't deleted (lack of logs for deleted)22:39
clarkbis it possible we've somehow missed a watcher event or the watcher is no longer watching?22:40
clarkboh wait22:40
clarkb2021-08-03 20:28:22,423 DEBUG nodepool.driver.NodeRequestHandler[nl04.opendev.org-PoolWorker.ovh-gra1-main-8e061a04c2b24871b6e337ad89b66404]: [e: 7e9b6a5b1e2f41989b016e4b4699346a] [node_request: 299-0014954458] Fulfilled node request22:40
clarkbthat should've set req.state to fulfilled not requested22:40
corvusclarkb: right, the zk node reflects that, the object in zuul's scheduler memory says requested22:41
clarkbah22:41
corvusso that tells us that the watcher didn't fire for the fulfilled event for the re-request (the logs already told us that)22:41
clarkbyup22:41
clarkbit did fire at some point beacuse we got the single log line but it didn't fire for the fulfilled state22:41
clarkband we're running out of possibilities for a return False there22:42
corvusit's possible the first callback is just the kazoo datawatch firing on creation, and that we never actually got a zk watch notification22:43
clarkbgood point22:44
clarkbI forgot it does that22:44
clarkbin fact it would have to be right? since we got the log line and it does call on creation of the watch?22:44
corvusi think so22:45
corvusinteresting seeing a node request in 'requested' state with nodes in its nodeset22:48
corvus(because the nodeset was updated by the callback from the previous request)22:49
clarkboh that is a bug because we don't clear out the nodes from the request and the reuse it?22:49
clarkbya22:49
clarkband then when it went back to nl04 it gave the same node back again22:49
clarkbJayF: your change is running builds for all jobs now. It should end up being fine22:51
corvusthe zk data that it has (from the original request) only has one node id (which is correct).  the current data in zk has 2.22:51
corvusi suspect that's a second (minor?) bug22:51
corvusthat's probably something we need to clear out but don't.22:52
corvusfwiw, i think if we can drop the nodepool.requests dict, we can use fresh noderequest objects22:52
JayFclarkb: I see that, thank you :D22:53
clarkbcorvus: thinking out loud a bit more. Is it possible that because we lost the zk connection there was some associated watcher cleanup on the scheduelr side, but since we submitted a new request we got the initial call for a watch against a dead watch again?23:04
clarkbI think that wouldn't be the acse because from the zk perspective the node request did change23:04
clarkbsince the node requset id in zk is the path entry23:05
corvusclarkb: i've been thinking along similar lines, but yeah, from zk/kazoo, they're just operating on paths, so it looks like 2 different paths, nodes, watchers23:05
corvus(it's just that the watchers closures happen to have the same first argument)23:05
clarkbyup23:05
corvusnow that does mean that a zombie watch callback for the first path would affect the shared request object, but we don't have a log entry indicating that23:06
clarkband it is an ephemeral node so in theory zk cleaned that up before we reconnected23:07
clarkbright?23:07
corvusyep23:07
clarkbit would be weird if ephemeral nodes could make it across connections in that way23:07
corvusmaybe we should ask zk what watches it has23:07
corvusthat's going to be a bit longer than it used to be23:08
corvusnl01, nl02, and two sessions from nl04 hold watches on that path23:11
clarkbbut none from the scheduler23:12
clarkbwhich woudl explain the behavior we see but not explain how or why it went wrong23:12
corvusappears to be the case23:12
corvusyep23:12
clarkbnl01 and nl02 both yielded the request iirc23:13
opendevreviewMerged openstack/diskimage-builder master: Permit specification of extra bootstrap packages  https://review.opendev.org/c/openstack/diskimage-builder/+/80259223:13
clarkbbut nl03 didn't23:13
corvus2021-08-03 20:28:45,369 ERROR kazoo.recipe.watchers: list index out of range23:17
corvus2021-08-03 20:28:45,374 ERROR kazoo.handlers.threading: Exception in worker queue thread23:17
corvus  File "/usr/local/lib/python3.8/site-packages/zuul/zk/nodepool.py", line 503, in updateNodeRequest23:18
corvus    request_nodes[i].id = nodeid23:18
corvusi think it's actually the mismatch in node lengths23:18
clarkboh neat23:18
clarkbI guess the obvious weird thing was the thing to look at23:18
corvusit looks like an exception there will cause it to stop watching23:18
corvusclarkb: so i think your hunch of "we're not clearing something out" was right -- and the thing we're not clearing out is the node list23:20
clarkbthat method gets attached to the callback we were looking at previously23:20
clarkbwhat isn't clear to me is why the callback raising would cause the watch to go away? I thought you had to return False for that23:20
clarkbin any case that should be an easy fix. We need to more aggressively clean up the previous state?23:22
corvusclarkb: yeah, me neither.  i think we'll need to check the kazoo watch callback to fully understand that (i think it's outside the scope of the datawatch)23:22
corvusclarkb: yes -- though i think that avoiding re-use of that object will solve this problem, and we can do that if we drop the internal dict23:23
corvusif that's not a trivial change, then yeah, let's just clear out the node list for now23:23
corvusi will finish up the work i have going on the merger then do that23:23
clarkbcorvus: ok, am I good to abandon and restore?23:24
corvusclarkb: one sec23:24
clarkblooking at the code it appears we use that dict to count things for statsd23:24
clarkbbut otherwise its just internal accounting that we maybe don't need anymore23:25
corvusoh yep, it's the current request length23:25
clarkbcorvus: we could use a simple counter for that though?23:26
clarkbincrement and decrement instead of adding and removing from a dict23:26
clarkband that integer won't drag state along with it and cause unexpected things to happen23:26
clarkbAnyway I'll standby on abandon restore until you give the go ahead23:27
corvusclarkb: okay, i cleaned up the repl and shut it off.  feel free to proceed23:27
clarkbproceeding thanks23:28
clarkbthere are a couple other chagnes that exhibit similar behavior that I will do thsi too as well23:28
clarkboh ha I cannot abandon and restore in some repos because we changed how admin works23:30
fungiso you don't think restarting the launcher holding those node requests will work this time? it seemed to address the instances i saw last week, some of which had multiple reissued requests23:32
corvusin this case, the launchers are not holding the node request.  it's done with it.23:33
clarkbfungi: I did restart the launcher and it did not help23:33
corvusnothing will cause zuul to see an update to this particular node request23:33
clarkbdoing that and not changing the situation is what led to me doing the deeper dive debug above23:33
fungiahh, okau23:33
fungiand zuul dequeue won't do it either?23:34
clarkbfungi: that will probably work. Its easier to click buttons in the gerrit web ui though :P We can do dequeue enqueue for the starglinx/ha change I guess23:35
clarkbbut my ssh keys are telling me I shouldn't :P23:35
timburke_huh. anything change with paste.opendev.org lately? i just made https://paste.opendev.org/show/807865/, but it doesn't show the content -- yet when i go to https://paste.opendev.org/raw/807865/, it's all there :-/23:47
Clark[m]timburke_ the server was upgraded and the db server changed as a result23:59
Clark[m]I wonder if this is a weird interaction resulting from that23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!