Friday, 2022-05-27

*** rlandy|biab is now known as rlandy|out00:41
Clark[m]It is on to ze08 now01:18
fungiprobably still won't be through the remaining executors before i get to sleep01:20
ianwBlaisePabon[m]: congratulations :)  do you have to work in with existing deployment/IT things, or could you start out with a Zuul-based CI/CD for your control-plane from day 1?01:31
fungialready stopping ze10 now, so definitely speeding up02:03
Clark[m]Ya but periodic jobs just happened to slow us down again :)02:08
fungibah02:13
fungihopefully the periodics are mostly short-runners and don't involve any paused builds02:14
BlaisePabon[m]<ianw> "Blaise Pabon: congratulations :)..." <- I can do whatever I want!!02:37
BlaisePabon[m]In fact, the expectation is that I should start from a clean slate... and in fact, I don't have a choice because there is no CI and no CD at present.02:40
BlaisePabon[m](as in, 90's style, take the server off line, ssh as the user called `root` and the proceed to `yum install ...` and `npm build` for 90 mins02:40
BlaisePabon[m]So if ever anyone wanted to set up an exemplary zuul-ci configuration, this would be it.02:41
BlaisePabon[m]fwiw, I'm rather comfortable with Docker, git, python and Kubernetes02:41
BlaisePabon[m]btw, I figured out how to setup reverse proxies for the servers in my garage. A while back I had offered to make them available to the nodepool, so the offer still stands.02:45
BlaisePabon[m]oh, and, in full disclosure, I'm not sure I know what you mean by `Zuul-based CI/CD for your control-plane` but whatever it is, I can do it.02:47
ianwi mean that what we do is have Zuul actually deploy gerrit (and all our other services, including zuul itself :)02:59
opendevreviewKenny Ho proposed zuul/zuul-jobs master: fetch-output-openshift: add option to specify container to fetch from  https://review.opendev.org/c/zuul/zuul-jobs/+/84352703:02
opendevreviewKenny Ho proposed zuul/zuul-jobs master: fetch-output-openshift: add option to specify container to fetch from  https://review.opendev.org/c/zuul/zuul-jobs/+/84352703:04
opendevreviewKenny Ho proposed zuul/zuul-jobs master: fetch-output-openshift: add option to specify container to fetch from  https://review.opendev.org/c/zuul/zuul-jobs/+/84352703:10
ianwi'm just spinning up a test in ovh and noticed a bunch of really old leaked images.  i'll clean them up04:00
ianwno severs; image uploads.  the have timestamps starting with 1504:01
*** ysandeep|out is now known as ysandeep04:12
ianwthere's still a few in there in state "deleted" that don't seem to go away.  don't think there's much we can do client side on that04:14
*** marios is now known as marios|ruck05:01
*** bhagyashris is now known as bhagyashris|ruck05:08
ykarelfrickler, can you please check https://review.opendev.org/c/openstack/devstack-gate/+/84314805:55
mnasiadkafungi, clarkb: a lot of jobs are being run only on changes to particular files, ceph jobs are non voting because they are sometimes failing due to their complexity (multinode, ceph deployment, openstack deployment, etc). We moved to use cephadm in Wallaby (and working more to improve failure rate), since Victoria is EM - we could remove those jobs (I was not aware they are failing so much).06:23
*** ysandeep is now known as ysandeep|afk06:31
opendevreviewMerged openstack/project-config master: Add ops to openstack-ansible-sig channel  https://review.opendev.org/c/openstack/project-config/+/84349207:03
*** ysandeep|afk is now known as ysandeep07:22
opendevreviewIan Wienand proposed opendev/glean master: redhat-ish platforms: write out ipv6 configuration  https://review.opendev.org/c/opendev/glean/+/84324307:25
ianwfungi/clarkb: ^ i've validated that on an OVH centos-9 node.  more testing to do, but i think that is ~ what it will end up like07:25
opendevreviewIan Wienand proposed opendev/glean master: _network_info: refactor to add ipv4 info at the end  https://review.opendev.org/c/opendev/glean/+/84336707:30
opendevreviewIan Wienand proposed opendev/glean master: redhat-ish platforms: write out ipv6 configuration  https://review.opendev.org/c/opendev/glean/+/84324307:30
*** marios|ruck is now known as marios|ruck|afk08:44
*** marios|ruck|afk is now known as marios|ruck09:38
*** rlandy|out is now known as rlandy10:03
*** soniya29 is now known as soniya29|afk10:07
*** ysandeep is now known as ysandeep|afk10:30
*** ysandeep|afk is now known as ysandeep10:43
yoctozeptoinfra-root: (cc frickler) hi! may I request a hold of the node used for the job that runs for change #84358310:58
yoctozeptothe job name is kolla-ansible-ubuntu-source-zun-upgrade10:58
*** ysandeep is now known as ysandeep|afk11:06
yoctozeptomy ssh key https://github.com/yoctozepto.keys11:07
*** soniya29|afk is now known as soniya2911:08
*** rlandy is now known as rlandy|PTOish11:09
frickleryoctozepto: no need to double-hilight me ;) will set it up now11:14
*** dviroel|out is now known as dviroel11:15
fricklerand done11:16
fungias corvus predicted, we have merger graceful stopping problems. i'll leave the playbook in its present hung state for now, but essentially we upgraded all the executors and have been waiting for many hours for zm01 to gracefully stop (which it obviously has no intention of doing)11:34
fungiwe can probably manually swizzle the processes on the mergers in order to make the playbook think it stopped them so it will proceed through the rest of the list, but i'll sit on that idea until we have more folks on hand to tell me i'm crazy11:35
yoctozeptofrickler: thanks; the more, the merrier, or so they say :-)11:37
*** dviroel is now known as dviroel|afk13:07
yoctozeptofrickler: somehow I cannot get onto that node, my key is rejected13:23
fungiyoctozepto: as root?13:30
yoctozeptofungi: root@23.253.20.188: Permission denied (publickey).13:30
Clark[m]fungi: I'm not properly at a keyboard for a while yet but I think once the merger no longer shows up in the components list you can manually docker-compose down on that server to kill the container which will cause the playbook to continue. I can do that in about an hour and a half myself once at the proper keyboard13:37
fungiClark[m]: yeah, that's what i was thinking of doing, just didn't want to proceed until more folks are around since it'll churn through the remaining services rather quickly13:38
fungiyoctozepto: i've added your ssh key to the held node now13:39
yoctozeptothanks fungi, it works (cc frickler)13:44
corvusfungi: Clark yes that's what i would suggest13:49
frickleryoctozepto: ah, I was about to add your key now, seems fungi already did that, thx14:10
yoctozeptofungi, frickler: thx, I powered off that machine, it can be returned to the pool14:31
fungiyoctozepto: thanks! i've released it14:35
*** ysandeep|afk is now known as ysandeep14:46
fungiClark[m]: see the #openstack-cinder channel log for some discussion of more meetpad audio strife... digging around i ran across these which have a potential solution for chrom*'s autoplay permission and might also work around the problem in ff? https://github.com/jitsi/jitsi-meet/issues/10633 https://github.com/jitsi/jitsi-meet/issues/952814:47
fungispecifically, adding a pre-join page so that users click on/enter something in the page is enough of a signal that the browser considers the user has given permission to auto-play for that session14:48
fungiour config dumps people straight into the call without them needing to interact before the audio stream starts, which seems to maybe be the problem14:49
fungialso ran across another comment buried in an issue suggesting to switch media.webrtc.hw.h264.enabled to true in about:config on ff14:53
fungi(it's still false by default even in my ff 100)14:54
Clark[m]Enabling the join page looks like a simple config update at least. Asking users to edit about:config is probably best avoided14:54
fungiyeah, that was separate, for improving streaming performance on ff14:56
fungithough it looks like it's probably a bad idea to switch on unless you've got at least ff 96 when they merged an updated libwebrtc14:56
fungibut yes, i'm in favor of trying to add a pre-join page and seeing if that helps. i'll propose a change14:57
fungialso more generally, it looks like the lack of simulcast support between jitsi-meet and firefox is likely to still create additional load on all participants the more firefox users join the call with video on14:58
fungisince for firefox it ends up falling back on peer-to-peer streams14:59
fungior at least that's how i read the discussions14:59
Clark[m]I think we explicitly disable peer to peer15:00
fungiahh, okay15:00
fungithen maybe not for our case15:00
Clark[m]The problem aiui is webrtc is expensive for video and just adding video bogs things down. Zoom web client which isn't webrtc does the same thing15:00
Clark[m]Add in devices that thermal throttle (MacBooks) and problems abound :(15:01
fungifor sure15:01
fungiand yes, even on my workstation i end up setting zoom's in-browser client to disable incoming video15:02
fungiokay, since it's getting to the point in the day where more people are going to be around, i'll start manually downing the docker containers for each merger as they disappear from the components page, one by one15:03
fungistarting with zm01 now15:03
fungias soon as i did that, the system kicked me out for a reboot and the playbook progressed15:04
fungilooks like it came back and zm02 is down now so doing the same for it15:04
fungii'll wait when it gets to zm08, so everybody's got warning when the scheduler/web containers are going down15:05
Clark[m]++ I should be home soon15:06
fungino need to rush15:07
fungiall done except for zm08, and i've got the docker-compose down queued for that so ready to proceed when others are15:14
clarkbI'm here now just without ssh keys loaded yet15:17
clarkband now that is done. We can probably proceed unless you wanted corvus to ack too15:18
ykarelHi is there some known issue with unbound on c9-stream fips jobs15:22
clarkbykarel: there is a race where ansible continues running job stuff after the fips reboot but before unbound is up and running15:23
fungiykarel: the fips setup reboots the machine, which seems to result in unbound coming undone. i think there was some work in progress to make the unbound startup wait for networking to be marked ready by systemd first15:23
clarkbyup I think the idea was to encode all of that into a post reboot role in zuul-jobs. Then whether or not you are doing fips you can run that in your jobs to ensure the test node is ready before continuing15:23
fungiand yeah, it's basically that the job proceeds after the reboot when resolution dns isn't working yet15:23
ykarelclarkb, fungi okk so it's something known15:24
ykarelThanks15:24
ykarelmay be after reboot can wait for sometime until unbound is up15:24
fungiclarkb: i thought the idea was to change the service unit dependencies in the centos images to make sure sshd isn't running until unbound is fully started15:24
clarkbfungi: no I suggested against that because then you need our images to run the tests successfully15:25
clarkbI suggested that the test jobs themselves become smart enough to handle the distro behavior15:25
fungiwell, you need out images to run the tests successfully if you're using unbound for a local dns cache (which is a decision we made in our images)15:26
clarkbyou have to do a couple things post reboot like starting the console logger anyway so encoding all of that into an easy to use role makes sense15:26
clarkbfungi: its a decision we made in our images but we just install the normal distro package for it15:26
clarkbits not like this is a bug in our use of unbound. Distro systemd is allowing ssh connections before dns resolvers are up15:27
clarkband systemd is sort of designed to do that15:27
clarkb(speed up boots even if you end up on a machien that can't do much for a few extra seconds and all that)15:27
fungibut yes, i can see the logic in forcing the job to wait until the console stream is running again, so checking dns resolves successfully somehow is reasonable to do at the same time15:27
fungiokay, downing the container on zm08 now at 15:30z15:30
fungiand it's rebooting15:30
fungiand the containers on zuul01 have stopped and it's rebooting now15:31
johnsomade_lee Has a patch been proposed for the zuul task to wait for DNS?15:31
fungicontainers are starting on zuul0115:32
fungiclarkb: is this benign? "[WARNING]: conditional statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found: {{ components.status == 200 and components.content | from_json | json_query(scheduler_query) | length == 1 and components.content | from_json | json_query(scheduler_query) | first == 'running' }}"15:32
clarkbfungi: yes I made note of that when I was testing this15:34
clarkbit was the only way I could get the ? in the query var to not explode as a parse error15:34
fungithanks, looked familiar15:35
clarkbfungi: I suspect this is a corner case of ansible jinja parsing where ansible really wants you to use the less syntax heavy version because it makes ansible look better but it isn't as expressive and can have issues as I found15:35
clarkbyoctozepto: note my response on https://review.opendev.org/c/openstack/kolla-ansible/+/84353615:36
clarkbfungi: corvus: one thing I wonder is if having web and scheduler fight over intiailizing in the db may cause the whole thing to be slower? I guess they might be slower individually but since we run them concurrently wall time should be less?15:38
clarkbfungi: unrelated did you see https://storyboard.openstack.org/#!/story/2010054 I'm having a hard time undersatnding that one since all of our repos have / in their names too. I wonder if the actual repo dir has a / in it. We do openstack dir containing nova repo dir. Maybe they are doing something like openstack/nova is the repo dir (how you would convince a filesystem of that I15:40
clarkbdon't know)15:40
clarkboh you know what? Iwonder if they need to edit their gerrit config15:40
clarkbthere is a way to tell it to encode nested slashes iirc15:41
fungii think we do that, yeah15:42
yoctozeptoclarkb: im on mobile atm but your comment looks reasonable, the other case is something we were not aware of, we will amend our ways then, thanks15:42
fungiclarkb: we set it in the apache config actually15:43
clarkbfungi: aha15:43
fungii'll link th eexample15:44
clarkbyoctozepto: ya its always retriable if it happens in pre-run regardless of the reason. But then in any phase it is retriable if ansible reports a network error (and for reasons filling the remote disk results in network errors)15:44
clarkbcorvus: 2022-05-27 15:55:47,506 ERROR zuul.Scheduler:   voluptuous.error.MultipleInvalid: expected str for dictionary value @ data['default-ansible-version'] <- I think zuul01 is unhappy with the zuul tenant config15:56
clarkbzuul01 still shows as initializing, but I think it is up?15:57
clarkbcould it be related to that error?15:57
*** marios|ruck is now known as marios|out15:57
opendevreviewClark Boylan proposed openstack/project-config master: The default ansible version in zuul config is a str not int  https://review.opendev.org/c/openstack/project-config/+/84365015:59
clarkbI think ^ that will fix things based on the error message. However, I'm not sure if initiliazing as the current state is ideal for zuul to report if it is running otherwise. Maybe "degraded" ?15:59
clarkbanyway I suspect that if we land 843650 zuul will switch over to running and the playbook will proceed but that is just a hunch16:01
clarkband we've got about 2.5 hours to do it before the playbook exits in error16:02
*** dviroel|afk is now known as dviroel16:08
corvusclarkb: approved 65016:09
clarkblooking at zuul01's scheduler log more closely I think degraded is not really accurate either16:09
clarkbthe process is up and running but it isn't processing pipelines16:09
clarkbmaybe an ERROR state would be best then?16:09
clarkbit is just logging side effects caused by zuul02's operation if I am reading this correctly16:10
corvusclarkb: i think it's restarted 2x16:13
clarkbcorvus: hrm is that something docker would've helpfully done for us?16:14
clarkbit exited with error maybe so docker started it?16:14
corvus2022-05-27 15:32:06,508 DEBUG zuul.Scheduler: Configured logging: 6.0.1.dev3416:14
corvus2022-05-27 15:55:47,506 ERROR zuul.Scheduler:   voluptuous.error.MultipleInvalid: expected str for dictionary value @ data['default-ansible-version']16:14
corvus2022-05-27 15:55:51,679 DEBUG zuul.Scheduler: Configured logging: 6.0.1.dev3416:14
corvusmaybe?16:15
corvusand yeah, it's getting data from zk now16:15
corvusso i think we're in a 30m long startup loop16:15
corvuswhich is great, actually; it means that 30m after 843650 lands it should succeed maybe hopefully?16:15
fungineat16:15
clarkbcorvus: that would be my expectation. If that comes in under the 3 hour total timeout wait period then zuul02 should get managed by the automated playbook too16:16
corvusoh -- but only if ansible puts that file in place16:16
clarkbcorvus: the regular deploy job should do that16:16
corvuscool16:16
fungiand we haven't blocked deployments so that should happen16:16
corvuswasn't sure how much was disabled (but i'm glad that isn't -- i don't think we need to)16:16
fungiand i think we've got plenty of time before the timeout is reached, yeah16:17
clarkbcorvus: nothing is currently disabled16:17
corvus(it should be fine to do a tenant reconfig during a rolling restart)16:17
corvus(it would slow stuff down but shouldn't break)16:17
clarkbif 02 doesn't get automatically handled I can take care of it after the fix lands. Then we can retry the automated playbook after merger stop is fixed and with https://review.opendev.org/c/opendev/system-config/+/843549 if we think that is a good idea16:18
corvusclarkb: i +2d 549 ... will leave to you to +w16:19
clarkbcorvus: thanks.16:19
corvuszuul01 just restarted again16:20
clarkbif the timing estimates on the dashboard are accurate then the next restart should be happy16:20
corvusso assuming 650 lands soon, probably a successful restart around 16:4516:20
clarkb(I expect the fix will land in a couple of minutes and then the hourly zuul deploy should run shortly after that16:20
clarkbthen after hourly is done the deploy for 650 will run and noop16:21
opendevreviewMerged openstack/project-config master: The default ansible version in zuul config is a str not int  https://review.opendev.org/c/openstack/project-config/+/84365016:22
fungiyeah, i'm happy approving 843549 any time, since we're manually running this anyway for now and the current run won't pick that up even if it merges in the middle since the playbook has already been read16:24
fungiand i don't expect to run it again until we at least think we have clean merger stops16:24
clarkbyup. We may need to restart the mergers after they are fixed to pick up the fix, But then we can run the automated playbook again and it should roll through without being paused16:25
clarkbcorvus: is the issue a thread that isn't exiting or isn't marked daemon?16:25
corvusclarkb: unsure -- i'm planning on taking a look at that tomorrow.  i'm sure it'll be something simple like that.16:26
clarkbok, no rush. I don't expect I'll be running this playbook over the weekend :)16:26
clarkbgood news is if we have another config error like this happen when we are all sleeping the playbook should timeout and error without proceeding to the second scheduler16:27
ade_leejohnsom, yeah - not yet -- I've been trying to find time to create it 16:31
ade_leejohnsom, hopefully by early next week16:32
johnsomade_lee Ack, thanks for the update16:32
ade_leefungi, clarkb - do you guys know anything about this error here? https://zuul.opendev.org/t/openstack/build/041ccac8861442a192beaabb7c9ca50016:32
ade_leefungi, clarkb something about oslo.log not being set correctly from upper constraints in train?16:33
clarkbade_lee: there are two problems that cause that. The first is trying to install a version of a library that sets a required python version that isn't compatible with the current python. The other is if pypi's CDN falls back to their backup backend and serves you a stale index without the new package present16:34
clarkbade_lee: in this case oslo.log==5.0.0 requires python>=3.8 and you appear to be using 3.616:35
clarkbwhcih means it is the first issue16:35
ade_leeah16:35
fungiall installdeps: -chttps://releases.openstack.org/constraints/upper/master, -r/opt/stack/new/tempest/requirements.txt16:35
clarkbfungi: corvus: the hourly zuul deploy did not update to the fixed version as I thought it might. We have to wait for the normal deployment to happen which should happen soon enough16:35
fungiade_lee: coming from /opt/stack/new/devstack-gate/devstack-vm-gate.sh:main:L89016:36
fungisudo -H -u tempest UPPER_CONSTRAINTS_FILE=https://releases.openstack.org/constraints/upper/master TOX_CONSTRAINTS_FILE=https://releases.openstack.org/constraints/upper/master tox -eall -- barbican --concurrency=416:36
clarkbnewer pip will actually tell you why it failed in this case rather than giving you a convoluted message16:36
fungiso yes, i think this is the first case clarkb mentioned16:37
fungitempest's virtualenv is built with the python3 from ubuntu-bionic and is trying to install the master branch constraints16:37
ade_leeclarkb, fungi thanks - so I need to switch the barbican gate to py3816:37
clarkband stop using devstack-gate16:37
clarkboh wait you said train though so the actual fix may be more stable branch specific16:38
clarkblike using an older tempest or something16:38
fungiade_lee: you might check with #openstack-qa, i think they noted some breakage to old stable branches from tempest et al dropping support for old python16:38
ade_leefungi, ack - will do16:38
fungibasically you're supposed to install a tagged version of tempest i think, at least that's how it's been handled in the past16:38
clarkbwe just missed this restart so we have to wait for the next one16:43
clarkbproject-config is now updated. The next restart should work I hope16:46
clarkbso about 35 minute away?16:46
mgariepyhello, can i have a hold on :  --project=opendev.org/openstack/openstack-ansible --job=openstack-ansible-deploy-aio_lxc-rockylinux-8--ref=refs/changes/17/823417/31 16:51
mgariepyto investigate a bootstrap issue on rocky ?16:51
fungimgariepy: sure, i'm curious to see this work while the schedulers are in the middle of a rolling restart. it could be an interesting test16:52
mgariepyhehe :D16:54
fungimgariepy: it seems to be set successfully16:54
mgariepyhopefully it will be ok to get it, the job just failed it's 3rd attemp :/16:55
fungiif the build failed after i added the autohold at 16:54 utc then it should, otherwise it'll need a recheck16:56
mgariepycompleted at 2022-05-27 16:54:34 .16:56
fungizuul returns the nodes yep! just in time16:56
mgariepylol16:56
mgariepyit was close !16:57
fungiwhat's the link to your ssh public key again?16:57
mgariepyhttps://paste.openstack.org/show/bmEEcIcyQre3D8rn76hz/16:57
fungimgariepy: ssh root@104.239.175.23016:58
mgariepythanks 16:58
fungiyw16:58
fungiclarkb: zuul-web seems to have come up on zuul0117:00
clarkbfungi: ya it doesn't care about the tenant configs17:01
fungioh, so it came up earlier i guess17:01
clarkbit came up after the first restart17:01
fungigot it17:01
clarkbI think the current restart will fail then the next one will succeed since this current one started just before the fix was put in place on the server17:01
clarkbbut it does fail late in the process maybe that means it loads the config late enough to see the fix? I don't think so17:02
fungistill tons of time left in the timeout window anyway17:02
clarkbok its restarting zuul02 now so it did actually load late enough17:08
clarkbHowever I'm seeing a new error which may or may not be a problem for actual functionality17:08
clarkbhttps://paste.opendev.org/show/bHNqoDi2M2f9ExF9s1NH/17:10
clarkbI think this is a zuul model upgrading problem17:10
clarkbya I think this has effectively paused zuul job running :/17:11
clarkbya I see the issue17:12
clarkbhttps://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L2008 that attribute is added unconditionally17:13
opendevreviewMerged openstack/diskimage-builder master: Fix grub setup on Gentoo  https://review.opendev.org/c/openstack/diskimage-builder/+/84285617:13
clarkbbut it is part of the latest zuul model update so we've got old job content without that attribute and new trying to use it17:13
clarkbI think new jobs are happy and old old jobs are happy17:14
clarkbits just the jobs that were started in the interim period that are broken17:14
clarkbconsidering that I'm somewhat inclined to let things roll for a bit. I don't think we'll get any worse. Then we should be able to evict and reenqueue and jobs that were caught in the middle?17:15
clarkbthat seems less impactful overall than doing a full restart and rollback to v617:15
fungiyeah, agreed17:17
fungithis is something other continuous deployments of zuul may need to be aware of17:18
mgariepythanks fungi you can remove the hold.17:18
mgariepyand kill the instance :)17:18
clarkbfungi: yes just left notes in the zuul matrix room17:19
fungimgariepy: done17:19
clarkbnote that a rollback to v6 may not actually be necessary17:19
clarkbas long as we start on the new model api. It may be what becomes necessary depending on whether or not we can dequeue changes is stopping zuul and deleting zk state then starting zuul again17:19
fungiclarkb: do you still need the autohold labeled "Clarkb debugging jammy on devstack" or shall i clean it up while i'm in there?17:19
clarkbfungi: you can clean it up17:20
fungidone. thanks!17:20
clarkbfungi: another possible option available to us is modifying those jobs in zk directly. But that seems extra dangerous17:21
fungimmm, yeah17:22
clarkbthe two affected jobs I see regularly in the log our our infra-prod-service-bridge from the hourly jobs and tripleo-ci-centos-9-undercloud-containers from I don't know what yet17:23
clarkbfungi: if you are still in there do you want to try dequeing our hourly deploy buildset?17:23
clarkbI'm going to try and identify where taht tripleo-ci job is coming from so that we can evaluate if that is possible for it too17:23
clarkbbut I worry we won't be able to dequeue either due to this error17:24
clarkband we may need to stop the cluster and clear zk state to fix it17:24
clarkb843382,3 is the tripleo source of the problem I think17:25
fungishould we wait to dequeue things until zuul02 is fully up?17:25
clarkbso ya I wonder if we can dequeue that change and then reenqueue it and we'll be moving again17:25
fungior is that blocking the startup?17:25
clarkbfungi: I don't think that will affect startup17:25
clarkbthis is an issue in pipeline processing17:26
clarkbwhich happens after startup17:26
fungiso only 843382,3 needs to be dequeued and enqueued again, or are there others?17:26
clarkbthats the only one I've identified that needs to be dequeued and enqueued again. our hourly buildset needs to just be dequeued adn we'll let the next hour enqueue it17:27
clarkbThen if we still have trouble starting jobs we need to consider a full cluster shutdown, zk wipe, startup, reenqueue17:27
clarkb(it isn't clear to me if we're starting any new jobs currently fwiw)17:28
fungii did `sudo zuul-client dequeue --tenant=openstack --pipeline=check --project=openstack/puppet-tripleo --change=843382,3`17:29
fungithough it doesn't seem to have been processed yet17:29
fungithere's a management event pending for the check pipeline according to the status page17:30
clarkband the zuul01 debug log is quite idle right now17:31
fungiwhich i guess is this one?17:31
clarkbzuul01 is the up one 17:31
fungithere it went17:32
fungiokay, enqueuing it again now17:33
clarkblooks like we're processing jobs too17:33
clarkbfungi: can you do the same for our hourly deploy?17:33
clarkbthe playbook completed and looks like it succeeded17:34
clarkbfungi: I think the pause was zuul01 nad zuul02 synchronizing on the config as zuul02 came up17:35
clarkbI also suspect that if we remove our hourly deploy then the upgrade issue with the deduplicate attribute will be gone in our install17:35
clarkbbut also that other jobs seem to be running our deployment so if we jsut leave it that way for corvus to inspect later we're probably good17:35
clarkbthough we also have logs of the problem and I pasted them above too so that seems overkill17:36
fungii did `sudo zuul-client dequeue --tenant=openstack --pipeline=opendev-prod-hourly --project=opendev/system-config --ref=refs/heads/master`17:36
fungithat seems to have cleared it17:36
clarkb#status log Upgraded all of Zuul to 6.0.1.dev34 b1311a590. There was a minor hiccup with the new deduplicate attribute on jobs that forced us to dequeue/enqueue two buildsets. Otherwise seems to be running.17:37
opendevstatusclarkb: finished logging17:37
fungialso 843382,3 is back in check and running new builds17:37
clarkbfungi: ya so I think it was just those two jobs that had the mismatch in attributes as they raced the model update17:38
clarkbclearing them out and reenqueuing allowed the tripleo buildset to reenqueue under the new model api version and it is happy17:38
clarkbzuul itself will want to fix that for other people doing upgrades, but overall the impact was fairly minor once we took care of those17:39
clarkbfungi: I think all of the problems the rebooting playbook ran into were external to itself17:41
clarkband those problems should be fixable which is great17:42
fungiyep!17:43
clarkbI think I've convinced ymself that a revert to zuulv6 is not necessary if we continue to have problems. We're more likely to need to do a zk state clear and then starting on the current version is fine17:58
clarkbsince the problem is consistency of the ephemeral jobs in zk between different versions of zuul. Starting on a single version of zuul with clear zk state should be fine17:58
fungimakes sense, yes17:59
fungii mean, that's what all zuul's own functional tests do anyway17:59
clarkbI think https://review.opendev.org/c/openstack/tempest/+/843542/ is a good canary. It is about 20 minutes out from merging in the gate if its last build passes.18:02
clarkbIt would've started before the problem was introduced18:02
clarkbI've got a small worry that jobs that started aren't as happy as they appear to be however, I don't have real evidence of that yet18:03
clarkbhttps://review.opendev.org/c/openstack/openstack-ansible/+/843483/ too18:03
clarkbBut if they do fail due to this they should get evicted in the gate and all their children will be reenqueued and fine18:04
clarkbso again impact should be slight18:04
clarkbhttps://review.opendev.org/c/openstack/openstack-ansible/+/843483/ merged I think my fears are unfounded18:20
fungilgtm, yep18:20
fungiany reason to keep the screen session on bridge around now?18:21
fungiif not, i'll shut it down18:21
fungiwall clock time for that playbook was 1798m11.450s18:22
clarkbfungi: the time data probably isn't very useful after the merger pause. Also it probably ended up in the log file18:22
fungiright18:22
clarkbI think we can stop the screen18:22
fungiand done18:22
clarkbif we guestimate how long it took wtihout the merger pause and without the config error probably about a day. Just over a day?18:26
clarkbThat is better than I anticipated18:26
clarkband even before it is fully automated we can run it manually when appropriate18:27
fungiclarkb: looking at the example in https://github.com/jitsi/jitsi-meet/issues/10633 it's setting an enable flag inside a prejoinConfig array and the comment says it replaces prejoinPageEnabled, but our settings-config.js uses config.prejoinPageEnabled18:32
fungiare we using an outdated config file format?18:32
clarkbwe copy the config out of their container image file and then edit it iirc18:33
clarkbit is possible the content we copy out is out of date18:33
fungiokay, and maybe we haven't done that in a while18:33
clarkbhttps://review.opendev.org/c/openstack/tempest/+/843542/ has merged now too along with a whole stack of changes \o/18:34
clarkbhttps://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/settings-config.js I believ that is the upstream file18:35
fungilooks like https://review.opendev.org/781159 added that playbooks/roles/jitsi-meet/files/settings-config.js file over a year ago (march 2021), so was probably copied from the container around that time i would guess18:35
clarkbhttps://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/settings-config.js#L26518:35
clarkband https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/settings-config.js#L1118:36
fungiyeah, that looks like what we have18:36
clarkbso I think you just need to modify https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/jitsi-meet/templates/jvb-env.j2 and https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/jitsi-meet/templates/meet-env.j2 to set that flag18:36
fungiagreed, that's the commit i've drafted, but i started to question it after looking back at the example in the issue18:37
fungii only did meet-env.j2 but i can also add it to jvb-env.j2 if you think it's necessary18:38
clarkbIt isn't strictly necessary since we don't run the web service on the jvbs18:38
clarkbmaybe best to leave it out to avoid confusion18:39
fungilooks like they've updated the settings-config.js to default that to true though, so maybe we've diverged there after all18:40
fungiours defaults to false still18:40
clarkbya maybe we want to resync? compare the delta to amke sure we haven't overridden anything in the settings.config.js (we should rely on the .env files for overrides) and then update?18:41
fungiso might be better to re-sync their files to our repo, right18:41
fungii'll diff and see what's changed18:41
clarkbfungi: I want to say we did a copy because there were some things we couldn't override via their config18:43
clarkbanother option is to stop supplying the overridden config entirely and rely on upstream's in the image if we have everything we need in the file now18:43
clarkbbut I'd need to look at file/git history to remembe what exactly it was that was missing18:43
clarkbuseRoomAsSharedDocumentName and openSharedDocumentOnJoin according to c1bb5b52cfb00cb80555348614ee6ff1136c2f5218:44
fungiyep, gonna18:49
fungiclarkb: any idea where the playbooks/roles/jitsi-meet/files/interface_config.js came from?18:55
fungii can't seem to find it in the docker-jitsi-meet repo18:55
clarkbfungi:  Ithink interface_config.js is the config for the app on the browser side18:59
fungianyway, for the settings-config.js, this is the diff from ours to theirs: https://paste.opendev.org/show/bOI75ISjM1Zr4nKTxVbw/18:59
clarkbhttps://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/meet.conf#L35-L37 upstream serves it by default18:59
fungiahh19:00
clarkbhttps://github.com/jitsi/docker-jitsi-meet/issues/275 I may have even fetched it out of the running container?19:02
fungioh neat19:05
fungii was mainly wondering if we should try to resync it from somewhere too19:05
clarkbiirc they use some templating engine on container startup that writes out files like that19:05
clarkbbut I'm not seeing them in that repo19:05
clarkbfungi: https://github.com/jitsi/jitsi-meet/blob/master/interface_config.js19:08
clarkbI think the main jitsi source contains that19:08
fungioh okay19:08
fungiand yeah, the difference is substantial: https://paste.opendev.org/show/bet8dgBO9tlXNaX0L9JX/19:16
fungilooks like maybe i should tell diff to be a bit smarter though19:16
fungipatiencediff didn't do much better: https://paste.opendev.org/show/bedtf6bUc7W4hgNVVIfX/19:24
fungilooking at the git history for that file, it seems we edited it in order to disable the watermark which was overlapping the etherpad controls, took firefox out of the list of recommended browsers, and took out the background blur feature19:26
fungithe nice thing is that a recent update has added a comment block indicating that file is deprecated and config options should move to config.js eventually. i'll make a note to see if those things we changed are configurable there now19:27
*** dviroel is now known as dviroel|afk19:50
*** rlandy|PTOish is now known as rlandy20:01
clarkbfungi: we can probably sync the file then add in those extra bits too20:13
fungiyeah, that's sort of where i'm headed, though i also want to update the env configs from the example in the upstream repo as it's also got new stuff in it corresponding to the service configs20:39
clarkbI've approved https://review.opendev.org/c/opendev/system-config/+/843549 so that it is ready for us when we are ready to rerun that playbook next20:40
fungithanks!20:41
opendevreviewMerged opendev/system-config master: Perform package upgrades prior to zuul cluster node reboots  https://review.opendev.org/c/opendev/system-config/+/84354921:01
johnsomHi infra neighbors. I think there might be something wrong with the log storage.21:42
johnsomhttps://zuul.opendev.org/t/openstack/build/554a978fa1f346ddb89aea349cd4d76b21:42
johnsomIs saying it has no logs, but the job just ran: https://review.opendev.org/c/openstack/designate-tempest-plugin/+/83718021:42
jrosser_i am also seeing the same sort of thing here https://zuul.opendev.org/t/openstack/build/0c4ec03005f94771ad426ace70e869a421:44
johnsomThe interesting thing is the "download all logs" works21:44
johnsomYeah, the "View log" link works also, so it must be a zuul issue21:47
clarkbwhich view log link?22:57
clarkboh there it is22:58
clarkbok so the raw data is there, but the web viewer isn't finding/rendering it22:58
clarkb"Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://d321133537aef6ff2c0f-8ffa80ef1885272f8fa2b55d06420ca4.ssl.cf2.rackcdn.com/837180/7/check/designate-bind9-stable-xena/554a978/job-output.json. (Reason: CORS header ‘Access-Control-Allow-Origin’ missing)"22:59
clarkbit is a CORS issue22:59
clarkbwe should be setting CORS headers when we upload the objects to swift22:59
clarkblooking at the response headers there are no CORS headers at all.22:59
clarkblogs uploaded to ovh swift are fine. It appears related to rax swift23:01
clarkbwhich makes me think it isn't something that changed on our side, ut let me double check zuul-jobs to be sure23:01
clarkbI don't see any changes to zuul-jobs' log uploading. It may be an update to whatever swift client we use as well23:02
clarkbwe use openstack sdk23:03
clarkbopenstack sdk did make a release on May 20 that we may have picked up with this latest restart23:03
clarkbwould specifically be the executors23:04
clarkbI think this is either rax side or openstacksdk23:04
clarkbI'm not seeing any likely changes in openstacksdk unless some very low level system is filtering out the headers we attempt to set (seems unlikely because the ovh containers seem fine? we do have https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py#L226-L227 which is likely non standard so maybe that gets filtered?)23:14
clarkbConsidering ist late friday on a holdiay weekend adn you can click the view raw logs button for now I may punt on this23:15
clarkbAnyway if someone else ends up looking at this my suspicion is either something cloud side (maybe we can mitm ourselves and verify what sdk ends up sending to the cloud?) or a chagne in openstacksdk that filters out the non standard headers that we need via ^23:17
clarkbI suppose we could test this by using sdk 0.99.0 and 0.61.0 and see if the behavior changes23:17
clarkband hopefully we don't need to deploy a proxy to fix it23:20
clarkbthat would be annoying23:20

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!