Monday, 2023-05-01

ianwit's just weird that one happened for https://review.opendev.org/c/opendev/system-config/+/880579 and then https://review.opendev.org/c/opendev/system-config/+/88071000:44
ianwone added the jammy servers and the other removed them00:45
ianwremoved the old ones00:45
ianwok: [codesearch01.opendev.org]00:45
ianwit gathered facts ok00:45
ianwi dunno; happened well before anything relating to nameservers happened.  might be a big coincidence00:50
ianwhttps://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-base&pipeline=deploy&skip=0 it is perhaps semi-common i guess00:53
fungimight be a misbehaving middlebox somewhere in that cloud region01:02
ianwfungi/clarkb: not for now ... but i wasn't sure where we got to on AAAA glue records for opendev.org.  If we want to add them, we probably need to ask (my preference) but if we're ok with not having them, we can cross it off https://etherpad.opendev.org/p/2023-opendev-dns01:16
fungiianw: related, we may want to ask vexxhost to add ipv6 reverse dns for ns0401:19
ianwyeah, i don't think irc is effective for that01:20
ianwok i logged a low priority ticket01:27
ianw#status log shutdown ns1.opendev.org, ns2.opendev.org and adns1.opendev.org that have been replaced with ns03.opendev.org, ns04.opendev.org and adns02.opendev.org02:13
opendevstatusianw: finished logging02:13
opendevreviewIan Wienand proposed openstack/project-config master: project-config-grafana: filter opendev-buildset-registry  https://review.opendev.org/c/openstack/project-config/+/84787003:44
opendevreviewMerged opendev/system-config master: Add logging During Statup for haproxy-statsd  https://review.opendev.org/c/opendev/system-config/+/88190104:26
clarkbit is so quiet today15:14
clarkbI'm going toget a gitea 1.19.2 change up after local system updates. They didn't fixthe header issue but that has been a long standing problem so I think we can proceed with 1.19.215:21
clarkbI've just spot checked zuul and nodepool services and believe that we are running quay images at this point. Restarts over the weekend appear successful.15:35
clarkbOne thing it looks like we will need to do is manualy prune out the old docker hub images since our regular pruning hangs onto them15:35
clarkbcc corvus not sure if that is worth warning zuul users about15:36
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea to 1.19.2  https://review.opendev.org/c/opendev/system-config/+/87754115:46
opendevreviewClark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node  https://review.opendev.org/c/opendev/system-config/+/84818115:46
clarkbcleaned up my old hold and put another in place for ^ but I anticipate we can upgrade soon15:48
clarkbthe centos 9 stream mirror is broken. repomd.xml points to files that don't exist. THis problem oiginates in our upstream mirror16:26
clarkb(throwing that out there so that when everyone is back to work tomorrow they can short circuit the debugging)16:26
clarkbI'm getting my quay.io change for zookeeper-statsd together and will be regenerating the robot accounts token just to ensure we are starting fresh. Nothing should be using it yet anyway but good extra step for safety.16:49
clarkbs/token/docker cli passwd/16:49
clarkbcorvus: ^ for that I am having to press ^D twice for it to emit a password entered on the command line. Is this expected? I'm worried the first control character may end up in the input somehow. I'll use echo -n 'value' | zuul-client encrypt instead I guess16:54
corvusclarkb: re pruning -- i don't think that's something we need to warn people about16:55
corvusclarkb: yes, 2 ctrl-d's is expected when not immediately following a newline16:55
corvusthat's a shell thing16:56
clarkbTIL. fwiw echo -n '' seems to work. Just prefix it with a space to prevent it from going into history16:57
corvusyep it's nice to see what you're doing :)16:59
corvusclarkb: i went to go check exactly which image is running for zuul... and i see this:17:06
corvus4a9793f4f9fa   0021610b5ea6   "/usr/bin/dumb-init …"   2 days ago   Up 2 days             zuul-scheduler_scheduler_117:06
corvusso then i run  docker inspect 0021610b5ea617:06
corvusand i see:                 "org.zuul-ci.change_url": "https://review.opendev.org/873012"17:06
corvusand that does not look right to me17:07
opendevreviewClark Boylan proposed opendev/system-config master: Base jobs for quay.io image publishing  https://review.opendev.org/c/opendev/system-config/+/88128517:08
corvusdo you think something about how we're building images is broken and not attaching those labels correctly now?  maybe that's the most recent layer that has a label?17:08
corvusi get that with docker inspect quay.io/zuul-ci/zuul-scheduler locally too17:09
clarkbcorvus: I Think the component reported version looks fairly accurate17:09
clarkbso I suspect this has to do with metadat and not pushing stale content17:09
clarkbcorvus: does the most recent docker hub image look better?17:10
corvusyeah, the build date looks correct too17:10
corvusyep17:10
corvuspoints to https://review.opendev.org/88065817:11
clarkbinfra-root I think https://review.opendev.org/c/opendev/system-config/+/881285 is read for review now. Should be safe to land whenever we are ready to debug it. zookeeper-statd has not had any new images since I synced docker hub to quay.io so don't need to sync that before we switch either17:11
corvusclarkb: i think i see the issue17:12
clarkbok I'm still trying to figure out hwere we set the value. Must be in the jobs somehwere17:12
corvusworking on a change17:12
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add labels to build-container-image  https://review.opendev.org/c/zuul/zuul-jobs/+/88191917:14
corvusclarkb: ^17:15
corvuslooks like another case of the build-container-image bitrotting between when we made it originally and when we finally started using it.17:15
clarkbcorvus: heh ya the buildx tasks have them and they were copied over more recently17:16
fungii guess 881285 is going to need 881919 too?17:18
clarkbfungi: not strictly necessary but very nice to have yes17:18
clarkbprobably worth waiting on then we can be sure the label fix works too17:18
corvusyeah, it's super hard to map images back to what they're running without it, so i'd support waiting for 919 before doing any more builds :)17:18
corvusi'd like to restart zuul again to catch the changes that merged over the weekend; any objections?17:20
corvusi'll just run the zuul_reboot playbook17:20
fungisounds good to me, thanks17:20
clarkbcorvus: its quiet today (holidays elsewhere in the world) and the zookeeper content should make it even less of an impact. I'm good with this17:21
clarkboh ya zuul_reboot is the graceful one. Should go quickly anyway17:21
corvusrunning now in screen on bridge17:23
clarkbcorvus: there is a periodic job that may need to be dealt with in openstack now that I look17:23
clarkbit is queued though which means it isn't on an executor yet so maybe it is fine17:23
corvusyeah, probably how the last reboot made it through17:24
corvusi wish there were a way to copy the tooltip to get the node request id17:24
clarkb++17:24
opendevreviewMerged zuul/zuul-jobs master: Add labels to build-container-image  https://review.opendev.org/c/zuul/zuul-jobs/+/88191917:27
corvusclarkb: found a nodepool bug causing that stuck request17:33
clarkback17:34
corvuschange linked in #zuul:opendev.org 17:37
corvusi believe a restart of nl01 will correct the immediate problem; maybe we should just land that change and let the subsequent automatic restart handle that.17:38
clarkbyup I've approved the change and the auto hourly deployment should handle it automatically17:40
clarkbI'm going to pop out for lunch soon so won't approve the quay.io change in system-config yet. Happy for someone else to if they can watch it otherwise I'll aim to +A it when I get back18:01
fungiclarkb: 881285 won't actually upload a new image though, right? we need another change to merge for that?18:03
clarkbfungi: I think it will because those jobs are added/modified so they should run?18:04
fungioh, yeah i guess the upload happens in gate18:04
fungiokay, i'll check it once it merges18:04
clarkboh but ya the post may not. we can push up a noop dockerfile change if we need it to trigger more stuff18:05
clarkba number of our dockerfiles have atimestamp comment for this purpose now18:05
fungiright, that's what i was assuming we'd need to test, but we can do that if necessary after you finish your lunch18:07
clarkb++18:07
rlandyfungi: hi ... has anyone reported anything wrt centos9 mirrors? Looks like all jobs (not just tripleo related) are failing with "error: Status code: 404 for https://mirror.bhs1.ovh.opendev.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/3ea088796d71ec43bd0450022bddc9365606b1996065fac43595e4ef6798af11-primary.xml.gz" or some similar error. Example log: https://zuul.opendev.org/t/openstack/build/683d1d11236441d48c2b181b7ce193e818:30
rlandyanother example: https://zuul.opendev.org/t/openstack/build/371a7f56326b4eb5877aafd600ed0a8518:31
fungirlandy: first i've heard of it, but we mirror from other mirrors so i guess it's worth checking those to see if they're stale18:35
fungihttps://mirror.bhs1.ovh.opendev.org/centos-stream/timestamp.txt indicates it was updated at the top of the hour18:37
fungihttps://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/centos-stream-mirror-update#L44 says we're pulling from mirror.rackspace.com18:38
fungiand i don't see a 3ea088796d71ec43bd0450022bddc9365606b1996065fac43595e4ef6798af11-primary.xml.gz at https://mirror.rackspace.com/centos-stream/9-stream/BaseOS/x86_64/os/repodata/18:39
fungiso my guess is that their mirror is behind18:39
fungior somehow got rolled back18:40
rlandyok - so we're back to that mirror caching fun18:40
fungiyes, our mirror is only ever as reliable as the mirrors we copy from18:40
fungiand apparently the mirror network for centos is not all that reliable18:41
rlandythank you18:41
fungirlandy: https://review.opendev.org/868392 switched us from mirror.facebook.net to mirror.rackspace.com in december because facebooks mirrors stopped updating, according to the commit message18:43
rlandyyep - I remember we switched a few times last year18:45
rlandygoing to give it a few hours to see if the mirrors sync up18:45
opendevreviewMerged opendev/system-config master: Base jobs for quay.io image publishing  https://review.opendev.org/c/opendev/system-config/+/88128519:04
Clark[m]I posted about the mirror thing earlier today. I confirmed our upstream mirror has the same issue19:18
fungiahh, i missed that, thanks19:22
clarkbfungi: corvus: the quay thing failed in deploy on the zuul job. Likely due to the ongoing zuul restart? I didn't think about that interaction. It did push a change tag but did not update latest. I think because we do need an image change to trigger the promote job19:41
clarkbI'm working on that change now19:41
clarkbit did restart the zk statsd service on zk04. Image is identical to the one running before so that was just a docker bookkeeping change19:42
opendevreviewClark Boylan proposed opendev/system-config master: Force zookeeper-statsd rebuild  https://review.opendev.org/c/opendev/system-config/+/88192419:43
clarkbnl01's launcher restarted ~36 minutes ago19:44
fungiyeah, gate did run system-config-upload-image-zookeeper-statsd successfully, promote doesn't seem to do any tagging19:44
clarkband the stuck job is running19:44
clarkbfungi: yup and if you visit the imgae location on quay you'll see the gate change tag but latest is still old19:44
clarkbnothing unexpected that I see so far. just what we anticipated might be an issue which is good19:45
clarkbhrm my gitea 1.19.2 change failed presumably on lack of authentication. Implying that authentication is required?19:49
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea to 1.19.2  https://review.opendev.org/c/opendev/system-config/+/87754119:50
clarkbno log removed (which we can do because this shouldn't need authentication) to see more info19:51
clarkbthe infra-prod-run-zuul job failed due to zm02 failing to copy project config. We no log that so I don't know what happened to make that fail (plenty of disk space)19:59
clarkbI guess keep an eye on it for recurrences and we can dig in if necessary19:59
clarkbfungi: corvus  want to review (and hopefully approve) https://review.opendev.org/c/opendev/system-config/+/881924 so that we can see zookeeperstatsd go end to end with container publishing20:46
clarkboh heh the gitea thing I know what it is. pebkac21:07
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea to 1.19.2  https://review.opendev.org/c/opendev/system-config/+/87754121:09
opendevreviewClark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node  https://review.opendev.org/c/opendev/system-config/+/84818121:09
clarkbthe good news is that having that issue had me realize we don't need to no log that request since it isn't privileged. I expect this to work now and put a hold in place21:11
corvusfyi, due to the low load, i paused a handful of executors ahead of the reboot script to reduce the overall upgrade time21:39
corvusthat seems to be working as expected so far21:39
clarkbI saw that. A couple of executors are done too21:41
opendevreviewMerged opendev/system-config master: Force zookeeper-statsd rebuild  https://review.opendev.org/c/opendev/system-config/+/88192421:43
corvusyeah, i'm continuing my rolling window of keeping "about half" paused ahead of the script21:43
corvusincidentally, the zookeeper persistent recursive watches change has had a noticeable impact on the zk latency and outstanding requests metrics.21:45
corvusthat merged on april 11 (and if there's any doubt, you can see it in the zk watches graph)21:46
corvushttps://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-90d&to=now21:46
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Add job to ansible-config_template to Galaxy  https://review.opendev.org/c/openstack/project-config/+/88193021:48
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Add job to ansible-config_template to Galaxy  https://review.opendev.org/c/openstack/project-config/+/88193021:49
clarkbhttps://quay.io/repository/opendevorg/zookeeper-statsd?tab=tags boom! our first promoted image on quay.io the hourly zuul jobs should deploy that for us (we didn't trigger the zuul job after the image build)21:50
fungiw00t21:51
clarkbI'll work on getting a few more of those queued up21:51
clarkbnow to find some easy to swap images that havne't updated on dockerhub since I did the sync21:55
clarkbthis should reduce the amount of syncing we/I end up needing to do21:55
clarkbas I look at this I'm realizing that there is going to be a bit to do to get things moved. Stuff like our base python images create dependency issues. I think for now I'm going to ignore that though. If we get the leaf images moved then we can rebuild once the base images move too21:59
corvusclarkb: why not start with base?22:08
clarkbcorvus: I guess I can. My main concern with doing that is that if we need to update base for some reason urgently we may not be ready to consume it from its new location everywhere22:09
clarkbthe risk of that is low though and might be good motivation :)22:09
corvusok fair.  no strong opinion here22:09
clarkbI think doing it leaf image first is probably mor eeffort but also "safer" from that perspective22:09
corvusbtw, friendly reminder https://zuul.opendev.org/t/openstack/project/opendev.org/opendev/system-config?branch=master&pipeline=check exists in case it's helpful :)22:10
clarkbI'm working on two changes at the moment. One to update the base jobs and one to update ircbot22:17
clarkbWe can use review to decide which approach we prefer22:17
opendevreviewClark Boylan proposed opendev/system-config master: Move ircbot to quay.io  https://review.opendev.org/c/opendev/system-config/+/88193122:22
ianwnice!22:25
ianwgitea change looks good.  i don't think we use any authenticated endpoints now?22:25
clarkbianw: we do to create rpeos and orgs and stuff22:26
opendevreviewClark Boylan proposed opendev/system-config master: Move python builder/base images to quay.io  https://review.opendev.org/c/opendev/system-config/+/88193222:56
clarkbI'll work on a third change that updates our Dockerfiles to consume ^22:56
clarkbmy concern with that is a lot of images will update all at once... we can hash that out if we want ot split things up in review22:56
opendevreviewClark Boylan proposed opendev/system-config master: Consume python base images from quay.io  https://review.opendev.org/c/opendev/system-config/+/88193323:03
clarkbThere are a lot of moving pieces here. I think we can pause here since the general thing has been shown to work. Think about the approach we want to take / discuss it in the meeting tomorrow. Write down a plan/todo list and then get it done23:04
clarkbI'm going to shift gears here and check up on the gitea upgrade then get a meeting agenda sent out23:05
clarkbat first glance gitea 1.19.2 seems to be working https://158.69.65.228:3081/opendev/system-config23:07
clarkbthinking a bit about the quay.io work. It might make sense to try and do a "sprint" for that. Pick a couple of days in the near future and just focus on getting as much of that done as possible. Then we ideally don't end up with stale images for very long and can have people around to double check services are happy with their new images23:08
clarkbMeeting agenda has been updated. Probably with too much detail. Please let me know if there is anything else to add/chnage/edit23:22
ianwclarkb: i think the gerrit acl indent, etc. all got merged23:33
clarkboh neat I'll dobule check and clena that up23:34
ianwand the renames at the bottom are done, right?23:34
clarkboh yup. Thanks23:35
ianwnameserver status is accurate; i might just remove the old servers later today as there's been no problem after i shut them down yesterday morning (my time)23:35
clarkbthe acl updates did merge. Any idea if we applied them specially to ensure they all got updated?23:36
ianwthat's a good point, i'll go back and check the deploy23:36
opendevreviewIan Wienand proposed opendev/zone-opendev.org master: Remove old DNS servers  https://review.opendev.org/c/opendev/zone-opendev.org/+/88193523:40
ianwhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_95b/879906/7/deploy/infra-prod-manage-projects/95b29cf/manage-projects.yaml.log23:42
ianwhrm23:43
ianwTo ssh://review.opendev.org:29418/x/gearman-plugin23:43
ianw ! [remote rejected] HEAD -> refs/meta/config (prohibited by Gerrit: project state does not permit write)23:43
clarkbthat is probably a read only project23:43
ianwi didn't think of that23:43
clarkbI think that is fine. If we ever make it not read only we'll sync a current good config23:43
ianwyeah, the r/o projects all failed like that23:43
ianwall the errors were the doens't permit write23:45
clarkbhow long did it take (that could be good info)23:46
ianw55 minutes23:47
clarkbagenda sent23:47
clarkbianw: maybe we should increase the timeout of that job (assuming it is 60 minutes) and then we can just merge change sand not worry about manual runs23:47
ianw  name: infra-prod-manage-projects23:52
ianw    parent: infra-prod-playbook23:52
ianw    timeout: 480023:52
ianwprobably enough headroom23:53
clarkbperfect23:53

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!