Monday, 2023-04-17

opendevreviewMerged opendev/system-config master: launch: further DNS cleanups  https://review.opendev.org/c/opendev/system-config/+/88040000:40
ianwthe try.gitea.io cert has expired, which is a bit annoying for testing aginst it01:13
ianwok; scoped access tokens has this written all over it.  i found that by tracing it back to the top-level api router and "git blame" -> https://github.com/go-gitea/gitea/commit/de484e86bc495a67d2f122ed438178d587a9252601:24
ianwfiled a couple of issues about this; notes in https://review.opendev.org/c/opendev/system-config/+/87754103:23
*** Trevor is now known as Guest1130204:06
opendevreviewIan Wienand proposed opendev/zone-opendev.org master: Add DNS servers for Ubuntu Jammy refresh  https://review.opendev.org/c/opendev/zone-opendev.org/+/88057605:55
opendevreviewIan Wienand proposed opendev/zone-opendev.org master: Add Jammy refresh NS records  https://review.opendev.org/c/opendev/zone-opendev.org/+/88057706:07
*** amoralej|off is now known as amoralej06:16
opendevreviewIan Wienand proposed opendev/system-config master: inventory : add Ubuntu Jammy DNS refresh servers  https://review.opendev.org/c/opendev/system-config/+/88057906:16
ianw^ that's getting closer; the hosts are up.  i need to think through a few things so we can have two adns servers 06:22
opendevreviewIan Wienand proposed opendev/system-config master: dns: abstract names  https://review.opendev.org/c/opendev/system-config/+/88058006:31
ianw^ that's a start06:31
ianwi'll think about it some more too06:31
dpawlikdansmith: hey, soon I would like to make a release for ci-log-processing, but I still see some leftovers that are not send to the opensearch. There are just few, but each week. Almost all of the logs from this week was because of parsing the performance.json file -08:13
dpawlikhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1c2/periodic/opendev.org/x/networking-opencontrail/master/noc-tempest-neutron-plugin/1c21c82/controller/logs/performance.json 08:13
dpawlikthe "MemoryCurrent": 18446744073709551615 seems to be "too big" to the Opensearch field 08:14
dpawlikdpawlik is it correct? Are you using the performance index in Opensearch?08:14
dpawlikI see that most of the errors in the performance log comes from project: x/networking-opencontrail . Is it still used? can we remove the periodic job "noc-tempest-neutron-plugin" ? 08:18
opendevreviewDaniil Gan’kov proposed zuul/zuul-jobs master: Quote file name in URL in download-logs.sh  https://review.opendev.org/c/zuul/zuul-jobs/+/88051709:01
opendevreviewDaniil Gan’kov proposed zuul/zuul-jobs master: Quote file name in URL in download-logs.sh  https://review.opendev.org/c/zuul/zuul-jobs/+/88051709:18
gthiemongeHi Folks, there are multiple jobs stuck in zuul12:14
gthiemongeex: https://zuul.openstack.org/status/change/880435,112:15
fungigthiemonge: thanks for the heads up. i was on vacation last week so need to catch up on what might have changed, but it looks like our weekend upgrade stopped a quarter of the way through the executors too: https://zuul.opendev.org/components12:51
fungii wonder if ze04 is the culprit12:51
fungiit has what look to be a bunch of hung git processes dating back to thursday12:53
fungii think the lingering git cat-file --batch-check processes are a red herring. executors from both before and after container restarts seem to have a bunch of them too13:09
fungi2023-04-17 13:09:57,978 DEBUG zuul.ExecutorServer: Waiting for 2 jobs to end13:10
fungiso for some reason there are two builds on ze04 it can't seem to terminate13:10
fungi2023-04-17 13:10:44,015 DEBUG zuul.ExecutorServer: [e: b54117e7ad044144b1d1cce0bd252f19] [build: c4157ad90b3c4db383d3ac5fb6ce9707] Stop job13:10
fungi2023-04-17 13:10:44,015 DEBUG zuul.ExecutorServer: Unable to find worker for job c4157ad90b3c4db383d3ac5fb6ce970713:10
fungiwe have "Unable to find worker for job" messages dating back over a week though, basically all the way back to the start of our log retention. and not just on ze04 either, executors from before and after the restart seem to have them, so probably not related?13:14
fungii think the executors are just being very, very, very slow to gracefully stop, like taking about a day each13:19
fungilooks like ze01 started around 2023-04-15 00:00z on schedule13:19
fungistarted stopping i mean13:20
fungithen ~48 minutes later ze02 began its graceful stop13:20
fungiand it took over a day, ze03 began to stop around 10:30z today13:21
fungiand ze04 a little after 11:50z so that was less than 1.5 hours13:23
fungiso i guess the delay was really just ze02 for some reason13:23
fungize04 will probably finish stopping soon13:24
fungiso maybe the builds that have been queued for so long are unrelated to the restart slowness. certainly quite a few of them are stuck in check since before the restarts were initiated anyway13:25
fungigthiemonge: i notice a disproportionately large number of these jobs waiting for node assignments are for octavia changes, and i seem to remember octavia has relied heavily on nested-virt node types in the past. what are the chances all these waiting jobs want nested-virt nodes? maybe i should start looking at potential problems supplying nodes from some specific providers13:30
fungithe octavia-v2-dsvm-scenario-ipv6-only build for 879874,1 has been waiting for a nested-virt-ubuntu-focal node since 2023-04-14 05:36:32z13:31
fungithat was node request 300-0020974848, so i guess i'll look into where that ended up13:31
dansmithdpawlik: no, tbh, I didn't realize the performance index was actually ingesting those values now, but I see that it is13:32
dansmithdpawlik: I agree that memory value must be wrong, but I'd have to go dig to figure out why.. can you ignore those logs for now?13:32
gthiemongefungi: yeah, these jobs are using nested-virt nodes, and I think that most of them (but not all) are centos-9-stream-based jobs13:36
fungidpawlik: dansmith: apropos of nothing in particular, 18446744073709551615 is 2^64-1, so looks like something passed a -1 as an unsigned int13:36
fungiunsigned 64-bit wide int specifically13:37
fungilikely whatever was being measured didn't exist/had no value and attempted to communicate that with a -113:37
dansmithack13:37
fungilooks like nl03 took the lock for nr 300-0020974848 at 05:36:38 and logged boot failures in vexxhost-ca-ymq-1, then nl04 picked it up at 05:39:52 and tried to boot in ovh-bhs1 but failed and marked the last node attempt for deletion at 05:50:03, but i see no further attempts by any providers to service the request after that point nor was it released as a node error13:45
fungiever since 2023-04-14 05:50:03 there's nothing about it13:46
fungii need to step away for a few to run a quick errand, but if any other infra-root is around and wants to have a look, i think the first example's breadcrumb trail ends at nl0413:49
*** dviroel__ is now known as dviroel13:53
Clark[m]Executors logging that they are unable to find workers for builds is normal when you have more than one executor. Basically the executor is finding that it can't process a build because it is running on another executor.14:03
Clark[m]The issue with builds being stuck seems similar to the issue corvus and I looked into last week. https://review.opendev.org/c/zuul/nodepool/+/880354 is expected to make that better and I think restarting breaks the deadlock so landing that change and deploying new images should get stuff moving again14:05
Clark[m]This should be independent of slow zuul restarts since executor stops wait on running builds and Nodepool deadlocking happens before builds begin14:06
fungimakes sense14:48
opendevreviewDaniil Gan’kov proposed zuul/zuul-jobs master: Quote file name in URL in download-logs.sh  https://review.opendev.org/c/zuul/zuul-jobs/+/88051715:01
clarkbI've approved that change just now15:10
clarkbinfra-root Wednesday starting at about 20:00 UTC looks like a good time for an etherpad outage and data migration/server swap for me. Any objections to this time? If not I'll go ahead and announce it to service-announce15:19
fungiclarkb: thanks, it looked good to me but i wasn't clear whether there was a reason it sat unapproved and was trying to skim the zuul channel for additional related discussion15:20
clarkbI don't think there was any particular reason. If i had noticed that it wasn't merged on friday I would've approved it then (though I was also afk due to kids being out of school)15:22
fungimakes sense15:23
fungii guess once updated images are available we can pull and restart the launchers?15:23
clarkbfungi: ansible will automatically do that for us in the opendev hourly job runs15:23
fungioh, right15:23
fungiwhich should hopefully also get all those deadlocked node requests going again15:24
clarkbfungi: https://review.opendev.org/c/opendev/zone-gating.dev/+/880214 may interest you. Makes a 1hour ttl default consistent across the dns zone files we manage (the others have already been updated)15:24
clarkbcorrect since the deadlock is due to in process state15:24
fungiinfra-prod-service-nameserver hit RETRY_LIMIT in deploy for 880214 just now15:36
fungiansible said bridge01.opendev.org was unreachable15:37
fungi"zuul@bridge01.opendev.org: Permission denied (publickey)."15:37
fungii guess we haven't authorized that project key?15:38
fungiintentionally?15:38
fungipresumably our periodic deploy will still apply the change15:39
*** amoralej is now known as amoralej|off15:42
clarkbfungi: gating.dev is the one you had a change up to add jobs for right? I suspect that yes we need the project key to be added to bridge15:44
clarkband yes the daily job should get us in sync15:44
clarkb(that is what happened with the static changes to gating.dev just had to wait for the daily run)15:44
fungiyeah, that was https://review.opendev.org/879910 which merged 10 days ago, so i guess that's when it started15:48
fungiwe have these keys authorized so far: zuul-system-config-20180924 zuul-project-config-20180924 zuul-zone-zuul-ci.org-20200401 zuul-opendev.org-2020040115:50
opendevreviewJeremy Stanley proposed opendev/system-config master: Allow opendev/zone-gating.dev project on bridge  https://review.opendev.org/c/opendev/system-config/+/88066115:56
fungiclarkb: ^ like that i guess15:56
clarkbyes that looks right15:57
fungiftr, i obtained the key with `wget -qO- https://zuul.opendev.org/api/tenant/openstack/project-ssh-key/opendev/zone-gating.dev.pub`15:58
opendevreviewClark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup  https://review.opendev.org/c/opendev/system-config/+/88067216:23
clarkbI think ^ is a reasonable workaround for the gerrit replication issue we discovered during the recent gerrit upgrade16:23
clarkbfungi: does wednesday at 20:00 UTC for a ~90 minute etherpad outage and server move work for you?16:26
fungiyeah, sgtm16:27
clarkbthanks. you tend to be on top of projects happenigns and are a good one to ask for that sort of thing16:27
fungimmm. actually not project-related but i may not be around at that time... i can do later though, like maybe 22:00 or 23:00z16:28
fungiwedding anniversary and we were looking at going up the island to a place that doesn't open until 19:00z so probably wouldn't be back early enough to make 20:00 maintenance16:29
clarkblater times also work for me16:30
clarkbre zuul restart slowness I think it is related to the nodepool node stuff afterall. In particular I think ze04 is "running" a paused job that is paused waiting on one or more of the jobs that are queued to run16:31
clarkbthis will in theory clear up automatically with the nodepool deployment but we should keep an eye on the whole thing16:31
clarkblooks like some of the queued jobs are running?16:32
fungiinteresting... i wonder if that's why ze02 took almost 1.5 days to gracefully stop16:32
clarkball four launchers did restart just over 15 minutes ago which should've pulled in that latest image (it was promoted ~33 minutes ago)16:33
clarkband the swift change at the top of the queue just started its last remaining job16:34
clarkb* top of the check queue16:34
fungiperfect16:35
fungize04 is still waiting on one of those to complete, looks like16:47
clarkbya it will likely take 3 or more hours if it is one of the tripleo buildsets16:48
clarkbfungi: its the pause job in the gate for 87986316:58
clarkbfungi: you can look on ze04 in /var/lib/zuul/builds to get the running build uuids. Then grep that out of https://zuul.opendev.org/api/tenant/openstack/status16:59
clarkbhrm I expected https://review.opendev.org/c/opendev/system-config/+/880672/1/playbooks/zuul/gerrit/files/cleanup-replication-tasks.py#25 to trigger in https://zuul.opendev.org/t/openstack/build/da3c4879c4ec47ab938665020cdfc2fe/log/review99.opendev.org/docker/gerrit-compose_gerrit_1.txt but it isn't in there17:21
clarkboh we docker-compose down to do renames and that will only trigger on the first startup but we don't collect logs from before17:25
opendevreviewClark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup  https://review.opendev.org/c/opendev/system-config/+/88067217:32
clarkbthat mimics wait-for-it a bit in its logging output17:32
clarkbI think ze04 should restart in about an hour17:38
clarkbI've just updated the meeting agenda with what I'm aware of as being current. Please add content or let me know what is missing and I'll send that out later today18:16
fungithanks!18:17
clarkbwhile I sorted out lunch it looks like ze04 was restarted.19:24
fungiyes, it's working on 5 now19:25
clarkbI think we are in good shape to finish up the restart now. We can probably check it tomorrow to ensure it completes19:25
fungiagreed19:25
clarkbgoing to send an announcement for the etherpad outage now. I'll indicates 22:00 UTC to 23:30 UTC wednesday the 19th19:28
johnsomE: Failed to fetch https://mirror.ca-ymq-1.vexxhost.opendev.org/ubuntu/pool/universe/v/vlan/vlan_2.0.4ubuntu1.20.04.1_all.deb  Unable to connect to mirror.ca-ymq-1.vexxhost.opendev.org:https: [IP: 2604:e100:1:0:f816:3eff:fe0c:e2c0 443]19:32
johnsomhttps://zuul.opendev.org/t/openstack/build/6fa53bb38b8045a7b55d3180a8df1e9619:33
johnsomLooks like there is an issue at vexxhost19:33
clarkbif I open that link I get a download.19:34
clarkbof course I'm going over ipv4 from here19:35
clarkbhitting it via ipv6 also works. So whatever it is isn't a complete failure19:35
clarkbcould be specific to the test node too19:35
johnsomIt would not be the first time there was an IPv6 routing issue19:36
opendevreviewClark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup  https://review.opendev.org/c/opendev/system-config/+/88067219:37
clarkbhttps://zuul.opendev.org/t/openstack/build/6fa53bb38b8045a7b55d3180a8df1e96/log/job-output.txt#3315 there it seems to indicate it tried both ipv4 and ipv619:38
clarkbhttps://zuul.opendev.org/t/openstack/build/6fa53bb38b8045a7b55d3180a8df1e96/log/job-output.txt#821 and it fails quite early in the job too.19:40
clarkbwhich means it is unlikely that job payload caused it to happen19:40
clarkbdefinitely seems like a test node that couldn't route internally to the cloud but generally had network connectivity (otherwise the job wouldn't run at all). But lets see the mirror side19:41
fungithe boot failures in vexxhost which were contributing to the deadlocked node requests did mostly look like unreachable nodes too, so i wonder if there are some network reachability problems19:42
clarkbno OOMs or unexpected reboots of the mirror node. and the apache process isn't new either19:42
clarkbianw: when you're around can you clarify whether or not you think weneed to wait for upstream gitea to fix those api interaction things you posted bugs for before we upgrade? You are -1 on the change and not sure if that means you think this is a big enough problem to hold off upgrading for now21:41
ianwclarkb: umm, i guess i'm not sure.  they've put both issues in the 1.19.2 target tracker21:57
clarkbya I think the main issue is if anyone is using the APIs as an unauthenticated user21:58
ianwthe external thing would be that the organisation list is now an authenticated call.  i mean, i doubt anyone is using that though21:58
clarkbthe basic auth 401 problem is minor since you can force it with most tools seems like21:58
ianwyeah, there may be other bits that have fallen under the same thing, i didn't audit them21:58
clarkbI guess we can wait to be extra cautious. I'm mostly worried about letting it linger and then forgetting. But it is stillfresh at this point22:00
ianwi could go either way, i'm not -1 now we understand things, although i doubt we'll forget as we'll get any updates on those bugs22:02
ianwit might be a breaking change for us with the icon setting stuff?  i can't remember how that works, but that may walk the org list from an unauthenticated call?22:03
clarkbit does it via the db actually22:04
clarkband it seems to work in the held node (there are logos iirc)22:04
ianwit does hit it anonymously -> https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea-set-org-logos/tasks/main.yaml#L122:07
clarkbya thats the task my change updated https://review.opendev.org/c/opendev/system-config/+/877541/6/playbooks/roles/gitea-set-org-logos/tasks/main.yaml which send us down the rabbit hole22:09
ianwoh doh, right22:10
ianwfor some reason i had in my head that was on the test path22:12
ianwclarkb: do you think we should run that cleanup script in a cron job?22:13
ianwthe gerrit replication cleanup script, sorry, to be clear22:13
clarkbianw: I think we could run it there as well. The files don't seem large and there are only a "few" thousand of them right now so we can probably get away with just doing it at container startup22:14
clarkbthe upside to doing it at startup is that it prevents a race in generating those errors in the logs at startup. The downside is we'd only run it at startup and we might not see if there are other types of files that leak or if it stops/doesn't work for some reason22:15
ianwi guess startup and a cron job?22:16
clarkbI'm definitely open to feedback on that. I was thinking about artificially injecting some of the leaked files into the test nodes too but it gets weird because ideally we would write real replication tasks that should replicate and those that shouldn't and check that the ones we want to be removed are removed and that the ones we want to replicate are replicated but we don't test22:17
clarkbreplication in the test nodes22:17
clarkbbasically to test this properly got really complicated quickly and I decided to push what I had early rather than focus n making it perfect22:17
ianwfair enough.  i guess we could just do a out-of-band test type thing with dummy files and make sure it removes what we want22:18
clarkbya that might be the easiest thing22:18
clarkbianw: re cronjob I think we may not have cron in the container images. We'd have to trigger a cronjob that ran docker exec? This should work fine just trying to think of the best way to write it down essentially22:22
ianwyeah that's what i was thinking; cron from review that calls a docker exec22:22
clarkband maybe run it hourly or daily?22:23
clarkbI'll do it in a followup change since we don't want the cron until the script is in the image running on the host22:24
ianwi'd say daily is enough22:25
opendevreviewClark Boylan proposed opendev/system-config master: Run the replication task cleanup daily  https://review.opendev.org/c/opendev/system-config/+/88068822:41
clarkbSomething like that maybe. I tried to capture some of the oddities of this change in the commit message. We don't actually have anything like this running today. Not sure if reusing the shell container is appropriate. Again feedback very much welcome22:41
ianwi think the mariadb backups are fairly similar22:58
ianwclarkb: dropped a comment on run  v exec and using --rm with run, if we want to use that23:02
clarkbianw: re --rm we aren't rm'ing that container today23:08
clarkbthat might make a good followup but I think we should leave it as is until we change it globally23:08
ianwbut that will create a new container on every cron run?  why do we need to keep them?23:14
clarkbit doesn't create a new container. It never deletes the container so it hangs around. You can see it if you run `sudo docker ps -a` on review0223:17
clarkbI don't think we need to keep them but I don't know that there is a good way to manage `docker-compose up -d` and also somewhat atomically remove the shell container it creates23:17
opendevreviewClark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup  https://review.opendev.org/c/opendev/system-config/+/88067223:22
opendevreviewClark Boylan proposed opendev/system-config master: Run the replication task cleanup daily  https://review.opendev.org/c/opendev/system-config/+/88068823:22
clarkbianw: ^ that adds testing23:22
opendevreviewClark Boylan proposed opendev/system-config master: Explicitly disable offline reindexing during project renames  https://review.opendev.org/c/opendev/system-config/+/88069223:27
clarkband that is something I noticed when working on the previous change23:28
clarkbianw: fwiw on the --rm thing I don't know that this was an anticipated problem when the shell pattern was used. I do kinda like having an obvious place to run things with less potential for impacting the running services though. However, maybe it is simpler to have fewer moving parts and we should try to factor out the shell container. This would affect our upgrade processes though23:31
clarkbas they rely on this container for example23:31
clarkbok last call for meeting agenda topics as I'm running out of time before I need to find dinner23:31
clarkband sent23:41

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!