Thursday, 2021-04-01

*** hamalq has quit IRC00:02
*** mlavalle has quit IRC00:10
TheJuliacorvus: is it down? Looks like connections are timing out and I got a blank page load right before that00:22
corvusTheJulia: it's extremely degraded, but hasn't lost queue state or events.  it should eventually recover and restart all the jobs.00:23
TheJuliathanks corvus!00:25
corvusTheJulia: thanks for not throwing vegetables at me :)00:26
TheJuliacorvus: I've been there myself long ago00:26
*** ysandeep|away is now known as ysandeep00:41
fungizuul-scheduler process still has a cpu completely pegged and the rest api is unresponsive, but the debug log does indicate it's still dispatching builds to executors (albeit slowly)00:47
*** diablo_rojo has quit IRC00:52
fungiWARNING kazoo.client: Connection dropped: socket connection error: EOF occurred in violation of protocol (_ssl.c:1125)00:57
fungiis that what the zk connection timeouts look like?00:57
fungiseeing them go by in the debug log every few minutes00:57
fungiroughly 2-4 minutes apart for a while00:58
fungicorvus: are we likely generating retry events because of zookeeper disconnects faster than we can process them, or do you still expect it to recover on its own without restarting?01:01
fungii'm happy to work on a scheduler restart to get things moving again and try to reenqueue everything from the periodic queue backups. looks like we have one from 23:41 utc01:06
fungiwhich is roughly the time everything seems to have ground to an almost-halt01:06
corvusfungi: i have a very large query running01:08
corvusi'd like to let it finish01:08
fungino worries, wasn't sure if you were done yet01:08
corvusi'll make sure it's running before i go to bed01:09
weshay|rucksomething going.. seeing a ton of retry_limits on centos-8 jobs01:23
* weshay|ruck reads01:24
fungiweshay|ruck: yeah, we're trying to get to the bottom of a recent memory leak in zuul01:44
*** mfixtex has quit IRC01:46
*** brinzhang_ is now known as brinzhang02:04
johnsomIs there an ETA on zuul coming back?03:07
openstackgerritIan Wienand proposed opendev/system-config master: Planet OPML file
corvusi'm going to restart it now03:16
corvus#status log restarted zuul after freeze while debugging memleak03:23
corvusshould be up now03:24
*** gothicserpent has quit IRC03:49
*** ykarel|away has joined #opendev03:50
*** tkajinam has quit IRC03:51
*** tkajinam has joined #opendev03:51
*** tkajinam has quit IRC03:52
*** tkajinam has joined #opendev03:53
*** ykarel|away is now known as ykarel03:54
*** whoami-rajat has joined #opendev04:17
*** marios has joined #opendev05:25
*** rosmaita has joined #opendev05:47
*** sboyron has joined #opendev05:47
*** tkajinam has quit IRC06:02
*** tkajinam has joined #opendev06:03
*** tkajinam has quit IRC06:03
*** tkajinam has joined #opendev06:03
*** bandini has joined #opendev06:10
*** lpetrut has joined #opendev06:30
*** hashar has joined #opendev06:39
*** gibi_away is now known as gibi06:58
openstackgerritMerged opendev/system-config master: Explicitly create empty reprepro dists
*** CeeMac has quit IRC07:24
openstackgerritMerged opendev/system-config master: Correct debian-security repo codename for bullseye
openstackgerritxinliang proposed openstack/diskimage-builder master: Fix generate two grub.cfg files
*** tosky has joined #opendev07:45
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible
Tenguhello there! is the "job retry_limit/pause" issue solved? or may I help on it if it's still relevant?07:55
*** ykarel has quit IRC08:03
*** ykarel has joined #opendev08:05
*** ysandeep is now known as ysandeep|lunch08:27
*** jaicaa has quit IRC08:33
*** jaicaa has joined #opendev08:36
*** dtantsur|afk is now known as dtantsur08:44
*** ykarel is now known as ykarel|lunch08:58
openstackgerritIrene Calderón proposed opendev/storyboard master: Esto es una prueba
*** elod is now known as elod_afk10:01
*** ysandeep|lunch is now known as ysandeep10:05
*** ykarel|lunch is now known as ykarel10:10
openstackgerritxinliang proposed openstack/diskimage-builder master: Introduce openEuler distro
zbr|roverdo we use links to logs instead of zuul build page in zuul comments on purpose or accident? I kinda prefer being send to zuul page instead of logs page.10:27
zbr|roveri would personally find it more convenient if the links in comments would be the same as the ones inside the newly "zuul summary" tab.10:28
zbr|roverfunny, trying to load managed to crash chrome.10:30
openstackgerritxinliang proposed openstack/diskimage-builder master: Introduce openEuler distro
*** hashar is now known as hasharLunch11:02
*** ysandeep is now known as ysandeep|afk11:35
*** hasharLunch is now known as hashar11:58
*** bhagyashris has quit IRC12:28
*** bhagyashris has joined #opendev12:29
*** hrw has quit IRC12:29
*** ysandeep|afk is now known as ysandeep12:37
*** stand has joined #opendev12:55
weshay|ruckfungi, and all.. thanks!!12:57
*** gothicserpent has joined #opendev13:21
*** roman_g has joined #opendev13:24
*** gothicserpent has quit IRC13:25
*** mailingsam has joined #opendev13:46
fungizbr|rover: which comments are you talking about?13:50
zbr|roverfungi: nevermind. i think it was PBKAC on that, when i checked the url manually they were identical.13:52
zbr|roverthe "lost between browser tabs" would describe it better13:53
*** gothicserpent has joined #opendev14:04
*** gothicserpent has quit IRC14:04
fungihappens to me too, sure14:07
fungiTengu: solved (or at least gone for now)14:07
fungiwe've been trying to get to the bottom of a new memory leak in the zuul scheduler, but interactive debugging the live process was slowing it considerably and causing side effects like spurious mass job retries14:08
fungithe memory leak is not gone yet, we're still collecting data14:08
Tengufungi: ah, thanks for the info!14:12
openstackgerritJeremy Stanley proposed opendev/system-config master: Temporarily serve tarballs site from AFS R+W vols
fungiinfra-root: expedited approval of that ^ is appreciated so we can get back to serving current content on the tarballs site until the ord replication is finished14:14
*** elod_afk is now known as elod14:29
*** ysandeep is now known as ysandeep|away14:30
corvusfungi: i'd like to try to do another data collection pass; hopefully not as terrible as last night, but still almost certainly disruptive14:32
*** lbragstad has quit IRC14:36
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: ensure-kubernetes: remove dns resolvers hack
fungicorvus: probably the earlier the better14:38
fungilots of openstack teams are under a lot of stress since next week is final release candidates for wallaby14:40
fungiso there's been quite a bit of scrambling to get final fixes merged, as usual14:40
*** roman_g has quit IRC14:45
*** lpetrut has quit IRC14:45
openstackgerritSorin Sbârnea proposed openstack/diskimage-builder master: WIP: Add freebash disk image
openstackgerritSorin Sbârnea proposed openstack/diskimage-builder master: WIP: Add freebsd disk image
*** chkumar|ruck is now known as raukadah15:01
*** tkajinam has quit IRC15:02
*** dtantsur is now known as dtantsur|afk15:06
corvusthis current query is proving to be quite disruptive; i have a copy of the queues saved from before i started it though; so if we decide to abort it, i can re-enqueue15:11
corvusi believe i have thought of a way to make objgraph nicer though; if we do abort/restart, i'll work on that15:12
corvusfungi: no result yet; i think we should restart :(15:27
*** zbr|rover is now known as zbr15:29
openstackgerritJeremy Stanley proposed zuul/zuul-jobs master: Document algorithm var for remove-build-sshkey
fungicorvus: okay, do you need help with the restart or want me to do it?15:30
corvusfungi: i won't de any more debugging today; i'll resume tonight or tomorrow, and do so with a process which is hopefully nicer and can be aborted.15:30
corvusfungi: nah, i got it15:30
*** mlavalle has joined #opendev15:33
corvusfungi, clarkb: i wonder if running the objgraph query in a fork would be effective?15:36
corvustobiash: ^15:36
fungicorvus: that's an interesting idea15:36
fungiit would get its own copy of memory i guess15:36
corvusyeah, i'm assuming all the objects would be there and leaked; we'd want to make sure all the tcp connections are closed.15:37
tobiashcorvus: a fork should work15:38
corvusmy first idea is to just modify the objgraph methods to add in a sleep between each call to gc.get_referrers, and to check for a stop flag; but if we can do the work in a forked process, we would have an entire cpu available.15:38
corvuscool; i'll prototype the fork idea on a local zuul and if it works, try that out in the next debug session tonight/tomorrow.15:39
fungithat does seem like it's worth trying anyway15:39
fungiand the fork is still using all the same pointers, so shouldn't increase actual memory utilization significantly, right?15:40
fungiassuming we don't memcpy everything15:40
corvusfungi: yeah, i think mem usage would increase moderately slowly as pages get cow'd15:40
fungicool, that's what i was hoping15:41
*** diablo_rojo has joined #opendev15:41
diablo_rojofungi, clarkb I assume you're already aware the zuul status site is not loading?15:43
*** ykarel is now known as ykarel|away15:43
fungidiablo_rojo: yeah, was just talking about that in #openstack-infra with some other folks, probably we need to restart zuul-web now that zuul-scheduler has been restarted15:44
fungicorvus: shall i? or do you think it will recover on its own?15:44
corvusfungi: it's up15:44
diablo_rojofungi, ah okay cool. Thanks!15:44
fungioh, perfect. thanks!15:44
diablo_rojoWay ahead of me :)15:45
diablo_rojoThanks fungi and corvus!15:45
corvus#status log restarted zuul after going unresponsive during debugging15:47
*** whoami-rajat has quit IRC15:47
corvusfungi: restart and re-enqueue is complete15:47
corvusfungi: i'm done debugging for the day15:48
fungithanks again!15:48
fungii'll keep an eye on the memory graph15:48
*** bandini has quit IRC15:49
*** ykarel|away has quit IRC15:53
*** hashar has quit IRC15:58
*** fressi has joined #opendev15:59
*** fressi has left #opendev15:59
*** sshnaidm is now known as sshnaidm|afk16:20
*** marios is now known as marios|out16:31
openstackgerritPaul Belanger proposed zuul/zuul-jobs master: ensure-podman: Use official podman repos for ubuntu
*** hamalq has joined #opendev16:40
*** ysandeep|away is now known as ysandeep16:41
openstackgerritMerged opendev/system-config master: Temporarily serve tarballs site from AFS R+W vols
*** marios|out has quit IRC17:20
clarkbcorvus: re a fork, the risk there is we'll have two schedulers fighting to do the same work? I guess the thing you'll be POCing is convincing the child to be inactive while running?17:38
fungiyeah, i took that as a given. the fork needs to explicitly do nothing, i think17:38
fungiclose all inherited file descriptors, maybe even just go into a busywait17:39
corvusclarkb: the only thread in the fork should be the one where fork is called, yeah?  so that would be the repl server thread becoming the main thread of the new process, i would think.17:39
clarkbinfra-root I'm going to try and dig into the vexxhost ipv6 stuff after lunch today as I suspect that is impact job runtimes in that cloud as well as our new review server. I think a good next step there will be to jump on some in use test nodes and check their ipv6 networking configs to see if any patterns emerge and go from there since ianw seems to have the tcpdumping covered on review17:39
corvus(ie, the scheduler thread would not exist)17:39
fungiahh, yeah the repl as the only thread would solve it17:40
fungias long as you don't execute functions in the repl to make that no longer true, but the answer there is to just not do that17:41
clarkbah yup17:42
clarkbfungi: have we landed the second pair of openedge cleanups yet? that was next on my list to check on from yesterday17:43
* clarkb finds change links17:44
clarkb looks like that hasn't merged yet. Any reason to not do that now (sounds like zuul things are settling for the moment?)17:45
fungiclarkb: no, haven't yet17:46
fungibut should be safe now17:46
clarkbok I'll +A it now17:46
*** ysandeep is now known as ysandeep|away17:57
*** ykarel|away has joined #opendev18:13
*** ykarel|away has quit IRC18:25
openstackgerritMerged opendev/system-config master: Clean up OpenEdge configuration
fungiclarkb: i suppose 784086 can go in now too since that's merged18:48
clarkbfungi: ++ do you want to +A or should I?18:52
fungifeel free18:52
*** gothicserpent has joined #opendev18:59
*** gothicserpent has quit IRC19:01
openstackgerritMerged opendev/ master: Clean up OpenEdge configuration
openstackgerritPaul Belanger proposed zuul/zuul-jobs master: ensure-podman: Use official podman repos for ubuntu
*** mailingsam has quit IRC19:12
clarkbI've started looking at a vexxhost test node to see what is going on with its networking and try to work from that. One thing I checked since I was here is that svm cpu flag is present and it is (this means amd nested virt is a possibility)19:17
clarkbdmesg also confirms nested virt is enabled. What i am not seeing is glean or network config at first boot at all19:22
clarkbchecking some of the older hosts that exist in nodepool's list none of them seem to have more than one globally routable address and 2 default routes (2 default routes are expected on the public ipv6 interface iirc)19:30
clarkbso that is all looking good from the test node side. Makes me wonder if they aren't typically sticking around long enough to have trouble19:30
*** gothicserpent has joined #opendev19:36
fungiwhich test node?19:37
fungioh, "a vexxhost test node" i see19:38
fungitrying to find an explanation for the replacement gerrit server's ipv6 madness?19:38
clarkbyup, and also the pip installation slowness in johnsom's example from yesterday which I suspect is also related19:39 is the modified file that fixed this problem there19:39
clarkbwe set dhcp6 and accept-ra to false then manually set routes and addr based on the values that we had previously accepted via RA which mnaser confirmed should be stable19:40
clarkbhowever we never really got any more info from the cloud side why this was happening19:40
clarkbassuming things are expected to continue to be stable cloud side we could do similar for review02 but that seems really clunky and we should consider documenting/automating that configuration for vexxhost nodes if we do19:40
*** rosmaita has left #opendev19:41
fungiyeah, and bug 1844712 is still basically getting no traction19:41
openstackbug 1844712 in OpenStack Security Advisory "RA Leak on tenant network" [Undecided,Incomplete]
johnsomJust throwing a wild idea out, are you accidentally over restricting icmpv6?19:41
johnsomIf routers don't get all of the neighbor discovery goodness they can prune routes?19:42
clarkbjohnsom: the problem is we're getting RAs for networks we aren't on19:42
johnsomopps, ?->. I see this on my comcast IPv619:42
johnsomAh, well, that is a whole different issue. lol19:42
fungijohnsom: not filtering icmpv6, no. we're wondering if it's that we're getting announcements from gateways which aren't really valid gateways in addition to the correct ones19:42
clarkbso then when the host tries to talk from that source addr over that route the packets end up in the bitbucket19:42
clarkbwe solved that on the mirror node by disabling dynamic configuration which is less than ideal when nova says use ipv6_slaac19:43
clarkbjohnsom: but I suspect that may explain some of the pip installation slowness in your timeout on vexxhost example. As pip may wait for ipv6 to timeout then fallback to ipv4 (particularly notable is the time spent is consistency ~60 seconds every time)19:44
fungijohnsom: the bug report above, we've seen it happen both in vexxhost and limestone, where it's leaking between tenants even, but i can imagine it's even more likely to occur within a tenant (some job sets up routing on a vnic, begins spewing ra packets onto the network, other nodes see those and add prefixes/routes)19:44
fungiin theory neutron filters that, but it seems that sometimes that doesn't actually happen19:45
fungiand as of yet, nobody's come up with a sound theory on why19:45
clarkband we set up a bunch of nodes once and tried to inject RAs ourselves and they never showed up on other hosts (as expected)19:46
fungiyeah, in the past there have been races around things like port creation/deletion, et cetera, where filtering had gaps19:46
clarkbfungi: I think my next step is to boot a vexxhost test node manually and see if I can reproduce there if the node hangs around long enough (say check ti tomorrow019:47
clarkbbut otherwise on the test node side I didn't see anything amiss after checking about 10 instances19:47
fungii have a feeling it could happen in bursts, and relies on some specific set of circumstances19:47
clarkband maybe we set up review02 to mimic mirror01 when ianw returns19:47
fungiyou have to catch it when the right job has run there recently and misbehaved in that way while the other node was up and running19:48
openstackgerritDmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Debian Bullseye Zuul job
*** slaweq_ has joined #opendev19:51
*** slaweq has quit IRC19:52
*** CeeMac has joined #opendev19:54
openstackgerritPaul Belanger proposed zuul/zuul-jobs master: ensure-podman: Use official podman repos for ubuntu
openstackgerritPaul Belanger proposed zuul/zuul-jobs master: ensure-podman: Use official podman repos for ubuntu
openstackgerritPaul Belanger proposed zuul/zuul-jobs master: ensure-podman: Use official podman repos for ubuntu
*** gothicserpent has quit IRC21:07
clarkbfungi: have you had a chance to look at those gerrit account cleanup proposals? (I know its been a busy week or longer of fires) if possible would be nice to go thorugh those tomorrow21:36
*** diablo_rojo has quit IRC21:38
*** osmanlicilegi has quit IRC21:44
funginot yet, but i can take a look now21:44
clarkbcool and thank you21:44
fungithey're in your homedir on review?21:44
clarkbyes let me find the exact path for you21:45
clarkbfungi: ~/gerrit_user_cleanups/notes.2021031521:45
fungi784424 merged almost 5 hours ago and still hasn't deployed21:45
fungiand the tarballs release is still running21:46
clarkbare we waiting on available executor slots?21:46
clarkbthose don't use normal nodes so we shouldn't be queued up being nodepool usage21:46
fungichecking to see if it's still in the queue21:47
clarkbya they are all in the queues still21:47
clarkbas waiting21:47
fungiahh, yep21:48
fungiprobably blocked on the periodics?21:48
clarkbas well as a large number of tag jobs :/21:48
clarkbwell but periodic is doing the same thing it is just waiting as well21:48
clarkbis this possibly a side effect of corvus' debugging?21:48
fungiyeah, nothing's actually running for those items21:49
clarkbthe infra-prod jobs do have a semaphore21:50
clarkbdo you know if those tag jobs too? (just wondering if this is more semaphore weirdness)21:50
fungithe times on all those items line up with the reenqueue21:50
fungiso maybe something is weird about how they were reenqueued21:50
corvusclarkb, fungi: i had a fleeting thought that we may have leaked actual semaphores in the crash21:51
corvuswe have semaphore cleanup as a todo item21:51
fungioh! right, so maybe leaked semaphores still sitting in znodes?21:51
corvusyep; is this only affecting infra, or is it wider?  how urgent?21:52
clarkbcorvus: it is affecting a bunch of openstack tag jobs21:52
clarkbinfra + those tag jobs are the only ones I've seen so far21:53
fungithose tag jobs are all release note publication though21:53
fungiso maybe not urgent21:53
clarkbah yup21:53
fungiand the rest, yeah, just opendev infra deployment and config management runs21:53
fungiso not a huge deal, i can manually apply 784424 in production, that's the only thing really causing lingering pain as far as i know21:54
corvushow does "resolve within 4 hours" sound for priority on this?21:54
fungioh that's plenty soon as far as i'm concerned21:54
corvusok.  if it's more urgent, i can increase that; but all things being equal, that's convenient for me.21:55
fungii've manually applied 784424 in production now, so nothing else urgent i know about21:55
fungialso happy to try manual znode surgery, betting it's safe to delete semaphore znodes older than the restart21:55
corvusfungi: i think it's actually a znode edit22:00
fungiahh, okay22:00
corvusi think a semaphore is now a json list of jobs which hold it22:00
corvusin the case of a semaphore max of 1, however, a delete should be ok22:01
fungiso the semaphore itself is persistent, but may be empty22:01
corvusmy guess is we have a semaphore that looks like "/path/to/semaphores/infra-prod-something" and its contents are "['build-uuid-from-before-restart']"22:02
corvusif that's the case, and the max is 1, we can delete that znode22:02
fungiclarkb: on the account cleanup topic, i'm in the process of adapting openstack's election tooling to work around the lack of anonymous access to the emails method in the rest api, and i'm finding there are at least some accounts which have contributed changes recently but have no preferred e-mail. wonder if if we can (or should) do anything about those22:02
corvusbut if it were "['build-uuid-from-before-restart', 'build-uuid-from-after-restart']" with a max of 2, then editing would be required.22:02
fungicorvus: right, and i have a feeling we have the latter because in this case there wouldn't be builds waiting otherwise22:03
clarkbfungi: those users can simply go in via the web ui and set up a preferred email22:03
corvusfungi: i think we have the former because i think our max is 1?22:03
fungiclarkb: if we can figure out how to contact them ;)22:03
clarkbfungi: they likely have external ids with emails in them22:03
clarkbor the git commits they have pushed22:03
fungicorvus: oh, okay, i must have misunderstood. so waiting builds don't get added to the data structure in the semaphore znode22:04
corvusfungi: correct, only builds which hold the lock22:04
fungianyway, i'll stop distracting you22:04
fungiclarkb: excellent point, they will probably have a committer address on the change even if the owner account has no preferred address22:05
fungii can probably use that as a fallback even22:05
fungiclarkb: looking at your list, i wonder if we can also identify accounts with invalid openids and no ssh keys and no password set (regardless of whether they have a username)?22:08
clarkbfungi: checking if password is set is hard because you have to dig into the git repo directly22:09
clarkbit is doable though22:09
clarkbfungi: maybe take the set that meet the other criteria as a sublist then check the git repo directly for that?22:09
clarkb(no apis expose that essentially)22:09
fungiahh, nevermind. they're not usable, though may have been used previously since we did wipe all the passwords after the incident22:09
fungiyeah, i'm good with the stuff in your proposed list. i spot-checked some from each category22:10
fungialso i'll be around tomorrow to help with the cleanup on these if you want22:10
fungiaha! i just realized most of these changes owned by an account with no preferred e-mail are from "OpenStack Proposal Bot"22:12
clarkbsilly bot22:13
clarkbthank you for checking and I'll need to get back up to speed on running my scripts again :)22:13
clarkbok I've reviewed the gerrit db change22:39
*** sboyron has quit IRC22:50
*** eharney has quit IRC22:54
clarkbfungi: should we warn the release team about the tag jobs? I assume those tags were pushed by them? but I guess they could be independent?22:54
*** tkajinam has joined #opendev22:57
*** tkajinam has quit IRC22:57
*** tkajinam has joined #opendev22:58
*** auristor has quit IRC23:04
corvus(CONNECTED [localhost:2181]) /zuul/semaphores/openstack> get publish-releasenotes23:17
corvus(CONNECTED [localhost:2181]) /zuul/semaphores/openstack> get infra-prod-playbook23:17
corvusthose are the 2 semaphores currently held23:17
corvusthis is pretty cool; i like this level of visibility :)23:18
corvusthat's uuid-jobname23:18
corvusoh, those are queue item uuids23:19
corvus(thus the job name addition to make it unique)23:20
corvusi think that's so that if the build uuid changes, we keep the semaphore23:20
corvuslast restart was at 20:3223:20
corvusbaaab4cfbc074796b5be235775754aaf last appeared in the log at 14:4723:21
corvuswait that restart time doesn't look right23:21
corvus15:47 was last restart23:22
corvuslooks like my last log entry didn't make it to the wiki :/23:22
corvusanyway, that entry is confirmed as stale23:22
corvus says max is 123:23
corvusso i will remove the entry23:23
corvusthe same is true for infra-prod-playbook, but it's even older.  removed23:26
corvustop releasenodes job is queued now; top infra-prod job is running23:27
clarkbinfra-prod job is in the periodic queue if anyone has trouble finding it23:27
clarkbcorvus: when you say remove the entry you removed the znode entirely or made the znode content [] ?23:28
corvusclarkb: removed entirely23:28
corvusshortcut valid for max=1 semaphores only23:28
clarkbfor max>1 you would edit the json to remove invalid job entries?23:28
corvusbut i'm going to write code so no one ever has to do that :)23:29
*** tosky has quit IRC23:40

Generated by 2.17.2 by Marius Gedminas - find it at!