Friday, 2021-03-05

fungiianw: looks like afs01.dfw was rebooted roughly 2 hours ago after the upgrade completed. all went fine i guess?00:04
ianwfungi: i think so, no complaints so far :)00:05
fungilooks like the mirror-update crontab is active too, so all clear hopefully00:07
*** tosky has quit IRC00:11
*** yoctozepto has quit IRC00:13
*** yoctozepto has joined #opendev00:13
clarkbze03 and ze04 have a numbr of zuul owned processes but none of them appear to be ansible related anymore00:21
clarkbze02 is still running ansible stuff though. I'll go ahead and stop ze03 and ze04 zuul-executors now00:21
clarkbcorvus: ^ related to that I think we may be leaking git processes, si that something you would prefer I leave things paused so you can look at?00:22
clarkblooks like multiprocessing is involved00:23
clarkbthe process tree seems to be zuul-executor -> some multiprocessing thing -> git processes00:23
corvusclarkb: which host(s)?00:25
clarkbcorvus: through easier to see on 03 and 04 because there isn't ansible stuff running too00:26
corvusk i'll look on 300:26
corvuszuul is running du00:26
corvusoh zuul is still running but paused.  sorry, i'm caught up now00:26
*** hamalq has quit IRC00:26
clarkbyup I think the du is expected. Its the git processes with multiprocessing parents that are not00:27
corvusthey're all cat-files00:27
clarkband some are from 2 days ago00:28
corvusat random 21254 is from build df2dfafb9d18432bb667001630ca8c4400:28
corvus(so says proc/21254/cwd)00:28
clarkbwhich ls notes as (deleted)00:29
corvusit ran the job; the job was aborted, but well past the repo setup stage00:30
corvusi see no errors related to that00:32
corvusno broken process pool errors around that time that i see00:34
clarkbwhat is odd (at least to me) is wouldn't you expect the git processes to finish and exit? maybe they are hanging on lack of a fd to write out to?00:35
clarkbthe git processes appear to have their stdin/out/err attacked to pipes on the multiprocess python processes00:37
clarkbstrace says 21254 is reading off of fd 0 which is a pipe to the parent00:39
clarkbI guess it expects some input?00:40
clarkbaha the batch flags mean it takes input on stdin for the things to cat00:40
clarkbcorvus: could this happen if we cancel things before reading all the files and stopping the process? it will just sit there waiting on stdin for more input?00:42
corvusthat job finished its git prep; i'm not sure what would have been canceled00:42
corvusclarkb: hypothesis: these aren't leaked00:43
clarkbcorvus: would they be waiting for new inputs legitimetely and the pause just exposes they are all there?00:44
corvusclarkb: it looks like maybe gitpython runs this as a long-running process.  the cwd may just happen to be the first build dir for a newly started process worker00:44
corvusclarkb: yeah that's what i'm thinking00:44
corvusi'm trying to figure out how long gitpython expects these to last00:44
corvusi think that should stick around as long as the git.Repo object is around00:48
clarkbah interesting00:48
corvushowever, due to bad experiences with gitpython, we try really hard to deref those objects immediately00:49
corvusso zuul shouldn't be keeping them around after use00:50
*** brinzhang0 has joined #opendev00:54
corvusalso we're certainly not supposed to have 139 of these.00:55
fungithat is rather a few00:56
corvusthe we should have 800:56
clarkbthere are 8 multiprocessing parents00:56
corvusok, so maybe we're leaking git.Repo objects00:57
*** brinzhang_ has quit IRC00:57
clarkbI need to go help wtih dinner. I can leave those servers as is and pick up their cleanup in the morning. Let me knwo if you'd like to preserve them longer01:01
corvusclarkb: i don't think i'll have time to dig further :(01:02
clarkbok, I don't right now either :), but I can also try to write up a bug at least before I shut things down tomorrow so we have hints for the future01:02
corvusi think the only thing that could help now would be an objgraph from one of those subprocesses; and i don't think we can get one?01:02
clarkbI'm not sure how we would get one at least01:03
clarkbsince that is outside of the zuul stuff to objgraph01:03
corvusyeah, i mean, we can start a repl on the executor i think01:03
corvusbut i very much doubt we could start one on the subprocess01:04
corvusit's worth noting however that if the executor multiprocessing thing is leaking git repos, it may be leaking other things, which could be exacerbating our oom issues01:04
*** mlavalle has quit IRC01:19
*** hemanth_n has joined #opendev01:44
*** Eighth_Doctor has quit IRC01:45
*** Eighth_Doctor has joined #opendev02:14
*** gothicserpent has quit IRC04:22
*** gothicserpent has joined #opendev04:22
*** redrobot has quit IRC04:27
*** redrobot has joined #opendev04:30
*** redrobot has quit IRC04:35
*** redrobot has joined #opendev04:35
*** ykarel has joined #opendev04:38
openstackgerritIan Wienand proposed opendev/system-config master: [wip] kerberos ansible
*** dviroel has quit IRC05:10
*** ykarel has quit IRC05:50
*** ykarel has joined #opendev05:53
*** ykarel_ has joined #opendev06:08
*** marios has joined #opendev06:08
*** ykarel has quit IRC06:10
*** ykarel_ is now known as ykarel06:10
openstackgerritMerged openstack/project-config master: Add an config
*** slaweq has joined #opendev06:59
*** redrobot has quit IRC07:00
*** sboyron has joined #opendev07:02
*** ralonsoh has joined #opendev07:03
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
*** eolivare has joined #opendev07:30
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
openstackgerritMartin Kopec proposed opendev/system-config master: refstack: Edit URL of public RefStackAPI
*** lpetrut has joined #opendev08:00
*** rpittau|afk is now known as rpittau08:21
*** hashar has joined #opendev08:50
*** jpena|off is now known as jpena08:54
*** tosky has joined #opendev09:23
*** toomer has joined #opendev09:25
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
*** DSpider has joined #opendev09:54
*** fressi has joined #opendev09:55
*** DSpider has quit IRC09:56
*** zoharm has joined #opendev10:17
*** dviroel has joined #opendev10:18
*** hashar has quit IRC11:03
*** hashar has joined #opendev11:04
*** artom has quit IRC11:16
*** ykarel_ has joined #opendev11:19
*** ykarel has quit IRC11:22
*** ykarel_ is now known as ykarel11:23
*** ykarel_ has joined #opendev12:07
*** ykarel has quit IRC12:09
*** jpena is now known as jpena|lunch12:34
*** tkajinam has quit IRC12:35
*** tkajinam has joined #opendev12:35
*** hashar is now known as hasharLunch12:39
*** whoami-rajat has joined #opendev12:58
*** redrobot has joined #opendev13:06
*** artom has joined #opendev13:10
*** hasharLunch is now known as hashar13:14
*** jpena|lunch is now known as jpena13:28
*** hemanth_n has quit IRC13:43
*** fressi has left #opendev13:45
openstackgerritJeremy Stanley proposed opendev/system-config master: Add the Gerrit reviewers plugin to Gerrit builds
*** ykarel_ is now known as ykarel14:32
TheJuliais working?14:44
TheJuliaHmm, can't even load the base webpage14:44
fungiloads for me... connecting over ipv4 or ipv6?14:46
ykarelfungi, hi, we have cleaned the open reviews for stable/ocata and pike can u do the cleanup when u get chance
fungiykarel: yep, i saw the ml post, thanks for following up! we have a very large list of branch deletions to process, and have been working with elod and the release team on integrating it with the rest of release automation similar to how branch creation is currently handled (the manual process is painfully slow)14:49
fungii still owe elod a review on his proposed script change, but worst case i'll handle the current backlog by hand14:50
ykarelfungi, ok Thanks for update, and yes automating it would be very helpful14:50
*** smekala has joined #opendev14:51
*** smekala has quit IRC14:56
fungiTheJulia: looking over resource graphs for our gitea backends, it appears gitea08 got slammed briefly (and is still under some fairly heavy load). my guess is you got load-balanced to that one before it popped out of the pool14:57
fungiseems the server has recovered now (5-minute load average is back around 3)14:58
fungilooks like it has git processes chewing up 100% of a processor each14:59
fungi/usr/lib/git-core/git pack-objects --revs --thin --stdout --delta-base-offset14:59
fungichild of: /usr/bin/git upload-pack --stateless-rpc /data/git/repositories/openstack/nova.git14:59
fungiit ate memory until the oom killer knocked it out15:01
fungicausing lots of swap thrash before that15:02
fungiall the processing capacity was saturated with iowait15:02
fungias tends to happen under such circumstances15:03
*** lpetrut has quit IRC15:03
TheJuliafungi: most likely, looks like it is working again15:03
fungiin fact the oom killer had to kill three processes within the span of a few seconds15:03
*** rpittau is now known as rpittau|afk15:03
fungiwe suspect a ci system or some other automated process is trying to clone multiple copies of repos all at once from the same ip address, so gets balanced to the same backend15:05
fungiwe haven't been able to narrow it down to a particular source yet though15:05
corvusfungi: to remind myself: we're using source ip balancing because the backends can be slightly different (it terms of git object structure) because they don't have shared storage, right?15:32
*** ykarel has quit IRC15:37
fungicorvus: yes, though also we're terminating ssl/tls on the gitea servers at the moment, so layer 4 is the deepest haproxy can see at the moment15:38
fungiif we wanted to do layer 7 inspection and/or fancy distribution mechanisms like cookie injection, we'd need to move the cert to the lb15:39
* fungi doubts cookie injection would actually help this case, it was simply an example15:39
fungicorvus: also if we did distribute these requests across the entire pool, there's a chance we'd simply oom on all the backends and take the entire service offline instead of just slamming a single server, but that could probably be mitigated by growing the cluster even more15:41
corvusor the other way: least-conn/round-robin we'd need shared storage15:41
*** zoharm has quit IRC15:58
clarkbya specifically I think the issue is that clients if they connect to a new backend in the middle of some "transaction" end up not seeing the objects they expect16:06
clarkbhowever, I think that was largely for older git clients, it is possible this will improve with newer clients being smarter?16:07
clarkbin my excitement to get the new executors up yesterday I failed to configure reverse dns for them. This has now been corrected16:12
clarkbI'm going to work on cleaning up the extra servers now16:12
clarkbthe zuul-executor processes are now stopped on the 3 old servers. I'll check grafana seems happy with that in a bit then delete the servers entirely16:17
marioselod: o/ hey can you please check again when you have some time for reviews please thank you16:17
fungiheh, you're not my web browser!16:26
*** hashar is now known as hasharAway16:27
clarkbfungi: also if you've got time to look over the ~70 entries identified by the latest run of the audit script as having no username, an invalid openid, no review or code pushes I think I'd like to run the script on that set today then followup with external id cleanups on them early next week if there is no screaming16:30
fungioof, you can tell the openstack release freeze is looming... node request backlog is well over 1k even with the 25% additional quota we got this week16:30
clarkbfungi: yup I noticed zuul is busy when double checking the executor status on grafana16:30
clarkbfungi: re the accounts, I don't necessarily expect people to check every one of them but maybe review the latest versio nof the audit script and quickly skim the list to ensure nothing stands out as blatantly wrong16:31
fungilist is in your homedir on review.o.o again?16:32
clarkbyes should have a date suffix of yesterday16:32
fungithat one i guess16:32
clarkbthat sounds right16:32
elodmarios: sorry, I found some new commits there again :S16:33
clarkbfungi: then I think I'm going to try and finish up the preferred email addr has no external id errors next since I expect the remainder of the external id issues to be much more painful :)16:34
clarkbmight haev another small batch of account retirements today due to that if I can make sense of the ~17 remaining there16:35
marioselod: thank you for checking I just replied, both commits (by me) was update to zuul.d/layout to remove some deprecated templates16:36
marioselod: if it really needs to be updated then I will do that. thanks for checking so carefully16:36
clarkb#status log Deleted as they have been replaced with new hosts.16:38
openstackstatusclarkb: finished logging16:38
elodmarios: for the sake of completeness, please update, then I'll +2 immediately :X16:40
elodmarios: and please do not allow in new patches to those branch until the branch is not deleted16:41
marioselod: ack OK waiting for tox validate to finish and will post update thanks16:44
marioselod: wrt stopping patches... any ideas how i can do that other than asking folks not to post them ?16:44
clarkbI don't know how I've missed this but new gerrit seems to show you changes that were parts of rebases and not relvant to the current diff when doing inter patchset diffs16:45
marioselod: updated when you next get a chance16:47
elodmarios: maybe discuss with stable cores to -W if such patches arrive?16:47
marioselod: ack yeah OK I will socialise that some more now it is actually close to happening and ask folks to help me block patches (there should be very few, if any)16:48
marioselod: thank you for your time, i am going end of day in a few minutes. if there is anything else about the review i will deal with it next week16:48
fungiclarkb: i thought gerrit always showed you the diffs dragged in from rebases when looking at inter-patchset diffs?16:52
elodmarios: just +2'd it, and I've pinged hberaud. thanks for your patience & have a nice weekend o:]16:54
fungiclarkb: heh, i randomly stumbled across one of the google openids. i suppose they're all included in this set16:59
mariosthank you elod no need to apologise for doing a good job ;)16:59
marioselod: have a good one yrself happy friday ;)16:59
clarkbfungi: it does, I mean that it differentiates it from the diff you care about16:59
clarkbfungi: you get red, green, and purple diff text now16:59
fungioh, neat!17:00
funginode request backlog is not really shrinking, we nearly passed 1.5k a few minutes ago17:01
clarkbfungi: I'm going to quickly check the ~17 accounts with preferred email addr issues for invaldi openids and see if I can narrow that list down that way too17:08
*** marios is now known as marios|out17:09
fungigood idea17:10
fungiclarkb: spot checking the 70 entries for "Users without username, ssh keys, valid openid, and no changes or reviews" and it looks right to me17:12
fungiran through a bunch of them querying by hand17:12
fungiincluding double-checking the openid urls 404 for them17:12
clarkbgreat, you're comfortable with running on those today then removing their conflicting external ids next week?17:14
clarkbthe time delay helps ensure we haven't missed anything. If we feel strongly about it we can probably just remove the external ids now (though undoing external id removals is more difficult than undoing the retire changes)17:15
*** eolivare has quit IRC17:15
clarkbusing the invalid openid or no openid approach against the preferred email issues identifies 4 more that can be retired (these don't need external id cleanups)17:20
clarkband then I've got 3 more on top of all that that I've sort of manually identified as cleanable. One is the account, another is an account whose openid says "foo-unused", and the third is for hubcap who is long gone and if I have to make amends will bribe with whiskey17:21
*** marios|out has quit IRC17:22
fungiclarkb: yeah, comfortable retiring that set whenever you're ready17:24
fungiwhiskey seems like a fine approach17:25
clarkbcool, I'll proceed with those now17:26
clarkbweshay|ruck: I've set the account inactive. the os-tripleo-ci account is untouched. Let us know if this causes any problems. I won't do the more extensive cleanup until next week17:32
weshay|ruckclarkb, nice.. thanks! happy friday17:33
funginode request backlog is well over 1.5k now. i rechecked an openstackclient change and it took nearly 3 hours to get nodes assigned17:41
fungilooks like there's a slew of neutron changes in the gate pipeline for the openstack tenant, i expect that's a big part of it given the number of node-hours those burn and the odds of gate resets in a deep queue there17:42
fungi>90% of the changes in the gate are for neutron17:43
fungier, in the integrated gate queue i mean17:44
fungithat said, things are moving quickly. oldest change in the gate has only been there for 3 hours17:45
fungiand we're logging fairly steady merge events17:46
*** mlavalle has joined #opendev17:47
fungioldest change in check is about to report and has only been in there for 4.5 hours. so really not too bad17:47
fungiexecutors are pretty choked though, we're spending a fair amount of time with no executors accepting jobs17:49
fungilooks like it's probably the ram governor17:50
fungithat said, we're not under-utilizing our node quota so i don't expect it's a problem17:51
*** jpena is now known as jpena|off17:58
*** irclogbot_3 has joined #opendev18:04
clarkbok has been run against all of those accounts. There was one account without an account.config file so my sed failed to update the file. I'm going to look at that next (and possibly retire it via the api?). Then rerun the audit script and all of these accounts should show up in the top list saying they can have their external ids cleaned up18:04
clarkboh let me upload the log really quickly first though18:04
clarkblog is uploaded18:05
clarkbI set the odd account inactive via the api and that seems happy18:08
fungiinteresting. i wonder why it was missing an account.config18:10
clarkband I have started the reaudit to ensure we get these accounts classified properly as external ids removable because one account is inactive18:10
clarkbfungi: looks like no full name or preferred email was ever set so gerrit didn't write out the config fiel I guess18:10
fungii mean, makes our decision even easier. too bad there aren't more of those?18:11
fungiprobably it was the result of an almost but not quite complete merger/retirement before the 3.2 upgrade18:12
clarkbthe audit script isn't particularly quick due to its checking of the openids. Once we've fully cleaned up this invalid openid set I'll update the script to remove that (at least by default)18:13
fungiyeah, banging launchpad with those is probably not terrible polite. sort of surprised they haven't thrown a yellow card18:15
clarkbfungi: well our gerrit is slow enough that we sort of have a built in sleep between requests I think. Also I'm only doing HEADs18:15
fungiheh, good point. and yeah head is sufficient and much nicer18:15
clarkbI thought about adding a sleep but realized that the gerrit queries in between are likely long enough to spread things out18:15
*** irclogbot_3 has quit IRC18:24
*** hasharAway has quit IRC18:27
*** irclogbot_2 has joined #opendev18:27
clarkband my audit crashed due to a name resolution failure. I blame my terrible wireless18:27
clarkbI'm going to trim the input list down to the accounts we just modified as that should run much quicker and still double check things18:31
*** ralonsoh has quit IRC18:35
*** toomer has quit IRC18:40
fungii suspect the holes in our executors accepting graph represent deep gate queue resets for the openstack tenant18:46
funginow that the queue is stabilizing, the executors are mostly all back to accepting18:47
fungispoke too soon, another neutron change just blew18:49
fungiand right on cue, executors accepting takes a nosedive to 018:52
fungiso yeah i think that's what's going on18:52
clarkbya they get busy handling resets18:54
clarkbthe audit script runs much more quickly when you reduce the problem space. Found one minor bug in reporting for accounts that are inactive cleanups though (note I don't think this would cause problems for any of the actions we've previously taken as it was underreporting inactive accounts. It would only report an inactive account if there was a conflicting active account)18:59
openstackgerritClark Boylan proposed opendev/system-config master: Add tools being used to make sense of gerrit account inconsistencies
clarkbbug is fixed in ^19:06
*** elod has quit IRC19:12
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
*** elod has joined #opendev19:13
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
*** sboyron has quit IRC19:30
clarkbfungi: that shows you what I meant about the diffs earlier19:42
clarkbze05-08 replacements are being launched now19:50
openstackgerritClark Boylan proposed opendev/ master: Add replacement ze05-08 servers to dns
clarkbinventory change coming up next20:05
openstackgerritClark Boylan proposed opendev/system-config master: Replace with
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size
clarkbinfra-root ^ I'm around today to get those in and stop old servers if you have a moment to erview those changes20:09
fungii went ahead and approved those, neither should have direct production impact20:14
fungijust adding/correcting dns records, and switching configuration management to start running against the new servers instead of the old ones (and also flipping names around in cacti)20:15
clarkbyup, and we've done it now for 4 other hosts so I don't expect much trouble. The real fun happens when asking zuul-executor to gracefully stop20:16
openstackgerritMerged opendev/ master: Add replacement ze05-08 servers to dns
fungiso much grace it all over yuo screen (and process list)20:17
*** slaweq has quit IRC20:17
corvusretro +220:19
openstackgerritClark Boylan proposed opendev/system-config master: Fix sshfp record printing
clarkbthe zone file edits pointed out ^ is a thing we should do20:20
clarkblinus says don't run linux 5.12-rc1 because it has swapfile problems that will write to all the wrong plces in your filesystem20:23
clarkbthough we're probably someo f the only people that swapfile because cloud ci stuff (and none of our hosts will have a brand new kernel)20:23
fungiyeah, i'm still on 5.10 because of the debian bullseye release freeze20:31
fungilooks like our node request backlog may finally be on the downward slide into the weekend20:34
fungithough a neutron change just reset the entire openstack integrated gate queue again a few minutes ago20:35
openstackgerritMerged opendev/system-config master: Replace with
clarkbfungi: looks like a big tripleo reset just happened too. Definitely not out of the woods yet, but likely to catch up before next week starts it over agani20:52
fungiat least feature freeze will end next week20:54
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
*** hamalq has joined #opendev21:14
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Zuul Cache role with s3 implementation.
*** iurygregory has quit IRC22:57
clarkbansible has found the new servers, now I wait23:11
*** iurygregory has joined #opendev23:22
*** elod has quit IRC23:25
clarkbstarting new executors now23:34
clarkband the old executors have been asked to gracefully stop23:39
fungiawesome23:40 is particularly busy but it should be easing off now23:42
*** hamalq has quit IRC23:44

Generated by 2.17.2 by Marius Gedminas - find it at!