Wednesday, 2021-02-03

*** tosky has quit IRC00:15
clarkbianw: left some comments on ^ note the inline comments aren't the reason for the -1, the top level comment is00:27
clarkblet me know if I've missed something obvious and I can ammend my review00:27
ianwcool, replied.  i may have missed gitea, will check in a sec00:31
ianwbasically i think the script should work by looping through and trying to backup everything, and if one part fails, the whole thing will exit with !000:32
clarkbianw: re your reply on the pipefail: I mention it because you are doing `bash foo | something else` and that will only exit non zero if the something elsefails00:32
clarkbI agree with your plan. I'm just worried we'll ignore if bash foo fails00:32
clarkbI think if we set -o pipefail we'll get both things00:33
clarkbthis is distinct from set -e00:33
ianwoh yes i see.  we can check PIPEFAIL or whatever that is, that's a good idea for robustness00:33
clarkbah ya if we can check it directly too that would work00:34
openstackgerritIan Wienand proposed opendev/system-config master: borg-backup: implement saving a stream, use for database backups
ianwclarkb: nice ideas, thanks, implemented with ^00:41
clarkbianw: one little formatting thing that yaml will eb sad about. Otherwise that lgtm00:42
openstackgerritIan Wienand proposed opendev/system-config master: borg-backup: implement saving a stream, use for database backups
ianwindeed, i was actually just playing with yamllint wrt to
clarkbside note: would it be worth testing starting review-test up against an empty accountPatchReviewDb?00:44
clarkband if that works with the only loss being the little check marks on the ui next to things you've reviewed maybe we just stop backing up the review database entirely?00:45
clarkbwhat I'm not sure about is if there are tendrils of that data in the notedb. I don't think there are, but it is possible00:45
ianwpossibly, but i don't think it's a major size concern00:45
ianwi mean, it's not going to be atomic with any gerrit state saved in backups anyway, so if they do communicate ... i gues sit's likely to be corrupt00:46
ianwor at least ... have corruption00:46
clarkbthat change lgtm now too00:47
clarkbI still think docs telling people to set up db backups separately would be a good addtion :)00:49
ianwclarkb: yes, i've started :)  it got all tangled up in my modifying the stuff we have there about rotation, which now i'm not sure what to do.  i'll separate it out00:50
clarkb++ to separating the two things and we can update them as we get to each piece00:50
clarkbre the gerrit testing my change seemed to have worked but post failured on some unrelated issues that appear to be network related getting logs. I rechecked it00:50
ianwit was all going to be so simple... :)00:50
clarkbianw: I used your screenshots to confirm the gerrit versions were as expected too :)00:51
clarkbI thought that would end up in the container logs but didn't find them there00:51
clarkb(I think because it logs that info the disk the container log just gets stdout/stderr which is fairly short)00:51
ianwyeah with some sleep() and adujsting the viewport the screenshots seem to be pretty good now00:52
clarkband now it is time to go figure out some dinner00:52
*** diablo_rojo has quit IRC00:56
ianw#status log afsdb01/02 restarted with afs 1.8 packages01:15
openstackstatusianw: finished logging01:15
openstackgerritMerged openstack/diskimage-builder master: Install last stable version of script
*** dviroel has quit IRC01:54
*** mlavalle has quit IRC01:55
openstackgerritMerged opendev/system-config master: Manage afsdb servers with Ansible
*** lbragstad_ has joined #opendev02:14
*** ysandeep|away is now known as ysandeep02:15
*** lbragstad has quit IRC02:17
*** hemanth_n has joined #opendev02:19
*** DSpider has quit IRC02:55
openstackgerritMerged opendev/system-config master: borg-backup: implement saving a stream, use for database backups
*** ykarel has joined #opendev03:18
*** ykarel has quit IRC03:27
*** d34dh0r53 has quit IRC03:47
*** d34dh0r53 has joined #opendev03:48
*** d34dh0r53 has quit IRC03:48
*** d34dh0r53 has joined #opendev03:49
*** d34dh0r53 has quit IRC03:49
*** lbragstad_ is now known as lbragstad03:51
*** d34dh0r53 has joined #opendev03:53
*** d34dh0r53 has quit IRC03:55
*** d34dh0r53 has joined #opendev03:56
*** d34dh0r53 has quit IRC03:56
*** d34dh0r53 has joined #opendev03:57
*** d34dh0r53 has joined #opendev03:58
*** d34dh0r53 has joined #opendev03:59
*** brinzhang has quit IRC04:52
*** brinzhang has joined #opendev04:53
*** ykarel has joined #opendev05:01
*** brinzhang_ has joined #opendev05:02
*** brinzhang has quit IRC05:05
ianwclarkb/kopecmartin : re 705258 i left review comments, but i've started to try and flesh out the steps we'll use to bring up the host and various other things.  as mentioned, i think it would be good to validate the db migration procedure on a test host before we start production host05:49
*** ykarel_ has joined #opendev05:50
*** ykarel has quit IRC05:53
*** ykarel_ is now known as ykarel05:53
*** ykarel_ has joined #opendev05:58
*** marios has joined #opendev05:59
*** ykarel has quit IRC06:00
*** ykarel_ is now known as ykarel06:15
*** dirtygiraffe has joined #opendev06:58
*** dirtygiraffe has quit IRC07:02
*** brinzhang_ has quit IRC07:04
*** brinzhang_ has joined #opendev07:04
*** eolivare has joined #opendev07:28
*** slaweq has joined #opendev07:28
*** ralonsoh has joined #opendev07:28
*** hashar has joined #opendev07:58
*** hashar has quit IRC08:01
*** hashar has joined #opendev08:01
*** sboyron_ has joined #opendev08:04
*** fressi has joined #opendev08:04
*** ysandeep is now known as ysandeep|lunch08:18
*** andrewbonney has joined #opendev08:19
*** valery_t has joined #opendev08:21
*** ykarel is now known as ykarel|lunch08:21
valery_tI need a reviewer for my review
*** valery_t has quit IRC08:32
fricklerwow, that one was really hasty08:33
cgoncalveshey folks. not sure if this issue has been reported or not, apologies in advance. is super slow, CI jobs timing out08:36
cgoncalves(HTTP 443, connection timed out)08:38
*** tosky has joined #opendev08:40
*** rpittau|afk is now known as rpittau08:41
openstackgerritMerged openstack/diskimage-builder master: Remove the deprecated ironic-agent element
*** valery_t has joined #opendev08:49
fricklercgoncalves: works fine for me, do you have some logs? is this our CI or downstream?08:51
cgoncalvesI also hit HTTP 443 locally08:52
*** valery_t has quit IRC08:55
*** jpena|off is now known as jpena08:57
*** brinzhang_ has quit IRC08:59
*** brinzhang_ has joined #opendev09:00
fricklerhmm, seems to be a bit of a load spike, but I don't see anything wrong locally
fricklerthere also seems to be a regular peak in io load starting every day at 6, not sure if that are our periodic jobs or the backup possibly, ianw?09:04
cgoncalvesfrickler, FYI 2m11s
cgoncalvesand thanks for checking!09:05
*** DSpider has joined #opendev09:07
*** valery_t_ has joined #opendev09:14
*** ysandeep|lunch is now known as ysandeep09:38
priteauGood morning. tarballs.o.o is extremely slow for me today. I remember it happened some time ago and someone restarted apache (IIRC) which fixed it09:53
priteauYeah, that was on 2020-11-2709:55
priteau16:31 fungi: #status log restarted apache2 on in order to troubleshoot very long response times09:56
priteaucgoncalves: I see the same problem09:57
priteaufrickler: See quote from fungi above ^^^09:58
*** wanzenbug has joined #opendev10:00
*** wanzenbug has quit IRC10:04
ttxYes, affects too10:10
*** CeeMac has joined #opendev10:20
frickler#status log restarted apache2 on in order to resolve slow responses and timeouts10:20
openstackstatusfrickler: finished logging10:20
fricklerttx: priteau: cgoncalves: infra-root: ^^ looks better to me currently, please let us know if you see any further issues10:21
cgoncalvesfrickler, functional now. thanks a lot!10:22
priteauThank you frickler! upper constraints fetched in 1 to 2 seconds10:23
*** ykarel|lunch is now known as ykarel10:29
*** sshnaidm|afk is now known as sshnaidm|ruck10:35
*** hashar has quit IRC10:45
*** dtantsur|afk is now known as dtantsur10:49
openstackgerritMartin Kopec proposed opendev/system-config master: Deploy refstack with ansible docker
*** dviroel has joined #opendev11:14
openstackgerritDinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos
*** hrw has joined #opendev12:18
hrwcan someone review/approve patch? it adds centos 8 stream for aarch64 nodes12:19
*** jpena is now known as jpena|lunch12:41
openstackgerritPedro Luis Marques Sliuzas proposed openstack/project-config master: Add Metrics Server App to StarlingX
*** hemanth_n has quit IRC13:00
*** hrw has quit IRC13:19
openstackgerritMerged openstack/project-config master: CentOS 8 Stream initial enablement for AArch64
openstackgerritPedro Luis Marques Sliuzas proposed openstack/project-config master: Add Metrics Server App to StarlingX
*** jpena|lunch is now known as jpena13:37
*** ykarel_ has joined #opendev13:51
*** ykarel has quit IRC13:54
*** whoami-rajat__ has joined #opendev13:55
*** lbragstad has quit IRC13:57
*** ykarel_ is now known as ykarel13:59
*** brinzhang_ has quit IRC14:17
*** brinzhang_ has joined #opendev14:17
*** zoharm has joined #opendev14:33
*** akahat|rover is now known as akahat14:34
*** brinzhang_ has quit IRC14:35
*** lbragstad has joined #opendev14:35
*** brinzhang_ has joined #opendev14:36
*** ysandeep is now known as ysandeep|afk14:48
*** bcafarel has quit IRC14:58
*** d34dh0r53 has quit IRC15:01
*** d34dh0r53 has joined #opendev15:01
*** fressi has quit IRC15:23
*** ykarel_ has joined #opendev15:30
*** ysandeep|afk is now known as ysandeep15:31
*** ykarel has quit IRC15:32
*** alfred188 has joined #opendev15:50
*** ykarel_ is now known as ykarel16:00
clarkbhrw isn't here anymore, but that arm64 centos 8 stream image has me wondering if maybe the centos 8 image should be removed? I don't know if anything is using it currently though16:04
*** ysandeep is now known as ysandeep|away16:06
openstackgerritClark Boylan proposed opendev/system-config master: Run gerrit 3.2 and 3.3 functional tests
*** ykarel has quit IRC16:17
*** d34dh0r53 has quit IRC16:18
*** d34dh0r53 has joined #opendev16:19
openstackgerritMatt McEuen proposed openstack/project-config master: New Project Request: airship/gerrit-to-github-bot
*** mlavalle has joined #opendev16:34
*** hashar has joined #opendev16:45
fungiclarkb: given the concern over centos 8 vs centos stream 8 it seemed like projects were going to want to have both available at least for a bit so they can make sure stream still works the same for them16:49
clarkbfungi: yup, but I'm not sure if anything used the arm64 centos 8 image?16:49
clarkbseemed like most of the work there was done on debuntu, but I am probably also just working on out of date info16:50
fungioh, the arm64 images specifically. right, that may be16:50
*** sshnaidm|ruck is now known as sshnaidm16:52
*** zoharm has quit IRC16:52
*** marios is now known as marios|out17:24
*** ralonsoh has quit IRC17:31
clarkbusing codesearch kolla uses centos-8-arm64 in the kolla-centos8-aarch64 nodeset and that is used in kolla-build-centos8-source-aarch64. Opendev also uses it to build the arm64 centos 8 wheel cache17:36
clarkbmy hunch is that hrw is adding the new image for kolla so that kolla-build-centos8-source-aarch64 job is likely to get replaced with a stream job. Once that happens we can drop the centos-8 wheel cache in favor of a stream wheel cache and then drop the image I bet17:36
clarkbbut we can't just drop it today17:36
fungiyeah, that sounds about right17:37
clarkbinfra-root I think the stack at is ready for review now. These are housekeeping changes to add gerrit 3.3 image builds and testing17:40
clarkbI figured out why those jobs were post failuring and it was beacuse the run playbook was short circuiting due to an error which caused a log file copy to fail since the file wasn't present17:41
clarkbtl;dr best to look at post failures as if they are actual failures first17:41
fungi#status log Requested Spamhaus SBL delisting for the IPv6 address17:48
openstackstatusfungi: finished logging17:48
fungiinfra-root: i checked all the addresses and hostnames for and they're still clean17:49
fungijust as a heads up17:49
*** valery_t_ has quit IRC17:56
*** jpena is now known as jpena|off17:59
clarkbiurygregory: I have approved to allow ironic project cores to edit hashtags on the appropriate projects. I would be curious to hear how that goes18:04
iurygregoryclarkb, awesome thanks! after it merges I will give a try18:05
corvusi'd be in favor of allowing that for all auth'd users18:06
clarkbiurygregory: note there will be a delay while we sync the acl update, you can follow along in the deploy pipeline on zuul status for that change (it will be the manage-projects job)18:07
*** eolivare has quit IRC18:07
fungicorvus: yeah, i think we mostly wanted to see how it played out for volunteer test projects before we turned it on globally18:07
iurygregoryclarkb, ack18:08
fungimain concern is that any user can remove a hashtag, so some projects may find that they want to override our global access for it and restrict it to a core reviewer group18:08
*** rpittau is now known as rpittau|afk18:08
fungibut honestly, there are so many ways someone can vandalize a change in gerrit, i'm not too concerned about rampant hashtag deletion18:08
*** fbo is now known as fbo|off18:09
corvusyep, that's my thought.  a measured introduction with clear guidelines would probably help.  maybe a standard place (CONTRIBUTING?) to describe a project's "reserved" hashtags18:10
*** zimmerry has joined #opendev18:10
openstackgerritMerged openstack/project-config master: Update ACLs of Ironic Projects to allow Edit Hashtags
clarkbfungi: looking at project-config changes I notice that you've got a revert to reenable gentoo image builds again. I presume that means they are off now? should we reenable them at this point?18:12
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size
*** dtantsur is now known as dtantsur|afk18:16
fungiclarkb: prometheanfire had some fixes into dib, we probably need to check that those appear in a release before we try again18:20
clarkbgot it18:20
*** d34dh0r53 has quit IRC18:22
openstackgerritMerged openstack/project-config master: Remove anachronistic jobs from scciclient
*** d34dh0r53 has joined #opendev18:24
fungii see at least one gentoo-related entry in `git log --no-merges --oneline 3.6.0..origin/master`18:28
fungiianw: frickler: how do you feel about tagging another dib release?18:28
fungilooks like that'll pull in the change too18:29
fungiand a fix for centos stream18:29
prometheanfirefungi: clarkb yep, the gentoo update would be nice, iirc it may help fix the build issues for gentoo18:33
openstackgerritMerged openstack/project-config master: Add Metrics Server App to StarlingX
corvusi'm seeing gerrit http response times > 30s in gertty18:53
clarkbload is a bit high right now, but not drastically so. I've been having decent luck through the web ui. I wouldn't say its super fast, but also hasn't been terribly slow doing project-config and zuul-job reviews18:54
*** marios|out has quit IRC18:54
clarkbdansmith is doing things with the api to get zuul comments (based on conversation in -infra)18:56
clarkbI wonder if that could be related, or if its just another researcher18:56
clarkbload appears to be falling off now18:59
iurygregoryclarkb, worked =)19:01
iurygregory I can add hashtag for a change that I'm not the owner/uploader19:02
clarkbiurygregory: cool, if you end up with some examples of how you are using it, I would be interested in seeing those19:02
iurygregorythe idea is that we will use to track priorities for review19:02
clarkbiurygregory: you could tag changes "urgent" I guess19:03
clarkband then core reviewers start reviewing anything tagged urgent when they review sort of thing?19:03
iurygregoryand probably for backports also (we are thinking in add the backport-candidate label) and maybe try to use the gerrit api to automatically add a hashtag that would tell we need to have backport in some patches19:04
iurygregoryclarkb, with the hashtag we can have a simple search in gerrit19:04
clarkbcorvus: I wonder too if possibly updating acls slows things down (maybe there are locsk involved in that?)19:04
iurygregoryfor example19:04
iurygregoryso maybe we will have specific ironic hashtags we want to use to make things easier for us and have a dashboard that would help the community19:05
clarkbright, in that example "bifrost" is implied because it is the bifrost repo. But I can see how other values for things like backports and urgency would help out19:06
*** andrewbonney has quit IRC19:17
*** psliuzas has joined #opendev19:18
psliuzasHey folks, My commit just got merged and I would like to be the first core reviewer for the repo starlingx/metrics-server-armada-app , could someone help me with that? thanks!19:24
openstackgerritMatt McEuen proposed openstack/project-config master: New Project Request: airship/gerrit-to-github-bot
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size
fungipsliuzas: sure, taking care if it now, just a moment19:32
fungipsliuzas: oh, our deployment automation hasn't run for that yet, i'll check it again in a few minutes19:34
fungiinfra-prod-manage-projects TIMED_OUT in 30m 39s19:43
fungii guess that's why it wasn't created19:43
fungilooking into it now19:43
fungi"Failed to set desciption for: openstack/puppet-openstack_extras 500 Server Error: Internal Server Error for url: https://localhost:3000/api/v1/repos/openstack/puppet-openstack_extras"19:47
fungilooks like gitea01 may be having a bad day19:47
fungiit errored about setting descriptions on a bunch of projects19:48
openstackgerritMerged openstack/project-config master: New Project Request: airship/gerrit-to-github-bot
fungii'll keep an eye on that one ^ and see if the problem persists19:51
clarkbfungi: thanks19:57
clarkbI wnt to say we considered making the project description update failures non fatal?19:58
clarkbit happens in a different spot than the initial project setup iirc, so we could separate those two concerns and get the project update on the next pass wheneverthat happens19:58
fungiyeah, it's not 100% clear to me from the log that's why it didn't run tasks for gerrit, but seems likely19:59
clarkbfungi: I think the whole job short circuits if the gitea stuff fails because we don't want to create a repo in gerrit that will fail to replicate19:59
clarkbI'll look into that after my bike ride as that seems like a good improvement20:04
openstackgerritMerged zuul/zuul-jobs master: bindep: remove set_fact usage when converting string to list
*** hashar has quit IRC20:11
*** klonn has joined #opendev20:16
funginow gitea08 is returning "Internal Server Error for url: https://localhost:3000/api/v1/orgs/pypa/teams?limit=50&page=2" according to the latest log20:23
fungiand gitea04 said "401 Client Error: Unauthorized for url: https://localhost:3000/api/v1/user/orgs?limit=50&page=1"20:24
fungii wonder if something is going sideways in gitea20:24
clarkbcacti shows significant new cpu demand on 0120:25
clarkb04 was in a similar situation until recently but seems to have subsided20:26
clarkb07 and 08 exhibit similar20:27
fungiyeah, seeing that. maybe we're getting slammed by something/someone20:27
fungiif it hasn't subsided by the time my kettle reaches a boil, i'll start digging into apache access logs and looking at blocking abusive client addresses20:29
clarkbfungi: remember that you need to map the connecting port in apache to the haproxy logs in the lb syslog20:30
clarkbfungi: since from apache's perspective all connections originate from the load balancer20:30
clarkbhrm apache may not be logging that :/20:31
clarkbfungi: maybe we set up something like 'LogFormat "%h:%{remote}p %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined' and then tell CustomLog to use that format?20:40
clarkbI'll push that change up now and we can iterate on it if necessary20:40
openstackgerritClark Boylan proposed opendev/system-config master: Add remote port info to gitea apache access logs
fungiahh, yep20:45
fungiestablished tcp connections through the lb shot waaay up around 19z20:48
*** sshnaidm is now known as sshnaidm|afk20:52
clarkbianw: for the last little bit the giteas are doing the dance similar to the thing you wrote the apache vhost for20:54
clarkbianw: we've noticed that apache isn't logging source ports so hard to map to the load ablancer logs so I pushed
ianwinteresting, and it seems like they must be coming from separately hashed addresses if multiple gitea's are feeling it20:55
fungiseems the lb is strafing problem traffic to different backends until they oom20:56
clarkbI think too what can happen is one gets overwhelmed and the lb takes it out of the pool and then the addrs shift20:56
fungiso it could be just one or a small handful of client addresses20:56
fungibut memory consumption seems to be the predominant symptom, we're reaching oom conditions on backend servers20:57
fungihuh, any idea why we've got afs set up on the gitea servers?20:57
clarkbI don't see afs on gitea08, but I may not be looking properly20:58
fungid'oh, my bad. i should be on gitea08 not ze0820:59
fungiyeah, definite oom there20:59
fungikilled a gitea process20:59
fungiclarkb spotted one ipv4 address making a ton of requests which were getting directed to gitea01 where the current memory crisis seems to be unfolding. i've temporarily blocked it in iptables on the lb to see what happens21:03
clarkbalready load seems better fwiw21:03
fungimnaser: seems we may be getting spammed by very heavy git clone operations in volume from which looks like a vexxhost customer (but isn't us as far as i can tell). i've temporarily blocked access from that address to the git servers21:05
clarkbthat is a bit of an imperfect correlation without the port details21:05
clarkbwe can get the logging improvement in then open things up and see what we can infer from there21:06
fungiyeah, and i'll watch the logs here for a bit, then try to remove the block rule and see if the problem resumes21:06
ianwthere is something cloning /vexxhost/* with an odd UA "GET /vexxhost/helm-charts/info/refs?service=git-upload-pack HTTP/1.1" 200 8436 "-" "git/1.0"21:10
*** psliuzas has quit IRC21:10
ianw# cat gitea-ssl-access.log | grep 'git/1.0' | awk '{print $7}' | sort | uniq -c21:11
ianw   1229 /vexxhost/helm-charts/info/refs?service=git-upload-pack21:11
ianw    297 /vexxhost/openstack-operator/info/refs?service=git-upload-pack21:11
ianw    297 /vexxhost/rbac-helm/info/refs?service=git-upload-pack21:11
ianwit has very particular interest21:11
fungiyeah, the potentially problematic requests i was seeing all had git/1.0 as the us21:11
fungigah, ua21:11
fungilooks like gitea01 also reached oom conditions21:12
fungi[Wed Feb  3 20:54:22 2021] Killed process 29676 (gitea) total-vm:30048404kB, anon-rss:7604728kB, file-rss:0kB, shmem-rss:0kB21:13
fungiproblem client(s) may have gotten punted by the lb to a fresh backend after that21:14
ianwgitea01 seems to have no "git/1.0" UA requests?21:15
fungiso far the problem seems to have hit 01, 04, 07 and 0821:16
fungi01 looks reasonably healthy again in past 10-15 minutes21:19
fungii don't see any indication the load has shifted to another backend21:20
fungithe secondary symptom of established tcp connection count on the lb has also seems to have subsided around the same timeframe21:21
fungiin a few more minutes i'll try removing the firewall rule blocking
ianwi can't pick any common themes from the logs like on gitea08 with the git/1.0 thing.  although git/1.0 seems to be a pretty common thing used in a few git libraries.  all it really indicates is whatever is cloning isn't actually a basic git client, but something using a library21:23
fungii've approved another project creation change, in hopes that might flush the incompletely applied changes from earlier21:25
fungigonna pop out to check the mail while that grist churns through the mill, brb21:27
*** hamalq has joined #opendev21:27
*** klonn has quit IRC21:32
*** whoami-rajat__ has quit IRC21:34
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size
openstackgerritMerged openstack/project-config master: Add ansible-role-pki repo
fungiand just waiting for that to deploy now21:37
fungiseems to be in progress, tailing the log on bridge.o.o21:41
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size
fungiso far it's only gitea06 which hasn't reported any task result21:46
fungiload average there is pretty high21:47
fungilike around 10 right now21:47
fungilooks like swap usage is spiking on 06 just in the last poll interval21:48
fungipossible it's just manage-projects briefly running through all the descriptions21:49
*** sboyron_ has quit IRC21:49
fungibut the other 7 backends completed far faster21:49
fungiyeah, swap is getting exhausted quickly there21:51
fungialready basically no ram available and half the swap in use21:51
fungiokay, seems to be subsiding now21:52
fungino oom (yet anyway)21:52
fungii need to start working on dinner but will try to keep one eye on my terminals21:55
ianwok sorry back now22:05
ianwseems like we don't really have a smoking gun22:05
*** openstackgerrit has quit IRC22:11
funginot so far, no22:15
clarkbhave we gotten ebtter logging in place?22:39
ianwumm i +2'd it but it hadn't finished testing22:40
clarkbI think last time we weren't able to pinpoint anything until we had similar in the gitea logs22:40
ianwlooks like it's still moving through22:40
clarkbI expect that to be a big help given previous experiences22:40
*** slaweq has quit IRC22:41
*** slaweq has joined #opendev22:43
fungilooks like the manage-projects run actually completed without timing out22:45
clarkblots of connections from a single vexxhost ip to gitea06 according to the lb22:45
clarkbI wonder if its just bouncing around a few IPs there?22:45
fungipsliuzas is gone, but i've added them to starlingx-metrics-server-armada-app-core22:46
fungidoes look like i caught that right as load was ramping up for gitea0622:46
fungican see it probably hit an oom condition a few minutes ago now22:47
clarkbas far as I can tell these IPs from vexxhost are not part of our gitea cluster or in our nodepool logs (so not ours)22:47
fungi[Wed Feb  3 22:30:41 2021] Killed process 14724 (gitea) total-vm:22863900kB, anon-rss:7564180kB, file-rss:0kB, shmem-rss:0kB22:47
fungiyeah so oom on gitea06 ~47 minutes ago22:47
*** slaweq has quit IRC22:47
fungier, ~1722:47
clarkbload was high as of a fwe minuts ago22:48
clarkbI didn't think it was that long ago :)22:48
clarkbI feel like the key is to catch whichever one is next now22:49
clarkbbefore it goes compeltely sad22:49
fungii've reset iptables on gitea-lb01 now so is no longer blocked22:49
fungias it didn't seem that one (or that one alone anyway) was the problem22:50
clarkb06 has the highest system load of the set, the rest look quite happy actually22:50
fungisystem load average is back down around 1 now22:51
fungion gitea0622:51
clarkbthat vexxhost IP seems to have continuously made requests that hit 06 for hours and hours and hours22:53
clarkbwhich is interesting, but maybe an indication it isn't to blame22:54
clarkbhowever that vexxhost IP made far and away the most requests to gitea06 while cacti reports it as being under high load22:58
fungieyeballing the overall impact, it's possible these two ip addresses together are the cause23:01
fungisince it looks like blocking one of them may have roughly halved the effect23:01
fungibut it's also possible utilization is trailing off in general for the day, and is no longer compounding the problem23:02
fungimildly amusing, the address i blocked earlier, when stuffed into my web browser, reveals that it's actually trunk-centos8.rdoproject.org23:04
fungiand the other one seems to be trunk-primary.rdoproject.org23:05
fungiso maybe we need to reach out to rdo folks and make sure everything is okay on their end?23:05
fungiwe probably even have some rdo people in here or at least in #openstack-infra who can check on things23:07
fungiand would probably be faster than having vexxhost support act as a relay for the discussion23:08
*** openstackgerrit has joined #opendev23:24
openstackgerritMerged opendev/system-config master: Add remote port info to gitea apache access logs
clarkbI was able to usethe new logging from ^ on gitea06 to correlate some requests to the rdo host fungi pointed out above.23:39
clarkbThat is the host I identified as making the bulk of the requests to gitea06 via haproxy logs23:39
clarkbstill not an indication that what they did is wrong (and in fact they seem to regularly poll repos for ref updates)23:39
clarkbbut was a good test case for: do our logs give us what we need to correlate things now and I think they do.23:39
clarkbWe might consider logging the apache source port on the connection to gitea so that we can correlate between apache and gitea too?23:40
clarkbactually I don't know how to expose that with apache logging23:41
clarkb%{format}p doesn't seem to have a format for that23:41
clarkbianw: fungi fwiw I think at this point we largely need to see it happening again so that we log it with the data necessary to correlate things then go from there23:42
corvusi wonder if we can get metrics from gerrit on certain operations (like how long a push takes)23:42
corvusi was wondering that as i just pushed a change and it seemed to take a good 10-15 seconds23:43
clarkbcorvus: for replication to gitea? oh this is separate23:43
corvusyeah, sorry, separate23:43
clarkbcorvus: I think that you can probably get that out of the ssh logs23:43
clarkbI want to say there is timing info there and there should be enough info to split out the git operations23:43
corvusmight be a nice thing to track in a dashboard as opposed to anecdata23:43
clarkbbut its been  while since I looked at that log file23:43
clarkbfwiw I noticed that pushing to gerrit's gerrit is similarly slow (but I've only pushed a handful of times to there recently)23:44
clarkbalso that is over http not ssh23:44
corvusclarkb: yeah; though i always chalked that up to their backend (i assume a lot of distributed locking is involved)23:44
corvusi sort of assumed they had a high cost for each push, but that they could scale out to a lot of simultaneous pushes (to different repos at least)23:45
corvusbut that's totally just assumption/inference on my part23:45
*** tosky has quit IRC23:54
clarkbfungi: it appears that updating project descriptions is already a best effort attempt and shouldn't case things to fail23:59
clarkbfungi: I think the implication there is that something failed when trying to create the new project in gitea and that was a valid failure23:59

Generated by 2.17.2 by Marius Gedminas - find it at!