Saturday, 2023-12-16

Clark[m]Deployment concluded successfully and a requirements change merged and replicated successfully 02:10
*** dhill is now known as Guest1053303:02
qwebirc35203Is there known problem or maintenance for opendev.org being down?05:02
tonybqwebirc35203: can you be more specific.  I'm not seeing an issue with review.opendev.org05:08
qwebirc35203Some subdomains seem to work, but base domain as in https://opendev.org/ is not opening for me (tested a few different providers and few website testing sites and all give timeouts)05:09
tonybqwebirc35203: hmm okay.  I see it.  I'll look into it05:11
qwebirc35203tonyb: Thank you.05:11
tonyb#status notice Web, and possibly others, services on opendev.org appear to be down.  Admins are investigating.05:18
opendevstatustonyb: sending notice05:18
-opendevstatus- NOTICE: Web, and possibly others, services on opendev.org appear to be down. Admins are investigating.05:18
tonybOoops that wasn't supposed to be notice :/05:20
opendevstatustonyb: finished sending notice05:21
tonybqwebirc35203: Try again now.  It's working for me05:46
tonybqwebirc35203: and stalled again05:56
qwebirc35203tonyb: Ok, I probably wasn't fast enough on testing, because I didn't get successful load.05:57
qwebirc35203Routing seems ok as traceroute completes to the ip that resolves from opendev.org05:58
tonybYup, and the systems behind the LoadBalancer seem fine, but not the LB itself.05:59
tonybit was fine after a process restart05:59
qwebirc35203Someone session flooding the lb?06:00
tonybqwebirc35203: Something like that.  I'm struggling to verify that :/06:10
qwebirc35203It you are running haproxy, I'm assuming you've checked hatop?06:12
qwebirc35203*If you are...06:12
tonybqwebirc35203: we don't seem to have that.06:14
qwebirc35203You might have stats web-endpoint, that's usually in different port so it should be accessible06:16
qwebirc35203Or you can get the raw stats with " echo "show stat" | nc -U /var/run/haproxy.stat" (your stat socket location may differ)06:19
tonybI can certainly do the last one.  I'll fess up that I'm "new" here so I'm struggling to filter the normal from the "problem pointers"06:20
qwebirc35203echo "show info" | nc -U /var/run/haproxy.stat06:21
qwebirc35203That should give you the generic stats06:21
qwebirc35203If CurrConns is hitting Hard_maxconn, then something is hogging the connections06:22
tonybOkay.  That is the current state. 06:23
tonybMaxconn: 400006:23
tonybHard_maxconn: 400006:23
tonybCurrConns: 400006:23
qwebirc35203Ok, then that's bit of an advanced topic to find out what is doing that. I would recommend forwarding the issue to someone who is used to solving that06:24
tonybqwebirc35203: Thanks for your help.06:25
qwebirc35203Could be someone intentionally doing abuse, or just automations hammering the site after some downtime06:25
qwebirc35203Increasing logging or using tcpdump may be required to figure out what is happening and how to correct the situation. If it is single source, that can be firewalled or rate limited from haproxy06:26
qwebirc35203Correct solution heavily depends on what is actually happening06:27
qwebirc35203Increasing max connections shouldn't be a problem, but I wouldn't do that without consulting someone who has experience with that specific system06:29
tonybSure, but given how quickly the connections fiulled up last time I suspect any additional connections will also just fill up06:31
qwebirc35203Yes, that is likely. And more connections on load balancer may cause issues to the servers behind it at normal operations. That's why changing the number must be informed decision with knowledge of the system.06:32
qwebirc35203I would probably do tcpdump on port 443 with limited packet count to check if some ip has high dominance in the packets received. If you don't limit the packet count on remote system you might have trouble stopping the capture. But every admin has their preferred way, some would add verbosity to haproxy logging.06:36
tonybI'm confident I can up the maxconns and the load balancer and beckend will handle it.  Without knowing if this is innocent or not06:37
qwebirc35203Tcpdump has the advantage of having no effect on logging (if it processed or delivered forward in some way, changing verbosity may cause some trouble, same goes if disk space is at premium)06:37
tonybqwebirc35203: Yeah I've looked at tcpdump and there isn't a visible hog.06:37
qwebirc35203You should be able to do x2 or x4 connections without issues at least on haproxy side. If you do that make sure Ulimit-n value in the output increases accordingly06:40
qwebirc35203Because haproxy has the front connection and back connection, you need to have double the sockets to maxconn06:41
tonybYup06:42
qwebirc35203Just came in for quick check from webirc, switching to better client at TNX.06:57
tonybKk06:59
TNXGoog luck with resolving the issue, I'll see how it went later on, I hit that site issue while finishing some update tasks for the night, for me it's late for the work shift and early for the day.07:04
TNXI got everything important done anyway, but thought give notice about the issue07:05
tonybThanks07:05
tonybI appreciate it07:05
tonyb#status log The gitea load balancer on opendev.org is saturated and therefore new connections are timing out.  Admins are investigating.07:21
opendevstatustonyb: finished logging07:21
tonybinfra-root: gitea-lb02 is "flooded" CurrConns == Maxconn == 4000, I'm basically out of ideas.08:41
tonybhttps://etherpad.opendev.org/p/gitea-debugging is a summary of the last couple of hours.  I'll be back online after dinner etc08:43
eanderssonI think something is broken again? e.g. this does not load for me at all https://opendev.org/openstack/requirements/raw/branch/master/upper-constraints.txt12:15
eanderssonoh or maybe the issue is just ongoing12:16
fungieandersson: yes, we suspect something has gone awry with internal networking in vexxhost's sjc1 region13:11
fungithough not for everything... i'm able to reach https://mirror.sjc1.vexxhost.opendev.org/ with no problem13:13
fungihttps://gitea09.opendev.org:3000/ loads too13:14
fungii can ssh to the opendev.org load balancer with no problem, seems like it's just haproxy itself that's getting overloaded13:15
fungihttp://cacti.openstack.org/ graphs for gitea-lb02 are pretty striking. user cpu and load average both spike to fill the two available vcpus right at 04:00 utc13:18
fungiat the same time, network traffic drops to almost to almost imperceptible levels13:19
fungi/var/log/haproxy.log has been silent since 07:35 utc, nearly 6 hours ago13:21
fungilooks like haproxy was restarted at roughly 07:30, so maybe it was briefly logging traffic until it fell over again13:24
fungidoing some analysis of packet capture samples to see if there's any common sources at the moment13:36
fungiat the moment there's one host that seems to account for around 30% of all inbound packets for the ports haproxy is listening on13:38
fungii've blocked the two highest volume sources in iptables and am restarting the haproxy container to see if it regains its senses13:45
fungithe site is loading for me at the moment, but i'm not sure that will last13:46
fungisource addresses have changed now, majority of high-volume senders are known parties (red hat, suse, microsoft...)13:48
funginow the site's back to not responding for me13:50
fungiyeah, 13:49:08 was the last request in the haproxy log13:51
Clark[m]Haproxy should log to syslog iirc. You get better data about the requests on the gitea side since that shows you l7 URL paths and such. My first thought is this is the web crawler ddos we've seen before.13:52
Clark[m]Haproxy logs the conclusion of connections so if we get no logs whatever it is may simply be holding connections open13:52
Clark[m]Gitea will log the start of a request and the completion of a request13:53
Clark[m]I think we want to look for started but not concluded requests in gitea and work back from there13:53
fungiyeah, i do you know how it differentiates them in the log?13:54
Clark[m]It puts a string like started and finished in the log lines 13:54
Clark[m]Request started /too/bar or something along those lines 13:55
fungithing is, /var/log/haproxy.log only has 13 entries between when i started it and when it stopped logging13:56
Clark[m]Ya haproxy only logs when connections end iirc13:56
Clark[m]On the lb side netstat/ss to list tcp conn state may be more helpful13:57
Clark[m]Basically something appears to be making 4k connections and never closing them if I interpret this correctly. But it is early for me so take that with a grain of salt13:59
Clark[m]Restarting helps things because you manually kill those connections and let other stuff reconnect until whatever is misbehaving comes back and grabs all the connections again13:59
fungiyep14:00
Clark[m]But also I think haproxy may log to syslog so double check there?14:04
Clark[m]Sorry I can't really look myself without waking the entire house on a Saturday morning 14:04
fungino worries, i've blocked the top 5 ip sources with open connections according to netstat and am restarting the container again14:06
Clark[m]And ya cross checking against gitea web logs may give more insight into what the clients are attempting to accomplish 14:08
fungiif things stabilize, then i can slowly remove the firewall rejects one by one until the problem comes back too14:10
funginow it's no longer loading for me again14:13
Clark[m]Whatever it is probably requires characterization beyond simply the worst IPs.14:13
Clark[m]Because there are more behind them14:14
fungiyeah, i added a couple more, the new top two with open sockets shortly after things hung, so it's possible one of them is causing it14:14
fungithey were several times higher than the runners-up14:15
Clark[m]And we're reasonably happy that this isn't a failure of the backends because those work?14:15
fungii'll retest them all directly to be sure, i only spot-checked some14:16
Clark[m]Ya backend responds immediately for me14:16
fungiall 6 return the expected content for me on 3000/tcp14:17
fungione more try. this time blocked a few /24 networks that had a large number of open connections spread across many addresses14:23
Clark[m]If that doesn't help I would check gitea logs as the next step and try to understand if clients are making it that far14:23
fungiyeah, that will be my next stop, but trying to correlate those is a lot more time consuming so wanted to rule out the easy things first14:24
Clark[m]Sometimes it's pretty apparent just because the patterns are "odd" but if not then ya14:24
fungiat the moment, all /24 networks with clients that have sockets open through the load balancer are under 25 each14:26
fungicpu on the load balancer is still maxxed out though14:27
fungi~350 total sockets open for 80/tcp or 443/tcp14:28
fungi432 sockets open now and the service is back to not responding again14:30
Clark[m]There must be more SYN'ing then to hit 4k?14:31
fungii forgot we have https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer14:32
fungimaybe i'll find more clues there14:32
fungidefinitely reports ~4k concurrent sessions14:33
fungilooks like tons of concurrent sessions to gitea1214:36
Clark[m]There is a clue in those graphs. Gitea12 has the most conns so it is hashing more things (or one thing) to there14:36
Clark[m]I would look at logs on gitea1214:36
fungiright, so classifying the traffic it's seeing will help narrow this down14:36
Clark[m]And work backward14:36
Clark[m]++14:36
fungiearlier in the day it was mostly hitting gitea11, looks like14:37
Clark[m]The average session time graph is interesting basically says the average session is 0 seconds on 12? Could this be an ol fashioned tcp ddos?14:40
Clark[m]Oh wait we probably need session to end to report that data14:40
Clark[m]Same issue with logging not happening until sessions close 14:40
fungiapache access log doesn't seem to dereference the forwarded-for client, where were we recording that?14:44
fungioh, i guess we only do the source port14:45
Clark[m]You have to rely on ports. Ya it's a bug in gitea14:45
Clark[m]They have the info but their logging mangles it so we can't get it last I tried14:45
fungiso figuring out the client address(es) is... hrm14:46
Clark[m]It should be possible but also annoying14:47
Clark[m]I think the haproxy command socket will give you mappings on the haproxy side since we don't have logs yet14:48
Clark[m]And then align the port numbers with the backend?14:49
Clark[m]But also the urls and user agents may be sufficient clues. For proceeding it it is the typical crawler dos we've seen14:49
Clark[m]Oh I think I remember the forwarded for issue. We go haproxy to Apache to gitea and for some reason gitea can parse out the top level forwarding info. The Apache logs may be easier to work with?14:54
fungiyeah, i was looking in the apache logs but they show the haproxy host's address as the client14:55
fungii thought we had protocol-level proxy client signalling turned on (whatever that's called)14:55
fungiso that haproxy can communicate the original client address to apache14:55
fungibut if so, that's not being reflected in apache's access log14:56
fungii'm fumbling around with the haproxy command socket right now, but no luck figuring out how to get it to dump a list of clients/sessions yet. the context help is quite terse, i'll probably need to crack open the manual and digest haproxy's terminology14:57
fungiaha, show sess14:58
fungiit has 2391 sessions matching be=balance_git_https srv=gitea12.opendev.org14:59
fungi2214 of those are from codesearch :/15:01
Clark[m]Maybe block code search and see if it helps?15:02
Clark[m]If code search is struggling we can always fix that later15:03
corvusfyi, hound has not recently restarted so seems unlikely to be a sudden behavior change there15:03
Clark[m]Hrm15:03
fungiagreed, that was the first think i checked15:04
fungiokay, restarted the haproxy container again, this time with codesearch blocked15:04
corvushound started reporting errors on dec 1415:05
fungiit was connecting over ipv6, but i've blocked its ipv4 address too just to make sure it doesn't fall back on that15:05
corvus(hound had intermittent problems dec 14 and 15; looks much worse today)15:06
corvusthat may suggest a longer time window for whatever issue15:06
fungioddly the cpu usage and system load spiked up at precisely 04:00, which seems like suspicious timing15:07
fungihaproxy is still basically maxing out cpu utilization 15:08
fungibut maybe it will settle after a bit15:08
corvus(apropos of nothing, gitea11's ipv4 has a ptr record of test.example2.com)15:09
fungilovely15:09
funginb02 has a bunch of open sessions through haproxy at the moment15:12
fungialso something in osuosl, maybe a test node15:12
fungisite's still responding though15:13
fungihttps://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer shows current sessions have exceeded 3k, climbing steadily but less rapidly than before15:14
fungiyeah, the site's unresponsive again15:16
Clark[m]Codesearch was probably failing due to whatever the real issue is and making it worse by quickly trying to fetch things15:16
fungiipv6 address of nb02 currently has the largest number of active sessions through the lb15:16
Clark[m]Did the gitea or Apache logs show many commit requests by weird UAs?15:16
funginot that i spotted, lots of odd agent strings (Blackbox Exporter, SemrushBot, Faraday, ...) but i'll see if i can classify based on that15:18
fungilargest count of agents on gitea12 since the last restart is python-requests/2.28.115:20
fungii'll check the other backends, but if it's a ddos it would in theory be hitting them all pretty equally15:21
Clark[m]The UAs that typically gives us trouble are for old cellphones and Chinese android browsers15:21
fungigitea10 had 1599 connections for git/2.39.215:22
fungithat stands out15:22
fungiothers are mostly search bots (yandex, bing, openai)15:23
fungiyeah, after the last restart, grafana also says the bulk of current sessions were for gitea1015:24
fungii wonder if that's where nb02 is getting sent15:25
Clark[m]The pattern we've seen cause issues is requesting many commit urls in short periods of time. But I suspect that isn't an issue here because I can directly get urls like that without trouble talking to a backend15:25
fungisrc=2001:4800:7818:104:be76:4eff:fe02:5608:56772 fe=balance_git_https be=balance_git_https srv=gitea10.opendev.org15:26
fungiso nb02 (presumably git cache dib element for an image build) made ~1.6k git requests in the few minutes haproxy was working15:27
fungishould we block it and try again?15:28
Clark[m]I feel like that is a symptom not the cause. But maybe it is git level operations not web operations that are the problem 15:29
Clark[m]Can you try cloning directly from a backend?15:29
fungii agree it seems unlikely15:29
eanderssonWould it be worth clearning the zuul queue? There are hundreds of job in the queue15:30
fungicloning https://gitea10.opendev.org:3000/openstack/nova is working, slowly but not unusually so15:31
corvuseandersson: zuul is unlikely to be contributing to the problem15:31
corvuseandersson: (it does not communicate with gitea)15:31
fungigit clone is averaging around 1-1.5 MiB/s from gitea10 for me15:32
Clark[m]Really seems like the issue is in the lb somehow. Maybe we block everything then run local requests against it and see if we can reproduce hanging connections. That seems like a shot in the dark though15:33
fungimy nova clone from gitea10 just completed, so took about 4 minutes15:33
fungisimilar test against gitea09 seems to be running about as fast15:34
Clark[m]The failed jobs in zuul thrashing the queues seem to be trying to fetch global requirements constraints from OpenDev. So that explains the fallout there but doesn't necessarily mean it is the problem15:35
Clark[m]But perhaps it snowballed and now we're behind due to that thundering herd15:36
Clark[m]Do periodic jobs start around 0400?15:37
eanderssonYea that is what I was thinking. Seems like a safe option to at least restart zuul?15:37
TNXThat "python-requests/2.28.1" sounds lot like Openstack ansible jobs, I first noted the issue running an upgrade15:38
fungiperiodic triggers at 02:00 utc, periodic-weekly doesn't trigger until tomorrow15:38
Clark[m]Restarting zuul doesn't dump the jobs but we could try that using the API maybe15:38
fungiTNX: yeah, in the past we've also seen thundering herds from openstack-ansible users who try to upgrade a large network and don't maintain a central cache, so wind up flooding our git server farm from multiple servers all trying to clone the same things simultaneously15:39
fungibut recent openstack-ansible versions are supposed to set a custom user agent string now in order to make that more obvious15:40
corvustime wget -O - https://gitea10.opendev.org:3081/openstack/requirements/raw/branch/master/upper-constraints.txt15:41
corvusreal0m0.035s15:41
fungiyeah, we should be able to satisfy thousands of those sorts of requests15:41
corvusit doesn't look like fetching a constraints file is very taxing15:42
fungithey don't (or at least shouldn't) stay open for any time at all15:42
fungiso anyway, we don't have any periodic zuul pipelines that trigger at 04:00 on any day of the week, much less today (except our hourly deploy pipeline which fires every hour of the day of course)15:43
fungiand judging from grafana it doesn't seem likely to be something that started in the periodic pipeline at 02:00 and gradually built up, the sessions chart shows it going from a steady ~30 sessions just before 04:00 to 4k within a couple of minutes15:45
fungii suppose we could try bisecting the entire internet. wouldn't be all that many steps15:47
corvushaproxy is supposed to log every request on completion, right?  and so far, it's logged a handful from the last restart15:48
fungiyes, around 2015:48
eanderssonhttps://github.com/openstack/project-config/blob/master/zuul.d/pipelines.yaml#L212 Isn't this starting at 2AM everyday?15:48
corvuswhich means that as far as its concerned, 39xx sessions haven't completed?15:48
fungieandersson: correct, and the problem started at exactly 04:00, two hours later15:49
TNXHave you checked what that high cpu usage consists of? User, system, interrupts? Might give a pointer if there is something wrong with the system and not with the amount of connections per se.15:49
fungiTNX: user15:49
fungiit's entirely user, and attributed to the haproxy process15:50
corvusso what's it take for a session to complete? in general... i'm seeing FINs on the small amount of traffic that goes between lb02 and gitea10, so it's not like those are being completely dropped15:50
corvusaccording to strace, the haproxy process looks to be busy-waiting around 2 different epoll objects15:51
corvuslb02 has 302 connections to gitea12 in CLOSE_WAIT; gitea12 has 0 connections to lb0215:53
funginot seeing any packet loss between them at least15:54
Clark[m]Did haproxy update maybe and change it's TLS connection settings by default?15:55
Clark[m]Or I supposed that could be packet loss of some sort15:56
Clark[m](hence fungi's comment)15:56
corvusyeah, i got a pcap from both ends and am looking at a single session; i'm not seeing any red flags. i  seet a fin,ack from each side followed by an ack from each side and that's the end15:57
fungioh, good point i looked at dpkg.log but haproxy is run from a container image15:57
fungihaproxy                             latest    b7b500699a22   22 hours ago   121MB15:57
corvusthat seems suspicious15:58
fungiwe don't seem to have old images preserved on the server either15:58
corvusthere is an image from 8 hours ago15:59
corvuswe could try rolling forward15:59
corvushttps://www.mail-archive.com/haproxy@formilux.org/msg44428.html16:00
corvusat a guess: maybe our current image is 2.9.0 and the newer one is 2.9.1?16:01
Clark[m]That seems like a good guess. I suspect the image can tell us somehow. If that doesn't work we can probably rollback too. Haproxy isn't very stateful across hard restarts16:02
fungihaproxy_1         | 2023-12-16T15:04:08.174923451Z [NOTICE]   (1) : haproxy version is 2.9.1-f72603c16:02
fungifrom docker-compose logs16:02
corvusoh :(16:03
corvusthen we may have to go back to < 2.9.016:03
corvusmaybe 2.8.5?16:04
fungisounds worth a try16:04
corvusit's an lts stable release :)16:04
fungialso the "22 hours ago" is probably when the image was built, not when it was published nor when we downloaded and restarted onto it16:04
Clark[m]I think 2.8.5 is fine as long as we down and up and don't do a graceful restart attempt16:05
corvusyep16:05
Clark[m]Our configs haven't been using fancy new etuff16:05
corvuswe could also use the tag "lts"16:05
Clark[m]++16:05
corvusfungi: you want to manually make that change?  want me to?16:05
fungii can in a moment, trying to see if the 04:00 deploy was when we pulled the new haproxy image16:06
corvus++ i will make breakfast16:06
fungibingo16:09
fungiPulling haproxy        ... status: downloaded newer image fo...16:09
fungidocker-compose up -d ran at 03:59:27.85710916:09
fungiworking on the change now16:09
Clark[m]I'm still not reaching th service. Did you down the container first?16:11
fungino, that was from the ansible log on bridge16:11
fungii'm working on downgrading and will then push a similar change for review16:12
Clark[m]Oh I see16:12
fungii was just confirming that we upgraded to haproxy 2.9.1 at 04:00 on the nose, pretty much16:12
Clark[m]++16:13
fungihaproxy_1         | [NOTICE]   (1) : haproxy version is 2.8.5-aaba8d016:15
fungii've added the lb to the emergency disable list too so we won't roll that back accidentally while reviewing16:16
corvusi see connections growing and shrinking16:16
corvusnone staying in close_wait16:17
Clark[m]Yay. I guess we check the haproxy issue tracker to see if this is known. But that can probably wait until Monday 16:18
Clark[m]https://github.com/haproxy/haproxy/issues/2387 maybe16:19
opendevreviewJeremy Stanley proposed opendev/system-config master: Downgrade haproxy image from latest to lts  https://review.opendev.org/c/opendev/system-config/+/90380516:20
fungii'm going to work on unwinding all the iptables drop rules now16:21
fungithat's done16:24
fungihttps://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer is still looking healthy16:24
Clark[m]Haproxy has been so solid I didn't even consider it was a regression there 16:26
fungistatus notice Service for Git repository hosting on https://opendev.org/ has been restored by rolling back an haproxy upgrade; Zuul jobs which failed with connection timeouts occurring between 04:00 and 16:15 UTC today can be safely rechecked now16:26
funginot sending yet until we're sure, but does that work?16:26
Clark[m]Yes lgtm16:27
fungithis is definitely the longest it's been in working order without session count climbing since the incident began, so looks like we're in the clear16:28
fungistill solid. load average has settled around 0.116:33
fungihaproxy is using about 7% cpu now16:33
fungiinfra-root: any objection to me sending the status notice in a few minutes?16:34
fungistill looking happy 30 minutes later, so sending it now16:44
fungi#status notice Service for Git repository hosting on https://opendev.org/ has been restored by rolling back an haproxy upgrade; Zuul jobs which failed with connection timeouts occurring between 04:00 and 16:15 UTC today can be safely rechecked now16:45
opendevstatusfungi: sending notice16:45
-opendevstatus- NOTICE: Service for Git repository hosting on https://opendev.org/ has been restored by rolling back an haproxy upgrade; Zuul jobs which failed with connection timeouts occurring between 04:00 and 16:15 UTC today can be safely rechecked now16:45
opendevstatusfungi: finished sending notice16:47
fungiokay, switching to the stuff i was supposed to get done this morning, but i'll keep an eye on the graphs for a while just in case16:51
clarkbfungi: corvus: I'm going to +2 but not approve the haproxy change. The reason is I remembered that zuul has uses haproxy and not sure if we need manual itnervention there as well18:36
jrosserTNX: did you ask the openstack-ansible team for help with your upgrade?19:07
jrossermany requests to opendev.org would happen many releases ago if the upgrade guide was not followed19:08
jrosserbut there is now a specific circuit-breaker in the code to prevent that and halt the deployment19:09
TNXjrosser: I'm fine, I have latest version running now, my repo just failed one request against https://opendev.org/ for upper constraint while the issue was ongoing.19:32
fungialso note that you can switch your base url to pull from the mirror at github.com/openstack for those repos if there's a problem (though we try to be fairly responsive to outages)19:37
fungithis one was unusual because who expects a regression like that in a minor release bump for haproxy?19:38
fungii guess we do, now anyway19:38
TNXThanks for the tip, everything worked out fine in the end, I did previous upgrade last night end second today. I got a nice rest while you were looking into the issue19:38
TNXHaproxy is certainly one of the software pieces you wouldn't expect to be the problem. But in long enough run even that will happen.19:40
jrosserTNX: just FYI there is now a single variable you can set in OSA to swing all the git urls over to anywhere you like, including the GitHub mirrors19:40
TNXjrosser: I'm probably getting too comfy on everything "just working" most of the time nowadays with Openstack ansible. I've been running it for quite a long time and I remember thing being quite a bit rougher at every upgrade.19:43
* fungi is happy to hear people say openstack upgrades aren't as painful as in the bad old days19:44
tonybThanks for getting to the bottom of the haproxy issue.  I learnt a lot from reading the scrollback21:30
eanderssonIs it just me? I am still having some issues, e.g. zuul.opendev.org isn't loading for me :'(21:42
fungieandersson: i'll put money on it being the same haproxy problem... will check shortly but i agree it's not loading for me now (and as clarkb pointed out, it's the one other place we use haproxy)21:43
tonybI'm looking at it now21:44
tonybtonyb@zuul-lb01:~$  echo "show info" | sudo nc -U /var/haproxy/run/stats | grep -E '(Ulimit-n|Maxsock|Maxconn|Hard_maxconn|CurrConns|MaxconnReached)' | cat21:46
tonybUlimit-n: 803721:46
tonybMaxsock: 803721:46
tonybMaxconn: 400021:46
tonybHard_maxconn: 400021:46
tonybCurrConns: 400021:46
tonybMaxconnReached: 204221:46
tonybecho "show info" | sudo nc -U /var/haproxy/run/stats21:46
eandersson<321:49
tonybI'm manually applied the same change as gitea21:50
fungitonyb: thanks! it's working for me now21:51
fungican you confirm that 903805 is going to apply to both gitea-lb and zuul-lb? pretty sure they use the same ansible role21:52
tonybOh okay I can double check that21:52
fungitonyb: also, make sure to add zuul-lb01 to the emergency disable list on bridge or it will just get undone21:52
tonybI just added zuul-lb0 to emergency21:53
tonybI figured we'd need time to get a change landed to fix it21:53
tonybCurrConns on zuul-lb01 are staying low21:54
fungiyeah, my expectation is that 903805 will fix both gitea and zuul lbs because they should use the same role21:56
tonybThat makes sense, I'm just confirming that21:57
tonybYup looks good to me.21:59
fungiawesome. assuming everything's working as expected, let's worry about it next week (or this week on your end of the rock)22:00
tonybCool Beans.22:01
fungimy guess is that the zuul lb doesn't get nearly the volume of connections that the gitea lb does (probably multiple orders of magnitude fewer), so took longer to fill up22:05
tonybthat sounds plausible 22:06

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!