Wednesday, 2023-02-08

ianw	https://gerrit-review.googlesource.com/c/gerrit/+/357694 is more channel logging. i'll update github if/when that passes	00:29
ianw	(it builds for me, after i figured out bazelisk, again :)	00:29
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: build-docker-image: further cleanup buildx path https://review.opendev.org/c/zuul/zuul-jobs/+/872806	01:10
*** rcastillo\|rover is now known as rcastillo		03:59
*** yadnesh\|away is now known as yadnesh		04:02
*** JasonF is now known as JayF		04:12
*** ysandeep\|out is now known as ysandeep		05:12
ysandeep	folks o/ looking for reviews on https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/873056	06:03
ysandeep	To fix py27 jobs, last night fix had some issue in the playbook	06:04
ianw	thanks for that; sorry -- shouldn't have posted it before breakfast :)	06:22
*** jpena\|off is now known as jpena		08:29
*** amorin_ is now known as amorin		09:09
*** ysandeep is now known as ysandeep\|food		09:27
yoctozepto	hi folks; any idea why zuul has not picked up https://review.opendev.org/c/opendev/sandbox/+/873096 ? I got no feedback from it; I was aiming to present how zuul picks up and runs new jobs; maybe it's disabled somehow for the sandbox but I could not find info on that either	10:23
yoctozepto	hmm, seemingly it has happened with a large delay	10:36
yoctozepto	although I expected the job on the status page in the meantime	10:37
yoctozepto	oh, I get it, it's the openstack tenant	10:38
yoctozepto	mystery solved	10:38
yoctozepto	I just assumed that opendev/ would run in opendev tenant	10:39
yoctozepto	sorry for troubling the channel :-)	10:39
*** ysandeep\|food is now known as ysandeep		11:10
*** ysandeep is now known as ysandeep\|ruck		11:10
fungi	yoctozepto: most of the opendev/ git namespace projects are still in the "openstack" zuul tenant because that used to be the only tenant and we've just not found time to move them yet or they're complicated to disentangle from testing of other projects in that tenant	12:56
yoctozepto	fungi: thanks for clarifying!	13:01
yoctozepto	btw, there is an issue with a broken submodule in that opendev/sandbox repository if someone cares to take a look :-)	13:02
yoctozepto	at least cloning using git clone produces a broken state	13:03
yoctozepto	(.gitmodules is missing)	13:03
yoctozepto	(and /sandbox is a submodule path)	13:03
fungi	neat. i expect somebody added that trying to test submodule support	13:12
yoctozepto	fungi: if you were so kind as to cross-review and merge: https://review.opendev.org/c/opendev/sandbox/+/873097 :-)	13:18
yoctozepto	this seems to have introduced the issue: https://review.opendev.org/c/opendev/sandbox/+/432556	13:18
*** yadnesh is now known as yadnesh\|away		13:24
ZaphodB	hi there, our developers have noticed that there might be a rate limit imposed on our IP(v4) ranges when cloning from https://opendev.org/openstack/keystone.git . Is there such a thing or am i looking at a connectivity issue? If there is a rate limit how would i best address the issue? ranges would be 31.172.115.0/24 and 31.172.116.0/23 .	13:35
*** dasm\|off is now known as dasm\|rover		13:51
dasm\|rover	fungi: ianw thanks for doing great job! (continuing yesterday's conversation :) )	13:52
*** tosky_ is now known as tosky		14:19
fungi	ZaphodB: no inherent rate limit, but the servers providing that git interface are easily overwhelmed. please don't repeatedly clone the same things from there, clone once and then just fetch updates	14:24
fungi	and if you have a lot of machines in your network trying to use copies of the repositories we host, please make a local cache and distribute it to your machines rather than having each one clone separate copies	14:25
fungi	remember we're a volunteer open source community relying on generously donated resources, so please try to help conserve our bandwidth usage and system resources	14:26
jrosser	fungi: the terrible connectivity i had yesterday is still bad today, if not worse than it was yesterday	15:25
jrosser	still feels like transit trouble	15:25
fungi	mnaser: is connectivity issues through one or more backbones (possibly cogent or zayo) on your radar? no idea if you normally prepend your ebgp(6) announcements to work around such situations	15:31
mnaser	fungi: hrm, i know there was a maintenance a few days ago	15:31
mnaser	let me check	15:32
fungi	mnaser: some users are reporting significant packet loss and out-of-order deliveries, most had traceroutes through your cogent peer, one said zayo, but we haven't backtraced them so it's possible the issue is on an asymmetric return route	15:33
mnaser	fungi: is this sjc1?	15:35
*** dasm\|rover is now known as dasm\|afk		15:36
*** ysandeep\|ruck is now known as ysandeep\|out		15:36
fungi	mnaser: review.o.o so i think montreal? unless you're backhauling, my traceroutes come in through your 800SquareVictoriaSt01.YUL.beanfield.com peer	15:37
mnaser	ah yes in montreal, i thought opendev.org cause that's in sjc1 afaik	15:37
fungi	i'm not having any trouble at all through the beanfield peer, fwiw	15:38
mnaser	jrosser: i see you're having issues, whats a traceroute to 38.122.103.106 look like?	15:43
mnaser	(or mtr if possible)	15:44
jrosser	mnaser: one moment, let me look	15:45
jrosser	mnaser: this is from yesterday when it was ~100kbits/s throughput https://paste.opendev.org/show/bWexwiKnupXkaawY4IUA/	15:46
mnaser	looks like zayo through cogent, hrm	15:47
mnaser	also this is opendev.org so this is actually sjc1	15:47
fungi	oh fun, so maybe the earlier packet loss through cogent to review.o.o a day or so ago was unerlated	15:50
fungi	unrelated	15:50
jrosser	but that exact same host just now is quick and looks like the same route https://paste.opendev.org/show/b6pHfSyE0IsYfyDAb2JG/	15:50
jrosser	though tbh something odd is going on as we have people getting pretty desperate in #openstack-ansible setting up local mirrors to be able to do any meaningful deployment	15:51
jrosser	any instinctively i feel they are suffering (at times) from the same thing, really low throughput	15:51
fungi	yeah, if this isn't the issues reported for reaching gerrit earlier, then we should look to see if someone is overwhelming the gitea servers. this may not be network related at all	15:52
jrosser	unfortunatly i'm not a good canary for this sort of trouble as i have local mirrors anyway	15:53
fungi	i'm digging through the cacti graphs for all the gitea backends now	15:54
Clark[m]	Gitea08 had trouble recently and tonyb first reported it.	15:55
Clark[m]	Given history chances are it's OSA doing it too :)	15:55
Clark[m]	By the time I had time to look at it things had calmed down	15:55
Clark[m]	But logs should still be there	15:55
jrosser	well there is user-agent in there so you can be definitive about that	15:56
jrosser	it'll turn into fokelore otherwise	15:56
jrosser	and we now have a hard failure in the code for this situation too	15:56
mnaser	fungi: please let me know whenever you get some further details and if you still feel its network or potentially vm/backend related	15:56
Clark[m]	jrosser: also you can check what backend you are talking to via the SSL cert. One of the names in the cert is for the backend. And backends can be addressed directly at https://gitea0X.opendev.org:3081 for values of X 1 through 8	15:57
fungi	mnaser: will do. i'll also get JayF to check whether he's still seeing connectivity issues to gerrit	15:57
Clark[m]	Unfortunately git cannot be safely load balanced in a least connections manner due to state on the client side that persists across multiple tcp connections. This forces us to load balance based on source hashing	15:59
fungi	Clark[m]: jrosser: yeah, you can see on the gitea08 graphs it was having a very unhappy night around 00-02 utc, and had some new spikes around 13-14 utc though not quite as terrible: http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=1&leaf_id=886&nodeid=node1_886&host_group_data=	15:59
jrosser	so for that host i have which was slow yesterday but quick today it hits gitea07	16:00
fungi	also gitea01 has been serving a very significant but steady amount of data, doesn't look like its struggling so the access pattern is likely different: http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=1&leaf_id=879&nodeid=node1_879&host_group_data=	16:00
fungi	much higher traffic volume than the other servers though	16:01
jrosser	how is the loadbalance done? source IP?	16:03
fungi	yes	16:03
fungi	ssl is terminated on the backend servers	16:03
fungi	getting users to identify the subject on the server cert they're seeing for https://opendev.org/ from the affected machines and the time they saw slowness could help us correlate logs	16:04
fungi	echo\|openssl s_client -connect opendev.org:https\|openssl x509 -text\|grep gitea	16:04
fungi	that's the easy way i know to check from the command line, but there are likely simpler alternatives	16:05
jrosser	or `curl -vvI https://opendev.org` for those encumbered with http proxy :/	16:05
ZaphodB	curl -v - i seem to be getting 6 mostly but 8 is in the mix as well	16:05
fungi	resource utilization on gitea06 seems nominal	16:08
fungi	i'm going to switch to checking dmesg on the backends for oom killer events	16:08
fungi	we often see those if systems repeatedly clone large repos	16:08
fungi	the only one to hit that this year was gitea08, last occurrence was 02:40:19 utc today	16:11
fungi	ZaphodB: you're seeing slow downloads from gitea06? i'll try cloning some large repos from it to get comparative timing	16:12
fungi	i'm seeing what looks like slow behavior cloning https://gitea06.opendev.org:3000/openstack/nova from home	16:17
fungi	i'll try to compare to other backens, but git is reporting around 100-125 KiB/s at the moment	16:18
ZaphodB	fungi: yes, its fluctuating wildly around 40KiB/s, have ruled out MTU issue, ping -s 1472 -M no works.	16:18
fungi	and it often dips down to around 70 KiB/s for me	16:18
fungi	not going to wait for this nova clone from gitea06 to finish. aborting and trying gitea05 next	16:21
fungi	seeing similar behavior from gitea05	16:24
fungi	trying 04 now, but it does from outward appearances look like the problem is upstream of the backends, so the local network between them and the lb or the lb itself or the uplink from the lb or...	16:25
clarkb	you could do a local clone to the vexxhost region or on the same hosts to rule out the application	16:26
jrosser	is there a decent sized thing that can be wget from the same network to decide if it is git specific or just general network trouble	16:26
clarkb	possibly something hosted on our mirror in the same cloud region, but then you're involving afs which isn't the best for this kind of debugging	16:27
clarkb	I think fungi can simply check the git operations locally to the cluster to rule in/out the network	16:27
clarkb	(sorry I'm not really here today, but have time over breakfast tea to look at the computer)	16:28
fungi	yeah, i plan to repeat some testing from additional locations	16:29
fungi	but 04 is equally as slow as 05 and 06 so i don't think the backends themselves are the problem this time	16:29
fungi	also my traceroute is through zayo (my isp and vexxhost both peer with it, so that's the only backbone provider)	16:30
clarkb	I'm through cogent and I get >1MiB/s to gitea04	16:31
clarkb	which is quite a bit better than 70KiB/s	16:31
fungi	cloning from https://opendev.org/openstack/nova from a shell on mirror01.sjc1.vexxhost.opendev.org i see very fast response	16:32
fungi	9.69 MiB/s	16:32
clarkb	ya so very likely network not host limited here	16:32
clarkb	and its possible the gitea08 issues are related if slow operations cause more ops to pile up all at once	16:32
fungi	i'll move to another remote host with a different route and see what i get	16:32
fungi	mirror.ca-ymq-1.vexxhost.opendev.org is also fast at cloning nova: 8.34 MiB/s	16:35
fungi	traceroute goes through cogent	16:35
fungi	return route back to it also through cogent	16:36
fungi	so seems like symmetric cogent routes are not experiencing the slowness	16:37
fungi	mirror.dfw.rax.opendev.org reaches vexxhost sjc1 via zayo, and the return route is symmetrical. let's see if it gets terrible performance	16:38
fungi	not great but comparatively reasonable: 1.87 MiB/s	16:40
fungi	it did slow up quite a bit at the end though	16:40
clarkb	fungi: and with review.o.o we thought cogent might be the issue? Two different locations so entirely possible. But we probably need to be extra careful not to create confusion over what is what as a result	16:44
fungi	yes	16:45
fungi	well, also it was stated that there was maintenance on cogent's network a couple of days ago, so possible that performance issue went away	16:45
fungi	i retested from dfw.rax and got a similar 1.16 MiB/s result	16:46
fungi	testing from ord.rax it's not slow at all (9.21 MiB/s). i'll continue looking for a mirror server with significantly worse performance reaching gitea	16:48
ZaphodB	we and other german isps experienced something like this with vodafone in munich recently, where rerouting via frankfurt as a temporary measure helped. it was suggested that it might have been one of several links in a bundle experiencing loss. icmp was not affected. sadly they only acknowledge the issue not specifics.	16:48
fungi	yeah, if i can correlate from enough locations, i should be able to isolate some common backbone segments where the problem arises	16:49
ZaphodB	21. AS174 te0-0-0-8.nr01.b051790-0.sjc01.atlas.cogentco.com (154.24.37.62) 0.0% 100 163.0 162.8 162.4 164.7 0.3	16:50
ZaphodB	my path at home via zayo is not affected	16:51
ZaphodB	31.172.115.2 is pingable if you're interested in the reverse	16:52
fungi	yeah, return path to you goes through zayo as well	16:59
fungi	then peering with twelve99.net in nyc and across to london	17:00
fungi	part of the problem with zayo though is they're more of a conglomerate than many of the backbones since they picked up telia, abovenet and others	17:01
fungi	but as i understand it those different parts of their network are still mostly independently operated like they were formerly	17:01
ZaphodB	https://www.peeringdb.com/net/541 (ex AboveNet, MFNX) seems to be what you're dealing with here	17:04
fungi	trying a european host, our gra1.ovh mirror, i see fine performance too	17:07
fungi	old abovenet (now zayo) equinix peer in paris	17:08
fungi	mnaser: just to update you, it does appear to be a network problem, but seems likely somewhere inside one of zayo's transit networks, since some routes through zayo are fine	17:10
fungi	still trying to pin down something a little more exact though	17:11
mnaser	fungi: thanks for digging into this, i would really appreciate a reproducible use case as much as possible so that i can talk to them about this, cause it seems like its maybe EU related only from looking at logs	17:14
fungi	mnaser: my home connection (charter cable through zayo to vexxhost sjc1) is also experiencing a problem, not positive it's the same one yet	17:19
fungi	both ipv4 and ipv6 are orders of magnitude slower for me than they should be	17:21
fungi	comparatively, i can clone very quickly from review.opendev.org (ca-ymq-1)	17:22
fungi	so i've effectively ruled out my home network and local isp segment at least	17:22
fungi	my isp is peering with zayo in atlanta, return path peers with zayo in dallas	17:24
fungi	interestingly, performance from our rackspace mirror in dallas was also slowish through zayo, but not nearly as slow as it is for me	17:24
fungi	which gives me the feeling atlanta is involved	17:25
fungi	i'm not observing packet loss, just excruciatingly slow transfer rate	17:27
fungi	so i'm not sure mtr will provide anything useful that a normal traceroute doesn't	17:27
fungi	yeah, icmp echo to opendev.org from home, 100 packets transmitted, 100 received, 0% packet loss	17:28
fungi	oh, though rtt is pretty high	17:28
fungi	107ms average	17:28
fungi	ping time to montreal is about half that, though it's also geographically closer to me but not by nearly enough to account for the different	17:29
mnaser	yeah 107ms is quite high for anywhere across US	17:29
fungi	so i guess if mtr shows a significant jump in latency between a couple of hops, that could be an indicator	17:30
fungi	of course, that assumes the latency is connected to the throughput issue, which isn't certain	17:31
fungi	and the asymmetric nature of my route to sjc1 is going to make that harder to nail down too	17:31
fungi	latency definitely jumps for me in atlanta though, roughly triples	17:32
*** jpena is now known as jpena\|off		17:32
mnaser	well it looks like nb02.opendev.org is happily pushing a fair bit of traffic through to upload images :)	17:33
fungi	or that might be a cross-country hop exiting their atl pop	17:34
fungi	nope, the jump in latency looks like it's inside atlanta now that more routers are responding to me	17:34
mnaser	oh i have an idea	17:35
fungi	up through ae67.zayo.ter1.atl10.us.zip.zayo i see ~33ms and then the next hop is ae9.cs1.atl10.us.eth.zayo.com where it jumps to >100ms for the rest of the path out	17:35
mnaser	fungi: i think there is a mirror by any chance in sjc1 where we can try to download 1 big file?	17:36
mnaser	that will help determine latency vs throughput :>	17:36
fungi	yeah, i can slap a file on there. i've been relying on git's throughput reports but http is likely cleaner	17:36
mnaser	i think git does a lot of small requests rather than 1 big one	17:38
mnaser	but that's just my guessing	17:38
fungi	yeah, in that case it could be the latency killing it if the small requests are serialized	17:38
fungi	still trying to get a clean mtr report for the return path, but i think it's starting to show low latency all the way up to the point in my outbound path where the routes diverge, which would also point to atlanta being the actual problem	17:39
clarkb	git should do a couple small ones to negotiate then one or more larger ones to fetch the necessar pack files	17:46
fungi	http://mirror.sjc1.vexxhost.opendev.org/urandom-gb.bin is a gig of /dev/urandom served from the server's local filesystem	17:49
fungi	i'm floating around 70-100KB/s download rate according to wget	17:49
fungi	i also added a urandom-mb.bin and urandom-kb.bin for testing smaller sizes	17:50
mnaser	jrosser: hows that file looking for you ^ ?	18:17
fungi	i've had mtr going for about an hour on the gitea haproxy lb trying to trace back to my home address, and everything after zayo-charter.ter1.dfw2.us.zip.za until it reaches my house (~9 additional hops) is just "waiting for reply" so not going to tell us much other than the return path is definitely going through a different part of zayo's network than outbound	18:19
fungi	though i guess that's the serial on their peer with my isp, so the missing hops are inside my isp's network	18:20
jrosser	mnaser: just travelling can check later	18:21
fungi	so anyway, outbound route from my house peers with zayo in atlanta and then shows a 3x jump in latency one hop into zayo's network (still in atl), return route zayo peers with my isp in dallas with decent rtt but i don't get any additional info past there	18:22
fungi	mnaser: mtr reports, so much as they are... https://paste.opendev.org/show/bjPfo1zPcy5HDJNENgXb/	18:24
slittle1_	what happened here ? ... https://zuul.opendev.org/t/openstack/build/5649ab1e204b4d328a1196222371fb36/logs	18:51
clarkb	slittle1_: https://zuul.opendev.org/t/openstack/build/5649ab1e204b4d328a1196222371fb36/log/job-output.txt#781 you've hit a tox v4 incompatibility	18:52
fungi	tox stopped allowing you to run anything outside the venv unless it's in the allowlist for that testenv in your tox.ini	18:53
slittle1_	some sort of log would be helpful	18:55
fungi	is the log not showing up for you?	18:55
slittle1_	'This build does not provide any logs'	18:56
fungi	the "manifest" of logs is fetched by javascript in that dashboard, to get a link to an object storage service which serves that manifest. possible it's getting blocked by a browser extension or a proxy?	18:57
fungi	slittle1_: it should be pulling https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_564/873166/1/check/openstack-tox-linters/5649ab1/zuul-manifest.json to generate the list of logs there	18:59
fungi	are you unable to retrieve that file?	18:59
fungi	could also be a routing problem i guess, if that service is unreachable	19:00
slittle1_	An error occurred during a connection to storage.bhs.cloud.ovh.net. PR_CONNECT_RESET_ERROR	19:04
fungi	got it. so something is rejecting/resetting connections between you and that part of the internet	19:05
slittle1_	tried from a machine on a non-WindRiver network.... couldn't view https://review.opendev.org/c/starlingx/manifest/+/873166 at all. Just a blank screen	19:32
fungi	wow	19:33
fungi	i'm not having trouble getting there, but it's the internet	19:34
slittle1_	My iphone can view both, but that's hardly ideal	19:34
fungi	keep in mind that review.opendev.org is in vexxhost's montreal canada data center, while storage.bhs.cloud.ovh.net is in ovh's montreal area data center	19:35
fungi	two different providers, but in relatively close geographical proximity to one another	19:35
fungi	it's possible you have a common route to both of them	19:35
fungi	not that i'm saying quebec has bad networking, but it's an interesting coincidence	19:37
fungi	i'm also not super familiar with geography, but i think beauharnois is on the opposite side of montreal from vexxhost's pop	19:40
fungi	or maybe just closer to the city	19:40
fungi	though also, mnaser will be interested in problems connecting to review.opendev.org (in vexxhost's ca-ymq-1 network)	19:41
fungi	amorin may similarly be interested in connectivity issues to ovh's bhs1 region	19:42
slittle1_	gerrit is giving me 'kex_exchange_identification: read: Connection reset by peer' again after removal of 'ServerAliveInterval 60'	20:24
fungi	"Connection reset by peer" can often be coming from any router between you and the final destination	20:26
fungi	routers often send a tcp/rst packet on behalf of a destination in order to forcibly close a connection	20:27
fungi	slittle1_: given the network connection issues you're seeing to multiple networks, is it possible someone had made changes in packet filtering or connection proxying in a network local to your client machine?	20:35
fungi	back in the days when i used to work in an office, we had firewalls which closed "idle" ssh connections after a couple of minutes, and the way a firewall closes a connection is usually via spoofed tcp reset on behalf of the remote peers	20:37
JayF	fungi: btw I saw your note about try again ... I had CenturyLink installed yesterday (a long-delayed maintenance coming to pass)	20:37
JayF	fungi: so I am not able to reproduce from my now-defunct comcast	20:37
fungi	JayF: i have no idea whether to congratulate or console you	20:37
fungi	i guess congratulations are in order	20:38
JayF	It depends on if the squirrls find fiber as tasty as coax...	20:38
JayF	(it overall seems to be nicer, from my zero quanitative data)	20:38
fungi	squirrels are indiscriminate	20:38
jrosser	mnaser: fungi: https://paste.opendev.org/show/btcIlXecuuI3njHmNq6h/	20:41
fungi	jrosser: looks like it's spiking latency at the old abovenet uk border into zayo's transit network. return path may indicate where the actual jump in latency is coming from	20:43
ianw	speaking of the ssh dropouts; thomas responded that https://gerrit-review.googlesource.com/c/gerrit/+/357694 looks OK	20:46
ianw	so i'll see about patching it into our build	20:46
slittle1_	'kex_exchange_identification: read: Connection reset by peer' is pretty consistently hitting about 25% of my git review transactions. By any chance would review.opendev.org be load balanced over 4 servers?	20:56
ianw	i'll also get back to the docker upgrades	20:56
slittle1_	I did open a ticker with my internal IT as well	20:56
ianw	slittle1_: review.opendev.org isn't; it's a single server	20:59
fungi	with no load balancer or proxy. it's just a big vm with publicly routable ipv4 and v6 addresses directly on the internet	21:00
fungi	well, to be more precise, that ssh interface is within a jvm running in a linux vm	21:01
fungi	so there is an additional (java) tcp/ip stack	21:01
fungi	the gerrit service embeds a java-based sshd	21:02
ianw	picking off the zk servers for upgrade, i should be right to just cycle those sequentially right?	21:18
fungi	yes. as long as two remain up and the third has a chance to re-synchronize after it returns to service before you take down the next one	21:19
ianw	++	21:19
fungi	what you what to avoid is having only one cluster member with current data	21:20
fungi	s/what to/want to/	21:20
*** dasm\|afk is now known as dasm\|offline		22:05

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!