Wednesday, 2023-02-08

ianwhttps://gerrit-review.googlesource.com/c/gerrit/+/357694 is more channel logging.  i'll update github if/when that passes00:29
ianw(it builds for me, after i figured out bazelisk, again :)00:29
opendevreviewIan Wienand proposed zuul/zuul-jobs master: build-docker-image: further cleanup buildx path  https://review.opendev.org/c/zuul/zuul-jobs/+/87280601:10
*** rcastillo|rover is now known as rcastillo03:59
*** yadnesh|away is now known as yadnesh04:02
*** JasonF is now known as JayF04:12
*** ysandeep|out is now known as ysandeep05:12
ysandeepfolks o/ looking for reviews on https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/873056 06:03
ysandeepTo fix py27 jobs, last night fix had some issue in the playbook06:04
ianwthanks for that; sorry -- shouldn't have posted it before breakfast :)06:22
*** jpena|off is now known as jpena08:29
*** amorin_ is now known as amorin09:09
*** ysandeep is now known as ysandeep|food09:27
yoctozeptohi folks; any idea why zuul has not picked up https://review.opendev.org/c/opendev/sandbox/+/873096 ? I got no feedback from it; I was aiming to present how zuul picks up and runs new jobs; maybe it's disabled somehow for the sandbox but I could not find info on that either10:23
yoctozeptohmm, seemingly it has happened with a large delay10:36
yoctozeptoalthough I expected the job on the status page in the meantime10:37
yoctozeptooh, I get it, it's the openstack tenant10:38
yoctozeptomystery solved10:38
yoctozeptoI just assumed that opendev/ would run in opendev tenant10:39
yoctozeptosorry for troubling the channel :-)10:39
*** ysandeep|food is now known as ysandeep11:10
*** ysandeep is now known as ysandeep|ruck11:10
fungiyoctozepto: most of the opendev/ git namespace projects are still in the "openstack" zuul tenant because that used to be the only tenant and we've just not found time to move them yet or they're complicated to disentangle from testing of other projects in that tenant12:56
yoctozeptofungi: thanks for clarifying!13:01
yoctozeptobtw, there is an issue with a broken submodule in that opendev/sandbox repository if someone cares to take a look :-) 13:02
yoctozeptoat least cloning using git clone produces a broken state13:03
yoctozepto(.gitmodules is missing)13:03
yoctozepto(and /sandbox is a submodule path)13:03
fungineat. i expect somebody added that trying to test submodule support13:12
yoctozeptofungi: if you were so kind as to cross-review and merge: https://review.opendev.org/c/opendev/sandbox/+/873097 :-)13:18
yoctozeptothis seems to have introduced the issue: https://review.opendev.org/c/opendev/sandbox/+/43255613:18
*** yadnesh is now known as yadnesh|away13:24
ZaphodBhi there, our developers have noticed that there might be a rate limit imposed on our IP(v4) ranges when cloning from https://opendev.org/openstack/keystone.git . Is there such a thing or am i looking at a connectivity issue? If there is a rate limit how would i best address the issue? ranges would be 31.172.115.0/24 and 31.172.116.0/23 .13:35
*** dasm|off is now known as dasm|rover13:51
dasm|roverfungi: ianw thanks for doing great job! (continuing yesterday's conversation :) )13:52
*** tosky_ is now known as tosky14:19
fungiZaphodB: no inherent rate limit, but the servers providing that git interface are easily overwhelmed. please don't repeatedly clone the same things from there, clone once and then just fetch updates14:24
fungiand if you have a lot of machines in your network trying to use copies of the repositories we host, please make a local cache and distribute it to your machines rather than having each one clone separate copies14:25
fungiremember we're a volunteer open source community relying on generously donated resources, so please try to help conserve our bandwidth usage and system resources14:26
jrosserfungi: the terrible connectivity i had yesterday is still bad today, if not worse than it was yesterday15:25
jrosserstill feels like transit trouble15:25
fungimnaser: is connectivity issues through one or more backbones (possibly cogent or zayo) on your radar? no idea if you normally prepend your ebgp(6) announcements to work around such situations15:31
mnaserfungi: hrm, i know there was a maintenance a few days ago15:31
mnaserlet me check15:32
fungimnaser: some users are reporting significant packet loss and out-of-order deliveries, most had traceroutes through your cogent peer, one said zayo, but we haven't backtraced them so it's possible the issue is on an asymmetric return route15:33
mnaserfungi: is this sjc1?15:35
*** dasm|rover is now known as dasm|afk15:36
*** ysandeep|ruck is now known as ysandeep|out15:36
fungimnaser: review.o.o so i think montreal? unless you're backhauling, my traceroutes come in through your 800SquareVictoriaSt01.YUL.beanfield.com peer15:37
mnaserah yes in montreal, i thought opendev.org cause that's in sjc1 afaik15:37
fungii'm not having any trouble at all through the beanfield peer, fwiw15:38
mnaserjrosser: i see you're having issues, whats a traceroute to 38.122.103.106 look like?15:43
mnaser(or mtr if possible)15:44
jrossermnaser: one moment, let me look15:45
jrossermnaser: this is from yesterday when it was ~100kbits/s throughput https://paste.opendev.org/show/bWexwiKnupXkaawY4IUA/15:46
mnaserlooks like zayo through cogent, hrm15:47
mnaseralso this is opendev.org so this is actually sjc115:47
fungioh fun, so maybe the earlier packet loss through cogent to review.o.o a day or so ago was unerlated15:50
fungiunrelated15:50
jrosserbut that exact same host just now is quick and looks like the same route https://paste.opendev.org/show/b6pHfSyE0IsYfyDAb2JG/15:50
jrosserthough tbh something odd is going on as we have people getting pretty desperate in #openstack-ansible setting up local mirrors to be able to do any meaningful deployment15:51
jrosserany instinctively i feel they are suffering (at times) from the same thing, really low throughput15:51
fungiyeah, if this isn't the issues reported for reaching gerrit earlier, then we should look to see if someone is overwhelming the gitea servers. this may not be network related at all15:52
jrosserunfortunatly i'm not a good canary for this sort of trouble as i have local mirrors anyway15:53
fungii'm digging through the cacti graphs for all the gitea backends now15:54
Clark[m]Gitea08 had trouble recently and tonyb first reported it.15:55
Clark[m]Given history chances are it's OSA doing it too :)15:55
Clark[m]By the time I had time to look at it things had calmed down15:55
Clark[m]But logs should still be there15:55
jrosserwell there is user-agent in there so you can be definitive about that15:56
jrosserit'll turn into fokelore otherwise15:56
jrosserand we now have a hard failure in the code for this situation too15:56
mnaserfungi: please let me know whenever you get some further details and if you still feel its network or potentially vm/backend related15:56
Clark[m]jrosser: also you can check what backend you are talking to via the SSL cert. One of the names in the cert is for the backend. And backends can be addressed directly at https://gitea0X.opendev.org:3081 for values of X 1 through 815:57
fungimnaser: will do. i'll also get JayF to check whether he's still seeing connectivity issues to gerrit15:57
Clark[m]Unfortunately git cannot be safely load balanced in a least connections manner due to state on the client side that persists across multiple tcp connections. This forces us to load balance based on source hashing15:59
fungiClark[m]: jrosser: yeah, you can see on the gitea08 graphs it was having a very unhappy night around 00-02 utc, and had some new spikes around 13-14 utc though not quite as terrible: http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=1&leaf_id=886&nodeid=node1_886&host_group_data=15:59
jrosserso for that host i have which was slow yesterday but quick today it hits gitea0716:00
fungialso gitea01 has been serving a very significant but steady amount of data, doesn't look like its struggling so the access pattern is likely different: http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=1&leaf_id=879&nodeid=node1_879&host_group_data=16:00
fungimuch higher traffic volume than the other servers though16:01
jrosserhow is the loadbalance done? source IP?16:03
fungiyes16:03
fungissl is terminated on the backend servers16:03
fungigetting users to identify the subject on the server cert they're seeing for https://opendev.org/ from the affected machines and the time they saw slowness could help us correlate logs16:04
fungiecho|openssl s_client -connect opendev.org:https|openssl x509 -text|grep gitea16:04
fungithat's the easy way i know to check from the command line, but there are likely simpler alternatives16:05
jrosseror `curl -vvI https://opendev.org` for those encumbered with http proxy :/16:05
ZaphodBcurl -v - i seem to be getting 6 mostly but 8 is in the mix as well16:05
fungiresource utilization on gitea06 seems nominal16:08
fungii'm going to switch to checking dmesg on the backends for oom killer events16:08
fungiwe often see those if systems repeatedly clone large repos16:08
fungithe only one to hit that this year was gitea08, last occurrence was 02:40:19 utc today16:11
fungiZaphodB: you're seeing slow downloads from gitea06? i'll try cloning some large repos from it to get comparative timing16:12
fungii'm seeing what looks like slow behavior cloning https://gitea06.opendev.org:3000/openstack/nova from home16:17
fungii'll try to compare to other backens, but git is reporting around 100-125 KiB/s at the moment16:18
ZaphodBfungi: yes, its fluctuating wildly around 40KiB/s, have ruled out MTU issue, ping -s 1472 -M no works.16:18
fungiand it often dips down to around 70 KiB/s for me16:18
funginot going to wait for this nova clone from gitea06 to finish. aborting and trying gitea05 next16:21
fungiseeing similar behavior from gitea0516:24
fungitrying 04 now, but it does from outward appearances look like the problem is upstream of the backends, so the local network between them and the lb or the lb itself or the uplink from the lb or...16:25
clarkbyou could do a local clone to the vexxhost region or on the same hosts to rule out the application16:26
jrosseris there a decent sized thing that can be wget from the same network to decide if it is git specific or just general network trouble16:26
clarkbpossibly something hosted on our mirror in the same cloud region, but then you're involving afs which isn't the best for this kind of debugging16:27
clarkbI think fungi can simply check the git operations locally to the cluster to rule in/out the network16:27
clarkb(sorry I'm not really here today, but have time over breakfast tea to look at the computer)16:28
fungiyeah, i plan to repeat some testing from additional locations16:29
fungibut 04 is equally as slow as 05 and 06 so i don't think the backends themselves are the problem this time16:29
fungialso my traceroute is through zayo (my isp and vexxhost both peer with it, so that's the only backbone provider)16:30
clarkbI'm through cogent and I get >1MiB/s to gitea0416:31
clarkbwhich is quite a bit better than 70KiB/s16:31
fungicloning from https://opendev.org/openstack/nova from a shell on mirror01.sjc1.vexxhost.opendev.org i see very fast response16:32
fungi9.69 MiB/s16:32
clarkbya so very likely network not host limited here16:32
clarkband its possible the gitea08 issues are related if slow operations cause more ops to pile up all at once16:32
fungii'll move to another remote host with a different route and see what i get16:32
fungimirror.ca-ymq-1.vexxhost.opendev.org is also fast at cloning nova: 8.34 MiB/s16:35
fungitraceroute goes through cogent16:35
fungireturn route back to it also through cogent16:36
fungiso seems like symmetric cogent routes are not experiencing the slowness16:37
fungimirror.dfw.rax.opendev.org reaches vexxhost sjc1 via zayo, and the return route is symmetrical. let's see if it gets terrible performance16:38
funginot great but comparatively reasonable: 1.87 MiB/s16:40
fungiit did slow up quite a bit at the end though16:40
clarkbfungi: and with review.o.o  we thought cogent might be the issue? Two different locations so entirely possible. But we probably need to be extra careful not to create confusion over what is what as a result16:44
fungiyes16:45
fungiwell, also it was stated that there was maintenance on cogent's network a couple of days ago, so possible that performance issue went away16:45
fungii retested from dfw.rax and got a similar 1.16 MiB/s result16:46
fungitesting from ord.rax it's not slow at all (9.21 MiB/s). i'll continue looking for a mirror server with significantly worse performance reaching gitea16:48
ZaphodBwe and other german isps experienced something like this with vodafone in munich recently, where rerouting via frankfurt as a temporary measure helped. it was suggested that it might have been one of several links in a bundle experiencing loss. icmp was not affected. sadly they only acknowledge the issue not specifics.16:48
fungiyeah, if i can correlate from enough locations, i should be able to isolate some common backbone segments where the problem arises16:49
ZaphodB 21. AS174    te0-0-0-8.nr01.b051790-0.sjc01.atlas.cogentco.com (154.24.37.62)       0.0%   100  163.0 162.8 162.4 164.7   0.316:50
ZaphodBmy path at home via zayo is not affected16:51
ZaphodB31.172.115.2 is pingable if you're interested in the reverse16:52
fungiyeah, return path to you goes through zayo as well16:59
fungithen peering with twelve99.net in nyc and across to london17:00
fungipart of the problem with zayo though is they're more of a conglomerate than many of the backbones since they picked up telia, abovenet and others17:01
fungibut as i understand it those different parts of their network are still mostly independently operated like they were formerly17:01
ZaphodBhttps://www.peeringdb.com/net/541 (ex AboveNet, MFNX) seems to be what you're dealing with here17:04
fungitrying a european host, our gra1.ovh mirror, i see fine performance too17:07
fungiold abovenet (now zayo) equinix peer in paris17:08
fungimnaser: just to update you, it does appear to be a network problem, but seems likely somewhere inside one of zayo's transit networks, since some routes through zayo are fine17:10
fungistill trying to pin down something a little more exact though17:11
mnaserfungi: thanks for digging into this, i would really appreciate a reproducible use case as much as possible so that i can talk to them about this, cause it seems like its maybe EU related only from looking at logs17:14
fungimnaser: my home connection (charter cable through zayo to vexxhost sjc1) is also experiencing a problem, not positive it's the same one yet17:19
fungiboth ipv4 and ipv6 are orders of magnitude slower for me than they should be17:21
fungicomparatively, i can clone very quickly from review.opendev.org (ca-ymq-1)17:22
fungiso i've effectively ruled out my home network and local isp segment at least17:22
fungimy isp is peering with zayo in atlanta, return path peers with zayo in dallas17:24
fungiinterestingly, performance from our rackspace mirror in dallas was also slowish through zayo, but not nearly as slow as it is for me17:24
fungiwhich gives me the feeling atlanta is involved17:25
fungii'm not observing packet loss, just excruciatingly slow transfer rate17:27
fungiso i'm not sure mtr will provide anything useful that a normal traceroute doesn't17:27
fungiyeah, icmp echo to opendev.org from home, 100 packets transmitted, 100 received, 0% packet loss17:28
fungioh, though rtt is pretty high17:28
fungi107ms average17:28
fungiping time to montreal is about half that, though it's also geographically closer to me but not by nearly enough to account for the different17:29
mnaseryeah 107ms is quite high for anywhere across US17:29
fungiso i guess if mtr shows a significant jump in latency between a couple of hops, that could be an indicator17:30
fungiof course, that assumes the latency is connected to the throughput issue, which isn't certain17:31
fungiand the asymmetric nature of my route to sjc1 is going to make that harder to nail down too17:31
fungilatency definitely jumps for me in atlanta though, roughly triples17:32
*** jpena is now known as jpena|off17:32
mnaserwell it looks like nb02.opendev.org is happily pushing a fair bit of traffic through to upload images :)17:33
fungior that might be a cross-country hop exiting their atl pop17:34
funginope, the jump in latency looks like it's inside atlanta now that more routers are responding to me17:34
mnaseroh i have an idea17:35
fungiup through ae67.zayo.ter1.atl10.us.zip.zayo i see ~33ms and then the next hop is ae9.cs1.atl10.us.eth.zayo.com where it jumps to >100ms for the rest of the path out17:35
mnaserfungi: i think there is a mirror by any chance in sjc1 where we can try to download 1 big file?17:36
mnaserthat will help determine latency vs throughput :>17:36
fungiyeah, i can slap a file on there. i've been relying on git's throughput reports but http is likely cleaner17:36
mnaseri think git does a lot of small requests rather than 1 big one17:38
mnaserbut that's just my guessing17:38
fungiyeah, in that case it could be the latency killing it if the small requests are serialized17:38
fungistill trying to get a clean mtr report for the return path, but i think it's starting to show low latency all the way up to the point in my outbound path where the routes diverge, which would also point to atlanta being the actual problem17:39
clarkbgit should do a couple small ones to negotiate then one or more larger ones to fetch the necessar pack files17:46
fungihttp://mirror.sjc1.vexxhost.opendev.org/urandom-gb.bin is a gig of /dev/urandom served from the server's local filesystem17:49
fungii'm floating around 70-100KB/s download rate according to wget17:49
fungii also added a urandom-mb.bin and urandom-kb.bin for testing smaller sizes17:50
mnaserjrosser: hows that file looking for you ^ ?18:17
fungii've had mtr going for about an hour on the gitea haproxy lb trying to trace back to my home address, and everything after zayo-charter.ter1.dfw2.us.zip.za until it reaches my house (~9 additional hops) is just "waiting for reply" so not going to tell us much other than the return path is definitely going through a different part of zayo's network than outbound18:19
fungithough i guess that's the serial on their peer with my isp, so the missing hops are inside my isp's network18:20
jrossermnaser: just travelling can check later18:21
fungiso anyway, outbound route from my house peers with zayo in atlanta and then shows a 3x jump in latency one hop into zayo's network (still in atl), return route zayo peers with my isp in dallas with decent rtt but i don't get any additional info past there18:22
fungimnaser: mtr reports, so much as they are... https://paste.opendev.org/show/bjPfo1zPcy5HDJNENgXb/18:24
slittle1_what happened here ? ... https://zuul.opendev.org/t/openstack/build/5649ab1e204b4d328a1196222371fb36/logs18:51
clarkbslittle1_: https://zuul.opendev.org/t/openstack/build/5649ab1e204b4d328a1196222371fb36/log/job-output.txt#781 you've hit a tox v4 incompatibility18:52
fungitox stopped allowing you to run anything outside the venv unless it's in the allowlist for that testenv in your tox.ini18:53
slittle1_some sort of log would be helpful18:55
fungiis the log not showing up for you?18:55
slittle1_'This build does not provide any logs'18:56
fungithe "manifest" of logs is fetched by javascript in that dashboard, to get a link to an object storage service which serves that manifest. possible it's getting blocked by a browser extension or a proxy?18:57
fungislittle1_: it should be pulling https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_564/873166/1/check/openstack-tox-linters/5649ab1/zuul-manifest.json to generate the list of logs there18:59
fungiare you unable to retrieve that file?18:59
fungicould also be a routing problem i guess, if that service is unreachable19:00
slittle1_An error occurred during a connection to storage.bhs.cloud.ovh.net. PR_CONNECT_RESET_ERROR19:04
fungigot it. so something is rejecting/resetting connections between you and that part of the internet19:05
slittle1_tried from a machine on a non-WindRiver network.... couldn't view https://review.opendev.org/c/starlingx/manifest/+/873166 at all.  Just a blank screen19:32
fungiwow19:33
fungii'm not having trouble getting there, but it's the internet19:34
slittle1_My iphone can view both, but that's hardly ideal19:34
fungikeep in mind that review.opendev.org is in vexxhost's montreal canada data center, while storage.bhs.cloud.ovh.net is in ovh's montreal area data center19:35
fungitwo different providers, but in relatively close geographical proximity to one another19:35
fungiit's possible you have a common route to both of them19:35
funginot that i'm saying quebec has bad networking, but it's an interesting coincidence19:37
fungii'm also not super familiar with geography, but i think beauharnois is on the opposite side of montreal from vexxhost's pop19:40
fungior maybe just closer to the city19:40
fungithough also, mnaser will be interested in problems connecting to review.opendev.org (in vexxhost's ca-ymq-1 network)19:41
fungiamorin may similarly be interested in connectivity issues to ovh's bhs1 region19:42
slittle1_gerrit is giving me 'kex_exchange_identification: read: Connection reset by peer' again after removal of 'ServerAliveInterval 60'20:24
fungi"Connection reset by peer" can often be coming from any router between you and the final destination20:26
fungirouters often send a tcp/rst packet on behalf of a destination in order to forcibly close a connection20:27
fungislittle1_: given the network connection issues you're seeing to multiple networks, is it possible someone had made changes in packet filtering or connection proxying in a network local to your client machine?20:35
fungiback in the days when i used to work in an office, we had firewalls which closed "idle" ssh connections after a couple of minutes, and the way a firewall closes a connection is usually via spoofed tcp reset on behalf of the remote peers20:37
JayFfungi: btw I saw your note about try again ... I had CenturyLink installed yesterday (a long-delayed maintenance coming to pass)20:37
JayFfungi: so I am not able to reproduce from my now-defunct comcast20:37
fungiJayF: i have no idea whether to congratulate or console you20:37
fungii guess congratulations are in order20:38
JayFIt depends on if the squirrls find fiber as tasty as coax...20:38
JayF(it overall seems to be nicer, from my zero quanitative data)20:38
fungisquirrels are indiscriminate20:38
jrossermnaser: fungi:  https://paste.opendev.org/show/btcIlXecuuI3njHmNq6h/20:41
fungijrosser: looks like it's spiking latency at the old abovenet uk border into zayo's transit network. return path may indicate where the actual jump in latency is coming from20:43
ianwspeaking of the ssh dropouts; thomas responded that https://gerrit-review.googlesource.com/c/gerrit/+/357694 looks OK20:46
ianwso i'll see about patching it into our build20:46
slittle1_'kex_exchange_identification: read: Connection reset by peer' is pretty consistently hitting about 25% of my git review transactions.   By any chance would review.opendev.org be load balanced over 4 servers? 20:56
ianwi'll also get back to the docker upgrades20:56
slittle1_I did open a ticker with my internal IT as well20:56
ianwslittle1_: review.opendev.org isn't; it's a single server20:59
fungiwith no load balancer or proxy. it's just a big vm with publicly routable ipv4 and v6 addresses directly on the internet21:00
fungiwell, to be more precise, that ssh interface is within a jvm running in a linux vm21:01
fungiso there is an additional (java) tcp/ip stack21:01
fungithe gerrit service embeds a java-based sshd21:02
ianwpicking off the zk servers for upgrade, i should be right to just cycle those sequentially right?21:18
fungiyes. as long as two remain up and the third has a chance to re-synchronize after it returns to service before you take down the next one21:19
ianw++21:19
fungiwhat you what to avoid is having only one cluster member with current data21:20
fungis/what to/want to/21:20
*** dasm|afk is now known as dasm|offline22:05

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!