Thursday, 2020-08-13

*** mlavalle has quit IRC00:30
*** DSpider has quit IRC00:43
ianwclarkb: could you look at https://review.opendev.org/#/c/744038/ for additional quay.io mirror bits00:55
*** openstackgerrit has joined #opendev01:23
openstackgerritMerged opendev/system-config master: Redirect UC content to TC site  https://review.opendev.org/74449701:23
*** qchris has quit IRC01:53
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598902:00
*** qchris has joined #opendev02:05
openstackgerritIan Wienand proposed openstack/project-config master: A pyca/cryptography to Zuul tenant  https://review.opendev.org/74599002:10
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598902:14
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598902:36
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598902:45
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598903:07
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598903:12
openstackgerritMatthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support  https://review.opendev.org/74600003:23
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598903:29
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598903:33
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598903:44
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598903:49
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds  https://review.opendev.org/74598903:59
openstackgerritMatthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support  https://review.opendev.org/74600004:41
openstackgerritMatthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support  https://review.opendev.org/74600004:50
*** ysandeep|away is now known as ysandeep06:06
*** jaicaa has quit IRC06:07
*** jaicaa has joined #opendev06:10
ianwinfra-root: seems opendev.org is having ... issues06:45
ianwhard to say06:45
ianwthe gitea container on gitea04 restarted just recently06:45
ianwgitea03 is under memory pressure, but no one thing06:46
ianwhttp://paste.openstack.org/show/796802/06:46
ianwhttp://cacti.openstack.org/cacti/graph.php?action=properties&local_graph_id=66680&rra_id=0&view_type=tree&graph_start=1597299176&graph_end=159730107606:48
ianw06:25 it starting going crazy06:48
ianwhttp://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=66797&rra_id=0&view_type=tree&graph_start=1597298351&graph_end=159730094306:54
ianwit seems all the gitea hosts dropped off from ~ 6:08 -> ~6:3506:55
ianwi think what might have happened here is some sort of progressive outage on the gitea servers; the load balancer noticed some of them not responding and cut them out06:58
ianwbut that then started to overload whatever was left06:58
ianwgitea03 and 05 maybe07:00
*** ryohayakawa has quit IRC07:04
*** tosky has joined #opendev07:42
openstackgerritIan Wienand proposed openstack/project-config master: Create pyca/infra  https://review.opendev.org/74601407:49
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:02
*** hashar has joined #opendev08:03
*** ryohayakawa has joined #opendev08:09
*** DSpider has joined #opendev08:15
*** ysandeep is now known as ysandeep|lunch09:20
*** ysandeep|lunch is now known as ysandeep09:35
*** hashar has quit IRC09:46
mnaserianw, infra-root: anthing from our side?12:12
mnaserlooks like it's more in the vms..12:12
*** hashar has joined #opendev12:12
*** ryohayakawa has quit IRC12:18
*** marios|ruck has joined #opendev12:35
*** andrewbonney has joined #opendev13:17
*** qchris has quit IRC13:24
*** tkajinam has quit IRC13:32
*** qchris has joined #opendev13:53
*** qchris has quit IRC13:56
clarkbthat was basically what our china source ip ddos looked like. I wonder if we've got another ddos14:07
clarkbhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all14:11
clarkbthat shows a connection spike but not like the ddos which hit the haproxy connection limits14:11
clarkbstill possible those were costly requests that backed things up14:11
*** qchris has joined #opendev14:14
clarkbthinking out loud here: we may want to reboot each of the backends in sequence to clear out any OOM fallout then do a gerrit full sync replication (there are reports some repos arenot in sync)14:18
clarkbthis assumes the issue isnt persistent and was related to that spike in requests14:18
fungithe timing suggests daily cron jobs14:20
clarkbbased on when gaps in cacti graph data happened we seem to have largely recovered. The gaps also correlate well to that spike in connections except for gitea0514:27
clarkbhttp://cacti.openstack.org/cacti/graph_view.php14:27
clarkbher14:27
clarkb*er14:27
clarkbhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66728&rra_id=all14:27
clarkbI can never figure out linking to top level for a host but that shows it in a specific graph14:28
fungiyeah, i agree rolling reboots followed by full replication is probably warranted14:28
clarkbI can work on that in about half an hour14:29
clarkblooking at gitea05 more the early blank spot doesnt correllate to any major increasein network connections or traffic14:30
clarkbis it possible there was networking trouble there like we saw yesterday that caused the servers that were reachanle to takeon more load?14:31
clarkbdifferent cloud regions though aiui14:31
fungiyeah, yesterday's v6 routing problem was in ca-ymq-1 and the gitea servers are in sjc114:33
fungii can start on reboots... do we need to down the servers in haproxy first?14:34
funginot sure how graceful others have tried to be with these in the past14:34
clarkblookign at syslog on gitea05 we seem to just be OOMing in a loop14:35
clarkbthat stopped about 5 hours ago14:35
clarkbbut also started before those games in time14:36
clarkb02:51:23 is when that started14:36
clarkboh thats actually when we have the first gap on gitea0514:37
clarkbthere are a bunch of git GETs for charms around when the OOM first started there14:42
*** hashar has quit IRC14:45
clarkbyes a canonical IP is second biggest requestor of gitea05 between 02:00 and 03:0014:47
clarkbnot surprising that charms show up given that. However there is a much more request happy IP I'm trying to figure out next14:48
*** qchris has quit IRC14:48
clarkbaccording to our logs at about 01:04 gitea logs a request from this particular IP as a GET for a charm repo then haproxy reports it was in a CD state at 2::5114:51
clarkb*02:5114:51
clarkband all but one request from this IP (of which there are thousands) ends in a CD state14:52
clarkbwhcih is a closed disconnected error state from haproxy iirc14:52
clarkblooking at our top 10 requestors to the load balancer only those two charms requestors show up in gitea05's log during that time span. The abundance of CD state connections and the amount of time that they seem to be held open is somewhat suspicious14:58
clarkba full 99.84% of requests from that particular IP end up in that state14:59
clarkbI wonder if this is a client issue?15:00
clarkbin any case it does seem to have subsided15:00
clarkband I think the rolling reboots are worthwhile to clear out any issues. I'll start that now and will take hosts out of the rotation in haproxy before I reboot them15:00
*** qchris has joined #opendev15:01
clarkbthere are also a couple of really chatty vexxhost IPs that we will want to cross check against our nodepool logs (they don't seem to have reverse dns)15:03
clarkbbut they don't seem to correlate to when problems start15:03
*** mlavalle has joined #opendev15:10
clarkbreboots are done15:17
clarkbI'll start gerrit replication momentarily15:17
fungii was willing to take care of the reboots but wanted to know if we gracefully down them in the haproxy pools one at a time and how long we wait before rebooting to make sure requests aren't still in progress15:18
clarkbfungi: oh sorry, yes I gracefully downed them then tailed /var/gitea/logs/access.log and waited for requests from the load balancer IP to trail off15:19
clarkbthere are internal requests from 127.0.0.1 that get made and some web crawler is also crawling them which I ignored15:19
clarkbI missed your messages earlier I was so heads down on this (early morning blinders)15:19
clarkbhttps://docs.opendev.org/opendev/system-config/latest/gitea.html#backend-maintenance has does on the haproxy manipulation15:20
fungino worries, just didn't want you stuck shouldering it all15:22
clarkbI've also pinged mnaser about teh chatty vexxhost IPs in case they are doing something unexpected. I don't really think they were doing anything to trigger the problems though15:25
clarkbit really does seem like the IP interested in charms that couldn't successfully finish a connection is related15:25
clarkbI wonder if all those connections failed because it was doing smoething that caused gitea to fail (OOM?)15:26
clarkbmakign that correlation is likely to be more difficult though we could try making the requests it was making I suppose15:26
clarkbin trying to correlate the requests that IP is making more accurately I'm discovering the 65k limit on port numbers means we recycle them often :/15:29
clarkbah ok I see more things. THe data transferred values seem to be important here15:31
clarkbsometimes we transfer nothing and the gitea backend never sees it15:31
*** ysandeep is now known as ysandeep|away15:35
fungialso further investigation of a suspicious client ip address has turned up what appears to be a socket proxy to our git hosting running on a vm in hetzner15:44
fungivery bizarre15:45
clarkbmy current hunch is that that proxy undoes any load balancing form those sources since we bnalance on source IP. That then allowed it to bounce between backends as they failed under the load associated with those connections15:45
clarkbits possible other connections were responsible though, don't have a strong enough correlation to that yet15:46
*** priteau has joined #opendev15:46
openstackgerritClark Boylan proposed openstack/project-config master: Trigger service-eavesdrop when gerritbot channels change  https://review.opendev.org/74616815:52
clarkbfungi: ^ thats one of two gerritbot job tie ins we need (but should wait for gerrit replication to finish before merging that)15:53
clarkbthe other is to run when gerritbot's docker images update15:53
fungiahh, thanks, i had already forgotten about that. it was fairly late last night15:53
fungithough in good news, my patch seems to have solved the regression we saw15:53
clarkbexcellent15:53
fungifrickler: ^ it was your excellent eye which spotted the cause, so thanks!15:54
clarkbactually hrm. For the image update causing things to update we'd need to add gerritbot's project ssh key to bridge15:55
fungiit bears remembering that if you're iterating over a dict's keys() iterable, and then you change the dict in that loop, you change what you're iterating over15:55
clarkbI'm thinking that maybe it is better to simply run that hourly or daily instead?15:55
fungiclarkb: yeah, i was getting sleepy but was trying to figure out how this was any different from other stuff we're deploying where the image builds don't happen in the context of system-config changes15:56
fungii think last time this came up we concluded that we'd need to rely on periodic jobs for now15:56
clarkbwfm I'll get that patch up shortly now15:56
openstackgerritClark Boylan proposed opendev/system-config master: Run service-eavesdrop hourly  https://review.opendev.org/74618115:59
*** diablo_rojo has joined #opendev16:20
*** marios|ruck is now known as marios|out16:26
*** tosky has quit IRC16:43
*** marios|out has quit IRC16:44
*** JayF has quit IRC17:15
*** andrewbonney has quit IRC17:31
*** priteau has quit IRC17:36
*** priteau has joined #opendev17:44
*** priteau has quit IRC17:53
AJaegerclarkb, just saw your comment on 746168 and removed my +A, please self-approve once ready18:30
clarkbAJaeger: replication is done I'll reapprove. Thanks18:33
AJaegerclarkb: great18:34
clarkbmnaser: osc reports 'Certificate did not match expected hostname: compute.public.mtl1.vexxhost.net. Certificate: {'subject': ((('commonName', '*.vexxhost.net'),),), 'subjectAltName': [('DNS', '*.vexxhost.net'), ('DNS', 'vexxhost.net')]}' trying to show an instance details18:34
clarkbfungi: ^ do glob certs only do a single level of dns?18:34
openstackgerritMerged openstack/project-config master: Trigger service-eavesdrop when gerritbot channels change  https://review.opendev.org/74616818:36
fungigood question, i thought they covered anything within that zone18:37
clarkband now we should be able to land smcginnis' change to update the gerritbot channel config and be good to go18:37
fungibut if you delegate subdomains to other zones they won't18:37
fungiwildcard records aren't returned as dns responses, they're a shorthand instruction to the authoritative nameserver to match any request, but they're zone-specific18:38
fungioh! though this isn't wildcard dns records, this is wildcard subject (alt)names18:39
clarkbya its sslcert verification18:39
smcginnis\o/18:39
fungiclarkb: confirmed, apparently you can't wildcard multiple levels of subdomains with a single subject (alt)name18:43
clarkbmnaser: ^ I think that may be something you'll want to fix18:45
fungi"Names may contain the wildcard character * which is considered to match any single domain name component or component fragment. E.g., *.a.com matches foo.a.com but not bar.foo.a.com. f*.com matches foo.com but not bar.com." https://www.ietf.org/rfc/rfc2818.txt §3.1¶418:45
fungialso finding a number of kb articles from certificate authorities and questions at places like serverfault agreeing this is the case18:46
fungiapparently some browsers did at one point treat the wildcard as matching any subsequent levels, but most (all?) have ceased doing so as it was a blatant standards violation18:47
clarkbI've approved smcginnis' gerritbot config change18:47
clarkbwe should see gerritbot reconnect when that gets applied18:48
fungithis will be a good test18:48
*** JayF has joined #opendev18:55
openstackgerritMerged openstack/project-config master: Gerritbot: only comment on stable:follows-policy repos  https://review.opendev.org/74494718:59
*** openstackgerrit has quit IRC19:02
mnaserclarkb: it should be ok again now19:05
*** hashar has joined #opendev19:10
diablo_rojoIn thinking about the ptg. Its probably good to 'de-openstack' the irc channel. Any qualms with my making a new one just called '#ptg'?19:53
clarkbdiablo_rojo: we get some management simplification by namespacing on freenode19:55
clarkbbasically freenideknows who to go to for all #openstack- prefixed channels19:55
clarkbnot a reason to avoid #ptg but something to keep in mind19:55
diablo_rojoMakes sense. We have the #openstack-ptg channel already obviously, but I figured it might be more inclusive to other projects to make one without the prefix19:57
*** tosky has joined #opendev20:10
mnaserclarkb: http://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1 -- is it possible to maybe rekick nodepool as it may be using a cached service catalog?20:19
fungidiablo_rojo: if osf is going to have a bunch of those sorts of channels (this already came up wrt the #openstack-diversity channel for example) maybe we want an #osf prefix or something20:21
mnasercc infra-root ^20:28
fungilooking20:29
corvusfungi: i'm around if you need help20:32
fungimnaser: it's the ssl cert problem clarkb noted earlier20:33
mnaserfungi, corvus: the endpoint has changed and i think nodepool has the value cached20:33
fungithe cert you're serving is not valid for compute-ca-ymq-1.vexxhost.net20:33
fungioh, i get it20:33
mnaserright, but the url in the service catalog is compute.public.mtl1.vexxhost.net20:33
mnaser:)20:34
fungiyeah, the launcher will need a restart for that20:34
fungijust a sec20:34
fungi#status manually restarted nodepool-launcher container on nl03 to pick up changed catalog entries in vexxhost ca-ymq-1 (aka mtl1)20:36
openstackstatusfungi: unknown command20:36
fungid'oh!20:36
fungi#status log manually restarted nodepool-launcher container on nl03 to pick up changed catalog entries in vexxhost ca-ymq-1 (aka mtl1)20:36
openstackstatusfungi: finished logging20:36
fungithere we go20:36
fungimnaser: thanks, that seems to be spewing a lot fewer errors in its logs now20:37
*** openstackgerrit has joined #opendev21:05
openstackgerritMatthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support  https://review.opendev.org/74600021:05
ianwfungi/clarkb: thanks for looking at gitea; do we think the reboot has done it?21:55
clarkbianw: I think it sorted itself out then reboots were mostly to ensure there wasn't any bad fallout from the OOMs21:57
clarkbianw: it looked like that proxy may have contributed to the problem, possibly because it had a bunch of things behind it all hitting a single backend due to the proxy having a single IP21:57
clarkbthen when one server was sad the haproxy lb took it out of the rotation pointing that proxy at a new backend and rinse and repeat21:58
ianwahh, yes that sounds likely21:58
ianwinfra-root: if i could get some eyes on creating an pyca/infra project @ https://review.opendev.org/#/c/746014/ that would help me continue fiddling with manylinux wheel generation22:02
ianwmy hopes that we'd sort of drop-in manylinux support are probably dashed ... for example cryptography does a custom builder image ontop of the upstream builder images that pulls in openssl and builds it fresh22:03
ianwwhich is fine, but not generic22:04
ianwone thing is though, if i build custom manylinux2014_aarch64 images speculatively using buildx, i unfortunately can't run them on arm64 speculatively22:06
ianwbecause can't mix architectures22:06
clarkbfwiw with buildx things that do IO seem fine but not cpu (like compiling)22:07
clarkbexpect compiling openssl to take significant time. Though we can certainly test it to find out how much22:07
clarkbianw: I think you can do speculative testing without buildx though22:07
clarkbthen run both the image build and the use of the image in the linaro cloud22:08
ianwhrm, yes i wasn't sure of the state of native image builds22:09
ianwnative container builds i should probably say22:09
clarkbI think they work fine, though the manifest info might assume x86 by default?22:09
ianwmaybe ... https://review.opendev.org/#/c/746011/ is sort of the framework, but i don't want to put it in pyca/project-config because that's a trusted repo22:12
diablo_rojofungi, yeah was thinking about an osf prefix too. If that's easier to manage, I am totally cool with that.22:15
diablo_rojo(waaaaaaay late in my response, got sucked into other things)22:15
ianwwhen you see how the sausage is made with all this ... it does make you wonder a little bit if you still like sausages22:15
corvusyeah, i think we avoided native builds in the general case because we don't want the zuul/nodepool gate to stop if we lose the linaro cloud; that's probably less worrisome for an arm-only situation22:17
ianwi can try it and see what happens :)22:20
openstackgerritMerged openstack/project-config master: Create pyca/infra  https://review.opendev.org/74601422:29
corvusianw: ^ deploy playbook is done22:51
ianwcorvus: thanks, already in testing :) https://zuul.opendev.org/t/pyca/status22:52
ianwfrom what i can tell of upstream, ISTM that the wheels get generated and published as an artifact by github actions22:53
ianwi can not see that they are uploaded via that mechanism though, although i may have missed it22:53
ianw(uploaded to pypi)22:54
ianwhttps://github.com/pyca/cryptography/actions/runs/176310608/workflow if interested22:56
*** tkajinam has joined #opendev22:58
*** gema has quit IRC23:05
*** mlavalle has quit IRC23:08
*** tosky has quit IRC23:09
*** sgw1 has quit IRC23:15
ianwheh, the manylinux container build decided to use http://mirror.facebook.net/ ... who knew23:18
fungiew23:22
ianwbuilding openssl ... https://zuul.opendev.org/t/pyca/stream/d35730a2d4fa4121985b01692cc45c9d?logfile=console.log23:34
ianwslowly23:34
ianwcorvus: so the theory is if i run this on an arm64 node, it might "just work"?  i guess the intermediate registry also needs to run there?23:36
*** DSpider has quit IRC23:38
ianw... ok, to answer my own question -- the intermediate registry seems happy to run in ovh23:52
ianwhowever, ensure-docker is failing on arm6423:53
ianwno package "docker-ce"23:53
*** hashar has quit IRC23:53
*** hashar has joined #opendev23:55
*** diablo_rojo has quit IRC23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!