Wednesday, 2020-08-19

*** DSpider has quit IRC00:12
*** markmcclain has quit IRC00:13
*** markmcclain has joined #opendev00:15
*** markmcclain has quit IRC00:24
ianwi've dug a bit on the manylinux arm page size issue and collated a few things into
*** ysandeep|away is now known as ysandeep01:46
openstackgerritIan Wienand proposed openstack/project-config master: Move github reporting to checks API
*** Dmitrii-Sh9 has joined #opendev02:04
*** cloudnull8 has joined #opendev02:04
*** cloudnull has quit IRC02:04
*** cloudnull8 is now known as cloudnull02:04
*** smcginnis has quit IRC02:05
*** weshay|ruck has quit IRC02:05
*** weshay has joined #opendev02:06
*** Dmitrii-Sh has quit IRC02:06
*** prometheanfire has quit IRC02:06
*** Dmitrii-Sh9 is now known as Dmitrii-Sh02:06
*** smcginnis has joined #opendev02:06
*** prometheanfire has joined #opendev02:07
*** shtepanie has quit IRC03:16
prometheanfireianw: ta03:31
ianwlgtm, i hope to release after that and we can have some new things03:35
prometheanfireI have been using it on a packet host, hope to build an ironic image soon there03:43
prometheanfirealso, osuosl03:43
*** raukadah is now known as chkumar|rover04:25
*** cmurphy is now known as cmurphy_afk04:32
openstackgerritMerged openstack/diskimage-builder master: update gentoo to allow building arm64 images
*** cloudnull is now known as kecarter05:19
*** kecarter is now known as cloudnull05:19
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] json edit
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] json edit
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] json edit
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] json edit
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] json edit
*** DSpider has joined #opendev06:17
*** lpetrut has joined #opendev06:51
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:01
openstackgerritMerged opendev/elastic-recheck master: Bumped flake8
*** dtantsur|afk is now known as dtantsur08:16
openstackgerritMerged zuul/zuul-jobs master: terraform: Add parameter for plan file
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Resolve unsafe yaml.load use
*** tosky has joined #opendev08:28
*** hipr_c has quit IRC08:31
openstackgerritMerged opendev/elastic-recheck master: Replace pep8 jobs with linters
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Resolve unsafe yaml.load use
*** ryo_hayakawa has joined #opendev08:58
*** ryohayakawa has quit IRC09:01
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: WIP: Create elastic-recheck container image
zbrsomething is unclear to me, why and builder seem not to have python installed on them by default.09:19
zbrevery time I build, it seems to install python3..., twice.09:21
*** ysandeep is now known as ysandeep|lunch09:35
*** ysandeep|lunch is now known as ysandeep09:55
*** ysandeep is now known as ysandeep|brb10:27
*** Ankita2910-contr has joined #opendev10:49
*** ysandeep|brb is now known as ysandeep10:49
*** Ankita2910-contr has quit IRC10:50
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Enable configuration via environment variables
toskyI see a relevant number of failures on rax nodes because they can't find glance-store 2.2.011:13
*** ryo_hayakawa has quit IRC11:28
*** ryohayakawa has joined #opendev11:29
*** ryohayakawa has quit IRC11:44
*** hashar has joined #opendev12:01
openstackgerritMerged opendev/elastic-recheck master: Resolve unsafe yaml.load use
*** Gyuseok_Jung has quit IRC12:06
*** sshnaidm is now known as sshnaidm|afk12:26
toskynow gra1 mirror doesn't find neutron-lib 2.5.012:37
*** weshay is now known as weshay|interview12:47
fricklertosky: we only proxy from pypi, so likely some issue in their cdn12:59
toskyfrickler: I thought so13:01
toskyis there a way to tell them about it? I've been hit by the glance-store issues for the last 2 days, and I'm probably not the only one13:02
toskyI guess it may solve by itself, but...13:02
*** weshay|interview is now known as weshay13:10
openstackgerritEmilien Macchi proposed openstack/project-config master: Disable E208 for now
openstackgerritEmilien Macchi proposed openstack/project-config master: Re-introduce puppet-tripleo-core group
*** weshay is now known as weshay|ruck13:21
*** priteau has joined #opendev13:31
fricklertosky: do you have some link to your failures? seems unlikely that it should always hit the same pkg, so maybe there's some other cause aftr all13:32
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Enable configuration via environment variables
toskyI've seen also a few of them related to other packages (oslo-messaging 12.1.2, and neutron-lib 2.5.0)13:41
toskyI may dig into them13:41
toskyprobably not worth it if tomorrow everything works13:41
toskysee also the report on #openstack-infra just now13:42
*** dulek has joined #opendev13:44
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Filter bugs shown in graphs based on regex
Open10K8SHi infra team13:52
openstackgerritEmilien Macchi proposed openstack/project-config master: Disable E106 & E208 for now
Open10K8SHi team.  The zuul checking is failing now. WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='', port=443): Read timed out. (read timeout=60.0)",)': /pypi/simple/cffi/13:53
openstackgerritEmilien Macchi proposed openstack/project-config master: Re-introduce puppet-tripleo-core group
Open10K8Sseems is unreachable or the connection is unstable13:53
fricklerinfra-root: the only thing I can note about these pypi issues is that the not-found wheels don't appear in our self-built wheel list, neither in nor e.g.13:53
fricklerOpen10K8S: yes, our 2001:db8 issue has reappeared, on it13:54
Open10K8SWhat is the reason and how can I do ? :)13:54
Open10K8Sinfra-root: frickler: recently, it occurred frequently13:55
fricklermnaser: maybe you can track that mac?
fricklerOpen10K8S: mnaser seemed to assume some neutron bug, not sure if you have internal access to vexxhost, too?13:57
mnaseri think we'll have to chase this down :\13:57
fricklerOpen10K8S: I remove the stray IPs from that mirror host, so it should be working now again until the next incident13:57
mnaserthat mac address is useful, it will help us start narrowing it down13:58
Open10K8Sfrickler: so you mean you removed extra ipv6 address?13:59
Open10K8Sok, let me recheck PSs13:59
fricklerOpen10K8S: yes14:02
fricklermnaser: it started a bit earlier with the other prefix . I'd not be surprised if that was one of our neutron jobs14:03
* frickler needs to afk, bbl14:03
*** hashar has quit IRC14:06
*** Ankita2910-contr has joined #opendev14:24
*** ysandeep is now known as ysandeep|away14:29
Open10K8Sfrickler: ok, good now14:32
Open10K8Sthank you14:32
*** tkajinam has quit IRC14:33
*** hashar has joined #opendev14:33
*** Ankita2910-contr has quit IRC14:33
*** sshnaidm|afk is now known as sshnaidm14:42
*** mlavalle has joined #opendev14:44
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: WIP: Enable configuration via environment variables
*** qchris has quit IRC14:56
clarkbzbr: can you expand on what you are seeing with the python images? we base on the python images which have python3 preinstalled, but they aren't installed from the distro so it could be something is pulling it in15:03
zbrclarkb: ahh, that explains why it gets installed. Now my question is why not from distro?15:04
*** larainema has joined #opendev15:04
clarkbzbr: because we interested in using the python official images and that isn't how they distribute python15:04
zbrand this means that i cannot benefits from precompiled packages, unless they published wheels for them.15:05
clarkbyes, those images are intent on python source installs not for use with distro packaging15:06
zbrseems bit weird but i guess i can live to accept it.15:06
clarkbI think there is an apt override file that is supposed to make this better15:06
clarkbbasically it tells apt that python is already installed /me is looking for it15:06
zbrfor development it is a very inconvenient way to produce containers (slow)15:06
clarkbzbr: well the whole system is built that way. That is why we have a builder and a base image15:07
clarkbif you don't want that you can just FROM debian and go from there15:07
clarkbthe idea is the builder image is used to create wheels for everything in a throw away image, then on the bsae side we install all the wheels which keeps the images smaller15:07
clarkbbut the whole thing assumes a from source installation15:07
zbri will try to follow the same practices and diverge only if really needed15:07
clarkbzbr: looks like the apt override is for python3-dev only15:08
zbrclarkb: mtreinish:  i need your help with a very weird unitest failure unique to py27 at
clarkbI don't see a problem with also adding a python3 if that helps15:08
zbrit is bit of mindblowing because I see an error related to a an exception that was supposed to be catched15:09
zbri wonder if there is some magic stestr issue....15:09
clarkbzbr: that is the override for python3-dev15:09
*** qchris has joined #opendev15:09
zbrok i will look at this later, when i am back at container building. now i have some preparatory steps to finish.15:10
clarkbzbr: its a namespace problem. Its ConfigParser.NoOptionError but you from configparser import NoOptionError15:11
clarkbfrickler: tosky: ya we seem to be seeing that more often. Basically pypi isn't serving the requires python info on the indexes15:12
zbrwhat i used in the past with other python projects was alpine plus all system packages available and wheels for others.15:12
zbrhaha, lol. thanks15:12
*** chkumar|rover is now known as raukadah15:14
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Enable configuration via environment variables
zbrclarkb: when do you think we could remove py27 support because the sooner the better. Can we remove it before we have the container? AFAIK e-r is not redeployed with CD so we could live for a while without supporting py37 on master anymore?15:16
zbrasking because there is a high amount of py27 junk i would like to scrap...15:17
clarkbthis is why I originally suggested we just change the pip instal in the puppet to pip3 ...15:18
clarkbI think its literally a one line change to get on python3.5 then you can drop most python2 stuff15:18
clarkbI think a better end goal is the container so didn't argue against it, but the original goal was just to get off python2 and that is much simpler15:19
zbrclarkb: i remember trying this but i went into a chain of issues related to puppet. ideally we should switch to py3615:21
clarkbzbr: yes a simple linter failure15:22
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Enable configuration via environment variables
clarkbI've rechecked the change so we can get logs again15:22
zbri seen. if we fix this it will ease the other steps.15:23
openstackgerritMerged openstack/project-config master: Disable E106 & E208 for now
fungiOpen10K8S: mnaser: frickler: catching up on scrollback now, but i see there are still stray v6 global prefixes on the ca-ymq-1 mirror. should i go ahead and delete them?15:28
mnaserfungi: i think we're hitting it a lot more often sadly15:28
clarkbmwhahaha: weshay|ruck: fwiw I still see single IPs in rougly the same timeframe (so single job node) requesting the same docker shas multiple times15:29
clarkbits possible caching the layers and reusing them without another network round trip would also help with request limits15:29
mnaserfungi: but yes, please do15:31
fungimnaser: looks like we have nexthop via fe80::ce2d:e0ff:fe0f:74af and fe80::ce2d:e0ff:fe5a:d84e currently15:31
mnaserfungi: those are the correct next hops though15:32
mnaserfungi: the only addrs that should be there are 2604:e100:...15:32
fungiokay, so we just need to delete the 2001:db8:0:3::/64 and 2001:db8:1::/64 aliases, got it15:32
fungimnaser: oh! i think i see the issue. we've deleted the interface aliases but not the associated host routes15:33
mnaseroh heck, so there's.. 4 default routes?15:34
funginope, not default routes, interface local routes15:34
fungiso maybe those aren't particularly disruptive as long as the mirror never needs to talk to some address within one of those cidrs15:35
fungiwhich it shouldn't, those are from a reserved aggregate anyway15:35
fungifrickler: on the "missing wheels" those are pure python projects with existing wheels on pypi, so we don't copy them into our pre-built wheels cache15:38
fungiwe expect them to be fetched from pypi instead15:38
fungithe current wheel mirror job basically installs all of openstack/requirements upper-constraints.txt into a virtualenv for the relevant python version and platform, then scrapes the pip log to find out which wheels were not downloaded from pypi and only copies those into our cache15:40
openstackgerritSorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: Use py3 with elastic-recheck
fungiinfra-root: (any anyone else) what do you think about publishing the apache access logs on our mirror servers? would that be reasonably safe?
*** shtepanie has joined #opendev15:49
clarkbfungi: the request was for stats too. Maybe we start with some processed logs to avoid exposing too much? Its also possible that our existing goaccess stuff would be helpful here15:49
fungioh, maybe. can it do mod_proxy cache hit/miss tracking?15:50
clarkbI dont know15:50
clarkbits mod_cache in this case not mod_proxy, but it has a bunch of features and that may be one of them15:50
fungioh, right15:50
clarkbthere is a cache status panel15:52
clarkband you need to have %C in your log format to enable it (I'm not sure if the default apache log format has %C)15:52
clarkbI don't think %C is in the default combined log format16:01
clarkbwe would have to define our own log format but that should work once done16:01
clarkband if that doesn't work we can probably do similar with some simple scripting to scrape the logs for this specific purpose16:02
zbrclarkb: fixing the puppet linter broke the testing job... what a joy16:17
*** hashar has quit IRC16:18
clarkbzbr: I think its unrelated16:19
zbrclarkb: it did pass before removing the ::16:19
clarkbzbr: in may16:20
clarkb is the error and its well before it tries to aprse any puppet I think16:20
clarkbits trying to load ssh keys?16:21
zbrnoidea. when you have time, please also take a look at -- i already know is not perfect but i think is a step in the right direction.16:26
zbronce we drop py27, i will be able to simplify it even more16:26
clarkbI wonder if the issue is related to ssh key format that whole -m pem thing16:29
clarkb shows there is already a key present16:30
clarkband if that key was generated by zuul it may have the newer format?16:30
clarkbanyone know how /home/zuul/.ssh/id_rsa is being generated in jobs?16:30
clarkb is where we try to generate a key on the server itself16:32
clarkb I think that may be how we get the key in there16:33
clarkbya is gonna be with new format key file since we run executors in containers now16:34
clarkbzbr: ^ I think that is what changed since may16:34
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Generate build ssh keys in PEM format
clarkbzbr: ^ I think that will fix it16:38
clarkbfungi: looking at today's proxy_8082 access log on mirror.iad.rax we have about a 79% cache hit rate16:38
clarkbnot great, but not terrible either16:39
fungiany idea if the authenticated requests which are passed through as uncacheable count as cache misses?16:46
clarkbfungi: no I'm only counting cacheable things (eg an explicit cache miss or hit is logged)16:48
fungiahh, okay16:49
*** hashar has joined #opendev16:50
*** lpetrut has quit IRC16:50
clarkbit seems there are in the basic case 5 requests per blob (I think that number goes down if there are more layers associated with a manifest)16:52
clarkbwe can cache 1 of them the actual data not metadata transfer16:52
clarkbthe other 4 are metadata and not cacheable16:52
clarkbso if we request the same thing 3 times we go from 5 to 15 requests, 12 of which are to docker hub16:52
fungiwas this specific to the dockerhub v2 cache, or across all our caches for that server?16:52
clarkbjust port 8082 which is docker hub v216:56
clarkblooking at this one ip during the time range I was using to find duplicate requests it seems there are ~69 layers 30 of which are requested multiple times16:58
*** dtantsur is now known as dtantsur|afk17:01
johnsomHi all, I just wanted to mention we have another case where the python package cache seems to have gone wrong:
johnsomCould not find a version that satisfies the requirement glance-store===2.2.017:11
johnsomThough it was posted on pypi on the 13th (according to pypi).17:11
johnsomJust wanted to mention it. We are rechecking, so should be fine, just another oddity case.17:11
clarkbjohnsom: ya others have been reporting similar. We may also have pypi serving indexes without requires python metadata so sometimes python2 or python3.5 intsall a too new package17:12
fungi#status log ethercalc service restarted following exportCSV crash at 16:42:59 utc17:18
openstackstatusfungi: finished logging17:18
johnsomLooks like the same with neutron-lib 2.5.0, which pypi says was out Aug 6th. hmmm17:18
fungiinfra-root: the backtrace on that looks basically the same as the last one17:18
johnsomHere is the link to the new neutron-lib failure:
johnsomDifferent "mirror" instance than the last one.17:19
clarkbjohnsom: ya its all proxies to pypi17:20
clarkbjohnsom: it seems that pypi's CDN has problems17:20
clarkbfungi: I can't find a way to do a csv export17:21
fungii wish pip freeze recorded the installed version of pip like pip list does17:21
clarkbat least not through the ui. I wonder if that is an api call?17:21
fungilooks like that job is using distro package dpip17:22
clarkbI think devstack installs with get-pip.py17:23
fungii don't think pip 9.0 knows how to check that metadata on pypi17:23
johnsomI'm not sure which cache package is being used, but does it maybe have a negative-cache setting that is too long?17:23
clarkbjohnsom: indexes are cached for 10 minutes17:24
clarkbfungi: yes that job log shows it installing pip 20.2.2 with get-pip17:24
fungiahh, yeah later it's using after purging the python3-pip distro package17:24
clarkbjohnsom: you can check all that with your own brwoser too fwiw17:26
clarkbthough in tis case it didn't give you an index url link which is annoying (depending on how it fails you'll get that)17:26
fungiand i don't see the job reinstalling the distro package later, so i guess it is failing that with 20.2.217:27
johnsomIt's not these:, ?17:27
clarkbjohnsom: its
clarkbthe wheel mirror won't have neutron-lib since we publish wheels to pypi17:28
clarkb2.5.0 is there now17:28
johnsomHmm, clicking those links don't get me the file, but this stuff is in my "just enough time between needing to know" purged window. grin17:29
johnsomAh, need to drop the sha17:30
clarkbjohnsom: you don't get a download cliking on those links? (I do)17:30
clarkbbut also the issue is with the index not having the requested version not the version itself 404'ing17:30
clarkb(you get a different error for that)17:30
johnsomHmm, so I wonder if the first call will "warm" the cache. Maybe we need a pip wrapper that retries a few times. I don't see a native option on pip to do that.17:31
clarkbI mean ideally pypi would fix their broken indexes :/17:31
clarkbI do think we get a CDN backend identifier in the headers17:32
johnsomIt's just a bummer for us, the first failure led to recheck that led to another different failure. Wasting zuul resources, etc.17:32
clarkbmaybe we can request that pip include that info on failed index lookups and then we can take that back t othem17:33
fungiaha, if you view-source: you can see the requires_python metadata in some of the newer urls17:33
fungier, newer package versions17:33
clarkbfungi: yup, I double checked that job is using python3.6 where it fails so shouldn't be a python version mismatch17:34
fungiright, the most we specify is >=3.617:34
clarkb(because thats the other similar way we see hte problem. you have wrong version of python and pip says the package doesnt exist)17:34
fungior it dwonloads too new of a package for your version of python and then setuptools throws a fit on install17:35
clarkbfungi: maybe we shouod collect tye x-served-by values form mirrors that have hit this and give that to pypa?17:41
clarkbI dont know that thereus much more debugging we can do other than to tell them something is up and likely with these cdn caches?17:41
fungimaybe we could somehow collect the actual page content for those indices?17:42
funginot sure how to go about doing that though17:42
fungiunless pip caches them maybe17:42
clarkbsometimes if you refresh you get lucky and get the bad content17:42
clarkbbut that was months ago when this last happened17:43
johnsomThis error seems to also happen when the internal resolver fails. I wonder if it would be adding a -v or two to our pip calls to get more information.  Based on this open issue report:
clarkbjohnsom: but it seesthe other packages17:43
johnsomThat said, I just got a varnish 503 error by digging in their docs, so...17:44
fungiahh, yeah, so suggesting that fastly has a mix of good and problem endpoints in the same pool and is distributing requests across them all17:44
clarkb(it lists out all the valid ones it sees in the error)17:44
clarkbfungi: yes17:44
fungiso in theory if i smash a bunch from i should sometimes get output which is missing neutron_lib-2.5.0-py3-none-any.whl17:48
clarkbya, but ithas a 10 minute timeout si may take a while17:49
fungiwell, i'm not going through our cache to do that17:49
clarkbanotheroption is to wget/curl from that s erver to pypi?17:49
clarkboh ya that17:49
fungii'm running `wget -SvO-|grep 'neutron_lib-2.5.0-py3-none-any.whl'` from a shell on
fungi-S gives us the X-Served-By: header17:50
fungi(well, all headers, but including that one)17:50
mnaserout of curiosity, how has opendev 'forced' image rebuilds when changing the source images inside ci?17:50
fungimnaser: usually with an empty change17:50
mnaserwe use a similar model, we just made an updated in the base image and we want to force a rebuild of an image (we use files: ..)17:50
johnsomOye, another one:
mnaserright, so whats the empty change y'all go for, just to try and mimic :P17:51
fungimnaser: ugly, but we haven't needed to do it often17:51
clarkbmnaser: a comment change in dockerfile iirc17:51
mnaserlike new line in dockerfile?  touching the job config in zuul?17:51
clarkbas we filter on the files changed17:51
mnaserah, that's reasonable17:51
fungialso allows us to date the trigger for the new build17:52
fungisince there's basically a record in the git history17:52
fungijohnsom: so "ERROR: No matching distribution found for libvirt-python===6.6.0" for that one, and via mirror.dfw.rax.opendev.org17:54
*** hashar is now known as hasharWineMusic17:55
fungihere's an interesting twist though... are these failures all centos-8?17:57
johnsomJust that last one17:57
fungilast two17:57
fungiahh, no, i was looking at the same log twice17:57
johnsomYeah, the first two were bionic17:58
fungiso unrelated, but that most recent one is also looking for the wrong wheel cache url17:58
fungii expect all recent centos-8 builds may be doing that17:59
fungithat does not exist17:59
fungishould be
funginot sure if we need to tweak the pip.conf on those nodes, or add redirects on the mirrors17:59
johnsomThere is a proposal on the table to switch our main testing to centos as there is no help from Canonical on the project and focal is having networking issues inside our service vms. Most of the main contributors are running centos. But this is an aside and an up coming PTG topic.18:01
clarkbfungi: we ise an ansible fact for the distro name I think18:02
fungilooks like libvirt-python-6.6.0.tar.gz was uploaded 17 days ago (2020-08-02) so not including it in the simple api means either a *very* stale response, or truncated/empty response18:02
clarkbdid it list valid versions?18:03
clarkbifso that implies it wasnt an empty response18:03
clarkbsprry I'm just about to sit on the bikr and head out. back in a bit18:03
fungigo get your bike on18:04
fungii'll keep digging18:04
fungithe log does indeed list available versions, but only up to 6.4.0, so missing 6.5.0 and 6.6.0 which are included in the responses i'm seeing18:05
fungiso yeah, maybe a truncated response?18:06
fungior else a very, *very* stale cache18:06
johnsomHmm, ok, something else is going on. I did another recheck on that earlier glance-store 2.2.0 issue, got the same failure again from the same "mirror":
johnsomSo, these two: (latest) and (earlier today)18:08
fungii'll switch back to trying to repro that one instead of the dfw one18:09
johnsomIf I click into it is in fact not there18:09
*** lpetrut has joined #opendev18:10
fungiit wouldn't be18:10
fungiwe don't copy wheels there if they exist on pypi18:10
fungiwe only add wheels which aren't on pypi, so generally not anything pure-python and not anything which already publishes appropriate manylinux1 wheels18:11
fungithere are going to be some older glance-store wheels in there from before we started filtering what we copy into it18:11
fungibut basically if they exist on pypi we want the pypi cache to be used18:11
fungiso far i can't reproduce, but they do seem to have a bunch of caches there in the montreal area18:14
fungithough right now all the responses i get there are still chained back to the same cache in baltimore18:15
fungi(many second-level caches in montreal serving content from the same cache backend in baltimore)18:16
*** lpetrut has quit IRC18:17
fungiworth noting, the last time this cropped up in the montreal area, we saw similar errors from builds in both ca-ymq-1.vexxhost and as they seemed to be hitting the same local fastly endpoints18:22
fungibut so far i'm not able to trip it from a shell on the mirrors there18:23
fungii've set up a loop to hit it every 10 seconds and log the cache backends to a file18:33
fungisave the wear on my up arrow18:33
fungialso this way i can continue trying to trap a failure while also chopping onions for dinner18:34
fungiaha, got one!18:35
fungicache-bwi5145-BWI, cache-yyz4537-YYZ18:35
fungiso far that's the only hit against cache-yyz4537-YYZ but cache-bwi5145-BWI shows for all the successful copies, so it's likely a problem at the second level18:36
johnsomNeat, yeah, I just needed to step away to make lunch. grin18:36
fungii'll hopefully catch a few more and if they're all cache-yyz4537-YYZ then i expect it's the problem there18:37
fungii guess i should set up a similar loop on the dfw.rax mirror server18:37
fungilooking for stale libvirt18:38
fungiso far it's seen 21 different second-level endpoints reported18:39
fungiso that puts the failure rate at being fairly low, i expect18:39
johnsomSeems like we are unusually "lucky" then?18:40
fungiwell, the problem is at our third-level cache. we get unlucky once and then serve that bad copy for 10 minutes before we refresh it18:41
fungii mean, as far as amplifying the error. that'll make it come in spurts of 10 minutes of bad followed by roughly 200 minutes of good18:42
fungias opposed to a random ~5% failure rate18:43
funginow i've got a similar loop in dfw.rax looking for missing libvirt-python-6.6.0.tar.gz18:44
fungiboth running in detached screen sessions under my user18:45
fungii'll check back in on them again in a bit18:45
fungijohnsom: have you noticed these failures anywhere besides dfw.rax, and ca-ymq-1.vexxhost?18:46
johnsomNo, but I haven't looked very hard for other either18:46
fungino worries, i expect this will at least be a good start18:46
johnsomI will poke around our other patches18:46
fricklermnaser: infra-root: as a workaround for the vexxhost mirror v6 issue, how about we configure it's addr and default route statically and disable it usings RAs? would assume that the two routers are static enough. or can one keep RAs for default route and just disable address autoconfig? I can do some tests for that tomorrow18:55
johnsomfungi as well.
openstackgerritPaul Belanger proposed openstack/diskimage-builder master: Revert "source-repositories: git is a build-only dependency"
fungijohnsom: okay, interesting. that ones in france19:26
fungii've added a loop there checking for neutron_lib-2.5.0-py3-none-any.whl19:29
fungirevisiting we've gotten a number of good results from cache-yyz4537-YYZ now so i'm really not sure what's going on there19:31
fungiand no bad results yet from the loop for mirror.dfw.rax.opendev.org19:32
*** priteau has quit IRC19:44
johnsomWell, I have one job left to finish then I will recheck this quota change patch for the third time and see what happens.19:49
mnaserfrickler, infra-root: the static routes should be very reliable and extremely stable19:56
mnaserinfra-root: i am seeing a lot of ceph jobs failing @ vexxhost -- is it possible mirrors are stale there?19:59
clarkbmnaser: frickler: that plan seems reasonable, though looking at the server really quickly I'm not sure how ubuntu expects that to be configured these days19:59
clarkbmnaser: mirrors should be the same everywhere due to afs19:59
clarkbmnaser: can you point to what specifically is breaking?19:59
mnaser was a ceph failure20:00
clarkb/etc/netplan/50-cloud-init.yaml seems to be the thing that configures networking on the vexxhost mirror20:00
clarkbdo we get to learn netplan now :/20:00
mnaserit sounds like we're actually pulling from ..20:00
clarkbya you aren't using our mirrors20:01
mnaser also failed20:01
mnaseryeah this seems like a devstack-plugin-ceph issue then20:01
fungiErr:19 bionic/main amd64 Packages20:01
fungithat's... not us20:01
mnaseryeah, i'm just noticing thid, i have no idea what that is to be honest20:01
mnasersorry for that noise, i'm checking the plugin20:03
clarkbalso re the RA thing when limestone had this problem I tried to set up radvd myself and send RAs then catch them across tenants and failed20:03
clarkbbut that may be worth checking here as well (then we don't have to figure out what nested neutron is doing)20:03
mnaserclarkb: i actually think frickler had a tcpdump going that caught them20:03
mnaserit's just.. pretty non trivial to figure out :\20:04
mnaseralso for those who are curious:
clarkbmnaser: oh ya I mean I'm sure they're happening but if we can send them ourselves when we want to then we (really you) can trace it through the neutron firewalls20:04
mnaserclarkb: yeah at the time the use of finding out the mac address of the system which sent them so we can trace it, so yeah20:04
fungiif logstash is up to date and we're indexing the network sanity checks in job logs, the mac address can possibly be found with a logstash search to identify the build/job20:08
fungiassuming we were the source of the stray announcements, which i expect we were20:08
clarkbya but those shouldn't be a problem20:09
clarkbneutron on the host side should drop them20:09
* clarkb is reading netplan docs fwiw20:12
fungioh, i completely agree, just hoping if we can get some correlation between ra leakage and specific jobs maybe it'll give us a clue as to how to replicate the problem20:15
mnaserfungi: yeah, im hoping that will help me get the host that did it and i can find some useful error message20:17
clarkbfwiw I don't think our logstash data is anywhere near current (and ironically due to neutron logs being very large)20:17
fungimnaser: also if you can map it to an instance name (or uuid probably) then i can likely track back through nodepool to zuul and work out which build used it20:19
mnaseryeah, i mean i wouldn't be surprised if it was just a normal devstack gate change20:19
fungithat's my fear20:20
fungiwhich means the chance of being able to regularly reproduce the problem is low and so a very rare condition20:20
mnaseri mean logan- ran into it only once20:21
mnaserwhat can i say though20:21
mnaseri'm much more lucky :)20:22
clarkbmnaser: well it persisted there20:22
clarkbwe turned the cloud off for a while and eventually we couldn't reproduce and neutron couldn't do anything so we tried turning it back on again and things were happy20:22
mnaseryeah i suspect this won't be fun to find..20:22
logan-whoa.. it happened again mnaser?20:23
mnaserlogan-: like 3 times already :)20:23
clarkb2604:e100:1:: is the gateway for the vexxhost mirror?20:24
mnaserlogan-: and the weird thing is that.. no migrations are happening or anything, so it would only be something that somehow cleared firewall rules and neutron .. readded them20:24
mnaserclarkb: the gateways are link local addreses, you should see them right now20:25
clarkbmnaser: okthats why I was confused20:25
clarkb something like that maybe?20:25
clarkbthen restart networking with systemd?20:26
clarkbI love that we change these things every reelase20:26
clarkbfungi: re reproducing the fastly thing, its possible that each cdn "endpoint" is really many many servers and we only get lucky if we hit one of them20:30
clarkbwhat I remember from when this has happened before is we don't get the html formatting for the index list. Everything is listed in one line that wraps and none of the requires python metadata is included and the list is stale missing newer releases20:38
johnsomFYI, the third recheck of that quota change patch seems to have got past the pypi cache issue and appears to be running this time.20:41
clarkbheh its also possible that fungi hitting the CDN all over is ensuring that we keep that data fresh20:42
johnsomHey, if it works! grin20:42
fungiyeah, not sure, since that one case i caught from ca-ymq-1.vexxhost i haven't had any more problem responses20:46
fungiand i'm testing cases from each of the mirrors where we've seen it every ten seconds for at least a couple hours now20:46
fungialso saw non-problematic responses from cache-yyz4537-YYZ after the one bad response20:48
fungipossible fastly had some cache propagation issues and they've cleared up since20:48
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Generate build ssh keys in PEM format
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Add test-add-build-sshkey role
openstackgerritClark Boylan proposed opendev/base-jobs master: Use test-add-build-sshkey in base-test is unhappy for me21:03
clarkbcorvus: the web ui?21:04
clarkbI'm able to browse it from here21:04
corvussigh.  i ended up with an https link, sorry21:04
clarkbplenty of free memory on the server so we aren't OOMing again due to bot crawlers21:04
clarkbah ok21:05
corvuswe should do something about that someday :/21:05
*** dmsimard7 has joined #opendev21:14
openstackgerritEmilien Macchi proposed openstack/project-config master: Re-introduce puppet-tripleo-core group
*** dmsimard has quit IRC21:15
*** dmsimard7 is now known as dmsimard21:15
fungii need to pick back up the mm3 poc, i figured we'd ssl that since it has actual accounts and not just a password it e-mails you21:28
fungialso the first time i tried we weren't really deploying containers so i tried it from distro packages, but there are official container images i want to take out for a spin21:29
* clarkb remembers gerritbot unit ws never disabled on review.o.o21:31
clarkbdoing that now21:31
fungiooh, yep21:31
fungiwe could just uninstall it21:31
funginuke it from orbit, the only way to be sure21:31
clarkb`sudo systemctl disable gerritbot.service` has been run21:32
clarkbI'm not sur ewe need to do more than that21:32
fungiat this point i think we don't have any more obvious/critical problems with the new deployment so i doubt we'd roll back regardless21:32
fungibut yeah, as long as nothing ever reenables the unit, should be fine21:33
*** tosky has quit IRC21:47
*** hasharWineMusic is now known as hahsar21:59
*** hahsar is now known as hashar21:59
ianwfyi on the arm64 issue, it looks like patchelf, rather than compile-time/build platform :
ianwfungi: catching up ... lmn if i can help with the cache issues.  tl;dr seems to be the weird "don't get the right index" but we then cache it, making it worse?22:00
fungilooks like i've finally caught a couple more hits in montreal, both involved cache-yyz4537-YYZ22:01
fungiso far no occurrences in the other regions i'm checking22:02
ianwclarkb: the ssh key thing ... i have deja vu22:12
ianw was it ... which turned out to be trailing whitespace being stripped, not the key format22:13
clarkbianw: wasn't that fixed though?22:14
ianwyeah but i found that the format change had been incorporated a long time ago -- where is it failing?22:15
ianwiirc, it's fuzzy, but i determined everything we run should understand the new format keys22:16
clarkbianw: in beaker testing22:16
clarkbit passed back in may but fails now and I believe we switched from xenial openssh on the executors generating the keys to debian new whatever zuul docker images are built on openssh generating them22:16
clarkbit may be the ruby lib there not necessarily the system openssh that is failing22:17
clarkbianw: is the sort of traceback thing we get from ruby22:17
ianw /home/zuul/.bundled_gems/gems/net-ssh-2.9.4/lib/net/ssh/authentication/methods/publickey.rb:19:in `authenticate'22:17
ianwyeah ... ok.  because yeah, i'm pretty sure i determined xenial even could handle it22:18
clarkband earlier in the job we try to generate a key on xenial using xenial openssh but it fails because we've already written an id_rsa22:18
clarkbthats fine if we can use that key though I think22:18
clarkbbut it seems we can't22:18
*** shtepanie has quit IRC22:19
ianwyeah my only concern is that we'll be using the old format forever because we'll never remove it, to work around a xenial ruby issue22:19
clarkbthats fair, but we keep hitting this all over22:19
clarkbI had to address it with my gerrit upgrade testing22:19
clarkb(gerrit init uses ssh-keygen fork but can only read PEM format so breaks :( )22:20
clarkbI guess the common attribute there is its other ssh tools not openssh itself22:20
clarkbianw: reading xenils ssh-keygen manpage I think you are right atht it supporst the RFC 4716/SSH2 public or private key format22:22
clarkbbut I think xenial must default to PEM22:22
clarkbwhich is why the ruby stuff worked until we switched to containerized executors22:22
clarkbthe other place we've seen this is with paramiko22:23
clarkbso ya  Ithink it is largely third party ssh tools that trip over it22:24
ianwyeah, that sounds about like what i found, i'm not sure if i wrote it down22:24
ianwdropped a comment; one idea is that we could have a flag to turn it on?  i'm trying to think of somewhere that's a good hook point to enable that ...22:24
ianwwould it be "if there's a xenial node in the inventory"?22:25
clarkb is the paramiko bug22:25
clarkbwell I think it has more to do with the tools you use (gerrit's ssh lib, paramiko, ruby net-ssh) than the distro22:25
clarkbis there a concern with using PEM if it is more universally understood?22:26
ianwnot really, other than over time we fall into the other category of "using the less tested path" as everything else moves22:28
clarkbianw: I guess as an alternative we can modify our python jobs to convert the zuul generated key or delete it and regenerate22:30
clarkbbasically have zuul do the up to date thing and if you need something else push into specific jobs22:31
clarkblet me work on a change to do that22:31
ianwup to you if you think it's good.  i'm happy enough with a comment at least pointing us in the direction of "it's not ssh, but other stuff"22:31
ianwThis will be out in Paramiko 2.7 (only one more ticket after this one before that's cut!)22:32
ianw bitprophet commented on Dec 4, 201922:32
ianwthat's not that long ago, unlike the openssh changes22:33
*** hashar has quit IRC22:34
ianwOpenSSH 5.6 was released on 2010-08-23 ... paramiko merged a fix for it in 2019-12 ... so i'll see you back here in ~2030 and we can switch :)22:36
openstackgerritClark Boylan proposed opendev/system-config master: Convert ssh keys for ruby net-ssh if necessary
clarkbianw: hahaha22:37
clarkbianw: fwiw I'm happy doing ^ instead if that works22:37
clarkbianw: and ya judging by when the paramiko bug shows up I think ssh-keygen on xenial produces a working format by default but then the default output format changes on bionic22:44
clarkbxenial and bionic both support both formats, its more a concern of which tehy output if you don't specify22:44
ianwpabelanger: are you going to need a point release for the git thing?22:47
openstackgerritClark Boylan proposed opendev/puppet-elastic_recheck master: Use py3 with elastic-recheck
clarkbzbr: ianw ^ added a depends on to test the change above22:49
clarkbianw: want to WIP my zuul-jobs change until we get results from ^22:51
clarkbI can do it too. Just wondering how concered we are about that landing early22:51
clarkbianw: the parent
clarkb(technically both but gerrit will enforce it if its on the parent alone)22:53
*** tkajinam has joined #opendev22:58
*** DSpider has quit IRC23:00
*** mlavalle has quit IRC23:03
openstackgerritIan Wienand proposed openstack/project-config master: Move github reporting to checks API
openstackgerritClark Boylan proposed opendev/system-config master: Convert ssh keys for ruby net-ssh if necessary

Generated by 2.17.2 by Marius Gedminas - find it at!