Wednesday, 2021-03-31

*** artom has quit IRC00:14
*** artom has joined #opendev00:14
clarkbfungi: it looks like maybe the zuul restart caught the periodic jobs for requirements in a weird spot? they are all retry limits and the ref is 00000000:18
clarkb(just noting it, I expect that the 0600 enqueing of those jobs will be fine)00:18
fungii meant to follow up on that, i got several tracebacks from reenqueues00:20
fungichecking to see if i can tell which ones00:20
fungiso this is what i ran for it: zuul enqueue-ref --tenant openstack --pipeline periodic --project openstack/requirements --ref refs/heads/master00:21
fungino errors from that00:22
fungithe ones which did throw errors were pyca/cryptography00:23
fungihttp://paste.openstack.org/show/80406400:24
fungiclarkb: the retry limit may be separate from the reason for the restarts00:29
fungiretries00:29
fungier, let me retry00:29
fungithe reason for the retries may be unrelated to the 0 ref00:30
openstackgerritMerged opendev/system-config master: review01.openstack.org: add key for gerrit data copying  https://review.opendev.org/c/opendev/system-config/+/78377800:30
fungilast zuul scheduler restart we saw a similar situation with an hourly deploy item enqueued with a 0 ref, but it grabbed the branch anyway00:30
*** hamalq_ has quit IRC00:49
openstackgerritJeremy Stanley proposed zuul/zuul-jobs master: Document algorithm var for remove-build-sshkey  https://review.opendev.org/c/zuul/zuul-jobs/+/78398800:56
*** iurygregory has quit IRC01:16
*** iurygregory has joined #opendev01:17
*** iurygregory has quit IRC01:18
*** iurygregory has joined #opendev01:18
*** osmanlicilegi has joined #opendev01:20
openstackgerritJeremy Stanley proposed opendev/base-jobs master: Clean up OpenEdge configuration  https://review.opendev.org/c/opendev/base-jobs/+/78398901:44
openstackgerritJeremy Stanley proposed openstack/project-config master: Clean up OpenEdge configuration  https://review.opendev.org/c/openstack/project-config/+/78399001:44
openstackgerritJeremy Stanley proposed opendev/system-config master: Clean up OpenEdge configuration  https://review.opendev.org/c/opendev/system-config/+/78399101:45
*** iurygregory has quit IRC02:08
*** iurygregory has joined #opendev02:09
ianwour new review02.opendev.org can't ping review01.openstack.org via ipv6, but the other way (review01 -> review02) *does* work02:19
ianwi'm taking suggestions on how i might have messed this up :)02:20
ianwi am connecting to review02 via ipv6.  i can also ping it locally here.  so it's not ipv6 in general02:25
ianwi know nobody is around, but dumping some debugging info in #vexxhost channel02:38
*** diablo_rojo has quit IRC02:40
fungii can take a look when i wake up too02:48
ianwfungi: :) thanks, you'll probably have more cross-over with vexxhost people02:49
fungiit does seem on the face to be similar to some of the ipv6 oddness we've seen with rackspace in the past, so i wouldn't assume there's anything to do with how things are set up in vexxhost02:50
fungianyway, passing out now, will sleep on it02:51
ianwyeah; i agree.  however in that case we usually did see the packets coming *into* the host, which responded, but the packets never found their way back02:51
fungitrue02:51
ianwin this case, a tcpdump doesn't show the ping packets making to the host02:51
ianwalso of major ipv6 things i can think of, it can't seem to ping most, but can ping google02:52
ianwanyway, i have review02 now syncing via ipv403:07
ianw~45MB/s so not too shabby03:08
*** akahat has quit IRC03:08
*** kopecmartin has quit IRC03:09
*** fbo has quit IRC03:09
*** kopecmartin has joined #opendev03:13
*** fbo has joined #opendev03:14
*** akahat has joined #opendev03:22
*** ykarel|away has joined #opendev04:20
*** ykarel|away is now known as ykarel04:39
*** marios has joined #opendev05:03
*** zbr|rover4 has joined #opendev05:04
*** zbr|rover has quit IRC05:06
*** zbr|rover4 is now known as zbr|rover05:06
*** whoami-rajat has joined #opendev05:17
openstackgerritJames E. Blair proposed zuul/zuul-jobs master: Add upload-logs-azure role  https://review.opendev.org/c/zuul/zuul-jobs/+/78200405:27
*** auristor has quit IRC05:27
ianwok, https://review02.opendev.org has some content05:41
*** ysandeep|away is now known as ysandeep05:59
openstackgerritIan Wienand proposed opendev/system-config master: AFS documentation : add notes on replication  https://review.opendev.org/c/opendev/system-config/+/78400206:01
openstackgerritIan Wienand proposed opendev/system-config master: review02 : bump heap limit to 96gb  https://review.opendev.org/c/opendev/system-config/+/78400306:01
ianwtime docker-compose run shell java -jar /var/gerrit/bin/gerrit.war reindex -d /var/gerrit --threads 3206:04
ianwreal    50m8.443s06:04
*** ralonsoh has joined #opendev06:10
*** slaweq has joined #opendev06:10
*** sboyron has joined #opendev06:21
*** eolivare has joined #opendev06:30
*** hashar has joined #opendev06:45
openstackgerritHervĂ© Beraud proposed openstack/project-config master: Use publish-to-pypi on barbican ansible roles  https://review.opendev.org/c/openstack/project-config/+/78401106:52
ianw24 threads was "real    52m10.284s" fyi07:19
ianwi just rebooted the review02.opendev.org host, and all ipv6 seems to work now07:25
*** tosky has joined #opendev07:33
*** ysandeep is now known as ysandeep|lunch07:47
*** ykarel has quit IRC08:01
openstackgerritMerged opendev/irc-meetings master: Remove Automation SIG meeting  https://review.opendev.org/c/opendev/irc-meetings/+/78387808:06
*** dpawlik0 is now known as dpawlik08:08
*** hrw has joined #opendev08:11
hrwmorning08:11
hrwcan someone help me get centos-8-stream-arm64 node running?08:14
hrwproject-config has such08:14
hrwhttps://zuul.openstack.org/nodes does not08:14
openstackgerritMerged openstack/project-config master: Use publish-to-pypi on barbican ansible roles  https://review.opendev.org/c/openstack/project-config/+/78401108:16
ianwhrw: this build doesn't look good08:21
ianwhttps://nb03.opendev.org/centos-8-stream-arm64-0000001549.log08:22
hrwianw: let me look08:22
ianw2021-03-31 07:44:32.339 | + /usr/sbin/grub2-install '--modules=part_msdos part_gpt lvm' --removable --force /dev/loop608:22
ianw2021-03-31 07:44:32.341 | /usr/sbin/grub2-install: error: this utility cannot be used for EFI platforms because it does not support UEFI Secure Boot.08:22
ianwit may be possible we fixed this and we haven't either done a dib release or included a new dib release in the nodepool container...08:22
ianwdiskimage-builder version 3.7.008:22
hrwgrub--08:23
ianwlooks like https://review.opendev.org/c/openstack/diskimage-builder/+/77910608:24
ianwit seems we are due a release08:24
hrwlooks like08:25
ianwok, i pushed 3.8.0, but we'll have to pull into nodepool and then deploy to builders.  sorry, no quick route there :/08:27
hrwno problem, happens08:28
ianwok, https://review.opendev.org/c/zuul/nodepool/+/784026 will start the process08:29
*** ykarel has joined #opendev08:43
*** ykarel is now known as ykarel|lunch08:43
jrosserdebian-bullseye-updates and debian-bullseye-backports don't seem to be being mirrored, the logs are zero length here https://files.openstack.org/mirror/logs/reprepro/09:05
*** ysandeep|lunch is now known as ysandeep09:07
*** klonn has joined #opendev09:11
*** klonn has quit IRC09:50
*** gibi is now known as gibi_away09:52
*** hashar has quit IRC09:52
hrwjrosser: are there such repos upstream already?09:52
*** klonn has joined #opendev09:52
jrosserhrw: they seem to be here http://ftp.uk.debian.org/debian/dists/09:52
hrwo, they are. nice09:52
hrwjrosser: note that https://files.openstack.org/mirror/logs/reprepro/debian-buster-updates.log.1 has content so perhaps there was nothing to mirror last time reprepro ran09:54
hrwops, wrong version09:55
jrossertheres a patch to build images for bullseye which fails because it tries to add -updates and -backports repos and apt-update is upset that theres no Releases file09:56
*** ykarel|lunch is now known as ykarel10:04
mordredbullseye is current testing - it's not going to have -updates or -backports yet10:32
mordredit won't grow working versions of those until it is actually released10:32
mordredhrm. I take that back - I agree that -updates exists for real10:33
chkumar|ruckHello Infra, we are seeing few retry limits on one patch https://zuul.opendev.org/t/openstack/status#782187 , please have a look thanks!10:34
chkumar|ruckjoin name : tripleo-ansible-centos-8-molecule-tripleo-modules10:35
chkumar|ruck*job10:35
mordredsame with -backports - how weird (although I gotta say it makes automation nice)10:35
* mordred goes back to morning caffine10:35
chkumar|ruckit is the earlier retry_limit job https://zuul.opendev.org/t/openstack/build/d28be58628484f92a36bd8ab87279d6e10:35
*** klonn has quit IRC10:45
*** mugsie__ is now known as mugsie11:01
*** dtantsur|afk is now known as dtantsur11:34
*** lpetrut has joined #opendev11:37
fungijrosser: hrw: mordred: catching up, but the problem is that reprepro won't create empty repositories, even if they exist empty at the source end. there is set of commands we can run to create the empty indices, documented in the reprepro manpage i think, i vaguely recall doing that for buster11:41
fungichkumar|ruck: zbr|rover was also asking about those retries in #openstack-infra, seemed like it could be related to a specific job or node type, i can help get an autohold set up for it in a bit and then we can try to retrigger the failure and investigate the resultant state of the vm after the failure and also try to extract a vm console log from it11:43
mordredAhhh right11:45
chkumar|ruckfungi: thanks :-)11:50
hrwfungi: maybe scripts which call reprepro should take care of creating empty RELEASE-{backports,update} ones when new release gets added?11:56
hrwfungi: so in 2 years time we will not get into same discussion again ;)11:56
fungihrw: maybe, but how to add that requires some thought in declarative configuration management. we don't have a lot of tasks which are run-once-on-setup11:56
*** sshnaidm|off is now known as sshnaidm11:57
hrwfungi: understood11:58
fungithough maybe our script which runs reprepro could run it if the suites are missing at the end or something11:59
fungibasically "create these empty if they don't exist at completion"12:00
hrw;)12:01
openstackgerritGuillaume Chauvel proposed opendev/gear master: Update SSL exceptions  https://review.opendev.org/c/opendev/gear/+/78408212:01
openstackgerritGuillaume Chauvel proposed opendev/gear master: WIP: Client: use NonBlockingConnection to allow TLSv1.3  https://review.opendev.org/c/opendev/gear/+/78408312:01
*** auristor has joined #opendev12:05
openstackgerritJeremy Stanley proposed opendev/zone-opendev.org master: Clean up OpenEdge configuration  https://review.opendev.org/c/opendev/zone-opendev.org/+/78408612:10
openstackgerritGuillaume Chauvel proposed opendev/gear master: WIP: Client: use NonBlockingConnection to allow TLSv1.3  https://review.opendev.org/c/opendev/gear/+/78408312:14
fungizbr|rover: chkumar|ruck: i've set an autohold for the failing job on https://review.opendev.org/782187 so feel free to recheck it and we can take a closer look at the node once it fails again12:35
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible  https://review.opendev.org/c/zuul/zuul-jobs/+/78066212:42
fungijrosser: hrw: mordred: according to the reprepro manpage, i think something like `reprepro export buster-updates` is what we want, but i'll get some caffeine in me before attempting12:42
hrwfungi: s/buster/bullseye/ and also -backports but yes, it looks like it12:44
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible  https://review.opendev.org/c/zuul/zuul-jobs/+/78066212:45
jrosserfungi: thanks for taking a look at that :)12:45
*** smcginnis has quit IRC12:53
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible  https://review.opendev.org/c/zuul/zuul-jobs/+/78066213:04
fungijrosser: hrw: mordred: apparently i did it almost a year ago for focal-backports: http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-04-24.log.html#t2020-04-24T00:36:0013:13
hrw"we just have to remember to do that whenever adding a new release i guess"13:14
hrw;)13:14
fungiwell, or like i said, maybe if we can detect in our script that no indices were created for a configured dist, we make it run that command at the end13:18
fungii'm going to work on that angle here in a bit13:18
*** ralonsoh has left #opendev13:20
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible  https://review.opendev.org/c/zuul/zuul-jobs/+/78066213:30
zbr|roverfungi: probably is going to happen with https://zuul.opendev.org/t/openstack/stream/5ebc409e1d554b89b5569c6fbbfcc1f7?logfile=console.log too13:32
zbr|roveralready >12 min without any reply, probably is already stuck.13:33
openstackgerritDaniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible  https://review.opendev.org/c/zuul/zuul-jobs/+/78066213:34
zbr|roverfungi: yep, it did fail too.13:36
*** darshna has joined #opendev14:09
*** ykarel is now known as ykarel|away14:40
clarkbchkumar|ruck: zbr|rover fungi  note if the problem is network connectivity holding the node may not be helpful14:46
fungiclarkb: well, the idea is that i may at least be able to capture a vm console log from the held node, or reboot it with the nova api, or even boot it on a rescue image to get at the logs14:48
fungibut we've also seen similar cases where something during the job drove system load or disk i/o up so high that sshd ceased responding fast enough to beat ansible's timeout14:49
clarkbfair enough. The last time tripleo had network setup issues the console logs did help in that it showed that network manager was undoing our static ip config14:49
fungiand after a time the vm recovers and can be reached again14:49
zbr|roveri am now trying to manually run the same tests that are happening inside that job, in may be able to identify if there is an issue with these tests or not.14:56
zbr|roveri will find out soon, already passed 1/614:57
*** mfixtex has joined #opendev15:14
*** lpetrut has quit IRC15:22
*** Dmitrii-Sh4 has joined #opendev15:26
*** Dmitrii-Sh has quit IRC15:26
*** Dmitrii-Sh4 is now known as Dmitrii-Sh15:26
*** noonedeadpunk has quit IRC15:27
*** noonedeadpunk has joined #opendev15:28
*** ykarel|away has quit IRC15:37
*** hashar has joined #opendev15:41
*** ysandeep is now known as ysandeep|away15:47
*** spotz has joined #opendev15:50
*** diablo_rojo has joined #opendev16:06
*** hamalq has joined #opendev16:18
*** hamalq_ has joined #opendev16:19
*** hamalq has quit IRC16:22
*** Dmitrii-Sh has quit IRC16:23
*** Dmitrii-Sh has joined #opendev16:24
*** hamalq_ has quit IRC16:41
*** hamalq has joined #opendev16:41
*** eolivare has quit IRC16:43
*** marios is now known as marios|out16:47
*** marios|out has quit IRC16:59
*** dtantsur is now known as dtantsur|afk17:04
corvusi'm getting started on looking into the zuul memory leak now17:23
zbr|roverfungi: clarkb: re stuck job, we are currently making it nv and we have another patch that may fix the root cause, but we ware not sure yet.17:29
clarkbzbr|rover: its failing in a loop through right? ideally we wouldn't just set it to nv in that case17:29
fungizbr|rover: thanks for the update, i also realized there was no reason to limit the autohold to a single change so broadened it to any tripleo-ansible-centos-8-molecule-tripleo-modules failure for openstack/tripleo-ansible17:30
zbr|roveri am almost sure is not zuul or infra issue here, is a genuine bug.17:30
fungialso, clarkb is right, setting it nv means we'll eat 3x the node count for that job anyway and just throw away the results17:30
zbr|roverwe do want to see the impact of the real fix first. it do not expect it to stay nv more than day.17:30
fungithe change with the fix can readd the job to the check and gate pipelines17:31
fungithat way you still see the effects of the fix on the fix change and any queued after it17:31
fungijust not on the changes where you expect it to fail17:31
zbr|roverto be clear it does not always fails. in fact we do not really know what introduced the issue.17:32
fungi<100% failure would also be a possible explanation for how the bug itself got merged if it wasn't an outside shift17:32
clarkbif it is a similar network issue to last time it had to do with the job forcing dhcp in regions without dhcp17:35
clarkball the regions that used dhcp said ok whatever and kept running, but those that use static IPs immediately broke17:35
clarkbcorvus: I need to take a break, but let me know if I can help with the memory leak and I can dive into that after17:36
corvusclarkb: thx.  i'm still at step 1: waiting for the first siguser2 objgraph most common types report to finish17:36
corvusabout 15 minutes into that17:37
zbr|roverdone, updated the patch to disable job17:40
zbr|roverfungi: clarkb thanks for helping on that issue. time for me to go offline now.17:42
fungihave a good evening zbr|rover!17:42
*** avass has quit IRC17:56
*** yourname has joined #opendev17:57
*** yourname is now known as avass17:58
*** avass has quit IRC18:00
*** yourname has joined #opendev18:01
*** yourname is now known as avass18:02
*** avass has quit IRC18:07
*** yourname has joined #opendev18:10
*** yourname is now known as avass18:10
fungipoking at the missing empty debian bullseye dists a bit more, i'm starting to think it may make the most sense to just reprepro export unconditionally after every reprepro update. testing that theory now18:13
fungiif it's reasonably quick even on nonempty dists, then the cost is low enough to warrant the one-line fix rather than lots of unnecessary config parsing and conditionals18:15
*** avass has quit IRC18:17
*** yourname has joined #opendev18:18
*** yourname is now known as avass18:18
*** avass has quit IRC18:19
*** yourname has joined #opendev18:19
*** yourname has quit IRC18:21
*** avass has joined #opendev18:21
openstackgerritClark Boylan proposed opendev/system-config master: Update Gerrit to 3.2.8  https://review.opendev.org/c/opendev/system-config/+/78415218:22
clarkbthat is't urgent but noticed they made a new release so figure we should try and keep up if we can18:22
fungithe dstat effort is not yet to the point where we have enough data to decide on a timeline for 3.3 i guess18:22
clarkbya I think we really want to improve the gatling git stuff for 3.318:23
clarkbthat said new bigger server gives us a lot of headroom and I think we can be less cautious (early data says 3.3 is more memory but faster)18:23
johnsomHmm, things are queuing in an odd way. I posted a patch fifteen minutes ago and it still isn't listed in the check pipeline18:33
johnsomhttps://review.opendev.org/c/openstack/octavia-tempest-plugin/+/76046518:33
clarkbjohnsom: we're (mostly corvus) doing object introspection on zuul to try and root cause this memory leak and that slows stuff down18:34
clarkbhopefully just temporary until we get data we need18:34
johnsomAh, ok. Just thought I would give a heads up18:34
*** DSpider has joined #opendev18:39
*** DSpider has quit IRC18:39
fungiokay, so adding an explicit reprepro export on an otherwise noop update added 5.5 minutes for the debian repo, (9 dists covering hundreds of thousands of packages)19:01
fungii'll see what it adds for debian-security19:01
clarkbnot too bad considering the runtime a full sync19:01
fungiso the fix to debian-security is going to have to be different, i think19:02
fungiaptmethod error receiving 'http://security.debian.org/dists/bullseye/updates/Release':19:02
clarkbbecause it doesn't exist at all upstream yet or ?19:02
fungi'404  Not Found19:02
clarkbya19:02
mordredclarkb: *amazing* that upstream released a point release and there is an opendev patch up to maybe run it19:03
clarkbmordred: ya we've managed to keep up with point releases19:03
clarkb3.3 scares me a bit simply because a few people on the repo discuss list reverted19:03
mordredyah19:04
clarkbbut I've been trying to add better testing of it when I can. We added dstat for system level stats and also have a clunky gatling git thing up19:04
mordredfungi: deb http://security.debian.org/debian-security bullseye-security main is what's in the bullseye docker image for security - is security of *updates* the thing that doesn't exist until it exists that I was thinking about earlier?19:04
mordredoh ...19:05
mordredfungi: http://security.debian.org/debian-security/dists/bullseye-security/updates/19:05
mordredfungi: it's security.d.o bullseye-security - not security.d.o bullseye19:06
mordred(even though the other releases do not have a -security suffix)19:06
*** hashar has quit IRC19:08
fungioh, righto, that's changing in bullseye19:09
fungitotally forgot about that announcement19:09
fungiso we probably need the fix i'm considering *and* some config change for bullseye's security repo19:10
fungiokay, lemme get this pushed up first then, it seems to solve the first issue19:12
fungiafter than i can meditate on the reprepro config a bit19:12
openstackgerritJeremy Stanley proposed opendev/system-config master: Explicitly create empty reprepro dists  https://review.opendev.org/c/opendev/system-config/+/78415819:27
fungijrosser: hrw: ^ that's part of the solution19:27
funginow to work out the config change we need for bullseye-security19:27
*** sboyron has quit IRC20:12
*** d34dh0r53 has quit IRC20:35
*** d34dh0r53 has joined #opendev20:40
*** whoami-rajat has quit IRC20:47
*** dhellmann_ has joined #opendev21:12
*** dhellmann has quit IRC21:12
*** dhellmann_ is now known as dhellmann21:14
fungiokay, so i've worked out the fix for bullseye-security, unfortunately it meant that we haven't been updating the debian-security volume since the bullseye addition went in because it was breaking that reprepro invocation, so we're some days behind on security repo state for stretch/buster and only just now starting to mirror it for bullseye21:14
fungigood news is it's a one-line fix21:15
fungithough i had to run reprepro clearvanished on it to clean up old incomplete references to the wrong bullseye security repo21:15
fungiwhich was part of what was preventing it from running21:15
*** dhellmann has quit IRC21:16
*** dhellmann has joined #opendev21:17
*** dhellmann has quit IRC21:23
*** dhellmann has joined #opendev21:25
fungiso one down-side here is i think i need to put the mirror-update server in the emergency disable list until i get the config patch deployed21:27
fungiwell and merged21:27
fungiand uploaded21:27
fungiand written21:27
fungione thing at a time ;)21:27
openstackgerritJeremy Stanley proposed opendev/system-config master: Correct debian-security repo codename for bullseye  https://review.opendev.org/c/opendev/system-config/+/78416921:32
fungiinfra-root: ^ appreciate an expedited review of that and the parent change since i have the server in emergency disable with the latter fix applied manually to avoid ansible from re-breaking it and requiring additional manual cleanup21:33
clarkblooking21:34
fungijrosser: hrw: ^ that's the remaining fix, but at this point i've also applied and run it on the mirror-update server so we should be ready to move forward and recheck the dib job addition now21:36
fungiand assuming that passes, approve the nodeset addition too21:37
ianwfungi: lgtm, thanks21:38
hrwfungi: cool, thanks!21:40
*** artom has quit IRC21:44
*** artom has joined #opendev21:45
fungii'm confused by the error on https://zuul.opendev.org/t/openstack/build/dbe8af6f6b054f0eb85401a70f74b18822:12
fungii wonder if that test has bitrotted22:12
fungivos examine exiting 25522:14
fungifamous last words, but i don't think my change is causing that, seems to be arising in a wholly separate script22:14
fungithe last time system-config-run-mirror-update succeeded was two days ago22:16
fungibut these changes are the first to run it since22:16
clarkbfungi: could it just be a fluke related to udp and internets?22:19
clarkband or did it run on an ipv6 only cloud which might be more sensitive to problems?22:19
fungimaybe, but both changes hit the same error a couple of hours apart22:20
fungiaha!22:21
fungiexternal cause22:21
fungiVolume does not exist on server afs01.ord.openstack.org as indicated by the VLDB22:21
fungijust tried it from my workstation22:21
fungii guess that will clear up once ianw's vos releases finish22:22
fungiianw: would it be safe to go ahead and replicate project.zuul.readonly to ord ahead of the others?22:22
fungisince we explicitly reference it in that test, it can't pass currently22:22
ianwoh, i guess alphabetically that came last in the loop22:26
clarkbI'm going through johnsom's list of CI issues and seeing ifI can provide any help/feedback/fixes22:26
ianwi'm releasing it now22:26
clarkbhttps://etherpad.opendev.org/p/wallaby-RC1-ci-challenges <- is the list22:27
fungithanks ianw! lmk when it completes and i'll approve those debian mirror fixes22:27
ianwfungi: Released volume project.zuul successfully22:29
clarkbfungi: ianw: if you get a chance can you look at my orange ish notes on item 3 in that etherpad and tell me if that looks like the pip solver to you?22:34
clarkbI wonder if it is really slow on amd/vexxhost for some reason22:35
TheJuliado we have mor general ci grumpiness? a lot of jobs just went to 2nd retry22:35
TheJuliaAt least, looking at https://zuul.opendev.org/t/openstack/status#ironic22:36
clarkbTheJulia: there was a zk reconnection aboutan hour ago? something like that22:36
clarkbcorvus is actively debugging whihc at times has impact on zuul performance which can trigger that (even though last I checked memory use and thus swap was fine)22:36
TheJuliaLooks fairly recent-ish :(22:36
TheJuliaokay22:36
fungicacti claims we're not back into memory pressure on the scheduler yet at least, but maybe the repl work is stalling zk connections out22:37
clarkbjohnsom: out of curiousity are you enabling nested virt on any of these that have libvirt/cpu trouble?22:38
clarkbjohnsom: yes I think so as the label being used is explicitly the nested virt label22:39
johnsomAll of them, but the errors in in nova<->libvirt. The qemu/kvm layer has no errors22:39
clarkbjohnsom: well the cpu lockup was in the kernel/cpu/etc22:40
johnsomIt seems related to bionic as well, they are all stable jobs that I have seen22:40
clarkbI have a ver ystrong suspicion that that one is related to nested virt22:40
johnsomIt always goes through the "try CPU type", that is not unusual. The speculation is it is a bug in libvirt/glib combo22:43
clarkbjohnsom: sure, but all of your examples are vexxhost so far :)22:44
clarkbmaybe it is a bug with libvirt/glib + amd :)22:44
clarkbjohnsom: also reading the qemu log I'm not sure the amd nested virt flag is being set properly on those hosts22:44
fungikeeping in mind that amd nested virt accel is different than intel nested virt accel too22:44
clarkbshould be svm but doesn't seem its in the opteron_g2 flag list22:44
clarkbfungi: yup though I'm not convinced it is properly enabled, but if it is that could be another factor22:45
johnsomYeah, it always whines about that stuff too. Not unusual22:45
clarkbfungi: in the stackviz one that failed on name resolution that you checked against unbound it is running /tmp/stackviz/bin/pip3 install -u file://path/to/stackviz.tar.gz22:55
clarkbthat then does a python setup.py egg_info somewhere which does the fetch against pypi directly22:55
clarkbI suspect that somehow we are tripping over easy install?22:55
clarkbI wonder if an explicit install of pbr into the virtualenv first would help22:56
fungiyeah, i was more wondering why stackviz install is being done that way22:56
fungiwe could unpack the stackviz tree and then just pip install /the/path/to/it22:57
clarkbjohnsom: another thing to consider is your jobs are running on a reduced set of clouds due to the nested virt request. Limestone which I guess sometimes has dns failures, vexxhost which may have amd weirdness and also pip SAT solver slowness?, and ovh which I haven't seen any specific issues against yet22:57
clarkbcalling that out because if those clouds have problems your jobs will notice much more than background22:57
clarkbalso simply turning off the problematic clouds wn't help much if they are the only ones that can run the flavors you want22:58
johnsomHa, well, do we have other clouds? I'm ignoring RAX as it has it's own set of Xen problems22:58
fungiand internap22:58
fungiwe have lots of nodes there22:58
clarkbjohnsom: rax and inap are the other two clouds currently used for x86. Neither does nested virt22:58
clarkbbut they provide a majority of resources iirc22:59
fungier, right, they're inap now not internap22:59
johnsomHmm, internap did at one point22:59
clarkbjohnsom: we don't put the nsted-virt label there as we don't get the same attention of debugging the nested virt problems22:59
fungipossible we just don't create a special node type there to add it22:59
clarkbso even if it is enabled we won't put the special label there22:59
johnsomAdmittedly, this sampling is very small. It is all from just one patch and not the normal day-to-day22:59
fungimgagne may be able to suggest someone who can help with more low-level investigation of nested virt issues there, but he's not in here at the moment23:00
clarkbI think it is worth investigating further if the amd cpus are having trouble with pip solving and/or nested virt23:00
johnsomI am 90% sure we used to just "turn it on" there in the past, before the nodeset existed.23:00
clarkbthe pip install timing on those is reall weird23:01
fungibut if we want to consider exposing a nested-virt label for inap i agree that would be a prerequisite23:01
clarkbjohnsom: right but that isn't how we are exposing the label23:01
johnsomYeah, I know23:01
clarkbjohnsom: for the label we've gotten those clouds to minimally buy into to helping debug things when we can attribute them to nested virt23:01
fungiclarkb: oh, speaking of clouds, i did get the openedge cleanup pushed under topic:openedge23:02
clarkbfungi: do you know if pip solving slowness looks like https://zuul.opendev.org/t/openstack/build/d35cc616da1744e98c2d5b081866d541/log/job-output.txt#6209-6211 ?23:02
clarkbthe reason I don't really suspect networ slowness is after that first package everything else is much quicker23:03
clarkbI kind of expect an oddity in how pip logs things where the solving is just no logging and you jump ahead a minute later with the downloads starting but I'm not sure23:03
clarkbfungi: is there an order to those openedge cleanup changes?23:04
fungidns depends on system-config which depends on the others23:04
fungibase-jobs and project-config can merge first23:05
fungisudo -H LC_ALL=en_US.UTF-8 SETUPTOOLS_USE_DISTUTILS=stdlib http_proxy= https_proxy= no_proxy= PIP_FIND_LINKS= SETUPTOOLS_SYS_PATH_TECHNIQUE=rewrite python3.8 -m pip install -c /opt/stack/requirements/upper-constraints.txt etcd3gw23:05
fungii guess that would be worth testing23:05
clarkb(a gerrit plugin showing depends on chains in list of changes would be neat but probably difficult to do in a way that performance isn't terrible since the gerritdb knows nothing about depends on)23:05
fungiclarkb: i agree it could be dep solver slowness if the pip version being used is new enough to have it23:06
clarkbfungi: I think devstack upgrades pip very early23:06
clarkbbut not sure23:07
fungilike 4175 claims pip 20.0.223:08
clarkbdep solver is 21?23:09
fungi20.0.3 i think23:10
clarkbI wonder if devstack pinned pre solver, but that wouldalso rule out that theory23:10
clarkbmaybe we should boot one of those vexxhost nodes and profile it?23:11
clarkbianw: ^ possibly related to gerrit things23:11
clarkbit definitely seems like it just goes out to lunch every time it needs to install something23:11
clarkbbut then catches up after the first dep is pulled23:11
ianwi'm seeing ipv6 weirdness, many sites unavailable.  so possibly it gives up on something a falls back?23:13
fungiclarkb: sorry, 20.323:13
fungibut yes, not new enough to be the new solver23:13
fungiclarkb: the log looks like it's using distro python version from focal23:14
clarkbianw: oh that is a good theory, ya it could be that23:15
clarkbianw: and then it remembers to use ipv4 for everything subsequent23:15
fungi"pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)"23:15
openstackgerritMerged opendev/base-jobs master: Clean up OpenEdge configuration  https://review.opendev.org/c/opendev/base-jobs/+/78398923:15
* clarkb updates the etherpad23:16
fungiand yes, ipv6 connection timeout could explain the long delay23:17
clarkband why it is so consistent23:17
clarkbof about a minute exactly23:17
fungithat would definitely make setup take a long time given how many different pip install commands devstack likes to break up into23:18
ianwi'm tracking things somewhat between #vexxhost channel and https://etherpad.opendev.org/p/gerrit-upgrade-202123:19
openstackgerritMerged openstack/project-config master: Clean up OpenEdge configuration  https://review.opendev.org/c/openstack/project-config/+/78399023:22
clarkbfungi: I +2'd ^ the changes in the stack but didn't approve the later two as I can't watch the big inventory change go in23:23
clarkbI'm going to need to sort out dinner and enjoy this 70F march day shortly23:23
fungigo enjoy it, was a great day here too. had the windows open all day23:24
*** tosky has quit IRC23:30
TheJuliasigh, 3rd retry on multiple jobs :(23:34
corvusTheJulia: i'm sorry :(23:53
TheJuliac'est la vie23:53
TheJuliaAll I can do is wait it out23:53

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!