Wednesday, 2021-10-13

opendevreviewShnaidman Sagi (Sergey) proposed openstack/diskimage-builder master: Improve DIB for building CentOS 9 stream  https://review.opendev.org/c/openstack/diskimage-builder/+/80681900:22
ianwclarkb / fungi : I have put in a draft note at the end of https://etherpad.opendev.org/p/gerrit-upgrade-3.3 about dashboards and attention sets as discussed in the meeting.  please feel free to edit and send as you see fit 02:53
Clark[m]ianw: I made two small edits but lgtm if you want to send it02:58
ianwClark[m]: were you thinking openstack-discuss or just service-discuss?03:03
Clark[m]service-discuss. More people seem to be watching our lists and getting it out there will hopefully percolate through the places03:05
ianwwill do.  going to take a quick walk before rain and will come back to it :)03:06
*** ysandeep|out is now known as ysandeep04:04
*** ykarel|away is now known as ykarel04:32
opendevreviewIan Wienand proposed openstack/diskimage-builder master: Update centos element for 9-stream  https://review.opendev.org/c/openstack/diskimage-builder/+/80681904:53
ianwsshnaidm: i have updated the change with the testing we should be doing, and tried to explain more clearly what's going on in https://review.opendev.org/c/openstack/diskimage-builder/+/806819/comment/87f51fd0_0ee1505c/05:04
ianwhopefully that can get us on the same page05:05
*** bhagyashris is now known as bhagyashris|rover05:21
opendevreviewIan Wienand proposed openstack/diskimage-builder master: Update centos element for 9-stream  https://review.opendev.org/c/openstack/diskimage-builder/+/80681906:49
opendevreviewBhagyashri Shewale proposed zuul/zuul-jobs master: [DNM] Handle TypeError while installing the any sibling python packages  https://review.opendev.org/c/zuul/zuul-jobs/+/81374906:50
opendevreviewBhagyashri Shewale proposed zuul/zuul-jobs master: [DNM] Handle TypeError while installing the any sibling python packages  https://review.opendev.org/c/zuul/zuul-jobs/+/81374906:59
opendevreviewBhagyashri Shewale proposed zuul/zuul-jobs master: [DNM] Handle TypeError while installing the any sibling python packages  https://review.opendev.org/c/zuul/zuul-jobs/+/81374907:09
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size  https://review.opendev.org/c/zuul/zuul-jobs/+/81303407:18
opendevreviewBhagyashri Shewale proposed zuul/zuul-jobs master: Handled TypeError while installing any sibling python packages  https://review.opendev.org/c/zuul/zuul-jobs/+/81374907:25
*** jpena|off is now known as jpena07:29
louroto/ has anyone got a moment for https://github.com/openstack-charmers/test-share/pull/21 ? thanks!07:45
lourotwrong channel, sorry07:46
fricklerinfra-root: lots of post_failures. I've heard rumors of OVH having issues, but can't dig right now. might be log uploads failing08:05
ianwyep, OVH08:20
ianwWARNING:keystoneauth.identity.generic.base:Failed to discover08:20
ianwavailable identity versions when contacting https://auth.cloud.ovh.net/.08:20
opendevreviewIan Wienand proposed opendev/base-jobs master: Disable log upload to OVH  https://review.opendev.org/c/opendev/base-jobs/+/81378008:24
ianwstatus page gives "Active issue"08:24
ianwhttps://status.us.ovhcloud.com/08:25
fricklerdidn't we use to have a third log provider? maybe we should actively try to get some more redundancy again08:28
ttxLooks like OVH is having network issues right now08:29
ianwfrickler: think we should fast merge it?08:29
ttxI was disconnected for an hour, just came back08:29
fricklerianw: actually I think the indentation might be broken with your patch?08:29
ttx(my bouncer is on a OVH node)08:29
ianwfrickler: the lines are commented, i think the other lines remain the same?08:30
fricklerianw: but don't comments need to match the indentation of their surroundings?08:31
ianwi don't believe so, but if you'd prefer i can delete the lines and we can just put a revert in08:32
frickleranyway, it seems to be working again just now, at least accessing the cloud from bridge is now returning things again08:32
ianwyeah, i can get to auth.cloud.ovh.net:5000 too08:33
opendevreviewIan Wienand proposed opendev/base-jobs master: Disable log upload to OVH  https://review.opendev.org/c/opendev/base-jobs/+/81378008:33
opendevreviewIan Wienand proposed opendev/base-jobs master: Revert "Disable log upload to OVH"  https://review.opendev.org/c/opendev/base-jobs/+/81378308:33
ianwthere's the stack if we have issues08:34
ianwi'm afraid i'm rapidly reaching burnout point here08:34
fricklerianw: np, thanks for your help, I can check from time to time and see if it stays stable08:35
*** arxcruz is now known as arxcruz|rover08:48
opendevreviewyatin proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB  https://review.opendev.org/c/openstack/diskimage-builder/+/81139209:22
*** ykarel is now known as ykarel|lunch09:23
*** ysandeep is now known as ysandeep|mtg09:25
fricklerhmm, still seeing failures, checking logs09:48
fricklerI expected to see errors in the executor logs, but can't find anything there. also zuul didn't vote on 813780 but I'm also not seeing any job for that09:57
fricklerand while looking for strange things, https://review.opendev.org/800445 seems to be stuck in check for 44h09:58
frickleralso some tobiko periodic jobs for even longer10:10
fricklerI also don't see any current POST_FAILURES, so will leave the upload config as is for now10:15
opendevreviewShnaidman Sagi (Sergey) proposed zuul/zuul-jobs master: Include podman installation with molecule  https://review.opendev.org/c/zuul/zuul-jobs/+/80347110:18
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Changing no of days for query from 14 to 7  https://review.opendev.org/c/opendev/elastic-recheck/+/81379510:27
*** ykarel|lunch is now known as ykarel10:37
*** ysandeep|mtg is now known as ysandeep10:57
*** dviroel|out is now known as dviroel11:17
*** jpena is now known as jpena|lunch11:24
*** ysandeep is now known as ysandeep|afk11:31
*** ysandeep|afk is now known as ysandeep12:01
*** jpena|lunch is now known as jpena12:24
*** ykarel__ is now known as ykarel12:37
ysandeepFolks o/ Is there a way to set the hashtag(s) via the CLI, for example we can set topic: -t <topic> to git-review13:06
*** dviroel is now known as dviroel|rover13:33
*** arxcruz|rover is now known as arxcruz13:37
*** bhagyashris|rover is now known as bhagyashris13:39
fungiysandeep: feel free to push up a change implementing that in git-review, though you can probably also do it as a second command straight to gerrit's ssh api... checking the documentation13:40
fungiysandeep: i'm not finding it, looks like they didn't implement any controls for hashtags in the ssh cli, at least not yet13:47
ysandeepfungi: ack, no worries, thank your for checking 13:48
ysandeepyou*13:48
fungiysandeep: looks like it could be set at push, similar to how git-review does topics at push: https://review.opendev.org/Documentation/user-upload.html#hashtag13:51
Clark[m]They are part of the push ref options13:51
fungiyeah, it's just that the ssh cli also has a set-topic command, so i was hoping there might be a similar set-hashtag13:52
* ysandeep checking documentation 13:56
fungiyeah, in git_review/cmd.py you could probably just extend the command line options with one for hashtags and then append to push_options like happens with --topic13:58
fungioh, interesting, it looks like you can only set one hashtag at push time, not a list of them13:58
ysandeepfungi, thanks i was able to set hashtag with push ref options14:09
ysandeepfungi, I will give a shot implementing that in git-review 14:12
Tenguysandeep: cool! thanks :)14:13
Tengufungi: thanks as well :)14:13
fungiysandeep: feel free to ping me here when you push up the git-review feature and i'll be happy to review it14:21
ysandeepfungi++ thanks! I will try to implement that as my weekend python project.. So will probably ping you in week after PTG14:23
fungiwhenever, have fun!14:23
opendevreviewAnanya proposed opendev/elastic-recheck rdo: WIP: ER bot with opensearch for upstream  https://review.opendev.org/c/opendev/elastic-recheck/+/81325014:23
johnsomFYI, the ptg page is down. I'm getting a cloudflare error 523 "origin is unreachable" page going to openstack.org/ptg page.14:36
fungijohnsom: yes, there's some network incident in vexxhost impacting some systems there, but ptg.opendev.org is still up14:39
johnsomAh, bummer, I wish them luck!14:40
fungii'm sure they'll have it cracked shortly14:40
*** ysandeep is now known as ysandeep|dinner14:55
*** ykarel is now known as ykarel|away15:12
clarkbI guess the OVH stuff corrected itself before we had to worry about landing and then reverting any changes15:26
fungiseems so15:27
clarkbapparently I'm somehow still identified with oftc too. Neat15:29
fungidid you set up cert auth?15:29
fungii never have to identify on reconnect15:29
fungiit's just done as part of the tls setup with the client key15:30
clarkbI don't think I did based on my nickserv status15:30
clarkbbut also it seems to have been magically handled by weechat? so meh?15:31
*** marios is now known as marios|out15:33
opendevreviewAnanya proposed opendev/elastic-recheck rdo: WIP: ER bot with opensearch for upstream  https://review.opendev.org/c/opendev/elastic-recheck/+/81325015:48
clarkbfungi: thinking out loud here should we hold off on https://review.opendev.org/c/opendev/system-config/+/813534/ and children until after renaming is done just to avoid any issues withconfig when gerrit starts up again in that process? Or go for it and maybe restart gerrit today/tomorrow?15:49
clarkbsimilar question for https://review.opendev.org/c/opendev/system-config/+/81371615:49
*** ysandeep|dinner is now known as ysandeep16:10
clarkbfungi: https://review.opendev.org/c/opendev/gerritlib/+/813710 is a super easy review too (adds python39 testing to gerritlib as we run jeepyb which uses gerritlib on python39 now)16:11
fungijohnsom: problem seems to have been fixed, if you needed to get to the site for something16:13
johnsomfungi Thank you!16:13
fungiclarkb: i like the quick restart later today or tomorrow idea, just to make sure we're as prepped as we can be16:14
gthiemongeHey Folks, one of my patches is stuck in zuul https://zuul.openstack.org/status#69845016:14
gthiemongeis there any ways to kill it?16:14
fungialso i just realized i booked an appointment for a vehicle inspection friday after the openstack release team meeting, but i should be back well before we're starting the rename maintenance16:15
clarkbfungi: in that case feel free to carefully review and approve those chagnes I guess :)16:16
clarkbgthiemonge: hrm we should probably inspect why that happened first16:16
clarkbcorvus: ^ fyi there are two changes in openstack's check pipeline that have gotten stuck. Likely due to lost node requests? I feel like that is what happened before. I'll start trying to find logs for them16:17
gthiemongeoctavia-v2-dsvm-scenario-centos-8-stream is still queued, and my patch updates this job16:17
fungiclarkb: frickler noted some stuck changes earlier16:17
clarkbfungi: I'm guessing it is these chagnes since they are old enough to have been seen back when frickler's work day was happening16:18
fungi800445,16 has an openstack-tox-py36 build queued for 50 hours and counting16:18
fungialso there's a few periodic jobs waiting since days16:19
clarkb810f631ad5494c9ba7bc892d1c3f430f is the event associated with that enqueue I think16:21
clarkbthere are two other events for child changes16:21
clarkb2021-10-13 07:41:31,801 DEBUG zuul.Pipeline.openstack.check: [e: 810f631ad5494c9ba7bc892d1c3f430f] Adding node request <NodeRequest 300-0015752422 ['nested-virt-centos-8-stream']> for job octavia-v2-dsvm-scenario-centos-8-stream to item <QueueItem 7f41db5ea67847b887881460f0b7b2b5 for <Change 0x7f62645d9e80 openstack/octavia-tempest-plugin 698450,19> in check> <- is the last thing the16:22
clarkbscheduler logs for that job on that event16:22
clarkbnow to hunt down that node requests16:22
clarkbnested-virt-centos-8-stream <- the job uses a special node type...16:23
fungii notice all the long-queued builds in periodic are for fedora16:24
fungiso we might be looking at multiple causes16:24
clarkbya gthiemonge is caused because only ovh can build nested-virt-centos-8-stream and that reuqest happened during ovh's outage. I think the raeson we haven't node failured is we must've leaked launcher registrations in zookeeper again so nodepool thinks there are other providers still to decline that request16:27
clarkblet me check on the registrations really quickly, but in gthiemonge's case I think the easiest thing is to abandon/restore and push a new patchset16:27
clarkbor wait, I think if I restart the node with the registrations it will notice and node failure it16:28
fungiwe seem to not be unable to launch fedora-34 nodes, but i have a feeling something similar befell the three periodic builds which wanted one16:28
clarkbhrm I don't see any extra registrations16:29
fungisimilarly that openstack-tox-py36 job would have wanted an ubuntu-bionic node and we're probably booting fewer of them these days16:29
fungistatistically speaking, as none of the 5 stuck examples use our most popular node label, it's possible they're all representatives of a similar problem16:30
*** jpena is now known as jpena|off16:30
clarkbthe linaro provider has not declined that request yet16:32
clarkbit cannot provide that node type so it should decline it. Now to look at why it hasn't yet16:33
clarkblinaro reports not enough quota to satisfy request whcih will cause it to enter pause mode and not decline requests.16:36
clarkbI think that may be starving its ability to get through ahd decline requests it cannot satisfy at all due to being the wrong label type16:36
clarkbthere are leaked isntances in that cloud which I am trying to clean up now. We'll see what happens16:37
clarkbfungi: the tobiko changes have been pathological for a while. I suspect some weird configuration issue as they have a ton of errors iirc16:40
clarkbfungi: neutron is probably related to whatever is causing fedora issues16:40
fungii looked at a few and it was suds-jurko failing to build16:40
clarkbfungi: but that wouldn't cause them to be stuck in zuul?16:40
fungino, talking about your suggestion that the tobiko jobs have been pathologically going into retry_limit16:41
clarkbgthiemonge: I think your best bet may be to push a new patchset or abandon and restore the current patchset. The issue is nodepool isn't declining the request because it can't get to those requests because a cloud is failing very early :/16:41
clarkbfungi: ah16:41
clarkbI'm going to write an email to kevinz about cleaning up these leaked isntances in the linaro cloud now16:42
fungithe fedora-34 situation is a little odd too. there's one in airship-kna1 which has been in a "ready" state for more than 3 days16:44
fungiit should have gotten assigned to a build long before now16:44
clarkblinaro email sent16:45
clarkbI think that we can set the linaro quota to 0 if this persists and we notice more stuck changes due to it16:46
fungii wonder if there's a good way to determine why nl02 hasn't assigned 0026879768 to a node request yet16:46
clarkbfungi: that cloud is also probably near or at its quota most of the time so it has a hard time filing through requests16:48
fungioh, maybe16:48
clarkbnodepool.exceptions.ConnectionTimeoutException: Timeout waiting for connection to 149.202.161.123 on port 22 <- seems to be the general issue here16:48
clarkbwith fedora-34 launches I mean16:48
clarkbwe'll probably need to launch one out of band and inspect it. /me starts trying to do that16:48
fungiyeah, but 0026879768 has been in a ready state for days there16:49
fungiand grepping the debug log, the last mention was when it came ready and was unlocked (2021-10-10 08:52:00,927)16:49
clarkbfungi: yes, but if the airship cloud is perpetually paused it won't ever get a chance to scan all the requests and find the few fedora requests to assign that node16:50
clarkbthe process here is provider is not doing an action proceed to grab next request, lock it, check quota, if at quota pause, when no longer at quota attempt to launch node.16:51
clarkbit can shortcut that if it already has the node but that depends on it finding a random request for a fedora034 node16:51
gthiemongeclarkb: ok thanks, i'll try16:51
fungigot it, so the problem is that we want the pause to pause cloud provider api interactions, not just everything16:52
fungipausing declining nodes, or assigning nodes which are available and ready, results in a deadlock16:53
fungibut i guess nodepool performs a server list, which is a cloud provider api call16:53
fungiso is that where it's blocking?16:54
clarkbit does an internal wait for a node to be deleted iirc as it knows there will be free quota after that16:54
opendevreviewMerged opendev/gerritlib master: Add python39 testing  https://review.opendev.org/c/opendev/gerritlib/+/81371016:54
clarkbfungi: and ya it seems like we could have it continue to process and decline things it has no hope of ever fulfilling as well as fulfilling things it already has resource allocated to like the fedora-34 node16:55
clarkbclarkb-test-fedora-34 is booting in ovh bhs1. that as a region I noticed fedora-34 boot problems in16:56
clarkbfungi: I think it is failing in rax too, otherwise I would blame potentially bad uploads due to ovh's network problems16:57
clarkbthe byte count for this image seems to match up what we have on nb02 at least16:58
clarkbthis is spicy the console log is just one giant kernel panic16:59
fungii'm looking at a build which succeeded on a fedora-34 node in airship-citycloud yesterday, so apparently we booted one there for it even though there was one ready for a couple days17:00
fungiin the same provider17:00
clarkbfungi: were they different pools?17:00
clarkbanyway I think fedora-34 is completely hosed based on the console log on the test instance i booted in ovh17:00
fungiprovider: airship-kna117:00
clarkbI'm going to try and boot in other clouds and see if we get different results17:01
clarkbfungi: ya but that provider has multiple pools. I don't think it will share acorss pools17:01
fungiso same pool17:01
clarkbfungi: we can do multiple pools per provider now17:03
funginl02.opendev.org-PoolWorker.airship-kna1-main-43d321e2735c45e5a17fc8d90e8ac674 logged for both nodes 0026879768 (the one that's been ready for days) and 0026901606 (the one booted yesterday for node request 300-0015737905)17:05
fungiwhy would the pool worker build a new node for an incoming node request when it already had one with the same label available?17:06
clarkbI do not know. That seems like a bug17:06
clarkbthe iweb boot doesn't seem to panic. That makes me think potentially corrupt image in ovh. However the iweb test node doesn't seem to allow me in via ssh either17:07
fungicould be cpu flags?17:07
fungior something hypervisor-related?17:07
clarkbfungi: well the image upload happened around ovh's crisis. I really suspect it is as simple as a bad image upload there17:09
clarkbon the iweb side of things it apepars that we only configure the lo interface with glean?17:09
clarkbat least I don't see anything logged for other interfaces + glean in the console log17:09
clarkbI think we should consider pausing fedora-34 image builds then delete today build17:10
clarkband then hopefully those that understand fedora can look into why glean + network manager seem to be unhappy with it17:10
opendevreviewMerged zuul/zuul-jobs master: Handled TypeError while installing any sibling python packages  https://review.opendev.org/c/zuul/zuul-jobs/+/81374917:11
opendevreviewClark Boylan proposed openstack/project-config master: Pause fedora-34 to debug network problems  https://review.opendev.org/c/openstack/project-config/+/81387617:12
clarkbI'm going to boot a test on yesterday's image to see if it acts different17:12
fungii guess it's just provider launches we can pause from the cli? i always forget17:13
clarkbI didn't realize there is a command line option I'll check after I test yseterday's image17:14
fungii'm looking it up in the docs now17:14
clarkbhrm yseterday's image may be not better17:15
clarkbin which case we're in a more roll forward state17:15
fungihttps://zuul-ci.org/docs/nodepool/operation.html#command-line-tools "image-pause: pause an image"17:16
clarkbya I think rolling back to the previous image isn't going to help us17:16
opendevreviewMerged opendev/system-config master: Replace testing group vars with host vars for review02  https://review.opendev.org/c/opendev/system-config/+/81353417:16
clarkbfungi: https://review.opendev.org/c/zuul/zuul-jobs/+/813749 that merged which needed a fedora-34 node. So there must be working fedora-34 somewhere /em looks17:19
clarkbha that ran in airhsip. did it use your old node?17:20
clarkbfedora-34-airship-kna1-0026921358 was the hostname17:20
clarkbFailed to start [0;1;39mNetwork Manager Wait Online then See 'systemctl status NetworkManager-wait-online.service' for details. on the iweb images17:21
clarkbovh kernel panics but could just be a bad image17:21
fungiairship-kna1 for the check pipeline build too17:22
fungiso it's like we're only getting fedora-34 nodes in airship-kna1, which also has a fedora-34 ready node it's been ignoring for days17:23
clarkbmaybe its image is just old17:23
fungii wonder if image uploads have been failing there and it's got an old...17:23
fungiyeah17:23
clarkbdoesn't appear to be old17:24
clarkbit could be luck that whatever is causing NM + glean to fail in iweb isn't an issue in airship17:24
clarkbI am going to try and delete the image in ovh that is panicing to force a reupload17:24
clarkbthen maybe we'll see ovh do what iweb is doing or function like airship17:24
fungiyeah, airship is currently using image 7900 uploaded 13.5 hours ago17:25
fungier, airship-kna1 is17:25
fungimaybe network setup in citycloud is different than everywhere else we try to boot fedora-34?17:26
fungidifferent virtual interface type which the f34 image's kernel is actually recognizing?17:27
clarkbI think it uses dhcp like many clouds. rax is static17:27
clarkbya that might explain it17:28
clarkbhttps://bodhi.fedoraproject.org/updates/FEDORA-2021-ffda3d6fa1 is a recent update and they already have https://bodhi.fedoraproject.org/updates/FEDORA-2021-385f3aebfd proposed too17:29
fungiens3 is the detected interface on the successful builds there17:29
fungialso citycloud is using rfc-1918 addressing with floating-ip for global access17:30
clarkbthis is curious booting the previous image in ovh produces no console log. But I also cannot ssh in17:31
fungiand no global ipv617:31
clarkbat least it isn't kernel pacnicing?17:31
clarkbI'm going to clean up my test instances in ovh and iweb and boot some on rax and vexxhost and see if they are any different17:34
clarkbvexxhost can boot the fedora-34 image successfully too ,but we don't launch the fedora image tehre because we only do the special larger instances in vexxhost now17:45
opendevreviewMerged opendev/system-config master: Switch test gerrit hostname to review99.opendev.org  https://review.opendev.org/c/opendev/system-config/+/81367117:47
fungiso something about the image works in vexxhost and citycloud, is unreachable in iweb, crashes during boot in ovh...17:49
clarkbfungi: yes though the crashes during boot in ovh may be unrelated and a result of us trying to upload images tehre during their network crisis17:50
clarkbrax test node also appears to be sad. This is noteworthy because rax uses static configuration and not dhcp. Implies the issue is independent of dhcp17:51
fungi"sad" as in boots but is unreachable over the network, or crashing like in ovh?17:51
clarkbunreachable over network. They don't support cli console logs so I didn't bother to check that17:52
clarkbit is possible that it crashes but that requires me to dig out credentials and do more work, but I need to context switch to other stuff17:52
fungiyeah, you can console url show and then stick that in a browser17:52
clarkboh neat17:52
fungishouldn't need credentials, the url is just meant to be unguessable17:52
clarkbok let me relaunch in rax and see what it says17:53
fungithat's been my fallback on the providers who don't implement console log show17:53
fungiannoying, but better than nothing17:53
clarkbbut I think we're fast approaching the bit where I say "people who care about this platform and understand it should really take a look" because I'm still arguing we should delete allfedora and usestream which seems to be a fair bit more stable for our pruposes17:54
fungiif memory serves, ianw did a fair bit of work on the f34 networking stuff, so may have a better idea of where we should be looking for root cause17:55
clarkbya my hunch is something to do with ordering of services. Like udev isn't finding the device properly before we run glean or similar17:55
clarkbIt wouldn't be so problematic if fedora didn't update so frequently with so many big explosions :)17:56
clarkbbasically the reason we don't do intermediate ubuntu release17:56
clarkbI'm going to abandon the f34 pause change since that won't help17:56
fungii suppose we could temporarily add f34 labels to vexxhost and remove them from everywhere else except citycloud17:57
clarkbya the risk there is the flavor there is huge so the fedora jobs might end up needing that memory17:58
clarkbbut if it is a short lived change the risk of that should be low17:58
fungialso not sure if i should delete this f34 ready node in citycloud which nodepool seems to just be ignoring and wasting quota with17:58
clarkbmight be worth keepign around if anyone has time to dig into why the launcher isn't using it but instaed booting new nodes17:58
fungiyeah, also i have a feeling that if i do delete it, a new ready node will be booted and ignored instead17:59
clarkboh ya since it is the only cloud that can satisfy the min-ready of 1 for f34 right now17:59
clarkbeverything else will fail and eventually the airship cloud should get it18:00
clarkbfungi: the rax boot enters an emergency shell and I can't seem to get any scrollback to understand why that happens better18:09
clarkb"unknown key released" when I hit page up18:10
clarkbcertainly seems that something to do with fedora 34 is new or different and causing some clouds problems18:10
fungii wonder when this started18:11
clarkbI'll try to emergency boot this and see if the initramfs sos report is present18:11
clarkb(I kinda doubt it will be there because I don't think it is on persistent storage but doesn't hurt to check)18:12
fungiyeah, if it didn't get far enough to pivot from the initramfs to the real rootfs18:14
clarkbLE job failed for https://review.opendev.org/c/opendev/system-config/+/813534 so the jobs behind it didn't run18:17
*** ysandeep is now known as ysandeep|out18:19
clarkbfatal: unable to access 'https://github.com/Neilpang/acme.sh/': Failed to connect to github.com port 443: Connection timed out <- that repo redirects to https://github.com/acmesh-official/acme.sh now but is generally accessible. I guess this is just a random internet is sad occurence18:19
clarkbbut this means that service-review didn't run after that change landed so not sure if we want to manually run it18:19
fungithough we could stand to update the url anyway, i guess18:20
fungialso retry downloads maybe18:20
clarkb++18:21
clarkbfor that ord f34 instance I can't get to the rescue instance18:21
clarkbdoes it rescue with its own image by default?18:21
fungiyes18:21
clarkbugh18:22
fungithat's probably not going to work18:22
clarkbya18:22
fungifor... reasons which should previously have been obvious to me, sorru18:22
clarkbI didn't expect rescuing to give us any new info anyway. I'll just unrescue and delete the instance18:22
clarkbpeople who have had trouble booting f34 VMs have pointed to https://fedoraproject.org/wiki/Changes/UnifyGrubConfig on the internets. I'm booting a new instance in vexxhost so that I can check its grub configs18:27
clarkbfungi: when the current set of deploy jobs finishes maybe we should run service-review manually to be sure there are no unexpected updates?18:28
opendevreviewJeremy Stanley proposed opendev/system-config master: Retry acme.sh cloning  https://review.opendev.org/c/opendev/system-config/+/81388018:28
fungino idea if that's the way to do it, was reading random examples and trying to piece together from the docs18:28
clarkbfungi: I left a comment on it, close, but not quite18:29
fungithanks18:30
opendevreviewJeremy Stanley proposed opendev/system-config master: Retry acme.sh cloning  https://review.opendev.org/c/opendev/system-config/+/81388018:32
clarkb[Wed Oct 13 18:29:05 2021] Unknown command line parameters: nofb BOOT_IMAGE=/boot/vmlinuz-5.14.10-200.fc34.x86_64 gfxpayload=text18:34
clarkbthe vexxhost node's dmesg reports that. I half wonder if it isn't able to find the kernel as a result on some system18:34
clarkbthough maybe not18:35
clarkbsince the kernel is already running at this point18:35
clarkband we're just telling the kernel about itself18:35
fungiyeah, it's got to be finding the kernel if it's into the initrd18:36
clarkb/boot/efi/EFI/fedora/ exists but is empty. We do all of our x86 images as grub images. I half wonder the other clouds might be seeing the efi dir and attempting efi, failing due to the lack of an efi config and not falling back to grub?18:37
clarkbI don't know how that all works with openstack, nova, kvm, and qemu18:37
clarkbthe grub config and fstab and the device label all lgtm18:42
clarkbThe actual grub menu entry uses the device uuid not its label, but both the label in the grub /etc/default/grub config and the uuid in the /boot/grub2/grub.cfg menu entry match /dev/vda118:43
clarkbI don't think vexxhosts's kvm had to do any magic to properly boot this18:43
clarkbok I really need to page out the f34 stuff. I'm going to delete the vexxhost test node as it didn't show me anything suepr useful other than "it should work". But then i need to do lunch and then we should manually run service-review.yaml, check that didn't make any unexpected changes to gerrit. Then test node exporter on trusty. Then prep stuff for the project renaming18:47
fungilooking into the gitea metadata automation, it's the gitea_create_repos.Gitea.update_gitea_project_settings() method we want to call, and that already takes a project as a posarg, we're calling it from a loop in the make_projects() method18:48
clarkbfungi: iirc there is a force flag18:48
clarkband if the project is new or the force flag is set then the metadata is updated18:48
fungithough looking closer at how we call it from ansible, it may be simpler to add a project filter as a library argument and filter it that way18:48
clarkbmaybe we can make the force flag a list of names to force?18:49
clarkbif force is not empty then if project in force type deal18:49
fungiyeah, we already parameterize that18:49
fungialways_update: "{{ gitea_always_update }}"18:49
fungiright now we just set it to true or let it default to false18:50
fungibut we could overload it as a trinary?18:50
fungior make it a regex18:50
clarkbwell you could set it to structured data like a list18:50
clarkbfalse/[] don't force update or [ foo/bar, bar/foo] force update those projects18:51
fungithough if we also still want a way to be able to force it to do all projects, we'd have to list thousands19:00
fungibut yeah, a trinary of falsey/[some,list]/true could fit and remain backward-compatible19:01
fungioh, though ansible seems to like these arguments to be declared with only one type19:13
fungioh, maybe not declaring a type in the AnsibleModule argument_spec is fine19:14
fungiwe don't seem to declare it for all of them19:15
clarkbI've quickly consumed some food. fungi I'll start a root screen on bridge and run the service-review.yaml playbook?19:18
opendevreviewJeremy Stanley proposed opendev/system-config master: Allow gitea_create_repos always_update to be list  https://review.opendev.org/c/opendev/system-config/+/81388619:19
fungiclarkb: sounds good, thanks19:19
fungialso there's a start on the metadata update project filtering, though i haven't touched the testing yet19:19
clarkbalright starting that playbook now19:20
fungii'm attached to the root screen19:20
clarkbreview02.opendev.org       : ok=66   changed=0    unreachable=0    failed=0    skipped=9    rescued=0    ignored=019:23
fungilooks good19:23
clarkbit looked as we hoped no changes19:23
clarkbyup, I think we're good, the var movement didn't cause any unexpected updates19:23
clarkbI'll go ahead an exit the screen?19:23
fungiyep, go ahead19:23
clarkbcool, I think we should still do a restart because we haven't done one since the quoting changes happened19:24
clarkbbut this is good news on its own19:24
fungii should be around for a gerrit restart later today if you want19:25
clarkbok, lets see where the day continues to go :) I am still planning to get the rename input file pushed and review the related chagnes and start an etherpad19:26
clarkboh and I wanted to test node exporter on wiki.19:26
clarkbfungi: for ^ it is basically `wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz` the nconfirm the sha256, then extract and run the binary to see that it starts and doesnt crash19:27
clarkbfungi: any objectiosn for me doing that on wiki now?19:27
clarkbit all runs as my own user19:27
clarkbI went ahead and fetched it, checked the hash and extracted it since that is all pretty safe19:30
clarkbfungi: ^ I await your ACK before running the binary out of an abundance of cuation19:31
clarkbfungi: re testing of the gitea project stuff you should be able to hack the existing system-config-run-gitea job for that  since it creates projects and then does another pass of them to ensure it noops (but in this case we could hack it to force updates for some projects)19:33
fungiclarkb: no objections19:34
clarkbcool it ran successfully. I ran it in the foreground and killed it. But if you want to double check it didn't fork into a daemon it was listening on port 9100 and process was called node_exporter19:37
clarkbI can mark that done and we should be good to land that spec tomorrow19:37
fungiyep, lgtm. nothing listening on 9100 (though it does have listeners on 9200 and 9300/tcp on the loopback)19:51
clarkbthose ports are for ES that is used to do text search19:54
clarkbthey are expected iirc19:54
fungiyep19:59
ianwclarkb: sorry, reading scrollback now20:38
clarkbianw: I don't think it is super urgent but its super weird and going to be a pain to resolve I bet :/20:39
ianwthe kernel starting and things going blank could very well be a sign that root=/dev/... is missing, i have seen that before20:39
ianwthat said, i think it is passing in the devstack boot tests ... it should hit there too if that's it20:40
clarkbwell it works in citycloud and vexxhost20:40
clarkbwhich is why I suspect this is an odd one20:40
ianwhrm, yeah i have no immediate thoughts :/20:44
ianwbib just have to sort out some things20:45
opendevreviewAndrii Ostapenko proposed zuul/zuul-jobs master: Add retries for docker image upload to buildset registry  https://review.opendev.org/c/zuul/zuul-jobs/+/81389420:49
clarkbdo we know ^'s IRC nick?21:44
clarkbsimilar to goneri's related update we should make sure there aren't problems with the registry or local networking since that upload should always be local to the cloud21:45
clarkbIts a huge warning flag to me that people are retrying those requests and points to an underlying issue that we should probably fix instead21:45
clarkbfungi: the openinfra renames are renaming projects like openstackid which need to be retired. I guess we retire them in the new name location?21:48
fungithere's a foundation profile associated with the gerrit account e-mail, but it doesn't have any irc field filled in21:48
fungiclarkb: yeah, retiring them in the new location is fine, also gets rid of the old namespace that way21:49
clarkbwell the old namespace will stick around for redirects but it empties it21:49
fungiright, that21:50
fungithe project list will no longer include the old namespace21:50
fungidoes https://zuul.opendev.org/t/openstack/build/a758b4b433b7433aa3574ebbd3d77c21 look to anyone else like our conftest is bitrotten?21:55
opendevreviewJeremy Stanley proposed opendev/system-config master: Allow gitea_create_repos always_update to be list  https://review.opendev.org/c/opendev/system-config/+/81388621:58
opendevreviewJeremy Stanley proposed opendev/system-config master: More yaml.safe_load() in testinfra/conftest.py  https://review.opendev.org/c/opendev/system-config/+/81390021:58
opendevreviewClark Boylan proposed openstack/project-config master: Move ansible-role-refstack-client from x/ to openinfra/  https://review.opendev.org/c/openstack/project-config/+/76578721:58
clarkbPushed that to resovle a conflict ebtween two of the renaming changes21:59
fungithanks21:59
clarkbfungi: looks like pyyaml updated and we need to update to match?22:00
clarkbsafe flag?22:00
opendevreviewClark Boylan proposed opendev/project-config master: Record renames being performed on October 15, 2021  https://review.opendev.org/c/opendev/project-config/+/81390222:03
clarkband there is our input file and recording of the changes22:03
fungiclarkb: yeah, for now i updated the remaining call in that script to match the other one which was already using safe_load22:07
fungibut there are probably a bunch more which need changing22:07
corvusi'd like to restart zuul now.  any objections?22:17
clarkbcorvus: looking22:20
fungishould we time the gerrit restart to coincide?22:20
clarkbfungi: if you'd like. I don't see any relaese activity but will warn the release team. The tripleo team may appreicate waiting that 14 mintues to see if those changes at the top of their queue end up merging22:21
clarkbI've warned the release team22:21
corvusi can afk for 20 minutes and try again if you like22:21
fungii can handle the gerrit restart in the middle of the zuul down/up22:22
clarkbcorvus: considering how long their changes can take that might be a good thing22:22
clarkbjust to avoid another set of 4 hour round trips for each of them22:22
clarkbinfra-root I've put https://etherpad.opendev.org/p/project-renames-2021-10-15 together for the rename on Friday22:22
corvusclarkb: okay.  my own thought is that the last time we waited 5 minutes it took an hour.  there's no good time and therefore no bad time to restart.22:23
clarkbcorvus: fair neough, I'm happy to proceed now too22:23
corvusbut it's fine.  i have something else to do that takes 20m so it's no big deal to me.22:23
clarkbI'll let you decide if you'd ratehr do it now or in 20 minutes :)22:23
clarkbI'll be around for both22:23
corvuslet's come back in 20m.  (mostly just don't want to establish too much of a precedent :)22:24
clarkbfungi: re gerrit restart the big thing it will be checking is the gerrit.config quoting updates22:24
fungiyep22:25
fungiif i need to hand edit the config to get it to restart, i can do that really quickly too22:25
clarkbfungi: /home/gerrit2/tmp/clarkb/gerrit.config.20211013.pre-group-mangling <- is a copy I made of that file on review02 earlier today when those otehr changes were merging22:25
clarkbwe can use that to compare delta post restart22:25
fungistatus notice Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again22:40
fungithat work?22:40
clarkblgtm22:40
fungii have a root screen session started on the gerrit server in case we need to coordinate anything there, and the docker-compose down command is queued22:41
clarkbI'll join it22:41
fungithat tripleo job has been uploading logs for almost 5 minutes, so should end any time hopefullt22:43
clarkbbut also we gave it a chance we can proceed when ready I think22:44
fungiyep22:44
clarkbcorvus: you'll do a stop, then we can restart gerrit, then a start?22:44
fungithat's how we did it last time, at least22:45
fungioh, the test job wrapped up, now the paused registry job is closing out22:47
corvussounds good22:47
corvusare we waiting still, or calling it good enough?22:48
clarkbI'm happy calling it good enoguh. We gave it a real chance.22:48
fungigood enough22:48
fungi#status notice Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again22:49
opendevstatusfungi: sending notice22:49
corvusokay.  i'm re-pulling images to get ianw's 400 change22:49
-opendevstatus- NOTICE: Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again22:49
corvuswill take just a few seconds longer22:49
corvusstopping zuul22:50
corvusfungi: you can proceed with gerrit restart22:50
fungidowning gerrit22:50
fungiupping22:51
corvuswaiting for signal from fungi to start zuul22:51
fungiwebui is loading for me22:51
clarkbyup lodas for me too and the config diff is empty22:52
fungi[2021-10-13T22:51:33.954Z] [main] INFO  com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.3.6-44-g48c065f8b3-dirty ready22:52
fungicorvus: all clear22:52
clarkb++22:52
corvusstarting zuul22:53
clarkbfungi: I detached from the screen. feel free to close it whenever you like22:53
fungithanks, don22:53
fungie22:53
clarkbits ok you can call me Don22:54
corvusjust don call me shirley22:54
fungii picked the wrong day to stop sniffing glue22:55
clarkbwhat continues to try and get a kata-conatiners tenant? we removed it right? Maybe the cronjobs to dump queues?22:58
clarkbThats a not today question I think22:58
opendevreviewIan Wienand proposed opendev/system-config master: ptgbot: have apache cache backend  https://review.opendev.org/c/opendev/system-config/+/81391023:01
ianwfungi: ^ i'd probably consider you domain expert in that -- i'd not really intended the little static server to be demand-facing, so having apache cache it would be good for reliability i think23:01
fungioh, yep23:03
corvusre-enqueing23:05
corvus#status log restarted all of zuul on commit 3066cbf9d60749ff74c1b1519e464f31f213211423:05
opendevstatuscorvus: finished logging23:05
clarkband in an hour we should see the znode count fall again?23:07
clarkbI think we expect it in the 80-90k range?23:07
corvusyeah.  it's hard for me to say if 110 might be okay though -- so i probably wouldn't assume we have a leak until it's over 120k sustained.23:10
clarkbcorvus: a zuul/zuul change showed up in the openstack tenant release-approval pipeline briefly. I'm surprised we evaluate zuul changes in openstack at all?23:11
corvusit's in the projects list23:12
clarkbhuh I didn't expect that but that is expected behavior then23:12
clarkboh i bet it is there for the zuul_return testing in system-config/project-config/etc?23:12
clarkbwe might be able toclean that up nwo as zuul_return has a mock or osmething now iirc23:12
corvusi think it may have been to try to trigger opendev deployments on zuul changes or something.23:13
corvusnot sure if currently used23:13
corvusbut it looks like jobs are loaded too, so may be some job inheritance going on23:13
corvusre-enqueue complete23:15
clarkbfungi: I approved the safe_load fix23:18
clarkbianw: if we ignore the kernel panic in ovh because maybe that was due to their outage coinciding with our upload we're left with two failure modes. The rax emergency initramfs shell and the iweb failure to get network23:24
clarkbIt might be easier to debug the iweb failure case first? like maybe do a build that hard code dhcp without glean or something and see if that boots and work back from that?23:24
clarkband maybe if we get lucky fixing that will give us clues to fixing the rax problem23:25
ianwclarkb: yeah, i think all will be revealed if we can get a serial output23:28
clarkbfungi: re the gitea metadata. I'm thinking we can just do a copy of playbooks/sync-gitea-projects.yaml but then replace the gitea_always_update var with our list and then be good? You should be able to test this by calling that copy of sync-gitea-projects.yaml in the system-config-run-gitea job23:30
clarkbthat job runs playbooks/test-gitea.yaml <- should be easy to run the playbook from there?23:30
clarkbnote the import_playbook in test-gitea.yaml you should be able to run sync-gitea-projects that way23:31
fungiyup, will give it a shot tomorrow between meetings23:33
ianwclarkb: sorry my attention is slightly divided, i'm just trying to see if we can get these 9-stream image-based builds testing in 80681923:35
clarkbianw: ya no worries, I don't think this is urgent yet. Worse case we can add f34 to vexxhost liek fungi suggested and that will give us enough capacity for the label to limp along while we debug further23:36
clarkbianw: https://zuul.opendev.org/t/openstack/build/e336bb93987042a18a5acc44fb818b1e/log/logs/centos_8-build-succeeds.FAIL.log#933-935 that seems odd considering the other centos builds succeeded in that job. Has epel already started removing centos 8 stuff?23:41
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [dnm] testing centos 8 image builds  https://review.opendev.org/c/openstack/diskimage-builder/+/81391223:43
ianwclarkb: ^ i hope to find out :)23:43
clarkbha ok23:44
ianwi don't like these image-based jobs but clearly people still use them23:45
clarkbianw: ya I'll admit I didn't even consider that that might be what people were trying to do there23:45
clarkbthe minimal builds are far more reliable bceause you don't have the upstream image changing daily on you in the case of ubuntu for example23:45
opendevreviewMerged opendev/system-config master: More yaml.safe_load() in testinfra/conftest.py  https://review.opendev.org/c/opendev/system-config/+/81390023:46
clarkbbig znode drop from ~140k to 108k. corvus' estimate of ~110k may have been spot on23:53
corvus90 may be idle and 110 may be busy-ish ?23:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!