Wednesday, 2022-07-13

*** ysandeep is now known as ysandeep|afk00:09
*** dviroel|rover is now known as dviroel|out00:10
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84958900:29
opendevreviewJames E. Blair proposed opendev/system-config master: WIP: Build a nodepool image  https://review.opendev.org/c/opendev/system-config/+/84879200:34
ianwhrm, not sure how to test the upload-pypi role00:46
ianwi can make a limited api key that can only update one project on test.pypi.org and we can assume that is public, and use it in zuul-jobs00:47
ianwi mean to say test it automatically, rather than just a one-off manual approach00:48
ianwi think the best approach might be to test upload-pypi in zuul jobs with an api key separately and manually, before committing.  then we can have the switch ready in project-config and merge it just before we do something that will push to pypi like a dib release, and monitor it closely00:51
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] upload-pypi: basic testing  https://review.opendev.org/c/zuul/zuul-jobs/+/84959301:04
fungimaybe uploads of opendev/sandbox?01:11
fungithough we may be missing a lot of the necessary bits for that01:12
*** ysandeep|afk is now known as ysandeep01:16
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84958901:20
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] upload-pypi: basic testing  https://review.opendev.org/c/zuul/zuul-jobs/+/84959301:20
ianwyeah, it would be a bit of a pain to make something that increases its version number on every gate check01:21
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] upload-pypi: basic testing  https://review.opendev.org/c/zuul/zuul-jobs/+/84959301:38
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing  https://review.opendev.org/c/zuul/zuul-jobs/+/84959301:47
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959701:52
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing  https://review.opendev.org/c/zuul/zuul-jobs/+/84959301:57
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959701:57
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed  https://review.opendev.org/c/zuul/zuul-jobs/+/84959801:57
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959702:04
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959702:32
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84958902:50
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed  https://review.opendev.org/c/zuul/zuul-jobs/+/84959802:50
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing  https://review.opendev.org/c/zuul/zuul-jobs/+/84959302:50
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959702:50
*** ysandeep is now known as ysandeep|afk03:19
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84958904:02
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed  https://review.opendev.org/c/zuul/zuul-jobs/+/84959804:02
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing  https://review.opendev.org/c/zuul/zuul-jobs/+/84959304:02
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959704:02
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84958904:27
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed  https://review.opendev.org/c/zuul/zuul-jobs/+/84959804:27
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing  https://review.opendev.org/c/zuul/zuul-jobs/+/84959304:27
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959704:27
*** ysandeep|afk is now known as ysandeep04:44
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84958905:02
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed  https://review.opendev.org/c/zuul/zuul-jobs/+/84959805:02
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing  https://review.opendev.org/c/zuul/zuul-jobs/+/84959305:02
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959705:03
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84958905:18
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed  https://review.opendev.org/c/zuul/zuul-jobs/+/84959805:18
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing  https://review.opendev.org/c/zuul/zuul-jobs/+/84959305:18
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959705:18
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959706:07
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959706:20
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959706:27
*** ysandeep is now known as ysandeep|afk06:44
*** soniya is now known as soniya|ruck06:48
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959707:01
*** ysandeep|afk is now known as ysandeep07:40
*** ysandeep is now known as ysandeep|lunch08:25
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959708:34
opendevreviewIan Wienand proposed zuul/zuul-jobs master: upload-pypi: test sandbox upload  https://review.opendev.org/c/zuul/zuul-jobs/+/84959708:53
*** anbanerj is now known as frenzy_friday09:17
*** soniya|ruck is now known as soniya|ruck|lunch09:41
*** soniya|ruck|lunch is now known as soniya|ruck10:09
*** soniya|ruck is now known as soniya|ruck|afk10:11
*** rlandy|out is now known as rlandy10:26
*** ysandeep|lunch is now known as ysandeep10:40
ianwfungi: https://review.opendev.org/q/topic:upload-pypi-api is the base work for pypi api upload10:57
*** soniya|ruck|afk is now known as soniya|ruck11:07
*** rlandy is now known as rlandy|rover11:15
*** dviroel is now known as dviroel|rover12:12
*** rlandy|rover is now known as rlandy12:23
*** ysandeep is now known as ysandeep|afk12:59
*** ysandeep|afk is now known as ysandeep13:31
mnaserinfra-root: https://tarballs.opendev.org is returning forbidden14:27
fungilooking14:28
fungimay be an afs outage14:28
mnaserthank you fungi !14:28
fungiapache throwing lots of kernel oopses in dmesg14:29
*** dasm|off is now known as dasm14:29
fungi[Wed Jul 13 13:15:38 2022] afs: Lost contact with file server 104.130.138.161 in cell openstack.org (code -1) (all multi-homed ip addresses down for the server)14:29
mnaserthat'll do it14:30
mnaser104.130.138.161 is not pingable14:30
fungitime reported by dmesg may also not be accurate so that may be more recent than an hour ago14:30
mnaserso maybe afs could be the real issue here (unless that ip is not pingable by icmp)14:30
fungiyeah, that's afs01.dfw.openstack.org14:30
fungitrying to ssh into it now but it's hanging14:31
fungii'll check the oob console14:31
mnaseris afs02 a replica for afs01 ?14:31
fungiyes, for most things anyway14:31
mnaserim wondering why it didnt fall back to that14:31
fungiit did for some volumes, but doesn't seem to have for tarballs14:32
fungipossible something is wrong/stuck with the replica for it14:32
fungifor now i'm going to dig into what's happening to afs01.dfw though14:33
fungiinfra-root: ^ heads up, and also i have a conference call i have to jump to in 25 minutes, just fyi14:33
fungiticket from rackspace: This message is to inform you that the host your cloud server, afs01.dfw.openstack.org, resides on alerted our monitoring systems at 2022-07-13T13:29:01.300633. We are currently investigating the issue and will update you as soon as we have additional information regarding what is causing the alert.14:36
mnaserah14:36
fungifollowup: This message is to inform you that the host your cloud server, afs01.dfw.openstack.org, resides on became unresponsive. We have rebooted the server and will continue to monitor it for any further alerts.14:36
fungithat followup was stamped roughly an hour ago14:37
fungiso i guess the instance didn't come back when the host rebooted14:37
fungiyeah, the api reports it in an "error" state14:38
fungifault | {'message': 'Storage error: Reached maximum number of retries trying to unplug VBD OpaqueRef:6d2337f7-aa1d-46b3-5da6-209ac49fd06b', 'code': 500, 'created': '2022-04-28T20:06:53Z'}14:40
mnaserthe date of that fault seems to show that its unrelated14:41
mnaser(also wow what a throwback to see 'OpaqueRef', old school xenserver code)14:41
mnaserafaik the nova api will let you hard reboot if vm is in error state14:41
fungiafs01.dfw has four volumes in cinder, all in-use, none of which match that uuid14:43
mnaserthat is an internal uuid used by xenserver14:43
fungiahh14:44
fungiso no clue which cinder volume it might be14:44
fungianyway, yeah, i'll try a hard reboot and hope we don't corrupt any filesystems14:44
fungifault | {'message': 'Failure', 'code': 500, 'created': '2022-07-13T14:45:08Z'}14:46
fungithat's less than helpful14:46
fungiputting it into shutoff for a minute and then trying a server start14:47
fungiit went into shutoff state fine, but server start seems to be getting ignored now14:49
mnaseri think the hypervisor feels borked :(14:49
fungiyeah, i'll follow up on the ticket they opened about the host reboot14:50
fungiin the meantime we can see if the read-only replica for tarballs can be brought online14:50
fungi#status notice Due to an incident in our hosting provider, the tarballs.opendev.org site (and possibly other sites served from static.opendev.org) is offline while we attempt recovery14:51
opendevstatusfungi: sending notice14:51
-opendevstatus- NOTICE: Due to an incident in our hosting provider, the tarballs.opendev.org site (and possibly other sites served from static.opendev.org) is offline while we attempt recovery14:52
*** dviroel|rover is now known as dviroel|rover|biab14:53
opendevstatusfungi: finished sending notice14:54
fungiinfra-root: i've updated the ticket (220713-ord-0002114) and am awaiting further response from rackspace support14:55
fungii probably don't have time to dig into what's preventing failover for the tarballs volume to the read-only replica before my call in a few minutes, but can try to poke at it some. also we should disable afs volume releases in the meantime and work on doing a full switchover to afs02.dfw14:57
*** soniya is now known as soniya|ruck15:01
*** ysandeep is now known as ysandeep|out15:04
Clark[m]I'm getting my morning started but need to do quick system updates first.15:12
Clark[m]fungi: are we serving the RW path on static?15:12
jrosserwouold this be related? https://mirror-int.dfw.rax.opendev.org/ubuntu/dists/bionic/universe/binary-amd64/Packages  403  Forbidden [IP: 10.209.161.66 443]15:16
clarkbjrosser: likely yes15:21
clarkbstatic's dmesg has a number of tracebacks involving afs after losing contact with the server. mirror.dfw does now15:24
clarkbs/does now/does not/15:24
clarkball three afsdb0X servers report they are running happily according to bos status so I'm not sure why failover wouldn't have happened except for maybe talking to RW paths instead of RO paths15:27
clarkbor maybe the kernel tracebacks crashes afs hard enough to prevent failover on the client side?15:27
clarkblooking at /var/www/mirror on mirror.dfw I think some volumes failed over and others did not15:29
clarkbhttps://mirror.ord.rax.opendev.org/ubuntu/dists/bionic/universe/ has content so ya this may be ~luck of the draw on individual clients for handling failovers.15:29
clarkbI'm trying to restart openafs on mirror.iad3.inmotion15:31
clarkbit isn't going very quickly15:31
clarkbok that was slow but it didn't seem to break anything. I'll try that on mirror.dfw now15:33
clarkbbefore I did that I simply navigated to the path on the fs and now it seems happy on dfw?15:35
clarkbI wonder if we cached a failed lookup in apache and the napache stopped trying to hit the fs to refresh the failover?15:35
fungiyeah, apache restarts might help, i suppose15:36
fungiand sorry, confcall is pretty distracting15:36
fungibut should be free again in ~25 minutes15:36
clarkbok ya I think mirror.dfw is good now simply by manually traversing the path on afs15:36
clarkbI'll check the other mirrors first (I know that static is probably what people want more but I feel like I'm learning and mirrors are far less stressful)15:37
rlandyhi ... Failed to fetch https://mirror.iad3.inmotion.opendev.org/ubuntu/dists/focal/main/binary-amd64/Packages  403  Forbidden [IP: 173.231.253.126 443]15:37
rlandymirror.iad3.inmotion.opendev.org seems to be the failing mirror now for us15:38
clarkbrlandy: it is working now I think. Note timestamps and also links to failures are always always useful. But ya I think that particular mirror as well as dfw is happy now15:38
clarkb(it could be that failure occured when I restartedo penafs)15:38
rlandyclarkb: thanks - will watch that15:39
clarkbif we see failures after this point in time for mirror.dfw and mirror.iad3 let us know. And now I'm looking at the others15:39
rlandyfailures are probably from an hour back15:39
clarkbmirror.mtl01.iweb appears happy15:40
clarkbmirror.ord and mirror.iad as well. None of them have the tracebacks like static does15:40
clarkbboth ovh mirrors are similarly happy from  what I see. No tracebacks either15:42
*** marios is now known as marios|out15:44
clarkbya all the mirrors appear ok now based on filesystem listings against /var/www/mirror15:45
clarkbnone contain the dmesg tracebacks that static shows15:45
rlandythanks for checking15:47
clarkblooks like tarballs is up too? I wonder if taking the unhappy fileserver down was what we needed to failover15:47
clarkbfungi: ^15:47
fungii can check in a few15:48
clarkbI *think* we're in a good state now via failover to RO volumes on afs02.dfw15:48
clarkbI think the next steps are likely going to be disabling any vos releases so that we don't possibly replicate corrupted RW volumes on afs01 to RO volumes on afs02 when 01 comes back (openafs likely protects against this but I'm not sure)15:49
clarkbthen we can bring back afs01 and convert its volumes to RO and switch 02 to RW then enable releases in the other direction?15:50
fungialso possible 01 came back up, i haven't checked yet15:54
clarkbit doesn't ping and there is no ssh15:55
clarkbanyway I didn't really have to do anything on the servers other than navigate their /afs/openstack.org/mirror and /afs/openstack.org/project paths and that seemed to make things happy. Either that or the shutdown of afs01 caused the afs db servers to finally notice it is down and fail over15:56
clarkbI believe we are in a RO state with content being served. I've ntoified the release team to not do releases and updated the mailing list thread with this info15:56
clarkbI'm going to take a break now as I haven't had breakfast yet and I have a bunch of email to catch up on after being out for a few days15:57
fungithanks! i'm freeing up again now for a bit, but will have an errand to run soon as well, so will see what i can get done on this in the meantime16:00
clarkbfungi: I think holding locks/commenting out vos release cron jobs so that we control how, when and what syncs when afs01 is back is the next thing16:01
clarkband then it is probably just a matter of monitoring and seeing what rax says? I guess we could try booting a recovery instance to inspect why it is failing16:01
clarkbBut I really need food16:02
fungigo eat!16:04
*** dviroel|rover|biab is now known as dviroel|rover16:11
fungii've added mirror-update02.opendev.org to the emergency disable list16:15
fungii've also temporarily commented out all lines in the root crontab on that server16:16
clarkbfungi: I think docs and tarballs etc are released via a cronjob elsewhere? Worth double checking16:23
fungithose are handled by the release-volumes.py cronjob on that server, as far as i'm aware16:24
fungiwhich runs every 5 minutes16:24
fungior did, until i commented it out16:24
fungiwe had separate mirror-update servers which reprepro and rsync mirroring was split between for a while, but that's been consolidated onto the newer server more recently16:25
clarkbaha16:26
clarkblooks like there is an update to our ticket? I'm not in a good place to login and check that yet16:26
clarkb(I've got a post road trip todo list a mile long too :( ...)16:26
*** dviroel__ is now known as dviroel|rover|biab16:42
fungii can only imagine16:44
fungithe ticket updates were "The query regarding unable to boot afs01.dfw.openstack.org has been received. I am currently reviewing this ticket and I will update you with more information as it becomes available." followed by "I will now escalate to appropriate team for further review."16:46
fungiso i guess we're waiting for an appropriate team16:47
clarkbgood to know it has been seen at least16:47
fungii need to go run some errands, but will make them as quick as possible. shouldn't be more than an hour i hope16:47
clarkbI think we've done what we can until we hear back form them short of booting a recovery instance16:48
clarkband it is probably better to let them poke at it now that tehy have seen it16:48
fungiyep16:50
*** dviroel|rover|biab is now known as dviroel|rover17:21
opendevreviewJames E. Blair proposed opendev/system-config master: WIP: Build a nodepool image  https://review.opendev.org/c/opendev/system-config/+/84879217:33
fungiracker todd is my new hero! "the volume afs01.dfw.opendev.org/main03  eafb4d8d-19e2-453e-8657-013c4da7acb6 lost it's iscsi connection to the Compute host... Detaching and reattaching it did the trick."18:08
fungireboot   system boot  Wed Jul 13 18:0818:08
fungii think afs01.dfw is back in business now, but need to double-check all the volumes to make sure everything's copacetic before i can say with any certainty18:10
fungii've gone ahead and closed out the ticket with much thanks, since we can at least take it from here18:12
clarkbexcellent18:12
fungifor future reference, i suppose we can try detaching/reattaching through cinder18:13
fungii've got a narrow window to try and catch up on yardwork, but may be able to poke at checking those over on breaks or once i finish18:17
opendevreviewJames E. Blair proposed opendev/system-config master: WIP: Build a nodepool image  https://review.opendev.org/c/opendev/system-config/+/84879218:20
fungiper discussion in #openstack-infra a zuul job successfully wrote to the docs rw volume, so i'm going to uncomment the vos release cronjob for that next and see if we have any new problems there19:10
fungii'm tailing /var/log/afs-release/afs-release.log on mirror-update and should hopefully see it kick off in ~2 minutes19:13
clarkbthanks19:15
fungilooks like all releases were successful, including tarballs19:16
fungiwe dodged a bullet there19:16
opendevreviewJames E. Blair proposed opendev/system-config master: WIP: Build a nodepool image  https://review.opendev.org/c/opendev/system-config/+/84879219:16
fungiclarkb: any objections to me uncommenting the other cronjobs and taking mirror-update02 out of the emergency disable list now?19:18
clarkbfungi: no, I probably would've done the mirrors first myself since they are all upstream data :)19:19
clarkbI think if tarballs et al are happy then mirrors are good to go19:19
fungifair, but there was a request to rerun a docs job so i took the opportunity19:19
clarkbya19:19
fungiokay, undoing the rest19:19
fungiand done19:19
fungii'll hold my hopes until we see if there are mirror volumes remaining stale, but i think we can status log a conclusion (i only did status notice earlier, not alert)19:20
fungi#status log The afs01.dfw server is back in full operation and writes are successfully replicating once more19:21
opendevstatusfungi: finished logging19:21
fungii'll let #openstack-release know too19:21
opendevreviewClark Boylan proposed opendev/system-config master: WIP Update to Gitea 1.17.0-rc1  https://review.opendev.org/c/opendev/system-config/+/84720420:45
opendevreviewClark Boylan proposed opendev/system-config master: Update Gitea to 1.16.9  https://review.opendev.org/c/opendev/system-config/+/84975420:45
clarkbThere is a new gitea bugfix release. I put that update between the testing update and the 1.17.0 rc change20:45
clarkbHopefully we can land the testing update and the 1.16.9 update soon. But as always please review the changelog and template updatds20:45
clarkbgit diff didn't show me any template changes between .8 and .9 for the three templates we override20:47
ianwo/ ... is the short story one of the afs volumes went away for a bit?20:56
ianwit seems we didn't need to fsck, which is good20:57
clarkbianw: the entire fileserver went away due to one of the cinder volumes going away20:57
clarkbI think that may have impacted all of the afs volumes due to lvm?20:58
clarkbbut ya it seems to have come back20:58
clarkbianw: while we are doing catchup thank you for updating our default ansible version in zuul (I shoudl've set myself a calendar reminder for that and just spaced it). Also looks like we update ansibled to v5 on bridge too?21:00
ianwumm, i didn't touch the ansible version on bridge, i don't think21:00
ianwi guess /vicepa reports itself as clean ... how it survived that I don't know :)21:04
clarkbianw: oh maybe I misread something over the last week21:04
clarkbI may have just smashed together the zuul update and bridge update in my head21:05
corvusianw: clarkb fungi for any who are interested, https://review.opendev.org/848792 is an image-build-in-zuul-job which has 2 successful runs -- one at 1 hour, one in 38 minutes. i believe that further improvement in runtime is possible with better use of the cached data already on the nodes.  it does use the existing git repo cache (but then fetches updates, which is a little slow.  it also copies it twice, and i feel like we should be able to21:13
corvusavoid that somehow, but that requires some detailed thought about what's mounted where and when).  it doesn't use any of the devstack/tarball/blob cache on the host, so those files are all being fetched each time; that could obviously be improved.  anyway, i think that's a useful starting point, and it could be used to test out the cointainerfile stuff ianw was looking at.   i'm currently working on a new spec for nodepool/zuul, and i wanted to21:13
corvusget an idea of what a job like that would look like.  feel free to take that change and modify it or copy it or whatever if you have any ideas you want to explore; i'm basically done with that for right now (it answered my questions).21:13
clarkbcorvus: re caching off the host I think the existing dib caching knows how to check for updates to thosefiles we just have to copy/link them into the right locations in the dib build path?21:15
corvusclarkb: maybe -- but it also has some shasum hash thing it does and i think that's only in the /opt/dib_cache dir, so i don't think we have all that data on the host (which is this case is one of our built images)21:16
clarkbya the dib_cache dir isn't copied into the zuul runtime images21:17
clarkbbut we could probably update things to leak that across assuming it isn't very large and is also useful21:17
clarkbI'd have to think about that a bit more21:17
corvusyeah21:17
corvusat least, the theoretical problem of "we have foo.img, let's update it iff it needs updating" seems solveable :)21:18
corvus(i went ahead and put a bit of effort into the git repo cache already though since i knew that was the big thing)21:18
fungiianw: clarkb: a more accurate summary would be the primary afs server went away because the hypervisor host went away, but then we couldn't boot it back up for hours because the host got confused when it lost contact with the iscsi backend for one of the attached volumes21:19
clarkbfungi: thanks21:19
fungiso it was a bit of a cascade failure21:19
*** dmitriis is now known as Guest493421:20
fungialso we didn't manage to automatically fail over serving the ro replica for something (tarballs volume at least) and needed to intercede21:20
clarkbfungi: was the server off for all those hours then? If so then I think the idea taht shutting it down caused failover to happen is unlikely (and more likely that my manual navigation of paths made it happier)21:20
fungithe server was offline until 18:08 yes21:20
fungiand the outage started around 13:something21:21
clarkbok that helps. For some reason i had it in my head that the server was up but with sad openafs and that may have confused the afs dbs21:21
fungitarballs.o.o didn't start serving content until somewhere in between those timrs21:21
clarkbya I suspect more that my manual navigation of afs paths on static forced openafs there to try again and it started working?21:22
fungipossibly, though i also did that earlier in the outage21:22
clarkbor maybe we cached the bad results for a couple of hours and that timing just lined up where the caching timed out21:22
fungijust as part of inspecting things to see what was actually down21:23
clarkbfungi: if you have time can you take a look at https://review.opendev.org/c/opendev/system-config/+/849754 you've already reviewd its parent.21:58
clarkbianw: ^ if you get a chance to look too that would be great21:59
clarkbCI results on the child should be back momentarily21:59
opendevreviewClark Boylan proposed opendev/system-config master: Install Limnoria from upstream  https://review.opendev.org/c/opendev/system-config/+/82133122:01
clarkbinfra-root ^ is a change that keeps ending up stale beacuse there is never a good time to land it :/ I think Fridays are generally quiet with meetings if we want to try and land it this friday (seems like the last time I pciekd a day there was a big fire that distracted me)22:02
*** dasm is now known as dasm|off22:08
fungiclarkb: lgtm. unrelated, a review of 849576 and its child would be awesome when you have time22:13
clarkbfungi: I've +2'd both but didn't approve in case you wanted to respond to ianw first. Feel free to self approve22:15
ianwoh i assume that it is all in order22:16
clarkbI've approved the update to gitea testing. I think I'll hold off on gitea upgrade proper until tomorrow though as I'm still getting distracted by all the "home after a week away" problems22:18
clarkbfeel free to land the gitea upgrade if you're able to monitor it, but I'm happy to do that tomorrow22:18
ianwi can monitor it, can merge in a few hours when it all slows down22:18
opendevreviewIan Wienand proposed openstack/project-config master: Remove testpypi references  https://review.opendev.org/c/openstack/project-config/+/84975722:19
fungiianw: did i not respond to ianw? maybe i missed something22:20
fungier clarkb ^22:20
ianwoh you did, about the handbook v the guide v the open way v the four opens etc.22:21
fungilooking back, i left a review comment in reply to an inline comment, rather than replying with an inline comment, sorry!22:23
fungiand yeah, they're intentionally distinct22:24
fungi(we debated the option of putting them together or not at great length)22:24
fungii was personally in favor of fewer repos, but one more repo wasn't that great of a cost to appease those who disagreed with my position on the matter22:25
opendevreviewIan Wienand proposed openstack/project-config master: twine: default to python3 install  https://review.opendev.org/c/openstack/project-config/+/84975822:27
clarkbfungi: hrm this is the problem of not responding to the inlien comment directly so it doesn't show up as a response on the file22:30
clarkbhttps://review.opendev.org/c/openstack/project-config/+/849576/1/gerrit/projects.yaml basically that doesn't show a response22:31
clarkbbut ya I see it now22:31
fungiwell, in this case i missed that it was an inline comment so i made a normal review comment instead. was my bad22:39
clarkbwith the web ui if you click reply it automatically attaches it to the correct place. I wonder if gertty could grow a similar functionality22:45
clarkbor maybe it does and I just haven't used it in long enough to have forgotten22:45
*** dviroel|rover is now known as dviroel|rover|Afk22:58
jrosseri might have a very long running logs upload in progress here https://zuul.opendev.org/t/openstack/stream/915484832105431892e804fb86abc2d3?logfile=console.log23:08
clarkbhrm doesn't look like we've ported the base-test updates to log the target to the production base job?23:09
clarkbor if we have I'm not seeing it in that log yet23:09
jrosserit's from 84799123:09
opendevreviewMerged opendev/system-config master: Move gitea partial clone test  https://review.opendev.org/c/opendev/system-config/+/84817423:09
jrosserno i don't think we have merged that yet23:09
* clarkb makes a note to catch back up on that tomorrow23:10
jrosseri have a patch to do that but it needs updating23:10
jrosseri saw one POST FAILURE earlier, and just noticed that one aparrently stuck23:10
ianwlsof on that shows connections to ...23:19
ianw142.44.227.10223:19
ianw OVH Hosting Inc.23:20
ianwlooking at it in strace, it doesn't seem to be doing anything23:21
fungiclarkb: in this case it wasn't a gertty failing, i replied to the review comment which contained the inline comment rather than replying to the inline comment itself23:21
clarkbianw: is it waiting on a read or a write? (might point to which side is idling)23:23
ianwhttps://paste.opendev.org/show/bzXL1q1f2G0e4d4dQgvA/23:23
ianwlooks to me stuck in a bunch of reads23:24
clarkbto me that implies something about the remote end being unhappy23:26
clarkbwe're waiting for ovh to respond to us?23:26
clarkbcould be something on the network between as well23:26
ianwpinging it from ze02 seems fine23:27
ianwit really just looks like those threads are sitting there waiting for something23:27
clarkbmight be something amorin could help with23:28
clarkb(to check on the ovh side to see if there is any obvious reason for the pause)23:28
ianwi've had that under strace for a while and nothing has got any data or timed out either23:30
fungiwhen was the connection initiated?23:31
ianwhttps://paste.opendev.org/show/bdRYt3Lbz7PZZoEuovxE/23:33
ianwit would indicate Jul 13 23:33 i guess23:34
ianwalthough, that's 1 minute ago?23:34
ianw... and it's gone ...23:35
ianwdid it get killed?23:35
ianw2022-07-13 20:34:38.552391 | TASK [upload-logs-swift : Upload logs to swift]23:37
ianw2022-07-13 23:34:33.029640 | POST-RUN END RESULT_TIMED_OUT: [trusted : opendev.org/opendev/base-jobs/playbooks/base/post-logs.yaml@master]23:37
ianwyes23:37
jrosserstill html ara report there which needs to be got rid of23:38
ianwi guess the time of the file in /proc/<pid>/fd is the time that the kernel made the virtual file in response to the dirent or whatever (i.e. when you "ls" it), not the file creation time.  not sure i've ever considered the timestamp of it before23:40
ianwanyway, that's a data point i guess?  it was ovh, and it was all the thread stuck in read() calls23:41
clarkb++ someone like timburke might know what portion of the upload is doing reads too. Though it may just be waiting for a status result from the http server23:46
clarkband the problem is processing/storing the data on the remote23:46
ianwmy next thought was a backtrace on one of those threads, but they disappearedd23:48
opendevreviewIan Wienand proposed openstack/project-config master: pypi: use API token for upload  https://review.opendev.org/c/openstack/project-config/+/84976323:54
ianwdoes "Job publish-service-types-authority not defined" in project-config ring any bells?  23:56
clarkbservice types authority is the thing that is published for keystone ? I think its the static json blob23:57
*** dviroel|rover|Afk is now known as dviroel|rover23:57
ianwhttps://review.opendev.org/c/openstack/project-config/+/708518 removed the job in feb 202023:58
clarkbianw: did you get that error from the zuul scheduler log?23:58
clarkball it said in gerrit was the change depends on a change with invalid config23:58
ianwwe have a reference @ https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L501823:59
clarkbhttps://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5017-L5018 ya I just found that too23:59
clarkbmaybe we need to clean that up?23:59
ianwhttps://review.opendev.org/c/openstack/project-config/+/849757 gives me a zuul error23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!