Thursday, 2022-04-28

opendevreviewMerged openstack/diskimage-builder master: Fix dhcp-all-interfaces on debuntu systems  https://review.opendev.org/c/openstack/diskimage-builder/+/83908000:05
opendevreviewMerged openstack/diskimage-builder master: Switch to release-notes-jobs-python3  https://review.opendev.org/c/openstack/diskimage-builder/+/83959900:49
*** rlandy|bbl is now known as rlandy00:59
*** rlandy is now known as rlandy|out01:00
*** ysandeep|out is now known as ysandeep01:23
opendevreviewMerged openstack/diskimage-builder master: Set machine-id to uninitialized to trigger first boot  https://review.opendev.org/c/openstack/diskimage-builder/+/83725101:35
*** ysandeep is now known as ysandeep|breakfast03:16
*** ysandeep|breakfast is now known as ysandeep04:19
*** ysandeep is now known as ysandeep|afk06:06
fricklerinfra-root: hrw: it looks like the c9s wheel job is broken and thus blocking wheel updates. https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/836829 https://zuul.opendev.org/t/openstack/builds?job_name=publish-wheel-cache-centos-9-stream06:41
ianwfrickler: yeah, looking at it now.  it comes back to the openafs publishing06:41
ianwthe promote job has failed so we don't have the rpms, i think we need to untangle that first06:42
ianwhttps://zuul.opendev.org/t/openstack/build/a0c2266464534e2b8267559564f2f609/logs was the last run06:42
ianwbut all logs are gone06:42
*** pojadhav- is now known as pojadhav06:44
fricklerianw: ah, o.k., I just noticed that this looked weird when checking the AFS graphs, go ahead, then ;)06:46
ianwyeah i noticed yesterday when looking how much space we'd saved with the src file removal too06:46
*** ysandeep|afk is now known as ysandeep06:48
*** jpena|off is now known as jpena06:58
ianw2022-04-28 06:59:10.630370 | centos-7 | error: Failed build dependencies:07:01
ianw2022-04-28 06:59:10.630649 | centos-7 | kernel-devel-x86_64 = 3.10.0-1160.59.1.el7 is needed by openafs-1.8.8.1-1.el7.x86_6407:01
ianwi bet we have build issues with centos-7 images; this happens when our images get out of sync with the mirror07:02
fricklerCannot find a valid baseurl for repo: base/$releasever/x86_6407:03
fricklerhttps://nb01.opendev.org/centos-7-0000263671.log07:03
fricklercentos-7 images look to be 22 days old, quite similar to the wheel age ...07:05
ianwthat's definitely part of it.  the images get out of sync, and then the kernel version the images have disappears from the mirrors, and then we can't find the -devel packages for the kernel it's running, and so can't build openafs, so can't publish wheels ...07:06
ianwthis is the first yum call in the chroot07:09
ianwhttps://3312f1bac6b015e072cd-6f4fdaa50c9ffb2ee70643e96aea629f.ssl.cf1.rackcdn.com/838863/7/check/dib-nodepool-functional-openstack-centos-7-src/0cbbe24/nodepool/builds/test-image-0000000001.log07:12
ianwis a good gate build07:12
ianwit looks pretty much the same.  i'm going to have to investigate this more tomorrow07:13
hrwhave you considered moving building from VM to containers? that way host can run one OS, has AFS mounted and then in container you have other OS with AFS already mounted08:00
*** ysandeep is now known as ysandeep|lunch08:13
*** pojadhav is now known as pojadhav|lunch08:21
ianwhrw: yeah, a lot of this was written before containers were really a consideration :)  the other thing we could that's potentially less impact is to copy through the zuul executor.  i've had that on my todo list for a long time, there might even be a spec about it08:34
ianwbut yes, in 2022 it's a good option :)08:35
hrwianw: less OS related stuff to handle as you can run infra on one distro and use other ones in containers to build stuff08:35
hrwbut then it would be to decide 'move from DIB to dockerfiles' or 'add container building into DIB' probably?08:37
*** pojadhav|lunch is now known as pojadhav09:11
*** ysandeep|lunch is now known as ysandeep09:40
*** rlandy|out is now known as rlandy10:25
*** ysandeep is now known as ysandeep|afk10:56
fricklerWarning: Change 826541 in project zuul/nodepool does not share a change queue with 826543 in project openstack/openstacksdk11:03
fricklerthat's a weird way for zuul to say that the dependency hasn't merged yet11:03
frickleralso seems that that nodepool quota unit test has developed some persitent failure11:05
fricklerpersistent even11:05
*** dviroel|rover|out is now known as dviroel|rover11:23
opendevreviewMerged openstack/project-config master: Add the cinder-three-par to Openstack charms  https://review.opendev.org/c/openstack/project-config/+/83778211:38
*** pojadhav is now known as pojadhav|afk11:42
*** ysandeep|afk is now known as ysandeep11:45
*** pojadhav|afk is now known as pojadhav12:45
*** pojadhav is now known as pojadhav|afk13:42
*** ysandeep is now known as ysandeep|out14:10
*** jpena is now known as jpena|off14:14
*** pojadhav|afk is now known as pojadhav15:17
clarkbhrw: ianw: note I'm not sure containers would help in this case because the issue is a mismatch between expected kernel dev headers in userspace and the running kernel of the system. Containers would only make that problems worse as ubuntu and centos and debian all run completely different kernels15:21
clarkbalso the problem is orthogonal to building the VM images with DIB (but did does support the containerfile element now if you wish to use that). This is about building an openafs client for the running kernel to write out the wheel packages to the filesystem15:22
hrwclarkb: host OS loads openafs kernel module, mounts afs volumes. then container runs and gets afs volume already mounted by host os15:22
clarkbsure, but that doesnt' solve the problem of having an openafs client you just externalize it15:23
clarkbthe problem continues to exist either way15:23
hrwclarkb: run hostOS vm, mount afs, run container with guest OS, do builds etc, exit, sync afs in hostOS, shutdown VM?15:24
clarkbhrw: yes but where does hostOS vm get its openafs client?15:24
clarkbthats the problem here :) and its an issue on all the platforms15:24
hrwclarkb: you already have that covered for several host OSes15:24
clarkbthey just break at different times due to different pace of the different host platforms15:24
hrwjust choose one where it works15:24
clarkbhrw: only because we've solved this problem15:24
clarkbwith a fair bit of effort is my point and containers don't solve that15:24
hrwthis way you sort it once per 2 years and have it done15:25
clarkbwell once per $afs_breaks_time15:25
clarkbbut ya it would reduce the problem space.15:25
hrwgrab ubuntu 20.04 for example as host OS, run it for 5y and then migrate to 24.04 for another 5y?15:26
clarkbthat doesn't work beacuse the ubuntu openafs client doesn't work which is my point15:26
clarkbwe solve this same problem on ubuntu and debian too15:26
clarkbthey just don't break at the same time as centos15:26
hrwor any other host OS supported by AFS upstream15:26
clarkbthere are none15:26
clarkbthis is the biggest drawback to using openafs15:27
hrwmove to git with lfs?15:27
clarkbI'm just saying its easy to say "use a container" when the real problem is we have to build our own openafs packages for the kernels we run against15:27
hrwclarkb: understood15:27
clarkbgit + lfs is not a globally distributed filesystem15:27
clarkbits unfortunate that there aren't any more modern alternatives to afs because it solves the mirroring problem quite elegantly15:28
clarkbits basically a filesystem with built in CDN15:28
clarkbWe get to maintain one (really two) copies of the data then all readers see that content cohesively at roughly the same time when updates are made. This means we don't need 2TB of disk in every cloud to manage mirrors, we need 10% of that for caches15:29
hrwstill - having one host OS to worry about instead of several ones?15:30
clarkbwe could definitely engineer something like a 404 handler that queries against the pristine copy and manage caching more directly. But so far afs has done well enough15:30
fungii think the issue at hand had to do with wheel builds, and the suggestion was to have openafs in an ubuntu vm but then perform wheel builds in a centos container/chroot?15:30
clarkbfungi: yes. I'm saying that just punts the problem of the openafs packaging15:31
clarkbI don't really think running the centos build on a centos machine to make centos wheels is significantly more effort than managing the PPA for ubuntu packages then coming up with an indirection layer to run stuff in containers on top of that15:31
clarkb(and also I wanted to make sure that it was clear DIB isn't involved in this as the two things got conflated. Dib supports the functionalty that has been suggested there )15:32
fungiyes, i agree there. i wasn't sure where dib's containerfile backend was coming into the discussion, but i'm trying to follow three conversations simultaneously so i probably skimmed poorly15:32
clarkbcopying through the executor and not thinking too hard about where the wheels are actually built is probably the most resilient thing long term and ianw made that suggestion15:34
clarkbsince the executors already so the data shuffling all day long15:34
clarkbits a well exercised method that doesn't require any new tooling be built. Just a switch to the existing tooling15:35
fungiyeah, we've been talking about that as an improvement for a while now15:36
fungithough it does involve additional data copying15:36
clarkbfor wheels specifically I still think it would be helpful to have our upstream deps publish wheels. One of the biggest offenders is libvirt-python and we know the people there but apparently pypi license terms are unfavorable to them? I've never understood the argument15:36
fungimuch less now that we don't unnecessarily include wheels which are already available on pypi15:36
clarkbThe problem with our wheel cache is that others don't have it and we semi frequently discover that people outside of CI running our software are broken as a result15:37
clarkb(I don't understand the libvirt-python pypi concern because they do publsih sdsits)15:37
fungiwheels would need to have libvirt itself vendored in15:39
fungisince it's not a base lib for manylinux15:39
fungiso it may be that they object to distributing built libvirt binaries on pypi15:39
clarkbwouldn't it only need the linker info to be present? I suppose if that changes drastically between libvirt versions then you'd be in the same boat15:40
clarkbbut iirc they said they could do it except for some licensing terms they didn't like15:41
clarkbwhich confused me because they are ok with the sdists15:41
fungiit would only need the linker info if you were installing libvirt from some separate place15:41
fungibut generally, wheels embed pre-built c libraries if they're outside the set expected to be available on most systems15:42
clarkbright if you install libvirt-python they could assume you have a libvirt installed separately rather than bundling it. But that likely gets into the trouble of libvirt proper changing its interfaces15:42
clarkbfungi: is that what crpytography does?15:42
clarkb(they seem to be at the forefront of python linking to external resources and building wheels for all the things)15:42
fungiyes15:43
*** dviroel|rover is now known as dviroel|rover|lunch16:02
*** dviroel|rover|lunch is now known as dviroel|rover16:41
clarkbTIL about pinky(1)16:42
clarkbLuca has warned people about upgrading to 3.5 and to careful monitor performance and heap utilization16:44
clarkbI think the change I pushed may just be one piece of that puzzle16:44
clarkbLuca sent me a gerrithub outage analysis doc that I need to read through when I've got time to digest it talking about the issues they have seen16:45
clarkbJust a heads up that we might need to take the 3.5 upgrade carefully and with extra testing16:45
clarkbianw: if timing works out today I think I'd like to land https://review.opendev.org/c/opendev/system-config/+/839621 and then can query you for info on the reprepro stuff that needs to be done?16:57
clarkbianw: in scrollback two different commands are mentioned `clearvanished` and `deleteunreferenced`. Is the process roughly grab the volume lock on mirror-update. Then run the reprepreo clearvanished and deleteunreferenced commands against that repo using the krb credentials. Then finally vos release?16:59
fungihuh, i have pinky installed courtesy of coreutils17:04
fungiseems functionally similar to who/w17:05
opendevreviewGage Hugo proposed openstack/project-config master: End project gating for openstack-helm-docs  https://review.opendev.org/c/openstack/project-config/+/83910317:15
fungiinfra-root: rackspace support ticket #220427-ord-0001314 is warning us that there will be a block storage maintenance 2022-05-11 03:00-04:00 utc impacting these volumes: afs01.ord.openstack.org/main02 backup01.ord.rax.opendev.org/main02 mirror01.ord.rax.opendev.org/main0117:15
fungiit looks like they've set some anti-affinity scheduling so if we clone or otherwise replace volumes in advance we won't be impacted17:17
fungithat might be a good idea at least for afs01.ord17:17
clarkbthe ord afs server doesn't serve that much stuff17:17
opendevreviewGage Hugo proposed openstack/project-config master: Retire openstack-helm-docs repo, step 3.3  https://review.opendev.org/c/openstack/project-config/+/83942717:17
clarkbbut if that helps prevent fallout with vos releases may be worthwhile17:17
fungii'm more worried about the afs sync time if we have to rebuild its filesystem from scratch due to corruption17:17
clarkbfor the backup server we can just unmount it I think?17:18
fungiyeah, or shut the server down temporarily17:18
fungiwe could shut down afs01.ord temporarily as well, and let things fail over17:19
opendevreviewGage Hugo proposed openstack/project-config master: Retire openstack-helm-docs repo, step 3.3  https://review.opendev.org/c/openstack/project-config/+/83942717:20
opendevreviewMerged openstack/project-config master: End project gating for openstack-helm-docs  https://review.opendev.org/c/openstack/project-config/+/83910317:39
fungione up-side to replacing those cinder volumes ahead of the maintenance is that we don't have to remember to turn anything off right before. it'll be happening well into the evening for most of us, so could result in hung afs volumes, corrupt backups, job failures because the mirror went away, et cetera which we won't necessarily spot until the next day. and it's taking place in the middle of17:44
fungithe week (tuesday night/wednesday morning my time)17:44
clarkbya swapping them out definitely seems advantageous and lvm amkes it easy17:45
fungii'll get started on that for the afs server in a bit, then backup and mirror as time permits17:47
*** rlandy is now known as rlandy|mtg19:05
*** rlandy|mtg is now known as rlandy19:23
*** artom__ is now known as artom19:32
*** dviroel|rover is now known as dviroel|rover|brb20:31
ianwclarkb: yep, what you said is about it, you need --nokeepunrefrenced i think to deleteunreferenced just ... because 20:39
clarkbianw: ok cool. Back from lunch and going to look into that if we can land that change20:47
clarkboh looks like it is approved perfect20:48
clarkbI've grabbed the debian-docker flock on mirror-update in a new window in the root screen you started20:52
clarkbonce the change lands I'll run those commands20:52
clarkbI seem to have crashed firefox20:54
clarkband now it won'st start again20:54
clarkband `firefox --ProfileManager` isn't helping20:54
clarkbok its because / remounted ro due to ext4 errors :/20:57
clarkbI'm going to reboot and hope it is ok20:57
opendevreviewMerged opendev/system-config master: Stop mirroring source packages for debian-docker  https://review.opendev.org/c/opendev/system-config/+/83962121:03
funginote that the debian-docker cleanup is likely a no-op21:12
fungii think they didn't serve any source packages for us to mirror21:12
fungithough i'll admit i didn't look too hard either21:13
clarkbya but I don't know if reprepro will complain.21:16
clarkbI figured go through the steps and figure them out21:16
clarkbalso re sad ext4 I gave up trying to fsck as I can't find a way to do that with my tumblweed install without booting off usb drive21:16
clarkbif it happens again I guess I should debug harder21:16
clarkbianw: the command history in window 0 of the screne doesn't seem to have those commands?21:20
clarkbI guess reprepro-mirror-update doesn't need to aklog? Anyway I see hte reprepro commands using k5start there so I hsould be able to run rperepro in the same way but passing the clearvanished and deleteunreferenced commands. Then run reprepro-mirror-update to do a vos release21:25
clarkbjust waiting for the config update to deploy then I'll do that21:26
clarkb`k5start -t -f /etc/reprepro.keytab service/reprepro -- reprepro --confdir /etc/reprepro/debian-docker-xenial clearvanished`21:32
fungiyeah, all the krb auth is baked into the script21:32
clarkbI will run that then replace clearvanished with deleteunreferenced then run the script21:32
clarkbthen do bionic then focal21:34
fungisounds great21:34
clarkbok looks like deploy is done let me quickly check the config21:42
clarkbyup lgtm running that command above with the lock held21:43
clarkbfungi: it appears to have been a very large noop for xenial21:47
clarkbfor all of thecommands. But I'll proceed with bionic and focal just to be sure they aren't different21:47
clarkband done21:50
clarkbdo we want to proceed with landing the ubuntu change? it will definitely not noop21:50
clarkbI've releaesd the debian docker flock21:51
clarkbhttps://review.opendev.org/c/opendev/system-config/+/839622 that change. Not sure how long the clearvanished and deleteunreferenced steps are expected to take in a large set of mirrors like that21:52
fungii'm up for it22:16
fungi#status log Replaced block storage volume afs01.ord.openstack.org/main02 with main03 in order to avoid service disruption from upcoming provider maintenance activity22:18
opendevstatusfungi: finished logging22:18
clarkbfungi: ok I approved it22:20
clarkba currently running ubuntu reprepro run has the flock22:21
clarkbI'll try to grab it22:21
clarkbNot sure how long those runs typically take. It started ~6 minutes ago22:21
fungiis there any reason why the vgs on the borg servers used raw block instead of partitions?22:26
clarkbfungi: does the vicepa system need raw block?22:27
fungifor borg?22:27
clarkboh sorry I read it as afs servres for some reason22:27
fungithis is all at the lvm layer22:27
clarkbI don't think borg cares it is all on top of your fs not doing direct device manipulation22:27
fungiright, i'm following https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#cinder-volume-management and it has us partition the attached cinder volume and then add the partition as a pv in the vg, but wondering if there was a specific reason that was avoided when the borg server was built22:29
clarkbianw may know as he did most of the setup on those22:29
clarkbas far was why use partitions I think it make sit clear the the volume has been used22:29
clarkbbut when it is raw it is harder to make that distinction?22:30
clarkbthough lsblk may tell you either way22:30
fungiright, partitioning lets you set the partition type to lvm, which may aid in scanning, i don't know if it still does these days22:30
fungiand also possibly with block alignment22:32
fungianyway, one of the pvs in the main-202010 vg is now a partition while the other two are raw devices. lvm doesn't really care22:32
fungiwe can replace the others over time as opportunity arises22:33
fungifor consistency with our documented process and our other existing severs22:33
clarkb++22:33
clarkbubuntu flock is still held. /me tries to be patient22:41
*** rlandy is now known as rlandy|out22:45
clarkbit is vos releasing now so hopefully soon it will be done22:55
opendevreviewMerged opendev/system-config master: Stop mirroring source packages for ubuntu  https://review.opendev.org/c/opendev/system-config/+/83962222:55
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Parse block device lvm lvs size attributes  https://review.opendev.org/c/openstack/diskimage-builder/+/83982922:56
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Make centos reset-bls-entries behave the same as rhel  https://review.opendev.org/c/openstack/diskimage-builder/+/83983022:56
clarkbwoot I have the lock now22:57
ianwsorry had to run around, back now22:58
ianwi guess i must have done them as raw partitions.  i usually make a partition and set its type to lvm, so i'm not sure what happened22:59
clarkbianw: no worries, pushing along with the ubuntu source rpeo removal. The chagne just merged and I hvae the lock. I'll run clearvanished and deleteunreferenced then the main script in that order once the configs update22:59
ianwclarkb: hrm, i guess the bash history disppears after your run bash under flock?  those are the bits missing from history22:59
clarkbianw: let me know if you think we should wait for some reaosn22:59
clarkbianw: ya I was wondering it if was in a subshell once I exit'd by subshell that held the debian docker lock23:00
ianwi don't see any reason to wait, from what i could tell of codesearch i couldn't see anyone setting them up23:01
clarkbok the config update is done. Running clearvanished now23:01
clarkblots of stuff like 'There are still packages in 'focal-updates|main|source', not removing (give --delete to do so)!'23:02
clarkbianw: fungi: Do I want to pass --delete to clearvanished?23:03
clarkbor maybe do the deleteunrefrerenced first then clearvanished again?23:03
clarkbI'll try deleteunreferenced first23:04
ianwyes, you want delete23:06
clarkbheh 'Error: packages database contains unused 'bionic-backports|main|source' database.' and clearvanished said 'There are still packages in 'bionic-backports|main|source', not removing (give --delete to do so)!' so ya I need to --delete23:06
ianwthen after clearvanished, we want the deleteunreferenced23:06
clarkbianw: is it reprepro --delete clearvanished or reprepro clearvanished --delete? maybe it doesn't matter. Ifind this tool quite obtuse. at least it yells at you and gives you hints :)23:06
ianwi feel like i put it last (after clearvanished) but i also think it doesn't matter23:06
clarkbit does matter. It has to go in the front23:07
ianwyes, it is certainly "interesting" to interact with23:07
clarkbbut it yells again23:07
clarkbso I guess the good news is eventually it yells enough that we figure it out23:07
ianwthat could be right too :)23:07
ianwoh did i just kill your screen?23:08
clarkbianw: no you belled me but I'm in another window23:08
clarkbwindow 2 for ubuntu work23:08
ianwsorry i should have exited the session now23:09
clarkbok clearvanished is done and now deleteunreferenced is running. it shas like 280k files to prune23:10
ianwexcellent23:11
ianw... ok, so back to last night's issue ... i have no idea why centos7 is failing on the builders but not in the dib gate23:12
clarkbianw: does one use our mirror and the other not?23:14
clarkband our mirror is stale because $reason?23:14
ianw"Cannot find a valid baseurl for repo: base/$releasever/x86_64"23:15
ianwit does have shades of being a mirror issue, but I *think* that in the chroot here we're not using our mirror in both23:15
clarkbianw: http://mirror.ord.rax.opendev.org/centos/7/os/x86_64/ base/$releasever/x86_64 doesn't seem to align with that23:16
clarkbits centos/$releasever/$content/x84_6423:16
ianwyeah i'm wondering if we've got a different .repo file in there, or we've sed-ed something the wrong way23:18
clarkbok delete unreferenced is done. Running the regular sync now23:18
clarkbthere were errors related to deb packages for libreoffice things so we didn't vos release23:34
clarkbI'm going to rerun the script now to see if those are not persistent23:35
clarkbok it is still failing23:49
clarkb'Unable to forget unknown filekey'23:50
clarkbshould I try running clearvanished and deleteunreferenced again?23:50
clarkbI guess it can't hurt anymore that we've already done and it may help /me tries it23:50
clarkboh I wonder if this is due to our state tracking to remove packages?23:51
clarkbya 2022-04-28 23:48:50  | Cleaning up files made unreferenced on the last run23:52
*** dviroel|rover|brb is now known as dviroel|rover23:52
clarkbso I think the error is that we're telling it to forget files that the deleteunreferenced already cleared out23:52
clarkbcan I just move /var/run/reprepro/mirror.ubuntu.ubuntu.unreferenced-files aside?23:53
clarkbya  Ithink so. We generate a new version of thta file on the next pass23:54
clarkbI'm going to move that file aside into my homedir on the server23:54
clarkband now rerunning the main script23:55
ianwyeah, that sounds familiar23:55
ianwthere might be something about it in the afs recovery docs, but it's been a long time (thankfully!)23:56
clarkbya reading the logs and the script we bsaically has the old pre source removal list of things to remove which doens't work bceause I removed them and more23:59
clarkbbut moving that script aside should get things moving23:59
ianw"2022-04-28 23:56:11.052 | base/$releasever/x86_64             CentOS-$releasever - Base             0"23:59
fungisorry, stepped away to make dinner, but back around now. sounds like you've got it worked out?23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!