Monday, 2023-03-06

opendevreviewIan Wienand proposed openstack/diskimage-builder master: [wip] f37  https://review.opendev.org/c/openstack/diskimage-builder/+/87648200:22
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [wip] f37  https://review.opendev.org/c/openstack/diskimage-builder/+/87648200:50
clarkbianw: thank you for the reviews on the gitea stack. I'll fix that last one tomorrow morning and try to review the acl stack again while I'm letting those changes make their way through to production00:58
opendevreviewIan Wienand proposed opendev/system-config master: mirror-update : drop Fedora 35  https://review.opendev.org/c/opendev/system-config/+/87648603:49
opendevreviewIan Wienand proposed opendev/system-config master: mirror-update: Add Fedora 37  https://review.opendev.org/c/opendev/system-config/+/87648703:49
opendevreviewIan Wienand proposed opendev/system-config master: mirror-update : drop Fedora 35  https://review.opendev.org/c/opendev/system-config/+/87648604:27
opendevreviewIan Wienand proposed opendev/system-config master: mirror-update: Add Fedora 37  https://review.opendev.org/c/opendev/system-config/+/87648704:27
opendevreviewIan Wienand proposed opendev/system-config master: mirror-update: stop mirroring old atomic version  https://review.opendev.org/c/opendev/system-config/+/87648804:27
opendevreviewIan Wienand proposed opendev/system-config master: mirror-update: drop Fedora 35  https://review.opendev.org/c/opendev/system-config/+/87648604:31
opendevreviewIan Wienand proposed opendev/system-config master: mirror-update: Add Fedora 37  https://review.opendev.org/c/opendev/system-config/+/87648704:31
*** jpena|off is now known as jpena08:06
opendevreviewMaksim Malchuk proposed openstack/diskimage-builder master: Add swap support  https://review.opendev.org/c/openstack/diskimage-builder/+/86927010:55
opendevreviewMaksim Malchuk proposed openstack/diskimage-builder master: Add swap support  https://review.opendev.org/c/openstack/diskimage-builder/+/86927011:19
opendevreviewMaksim Malchuk proposed openstack/diskimage-builder master: Add swap support  https://review.opendev.org/c/openstack/diskimage-builder/+/86927011:27
opendevreviewMaksim Malchuk proposed openstack/diskimage-builder master: Add swap support  https://review.opendev.org/c/openstack/diskimage-builder/+/86927011:45
opendevreviewMaksim Malchuk proposed openstack/diskimage-builder master: Add swap support  https://review.opendev.org/c/openstack/diskimage-builder/+/86927011:47
opendevreviewMaksim Malchuk proposed openstack/diskimage-builder master: Add swap support  https://review.opendev.org/c/openstack/diskimage-builder/+/86927011:54
fungilooking at the nodepool graphs for rax-ord, there's something pretty wrong in there13:49
fungifrom what i can piece together, nodepool is getting a bunch of launch failures with "Timeout waiting for instance creation" and then proceeds to ask the cloud to delete the node and immediately deletes the znode, so nodepool is no longer tracking those, but they're hanging around in an active state in the cloud consuming quota for ages13:50
fungii'm watching one which had its deletion requested over half an hour ago and is still in an active state according to openstack server show13:51
fungianyway, the end result is that we're averaging something like 5% effective utilization of the quota we have there13:52
fungiand since it's the largest quota of any region we have access to, that's a huge chunk of our aggregate quota we can't use (more than 25%)13:54
funginode 0033377524 is one of the examples i'm looking at13:55
fungi2023-03-06 13:19:22,943 INFO nodepool.StateMachineNodeDeleter.rax-ord: [node: 0033377524] Deleting ZK node id=0033377524, state=deleting, external_id=None13:55
fungicorresponding server instance c8f2f797-004a-4f26-8883-798ed0561926 finally disappeared moments ago, roughly 37 minutes after nl01 issued the delete13:57
fungialso nl01 is logging a bunch of tracebacks checking the quota13:58
fungiFile "/usr/local/lib/python3.11/site-packages/nodepool/driver/utils.py", line 355, in estimatedNodepoolQuotaUsed13:59
fungiif node.type[0] not in provider_pool.labels:13:59
fungiIndexError: list index out of range13:59
fungii don't see any other launchers logging that exception, so could be something specific to rackspace's api responses i suppose13:59
fricklerthis also doesn't look good https://paste.opendev.org/show/bZ1r1HWRmIQacGhpVhN8/14:33
fricklernot sure whether we have created a busy loop of lots of creations leading to long startup times leading to lot of timeouts14:34
fricklerwe could consider bumping the launch-timeout. or lower the quota for some time to see if it recovers14:35
fricklermaybe there's also another bug in the new state machine code. seems we don't have good test coverage for that14:37
fungiyeah, i suppose if we're deleting the node from zk immediately and relying on the quota checking to keep us honest, but then can't actually check the utilization because of that exception and fall back on max-servers and the assumption that the nodes it knows about are the only ones that exist, then we could be trying to boot over quota too14:41
fungimight even be leading us to hammer the api, kicking api rate limits into effect, slowing our calls down even more and creating a vicious cycle14:50
fungineed to go run some errands, but should be back within the hour14:57
clarkbthe rax ord thing appears to have been going on for months.16:14
clarkbI suspect it is something to do withthe cloud itself given that and the lack of issues for the other two regions16:14
clarkbI would probably start by increasing boot timeout?16:16
clarkbsince it appears to be booting things successfully but deciding they aren't coming up fast enough?16:16
*** gthiemon1e is now known as gthiemonge16:24
clarkboof launch timeout is already 10 minutes16:25
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: Correct boot path to cover FIPS usage cases  https://review.opendev.org/c/openstack/diskimage-builder/+/87619216:31
opendevreviewClark Boylan proposed opendev/system-config master: Switch borg backup from gitea01 to gitea09  https://review.opendev.org/c/opendev/system-config/+/87647116:35
clarkbinfra-root if you get a chance to look at the new gitea servers I think https://review.opendev.org/c/opendev/system-config/+/876448 is ready to go16:36
clarkbre rax ord maybe the thing to do is set max-servers to 0 and let it clean up after itself16:52
clarkbthen increase the number slowly and see if the timeouts persist16:53
clarkbthe number == max-servers16:53
fungiyeah, i'm looking through the config, we already set boot-timeout: 120 and launch-timeout: 600 across all rackspace regions16:54
fungiwhich is pretty lengthy16:54
clarkbfungi: do we know which timeout we are hitting?16:54
fungithough this is boot-timeout it's running into i think16:55
fungi"Timeout waiting for instance creation"16:55
clarkbboot-timeout is the timeout waiting for openstack to report a node ready iirc. and then launch timeout is time to be able to ssh in16:55
fungiyeah, so maybe i'll up that to 300 and see if it helps16:55
fungijust in ord16:55
clarkbwfm16:55
opendevreviewJeremy Stanley proposed openstack/project-config master: Increase boot-timeout for rax-ord  https://review.opendev.org/c/openstack/project-config/+/87659216:58
fungiin theory the launcher should clean up after itself anyway within an hour, if everything else is working. toggling max-servers to 0 and back is probably not going to change that16:59
clarkbwell it was mostly an idea to start small and ramp up with clean data to see if we're our own worst enemy there17:00
fungigranted my samples were random, but it seemed like the launcher was cleaning up behind itself anyway17:01
clarkbfungi: if you have time to review https://review.opendev.org/c/opendev/system-config/+/876448/ I'd love to get that in today.17:07
clarkbI'm thinking I will also drop 01-04 from haproxy manually to see what load looks like on the four new servers17:08
fungisure, sounds great17:12
*** jpena is now known as jpena|off17:13
opendevreviewdaniel.pawlik proposed zuul/zuul-jobs master: Provide deploy-microshift role  https://review.opendev.org/c/zuul/zuul-jobs/+/87608117:22
opendevreviewMerged openstack/project-config master: Increase boot-timeout for rax-ord  https://review.opendev.org/c/openstack/project-config/+/87659217:38
fungithat deployed about 10 minutes ago, so hopefully we'll see the graph there smooth out by 19:00z17:55
clarkbianw: I left some comments on the acl stack. Let me know what you think about the submit-requirement implied mapping problem17:58
opendevreviewMerged opendev/system-config master: Replace gitea05-07 with gitea10-12 in haproxy  https://review.opendev.org/c/opendev/system-config/+/87644818:30
fungiclarkb: mirror.iad3.inmotion.opendev.org seems to be offline again, powered off since 2023-03-03T18:15:28Z (3 days ago), but non-impacting since we zeroed max-servers for that provider. what's the best way to go about figuring out what's broken in there?18:40
fungithis is three times in two weeks, so something is definitely repeatedly killing the instance18:40
fungii guess i can ssh into the nova controller and look at the service logs for clues?18:41
clarkbfungi: the first thing I would check is if the nova api (server show) lists any errors18:41
clarkbyou can run that as our normal user and as admin. I think admin may get more info18:41
fungiserver show doesn't report an error condition, no18:42
clarkbok. In that case I'd probably find the hypervisor and see what nova compute logs and virsh/libvirt/qemu have to say about it18:42
clarkbthere should be instance logs in /var/run//libvirt/something/or/other iirc18:42
fungijust that the power_state is Shutdown, vm_state is stopped, status is SHUTOFF18:42
fungiah, as admin. i'll see if we have credentials for that in clouds.yaml already18:43
clarkbfungi: well if it doesn't show an error then admin probably won't 18:44
clarkbfungi: we don't have them in bridge clouds.yaml but they are in a clouds.yaml or a openrc on the hosts themselves18:44
fungino, no error whatsoever, just looks as though someone logged into the server and issued a poweroff18:44
clarkbya so unlikely to be any different listing things as admin18:44
fungii vaguely remember something similar happening to our mirror in the older linaro deployment18:45
fungior might have been the builder18:45
clarkbya I would look in the libvirt/qemu/nova compute logs18:45
clarkbthose are actually two separate things but I would see if they have any hints18:46
jrosseryou would get something like that if the OOM killer terminated the VM?18:47
clarkbjrosser: yup or if the VM hit some sort of nested virt failure (we have nested virt enabled but that vm doesn't do virt, but maybe its tripping it anyway)18:49
fungisure, i suppose dmesg on the compute hosts would be a good first thing to check18:49
fungihow do i find the names/addresses of all the compute hosts?18:49
clarkbfungi: I think you login to the control panel with the secrets file infos and that gives you a listing. Its also the first three IPs after the api endpoint iirc18:50
clarkbthere are a couple of extra hypervisors now too and I don't recall if they are in order too or not18:50
fungioh, there's a control panel? i probably knew that at one point and then forgot18:51
clarkbya there is the baremetal control panel which is separate from the openstack horizon stuff. The details for both are in secrets iirc18:52
clarkbbut that control panel should list all the hypervisors and their IPs18:52
fungithanks. i'll see if i have an opportunity to take a look there in a bit18:52
fungii guess it's been a while since we tried that. the credentials we have on file are giving me "Invalid e-mail address and/or password"18:55
clarkbhrm18:55
fungiunrelated, looks like infra-prod-base failed on deploy for the gitea-lb update, likely due to the inmotion mirror being offline again18:56
fungiso maybe not so unrelated i guess18:56
clarkbya I'm not too worried about that18:56
clarkbit applied the lb update anyway18:56
clarkbfungi: maybe try resetting the password? it should go to the shared email inbox. I just looked and there is/was email sent there18:57
clarkbotherwise we may need to reach out to them for help. For logging into hypervisors they are the three ips after the api endpoint though18:57
clarkb(also you can ssh to the api endpoint and ou get load balanced to a random one)18:57
clarkbjimmy might be of some help there too since I Think things got split into two companies?18:58
fungiyuris was in here for a while too, i thought18:59
clarkbI think yuris didn't maintain a persistent irc client18:59
clarkbbut ya has been in and out but isn't here now18:59
fungiokay, so the old login url we had in our notes goes to a completely different system now. using the correct (new) url, i'm able to get into the webui there19:01
fungimanage->assets shows ip addresses for servers, though it's unclear what roles they play as they have generic types and randomly generated names19:03
fungii'm guessing the three with more ram are the compute hosts?19:03
clarkbfungi: the way the deployment was made was a converged set of three hosts doing everything. THen to get the max-servers count out I think a couple of compute only hosts were added19:04
clarkbfungi: but ya the automated deployment system doesn't name things with helpful hints19:04
fungithe assets list has 3x servers with 16 cores and 128gb ram, also 3x servers with 40 cores and 510gb ram19:05
fungiso yeah, i suppose the smaller servers were the ones added to get the additional /28 netblock assignments19:05
clarkbI'm not sure which are which. You'll probably need to use the nova api to help you sort out which host that vm was on19:05
clarkbserver show will give you a host id19:06
clarkbthen there is some nova services listing that will give you the host ids mapped to useful things19:06
fungidoes horizon have admin bits, or is it strictly for non-admin features?19:06
clarkbfungi: `openstack compute service list`19:06
clarkbI don't know. I avoid using horizon as much as possible >_>19:07
fungii was assuming i'd need to use it to get the correct clouds.yaml values, but i guess those don't change for admin context anyway so i can reuse most of what's in our existing clouds.yaml19:08
clarkbfungi: the clouds.yaml is on those hosts19:08
clarkbif you ssh into one of them you'll have access to what is necessart to use openstack client19:08
fungicatch 22, i don't have ssh keys for them19:09
clarkbyou should. Pretty sure we added your keys when this was deployed19:09
fungioh? i didn't even think to try19:09
fungihah19:09
fungimmm, not as fungi, but root worked!19:10
clarkbyes we don't have specific users on these hosts19:10
clarkb(because all of the deployment is automated by that control panel)19:10
fungi[Fri Mar  3 18:03:51 2023] Out of memory: Killed process 4017237 (qemu-kvm) total-vm:10455440kB, anon-rss:8495744kB, file-rss:0kB, shmem-rss:4kB, UID:42436 pgtables:17752kB oom_score_adj:019:12
clarkbcool. I don't think we are oversubscribing memory in openstack. But we are hyperconverged or whatever19:12
clarkbthat means that maybe the openstack services are using more memory and that is impacting our VM?19:12
fungijrosser wins a cookie19:12
jrosser\o/19:13
clarkbWe might be able to deal with that by tuning max-servers down a bit and also telling openstack to use less resources? Oh also maybe we need to look for leaked resources19:13
clarkbleaked VMs just hanging out might be consuming too much memory or something19:13
jrosseri saw something a that felt similar here when we tried to fit exactly 2 giant GPU VM per host19:13
jrosserand there was not quite enough memory spare with 2 instances plus + $everything-else19:14
fungiclarkb: so of the 6 servers in the assets list, one of the smaller type has a slew of oom errors in dmesg and the other 5 are clean19:14
jrosserso the second instance to boot killed the first one19:14
clarkbfungi: ya I think the smaller ones are the control plane19:14
clarkbfungi: another option may be to move the mirror to one of the larger nodes19:14
clarkbsince they are just compute nodes iirc19:14
jrosserif there are things running on some nodes that nova doesnt know about you can use `reserved_host_memory_mb`19:16
clarkbjrosser:  Ithink this deployment is alread doing that, but ya memory needs may have expanded beyond that existing value19:16
fungichecking memory utilization, right now mysqld, glance, nova, cinder, ceph et al are taking up around 2/3 of the available memory on this machine. i don't see any qemu processes (unsurprising since we're not booting nodes there and the mirror is presently down), but i think that rules out lost virtual machines taking up excess ram19:17
clarkb++19:17
clarkbwe could also try rebooting/restarting services so that they give back to the operating system19:18
clarkbbut then we run the risk of rabbit getting angry19:18
clarkbbut the cloud isn't in use so the risk to us if that happens is basically nil19:18
fungii wonder if we can exclude this server from use for job nodes?19:21
fungiit has sufficient memory to run the mirror vm, but if we boot more than a few job nodes it won't19:21
clarkbfungi: we could run the mirror there and then set the value jrosser pointed too19:21
clarkbfungi: that seems like a good thing to try19:22
clarkbbasically set it so that nova won't try to run anything else there because the mirror is already consuming those resources19:22
jrosseri wonder how much value there is actually in trying to run VM on those smaller nodes at all19:24
clarkb#status log Manually disabled gitea01-04 in haproxy to force traffic to go to the new larger gitea servers. This can be undone if the larger servers are not large enough to handle the load.19:26
opendevstatusclarkb: finished logging19:26
clarkbinfra-root ^ fyi19:26
clarkbjrosser: well we're severely resource constrainted. So the value is in getting as much as we can out of the system19:27
fungithe smaller servers each represent 1/15 (approximately 7%) of our overall memory capacity19:29
fungii'm guessing this one is where most of the central openstack services are parked though, hence much of the overhead being not virtual machines19:30
fungithough oddly no, looking through each of the servers, free reports approximately 24gb available on each of the smaller servers, and 432gb available on each of the larger servers19:32
fungiso there's around 80-100gb occupied by running services on each of those 6 servers19:33
clarkbya so best thing may be to simply edit the reservation that nova avoids19:34
clarkband possibly even prevent nova from launching test vms on those nodes alltogether19:34
fungii guess if i look at strictly used (no shmem or buffers/cache) it's more like 50-90gb overhead on each server19:35
fungiwell, strangely it's only the server with the mirror booted on it which was getting into an oom situation19:35
clarkbfungi: probably because it is the only one with a long lived VM on it19:35
clarkbso its going to end up with more memory pressure over time on average?19:36
fungiprobably. also i wonder if adding a gb or two of swap to these would be a terrible idea, just so the kernel can page out infrequently accessed data to free up more for cache19:36
fungiright now none of them has any swap memory at all19:37
fungiclarkb: do you happen to know whether there's a specific reason that was avoided?19:38
clarkbfungi: no, its all done by their tooling19:38
clarkbthe only thing we select is the number of nodes. We don't select partition layouts, IPs, hostnames, etc19:38
fungii suppose they tried adding small swap partitions and determined they weren't much help19:39
fungior created some sort of problem19:40
clarkbor they subscribe to swap is bad and never swap19:40
fungiin good news, the rax-ord graph  looks a bit healthier since 19:00z, but won't really know until the next round of daily periodic jobs kicks off19:43
clarkbthe giteas seem to be working after I reduced their number to 419:48
clarkbI'll leave things like this. If this keeps up I think we should consider reducing total giteas to 4 or maybe 619:49
fungiagreed, we can observe and see if they end up being more/less loaded19:54
ianwclarkb: that's a good point about label-name != submit-requirement, it could be confusing.  i agree we should match on the label in the s-r.  i'll rework that20:17
opendevreviewSteve Baker proposed openstack/diskimage-builder master: A new diskimage-builder command for yaml image builds  https://review.opendev.org/c/openstack/diskimage-builder/+/87624520:26
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Switch run_functests.sh from disk-image-create to diskimage-builder  https://review.opendev.org/c/openstack/diskimage-builder/+/87647920:26
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Document diskimage-builder command  https://review.opendev.org/c/openstack/diskimage-builder/+/87663320:26
clarkbthe vast majority of the gitea demand is on 09 right now. It seems to be keeping up which is cool20:33
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/876449 is the next gitea task. This removes 05-07 from gerrit replication. If you're happ with the new servers so far I think we can land this one20:34
fungiyeah, seems they're managing the current load20:43
clarkbhttps://gerrit-review.googlesource.com/c/gerrit/+/362054 is the change I promised the gerrit community meeting I would write20:49
fungiclarkb: so if we were to set reserved_host_memory_mb in the scheduler config, i guess that would go on whichever server is running nova's controller service? or just on all of them? any idea if the deployment tooling for this has somewhere to register/persist config overrides like that so they survive redeployment?20:51
fungior is there a scheduler configured on each hypervisor host?20:52
JayFclarkb: I don't have an account there, but s/safetey/safety/ on LN 542720:52
clarkbfungi: you would apply it to the server that runs the mirror so that nova doesn't schedule too much workload on that host causing OOMs. I suspect that will go into the nova scheduler/placement databases and no there isn't anything to persist that20:52
clarkbJayF: thanks20:52
JayFclarkb: and thank you <3 documenting undocumented things20:52
fungiokay, so each compute host has a scheduler config? i'm a little lost wading through the nova docs20:53
clarkbfungi: oh does it go into the config file?20:53
clarkbI expected that would have been a runtime thing :/20:53
clarkbfungi: the way the cloud is deployed there is a three node hyperconverged set of nodes. This means they run everything including the control plane, ceph, nova compute and VMs20:54
clarkbthen there are the additional nodes that only run the compute services20:54
fungiwell, web searching turned up references in newton documentation, i'll try to refine my searching20:54
clarkbya looks like it is a compute (not scheduler) config option20:55
clarkbto modify that I think what you are supposed to do is edit the kolla config and do a kolla deployment20:55
fungiso it probably moved to be host-specific after newton: https://docs.openstack.org/newton/config-reference/compute/schedulers.html20:56
opendevreviewMerged opendev/system-config master: Remove gitea05-07 from Gerrit replication  https://review.opendev.org/c/opendev/system-config/+/87644920:56
clarkbI don't want to have to page all that in. I think instead what we might be able to do is turn on the mirror then disable the compute service on the node the mirror runs on20:56
clarkbhttps://docs.openstack.org/nova/latest/admin/scheduling.html#compute-disabled-status-support20:57
fungii guess otherwise we need different host aggregates for the two accounts?20:57
clarkbno the memory reservation should be global. But deploying it requires editing the kolla deployment there and while doable that could lead to a whole bunch of work20:58
fungiwe could set one of the small servers in one aggregate and everything else in the other, then make it so the control plane account only creates servers in the small dedicated aggregate and the nodepool account uses the aggregate that contains the other 5 servers20:58
fungiis what i meant20:58
clarkboh ya that would be another option20:59
fungithat way we'd never try to boot job nodes on the (smaller) server where the mirror is booted20:59
clarkbfungi: https://wiki.openstack.org/wiki/OpsGuide-Maintenance-Compute#Planned_Maintenance20:59
fungiunfortunately i only know about enough to throw that word salad together, not actually how it's done20:59
clarkbI feel like simply disabling the nova compute there is the easiest thing21:00
clarkbits a bit hacky but it should work21:00
clarkbjust the first step there is what we would need21:01
fungiagreed. it's also reminding me why we prefer not to run our own openstack clouds21:01
clarkbfungi:  Ithink you need to start the mirror node before you do that though21:01
fungiif we set the host into maintenance mode, will that block us from (re)booting the mirror there?21:01
clarkbbasically start the mirror node, disable the compute on that node, then tell nodepool it can use thing again21:01
fungiyeah, that's what i was wondering21:02
clarkbfungi: maybe? I don't know if it completely breaks the ability to manage a running instance21:02
clarkbthe docs there show migrate commands being valid so maybe not21:02
clarkbmelwitt: ^ would likely know21:02
fungibut if it stops the instance later for some reason, starting it again may involve toggling that21:02
clarkbya worst case we just turn it back on I guess21:02
clarkbI don't think we should overthink this since we should liekyl consider redeploying that cloud for other reasons anyway. Do the simple thing that works then try to incporporate what we've learned when we start over21:03
clarkbfungi: another option would be to migrate it to one of the bigger nodes then see if the test nodes are ok on that host21:09
clarkbbut that likely requires more observation. The plan to just dedicate that node to the mirror seems simplest21:09
clarkboh darn the gerrit replication update failed to update due to the base job failures. I think thats fine and we can wait for our daily run to update it and then make sure things are happy tomorrow21:14
clarkbassuming we fix the mirror in the meantime. Otherwise maybe we add the mirror to the emergency file so that it is skipped for now21:14
melwittclarkb: not sure I got the exact question but disabling a compute service should only prevent any new instance being scheduled to it. it should not affect instances that are running there already21:15
clarkbmelwitt: that is what we needed to know. Thanks! Basically we've got a compute + control plane node that has limited memory due to the control plane. We want to run a single long lived VM there and force other things to boot elsewhere. Disabling the compute service there seems like an easy straightforward way to do that21:15
clarkbthere are other more elegant tools but they are all a bit more involved and require us to undersatnd running and deploying the cloud better21:16
melwittclarkb: ah gotcha. I think that should work21:16
fungiawesome. i'll see what i can do to make that happen21:18
fungimmm, i get "Failed to set service status to disabled" but it doesn't give me a reason21:43
fungii tried with the hostname listed in the assets, and also with its public ip address21:44
fungipossible i'm not specifying it correctly, since if i put random garbage in place of the hostname i get the same error21:45
fungiand i can't `compute service list` as it returns a policy rejection21:45
fungi(this is all with the admin creds listed in our notes)21:46
fungi`compute agent list` is similarly rejected by policy21:46
fungimight be due to clouds.yaml including a project_name, project_id and user_domain_name which are probably irrelevant for admin? but commenting them out i get a message that the service catalog is empty21:49
fungihypervisor list also disallowed by policy21:51
clarkbfungi: have you tried using the creds on the host? Maybe they differ in some way that is important22:10
clarkbalso check the history on the nodes for what I've done in the past? I seem to recall some things are wierd about admin and you have to be explicit about it22:11
clarkbfungi: `source /etc/kolla/admin-openrc.sh`22:16
clarkband `source /opt/kolla-ansible/victoria/bin/activate` to get the built in openstack client22:16
clarkbI'm able to run `openstack compute service list` after doing that22:16
ianwclarkb: if you have a sec to double check the syntax in https://review.opendev.org/q/topic:f37 system-config stuff, i can monitor.  pretty mechanical, just swapping the mirroring from 35/36 to 36/3722:33
clarkbianw: re https://review.opendev.org/c/opendev/system-config/+/876488/1 I feel like every time we look this up someone is still using it?22:34
clarkbI mean they really shouldn't but...22:35
ianwclarkb: i think that the problem used to be old branches, but now victoria isn't even using it 22:35
ianwit's switched to coreos or whatever it's called now22:35
clarkbright but ussuri and stuff still exist? 22:36
clarkbfwiw I want to delete it because its one of my biggest complaints with magnum as a user. It relies on ancient tools by the time you actually get deployed in production22:36
ianwno i think it's all retired now, at least the branches aren't there in gitea22:36
clarkboh huh I Guess the openstack branch cleanups finally got rid of those22:37
clarkbok ya if the older branches are gone then this should be fine. And really if we end up forcing the issue for any stragglers thats probably a good thing at this point22:37
ianwso i don't think it's an issue for CI at least22:37
fungioh, got it, i was using the creds from our notes. i'll revisit with what's on the servers22:38
clarkbfungi: its possible what is in our notes was from an early iteration of the cloud that got wiped and replaced? THat happened a couple of times and I don't recall when those creds were written relative to that22:39
fungimakes sense, yeah22:39
clarkbianw: related: I think we can drop xenial-* from the ubuntu ports mirror pretty safely22:42
ianwyeah, i'm not sure we ever published a xenial image22:42
clarkbfor some reason we have like 1.4GB of thunderbird packages in the centos 8 stream repo22:45
clarkbshouldn't it be clearing out old versions of that?22:45
clarkbhttps://mirror.bhs1.ovh.opendev.org/centos/8-stream/core/x86_64/centos-plus/Packages/t/ ?22:45
clarkboh its closer to 3GB due to aarch64 almost doubling it22:46
clarkbdebian-security needs a quota bump22:47
ianwit is the same as upstream, at least, http://mirror.centos.org/centos/8-stream/core/x86_64/centos-plus/Packages/t/22:47
clarkbweird22:47
clarkbre debian-security stretch is still in there. I think we can drop that now too?22:48
fungiyeah, should be able to, maybe it missed a manual cleanup step22:48
clarkbI'm trying to remember what the process is to drop things from reprepro. You pull them from the config then do manual syncing?22:48
fungialso keep in mind we're probably a little over a month from the bookworm release22:48
clarkbstretch is still in the regular debian repo too22:49
clarkbI think we removed stretch from our reprepro configs but then didn't remove it from the mirror?22:50
clarkbthough its not clear to me if the packages are in the pool or not22:51
clarkbspot checking the 0ad package I don't see stretches version in the pool. Its just listed under dists so I think that is not going to reclaim a bunch of space22:52
ianwi do feel like that got cleared ...22:52
clarkbianw: ya I think this is just stale since the reprepro cleanup won't remove the indexes22:52
clarkbI'm checking the -security side next22:52
ianw2021-11-12  debian-stretch : merge and babysit removal changes22:52
ianwmanual cleanup reprepo for debian/debian-security22:53
ianwfrom my notes, so it would have happened abou tthen22:53
clarkbyup both seem to lack stretch packages in the pools22:54
clarkbits just the index stubbed out22:54
clarkbwe're probably going to need to bump the quota for debian-security in that case22:54
clarkbI guess packages that go into -security eventually end up in the regular repos but don't get cleaned out of -security?22:58
clarkbfungi: ^22:58
clarkbianw: hrm I think xenial in ubuntu-ports is in the same situation23:00
clarkbthe indexes are stubbed out but we don't seem to configure reprepro to mirror it and if I cross check Packages and pool the packages are not present23:01
ianw2021-10-29 : remove ubuntu-ports xenial23:01
ianw    `<https://review.opendev.org/c/opendev/system-config/+/815914/>`__23:01
clarkbdo we just need to rm the pool/ dirs?23:01
clarkber sorry the dists/ dirs23:01
ianwthis is what i would have done ... https://review.opendev.org/c/opendev/system-config/+/815920/2/doc/source/reprepro.rst23:02
clarkbit won't save a lot of space but could cut down on confusion23:02
ianwit may well be that the clearvanished doesn't remove those 23:02
clarkbya I think I ran into this with something else23:02
clarkbI think clearvanished only cleans up pool/ but not dists/. fungi do you recall and is simply rm'ing those dirs out of the dists/ dir the right step?23:03
clarkbseparately the growth on these mirrors over the last 3 months is crazy for what are not quite static but also fairly stable distros23:03
ianwclarkb: did you have thoughts on https://paste.opendev.org/show/bmy36m7TO2b4cpndO5gO/ , which is the All-Projects update?  23:08
fungiclarkb: i don't know, not sure we've ever tried that23:08
clarkbianw: sorry I missed that because it isn't in a change. Was there a docs change though? maybe not in that stack?23:09
fungias for growth, i wonder if it's divergence between sid and bookworm since the freeze started? but we don't mirror either of those to my knowledge, so likely unrelated23:10
ianwclarkb: yeah, basically the same thing to the bootstrap docs -> https://review.opendev.org/c/opendev/system-config/+/876237/ & https://review.opendev.org/c/opendev/system-config/+/876236/123:10
clarkbianw: copyCondition = changekind:NO_CODE_CHANGE OR changekind:TRIVIAL_REBASE OR is:MIN <- is the only thing that jumps out since previously we were only copying min and trivial rebases. Not no code change. I think the difference is that we'll preserve votes if the commit message changes and we probably shouldnt' do that?23:11
clarkbianw: also All-Projects is probably somethingto update after the openstack release since it will have broad impact?23:12
ianwwell i guess the theory is it either has no impact or we roll it back immediately23:12
ianwcertainly not a push and walk away thing23:13
clarkbI left a note on the change about the copycondition23:14
clarkbya thats a good point23:14
ianwgood point i'm wondering why i put no_code_change23:14
clarkbI'm putting the agenda together. ACL things are on there (though I may need to update some links). fungi do you want me to add the rax-ord and inmotion mirror stuff?23:15
opendevreviewIan Wienand proposed opendev/system-config master: doc/gerrit : update copyCondition  https://review.opendev.org/c/opendev/system-config/+/87623623:23
opendevreviewIan Wienand proposed opendev/system-config master: doc/gerrit : update to submit-requirements  https://review.opendev.org/c/opendev/system-config/+/87623723:23
clarkbianw: oh heh I just left a comment about the overrides on https://review.opendev.org/c/opendev/system-config/+/876237 too23:24
clarkbis it only infra-specs that does that? If so I thin kwe can "break" things for infra-specs and force us to leave a code review and a rollcall23:24
ianwi think there's two that do that -- but i think it's copied from infra-specs, let me check23:25
ianwyeah -> https://review.opendev.org/c/openstack/project-config/+/875804/4/gerrit/acls/openstack/governance.config23:25
clarkbianw: if its just those two I think we can ask the openstack tc to see if they are ok leaving a +2 as well23:26
clarkbI like not having overrides for unnecessary special behaviors :)23:27
opendevreviewMerged opendev/system-config master: mirror-update: stop mirroring old atomic version  https://review.opendev.org/c/opendev/system-config/+/87648823:27
ianwif we want to override, I think the submit-requirement needs to be called "Code-Review"23:27
clarkbI'm not sure the actual submit-requirement names mean anything?23:28
clarkbbut maybe thats what they mean by override in this case is replacing a named submit-requirement and not changing the submittableIf?23:28
ianwonce again the docs are a bit unclear23:28
clarkbagreed :)23:28
ianwhttps://gerrit-review.googlesource.com/Documentation/config-submit-requirements.html#inheritance23:28
ianw"administrators can redefine the requirement with the same name in the child project and set the applicableIf expression to is:false"23:29
clarkbaha so ya I guess the name is used to know what to override23:29
clarkbX is replaced by X'23:29
ianwthat was what made me think it looks at the name and basically overwrites23:29
clarkbbut you could have two different submit requirements that look at code-review submittableif conditions and only override one or the other23:30
ianwi'm not sure it would match that.  i feel like it would treat each s-r totally separately?23:32
clarkbya I think so23:32
clarkbits just confusing because labels and submit requirements don't have a 1:1 relationship but overrides do23:33
ianwwhat i can do is send a doc update change that explains it the way i think it works.  which can either be accepted or rejected, assuming anyone wants to review it23:33
clarkband the docs don't really go into a ton of depth here :/23:33
ianwI think i'm in agreement that the best way to avoid problems overriding code-review is to just not play the game.  i'll double check what's doing that, and i think we can probably propose to change those ACL's as a first step23:34
clarkb++23:35
clarkbianw: we control acls through cnetralized code review which makes overrides pretty safe. But ya I agree23:40
fungialso we have a linter written in a turing-complete language which can essentially enforce whatever policies we want to enact around that23:44
ianwthis is true, but i wondered how far to go with the normalization script with this23:45
ianwi mean i could make it convert what we have to submit-requirements, but it seems like overkill 23:45
clarkbif its really only a small number of situations then I think simplifying is a good thing23:45
clarkbits not the end of the world to leave a +2 and a +123:46
clarkbfungi: re inmotion I'm still ssh'd in there should I disale the compute service running the mirror then we can start it again?23:46
ianwthere's actually 4 that do it23:48
ianwopendev/infra-specs.config openinfra/transparency-policy openstack/governance.config starlingx/governance.config23:48
fungiclarkb: oh, if you want to please go for it. i hadn't freed back up sufficiently to revisit it yet and my evening is encroaching23:49
fungii can probably get to it in my morning tomorrow otherwise23:49
clarkbI'm hoping to get it done today so that the base job runs successfully23:49
clarkblet me give it a go23:49
fungioh, i didn't think about it blocking the base job, we can also add the mirror there to the emergency disable list temporarily so it will be skipped23:50
clarkbya but this should go quickly once I identify the compute node hosting the mirror23:51
fungi.130 (parakeet) was the one with all the oom events23:51
ianwhttps://review.opendev.org/c/openstack/project-config/+/185785 has the thinking behind it23:52
ianwgiven the context there, it being a thought-out approach to comments on TC issues, i'm not sure i'd want to argue for not overriding Code-Review 23:55
clarkbis there no good way to go from a hostId to the compute list?23:56
clarkbthis seems like it should be trivial and yet23:56
fungihostid is hashed by the tenant/project for privacy reasons23:57
fungiso the same host has a different hostid depending on which project the user querying it belongs to23:58
clarkbhow are you supposed to use it then?23:59
fungii remember this coming up when we originally asked the nova devs to expose a host identifier to normal users23:59
fungii have to believe there's an admin function to convert it or look it up23:59
fungibut i've never been on that end of the situation23:59
clarkbI can't show the instance as admin because the project stuff is all wrong too it seems like23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!