Monday, 2023-09-11

*** dtantsur_ is now known as dtantsur00:08
opendevreviewDr. Jens Harbott proposed openstack/project-config master: sdk/osc: Rollback to LaunchPad for issuetracking  https://review.opendev.org/c/openstack/project-config/+/89428505:03
fricklergtema: ^^ fyi amended to cover all repos, which I assume is what you intended?05:04
fricklerI also noticed that ansible-collections-openstack formally still belongs to the ansible sig, which iiuc has effectively disbanded. would it make sense to move it to sdks proper?05:05
opendevreviewDr. Jens Harbott proposed openstack/project-config master: sdk/osc: Rollback to LaunchPad for issuetracking  https://review.opendev.org/c/openstack/project-config/+/89428505:26
gtemaFricker. Thanks. Wrt ansible-collection-openstack: I guess yes, the move makes sense05:46
opendevreviewMartin Magr proposed openstack/project-config master: Add python-observabilityclient  https://review.opendev.org/c/openstack/project-config/+/89454113:47
clarkbfungi: midday ish my time may be good for https://review.opendev.org/c/opendev/system-config/+/894382 to update gitea to the latest version? I have a hopefully short without pupil dilation optometrist visit in about an hour so I'm thinking after that15:16
clarkbfungi: also today is the day we said we would do https://review.opendev.org/c/openstack/project-config/+/893963/1 and child to clean up fedora15:19
clarkbmaybe start with those if you have a moment to review them ?15:20
opendevreviewBernhard Berg proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects  https://review.opendev.org/c/zuul/zuul-jobs/+/88791715:59
opendevreviewBernhard Berg proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects  https://review.opendev.org/c/zuul/zuul-jobs/+/88791716:33
fungiclarkb: yep, sounds good. taking a look shortly16:47
clarkbI'm back from getting my eyeballs examined. Happy to keep an eye on any of those three changes if they get approved17:34
clarkbor address review comments if changes need to be made17:34
opendevreviewMerged openstack/project-config master: Remove fedora-35 and fedora-36 from nodepool providers  https://review.opendev.org/c/openstack/project-config/+/89396317:34
clarkbwe should approve https://review.opendev.org/c/openstack/project-config/+/893964/ after ^ looks good though17:34
fungii need to pick up a few things from the hardware store and grab lunch while i'm out, but can help test new gitea in a bit once i'm back (hour-ish) if that works?17:39
fungithat'll probably also be enough time to know if we're ready to proceed with the fedora image removal step17:40
fungiokay, headed out, back in about an hour17:41
clarkbsounds good17:55
clarkbsorry I got distracted by stuff around the house btu I'll not going anywhere so that plan sounds good17:55
clarkbI think 893963 has applied and nodepool is continuing to run happily with the new config. There are no fedora nodes either. Now to check if the images have been cleaned up from the cloud providers18:25
clarkbon the nodepool side of things there is a single inmotion fedora image that appears to be failing to delete. I think we can probably just remove that image from the zk db and then figure out cleaning it up from the cloud another time18:27
clarkbspot checking rax regions there are a handful of fedora images that show up there still that aren't in nodepool18:28
clarkbso ya I think next step is clean up fedora-36-1662540204 for inmotion on the nodepool side then we can merge the next change safely18:28
clarkbthen we can do manual cleanups of any remaining fedora images18:28
fricklerinmotion had a big bunch of very old images failing to delete last time I looked, maybe check those, too, when you're done with fedora18:31
clarkback18:32
clarkbchances are we have to login as admins and forcefully delete some thinsg then nodepool will notice they are gone18:32
clarkb/nodepool/images/fedora-36/builds/0000000022/providers/inmotion-iad3/images/0000000001 <- that appears to be the znode to remove from the zk db18:33
fricklerif you tell me how I can do that I can have a look tomorrow18:33
fricklernodepool image-list|grep deleting|wc => 19, all in inmotion, some > 1y old18:33
clarkbfrickler: are you interested in the zookeeper bit or the inmotion thing? For zookeeper you login to one of the three nodes and then use the zk-shell tool (I have it installed in a venv called venv in my homedir) to connect to the zk server. Then you can use simply commands like ls,cd,get,rm to manipulate the db18:35
fricklerI was talking about inmotion, sorry for the overlap18:35
clarkbin this case what I've done is ls and cd around to find what looks like the correct node finding the path above. Then ran `get that_path` on it to confirm the data inside the node.18:35
clarkbah18:35
clarkbfor inmotion we have ssh keys on the servers (I can check that yours is there) and we login then its a kolla setup. The kolla vars give us account details and there is an openrc to source if you use the cli tools18:36
clarkbUsually what I do there is login, source the appropriate admin bits then start poking around and learning bceause I'm not a real opesntack admin :)18:37
clarkbin this case I suspect it will be doing openstack image/glance commands as admin to delete the image18:37
clarkbfungi: I have not deleted that znode yet. I'm going to eat lunch soon but maybe you can look to see it seems correct then we can remove it18:38
fricklerwell kolla is daily business for me, so if I can login, I hope that should be manageable18:38
clarkbfrickler: cool I see you aren't in the authorized keys list yet. I'll add you to the servers (this is openstack as a service so outside our normal ansible) and PM you the IP list18:39
clarkbI'll use the same key that you have in system-config18:39
fricklerthat should work, thx18:40
fungiokay, back sorry19:12
fungitook a few minutes longer than i projected19:12
fungii agree nodepool is looking no worse after then label removal19:13
clarkbfungi: I think the main thing is confirming that znode should be deleted then deleting it. Then we can merge the second chagne to remove teh diskimage config19:14
fungiyep, looking now19:15
fungizk-shell json_cat that znode does indeed indicate that it's trying to delete an image called fedora-36-166254020419:18
fungiso i'm good with manually removing that19:18
clarkbfungi: ok do you want to do it or should I?19:19
fungii'm happy to19:19
clarkbI think once that single znode is removed nodepool should clean up the other znodes related to that image?19:19
clarkbgot for it19:20
clarkbunless we want corvus  to weigh in first19:20
clarkbI suppose there is some risk we break locking or something19:20
fungii did /zk-shell rmr /nodepool/images/fedora-36/builds/0000000022/providers/inmotion-iad3/images/000000000119:20
fungifingers crossed that didn't break anything19:21
clarkbits probably fine19:21
clarkbI seem to recall doing this in the past for the same reason19:21
fungijson_cat says "Path /nodepool/images/fedora-36/builds/0000000022/providers/inmotion-iad3/images/0000000001 doesn't exist" so it's definitely gone now19:21
clarkbfungi: what about ls /nodepool/images/fedora-36/builds/000000002219:22
clarkbsince thats the bit taht should go away ocne nodepool cleans up the image as a whole19:22
clarkbfwiw nl02 does the launcher for inmotion and seems to be running happily19:22
fungi"Path /nodepool/images/fedora-36/builds/0000000022 doesn't exist"19:22
clarkbperfect that is what we want19:22
fungi`zk-shell ls /nodepool/images/fedora-36/builds/` returns two uuids and a lock19:23
clarkbya I think those uuids may be really old?19:23
clarkbthey don't seem to hurt anything and image-list shows no images19:24
clarkbI think we can proceed with the next change19:24
fungi/nodepool/images/fedora-36/builds/46e131aac17540bfa3b16945bfaeb72e/providers/ has most of our provider regions listed but may just be cruft19:24
clarkbfungi: I think fedora-34 etc still have entries at /nodepool/images/fedora-34 too19:25
clarkbbut we haven't had those images in a while. I think this may just be stuff nodepool doesn't fully claer out?19:25
fungibut /nodepool/images/fedora-36/builds/46e131aac17540bfa3b16945bfaeb72e/providers/inmotion-iad3/images/ just has a lock in it, apparently19:25
fungitaken as an example19:25
fungiyeah, we even have fedora-31 there still19:26
clarkbshould we approve https://review.opendev.org/c/openstack/project-config/+/893964/ ?19:31
fungiyeah, i think that's safe19:37
fungiclarkb: how about 894382? i'm around to watch it19:38
clarkbfungi: ya I Thinkwe can approve that one too19:38
clarkbI'm around as well19:38
fungidone19:38
opendevreviewMerged openstack/project-config master: Remove fedora image builds  https://review.opendev.org/c/openstack/project-config/+/89396419:49
opendevreviewClark Boylan proposed opendev/system-config master: Cleanup the Fedora 36 mirror content  https://review.opendev.org/c/opendev/system-config/+/89457520:00
opendevreviewClark Boylan proposed opendev/system-config master: Remove ara from source install option  https://review.opendev.org/c/opendev/system-config/+/89457620:05
opendevreviewClark Boylan proposed openstack/project-config master: Remove ara from Zuul config  https://review.opendev.org/c/openstack/project-config/+/89457720:08
clarkbI don't actually know if released ara can work with dev ansible which may be why that is done20:08
clarkbbut I figure we can remoev it anyway and if it is a problem we can install from source using a shallow clone or something along those lines (won't have depends-on integration but I don't think we need that for ara)20:08
clarkbthere are no fedora disk images listed by nodepool dib-image-list now as well. I think that clean up is happy20:14
fungiyeah, that looks right to me20:15
fungi#status log Requested delisting for lists.katacontainers.io IPv4 address from SpamHaus PBL21:04
opendevstatusfungi: finished logging21:04
opendevreviewMerged opendev/system-config master: Update to gitea 1.20.4  https://review.opendev.org/c/opendev/system-config/+/89438221:15
fungiwatching for the deploy now21:16
clarkbhttps://gitea09.opendev.org:3081/opendev/system-config is the url to watch and then in order up to gitea1421:17
clarkband the first one (gitea09) is done21:21
clarkblooks good at first glance21:21
clarkball are done now and look good21:34
clarkbthe deployment job reported success as well21:35
clarkbdoes anyone know if we've got a change to set the nodepool image upload timeout now that it is configurable?21:38
clarkbI've just made updates to the team meeting agenda. Please add anything that is missing and I'll send that out later today21:39
fungiyeah, they're working for me21:44
clarkbI've manually cleaned up fedora images across rax, ovh, inmotion, and vexxhost regions. There were no arm64 fedora images21:56
clarkbthere are three images that I couldn't remove. Two in ovh gra1 that are in a deleted state so can't transition to a deleting state and the one in inmotion that we identified earlier21:57
clarkbthe one in inmotion appears to fail because glance says it is in use so we may have a leaked node that we need to cleanup too21:57
clarkbfrickler: ^ fyi. Fwiw that node doesn't show up in nodepool listings so we should be able to more forcefully remove it then the image on the inmotion side21:58
fungiwould be great if glance had something like a "flag for cleanup" option so that images used for bfv could be automatically deleted once their reference count drops to 022:02
fungiand then some way of indicating in image listings that the image will be cleaned up as soon as it is no longer in use22:02
clarkbit would also be cool if openstack grew the idea of applying alerts to resources from the user side22:09
clarkbthen instead of needign to file a ticket nodepool could after say 10 failed attempts to do $X apply an alert to the resource and then the cloud could sweep through them periodically and take appropriate action22:10
clarkbopenstack server alert foo. then cloud looks at foo and sees it is in a deleting state and 10 deletion requests have been made that all failed so go ahead and make that happen somehow22:10
JayFclarkb: we had a downstream patch at [former purple employer] to nova, called 'breakfix'. If you issued a 'breakfix' against your instance, it filed a ticket in our systems to fix it and maintenance'd the underlying Ironic node with the reason you provided22:13
JayFclarkb: so there is absolutely an audience for that kind of feedback mechanism22:13
JayFclarkb: that's in the same spirit of how Ironic is starting to hook up project information to allow folks to self-serve some maintenance tasks from Ironic's (formerly admin-only) API22:13
clarkbya I think the tricky bit in designing that would be coming up with something general enough taht it can reasonable and effectively tie into remediation systems that already exist within orgs22:15
clarkbmaybe that is as simple as a flag on the resource then you can do an api query or sql query to generate a list22:15
JayFArguably we already have the plumbing for this to be done as a sidecar in oslo.messaging notifications support22:17
JayFwe used that extensively at a couple of places to do reporting and failure detection22:17
fungia big counterargument to this is: these are cases where openstack has broken down, users shouldn't have to inform the cloud's operations team of that22:21
JayFfungi: that's more of an argument for the notification-based approach22:22
fungiimages or servers in deleting+error state are quite clearly broken22:22
JayFfungi: in either event, I think there's "space" for a sidecar project to try and help manage operations of OpenStack; I know because we've built one indepedently literally everywhere I've worked that's run it22:22
fungithe user has asked to delete something, the cloud didn't refuse the delete request and rather went into an error state when the deletion failed22:22
JayFbut the question is basically whether or not such a sidecar project can be generally useful, or if they are useful *because* they were bespoke with business logic baked in22:23
fungiyeah, that i don't know. it's more that shit clearly broke, waiting for users to tell ops that something broke is sort of backwards22:24
fungiparticularly when it's me opening a ticket that says something like "your api is telling me this broke, can you please do something to unbreak it"22:25
fungiideally the services would just notify them directly of these things and not have to wait for the user to pass that message along22:26
fungior, better still, not break. but i know that's probably asking a lot ;)22:27
JayF> fungi | ideally the services would just notify them directly of these things and not have to wait for the user to pass that message along22:28
JayFthis is not the hard part22:28
JayFwe have those hooks *today* with oslo.messaging notifications22:28
fungiit's one thing when it's our community infrastructure we don't actively monitor and volunteers patch up problems on a best effort basis, but something else entirely when it's a commercial product and their paying customers are having to reach out to let them know about a problem the software should have told them about before the customer even noticed22:28
JayFthe hard part is helping the cloud operator sort through those notifications and surface the 'real problems' over the noise22:29
fungiyeah, makes sense22:29
JayFit's just not clear to me (yet) if there is enough commonality to actually attack that problem versus it being a shop-to-shop sorta thing22:29
JayFbecause I can tell you, what that looked like at Rackspace didn't look like it does at Yahoo neither look like it does downstream here22:29
fungisure, i get that22:30
fungino two deployments are the same22:30
fungi"there is no such thing as vanilla openstack"22:30
JayFI think it's even deeper than that22:31
JayFfailure tolerance is a good example; some use cases failures are "tolerated" by just papering over it with more infrasturcutre/redundancy elsewhere22:32
JayFsome places have a strong sense of "this has to run here" and try to enforce that even when we try to disallow i22:32
JayF*it22:32
JayFe.g. at Rackspace, a single provisioned ironic node going down was a big deal as an outage for a customer; but at an HPC shop it might be noise unless the failures get over a certain %22:32
fungii'm thinking more in terms of flagged error states for unusable-but-undeletable resources occupying the customer's quota22:33
fungiif the user tells the cloud to delete something, and then the deletion fails and the resource remains in an indefinite "deleting" state, i'm not sure what decisions there are for the user to make at that point other than wait for the ops to notice or open a ticket asking they fix whatever resulted in the error state so that the resource deletion can proceed22:36
fungimaybe that's an uncommon corner case, but it seems to happen a lot for us (with servers, images, sometimes fips or networks)22:36
JayFThat is a particularly painful case for many (and has some security-related badness around it), I think that solving the general case of surfacing anomolies has value moreso than pointing at obvious broken cases22:37
clarkbfungi: ya I agree that the idea lstate is that customers don't need to be pushing it but maybe that sort of signal is a compromise between realtiy and ideal22:50
JayFclarkb: fungi: one thing that pushes even more in that direction is that not all failure cases are avoidable or fixable by openstack-the-software (e.g. network failures in a portion of the datacenter) .... but they still tend, IME, to be *blamed* on openstack-the-software22:52
JayFI tell folks who operate OpenStack at scale that OpenStack becomes the messenger for all bad news in your environment. Unless you are doing perfect operational monitoring, you will find a large number of outages will be first seen by a user in an openstack error message.22:52
JayFPushing back against the negative perceptions that can create is difficult, too.22:53
fungiyeah, it just makes it more likely for users to blame the software if they have to report persistent error states to the operators22:53
fungiwell, either blame the software or blame the people operating it, anyway22:54
clarkbmaybe openstack should provide tools for monitoring these unexpected state changes22:54
clarkbkinda like what we did with the logstash rules and queries once upon a time22:54
clarkbthat is still being done but the idea you could run the same set of queries against real clouds has largely died out I think22:55
opendevreviewGoutham Pacha Ravi proposed openstack/project-config master: Add manila-core to osc/sdk repo config  https://review.opendev.org/c/openstack/project-config/+/89460523:21
clarkbfor ^ I wonder if we shouldn't make a new group called openstacksdk-reviewers, give that group +/-2 and then that group can add subgroups as necessary23:25
clarkbthen we don't need to be gatekeepers of those changes for sdks23:25
fungithat was the original suggestion, but gtema expressed a preference for there being an audit log in public git23:48
fungiand said something like "there won't be that many"23:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!