Wednesday, 2023-03-01

opendevreviewMatthew Oliver proposed openstack/swift master: Proxy: restructure cached updating shard ranges  https://review.opendev.org/c/openstack/swift/+/87088603:50
opendevreviewMatthew Oliver proposed openstack/swift master: updater: add memcache shard update lookup support  https://review.opendev.org/c/openstack/swift/+/87472103:51
opendevreviewMatthew Oliver proposed openstack/swift master: updater: add memcache shard update lookup support  https://review.opendev.org/c/openstack/swift/+/87472105:32
opendevreviewMatthew Oliver proposed openstack/swift master: POC: updater: only memcache lookup deferred updates  https://review.opendev.org/c/openstack/swift/+/87580605:32
opendevreviewTim Burke proposed openstack/swift master: proxy: Reduce round-trips to memcache and backend on info misses  https://review.opendev.org/c/openstack/swift/+/87581907:35
opendevreviewAlistair Coles proposed openstack/swift master: sharder: show path and db file in info and debug logs  https://review.opendev.org/c/openstack/swift/+/87522015:02
opendevreviewAlistair Coles proposed openstack/swift master: sharder: show path and db file in warning and error logs  https://review.opendev.org/c/openstack/swift/+/87522115:02
reid_gHello, I recently did some OS upgrades (18.04 > 20.04) and now one of my nodes is spitting out tons of reconstructor messages "Unable to get enough responses (1/N) to reconstruct non-durable" followed by "Unable to get enough responses (X error responses) to reconstruct durable" for the same object. It seems like maybe there is some old data on this server. Now all of servers in the cluster are showing Ks handoffs. Any thoughts?16:22
reid_gIt seems like these 1 off fragments are being pushed around to other hosts for some reason.16:30
opendevreviewTim Burke proposed openstack/swift master: Add --test-config option to WSGI servers  https://review.opendev.org/c/openstack/swift/+/83312417:07
opendevreviewTim Burke proposed openstack/swift master: Add a swift-reload command  https://review.opendev.org/c/openstack/swift/+/83317417:07
opendevreviewTim Burke proposed openstack/swift master: systemd: Send STOPPING/RELOADING notifications  https://review.opendev.org/c/openstack/swift/+/83763317:07
opendevreviewTim Burke proposed openstack/swift master: Add abstract sockets for process notifications  https://review.opendev.org/c/openstack/swift/+/83764117:07
opendevreviewAlistair Coles proposed openstack/swift master: WIP: Allow internal container POSTs to not update put_timestamp  https://review.opendev.org/c/openstack/swift/+/87598219:30
mattoliverreid_g: has the crc library changed? https://bugs.launchpad.net/swift/+bug/188608820:52
timburke#startmeeting swift21:00
opendevmeetMeeting started Wed Mar  1 21:00:50 2023 UTC and is due to finish in 60 minutes.  The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot.21:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.21:00
opendevmeetThe meeting name has been set to 'swift'21:00
timburkewho's here for the swift meeting?21:01
zaitcevo/21:01
indianwhocodeso/21:01
mattoliveri'm kinda here, have the day off today so that means I'm on getting kids ready for school (however that works) :P 21:01
timburkei didn't get around to updating the agenda, but i think it's mostly going to be a couple updates from last week, maybe one interesting new thing i'm working on21:02
timburke#topic ssync, data with offsets, and meta21:03
acoleso/21:03
timburkeclayg's probe test got squashed into acoles's fix21:03
timburke#link https://review.opendev.org/c/openstack/swift/+/87412221:03
timburkewe're upgrading our cluster now to include that fix; we should be sure to include feedback about how that went on the review21:04
timburkebeing able to deal with metas with timestamps is still a separate review, but acoles seems to like the direction21:05
timburke#link https://review.opendev.org/c/openstack/swift/+/87418421:05
acolestimburke: persuaded me that we should fix a future bug while we had this all in our heads21:06
timburkethe timestamp-offset delimiter business still seems a little strange, but i didn't immediately see a better way to do deal with it21:06
timburke#topic http keepalive timeout21:07
timburkeso my eventlet patch merged! gotta admit, seemed easier to get merged than expected :-)21:08
timburke#link https://github.com/eventlet/eventlet/pull/78821:08
timburkewhich means i ought to revisit the swift patch to add config plumbing21:09
timburke#link https://review.opendev.org/c/openstack/swift/+/87374421:09
timburkeare we all ok with turning it into a pure-plumbing patch, provided i make it clear in the sample config that the new option kinda requires new eventlet?21:10
acoleswhat happens if the option is set without new eventlet?21:12
timburkelargely, existing behavior: keepalive is turned on, and with the general socket timeout (ie, client_timeout)21:13
timburkeit would also give the option of setting keepalive_timeout to 0 to turn off keepalive behavior21:13
mattoliverYup, do it21:13
acolesok21:14
timburkeall right then21:15
timburke#topic per-policy quotas21:15
timburkethanks for the reviews, mattoliver!21:15
timburketest refactor is now landed, and there's a +2 on the code refactor21:16
timburke#link https://review.opendev.org/c/openstack/swift/+/86148721:16
timburkeany reason not to just merge it?21:16
timburkei suppose mattoliver's busy ;-) i can poke him more later21:17
timburkethe actual feature patch needs some docs -- i'll try to get that up this week21:18
timburke#link https://review.opendev.org/c/openstack/swift/+/86128221:18
timburkeother interesting thing i've been working on (and i should be sure to add it to the PTG etherpad)21:19
acolesI just glanced (not reviewed) and the refactor looks nicer than the original21:19
timburkethanks -- there were a couple sneaky spots, but the existing tests certainly helped21:20
timburke#topic statsd labeling extensions21:20
mattoliverYeah it can probably just land21:21
timburkewhen swift came out, statsd was the basis for a pretty solid monitoring stack21:21
timburkethese days, though, people generally seem to be coalescing around prometheus, or at least its data model21:22
timburkewe at nvidia, for example, are running https://github.com/prometheus/statsd_exporter on every node to turn swift's stats into something that can be periodically scraped21:23
mattoliverI've been playing with otel metrics, put it as a topic on the ptg etherpad. Got a basic client to test some infrastructure here at work. Maybe I could at least write up some doc on how that works for extra discussions at the ptg?21:24
mattoliverBy that i mean how open telemetry works21:25
timburkethat'd be great, thanks!21:25
timburkeas it works for us today, there's a bunch of parsing that's required -- a stat like `proxy-server.object.HEAD.200.timing:56.9911003112793|ms` doesn't have all the context we really want in a prometheus metric (like, 200 is the status, HEAD is the request method, etc.)21:26
timburkewhich means that whenever we add a new metric, there's a handoff between dev and ops about what the new metric is, then ops need to go update some yaml file so the new metric gets parsed properly, and *then* they can start using it in new dashboards21:27
timburkewhich all seems like some unnecessary friction21:28
timburkefortunately, there are already some extensions to add the missing labels for components, and the statsd_exporter even already knows how to eat several of them: https://github.com/prometheus/statsd_exporter#tagging-extensions21:29
timburkeso i'm currently playing around with emitting metrics like `proxy-server.timing,layer=account,method=HEAD,status=204:41.67628288269043|ms`21:30
timburkeor `proxy-server.timing:34.14654731750488|ms|#layer:account,method:HEAD,status:204`21:30
timburkeor `proxy-server.timing#layer=account,method=HEAD,status=204:5.418539047241211|ms`21:30
timburkeor `proxy-server.timing;layer=account;method=HEAD;status=204:34.639835357666016|ms`21:30
timburke(really, "proxy-server" should probably get labeled as something like "service"...)21:31
timburkemy hope is to have a patch up ahead of the PTG, so... look forward to that!21:31
acolesnice!21:32
acoles"layer" is a new term to me?21:32
timburkeidk, feel free to offer alternative suggestions :-)21:32
acolesvs tier or resource (I guess tier isn't clear)21:33
acoleshaha it took us < 1second to get into a naming debate :D21:33
acoleslet's save that for the PTG21:33
mattoliverOh cool, I look forward to seeing it!21:34
timburkeif it doesn't mesh well with an operator's existing metrics stack, (1) it's opt-in and they can definitely still do the old-school vanilla statsd metrics, and (2) most collection endpoints (i believe) offer some translation mechanism21:34
acolesI'm hoping we might eventually converge this "structured" stats with structured logging21:34
mattoliver+121:35
timburkeyes! there's a lot of context that seems like it'd be smart to share between stats and logging21:35
acolese.g. build a "context" data structure and squirt it a logger and/or a stats client and you're done21:35
timburkethat's all i've got21:36
timburke#topic open discussion21:36
timburkewhat else should we bring up this week?21:36
acoleson that theme, I wanted to draw attention to a change i have proposed to sharder logging21:36
timburke#link https://review.opendev.org/c/openstack/swift/+/87522021:37
timburke#link https://review.opendev.org/c/openstack/swift/+/87522121:37
acoles2 patches currently: https://review.opendev.org/c/openstack/swift/+/875220 and https://review.opendev.org/c/openstack/swift/+/87522121:37
acolestimburke: is so quick!21:37
mattoliverOh yeah, I've been meaning to get to that.. but off for the rest of the week, so won't happen now until next week.21:38
acolesI recently had to debug some sharder issue and found the inconsistently log formats very frustrating21:38
acolese.g sometime we include the DB path, sometimes the resource path, sometimes both...but worst, sometimes neither21:38
acolesSo the patches ensure that every log message associated with a container DB (which is almost all) will consistently get both the db file path and the resource path (i,e, 'a/c') appended to the message21:39
acolesI wanted to flag it up because that includes WARNING and ERROR level messages that I am aware some ops may parse for alerts21:40
acolesso this change may break some parsing, but on the whole I believe we'll be better for having consistency21:41
mattoliverSounds good, and as we eventually worker up the sharper it gets all more important. 21:41
mattoliver*sharder 21:41
acolesIDK if we have precedence for flagging up such a change, or if I am worrying too much (I tend to!)21:42
mattoliverYour making debugging via log messages easier.. and that's a win in my book21:43
timburkethere's some precedent (e.g., https://review.opendev.org/c/openstack/swift/+/863446) but in general i'm not worried21:43
acolesok so I could add an UpgradeImpact to the commit message21:44
timburkeif we got to the point of actually emitting structured logs, and then *took that away*, i'd worry. but this, *shrug*21:45
timburkefwiw, i did *not* call it out in the changelog21:46
acoleswell if there's no concerns re. the warnings then I will squash the two patches 21:46
acolesand then I can look forward to the next sharder debugging session 😜21:47
timburkesounds good21:47
timburkeall right, i think i'll call it21:49
timburkethank you all for coming, and thank you for working on swift!21:49
timburke#endmeeting21:49
opendevmeetMeeting ended Wed Mar  1 21:49:23 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)21:49
opendevmeetMinutes:        https://meetings.opendev.org/meetings/swift/2023/swift.2023-03-01-21.00.html21:49
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/swift/2023/swift.2023-03-01-21.00.txt21:49
opendevmeetLog:            https://meetings.opendev.org/meetings/swift/2023/swift.2023-03-01-21.00.log.html21:49
reid_g@mattoliver - we did change the library but we set but we added overrides to the systemd services before anything starts "Environment="LIBERASURECODE_WRITE_LEGACY_CRC=1". We did this for ~20 different clusters without issues. Also there were no quarantines generated in the cluster.21:52
opendevreviewAlistair Coles proposed openstack/swift master: sharder: show path and db file in logs  https://review.opendev.org/c/openstack/swift/+/87522021:53
reid_gGotta head out. Will check chat logs tomorrow if you reply22:10
timburkereid_g, what versions of swift were involved? sounds like maybe https://bugs.launchpad.net/swift/+bug/1655608 -- were any object disks out of the cluster for a while, then brought back in? it's a bit of an old bug, but we've seen patches in relation to it as recently as a couple years ago. if your new swift is >= 2.28.0, you might consider setting quarantine_threshold=1 for the reconstructor -- see https://github.com/openstack/swift/commit/4622:10
timburkeea3aea22:10
reid_gWe are in ussuri 2.25.222:11
reid_gI don't think any disks were out for a while since we have monitoring for missing disks and get those taken care of quickly.22:12
reid_gWhat is kind of odd is if I swift-object-info on the fragment on the host we think the issues are with, the fragments belongs to that host according to the ring. 1 particular object has a FS date of jan 2022 but it looks like it was pushed to another node on feb 18 2023. The other node appears as a handoff according to the ring22:14
reid_gRight now I have a bunch of fragments on all the disks on 1 host that are being pushed around has handoffs to all other hosts (based on the filesystem dates of the files).22:16
timburkeare some disks unmounted? or maybe full?22:16
timburkehttps://github.com/openstack/swift/commit/ea8e545a had us start rebuilding to handoffs in 2.21.0 if a primary responds 50722:17
reid_gno they are all mounted and ~45% used. I don't think that they were unmounted previously.22:18
reid_gI will check that link tomorrow. I have to head home before my wife gets on me.22:19
timburkeof course -- good luck!22:19
reid_gTBC. Thanks for your input!22:20
opendevreviewTim Burke proposed openstack/swift master: proxy: Reduce round-trips to memcache and backend on info misses  https://review.opendev.org/c/openstack/swift/+/87581922:58
opendevreviewTim Burke proposed openstack/swift master: proxy: Reduce round-trips to memcache and backend on info misses  https://review.opendev.org/c/openstack/swift/+/87581923:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!