Tuesday, 2020-08-25

*** tkajinam_ is now known as tkajinam01:04
*** rcernin has quit IRC01:14
*** rcernin has joined #openstack-swift01:16
*** baojg has joined #openstack-swift01:36
*** gyee has quit IRC01:48
*** josephillips has quit IRC03:31
*** josephillips has joined #openstack-swift03:52
*** rcernin has quit IRC04:32
*** evrardjp has quit IRC04:33
*** evrardjp has joined #openstack-swift04:35
*** m75abrams has joined #openstack-swift04:51
*** dsariel has joined #openstack-swift04:58
*** rcernin has joined #openstack-swift05:03
*** rcernin has quit IRC05:41
openstackgerritTim Burke proposed openstack/swift master: wsgi: Handle multiple USR1 signals in quick succession  https://review.opendev.org/74749606:05
*** rcernin has joined #openstack-swift06:57
*** rcernin has quit IRC07:03
*** rcernin has joined #openstack-swift07:06
*** rcernin has quit IRC07:14
*** baojg has quit IRC07:58
*** baojg has joined #openstack-swift07:59
*** dosaboy has quit IRC08:59
*** dosaboy has joined #openstack-swift08:59
*** ianychoi__ has joined #openstack-swift09:21
*** ianychoi_ has quit IRC09:24
*** abelur has quit IRC10:49
*** lxkong has quit IRC10:51
*** rcernin has joined #openstack-swift10:57
*** abelur has joined #openstack-swift11:02
*** lxkong has joined #openstack-swift11:02
*** hoonetorg has quit IRC12:08
*** hoonetorg has joined #openstack-swift12:21
*** hoonetorg has quit IRC12:41
*** hoonetorg has joined #openstack-swift12:54
*** gyee has joined #openstack-swift14:03
*** rcernin has quit IRC14:45
*** baojg has quit IRC15:41
*** baojg has joined #openstack-swift15:42
*** m75abrams has quit IRC16:17
timburkegood morning16:42
*** baojg has quit IRC17:59
*** abelur has quit IRC18:52
*** abelur has joined #openstack-swift18:53
timburkeso i noticed the proxy server in my home swift going a little squirrelly on occasion -- running https://github.com/swiftstack/python-stack-xray/blob/master/python-stack-xray against it, i found a particularly strange stack: http://paste.openstack.org/show/797139/19:49
timburkewhat in the world is up with that _get_response_parts_iter frame?? i guess maybe there's a generator exit getting raised?19:49
timburkefwiw, i'm fairly certain my troubles are the two frames in enumerate() -- most/all of the other stacks are also waiting on _active_limbo_lock :-/19:54
timburkethe reference to threading._active reminds me of https://github.com/eventlet/eventlet/pull/611 ... i need to double check whether i applied that fix here or not...19:57
openstackgerritTim Burke proposed openstack/swift master: wsgi: Handle multiple USR1 signals in quick succession  https://review.opendev.org/74749619:58
openstackgerritTim Burke proposed openstack/swift master: ssync: Tolerate more hang-ups  https://review.opendev.org/74427020:27
ormandjsomething interesting we've noticed with swift - when we load a drive, single drive on a server (let's assume these servers have 56 drives for swift, which they do) - such as running a patrolread on them20:58
ormandjour throughput for the entire cluster goes waaaaaay down20:58
ormandjwe have dedicated SSDs for container/account dbs, and those are not being touched20:58
ormandjwe've also noticed when taking a server down (but not removing from ring) the same degradation happens21:00
ormandjis that a function of a three server cluster w/ triple replication?21:00
timburke"load a drive" as in, there's one drive that seems especially hot, or there's one drive that's particularly full?21:01
timburkecould be part of it. i'd expect the down server to get error-limited fairly quickly, though21:01
timburkeand then traffic should shed to the remaining servers21:02
timburkeare you seeing performance tank for reads, writes, or both?21:03
timburkewhen you're taking a node down, how quickly can you get the proxy out of rotation for your load balancer?21:04
ormandjtimburke: load as in cause to slow down21:09
ormandjtimburke: no proxies go out of rotation, we're taking down storage nodes, not swift proxies. lb -> swift proxies -> storage nodes. swift proxies scales independently of storage nodes21:10
ormandjit looks like they continue to try and contact the 'down' node the whole time it's down21:11
ormandjtimburke: so on the disk load thing, for example, we kick off a patrolread which 'invisibly' hits the disk with enough IOPs to increase await significantly for that one drive21:12
timburkemight want to look at https://github.com/openstack/swift/blob/master/etc/proxy-server.conf-sample#L161-L166 values21:12
ormandji think we have those at defaults (verifying)21:12
ormandjit'll keep trying the entire time they are down21:12
timburke:-/21:13
ormandjyep, both commented out in the config21:13
timburkeseparately from proxy configs, how's the object server configured? servers per port, or all disks going over the one port? how many workers?21:13
ormandjworkers is set default (auto according to config) and servers per port is default (0 according to config)21:15
timburkeso auto should give you a worker per core -- how many cores do the nodes have? i'd worry a bit about all workers for that node getting hung up trying to service requests for that disk and getting stuck in some uninterruptible sleep21:27
ormandj48 cores21:27
ormandj56 data drives21:28
ormandj256 gigs ram21:28
ormandj4xssd for account/container db21:28
ormandjwhat you're describing would make sense based on what we see21:29
timburkei'd think about giving each disk its own port in the ring and setting servers_per_port to 2 or so21:29
ormandjsafe to do without blowing up existing data?21:30
timburkeyeah, it's a matter of updating the ring with swift-ring-builder's set_info command. the transition may still be a bit disruptive, though; you'd likely want to announce a maintenance window21:32
timburkelet me see if i can find some docs on it...21:32
ormandjthanks tim. the docs have been a bit... not clear on some of the implications in the past21:36
ormandjso we try to be careful when it comes to ring operations heh21:36
ormandjmanpage had info on that option21:37
ormandjset_info <search-value> <ip>:<port>/<device_name>_<meta>21:37
timburkehttps://docs.openstack.org/swift/latest/deployment_guide.html#running-object-servers-per-disk21:38
timburke(though it doesn't give an example of how to switch between modes :-/)21:38
ormandjawesome, we'll look into implementing that, we'll test in our dev cluster first in case we hose all the data, which is likely :p21:38
ormandjlooks like just updating the port fields, which is straight forward enough21:39
ormandj"When migrating from normal to servers_per_port, perform these steps in order:21:39
ormandj"21:39
ormandjit has that section below the output21:39
timburkeoh good -- i clearly didn't skim well enough!21:41
ormandjit doesn't give guidance on servers_per_port for the hypothetical, but looking at the default options, it seems to suggest '4' as giving complete i/o isolation21:42
ormandjso we'd end up at 56*4 processes, effectively if we did that21:43
timburkefwiw, the commit that introduced it had some nice benchmarks referenced: https://github.com/openstack/swift/commit/df134df901a13c2261a8205826ea1aa8d75dc28321:43
timburkehttps://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md in particular seems relevant21:43
ormandjwonder how this got overlooked when people were setting up this cluster21:44
ormandjseems like it's a best practice kind of thing21:44
ormandjalso, curious that it's not default21:44
timburkeyeah, i was just about to say that we should probably look at updating docs/deployment guides to default to servers per port21:45
ormandjthose benchmarks are pretty telling, you weren't kidding21:45
ormandjlittle struggle to understand the chart though, haha21:46
*** rcernin has joined #openstack-swift22:01
*** rcernin has quit IRC22:01
*** rcernin has joined #openstack-swift22:02
openstackgerritTim Burke proposed openstack/swift master: docs: Clean up some formatting around using servers_per_port  https://review.opendev.org/74804322:33
claygthere's a lot to digest in p 745603 - but I feel like I'm getting a hang of it!22:38
patchbothttps://review.opendev.org/#/c/745603/ - swift - Bind a new socket per-worker - 4 patch sets22:38
timburkesorry; i maybe shouldn't have moved to get rid of PortPidState in the same patch22:39
claygfwiw, i'll probably spend some time with the graceful worker shutdown patch before I loop back around to the per-worker-socket22:41
claygi'll be nice to gather informal feedback in the meeting tomorrow as well22:42
claygbut it looks great timburke - incredible work22:42
timburkeclayg, we may want to make it somewhat configurable -- someone (timur i think?) pointed out https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ that had some more thoughts on the matter22:44
claygthat might be a good reason to keep the sockets in the parent 🤔22:45
timburkemight be worth trying to do something like 4 listen sockets each with 6 workers or something like that22:45

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!