#ceph IRC Log


IRC Log for 2011-01-04

Timestamps are in GMT/BST.

[2:27] <cmccabe> wido: hey, I implemented that max_open_files thing
[2:27] <cmccabe> wido: since I already spent some time thinking about it
[16:38] <stingray> when osds are resyncing, client is crawling
[16:58] <greglap> stingray: how much of a performance drop-off?
[17:00] <greglap> things certainly slow down during OSD recovery — we haven't implemented serious QOS or anything — but it generally shouldn't stop unless you've lost data
[17:48] <stingray> gregaf: client just stalls and [ 1341.264064] ceph: tid 104 timed out on osd3, will reset osd
[17:49] <stingray> osd3 is a new osd, I just added 4 of those, pushed new crushmap and increased replication to 3 for both data and metadata
[19:37] <gregaf> stingray: if I understand you correctly, I think this is a present unfortunate consequence of adding too many OSDs at once
[20:28] <stingray> gregaf: I fail :(
[20:28] <stingray> let's wait until it converges
[20:28] <stingray> only 6% left
[20:29] <stingray> what's funny is that writes work
[20:29] <stingray> only reads get stuck, and that are reads from a particular osd
[20:29] <gregaf> hmm
[20:29] <stingray> is there any way to figure this out except tcpdump
[20:29] <gregaf> we do have some reports that writes starve reads
[20:30] <gregaf> and the rate-limiting in recovery isn't well tested, so it's possible that the OSD just has too much of a write workload so reads aren't going through/are taking forever
[20:30] <stingray> the don't usually do that, they seemed to work before I did the reshuffle
[20:30] <stingray> the kernel client doesn't have much of the useful instrumentation - at least the old one I have
[20:30] <stingray> maybe a client bug, after all
[20:31] <stingray> I'll do more tests
[20:31] <gregaf> just to make sure I understand, you just added 4 OSDs to your cluster and increased replication from 2 to 3?
[20:31] <stingray> is ods thread-per-request?
[20:31] <stingray> osd
[20:31] <stingray> gregaf: yeah
[20:31] <gregaf> how big was your cluster prior to that?
[20:31] <stingray> it then told me it's about 50% degraded and started doing stuff
[20:32] <stingray> 3 osds, 3.6T each, 1.5T of data (3T with replication)
[20:32] <gregaf> okay
[20:32] <stingray> I added 4 osds about 3.4T each
[20:32] <gregaf> the OSDs have a lot of threads, but it's not one per request
[20:33] <stingray> and it doesn't prioritize peering versus clients
[20:33] <gregaf> they have a setup based on thread pools and workqueues that the requests get routed through, in order of arrival but modified by what's available in the system
[20:33] <gregaf> there are some limited attempts to do it with some throttling
[20:34] <gregaf> and rudimentary logic to try and keep PGs active if they're moved to a new OSD with no data
[20:34] <stingray> it does a lot of "journal throttle" here, not sure what it means now
[20:34] <gregaf> but I think that by simultaneously doubling the cluster and increasing replication you just gave it more than our metrics can handle
[20:35] <stingray> heh
[20:35] <gregaf> it's an area that needs further development
[20:36] <gregaf> what's the exact journal throttle message you're getting?
[20:37] <stingray> both
[20:37] <stingray> waited for bytes
[20:37] <stingray> and for ops
[20:37] <stingray> 2011-01-04 21:47:45.497892 7f56c0ff9700 journal throttle: waited for ops
[20:37] <stingray> 2011-01-04 21:47:45.584344 7f56abfff700 journal throttle: waited for ops
[20:37] <stingray> 2011-01-04 22:14:49.906795 7f56aaffd700 journal throttle: waited for ops
[20:37] <stingray> 2011-01-04 22:14:49.974128 7f56aaffd700 journal throttle: waited for ops
[20:37] <stingray> 2011-01-04 22:14:50.056533 7f56aaffd700 journal throttle: waited for ops
[20:37] <stingray> 2011-01-04 22:14:50.124184 7f56aaffd700 journal throttle: waited for ops
[20:38] <stingray> they usually come in 2+ screens
[20:38] <gregaf> ah
[20:39] <gregaf> it isn't the sole cause of any trouble, but that means your OSD journal can't keep up with the rate at which its receiving data
[20:41] <gregaf> not anything broken but if you can somehow get a faster journal it will help performance, or at least minimize issues during repeering, etc
[20:41] <stingray> ok
[21:58] <stingray> 2011-01-04 23:58:12.300505 pg v108271: 804 pgs: 7 active, 797 active+clean; 1491 GB data, 4459 GB used, 19195 GB / 24921 GB avail; 31623/1151964 degraded (2.745%)
[21:58] <stingray> doesn't go below this
[22:02] <gregaf> stingray: hmmm, what version are you running and how long has it been stuck there?
[22:05] <stingray> .24
[22:05] <stingray> doesn't seem to make any progress
[22:06] <stingray> and, the chunks are still unreadable
[22:06] <stingray> I'm trying to debug osd
[22:07] <gregaf> sjust: you dealt with symptoms like this recently, didn't you?
[22:10] <sjust> gregaf: Yeah, I don't remember specifically what was causing it though. I would need to look at the logs.
[22:14] <stingray> sjust: something fixable?
[22:16] <sjust> stingray: the problem before was with scrubbing, but I now remember fixing that bug
[22:16] <sjust> could you get me the logs?
[22:16] <stingray> if only I knew which osd
[22:17] <stingray> sjust: I'll try to gather more info for you, worst case tomorrow
[22:17] <stingray> now I've got to go - need to meet someone
[22:17] <stingray> cu
[22:17] <sjust> ok
