#ceph IRC Log


IRC Log for 2010-10-28

Timestamps are in GMT/BST.

[0:00] <gregaf> do you have 1x or 2x replication?
[0:00] <gregaf> *other local benchmarks
[0:01] <gregaf> osd bench is entirely local but it's not going to report specific kinds of issues
[0:02] <jantje_> gregaf: default
[0:02] <gregaf> so 2x
[0:03] <gregaf> I ask because that means that osd1 is getting hit with 1/3 of the writes
[0:04] <gregaf> which I think is enough to slow you down to the speeds you're seeing
[0:06] <jantje_> Hmm, interesting
[0:06] <jantje_> lets put it off
[0:06] <jantje_> ;osd pool default size = 0
[0:06] <jantje_> ;osd pool default crush rule = 0
[0:07] <jantje_> that should do the trick, right?
[0:07] <gregaf> you need to leave the default crush rule set
[0:07] <gregaf> you had default size set to 0?
[0:08] <gregaf> I don't know what that does….it should be >=1
[0:10] <jantje_> no it wasn't set
[0:12] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[0:12] <gregaf> ah
[0:12] <gregaf> to turn off replication you'll need to set it explicitly off
[0:13] <gregaf> osd pool default size = 1
[0:17] <jantje_> ok, i'm going to try it
[0:17] <jantje_> just wondering, isn't replication a lower priority then writing/reading the data ?
[0:19] <gregaf> there's not really a way to do that while maintaining the replication level
[0:20] <jantje_> it would just depend on how important you think it is
[0:20] <jantje_> 6291456000 bytes (6.3 GB) copied, 35.9358 s, 175 MB/s
[0:20] <jantje_> better!
[0:20] <gregaf> yep!
[0:21] <gregaf> that's with 1x replication?
[0:21] <jantje_> yes
[0:21] * deksai (~deksai@dsl093-003-018.det1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[0:21] <gregaf> k
[0:21] <jantje_> on friday i'll put in some extra machines to do parallel reads
[0:22] <jantje_> anyway, it would be nice if someone could give 'weights' on how important replication is
[0:22] <gregaf> that's still not the bandwidth I'd expect from 5 drives doing 80MB each, but that might just be the dd workload or something odd
[0:23] <cmccabe> have you tried a blocksize other than 4k with dd?
[0:23] <jantje_> running now
[0:23] <jantje_> bs=1M
[0:23] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) has joined #ceph
[0:24] <jantje_> (mtu is just 1500, so really doing lots of fragmentation)
[0:24] <darkfader> jantje_: can you tell me what a ideal blocksize would be for ceph?
[0:25] <darkfader> I kept thinking about that but no success :)
[0:25] <jantje_> 8589934592 bytes (8.6 GB) copied, 51.4922 s, 167 MB/s
[0:25] <jantje_> with bs=1M for dd
[0:25] <darkfader> s/blocksize/mtu
[0:25] <jantje_> darkfader: depends ... first you have to make sure your disks can keep up
[0:26] <darkfader> do you have the mds and journals on different disks? from my understanding both kills your performance
[0:26] <jantje_> darkfader: just set it to the max mtu, if your switch and nic's support it ofcourse
[0:26] <jantje_> it won't do any harm I guess
[0:26] <darkfader> okay
[0:26] <jantje_> unless the network layer waits for the payload to fill up
[0:27] <darkfader> on ethernet it shouldnt
[0:27] <darkfader> dunno what it'd be like with ipoib
[0:27] <darkfader> i have 64k mtu there
[0:27] <jantje_> sounds fast :)
[0:28] <darkfader> dont know - i didnt have an extra ssd for the journals and didnt crack 100MB/s over dual gigabit
[0:28] <darkfader> so i didn't move on to the faster links yet
[0:28] <jantje_> I just created journals in memory
[0:29] <darkfader> and then i broke the lab. now i'm still sulking ;)
[0:29] <jantje_> and will replace them by SSDs later on
[0:29] <darkfader> good idea
[0:29] <jantje_> gregaf: it might not be what we expect with 6 drives ! :)
[0:30] <jantje_> 6*80 = 480MB/sec
[0:30] <jantje_> ofcourse there is the network limit
[0:30] <jantje_> but I would also expect my journal not to throttle
[0:31] <darkfader> jantje_: i wonder how much faster it would get with more replicas
[0:31] <darkfader> or if at all
[0:31] <jantje_> since it should be able to write less data then the physical network connection can handle
[0:31] <jantje_> (did you get my point?)
[0:33] <cmccabe> jantje_: I think the network is more likely to be a bottleneck than the drives
[0:34] <cmccabe> I did some benchmarks with a 1 gigabit card on a 2-machine LAN and was getting between 75 MB/s and 90 MB/s
[0:34] <cmccabe> It partly depends on how good the ethernet driver is. It definitely has to be NAPI, or else it will suck a lot.
[0:35] <cmccabe> that was just passing data over a TCP connection
[0:36] <cmccabe> I think we were using some zero-copy stuff too at the time
[0:36] <cmccabe> On the other hand, most 3.5" drives can get pretty near to 125 MB/s, no problem on I/O to a raw partition
[0:36] <cmccabe> (the benchmark I was talking about earlier didn't involve ceph, just gigE speeds in general)
[0:37] <cmccabe> (no idea how bonding/trunking performs; I haven't used it)
[0:39] <darkfader> cmccabe: what is NAPI?
[0:40] <jantje_> cmccabe: i measured my raw bandwith with iperf
[0:40] <darkfader> I have some broadcom, some intel chips. didn't have expected there would be any driver issues worth mentioning left in 2010 ;)
[0:40] <cmccabe> NAPI is an API inside the kernel that drivers are written for
[0:41] <jantje_> [ 3] 0.0-10.0 sec 2.37 GBytes 2.03 Gbits/sec
[0:41] <cmccabe> the newer drivers use NAPI to get better performance
[0:41] <jantje_> its an intel e1000, should be decent cards
[0:42] <darkfader> jantje_: what options did you use?
[0:42] <cmccabe> jantje: yeah the e1000 is a good card
[0:42] <jantje_> darkfader: default iperf options
[0:42] <darkfader> and how many e1000 per host?
[0:43] <jantje_> cmccabe: too bad the chip I have is made of 1x 82573E and 3x 82573L , the 82573E does not supports jumbo frames
[0:43] <darkfader> ah that was you, i remember that
[0:43] <cmccabe> jantje_: yeah, too bad. It would be interesting to see if jumbo frames made a difference or not.
[0:46] <jantje_> I can enable jumbo frames on 3 devices
[1:00] <johnl> hey gregaf, how do I compile cfuse? I cloned ceph, configure and make
[1:01] <johnl> but no cfuse executable
[1:01] <cmccabe> it's a configure script thing
[1:01] <gregaf> johnl: it should go in automatically, do you have FUSE installed?
[1:01] <johnl> go in?
[1:02] <gregaf> sorry, it should build automatically if you have the dependencies
[1:02] <johnl> ah didn't have fuse-dev installed
[1:02] <cmccabe> johnl: yeah, I think it's enabled by default
[1:02] <johnl> thought configured would fail.
[1:02] <cmccabe> johnl: but you should check the configure output to see if it detected the fuse libraries on your system
[1:02] <johnl> configure
[1:02] <johnl> ok, ta.
[1:03] <jantje_> I think my switch doesn't support jumbo frames as well
[1:03] <gregaf> johnl: configure only fails if it can't build the core daemons, all the extra stuff (cfuse, hadoop, using tcmalloc, etc) will auto-detect and defaults to on but won't cause failures if you can't do it
[1:04] <gregaf> jantje_: was that 175 MB/s with osd1 still enabled?
[1:04] <johnl> yer sorry. installed libfuse-dev and reconfigured and now it's building
[1:04] <jantje_> gregaf: yes
[1:05] <gregaf> k
[1:05] <jantje_> gregaf: i should try it with osd1 out, but currently my client is dead, and i'm at home
[1:05] <gregaf> I'm trying to work out the theoretical bandwidth but it's a pain to do
[1:05] <darkfader> bbl, you made me play with the mtu's
[1:05] <darkfader> need to reboot to reinit some bridges
[1:06] <jantje_> darkfader: :-)
[1:06] <cmccabe> darkfader: it was just an idea. I don't really know if it will make a difference or not.
[1:07] <jantje_> gregaf: network bandwidth? disk ?
[1:07] <gregaf> Ceph bandwidth given the disk and network bandwidths :)
[1:07] <cmccabe> darkfader: the fact that iperf is already giving you pretty good performance over TCP makes me think jumbo frames may not matter
[1:07] <jantje_> Ok :P
[1:07] <cmccabe> darkfader: but I could be wrong
[1:09] * MarkN (~nathan@ has joined #ceph
[1:16] * jantje_ is going to sleep
[1:18] <jantje_> and going to get the smallest one: http://www.alcatel-lucent.com/wps/PA_1_A_8VH/images/Images/7750_SR_Family.jpg to try real LAG with jumbo frames
[1:21] <cmccabe> cool
[1:32] <jantje_> :)
[1:32] <jantje_> we manufacture those, so it shouldn't be a problem :P
[1:51] <jantje_> cool, graphical monitoring :)
[1:51] * jantje_ &
[1:52] <cmccabe> yeah. I'm still working out the bugs
[1:53] <cmccabe> original code by michael mcthrow
[1:53] <cmccabe> He based his code on a somewhat older version (which was current at the time) so I had a bit of forward-porting to do
[1:54] <cmccabe> also integrated it with automake and .the /ceph tool
[1:54] <cmccabe> I didn't want to duplicate ceph.cc basically
[1:55] <cmccabe> brb
[2:11] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[3:01] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[3:15] * sjust (~sam@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[3:15] * deksai (~deksai@ has joined #ceph
[3:23] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[3:23] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[3:38] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:58] * greglap (~Adium@ has joined #ceph
[4:01] * henry_c_chang (~chatzilla@59-124-35-221.HINET-IP.hinet.net) has joined #ceph
[4:11] <henry_c_chang> Hi sage, are you there?
[4:12] <greglap> henry_c_chang: he's at home now, although he might be on again for a bit in several hours
[4:13] <henry_c_chang> ok..thx
[4:44] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[5:00] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[5:58] * deksai (~deksai@ Quit (Ping timeout: 480 seconds)
[6:10] <sage> henry_c_change: here now
[6:11] <sage> er, henry_c_chang :)
[6:11] <henry_c_chang> hi
[6:12] <sage> did you see my email? does the seq vs issue_seq difference make sense?
[6:12] <henry_c_chang> not yet..let me check the email
[6:13] <henry_c_chang> but yesterday getattr hanged again...so my fix doesn't fully work
[6:13] <sage> yeah, i think the problem is actually a bad change i made in september
[6:14] <sage> d91f2438
[6:15] <henry_c_chang> ok...I will do as your email said.
[6:16] <henry_c_chang> and you can move the discussion to ceph-devel..I don't mind
[6:17] <sage> ok thanks. there is probably still a problem somewhere (i thought I was fixing something with that patch, but unfortunately can't remember what the workload was)...
[6:28] <henry_c_chang> I have to leave for a while...come back later
[6:55] <henry_c_chang> sage: I reverted d91f2438 and my mds modifications....it works fine.
[6:56] <sage> yay!
[6:56] <henry_c_chang> :)
[6:56] <sage> ok, i reverted in the ceph-client.git master branch. hopefully our testing will turn up whatever the original problem was I thought i was fixing, and we can find a real solution for whatever that is :)
[6:59] <sage> from my description it sounds like an mds bug :/
[6:59] <henry_c_chang> do you think Jim's case (on ceph-devel) is the same with mine?
[7:00] <sage> i'm hoping so.. what do you think?
[7:01] <henry_c_chang> no idea..but I'll do more tests
[7:01] <sage> ok thanks
[7:09] * terang (~me@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[7:15] * cmccabe (~cmccabe@adsl-76-202-119-74.dsl.pltn13.sbcglobal.net) Quit (Quit: Leaving.)
[7:19] <henry_c_chang> sage: in our mail loop, you sent me a diff with a minor chage. My client code still has it. do you think that matter?
[7:21] <henry_c_chang> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
[7:21] <henry_c_chang> index 98ab13e..9082d3c 100644
[7:21] <henry_c_chang> --- a/fs/ceph/caps.c
[7:21] <henry_c_chang> +++ b/fs/ceph/caps.c
[7:21] <henry_c_chang> @@ -1429,12 +1429,14 @@ static int try_nonblocking_invalidate(struct inode *inode)
[7:21] <henry_c_chang> if (inode->i_data.nrpages == 0 &&
[7:21] <henry_c_chang> invalidating_gen == ci->i_rdcache_gen) {
[7:22] <henry_c_chang> /* success. */
[7:22] <henry_c_chang> - dout("try_nonblocking_invalidate %p success\n", inode);
[7:22] <henry_c_chang> + dout("try_nonblocking_invalidate %p gen %d success\n", inode,
[7:22] <henry_c_chang> + invalidating_gen);
[7:22] <henry_c_chang> ci->i_rdcache_gen = 0;
[7:22] <henry_c_chang> ci->i_rdcache_revoking = 0;
[7:22] <henry_c_chang> return 0;
[7:22] <henry_c_chang> }
[7:22] <henry_c_chang> - dout("try_nonblocking_invalidate %p failed\n", inode);
[7:22] <henry_c_chang> + dout("try_nonblocking_invalidate %p gen %d failed, now %d, %lu pages\n",
[7:22] <henry_c_chang> + inode, invalidating_gen, ci->i_rdcache_gen, inode->i_data.nrpages);
[7:22] <henry_c_chang> return -1;
[7:22] <henry_c_chang> }
[7:22] <henry_c_chang> @@ -2304,10 +2306,10 @@ static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant,
[7:22] <henry_c_chang> } else {
[7:22] <henry_c_chang> /* there were locked pages.. invalidate later
[7:22] <henry_c_chang> in a separate thread. */
[7:22] <henry_c_chang> - if (ci->i_rdcache_revoking != ci->i_rdcache_gen) {
[7:22] <henry_c_chang> - queue_invalidate = 1;
[7:22] <henry_c_chang> - ci->i_rdcache_revoking = ci->i_rdcache_gen;
[7:22] <henry_c_chang> - }
[7:22] <henry_c_chang> + dout(" will queue invalidate, gen %d was revoking %d\n",
[7:22] <henry_c_chang> + ci->i_rdcache_gen, ci->i_rdcache_revoking);
[7:22] <henry_c_chang> + queue_invalidate = 1;
[7:22] <henry_c_chang> + ci->i_rdcache_revoking = ci->i_rdcache_gen;
[7:22] <henry_c_chang> }
[7:22] <henry_c_chang> }
[7:23] <sage> pretty sure not.. let me double-check
[7:25] <sage> yeah the current code is fine. it just avoids queueing invalidate work if it already appears to be queued.
[7:25] <henry_c_chang> ok..I'll revert it
[7:34] * lidongyang_ (~lidongyan@ Quit (Remote host closed the connection)
[7:38] * lidongyang (~lidongyan@ has joined #ceph
[7:53] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) has joined #ceph
[8:01] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:13] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[8:30] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Ping timeout: 480 seconds)
[8:43] * allsystemsarego (~allsystem@ has joined #ceph
[8:56] * xilei (~xilei@ has joined #ceph
[9:16] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:35] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[9:50] * henry_c_chang is now known as henrycc
[10:43] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) has joined #ceph
[10:57] * Yoric (~David@ has joined #ceph
[12:12] <darkfader> my jumbo frame test was a bit horrible
[12:13] <darkfader> the switches went down to 300mbps ;)
[12:13] <darkfader> (does that tell anything about huawei? yes)
[12:20] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[12:25] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[12:33] <jantje_> huawei sucks :P
[12:34] <jantje_> cheap stuff
[12:34] <jantje_> you're lucky if it works
[12:34] <jantje_> if it doesn't work .. you're off for an endless trip
[12:34] <jantje_> no support I heard
[12:35] <jantje_> oh well, I'm at the other side :)
[12:50] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[12:50] <jantje_> darkfader: what speeds do you get with iperf and ip over infiniband
[12:50] <jantje_> ?
[12:55] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[12:55] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[13:00] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[13:24] <darkfader> ip is slow, 2.7gbps maybe (sdr infiniband)
[13:24] <darkfader> there is a offload protocol named sdp but it doesnt bump up the iperf throughput
[13:37] <jantje_> i see
[13:37] <jantje_> i dont know IB, but I would expect more :)
[14:26] * Meths_ (rift@ has joined #ceph
[14:32] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[14:36] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[14:50] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[15:20] * xilei (~xilei@ Quit (Quit: Leaving)
[15:33] * Meths_ is now known as Meths
[15:54] * growler (growler@dog.thdo.woaf.net) has joined #ceph
[16:09] * Yoric_ (~David@ has joined #ceph
[16:09] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[16:09] * Yoric_ is now known as Yoric
[16:35] * todinini (tuxadero@kudu.in-berlin.de) Quit (Read error: Connection reset by peer)
[16:41] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:42] * todinini (tuxadero@kudu.in-berlin.de) has joined #ceph
[16:50] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[16:53] * Meths_ (rift@ has joined #ceph
[16:58] * Meths__ (rift@ has joined #ceph
[17:00] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[17:00] * Meths__ is now known as Meths
[17:02] * Meths_ (rift@ Quit (Ping timeout: 480 seconds)
[17:38] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[17:51] * greglap (~Adium@ has joined #ceph
[18:22] <greglap> johnl: I went through that cfuse crash and while I know the immediate cause I can't figure out how it got into that state to begin with
[18:23] <johnl> my ceph filesystem is in an unusual state?
[18:23] <greglap> and your server didn't seem to be up so I couldn't look at it in more detail
[18:23] <johnl> should be up, will resend details
[18:24] <greglap> well, either there's a path it's following that I didn't find in the code or else the changes to cfuse have busted up the metadata a little bit
[18:43] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[18:47] * sjust (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:47] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:03] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:21] <sagewk>
[19:26] * Yoric (~David@ Quit (Quit: Yoric)
[19:45] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) has joined #ceph
[21:38] * julienhuang (~julienhua@ has joined #ceph
[22:26] * julienhuang (~julienhua@ Quit (Quit: julienhuang)
[23:22] * sage (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[23:22] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:28] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) has joined #ceph
[23:29] * sage (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[23:41] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[23:53] <darkfader> jantje_: with the rdma feature the throughput is much closer to line rate (i think 7.8gbit raw for sdr, which is still lame, but sdr is ages old, qdr will be around 34gbit)
[23:54] <darkfader> rdma (verbs) is about the only good thing still in favor of gluster

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.