#ceph IRC Log


IRC Log for 2011-11-03

Timestamps are in GMT/BST.

[0:00] <psomas> journal _open /dev/sda2 fd 14: 4008222720 bytes, block size 4096 bytes, directio = 1
[0:01] <psomas> Does 'block size 4096' means that IO will be done in 4096 chunks? Is there a way to change that?
[0:02] * fronlius1 (~Adium@e182092135.adsl.alicedsl.de) Quit (Quit: Leaving.)
[0:03] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[0:03] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[0:03] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[0:03] <gregaf> psomas: the block size is the size of the root device, though I don't recall exactly how that size interacts with the journaling — sjust?
[0:04] <gregaf> I believe it just means that the OSD will align each write on 4k boundaries
[0:04] <sjust> gregaf, psomas: I think you are right
[0:08] <psomas> I guess I need to understand better how journalling works, but I was trying to figure out if that block size reported (4K) can affect perfomance, ie would increasing it help as far as throughput is concerned (i guess latency would be increased too)
[0:09] <sjust> changing the block size would affect performance, I don't know precisely how in your case
[0:10] <gregaf> I don't think it would make a big impact, actually?
[0:10] <sjust> I wouldn't think so
[0:10] <gregaf> mostly just in the case of very small updates you will have more unused space in your journal, so you'd be better off with smaller block sizes (but getting smaller than 4k is hard)
[0:10] <psomas> because it uses directio, i thought it could make some impact
[0:13] <nwatkins> gregaf: from lstat, is st_blksize the stripe unit or object size?
[0:14] <gregaf> nwatkins: where at? is it inconsistent or something?
[0:14] <gregaf> for hadoop I would really expect the stripe unit and object size to be the same
[0:15] <gregaf> but I guess that strictly speaking it should probably be the stripe unit
[0:16] <nwatkins> You calculate block locations using stripe unit so it should handle the common case. I was cleaning stuff up see some places to avoid multiple JNI crossings, but depended on the interpretation of st_blksize
[0:19] <gregaf> ah
[0:19] <nwatkins> ^common=general case
[0:19] <gregaf> well, as I look at the Client::fill_stat function…it looks like st_blksize is set to the larger of fl_stripe_unit and 4096
[0:20] <gregaf> so, it's the stripe unit
[0:20] <gregaf> I don't know why there's a max check in there, maybe to prevent upper layers from barfing on a block size <4K?
[0:21] <nwatkins> Not sure about that--in the kernel I believe stripe unit is forced to be multiple of page size
[0:22] <nwatkins> that's all i know about that
[0:22] <gregaf> yeah, probably
[0:22] <gregaf> anyway, st_blksize is set to the stripe unit in Client and that should be the interpretation everywhere else too
[0:24] <nwatkins> gregaf: thanks. a couple questions about logistics: I have one more bug to take care of before creating the new working patch to Hadoop, but it all still depends on wip-getdir. I'd like to write documentation against master. Any idea on the timeline of merging that branch?
[0:25] <gregaf> oh, heh
[0:25] <gregaf> I will get my ass in gear and write those tests I wanted and merge it sometime tonight
[0:25] <gregaf> should've done that a while ago but been distracted/engaged in other things, sorry!
[0:26] <nwatkins> that'd be awesome, thanks!
[0:27] <gregaf> :)
[0:30] <gregaf> oh, I do love Hadoop's unit and integration testing
[0:31] <nwatkins> haha... i've found that ant test-commit checks significantly less things, and does FS related stuff first.
[0:32] <gregaf> I just remember the inconsistency between LocalFileSystem and HDFS contract tests from two years ago — I filed a bug then and it ended up resolved as duplicate because a similar report was a year older
[0:32] <gregaf> so now it's been…3+ years and they still have tests that don't match each other
[0:55] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[0:56] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[0:56] <nwatkins> gregaf: yeh, I just confirmed one inconsistency on the mailing list (for listStatus assuming alphabetical ordering) -- the devs say this isn't really part of the contract, so I think i'll just submit a patch to Hadoop to see if that fixes things.
[0:57] <gregaf> that's what brought it up in my mind :)
[0:59] <nwatkins> The other issue is for mkdirs and will go away after a pass on separating libcephfs out and changing some of the interfaces to return exceptions properly.
[1:00] <gregaf> is that bugged anywhere?
[1:02] * jclendenan (~jclendena@ has joined #ceph
[1:03] <nwatkins> It's the first failure in #1656 -- the test harness expects -ENOTDIR for a deep path. The localFS emulation layer will handle this case, but the CephFS interface doesn't allow exceptions to be propogated back up
[1:03] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[1:05] <gregaf> ah — bad oversight on my part there :/
[1:06] <gregaf> (how on earth did that not get noticed when compiling?)
[1:07] <nwatkins> It's not a compile error--errors are communicated using the integer return value. The problem is that since localFS.mkdirs is wrapped in try block, the CephFaker loses information about what the problem is
[1:08] <gregaf> oh, so the CephFaker actually doesn't throw any errors
[1:08] <gregaf> that's what I was confused about
[1:09] <gregaf> So CephFileSystem regenerates the Exceptions it should be throwing (or tries to), presumably?
[1:09] <gregaf> (been a while since I checked out these kinds of details)
[1:09] <gregaf> and I think I know why the interface is like that — throwing exceptions in JNI code is going to hurt… ;)
[1:09] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:09] <nwatkins> Yeh, It'll take some work
[1:11] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[1:15] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:29] * adjohn is now known as Guest15613
[1:29] * Guest15613 (~adjohn@ Quit (Read error: Connection reset by peer)
[1:29] * adjohn (~adjohn@ has joined #ceph
[1:39] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:45] <nwatkins> My ceph cluster has been up for a few days, and all of a sudden my client started receiving connection refused when connecting.
[1:51] <nwatkins> here is the ceph.log for: http://pastebin.com/wc8V1YuR
[1:54] <nwatkins> Looks like this might be relevant from mds.log: 2011-11-02 17:37:50.924858 7ff186feb700 -- accepter no incoming connection? sd = -1 errno 24 Too many open files
[1:58] <joshd> nwatkins: that sounds like an fd leak, or something is keeping a lot of files open
[1:59] <nwatkins> joshd: yeh, i just restarted the mds and it seems to have resolved itself, but still seemed like a problem since the cluster has been lightly loaded
[2:01] <joshd> sounds more like a leak then - gregaf might have a better idea of where
[2:01] * adjohn (~adjohn@ Quit (Quit: adjohn)
[2:52] * cp (~cp@ Quit (Quit: cp)
[3:01] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[3:17] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:18] * stass (stas@ssh.deglitch.com) Quit (Read error: Connection reset by peer)
[3:26] * stass (stas@ssh.deglitch.com) has joined #ceph
[3:36] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[3:36] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) Quit ()
[3:55] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[4:41] * nwatkins (~nwatkins@kyoto.soe.ucsc.edu) has left #ceph
[4:46] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[5:44] * eternaleye_____ is now known as eternaleye
[6:40] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[7:29] * alexxy (~alexxy@ has joined #ceph
[7:38] * lhg_ (~lhg@ has joined #ceph
[9:49] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[9:49] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[10:17] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[10:20] * gregorg (~Greg@ Quit (Read error: No route to host)
[10:20] * gregorg (~Greg@ has joined #ceph
[10:25] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[10:40] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[10:48] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) Quit (Ping timeout: 480 seconds)
[11:07] <psomas> do multiple osds per physical server require different journals or can they use the same bdev as the journal?
[11:26] * lhg_ (~lhg@ Quit (Quit: Leaving)
[11:51] * mgalkiewicz (~mgalkiewi@ has joined #ceph
[11:51] <mgalkiewicz> hello
[11:52] <mgalkiewicz> Is there any manual, documentation or wiki where authentication for rbd is described?
[11:53] <mgalkiewicz> I have found some wiki but it was incomplete
[12:11] * slb|afk (~slb@gateway.ash.thebunker.net) Quit (Quit: leaving)
[13:08] <mgalkiewicz> do I have to configure cephx or rbd authentication is sth different?
[15:09] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[15:11] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[16:00] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[16:07] * mgalkiewicz (~mgalkiewi@ has left #ceph
[16:27] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[16:35] <gregaf> psomas: they all need their own journal — but of course if you're using an SSD or something and want to share you can partition it or give them each a file that lives on it to use
[16:35] <gregaf> mgalkiewicz: rbd just uses cephx authentication
[16:36] * tserong (~tserong@124-168-227-175.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[16:36] <psomas> so, i must use the osd identifier when defining the journal to use? or else, define it in the [osd.x] conf section?
[16:38] <gregaf> yes, if I understand your question correctly
[16:39] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[16:45] * tserong (~tserong@58-6-131-50.dyn.iinet.net.au) has joined #ceph
[17:08] * nwatkins` (~user@kyoto.soe.ucsc.edu) has joined #ceph
[17:10] * testing222 (~80723424@webuser.thegrebs.com) has joined #ceph
[17:10] * testing222 (~80723424@webuser.thegrebs.com) Quit ()
[17:14] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[17:37] * bchrisman (~Adium@ has joined #ceph
[17:39] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:00] <Tv> oh and now we get "blocked for more than 120 seconds" on xfs too? lovely
[18:00] <Tv> let' all switch to reiserfs
[18:00] <gregaf> yeah, I'm hoping something else is going on there
[18:05] * votz__ is now known as grun
[18:06] * grun is now known as votz
[18:45] * cp (~cp@ has joined #ceph
[18:52] * grape (~grape@c-76-17-80-143.hsd1.ga.comcast.net) has joined #ceph
[19:13] <yehudasa_> oh, great
[19:13] <yehudasa_> Tv: I think I solved the chunked read issue
[19:14] <yehudasa_> was able to dig deep enough in apache to find out that the test is sending bad format
[19:14] <Tv> oh
[19:14] <Tv> yehudasa_: details please?
[19:14] <Tv> yehudasa_: ohh swift test not s3test?
[19:15] <Tv> then i don't care as much ;)
[19:15] <yehudasa_> in chunked PUT the chunks are sent like this: first size in hex, then the data
[19:15] <yehudasa_> the swift test sends the size in hex prefixed with 0x
[19:15] <yehudasa_> which apache translates into 0
[19:15] <Tv> ahaha
[19:16] <yehudasa_> and I blamed it on mod_rewrite..
[19:16] <Tv> http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1 http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2
[19:16] <Tv> patch to swift tests should be trivial
[19:16] <yehudasa_> Tv: yeah, patched it, it works now
[19:17] <yehudasa_> Tv: I did have to patch our mod_fastcgi too though
[19:17] <Tv> the f in fcgi is for fail :(
[19:17] <Tv> (and the cgi is for can't get improvements)
[19:18] <yehudasa_> heh.. but the patch isn't as intrusive as I feared originally.. I ended up rewriting the part that reads from the client, updated it to apache2 api
[19:18] <yehudasa_> but I think I'll drop that and just keep the one liner change
[20:14] * cp (~cp@ Quit (Quit: cp)
[20:32] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Quit: Leaving.)
[22:17] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[22:23] * Tv (~Tv|work@aon.hq.newdream.net) has left #ceph
[22:24] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[22:48] * tserong (~tserong@58-6-131-50.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[22:52] <gregaf> nwatkins`, can you tell me any more about that fd problem from yesterday?
[22:54] * todin (tuxadero@kudu.in-berlin.de) has joined #ceph
[22:55] <todin> hi, how could I change the object size on the osd store from the 4MB to e.g. 8MB?
[22:56] * tserong (~tserong@124-168-227-41.dyn.iinet.net.au) has joined #ceph
[22:56] <joshd> todin: if you're using rbd, you specify the object size when creating the image
[22:57] <todin> joshd: yes, I do, how could I do that?
[22:57] <joshd> rbd create -s 1000 --order 23 imgname
[22:58] <joshd> order is the number of bits shifted to get the object size - default is 22
[22:59] <todin> joshd: oh, that's nice, should be documentet in the rbd help
[23:00] <joshd> indeed - I'll fix that
[23:01] <todin> cherrs
[23:12] <Tv> http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=%2Fapis%2Ffcntl.htm
[23:23] <todin> joshd: 25 is the hightest shift I can do?
[23:24] <joshd> it's the highest the command line tool lets you
[23:25] <joshd> yehudasa might know why that's the limit
[23:27] <todin> ok, I want to use rbd for virtual disk images, therefore I think a large object size is an advantage, any thoughts about that?
[23:29] <joshd> generally the guest or hypervisor will be doing smaller I/Os (like 512 bytes), so you probably won't see much difference in performance
[23:30] <gregaf> writes over 4MB are…exceedingly rare and if you get them then you're already going to be bound by wirespeed, so I don't think larger blocks will help you
[23:32] <todin> gregaf: that are bad news, I like the idea of ceph, but the performance in a kvm hosting enviroment is not good
[23:33] <joshd> todin: have you enabled the rbd_writeback_window?
[23:35] <todin> joshd: yes i did, the problem is more a btrfs problem, btrfs as some sort of ageing, after a while the io ops rate raises and the performance drops
[23:36] <Tv> sjust, gregaf: http://kerneltrap.org/mailarchive/linux-kernel/2008/5/30/1980764 has some of the history.. it's consistently talked about as a hint to the OS for better performance
[23:37] <Tv> as in, hint not a strong requirement
[23:38] <Tv> todin: i recall seeing that "aging" discussion with the btrfs upstream recently -- if that wasn't you, you should probably see if there was a resolution there
[23:40] <todin> Tv: there wasn't a real solution, the point was, non of the fs does really work with ceph, we tested ext3/4 btrfs
[23:40] <Tv> todin: we are getting some love from the xfs upstream too, you might want to try that if you didn't already
[23:42] <Tv> todin: but i'm sorry to hear about your troubles; we have a known bug http://tracker.newdream.net/issues/213 that affects non-btrfs, outside of that i don't expect much problems.. hopefully that'll be resolved in a near-future release (we pretty much *have to*)
[23:42] <todin> Tv: I will try
[23:43] <Tv> todin: also worth noting, we run heavy loads on ext4 every night and those tend to work ok; #213 can only bite you when recovering the journal
[23:45] <gregaf> some people have managed to break it, though :(
[23:45] <gregaf> although I think the only known bug is fixed in ext4 upstream now
[23:45] <todin> Tv: my problem with ext4 is, when I cleanly shutdown the osd and unmount the store, and do a fsck on it, threre a brocken inodes
[23:45] <todin> I cannot tell you how servere that is, but I cannot trust my data on it
[23:46] <nwatkins`> gregaf: thanks for merging wip-getdir!
[23:46] <gregaf> nwatkins: yeah, sorry I didn't get it earlier — took a bit longer than I expected to massage the tests and sleep won out ;)
[23:46] <gregaf> thanks for the hadoop changes, those are in now too
[23:49] <Tv> todin: sounds like Ted T'so or someone would like to hear about those inodes..
[23:50] <gregaf> Tv, todin: I believe that's the bug that got fixed already
[23:50] <Tv> ah, good
[23:50] <gregaf> pretty sure it was non-fatal and related to our use of large xattrs and sandeen (not here right now) fixed it up
[23:51] <Tv> oh right, that conversation
[23:52] <todin> gregaf: where could I find that fix for ext4?
[23:52] <gregaf> …upstream? :p
[23:54] <todin> gregaf: I tested mainline v3.1.0 and it wasn't fixed
[23:54] <gregaf> here's the conversation about it: http://patchwork.ozlabs.org/patch/121441/
[23:54] <gregaf> it's only a week or 4 days old
[23:55] <gregaf> looks like it's commit 6d6a435190bdf2e04c9465cde5bdc3ac68cf11a4 in the ext4 git tree
[23:56] <gregaf> so it should be in the initial 3.2 judging by it being in a couple for_linus branches
[23:57] * mgalkiewicz (~maciej.ga@ has joined #ceph
[23:57] <todin> ok, thanks, I will test it, would you guys like to have feedback on this?
[23:59] <gregaf> we always want to know what FSes work and what problems they have :)
[23:59] <mgalkiewicz> how to set authentication for each rbd volume?

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.