#ceph IRC Log


IRC Log for 2012-01-23

Timestamps are in GMT/BST.

[0:44] * mrjack (mrjack@office.smart-weblications.net) Quit ()
[0:47] * MarkDude (~MT@ has joined #ceph
[0:49] * MarkDude (~MT@ Quit ()
[2:02] * lollercaust (~paper@67.Red-88-11-191.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[3:21] * aa (~aa@r190-64-67-132.dialup.adsl.anteldata.net.uy) Quit (Ping timeout: 480 seconds)
[3:29] * aa (~aa@r190-64-67-132.dialup.adsl.anteldata.net.uy) has joined #ceph
[3:35] * andresambrois (~aa@r186-52-130-198.dialup.adsl.anteldata.net.uy) has joined #ceph
[3:35] * andresambrois (~aa@r186-52-130-198.dialup.adsl.anteldata.net.uy) Quit (Remote host closed the connection)
[3:42] * aa (~aa@r190-64-67-132.dialup.adsl.anteldata.net.uy) Quit (Ping timeout: 480 seconds)
[5:04] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[5:10] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[5:31] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[6:32] * The_Bishop (~bishop@cable-89-16-138-109.cust.telecolumbus.net) Quit (Ping timeout: 480 seconds)
[7:59] * alexxy[home] (~alexxy@ has joined #ceph
[8:04] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[8:55] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[10:15] * BManojlovic (~steki@93-87-148-183.dynamic.isp.telekom.rs) has joined #ceph
[10:22] * lollercaust (~paper@67.Red-88-11-191.dynamicIP.rima-tde.net) has joined #ceph
[10:47] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[10:51] * lollercaust (~paper@67.Red-88-11-191.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[11:03] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[11:08] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[11:55] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[12:01] * bchrisman1 (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[12:32] * alexxy[home] (~alexxy@ Quit (Remote host closed the connection)
[12:37] * alexxy (~alexxy@ has joined #ceph
[13:24] * lollercaust (~paper@67.Red-88-11-191.dynamicIP.rima-tde.net) has joined #ceph
[13:29] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[14:25] * The_Bishop (~bishop@cable-89-16-138-109.cust.telecolumbus.net) has joined #ceph
[14:34] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[15:03] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:06] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:14] * BManojlovic (~steki@93-87-148-183.dynamic.isp.telekom.rs) Quit (Quit: Ja odoh a vi sta 'ocete...)
[15:15] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:18] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:38] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:47] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:48] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[15:52] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[16:13] * hijacker (~hijacker@ Quit (Quit: Leaving)
[16:31] * hijacker (~hijacker@ has joined #ceph
[16:32] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[16:35] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[17:12] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[17:28] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Remote host closed the connection)
[17:28] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[17:30] * gregaf1 (~Adium@aon.hq.newdream.net) has left #ceph
[17:31] * gregaf1 (~Adium@aon.hq.newdream.net) has joined #ceph
[17:37] * Tv|work (~Tv|work@aon.hq.newdream.net) has joined #ceph
[17:40] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:42] * fred_ (~fred@80-219-180-134.dclient.hispeed.ch) has joined #ceph
[17:42] <fred_> yehudasa, you around?
[17:43] <gregaf1> fred_: he's not here yet, but if you can remind me what you were doing I might be able to help
[17:43] <fred_> gregaf1, ok thanks
[17:43] <fred_> gregaf1, in fact I made a quick fix for rgw/rgw_rest.cc
[17:44] <fred_> gregaf1, it works for me but... I'm sure you would do it better because I completely ignored everything related to swift, which I don't know and don't use
[17:44] <fred_> gregaf1, dive in the source ?
[17:45] <gregaf1> did you send it to the list?
[17:45] <gregaf1> I remember you guys were talking but I don't remember what problem you were seeing :)
[17:46] <fred_> no, didn't send to the list, my @yahoo.com address gets unsubscrided everytime without notice, I'm tired of re-re-re-re-re-subscribing
[17:47] <fred_> from irclogs --> Basically, if my fcgi param SCRIPT_NAME is unset radosgw crashes, if it is '/', the only way to pass bucket name seems to be via hostname (rgw_swift_url_prefix is unset). Looking at how rgw_rest.cc is written, it seems strange that at line 443 you check the content of SCRIPT_NAME, and that if it is '/' you goto done, bypassing setting bucket name which happens before done label, at line 521 (everything line number refers to c
[17:47] <fred_> eph git tag v0.4
[17:47] <gregaf1> ah right
[17:48] <fred_> my proposal (which works fine) was http://pastebin.com/FNeHEXhp
[17:49] <fred_> yehudasa's proposal was http://pastebin.com/Uw05TEgG
[17:49] <gregaf1> did you try that out?
[17:49] <gregaf1> this is sufficiently detailed that I think we're better off letting him discuss it with you :)
[17:51] <fred_> didn't try because it seemed to me that it would not work
[17:51] <fred_> ok, I'll see him another day
[17:52] <fred_> from what I understand SCRIPT_NAME is a kind of prefix where your s3gw is reachable
[17:53] * adjohn is now known as Guest251
[17:53] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[17:54] * Guest251 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Read error: Connection reset by peer)
[17:54] * jojy (~jvarghese@ has joined #ceph
[17:55] <fred_> it seems to me that init_entities_from_header should start by parsing away this prefix, and then get bucket and object names
[17:55] <gregaf1> yeah; I'm afraid I don't know
[17:55] <gregaf1> I'm reasonably conversant with how rgw works but the apache interface stuff I have no idea about
[17:55] <fred_> sure, I hope he will look at his irc backlog :)
[17:56] * jojy (~jvarghese@ Quit ()
[17:57] <gregaf1> I'll poke him when I see him :)
[17:58] <fred_> I'm sending this to him via privmsg, thanks
[18:16] <ajm> http://adam.gs/osd.9.log
[18:16] <ajm> anyone seen something like this? 4 OSD on the same box won't start up after a semi-unclean shutdown of that box
[18:20] <Tv|work> os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
[18:20] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[18:20] <Tv|work> 2012-01-23 12:04:14.729322 7fc7b1dd1780 filestore(/data/osd.9) error (22) Invalid argument not handled on operation 12 (op num 1, counting from 1)
[18:21] <Tv|work> so your journal is telling it to do something that results in EINVAL
[18:22] <gregaf1> that's the −1 truncate bug
[18:22] <gregaf1> argh
[18:22] <gregaf1> I think it's fixed in v0.40, let me check
[18:22] <Tv|work> static const int OP_TRUNCATE = 12; // cid, oid, len
[18:22] <Tv|work> yeah
[18:22] <gregaf1> or at least hacked away; I don't know what happens once it's in the journal
[18:22] <gregaf1> …yeah, once journaled we're stuck
[18:22] <gregaf1> crap
[18:23] <ajm> hrm, fixed in 0.40 ? I'm on 0.40
[18:24] <ajm> this was a new array, its always been 0.40
[18:24] <Tv|work> hmm
[18:24] <Tv|work> greg just walked out in search of answers
[18:29] <ajm> :) ok
[18:29] <gregaf1> ajm: hrm, maybe the patch is newer than I thought
[18:29] <gregaf1> oh, yep, it's not in v0.40
[18:30] <ajm> aha
[18:31] <ajm> are there any issues if i run trunk osd on an otherwise 0.40 cluster?
[18:31] <gregaf1> there's a problem with the MDS sending truncate_seq 1 truncate_size −1 to the OSD, which is supposed to mean "don't truncate" and the OSD is interpreting it as "truncate now!", and it errors out as too large a file size
[18:31] <ajm> or do you know the commit so I can backport?
[18:31] <sagewk> elder: around?
[18:31] <gregaf1> the commit right now is just a hack in the OSD to drop that particular set of values; it's id 0ded7e4dac9fb9357afe6bd2fa9f02d0a96ed06c
[18:31] <yehudasa> fred_: looking into the issue again
[18:32] <gregaf1> ajm: unfortunately it won't fix your journals; so your best option there is to zero out the journal and go backwards in time a little bit
[18:32] <ajm> gregaf1: if its already present in my journal though, isn't it just going to happen even if I patch this, or does it happen from when the osd reads the journal?
[18:32] <ajm> lol, ok
[18:32] <gregaf1> yeah :/
[18:33] <ajm> just overwrite the journal file with \0 ?
[18:34] <gregaf1> there's an osd command on startup; ceph-osd --help should tell you
[18:35] <gregaf1> or I can look it up if you need :)
[18:37] <ajm> ah
[18:37] <ajm> mkjournal ?
[18:41] <gregaf1> ajm: yes
[18:42] <ajm> adam # ceph-osd --help
[18:42] <ajm> global_init: unable to open config file.
[18:42] <ajm> thats annoying :/
[18:44] <sagewk> tv|work: 5 or 29 i think
[18:44] <Tv|work> sepia29 has a busted disk
[18:44] <Tv|work> apt-cache policy build-essential hits a block that gets read errors
[18:45] <gregaf1> ajm: bugged! (http://tracker.newdream.net/issues/1960)
[18:45] <ajm> gregaf1: that fixed it, thanks
[18:45] <yehudasa> fred_: SCRIPT_NAME is the part of the url following the host name, and preceding any extra parameters (e.g., in foo.com/bucket/object?acl it'll be bucket/object). It's also url-decoded.
[18:45] <gregaf1> oh good
[18:46] <yehudasa> fred_: so basically I think both patches aren't going to work for anything other than get/put of objects/buckets
[18:46] <yehudasa> fred_: setting ACLs for example wouldn't work as it requires extra params
[18:47] <Tv|work> sagewk: i put sepia29 down in the teuthology runner db, it should get skipped
[18:47] <yehudasa> fred_: in any case, my patch is just a spin of yours, making it less implicit
[18:47] <sagewk> tv|work: yay thanks
[18:47] <Tv|work> sagewk: i'm also just gonna shutdown -h the machine, to avoid stupid issues with it
[18:48] <sagewk> k
[18:53] <gregaf1> iggy: you figure out the journal stuff?
[18:54] <gregaf1> the size of the journal should be enough to absorb all inbound writes while the main store is syncing, so probably 2-5 seconds worth of writes
[19:03] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[19:05] * bchrisman (~Adium@ has joined #ceph
[19:06] <iggy> gregaf1: thanks, that helps a little
[19:12] <gregaf1> all right, let us know if there's something else…not really sure what you're after :)
[19:18] <elder> sage, Yes, sorry. I didn't notice your post.
[19:19] <iggy> I'm looking at nvram/ssd sizing for a journal
[19:19] <iggy> nvram devices all appear to be fairly small (512M, 1G, 2G range)
[19:22] <gregaf1> iggy: oh, yeah
[19:24] <gregaf1> that ought to work fine; the system defaults are 100MB, and that might be a bit small
[19:24] <gregaf1> but 1GB will handle 2-3 seconds for a pretty fast array
[19:25] <iggy> I was thinking the 2G for a 12 drive system
[19:26] <gregaf1> I suspect that would work fine (it's 3 seconds at 56MB/s per drive), but we haven't tested a wide range of these things yet
[19:27] <iggy> hmm, yeah, hadn't thought of thinking of it in that regard
[19:28] <gregaf1> that's the way I'd do it — the journal has to get all the writes the main store does, but it can toss them out as soon as the main store does a sync/completes a snapshot :)
[19:31] * lollercaust (~paper@67.Red-88-11-191.dynamicIP.rima-tde.net) Quit (Read error: Connection reset by peer)
[19:55] * sage (~sage@ Quit (Ping timeout: 480 seconds)
[20:04] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Ping timeout: 480 seconds)
[20:15] <ajm> would anyone say the fuse cephfs client is better/worse than the in-kernel client?
[20:16] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[20:16] <nhm> ajm: what do you mean by better?
[20:17] <ajm> more stable, faster
[20:18] * johnl_ (~johnl@2a02:1348:14c:1720:24:19ff:fef0:5c82) has joined #ceph
[20:18] <nhm> ajm: I doubt it's faster, but it might be less likely to take a client down when something unexpected happens.
[20:19] * dwm__ (~dwm@2001:ba8:0:1c0:225:90ff:fe08:9150) has joined #ceph
[20:19] <nhm> ajm: One of the other guys could probably answer my concretely though.
[20:19] * johnl (~johnl@2a02:1348:14c:1720:24:19ff:fef0:5c82) Quit (Ping timeout: 480 seconds)
[20:21] * dwm_ (~dwm@2001:ba8:0:1c0:225:90ff:fe08:9150) Quit (Read error: Connection reset by peer)
[20:31] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[20:43] <ajm> gregaf1: ceph fuse does similar things, do you think I should append that to the bug or make another one?
[21:13] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[21:34] <gregaf1> ajm: yeah, just stick it in there; if it becomes appropriate to split out we can do so
[21:35] <gregaf1> regarding ceph-fuse versus kernel, Sage thinks it's a bit slower but I'm not sure what benchmarks that's based on; I've seen it be faster in some situations (because eg its metadata caching is actually turned on; unlike some of the kernel's) and it certainly is easier to get updates rolled and applied
[21:35] <ajm> i think the fuse stuff actually requires a ceph.conf ? or i'm not figuring out how to specify the options right
[21:36] <ajm> "ceph mount failed with (1) Operation not permitted"
[21:37] <gregaf1> well, it requires all the same data as a kernel mount
[21:38] <gregaf1> but getting FUSE set up properly can be annoying
[21:38] <gregaf1> you probably have it set up so that only the root user can create FUSE mounts
[21:40] <ajm> hrm, i'm running that as root and I don't think its dropping privs or anything
[21:41] <gregaf1> okay, what command are you running?
[21:42] <ajm> tried a number of iterations but something as simple as: ceph-fuse -m a.b.c.d:6789 /ceph
[21:43] <ajm> -o name=name,secret=whatever as well
[21:44] <gregaf1> ah, did you specify the part of the Ceph tree to mount?
[21:44] <gregaf1> "ceph-fuse -m a.b.c.d:6789:/ /ceph -o name=name,secret=whatever"
[21:46] <ajm> # ceph-fuse -m a.b.c.d:6789:/ /ceph -o name=name,secret=$(</etc/ceph-keyfile)
[21:46] <ajm> server name not found: a.b.c.d:6789:/ (Success)
[21:46] <ajm> usage: ceph-fuse [-m mon-ip-addr:mon-port] <mount point>
[21:48] <ajm> hrm, i'm not sure its expecting that there? there's a -r option to mount a subdir
[21:49] <gregaf1> oh, hrm, never mind that
[21:49] <gregaf1> I guess the format changed without my noticing at some point!
[21:50] <gregaf1> all right, well, we can dump out a bunch more debugging to see what it's getting caught up on
[21:50] <ajm> like "add some printfs" or there's stuff in there i can enable
[21:50] <gregaf1> add "--debug_client 10 --debug_ms 1" to that line and see what it prints out
[21:50] <gregaf1> ^ command-line options :)
[21:51] <ajm> hrm
[21:51] <gregaf1> oh, except we need some stupid log-to-stdout thing
[21:51] <gregaf1> sagewk, do you remember what it is to get debug to standard out?
[21:52] <sagewk> --log-to-stderr and 2>&1
[21:52] <ajm> derr is fine
[21:52] <ajm> oh ok
[21:52] <ajm> so its an auth issue
[21:52] <ajm> ==== auth_reply(proto 2 -1 Operation not permitted) v1 ==== 24+0+0 (1078721237 0 0) 0x1ab1800 con 0x1ab88c0
[21:55] * vodka (~paper@34.Red-88-11-190.dynamicIP.rima-tde.net) has joined #ceph
[21:57] <gregaf1> does it work properly if you specify the keyring file instead of echoing it?
[21:57] <gregaf1> (I don't do much setup stuff so I'm trying to page in what's good and bad)
[21:57] <gregaf1> Tv thinks you can't specify secret on the command-line like that
[21:57] <gregaf1> but if your config file is good it should just work
[21:57] <Tv|work> as far as i know, secret= was a kludge for kernel mounting only
[21:57] <Tv|work> ceph-fuse should behave just like the other ceph command line tools, same basic auth mechanism (read ceph.conf, read keyring, find key matching client.$name, name defaults to "admin)
[21:58] * edwardw`away is now known as edwardw
[21:58] <ajm> hrm
[21:58] <ajm> let me try to add the info in the config, perhaps name on the command line doesn't work or something
[21:58] <ajm> i tried secretfile and it didn't work either
[21:59] <ajm> name= in the [global] section should specify this?
[22:00] <gregaf1> you should be able to specify "name=admin" (or whatever user you set up) on the command-line or in the global config section
[22:01] <ajm> hrm, name=thename doesn't seem to work for the tools
[22:01] <ajm> --name client.thename does work on the command line however
[22:02] <Tv|work> ajm: you need the config file to have "keyring" set
[22:02] <Tv|work> ajm: not secretfile, that's all kernel stuff, kernel mounting is more limited
[22:02] <ajm> i have that as well
[22:02] <ajm> --name client.thename works for ceph.fuse now
[22:05] <ajm> actually same thing for the other ceph tools as well, ceph -s only works if i --name=client.thename
[22:07] <gregaf1> yeah, they're all using the same code; I must have gotten backwards whether you needed the "client" bit or not
[22:07] <ajm> nothing i specify in the config though seems to set the name
[22:07] * vodka (~paper@34.Red-88-11-190.dynamicIP.rima-tde.net) Quit (Ping timeout: 480 seconds)
[22:09] <gregaf1> I think it defaults to client.admin if you don't specify one; perhaps we've gotten confused by that
[22:09] <gregaf1> sorry :/
[22:09] <ajm> but only possible to specify on the command line then I guess?
[22:10] <Tv|work> uhhh, $name is supposed to consist of $type.$id, and type in this is hardcoded to "client"
[22:10] <joshd> the confusing thing is that the kernel-related tools used name=admin, since they don't use ceph conf files etc, but the others use --name=client.admin
[22:10] <Tv|work> perhaps the argument parsing is silly and by "--name=" means id
[22:10] <ajm> reading ./src/common/config_opts.h i don't see anything that would be name
[22:10] <gregaf1> yeah, I think it can't be conf-specified
[22:10] <ajm> oh is it id= in the config perhaps ?
[22:11] <Tv|work> yeah kernel is special crufty magic
[22:11] <Tv|work> don't read kernel mounting docs if you're using fuse
[22:11] <Tv|work> kernel is special, ceph-fuse is like all the other ceph tools
[22:11] <ajm> I wasn't reading them as much as i'm just influenced byu them :)
[22:11] <ajm> i've been using the kernel mounts and just trying out fuse right now
[22:11] <Tv|work> $name = $type.$id
[22:11] <Tv|work> set either name or id
[22:11] <Tv|work> but name is expect to have the type in it
[22:12] <ajm> (id= in [global] doesn't work)
[22:12] <Tv|work> ajm: oh that would be a bad idera..
[22:12] <Tv|work> ajm: when you read the config file, you already know your $type.$id
[22:12] <ajm> ah
[22:12] <Tv|work> ajm: and only pay attention to those sections, etc
[22:12] <ajm> hrm
[22:12] <Tv|work> usually, you use ceph-osd -i 42 etc
[22:13] <ajm> oh ok
[22:13] <ajm> this makes more sense then'
[22:13] <ajm> its mildly annoying for the simple ceph -s case or whatever
[22:13] <ajm> having to specify the id there every time
[22:14] <gregaf1> it may be something that's appropriate to revisit; we're all just using the default client.admin so we don't actually have to specify….
[22:14] <joshd> ajm: you can set it in the CEPH_ARGS environment variable (any args can be put there)
[22:14] <ajm> thats useful :)
[22:14] <ajm> i just like to keep things in such a way that every box has its own key
[22:14] <Tv|work> ajm: yeah, with most of those tools you just use the default client.admin
[22:15] <ajm> it tweaks my nature otherwise
[22:15] <Tv|work> CEPH_ARGS in environment tweaks my nature..
[22:15] <Tv|work> i hate magic environments
[22:15] <Tv|work> mostly because when one day the var is not there, you can't figure out why things are breaking
[22:17] <gregaf1> yeah, I got hit with that recently…then I put the export I needed into my .profile *sigh*
[22:17] * vodka (~paper@179.Red-88-11-190.dynamicIP.rima-tde.net) has joined #ceph
[22:18] <ajm> yeah this one is definitely a big personal preference thing :)
[22:21] * daemonik (~Adium@static-173-55-114-2.lsanca.fios.verizon.net) has joined #ceph
[22:23] <daemonik> According to this http://learnitwithme.com/?p=303 CephFS (POSIX) does not perform nearly as well as NFS. Has CephFS performance significantly improved since 0.32?
[22:23] * fronlius (~fronlius@f054112216.adsl.alicedsl.de) has joined #ceph
[22:23] <darkfader> well, nfs replication is still RFC-only, right?
[22:23] <darkfader> ceph has a little more work to do
[22:23] <Tv|work> daemonik: so much depends on the workload
[22:24] <darkfader> Tv|work: are there any recent numbers by chance?
[22:24] <Tv|work> darkfader: nope, we'll start benchmarking more once the new hardware is fully operational
[22:25] <Tv|work> now we're still more concerned with stress than benchmark
[22:25] <gregaf1> I think we got some about 3 months ago, right?
[22:25] <Tv|work> from a 3rd party, yeah
[22:25] <gregaf1> that somebody else generated I mean, mostly for rbd but not all
[22:26] <gregaf1> oh, that might be those numbers, n/m
[22:27] <Tv|work> there's also a lot of detail in what the set up is, e.g. Christian Brunner's btrfs slowdown email
[22:27] <gregaf1> anyway, I think a few things have gotten better, but you're still going to find that on the same hardware Ceph is slower than NFS — it is after all doing twice the writes, etc
[22:29] * Tv|work (~Tv|work@aon.hq.newdream.net) has left #ceph
[22:29] * Tv|work (~Tv|work@aon.hq.newdream.net) has joined #ceph
[22:33] <daemonik> Tv|work: A Dovecot cluster (using director proxies) with 32mb mdbox files is something I'd like to use CephFS for. Shouldn't CephFS outperform NFSv4 if metadata is aggressively cached? Using FreeBSD and ZFS with the ZIL on a ramdisk and L2ARC on 2.5" SSDs would speed things up quite a bit no?
[22:36] * edwardw is now known as edwardw`away
[22:36] * edwardw`away is now known as edwardw
[22:38] * edwardw is now known as edwardw`away
[22:44] <Tv|work> daemonik: i honestly don't know, off hand -- benchmark, benchmark, benchmark
[22:44] <daemonik> Tv|work: In theory, CephFS should eventually run circles around NFSv4 the way that RBD runs circles around iSCSI?
[22:45] <Tv|work> daemonik: especially once the multi-mds stuff is stable, nfs will hit a bottleneck ceph won't
[22:45] <Tv|work> rbd is embarassingly distributable
[22:45] <Tv|work> winning over a single-backend system in that space is way easier
[22:46] <Tv|work> and the single-backend bottleneck is hit earlier, too
[22:47] <dwm__> Oooh, a RADOS backend for Dovecot could work most nicely.
[22:49] <dwm__> Would require adding code to Dovecot itself, of course.
[22:50] <daemonik> Are there any disadvantages to using CephFS on FreeBSD? ZFS makes FreeBSD much more preferable to Linux.
[22:50] <dwm__> Suspect it might be less effort to just use the higher-level Ceph directly.
[22:51] <gregaf1> the only problem with assuming that Ceph will eventually be faster than NFS is that it does a lot more work (regardless of implementation, it provides stronger data safety guarantess, and is real POSIX), but…hopefully!
[22:52] <gregaf1> we haven't tried running Ceph on FreeBSD, so no idea how it's doing; and to really optimize it might require some work to take advantage of ZFS
[22:56] <daemonik> gregaf1: I was very excited to see the announcement that Ceph compiles on FreeBSD. This may no longer be accurate, but the consensus is that ZFS maturity and known stability makes it highly desirable over btrfs. ZFS self-healing gives it a lot of appeal. I'm sure that many sys admins would happily dedicate a massive amount of RAM to caching POSIX metadata and an SSD if that's what it takes. Like dwm__ just said, it would be less e
[22:58] <gregaf1> yeah, it ought to work; I'm just saying that the OSD's "FileStore" does a lot of special-casing for btrfs to take advantage of its snapshots and things, whereas running on zfs is going to go down the normal filesystem code paths
[23:14] <daemonik> gregaf1: Ceph does not take advantage of ZFS' snapshots in the same way at the moment?
[23:14] <gregaf1> not yet; doing so requires some filesystem-specific code and nobody's done it yet for ZFS
[23:15] <gregaf1> contributions welcome ;)
[23:17] <darkfader> daemonik: one thing from the sidelines: nfs4 is a vague term i think - pnfs is hard to beat, and so is nfs over rdma. if you suggest putting ZIL on a ramdisk you kinda imply putting up something fully syncronous (nfs) versus a ceph that's being bent to be asynchronous
[23:17] <darkfader> that's like the ext3 vs. everyone else benchmarks were
[23:18] <darkfader> async and unstable vs more sync / replayable in that case
[23:18] <darkfader> and like TV said it mostly matters what workload you got
[23:19] <darkfader> pNFS can't (at least i think so) scale out over many systems
[23:19] <darkfader> so for stable bandwidth with many streams into one client, ceph (imo) would win
[23:19] <darkfader> since it can feed that from many many ods
[23:19] <darkfader> +s
[23:20] <darkfader> if i said anything wrong, correct me please :)
[23:21] <daemonik> darkfader: Does NFSv4 include pnfs and nfs over rdma? It is correct to say there are no stable/available implementations of pnfs for FreeBSD / Linux?
[23:22] <darkfader> i think pnfs is not avaible as stable in freebsd or linux. oracle has a stable in-kernel client and netapp a stable server. those combined are *quite* fast. and nfs/rdma is not in stock linux nfs i think, but stable enough as far as i know. on freebsd i'd assume this is something "someone might wanna add that as a GSoC project"
[23:24] <darkfader> *bigsmile* i think the people who are using glusterfs in backend and nfs/rdma in frontend would say that nfs is giving them less headache than the other bit
[23:24] <darkfader> i'll better shut up now
[23:24] <daemonik> darkfader: Yeah, there's no stable-and-popular version of Solaris out there. When we have FreeBSD with ZFS, a lot of us can avoid Netapp. Something like Ceph is the missing piece. I care about availability more than anything. Ceph's architecture looks much better than GlusterFS'.
[23:24] <darkfader> let me add i don't see a point in avoiding netapp
[23:24] <darkfader> for scaleout storage, yes i'd pick ceph
[23:25] <darkfader> as a nfs server i'd rather to and jump off something than decide for something instead of a filer
[23:25] <darkfader> (expect for *cough* budgetary reasons)
[23:25] <daemonik> darkfader: I used GlusterFS, the documentation was sparsely distributed, not centralized, outright inaccurate in certain places, error messages were not very helpful, and performance (despite much tuning) was terrible and performance bottlenecks were opaque. I have joined the camp of people who don't want to touch it again.
[23:25] <darkfader> haha
[23:25] <darkfader> have a beer :)
[23:26] <darkfader> that coduld be my summary too
[23:26] <daemonik> darkfader: Netapp is expensive. FreeBSD is open source. ZFS is open source. Putting together a FreeBSD box with an ACARD 9010 and 2.5" SSDs is fast and easy. =)
[23:26] <darkfader> well anyway i also hope the guy with the freebsd ceph port makes headway
[23:26] <darkfader> yes
[23:27] <darkfader> will still not remotely touch a filer
[23:27] <darkfader> put, say, 500 active clients on both and then compare
[23:27] <darkfader> with 1-10 the zfs box will of course run circles around it
[23:28] <darkfader> and of course you should not have any ha requirements ;)
[23:28] <daemonik> I know that Ceph was developed with btrfs, but ZFS is still much more welcomed by the world.
[23:28] <daemonik> darkfader: Should not have any HA requirements? =\
[23:29] <darkfader> daemonik: ceph can handle the failover, sorry
[23:29] <Tv|work> daemonik: if you define world as in "not linux", then perhaps ;)
[23:29] <darkfader> i was stuck in nfs land
[23:30] <daemonik> darkfader: During a graceful shutdown of a node in a two-node cluster, there wild be no "failover", no disruption, does Ceph allow for this?
[23:30] <Tv|work> also, do remember Sun was bought by Oracle, and Oracle pays for btrfs, *and* has committed to using it in their linux distro
[23:30] <darkfader> Tv|work: world as in already deployable CoW filesystems> giggle
[23:30] <Tv|work> now, i don't expect much clue to transfer over from Sun to Oracle, but.. the tap for ZFS funding is probably not flowing as well as it used to.
[23:31] <darkfader> Tv|work: not sure, oracle is not broke, sun was broke
[23:31] <darkfader> they might have better financing for zfs now
[23:31] <daemonik> Tv|work: Does Unbreakable 6.x use btrfs as the default filesystem? I don't trust Oracle, and I don't see the kind of communication from the btrfs team that I have the ZFS team. If there are docs or posts on btrfs that can help instill confidence in btrfs please link to it.
[23:31] <Tv|work> daemonik: we just sat in a talk by Mason in Saturday where we said the Oracle linux distro *will* use btrfs by default
[23:32] <gregaf1> *he said :)
[23:32] <darkfader> Tv|work: but at which release?
[23:32] <Tv|work> yes, *he
[23:32] <daemonik> You guys are in the bay area huh?
[23:32] <gregaf1> Los Angeles
[23:32] <Tv|work> darkfader: i think it was "next"
[23:32] <darkfader> 6.x has ext4 and ocfs2
[23:32] <gregaf1> yes, "next" as in "mid-February"
[23:32] <gregaf1> god help him
[23:32] <Tv|work> darkfader: as in, his benefactors are holding him to that, too
[23:32] <darkfader> gregaf1: by rhel's release cycle it would be 2014
[23:32] <darkfader> for "next"
[23:33] <darkfader> Tv|work: uh oh.
[23:33] <darkfader> btrfs is like... years from production grade?
[23:33] <daemonik> What if ZFS / btrfs could offload filesystem operations that deal with checksums and compression to a GPU? Has any one else thought about that?
[23:34] <gregaf1> daemonik: I'm sure somebody has…but you'll run into bandwidth issues; CPUs are fast enough that you'd probably spend more time transferring data in and out than you'd save calculating checksums :)
[23:34] <Tv|work> daemonik: GPUs tend to have issues accessing data not specifically handed to them
[23:34] <darkfader> daemonik: checksum can be done in cpu offloaded and fast
[23:34] <gregaf1> plus ewww gpu code in the kernel
[23:34] <darkfader> and compression algorithms suck at parellelism
[23:34] <daemonik> Does btrfs re-silver as efficiently as ZFS does yet?
[23:35] <darkfader> i.e. pigz can only compress in multiple threads, but not decompress
[23:35] <darkfader> daemonik: since when is it efficient on zfs? ;)
[23:35] <Tv|work> daemonik: re-silver?
[23:36] <darkfader> Tv|work: scrubbing for normal storage
[23:36] <gregaf1> for RAID volumes, I assume
[23:36] <daemonik> Perhaps sha-256 checksums could be done better by a GPU though? I haven't used ZFS with sha-256 block checksums, but I imagine the overhead is high. In ZFSland a mirror/raidz recover is called resilvering. Unlike Linux softraid, ZFS only repairs the bits that need to be repaired.
[23:37] <Tv|work> darkfader: scrubbing to verify checksums, or what? Mason just demoed that this weekend, I don't remember how long it's been there.
[23:37] <darkfader> daemonik: reread what greg said about bw
[23:37] <Tv|work> daemonik: btw there's a dm module for linux that keeps track of dirty blocks, can do just the needed raid recovery
[23:38] <darkfader> Tv|work: yes.. i'll go bitter admin mode again about that... checksumming is cool now since it is finally no longer a feature of enterprise hardware that noone ever needed... errr...
[23:38] <Tv|work> but Mason made pretty convincing noises about the feature set of btrfs's internal raid replacement.. the only non-feature there was that it reads from a random copy, so the replicas should have similar performance
[23:38] <daemonik> Tv|work: Do you remember what it's called?
[23:39] <Tv|work> daemonik: let me google a bit..
[23:39] <Tv|work> daemonik: probably this: http://alinux.tv/Kernel-2.6.34/device-mapper/dm-log.txt
[23:39] <darkfader> Tv|work: the thing about same speed replicas - how much does it still apply to ceph nowadays?
[23:39] <daemonik> Stacking bcache and other device-mapper pieces together . . could get ugly.
[23:39] <Tv|work> darkfader: with ceph, you can use the osd weights to adjust for that
[23:40] <daemonik> No mention of hashes on that page. =(
[23:40] <darkfader> ah, thanks. i'll put some re-reading on my list for travelling
[23:42] <darkfader> daemonik: hashes are pointless unless they're tracked somewhere. if you got hash abcd on raiddisk 1 and defg on raiddisk2 it's over (unless it is a block where you still know the last writer). zfs can sort since it can track the checksums
[23:42] <darkfader> hashes
[23:42] <darkfader> whatever
[23:42] <darkfader> but in mirror, that is only possible if you still have it in the log
[23:42] <darkfader> and then you don't really need a hash imho
[23:43] <darkfader> err anyway. i think there's a good use for most software and good night :)
[23:44] <daemonik> With btrfs / Ceph one wouldn't be using RAID any way in production, correct?
[23:44] <Tv|work> daemonik: in general, we prefer JBOD, unless you have ridiculously many disks in one box
[23:44] <Tv|work> daemonik: and then we say use several small bundles of disks
[23:45] <daemonik> Tv|work: Are there builds of btrfs that one dares use in production?
[23:45] <Tv|work> daemonik: btrfs doesn't change any of that, but using it's internal raid-like features instead of raid may make it perform better, or give you nicer features, but stability is more of an unknown
[23:45] <Tv|work> daemonik: latest stable kernel, in general
[23:46] <Tv|work> daemonik: but realize your osds will crash every now and then -- many copies keep stuff safe
[23:46] <daemonik> Tv|work: Would it work against Ceph to have two zpools on separate hosts?
[23:46] <Tv|work> daemonik: zpool as in zfs? i don't know zfs real world behavior well enough to recommend anything
[23:46] <Tv|work> i only read the papers, i haven't used it in anger
[23:46] <daemonik> Anger?
[23:47] <Tv|work> for real
[23:47] <Tv|work> http://www.phrases.org.uk/bulletin_board/55/messages/494.html
[23:48] <Tv|work> actually, this is even better http://english.stackexchange.com/questions/30939/is-used-in-anger-a-britishism-for-something
[23:48] <Tv|work> the difference between fiddling in the lab, and actually needing it to work
[23:55] * fronlius (~fronlius@f054112216.adsl.alicedsl.de) Quit (Quit: fronlius)
[23:58] <daemonik> ZFS is worth checking out. All of the RAID/LVM stuff went out the window. Write caches and layer-2 read caches can be casually added. The RAID5 write-hole is closed. RAIDZ replaced RAID5, RAIDZ2 replaced RAID6, there is also RAIDZ3. There is no need for fsck. ZFS is effectively always online (weekly patrol scrubs check blocks against their checksums). One of the most compelling features of ZFS is that it self-heals.
[23:59] * The_Bishop (~bishop@cable-89-16-138-109.cust.telecolumbus.net) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.