#ceph IRC Log


IRC Log for 2012-05-04

Timestamps are in GMT/BST.

[0:00] <yehudasa> sagewk: looking at http://fpaste.org/P1yG/ there are a few cases where the log has 10-20 ms gaps. That follow _do_op completion.
[0:01] <joao> nhm, just curious, what do you mean by "generating movies"?
[0:02] <nhm> joao: seekwatcher lets you create movies showing where writes to the disk are happening over time
[0:02] <nhm> joao: http://nhm.ceph.com/movies/wip-throttle/
[0:03] <joao> cool :)
[0:04] <joao> wow, this is more entertaining than watching watchmen
[0:17] * s[X]_ (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[0:19] <dmick> Wow. So, one of the processes in a single-job build has a 113G RSS
[0:19] <dmick> now to see which one...
[0:19] <Tv_> holy mother
[0:20] <Tv_> err how can that be RSS? VSZ i'd believe
[0:20] <dmick> yes, this confuses me as well
[0:20] <Tv_> and even then that'd be mighty impressive
[0:23] <dmick> oops. 1.13G. but still.
[0:25] <Qten> morning ceph,
[0:26] <Qten> can anyone explain or point me to something that explains why btrfs dosnt need a ceph journal file?
[0:27] <Qten> reason i ask is i'm trying to work out my proformance issues or lack of :)
[0:27] <gregaf> Qten: thanks to snapshots (and other fine-grained hooks into the filesystem) it's possible to guarantee a consistent disk image without the use of a journal when using btrfs
[0:28] <gregaf> but it's not recommended; things will still be a lot faster with one
[0:28] <Qten> i thought hte journal was more of a "cache" to collect up the writes instead of writing 100 times it would send 1 write?
[0:29] <gregaf> that's one of the things it does
[0:29] <Qten> so i suppose my question is how does snapshots affect that?
[0:29] * lofejndif (~lsqavnbok@19NAAIJ8M.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[0:29] <gregaf> the other is that with a non-btrfs filesystem, if the OSD crashes then it doesn't know whether the disk is in a consistent state or not, so it replays the journal
[0:30] <gregaf> this is the same function a traditional filesystem journal plays
[0:30] <Qten> ahh
[0:30] * sagelap (~sage@aon.hq.newdream.net) has joined #ceph
[0:30] <Qten> seems logical
[0:30] <yehudasa> sagelap: looking at http://fpaste.org/P1yG/ there are a few cases where the log has 10-20 ms gaps. That follow _do_op completion (I think)
[0:31] <Qten> so in a scenario if i was using a non raid setup how would i handle the journal?
[0:31] <sagelap> yehudasa: i grepped for ' = ' in that dump, so you don't see when things start and end, only end
[0:31] <Qten> on a raided pair of ssd?
[0:32] <gregaf> Qten: you could put it wherever you wanted to
[0:32] <Qten> i guess is there any need to protect the journal?
[0:32] <yehudasa> sagelap: hmm.. yeah, I assumed it's the end, do you have the full dump?
[0:32] <gregaf> ah, not any more strictly than the OSD itself
[0:32] <gregaf> if you lose the journal you'll lose the OSD
[0:33] <sagelap> it's on ceph.com in nhm's dir
[0:33] <gregaf> but I wouldn't bother doing RAID or anything; that's expensive!
[0:33] <Qten> so when a write gets sent to the osd the replication takes place in the ceph cluster, does the write go to server 1 -> 2,3 or Server 1 -> 2 -> 3? or ?
[0:34] <gregaf> 1->2, 3
[0:34] <gregaf> the primary OSD sends it to each of the replicas in turn
[0:34] <Qten> at the same time?
[0:35] <gregaf> that's up to the networking; it's not strictly defined ;)
[0:35] <Qten> i guess i'm trying to work out the pros/cons of raid/network through put
[0:35] <Qten> and how it affects the write speed of the client
[0:35] <gregaf> each replica needs to receive and commit to disk the write before the client gets told it was written
[0:36] <Qten> so the write is only sucessful if its been written to the goal level on the servers?
[0:37] <gregaf> in the absence of failures, yes
[0:37] <yehudasa> nhm: sagelap: sadly under nhm.ceph.com I can only find movies, any other place?
[0:37] <gregaf> it's possible the cluster is running in degraded mode if a node went down and hasn't been booted yet
[0:37] <sagelap> there's a dir in there somewhere, lemme look
[0:37] <Qten> does ceph have built in handling so far of if a osd was to fail in the middle of a write it "fails over" or ?
[0:37] <nhm> ugh, I just spent the last hour and a half dealing with Century Link and still don't have my problem resolved.
[0:37] <gregaf> but otherwise with replication level 3 you will get 3 on-disk replicas
[0:37] <nhm> yehuda: nhm.ceph.com/results
[0:37] <gregaf> Qten: of course :)
[0:38] <yehudasa> nhm: thanks
[0:38] <nhm> yehudasa: sorry, I should stick everything in once place probably.
[0:38] <gregaf> you will get higher latency for the in-progress operations but they will resolve cleanly
[0:38] <nhm> yehudasa: will have btrfs results up soon.
[0:38] <Qten> gregaf: ok :) was trying to work out why the likes of gluster do the writes from the client side not server side replication
[0:39] <nhm> ok, bbl, dinner time
[0:39] <Qten> i see it as an issue because your server connecting to the cluster needs 10gbe :)
[0:39] <Qten> or better
[0:39] * jeffp (~jplaisanc@net66-219-41-161.static-customer.corenap.com) has joined #ceph
[0:40] <gregaf> Qten: gluster has a much different, much simpler model
[0:40] <Qten> don't get me wrong were pretty mcuh sold on ceph, just trying to sus out how it compares
[0:40] <gregaf> it's allowed them to get something stable faster
[0:40] <Qten> thats pretty much what i figured
[0:42] <Qten> so if i have a server with 24 disks is a shared journal ssd going to be enough for speed? i would think i'd need toraid them otherwise it would be the bottleneck?
[0:42] <gregaf> Qten: ah, yes, you'll want your journals to be able to keep up with your network bandwidth
[0:42] <jeffp> hi i'm getting an error trying to start my osd's on a new cluster (ceph 0.46)
[0:42] <jeffp> -1 global_init_daemonize: BUG: there are 1 child threads already started that will now die!
[0:42] <jeffp> full error at: http://pastebin.com/ak57DhEs
[0:43] <gregaf> (I assume with that many disks you're using 10GigE)
[0:43] <Qten> yerp
[0:43] <Qten> or IB
[0:43] <Qten> maybe
[0:44] <Qten> IB=infiniband
[0:44] <gregaf> jeffp: yikes, I haven't seen that error in a long time
[0:44] <gregaf> sjust: does that look familiar to you? maybe a bad path you get into with a weird OSD config?
[0:44] <joao> gregaf, I saw it just a few days ago, but it went away by rerunning ./ceph-osd
[0:44] <Qten> gregaf: thanks for the info shall go and simmer on it :)
[0:44] <gregaf> np :)
[0:45] <sjust> never seen it
[0:45] <jeffp> it's been intermittent but now it's happening every time
[0:45] <joao> granted, I was testing on my desktop with a single osd
[0:45] <joao> so my case is probably of no use
[0:45] <gregaf> oh dear, we've got a race somewhere then...
[0:45] <jeffp> yeah that's what i was thinking
[0:46] <jeffp> i'm running a 5 node cluster running on centos inside virtualbox all on the same machine
[0:46] <jeffp> maybe the slowness is exacerbating it
[0:48] * adjohn is now known as Guest480
[0:48] * adjohn (~adjohn@ has joined #ceph
[0:48] <gregaf> ah
[0:48] * Guest480 (~adjohn@ Quit (Read error: No route to host)
[0:48] <yehudasa> sagewk, nhm: 2012-05-02 14:45:10.213709 7f661ee58700 2 filestore(/srv/osd.0) op_queue_reserve_throttle waiting: 50 > 1000 ops || 209822252 > 209715200
[0:49] <yehudasa> nhm: did you look at op_queue_reserve_throttle?
[0:49] <yehudasa> or whatever that config param is
[0:50] <jeffp> http://pastebin.com/Zjmi6Bhw is my ceph.conf, i don't think there's anything too out of the ordinary there
[0:50] <sjust> joao: can you make a bug for that?
[0:51] <joao> sure
[0:51] <yehudasa> nhm: filestore queue max bytes, filestore queue max ops
[0:52] * dennis (~chatzilla@p5DE4A247.dip.t-dialin.net) has joined #ceph
[0:53] * dennis is now known as dennisj
[0:54] <dennisj> hi, i've just set up a small 2-system ceph cluster which seems to work fine so far but apparently i cannot create rbd images
[0:54] * CristianDM (~CristianD@host119.190-138-237.telecom.net.ar) has joined #ceph
[0:54] <CristianDM> Hi
[0:55] <dennisj> what i get is this: librbd: failed to assign a block name for image
[0:55] <dennisj> create error: (5) Input/output error
[0:55] <dennisj> on this command "rbd create --size 1000 demo"
[0:55] <CristianDM> I have a doubt about the performance with small files
[0:56] <gregaf> dennisj: josef was having that problem earlier and I think it turned out to be his OSD not starting up, can you paste the output of "ceph -s"?
[0:56] <gregaf> CristianDM: what can we tell you about?
[0:57] <Tv_> dennisj: perhaps this: http://ceph.com/docs/master/dev/osd-class-path/
[0:57] <dennisj> 2012-05-04 00:56:54.346572 pg v185: 200 pgs: 1 creating, 199 active+clean; 8730 bytes data, 549 MB used, 7885 MB / 9999 MB avail
[0:57] <dennisj> 2012-05-04 00:56:54.347010 mds e9: 1/1/1 up {0=0=up:active}
[0:57] <dennisj> 2012-05-04 00:56:54.347033 osd e20: 2 osds: 2 up, 2 in
[0:57] <dennisj> 2012-05-04 00:56:54.347082 log 2012-05-04 00:43:39.121193 mon.0 7 : [INF] mds.0 up:active
[0:57] <dennisj> 2012-05-04 00:56:54.347596 mon e1: 1 mons at {0=}
[0:57] <gregaf> all right, probably Tv_'s thing then
[0:57] <dennisj> tv: yes, i already found that via google and added "osd class dir = /usr/lib64/rados-classes" to the osd section
[0:58] <sagelap> dennisj: was the default value (determined by autoconf) wrong?
[0:59] <dennisj> how do i see the default value ceph is using?
[0:59] <sagelap> ceph --show-config | grep class
[0:59] <sagelap> or ceph-osd -i 123 --show-config|grep class ... whatever
[1:01] <joao> is it relevant to fill the "Source" box when creating a new issue?
[1:01] <dennisj> both of these commands don't work.
[1:01] <sagelap> dennisj: what version are you running?
[1:01] <dennisj> 0.45
[1:01] <joao> sjust, jeffp, http://tracker.newdream.net/issues/2382
[1:01] <joao> sjust, should I dive into it?
[1:02] <sjust> if you want, sure
[1:02] <dennisj> i see that the tracker bug mentions "cls_rbd.so"
[1:02] <sagelap> oh, that was added in 0.46
[1:02] <sjust> greg's guess is probably right, our wayward thread is most likely in the filestore
[1:02] <dennisj> yet the file here is called "/usr/lib64/rados-classes/libcls_rbd.so.1.0.0"
[1:02] <sagelap> could do strings /usr/bin/ceph | grep /var/lib :)
[1:02] <nhm> yehudasa: tried that a while back with little improvement. I didn't dig into it what was going on though as the results were more or less the same.
[1:02] <dmick> and the winner is: g++ of test/encoding/ceph_dencoder.cc, clocking in at 1.1GB RSS
[1:03] <joao> sjust, will take a look then, and report back; I'm going to bed early-ish today :p
[1:03] <sjust> ok
[1:03] <jeffp> joao, sjust, gregaf, thanks for looking into it
[1:03] <nhm> joao: so you are going to bed before me tonight? ;)
[1:04] <joao> we'll see :p
[1:04] <sagelap> dmick: that's not surprising :)
[1:04] <sagelap> dmick: it basically links against eeeeverything
[1:04] <Tv_> sagelap: but that's .c -> .o
[1:04] <Tv_> .cc i mea
[1:04] <Tv_> n
[1:05] <Tv_> sagewk: it sounds to me the include files are just messier than they need to be
[1:05] <sagelap> only that it wins :)
[1:05] <sagelap> exactly
[1:07] * adjohn is now known as Guest484
[1:07] * adjohn (~adjohn@ has joined #ceph
[1:07] * Guest484 (~adjohn@ Quit (Read error: Connection reset by peer)
[1:08] <dmick> next runner up is MDCache.cc at 812M
[1:08] <dmick> OSD.cc, 688M
[1:10] * adjohn is now known as Guest485
[1:10] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[1:14] * steki-BLAH (~steki@bojanka.net) Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:14] <dennisj> looks like i found the culprit
[1:15] <dennisj> "libcls_rbd.so.1" and "libcls_rbd.so.1.0.0" exist but "libcls_rbd.so" is missing
[1:16] <dennisj> shouldn't it actually try to open libcls_rbd.so.1 though?
[1:17] <CristianDM> Hi
[1:17] <dennisj> _load_class could not open class /usr/lib64/rados-classes/libcls_rbd.so (dlopen failed): /usr/lib64/rados-classes/libcls_rbd.so: cannot open shared object file: No such file or directory
[1:17] * Guest485 (~adjohn@ Quit (Ping timeout: 480 seconds)
[1:17] * CristianDM (~CristianD@host119.190-138-237.telecom.net.ar) has left #ceph
[1:18] * CristianDM (~CristianD@host119.190-138-237.telecom.net.ar) has joined #ceph
[1:18] <CristianDM> Hi. I need use ceph for web hosting
[1:18] <sagelap> dennisj: are you working from a .deb? .rpm? make install?
[1:19] <CristianDM> So, I don??t know if the best method is use RBD with qemu or directly the FS
[1:19] <CristianDM> Are any system that manage qemu that support ceph?
[1:20] <gregaf> not that I'm aware of, although you could always present it as a local FS
[1:20] <gregaf> but RBD is getting a lot more testing and support than the filesystem right now, so you should use that if it's appropriate to your use case
[1:21] <gregaf> (if you're hosting KVM virtual disks, it's very appropriate)
[1:21] <CristianDM> I hosting web services and emails services.
[1:21] <dennisj> rpm
[1:22] <dennisj> i took a fedora rpm and rebuilt it for centos 6.2
[1:22] <CristianDM> And have any system as cloudstack, proxmox, etc that can use the ceph RBD as storage?
[1:22] <sagelap> dennisj: it is probably a ceph.spec error..
[1:23] <sagelap> it shoudl be packages as .so
[1:23] <gregaf> cloudstack (/Xen) can't use it directly yet, although I think josh might be working on it soon
[1:24] <gregaf> though you can mount it in the host kernel and use it as a regular block device for anything you like
[1:24] <dennisj> hm, i guess so
[1:24] <CristianDM> Cool. And another question. Is it possible use as RBD and FS at the same time?
[1:25] <Tv_> CristianDM: yes
[1:25] <CristianDM> Wow, ceph is perfect :D
[1:28] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[1:28] <CristianDM> So, the use with qemu are all manually, not have any manager that support ceph?
[1:30] <dennisj> oddly enough "libcls_rgw.so" exists. only "libcls_rbd.so" hasn't been symlinked
[1:36] <CristianDM> So, the use with qemu are all manually, not have any manager that support ceph?
[1:45] * sagelap (~sage@aon.hq.newdream.net) Quit (Quit: Leaving.)
[1:47] <CristianDM> Any?
[1:49] <gregaf> CristianDM: not sure what you mean...
[1:51] <CristianDM> I am select an cloud management plataform, and I don??t know if have any that support ceph rbd
[1:51] <gregaf> qemu/kvm can use rbd natively
[1:51] <Tv_> CristianDM: openstack, libvirt, anything using kvm if you're willing to put in glue code (like most service providers will end up doing anyway)
[1:51] <gregaf> libvirt supports it
[1:52] <Tv_> CristianDM: xen can use the linux kernel rbd support too, if you put in the effort to configure it
[1:52] <CristianDM> My idea is use kvm
[2:02] * joshd (3f6e330b@ircip4.mibbit.com) has joined #ceph
[2:03] <joao> sjust, found nothing and wasn't able to reproduce it; am now heading to bed, and will give it another go tomorrow
[2:07] <Tv_> uhh http://gitbuilder-oneiric-deb-amd64.ceph.newdream.net/log.cgi?log=627761f87c0c397112528766f2307fff66a593f8
[2:07] <Tv_> it's been sick for a while now
[2:08] <Tv_> elder: this one seems to on your shoulders ;)
[2:08] <Tv_> i'll fix it in the morning if nobody's gotten to it -- i got to head out soon
[2:08] <dmick> man, that is an awfully verbose way to say very little
[2:09] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[2:09] <Tv_> well it's a log of what happened, without a good way to summarize it
[2:09] <nhm> sage: btrfs results (with flusher on) are up.
[2:10] <nhm> next up will be btrfs results with flusher off.
[2:15] <elder> nhm, finally back online, sorry it took so long. Any highlights with btrfs that you've already noticed?
[2:16] <elder> I see there's a nice sequential pattern on the first one I'm looking at for btrfs. But then again, I think it's a log-based filesystem and that's just the way it works.
[2:17] <elder> Always writing to new space, always continuously appending. I'm sure it would look less pretty in an aged and/or full file system.
[2:17] <nhm> elder: Yeah, I haven't looked at it closely enough to have any comments yet
[2:18] <elder> Well that's just my immediate reaction, having watched it play twice (osd0 4m 2threads btrfs)
[2:18] <dmick> elder, did you notice Tv_'s message?
[2:18] <nhm> elder: gotta put the kids to bed soon. Overall btrfs performance seemed a bit better but there are still some wierd gaps.
[2:19] <elder> dmick, I saw the "on your shoulders" but hadn't gone back to see what it was he was referring to.
[2:19] <dmick> build problem
[2:20] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[2:20] <nhm> elder: even with btrfs though, I can dd to the btrfs partition on osd0 at over twice the speed that we are seeing in those movies.
[2:21] <nhm> elder: so it's a bit better, but something is still really wrong.
[2:21] <elder> nhm, I do not suspect XFS.
[2:22] <elder> So therefore I expect btrfs should also exhibit some sort of trouble.
[2:23] <nhm> elder: I wasn't really suspecting xfs, but I was wondering if maybe we were hitting multiple problems, one of which is the way we interact with xfs.
[2:23] <elder> True.
[2:23] <nhm> ok, time to put the kids to bed. Might be back tonight, but more likely tomorrow morning.
[2:24] <joao> elder, nhm, what are you guys chasing after?
[2:25] * joshd (3f6e330b@ircip4.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[2:26] * Tv_ (~tv@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:28] <Qten> hey guys, anyone know if its possible to backup a rbd device to another machine/ceph cluster?
[2:28] <elder> Sorry, I've been talking with my wife... OK, let me see, where was I... Joao: Mark is trying to find the source of some odd delays or maybe slow performance numbers while running some sort of continuous write test.
[2:29] <elder> So he's been capturing detailed info about I/O completion while running a test, and he has plots here--movies!--of the I/O activity for various scenarios: http://nhm.ceph.com/movies/wip-throttle/
[2:29] <elder> Today he tried some tests using btrfs as a backing store behind the OSD's rather than XFS to see how things changed.
[2:29] <elder> dmick, I can look at a build problem but I have no access to any build machines that I'm aware of.
[2:30] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[2:30] <dmick> elder: How....are you building software to test?...
[2:30] <elder> joao, you were supposed to be heading to bed.
[2:30] <joao> elder, do you know if those writes are always over the same file, on a clean fs?
[2:31] <elder> dmick, I mean if there's a problem with the build machine proper, I can't get in and kick things.
[2:31] <elder> joao, I don't know for sure.
[2:31] <dmick> looks like this is a coding error, but, yeah, dunno for sure
[2:31] <elder> nhm is the one to ask. I'm sort of an excited lurker that is offering theories about what the plots say.
[2:31] <elder> Is it on ceph-client:?
[2:31] <dmick> was just seeing if I could reproduce
[2:32] <dmick> can you not see the gitbuilder output?
[2:32] <elder> I can. Just hunting down the link right now.
[2:32] <dmick> ok
[2:32] <dmick> it seems to be ceph, not client
[2:32] <dmick> and seems to be an automake/configure issue
[2:33] <joao> elder, can you enlighten me on what does the plot's X and Y represent?
[2:33] <joao> I find it amusing to look at the dots being plotted, but haven't figured out how to read the plot yet :x
[2:33] <elder> The graph is actually a rectangle representing blocks on the drive.
[2:33] <joao> oh
[2:33] <joao> okay
[2:33] <elder> So the plot sort of wraps repeatedly
[2:34] <joao> that explains how most of the writes are contiguous
[2:34] <elder> Highest strip across the rectangle is the highest block offsets in the device.
[2:34] <elder> Yes.
[2:35] <elder> Both graphs are plots over time though. So the upper rectangle represents I/O across the whole disk. The lower boxes show seek rate and aggregate bandwidth
[2:35] <joao> yeah, thanks
[2:35] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[2:35] <dmick> it appears that the intensity of the color is inversely proportional to the age fo the request
[2:35] <elder> I believe so.
[2:35] <joao> the first movie on btrfs is cool
[2:35] <elder> It hangs around for a bit so you can visualize the sequence a bit better.
[2:36] <elder> Color fades over a few time steps.
[2:36] * lxo (~aoliva@lxo.user.oftc.net) Quit (Read error: No route to host)
[2:36] <elder> dmick, I see a lot of errors, but nothing has changed on the ceph-client branch.
[2:36] <dmick> the thing Tv pointed at was not from a ceph-client
[2:36] <elder> Oh.
[2:37] <elder> OK, where do I need to look?
[2:37] <dmick> HEAD is now at 627761f Merge remote-tracking branch 'gh/wip-ceph-kdump-copy'
[2:37] <dmick> make[2]: Entering directory `/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-0.46-122-g627761f/src' make[2]: *** No rule to make target `init-ceph.in', needed by `init-ceph'. Stop.
[2:37] <dmick> seem to be the relevant things
[2:37] <elder> Is this for ceph.git?
[2:37] <dmick> but it's not failing for me
[2:38] <dmick> given that it's got ceph-object-corpus and leveldb at the top, I think so
[2:38] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[2:38] <dmick> I'm new to this
[2:39] <elder> OK, let me update my local tree. I have spent very little time in the ceph tree but I'll see if I notice anythign...
[2:39] <dmick> I think Tv pointed at you because of wip-ceph-kdump-copy
[2:39] <elder> Wait a minute, why was that merged?
[2:40] * joshd (3f6e330b@ircip4.mibbit.com) has joined #ceph
[2:40] <elder> In fact, I don't see that merged.
[2:40] <dmick> why do you think it's merged, and where?
[2:41] <dmick> the gitbuilders build all the branches, right?
[2:41] <elder> Wait, I have to make sure my local tree is sane then I'll speak again.
[2:41] <joao> wth just happened here: http://nhm.ceph.com/movies/wip-throttle/4m-flusher-2threads-btrfs-osd0.mpg
[2:41] <joao> 20-30s mark
[2:42] <joshd> Qten: you can use 'rbd export' to save rbd images to a local file
[2:42] <elder> What are you seeing joao?
[2:42] <joao> I'm seeing a complete gap of activity
[2:43] <elder> You mean the drop to 0 of seeks and MB/sec?
[2:43] <joao> yes
[2:43] <elder> I think it's missing data.
[2:43] <elder> The movie seems to slip ahead there when you play it too.
[2:44] <joao> yes, but I assumed it would do that if, say, the whole period was just a bunch of zeros
[2:44] <joshd> CristianDM: wido has been adding rbd support to cloudstack as well
[2:44] * mkampe (~markk@aon.hq.newdream.net) Quit (Quit: Leaving.)
[2:50] <elder> OK, dmick I can't figure out what I'm supposed to be looking at. Granted, I'm pretty distracted.
[2:50] * joshd (3f6e330b@ircip4.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[2:54] <elder> gregaf, how did you get to be #42?
[2:54] <dennisj> hm, i've just installed 0.46, rebuilt the cluster and now when i start osd.0 I get "global_init_daemonize: BUG: there are 1 child threads already started that will now die!"
[2:55] <jeffp> dennisj: i ran into this earlier today, http://tracker.newdream.net/issues/2382
[2:56] <jeffp> dennisj, when i run '/usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf' manually it starts up
[2:57] <jeffp> if i use init.d i get that error
[2:57] <jeffp> sometimes
[2:58] <dennisj> doesn't seem to help here
[2:58] <dennisj> get the same error 100% of the time
[2:59] <dennisj> same with mds and mon
[3:00] <joao> the bug clearly isn't restricted to the osd
[3:00] <dennisj> ah, finally
[3:00] <joao> I've just triggered it twice, once with an mds and another with a monitor
[3:00] <dennisj> after re-running this like 20 times very quickly in the shell the daemon started
[3:01] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[3:02] <Qten> josh: ok i'll try rbd export out :)
[3:04] <Qten> thanks
[3:04] <dennisj> finally got all the daemons running
[3:09] <joao> updated the issue
[3:09] <joao> going to bed
[3:09] <joao> later
[3:09] <nhm> back
[3:10] <joao> oh, nice
[3:10] <nhm> joao: that gap may actually be a gap
[3:10] <joao> a gap in the results?
[3:10] <joao> or a gap in the writes issued by the test?
[3:10] <nhm> joao: the osd logs, blktrace results, and collectl output are in http://nhm.ceph.com/results
[3:11] <nhm> joao: there are times where I've seen 1-1.5s gaps in the logs with debugging maxed.
[3:11] <joao> also, does the test issue contiguous writes onto one single file, or over multiple files?
[3:12] <nhm> joao: it's rados bench, issuing 16 concurrent requests at the defined request size.
[3:13] <nhm> joao: I confess I haven't looked very closely at the src yet.
[3:14] * adjohn (~adjohn@ has joined #ceph
[3:14] <joao> nhm, I can't recall ever looking into it either
[3:14] <nhm> joao: typically it seems like slow periods on the data disks are associated with no journal and network activity, but I still don't know why exactly.
[3:15] <nhm> joao: I think there may be multiple things going on.
[3:15] <joao> oh...
[3:15] <joao> on btrfs?
[3:15] * adjohn is now known as Guest496
[3:15] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[3:16] <nhm> joao: well, on xfs there are the nasty seek spikes. On both there are some times when there are a bunch of small writes periodically. Both are also far slower than they should be, and both also seem to have strange gaps where the writes just stop for a while.
[3:16] <nhm> joao: so I'm thinking this may not all be one problem.
[3:16] <joao> nhm, the small writes can be the xattrs
[3:17] <joao> and occasionally the on-disk metadata updates
[3:17] <nhm> joao: sounds plausible. Apparently Sage found that at least durning one of the stalls the onlything that was happening was sync
[3:18] <joao> btrfs's sync?
[3:18] <nhm> that was on xfs
[3:19] <joao> I don't know xfs well enough to think of anything then :\
[3:20] <nhm> joao: yeah. I need to look through the logs from the btrfs tests.
[3:20] <nhm> joao: See what's happening there.
[3:20] <nhm> Anyway, I should let you go to bed. ;)
[3:20] <joao> we'll talk more about this tomorrow :)
[3:21] <joao> later
[3:23] * joao (~JL@89-181-148-121.net.novis.pt) Quit (Quit: Leaving)
[3:23] <dennisj> is it possible to set the replication size to 1 for a cluster with 2 osd's and get a sort of naive striping that way?
[3:24] * _are__ (~quassel@vs01.lug-s.org) Quit (Remote host closed the connection)
[3:24] * lofejndif (~lsqavnbok@83TAAFIAM.tor-irc.dnsbl.oftc.net) has joined #ceph
[3:24] * Guest496 (~adjohn@ Quit (Ping timeout: 480 seconds)
[3:24] * lofejndif (~lsqavnbok@83TAAFIAM.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[3:24] * _are_ (~quassel@vs01.lug-s.org) has joined #ceph
[3:26] <gregaf> elder: I wrote that down ??? why'd you change away from Big Shot Software Developer? :p
[3:26] <elder> I thought those were jokes and we were getting HR approved names.
[3:26] <nhm> dennisj: I've been told that's possible, but not really recommended...
[3:27] <elder> Is 42 just a HHGTTG reference?
[3:27] <nhm> gregaf: me too, I figured they were just going to delete them all. :)
[3:27] <gregaf> elder: yeah
[3:27] <nhm> gregaf: otherwise I would have put down something creative
[3:27] <gregaf> word I heard was we got a humorific
[3:27] <elder> Apparently it went all the way to proof.
[3:27] <gregaf> of course I think now I'm the only one who has one
[3:28] <elder> Well, word doesn't spread across the country very well apparently.
[3:28] <dennisj> you probably don't want to run it in a production env. like that but it could be fun for performance testing
[3:28] <gregaf> but it obviously had, because you had a great one!
[3:29] * CristianDM (~CristianD@host119.190-138-237.telecom.net.ar) Quit (Ping timeout: 480 seconds)
[3:29] <gregaf> I actually use "Greg's 42" as my blog, although I haven't written anything in forever ?????it will live Some Day Soon ;)
[3:29] * CristianDM (~CristianD@host119.190-138-237.telecom.net.ar) has joined #ceph
[3:29] <nhm> gregaf: I think someone sent out an email that we didn't get to make up own titles, so I just assumed that was the law. :)
[3:29] <elder> And I changed mine, to be a law abiding citizen.
[3:29] <gregaf> yeah, there was more discussion around it that I guess didn't get around to everybody :/
[3:30] <gregaf> I'm sure it won't be a problem to get them changed for the next run
[3:30] <elder> Too bad, I maybe would have liked having the other title for this initial run.
[3:30] <nhm> gregaf: new rule: all decisions made on IRC.
[3:30] <gregaf> but for now, I have ALL THE ANSWERS and you guys are just software engineers :p
[3:31] <nhm> I vote for free booze will be provided to all staff during remot employee visits.
[3:31] <nhm> Yays?
[3:31] <nhm> yay
[3:31] <nhm> Nays?
[3:32] <nhm> The yays have it.
[3:33] * dennisj (~chatzilla@p5DE4A247.dip.t-dialin.net) Quit (Quit: ChatZilla [Firefox 12.0/20120424092743])
[3:34] <elder> What about the answer to life?
[3:34] <elder> Yay!
[3:43] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[3:58] * renzhi (~renzhi@ has joined #ceph
[4:51] * adjohn (~adjohn@70-36-139-109.dsl.dynamic.sonic.net) has joined #ceph
[5:11] <elder> Ahaaaa! Now I see what all the fuss is about.
[5:11] <elder> Sage merged my branch and should not have. I probably should have cleaned it up.
[5:11] * CristianDM (~CristianD@host119.190-138-237.telecom.net.ar) Quit ()
[5:17] <elder> sage, sagewk, in case you happen to see it here first... The wip-kdump-copy branch was I believe committed so it could be looked at and so I could sort of checkpoint the work while Tv got to a better place for integrating the kdumps into a VM framework on plana nodes.
[5:18] <elder> I haven't looked at it since then and in my mind it wasn't ready to commit.
[5:18] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[5:18] <elder> Now it's been merged into ceph/master and it is apparently the cause of some build errors.
[5:19] <elder> So it needs to be reverted, or the head needs to be back up and rebased without it.
[5:22] * izdubar (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[5:23] * adjohn (~adjohn@70-36-139-109.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[5:26] * adjohn (~adjohn@70-36-139-109.dsl.dynamic.sonic.net) has joined #ceph
[5:28] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[5:30] * adjohn (~adjohn@70-36-139-109.dsl.dynamic.sonic.net) Quit ()
[5:30] <Qten> can i make a file or folder have more replicas/faster reads or is it for the whole object store only?
[5:31] <gregaf> Qten: you have to change number of replicas on the level of a pool
[5:31] <gregaf> you can store Ceph filesystem data in more than one pool though
[5:32] <gregaf> so you can (before you write any actual data) change a folder to be in a different pool, and then it and all its children will be in the newly-specified pool
[5:33] <gregaf> sorry, I mean that you can't change the pool of a file which already has data, but you can set the pool of a file without any data written yet
[5:33] <gregaf> and if you set the pool of a folder then all its newly-created children will be in that newly-specified pool
[6:08] * cattelan is now known as cattelan_away
[6:17] <Qten> so you mean you can increase or decrese the whole pool just not at a file or folder level
[6:18] <gregaf> yes, and if you have a great deal of patience/automation or fairly coarse-grained replication requirements you can map those files & folders into different pools to let you make those sorts of changes
[6:19] <Qten> ahh
[6:19] <gregaf> (eg, put one folder hierarchy in the 3x replication pool, while the system defaults to a 2x replication pool)
[6:19] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[6:19] <gregaf> or if it's just occasional files which you regard as very important you can manually (or with a tool) touch them, say "place this file in this pool", and then write to the file
[6:20] * chutzpah (~chutz@ Quit (Quit: Leaving)
[6:25] <gregaf> gotta go
[6:30] <Qten> np thanks
[6:37] * adjohn (~adjohn@70-36-139-109.dsl.dynamic.sonic.net) has joined #ceph
[6:59] <Qten> lo, i'm trying to workout why my write speeds are a bit slow. 0.45 RBD, I have 4 servers with 2 disks each, i'm using xfs with noatime,nodiratime,relatime,nobarrier,logbufs=8, on both disks, i have the journal on a sperate disk (sda) to the data disk (sdb) on the 4 servers and iperf says i'm getting over 920mbit/s between box's i'm running dual GBE Bonded
[6:59] <Qten> and with dd if=/dev/zero of=ddfile1 bs=16k i'm getting 42.2mb/s
[6:59] <Qten> each disk in the cluster is getting over 80mb/s natively to the disks as well
[7:00] <Qten> 8k is 43mb/s
[7:00] <Qten> rep level x 2
[7:47] * Theuni (~Theuni@dslb-088-066-111-066.pools.arcor-ip.net) has joined #ceph
[9:01] * s[X]_ (~sX]@eth589.qld.adsl.internode.on.net) Quit (Remote host closed the connection)
[9:15] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[9:31] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:56] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[9:56] * The_Bishop (~bishop@cable-86-56-102-91.cust.telecolumbus.net) Quit (Ping timeout: 480 seconds)
[10:06] * The_Bishop (~bishop@cable-86-56-102-91.cust.telecolumbus.net) has joined #ceph
[10:09] * yoshi (~yoshi@p3167-ipngn3601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[10:22] * adjohn (~adjohn@70-36-139-109.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[10:33] * adjohn (~adjohn@70-36-139-109.dsl.dynamic.sonic.net) has joined #ceph
[10:33] * adjohn (~adjohn@70-36-139-109.dsl.dynamic.sonic.net) Quit ()
[10:41] * joao (~JL@ has joined #ceph
[11:28] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[11:29] * Theuni (~Theuni@dslb-088-066-111-066.pools.arcor-ip.net) has left #ceph
[11:30] * Theuni (~Theuni@dslb-088-066-111-066.pools.arcor-ip.net) has joined #ceph
[11:45] * yoshi (~yoshi@p3167-ipngn3601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:47] * renzhi (~renzhi@ Quit (Quit: Leaving)
[11:58] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[12:36] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[12:46] <filoo_absynth> anyone awake?
[12:52] <joao> hey filoo_absynth
[12:52] <joao> sup?
[13:04] <filoo_absynth> wondering if someone here does tech stuff for dreamhost (not inktank) still
[13:14] * oliver1 (~oliver@p4FFFE8CE.dip.t-dialin.net) has joined #ceph
[13:16] <nhm> Qten: Do you have a way to monitor the write performance to the underlying OSDs during your tests?
[13:32] * BManojlovic (~steki@ has joined #ceph
[13:35] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[13:56] * jluis (~JL@ has joined #ceph
[14:02] * joao (~JL@ Quit (Ping timeout: 480 seconds)
[14:10] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[14:49] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[15:00] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[15:09] * aliguori (~anthony@c-76-116-233-168.hsd1.nj.comcast.net) has joined #ceph
[15:24] * cattelan_away is now known as cattelan
[15:30] * jluis is now known as joao
[15:31] * Wtrhead (~Wtrhd@bas1-montreal46-2925426220.dsl.bell.ca) has joined #ceph
[15:31] * Wtrhead (~Wtrhd@bas1-montreal46-2925426220.dsl.bell.ca) has left #ceph
[15:32] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[15:39] * alexxy (~alexxy@ has joined #ceph
[16:13] * enrylla (~enrylla@83TAAFI4N.tor-irc.dnsbl.oftc.net) has joined #ceph
[16:14] * enrylla (~enrylla@83TAAFI4N.tor-irc.dnsbl.oftc.net) has left #ceph
[16:39] * hijacker__ (~hijacker@ Quit (Quit: Leaving)
[16:43] * Theuni (~Theuni@dslb-088-066-111-066.pools.arcor-ip.net) Quit (Quit: Leaving.)
[17:20] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[17:28] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:33] * stackevil (~stackevil@ has joined #ceph
[17:34] * jlogan (~chatzilla@2600:c00:3010:1:2d42:b8ed:c5ac:1af9) has joined #ceph
[17:38] * jlogan (~chatzilla@2600:c00:3010:1:2d42:b8ed:c5ac:1af9) Quit (Remote host closed the connection)
[17:38] * lofejndif (~lsqavnbok@83TAAFI83.tor-irc.dnsbl.oftc.net) has joined #ceph
[17:42] * jlogan (~chatzilla@2600:c00:3010:1:2d42:b8ed:c5ac:1af9) has joined #ceph
[17:43] * dennisj (~chatzilla@p5DE4A247.dip.t-dialin.net) has joined #ceph
[17:43] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[17:45] <dennisj> hi, is there a way to find out how a rbd image maps to osd's?
[17:46] <dennisj> i.e. how can i find out where exactly the first 4MB block of the image is located in the osd cluster?
[17:47] * danieagle (~Daniel@ has joined #ceph
[18:02] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[18:11] * chutzpah (~chutz@ has joined #ceph
[18:14] <sjust> dennisj: one moment, let me find the naming scheme
[18:16] * Tv_ (~tv@aon.hq.newdream.net) has joined #ceph
[18:17] <sjust> let's see, the first block name should be <image_name>.000000000001
[18:20] <dennisj> hm, where would that file be located?
[18:20] <sjust> one sec
[18:20] <sjust> ok, run:
[18:21] <sjust> ceph osd getmap -o /tmp/map ; osdmaptool --test_map_object <image_name>.000000000001 /tmp/map
[18:21] <sjust> where <image_name> is the name of your rbd image
[18:25] <dennisj> thanks what i get is: object 'testimg.000000000001' -> 0.a6d7 -> [0,3]
[18:25] <dennisj> does the [0,3] refer to osd.0 and osd.3?
[18:26] <sjust> ok, 0.a6d7 is the pg (you can ignore that), [0,3] means that it should have osd.0 as its primary and osd.3 as it's replica
[18:26] <sjust> *its replica
[18:27] * jeff (~413144bd@webuser.thegrebs.com) has joined #ceph
[18:27] * User (~User@ has joined #ceph
[18:27] * jeff is now known as Guest565
[18:28] * User (~User@ Quit ()
[18:28] <dennisj> hm, apparently i can replace "000000000001" with any random string
[18:28] <sjust> that is true
[18:28] <sjust> you are using the osdmaptool, which works on rados objects
[18:29] <sjust> the rbd layer sits on top of rados and translates device blocks into rados objects
[18:29] <sjust> the string I gave you should be what rbd names the first block of rbd image <image_name>
[18:30] <yehudasa> sjust: shouldn't first block be .000000000000?
[18:30] <sjust> yehudasa: ah, yes, my bad
[18:30] <sjust> also, I forgot to specify the pool
[18:31] <dennisj> so that is supposed to be an integer?
[18:31] <dennisj> basically an offset?
[18:31] <sjust> yeah, it's the block offset
[18:31] <dennisj> because i can use "testimg.whatever" and osdmaptool tells me the location is "0.1382 -> [1,0]"
[18:32] <sjust> yehudasa: do you know how to specify a pool with osdmaptool?
[18:32] <yehudasa> sjust: by default it's rbd
[18:32] <sjust> with osdmaptool?
[18:32] <yehudasa> ah.. I don't know how to specify the pool there
[18:33] <sjust> yeah, I'm looking at it and I don't think we can
[18:33] * Guest565 (~413144bd@webuser.thegrebs.com) Quit (Quit: TheGrebs.com CGI:IRC (Ping timeout))
[18:33] <yehudasa> I'd assume that it uses 'data' as the default
[18:33] <sjust> ugh, that should be fixed
[18:33] <sjust> appears to be just using pool 0
[18:33] <sjust> dennisj: in summary, that was actually wrong
[18:34] <sjust> yehudasa: do we have a tool to get the location of an rbd block?
[18:34] <yehudasa> sjust: no
[18:35] <sjust> is there debugging output somewhere that would do it?
[18:35] <yehudasa> sjust: for debugging purposes you can rados -p pool stat <objname> --debug-ms=1
[18:35] * stackevil (~stackevil@ has left #ceph
[18:38] <sjust> rados -p rbd stat rbd.000000000000 --debug-objecter=20 --debug-to-stderr 2>&1 | grep send_op
[18:38] <sjust> dennisj: try that ^ it'll actually do a stat on the object and the output should give you the location
[18:39] <sjust> sorry for the lack of a ready tool for this
[18:41] * bchrisman (~Adium@ has joined #ceph
[18:41] <dennisj> hm, how do I know that I actually provide the correct input? I can stat "XYXYXYXY" and the output looks just the same as with "rbd.000000000000"
[18:42] <yehudasa> nhm: are there are noflusher results for btrfs yet?
[18:42] <nhm> yehudasa: creating the movies rightn ow
[18:42] <nhm> yehudasa: should have them up before the meeting
[18:42] <yehudasa> nhm: to my untrained eyes the btrfs results look better than xfs
[18:42] <yehudasa> nhm: what btrfs version is that?
[18:42] * oliver1 (~oliver@p4FFFE8CE.dip.t-dialin.net) has left #ceph
[18:43] <sjust> dennisj: try just rados -p rbd stat rbd.000000000000
[18:43] <sjust> that should perform an actual stat on the object, if it returns successfully, the first block is indeed there
[18:43] <yehudasa> sjust: I think you're missing something there
[18:44] <nhm> yehudasa: this is kernel 3.3.0-ceph without patches
[18:44] <sjust> yehudasa: ?
[18:44] <yehudasa> I think the block number should be separated into two numbers: high and low
[18:44] <sjust> hmm?
[18:44] <sjust> I'm looking at librbd.cc:get_block_oid
[18:45] <dennisj> hm, "error stat-ing rbd/rbd.000000000000: No such file or directory"
[18:45] <nhm> yehudasa: The btrfs results do look better, but still have some unexplained stalls, and overall throughput is still low relative to a dd on the underlying btrfs filesystem.
[18:45] <sjust> dennisj: ah
[18:45] <sjust> have you written anything to the disk?
[18:45] <dennisj> nope, only created it this far
[18:46] <sjust> ah, there aren't any blocks yet
[18:46] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[18:46] <sjust> I assumed the first block would be there since most filesystems write something to the beginning of the disk during mkfs
[18:46] <nhm> yehudasa: I should do an osd bench with more data on btrfs to see how the FS performs.
[18:47] <sjust> dennisj: also, the object name should have been testimg.000000000000 I think
[18:47] <yehudasa> sjust: do rados -p <pool> ls first
[18:47] <dennisj> i didn't do mkfs either. so far i have only created the image.
[18:48] <sjust> dennisj: 'rados -p rbd ls' will list the objects in the rbd pool
[18:48] <yehudasa> sjust: the name should be <imgname>.rbd
[18:49] <yehudasa> for the head object
[18:49] <dennisj> ok, i can see the testimg.rbd
[18:49] <sjust> dennisj: as yehudasa just said, that's the head object, the actual data blocks won't exist until they are written to
[18:50] <dennisj> so basically this is using a form of thin provisioning?
[18:50] <sjust> yep
[18:51] <dennisj> given that only those objects of the image exist that actually had data written to them how do i find out which objects actually exist at any given time?
[18:52] <sjust> the pool ls you did above would do it, but it will return all blocks/head objects in the pool
[18:52] <sjust> we don't have a better tool at this time
[18:52] <sjust> though I will make a bug
[18:53] <dennisj> ah, so once i write to the image the entry "testimg.000000000000" should appear?
[18:55] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:56] <sjust> yeah, bug #2383
[18:58] <dennisj> hm, so now I've written some data and the object "rb.0.0.000000000000" appeared
[18:58] <joshd> dennisj: the data objects are actually named by id, but 'rbd info <imagename>' will tell you the prefix for each object in the image
[18:58] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[18:58] <dennisj> indeed: block_name_prefix: rb.0.0
[18:59] <joshd> only the header object is named after the image name, so you can rename easily
[18:59] <dmick> cool. oneiric update updated my Ceph packages.
[19:02] * aliguori (~anthony@c-76-116-233-168.hsd1.nj.comcast.net) Quit (Remote host closed the connection)
[19:02] <dennisj> ok, so now i can enumerate the objects for an image
[19:03] <dennisj> when i do "osdmaptool --test_map_object rb.0.0.000000000000 /tmp/map"
[19:03] <dennisj> i get "object 'rb.0.0.000000000000' -> 0.ee2 -> [3,0]"
[19:03] <yehudasa> nhm: can you make it that reads and writes show in different colors in the movies?
[19:03] <yehudasa> nhm: also it'd be interesting seeing btrfs on the patched kernel
[19:03] <dennisj> but when i do "rados -p rbd stat rb.0.0.000000000000 --debug-objecter=20 --debug-to-stderr 2>&1"
[19:03] <dennisj> i see amongst other things "recalc_op_target tid 1 pgid 2.da680ee2 acting [3,2]"
[19:04] <sjust> dennisj: osdmaptool can't be used here because it's defaulting to pool 0 rather than rbd
[19:04] <dennisj> that's what i thought, just wanted to be sure
[19:05] <joshd> sage added a way to do it from the ceph command recently - I think it's 'ceph osd map <pool> <object>'
[19:08] <dennisj> so is [3,2] referring to the actual osd's the object is located then? (rather than the [3,0] from the osdmaptool)
[19:08] <nhm> yehudasa: yeah, alex asked about that the other day and I forgot. I'll take a look at that.
[19:09] <joshd> dennisj: yeah
[19:13] <dennisj> cool, so now i see a file "/var/lib/ceph/osd/ceph-3/current/2.62_head/rb.0.0.000000000000__head_DA680EE2" on osd.3
[19:13] <dmick> nhm: looks like it's hardcoded in seekwatcher 226,227
[19:13] <dennisj> how can i map the ""recalc_op_target tid 1 pgid 2.da680ee2 acting [3,2]" to this location?
[19:14] <dmick> nhm: http://www.scipy.org/Cookbook/Matplotlib/Show_colormaps for alternatives
[19:14] <dmick> "Reds" would be good for writes IMO
[19:14] <elder> dmick, that's what I suggested too.
[19:15] <dmick> better yet, contribute a patch that takes that as an argument :)
[19:15] <joshd> dennisj: sjust can help you there, I'm not sure if there's a tool that exposes that mapping easily
[19:16] <joshd> dennisj: why are you interested in tracking exactly where the objects are stored?
[19:16] * lofejndif (~lsqavnbok@83TAAFI83.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[19:17] <dennisj> I'm trying to get a better overview of the cluster and understand the topology
[19:19] <dennisj> I'd like to write a script/tool that allows me to inspect the cluster in the same way i can use e.g. raid controller tools to describe the topology of virtual and physical disks and their relation
[19:23] <dennisj> if this mapping is only exposed via some c api functions that would be ok for me too
[19:28] * izdubar (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[19:34] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[19:36] <Tv_> dennisj: the big difference here is, objects will normally move around the cluster all the time
[19:37] <Tv_> (on a big enough cluster)
[19:42] <dennisj> ok, but wouldn't it still be useful to know at least for debugging purposes? In general it would help to know a) to which physical byte X (server, disk, file, offset) a given logical byte Y maps to and b) to which logical byte X (image, offset) a given physical byte Y maps to
[19:42] <Tv_> dennisj: that level of debugging requires plenty more understanding of the internals
[19:43] <Tv_> dennisj: e.g. your latest write might not be reflected in that file yet, because it's only in the journal
[19:44] <Tv_> dennisj: we're trying to slowly populate http://ceph.com/docs/master/dev/ with that sort of content, but it's a complex system
[19:46] <dennisj> i imagine it is and i'm only started playing with ceph yesterday so i don't expect to get a thorough understanding of it right away
[19:47] <dennisj> i'm looking at it from an operations point of view right now
[19:47] <Tv_> dennisj: typical ops view wouldn't care about placement of individual objects
[19:48] <Tv_> care about the aggregate
[19:48] <Tv_> evenness of load
[19:48] <Tv_> etc
[19:49] <dmick> it'd be nice in the same way things like seekwatcher is nice, but yeah, couldn't be authoritative, and would only really give you an overview/trend sort of thing. But even that might be valuable.
[19:49] <dennisj> on a high level if everything works fine yes but once things don't work as they should it would be good to have some introspective features
[19:49] <Tv_> dmick: metrics and visualization yes
[19:50] <dennisj> granted maybe knowing the individual files is only necessary in rare cases
[19:50] <dmick> could let you spot pathologies more quickly
[19:51] <dennisj> but at least i'd like to know a) for a given image X which osd's contain the blocks for that image and b) if I tinker with osd X which images might potentially be affected?
[19:53] <Tv_> dennisj: b) none
[19:53] <Tv_> dennisj: or we screwed up
[19:55] <dennisj> and in order to determine that you need the ability for introspection
[19:55] <Tv_> it's not that simple
[19:55] <dennisj> nobody said it is
[19:55] <Tv_> most of the screwups are about peering, messaging etc, not what files are on disk
[19:56] <dennisj> you cannot predict what kind of screwups are going to happen in the future
[19:56] <Tv_> i can predict that thinking the underlying disk files are enough to debug or recover is false..
[19:57] <dennisj> again, nobody said that "the underlying disk files are enough to debug or recover"
[20:05] * stxShadow (~jens@ip-78-94-238-69.unitymediagroup.de) has joined #ceph
[20:05] <yehudasa> nhm: are these movies for the aggregated system io? or for a specific process/device?
[20:07] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Ping timeout: 480 seconds)
[20:09] <nhm> yehudasa: the movies are constructed from blktrace dumps of /dev/sdb.
[20:09] <nhm> which is the device used for the OSD data disk on each node.
[20:18] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[20:19] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[20:24] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[20:29] <dmick> hm. github review commentary still coming to @dreamhost.com
[20:30] <dmick> right. profile.
[20:33] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[20:34] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[20:34] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit ()
[20:34] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[20:47] <yehudasa> nhm: is that just a single osd process?
[20:53] <nhm> yehudasa: one osd per node, 2 filestore op threads (afaik that's the default)
[20:59] <nhm> remade movies and noflusher btrfs results should be up in a while. Had to remake all of the movies/archives and my upstream throughput is slow. :/
[21:00] <dmick> yay cable internet
[21:01] <nhm> dmick: actually municipal wireless
[21:01] <dmick> oh, god, you *have* upstream? :)
[21:01] <nhm> dmick: though theoretically I'll have 5Mb up if the clowns at century link ever get my new connection working right.
[21:01] <dmick> what was the URL again?
[21:01] <nhm> dmick: for movies?
[21:01] <dmick> ya
[21:02] <nhm> nhm.ceph.com/movies
[21:02] <nhm> results are in nhm.ceph.com/results
[21:02] <nhm> all the movies will be blue/red instead of blue/green soonish.
[21:03] <dmick> such fun.
[21:04] <nhm> Indeed, this is the kind of thing I love doing. :)
[21:04] <sagewk> sjust: actuallyit can get an error
[21:05] <sagewk> - int r = insert_item(cct, item, weight, name.c_str(), loc);
[21:05] <sagewk> - if (r == 0)
[21:05] <sagewk> - ret = 1;
[21:05] <sagewk> + ret = insert_item(cct, item, weight, name.c_str(), loc);
[21:05] <sagewk> + if (ret == 0)
[21:05] <sagewk> + ret = 1; // changed
[21:14] * imjustmatthew (~imjustmat@pool-96-228-59-72.rcmdva.fios.verizon.net) has joined #ceph
[21:16] <imjustmatthew> Is the ceph-fuse client more or less mature than the kernel client for cephfs?
[21:17] <nhm> ok, recolored movies are up. Red=Writes, Blue=Reads
[21:17] <nhm> new results are being uploaded and will be placed shortly.
[21:19] <elder> Ooh Red!
[21:19] <elder> It's actually a little brick red.
[21:19] <elder> And it fades to orange and then yellow.
[21:19] <elder> Can we go with jewel tones instead?
[21:19] <elder> (:
[21:20] <nhm> elder: you can have whatever matplotlib will give you. ;)
[21:20] <nhm> elder: all of the blktrace results are there for your amusement.
[21:23] <elder> Big dead zone again here at 50 seconds or so: http://nhm.ceph.com/movies/wip-throttle/64m-noflusher-2threads-xfs-ag4-osd1.mpg
[21:24] <elder> I believe that means XFS is not being asked to do anything (or it would be initiating I/O)
[21:25] <nhm> elder: Yeah, it seems like those kinds of deadzones are happening on occasion.
[21:26] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[21:26] <nhm> elder: on 64m-noflusher-2threads-xfs-ag4-osd0.mpg there is a large one at the start.
[21:28] <elder> Well that's interesting. I wonder if that means there's an interaction between the two osd's somehow.
[21:28] <elder> nhm, :)
[21:30] <nhm> Well, there's definitely interaction as they'll be replicating data to each other.
[21:30] <nhm> Theoretically it should only wait on the journal write, not the OSD write though.
[21:31] <elder> Replicating data to each other should not preclude other activity.
[21:31] <elder> There's no reason a client's incoming data can't be writing while previously-sent data gets copied to an OSD neigbor.
[21:33] <yehudasa> nhm: I want to run some test on btrfs
[21:34] <yehudasa> nhm: want to check theory that the pg logs slow us down due to seeks
[21:35] <nhm> yehudasa: ok, what would you like the ceph.conf to look like?
[21:35] <nhm> yehudasa: I can set it up if you'd like, or you can just play on the nodes yourself if you'd like...
[21:35] <yehudasa> nhm: just tell me how to run the test and on which node
[21:36] <nhm> yehuda: client node is plana05, currently ceph is stopped on all of the nodes.
[21:37] <nhm> I'm just running rados -p data bench 120 write -t 16 -b <bytes>
[21:37] <yehudasa> nhm: the idea is to mount the osd meta directory on a different partition, that's hacky but will do the trick
[21:38] <yehudasa> nhm: on what setup? I can't seem to find the setup configuration you sent me
[21:38] <nhm> yehudasa: ?
[21:38] <yehudasa> nhm: you sent me all the clusters allocations
[21:38] <yehudasa> which machines for which clustetr
[21:39] <nhm> yehudasa: the google doc?
[21:39] <yehudasa> yeah
[21:39] <yehudasa> oh, here it is
[21:40] * nhm watches stuff slowly upload
[21:41] <nhm> 115KB/s, awesome
[21:42] <yehudasa> nhm: are you sure you're tracing /dev/sdb? there's nothing there ..
[21:43] <yehudasa> I mean, that's probably the journal
[21:43] <nhm> yehudasa: what node are you looking at?
[21:43] <yehudasa> burnupi02/03
[21:44] <nhm> yehudasa: it should be burnupi28/29
[21:44] <nhm> yehudasa: let me check the google doc
[21:46] <nhm> yehudasa: it's under the "SuperNode Test" heading.
[21:46] <yehudasa> nhm: ah, cool, thanks
[21:47] <yehudasa> nhm: is the cluster down on purpose?
[21:48] <nhm> yehudasa: I shut it down after the tests to stop logging stuff.
[21:49] <nhm> yehudasa: I can turn it back on if you'd like.
[21:49] <yehudasa> nhm: no worries, doing it
[22:00] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[22:00] <joao> does dout() serialize output? if so, is it blocking?
[22:01] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[22:06] <Tv_> joao: dout() puts things into a queue that is flushed to disk by a dedicated thread
[22:06] <joao> do you know if it blocks on that queue?
[22:08] <Tv_> there's a mutex or something, probably
[22:08] <Tv_> sage sent an email about it a while ago
[22:08] * UnixDev (~user@c-98-242-186-177.hsd1.fl.comcast.net) has joined #ceph
[22:08] <UnixDev> hi all
[22:08] <joao> okay, thanks
[22:08] <Tv_> joao: submitters don't wait for disk writes
[22:09] <UnixDev> will ceph work well for a share that has many small files and is interconnected via WAN (i.e.: higher latency than local machines) >
[22:10] <joao> Tv_, I was thinking more about the possibility of submitters waiting for each other when queuing
[22:10] <joao> given we have the whole code filled with dout's, I wonder the implications that may have on the overall performance
[22:12] <Tv_> joao: we went through all that a while ago; we were ready to do something better but Sage ended up convinced that the lock isn't too busy
[22:13] <joao> okay then
[22:13] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[22:13] <joao> just a little thing I came up to think about during dinner :)
[22:14] * BManojlovic (~steki@ has joined #ceph
[22:24] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[22:26] <nhm> yehudasa: how are your tests going?
[22:40] <yehudasa> nhm: I'll have to test it a bit differently
[22:43] <nhm> how are you going about mounting the meta directory on another partition?
[22:46] * dennisj (~chatzilla@p5DE4A247.dip.t-dialin.net) Quit (Quit: ChatZilla [Firefox 12.0/20120424092743])
[23:04] * Oliver1 (~oliver1@ip-176-198-97-69.unitymediagroup.de) has joined #ceph
[23:08] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[23:11] * stxShadow (~jens@ip-78-94-238-69.unitymediagroup.de) Quit (Quit: bye bye !! )
[23:16] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[23:24] <jeffp> with rbd is there any sort of mechanism to prevent two machines from attaching the same block device at the same time?
[23:28] <jeffp> one way i can think of for doing that would be setting an owner in the xattrs on each object in the block device and checking that you are still the owner on every write, and then on failover changing the owner on all the objects
[23:28] * Oliver1 (~oliver1@ip-176-198-97-69.unitymediagroup.de) Quit (Quit: Leaving.)
[23:28] <Tv_> jeffp: that check would hurt i think
[23:28] <Tv_> but the watch-notify mechanism might be able to do that cheaper
[23:28] <jeffp> yeah it probably would
[23:30] <jeffp> i don't think you can do it with watch-notify without getting into a race condition
[23:31] <jeffp> alternatively you could do all modifications to a snapshot and on failover merge the snapshot back into the main copy then start a new snapshot
[23:38] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[23:53] <jeffp> actually yeah watch-notify could do that as long as the notify happens and is checked as a condition of the write, i was thinking the notify is asynchronous which wouldn't work
[23:58] * s[X]_ (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.