#ceph IRC Log


IRC Log for 2013-03-05

Timestamps are in GMT/BST.

[0:00] <nhm> noob21: it wasn't fun, but some bugs got fixed when I went through it so it may not be quite as terrible now, just sort of terrible. ;)
[0:00] <noob21> lol
[0:00] <noob21> which gcc did you use?
[0:01] <nhm> hrm, it appears I built 4.7.2
[0:03] <noob21> alright i'm a little older than that. 4.6.2 it looks like depending on the tool
[0:04] <wer> man this sucks. It seems that during my load testing I lost an osd... but ever since then I the cluster has been super flakey. I got the blocking problem above... and many slow writes. now I am getting multiple accounts of wrong node in the osd logs. I am starting to wonder if I did something incorrect cause the cluster just seems confused.
[0:04] <nhm> that might be fine. I was trying to use 4.1 before and it broke in very strange ways.
[0:06] <wer> when that osd died we took it out... but maybe no out enough. Outer morer maybe? The kernel being blocked by an osd is "scary" to me... and now the general unhappiness make santa cry again.
[0:07] <wer> Is anyone up for digging?
[0:07] <noob21> nhm: i may give compiling it a shot once i get setup
[0:08] <nhm> wer: sorry, I'd like to help but I've gotta run to dinner. Good luck. :/
[0:08] <wer> nhm: yes. ty. I will luck it :)
[0:12] <dmick> wer: those messages appear to correlate with the OSD making little or no progress?
[0:13] <wer> dmick: well.... not sure.
[0:13] <wer> We had a unclean health.... and waited until everything was clean before starting load testing again.
[0:14] <wer> We left the dead osd/drive out... but now these wrong node messages have showed up..... and the throughput on writes is down to at least 2/3's what it was this weekend. ~600MB/s now.
[0:15] <wer> damn. Now it is degraded.... hrm.
[0:16] <wer> got a few stuck unclean pgs that are all on osd 25 and 139... hmr.
[0:18] <wer> dmick: I don't really know how to measure the impact of those messages. is num_ops a good correlation?
[0:22] * diegows (~diegows@ has joined #ceph
[0:23] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) Quit (Quit: Leaving.)
[0:24] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) has joined #ceph
[0:29] <wer> got 4 inactive pgs now... and one of our osd's says got old message 12 <= 2995 and appears to have stopped logging. I am going to shave my eyebrows off.
[0:30] <wer> osd 25 is in charge of all the inactive pgs.... so I guess I should kick it?
[0:31] <dmick> that log message *seems* to indicate that an OSD died, but we (the logger) don't know about it officially yet, so we're trying to connect to what we thought was the right connection
[0:31] <dmick> that doesn't help you much
[0:32] <wer> heh. Yeah that was my assessment too... it doesn't know it is dead yet :)
[0:33] <dmick> cue Python
[0:34] <wer> well hrm. I will kick it then. Is there a way to prevent further degradation such as should I no out it or anything? Seems like the last 0.2% always takes the longest and right now it is just the 4 pgs that a re inactive. Nothing else is screwed up (other then the aforementioned slowness wrong node stuff....)
[0:34] <wer> do what dmick?
[0:34] * gucki (~smuxi@HSI-KBW-095-208-162-072.hsi5.kabel-badenwuerttemberg.de) Quit (Ping timeout: 480 seconds)
[0:35] <dmick> (talking about Bring Out Your Dead from Holy Grail. Don't mind me.)
[0:35] <wer> oh! lol
[0:35] <wer> yeah that is about how I feel.
[0:36] <dmick> I'm not sure what would be best. I guess it'd be good to know if the OSD in that message appears to be all right or if it's dead, and what's in its log
[0:36] <dmick>
[0:36] <wer> do I find the node and count ports to get the osd number?
[0:37] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:39] <wer> dmick: that osd is reporting wrong node in the log..... osd.19.... other then that he is running. He reports several other wrong nodes.
[0:39] <wer> I would grab them all...seems like a bunch.
[0:39] <dmick> so, someone is trying to connect to that address and is complaining that the addr is wrong
[0:39] <wer> 2013-03-04 23:39:46.655567 7f91b5d4f700 0 -- >> pipe(0xb090000 sd=30 :52064 s=1 pgs=1172 cs=4 l=0).connect claims to be not - wrong node!
[0:40] <dmick> but you're saying the daemon *on* that address is also complaining about wrong addrs? (i.e. there are two complaining?)
[0:40] * rinkusk (~Thunderbi@ has joined #ceph
[0:40] <wer> yup. There are multiple osd's complaining on multiple nodes. Let me check the line above from 19's complaint.
[0:44] <wer> 79 is the one that 19 says is wrong... he is just chillin in standby I think.
[0:44] <dmick> so yeah, someone's trying to connect to what they believe is the right 'entity' (basically ip/port/nonce) from the osdmap
[0:44] <dmick> but the daemon that answers ain't him
[0:44] <wer> well I didn't do that :)
[0:44] <dmick> so probably a daemon died and restarted
[0:44] <dmick> but for some reason the map updates aren't getting around to where they should be
[0:44] <dmick> if this is reproducible we'd love to see logs from its inception, but
[0:44] <dmick> meanwhile the workaround is to restart the guys who have the wrong addresses; they'll get fresh maps on startup
[0:45] <dmick> for minimal invasion, I'd probably do them one at a time and let things settle down before continuing
[0:45] <wer> ahh. can I check the map? 25 was complaining about getting old requests.... that is the guy that also has the inactive pgs.
[0:45] <dmick> you can certainly look at the map
[0:45] <wer> I have all the logs btw if you want them :)
[0:46] <dmick> ceph osd dump will show what the mon believes the current OSD entity addresses are
[0:48] <wer> omg. So do I have to take that output and see if the ones complaining are in disagreement with osd dump?
[0:48] * markbby (~Adium@ Quit (Quit: Leaving.)
[0:49] <dmick> well, no, not unless you're interested
[0:50] <wer> lol. So just restart the osd's that are complaining? Or the ones that are being complained about? I am unsure. I know 25 needs a restart cause it has inactive pgs :) The rest is magic.
[0:50] <dmick> the ones that are complaining
[0:50] <wer> ok. and you said slowlyish.... one at a time.
[0:50] <dmick> the theory is that the complainers are complaining because the other ends died and restarted, and they weren't told
[0:50] <dmick> so if they come up fresh, they'll have no knowledge of the old corpses
[0:51] <dmick> and yeah, I'd stage it just in case
[0:51] <wer> ok. Well why does this has to happens?
[0:51] <wer> :)
[0:51] <wer> I love ceph until it acts all weird.
[0:52] <dmick> "son, let me tell ya a little story about software development..."
[0:52] <wer> lol
[0:52] <wer> About wizards and unicorns?
[0:56] <dmick> Bugs happen; we've squashed several in this area but maybe one remains
[0:58] <Cotolez> dmick: hi
[0:59] <joelio> any idea why one OSD would have be maxxed out at 100% usage when the other OSDs are nowhere near that util?
[0:59] <Cotolez> i've tried to reboot the storage host, increase the osd scrub thread timeout to 3 minutes, make a new journal on the problemaic osd's
[0:59] <Cotolez> ....nothing
[1:01] <wer> well dmick as always, thanks for the walk through the strangeness. Me going at it alone just does nothing. Do you want any of these logs? I can put them somewhere. I have the kernel block thing too that happened and I can't relate it to memory exhaustion or anything I have been able to find :(
[1:01] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Read error: Connection reset by peer)
[1:03] <dmick> Cotolez: if the theory is that the host is just overloaded, I don't think any of those would really help
[1:03] <dmick> joelio: usage as in "disk space"?
[1:04] <dmick> wer: sure, hopefully bouncing the daemons will help, let us know if it doesn't
[1:04] <wer> want logs?
[1:04] <dmick> logs: maybe, if you think you have them from the time when the connections started going wrong
[1:05] <wer> yeah... I have them from the kernl thing all the way through now. The 12ish hours have been the strange times.
[1:05] <Cotolez> dmick: I don't think that is "just overloaded". I think that something in the deep scrub procedure is wrong: In this moment no clients are connected to the cluster. The problem start when the deep scrub starts
[1:05] <wer> I didn't catch anything in the logs when the kernel blocking happened... but I was still poking for that timeframe.
[1:06] <dmick> Cotolez: yes, but the deep scrub adds lots of work for the OSDs to get at their filesystems, so if there are bottlenecks there, you'd expect the problem to be triggered by scrub, right?
[1:06] <Cotolez> dmick: Ceph waits with Health OK status until load drop under 0.50 - then starts to deep scrub and boom
[1:06] <dmick> yes
[1:07] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:07] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:08] <Cotolez> dmick: what can I try to lower the priority? (or something similar?)
[1:09] <dmick> does it get better if you take one or two OSDs down?
[1:11] <Cotolez> dmick: i'm triyin
[1:13] * Lennie`away (~leen@524A9CD5.cm-4-3c.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[1:14] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[1:15] <Cotolez> dmick: i give "cepg osd down 0" but upstart is bringin ceph-osd process up
[1:15] <MrNPP> what would cause an osd_op_reply of Operation not permitted, i tried wiping ceph and starting over and i still have nothing
[1:16] <MrNPP> qemu-img, and rbd seems to work fine
[1:17] <dmick> Cotolez: it's actually ceph itself in that case; "down" doesn't mean the process is stopped, and so since it continues to be reachable, ceph orders it to come back up. You can set a flag to stop that, but that's probably not what you want; I'm not certain whether the OSD in "down" state will still consume resources in a bad way
[1:19] <Cotolez> dmick: sorry but, to take one osd down, how can i do?
[1:21] * wschulze1 (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[1:22] <dmick> MrNPP: lots of things; bad/missing keys, bad settings for auth
[1:22] <dmick> did you say before it was failing to access from qemu? do you have the right auth settings in the qemu invocation line?
[1:22] <dmick> Cotolez: you could kill one or more osd procs. I'm not sure that would help tho
[1:22] <MrNPP> dmick: here is full debug
[1:22] <MrNPP> http://paste.scurvynet.com/?fd480626aed1cc80#ZKORLy+rcfPpHPdHvfi0gehuR4oKdjK7dJhvoJFJLgM=
[1:22] <dmick> it seems like something's wrong with your disk access
[1:23] <dmick> and it shows up with the heavier load brought on by deep-scrub
[1:23] * jlogan2 (~Thunderbi@ has joined #ceph
[1:23] * esammy (~esamuels@host-2-99-4-178.as13285.net) has joined #ceph
[1:23] <dmick> I assume there are no dmesg issues with the drive(s)? How many drives/filesystems are involved?
[1:24] <dmick> MrNPP: oh:
[1:24] <dmick> 2013-03-04 14:09:24.034624 7fafdecd7900 -1 librbd::ImageCtx: error finding header: (1) Operation not permitted
[1:24] <dmick> that's specific and may be different
[1:25] <dmick> what does ceph auth list show for that key?
[1:25] <dmick> i.e. what caps?
[1:25] <Cotolez> dmick:nothing... cepg restart the process even if I use kill
[1:25] <MrNPP> caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=libvirt-pool
[1:26] * Cube1 (~Cube@ has joined #ceph
[1:26] <gregaf> Cotolez: you'll need to stop them through upstart
[1:27] <gregaf> "initctl emit ceph-osd stop id=x", maybe? where x is the id of the OSD to kill
[1:27] <gregaf> something like that anyway, you can look in /init/ceph/ for the upstart jobs
[1:28] * jlogan1 (~Thunderbi@2600:c00:3010:1:217f:2c08:a1d4:e762) Quit (Ping timeout: 480 seconds)
[1:28] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[1:28] * jjgalvez (~jjgalvez@ Quit (Ping timeout: 480 seconds)
[1:29] <MrNPP> dmick, the error you are talking about occured after the failure was returned from the osd, so i'm guessing that it occured because of the first set of osd errors
[1:29] <dmick> gregaf: initctl ceph-osd stop id=x ought to be enough?...(and you probably mean /etc/init/ceph*, right?)
[1:29] <gregaf> ah yeah, sorry
[1:29] <dmick> MrNPP: I don't see the other errors?...
[1:29] <gregaf> and I really don't remember the upstart job syntax
[1:30] <Cotolez> tryin'...
[1:30] <MrNPP> 2013-03-04 14:09:24.032837 7fafd3fff700 1 -- <== osd.6 1 ==== osd_op_reply(1 gentoo-vm.rbd [stat] = -1 (Operation not permitted)) v4 ==== 112+0+0 (112968780 0 0) 0x7fafc00009a0 con 0x7fafe0962ff0
[1:30] <dmick> same error, different log, but ok
[1:30] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[1:31] <MrNPP> same log
[1:31] <dmick> different logging entity
[1:31] <joelio> dmick: it's ok, think I made a rookie mistake when creating the pool, left PG's default value
[1:32] <dmick> that shouldn't drastically affect osd "usage", and I'm still only assuming you mean disk space
[1:32] <joelio> no, usage via iostat
[1:32] * Cube1 (~Cube@ Quit (Read error: Operation timed out)
[1:33] <dmick> oh. well we'd need to look at your workload
[1:33] <dmick> MrNPP: does rados -p rbd ls show gentoo-vm.rbd?
[1:34] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[1:34] <MrNPP> no, it does not
[1:34] <MrNPP> blank
[1:35] <Cotolez> this is the correct syntax: initctl stop ceph-osd id=X
[1:36] <Cotolez> phew
[1:36] * buck (~buck@bender.soe.ucsc.edu) Quit (Quit: Leaving.)
[1:37] <dmick> so, the problem would appear to be that the rbd image isn't present. I'll grant you that's not the best error message
[1:38] <dmick> wait, sorry, that's the v1 look
[1:38] <dmick> how about rbd_id.gentoo-vm?
[1:39] <MrNPP> not sure what you mean about that value
[1:39] <dmick> does that object exist in the ls output
[1:39] <sagewk> joshd, dmick: can you check wip-prepare for python idiocy?
[1:39] <MrNPP> nothing outputs when i run the rados command
[1:40] <MrNPP> running rbd -p libvirt-pool ls does show the image
[1:40] <dmick> sigh, sorry, it's that poolname
[1:40] <dmick> how about rados -p libvirt-pool ls, for either gentoo-vm name?
[1:40] <MrNPP> oh yeah
[1:40] <MrNPP> haha
[1:40] <MrNPP> sorry
[1:40] <MrNPP> hyp03 qemu # rados -p libvirt-pool ls
[1:40] <MrNPP> gentoo-vm.rbd
[1:40] <dmick> ok. so it really is EPERM trying to read the objects
[1:43] <Cotolez> dmick: gregaf: for me it's time to sleep. Tomorrow I'll go further
[1:43] <Cotolez> thanks for help
[1:44] <dmick> ok Cotolez, gl
[1:44] <MrNPP> i was able to map it just fine, i'm assuming libvirt does that for me
[1:44] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:44] <dmick> mapping is for the kernel; libvirt makes a userland connection with librbd
[1:45] <dmick> researching caps, that's a form i haven't seen yet
[1:45] * Cotolez (~aroldi@ Quit (Quit: Sto andando via)
[1:45] <MrNPP> i pulled ir right from the ceph rbd libvirt page
[1:47] <dmick> yeah, more recent than last I looked. I don't usually limit very far in my test clusters, but it looks plausible
[1:47] * rinkusk (~Thunderbi@ Quit (Ping timeout: 480 seconds)
[1:47] <dmick> OH.
[1:48] <dmick> http://tracker.ceph.com/issues/4287 looks relevant
[1:48] <dmick> maybe quickly try making another pool lacking the - in its name?
[1:48] <dmick> and see if that fixes it
[1:48] <dmick> (which version are you running, btw?)
[1:50] <MrNPP> 56.3
[1:50] <sagewk> this has hit a few other people with the same problem.. the fix will be in 56.4
[1:51] <sagewk> in the meantime, it is in the bobtail branch, and there are autobuilt packages you can use for that
[1:51] <sagewk> i think.. let me double check the patch
[1:52] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[1:52] <MrNPP> ok, let me test it without the -
[1:53] <sagewk> mrnpp: yeah, it's in bobtail. so if it works with no '-' you jus tneed to updated to the latest bobtail build
[1:56] <MrNPP> yup
[1:56] <MrNPP> that did it
[1:57] <MrNPP> thank you everyone
[1:57] <MrNPP> i'll just patch my build with it for now
[1:57] <MrNPP> thanks again
[1:59] <dmick> sorry it took me so long to remember but glad we got it!
[2:03] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[2:04] * xiaoxi (~xiaoxiche@ has joined #ceph
[2:14] <wer> dmick: I updated 4006 since it seemed closest to my wrong node issue.... I have a copy of the logs and don't know what to do with them :) And also wanted to address the osd blocking issue... which is in the same logs.... hrm.
[2:18] * Lennie`away (~leen@lennie-1-pt.tunnel.tserv11.ams1.ipv6.he.net) has joined #ceph
[2:22] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[2:25] * esammy (~esamuels@host-2-99-4-178.as13285.net) Quit (Ping timeout: 480 seconds)
[2:28] * alram (~alram@ Quit (Ping timeout: 480 seconds)
[2:34] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[2:46] * esammy (~esamuels@ has joined #ceph
[2:49] * xiaoxi (~xiaoxiche@ Quit (Ping timeout: 480 seconds)
[2:49] * noob21 (~cjh@ has left #ceph
[2:51] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[2:52] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[2:57] * xiaoxi (~xiaoxiche@ has joined #ceph
[3:00] * sage (~sage@ Quit (Quit: Leaving.)
[3:01] * sage (~sage@ has joined #ceph
[3:10] * esammy (~esamuels@ Quit (Ping timeout: 480 seconds)
[3:14] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) has joined #ceph
[3:16] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[3:17] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[3:24] * rturk is now known as rturk-away
[3:36] * rinkusk (~Thunderbi@CPEbc14015a7093-CMbc14015a7090.cpe.net.cable.rogers.com) has joined #ceph
[3:41] * dpippenger (~riven@ Quit (Ping timeout: 480 seconds)
[3:42] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[3:45] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[3:45] * jlogan2 (~Thunderbi@ Quit (Ping timeout: 480 seconds)
[3:52] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[3:56] * wschulze1 (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[3:57] * rinkusk1 (~Thunderbi@cmr-208-97-77-198.cr.net.cable.rogers.com) has joined #ceph
[3:59] * rinkusk (~Thunderbi@CPEbc14015a7093-CMbc14015a7090.cpe.net.cable.rogers.com) Quit (Ping timeout: 480 seconds)
[4:00] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[4:00] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[4:03] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) has joined #ceph
[4:03] * tryggvil (~tryggvil@2a02:8108:80c0:1d5:a195:a4a5:4c86:e4e7) Quit (Quit: tryggvil)
[4:11] * LeaChim (~LeaChim@b0faa0c8.bb.sky.com) Quit (Ping timeout: 480 seconds)
[4:15] * alram (~alram@cpe-75-83-127-87.socal.res.rr.com) Quit (Quit: leaving)
[4:22] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[4:23] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit ()
[4:32] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:42] * rinkusk1 (~Thunderbi@cmr-208-97-77-198.cr.net.cable.rogers.com) Quit (Ping timeout: 480 seconds)
[4:58] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[5:08] * scalability-junk (uid6422@id-6422.tooting.irccloud.com) Quit (Ping timeout: 480 seconds)
[5:08] * stefunel (~stefunel@static. Quit (Ping timeout: 480 seconds)
[5:09] * Tribaal (uid3081@hillingdon.irccloud.com) Quit (Ping timeout: 480 seconds)
[5:13] * The_Bishop_ (~bishop@f052102141.adsl.alicedsl.de) has joined #ceph
[5:13] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[5:20] * The_Bishop (~bishop@e177089229.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[5:20] * joelio (~Joel@ Quit (Ping timeout: 480 seconds)
[5:28] * stefunel (~stefunel@static. has joined #ceph
[5:37] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[5:47] * stefunel (~stefunel@static. Quit (Ping timeout: 480 seconds)
[5:48] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[5:51] * stefunel (~stefunel@static. has joined #ceph
[5:53] * joelio (~Joel@ has joined #ceph
[6:01] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[6:03] * ScOut3R (~scout3r@1F2EAE22.dsl.pool.telekom.hu) has joined #ceph
[6:42] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:52] * Tribaal (uid3081@hillingdon.irccloud.com) has joined #ceph
[6:57] * ScOut3R (~scout3r@1F2EAE22.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[7:01] * scalability-junk (uid6422@id-6422.tooting.irccloud.com) has joined #ceph
[7:02] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[7:13] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: Pogoapp - http://www.pogoapp.com)
[8:09] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[8:12] <MrNPP> anyone awake?
[8:17] * LeaChim (~LeaChim@b0faa0c8.bb.sky.com) has joined #ceph
[8:21] * esammy (~esamuels@host-2-103-102-78.as13285.net) has joined #ceph
[8:25] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:26] * tryggvil (~tryggvil@95-91-243-120-dynip.superkabel.de) has joined #ceph
[8:27] * tryggvil (~tryggvil@95-91-243-120-dynip.superkabel.de) Quit ()
[8:33] * gucki (~smuxi@HSI-KBW-095-208-162-072.hsi5.kabel-badenwuerttemberg.de) has joined #ceph
[8:35] * sstan_ (~chatzilla@modemcable016.164-202-24.mc.videotron.ca) Quit (Ping timeout: 480 seconds)
[8:35] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:36] * loicd (~loic@magenta.dachary.org) has joined #ceph
[8:43] * leseb (~leseb@ has joined #ceph
[8:43] * sstan_ (~chatzilla@modemcable016.164-202-24.mc.videotron.ca) has joined #ceph
[8:47] * janeUbuntu (~jane@2001:3c8:c103:a001:f940:6bf6:60ac:ff19) has joined #ceph
[8:48] <Qten> hey guys, will 1 rbd device be able to use the same amount of iops etc, as several devices?
[9:14] * leseb_ (~leseb@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:14] <joelio> I suppose it depends on the size of the device, but I (presonally - may not be the same as others) found greater culumative IOPS across multiple devices. YMMV
[9:15] <joelio> Qten: ^
[9:16] * Morg (b2f95a11@ircip4.mibbit.com) has joined #ceph
[9:23] * leseb_ (~leseb@3.46-14-84.ripe.coltfrance.com) Quit (Remote host closed the connection)
[9:23] * leseb_ (~leseb@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:24] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: Relax, its only ONES and ZEROS!)
[9:25] * gerard_dethier (~Thunderbi@ has joined #ceph
[9:29] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[9:35] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[9:35] * rinkusk (~Thunderbi@CPEbc14015a7093-CMbc14015a7090.cpe.net.cable.rogers.com) has joined #ceph
[9:38] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[9:43] * eschnou (~eschnou@234.90-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[9:49] * ScOut3R (~ScOut3R@ has joined #ceph
[10:02] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[10:04] * eschnou (~eschnou@234.90-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[10:05] * l0nk (~alex@ has joined #ceph
[10:07] * xiaoxi (~xiaoxiche@ Quit (Ping timeout: 480 seconds)
[10:11] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[10:13] * tryggvil (~tryggvil@Router086.inet1.messe.de) has joined #ceph
[10:17] * BillK (~BillK@124-169-54-248.dyn.iinet.net.au) has joined #ceph
[10:18] * tryggvil (~tryggvil@Router086.inet1.messe.de) Quit (Read error: Connection reset by peer)
[10:19] * tryggvil (~tryggvil@Router086.inet1.messe.de) has joined #ceph
[10:23] * The_Bishop_ (~bishop@f052102141.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[10:34] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Quit: Leaving.)
[10:35] * Cotolez (~aroldi@ has joined #ceph
[10:36] * tryggvil (~tryggvil@Router086.inet1.messe.de) Quit (Ping timeout: 480 seconds)
[10:37] <Cotolez> Hi all
[10:39] * jtangwk1 (~Adium@2001:770:10:500:6d37:758b:ea66:61bb) Quit (Quit: Leaving.)
[10:40] * jtangwk (~Adium@2001:770:10:500:6d37:758b:ea66:61bb) has joined #ceph
[10:41] <Cotolez> I'm dealing with a test cluster who has 6 osd on a total of 11 (on one single host) that suddenly hungs when deep scrub starts. Yesterday I was chatting with dmick and gregaf, who points the finger against a slow response from the disks. I've collected a log from one osd with all debug parameters set to maximum
[10:41] <Cotolez> It's here: https://docs.google.com/file/d/0B1lZcgrNMBAJVjBqa1lJRndxc2M/edit?usp=sharing
[10:43] <fghaas> Cotolez: knee-jerk response, have you read http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals?
[10:44] <Cotolez> no, I'm reading
[10:44] <fghaas> where are your journals for those OSDs? all one one physical device?
[10:44] <fghaas> that would explain your hang right then and there
[10:44] <Cotolez> the journals are on a seaparte partiton of the storage disk
[10:46] <Cotolez> 10Gb each journal
[10:46] <Cotolez> 1 1049kB 1990GB 1990GB xfs ceph data
[10:46] <Cotolez> 1 1049kB 1990GB 1990GB xfs ceph data
[10:46] <Cotolez> 2 1990GB 2000GB 10.5GB ceph journal
[10:46] <fghaas> so you've partitioned all of your disks such that you have one partition that you use as a block device for the journal, and another partition for the filestore?
[10:47] <Cotolez> that's right
[10:47] <fghaas> i.e. one osd per disk, but the disk is split into a journal and a filestore partition?
[10:47] <Cotolez> yes
[10:48] <fghaas> well that would cause an awful lot of disk seeks
[10:48] <fghaas> (I'm assuming they're all spinners, correct me if you're using SSDs)
[10:48] * Philip_ (~Philip@hnvr-4d07b385.pool.mediaWays.net) has joined #ceph
[10:48] <Cotolez> all spinners, 6Gb SAS drives
[10:50] <fghaas> any controller write cache involved?
[10:50] <Cotolez> yes, There is a 1Gb cache on the controller, with BBU
[10:50] <fghaas> I/O scheduler?
[10:50] <Cotolez> deadline
[10:51] <fghaas> on all devices?
[10:51] <Cotolez> yes
[10:51] <Cotolez> the cluster has run for 35 day flawlessy
[10:52] <fghaas> confirmed to be in writeback mode? (if your battery just died, the controller may have switched itself into write-through mode)
[10:52] <Cotolez> 0.56.3 on ubuntu 12.04
[10:52] <Cotolez> The battery status is OK
[10:53] <fghaas> did you double check for WB also?
[10:53] <Kdecherf> oh god, a kernel panic
[10:54] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[10:54] <fghaas> Kdecherf: last I checked, god couldn't help with kernel panics, but you might ask Chuck Norris
[10:54] <Cotolez> fghaas: wait please
[10:55] <Kdecherf> fghaas: the kernel panic was caused by a crashdump of an osd
[10:56] <fghaas> Kdecherf: crashdump as in lkcd? you mean you got one panic, then lkcd kicked in, and then another panic?
[10:57] <fghaas> Cotolez: also, iostat -x 1 reporting any device near 100% util during the deep scrub?
[11:00] <Cotolez> fghaas: the devices ar between 72% and 82%
[11:00] <Kdecherf> fghaas: nope, ceph crashed and dumped the last 220k events
[11:00] <Kdecherf> and a kernel panic occured during the dump
[11:01] <fghaas> Kdecherf: lkcd and a crashkernel still would have helped to pinpoint the problem
[11:01] <Kdecherf> fghaas: http://i.imgur.com/jtvfvaR.png
[11:01] <fghaas> Cotolez: that doesn't strike me as particularly problematic
[11:02] <fghaas> Kdecherf: yeah, you're proving my point -- a tiny snippet like that is close to useless :)
[11:02] <Cotolez> fghaas: sorry, i was wrong. I have to wait the next deep scrub. Please wait
[11:02] <fghaas> wrong about what?
[11:02] <Cotolez> about the percentage
[11:02] <Kdecherf> fghaas: http://imgur.com/uLdc47y ;-)
[11:04] <fghaas> that can hardly be the full stack trave
[11:04] <fghaas> trace
[11:05] <fghaas> Kdecherf: and since you're running a 3.7 kernel, I assume debugging it is part of the fun, no? :)
[11:07] <Kdecherf> fghaas: yep :) but actually we have no problem with this kernel (if we exclude ceph)
[11:07] <Kdecherf> and we will move to 3.9rc1
[11:08] * Philip__ (~Philip@hnvr-4d07bfa9.pool.mediaWays.net) has joined #ceph
[11:08] * rinkusk1 (~Thunderbi@cmr-208-97-77-198.cr.net.cable.rogers.com) has joined #ceph
[11:08] <fghaas> if your ceph-osd (which is a 100% userspace process) causes a kernel panic, then yes you are having a problem with this kernel :)
[11:09] * rinkusk (~Thunderbi@CPEbc14015a7093-CMbc14015a7090.cpe.net.cable.rogers.com) Quit (Ping timeout: 480 seconds)
[11:09] <TMM> Kdecherf, it's a kernel's job to deal with whatever userspace does, it is there to protect the system if you will. If a userspace process can crash your kernel then it is either a kernel bug or a hardware problem. by definition :)
[11:10] <fghaas> Kdecherf: TMM and I seem to think alike
[11:11] <TMM> fghaas, could still be an application bug triggering a kernel bug though. It's the kernel's fault for it's own crash, but the OSD may still be misbehaving :)
[11:12] <fghaas> sure
[11:12] <fghaas> but exposing a kernel issue is different from causing one
[11:12] <TMM> of course
[11:13] <TMM> the kernel is responsible for not crashing
[11:13] <TMM> like any application is
[11:13] <TMM> an application crashing is always it's own fault :P
[11:14] * Philip_ (~Philip@hnvr-4d07b385.pool.mediaWays.net) Quit (Ping timeout: 480 seconds)
[11:19] <Qten> joelio: thanks
[11:19] <Cotolez> fghaas: sorry, now the load is still over 0.50 and the deep scrub doesn't start. The cluster is up and running. The problems starts only when deep scrub starts. I use nmon instead of iostat, and yesterday, during deep scrub, I saw the devices at 100%
[11:20] * rinkusk1 (~Thunderbi@cmr-208-97-77-198.cr.net.cable.rogers.com) Quit (Ping timeout: 480 seconds)
[11:21] <fghaas> huh, load? as in load average?
[11:21] <Cotolez> yes
[11:22] <fghaas> well there's a number of things that contribute to load, only one of them being processes waiting for I/O
[11:22] <fghaas> why don't you kick off a deep scrub manually, to reproduce the issue?
[11:22] <Cotolez> ok, I try
[11:25] <fghaas> and then, like I said, iostat -x 1 would be interesting to see if the %util column goes anywhere near 100% -- if it is, then yes most likely it's just that your disks are saturated
[11:29] <Cotolez> is weird: I issue the command pg scrub
[11:29] <Cotolez> then %utils is 0 and the process is 100% on cpu
[11:30] <Cotolez> then after 30 second is 100%utils and 100%cpu
[11:30] <Cotolez> then the osd is marked out
[11:30] <Cotolez> and after a minute is "wrongly marked out" and put in the cluster again
[11:31] <Cotolez> Do I have to start over from scratch?
[11:32] <Cotolez> I'd like to record my desktop and show it
[11:43] * thelan (~thelan@paris.servme.fr) has joined #ceph
[11:43] <thelan> Hello
[11:44] * diegows (~diegows@ has joined #ceph
[11:47] <thelan> I've got some inconsistent pgs. When i'm using ceph pg repair on theses i've got the following error:
[11:48] <Cotolez> fghaas: This is what it happens:
[11:48] <thelan> repair 2.c7 9a785ec7/rb.0.614c.2ae8944a.00000000040c/head//2 on disk size (4001792) does not match object info size (4184064)
[11:48] <Cotolez> fghaas: 1- I issue the pg scrub and the %utils on the 2 ods of that pg is 0 - the relative ceph-osd process is at 100%cpu
[11:49] <Cotolez> fghaas: 2- Ceph marks the osd down - The ceph-osd process is still at 100%cpu and %utils=0
[11:50] <Cotolez> fghaas: 3- Ceph marks the osd "wrongly down" - The ceph-osd process float from 100 to 90% - the %utils goes to 100%
[11:53] <fghaas> that's a pretty nasty combination
[11:54] * thomas (~thomas@LLagny-156-35-38-195.w217-128.abo.wanadoo.fr) has joined #ceph
[11:55] <fghaas> as for ceph-osd hitting 100%cpu, my guess would be an osd bug, also for the "wrongly down" issue. sorry, gregaf or sjust might know more, but I'd say it'd be hard to explain this with "just" a slow OSD
[11:56] <Cotolez> just as reminder, I've collected a osd log with all debugs set to maximum, here: https://docs.google.com/file/d/0B1lZcgrNMBAJVjBqa1lJRndxc2M/edit?usp=sharing
[12:00] * gucki_ (~smuxi@HSI-KBW-095-208-162-072.hsi5.kabel-badenwuerttemberg.de) has joined #ceph
[12:00] <gucki_> hi there
[12:00] * gucki_ (~smuxi@HSI-KBW-095-208-162-072.hsi5.kabel-badenwuerttemberg.de) Quit (Remote host closed the connection)
[12:05] * tryggvil (~tryggvil@Router086.inet1.messe.de) has joined #ceph
[12:07] * gerard_dethier1 (~Thunderbi@ has joined #ceph
[12:08] * gerard_dethier (~Thunderbi@ Quit (Ping timeout: 480 seconds)
[12:10] <gucki> is there a config option to make "ceph -w" also display read throughput (right now it seems only write throughput is displayed, like wr ..., ops ..)
[12:13] * tryggvil (~tryggvil@Router086.inet1.messe.de) Quit (Read error: Connection reset by peer)
[12:13] * tryggvil (~tryggvil@Router086.inet1.messe.de) has joined #ceph
[12:28] * gaveen (~gaveen@ has joined #ceph
[12:39] * Philip__ (~Philip@hnvr-4d07bfa9.pool.mediaWays.net) Quit (Ping timeout: 480 seconds)
[12:42] * tryggvil (~tryggvil@Router086.inet1.messe.de) Quit (Read error: Connection reset by peer)
[12:43] * tryggvil (~tryggvil@Router086.inet1.messe.de) has joined #ceph
[12:57] * tryggvil_ (~tryggvil@Router086.inet1.messe.de) has joined #ceph
[13:02] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:02] * The_Bishop (~bishop@2001:470:50b6:0:5026:e6f0:2177:24d5) has joined #ceph
[13:05] * tryggvil (~tryggvil@Router086.inet1.messe.de) Quit (Ping timeout: 480 seconds)
[13:05] * tryggvil_ is now known as tryggvil
[13:10] * tryggvil_ (~tryggvil@Router086.inet1.messe.de) has joined #ceph
[13:10] * tziOm (~bjornar@ has joined #ceph
[13:13] * tryggvil (~tryggvil@Router086.inet1.messe.de) Quit (Ping timeout: 480 seconds)
[13:13] * tryggvil_ is now known as tryggvil
[13:21] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[13:22] * rinkusk (~Thunderbi@ has joined #ceph
[13:35] <elder> Weird. Snow day for my son today.
[13:47] * rinkusk (~Thunderbi@ Quit (Read error: Connection reset by peer)
[13:49] * rinkusk (~Thunderbi@ has joined #ceph
[13:50] * diegows (~diegows@ has joined #ceph
[13:55] * tziOm (~bjornar@ Quit (Ping timeout: 480 seconds)
[13:56] * gregorg (~Greg@ has joined #ceph
[13:58] * BillK (~BillK@124-169-54-248.dyn.iinet.net.au) Quit (Quit: Leaving)
[14:04] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[14:07] * tryggvil (~tryggvil@Router086.inet1.messe.de) Quit (Quit: tryggvil)
[14:09] * rinkusk (~Thunderbi@ Quit (Ping timeout: 480 seconds)
[14:10] * ScOut3R_ (~ScOut3R@ has joined #ceph
[14:16] * SkyEye (~gaveen@ has joined #ceph
[14:16] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[14:16] * Morg (b2f95a11@ircip4.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[14:19] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[14:20] * eschnou (~eschnou@ has joined #ceph
[14:28] * rinkusk (~Thunderbi@CPE00259c467789-CM00222d6c26a5.cpe.net.cable.rogers.com) has joined #ceph
[14:30] * tziOm (~bjornar@ has joined #ceph
[14:36] * markbby (~Adium@ has joined #ceph
[14:36] <Cotolez> Hi again, I've captured a video to show you the behavior of my issue. http://youtu.be/708AI8PGy7k
[14:37] * fred1 is now known as flepied
[14:38] * josef (~seven@li70-116.members.linode.com) has left #ceph
[14:39] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:41] <janos> Cotolez: what's the screen on the right running?
[14:42] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[14:42] <Cotolez> iostat -x 1
[14:47] * ScOut3R (~ScOut3R@ has joined #ceph
[14:50] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[14:50] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:53] * mcclurmc_laptop (~mcclurmc@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[14:53] * ScOut3R_ (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[14:59] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[15:07] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[15:07] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) has joined #ceph
[15:13] * andrew_ (~andrew@ip68-231-33-29.ph.ph.cox.net) has joined #ceph
[15:18] * yanzheng (~zhyan@jfdmzpr04-ext.jf.intel.com) has joined #ceph
[15:25] <andrew_> how are we to interpret an absurdly large exit code from rados get?
[15:30] <jluis> 'large exit code'?
[15:31] <elder> Like a zillion?
[15:31] <elder> That would be CRAZY
[15:32] <elder> What kind of exit code, andrew_?
[15:33] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[15:33] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:35] <andrew_> 1464856576
[15:36] <elder> That's an exit code?
[15:37] <janos> it exited and slammed the door on the way out
[15:37] <elder> As in, it was the status returned by the shell command "rados get ..."
[15:37] <elder> ?
[15:37] <andrew_> you know, you're right. it is the error code reported by rados, rather than the actual unix exit code. my bad. but the question still remains
[15:37] <andrew_> how to interpret this error code.
[15:37] <elder> Do you have the message that you can paste here, or to a pastebin or something?
[15:37] <elder> (For a little more context)
[15:39] <andrew_> zeno-26_agh=; rados get -p tom dump-home-200601-v2-55caa5194ad1b41e81f1119c429931d8284b5b3f.cpio.8bb69dd4373aa70eccf4d9752c378794
[15:39] <andrew_> error getting tom/dump-home-200601-v2-55caa5194ad1b41e81f1119c429931d8284b5b3f.cpio.8bb69dd4373aa70eccf4d9752c378794: Unknown error 1464856576
[15:41] * vata (~vata@2607:fad8:4:6:c5f3:e0ab:5c04:da5a) has joined #ceph
[15:42] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:44] <elder> Just for more complete information, can you tell us what version of ceph you're running? "rados --version" I think will give that to you.
[15:44] * bmjason (~bmjason@static-108-44-155-130.clppva.fios.verizon.net) has joined #ceph
[15:44] * mikedawson_ (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:45] <andrew_> 0.56.3
[15:46] * dmner (~tra26@tux64-13.cs.drexel.edu) has joined #ceph
[15:46] <andrew_> it is likely significant that this is the first get i am trying; i've "put" 2-3TB's worth of objects.
[15:48] <dmner> Anyone familiar with disk utilization, ceph reports it is using 6G with only 10M of data (I used to have more previously and removed all of it)
[15:48] <dmner> also it caused a negative degraded state which is fun
[15:48] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[15:49] * mikedawson_ is now known as mikedawson
[15:49] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:50] * mcclurmc_laptop (~mcclurmc@firewall.ctxuk.citrix.com) has joined #ceph
[15:51] * SkyEye is now known as gaveen
[15:51] <jluis> hmm
[15:52] <jluis> somehow it feels as if a negative number is being returned by some unsigned int function or something of the sorts
[15:52] <jluis> I'll take a look
[15:52] <elder> andrew_, that comment from jluis was directed at you...
[15:53] <andrew_> okay. great!! thanks
[15:54] <jluis> ah, I see
[15:54] <jluis> I'll open a bug for this
[15:56] <andrew_> jlius; can you give a quick summary of what you found?
[15:56] <jluis> yeah, just making sure
[15:57] <jluis> no, scratch that
[15:57] <jluis> I don't see it after all
[15:57] * capri (~capri@ Quit (Quit: Verlassend)
[16:00] * rturk-away is now known as rturk
[16:03] * Cube (~Cube@ has joined #ceph
[16:04] <jluis> ah
[16:05] <andrew_> smells like a size-related problem to me
[16:05] <jluis> andrew_, I'm pretty sure what you're hitting is something that neither elder nor I would be able to see in current code, as that particular portion changed sometime before 0.57
[16:06] <jluis> that should be an error from trying to read the object, checking if 'ret < 0 then return ret'
[16:07] <jluis> err
[16:07] <jluis> I need more coffee
[16:07] <jluis> no, cuz the return value is being passed to strerr_r(-err)
[16:07] <jluis> -_-
[16:10] <andrew_> is this one of those cases where an underlying routine is passing back more info (like number of bytes to read) and the upper routine is forgetting to translate it into a errno?
[16:10] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[16:15] <jluis> andrew_, not really; the function is returning the error appropriately, and the error value is being properly handled before being handed off to 'strerror_r()', so I'm missing how the hell the number turns from negative to positive in the process
[16:22] * Cube (~Cube@ Quit (Read error: Connection reset by peer)
[16:22] * Cube1 (~Cube@ has joined #ceph
[16:33] * Philip__ (~Philip@hnvr-4d07bfa9.pool.mediaWays.net) has joined #ceph
[16:35] <andrew_> is there a size limit on objects?
[16:40] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[16:41] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[16:45] <elder> andrew_, I believe the size limit should be 2^64, but it's possible it's 2^63 and it's possible there are corner cases that don't handle sizes properly.
[16:46] <elder> I don't expect bugs like that, I just know you're seeing a problem so it's a caveat...
[16:47] <andrew_> well, all my action is taking place through the rados program; it succeeded on a 2GB and faled on a 6GB file.
[16:48] <elder> That is a very good clue, isn't it?
[16:48] <jluis> andrew_, here's the bug: http://tracker.ceph.com/issues/4349
[16:48] <elder> There may be others online soon (West coast) that might have more insights.
[16:48] * rturk is now known as rturk-away
[16:48] <jluis> I tried to look into it, but I just can't find anything wrong
[16:48] <andrew_> i was going to do binary search to find the key number but i am now getting the dread old requests diagnostics from ceph -w
[16:48] <jluis> and this is eating way too much of my time
[16:48] <andrew_> sorry!
[16:50] <t0rn> when doing continuous writes to a rbd volume within a qemu process (using qemu-rbd, over 24 hour period), would it be surprising to see the qemu process itself use less memory when rbd cache is true (all other rbd cache settings at default values), and more memory when rbd cache is false?
[16:50] <t0rn> With rbd cache off, i have seen my qemu process use almost a gig over my instance allocated amount, but with caching enabled, I have only seen it use (so far) up to ~300mb over the instance allocated amount. Im using 0.56.3, qemu 1.1.2
[16:51] <jluis> andrew_, np, the fault is all mine: I tend to obsess over this kind of things, and I tend to spend much more time than what I should :)
[16:53] <andrew_> i empathise
[16:58] * tryggvil (~tryggvil@Router086.inet1.messe.de) has joined #ceph
[16:59] * tryggvil (~tryggvil@Router086.inet1.messe.de) Quit ()
[17:01] <todin> t0rn: there is a memoryleak within the librbd
[17:02] * gerard_dethier1 (~Thunderbi@ Quit (Quit: gerard_dethier1)
[17:07] * aliguori (~anthony@ has joined #ceph
[17:10] * scuttlemonkey_ is now known as scuttlemonkey
[17:13] <t0rn> todin: is there a bug i can track for the leak?
[17:14] <todin> t0rn: not that I know of, but joshd knows it, maybe we should open a bug in the tracker, I will ask him
[17:15] <PerlStalker> ceph -s is reporting that I have a bunch of pgs stuck in "active+remapped" state after adding an OSD. Is this something I need to worry about?
[17:20] <thelan> PerlStalker: see http://ceph.com/docs/master/rados/operations/pg-states/
[17:22] * aliguori_ (~anthony@ has joined #ceph
[17:23] * aliguori__ (~anthony@bi-01.bluebird.ibm.com) has joined #ceph
[17:23] <thelan> PerlStalker: ceph is moving some pg to your new osd
[17:24] * Gugge_47527 (gugge@kriminel.dk) has joined #ceph
[17:25] * jlogan1 (~Thunderbi@2600:c00:3010:1:a073:d626:ddc8:2b2b) has joined #ceph
[17:27] * Gugge-47527 (gugge@kriminel.dk) Quit (Ping timeout: 480 seconds)
[17:27] * Gugge_47527 is now known as Gugge-47527
[17:27] * yanzheng (~zhyan@jfdmzpr04-ext.jf.intel.com) Quit (Remote host closed the connection)
[17:28] * aliguori (~anthony@ Quit (Ping timeout: 480 seconds)
[17:29] * aliguori (~anthony@ has joined #ceph
[17:30] * aliguori_ (~anthony@ Quit (Ping timeout: 480 seconds)
[17:31] <PerlStalker> thelan: So, basically, I should just be patient and these active+remapped pgs should take care of themselves.
[17:31] * aliguori__ (~anthony@bi-01.bluebird.ibm.com) Quit (Ping timeout: 480 seconds)
[17:39] <thelan> PerlStalker: in my case yes. Basicly i had to wait about 20-30 minutes for ceph to "balance" data (about 200Go of test data)
[17:41] <PerlStalker> I have 761 GB of data but this osd is only weighted at .2.
[17:42] * gregaf (~Adium@2607:f298:a:607:215d:9748:6e65:fe68) Quit (Quit: Leaving.)
[17:42] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[17:43] * gregaf (~Adium@2607:f298:a:607:3177:339c:34e6:e9b) has joined #ceph
[17:47] * jtangwk (~Adium@2001:770:10:500:6d37:758b:ea66:61bb) Quit (Quit: Leaving.)
[17:47] * jtangwk (~Adium@2001:770:10:500:9d7f:aa39:c3c1:4b7c) has joined #ceph
[17:50] * l0nk (~alex@ Quit (Quit: Leaving.)
[17:53] * jantje (~jan@paranoid.nl) Quit (Read error: Connection reset by peer)
[17:54] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[17:57] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[17:57] <thelan> i'have weighted one of my osd from 1 to .7 and now i have the same
[17:58] * jantje (~jan@paranoid.nl) has joined #ceph
[18:00] <thelan> PerlStalker: see progression http://i46.tinypic.com/2vw62wl.jpg
[18:00] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[18:02] <thelan> i've made some script to monitor ceph trough zabbix https://github.com/thelan/ceph-zabbix
[18:04] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[18:19] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Remote host closed the connection)
[18:23] <thomas> exit
[18:23] <thomas> sorry, bye ;)
[18:23] * thomas (~thomas@LLagny-156-35-38-195.w217-128.abo.wanadoo.fr) Quit (Quit: leaving)
[18:25] <joshd1> t0rn: you see more memory used with no caching enabled? that's the opposite of the leak that todin saw
[18:25] * bmjason (~bmjason@static-108-44-155-130.clppva.fios.verizon.net) Quit (Read error: Connection reset by peer)
[18:25] * bmjason (~bmjason@static-108-44-155-130.clppva.fios.verizon.net) has joined #ceph
[18:29] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[18:35] * Cotolez (~aroldi@ Quit (Quit: Sto andando via)
[18:36] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Read error: Connection reset by peer)
[18:37] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[18:39] <t0rn> joshd1: that is correct, with rbd cache set to true (other settings default) i see less memory usage of the qemu process. For the test I just start a VM, with this loop: 'while true; do dd if=/dev/zero of=/dev/vdb bs=1M ; done' and watch the memory usage climb. Doing that I see more usage when caching is off
[18:40] <t0rn> where vdb is the rbd volume
[18:41] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[18:42] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[18:42] <joshd1> t0rn: does memory usage continue to grow unbounded, or does it stop after a while?
[18:42] * ScOut3R (~scout3r@1F2EAE22.dsl.pool.telekom.hu) has joined #ceph
[18:44] <t0rn> so far i've only ran the test for 24 hours each, and it does appear to flat line somewhat (http://tinypic.com/view.php?pic=2q3dkav&s=6) thats a graph showing the results of the above test, but with rbd cache disabled (argonaut vs bobtail)
[18:44] * mcclurmc_laptop (~mcclurmc@firewall.ctxuk.citrix.com) Quit (Ping timeout: 480 seconds)
[18:46] <t0rn> i'm not sure what memory overhead i should expect under normal circumstances from qemu-rbd setups over the instance allocated amounts. That graph was from a vm granted 512mb through qemu
[18:47] <joshd1> ok, that doesn't look like a leak, but it would be good to investigate why it's using more memory in bobtail vs argonaut, and without cache vs with.
[18:48] <joshd1> t0rn: those are with the same size cluster with images in the same pool?
[18:48] <t0rn> same cluster and pool yes. The only difference was the client librdb versions
[18:48] <joshd1> perfect. which distro and packages are you using?
[18:50] <t0rn> centos 6.2 , compiled from source: 1.) qemu 1.1.2 2.) librbd from v0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5) 3.) libiscsi 1.5.0 4.) libvirt 1.0.1
[18:51] <t0rn> if you mean client side
[18:51] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[18:53] <joshd1> yeah, just the client side. did you build librbd with tcmalloc (from google-perftools) and nss (as opposed to crypto++)?
[18:54] <joshd1> using tcmalloc generally helps memory use a lot for ceph
[18:55] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[18:55] * chutzpah (~chutz@ has joined #ceph
[18:57] <yehuda_hm> thelan: looks cool, maybe send a message to ceph-devel, will get more exposure
[18:57] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[18:58] <t0rn> i built with gperftools-devel,gperftools-libs 2.0-3.el6.2 from http://code.google.com/p/gperftools i used nss-devel-3.12.10-17.el6_2 (stock centos package)
[18:58] <infernix> nhm: any chance to play with rbdbench.py?
[18:59] * leseb_ (~leseb@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[19:02] <jlogan1> Hi. I'm trying to mount an RBD on a Centos 6.3 host, but I'm having some cephx authentication issues.
[19:02] <jlogan1> This is now I made the key on a Ceph host:
[19:02] <jlogan1> ceph auth get-or-create client.it02-sef osd 'allow rwx pool=it02-sef' mon 'allow r' > /etc/ceph/it02-sef.keyring
[19:02] <jlogan1> I then copy the key to my client and run the following:
[19:02] <jlogan1> [root@it02-sef ceph]# rbd --keyfile /etc/ceph/it02-sef.keyring --id it02-sef --pool it02-sef list
[19:02] <jlogan1> 2013-03-05 10:02:16.739485 7f9b85567760 -1 auth: failed to decode key '
[19:02] <jlogan1> [client.it02-sef]
[19:02] <jlogan1> SNIP
[19:02] <jlogan1> 2013-03-05 10:02:16.739505 7f9b85567760 0 librados: client.it02-sef initialization error (22) Invalid argument
[19:03] <jlogan1> rbd: couldn't connect to the cluster!
[19:03] <jlogan1> am I missing a setup on the ceph servers, or the client?
[19:04] <joshd1> jlogan1: you probably want --keyring instead of --keyfile; keyfile expects only the base64 key without the keyring stuff around it
[19:04] <joshd1> t0rn: thanks, that should be enough to try to reproduce and see what's going on
[19:05] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[19:05] <jlogan1> joshd1: Closer:
[19:05] <jlogan1> [root@it02-sef ceph]# rbd --keyring /etc/ceph/it02-sef.keyring --id it02-sef --pool it02-sef list
[19:05] <jlogan1> rbd: list: (1) Operation not permitted
[19:06] * rturk-away is now known as rturk
[19:06] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[19:07] <t0rn> joshd1 i can recompile the stack i have easily enough if you wanted me to test different compile options when i built from the ceph repo i just installed that nss-devel package and the google perftools rpms detailed above, then ran configure with default args
[19:07] <t0rn> (so no args to configure)
[19:08] <joshd1> t0rn: can you make sure it picked up tcmalloc (config.log should say something about it)
[19:08] <t0rn> sure i will recompile so i can then go back and check that log
[19:11] <t0rn> actually, found the log from when i compiled it joshd1 . Here is the grep for tcmalloc on it: http://paste.debian.net/239978/ it appears to have picked it up
[19:11] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Remote host closed the connection)
[19:12] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[19:12] <joshd1> jlogan1: if you're using 0.56.x, make sure all your osds are on that version and not argonaut. you could also be hitting http://tracker.ceph.com/issues/4287
[19:12] * jtangwk raises an eye brow about feedback for cephfs
[19:13] <joshd1> t0rn: yeah, looks like it did, thanks
[19:14] <jlogan1> I'm on 0.56.1-1quantal for the server and ceph-0.56.3-0.el6.x86_64 for the client.
[19:14] <jlogan1> The client is centos 6.3, so I need to get a new kernel with RBD support.
[19:15] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[19:16] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[19:17] <jtangwk> im kinda interested in the no-fsck statement in http://ceph.com/community/blog/
[19:17] <jlogan1> Are there any suggested Centos 6.3 Kernel .rpm to use?
[19:17] <jtangwk> does that mean that you will be doing scrubs/checks on blocks with in the system with checksums?
[19:18] <jtangwk> or is it going to be file based?
[19:18] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:19] <jtangwk> the over all proposal looks appealing for us at our site, though we're not using cephfs (yet)
[19:19] <jtangwk> but just the rbd component for now
[19:20] <jtangwk> still, not having a fsck makes me feel uncomfortable having experience failures in the past before on other distributed file systems
[19:22] * mcclurmc_laptop (~mcclurmc@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[19:22] * mcclurmc_laptop (~mcclurmc@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) Quit ()
[19:25] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[19:28] * taber (~taber@pool-72-90-77-216.syrcny.fios.verizon.net) has joined #ceph
[19:29] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[19:31] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[19:31] <taber> hi, has anyone else tried compiling on mac os x lion and gotten "configure: error: libresolv not found"?
[19:32] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has joined #ceph
[19:33] <sstan> no, but do you have libresolv? what's the path to it?
[19:34] <taber> i believe so, /usr/lib/libresolv.9.dylib and /usr/lib/libresolv.dylib
[19:36] <sstan> might be able to tell configure where libresolv is .. but idk how. Perhaps someone else here could help :/
[19:36] * leseb (~leseb@ Quit (Read error: Connection reset by peer)
[19:36] * leseb (~leseb@ has joined #ceph
[19:37] <taber> that'd be awesome, not sure if i should have two in there or what
[19:38] <joshd1> todin: could you try reproducing the memory leak with one vm and debug_ms=1:debug_objectcacher=30:log_file=/path/to/file after updating from the wip-objectcacher-leak-bobtail branch?
[19:38] <joshd1> todin: I have a theory, and added some extra debugging that will help in any case
[19:39] <joshd1> todin: it would be easier to read the log from format 1, since there'd only be once cache too
[19:39] <MrNPP> should i keep using virtio for the rbd driver, or should i be using something else? i read somewhere that using the scsi was better
[19:43] <todin> joshd1: yep, I ca do that, give me a few minutes
[19:53] * danieagle (~Daniel@ has joined #ceph
[19:54] * benner (~benner@ Quit (Quit: leaving)
[19:56] * noahmehl (~noahmehl@cpe-75-186-45-161.cinci.res.rr.com) has joined #ceph
[19:57] * noahmehl (~noahmehl@cpe-75-186-45-161.cinci.res.rr.com) Quit ()
[19:57] * noahmehl (~noahmehl@cpe-75-186-45-161.cinci.res.rr.com) has joined #ceph
[19:59] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has left #ceph
[20:04] * jjgalvez (~jjgalvez@ has joined #ceph
[20:04] * janos (~janos@static-71-176-211-4.rcmdva.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[20:05] * janos (~janos@static-71-176-211-4.rcmdva.fios.verizon.net) has joined #ceph
[20:06] * janos (~janos@static-71-176-211-4.rcmdva.fios.verizon.net) has left #ceph
[20:06] * janos (~janos@static-71-176-211-4.rcmdva.fios.verizon.net) has joined #ceph
[20:06] * miroslav (~miroslav@ has joined #ceph
[20:11] * dpippenger (~riven@ has joined #ceph
[20:12] <fghaas> joshd1: silly question, the rule still stands that kernel-mapping RBDs on nodes that are themselves OSDs is a bad idea, correct? (just like mounting kernel cephfs would be)
[20:14] <joshd1> fghaas: yeah, it's not likely to ever be a good idea
[20:16] <fghaas> joshd1: thanks
[20:18] * Cube1 (~Cube@ Quit (Read error: Connection reset by peer)
[20:18] * Cube (~Cube@ has joined #ceph
[20:19] <jlogan1> joshd1: I think I hit 4287. I redid the test without a - and it works:
[20:19] <jlogan1> [root@it02-sef ~]# rbd --keyring /etc/ceph/it02sef.keyring --id it02sef --pool it02sef list
[20:19] <jlogan1> cobblerit02sef
[20:19] <jlogan1> postgresit02sef
[20:21] * eschnou (~eschnou@234.90-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:22] * leseb (~leseb@ Quit (Read error: Connection reset by peer)
[20:22] * leseb (~leseb@ has joined #ceph
[20:23] * sjustlaptop (~sam@ has joined #ceph
[20:23] <joshd1> jlogan1: good to know that it hits centos too. others had just seen it on ubuntu quantal. the fix is already in the bobtail branch, which will be 0.56.4
[20:24] <jlogan1> My server is Ubuntu, client is Centos. So possibly a different setup then what you have seen before.
[20:25] <jlogan1> Is there a way to have rbd mounted on boot? Or do I need to put the map commands into rc.local?
[20:25] * leseb (~leseb@ Quit (Remote host closed the connection)
[20:25] * bstillwell (~bryan@ has joined #ceph
[20:26] * lkeijser (~me@j184015.upc-j.chello.nl) has joined #ceph
[20:27] <joshd1> ah, that explains it. it's the server boost library that matters
[20:27] <joshd1> no way to map it on boot, no
[20:27] * dmick idly wonders if we have an upstart event for when the cluster is up and healthy
[20:27] <bstillwell> I'm trying to replace an osd in my cluster and I'm running into a cephx issue that I'm hoping someone can help with.
[20:27] <dmick> or if we should
[20:27] <lkeijser> hi, small question about ceph-mon. I'm looking at 4 ceph nodes (just doing support for a customer - I have very little experience with ceph). Can I just start another instance on a new node and kill/stop the other one?
[20:27] <bstillwell> I'm trying to run:
[20:27] <bstillwell> ceph auth add osd.2 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-2/keyring
[20:28] <bstillwell> but I'm getting:
[20:28] <bstillwell> monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
[20:28] <lkeijser> cause it's taking up a lot of memory
[20:28] <bstillwell> Seems like Greg ran into this before as well:
[20:28] <bstillwell> http://tracker.ceph.com/issues/3894
[20:29] <jlogan1> ok. I'll give rc.local a try. Thanks for the help and the - issue!
[20:29] <jlogan1> For the record elrepo.org kernel-ml is working on Centos to mount RBD.
[20:30] <scuttlemonkey> jlogan1: it's worth noting that depend on your startup sequence doing the various steps in an rc script may fail depending on the state of your ceph cluster when it runs
[20:30] <scuttlemonkey> (ie. if this is all on one test box, make it wait until ceph health is ok before running the mount script)
[20:30] * leseb (~leseb@ has joined #ceph
[20:31] <scuttlemonkey> bstillwell: you have the key present on the osd machine?
[20:31] <bstillwell> using strace it looks like my osd nodes might need /etc/ceph/ceph.client.admin.keyring, /etc/ceph/ceph.keyring, /etc/ceph/keyring, or /etc/ceph/keyring.bin
[20:31] <jlogan1> scuttlemonkey: The Ceph cluster is on 3 Ubuntu hosts. This server is just a user of that cluster. I don't think there will be any issues unless the whole site has lost power...
[20:31] <scuttlemonkey> jlogan1: that's fine, just thought it was worth mentioning since several folks have been putting everything on a single host for testing
[20:32] <scuttlemonkey> and just wanting the whole shebang to come up on boot
[20:32] <scuttlemonkey> with your setup it wont matter
[20:33] <bstillwell> scuttlemonkey: that's better, I copied /etc/ceph/keyring.bin to all my OSDs
[20:33] <scuttlemonkey> bstillwell: right on
[20:35] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[20:36] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[20:44] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has left #ceph
[20:44] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[20:44] * ChanServ sets mode +o scuttlemonkey
[20:49] <joshd1> t0rn: one detail I forgot to confirm, is that using format 1 images (the default)?
[20:53] * lkeijser (~me@j184015.upc-j.chello.nl) Quit (Quit: I'm So Meta, Even This Acronym)
[20:53] * markbby (~Adium@ Quit (Quit: Leaving.)
[20:54] <t0rn> joshd1 i'm using format 1
[20:54] <janos> is format 2 considered production ready?
[20:56] <todin> joshd1:
[20:56] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[20:56] * markbby (~Adium@ has joined #ceph
[20:57] <joshd1> janos: yeah, but it's still librbd-only
[20:58] <janos> any idea if that's going to remain userspace only?
[20:58] * miroslav (~miroslav@ Quit (Quit: Leaving.)
[20:58] <jmlowe> last I heard it was slated for the 3.9 kernel
[20:58] <janos> party
[20:59] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit (Ping timeout: 480 seconds)
[20:59] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[21:00] <joshd1> jmlowe: I don't think it'll make 3.9, but maybe 3.10
[21:00] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) Quit ()
[21:03] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[21:05] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:05] <janos> wouldn't that have to be 4?
[21:06] <janos> 3.10 is less than 3.9
[21:06] <janos> ah dang
[21:06] <janos> nm
[21:06] <janos> i'm not htinking how they do 3.10.0-blah
[21:06] <joshd1> todin: shoot, I didn't have the extra debugging there yet. I thought I did, but now it'll be updated once it finishes building http://gitbuilder.sepia.ceph.com/gitbuilder-precise-deb-amd64/#origin/wip-objectcacher-leak-bobtail
[21:06] <janos> math getting inmy way
[21:07] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[21:07] <joshd1> todin: sorry, but could you redo it again once in ~10 min when that's finished building?
[21:08] <todin> joshd1: so it should be bb70437952?
[21:08] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[21:10] * markbby (~Adium@ Quit (Quit: Leaving.)
[21:10] * markbby (~Adium@ has joined #ceph
[21:12] <joshd1> todin: yeah
[21:13] * Cube1 (~Cube@ has joined #ceph
[21:13] * Cube (~Cube@ Quit (Read error: Connection reset by peer)
[21:14] * markbby (~Adium@ Quit ()
[21:14] * markbby (~Adium@ has joined #ceph
[21:19] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:21] * taber (~taber@pool-72-90-77-216.syrcny.fios.verizon.net) Quit (Quit: bbl)
[21:23] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[21:24] <dmick> bstillwell: the problem was probably that you were missing the client.admin key
[21:25] <dmick> (on the host where you were running the ceph command, which is a client, by default named admin)
[21:25] <todin> joshd1:
[21:25] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:28] * diegows (~diegows@ has joined #ceph
[21:28] * Cube (~Cube@ has joined #ceph
[21:28] * Cube1 (~Cube@ Quit (Read error: Connection reset by peer)
[21:28] * Cube (~Cube@ Quit ()
[21:29] * Cube (~Cube@ has joined #ceph
[21:29] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[21:32] * Cube (~Cube@ Quit ()
[21:32] * Cube (~Cube@ has joined #ceph
[21:38] <ShaunR> I'm poking around in the init script for ceph on RHEL/CENTOS servers and i see it looks like the script will attempt to mount xfs volumes with the inode64 attribute, whats the reasoning for that, should that be done normally?
[21:39] <dmick> ShaunR: ISTR it was advice from the xfs developers to help spread out the metadata
[21:40] <dmick> ceph xfs inode64 is a good thing to Google
[21:42] * drokita (~drokita@ has joined #ceph
[21:43] <drokita> Netelligent1!
[21:44] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[21:44] * drokita (~drokita@ has left #ceph
[21:44] * dynamike (~dynamike@ext.nat.phx.lindenlab.com) has left #ceph
[21:47] <dmick> oh.....kay....
[21:49] <ShaunR> i'm currently doing the following for mounting my xfs filesystems... rw,noexec,nodev,noatime,nodiratime,barrier=0
[21:51] <ShaunR> I see that Christoph Hellwig was saying to also increate the default log size and the use large directory blocks.
[21:51] * miroslav (~miroslav@ has joined #ceph
[21:54] * Cube (~Cube@ Quit (Quit: Leaving.)
[21:54] * Cube (~Cube@ has joined #ceph
[21:55] * jskinner (~jskinner@ has joined #ceph
[22:04] * garyy (~gary@pool-72-90-77-216.syrcny.fios.verizon.net) has joined #ceph
[22:07] <bstillwell> dmick: that'd be my guess too
[22:09] <infernix> if you just have 4MB objects inode64 is probably not needed
[22:10] <infernix> i've only once ran out of inodes on an xfs volume of several TBs
[22:10] <nhm> ShaunR: It may help, haven't gotten to test yet. Please feel free to try it! :D
[22:10] <infernix> but it doesn't hurt
[22:10] <dmick> infernix: I don't think it was about running out, but about having the inode blocks and data blocks close to one another
[22:11] <dmick> Christoph said it's now the default in 3.7, even, as it's always been a perf win
[22:11] <infernix> without inode64 all inodes are stured under 1TB iirc
[22:11] <dmick> right
[22:11] <infernix> so for 1TB disks it doesn't matter
[22:11] <dmick> 1TB is so 2010
[22:11] <dmick> :)
[22:12] <infernix> NFS used to have an issue with inode64
[22:12] <dmick> all this is in that post
[22:12] <nhm> dmick: A while back Sandeen I think was saying it's not always so clear cut, but it seems to be a win for Ceph since you spread out metadata writes over all the AGs afaik.
[22:12] <infernix> nhm: any numbers from rbdbench.py yet? :>
[22:12] <elder> When did NFS have an issue with inodet4?
[22:12] <elder> 64?
[22:13] <elder> inode64 does indeed spread the inodes out over an entire volume, regardless of size.
[22:13] <infernix> couple of months ago i ran into it on centos 5 where a centos 6 box was exporting it over nfs
[22:13] <elder> inode32 keeps all inodes under 32 bits.
[22:14] <dmick> http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F
[22:14] <dmick> <shrug>
[22:14] <nhm> infernix: sorry, I've been fighting with IOR to figure out why it's broken when directly testing against RBD block devices.
[22:14] <nhm> infernix: fio not, ior, sorry.
[22:14] <infernix> nhm: so use my tool :D
[22:14] <infernix> this is kernel rbd i take it
[22:15] <nhm> infernix: patience young skywalker! ;) Also, email me a link at mark.nelson@inktank.com so I've got it archived.
[22:15] <nhm> infernix: I *do* want to test it.
[22:16] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[22:19] <ShaunR> nhm: my tests are looking promising so i'm building out a new test cluster, 3 servers each running a single mon and 4 osds (SAS 7200 RPM 1TB), LSI 9670 controllers, raid0 per drive and a single OCZ revo drive x2 in them for journals. Those older cards still perform really well and you can get the 240GB models for a whopping $150 bucks
[22:20] <ShaunR> If all goes well next will be 24 disk servers, with 2 of those OCZ revodrives (one for read cache, one for write)
[22:20] <nhm> wow, $150? how fast is each card?
[22:21] <ShaunR> These are the older cards, but they still are fairly fast
[22:21] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) Quit (Quit: Leaving)
[22:22] <ShaunR> let me get the specs
[22:22] <ShaunR> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227959&Tpk=OCZSSDPX-1RVDX0240RF
[22:23] <ShaunR> only disadvantage to the card is the drives that i have seen so far.
[22:24] <infernix> ShaunR: found any info on how much TB you can write to them before the nand will begin to wear out?
[22:24] <ShaunR> 240G space, 120,000 IOPS 4KM random write (aligned), 740MB/s read/ 720 MB/s write
[22:25] <ShaunR> no, the MTBF claims 2,000,000 hours but that really doesnt say much
[22:26] <nhm> that's a great price. Too bad it's refurbished.
[22:26] <ShaunR> even the newer revodrive cards are around $500-$600 dollars
[22:26] <ShaunR> which isnt bad when compared to the price of a good SSD
[22:26] <ShaunR> nhm: eh, my expereince with refurbished stuff has been good.
[22:27] <nhm> ShaunR: I've had mixed. Some good, some bad
[22:27] <ShaunR> If we did this for a production setup i think we'll be going with the newer cards anyway
[22:27] <infernix> ShaunR: you'll want to be careful with it though
[22:27] <janos> refurb is ok with ceph - ceph was made with the knowledge that hardware failure IS an option!
[22:27] <infernix> with 2 or 3 way replication you're gonna have a bad time two or three times as fast
[22:27] <janos> ;)
[22:28] <ShaunR> infernix: uhh, not sure i follow?
[22:28] <infernix> well you're writing to two or three cards, and if the log resides on them, two times for every write
[22:28] <ShaunR> janos: well thats kind of what i was thinking too... if they fail, then ceph should live and we can just replace them..
[22:29] <infernix> so basically quite a bit of write amplification
[22:29] <janos> yeah. esp. for a test environment - failure is a good test to run through
[22:29] <ShaunR> hell, we've had plenty of SSD's brick on us here... not like those are bullet proof
[22:31] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[22:31] <ShaunR> infernix: i see, your still referencing to how much writes the nand will take
[22:31] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[22:31] <ShaunR> i wish the fusionIO cards wernt thousands of dollars :)
[22:33] <infernix> assuming they have 512kb flash nands and you can do 5000 cycles per nand, you can write (((((240*1024*1024)/512)*5000)*512))/1024)/1024)/1024 = 1171TB before you see failures
[22:34] <infernix> but my late night math could be way off
[22:34] <infernix> :>
[22:34] <infernix> and that assumes ideal write distribution
[22:35] <ShaunR> Maybe i'll give ocz a call and see if they can shed some light
[22:35] <infernix> but excludes the 16gb space capacity
[22:35] <infernix> *spare
[22:35] <ShaunR> I wonder if the device keeps track of that sort of thing
[22:36] <infernix> it does but the quality of wear leveling has greatly improved in recent years
[22:36] <infernix> look at the controller specs
[22:36] <ShaunR> do you know how i would access that data, smart maybe?
[22:37] <infernix> some devices support smart readouts
[22:37] <ShaunR> i'll attempt to kill one of these, doesnt really matter to me :)
[22:37] <infernix> but just google the type of sandforce controller and look up how they did wear leveling in that controller
[22:37] <infernix> for what it's worth we run with eMLC only
[22:37] * Cube (~Cube@ Quit (Quit: Leaving.)
[22:37] <ShaunR> infernix: who's we?
[22:38] * Cube (~Cube@ has joined #ceph
[22:38] <infernix> company i do work for
[22:38] * Cube (~Cube@ Quit ()
[22:38] * Cube (~Cube@ has joined #ceph
[22:38] <ShaunR> last i looked the sandforce chip was LSI, if thats what your asking
[22:38] * Cube (~Cube@ Quit ()
[22:39] * Cube (~Cube@ has joined #ceph
[22:39] <infernix> look at the technology they put in there, and compare to, say, intels most recent
[22:39] <infernix> the s3700 is really the first MLC drive i'd consider worthy of production in servers, assuming their endurance numbers are correct
[22:40] <sstan> missed part of the the conversation; so did you guys figure out what's the best ssd solution?
[22:40] * hybrid5121 (~w.moghrab@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[22:41] <ShaunR> On the revodrive 3x2 the sandforce chip has the numbers sf-2281vb1-sdc on it
[22:41] * Cube (~Cube@ Quit ()
[22:41] * Cube (~Cube@ has joined #ceph
[22:42] <nhm> infernix: the S3700 looks like a great product.
[22:42] <lurbs> Yeah, the Intel S3700 would be my pick at the moment.
[22:42] <infernix> for the price it is
[22:42] <infernix> there is better but at higher cost
[22:43] <lurbs> There always is. :)
[22:43] <infernix> significantly better though
[22:43] <nhm> I'm hoping the price on the 200-400GB models comes down some.
[22:43] <ShaunR> here's the specs on it...
[22:43] <ShaunR> http://www.ocztechnology.com/ocz-revodrive-x2-pci-express-ssd-eol.html
[22:43] <infernix> i've been playing with the idea of building a box with 16x800gb in raid 10
[22:43] * markbby (~Adium@ Quit (Quit: Leaving.)
[22:44] <infernix> ceph is nice but it would not be fast enough to keep up
[22:44] <infernix> least not yet
[22:45] <infernix> ideal circumstances that'd yield 6.4tb at 600k read iops and 290k write iops
[22:45] * hybrid512 (~w.moghrab@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[22:45] <infernix> i don't know if raid 6 would be better. could be.
[22:46] <ShaunR> infernix: NAND componets say Multi-Level Cell (MLC)
[22:47] <nhm> infernix: what networking?
[22:47] <ShaunR> 4 x sandforce 1222 controllers
[22:47] <infernix> ShaunR: yeah, so the controller plays the biggest part in endurance. if it requires TRIM to do it, then you better use a filesystem that TRIMs or you'll kill them faster. it should all be documented in the specs for that controller, or for drives that use similar
[22:47] <infernix> nhm: infiniband, srp
[22:48] <infernix> probably some overhead but even if its 50k overhead that's still plenty IOPS
[22:48] * eschnou (~eschnou@234.90-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[22:48] <nhm> infernix: still need to test rsockets. Gotta seei f I can get Mellanox to send me some ConnectX3 cards.
[22:48] <ShaunR> ECC, 27 bytes of redundancy per 512 bytes data. Up to twelve 9-bit symbols correctable
[22:48] <infernix> nhm: if you code it i'll send you some
[22:48] <infernix> :>
[22:48] <ShaunR> not sure what that means
[22:48] <infernix> nhm: do you have a switch though?
[22:48] <nhm> infernix: I figure direct connect the QSFP+
[22:48] * dmner (~tra26@tux64-13.cs.drexel.edu) Quit (Ping timeout: 480 seconds)
[22:48] <infernix> hm
[22:49] <phantomcircuit> infernix, problem with TRIM is that it clears the command queue
[22:49] <nhm> infernix: it's what I do for bonded 10GbE right now.
[22:49] <phantomcircuit> it has much worse performance effects than a simple flush
[22:49] <infernix> phantomcircuit: which is why i'd prefer the hardware to not care.
[22:49] <infernix> it's better overall, and it's what we use
[22:50] <infernix> but consumer controllers, especially older ones, really do benefit from trim over time
[22:50] <ShaunR> owww, here's somthing i overlooked... Looks like that older card only has windows drivers
[22:51] <infernix> nhm: i probably have two spare connectx 2 cards
[22:51] <ShaunR> I may have to do my testing with the newer cards
[22:51] <infernix> if you're able to dedicate the time i can probably ship them
[22:52] <nhm> infernix: Not sure how the time plays out unfortunately. :/
[22:52] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[22:52] * leseb (~leseb@ Quit (Remote host closed the connection)
[22:52] <nhm> infernix: roadmaps and such.
[22:52] * loicd (~loic@magenta.dachary.org) has joined #ceph
[22:52] <infernix> nhm: ping or mail me when you do then
[22:53] <nhm> infernix: Will do. Hopefully we can at least try to get rsockets working.
[22:53] <infernix> they're in use as secondary cards in some of my test boxes, but i can live with 1
[22:53] <nhm> infernix: RDMA is probably a while out still.
[22:53] <infernix> will happily ship them once you have time to use them
[22:59] <ShaunR> in the ceph conf, when giving each osd a id.. you should always use numbers right? so osd.1, osd.2, etc..? Nobody does like osd.a or osd.joe-1 or anything like that right?
[23:00] * garyy (~gary@pool-72-90-77-216.syrcny.fios.verizon.net) Quit (Quit: Leaving)
[23:01] <infernix> nhm: there's some MHQH19B-XTR on ebay for like $120
[23:01] <infernix> qrd cable should go for around $30
[23:01] <infernix> *qdr
[23:02] <sstan> ShaunR : ceph-osd create creates osds and assigns them numbers
[23:03] <sstan> I don't think there's a way to ceph-osd create an arbitrarily named osd
[23:03] <ShaunR> ok, just checking, i'm going to fix a annoying problem that probably just bothers me with ceph-conf :)
[23:04] <nhm> infernix: nice, any idea if those can do 40GbE too?
[23:04] <infernix> they won't
[23:04] <infernix> and EoIB needs hardware
[23:04] * sagelap (~sage@2600:1010:b11f:163c:492:e6c9:797:ff5b) has joined #ceph
[23:04] * alram (~alram@ has joined #ceph
[23:04] <sagelap> slang: there?
[23:05] <slang> sagelap: yep
[23:05] <infernix> nhm: http://bit.ly/12s3LCr 10+ available
[23:05] <sagelap> (or anyone else interested in teuthology archictures vs installs)
[23:05] <sagelap> i'm inclined to move the install bits in ceph-fuse back into install.py
[23:05] <sagelap> its' problematic now because we aren't picking the right branch to install.
[23:05] <sagelap> making it look at the cpeh task's config to see which branch/sha1 is ugly and hacky to implement..
[23:06] <sagelap> and adding fields to ceph-fuse task config means restructuring that (tho that may be a good idea anyway)
[23:06] <infernix> update the firmware on them (with the mellanox tools, not distro firmware tools, killed a card with that) and you're good to go
[23:06] <sagelap> but we are moving toward having a ceph-deploy and possibly other install/configure methods (liek chef), so i'm not sure hardcoding a particular install method is the right thing to do here. i.e., maybe it is an existing install that we are testing against.
[23:07] <nhm> infernix: I might hold out and see if I can get Mellanox to give me a pair of X3s. ;)
[23:07] <sagelap> ..but looking for other opinions!
[23:08] <sagelap> i'm thinking the install task can take arguments like flavor, specific package names, etc., and those can be used to build appropriate job descriptions (or not, if something other than install.py is going to be used)
[23:08] <slang> sagelap: what about setting the branch/sha1 in the ceph-fuse task config
[23:08] <sagelap> i started to implement that.. it would work, but then what hapepns when we use ceph-deploy to set up the cluster?
[23:08] <slang> sagelap: otherwise we can't test different versions between ceph-fuse and mds
[23:08] <sagelap> then ceph-fuse.py is calling into install.py and maybe stepping on apt config rules
[23:09] <sagelap> i'm thinking if we want different versions, we do something like
[23:09] <infernix> nhm: afaik the 40gbe parts are not the same as IB parts. you either get an IB card or an EN card
[23:09] <sagelap> install:
[23:09] <sagelap> branch: foo
[23:09] <sagelap> install.something:
[23:09] <sagelap> packages: [ceph-fuse]
[23:09] <sagelap> branch: bar
[23:09] <infernix> it's just the interconnect that's the same (QSFP+)
[23:09] <sagelap> or who knows what.
[23:09] <sagelap> then ceph-fuse.py is concerned *just* with running ceph-fuse, and somethign else is responsible for having (some version of it) installed
[23:10] <slang> sagelap: yeah
[23:10] <slang> sagelap: seems reasonable
[23:10] <sagelap> someday soon maybe it will be chef.py :)
[23:10] <joshd1> sagelap: I think it makes sense to separate from ceph-fuse, but the explicit package list seems too low level
[23:10] <sagelap> yeah, esp since package names vary between distros
[23:11] <sagelap> not sure what the best approach there is.
[23:11] <infernix> nhm: e.g. MCX313A-BCBT Single 40GbE QSFP, or the one we have which is the MCX353A-FCBT Single FDR 56Gb/s or 40GbE. huh. well ill be.
[23:11] <sagelap> in the meantime, though, ceph-fuse is always running against master, which is not ideal :)
[23:11] <sagelap> i'll start with this tho, and add ceph-fuse to install.py's default list
[23:12] <joshd1> sagelap: maybe do what the kernel task does, and make branch/sha1/etc configurable per role
[23:12] <nhm> infernix: yeah, I thought I remembered reading you could mix-and-match.
[23:12] <sagelap> as part of install.py?
[23:12] <sagelap> yeah
[23:12] <joshd1> but for now, just installing ceph-fuse on all client nodes makes sense too
[23:12] * mcclurmc_laptop (~mcclurmc@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[23:12] <infernix> nhm: actually we have the qdr which is only good for 10GBe. i'll be damned if I know how to make it do 10GbE though, you probably need the mellanox proprietary drivers for it - which aren't out for Ubuntu, only RHEL and derivatives
[23:12] <sagelap> the other oddity is that currently the branch/sha1/etc are ceph task properties instead of install task properties.. that shoudl be changed too, i think.
[23:13] <sagelap> it works ok now only because install.py is looking at the ceph overrides
[23:13] <joshd1> yeah
[23:13] <sagelap> for that i'd like to see where the rpm/yum stuff is giong before settling on the syntax etc for install.py
[23:15] <sagelap> and we can revamp/simplify the flavor stuff, too. i think it should all be explicit overrides for install.py instead of magically detected via the presense of, say, valgrind args.
[23:15] <joshd1> we could just install everything a client could need on client nodes all the time
[23:15] <joshd1> yeah, flavor: valgrind would be better
[23:15] <infernix> nhm: looks like it is autosensing based on what it's plugged in, you might be able to force it to ethernet
[23:15] <sagelap> in the meantime, this is less broken than before
[23:15] <nhm> infernix: crazyh
[23:15] * ScOut3R (~scout3r@1F2EAE22.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[23:16] <joshd1> instead of just client.* for everything, we could specialize roles more if we wanted to only install e.g. java sometimes
[23:16] <sagelap> yeah
[23:17] <sagelap> on a related note, i was telling tamil: we should move the client.* key creation and install out of ceph.py, and make a generic teuthology.something helper that will create/get a key for client.foo and install on a given remote
[23:18] <sagelap> and call that method from any task that needs it. then ceph-deploy (and ceph.py) don't need to worry about key creation and install
[23:18] <sagelap> currently the ceph-deploy task won't work against ceph-fuse etc.
[23:18] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[23:19] <joshd1> yeah, other things could be extracted from the ceph task too
[23:19] <sagelap> joshd1: .. unles you have a better idea? :)
[23:19] <sagelap> yeah
[23:19] <joshd1> the cluster() method in particular has about 10 different things in it
[23:21] <infernix> so a generic question; i'll be writing many TBs daily to ceph (40-60TB range) in the form of rbd devices which are backups. is it better to a) delete the rbd disk images and write to a new one, or b) overwrite data on an existing rbd disk image?
[23:21] <infernix> does it even matter to ceph?
[23:25] <joshd1> infernix: doesn't really matter, overwrite uses less space and might be a tiny bit more efficient
[23:26] * bmjason (~bmjason@static-108-44-155-130.clppva.fios.verizon.net) Quit (Quit: Leaving.)
[23:27] * dmner (~tra26@tux64-10.cs.drexel.edu) has joined #ceph
[23:28] <dmner> So running bonnie on an rbd is causing the client to kernel panic, anyone experience anything similar?
[23:29] <dmner> on a side note, should I email about to to the users list and the devel or just users or what
[23:31] <gregaf> dmner: ceph-users is good, although there's still some overlap as we try and find the right balance as a community :)
[23:32] <scuttlemonkey> dmner: wrt lists, troubleshooting I usually say ceph-user, unless it's an identified bug that the devs are gathering details on
[23:32] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[23:33] <dmner> scuttlemonkey, gregaf that was my original thought as well, but then again hosing a system seems serious to me wasn't sure of the line between the two
[23:33] * miroslav (~miroslav@ Quit (Quit: Leaving.)
[23:34] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[23:35] <scuttlemonkey> dmner: yeah, the same people on our end read both lists...I think it's just a question of focus for future archival
[23:36] <scuttlemonkey> if it's a ceph bug someone will probably /cc the dev list at some point
[23:36] <scuttlemonkey> appreciate the asking though!
[23:36] <dmner> scuttlemonkey: thanks for the info, didn't want to bother the devs too early in the process
[23:37] <dmner> now off to send that email
[23:37] <scuttlemonkey> hehe, cool
[23:40] * miroslav (~miroslav@ has joined #ceph
[23:41] <MrNPP> whats the recomended driver to use with libvirt/qemu? virtio?
[23:42] <lurbs> virtio doesn't (yet?) support passing discard through, but other than that it's my preference.
[23:43] <MrNPP> so i'm running 6 osd's and started a vm, did a 16gig dd test and came up with 30mb write, and 96 read
[23:43] <MrNPP> seems pretty low
[23:44] <dmick> 30 millibits is pretty slow
[23:47] <Vjarjadian> might be caused by the writing of multiple copies?
[23:48] <MrNPP> hmmm, i have a replica level of 2
[23:50] <nhm> MrNPP: try doing a couple of them at once.
[23:50] * LeaChim (~LeaChim@b0faa0c8.bb.sky.com) Quit (Ping timeout: 480 seconds)
[23:50] <MrNPP> a few dd tests?
[23:50] <nhm> MrNPP: yeah, see if you get any scaling effects.
[23:51] <MrNPP> i also ran into an issue with deleting, took about 15 minutes to delete a 100gb image
[23:51] <nhm> MrNPP: also, what size, and direct IO or buffered?
[23:51] <MrNPP> pgmap v1491: 1472 pgs: 1472 active+clean; 112 MB data, 523 MB used, 5023 GB / 5024 GB avail
[23:52] <nhm> sorry, I meant what size IOs?
[23:52] <MrNPP> bs=8k
[23:52] <MrNPP> 16gb total
[23:52] <MrNPP> i think i'm using buffered as i don't anything in my ceph config for direct
[23:53] <nhm> how fast is it with bs=4M?
[23:53] <MrNPP> let me test
[23:53] <Qten> So I had this crazy idea to run 2 ceph clusters using RBD (one at each dc) then running a VM at each DC with ceph inside a VM for Rados/object files access which then would be replicated across two sites?
[23:53] <nhm> also, make sure to use conv=fdatasync
[23:54] <Qten> too crazy or maybe an easier way to do it? :)
[23:55] <sstan> Qten : http://www.sebastien-han.fr/blog/categories/ceph/
[23:55] <Vjarjadian> Qten, so basically using Ceph as a raid 1?
[23:55] <MrNPP> nhm: ok, testing that now
[23:56] <Qten> Vjarjadian: yep
[23:56] <Vjarjadian> everything i've seen says that Ceph is for LAN use at the moment... unless you have 'scary' fast WAN...
[23:56] <Vjarjadian> and chances are... you would get the same effect with more replicas on one cluster
[23:57] <Qten> apart from act of god
[23:57] <Qten> so to speak
[23:57] <Qten> sstan: i've seen the GEO rep idea looks pretty cool
[23:58] <Vjarjadian> act of god? you mean some unknown being saying 'Let this guy's ceph cluster blow up'
[23:58] <Qten> Vjarjadian: fire flood etc who knows
[23:58] <sstan> meteor
[23:58] <sstan> sinkhole
[23:58] <Qten> sstan: well thats becoming more common ;)
[23:59] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[23:59] <Vjarjadian> more replicas...
[23:59] <Qten> i don't want to go climbing into a sink hole or meteor crater to pick out servers

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.