#ceph IRC Log

Index

IRC Log for 2012-01-03

Timestamps are in GMT/BST.

[0:28] * jojy (~jvarghese@108.60.121.114) Quit (Quit: jojy)
[1:07] <s[X]> Is there any decent tutorials around for Ceph ?, ive been reading the wiki and having a bash around on a dev machine but im not having much luck
[1:32] <dwm_> mgalkiewicz: Periodic releases tend to occur every few weeks.
[2:27] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[3:25] * ottod (~ANONYMOUS@li127-75.members.linode.com) has joined #ceph
[3:45] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[3:45] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[7:43] * mgalkiewicz (~maciej.ga@85.89.186.247) Quit (Quit: Ex-Chat)
[8:58] * MarkDude (~MT@c-67-170-237-59.hsd1.ca.comcast.net) has joined #ceph
[9:04] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) Quit (Remote host closed the connection)
[9:17] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[9:21] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[9:44] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[9:47] * MarkDude (~MT@c-67-170-237-59.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[10:44] <s[X]> hey all
[11:06] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[11:35] <dwm_> Re: http://tracker.newdream.net/issues/1759 -- I've seen there have been some updates to track this down. Should I fast-forward to latest master and retry?
[11:35] <dwm_> (Which might be a bit sticky, as the cluster has offlined itself as it ran out of space.)
[11:45] * fghaas (~florian@85-127-155-32.dynamic.xdsl-line.inode.at) has joined #ceph
[12:11] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Ping timeout: 480 seconds)
[12:24] * tjikkun (~tjikkun@82-169-255-84.ip.telfort.nl) has joined #ceph
[12:42] <s[X]> hey all having a bit of trouble compiling ceph on centos6 anybody have any idea why when i excute, git checkout -b rc origin/rc i get "fatal: git checkout: updating paths is incompatible with switching branches."
[12:49] <wonko_be> seems more like a git problem
[12:49] <wonko_be> would be git checkout --track -b rc origin/rc
[12:50] <s[X]> do u need to switch to the rc branch ?
[12:51] <s[X]> i went ahead and ran the compile without switching and so far it seems to be okay
[12:51] <wonko_be> afaik not
[12:51] <s[X]> mmm
[12:51] <wonko_be> rc is release candidate afaik
[12:51] <s[X]> Im very very new to Compiling etc so excuse the ignorance
[12:52] <s[X]> Want to play with CEPH / BTRFS to replace my Media Storage because im sick to death of Windows
[12:52] <wonko_be> if compiling is a problem, switch to ubuntu or debian and use the prepackaged debs
[12:53] <s[X]> yeah might have to will see how far i get with this
[12:53] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[12:53] <wonko_be> i'm not into centos/fedrora, but there are some instructions here: http://ceph.newdream.net/docs/latest/ops/install/mkcephfs/
[12:53] <wonko_be> note that there is a newer version than .35 available
[12:54] <s[X]> what do u run ?
[12:54] <wonko_be> debian
[12:54] <s[X]> mmm might give it a whirl
[12:54] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[12:55] <fghaas> s[X]: it seems you're mixing up git checkout -b with git branch --track
[12:56] <s[X]> thanks fghaas, i was just following the instructions on http://ceph.newdream.net/wiki/Installing_on_RedHat_or_CentOS
[12:57] <wonko_be> take my link, it has newer info I think
[13:00] <s[X]> thanks wonko_be i think ima go down the debian route
[13:00] <s[X]> Just for simplicity
[13:01] <s[X]> How do u guys find using ceph, fairly stable ?
[13:01] <wonko_be> euh... sometimes
[13:01] <wonko_be> don't put data on it you don't want to loose
[13:01] <s[X]> Yeah i wont for now
[13:01] <dwm_> s[X]: Unless you've got spare hardware to use for experiments, I wouldn't try using it for your media library just yet.
[13:02] <s[X]> dwm_ i was going to pickup a few HP MicroServers and throw some 2tb's into them and use them to play with
[13:02] <s[X]> Can never have enough Hard Drives
[13:02] <s[X]> :P
[13:03] <s[X]> I was just curious, For Redundancy can you specify how you want the failover to act
[13:03] <s[X]> ie: if you had 2 OSD's, both with 2 x 2tb drives each
[13:03] <wonko_be> you can say where the redundant copy should be placed
[13:04] <s[X]> and one OSD failed
[13:04] <wonko_be> you can force it to not be on the same osd, or the same host, or in the same rack...
[13:04] <dwm_> s[X]: In general, you specify policy, and Ceph works out some suitable layout.
[13:04] <s[X]> ah ok
[13:05] <dwm_> So you can say, "keep three copies of everything, at least two copies on different machines" or similar.
[13:05] <s[X]> Does it keep entire copies or is it possible to keep two copies + parity
[13:05] <dwm_> No, Ceph works with replicas.
[13:06] <dwm_> However, you can give Ceph MD arrays to work with.
[13:06] <dwm_> So you could configure a RAID6 array on a host, and then give that over to Ceph to use.
[13:07] <s[X]> The reason the whole plunge into this as Ive got a Duplicated Drive pool in Windows atm, The Master copy became corrupt without the system realisig and i had to manually go in and pull the duplicate copy, remove the master and then recopy the duplicate back in for replication
[13:08] <fghaas> s[X]: so all you want at this point is some storage replicated over two hosts?
[13:08] <s[X]> yup pretty much
[13:09] <fghaas> easy option is iscsi with drbd then
[13:09] <wonko_be> i had some good experience with MooseFS on that matter, exported as SMB drive
[13:09] <fghaas> (OT for this channel, I realize that, but just throwing that in)
[13:09] <wonko_be> fghaas: help is help :-)
[13:09] <s[X]> nah i appreaciate it
[13:09] <s[X]> THe reason i liked ceph is the ability to expand
[13:12] <s[X]> I might be wrong, but with 2 OSD's whats the chance of one being corrupted ?
[13:12] <wonko_be> the question today would be... what is the chance of hitting a bug / feature
[13:14] <s[X]> I understand its still in its infancy but i thought Ceph was similar to ZFS to prevent corruption of data
[13:15] <s[X]> maybe im wrong
[13:16] <s[X]> This is obviously with the intention of using Btrfs with Ceph
[13:16] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 481 seconds)
[13:19] * gregorg (~Greg@78.155.152.6) has joined #ceph
[13:19] * gregorg_taf (~Greg@78.155.152.6) Quit (Read error: Connection reset by peer)
[13:22] * gregorg_taf (~Greg@78.155.152.6) has joined #ceph
[13:22] * gregorg (~Greg@78.155.152.6) Quit (Read error: Connection reset by peer)
[13:32] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[13:45] * elder (~elder@c-71-193-71-178.hsd1.mn.comcast.net) has joined #ceph
[14:16] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[14:29] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[14:41] <dwm_> Right, time to try out latest HEAD to see if I can advance http://tracker.newdream.net/issues/1759 at all..
[16:11] <dwm_> Drat, first pass didn't work because I didn't manage to make git fast-forward properly. Let's try that package build again..
[16:12] * stickith (~mmeuleman@206.83.236.154) has joined #ceph
[16:15] <dwm_> (Though thank you for including commit ids in crashlogs.)
[16:32] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[16:32] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Read error: Connection reset by peer)
[16:34] * stickith (~mmeuleman@206.83.236.154) has left #ceph
[16:41] <dwm_> Yeah, I think I'm going to have to blow my old ceph cluster away and reinitialize -- it's hit ENOSPACE and won't recover.
[16:53] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[17:03] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[17:21] * aliguori (~anthony@32.97.110.59) has joined #ceph
[17:46] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:09] * aliguori (~anthony@32.97.110.59) Quit (synthon.oftc.net oxygen.oftc.net)
[18:09] * elder (~elder@c-71-193-71-178.hsd1.mn.comcast.net) Quit (synthon.oftc.net oxygen.oftc.net)
[18:09] * sage (~sage@76.89.180.250) Quit (synthon.oftc.net oxygen.oftc.net)
[18:09] * Meths (rift@2.25.193.184) Quit (synthon.oftc.net oxygen.oftc.net)
[18:09] * edwardw (~edward@ec2-50-19-100-56.compute-1.amazonaws.com) Quit (synthon.oftc.net oxygen.oftc.net)
[18:10] * sage (~sage@76.89.180.250) has joined #ceph
[18:10] * aliguori (~anthony@32.97.110.59) has joined #ceph
[18:10] * elder (~elder@c-71-193-71-178.hsd1.mn.comcast.net) has joined #ceph
[18:10] * Meths (rift@2.25.193.184) has joined #ceph
[18:10] * edwardw (~edward@ec2-50-19-100-56.compute-1.amazonaws.com) has joined #ceph
[18:19] * MarkDude (~MT@c-67-170-237-59.hsd1.ca.comcast.net) has joined #ceph
[18:28] * BManojlovic (~steki@212.200.241.4) has joined #ceph
[18:36] * bchrisman (~Adium@108.60.121.114) has joined #ceph
[18:54] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[19:02] * jojy (~jvarghese@108.60.121.114) has joined #ceph
[19:02] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[19:03] <sage> working from home today (sick)
[19:03] <sage> i'll skype in for thestandup
[19:04] <gregaf> k, we'll call you
[19:15] * MarkDude (~MT@c-67-170-237-59.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[19:28] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[19:29] * _Tassadar (~tassadar@tassadar.xs4all.nl) Quit (Ping timeout: 480 seconds)
[19:31] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[19:33] * _Tassadar (~tassadar@tassadar.xs4all.nl) has joined #ceph
[19:43] * steki-BLAH (~steki@212.200.243.100) has joined #ceph
[19:48] * slang (~slang@chml01.drwholdings.com) Quit (Quit: Leaving.)
[19:50] * BManojlovic (~steki@212.200.241.4) Quit (Ping timeout: 480 seconds)
[19:58] * MarkDude (~MT@64.134.225.94) has joined #ceph
[20:29] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[20:34] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:44] * steki-BLAH (~steki@212.200.243.100) Quit (Remote host closed the connection)
[21:01] <dwm_> Hmm, interesting test-case that seems to regularly, quickly fail: git gc on a bare repo.
[21:02] <dwm_> kernel mount, fresh ceph cluster (which has had some git repositories copied into it, as well as a few 10s of GB of random source trees rsync'd into it seperately.
[21:03] <dwm_> `git gc` starts running, returns Bus error, while the kernel reports: libceph: get_reply unknown tid <ID> from osdN.
[21:03] <dwm_> Seems to be non-deterministic.
[21:06] <dwm_> Oh wow, I seem to have just kicked ceph-mds into a memory-inhalation loop..
[21:06] <joshd> dwm_: sounds like you found a good test case
[21:07] * MarkDude (~MT@64.134.225.94) Quit (Quit: Leaving)
[21:08] <joshd> are the unknown tids weird numbers, or are they sequential?
[21:10] <dwm_> The two I just had are sequential; recently rebooted client (3.1.5), tid 558, 559.
[21:11] <dwm_> ... and it seems things are slowly unblocking.
[21:11] <dwm_> (Terminal blocked in disk-wait while I was trying to spawn a new window with `pwd` in the ceph mount.)
[21:14] <dwm_> Right, I'm going to have to chase this later -- it's past 2000hrs here.
[21:17] <joshd> ok, if you can get osd logs for the unknown tids and put them in a bug later that'd be great
[21:25] <bugoff> :32
[22:40] * s[X]_ (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[22:47] <fghaas> wonder if dwm_ is seeing that problem on opensuse 12.1 ... has given me these "bus error" heisenbugs rather frequently lately
[22:48] <dwm_> No, I'm sitting on Ubuntu 10.04 with a linux-stable kernel.
[22:51] <dwm_> Details of cluster are as in tail of http://tracker.newdream.net/issues/1759, except it's now running master HEAD as of earlier today (commit: a1252463)
[22:52] <dwm_> Huh; since I last looked at it, osd.0 has crashed and my ceph-mds process is taking up 80% of the machine's RAM..
[22:52] <fghaas> dwm_: real hardware or virtual box?
[22:52] <dwm_> Real hardware, though old.
[22:53] <dwm_> Just 2GB of physical RAM, 64-bit processor but not hardware VM capable.
[22:54] <fghaas> uname -r ?
[22:55] <dwm_> 3.1.5, upstream stable.
[22:56] <joshd> dwm_: what's the osd backtrace?
[22:56] <fghaas> I'm on 3.1.0-1.2-default, producing occasional random bus errors when _building_ ceph... NFI if these issues are in fact related, but at least its the same kernel minor release and a bus error
[22:56] <dwm_> fghaas: Are you building ceph inside a ceph FS?
[22:56] * s[X] (~sX]@eth589.qld.adsl.internode.on.net) has joined #ceph
[22:57] <dwm_> fghaas: I'm seeing these bus-errors directly corresponding to transaction errors on the ceph client.
[22:57] <fghaas> dwm_: no sir, my build env is is a boring old ext3
[22:57] <dwm_> ('Bus error' is what's being printed to the terminal running the git tools, at any rate.)
[22:57] <s[X]> hey fghaas, hey dwm_
[22:57] <fghaas> yup, that's what I'm getting from gcc
[22:57] <dwm_> fghaas: Sounds orthogonal, then.
[22:58] <dwm_> In my case, each instance of a bus error directly corresponds to a dmesg entry like: "libceph: get_reply unknown tid 559 from osd3"
[23:01] <dwm_> There don't seem to be any directly corresponding entries in the logs on osd3
[23:02] <dwm_> fghaas: My natural inclination is to think either software bug or hardware error.
[23:02] <dwm_> fghaas: If you're using ccache and/or distcc, might be worth disabling one or t'other to see if that makes a difference.
[23:02] <fghaas> dwm_: what I was suggesting is that we may be running into a related kernel issue that ceph can't do jack about
[23:03] <dwm_> fghaas: I have seen similar issues on previous versions of the kernel.
[23:03] <fghaas> looking into my dmesg suggest that it was GNU as randomly GPing
[23:03] <dwm_> If memory serves, I was also seeing similar results on 2.6.38.2.
[23:03] <fghaas> yay :)
[23:04] <dwm_> Hmm: "journal throttle: waited for ops" -- denotes that journal FS can't keep up?
[23:08] <joshd> dwm_: yeah
[23:09] <dwm_> Hmm, possibly not too surprising. Was throwing a lot of rsync traffic at it, and it's only got a pair of SATA disks.
[23:09] <dwm_> (One for data, one for OS+journal)
[23:10] <dwm_> Certainly explains why I was seeing performance stalls on the client. :-)
[23:12] <gregaf> in fairness, it might also just mean that you should adjust the number of in-flight ops that are allowed; we don't really know what the proper limit is for your hardware
[23:12] <dwm_> Ah, that's a tuneable?
[23:13] <dwm_> (I'm also seeing 'waited for bytes' in addition to ops.)
[23:13] <joshd> dwm_: you can change journal_queue_max_ops and journal_queue_max_bytes
[23:14] <dwm_> Right, will investigate that tomorrow.
[23:14] <dwm_> (It's about 2215hrs local-time!)
[23:18] <gregaf> but if you're seeing waited for bytes that's probably a good indicator, since that means it's waiting for actual journal disk space (whereas ops is really a pretty arbitrary limit)
[23:45] * fghaas (~florian@85-127-155-32.dynamic.xdsl-line.inode.at) has left #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.