#ceph IRC Log


IRC Log for 2010-09-29

Timestamps are in GMT/BST.

[0:06] <cmccabe> so what's the usual procedure for running a test on a cluster
[0:07] <cmccabe> it begins with autogen, configure, make, make install. Then what?
[0:19] <cmccabe> ok I think I figured this out
[2:56] * greglap (~Adium@ has joined #ceph
[3:40] * greglap1 (~Adium@ has joined #ceph
[3:48] * todinini (tuxadero@kudu.in-berlin.de) Quit (Read error: Connection reset by peer)
[3:48] * greglap (~Adium@ Quit (Ping timeout: 480 seconds)
[3:50] * todinini (tuxadero@kudu.in-berlin.de) has joined #ceph
[3:53] * greglap1 (~Adium@ Quit (Quit: Leaving.)
[4:11] * todinini_ (tuxadero@kudu.in-berlin.de) has joined #ceph
[4:11] * todinini (tuxadero@kudu.in-berlin.de) Quit (Read error: Connection reset by peer)
[4:33] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) has left #ceph
[5:49] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[5:49] * gregorg (~Greg@ has joined #ceph
[5:50] * jantje_ (~jan@paranoid.nl) has joined #ceph
[5:53] * alexxy (~alexxy@ Quit (Read error: Connection reset by peer)
[5:53] * alexxy (~alexxy@ has joined #ceph
[5:56] * jantje (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[6:29] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Ping timeout: 480 seconds)
[6:52] * f4m8_ is now known as f4m8
[7:53] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:06] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[8:29] * allsystemsarego (~allsystem@ has joined #ceph
[8:29] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:20] * Yoric (~David@ has joined #ceph
[9:27] * Yoric_ (~David@ has joined #ceph
[9:27] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[9:27] * Yoric_ is now known as Yoric
[9:29] * Yoric_ (~David@ has joined #ceph
[9:29] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[9:29] * Yoric_ is now known as Yoric
[9:32] <yuravk> Hi, is anybody here ?
[9:39] * Yoric_ (~David@ has joined #ceph
[9:39] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[9:39] * Yoric_ is now known as Yoric
[9:40] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[9:40] * Yoric (~David@ has joined #ceph
[9:44] <wido> yuravk: yes
[9:45] <yuravk> :-) and no devs ...
[9:45] <wido> nope :-) But i might be able to help
[9:45] <wido> you won't find a dev here during this time of day
[9:45] <phreak-> are you? :)
[9:45] <wido> hi phreak- ;)
[9:46] <phreak-> hey wido
[9:46] <wido> i'm no dev yuravk, but got a lot of experience with Ceph
[9:46] <phreak-> wido: how does ceph perform on an active/active platform?
[9:46] <wido> how do you mean, active/active?
[9:47] <phreak-> currently i run drbd in combination with gfs2, both with write
[9:47] <phreak-> but that performs terribly
[9:47] <wido> ah, ok :-) Well, you don't use Ceph in combination with DRBD ;)
[9:47] <phreak-> Ceph is on it's own right?
[9:47] <wido> but, I did some test, 12 OSD's, all with 1Gbit and a client with a 2Gbit trunk, gave me about 150MB/sec write
[9:48] <wido> yes, Ceph is on it's own
[9:48] <phreak-> with the block device driver on top of it?
[9:48] <wido> reads got stuck at 67MB/sec, but that needs further investigation
[9:48] <wido> phreak-: No, Ceph is a fully distributed filesystem which has it's own daemons, etc, etc. It uses BTRFS for it's low level storage, but on top of that there a various daemons
[9:49] <phreak-> hmk
[9:49] <phreak-> and it's stable?
[9:49] <wido> well, not yet, it needs more testing and development
[9:50] <phreak-> is there a roadmap available?
[9:50] <wido> phreak-: http://tracker.newdream.net/projects/ceph/roadmap
[9:50] <phreak-> ty
[10:00] <yuravk> wido: I'm not able to get kernel client be built/work on Debian 5.0, Ubuntu 10.04 and CentOS 5.5
[10:05] <wido> you mean the standalone?
[10:05] <wido> I build the standalone just yesterday on Ubuntu 10.04, but I'm using 2.6.36
[10:05] <wido> yuravk: try http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/
[10:06] <wido> the master branch of the standalone client should build against that
[10:07] <yuravk> as I understand, the ceph client kernel module should be included into the kernel already
[10:07] <yuravk> but, I'll also try to build mine ...
[10:07] <yuravk> thanks
[10:13] <wido> yuravk: yes, they are included since .34, but when testing, I could advise using the latest git checkout, just to be sure
[10:13] <wido> development goes fast
[10:16] <yuravk> So, I'm cloning ceph-client-standalone.git and do not switch to master-backport
[10:16] <yuravk> ?
[10:22] <yuravk> I have built the module for kernel you provide.
[10:22] <wido> correct, just stay in the master brandch
[10:22] <yuravk> but it also failed to load: ceph: Unknown symbol crc32c (err 0)
[10:22] <wido> make sure, you remove the ceph.ko which comes with the kernel
[10:22] <wido> and run
[10:22] <wido> depmod -a
[10:23] <wido> seems that the module crc32c isn't loaded
[10:24] <wido> "depmod - program to generate modules.dep and map files."
[10:27] <yuravk> crc32c is loaded
[10:28] <wido> ok, but did you clean up the ceph.ko which comes with the 2.6.36? After your module compile and make modules_install there should only be a ceph.ko in the "extra" directory
[10:28] <wido> then run depmod -a and try to load ceph again
[10:29] <yuravk> yep, it loads
[10:31] <wido> ok, cool :)
[10:31] <yuravk> Thanks, do you know somebody built Ceph for CentOS 5.* ?
[10:32] <wido> there is a ceph.spec file, which should give you RPM files for the daemons
[10:32] <wido> or do you mean the kernel client?
[10:44] <yuravk> both
[10:45] <yuravk> Will it will be able to built the cluster without Ceph client kernel module ?
[10:49] <wido> yes, you don't need the kernel module
[10:49] <wido> the daemons are only userspace
[10:55] <yuravk> so the kernel module should be installed into each client which uses the cluster ?
[10:59] <wido> yes, indeed
[10:59] <wido> the servers do not need the kernel client
[11:29] <yuravk> so I need minimum 2 boxes for the cluster and 1 node for client ?
[11:39] <wido> yes
[11:39] <wido> well, you can run a cluster on one box if you want to
[11:43] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[11:45] * Yoric (~David@ has joined #ceph
[11:49] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[11:50] * Yoric (~David@ has joined #ceph
[11:59] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[12:03] * Yoric (~David@ has joined #ceph
[12:03] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[12:10] <yuravk> wido: well, I have set up my first test cluster
[12:11] <wido> yuravk: great :-)
[12:12] <yuravk> Now, i need to know is it possible to get client kernel module for CentOS 5.* ...
[12:13] <wido> if you can get a up to date kernel, you can
[12:14] <yuravk> no, think it will be not possible to run something instead of 2.6.18 on CentOS 5.*
[12:17] <wido> well, it should be :)
[12:17] <wido> but 2.6.18 is way to old
[12:18] <yuravk> yep, but it is default for CentOS 5
[12:18] <wido> you won't get the Ceph client (kernel) running there
[12:18] <wido> the FUSE client might, but I never used that one
[12:55] <yuravk> wido: is there any other way for clients (exempt mounting ceph FS) to work with the cluster ?
[13:15] <yuravk> wido: and, what is recommended hardware for Ceph cluster members ? Do you now recommended RAM, CPU, disk params
[13:43] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Quit: Ex-Chat)
[13:52] <wido> yuravk: well, if you want to use the filesystem, no
[13:52] <wido> you have to mount it
[13:52] <wido> but, below Ceph there is RADOS, the object store, you could use RBD (Rados Block Device) or phprados: http://ceph.newdream.net/wiki/Phprados
[13:53] <wido> and for a OSD, the specs really depend on your workload
[13:53] <wido> but more RAM is better :)
[14:00] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:25] <yuravk> wido: to store some backups: files from 250Mb up to 20 Gb
[14:41] <wido> yuravk: hmm, ok. If you don't stress them to much, say, 4GB Ram per OSD
[14:41] <wido> lets say, that OSD has a 1TB disk
[14:44] <yuravk> cool, thanks
[14:57] <wido> yuravk: you are in GMT +2 too?
[14:57] <yuravk> yes
[14:57] <wido> ah, ok
[15:00] * Yoric_ (~David@ has joined #ceph
[15:00] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[15:00] * Yoric_ is now known as Yoric
[15:01] * Yoric_ (~David@ has joined #ceph
[15:01] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[15:01] * Yoric_ is now known as Yoric
[15:02] * Yoric (~David@ Quit ()
[15:02] * Yoric (~David@ has joined #ceph
[15:10] * Yoric_ (~David@ has joined #ceph
[15:10] * Yoric (~David@ Quit (Ping timeout: 480 seconds)
[15:10] * Yoric_ is now known as Yoric
[15:20] * Yoric_ (~David@ has joined #ceph
[15:20] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[15:20] * Yoric_ is now known as Yoric
[15:49] * f4m8 is now known as f4m8_
[17:50] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[18:10] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[19:07] <sagewk> wido: there?
[19:16] * greglap (~Adium@ has joined #ceph
[19:17] * deksai (~deksai@dsl093-003-018.det1.dsl.speakeasy.net) has joined #ceph
[19:24] * Yoric (~David@ Quit (Quit: Yoric)
[19:38] * deksai (~deksai@dsl093-003-018.det1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[19:38] <wido> sagewk: yes
[19:38] <sagewk> the remaining pgs are recovering because the objects are really missing. i'm guessing there was only 1 copy of them on the osd that got rsynced/wiped/recopied?
[19:39] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) has joined #ceph
[19:39] <wido> hmm, that might be
[19:39] <wido> my replication level is set to two, but I indeed had to wipe one OSD
[19:40] <wido> well, I assume the pg's just have to be removed?
[19:42] <sagewk> well, i think we need a way for an administrator to tell the osd to give up on those objects and declare them lost. somehow.
[19:44] <wido> can't I remove them with the "rados" tool?
[19:44] <wido> rados -p rbd rm <obj>
[19:44] <wido> rados -p metadata rm <obj>
[19:45] <sagewk> yeah the osds need to be told at a lower level to give up first. working on something now.
[19:47] <wido> what will this do to the filesystem?
[19:47] <wido> since the metadata pool is missing a object now
[19:47] <sagewk> i was wrong, it's the data and rbd pools that are missing stuff. the metadata pool is fine.
[19:47] <sagewk> so some file data will be missing, and some rbd image data.
[19:48] <wido> ok, so one image will be corrupt
[19:48] <wido> I'm in favour of repl 3 btw, 2 is still a bit tricky.
[19:48] <sagewk> yeah me too :)
[19:48] <wido> and the file (data pool) will simply be having a gap of 4MB
[19:48] <sagewk> right
[19:49] <sagewk> the missing objects will show up in the monitor log for future reference once i fix this up.
[19:49] <wido> yes, great. But isn't this linked to a "scrub" and/or http://tracker.newdream.net/issues/425
[19:51] <sagewk> the scrub will make sure that any (completed) recovery is correct. it won't run on a degraded/peering/recovering pg
[19:52] <wido> ok, but you could make a flag for a pg, where the OSD gives up, don't remove anything (just notify), but keep the cluster running
[20:03] <sagewk> it's not removing anything either way, since the objects are already missing. but yeah, i think it'll be something like
[20:03] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[20:04] <sagewk> 'ceph pg give_up_on_lost_objects 0.232' or something along those lines
[20:04] <sagewk> then teh pg will go active, and the missing objects will just be lost.
[20:07] <sagewk> actually, that should probably happen automatically, if the primary queries all osds that may have had the data and the objects are still lost. it doesn't by default currently to catch any bugs ("fail-safe"). once you give up there's no going back. i'll add a config option too.
[20:13] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:15] <wido> yes, that would be great
[20:15] <wido> if it simply isn't there, it isn't, so no point in waiting for manual intervention
[20:17] * morse (~morse@supercomputing.univpm.it) Quit (Read error: Connection reset by peer)
[20:17] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:17] <Mark22> wido: when do you think it will be stable enough for production?
[20:18] <wido> Mark22: that question gets asked a lot around here
[20:18] <wido> depends on your demands for production
[20:18] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[20:18] <wido> but imho, no, not yet. I think Ceph is getting close, but for example, the multi-MDS isn't stable yet
[20:19] <wido> but, RADOS (which is also very, very, very cool!) is pretty stable
[20:20] <wido> RADOS is the object store below Ceph
[20:20] * morse (~morse@supercomputing.univpm.it) Quit (Read error: Connection reset by peer)
[20:22] <wido> Mark22: the problem is, right now you will run into a problem someday, you won't loose data, but due to a bug your cluster can be down for a few days
[20:23] <wido> During development that's not a problem at all, but in a production env? I wouldn't want that to happen
[20:23] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:23] <Mark22> that is not good enough for our demands (for production), however it won't be a big problem for testing things
[20:24] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:24] <Mark22> wido: there is a reason we currently pay for some 24/7 service for our storage needs
[20:24] <wido> nope, and that's just what Ceph needs, more and more people testing it and submitting bugs
[20:25] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[20:25] <Mark22> but in a few days I'll start testing with it (as I just need to learn things it is a good option to also look for bugs to submit)
[20:25] <wido> Mark22: Storage is always (and is getting) a bigger and hotter issue, so for that, Ceph will really be a solution, but it needs time.
[20:26] <wido> ah, great :-) you can run it on a few simple machines (even VM's)
[20:27] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:27] <Mark22> nice :)
[20:27] <Mark22> as I also would like to test something else and in that case I could combine it :)
[20:29] <Mark22> currently I only have 2 things on my list to test for as far as it is related to storage (ceph + a setup with drbd/openiscsi/heartbeat/etc.)
[20:30] * oksamyt (~sana@ has joined #ceph
[20:30] * Mark22 wants to understand what pcx is maintaining for the company where I work ;)
[20:32] <wido> well, I think you can't compare them
[20:32] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[20:32] <wido> where Ceph is fully distributed and aimed for multi TB's, a iSCSI env with DRBD will hit some limitions, pretty fast. You are stuck to the performance that hardware delivers, where Ceph keeps scaling
[20:33] <Mark22> I don't really want to compare them, but I want to look for a good storage solution for backups (and ceph sounds good)
[20:33] <wido> Well, depends on how safe your backups need to be ;)
[20:33] <Mark22> however currently it isn't stable enough :(
[20:33] <wido> But creating some backups to it should be pretty safe (where these is no guarantee!)
[20:34] <wido> keep your replication level set to 3
[20:34] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[20:35] <Mark22> I'll test it in the test situation first (where it isn't a problem when I can't get my data back)
[20:37] <wido> that would be a good point to start at
[20:40] <oksamyt> hello all
[20:42] <gregaf> hi oksamyt
[20:42] <wido> oksamyt: hi
[20:43] <oksamyt> i'm having a problem with several different releases: http://ceph.newdream.net/download/ceph-0.21.2.tar.gz and http://ceph.newdream.net/download/ceph-0.21.3.tar.gz
[20:43] <oksamyt> i have one monitor, one mds and 3 osds
[20:43] <oksamyt> but the only osd which seems to work is at the same machine as mon and mds
[20:44] <wido> what is your monitor IP in the ceph.conf?
[20:44] <wido> is it set to a IP which the OSD's can reach?
[20:44] <gregaf> seems to work?
[20:44] <oksamyt> mon addr =
[20:44] <gregaf> are the daemons all being started?
[20:45] <oksamyt> no, they all can see each other and everything worked about two weeks ago when i used the current code from git
[20:45] <oksamyt> it's just that i wanted to try a release
[20:46] <oksamyt> which version would you advise to use?
[20:46] <gregaf> .21.3 is the newest tagged version from a few days ago
[20:46] <wido> oksamyt: what makes you say only one OSD works? Does "ceph osd dump -o -" flag them as down?
[20:48] <oksamyt> max_osd 2
[20:48] <oksamyt> osd0 in weight 1 up (up_from 9 up_thru 9 down_at 8 last_clean 5-7)
[20:49] <oksamyt> ceph osd stat
[20:49] <oksamyt> 10.09.29_21:49:13.375581 mon <- [osd,stat]
[20:49] <oksamyt> 10.09.29_21:49:13.376092 mon0 -> 'e3: 1 osds: 1 up, 1 in' (0)
[20:50] <gregaf> can you check the other OSD daemons are running on their servers?
[20:50] <gregaf> and are in the configuration?
[20:50] <oksamyt> yes, the config file is the same everywhere, and the osd daemons are running
[20:50] <oksamyt> maybe i'll post some log lines?
[20:54] <wido> oksamyt: your dump says: max_osd 2
[20:55] <wido> and your mon says 1 OSD
[20:55] <wido> so it seems that the other two OSD's aren't "registered" at the cluster
[20:55] <wido> oksamyt: I thin you need: http://ceph.newdream.net/wiki/OSD_cluster_expansion/contraction
[20:56] <oksamyt> can i just create everything from scratch using mkcephfs?
[20:57] <wido> oh, yes. If all the OSD's are in your ceph.conf, then mkcephfs will format them too
[20:59] <oksamyt> ok, let me try to repeat all the steps and see if it works
[21:12] <oksamyt> i have run mkcephfs on all three machines and executed the init-ceph script (monitor and first osd, then two remaining osds)
[21:12] <oksamyt> max_osd 3
[21:12] <oksamyt> osd0 in weight 1 up (up_from 2 up_thru 2 down_at 0 last_clean 0-0)
[21:12] <wido> yes, your max_osd is correct now
[21:12] <oksamyt> the monitor says 2 others are down
[21:12] <wido> but are the other OSD's running?
[21:12] <oksamyt> 10.09.29_22:13:12.631109 7fd271b1a710 osd1 2 ERROR: Got non-matching FSID from trusted source!
[21:13] <oksamyt> this is in the log at two other osds
[21:13] <wido> ok, never seen that :)
[21:13] <oksamyt> earlier i was advised to create all the nodes with mkcephfs -a
[21:15] <oksamyt> is it the only way?
[21:15] <wido> it isn't the only, but it's the easiest for now
[21:15] <wido> I have to go afk, be back later
[21:16] <oksamyt> ok
[21:17] * physical (3f4cdeea@ircip3.mibbit.com) has joined #ceph
[21:30] * deksai (~deksai@dsl093-003-018.det1.dsl.speakeasy.net) has joined #ceph
[21:47] <gregaf> oksamyt: every time you run mkcephfs it generates a new fsid, so you need to mkecephfs -a or the nodes think they're in separate filesystems
[21:47] <gregaf> there's a feature request in the tracker to change this but it's not done yet
[21:52] * alexxy (~alexxy@ Quit (Ping timeout: 480 seconds)
[22:10] <wido> sagewk: i'm going afk, if there is a fix to purge/remove the pg's, just post it somewhere, i'll read the commits tomorrow morning and test it
[22:10] <sagewk> ok. about to push it to your machine to test now.
[23:03] * physical (3f4cdeea@ircip3.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[23:20] <oksamyt> well, everything seems to be fine after i have run mkcephfs -a
[23:20] <oksamyt> we tried setting up a cluster in the students' internet centre last sunday, and the client hung while recording a file
[23:20] <oksamyt> and the cosd became a zombie
[23:33] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.