#ceph IRC Log


IRC Log for 2010-07-20

Timestamps are in GMT/BST.

[1:21] * deksai (~chris@dsl093-003-018.det1.dsl.speakeasy.net) Quit (Quit: Leaving.)
[2:57] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) Quit (Server closed connection)
[2:57] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) has joined #ceph
[3:24] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) Quit (Server closed connection)
[3:25] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[3:55] * todinini (tuxadero@kudu.in-berlin.de) Quit (Server closed connection)
[3:55] * todinini (tuxadero@kudu.in-berlin.de) has joined #ceph
[4:11] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Read error: Connection reset by peer)
[4:12] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[5:04] * Guest126 (~bbigras@bas11-montreal02-1128535472.dsl.bell.ca) Quit (Server closed connection)
[5:05] * bbigras (~bbigras@bas11-montreal02-1128535472.dsl.bell.ca) has joined #ceph
[5:05] * bbigras is now known as Guest494
[5:08] * Osso_ (osso@AMontsouris-755-1-18-109.w90-46.abo.wanadoo.fr) has joined #ceph
[5:13] * Osso (osso@AMontsouris-755-1-9-31.w90-46.abo.wanadoo.fr) Quit (Remote host closed the connection)
[5:13] * Osso_ is now known as Osso
[5:20] * sakib (~sakib@ has joined #ceph
[5:24] * Gary (~syslxg@ has joined #ceph
[5:38] * Gary (~syslxg@ Quit (Quit: Leaving)
[6:15] * sakib (~sakib@ Quit (Quit: leaving)
[6:45] * f4m8_ is now known as f4m8
[6:49] <f4m8> gregaf: thanks for the explaination. This helps a lot
[7:09] * Osso (osso@AMontsouris-755-1-18-109.w90-46.abo.wanadoo.fr) Quit (Quit: Osso)
[7:09] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Server closed connection)
[7:09] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[8:57] * eternaleye (~quassel@184-76-53-210.war.clearwire-wmx.net) Quit (Ping timeout: 480 seconds)
[10:41] <Anticimex> hm, for testing out ceph, is the .debs of ceph-unstable problematic for data integrity?
[10:41] <Anticimex> or will there be other non-data-corrupting issues?
[10:42] <wido> Anticimex: what are you experiencing?
[10:44] <Anticimex> i've not installed it yet
[10:45] <Anticimex> but I'm about to and is considering which version to choose
[10:48] <wido> oh, i would choose unstable
[10:51] <wido> but, note, Ceph is not stable yet! So do not use it in production
[10:51] <wido> your data is not safe
[10:56] <wido> but Anticimex testing is really needed
[10:56] <wido> so if you have hardware to spare and time, please feel free to test and report whatever you find
[10:56] <Anticimex> ok wido, thanks
[10:57] <Anticimex> that's what i was wondering, if "not for production use" today meant that data was unsafe, and not just that you could get weird hangs and other lockups that needed manual tinkering
[11:04] <wido> well, i never really lost all my data
[11:04] <wido> but i had some crashes, hangs, etc, etc where it could not recover from
[11:04] <wido> so i formatted the cluster again to keep on going with testing
[13:01] * allsystemsarego (~allsystem@ has joined #ceph
[13:24] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Read error: No route to host)
[14:23] <darkfader> Anticimex: it still means the data is at a certain risk, and more testing is needed until at some point we can say "it's safe now". don't forget btrfs is still experimental, too.
[14:24] <darkfader> i feel ok saving some data on it, data i got backups of and that doesn't need to be online constantly
[14:25] <darkfader> i mean people even save stuff on ext3 *eg*
[14:34] <Anticimex> yeah, i'm not looking to store my most valued data on it :)
[14:34] <Anticimex> that is heavly backuped (I think I have 4 copies)
[14:51] <darkfader> ok ;)
[14:52] <darkfader> i have filled it with the distro mirror tree so far
[14:52] <darkfader> and an rsync of some pr0n, too.
[14:53] <darkfader> worst is, the second bit has definitely a fs issue, ls -l will hang, as will du -ks. but i feel i shouldn't use that to file a bug report
[14:54] <darkfader> one note: when you set up the filesystem, take care to include a journal setting. I didn't have that at start because it's not in thE "installing on debian" part of the wiki and it was painfully slow
[14:54] <darkfader> (which at least proves it handles the data carefully)
[15:06] * deksai (~chris@71-13-57-82.dhcp.bycy.mi.charter.com) has joined #ceph
[15:45] * Osso (osso@AMontsouris-755-1-18-109.w90-46.abo.wanadoo.fr) has joined #ceph
[16:36] * f4m8 is now known as f4m8_
[18:46] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[18:47] * sagewk (~sage@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:20] * eternaleye (~quassel@184-76-53-210.war.clearwire-wmx.net) has joined #ceph
[20:50] * Osso (osso@AMontsouris-755-1-18-109.w90-46.abo.wanadoo.fr) Quit (Quit: Osso)
[20:52] * Osso (osso@AMontsouris-755-1-18-109.w90-46.abo.wanadoo.fr) has joined #ceph
[20:53] * Osso_ (osso@AMontsouris-755-1-18-109.w90-46.abo.wanadoo.fr) has joined #ceph
[20:53] * Osso (osso@AMontsouris-755-1-18-109.w90-46.abo.wanadoo.fr) Quit (Read error: Connection reset by peer)
[20:53] * Osso_ is now known as Osso
[21:09] * darkfader (~floh@host-82-135-62-109.customer.m-online.net) Quit (Server closed connection)
[21:09] * darkfader (~floh@host-82-135-62-109.customer.m-online.net) has joined #ceph
[22:31] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[22:52] <wido> hi sagewk
[22:52] <sagewk> wido: hey
[22:52] <sagewk> fixed one thing that may have contributed to the osd up/down flapping
[22:52] <wido> ah, cool
[22:52] <wido> and i see your logged on to node13
[22:53] <wido> my mon's and mds'es are acting strange today.
[22:53] <sagewk> yeah i was looking at the mon0 problem.
[22:54] <sagewk> and node13 / is full
[22:54] <wido> what i noticed that when they were all running, the cluster still seemed to hang.
[22:54] <wido> oh yes, that's the coredumps and logfiles
[22:54] <wido> they are equipped with a 4GB flash drive, a bit small..
[22:54] <sagewk> hmm yeah
[22:55] <sagewk> go ahead and wipe those cores, don't think i need them
[22:55] <wido> but what i saw, nothing was happening, but the mds'es were still saying the OSD's were laggy, while there was no load on them
[22:55] <sagewk> they were all up?
[22:55] <wido> ok, they are gone. I would still really like syslog :-)
[22:55] <wido> yes, they were.
[22:55] <wido> all 30, up and running, load at 0.00
[22:55] <wido> was no network issue for sure
[22:57] <wido> but for now the rsync of kernel.org keeps killing my cluster
[22:57] <wido> mon's and mds'es are going down when i try to do a rsync, starts to compare the bits and then something will go down
[22:58] <sagewk> i'll try to reproduce, what are you rsyncing?
[22:58] <wido> rsync -avr --stats --progress rsync://rsync.eu.kernel.org/pub/* /mnt/ceph/static/kernel/
[22:59] <wido> the first sync will go fine, but then try to sync again, that goes wrong
[23:00] <sagewk> thanks
[23:01] <wido> and from there, when a MDS tries to recover, it crashes again
[23:01] <wido> btw, i equipped node01 with a Intel X25-M SSD and placed the journal on it for osd0 - osd3, that works really nice
[23:01] <wido> OSD performance went from 35MB/sec to 70MB/sec
[23:03] <wido> oh btw, mon0 is still crashing, it can't be started, but you already noticed that i think
[23:03] <sagewk> oh
[23:04] <wido> you closed the issue with Can't reproduce, but it's still happening on my cluster
[23:06] <sagewk> same issue as the crash before.. but no log. i'll fix it again and next time it crashes don't restart it (unless/until you build latest, which won't clobber the log)
[23:07] <wido> ok, i'm going afk now, i'll do a build tomorrow and leave it for the night like this
[23:07] <wido> there is a commit for the logs i saw? which appends
[23:07] * alexxy (~alexxy@ has joined #ceph
[23:08] <sagewk> i used ios::ate before, but that doesn't work as advertised.. ios::app does. fixed as this morning.
[23:08] <wido> pull, build, retry and save log, core + binary?
[23:08] <sagewk> yeah, or just wait for crash, and don't restart
[23:09] <wido> you can start mon0 right now on node13, will crash in 5 secs
[23:09] <wido> or am i misunderstanding something?
[23:10] <wido> ok, now i see, it's running again.
[23:10] <sagewk> i fixed up logm/last_commited and then it starts.
[23:10] <sagewk> the question is why that got a bad value in the first place
[23:10] <wido> ah, ok. I'm not sure wether i'll hit this bug again
[23:10] <sagewk> that's the second time, so hopefully!
[23:11] * alexxy (~alexxy@ Quit ()
[23:11] <wido> ok, one last question before i'm off
[23:11] * alexxy (~alexxy@ has joined #ceph
[23:11] <wido> is there a way to set a default crush rule for new pools?
[23:11] <wido> once that get created via de RADOS Gateway for example
[23:12] <wido> so there replication is to 3 by default
[23:12] <sagewk> the replication level is a pool properly, not a crush property (the crush rule is valid for a range of pg sizes).
[23:12] <sagewk> not sure where the default comes from..
[23:13] <todinini> hi, do you have a test suite to test and benchmark ceph? my cluster is now up and running an most of the time idel
[23:13] <wido> ok, sagewk but since i have some custom crush rules, could you set a default for that? for example that a new pool inherets the rule from the "data" pool
[23:14] <wido> so my data gets replicated over the right osd's, not the same replica on the same physical machine
[23:14] <sagewk> toinini: there is a qa/ directory in ceph.git that has some random workloads we run for stability testing
[23:14] <wido> todinini: try stressing with bonnie++, postmark, simple (large) rsync's
[23:16] <todinini> wido: I tried that, but those tools never check if the data is correct, how about some md5 summing of the sync files, then we know the data is correct?
[23:16] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[23:17] <wido> todinini: you can give rsync some extra options to do so
[23:17] <wido> or write something in bash to verify some sum's
[23:18] <todinini> wido: ok, I thought there would be same standart tests
[23:18] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:19] <wido> sagewk: i'm going afk now. If you could answer my question about the default CRUSH rules for new pools, i'll give it a try tomorrow and try to put something in the Wiki about it. Until then, it's in the IRC logs :-)
[23:19] <wido> ttyl
[23:24] * jbl (~jbl@ Quit (Server closed connection)
[23:24] * jbl (~jbl@charybdis-ext.suse.de) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.