#ceph IRC Log

Index

IRC Log for 2010-09-23

Timestamps are in GMT/BST.

[0:03] * deksai (~deksai@dsl093-003-018.det1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[2:11] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[3:04] * cockroach (~stefan@dhcp-130-92-99-103.vpn.unibe.ch) Quit (Quit: this was not the cockroach you were looking for)
[4:22] * yehudasa_hm (~yehuda@adsl-69-225-137-176.dsl.irvnca.pacbell.net) has joined #ceph
[4:23] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Remote host closed the connection)
[4:25] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[4:49] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[4:53] * Spudz76 (~spudz76@dc.gigenet.com) has joined #ceph
[4:55] * sagelap1 (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[4:55] <Spudz76> how long should a mkcephfs take
[4:55] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (Read error: No route to host)
[4:56] <Spudz76> mine is camping out on osd.0 for quite a while now, 100% cpu
[5:00] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[5:01] * sagelap2 (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[5:01] * sagelap1 (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (Read error: No route to host)
[5:06] <Spudz76> last thing it says (with debug on) is "filestore(/data/osd0) mount found snaps <>"
[5:08] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[5:15] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Remote host closed the connection)
[5:15] * sagelap2 (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[5:16] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[5:34] <Spudz76> oop needed osd journal size = 100
[5:34] * Spudz76 (~spudz76@dc.gigenet.com) has left #ceph
[6:16] * eternaleye (~eternaley@195.190.31.135) has joined #ceph
[6:40] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Ping timeout: 480 seconds)
[7:15] * f4m8_ is now known as f4m8
[8:06] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:16] * hijacker (~hijacker@213.91.163.5) has joined #ceph
[8:27] * kblin_ (~kai@h1467546.stratoserver.net) Quit (Remote host closed the connection)
[8:41] * allsystemsarego (~allsystem@188.27.166.252) has joined #ceph
[9:12] * lidongyang (~lidongyan@222.126.194.154) Quit (Remote host closed the connection)
[9:38] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:55] * lidongyang (~lidongyan@222.126.194.154) has joined #ceph
[9:56] * lidongyang (~lidongyan@222.126.194.154) Quit (Remote host closed the connection)
[10:28] * Yoric (~David@213.144.210.93) has joined #ceph
[11:09] * hijacker (~hijacker@213.91.163.5) Quit (Read error: Connection reset by peer)
[11:28] * hijacker (~hijacker@213.91.163.5) has joined #ceph
[11:31] * toothkit (~betch2k@66.87.2.111) has joined #ceph
[12:01] * allsystemsarego (~allsystem@188.27.166.252) Quit (Quit: Leaving)
[14:29] * Yoric_ (~David@213.144.210.93) has joined #ceph
[14:29] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[14:29] * Yoric_ is now known as Yoric
[14:29] * toothkit (~betch2k@66.87.2.111) Quit (Ping timeout: 480 seconds)
[14:30] * morse (~quassel@supercomputing.univpm.it) has joined #ceph
[14:39] * toothkit (~betch2k@66.87.1.74) has joined #ceph
[14:49] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[14:49] * Yoric (~David@213.144.210.93) has joined #ceph
[14:55] * Yoric_ (~David@213.144.210.93) has joined #ceph
[14:55] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[14:55] * Yoric_ is now known as Yoric
[14:59] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[14:59] * Yoric (~David@213.144.210.93) has joined #ceph
[15:07] * Yoric_ (~David@213.144.210.93) has joined #ceph
[15:07] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[15:07] * Yoric_ is now known as Yoric
[15:16] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[15:17] * Yoric (~David@213.144.210.93) has joined #ceph
[15:19] * Yoric (~David@213.144.210.93) Quit ()
[15:36] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[16:08] * allsystemsarego (~allsystem@188.27.166.252) has joined #ceph
[16:15] * Yoric (~David@213.144.210.93) has joined #ceph
[16:44] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Ping timeout: 480 seconds)
[17:37] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[17:40] <wido> sagelap: about the corrupt pg's. Some time ago I was doing some tests and I removed some random files on a OSD, that didn't get noticed
[17:40] <wido> is that the expected behaviour for now?
[17:43] <sagewk> random objects, or random pg logs?
[17:44] <sagewk> removing random objects won't be noticed until a scrub, which is only triggered manually at the moment
[17:46] <wido> random objects
[17:46] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[17:46] <wido> ok, but a automatic test could be a feature for later, a continues process for checking data integrity
[17:46] <wido> while btrfs will do a lot for you, there could be scenarios
[17:51] * greglap (~Adium@166.205.138.48) has joined #ceph
[17:53] <sagewk> yeah, the scrubs should trigger automatically on a regular basis.
[17:56] <sagewk> http://tracker.newdream.net/issues/425
[18:40] * greglap (~Adium@166.205.138.48) Quit (Quit: Leaving.)
[18:48] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[19:19] * NoahWatkins (~jayhawk@kyoto.soe.ucsc.edu) Quit (Quit: leaving)
[19:20] * NoahWatkins (~jayhawk@kyoto.soe.ucsc.edu) has joined #ceph
[19:26] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[19:55] <wido> yehudasa_hm: http://pastebin.com/9wsmzEsN
[19:55] <wido> did you get those messages too?
[19:59] <yehudasa> wido: I probably added some debug info to your kernel
[19:59] <yehudasa> I mean, to your btrfs module
[20:00] <yehudasa> I'll remove that
[20:03] <yehudasa> wido: did you upgrade to a newer kernel?
[20:03] <wido> yes, build the module against that kernel
[20:03] <wido> there was a new kernel from Ubuntu
[20:04] <yehudasa> ok, I removed that offending line
[20:04] <wido> you mean: "/root/source/fs/btrfs/ioctl.c:1534 ordered=ffff8800a093b9c0 test=0" ?
[20:04] <yehudasa> yeah
[20:04] <wido> but when you scroll down the pastebin
[20:04] <yehudasa> I just noticed that
[20:04] <wido> there are some more serious messages too
[20:08] <yehudasa> wido: other than these warnings, does any other thing happens?
[20:09] <wido> no, the OSD's seem to run fine
[20:09] <wido> well, right now my cluster hangs at 1.75% degraded, but that will be a different issue
[20:10] <yehudasa> anything specific that you did that triggered this warning? was it still recovering?
[20:11] <wido> yes, I had the crashes offcourse, so it was down for some time
[20:11] <wido> the powerfailurs gave me some hardware issues to, that was fixed today
[20:11] <wido> and then I started to get it all up and running again
[20:14] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[20:23] <wido> btw, this is something I saw multiple times. Somehow the cluster doesn't recover and hangs around the latest 2%, this causes access to various pools to block
[20:23] <wido> for example, my filesystem isn't working now (hangs), while I can fetch objects from the "thesimpsons" pool.
[20:25] <wido> I've got 11 of the 12 OSD's up, osd4 is down because rsync didn't transfer the xattrs, so that one has to be wiped
[20:28] <sagewk> yeah
[20:31] <wido> Is there a logical explanation? Since the replication level is set to 2, I should be able to cope with the loss of one OSD
[20:31] <sagewk> it's sounds like a problem in the recovery code. i can look later this afternoon
[20:31] <wido> I've seen this multiple times in the past
[20:32] <sagewk> yeah
[22:12] * allsystemsarego (~allsystem@188.27.166.252) Quit (Quit: Leaving)
[22:37] <wido> sagewk: i'm going afk
[22:37] <wido> If you could take a look later today, would be great
[22:37] <wido> it's stil hanging at 1.7% degraded
[22:37] <sagewk> will do
[22:37] <wido> tnx!
[22:37] <sagewk> np
[23:15] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.