#ceph IRC Log

Index

IRC Log for 2010-09-20

Timestamps are in GMT/BST.

[0:30] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Remote host closed the connection)
[2:04] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[3:02] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[3:17] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Ping timeout: 480 seconds)
[3:46] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[3:47] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[3:53] * deksai (~deksai@96.35.100.192) has joined #ceph
[4:18] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[4:18] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has left #ceph
[6:09] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[6:10] * deksai (~deksai@96.35.100.192) Quit (Ping timeout: 480 seconds)
[6:25] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[6:59] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[7:01] * f4m8_ is now known as f4m8
[7:22] * bbigras (quasselcor@bas11-montreal02-1128531099.dsl.bell.ca) has joined #ceph
[7:23] * bbigras is now known as Guest113
[7:27] * Guest1244 (~bbigras@bas11-montreal02-1128531099.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[7:36] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:22] * xilei (~xilei@61.135.165.172) Quit (Ping timeout: 480 seconds)
[8:33] * xilei (~xilei@61.135.165.180) has joined #ceph
[8:50] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[8:51] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[8:58] * xilei (~xilei@61.135.165.180) Quit (Ping timeout: 480 seconds)
[9:06] * xilei (~xilei@61.135.165.180) has joined #ceph
[9:53] * Yoric (~David@213.144.210.93) has joined #ceph
[9:59] * Yoric_ (~David@213.144.210.93) has joined #ceph
[9:59] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[9:59] * Yoric_ is now known as Yoric
[10:17] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[10:17] * Yoric (~David@213.144.210.93) has joined #ceph
[13:28] * Yoric_ (~David@213.144.210.93) has joined #ceph
[13:28] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[13:28] * Yoric_ is now known as Yoric
[13:34] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[13:36] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[13:55] * checkers (~alex@abraxo.bluebottle.net.au) has left #ceph
[14:09] * Yoric (~David@213.144.210.93) has joined #ceph
[15:09] * xilei (~xilei@61.135.165.180) Quit (Ping timeout: 480 seconds)
[15:22] * deksai (~deksai@96.35.100.192) has joined #ceph
[15:29] * allsystemsarego (~allsystem@188.27.166.252) has joined #ceph
[15:55] * hijacker (~hijacker@213.91.163.5) Quit (Read error: Connection reset by peer)
[16:00] * hijacker (~hijacker@213.91.163.5) has joined #ceph
[17:06] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[17:51] * greglap (~Adium@166.205.137.217) has joined #ceph
[18:13] * greglap1 (~Adium@166.205.137.79) has joined #ceph
[18:17] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[18:20] * greglap (~Adium@166.205.137.217) Quit (Ping timeout: 480 seconds)
[18:25] <yehudasa> wido: are you there now?
[18:41] * greglap1 (~Adium@166.205.137.79) Quit (Quit: Leaving.)
[18:44] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[19:11] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[19:12] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:13] * sagelap (~sage@12.248.40.138) has joined #ceph
[19:21] * sagelap (~sage@12.248.40.138) Quit (Ping timeout: 480 seconds)
[20:09] <wido> yehudasa: yes
[20:10] <yehudasa> wido: I compiled a new btrfs module on node08
[20:10] <wido> ok, did you test it?
[20:10] <yehudasa> it seemed to get through the point where it crashed before, but now it crashes on a different issue, probably because of my fix
[20:11] <wido> right now i'm rsyncing all the data to the logger machine, removing the btrfs stripe and going to LVM
[20:11] <wido> is it a OSD crash or still a kernel panic?
[20:11] <yehudasa> it get's a kernel oops
[20:11] <wido> it's due to the multi-disk btrfs, isn't it?
[20:12] <yehudasa> not sure.. previous code was just broken
[20:12] <yehudasa> don't really know why we didn't hit it before
[20:13] <wido> right now only one OSD is up, so we'll see if it recovers when all the OSD's are back :-)
[20:14] <wido> if my data is still there
[20:15] <yehudasa> did you reboot node08 recently?
[20:19] <wido> we had a power failure in the DC, two nights in a row...
[20:19] <yehudasa> oh, ok
[20:19] <yehudasa> well.. the problem that I saw still happens
[20:19] <wido> genset refused to start
[20:20] <yehudasa> actually it's not a kernel oops, just a lockup
[20:21] <wido> ok
[20:21] <wido> for now I want the cluster online, so i'll keep rsyncing the data
[20:21] <wido> I've got some RBD things I want to test
[20:22] <wido> but I could leave node08 like it is
[20:22] <wido> so you can debug it further
[20:24] <yehudasa> thanks
[20:26] * sagelap (~sage@12.248.40.138) has joined #ceph
[20:29] <yehudasa> sagelap: I'm getting http://pastebin.org/989504 on wido's node after patching with the latest clone ioctl fix, any idea?
[20:41] <yehudasa> actually, might be related to the fd leakage during recovery
[20:41] <sagelap> yehudasa: looks like maybe an extent lock is getting leaked somehow?
[20:41] <sagelap> i'd audit the lock/unlock extent calls in that function and make sure everything looks ok. of course it could be leaked from elsewhere too, but that would be more likely to be noticed by others
[20:41] <yehudasa> sagelap: it hangs on the kmem_cache_alloc
[20:43] <yehudasa> and there are like 250 opened fds currently by that process
[20:44] <sagelap> oh, hmm. in the write func there is an option that queues up the fds to be flushed and closed later.. that can be turned off though. try that?
[20:44] <sagelap> see _write() in FileStore.cc
[20:44] <sagelap> should just be an option that can be turned off.
[20:45] <yehudasa> g_conf.filestore_flusher
[20:46] <sagelap> yeah. there are two of them tho
[20:47] <yehudasa> I wonder if we're hitting some deadlock that is triggered by the local system config, e.g., max fds per process
[20:48] <sagelap> yeah just turning those off will eliminate that possibility. the performance advantage is dubious anyway
[20:50] <yehudasa> what about 'filestore sync flush'?
[20:54] <yehudasa> wido: I'm trying to restart node08, doesn't go down
[21:01] * sagelap (~sage@12.248.40.138) Quit (Ping timeout: 480 seconds)
[21:19] <wido> oh yehudasa, just force it
[21:20] <wido> echo b > /proc/sysrq-trigger
[21:21] <yehudasa> oh, ok
[22:02] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[22:03] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[22:09] * NoahWatk1ns (~jayhawk@kyoto.soe.ucsc.edu) has joined #ceph
[22:13] * sagelap (~sage@12.248.40.138) has joined #ceph
[22:20] * sagelap1 (~sage@32.175.89.250) has joined #ceph
[22:24] * sagelap (~sage@12.248.40.138) Quit (Ping timeout: 480 seconds)
[22:30] * sagelap1 (~sage@32.175.89.250) Quit (Ping timeout: 480 seconds)
[22:49] * NoahWatk1ns (~jayhawk@kyoto.soe.ucsc.edu) Quit (Quit: leaving)
[23:02] * cowbar (bd2ac46502@dagron.dreamhost.com) Quit (Read error: Connection reset by peer)
[23:08] * cowbar (b4e6d78044@dagron.dreamhost.com) has joined #ceph
[23:42] * MarkN (~nathan@59.167.240.178) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.