#ceph IRC Log

Index

IRC Log for 2010-09-15

Timestamps are in GMT/BST.

[0:16] * Brock (~berwin@66-189-196-132.dhcp.yakm.wa.charter.com) Quit (Ping timeout: 480 seconds)
[0:20] * deksai (~deksai@dsl093-003-112.det1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[0:59] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[2:19] * Brock (~berwin@66-189-196-132.dhcp.yakm.wa.charter.com) has joined #ceph
[2:27] * greglap (~Adium@166.205.138.112) has joined #ceph
[2:39] * Brock (~berwin@66-189-196-132.dhcp.yakm.wa.charter.com) Quit (Ping timeout: 480 seconds)
[3:16] * greglap (~Adium@166.205.138.112) Quit (Read error: Connection reset by peer)
[3:37] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[3:55] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[4:21] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Remote host closed the connection)
[4:26] * xilei (~xilei@61.135.165.172) has joined #ceph
[4:26] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[4:31] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Quit: Leaving)
[4:31] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[4:39] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Ping timeout: 480 seconds)
[4:43] * Brock (~berwin@66-189-196-132.dhcp.yakm.wa.charter.com) has joined #ceph
[5:37] * Osso (osso@AMontsouris-755-1-7-230.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[5:38] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[5:38] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[6:39] <sage> wido: can you see if the btrfs bug is reproducible? if we're lucky, the offending clone op is triggered by the osd journal replay when cosd starts and it will crash every time. if so, can you point us to the osd(s) and where the machine's kernel sources are so we can debug it?
[7:05] * xilei (~xilei@61.135.165.172) Quit (Ping timeout: 480 seconds)
[7:18] * xilei (~xilei@61.135.165.180) has joined #ceph
[7:31] * allsystemsarego (~allsystem@188.27.166.252) has joined #ceph
[7:52] * xilei (~xilei@61.135.165.180) Quit (Ping timeout: 480 seconds)
[7:56] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:16] * xilei (~xilei@61.135.165.172) has joined #ceph
[8:30] * hijacker (~hijacker@213.91.163.5) has joined #ceph
[8:47] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:36] * darktim (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[9:40] * andret (~andre@pcandre.nine.ch) has joined #ceph
[9:49] * Yoric (~David@213.144.210.93) has joined #ceph
[10:00] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[11:37] * Yoric (~David@213.144.210.93) Quit (Remote host closed the connection)
[11:37] * Yoric (~David@213.144.210.93) has joined #ceph
[12:05] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[12:05] * Yoric (~David@213.144.210.93) has joined #ceph
[12:15] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[12:22] * Yoric (~David@213.144.210.93) has joined #ceph
[12:23] * Yoric_ (~David@213.144.210.93) has joined #ceph
[12:23] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[12:23] * Yoric_ is now known as Yoric
[12:25] * Yoric_ (~David@213.144.210.93) has joined #ceph
[12:25] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[12:25] * Yoric_ is now known as Yoric
[12:31] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[12:31] * Yoric (~David@213.144.210.93) has joined #ceph
[13:40] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Quit: WeeChat 0.2.6)
[13:58] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (reticulum.oftc.net venus.oftc.net)
[13:58] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[13:59] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (reticulum.oftc.net venus.oftc.net)
[13:59] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[13:59] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (reticulum.oftc.net venus.oftc.net)
[14:00] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:01] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (reticulum.oftc.net venus.oftc.net)
[14:01] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:02] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (reticulum.oftc.net venus.oftc.net)
[14:02] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:02] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (reticulum.oftc.net venus.oftc.net)
[14:03] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:03] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (reticulum.oftc.net venus.oftc.net)
[14:03] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:04] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (reticulum.oftc.net venus.oftc.net)
[14:04] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:05] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (reticulum.oftc.net venus.oftc.net)
[14:05] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:06] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (reticulum.oftc.net venus.oftc.net)
[14:06] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:07] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Remote host closed the connection)
[14:16] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:51] * ezgreg (~Greg@78.155.152.6) Quit (Read error: Connection reset by peer)
[14:51] * ezgreg (~Greg@78.155.152.6) has joined #ceph
[15:03] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) has joined #ceph
[15:10] * alexxy[home] (~alexxy@79.173.82.178) has joined #ceph
[15:11] * alexxy (~alexxy@79.173.82.178) Quit (Ping timeout: 480 seconds)
[15:33] * xilei (~xilei@61.135.165.172) Quit (Remote host closed the connection)
[16:00] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[16:03] * f4m8_ (~f4m8@lug-owl.de) has left #ceph
[16:08] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[16:08] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[16:16] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[16:16] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[16:24] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[16:24] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[16:32] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[16:32] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[16:40] * allsystemsarego (~allsystem@188.27.166.252) Quit (Quit: Leaving)
[16:40] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[16:40] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[16:48] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[16:48] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[16:56] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[16:56] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[16:58] * f4m8 (~f4m8@lug-owl.de) has joined #ceph
[16:58] * f4m8 (~f4m8@lug-owl.de) Quit ()
[17:00] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[17:04] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[17:04] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[17:04] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[17:06] * deksai (~deksai@96-35-100-192.dhcp.bycy.mi.charter.com) Quit (Ping timeout: 480 seconds)
[17:08] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[17:12] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[17:13] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[17:13] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[17:16] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[17:16] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[17:21] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[17:21] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[17:24] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[17:24] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[17:29] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[17:29] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[17:32] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[17:32] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[17:33] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[17:33] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[17:33] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit ()
[17:44] * Osso (osso@AMontsouris-755-1-7-230.w86-212.abo.wanadoo.fr) has joined #ceph
[17:53] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[18:08] * Yoric_ (~David@213.144.210.93) has joined #ceph
[18:08] * Yoric (~David@213.144.210.93) Quit (Read error: Connection reset by peer)
[18:08] * Yoric_ is now known as Yoric
[18:25] * Osso (osso@AMontsouris-755-1-7-230.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[18:25] * Osso (osso@AMontsouris-755-1-7-230.w86-212.abo.wanadoo.fr) has joined #ceph
[18:25] * deksai (~deksai@dsl093-003-018.det1.dsl.speakeasy.net) has joined #ceph
[18:32] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[18:39] * hijacker (~hijacker@213.91.163.5) Quit (Remote host closed the connection)
[18:42] * Yoric (~David@213.144.210.93) Quit (Quit: Yoric)
[18:56] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[18:59] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[19:22] <yehudasa> wido: are you there?
[19:22] * eternaleye_ (~eternaley@195.190.31.135) has joined #ceph
[19:23] * eternaleye (~eternaley@195.190.31.135) Quit (Read error: Connection reset by peer)
[19:40] <wido> yehudasa: yes!
[19:41] <yehudasa> great!
[19:41] <yehudasa> I want to debug this btrfs issue, but can't login to your cluster
[19:42] <wido> oh, let me check your key
[19:42] <wido> what node are you trying?
[19:42] <yehudasa> 05, 08
[19:43] <wido> could you sent me your public key again?
[19:47] <yehudasa> sent
[19:52] <wido> yehudasa: loaded into 05 and 08
[19:52] <wido> i'll do the rest now
[19:52] <yehudasa> thanks
[19:52] <wido> /proc/sys/kernel/panic is set to 60
[19:52] <wido> so if the machine panic's, it reboots
[19:53] <wido> ok, node12 is down now, i'll let someone reboot it. You could access "logger", "client01" or "client02", those have client.admin permissions
[19:55] <yehudasa> wido: how can I reproduce the problem?
[19:55] <yehudasa> just starting up the osd?
[19:56] <wido> yes, but you should load btrfs first: modprobe btrfs; btrfsctl -a; mount /srv/ceph/osd.4; service ceph start
[19:56] <wido> it's a multi-device btrfs stripe
[19:57] <yehudasa> yeah, I think I crashed node05
[19:57] <wido> problem is, I have a remote syslog, but you can't access it. That's where I found it was a btrfs bug
[19:58] <wido> yes, node05 seems down, but nothing about btrfs in the syslog now :(
[19:59] <wido> yehudasa: node05 is back
[20:00] <yehudasa> great
[20:00] <yehudasa> did that start to happen with no reason, or did you install a new kernel?
[20:01] <yehudasa> some other setup change?
[20:13] <wido> happen with no reason
[20:13] <wido> osd7 was down for a long time due to a bug, but sage couldn't find it. So I formatted the OSD and brought it up again
[20:13] <wido> the cluster then started to recover, but OSD's started to panic then
[20:14] <wido> yehudasa: http://pastebin.com/RfitcJfh
[20:15] <yehudasa> yeah, you can actually google the problem and this pastebin will come up
[20:15] <wido> As you can see, node02 had the issue too. But right now it's staying up, except i'm getting "BUG: soft lockup - CPU#1 stuck for 61s!" on that node now
[20:18] <yehudasa> ok, now both 5 and 8 are down
[20:18] <yehudasa> oh, 8 is up again
[20:19] <wido> yes, they reboot after a kernel panic, takes a few minutes
[20:30] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[20:52] <wido> yehudasa: node12 is back
[20:53] <yehudasa> alright
[21:30] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[21:53] <wido> yehudasa: i'm going afk
[21:53] <wido> any idea yet?
[21:57] <yehudasa> wido: not yet
[22:11] <wido> yehudasa: ok, no problem. The machines should all reboot on panic
[22:11] <wido> kernels are from Ubuntu's kernel PPA
[22:11] <yehudasa> yeah
[22:11] <wido> http://kernel.ubuntu.com/~kernel-ppa/info/kernel-version-map.html
[22:12] <yehudasa> I was distracted with something else, hopefully I'll be able to continue debugging it now
[22:12] <wido> you could do some remote syslogging like I did in /etc/rsyslog.d/60-remote-syslog.conf
[22:13] <yehudasa> ok thanks, I'll look at it
[22:14] <wido> oh, just fixed it :-) logger.ceph.widodh.nl now also accepts remote syslog
[22:15] <wido> i'll push it to the other machines, their messages should appear in /var/log/syslog then on logger.ceph.widodh.nl
[22:15] <yehudasa> great
[22:16] <wido> ok, done :-) And remember, all the machines use a btrfs stripe, after a reboot btrfs should be loaded and btrfsctl -a should be run before mounting
[22:17] <wido> that's all, i
[22:17] <wido> i'm afk
[22:17] <wido> tnx!
[22:17] <yehudasa> thank you.. night!
[22:50] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[23:20] * Brock (~berwin@66-189-196-132.dhcp.yakm.wa.charter.com) Quit (Read error: Operation timed out)
[23:24] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[23:51] * deksai (~deksai@dsl093-003-018.det1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.