#ceph IRC Log

Index

IRC Log for 2011-10-07

Timestamps are in GMT/BST.

[20:33] -magnet.oftc.net- *** Looking up your hostname...
[20:33] -magnet.oftc.net- *** Checking Ident
[20:33] -magnet.oftc.net- *** No Ident response
[20:33] -magnet.oftc.net- *** Found your hostname
[20:33] * CephLogBot (~PircBot@rockbox.widodh.nl) has joined #ceph
[20:33] <damian_> coo
[22:14] * jojy (~jojyvargh@108.60.121.114) Quit (Quit: jojy)
[22:25] * jojy (~jojyvargh@108.60.121.114) has joined #ceph
[22:32] * verwilst (~verwilst@dD576F54C.access.telenet.be) has joined #ceph
[22:39] * Meths (rift@2.25.193.40) Quit (Quit: )
[22:44] * Meths (rift@2.27.73.252) has joined #ceph
[22:53] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[22:53] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[22:54] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has left #ceph
[23:09] <df__> Oct 7 18:46:12 vc-fs2 osd.1[5128]: 7f5561277700 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f5553059700' had timed out after 60
[23:10] <df__> i'm getting a lot of that, any cause/condern?
[23:10] <df__> Oct 7 18:46:27 vc-fs1 mon.0[20126]: 7f3e42875700 mon.0@0(leader).osd e2170 OSDMonitor::handle_osd_timeouts: last got MOSDPGStat info from osd 1 at 2011-10-07 18:31:25.785995.
[23:10] <df__> It has been 902.028973, so we're marking it down!
[23:10] <df__> then a load of:
[23:10] <df__> Oct 7 18:46:55 vc-fs2 osd.1[5128]: 7f5552858700 osd1 2170 heartbeat_check: no heartbeat from osd0 since 2011-10-07 18:46:27.849379 (cutoff 2011-10-07 18:46:35.100453)
[23:11] <df__> Oct 7 18:46:55 vc-fs2 osd.1[5128]: 7f5552858700 osd1 2170 heartbeat_check: no heartbeat from osd2 since 2011-10-07 18:46:27.959135 (cutoff 2011-10-07 18:46:35.100453)
[23:11] <df__> as if osd.1 doesn't quite know it has been fenced
[23:12] <gregaf> sjust and sagewk probably want to see that
[23:13] <sjust> hmm
[23:13] <sjust> anything in dmesg?
[23:13] <df__> from around the time of the "event" no
[23:14] <sjust> you probably didn't have filestore debugging on, right?
[23:14] * jojy (~jojyvargh@108.60.121.114) Quit (Read error: Connection reset by peer)
[23:14] * jojy (~jojyvargh@108.60.121.114) has joined #ceph
[23:15] <df__> no extra debugging than the normal messages
[23:15] <sjust> can you post the log?
[23:16] <df__> sure, just filtering it
[23:25] <df__> ftp://ftp.kw.bbc.co.uk/davidf/priv/ceph/emaiT5Sa.txt
[23:25] <df__> you may want to filter the 'journal throttle: waited for bytes' message
[23:27] <df__> the log starts with the first inscance of the timed out message. however, everything is working at that point
[23:27] <df__> 3 osd's up. osd.2 had been cleared and the whole system was resyncing
[23:28] <df__> cleared = formated ext4
[23:29] <sjust> what sort of workload was running?
[23:29] <df__> s/osd.2/osd.1
[23:30] * df__ grumbles at vc-fs$N hosting osd.$N-1
[23:30] <sjust> heh
[23:30] <df__> no client workload, just the resync/rebuild of osd.1
[23:31] <sjust> ok, I see
[23:31] <df__> what term do you use for that (eg, resilver for zfs)
[23:31] <sjust> recovery
[23:31] <df__> ok, osd.1 was recovering at a mean speed of 160MB/sec
[23:32] <df__> (man limit is the journal that peaks at about 180MB/sec to 200MB/sec)
[23:32] * Dantman (~dantman@S010600259c4d54ff.vs.shawcable.net) Quit (Read error: Connection reset by peer)
[23:33] <df__> need to get some more SSDs for that (since the underlying array can to 1.5GB/sec raw, 1GB/sec md raid6 and 7-800MB/sec ext4, [streaming writes])
[23:35] * Dantman (~dantman@S010600259c4d54ff.vs.shawcable.net) has joined #ceph
[23:35] <sjust> monitor messages indicate that one OSD is marked down, was that OSD1 (fs2)?
[23:35] * jojy (~jojyvargh@108.60.121.114) Quit (Read error: Connection reset by peer)
[23:35] * jojy (~jojyvargh@108.60.121.114) has joined #ceph
[23:43] * sjust (~sam@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[23:44] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[23:45] <df__> yes, afaict
[23:45] <df__> btw, is there a status command that lists the known state of everything, ie which osd's are in, which monitors, etc.,
[23:45] <sjust> it would hbe helpfull to have logs befroe the disk_tp timed out message started
[23:46] <sjust> ceph osd dump -o -
[23:46] <sjust> will dump the current osd state
[23:46] <sjust> ceph pg dump -o - will dump current pg state
[23:46] <sjust> not sure about monitors
[23:46] <df__> osd1 down
[23:46] <sjust> though ceph -s will tell you how many are up
[23:46] <sjust> I think
[23:55] <df__> ok, just grabbing all of the log for today
[23:55] <sjust> cool
[23:56] <df__> nb, that will include a load of kicking things this morning, from when i reinitialised osd.1
[23:56] <sjust> that's ok
[23:59] <df__> and vc-fs3 had died of its own accord overnight, i restart it at 09:39:02, ftp://ftp.kw.bbc.co.uk/davidf/priv/ceph/Xaitegh7.txt
[23:59] <df__> 09:21:55 is when i mounted the filesystem for osd.1

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.