#ceph IRC Log

Index

IRC Log for 2010-07-19

Timestamps are in GMT/BST.

[5:20] * pcish (dc8729c2@ircip1.mibbit.com) has joined #ceph
[5:49] <pcish> hello, we're getting an assert failed on our OSDs: osd/PG.cc:1833: FAILED assert(recovering_oids.count(soid) == 0)
[5:50] <pcish> judging from the surrounding debug code, it looks like someone ran into the same problem before
[7:00] * f4m8_ is now known as f4m8
[7:13] * Osso_ (osso@AMontsouris-755-1-9-31.w90-46.abo.wanadoo.fr) Quit (Quit: Osso_)
[7:14] * Osso (osso@AMontsouris-755-1-9-31.w90-46.abo.wanadoo.fr) has joined #ceph
[7:15] * Osso (osso@AMontsouris-755-1-9-31.w90-46.abo.wanadoo.fr) Quit ()
[7:37] * DJCapelis (~djc@capelis.dj) Quit (Remote host closed the connection)
[7:46] * DJCapelis (~djc@capelis.dj) has joined #ceph
[9:02] * andret (~andre@pcandre.nine.ch) has joined #ceph
[9:25] * eternaleye (~quassel@184-76-53-210.war.clearwire-wmx.net) Quit (Ping timeout: 480 seconds)
[9:44] * allsystemsarego (~allsystem@188.27.164.220) has joined #ceph
[10:16] * s15y2 (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[11:36] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[12:51] <f4m8> now my setup of rbd on 2.6.35-rc3+ is running.
[12:52] <f4m8> but do a simple "rbdtool --list" fill the screen with "10.07.19 12:52:13.908381 b52d2b70 client6603.objecter pg 3.8c1c on [] is laggy: 1"
[12:53] <f4m8> this should have something to do with the output of "# ceph -s" -> "10.07.19 12:52:42.223546 mds e96: 1/1/1 up, 3 up:standby(laggy or crashed), 1 up:standby, 1 up:replay"
[12:59] <f4m8> # ceph -s | grep -v " -- " leads to: http://paste.debian.net/80871
[13:01] <f4m8> What does "mds e96: 1/1/1 up, 3 up:standby(laggy or crashed), 1 up:standby, 1 up:replay" exactly mean with "replay"?
[13:06] <f4m8> there are only one monitor defined, two mds and two osd.
[13:12] <f4m8> my ceph.conf: http://paste.debian.net/80873/
[13:53] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[13:54] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit ()
[15:46] * f4m8 is now known as f4m8_
[16:20] * greglap1 (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[16:38] * Osso (osso@AMontsouris-755-1-9-31.w90-46.abo.wanadoo.fr) has joined #ceph
[17:16] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[17:34] * deksai (~chris@dsl093-003-018.det1.dsl.speakeasy.net) has joined #ceph
[18:08] * eternaleye (~quassel@184-76-53-210.war.clearwire-wmx.net) has joined #ceph
[18:08] * pfo (~pfo@srv.gmi.oeaw.ac.at) has joined #ceph
[18:17] <gregaf> f4m8: it looks like something's been messing with your MDS daemons
[18:17] <gregaf> mds e96: 1/1/1 up, 3 up:standby(laggy or crashed), 1 up:standby, 1 up:replay"
[18:18] <gregaf> means it's on MDS map epoch 96 (so there's been a lot of changes happening)
[18:18] <gregaf> 1/1/1 up means that you have 1 up (operational) and 1 in (held responsible for data) out of 1 mds max
[18:18] <gregaf> 3 up:standby(laggy or crashed) means you have 3 instances that have probably died
[18:19] <gregaf> 1 up:standby means you have one that's waiting to take over as needed
[18:19] <gregaf> 1 up:replay means you have one replaying logs from a previous crash
[18:19] <gregaf> my guess is that your MDSes have been crashing and something's been restarting them (though I don't know what)
[18:20] <gregaf> but they've been getting different addresses so they're regarded as different instances by the system (which should be fine)
[18:21] <gregaf> if you actually want the two MDSes you have defined to both get metadata, you'll need to run "ceph mds set_max_mds 2"
[18:21] <gregaf> otherwise one will hold all the data and the other will watch for it to crash and then take over :)
[18:22] <gregaf> if you're getting messages that pgs are laggy, though, that's from OSDs being slow to respond to requests
[18:23] <gregaf> either they're dead or they're getting too overloaded, maybe as a result of the MDS logs being read off?
[18:23] <gregaf> f4m8_: let me know if you need more info/help :)
[18:24] <gregaf> pcish: can you give us more info about the workload and what's happening in the system when you hit that error?
[18:26] <gregaf> darkfader: journals can use up space inside the FS if they're on the same partition
[18:26] <gregaf> plus the MDS does journaling and that goes on the OSDs, but it's capped at a maximum size
[18:43] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:47] * fzylogic (~fzylogic@dsl081-243-128.sfo1.dsl.speakeasy.net) has joined #ceph
[19:15] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[20:48] * pfo (~pfo@srv.gmi.oeaw.ac.at) Quit (Ping timeout: 480 seconds)
[21:05] * pfo (~pfo@chello084114049188.14.vie.surfer.at) has joined #ceph
[21:15] * pfo (~pfo@chello084114049188.14.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[21:33] * Osso_ (osso@AMontsouris-755-1-9-31.w90-46.abo.wanadoo.fr) has joined #ceph
[21:33] * Osso (osso@AMontsouris-755-1-9-31.w90-46.abo.wanadoo.fr) Quit (Remote host closed the connection)
[21:33] * Osso_ is now known as Osso
[22:30] * allsystemsarego (~allsystem@188.27.164.220) Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.