#ceph IRC Log

Index

IRC Log for 2011-11-23

Timestamps are in GMT/BST.

[1:14] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Read error: Connection reset by peer)
[1:26] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[2:04] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[2:13] * Nightdog (~karl@190.84-48-62.nextgentel.com) Quit (Read error: Connection reset by peer)
[2:34] * bchrisman (~Adium@108.60.121.114) Quit (Quit: Leaving.)
[2:36] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[2:37] * yanzheng (~zhyan@134.134.139.76) has joined #ceph
[2:38] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[2:57] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Read error: Operation timed out)
[3:01] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[3:13] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[4:10] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[4:19] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[8:01] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[8:03] * yanzheng (~zhyan@134.134.139.76) Quit (Remote host closed the connection)
[8:16] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) has joined #ceph
[9:08] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:57] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) Quit (Remote host closed the connection)
[10:02] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[10:10] * yanzheng (~zhyan@134.134.139.76) has joined #ceph
[11:25] * yanzheng (~zhyan@134.134.139.76) Quit (Remote host closed the connection)
[11:25] <psomas> what's the difference between a pg name (ie 2.2, 2.1c etc) and pgid i see in the logs?
[11:50] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Ping timeout: 480 seconds)
[12:47] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[12:47] <psomas> sjust: i went through the logs, and for the stuck pg i see this
[12:47] <psomas> "wait for cleanup"
[12:47] <psomas> for every other pg scrub, i get clean up scrub
[12:48] <psomas> then i get a sub_op_scrub_map
[12:48] <psomas> it gets the peer osd scrub map, but while all the other replicas enqueue scrub_finalize, in the stuck replica this doesn't seem to happen
[12:48] <psomas> s/replicas/pg
[12:51] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:52] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[12:53] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:53] * Nightdog (~karl@190.84-48-62.nextgentel.com) has joined #ceph
[12:53] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[12:54] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:55] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[12:55] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:55] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[12:58] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:58] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[12:58] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[12:59] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[13:07] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[13:07] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[13:24] * gregorg_taf (~Greg@78.155.152.6) Quit (Read error: Connection reset by peer)
[14:00] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[14:14] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[14:15] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit ()
[14:23] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[14:30] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[14:32] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[14:47] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[14:49] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[15:23] * Olivier_bzh (~langella@xunil.moulon.inra.fr) has joined #ceph
[15:24] <Olivier_bzh> hi everybody
[15:24] <Olivier_bzh> I am a beginner with ceph
[15:24] <Olivier_bzh> i've installed it last week
[15:25] <Olivier_bzh> and played a little, it sounds great
[15:25] <Olivier_bzh> bu I've made some mistakes :
[15:25] <Olivier_bzh> On my 3 servers, I have stopped 2 of them
[15:26] <Olivier_bzh> the I tried to restart and I have :
[15:26] <Olivier_bzh> ceph health
[15:26] <Olivier_bzh> 2011-11-23 15:26:28.593636 mon <- [health]
[15:26] <Olivier_bzh> 2011-11-23 15:26:28.593980 mon.2 -> 'HEALTH_WARN 402 pgs degraded, 748651/1497302 degraded (50.000%)' (0)
[15:26] <Olivier_bzh> it is bad I guess...
[15:27] <Olivier_bzh> I ve read about degraded pgs on the mailing list
[15:27] <Olivier_bzh> but I don't really understand what is going on...
[15:27] <Olivier_bzh> If someone could help me, I would be very happy
[15:29] <failbaitr> im guessing the last live ceph machine consider the stack degraded as not all files where online anymore
[15:29] <failbaitr> not sure how to fix that though
[15:34] <Olivier_bzh> ok
[15:35] <Olivier_bzh> for info, a mount -t ceph 192.168.4.1:6789:/ /mnt/osd -vv
[15:35] <Olivier_bzh> works well
[15:35] <Olivier_bzh> I can browse directories
[15:35] <Olivier_bzh> get files...
[15:35] <Olivier_bzh> but df -h
[15:36] <Olivier_bzh> give inconsistent spaces for /mnt/osd
[15:52] <Olivier_bzh> looking at the wiki, monitor commands section, I've tried :
[15:52] <Olivier_bzh> ceph osd repair *
[15:52] <Olivier_bzh> but I get :
[15:52] <Olivier_bzh> 2011-11-23 15:49:27.432035 mon <- [osd,repair,client.admin.log,lost+found,mds.gorgone0.log,mds.gorgone1.log,mds.gorgone2.log,mon.gorgone0.log,mon.gorgone1.log,mon.gorgone2.log,osd.0.log,osd.1.log,osd.2.log,osd.admin.log,osd.-c.log]
[15:52] <Olivier_bzh> 2011-11-23 15:49:27.433005 mon.2 -> 'unknown command repair' (-22)
[15:52] <Olivier_bzh> perhpas the syntax has changed
[15:53] <Olivier_bzh> I'am using ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)
[16:25] * The_Bishop (~bishop@port-92-206-183-175.dynamic.qsc.de) Quit (Ping timeout: 480 seconds)
[16:51] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[16:51] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[17:05] <Olivier_bzh> hi again,
[17:06] <Olivier_bzh> I am trying to understand what means "out" and "down" for osds, but I don't see it in the wiki or documentation, is there a document about that ?
[17:07] <Olivier_bzh> or does someone has a clue ?
[17:07] * grape (~grape@108-69-70-124.lightspeed.mdsnwi.sbcglobal.net) has joined #ceph
[17:10] * The_Bishop (~bishop@port-92-206-183-175.dynamic.qsc.de) has joined #ceph
[17:21] <psomas> Olivier_bzh: down means, well down, not accessible, out means that the osd is not part of data 'restribution'
[17:22] <psomas> if i'm not mistaken, if you mark an osd out, then the pgs mapped to that osd will be remapped or sth
[17:23] <Olivier_bzh> thank you for this info... I'm stucked there with 2 osd out and down
[17:23] <Olivier_bzh> I've tried :
[17:23] <Olivier_bzh> ceph osd up 0
[17:24] <psomas> is ceph-osd daemon running on the nodes for those two osds?
[17:24] <Olivier_bzh> ps -ea | grep ceph
[17:24] <Olivier_bzh> 6143 ? 00:00:00 ceph-watch-noti
[17:24] <Olivier_bzh> 8157 ? 00:00:01 ceph-mon
[17:24] <Olivier_bzh> 8304 ? 00:00:00 ceph-mds
[17:24] <Olivier_bzh> it seems not
[17:25] <Olivier_bzh> I've tried to restart with :
[17:25] <psomas> when ceph-osd 'dies', it's marked down, and after some time, it'll get marked out, and its data will be remapped to different osds
[17:25] <Olivier_bzh> ok...
[17:25] <psomas> but if you stop two of them, and then they both are marked as out, if you use replication level 2, then the data replicated between those two osds, will be lost probably
[17:26] <psomas> anyway, service ceph start osd on each node, will bring them up probably
[17:26] <Olivier_bzh> ok, that's the weird thing I've made ;-)
[17:27] <Olivier_bzh> unfortunately I've tried :
[17:27] <Olivier_bzh> /usr/local/src/ceph-0.38/src/init-ceph -c /etc/ceph/ceph.conf start
[17:27] <Olivier_bzh> on my out nodes
[17:27] <Olivier_bzh> but ceph-mon is still down and out
[17:27] <Olivier_bzh> ceph-osd sorry
[17:29] <psomas> hm, i've never used init-ceph
[17:31] <Olivier_bzh> well thank you for the info
[17:31] <Olivier_bzh> I'll keep to struggle a bit to get back my data, but it is hopefully not too precious
[17:32] <Olivier_bzh> and it was my mistake
[17:32] <psomas> hm, now that i think it over, if you bring up the osds, you'll probably get your data back
[17:33] <Olivier_bzh> looking at the osd.0.log
[17:34] <Olivier_bzh> I've bad news : os/FileStore.cc: 2426: FAILED assert(0 == "unexpected error")
[17:34] <Olivier_bzh> ceph-osd tries to start but it fails
[17:35] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:37] <Olivier_bzh> I've to leave, thanks and see you tomorrow ;-)
[17:37] <Olivier_bzh> bye everybody
[17:37] * Olivier_bzh (~langella@xunil.moulon.inra.fr) has left #ceph
[17:50] * grape (~grape@108-69-70-124.lightspeed.mdsnwi.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[17:57] * ee (~Guest@81-178-167-247.dsl.pipex.com) has joined #ceph
[17:57] * ee (~Guest@81-178-167-247.dsl.pipex.com) has left #ceph
[18:02] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[18:05] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:37] * bchrisman (~Adium@108.60.121.114) has joined #ceph
[18:50] * yehudasa_ (~yehudasa@aon.hq.newdream.net) has left #ceph
[18:50] * yehudasa_ (~yehudasa@aon.hq.newdream.net) has joined #ceph
[18:55] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[19:04] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[19:08] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:23] * sageph (~yaaic@mf32736d0.tmodns.net) has joined #ceph
[19:26] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[19:49] * verwilst (~verwilst@dD576F767.access.telenet.be) has joined #ceph
[20:06] <damoxc> sagewk: hey are you about?
[20:09] <gregaf> damoxc: he's left for vacation, although it looks like he's online as sageph right now...
[20:12] <damoxc> gregaf: ah okay, was just going to give him the logs from my cluster that's trying to communicate on port 0
[20:16] <gregaf> cool
[20:16] <gregaf> I think he managed to reproduce locally after a lot of effort, actually, but more is always better ;)
[20:16] * sageph (~yaaic@mf32736d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[20:16] <damoxc> always :-) I've uploaded them to http://damoxc.net/osd.{0,1,2,3,4}.log.1.gz
[20:18] <damoxc> do you know if yehudasa_ had any luck with my rbd issue?
[20:33] <yehudasa_> damoxc: I was sidetracked, wouldn't have much time for it in the next few days
[20:38] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[20:47] <damoxc> yehudasa_: that's cool, just wanted to know if you needed anymore info / logs
[20:47] <Tv> joshd: this is what made me think that state machine in teuth is suspect, before i saw your problem: http://tracker.newdream.net/issues/1744
[20:48] <Tv> joshd: it's as if an osd was writing objects.. note how it says nothing after "Shutting down osd"
[20:48] <Tv> it's as if it's not in the data structure
[20:49] <joshd> yeah, that looks like exactly the same problem
[20:50] * grape (~grape@108-69-70-124.lightspeed.mdsnwi.sbcglobal.net) has joined #ceph
[21:03] * cp (~cp@74.85.19.35) has joined #ceph
[21:04] * cp_ (~cp@206.15.24.21) has joined #ceph
[21:11] * cp (~cp@74.85.19.35) Quit (Ping timeout: 480 seconds)
[21:11] * cp_ is now known as cp
[21:24] * cp_ (~cp@74.85.19.35) has joined #ceph
[21:30] * cp (~cp@206.15.24.21) Quit (Ping timeout: 480 seconds)
[21:30] * cp_ is now known as cp
[21:41] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:48] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[21:55] * grape (~grape@108-69-70-124.lightspeed.mdsnwi.sbcglobal.net) Quit (Ping timeout: 480 seconds)
[22:15] <Tv> gregaf: python -c 'import os; print os.statvfs(".")'
[22:16] * cp_ (~cp@206.15.24.21) has joined #ceph
[22:17] <Tv> gregaf: care to copy-paste your variable def and ioctl call?
[22:17] <gregaf> if (ioctl(fd, BLKGETSIZE, &size_blks) != 0) {
[22:17] <gregaf> perror("getting blocksize");
[22:17] <gregaf> exit(1);
[22:17] <gregaf> }
[22:17] <gregaf> fprintf(stderr, "Got blocksize %i\n", size_blks);
[22:17] * cp (~cp@74.85.19.35) Quit (Read error: Operation timed out)
[22:17] * cp_ is now known as cp
[23:14] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[23:40] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[23:41] * ajm (adam@adam.gs) has left #ceph
[23:44] * verwilst (~verwilst@dD576F767.access.telenet.be) Quit (Quit: Ex-Chat)
[23:49] * bchrisman (~Adium@108.60.121.114) Quit (Quit: Leaving.)
[23:59] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.