#ceph IRC Log


IRC Log for 2011-11-02

Timestamps are in GMT/BST.

[0:55] * cp (~cp@ Quit (Quit: cp)
[1:02] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:50] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:02] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:08] * yehudasa_ (~yehudasa@aon.hq.newdream.net) has joined #ceph
[2:08] * gregaf1 (~Adium@aon.hq.newdream.net) has joined #ceph
[2:09] * joshd1 (~joshd@aon.hq.newdream.net) has joined #ceph
[2:13] * joshd (~joshd@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:14] * gregaf (~Adium@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:14] * sagewk (~sage@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:15] * sjust (~sam@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:15] * yehudasa (~yehudasa@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:15] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[2:19] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[2:22] * joshd1 (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:15] * nwatkins (~nwatkins@kyoto.soe.ucsc.edu) has left #ceph
[3:21] * sagelap (~sage@mc85536d0.tmodns.net) has joined #ceph
[5:08] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) Quit (Quit: Leaving)
[5:24] * sagelap (~sage@mc85536d0.tmodns.net) Quit (Read error: Connection reset by peer)
[5:26] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) has joined #ceph
[6:13] * grape (~grape@ Quit (Read error: Connection reset by peer)
[6:13] * grape (~grape@ has joined #ceph
[6:58] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[7:10] * aneesh (~aneesh@ Quit (Quit: Coyote finally caught me)
[7:31] * aneesh (~aneesh@ has joined #ceph
[7:31] * aneesh (~aneesh@ Quit (Remote host closed the connection)
[7:31] * aneesh (~aneesh@ has joined #ceph
[7:59] * tserong (~tserong@58-6-102-149.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[8:09] * tserong (~tserong@124-171-112-21.dyn.iinet.net.au) has joined #ceph
[8:30] * gregorg (~Greg@ has joined #ceph
[10:14] * fronlius (~Adium@testing78.jimdo-server.com) has joined #ceph
[10:59] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:24] * nhm (~mark@penguin.msi.umn.edu) Quit (Remote host closed the connection)
[11:30] * tserong (~tserong@124-171-112-21.dyn.iinet.net.au) Quit (synthon.oftc.net larich.oftc.net)
[11:30] * grape (~grape@ Quit (synthon.oftc.net larich.oftc.net)
[11:30] * gregaf1 (~Adium@aon.hq.newdream.net) Quit (synthon.oftc.net larich.oftc.net)
[11:30] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (synthon.oftc.net larich.oftc.net)
[11:30] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (synthon.oftc.net larich.oftc.net)
[11:30] * eternaleye_____ (~eternaley@ Quit (synthon.oftc.net larich.oftc.net)
[11:30] * Meths (rift@ Quit (synthon.oftc.net larich.oftc.net)
[11:30] * nolan (~nolan@phong.sigbus.net) Quit (synthon.oftc.net larich.oftc.net)
[11:30] * conner (~conner@leo.tuc.noao.edu) Quit (synthon.oftc.net larich.oftc.net)
[11:30] * psomas (~psomas@inferno.cc.ece.ntua.gr) Quit (synthon.oftc.net larich.oftc.net)
[11:31] * eternaleye_____ (~eternaley@ has joined #ceph
[11:34] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[11:35] * grape (~grape@ has joined #ceph
[11:35] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[11:36] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[11:37] * conner (~conner@leo.tuc.noao.edu) has joined #ceph
[11:38] * nolan (~nolan@phong.sigbus.net) has joined #ceph
[11:42] * tserong (~tserong@124-171-112-21.dyn.iinet.net.au) has joined #ceph
[11:47] * Meths (rift@ has joined #ceph
[13:24] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[14:45] * sagelap (~sage@ has joined #ceph
[14:49] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[15:05] * sagelap1 (~sage@ has joined #ceph
[15:05] * sagelap (~sage@ Quit (Read error: Connection reset by peer)
[15:18] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) Quit (Quit: Ex-Chat)
[15:30] * sagelap1 (~sage@ Quit (Ping timeout: 480 seconds)
[15:35] * Zipo34 (~lpro@iut.iutbeziers.univ-montp2.fr) has joined #ceph
[15:37] * Zipo34 (~lpro@iut.iutbeziers.univ-montp2.fr) Quit ()
[15:41] * Iribaar (~Iribaar@ Quit (Ping timeout: 480 seconds)
[15:45] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:52] * grape (~grape@ Quit (Remote host closed the connection)
[15:53] * Iribaar (~Iribaar@ has joined #ceph
[16:06] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) has joined #ceph
[16:11] * grape (~grape@ has joined #ceph
[16:37] * gregaf (~Adium@aon.hq.newdream.net) has left #ceph
[16:37] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[16:42] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[17:00] * adjohn (~adjohn@70-36-139-78.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[17:23] * grape (~grape@ Quit (Read error: Connection reset by peer)
[17:29] * sagelap (~sage@ has joined #ceph
[17:45] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:01] * votz_ (~votz@pool-108-52-121-103.phlapa.fios.verizon.net) has joined #ceph
[18:05] * Iribaar (~Iribaar@ Quit (Ping timeout: 480 seconds)
[18:07] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:08] * votz (~votz@pool-108-52-121-103.phlapa.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[18:17] * Iribaar (~Iribaar@ has joined #ceph
[18:21] * sagelap (~sage@ has left #ceph
[18:22] * fronlius (~Adium@testing78.jimdo-server.com) Quit (Quit: Leaving.)
[18:28] * votz__ (~votz@pool-108-52-121-103.phlapa.fios.verizon.net) has joined #ceph
[18:35] * votz_ (~votz@pool-108-52-121-103.phlapa.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[18:50] * adjohn (~adjohn@ has joined #ceph
[19:39] * fronlius (~Adium@e182095035.adsl.alicedsl.de) has joined #ceph
[19:45] * slb (~slb@gateway.ash.thebunker.net) has joined #ceph
[19:48] <slb> afternoon. looking for some advise on how to recover a small test cluster which I seem to have broken.
[19:48] <Tv> slb: depends a lot on how it broke
[19:49] <slb> symptoms are it never recovers to the point of us being able to mount it. and 3 of the 6 ceph-osd processes get stuck using 100% CPU.
[19:49] <Tv> slb: do the log files have anything interesting?
[19:54] <slb> i tried increasing the osd verbosity to 20. it didn't report anything that looked broken to me. just some osd_ping and osd map distribution stuff. just trying again now.
[19:55] * adjohn (~adjohn@ Quit (Quit: adjohn)
[19:55] <Tv> slb: what does "ceph -s" say?
[19:55] <sjust> slb: what version are you running?
[19:56] <slb> sjust: 0.37-1~bpo70+1
[19:57] <slb> Tv: http://pastebin.com/jzm1n6qA
[19:58] <Tv> slb: so two of your osds aren't getting up; what do their logs say?
[19:58] <sjust> could you paste the output of 'ceph osd dump -o -'?
[19:58] <slb> Tv: the number of osds is 4 now as I just restarted one of the 100% cpu ones, it will go back down to 3 in a few minutes
[19:59] <slb> sjust: http://pastebin.com/CYSm2Kf6
[19:59] <slb> tell me if you'd rather I paste directly rather than via pastebin
[20:00] <sjust> either way, can you ramp osd and filestore debugging on osd.2 up to 25 and restart it?
[20:03] <Tv> i think this irc network will kick you if you paste too much in a channel, so pastebin is safer
[20:07] <slb> sjust: done, it's now at 100% CPU and looping through what look to be the same 8-10 log entries
[20:07] <sjust> ah...
[20:07] <sjust> can you post some of the log?
[20:10] <slb> at the beginning there are some:
[20:10] <slb> 2011-11-02 19:04:53.131884 7f7df47e5720 osd.2 1062 pg[0.7( v 499'7542 (497'7538,499'7542] n=120 ec=2 les/c 1061/1062 1060/1060/1060) [] r=0 (info mismatch, log(497'7538,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log 0 499'7541 (499'7540) m 10000000012.0000013f/head by client.6923.0:353904 2011-10-21 11:27:18.189730
[20:10] <slb> actually, quite a log of them
[20:11] <Tv> "mismatch" is never a nice word :(
[20:12] <psomas> Is there a way to set-up a local network which will be used only for osd<->osd traffic (for the replication etc traffic), and use another interface for the client<->osd traffic?
[20:12] <slb> sjust: http://pastebin.com/LmHV2RE2 (last 25 lines of osd log)
[20:12] <Tv> psomas: yes, hold on looking for the name of the right config option
[20:12] <psomas> Tv: if it's in the docs, i can look it up
[20:12] <Tv> psomas: you should find it in the wiki
[20:14] <Tv> psomas: public_addr, cluster_addr
[20:14] <psomas> btw, for a test setup, we'll be running the mon servers on the same machines with some of the osds, should we expect any perfomance impact?
[20:14] <Tv> psomas: it doesn't look like it's documented, sorry about that..
[20:14] <Tv> psomas: those take ip addresses, in the [osd.foo] section in the config
[20:15] <Tv> psomas: mon is very lightweight
[20:15] <psomas> kk, thanks :0
[20:15] <psomas> :)
[20:18] <psomas> and one more question, if we use ext4 as the osd fs, is there any specific journalling mode we should choose?
[20:19] <Tv> psomas: outside of bugs, defaults should be good; but there is an open bug that's pretty gnarly, looking for more info...
[20:20] <Tv> http://tracker.newdream.net/issues/213
[20:21] <Tv> psomas: so i think i overheard that #213 is avoidable with a specific journaling mode
[20:21] <Tv> psomas: but we need a proper fix, and you might choose to ignore the bug for now..
[20:23] <psomas> i've seen that, i think there was a thread at the ml too
[20:23] <Tv> yeah we've been trying to figure out a design that gives us good performance
[20:26] <sjust> slb: you have osd and filestore debugging on?
[20:26] * fronlius1 (~Adium@e182092135.adsl.alicedsl.de) has joined #ceph
[20:27] <slb> sjust: I believe so, added to config file [osd.2] section and bounced
[20:27] <slb> is there a way to confirm from ceph command line?
[20:27] <sjust> can you post the config file?
[20:28] <psomas> Tv: clones happen when you make an object snapshot?
[20:29] <slb> sjust: sure. http://pastebin.com/4A9SqwzU
[20:30] * conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[20:30] * Iribaar (~Iribaar@ Quit (Ping timeout: 480 seconds)
[20:30] <sjust> that looks right, can you post the log for that machine since it restarted?
[20:30] <sjust> also, does ceph osd dump -o - indicate that osd.2 is up?
[20:32] * fronlius (~Adium@e182095035.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[20:33] <Tv> psomas: yes; the mds might be doing that even when "you" are not..
[20:33] <slb> sjust: osd dump does say it is up, yes
[20:34] <psomas> Tv: y, but if we use the object store only, and not the fs we're ok i guess :)
[20:35] <Tv> psomas: yeah if you're talking RADOS, you're low level enough that i don't think there are many surprises
[20:35] <Tv> psomas: i'm not the expert on osd, maybe sjust can confirm
[20:36] <Tv> sjust: resend: are there ever clone ops (as per bug #213) in the journal if you use RADOS directly and don't do snapshots yourself
[20:36] <sjust> Tv: not that I can think of
[20:36] <psomas> well, rbd actually, which may use clones for the read-only snapshots, or for the image copy i guess
[20:37] <sjust> rbd does use clones for snapshots
[20:37] <Tv> psomas: yeah rbd is a really thin layer.. just don't snapshot (or take the risk of triggering a bug.. it's only possible if you crash at the wrong time)
[20:38] * conner (~conner@leo.tuc.noao.edu) has joined #ceph
[20:38] <psomas> kk, thanks again
[20:41] <sjust> slb: what is the current output of ceph -s
[20:41] <sjust> ?
[20:42] * Iribaar (~Iribaar@ has joined #ceph
[20:42] <slb> sjust: http://pastebin.com/88DaeQrR
[20:42] <NaioN> I'm still hitting the suicide bug (http://tracker.newdream.net/issues/1624) with the GIT master
[20:42] <NaioN> does somebody know if the bug is really solved?
[20:45] <slb> sjust: strangely osd.2 is running okay now and osd.3 is using 100% cpu
[20:45] <sjust> slb: does it have debugging on?
[20:46] <sjust> If we can catch one of the 100% cpu cases with filestore and osd debugging turned up, we should be able to figure out what's going on
[20:49] * adjohn (~adjohn@ has joined #ceph
[20:50] <slb> okay... will turn it on on all of them, null the logs and restart
[20:56] * Iribaar (~Iribaar@ Quit (Ping timeout: 480 seconds)
[21:01] <slb> sjust: embarassingly, although two of the osds are using 100% CPU, I can now mount the ceph volume which we haven't been able to do for about a week
[21:04] * Iribaar (~Iribaar@ has joined #ceph
[21:17] <slb> sjust: luckily it broke again. the first osd marked down by osd dump was osd.3
[21:17] <sjust> cool
[21:17] <slb> sjust: http://www.box.net/shared/gtm7xdryzvnhp7krq5gs - a .gz of the log file since startup of osd.3
[21:17] <sjust> cool, thanks
[21:18] <slb> sorry for the naff drop area, first place I found hopefully will do the job
[21:25] * nwatkins (~nwatkins@kyoto.soe.ucsc.edu) has joined #ceph
[21:25] <sjust> slb: no problem, looking now
[21:34] <sjust> slb: you seem to have hit #1530 or something related. The logs should help enormously.
[21:36] <slb> oooh. glad it's helpful.
[21:36] <slb> sjust: do you think there's any hope for my data?
[21:37] <sjust> slb: yeah, I think so. Not completely sure though.
[21:37] <slb> ok. I won't zap it and start again in that case and will follow that bug report you mentionedf
[21:41] * slb_ (~slb@gateway.ash.thebunker.net) has joined #ceph
[21:41] * slb (~slb@gateway.ash.thebunker.net) Quit (Quit: leaving)
[21:41] * slb_ is now known as slb
[21:41] * slb is now known as slb|afk
[22:17] <nwatkins> gregaf: I'm seeing a problem with ceph_mkdirs in the error case when the parent directory already exists as a file. I'm seeing a return value of -1, but the Java code is expecting ENOTDIR.
[22:17] <gregaf> nwatkins: checking
[22:20] * tserong (~tserong@124-171-112-21.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[22:21] <gregaf> nwatkins: you're getting −1 back when it's a file?
[22:21] <gregaf> not −17?
[22:22] <gregaf> oh, parent directory, you mean not the end of the path then
[22:23] <nwatkins> The case is: mkdir(/path/to/file/dir) but this exists: /path/to/file
[22:24] <gregaf> hmmm, it looks fine to me
[22:25] <gregaf> are you sure you actually have permissions to look at the file?
[22:25] <nwatkins> That should be fine. This is through the test harness CephFaker
[22:25] <nwatkins> oops
[22:25] <gregaf> oh, god only knows what's going on then
[22:25] <nwatkins> lol, I wasn't even thinking
[22:26] <nwatkins> Sorry about that
[22:26] <gregaf> np :)
[22:30] * tserong (~tserong@124-168-227-175.dyn.iinet.net.au) has joined #ceph
[22:50] * adjohn is now known as Guest15601
[22:50] * Guest15601 (~adjohn@ Quit (Read error: Connection reset by peer)
[22:50] * adjohn (~adjohn@ has joined #ceph
[23:22] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:26] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[23:26] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:27] * cp (~cp@ has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.