#ceph IRC Log

Index

IRC Log for 2011-12-15

Timestamps are in GMT/BST.

[0:15] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[0:17] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[0:28] * NightDog (~karl@52.84-48-58.nextgentel.com) has joined #ceph
[0:28] * NightDog (~karl@52.84-48-58.nextgentel.com) Quit ()
[0:58] * adjohn is now known as Guest20427
[0:58] * adjohn (~adjohn@208.90.214.43) has joined #ceph
[0:59] * The_Bishop (~bishop@port-92-206-76-165.dynamic.qsc.de) has joined #ceph
[0:59] * Nightdog_ (~karl@52.84-48-58.nextgentel.com) Quit (Remote host closed the connection)
[0:59] * Guest20427 (~adjohn@208.90.214.43) Quit (Read error: Operation timed out)
[1:03] <sjust> Amon Ott had a question a few days ago on the mailing list about returning useful errors from the kernel client when we hit problems. Anyone have any insight?
[1:06] <gregaf> oh, this was the ESTALE one?
[1:07] <sjust> gregaf: yeah
[1:08] <gregaf> well we don't return ESTALE because we don't want to lose the data and we have the expectation that it'll come back up Real Soon Now
[1:08] <sjust> gregaf: well, yeah...
[1:08] <gregaf> returning ESTALE would be a lot farther from POSIX semantics so anybody wanting that to happen instead needs to come up with an argument which gets past that hurdle
[1:09] <ajm> a mount option would be nice
[1:09] <gregaf> I don't know how difficult it would be technically or if we could set it up as a mount option to be lazy about data
[1:09] <ajm> even non-default
[1:09] <Tv> ESTALE is pure evil
[1:09] <Tv> but the real reason nfs is evil is it doesn't give the admin control
[1:09] <Tv> you get either hard or soft timeouts
[1:10] <Tv> what you really want is block waiting for recovery, until told otherwise
[1:10] <Tv> nfs doesn't give you the "until told otherwise" part
[1:10] <Tv> neither does ceph, currently
[1:11] <Tv> then you can get the "you'll never get ESTALE in normal operation" guarantee, but when the admin flags something as terminally broken, you can still recover
[1:12] <gregaf> a useful ESTALE in this context I think has to not rely on admin intervention — you can already cancel those IO requests, right?
[1:12] <gregaf> (ie Ctrl-C)
[1:12] <Tv> at some point, i wished the linux force umount feature would have done this; it doesn't
[1:12] <Tv> and even then, with ceph it's probably not the whole fs that is failing, just some files/blocks
[1:14] <Tv> gregaf: at least with nfs, you can't always control-c out of IO
[1:14] * gregaf (~Adium@aon.hq.newdream.net) has left #ceph
[1:14] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[1:14] <Tv> "dammit", he said
[1:15] <gregaf> I believe that Ceph IO is interruptible, or at least it was once
[1:15] <ajm> it is
[1:26] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:29] * andresambrois (~aa@r190-64-67-231.dialup.adsl.anteldata.net.uy) has joined #ceph
[1:30] * aa (~aa@r190-135-29-205.dialup.adsl.anteldata.net.uy) Quit (Read error: Operation timed out)
[2:19] * andresambrois (~aa@r190-64-67-231.dialup.adsl.anteldata.net.uy) Quit (Quit: Konversation terminated!)
[2:19] * andresambrois (~aa@r190-64-67-231.dialup.adsl.anteldata.net.uy) has joined #ceph
[2:25] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[2:26] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[2:32] * sageph (~yaaic@md52736d0.tmodns.net) has joined #ceph
[2:37] * votz (~votz@pool-108-52-122-97.phlapa.fios.verizon.net) Quit (Quit: Leaving)
[2:39] * cp (~cp@76-220-17-197.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[2:41] * bchrisman (~Adium@108.60.121.114) Quit (Quit: Leaving.)
[2:43] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[2:51] * dwm_ (~dwm@vm-shell4.doc.ic.ac.uk) has joined #ceph
[2:52] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[2:55] * sageph (~yaaic@md52736d0.tmodns.net) Quit (Quit: Yaaic - Yet another Android IRC client - http://www.yaaic.org)
[2:58] * andresambrois (~aa@r190-64-67-231.dialup.adsl.anteldata.net.uy) Quit (Quit: Konversation terminated!)
[2:59] * andresambrois (~aa@r190-64-67-231.dialup.adsl.anteldata.net.uy) has joined #ceph
[3:03] * adjohn (~adjohn@208.90.214.43) Quit (Quit: adjohn)
[3:05] * cp (~cp@76-220-17-197.lightspeed.sntcca.sbcglobal.net) Quit (Quit: cp)
[3:22] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[4:03] * fourty52myhead (~fourty52m@19NAAFO52.tor-irc.dnsbl.oftc.net) has joined #ceph
[4:07] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[4:24] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[4:49] * fourty52myhead (~fourty52m@19NAAFO52.tor-irc.dnsbl.oftc.net) has left #ceph
[5:20] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[5:29] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) Quit (Quit: cp)
[5:58] * ssedov (stas@ssh.deglitch.com) Quit (Read error: Connection reset by peer)
[5:59] * stass (stas@ssh.deglitch.com) has joined #ceph
[6:00] * elder (~elder@c-71-193-71-178.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[6:17] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[6:51] * The_Bishop (~bishop@port-92-206-76-165.dynamic.qsc.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[7:16] * andresambrois (~aa@r190-64-67-231.dialup.adsl.anteldata.net.uy) Quit (Ping timeout: 480 seconds)
[7:26] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) has joined #ceph
[7:32] * yhager (~yhager@173.180.85.48) has joined #ceph
[8:37] * votz (~votz@pool-108-52-122-97.phlapa.fios.verizon.net) has joined #ceph
[8:54] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[9:01] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[9:02] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[9:12] * sjustlaptop (~sam@96-41-121-194.dhcp.mtpk.ca.charter.com) has joined #ceph
[9:16] * MikeP (~Talan@208.72.101.82) Quit ()
[9:31] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:39] * sjustlaptop (~sam@96-41-121-194.dhcp.mtpk.ca.charter.com) Quit (Ping timeout: 480 seconds)
[9:41] * yhager (~yhager@173.180.85.48) Quit (Ping timeout: 480 seconds)
[10:11] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[10:14] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Remote host closed the connection)
[10:14] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[10:19] * colon_D (~colonD@173-165-224-105-minnesota.hfc.comcastbusiness.net) Quit ()
[12:06] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[12:45] <chaos_> in what units latency is measured now?
[12:45] <chaos_> "op_latency"=>{"sum"=>109209, "avgcount"=>500491},
[12:45] <chaos_> i hope it isn't in seconds ;-)
[13:08] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[13:08] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[13:08] * fronlius_ is now known as fronlius
[13:08] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[13:14] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[13:16] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[13:18] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[13:19] * fronlius (~fronlius@testing78.jimdo-server.com) Quit ()
[13:42] <wonko_be> is there any logic in the "suffixes" for the naming of the osd/mds/mon? mon.a, mon.b but osd.0, osd.1 ...
[13:42] <wonko_be> or could this be anything
[13:42] <wonko_be> osd.host1
[13:43] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[14:10] * andresambrois (~aa@r190-135-25-156.dialup.adsl.anteldata.net.uy) has joined #ceph
[14:13] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Ping timeout: 480 seconds)
[14:35] * fghaas (~florian@85-127-155-32.dynamic.xdsl-line.inode.at) has joined #ceph
[14:51] <dwm_> IIRC, OSDs must be numbered from zero, as their position in the global ordering is significant.
[14:52] <dwm_> MONs (and MDSs?) must simply be unique, so have fewer constraints.
[15:03] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[15:04] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[15:27] <wonko_be> if I number them all, zero based, that should end up okay then
[15:38] * andresambrois (~aa@r190-135-25-156.dialup.adsl.anteldata.net.uy) Quit (Quit: Konversation terminated!)
[15:38] * andresambrois (~aa@r190-135-25-156.dialup.adsl.anteldata.net.uy) has joined #ceph
[16:02] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[16:02] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[16:02] * fronlius_ is now known as fronlius
[16:18] * andresambrois (~aa@r190-135-25-156.dialup.adsl.anteldata.net.uy) Quit (Remote host closed the connection)
[16:43] <wido> wonko_be: You should not number your MON / MDS
[16:44] <wido> better use mon.alpha, mon.beta, mon.charlie
[16:44] <wido> or mon.a, mon.b, mon.c
[16:55] <dwm_> BTW, in case you haven't seen, I've updated bug #1759 with some debug logs -- ran into the same failure mode on my testing cluster.
[16:57] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) has joined #ceph
[17:05] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[17:05] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[17:05] * fronlius_ is now known as fronlius
[17:20] * fghaas (~florian@85-127-155-32.dynamic.xdsl-line.inode.at) Quit (Ping timeout: 480 seconds)
[17:38] <sjust> dwm_: did you mean that you did hit an assert in the MDS?
[17:39] <wido> I just opened a bug with Ubuntu to get Qemu build with RBD support: https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/904834
[17:40] <wido> hopefully the upcoming LTS can ship with a RBD-enabled Qemu package, librbd (0.38) is already in the repos
[17:55] <guido> Is RBD implemented entirely on the client side?
[17:56] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[17:57] <sjust> guido: essentially, yes
[18:00] <dwm_> sjust: No, the crash I'm getting is in the OSD.
[18:01] <dwm_> sjust: I tried running the wip-truncate branch of the MDS to see if it would trip an assert, and it did not.
[18:01] <sjust> dwm_: hmm, ok
[18:01] <guido> sjust: ah, thx
[18:02] <dwm_> (The cluster has since offlined itself -- it tried to reacquire the desired number of object replicas across the remaining OSDs, ran out of space and they all started tumbling.
[18:03] <dwm_> Presumably a later revision of the OSDs will stop trying to replicate objects / discard expendable replicas in the event of a low-capacity condition.. :-)
[18:08] <sjust> dwm_: presumably
[18:09] <dwm_> But yes, I know -ENOSPACE handling is something that'll Happen Later.
[18:12] <chaos_> sjust, who is reponsible for osd performance sockets?
[18:13] <sjust> chaos_: osd performance socket?
[18:13] <chaos_> osd deamon have socket that you can query for performance data, latency and other stuff ;)
[18:13] <sjust> chaos_: ah, yes!
[18:14] <chaos_> something changed somewhere about 0.38 and now latencies has different measure unit?
[18:14] <chaos_> "op_latency"=>{"sum"=>109209, "avgcount"=>500491},
[18:14] <sjust> chaos: hmm, looking
[18:14] <chaos_> i hope it isn't in seconds;p
[18:15] <chaos_> earlier sum was just latency in seconds
[18:15] <sjust> hmm, that would be around 200ms latency, seems a bit high
[18:20] <chaos_> erm.. how did you get 200ms?:p
[18:20] <sjust> chaos_: "sum" appears to be total latency in seconds, "avgcount" appears to be total number of ops
[18:20] <sjust> so sum/avgcount
[18:20] <chaos_> oh
[18:21] <chaos_> well.. it's quite obvious ;p
[18:21] <sjust> chaos_: nah, I had to find the code :P
[18:21] <chaos_> :D
[18:21] <guido> Hm, why do I need an existing monmap to format a new OSD?
[18:22] <chaos_> thanks sjust, I'll fix my monitoring plugins tomorrow
[18:22] <sjust> chaos_: np
[18:23] <sjust> guido: at the least, I think we get the fsid from the monmap
[18:30] <guido> When I expand my osd cluster, according to what I've read so far, I should also increase the number of PGs, right? But of these things will cause a lot of data to be moved around. Should I wait until it has calmed down again before increasing PGs, or should I do both simultaneously?
[18:38] <sjust> guido: increasing the number of pgs does not currently work
[18:46] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:47] * adjohn (~adjohn@70-36-197-80.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[19:01] * yhager (~yhager@173.180.85.48) has joined #ceph
[19:10] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[19:20] * NightDog (~karl@52.84-48-58.nextgentel.com) has joined #ceph
[19:49] * adjohn (~adjohn@208.90.214.43) has joined #ceph
[20:12] * fghaas (~florian@85-127-155-32.dynamic.xdsl-line.inode.at) has joined #ceph
[21:23] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[21:45] * fronlius (~fronlius@g231139059.adsl.alicedsl.de) has joined #ceph
[22:15] * fghaas (~florian@85-127-155-32.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[22:16] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Remote host closed the connection)
[22:32] * adjohn is now known as Guest20534
[22:32] * adjohn (~adjohn@208.90.214.43) has joined #ceph
[22:32] * Guest20534 (~adjohn@208.90.214.43) Quit (Read error: Connection reset by peer)
[22:36] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has left #ceph
[22:47] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[22:47] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit ()
[22:47] * aa (~aa@r190-135-25-156.dialup.adsl.anteldata.net.uy) has joined #ceph
[22:47] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[23:23] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has joined #ceph
[23:23] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has left #ceph
[23:54] * sjustlaptop (~sam@aon.hq.newdream.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.