#ceph IRC Log

Index

IRC Log for 2011-11-25

Timestamps are in GMT/BST.

[1:02] * fronlius (~fronlius@f054114240.adsl.alicedsl.de) Quit (Quit: fronlius)
[1:31] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:28] * Nightdog (~karl@190.84-48-62.nextgentel.com) Quit (Remote host closed the connection)
[2:44] * The_Bishop (~bishop@port-92-206-183-175.dynamic.qsc.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[4:45] * Tv__ (~Tv__@cpe-76-168-227-45.socal.res.rr.com) has joined #ceph
[5:13] * gohko (~gohko@natter.interq.or.jp) Quit (Quit: Leaving...)
[5:17] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[5:19] * gohko_ (~gohko@natter.interq.or.jp) has joined #ceph
[5:19] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[5:23] * gohko_ (~gohko@natter.interq.or.jp) Quit ()
[5:42] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[6:56] * Tv__ (~Tv__@cpe-76-168-227-45.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[7:57] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Remote host closed the connection)
[7:58] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[8:03] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Remote host closed the connection)
[8:05] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[8:34] * tsuzuki (~tsuzuki@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[8:38] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[9:05] * verwilst (~verwilst@d51A5B195.access.telenet.be) has joined #ceph
[10:30] * yoshi (~yoshi@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[10:56] * colomonkey (~r.nap@188.205.52.204) Quit (Quit: leaving)
[10:56] * rosco (~r.nap@188.205.52.204) has joined #ceph
[12:07] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[12:39] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[12:52] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[13:08] * Nightdog (~karl@190.84-48-62.nextgentel.com) has joined #ceph
[13:56] * tsuzuki (~tsuzuki@p9224-ipngn1601marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[14:22] * fronlius_ (~fronlius@testing78.jimdo-server.com) has joined #ceph
[14:22] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[14:22] * fronlius_ is now known as fronlius
[14:32] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Ping timeout: 480 seconds)
[14:55] * fronlius (~fronlius@testing78.jimdo-server.com) has joined #ceph
[15:14] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[15:57] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[16:20] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[16:29] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) Quit (Remote host closed the connection)
[17:36] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[17:54] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) has joined #ceph
[17:59] * grape (~grape@216.24.166.226) has joined #ceph
[18:06] * The_Bishop (~bishop@port-92-206-183-175.dynamic.qsc.de) has joined #ceph
[18:06] * Nightdog (~karl@190.84-48-62.nextgentel.com) Quit (Read error: Connection reset by peer)
[18:07] * Nightdog (~karl@190.84-48-62.nextgentel.com) has joined #ceph
[18:19] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:22] * fronlius (~fronlius@testing78.jimdo-server.com) Quit (Quit: fronlius)
[18:26] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) has joined #ceph
[18:26] * verwilst (~verwilst@d51A5B195.access.telenet.be) Quit (Quit: Ex-Chat)
[18:36] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[18:52] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[18:58] * fronlius (~fronlius@e176059249.adsl.alicedsl.de) has joined #ceph
[19:16] * darkfaded (~floh@188.40.175.2) Quit (Remote host closed the connection)
[19:17] * adjohn (~adjohn@70-36-139-247.dsl.dynamic.sonic.net) has joined #ceph
[19:56] <grape> After running mkcephfs, which appears to complete without a problem, I get:
[19:56] <grape> monclient(hunting): build_initial_monmap
[19:56] <grape> -- :/0 messenger.start
[19:56] <grape> monclient(hunting): init
[19:56] <grape> monclient(hunting): MonClient::init(): Failed to create keyring
[19:56] <grape> ceph_tool_common_init failed.
[19:58] <grape> From what I gather it is a permissions issue, but I haven't found any clues that lead me past that situation.
[19:58] <grape> Does anyone have a solution to this?
[20:29] * greglap (~Adium@cpe-24-24-170-80.socal.res.rr.com) has joined #ceph
[20:31] <greglap> grape: what code are you running, is cephx enabled, and does the machine have a keyring that the process can access?
[20:33] <greglap> lxo: I don't recognize that particular error from the description you gave — can you post some of the log? (and why is the osd replaying its journal anyway?)
[20:35] <lxo> greglap, it had crashed just before, probably from the very same truncate request
[20:37] <greglap> okay — truncate −1 is unusual but I don't think impossible; if you post a log/backtrace we can check it out at some point
[20:38] <greglap> (everybody's on vacation right now for Thanksgiving)
[20:38] <lxo> greglap: 2011-11-24 07:00:07.771972 7fa8fe7fc700 filestore(/etc/ceph/osd0) error error 22: Invalid argument not handled
[20:38] <lxo> os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '7fa8fe7fc700'
[20:38] <lxo> os/FileStore.cc: 2426: FAILED assert(0 == "unexpected error")
[20:38] <lxo> ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)
[20:38] <lxo> 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x1c2d) [0x6ee70d]
[20:38] <lxo> 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6f0a16]
[20:38] <lxo> 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x1ba) [0x6e3eca]
[20:38] <lxo> 4: (ThreadPool::worker()+0x701) [0x5f92e1]
[20:38] <lxo> 5: (ThreadPool::WorkThread::entry()+0xd) [0x57377d]
[20:38] <lxo> though that doesn't look very useful to me :-(
[20:39] <greglap> no...where do you see the truncate −1?
[20:39] <lxo> I started osd within a debugger
[20:40] <lxo> set a conditional breakpoint before processing the op that was failing, and then single-stepped into it
[20:41] <lxo> I had to force-reboot the ceph.ko client that was triggering this, and then reset all osd journals in order to recover
[20:41] <greglap> okay
[20:41] <greglap> so −1 is an invalid truncate, that's why it's getting an error code, and the assert is a generic one
[20:41] <lxo> I steered away from ceph.ko after that
[20:42] <lxo> yup. the syscall returns -EINVAL, and we don't handle this
[20:42] <greglap> I don't have the slightest idea where it would be coming from, though
[20:42] <greglap> probably uninitialized arguments being read and acted on for some reason :/
[20:43] <lxo> certainly from the kernel client. I was rsync a phone root filesystem into it with --inplace
[20:43] <lxo> eek. kernel or osd?
[20:43] <lxo> (if we're in a guessing mood :-)
[20:44] <greglap> oh, I'm sure it's coming from the kernel client, but I don't know if the naughty bit that's acting on it is coming from the OSD or the kernel client
[20:44] <lxo> I vaguely recall a long nonsensical syslog from one of the ceph.ko clients at about the same time. lemme dig it up
[20:45] <greglap> heh
[20:47] <lxo> it was one of those things that ceph.ko logs when it gets some message it can't understand, you know?
[20:48] <lxo> aah, and guess what?, it was on the same server I had to force-reboot (I thought it was a different one)
[20:49] <greglap> I'm not sure what log entry you're talking about, actually :/
[20:50] <lxo> ok, maybe it doesn't happen very often. it's an interesting scenario, now that I look into it
[20:50] <lxo> there's a kernel page allocation failure right before
[20:51] <greglap> see now you're worrying me that you ran out of memory and horrible things happened when it tried to page to disk
[20:52] <lxo> either one of those deadlock scenarios we discussed on e-mail the other day, or maybe a mon running astray eating up all RAM, which often happens to me when one of the mons' servers slows down
[20:52] <lxo> I think it's simpler than that; it may have just failed to allocate space for an incoming message, and gone down the hill from there
[20:53] <greglap> okay
[20:53] <lxo> right after the page allocation failure, I have:Nov 24 06:59:09 freie kernel: [301942.754560] ceph: problem parsing dir contents -12
[20:53] <lxo> Nov 24 06:59:09 freie kernel: [301942.754566] ceph: mds parse_reply err -12
[20:53] <lxo> Nov 24 06:59:09 freie kernel: [301942.754572] ceph: mdsc_handle_reply got corrupt reply mds0(tid:999)
[20:53] <lxo> Nov 24 06:59:09 freie kernel: 0 00e 65 6c 2d 6d 6c 69 61 2e 700 00 00 00f ff ff ff 0004 0d 07 00 00 00 00 00 00 00 40 00 f 07 00 0 00 00 [lots of other hex pairs omitted]
[20:53] <greglap> don't run the kernel client and ceph-osd on the same node!
[20:53] <greglap> yeah, it's probably running into invalid memory accesses right there
[20:53] <lxo> yeah, I figured I'd give it a try now that I had syncfs :-)
[20:54] <greglap> that doesn't actually solve the memory deadlock problems :(
[20:54] <lxo> so have I just confirmed ;-)
[20:55] <lxo> one of these days I'll move my home gateway into a virtual machine (with root backed by ceph :-) and then I'll (hopefully) be able to use the kernel client from there
[20:56] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Remote host closed the connection)
[20:56] <lxo> though I'm finding out ceph-fuse to be quite reliable, fast and much easier to deal with in case of failure ;-)
[20:56] <greglap> heh, yeah
[20:56] <lxo> so, sorry about the noise
[20:57] <lxo> and thanks for the support (even if I'm one day too late for thanks giving ;-)
[20:57] <lxo> I thought yesterday was the day USAmericans would attack Turkey ;-D
[20:59] <greglap> :)
[20:59] <lxo> my favorite geogastronomic joke is that turkey the bird translates to peru in Portuguese, but Peru and Turkey are in different continents!
[20:59] <greglap> yeah, but nobody wants a day off in the middle of the week, so we take an extra day ;)
[20:59] <lxo> to go discount shopping, I'm told :-D
[21:01] <lxo> some brazilian shops seem to be picking up the black friday tradition, it seems. it's odd because the natural translation to black friday sounds sinister, scary even
[21:02] <lxo> anyhow, I went somewhat quiet on ceph lately mainly because I haven't run into new problems; most of the ones I've been hitting lately are btrfs bugs, and I've had more luck investigating and fixing those than ceph's, for some reason
[21:04] <lxo> still a couple of serious performance problems in btrfs to track down for ceph to run efficiently for me...
[21:06] <lxo> though it's *much* better than it was when I started. clustered block allocation speeds things up some, but it came with a couple of problems that degrades performance over time. and the btrfs inconsistent disk flush problem that we recently tracked down caused me a number of filesystem losses; none after the fix!
[21:08] <greglap> good to hear it's getting better!
[21:09] <greglap> I'm heading off for now, enough productivity on a lazy day ;)
[21:09] <lxo> :-D
[21:09] <lxo> have a great one!
[21:09] <lxo> thanks again
[21:23] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) has joined #ceph
[21:26] * cp (~cp@c-98-234-218-251.hsd1.ca.comcast.net) Quit ()
[21:42] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:49] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[21:50] * adjohn (~adjohn@70-36-139-247.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[21:50] * adjohn (~adjohn@70-36-139-247.dsl.dynamic.sonic.net) has joined #ceph
[21:50] * adjohn (~adjohn@70-36-139-247.dsl.dynamic.sonic.net) Quit ()
[22:08] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[22:14] * adjohn (~adjohn@70-36-139-247.dsl.dynamic.sonic.net) has joined #ceph
[22:33] * johnl (~johnl@109.107.34.22) has joined #ceph
[22:33] * johnl (~johnl@109.107.34.22) Quit ()
[22:35] * johnl (~johnl@johnl.ipq.co) has joined #ceph
[22:36] * johnl (~johnl@johnl.ipq.co) Quit ()
[22:38] * johnl (~johnl@2a02:1348:14c:1720:24:19ff:fef0:5c82) has joined #ceph
[22:41] * johnl (~johnl@2a02:1348:14c:1720:24:19ff:fef0:5c82) Quit ()
[22:43] * johnl (~johnl@2a02:1348:14c:1720:24:19ff:fef0:5c82) has joined #ceph
[22:45] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) has joined #ceph
[22:47] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[23:40] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[23:45] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:46] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.