#ceph IRC Log


IRC Log for 2011-11-25

Timestamps are in GMT/BST.

[19:56] <grape> After running mkcephfs, which appears to complete without a problem, I get:
[19:56] <grape> monclient(hunting): build_initial_monmap
[19:56] <grape> -- :/0 messenger.start
[19:56] <grape> monclient(hunting): init
[19:56] <grape> monclient(hunting): MonClient::init(): Failed to create keyring
[19:56] <grape> ceph_tool_common_init failed.
[19:58] <grape> From what I gather it is a permissions issue, but I haven't found any clues that lead me past that situation.
[19:58] <grape> Does anyone have a solution to this?
[20:31] <greglap> grape: what code are you running, is cephx enabled, and does the machine have a keyring that the process can access?
[20:33] <greglap> lxo: I don't recognize that particular error from the description you gave — can you post some of the log? (and why is the osd replaying its journal anyway?)
[20:35] <lxo> greglap, it had crashed just before, probably from the very same truncate request
[20:37] <greglap> okay — truncate −1 is unusual but I don't think impossible; if you post a log/backtrace we can check it out at some point
[20:38] <greglap> (everybody's on vacation right now for Thanksgiving)
[20:38] <lxo> greglap: 2011-11-24 07:00:07.771972 7fa8fe7fc700 filestore(/etc/ceph/osd0) error error 22: Invalid argument not handled
[20:38] <lxo> os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '7fa8fe7fc700'
[20:38] <lxo> os/FileStore.cc: 2426: FAILED assert(0 == "unexpected error")
[20:38] <lxo> ceph version 0.38 (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)
[20:38] <lxo> 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x1c2d) [0x6ee70d]
[20:38] <lxo> 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6f0a16]
[20:38] <lxo> 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x1ba) [0x6e3eca]
[20:38] <lxo> 4: (ThreadPool::worker()+0x701) [0x5f92e1]
[20:38] <lxo> 5: (ThreadPool::WorkThread::entry()+0xd) [0x57377d]
[20:38] <lxo> though that doesn't look very useful to me :-(
[20:39] <greglap> no...where do you see the truncate −1?
[20:39] <lxo> I started osd within a debugger
[20:40] <lxo> set a conditional breakpoint before processing the op that was failing, and then single-stepped into it
[20:41] <lxo> I had to force-reboot the ceph.ko client that was triggering this, and then reset all osd journals in order to recover
[20:41] <greglap> okay
[20:41] <greglap> so −1 is an invalid truncate, that's why it's getting an error code, and the assert is a generic one
[20:41] <lxo> I steered away from ceph.ko after that
[20:42] <lxo> yup. the syscall returns -EINVAL, and we don't handle this
[20:42] <greglap> I don't have the slightest idea where it would be coming from, though
[20:42] <greglap> probably uninitialized arguments being read and acted on for some reason :/
[20:43] <lxo> certainly from the kernel client. I was rsync a phone root filesystem into it with --inplace
[20:43] <lxo> eek. kernel or osd?
[20:43] <lxo> (if we're in a guessing mood :-)
[20:44] <greglap> oh, I'm sure it's coming from the kernel client, but I don't know if the naughty bit that's acting on it is coming from the OSD or the kernel client
[20:44] <lxo> I vaguely recall a long nonsensical syslog from one of the ceph.ko clients at about the same time. lemme dig it up
[20:45] <greglap> heh
[20:47] <lxo> it was one of those things that ceph.ko logs when it gets some message it can't understand, you know?
[20:48] <lxo> aah, and guess what?, it was on the same server I had to force-reboot (I thought it was a different one)
[20:49] <greglap> I'm not sure what log entry you're talking about, actually :/
[20:50] <lxo> ok, maybe it doesn't happen very often. it's an interesting scenario, now that I look into it
[20:50] <lxo> there's a kernel page allocation failure right before
[20:51] <greglap> see now you're worrying me that you ran out of memory and horrible things happened when it tried to page to disk
[20:52] <lxo> either one of those deadlock scenarios we discussed on e-mail the other day, or maybe a mon running astray eating up all RAM, which often happens to me when one of the mons' servers slows down
[20:52] <lxo> I think it's simpler than that; it may have just failed to allocate space for an incoming message, and gone down the hill from there
[20:53] <greglap> okay
[20:53] <lxo> right after the page allocation failure, I have:Nov 24 06:59:09 freie kernel: [301942.754560] ceph: problem parsing dir contents -12
[20:53] <lxo> Nov 24 06:59:09 freie kernel: [301942.754566] ceph: mds parse_reply err -12
[20:53] <lxo> Nov 24 06:59:09 freie kernel: [301942.754572] ceph: mdsc_handle_reply got corrupt reply mds0(tid:999)
[20:53] <lxo> Nov 24 06:59:09 freie kernel: 0 00e 65 6c 2d 6d 6c 69 61 2e 700 00 00 00f ff ff ff 0004 0d 07 00 00 00 00 00 00 00 40 00 f 07 00 0 00 00 [lots of other hex pairs omitted]
[20:53] <greglap> don't run the kernel client and ceph-osd on the same node!
[20:53] <greglap> yeah, it's probably running into invalid memory accesses right there
[20:53] <lxo> yeah, I figured I'd give it a try now that I had syncfs :-)
[20:54] <greglap> that doesn't actually solve the memory deadlock problems :(
[20:54] <lxo> so have I just confirmed ;-)
[20:55] <lxo> one of these days I'll move my home gateway into a virtual machine (with root backed by ceph :-) and then I'll (hopefully) be able to use the kernel client from there
[20:56] <lxo> though I'm finding out ceph-fuse to be quite reliable, fast and much easier to deal with in case of failure ;-)
[20:56] <greglap> heh, yeah
[20:56] <lxo> so, sorry about the noise
[20:57] <lxo> and thanks for the support (even if I'm one day too late for thanks giving ;-)
[20:57] <lxo> I thought yesterday was the day USAmericans would attack Turkey ;-D
[20:59] <greglap> :)
[20:59] <lxo> my favorite geogastronomic joke is that turkey the bird translates to peru in Portuguese, but Peru and Turkey are in different continents!
[20:59] <greglap> yeah, but nobody wants a day off in the middle of the week, so we take an extra day ;)
[20:59] <lxo> to go discount shopping, I'm told :-D
[21:01] <lxo> some brazilian shops seem to be picking up the black friday tradition, it seems. it's odd because the natural translation to black friday sounds sinister, scary even
[21:02] <lxo> anyhow, I went somewhat quiet on ceph lately mainly because I haven't run into new problems; most of the ones I've been hitting lately are btrfs bugs, and I've had more luck investigating and fixing those than ceph's, for some reason
[21:04] <lxo> still a couple of serious performance problems in btrfs to track down for ceph to run efficiently for me...
[21:06] <lxo> though it's *much* better than it was when I started. clustered block allocation speeds things up some, but it came with a couple of problems that degrades performance over time. and the btrfs inconsistent disk flush problem that we recently tracked down caused me a number of filesystem losses; none after the fix!
[21:08] <greglap> good to hear it's getting better!
[21:09] <greglap> I'm heading off for now, enough productivity on a lazy day ;)
[21:09] <lxo> :-D
[21:09] <lxo> have a great one!
[21:09] <lxo> thanks again
