#ceph IRC Log


IRC Log for 2011-06-17

Timestamps are in GMT/BST.

[0:05] * arken420 (~mike@75-138-193-69.static.snfr.nc.charter.com) Quit (Quit: Leaving)
[0:06] <Tv> apparently sepia works again
[0:07] <Tv> err, some of them
[0:20] * mib_765z2b (5138106c@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[0:21] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[0:32] * fred_ (~fred@185-187.78-83.cust.bluewin.ch) has joined #ceph
[0:34] <fred_> sjust, could you have a look at #1191?
[0:34] <sjust> fred_: sorry, I got side tracked
[0:35] <fred_> sjust, do you think you'll get some time to look at it today?
[0:35] <sjust> fred_: yeah, looking now
[0:36] <sjust> have you tried bringing the cluster back up?
[0:36] <fred_> sjust, ok so I'm here if you need anything
[0:36] <sjust> fred_: does the problem reoccur if you bring the cluster back up?
[0:36] <fred_> sjust, no stopped it all as soon as my 5 osd died
[0:36] <fred_> will try
[0:38] * gregorg_taf (~Greg@ has joined #ceph
[0:38] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[0:41] <sjust> fred_: ceph -w/s said that all pgs were active+clean before the crash?
[0:41] <fred_> yes
[0:42] <fred_> I was monitoring with ceph -w
[0:42] <fred_> restarted it now, and reached the all active+clean state
[0:43] <sjust> did some of the pgs stay in acting for a bit longer before going active+clean?
[0:43] <sjust> or, did you see the number of degraded objects go from non-zero to zero?
[0:45] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[0:46] <fred_> sjust, well the recovery was not that long and the number of degraded PGs went down quite continuously if that is what you mean
[0:46] <sjust> ah, ok
[0:47] <sjust> the assert failure indicates that the primary marked the pg as clean before all of the replicas had been recovered (the offending pg's missing set was not yet empty)
[0:47] <sjust> did it crash this time?
[0:48] <sjust> if it still crashed, we should be able to crank up the debugging and figure out how it's happening
[0:50] <fred_> it is running fine now
[0:50] <fred_> it crashed very quickly before
[0:51] <sjust> yeah, there must be a flaw in the recovery code on the primary
[0:51] <sjust> you don't have a core dump, do you?
[0:51] <fred_> I've got 5 of them!
[0:52] <sjust> ah, now we are talking
[0:52] <fred_> fine
[0:52] <sjust> can you post them with the cosd binary?
[0:52] <fred_> sure, could you give me the sftp account again please?
[0:55] <fred_> sjust, you want 1 core dump or all of them ?
[0:55] <sjust> fred_: if you give me a public key, I can set you up on ceph.newdream.net
[0:55] <sjust> fred_: all of them, probably
[0:55] <fred_> ok
[0:59] <sjust> fred_: the transfer doesn't seem to be working, you could email it to me at samuel_just@dreamhost.com
[1:01] <fred_> sent
[1:02] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Quit: Ex-Chat)
[1:04] <sjust> oh, sorry, samuel.just@dreamhost.com
[1:04] <sjust> wrong address
[1:07] <fred_> sent
[1:13] <fred_> got it ?
[1:13] <sjust> yeah, trying to get the account working
[1:13] <sjust> sorry for the delay
[1:14] <fred_> no problem
[1:14] <sjust> ok, cephdrop@ceph.newdream.net should work now
[1:16] <fred_> scp or sftp ?
[1:16] <sjust> sftp
[1:16] <fred_> ok
[1:17] <sagewk> bchrisman: for that nfs stale thing, can you attach an mds log with debugging on?
[1:18] <bchrisman> ahh sure...
[1:22] <fred_> sjust, done: 1194.tgz (yes should be 1191.tgz... sorry for that)
[1:23] <sjust> heh, no problem
[1:23] <fred_> ah could rename .... 1191.tgz
[1:24] <fred_> sjust, I included the binary and the debugging symbols which are in a separate file
[1:25] <fred_> sjust, I thing that last time greg and I had trouble dealing with that. If you wanted to open the core file, you had to put the binary at /usr/bin/cosd and the debug syms at /usr/lib/debug/usr/bin/cosd
[1:25] <sjust> ah...ok
[1:25] <sjust> thanks
[1:31] * MattCampbell (~matt@ppp-70-130-44-76.dsl.wchtks.swbell.net) Quit (Quit: ircII EPIC5-1.1.2 -- Are we there yet?)
[1:33] <sjust> fred_: it looks like my libraries are too different
[1:34] <fred_> of course
[1:34] <fred_> packaging them also
[1:37] <fred_> but which one do you need as cosd does not link with other ceph libs?
[1:37] <sjust> one sec
[1:45] <sjust> fred_: I'm trying to make sense of the gdb errors
[1:45] <sjust> Error while mapping shared library sections:
[1:45] <sjust> s [%lu] required by file %s [%lu]
[1:45] <sjust> : No such file or directory.
[1:45] <sjust> thats where it's failing
[1:46] <fred_> are you running 32 or 64 bits os ?
[1:46] <sjust> 64
[1:47] <sjust> squeeze
[1:51] <fred_> ok, 64 natty here
[1:51] <sjust> hmm, ok, one sec
[1:53] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:13] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:26] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[2:29] <sjust> fred_: gcc 4.4 or 4.5?
[2:31] <fred_> 4.5.2
[2:31] * yoshi (~yoshi@p24092-ipngn1301marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:33] <sjust> I'm trying to get at the core files using a natty 64 laptop, it seems to be unable to find the libstdc++ debugging symbols
[2:34] <fred_> I didn't install them, is it really needed ?
[2:36] <fred_> maybe they are in the ddebs archives: http://ddebs.ubuntu.com/
[2:38] <fred_> echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse" > /etc/apt/sources.list.d/ddebs.list
[2:47] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:49] <sjust> ah, it seems that it's more likely to be ld-linux-x86-64.so.2's debugging symbols
[2:49] <sjust> can you open the core file on your machines?
[2:50] <sjust> if so, where does it say that it's getting the symbols from? (should be something like: Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.11.2.so...done.)
[2:51] <fred_> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
[2:51] <fred_> Loaded symbols for /lib64/ld-linux-x86-64.so.2
[2:51] <fred_> as I said, didn't install them
[2:51] <sjust> can you get a meaningful backtrace?
[2:52] <fred_> yes, the one I posted on the tracker
[2:52] <sjust> right, forgot
[2:52] <sjust> must not be ld then
[3:16] * fred_ (~fred@185-187.78-83.cust.bluewin.ch) Quit (Quit: Leaving)
[3:30] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[5:12] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Read error: Operation timed out)
[5:12] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[5:54] * cmccabe1 (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has left #ceph
[8:18] * lidongyang (~lidongyan@ Quit (Read error: Connection reset by peer)
[8:46] <lx0> wheee, just upgraded to 0.29.1, and 0.29 was looking pretty great already! thanks, folks!
[8:54] * lidongyang (~lidongyan@ has joined #ceph
[9:57] * lidongyang (~lidongyan@ Quit (Remote host closed the connection)
[10:37] * allsystemsarego (~allsystem@ has joined #ceph
[10:38] * gregorg_taf (~Greg@ Quit (Quit: Quitte)
[10:38] * gregorg (~Greg@ has joined #ceph
[10:45] * lidongyang (~lidongyan@ has joined #ceph
[11:03] * yoshi (~yoshi@p24092-ipngn1301marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:38] * pombreda (~Administr@ has joined #ceph
[11:45] * johnl (~johnl@johnl.ipq.co) Quit (charon.oftc.net solenoid.oftc.net)
[11:45] * stefanha (~stefanha@yuzuki.vmsplice.net) Quit (charon.oftc.net solenoid.oftc.net)
[11:45] * dwm (~dwm@vm-shell4.doc.ic.ac.uk) Quit (charon.oftc.net solenoid.oftc.net)
[11:46] * stefanha (~stefanha@yuzuki.vmsplice.net) has joined #ceph
[11:46] * johnl (~johnl@johnl.ipq.co) has joined #ceph
[11:46] * dwm (~dwm@vm-shell4.doc.ic.ac.uk) has joined #ceph
[13:35] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Ping timeout: 480 seconds)
[13:39] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[13:59] * pombreda (~Administr@ Quit (Quit: Leaving.)
[14:07] * pombreda (~Administr@ has joined #ceph
[14:15] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[14:36] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) has joined #ceph
[14:53] * yoshi (~yoshi@KD027091032046.ppp-bb.dion.ne.jp) Quit (Remote host closed the connection)
[15:21] * pombreda (~Administr@ Quit (Quit: Leaving.)
[17:31] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[17:33] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit ()
[17:37] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:50] * greglap (~Adium@ has joined #ceph
[17:54] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[17:55] <jmlowe> I'm seeing mind numbingly slow qemu-img convert, is this expected?
[17:56] <greglap> jmlowe: some applications do better with rbd than others; I don't remember if qemu-img convert is one of them
[17:56] <greglap> rbd currently doesn't do any false acks so every write has to go out to the network and hit disk and before it says the write is done
[17:58] <greglap> people who have actually used rbd will be on in an hour or so though, if you want more info :)
[18:00] <stefanha> jmlowe: Are you using qemu-img's builtin rbd protocol support or the kernel rbd block device?
[18:00] <jmlowe> looks like ~500kB/s from iostat, mtu 9k, slowest link line rate 1GigE, slowest disk 20MB/s
[18:01] <jmlowe> qemu-img convert -p -f qcow2 -O rbd
[18:02] <jmlowe> I do have a disk out in that raid6 array, but >1MB/s is a little much
[18:02] <stefanha> jmlowe: I wonder if using the kernel rbd block device support would help...a lot. That way qemu-img can use the page cache on your machine and will write back to the rbd device in the background, avoiding many expensive acks as greglap mentioned.
[18:03] <stefanha> http://ceph.newdream.net/wiki/Rbd
[18:03] <jmlowe> so is using the kernel block device the current preferred way to do things until qemu/librados gets streamlined?
[18:03] <stefanha> I haven't tried this and my ceph experience is very limited so it may be a blind alley.
[18:05] <greglap> probably? sorry, my practical knowledge of rbd is pretty limited
[18:05] <greglap> yehudasa or joshd will be able to tell you more when they get in
[18:05] <jmlowe> I was reasonably happy with the kernel device in my limited tests, I was just trying it out to make sure I had rbd working before I layered on qemu, I was thinking using the user space librados/qemu would save me a context switch or two
[18:06] <stefanha> jmlowe: The problem is that on the qemu side there is no page cache when you use qemu-img. It's not an issue when you run a VM because the guest has its own page cache.
[18:06] <stefanha> But for qemu-img it means every I/O is going across ceph/rbd.
[18:07] <stefanha> BTW "there is no page cache when you use qemu-img" is only true for protocols like rbd, not local image files.
[18:09] <greglap> a lot of the other network protocols issue an ack as soon as they get the write, and handle the network stuff in private
[18:09] <greglap> this is coming to rbd but we've been focusing on other things
[18:10] <jmlowe> right, I'd rather have eventually than maybe
[18:17] <jmlowe> there's no restriction on multiple clients adding the same rbd device? I'm asking with an eye to online kvm migration where the same bits have to be on both the source and destination
[18:18] <greglap> rbd doesn't provide any synchronization for you and you'll need to properly handle invalidation of any caching layers that are in-between them
[18:18] <greglap> but you can mount it on multiple machines without problem
[18:19] <greglap> (and it will do the synchronization required for snapshot creation all on its own)
[18:19] <jmlowe> one would hope that sync() would be called as part of the migration process
[18:20] <greglap> I'm just letting you know, some people have very strange expectations :)
[18:26] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:41] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[18:42] <sagewk> slang: there?
[18:42] <slang> sagewk: yes
[18:43] <sagewk> did you get a core for mds.alpha too? can you tell what line it's on in 9: (EMetaBlob::fullbit::update_inode(MDS*, CInode*)+0x1cc) [0x804376]
[18:45] <slang> sorry no
[18:45] <slang> I'm recompiling with -g
[18:45] <sagewk> np
[18:46] <sagewk> for the other part.. if you know which one crashed in sessionmap::decode and dump that object that will help
[18:49] <slang> #0 0x00000000007b6780 in inodeno_t::operator _inodeno_t (this=0x21) at ../../src/include/types.h
[18:49] <slang> :299
[18:49] <slang> #1 0x0000000000821a41 in std::less<inodeno_t>::operator() (this=0xeb69e8, __x=..., __y=...) at /
[18:49] <slang> usr/include/c++/4.5/bits/stl_function.h:230
[18:49] <slang> #2 0x0000000000839a46 in std::_Rb_tree<inodeno_t, std::pair<inodeno_t const, inodeno_t>, std::_S
[18:49] <slang> elect1st<std::pair<inodeno_t const, inodeno_t> >, std::less<inodeno_t>, std::allocator<std::pair<
[18:49] <slang> inodeno_t const, inodeno_t> > >::_M_lower_bound (this=0xeb69e8, __x=0x1, __y=0xebe300, __k=...) a
[18:49] <slang> t /usr/include/c++/4.5/bits/stl_tree.h:1020
[18:49] <slang> #3 0x0000000000831761 in std::_Rb_tree<inodeno_t, std::pair<inodeno_t const, inodeno_t>, std::_S
[18:49] <slang> elect1st<std::pair<inodeno_t const, inodeno_t> >, std::less<inodeno_t>, std::allocator<std::pair<
[18:49] <slang> inodeno_t const, inodeno_t> > >::lower_bound (this=0xeb69e8, __k=...) at /usr/include/c++/4.5/bit
[18:49] <slang> s/stl_tree.h:767
[18:49] <slang> #4 0x0000000000827235 in std::map<inodeno_t, inodeno_t, std::less<inodeno_t>, std::allocator<std
[18:49] <slang> ::pair<inodeno_t const, inodeno_t> > >::lower_bound (this=0xeb69e8, __x=...) at /usr/include/c++/
[18:49] <slang> 4.5/bits/stl_map.h:754
[18:49] <slang> #5 0x000000000081ea3e in interval_set<inodeno_t>::find_inc (this=0xeb69e0, start=...) at ../../s
[18:49] <slang> rc/include/interval_set.h:179
[18:49] <slang> #6 0x00000000008173ae in interval_set<inodeno_t>::contains (this=0xeb69e0, i=...) at ../../src/i
[18:49] <slang> nclude/interval_set.h:260
[18:49] <slang> #7 0x00000000009ce890 in InoTable::replay_alloc_id (this=0xeb6970, id=...) at ../../src/mds/InoT
[18:49] <slang> able.cc:103
[18:49] <slang> #8 0x0000000000806bbd in EMetaBlob::replay (this=0xebb0f0, mds=0xeb2f20, logseg=0xebd8b0) at ../
[18:49] <slang> ../src/mds/journal.cc:663
[18:49] <slang> #9 0x00000000008092e4 in EUpdate::replay (this=0xebb0c0, mds=0xeb2f20) at ../../src/mds/journal.
[18:49] <slang> cc:942
[18:49] <slang> #10 0x00000000009fdf1c in MDLog::_replay_thread (this=0xeb65f0) at ../../src/mds/MDLog.cc:555
[18:49] <slang> #11 0x00000000007df302 in MDLog::ReplayThread::entry (this=0xeb6658) at ../../src/mds/MDLog.h:86
[18:49] <slang> #12 0x00000000009fe9c5 in Thread::_entry_func (arg=0xeb6658) at ../../src/common/Thread.h:41
[18:49] <slang> #13 0x00007ffff764fd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
[18:49] <slang> #14 0x00007ffff6bfa04d in clone () from /lib/x86_64-linux-gnu/libc.so.6
[18:50] <slang> that's not the same one..
[18:53] <slang> these look like memory errors that are just hitting different parts of the code..
[18:53] <slang> this one is different too
[18:53] <slang> #0 0x00007ffff6b47d05 in raise () from /lib/x86_64-linux-gnu/libc.so.6
[18:53] <slang> #1 0x00007ffff6b4bab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6
[18:53] <slang> #2 0x00007ffff6b80d7b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[18:53] <slang> #3 0x00007ffff6b8bfb1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[18:53] <slang> #4 0x00007ffff6b8d472 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[18:53] <slang> #5 0x00007ffff6b917b4 in calloc () from /lib/x86_64-linux-gnu/libc.so.6
[18:53] <slang> #6 0x00007ffff6b7f7ad in open_memstream () from /lib/x86_64-linux-gnu/libc.so.6
[18:53] <slang> #7 0x00007ffff6bf4deb in __vsyslog_chk () from /lib/x86_64-linux-gnu/libc.so.6
[18:53] <slang> #8 0x00007ffff6bf544c in syslog () from /lib/x86_64-linux-gnu/libc.so.6
[18:53] <slang> #9 0x0000000000ab0738 in DoutStreambuf<char, std::char_traits<char> >::overflow (this=0xe8c010,
[18:53] <slang> c=-1) at ../../src/common/DoutStreambuf.cc:195
[18:53] <slang> #10 0x0000000000ab0845 in DoutStreambuf<char, std::char_traits<char> >::sync (this=0xe8c010) at .
[18:53] <slang> ./../src/common/DoutStreambuf.cc:370
[18:53] <slang> #11 0x00007ffff73d3a6e in std::basic_ostream<char, std::char_traits<char> >::flush() () from /usr
[18:53] <slang> /lib/x86_64-linux-gnu/libstdc++.so.6
[18:53] <slang> #12 0x0000000000805bdd in EMetaBlob::replay (this=0xec0190, mds=0xeb2f20, logseg=0xeda6f0) at ../
[18:53] <slang> ../src/mds/journal.cc:579
[18:53] <slang> #13 0x00000000008092e4 in EUpdate::replay (this=0xec0160, mds=0xeb2f20) at ../../src/mds/journal.
[18:53] <slang> cc:942
[18:53] <slang> #14 0x00000000009fdf1c in MDLog::_replay_thread (this=0xeb65f0) at ../../src/mds/MDLog.cc:555
[18:53] <slang> #15 0x00000000007df302 in MDLog::ReplayThread::entry (this=0xeb6658) at ../../src/mds/MDLog.h:86
[18:53] <slang> #16 0x00000000009fe9c5 in Thread::_entry_func (arg=0xeb6658) at ../../src/common/Thread.h:41
[18:53] <slang> #17 0x00007ffff764fd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
[18:53] <slang> #18 0x00007ffff6bfa04d in clone () from /lib/x86_64-linux-gnu/libc.so.6
[18:53] <sagewk> slang: hrm yeah, which version is this?
[18:54] <sagewk> it's not out of memory right?
[18:55] <slang> v0.29 stable branch
[18:55] <slang> doesn't look like its out of memory
[18:56] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:56] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:57] <slang> valgrind reports some invalid reads and invalid frees..
[18:57] <sagewk> hmm, we should run this through valgrind and see if anything is awry. we haven't seen those problems
[18:57] <slang> ==10009== Invalid free() / delete / delete[]
[18:57] <slang> ==10009== at 0x4C27FFF: operator delete(void*) (vg_replace_malloc.c:387)
[18:57] <slang> ==10009== by 0x804375: EMetaBlob::fullbit::update_inode(MDS*, CInode*) (journal.cc:434)
[18:57] <slang> ==10009== by 0x80598E: EMetaBlob::replay(MDS*, LogSegment*) (journal.cc:571)
[18:57] <slang> ==10009== by 0x8092E3: EUpdate::replay(MDS*) (journal.cc:942)
[18:57] <slang> ==10009== by 0x9FDF1B: MDLog::_replay_thread() (MDLog.cc:555)
[18:57] <slang> ==10009== by 0x7DF301: MDLog::ReplayThread::entry() (MDLog.h:86)
[18:57] <slang> ==10009== by 0x9FE9C4: Thread::_entry_func(void*) (Thread.h:41)
[18:57] <slang> ==10009== by 0x53AAD8B: start_thread (pthread_create.c:304)
[18:57] <slang> ==10009== by 0x5E4904C: clone (clone.S:112)
[18:57] <slang> ==10009== Address 0x6d22fa0 is 0 bytes inside a block of size 28 free'd
[18:57] <slang> ==10009== at 0x4C27FFF: operator delete(void*) (vg_replace_malloc.c:387)
[18:57] <slang> ==10009== by 0x8115B3: EMetaBlob::fullbit::~fullbit() (EMetaBlob.h:101)
[18:57] <slang> ==10009== by 0x818B53: void decode<EMetaBlob::fullbit>(std::list<EMetaBlob::fullbit, std::allocator<EMetaBlob::fullbit> >&, ceph::buffer::list::iterator&) (encoding.h:293)
[18:57] <slang> ==10009== by 0x8121D8: EMetaBlob::dirlump::_decode_bits() (EMetaBlob.h:315)
[18:57] <slang> ==10009== by 0x80504A: EMetaBlob::replay(MDS*, LogSegment*) (journal.cc:521)
[18:57] <slang> ==10009== by 0x8092E3: EUpdate::replay(MDS*) (journal.cc:942)
[18:57] <slang> ==10009== by 0x9FDF1B: MDLog::_replay_thread() (MDLog.cc:555)
[18:57] <slang> ==10009== by 0x7DF301: MDLog::ReplayThread::entry() (MDLog.h:86)
[18:57] <slang> ==10009== by 0x9FE9C4: Thread::_entry_func(void*) (Thread.h:41)
[18:57] <slang> ==10009== by 0x53AAD8B: start_thread (pthread_create.c:304)
[18:57] <slang> ==10009== by 0x5E4904C: clone (clone.S:112)
[18:58] <slang> I can send the whole output of valgrind if that's helpful
[18:58] <sagewk> yes please
[19:01] * cmccabe (~cmccabe@ has joined #ceph
[19:02] <stefanha> jmlowe: if you try the kernel rbd driver I would be interested in the result :)
[19:04] <sagewk> is there a matching mds log for the replay that triggered that valgrind warning?
[19:04] <slang> http://pastebin.com/raw.php?i=kBbvYemz
[19:05] <lx0> oh, speaking of problems... I've observed a slightly annoying problem with 0.29, maybe 0.28 too: after service ceph stop osd (i.e., not a system crash), attempts to restart osd sometimes fail because of zero-sized pginfo files in current/meta
[19:05] <lx0> is this a known problem?
[19:05] <slang> sagewk: it doesn't actually crash when run in valgrind
[19:06] <sagewk> is that the first warning?
[19:06] <lx0> I could always work around it by discarding that copy of the PG (removing its _head, pginfo and pglog) and letting it resync, but that's a bit of a pain
[19:06] <sagewk> oh i see
[19:07] <lx0> ideally cosd should detect the zero-sized (or otherwise incomplete) pginfo file without crashing, and try to get the info from earlier snapshots (I wasn't using a journal then)
[19:07] <sagewk> lx0: if you're seeing those at all there is something wrong with the osd commit. are you running on btrfs or extN?
[19:07] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[19:07] <lx0> btrfs
[19:08] <slang> looks like the mds might be looping on something, stuff in the log looks like this and seems to be repeating: http://pastebin.com/raw.php?i=E6NcmtTY
[19:09] <lx0> no journal, 2.6.39-libre kernel with many debug options enabled. that made for a slooow ceph, I found out as I switched back to 2.6.38.* :-)
[19:10] <lx0> not sure this could be a bug in linux-2.6.39 though
[19:11] <lx0> I haven't had much reason to restart osds after switching back, at about the same time I upgraded to ceph 0.29.1, so I can't tell whether the problem is still there, latent or not
[19:11] <sagewk> slang: pushed fix for the Esession part to stable branch
[19:11] <sagewk> slang: that is normal idle noise
[19:12] <lx0> I ended up re-creating my cluster today, after scrubbing found an inconsistency in one of the PGs that probably resulted from a creative attempt to recover from the problem
[19:12] <sagewk> slang: can you build latest stable and retry, and see if that bad delete in EMetaBlob::fullbit::update_inode(MDS*, CInode*) still comes up?
[19:12] <lx0> (I copied the pginfo for that PG from an earlier snap_, but that was detected as too-old/corrupted)
[19:13] <sagewk> it looks impossible from the code :/
[19:14] <lx0> and somehow the inconsistency propagated to all 3 copies of the PG. I could recover one of the missing files it reported, but I couldn't figure out how to correct the stat mismatch
[19:14] <lx0> I still have the mon and osd data for one of the members of the cluster, if that would help
[19:15] <lx0> unfortunately, I accidentally wiped out the logs
[19:16] <gregaf1> what do you mean the inconsistency propagated?
[19:16] <slang> sagewk: yes
[19:16] <gregaf1> lx0: do you mean it couldn't resolve the inconsistency, or that the wrong data ended up on each OSD?
[19:17] <lx0> I haven't noticed any wrong data, but it certainly couldn't resolve the reported inconsistency (and I couldn't either)
[19:18] <lx0> by ???propagated???, I mean I tried keeping a single OSD up at a time and telling it to scrub the inconsistent PG to see if it would fix things up, to no avail
[19:19] <lx0> the plan was to find one that hadn't been affected and wipe the PG out from the others so that it would resync from scratch, but since all of them reported the inconsistency, I was at a loss
[19:20] <gregaf1> lx0: well the check for inconsistency is a check against each other, so if one of them is different from the others it will report inconsistent
[19:21] <lx0> (I had x3 replication)
[19:21] <lx0> hmm... it still reported a stat mismatch when a single PG was up
[19:21] <sagewk> standup brb
[19:21] <gregaf1> scrub repair isn't actually that useful right now though ??? it can handle missing objects and objects with the wrong time stamps, but if the actual PG metadata is inconsistent between OSDs it can't repair that
[19:22] <gregaf1> and once a PG is flagged inconsistent it doesn't go to consistent without active work ??? taking OSDs down can't lose that or bad things might happen
[19:22] <gregaf1> back in a bit
[19:22] <lx0> aah, it was two different problems, then, probably caused by my failed attempt to recover without resyncing
[19:22] <lx0> maybe if I had kept two OSDs up at a time I could have found a consistent pair, and propagated from that to the third. anyway, too late now
[19:23] <lx0> hmm, I've seen inconsistent PGs go back to consistent by restarting OSDs before. can't quite place how long ago that was, though
[19:24] <lx0> or rather lose the inconsistent bit, at least
[19:27] <lx0> biab myself too ;-)
[19:40] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[19:57] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[20:00] <gregaf1> lx0: well it can repair certain kinds of inconsistencies and you might have triggered that somehow, but I don't think it should have "lost" the inconsistent setting
[20:08] <yehudasa> Tv: I pushed my python s3 tests to some branch, can you take a look?
[20:08] <Tv> will do
[20:13] <lx0> the ???loss??? of the inconsistent bit was long ago. it seemed to be recovered upon scrubbing, that's why I thought scrubbing might help me get rid of it this time. anyway, now I know how *not* to try to bring a crashing osd back up ;-)
[20:14] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[20:17] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[20:18] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[20:19] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:21] <sjust> Ixo: "inconsistent" is set when scrubbing discovers an inconsistency among the pg replicas. The reason it appears to go away when you restart is that I don't think it gets written to disk. You can tell the osd to repair the affected pg, that will in many cases take care of the problem.
[20:23] <gregaf1> wait, inconsistency is transient?
[20:23] <sjust> gregaf1: well, no
[20:23] <sjust> but it gets stashed in the pg state
[20:23] <sjust> like active or clean
[20:23] <sjust> or peering
[20:23] <sjust> etc
[20:23] <gregaf1> so how does it get lost?
[20:24] <sjust> so I don't think that the marker persists across reboots
[20:25] <gregaf1> so it is transient...
[20:25] <gregaf1> in the "not in permanent storage anywhere"
[20:25] <gregaf1> sense
[20:26] <gregaf1> I thought we needed to keep track of it so we didn't accidentally remove the right version
[20:26] <sjust> the marker is transient, the status is not (restarting the cluster does not actually make the pg consistent) /pedantic
[20:26] <cmccabe> sjust: um, based on my memory of what the code used to look like
[20:26] <cmccabe> sjust: the lost bit is in the object something or other _t
[20:26] <cmccabe> sjust: and does get serialized and written out
[20:26] <cmccabe> object_locator_t?
[20:26] <sjust> cmccabe: lost is different
[20:27] <cmccabe> sjust: oh, I thought you were talking about lost, since I saw "gregaf: so how does it get lost?"
[20:27] <sjust> cmccabe: nope :)
[20:27] <cmccabe> sjust: k
[20:27] <sjust> gregaf1: the marker should probably persist, but that's a problem for another day
[20:27] <cmccabe> sjust: yeah PG inconsistency is different, and I don't remember it getting stored anywhere
[20:28] <gregaf1> ungh, now I feel like I should make a bug for it
[20:29] <cmccabe> I mean it really comes out of reconciling the PG::Log and the objects which are present with the files in the filesystem
[20:29] <cmccabe> gregaf: avoiding a recovery storm is a much bigger problem; make a bug for that first if it doesn't exist :)
[20:30] <gregaf1> different kinds of bugs
[20:30] <gregaf1> losing inconsistency is a correctness bug, recovery storms are availability bugs
[20:30] <gregaf1> you can make bugs for whatever you like, though ;)
[20:33] <cmccabe> I have a vague sense that we're already creating and saving too much state
[20:33] <cmccabe> which just adds to the complexity of testing all these cases
[20:34] <cmccabe> but I can see why you might think preserving the knowledge that the PG is inconsistent across restarts would be good
[20:34] <gregaf1> your vague sense is meaningless in the face of potential data corruption
[20:35] <gregaf1> nah, we might be saving more state than we need to, I haven't audited it
[20:35] <gregaf1> but stuff like inconsistent flags is stuff we need
[20:35] <cmccabe> I forget how we were supposed to resolve inconsistent states
[20:36] <cmccabe> I think it had something to do with the primary just winning?
[20:37] <gregaf1> not sure what we do now (and it's probably less correct than it could be)
[20:37] <cmccabe> I mean most of the recovery process is written around the idea that the primary knows the objects it wants, and is just trying to get them all into place
[20:38] <gregaf1> but if we haven't resolved the inconsistency then forgetting it happened means that we can lose potentially-correct differing data
[20:38] <cmccabe> there is a part where we merge histories, but what can you do if they are inconsistent?
[20:39] <gregaf1> all kinds of things, though I don't think we do any of them right nwo
[20:39] <gregaf1> checksum-and-vote, checksum-and-compare-to-old-checksum
[20:39] <gregaf1> present to administrator for decision
[20:40] <cmccabe> checksums can help with individual objects, but not so much with histories that are different
[20:40] <gregaf1> that's generally how we get inconsistent though
[20:41] <cmccabe> I think there is some kind of history merge algorithm going on there
[20:41] <gregaf1> we have different contents and not enough history to know which is right
[20:41] <gregaf1> or the modification times don't match
[20:41] <gregaf1> or whatever
[20:41] <cmccabe> like if history 1 talks about creating A and B, and history 2 talks about creating C and D, we can construct a history where we create ABCD
[20:41] <gregaf1> yeah, that's not inconsistent though, that's just incomplete
[20:41] <cmccabe> I think we can generally all agree that that is "correct"
[20:41] <gregaf1> I'm pretty sure
[20:41] <cmccabe> oh, yeah
[20:42] <cmccabe> you're right
[20:42] <cmccabe> inconsistent would be something like history 1 deletes A and history 2 modifies A
[20:42] <cmccabe> I think by definition you can't merge inconsistent histories in any way which is intuitively "correct"
[20:42] <gregaf1> I don't think we handle inconsistent histories at all, because they really can't happen
[20:43] <cmccabe> I guess what I'm getting at is that inconsistent histories may lead to data corruption in all cases, whether we save the state to disk or not
[20:43] <gregaf1> if we actually get one maybe we'll do something, but unless we have multiple histories we can compare we're just going to have to assume the primary for each epoch was correct
[20:43] <gregaf1> yes, but the point is that however we resolve it, we should resolve it with all the data and not move forward by forgetting that we were inconsistent
[20:44] <cmccabe> so saving the state to disk helps you realize you're in trouble earlier, but there's still nothing you can really do
[20:44] <cmccabe> I guess you could somehow make the administrator handle it
[20:44] <cmccabe> but imagine the complexity of even visualizing what was wrong
[20:44] <cmccabe> we'd have to write some kind of history manipulation tool suite
[20:45] <cmccabe> would make administering ZFS look like farmville
[20:45] <lx0> heh
[20:45] <wido> hi guys
[20:46] <wido> sjust: Are you around?
[20:47] <cmccabe> wido: sam was here a minute ago, it's getting to lunchtime though
[20:47] <wido> ah, ok :)
[20:47] <sjust> wido: here
[20:48] <wido> sjust: hi
[20:48] <wido> You might have noticed, my cluster won't work, the cosd's started asserting again
[20:49] <sjust> wido: yeah, one sec
[20:50] <wido> In the past I've seen, that my cluster ran into so many corner cases that I was running into crashes that nobody would ever run into again
[20:51] <sjust> wido: I've been trying to reproduce that particular error using our cluster
[20:51] <wido> Is it useful to keep my cluster "running" and report every crash that I see?
[20:53] <sjust> wido: the problem I think is that we have logging off, turning it back on causes recovery to go way too slowly to be useful
[20:54] <sjust> wido: are you seeing other asserts other than the incorrect query assert?
[20:54] * fred_ (~fred@80-219-183-100.dclient.hispeed.ch) has joined #ceph
[20:54] <fred_> hi
[20:54] <sjust> fred_: hi
[20:54] <wido> sjust: Yes, a lot, about 12 OSD's crashed
[20:55] <wido> for what I see now, about 5 or 6 went down with the same assert
[20:55] <fred_> sjust, do you need something from me about #1191?
[20:56] <sjust> fred_: I'll take a closer look at the cores later today, but I think I've got all the information we can get
[20:56] <fred_> ok
[20:56] <sjust> wido: I'm going to take a quick look at the cluster now
[20:57] <wido> sjust: Sure, np. But I want to prevent us keeping working on a cluster which is so severely damaged, that we are wasting our time
[20:58] <sjust> wido: the query assert one is probably important, so I'm hoping it'll come back up in other testing with logging on
[20:58] <fred_> is there a way to check the integrity of rbd images? i.e., are all the objects available for a given image...
[20:59] <sjust> if all the pgs in the rbd pool read active+clean, you're probably good
[20:59] <fred_> the fact is that I probably deleted some objects by mistake using "rados rm "
[21:00] <joshd> fred_: I'm not sure there's much you can do in that case, except rollback to a snapshot
[21:01] <joshd> which objects did you delete?
[21:01] <fred_> no idea :)
[21:01] <sjust> wido: the logs on the machines seem to have been rotated out, would I find the assert failure backtraces in /srv/ceph/remote-syslog?
[21:02] <fred_> I tried to rbd export my images and it worked does that mean no object was deleted ?
[21:03] <sjust> wido: you're probably right about the cluster being too badly messed up, though
[21:03] <wido> sjust: Oh, no, the 'logger' machine is in Amsterdam, where my second (smaller) cluster is
[21:03] <sjust> oh
[21:03] <wido> The big cluster is at the office, the logs are stored on noisy.ceph.widodh.nl
[21:04] <wido> you'll find them in /var/log/remote/ceph
[21:04] <sjust> ah, gotcha
[21:04] <joshd> fred_: if export worked, it was able to read all the objects from that image
[21:04] <wido> If you guys say, do a fresh mkcephfs with the latest stable, I'll do. But if the cluster will still provide a lot of valuable information, I'll leave it in this state
[21:06] <sjust> wido: I'm going to take a look at those asserts first, anyway
[21:06] <wido> Sure, I
[21:06] <wido> I'll be right back
[21:06] <sjust> I'm going to lunch in a moment, I'll be back on later
[21:30] * fred_ (~fred@80-219-183-100.dclient.hispeed.ch) Quit (Quit: Leaving)
[21:37] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:43] * mib_0teqg5 (5138106c@ircip2.mibbit.com) has joined #ceph
[22:07] * jmlowe (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[22:25] * monrad-51468 (~mmk@domitian.tdx.dk) Quit (Quit: bla)
[22:25] * monrad-51468 (~mmk@domitian.tdx.dk) has joined #ceph
[22:30] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[23:23] * lxo (~aoliva@ has joined #ceph
[23:30] * lx0 (~aoliva@1GLAAB9ZF.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[23:32] * mib_0teqg5 (5138106c@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[23:57] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.