#ceph IRC Log


IRC Log for 2011-08-29

Timestamps are in GMT/BST.

[0:08] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[0:13] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[0:14] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[0:33] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[0:35] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[0:56] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[2:12] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[3:33] * yoshi_ (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[3:33] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Read error: Connection reset by peer)
[4:10] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[4:42] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[8:27] * cp (~cp@m8d0536d0.tmodns.net) has joined #ceph
[8:35] * cp (~cp@m8d0536d0.tmodns.net) Quit (Ping timeout: 480 seconds)
[8:54] * eternaleye_ (~eternaley@ has joined #ceph
[8:57] * eternaleye (~eternaley@ Quit (Remote host closed the connection)
[8:58] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (synthon.oftc.net charm.oftc.net)
[8:58] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (synthon.oftc.net charm.oftc.net)
[8:58] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (synthon.oftc.net charm.oftc.net)
[8:58] * rsharpe (~Adium@70-35-37-146.static.wiline.com) Quit (synthon.oftc.net charm.oftc.net)
[8:58] * yehuda_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) Quit (synthon.oftc.net charm.oftc.net)
[8:58] * sage (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (synthon.oftc.net charm.oftc.net)
[8:58] * todin (tuxadero@kudu.in-berlin.de) Quit (synthon.oftc.net charm.oftc.net)
[8:59] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[8:59] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[8:59] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[8:59] * rsharpe (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[8:59] * yehuda_hm (~yehuda@99-48-179-68.lightspeed.irvnca.sbcglobal.net) has joined #ceph
[8:59] * sage (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[8:59] * todin (tuxadero@kudu.in-berlin.de) has joined #ceph
[11:04] * yoshi_ (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:04] * lx0 (~aoliva@19NAADDN8.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[14:52] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:54] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[15:40] * eternaleye__ (~eternaley@ has joined #ceph
[15:40] * eternaleye_ (~eternaley@ Quit (Read error: Connection reset by peer)
[16:33] * The_Bishop (~bishop@port-92-206-21-65.dynamic.qsc.de) Quit (Ping timeout: 480 seconds)
[16:48] * The_Bishop (~bishop@dslb-188-103-205-222.pools.arcor-ip.net) has joined #ceph
[16:59] * mtk (WqFAQ1zgh3@panix2.panix.com) has joined #ceph
[17:01] * mtk (WqFAQ1zgh3@panix2.panix.com) Quit ()
[17:02] * mtk (lDuACEgRML@panix2.panix.com) has joined #ceph
[17:04] * ajm (adam@adam.gs) has joined #ceph
[17:05] <ajm> anyone seen anything like this: http://pastebin.com/rcaAbfY0
[17:05] <ajm> cosd appears to crash a few moments after it starts up
[17:18] * andret (~andre@pcandre.nine.ch) Quit (Remote host closed the connection)
[17:19] * mtk (lDuACEgRML@panix2.panix.com) Quit (Remote host closed the connection)
[17:31] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[17:35] * mtk (nrt2niVPD1@panix2.panix.com) has joined #ceph
[17:43] * The_Bishop (~bishop@dslb-188-103-205-222.pools.arcor-ip.net) Quit (Ping timeout: 480 seconds)
[17:51] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[17:53] * The_Bishop (~bishop@port-92-206-21-65.dynamic.qsc.de) has joined #ceph
[18:07] * The_Bishop (~bishop@port-92-206-21-65.dynamic.qsc.de) Quit (Ping timeout: 480 seconds)
[18:09] * The_Bishop (~bishop@port-92-206-21-65.dynamic.qsc.de) has joined #ceph
[18:12] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[18:12] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[18:19] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[18:25] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[18:27] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[18:38] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:42] * gregaf (~Adium@aon.hq.newdream.net) Quit (Remote host closed the connection)
[18:43] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[18:59] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[18:59] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[19:01] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[19:30] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:36] * adjohn (~adjohn@ has joined #ceph
[19:38] * morse (~morse@supercomputing.univpm.it) Quit (Ping timeout: 480 seconds)
[19:43] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:45] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[20:03] <cmccabe> I got an interesting backtrace with the latest radosgw
[20:03] <cmccabe> #0 0x00007f72e49a3e2b in raise () from /lib/libpthread.so.0
[20:03] <cmccabe> (gdb) bt
[20:03] <cmccabe> #0 0x00007f72e49a3e2b in raise () from /lib/libpthread.so.0
[20:03] <cmccabe> #1 0x00000000009a7a93 in reraise_fatal (signum=6) at global/signal_handler.cc:59
[20:03] <cmccabe> #2 0x00000000009a7c8d in handle_fatal_signal (signum=6) at global/signal_handler.cc:106
[20:03] <cmccabe> #3 <signal handler called>
[20:03] <cmccabe> #4 0x00007f72e3187165 in raise () from /lib/libc.so.6
[20:03] <cmccabe> #5 0x00007f72e3189f70 in abort () from /lib/libc.so.6
[20:03] <cmccabe> #6 0x00007f72e31802b1 in __assert_fail () from /lib/libc.so.6
[20:03] <cmccabe> #7 0x00000000009c6a93 in FileStore::sync_entry (this=0x21c6000) at os/FileStore.cc:3176
[20:03] <cmccabe> #8 0x00000000009d2ce6 in FileStore::SyncThread::entry() ()
[20:03] <cmccabe> #9 0x000000000084ff99 in Thread::_entry_func (arg=0x21c6698) at common/Thread.cc:45
[20:03] <cmccabe> #10 0x00007f72e499b8ba in start_thread () from /lib/libpthread.so.0
[20:03] <cmccabe> #11 0x00007f72e322402d in clone () from /lib/libc.so.6
[20:03] <cmccabe> #12 0x0000000000000000 in ?? ()
[20:03] <sjust> what was the line of the assert?
[20:04] <gregaf> cmccabe: oh, forgot to mention, you're support watch this week if you didn't check your calendar :)
[20:04] <gregaf> (or even if you did check it)
[20:04] <cmccabe> 3176
[20:04] <cmccabe> yeah, I know
[20:04] <gregaf> coolio
[20:04] <sjust> oh, snap created failed
[20:05] <cmccabe> yeah, it probably is just a btrfs hiccup
[20:05] <sjust> dmesg?
[20:05] <cmccabe> [5974569.432078] INFO: task btrfs-transacti:696 blocked for more than 120 seconds.
[20:05] <cmccabe> [5974569.435230] [<ffffffffa0345aa1>] ? find_first_extent_bit+0x2b/0x74 [btrfs]
[20:05] <cmccabe> [5974569.435236] [<ffffffff8105a6a0>] ? lock_timer_base+0x26/0x4b
[20:05] <cmccabe> [5974569.435240] [<ffffffff8105a728>] ? try_to_del_timer_sync+0x63/0x6c
[20:05] <cmccabe> [5974569.435251] [<ffffffffa032dea4>] ? wait_current_trans+0x9a/0xe5 [btrfs]
[20:05] <cmccabe> [5974569.435255] [<ffffffff81064d2a>] ? autoremove_wake_function+0x0/0x2e
[20:05] <cmccabe> [5974569.435260] [<ffffffff812fb175>] ? mutex_lock+0xd/0x31
[20:05] <cmccabe> [5974569.435270] [<ffffffffa032e50b>] ? start_transaction+0x67/0x126 [btrfs]
[20:06] <cmccabe> [5974569.435280] [<ffffffffa032a7c2>] ? transaction_kthread+0x160/0x1ea [btrfs]
[20:06] <cmccabe> [5974569.435285] [<ffffffff8103a9dd>] ? __wake_up_common+0x44/0x72
[20:06] <cmccabe> [5974569.435296] [<ffffffffa032a662>] ? transaction_kthread+0x0/0x1ea [btrfs]
[20:06] <cmccabe> [5974569.435298] [<ffffffff81064a5d>] ? kthread+0x79/0x81
[20:06] <cmccabe> [5974569.435303] [<ffffffff81011baa>] ? child_rip+0xa/0x20
[20:06] <cmccabe> [5974569.435306] [<ffffffff810649e4>] ? kthread+0x0/0x81
[20:06] <cmccabe> [5974569.435309] [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[20:06] <cmccabe> I guess I need to upgrade my kernel, otherwise I'm just wasting time reproducing already-found bugs
[20:19] * bchrisman (~Adium@ has joined #ceph
[20:41] <sagewk> slang: there?
[20:43] <ajm> http://pastebin.com/rcaAbfY0 anyone seen anything like this? cosd crashes immediately on startup on some nodes for me w/that
[20:52] <sagewk> cmccabe: i wonder if the testlibrbd should be redone to use gtest (and look like the rados-api tests)
[20:53] <cmccabe> sagewk: yeah, it definitely should
[20:53] <sagewk> ajm: can you crank up the osd log to 20 and reproduce?
[20:53] <cmccabe> sagewk: I need to have something to test the new api though
[20:53] <sagewk> that should point you at which object/file has bad data on disk.
[20:53] <sagewk> my guess is a missing xattr or 0-length object
[21:30] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[21:36] <ajm> sagewk: http://adam.gs/osd.8.log
[21:42] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[21:42] <Tv> sagewk: any objections to tearing down the libvirt migration demo setup?
[21:55] <Tv> sagewk: you made gitbuilder red
[22:02] <gregaf> fyi he's out at lunch
[22:02] <Tv> i'm in no rush
[22:20] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[22:44] <cmccabe> joshd: what would a typical block size be for librbd?
[22:45] <joshd> cmccabe: 4MB by default
[22:45] <cmccabe> joshd: so it seems pretty reasonable to just call a progress callback after each chunk
[22:46] <joshd> most likely that will be fine
[22:46] <cmccabe> joshd: in copy()
[22:46] <cmccabe> joshd: if one is provided
[22:48] <joshd> if you have a large image (like terabytes) you might not care about every block
[22:49] <cmccabe> 1TB = 256 4MB chunks
[22:49] <cmccabe> er wait
[22:49] <cmccabe> 262144
[22:50] <cmccabe> anyway, we already do a callback on each 4MB chunk in copy()
[22:50] <joshd> maybe each block, or every 1000th of the size, whichever is larger?
[22:50] <cmccabe> if we invoke a user-supplied callback, the most expensive part is the probable icache miss
[22:51] <cmccabe> because it's calling through a function pointer
[22:52] <cmccabe> it might be best to leave "skip every N progress bar updates" to the user
[22:54] <joshd> yeah, I think you're right. leaving it at just every block and letting the user decide whether to actually do something with it would be fine
[23:08] <sagewk> tv: you can tear it down yeah
[23:12] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[23:14] <sagewk> ajm: 0.1fb looks like the culprit. can you ls -al on current/meta/pginfo_0.1fb* ?
[23:15] <sagewk> ajm: are you using ext4 or btrfs or something else?
[23:15] <Tv> oooh neat i can get doxygen docs to show up as part of the "big ceph doc"
[23:15] <Tv> with the same layout & template & etc
[23:15] * lxo (~aoliva@9KCAAAP54.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:15] <Tv> did someone already figure out the right config for doxygen?
[23:16] <joshd> Tv: I think sjust did
[23:16] <sjust> more or less kinda
[23:16] <Tv> hehe processing mds.c takes ages
[23:17] <Tv> sjust: i'm interested in any exclude etc stuff if you managed to figure out the right settings.. basically, this thing will slurp the xml output from doxygen and publish it very neatly
[23:18] <sjust> ah...
[23:18] <sjust> I didn't have to do any exclude stuff
[23:18] <Tv> ah ok
[23:19] <sjust> I just changed the base config slightly
[23:19] * slang (~slang@chml01.drwholdings.com) Quit (Remote host closed the connection)
[23:19] <Tv> i changed like 3 settings and it seems to work, just slowly
[23:19] <sjust> sounds about right
[23:19] <sjust> I also only tried to run it against the src/os directory
[23:19] <Tv> i guess that's because i have it still writing out the full source, and i included the undocumented stuff
[23:20] <Tv> hurr my config says INLINE_SOURCE=NO etc
[23:20] <Tv> i wonder what it's doing
[23:21] <ajm> sagewk: its btrfs, that file is 0-bytes: -rw-r--r-- 1 root root 0 Aug 26 10:36 current/meta/pginfo_0.1fb_0
[23:21] <Tv> ohh it's consuming the vstart logs
[23:21] <Tv> GAH
[23:21] <Tv> "mds.c" !!!
[23:22] <sjust> :)
[23:22] <sagewk> tv: i was wondering if that was a typo.. :)
[23:22] <Tv> sagewk: i hate you for not being consistent with .log suffix... :-/
[23:23] <sagewk> tv: feel free to patch vstart.sh. just minimizing typing...
[23:24] <sagewk> ajm: is this an old cluster you've only just upgraded, or has it been running reasonably recent code up until this last restart?
[23:24] <sagewk> ajm: i guess the question is, what version was it last running before you restarted it?
[23:24] <ajm> it was working at 0.33
[23:25] <ajm> since right after the 0.33 release, then broke over the weekend
[23:25] <sagewk> hrm ok. basically, that file is only ever written with at least 8 bytes of content. so something is breaking in filestore/filejournal or in btrfs. wido was seeing the same issue.
[23:27] <ajm> interesting, any idea how to fix temp? can I just rm / move that ?
[23:28] <sagewk> rename the collection directory (snap_highesternumberyousee/1.1fb_head) somewhere else and restart. it will recover as long as there is a fully copy of that collection on another osd
[23:42] <ajm> did you mean move snap_/0.1fb_head ?
[23:44] <sagewk> it should be snap_somebignumber .. whichever the largest one in your osd data dir is
[23:45] <ajm> I mean 1.1fb vs 0.1fb
[23:46] <sagewk> oh.. yeah 0.1fb
[23:51] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[23:51] <ajm> :)
[23:51] <ajm> I assume you identified that as the issue since its was the last collection to be read before it broke?
[23:51] <ajm> I have another OSD to fix
[23:54] * jim (~chatzilla@c-71-202-13-33.hsd1.ca.comcast.net) has joined #ceph
[23:55] <sagewk> ajm: exactly
[23:59] <ajm> oops, i managed to break 1.1fb somehow too :)
[23:59] <ajm> but that fixed it after I removed 1.1fb as well
[23:59] <ajm> thanks sagewk

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.