#ceph IRC Log

Index

IRC Log for 2011-09-01

Timestamps are in GMT/BST.

[0:02] <johnl_> sjust: np. was a test cluster, no real data.
[0:02] <sagewk> damoxc: sigh.. well, there are a few different potential offenders, but its hard to tell without knowing what happened before. i'll push a patch that will clean it up for you (and complain loudly about the inconsitency)
[0:03] <sjust> johnl_: good to hear, I'll have a fixed pushed shortly
[0:03] <damoxc> sagewk: okay cool, thank you very much
[0:03] <johnl_> super
[0:03] * votz (~votz@pool-72-94-171-89.phlapa.fios.verizon.net) has joined #ceph
[0:08] <sagewk> damoxc: hmm, why did this osd stop originally? did it crash, or did you shut it down for an upgrade or something?
[0:09] <sjust> johnl_: pushed
[0:09] <damoxc> sagewk: I'm assuming they crashed, I noticed 5 osds out of 14 had stopped
[0:09] <sjust> thanks the bug report!
[0:10] <damoxc> sagewk: yeah crashed, same stack trace by the looks of things, but I only have pre-assert fix stacks
[0:11] <damoxc> sagewk: and debug logging was switched off so it's not too useful I'm afraid
[0:11] <sagewk> was logging on prior to the original crash?
[0:13] <damoxc> sagewk: http://dpaste.com/606289/ best I can do
[0:13] <sagewk> oh well. i pushed a wip-rmdir branch that has a patch to clean this up for you. you'll need to pass --filestore-cleanup-rmdir-notempty (or put filestore cleanup rmdir notempty = true in your .conf)
[0:13] <sagewk> k thanks
[0:17] <damoxc> sagewk: an off-topic question, are there any parameters you can change to improve rbd performance instead qemu?
[0:18] <sagewk> cache=writeback? that's cheating tho
[0:19] <damoxc> heh yeah
[0:20] <sagewk> there's some stuff on the road map that will help... no immediate knobs to turn tho
[0:20] <sagewk> what performance are you concerned about?
[0:21] <damoxc> I guess mostly latency, although throughput seems to be quite low atm
[0:21] <sagewk> what workload?
[0:21] <johnl_> sjust: ta.
[0:21] <Tv> damoxc: also, read or write
[0:22] <damoxc> using Windows 7x64 as a guest, running performancetest from passmark, the disk benchmark showed approx 50MB/s sequential read, 6MB/s sequential write
[0:23] <damoxc> random write was fairly dire, barely over 1MB/s if memory serves
[0:25] <damoxc> strange really as I've been able to get sustained sequential writes of over 90MB/s using the unix filesystem
[0:25] <sagewk> care to capture a log for us? if you can feed debug_ms=1 and log_file=/some/thing in the qemu option string (e.g. rbd:pool/image:debug_ms=1:log_file=/some/path ) that will tell us more
[0:25] <Tv> damoxc: yeah so right now rbd basically flushes after every write; linux kernel buffer cache will paper over that pretty well, but benchmarks just don't
[0:25] <sagewk> it may be that windows is doing sync writes and expecting the hard disk cache to make it all better
[0:26] <damoxc> sagewk: I did spy something about that in the disk management stuff
[0:26] <damoxc> I'll try turning that off and see how things change, and also capture a log for you tomorrow
[0:27] <Tv> sagewk: btw just double-checking; making rbd do acks right went on the roadmap, right?
[0:27] <gregaf> Tv: I actually didn't see it on there, just trivial and good layering...
[0:27] <sagewk> by "right" you mean "lie the way a normal disk cache does"?
[0:28] <Tv> sagewk: yes, that's what's expected these days
[0:28] <sagewk> yeah
[0:28] * lxo (~aoliva@82VAADKKR.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[0:29] <Tv> "right" as in acceptable, not "right" as in morally right ;)
[0:29] <sagewk> :)
[0:31] <sagewk> slang: a workload we can reproduce easily will make it much much easier to find/fix
[0:31] <slang> yeah
[0:35] <Tv> that's http://tracker.newdream.net/issues/947
[0:36] <slang> sagewk: if the servers are still running, what prevents a response from getting returned to the client?
[0:37] <sagewk> for 1472 actually 2 of the 3 hangs look like the mds is hung up on something
[0:38] <sagewk> the first one looks like a hung osd. but a hung osd will also make the mds get blocked up... so it's a questino of which of those hangs you are still seeing with healthy osds
[0:40] <slang> I should note, I can kill and restart cfuse and everything looks fine
[0:41] <sagewk> the client session teardown is often enough to knock the mds out of whatever corner it got stuck in.
[0:41] <sagewk> even running stat on another client can sometimes wake it up.
[1:20] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:48] <damoxc> sagewk: thanks for the fixes :-)
[1:49] <sagewk> damoxc: no problem. that fixed it up for you?
[1:59] <damoxc> I'll test it properly tomorrow and report back then, bed time now, just wanted to show my appreciation
[2:03] <sagewk> damoxc: np. thanks for testing!
[2:05] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:09] <sjust> is it the low numbered sepia machines that have a second disk?
[2:18] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[2:22] <sagewk> sjust: i thought they all did.. but maybe not?
[2:24] <sjust> sagewk: I found some that do, but at least 90, 91, 92, and 96 only have sda
[2:26] <sagewk> sjust: crap
[2:26] <sagewk> hmm, can you do a quick count of how have/don't have disks?
[2:26] <sjust> ok
[2:35] <bchrisman> gregaf: was wondering whether you put together a test for that locking condition or whether we should provide one?
[2:41] <sjust> 35 machines have /dev/sdb
[2:42] <sagewk> k, i'll talk to patrick about it
[2:42] <sagewk> did you by chance make a list of those that are up but don't?
[2:43] <sjust> I have all that are up but do, one sec
[2:48] <sjust> alright, have a list of machines without sdb
[2:49] <sagewk> jabber it to me?
[2:50] <sagewk> tnx
[3:02] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[3:15] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[3:20] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit ()
[3:35] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[4:09] * cmccabe (~cmccabe@69.170.166.146) has left #ceph
[4:18] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[4:20] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[4:20] * jojy (~jojyvargh@75-54-231-2.lightspeed.sntcca.sbcglobal.net) Quit ()
[4:30] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[4:38] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Ping timeout: 480 seconds)
[4:46] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[4:53] * Dantman (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[5:15] * lxo (~aoliva@19NAADG1Y.tor-irc.dnsbl.oftc.net) has joined #ceph
[5:56] * lxo (~aoliva@19NAADG1Y.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[7:31] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[8:36] * Meths (rift@2.25.189.91) has joined #ceph
[8:38] * Meths_ (rift@2.25.189.91) Quit (Ping timeout: 480 seconds)
[8:38] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Read error: Connection reset by peer)
[8:41] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[9:19] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[10:03] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (synthon.oftc.net charm.oftc.net)
[10:03] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (synthon.oftc.net charm.oftc.net)
[10:03] * pzb (~pzb@gw-ott1.byward.net) Quit (synthon.oftc.net charm.oftc.net)
[10:03] * u3q (~ben@uranus.tspigot.net) Quit (synthon.oftc.net charm.oftc.net)
[10:03] * ajm (adam@adam.gs) Quit (synthon.oftc.net charm.oftc.net)
[10:03] * todin (tuxadero@kudu.in-berlin.de) Quit (synthon.oftc.net charm.oftc.net)
[10:03] * sage (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (synthon.oftc.net charm.oftc.net)
[10:03] * rsharpe (~Adium@70-35-37-146.static.wiline.com) Quit (synthon.oftc.net charm.oftc.net)
[10:03] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) has joined #ceph
[10:03] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[10:03] * pzb (~pzb@gw-ott1.byward.net) has joined #ceph
[10:03] * u3q (~ben@uranus.tspigot.net) has joined #ceph
[10:03] * ajm (adam@adam.gs) has joined #ceph
[10:03] * rsharpe (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[10:03] * sage (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[10:03] * todin (tuxadero@kudu.in-berlin.de) has joined #ceph
[10:21] * Juul (~Juul@port80.ds1-vo.adsl.cybercity.dk) has joined #ceph
[10:52] * gregorg (~Greg@78.155.152.6) has joined #ceph
[11:37] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[12:44] * Juul (~Juul@port80.ds1-vo.adsl.cybercity.dk) Quit (Ping timeout: 480 seconds)
[13:06] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Ping timeout: 480 seconds)
[13:20] * yoshi (~yoshi@p10166-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[14:36] * lxo (~aoliva@659AAD0GV.tor-irc.dnsbl.oftc.net) has joined #ceph
[14:46] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[15:41] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[16:06] * jmlowe (~Adium@140-182-209-230.dhcp-bl.indiana.edu) has joined #ceph
[16:08] <jmlowe> I've been running some vm's backed by files on a ceph fs, I'm beginning to suspect that I've found a bug, the filesystems in the vm's are loosing consistency
[16:08] <jmlowe> any comments, tips, hints, suggestions?
[16:13] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:20] * jmlowe (~Adium@140-182-209-230.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[16:27] * jmlowe (~Adium@140-182-209-230.dhcp-bl.indiana.edu) has joined #ceph
[16:48] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Remote host closed the connection)
[17:03] * huangjun (~root@221.234.37.227) has joined #ceph
[17:18] * huangjun (~root@221.234.37.227) Quit (Remote host closed the connection)
[17:22] * lxo (~aoliva@659AAD0GV.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[17:24] * jmlowe (~Adium@140-182-209-230.dhcp-bl.indiana.edu) Quit (Ping timeout: 480 seconds)
[17:25] * jmlowe (~Adium@140-182-209-230.dhcp-bl.indiana.edu) has joined #ceph
[17:30] * jmlowe1 (~Adium@140-182-209-230.dhcp-bl.indiana.edu) has joined #ceph
[17:30] * jmlowe (~Adium@140-182-209-230.dhcp-bl.indiana.edu) Quit (Read error: Connection reset by peer)
[17:42] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[17:48] * The_Bishop (~bishop@port-92-206-251-64.dynamic.qsc.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[17:54] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:08] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) has joined #ceph
[18:22] * jmlowe (~Adium@140-182-209-230.dhcp-bl.indiana.edu) has joined #ceph
[18:22] * jmlowe1 (~Adium@140-182-209-230.dhcp-bl.indiana.edu) Quit (Read error: Connection reset by peer)
[18:23] * jmlowe (~Adium@140-182-209-230.dhcp-bl.indiana.edu) Quit ()
[18:33] * adjohn (~adjohn@50-0-92-177.dsl.dynamic.sonic.net) Quit (Quit: adjohn)
[18:34] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[18:38] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[18:42] <slang> sagewk: I'm not seeing the inode in the dumpcache output
[18:42] <slang> (gdb) p **fhp
[18:42] <slang> $5 = {inode = 0xe00000030, pos = 461919, mds = 46997, mode = 256, append = 237, pos_locked = 3,
[18:42] <sagewk> slang: the ino numbers are in hex.. maybe that's it?
[18:43] <slang> I'm looking for 0xe00000030
[18:43] <sagewk> oh.. that looks wrong
[18:43] <sagewk> mds = 46997?
[18:43] <slang> oh this might be a different hang
[18:44] <slang> this is mds = 20378
[18:44] <slang> wait
[18:45] <slang> is that supposed to be a 0?
[18:45] <sagewk> i think that' ssupposed to be an mds number, yeah.
[18:45] <sagewk> i wonder if htere is a ref count error and that is bad memory
[18:45] <sagewk> append is a bool, for instance
[18:45] <sagewk> as is pos_locked
[18:45] <slang> right ok
[18:45] <slang> I can try to run cfuse in valgrind
[18:46] <sagewk> yeah
[18:49] <slang> there isn't by chance a suppressions file for known/spurious errors from valgrind?
[18:49] <sagewk> there are no known ceph-related spurious errors
[18:49] <sagewk> usually they come from the environment (libs etc).
[18:50] <sagewk> one of my dev boxes is unusable with valgrind, the other runs clean... bleh
[18:50] <slang> oh dera
[18:50] <slang> dear
[18:50] <slang> ==10068== Process terminating with default action of signal 11 (SIGSEGV)
[18:50] <slang> ==10068== Access not within mapped region at address 0x7FFF7BDD9000
[18:50] <slang> ==10068== at 0x5BAC2E0: base::VDSOSupport::ElfMemImage::Init(void const*) (in /usr/lib/libtcmalloc.so.0.0.0)
[18:50] <slang> ==10068== by 0x5BAC932: base::VDSOSupport::Init() (in /usr/lib/libtcmalloc.so.0.0.0)
[18:50] <sagewk> fortunately ubuntu 10.10 (which our qa cluster runs) is valgrind clean
[18:50] <sagewk> yeah that's what i see on my box
[18:51] <sagewk> oh, wait.. that's tcmalloc.
[18:51] <sagewk> you can build without that, --without-tcmalloc
[18:51] <sagewk> we do that for running valgrind in qa :)
[18:51] <slang> and hope there aren't any tcmalloc bugs :-)
[18:52] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[19:04] <gregaf> bchrisman: I did, but I'm having some unrelated troubles getting it going???today, I hope!
[19:05] <gregaf> jmlowe: if you can narrow it down at all we're interested in it! ???but you might want to try rbd instead ;)
[19:06] <bchrisman> gregaf: yeah.. my installation is now failing on something in regards to mkcephfs.. I'm going to track that down a bit
[19:07] <gregaf> bchrisman: I've been getting a lot of EINTR returns from fcntl on wait locking tests that naive implementations fail out on ??? have you seen that at all?
[19:08] <bchrisman> ahh no.. you have a test we can add to our automated process?
[19:08] <gregaf> there's a locktest.c in xfstests that I've been hacking up...
[19:09] <bchrisman> ahh okay.. we were thinking about including that in our automation already.
[19:11] * gregaf1 (~Adium@aon.hq.newdream.net) has joined #ceph
[19:11] * gregaf (~Adium@aon.hq.newdream.net) Quit (Read error: Connection reset by peer)
[19:12] <sagewk> slang: which inode is blockedin that log fragment?
[19:13] <slang> uh
[19:13] <slang> I've already restarted cfuse
[19:13] <gregaf1> bchrisman: you can look at my repo at https://github.com/gregsfortytwo/xfstests-ceph
[19:13] <sagewk> no worries.
[19:13] <slang> sagewk: I think it was that one I pasted earlier
[19:13] <gregaf1> but I wouldn't recommend pointing anything at it directly; I've been rebasing and tossing out commits and stuff a lot lately
[19:14] <sagewk> slang: btw teh logs are easier to read if you also include --debug-ms 1
[19:14] <slang> yeah
[19:14] <gregaf1> hacked in support for wait locks and I've got some for zero-length locks too
[19:14] <slang> I tried to add that but was too late
[19:15] <bchrisman> cool
[19:16] <slang> looks like it hung in valgrind, but I'm not getting any errors output from valgrind
[19:27] <slang> sagewk: I'm seeing the same thing with this new hang:
[19:27] <slang> (gdb) p **fhp
[19:27] <slang> $1 = {inode = 0xe00000030, pos = 400723, mds = 46997, mode = 256,
[19:27] <slang> append = 237, pos_locked = 3,
[19:32] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Remote host closed the connection)
[19:34] <slang> it looks like fhp just gets set at some point later
[19:36] <slang> looking at *req->inode is more productive
[19:40] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[19:42] <sagewk> slang: hrm, i'm not seeing any obvious way Fh could get corrupted like that.
[19:42] <slang> yeah I think it gets set later
[19:43] <slang> so it just uninitialized memory when I print it out
[19:43] <slang> I have the inode from a hang using *req->inode
[19:43] <sagewk> oh.. what function are you in?
[19:43] <slang> for **fhp I was in _open
[19:44] <sagewk> called by ll_open from fuse?
[19:44] <slang> I backed up to make_request and am using req->inode now
[19:44] <slang> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
[19:44] <slang> #1 0x00000000006a4ce1 in Cond::Wait (this=0x7fffa5d31460, mutex=...) at ../../src/common/Cond.h:46
[19:44] <slang> #2 0x000000000066f604 in Client::make_request (this=0x252a320, request=0x2af0770, uid=1005, gid=1006, ptarget=0x0, use_mds=-
[19:44] <slang> 1, pdirbl=0x0) at ../../src/client/Client.cc:1090
[19:44] <slang> #3 0x000000000068f5a2 in Client::_open (this=0x252a320, in=0x7fe0d44cf050, flags=557569, mode=0, fhp=0x7fffa5d31678, uid=100
[19:44] <slang> 5, gid=1006) at ../../src/client/Client.cc:4894
[19:44] <slang> #4 0x000000000069f3bb in Client::ll_open (this=0x252a320, vino=..., flags=557569, fhp=0x7fffa5d31678, uid=1005, gid=1006) at
[19:44] <slang> ../../src/client/Client.cc:6684
[19:44] <slang> #5 0x000000000066087a in ceph_ll_open (req=0x2aef580, ino=1099511674773, fi=0x7fffa5d316c0) at ../../src/client/fuse_ll.cc:3
[19:44] <slang> 21
[19:45] <sagewk> ok i see.
[19:45] <sagewk> will initialize that to NULL in fuse_ll.cc to avoid this confusion next time :)
[19:45] <slang> so the inode appears to be 1000000b795
[19:45] <slang> sagewk: yeah sorry for the goose chase
[19:45] <slang> [inode 1000000b795 [2,head] file.ext auth v30708 s=0 n(v0 1=1+0) (iversion lock) | dirty waiter 0x252f160
[19:45] <sagewk> oh np. can you post the matching log?
[19:45] <slang> dumpcache outputs that
[19:46] <slang> sagewk: yeah one sec
[19:46] <sagewk> f 2
[19:46] <sagewk> p request->tid
[19:47] <slang> 15536
[19:49] <gregaf1> sagewk: do we have a gitbuilder listing somewhere?
[19:49] <gregaf1> I wasn't aware we had a no-atomic-ops one
[19:52] <Tv> gregaf1: http://ceph.newdream.net/git/?p=autobuild-ceph.git;a=blob;f=fabfile.py;hb=HEAD
[19:52] <Tv> hard to get more authoritative than that
[19:52] <gregaf1> ah, that's how we're setting them up?
[19:53] <gregaf1> I thought it was all manual
[19:53] <Tv> manual is for sports cars
[19:54] <sagewk> gregaf1: oh did i set that but to resolved? i was going quickly and pbly confused it with notcmalloc?
[19:54] <gregaf1> that's what I was wondering, yeah
[19:55] <sagewk> i guess the question is, do we care about that flavor?
[19:55] <slang> http://dl.dropbox.com/u/18702194/log
[19:55] <gregaf1> sagewk: well, if we don't care about it we should get rid of the spinlock atomics
[19:55] <slang> that's the log within the time period of the hang
[19:55] <gregaf1> but that'll bump out???some versions of arm
[19:56] <gregaf1> anybody running Debian packages
[19:56] <gregaf1> of which I believe there is at least one person who contacted me when we were having troubles with debian armel
[19:58] * adjohn (~adjohn@50.0.103.34) has joined #ceph
[19:58] * jmlowe (~Adium@140-182-133-137.dhcp-bl.indiana.edu) has joined #ceph
[19:59] <sagewk> slang: hmm, i don't see the reply for client4533:15535.. and 15536 never arrives. was that a typo?
[20:00] <slang> (gdb) f 2
[20:00] <slang> #2 0x000000000066f604 in Client::make_request (this=0x252a320, request=0x2af0770, uid=1005, gid=1006, ptarget=0x0, use_mds=-
[20:00] <slang> 1, pdirbl=0x0) at ../../src/client/Client.cc:1090
[20:00] <slang> (gdb) p request->tid
[20:00] <slang> $18 = 15536
[20:03] <gregaf1> slang: I'm late to this party???single-mds, right?
[20:03] <sagewk> slang: do you have more log preceeding this? wondering where the wrlock on 1000000b795 came from (the w=1 part of (ifile excl w=1)
[20:03] <slang> sagewk: maybe I trimmed the log after the messages, let me push the entire log when debugging was enabled
[20:07] <sagewk> slang; k, let me know
[20:08] * aliguori (~anthony@32.97.110.59) has joined #ceph
[20:14] <slang> http://dl.dropbox.com/u/18702194/log2
[20:14] <slang> sagewk: that's about 20 secs before I saw the hang
[20:16] <jmlowe> arr, working on this campus would be great if only the students weren't here, they make all rf communication flakey especially wifi
[20:17] <slang> maybe less...the hang occurred between 12:20:37 and 12:20:40
[20:17] <jmlowe> I had asked "I've been running some vm's backed by files on a ceph fs, I'm beginning to suspect that I've found a bug, the filesystems in the vm's are loosing consistency, any comments, hints, tips suggestions?"
[20:17] <jmlowe> but I wouldn't have seen any replies
[20:18] <gregaf1> jmlowe: jmlowe: if you can narrow it down at all we're interested in it! ???but you might want to try rbd instead ;)
[20:18] <gregaf1> there is somebody???somewhere???I forget who, trying to use Ceph-backed VM images for Xen
[20:18] <gregaf1> but they don't talk to us about it much
[20:19] <jmlowe> I abandoned rbd because it too so loooong to to convert the images
[20:20] <gregaf1> what kind of consistency issues are you running into?
[20:20] <jmlowe> I did have some success in running off the rbd device in the host, ie plumbing the /dev/rbdN from the host machine directly into the vm as it's block device
[20:21] <jmlowe> I'm not sure yet, and I haven't narrowed down the conditions under which it happens
[20:22] <slang> sagewk: I can just post the entire log for today -- sorry should have done this in the first place...
[20:22] <gregaf1> in that case I think all the advice I can give is to make sure your settings are all correct, ie that you haven't deliberately loosened consistency for performance somewhere :/
[20:22] <jmlowe> I've had 4 copies of the image file, 3 were fine, ran yum upgrade on all of them (slightly out of date), 3 rebooted fine the 4th went single user for a fsck
[20:23] * Juul (~Juul@port80.ds1-vo.adsl.cybercity.dk) has joined #ceph
[20:24] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[20:24] * jmlowe1 (~Adium@129-79-195-139.dhcp-bl.indiana.edu) has joined #ceph
[20:25] * jmlowe (~Adium@140-182-133-137.dhcp-bl.indiana.edu) Quit (Read error: Connection reset by peer)
[20:26] <sagewk> slang: yeah it looks like it came from even earlier than log2.. probably a leaked lock in the mds.
[20:27] <sagewk> slang: oh wait
[20:27] <sagewk> slang: nevermind
[20:29] <sagewk> slang: ok i see the problem
[20:36] <slang> sagewk: still need the log before log2?
[20:36] <slang> its over 2gigs and I'm having trouble uploading to dropbox
[20:36] <sagewk> nope
[20:36] <slang> cool
[20:38] <sagewk> slang: can you reproduce with this applied? http://fpaste.org/x6Vo/
[20:38] <sagewk> slang: 'waiting for pending truncate from 0 to 0 to complete on' is the badness, but its not clear how that can happen
[20:40] <slang> trying now
[20:40] <slang> sagewk: what debugging should I enable?
[20:40] <slang> mds 20 ms 1?
[20:54] <sagewk> slang: yeah
[20:59] <jmlowe1> so how solid is the clustered mds these days?
[21:00] <jmlowe1> are we talking water, pudding, or jello?
[21:01] <sagewk> pudding?
[21:01] <sagewk> it should be pretty stable in non-failure scenarios
[21:03] <jmlowe1> ok, last blog made some reference to the single mds getting pretty good, begging the question of the clustered mds
[21:04] * jmlowe1 (~Adium@129-79-195-139.dhcp-bl.indiana.edu) Quit (Quit: Leaving.)
[21:04] * jmlowe (~Adium@140-182-133-137.dhcp-bl.indiana.edu) has joined #ceph
[21:12] * verwilst (~verwilst@d51A5B24D.access.telenet.be) has joined #ceph
[21:12] * jmlowe (~Adium@140-182-133-137.dhcp-bl.indiana.edu) Quit (Ping timeout: 480 seconds)
[21:21] * jmlowe (~Adium@140-182-208-30.dhcp-bl.indiana.edu) has joined #ceph
[21:24] * lxo (~aoliva@9KCAAATEC.tor-irc.dnsbl.oftc.net) has joined #ceph
[21:25] <slang> sagewk: I can't seem to reproduce it now with the patch in place
[21:28] <sagewk> hmm.. maybe its the logging level? try turning debug mds down?
[21:29] <slang> down to...
[21:31] <sagewk> 10?
[21:31] <sagewk> or off? just to see
[21:37] * Juul (~Juul@port80.ds1-vo.adsl.cybercity.dk) Quit (Quit: Leaving)
[21:37] <slang> still nothing
[21:37] <slang> could it be tcmalloc?
[21:38] <slang> I didn't recompile with tcmalloc enabled after disabling it
[21:38] <sagewk> hmm.. unlikely, but may as well try
[21:39] <slang> oh wait
[21:42] <slang> looks like its hung now
[21:42] <slang> no assert on the mds though
[21:43] <sagewk> logs?
[21:43] <bchrisman> curious??? what is supposed to create /var/run/ceph??? mkcephfs?
[21:45] <sagewk> your .deb/rpm, or make install, o
[21:45] <bchrisman> that's what I was figuring..
[21:45] <bchrisman> it's not in the spec file by default.
[21:45] <bchrisman> not sure why I haven't seen this issue before.
[21:45] <sagewk> ah, yeah.. send a patch! :)
[21:46] <bchrisman> okie
[21:46] <slang> debugging wasn't enabled
[21:58] <sagewk> slang: hrm. are you trying again with debugging enabled?
[21:58] <slang> sagewk: I was trying to track down the inode in the log for the moment
[22:00] <sagewk> slang: if you get a chance this patch might help catch it http://fpaste.org/uN81/
[22:00] <slang> (gdb) printf "0x%llx\n", request->inode.ino.val
[22:00] <slang> 0x1000000b681
[22:00] <slang> (gdb) p request->tid
[22:00] <slang> $10 = 124925
[22:00] <slang> (gdb)
[22:00] <slang> ok
[22:00] <slang> I have to disappear for a few hours, but will try that soon
[22:01] <sagewk> slang: ok
[22:15] * verwilst (~verwilst@d51A5B24D.access.telenet.be) Quit (Quit: Ex-Chat)
[22:29] <Tv> new docs content pushed: http://ceph.newdream.net/docs/latest/architecture/
[22:41] * alekibango (~alekibang@ip-94-113-34-154.net.upcbroadband.cz) has joined #ceph
[22:42] <alekibango> hi. is there any time estimate for ceph 1.0 release?
[22:44] <gregaf1> alekibango: not until it's done! :)
[22:44] <alekibango> i cant wait to put it in use...but i am afraid to take pre-1.0 version
[22:45] <gregaf1> depends on what parts you're looking to use
[22:45] <alekibango> gregaf1: so, more than month? few months? year?
[22:45] <alekibango> i would like to have reliable storage for qemu/kvm virtual machines
[22:45] <alekibango> and generic reliable file system storage for samba server
[22:45] <gregaf1> rbd (which doesn't use the posix filesystem, but is fully redundant/replicated/etc) is much more stable than the full FS is
[22:46] <gregaf1> and is going to be getting more work within the next couple of months that should put it at a 1.0 sort of place
[22:46] <gregaf1> full Ceph is going to be longer
[22:46] <alekibango> would you use rbd for production servers?
[22:47] * jmlowe (~Adium@140-182-208-30.dhcp-bl.indiana.edu) Quit (Ping timeout: 480 seconds)
[22:47] <gregaf1> hmm, depends on the kind of production, I guess?
[22:47] <gregaf1> maybe somebody who isn't a Ceph dev would be better to ask about that ;)
[22:48] <alekibango> heh
[22:49] <alekibango> i mean, i want paying customers to use virutal servers -- having storage in some reliable system...
[22:49] <alekibango> i like sheepdog, but also that one is not very stable
[22:49] <alekibango> yet
[22:49] <alekibango> maybe nbd/xdb/iscsi - but i would like to have it easy to scale up...
[22:49] <gregaf1> yeah
[22:50] <alekibango> xnbd*
[22:50] <gregaf1> well, rbd mostly just doesn't have enough QA on it right now
[22:50] <gregaf1> it will get full-scale QA and fixes before the fs gets full QA and fixes (because it's a lot simpler!)
[22:50] <gregaf1> there aren't any outstanding issues that I'm aware of on it, though, so if you wanted to start testing??? :p
[22:50] <alekibango> i would suggests automated testing...
[22:52] <gregaf1> well, yes, but it takes some time to build the infrastructure for that and to get the tests
[22:52] <alekibango> ... i would liketo start using it :) not testing
[22:52] <alekibango> gregaf1: it will be worth the job, believe me
[22:52] <gregaf1> sorry, but I'm afraid we aren't going to have any hard timeframes for you :(
[22:52] <alekibango> ok, thansk anyway for not promising what you cannot deliver
[22:53] <alekibango> that hurts more
[22:53] <alekibango> gregaf1: and maybe look at agile development, test driven developmnent - it might help a lot
[22:54] <alekibango> extreme programming etc
[22:54] <alekibango> i really started to like test driven development... i am doing it and it helps tons
[22:55] <gregaf1> yeah, it's a little harder for monolithic systems but we're definitely doing agile
[22:55] <alekibango> ic
[22:56] <alekibango> i will stay keeping eye on ceph... like i do for year now... please stop making new features, just make stable release :))
[22:56] <alekibango> i think it has bright future...
[22:57] <gregaf1> we hope so!
[23:05] * lxo (~aoliva@9KCAAATEC.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[23:07] * lxo (~aoliva@1RDAAAMVE.tor-irc.dnsbl.oftc.net) has joined #ceph
[23:36] <damoxc> sagewk: are you about?
[23:36] <sagewk> yep
[23:37] <damoxc> regarding #1486
[23:37] <damoxc> you say it's due to 0 sized pg files?
[23:39] <sagewk> yeah
[23:39] <sagewk> 2011-09-01 10:54:05.538315 7f081f4bc720 filestore(/srv/osd13) read meta/pginfo_1.78/0 0~0
[23:39] <sagewk> that file should never by 0 bytes
[23:39] <sagewk> be
[23:40] <damoxc> would removing it be enough to get it starting again?
[23:40] <sagewk> sjust is going to throw something together to try to reproduce it.
[23:40] <damoxc> ah cool
[23:40] <sagewk> not currently. you can remove the pg directory tho (assuming the same pginfo file isn't damaged on other nodes) and let the cluster recover
[23:43] <damoxc> okay
[23:43] <damoxc> http://dpaste.com/607130/
[23:43] <damoxc> is that bad?
[23:43] <damoxc> that's run at the root of the osd dir
[23:47] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has joined #ceph
[23:52] * lxo (~aoliva@1RDAAAMVE.tor-irc.dnsbl.oftc.net) Quit (Quit: later)
[23:57] <sagewk> only necessarily if they start with pginfo
[23:57] <sagewk> its just meta/pginfo* that speficially triggers the bug
[23:57] <sagewk> it may be that the same underlying problem is causing other corruptions, but not sure what those look like yet
[23:58] * DanielFriesen (~dantman@S0106001731dfdb56.vs.shawcable.net) Quit (Read error: Connection reset by peer)
[23:58] <damoxc> hmm 126 corrupt ones on one osd
[23:59] <damoxc> 16 on another
[23:59] <damoxc> and 6 on another

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.