#ceph IRC Log


IRC Log for 2011-03-19

Timestamps are in GMT/BST.

[0:11] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[0:14] <lxo> gregaf: mkcephfs; mount .../ceph; rsync -aHSv --append-verify /lots/of/stuff .../ceph/
[0:15] <lxo> three-node cluster reduce to two nodes for the time being
[0:15] <lxo> btrfs froze every now and again, particularly on the most heavily-accessed filesystems, rebooted as possible, often resorting to SysRq-U+S+B
[0:16] <lxo> host running rsync is one of the ceph nodes
[0:16] <gregaf> Ixo: you did this for both your initial syncs and subsequent syncs?
[0:16] <lxo> so some reboots took out ceph clients as well
[0:16] <lxo> after some iterations, I noticed it was syncing slowly, and it appeared to be directories it had synced before
[0:17] <gregaf> ahhh, you lost clients and servers at the same time?
[0:17] <gregaf> I think that would do it
[0:18] <gregaf> the client would have created the inode (which is an MDS op, and thus properly journaled [making it durable])
[0:18] <lxo> investigating, I found out tons of zero-sized files, among others that were ok. the zero-sized files were timestamped 1970-01-01 (epoch time)
[0:18] <gregaf> and then started writing to it
[0:18] <gregaf> but the client would have buffered the writes to those files
[0:18] <gregaf> so if you lost the client then the data would have been lost but the inode's creation would be a durable event that lived on
[0:19] <gregaf> I do not know if that is allowed behavior (or behavior we want to avoid if it is allowed) or not
[0:19] <lxo> there was a large gap between some of the zeroed files and those that were ok, in directories thousands of files ahead. although it could be that they got written ahead of time by the client, or that multiple rsync attempts left multiple holes (hmm, no, the latter doesn't make much sense without the former)
[0:20] <lxo> isn't it odd that the file doesn't get a current timestamp, though?
[0:20] <gregaf> I don't recall off-hand how the timestamps are initialized
[0:20] <lxo> (it does help find them, though ;-)
[0:20] <gregaf> but you're saying that there are full files much later than the zeroed files?
[0:20] <lxo> yep
[0:20] <lxo> *much* later
[0:20] <gregaf> and the rsync was done from a single client?
[0:20] <lxo> yup
[0:21] <gregaf> hmm, probably not then
[0:21] <gregaf> although I don't know too well how rsync works so maybe the files are in order of rsync's internal ordering somehow?
[0:21] <lxo> well, I'll keep my eyes open after further reboots and see if I can figure out more details, now that I have some insights into how the files may come about
[0:22] <gregaf> there may be something else going on, but that's the only thing that occurs to me right now
[0:22] <Tv> gregaf: i think rsync goes in readdir order, recursing as it goes along
[0:22] <Tv> oh except.. the existence check might sort the names
[0:22] <gregaf> yeah, so that wouldn't be ordered by name, would it?
[0:23] <gregaf> heh
[0:23] <Tv> it basically compares file listings on both sides, and to do that it probably sorts
[0:23] <lxo> rsync writer has two processes: one looks for files, creates dirs and links ahead of time, tells the other process to deal with files, then when files are done either of them adjusts containing directories, not sure which
[0:23] <gregaf> might it not do a sort when it knows it's doing the initial sync?
[0:23] <lxo> it doesn't matter much if it sorts, for the directories in question are already sorted anyway
[0:24] <lxo> one of the affected trees had one file per date in directories named after IRC channels
[0:25] <lxo> several years worth of such logs, with tens of directories per server, a handful of IRC servers
[0:26] <lxo> some tips I thought I'd share to speed up an initial rsync:
[0:26] <lxo> 1. use --append-files, it cuts out a lot of the metadata log size by avoiding renames
[0:27] <gregaf> Ixo: I admit I'm hardly knowledgeable about this level of working of the vfs or individual filesystems, but I don't think that creation order implies the ordering in the directory
[0:27] <lxo> 2. prime the directories with (cd /source; find -type d ! -type l -print0) | (cd /target; xargs -0 mkdir) so that rsync doesn't take forever to getdents of a just-created directory
[0:27] <gregaf> Sage is out right now but I'll ask him about this scenario when I see him next
[0:27] <Tv> gregaf: directory ordering is pretty much undefined in general
[0:27] <lxo> it doesn't, all I'm saying is that it is the case. it might be that earlier rsync made it that way, but that's the way it is now
[0:28] <Tv> gregaf: ext3/4 will be bucketed by hash value etc; ceph will be frag-by-frag with each frag sorted
[0:29] <Tv> gregaf: lxo's case sounds like repeat crashes/restarts, both in btrfs and in ceph, lead to some corruption.. if there was a ceph fsck, that might be useful for him
[0:29] <lxo> 3. if you have to restart rsync after restarting the mds, it helps to bring dirs into the mds in advance, especially if they haven't been committed from the mds journal yet: for d in /target/*; do find $f > /dev/null & done, and recurse if useful
[0:29] <gregaf> Ixo: wait, now I'm not sure what you're saying — I thought you were arguing that the directory was already ordered so if the issue were what I described above (with lost client buffered data) then the zero-length files would all be at the end of the directory (as returned by ls)
[0:30] <lxo> I'm saying it doesn't matter. there were zero-sized-epoch-dated files in multiple target directories that also contained non-zero files
[0:30] <lxo> so however rsync traverses any single directory wouldn't have run into this
[0:31] <lxo> still, watching rsync with strace I see one process lstat files in order, and the other open&write in order too
[0:31] <gregaf> ah, gotcha
[0:32] <lxo> now, I'm running into so many btrfs issues that I'm wondering if I should switch back to ext4
[0:33] <lxo> it might also help pinpoint this particular issue (unless it disappears ;-)
[0:34] <lxo> although I'm already addicted to btrfs's multiple mount points per filesystem by now ;-) easy to emulate with bind mounts, but not so much when one of the filesystems is your root ;-)
[0:34] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[0:41] <lxo> 4. sometimes a write() hangs. I thought I had to restart the rsync to get past it, but I found out just stat()ing that file from some other process will get the write() moving again
[0:42] <lxo> ceph 0.25.1 on 2.6.38, rsyncing on one of the mon/mds/osd nodes. is this the deadlock you mentioned to me the other day?
[0:44] <gregaf> Ixo: I don't think that's the deadlock, no — can you give us a scenario that will reproduce your write() hanging?
[0:44] <lxo> no, these things seem to be totally random. it happened just a handful of times since I started playing with ceph a few weeks ago
[0:46] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[0:46] <lxo> another thing I noticed, particularly after I switched to 0.25.1, was that mdses crash throwing an exception from Objecter::C_Stat::finish(int). it calls verbose terminate, and then the fatal signal handler catches it and crashes within memcpy
[0:47] <lxo> sadly, I haven't kept debug info (stupid accidental removal of the build tree)
[0:47] <gregaf> Ixo: oh, Sage says that hanging write()s might be an MDS bug, so if you see it again and have MDS logging on then keep a copy of the log for us, please :)
[0:49] <gregaf> Ixo: don't think we've seen the C_Stat crash — let us know if you collect any logs/core files from that
[0:49] <lxo> I tried turning on verbose logging, but it filled up the 10GB I had free on my / filesystem, so I kind of got scared of bumping it up again ;-)
[0:50] <lxo> I got a core file. let me rebuild (and hopefully get the same binary) to dig up more info
[0:50] <lxo> it might be correlated with the write hang: it looks like when I got the write going again, the mds died
[1:00] <lxo> 5. if your machine “hangs” so that not even ps will complete, strace ps and kill the process for which ps hangs opening /proc/<pid>/cmdline. this will restore some sanity without requiring a reboot.
[1:03] <lxo> (then restart the odses :-)
[1:05] <lxo> @&#$@&@&#$ incomplete debug info
[1:10] <lxo> so, it looks like we're finishing an Objecter::C_stat ack, but bl has length zero (?!?), and advance throws while trying to decode the utime_t m
[1:11] <lxo> more specifically, the tv_nsec member thereof
[1:13] <lxo> oh, well, after the recovery tip #5 above it won't last long... time to reboot my gateway. biab
[1:14] <gregaf> Ixo: we have seen some issues with the OSDs which we believe are the same exception
[1:14] <gregaf> not sure where they came from but sjust is looking into it (and he's the only one who changed bufferlist recently so they're probably his fault :P)
[1:24] <lxo> heh
[1:30] <Tv> whee 10 tests of 6 machines each scheduled
[1:35] * cmccabe (~cmccabe@ has left #ceph
[1:38] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[1:45] <lxo> ugh, now 6 pgs are in crashed (4 +replay and 2 +peering) and it seems they won't get out of it. help?
[1:46] <lxo> I've already restarted all 5 osds, one at a time, to no avail
[2:02] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[2:11] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[2:22] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) has joined #ceph
[2:29] * rajeshr (~Adium@ Quit (Quit: Leaving.)
[2:40] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) Quit (Quit: Leaving)
[2:56] * rajeshr (~Adium@99-7-122-114.lightspeed.brbnca.sbcglobal.net) has joined #ceph
[3:06] * lxo (~aoliva@ Quit (resistance.oftc.net weber.oftc.net)
[3:12] * lidongyang_ (~lidongyan@ has joined #ceph
[3:14] * cclien_ (~cclien@ec2-175-41-146-71.ap-southeast-1.compute.amazonaws.com) has joined #ceph
[3:14] * johnl_ (~johnl@johnl.ipq.co) has joined #ceph
[3:14] * promethe1nfire (~mthode@mx1.mthode.org) has joined #ceph
[3:14] * DanielFriesen (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (synthon.oftc.net larich.oftc.net)
[3:14] * prometheanfire (~mthode@mx1.mthode.org) Quit (synthon.oftc.net larich.oftc.net)
[3:14] * johnl (~johnl@johnl.ipq.co) Quit (synthon.oftc.net larich.oftc.net)
[3:14] * lidongyang (~lidongyan@ Quit (synthon.oftc.net larich.oftc.net)
[3:14] * cclien (~cclien@ec2-175-41-146-71.ap-southeast-1.compute.amazonaws.com) Quit (synthon.oftc.net larich.oftc.net)
[3:14] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) Quit (synthon.oftc.net larich.oftc.net)
[3:18] * lxo (~aoliva@ has joined #ceph
[3:19] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[3:25] * DanielFriesen (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[3:25] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[3:29] * greglap (~Adium@ has joined #ceph
[3:58] * greglap (~Adium@ Quit (Quit: Leaving.)
[3:58] * MarkN (~nathan@ has joined #ceph
[3:59] * MarkN (~nathan@ has left #ceph
[4:21] <neurodrone> I had a small question. I have just completed starting the entire ceph system (for the first time :)) but when I do a "ceph -w" or a "ceph -s" I am bombarded by the errors which look like these: http://pastebin.com/DCqv3x2s . Any idea on what might be going wrong? Do I need to configure anything specifically to take care of the "first faults"?
[4:24] <neurodrone> I think I might need to add some things within init-ceph.in to make it run properly. Any idea what the variables "bindir" and "libdir" should contain? I have my "ceph.conf" in "/etc/ceph/conf" ..do I need to add the "/etc/conf" path somewhere in that file?
[4:25] <neurodrone> "/etc/ceph" I mean*
[5:16] * promethe1nfire is now known as prometheanfire
[5:36] * DanielFriesen is now known as Dantmsn
[5:36] * Dantmsn is now known as Dantman
[5:37] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Quit: http://daniel.friesen.name or ELSE!)
[5:37] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[6:54] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[7:16] * rajeshr (~Adium@99-7-122-114.lightspeed.brbnca.sbcglobal.net) Quit (Quit: Leaving.)
[7:47] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Ping timeout: 480 seconds)
[7:58] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[9:17] * monrad (~mmk@domitian.tdx.dk) Quit (Quit: bla)
[9:18] * monrad-51468 (~mmk@domitian.tdx.dk) has joined #ceph
[10:24] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[10:46] * lxo (~aoliva@ Quit (Ping timeout: 480 seconds)
[10:46] * lxo (~aoliva@ has joined #ceph
[13:54] * prometheanfire (~mthode@mx1.mthode.org) Quit (Quit: leaving)
[14:19] * Meths_ (rift@ has joined #ceph
[14:25] * allsystemsarego (~allsystem@ has joined #ceph
[14:26] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[14:28] * hijacker (~hijacker@ has joined #ceph
[14:28] * hijacker (~hijacker@ Quit (Remote host closed the connection)
[14:28] * Meths_ is now known as Meths
[14:38] <darkfaded> one blog reader gave me that link http://orionvm.com.au/blog/Cloud-Performance-It-s-all-about-Storage/ (it's from orionvm they currently use gluster but one of them was in here last year because they just need more scalability)
[14:39] <darkfaded> i think they have the fastest "cloud thing" on earth now
[14:39] <darkfaded> but also spent literally many months on glusterfs tuning
[14:46] <Meths> They need to tune their rss feed to be valid to thunderbird.
[14:47] <darkfaded> maybe THATs why i didnt have it in my google reader
[14:47] <darkfaded> but i had found the benchmarks a few days before the post hehe
[15:25] * hijacker (~hijacker@ has joined #ceph
[15:26] * hijacker__ (~hijacker@ has joined #ceph
[15:26] * hijacker__ (~hijacker@ Quit (Remote host closed the connection)
[15:26] * hijacker (~hijacker@ Quit ()
[16:21] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[17:57] * rajeshr (~Adium@99-7-122-114.lightspeed.brbnca.sbcglobal.net) has joined #ceph
[18:45] * tuhl (~tuhl@p4FFB0A43.dip.t-dialin.net) has joined #ceph
[18:48] * lxo (~aoliva@ Quit (Ping timeout: 480 seconds)
[20:22] * rajeshr (~Adium@99-7-122-114.lightspeed.brbnca.sbcglobal.net) Quit (Quit: Leaving.)
[20:44] * lxo (~aoliva@ has joined #ceph
[20:57] * joshd (~jdurgin@adsl-75-28-69-238.dsl.irvnca.sbcglobal.net) has joined #ceph
[20:57] * joshd (~jdurgin@adsl-75-28-69-238.dsl.irvnca.sbcglobal.net) Quit ()
[22:10] * tuhl (~tuhl@p4FFB0A43.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[23:10] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.