#ceph IRC Log


IRC Log for 2011-03-25

Timestamps are in GMT/BST.

[0:01] <gregaf> hi johnl
[0:02] <johnl> I did a talk on ceph today at a conference
[0:02] <johnl> lots of people excited about it :)
[0:02] <gregaf> yay
[0:03] <johnl> I sold loads of copies of the source code anyway
[0:03] <johnl> I'm pretty rich now
[0:04] <gregaf> haha
[0:04] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[0:07] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) has joined #ceph
[0:07] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[0:14] * bchrisman (~Adium@sjs-cc-wifi-1-1-lc-int.sjsu.edu) Quit (Quit: Leaving.)
[0:27] <Tv> so i can reliably crash kclient in uml, but it didn't crash on sepia just now
[0:28] <Tv> i'm still not convinced whether it's about uml or not
[0:28] <gregaf> what's the trace look like?
[0:28] <Tv> tends to vary a bit it seems, i'll pastebin this round
[0:29] <Tv> http://pastebin.com/jT3PwiSK
[0:29] <Tv> it always seems to have sig handlers somewhere, which makes me think uml might be at fault too
[0:29] * lxo (~aoliva@ Quit (Quit: later)
[0:31] <gregaf> interesting
[0:31] <gregaf> that's not one I recognize, though
[0:31] <Tv> maybe nobody bothered to stress mount/umount before
[0:31] <Tv> it's a pretty reliable crash on uml
[0:32] <Tv> and a mount, umount, rmmod loop on sepia makes mount fail with EIO
[0:33] <Tv> which is not quite the same
[0:33] <Tv> but it's very different hardware too
[0:33] <gregaf> uml does have issues from time to time, for varying reasons
[0:33] <gregaf> so I suspect it is a uml issue but I wouldn't declare that until knowing where it's coming from :/
[0:37] * lxo (~aoliva@ has joined #ceph
[0:43] <Tv> gregaf: the reason i'm hunting it so intently that the symptoms are somewhat similar to things i can trigger via other means on sepia
[0:43] <Tv> this would be an easy trigger to debug
[0:44] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[0:44] <sagewk> tv: looks like a problem with the msgr teardown.
[0:44] <Tv> sagewk: yeah somewhere in there is what i dug out with the dout's
[0:44] <sagewk> tv: do you need rmmod in the loop or is mount/umount sufficient?
[0:45] <Tv> sagewk: uml crashes with just mount/umount, sepia doesn't crash kernel either way but mount gets -EIO
[0:45] <Tv> i mean, mount gets -EIO if rmmod is in the loop
[0:45] <Tv> without rmmod, sepia just won't fail
[0:45] <sagewk> k, i'll start with mount/umount.
[0:46] <Tv> sagewk: i got on this path because *i* had a bug in the mount/umount path.. except it kept crashing after i removed my changes..
[0:49] <Tv> sagewk: here's the debug output from two earlier runs: http://pastebin.com/Xwg8baC4
[0:49] * Administrator__ (~samsung@ has joined #ceph
[0:49] <Tv> those look like a double-free maybe
[0:50] <sagewk> yeah
[0:53] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[1:09] <cmccabe1> I'm getting some unit test failures here on head-of-line
[1:09] <cmccabe1> test/crypto.cc:176: Failure
[1:09] <cmccabe1> Expected: (duration_ns) < (1000000000u), actual: 1154259814 vs 1000000000
[1:11] <cmccabe1> it really doesn't seem like we should have stuff like this in the code
[1:11] <cmccabe1> I mean unit tests should not fail when you run them on a low-end machine
[1:14] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[1:55] * cmccabe1 (~cmccabe@ has left #ceph
[2:25] * joshd1 (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:09] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) has joined #ceph
[4:13] * votz (~votz@dhcp0020.grt.resnet.group.upenn.edu) Quit (Quit: Leaving)
[7:45] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) Quit (Quit: neurodrone)
[7:45] * allsystemsarego (~allsystem@ has joined #ceph
[10:02] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[10:03] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Operation timed out)
[10:04] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[10:08] * sum-it (~sum-it@ has joined #ceph
[10:08] * sum-it (~sum-it@ has left #ceph
[10:11] * Plnt (~someone@rhea.pwn.cz) Quit (Remote host closed the connection)
[11:34] * Plnt (~someone@rhea.pwn.cz) has joined #ceph
[12:38] * Administrator__ is now known as sankung
[12:50] * sankung (~samsung@ Quit (Ping timeout: 480 seconds)
[13:38] * ghaskins (~ghaskins@66-189-113-47.dhcp.oxfr.ma.charter.com) has left #ceph
[16:13] * greglap (~Adium@cpe-76-170-84-245.socal.res.rr.com) Quit (Quit: Leaving.)
[16:36] * neurodrone (~neurodron@cpe-76-180-162-12.buffalo.res.rr.com) has joined #ceph
[16:48] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:17] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[17:38] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[17:51] * cmccabe1 (~cmccabe@ has joined #ceph
[17:54] <sage> still waiting for someone to show up at the house... may be a bit late :(
[18:10] * joshd1 (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:28] * Guest938 (quasselcor@bas11-montreal02-1128535815.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[18:37] * bbigras (quasselcor@bas11-montreal02-1128535815.dsl.bell.ca) has joined #ceph
[18:37] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:37] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit ()
[18:37] * bbigras is now known as Guest123
[18:39] <bchrisman> gregaf: Posted additional debug output for that cfuse issue on the bug… I've still got the logfile hanging around for a bit.
[18:41] <gregaf> oh yeah, I'll take a look
[18:47] <gregaf> bchrisman: I think I'm going to need the whole client log
[18:47] <gregaf> to see when the tid acks are coming back and getting set
[18:50] <bchrisman> gregaf: hmm… okay.. I'll either need to look through it on your behalf or get you access to that… cuz it's huge..
[18:50] <gregaf> bchrisman: they usually compress pretty well if you have a place to host it?
[18:51] <bchrisman> 37G compressed is probably still too big..
[18:51] <bchrisman> I can compress though and see.
[18:52] <bchrisman> tid is transaction id.. do we know which one we're looking for?
[18:52] <gregaf> well, the incoming tid is 24067
[18:52] <gregaf> "2011-03-24 20:30:19.744085 7f5fe9c4c710 client4156.objectcacher bh_write_ack 10000005dcb.00000009/head tid 24067 303104~8192
[18:52] <gregaf> "
[18:53] <gregaf> and for some reason the bufferhead thinks that it's already gotten an ack for a later tid
[18:53] <bchrisman> http://pastebin.com/mEzTgCcX
[18:54] <gregaf> okay, those are definitely later tids coming in first
[18:54] <bchrisman> I'm wondering how far back I can go with the logs to send ya...
[18:55] <bchrisman> I can look for that original tid 24067.. will it be referenced far upside?
[18:56] <gregaf> I don't remember how this works offhand
[18:56] <gregaf> let me go over it a bit more and I'll get back to you
[18:56] <bchrisman> okay… I'm looking at it a bit too.
[18:57] <bchrisman> This looks like it's supposed to be a healthy transaction: http://pastebin.com/v9RdDDGT
[18:58] <bchrisman> If that's healthy, I can grep through and try to follow the 24067
[18:58] <bchrisman> at least figure out how much of the log is between when that transaction starts and the end of the file.
[18:58] <bchrisman> might still not be enough to debug.
[19:01] <gregaf> that's probably all we'd need
[19:02] <bchrisman> tail −200000 didn't pick up the whole transaction.. so I'm doing a full log scan.. will take a couple minutes… then I'll pull that section, compress it, and see if it's small enough to send across.
[19:08] <bchrisman> gregaf: here's just that transaction: http://pastebin.com/mU25BR63
[19:09] <bchrisman> however, it starts at an 8-digit line no in the log file, which is 9-digit lines long… so probably 4GB+...
[19:11] <bchrisman> err mean 30GB+
[19:12] <bchrisman> interesting thing is the transaction plays out over 1.5hr+
[19:13] <bchrisman> err just over 2hr .. more accurately.. but long long time
[19:33] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[19:36] <bchrisman> curious about what 'kickbacks' and 'recalcs' are… is this something happening during some sort of failure/rebuild?
[19:37] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:37] <gregaf> bchrisman: actually you're getting two separate tids there
[19:37] <gregaf> the first one is the posix client tid for MDS requests
[19:37] <gregaf> we're only interested in the stuff after "op_submit oid 10000005dcb.00000009@0 [write 303104~8192] tid 24067 osd1"
[19:38] <bchrisman> ahh separate sequences
[19:38] <gregaf> yeah
[19:39] <gregaf> recalc_op_target determines the target — it's actually called on op creation, not just when it thinks the target has changed
[19:39] <gregaf> kickback iirc is generally used to start up blocked functions which are waiting on an async op
[19:39] * morse (~morse@supercomputing.univpm.it) Quit (Read error: Connection reset by peer)
[19:40] <bchrisman> so 'recalc' is called that because we're recalculating something that's been already calculated elsewhere?
[19:40] <gregaf> the objectcacher tid records go off the objecter tid, btw
[19:40] <bchrisman> okay.. I'll get that context from that point you mention on.
[19:40] <gregaf> bchrisman: well actually it's just the function for calculating op targets, I think
[19:40] <bchrisman> (the op_submit)
[19:41] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:42] <gregaf> recalc_op_target calculates the OSD responsible for a given operation
[19:42] <gregaf> I think the name is a legacy one from when we implemented target calculation in two places
[19:42] <gregaf> but at this point it's just the function for calculating targets
[19:42] <gregaf> and it's capable of recalculating if we think the target has changed after initial submission
[19:42] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[19:42] <bchrisman> cool
[19:43] <gregaf> if you look at op_submit it calls recalc_op_target (line 490 in my checkout)
[19:43] <gregaf> unless op_submit got given an OSD session to send the op to
[19:43] <gregaf> and I don't think anybody takes advantage of that, so it's always called
[19:45] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:49] <bchrisman> that's a local client calc right? so that's not terribly expensive I guess?
[19:51] <bchrisman> gregaf: attached compressed logfile from tid 24101+
[19:59] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[20:02] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:03] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[20:24] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:28] * Plnt_ (~someone@rhea.pwn.cz) has joined #ceph
[20:29] * Plnt (~someone@rhea.pwn.cz) Quit (reticulum.oftc.net kilo.oftc.net)
[20:33] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[20:36] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:53] <sagewk> created a 'next' branch for 0.26, direct bugfixes there!
[20:53] <sagewk> (or stable)
[20:54] <gregaf> bcrhisman: yeah, targets are determined by who owns the object — it's the miracle of CRUSH!
[21:21] <sagewk> tv: pushed a fix for the mount/umount crash
[21:22] <sagewk> (at least I can't reproduce it anymore)
[21:22] <Tv> sweet, i'll rerun asap
[21:24] <gregaf> bchrisman: oh, we need more log than that — it doesn't show the submission of tid 24067 :)
[21:24] <gregaf> we really need at least that much, and this might be an osd issue
[21:25] <gregaf> the problem is that the operation responses are coming back in a different order from how they're submitted, and operations on a single object should always be ordered
[21:33] <Tv> sagewk: confirming, cannot reproduce mount/umount loop bug anymore
[21:33] <sagewk> tv: yay!
[21:34] <bchrisman> gregaf: ahh.. I had the wrong id stuck in my head there.. np.. one sec
[21:50] <bchrisman> gregaf: alright.. that should do it…
[21:51] <gregaf> hmm, that's all writes then...
[21:51] <gregaf> looks like we've found an osd bug
[21:52] <gregaf> bchrisman: do you still have the osd logs for that?
[21:52] <bchrisman> don't think I had much osd logging turned on...
[21:53] <bchrisman> but can rerun pretty quickly.
[21:53] <gregaf> what version are you running right now?
[21:53] <bchrisman> 0.25.1
[21:53] <gregaf> okay
[21:53] <bchrisman> I cannot reproduce this issue with the kernel client
[21:54] <bchrisman> same set of operations runs correctly over kernel client...
[21:54] <gregaf> well if you look at the sequence of osd_op_reply messages coming back
[21:54] <gregaf> you'll see acks for 24066, 24068, 24069 all in quick sequence
[21:54] <gregaf> and then at the end of the log file you see commits for 24066 and 24067
[21:55] <gregaf> now the protocol allows for the server to return just a commit instead of an ack+commit
[21:55] <gregaf> but in that case it's also serving as the ack
[21:55] <gregaf> and ack responses need to be ordered with respect to writes
[21:55] <gregaf> or, rather, operations and responses on the same object need to be ordered with respect to writes
[21:56] <gregaf> based on what we're seeing I'm sure if we just took out the assert the client would handle it fine
[21:56] <gregaf> but the server responses are concerning
[21:56] <bchrisman> ahh
[21:56] <gregaf> so I expect the kclient just isn't performing this debug check
[21:57] <bchrisman> yeah
[21:57] <bchrisman> chcking on osd logs for that… looks like osd1
[21:57] <gregaf> yeah
[21:57] <gregaf> btw, what load is this again?
[21:57] <gregaf> this is while doing the export via samba?
[21:58] <bchrisman> export samba… and a build of our C++ code on top of that.
[21:59] <bchrisman> (as a test for the filesystem)
[22:06] <bchrisman> think I'll need to turn up osd debugging for this one
[22:06] <bchrisman> http://pastebin.com/ck4GqpSm
[22:06] <bchrisman> that's all I've got right now.
[22:08] <gregaf> yeah, that's…somewhat less than helpful
[22:09] <gregaf> we'll want everything pretty high up
[22:10] <gregaf> vstart's debugging settings should be good:
[22:10] <gregaf> debug ms = 1
[22:10] <gregaf> debug osd = 25
[22:10] <gregaf> debug monc = 20
[22:10] <gregaf> debug journal = 20
[22:10] <gregaf> debug filestore = 10
[22:11] <bchrisman> can I set debug filename, or will it allow me to symlink underneath?
[22:12] <gregaf> ?
[22:12] <gregaf> —log-file=file.log, if that's what you're after
[22:12] <bchrisman> gonna be a big debug file.. right now that's on my root partition.
[22:12] <bchrisman> hmm… ok..
[22:12] <gregaf> that lets you set the location
[22:21] <bchrisman> I assume that if I use injectargs for it, and it responds affirmatively, it'll be okay: http://pastebin.com/E2S2k4pt
[22:24] <cmccabe1> bchrisman: you have to send SIGHUP to get it to reread the logging configuration
[22:24] <cmccabe1> bchrisman: so injectargs, then send SIGHUP
[22:24] <bchrisman> ahh
[22:25] <gregaf> you don't need sighup for the change in log levels
[22:25] <gregaf> but you do for the change in log file
[22:26] <bchrisman> good point.. I need to change that logfile.. thx
[22:39] <bchrisman> hmm.. I guess I don't know that the error will crop up in the same osd… so I should probably put logging on all of them I guess.
[23:01] * allsystemsarego (~allsystem@ Quit (Ping timeout: 480 seconds)
[23:37] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[23:51] * lightspeed (~lightspee@fw-carp-wan.ext.lspeed.org) has left #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.