#ceph IRC Log

Index

IRC Log for 2011-08-19

Timestamps are in GMT/BST.

[0:29] <Tv> FYI borked sepia machines: 54, 41, 50, 58, 89, 36, 68, 27, 56 (trying to add ssh keys to all)
[0:35] <sagewk> are those the ones commented out in fabfile?
[0:35] <Tv> nope
[0:36] <Tv> soo... teuthology-nuke requires me to write down a yaml with targets: ? no way to do that more automagically, currently?
[0:36] <gregaf> Tv: unfortunately not
[0:36] <Tv> sagewk: i can commit the fabfile change, but i don't like taking machines out if all they need is a reinstall
[0:36] <sagewk> tv: ok, the reinstall is easy enough
[0:36] <gregaf> at least some of those have been out for a while ??? they're not in the list of sepia keys I've got, anyway
[0:38] <Tv> they used to work..
[0:38] <Tv> they used to work a month ago, that is
[0:38] <sagewk> futzing with it now
[0:40] * lxo (~aoliva@9KCAAAC6H.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[0:40] <Tv> so i actually need to put the ssh key in nuke.yaml targets: line too
[0:41] <Tv> crap
[0:41] <gregaf> it's just the same as regular teuthology
[0:43] <Tv> nukes should be easier to launch ;)
[0:43] <Tv> even the US missiles had a launch code of 0000000 for a long time
[0:43] <sagewk> tv: i forgot to push the new console ips; working now
[0:44] <sagewk> tv: just need to run ./reimage.sh N as root@cephbooter for each node (did 54)
[0:45] <Tv> sagewk: i can run those, thx
[0:46] <gregaf> Tv: I've thought it would be nice to add targets as command-line parameters, but I'm not sure how much it actually adds since if you're using auto-locking you're going to have to jump through hoops anyway, and if you're not you've got them in a config file somewhere...
[0:46] * lxo (~aoliva@19NAAC63B.tor-irc.dnsbl.oftc.net) has joined #ceph
[0:47] <Tv> gregaf: nuke should interact with locking i think
[0:47] <gregaf> and nuke everything you have locked? That'll be pleasant when you kill a run in progress while debugging your other thing
[0:48] <Tv> not necessarily all
[0:48] <Tv> just saying, ensure you have it locked, get ssh pubkey, nuke, ...
[0:48] <Tv> just need to design for the whole workflow
[0:48] <Tv> this is a remnant from before the locks existed
[0:49] <sagewk> tv: iirc it does verify locks before nuking..
[0:49] <Tv> powercycling
[0:49] <Tv> sepia41 not found in map file ceph-qa-deploy/machine_power
[0:50] <sagewk> k
[0:51] <sagewk> tv: try now
[0:51] <Tv> SSH got permission denied at ./ceph-qa-deploy/_pdu_helper.pl line 28.
[0:51] <Tv> harrumph
[0:52] <sagewk> on 41?
[0:52] <Tv> and others
[0:53] <sagewk> 41 worked for me
[0:53] <Tv> cephbooter:~# ./ceph-qa-deploy/powercycle sepia41
[0:53] <Tv> perhaps you have agent forwarding from your own account..
[0:53] <sagewk> yeah
[0:55] <sagewk> try now?
[0:55] <Tv> same
[1:07] <slang> sagewk: thanks for those commits
[1:07] <sagewk> slang: np, let me know how far that gets you
[1:10] <slang> sagewk: started up everything again and things seem to be recovering
[1:11] <sagewk> cool beans.
[1:11] <slang> number of peering pgs is dropping gradually
[1:11] <sagewk> i'm still not 100% sure about the 'wrongly marked down' we saw.. if you see any of that again, let us know.
[1:11] <slang> sagewk: ok
[1:11] <sagewk> not sure if 'debug ms = 1' is a low enough level of logging to leave on? that is usually enough to track down the heartbeat issues
[1:23] * Tv (~Tv|work@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[1:27] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) Quit (Ping timeout: 480 seconds)
[2:06] * slang1 (~Adium@chml01.drwholdings.com) has joined #ceph
[2:07] <slang1> osd crash: http://pastebin.com/raw.php?i=A08FRV15
[2:09] <slang1> stack trace: http://pastebin.com/raw.php?i=Hn8LWu78
[2:10] <gregaf> slang1: most everybody's out for the day now...
[2:10] <gregaf> but how did you come down ??? was your crushmap good or might you have actually lost objects?
[2:11] <sjust> yes
[2:11] <slang1> gregaf: may have lost objects
[2:11] <gregaf> I'm guessing that's what caused this, but sjust knows way more about it than me...*consults*
[2:11] <sjust> lost objects would do it
[2:11] <sjust> unfortunately
[2:12] <sjust> to be honest, that bit of code should probably attempt to recover from that case
[2:12] <slang1> yes
[2:12] <sjust> it's failing because the on-disk set of objects does not match the set of clones that the xattr on the head object says should be there
[2:13] <sjust> normally, that's a sign of a bug, hence the assert
[2:14] <sjust> when you said there were lost objects, was there a sequence of failures from earlier?
[2:14] <slang1> I can see it being difficult to tell if the data got lost/corrupted or if you hit a bug
[2:14] <sjust> actually, now that I think about it, scrub will only happen if the PG deems itself healthy
[2:15] <slang1> but in general ceph should not assert fail just to catch bugs, no?
[2:15] <gregaf> sjust: he lost some OSDs off a single machine and marked them down/out to force recovery, but it turned out his crushmap wasn't set up like he thought it was :(
[2:15] <sjust> indeed
[2:15] <gregaf> and did the whole thing to mark as lost
[2:15] <sjust> slang1: the notion is that corrupt osd state could cause data loss, so early failure is often preferrable
[2:15] * huangjun (~root@113.106.102.8) has joined #ceph
[2:17] <slang1> sjust: that means data corruption (by some other path) could cause an osd to crash
[2:18] <sjust> ok, just talked to Greg, the problem is that marking a the objects lost doesn't clean up the metadata on the head objects
[2:22] <slang1> ok
[2:24] <sjust> slang1: yeah, it's a flaw in the way we mark objects as lost, we'll need to fix it
[2:25] <slang1> I can report an issue in the bug tracker if desired
[2:25] <sjust> slang1: actually, were you using any snapshots?
[2:25] <slang1> sjust: we had created some snapshots, yes
[2:25] <sjust> slang1: ok, good, theory intact
[2:26] <sjust> slang1: I'll make the bug, thanks for the help
[2:26] <slang1> sjust: thank you!
[2:29] <sagewk> are we sure there were really lost objects?
[2:32] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[2:48] * cmccabe (~cmccabe@69.170.166.146) Quit (Quit: Leaving.)
[2:56] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) Quit (Quit: jojy)
[2:58] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) has joined #ceph
[3:08] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:10] * greglap (~Adium@166.205.141.126) has joined #ceph
[3:41] * macana (~ml.macana@159.226.41.129) Quit (Remote host closed the connection)
[3:57] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) Quit (Remote host closed the connection)
[4:00] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) has joined #ceph
[4:05] * The_Bishop (~bishop@port-92-206-21-65.dynamic.qsc.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[4:12] * greglap (~Adium@166.205.141.126) Quit (Read error: Connection reset by peer)
[4:16] * jojy (~jojyvargh@75.54.231.2) has joined #ceph
[4:16] * jojy (~jojyvargh@75.54.231.2) Quit ()
[5:19] * slang1 (~Adium@chml01.drwholdings.com) Quit (Read error: Connection reset by peer)
[5:56] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[6:38] * slang1 (~Adium@chml01.drwholdings.com) has joined #ceph
[6:38] <slang1> sagewk: I'm not sure there were, no
[6:40] <slang1> I just hit another crash on another osd
[6:40] <slang1> http://pastebin.com/raw.php?i=7jGT1czn
[6:40] <slang1> this one may have been caused by one of the other nodes going down (due to an unrelated kernel panic)
[6:42] <slang1> http://pastebin.com/raw.php?i=AeHhnwR8
[6:42] <slang1> that's the backtrace
[7:21] * slang1 (~Adium@chml01.drwholdings.com) Quit (Read error: Connection reset by peer)
[8:02] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) Quit (Remote host closed the connection)
[8:17] * The_Bishop (~bishop@port-92-206-21-65.dynamic.qsc.de) has joined #ceph
[8:49] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) has joined #ceph
[9:11] * Meths_ (rift@2.25.212.121) has joined #ceph
[9:16] * Meths (rift@2.25.189.115) Quit (Read error: Operation timed out)
[9:18] * gregorg (~Greg@78.155.152.6) Quit (Quit: Quitte)
[9:19] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) Quit (Ping timeout: 480 seconds)
[9:44] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) has joined #ceph
[10:07] * lxo (~aoliva@19NAAC63B.tor-irc.dnsbl.oftc.net) Quit (Quit: later)
[10:23] * lxo (~aoliva@19NAAC7BN.tor-irc.dnsbl.oftc.net) has joined #ceph
[13:45] * gregorg (~Greg@78.155.152.6) has joined #ceph
[13:56] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) Quit (Ping timeout: 480 seconds)
[14:28] * huangjun (~root@113.106.102.8) Quit (Remote host closed the connection)
[15:38] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) Quit (Remote host closed the connection)
[16:08] * Hugh (~hughmacdo@soho-94-143-249-50.sohonet.co.uk) has joined #ceph
[16:13] * mtk (KkDCxodNrp@panix2.panix.com) Quit (Remote host closed the connection)
[16:18] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[16:22] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) Quit (Remote host closed the connection)
[16:26] * mtk (~mtk@ool-182c8e6c.dyn.optonline.net) has joined #ceph
[16:43] * ladyg (~ladyg@212-198-248-35.rev.numericable.fr) has joined #ceph
[16:45] * ladyg (~ladyg@212-198-248-35.rev.numericable.fr) Quit ()
[16:49] * greglap (~Adium@166.205.142.20) has joined #ceph
[17:02] <slang> frequent crashing of monitor node, log: http://pastebin.com/raw.php?i=GvcbZNxW, bt: http://pastebin.com/raw.php?i=ER7xiBEL
[17:07] <greglap> slang: weird, were you messing around with your monitor stores at any point?
[17:07] <slang> no
[17:08] <greglap> and can you try and reproduce it with debug_mon = 20?
[17:16] <slang> greglap: here you go: http://pastebin.com/raw.php?i=9g1K3FW9
[17:21] <greglap> hmmm, so we somehow got a zero-length mdsmap on disk
[17:23] <greglap> did any of your monitor nodes get rebooted?
[17:37] * greglap (~Adium@166.205.142.20) Quit (Quit: Leaving.)
[17:38] <slang> oh
[17:39] <slang> is greglap still in here on a different handle?
[17:39] <slang> two nodes running monitors crashed and got rebooted
[17:47] * Tv (~Tv|work@aon.hq.newdream.net) has joined #ceph
[17:55] <gregaf> that's me, slang :)
[17:55] <gregaf> work computer, lap is laptop, pad is ipad, etc ;)
[18:08] <slang> ah
[18:09] <slang> I figured, but I thought maybe gregs were just attracted to ceph :-)
[18:09] <sagewk> slang: it sounds like an issue with the order things are fsync'd to disk on the monitor.
[18:09] <slang> sagewk: ok
[18:11] <gregaf> well, there's some attraction; I still have no idea who gregorg is ??? reliable lurker but I dunno if he has ever said anything :)
[18:12] <gregaf> sagewk: looks like xlock_done and put_xlock are separate
[18:12] <gregaf> early_reply calls xlock_done but not put_xlock; that's left for later and I bet something in there is why it's broken
[18:13] <sagewk> gregaf: iirc done means the update is done locally and we can pipeline changes, put means it is visible to other clients
[18:14] <gregaf> okay, but when early_reply goes it's also visible to at least that one client...
[18:14] <gregaf> who in this case sends along another request that tries to xlock before the xlock has been put
[18:15] <sagewk> gregaf: i see. let's look at the log
[18:17] * ajm (adam@adam.gs) Quit (Remote host closed the connection)
[18:18] * ajm (adam@adam.gs) has joined #ceph
[18:20] <ajm> hi, i'm curious if anyone has seen this with 0.33: http://pastebin.com/Fbsymf5b
[18:21] <gregaf> ajm: augh
[18:21] <gregaf> looks like our atomic_t implementations have diverged depending on if you're using libatomic_ops or not
[18:23] <ajm> hrm, let me try with libatomic on
[18:23] <gregaf> simple patch should fix it:
[18:23] <gregaf> diff --git a/src/include/atomic.h b/src/include/atomic.h
[18:23] <gregaf> index 293b86a..5d4a9a3 100644
[18:23] <gregaf> --- a/src/include/atomic.h
[18:23] <gregaf> +++ b/src/include/atomic.h
[18:23] <gregaf> @@ -76,6 +76,11 @@ namespace ceph {
[18:23] <gregaf> ~atomic_t() {
[18:23] <gregaf> pthread_spin_destroy(&lock);
[18:23] <gregaf> }
[18:23] <gregaf> + void set(size_t v) {
[18:23] <gregaf> + pthread_spin_lock(&lock);
[18:23] <gregaf> + int r = v;
[18:23] <gregaf> + pthread_spin_unlock(&lock);
[18:23] <gregaf> + }
[18:23] <gregaf> int inc() {
[18:23] <gregaf> pthread_spin_lock(&lock);
[18:23] <gregaf> int r = ++val;
[18:24] <sagewk> slang: can you pastebin an ls -al in that directory?
[18:24] <sagewk> $mon_data/mdsmap
[18:27] <slang> http://pastebin.com/raw.php?i=rKruxwqD
[18:28] <sagewk> slang: ok, i foudn the bug. quick fix for you is to just delete those 0-lenght files (363-368) and restart
[18:28] <sagewk> will push a patch shortly
[18:28] <slang> awesome
[18:29] <slang> sagewk: thanks!
[18:29] <ajm> gregaf: is it better to use libatomic ?
[18:29] <gregaf> ajm: if you don't then it implements atomic_t using spinlocks
[18:29] <gregaf> so yes, but by how much? dunno
[18:30] <ajm> interesting, if there's no negative i'll just do that, easier to adjust a use flag than it is to patch sources :)
[18:31] <gregaf> ajm: well it should be defaulting to with atomic_ops, so you probably don't have the library installed
[18:32] <gregaf> what arch/distribution are you using?
[18:32] <ajm> its gentoo, it didn't have it installed, easy enough to add the use flag
[18:33] <gregaf> what I'm saying is that if the library exists the configure script should be setting it to use that without you needing to set any flags :)
[18:33] <gregaf> and if you don't have it installed then it will just fail since it won't be able to find things
[18:34] <ajm> i think without the use flag it does --without-libatomic
[18:35] * jojy (~jojyvargh@70-35-37-146.static.wiline.com) has joined #ceph
[18:36] <gregaf> yes, that's why it's busted ??? you'll need to apply the patch if you don't want to install the library
[18:37] <ajm> already installed it :)
[18:40] <slang> sagewk: deleting the zero-length files fixed the crash
[18:42] <gregaf> the bigger question is how exactly they got there :/
[18:43] <sagewk> gregaf: fixing that bug now.
[18:44] <gregaf> just bad/missing sync ordering?
[18:46] <sagewk> the sync(2) optimization was unsafe.
[18:48] <gregaf> oh good, I was hoping it was simple
[18:56] * cmccabe (~cmccabe@c-24-23-254-199.hsd1.ca.comcast.net) has joined #ceph
[19:18] * lxo (~aoliva@19NAAC7BN.tor-irc.dnsbl.oftc.net) Quit (Remote host closed the connection)
[19:34] * lxo (~aoliva@9YYAAAZLX.tor-irc.dnsbl.oftc.net) has joined #ceph
[20:00] * sjust (~sam@aon.hq.newdream.net) Quit (Read error: Connection reset by peer)
[20:04] * sjust (~sam@aon.hq.newdream.net) has joined #ceph
[20:05] * st-15308 (foobar@89.214.230.142) has joined #ceph
[20:06] * st-15308 (foobar@89.214.230.142) Quit ()
[20:08] <slang> is it common to see 'ms_handle_reset' messages?
[20:09] <slang> or is that just me?
[20:10] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[20:21] <gregaf> slang: pretty sure that's just what happens when sockets go idle for long enough
[20:37] * jim (~chatzilla@astound-69-42-16-6.ca.astound.net) has joined #ceph
[20:55] * bchrisman (~Adium@64.164.138.146) has joined #ceph
[21:01] * bchrisman (~Adium@64.164.138.146) Quit (Quit: Leaving.)
[21:01] * bchrisman (~Adium@64.164.138.146) has joined #ceph
[21:08] * bchrisman (~Adium@64.164.138.146) Quit (Quit: Leaving.)
[21:10] * bchrisman (~Adium@64.164.138.146) has joined #ceph
[21:23] * bchrisman (~Adium@64.164.138.146) Quit (Quit: Leaving.)
[21:24] <slang> I'm seeing client (cfuse) hangs, and I'm not sure what to check
[21:24] <slang> (or if I have the right commits)
[21:25] * bchrisman (~Adium@64.164.138.146) has joined #ceph
[21:25] <slang> if I try to restart the client with debugging enabled, it seems to just sit there calling renew_caps() every few seconds
[21:26] <slang> is that something that's already been seen/fixed?
[21:27] <gregaf> slang: that usually means the MDS isn't replying to a client request
[21:27] <gregaf> I'm not aware of any such problems in the latest code, but I'm sure there still are some
[21:31] <slang> http://pastebin.com/raw.php?i=VWTeZgL0
[21:31] <slang> that's with debug client = 20, debug ms = 20
[21:32] <slang> I don't even have to pass any args to cfuse, so this is happening before arg parsing
[21:32] <gregaf> that's the entire log?
[21:32] <slang> for that run, yes
[21:32] <slang> until I killed it
[21:33] <slang> I can let it run longer, but it just keeps iterating on the same pattern
[21:33] <gregaf> oh, it can't connect to the MDS for some reason
[21:33] <slang> renew_caps()....writer sleeping
[21:33] <gregaf> is it actually running?
[21:33] <slang> yes
[21:34] <gregaf> what status is it in
[21:34] <gregaf> ?
[21:34] <slang> I would enable debugging on the mds, but ceph mds tell 0 injectargs '--debug-mds 20' doesn't seem to work
[21:34] <slang> active
[21:35] <slang> mds e2077: 1/1/1 up {0=delta=up:active}, 1 up:standby-replay, 2 up:standby
[21:36] <gregaf> and delta is at address 192.168.101.15:6800?
[21:37] <gregaf> oh, sorry, you mean it's calling renew_caps and won't satisfy any requests, right?
[21:38] <gregaf> err, any FS accesses, I mean
[21:38] <slang> gregaf: it chooses the port, but yes should be at the ip
[21:38] <slang> gregaf: right -- the clients I still have running are hung
[21:38] <slang> gregaf: the ones I try to restart, cfuse just hangs
[21:38] <gregaf> I don't think I understand
[21:39] <gregaf> you have frozen mounts and when you try to restart them, they hang before restart?
[21:39] <gregaf> or after?
[21:39] <slang> its the restart ones that I have enabled logging for and can see calling renew_caps()
[21:39] <gregaf> and you restarted them with debugging because they hung?
[21:39] <slang> gregaf: I can unmount fine, but the cfuse process hangs when I try to start it again (mount)
[21:40] <gregaf> when you unmount are you sure the process actually exited?
[21:40] <slang> gregaf: I restarted one with debugging because they are hung, yes
[21:40] <slang> gregaf: yes
[21:40] <gregaf> hmmm
[21:40] <gregaf> how long ago did you restart them?
[21:41] <gregaf> it sounds to me like they hung on an MDS request that got stuck in the MDS somewhere
[21:41] <gregaf> and then when you restarted them they're waiting on caps because the MDS has already issues those caps
[21:41] <gregaf> or possibly because the lost request is holding some locks on the MDS
[21:42] <slang> gregaf: can I flush the caps on an mds?
[21:42] <gregaf> no, there's just the timeout
[21:42] <gregaf> you can restart the MDS though, that should get it unstuck and the clients that are still alive will replay any requests that didn't get committed to disk
[21:46] <slang> maybe its just stuck in a boot state
[21:47] <slang> http://fpaste.org/uUd3/
[21:47] * Meths_ is now known as Meths
[21:49] <slang> http://fpaste.org/5xgv/raw/
[21:49] <slang> that's the log with debugging for the mds that seems to be stuck in up:boot
[21:49] <gregaf> it'll time out and then the standby-replay will take over and the old active will probably go standby-replay
[21:51] <slang> or at least, the mds thinks its in standby, but the monitor thinks its in boot
[21:52] <gregaf> no, the up:boot is a log message notifying you that it booted, then the monitor told it to go to standby since it didn't have anything specific for it to do
[21:52] <gregaf> you're just waiting now for the monitors to conclude that the MDS died, although I don't think it should take this long, more like 30 seconds
[21:54] <slang> gregaf: this isn't normal behavior
[21:55] <gregaf> what's ceph -s say now?
[21:55] <slang> same
[21:56] <gregaf> do you have any weird config changes for the monitor or mds?
[22:08] <slang> mon.alpha@0(leader).mds e2080 preprocess_beacon on fsid 00000000-0000-0000-0000-000000000000 != 12b23328-79c8-09a0-6be5-d154d246232b
[22:10] <gregaf> that's part of the MDS' first beacon?
[22:10] <gregaf> that's normal on daemon startup since they don't know the fsid (unless they've got local storage and have started up previously)
[22:11] <gregaf> what mds and mon config settings are you using?
[22:17] <slang> http://fpaste.org/5dqC/raw/
[22:22] <gregaf> oh, how many of your monitors are running?
[22:22] <slang> all of them
[22:22] <gregaf> that status only has 3 of them up
[22:22] <gregaf> but there are 4 in your config
[22:23] <gregaf> and if another one went down, that's it, map updates are stuck
[22:24] <slang> it seems to always just report 3
[22:24] <slang> I'm assuming that's because it has to have an odd number of monitor nodes
[22:25] <gregaf> well, you want an odd number, I'm not aware of any code that's been added to enforce that though...sagewk?
[22:34] <sagewk> it's not enfoced.
[22:34] <sagewk> slang: you said cfuse was hung? it was broken for a while during this sprint.. try v0.33 or latest master for best results
[22:34] <gregaf> slang: can you check your nodes and make sure the monitor processes are all still running?
[22:35] <slang> gregaf: they're still running
[22:36] <slang> sagewk: I'm running stable (v0.33) with some cherry-picked commits
[22:37] <slang> sagewk: v0.33 + d44eda35393308474ee70c1fad59e6c80732345e + 2839e71a0a8839bc2bd9e1a70cf8cf8cc6b78544
[22:38] <slang> (I'd rather not run master if I can help it)
[22:38] <sagewk> slang: yeah
[22:50] <sagewk> slang: did you post teh cfuse log before? the old pastebin seems to be erroring out
[22:51] <slang> yeah let me repost on fpaste
[22:54] <slang> http://fpaste.org/ZJ8J/raw/
[22:54] <slang> that has a few attempts
[22:55] <slang> http://fpaste.org/r3r9/raw/
[22:55] <slang> similar result on most recent attempt
[22:55] <slang> (different client)
[22:57] <slang> also after killing the charlie monitor
[22:57] <slang> (at 114)
[22:57] <sagewk> sorry if i'm repeating, but: did you see the client_session arriving in the mds log?
[23:00] <sagewk> arg i have a meeting, can't follow up right now.
[23:00] <sagewk> what is suprising to me is that the client_session is sent to the mds but there is no reply. either the mds isn't getting the emessage, is drpping it, or can't journal to the osds (and reply) bc the osds are misbehaving.
[23:01] <sagewk> watching for the client_sessoin arrival in the mds log should tell you which
[23:05] * bchrisman (~Adium@64.164.138.146) Quit (Quit: Leaving.)
[23:06] <slang> http://fpaste.org/t6jr/raw/
[23:08] <slang> client_session is there
[23:39] <slang> it looks like its sending a write to an osd after it gets the request_open
[23:39] <slang> a write, and then a writefull
[23:40] <slang> its not clear to me how those pg ids (1.20b7 and 1.3494) work though
[23:40] <slang> there are pgs with ids 1.20b and 1.349
[23:40] <slang> and the primaries for those pgs seem fine..
[23:46] <slang> killing the primary osd for 1.20b7 does cause this to be spit out: [ERR] 0.678 log bound mismatch, info (83'845,6131'847] actual [16'1,6131'847]
[23:46] <slang> and others like
[23:46] <slang> it
[23:46] <slang> along with: [ERR] 0.745 scrub stat mismatch, got 129/104 objects, 26/26 clones, 310848146/315042450 bytes, 303565/307661 kb
[23:59] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.