IRC Log for 2010-08-27

Timestamps are in GMT/BST.

[0:10] * MarkN (~nathan@ Quit (Remote host closed the connection)
[0:12] * MarkN (~nathan@ has joined #ceph
[0:33] <sagewk> markn: http://tracker.newdream.net/issues/384
[0:33] <sagewk> markn: can you upload your ceph.ko and .config?
[0:40] <MarkN> sure, my kernel .config?
[0:40] <sagewk> markn: were you doing anything funny with multiple mounts of the same fs, or deep mounts (mount -t ceph server:/some/sub/dir /mnt/foo) ?
[0:40] <sagewk> yeah
[0:40] <MarkN> yes wrt the deep mounts
[0:41] <sagewk> aha, ok. do you have the sequence of operations that led to the crash?
[0:42] <MarkN> it seemes as though, if i add another client, then do a stat system call it oops
[0:42] <MarkN> so for example cd /ceph/mount
[0:43] <MarkN> then do 'ls start_of_file' then tab complete it will fail
[0:43] <sagewk> (where /ceph is a mount of server:/some/subdir ?)
[0:43] <sagewk> it's crashing during readdir (triggered by tab completion it sounds like)
[0:43] <MarkN> yes
[0:43] <sagewk> how did you mount in that scenario? server:/some/subdir on /ceph?
[0:44] <MarkN> "mount -t ceph /data/server/public"
[0:45] <sagewk> and it's crashing on something like 'ls /data/server/public' ?
[0:45] <sagewk> (i.e. root of mounted dir?)
[0:46] <MarkN> yes
[0:46] <sagewk> ok thanks, i'll see if i can reproduce that
[0:47] <MarkN> thanks. One quick question now after rebooting all machines in the cluster / clients i am getting a can't read super block error when trying to mount, what is the best way to diagnose these issues ?
[0:51] <sagewk> dmesg|tail.. if there's nothing useful there, you can crank up debugging (echo 'module ceph +p' > /sys/kernel/debug/dynamic_debug/control) and repeat
[1:19] <sagewk> markn: pushed a fix to ceph-client.git master branch, commit ce4d6eab
[1:19] <sagewk> (at least, it fixed my method of hitting the bug.) can you let me know if it fixes your problem?
[1:20] <MarkN> sure - i will get it sorted this morning
[1:20] <sagewk> thanks
[1:59] <MarkN> so sage I am trying to rebmount the filesystem and keep getting the can' read superblock issue. dmesg show nothing, syslog shows nothing, only :
[1:59] <MarkN> Aug 27 09:42:21 devgold051 kernel: ceph: client4734 fsid 081d1a85-8b2a-39ed-47e3-a4f423017857
[1:59] <MarkN> Aug 27 09:42:21 devgold051 kernel: ceph: mon0 session established
[2:00] <MarkN> so no errors. anyother ideas on mounting the fs?
[2:08] <gregphone> MarkN: you tried rebooting?
[2:08] <gregphone> if that doesn't fix it you're going to need to email the list
[2:08] <MarkN> yeah, all clients and cluster nodes, all nodes are up with the correct processes running on them
[2:08] <MarkN> no worries RE list email
[2:09] <gregphone> it's just because sage and yehudasa are both out of the office until after Labor Day now
[2:10] <MarkN> what is labour day date in the US?
[2:11] <gregphone> September 6
[2:11] <gregphone> week and a half from now
[2:13] <MarkN> ah OK no problems - i will do some more digging around anyway and send it off the the lsit
[2:20] <MarkN> hmm after trying for an hour it has decided to mount OK after me going to get a tea and biscuits :)
[2:26] <gregphone> hmm
[2:27] <gregphone> my WAG was that one of the server addresses or the raid wasn't updating properly, guess it finally flushed out due to a timeout or something
[2:27] <gregphone> *raid -> fsid
[2:29] <MarkN> anyway to check this in the future if it happens again ?
[2:30] <gregphone> probably if you enable debug output it'll tell you what's going wrong when the mount fails
[2:31] <gregphone> not sure how thorough that coverage is, though
[6:19] <wido> sagewk: still there? any idea how long class loading should take?
[14:09] <todinini> wido: the rbd-support.patch does not work, applies cleanly but the function error_report is not defined
[15:07] <wido> todinini: oh, might be and old patch then
[15:07] <wido> let me check
[15:09] <wido> oh, yes, old patch, adding a new one right now
[15:09] <wido> should use printf instead of error_report
[15:11] <wido> todinini: http://tracker.newdream.net/issues/341
[15:11] <todinini> wido: ok, I will try again
[15:24] <wido> todinini: http://tracker.newdream.net/issues/381
[15:24] <wido> are you still hitting that too?
[15:30] <todinini> wido: at one point you call the volume delta and further down alpha, may be that is the problem?
[15:33] <wido> no, alpha - charlie exist, so i try to ls the snapshots
[15:33] <wido> and i try to create the "delta" volume
[15:40] <todinini> wido: that's wired because it works for me
[15:42] <wido> todinini: yeah, i'm still trying to figure it out...
[15:47] <todinini> hmm I can't compile the ubunut libvirt-0.7.5 package even the original source .deb is failling
[17:45] <wido> todinini: why not apt-get my packages?
[17:59] <wido> yehudasa: you there? I just tried to load RBD again, but right now it won't even load
[18:00] <wido> restarted my whole cluster, cclass -a, waited for some time, but RBD never shows, only sync 1.0
[19:47] <gregaf> wido: your recent MDS crash is actually a different issue from #312, involving the distributed lock manager
[19:48] <gregaf> are your MDSes just refusing to come up now, or is your cluster working again?
[19:50] <gregaf> and what version of the code were you running when it crashed the first time?
[19:58] <wido> gregaf: my MDS'es will start, but crash after some time
[19:58] <gregaf> with that same backtrace?
[19:58] <wido> and i upgraded from yesterdays unstable to the one of this morning
[19:59] <wido> yes, those backtraces are from the crashes i saw today (i preserved the timestamps)
[20:04] <gregaf> all right, I'll look at it a bit more and see if I can work out what's going on or if it's safe to just nix the assert, but if you could make a new issue it'd be good since a final resolution will probably have to wait on Sage
[20:05] <wido> gregaf: any suggestions for a issue subject?
[20:06] <gregaf> failed assertion in Locker::scatter_nudge
[20:07] <wido> ok, i'll do that in a minute
[20:07] <gregaf> I can see it in the code easily enough but it looks like it's just catching an issue in/with the distributed lock manager that occurred earlier
[20:38] <wido> gregaf: http://tracker.newdream.net/issues/385
[22:39] <kblin> evening folks
[22:40] <gregaf> hey
