#ceph IRC Log


IRC Log for 2010-11-05

Timestamps are in GMT/BST.

[0:41] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[0:41] * darkfader (~floh@host-82-135-62-109.customer.m-online.net) Quit (Read error: Connection reset by peer)
[0:41] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[0:44] * darkfader (~floh@host-82-135-62-109.customer.m-online.net) has joined #ceph
[0:51] <jantje_> cmccabe: git unstable from last week, I can give you the numbers tomorrow
[0:52] <jantje_> cmccabe: and I need a kick-start introduction to gdb :-)
[0:52] <cmccabe> just doing git show HEAD would be good
[0:52] <cmccabe> actually, though, the backtrace is probably just at the end of the log file mon.a or whatever
[0:53] <cmccabe> so that would be the best thing to start with
[0:53] <jantje_> I probably restarted it a dozen times before I noticed the core dump
[0:53] <jantje_> I should watch more closely
[0:53] <cmccabe> I created a script called ps-ceph.sh
[0:53] * greglap1 (~Adium@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[0:54] <jantje_> (and I also need to configure my core-dump-naming!)
[0:54] <cmccabe> which tells you what ceph-related processes are running on your local machine
[0:54] <jantje_> ok, i'll check it out tomorrow
[0:54] <jantje_> (i'm not at work, can't access those things)
[0:54] <cmccabe> k
[0:54] <jantje_> time differences are so annoying :)
[0:55] <cmccabe> ps-ceph.sh is just for convenience. It's nothing you couldn't do yourself with just ps and grep.
[0:55] <cmccabe> But I was doing it a lot, so...
[0:55] <jantje_> :-)
[1:22] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[1:24] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Remote host closed the connection)
[1:25] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[1:27] * greglap (~Adium@ has joined #ceph
[1:33] * cmccabe (~cmccabe@adsl-76-199-101-63.dsl.pltn13.sbcglobal.net) Quit (Quit: Leaving.)
[2:18] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[2:25] * alexxy (~alexxy@ has joined #ceph
[2:31] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) has joined #ceph
[2:37] * sjust (~sam@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[3:01] * lidongyang (~lidongyan@ Quit (Remote host closed the connection)
[4:28] * Jiaju (~jjzhang@ Quit (Remote host closed the connection)
[4:39] * Jiaju (~jjzhang@ has joined #ceph
[5:17] * henrycc (~henry_c_c@59-124-35-221.HINET-IP.hinet.net) has joined #ceph
[5:32] <sage> henrycc: any luck with that patch?
[6:33] * terang (~me@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[6:37] <henrycc> sage: I have not tried it yet. I am looking at the other issue about direct i/o.
[6:38] <henrycc> will try that patch later.
[6:38] <sage> ok
[6:38] <sage> thanks
[6:40] <henrycc> when opening file with direct i/o flag, it seems that seeking to offset 6656 and read 512 byte will get wrong data.
[6:40] <sage> hmm, that's interesting..
[6:41] <sage> i wonder if it isn't handling non-page aligned io properly. its been a while since i looked at that code.
[6:41] <sage> is the data shifted, i wonder.. read into a different part of the page or something?
[6:42] <henrycc> yes
[6:42] <henrycc> it's about page alignment
[6:43] <sage> i see. that will be a tricky fix. i think the messenger code currently uses the alignment provided by the message payload (data_off in the header). that should probably be provided by the client side (so that a ill-behaved osd can't cause trouble).
[6:44] <sage> probably needs to be worked into the alloc_msg callback somehow. maybe add a ceph_msg field for the actual alignment to use when reading in the data pages (instead of using data_off in the header).
[6:45] <henrycc> ok..doesn't looks like what I can fix :)
[6:45] <sage> do you need 512-byte aligned direct io?
[6:46] <henrycc> yes, we use vhd-util.
[6:46] <henrycc> it use 512-byte aligned
[6:46] <sage> i shouldn't be too complicated to fix. can you enter an issue in the tracker?
[6:46] <sage> s/i/it/
[6:46] <henrycc> sure
[6:48] <henrycc> is this a feature (or bug)?
[6:48] <sage> bug
[6:52] * f4m8_ is now known as f4m8
[6:57] <henrycc> ok...bug #546
[6:57] <sage> thanks. will try to look at it tomorrow
[6:57] <henrycc> thanks
[6:58] <sage> np
[7:16] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) has joined #ceph
[8:41] * allsystemsarego (~allsystem@ has joined #ceph
[10:06] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[10:49] <failboat> so, while I was asleep my cluster did turn into a pumpkin
[10:53] <failboat> but it was my fault this time
[10:57] <failboat> and now client is stuck
[10:57] <failboat> echo b to the rescue
[10:59] * failboat (~stingray@stingr.net) Quit (Quit: fuck this shit, I'm getting the vuvuzela)
[11:09] * stingray (~stingray@stingr.net) has joined #ceph
[11:09] <stingray> [stingray@stingr ceph]$ ll
[11:09] <stingray> total 0
[11:09] <stingray> drwxr-xr-x. 1 stingray wheel 420854555388 Sep 28 12:48 mirrors
[11:09] <stingray> drwxr-sr-x. 1 stingray wheel 41130366397 Nov 3 17:28 test
[11:09] <stingray> drwxrwxr-x. 1 stingray wheel 379566618559 Dec 21 2009 video
[11:09] <stingray> [stingray@stingr ceph]$ ls mirrors/
[11:09] <stingray> ls: cannot open directory mirrors/: Permission denied
[11:09] <stingray> boom
[11:24] * fract (~lbz@fw1.aspsys.com) Quit (Ping timeout: 480 seconds)
[11:43] * julienhuang (~julienhua@pasteur.dedibox.netavenir.com) has joined #ceph
[11:43] <stingray> 2010-11-05 13:24:41.092523 7fc529618700 mds0.server missing 10000000542 #1/mirrors/rpmfusion (mine), will load later
[11:43] <stingray> 2010-11-05 13:24:41.092535 7fc529618700 mds0.server missing 10000000543 #1/mirrors/tigro (mine), will load later
[11:44] <stingray> dunno wtf is that but it looks like my problem
[12:10] * asllkj (~lbz@fw1.aspsys.com) has joined #ceph
[13:11] <stingray> mds/Dumper.cc:15:24: fatal error: mds/Dumper.h: No such file or directory
[13:11] <stingray> fffuuu
[13:22] <jantje_> sage: the backout of the commit you mentioned did not fix the bonnie++ problem
[14:04] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[14:35] * sentinel_e86 (~sentinel_@ Quit (Quit: sh** happened)
[14:37] * sentinel_e86 (~sentinel_@ has joined #ceph
[14:39] * sentinel_e86 (~sentinel_@ Quit ()
[14:40] * sentinel_e86 (~sentinel_@ has joined #ceph
[14:41] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[14:42] * sentinel_e86 (~sentinel_@ Quit ()
[14:44] * sentinel_e86 (~sentinel_@ has joined #ceph
[14:46] * sentinel_e86 (~sentinel_@ Quit ()
[14:47] * sentinel_e86 (~sentinel_@ has joined #ceph
[14:49] * sentinel_e86 (~sentinel_@ Quit ()
[14:51] * sentinel_e86 (~sentinel_@ has joined #ceph
[14:53] * sentinel_e86 (~sentinel_@ Quit ()
[14:54] * sentinel_e86 (~sentinel_@ has joined #ceph
[14:57] * sentinel_e86 (~sentinel_@ Quit ()
[14:58] * sentinel_e86 (~sentinel_@ has joined #ceph
[15:00] * sentinel_e86 (~sentinel_@ Quit ()
[15:01] * sentinel_e86 (~sentinel_@ has joined #ceph
[15:12] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[15:12] * alexxy (~alexxy@ has joined #ceph
[15:13] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[15:22] * alexxy (~alexxy@ has joined #ceph
[15:24] <stingray> ls: cannot open directory ./test/asplodeth: Permission denied
[15:24] <stingray> ugh
[15:24] <stingray> maybe it's a client problem?
[15:40] <jantje_> what kernel version?
[15:40] <jantje_> I always try to use some recent version
[15:40] <jantje_> client / server are running 2.6.36
[15:41] <jantje_> and ceph is unstable from git
[15:47] * f4m8 is now known as f4m8_
[15:56] * ghaskins_mobile (~ghaskins_@ Quit (Quit: This computer has gone to sleep)
[16:06] * Meths_ (rift@ has joined #ceph
[16:06] <stingray>
[16:13] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[16:14] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[16:14] * ghaskins_mobile (~ghaskins_@ Quit (Remote host closed the connection)
[16:18] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[16:24] <stingray> so, I kill all mds, then I start one, wait for it to settle, then mount&ls - same permission denieds
[16:24] <stingray> either data is corrupted or it's a client problem
[16:24] <stingray> or both
[16:24] <stingray> let's try newer kernel
[16:31] * greglap (~Adium@cpe-76-90-74-194.socal.res.rr.com) Quit (Quit: Leaving.)
[16:44] * Meths_ is now known as Meths
[16:50] * greglap (~Adium@ has joined #ceph
[16:53] * ghaskins_mobile (~ghaskins_@ Quit (Ping timeout: 480 seconds)
[16:54] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[16:59] <sagewk> stingray: if you add 'debug ms = 1' to the [mds] section and restart, or "ceph mds tell 0 injectargs '--debug-ms 1'" to the running mds, then reproduce the permission denied, you can see from the mds log if the error is coming from the mds or the client
[16:59] <sagewk> (if it's the MDS, you'll see the error message go by in the log)
[17:08] * ghaskins_mobile (~ghaskins_@ Quit (Quit: This computer has gone to sleep)
[17:25] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[17:26] <stingray> 2010-11-05 19:26:01.973369 7fcd2136e700 -- <== client4647 11 ==== client_request(client4647:5 lookup #1/test) ==== 118+0+0 (3347128120 0 0) 0x7fccd8001390
[17:26] <stingray> 2010-11-05 19:26:01.973431 7fcd2136e700 -- --> -- client_reply(???:5 = 0 Success) v1 -- ?+0 0x7fcd18a21590
[17:26] <stingray> not very informative, I guess
[17:26] <sagewk> so the error is on the client side. are you root?
[17:26] <stingray> except for the "success" bit
[17:26] <stingray> sagewk: yep.
[17:27] <sagewk> it might be a garbled file mode. can you do 'csyn --syn walk' and see what the file mode looks like there?
[17:27] <stingray> maybe it's selinux - but I have label per mountpoint
[17:28] <stingray> 10000056b8a drwxr-xr-x 1 501 10 41130366397 Wed Nov 3 17:28:36 2010 /test
[17:28] <stingray> I'll try turning off selinux now
[17:28] <sagewk> hmm yeah
[17:28] <stingray> it's on the same vm as this chat so I have to reboot :)
[17:28] <stingray> brb
[17:29] * stingray (~stingray@stingr.net) Quit (Quit: fffuuu)
[17:32] * stingray (~stingray@stingr.net) has joined #ceph
[17:32] <stingray> guess what
[17:32] <stingray> it's selinux!
[17:32] <stingray> fffuuu
[17:32] <stingray> even with context per mountpoint
[17:32] <stingray> somewhere is leaking or is referencing something else
[17:32] <sagewk> weird. well we should figure that was is going wrong there at some point.
[17:32] <stingray> I'm going to kill myself, then someone
[17:32] <stingray> oh wait
[17:33] <stingray> the question now is, I guess, how I can make it work without turning selinux off
[17:36] <sagewk> right. i don't know much about selinux or the LSM stuff. someone needs to take a close look at which syscalls are doing EPERM and where they are returning from.
[17:37] <stingray> type=AVC msg=audit(1288899055.231:26227): avc: denied { 0x100000 } for pid=24221 comm="rsync" name="RPMS.classic" dev=ceph ino=1099511676364 scontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r
[17:37] <stingray> :user_home_t:s0 tclass=lnk_file
[17:37] <stingray> this is my audit
[17:40] <stingray> #============= unconfined_t ==============
[17:40] <stingray> # src="unconfined_t" tgt="user_home_t" class="file", perms="{ 0x400000 0x200000 }"
[17:40] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[17:40] <stingray> # comm="ls" exe="" path=""
[17:40] <stingray> allow unconfined_t user_home_t:file { 0x400000 0x200000 };
[17:40] <stingray> # src="unconfined_t" tgt="user_home_t" class="lnk_file", perms="{ 0x100000 0x400000 0x200000 }"
[17:41] <stingray> # comm="rsync" exe="" path=""
[17:41] <stingray> allow unconfined_t user_home_t:lnk_file { 0x100000 0x400000 0x200000 };
[17:41] <stingray> strange
[17:41] <stingray> I need dan walsh to interpret this
[17:41] <stingray> perms="{ 0x100000 0x400000 0x200000 }" <- wtf
[17:42] <sagewk> yeah sorry, meaningless to me :)
[17:43] * yehudasa_hm (~yehuda@ppp-69-228-129-75.dsl.irvnca.pacbell.net) has joined #ceph
[17:47] * julienhuang (~julienhua@pasteur.dedibox.netavenir.com) Quit (Ping timeout: 480 seconds)
[17:47] * julienhuang (~julienhua@pasteur.dedibox.netavenir.com) has joined #ceph
[17:54] * cmccabe (~cmccabe@dsl081-243-128.sfo1.dsl.speakeasy.net) has joined #ceph
[17:57] * julienhuang (~julienhua@pasteur.dedibox.netavenir.com) Quit (Ping timeout: 480 seconds)
[17:58] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[18:01] * greglap (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:09] * sjust (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:14] * ghaskins_mobile (~ghaskins_@ Quit (Ping timeout: 480 seconds)
[18:18] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[18:19] <stingray> ok, so apparently those are not defined for files
[18:19] <jantje_> can it be that ./configure --without-gtk2 isn't working?
[18:20] <sagewk> it can
[18:20] <jantje_> I really dont want to install all those additional packages
[18:20] <jantje_> ok, i'll investigate ;p
[18:20] <cmccabe> hi jantje
[18:20] <cmccabe> what is the message you are getting from configure?
[18:21] <jantje_> ./configure: line 15531: syntax error near unexpected token `GTKMM,'
[18:21] <jantje_> ./configure: line 15531: ` PKG_CHECK_MODULES(GTKMM, gtkmm-2.4 >= 1.0.0,'
[18:22] <cmccabe> so there was a bug earlier where --without-gtk had no effect
[18:22] <cmccabe> what revision are you on
[18:22] <stingray> #define DIR__OPEN 0x00400000UL
[18:23] <sagewk> cmccabe: oh, i saw that one several times too before i had packages installed. i wasn't passing --without-gtk2 tho
[18:23] <cmccabe> jantje, sagewk: can you run automake --version
[18:24] <sagewk> 1.10.1
[18:24] <cmccabe> I'm on 1.11.1
[18:24] <cmccabe> I'll try uninstalling gtkmm and see if I can reproduce.
[18:24] <jantje_> cmccabe: i just updated my git tree and i'm on unstable
[18:24] <cmccabe> jantje: thanks
[18:24] <jantje_> automake (GNU automake) 1.11.1
[18:25] <cmccabe> jantje_: ah, that's my version
[18:26] <cmccabe> jantje_: so I just tried --with-gtk=no and it did not check for gtk2
[18:27] <cmccabe> jantje_: also tried --without-gtk2, same effect
[18:27] <jantje_> weird!
[18:27] <cmccabe> jantje_: after uninstalling gtkmm, I got:
[18:27] <cmccabe> > checking pkg-config is at least version 0.9.0... yes
[18:27] <cmccabe> > checking for GTKMM... no
[18:27] <cmccabe> > Sorry, a usable version of gtkmm was not found. We will build without it.
[18:27] <cmccabe> > configure: creating ./config.status
[18:28] <cmccabe> jantje_: I think we're going to have to play the automake versions game now
[18:31] * gregoryhaskins_ (~ghaskins_@ has joined #ceph
[18:31] <jantje_> ?
[18:32] <cmccabe> jantje_: I am trying to reproduce this on some machines we have
[18:32] * ghaskins_mobile (~ghaskins_@ Quit (Ping timeout: 480 seconds)
[18:35] <jantje_> write(1, "configure: WARNING: unrecognized"..., 53configure: WARNING: unrecognized options: --with-gtk
[18:36] <jantje_> did I update my git tree correctly?
[18:36] <cmccabe> jantje: I think you are passing --with-gtk rather than --with-gtk2
[18:36] <cmccabe> jantje: try running --help to see what the options are
[18:37] <yehudasa_hm> cmccabe: maybe we should accept both?
[18:37] <jantje_> I was trying to pass --with-gtk=no and --with-gtk2=no and --without and ...
[18:37] <jantje_> (because you just told you tried --with-gtk=no )
[18:38] <cmccabe> jantje: try --with-gtk2=no and --without-gtk2
[18:40] <cmccabe> yehudasa_hm: there are some projects that do support both GTK and GTK2. I'm a little worried that having both options might confuse some users.
[18:41] <jantje_> cmccabe: same error with -both-
[18:41] <cmccabe> jantje: so you see "./configure: line 15531: syntax error near unexpected token `GTKMM,'"
[18:43] <jantje_> if test "x$try_with_gtk2" != "xno"; then :
[18:43] <jantje_> PKG_CHECK_MODULES(GTKMM, gtkmm-2.4 >= 1.0.0,
[18:43] <jantje_> AC_DEFINE([HAVE_GTK2], [1], [we have gtk2]),
[18:43] <jantje_> try_with_gtk2=no
[18:43] <jantje_> AC_MSG_RESULT([Sorry, a usable version of gtkmm was not found. We will build without it.]))
[18:43] <jantje_> fi
[18:44] <jantje_> I never get the result
[18:46] <cmccabe> jantje: ok, I just tried this on a system with automake 1.10.1. No errors.
[18:46] <cmccabe> jantje: what is the exact error message that you see
[18:46] <sagewk> i assume ./autogen.sh has been run recently?
[18:47] <wido> jantje_: you should install pkg-config
[18:48] <wido> that's the issue I had last week, when you have the latest unstable, it's added to the build dependency (Debian/Ubuntu)
[18:48] <cmccabe> wido: interesting! Perhaps our configure script should check for pkg-config
[18:49] <wido> cmccabe: yes :)
[18:49] <jantje_> i just removed my ./configure , how do I get it back? (cvs checkout equivalent for git?)
[18:49] <cmccabe> jantje: run ./autogen.sh
[18:50] <jantje_> oh, yes, stupid me :-)
[18:50] <cmccabe> jantje, wido: bingo. Without pkg-config, you get this error: syntax error near unexpected token `GTKMM,
[18:51] <jantje_> works !
[18:51] <jantje_> :-)
[18:52] <jantje_> thanks wido
[18:53] <cmccabe> wido: yeah, thanks. Now I just need to figure out how to make automake realize that it needs pkg-config.
[18:53] <jantje_> sagewk: fyi, the backout did not fix the bonnie++ problem
[18:53] <sagewk> which bonnie++ problem was that?
[18:55] <jantje_> Stat files in sequential order...Expected 16384 files but only got 0
[18:55] <jantje_> kernel client commit efa4c120
[18:55] <jantje_> (at least ... if my backout succeeded)
[18:55] * jantje_ has to go
[18:55] <sagewk> ok thanks
[18:56] <sagewk> was there a bug open for this?
[18:56] <jantje_> no, i just reported it to the mailinglist
[18:56] <sagewk> ok
[18:56] <jantje_> thanks
[18:57] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[19:00] <cmccabe> jantje: bye
[19:01] <sagewk> jantje_: http://tracker.newdream.net/issues/549 fyi
[19:11] <wido> cmccabe: can't you simply search for "pkg-config" in PATH?
[19:11] <cmccabe> wido: I've been investigating a bunch of different solutions. Apparently there are some macros that test the pkg-config version
[19:12] <cmccabe> wido: but, "helpfully", they're complete undefined if there is no pkg-config
[19:12] <cmccabe> wido: so, yeah, probably it will come down to trying to find that binary
[19:13] <wido> cmccabe: it's something I see in a lot of configure/m4 scripts, just searching for binary's
[19:14] <wido> when I was building phprados, I saw the same technique being used in a lot of configure's
[19:15] <wido> sagewk: I saw that Henry on the ml was getting a hang on a getattr too, could it be I hit the same issue?
[19:16] <yehudasa_hm> wido: getattr is usually just a symptom to some other mds problem
[19:17] <cmccabe> wido: I guess pkg-config is supposed to replace searching for binaries. But to find pkg-config itself you have to do something similar.
[19:22] <wido> yehudasa_hm: ok, tnx
[19:26] <wido> something else I'm seeing: "libceph: mds0 (unknown sockaddr family 0) connect error", my MDS'es are running, but seem hung. logging is at 20, but nothing is beeing logged
[19:26] <wido> both cmds are running, but seem stalled or something, same goes for FS access
[19:27] <wido> a strace shows (on both): "futex(0x11e258c, FUTEX_WAIT_PRIVATE, 1, NULL"
[19:28] * gregoryhaskins_ (~ghaskins_@ Quit (Quit: This computer has gone to sleep)
[19:33] <cmccabe> wido: do you have a backtrace of the MDS'es
[19:34] <cmccabe> just attach with gdb -p <pid> and run "thread apply all bt"
[19:35] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[19:43] <wido> cmccabe: http://pastebin.com/BNMMA1BY
[19:43] <wido> load isn't high either
[19:45] <cmccabe> try doing kill -SIGSTOP <mds-pid>
[19:45] <cmccabe> followed by kill -SIGCONT <mds-pid>
[19:47] <cmccabe> wido: I know it sounds strange but I have seen a similar issue and would like to know if they're related
[19:47] <wido> yes, the MDS came alive again
[19:47] <wido> trying the other now
[19:48] <cmccabe> wido: ah!
[19:48] <wido> hmm, the second one isn't logging anything
[19:49] <wido> other MDS keeps saying: get_load no root, no load
[19:49] <wido> handle_mds_beacon up:standby seq 81 rtt 0.116031
[19:49] <wido> seems to be standby?
[19:51] <cmccabe> I haven't dealt with the mds stuff too much
[19:51] <cmccabe> but it would be interesting to know if the first MDS is usable again after the signals
[19:53] <cmccabe> basically I think there's a race condition in SimpleMessenger. It's possible you're experiencing that, or it could be something specific to the MDS
[19:54] <wido> cmccabe: I could give you remote access? Just sent me your public key
[19:54] <cmccabe> wido: it would be best if sage looked at it. I work mostly with the OSD code.
[19:55] <wido> ok
[19:56] <wido> btw, after signaling the MDS, a monitor (I have three) crashed: http://pastebin.com/VPwy41qt and my dmesg is showing that "mds0" is working again, client seems to be reconnected
[19:57] <wido> FS is working again too
[19:58] <cmccabe> wido: looks like the monitor crashed decoding a message
[19:59] <cmccabe> wido: yeah, I think the logs are going to be very interesting for this one
[19:59] <wido> i'll run cdebugpack
[20:00] <wido> And open a issue
[20:00] <cmccabe> k
[20:03] <wido> doesn't seem to be related, monitor crashed 3 minutes before I ran the kill
[20:05] <cmccabe> the MDS message decoding crash is definitely some kind of bug
[20:06] <cmccabe> it is possible that there was another bug that was exposed once that MDS went down
[20:06] <wido> the MDS was already down for about 24 hours
[20:06] <cmccabe> anyway, no point in speculating too much about it. Sage will check the log and try to reconstruct what happened.
[20:06] <wido> but my monitor crashed at 19:43, while I ran the kill at 19:47
[20:07] <cmccabe> yeah, I am agreeing with you...
[20:07] <cmccabe> :)
[20:07] <wido> pure coincidence
[20:09] <cmccabe> wido: oh btw, I put in a fix for that pkg-config thing.
[20:09] * ghaskins_mobile (~ghaskins_@ Quit (Quit: This computer has gone to sleep)
[20:09] <wido> yes, I noticed :)
[20:09] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[20:10] * ghaskins_mobile (~ghaskins_@ Quit (Remote host closed the connection)
[20:27] <wido> sagewk: If you got time later on, could you check the MDS issue? mds0 is running again after cmccabe's tip, but mds1 is still refusing to get back to life. If you look back, I posted a pastebin with a backtrace of the MDS.
[20:27] <wido> I have two clients running (running same kernels, latest master branch), one recovered, but the other one gave:
[20:27] <wido> libceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6818 protocol version mismatch, my 32 != server's 32
[20:38] * ghaskins_mobile (~ghaskins_@ has joined #ceph
[21:36] * ghaskins_mobile (~ghaskins_@ Quit (Quit: This computer has gone to sleep)
[21:58] <sagewk> cmccabe: heads up, we're going to have a meeting monday morning (via video skype for you i guess) to plan out the next two weeks
[21:58] <cmccabe> ok
[21:59] <cmccabe> I'll see what the camera situation is like over here
[21:59] <cmccabe> I do have a headset
[21:59] <stingray> Nov 5 23:58:57 cephclient kernel: [ 3350.648203] ceph: problem parsing dir contents -5
[21:59] <stingray> Nov 5 23:58:57 cephclient kernel: [ 3350.649508] ceph: mds parse_reply err -5
[21:59] <stingray> Nov 5 23:58:57 cephclient kernel: [ 3350.650753] ceph: mdsc_handle_reply got corrupt reply mds0
[21:59] <stingray> oops
[21:59] <stingray> fail
[22:10] * terang (~me@pool-173-55-24-140.lsanca.fios.verizon.net) has joined #ceph
[22:22] <stingray> looks like kernel oplocks = no is in order
[22:26] <wido> stingray: running the latest unstable?
[22:26] <wido> and which kernel?
[22:28] <wido> and fail? well, ceph is still in development :-)
[22:28] <stingray> I'm not complaining
[22:28] <stingray> I'm exploring various failure modes
[22:28] <stingray> cmccabe may be interested in this one:
[22:28] <stingray> osd/PG.cc: In function 'void PG::build_scrub_map(ScrubMap&)':
[22:28] <stingray> osd/PG.cc:2634: FAILED assert(r == 0)
[22:28] <stingray> ceph version 0.23~rc (commit:62716aa7c9a264c7a575bbccde0d8a7002563210)
[22:28] <stingray> 1: (PG::sub_op_scrub(MOSDSubOp*)+0x155) [0x547965]
[22:28] <stingray> 2: (OSD::dequeue_op(PG*)+0x374) [0x4eed34]
[22:28] <stingray> 3: (ThreadPool::worker()+0x16a) [0x60b27a]
[22:32] <cmccabe> hmm
[22:33] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:33] <cmccabe> is that all the backtrace?
[22:36] <cmccabe> also, do you have a core file for that one
[22:37] <cmccabe> stingray: if you can run gdb on the core file and verify that it crashed in PG::build_scrub_map that would be helpful
[22:41] <cmccabe> stingray: I glanced at the bug tracker and it looks like you found a new one. Definitely file it
[22:41] <sagewk> stingray: the scrubs should only trigger manually.. di dyou do 'ceph osd scrub' or something? (incidentally, that code is mostly rewritten, queued for 0.24)
[22:42] <sagewk> as for the parsing dir contents problem, i haven't seen that one before
[22:42] <stingray> cmccabe: yes I did manually tell it to scrub
[22:42] <sagewk> (i wouldn't worry about the scrub one for now)
[22:42] <cmccabe> sagewk: I didn't realize that wasn't merged yet
[22:42] <stingray> cmccabe: however it should scrub, not segfault, right :)
[22:43] <sagewk> not yet, will need a full cycle of testing.
[22:43] <sagewk> stingray: that's the idea :)
[22:43] <stingray> I'm installind the -debuginfo rpm and then will gdb core
[22:43] <cmccabe> stingray: I guess we should probably hold off on filing bugs on that until the rewritten scrub gets in
[22:43] <sagewk> yeah, don't sweat the scrub one for now
[22:44] <stingray> sagewk: I've set up samba and tried to write files to it from windows
[22:44] <stingray> sagewk: so I guessed it's the lock problem and the guess was correct as by turning kernel oplocks in samba off the corrupt replies are gone
[22:44] <stingray> I can paste the hexdump for you if you want
[22:44] <stingray> (the one that's in dmesg)
[22:44] <sagewk> weird. i'm not sure why teh oplocks would affect things.
[22:45] <sagewk> oh, maybe samba is setting xattrs on files?
[22:45] <stingray> in general, yes, if you mount with xattr
[22:45] <stingray> I didn't
[22:45] <cmccabe> sagewk: samba was one of the original xattr users I think
[22:45] <sagewk> stingray: if you do'nt mind can you put it in a new bug at tracker.newdream.net? under the kernel client project
[22:46] <stingray> and, as I said, if I turn off locks it is gone
[22:46] <cmccabe> sagewk: also selinux puts xattrs on things. I'm not sure if anyone has run ceph under selinux yet...
[22:47] <sagewk> stringray tried and failed this morning
[22:47] <cmccabe> hehe
[22:47] <cmccabe> I'm not surprised... selinux tends to take a dim view of processes doing things unless someone writes a policy that lets them do them
[22:51] <stingray> my problem isn't the xattr one
[22:51] <stingray> I specifically mounted it as mountpoint labeling
[22:51] <stingray> however something goes wrong later
[22:52] <stingray> as it tries to acquire permissions for non-existent actions
[22:52] <stingray> for example for class file there's no permission with code 0x400000
[22:52] <stingray> I need someone with a big hammer to investigate, I guess
[22:52] <stingray> :)
[22:52] <stingray> anyway
[22:52] <stingray> filed a bug
[22:53] <stingray> brb - need to watch new episode of Fringe :)
[22:57] <yehudasa_hm> sagewk: for the sys/class/rbd stuff, should we move all device specific related operations under the specific device?
[22:58] <sagewk> yeah
[22:58] <sagewk> i think so at least: echo 1 > /sys/class/rbd/0/refresh
[22:58] <yehudasa_hm> specifically there are all the snapshots related (create, rollback), but also there's remove of the device
[22:58] <sagewk> actually, maybe let's first write a doc for the proposed interface, and review that on the list
[22:59] <sagewk> before spending time coding something up that we'll need to change
[22:59] <yehudasa_hm> ahmm..
[23:01] <sagewk> what i think we should do is send a doc to ceph-devel and lkml saying i'm unsure the original interface got much attention, or that the osdblk approach scales well. here is what we think we should do instead, let's get it right before 2.6.37
[23:01] <yehudasa_hm> right
[23:01] <sagewk> then it's define, review, code. instead of code, document, review, repeat
[23:02] <yehudasa_hm> at this point I made it create the rbd/<dev> and under that you can see all the related device info
[23:07] <sagewk> yeah
[23:08] <sagewk> did you decide if there an existing inmemory per-snap object to piggyback the sysfs stuff onto?
[23:12] <yehudasa_hm> do you mean whether we should have a per-snap sysfs entity?
[23:12] <yehudasa_hm> or whether it's feasible?
[23:13] <sagewk> whether it's feasible.
[23:13] <sagewk> i guess the larger question is which is the more straightforward interface :)
[23:13] <yehudasa_hm> there's no readily available infrastructure but I don't think we should actually map it to the sysfs that way
[23:14] <sagewk> k
[23:21] <sagewk> yehudasa_hm: you think be in monday?
[23:21] <sagewk> er, do you think you'll be in monday?
[23:21] <yehudasa_hm> that's my goal, yeah
[23:22] <yehudasa_hm> I think it'll be ok by monday, though hopefully I haven't caught anything
[23:22] <sagewk> :) hopefully everyone is feeling better by then!
[23:23] <yehudasa_hm> yep!
[23:23] <cmccabe> yehudasa_hm: yeah, get well soon
[23:23] <yehudasa_hm> cmccabe: thanks! I'll pass it on to the affected entities
[23:24] <yehudasa_hm> (infected entities)
[23:25] <cmccabe> effectively, they're affected by infection
[23:25] <yehudasa_hm> heh

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.