#ceph IRC Log


IRC Log for 2010-07-28

Timestamps are in GMT/BST.

[0:18] <AnthonyOIT1> I'll try that out in a bit. I was just looking at my current crushmap and found out that it differs from the example on the wiki. It's not showing the individual hosts(I have two nodes each storing 4 SCSI disks)
[0:19] <AnthonyOIT1> could it be because i'm using 0.20.2?
[0:19] <AnthonyOIT1> src
[0:19] <gregaf> oh, you've started up an osd for each disk
[0:20] <AnthonyOIT1> yes
[0:20] <AnthonyOIT1> was i not suppose to do that?
[0:20] <gregaf> you definitely can
[0:20] <gregaf> it's just that if you want to do that you'll actually need to work on your crushmap a little more, or you'll get "replicas" on the same box
[0:21] <gregaf> you'll need to modify your rules so that it selects at least one replica on each computer, but somebody else will need to help you with that
[0:38] <AnthonyOIT1> hmm
[0:39] <AnthonyOIT1> i found this when i was reading my crushmap
[0:39] <AnthonyOIT1> # types
[0:39] <AnthonyOIT1> type 0 device
[0:39] <AnthonyOIT1> type 1 domain
[0:39] <AnthonyOIT1> type 2 pool
[0:39] <AnthonyOIT1> it's missing the type # host
[0:39] <AnthonyOIT1> can i just add that in?(sorry for the newbie question)
[0:39] <AnthonyOIT1> i'm comparing it with the wiki example
[1:54] * AnthonyOIT1 (~anthony@dhcp-v025-008.mobile.uci.edu) Quit (Quit: Leaving)
[2:16] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[2:46] <MarkN> some of my cmons are crashing after being up for a couple of seconds http://pastebin.org/423649
[2:47] <MarkN> i am running unstable from about an hour ago
[4:16] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[4:21] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) has joined #ceph
[4:44] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[6:35] * MarkN1 (~nathan@ has joined #ceph
[6:40] * MarkN (~nathan@ Quit (Ping timeout: 480 seconds)
[6:42] * bbigras (~bbigras@bas11-montreal02-1128531598.dsl.bell.ca) has joined #ceph
[6:43] * bbigras is now known as Guest44
[6:47] * Guest494 (~bbigras@bas11-montreal02-1128535472.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[7:11] * f4m8 (~drehmomen@lug-owl.de) has left #ceph
[7:12] * f4m8 (~drehmomen@lug-owl.de) has joined #ceph
[8:06] * akhurana (~ak2@c-98-232-30-233.hsd1.wa.comcast.net) has joined #ceph
[8:10] * Guest44 (~bbigras@bas11-montreal02-1128531598.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[8:20] * bbigras (~bbigras@bas11-montreal02-1128531598.dsl.bell.ca) has joined #ceph
[8:21] * bbigras is now known as Guest47
[8:26] * Guest47 (~bbigras@bas11-montreal02-1128531598.dsl.bell.ca) Quit (Remote host closed the connection)
[8:28] * bbigras_ (~bbigras@bas11-montreal02-1128531598.dsl.bell.ca) has joined #ceph
[8:43] * mtg (~mtg@vollkornmail.dbk-nb.de) has joined #ceph
[8:54] * allsystemsarego (~allsystem@ has joined #ceph
[9:01] * bbigras__ (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) has joined #ceph
[9:04] <wido> MarkN1: could you create a issue for this in the tracker?
[9:04] <wido> dev's need the coredump, logfile + binary to backtrace where this came from
[9:06] <MarkN1> sure, will do tomorrow when I am in at work
[9:06] * bbigras_ (~bbigras@bas11-montreal02-1128531598.dsl.bell.ca) Quit (Ping timeout: 480 seconds)
[9:06] * eternaleye (~quassel@184-76-53-210.war.clearwire-wmx.net) Quit (Ping timeout: 480 seconds)
[9:08] <wido> ok, great
[9:09] <wido> corefiles are pretty large most of the times
[9:09] <wido> so if you have some webhosting somewhere?
[9:09] <MarkN1> i should manage i think :)
[9:09] <wido> great :)
[9:10] <wido> just curious, did your mon run out of diskspace?
[9:10] * akhurana (~ak2@c-98-232-30-233.hsd1.wa.comcast.net) Quit (Quit: akhurana)
[9:11] <MarkN1> no, plenty of room available on all my srives
[9:11] <wido> ok, i had a similar crash when my mon ran out of diskspace and some files got a zero filesize
[9:12] <MarkN1> hmm, i may test just to confirm disc space is OK
[10:54] * Jiaju (~jjzhang@ Quit (Remote host closed the connection)
[11:48] * wido_ (~wido@fubar.widodh.nl) has joined #ceph
[11:48] * pruby (~tim@leibniz.catalyst.net.nz) Quit (reticulum.oftc.net kilo.oftc.net)
[11:48] * andret (~andre@pcandre.nine.ch) Quit (reticulum.oftc.net kilo.oftc.net)
[11:48] * wido (~wido@fubar.widodh.nl) Quit (reticulum.oftc.net kilo.oftc.net)
[11:59] * andret (~andre@pcandre.nine.ch) has joined #ceph
[12:01] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[13:56] * Osso (osso@AMontsouris-755-1-7-241.w86-212.abo.wanadoo.fr) has joined #ceph
[15:10] * mtg (~mtg@vollkornmail.dbk-nb.de) Quit (Quit: Verlassend)
[15:41] * deksai (~chris@71-13-57-82.dhcp.bycy.mi.charter.com) has joined #ceph
[16:15] * deksai (~chris@71-13-57-82.dhcp.bycy.mi.charter.com) Quit (Quit: Leaving.)
[16:46] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Ping timeout: 480 seconds)
[16:48] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[17:13] * deksai (~chris@dsl093-003-018.det1.dsl.speakeasy.net) has joined #ceph
[19:09] <wido_> yehudasa: i've just submitted a patch in #313. But there is one piece of output which i can't find, see: http://www.pastebin.org/425717 There you see three lines with: #
[19:09] <wido_> user.rgw.acl, #
[19:09] <wido_> #
[19:09] <wido_> user.rgw.content_type and user.rgw.etag
[19:09] <wido_> somewhere the attribute names seem to be printed, but i really can't find where this is done.
[19:10] <yehudasa> i'll search for it
[19:11] * deksai (~chris@dsl093-003-018.det1.dsl.speakeasy.net) has left #ceph
[19:12] <wido_> great
[19:13] * wido_ is now known as wido
[19:33] <sagewk> wido: on #312.. do you have older rotated copies of the mds logs?
[19:34] <sagewk> nm, found it
[19:37] <yehudasa> wido: there's some linkage problem with your patch
[19:39] <wido> oh, what?
[19:39] <wido> it's against unstable
[19:39] <yehudasa> getting rgw_log_level undefined
[19:40] <wido> could you pastebin it?
[19:40] <yehudasa> something about C vs C++ extern declaration probably
[19:40] <yehudasa> http://pastebin.org/425783
[19:41] <yehudasa> can add extern "C" to it would probably fix it
[19:42] <wido> i didn't get that error, weird?
[19:42] <wido> compiles and runs fine
[19:42] <yehudasa> which compiler are you using?
[19:42] <wido> just using: make radosgw
[19:42] <wido> uses g++
[19:43] <yehudasa> oh
[19:43] <yehudasa> make radosgw actually works
[19:43] <yehudasa> oh.. forget it
[19:43] <yehudasa> it's radosgw_admin that fails
[19:43] <yehudasa> so probably need to add this rgw_log_level to a new rgw_common.cc
[19:44] <yehudasa> we actually already have rgw_common.cc
[19:44] <wido> yes, but radosgw_admin doesn't need to use this
[19:45] <yehudasa> it links against rgw_acl.cc so it does need it
[19:45] <yehudasa> even if it doesn't use it
[19:45] <wido> yes, i see now. Didn't think about that
[19:45] <yehudasa> anyway, rgw_common is the place to put it anyway
[19:46] <wido> won't be a problem
[19:47] <yehudasa> yeah, that fixes it
[19:50] <wido> btw, the shutdown of the gateway doesn't seem to be fixed fully. Are u using a s3gw.fcgi script?
[19:50] <yehudasa> yeah
[19:50] <wido> i just symlinked /var/www/s3gw.fcgi to /usr/bin/radosgw, that works better, shutdown is clean and fast now
[19:50] <wido> with the script in between sometimes only the script exists, but the gw keeps running
[19:51] <yehudasa> hmm.. yeah, well apache doesn't really know which process to shut down
[19:51] <yehudasa> so it just kills the script
[19:52] <yehudasa> that explains why I didn't get a certain signal
[19:52] <wido> there is no benefit (only the config patch) of running with the wrapper in between
[19:52] <wido> patch = path
[19:52] <wido> we could even make the config file a envirionment variable too :)
[19:54] <yehudasa> not sure if it'd pass suexec if you just symlink
[19:54] <wido> oh, i'm not using suexec, just running under www-data
[19:55] <wido> but suexec doesn't take symlink's if i remember right, so that's where you are right
[20:15] <yehudasa> wido: I pushed your commit, also fixed that extra logging that came from librados
[20:16] <yehudasa> oh, and I increased the default log level to 20
[20:17] * deksai (~chris@dsl093-003-018.det1.dsl.speakeasy.net) has joined #ceph
[20:17] <yehudasa> would make it easier getting feedback from users, you can set it to whatever you want
[20:18] <wido> ah, cool!
[20:19] <wido> i found some more code cleanups, like SERVER_NAME in rgw_common.h and CGI_PRINTF being declared twice
[20:20] * deksai (~chris@dsl093-003-018.det1.dsl.speakeasy.net) Quit ()
[21:06] <wido> gregaf: i'm building right now with your commit for #312, but since this "band-aids" the crash, how could i notice if it happends again?
[21:07] <gregaf> your central log will get that message I mention
[21:07] <wido> "MetaBlob.replay FIXME had dentry linked to wrong inode"
[21:07] <gregaf> yeah
[21:07] <wido> ok, i'll keep an eye out for it. But in my experience, i only experience crashes once...
[21:07] <gregaf> we're not sure how the dentry and inode got linked incorrectly
[21:08] <wido> do you know on which OSD it was?
[21:08] <gregaf> but there's a fairly simple fix which was already in place, it just wasn't careful about cleaning up the "wrong" inode and that caused the crash
[21:09] <gregaf> it wasn't an OSD problem, it was on the MDSes
[21:09] <wido> true, thought it might be a file which got corrupted on the OSD, which then caused the MDS to crash, but nvm. Building right now
[21:10] <gregaf> yeah, probably not
[21:19] * mtk (~mtk@host251.diamondbackcap.com) has joined #ceph
[21:35] * eternaleye (~quassel@184-76-53-210.war.clearwire-wmx.net) has joined #ceph
[21:40] * mtk (~mtk@host251.diamondbackcap.com) Quit (Quit: Leaving)
[21:41] <wido> gregaf: i'm seeing a OSD merge_log crash right now after the last commit, this can't be related to the MDS crash i assume? If not, i'll then open a new issue for it
[21:44] <gregaf> umm, I don't know
[21:44] <gregaf> can you pastebin the end of the log?
[21:46] <wido> gregaf: http://www.pastebin.org/426122
[21:46] <wido> i'm actually seeing two crashes, different OSD's though
[21:49] <gregaf> better give us the other one
[21:50] <gregaf> Sage says this assert is just wrong, he's about to push a fix
[21:51] <wido> ok, the next one, seeing it on 2 OSD's now: http://www.pastebin.org/426140
[21:53] <wido> like you can see, the related files are on logger in "/srv/ceph/issues/osd_crash_sub_op_pull"
[21:55] <sagewk> is that the current cluster state?
[21:56] <wido> yes, it is
[21:56] <wido> just wanted to upgrade to the latest unstable for the MDS crash
[21:56] <wido> osd24 and osd25 are down due to a kernel panic (btrfs bug), can't reboot the machine now
[21:57] <sagewk> do you have oops msg for those btrfs errors, btw?
[21:59] <wido> yes: http://www.pastebin.org/426149 those are the last lines in my remote syslog, i assume it got a panic afterwards since it's down
[22:01] <wido> i'm going afk, i see you're in already (logger & node08). ttyl!
[22:02] <wido> btw, osd2 is running fine now (merge log crash)
[23:07] * deksai (~chris@dsl093-003-018.det1.dsl.speakeasy.net) has joined #ceph
[23:09] * deksai (~chris@dsl093-003-018.det1.dsl.speakeasy.net) Quit ()
[23:28] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.