#ceph IRC Log


IRC Log for 2010-12-03

Timestamps are in GMT/BST.

[0:02] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) has joined #ceph
[0:14] * johnl (~johnl@cpc3-brad19-2-0-cust563.barn.cable.virginmedia.com) Quit (Quit: bye)
[0:14] * johnl (~ubuntu@ has joined #ceph
[0:37] * ajnelson (~Adium@dhcp-128-22.cruznetsecure.ucsc.edu) Quit (Quit: Leaving.)
[1:08] <johnl> hrm, another crasher
[1:08] <johnl> upsetting all my osds!
[1:09] <johnl> changed my crushmap on a cluster in a major way, waited for several hours watching "degraded" drop
[1:09] <johnl> (I assumed degraded meant "data not where it should be")
[1:10] <cmccabe> degraded PGs are operating without as many OSDs as they would like
[1:10] <johnl> k. well, finally dropped very low (0.8% or so)
[1:10] <johnl> got lots of pools with about 10k objects each
[1:11] <johnl> "rados df" showed a bunch of them had degraded objects
[1:11] <johnl> but also "unfound"
[1:11] <johnl> assuming that means lost (since all my osds were up)
[1:12] <cmccabe> unfound means it doesn't know where to find it
[1:12] <cmccabe> lost means that you have requested it to give up looking
[1:13] <johnl> k. didn't look like it was going to find them. was stuck at 0.8 degraded and not changing
[1:13] <johnl> so I decided to delete the pools with the degraded and unfound objects
[1:13] <johnl> at which point my osds started stopping
[1:14] <johnl> no crashes logged. no backtraces.
[1:14] <cmccabe> how many pools did you have in total?
[1:14] <johnl> about 500
[1:14] <johnl> deleted 10 or 20
[1:15] <johnl> each pool had 10k-ish objects
[1:15] <cmccabe> gregaf or someone might contradict me on this, but I don't think pool deletion is very well tested
[1:15] <gregaf> well, it was working at one point...
[1:16] <cmccabe> so what do you mean the osds started stopping
[1:16] <gregaf> it wouldn't surprise me if it didn't work when the PGs weren't settled or something though
[1:16] <cmccabe> I can see you wrote that there were no logs, but what about core files?
[1:16] <johnl> cmccabe: the processes are no longer running. the other osds decided they're failed.
[1:16] <johnl> but there is nothing in the logs about a crash, which there is usually
[1:17] <johnl> started them back up again and they rejoined fine.
[1:17] <johnl> down to 0.2% degraded now, presumably due to less of the pools existing :)
[1:18] <johnl> I'll put all the osds in debug and see if I can reproduce
[1:19] <cmccabe> johnl: the first thing to do is to set your /proc/sys/kernel/core_pattern
[1:19] <cmccabe> and use ulimit -c unlimited to make sure core dumps are enabled
[1:20] <johnl> ta.
[1:20] <johnl> mssh rocks :)
[1:22] <cmccabe> I've used dsh a lot in the past too
[1:22] <cmccabe> I have no idea how they compare
[1:24] <bchrisman> looks like dsh uses a machines.list file that's probably the same used by mpirun and such?
[1:24] <bchrisman> I haven't used either before.
[1:25] <johnl> mssh doesn't seem to have any session config. so I have to specify the servers each time I run it. will checkout dsh
[1:26] <johnl> right, got a stopped cosd
[1:26] <johnl> a debug log. and weird looking core files
[1:26] <johnl> monmap.pid and osdmap.pid
[1:26] <cmccabe> is there a backtrace?
[1:26] <johnl> no backtrace no.
[1:26] <johnl> last line in osd log is 2010-12-03 00:24:05.864579 7fa1bc364710 osd2 5106 _remove_pg 386.5 removing final
[1:27] <johnl> I'll file a bug and attach the files
[1:27] <cmccabe> if it's not too much trouble
[1:27] <cmccabe> can you run gdb and get backtraces from the threads
[1:27] <cmccabe> should just be
[1:27] <cmccabe> $ gdb ./cosd
[1:27] <cmccabe> $ core <core-path>
[1:27] <cmccabe> $ thread apply all bt
[1:27] <johnl> not too much trouble. easier for me to do this than me try fix the problem :)
[1:28] <johnl> will do.
[1:28] <cmccabe> we usually don't have too much luck decoding core files sent by users-- library mismatches, compiler version skew, etc. make it fairly unlikely to work
[1:30] <johnl> gah sorry, I'm being a clown. these aren't core files. they're something else from 2 days ago
[1:31] <johnl> just from roughly the same time at night, in the same dir as I set core files to go in, with a pid as the extension!
[1:31] <johnl> so, um, kinda looks like cosd didn't crash.
[1:31] <cmccabe> the ctime should match
[1:31] <johnl> it just exited.
[1:31] <johnl> yeah, no core files, sorry, my mistake.
[1:32] <johnl> I'll run all the osds from gdb...
[1:33] <gregaf> are they exiting simultaneously?
[1:33] <gregaf> like I said this isn't well-tested, so maybe we do have some in-memory arrays or something that aren't behaving well
[1:33] <johnl> just one or two exiting
[1:34] <johnl> different ones
[1:34] <johnl> I'm only deleting pools with degraded or unfound objects. an osd exits every time I do it (so far)
[1:34] <cmccabe> gregaf: if it were arrays, he'd be getting a SIGSEGV and a core dump.
[1:35] <gregaf> cmccabe: I was thinking OOM
[1:35] <cmccabe> gregaf: what signal does OOM send anyway?
[1:36] <gregaf> it doesn't
[1:36] <gregaf> either you get a NULL returned from malloc, or the OOM-killer runs and kills your program silently
[1:36] <gregaf> I think I've only actually seen the second option
[1:38] <cmccabe> gregaf: looks like OOM sends SIGKILL
[1:38] <cmccabe> gregaf: but I don't think you can handle sigkill really at all, so same thing really
[1:38] <gregaf> oh, really?
[1:38] <gregaf> didn't know it even told you, but yeah I don't think you can do jack with it
[1:38] <cmccabe> http://unix.derkeiler.com/Newsgroups/comp.unix.programmer/2010-01/msg00110.html
[1:39] <gregaf> it stops executing and there's no clue why in the logs
[1:39] <gregaf> I've seen it a few times with runaway loops or whatever
[1:39] <cmccabe> I think sigkill can't be caught, so yeah, you can't do much.
[1:40] <cmccabe> one thing we might consider doing is something like this:
[1:40] <cmccabe> ( ./cosd || logger "exited with status $? ) &
[1:41] <johnl> ah bum. I'm all setup with gdb on all nodes now, but I'm out of degraded pools!
[1:42] <johnl> no ooms anyway.
[1:43] <cmccabe> yeah, the syslog should have it
[1:43] <cmccabe> do you see anything from dmesg | grep "Out of Memory"
[1:44] <cmccabe> johnl: if you manually kill a cosd process somewhere, some PGs will become degraded.
[1:44] <johnl> I checked kernel log, no ooms on any box.
[1:45] <cmccabe> johnl: you could also try attaching to the cosd processes with gdb while they're running using gdb -p `pidof cosd`
[1:45] <cmccabe> johnl: then hit c to continue
[1:45] <cmccabe> johnl: although their behavior isn't really the same under gdb in many cases
[1:46] <johnl> cmccabe: I've started them from gdb, using gdb --args cosd -D -c ...
[1:46] <johnl> just trying to reproduce the exit/crash now
[1:46] <cmccabe> johnl: oh yeah, that's probably best
[1:48] <johnl> killed one cosd process, waited for it to become degraded, then deleted 3 pools with degraded objects... no exits
[1:48] <johnl> not got any "unfound" objects any more though
[1:49] <johnl> whoop! segfaults!
[1:49] <johnl> two cosd processes segfaulted just after that same log line
[1:50] <cmccabe> backtrace?
[1:50] <johnl> thread apply all bt ?
[1:50] <cmccabe> yeah
[1:50] <cmccabe> also regular bt probably would be a good start
[1:50] <cmccabe> is there a bug number yet?
[1:52] <bchrisman> I'm compiling a kernel with the ceph option (this is the client included in the kernel, I presume). Is that the best version.. has it changed in .37-rc4?
[1:52] <johnl> got like 112 threads. no bug number yet, I'll create one now.
[1:54] <gregaf> bchrisman: yehuda says the kclient hasn't changed too much since then
[1:56] <bchrisman> thanks
[1:57] <yehudasa> bchrisman: most of the .37 changes are code refactoring, but isn't really interesting in terms of actual functionality if the ceph fs
[2:02] <johnl> cmccabe: ticket #629
[2:03] <bchrisman> cool
[2:05] <johnl> right, I need to get to bed. nnight.
[2:07] <cmccabe> johnl: good night
[3:01] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[3:26] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:28] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[3:58] * sjust (~sam@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[4:16] * pruby (~tim@leibniz.catalyst.net.nz) Quit (Ping timeout: 480 seconds)
[4:19] * pruby (~tim@leibniz.catalyst.net.nz) has joined #ceph
[4:42] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[5:09] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[5:26] * ajnelson (~Adium@dhcp-128-22.cruznetsecure.ucsc.edu) has joined #ceph
[5:32] * ajnelson (~Adium@dhcp-128-22.cruznetsecure.ucsc.edu) has left #ceph
[5:34] * greglap (~Adium@ has joined #ceph
[5:37] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[6:52] * ijuz_ (~ijuz@p4FFF5EB0.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[6:57] * greglap (~Adium@ Quit (Quit: Leaving.)
[7:00] * greglap (~Adium@ has joined #ceph
[7:01] * ijuz_ (~ijuz@p579997FB.dip.t-dialin.net) has joined #ceph
[8:03] * atg (~atg@please.dont.hacktheinter.net) Quit (Remote host closed the connection)
[8:03] * atg (~atg@please.dont.hacktheinter.net) has joined #ceph
[8:08] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[8:50] * todinini (tuxadero@kudu.in-berlin.de) Quit (Remote host closed the connection)
[9:27] * ijuz_ (~ijuz@p579997FB.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[9:52] * todinini (tuxadero@kudu.in-berlin.de) has joined #ceph
[10:03] <wido> johnl: Since you're interested in RADOS
[10:04] <wido> isn't phprados something? Or rados4j?
[10:07] * allsystemsarego (~allsystem@ has joined #ceph
[10:17] * Yoric (~David@ has joined #ceph
[10:53] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[11:28] <johnl> hi wido. phprados is definitely something. what do you mean?
[11:50] * Meths_ (rift@ has joined #ceph
[11:54] * Meths (rift@ Quit (Read error: Operation timed out)
[13:16] * Yoric_ (~David@ has joined #ceph
[13:16] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[13:16] * Yoric_ is now known as Yoric
[15:11] * Meths_ is now known as Meths
[15:49] * f4m8 is now known as f4m8_
[16:53] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Read error: Operation timed out)
[16:56] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[17:04] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[17:05] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[17:21] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[17:23] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[17:35] * greglap (~Adium@ Quit (Quit: Leaving.)
[17:43] * sentinel_e86 (~sentinel_@ Quit (Remote host closed the connection)
[17:44] * MarkN (~nathan@ Quit (Ping timeout: 480 seconds)
[17:47] * sentinel_e86 (~sentinel_@ has joined #ceph
[17:50] * MarkN (~nathan@ has joined #ceph
[17:50] * greglap (~Adium@ has joined #ceph
[17:53] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[18:14] * sentinel_e86 (~sentinel_@ Quit (Remote host closed the connection)
[18:14] * greglap (~Adium@ Quit (Read error: Connection reset by peer)
[18:18] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[18:19] * sentinel_e86 (~sentinel_@ has joined #ceph
[18:31] <sagewk> johnl: there have been a handful of bug fixes merged for .37, including a revert of a bad change that made it into .36. i would use a .37 rc if possible.
[18:33] * sentinel_e86 (~sentinel_@ Quit (Remote host closed the connection)
[18:35] * sentinel_e86 (~sentinel_@ has joined #ceph
[18:36] * failboat (~stingray@stingr.net) Quit (Remote host closed the connection)
[18:36] * stingray (~stingray@stingr.net) has joined #ceph
[18:38] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:50] * Yoric (~David@ Quit (Quit: Yoric)
[18:56] * sjust (~sam@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:58] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:15] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[19:23] * morse (~morse@supercomputing.univpm.it) Quit (Ping timeout: 480 seconds)
[19:27] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[19:28] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:25] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[20:31] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[20:35] <wido> johnl: Since you are using RADOS that much, I thought i'd promote my PHP extension ;)
[20:36] <wido> I see a lot of potential in RADOS too, playing with it much more then with the filesystem
[21:57] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[21:58] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[21:59] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[22:32] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[22:38] <wido> sagewk: Did you get my e-mail yesterday?
[22:41] <wido> http://pastebin.com/DDbDdsMY could that be due to the disk being at 945?
[22:42] <wido> 94%?
[22:42] <wido> "ENOSPC handling not implemented"
[22:53] <cmccabe> wido: sage is in meetings today
[22:54] <wido> cmccabe: ok, np!
[22:54] <wido> Do you know what goes wrong in the bt? I've just switched my cluster from unstable (2 days ago) to RC, got that crash on a OSD
[22:54] <wido> Note, the disk is at 94%
[22:55] <cmccabe> wido: what's going on here is that your disk is too full, and it's causing us to abort in the object store code
[22:55] <cmccabe> wido: we're looking at some of these issues for future releases
[22:56] <cmccabe> wido: by disk, I mean local OSD disk
[22:57] <wido> cmccabe: I thought so! But then I won't create a issue for it
[23:01] <cmccabe> wido: yeah, it's a known issue
[23:02] <cmccabe> wido: I kind of wish that it was possible to rebalance OSDs while one of them was in the ENOSPC state
[23:02] <wido> cmccabe: no problem! But then I know what it is
[23:02] <wido> cure for now, make sure all the OSD's are the same size
[23:02] <cmccabe> wido: I'm not sure though, Sage might have something in the roadmap that addresses that, that I don't know of
[23:02] <wido> or make sure your crushmap is setup right from the beginning
[23:04] <cmccabe> wido: yeah, it's avoidable. But it would still be a nifty feature to have a way to fix it easier
[23:05] <wido> I'm not sure what the problem is, but make it possible to make a "temp" dir, where the objector can do it's work
[23:05] <wido> so the OSD can rebalance
[23:06] <cmccabe> wido: one approach that I've seen in the past is to allocate some space on disk... a small amount, like a megabyte or so
[23:07] <cmccabe> wido: then when you hit the first ENOSPC, you can free that and have some breathing room for doing rebalancing
[23:07] <cmccabe> wido: oh, I wanted to ask you
[23:07] <wido> cmccabe: I get it
[23:07] <cmccabe> wido: so you'd like to see logs go to syslog... in addition to being visible with ceph -w, or instead of
[23:08] <cmccabe> wido: just out of curiousity... I think we can support both
[23:08] <wido> Oh, no, I would like the daemon logs to go to syslog
[23:08] <wido> In large envs you can then use remote syslog servers, so you can gather all your logs on one machine
[23:08] <wido> Turning on debugging won't eat disk space on the local machine then, nor needing you to log on to the machine
[23:09] <cmccabe> wido: yeah
[23:09] <wido> so debug osd, debug mds, etc, would go to syslog
[23:10] <wido> And you could then use something like a loganalyzer, which searches for "dangerous" lines
[23:10] <wido> there are such programs, like logwatch
[23:11] <wido> But I'd like to run my OSD's on a simple small OSD for the OS
[23:11] <wido> then there wouldn't be diskspace for a lot of logging
[23:11] <cmccabe> wido: on a semi-related note, we have this command called ./ceph health that is intended to be used by nagios
[23:12] <wido> cmccabe: Yes, ofcourse :-) But I mean, there could be some cephx messages which end up in your logfiles, or a OSD which starts complaining about specific things
[23:12] <wido> I'd like to monitor my logs for such things
[23:12] <cmccabe> wido: yeah, log analysis would probably be pretty interesting
[23:12] <wido> but the major advantage, have your logs on on place, let syslog rotate the logs, no need to SIGHUP the daemon
[23:12] <wido> nor having the daemon stall when a disk fills up in your logger
[23:14] <cmccabe> wido: yeah
[23:15] <wido> I would like to have a remote syslog machine with a few TB of diskspace, just have my cluster running on max debug, without performance impact on the machines
[23:15] <wido> just gather it all on one place and process / read it there
[23:16] <cmccabe> wido: well, sending syslogs over the network will always have some impact
[23:16] <cmccabe> wido: but probably less than an NFS-mounted log directory
[23:16] <wido> yes, indeed
[23:16] <wido> and syslog is designed for sending log messages, it's UDP, so async
[23:17] <wido> and when the syslog goes down, you don't stall, your logs simply vanish
[23:17] <cmccabe> wido: yeah, that's what syslog is designed to do
[23:17] <wido> indeed
[23:18] <wido> btw, why did you choose to implement native syslog and not make a pipe?
[23:19] <wido> there is the "logger" executable on Linux
[23:19] <wido> really easy, I use it for Apache, sending my logs to syslog
[23:19] <wido> CustomgLog "|/usr/bin/logger -t httpd" combined
[23:20] <cmccabe> I think the pipe option should probably be in addition to the syslog option
[23:20] <cmccabe> because it's slightly more complex for users to setup, for one thing
[23:20] <wido> Imho, when you're able to setup Ceph, you know how to get a pipe working
[23:20] <cmccabe> for another thing, although I haven't thought about it too deeply, I think it probably involves an additional context switch-- at least if the pipe is synchronous
[23:21] <wido> ok, that's not something I know about
[23:21] <wido> I see there is a syslog branch, is it ready for testing yet?
[23:21] <cmccabe> wido: no, it's still ongoing
[23:22] <cmccabe> wido: I'll continue on it until we hit another RC issue though :)
[23:22] <wido> Ok, then i'll hunt for issues with the RC branch
[23:22] <wido> ;)
[23:22] <cmccabe> wido: or until Fred sends the logs for 590 :)
[23:24] <wido> ok, RC seems to run fine here now, got some VM's running on it
[23:25] <wido> but i'm sure I'll hit something soon
[23:25] <wido> but, enough for today, going afk!
[23:25] <wido> cmccabe: ttyl!
[23:26] <cmccabe> wido: bye!

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.