#ceph IRC Log


IRC Log for 2011-02-01

Timestamps are in GMT/BST.

[0:18] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[0:18] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[0:32] <gregaf> cmccabe1: a signal 11 segfault should always trigger a backtrace dump, right?
[0:32] <cmccabe1> gregaf: as long as it's not the result of heap corruption
[0:33] <gregaf> why does that matter?
[0:33] <cmccabe1> gregaf: unfortunately the current code calls malloc in the signal handler, so you can go boom
[0:33] <gregaf> ah
[0:33] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[0:33] <cmccabe1> gregaf: also we don't use altstack currently, so if you trash the stack pointer, you're probably toast as well
[0:33] <cmccabe1> those two things could probably be fixed eventually.
[0:33] <gregaf> I've got a core dump from segfault, a null dereference actually, and the log doesn't have the backtrace
[0:34] <cmccabe1> hmm
[0:34] <gregaf> not a big deal in this case since I have a valid core but I think it should have triggered?
[0:34] <Tv|work> honestly, the only way to make backtraces reliable is to dump a core & analyze it
[0:34] <cmccabe1> gregaf: is this reproducible?
[0:34] <gregaf> dunno, haven't tried yet
[0:34] <gregaf> it took a while too, I was rsyncing the ceph-client tree into a ceph mount under uml
[0:35] <gregaf> (to look at if there was a problem with caps or messages or whatever under that scenario)
[0:35] <gregaf> and then hit this unexpected bug I'm looking at now
[0:35] <cmccabe1> tv: backtraces can be made reliable in higher-level langauges such as python
[0:35] <cmccabe1> tv: python will always give a backtrace when terminated by a signal (unless it's SIGKILL)
[0:35] <gregaf> which triggered the segfault, which I'm more concerned about than the backtrace atm ;)
[0:36] <cmccabe1> tv: in C++, you can close off most of the ways that programmers nuke the backtrace code by making the backtrace code as short and simple as possible
[0:40] <cmccabe1> gregaf: anyway, let me know if you find a case that's reproducible. gdb might allow me to find where it went wrong
[0:42] <gregaf> will do
[0:45] <Tv|work> cmccabe1: in the face of memory corruption, nothing inside the process itself is safe; your "higher-level languages" is only true if they're 100% safe = no use of C libraries
[0:45] <Tv|work> cmccabe1: but yeah, short = probabilistics, most likely not hit
[0:45] <cmccabe1> tv: well, you can always encounter a bug in some part of the software stack you're not developing
[0:45] <Tv|work> just saying, greg should not be surprised that backtrace doesn't always trigger
[0:46] <Tv|work> shit will happen, until you compartmentalize enough
[0:46] <cmccabe1> tv: but it tends to not happen much compared with finding your own bugs :)
[0:46] <cmccabe1> tv: you're not guaranteed to ever get a useful backtrace in C, even with gdb and a core
[0:46] <Tv|work> oh sure there's always the "fandango on the core" error mode ;)
[0:46] <cmccabe1> tv: I could change ESP and EBP to something wacky and leave you twisting in the wind
[0:47] <Tv|work> a simple loop can easily clobber all of the stack
[0:47] <cmccabe1> tv: that is true, but in that case, you still have a stack, just one that led to SIGSEGV
[0:47] <cmccabe1> tv: it's only if the stack pointer is overwritten that you are left with no obvious clue where it was
[0:48] <Tv|work> cmccabe1: you have the memory area where stack is, but it may have been written full of garbage
[0:48] <Tv|work> been there done that ;)
[0:48] <cmccabe1> tv: heh, yeah
[0:49] <cmccabe1> tv: I do think we can achieve a usable backtrace 99% of the time though
[0:49] <cmccabe1> tv: unless something truly heinous happens like SIGKILL (oom killer) or someone trashing ESP/EBP
[0:51] <cmccabe1> another idea I've been toying with is providing backtraces for multiple threads in the logfile
[0:51] <cmccabe1> I'm not 100% sure it's the right thing to do... that would be a lot of text, and would no doubt confuse some people
[0:52] <Tv|work> cmccabe1: one thing a previous job had a lot of success with was a "package my crap for a bugreport" script
[0:52] <Tv|work> cmccabe1: then you don't need to spew things to the main human-readable log
[0:52] <Tv|work> cmccabe1: but you'd have e.g. an alternate location to dump that stuff in
[0:52] <sagewk> tv|work: cdebugpack
[0:52] <gregaf> we do have a script like that, actually
[0:52] <Tv|work> sagewk: yeah
[0:53] <Tv|work> and once you have that, you can analyze core files etc there
[0:53] <gregaf> he means it already exists, and is called cdebugpack :)
[0:53] <Tv|work> yup
[0:53] <gregaf> it's not very fancy right now, just gzips up the core and logs and executables, I think
[0:53] <cmccabe1> yeah, I noticed wido uses that script
[0:54] <cmccabe1> we should probably start recommending it more to people filing bug reports
[0:57] <gregaf> wido: sage says that he closed 563 since he thinks they're tracking it upstream and he's not working on it
[1:04] <cmccabe1> this is actually pretty cool: http://timetobleed.com/an-obscure-kernel-feature-to-get-more-info-about-dying-processes/
[1:05] <cmccabe1> we could set it up so that any time a ceph process dumps core on our cluster, it emails the user who ran it with a backtrace
[1:07] <cmccabe1> well, I guess the daemons usually run as root. But still.
[1:11] <bchrisman> wonder what happens when the core dump process segfaults…? :)
[1:11] <Tv|work> bchrisman: core dumps are written from kernel space
[1:12] <bchrisman> (they were talking about invoking another program during core dump... right?)
[1:12] <cmccabe1> tv: he's talking about the core helper thing
[1:12] <cmccabe1> call_usermodehelper_exec
[1:15] <cmccabe1> it's actually very powerful because it runs before the process has terminated. So you can do stuff like read /proc/maps, or look at the value of cross-process semaphores, etc. Things that wouldn't be in a core dump.
[1:15] <bchrisman> yeah.. looks very userful
[1:15] <Tv|work> http://lxr.linux.no/linux+v2.6.37/fs/exec.c#L2023
[1:16] <Tv|work> it just printks, shrugs, and moves on
[1:16] <cmccabe1> heh. At KERN_INFO, too
[1:17] <cmccabe1> yeah, I suspect that running the core dumper process on itself wouldn't be a good plan
[2:48] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[2:50] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[2:51] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[2:51] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit ()
[2:57] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) has joined #ceph
[3:07] * jantje (~jan@paranoid.nl) has joined #ceph
[3:11] * jantje__ (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[3:20] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) Quit (Quit: Leaving.)
[3:24] * Juul (~Juul@static.88-198-13-205.clients.your-server.de) Quit (Quit: Leaving)
[3:47] * cmccabe1 (~cmccabe@c-24-23-253-6.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[4:27] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has left #ceph
[4:28] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[5:18] * greglap (~Adium@ has joined #ceph
[5:52] * greglap (~Adium@ Quit (Quit: Leaving.)
[5:59] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) has joined #ceph
[6:34] * chrisrd (~chrisrd@ has joined #ceph
[8:09] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:29] <Anticimex> gregaf: interesting... about to try and recreate the issue now
[8:30] <Anticimex> but before i umounted the kclient mount on the ceph-node itself, i did another ls -al, and the video file now reports 0 bytes and timestamp 1970-01-01 etc
[8:30] <Anticimex> before that I scp:ed it away to another host, a bunch of hours ago
[9:08] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[10:04] * Yoric (~David@ has joined #ceph
[10:05] * allsystemsarego (~allsystem@ has joined #ceph
[10:11] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[10:45] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[10:47] * Yoric (~David@ has joined #ceph
[13:43] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[13:56] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[14:01] * Yoric_ (~David@ has joined #ceph
[14:01] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[14:01] * Yoric_ is now known as Yoric
[14:12] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[14:25] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[14:27] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[14:36] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[14:52] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[15:46] * Yoric_ (~David@ has joined #ceph
[15:47] * Yoric (~David@ Quit (Ping timeout: 480 seconds)
[15:47] * Yoric_ is now known as Yoric
[15:49] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[15:59] * Yoric (~David@ has joined #ceph
[16:01] * Yoric (~David@ Quit ()
[16:33] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[16:56] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[16:57] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[16:57] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[17:17] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[17:20] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[17:24] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[17:27] * Meths_ (rift@ has joined #ceph
[17:34] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[17:35] * Meths_ is now known as Meths
[17:37] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[17:52] * greglap (~Adium@ has joined #ceph
[17:58] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[18:02] <sage> hi guys: home sick today :(
[18:02] <greglap> :(
[18:09] <wido> :(
[18:09] <wido> hi
[18:10] <wido> sjust: I got 5 corrupted pg's again today. I resized a RBD image from 1.5T to 3.0T. I then booted the VM and saw that 4 PG's were inconsistent
[18:10] <wido> http://pastebin.com/sWe1Gmvp
[18:10] <wido> 5 PG's
[18:11] <wido> Loggin was still low, so I have no logging of the OSD's. Are the missing rbd objects normal?
[18:19] * Tv|work (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:20] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[18:20] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[18:40] * greglap (~Adium@ Quit (Quit: Leaving.)
[18:58] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:59] * cmccabe (~cmccabe@ has joined #ceph
[19:01] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) Quit (Remote host closed the connection)
[19:09] <sjust> wido: not as far as I know
[19:11] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[19:12] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[19:15] <Tv|work> gregaf: ahh i asked it on #autotest -- oops ;)
[19:15] <gregaf> heh
[19:19] <gregaf> Anticimex: did you come up with anything when you tried to reproduce?
[19:20] <Anticimex> just woke up now, and no, i wasn't able to reproduce
[19:20] <Anticimex> this time, it diffeed in that after having copied the file on the remote client onto the ceph mount, i did a ls -al, which seems to have triggered a sync
[19:21] <Anticimex> hence, i didn't do a sync call
[19:21] <Anticimex> i'll try again
[19:22] <gregaf> hmm, okay
[19:23] <wido> sjust: ok, well, that is what happend :)
[19:23] <sjust> wido: off hand, I don't know what would cause that, some logs would help
[19:26] <gregaf> cmccabe: sjust: Tv|work: joshd: meeting will have to be a bit later, Sage is on a phone call
[19:27] <sjust> ok
[19:27] <Tv|work> gregaf: ok
[19:29] <cmccabe> k
[19:31] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:34] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[19:43] <sage> let's do 11
[19:43] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:44] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[19:45] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[19:58] * bcherian (~bencheria@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:03] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[20:11] <bchrisman> recommendations for debugging a refusal to umount? http://pastebin.com/L8ELLqk5
[20:12] <bchrisman> looking for kernel client debug recommendations there to find out why ceph still thinks the filesystem is in use..
[20:14] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[20:21] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) has joined #ceph
[20:27] * bcherian (~bencheria@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[20:29] <gregaf> bchrisman: that usually happens when the kclient can't communicate with the ceph servers or is taking a long time to flush out data for some reason
[20:30] <bchrisman> ahh ceph -s should report that then?
[20:30] <bchrisman> (the communication issue)
[20:30] <bchrisman> I didn't check yet.
[20:30] <gregaf> well if there's actually a cluster issue, yes
[20:31] <gregaf> that's all I can come up with, although I have seen umount behave a bit oddly a few times
[20:31] <bchrisman> okay… will remember to check the daemons if/when it recurrs.
[20:31] <gregaf> never had trouble just doing a lazy umount although my own clusters aren't usually very long-lived
[20:32] <gregaf> there are a variety of debug options via /proc/sys/kernel stuff but those tend to produce a lot of output
[20:39] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[20:39] <wido> sjust: I'll change the logging. Would 10 be enough on one OSD or do you need it on all OSD's?
[20:39] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit ()
[20:41] <sjust> wido: 10 on one osd would certainly help
[20:42] <wido> Ok, i'll set up remote syslog to a machine with some more space and change the logging for one OSD
[20:46] <sjust> wido: sounds good
[20:47] <Anticimex> gregaf: i hurried to do a sync after i copied a second file of simlar size, and i think i have reproducd the problelm now
[20:47] <Anticimex> seemingly no end to uploading from client, and no termination of sync
[20:47] <gregaf> Anticimex: did you get some logs for me? :)
[20:48] <Anticimex> i activated debug ms 1 on mds, mon and osd yeah
[20:48] <gregaf> cool, can you zip/tar them and put them on webspace somewhere?
[20:48] * cmccabe (~cmccabe@ has left #ceph
[20:48] <gregaf> or if they're small enough packaged up you can just email them to me
[20:48] <Anticimex> wilco
[20:49] <gregaf> gregf@hq.newdream.net
[20:51] <Anticimex> i'll put them on a webpage, they're ~150 MB per osd :)
[20:51] <Anticimex> they've been running since yesterday
[20:52] <Anticimex> so i believe you can just tail out the last x thousand lines and find the recent issues there
[20:52] <gregaf> cool
[20:52] <gregaf> you're going to have to give me the address, though ;)
[20:52] <Anticimex> i will when i know what it is, gzipping now
[20:52] <gregaf> k, thanks!
[20:53] <Anticimex> interesting that it was not triggered by a stat (or just coincidence with a timed flush) on the files in the directory
[20:53] <Anticimex> the bug, that is.
[20:54] <gregaf> yeah
[20:54] <gregaf> bbl, lunchtime!
[20:54] <Anticimex> 197MB tar.gz at http://martin.millnert.se/public/ceph-broken-sync-logs.tar.gz
[20:55] <Anticimex> 2G unpacked. :)
[20:57] * cmccabe (~cmccabe@ has joined #ceph
[20:57] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[21:00] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) has joined #ceph
[21:01] * coelho (1000@ has joined #ceph
[21:02] <wido> cmccabe: Hi!
[21:02] <cmccabe> wido: hi wido
[21:02] <wido> You wrote the syslog code?
[21:02] <cmccabe> wido: yeah
[21:02] <wido> I'm trying it now, but is there a way to prefix a name for each process?
[21:02] <wido> right now I have 4 OSD's logging to syslog, but they all say "cosd"
[21:02] <cmccabe> wido: hmm.
[21:03] <darkfader> wido: what syslogger do you use?
[21:03] <wido> Ubuntu's, rsyslog
[21:03] <darkfader> at least on solaris you could do evil stuff with incoming data
[21:03] <darkfader> i'll look for rsyslog options
[21:03] <wido> syslog-ng is pretty cool for example, I use that on my remote syslog machine
[21:03] <cmccabe> I think that's probably the sort of thing the syslog daemon should/could do
[21:04] <wido> cmccabe: "Feb 1 21:01:39 noisy cosd: 2011-02-01"
[21:04] <darkfader> well syslog-ng should be able to do that :)
[21:04] <wido> when writing the line you could replace "cosd" by "osd0"
[21:04] <cmccabe> the other thing syslog does is prefix a time
[21:04] <cmccabe> I kind of wish we didn't always prefix our own (duplicate) time when writing to syslog
[21:04] <wido> I doesn't have to be the processes name
[21:05] <wido> You can do this when opening the socket to syslog
[21:05] <cmccabe> ah, I see that there's a call named openlog which allows us to set our identity
[21:06] <cmccabe> yeah, that might be a good idea. So we'd have output like osd0: foo, mds.a: foo, etc.
[21:06] <wido> Yeah, indeed. Would make sorting logs much easier :)
[21:07] <wido> I just checked LogClient.cc
[21:07] <cmccabe> since it's part of the standard syslog API we should definitely use it
[21:07] <wido> you only use syslog()
[21:07] <cmccabe> that's right. You're allowed to use syslog() without openlog(), at least on linux
[21:08] <Tv|work> syslog() basically has if (!initialized) openlog() in it
[21:08] <cmccabe> the tricky part will be figuring out what the "pretty" process name is
[21:08] <wido> cmccabe: Setting a facility would also be usefull, this way you can filter out which to sent to a remote box
[21:08] <cmccabe> like writing osd.0 in a cosd, mds.b in a cmds
[21:09] <cmccabe> probably doing nothing in random programs that don't have a pretty name
[21:10] <Tv|work> i think the default is basename of argv[0]
[21:10] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) has joined #ceph
[21:10] <cmccabe> yeah
[21:10] <cmccabe> that makes all osds look the same though
[21:11] <Tv|work> yup
[21:11] <Tv|work> but good enough for command-line use
[21:11] <cmccabe> oh, yeah.
[21:11] <Tv|work> so daemons that know they're "special" need to set it
[21:12] <wido> cmccabe: Shall I open a issue for it?
[21:12] <cmccabe> wido: sure
[21:12] <wido> ok, I'll do that
[21:12] <cmccabe> tv: although command-line programs should call set_foreground_logging and therefore never log to syslog
[21:13] <Tv|work> yes
[21:13] <cmccabe> tv: but if someone creates a random test daemon, which has been known to happen, we should do something intelligent
[21:22] <wido> cmccabe: With setting a facility you could do simple things like: local1.* @remote-syslog-machine
[21:23] <wido> although newer syslog daemons support these functions with a regexp: :msg,contains,"cosd" @remote-syslog-machine
[21:23] <cmccabe> wido: the set of facilities looks pretty limited, are any of them really appropriate besides LOG_USER?
[21:23] <cmccabe> I guess maybe LOG_DAEMON
[21:24] <gregaf> cmccabe: you want to look at g_conf.id
[21:25] <gregaf> I don't remember if it's of the format "osd.0" or "0", but it will be helpful :)
[21:25] <wido> Well, just LOG_LOCAL1, LOCAL2, using those could be used for filtering
[21:25] <wido> "other codes through 15 reserved for system use"
[21:26] <cmccabe> wido: I like your previous idea about regular expressions. Sounds more flexible than LOG_LOCAL1/LOG_LOCAL2
[21:27] <wido> cmccabe: Yes, but that depends on your syslog daemon. rsyslog (which is default under Ubuntu) supports it
[21:28] <wido> debian lenny still has plain old syslogd
[21:28] <wido> sysklogd
[21:29] <cmccabe> well, rsyslog is available for install right?
[21:29] <cmccabe> I seem to remember some people like syslog-ng too
[21:29] <wido> yes, rsyslog is available, syslog-ng too.
[21:29] <darkfader> yeah and flexible is when it works for all of them ;p
[21:29] <wido> syslog-ng is really flexible indeed, we use it at our "collect boxes"
[21:30] <darkfader> any of you use splunk?
[21:30] <wido> no, never heard of it?
[21:30] <cmccabe> haven't used it.
[21:30] <darkfader> i think i can give you an url for mine in a sec
[21:30] <darkfader> unless i broke it
[21:31] <wido> cmccabe: You could indeed skip the facility, but the identify is needed imho
[21:31] <cmccabe> wido: right
[21:33] <darkfader> thats a live box
[21:34] <darkfader> you can run queries over the syslog messages and view timelines and all kinda stuff
[21:35] <darkfader> and run a real time view based on the filters, too :)
[21:35] <wido> there are tons of those things around, also OSS ;)
[21:36] <wido> Btw, still playing around with Qemu-RBD and I keep getting: "flush-251:2: page allocation failure. order:0, mode:0x20"
[21:36] <darkfader> well tons that try ;)
[21:36] <cmccabe> darkfader: yeah, splunk always seemed vaguely interesting. Never had a chance to try it. How does it compare to ganglia / nagios ?
[21:36] <darkfader> cmccabe: not related at all
[21:36] <wido> now, you would saw, the VM is out of memory, but I think the kernel can't flush fast enough
[21:37] <cmccabe> darkfader: well, splunk is all about logfiles. Ganglia and nagios are more active monitoring I guess
[21:37] <darkfader> well you *can* use it for performance data nad the likes, but the syslog search engine is enough fun for me
[21:37] <cmccabe> darkfader: also, is there an open source tool that is kind of like splunk I wonder?
[21:37] <darkfader> wido says so :)
[21:38] <gregaf> wido: I'd poke yehudasa or joshd about that
[21:38] <darkfader> but i don't think anything that actually works as well
[21:38] <darkfader> i ended up with splunk when i was looking for some oss solution
[21:39] <cmccabe> is splunk ad supported, or royalty based, or what
[21:39] <darkfader> free until a lograte of 500mb/day or something like that
[21:39] <cmccabe> seems like the main thing you need is some big database of regular expressions, and you could crowdsource that pretty easily
[21:40] <wido> gregaf: I just updated the issue with the High I/O wait. Guess it's related to that
[21:40] <wido> So yehudasa or joshd if you are reading, might want to check out #752 ?
[21:41] <cmccabe> wido: what partition is mounted with XFS?
[21:41] <darkfader> cmccabe: splunk and nagios would mostly differ in that nagios is always quite static (you can alert for message xyz but you normally can't correlate after the fact or search for something different
[21:41] <cmccabe> darkfader: k
[21:42] <darkfader> well. i'll leave it accessible for a day, just test it :)
[21:42] <darkfader> the $$$$ versions are mostly for companies that either like a few 100GB per day or who put their whole business transactions via splunk
[21:42] <darkfader> kinda irrelevant for me
[21:42] <wido> cmccabe: /mnt/data, the partition I'm syncing to. / is ext4
[21:46] <Tv|work> darkfader: https://code.google.com/p/logstash/ perhaps
[21:46] <Tv|work> darkfader: or skip the ruby and go straight to ElasticSearch
[21:47] <darkfader> Tv|work: i dont see any point in switching now
[21:47] <cmccabe> tv: elasticsearch?
[21:47] <darkfader> this looks like a splunk 0.01-rc1
[21:48] <cmccabe> wow, people are doing cool things with lucene these days
[21:48] <darkfader> and splunk worked for 4 years without any hassle
[21:48] <Tv|work> cmccabe: http://www.elasticsearch.com/
[21:48] <Tv|work> lucandra is even neater, just not as neatly packaged
[21:49] <cmccabe> tv: so if lucandra is lucene with a cassandra backend, what is lucene's "normal" back end?
[21:51] <Tv|work> cmccabe: local files
[21:51] <cmccabe> tv: k
[21:51] <Tv|work> lucene is stuck with old school fulltext indexing though, it needs a rewrite :(
[21:51] <Tv|work> stuff like, it assumes it knows what are all the indexable fields, and gives them numbers
[21:52] <Tv|work> elasticsearch gets around that by maintaining a name<->number mapping, but that's just covering up for the ugly
[21:54] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) Quit (Ping timeout: 480 seconds)
[21:55] * ghaskins_mobile (~ghaskins_@66-189-113-47.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[22:12] * baldben (~bencheria@ip-66-33-206-8.dreamhost.com) has joined #ceph
[22:31] * bchrisman (~Adium@70-35-37-146.static.wiline.com) Quit (Quit: Leaving.)
[22:33] * DLange (~DLange@dlange.user.oftc.net) Quit (Remote host closed the connection)
[22:39] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[22:52] * DLange (~DLange@dlange.user.oftc.net) Quit (Remote host closed the connection)
[22:52] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[23:11] <jantje> hi !
[23:12] * coelho (1000@ Quit (Ping timeout: 480 seconds)
[23:20] <jantje> cmccabe: thanks for fixing the timer issue
[23:25] * coelho (1000@ has joined #ceph
[23:29] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[23:31] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[23:35] <cmccabe> jantje: np!
[23:49] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[23:53] * coelho (1000@ Quit (Quit: Leaving)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.