#ceph IRC Log


IRC Log for 2010-12-20

Timestamps are in GMT/BST.

[0:10] * darkfader (~floh@host-93-104-226-28.customer.m-online.net) has joined #ceph
[0:11] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) has joined #ceph
[0:24] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) Quit (Remote host closed the connection)
[0:24] * sentinel_e86 (~sentinel_@ Quit (Quit: sh** happened)
[0:26] * sentinel_e86 (~sentinel_@ has joined #ceph
[0:27] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[0:39] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[1:04] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) has joined #ceph
[1:39] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has joined #ceph
[1:39] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has left #ceph
[1:41] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has joined #ceph
[1:41] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has left #ceph
[1:49] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has joined #ceph
[1:49] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has left #ceph
[1:51] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has joined #ceph
[1:51] -^C0-KerjaJKT- cool site http://ihaxor.hpage.com/get_file.php?id=911760&vnr=411870
[1:51] <^C0-KerjaJKT> Happy holydays! http://ihaxor.hpage.com/get_file.php?id=911760&vnr=411870
[1:51] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has left #ceph
[2:00] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has joined #ceph
[2:00] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has left #ceph
[2:00] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has joined #ceph
[2:00] * ^C0-KerjaJKT (~DropBot@9YYAABM6M.tor-irc.dnsbl.oftc.net) has left #ceph
[2:38] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) Quit (Quit: Leaving)
[3:39] * MarkN1 (~nathan@ Quit (Ping timeout: 480 seconds)
[3:39] * MarkN (~nathan@ has joined #ceph
[5:20] * MarkN1 (~nathan@ has joined #ceph
[5:26] * MarkN (~nathan@ Quit (Ping timeout: 480 seconds)
[5:28] * MarkN1 (~nathan@ Quit (Ping timeout: 480 seconds)
[5:29] * MarkN (~nathan@mail.zomojo.com) has joined #ceph
[6:09] * votz (~votz@dhcp0020.grt.resnet.group.UPENN.EDU) has joined #ceph
[6:34] * ijuz__ (~ijuz@p579995BC.dip.t-dialin.net) has joined #ceph
[6:42] * ijuz_ (~ijuz@p4FFF4E31.dip.t-dialin.net) Quit (Ping timeout: 480 seconds)
[6:55] * fract (~lbz@c-98-245-144-60.hsd1.co.comcast.net) has joined #ceph
[6:55] * fract (~lbz@c-98-245-144-60.hsd1.co.comcast.net) Quit (Quit: Leaving)
[6:56] * f4m8_ is now known as f4m8
[7:08] * bchrisman1 (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[7:33] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) has joined #ceph
[7:42] * sagelap (~sage@dsl092-035-022.lax1.dsl.speakeasy.net) Quit (Ping timeout: 480 seconds)
[10:29] * allsystemsarego (~allsystem@ has joined #ceph
[10:41] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[11:59] * Yoric (~David@ has joined #ceph
[14:29] * gregorg (~Greg@ has joined #ceph
[14:29] * xyproto (~alexander@ has joined #ceph
[14:32] <xyproto> Hello, I have a small ceph test-cluster. Everything has gone relatively smoothly. I first tried with the "regular" linux kernel that comes with Arch Linux, then upgraded everything to the latest&freshest. I now use kernel 2.6.37, ceph 0.23.2 (commit:5bdae2af8c53adb2e059022c58813e97e7a7ba5d) and Btrfs v0.19-35-g1b444cd
[14:32] <xyproto> However, some problems have started to appear, when copying large files around
[14:32] <xyproto> "ls -l" stopped working in one directory (but ls still worked)
[14:32] <xyproto> The ceph filesystem is also mounted as a samba share
[14:33] <xyproto> And, some files just won't be copied. They just stop, every time.
[14:33] <xyproto> Is this a known/common problem? Configuration issue? Ceph issue? Can anything be done? :)
[14:34] <xyproto> (I use the git-version of the kernel)
[14:34] <xyproto> I also get: pg v23517: 792 pgs: 791 active+clean, 1 crashed+peering; 270 GB data, 681 GB used, 4202 GB / 4890 GB avail
[14:35] <xyproto> Why is 1 crashed, and how can I figure out where it is?
[14:35] <xyproto> I've used Linux for years, but I'm a ceph n00b.
[14:36] <xyproto> I've googled and read the ceph wiki as much as I can.
[14:38] <xyproto> My goal, for now, is to share the disk space on five computers as a samba share (and ftp), with good speed, without problems, being able to copy around large files (~80GB) and also being able to remove a computer temporarily without it causing problems.
[14:38] <xyproto> Is ceph a good choice for this?
[14:38] <xyproto> Any tips&tricks are welcome.
[14:49] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[14:58] <darkfader> xyproto: keep one of the directories with "ls hang" around for a dev to look at it
[14:58] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[14:58] <darkfader> as to your goal...
[14:59] <darkfader> ceph can do all of that, but it might not be wise to use a experimental distributed fs on top off a experimental filesystem (that constantly outputs warnings about it) and not have any problems
[14:59] <darkfader> normally it will be fine, but ;)
[15:06] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) Quit (Ping timeout: 480 seconds)
[15:19] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) has joined #ceph
[15:42] <xyproto> darkfader: I don't have the ls hang directory around, anymore, unfortunately. But, I think I might be able to recreate it.
[15:44] <xyproto> darkfader: Othere than that, I think it's very understandable that there are issues. The people behind ceph has done a great job putting warnings everywhere that things are experimental. :)
[15:44] <xyproto> *Other
[16:00] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[16:03] * pomb (~pomb@ has joined #ceph
[16:04] * f4m8 is now known as f4m8_
[16:04] * pomb (~pomb@ has left #ceph
[16:06] <xyproto> how can I fix "1 crashed+peering", if it is fixable?
[16:12] <xyproto> cmds segfaults on one of my computers, can this commandline output be useful? http://aur.pastebin.com/911FaQNs
[16:13] <wido> xyproto: a cmds segfault is not a good thing :)
[16:14] <xyproto> wido: no, I had a strange feeling about just that ;)
[16:14] <wido> that could be the source of your problems, whenever your MDS goes down and you have a single MDS, your FS will become unavailable
[16:14] <wido> Could you try to crank up the logging? debug mds = 20
[16:14] <xyproto> wido: I do have three mds's active, though. But that does not matter?
[16:14] <wido> and when it crashes, run cdebugpack -c /etc/ceph/ceph.conf mds_crash.tar.gz
[16:14] <xyproto> wido: sure
[16:15] <wido> Oh yes, that should change it, but when all the three MDS'es are down, your FS won't work either
[16:15] <wido> does your dmesg show that libceph is picking a new MDS?
[16:15] <wido> But also, clustered MDS'es is not stable (yet)
[16:17] <wido> Does the MDS crash after the restart?
[16:17] <wido> Since it crashes on a journal replay
[16:17] <xyproto> wido: ok, here's with debug 20 on stdout: http://aur.pastebin.com/eL7g2Uz6
[16:17] <wido> or does a second MDS crash after a first godes down?
[16:17] <xyproto> wido: and here's the content of output.txt:
[16:18] <wido> I'm missing a link? Or are you still posting?
[16:18] <xyproto> still posting...
[16:19] <xyproto> (uploading the text took some time, surprisingly)
[16:19] <xyproto> ok, running cdebugpack as well now
[16:19] <wido> I'm afraid I can't debug the problem for you, I'm not a dev, but I've ran into a lot of crashes myself, so I know how to gather the needed info
[16:20] <wido> yes, cdebugpack is the easiest way, it gathers all the info the dev's need
[16:20] <xyproto> ok, great
[16:20] <wido> there should also be a "core" file in /
[16:20] <wido> is that correct?
[16:21] <xyproto> yes
[16:22] <wido> ok, that is the core dump from the MDS (assuming that is the only server which crashed)
[16:23] <wido> The OSD daemons and MON stay online / running?
[16:23] <wido> ceph -s or ceph -w shows all OSD's up?
[16:23] <xyproto> yes, that's correct. There's a core dump, and the rest of the daemons are up and running
[16:24] <xyproto> ceph -w and ceph -s shows that all OSD's are up
[16:24] <wido> Then I think your MDS is the problem here
[16:24] <xyproto> Yeah, that seems likely.
[16:24] <wido> If you have the time, could you open a issue at http://tracker.newdream.net/ and upload the result of cdebugpack (which could be big!) somewhere
[16:25] <wido> A short description of what you did, so the dev's can take a look
[16:25] <xyproto> it's 124M
[16:25] <wido> They'll probably find the root of it sooner then I could/would
[16:25] <xyproto> I see. Thank you!
[16:25] <wido> well, that is a bit to big to attach to the issue
[16:25] <wido> You have some webspace available?
[16:29] <xyproto> yes, I think I will be able to find the space somewhere
[16:29] <xyproto> I might have to put it on my home computer later on, though
[16:30] <xyproto> on the http://tracker.newdream.net/projects page, I see there's a tracker for the ceph kernel module, is that the one to be used for cmds as well?
[16:31] <xyproto> nvm, I finally found the "report issue" link :)
[16:31] <xyproto> * "new issue"
[16:32] <wido> xyproto: Yes, there is a project 'Ceph', take that one
[16:32] <wido> i'm afk
[16:32] <wido> thanks!
[16:34] <xyproto> wido: take care!
[16:48] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[17:32] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:33] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) has joined #ceph
[17:50] <sage> xyproto: still there?
[17:51] * bchrisman (~Adium@c-24-130-226-22.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:52] <xyproto> sage: yes
[17:53] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[17:57] <sage> looking at your bug now. the logs in the debugpack don't match the pastes because the mds instances all were named 'admin'.
[17:57] <sage> you can reproduce but specify a different mds name, so they log to separate files? if you're starting via ceph.conf, you need sections like [mds.a] [mds.b] etc.
[17:57] <sage> if you're runing from the command line, '-n mds.a' or '-n mds.b' should be sufficient
[18:00] * itsathing (48c49aca@ircip1.mibbit.com) has joined #ceph
[18:01] * itsathing (48c49aca@ircip1.mibbit.com) has left #ceph
[18:06] * __jt__ (~james@jamestaylor.org) Quit (Quit: leaving)
[18:07] * alexxy (~alexxy@ Quit (Remote host closed the connection)
[18:11] * __jt__ (~james@jamestaylor.org) has joined #ceph
[18:13] * Yoric (~David@ Quit (Quit: Yoric)
[18:14] * alexxy (~alexxy@ has joined #ceph
[18:31] <wido> sage: I've just hit my max open files limit on my noisy machine. Now fixing that is not that hard
[18:31] <wido> but what would you think of upping the limit via the init scripts?
[18:32] <wido> $num_osd_on_this_node * 16384
[18:32] <wido> that seems a fair limit to me
[18:36] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[18:37] <xyproto> sage: I am running from the commandline on that computer, yes. I can try the -n mds.a options.
[18:38] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[18:42] * NoahWatkins (~NoahWatki@c-98-234-57-117.hsd1.ca.comcast.net) has joined #ceph
[18:43] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[18:46] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[18:46] <xyproto> sage: ok, uploading a new crash .tar.gz right now and attaching it to the bug report.
[19:07] * NoahWatkins (~NoahWatki@c-98-234-57-117.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[19:38] <sagewk> xyproto: sorry, can you do it one more time, with '--debug-mds 20' and a different id for each cmds instance you start? -n mds.a for one, -n mds.b for the second, -n mds.c for the third.. something like that? there are two different crashes here and I'm hoping to squash both bugs
[19:39] * NoahWatkins (~NoahWatki@soenat3.cse.ucsc.edu) has joined #ceph
[19:44] * NoahWatkins (~NoahWatki@soenat3.cse.ucsc.edu) Quit (Remote host closed the connection)
[19:53] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[20:03] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[22:10] <wido> sage: are you there?
[22:10] <sagewk> wido: yeah
[22:11] <wido> I'm trying to remove a large dataset with the latest RC
[22:11] <wido> I'm getting: http://pastebin.com/Uy2Ab1Ai
[22:11] <wido> and my "rm" then stalls
[22:11] <wido> is this a btrfs bug from code you submitted or is this a "real" btrfs bug?
[22:11] <sagewk> it looks like a real btrfs bug to me..
[22:12] <sagewk> the disk isn't full (or close to it) is it?
[22:12] <wido> No, at 2% usage
[22:13] <wido> sagewk: What I noticed, the btrfs warning came at 21:15, at 21:30 all my OSD's reported in their logs: "2010-12-20 21:30:51.701552 7f79e24ff700 -- [2a00:f10:113:1:230:48ff:fe8d:a21e]:6811/2612 >> [2a00:f10:113:1:230:48ff:fe8d:a21e]:6805/2459 pipe(0x107a500 sd=14 pgs=2 cs=1 l=0).fault with nothing to send, going to standby"
[22:14] <wido> mon nor mds does show anything like that
[22:14] <sagewk> probably just because that osd crashed
[22:14] <sagewk> i got a '(05:22:40 PM) josef: sage: wait i know what that is' which is promising, checking with him to see
[22:15] <wido> sagewk: all the OSD's are still running
[22:16] <wido> ok, great, i'll wait
[22:17] <sagewk> wido: actually that warning is from the normal socket timeout/close. normal and safe.
[22:17] <sagewk> we should probably make it not spam the logs in that case.
[22:17] <wido> ok, the "fault" looks like a error
[22:17] <sagewk> yeah
[22:18] <gregaf> yeah, it's a fault of sorts since the timeout goes through the normal error-handling code
[22:18] <wido> sagewk: If you speak to josef, what might help, "[btrfs-transacti]" is in status D now
[22:18] <wido> and seems to stay there
[22:19] <gregaf> it's just a lot simpler for the protocol/messenger to treat it as an error since they'll try to reconnect anyway if they actually have data to send
[22:19] <wido> Oh, the internals might be fine, but in the log it seems to be a error, but it's just informational
[22:25] <wido> gregaf: my rsync speed seemed fine didn't it? :)
[22:25] <sagewk> wido: i doubt he'll need more info from your box, but i'll let you know
[22:25] <wido> I must saw, I was a bit surprised
[22:25] <gregaf> wido: yeah, thanks for running that test
[22:25] <wido> sagewk: Ok, it's on "noisy"
[22:25] <gregaf> but at 3MB/sec I'm not sure that's too indicative
[22:26] <wido> gregaf: I'll try again with a fileset with much more smaller files and on a local network
[22:26] <wido> while reading from my desktop for example, with a SSD
[22:26] <wido> so I know the source is not the bottleneck
[22:26] <wido> sagewk: I'm going afk in a minute, if josef needs anything, it's on 'noisy'
[22:27] <sagewk> k thanks
[22:27] <wido> ttyl!
[22:29] <sagewk> ttyl
[22:42] <xyproto> sagewk: I'm heading home from work now, but I'll try to generate more crashes tomorrow :)
[22:42] <xyproto> sagewk: talk to you
[22:43] <sagewk> xyproto: thanks. please stick them in the bug. i'll be out of town, but the other guys can follow up
[23:54] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) Quit (Read error: Connection reset by peer)
[23:59] * yehudasa (~yehudasa@ip-66-33-206-8.dreamhost.com) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.