#ceph IRC Log


IRC Log for 2012-02-25

Timestamps are in GMT/BST.

[0:09] <pulsar> sjust2: Tv|work gregaf1 you guys around? i have some test results after stress testing ceph you might be interessted in. oterhwise i will be shutting it down and giving it a try some time later with a newer kernel / ceph version
[0:09] <sjust2> pulsar: here
[0:09] <sjust2> and intereseted
[0:09] <sjust2> *interested
[0:09] <pulsar> oh, great.
[0:10] <pulsar> so basically i made a script to create directories like crazy. ended up with mds locked up at 14gb ram and serveral dead osds crashing right after restarting them.
[0:10] <sjust2> ah
[0:10] <pulsar> did not look into the logfiles, i do not feel it is an option for me right now
[0:11] <pulsar> so, if you want to take a look, i can patch you through. otherwise killall -9 ceph :)
[0:11] <sjust2> sagewk: are you interested in seeing the mds logs?
[0:12] <sagewk> not the mds logs... shouldnt' hard to recreate the situation
[0:12] <sagewk> lots of subdirs in teh same dir, or were individual dirs relatively small?
[0:12] <pulsar> not that much, i tried to hit 1b directories / files
[0:13] <pulsar> let me see how far i actually got
[0:14] <sagewk> were you careful to keep individual dirs small, or were there any that were big?
[0:14] <pulsar> ~ 6720000 directories / files
[0:15] <pulsar> and maximum number of children per directory might be around ....
[0:15] <pulsar> not even 20k
[0:15] <pulsar> i have 3 ods which will die after attempting a restart
[0:16] <pulsar> and a couple of deadlocked fuse mounts
[0:16] <sagewk> that may still be pushing into problem area, given that dir fragmentation is off.
[0:16] <sagewk> ok. we're definitely interested in the osd crashes!
[0:17] <pulsar> i can get you some log files then, if you want to watch over my shoulder i can give you screen sharing over skype
[0:17] * aa (~aa@r200-40-114-26.ae-static.anteldata.net.uy) Quit (Ping timeout: 480 seconds)
[0:17] <pulsar> or teamviewer if you are willing to fire up the client app
[0:18] <sagewk> just the stack traces in the log file may be enoug
[0:18] <pulsar> ok, i'll see what i can get you and msg you the download link
[0:18] <pulsar> i am just wondering...
[0:18] <pulsar> you guys are working full time on ceph?
[0:20] <sagewk> pulsar: yep!
[0:20] <sagewk> x ~10 people so far
[0:20] <pulsar> that explains much ;)
[0:20] <pulsar> pretty much the best support i came across so far regarding open source project
[0:21] <pulsar> usually i end up in a dead irc channel or forum with plenty of users and maybe one person answering a question per day. so, thanks! has been a pleasure!
[0:21] <sagewk> you're welcome :)
[0:24] <pulsar> http://dl.dropbox.com/u/3343578/logs/crashing.ods.log
[0:25] <pulsar> there is one, crashing every time i restart it
[0:25] <pulsar> actually these are two instances
[0:26] <pulsar> on the same machine
[0:26] <pulsar> i have another server with only one osd crashing
[0:27] <sagewk> pulsar: any chance you can start it up with 'debug filestore = 10' and let it run to crash one more time?
[0:27] <pulsar> sure
[0:29] <pulsar> sagewk: [osd] .... debug filestore = 10
[0:29] <pulsar> or is it [osd] .... osd debug file store = 10
[0:29] <pulsar> ?
[0:29] <sagewk> [osd]
[0:29] <sagewk> debug filestore = 10
[0:30] <pulsar> that space/underscore substitution thing is a bit confusing :)
[0:31] <pulsar> logging...
[0:32] <pulsar> if i was going to try put that many directories into a ceph filesystem again, i take it it is a good idea to limit the number of children per directory to keep the mds memory usage low?
[0:33] <pulsar> i could come up with some path encoding based on md5 hashes to keep every fs node < 0xff entries for instance
[0:36] * verwilst (~verwilst@d51A5B5DF.access.telenet.be) Quit (Quit: Ex-Chat)
[0:44] <sagewk> for now, until mds frag = true by default
[0:46] <pulsar> which is unstable?
[0:46] * tnt_ (~tnt@80.63-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[0:46] <pulsar> or not available for 0.41?
[0:58] <sagewk> unstable with recovery
[1:02] <pulsar> ic
[1:24] * lofejndif (~lsqavnbok@191.Red-83-34-192.dynamicIP.rima-tde.net) Quit (Quit: Leaving)
[1:27] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[1:32] * nyeates (~nyeates@pool-173-59-237-128.bltmmd.fios.verizon.net) has joined #ceph
[1:38] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:39] * nyeates (~nyeates@pool-173-59-237-128.bltmmd.fios.verizon.net) Quit (Quit: Zzzzzz)
[2:07] * Tv|work (~Tv__@aon.hq.newdream.net) Quit (Ping timeout: 480 seconds)
[2:07] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[2:28] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[2:39] * nyeates (~nyeates@pool-173-59-237-128.bltmmd.fios.verizon.net) has joined #ceph
[2:39] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[3:01] * nyeates (~nyeates@pool-173-59-237-128.bltmmd.fios.verizon.net) Quit (Quit: Zzzzzz)
[3:19] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[4:43] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:59] * cattelan is now known as cattelan_away
[5:00] * dmick (~dmick@aon.hq.newdream.net) has left #ceph
[5:38] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[6:14] * Ormod_ (~valtha@ohmu.fi) Quit (Remote host closed the connection)
[6:15] * Ormod (~valtha@ohmu.fi) has joined #ceph
[8:04] * tnt_ (~tnt@80.63-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:31] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[10:06] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) Quit (Read error: Operation timed out)
[10:11] * sage (~sage@cpe-76-94-40-34.socal.res.rr.com) has joined #ceph
[10:58] * BManojlovic (~steki@ has joined #ceph
[11:15] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[11:48] * gohko_ (~gohko@natter.interq.or.jp) has joined #ceph
[11:49] * diegows (~diegows@ Quit (Read error: Operation timed out)
[11:50] * diegows (~diegows@ has joined #ceph
[11:53] * gohko (~gohko@natter.interq.or.jp) Quit (Ping timeout: 480 seconds)
[12:02] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[12:12] <pulsar> sagewk: logs are ready, see query/privmsg
[12:33] * SpamapS (~clint@xencbyrum2.srihosting.com) Quit (Quit: Lost terminal)
[13:19] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[13:25] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[13:59] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[14:00] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[14:21] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[14:48] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[14:57] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[15:11] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[16:50] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) has joined #ceph
[16:53] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Read error: Connection reset by peer)
[16:58] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[18:19] * mtk (~mtk@ool-44c35967.dyn.optonline.net) Quit (Remote host closed the connection)
[18:58] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[20:00] * phil_ (~quassel@chello080109010223.16.14.vie.surfer.at) Quit (Remote host closed the connection)
[21:01] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Quit: Leaving)
[21:08] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[22:29] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[23:12] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) has joined #ceph
[23:18] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) has left #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.