#ceph IRC Log

Index

IRC Log for 2013-03-04

Timestamps are in GMT/BST.

[0:07] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[0:11] * LeaChim (~LeaChim@b0faa0c8.bb.sky.com) Quit (Ping timeout: 480 seconds)
[0:15] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[0:27] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[0:40] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[0:42] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:42] * loicd (~loic@magenta.dachary.org) has joined #ceph
[0:48] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[0:51] * leseb_ (~leseb@78.250.138.174) Quit (Remote host closed the connection)
[0:53] * leseb_ (~leseb@78.250.138.174) has joined #ceph
[0:53] * leseb_ (~leseb@78.250.138.174) Quit (Remote host closed the connection)
[0:58] * leseb (~leseb@78.251.55.64) Quit (Remote host closed the connection)
[0:58] * Philip__ (~Philip@hnvr-4dbb8b76.pool.mediaWays.net) Quit (Ping timeout: 480 seconds)
[0:59] * leseb (~leseb@78.251.55.64) has joined #ceph
[0:59] * leseb (~leseb@78.251.55.64) Quit (Remote host closed the connection)
[1:02] * nwat (~nwatkins@soenat3.cse.ucsc.edu) Quit (Quit: nwat)
[1:07] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[1:09] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[1:11] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:12] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:16] * nwat (~nwatkins@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[1:27] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:28] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:45] * rinkusk (~amrik@CPEbc14015a7093-CMbc14015a7090.cpe.net.cable.rogers.com) has joined #ceph
[1:46] * amriksk (~amrik@CPEbc14015a7093-CMbc14015a7090.cpe.net.cable.rogers.com) has joined #ceph
[1:48] * danieagle (~Daniel@186.214.56.35) Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[1:49] * amriksk (~amrik@CPEbc14015a7093-CMbc14015a7090.cpe.net.cable.rogers.com) Quit ()
[1:49] * rinkusk (~amrik@CPEbc14015a7093-CMbc14015a7090.cpe.net.cable.rogers.com) Quit (Quit: Leaving)
[1:50] * rinkusk (~amrik@CPEbc14015a7093-CMbc14015a7090.cpe.net.cable.rogers.com) has joined #ceph
[1:57] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[2:04] <rinkusk> help set
[2:07] * rinkusk (~amrik@CPEbc14015a7093-CMbc14015a7090.cpe.net.cable.rogers.com) Quit (Quit: Leaving)
[2:09] * nwat (~nwatkins@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: nwat)
[2:18] * nwat (~nwatkins@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[2:20] * nwat (~nwatkins@c-50-131-197-174.hsd1.ca.comcast.net) Quit ()
[2:31] * BManojlovic (~steki@85.222.222.132) Quit (Quit: Ja odoh a vi sta 'ocete...)
[2:43] * diegows (~diegows@190.188.190.11) Quit (Ping timeout: 480 seconds)
[2:44] * BillK (~BillK@124.150.58.6) has joined #ceph
[2:49] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[2:55] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[2:57] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[2:59] * nwat (~nwatkins@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[3:23] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[3:50] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[3:58] * nwat (~nwatkins@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: nwat)
[3:58] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[3:59] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[4:06] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has joined #ceph
[4:07] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has left #ceph
[4:50] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[4:51] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[4:58] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[5:05] * b1tbkt (~Peekaboo@68-184-193-142.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[5:12] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[5:12] * loicd (~loic@magenta.dachary.org) has joined #ceph
[5:27] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) has joined #ceph
[5:40] * calebamiles1 (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[5:41] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[5:43] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) Quit (Ping timeout: 480 seconds)
[5:47] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[5:51] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[5:59] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[5:59] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[5:59] * loicd (~loic@magenta.dachary.org) has joined #ceph
[6:22] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[6:33] * nwat (~nwatkins@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[6:46] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[6:51] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[6:58] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[6:59] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[7:00] * BillK (~BillK@124.150.58.6) Quit (Quit: Leaving)
[7:00] * BillK (~BillK@124.150.58.6) has joined #ceph
[7:18] * nwat (~nwatkins@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: nwat)
[7:26] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[7:26] <buck> is there an easy way to get debug logs out of failing binary test cases ? ceph_test_foo ?
[7:27] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[7:38] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[7:51] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[7:59] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[8:00] * scalability-junk (uid6422@id-6422.tooting.irccloud.com) Quit (Ping timeout: 480 seconds)
[8:14] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) Quit (Quit: Pogoapp - http://www.pogoapp.com)
[8:22] * Philip__ (~Philip@hnvr-4dbb8b76.pool.mediaWays.net) has joined #ceph
[8:42] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[8:46] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: Download IceChat at www.icechat.net)
[8:47] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[8:48] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit ()
[8:48] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[8:52] * gerard_dethier (~Thunderbi@85.234.217.115.static.edpnet.net) has joined #ceph
[8:55] * gucki (~smuxi@HSI-KBW-095-208-162-072.hsi5.kabel-badenwuerttemberg.de) has joined #ceph
[8:55] * LeaChim (~LeaChim@b0faa0c8.bb.sky.com) has joined #ceph
[9:06] * joshd1 (~jdurgin@2602:306:c5db:310:1d09:a5bc:b553:d43d) Quit (Ping timeout: 480 seconds)
[9:13] * leseb (~leseb@83.167.43.235) has joined #ceph
[9:15] * joshd1 (~jdurgin@2602:306:c5db:310:459f:8f6e:109f:a935) has joined #ceph
[9:21] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[9:26] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[9:44] * ScOut3R (~ScOut3R@212.96.47.215) has joined #ceph
[9:46] * l0nk (~alex@83.167.43.235) has joined #ceph
[9:53] * scalability-junk (uid6422@id-6422.tooting.irccloud.com) has joined #ceph
[9:54] * dosaboy (~user1@host86-164-227-220.range86-164.btcentralplus.com) has joined #ceph
[9:58] * LeaChim (~LeaChim@b0faa0c8.bb.sky.com) Quit (Ping timeout: 480 seconds)
[9:58] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[9:59] * LeaChim (~LeaChim@b0faa0c8.bb.sky.com) has joined #ceph
[10:07] * hybrid512 (~w.moghrab@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[10:19] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:20] * dosaboy (~user1@host86-164-227-220.range86-164.btcentralplus.com) Quit (Read error: Connection reset by peer)
[10:20] * dosaboy (~user1@host86-164-227-220.range86-164.btcentralplus.com) has joined #ceph
[10:21] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[10:23] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[10:46] * leseb_ (~leseb@78.251.60.12) has joined #ceph
[10:48] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[10:49] * morse (~morse@supercomputing.univpm.it) Quit (Quit: Bye, see you soon)
[10:50] * mynameisbruce_ (~mynameisb@tjure.netzquadrat.de) has joined #ceph
[10:50] * mynameisbruce_ (~mynameisb@tjure.netzquadrat.de) Quit (Remote host closed the connection)
[10:58] <joelio> Anybody benched using the Threaded IO tester? I'm getting some pretty miserable results (1-2MB/s) from it. Wondering if this is normal and if not where to look for the bottleneck. Network and OSD util does't seem maxed out at all
[10:59] <joelio> (on a 1TB ext4 RBD)
[11:02] * mattch (~mattch@pcw3047.see.ed.ac.uk) has joined #ceph
[11:02] * ScOut3R (~ScOut3R@212.96.47.215) Quit (Remote host closed the connection)
[11:07] * Philip_ (~Philip@hnvr-4d07b385.pool.mediaWays.net) has joined #ceph
[11:14] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[11:14] * Philip__ (~Philip@hnvr-4dbb8b76.pool.mediaWays.net) Quit (Ping timeout: 480 seconds)
[11:20] * ScOut3R (~ScOut3R@212.96.47.215) has joined #ceph
[11:20] * dxd828 (~davidmait@dxdservers.com) Quit (Quit: Lost terminal)
[11:20] * leseb (~leseb@83.167.43.235) Quit (Remote host closed the connection)
[11:25] * mcclurmc_laptop (~mcclurmc@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[11:32] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[11:37] * Philip_ (~Philip@hnvr-4d07b385.pool.mediaWays.net) Quit (Ping timeout: 480 seconds)
[11:51] * leseb (~leseb@83.167.43.235) has joined #ceph
[11:55] <ScOut3R> hello everyone
[11:56] <ScOut3R> i've created two very basic munin plugins, one for monitoring storage space and one for monitoring osd states
[11:57] <ScOut3R> if you are interested or just want to criticize it: https://github.com/munin-monitoring/contrib/tree/master/plugins/ceph
[11:58] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[12:03] * leseb (~leseb@83.167.43.235) Quit (Ping timeout: 480 seconds)
[12:03] * mcclurmc_laptop (~mcclurmc@firewall.ctxuk.citrix.com) has joined #ceph
[12:04] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[12:11] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[12:14] <joelio> ScOut3R: Couldn't have come at a better time, I will test today, thanks!
[12:19] <joelio> ScOut3R: Out of interest, would it not be better to pull the stats from JSON? Would imagine that's less susceptible to minor changes in output that could affect the awk'ing?
[12:20] <ScOut3R> joelio: thanks for poiting out the JSON output, i'll look into it
[12:32] * leseb (~leseb@83.167.43.235) has joined #ceph
[12:38] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[12:59] * lightspeed (~lightspee@fw-carp-wan.ext.lspeed.org) Quit (Ping timeout: 480 seconds)
[13:02] * The_Bishop (~bishop@cable-89-16-157-34.cust.telecolumbus.net) Quit (Read error: Connection reset by peer)
[13:14] * The_Bishop (~bishop@2001:470:50b6:0:3174:f827:84cf:bf7) has joined #ceph
[13:16] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[13:37] * leseb (~leseb@83.167.43.235) Quit (Remote host closed the connection)
[13:46] * joelio (~Joel@88.198.107.214) Quit (Quit: Changing server)
[13:46] * The_Bishop_ (~bishop@2001:470:50b6:0:3174:f827:84cf:bf7) has joined #ceph
[13:46] * The_Bishop (~bishop@2001:470:50b6:0:3174:f827:84cf:bf7) Quit (Read error: Connection reset by peer)
[13:51] * The_Bishop_ (~bishop@2001:470:50b6:0:3174:f827:84cf:bf7) Quit (Read error: Connection reset by peer)
[13:51] * The_Bishop_ (~bishop@2001:470:50b6:0:3174:f827:84cf:bf7) has joined #ceph
[13:56] * loicd1 (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[13:56] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Read error: Connection reset by peer)
[14:02] * joelio (~Joel@88.198.107.214) has joined #ceph
[14:06] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Read error: Connection reset by peer)
[14:06] * JohansGlock (~quassel@kantoor.transip.nl) Quit (Remote host closed the connection)
[14:09] * JohansGlock (~quassel@kantoor.transip.nl) has joined #ceph
[14:15] * BillK (~BillK@124.150.58.6) Quit (Quit: Leaving)
[14:16] * BillK (~BillK@124.150.58.6) has joined #ceph
[14:19] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) has joined #ceph
[14:28] * leseb (~leseb@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[14:30] * Philip_ (~Philip@hnvr-4d07b385.pool.mediaWays.net) has joined #ceph
[14:35] <janos> is there going to be any quorum issue if i want to move where a monitor lives? i have 3. i can reduce to 2 then bring up a new 3rd, or bring it up to 4 then take down one
[14:39] * markbby (~Adium@168.94.245.5) has joined #ceph
[14:39] <jluis> janos, this might help: http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address
[14:40] * janos looks
[14:41] <janos> party, that answers that!
[14:41] <janos> thanks, and sorry for the time suck. i should have found this
[14:56] * loicd1 working on http://tracker.ceph.com/issues/4321
[14:57] * loicd1 is now known as loicd
[15:00] * diegows (~diegows@200.68.116.185) has joined #ceph
[15:01] <joelio> Any way to tell if rbd caching is enabled from the ceph command line, example I found is old?
[15:06] * markbby (~Adium@168.94.245.5) Quit (Quit: Leaving.)
[15:07] * markbby (~Adium@168.94.245.5) has joined #ceph
[15:14] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[15:21] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[15:25] * drokita1 (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[15:29] * leseb (~leseb@3.46-14-84.ripe.coltfrance.com) Quit (Remote host closed the connection)
[15:30] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[15:33] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[15:48] * verwilst (~verwilst@46.179.57.17) has joined #ceph
[15:50] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) has joined #ceph
[15:51] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[15:54] * drokita1 (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[15:57] * lx0 (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:59] * leseb (~leseb@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[16:01] * PerlStalker (~PerlStalk@72.166.192.70) has joined #ceph
[16:03] * leseb (~leseb@3.46-14-84.ripe.coltfrance.com) Quit (Remote host closed the connection)
[16:03] * leseb (~leseb@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[16:06] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[16:07] * yehuda_hm (~yehuda@2602:306:330b:1410:d9d2:d915:85bb:6501) has joined #ceph
[16:07] * drokita (~drokita@199.255.228.128) has joined #ceph
[16:07] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) Quit (Remote host closed the connection)
[16:13] * drokita (~drokita@199.255.228.128) Quit (Quit: Leaving.)
[16:13] <sstan> joelio: what's the example you found?
[16:14] <sstan> mapping a RBD of type 2 requires a particular kernel version?
[16:15] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) Quit (Ping timeout: 480 seconds)
[16:15] <jmlowe> sstan: very recent
[16:15] <jmlowe> 3.6 maybe?
[16:15] <sstan> jmlowe: is it possible to install that on some 3.0 kernel?
[16:15] * drokita (~drokita@199.255.228.128) has joined #ceph
[16:16] <Gugge-47527> 3.8.1 cant map format2 kernels
[16:16] <Gugge-47527> i dont know if 3.9-rc can
[16:16] <sstan> i.e. mapping rbd type 2 implies kernel > 3.6 ?
[16:17] <fghaas> um, sstan, so far as I know this isn't upstream yet at all
[16:17] <fghaas> (didn't check for the 3.9 merge window, though)
[16:17] <sstan> aww it's such a cool feature
[16:20] * leseb (~leseb@3.46-14-84.ripe.coltfrance.com) Quit (Remote host closed the connection)
[16:21] <jmlowe> yeah, I thought it was in a recent merge but looks like it is slated for 3.9
[16:23] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) has joined #ceph
[16:26] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) Quit (Quit: Leaving)
[16:26] * eschnou (~eschnou@62-197-93-189.teledisnet.be) has joined #ceph
[16:27] <sstan> I was discussing about rbd cp with someone here. I argued that it would be really fast if it was implemented as a local copy (in the OSD hard drives themselves) ...
[16:28] <sstan> so each object would just copy-paste itself... apparently it isn't how it works right now
[16:28] * senner (~Wildcard@68-113-232-90.dhcp.stpt.wi.charter.com) has joined #ceph
[16:30] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[16:31] <Elbandi_> hi, i have a assert like this: http://www.spinics.net/lists/ceph-devel/msg04336.html
[16:31] <Elbandi_> 0.56.3 ceph
[16:31] <Elbandi_> where do i send the log?
[16:36] * nhm (~nh@184-97-130-55.mpls.qwest.net) has joined #ceph
[16:38] <joelio> sstan: ceph --admin-daemon blah show config | grep rbd_cache
[16:40] <joelio> Intra VM copies are pretty bad too, about 30MB/s
[16:41] <joelio> such a shame, everything else is ticking the boxes
[16:44] * yanzheng (~zhyan@jfdmzpr03-ext.jf.intel.com) Quit (Remote host closed the connection)
[16:46] * leseb (~leseb@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[16:50] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[16:55] * gerard_dethier (~Thunderbi@85.234.217.115.static.edpnet.net) Quit (Quit: gerard_dethier)
[16:55] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[17:02] * eschnou (~eschnou@62-197-93-189.teledisnet.be) Quit (Ping timeout: 480 seconds)
[17:04] * jmlowe (~Adium@149.160.195.28) has joined #ceph
[17:08] * BillK (~BillK@124.150.58.6) Quit (Ping timeout: 480 seconds)
[17:10] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) has joined #ceph
[17:12] <sstan> joelio: map two rbds ... and dd between them
[17:19] * b1tbkt (~Peekaboo@68-184-193-142.dhcp.stls.mo.charter.com) has joined #ceph
[17:19] * b1tbkt (~Peekaboo@68-184-193-142.dhcp.stls.mo.charter.com) Quit ()
[17:20] * b1tbkt (~Peekaboo@68-184-193-142.dhcp.stls.mo.charter.com) has joined #ceph
[17:23] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:25] <joelio> sstan: wouldn't work in production though, once a user has a VM.
[17:26] * rinkusk (~Thunderbi@CPE00259c467789-CM00222d6c26a5.cpe.net.cable.rogers.com) has joined #ceph
[17:26] * vata (~vata@2607:fad8:4:6:b476:1af2:5ca1:9d51) has joined #ceph
[17:28] * eschnou (~eschnou@85.234.217.115.static.edpnet.net) Quit (Remote host closed the connection)
[17:31] <joelio> If I can crack quicker inter VM copies, the we have a bouncing new deploymet waiting to be procured
[17:31] <joelio> they've promised me, like, real money too
[17:36] <joelio> There's no real way to proivde a client blcok side caching implemetation inside of the libvirt handlers is there
[17:36] <joelio> ?
[17:36] <joelio> Appreciate could back the rbd devices and chage setting in libvirt - just seems alittle clunky
[17:36] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[17:37] <joelio> would prefer to see if it could be done via the rbd:/pool/image setup
[17:37] <joelio> rather than having using the raw block rbd cached device
[17:38] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[17:38] <joelio> .. unless I have something fundementally broke in terms of my OSD and PG placements?
[17:38] <joelio> hece why intra speeds are poor
[17:38] * joelio confuzzled
[17:46] * jmlowe (~Adium@149.160.195.28) Quit (Quit: Leaving.)
[17:49] * flakrat (~flakrat@eng-bec264la.eng.uab.edu) has joined #ceph
[17:52] * tryggvil (~tryggvil@Router086.inet1.messe.de) has joined #ceph
[17:52] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[18:00] * jmlowe (~Adium@149.160.195.28) has joined #ceph
[18:00] * tryggvil (~tryggvil@Router086.inet1.messe.de) Quit (Ping timeout: 480 seconds)
[18:02] * jmlowe (~Adium@149.160.195.28) Quit ()
[18:02] * jlogan1 (~Thunderbi@2600:c00:3010:1:217f:2c08:a1d4:e762) has joined #ceph
[18:03] * verwilst (~verwilst@46.179.57.17) Quit (Remote host closed the connection)
[18:06] * Cotolez (~aroldi@81.88.224.110) has joined #ceph
[18:06] * ScOut3R (~ScOut3R@212.96.47.215) Quit (Ping timeout: 480 seconds)
[18:07] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Quit: Leaving...)
[18:07] * Cotolez (~aroldi@81.88.224.110) Quit ()
[18:14] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: Always try to be modest, and be proud about it!)
[18:15] * tryggvil (~tryggvil@Router086.inet1.messe.de) has joined #ceph
[18:17] * loicd (~loic@magenta.dachary.org) has joined #ceph
[18:20] * tryggvil (~tryggvil@Router086.inet1.messe.de) Quit (Quit: tryggvil)
[18:20] <sstan> joelio : I don't know if LVM does that? or some Copy On Write technologies out there.. tell me if you find sometihng
[18:21] <joelio> sstan: bcache and dm-cache support lvm - however that would ot help as the client wouldn't see rbd via lvm
[18:22] <joelio> again, it could be hacked to make work, but that's not the idea
[18:22] <joelio> it needs to happen inside libvirt
[18:23] <janos> when splitting up between public network and cluster network - do all the pieces need access to both?
[18:23] <janos> osd's, mon's, mds, etc
[18:25] * alram (~alram@38.122.20.226) has joined #ceph
[18:26] <janos> it kind of looks like just the osd's would set both
[18:26] <janos> but not sure
[18:27] <joelio> I have mine all on the cluster net and not accessible via any public
[18:27] <gregaf> OSDs get both a public and cluster address, and they communicate with the other OSDs via the cluster address, but with everybody else (MDS, monitor, Ceph clients) via their public address
[18:28] <janos> awesome
[18:28] <gregaf> everybody else only uses the "public" network
[18:28] <janos> sounds wonderful
[18:28] <janos> i know what i'm doing during lunch
[18:28] <janos> ;)
[18:29] <gregaf> but that doesn't mean that those addresses or networks need to actually map onto real networks, as long as each address can route messages to each other address that it needs to talk to ;)
[18:29] <janos> right
[18:38] * alram (~alram@38.122.20.226) Quit (Quit: Lost terminal)
[18:39] * markbby (~Adium@168.94.245.5) Quit (Quit: Leaving.)
[18:40] * markbby (~Adium@168.94.245.5) has joined #ceph
[18:40] * leseb (~leseb@3.46-14-84.ripe.coltfrance.com) Quit (Remote host closed the connection)
[18:44] * jtangwk1 (~Adium@2001:770:10:500:6d37:758b:ea66:61bb) has joined #ceph
[18:44] * jtangwk (~Adium@2001:770:10:500:cc42:58e9:2b8:b39) Quit (Read error: Connection reset by peer)
[18:46] * l0nk (~alex@83.167.43.235) Quit (Quit: Leaving.)
[18:49] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[18:51] * ScOut3R (~scout3r@BC24BBCE.dsl.pool.telekom.hu) has joined #ceph
[18:56] * esammy (~esamuels@host-2-103-100-166.as13285.net) Quit (Ping timeout: 480 seconds)
[19:01] * chutzpah (~chutz@199.21.234.7) has joined #ceph
[19:04] * rturk-away is now known as rturk
[19:04] * portante (~user@66.187.233.206) Quit (Quit: updating)
[19:04] * The_Bishop_ (~bishop@2001:470:50b6:0:3174:f827:84cf:bf7) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[19:10] * sstan (~chatzilla@dmzgw2.cbnco.com) Quit (Remote host closed the connection)
[19:10] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Remote host closed the connection)
[19:11] * Philip_ (~Philip@hnvr-4d07b385.pool.mediaWays.net) Quit (Ping timeout: 480 seconds)
[19:13] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[19:15] <dynaAFK> how should I diagnose slow requests during a recovery? dmesg isn't showing any issues.
[19:15] * dynaAFK is now known as dynamike
[19:15] * sstan (~chatzilla@dmzgw2.cbnco.com) has joined #ceph
[19:16] <sstan> what happens when the journal fails before data gets written to disk ?
[19:16] * lightspeed (~lightspee@fw-carp-wan.ext.lspeed.org) has joined #ceph
[19:25] * alram (~alram@38.122.20.226) has joined #ceph
[19:30] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[19:31] <fghaas> sstan: not sure if there's an earlier part of the discussion that I missed, but if your journal write fails, then that means the OSD write just hasn't completed, and my expectation is ceph-osd would just die (meaning the affected PGs are first degraded, and would then recover once the osd down out interval expires)
[19:32] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[19:33] * cashmont (~cashmont@c-76-18-76-30.hsd1.nm.comcast.net) has joined #ceph
[19:35] * leseb (~leseb@78.250.146.234) has joined #ceph
[19:38] * leseb (~leseb@78.250.146.234) Quit (Remote host closed the connection)
[19:41] * dpippenger (~riven@216.103.134.250) has joined #ceph
[19:46] * Cube (~Cube@12.248.40.138) has joined #ceph
[19:48] * diegows (~diegows@200.68.116.185) Quit (Ping timeout: 480 seconds)
[19:56] * lx0 is now known as lxo
[19:58] * jtang1 (~jtang@79.97.135.214) has joined #ceph
[20:00] * noob21 (~cjh@173.252.71.3) has joined #ceph
[20:06] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[20:08] <noob21> if i'm using a 3.2.x kernel on centos do i need to compile the modules for ceph also? Does it come in the default packages?
[20:09] <gregaf> it's in upstream linux; not sure what package version you're using though
[20:10] <noob21> i was thinking of using botail's stuff
[20:10] <noob21> so i'd have to do some compiling
[20:12] * cashmont (~cashmont@c-76-18-76-30.hsd1.nm.comcast.net) Quit (Ping timeout: 480 seconds)
[20:25] * leseb (~leseb@78.251.35.227) has joined #ceph
[20:26] * cashmont (~cashmont@c-76-18-76-30.hsd1.nm.comcast.net) has joined #ceph
[20:28] * noob22 (~cjh@173.252.71.1) has joined #ceph
[20:28] <sstan> fghaas: Is it possible to have a journal write succeed, but before data is transferred to the hard drive, the journal fails.
[20:29] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[20:29] <sstan> e.g. journal storage fails while data is being transferred
[20:30] * janeUbuntu (~jane@2001:3c8:c103:a001:54ed:2bf5:26ad:2af8) Quit (Remote host closed the connection)
[20:30] <sstan> so writes are acknowledged to the client only when they are in the OSD (past the journal) ?
[20:30] * BManojlovic (~steki@85.222.222.132) has joined #ceph
[20:31] * leseb_ (~leseb@78.251.60.12) Quit (Ping timeout: 480 seconds)
[20:32] * noob21 (~cjh@173.252.71.3) Quit (Ping timeout: 480 seconds)
[20:32] * drokita1 (~drokita@199.255.228.128) has joined #ceph
[20:32] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[20:32] * cashmont (~cashmont@c-76-18-76-30.hsd1.nm.comcast.net) has left #ceph
[20:33] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:33] * cashmont (~cashmont@c-76-18-76-30.hsd1.nm.comcast.net) has joined #ceph
[20:35] * drokita (~drokita@199.255.228.128) Quit (Ping timeout: 480 seconds)
[20:36] <phantomcircuit> sstan, put the journal on the same device as the filestore
[20:36] <phantomcircuit> but that's going to be terrible for performance
[20:36] <sstan> exactly ..
[20:36] <gregaf> writes are acked to a client once they're in the journal sstan — if you somehow lose the node before it goes to the backing store and the journal gets killed, though, you can recover from the replicas
[20:36] <sstan> I'm trying to figure out how data might be lost
[20:36] <gregaf> unless the same thing happened to them all, in which case yes you lost some writes after losing multiple disks
[20:39] <sstan> gregaf : when replica size > 1, writes are acknowledged after being written in ALL osd's journals?
[20:39] <gregaf> yes
[20:40] <fghaas> sstan: gregaf will correct me if I'm wrong, but after that journal write failure the OSD will immediately go down, taking the node out of the list of those a client may read from, for that PG. All subsequent reads and writes will therefore hit a different OSD, with your misbehaving one no longer in the picture. Then after the OSD is marked out, its data is recovered to a different node, and you're fully redundant again
[20:40] <sstan> aaah
[20:40] <gregaf> data must be durable to all acting OSDs before it's acked to the client
[20:42] <sstan> gregaf : data --> osd_1's journal --> osd_2's journal --> ack ... That's cool!
[20:42] * drokita1 (~drokita@199.255.228.128) Quit (Ping timeout: 480 seconds)
[20:43] * mcclurmc_laptop (~mcclurmc@firewall.ctxuk.citrix.com) Quit (Ping timeout: 480 seconds)
[20:44] <sstan> so if I'm recording something live, I wont lose a single frame...
[20:44] <fghaas> gregaf, for the sake of completeness, if you have replicas > 2, it's actually client->primary osd and from there to the replica osds in parallel as opposed to in sequence, correct?
[20:45] * noob22 (~cjh@173.252.71.1) Quit (Quit: Leaving.)
[20:46] <sstan> fghaas : yeah I'm looking at figure2 in "Rados: A Scalable [...]"
[20:47] <sstan> acks are sent after writes occur on at least 2 OSDs
[20:47] <fghaas> sstan, right, but not all of that paper is still current and/or still implemented, for example primary-copy is the only still-current replication mode
[20:47] <fghaas> splay replication is gone
[20:47] <fghaas> (or whatever it was called)
[20:50] <sstan> thanks for the info guys : ) Now would you agree that keeping journals on RAM is OK provided Btrfs & no power failure(on all cluster members at the same time) will occur
[20:52] <lurbs> No power outages, guaranteed? That's deep magic.
[20:52] <fghaas> sstan: nope.
[20:53] * cashmont (~cashmont@c-76-18-76-30.hsd1.nm.comcast.net) has left #ceph
[20:53] <fghaas> your journal has to be persistent, if you lose a journal you lose the OSD and need to reinitialize it from scratch
[20:53] <fghaas> there's currently no way to recover an osd from the filestore alone
[20:55] <lurbs> I was under the impression that a btrfs backed OSD would be in a known and coherent state, such that re-creating the journal and waiting for back end replication could fix it.
[20:55] <lurbs> That's based on half reading conversations in here, though.
[20:56] <lurbs> And I'm in no way advocating either btrfs OSDs or RAM-backed journals.
[21:00] <fghaas> lurbs, as far as I know gregaf, sjust and other discussed the possibility of implementing this, but not that it was already available
[21:00] <fghaas> "others"
[21:04] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[21:09] * leseb_ (~leseb@78.250.157.233) has joined #ceph
[21:11] * flakrat (~flakrat@eng-bec264la.eng.uab.edu) Quit (Quit: Leaving)
[21:19] <junglebells> I was using RAM-backed journals for the increased performance as I don't have SSD's for my journaling but it was also my intention to move the journal back to a persistent disk once I'm prepared for production.
[21:20] <junglebells> It's gone fine short of have 1 of 3 nodes regularly dying (hard kernel panic). We're in the process of narrowing it down to find the root cause. Once we finish updating all the firmwares, if it's still happening we're going to reach out to our reseller as it's likely a hardware fault.
[21:20] * jjgalvez (~jjgalvez@12.248.40.138) has joined #ceph
[21:22] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[21:24] * ScOut3R (~scout3r@BC24BBCE.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[21:27] * esammy (~esamuels@host-2-99-4-178.as13285.net) has joined #ceph
[21:27] * esammy (~esamuels@host-2-99-4-178.as13285.net) has left #ceph
[21:27] * esammy (~esamuels@host-2-99-4-178.as13285.net) has joined #ceph
[21:28] * esammy (~esamuels@host-2-99-4-178.as13285.net) has left #ceph
[21:31] * eschnou (~eschnou@29.89-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:31] * leseb (~leseb@78.251.35.227) Quit (Ping timeout: 480 seconds)
[21:36] * Cotolez (~aroldi@81.88.224.110) has joined #ceph
[21:38] <dmick> junglebells: hardware faults are so much fun
[21:39] <junglebells> oh yea.... too much fun
[21:39] <Cotolez> Hi all
[21:39] <dmick> hello Cotolez
[21:39] <jmlowe> lurbs: We have 1/2 ton flywheels magnetically suspended in a vacuum that make sure our ups has a chance to kick in and work long enough for our generators to kick on
[21:40] <Cotolez> I have a ceph test cluster and from 2 days I have some big problem:
[21:40] <Cotolez> the cluster has only 1 storage node
[21:40] <Cotolez> with 11 osd
[21:41] <jmlowe> Cotolez: go on
[21:41] * leseb (~leseb@78.251.62.129) has joined #ceph
[21:41] <Cotolez> and from 2 day, suddenly, i have 6 osd that goes down and after a while up again
[21:42] <jmlowe> what filesystem are you using to back those osd's?
[21:42] <Cotolez> one of the osd log: http://pastebin.com/DghiYruA
[21:42] <Cotolez> jmlowe: XFS
[21:42] <Cotolez> the mon log: http://pastebin.com/yjLnAEyJ
[21:42] <Cotolez> ceph osd tree: http://pastebin.com/TUCXR7jf
[21:43] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[21:44] <jmlowe> Cotolez: not something I've run across myself, looks like the osd is getting hung up and not replying to heartbeats in a timely fashion
[21:44] <jmlowe> Cotolez: somebody else here probably has more insight
[21:45] <Cotolez> It seems that problems begins when ceph start scrub
[21:45] <Cotolez> deep scrub
[21:46] <Cotolez> Thanks jmlowe, I'll wait for some hints.
[21:47] <gregaf> fghaas, lurbs, sstan: sorry, lunch time! :) yes, replication happens in parallel to the non-primary OSD, and we only do primary-copy replication
[21:47] * yehudasa (~yehudasa@2607:f298:a:607:5890:b330:fbe6:fad2) Quit (Ping timeout: 480 seconds)
[21:47] <jmlowe> I've had some similar problems but that was all due to lost packets, everything on one host rules that out
[21:47] <gregaf> if you use a btrfs OSD then you can lose and replace the journal and it will revert to an older state; that is implemented via the standard cases (though you would need to run the commands to recreate the journal)
[21:48] * noob21 (~cjh@173.252.71.1) has joined #ceph
[21:48] <dmick> Cotolez: what version are you running?
[21:48] <Cotolez> dmick: 0.56.3 on Ubuntu 12.04
[21:49] <Cotolez> gregaf: I'm using XFS
[21:49] <dmick> 11 on the same host all scrubbing will use up some machine resources; I know there have been some very recent fixes to change relative priorities of traffic, but I'm not sure if they would affect "heartbeat vs. scrub traffic"
[21:49] <gregaf> sorry, those comments weren't for you Cotolez :)
[21:49] <Cotolez> gregaf: :)
[21:49] <gregaf> was that your email on ceph-users this morning?
[21:50] <Cotolez> yes
[21:51] <jmlowe> dmick: you thinking the osd's memory got swapped out and missed some heartbeats while being swapped back in?
[21:51] <gregaf> I didn't go through it all, but it looks like your disk accesses are taking too long, Cotolez
[21:52] <dmick> jmlowe: heartbeat response should be pretty low-resource, it's true
[21:54] * leseb__ (~leseb@78.251.55.34) has joined #ceph
[21:54] * rturk is now known as rturk-away
[21:55] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[21:55] * loicd (~loic@magenta.dachary.org) has joined #ceph
[21:55] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[21:55] * yehudasa (~yehudasa@2607:f298:a:607:8c77:4d39:f04c:96ab) has joined #ceph
[21:56] <sstan> fghaas : actually, I rebooted my machines (journal on RAM) several times. OSDs were recoverable (I use btrfs)
[21:57] * leseb (~leseb@78.251.62.129) Quit (Ping timeout: 480 seconds)
[21:57] <sstan> lurbs : all machines failing at the same time is unlikely provided no human error
[21:57] <fghaas> sstan: fair enough; commendable bravery :) (if that's a prod system, that is)
[21:58] <fghaas> that comment of mine was for the use of btrfs, relying on the absence of human error is not brave, that's downright insane :)
[21:58] * noob21 (~cjh@173.252.71.1) Quit (Quit: Leaving.)
[21:58] <sstan> haha I see
[21:59] <jmlowe> sstan: I couldn't take the stress anymore, I'm in the process of redoing all of my osd's with xfs
[21:59] <sstan> 1/2 ton flywheels ... one must make some big blunder to power-outage your datacenter
[21:59] <sstan> jmlowe : btrfs is stressing you ?
[22:00] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Read error: Connection reset by peer)
[22:00] <jmlowe> I've had two btrfs filesystems eat themselves in the past week
[22:01] <lurbs> And a fsck di... Oh.
[22:01] <jmlowe> http://pastebin.com/SC7dHmX9
[22:02] <jmlowe> lurbs: I have cursed Chris Mason many a time
[22:03] <jmlowe> lurbs: at least he stopped teasing us with "it just needs a few more weeks of work"
[22:03] <janos> did he say how many people working a few more weeks? ;)
[22:03] <jmlowe> touche
[22:04] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has left #ceph
[22:05] <lurbs> Although the last XFS filesystem I had die caused xfs_repair to segfault, so that wasn't helpful either.
[22:05] * leseb__ (~leseb@78.251.55.34) Quit (Read error: Connection reset by peer)
[22:06] * leseb (~leseb@78.251.38.57) has joined #ceph
[22:06] <jmlowe> LA LA LA, I don't want to hear about xfs dying
[22:07] <janos> hahaha
[22:07] <loicd> :-D
[22:07] <janos> so what i'm hearing is - replication size 3, 1/3 osd's are ext4, 1/3 xfs, 1/3 btrfs
[22:07] <lurbs> Someone put a previously failed disk back into an array, and it decided to trust the data without resyncing. Not really XFS' fault.
[22:08] <janos> and see how it goes
[22:09] <jmlowe> seriously, this is my plan to avoid my 02:00, 04:00, and 06:00 ceph -s in a cold sweat with trembling fingers almost making it impossible to type
[22:11] * leseb__ (~leseb@78.250.157.233) has joined #ceph
[22:13] * The_Bishop (~bishop@e177089229.adsl.alicedsl.de) has joined #ceph
[22:13] <loicd> leseb__: good evening sir :-)
[22:15] * leseb_ (~leseb@78.250.157.233) Quit (Ping timeout: 480 seconds)
[22:19] * mcclurmc_laptop (~mcclurmc@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[22:21] <wer> Hey guys, I had some osd's block, so the kernel seemingly killed them. It brought down my node when load increased I think all the others hit the suicide timeout. I have logs but here is dmesg. http://pastebin.com/MzRJkagw
[22:23] <wer> This only happened on once node out of 7, and we were doing some load testing at the time. I don't know why this has to happens?
[22:23] <wer> *one node
[22:35] <nz_monkey_> Hey does anyone have any benchmarks from a cluster running 10Gbit ethernet with say 20 OSD's on 4 nodes ?
[22:35] <nz_monkey_> we have a POC running but cannot get intel x520 NIC's to run in our boards
[22:35] <nhm> nz_monkey_: those numbers could be highly variable
[22:35] <nz_monkey_> so are hitting 200MB/s during rados bench on 3 nodes, which decreases as we add more nodes
[22:36] <todin> nz_monkey_: I have 4 nodes with each 4 osds
[22:36] <nz_monkey_> pointing to network bottleneck
[22:36] <nz_monkey_> nhm: Understood. We are just trying to locate our bottleneck
[22:36] <nhm> nz_monkey_: do you know how to check the admin socket?
[22:36] <nz_monkey_> todin: What sort of numbers are you seeing from rados bench ?
[22:37] <nz_monkey_> nhm: No, how do we do that ?
[22:37] <todin> nz_monkey_: how is your rados bench cli? I will let it run
[22:37] <nz_monkey_> rtfm I am guessing ;)
[22:38] <nhm> nz_monkey_: one sec
[22:38] <nz_monkey_> todin: rados bench -p tier3 300 write
[22:38] <nz_monkey_> todin: where tier3 is your pool name
[22:39] <nz_monkey_> todin: we see around 200MB/s with 3 nodes using bonded gigabit ethernet, as we add a 4th node it drops to around 180MB/s then with 5 nodes around 160MB/s
[22:41] <todin> I get around 700MB/s.
[22:41] <nz_monkey_> todin: Ok, so you are running 10GE ?
[22:41] <todin> nz_monkey_: yep
[22:42] <nz_monkey_> todin: Excellent, thanks, this is really helpful for us
[22:42] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:45] <nhm> nz_monkey_: try something like this in a loop through your OSDs on each node: ceph --admin-daemon /var/run/ceph/ceph-osd.$i.asok dump_ops_in_flight | grep num_ops
[22:46] <Cotolez> excuse me, anybody has some other hint to help me get out from my bad situation? I see that when deep scrub starts, the cpu is at 100%, Is there a way to lower the scrub priority?
[22:46] <nz_monkey_> nhm: Thanks, i'll write a script to do this.
[22:47] <nhm> nz_monkey_: what you may notice is that operations back up on specfic OSDs.
[22:47] <nz_monkey_> nhm: Which will most likely indicate network performance issues I am guessing
[22:47] * eschnou (~eschnou@29.89-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[22:48] <nhm> nz_monkey_: maybe, have you done any tests with iperf?
[22:48] * leseb__ (~leseb@78.250.157.233) Quit (Remote host closed the connection)
[22:49] <nz_monkey_> nhm: Yes, we have 4x Gigabit ethernet on Intel i350 per node. With single iperf stream we get ~970mbit, as we increase threads this spreads over the other nic's due to our use of L3/L4 hashing algorithim and increases up to around 3.5gbit
[22:51] <nz_monkey_> nhm: we suspect the issue is that the hashing is not particularly effective in conjunction with ceph, as when we run bwm-ng during a rados bench we can see partiular nic's maxing out, yet others have very little traffic on them
[22:51] * portante (~user@66.187.233.206) has joined #ceph
[22:51] * leseb_ (~leseb@78.250.157.233) has joined #ceph
[22:52] <nz_monkey_> nhm: we have a pile of Intel x520 10gbit nic's but the gigabyte boards we are using for our POC nodes wont post with them
[22:52] <nhm> nz_monkey_: strange
[22:52] <nhm> nz_monkey_: can't even get into bios?
[22:52] <nz_monkey_> nhm: No, not at all
[22:52] <nz_monkey_> nhm: they are Q77 chipset boards with ivybridge xeon's
[22:53] <nhm> nz_monkey_: huh. Tried all of the various PCIE slots?
[22:53] <nz_monkey_> nhm: we are just using them to get an idea of performance factors, so we can work out what hardware we purchase for production
[22:54] <nz_monkey_> nhm: Yes, there are only two PCI-e slots. We can put the Intel i350 NIC's in either the x16 or x8 slot and it works perfectly, as soon as we put an x520 in either slot, no post
[22:55] <nz_monkey_> nhm: we have a ticket open with Gigabyte, no idea if it will progress I cant imagine too many people are running 10gigabit NIC's on their enterprise desktop boards....
[22:56] <janos> if you send me one, i promise i'll help break the mold and play video games with it
[22:57] <nz_monkey_> HAHAHA
[22:57] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[22:59] <nz_monkey_> they are these ones http://www.gigabyte.co.nz/products/product-page.aspx?pid=4237#ov we got them due to being Q77 so they have IP KVM and also had 6 internal SATA connectors so we didnt need to buy any LSI SAS cards
[22:59] <nz_monkey_> everything has been going great until we tried 10GE cards in them
[22:59] * leseb (~leseb@78.251.38.57) Quit (Remote host closed the connection)
[23:00] <janos> looks like the one presentlty on my deck that i stabbed to death and then stomped on after getting completely fed up last week!
[23:00] <janos> though that was an amd one
[23:00] * noob21 (~cjh@173.252.71.3) has joined #ceph
[23:01] <nz_monkey_> I always try to get boards that have intel chipset, intel nic, usually have less problems than with dodgy reltek or broadcom stuff
[23:01] <nz_monkey_> or we just buy Dell boxes, but for POC they were a little out of budget
[23:01] <janos> bummer, i see the intel boards are still using those holes to mount. last one i got the stock heatsink was REALLY flakey about holding on
[23:06] * leseb_ (~leseb@78.250.157.233) Quit (Ping timeout: 480 seconds)
[23:07] * jtang1 (~jtang@79.97.135.214) Quit (Quit: Leaving.)
[23:08] <gregaf> Cotolez: ah, your CPU can't drive the number of OSDs and their filesystems you're using, then
[23:08] <gregaf> if you're on btrfs you could switch to xfs (lower CPU utilization), but otherwise you'll just have to reduce the number of daemons you've got
[23:08] <sstan> one daemon per core is what's recommended I think
[23:10] * jtang1 (~jtang@79.97.135.214) has joined #ceph
[23:10] <janos> 1ghz of cpu per daemon i thought
[23:10] <janos> as a rough estimate
[23:11] * jtang1 (~jtang@79.97.135.214) Quit ()
[23:11] * loicd trying to remember how to keep /tmp/cephtest when running teuthology
[23:12] <MrNPP> waht would cause an osd to reject my auth?
[23:13] <MrNPP> but i can create images just fine, just can't mount
[23:13] * jtang1 (~jtang@79.97.135.214) has joined #ceph
[23:13] * jtang1 (~jtang@79.97.135.214) Quit ()
[23:14] <Cotolez> gregaf: maybe, but it ran flawlessy for 35 days since 2 days ago. It sound anyway strange, isn't it?
[23:14] <Cotolez> and it's a dual xeon 4-core
[23:14] <gregaf> well, you've got logs full of complaints about IO going into a black hole and not coming back, and you're saying that it's running at 100% CPU
[23:15] <Cotolez> now i'm trying to increase osd scrub thread timeout. It may help?
[23:16] <gregaf> loicd: just forwarded a message on teuthology to ceph-devel that should help you :)
[23:16] <dmick> Cotolez: has the usage been the same for 35 days?
[23:16] <Cotolez> yes
[23:16] <loicd> gregaf: oh thanks. Checking
[23:16] <loicd> I figure my problem comes from https://github.com/ceph/teuthology/blame/master/teuthology/task/ceph.py#L359
[23:16] <MrNPP> http://paste.scurvynet.com/?fd480626aed1cc80#ZKORLy+rcfPpHPdHvfi0gehuR4oKdjK7dJhvoJFJLgM=
[23:16] <dmick> then yes, that is strange; static situations stay static
[23:16] <loicd> because it complains that ceph-coverage is nowhere
[23:17] <dmick> something must have changed somewhere.
[23:17] <gregaf> filesystem aging?
[23:17] <Cotolez> gregaf: sorry?
[23:17] <dmick> filesystems get slower as they age, from fragmentation
[23:18] <dmick> if you're near a performance limit that could be enough to push you over the edge
[23:18] <loicd> gregaf: http://marc.info/?l=ceph-devel&m=136243535715479&w=4 is very relevant indeed :-) Thanks !
[23:19] <jmlowe> there is also the free block hunting as it fills up
[23:24] <loicd> I'm still unsure why ceph-coverage is not found. I now better understand the change in the directory layout but I don't think it's related to the absence of ceph-coverage.
[23:24] <Cotolez> gregaf: the filesystem has 40 days
[23:24] <loicd> It would be most helpful to see the output of which ceph-coverage on a machine that successfully runs teuthology.
[23:25] <gregaf> I really don't know anything about it loicd, you'll have to ask joshd for help on that one ;)
[23:25] <loicd> I mean $(which ceph-coverage) ( confusing sentence...)
[23:25] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[23:25] <loicd> gregaf: ok :-) I'll send a mail and keep looking.
[23:26] <joshd> loicd: /usr/bin/ceph-coverage, it's in the ceph-test package now
[23:26] <loicd> oh!
[23:26] <loicd> joshd thanks for saving me the trouble of a mail
[23:27] <joshd> loicd: the larger change is to use package installs instead of tarballs, so if you put the install task before the ceph task, you should have everything you need
[23:28] <loicd> I did not realize it happened already. Great news :-) Congratulations !
[23:29] * rturk-away is now known as rturk
[23:30] <loicd> ceph-coverage is not in ceph-test 0.56.3-1quantal ... maybe a more recent one
[23:30] <joshd> thanks to slang and sage :)
[23:30] <dmick> 376cca2d4d4f548ce6b00b4fc2928d2e6d41038f
[23:31] * loicd reading https://github.com/ceph/teuthology/blob/master/teuthology/task/install.py
[23:34] * vata (~vata@2607:fad8:4:6:b476:1af2:5ca1:9d51) Quit (Quit: Leaving.)
[23:41] <loicd> joshd did things change significantly in the way code coverage is collected ? Previously it required getting a tarbal on the host running teuthology ( wget -O /tmp/build/tmp.tgz http://gitbuilder.ceph.com/ceph-tarball-precise-x86_64-gcov/sha1/$(cat /tmp/a1/ceph-sha1)/ceph.x86_64.tgz ) to be able to run ./virtualenv/bin/teuthology-coverage -v --html-output /tmp/html \ --lcov-output /tmp/lcov \ --cov-tools-dir $(pwd)/coverage \ /tmp
[23:42] * markl (~mark@tpsit.com) Quit (Quit: leaving)
[23:42] * markl (~mark@tpsit.com) has joined #ceph
[23:42] <joshd> loicd: no, I think that part is the same. it may need to be adjusted as well
[23:44] <loicd> joshd thanks
[23:45] <wer> claims to be 10.9.2.132:6813/4704 not 10.9.2.132:6813/4014 - wrong node is flying through one of my osd logs... and throughput of the cluster is not great. does this need to be reopened http://tracker.ceph.com/issues/4006 I am running 0.57-1~bpo70+1
[23:47] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:50] <loicd> joshd gregaf : INFO:teuthology.run:pass \o/ . Nice conclusion for a great sunny day in Paris :-)
[23:51] <gregaf> yay :)
[23:51] <joshd> hooray!
[23:53] * rinkusk (~Thunderbi@CPE00259c467789-CM00222d6c26a5.cpe.net.cable.rogers.com) Quit (Ping timeout: 480 seconds)
[23:56] <noob21> are there any ceph builds for centos 5?
[23:56] <noob21> this is prob a dumb question haha
[23:56] <darkfader> like, client? or everything?
[23:56] <darkfader> i think there's none of either
[23:56] <noob21> everything
[23:57] <noob21> yeah i thought so
[23:57] <darkfader> at least if i you consider recent verisons
[23:57] <darkfader> you might find something for 0.48 or older
[23:57] <noob21> hmm ok
[23:57] <dmick> there has been work done on it
[23:57] <dmick> but no official builds, no
[23:57] <noob21> centos 5 is pretty crusty i'll admit
[23:57] <dmick> nhm might know something about it
[23:58] <darkfader> noob21: is there a really huge benefit from trying it on centos5 for you?
[23:58] <nhm> uh oh
[23:58] <darkfader> ...benefit for you... anyway
[23:58] * tryggvil (~tryggvil@2a02:8108:80c0:1d5:a195:a4a5:4c86:e4e7) has joined #ceph
[23:58] <nhm> noob21: I got ceph to compile on RHEL5. I built a new GCC and entire tool chain / library base to make it work. :)
[23:59] <noob21> wow
[23:59] <noob21> well i'm running a recent kernel at least
[23:59] <noob21> 3.2.x
[23:59] <nhm> noob21: yeah, 3.5 on these systems.
[23:59] <noob21> gotcha
[23:59] <nhm> noob21: It can work, but it's not pleasant.
[23:59] <darkfader> nhm: do you use pkgsrc to bring up the fresher toolchain or do you do that using rpms?
[23:59] <noob21> yeah how painful is it

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.