#ceph IRC Log


IRC Log for 2013-07-16

Timestamps are in GMT/BST.

[0:00] <dmick> johnu: do you expect your cluster status not to be OK?
[0:03] <johnu> dmick: :). I was asking whether we can verify the settings which we have used during installation. So, I tried to see ceph.conf but, it had only initial mon.
[0:07] <dmick> you mean you added more than one mon? (first I heard of that)
[0:07] * markbby (~Adium@ Quit (Quit: Leaving.)
[0:08] <dmick> ceph.conf has whatever you put in it, with ceph-deploy or otherwise. Mons will join the cluster as they come alive. If you're trying to connect to a cluster with some mons down, and the mon in the ceph.conf is down, and you don't use -m to select a monhost, you might not connect, so you can avoid that by mentioning more mons in ceph.conf
[0:08] <dmick> maybe that is your question; I'm not sure what settings you want to verify, but the ceph command queries a lot of cluster information
[0:08] * johnu (~oftc-webi@dhcp-171-71-119-30.cisco.com) Quit (Remote host closed the connection)
[0:08] * johnu (~oftc-webi@dhcp-171-71-119-30.cisco.com) has joined #ceph
[0:09] <johnu> that was my question. Do we need to do that manually?
[0:09] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[0:13] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Remote host closed the connection)
[0:14] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[0:23] <johnu> when I am trying mounting cephfs, it fails with error " bad option at 'secretfile=admin.secret' " (shown in dmesg)
[0:24] <johnu> has anybody come across this error?
[0:24] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:25] * mtanski (~mtanski@ Quit (Ping timeout: 480 seconds)
[0:25] * sunday (~sunday@ has joined #ceph
[0:27] <gregaf> johnu: is "admin.secret" a file which contains the key?
[0:29] <johnu> yes. i created a file with the key listed in ceph.client.admin.keyring
[0:29] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[0:30] <sunday> Please to get clarification on this ceph-deploy command "ceph-deploy install {hostname [hostname] ...} "
[0:31] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[0:32] <sunday> Is [hostname] a list or arrays of host in the to be install cluster?
[0:32] <gregaf> johnu: what's the full command you're trying?
[0:33] <gregaf> I don't see "bad option" anywhere in our repository, which makes me think it's coming out of the standard mount tool and never reaching ours
[0:33] <dmick> sunday: the syntax "{word [word..]}" means "at least one of word, but maybe more than one"
[0:33] <gregaf> which would happen if you don't actually have mount.ceph available for some reason
[0:33] <dmick> so, yes, a list of hosts
[0:34] <johnu> sudo mount -t ceph /mnt/mycephfs -o name=admin,secretfile=admin.secret gives error mount: wrong fs type, bad option, bad superblock on, and in dmseg, [ 2549.447826] libceph: bad option at 'secretfile=admin.secret'
[0:34] <sunday> thanks, what { hostname [hostname]...}
[0:35] <sunday> what of { hostname [hostname]...}
[0:35] <dmick> sunday, is that a question? and if so, what is the question?
[0:36] <gregaf> johnu: ah, kernel output, right
[0:36] <sunday> yes, this is parameter pass to ceph-deploy
[0:36] <sunday> { hostname [hostname]...}
[0:36] <johnu> yes.
[0:37] <gregaf> johnu: I bet you don't actually have mount.ceph available on the system, so it's trying to use a generic path
[0:37] <gregaf> try checking for that
[0:37] <grepory> will ceph-deploy even support separate cluster and public networks?
[0:38] <grepory> or should i just deploy manually?
[0:38] <johnu> oh.. What should I do ?
[0:38] <gregaf> "whereis mount.ceph"?
[0:38] <sunday> what is expected to be substituted for these hostnames?
[0:38] <gregaf> if it turns up blank, figure out how to install it
[0:39] <johnu> yup. It is absent. i need to see package for it in ubuntu
[0:39] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[0:40] <johnu> grgaf: do you know the package name?
[0:40] <gregaf> on any vaguely newish ubuntu it's packaged, but I confess I don't remember where :)
[0:41] <gregaf> looks like probably ceph-fs-common, though
[0:41] <johnu> ok.. let me try out
[0:42] <dmick> sunday: I'm sorry, I really don't understand what you're asking
[0:45] <johnu> gregaf :)..Thanks a lot.. it worked
[0:46] <gregaf> yay
[0:46] * janos (~janos@static-71-176-211-4.rcmdva.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[0:48] * oddomatik (~Adium@ Quit (Read error: Connection reset by peer)
[0:48] * oddomatik (~Adium@ has joined #ceph
[0:48] <johnu> :)
[0:48] * oddomatik (~Adium@ Quit ()
[0:48] <grepory> ceph-deploy probably just shouldn't be used for setting up production environments huh?
[0:49] * oddomatik (~Adium@ has joined #ceph
[0:49] * BManojlovic (~steki@237-231.197-178.cust.bluewin.ch) Quit (Remote host closed the connection)
[0:50] * BillK (~BillK-OFT@124-148-212-240.dyn.iinet.net.au) has joined #ceph
[0:55] <guppy> oh?
[0:55] <dmick> sunday: do you mean "what hostnames should you actually supply to the ceph-deploy install command"? Surely it must be obvious that the answer is "the hostnames on which you want to install the ceph packages"?
[0:57] <sunday> I got confuse because of the list inside dictionary
[0:57] <dmick> sunday: this is not Python code; that's command-line syntax,and it means what I said it means
[0:58] <dmick> http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=%2Fcom.ibm.swg.im.iis.common.doc%2Fcommon%2Fcommand_conventions.html
[1:03] * tnt_ (~tnt@92.203-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[1:03] <sunday> thanks dmick, the link really help
[1:05] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[1:07] * oddomatik (~Adium@ Quit (Quit: Leaving.)
[1:15] * mschiff (~mschiff@port-49445.pppoe.wtnet.de) Quit (Remote host closed the connection)
[1:17] * mschiff (~mschiff@port-49445.pppoe.wtnet.de) has joined #ceph
[1:18] * janos (~janos@static-71-176-211-4.rcmdva.fios.verizon.net) has joined #ceph
[1:21] * janos (~janos@static-71-176-211-4.rcmdva.fios.verizon.net) has left #ceph
[1:22] * sagelap (~sage@2600:1012:b024:9857:61dd:2b6f:b08f:b063) has joined #ceph
[1:22] * janos (~janos@static-71-176-211-4.rcmdva.fios.verizon.net) has joined #ceph
[1:23] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[1:28] * johnu (~oftc-webi@dhcp-171-71-119-30.cisco.com) Quit (Remote host closed the connection)
[1:30] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[1:31] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[1:32] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[1:34] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[1:40] * dosaboy_ (~dosaboy@host86-163-12-187.range86-163.btcentralplus.com) has joined #ceph
[1:42] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[1:44] * jakes (~oftc-webi@128-107-239-233.cisco.com) has joined #ceph
[1:45] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[1:47] * dosaboy (~dosaboy@host86-150-246-156.range86-150.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[1:48] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:48] * dosaboy (~dosaboy@host86-163-11-195.range86-163.btcentralplus.com) has joined #ceph
[1:50] * DarkAceZ (~BillyMays@ Quit (Ping timeout: 480 seconds)
[1:52] * DarkAceZ (~BillyMays@ has joined #ceph
[1:53] * jakes (~oftc-webi@128-107-239-233.cisco.com) Quit (Remote host closed the connection)
[1:54] * dosaboy_ (~dosaboy@host86-163-12-187.range86-163.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[1:59] <sage> dmick: fixed 4779, bu tneed to squash taht one patch before it is merged.
[2:01] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[2:05] <dmick> k
[2:06] <dmick> looking
[2:09] * dosaboy_ (~dosaboy@host86-145-216-186.range86-145.btcentralplus.com) has joined #ceph
[2:09] * mschiff (~mschiff@port-49445.pppoe.wtnet.de) Quit (Remote host closed the connection)
[2:15] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[2:16] * dosaboy (~dosaboy@host86-163-11-195.range86-163.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[2:24] * nwat (~oftc-webi@eduroam-251-132.ucsc.edu) Quit (Quit: Page closed)
[2:25] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[2:30] * dosaboy_ (~dosaboy@host86-145-216-186.range86-145.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[2:31] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Ping timeout: 480 seconds)
[2:37] * dosaboy (~dosaboy@host86-161-201-230.range86-161.btcentralplus.com) has joined #ceph
[2:40] * Henson_D (~kvirc@ has joined #ceph
[2:50] * Tamil (~tamil@ Quit (Quit: Leaving.)
[2:50] <Henson_D> hello everyone. I'm using ceph 0.61. Is it possible to remove an MDS server once it has been created with ceph-deploy? I've tried "ceph mds rm 0" as per Sebastian Han's blog, but that didn't work. Ceph documentation says "coming soon", and I haven't been able to find instructions anywhere on how to do it. Is there an undocumented command that can do it, or is it not possible?
[2:51] * dosaboy_ (~dosaboy@host86-161-163-52.range86-161.btcentralplus.com) has joined #ceph
[2:53] * dosaboy_ (~dosaboy@host86-161-163-52.range86-161.btcentralplus.com) Quit (Read error: Connection reset by peer)
[2:56] * sunday (~sunday@ Quit (Quit: Ex-Chat)
[2:56] * dosaboy_ (~dosaboy@host86-161-162-175.range86-161.btcentralplus.com) has joined #ceph
[2:58] * dosaboy (~dosaboy@host86-161-201-230.range86-161.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[3:00] * xmltok_ (~xmltok@pool101.bizrate.com) Quit (Quit: Bye!)
[3:09] * dosaboy_ (~dosaboy@host86-161-162-175.range86-161.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[3:10] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[3:12] * yy (~michealyx@ has joined #ceph
[3:12] * dosaboy (~dosaboy@host86-145-220-48.range86-145.btcentralplus.com) has joined #ceph
[3:17] * dosaboy_ (~dosaboy@host86-163-14-113.range86-163.btcentralplus.com) has joined #ceph
[3:18] * john_barbee (~jbarbee@c-50-165-106-164.hsd1.in.comcast.net) has joined #ceph
[3:22] * dosaboy (~dosaboy@host86-145-220-48.range86-145.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[3:26] * dosaboy_ (~dosaboy@host86-163-14-113.range86-163.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[3:28] * dosaboy (~dosaboy@host86-150-243-74.range86-150.btcentralplus.com) has joined #ceph
[3:35] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[3:43] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[3:49] * john_barbee_ (~jbarbee@c-50-165-106-164.hsd1.in.comcast.net) has joined #ceph
[3:49] * john_barbee (~jbarbee@c-50-165-106-164.hsd1.in.comcast.net) Quit (Read error: Connection reset by peer)
[3:49] * john_barbee_ is now known as john_barbee
[3:50] * zhangbo (~zhangbo@ has joined #ceph
[3:55] * xmltok (~xmltok@relay.els4.ticketmaster.com) has joined #ceph
[3:57] * dosaboy_ (~dosaboy@host86-150-247-233.range86-150.btcentralplus.com) has joined #ceph
[4:00] * dosaboy (~dosaboy@host86-150-243-74.range86-150.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[4:07] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) has joined #ceph
[4:10] * dosaboy (~dosaboy@host86-150-247-137.range86-150.btcentralplus.com) has joined #ceph
[4:13] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[4:13] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[4:15] * dosaboy_ (~dosaboy@host86-150-247-233.range86-150.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[4:17] * joshd1 (~joshd@2602:306:c5db:310:b0be:ab79:188d:4fc9) Quit (Ping timeout: 480 seconds)
[4:18] * mnash_ (~chatzilla@66-194-114-178.static.twtelecom.net) has joined #ceph
[4:19] * Gugge-47527 (gugge@kriminel.dk) Quit (Read error: Connection reset by peer)
[4:19] * mnash (~chatzilla@vpn.expressionanalysis.com) Quit (Read error: Connection reset by peer)
[4:19] * ofu (ofu@dedi3.fuckner.net) Quit (Read error: Connection reset by peer)
[4:19] * ofu (ofu@dedi3.fuckner.net) has joined #ceph
[4:19] * Gugge-47527 (gugge@kriminel.dk) has joined #ceph
[4:19] * mnash_ is now known as mnash
[4:43] * julian (~julianwa@ has joined #ceph
[4:43] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[4:45] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 22.0/20130618035212])
[4:56] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[4:58] * dosaboy_ (~dosaboy@host86-161-201-156.range86-161.btcentralplus.com) has joined #ceph
[4:58] * rkeene (1011@oc9.org) Quit (Quit: -.0)
[5:00] * fireD (~fireD@93-139-145-242.adsl.net.t-com.hr) has joined #ceph
[5:02] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Remote host closed the connection)
[5:03] * Henson_D (~kvirc@ Quit (Quit: KVIrc 4.1.3 Equilibrium http://www.kvirc.net/)
[5:05] * dosaboy (~dosaboy@host86-150-247-137.range86-150.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[5:07] * fireD1 (~fireD@93-142-200-96.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:14] * Guest160 (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[5:26] * dosaboy (~dosaboy@host86-150-245-16.range86-150.btcentralplus.com) has joined #ceph
[5:33] * dosaboy_ (~dosaboy@host86-161-201-156.range86-161.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[5:36] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[5:42] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[5:43] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[5:44] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[5:46] * Volture (~quassel@office.meganet.ru) has joined #ceph
[5:48] * sagelap1 (~sage@ has joined #ceph
[5:50] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[5:52] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) has joined #ceph
[5:55] * sagelap (~sage@2600:1012:b024:9857:61dd:2b6f:b08f:b063) Quit (Ping timeout: 480 seconds)
[6:28] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[6:30] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) has joined #ceph
[6:30] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) Quit ()
[6:42] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[6:46] * yy (~michealyx@ has left #ceph
[6:50] * dosaboy (~dosaboy@host86-150-245-16.range86-150.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[6:53] * xmltok (~xmltok@relay.els4.ticketmaster.com) Quit (Ping timeout: 480 seconds)
[6:58] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[6:58] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[6:59] * dosaboy (~dosaboy@host86-145-218-189.range86-145.btcentralplus.com) has joined #ceph
[7:06] * dosaboy_ (~dosaboy@host86-164-138-58.range86-164.btcentralplus.com) has joined #ceph
[7:11] * dosaboy (~dosaboy@host86-145-218-189.range86-145.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[7:11] * julian (~julianwa@ Quit (Quit: afk)
[7:17] * ccourtaut (~ccourtaut@2001:41d0:1:eed3::1) Quit (Ping timeout: 480 seconds)
[7:17] * ccourtaut (~ccourtaut@2001:41d0:1:eed3::1) has joined #ceph
[7:22] * dosaboy (~dosaboy@host86-161-206-113.range86-161.btcentralplus.com) has joined #ceph
[7:25] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[7:27] * dosaboy_ (~dosaboy@host86-164-138-58.range86-164.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[7:32] * dosaboy_ (~dosaboy@host86-150-247-105.range86-150.btcentralplus.com) has joined #ceph
[7:33] * yy (~michealyx@ has joined #ceph
[7:37] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[7:38] * dosaboy (~dosaboy@host86-161-206-113.range86-161.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[7:46] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[7:46] * dosaboy_ (~dosaboy@host86-150-247-105.range86-150.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[7:52] * dosaboy (~dosaboy@host86-164-141-14.range86-164.btcentralplus.com) has joined #ceph
[7:53] * tnt (~tnt@92.203-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[7:57] * mschiff (~mschiff@port-8247.pppoe.wtnet.de) has joined #ceph
[8:11] * dosaboy_ (~dosaboy@host86-164-141-1.range86-164.btcentralplus.com) has joined #ceph
[8:18] * dosaboy (~dosaboy@host86-164-141-14.range86-164.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[8:24] * s2r2 (~s2r2@g227002106.adsl.alicedsl.de) has joined #ceph
[8:26] * john_barbee (~jbarbee@c-50-165-106-164.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[8:29] * sagelap1 (~sage@ Quit (Ping timeout: 480 seconds)
[8:29] * trond (~trond@trh.betradar.com) Quit (Quit: leaving)
[8:41] * tnt (~tnt@92.203-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[8:43] * odyssey4me (~odyssey4m@ has joined #ceph
[8:44] * s2r2 (~s2r2@g227002106.adsl.alicedsl.de) Quit (Quit: s2r2)
[8:47] * dosaboy (~dosaboy@host86-164-143-67.range86-164.btcentralplus.com) has joined #ceph
[8:49] * KindTwo (KindOne@h118.30.131.174.dynamic.ip.windstream.net) has joined #ceph
[8:50] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[8:52] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[8:52] * KindTwo is now known as KindOne
[8:53] * dosaboy_ (~dosaboy@host86-164-141-1.range86-164.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[9:07] * KindTwo (KindOne@ has joined #ceph
[9:08] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[9:08] * KindTwo is now known as KindOne
[9:10] * dosaboy_ (~dosaboy@host86-159-117-155.range86-159.btcentralplus.com) has joined #ceph
[9:12] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[9:14] * hybrid512 (~walid@106-171-static.pacwan.net) has joined #ceph
[9:16] * dosaboy (~dosaboy@host86-164-143-67.range86-164.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[9:17] * s2r2 (~s2r2@ has joined #ceph
[9:17] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:17] * ChanServ sets mode +v andreask
[9:30] * AfC (~andrew@2001:44b8:31cb:d400:a1a9:56c1:6194:69da) has joined #ceph
[9:31] * leseb (~Adium@ has joined #ceph
[9:34] * dosaboy (~dosaboy@host86-164-216-206.range86-164.btcentralplus.com) has joined #ceph
[9:35] * haomaiwang (~haomaiwan@ Quit (Remote host closed the connection)
[9:35] * haomaiwang (~haomaiwan@notes4.com) has joined #ceph
[9:38] * dosaboy_ (~dosaboy@host86-159-117-155.range86-159.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[9:39] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) Quit (Remote host closed the connection)
[9:39] * dosaboy_ (~dosaboy@host86-163-11-2.range86-163.btcentralplus.com) has joined #ceph
[9:42] * dosaboy (~dosaboy@host86-164-216-206.range86-164.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[9:54] <ccourtaut> morning
[9:54] * n3c8-35575 (~mhattersl@pix.office.vaioni.com) has joined #ceph
[9:57] * allsystemsarego (~allsystem@ has joined #ceph
[9:58] <joelio> ccourtaut: mornin'
[10:00] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[10:03] * ScOut3R_ (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[10:04] <loicd> ccourtaut: morning !
[10:05] <odyssey4me> good morning :)
[10:06] <odyssey4me> infernix: I see significant improvement in performance on Ubuntu 12.04 LTS with the newer 3.8 kernel... apt-get install -y linux-image-generic-lts-raring fdutils linux-tools-lts-raring; update-grub
[10:08] <odyssey4me> I've been thinking... xfs has a journal, as does ceph... does it make sense to disable the xfs journal or put it onto a separate disk to improve performance?
[10:09] <tnt> yeah ... but no.
[10:09] <tnt> they serve different purposes.
[10:09] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[10:10] * LeaChim (~LeaChim@ has joined #ceph
[10:10] <odyssey4me> is it best to leave the xfs journal at its defaults, but to place the osd journals on a separate disk?
[10:11] <tnt> yes.
[10:12] * leseb (~Adium@ Quit (Quit: Leaving.)
[10:13] <odyssey4me> great, thanks tnt
[10:14] <odyssey4me> one more thing - if all I have are spinning disks (no ssd's), what's the best way to split up the osd's and their journals? note that I have a decent RAID controller (cache, writeback, BBU, etc)
[10:18] * mschiff (~mschiff@port-8247.pppoe.wtnet.de) Quit (Remote host closed the connection)
[10:18] * bergerx_ (~bekir@ has joined #ceph
[10:21] <odyssey4me> assuming 8 disks I'm thinking: 2 for OS (mirrored pair), 1 for journaling, 5 for RAID5 set
[10:22] <tnt> One thing to know is that if the journal is lost, then you can pretty much throw the rest of the OSD away too.
[10:23] * n3c8-35575 (~mhattersl@pix.office.vaioni.com) Quit (Read error: Connection reset by peer)
[10:24] <odyssey4me> tnt - sure, but if I'm spreading two copies of each block between several servers then the likelihood of that happening and me losing all data is low
[10:25] <tnt> yes. It was more a reflection on why would you bother with RAID5 :)
[10:26] * dosaboy (~dosaboy@faun.canonical.com) has joined #ceph
[10:26] <tnt> all data write are going to hit the journal first. so with 1 disk as journal, your write rate is going to be limited to the bandwidth of that disk.
[10:27] <tnt> now if you take your 6 disks and put 6 OSDs on them with the journal on the same drive, assuming that the journal write speed is going to be ~ 1/3 of total bw, you'd get 2x the write speed of the disk.
[10:28] * john_barbee (~jbarbee@c-50-165-106-164.hsd1.in.comcast.net) has joined #ceph
[10:29] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[10:30] <odyssey4me> tnt - hmm, that was my initial thinking and why I initially configured it that way
[10:30] <odyssey4me> right now I have 2 x OS (mirrored pair), 6 x OSD
[10:31] <odyssey4me> I'm trying to work out whether I can improve performance at all.
[10:31] <odyssey4me> My limitation is that every one of my OSD servers is also a client.
[10:31] * zhangbo (~zhangbo@ Quit (Quit: Leaving)
[10:32] <odyssey4me> My use case is that these servers are KVM hosts and that I'm putting the VM's into RBD using the qemu librbd integration.
[10:33] * fridudad (~oftc-webi@fw-office.allied-internet.ag) Quit (Quit: Page closed)
[10:34] <odyssey4me> The read & write speed in this configuration right now is too low. I'm only getting around 1Mb/s throughput for writes and 3Mb/s read from inside the VM.
[10:34] * Volture (~quassel@office.meganet.ru) Quit (Remote host closed the connection)
[10:35] * Volture (~quassel@office.meganet.ru) has joined #ceph
[10:35] <odyssey4me> That said, I have yet to enable caching which may help.
[10:35] * Volture (~quassel@office.meganet.ru) Quit ()
[10:35] * Volture (~quassel@office.meganet.ru) has joined #ceph
[10:35] <joelio> odyssey4me: definitely ysing virtio inside the VM too?
[10:35] <tnt> Yeah, I have the same thing, except using Xen. Each server is also OSD.
[10:36] <odyssey4me> joelio - yes
[10:38] <odyssey4me> Compared this to just running a qcow preallocated disk on an XFS filesystem on top of a RAID5 set where I get similar write performance (which still seems low), but read throughput is between 5 and 7 Mb/s.
[10:39] <joelio> btw I'd definitely not use just 1 disk for journal and 5 for OSDs/RAID - unless the OSDs are spinners and the journal disk an SSD. If you have just spinners, put the journal directly on the OSD drive
[10:40] <tnt> as a partition rather than a file.
[10:40] <joelio> odyssey4me: rbd caching won't help reading btw
[10:40] <odyssey4me> What really freaks me out is that natively on the host the read & write performance is 10x better. I did testing comparing the RAID5 disk set (6 disks) against a 3 server RBD cluster using the kernel module and the performance was roughly the same for both.
[10:40] <joelio> what speed are your interconnects?
[10:41] <odyssey4me> Bonded 10G interfaces with l2+l3 hashing between all 3 servers.
[10:42] <odyssey4me> tnt - ok, so you'd recommend 6 x OSD - but partition each drive into two: a small one for the journal and the rest for the data?
[10:42] * leseb (~Adium@ has joined #ceph
[10:42] <odyssey4me> joelio - so the caching is to improve write performance?
[10:42] * LiRul (~lirul@ has joined #ceph
[10:43] <LiRul> hi
[10:43] <LiRul> are there any ceph-related options to hardening configuration?
[10:43] <joelio> odyssey4me: yes, rbd caching is. It buffers the writes back to the OSDs, so that the file sync is quicker
[10:44] <joelio> can't speed up reads in this was as it still needs to travel 'across the wire'
[10:44] <odyssey4me> joelio - noted, thanks... I'll still add it to the tests
[10:45] * joelio would investigate some dm-cache, bcache, or the other one who's name escapes me.
[10:45] <joelio> more difficult to get libvirt integration though iirc
[10:46] <tnt> odyssey4me: https://gist.github.com/smunaut/5433222
[10:46] <tnt> can you try this small benchmark utility
[10:47] <tnt> you just create a image of like 5G or so and run the utility twice on it. (the first time will be biased because the image will be sparse)
[10:49] <ofu> i have a question regarding librbd caching: rbd_cache_size is pretty small per default (32MByte)... wouldnt it be cool if you put it on a local SSD of your KVM host with GBytes of storage?
[10:50] * leseb (~Adium@ Quit (Ping timeout: 480 seconds)
[10:51] <joelio> ofu: that's what dm-cache and bcache etc are for
[10:51] <joelio> rbd is in memory .. i.e. much faster than ssd
[10:51] <joelio> *cache
[10:52] <ofu> joelio: dm-cache and bcache on the ceph-hosts, not on the KVM hosts?
[10:52] <joelio> no, kvm hosts - although achieving this with libvirt I think is currenltly difficuly
[10:53] * mxmln (~maximilia@ has joined #ceph
[10:53] <ofu> how do I use dm-cache/bcache through rbd? I dont have a local block device, so how do I offload reads/writes with SSDs?
[10:53] * ofu is confused
[10:55] <tnt> the problem with too much cache is that the write have to hit the cluster to ensure the reliability ... or else you might as well give up the redundancy ceph provides.
[10:56] <ofu> sure
[10:57] * topro_ (~prousa@host-62-245-142-50.customer.m-online.net) has joined #ceph
[10:57] * topro_ is now known as topro__
[10:58] <topro__> is there a way to increase peering of a pg during recovery process as this is the only inactive pg I have right now so it can't receive IO
[10:58] <topro__> ^increase priority
[11:07] * stacker666 (~stacker66@ has joined #ceph
[11:08] * dosaboy__ (~dosaboy@host86-164-219-216.range86-164.btcentralplus.com) has joined #ceph
[11:15] * dosaboy_ (~dosaboy@host86-163-11-2.range86-163.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[11:19] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) has joined #ceph
[11:25] * ScOut3R_ (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[11:31] <topro__> fyi, pg dump 1.95 query showed me that this might be due to waiting for osd.8 so i just tried restarting osd.8 and it brought pg 1.95 back onling
[11:36] * john_barbee (~jbarbee@c-50-165-106-164.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[11:38] * john_barbee (~jbarbee@c-50-165-106-164.hsd1.in.comcast.net) has joined #ceph
[11:45] * trewer (~dgesa@bl9-52-187.dsl.telepac.pt) has joined #ceph
[11:51] * trewer is now known as guteryn
[11:54] * leseb (~Adium@ has joined #ceph
[11:58] * mschiff (~mschiff@p4FD7FAB1.dip0.t-ipconnect.de) has joined #ceph
[12:03] * matt__ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:08] * yy (~michealyx@ has left #ceph
[12:09] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[12:11] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[12:16] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[12:18] * janisg (~troll@ Quit (Ping timeout: 480 seconds)
[12:18] * janisg (~troll@ has joined #ceph
[12:21] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[12:33] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:39] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[12:45] * john_barbee (~jbarbee@c-50-165-106-164.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[12:48] * guteryn (~dgesa@bl9-52-187.dsl.telepac.pt) Quit (autokilled: Spamming. Mail support@oftc.net if you feel this is in error (2013-07-16 10:48:18))
[12:49] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[12:52] * ScOut3R_ (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[12:58] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[13:06] * Joel (~chatzilla@2001:620:0:25:1102:d021:f719:4e5e) has joined #ceph
[13:07] <odyssey4me> tnt / joelio - sorry, was out for a meeting
[13:07] <odyssey4me> tnt - I've been using iozone. What will this benchmark tool do?
[13:08] <tnt> it doesn't run in the VM. It uses librbd directly.
[13:08] <tnt> and basically it reads and write the entire image sequentially and gives pretty much an "upper limit" of perf
[13:14] * yanzheng (~zhyan@ has joined #ceph
[13:21] * s2r2 (~s2r2@ has joined #ceph
[13:22] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[13:23] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:23] * ChanServ sets mode +v andreask
[13:27] * infinitytrapdoor (~infinityt@ has joined #ceph
[13:29] * jcfischer (~fischer@user-28-11.vpn.switch.ch) has joined #ceph
[13:30] <jcfischer> we are investigating a problem with our mds servers. We have two, and they seem to be stuck in up:replay
[13:30] <jcfischer> health HEALTH_WARN mds cluster is degraded
[13:31] <jcfischer> mdsmap e543: 1/1/1 up {0=myoko=up:replay}, 1 up:standby
[13:31] <jcfischer> we have enabled debug loggging on both mds and they seem to be quite happy and quiet
[13:32] <jcfischer> 2013-07-16 13:32:01.418445 7f4606e1c700 1 -- [2001:620:0:6::d0]:6800/12620 <== mon.3 [2001:620:0:6::10e]:6789/0 56 ==== mdsbeacon(137808/ineri up:standby seq 42 v543) v2 ==== 107+0+0 (3202644810 0 0) 0x15d3600 con 0x14e8f20
[13:32] <jcfischer> 2013-07-16 13:32:01.418471 7f4606e1c700 10 mds.-1.0 handle_mds_beacon up:standby seq 42 rtt 0.002181
[13:32] <jcfischer> 2013-07-16 13:32:01.418478 7f4606e1c700 10 -- [2001:620:0:6::d0]:6800/12620 dispatch_throttle_release 107 to dispatch throttler 107/104857600
[13:32] <jcfischer> 2013-07-16 13:32:01.418486 7f4606e1c700 20 -- [2001:620:0:6::d0]:6800/12620 done calling dispatch on 0x15d3600
[13:32] <jcfischer> 2013-07-16 13:32:02.412345 7f4604d17700 20 mds.-1.bal get_load no root, no load
[13:32] <jcfischer> 2013-07-16 13:32:02.412401 7f4604d17700 15 mds.-1.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.22>
[13:33] <jcfischer> however, we have left this running for several hours (the seq nr has increased) but no change to the mds status
[13:35] * dosaboy_ (~dosaboy@host86-161-161-91.range86-161.btcentralplus.com) has joined #ceph
[13:35] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[13:36] <jcfischer> here's some more log file: http://pastebin.com/UjAHrvxf
[13:38] <yanzheng> that's standby mds' log
[13:39] <yanzheng> try killing mds myoko
[13:39] <jcfischer> we did
[13:39] * diegows (~diegows@ has joined #ceph
[13:39] <jcfischer> but let me try again :)
[13:40] * dosaboy__ (~dosaboy@host86-164-219-216.range86-164.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[13:40] * tserong_ (~tserong@58-6-128-204.dyn.iinet.net.au) has joined #ceph
[13:40] <jcfischer> here's the log from myoko: http://pastebin.com/ig167KHT
[13:41] <odyssey4me> tnt - hmm, trying to figure out how to compile this... I have the -lib dependencies, but it's not working so I must be missing something. Do you have a guide of some sort for compiling it?
[13:42] <odyssey4me> for example: /tmp/ccF2SYLr.o: In function `rb_discard':
[13:42] <odyssey4me> rbd_bench.c:(.text+0x41): undefined reference to `rbd_aio_release'
[13:42] <jcfischer> mds restarted, but no change in behaviour
[13:42] <yanzheng> is your osd cluster health
[13:42] <jcfischer> yes
[13:43] <jcfischer> I tried to check to see what the mds_requests are (see http://tracker.ceph.com/issues/4742) but the mds_requests command is not understand by ceph --admin-daemon
[13:45] <yanzheng> set debug mds = 20 and try again
[13:45] * tserong (~tserong@58-6-103-73.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[13:45] <jcfischer> that is already set
[13:46] <tnt> odyssey4me: gcc rbd_bench.c -lrbd -lrados
[13:46] <tnt> and add -o rbd_bench
[13:47] * dosaboy__ (~dosaboy@host86-164-218-146.range86-164.btcentralplus.com) has joined #ceph
[13:47] <odyssey4me> aha, thanks
[13:49] <yanzheng> can you find text 'replay_start' in mds log
[13:50] <jcfischer> got it
[13:52] <jcfischer> http://pastebin.com/WGJYmHpA
[13:52] <jcfischer> that is the log after the latest restart
[13:53] * dosaboy_ (~dosaboy@host86-161-161-91.range86-161.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[13:53] * markbby (~Adium@ has joined #ceph
[13:55] <odyssey4me> tnt - Read: 2005.33 Mb/s (107374182400 bytes in 51064 ms), Write: 315.58 Mb/s (107374182400 bytes in 324483 ms)
[13:55] <yanzheng> what version is your mds
[13:56] <tnt> odyssey4me: redo it a second time. (first time just fills the sparse image).
[13:56] * odyssey4me does so
[13:56] <jcfischer> yanzheng: 061.3
[13:58] <jcfischer> root@myoko:~# ceph-mds --version
[13:58] <jcfischer> ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
[13:58] * dosaboy__ (~dosaboy@host86-164-218-146.range86-164.btcentralplus.com) Quit (Remote host closed the connection)
[13:59] <jcfischer> but the rest is 0.61.2
[13:59] <jcfischer> we have just updated mds this morning
[13:59] <yanzheng> no idea what happens, looks like the mds failed to open the log
[14:00] <yanzheng> what version is your osd
[14:01] <jcfischer> root@myoko:~# ceph-osd --version
[14:01] <jcfischer> ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
[14:02] <jcfischer> I just realize its a bit of a mess: most nodes are on 0.61.3, one is on 0.61.4 and myoko is on 0.61.2
[14:03] <odyssey4me> tnt - Read: 1320.54 Mb/s (107374182400 bytes in 77544 ms), Write: 330.82 Mb/s (107374182400 bytes in 309530 ms)
[14:03] <jcfischer> (but there are no OSDs on myoko)
[14:04] <tnt> odyssey4me: ok. So that's pretty good :) what's the hw config of your cluster ?
[14:05] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[14:07] <odyssey4me> tnt - 3 ibm x3550's, all have 2xOS disks mirrored, two have 6x600GB SAS (6Gb/s) and one has two of those disks and 4 300GB of the same type of disk. I'm using the one with 300G disks as a client. They all have their non-OS disks setup as OSD's and the mon is on the client server.
[14:08] <tnt> network ?
[14:08] <odyssey4me> Oh, and network access is via 2x10GB bonded links with non-blocking switches. MTU is 1500. Bond is LACP with l2+l3 hash.
[14:08] <jcfischer> yanzheng: where do you see that it failed to open the log? I must have missed it in the log files
[14:09] * infinitytrapdoor (~infinityt@ has joined #ceph
[14:10] <tnt> odyssey4me: well the bottleneck doesn't seem to be the cluster ... I mean it can read at more than 1GByte/s and write at 300MByte/s ...
[14:10] <yanzheng> because opening log didn't return
[14:10] <odyssey4me> tnt - my wishful thinking was to have a crush map that would make use of local OSD's for all local reads/writes and some sort of async replication to a second host... but I learned over time that ceph doesn't do async replication
[14:11] <odyssey4me> well, it seems to me that my wishful thinking can't become relity just yet
[14:11] <jcfischer> here? 2013-07-16 13:41:04.170786 7f26b121c700 2 mds.0.73 boot_start 1: opening mds log
[14:11] <jcfischer> 2013-07-16 13:41:04.170791 7f26b121c700 5 mds.0.log open discovering log bounds
[14:12] <odyssey4me> tnt - this sequential read/write... if I do a dd from /dev/null to a local disk is that a similar test to the sequential write that's being done in this bench tool?
[14:12] <odyssey4me> and how would I do a similar read test?
[14:12] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:12] <yanzheng> set 'debug mds journaler = 20' and try again
[14:13] <joelio> odyssey4me: you need to pass a flag to make direct IO in a dd, otherwise you're hitting the local fs cache
[14:14] <tnt> odyssey4me: yes it's sequential test and with a lot of async. and for dd tests you can do dd if=/dev/mydisk of=/dev/null iflag=direct bs=1M for read and dd if=/dev/zero of=/dev/mydisk oflag=direct bs=1M for write
[14:14] <joelio> best to use bonnie - I've used the phononix-test-suite with the pst/disk checks to great success. Bonnie, fio etc. all part of it
[14:15] <jcfischer> http://pastebin.com/CJcgXeAJ
[14:15] <odyssey4me> joelio - up until now I've been using iozone... why bonnie instead of iozone?
[14:15] <joelio> iozone fine too, just watch out for dd without flag
[14:16] <odyssey4me> tnt - why bs=1M? most file systems default to 4K as far as I've seen
[14:17] <tnt> that was to simulate the same kind of io as rbd_bench.
[14:17] <tnt> If you use 4k blocks of course performance is going to tank.
[14:18] <tnt> in a real usage they won't be 'flushed' by 4k, hopefully the block layer will coalesce them
[14:20] <yanzheng> sorry it should be 'debug journaler = 20'
[14:20] <jcfischer> np
[14:21] <jcfischer> that is in the mds section, right?
[14:21] <yanzheng> YES
[14:22] <jcfischer> http://pastebin.com/y28UKK1e
[14:24] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[14:25] <odyssey4me> tnt - interesting: dd if=/dev/sdb of=/dev/null iflag=direct bs=4k yields 80,7 MB/s
[14:25] <odyssey4me> dd if=/dev/sdb of=/dev/null iflag=direct bs=1M yields 162 MB/s
[14:25] <joelio> oflag != iflag
[14:27] <tnt> for read tests, iflag is correct.
[14:28] <odyssey4me> dd if=/dev/zero of=/dev/sdb oflag=direct bs=1M yields 163 MB/s
[14:30] <yanzheng> debug objecter = 20 and debug filer = 20
[14:30] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[14:31] <odyssey4me> dd if=/dev/zero of=/dev/sdb oflag=direct bs=4k yields 73,0 MB/s
[14:32] <jcfischer> http://pastebin.com/pEv2vqzz
[14:32] <odyssey4me> tnt - right, so that yields a pretty consistent result = ~160MB/s write, ~160MB/s read... doesn't that seem a little odd?
[14:34] <odyssey4me> that was a single disk config
[14:35] <jcfischer> ah - now we see something: http://pastebin.com/R4VVV9qX one of the OSD seems to be laggy
[14:35] * infinitytrapdoor (~infinityt@ has joined #ceph
[14:35] <odyssey4me> by comparison I get this on a 8 disk RAID5 set: 905 MB/s read, 662 MB/s write
[14:35] <yanzheng> i guess it's osd.42
[14:35] <jcfischer> yep
[14:35] <Gugge-47527> odyssey4me: dont compare singlethreaded io on local raids with singlethreaded io on a distributed network system :)
[14:36] * dosaboy_ (~dosaboy@host86-161-205-207.range86-161.btcentralplus.com) has joined #ceph
[14:36] <Gugge-47527> it will never be as good :)
[14:36] <jcfischer> and that is on a host we had trouble with recently - what next? Restart the OSD? or take it out of the cluster?
[14:37] <yanzheng> both should work
[14:37] <jcfischer> lets do a restart first then
[14:37] <odyssey4me> Gugge-47527: fair point, however I have the mission of trying to reach performance as close as possible to the local RAID set for KVM guests
[14:38] <jcfischer> two more osds on that host have become laggy...
[14:40] * dosaboy_ (~dosaboy@host86-161-205-207.range86-161.btcentralplus.com) Quit (Remote host closed the connection)
[14:40] <jcfischer> restarting them as well
[14:40] <odyssey4me> read 1320.54 Mb/s & write 330.82 Mb/s - so so far I have almost 50% more read speed on RBD, but around half the write speed
[14:41] <odyssey4me> interesting - I think it's time to introduce caching
[14:42] <jcfischer> and wonder over wonder: restarting three OSDs made the mds service become available
[14:43] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[14:45] <jcfischer> root@ineri ~$ ceph -s
[14:45] <jcfischer> health HEALTH_OK
[14:47] <Gugge-47527> odyssey4me: for kvm guests im guessing you have more than one?
[14:48] * markit (~marco@ has joined #ceph
[14:48] <Gugge-47527> if you can get 160MB/s * 10 with 10 guests, that is better than your local raid :)
[14:49] <markit> just entered.. 160MB/s with wich hardware / connectivity? 10Gb or n x 1gb bonded?
[14:49] <jcfischer> yanzheng: Thanks for your help! (still wondering what exactly was happening)
[14:50] * infinitytrapdoor (~infinityt@ has joined #ceph
[14:50] <odyssey4me> My test environment is 3 ibm x3550's, all have 2xOS disks mirrored, two have 6x600GB SAS (6Gb/s) and one has two of those disks and 4 300GB of the same type of disk. I'm using the one with 300G disks as a client. They all have their non-OS disks setup as OSD's and the mon is on the client server. Network access is via 2x10GB bonded links with non-blocking switches. MTU is 1500. Bond is LACP with l2+l3 hash.
[14:54] <Gugge-47527> what is the read/write speed of a single of those SAS disks?
[14:56] <odyssey4me> They're 6Gb/s - using the dd test I get around 160MB/s
[14:56] <Gugge-47527> local?
[14:56] <odyssey4me> yes, they're each local to the server and one of the servers is acting as a client
[14:56] <Gugge-47527> if 160MB/s is what a single disk can deliver, then you are not gonna get more with a single read/write stream from/to rbd without some striping
[14:56] <odyssey4me> they're connected to a RAID controller
[14:57] <Gugge-47527> http://ceph.com/docs/next/man/8/rbd/#striping
[14:58] <Gugge-47527> requires format 2 images
[15:00] <odyssey4me> Gugge-47527: hmm, I'm not sure I understand how to apply this now?
[15:00] <odyssey4me> I thought that striping was done by default?
[15:00] <Gugge-47527> the point is, without striping, each write will only hit a single disk
[15:01] <Gugge-47527> and then dd on rbd will never be faster than a single disk
[15:01] <odyssey4me> Aha. That makes sense. So each block that is written will go to a single disk and the operation must complete before going to the next one.
[15:01] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[15:01] <Gugge-47527> yes
[15:02] <Gugge-47527> if you have 10 guests doing io
[15:02] <Gugge-47527> you dont care :)
[15:02] <Gugge-47527> if you want a single dd to perform better, you have to use some striping :)
[15:02] <odyssey4me> I'm not sure if this changes the situation, but the rbd bench I did was with this: https://gist.github.com/smunaut/5433222
[15:02] <guppy> from reading that link you posted, isn't striping enabled by default?
[15:03] <Gugge-47527> "By default, [stripe_unit] is the same as the object size and [stripe_count] is 1"
[15:03] <guppy> ah, there we go
[15:03] <odyssey4me> I didn't dd with rbd - I used dd with the local disks as a comparison for a sequential read & write.
[15:03] <guppy> sorry, I was just reading that
[15:05] <odyssey4me> ok, so to try out the difference that striping makes all I do is create a pool with --stripe-count 2 (or more)
[15:05] <Gugge-47527> not the pool, the rbd :)
[15:05] <odyssey4me> lol, sorry - still wrapping my head around all this terminology
[15:05] <Gugge-47527> im guessing if you set stripe_size to 4k, and stripe_count to 4
[15:06] <Gugge-47527> when you do a 16k write, it will end on 4 disks :)
[15:06] <Gugge-47527> but i dont use striping, and dont really know anything about it :)
[15:06] <Gugge-47527> i musing the kernel rbd driver, with 3.8, so no support for format 2 images :)
[15:10] * diegows (~diegows@ has joined #ceph
[15:11] <odyssey4me> haha, I am trying it out academically - I dont know if kvm/qemu supports creating images this way but it is an interesting idea
[15:11] <mxmln> we also using kernel 3.8.0-19 but all format2 images
[15:12] <odyssey4me> mxmln - sure, but it would appear that the kernel driver doesn't support format2 images
[15:12] <Gugge-47527> mxmln: with kernel rbd, or with qemu?
[15:12] <mxmln> we dont use mds just rbd block devices
[15:12] <mxmln> and qemu
[15:12] <Gugge-47527> so not "rbd map xxx"
[15:12] <mxmln> no
[15:13] <odyssey4me> I hit a wall the other day trying to deploy images via the kernel driver. The rbd device just locked up. I found out that unfortunately using the kernel driver on an RBD server isn't all that stable.
[15:13] <Gugge-47527> you dont use the kernel rbd driver then :)
[15:13] <mxmln> shortest wat to kernel panic :) with format 2
[15:13] <mxmln> wat=way
[15:13] <Gugge-47527> using kernel rbd on the same server running osd is bad :)
[15:14] <odyssey4me> mxmln - do you have to tell kvm/qemu about the format2 images, or do you set that as a default type in the ceph config somewhere?
[15:15] <odyssey4me> Gugge47527 - yeah, I fuond that out... I was hoping to use it to simply map a drive on each cluster member to a shared folder as a very simple shared distributed storage mechanism.
[15:15] <mxmln> current qemu doesnt support format parameter to create format2 images
[15:16] <mxmln> libvirt api wrapper calls default its format1 only
[15:17] <odyssey4me> mxmln: so the only option is to create the image using the 'rbd' command, then to use it when starting kvm?
[15:18] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[15:18] <mxmln> yes
[15:19] <odyssey4me> mxmln: why're you using format 2 images?
[15:19] <mxmln> you dont need "rbd map" if you use ceph cluster only for vm images
[15:20] <odyssey4me> something else I struggled with a bit - I have been able to get an rbd image assigned to a VM (in KVM via libvirt) as a data disk, but it doesn't seem to want to work if I use a system disk
[15:20] <mxmln> from documentation...format 2 - Use the second rbd format, which is supported by librbd (but not the kernel rbd module) at this time. This adds support for cloning and is more easily extensible to allow more features in the future.
[15:21] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Ping timeout: 480 seconds)
[15:21] <odyssey4me> mxmln - so you're using it because of the cloning functionality?
[15:22] <mxmln> yes and this sounds for me format1 may be obsolete next future
[15:24] <tnt> format2 is now fully supported by kernel module as of 3.10
[15:25] <mxmln> nice to hear that
[15:25] * ScOut3R_ (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[15:27] * PerlStalker (~PerlStalk@ has joined #ceph
[15:30] * BillK (~BillK-OFT@124-148-212-240.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[15:31] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[15:32] * dobber (~dobber@ has joined #ceph
[15:32] <odyssey4me> using xfs partitions for osd's, which will be used for 50GB+ RBD images (for KVM vm's) - is the default 4k block size suitable or can you gain performance by increasing the block size?
[15:34] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) has joined #ceph
[15:36] <odyssey4me> tnt?
[15:38] <tnt> no idea, never tested that.
[15:41] <odyssey4me> tnt - you mentioned earlier that your recommendation in the absence of ssd's is to split the journal and the osd data onto two partitions per disk... what sized partition do you use for your journals?
[15:43] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:43] <tnt> odyssey4me: I use a 2G journal per OSD.
[15:45] * Joel (~chatzilla@2001:620:0:25:1102:d021:f719:4e5e) Quit (Quit: ChatZilla [Firefox 22.0/20130618035212])
[15:47] <odyssey4me> tnt - cool, thanks... I'm going to give that a go
[15:52] * dosaboy_ (~dosaboy@host86-164-80-171.range86-164.btcentralplus.com) has joined #ceph
[15:54] * tziOm (~bjornar@ has joined #ceph
[16:00] * yanzheng (~zhyan@ Quit (Remote host closed the connection)
[16:02] * haomaiwang (~haomaiwan@notes4.com) Quit (Remote host closed the connection)
[16:03] * aliguori (~anthony@ has joined #ceph
[16:03] * haomaiwang (~haomaiwan@ has joined #ceph
[16:10] * drokita (~drokita@97-92-254-72.dhcp.stls.mo.charter.com) has joined #ceph
[16:13] <drokita> Is anyone out there running the RadosGW with FastCGI and Apache on the same server successfully or does the latenecy make it a no-go
[16:14] <tnt> I'm running rgw with lighttpd on the same server ...
[16:14] <drokita> Is that a prod deployment?
[16:15] <tnt> yes
[16:15] <drokita> If you don't mind me asking, what kind of scale are you dealing with?
[16:16] <tnt> not big. The RGW part is used to store documents and thumbnails of those documents, some videos (and converted version of those). In total there is like ~500k objects and maybe 1T of data.
[16:17] <drokita> Gotcha. Thanks for the info!
[16:21] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[16:21] <odyssey4me> tnt - interestingly, with a stripe-count of 4 I've been waiting for an hour to complete the first rbd_bench process - this doesn't appear to be a good option
[16:21] <tnt> heh. I never tried this option.
[16:21] <odyssey4me> haha, it just finished that first read cycle - 2.56Mb/s
[16:22] <tnt> That's ... worrying.
[16:25] <odyssey4me> try it: rbd -p rbd --image-format 2 --size 10000 --stripe-unit 4 --stripe-count 4 create teststripe
[16:25] <odyssey4me> rbd_bench admin rbd teststripe
[16:26] <odyssey4me> perhaps you're doing something in the bench to force synchronous actions? if so, perhaps that affects the way the stripe behaves
[16:26] <tnt> --stripe-unit 4 seems wrong
[16:26] <tnt> docs says "Specifies the stripe unit size in bytes."
[16:27] <tnt> you don't want stripe of 4 bytes ...
[16:27] * AfC (~andrew@2001:44b8:31cb:d400:a1a9:56c1:6194:69da) Quit (Quit: Leaving.)
[16:31] <odyssey4me> tnt - ok, what size do you think is more suitable?
[16:31] <odyssey4me> 4k is probably what I should have aimed for
[16:32] <tnt> try 64k or something like that
[16:33] <odyssey4me> oh wow, 4k works nicely - nice catch... I misread it to be kb
[16:33] <odyssey4me> Read: 675.04 Mb/s, Write: 339.74 Mb/s
[16:34] <tnt> interesting. I'll have to give that a shot.
[16:34] * infinitytrapdoor (~infinityt@ has joined #ceph
[16:35] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[16:36] <odyssey4me> I forgot about using two cycles before removing and trying 64k. :p
[16:37] <odyssey4me> 64k = Read: 572.41 Mb/s, Write: 361.98 Mb/s on 2nd cycle; Read: 573.59 Mb/s, Write: 396.46 Mb/s on 3rd
[16:39] * LiRul (~lirul@ has left #ceph
[16:42] <odyssey4me> 4k = Read: 471.61 Mb/s, Write: 367.19 Mb/s on 2nd cycle; Read: 469.81 Mb/s, Write: 360.84 Mb/s on 3rd
[16:48] <Gugge-47527> 670MB/s with stripe-count 4 is pretty good :)
[16:48] <Gugge-47527> basically its what 4 of your disks can do :)
[16:49] <Gugge-47527> try stripe-count 8
[16:49] <jcfischer> odyssey4me: where/how did you compile that benchmark? I'd like to try that on our cluster
[16:50] * yanzheng (~zhyan@jfdmzpr02-ext.jf.intel.com) has joined #ceph
[16:51] * X3NQ (~X3NQ@ has joined #ceph
[16:52] <Gugge-47527> jcfischer: https://gist.github.com/smunaut/5433222 - 'gcc -lrbd -lrados' if i remember correctly :)
[16:54] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[16:54] <jcfischer> rbd_bench.c:8:28: fatal error: rados/librados.h: No such file or directory - where do I get that from?
[16:55] <tnt> jcfischer: small caveat: you need to run it twice ... first result will be meaningless due to sparseness. Also, whatever image you tell it to use, it will _overwrite_ it.
[16:55] <jcfischer> so I create an new image with 'rbd -p rbd --image-format 2 --size 10000 --stripe-unit 4096 --stripe-count 4 create teststripe'
[16:55] <jcfischer> and then run 'rbd_bench admin rbd teststripe'
[16:55] <jcfischer> right?
[16:55] <tnt> yup.
[16:56] <tnt> and you ned the librbd-dev package installed
[16:56] <jcfischer> which will create and then clobber teststripe
[16:57] <tnt> yup.
[16:57] <jcfischer> but it doesn't compile yet: http://pastebin.com/92d0PaDd
[16:57] <tnt> it's a lowercase l as in lambda. not an uppercase I
[16:58] <jcfischer> yields exactly the same errors
[16:59] <tnt> try putting the filename before th options
[17:00] * mozg (~andrei@host217-46-236-49.in-addr.btopenworld.com) has joined #ceph
[17:00] * X3NQ (~X3NQ@ Quit (Quit: Leaving)
[17:01] <jcfischer> here we are: gcc rbd_bench.c -lrbd -lrados -orbd_bench
[17:01] <mozg> hello guys
[17:01] <mozg> does anybody know how to give more priority to client traffic when the repair process is taking place?
[17:03] <tnt> that's the default behavior AFAIK.
[17:04] <odyssey4me> jcfischer: apt-get install librados-dev librbd-dev
[17:04] <odyssey4me> gcc rbd_bench.c -o rbd_bench -lrbd -lrados
[17:04] <jcfischer> yep have it compiled
[17:04] <odyssey4me> -o must precede the others
[17:05] * drokita (~drokita@97-92-254-72.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[17:05] <jcfischer> running it, it got stuck at 928 of 1327 submitted
[17:05] <jcfischer> should I just wait?
[17:06] <odyssey4me> Is it possible to stripe an entire pool?
[17:07] * stacker666 (~stacker66@ Quit (Ping timeout: 480 seconds)
[17:08] * drokita (~drokita@97-92-254-72.dhcp.stls.mo.charter.com) has joined #ceph
[17:14] <jcfischer> hmm - now it seems that rbd is hanging - I CTRL-C the rbd_bench but can't 'rbi rm teststripe' and can't create a new rbd image either
[17:17] * topro__ (~prousa@host-62-245-142-50.customer.m-online.net) Quit (Ping timeout: 480 seconds)
[17:21] <jcfischer> odyssey4me: tnt: the rbd_bench hangs after around 900 out of 1327 submitted. If I kill it I can't delete the test image and I can't create a new image (rbd just hangs)
[17:22] * topro__ (~prousa@host-62-245-142-50.customer.m-online.net) has joined #ceph
[17:22] <odyssey4me> that's odd - haven't seen that happen to me - how many servers in your cluster?
[17:22] <jcfischer> 12
[17:23] <jcfischer> 64 osd
[17:23] <odyssey4me> I'm new to all this. Perhaps tnt has a view on what may have happened?
[17:23] <jcfischer> (I'm new to this too :) )
[17:25] <odyssey4me> I have an issue too - for some reason I can't restart ceph or any of its components. The init script gives no feedback and does nothing.
[17:26] <odyssey4me> When I try to kill the process it simply ignores me.
[17:26] <odyssey4me> The only way I've managed to change configs so far has been to reboot.
[17:26] * mikedawson_ (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[17:26] * yanzheng (~zhyan@jfdmzpr02-ext.jf.intel.com) Quit (Quit: Leaving)
[17:29] * oddomatik (~Adium@pool-71-106-149-194.lsanca.dsl-w.verizon.net) has joined #ceph
[17:29] <joelio> odyssey4me: it uses upstart in ubuntu
[17:29] <joelio> stop ceph-all
[17:30] * vata (~vata@2607:fad8:4:6:59b5:d635:dcf0:808) has joined #ceph
[17:30] <joelio> odyssey4me: the init scripts are legacy iirc.. they should work imho, but doesn't look like the init.d wrapper works
[17:31] <joelio> so stop ceph-all, start ceph-all etc..
[17:31] <matt__> Does anyone know if there is a performance impact on rbd once a volume is snapshotted?
[17:31] <markit> newbie: I've read some post that says to use "rados bench 30 seq -p data" to test performance, but I get the error "Must write data before running a read benchmark!", what am I missing?
[17:31] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[17:31] * mikedawson_ is now known as mikedawson
[17:32] <joelio> markit: writing data
[17:32] <joelio> there is --no-cleanup flag#
[17:32] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[17:32] <odyssey4me> lol, thanks joelio
[17:32] <joelio> rados bench -p {pool name} 600 write --no-cleanup -t 32
[17:32] <joelio> for example
[17:33] <joelio> odyssey4me: no probs, caught me out before :)
[17:33] <markit> joelio: how "write data" in a pool (don't even know what a pool is), also if I use this first, same result: rados bench 30 write -p data
[17:33] <markit> ah
[17:33] <markit> joelio: thaks a lot
[17:33] <joelio> data is a default pool for MDS
[17:34] <joelio> sorry CephFS
[17:34] <markit> joelio: I use RDB with proxmox, I've no installed or setup CephFS
[17:34] <joelio> sure, then data is just default
[17:34] <joelio> you will need to create a pool and set number of PG's etc.. based upon your topology
[17:35] <joelio> markit: http://ceph.com/docs/master/rados/operations/pools/
[17:35] <joelio> you can also control number of replicas etc..
[17:35] <jcfischer> tnt: this happens when I run strafe -f ./rbd_bench … (and similar symptoms when running strafe -f rbd rm teststripe) http://pastebin.com/xiENy2kr
[17:35] <jcfischer> s/strace/strafe/
[17:35] <markit> joelio: thanks a lot
[17:39] <markit> joelio: as far as you know, is there any cache to empty before running rados bench?
[17:39] <markit> I've got a 545 MB/s in read bandwidth
[17:39] <joelio> markit: no, it's relative iirc
[17:40] <joelio> obviosuly you will leave cruft in the pool if you don't clean, so create a pool for just benching (based upon same pg values etc..)
[17:41] * dobber (~dobber@ Quit (Remote host closed the connection)
[17:44] <markit> joelio: can't run write test without --no-cleanup and have it "blanked" again?
[17:44] <markit> I've run the test against data, sigh... I've also a VM stored with RDB, is it in 'data'? I don't get the big picture of what is going on :(
[17:45] <markit> I associate "pool" with sort of "partition"
[17:45] <odyssey4me> markit - nope, a 'pool' is a set of osd's (like disks) to work against... an 'image' is more like a partition
[17:46] <markit> ops, probably my VM is in "rdb" pool
[17:46] <mozg> tnt: my issue is that when repair is taking place the client performance drops significantly
[17:46] <odyssey4me> yes, the default pool for image data in a new cluster is the 'rbd' pool
[17:46] * jcfischer is now known as jcfischer_afk
[17:46] <mozg> to the point it is hard to use the vms running on ceph
[17:46] <mozg> i would like to give clients more priority
[17:46] <odyssey4me> but when you setup your cluster you should create your own pool for your data and specify how many copies you want, how many pg's, etc
[17:46] <mozg> i've read the docs that there is a way to do that
[17:47] <mozg> setting a value 1-63
[17:47] <mozg> osd client op priority
[17:47] <markit> odyssey4me: I've used ceph-deploy, and I need only "rdb" storage, may I delete "data" and "metadata"?
[17:47] <mozg> however, it doesn't say if 1 is the highest priority and 63 is lowest
[17:47] <mozg> or the other way
[17:47] <markit> and are them taking some space or none if not used?
[17:48] <odyssey4me> markit - best not to... if you don't put anything into them they don't take up much space anyway
[17:48] <mozg> and the description of this option states: The priority set for client operations. It is relative to osd recovery op priority.
[17:48] <odyssey4me> although the caveat is that I'm new to all this too ;)
[17:48] <mozg> what does that mean?
[17:49] * b1tbkt (~b1tbkt@24-217-192-155.dhcp.stls.mo.charter.com) Quit (Read error: No route to host)
[17:49] <markit> odyssey4me: ancient greeks sayd: a blind man can't guide another blind man ;P
[17:49] <odyssey4me> markit - true, however I've been benching and testing for several days already... so I'm a little further than you are :p
[17:50] <markit> odyssey4me: that's why I listen so carefully to you ;P
[17:50] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[17:50] <markit> odyssey4me: do you use proxmox also by chance?
[17:50] * matt__ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[17:50] * oddomatik (~Adium@pool-71-106-149-194.lsanca.dsl-w.verizon.net) Quit (Quit: Leaving.)
[17:51] * b1tbkt (~b1tbkt@24-217-192-155.dhcp.stls.mo.charter.com) has joined #ceph
[17:52] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[17:53] <odyssey4me> markit - nope, just doing testing with kvm for the purpose of use in an openstack deployment
[17:59] <mozg> could someone help me with troubleshooting slow requests please
[17:59] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[17:59] <mozg> i am on 0.61.4
[17:59] <mozg> ubuntu
[18:00] <mozg> infiniband network
[18:00] <mozg> and 2 osd servers
[18:00] <mozg> with 8 osds each
[18:03] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[18:11] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[18:12] <Gugge-47527> mozg: how is your ram/cpu usage, and what is the io load on the disks?
[18:13] <mozg> Gugge-47527: it all depends on what ceph is currently doing. For instance. I do get slow requests when vms are idle
[18:13] <mozg> and there is not a great deal of activity
[18:13] <mozg> i've got 4-5 test vms on ceph
[18:14] <mozg> and i get slow requests practically every other day
[18:14] <mozg> when vms are not doing anything
[18:14] <mozg> just switched on and not running any tests
[18:14] <mozg> during those times, storage servers are pretty idle
[18:14] <mozg> load below 1
[18:14] <mozg> io on disks are also minimal
[18:15] <mozg> however
[18:15] <mozg> when I do run tests on vms
[18:15] <mozg> i get one server with load around 5
[18:15] <mozg> another server more like 20-25
[18:15] <mozg> they are different servers
[18:15] <mozg> with different cpus
[18:16] <joelio> what does 'iostat -x 1' tell you?
[18:16] <joelio> on the storage nodes
[18:16] <mozg> joelio: when tests are running, the disk utilisation come to 100%
[18:16] <mozg> when vms are idle, it is close to 0%
[18:16] <mozg> with occasional writes and reads
[18:17] <mozg> Gugge-47527: what I did notice when trying to isolate the problem
[18:17] <mozg> is that when I disable one of the servers (regardless which one) I do not see any slow requests at all
[18:17] <mozg> so, thinking logically this should point to a networking issue
[18:17] <mozg> however,
[18:18] <mozg> i've been testing the network and i do not see any issue
[18:18] <mozg> no packet drops
[18:19] * keruspe (~keruspe@slave.imagination-land.org) Quit (Quit: WeeChat 0.4.0)
[18:19] <mozg> and i do get over 10Gbit/s throughput
[18:20] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[18:21] <markit> joelio: I've deleted the test data with: # for i in $(rados -p data ls); do rados -p data rm $i; done
[18:21] <mozg> so, the end result is that I have individual servers which seem to work okay with ceph, but as soon as they are working together ceph tend to fall over about twice a week
[18:21] <mozg> all vms freeze, etc
[18:21] <markit> (of course, since there was nothing else there except test data)
[18:21] <mozg> not a good picture at all
[18:21] <joelio> markit: not just drop the pool and readd? or is data protected in some way?
[18:22] * joelio never benched data, so not sure
[18:22] <markit> joelio: newbie here, was unsure about delete and add (with what parameters? don't know defaults used by ceph-deploy)
[18:22] <joelio> yea, no worries
[18:22] <markit> well, I'm a bit scared to be sincere ;P
[18:23] <markit> also about what I heard from mozg, is the kind of situation I would never be
[18:23] <joelio> I've only been using for a few months now myself, it definitely makes much more sese now :)
[18:23] <markit> and wondering if can be troubleshoot easely
[18:23] <janos> mozg: how many monitors do you have?
[18:24] * tchmnkyz_ is now known as tchmnkyz
[18:24] <mozg> janos: 3 mons
[18:24] <mozg> 2 osd servers with 8 osds each
[18:24] <janos> strange situation
[18:24] <mozg> and two clients using kvm+rbd
[18:25] <janos> beyond my debugging, but thought i'd ask to see if maybe a single-mon bottleneck
[18:25] <janos> there are folks here with kvm+rbd with great perf
[18:25] <janos> i'll lurk though- curious as to what ails you
[18:27] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[18:27] * bergerx_ (~bekir@ Quit (Quit: Leaving.)
[18:28] * tnt (~tnt@ has joined #ceph
[18:28] * diegows (~diegows@ has joined #ceph
[18:31] * mschiff (~mschiff@p4FD7FAB1.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[18:35] * hybrid512 (~walid@106-171-static.pacwan.net) Quit (Quit: Leaving.)
[18:35] * markit (~marco@ Quit (Quit: Konversation terminated!)
[18:36] * leseb (~Adium@ Quit (Quit: Leaving.)
[18:37] * yehudasa__ (~yehudasa@2602:306:330b:1410:6dbd:dec3:7ee7:8132) has joined #ceph
[18:39] * Tamil (~tamil@ has joined #ceph
[18:39] * scuttlemonkey_ is now known as scuttlemonkey
[18:46] * mozg (~andrei@host217-46-236-49.in-addr.btopenworld.com) Quit (Ping timeout: 480 seconds)
[18:47] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[18:51] * odyssey4me (~odyssey4m@ Quit (Ping timeout: 480 seconds)
[18:59] * rturk-away is now known as rturk
[19:00] * mschiff (~mschiff@ has joined #ceph
[19:01] * nwatkins (~oftc-webi@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[19:11] * s2r2 (~s2r2@g227002106.adsl.alicedsl.de) has joined #ceph
[19:11] * dosaboy (~dosaboy@faun.canonical.com) Quit (Quit: leaving)
[19:19] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (synthon.oftc.net oxygen.oftc.net)
[19:19] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[19:19] * nwatkins (~oftc-webi@c-50-131-197-174.hsd1.ca.comcast.net) Quit (synthon.oftc.net dibasic.oftc.net)
[19:19] * nwatkins (~oftc-webi@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[19:25] * diegows (~diegows@ has joined #ceph
[19:30] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[19:30] * ChanServ sets mode +v andreask
[19:30] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Read error: Connection reset by peer)
[19:35] * oddomatik (~Adium@ has joined #ceph
[19:35] <sagewk> joao: cool, merged!
[19:36] <joao> sagewk, thanks!
[19:42] * xmltok (~xmltok@pool101.bizrate.com) Quit (Remote host closed the connection)
[19:42] * xmltok (~xmltok@relay.els4.ticketmaster.com) has joined #ceph
[19:49] * grepory1 (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[19:49] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Read error: Connection reset by peer)
[19:51] * s2r2 (~s2r2@g227002106.adsl.alicedsl.de) Quit (Quit: s2r2)
[19:51] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[19:51] * grepory1 (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Read error: Connection reset by peer)
[19:52] * grepory1 (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[19:56] * xmltok_ (~xmltok@pool101.bizrate.com) has joined #ceph
[19:56] * nwatkins (~oftc-webi@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Page closed)
[19:59] * KevinPerks1 (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[19:59] * nwat (~oftc-webi@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[19:59] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Ping timeout: 480 seconds)
[20:03] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[20:04] * xmltok (~xmltok@relay.els4.ticketmaster.com) Quit (Ping timeout: 480 seconds)
[20:07] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[20:07] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Read error: Connection reset by peer)
[20:10] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[20:12] * The_Bishop (~bishop@2001:470:50b6:0:497e:a554:edd:9a9f) has joined #ceph
[20:12] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[20:15] * nwat (~oftc-webi@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[20:15] * grepory1 (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Ping timeout: 480 seconds)
[20:22] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Read error: Connection reset by peer)
[20:25] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[20:29] * mozg (~andrei@host217-44-214-64.range217-44.btcentralplus.com) has joined #ceph
[20:32] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[20:36] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[20:42] <mozg> hi
[20:42] <Midnightmyth> hello
[20:42] <mozg> sorry, had to leave
[20:42] <mozg> without finishing my troubleshooting
[20:43] <mozg> anyone around to help with the slow requests?
[20:43] <mozg> i've got an issue that i've discussed here about 2 hours ago
[20:43] * s2r2 (~s2r2@g227002106.adsl.alicedsl.de) has joined #ceph
[20:46] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[20:47] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[20:52] * oddomatik is now known as BrianA
[21:04] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[21:05] * toMeloos (~tom@53545693.cm-6-5b.dynamic.ziggo.nl) has joined #ceph
[21:09] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[21:09] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit ()
[21:12] * danieagle (~Daniel@ has joined #ceph
[21:17] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[21:17] * topro__ (~prousa@host-62-245-142-50.customer.m-online.net) Quit (Quit: Konversation terminated!)
[21:18] <joao> alfredodeza, welcome!
[21:18] * ChanServ sets mode +o scuttlemonkey
[21:18] * ChanServ sets mode +o dmick
[21:18] * ChanServ sets mode +v nhm
[21:18] * ChanServ sets mode +v elder
[21:18] * ChanServ sets mode +o rturk
[21:18] * ChanServ sets mode +v joao
[21:18] * alfredodeza <3's irc
[21:18] * guppy is now known as guppy_
[21:18] <alfredodeza> thank you joao :)
[21:18] <alfredodeza> happy to be here
[21:22] * ChanServ sets mode +v rturk
[21:22] * ChanServ sets mode +v dmick
[21:22] * ChanServ sets mode +v scuttlemonkey
[21:22] <joao> chanserv sure seems busy
[21:23] * doubleg (~doubleg@ Quit (Quit: Lost terminal)
[21:25] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:26] * SudoAptitude (~Yruns@ip-93-159-106-144.enviatel.net) has joined #ceph
[21:29] <SudoAptitude> Hello! What is the difference between RADOS and ceph? Is RADOS the underlying Object Storage "File System" that ceph uses to deliver "Block Storage Devises", "Ceph FS" etc.?
[21:29] <rturk> RADOS = the object store underneath
[21:29] <rturk> CephFS: distributed filesystem built on RADOS
[21:30] <rturk> (RADOS = reliable, distributed, autonomic object store, for the acronym minded)
[21:30] <rturk> http://ceph.com/docs/master/glossary/
[21:30] <SudoAptitude> And with RADOS GW I access RADOS directly. The other Products use ceph as "gateway"
[21:31] <rturk> radosgw allows you to access rados using REST
[21:31] <rturk> librados gives you direct access to rados
[21:32] <SudoAptitude> I see. thank you :)
[21:32] <rturk> :)
[21:32] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Read error: Connection reset by peer)
[21:33] <SudoAptitude> I tried to install ceph by tutorials but constantly run into problems. I think I don't understand the system itself well enough to address the problems properly :/
[21:34] <rturk> what kind of problems are you running into?
[21:34] <SudoAptitude> For example I can't install ceph-deploy. apt-get tells me it depends on the virtual package "python pshy" and won't install it
[21:34] <rturk> hmmm, I think I filed a bug on that
[21:34] <rturk> one second
[21:35] <rturk> http://tracker.ceph.com/issues/5475 <- look familiar?
[21:37] <SudoAptitude> Using the second tutorial I can't start Ubuntu any more. before I can log in on the tty the machine starts echoing stuff like "2013-07-16 16:35:00.734873 b3995b40 0 -- :/1280 >> pipe(0x97e4c20 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault" repeatedly
[21:38] <rturk> hm, now that one I haven't seen
[21:38] <SudoAptitude> Yes that seems to be the problem :) so a 64-bit VM might work in that case?
[21:39] <rturk> yes
[21:39] <rturk> or you can do what I sometimes do and clone ceph-deploy from github
[21:40] <rturk> running the bootstrap will grab pushy for you
[21:41] <SudoAptitude> i haven't ried that yet. thanks!
[21:42] * vata (~vata@2607:fad8:4:6:59b5:d635:dcf0:808) Quit (Quit: Leaving.)
[21:49] <gregaf> yehuda_hm: hmm, maybe I'm missing something but I think we just need to do something like this (with the addition of an interface for selecting sync type): https://github.com/gregsfortytwo/ceph/compare/ceph:next...wip-rgw-versionchecks
[21:49] <gregaf> since we're already gathering up the obj_version info on the gateway before doing any writes
[21:56] <yehuda_hm> gregaf: looking
[22:00] * mteixeira (~oftc-webi@ip-216-17-134-225.rev.frii.com) has joined #ceph
[22:00] <mteixeira> Hello. I am having a problem with ceph. Is here the place to ask for some help?
[22:02] <yehuda_hm> gregaf: I think that's racy
[22:03] <gregaf> yehuda_hm: well, I guess it's not atomic, but what race are you thinking we'd see?
[22:03] <gregaf> and yes, mteixeira, this is the place
[22:03] * nhm (~nhm@184-97-193-106.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[22:04] <yehuda_hm> gregaf: e.g., you apply a change, but you end up reverting to an older version because user just changed it
[22:04] * nhm (~nhm@184-97-255-87.mpls.qwest.net) has joined #ceph
[22:05] <yehuda_hm> I think the way to go would be by propagating that flag to the version update op
[22:05] <yehuda_hm> and manage that in the objclass
[22:05] <gregaf> I don't think I get the scenario you're proposing
[22:05] <yehuda_hm> we bundle the operation with it anyway, so if we return an error there it's not going to get applied
[22:05] <gregaf> are you talking about a sync in running while users are still active against it?
[22:06] <mteixeira> Great. Here is my problem. I am trying to remove an OSD from my ceph cluster. I am following along with the instructions and I started by doing a "ceph osd out". I then watched the rebalancing, but when this was done, rather than having HEALTH_OK my cluster ended up with pgs stuck unclean. I'm at a loss on how to proceed.
[22:06] <yehuda_hm> gregaf: that's one example
[22:06] <yehuda_hm> but I'm sure we can come up with a different scenario, it's a problematic read-modify-write
[22:07] <yehuda_hm> .. though the objv_tracker is probably going to protected against that
[22:07] <gregaf> I'm not seeing how else it would happen, and we don't support that scenario *anyway*
[22:08] <gregaf> I mean, I can push it down to the objclass if you want, this one was just fast, easy, and looked effective
[22:10] <dmick> mteixeira: do you still have enough OSDs in the right places in the crushmap to allow the cluster to replicate your data completely?
[22:10] <dmick> (depends on replication level you chose, number of OSDs, and the map)
[22:10] <gregaf> yehuda_hm: doing it in the objclass is going to require sprinkling an extra parameter everywhere, unless it's appropriate to add the flag as an objv_tracker member :/
[22:10] <yehuda_hm> hmm, that'll probably be a good enough solution. I don't like the code duplication, having to handle the logic at all the put() implementations
[22:12] <yehuda_hm> gregaf: the objv_tracker is probably a good choice, its entire purpose is to handle the versioning state
[22:12] <mteixeira> I'm afraid I don't know enough to answer that question, but there are 26 other OSDs and my drives are only 30% full.
[22:12] <gregaf> which is good enough?
[22:12] <gregaf> it occurred to me while pushing that it should be a helper in the superclass, though I don't think I can glue it in any more natively than that
[22:13] <yehuda_hm> gregaf: your original solution, however, I'm not completely sure we're not digging some hole we'd have trouble getting out of
[22:13] <dmick> mteixeira: well, so, start at http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ and poke around a bit
[22:13] <yehuda_hm> gregaf: either adding a helper in the superclass, or break the put() into two different calls
[22:14] <gregaf> yehuda_hm: k, but given that sync is basically a single-writer system all the way through (and making it not would be painful) I don't think we're digging any new holes :)
[22:15] <gregaf> any thoughts on how we want to specify the sync type in the requests?
[22:15] <gregaf> joshd, you too on that one ^
[22:15] <gregaf> probably a param would be most correct? or should it go in the json request body?
[22:16] <yehuda_hm> gregaf: param
[22:16] <joshd> yeah, param
[22:16] <mteixeira> Okay. I looked at that, but I wasn't able to find or understand much (probably due to my own ignorance). I'll mark the OSD as "out" again and when I get the error again I'll come back here and ask for specific advice.
[22:18] <gregaf> mteixeira: in that case the thing to focus on is what state the unclean PGs are in and possibly branch out from there to looking at the OSDs hosting them
[22:18] <yehuda_hm> gregaf: well, consider the following scenario: we have a master region and a secondary region. Both are up, have joshd's sync agent continuously updating the secondary region
[22:19] <yehuda_hm> gregaf: oh crap, lost my thread of thought
[22:20] <mteixeira> It seems strange to me that taking a single OSD "out" causes pgs to become stuck, and putting that same OSD back in restores ceph to HEALTH_OK. It seems like the problem is in the OSD which I am trying to take out.
[22:21] <mteixeira> Anyhow, I marked it "out" again. In an hour or two I might be able to answer more questions about their states. Thanks for the help.
[22:22] <yehuda_hm> gregaf: now, let's say we removed a bucket and recreated the same bucket on the secondary. We have multiple writers to that data structure .. one is the sync agent
[22:22] <yehuda_hm> gregaf: the other one is the gateway
[22:22] <yehuda_hm> so we might have both racing, might have the wrong one winning
[22:23] <yehuda_hm> e.g., the gateway needs to apply the change whatever the version is, the sync agent only allowed to upgrade
[22:23] <gregaf> "recreated the same bucket on the secondary"?
[22:23] <yehuda_hm> yeah, secondary did a bucket remove, bucket create
[22:24] <yehuda_hm> .. bucket might have been on a different region before
[22:24] <gregaf> so then it's a different bucket instance, right?
[22:24] <yehuda_hm> it's a different bucket instance, the same bucket entry point
[22:24] * jskinner (~jskinner@ has joined #ceph
[22:26] <gregaf> I'm not real caught up on your terminology here, the entry point is the object which has the bucket name that points to the right bucket instance?
[22:26] <yehuda_hm> gregaf, right
[22:27] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:28] <sagewk> joao, gregaf: have a fix for the mon leak regression, wip-5643
[22:29] <gregaf> yehuda_hm: aren't we protected in that case by the pre_modify(), put(), post_modify() stuff?
[22:30] <yehuda_hm> the pre_modify() and post_modify() just dumps the metadata changes to the log
[22:31] <gregaf> oh, I thought we had asserts on there
[22:32] <yehuda_hm> we're probably protected by having the objv_tracker and the correct read_version matching the current version
[22:33] <gregaf> yeah, it trickles down into including asserts that the versions match via a call to objv_tracker->prepare_op_for_write
[22:34] <Gugge-47527> mteixeira: instead of marking it out, i would reweight it to 0, then it would be online while data is moved
[22:34] <gregaf> that's what marking "out" but not "down" does :)
[22:35] <mteixeira> gregaf: Yes, I noticed that the weight became 0 when I marked the OSD out.
[22:36] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[22:36] <Gugge-47527> hmm, i thought an out osd would never serve data to the other osds
[22:37] <gregaf> Gugge-47527: an out osd is considered no longer "responsible" for any data, but as long as it's up it will continue to answer requests from other OSDs for whatever it has, precisely for this shipping-off case
[22:38] <Gugge-47527> great :)
[22:38] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[22:39] <mteixeira> gregaf: Is it correct to say that the expected end result is that all data should migrate off the "out" OSD and that the health should return to HEALTH_OK once the rebalance is complete?
[22:39] <gregaf> yes
[22:39] <gregaf> assuming your crush map is satisfiable with the OSD marked "out", anyway
[22:40] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[22:41] <mteixeira> Okay. Thanks. I'll let you know when the rebalancing is done. Maybe ten minutes to go now.
[22:44] <mteixeira> BTW, I will mention the OSD is the last OSD on the host, if that makes any difference. Is it a problem that the only OSD on a host is "out"?
[22:45] <Gugge-47527> only if you no longer have enough hosts to satisfy your crush rules
[22:47] <mteixeira> What would I have to look at to determine whether that was the case or not? It seems to me like the crush map is a list of hosts and weights for those hosts, which seems fairly straightforward. What would make a crush map unsatisfiable?
[22:48] <Gugge-47527> if you want 2 copies on different hosts, and you only have one host
[22:49] <mteixeira> Oh, well, I got that covered. I only need two copies and I got seven hosts :)
[22:50] * Machske (~Bram@d5152D87C.static.telenet.be) Quit ()
[22:50] <Gugge-47527> or if it is the last host in a rack, and you want copies on different racks
[22:50] <Gugge-47527> or something like that :)
[22:50] * diegows (~diegows@ has joined #ceph
[22:51] <mteixeira> Okay, it looks like it is done. "ceph health" says "HEALTH_WARN 9 pgs stuck unclean; recovery 1/1003194 degraded (0.000%)". Let me try some of the things in the troubleshooting page.
[22:51] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[22:52] <gregaf> mteixeira: it's possible that your crush rules are set up wrong if you're choosing hosts and then osds in the host, but one host is empty — is that what you're doing?
[22:53] * Machske (~Bram@d5152D87C.static.telenet.be) has joined #ceph
[22:54] <SudoAptitude> I don't understand the difference between Ceph Block Device and Ceph FS. They both create a File System that I can mount and use. What is the big difference?
[22:54] <mteixeira> Let me decompile my crush map and I can tell you exactly what I have. My setup is pretty basic. I haven't customized it much.
[22:54] <dmick> The words "Block Device" are not equivalent to "File System".
[22:54] <dmick> so, no a block device doesn't create a filesystem
[22:54] <dmick> mteixeira: ceph osd crush dump will work
[22:55] <mteixeira> BTW, "ceph health detail" gives me several lines, such as "pg 15.360 is stuck unclean for 83092.639530, current state active+remapped, last acting [15,1,7] ". OSD 7 is the one that is "out"
[22:55] <rturk> SudoAptitude: a rados block device looks like a disk to the client OS, not like a filesystem
[22:55] <Gugge-47527> SudoAptitude: rbd is a block device, you can use on a single host, and create whatever filesystem you want on. cephfs is a distribued filesystem you can mount on multiple hosts
[22:55] * s2r2 (~s2r2@g227002106.adsl.alicedsl.de) Quit (Quit: s2r2)
[22:55] <SudoAptitude> ok that is indeed a big difference :)
[22:56] <mteixeira> Okay. "ceph osd crush dump" is definitely a different syntax than what I am used to, so I'm not sure what to make of it. It's long. Is there someplace I could post it?
[22:57] <rturk> mteixeira: a lot of people use pastebin
[22:57] <dmick> pastebin? if you're comfy with extracting/decompiling, that's easier to read, just h arder to get. either way, pastebin
[22:59] <mteixeira> I do see some rules toward the end. But I did not customize those myself, so I assume they are the defaults. The only thing I did to my crush map was add hosts and racks and weights. I'll do a paste bin. Seems simple enough.
[23:00] * markbby (~Adium@ Quit (Quit: Leaving.)
[23:00] * danieagle (~Daniel@ Quit (Ping timeout: 480 seconds)
[23:00] <mteixeira> Okay. Here is the output of "ceph osd crush dump" http://pastebin.com/kZm6L3aL
[23:01] <mteixeira> I'll do the decompile too, since I find that to be easier for me to understand :)
[23:05] <mteixeira> Okay, this http://pastebin.com/ghJq03vd contains the dump as well as the decompiled crush map
[23:08] * jakes (~oftc-webi@dhcp-171-71-119-30.cisco.com) has joined #ceph
[23:08] * danieagle (~Daniel@ has joined #ceph
[23:13] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[23:16] * jakes (~oftc-webi@dhcp-171-71-119-30.cisco.com) Quit (Quit: Page closed)
[23:17] <joao> mteixeira, where are you from?
[23:17] <joao> sagewk, looking (wrt wip-5643)
[23:18] <mteixeira> @joao: Originally from Brazil, but I've been in the US for about thirty years :)
[23:18] <cephalobot> mteixeira: Error: "joao:" is not a valid command.
[23:18] <mteixeira> joao: Originally from Brazil, but I've been in the US for about thirty years :)
[23:18] * rturk kicks cephalobot
[23:19] <joao> mteixeira, eh, your name was giving away some Portuguese descent :p
[23:19] <mteixeira> Right :)
[23:19] <joao> was hoping you were from this side of the pond though :)
[23:20] <mteixeira> Joao: For what it is worth, some of my ancestors were :) I think my grand grand father was from Portugal.
[23:20] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:22] <mteixeira> Any ideas on my crush map? It looks fairly sensible to me. Although I'm not sure why device6,8 and 9 are in there. I removed those OSDs.
[23:23] <mteixeira> Maybe I'll try removing those from my crush map, see if it helps.
[23:24] <dmick> so I don't know what I'm talking about :)
[23:24] <dmick> but
[23:24] <dmick> it seems weird to me that you're set up to choose host first
[23:24] <dmick> but then there are two hosts with 6 osds and one with 1
[23:25] <mteixeira> I'll admit I do not really understand the rules section. I think you're telling me I need to read up on that part :)
[23:25] <dmick> I don't immediately see where that would be a problem unless osd.7 goes down, in which case you will have some PGs that cannot map to full replication
[23:26] <dmick> and, oh, you say 7 is the one that's out. so there you go, I suspect
[23:26] <dmick> are you trying to decommission host ceph2?
[23:26] <mteixeira> dmick: Exactly. Trying to remove ceph2
[23:27] <gregaf> yeah, so you'll want to remove the host from the map as well as the OSDs
[23:27] <dmick> ok, so i think what you need is to tell crush to stop using that host for placement.
[23:27] <gregaf> CRUSH is pseudo-random so a small portion of your PGs are getting mapped to the (empty) host 7 and can't get off of it even after a bunch of retries
[23:27] <dmick> ^ what greg said. Greg, if it's out of the map, it can still be a source for replication if it's not done moving PGs, right?
[23:27] <gregaf> yeah
[23:28] <gregaf> although it doesn't need to be in this case since all the data exists elsewhere, they were unclean just because they couldn't map a third replica
[23:28] <dmick> you know that because of the 'stuck' state?
[23:29] <gregaf> he pasted the ceph health output somewhere, and it was 9 PGs active+remapped
[23:29] <dmick> ah. missed that.
[23:29] <mteixeira> So how do I do that? Do I just recompile my crushmap without ceph2. I was only following the directions on removing OSDs, and the first step was to mark it "out". You're telling me to first modify the crush map to remove ceph2. Is that correct?
[23:29] <dmick> out is good; that means no more new placement there
[23:31] <dmick> but yeah, now you can remove ceph2 from the map. I think ceph osd crush unlink host2 will do it
[23:31] <dmick> er, ceph2, not host2
[23:31] <gregaf> you can do all this via the CLI rather than compiling and decompiling, btw
[23:31] <dmick> ^ like that
[23:32] <mteixeira> How is that different than "ceph odd crush remove ceph2"... that would have been my first inclination.
[23:32] <mteixeira> odd = osd
[23:32] <gregaf> actually I think that's what you want to do, isn't unlink just taking it out of the tree but leaving it around, dmick?
[23:33] <dmick> remove means you have to empty ceph2 first, so you'd have to remove osd.7. which you can do.
[23:33] <dmick> unlink will get it out of the way so you can see results quickly; then you can remove osd.7 and then ceph2
[23:33] <dmick> and also leaves it in case I was wrong and it needs to be relinked :)
[23:33] <dmick> but yeah, I think either way will work
[23:34] <mteixeira> And this is safe to do with some of my pgs stuck unclean, correct? I'm a bit worried about removing or unlinking anything that might still have data
[23:35] <gregaf> changing the crush map won't remove any data from anything, unless that data has been correctly replicated elsewhere
[23:35] <mteixeira> Okay. Gonna give the unlink a shot.
[23:36] <dmick> could start a ceph -w before you do and watch things change
[23:37] * aliguori (~anthony@ Quit (Quit: Ex-Chat)
[23:38] <mteixeira> Yeah, it now is moving data around. I also noticed "ceph2" and "osd.7" show up in "ceph odd tree" in a different place now. I assume that is what the unlink did. It no longer is under the pool.
[23:38] <gregaf> yeah
[23:40] <mteixeira> Okay. This might take a half and hour to complete.
[23:42] * rturk is now known as rturk-away
[23:43] <mteixeira> So the docs pretty much say to mark the OSD out, then a bit later it says to run "ceph osd crush remove". So effectively I am doing the same thing with "ceph osd unlink", right? So the procedure would have worked, I just was a bit overly concerned about the fact I had stuck pgs. I should have just pressed forward, is this correct?
[23:44] <gregaf> I think the docs might not discuss removing buckets (host, rack, etc) versus just an OSD, and the remove takes it out of the map (not just the pool), but basically yes
[23:45] <mteixeira> I see. Yes, I think the docs do not address removing the buckets. I was going to deal with that later once the osd was out of the way.
[23:46] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[23:49] * ntranger (~ntranger@proxy2.wolfram.com) Quit ()
[23:52] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[23:57] <mteixeira> Okay, so now ceph health says "HEALTH_WARN 7 pgs stuck unclean". It no longer says degraded though
[23:57] <mteixeira> This is after the "ceph osd crush unlink ceph2"
[23:58] <dmick> is that number changing, perhaps?

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.