#ceph IRC Log


IRC Log for 2013-09-25

Timestamps are in GMT/BST.

[0:01] <angdraug> symmcom: that depends on how #ofhdds/#ofssds ratio compares to the hdd/ssd throughput ratio
[0:01] * rendar (~s@host116-161-dynamic.1-87-r.retail.telecomitalia.it) Quit ()
[0:03] <symmcom> angdraug: do u mean the total # of hdd vs the total size of the node?
[0:04] <angdraug> basically you need to compare combined throughput of your hdds with that of your ssds within the same node
[0:04] * The_Bishop (~bishop@2001:470:50b6:0:873:cadb:5333:c64c) Quit (Ping timeout: 480 seconds)
[0:04] <angdraug> 1 ssd to 12 hdds most likely won't do you much good, so the benefit of putting journals on ssd disappears
[0:04] * dmsimard1 (~Adium@ap08.wireless.co.mtl.iweb.com) has joined #ceph
[0:04] <symmcom> ah i see
[0:04] <angdraug> but if your ssd is say 20 times faster than your hdds, 1/12 may still be worthwhile
[0:05] <angdraug> alternatively, if you have 4 ssds per node, that gives you a ratio of 1 ssd per 3 hdds which is not bad at all
[0:06] * dmsimard (~Adium@ Quit (Read error: Operation timed out)
[0:06] * dmsimard1 (~Adium@ap08.wireless.co.mtl.iweb.com) Quit ()
[0:06] <angdraug> and of course there's the drive fault recovery tradeoff to account for
[0:06] <symmcom> how big the SSD needs to be for 3 hdd
[0:06] * sprachgenerator (~sprachgen@ Quit (Quit: sprachgenerator)
[0:07] <angdraug> The journal size should be at least twice the product of the expected drive speed multiplied by filestore max sync interval.
[0:07] <angdraug> http://ceph.com/docs/master/rados/configuration/osd-config-ref/
[0:08] <LCF> drive speed ?
[0:08] <LCF> like what speed ?
[0:08] <xarses> MiB/s
[0:08] <LCF> if I've got only ssd in cluster ?
[0:09] <angdraug> then it doesn't sound like you need a separate device for journals :)
[0:09] <xarses> =)
[0:09] <LCF> yeah but still wonder about joirnal size
[0:09] <LCF> or it doesn't matter ?
[0:10] <xarses> what ever the MFG has as the max write bandwidth
[0:10] <xarses> most newish midline ssd's are around 480MiB/s
[0:13] * The_Bishop (~bishop@ has joined #ceph
[0:15] * jeff-YF (~jeffyf@ Quit (Quit: jeff-YF)
[0:15] <xarses> LCF several people on the channel mentioned that most journals will never need to be larger than 10GiB
[0:16] <LCF> thanks
[0:17] <LCF> btw. anyone know little more about rgw_thread_pool_size then documentation ?
[0:18] <symmcom> does CEPH mons uses storage requirement grows in size? 3 of my mons started giving me 30% storage alert
[0:18] <sagewk> dmick: https://github.com/ceph/ceph/pull/635
[0:19] <sagewk> symmcom: du /var/lib/ceph/mon/* to see how big. probably your / is just getting full of logs or something. ceph is trying to be helpful.
[0:21] <symmcom> sagewk: du shows 30252 for store.db. is that in bytes?
[0:21] <lurbs> Add -h to the du, it'll tell you in human readable units.
[0:22] * AfC (~andrew@2407:7800:200:1011:2ad2:44ff:fe08:a4c) has joined #ceph
[0:22] <lurbs> It'll likely be in KiB by default, BTW.
[0:23] <symmcom> lurbs: thanks. :) it shows in MB, only 35M so i guess that leaves me with the OS itself growing in size
[0:23] <dmick> sagewk: looks good, although I wonder about error notification; I guess if err < 0, it'll trigger?
[0:23] <sagewk> yep
[0:24] <dmick> and I suppose we could test that the weight percolates up, but that would be harder and maybe not worth it. lgtm
[0:25] <sagewk> dmick: yep. tnx
[0:28] <symmcom> sagewk: Just curious, are you the Sage, creator of CEPH?
[0:29] <sagewk> not the only creator, but yes :)
[0:30] <symmcom> Great! it is a great creation. Thanks for that. :)
[0:31] * dmsimard (~Adium@ap01.wireless.co.mtl.iweb.com) has joined #ceph
[0:33] <xarses> symmcom, check du -hs /var/log/ceph
[0:33] * ScOut3R (~scout3r@540130DB.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[0:33] <dmick> LCF: what kind of information are you after about rgw thread pool size?
[0:33] * dmsimard1 (~Adium@ has joined #ceph
[0:35] <symmcom> xarses: du -hs shows 71M in /var/log/ceph
[0:35] * dmsimard (~Adium@ap01.wireless.co.mtl.iweb.com) Quit (Read error: Operation timed out)
[0:35] <xarses> symmcom, so ceph might not be the culprit :)
[0:36] <LCF> dmick: wonder how that value is connected with performance of rgw
[0:37] <dmick> indirectly, as you might expect
[0:37] <dmick> to me it's one of those "if you think you're running out of threads, try tweaking it" knobs
[0:38] <dmick> and if you're not running out of threads it likely has no effect
[0:38] <LCF> how did you findout you are running out of thrreads ?
[0:38] <symmcom> xarses: didnt think it was, but just wanted to be very sure :)
[0:38] <LCF> I got right now ~12.5k op/sec with 70MB rd and wd same time
[0:39] <LCF> at peak
[0:39] <dmick> I didn't, but if I saw requests go smoothly until I hit a certain number of simultaneous requests, and then suddently start to take a very long time, and/or saw things in the log, and/or looked at the radosgw process and saw 100+ threads....
[0:39] <LCF> so I guess I see same problem
[0:40] * doxavore (~doug@99-7-52-88.lightspeed.rcsntx.sbcglobal.net) Quit (Quit: :qa!)
[0:40] <LCF> dmick: how did you change value ?
[0:40] <LCF> I've got right now:
[0:40] <LCF> "rgw_thread_pool_size": "100",
[0:40] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[0:40] <dmick> I didn't
[0:40] <dmick> but it's a configuration value like any other configuration value
[0:41] <LCF> rgw thread pool size = 200
[0:41] <dmick> http://ceph.com/docs/master/radosgw/config/
[0:41] <LCF> and something like that in conf
[0:41] <LCF> restart radosgw didn't change value
[0:42] * AfC (~andrew@2407:7800:200:1011:2ad2:44ff:fe08:a4c) Quit (Quit: Leaving.)
[0:42] <dmick> that should do it
[0:43] * AfC (~andrew@2407:7800:200:1011:2ad2:44ff:fe08:a4c) has joined #ceph
[0:43] <LCF> crap
[0:43] <LCF> maybe it's bug then
[0:44] <LCF> I will try re-create case tomorrow and report that if it's still there
[0:45] * carif (~mcarifio@75-150-97-46-NewEngland.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:47] <symmcom> xarses: #sudo apt-get clean did the trick. it cleaned the OS cache and all warning for low space disappeared from 3 mons
[0:51] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Remote host closed the connection)
[0:51] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[0:56] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) Quit (Quit: ...)
[0:58] * dmsimard1 (~Adium@ Quit (Ping timeout: 480 seconds)
[1:14] * xmltok (~xmltok@pool101.bizrate.com) Quit (Quit: Leaving...)
[1:14] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[1:21] * LeaChim (~LeaChim@host86-135-252-168.range86-135.btcentralplus.com) Quit (Read error: Operation timed out)
[1:22] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[1:41] * BillK (~BillK-OFT@124-148-81-249.dyn.iinet.net.au) has joined #ceph
[1:45] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) Quit (Ping timeout: 480 seconds)
[1:54] * rturk is now known as rturk-away
[1:54] * Vjarjadian (~IceChat77@ Quit (Quit: Beware of programmers who carry screwdrivers.)
[1:55] * Vjarjadian (~IceChat77@ has joined #ceph
[1:55] * DarkAceZ (~BillyMays@ Quit (Ping timeout: 480 seconds)
[2:01] * rturk-away is now known as rturk
[2:04] * AfC (~andrew@2407:7800:200:1011:2ad2:44ff:fe08:a4c) Quit (Ping timeout: 480 seconds)
[2:06] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[2:07] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Quit: Leaving.)
[2:07] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[2:23] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[2:29] * DarkAceZ (~BillyMays@ has joined #ceph
[2:36] * AfC (~andrew@2407:7800:200:1011:2ad2:44ff:fe08:a4c) has joined #ceph
[2:38] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Read error: Connection reset by peer)
[2:40] * sagelap1 (~sage@2607:f298:a:607:ea03:9aff:febc:4c23) Quit (Ping timeout: 480 seconds)
[2:49] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) has joined #ceph
[2:51] * Cube (~Cube@ has joined #ceph
[2:51] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Ping timeout: 480 seconds)
[2:57] * freedomhui (~freedomhu@ has joined #ceph
[3:11] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[3:12] * diegows (~diegows@ has joined #ceph
[3:14] * glzhao (~glzhao@ has joined #ceph
[3:18] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Ping timeout: 480 seconds)
[3:21] * The_Bishop (~bishop@ Quit (Ping timeout: 480 seconds)
[3:24] * xarses (~andreww@ Quit (Ping timeout: 480 seconds)
[3:25] * angdraug (~angdraug@ Quit (Quit: Leaving)
[3:26] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) has joined #ceph
[3:35] * The_Bishop (~bishop@2001:470:50b6:0:873:cadb:5333:c64c) has joined #ceph
[3:39] * rturk is now known as rturk-away
[3:41] * yy-nm (~Thunderbi@ has joined #ceph
[3:47] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[3:47] * marrusl (~mark@209-150-43-182.c3-0.wsd-ubr2.qens-wsd.ny.cable.rcn.com) Quit (Quit: sync && halt)
[3:52] * xmltok (~xmltok@pool101.bizrate.com) Quit (Quit: Leaving...)
[3:56] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit (Quit: Leaving.)
[3:57] * shang (~ShangWu@ has joined #ceph
[4:01] * freedomhui (~freedomhu@ Quit (Quit: Leaving...)
[4:01] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[4:01] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[4:01] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit ()
[4:02] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[4:02] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Read error: Connection reset by peer)
[4:03] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[4:03] * ChanServ sets mode +o scuttlemonkey
[4:09] * torment (~torment@pool-72-91-185-241.tampfl.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[4:09] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Quit: Leaving.)
[4:17] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[4:21] * freedomhui (~freedomhu@ has joined #ceph
[4:25] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[4:27] * huangjun (~kvirc@ has joined #ceph
[4:29] * freedomhui (~freedomhu@ Quit (Ping timeout: 480 seconds)
[4:32] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[4:34] * xarses (~andreww@c-71-202-167-197.hsd1.ca.comcast.net) has joined #ceph
[4:39] * yanlb (~bean@ has joined #ceph
[4:39] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[4:40] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[4:40] * freedomhui (~freedomhu@ has joined #ceph
[4:45] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) has joined #ceph
[5:06] * fireD_ (~fireD@93-142-219-198.adsl.net.t-com.hr) has joined #ceph
[5:06] * MACscr (~Adium@c-98-214-103-147.hsd1.il.comcast.net) Quit (Remote host closed the connection)
[5:06] * MACscr (~Adium@c-98-214-103-147.hsd1.il.comcast.net) has joined #ceph
[5:07] * fireD (~fireD@93-142-250-247.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:08] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) Quit (Read error: Operation timed out)
[5:10] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[5:28] * freedomhui (~freedomhu@ Quit (Quit: Leaving...)
[5:37] * freedomhui (~freedomhu@ has joined #ceph
[5:59] * newbie (~kvirc@ has joined #ceph
[6:00] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Quit: Leaving.)
[6:05] * wschulze1 (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) has joined #ceph
[6:06] * newbie|2 (~kvirc@ has joined #ceph
[6:07] * huangjun (~kvirc@ Quit (Ping timeout: 480 seconds)
[6:09] * KindTwo (~KindOne@h51.33.186.173.dynamic.ip.windstream.net) has joined #ceph
[6:09] * clayb (~kvirc@ Quit (Read error: Connection reset by peer)
[6:10] * torment (~torment@pool-72-91-185-241.tampfl.fios.verizon.net) has joined #ceph
[6:11] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit (Ping timeout: 480 seconds)
[6:11] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[6:11] * KindTwo is now known as KindOne
[6:13] * newbie (~kvirc@ Quit (Ping timeout: 480 seconds)
[6:19] * yy-nm (~Thunderbi@ Quit (Read error: Connection reset by peer)
[6:19] * yy-nm (~Thunderbi@ has joined #ceph
[6:20] * wschulze1 (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit (Read error: Connection reset by peer)
[6:20] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) has joined #ceph
[6:25] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit ()
[6:26] * todin (tuxadero@kudu.in-berlin.de) has joined #ceph
[6:27] * lxo (~aoliva@lxo.user.oftc.net) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * glzhao (~glzhao@ Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * Vjarjadian (~IceChat77@ Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * `10` (~10@juke.fm) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * shdb (~shdb@gw.ptr-62-65-159-122.customer.ch.netstream.com) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * cfreak201 (~cfreak200@p4FF3EF6C.dip0.t-ipconnect.de) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * mjeanson (~mjeanson@00012705.user.oftc.net) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * sagewk (~sage@2607:f298:a:607:219:b9ff:fe40:55fe) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * via (~via@smtp2.matthewvia.info) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * saumya (uid12057@ealing.irccloud.com) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * nigwil (~chatzilla@2001:44b8:5144:7b00:dc7a:214e:dc79:ee1) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * leseb (~leseb@88-190-214-97.rev.dedibox.fr) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * Zethrok_ (~martin@ Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * todin_ (tuxadero@kudu.in-berlin.de) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * shimo (~A13032@122x212x216x66.ap122.ftth.ucom.ne.jp) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * capri_on (~capri@ Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * JayBox (~chatzilla@ Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * Yen (~Yen@2a00:f10:103:201:ba27:ebff:fefb:350a) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * absynth (~absynth@irc.absynth.de) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * cce (~cce@ Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * lurbs (user@uber.geek.nz) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * wogri (~wolf@nix.wogri.at) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * baffle (baffle@jump.stenstad.net) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * sileht (~sileht@gizmo.sileht.net) Quit (reticulum.oftc.net magnet.oftc.net)
[6:27] * ccourtaut (~ccourtaut@2001:41d0:1:eed3::1) Quit (reticulum.oftc.net magnet.oftc.net)
[6:28] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[6:28] * glzhao (~glzhao@ has joined #ceph
[6:28] * Vjarjadian (~IceChat77@ has joined #ceph
[6:28] * `10` (~10@juke.fm) has joined #ceph
[6:28] * shdb (~shdb@gw.ptr-62-65-159-122.customer.ch.netstream.com) has joined #ceph
[6:28] * cfreak201 (~cfreak200@p4FF3EF6C.dip0.t-ipconnect.de) has joined #ceph
[6:28] * mjeanson (~mjeanson@00012705.user.oftc.net) has joined #ceph
[6:28] * sagewk (~sage@2607:f298:a:607:219:b9ff:fe40:55fe) has joined #ceph
[6:28] * via (~via@smtp2.matthewvia.info) has joined #ceph
[6:28] * saumya (uid12057@ealing.irccloud.com) has joined #ceph
[6:28] * nigwil (~chatzilla@2001:44b8:5144:7b00:dc7a:214e:dc79:ee1) has joined #ceph
[6:28] * leseb (~leseb@88-190-214-97.rev.dedibox.fr) has joined #ceph
[6:28] * Zethrok_ (~martin@ has joined #ceph
[6:28] * todin_ (tuxadero@kudu.in-berlin.de) has joined #ceph
[6:28] * shimo (~A13032@122x212x216x66.ap122.ftth.ucom.ne.jp) has joined #ceph
[6:28] * capri_on (~capri@ has joined #ceph
[6:28] * JayBox (~chatzilla@ has joined #ceph
[6:28] * sileht (~sileht@gizmo.sileht.net) has joined #ceph
[6:28] * Yen (~Yen@2a00:f10:103:201:ba27:ebff:fefb:350a) has joined #ceph
[6:28] * absynth (~absynth@irc.absynth.de) has joined #ceph
[6:28] * cce (~cce@ has joined #ceph
[6:28] * lurbs (user@uber.geek.nz) has joined #ceph
[6:28] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[6:28] * ccourtaut (~ccourtaut@2001:41d0:1:eed3::1) has joined #ceph
[6:28] * baffle (baffle@jump.stenstad.net) has joined #ceph
[6:28] * wogri (~wolf@nix.wogri.at) has joined #ceph
[6:28] * todin_ (tuxadero@kudu.in-berlin.de) Quit (Ping timeout: 482 seconds)
[6:29] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[6:30] * The_Bishop (~bishop@2001:470:50b6:0:873:cadb:5333:c64c) Quit (Ping timeout: 480 seconds)
[6:30] * freedomhui (~freedomhu@ Quit (Quit: Leaving...)
[6:35] * andes (~oftc-webi@ Quit (Remote host closed the connection)
[6:39] * The_Bishop (~bishop@2001:470:50b6:0:893e:f19b:dc2:694d) has joined #ceph
[6:57] * newbie (~kvirc@ has joined #ceph
[7:04] * newbie|2 (~kvirc@ Quit (Ping timeout: 480 seconds)
[7:06] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[7:10] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[7:51] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[8:02] * newbie|2 (~kvirc@ has joined #ceph
[8:09] * newbie (~kvirc@ Quit (Ping timeout: 480 seconds)
[8:12] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[8:14] * JustEra (~JustEra@ALille-555-1-142-38.w92-155.abo.wanadoo.fr) has joined #ceph
[8:15] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[8:15] * freedomhui (~freedomhu@li565-182.members.linode.com) has joined #ceph
[8:19] * sleinen (~Adium@2001:620:0:2d:4ccd:55d7:f7a7:efe7) has joined #ceph
[8:20] * sleinen1 (~Adium@2001:620:0:26:fd3f:1fb4:b91d:84a7) has joined #ceph
[8:21] * foosinn (~stefan@office.unitedcolo.de) has joined #ceph
[8:23] * freedomhu (~freedomhu@ has joined #ceph
[8:27] * sleinen (~Adium@2001:620:0:2d:4ccd:55d7:f7a7:efe7) Quit (Ping timeout: 480 seconds)
[8:29] * JustEra (~JustEra@ALille-555-1-142-38.w92-155.abo.wanadoo.fr) Quit (Quit: This computer has gone to sleep)
[8:31] * freedomhui (~freedomhu@li565-182.members.linode.com) Quit (Ping timeout: 480 seconds)
[8:31] * KindTwo (~KindOne@h104.62.186.173.dynamic.ip.windstream.net) has joined #ceph
[8:34] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[8:34] * KindTwo is now known as KindOne
[8:36] * Vjarjadian (~IceChat77@ Quit (Quit: Beware of programmers who carry screwdrivers.)
[8:49] * JustEra (~JustEra@ALille-555-1-142-38.w92-155.abo.wanadoo.fr) has joined #ceph
[8:53] * rendar (~s@host32-176-dynamic.3-87-r.retail.telecomitalia.it) has joined #ceph
[8:54] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[8:55] * JustEra (~JustEra@ALille-555-1-142-38.w92-155.abo.wanadoo.fr) Quit (Quit: This computer has gone to sleep)
[9:10] * jcfischer (~fischer@macjcf.switch.ch) has joined #ceph
[9:14] <newbie|2> 2013-09-25 15:10:21.557462 mon.0 [INF] pgmap v1184: 216 pgs: 124 active+remapped, 92 active+degraded; 762 KB data, 20298 MB used, 6031 GB / 6050 GB avail; -4363/46 degraded (-9484.783%)
[9:15] * newbie|2 is now known as huangjun
[9:16] <huangjun> after restart the osd daemon, all looks fine
[9:17] * JustEra (~JustEra@ has joined #ceph
[9:18] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[9:24] * nerdtron (~kenneth@ has joined #ceph
[9:27] * mattt_ (~textual@ has joined #ceph
[9:30] * Pedras (~Adium@2001:470:84fa:3:f0cd:7353:ecb4:dfc7) Quit (Quit: Leaving.)
[9:36] * JustEra (~JustEra@ Quit (Ping timeout: 480 seconds)
[9:38] * JustEra (~JustEra@ has joined #ceph
[9:41] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) has joined #ceph
[9:41] * ChanServ sets mode +v andreask
[9:42] * sage (~sage@ Quit (Ping timeout: 480 seconds)
[9:46] * LeaChim (~LeaChim@host86-135-252-168.range86-135.btcentralplus.com) has joined #ceph
[9:48] * AfC (~andrew@2407:7800:200:1011:2ad2:44ff:fe08:a4c) Quit (Ping timeout: 480 seconds)
[9:48] * ScOut3R (~ScOut3R@catv-89-133-21-203.catv.broadband.hu) has joined #ceph
[9:52] * sage (~sage@ has joined #ceph
[9:54] * andrei (~andrei@ has joined #ceph
[9:54] <andrei> hello guys
[9:54] <andrei> I am having a bunch of issues with ceph-osd processes
[9:54] <andrei> they are crushing and producing hang task messages on the osd servers
[9:54] <andrei> could someone please help me investigate the issue
[9:56] * adam3 (~adam@46-65-111-12.zone16.bethere.co.uk) has joined #ceph
[9:57] <nerdtron> andrei, how did you determine hand taks msessages?
[9:57] <andrei> nerdtron, let me fpaste in a sec
[9:58] <andrei> nerdtron, http://ur1.ca/fp3jr - here is the example of dmesg hang task messages
[9:59] <andrei> i have the same issue on both of my ceph osd servers
[10:00] * adam2 (~adam@46-65-111-12.zone16.bethere.co.uk) Quit (Ping timeout: 480 seconds)
[10:07] <MACscr> are mon servers really that resource intensive?
[10:08] <wogri_risc> MACscr - they need around 300 MB of RAM in average
[10:08] <MACscr> how about cpu wise?
[10:10] <wogri_risc> nothing notable
[10:10] <wogri_risc> in a "stable" enviroment.
[10:10] <wogri_risc> if a lot of OSD's go up and down you might notice some cpu usage, but usually it trends against 0
[10:11] <MACscr> hmm, i wonder why the ceph recommendations for ceph-mon are so high
[10:12] <huangjun> andrei: do you see the 'slow requests' in ceph -w?
[10:12] <andrei> huangjun, I see them all the time
[10:12] <andrei> about 10,000 messages every day
[10:13] <andrei> this started happening more often since i've upgraded to 0.67 brunch
[10:13] <andrei> i had slow requests a few times a week when I had 0.61 installed
[10:14] <andrei> probably about 100-200 messages in total
[10:14] <andrei> with 0.67 it's happening on an hourly basis
[10:14] <andrei> totalling about 10k per day
[10:14] <andrei> my vms are crushing
[10:15] <wogri_risc> MACscr - why do you think they are high? where is taht stated?
[10:15] <andrei> and clustomers are not happy at all (((
[10:15] <MACscr> http://ceph.com/docs/next/install/hardware-recommendations/
[10:15] <MACscr> im looking at the production recommmendations
[10:15] <wogri_risc> Monitors simply maintain a master copy of the cluster map, so they are not CPU intensive.
[10:16] <huangjun> i'm not sure why the osd is slow, maybe the underlayer fs works not as fast as expected
[10:16] * The_Bishop (~bishop@2001:470:50b6:0:893e:f19b:dc2:694d) Quit (Ping timeout: 480 seconds)
[10:16] <wogri_risc> MACscr - all I see is that for the mon they recommend 1gb of ram per deamon, and 1 cpu
[10:17] <wogri_risc> that's reasonable and matches what I see in my prod environment
[10:17] <MACscr> wogri: i thought you just said 300mb
[10:18] <wogri_risc> that's what it uses.
[10:18] <wogri_risc> but you would want to have some space left, right?
[10:18] <MACscr> plus their single cpu is a 6 core
[10:18] <wogri_risc> I wouldn't buy a server with 350 MB of RAM just because I believe the OSD will need 300 MB. 1 GB is allright
[10:18] <MACscr> though i guess their production example is just two storage servers? a bit confusing
[10:18] <wogri_risc> yes, that's for storage servers
[10:18] <wogri_risc> look above: Minimum Hardware Recommendations¶
[10:19] <wogri_risc> there you see what they wask you for a mon process
[10:19] <MACscr> well typically software has min requirements and recommended
[10:19] <wogri_risc> s/wask/want you to get/
[10:19] <MACscr> so i figured the bottom was recommended
[10:19] <andrei> huangjun, i doubt it's an issue with the os / fs
[10:19] <wogri_risc> for a OSD.
[10:19] <andrei> it's xfs, which is recommended by ceph for production
[10:20] <wogri_risc> really, if you use a separate, dedicated machine for the mon, just use the minimum you can buy the server with, usually they come with 8gb anyways, that's more than enough.
[10:20] <wogri_risc> the production cluster is for OSDs.
[10:20] <wogri_risc> I mean the production cluster example
[10:20] <MACscr> i already have hardware, unfortunately they are all dual quad core xeons
[10:21] <wogri_risc> what is so unfortunate about this?
[10:21] <MACscr> and since i have to do 3 of these
[10:21] <wogri_risc> you mean it's a waste?
[10:21] <MACscr> i just hate wasting so much resources on something that doesnt need much
[10:21] <MACscr> yep
[10:21] <wogri_risc> you can have OSD's and MON's on the same machine with no problem at all
[10:21] <wogri_risc> I do it too.
[10:21] <wogri_risc> well equipped machines that share OSDs and MONs. productive environment. stable since almost half a year now.
[10:21] <MACscr> well i guess i already need 2 or 3 systems for openstack controllers
[10:22] <wogri_risc> I virtualize the controllers
[10:22] <wogri_risc> then I don't have much of an ressource issue.
[10:22] <wogri_risc> I mean, it's a tidier separation.
[10:22] <MACscr> http://www.screencast.com/t/TBkktZahH8fU
[10:23] <wogri_risc> looks fine.
[10:25] * adam4 (~adam@46-65-111-12.zone16.bethere.co.uk) has joined #ceph
[10:25] <huangjun> maybe your cluster is small, and you run the osd in vms?
[10:25] <wogri_risc> why would one run an OSD in a VM?
[10:26] * The_Bishop (~bishop@2001:470:50b6:0:873:cadb:5333:c64c) has joined #ceph
[10:27] <MACscr> lol, that makes zero sense
[10:28] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[10:30] * adam3 (~adam@46-65-111-12.zone16.bethere.co.uk) Quit (Ping timeout: 480 seconds)
[10:30] * jcfischer (~fischer@macjcf.switch.ch) Quit (Quit: jcfischer)
[10:35] * jerker (jerker@82ee1319.test.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[10:36] <MACscr> so i know that you have one osd per disk, but what determines the number of daemons for ceph-mon?
[10:39] * allsystemsarego (~allsystem@ has joined #ceph
[10:44] * Lethalman (~luca@ has joined #ceph
[10:44] <Lethalman> hi
[10:44] <Lethalman> I'm trying to configure a couple of nodes with lxc using ceph
[10:45] <Lethalman> my goal is to create a two-node replica and access the volume remotely
[10:45] <Lethalman> having hard time sorting things out, however I somehow managed to get two osds up
[10:45] <Lethalman> however this is the health: HEALTH_WARN 192 pgs degraded; 192 pgs stale; 192 pgs stuck stale; 192 pgs stuck unclean; 5 requests are blocked > 32 sec; mds cluster is degraded
[10:45] <Lethalman> 2 osds have slow requests
[10:45] <Lethalman> mds cluster is degraded
[10:46] <Lethalman> this is the health detail: http://paste.debian.net/45899/
[10:50] * jbd_ (~jbd_@2001:41d0:52:a00::77) has joined #ceph
[10:56] * sleinen1 (~Adium@2001:620:0:26:fd3f:1fb4:b91d:84a7) Quit (Quit: Leaving.)
[10:56] * sleinen (~Adium@ has joined #ceph
[10:59] * BManojlovic (~steki@ has joined #ceph
[11:01] * sleinen1 (~Adium@ has joined #ceph
[11:02] * sleinen (~Adium@ Quit (Quit: Leaving.)
[11:04] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) Quit (Remote host closed the connection)
[11:04] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[11:06] * claenjoy (~leggenda@ has joined #ceph
[11:07] * claenjoy (~leggenda@ Quit ()
[11:07] * claenjoy (~leggenda@ has joined #ceph
[11:08] * yy-nm (~Thunderbi@ Quit (Quit: yy-nm)
[11:10] * sleinen1 (~Adium@ Quit (Ping timeout: 480 seconds)
[11:20] * Cube (~Cube@ Quit (Quit: Leaving.)
[11:22] * yerrysherry (~yerrysher@ has joined #ceph
[11:26] * yerrysherry (~yerrysher@ Quit ()
[11:26] * yerrysherry (~yerrysher@ has joined #ceph
[11:27] * ismell_ (~ismell@host-64-17-89-79.beyondbb.com) Quit (Read error: Operation timed out)
[11:28] * yerrysherry (~yerrysher@ has left #ceph
[11:28] * yerrysherry (~yerrysher@ has joined #ceph
[11:29] * yerrysherry (~yerrysher@ has left #ceph
[11:29] * ismell (~ismell@host-64-17-89-79.beyondbb.com) has joined #ceph
[11:31] <andrei> hello guys
[11:32] <andrei> huangjun, not really, the osd servers are running on physical servers with 24gb of ram and 12 cores on one server and 4 cores on another
[11:46] * svg (~serge@antares.ginsys.net) has joined #ceph
[11:48] <huangjun> andrei: what about your system load and does your customer write many big files to the cluster?
[11:49] <huangjun> we saw osd slow request when writing many big files
[11:49] <huangjun> in another situation, it works fine,
[11:49] <huangjun> and i want to try the btrfs
[11:49] * svg (~serge@antares.ginsys.net) Quit (Quit: Changing server)
[11:50] * Cube (~Cube@ has joined #ceph
[11:53] <huangjun> when i use the "dd if=/dev/zero of=test bs=4M count=1000 oflag=direct" to test my ssd drivers, the output is 156 MB/s, it's a normal ssd write speed?
[11:53] * svg (~serge@antares.ginsys.net) has joined #ceph
[11:57] * sleinen (~Adium@2001:620:0:25:e975:1055:35fa:c8db) has joined #ceph
[12:03] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[12:31] * nerdtron (~kenneth@ Quit (Remote host closed the connection)
[12:32] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Read error: Connection reset by peer)
[12:33] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[12:36] <decede> huangjun: depends on the SSD
[12:36] <decede> huangjun: some of the small kingston ssd's only did 80MB/s for writes i think and some will do 500MB/s
[12:39] * huangjun (~kvirc@ Quit (Read error: Connection reset by peer)
[12:43] * yanzheng (~zhyan@ has joined #ceph
[12:50] * glzhao (~glzhao@ Quit (Quit: leaving)
[13:17] * tryggvil (~tryggvil@ has joined #ceph
[13:24] * marrusl (~mark@209-150-43-182.c3-0.wsd-ubr2.qens-wsd.ny.cable.rcn.com) has joined #ceph
[13:24] <Lethalman> ok I managed to get somehow 2 osd running :-)
[13:24] <Lethalman> I've created an ext4 file system mounted on osd-0 and osd-1
[13:25] <Lethalman> but I'm unable to understand if I can use that file system directly for storing things or not
[13:25] <Lethalman> I've created a pool named images
[13:25] <wogri_risc> Lethalman: The filesytem of the OSD's is not intended for you to see which data is where
[13:25] <Lethalman> I'd like to have these two osds being replicated
[13:25] <Lethalman> wogri, ah
[13:26] <wogri_risc> you use abstractions of it
[13:26] <wogri_risc> like rados
[13:26] <wogri_risc> rados block device
[13:26] <Lethalman> mh ok
[13:26] <wogri_risc> radosgw
[13:26] <wogri_risc> and cephfs
[13:26] <Lethalman> ok
[13:26] <Lethalman> I'd like to have a two-node replicated volume
[13:26] <Lethalman> and mount it remotely
[13:26] <wogri_risc> so if you want to use let's say sth like a "USB Disk" that is replicated between two nodes
[13:26] <wogri_risc> you would want to use RBD
[13:26] <Lethalman> ok perfect
[13:27] <wogri_risc> the replication is done in the crush map
[13:27] <wogri_risc> ceph has great documentation. read everything about RBD and CRUSH maps before you go on.
[13:27] <Lethalman> in the crush map I see they are in root default, host node1 and host node2
[13:27] <Lethalman> yes i'm reading but there are too many things
[13:27] <Lethalman> can't sort them out easily
[13:27] <Lethalman> for example now I wanted to move osds to the pool images I've created with osd pool create images --size 100
[13:27] <Lethalman> but then it says there's no need to mvoe
[13:29] * yanlb (~bean@ Quit (Quit: Konversation terminated!)
[13:30] <Lethalman> also the crush map says root default, but is that a pool or whatelse? can't understand :(
[13:31] <andreask> Lethalman: if you create a pool there is no "size" option ... but you can choose the number of pgs
[13:31] <Lethalman> yes sorry, think I did pool create 100
[13:32] * freedomhu (~freedomhu@ Quit (Quit: Leaving...)
[13:33] <andreask> if you want 2 replicas you would then do an: osd pool set images size 2
[13:34] <Lethalman> ok
[13:34] * AfC (~andrew@2001:44b8:31cb:d400:6e88:14ff:fe33:2a9c) has joined #ceph
[13:35] <Lethalman> now I did crush set osd.0 1.0 pool=images host=node1
[13:35] <Lethalman> but I don't see images under the crush map
[13:36] <Lethalman> this is the osd tree: http://paste.debian.net/45967/
[13:38] * diegows (~diegows@ has joined #ceph
[13:39] <wogri_risc> Lethalman: please read the fine manual, take your time. ceph is not easily explained. you lose time by trying stuff out instead of knowing what to do next.
[13:39] <Lethalman> wogri_risc, thanks, trying to
[13:40] <wogri_risc> I read about two days. then I started to fiddle.
[13:41] <Lethalman> wogri_risc, so I guess you know why rbd map images --pool images says rbd: add failed: (13) Permission denied and I can't find anything on the net about that error
[13:42] <wogri_risc> Letehalman: give me the output of rbd ls -l images
[13:43] <Lethalman> wogri_risc, http://paste.debian.net/45972/
[13:45] <wogri_risc> Lethalman: did you modprobe rbd, are you executing it as root, do you have a recent enough kernel on the machine you are trying to map on the same machine as the MON resides?
[13:45] <Lethalman> wogri_risc, I'm in lxc, rbd is loaded on the host, that may be the cause?
[13:45] <Lethalman> yes it's on the same machin as the mon
[13:45] <Lethalman> I have a mon per host
[13:48] <wogri_risc> Lethalman - don't do this within lxc
[13:48] <wogri_risc> this rbd mapping is a kernel feature
[13:48] <wogri_risc> and that's why the cgroup won't permit that.
[13:49] <Lethalman> ah, it's not a problem about ceph auth?
[13:49] <Lethalman> I'm trying something like ceph auth get-or-create client.admin mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=images'
[13:49] <Lethalman> or such
[13:52] <Lethalman> looks like a kernel thing though
[13:52] <Lethalman> wogri_risc, do you think setting up the cluster with vbox might work?
[13:54] <wogri_risc> lethalman: anything will work that does "real" virtualisation. like kvm, I don't know how vbox handles this.
[13:55] <Lethalman> ok thanks :)
[13:55] <Lethalman> will retry
[13:55] <wogri_risc> lxc is more of a root-jail than anything else
[13:55] <Lethalman> wogri_risc, my goal is to use a distributed fs for an existing fs
[13:55] <Lethalman> I saw that cephfs is not ready for production
[13:55] <Lethalman> so I guess I have to copy the existing data to the new rbd block right?
[13:56] <Lethalman> I can't reuse the actual xfs file system
[13:56] <wogri_risc> Lethalman: right
[13:56] <wogri_risc> if you're looking for a stable distributed FS you might - currently - be better off with glusterfs though
[13:59] * markbby (~Adium@ has joined #ceph
[14:01] <absynth> heretic! burn him at the stake!
[14:03] <wogri_risc> ey, I wrote 'CURRENTLY' :)
[14:04] * sjm (~sjm@ has joined #ceph
[14:16] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:16] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:16] * sjm (~sjm@ Quit (Quit: Leaving.)
[14:29] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[14:30] * KindOne (~KindOne@0001a7db.user.oftc.net) has joined #ceph
[14:36] * Lethalman (~luca@ Quit (Ping timeout: 480 seconds)
[14:38] * Lethalman (~luca@ has joined #ceph
[14:39] * Lethalman_ (~luca@net77-43-20-100.mclink.it) has joined #ceph
[14:43] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[14:43] <Lethalman_> wogri_risc, tested glusterfs, it auto-dosses when self-healing... maybe I can overcome that with a traffic shaper, are you using it?
[14:45] * AfC (~andrew@2001:44b8:31cb:d400:6e88:14ff:fe33:2a9c) Quit (Quit: Leaving.)
[14:46] * Lethalman (~luca@ Quit (Ping timeout: 480 seconds)
[14:47] <wogri_risc> I am.
[14:48] <wogri_risc> But I didn't really test the self-healing too much.
[14:48] <wogri_risc> It's a productive webcluster with only two nodes. and ceph is bad at having a deployment with two nodes.
[14:48] <Lethalman_> ah ok
[14:49] <Lethalman_> wogri_risc, I'm thinking of two partitions with two replicas
[14:49] <Lethalman_> as long as one node goes down, it's fine
[14:49] <wogri_risc> wrong
[14:49] <wogri_risc> ceph mons need to be in an unequal number
[14:49] <wogri_risc> like one or three
[14:49] <Lethalman_> oh
[14:49] <wogri_risc> if you have two monitor daemons, one goes down, ceph stands still
[14:49] <Lethalman_> wogri_risc, though the docs say at least two
[14:49] <Lethalman_> wogri_risc, however I was talking about gluster
[14:50] <Lethalman_> if two nodes go down and something like a split brain happens, then cpu goes beyond 100% and network saturates
[14:50] <wogri_risc> ok. i haven't seen that in my tests.
[14:50] <Lethalman_> I have to store terabytes of images
[14:50] <Lethalman_> and I'm still looking for a 2x2 solution without much luck
[14:50] <wogri_risc> images? sounds like a radosgw issue to me, not rbd
[14:51] <Lethalman_> wogri_risc, they are stored and accessed with a file system, can't change to use rest
[14:51] <wogri_risc> again: ceph is not good with two servers when it comes to failure of the wrong node.
[14:51] <wogri_risc> ceph starts being good at three nodes.
[14:51] <Lethalman_> thanks, good to know
[14:52] <Lethalman_> lustre is not an option cause it's known that patches don't apply to debian squeeze
[14:52] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) has joined #ceph
[14:52] <Lethalman_> that is, good support for rhel but not for debian
[14:52] <wogri_risc> what about DRBD?
[14:52] <Lethalman_> wogri_risc, how do I handle two partitions with drbd?
[14:52] <wogri_risc> old school, old fashined, not scalable at all, but it suffices your needs
[14:53] <wogri_risc> you have one big drbd thing
[14:53] <wogri_risc> create 2 LV's on it
[14:53] <Lethalman_> I can create an lvm living on two different servers?
[14:53] <Lethalman_> I mean, I need 2x2 drbd nodes
[14:53] <wogri_risc> what does 2x2 mean?
[14:53] <Lethalman_> 4 servers, 2 replicas
[14:54] <wogri_risc> oh. why don't you use ceph then :)
[14:54] <Lethalman_> eh :S
[14:54] <wogri_risc> hah :)
[14:54] <wogri_risc> I thought 2x2 means 2 servers, 2 disks :)
[14:54] <Lethalman_> no sorry, 4 servers, 2 replica for 2 partitions
[14:54] <Lethalman_> gluster is quite good at that
[14:54] <wogri_risc> yes. welcome to the ceph channel. you're right here :)
[14:55] <Lethalman_> awesome
[14:55] <wogri_risc> hahaha
[14:55] <Lethalman_> now I only have to redo my configuration using vbox rather than lxc :)
[14:55] <wogri_risc> what you want is: 3 monitor daemons on three servers
[14:55] <wogri_risc> one OSD per disk
[14:55] <wogri_risc> and a CRUSH map that does the replication as you want it.
[14:55] <wogri_risc> so you would have two different pools
[14:55] <wogri_risc> one pool is striped across server 1 and 2
[14:56] <wogri_risc> and the data of the other pool is striped across 3 and 4.
[14:56] <Lethalman_> how do I assign one osd to a pool? osd crush set says it doesn't need to be moved
[14:56] <wogri_risc> you can't assign an OSD to a pool.
[14:56] <wogri_risc> read the document about crush maps
[14:57] <wogri_risc> a pool has a placement rule, that placement rule contains certain OSDs.
[14:57] <wogri_risc> you do this in the crushmap
[14:57] <wogri_risc> but
[14:58] <wogri_risc> you could also say: I have two different pools, and I want ceph to replicate all the data across all servers, at least 2 copies on different servers.
[14:58] <wogri_risc> and don't care about your 'partitions'. if that makes sense
[14:58] <wogri_risc> anyways, I got to go.
[14:58] <Lethalman_> wogri_risc, thanks :)
[14:58] <Lethalman_> yes, not having to care about the partition also make sense
[15:00] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[15:01] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) Quit (Remote host closed the connection)
[15:20] * tsnider (~tsnider@nat-216-240-30-23.netapp.com) has joined #ceph
[15:24] * jeff-YF (~jeffyf@ has joined #ceph
[15:30] * andrei (~andrei@ Quit (Ping timeout: 480 seconds)
[15:30] * yanzheng (~zhyan@ Quit (Remote host closed the connection)
[15:34] * Cube (~Cube@wr1.pit.paircolo.net) has joined #ceph
[15:41] * odyssey4me (~odyssey4m@ has joined #ceph
[15:41] * yanzheng (~zhyan@ has joined #ceph
[15:52] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:54] * doxavore (~doug@99-7-52-88.lightspeed.rcsntx.sbcglobal.net) has joined #ceph
[15:54] * jskinner (~jskinner@ has joined #ceph
[15:58] * dmsimard (~Adium@ap02.wireless.co.mtl.iweb.com) has joined #ceph
[15:59] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:08] * jskinner_ (~jskinner@ has joined #ceph
[16:08] * jskinner (~jskinner@ Quit (Read error: Connection reset by peer)
[16:20] * shang (~ShangWu@ Quit (Remote host closed the connection)
[16:25] * andreask (~andreask@h081217135028.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[16:26] * dmsimard1 (~Adium@ has joined #ceph
[16:26] * dmsimard (~Adium@ap02.wireless.co.mtl.iweb.com) Quit (Read error: Connection reset by peer)
[16:32] * danieagle (~Daniel@ has joined #ceph
[16:41] * Lethalman__ (~luca@ has joined #ceph
[16:43] <swinchen> So this is kind of how you would set up a spine/leave CEPH deployment? : http://goo.gl/kcEA3O
[16:45] <swinchen> And what would you do if you only had one trunk to the internet?
[16:45] * Lethalman_ (~luca@net77-43-20-100.mclink.it) Quit (Read error: Operation timed out)
[16:47] <swinchen> Sweet.. there is an Anonymous Liger looking at the document
[16:48] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[16:49] <scuttlemonkey> swinchen: did you see the dreamhost setup?
[16:50] <swinchen> scuttlemonkey: I watched the video... I think this looks similar.
[16:51] <scuttlemonkey> swinchen: yeah, very similar...which is why I asked
[16:51] <scuttlemonkey> they spelled it out in a arch writeup...lemme see if I can find it
[16:51] <swinchen> network fabric is a new concept to me, so I am still getting used to it... I am not sure what the requirement of the switches are... but I think the links between the leaf and spine are L3 routed, so that is a bit peculiar.
[16:51] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) Quit (Read error: Connection reset by peer)
[16:52] <scuttlemonkey> https://dl.dropboxusercontent.com/u/5334652/DreamCompute%20Architecture%20Blueprint.pdf
[16:52] <scuttlemonkey> that has some great explanation around the implementation
[16:52] <scuttlemonkey> they did a pretty good job covering the basis imo
[16:53] <swinchen> thanks scuttlemonkey.
[16:53] <scuttlemonkey> np
[16:58] * yanzheng (~zhyan@ Quit (Remote host closed the connection)
[17:01] * mnash (~chatzilla@vpn.expressionanalysis.com) has joined #ceph
[17:01] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[17:03] <swinchen> man, dreamhost invested some SERIOUS money into this storage cluster
[17:03] <scuttlemonkey> hehe yeah
[17:04] * kuba (~kuba@ Quit (Read error: Operation timed out)
[17:04] <swinchen> one leaf switch costs $23000? Oh my god.
[17:07] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:08] * sprachgenerator (~sprachgen@ has joined #ceph
[17:08] <scuttlemonkey> yeah, they are definitely on a pretty extreme end of the spectrum
[17:08] <scuttlemonkey> but it's a good place to start and scale back from
[17:08] <swinchen> Yeah, I hope you can buy less expensive spine and leaf switches
[17:08] * freedomhui (~freedomhu@ has joined #ceph
[17:08] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[17:12] * foosinn (~stefan@office.unitedcolo.de) Quit (Quit: Leaving)
[17:18] * mtanski (~mtanski@ has joined #ceph
[17:18] * markbby1 (~Adium@ has joined #ceph
[17:20] <mtanski> sage: I posted a patch for an fscache issue. It's been the only issue that I've seen on my client cluster in a couple weeks (I was away on vacation). It happens infrequently, since it took a while to encounter it but it's a simple fix (2 lines) and it should go up to the kernel before the release.
[17:20] * markbby (~Adium@ Quit (Read error: Connection reset by peer)
[17:21] * JustEra (~JustEra@ Quit (Read error: Operation timed out)
[17:23] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Read error: Connection reset by peer)
[17:23] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[17:30] <mikedawson> swinchen: you may be better off with 3 leaf switches, each housing a single monitor
[17:30] <swinchen> mikedawson: Yeah, I ran out of room.
[17:30] <mikedawson> swinchen: and I don't really get the router placement
[17:31] <swinchen> From the dreamhost document I don't understand the dual 40Gb link between the two spine switches.
[17:32] * shang (~ShangWu@ has joined #ceph
[17:32] <swinchen> mikedawson: I was a bit unsure of the routers. would you put them on the spine switches? I read somewhere that the only thing spine switches should have connected to them are leaf switches or other spine switches.
[17:34] <mikedawson> swinchen: if you have 100s of leafs, are you going to have 100s of routers?
[17:34] <swinchen> mikedawson: that is a good point.
[17:35] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[17:35] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[17:39] <Lethalman__> I've created a 100mb file
[17:39] <Lethalman__> in an rbd with xfs
[17:39] <Lethalman__> i works and I see the correct pg usage
[17:39] <Lethalman__> then I delete the file, but the pg still have the same usage
[17:40] <Lethalman__> what am I missing?
[17:40] * Lethalman__ is now known as Lethalman
[17:41] <mikedawson> Lethalman: i believe deleting is lazy. In other words, that data will get deleted at some unknown point in the future.
[17:41] <Lethalman> mikedawson, ok thanks
[17:41] <Lethalman> I'd like to store some data in only 2 osd, and other data in other 2 osd (that is 4 osd as 2 partitions with each object with 2 replica)
[17:42] <Lethalman> should I create crush rules?
[17:46] <mikedawson> Lethalman: yes, crush rules are the way to go. People typically don't try to force data anyplace in particular unless it is for 1) enforcing replication across multiple failure domains or 2) creating pools with different underlying hardware performance (an ssd pool, a 15k sas pool, a 7200 sata pool for instance)
[17:46] <Lethalman> mikedawson, if a node fails, I'd like to have the data available
[17:47] <Lethalman> I'd like a whole file not to be striped
[17:47] <Lethalman> and to have copies on exactly 2 replicas
[17:47] <mikedawson> Lethalman: other than those reasons for pools, I can't think of a benefit. The cool thing about Ceph's pseudo-random placement has a benefit for spreading load across your cluster pretty effectively
[17:47] <Lethalman> I've created a file and it's being saved on all the 4 osds even if I set replicas to 2
[17:48] <mikedawson> Lethalman: yep - that's a good thing... pseudo-random placements spreads the load
[17:48] <Lethalman> mikedawson, but if the data is saved on all 4 osds what happens if a node fails
[17:48] <mikedawson> Lethalman: if you are using default settings, RBD will use 4MB objects, so you'll have data striped
[17:49] <Lethalman> exactly
[17:49] <janos> and duplicated
[17:49] <Lethalman> then if I lose a node I potentially lose the file
[17:49] <janos> so no problem if a node fails
[17:49] <janos> no
[17:49] <janos> you have replication
[17:49] <janos> it's not storing the primary chunk and the replicated chunk on the same node
[17:49] <janos> that would be silly
[17:49] <Lethalman> sure
[17:49] <Lethalman> let me do an example, so I understand :)
[17:49] <mikedawson> Lethalman: use crush rules to say choose to replicate across hosts instead of osds. That way you'll survive a node failure (assuming you have 2 of three monitors working)
[17:50] <Lethalman> I have 1 osd per node
[17:51] <mikedawson> Lethalman: If you have 1 osd per node and use a replication factor of 2, you'll certainly lose data if two or more nodes (or osd drives) die
[17:52] <Lethalman> ok, but if I use partitions, if 2 nodes die that are on different partitions, I don't lose anything
[17:52] <Lethalman> and load would be still balanced across the two partitions
[17:52] * sig_wall (~adjkru@ Quit (Ping timeout: 480 seconds)
[17:53] * lubyou (~lubyou@dsl093-174-037-223.dialup.saveho.com) Quit (Quit: Leaving)
[17:53] <Lethalman> striping in general provides less fault tolerance
[17:53] <Lethalman> I'd like to disable that
[17:53] <mikedawson> Lethalman: up the replication factor, drive and osd count, and/or node count until you have satisfied your fault tolerance objectives. Then quit before you spend too much chasing diminishing returns
[17:53] <janos> ceph wasn't really intended for such narrow and small deployments
[17:54] * sig_wall (~adjkru@ has joined #ceph
[17:54] <janos> you may as well raid10 the 4 disks and be done ;)
[17:54] <Lethalman> janos, the whole node can still fail
[17:55] * lubyou (~lubyou@dsl093-174-037-223.dialup.saveho.com) has joined #ceph
[17:55] <janos> sure, anything can fail. but what i mean is that ceph is intended to be more pieces. if you are aiming at such a narrow case, it's missing the benefits ceph really provides
[17:55] <Lethalman> mikedawson, sorry can you rephrase?
[17:55] <Lethalman> janos, I get that, but my requirement is to have more fault tolerance than i/o performance or such
[17:56] <mikedawson> Lethalman: read up on small failure domains in scale-out systems
[17:56] <janos> if fault tolerance is the primary goal, you're going to want more hosts and more osd's per host
[17:56] <Lethalman> why would I want more osd per host
[17:56] <janos> and crsuh rule (default) set to host as the failure domain
[17:57] * xarses (~andreww@c-71-202-167-197.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[17:57] <janos> because things fail
[17:57] <janos> esp. disks
[17:57] <janos> spread your risk. spread you load, etc
[17:57] <Lethalman> yes sure
[17:57] <Lethalman> however I'm thinking about nodes rather than osds
[17:57] <Lethalman> supposing I have 1 osd per node
[17:57] <Lethalman> or X osd per node
[17:58] <Lethalman> I'd like to replicate node1 to node2 only
[17:58] <Lethalman> striping might as well be a possibility for i/o performance
[17:58] <janos> while ti can work, it's not the spirit of ceph
[17:58] <janos> more nodes
[17:59] <Lethalman> janos, I only have 4 nodes, and spreading to all the 4 nodes doesn't make me able to reason about two-node failures
[17:59] <Lethalman> this is my requirement
[17:59] <Lethalman> knowing that I have exactly two partitions, I can reason about that
[17:59] <Lethalman> is it possible to do with ceph, even if it's not in the spirit? :)
[17:59] <janos> i think so, yeah
[17:59] <Lethalman> ok, I'm reading crush rules right now, let's see
[17:59] <janos> cool
[18:01] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[18:01] <mikedawson> Lethalman: if you choose to use ceph, you'll want a replication factor of 3. Then you can lose two nodes without any data loss. That being said, you'll potentially have a quorum issue... 4 nodes means you'll run 3 monitors, and two could have failed
[18:01] <Lethalman> I'm aware of that
[18:01] <Lethalman> can't do much with up to 4 nodes ;)
[18:02] <Lethalman> mikedawson, so you're saying that in case of 2 node fails, ceph can't run because only 1 other monitor is running?
[18:02] <mikedawson> Lethalman: 5 nodes is a functional minimum in a consensus-based system designed to sustain two concurrent node failures
[18:02] <Lethalman> yes
[18:02] <mikedawson> Lethalman: exactly, if the two failed nodes had monitors
[18:02] <Lethalman> I'm not saying I want to sustain two-node failures in all cases
[18:02] <Lethalman> only in the case the two nodes are in two different partitions
[18:03] * angdraug (~angdraug@ has joined #ceph
[18:03] <mikedawson> Lethalman: than you still want 3x replication and a bit of luck about monitor consensus. Not the plan I would deloy, though
[18:04] <Lethalman> let's talk only about 4 nodes :)
[18:07] <Lethalman> taking the default ruleset for data
[18:08] <Lethalman> I have a tree with a root and 4 hosts
[18:08] <Lethalman> the default rule is step chooseleaf firstn 0 type host
[18:08] <Lethalman> shouldn't it take only 2 hosts to store an object?
[18:08] <Lethalman> ah right, an object in rbd is a block, so striping is always there
[18:09] <Lethalman> so I can't disable striping at all?
[18:09] * mattt_ (~textual@ Quit (Read error: Connection reset by peer)
[18:09] * sig_wall (~adjkru@ Quit (Read error: Connection timed out)
[18:09] * ScOut3R (~ScOut3R@catv-89-133-21-203.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[18:10] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[18:10] <mikedawson> Lethalman: correct. embrace it - there are advantages
[18:10] <Lethalman> mh :S
[18:11] * sig_wall (~adjkru@ has joined #ceph
[18:11] <Lethalman> mikedawson, and disadvantages
[18:11] <mikedawson> Lethalman: perhaps you'd like http://www.drbd.org/
[18:11] <Lethalman> however, since ceph with 4 mons can't resist to a two-node failures
[18:11] <Lethalman> non-striping is useless
[18:11] <Lethalman> mikedawson, how can I have 2 replicas and 4 nodes with drbd?
[18:12] <mikedawson> Lethalman: two drbd pairs?
[18:12] <Lethalman> mikedawson, and... how can I merge into one volume
[18:12] * xarses (~andreww@ has joined #ceph
[18:12] <mikedawson> lvm?
[18:12] <Lethalman> mikedawson, I can use lvm between two nodes on a network?
[18:12] <Lethalman> uhm
[18:13] <Kioob`Taff> yes you can.
[18:13] * BillK (~BillK-OFT@124-148-81-249.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[18:14] <Lethalman> interesting
[18:14] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[18:19] * Vjarjadian (~IceChat77@ has joined #ceph
[18:20] * sig_wall (~adjkru@ Quit (Remote host closed the connection)
[18:21] * sig_wall (~adjkru@ has joined #ceph
[18:24] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[18:25] <Lethalman> thanks
[18:28] * wido (~wido@ has joined #ceph
[18:28] * freedomhui (~freedomhu@ Quit (Ping timeout: 480 seconds)
[18:28] <mikedawson> sage: xfs extent fragmentation was not the culprit, I can still recreate the scrub and deep-scrub issues with freshly defrag'ed filesystems. I am getting 1-2% fragmentation daily though.
[18:28] * thomnico (~thomnico@AMontsouris-652-1-207-134.w86-212.abo.wanadoo.fr) has joined #ceph
[18:29] <sage> mikedawson: hmm. in that case, can you try gathering the logs i noted in the bug?
[18:29] * Yen (~Yen@2a00:f10:103:201:ba27:ebff:fefb:350a) Quit (Ping timeout: 480 seconds)
[18:29] <mikedawson> sage: will do.
[18:29] <sage> mikedawson: also, if you are in a position to measure fragmentation vs performance, i can whip up a wip branch (heh) that fallocates rbd objects on creation...
[18:30] <peetaur> Does Ceph / RBD have something like zfs replication, where I can incrementally send a device somewhere, without reading the data on both sides like rsync would? (So for example, using only a single gigabit link, I can send my ZFS 2.51 TB vm data storage to the backup server in 15 seconds every 20 minutes... can I do that with ceph, or will it take 7 hours [for reading data before having delta to send in 15 seconds]
[18:30] <peetaur> like it would with rsync?)
[18:30] <sage> peetaur: see rbd export-diff, import-diff
[18:52] * Disconnected.
[19:47] -oxygen.oftc.net- *** Looking up your hostname...
[19:47] -oxygen.oftc.net- *** Checking Ident
[19:47] -oxygen.oftc.net- *** Couldn't look up your hostname
[19:47] -oxygen.oftc.net- *** No Ident response

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.