#ceph IRC Log


IRC Log for 2013-02-20

Timestamps are in GMT/BST.

[0:01] * Ryan_Lane (~Adium@ has joined #ceph
[0:02] * aliguori (~anthony@ Quit (Remote host closed the connection)
[0:02] * vata (~vata@2607:fad8:4:6:11f4:cedb:50b7:4c14) Quit (Quit: Leaving.)
[0:07] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) Quit (Quit: Leaving.)
[0:09] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[0:12] * Ryan_Lane (~Adium@ has joined #ceph
[0:17] * Ryan_Lane (~Adium@ Quit ()
[0:17] * Ryan_Lane (~Adium@ has joined #ceph
[0:17] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:18] * loicd (~loic@magenta.dachary.org) has joined #ceph
[0:18] * Ryan_Lane (~Adium@ Quit ()
[0:19] <dmick> wer: 5GB is pretty small, and competing with your OSD filesystem may not be the best
[0:19] * Ryan_Lane (~Adium@ has joined #ceph
[0:19] <dmick> mons: dunno; how they're not starting is relevant
[0:19] * Ryan_Lane (~Adium@ Quit ()
[0:20] <dmick> they should start whether or not OSDs start
[0:20] <wer> yeah... this thing is hosed..... hmmm. That is strange then.
[0:26] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[0:26] * miroslav1 (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[0:26] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[0:27] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:27] * ScOut3R (~scout3r@1F2EAE7E.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[0:28] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[0:43] * jlogan (~Thunderbi@2600:c00:3010:1:2dd0:c27d:7304:3f9b) has joined #ceph
[0:45] * Ul1 (~Thunderbi@ip-83-101-40-5.customer.schedom-europe.net) Quit (Ping timeout: 480 seconds)
[0:45] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[0:59] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[1:08] * houkouonchi-work (~linux@ Quit (Remote host closed the connection)
[1:17] <sstan> is it possible to resize an rbd in such a way that I don't have to unmap/remap
[1:18] <sstan> resize2fs /dev/rbd1
[1:18] <sstan> resize2fs 1.41.9 (22-Aug-2009)
[1:18] <sstan> The filesystem is already 512000 blocks long. Nothing to do!
[1:18] * LeaChim (~LeaChim@b0fac1c4.bb.sky.com) Quit (Ping timeout: 480 seconds)
[1:20] <jmlowe> whew, my osd removal is finally finished after 28 hours
[1:21] <leseb_> sstan: you can resize the RBD block size while the device is mapped. However if the filesystem is mounted the Kernel didn't notice that the block device size changed, so you have to un-mount and mount it and then you can issue your resize2fs
[1:27] <sstan> leseb_ : is it possible that it's only an LVM feature to be able to resize without unmounting?
[1:28] <leseb_> sstan: well with LVM it's a bit different, since it sits over the device mapper for instance. It's more Kernel related
[1:29] <leseb_> I know that Kernel 3.9 can re-read the block device size without un-mounting
[1:29] * masterpe (~masterpe@2001:990:0:1674::1:82) Quit (Remote host closed the connection)
[1:29] * masterpe (~masterpe@2001:990:0:1674::1:82) has joined #ceph
[1:29] <leseb_> but it's 3.9....
[1:31] <iggy> most scsi drivers have supported it for years (in varying different ways)
[1:31] <sstan> hmm then I might want to use RBD as PVs for Logical volumes ... I guess that wouldn't be a bottleneck
[1:32] <iggy> the problem then becomes resizing everything else
[1:32] <iggy> partitions, md/lvm/etc., filesystem
[1:32] * houkouonchi-work (~linux@ has joined #ceph
[1:35] <leseb_> iggy: well I never said the opposite, I just tried several times, even opened a tread on the ML couple of months ago about this.
[1:35] * markl_ (~mark@tpsit.com) Quit (Ping timeout: 480 seconds)
[1:36] <leseb_> sstan: Yes it's a bit overkill, and at the end you have too many layers, even if I don't know the overhead that putting LVM on top of RBD could generate
[1:37] <leseb_> that being said, time to sleep know :)
[1:37] * leseb_ (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[1:37] <sstan> bonne nuit!
[1:40] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[1:43] <sstan> is 17 Mb /s a normal rate? (replica size is 2, gigabit internet, 3 osds on 3 hosts)
[1:58] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[1:58] <gregaf1> not so much, no, but it depends on how your benchmark works
[1:58] * jjgalvez (~jjgalvez@ Quit (Quit: Leaving.)
[2:03] <sjust> sstan: is the write size 1b?
[2:04] <sstan> my benchmark was time dd if=/dev/sda of=data.raw bs=1M count=600 && time sync
[2:04] <sjust> sstan: hmm, that's rather less good
[2:05] <sjust> Mb/s or MB/s?
[2:05] * Cube (~Cube@ Quit (Ping timeout: 480 seconds)
[2:05] <sstan> that's in ext3 .. but when of=/dev/rbd1 , it's 22 megabytes
[2:05] <sstan> Mb/s
[2:06] <sjust> I may be missing something, can you explain the procedure?
[2:13] * diegows (~diegows@ has joined #ceph
[2:13] <houkouonchi-work> out of curiosity does direct I/O make a difference? (shouldn't need the sync i would thin then either) add oflag=direct to the dd command?
[2:15] <sstan> houkouonchi-work : you're right
[2:18] <sstan> sjust : yes. I do a dd to /dev/rbd1 and I time it
[2:18] <dmick> sstan: Mb/s is "Megabits per second". I doubt that's what you really meant.
[2:18] <sstan> I tought that 'B' is Bit
[2:18] <sstan> and 'b' byte
[2:19] <sjust> other way around, generally
[2:19] <sjust> B is bigger than b :)
[2:19] <sstan> ah you're right !! http://en.wikipedia.org/wiki/Megabit
[2:19] <wer> please no one mention millibits again :P
[2:20] <dmick> I suspect SI is just recommending Mbit vs Mbyte to avoid this issue too
[2:21] <sstan> does anyone know how to install the ocf RBD resource agent?
[2:21] * wschulze1 (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[2:24] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[2:26] <sstan> found the answer in this page: http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/
[2:39] <sstan> I get 13MB/s with ... dd if=sles1.raw of=/dev/rbd3 oflag=direct bs=4M
[2:40] <sstan> changing bs from 1M to 4M doubled the speed
[2:42] <houkouonchi-work> should just say mbit and MiB as those are always pretty easily recognized as megabit and megabyte =)
[2:44] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[2:49] * jjgalvez (~jjgalvez@ has joined #ceph
[2:50] * jlogan (~Thunderbi@2600:c00:3010:1:2dd0:c27d:7304:3f9b) Quit (Ping timeout: 480 seconds)
[2:50] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[2:50] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has left #ceph
[2:50] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[2:51] * rturk is now known as rturk-away
[2:58] * miroslav1 (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[3:04] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[3:45] <ShaunR> wow, the onboard controller is sucking bad compared to the lsi controller in raid0 mode per disk
[3:45] <ShaunR> LSI card is showing almost double the performance
[3:48] <ShaunR> this is interesting, write showed horrible performance (1/2 what the lsi controller showed), but now a seq bench is showing that the onboard controller is performacing almost 50% faster than the lsi card.
[3:50] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: Leaving.)
[3:50] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:55] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[4:06] <houkouonchi-work> ShaunR: well if your doing raid0 its probably using the memory on the raid controller?
[4:06] <houkouonchi-work> so that will offset the tests quite a bit especially for write performance
[4:07] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[4:08] * jjgalvez1 (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[4:13] * jjgalvez (~jjgalvez@ Quit (Ping timeout: 480 seconds)
[4:42] * illuminatis (~illuminat@0001adba.user.oftc.net) Quit (Ping timeout: 480 seconds)
[4:50] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[4:58] * illuminatis (~illuminat@0001adba.user.oftc.net) has joined #ceph
[5:06] <phantomcircuit> ShaunR, i solved that problem by not allowing users to shrink disks
[5:06] <phantomcircuit> if anybody complains i tell them it's for technical reasons
[5:06] <phantomcircuit> which is at least 50% trueish
[5:12] * chutzpah (~chutz@ Quit (Quit: Leaving)
[5:14] * ananthan_RnD (~ananthan@ has joined #ceph
[5:21] <paravoid> sage: that "mon losing touch with OSDs" sound like my problem as well, doesn't it?
[5:21] <paravoid> the "wrongly marked me down" one
[5:25] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: Leaving.)
[5:25] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[5:27] * jjgalvez1 (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[5:28] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[5:29] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit ()
[6:37] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[7:04] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[7:05] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[7:23] * ninkotech (~duplo@ip-89-102-24-167.net.upcbroadband.cz) Quit (Read error: No route to host)
[7:43] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[7:43] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[7:54] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[7:55] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[8:03] * yehuda_hm (~yehuda@2602:306:330b:a40:acba:bea8:67bb:37f5) Quit (Ping timeout: 480 seconds)
[8:11] * yehuda_hm (~yehuda@2602:306:330b:a40:55ce:5c6c:369a:5843) has joined #ceph
[8:24] * Ul (~Thunderbi@135.219-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:31] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[8:51] * low (~low@ has joined #ceph
[8:51] * gerard_dethier (~Thunderbi@ has joined #ceph
[8:59] * cocoy (~Adium@ has joined #ceph
[9:21] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[9:22] * loicd (~loic@magenta.dachary.org) has joined #ceph
[9:30] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:32] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[9:36] * BManojlovic (~steki@ has joined #ceph
[9:36] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[9:37] * l0nk (~alex@ has joined #ceph
[9:46] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[9:47] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:48] * eschnou (~eschnou@ has joined #ceph
[9:50] * leseb (~leseb@stoneit.xs4all.nl) has joined #ceph
[9:50] * leseb (~leseb@stoneit.xs4all.nl) Quit (Remote host closed the connection)
[9:51] * leseb (~leseb@mx00.stone-it.com) has joined #ceph
[9:58] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: Pull the pin and count to what?)
[10:01] * The_Bishop (~bishop@2001:470:50b6:0:8deb:bafd:156e:3462) Quit (Ping timeout: 480 seconds)
[10:02] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) has joined #ceph
[10:02] * ScOut3R (~ScOut3R@ has joined #ceph
[10:04] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:09] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[10:11] * The_Bishop (~bishop@2001:470:50b6:0:e827:50da:d179:b5f5) has joined #ceph
[10:13] * loicd (~loic@magenta.dachary.org) has joined #ceph
[10:14] * loicd (~loic@magenta.dachary.org) Quit ()
[10:17] * LeaChim (~LeaChim@b0fac1c4.bb.sky.com) has joined #ceph
[10:22] * Ul (~Thunderbi@135.219-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[10:30] * dosaboy1 (~user1@host86-164-227-220.range86-164.btcentralplus.com) has joined #ceph
[10:30] * dosaboy1 (~user1@host86-164-227-220.range86-164.btcentralplus.com) Quit ()
[10:33] * dosaboy (~user1@host86-164-229-186.range86-164.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[10:34] * dosaboy (~user1@host86-164-227-220.range86-164.btcentralplus.com) has joined #ceph
[10:38] * hybrid512 (~w.moghrab@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[10:40] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[10:49] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[11:03] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[11:26] * hvn_ (~hvn@ has joined #ceph
[11:27] <hvn_> hi all, can I use an existent client.admin keyring and make mkcephfs not to create it. I trying to deploy ceph by Saltstack, I want to copy keyring to all client automaticall
[11:31] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[11:37] * gaveen (~gaveen@ has joined #ceph
[11:39] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[11:40] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[11:43] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: Pogoapp - http://www.pogoapp.com)
[12:04] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:08] * lurbs (user@uber.geek.nz) Quit (Ping timeout: 480 seconds)
[12:10] * lurbs (user@uber.geek.nz) has joined #ceph
[12:11] <l0nk> hi all
[12:11] <l0nk> except for the journal, is it useful to have a raid card with a lot of cache ?
[12:12] <l0nk> i'm planning to use a 100g ssd for the journal (10 osd of 3to each)
[12:13] <l0nk> and i don't really know if i need more than 512mo of cache since all osd will be configured in single raid 0
[12:13] <absynth> yes
[12:13] <absynth> you want ssd cache for your spinners
[12:14] <l0nk> yeap
[12:14] <l0nk> i understood that easily :)
[12:15] <hvn_> hi all, i run `ceph auth add client.admin --infile=keyring` , then i cannot use any ceph command. I want to add a existent key to user client.admin. What I've done wrong? thanks
[12:15] <l0nk> but all my osd will be plugged to a raid card, and i'm asking if it's useful to have a raid card with a lot of cache
[12:16] <l0nk> absynth: do you see what i mean?
[12:17] * hvn_ (~hvn@ Quit (Quit: leaving)
[12:18] <absynth> you didn't understand me.
[12:18] <absynth> so i rephrase: Yes.
[12:18] <absynth> you need a RAID controller with a lot of cache.
[12:19] <l0nk> even if i'm not planning to use the cache for the journal but use a dedicated ssd for the journal?
[12:22] <absynth> yes
[12:22] <l0nk> ok thanks for you time
[12:44] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:50] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[13:13] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[13:43] <nhm> l0nk: controller cache helps most when you don't have SSDs for the journals.
[13:43] <nhm> l0nk: you still get some benefit with SSD journals but it's not very dramatic.
[13:44] <l0nk> yeah that's what i was thinking since all disk will be configured as a single raid0
[13:44] * ananthan_RnD (~ananthan@ has left #ceph
[13:45] <l0nk> but since i'm planning to plug 10osd of 3to maybe it can help, but i don't really know how much
[13:45] <l0nk> it "only" cost 200$ more
[13:45] <nhm> l0nk: performance with all disks in a single raid0 isn't great.
[13:45] <l0nk> i mean each disk as a single raid0
[13:46] <nhm> l0nk: ah, ok.
[13:46] <l0nk> or raid1 i don't know but each disk will be one raid
[13:47] <l0nk> osd disks will be 3to sata so i'm still unable to decide how much cache i'll buy :(
[13:47] <nhm> l0nk: yeah, I've seen pretty good performance with that kind of config on LSI controllers.
[13:48] <l0nk> yeah i saw that on ceph website/blog
[13:48] <l0nk> in my cas it will be h700 from dell, with 512 or 2go of ram
[13:49] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[13:49] * gregorg (~Greg@ has joined #ceph
[13:49] <nhm> l0nk: I haven't been able to get the same kind of performance out of our Dell nodes. I'm not sure why. It might be the firmware on the H700 or the expanders being used in the R515 backplane.
[13:50] <nhm> l0nk: It might be worth trying the H710, but I haven't tested it extensively yet.
[13:50] <l0nk> really? nice to know thanks
[13:52] <l0nk> well i don't really have any choice since it must be from dell (r510 or r515) and i don't think h710 are available right now where i live :)
[13:52] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[13:52] <nhm> l0nk: We have R515s. Could be that the R510 does better.
[13:53] <nhm> l0nk: I haven't tested recently either, it might do better with newer Ceph.
[13:53] * gregorg (~Greg@ Quit ()
[13:54] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[13:54] <l0nk> i'll tell you when i got the servers :D
[13:55] <nhm> l0nk: :)
[13:55] <l0nk> i think i'll go with 2 dell r510 with h700 (2gb cache i hope) with 2x300gb sas raid1 for sys, 1 SSD 100Gb for journal and 10 3to for osd
[13:56] <nhm> l0nk: keep in mind that all data writes hit the journal too, so in that config you will be throughput bound by the SSD.
[13:56] <nhm> and Dell SSDs aren't very fast.
[13:57] <absynth> honestly, we have measured differently
[13:57] <janos> i wonder if you're better off putting that money into my controller cache and dropping the ssd
[13:57] <nhm> absynth: oh?
[13:57] <absynth> sorry, backlog
[13:57] <janos> my/more
[13:57] <l0nk> ok, nice to know to (i'm not really an hardware guy)
[13:57] <absynth> in our experience, there is a huge difference between just having SSDs for journal and journal SSD + CacheCade SSD
[13:58] <absynth> all our nodes without cachecade were the I/O bottlenecks for the whole cluster
[13:58] <nhm> absynth: oh, cachecade might help. I don't think any of my controllers have it enabled.
[13:58] <absynth> but as we have found out several times, our workload is something that your lab does not have as a test case
[13:58] <absynth> it's well worth the extra 100 or so bucks for the license
[13:58] <nhm> absynth: what kind of workload are you testing?
[13:59] <absynth> vms + OSDs one one machine
[13:59] <absynth> so there`s a _lot_ of ctx switches and i/o
[13:59] <l0nk> maybe (i don't know if this possible) i'll go for 2 SSD, make a software raid 1 for the sys and configure the journal for 5 osd to 1 ssd and the 5 others osd to the other SSD
[13:59] <l0nk> i didn't know about cascade, i'm gonna take a look a it
[14:00] <nhm> absynth: we've got a couple of other customers doing the same thing, and yes, that's a case I hate thinking about. ;)
[14:00] <absynth> nhm: whenever i bring it up, everyone at inktank goes LALALALALALALALA I CANNOT HEAR YOU!!!! ;)
[14:02] <nhm> l0nk: Unless they've improved their SSDs, I'd just ignore them. They aren't worth the cost (unless the cachecade stuff is worth exploring as absynth said).
[14:02] <l0nk> so you recommend to buy my ssd somewhere else?
[14:02] <nhm> l0nk: The Intel DC S3700s on the other hand look fantastic.
[14:02] <l0nk> i have a ocz vertx something in my laptop which is great but that's all my experience in ssd :p
[14:03] <todin> nhm: the DC S3700 is fantastic, I have 4 of them for testing
[14:03] <l0nk> ok i'll get that :)
[14:04] <nhm> l0nk: There is some debate if you are better off spending the money for an enterprise grade drive (say a 200GB DC S3700), or going with a much larger and faster consumer grade drive and undersubscribing (ie not allocating the whole drive for partitions) so each cell gets fewer writes.
[14:04] <absynth> nhm: is that a performance or a durability consideration?
[14:05] <l0nk> and what do you think about making a small raid 1 on ssd with mdadm for the sys and use the two other devices (free space of the ssd) as journal device? (and split my osd's journal between the two ssd)?
[14:05] <nhm> absynth: both, since you don't get full write throughput on the S3700 until you hit the 400GB capacity.
[14:05] <l0nk> i don't know if i made my point clear enough
[14:06] <absynth> l0nk: what happens if your SSD dies?
[14:06] <nhm> l0nk: we haven't really explored it, but you'll lose half your sequential write throughput if you RAID1 them.
[14:06] <absynth> ah wait, 0 != 1
[14:06] <l0nk> just software raid1 one small partition for the sys
[14:07] <nhm> l0nk: oh, yeah, that's cool
[14:07] * Footur (~smuxi@31-18-48-121-dynip.superkabel.de) has joined #ceph
[14:07] <nhm> l0nk: that's a config I'd like to explore honestly (12 drives + 2SSDs with raid1 system partition.
[14:07] <Footur> hello there
[14:07] <l0nk> in this case 5 osd will use 1 part of ssd so no io problems :)
[14:07] <l0nk> and i don't have to buy 2 disk for the sys only
[14:08] <l0nk> i have to think about it but it disturb me to use only one ssd for 10 osd (3to each)
[14:08] <nhm> l0nk: I was actually thinking 12 OSD disks + 2 internal 2.5" SSDs.
[14:08] <l0nk> yes of course :)
[14:09] <Footur> someone here tested ceph v0.48 on centos 5?
[14:09] <l0nk> well in my case 2x10x3to is enough space but at the cost of 3to sata drives...
[14:10] <nhm> l0nk: yes, the downside of that config is that if you lose an SSD that's a lot of data that will try to re-replicate, and for small configs with few nodes that's really not ideal.
[14:11] <nhm> If you've got 100 nodes it's not as big of a deal, but with 2-3 nodes there will be a ton of replication overhead and you might fill up all your other OSDs with re-replicated data.
[14:11] <l0nk> nhm: in case i got only one ssd for the journal of 10 osd, that's worse since if i loose the ssd i loose all data of the server
[14:12] <nhm> ep
[14:12] <nhm> yep
[14:12] <janos> what about a controller card with more cache and put journals on the spinners - no SSD?
[14:12] <nhm> That's one big advantage of putting the journals on the same drive as the data. a drive failure just takes 1 OSD down.
[14:13] <nhm> janos: yep, and you'll actually get slightly more read throughput in that config because you have more spinners.
[14:13] <nhm> At the expense of write throughput.
[14:15] <l0nk> arg guys i only had some doubts and now i'm starting to rethinking from scratch!
[14:15] <janos> everything is a trade0off!
[14:15] <l0nk> yeah i know :(
[14:15] <l0nk> but what will be the best design for a 2/3 nodes cluster?
[14:16] <absynth> more nodes!
[14:16] <janos> haha
[14:17] <l0nk> yeah that was my answer too, but the reply was "it cost too much, keep working"
[14:17] <nhm> l0nk: what kind of IO do you expect? Mostly reads or writes? IOPs bound or Throughput bound?
[14:17] <absynth> performance, reliability, price. choose two.
[14:18] <darkfader> #nhm: could we bribe you to run a benchmark set with a "fully unlocked" lsi controller with fastpath / cachecade? there's some 30-day eval iirc.
[14:18] <darkfader> errr
[14:18] <darkfader> srory, i meant to put it away for later
[14:18] <nhm> darkfader: :D
[14:18] <darkfader> uncommenting does not work here
[14:18] <darkfader> sigh
[14:18] <l0nk> nhm: well the most balanced :D
[14:18] <absynth> i'd contribute to the bribe
[14:18] <absynth> what's your drug, nhm?
[14:19] <nhm> good belgian quads. :)
[14:19] <l0nk> nhm: but more throughput that IOPs
[14:19] <absynth> well that can be arranged - we have some in the delicatessen here
[14:20] <nhm> I was thinking at some point I want to redo the controller article. I'll see if I can do the fastcache thing when I redo that article.
[14:20] <nhm> Probably around cuttlefish
[14:20] <nhm> The SAS2208 I have is built into the supermicro motherboard, not sure how that will affect things.
[14:21] <absynth> we have the same controllers built in, we use them for the OS disks, i think
[14:21] <absynth> and we have an external one with cachecade running the OSDs
[14:23] <nhm> absynth: what nodes are you guys using?
[14:24] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:25] <l0nk> nhm: i think 1000iops with 100mb/s in read and 50mb/s in write will be good, rdb will be use mostly for storing data (re-export in samba/nfs) and some VM but like 3 or 4 only
[14:26] <nhm> l0nk: ok, that's pretty iops heavy per server and really light on the throughput.
[14:28] <nhm> l0nk: having said that, I think you could pull it off with spinning disks and controller cache fine. Getting an SSD with cachecade may be helpful too, but I haven't tested it.
[14:28] <absynth> supermicro stuff
[14:28] <absynth> probably same barebones as yours
[14:28] <l0nk> nhm: is there some doc somewhere about using the raid cache as a journal?
[14:29] <nhm> absynth: I don't have any SC826 chassis to test, but the SC847a lets me mimic one.
[14:29] <nhm> absynth: I really need to buy some expanders to test though since I have the straight pass-through chassis.
[14:30] <nhm> l0nk: Don't think you can do that on any controllers I know of (sadly!). You just create another partition at the beginning of each spinning disk for the journal.
[14:31] <l0nk> nhm: oh oky i should have thought about that
[14:33] <nhm> l0nk: I think in this case given the chassis being used, your throughput requirements, and the downside of an SSD failing with so few nodes, I'd just stick with spinning disks and skip the ssd journals.
[14:34] <nhm> (especially since you'll be using a controller with WB cache).
[14:34] <nhm> Just make sure to buy the battery. ;)
[14:34] <l0nk> ok so get a raid card with a lot of cache, make small partitions like 1g a the begining for the journal and that's all :)
[14:35] <l0nk> that's look better
[14:37] <nhm> l0nk: yeah. Maybe look into cachecade too, but I can't give any recommendation personally on it.
[14:37] <l0nk> don't know it either
[14:39] <nhm> l0nk: sounds like absynth has had good luck though
[14:40] <l0nk> it will be useful if the raid cache will be constantly fully used, and i have doubt about that
[14:43] <nhm> l0nk: you can always wait and get it later if you decide you want it.
[14:44] <l0nk> but if i put ssd in the design i'll prefer to put ssd for flashcache on the server which will re-export the rbd as samba/nfs
[14:45] <nhm> yeah. We haven't done any testing with re-exporting samba/nfs, but it's something other people want to do too.
[14:45] <l0nk> well using flashcache for this task will relieves a litthe bit the ceph cluster for others task :)
[14:52] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:53] <l0nk> and about cpu and cores number, does the 1core = 1osd is still the norm?
[14:54] * mikedawson (~chatzilla@23-25-2-142-static.hfc.comcastbusiness.net) has joined #ceph
[15:00] <nhm> l0nk: 1ghz of 1 core
[15:00] <l0nk> ok
[15:00] <nhm> l0nk: so if you have a 6 core 2ghz chip, that should be more or less fine for a 12 OSD node.
[15:01] * jtangwk1 (~Adium@2001:770:10:500:3db0:eeaf:3665:bfd2) has joined #ceph
[15:01] <l0nk> ok :)
[15:01] <nhm> normal operation won' typically use that much CPU, but during recovery CPU usage can go up.
[15:03] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) has joined #ceph
[15:03] <nhm> l0nk: also, AMD cores tend to be slower per clock, so you may want to go a bit faster if you choose an AMD CPU.
[15:04] <l0nk> nhm: it seems you know so much things
[15:05] <l0nk> thanks for sharing :)
[15:05] <l0nk> but i'll use intel cpu :)
[15:05] <nhm> l0nk: I used to do performance work for a Supercomputing Institute before I came to work for Inktank, so I've done this kind of stuff for many years.
[15:06] <nhm> l0nk: But guys like Sam and Greg know way more about the Ceph code than I do. :)
[15:06] <l0nk> yep i met inktank guys in amsterdam at ceph days :)
[15:07] <l0nk> a lot of knoweldge in a small room :D
[15:08] * dosaboy (~user1@host86-164-227-220.range86-164.btcentralplus.com) Quit (Remote host closed the connection)
[15:08] * dosaboy (~user1@host86-164-227-220.range86-164.btcentralplus.com) has joined #ceph
[15:08] * jtangwk (~Adium@2001:770:10:500:d97f:2952:4d27:da7f) Quit (Ping timeout: 480 seconds)
[15:08] <nhm> l0nk: oh nice! I need to go to one of these fancy European conferences some day. ;)
[15:09] <nhm> l0nk: I bet you guys drink beer and enjoy good food all day. ;)
[15:09] <absynth> beer
[15:09] <absynth> that reminds me
[15:09] <absynth> can i drink yet?
[15:09] <fghaas> it's 5pm somewhere, absynth
[15:09] <l0nk> don't know you're timezone but my rule is never before 6pm
[15:10] <nhm> absynth: I was just thinking a spiked coffee sounds good right now.
[15:10] <absynth> yeaaah, make it irish!
[15:10] <nhm> absynth: and it's 8am! ;)
[15:10] <l0nk> well we should organize the next event in bruxel if you guys like beer :D
[15:11] <absynth> especially if nhm likes belgian beer
[15:11] <absynth> i am not so fond of these extravagant brews with 15% of alcohol, but i can still drink piss... err, sorry, Heineken
[15:11] <absynth> the keys are right next to each other...
[15:12] <l0nk> if you like beer, you must love belgian beers
[15:12] <nhm> My favorite that I can get easily here is Koningshoeven Quad (it's called La Trappe here)
[15:13] <l0nk> la trappe comes from netherlands no?
[15:13] <nhm> l0nk: yep!
[15:13] <nhm> technically not Belgian.
[15:13] <l0nk> i prefer duvel or delirium or even leffe :)
[15:14] <fghaas> duvel normally leads to delirium if consumed in excess quantities.
[15:14] <nhm> I'm not huge on duvel, but Delirium and Leffe are good. I also like Piraat.
[15:14] * shemp (~shemp@officegateway.ethostream.com) has joined #ceph
[15:15] <l0nk> yeah duvel got a particular taste, you like it or not
[15:16] <nhm> I once had La Chouffe or Mc Chouffe on tap and really liked it.
[15:17] <l0nk> but more seriously there are chances that my company will setup a conference in paris
[15:19] <nhm> l0nk: Neat. Joao lives in Lisbon so he'd probably be able to swing up there pretty easily.
[15:19] <joao> I read something about beer and then someone called my name
[15:20] <nhm> joao: yes, we've been contracted for beer tasting work.
[15:20] <joao> finally!
[15:23] * cocoy (~Adium@ Quit (Quit: Leaving.)
[15:24] <l0nk> in case we do something in paris, i will remeber to have good beers for everyone :)
[15:24] <l0nk> i think that's why the FOSDEM is so popular
[15:26] <joao> and that's also why industry conferences are so much better than their academic counterparts too
[15:26] <nhm> l0nk: Yeah, sounds like FOSDEM was really good. ;)
[15:27] <nhm> that seems really backwards. I think Academia needs to get back to it's roots.
[15:27] <joao> I know, right?
[15:27] <joao> I used to think they were all so cool, but they seem really uptight when compared to, say, WHD
[15:28] <joao> tbh, pretty much everything I've seen seems uptight when compared to whd
[15:28] <nhm> lol, nice
[15:29] <nhm> Most of the conferences I've been to have been Academic.
[15:29] <nhm> Though Supercomputing can involve some drinking.
[15:30] <absynth> TeraGrid *duck*
[15:31] <nhm> There wasn't nearly enough drinking at the TeraGrid conference. ;)
[15:32] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[15:35] <absynth> ever been to IEEE Grid?
[15:35] <absynth> or ccGrid and whatnot?
[15:36] <nhm> absynth: nope, just teragrid and caGrid once each.
[15:39] <absynth> same people everywhere, at least in Europe
[15:39] <absynth> few US guys, since they rarely seem to get travel funding for overseas flights
[15:49] <nhm> absynth: yeah, it's very expensive
[15:50] <nhm> absynth: I can attend a local conference for like $1.5-$2k USD total, but to fly to europe and attend a conference it would probably be twice that at least.
[15:50] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[15:54] <absynth> attend as in "pay for the conference ticket"?
[15:54] <absynth> aarrrgh, the goddamn level3 support hotline kills me
[15:54] <absynth> *curses*
[15:54] * coredumb (~coredumb@xxx.coredumb.net) Quit (Quit: WeeChat
[15:55] <nhm> absynth: conference ticket, airfare, hotel, etc.
[15:55] <nhm> absynth: our hotline?
[15:55] <absynth> nah, level3's
[15:55] <nhm> oh, I see
[15:55] <absynth> yours is pretty much a ringdown on several cells, innit?
[15:56] <nhm> Yeah, it goes to various people and escalates if it doesn't get answered.
[15:56] <nhm> At least I think.
[15:56] <absynth> i'm this |-| close to escalating, too ;)
[15:57] <nhm> absynth: but they are the reliable partner of choice according to their website! ;)
[16:02] * PerlStalker (~PerlStalk@ has joined #ceph
[16:03] <junglebells> absynth: Poke em on Twitter :) with a picture of your wait time :)
[16:03] * mikedawson (~chatzilla@23-25-2-142-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[16:05] <l0nk> aha junglebells good idea, next time i wait to an hotline, i'm definitely doing that!
[16:08] <absynth> ooooh. snow!
[16:09] <nhm> where?! Oh wait, outside. All over.
[16:09] <nhm> ;)
[16:11] <l0nk> snow in last doest not last very long no?
[16:11] <l0nk> in LA sorry
[16:11] <nhm> l0nk: I'm in Minneapolis. :)
[16:12] <l0nk> ok i thought you were in LA since inktank got their office over there
[16:12] <nhm> l0nk: yeah, we are a very widely distributed company.
[16:12] <junglebells> nhm: What's the temp there?
[16:12] * l0nk looking on google maps where exactly is minneapolis
[16:13] <nhm> l0nk: Joao is in Portugal, there are two of us in Minnesota, A couple of people out east, etc.
[16:13] <elder> Cold.
[16:13] <l0nk> oky :)
[16:13] <elder> < 10 F today.
[16:13] <absynth> you have nobody in Germany
[16:13] <nhm> junglebells: 2F, or -17C
[16:13] <absynth> Y U NO GERMANY!!!
[16:14] <l0nk> nhm: wow that's cold
[16:14] <absynth> oh. did you read that lustre is being sold?
[16:14] <joao> absynth, cuz you guys speak german over there :p
[16:14] <nhm> absynth: saw that Xyratex was picking up the trademarks
[16:14] <absynth> yeah
[16:14] <absynth> never heard of 'em
[16:14] <elder> Wir k�nnen nach Deutschland zu fliegen.
[16:14] <elder> Xyratex was a hardware company I thought.
[16:15] <absynth> lose the "zu" and it's perfect
[16:15] <elder> Shade.
[16:15] <nhm> absynth: They've been around for a while. They make the Chassis that Cray and SGI rebrand for their lustre solutions.
[16:15] <elder> Aschade
[16:17] * raso (~raso@deb-multimedia.org) Quit (Ping timeout: 480 seconds)
[16:17] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:18] <nhm> Ich bin auf weib brot! das maschinen ist nicht für gefingerpoken
[16:18] <madkiss> oO
[16:18] * nhm says random things that sound vaguely german
[16:18] <elder> Was ist los mit das machine?
[16:18] <elder> Die machine?
[16:18] <elder> No, DIE machine!
[16:21] <madkiss> people tend to say that kraftwerk became one of the most successful german music groups because actually they weren't singing that much ;-)
[16:21] <elder> I'm the operator of my pocket calculator.
[16:21] <elder> By pressing on a special key it plays a little melody.
[16:25] * vata (~vata@2607:fad8:4:6:5d63:7f49:6476:4376) has joined #ceph
[16:26] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[16:26] * raso (~raso@deb-multimedia.org) has joined #ceph
[16:28] <nhm> WOOOT
[16:29] <fghaas> and madkiss, trust me, once you hear nena's "99 luftballons" in an airport bar in toronto, you definitely prefer kraftwerk
[16:30] <madkiss> in German or in English?
[16:31] <l0nk> fghaas: and now i have 99 luftballons (well at least the melody) stuck in my head for the end of the day
[16:31] * steki (~steki@246-167-222-85.adsl.verat.net) has joined #ceph
[16:31] <fghaas> l0nk: you're welcome :)
[16:31] <fghaas> madkiss: in German
[16:31] <l0nk> i'll listen to it, maybe it helps
[16:31] <madkiss> in Toronto?
[16:32] <madkiss> that's awesome, that's like germans listening to France Galle
[16:32] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[16:32] <madkiss> Gall, ofcoz
[16:33] * BManojlovic (~steki@ Quit (Read error: Connection reset by peer)
[16:34] <l0nk> and now i have poupée de cire poupée de son in my head, thanks to madkiss :(
[16:34] <madkiss> l0nk: i'll cure you
[16:35] * madkiss tunes in Edith Piaf with 'Non, je ne regrette rien'
[16:35] <l0nk> how?
[16:35] <l0nk> aha you got me!
[16:35] * jskinner (~jskinner@ has joined #ceph
[16:35] <madkiss> l0nk: if you have music in your ear, let it be good music :-)
[16:35] <l0nk> and actually i like it, so i'm listening to it :)
[16:37] <fghaas> this channel is remarkably on topic today
[16:37] * BManojlovic (~steki@ has joined #ceph
[16:40] <l0nk> beer and music :)
[16:41] * steki (~steki@246-167-222-85.adsl.verat.net) Quit (Ping timeout: 480 seconds)
[16:43] <scuttlemonkey> heh, I had to check to make sure I was in the right tab here
[16:44] <fghaas> scuttlemonkey: feel free to share your beer thoughts :)
[16:44] <scuttlemonkey> s/beer/whiskey/g
[16:44] <scuttlemonkey> =D
[16:45] <scuttlemonkey> and as for German bands I still prefer Rammstein
[16:45] <elder> Und Nina Hagen?
[16:46] <scuttlemonkey> cliche perhaps...but my plex server has their discography
[16:47] <scuttlemonkey> only if she has those crazy finish cellists backing her
[16:48] <joao> <3 apocalyptica
[16:48] <joao> one of the best live concerts I've ever been to
[16:48] <scuttlemonkey> they do seem like they would be fun to see live
[16:51] <l0nk> i saw them something like 4 years ago, was really fun :) made me regret to stop practise cello, but didn't have enought free time
[16:53] <l0nk> their arrangement on metallica was pretty great
[16:55] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[17:03] * gerard_dethier (~Thunderbi@ Quit (Quit: gerard_dethier)
[17:07] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[17:09] <absynth> wschulze: around?
[17:17] <wschulze> absynth: yes
[17:23] * low (~low@ Quit (Quit: Leaving)
[17:24] * markl (~mark@tpsit.com) has joined #ceph
[17:28] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) has joined #ceph
[17:30] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[17:32] <dilemma> so I have a situation where I have an unfound object. Querying the PG lists 5 OSD that might_have_unfound. Of those 5, three are "already probed" and the other two are down. Those two OSDs are done for - bad disks. How would I go about convincing the PG to permanently lose that object? When I try "ceph pg xxx mark_unfound_lost revert" it tells me that it hasn't queried all possible copies yet.
[17:32] <junglebells> dilemma: Mark your OSD's as lost first?
[17:32] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:33] <dilemma> that was my first thought, but I just wanted to make sure that it will have the desired effect
[17:33] <dilemma> it's the only PG in a bad state
[17:33] <dilemma> would marking the OSDs as lost have any other side effects in that case?
[17:34] <junglebells> afaik, you'll want to mark them as lost so ceph can reorg your data how it would prefer it on merely 3 osds
[17:34] <junglebells> And that would implicitly solve your issue. Don't take my word for it, but I had similar issues in the past.
[17:38] <sstan> I'm tying to improve the journal, but I get this error : http://pastebin.com/sGpVk64D
[17:38] <sstan> trying to change the journal path and remaking the journal
[17:46] <sstan> keep getting this ..: error creating journal on /dev/shm/journal: (22) Invalid argument
[17:46] * mattch (~mattch@pcw3047.see.ed.ac.uk) Quit (Ping timeout: 480 seconds)
[17:48] * dosaboy (~user1@host86-164-227-220.range86-164.btcentralplus.com) Quit (Remote host closed the connection)
[17:48] * dosaboy (~user1@host86-164-227-220.range86-164.btcentralplus.com) has joined #ceph
[17:56] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[17:56] * mattch (~mattch@pcw3047.see.ed.ac.uk) has joined #ceph
[18:00] * leseb (~leseb@mx00.stone-it.com) Quit (Remote host closed the connection)
[18:07] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[18:09] * `10` (~10@juke.fm) has joined #ceph
[18:11] <dilemma> junglebells: I took some time testing your suggestion in the lab, but I now have HEALTH_OK after losing that one object. Thanks for the help.
[18:15] * `10 (~10@juke.fm) Quit (Ping timeout: 480 seconds)
[18:17] <scuttlemonkey> sstan: did you flush your journal before you pointed your ceph.conf to the new journal & initialized it?
[18:18] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[18:18] <sstan> no
[18:18] <scuttlemonkey> that's probably the safest way to go
[18:18] <scuttlemonkey> incantation should be 'ceph-osd -i <OSD #> --flush-journal'
[18:19] <scuttlemonkey> then 'ceph-osd -u <OSD #> --mkjournal' should run ok
[18:19] <scuttlemonkey> er, -i no -u
[18:19] <sstan> ok I'll try that, thanks
[18:24] * l0nk (~alex@ Quit (Quit: Leaving.)
[18:25] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[18:26] * alram (~alram@ has joined #ceph
[18:31] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[18:32] <sstan> found the solution : http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/7244
[18:34] <scuttlemonkey> ahh, right on
[18:35] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) Quit (Quit: Leaving)
[18:38] * chutzpah (~chutz@ has joined #ceph
[18:41] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[18:43] * ScOut3R (~scout3r@1F2EAE7E.dsl.pool.telekom.hu) has joined #ceph
[18:46] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:46] * mikedawson (~chatzilla@host-69-95-126-226.cwon.choiceone.net) has joined #ceph
[18:50] <sstan> it seems like having the journal on RAM instead of the same drive as the OSD improves the speed from 5 Mbytes/s to 45 in my virtual machine!
[18:51] <sstan> is it too much to be true?
[18:51] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:51] <gregaf1> sounds like you've got a crappy hard drive there, but it's definitely possible
[18:56] <junglebells> sstan: I saw remarkable improvements per OSD going to ramdisk journal. 120MB/s to 230-250MB/s
[18:56] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[18:58] <junglebells> sstan: Just make sure you have ceph-osd -i {osdnum} --osd-data {datapath} --osd-journal /dev/shm/journal --mkjournal in some sort of boot or init script. I noticed that upon a reboot test I needed to set that up
[19:01] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[19:01] * mikedawson_ (~chatzilla@host-69-95-126-226.cwon.choiceone.net) has joined #ceph
[19:05] * mikedawson (~chatzilla@host-69-95-126-226.cwon.choiceone.net) Quit (Ping timeout: 480 seconds)
[19:06] * mikedawson_ is now known as mikedawson
[19:06] <junglebells> Anyone have an idea about fixing 2 scrub errors on (I believe) one PG? http://pastebin.com/PUphSpvJ
[19:08] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[19:10] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[19:12] * b1tbkt (~Go@68-184-193-142.dhcp.stls.mo.charter.com) Quit (Remote host closed the connection)
[19:12] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[19:13] * loicd (~loic@magenta.dachary.org) has joined #ceph
[19:14] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:20] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[19:20] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[19:23] * rturk-away is now known as rturk
[19:27] * jskinner (~jskinner@ has joined #ceph
[19:30] * Cube (~Cube@ has joined #ceph
[19:33] <iggy> so you don't like your data?
[19:38] <junglebells> iggy: The bad PG's (I'm pretty sure) are a part of an RBD I was copying and I had an very unexpected full cluster crash.
[19:39] <iggy> I was referring to the journals in ram
[19:39] <junglebells> On a personal note, I *hate* this data but sadly I still need to support it.
[19:39] <junglebells> oh
[19:39] <junglebells> Well if I have a decent crushmap for proper replication, what's the fault of it?
[19:40] <junglebells> If a node goes down, it can be rebuilt.
[19:42] * jskinner (~jskinner@ Quit (Ping timeout: 480 seconds)
[19:44] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[19:45] <sjustlaptop> junglebells: technically true, but at least with a normal journal the osd should survive a power cycle without having to be rebuilt
[19:46] <junglebells> sjustlaptop: Even in the event of a failure (power, kernel or otherwise) only the PG's that were in the process of being written should have to be rebuilt, correct?
[19:47] <sjustlaptop> no, if you loose the journal you loose the entire OSD
[19:47] <sjustlaptop> *lose
[19:48] <junglebells> Ahhh. Understood.
[19:51] <junglebells> At this point, I don't have access to SSD's and I was getting some pretty bad performance trying to keep my journal elsewhere.
[19:52] <elder> joao, are you areound?
[19:52] <joao> here
[19:57] * jskinner (~jskinner@ has joined #ceph
[20:00] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[20:03] * jlogan (~Thunderbi@2600:c00:3010:1:d9b9:ba9f:bb49:6d7d) has joined #ceph
[20:04] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[20:07] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[20:09] * mikedawson (~chatzilla@host-69-95-126-226.cwon.choiceone.net) Quit (Ping timeout: 480 seconds)
[20:10] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[20:12] * hybrid5121 (~w.moghrab@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[20:14] * dosaboy (~user1@host86-164-227-220.range86-164.btcentralplus.com) Quit (Quit: Leaving.)
[20:15] * ScOut3R (~scout3r@1F2EAE7E.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[20:16] * hybrid512 (~w.moghrab@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[20:29] * alram (~alram@ Quit (Quit: Lost terminal)
[20:32] <junglebells> Anyone have an idea about fixing 1 scrub error on (I believe) one PG? http://pastebin.com/PUphSpvJ
[20:34] * miroslav (~miroslav@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[20:37] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) has joined #ceph
[20:44] * leseb (~leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[20:45] * eschnou (~eschnou@98.166-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:47] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) has joined #ceph
[20:56] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[20:57] * fghaas1 (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) has joined #ceph
[20:57] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) Quit (Read error: Connection reset by peer)
[21:03] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[21:06] * mikedawson_ (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[21:10] <scuttlemonkey> junglebells: sounds like you'll have to trace back through the ceph log until you find the point where the pg became inconsistent
[21:10] <scuttlemonkey> can just grep for the pgid and take a look
[21:13] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[21:26] <jmlowe> junglebells: running on btrfs?
[21:26] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[21:26] <junglebells> jmlowe: Negative, xfs
[21:27] <junglebells> I took a look, the two OSDs it exists on are reporting different sizes but identical obj counts
[21:27] <jmlowe> ok, there is a btrfs bug that truncates sparse files
[21:27] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[21:28] <junglebells> I launched a repair on the OSD that I'm nearly positive is the source of the problem. It's working it's way through the PGs... About another 10-20m and it'll hit that one
[21:34] * phantomcircuit (~phantomci@covertinferno.org) Quit (Ping timeout: 480 seconds)
[21:44] <sstan> sjustlaptop: no, if you loose the journal you loose the entire OSD
[21:45] <sstan> you lose the entire OSD literally ?
[21:45] <sjustlaptop> sstan: with btrfs the osd is recoverable, but with xfs/ext4 the osd store may be in an arbitrary (read: not consistent) state
[21:47] <sstan> the journal is like a buffer, right?
[21:47] <sjustlaptop> the journal is a write ahead journal for transactions
[21:47] <sjustlaptop> it also acts as a buffer for bursts of random IO
[21:47] <sjustlaptop> but mostly it's a write ahead journal
[21:47] <sjustlaptop> we journal transactions to the journal prior to writing them out to the file system
[21:48] <sjustlaptop> this allows us to sync the filesystem infrequently
[21:48] <sstan> so if it disappears suddenly, it will affect only current transactions
[21:48] <sjustlaptop> yes, but that might (probably will) mean corrupted objects
[21:49] <sjustlaptop> since the transactions are probably partially applied
[21:50] <sstan> so before deleting the journal safely, what has to be done?
[21:50] <sjustlaptop> you would need to sync the journal, there is an option to ceph-osd to flush the journal
[21:51] <sstan> ah it's --flush-journal
[21:55] <sstan> sjustlaptop: a node breaking totally is likely to happen. It shouldn't matter to CEPH how it stores the journal
[21:56] <sjustlaptop> sstan: ceph doesn't care how the journal is stored, merely that it isn't lost...
[21:56] <sstan> so it doesn't protect fully against total node failure
[21:57] <sjustlaptop> if your crush map allows you to recover from losing a node of osds, it's no problem, ceph just treats the osds as dead
[21:57] <sjustlaptop> is that what you mean?
[21:58] <sstan> yes, with a replica size > 1
[21:59] <sjustlaptop> sstan: well, if you lose a node when you only have one replica, there isn't much to be done
[22:00] <sstan> I'm interested in the scenario where the journal disappears
[22:01] <sjustlaptop> at present, losing the journal on xfs/ext4 kills the osd
[22:01] <sstan> but like you said, it might create inconsistencies which will produce corrupted object especially if OSD uses ext3, 4
[22:01] <sjustlaptop> right
[22:01] <sstan> ah and the data cached by the journal is also lost
[22:02] <sjustlaptop> yes
[22:03] <sstan> on a different topic .. is it possible to plug two computers with one SATA cable? Is it only a matter of not having a driver, or it's impossible?
[22:05] <janos> you could do it physically ;)
[22:05] <janos> not sure what you mean to accomplish though
[22:05] <janos> like in lieu of something like ethernet?
[22:07] <sstan> exactly
[22:07] <sstan> it's more reliable also, because no NIC would be needed
[22:08] <sstan> so each computer could see eachother's drives
[22:09] <janos> not possible that i'm aware of
[22:09] * josef (~seven@li70-116.members.linode.com) has joined #ceph
[22:09] <josef> sagewk: so then the next question is how do i know whats stable?
[22:09] <josef> just the even releases?
[22:09] <sagewk> josef: hmm its not obvious from the git tree
[22:10] <josef> ooh wait did you mean it was better to go to .57?
[22:10] <sagewk> it's the latest tag n teh 'bobtail' branch currently. after thatn will come cuttlefish, dumpling, ...
[22:10] * josef is confused
[22:10] <sagewk> 0.56.3
[22:10] <paravoid> hey sagewk
[22:10] <josef> ok
[22:10] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[22:10] <sagewk> paravoid: hi!
[22:10] <josef> i'll update it now and i'll just wait for you to email me in 6 months ;)
[22:11] <paravoid> that "mon losing touch with OSDs" list thread sounds like my problem too, doesn't it?
[22:11] <sagewk> we're moving to a 3 month cadence... so, summer!
[22:11] <sagewk> could be. any logs you can generate would be helpful!
[22:11] * josef kills his mockbuild and pulls down 56.3
[22:13] <sagewk> paravoid: btw i closed the 'wrong node' bug out.. if you still see that, logs ftw!
[22:13] <iggy> sstan: re: journal in ram... what if you have a sudden power outage and lose both (all) replicas of an object?
[22:13] <paravoid> saw that!
[22:14] <paravoid> haven't seen it lately
[22:14] <sagewk> sweet. merged the (knock wood) fix in 0.56.3
[22:15] <sstan> iggy: UPS. It would give enough time to flush the journal
[22:15] <sstan> if the journal isn't the bottleneck, what is next ?
[22:15] <iggy> what about when the UPS service company comes into to do maintenance and accidentally kill all power?
[22:16] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:16] <sstan> hmm didn't think about that. I'd say disable journaling if it's possible, or move it so a safer place for a while
[22:16] <janos> battery-backed controller?
[22:17] <iggy> that's happened to me, so it's not just some vague "what if"
[22:17] <janos> wow that sucks
[22:17] <junglebells> SUE FOR DAMAGES! (UPS service company....)
[22:17] <iggy> you can't disable journaling at run time I don't think
[22:17] <sstan> janos :how can a battery-backed controller help RAM journal ?
[22:17] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[22:17] <janos> oh dur - RAM
[22:18] <janos> sorry
[22:18] <janos> i left that part out
[22:18] <sstan> iggy : one by one,
[22:18] <janos> this is where google's approach to per-machine batteries works out pretty well
[22:19] <janos> not realistic for us since we aren't fabricating our own nodes...
[22:20] <lightspeed> we had a case of someone getting curious about the big red button on the wall (EPO switch) and pressing it
[22:20] <janos> hahah
[22:20] <janos> damn
[22:23] <iggy> you should have seen that tech's face when everything went quiet
[22:23] <ShaunR> supermicro is making cases with batterys that look like powersupplys now
[22:23] <ShaunR> they only give the server like 3 min of run time
[22:24] <janos> that's way better than zero!
[22:25] <sstan> what's the next bottleneck after journaling? Hard drives do 100MB easily ... NIC is 1Gb. Is it the processor ?
[22:25] <ShaunR> ehh, i personally think it's pretty pointless.
[22:26] <janos> ShaunR: its importance is likely based on what nightmares you've had ;)
[22:26] <ShaunR> but then again we cant rely on our customers bringing in battery backed equipment... we have to provide that
[22:26] <janos> i'd like it for when i have to move power cables around!
[22:26] <janos> granted,t hat's rare
[22:28] <lurbs> My current largest power concern is a machine, with one PSU, with an uptime of 2681 days in a rack that's scheduled to be decomissioned.
[22:29] <janos> hahaha i took one of those down last year
[22:29] <janos> 5 year uptime
[22:29] * sage (~sage@ Quit (Ping timeout: 480 seconds)
[22:29] <janos> scared me
[22:29] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[22:29] <janos> so i planned to totally decommision
[22:29] <lurbs> It's no longer doing anything, and it's disconnected from the network, we just want to keep it going as long as possible. :(
[22:29] <janos> hahah
[22:31] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[22:37] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) Quit (Quit: themgt)
[22:38] * jlogan (~Thunderbi@2600:c00:3010:1:d9b9:ba9f:bb49:6d7d) Quit (Remote host closed the connection)
[22:38] * jlogan (~Thunderbi@2600:c00:3010:1:d9b9:ba9f:bb49:6d7d) has joined #ceph
[22:39] <iggy> go splice a UPS into the power cord
[22:39] <lurbs> Nuh uh. 240 volts hurts.
[22:40] <janos> if you're careful you don't get zapped
[22:40] <janos> though i've only been zapped by 120
[22:40] <lurbs> Careful, *and* know what you're doing.
[22:40] <janos> yes
[22:42] <Matt> 240V does hurt
[22:42] * Matt has been zapped, on several occasions
[22:42] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) has joined #ceph
[22:42] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) Quit ()
[22:45] <lightspeed> I'm always more concerned around DC powered kit, safety wise
[22:45] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) has joined #ceph
[22:46] * fghaas1 (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[22:48] * phantomcircuit (~phantomci@covertinferno.org) has joined #ceph
[22:56] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[22:58] <wer> which thing can I twiddle to make radosgw less loggy? rgw disable ops log my best bang for the buck?
[22:58] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[22:59] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[22:59] * phantomcircuit (~phantomci@covertinferno.org) Quit (Quit: quit)
[23:09] * eschnou (~eschnou@98.166-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:09] * phantomcircuit (~phantomci@covertinferno.org) has joined #ceph
[23:12] * fghaas (~florian@91-119-74-57.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[23:15] * miroslav1 (~miroslav@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[23:21] * miroslav (~miroslav@c-98-234-186-68.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[23:29] * sage (~sage@ has joined #ceph
[23:33] * shemp (~shemp@0001b1d9.user.oftc.net) Quit (Quit: leaving)
[23:33] * miroslav1 (~miroslav@c-98-234-186-68.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[23:36] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[23:37] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[23:39] * jskinner (~jskinner@ has joined #ceph
[23:40] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[23:41] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[23:45] * mikedawson_ (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.