#ceph IRC Log

Index

IRC Log for 2012-11-21

Timestamps are in GMT/BST.

[0:05] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[0:05] * loicd (~loic@magenta.dachary.org) has joined #ceph
[0:07] * timmclaughlin (~timmclaug@69.170.148.179) Quit (Ping timeout: 480 seconds)
[0:19] * deepsa (~deepsa@122.172.21.33) Quit (Ping timeout: 480 seconds)
[0:23] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:29] * benpol (~benp@garage.reed.edu) Quit (Quit: Leaving.)
[0:31] * plut0 (~cory@pool-96-236-43-69.albyny.fios.verizon.net) has joined #ceph
[0:32] <plut0> hows ceph doing?
[0:34] * jjgalvez (~jjgalvez@12.248.40.138) Quit (Quit: Leaving.)
[0:37] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) has joined #ceph
[0:37] <gregaf> well, it's excited about Thanksgiving and looking forward to its regeneration cycle into a Bobtail squid from an argonaut octups
[0:37] <gregaf> *octopus
[0:38] <rweeks> I was looking at the names of cephalopods
[0:38] <slang> gregaf: argonoctopus
[0:38] <rweeks> and when we get past C the names mostly go into the linnean biological names
[0:40] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[0:40] <slang> rweeks: just wait till Q
[0:40] <rweeks> hehe
[0:41] <plut0> anyone using ceph in production?
[0:41] <rweeks> yes.
[0:41] <plut0> rweeks: whats your work load?
[0:41] <rweeks> not me personally
[0:41] <plut0> oh
[0:41] * tnt (~tnt@162.63-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[0:41] <rweeks> but DreamObjects from DreamHost is using it
[0:41] <rweeks> S3-like object storage
[0:42] <rweeks> 3PB or so from what I know
[0:42] <plut0> was looking for some feedback on user experience
[0:42] <rweeks> what's your target workload?
[0:43] <plut0> cloud computing, virtualization, offsite storage
[0:45] <rweeks> any particular kind of cloud/virtualization?
[0:46] <plut0> like what software?
[0:46] <rweeks> yeah
[0:46] <plut0> undecided, leaning towards KVM
[0:46] <rweeks> there are a couple of companies deploying openstack with kvm and ceph
[0:47] <rweeks> Piston Cloud and MetaClout
[0:47] <rweeks> er, MetaCloud
[0:47] <plut0> how do you know that
[0:47] * brambles (~xymox@shellspk.ftp.sh) Quit (Quit: leaving)
[0:47] <rweeks> I work for Inktank
[0:47] <plut0> ahh
[0:47] <rweeks> but those are public partnerships
[0:47] <plut0> is there a list of clients published?
[0:47] <rweeks> I don't think so.
[0:48] <plut0> are they doing well?
[0:48] <lurbs> I'm looking at doing a similar thing, with Ceph/KVM/OpenStack. Waiting on bobtail, though.
[0:48] * brambles (~xymox@shellspk.ftp.sh) has joined #ceph
[0:48] <rweeks> but I could put you in touch with our partner people to get details if we have them
[0:48] <rweeks> as far as I know
[0:48] * vjarjadian_ (~IceChat7@5ad6d001.bb.sky.com) Quit (Read error: Connection reset by peer)
[0:49] <plut0> that would be great
[0:50] * vjarjadian_ (~IceChat7@5ad6d001.bb.sky.com) has joined #ceph
[0:52] <plut0> glad to hear i'm on the same page as others are thinking
[0:52] <rweeks> yep
[0:52] <rweeks> there's a lot of interest around kvm and ceph since we can do things like boot VMs from our block device
[0:53] <vjarjadian_> i just wish it was usable over WAN... would be the icing on the cake
[0:53] <rweeks> we'll get there, vjarjadian_
[0:54] <plut0> i can see people having a need for asynchronous replication
[0:54] <vjarjadian_> if a block was replicated say... 5 times... could ceph be configured to write the first one then trigger the rebalance for the others to be replicated?
[0:54] <gregaf> if you're interested in that, we'd love to hear about what kind of async replication you're looking for
[0:54] <gregaf> plut0: ^
[0:55] <plut0> gregaf: replication across a WAN to another site
[0:55] <gregaf> vjarjadian_: something like that is unlikely to happen; we've talked about maybe being able to add tail OSDs that don't need to commit to disk first
[0:56] <gregaf> plut0: right, but are you talking about a live stream because you have a 40gbit connection that's just too high-latency for synchronous replication? are you looking for an every-thirty-minute send that deduplicate and compresses intermediate writes?
[0:57] <vjarjadian_> i suppose you could use it over WAN now if you were using it purely as a backup storage... so it didnt really matter how delayed the operations were... unless that would trigger the OS to kill the disk or something
[0:57] <plut0> gregaf: more of a every-so-often backup
[0:57] <gregaf> plut0: do you want it to work both ways or do you have a primary and a disaster recovery copy?
[0:57] <plut0> gregaf: DR
[0:57] <vjarjadian_> wouldnt rsync be better for that type of async backup?
[0:58] <gregaf> cool
[0:58] <plut0> vjarjadian_: i suppose there are lots of ways to slice it
[0:58] <gregaf> we've heard all those scenarios but some of them are much easier than others so knowing what's actually desired is helpful in setting priorities :)
[0:59] <plut0> gregaf: i'm guessing you work for Inktank also?
[0:59] <rweeks> gregaf is one of our many awesome developers
[1:00] <plut0> i'll have to hit you guys up for commercial support at some point
[1:01] <vjarjadian_> your videos were certainly very informative
[1:01] <plut0> where are the videos?
[1:02] <vjarjadian_> youtube
[1:02] <gregaf> sjust: ^ one data point for periodic replication of RBD images
[1:02] <rweeks> plut0: sent you a pm
[1:04] <sagewk> slang: there?
[1:05] <slang> sagewk: yep
[1:05] <sagewk> on 3431... are those patches pushed?
[1:05] <sagewk> not sure the patches in wip-3431 don't look like the right ones?
[1:05] <slang> hmm I thought so
[1:06] <slang> one sec..
[1:06] <sagewk> slang: you mentioned that we *do* need to take a ref for bh writes, but the patch has them removed?
[1:07] <slang> sagewk: ff3837 is the one that I meant to push
[1:07] <sagewk> (and that is already in master i think?)
[1:07] <slang> sagewk: oh, no, we don't need a ref for writes
[1:08] <slang> sagewk: the assert failure is without the ref for writes
[1:08] <slang> sagewk: the patch in wip-3431 fixes the assert without inc/dec the ref for writes
[1:11] <slang> sagewk: I forced-updated the branch
[1:12] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[1:12] <slang> sagewk: sorry I must have just forgotten to push
[1:12] <sagewk> ah
[1:25] * vata (~vata@208.88.110.46) Quit (Quit: Leaving.)
[1:28] * weber (~he@219.85.117.233) Quit (Remote host closed the connection)
[1:32] * xiaoxi (~xiaoxiche@jfdmzpr03-ext.jf.intel.com) has joined #ceph
[1:33] * slang (~slang@ace.ops.newdream.net) Quit (Quit: slang)
[1:34] * vjarjadian_ (~IceChat7@5ad6d001.bb.sky.com) Quit (Ping timeout: 480 seconds)
[1:38] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:38] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:49] * maxiz_ (~pfliu@111.192.248.200) Quit (Quit: Ex-Chat)
[1:51] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[1:53] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:54] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[1:58] <xiaoxi> Rerise the question:How does Ceph choose sync time? I know there is a range [min_sync_interval, max_sync_interval] can be configured ,but how ceph decide the actuall sync time in this range?
[2:00] <sjust> xiaoxi: basically, we sync when the journal hits half full
[2:01] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) has left #ceph
[2:01] <sjust> ok, we sync no more often then every <min_sync_interval> seconds and no less often then every <max_sync_interval> seconds
[2:02] <sjust> but we try to start a sync when the journal hits half full
[2:02] <sjust> min_sync_interval is by default so small that it doesn't matter
[2:05] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[2:05] * loicd (~loic@magenta.dachary.org) has joined #ceph
[2:16] * AaronSchulz (~chatzilla@216.38.130.166) has joined #ceph
[2:18] <AaronSchulz> do ragosgw auth tokens last for 24 hours like with swauth?
[2:19] <AaronSchulz> yehudasa: also, I can get metadata to POST, but Content- headers seem to disappear (in spite of a 202)
[2:26] * yoshi (~yoshi@p11251-ipngn4301marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:28] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) has joined #ceph
[2:34] * calebamiles (~caleb@c-24-128-194-192.hsd1.vt.comcast.net) has joined #ceph
[2:42] <xiaoxi> sjust:Thanks a lot~ I tried to set [min,max sync interval] to [10s,120s],it improves a lot.
[2:42] <sjust> wow, really?
[2:42] <sjust> increasing min_sync_interval shouldn't matter, that's interesting
[2:42] <sjust> did you try it without changing min_sync_interval?
[2:42] <xiaoxi> no yet.
[2:43] <sjust> I would be that it results in no change, or a positive change
[2:43] <sjust> *I would bet
[2:43] * jlogan2 (~Thunderbi@2600:c00:3010:1:1ccf:467e:284:aea8) Quit (Ping timeout: 480 seconds)
[2:43] <sjust> with a large journal, increasing max_sync_interval could yield an improvement
[2:44] <xiaoxi> But if a silly user set a very big minimal interval ,how ceph will act like?I bet there is a hard-capping there?say if the journal is 75% full or 100% full,a sync will happen
[2:46] <sjust> no, it will wait for the min_sync_interval
[2:46] <sjust> that's why we set it by default to be very small
[2:46] <sjust> increasing it should only ever hurt
[2:47] <xiaoxi> that means a wrong configuration(a very big minimal interval) will lead to data loss in some scenarios?
[2:47] <xiaoxi> because the journal is rewrited from the begining
[2:47] <plut0> what are people using for backend file systems?
[2:48] <xiaoxi> plut0:some test results suggest that BTRFS is the best choice for performance
[2:48] <plut0> has anyone tried with zfs?
[2:49] <lurbs> Are there still issues with btrfs performance degrading with time?
[2:49] <lurbs> Or less so with newer kernel versions?
[2:50] <plut0> no zfs huh
[2:56] <plut0> i gave up on btrfs when i got into zfs
[2:56] <xiaoxi> plut0:why?
[2:56] * maxiz (~pfliu@202.108.130.138) has joined #ceph
[2:57] * lurbs has given up on anything that's not in the mainline kernel. Just not worth the time, effort and pain.
[2:57] <plut0> xiaoxi: i'm not convinced btrfs can live up to zfs
[2:58] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[2:58] <plut0> lurbs: zfs may not be in the kernel but it is easy to integrate now
[2:59] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[3:00] <plut0> xiaoxi: why do you like btrfs?
[3:03] <xiaoxi> plut0:well,I am not an expert on filesystem,but since BTRFS is recommended by ceph doc and some benchmark showed it is really provide better performance, why not?
[3:05] <plut0> zfs has a lot of advantages over btrfs
[3:06] <xiaoxi> so would you like to try it ?
[3:06] <plut0> try what? btrfs?
[3:10] <xiaoxi> no,zfs for ceph.
[3:11] <dmick> there have been some attempts with zfs
[3:11] <plut0> i intend to try with zfs
[3:11] <dmick> a couple of known problems: 1) doesn't support O_DIRECT, so to use for journals you have to turn off journal directio
[3:11] <plut0> i have some good experience with zfs
[3:12] <dmick> 2) several have experienced hangs/bugs that have not yet been characterized, much less diagnosed
[3:12] <dmick> 3) there are opportunities for optimization that ceph doesn't currently try to take (similar to btrfs)
[3:12] <dmick> it would be great to get a careful evaluation/list of good bugs filed
[3:12] <plut0> dmick: 2 - what version of zfs?
[3:13] <dmick> that's part of the characterization that hasn't been done, really :)
[3:13] <dmick> http://tracker.newdream.net/projects/ceph has some mentions
[3:14] <plut0> issue 3440 is an older version of zfs
[3:15] <dmick> then it would be awesome to see if that issue disappears with later versions
[3:15] <plut0> i'm running rc12
[3:15] <plut0> i'm hoping to buy a lab environment in the next few weeks to setup zfs + ceph
[3:16] <tore_> I tried running ceph with ZFS and I gave up on that quickly
[3:17] <plut0> i'm determined to make it work
[3:17] <tore_> OSD just kept crashing for me on start
[3:18] <plut0> same issue?
[3:19] <tore_> it's a problem with extended attributes I believe
[3:19] <plut0> is there a bug open on zfsonlinux?
[3:20] <tore_> I don't think it's ZFS. it's a compatibility issue with ceph
[3:20] <tore_> I was using native zfs on ubuntu for testing though
[3:20] <plut0> yeah thats zfsonlinux
[3:20] <dmick> tore_: do you have any details? Was a ceph issue filed?
[3:21] <tore_> I don't. this was back in February. If there is interest on pursuing ZFS compatibility then I could alway repurpose my cluster and give it another go
[3:22] <dmick> the question keeps coming up, and some of the issues are avoidable
[3:23] <plut0> i will give it a try
[3:23] <tore_> I run ZFS at work, but to be honest it's always been troublesome in production environments
[3:23] <tore_> even Nexenta kicks out disks too easily to make it practical for extremely large deployments
[3:24] <plut0> troublesome how?
[3:24] <tore_> it's extremely critical of SMART data
[3:24] <tore_> and will kick disks out if they show minor problems
[3:25] <tore_> other SANs do not require disk replacements as often for sure
[3:25] <plut0> what kind of disks, ssd's or hdd's?
[3:25] <tore_> I've also seen raidz3 collapse on two occasions
[3:26] <tore_> seagate constellation SAS drived 2GB
[3:26] <tore_> we had virtuals running of xenserver backed with Nexenta
[3:26] * Ryan_Lane (~Adium@216.38.130.167) Quit (Quit: Leaving.)
[3:27] <tore_> it was a horrible setup. the architetc was incompetent and decided to configure Nexenta with raidz3 and ordered 2gb disks because he only cared about $/gb
[3:28] <plut0> wonder if it was an older version of zfs
[3:28] <tore_> 16 hypervisors were running approximately 60 vsp on this setup. That SAN could only produce 150 IOPS
[3:29] <tore_> anyhow, even though the setup was rediculously underscaled - it did show us what happens with ZFS under extreme load
[3:29] <tore_> ZFS checked the SMART metrics for the drives and inevitably kicks a disk out
[3:30] <tore_> this triggers resilvering and for a 2GB disk this is about 22 - 27 hours
[3:30] <plut0> zfs doesn't check smart, does it? must be something scripted to do that?
[3:30] <tore_> yeah it does
[3:30] <tore_> SMART and iostat
[3:31] <tore_> ZFS performance degrades easily because of the way it handles striping
[3:32] <plut0> what os did you run zfs on?
[3:32] <tore_> when a write is striped across the drives, the next write can't start until the last drive finishes writing it's piece
[3:32] <tore_> Nexenta
[3:33] <plut0> and what did nexenta use?
[3:33] <tore_> Nexenta is not on illumos yet, so it's essentually zfs on opensolaris
[3:33] <plut0> ahh ok
[3:33] <tore_> I'd need to get check the version and build number for 3.1
[3:33] <plut0> what zpool and zfs version?
[3:34] <tore_> anyway, ZFS radiz2 & 3 collapses easily under load
[3:34] <tore_> the one in my example was raidz3
[3:35] <tore_> I think 3.1 was ZFSv28
[3:36] <plut0> zpool version 28 zfs version 5?
[3:36] <tore_> one sec I'll pull up release notes you may be right
[3:36] <plut0> i intend on using zfsonlinux and not opensolaris
[3:37] <tore_> http://gdamore.blogspot.jp/2011/07/nexentastor-31-available-now.html
[3:38] <tore_> I was quite happy with native ZFS on ubuntu. It was just a pain with CEPH
[3:38] <plut0> what pains?
[3:38] <tore_> OSD crashes and failure to start all the time
[3:39] <plut0> ok
[3:39] <tore_> supposedly ZFS on freebsd is supposed to be the best performance wise
[3:40] <tore_> the ubuntu one is not as fast for reads/writes on the same hardware
[3:40] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[3:40] <plut0> guessing there will be a major fork after version 28 anyways
[3:41] <plut0> closed source after that
[3:41] <tore_> These days for my virtuals, I'm putting a LSI 3ware 9750-8e in each hypervisor with no local disk
[3:42] <tore_> then I drop in 2 LSI SAS switches and 3 JBOD
[3:42] <tore_> I don't use disks over 500GB to keep rebuild times low
[3:43] <plut0> better iops/size too
[3:43] <tore_> This setup allows me to deploy upto 3 hypervisors
[3:43] <tore_> mistype
[3:43] <tore_> 13 hypervisors
[3:43] <tore_> yeah I keep it local and I can change the isolation zone easily to routing a hypervisor away for chassis maintenance
[3:44] <tore_> frankly it's cheaper that buying Nexenta and I don't need to deal with 10g fiber links or expensive SFP modules
[3:44] <plut0> i don't think zfs checks smart, not seeing how it would do that
[3:45] <tore_> let me see if I cna find some references
[3:47] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[3:47] * jks (~jks@3e6b7199.rev.stofanet.dk) Quit (Ping timeout: 480 seconds)
[3:48] <tore_> brb
[3:48] <plut0> ok
[4:03] * deepsa (~deepsa@122.172.20.227) has joined #ceph
[4:12] * chutzpah (~chutz@199.21.234.7) Quit (Quit: Leaving)
[4:26] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[4:41] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[4:44] * plut0 (~cory@pool-96-236-43-69.albyny.fios.verizon.net) has left #ceph
[5:15] * yoshi (~yoshi@p11251-ipngn4301marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[5:19] * buck (~buck@bender.soe.ucsc.edu) Quit (Remote host closed the connection)
[5:23] * yoshi (~yoshi@p11251-ipngn4301marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[5:46] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[5:47] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit ()
[5:52] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[6:02] * weber (~he@27.105.10.136) has joined #ceph
[6:32] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[6:42] * yoshi (~yoshi@p11251-ipngn4301marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[6:58] * yoshi (~yoshi@p11251-ipngn4301marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[7:01] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[7:11] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[7:13] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit ()
[7:38] * deepsa (~deepsa@122.172.20.227) Quit (Remote host closed the connection)
[7:40] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[7:48] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[7:49] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[7:54] * weber (~he@27.105.10.136) Quit (Quit: weber)
[8:03] * deepsa (~deepsa@122.172.8.54) has joined #ceph
[8:07] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) Quit (Read error: Connection reset by peer)
[8:11] * jeffhung (~jeffhung@60-250-103-120.HINET-IP.hinet.net) has joined #ceph
[8:16] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[8:22] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[8:29] * tnt (~tnt@162.63-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:29] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:29] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[9:03] * dmick (~dmick@2607:f298:a:607:15f3:a75d:146d:65e) Quit (Quit: Leaving.)
[9:26] * tnt (~tnt@162.63-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:28] * nosebleedkt (~kostas@kotama.dataways.gr) has joined #ceph
[9:34] * loicd (~loic@90.84.146.250) has joined #ceph
[9:41] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:41] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:44] * ScOut3R (~ScOut3R@212.96.47.215) has joined #ceph
[9:47] * Leseb (~Leseb@193.172.124.196) has joined #ceph
[9:53] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[9:58] * loicd1 (~loic@90.84.146.199) has joined #ceph
[9:58] * loicd (~loic@90.84.146.250) Quit (Ping timeout: 480 seconds)
[10:02] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) Quit (Quit: Leaving)
[10:03] * xiaoxi (~xiaoxiche@jfdmzpr03-ext.jf.intel.com) Quit (Ping timeout: 480 seconds)
[10:05] * fc (~fc@home.ploup.net) has joined #ceph
[10:09] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[10:14] * MapspaM (~clint@xencbyrum2.srihosting.com) has joined #ceph
[10:16] * SpamapS (~clint@xencbyrum2.srihosting.com) Quit (Ping timeout: 480 seconds)
[10:21] * loicd1 (~loic@90.84.146.199) Quit (Ping timeout: 480 seconds)
[10:28] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[10:29] * German (~kvirc@217.112.214.106) has joined #ceph
[10:29] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[10:30] * loicd (~loic@90.84.144.34) has joined #ceph
[10:31] * German (~kvirc@217.112.214.106) Quit ()
[10:36] * The_Bishop (~bishop@2001:470:50b6:0:1d2f:9f6:a2bb:df4d) has joined #ceph
[10:46] * SIN (~SIN@78.107.155.77) Quit (Remote host closed the connection)
[10:46] * maxiz (~pfliu@202.108.130.138) Quit (Quit: Ex-Chat)
[11:19] <nosebleedkt> Hello everyone. I have a ceph cluster containg 2 OSDs providing 1GB each.
[11:19] <nosebleedkt> On my client I do mount -t ceph {ip-address-of-monitor}:6789:/ /mnt/mycephfs
[11:20] <nosebleedkt> and it mounts normally.
[11:20] <nosebleedkt> when I df to see the available capacity
[11:20] <nosebleedkt> I get 10GB !
[11:20] <nosebleedkt> how is possible that cephfs client sees 10GB when the OSDs are only 2GB ?
[11:28] <tnt> I wouldn't rely on df
[11:31] <nosebleedkt> so on what?
[11:33] <tnt> Reliably getting the available space on a distributed fs is often impossible (especially on ceph where you can even set different replication level per directory so depending on where you write the space would be different).
[11:33] <tnt> 'rados df' can get you the usage for the cluster
[11:35] <nosebleedkt> hmm
[11:36] <nosebleedkt> thank you
[11:42] <nosebleedkt> tnt, another quick question is about provisioning. My 2 OSDs reside on /var/lib/osd/{ceph-0, ceph-1}. They have a file called journal which is 1GB so I suppose thats the storage.
[11:42] <nosebleedkt> When I run mkcephfs those files reserve 1GB of disk space.
[11:42] <tnt> no it's not.
[11:43] <nosebleedkt> what?
[11:43] <tnt> the file called 'journal' is .... the journal. For the real storage they will use any available space on the /var/lib/osd/ceph-{0,1} filesystem.
[11:44] <nosebleedkt> In ceph.conf I set osd journal size = 1000
[11:45] <nosebleedkt> isn't that the storage capacity?
[11:49] * German (~kvirc@217.112.214.106) has joined #ceph
[11:49] <joao> nosebleedkt, that's the size of the journal, but it has little to do with the storage capacity; the journal is used to checkpoint operations much like any other journaling file system
[11:50] * joao sets mode -o joao
[11:50] <nosebleedkt> joao, so where do i define the storage capacity?
[11:51] <nosebleedkt> ( I miss some info about journaling. I'm not sure what means journal )
[11:51] <joao> in a nutshell, it is basically whatever capacity the volume your osds are sitting on have
[11:51] <joao> say you have a system with two osds, each one sitting on a 1TB disk
[11:52] <nosebleedkt> yes..
[11:53] <joao> that grants you a total storage capacity of 2TB; depending on the replication factor, you will have either that (replication factor of 1) or a 'n'th part of that (replication factor of 2 makes you have only 1TB useful, etc)
[11:53] <nosebleedkt> ah
[11:53] <joao> but it is "defined" only by specifying the osd data dir to the osd
[11:53] <nosebleedkt> very clear explanation
[11:54] <nosebleedkt> I have the osd datadir inside my root filesystem. Which is 5GB big/.
[11:54] <joao> I do that a lot for testing purposes; nothing wrong with it
[11:54] <nosebleedkt> where do i set the replication factor?
[11:55] <joao> that's a crush map thing
[11:55] * LeaChim (~LeaChim@b0fa82fd.bb.sky.com) has joined #ceph
[11:55] <joao> you define the replication factor by pool
[11:55] <joao> let me try to dig the command for you; I can't really recall the whole shebang
[11:57] <joao> it would be something like 'ceph osd pool set <poolname> size <replevel>'
[11:57] <nosebleedkt> and in the pool i put osds ?
[11:59] <LeaChim> Hi, I think I've hit a bug in the monitor on 0.54. When I try to execute 'ceph osd crush set 3 osd.3 1 pool=default rack=c row=c' the monitor exits with code 139
[12:00] <joao> LeaChim, the monitor exits?
[12:00] <joao> that's weird
[12:00] <joao> can you drop the log somewhere for me to look at?
[12:01] <LeaChim> sure, one sec
[12:01] <joao> oh, 139 is a segfault
[12:02] <joao> nosebleedkt, http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
[12:06] <nosebleedkt> joao, thanks
[12:07] <nosebleedkt> joao, I just added a new disk of 1GB in my system. I will create a 3rd OSD now.
[12:07] <nosebleedkt> how do i tell that osd to reside on my new disk which is /dev/sdb ?
[12:08] <nosebleedkt> joao, leave it. I will find it alone.
[12:09] <LeaChim> joao, log is here: http://pastebin.com/36uCVFba
[12:09] <joao> LeaChim, thanks
[12:10] <joao> nosebleedkt, fwiw, the osd will need an existing file system on the volume it uses
[12:11] <nosebleedkt> so i have to mkfs and mount that disk first
[12:11] <joao> yeah
[12:11] <nosebleedkt> ext4 i guess
[12:11] <joao> nosebleedkt, that doc I pointed you to does describe pretty much all that
[12:11] <nosebleedkt> yeah
[12:11] <nosebleedkt> i see it
[12:12] <joao> LeaChim, is that the whole log?
[12:12] <nosebleedkt> :D
[12:12] <joao> no stack trace to back it up? :(
[12:13] * loicd (~loic@90.84.144.34) Quit (Quit: Leaving.)
[12:14] <LeaChim> that's all that was output to the console. Hmm, I wonder if it's to do with the arguments, it's not a particularly realistic scenario, I was just wanting to play around with moving osd's around.. If you have a test cluster I wonder if doing the same thing on that will die too..
[12:14] <joao> I can manage to bring one up now
[12:15] <joao> any chance you can provide me with the set of actions you did?
[12:15] <joao> I'll try to replicate it on my own
[12:17] * yoshi (~yoshi@p11251-ipngn4301marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:18] <nosebleedkt> joao, what is the CRUSH map for ?
[12:18] <joao> basically, defining data placement rules
[12:18] <LeaChim> It's not the cleanest of things. Brought it up a couple of days ago on 0.48.2, with a monitor, mds, and 2 osd's on the same box, played a bit with the cephfs stuff, osd.3 came from a 'ceph osd create' command, then I upgraded the software on here to 0.54, and then trying that crush set command it crashed
[12:19] <joao> LeaChim, is everything running the same versions?
[12:19] <joao> both the ceph command and the monitors?
[12:19] <joao> s/command/tool
[12:20] <LeaChim> yes, it's all 0.54, everything's running on this box.
[12:20] <joao> kay
[12:20] <joao> I'll look at the code hoping to find something
[12:21] <joao> but without the stack trace I guess this is going to be a wild goose chase :\
[12:22] <joao> LeaChim, what does 'ceph -s' report?
[12:22] <joao> have you brought that monitor up again?
[12:25] <LeaChim> I've just moved all the old data out of the way, and re-run mkcephfs, so I've got a clean cluster on 0.54. running 'ceph osd crush set 0 1 rack=a row=a' crashes it.
[12:25] <LeaChim> so it looks fairly reproducible
[12:26] <joao> nice; I like reproducible bugs
[12:26] <LeaChim> Hopefully that means you'll be able to get a stack trace easily on your cluster
[12:26] <joao> LeaChim, could you please pastebin the output of 'ceph osd tree'?
[12:27] <LeaChim> sure
[12:28] <LeaChim> http://pastebin.com/TANWSTVZ
[12:29] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Quit: Leaving.)
[12:30] <joao> oh joy
[12:30] <joao> it crashed
[12:30] <joao> yeah, no rack or row 'a'
[12:30] <joao> I think I got it
[12:31] <nosebleedkt> joao, can i ask you something else?
[12:31] <joao> shoot away
[12:31] <nosebleedkt> :P
[12:31] <nosebleedkt> well, lets say that I want to add to my cluster some disks that I have in another natural recovery site.
[12:32] <nosebleedkt> how do i connect those disks to my cluster?
[12:32] <joao> nosebleedkt, I don't understand the question
[12:32] <joao> is that a remote site?
[12:33] <nosebleedkt> yes
[12:33] <nosebleedkt> another place
[12:33] <nosebleedkt> maybe in another country
[12:33] <joao> fwiw, ceph does not support geo-replication
[12:33] <nosebleedkt> hmm
[12:33] <nosebleedkt> so the OSD have to be in the 'same' building ?
[12:33] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[12:33] <joao> it can do it, of course, by playing with the crush map and having osds in the cluster and all that
[12:34] <joao> but currently ceph may not deal well with the latencies involved
[12:34] <nosebleedkt> let me explain my idea
[12:35] <joao> fwiw, geo-replication is on the roadmap though
[12:35] <nosebleedkt> a company has got some osds in some building. They want to add more osds in another building so for example if a fire destroys the 1st building then ceph uses the osds from the 2nd building.
[12:38] <joao> nosebleedkt, I may not be the best person to answer you regarding to that, but that's something that has been discussed several times and all I know is that ceph requires a low latency, and geo-replication usually involves somewhat higher latencies with which ceph may not cope well
[12:39] <nosebleedkt> hmm
[12:39] <joao> furthermore, there's the issue of not being able (afaik) to tell a client to 'use stuff from this building and not the other, unless this one fails'
[12:39] <joao> and although you probably could create a crush map to keep stuff equally replicated between the osds of both sites, there's all that I mentioned before
[12:40] <joao> but as I said, I know that geo-replication is on the roadmap
[12:40] <joao> don't know however how close on the roadmap it is
[12:44] * silversu_ (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[12:45] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[12:51] <nosebleedkt> joao, one more question about the pool ?
[12:51] <joao> sure
[12:52] <joao> LeaChim, thanks for reporting that bug; looks like it's a recursion issue
[12:52] <nosebleedkt> what do we need the pools for?
[12:53] <joao> to logically separate the data
[12:55] <LeaChim> joao, no problem, glad to help :)
[12:57] <nosebleedkt> joao, based on what criteria?
[13:07] * loicd (~loic@90.84.144.34) has joined #ceph
[13:11] <joao> nosebleedkt, http://ceph.com/docs/master/rados/operations/pools/
[13:14] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[13:14] <madkiss> yikes.
[13:14] <madkiss> When using ceph-deploy, what hosts will ceph-deploy install MDSes on?
[13:18] <nosebleedkt> joao, thats doesnt help
[13:25] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) has joined #ceph
[13:36] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) has joined #ceph
[13:39] <nosebleedkt> joao, who is mapping PGs to OSDs ? Or it's done automatically by CRUSHMAP ?
[13:39] <joao> sorry, about to go grab some lunch
[13:40] <nosebleedkt> cool, maybe too :D
[13:40] <joao> yeah, pg's are mapped to osds using the crushmap
[13:40] <joao> you have pg's scattered throughout the cluster
[13:40] <joao> depending on your replication level, you may have them replicated
[13:41] <joao> at any given time, you can infer in which pg a given object is by using the crushmap, as long as you have an updated osdmap
[13:53] * German (~kvirc@217.112.214.106) Quit (Quit: KVIrc 4.1.3 Equilibrium http://www.kvirc.net/)
[13:58] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[14:07] * calebamiles (~caleb@c-24-128-194-192.hsd1.vt.comcast.net) Quit (Ping timeout: 480 seconds)
[14:12] * plut0 (~cory@pool-96-236-43-69.albyny.fios.verizon.net) has joined #ceph
[14:12] * loicd (~loic@90.84.144.34) Quit (Quit: Leaving.)
[14:13] <plut0> hi
[14:14] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[14:36] * weber (~he@61-64-87-236-adsl-tai.dynamic.so-net.net.tw) has joined #ceph
[14:39] * calebamiles (~caleb@65-183-137-95-dhcp.burlingtontelecom.net) has joined #ceph
[14:40] * timmclaughlin (~timmclaug@69.170.148.179) has joined #ceph
[14:42] * loicd (~loic@90.84.144.37) has joined #ceph
[14:44] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[14:49] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[14:52] * mistur_ is now known as mistur
[14:59] <nosebleedkt> joao, ok about pools/pgs and stuff
[15:00] <nosebleedkt> however can you give me an example of logical separation of pools ?
[15:00] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[15:02] <tnt> nosebleedkt: I have pool with important objects with a replication level of 3 and some temp stuff with a replication of 1 ...
[15:30] * slang (~slang@ace.ops.newdream.net) has joined #ceph
[15:36] * shelleyp (~shelleyp@173-165-81-125-Illinois.hfc.comcastbusiness.net) has joined #ceph
[15:47] <xiaoxi> will a big pg_num improve performance?
[15:54] <madkiss> there's a recommended way to calculate pg_num, isn't there?
[15:56] <xiaoxi> Where can I find the way?Well, I just using the default value now
[15:57] <nhm> xiaoxi: more PGs will give you a better distribution, but once you have around 100-200 PGs per OSD it probably won't be significant and could cause mon slow downs and other issues. I just actually had a really interesting conversation with Jim Schutt and he found that it's very important to stick with powers-of-2 for the number of PGs per pool otherwise you can get uneven data distribution.
[15:58] * nosebleedkt (~kostas@kotama.dataways.gr) Quit (Quit: Leaving)
[16:03] * PerlStalker (~PerlStalk@72.166.192.70) has joined #ceph
[16:06] * rlr219 (43c87e04@ircip3.mibbit.com) has joined #ceph
[16:07] <rlr219> Mikeryan or sjust: Either of you available?
[16:12] <xiaoxi> nhm:you means the default value(5376) is not a good one?
[16:15] <xiaoxi> whenI runing root@api:/var/lib/ceph# rbd list --pool nova
[16:15] <xiaoxi> rbd: pool nova doesn't contain rbd images
[16:15] <xiaoxi> but when I try to create a volume in pool nova,an error will occur
[16:15] <xiaoxi> Command: rbd create --pool nova --size 5120 volume-27c8a9ea-1398-4bcd-928b-55f40e6eec81 --new-format
[16:16] <xiaoxi> 'rbd: create error: (17) File exists\n2012-11-21 23:12:56.550987 7f234d899780 -1 librbd: rbd image volume-27c8a9ea-1398-4bcd-928b-55f40e6eec81 already exists\n'
[16:16] <xiaoxi> what's the reason for this?
[16:20] <nhm> xiaoxi: yes, it sounds like you would be better off with 4096 or 8192
[16:20] <nhm> xiaoxi: I have not tested it extensively myself.
[16:21] <nhm> xiaoxi: not sure about the pool problem.
[16:23] <via> j
[16:23] <via> ...sorry
[16:23] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[16:24] <xiaoxi> it seems there is some inconsistency between librbd and ceph
[16:40] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:44] <xiaoxi> It seems that this is caused by Cinder( an openstack component).Since I updated my ceph cluster to v0.54,but cinder still using the old interface(--new-format) to interact with librbd...which lead to an error "Invalid argument",then Cinder retried,get "already exist".
[16:45] <xiaoxi> it looks to me that there are two bugs
[16:45] <xiaoxi> 1.Cinder backend driver cannot work with v0.54
[16:46] <xiaoxi> 2.some inconsistency when creating a volume
[16:56] * deepsa_ (~deepsa@122.172.22.59) has joined #ceph
[16:58] * deepsa (~deepsa@122.172.8.54) Quit (Ping timeout: 480 seconds)
[16:58] * deepsa_ is now known as deepsa
[17:01] * joshd1 (~jdurgin@2602:306:c5db:310:9011:885f:57da:3c7e) has joined #ceph
[17:03] <xiaoxi> hi joshd
[17:03] * zynzel (zynzel@spof.pl) Quit (Ping timeout: 480 seconds)
[17:03] <joshd1> you found me :)
[17:03] <xiaoxi> yeah, I am facing difficulty in cinder & ceph..
[17:03] <nhm> xiaoxi: josh is the guy to talk to about the cinder/0.54 issues you saw.
[17:04] <xiaoxi> nhm:yes,I know he is the superman :)
[17:06] <xiaoxi> joshd1:My cinder works well with 0.48v2,but not for 0.54..I cannot even create volume
[17:07] * zynzel (zynzel@spof.pl) has joined #ceph
[17:08] <joshd1> xiaoxi: are you using cephx? there were some changes there in 0.53 and 0.54 that could have bugs
[17:09] <joshd1> xiaoxi: there was a secondary bug causing authentication failure to be reported as EEXIST for rbd creation too
[17:10] <xiaoxi> no,I disabled cephx
[17:11] <xiaoxi> I can bypass the create volume issue just by hacking the cinder code,comment out the line with "--new format"
[17:11] * raso (~raso@deb-multimedia.org) Quit (Ping timeout: 480 seconds)
[17:12] <joshd1> xiaoxi: are all your osds updated to 0.54 along with the client side?
[17:13] * raso (~raso@deb-multimedia.org) has joined #ceph
[17:13] <joshd1> --new-format requires new class methods on the osds
[17:13] <xiaoxi> all my osds updated to 0.54
[17:13] <xiaoxi> and also cinder machine,which act as one of the MON
[17:14] <joshd1> could you pastebin the output of 'rbd create -s 1 --new-format volumes/test --debug-ms 1 --debug-rbd 20'
[17:15] <xiaoxi> root@ceph01:~# rbd snap ls --pool rbd volume-3338d46b-8261-42af-935d-f0381fa5d367
[17:15] <xiaoxi> rbd: error opening image volume-3338d46b-8261-42af-935d-f0381fa5d367: (5) Input/output error
[17:15] <xiaoxi> 2012-11-22 00:10:21.900619 7f3d4d0c7780 -1 librbd: Error getting lock info: (5) Input/output error
[17:15] <xiaoxi> root@ceph01:~# rbd create -s 1 --new-format volumes/test --debug-ms 1 --debug-rbd 20
[17:15] <xiaoxi> 2012-11-22 00:14:49.521854 7f8f0ceda780 1 -- :/0 messenger.start
[17:15] <xiaoxi> 2012-11-22 00:14:49.522461 7f8f0ceda780 1 -- :/1003809 --> 192.168.10.13:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x1bba930 con 0x1bba5d0
[17:15] <xiaoxi> 2012-11-22 00:14:49.523344 7f8f0ced6700 1 -- 192.168.10.11:0/1003809 learned my addr 192.168.10.11:0/1003809
[17:15] <xiaoxi> 2012-11-22 00:14:49.814629 7f8f083e7700 1 -- 192.168.10.11:0/1003809 <== mon.3 192.168.10.13:6789/0 1 ==== mon_map v1 ==== 631+0+0 (3295763607 0 0) 0x7f8efc000c70 con 0x1bba5d0
[17:15] <xiaoxi> 2012-11-22 00:14:49.814743 7f8f083e7700 1 -- 192.168.10.11:0/1003809 <== mon.3 192.168.10.13:6789/0 2 ==== auth_reply(proto 1 0 Success) v1 ==== 24+0+0 (3429780932 0 0) 0x7f8efc001100 con 0x1bba5d0
[17:15] <madkiss> yikes
[17:15] <xiaoxi> 2012-11-22 00:14:49.814780 7f8f083e7700 1 -- 192.168.10.11:0/1003809 --> 192.168.10.13:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x1bbab30 con 0x1bba5d0
[17:15] <xiaoxi> 2012-11-22 00:14:49.815082 7f8f0ceda780 1 -- 192.168.10.11:0/1003809 --> 192.168.10.13:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x1bb6e90 con 0x1bba5d0
[17:15] <xiaoxi> 2012-11-22 00:14:49.815099 7f8f0ceda780 1 -- 192.168.10.11:0/1003809 --> 192.168.10.13:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x1bbaac0 con 0x1bba5d0
[17:15] <xiaoxi> 2012-11-22 00:14:49.815651 7f8f083e7700 1 -- 192.168.10.11:0/1003809 <== mon.3 192.168.10.13:6789/0 3 ==== mon_map v1 ==== 631+0+0 (3295763607 0 0) 0x7f8efc001100 con 0x1bba5d0
[17:15] <xiaoxi> 2012-11-22 00:14:49.815701 7f8f083e7700 1 -- 192.168.10.11:0/1003809 <== mon.3 192.168.10.13:6789/0 4 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (1213192359 0 0) 0x7f8efc001310 con 0x1bba5d0
[17:15] <xiaoxi> 2012-11-22 00:14:49.816753 7f8f083e7700 1 -- 192.168.10.11:0/1003809 <== mon.3 192.168.10.13:6789/0 5 ==== osd_map(36..36 src has 1..36) v3 ==== 40068+0+0 (2464632830 0 0) 0x7f8efc00ad10 con 0x1bba5d0
[17:15] <xiaoxi> 2012-11-22 00:14:49.817163 7f8f083e7700 1 -- 192.168.10.11:0/1003809 <== mon.3 192.168.10.13:6789/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (1213192359 0 0) 0x7f8efc00af10 con 0x1bba5d0
[17:15] <xiaoxi> rbd: error opening pool volumes: (2) No such file or directory
[17:15] <madkiss> USE A PASTEBIN
[17:15] <xiaoxi> 2012-11-22 00:14:49.817476 7f8f083e7700 1 -- 192.168.10.11:0/1003809 <== mon.3 192.168.10.13:6789/0 7 ==== osd_map(36..36 src has 1..36) v3 ==== 40068+0+0 (2464632830 0 0) 0x7f8efc001080 con 0x1bba5d0
[17:15] <xiaoxi> 2012-11-22 00:14:49.817908 7f8f083e7700 1 -- 192.168.10.11:0/1003809 <== mon.3 192.168.10.13:6789/0 8 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (1213192359 0 0) 0x7f8efc001350 con 0x1bba5d0
[17:15] <xiaoxi> 2012-11-22 00:14:49.817956 7f8f0ceda780 1 -- 192.168.10.11:0/1003809 mark_down_all
[17:15] <xiaoxi> 2012-11-22 00:14:49.818175 7f8f0ceda780 1 -- 192.168.10.11:0/1003809 shutdown complete.
[17:15] <xiaoxi> sorry...
[17:15] <madkiss> awesome.
[17:15] * madkiss does a /clear
[17:16] * joao copy & pastes everything again
[17:16] <xiaoxi> joshd:http://pastebin.com/WEGVViWH
[17:17] <joshd1> xiaoxi: oops, I meant nova/test for the image name - misremembered the pool name
[17:18] <joshd1> but that I/O error getting lock info etc. indicates the osd not supporting the request
[17:18] <joshd1> perhaps some of your osds weren't restarted after the upgrade to 0.54?
[17:20] <xiaoxi> a new paste:http://pastebin.com/2yveiCfv
[17:21] <xiaoxi> I restarted all the osds.what's more, I even redo the mkcephfs -a
[17:23] <LeaChim> Heya, I've crashed it again. this time the command was: 'ceph pg map 10000000005.00000000' (Which I've just realised was incorrect, I was actually wanting ceph osd map data, but still, it shouldn't crash all of my monitors at the same time) Logs are at: http://xelix.net/hotcpc12533.log , http://xelix.net/hotcpc9039.log, http://xelix.net/ldhfmi02.log
[17:26] <xiaoxi> checked all my machines with ceph -v and restarted all the daemons,it still the same
[17:26] <xiaoxi> ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150)
[17:27] <joshd1> xiaoxi: could you add 'debug osd = 20' to osd.53 and try the create again?
[17:28] <joshd1> xiaoxi: it's getting EINVAL from trying to call the 'set_id' class method, and I'd like to verify where that error's coming from
[17:31] <xiaoxi> same,so you want the log from osd.53?
[17:31] <joshd1> yeah
[17:33] <xiaoxi> a moment please
[17:36] <xiaoxi> joshd1:osd.53 's log is too long...
[17:36] <xiaoxi> keep growing
[17:36] <joshd1> could you grab the section around where 'rbd_id.test' is mentioned?
[17:39] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:40] <xiaoxi> is this enough ?http://pastebin.com/04TcJraT
[17:41] * loicd (~loic@90.84.144.37) Quit (Ping timeout: 480 seconds)
[17:41] * jlogan1 (~Thunderbi@2600:c00:3010:1:1ccf:467e:284:aea8) has joined #ceph
[17:42] * deepsa (~deepsa@122.172.22.59) Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[17:43] <joshd1> there should be a line containing 'rbd_id.test [call' - the next 50 lines or so after that are what I'm interested in
[17:43] <joao> LeaChim, thanks
[17:44] * ScOut3R (~ScOut3R@212.96.47.215) Quit (Remote host closed the connection)
[17:46] <xiaoxi> joshd1:sorry,cannot find it.Is it because I restart the daemon?
[17:47] <xiaoxi> I restart all the cluster-to enable debug-osd 20 to all the osds
[17:47] <joshd1> it should still be there... if you try the create again you should see it, but it may have gone to a different osd if the cluster state was different
[17:48] <plut0> can the osd's be weighted?
[17:49] <joshd1> on the client side the line after 'osd_op(client.4181.0:4 rbd_id.test [call rbd.set_id] 3.9a2f7478) v4 --' shows the osd replying, i.e. '2012-11-22 00:18:28.975785 7f890ee40700  1 -- 192.168.10.11:0/1003852 <== osd.53 ...'
[17:49] <joshd1> plut0: yes, that's part of the crush map
[17:50] <plut0> joshd1: weighted how? by free space? performance? statically set?
[17:50] * tnt (~tnt@162.63-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[17:51] <joshd1> plut0: you set it, and usually do so based on capacity. it's not auto-adjusted right now
[17:52] <xiaoxi> joshd1:sorry,I really cannot find any info with "[call"
[17:56] <joshd1> xiaoxi: ok, another tack: is your client 32-bit while your osds are 64-bit or something? perhaps there's an encoding error
[17:56] <xiaoxi> joshd:aha,get it~ I use a new volume name to retry,got the log http://pastebin.com/sAXW1hLe
[17:57] <joshd1> ok, this line shows the problem - 'call method rbd.set_id does not exist'
[17:58] <joshd1> that means your cls_rbd.so was not upgraded to 0.54 for some reason
[17:59] * NightDog (NightDog@38.179.202.84.customer.cdi.no) has joined #ceph
[18:00] * NightDog (NightDog@38.179.202.84.customer.cdi.no) has left #ceph
[18:00] <xiaoxi> well...but why? I update it with apt-get update
[18:01] <xiaoxi> & apt-get upgrade
[18:02] * brambles (~xymox@shellspk.ftp.sh) Quit (Quit: leaving)
[18:04] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) has joined #ceph
[18:05] <xiaoxi> joshd:where is the cls_rbd.so?
[18:06] <joshd1> I think /var/lib/rados-classes? 'dpkg -S ceph' will show it
[18:07] <joshd1> err, probably /usr/lib/rados-classes
[18:07] <xiaoxi> lrwxrwxrwx 1 root root 19 Sep 28 06:44 libcls_rbd.so -> libcls_rbd.so.1.0.0
[18:07] <xiaoxi> lrwxrwxrwx 1 root root 19 Sep 28 06:44 libcls_rbd.so.1 -> libcls_rbd.so.1.0.0
[18:08] <xiaoxi> well.it is real an old one
[18:08] * brambles (~xymox@shellspk.ftp.sh) has joined #ceph
[18:11] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) has joined #ceph
[18:12] <dilemma> joshd1: You were helping me out with a libvirt problem yesterday. I was unable to attach an rbd volume after upgrading from libvirt 0.9.13 to 1.0.0
[18:13] <dilemma> joshd1: I narrowed it down to the specific commit that introduced the problem I'm seeing: http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=4d34c929
[18:14] <joshd1> xiaoxi: does 'dpkg -l | grep ceph' show 0.54?
[18:15] <dilemma> running CentOS 6.2 on my host
[18:15] <joshd1> dilemma: ah, that makes sense
[18:15] <dilemma> which is why I have a custom qemu/kvm/librbd setup in /opt
[18:16] <xiaoxi> root@Ceph02:/usr/lib/rados-classes# dpkg -l | grep ceph
[18:16] <xiaoxi> ii ceph 0.48.2-0ubuntu2 amd64 distributed storage and file system
[18:16] <xiaoxi> ii ceph-common 0.54-1quantal amd64 common utilities to mount and interact with a ceph storage cluster
[18:16] <xiaoxi> ii libcephfs1 0.54-1quantal amd64 Ceph distributed file system client library
[18:16] <xiaoxi> damn...seems ceph-common updated but ceph doesn't ?
[18:16] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[18:18] <joshd1> xiaoxi: that seems strange... 'apt-cache show ceph' might tell you why
[18:19] * loicd (~loic@78.250.166.51) has joined #ceph
[18:20] <joshd1> dilemma: so it looks like we just need to make qemuDomainDetermineDiskChain return 0 when disk->type is a network disk
[18:20] <xiaoxi> joshd1:there are actuallly 2 ceph.. one is 0.48 with Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
[18:21] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[18:21] <xiaoxi> another is Version: 0.54-1quantal
[18:21] <xiaoxi> Architecture: amd64
[18:21] <xiaoxi> Maintainer: Laszlo Boszormenyi (GCS) <gcs@debian.hu>
[18:22] * loicd (~loic@78.250.166.51) Quit ()
[18:25] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[18:25] <xiaoxi> joshd:can I solve it?
[18:25] * benpol (~benp@garage.reed.edu) has joined #ceph
[18:27] <joshd1> xiaoxi: look at your /etc/apt/sources.list and /etc/apt/sources.list.d, I'm guessing you've got an extra entry there
[18:27] <joshd1> dilemma: is that your message on libvirt-users?
[18:28] <joshd1> ah, libvirt-devel actually
[18:28] * KindOne (~KindOne@h58.175.17.98.dynamic.ip.windstream.net) Quit (Remote host closed the connection)
[18:29] <dilemma> sorry, joshd1, was afk for a moment
[18:29] <dilemma> joshd1: no, I haven't sent any messages to the mailing list
[18:29] <dilemma> link?
[18:30] * KindOne (KindOne@h58.175.17.98.dynamic.ip.windstream.net) has joined #ceph
[18:30] * Leseb (~Leseb@193.172.124.196) Quit (Quit: Leseb)
[18:31] <xiaoxi> joshd:yes. there is a file in source.list.d,contians only http://ceph.com/debian-testing/ but sources.list contains mirros for ubuntu package ( http://mirrors.163.com/ubuntu/)
[18:31] * fc (~fc@home.ploup.net) Quit (Quit: leaving)
[18:31] <joshd1> dilemma: someone else found the same problem and mailed the list this morning: http://www.redhat.com/archives/libvir-list/2012-November/msg00918.html
[18:32] * nwatkins (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[18:33] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[18:36] * rlr219 (43c87e04@ircip3.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[18:36] <Robe> gregaf: your sales people are slow! ;)
[18:38] <dilemma> joshd1: looks like that was a coworker of mine
[18:39] <dilemma> wasn't aware he reported it
[18:39] <rweeks> Robe, is someone supposed to be contacting you?
[18:40] <xiaoxi> joshd1:what's the right version of cls_rbd.so
[18:41] <LeaChim> ceph health reports 140pgs stuck unclean, how do I find out why, and how to fix it?
[18:41] <Robe> rweeks: at least I sent a mail yesterday 18 hours ago
[18:41] <dilemma> with your explanation, joshd1, we should be able to put together a patch
[18:42] <joshd1> xiaoxi: it's not independently versioned, you just want to make sure you've got 0.54
[18:42] <joshd1> dilemma: great
[18:42] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has joined #ceph
[18:42] <joshd1> dilemma: it's good to find these regressions before they make it into distros
[18:45] <joshd1> xiaoxi: i.e. 'apt-get install ceph=0.54-1precise' will make sure you're getting the right version
[18:49] <xiaoxi> joshd:yes, I have make it done ,all updated to 0.54 and it works well with cinder now~
[18:49] <xiaoxi> Thanks a lot for your help
[18:50] <joshd1> xiaoxi: you're welcome :)
[18:51] <xiaoxi> it seems cause by ubuntu's official source has higher priority than ceph in default..
[18:51] * vata (~vata@208.88.110.46) has joined #ceph
[18:52] <joshd1> I'm not sure why ceph-common was still upgraded though
[18:56] <Robe> dpkg.log should have some info?
[18:56] * CristianDM (~CristianD@host165.186-108-123.telecom.net.ar) has joined #ceph
[19:00] <CristianDM> Hi.
[19:00] <CristianDM> Is it possible clean content of cephfs without destroy another RBD
[19:01] <CristianDM> I can´t delete folders for a bug into cephfs
[19:01] <xiaoxi> joshd1:new problem.cannot mount to VM,nova-compute report "deviceIsBusy: The supplied device (vdb) is busy"
[19:02] <tnt> CristianDM: I think if you just delete the data and metadata pool, it should do it.
[19:02] <xiaoxi> ceph -s in host seems well and dpkg -l | grep ceph show 0.54
[19:02] <joshd1> xiaoxi: try vdc or another one instead
[19:03] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:05] * chutzpah (~chutz@199.21.234.7) has joined #ceph
[19:06] <xiaoxi> joshd1:tried,the same.actually I use auto
[19:08] * buck (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has left #ceph
[19:10] <dilemma> joshd1: you were correct. This patch against libvirt v1.0.0 fixes the issue: http://pastebin.com/izx40mRd
[19:10] <dilemma> sending that to libvirt-devel as well
[19:10] <CristianDM> tnt: thanks
[19:14] * yehudasa (~yehudasa@2607:f298:a:607:ac9c:6541:7b1f:96e1) Quit (Ping timeout: 480 seconds)
[19:19] <xiaoxi> joshd:are you still there?
[19:19] <CristianDM> tnt: When recreate data and metadata need any special setup?
[19:19] <CristianDM> tnt: or simple run "ceph osd pool create data"
[19:20] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:21] <joshd1> dilemma: great, that looks good. they'll probably ask for a commit message and signed-off-by too
[19:22] <joshd1> xiaoxi: sorry, was away for a bit. is there anything in the compute log, or the libvirt log for the instance? also which version of libvirt do you have?
[19:23] <tnt> CristianDM: just set the pg num and pgp num accordingly
[19:23] <tnt> (afaik)
[19:23] * yehudasa (~yehudasa@2607:f298:a:607:c4f0:a32d:8103:5c98) has joined #ceph
[19:25] <xiaoxi> there is something in compute log,but nothing useful.libvirt print out nothing.version of libvirt is 0.9.13
[19:26] <xiaoxi> I faced the same situation when I forgot to plaste the ceph.conf to the right place(which I have fired a bug in nova:)),but there was some info in instance's log..this time ,no
[19:29] <xiaoxi> joshd:here is the log of nova-compute:http://pastebin.com/ep58D07M
[19:36] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[19:43] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:44] * CristianDM (~CristianD@host165.186-108-123.telecom.net.ar) Quit ()
[19:51] <joshd1> xiaoxi: you can try doing 'rbd info' as the user the vm is running with, with CEPH_ARGS set the same as it is for cinder-volume
[19:51] <joshd1> oh, no cephx, so you don't have to worry about CEPH_ARGS
[19:52] <joshd1> xiaoxi: I though you were using the nova pool, but that log shows the rbd pool being used
[19:56] * BManojlovic (~steki@81.18.49.20) has joined #ceph
[19:56] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) Quit (Remote host closed the connection)
[19:56] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) has joined #ceph
[19:57] * MikeMcClurg (~mike@firewall.ctxuk.citrix.com) Quit (Quit: Leaving.)
[20:00] * CristianDM (~CristianD@host165.186-108-123.telecom.net.ar) has joined #ceph
[20:03] * calebamiles (~caleb@65-183-137-95-dhcp.burlingtontelecom.net) Quit (Remote host closed the connection)
[20:08] * calebamiles (~caleb@65-183-137-95-dhcp.burlingtontelecom.net) has joined #ceph
[20:10] * Tamil (~Adium@38.122.20.226) has joined #ceph
[20:11] <CristianDM> I delete and create again pools data and metadata
[20:11] <CristianDM> but now I have mds a is laggy
[20:11] <CristianDM> How I can fix this?
[20:15] <gregaf> CristianDM: probably the MDS crashed when you deleted its pools out from under it
[20:15] <gregaf> you'll need to run the newfs command and then restart the MDS daemon
[20:16] <elder> sage, are you joining our rbd call?
[20:21] * Tamil (~Adium@38.122.20.226) Quit (Quit: Leaving.)
[20:23] <CristianDM> ceph mds newfs metadata data
[20:23] <CristianDM> unable to parse positive integer 'metadata'
[20:23] <CristianDM> Returns this "unable to parse positive integer 'metadata'"
[20:23] <gregaf> CristianDM: yes, it takes pool IDs rather than names
[20:24] <CristianDM> how get the pools numbers?
[20:24] <gregaf> ceph osd dump | grep pool
[20:24] <gregaf> will include them in the output
[20:27] <CristianDM> Thanks
[20:27] <CristianDM> Work fine now
[20:29] * loicd (~loic@pat35-5-78-226-56-155.fbx.proxad.net) has joined #ceph
[20:35] * noob2 (47f46f24@ircip2.mibbit.com) has joined #ceph
[20:35] <noob2> dmick: i ended up getting the gateway working yesterday after using your apache packages :)
[20:35] <noob2> thanks for the help
[20:36] <noob2> i had a question about how replication works with ceph. I know with gluster the client talks to both replica machines at the same time. How does this work in ceph? Do I talk to one machine and then it replicates out?
[20:37] <joshd1> yeah, the client talks to the primary osd, and the primary handles the replication within the cluster
[20:38] <noob2> ok
[20:38] <noob2> yeah it did look different than gluster when i was watching with dstat on the client and cluster
[20:39] <noob2> is the primary osd chosen at random via the crush algorithm?
[20:39] <noob2> or pseudo random
[20:43] <joshd1> yes, object names are hashed into placement groups, which are mapped to osds by crush http://ceph.com/docs/master/rados/operations/placement-groups/
[20:44] <noob2> ok i'm following
[20:45] <noob2> i had another question about network/disk performance. i got approval this week to build out a cluster of 6 osd servers. i'm debating whether to go with 4 1Gb nic's or 1 10Gb nic. The raid controller will be an HP 1GB flash backed. I don't think I can do jbod but I can do 1 raid0 on each disk
[20:46] <noob2> it seems like no matter what i do with the test machines I get about 30MB/s performance out of each OSD server total. does that seem right?
[20:46] <noob2> the test machines are similar to what i'd be buying
[20:46] <dilemma> Thanks for the help joshd1: http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=f0e72b2f
[20:48] <gregaf> noob2: that's actually lower than what we normally see, although I think sjust has a better intuition/memory on that
[20:49] <noob2> ok. that was my thought as well
[20:49] <noob2> if i was only getting 30MB/s then why bother with 10GB connections ya know?
[20:49] <noob2> my sas drives i'm testing with should be able to sustain higher rates of transfer. i can dd to them at a very fast rate
[20:49] <gregaf> well, you still need many disks in a chassis before a 10GB connection is the limiting factor ;)
[20:49] <noob2> right
[20:49] <gregaf> *10Gb
[20:50] <noob2> the machines i'm going to buy for the prod setup will have 12 sata's in them. hp dl180G6's
[20:50] <noob2> we're an hp shop so my hands are a little tied
[20:50] <noob2> hp's sucks with support for jbod
[20:50] <gregaf> but in terms of ease of admin, power/space, and not having to deal with bad bonding behavior I'd think 1x10Gb instead of 4x1Gb would be a no-brainer?
[20:50] <joshd1> dilemma: you're welcome, thanks for fixing it :)
[20:51] <noob2> gregaf: yeah normally. the thing is we have tons of 1Gb connections available and only maybe 20 10Gb connections per rack
[20:51] <noob2> actually 20 per row
[20:51] <noob2> each rack has 2 48 port 1Gb switches at the top
[20:52] <noob2> the network admins are good at making vPC 802.3ad connections for me. that's no problem
[20:52] <nhm> gregaf: I can hit 10GbE speeds with 9 disks + 3 SSDs. :)
[20:52] <noob2> really?
[20:53] <noob2> can you sustain it or does it just burst for a little bit?
[20:53] <noob2> dstat was showing me about a 10 second burst of quicker speed and then it bogs down
[20:54] <nhm> noob2: very ideal test to localhost. 5 min rados bench test, 10GB journal, 256 concurrent 4MB IOs with 8 rados bench instances. OSDs are using BTRFS underneath. 1045.42MB/s.
[20:54] <noob2> damn that is quick
[20:55] <noob2> so it was 1 machine testing on localhost as the mount?
[20:55] <noob2> i mean testing with rados bench on local
[20:55] <nhm> that's using 2 9207-8i HBAs, 9 7200rpm segate constelllation drives, and 3 Intel 520 SSDs for journals.
[20:55] <nhm> noob2: yep
[20:55] <noob2> i don't recognize that hba
[20:56] <nhm> noob2: it's the dirt cheap non-raid successor to the 9211-8i.
[20:56] <noob2> lsi
[20:56] <noob2> ok i see.
[20:56] <nhm> like $150-$200 per card.
[20:56] <noob2> did you raid your drives or just jbod it?
[20:56] <nhm> just jbod
[20:56] <noob2> yeah i think that's where HP is burning me. the raid0's on each drive don't seem to be that fast
[20:57] <Robe> yup
[20:57] <Robe> smartarrays aren't the fastest bunch
[20:57] <noob2> nah
[20:57] <noob2> i setup my gluster with HP smart arrays and i don't like it
[20:57] <noob2> it's slow
[20:57] <nhm> Yeah, that's what we do on our Dells too. On the SAS2208 on my supermicro board that's actually an ok config though.
[20:58] <noob2> you just setup a raid 0 on each drive also on the dell's?
[20:58] <nhm> noob2: yeah, the H700s we have don't do JBOD.
[20:58] <noob2> ok
[20:58] <noob2> how did that perform?
[20:58] * Tamil (~Adium@38.122.20.226) has joined #ceph
[20:58] <nhm> noob2: crappy
[20:58] <noob2> my test boxes are old HP G5 pizza boxes. the new machines will be G6's
[20:58] <noob2> yeah i see the same thing
[20:58] <noob2> crap performance
[20:59] <nhm> noob2: To be fair it was an older version of ceph too.
[20:59] <noob2> i wonder if i'd be better served making a giant raid 10 and presenting that to ceph
[20:59] <nhm> noob2: having said that, I'm suspicious of some issue with the controller/expanders they use.
[20:59] <noob2> me too
[21:00] <nhm> noob2: Have you seen the controller performance articles on our blog?
[21:00] <noob2> looking inside the G5/G6 machine i don't see a way to use a generic LSI card
[21:00] <noob2> yeah it's amazing
[21:00] <noob2> that's why i'm confused
[21:00] * Tamil (~Adium@38.122.20.226) Quit ()
[21:01] <noob2> i have a 512MB battery back cache on the gluster and it helps a little. that's why i was thinking of upping it to 1GB flash on the ceph drives
[21:01] <noob2> now i'm thinking i might be wasting my time if it doesn't support jbod
[21:02] <nhm> noob2: One thing I did notice is that the controllers doing JBOD mode that bypassed cache didn't do so hot without SSD journals.
[21:02] * timmclaughlin (~timmclaug@69.170.148.179) Quit (Ping timeout: 480 seconds)
[21:02] <noob2> oh really?
[21:02] <noob2> interesting
[21:02] <nhm> noob2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
[21:03] <noob2> so journals on the same drive look to benefit from write back cache
[21:04] <nhm> Yep
[21:04] <nhm> But if you use SSD journals, the cheap HBAs are actually some of the best performing options.
[21:04] <noob2> ok so maybe i'm on the right track
[21:04] <noob2> HP does seem to sell some LSI cards on their site. maybe i can get the sales guy to swap it out
[21:05] <noob2> get a small height 10Gb card for the one slot and get that extra 2 drive bay expansion and put ssd's in there
[21:05] <nhm> noob2: also, I'm not using expanders in this setup. That might matter too.
[21:06] <noob2> i can't tell if the dl180's are using expanders or not
[21:06] * Tamil (~Adium@38.122.20.226) has joined #ceph
[21:06] <noob2> http://h18000.www1.hp.com/products/quickspecs/13248_na/13248_na.html
[21:07] * shelleyp (~shelleyp@173-165-81-125-Illinois.hfc.comcastbusiness.net) Quit (Remote host closed the connection)
[21:09] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[21:09] <rweeks> nhm: I know you were skeptical about using OSDs on top of a big raid
[21:11] <nhm> rweeks: interestingly with ext4, the SAS2208 and ARC-1880 do pretty good with a big raid0.
[21:12] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[21:12] <rweeks> hm
[21:12] <rweeks> how many disks were in the raid0?
[21:12] <nhm> rweeks: 8
[21:12] <rweeks> so that's one OSD?
[21:12] <nhm> yeah, 1 OSD. I should say, they do well with large writes.
[21:13] <noob2> so out of all the controllers if i was stuck with smart array's you'd go with the flash write backed ones?
[21:13] <nhm> less well with small writes. It might be a Ceph limitation though.
[21:13] <rweeks> hm
[21:13] <rweeks> so if you lost a single disk in that 8 drive raid0, you'd lose the entire OSD
[21:13] <nhm> noob2: I haven't tested them. Any idea what chip is on them?
[21:13] <noob2> good question
[21:13] <noob2> let me see if i can dig into it
[21:14] <nhm> rweeks: yep. I only used the raid0 to test the same number of data drives though.
[21:14] <rweeks> gotcha
[21:14] <nhm> rweeks: presumably you'd use a R5/6 or something.
[21:14] <rweeks> I was curious, because noob2 was asking what if he did raid10
[21:14] <rweeks> right.
[21:14] <nhm> I just didn't want to make it even more complicated.
[21:14] <rweeks> oh sure
[21:14] <rweeks> I'm just thinking about things we need to test going forward. :)
[21:15] <nhm> rweeks: Yeah, once we test on big 60+ drive per node systems we'll have some definite testing to do.
[21:16] <rweeks> I'm also thinking of some of the hardware folks we're partnering with
[21:16] <nhm> that was a bit redundant. Oh well. :)
[21:16] * BManojlovic (~steki@81.18.49.20) Quit (Ping timeout: 480 seconds)
[21:16] * rweeks points nhm at the Dept of Redundancy Dept
[21:16] <plut0> is there any architect design docs on this stuff?
[21:16] <noob2> i found the chipset
[21:16] <nhm> plut0: closest right now is probably the blog posts. :)
[21:17] <noob2> http://pmcs.com/products/storage/raid_controllers/pm8011/
[21:17] <noob2> that's the p410 controller
[21:17] <gregaf> I gave a talk on cluster config with the slides in the Ceph Day blog post
[21:17] <gregaf> if that's the kind of thing you're interested in
[21:17] <noob2> oh
[21:17] <nhm> gregaf: ooh, I'll need to look at that.
[21:17] <noob2> yeah i think i grabbed that presentation
[21:17] <plut0> gregaf: where?
[21:17] <noob2> very informative :)
[21:17] <noob2> i think it's on the ceph blog
[21:18] <gregaf> http://ceph.com/community/our-very-first-ceph-day/
[21:18] <noob2> looks like the dell perc smokes that hp p410 controller
[21:19] * dmick (~dmick@2607:f298:a:607:c116:87f8:cf74:fa68) has joined #ceph
[21:19] <noob2> i think dell is just using an lsi chip
[21:19] <plut0> gregaf: thanks i see it
[21:21] <nhm> gregaf: good, I agree with everything in that Doc. :)
[21:21] <gregaf> heh
[21:22] * Tamil (~Adium@38.122.20.226) Quit (Quit: Leaving.)
[21:24] <dmick> wth does "scalable block device without visualization" mean, and how did you get to that interpretation, gregaf?
[21:24] <noob2> lol
[21:24] <gregaf> dmick: he meant virtualization, not visualization
[21:25] <noob2> dmick: thanks for the help yesterday. i got the gw working with the ceph apache packages
[21:25] <dmick> noob2: good
[21:25] <noob2> performance is decent although i'm noticing the gateways might need to be physical. they're vm's and kinda constrained for bandwidth
[21:25] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[21:25] <gregaf> hrm, that's actually not the right version of the presentation that's been posted
[21:25] <gregaf> I wonder how I failed at that...
[21:26] <dmick> gregaf: *oh*. but if you have a cluster running, and from a remote host use krbd, there's no reason that image can't be enlarged, and then made use of by the remote host, right?
[21:26] <gregaf> yeah, sure
[21:27] <noob2> gregaf: i'm basically looking for ceph to be my storage backend for vmware. my coworkers are used to high performance hitachi storage so i might want to steer towards higher end parts
[21:27] <noob2> i got ceph hooked up to the cluster over fibre last week. LIO works great for that
[21:28] <noob2> after i got the proof working now it's time to design the higher performance prod setup :)
[21:29] <dmick> noob2: how's the LIO plumbing layered, exactly?
[21:30] <rweeks> yeah I'm curious how you're talking to VMware
[21:30] <noob2> so i have 2 proxies setup
[21:30] <noob2> one proxy for the A side of the fibre fabric, another for the B side
[21:30] <noob2> each proxy mounts a rados block device and then exports it over LIO
[21:30] <rweeks> ah interesting
[21:30] <noob2> vmware see's each LIO port as a target. it's not aware it's going over 2 proxies
[21:31] <noob2> i needed to proxies to get myself out of the single point of failure area
[21:31] <noob2> two*
[21:31] <noob2> that way if the network blips on one of the proxies, vmware can fail over to the other fabric
[21:31] <dmick> is this proxy code you've written to shim between librbd and LIO?
[21:31] <noob2> nope
[21:31] <noob2> didn't need to write anything
[21:32] <noob2> LIO can export any block device
[21:32] <noob2> so once you map it to the proxy machine, LIO can see it and export it no problem
[21:33] <noob2> i think my problem is going to be getting the rbd devices to survive a reboot
[21:34] <noob2> i was able to vmotion some vm's over to the ceph lio storage pretty easily. vmware seems to aggressively cache so it masked my slow ceph setup. i don't know exactly what i did to make it so slow yet
[21:34] * fmarchand3 (~fmarchand@212.51.173.12) has joined #ceph
[21:34] <fmarchand3> hi !
[21:34] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) Quit (Quit: Leaving)
[21:35] <noob2> all said, i think it works pretty good. a little complicated but that's ok :) we're used to that
[21:36] <dmick> oh I see. LIO on top of the kernel block device.
[21:36] <noob2> right
[21:36] <noob2> i'm exporting /dev/rbdX
[21:36] <noob2> it won't survive a reboot though yet. that needs some work
[21:37] <plut0> gregaf: i read your presentation but i still have a lot of questions
[21:37] <fmarchand3> I have a question ! I use CephFS but I noticed it was still under development and that rbd was more used for production context. Could someone point me out to a good documentation of what it is and hhow to use it ?
[21:37] <dmick> yeah. we're interested in things that use the userland libs to avoid the kernel, but LIO is kernel anyway. Any feel for what the performance penalty is?
[21:37] <noob2> over LIO?
[21:37] <dmick> well, iscsi of some sort
[21:38] <noob2> none that i can really see at the moment. my bottleneck is the 2 1Gb network connections i'm going over
[21:38] <dmick> oh and penalty..yes
[21:38] <noob2> i'm using qlogic 8Gb fibre cards
[21:38] <noob2> cpu usage seems to be fairly low. ram usage as well
[21:38] <noob2> i haven't ramped it up just yet. i don't know what will happen when i have 20 rbd's mounted on the proxy :D
[21:38] <dmick> heh
[21:39] <dmick> sounds cool
[21:39] <noob2> yeah it's a fun project
[21:39] <noob2> my boss dropped her jaw when she saw what i made haha
[21:39] <dmick> if you're willing to write up a little about it, it would make a great email/blog post
[21:39] <noob2> yeah i'd love to share it with you guys
[21:39] <plut0> how does cephfs fit into this?
[21:39] <dmick> nothing heavyweight, just "hey look at this goofy config"
[21:39] <noob2> ok
[21:39] <dmick> (I say 'goofy' with love)
[21:39] <noob2> what's your email>
[21:40] <dmick> I was thinking about ceph-devel@vger.kernel.org
[21:40] <noob2> oh ok
[21:40] <noob2> let me save that addr
[21:40] <dmick> worth being on that, too
[21:40] <noob2> people nice on there?
[21:40] <dmick> http://ceph.com/resources/mailing-list-irc/
[21:40] <dmick> fairly
[21:40] <noob2> haha
[21:41] <dmick> I mean, you know. It's no 4chan
[21:41] <noob2> so if i sent off my crazy setup you think someone would post it to the blog?
[21:41] <noob2> that'd be really neat
[21:41] <rweeks> yes, noob2
[21:41] <noob2> nice :D
[21:41] <rweeks> if it's not me, nhm or scuttlemonkey or rturk will
[21:41] <noob2> sounds good to me
[21:41] <noob2> ok
[21:42] <noob2> how detailed should i get?
[21:42] <fmarchand3> noob2 : you seem to be aware of rbd ! Can you tel me how can I start to learn about it ? do you have a nice link ?
[21:42] <plut0> is cephfs optional?
[21:42] <rweeks> as many details as you can share
[21:42] <noob2> fmarchand3: i'd start with the ceph rbd docs. that's where i started. http://ceph.com/docs/master/rbd/rbd/
[21:42] <rweeks> cephfs is completely optional, plut0
[21:42] <rweeks> you only need to use the pieces of ceph you want.
[21:43] <plut0> rweeks: and why would i want cephfs?
[21:43] <rweeks> if you need a massively scalable posix filesystem?
[21:43] <noob2> plut0: if you want a scale out filesystem you'd use cephfs
[21:43] <fmarchand3> noob2 : but rbd is a ceph layer ?
[21:43] <noob2> i think you're confusing the 2
[21:43] <plut0> i'm not understanding. how else do you interact with ceph without the cephfs?
[21:44] <rweeks> many ways
[21:44] <rweeks> via mounting an RBD block device
[21:44] <rweeks> through the RADOS gateway using S3 or Swift APIs
[21:44] <noob2> exactly
[21:45] <noob2> rbd is a block of stuff you mount as a virtual 'hard drive'
[21:45] <rweeks> or via writing your own object storage using librados, from python, c, c++, ruby or java
[21:45] <noob2> your server or vm sees it as local storage
[21:45] <noob2> yeah there's many ways to use it
[21:45] <noob2> it's easy to setup and configure for all different situations
[21:45] <dmick> plut0: think of it this way:
[21:45] <dmick> the cluster is a big reliable storage bag
[21:46] <noob2> yeah i'm starting to think of it the same way now
[21:46] <dmick> there are several key ways to use it to store things:
[21:46] <dmick> 1) radosgw, which lets you use S3 or Swift applications to access it
[21:46] <dmick> 2) rbd, which lets you use it as backing store for block devices
[21:46] <dmick> 3) cephfs, which lets you use it as a Posix filesystem
[21:46] * Ryan_Lane (~Adium@207.239.114.206) has joined #ceph
[21:46] <dmick> and of course, you can layer on top of all those and make weirder access methods
[21:46] <noob2> lol
[21:47] <noob2> exactly
[21:47] <gregaf> you might also be interested in Sage's presentation, which talks about the different layers more explicitly than mine does
[21:47] <gregaf> http://ceph.com/presentations/20121102-ceph-day/20121102-ceph-day-sage.pdf
[21:47] <dmick> slight wrinkle: 2) and 3) both have kernel-based and user-based ways to do it
[21:47] <rweeks> or 4) you can write your own applications that put things directly in the object store.
[21:47] <fmarchand3> with rbd you stil need a mds ?
[21:47] <noob2> no you don't
[21:47] <dmick> fmarchand3: no, mds is for the filesystem only
[21:47] <noob2> cephfs requires the mds
[21:48] <dmick> mons+osds make up the cluster
[21:48] <plut0> you can only use mds with cephfs, correct?
[21:48] <noob2> i have to run in a few. thanks for your help guys! I'll work on that write up and send it off when i have a sec
[21:48] <dmick> plut0: it's only useful for cephfs
[21:49] <fmarchand3> my mds I use for cephfs is taking too much memory after 1 week without restarting it ... so If I can use something that does not take too much memory it could be great !
[21:50] <fmarchand3> But I don't know how to "convert" my cephfs to rbd ...
[21:50] <fmarchand3> any advice ?
[21:51] <dmick> fmarchand3: what are you using the filesystem for ultimately, and would a block device abstraction handle it?
[21:51] <plut0> i don't think i'd use cephfs but i'd need rbd and rados gateway
[21:51] <dmick> plut0: you can certainly use the same cluster for both
[21:52] <plut0> yeah
[21:53] <fmarchand3> dmick : I have many new small files every night stored on the cephfs which is actually an ext4 fs
[21:54] <dmick> uhhh...how is the cephfs actually an ext4fs?
[21:55] <nhm> dmick: maybe under the OSDs?
[21:55] <fmarchand3> dmick : I don't know if a bloc device abtraction layer would be a good idea for many small files ... I think I miss the whole concept of block device :)
[21:55] * Tamil (~Adium@38.122.20.226) has joined #ceph
[21:57] <fmarchand3> dmick : yes the osd's data are on an ext4 fs and I mount a ceph fs connecting to mon (fstab) ... does it answer to your question ?
[21:57] <dmick> ok, I see what you mean
[21:57] <fmarchand3> sorry but I'm newbee and maybe my ceph vocabulary is not yet very rich :)
[21:58] <dmick> just wasn't thinking about "under the OSDs". To me the cluster is kind of a blackbox layer
[21:59] <dmick> but anyway: lots of small files: yah, rbd doesn't sound like a great match. cephfs is the easy solution, but I'm sorry to hear about the memory growth; maybe that's a bug we can investigate, or maybe there are tuning settings
[21:59] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[21:59] <scuttlemonkey> noob2: sorry, been in meetings all day
[21:59] <dmick> but another option, if you don't really need Posix semantics, is to just use the cluster 'directly' through an application written to librados (i.e. store and retrieve objects directly)
[21:59] <scuttlemonkey> I'm building a guest-blogger program
[22:00] <scuttlemonkey> but if you wanna just send me whatever you have you can hit me directly
[22:00] <dmick> ^^ noob2
[22:00] <scuttlemonkey> patrick@inktank.com
[22:02] <fmarchand3> dmick : I tried to reduce the cache size of the mds but every night after the "crawl" I see the mds with maybe 100Mb bigger in memory ...
[22:02] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[22:02] <dmick> the crawl: something spidering your filesystem, you mean?
[22:03] <gregaf> fmarchand3: where did you get your binary from?
[22:03] <gregaf> and have you tried restarting the client nodes and seeing if that changes anything?
[22:03] <fmarchand3> dmick : and on top of that ... my osd's are constantly doing read operation on their respective disk
[22:03] <plut0> openstack can talk via rbd or rados gateway?
[22:03] <fmarchand3> but maybe it's normal
[22:03] <gregaf> plut0: openstack can talk directly to RBD
[22:04] <gregaf> that integrates fairly well
[22:04] <plut0> wouldn't that be preferred over the gateway?
[22:04] <lurbs> Is there are sane way in which to back up all of the metadata that defines a Ceph cluster? I can back up individual RBD volumes (either from inside the VM they're backing, or via rbd export) but don't have a way of rebuilding a totally failed cluster.
[22:04] * noob2 (47f46f24@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[22:04] <gregaf> you can use RGW instead of Swift or something, but I'm not sure what the integration story is in Folsom (if any), joshd1 might know?
[22:04] <gregaf> plut0: depends on what you want to do
[22:04] <fmarchand3> I installed it as a debian package (followed the doc on ceph website) so I use the 0.48-2 version
[22:05] <plut0> what does the gateway provide for openstack?
[22:05] <joshd1> there's nothing special needed in openstack to integrate with the rados gateway
[22:05] <gregaf> lurbs: not really — the expecation is that your configuration management system handles that
[22:05] <joshd1> it provides the s3 and swift apis, and in bobtail (0.55) it can use keystone for auth
[22:05] <gregaf> rebuilding exactly the same Ceph cluster after a disaster doesn't help you much, after all ;)
[22:06] <lurbs> Rebuilding the base cluster is easy enough, all I really need are the config files. Being able to regenerate all of the RBD volumes is the tricky bit. I guess even saving the output of 'rbd ls -l' would help.
[22:06] <fmarchand3> dmick : I have a java process using my cephfs mounted partition to crawl many website (maybe spider is more adequate for this purpose) and I store the pages to parse and extract informtion later in th day
[22:07] <gregaf> lurbs: what kind of regeneration are you after?
[22:07] <gregaf> seems like you could take all the images you have backed up and dump them into the cluster with whatever names you want...
[22:07] * Tamil (~Adium@38.122.20.226) Quit (Quit: Leaving.)
[22:08] <lurbs> When you export an RBD volume, or snapshot, does it write it out in full, or sparse?
[22:09] <joshd1> lurbs: it's full right now, since fiemap turned out to be unreliable
[22:09] <fmarchand3> gregaf : does argonaut version has a lot of memory leaks ?
[22:09] <lurbs> Yeah. That makes backups a little fun.
[22:09] <dmick> joshd1: export too?... I thought export could do sparse
[22:09] <joshd1> dmick: it could if the osds could
[22:10] <dmick> uh...oh, right, the "no zero buffers" issue I was hitting
[22:10] <dmick> gotcha
[22:10] <dmick> still vaguely wonder if "scan for runs of zeros" is worth it
[22:10] <gregaf> fmarchand: not a lot of known memory leaks, but there is a known issue with clients pinning dentries in memory and forcing the MDS cache to grow too much
[22:10] <gregaf> and there are probably some unknown leaks ;)
[22:11] <joshd1> dmick: I think it's probably worth doing on the osds directly (replace where we would have done fiemap)
[22:11] <dmick> right
[22:12] <dmick> avoid the network b/w for zero reads, and get sparse stuff running around the clients/export in the bargain
[22:12] <dmick> shame fiemap doesn't work
[22:12] <joshd1> it actually might be safe for exporting snapshots
[22:12] <fmarchand3> gregaf : but I understand that people using rbd have less memory issues than people using cephfs ... am I wrong ?
[22:12] <joshd1> it was mostly under load (writes to the same file) where it was inconsistent
[22:13] <gregaf> they certainly should — they don't have to run the MDS! even if it were perfect it would take more memory
[22:13] <joshd1> fmarchand3: if you're not accessing the fs from more than one machine, you can just put it on top of rbd
[22:14] <lurbs> So in the meantime backing up data from inside the VMs and 'rbd ls -l' in order to be able to re-populate the cluster is about as good as I'm going to get, until sparse exports exist?
[22:15] <dmick> lurbs: there's probably some trickery you can use with cp and/or dd to resparsify things for efficiency
[22:15] <fmarchand3> joshd1 : I have severals machine accessing the fs simultaneously ...
[22:15] <lurbs> dmick: Good point, I'll look into that.
[22:16] <dmick> maybe not dd; cp --sparse=always looks promising tho
[22:17] * Tamil (~Adium@38.122.20.226) has joined #ceph
[22:19] <lurbs> With a completely empty new RBD snapshot cp --sparse=always worked fine. I'll try filling it up with some data, and see how that goes.
[22:20] <dmick> lurbs: cool
[22:22] * a (~CristianD@host165.186-108-123.telecom.net.ar) has joined #ceph
[22:23] * a is now known as Guest6374
[22:25] * CristianDM (~CristianD@host165.186-108-123.telecom.net.ar) Quit (Ping timeout: 480 seconds)
[22:25] <fmarchand3> dmick : when you talk about an application using librados to read and write blocks ... when you do that way you cannot "mount" the rbd to see it as disk anymore ?
[22:26] <rweeks> it wouldn't be blocks at that point
[22:26] <dmick> right. that would be the way you access the objects, is through the application
[22:26] <rweeks> the application would be writing objects, not blocks
[22:26] <dmick> there is no rbd in that case, IOW
[22:26] * jjgalvez (~jjgalvez@166.191.17.78) has joined #ceph
[22:27] * jks (~jks@3e6b7199.rev.stofanet.dk) has joined #ceph
[22:30] <lurbs> ...and with 100 MB of data inside a 1 GB RBD volume the sparse file created is 100 MB. Next, to see if it's possible in one step to eliminate the scratch space requirement.
[22:30] <dmick> lurbs: export to - is not implemented (just working on it)
[22:30] <fmarchand3> it's just that mounting a file system and using it from the java process is really handy
[22:31] <dmick> maybe FIFO trickery?
[22:31] <lurbs> I was just thinking that.
[22:31] <dmick> export to - isn't hard; hope to make bobtail
[22:31] <dmick> easy patch if you want to go custom
[22:32] <lurbs> Really don't. :)
[22:32] <dmick> sigh. no one ever does :)
[22:33] <plut0> still trying to wrap my head around not using raid for this
[22:34] <rweeks> you could if you wanted
[22:35] <plut0> would take a long time to replicate wouldn't it?
[22:35] <rweeks> it might
[22:35] * jks (~jks@3e6b7199.rev.stofanet.dk) Quit (Remote host closed the connection)
[22:36] <rweeks> and if you have one OSD per raid you're talking about a larger failure domain if something goes wrong
[22:36] <plut0> how so
[22:36] * Ryan_Lane (~Adium@207.239.114.206) Quit (Quit: Leaving.)
[22:37] <rweeks> because of the replication time for larger sized OSDs
[22:37] <rweeks> if you have one OSD per disk you have replication time for whatever the size of that disk is
[22:37] <rweeks> if say, you put 8 disks in a raid 6 you've got replication time for 6 disks there
[22:38] <plut0> you'd do each disk is an osd?
[22:38] <lurbs> Plus you're losing capacity, and while the RAID is rebuilding that OSD (and therefore to an extent your entire cluster) is degraded performance-wise.
[22:38] <dmick> plut0: that's the usual mapping
[22:39] <plut0> each disk as an osd huh
[22:39] <lurbs> plut0: http://ceph.com/presentations/20121102-ceph-day/20121102-cluster-design-deployment.pdf
[22:39] <rweeks> that's what we recommend
[22:39] <rweeks> and then you replicate objects so you don't lose anything if a disk goes down
[22:39] <plut0> lurbs: i read that earlier
[22:39] <dmick> fmarchand3: small python program to create a pool if necessary and write an object, just to show the idea: http://pastebin.com/2LA2g5SS
[22:40] <plut0> interesting
[22:40] <plut0> i was thinking of having a fairly large osd server with lots of disks behind it
[22:41] * jks (~jks@3e6b7199.rev.stofanet.dk) has joined #ceph
[22:42] <dmick> (don't need separate open_ioctx() call, duh, but you get the idea)
[22:42] <dmick> plut0: as rweeks says: spreading the failure is better
[22:42] <dmick> ceph plans for OSDs to die
[22:42] <dmick> RAID tends to treat it as uncommon and somewhat more painful to repair
[22:42] <rweeks> you're also spreading the memory and CPU utilization across many OSDs that way
[22:45] <fmarchand3> dmick : it looks nice ! I'm gonna read some documentation in that way ... I need to know more about objects concept ... I don't know what is a pool ! I see the concept but I need to read more about that subject
[22:45] <dmick> it's very simplistic, but if you need simplistic non-hierarchical storage, simplicity is your friend
[22:46] <dmick> (and a pool is just a named subdivision of the cluster, with potentially its own replication rules and authentication credentials. Nothing particularly special)
[22:47] <fmarchand3> dmick : but it would be cool to access directly from my java code ...
[22:47] <dmick> fmarchand3: there are java bindings too
[22:48] <dmick> https://github.com/ceph/ceph/tree/master/src/java
[22:48] <fmarchand3> dmick : I was looking for java bindings for cephfs but not for rados ... :)
[22:48] <dmick> er, sorry, those *are* cephfs bindings
[22:48] <rweeks> but there are bindings for librados, yes?
[22:49] <plut0> so ceph is able to stripe across all osd's evenly?
[22:49] <dmick> https://github.com/noahdesu/java-rados
[22:52] * chutzpah (~chutz@199.21.234.7) Quit (Quit: Leaving)
[22:53] <dmick> plut0: absolutely
[22:54] <lurbs> There's no force on rbd export? Can't output to a FIFO, because it complains that the output file already exists.
[22:54] <fmarchand3> dmick : Yes it's cephfs ... but it's nice too :)
[22:54] <dmick> fmarchand3: but the second one is librados
[22:54] <dmick> lurbs: ah, yes, I discovered that too. Sorry, no
[22:54] <plut0> so is it bad to build a large file server with many disks?
[22:55] <fmarchand3> dmick : I'm reading the test files of the second binding (rados) and the concept of pools and objects are really simple
[22:55] <dmick> (I was trying /dev/stdout, but same thing)
[22:55] <dmick> fmarchand3: very
[22:55] <lurbs> Ah well. I'm waiting for bobtail anyway.
[22:55] <fmarchand3> dmick : so it could be my solution ...
[22:55] <dmick> fmarchand3: if you can handle the "no hierarchy" thing it's pretty easy to write
[22:56] <fmarchand3> Can I have pools inside pools ?
[22:56] <dmick> and of course you can manage your own hierarchy with object name prefixes
[22:56] <dmick> or something like that.
[22:56] <dmick> no, pool space is flat
[22:56] <fmarchand3> oh ok
[22:57] <fmarchand3> and rados take care of replication accross the cluster ! doesn't it ?
[22:58] <rweeks> nhm: you around?
[22:59] <dmick> fmarchand3: yes
[23:00] <nhm> rweeks: does it count if my brain is surrounded by fog? :)
[23:01] <plut0> say each server has 16 hdd's in it, each hdd is an osd and the server goes down, you've lost 16 osd's now, how would you avoid this?
[23:01] <rweeks> yes, nhm
[23:01] <rweeks> because plut0 has questions about density of disks in servers
[23:01] <fmarchand3> So if I understand well you can have rados under osd's, no mds and one or few mons and it should rock ? :)
[23:01] <rweeks> correct, fmarchand3
[23:02] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) Quit (Quit: adjohn)
[23:02] <fmarchand3> So I need to understand how to put rados "under" an osd ....
[23:02] <nhm> plut0: how would you avoid the OSDs going down?
[23:02] <rweeks> fmarchand3: RADOS requires OSDs to be there
[23:02] <lurbs> plut0: A correct CRUSH map should ensure that you have replicas on OSDs not on that machine.
[23:02] * adjohn (~adjohn@108-225-130-229.lightspeed.sntcca.sbcglobal.net) has joined #ceph
[23:03] <nhm> ^
[23:03] <rweeks> fmarchand3: in short: you take servers running linux with x disks in each server
[23:03] <rweeks> fmarchand3: you make each disk an OSD.
[23:03] <lurbs> So once the cluster has figured out the OSDs are down, and marked them as such, IO for the affected PGs will start up again.
[23:03] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) Quit (Remote host closed the connection)
[23:03] <rweeks> fmarchand3: you create a CRUSH map of your cluster
[23:03] <gregaf> fmarchand3: what do you mean by RADOS under OSDs?
[23:03] <rweeks> then you can start writing objects to those OSDs
[23:03] <rweeks> I think he's got the layers mixed up
[23:04] <dmick> fmarchand3: RADOS is another name for "the cluster"
[23:04] <plut0> well wouldn't those osd's be in the same pg? you've just lost 16 osd's then right?
[23:04] <lurbs> And if the machine, and OSDs, stays down (for a configurable amount of time) then the cluster will start to put replicas on other OSDs instead.
[23:04] <dmick> and RADOS/the cluster are *made up of* OSDs and MONs, they're not under it
[23:04] * xiaoxi (~xiaoxiche@134.134.139.74) has joined #ceph
[23:05] <dmick> a set of OSDs and MONs are the cluster. You can talk to it with librados, through the Python or Java bindings we've been discussing
[23:05] <gregaf> plut0: nope; you make a CRUSH map which knows the OSDs are all on the same host and doesn't put replicas on the same host
[23:05] <lurbs> plut0: Nope, OSDs on the same machine should never be in the same PG.
[23:06] <dmick> but plut0: yes, if you have 16 OSDs on one host, and you lose the host, you lose those OSDs. If that's your entire cluster, then yes, you've lost the entire cluster. That's why you wouldn't do that.
[23:06] <lurbs> The basic install tools I've used (mkcephfs, etc) have generally made a pretty good stab at getting a reasonable CRUSH map.
[23:06] <plut0> so say i build 8 identical servers with 16 drives each, each drive is an osd, i should create 16 pg's?
[23:06] * MikeMcClurg (~mike@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[23:06] <fmarchand3> gregaf, dmick : sorry ... I have the newbee virus ... I'm gonna read some docs ! At least I'm gonna try :)
[23:06] <plut0> 16 pg's with 8 osd's each?
[23:07] <gregaf> plut0: no no, that's not how PGs work at all
[23:07] <gregaf> you have lots of PGs per OSD
[23:07] <gregaf> consider them as shards of a pool
[23:07] <dmick> fmarchand3: some experimentation could help too. It's pretty easy to make a tiny one-machine 'cluster' and play with it with the 'rados' and 'ceph' CLI if you are so inclined
[23:07] <lurbs> In that scenario you'd want several thousand PGs. I vaguely seeing a calculation for that around.
[23:08] <dmick> rule of thumb: 100 per OSD
[23:08] <dmick> but not a hard rule
[23:08] <lurbs> My test cluster, for example, is three machines with 6 OSDs each and a default install gave it 3648 PGs.
[23:08] <nhm> also, try to keep it to a power of 2.
[23:08] <plut0> do i need to know how to place these or does CRUSH do that for me?
[23:09] <fmarchand3> dmick : it's what I'm gonna do :)
[23:09] <plut0> looks like a pg never uses more than one osd on the same cluster node, is that right?
[23:10] <lurbs> You could make it do that, but it wouldn't be default, no.
[23:10] <lurbs> s/be/by/
[23:10] <fmarchand3> thx dmick, gregaf !!!
[23:10] <lurbs> At least I think you could.
[23:10] <plut0> guess i'm not understanding very well yet
[23:11] * joshd1 (~jdurgin@2602:306:c5db:310:9011:885f:57da:3c7e) Quit (Quit: Leaving.)
[23:13] <lurbs> plut0: http://ceph.com/docs/master/dev/placement-group/
[23:14] <plut0> does the number of osd's in a pg represent the number of replicas you need?
[23:16] * KindOne (KindOne@h58.175.17.98.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[23:16] * KindTwo (KindOne@h4.176.130.174.dynamic.ip.windstream.net) has joined #ceph
[23:17] * KindTwo is now known as KindOne
[23:18] <lurbs> plut0: I believe so. Bear in mind I'm a user, not a developer.
[23:18] <plut0> ok
[23:19] <lurbs> The number of replicas can differ between pools, so they each have their own mappings.
[23:19] <plut0> is it a nightmare to keep track of which osd's map to which disk?
[23:19] <plut0> like when you need to replace the disk
[23:20] <dmick> plut0: you should probably read a little background
[23:20] <lurbs> Not really, because the config allows you to use $id for the mount point.
[23:20] <dmick> placement is done automatically. default crush rules do sane things, but you can tune crush to do what you want
[23:20] <lurbs> For example all of my OSDs are mounted at /dev/ceph/osd.$id
[23:20] <dmick> objects and pgs spread across the cluster in fault-tolerant replicatey ways
[23:20] <lurbs> Er, s/dev/srv/
[23:20] <plut0> lurbs: and what is id? serial #?
[23:21] <lurbs> $id's the ID of the OSD.
[23:21] <plut0> lurbs: how do you know which physical disk that is?
[23:22] <dmick> ceph.conf specifies which osds are on which hosts, and which disks are used by which osds. basically they're administration choices you make when setting up the cluster
[23:22] <lurbs> Same way you always do. Track it back down by serial number or whatever.
[23:22] <gregaf> we're working on making that better in the case of disk movement, but it hasn't been a problem for users so far
[23:22] <plut0> say you got 10 racks with 20 nodes in each and 16 drives in each, each drive is an osd, osd #9805812 dies because the drive died, how do you know which physical disk to replace?
[23:23] <lurbs> Tracking an OSD to a host is trivial.
[23:23] <lurbs> That's all in your config file.
[23:23] <plut0> ok, how do you map it to hdd now?
[23:23] <rweeks> one OSD per HDD
[23:24] <rweeks> another reason we recommend that.
[23:24] <dmick> you configured the OSD to use a disk device
[23:24] <plut0> what if you have 16 osd's on the node?
[23:24] <dmick> which is usually one drive, although you *can* make it live on a sw/hwraid (which we usually don't recommend)
[23:24] <dmick> you configured each OSD to use a separate drive
[23:24] <dmick> it's not magic; the OSD is just a daemon, with settings
[23:25] * Tamil (~Adium@38.122.20.226) Quit (Quit: Leaving.)
[23:25] <plut0> 1:1 OSD to HDD right?
[23:25] <lurbs> plut0: Each OSD contains a filesystem and is mounted in a known location. Tracking it back from there is done however you'd normally track a disk in a box.
[23:25] * Tamil (~Adium@38.122.20.226) has joined #ceph
[23:26] <lurbs> /dev/sdc1 930G 3.6G 927G 1% /srv/ceph/osd.6
[23:26] <lurbs> Just find sdc.
[23:26] <dmick> plut0: (02:24:32 PM) dmick: which is usually one drive, although you *can* make it live on a sw/hwraid (which we usually don't recommend)
[23:26] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Quit: This computer has gone to sleep)
[23:26] <plut0> lurbs: how do you know which disk is sdc?
[23:27] <dmick> plut0: with linux, /sys/block/devices/sdc, for instance
[23:28] <plut0> dmick: i'm looking at the front of the server, i see 16 drives, which is sdc?
[23:28] <dmick> plut0: that depends on what linux system administration tools you have available to you. How do you tell that when Ceph is not involved?
[23:29] * nwatkins (~Adium@soenat3.cse.ucsc.edu) has left #ceph
[23:29] <plut0> dmick: i know it has nothing to do with ceph. i guess i would label the drives by serial #
[23:29] <lurbs> `dd if=/path/to/disk of=/dev/null` and look for the blinky lights. ;)
[23:30] <plut0> lurbs: they're all flashing!
[23:30] <dmick> my favorite is chassis that have system indicator lights that you can light up
[23:30] <rweeks> unless the drive is really dead, in which case, look for the drive with no IO
[23:30] <plut0> dmick: they light up why? because the drive has failed? or by some other means?
[23:31] <dmick> some have failure lights; some have ID lights; some have both
[23:31] <plut0> dmick: how do the id lights work?
[23:31] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[23:31] * dmick is starting to feel like the parent of a toddler
[23:32] <plut0> i like to talk things through, helps me find things i can't figure out in my head
[23:32] * jjgalvez (~jjgalvez@166.191.17.78) Quit (Quit: Leaving.)
[23:32] <plut0> the failure light doesn't help if it is a SMART failure, drive is dying and needs to be replaced, there is no failure light
[23:33] <dmick> plut0: that depends on what software is monitoring the drive, and if there is a failure light on the chassis that it can light
[23:33] <dmick> but yes: It Is A Problem To Identify Devices Unambiguously
[23:34] <lurbs> plut0: We keep track of which slots on the front of the chassis correspond with which ports on the controller, get the serial number of the drive (via smartctl, or whatever) and then match it up on the disk controller to find the port. Then hope we've pulled the right one.
[23:34] * fmarchand3 (~fmarchand@212.51.173.12) Quit (Ping timeout: 480 seconds)
[23:35] <lurbs> If it's wrong, well, that OSD's going to need a rebuild.
[23:35] <plut0> lurbs: thats not very certain
[23:36] <lurbs> If you're really paranoid about it then I guess that could be a reason to run RAID behind the OSDs, or more replicas (and eat the write performance degredation).
[23:37] <lurbs> But as dmick said, it's a problem with or without Ceph.
[23:37] <plut0> lurbs: you could label the outside of the drive with the serial #
[23:37] <dmick> but it's not like RAID makes identifying failed drives eaiser
[23:37] <lurbs> dmick: No, but the consequences of pulling the wrong drive are potentially (depending on RAID level, of course) lower.
[23:38] <dmick> perhaps. Personally I bet Ceph recovers the drive quicker than any sw/hw raid
[23:38] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[23:38] <lurbs> Toast the entire RAID array and you're in for big problem, though.
[23:38] <lurbs> s/big/a big/
[23:38] <dmick> typically the RAID answer is "oh, I had a drive failure? RESILVER ALL THE THINGS"
[23:38] * jlogan1 (~Thunderbi@2600:c00:3010:1:1ccf:467e:284:aea8) Quit (Read error: Connection reset by peer)
[23:38] * jlogan (~Thunderbi@2600:c00:3010:1:1ccf:467e:284:aea8) has joined #ceph
[23:39] <lurbs> There's one RAID level. 10. 1's just a special case. ;)
[23:39] <dmick> oh, yeah, and that puts a huge load on the good disks, and screws their I/O response time, which further increases the load, so then they fail
[23:39] * lurbs doesn't touch the others.
[23:39] <dmick> not that I'm bitter about RAID or anything. :)
[23:39] <rweeks> every RAID has some kind of issue.
[23:39] <rweeks> hehe dmick
[23:39] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[23:40] <rweeks> RAID is fine
[23:40] * MapspaM is now known as SpamapS
[23:40] <rweeks> but for large scale storage I believe (and I think others here do too) that it doesn't scale well
[23:40] <dmick> and many RAID implementations leave a lot to be desired
[23:40] <rweeks> there is also that.
[23:41] <rweeks> not to mention RAID rebuild times with 4TB drives in large RAID groups
[23:41] <dmick> it's one of the more devil-is-in-the-details areas of computing
[23:41] <plut0> looks like you can offline a hdd, i'm guessing that power indicator light would go off at that point
[23:41] <dmick> plut0: if it has one, I wouldn't bet that it would go off
[23:42] <plut0> hmm
[23:42] <dmick> labeling with serial number is certainly the most foolproof and least dependent on other facilities
[23:42] * Tamil (~Adium@38.122.20.226) Quit (Quit: Leaving.)
[23:42] <dmick> "dd with a pattern and watch activity LED" is not a bad solution, really
[23:42] <dmick> if you haven't done serials
[23:43] <rweeks> some chassis have capabiliites to send a "blink" command to the LEDs on drives
[23:43] * Tamil (~Adium@38.122.20.226) has joined #ceph
[23:44] <nhm> rweeks: I had some ideas about trying to use it to send messages to people in the datacenter.
[23:44] <rweeks> hehe
[23:44] <rweeks> morse code?
[23:44] <plut0> nhm: lol yeah we wanted to setup a light show for datacenter tours
[23:44] <rweeks> does that supermicro chassis you have do that?
[23:44] <dmick> I've done things like that with Dell's LCD text display
[23:45] <nhm> rweeks: I have no idea honestly, I didn't look. :)
[23:45] <lurbs> You could shout at the drive(s) and see where your IO goes to shit: http://www.youtube.com/watch?v=tDacjrSCeq4
[23:45] <dmick> lurbs: ah, brendan
[23:45] * rweeks grins
[23:45] <nhm> lurbs: I remember that video. :)
[23:45] <rweeks> http://www.globalnerdy.com/wordpress/wp-content/uploads/2008/09/grandpa_simpson_yelling_at_cloud.jpg
[23:46] <dmick> actually I've wondered if gentle tapping on the drive case would do enough to throughput to sense
[23:46] <dmick> rweeks: '....storage'
[23:46] <rweeks> yes. :)
[23:46] <dmick> http://serverfault.com/questions/64239/physically-identify-the-failed-hard-drive, anyway
[23:47] <nhm> It would be rather hilarious if all of my performance testing were reduced down to how loud the datacenter is.
[23:47] <dmick> wonder if datacenter throughput graphs can be used to help plot earthquakes
[23:48] <nhm> dmick: that sounds like a federally funded grant if I've ever heard one.
[23:48] <dmick> right? and some of those huge DCs are in otherwise-unpopulated areas. could be valuable data
[23:48] <rweeks> let's apply!
[23:48] <rweeks> let's see
[23:48] <dmick> Amazon's probably already done it
[23:48] <rweeks> our grant process is to take nhm and get him really drunk, right?
[23:49] <dmick> O dpm
[23:49] <dmick> argh. I don't know but I support that methodology in general
[23:49] * xiaoxi (~xiaoxiche@134.134.139.74) Quit (Ping timeout: 480 seconds)
[23:49] <nhm> rweeks: hey, last time I was less drukn than sage.
[23:49] <rweeks> fair enough
[23:50] <Qten> mmm beer
[23:51] <Qten> be it not even 9am gotta be after 5 somewhere right?
[23:51] <nhm> I can't believe I went from a pit bull concert in vegas immediately to reading an RFP for exascale sorage proposals at like 4am.
[23:51] <dmick> oh man that was a weekend
[23:52] <dmick> and my workload was far less than nhm's
[23:52] <nhm> dmick: Certainly was a blast. :)
[23:53] * fmarchand2 (~fmarchand@212.51.173.12) has joined #ceph
[23:53] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[23:53] <rweeks> why were you in vegas, anyway?
[23:54] <nhm> rweeks: Inktank launch party
[23:54] <dmick> Inktank company launch party at ... Interop?
[23:54] <rweeks> ahhh
[23:54] <fmarchand2> I'm back ! what is mds mem max setting ? does it limit mds process in memory ?
[23:54] <nhm> I think that one is going on the resume "Able to complete government RFPs while drunk and hungover in Vegas".
[23:56] <fmarchand2> if yes if the default value is 1gb why mds could take more memory if you don't change default value ?
[23:58] <fmarchand2> gregaf : maybe you already answered this question :)
[23:58] <nhm> fmarchand2: this might be useful: http://www.spinics.net/lists/ceph-devel/msg09173.html
[23:59] <plut0> how large does the osd journal drive need to be?
[23:59] * PerlStalker (~PerlStalk@72.166.192.70) Quit (Quit: happy thanksgiving)
[23:59] <gregaf> fmarchand2: the mds max mem setting doesn't do anything, unfortunately

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.