#ceph IRC Log

Index

IRC Log for 2012-03-13

Timestamps are in GMT/BST.

[0:07] * BManojlovic (~steki@212.200.240.216) Quit (Remote host closed the connection)
[0:07] <elder> sagewk, done with all my committing and posting for review. Now I'm back to trying to run a teuthology test on the plana nodes. I have resubmitted my test now that I've updated my wip-master branch. Commit id d3f27f58df.
[0:07] <elder> I don't know if you rebooted my three machines before, but they appear to be running the same kernels as before.
[0:07] <elder> We'll see if this time is any different.
[0:17] <elder> 'failed to install new kernel version within timeout'
[0:20] <elder> Kernel, etc. appear to be set up on all three machines, and it does appear they rebooted, they just didn't boot into my kernel.
[0:20] <elder> I'm not familiar with the funky grub config that's in use. Have to look into that...
[0:21] <elder> Off to eat and maybe even do other stuff though.
[0:31] <joshd> elder, sagewk: only difference with oneiric that I see is that the new kernel is in a grub submenu
[0:32] <joshd> elder, sagewk: that seems to be the problem: http://www.gnu.org/software/grub/manual/html_node/default.html
[0:33] <sagewk> joshd: aha
[0:38] <joshd> hmm, no clean way to solve it - it's all generated by /etc/grub.d/10_kernel, with no dependence on env vars or anything
[0:44] <dmick> joshd: I think the preferred configuration is to specify a default in /etc/default/grub; what machine are you looking at?
[0:45] <dmick> (but I think there's some stuff going on with extra files inserted in /etc/grub.d by the teuthology install)
[0:46] <joshd> dmick: yeah, that's what we do, the issue is that submenus are treated as a different namespace when setting the default
[0:47] <joshd> dmick: by default oneiric puts only the newest version at the top level, and the rest in a submenu
[0:47] <dmick> right
[0:47] <dmick> newest, or "most-recently-installed", maybe
[0:47] <joshd> dmick: so I'm disabling that behavior so we don't have to worry about what "newest" is
[0:48] <dmick> I wonder if it needs the submenu specifiers if you use a string...
[0:48] <dmick> GRUB_DEFAULT='Example GNU/Linux distribution'
[0:49] <dmick> I guess their example is using a string in the submenu tree, so probably
[0:49] <joshd> nope, we were using a string already too
[0:49] <dmick> ok
[1:18] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Read error: Operation timed out)
[1:28] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:35] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[1:38] <elder> So is the problem that old grub follows a different convention from new grub? Is that a problem because we need to support both?
[1:39] <elder> (Sorry, leaving again for a couple hours...)
[1:41] <joshd> elder: problem is with grub configuration - natty didn't use submenus, but I've worked around it
[1:41] <joshd> elder: just put the chef task first, and they should reboot into the correct kernel
[1:47] * bchrisman (~Adium@108.60.121.114) Quit (Quit: Leaving.)
[2:00] <nhm> ok, I'm thinking the radosgw issue I ran into is probably due to the chef recipe being written for squeeze. Tomorrow I'll look into modifying it to support more distros.
[2:01] * eternaleye____ (~eternaley@tchaikovsky.exherbo.org) has joined #ceph
[2:01] * eternaleye___ (~eternaley@195.215.30.181) Quit (Read error: Connection reset by peer)
[2:13] * jksM (jks@193.189.93.254) has joined #ceph
[2:17] * jantje (~jan@paranoid.nl) Quit (Ping timeout: 480 seconds)
[2:17] * jantje (~jan@paranoid.nl) has joined #ceph
[2:19] * jks (jks@193.189.93.254) Quit (Ping timeout: 480 seconds)
[2:22] * joao (~JL@ace.ops.newdream.net) Quit (Ping timeout: 480 seconds)
[2:26] * MarkN (~nathan@142.208.70.115.static.exetel.com.au) has joined #ceph
[2:30] * joao (~JL@89.181.145.13) has joined #ceph
[2:42] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[3:02] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[3:10] * jefferai (~quassel@quassel.jefferai.org) has joined #ceph
[3:16] * joshd (~joshd@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:23] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[3:41] * yoshi_ (~yoshi@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[3:47] * yoshi (~yoshi@p8031-ipngn2701marunouchi.tokyo.ocn.ne.jp) Quit (Ping timeout: 480 seconds)
[3:50] * ghaskins (~ghaskins@68-116-192-32.dhcp.oxfr.ma.charter.com) has joined #ceph
[3:52] <dmick> nhm: I looked at that too, and I don't think 1) it's being built for other distros yet, and 2) that the chef recipe is really what it needs to be, and 3) there doesn't seem to currently be a way to tie in a custom chef based on client type. There should probably be some design discussion about what the plan is there (hopefully I can bend Tv's ear tomorrow)
[3:53] * dmick (~dmick@aon.hq.newdream.net) Quit (Quit: Leaving.)
[3:57] * joao (~JL@89.181.145.13) Quit (Ping timeout: 480 seconds)
[3:58] * adjohn (~adjohn@50-0-92-115.dsl.dynamic.sonic.net) has joined #ceph
[3:59] * adjohn (~adjohn@50-0-92-115.dsl.dynamic.sonic.net) Quit ()
[4:07] * chutzpah (~chutz@216.174.109.254) Quit (Quit: Leaving)
[4:39] * The_Bishop (~bishop@178-17-163-220.static-host.net) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[4:44] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[4:44] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[4:45] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[4:59] * renzhi (~xp@64.120.141.42) has joined #ceph
[5:01] <renzhi> Anyone has a real use case for Ceph and is willing to share?
[5:02] <renzhi> thanks
[5:43] * renzhi (~xp@64.120.141.42) Quit (Ping timeout: 480 seconds)
[5:54] * renzhi (~xp@raq2064.uk2.net) has joined #ceph
[6:05] <iggy> renzhi: a few people are using rados/rbd with kvm already i think
[6:06] <renzhi> iggy: in production?
[6:06] <renzhi> or some serious testing?
[6:06] <iggy> afaik, yeah
[6:07] <iggy> production that is
[6:07] <renzhi> iggy: that's cool, I thought it is still in beta, and the big warning sign not to use it except for reviewing or benchmarking
[6:08] <renzhi> You know how large is the storage volume that project has?
[6:08] <iggy> that's the fs
[6:08] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[6:09] <iggy> the layers underneath are simple enough that they are more mature
[6:09] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[6:09] <iggy> rbd is below the fs layer
[6:09] <renzhi> ok
[6:10] <iggy> and no, I'm not sure how much data those users have
[6:10] <renzhi> we are trying to size up ceph for a project that has 10 millions files, and some are really large
[6:10] <renzhi> and it's growing
[6:11] <iggy> that's significant
[6:11] <iggy> you should pop in during the US afternoon
[6:12] <iggy> more people can speak to the FS's abilities
[6:12] <iggy> how large are we talking?
[6:12] <renzhi> each file?
[6:12] <iggy> i mean single files
[6:13] <renzhi> Right now, we limit it to 20GB
[6:13] <iggy> oh, that's nothing
[6:13] <renzhi> but we want to remove that limit
[6:13] <renzhi> however, the majority of the files are around 5MB to 10MB
[6:14] <iggy> I'm trying to get work to look into it for our geophysics files
[6:14] <renzhi> oh, that's large
[6:14] <iggy> we have some files that are 2T+
[6:15] <renzhi> yeah
[6:15] <renzhi> are you using ceph now?
[6:15] <iggy> right now they are all on an isilon accessed via nfs :/
[6:15] <renzhi> :)
[6:15] <renzhi> we are considering gluster and ceph now
[6:16] <renzhi> but gluster looks much more mature
[6:16] <iggy> sucks to say, but RH is backing gluster... fwiw
[6:16] <iggy> they've got about a 3 yr head start on ceph
[6:18] <renzhi> yeah
[6:18] <renzhi> I really like ceph features though
[6:19] <iggy> yeah, from what I've dug through, it's better engineered
[6:20] <iggy> hopefully that's enough
[6:20] <renzhi> how far is your evaluation of ceph, so far?
[6:21] <renzhi> We started to look at Ceph last year, but put it aside due to the warning, and we look again this year
[6:24] <iggy> I've been following it for a few years
[6:25] <iggy> i was originally thinking of using nfs/virtfs with VMs
[6:25] <iggy> but rbd is just as good and easier to maintain I think
[6:25] <renzhi> so you are an old timer of ceph
[6:26] <iggy> just as good from the standpoint of space savings
[6:26] <renzhi> good to know
[6:26] <iggy> if there is such a thing
[6:26] <renzhi> I'm wondering if DreadHost is using it for something, internally
[6:26] <renzhi> err... Dreamhost
[6:26] <iggy> i think i may have been one of the first people to add stuff to the wiki
[6:27] <renzhi> really? cool
[6:27] <iggy> i don't think so yet
[6:27] <iggy> but that is their goal afaik
[6:28] <renzhi> can you comment on the fs part, how mature is it now?
[6:29] <iggy> I'm pretty sure the devs still consider it beta
[6:34] <renzhi> have you heard of any roadmap, when are they going to get out of beta?
[6:34] <iggy> I haven't really :/
[6:35] <renzhi> ok
[6:55] * renzhi (~xp@raq2064.uk2.net) Quit (Ping timeout: 480 seconds)
[6:59] * cattelan_away is now known as cattelan_away_away
[7:01] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[7:05] * renzhi (~xp@203.156.242.34) has joined #ceph
[7:08] <wonko_be> there is a roadmap: http://tracker.newdream.net/projects/ceph/roadmap
[7:08] <wonko_be> afaik the FS is still considered unstable, and they are not putting effort in it for now, the core (object store) must be stable first
[7:12] <renzhi> wonko_be: thanks
[8:00] * LarsFronius (~LarsFroni@f054106023.adsl.alicedsl.de) has joined #ceph
[8:02] * renzhi (~xp@203.156.242.34) Quit (Ping timeout: 480 seconds)
[8:19] * LarsFronius (~LarsFroni@f054106023.adsl.alicedsl.de) Quit (Quit: LarsFronius)
[8:19] * renzhi (~xp@203.156.242.34) has joined #ceph
[8:48] * renzhi (~xp@203.156.242.34) Quit (Ping timeout: 480 seconds)
[9:17] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[9:17] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:21] * tnt_ (~tnt@55.189-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:33] * tnt_ (~tnt@212-166-48-236.win.be) has joined #ceph
[9:45] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:48] * The_Bishop (~bishop@178-17-163-220.static-host.net) has joined #ceph
[10:02] * yoshi_ (~yoshi@p1062-ipngn1901marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[11:07] <wonko_be> can you specify the prefered number of pg's at mkcephfs creation - say I know I will eventually grow my cluster to 100 nodes, but I start with 10 now, and will add the others next week...
[11:08] <wonko_be> (as pg-splitting is apparently not yet available)
[11:23] * joao (~JL@89.181.145.13) has joined #ceph
[11:38] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[11:42] * LarsFronius_ (~LarsFroni@testing78.jimdo-server.com) has joined #ceph
[11:42] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Read error: Connection reset by peer)
[11:42] * LarsFronius_ is now known as LarsFronius
[11:57] <NaioN> wonko_be: you could create a new pool (ceph osd pool create POOL [pg_num [pgp_num]])
[11:57] <NaioN> for RBDs it's easy to put them in a different pool
[12:11] <wonko_be> NaioN: thanks, but I want it for the defaults too
[12:22] * stxShadow (~jens@p4FFFEB27.dip.t-dialin.net) has joined #ceph
[12:46] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[12:46] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[13:00] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[13:04] * mtk (~mtk@ool-44c35967.dyn.optonline.net) has joined #ceph
[13:07] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[13:20] <jefferai> I've been going through all the wiki pages and I'm a bit lost in regards to especially http://ceph.newdream.net/wiki/Designing_a_cluster
[13:20] <jefferai> Partly what I'm trying to understand are hardware requirements, or at least best practices
[13:21] <jefferai> I assume I'm right in thinking that all the Ceph stuff would sit in a layer above RAID, right?
[13:22] <stxShadow> above filesystem :)
[13:22] <jefferai> So I'd have a RAID setup, put btrfs on top, and then create the OSD stuff on top of that
[13:23] <stxShadow> right
[13:23] <jefferai> and I guess I could actually put it on top of btrfs on top of LVM on top of raid...?
[13:23] <jefferai> so that I could grow the journal size as necessary?
[13:24] <jefferai> and the OSD size itself
[13:24] <stxShadow> yes ... that should work .... on most setups -> the journal is on a separate disc or better ssd
[13:25] <jefferai> I'd have the journal on an SSD
[13:25] <jefferai> but I guess I'm thinking I'd have a RAID-10 of SSDs with LVM on top
[13:25] <jefferai> I assume that should be fast enough
[13:25] <jefferai> even with LVM overhead
[13:26] <stxShadow> i think that should do it :)
[13:26] <jefferai> so for OSD and MDS it says "lots and lots and lots of RAM"
[13:26] <jefferai> which is...how much, exactly?
[13:26] <jefferai> and how fast does the CPU need to be?
[13:26] <stxShadow> 10488 root 20 0 13.8g 10g 4156 S 38 44.9 2504:22 ceph-osd
[13:26] <jefferai> ah
[13:26] <stxShadow> -> one osd process on one of my nodes
[13:27] <stxShadow> osd size is 8 TB each
[13:27] <jefferai> hm
[13:27] <jefferai> I was thinking of 24GB in each box, but the boxes themselves are going to have something like 50 or 60TB storage
[13:28] <jefferai> and that's not MDS, that's just OSD, right?
[13:28] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[13:28] <stxShadow> yes ..... MDS is on another server in my setup
[13:28] <stxShadow> -> 3 MDS + 3 Mons
[13:29] <jefferai> Hm
[13:29] <jefferai> I was planning on building two storage boxes, for redundancy, and four boxes to host lots of VMs
[13:30] <jefferai> I could put MDS on other boxes but then it'd have to compete with VMs for CPU/RAM
[13:33] <jefferai> stxShadow: OK, so one possibility could be to put OSD on my storage nodes, and MDS/Mon on my other nodes
[13:34] <jefferai> does the amount of RAM needed for OSD scale linearly with the amount of storage?
[13:47] <nhm> jefferai: Hi! Part of the reason those docs are still a little ambigous is that we are still doing testing to try and find what the optimal setup is.
[13:56] <jefferai> ah
[13:56] <jefferai> nhm: hi! :-)
[13:56] <jefferai> nhm: for background -- I
[13:56] <jefferai> er
[13:56] <jefferai> nhm: for background -- I'm planning on having two storage boxes serving up iSCSI over 10GbE to (to start with) 4 clients
[13:56] <jefferai> Originally I was going to go with DRBD in active/standby, but that loses half of the available bandwidth etc.
[13:56] <jefferai> some DRBD guys told me active/active has a lot of caveats, and I also wasn't feeling fantastic about a lot of the other clustered filesystem options and their interactions with KVM
[13:56] <jefferai> Not that I'm sure I should feel great about Ceph, but I like it in principle :-)
[13:56] <jefferai> So I'd planned on 24GB RAM and 1 6- or 8-core Xeon in each storage box, and 128GB RAM in each VM box
[13:56] <jefferai> and what I'm trying to figure out is a) if Ceph will run happily on all this, or if I'm going to eat into too much RAM (I could increase the amount on the storage boxes), b) if Ceph is stable enough to handle this, and other questions TBD :-)
[13:58] <Azrael> jefferai: have you setup ceph yet? do you find it stable enough for production?
[13:59] <jefferai> Azrael: nope, don't even have the boxes yet
[13:59] <nhm> jefferai: Ok, first, remember that Ceph itself is still not ready for production use. Rados and rgw are getting closer. If you want to use Ceph proper, make sure to do some testing first. :)
[13:59] <jefferai> nhm: how's rbd?
[14:00] <nhm> jefferai: getting close. There's a couple of folks in this channel that have been using it for a while.
[14:01] <jefferai> I'm generally not sure how to take "not ready for production"...I've been using LXC with great results for two years, despite it being not ready for production :-)
[14:01] * Azrael shudders @ lxc
[14:02] <jefferai> Azrael: I dealt with OpenVZ for two years before that...LXC has its warts, but at least it didn't require a RHEL kernel
[14:02] <nhm> jefferai: Right now we are looking at 2U boxes (ie <=12 drives) as the first platform we are going to do some more extensive performance testing on. I'd like to get a larger node in to test eventually, but for now we won't have a whole lot of internal data on a setup like that.
[14:02] <jefferai> Hm.
[14:02] <Azrael> are you guys making use of btrfs or xfs or ext3/4 on your osd's?
[14:03] <jefferai> I don't need to put the boxes into production immediately; I could possibly help with benchmarking a larger setup
[14:03] <jefferai> if I have some handholding on the hows
[14:03] <jefferai> Because the hardware is likely not going to change regardless of whether I go with Ceph or DRBD (actually, even if I do go with Ceph I'm likely to use DRBD for some other bits)
[14:04] <jefferai> nhm: does RAM usage scale linearly with amount of storage?
[14:06] <nhm> jefferai: To be honest I don't know yet. I started on the team a little over a week ago. Previously I worked at a Supercomputing Institute working with Lustre.
[14:06] <jefferai> Oh
[14:06] <jefferai> hah
[14:06] <jefferai> :-)
[14:06] <nhm> jeffhung: :)
[14:07] <jefferai> What are your thoughts on Lustre/Glustre/etc?
[14:09] <nhm> jefferai: Lustre is the fastest option out there at the moment, but it's rather fragile, and I think it's going to be a lot of work to fix some of the problems it has. It will be interesting to see what happens with Gluster now that Red Hat is backing it. I'm friends with Jeff Darcy over there and I think he's got some big plans.
[14:10] <jefferai> Gluster was the other option that I was considering, it has a lot of nice features -- quota, CIFS/NFS export, etc
[14:10] <jefferai> although for me to take good advantage of all of it I'd really need LDAP integration, without having to provision 4k user accounts on each box
[14:10] <stxShadow> hmmm ..... the ram usage rises every time scrubbing runs over
[14:11] <nhm> jefferai: Yeah, we ran it briefly at the Institute. We ended up going with Lustre due to some problems that may or may not have been glsuter's fault when trying to have 8000 concurrent writers.
[14:11] <jefferai> nhm: so why Ceph?
[14:12] <nhm> jefferai: For me, crush. It's extremely clever, and I think that if we do a good job with the implementation it will scale better than anything else out there.
[14:13] <nhm> jefferai: The potential for an open source, distributed file system that replicates well, scales well, and has even moderately good per-osd performance is kind of the holy grail.
[14:13] <jefferai> yep
[14:13] <jefferai> I echo that -- Ceph to me seems very clever, and I really *want* it to work
[14:14] <jefferai> Like I said -- I think I could help benchmark it on a larger box, but I do need some guidance on what kind of resources I should be looking at getting
[14:14] <jefferai> is there someone specific I should talk to?
[14:14] <nhm> jefferai: yeah, we have a lot of impatient people (including me!) that want it to be ready NOW. ;)
[14:15] <jefferai> Well, for me it's more like -- I'm probably not going to change everything around once it's in production for 3, 4 years
[14:15] <jefferai> so I'd *like* to pick something that I think will give me great benefits in that time
[14:15] <stxShadow> i hope the rbd part is ready very soon :)
[14:15] <nhm> jefferai: Well, some of the core developers usually show up in channel in about 3-4 hours (they are all out on the west coast).
[14:16] <jefferai> okay
[14:16] <jefferai> I actually need to head out soon, but I can certainly be back later
[14:16] <nhm> jefferai: I may have more for you in a week or two. We just deployed our new test cluster so everything is a bit chaotic at the moment.
[14:16] <jefferai> sure
[14:16] <jefferai> nhm: anyone specific you recommend I get in touch with?
[14:17] <jefferai> or maybe I'll poke you in a few hours and ask then :-)
[14:18] <nhm> jefferai: Yeah, I'd just poke around in a couple of hours. Sage is the head honcho and is in here often, but someone else may have something to add too.
[14:18] <jefferai> Okay
[14:18] <jefferai> Will do
[14:19] <jefferai> thanks a lot!
[14:22] <nhm> no problem, sorry I didn't have more answers. :)
[14:22] <jefferai> :-)
[14:24] <stxShadow> anyone here who tested the rbd snapshot feature successfully ? Is a live rollback possible ?
[14:26] * gmax (~Adium@64-126-49-62.dyn.everestkc.net) has joined #ceph
[14:29] <Azrael> hmmm
[14:29] <Azrael> the snapshotting
[14:29] <Azrael> is that done via btrfs on osd?
[14:29] <stxShadow> we use xfs :)
[14:29] <Azrael> or is it that the osd will make use of btrfs shapshots if btrfs is there... otherwise it does it manually iwth xfs?
[14:29] <stxShadow> snapshotting itself works
[14:29] <Azrael> yeah i'm staying away from btrfs for now
[14:30] <stxShadow> rollback works too
[14:30] <stxShadow> but live rollback crashes the vm
[14:30] <Azrael> hmm
[14:30] <Azrael> immediately?
[14:30] <Azrael> the vm may not like its filesystem going silly all the sudden
[14:31] <stxShadow> yes .... immediately !
[14:31] <Azrael> kvm or xen?
[14:31] <stxShadow> i would love something like the "savevm" feature in kvm
[14:32] <stxShadow> -> KVM
[14:32] <Azrael> ha well there could be any number of issues then, non ceph related, causing kvm crashes :-D
[14:32] * cattelan_away_away is now known as cattelan_away
[14:33] <stxShadow> yes .... maybe the filesystem notify isn't working probably
[14:33] <stxShadow> qemu-rbd - Feature #699: support snapshot notify --> was solved a long time ago
[14:34] <stxShadow> but we use kvm 1.0
[14:51] * d405 (~nobody@un.interestingsh.it) has joined #ceph
[15:03] * oliver1 (~oliver@p4FFFEB27.dip.t-dialin.net) has joined #ceph
[15:05] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[15:09] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[15:09] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[15:12] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[16:24] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * eternaleye____ (~eternaley@tchaikovsky.exherbo.org) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * gohko (~gohko@natter.interq.or.jp) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * ivan` (~ivan`@li125-242.members.linode.com) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * psomas_ (~psomas@inferno.cc.ece.ntua.gr) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * gmax (~Adium@64-126-49-62.dyn.everestkc.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * eightyeight (~atoponce@pthree.org) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * mkampe (~markk@aon.hq.newdream.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * sjust1 (~sam@aon.hq.newdream.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * sagewk (~sage@aon.hq.newdream.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * yehudasa_ (~yehudasa@aon.hq.newdream.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * rosco (~r.nap@188.205.52.204) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * iggy (~iggy@theiggy.com) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * lxo (~aoliva@lxo.user.oftc.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * blufor (~blufor@mongo-rs2-1.candycloud.eu) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * gregaf (~Adium@aon.hq.newdream.net) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * ^conner (~conner@leo.tuc.noao.edu) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * darkfader (~floh@188.40.175.2) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * kirkland (~kirkland@74.126.19.140.static.a2webhosting.com) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * nhm (~nh@68.168.168.19) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * ottod (~ANONYMOUS@li127-75.members.linode.com) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * ajm (adam@adam.gs) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * __jt__ (~james@jamestaylor.org) Quit (reticulum.oftc.net charon.oftc.net)
[16:24] * edwardw`away (~edward@ec2-50-19-100-56.compute-1.amazonaws.com) Quit (reticulum.oftc.net charon.oftc.net)
[16:25] * gmax (~Adium@64-126-49-62.dyn.everestkc.net) has joined #ceph
[16:25] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[16:25] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) has joined #ceph
[16:25] * eternaleye____ (~eternaley@tchaikovsky.exherbo.org) has joined #ceph
[16:25] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[16:25] * blufor (~blufor@mongo-rs2-1.candycloud.eu) has joined #ceph
[16:25] * cattelan_away (~cattelan@c-66-41-26-220.hsd1.mn.comcast.net) has joined #ceph
[16:25] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[16:25] * ivan` (~ivan`@li125-242.members.linode.com) has joined #ceph
[16:25] * gregaf (~Adium@aon.hq.newdream.net) has joined #ceph
[16:25] * eightyeight (~atoponce@pthree.org) has joined #ceph
[16:25] * mkampe (~markk@aon.hq.newdream.net) has joined #ceph
[16:25] * sjust1 (~sam@aon.hq.newdream.net) has joined #ceph
[16:25] * sagewk (~sage@aon.hq.newdream.net) has joined #ceph
[16:25] * yehudasa_ (~yehudasa@aon.hq.newdream.net) has joined #ceph
[16:25] * cclien (~cclien@ec2-50-112-123-234.us-west-2.compute.amazonaws.com) has joined #ceph
[16:25] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[16:25] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[16:25] * rosco (~r.nap@188.205.52.204) has joined #ceph
[16:25] * psomas_ (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[16:25] * iggy (~iggy@theiggy.com) has joined #ceph
[16:25] * darkfader (~floh@188.40.175.2) has joined #ceph
[16:25] * kirkland (~kirkland@74.126.19.140.static.a2webhosting.com) has joined #ceph
[16:25] * __jt__ (~james@jamestaylor.org) has joined #ceph
[16:25] * edwardw`away (~edward@ec2-50-19-100-56.compute-1.amazonaws.com) has joined #ceph
[16:25] * ajm (adam@adam.gs) has joined #ceph
[16:25] * ottod (~ANONYMOUS@li127-75.members.linode.com) has joined #ceph
[16:25] * nhm (~nh@68.168.168.19) has joined #ceph
[16:53] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:00] * guilhem1 (~spectrum@sd-20098.dedibox.fr) has joined #ceph
[17:00] <sagewk> elder: ping?
[17:04] <guilhem1> hi all
[17:04] <guilhem1> I try to deploy an radosgw infrastructure. All is OK for the moment (MON, OSD, RGW, apache, nginx etc).
[17:04] <guilhem1> But to be more "clean" I want to use dedicated rados poolq for my buckets (because some of this need more replications than others).
[17:04] <guilhem1> Is this a good way ? And more problematic, I don't know how to link a bucket to a pool and the "exact" way to change rule for a "add" pool.
[17:06] <guilhem1> (for information : I do some work on ceph chef-cookbook, I do a pull request on github and I will do more work soon)
[17:06] <sagewk> you mean a pool per bucket?
[17:08] <sagewk> elder: merging testing branch with 3.3-rc6 has conflicts..
[17:09] <guilhem1> sagewk, yes, or 1 pool for 2-3 buckets and another for 1 bucket
[17:09] <stxShadow> hmmm .... is one of the osd's kind of "primary" for the whole cluster ? .... our first node is consuming lot more ram than the others
[17:09] <sagewk> that'll work for a small number of buckets. it won't work for large numbers of pools.
[17:10] <sagewk> radosgw pool add/rm are what you're looking for. i'm not sure how you control which one is the default for new buckets... yehudasa_?
[17:10] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[17:10] * MarkDude (~MT@c-71-198-138-155.hsd1.ca.comcast.net) has joined #ceph
[17:11] <guilhem1> I do "radosgw pool add"
[17:11] <guilhem1> # radosgw-admin pools list
[17:11] <guilhem1> [
[17:11] <guilhem1> { "name": ".rgw.buckets"},
[17:11] <guilhem1> { "name": "ys-streaming"}]
[17:11] <guilhem1> but for now... I don't know how to control buckets
[17:12] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[17:15] <stxShadow> the memory consuption rises very quickly if the cluster is "scrubbing"
[17:15] <stxShadow> but only on the first node
[17:16] * Jaykra (~Jamie@64-126-89-248.dyn.everestkc.net) has joined #ceph
[17:17] * groovious (~Adium@64-126-49-62.dyn.everestkc.net) has joined #ceph
[17:23] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[17:23] * groovious (~Adium@64-126-49-62.dyn.everestkc.net) Quit (Quit: Leaving.)
[17:24] * groovious (~Adium@64-126-49-62.dyn.everestkc.net) has joined #ceph
[17:24] <jefferai> sagewk: hey there, nhm pointed me your way
[17:24] <sagewk> jefferai: hi
[17:25] <jefferai> I'm pretty interested in ceph for RBD for KVM nodes and radosgw, but I was having some trouble figuring out how I might slot it into my probable machine configuration, and what the actual system requirements might be like
[17:25] <jefferai> in the same vein, I might be able to help with some large system benchmarking
[17:27] * joao (~JL@89.181.145.13) Quit (Ping timeout: 480 seconds)
[17:27] <yehudasa_> guilhem1: currently you can't control which bucket goes to which pool
[17:27] <jefferai> I'm likely to have two storage nodes and four client systems
[17:27] <jefferai> stxShadow said that his 7TB OSD node was using 14GB RAM, and I'm wondering if that scales linearly
[17:28] <jefferai> I could also use some guidance on exactly how much "lots and lots and lots of RAM" MDS needs
[17:28] <yehudasa_> guilhem1: If you have a single pool that you want all your buckets to go into, you can remove the other one that you don't need by 'radosgw-admin pool rm'
[17:28] <sagewk> jerrerai: that sounds like a lot, actually.. was that ceph-osd RSS?
[17:28] <sagewk> jefferai: the more the better, but in your case you don't need the mds for radosgw and rbd
[17:29] <jefferai> well -- ideally I'd like to eventually use CephFS, but I hear it's far less stable than the other bits
[17:29] <sagewk> jefferai: you do need to have a third node running ceph-mon, tho, to mediate failures in the otherwise 2- node cluster.
[17:30] <jefferai> So the box I'm thinking of building I was thinking of kitting out with 32GB RAM, an 8-core CPU, and about 70TB storage (this is the OSD nodes)
[17:30] <jefferai> I could go up to 16 cores and 128GB RAM, but that brings the price up a decent amount
[17:32] <jefferai> on the compute nodes I will have 128GB RAM, but I was hoping most of that would be going to VMs
[17:32] <jefferai> so if MDS uses tons and tons and tons of RAM and eats widely into that...
[17:33] * Jaykra (~Jamie@64-126-89-248.dyn.everestkc.net) Quit (Quit: Leaving.)
[17:34] <guilhem1> yehudasa_ : I can do this for the moment, but it will be nice to have this feature. Select 1 pool by default and be able to create a bucket in another pool.
[17:34] <wido> jefferai: Why a machine with so many OSD's? That defeats the point of distributed storage, doesn't it?
[17:34] <jefferai> so many OSDs?
[17:34] <wido> I wouldn't want to loose that machine, 70TB of data gone
[17:35] <wido> 35 OSD's on one machine I think? 70TB / 2TB disks
[17:35] <jefferai> wido: isn't that the point, that if one of the machines goes down the other is still going?
[17:35] <nhm> wido: The lustre nodes we had were built like that. Primarily a cost saving issue since each node was QDR IB connected.
[17:35] <jefferai> wido: why would I do that?
[17:36] * joao (~JL@ace.ops.newdream.net) has joined #ceph
[17:36] <wido> jefferai: If you have a single machine with 70TB worth of storage and it gets a kernel panic, corrupted btrfs filesystem or whatsoever and goes down
[17:36] <jefferai> yeah
[17:36] <wido> you'll loose so much storage
[17:36] <jefferai> I don't follow
[17:36] <jefferai> isn't that the point Ceph and other clustered file systems?
[17:36] <jefferai> they can handle losing a node or two?
[17:37] <yehudasa_> guilhem1: I opened feature #2169
[17:37] <wido> jefferai: Yes, but how many nodes are you talking about?
[17:37] <jefferai> two
[17:37] <wido> Yes, so you have two machines with each 70TB worth of storage?
[17:37] <jefferai> yes
[17:37] * tnt_ (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:37] <nhm> jefferai: he's wondering if it makes sense to put so much storage in a single node because of how much impact loosing 1 node out of 4 will have in terms of replication and general mayhem vs say, 1 node out of 12.
[17:37] <wido> nhm: exactly
[17:38] <wido> I'd rather have 7 machines with 10TB of storage eaches, so the impact of loosing single machine will be less painful
[17:38] <jefferai> I don't have the luxury of being able to have 7 or 12 machines
[17:38] <stxShadow> hmm ..... one of my osds crashed a few minutes ago with:
[17:38] <nhm> wido: yeah, it's just a little more expensive.
[17:38] <stxShadow> 2012-03-13 16:45:49.734680 1: (SafeTimer::timer_thread()+0x33b) [0x6745bb]
[17:38] <stxShadow> 2012-03-13 16:45:49.734688 2: (SafeTimerThread::entry()+0xd) [0x676f6d]
[17:38] <stxShadow> 2012-03-13 16:45:49.734704 3: (()+0x68ca) [0x7fd5273768ca]
[17:38] <stxShadow> 2012-03-13 16:45:49.734712 4: (clone()+0x6d) [0x7fd5259fa86d]
[17:38] <stxShadow> 2012-03-13 16:45:49.734717 os/FileStore.cc: In function 'virtual void SyncEntryTimeout::finish(int)' thread 7fd519ff5700 time 2012-03-13 16:45:49.734743
[17:38] <stxShadow> os/FileStore.cc: 2992: FAILED assert(0)
[17:38] <nhm> wido: or a lot more depending on the nodes...
[17:38] <guilhem1> yehudasa_ :nice, I will watching
[17:38] <wido> nhm: Yes, indeed. But I thought it was worth advisiing :)
[17:38] <jefferai> wido: I still don't understand what you're advising
[17:39] <wido> jefferai: Ok, one sec
[17:39] <jefferai> I have two nodes, and I was going to be using DRBD to have one act as a failover for the other
[17:39] <jefferai> I'm considering whether Ceph could be a better alternative
[17:39] * joshd (~joshd@aon.hq.newdream.net) has joined #ceph
[17:39] <jefferai> since its cluster mechanics mean nice things in terms of running KVM
[17:39] <wido> jefferai: With two nodes, I guess not?
[17:39] * aliguori (~anthony@32.97.110.59) has joined #ceph
[17:40] <jefferai> wido: except that I could use both nodes at once
[17:40] <jefferai> instead of active/standby
[17:40] <jefferai> (the drbd guys seemed down on the idea of active/active)
[17:40] <wido> jefferai: Yes, but if one node goes down, you loose 50% of your capacity and that will put pressure on RADOS
[17:40] <jefferai> wido: ceph stripes?
[17:41] <jefferai> I figured that in a two-node setup, I'd have the same amount of storage as on one node, but replicated
[17:41] <wido> Yes, it does. But it's a distributed system which you should run on multiple (>2) machines to take full advantage of the replication of data
[17:41] <nhm> wido: the other issue is density. You can get 60 drives in 4U with the right chassis. a couple petabytes of 12 drive 2U boxes is going to take up a lot of space.
[17:42] <jefferai> nhm: bingo
[17:42] <jefferai> I have very limited space
[17:42] <wido> nhm: Yes, density might be an issue
[17:42] <wido> I'm running 1U nodes with 4 disks, just to keep the impact of loosing a single node low
[17:43] <wido> No redundant power or fancy hardware, just simple 1U boxes running 4 OSD's each
[17:44] <nhm> wido: One of the neat things about ceph is that it makes deployments like that feasible vs other distributed filesystems that expect you to have a hardware raid sitting underneath.
[17:44] <wido> nhm: Yes, but you have to get your crushmap right
[17:44] * Jaykra (~Jamie@64-126-89-248.dyn.everestkc.net) has joined #ceph
[17:45] <wido> And I still think that having replication set to 2 is dangerous
[17:45] <stxShadow> wido ... why that ?
[17:45] <wido> I would at least want replication set to 3, but to have that run in a safe manner, you need 3 machines
[17:45] <nhm> wido: I think it really depends on the use case...
[17:46] <wido> nhm: Yes, it does
[17:46] * tnt_ (~tnt@87.67.189.55) has joined #ceph
[17:46] <wido> stxShadow: If a object gets corrupted, how do you know which one is the right one?
[17:46] <nhm> wido: For scratch space at my previous job we had no replication and no backups.
[17:46] <wido> Indeed, it depends on the requirements
[17:47] <stxShadow> wido .... as far as i know -> ceph by itself could not do such decisions ?
[17:47] <stxShadow> even if 3 replicas exists ....
[17:47] <stxShadow> or am i wrong ?
[17:48] <wido> stxShadow: No, at this poin it doesn't.
[17:48] <jefferai> nhm: actually, I planned on having a software raid underneath
[17:48] <jefferai> which is why I wouldn't have 35 OSDs in one box
[17:48] <wido> but if you do a manual comparison, you can't figure it out either
[17:48] * ^conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 480 seconds)
[17:48] <stxShadow> wido .... good point .... maybe i should change zu 3
[17:49] <wido> I have to go, ttyl
[17:49] <stxShadow> cu
[17:49] <nhm> jefferai: I don't think there is anything preventing you from doing that.
[17:49] <jefferai> sure
[17:49] <jefferai> nhm: so I think there are two issues here
[17:49] <jefferai> one is, replication if a node goes down
[17:49] <jefferai> I'll have spare parts on hand, plus each node would have redundanty power supplies and redundant network
[17:50] <jefferai> and connected via 10GbE
[17:50] <jefferai> so in the unlikely node it does go down, I can get things back in sync at a theoretical maximum of 2.5GB/s
[17:50] <nhm> jefferai: depending on the raid controller you may be able to switch from a raid setup to independent disks.
[17:50] <jefferai> nhm: raid controller = mdadm
[17:50] <nhm> jefferai: Oh, well then you certainly can do that. ;)
[17:50] <jefferai> what would the benefit of independent disks be?
[17:51] <stxShadow> hmmm .... in my setup -> sync is not using the network bandwith
[17:52] <stxShadow> but i only got 1 Gbit per Node
[17:52] <jefferai> you mean, not saturating it?
[17:52] <stxShadow> yes
[17:52] <stxShadow> max is 500 Mbit
[17:52] <jefferai> Ah
[17:52] <nhm> jefferai: basically you end up with replication at the ceph level vs at the raid level. With more nodes that means data gets spread across machines rather than just across disks in a given machine. Also, depending on the raid level, it may mean less capacity, but less risk of raid corruption and/or failure of disks during a raid rebuild.
[17:53] <jefferai> nhm: I think I'm happy to do replication at the raid level, inter-node
[17:53] <jefferai> I'm planning on RAID-10
[17:53] <jefferai> with hot spares
[17:53] <jefferai> disk goes down, hot spare takes over, another disk goes down, second hot spare takes over, third disk goes down, raid-10 still happy, fourth goes down, I may be screwed depending on which
[17:54] <jefferai> but I'd have spares to swap in, so unless they all go down at once...
[17:54] <jefferai> so what I wanted at the ceph level wasn't so much to keep my data protected as to replicate it so that I could take advantage of having multiple boxes active at once
[17:54] <jefferai> so I could have different VM nodes use different iSCSI paths
[17:54] <jefferai> (or actually multipath to both)
[17:55] <nhm> jefferai: I suppose one thing is that if you are doing 2x replication at the ceph level and doing raid 10+hotspares, you are loosing a lot of capacity.
[17:55] <jefferai> nhm: yeah, I know
[17:55] <jefferai> but basically I'm going for redundancy
[17:56] <jefferai> I can build a 70TB fully redundant system for less than half of what a single 20TB system would cost me from e.g. equallogic
[17:56] <jefferai> (and in less U)
[17:56] * ^conner (~conner@leo.tuc.noao.edu) has joined #ceph
[17:56] <jefferai> by fully redundant, I mean, including the space lost to RAID-10, and two nodes
[17:57] <jefferai> so I guess if I had the money available, I could look at getting 4 2U systems instead, which give better density overall, but then I could take advantage of Ceph's replication
[17:59] <nhm> jefferai: lots of different ways it could be done. We really don't know yet how 35 OSDs would work in a single system either afaik.
[17:59] * dmick (~dmick@aon.hq.newdream.net) has joined #ceph
[17:59] <jefferai> nhm: but with a software RAID setup underneath, we're talking 1 OSD, or maybe two or three tops
[18:00] <jefferai> it seems like it should work, but what he said about knowing which object is the right one if one gets corrupted has me a bit worried
[18:00] <nhm> jefferai: yep, but 1 OSD that would need to maintain really high throughput...
[18:00] <jefferai> yeah, that's why I was asking about hardware requirments :-)
[18:01] <nhm> jefferai: My guess is that the sweet spot is going to be somewhere between 1 and 35 OSDs, but I really don't know yet. :)
[18:01] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[18:02] <jefferai> nhm: what I was thinking was using RBD with qemu over iSCSI, which would allow me to have mutliple boxes serve the data at once
[18:02] <joshd> jefferai: a big benefit of having more nodes is that when one goes down, they recover in parallel (since it's not replicated in mirrored pairs, but by placement groups, which are pseodo-randomly distributed among osds)
[18:02] <jefferai> and then for other stuff using DRBD with normal filesystems
[18:02] <jefferai> for non-VM storage that is, since cephfs isn't really stable
[18:03] <jefferai> joshd: by recover you mean re-distribute the storage?
[18:03] <joshd> yeah, re-replicating data that a failed node stored
[18:03] <jefferai> the objects, rather
[18:03] * bchrisman (~Adium@108.60.121.114) has joined #ceph
[18:04] <joshd> also note that qemu/kvm has support for rbd directly, you don't need to go through iscsi
[18:04] <stxShadow> joshd ... so ... if i have 4 nodes with 3 replicas and one osd goes down .... the recovery would be faster than the same with 2 replicas ?
[18:07] <joshd> stxShadow: I think so, unless you're already saturating your network
[18:08] <jefferai> joshd: ok -- can you explain something a bit more basic to me?
[18:08] <joshd> jefferai: sure
[18:09] <jefferai> when you say 4 nodes with 3 replicas, you're saying that each object is replicated 3 times
[18:09] <jefferai> so you would have the total storage of one node
[18:09] <jefferai> if you have 4 nodes with two replicas, you would have the total storage of two nodes
[18:09] <jefferai> right?
[18:10] <joshd> yeah
[18:10] <stxShadow> right
[18:10] <jefferai> and can you resize/scale things up and down?
[18:10] <jefferai> so if I have a 2U chassis with 24 disks
[18:10] <jefferai> and 8 of them are in one OSD
[18:10] <joshd> you can add/remove osds without reshuffling too much data (it's closer to optimal with more osds)
[18:10] <jefferai> ah
[18:11] <stxShadow> you can set how many replicas of one object should be stored on the osds
[18:11] <jefferai> because one thing I could maybe do is instead of two big storage nodes and 4 1U compute servers, get 4 2U combo storage/compute servers
[18:12] <jefferai> but I'd sacrifice the ability to put in big 3TB desktop drives in favor of more 1TB 2.5" drives
[18:12] <jefferai> so RAID would be less ideal
[18:12] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[18:12] <jefferai> but I'd potentially have an extremely fast network (although stxShadow said it won't necessarily take advantage of that?)
[18:13] <jefferai> that said, though, if OSD would eat into the RAM available to the VMs, it'd be problematic
[18:14] <joshd> back in 15
[18:15] <stxShadow> jefferai: i don't know if the memory consumption of my osd node is ok ..... it seems to be a little bit high
[18:15] <jefferai> stxShadow: not necessarily less worrying :-)
[18:20] <stxShadow> thats true ..... thats why ceph is for testing purposes only :)
[18:21] <jefferai> yeah
[18:21] <jefferai> It's in the "gosh, I really *want* to use it, but..." phase
[18:26] * stxShadow (~jens@p4FFFEB27.dip.t-dialin.net) Quit (Remote host closed the connection)
[18:27] * oliver1 (~oliver@p4FFFEB27.dip.t-dialin.net) has left #ceph
[18:30] * groovious (~Adium@64-126-49-62.dyn.everestkc.net) Quit (Quit: Leaving.)
[18:30] <joshd> jefferai: usually we've seen osds stay around 200MB of memory, but during recovery they can use a lot more
[18:31] <jefferai> I see -- that's quite a lot less than the 13 GB stxShadow was seeing :-)
[18:32] <joshd> yeah, there may be something going wrong there
[18:35] <joshd> but because of the spikes during recovery, you might not want osds on your compute hosts
[18:37] <joshd> you can run multiple osds per node (one per disk even) to get faster recovery and more fault tolerance to e.g. osd or underlying fs bugs
[18:38] <jefferai> ah
[18:38] <jefferai> yeah, I knew you could do one per disk
[18:38] <jefferai> but didn't realize you'd do that to replicate on the same node
[18:38] * gmax (~Adium@64-126-49-62.dyn.everestkc.net) Quit (Ping timeout: 480 seconds)
[18:39] <joshd> replication strategy is different - you can set up your crush map so you have a replica on each node, even though there are many osds on each
[18:42] * Jaykra (~Jamie@64-126-89-248.dyn.everestkc.net) Quit (Quit: Leaving.)
[18:42] <joshd> recovery also uses a bit of cpu - we've generally been recommending 1 core/osd, but don't know an exact limit
[18:43] <joshd> wido tried using a atoms, and they weren't fast enough
[18:47] * Jaykra (~Jamie@64-126-89-248.dyn.everestkc.net) has joined #ceph
[18:58] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[19:00] * adjohn is now known as Guest6134
[19:00] * Guest6134 (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Read error: Connection reset by peer)
[19:00] * _adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[19:00] * _adjohn is now known as adjohn
[19:05] * LarsFronius (~LarsFroni@testing78.jimdo-server.com) Quit (Quit: LarsFronius)
[19:16] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[19:41] * groovious (~Adium@174.47.181.227) has joined #ceph
[19:57] * groovious (~Adium@174.47.181.227) Quit (Ping timeout: 480 seconds)
[20:14] * LarsFronius (~LarsFroni@f054106023.adsl.alicedsl.de) has joined #ceph
[20:16] * groovious (~Adium@64-126-49-62.dyn.everestkc.net) has joined #ceph
[20:31] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) Quit (Quit: adjohn)
[20:48] <wido> joshd: I'm still using Atoms
[20:48] <wido> lately that is working better and better\
[20:49] <wido> I do think a lot of my issues were somehow btrfs related
[20:49] <joshd> ah, good to know
[20:50] <wido> joshd: Thanks btw for your review, I'm working my way through it :)
[20:50] <joshd> was it purely btrfs, or did you also decrease osds/cpu?
[20:50] <wido> I think it was purely btrfs, OSD's hanging on slow btrfs operations. My configuration hasn't changed.
[20:51] <wido> One Atom 1.6Ghz dual-core, 4GB of RAM and 4 2TB disks, one OSD per disk
[20:51] <wido> and a Intel 80GB SSD for the journaling and OS
[20:53] <wonko_be> you still run one osd per disk? why not btrfs striped over multiple disks?
[20:54] <NaioN> wonko_be: btrfs isn't ready for it yet...
[20:54] <wido> wonko_be: If I loose a OSD due to a btrfs bug or a ceph bug, I don't loose 8TB of data
[20:55] <wido> but just 2TB
[20:55] <wido> And I'd rather have Ceph/RADOS/Crush distribute my data instead of btrfs
[20:55] <wido> Cooler crushmap ;)
[20:55] <blufor> ehlo ;]
[20:55] <NaioN> I had a lot of trouble with btrfs
[20:55] <wonko_be> wido: true that
[20:55] <wido> wonko_be: WHD next week?
[20:55] <blufor> wido: ha ! you talking bout atom, meh likey ;]
[20:55] <wonko_be> wido: nope, can't make it
[20:56] <wido> wonko_be: :(
[20:56] <wonko_be> i became a dad
[20:56] <wido> congratz!
[20:56] <blufor> wonko_be: gratz
[20:56] <wonko_be> so, my priorities have shifted a bit
[20:56] <wonko_be> thx
[20:56] <wido> blufor: the Atom is nice, I'm using the D525 on a SuperMicro mainboard
[20:56] <wonko_be> i would have wanted to go, just to pass by the ceph boot
[20:56] <NaioN> wido: howmany osd's per atom?
[20:56] <wido> There is a newer Atom, mine is about a year old
[20:56] <wonko_be> booth
[20:56] <blufor> wido: the twin^3 is where i wanna go :]
[20:56] <wido> NaioN: 4 OSD's per Atom
[20:56] <NaioN> ok
[20:56] <wido> NaioN: I posted the specs a few lines back
[20:56] <NaioN> and a osd/disk?
[20:56] <wido> ^^^
[20:57] <NaioN> wido: sorry :)
[20:57] <NaioN> scrolling
[20:57] <blufor> btw does raid1 make any sense at all ?
[20:57] * BManojlovic (~steki@212.200.240.216) has joined #ceph
[20:57] <blufor> the problem is, i got 3 drives per server
[20:57] <NaioN> wido: which kernel do you use?
[20:57] <NaioN> I had the exact same problem
[20:57] <NaioN> very poor performance after a short time
[20:58] <NaioN> last I tried was with a 3.2.5 kernel if I remember correctly
[20:58] <wido> NaioN: I'm using 3.2.x
[20:58] <blufor> wido: did you try to use ssds for the data ?
[20:58] <blufor> on atoms ?
[20:59] <wido> blufor: Nope, just cheap 2TB disks, cheapest I could find
[20:59] <NaioN> blufor: raid1? you can use the replication of ceph
[20:59] <NaioN> wido: which ones?
[20:59] <blufor> what i plan is 32 D525s with 4GB ram each
[20:59] <blufor> and i got 3 drives per server
[21:00] <blufor> so i wonder how to utilize those. i want to use those machines for two separate purposes: some for ceph, others for hdfs
[21:00] <NaioN> I have used the hitachi and now i've ordered the seagates
[21:00] <wido> NaioN: 2TB WD20EARS and some Seagates, al 5400RPM
[21:00] <NaioN> ok I use 7200 rpms
[21:00] <wido> My cluster is mainly for development and testing
[21:00] <wido> I just want to know if Ceph can run on it and how it scales
[21:01] <blufor> wido: how fast do you perform on the cluster ?
[21:01] <NaioN> I'm now building a second cluster for "production"
[21:01] <NaioN> only rbds and for backup purpose
[21:01] <wido> blufor: I haven't really tested it lately. But last time I ran a VM and performed bonnie++ I got somewhere around 80MB/sec inside the RBD VM
[21:01] <wido> I have the bonnie++ statistics somewhere I think
[21:01] <NaioN> with the first cluster i can experiment again
[21:02] <NaioN> i want to experiment with btrfs again, at the moment i'm using xfs
[21:02] <NaioN> wido: with a 1gbit/s network?
[21:02] <wido> NaioN: Yes
[21:02] <blufor> that's not that bad, what i try to do is a lot of atoms for the cluster connected with two 1Gbits in etherchannel each and 10GBit on hypervisor side
[21:02] <wido> replication set to 3
[21:02] <NaioN> I'm hitting the gigabit boundary easily
[21:03] <wido> 10 machines, so 40 OSD's in total
[21:03] <blufor> and stuff the atoms with ssds
[21:03] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[21:03] <NaioN> blufor: which fs do you use on the osds/disks?
[21:04] <blufor> NaioN: you mean on the atoms i got just on paper now ? :]
[21:04] <NaioN> oh ok :)
[21:04] <NaioN> well i think you would like to have a fs with TRIM support
[21:05] <blufor> the thing is, i work in this company for 3 months and they're paying immense amounts of $$$ to gogrid just for 40 servers
[21:05] <NaioN> hmmmm :)
[21:05] <blufor> and they grow fucking fast
[21:05] <blufor> socialbakers, maybe you've heard
[21:05] <blufor> anyways
[21:06] <NaioN> why would you want to have ssd's as data disks in the osds?
[21:06] <blufor> i think, that given the plans they've got the gogrid way will be worse with every other server
[21:06] <NaioN> what sort of workload do you have?
[21:07] <blufor> analyzing milions of posts from fb, twitter, g+, linkedin and youtube
[21:07] <blufor> analyzing all numbers you can see on those pages
[21:07] <blufor> counting statistics
[21:08] <NaioN> ok so a lot of database activity...
[21:08] <blufor> hdfs
[21:08] <NaioN> ok
[21:08] <blufor> i wanna use ceph for rbd
[21:08] <NaioN> that would make sense... a lot of iops i think
[21:08] <joshd> blufor: you might be interested in ceph's hadoop shim once the filesystem layer is more stable
[21:09] <blufor> yup
[21:09] <blufor> i wanna have everything on it. the works on the infrastructure will begin in, let's say 6 months
[21:10] <blufor> so there's plenty of stuff that'll get better
[21:10] <blufor> until then
[21:10] <blufor> but my goal is to avoid any SAN overkill (FC, iSCSI,....)
[21:11] <blufor> and throwing loads of $$$ to emc^2 or any other vendor like that
[21:11] <blufor> been there, done that
[21:11] <NaioN> hmmm depends
[21:11] <NaioN> with fc you get low-latency
[21:12] <NaioN> and that's something you like with a lot of ssds...
[21:13] <blufor> well, i wanna use arista network hw, if you're familiar
[21:13] <NaioN> i wonder if you really get the performance out of the ssd's with this setup
[21:13] <blufor> they do http://www.aristanetworks.com/en/products
[21:13] <blufor> at least the say that's the way
[21:13] <NaioN> I see...
[21:14] <NaioN> it looks low latency :)
[21:14] <blufor> not just latency, but also oversubscription
[21:14] <blufor> they're better and cheaper than cisco ;]
[21:15] <blufor> btw majority of the guys from arista is ex-cisco ;]
[21:15] <NaioN> hehe
[21:15] <blufor> and the 10Gbit stuff cisco does is just licensed arista :]
[21:15] <blufor> that's why all their nexus crap runs linux too
[21:16] <blufor> arista even suggests you run cobbler on their hardware and use the switch as a deployer :]
[21:16] <blufor> which is kinda crazy imo :]
[21:16] <jefferai> blufor: interesting -- I was looking at 10Gb switches now
[21:17] <jefferai> haven't gotten the cisco quote yet but got one from Juniper and Extreme
[21:17] <blufor> jefferai: dfntly look into those
[21:17] <wonko_be> even more - arista claims that you can run mini-linux-apps on the switches themselfs
[21:17] <wonko_be> it has been playing in the back of my head that a ceph mon would be a good candidate for this functionality
[21:17] <blufor> wonko_be: since it's linux and they give you access to the root shell... anything's possible :]
[21:17] <blufor> it reminds me of F5 and their boxes :]
[21:18] <wonko_be> it would really put the monitor in the center of the setup
[21:18] <blufor> wonko_be: the 7124SX comes with hdd included ;]
[21:18] <NaioN> hmmm i'm no fan of that... thinking back at guys who ran their irc sessions on junipers M's
[21:18] <jefferai> hah
[21:18] <blufor> :]
[21:20] <blufor> you can even talk to the arista boxes thru jabber client. or see the config of other box, from another box... gr8 stuff cisco can only dream about
[21:20] <Tv|work> wonko_be: There may be some commercial vendors already pondering that though.. *ahem*
[21:20] <Tv|work> *thought
[21:20] <wonko_be> Tv|work: probably
[21:21] <Tv|work> and by saying "may", I mean, I put that idea in their head ;)
[21:21] <blufor> :]
[21:21] <wonko_be> we got the suggestion/question from our arista sales people that they might want to lend us some equip to do such a setup
[21:21] <Tv|work> (I think; they may have come up with it independently too.. didn't talk to all of them)
[21:21] <nhm> Arista is interesting. I was looking at them before going with force10.
[21:22] <blufor> nhm: why did f10 win for you ?
[21:22] <NaioN> what's the point on integrating the mon on the switch?
[21:22] <wonko_be> lower latency
[21:22] <blufor> NaioN: latency
[21:22] <nhm> blufor: We got S4810s for *very* cheap.
[21:22] <NaioN> hmmm depends how the control plane is connected...
[21:22] <Tv|work> NaioN: in the case i was talking about, it was more about not needing it to be on a more vulnerable box, not needing to make any end node "special"
[21:22] <blufor> nhm: ah, price is usually the killer feature ;]
[21:23] * nhorman (~nhorman@99-127-245-201.lightspeed.rlghnc.sbcglobal.net) Quit (Quit: Leaving)
[21:23] <nhm> blufor: And they are working on quantum integration for openstack which is what I needed it for.
[21:23] <nhm> but yes, price ultimately was the deciding factor. ;)
[21:23] <jefferai> nhm: quantum?
[21:23] <NaioN> why would it have lower latency? it depends how the linux box is connected with the switching plane
[21:23] <blufor> jefferai: storage layer of openstack
[21:24] <nhm> jefferai: http://wiki.openstack.org/Quantum
[21:24] <jefferai> ah
[21:24] <jefferai> right
[21:24] <NaioN> with those Junipers M's it was only a 100Mbit/s ethernet interface
[21:24] * jefferai forgot the name of it
[21:24] <wonko_be> NaioN: they claim it has very low latency
[21:24] <blufor> ahem... not storage... too much stuff going on in my head ;]
[21:24] <jefferai> openstack is interesting, but I haven't found a use for it yet
[21:24] <NaioN> and where do you put the storage of the mon?
[21:25] <nhm> jefferai: Our plan was to run stuff on VMs that weren't good fits for our compute clusters. IE single threaded software or software that had special OS/library needs.
[21:25] <wonko_be> NaioN: on the hdd in the switch
[21:25] <NaioN> wonko_be: yeah like juniper on a single hdd! :)
[21:26] <NaioN> was the biggest problem of those junipers
[21:26] <NaioN> the only moving part
[21:26] <jefferai> nhm: I see. I've been looking at Ganeti for that
[21:26] <wonko_be> sure, why not, run multiple mon's, that is why it is redundant
[21:26] <wonko_be> or equip them with ssd's
[21:26] <nhm> jefferai: We actually had Ganeti deployed too for static VMs.
[21:26] <jefferai> openstack seems too focused on providing project-based capabilities to lots of users
[21:26] <jefferai> for e.g. service providers
[21:26] <nhm> jefferai: I didn't deploy it, but the folks that did liked it quite a bit.
[21:26] <NaioN> wonko_be: lots of writes no trim...
[21:27] <wonko_be> so the disk fails in a year or so
[21:27] <wonko_be> big deal
[21:27] <blufor> might be a big deal if you pay for "remote hands"
[21:27] <NaioN> If you build such a setup, why not 3 extra boxes as dedicated mons spread over your dc
[21:28] * stxShadow (~jens@ip-88-153-224-220.unitymediagroup.de) has joined #ceph
[21:28] <wonko_be> NaioN: i would like to see if the lower latency improved throughput a lot (especially in iops)
[21:29] <stxShadow> hmmm ..... another crash of our osd ..... but now: recovery stop at 0.79% with:
[21:29] <NaioN> well i don't think low latency to the mons makes taht difference
[21:29] <stxShadow> osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_modify(OpRequest*)' thread 7f7e7b104700 time 2012-03-13 21:26:25.679120
[21:29] <stxShadow> osd/ReplicatedPG.cc: 4051: FAILED assert(!missing.is_missing(soid))
[21:29] <stxShadow> any advice ?
[21:30] <joshd> stxShadow: if you could get a log with 'debug osd=20' that'd be great
[21:31] <stxShadow> from the failing osd?
[21:31] <joshd> stxShadow: unfortunately we'd need to know how it reached that point
[21:32] <joshd> from the failing osd, yeah, if it still fails the next time
[21:32] <stxShadow> ok ... the node is crashing anyway .... i will start it with debug :)
[21:32] * adjohn (~adjohn@rackspacesf.static.monkeybrains.net) has joined #ceph
[21:32] <stxShadow> it fails on every restart after a few seconds
[21:33] <joshd> is it the same failure after restart?
[21:34] <stxShadow> yes .... allways the same
[21:34] <stxShadow> no chance to get it back online
[21:35] <joshd> hmm, what we're really interested in is how it got into that state in the first place
[21:35] <joshd> logs of the restart won't help
[21:36] <joshd> sagewk: would wiping the incorrect pg work in this case?
[21:36] <stxShadow> hmmm ... so the dump will not help ? It has cashed again ..... with debug osd=20
[21:37] <joshd> it tells you what pg has the problem at least
[21:37] <sagewk> joshd: i'd like to see how we're hitting that failed assert first
[21:38] <joshd> stxShadow: pastebin the log?
[21:39] <sagewk> looks like #2132
[21:39] <stxShadow> ok ... just a moment
[21:41] * chutzpah (~chutz@216.174.109.254) has joined #ceph
[21:42] <stxShadow> hmmm .... debug osd=20 generates 324 MB of logfiles from start till crash
[21:44] <joshd> stxShadow: could you compress it and attach to http://tracker.newdream.net/issues/2132 ?
[21:50] <stxShadow> joshd: will i need a user there ?
[21:50] <joshd> stxShadow: I think so
[21:52] <stxShadow> http://www.as47215.net/osd.0.log.gz
[22:03] <johnl> so what's this about ceph coming to the uk?
[22:07] <stxShadow> joshd: maybe you are able to add it to 2132 ?
[22:07] <stxShadow> or sagewk
[22:21] <Tv|work> so who wants to talk to me about flab & metropolis
[22:22] <Tv|work> as in, what would it take to get everyone to stop using those?
[22:23] <sagewk> i like having a fast dev machine in the dc and not at aon.
[22:23] <sagewk> they can be moved elsewhere.
[22:24] <Tv|work> sagewk: well, we still have some space in the rack, but longer term i'd like to just have beefy enough vercoi like boxes, and run vms... then we don't have to worry about individual hardware failing
[22:25] <sagewk> yeah
[22:25] <sagewk> until then, we can either leave them where they are, or bring them back here, i guess. unless patrick has some dusty corner he doesn't mind putting them in garland
[22:26] <Tv|work> sagewk: I'd say either office pile or in IRV inside out subnet (vpnable), then we are fully out of GAR
[22:26] <Tv|work> *our
[22:27] <sagewk> works for me
[22:27] <sagewk> probably aon, less one-off cruft in irvine...
[22:28] <Tv|work> sure
[22:32] <Tv|work> Who uses cosd0-4? sjust1?
[22:33] <sagewk> nobody recently
[22:33] <Tv|work> sagewk: clobberin' time?
[22:34] <sagewk> :)
[22:34] <sagewk> 'ceph -v' is probably 0.20 something
[22:34] <Tv|work> haha
[22:35] * lofejndif (~lsqavnbok@74.Red-83-41-150.dynamicIP.rima-tde.net) has joined #ceph
[22:36] * lofejndif (~lsqavnbok@74.Red-83-41-150.dynamicIP.rima-tde.net) Quit ()
[22:36] * lofejndif (~lsqavnbok@9KCAAEMRX.tor-irc.dnsbl.oftc.net) has joined #ceph
[22:42] * Jaykra (~Jamie@64-126-89-248.dyn.everestkc.net) Quit (Quit: Leaving.)
[22:44] <sjust1> not me
[22:44] <sagewk> yay, new apache2/fastcgi packages work, chef scripts updated
[22:45] * Jaykra (~Jamie@64-126-89-248.dyn.everestkc.net) has joined #ceph
[22:45] * aliguori (~anthony@32.97.110.59) Quit (Remote host closed the connection)
[22:54] <jefferai> So do people run Ceph on top of corosync/pacemaker/heartbeat, or are those not necessary with Ceph since it has its own daemons?
[22:54] * darkfader (~floh@188.40.175.2) Quit (Read error: Connection reset by peer)
[22:56] <Tv|work> jefferai: not necessary
[22:56] <jefferai> h
[22:56] <jefferai> hm
[22:56] <Tv|work> jefferai: it's more if you've used Pacemaker etc for 10 years now, and want everything to fit in
[22:56] <jefferai> I haven't; it's my first time clustering and still trying to figure the whole world out
[22:57] <jefferai> ceph looks so nice, but as I've been told several times, not production-ready...
[22:58] * darkfader (~floh@188.40.175.2) has joined #ceph
[22:59] <sagewk> jefferai: close, depending on what you want to do
[22:59] <jefferai> sagewk: yeah, but I still have the problem of only having two nodes
[22:59] <jefferai> I'm looking at options to increase that
[23:00] <jefferai> but before looking hard at ceph, I had been looking at DRBD between two devices
[23:00] <jefferai> and two nodes there should be fine (but, if I want to have automatic failover of VMs, not just storage, I need clustering agents anyways)
[23:02] <Tv|work> jefferai: I'm just gonna say this: anything that calls itself a "clustering agent" is probably very old school, and mostly not nice.
[23:02] <jefferai> well, pacemaker/corosync/heartbeat do seem rather old school
[23:02] <Tv|work> yeah; i personally steer clear or them, and have for 6 years ;)
[23:03] <jefferai> Tv|work: what I was told earlier here is that having two nodes means that if a node goes down and comes up, you can't get quorum
[23:03] <Tv|work> just the fact that i've been actively not using them for 6 years should tell you something ;)
[23:03] <jefferai> so you can risk data corruption
[23:03] <jefferai> Tv|work: well -- I'm a sponge for your wisdom, currently :-)
[23:03] <Tv|work> jefferai: paxos-style majority voting really needs a minimum of 3 nodes.. or you run just 1 monitor and handle it manually (/automate with pacemaker)
[23:04] <Tv|work> jefferai: but 2-machine HA is problematic anyway; what if both think they are the guy in charge
[23:04] <jefferai> well, right
[23:04] <Tv|work> jefferai: that road leads to weird setups like STONITH, both machines being able to power down the other one
[23:04] <Tv|work> jefferai: and then it's race to who kills who faster, bleh
[23:04] <jefferai> yeah...
[23:04] <jefferai> but what does Ceph do in that instance?
[23:05] <Tv|work> on a 3-machine cluster, any 2 are happy to keep serving
[23:05] <jefferai> yeah, but what if two nodes go down?
[23:05] <jefferai> I guess then it's up to you to determine who's right
[23:05] <jefferai> so make sure it doesn't happen :-)
[23:05] <Tv|work> 1/3 monitors cannot form a quorum -> refuses to make decisions -> no worky
[23:06] <Tv|work> it comes down to this.. if two nodes is a realistic concern, buy the cheapest linux box you can find, run an extra monitor on that, now you have 3
[23:06] <jefferai> I see
[23:07] <jefferai> one sec
[23:07] <jefferai> Tv|work: so it's only important to have a quorum of mon boxes?
[23:08] <jefferai> because in the setup I was thinking about I'd have two object storage boxes and four nodes, and could certainly put the mon daemon on three of them
[23:09] <jefferai> Tv|work: mind if I start from scratch, and you can tell me how it looks?
[23:09] <Tv|work> jefferai: you can think of the monitors daemons forming a separate cluster, if you will
[23:09] * lofejndif (~lsqavnbok@9KCAAEMRX.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[23:09] <jefferai> I see
[23:09] <Tv|work> (the osds are more of a pool with p2p traffic, than a cluster ;)
[23:10] <jefferai> I see
[23:10] <Tv|work> (and the mdses are more like a separate server pool, with some servers active and some in standby)
[23:10] <jefferai> OK -- but the MDSes are only needed with cephfs right?
[23:10] <Tv|work> yes
[23:10] <jefferai> ok
[23:10] <jefferai> can I run this by you and see what you think?
[23:11] <Tv|work> run it by the channel ;)
[23:11] <jefferai> well, I sorta did before, but I think there was a lot of confusion on my end about what I was being told
[23:11] <jefferai> :-)
[23:12] <jefferai> so I can probably afford two storage boxes with high disk density, because I have limited rack space. So I can't really depend on Ceph for one replication, but what I was going to do was layer Ceph OSD on top of RAID-10
[23:12] <jefferai> So my thought was, mdadm RAID 10 -> LVM -> Ceph OSD; two of these
[23:13] <sagewk> and no ceph replication?
[23:13] <jefferai> I guess in that setup I don't actually need DRBD to replicate between the boxes as Ceph will keep the objects replicated
[23:13] <jefferai> well no
[23:13] <elder> Tv|work, should I expect to not have access to my plana systems?
[23:13] <jefferai> The boxes would replicate each other
[23:13] <sagewk> elder: i just had to restart my vpn
[23:13] <Tv|work> elder: i am not aware of anything special
[23:13] <elder> OK.
[23:13] <elder> I'll try that.
[23:13] <nhm> elder: I had to do that too.
[23:14] <jefferai> So I guess a replication factor of 1; but also underneath Ceph I'd have RAID for redundancy
[23:14] <Tv|work> jefferai: i don't know if anyone's even ran LVM on top of RBD
[23:14] <jefferai> you mean RBD on top of LVM?
[23:14] <Tv|work> oh oh right i see
[23:14] <elder> sage, nhm, Tv|work, that did it. Thank you.
[23:14] <Tv|work> jefferai: probably that neither ;)
[23:14] <jefferai> well
[23:15] <jefferai> I mean, I can always not do it if it doesn't work :-)
[23:15] <Tv|work> jefferai: the other part is, that means you'll be using kernel rbd, not the code in kvm/qemu; the kvm driver is thought to be nicer, because it's a shorter code path to the network
[23:15] <jefferai> So that would be the object storage nodes: a few OSDs on top of RAID-backed block devices, with one box replicating the other
[23:15] <jefferai> Tv|work: oh, why?
[23:16] <Tv|work> jefferai: it's a userspace processing using librbd & TCP sockets, as opposed to kernel block device to kernel driver to TCP
[23:16] <jefferai> no, I mean, why would it mean I'd be using kernel rbd?
[23:16] <Tv|work> jefferai: that's still an educated hunch
[23:16] <jefferai> keep in mind that I wouldn't be running qemu/kvm on these nodes
[23:16] <Tv|work> jefferai: because you're stacking kernel features on top of RBD
[23:16] <jefferai> I am?
[23:16] <Tv|work> jefferai: ohhh
[23:17] <sagewk> jefferai: raid10 + ceph replication means you're only using 25% of your raw space.
[23:17] <jefferai> sagewk: I'm aware
[23:17] <Tv|work> jefferai: so the boxes are mkfs'ing & mounting the rbd block devices themselves
[23:17] <jefferai> Tv|work: hold on a sec
[23:18] <jefferai> sagewk: I realize that I'm using a small amount of raw space, but I want redundancy in case a node goes down
[23:18] <jefferai> and I want redundancy in the hard drives
[23:18] <jefferai> and I'm not generally a RAID5/6 fan
[23:18] <jefferai> better ideas are much appreciated
[23:18] <jefferai> (including "raid 6 ain't that bad")
[23:19] <jefferai> I realize that if I had 14 nodes I could have a replication factor of 2 or 3 and still have a large amount of my raw space usable
[23:19] <jefferai> but I don't have the rack space nor dollars (nor power outlets in the racks) for that
[23:21] * groovious (~Adium@64-126-49-62.dyn.everestkc.net) Quit (Quit: Leaving.)
[23:22] * stxShadow (~jens@ip-88-153-224-220.unitymediagroup.de) has left #ceph
[23:23] <jefferai> I could *maybe* get four 2U nodes, but for safety I'd want a replication factor of, say, 3 -- so even without RAID that'd put me at 25% of raw capacity
[23:31] <jefferai> Tv|work: I seem to have stumped sagewk :-)
[23:31] <sagewk> jefferai: sorry, distracted.
[23:31] <jefferai> no problem
[23:32] * jefferai is at work too, quite understands
[23:32] <nhm> jefferai: if you can afford to do the 2U nodes, you'llh ave less impact from a motherboard failure which could be nice.
[23:33] <sagewk> yeah, i suspect more smaller nodes will perform better. i'd be owrried about a single ceph-osd in front of a huge raid10 array, too...more likely to hit software bottlenecks.
[23:34] <sagewk> and your monitor problem will go away :)
[23:37] <jefferai> nhm: the 2U nodes give me 24 drives each, but can only go up to 1TB drivers (2.5"); the 4U give me 36 but can go up to 3TB each. So a max (raw) of 48TB per 4U vs 108TB per 4U
[23:37] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) has joined #ceph
[23:37] <jefferai> (assuming 3TB drives; 4TB drives are much pricier and harder to get in quantity)
[23:38] <jefferai> sagewk: actually I was thinking of having every 4 drives be a RAID-10 device
[23:38] <nhm> jefferai: yeah, I was thinking 2U with 12 3TB drives.
[23:38] * Jaykra (~Jamie@64-126-89-248.dyn.everestkc.net) Quit (Quit: Leaving.)
[23:40] <jefferai> nhm: hm
[23:40] <jefferai> 4U with 3TB drives would give me about 48TB usable (after repliation)
[23:40] <nhm> jefferai: you were probably looking at the SC847 chassis for the 36 drives?
[23:41] <jefferai> if I trusted all the replication to ceph, and went with 4TB drives, I could get 48TB usable
[23:42] <jefferai> I was actually looking at silicon mechanics, which give some level of support but are not too pricey; the Storform iServ A518v3 (http://www.siliconmechanics.com/c625/storage-servers.php)
[23:42] * BManojlovic (~steki@212.200.240.216) Quit (Remote host closed the connection)
[23:43] <nhm> jefferai: One thing is you can probably cheap out a bit on the processors for the 2U nodes vs what you'd need to get for the big ones.
[23:43] <nhm> And divide the ram up so lower density dimms.
[23:43] <nhm> might give you some cost savings.
[23:43] <jefferai> possibly
[23:43] <nhm> ok, gotta run and eat dinner. have a good evening
[23:43] <jefferai> thanks, you too
[23:45] <jefferai> sagewk: if I have four nodes that *can* run the monitoring, and so three are monitoring, and one dies, does the monitor kick in on the fourth node (so that you have three again)?
[23:46] <Tv|work> jefferai: that would equal running ceph-mon on all four
[23:46] <Tv|work> jefferai: which you can do, but then your quorum is 3, and you can lose only one machine without downtime
[23:46] <jefferai> yeah, understood, but if you have an even number > 3, can you run on all four but one of them stays out of the way unless you have a failure?
[23:46] <Tv|work> jefferai: that's logically the same as running it on all
[23:47] <jefferai> can't the quorum drop to 2, if a node goes down?
[23:47] <Tv|work> jefferai: the strict majority vote isn't some clumsy restriction to work around; it is what prevents the non-healthy part of the cluster from claiming it's healthy
[23:47] <jefferai> yeah, understood
[23:48] <jefferai> what I'm saying is, say I lose a node
[23:48] <Tv|work> jefferai: imagine a network partition that leaves you A,B and C,D. you don't want both partitions to think they're good to go.
[23:48] <jefferai> so the unhealthy part is the node that got lost
[23:48] <jefferai> quorum says that the three that are still up are right
[23:48] <jefferai> now I lose another node
[23:48] <jefferai> I'd still have a majority -- 2 against 1
[23:48] <Tv|work> you'd have equal parts, 2 and 2
[23:49] <jefferai> only if the first node has come back up
[23:49] <Tv|work> and how do you know whether it has, or not?
[23:49] <jefferai> Good question
[23:49] <jefferai> I was hoping you'd tell me
[23:49] <jefferai> :-)
[23:49] <jefferai> I guess what I'm getting at though is
[23:50] <jefferai> if I run on 3 nodes, then my quorum is 2
[23:50] <jefferai> so i can only lose one node
[23:50] <jefferai> if I run on 4 nodes, I'm in the same boat, right?
[23:50] <Tv|work> yes
[23:50] <jefferai> ok
[23:50] <jefferai> so this is back to the question I had earlier
[23:50] <jefferai> way earlier
[23:50] <jefferai> which is, I'd have these storage boxes
[23:50] <jefferai> and I'd have a bunch of compute boxes
[23:51] <jefferai> why couldn't the compute boxes run the mon daemons
[23:51] <jefferai> in fact, the wiki seems to indicate that this is preferable
[23:54] <Tv|work> sure they can
[23:54] <jefferai> ok, let me extend that out further
[23:55] <jefferai> if I have two or four boxes for storage, and four boxes for VMs, should/could I run the mon daemon on all of them?
[23:55] <jefferai> or at least, seven of them?
[23:55] <Tv|work> that's quite a lot
[23:55] <jefferai> it seems that there are benefits to having at least 5 though
[23:55] <Tv|work> ceph-mon doesn't actually do much; it's rare to need more than 3
[23:56] * aliguori (~anthony@cpe-70-123-132-139.austin.res.rr.com) Quit (Remote host closed the connection)
[23:56] <Tv|work> but sure, 1-7 is still within the sane range
[23:57] <jefferai> I guess I'm unclear exactly what it does, then. My understnding is it determines, in case of a OSD node failure, which is the correct one to use
[23:57] <jefferai> but since it requires strict majority, it doesn't really matter how many you run as long as it's more than three, because losing any of the OSD nodes will mean that you've lost your strict majority
[23:58] <jefferai> err, losing a mon node
[23:58] <jefferai> so you can only ever tolerate losing one node at a time, for those serving as mon nodes
[23:59] * lofejndif (~lsqavnbok@659AAAKN1.tor-irc.dnsbl.oftc.net) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.