#ceph IRC Log


IRC Log for 2010-08-12

Timestamps are in GMT/BST.

[0:04] * conner (~conner@leo.tuc.noao.edu) has joined #ceph
[0:08] <conner> So is it typical in a ceph deployment to make all OSD hosts also mds hosts?
[0:08] <gregaf> conner: definitely not
[0:09] <conner> My questions are a little navie, I'm sort of at the research phase here. So the metadata is not fully distributed?
[0:10] <orionvm> You would want your MDS servers in another cluster.
[0:10] <gregaf> the metadata servers are architected to be run as a cluster, and testing on an early version of the system showed that it scaled out pretty well (I think it got half the IOPS/server at several hundred nodes as it did with one or a few nodes)
[0:11] <gregaf> but that part of the code isn't as well-tested so any current clusters will be much more stable with a single MDS because it's had less debugging
[0:12] <gregaf> and each MDS takes up resources so there's just not much point in running an MDS on every OSD unless you've got a very weird workload (I can't imagine a workload where that would be an efficient solution)
[0:12] <conner> gotcha, so your saying the mds need to be dedicated ala lustrefs?
[0:12] <gregaf> the mds is a separate daemon from the OSD, if that's what you're asking
[0:13] <gregaf> it can be run on the same server as the OSD nodes
[0:13] <gregaf> depending on how beefy/weak your servers are that might hurt you or might not
[0:14] <conner> I'm thinking about trying a small scale test
[0:14] <gregaf> when I'm doing development I just run an entire cluster of between 5 and 10 daemons on my dev server
[0:14] <conner> I've got the Isilon sales hounds sniffing around
[0:14] <conner> but I don't really nee Isilon today... but I need that sort of scaling in a few years
[0:18] <conner> so is metadata fully replicated to each mds or is it more of a distributed hash table?
[0:18] <gregaf> not really either
[0:18] <gregaf> each MDS is the authority on specific subtrees of the hierarchy
[0:19] <gregaf> the specific boundaries can shift depending on how hot different subtrees are, and if one folder gets really hot it can split up the authority for that folder too
[0:19] <conner> so what happens when an mds dies?
[0:19] <gregaf> and they do distributed locking to handle edge cases and such
[0:20] <gregaf> metadata ops on its region of authority halt until you bring up another daemon to replace it
[0:20] <conner> so the state is recoverable from looking at the filesystem?
[0:20] <gregaf> or you can have a couple extra daemons running that aren't actively engaged in management and the monitors will tell one of them to take over
[0:21] <gregaf> yes, all the ops are journaled to the distributed object store before being changed in-memory or returning to the client
[0:22] <conner> so how long does it take an mds to recover the state?
[0:22] <orionvm> Just got my all in one test box setup. :D
[0:23] <orionvm> conner: What are you intending to use your SAN for?
[0:23] <gregaf> umm, not sure how long recovery takes
[0:24] <gregaf> I think the amount of time it takes to declare the MDS failed dominates the actual read-the-journal-and-recover time in most instances, though
[0:24] <conner> orionvm, incoming data buffer, processing scratch space, and long term archviing
[0:25] <gregaf> I think the default time to declare death is a few minutes and then the new daemon takes over and can start serving under a minute later, but I haven't measured it
[0:26] <conner> do client requests just block or do they fail during that peroid?
[0:26] <gregaf> block
[0:28] <conner> is anyone funding dev?
[0:28] <gregaf> Yehuda and Sage and I are all employed by New Dream Network (aka DreamHost.com)
[0:29] <conner> awesome
[0:29] <gregaf> Lawrence Livermore provided some funding while it was Sage's PhD research topic
[0:29] <conner> llnl is a lustrefs house isn't it?
[0:29] <gregaf> think so, yeah
[0:30] <conner> well both ceph and elliptics look very promising
[0:30] <gregaf> elliptics?
[0:31] <gregaf> haven't heard of it
[0:31] <conner> http://www.ioremap.net/projects/elliptics
[0:32] <conner> sort of a dht with an http front end
[0:32] <conner> someday pohmelfs is supposed to live on it
[0:32] <orionvm> Hmm interesting.
[0:33] <gregaf> yeah, I haven't seen those before, interesting\
[0:34] <conner> I went through this exercise several years ago and ended up writing my own
[0:34] <conner> turned out that it was very similar to mogilefs
[0:35] <conner> I talked with Brad about it at OSCON but at the time I just couldn't beat the thing into working for me
[0:36] <orionvm> Haha yeah.
[0:36] <conner> the system I wrote is managing about a petabyte at $previous_job
[0:36] <orionvm> Nice. :)
[0:37] <conner> yes, well it's a hackish beast
[0:37] <orionvm> As they all are. :P
[0:37] <conner> the mds is mod_perl on top of mysql
[0:37] <orionvm> Lol!
[0:38] <orionvm> Yeah I think if I wanted to implement my own I would probably go the way of consistent hashing.
[0:38] <conner> i built in support for load balancing across multiple dbs but it was easier to just buy a bigger box
[0:38] <orionvm> Yeah? How big of a box did you end up on?
[0:38] <conner> that probably with consistent hashing is when you want to grow the system
[0:39] <orionvm> Normally sharding is more effective and just chuck memcached infront of it.
[0:39] <conner> orionvm, pretty modest box, 8 cores, 32gb ram, and I think 4 SAS drives
[0:39] <conner> i did memcached and it didn't help my case
[0:39] <orionvm> Yeah that's a decent size.
[0:39] <orionvm> Yeah? Your problem was writes?
[0:39] <conner> aye
[0:40] <orionvm> Ahh yeah not much you can do about that.
[0:40] <conner> it was a data processing system so it was consantly adding new files to the store
[0:40] <orionvm> Multiple master shards is about all you can do.
[0:40] <conner> well, you can if don't mind having to redo lots of work
[0:40] <orionvm> Yep.
[0:40] <conner> exactly, I wrote the support and ended up not needing it
[0:40] <conner> with around 75 clients, the qps on the db was only like 3k/s
[0:41] <orionvm> Mhmm thats pretty sizable but not overly hard to deal with.
[0:41] <orionvm> You mentioned, Isilon, do you already have Infiniband hardware?
[0:41] <conner> I had a 5 day upstream buffer so I just did db dumps every 8 hours
[0:41] <conner> nope, zero infiniband
[0:42] <conner> zero 10gigE too
[0:42] <conner> but I need another 50TB immediately
[0:42] <orionvm> Aight, we are getting Infiniband gear in next week.
[0:42] <orionvm> Mhmm have you looked at GlusterFS?
[0:42] <orionvm> It's what we are using atm and it's pretty decent with linear scalability.
[0:42] <conner> i have
[0:43] <orionvm> We have about 200TB on it.
[0:43] <conner> what's performance like?
[0:43] <orionvm> Hmm limited by our network interconnects.
[0:43] <orionvm> Our bonded gigE clients can pull 220mb/s
[0:43] <conner> that's interesting, I've only ever heard the perfromance is poor
[0:44] <conner> I thinking about buying two of these beasts: http://www.rackmountpro.com/category.aspx?catid=258
[0:44] <conner> and doing drdb between them
[0:44] <darkfade1> orionvm: with gluster, how do you get much speed for a single file?
[0:45] <darkfade1> because i tried it and it seemed it will only spread many files over the nodes
[0:45] <darkfade1> but a single file couldn't be distributed over many systems
[0:45] <darkfade1> but i'm not really sure i didn't just do it "wrong
[0:46] <darkfade1> conner: they're big for sure :>
[0:46] <orionvm> You can stripe files with the stripe translator.
[0:46] <darkfade1> so i just did it wrong.
[0:46] <orionvm> You can also use the replicate translator if you only need fast reads.
[0:47] <orionvm> Yeah, you can get good performance out of it.. but it can be risky buisness.
[0:47] <darkfade1> ideally, both.
[0:47] <darkfade1> why risky??
[0:47] <orionvm> Stripe gives you fast read and writes.
[0:48] <orionvm> The stripe translator is dangerous if you aren't careful, we have lost huge amounts of data before and had to restore entire gluster volumes from backups.
[0:48] <gregaf> so replication has to be done from the client over its link?
[0:48] <orionvm> Yes, it does sync writes.
[0:48] <orionvm> Which is abit.. meh.
[0:49] <orionvm> It's read performance is just crazy though.
[0:49] <darkfade1> hmmm what did go wrong? was it because all systems with a given file copy fail?
[0:49] <orionvm> Nah.. what happened was that on a few systems the disks filled up.
[0:49] <orionvm> Corrupting tons of data.
[0:49] <darkfade1> i first looked at gluster, because I also have some infiniband in place for it, but i got confused with their config file handling and their irc channel was just ... inpolite
[0:50] <orionvm> Haha yeah..
[0:50] <darkfade1> orionvm: were all disks of even size?
[0:50] <orionvm> And the documentation is BEYOND terrible.
[0:50] <darkfade1> haha yes
[0:50] <conner> so are you using native glusterfs clients or using it as a NAS?
[0:50] <orionvm> darkfade1: Unfortunately no.. we had 5 legacy systems as part of the cluser.
[0:50] <darkfade1> and it just doesnt feel elegant, nice or good like ceph.
[0:50] <orionvm> We never made that mistake again.
[0:51] <orionvm> conner: We use the native gluster FUSE client with the gluster patched fuse compiled into the kernel for speed.
[0:52] <orionvm> If you are going to build a Gluster cluster make sure that all your replicate or stripe volumes are exactly the same size.
[0:52] <conner> oh gezbuz
[0:52] <conner> that's a pita
[0:52] <orionvm> Yeah it is.
[0:52] <orionvm> That's why we are trying to move away from Gluster...
[0:52] <conner> well those 8U monsters give me about a year of breathing room
[0:52] <conner> but after that, I need something like ceph or isilon
[0:53] <orionvm> Yeah.
[0:53] <orionvm> Isilon SANs are amazing.
[0:53] <orionvm> But very expensive.
[0:53] <darkfade1> orionvm: ok i luckily felt there was a risk about that, when i saw how it showed me 20g free size and would fail to write a file over 10g. it is logical, in their design, and it sucks
[0:53] <darkfade1> so i just bought the same type of disk over and over
[0:53] <orionvm> Mhmm.
[0:53] <conner> orionvm, they were suggesting it can get down to $800/GB
[0:53] <darkfade1> conner: did you look at the "jackrabbit"?
[0:54] <conner> darkfade1, neg, what's that?
[0:54] <conner> orionvm, sorry, that's per TB
[0:54] <darkfade1> look at http://scalableinformatics.com/jackrabbit i dion't really know them, but one of their devs has a blog and wrote they also support gluster as a client or something
[0:54] <darkfade1> and they're fast
[0:55] <darkfade1> but i don't know anyone using them :)
[0:55] <orionvm> They are basically commercial supported Gluster.
[0:55] <darkfade1> the $800/GB make me chuckle
[0:56] <darkfade1> orionvm: are they "just gluster" or is it a box, that can "just do gluster, too">
[0:56] <darkfade1> ?
[0:56] <sagewk> si does both gluster and lustre
[0:56] <darkfade1> lustre is no fun :(
[0:56] <darkfade1> "we assume your storage never fails"
[0:56] <orionvm> Lol.
[0:57] <orionvm> Very true. :P
[0:57] <orionvm> Yeah I have tried to get lustre running on a semi modern kernel, gave up well before I come close.
[0:57] <conner> darkfade1, those look like just x86 servers
[0:57] <darkfade1> we got some arrays that i can swear they'll never fail.
[0:57] <darkfade1> but they're so fast i'd just skip over the storage appliance in front
[0:58] <darkfade1> or the shared fs
[0:58] <conner> orionvm, clusterfs has a real fuck the community attitude these days
[0:58] <darkfade1> yes
[0:58] <conner> I was evaluating them before Mike Shaver bailed out and went to mozzila
[0:59] <conner> and he was definately bitter before he left
[0:59] <darkfade1> poor guy
[0:59] <darkfade1> well, i figure anyone here agrees that ceph is far superior, both on attitude and design ;)
[1:01] <orionvm> Yeah Ceph has a very nice design.
[1:02] <orionvm> We have contributed to Gluster over the lifetime our cluster has been running on it and we would be likely to do the same if we can get ceph up to speed.
[1:02] <orionvm> I don't really need 220mb/s out of it.. I just need IOPS.
[1:02] <orionvm> If ceph can provide about 100mb/s and decent IOPS then I am sold.
[1:03] <orionvm> Gluster gets me great pure speed but the IOPS are garbarge over bonded gigE even with tons of optimisation in the TCP/IP stack.
[1:05] <orionvm> What I am thinking, is that I build each machine with say 16 disks and 2 SSDs. Use the SSDs to store the journals.
[1:05] <darkfade1> maybe less disks so you can scale to even more iops?
[1:05] <orionvm> If I end up using the actual filesystem part of ceph then I will build the MDS servers on SSDs too.
[1:05] <darkfade1> 8 per node?
[1:06] <orionvm> IOPS = spindles, more spindles the better.
[1:06] <darkfade1> yes but i think 8 nodes with 8 disks will be faster than 4 with 16
[1:07] <orionvm> Yeah but 8 with 16 is faster than 8 with 8. :D
[1:07] <darkfade1> okay hehe
[1:08] <orionvm> Atm our nodes run on 8*500GB SATA 2 drives, which is more than fast enough. It's not the IOPS on the disks that are maxing out.
[1:08] <orionvm> But rather Gluster is not really about IOPS, it's more about streaming and bandwith throughput.
[1:09] <orionvm> It's served us well so far but it's starting to get locked up waiting for IO when we hit around 1500 vms on the cluster.
[1:10] <orionvm> Well it's not really entirely Glusters fautl..
[1:10] <darkfade1> i only planned for 300 so far
[1:10] <orionvm> We have configured it in such a way that gluster only does stripes across 6 nodes.
[1:10] <darkfade1> then my dayjob got to be a 16hours a day job and i had to postpone it all
[1:11] <darkfade1> hmmm some overlapping might be good actually
[1:11] <orionvm> Yeah.
[1:11] <orionvm> We have it stripe a file 6 times.
[1:11] <orionvm> And replicate that 3 times.
[1:11] <orionvm> So reads are blindingly fast.
[1:11] <orionvm> I think I mentioned we can max 10gigE.
[1:12] <darkfade1> i might have missed it
[1:12] <darkfade1> thats awesome
[1:12] <orionvm> Only on sequential reads though.
[1:12] <orionvm> yeah.
[1:12] <orionvm> Ahh well, we are going to try out infiniband next week and see if that fixes the IO issues we are seeing.
[1:12] <darkfade1> so how much throughput would i see in a xen vm of yours? with a reading dd
[1:12] <orionvm> Depends on the client.
[1:13] <orionvm> Atm our "compute" clients are only attached via bonded gigE.
[1:13] <orionvm> You get about 130-140mb/s inside a domU.
[1:13] <darkfade1> ah, so you have a storage cluster and the hosts infront
[1:13] <darkfade1> nice, i guess thats already faster than most hosters :)
[1:14] <orionvm> Hmm yeah, I haven't seen anyone over 60-70mb/s before.
[1:14] <orionvm> Linode has the best from the "vps" providers.
[1:15] <orionvm> Rackspace and EC2's clouds have pretty pittiful storage performance.
[1:15] <orionvm> But they have much more compute than us.. they solve different problems.
[1:15] <darkfade1> yes cloud hosting per their kind is quite limited
[1:15] <orionvm> We are more into hosting disk intensive workloads.
[1:15] <orionvm> Stuff like big oracle dbs etc.
[1:16] <orionvm> We can't compete with Amazon for compute so we don't try.
[1:16] <darkfade1> orionvm: i'm trying to become oracle partner with oracle vm and see if there's anything in offering supported oracle hosting
[1:17] <orionvm> Aye I am doing something similar atm.
[1:18] <orionvm> Out of curiousity are you using xen or kvm?
[1:18] <darkfade1> xen
[1:18] <orionvm> Aye.
[1:18] <darkfade1> i want to squeeze in many small vms mixed with a few large ones
[1:18] <darkfade1> and that needs some scalability
[1:19] <orionvm> Hmmm.
[1:19] <darkfade1> orionvm: i'd hope their load patterns are different enough so that everyone gets good speed
[1:19] <orionvm> What are you going to use to manage the vms?
[1:19] <darkfade1> my teeth :(
[1:19] <orionvm> Hahaha.
[1:19] <orionvm> Have you even thought about it? :p
[1:19] <darkfade1> i want DTC as the hosting panel
[1:19] <darkfade1> of course hehe
[1:20] <darkfade1> they'll all end up in cpu / io shaping so I can ensure nothing can kill the host, firstofall
[1:20] <darkfade1> then i want to automate loadbalancing
[1:20] <orionvm> Hmm.
[1:20] <orionvm> I think you should look into a cluster fabric manager like OpenNebula.
[1:21] <darkfade1> the clustering is not [x] finished
[1:21] <orionvm> If you haven't built everything yet you should take a look at it.
[1:21] <darkfade1> yup i'll do
[1:21] <darkfade1> eucalyptus i already looked at
[1:21] <orionvm> Yeah.. Eucalyptus isn't very good.
[1:21] <darkfade1> it won't scale with it's network frontend box
[1:21] <orionvm> More of a playtoy in comparison. Too young atm.
[1:22] <orionvm> Yeah.
[1:22] <orionvm> Exactly.
[1:22] <orionvm> OpenNebula powers CERNs monster compute cloud.
[1:22] <darkfade1> and they're ditching it at the moment :(
[1:22] <orionvm> Very extensible.
[1:23] <darkfade1> awww, i don't know. i even thought about using veritas cluster on single-socket boxes :)
[1:24] <darkfade1> do you have a cloudy management software and trust it with the failover of vms?
[1:24] <orionvm> http://lists.opennebula.org/pipermail/users-opennebula.org/2010-April/001886.html
[1:24] <orionvm> We have alot of custom scripts wrapped around OpenNebula that deal with failovers etc.
[1:25] <darkfade1> the funniest thing is the oracle vm manager that comes with oracle vm
[1:25] <darkfade1> it sucks so big time
[1:25] <orionvm> Haha I haven't checked it out yet.
[1:25] <darkfade1> but you just disable one daemon and use oracle vm without the manager and you got the best xen distro ever
[1:25] <orionvm> Yeah?
[1:25] <darkfade1> i think so
[1:25] <orionvm> I will have to take a look at it.
[1:26] <darkfade1> wait for the 3.x beta in 1-2 months
[1:26] <darkfade1> but they coded the overcommit / ram compression and all that
[1:28] <darkfade1> orionvm: random trick question: how do you track the actual load of a dom0
[1:28] <darkfade1> err i mean of the host
[1:29] <darkfade1> because you can track the cpu time seconds of the dom0 and all domUs
[1:29] <orionvm> What do you mean? The actual CPU load or the VCPU load?
[1:29] <darkfade1> but that's just the amount of "used" cpu
[1:29] <darkfade1> yes, actual cpu load
[1:29] <orionvm> I don't think I like overcommitting.
[1:29] <orionvm> It's bad.
[1:29] <orionvm> The reason I like Xen is you can't overcommit memory.. I don't want it implemented heh.
[1:30] <darkfade1> it's already done AND it's been done far better than anywhere else
[1:30] <darkfade1> you can just enable compression and page sharing
[1:30] <darkfade1> and not overcommit beyond that
[1:31] <orionvm> Hmm yeah.
[1:31] <orionvm> Performance hit?
[1:31] <darkfade1> i think it also wasn't intended for usages like the openvz people do, so we're just talking overcommit 10-20% or something (at least i think so)
[1:31] <darkfade1> low, definitely
[1:31] <orionvm> Hmm aight, will have a look at it.
[1:32] <darkfade1> they even had a benchmark that got faster due to higher throughput
[1:32] <orionvm> Hmmm.
[1:32] <orionvm> That is interesting.
[1:32] <darkfade1> you don't have to :)
[1:32] <darkfade1> anyway, i just wonder how i track the real cpu load on a host
[1:32] <darkfade1> it doesn't seem to be in the xen-stats.pl output or any of those
[1:33] <darkfade1> because, it would be the best input for the loadbalancing
[1:33] <darkfade1> if a node reaches 90% cpu, turn on another one and even out
[1:35] <orionvm> Hmm.
[1:35] <orionvm> OpenNebula manages to get it out of it.
[1:35] <orionvm> Have a look at xm top ?
[1:36] <orionvm> The dom0 usage should be indicative of actual system load assuming you haven't pinned dom0 to a single core or something.
[1:36] <darkfade1> normally i'd do that
[1:37] <darkfade1> and it only counts the "used" time for the dom9
[1:37] <darkfade1> err dom0
[1:37] <darkfade1> thats the problem, idle cpu will not show up in dom0, i'm quite sure
[1:38] <darkfade1> errr. but i can take 100 - the totals of the cpu % and
[1:39] <darkfade1> thats it.
[1:39] <darkfade1> thank you for making me think a bit longer
[1:40] <orionvm> Aight.
[1:41] <orionvm> Have a look at OpenNebula's xen monitoring driver.
[1:41] <orionvm> It would tell you how it's done if you can read ruby.
[1:41] <darkfade1> cool
[1:42] <darkfade1> and i'll just check out OpenNebula
[1:42] <darkfade1> ruby is nice to read
[1:44] <darkfade1> ah, i mixed up the thing about opennebula
[1:45] <darkfade1> nasa is ditching eucalyptus that's what i read hehe
[1:47] <orionvm> Yeah and no.
[1:47] <orionvm> Basically NASA wrote a fabric manager called Nova.
[1:47] <orionvm> Which they never actually used because it was buggy and unscalable.
[1:47] <orionvm> They then adopted Eucalpytus, which is also buggy and unstable but less so than Nova.
[1:48] <darkfade1> lol
[1:48] <orionvm> OpenStack is now being built on the leftovers of Nova, some code from Eucaplytus and some stuff Rackspace tacked onto it as far as I can tell.
[1:48] <darkfade1> you have a nice way of describing things.
[1:49] <orionvm> I would stay away from all 3 in a production environment.
[1:49] <orionvm> Well unfortunately if you fabric manager decides to fail you have no way of sheduling vms, getting info from vms or basically working with your cluster.
[1:49] <darkfade1> yup :)
[1:49] <orionvm> It's one of the most important pieces of software in your VM stack and not something you want to fail under any circumstances.
[1:50] <orionvm> OpenNebula is the most mature, most fully featured etc.
[1:51] <orionvm> OpenStack has no current deployments that I am aware of.
[1:51] <orionvm> Eucalpytus has only been deployed in large scale at NASA where it has shown that it is not viable.
[1:51] <orionvm> Nova has never been used by anyone.
[1:51] <darkfade1> what is the one that disappeared, hmmm
[1:51] <orionvm> Enomalism?
[1:51] <darkfade1> yeah!
[1:52] <darkfade1> the ex-xen management tool that now is a cloud thing
[1:52] <darkfade1> ex-opensource, too
[1:52] <orionvm> The developer pulled the community edition and went closed source.
[1:52] <orionvm> yeah.
[1:52] <orionvm> There is also XCP or Xen Cloud Platform.
[1:52] <orionvm> But XCP is very very early stages of development atm.
[1:52] <orionvm> It's 0.1 atm.
[1:53] <darkfade1> my employer is a citrix partner and we use xenserver there
[1:53] <orionvm> OpenNebula is nearing it's 2.0 release. My cloud is actually running on their 1.4 release and we are testing 2.0 beta atm.
[1:53] <orionvm> Never actually used XenServer.
[1:53] <darkfade1> i might say i don't like anything that comes out of there
[1:53] <orionvm> Lol.
[1:53] * gregphone (~gregphone@ has joined #ceph
[1:55] <darkfade1> how long are you having that business now?
[1:56] <darkfade1> and did it grow or did you start with the current setup?
[1:56] <orionvm> Hmm we have been going for a few years as pure consultancy etc.
[1:56] <orionvm> We started with a much smaller system.
[1:56] <orionvm> We are currently in our Mk3 system.
[1:56] <darkfade1> cool
[1:56] <orionvm> I am designing our Mk4 system atm.
[1:56] <darkfade1> was it always with opennebula?
[1:56] <orionvm> Yes, we started with OpenNebula from the start.
[1:57] <orionvm> OpenNebula used to be the only one there was.
[1:57] <darkfade1> i see
[1:58] <orionvm> What are you intending to use for shared storage?
[1:58] <orionvm> Storage is the biggest challenge in large scale virtualised environments.
[2:02] <darkfade1> can't decide yet. if i have to "bite the apple" it'd be buying from amplidata or get the old DMX 800 from $customer
[2:02] <darkfade1> i guess the dms will cost as much in power as the amplidatas in purchasing
[2:02] <darkfade1> otherwise i'll go with gluster now and switch to ceph once it has less warnings that say "experimental"
[2:03] <darkfade1> the dmx would handle the iops easily, 2x80GB cache
[2:03] <darkfade1> but its heavy and hungry
[2:03] <orionvm> Lol.
[2:03] <orionvm> Gluster?
[2:03] <orionvm> Gluster just needs big caches.
[2:04] <darkfade1> err no, enterprise array
[2:04] <orionvm> Ahh yeah.
[2:04] <darkfade1> one of those where i actually know it will really, really not fail
[2:05] <darkfade1> but i wanted to scale out, make all xen hosts also a storage box
[2:05] <darkfade1> thus not let them have high load (better for customers anyway)
[2:06] <darkfade1> each get infiniband + dual gige
[2:08] * orionvm (3cf29034@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[2:08] <darkfade1> ajax.. ^^
[2:15] * darkfade1 (~floh@host-82-135-62-109.customer.m-online.net) Quit (reticulum.oftc.net kinetic.oftc.net)
[2:19] * gregphone (~gregphone@ has left #ceph
[2:21] * darkfade1 (~floh@host-82-135-62-109.customer.m-online.net) has joined #ceph
[3:36] * Guest32 (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) Quit (Remote host closed the connection)
[3:36] * bbigras (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) has joined #ceph
[3:37] * bbigras is now known as Guest139
[6:34] * conner (~conner@leo.tuc.noao.edu) Quit (Ping timeout: 481 seconds)
[6:38] * conner (~conner@leo.tuc.noao.edu) has joined #ceph
[7:03] * f4m8_ is now known as f4m8
[7:52] * Jiaju (~jjzhang@ Quit (Remote host closed the connection)
[8:00] * mtg (~mtg@vollkornmail.dbk-nb.de) has joined #ceph
[8:05] * Jiaju (~jjzhang@ has joined #ceph
[8:26] <klp> anyone alive?
[9:06] <wido> klp: yes
[9:07] * MarkN (~nathan@ has left #ceph
[9:07] * MarkN (~nathan@ has joined #ceph
[9:18] * mtg (~mtg@vollkornmail.dbk-nb.de) Quit (Ping timeout: 480 seconds)
[9:27] * mtg (~mtg@vollkornmail.dbk-nb.de) has joined #ceph
[10:41] * allsystemsarego (~allsystem@ has joined #ceph
[10:50] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[11:36] <todinini> where do I get the source code for the rbd for a 2.6.35 kernel?
[15:06] <wido> todinini: what do you mean? it's in ceph-client.git in the rbd branch
[15:46] * f4m8 is now known as f4m8_
[16:11] <todinini> wido: did you ever try to reduce the mem usage of the mds in the config.cc mds_cache_size?
[16:21] <wido> yes, a few months ago, but never noticed any difference
[16:26] <todinini> hmm, that's bad, because I have quite a problem with the mem usage of the mds
[16:31] * mtg (~mtg@vollkornmail.dbk-nb.de) Quit (Quit: Verlassend)
[16:50] <wido> todinini: try the tcmalloc branch
[16:50] <todinini> wido: where do I find it?
[16:50] <wido> todinini: http://tracker.newdream.net/issues/138
[16:50] <wido> git checkout -b tcmalloc origin/tcmalloc
[16:51] <wido> you will have to install some extra libraries, but then with tcmalloc you will have some real memory reduction
[16:51] <wido> ldd /usr/bin/cmds should show the binary linked against tcmalloc and libunwind
[16:51] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[16:53] <todinini> wido: I will give it a try
[16:55] <todinini> wido: do you know make-kpkg? I have a wired '+' problem with it.
[16:57] <wido> todinini: no, never used it
[16:58] <wido> but i got to go
[16:58] <todinini> ok, ttyl
[17:51] * gregphone (~gregphone@ has joined #ceph
[18:12] <wido> gregphone: any idea when tcmalloc will go into unstable? would save me some time merging with unstable ;)
[18:12] <gregphone> N
[18:12] <gregphone> It shouldn't be long, just need to set up the packaging bits for it
[18:13] <wido> yes, i was thinking about that
[18:13] <gregphone> I'll ask Sage about it when I get to the office
[18:13] <wido> but i don't see another way then making a seperate tcmalloc package
[18:13] <wido> or do you want to make tcmalloc default? No i guess?
[18:14] <gregphone> I think we're going to set it to recommends, or something?
[18:14] <gregphone> I don't really get much of the packaging system beyond apt-get install, update, remove
[18:14] <wido> well, you have a Build-Depends rule
[18:15] <wido> the packages it depends on when building, but when you place the tcmalloc packages in there, it will depend on them when building and thus build a tcmalloc package
[18:15] <gregphone> The config/compiler stuff will build with it if it's there, detecting that is easy
[18:15] <wido> yes, but then you will end up with a tcmalloc and non-tcmalloc package
[18:16] <gregphone> ?
[18:16] <wido> while building the compiler checks if there is a tcmalloc library
[18:17] <wido> after compiling you will build the package, so it depends what your package will have based on the system it was compiled on
[18:17] <wido> if the system has tcmalloc, you will have a tcmalloc package
[18:17] <gregphone> Ah
[18:17] <wido> http://goog-perftools.sourceforge.net/doc/tcmalloc.html < "You can use tcmalloc in applications you didn't compile yourself, by using LD_PRELOAD: "
[18:17] <wido> didn't try that though
[18:18] <wido> "LD_PRELOAD is tricky, and we don't necessarily recommend this mode of usage. "
[18:18] <gregphone> The build scripts that go out in the source package will detect tcmalloc if it's installled, and then set the makefile accordingly
[18:19] <wido> true, but when you distribute Ceph in binary form, it is with or without tcmalloc
[18:19] <wido> not everybody will build from source
[18:19] <gregphone> I think we're Giotto use it by default
[18:19] <gregphone> *going to use it
[18:20] <wido> ah, ok, was already Googling :P
[18:20] <gregphone> Assuming we don't discover any problems the memory use improvements mean we want everybody on it
[18:20] <wido> i had one crash with tcmalloc on a OSD, backtrace showed me it was due to tcmalloc, but i removed the core dump
[18:21] <gregphone> Was it due to tcmalloc or just in the library!
[18:21] <gregphone> ?
[18:21] <wido> in the library, tcmalloc
[18:21] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) has joined #ceph
[18:21] <gregphone> Keep the core next time if you could
[18:22] <wido> yes, stupid of me to remove it, was on the wrong machine
[18:22] <gregphone> If tcmalloc is broken that's sad
[18:22] <gregphone> And it could just be that we broke something subtle, which is sad and we want to fix
[18:22] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) Quit (Remote host closed the connection)
[18:22] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) has joined #ceph
[18:22] <wido> might be, hope i see it again, so we can find out what went wrong
[18:41] * gregphone (~gregphone@ Quit (Quit: Rooms • iPhone IRC Client • http://www.roomsapp.mobi)
[18:45] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[18:45] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) has joined #ceph
[18:48] <sagewk> wido: trying to track down cause of osd11's memory usage
[19:00] <todinini> where do I find the rbdtool user space tool?
[19:00] <sagewk> it was renamed to just 'rbd'
[19:01] <todinini> thanks
[19:03] * tjikkun (~tjikkun@195-240-122-237.ip.telfort.nl) Quit (Read error: Operation timed out)
[19:12] <todinini> if i try to create a rbd image i get this error
[19:13] <todinini> root@node44:~# rbd create foo --size 1024
[19:13] <todinini> failed to assign a block name for image
[19:13] <todinini> create error: Operation not supported
[19:16] <wido> sagewk: today i managed to get osd5 recovered, i added 2GB of swap and reduced the recovery ops from 5 to 3
[19:17] <wido> after a lot of crashes (OOM) the cluster finally recovered
[19:17] <wido> but right now i'm trying to add osd8 and osd11 again, but they suffer from the same
[19:18] <wido> todinini: try "ceph class list"
[19:18] <wido> and cclass -a
[19:19] <wido> todinini: http://tracker.newdream.net/issues/322
[19:24] <todinini> wido: thanks
[19:40] <todinini> can I mount more than one rbd device on a node?
[19:42] <gregaf> hmmm, I think so but I'm not sure, sagewk probably knows
[19:42] * nolan (~nolan@phong.sigbus.net) Quit (Server closed connection)
[19:42] <sagewk> wido: ok. i suspect reducing recovery ops will do the trick. i think there is a problem with the journal throttling
[19:42] * nolan (~nolan@phong.sigbus.net) has joined #ceph
[19:43] <sagewk> todinini: as many as you want.. cat the sysfs list file to see which ones are mapped
[19:45] <todinini> I get this error on the second image
[19:45] <todinini> root@node44:~# echo " name=admin rbd myimage2" > /sys/class/rbd/add
[19:45] <todinini> -bash: echo: write error: No such file or directory
[19:46] <todinini> root@node44:~# cat /sys/class/rbd/list
[19:46] <todinini> 0 251 client4532 rbd myimage
[19:46] <sagewk> does the myimage2 image exist?
[19:46] <sagewk> rbd list
[19:47] <todinini> rbd list
[19:47] <todinini> myimage
[19:47] <todinini> myimage2
[19:48] <sagewk> hmm, not sure about that one. i can take a closer look a little later
[20:07] <wido> sagewk: tried reducing the recovery ops, but 3 was still to much i think
[20:07] <wido> i'll lower it to one for now
[20:11] <sagewk> wido: i'm tracking down a memory leak, although i'm not sure that it's necessarily the culprit.
[20:13] <wido> ah, ok. But it is a lot of memory what it is using, isn't it?
[20:13] <wido> todinini: still getting the No such file or directory message?
[20:16] <darkfade1> hi sagewk good evening / morning / something :)
[20:16] <sagewk> hey
[20:17] <gregaf> morning here on the US' west coast :)
[20:18] <darkfade1> cool, your project is perfectly aligned for a night-worker in germany hehe
[20:19] <darkfade1> and yesterday someone from .au was around. so we can check the 24x7 reachability :p
[20:20] * tjikkun (~tjikkun@2001:7b8:356:0:204:bff:fe80:8080) has joined #ceph
[20:21] <gregaf> haha, I don't think you're going to hit us between midnight and 7am our time
[20:24] <darkfade1> oh, i meant check as an item
[20:24] <darkfade1> so someone will be awake in here at all times
[20:24] <darkfade1> not all us the users checking if you'll reply to bug reports 24x7 hehehehe
[20:25] <wido> darkfade1: you german?
[20:25] <darkfade1> yes
[20:26] <wido> ah, lot of german, dutch and belgium people here
[20:26] <wido> gregaf: sometimes i see some commits from you guys, when it is just morning for me, so you have to be working late some days
[20:27] <gregaf> occasionally we'll get stuff in at ~11:45pm, but I don't think I've done/seen any work after that
[20:50] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) Quit (Quit: This computer has gone to sleep)
[20:56] * allsystemsarego (~allsystem@ has joined #ceph
[21:02] <sagewk> wido: btw, the clock drift tolerance config option was renamed, mon_lease_wiggle_room -> mon_allowed_clock_drift
[21:06] <sagewk> er, mon_clock_drift_allowed
[21:06] <darkfade1> sagewk: should it still be around "1"?
[21:06] <sagewk> depends on how well ntp works for you :)
[21:07] <darkfade1> hehe
[21:28] <wido> sagewk: i did not update to the latest unstable yet
[21:28] <wido> since i'm on the tcmalloc branch
[21:28] <wido> didn't see any fixes that would apply for me directly
[21:30] <wido> but the clock drift is pretty anoying btw, had a hourly cron with ntpdate, didn't work, now openntpd isn't working either
[21:32] <sagewk> yeah sorry for the silent option rename, just want to get things more consistent sooner rather than later
[21:33] <sagewk> i pushed a memory leak fix that _migth_ be related, although probably not. there's also a logging fix that will help track things down a bit
[21:34] <sagewk> going to try one other thing on osd11 in the meantime
[21:40] <wido> sagewk: i read the commit messages every morning, my ritual since a few months :-) But go ahead with osd11, i'll be afk shortly
[21:40] <sagewk> k
[21:40] <wido> but note, (i might be repeating myself), it was not only osd11, osd6 and osd8 were having the same issues today
[21:41] <sagewk> yeah, osd11 is just the one wehre i can currently reproduce it right now :)
[21:46] * atg (~atg@please.dont.hacktheinter.net) Quit (Quit: -)
[21:53] <wido> i'm afk
[21:53] <wido> ttyl
[22:12] <sagewk> wido: okay, i think i found the problem. pushed
[23:28] * iggy (~iggy@theiggy.com) Quit (Server closed connection)
[23:28] * iggy (~iggy@theiggy.com) has joined #ceph
[23:40] * f4m8_ (~drehmomen@lug-owl.de) Quit (resistance.oftc.net charon.oftc.net)
[23:40] * DLange (~DLange@dlange.user.oftc.net) Quit (resistance.oftc.net charon.oftc.net)
[23:41] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[23:41] * f4m8_ (~drehmomen@lug-owl.de) has joined #ceph

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.