#ceph IRC Log


IRC Log for 2010-08-11

Timestamps are in GMT/BST.

[0:01] <sage> hmm, the part that worries me is osdc half of 'messenger and osdc changes for rbd'..
[0:02] <sage> it might be better to wait until we know how much refactoring is appropriate
[0:02] <yehudasa> yeah
[0:03] <sage> just the monc one then? the pool lookup is trivial enough to leave out for now.
[0:04] <yehudasa> yeah
[0:04] <sage> if we can do that quickly enough, we can send a second pull request
[0:05] <sage> i just don't want to miss the boat for everything else.
[0:05] <yehudasa> right..
[0:05] <yehudasa> there's some whitespace cleanup patch that I see also
[0:06] <yehudasa> oh, it's probably on the unstable branch.. forget it
[0:06] <sage> yeah
[0:10] <jantje> yea i've read them (quickly)
[0:11] <jantje> I just havn't read your thesis, it just has too large space between lines
[0:11] <jantje> +s
[0:11] <yehudasa> heh.. you probably haven't seen my masters thesis.. was 100 pages printed, could be squeezed into probably around 5
[0:13] <sage> the uc formatting requirements were pretty ugly
[0:30] <jantje> re-export it to IEEE standards :)
[0:30] <jantje> yehudasa: mine was about the same size, no big deal
[0:36] <yehudasa> oh, it's probably on the unstable branch.. forget it:bd
[0:36] <yehudasa> hmm.. wrong window
[1:03] * homebrewmike (~homebrewm@adsl-75-17-203-115.dsl.euclwi.sbcglobal.net) has joined #ceph
[1:20] <yehudasa> sage: you there?
[1:20] <sage> yeah
[1:21] <yehudasa> I just pushed rbd-separate branch, that moves rbd.c to drivers/block
[1:21] <yehudasa> moves most of the include files to include/linux/ceph
[1:21] <yehudasa> and include/linux/ceph/crush
[1:22] <yehudasa> it creates a separate module, rbd.ko
[1:22] <yehudasa> (updated Kconfig, makefiles too)
[1:24] <sage> not too bad
[1:24] <sage> yeah, not that many symbols
[1:25] <yehudasa> yeah, most of it was updating the includes
[1:25] <sage> looks like it leaves the old headers behind in fs/ceph? should git rm those
[1:25] <yehudasa> I left a few headers behind
[1:25] <sage> the patch didn't remove teh moved ones tho
[1:25] <yehudasa> hmm.. strange
[1:26] <yehudasa> I did a "git mv"
[1:26] <sage> oh weird
[1:26] <yehudasa> yeah, strange
[1:26] <sage> the rbd kconfig should be updated to not say 'ceph will include rbd'
[1:27] <sage> should it be select CEPH_FS instead of depends, i wonder? otherwise it won't show up until you first enable CEPH_FS i think
[1:28] <yehudasa> hmm.. I'm not sure
[1:30] <yehudasa> really strange
[1:31] <yehudasa> instead of the header files moving, it just copied them, losing all history
[1:31] <yehudasa> I'll try to redo this part
[1:33] <sage> if you just git rm them, and then merge teh two commits with rebase -i it should figure it out
[1:33] <sage> the rename stuff is detected by diff when looking at the patch.. it's not part of the actual commit metadata
[1:33] <yehudasa> oh
[1:33] <yehudasa> that'll be easier then
[1:39] <sage> i guess the next question is what would it take to get from here to libceph.. that also exports everything needed by the fs bits.
[1:39] <sage> would need to move some of the super.c bits into another file, but the code itself wouldn't much in the way of changes, right?
[1:40] <yehudasa> yeah
[1:40] <yehudasa> pushed update patch, btw
[1:42] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[1:42] <sage> much better
[1:43] <sage> yeah i think that's the way to go.. the key being you can load rbd and not get the filesystem registered. which is basically what they were both asking for.
[1:48] <yehudasa> ahh.. at the moment you still need to load ceph.ko, hence the filesystem will be registered
[1:48] <sage> yeah.
[1:49] <sage> let's see how hard it is to move the non-vfs parts into lib/ceph... hopefully the separation won't be too bad (i.e. mainly some super.c surgery)
[1:51] <yehudasa> is lib/ceph really the place for it? there's nothing remotely related to storage there
[1:52] <sage> hmm...
[1:53] <yehudasa> actually.. it's more related to net/
[1:53] <sage> yeah
[1:53] <sage> it's also pretty trivial to move around at that point if someone has a better idea
[1:54] <yehudasa> like net/sunrpc
[1:56] <sage> and net/9p
[1:56] <yehudasa> right
[1:56] <sage> k
[1:57] <yehudasa> basically we need the *_client.c there
[1:57] <yehudasa> and whatever derives from it
[1:58] <sage> i think everything but the vfs bits: addr, dir, inode, file, xattr, super
[2:01] <yehudasa> caps?
[2:01] <sage> yeah
[2:01] <sage> that's closely tied to mdsc.
[2:02] <sage> ...unless we keep mdsc in fs/ceph, but that would require deeper changes in how the ceph_client is structured. :/
[2:04] <yehudasa> it makes sense to move mdsc, as we also move monc and osdc
[2:04] <sage> yeah
[2:05] <sage> it does put lots of struct dentry/inode related code in net/ though
[2:05] <sage> monc/osdc don't touch those
[2:05] <sage> but mdsc is all mixed up in the fs namespace
[2:05] <yehudasa> hmm
[2:06] <yehudasa> yeah, I'd rather leave that stuff under fs/ceph
[2:07] <sage> yeah. there's actually only one monc->mdsc reference (handle_map). we could probably add a simple way to chain in message handlers for that case.
[2:07] <sage> which just leaves the super.c mount stuff. maybe the user (rbd | ceph) can handle those interactions
[2:10] <yehudasa> locks, snap, export?
[2:10] <sage> all fs/ceph
[2:10] <yehudasa> mdsmap?
[2:10] <sage> only need by mdsc, so fs/ceph
[2:17] <yehudasa> sage: I'm off, pushed some preliminary stuff
[2:17] <sage> ok! see you tomorrow
[2:17] <yehudasa> yep!
[2:28] * ghaskins_mobile (~ghaskins_@66-189-114-103.dhcp.oxfr.ma.charter.com) has joined #ceph
[3:08] * Guest1299 (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) Quit (Remote host closed the connection)
[3:09] * bbigras (quasselcor@bas11-montreal02-1128531598.dsl.bell.ca) has joined #ceph
[3:10] * bbigras is now known as Guest32
[3:31] * homebrewmike (~homebrewm@adsl-75-17-203-115.dsl.euclwi.sbcglobal.net) Quit (Quit: HydraIRC -> http://www.hydrairc.com <- Chicks dig it)
[6:54] * f4m8_ is now known as f4m8
[7:34] * mtg (~mtg@vollkornmail.dbk-nb.de) has joined #ceph
[8:54] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) Quit (Quit: Osso)
[9:35] * allsystemsarego (~allsystem@ has joined #ceph
[11:46] * sakib (~sakib@ has joined #ceph
[12:10] * sakib (~sakib@ Quit (Quit: leaving)
[13:24] * kblin (~kai@mikropc7.biotech.uni-tuebingen.de) Quit (Quit: kernel update)
[13:57] <todinini> how could I limit the memory use of a cmds? because the nodes goes allways into swap, the node has 1G Ram and 5G Swap
[15:46] * f4m8 is now known as f4m8_
[16:04] * mtg (~mtg@vollkornmail.dbk-nb.de) Quit (Quit: Verlassend)
[16:18] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Quit: WeeChat 0.2.6)
[16:18] <iggy> todinini: that's one of ceph's weak spots atm... I think they plan to work on it in the future
[17:14] * gregphone (~gregphone@ has joined #ceph
[17:17] * orionvm (3cf29034@ircip1.mibbit.com) has joined #ceph
[17:17] <orionvm> Heya. :)
[17:18] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[17:19] <orionvm> Looking to play with Ceph as a testbed with about 15 nodes. Anyone have a suggestion for distro? I am thinking Debian Squeeze atm but open to suggestions. :)
[17:25] * gregphone (~gregphone@ Quit (Quit: Rooms • iPhone IRC Client • http://www.roomsapp.mobi)
[17:25] * gregphone (~gregphone@ has joined #ceph
[17:26] <gregphone> orionvm: we do all our development on debian so squeeze is a good choice
[17:26] <orionvm> Awesome. :)
[17:27] <gregphone> You'll want to keep up-to-date with the kernel client though
[17:27] <orionvm> Yep, I am pretty used to rolling my own kernel so that shouldn't be a problem.
[17:27] <gregphone> So either building out of the ceph-client-standalone backplate branch
[17:28] <gregphone> Or just using 2.6.35/36 and building the whole thing
[17:28] <orionvm> Yep aight.
[17:28] <gregphone> Err, backport, not backplate
[17:28] <orionvm> Haha. :P
[17:28] <orionvm> It's cool I read it as backports. ;P
[17:29] * klp (~lbz@c-24-8-7-136.hsd1.co.comcast.net) has joined #ceph
[17:29] <gregphone> iPhone autocorrection is usually pretty good but sometimes makes odd choices
[17:29] <orionvm> I am planning to use it as backend storage for Xen vms.
[17:29] <gregphone> Storing images as files?
[17:29] <gregphone> Or using end?
[17:29] <orionvm> Yeah I have only just got an Android client and it's got the same problems.
[17:29] <gregphone> RBD
[17:29] <orionvm> That is what I was going to ask.
[17:30] <orionvm> Has anyone used rbd to back xen vm disks before?
[17:30] <orionvm> I have seen it used with KVM/QEMU but couldn't find references to Xen.
[17:30] <gregphone> We use KVM here and it seems to be good backing those
[17:30] <orionvm> Sweet. :)
[17:31] <gregphone> It's just a block device so I don't think it should matter
[17:31] <orionvm> Yep exactly my thoughts.
[17:31] <gregphone> But I could be missing something, I haven't worked with rbd at all
[17:31] <orionvm> Fair enough, I will set a system up over the next few days and document my findings anyways. :)
[17:32] <gregphone> Cool
[17:32] <orionvm> We are getting in infiniband gear soonish so will try Ceph with IP over IB.
[17:32] <orionvm> Atm we are using Gluster with file backed vms.
[17:32] <gregphone> I think you probably haven't seen it use with Xen because KVM seems to be getting more popular with new projects
[17:33] <orionvm> Hmm yeah..
[17:34] <orionvm> I think Xen will catchup once dom0 support is mainline and stable.
[17:34] <orionvm> It's architecture is far more favourable than KVM.
[17:35] * Osso (osso@AMontsouris-755-1-2-32.w86-212.abo.wanadoo.fr) has joined #ceph
[17:35] <gregphone> I haven't looked into them much myself
[17:35] <orionvm> Aye fair enough.
[17:35] <gregphone> Other people are running that show
[17:35] <orionvm> Hehe. :P
[17:36] <orionvm> How many nodes has ceph been tested on before?
[17:36] <gregphone> Recently only about 30
[17:36] <orionvm> Aight kk.
[17:37] <gregphone> Couple years ago Sage got to use a cluster of several hundred for a while to teat scalability
[17:37] <orionvm> I know it has "not production ready" stamped all over it but would you have an objective opinion on how stable ceph is?
[17:37] <gregphone> That's documented in his thesis if you want to read about it
[17:37] <orionvm> Yep I certainly will.
[17:37] <gregphone> Depends what you're doing with it
[17:38] <orionvm> Basically we are looking for a new distributed image store to back some IaaS infastructure.
[17:38] <orionvm> GlusterFS has served us well but it has a few issues, like the fact that it's not block based that makes it a little annoying at times.
[17:38] <gregphone> Just running rbd on btrfs I think is pretty good, we're planning to roll out a beta KVM service on that in a month or two
[17:38] <orionvm> Yep.
[17:39] <gregphone> Anything involving the mds gets a lot less happy; it's way more complicated
[17:39] <orionvm> Yeah I don't plan to run a filesystem over it if I can avoid it.
[17:39] <orionvm> I just want to make use of the distributed object store.
[17:40] <gregphone> But it's definitely not done yet; I think we might have introduced a memory leak recently, *sigh*
[17:40] <orionvm> Haha yeah. >.<
[17:41] <orionvm> I was contemplating developing an object store because I have done some kernel development before.
[17:41] <orionvm> But after memories of debugging that.. I decided against it pretty fast. -_-
[17:41] <gregphone> Haha, yeah
[17:42] <gregphone> Are you looking at any other alternatives!
[17:42] <orionvm> I took another look at Lustre...
[17:42] <orionvm> That doesn't look attractive, too much code to port to a decent kernel version. -_-
[17:43] <gregphone> Yeah
[17:43] <orionvm> There really isn't another alternative, that's the problem.
[17:43] <gregphone> It's hard to actually find good info on too
[17:43] <orionvm> Yeah that too.
[17:43] <gregphone> I tried once and all I could get was their fairly useless white paper
[17:43] <orionvm> The issue is we don't want a filesystem.
[17:43] <orionvm> Hahah yeah, I have got it running.
[17:44] <gregphone> I of course think you should use ceph-based solutions
[17:44] <orionvm> Only on archaic versions of CentOS and RHEL though.
[17:44] <orionvm> Lol. :P
[17:44] <gregphone> But you might look at sheepdog too?
[17:44] <orionvm> Yeah I did.
[17:44] <orionvm> Looks like a toy compared to Luster and RADOS though. Not to be insulting but it's the impression that I get.
[17:44] <orionvm> *lustre
[17:45] <gregphone> What's it missing?
[17:45] <orionvm> A solid implentation I think, the theory is good. Consistently hashed object store is not an old idea and it would be how I would do it.
[17:46] <orionvm> But they don't have a kernel level driver and don't expose a block device.
[17:46] <orionvm> There is no client other than a test client that isn't that useful and a KVM/QEMU driver.
[17:46] <gregphone> Ah
[17:46] <orionvm> Like it would be good, just in maybe a year or so.
[17:47] <gregphone> I didn't realize it was so KVM focused
[17:47] <orionvm> Yeah, its a hurdle for us.
[17:47] <orionvm> We could convert to KVM but I think it would be most unwise.
[17:48] <orionvm> Xen has much better performance and scalablity.
[17:48] <orionvm> Not to mention security.
[17:48] <orionvm> So it's just not an option at this point in time.
[17:48] <orionvm> We need something more generic.
[17:48] <iggy> bollocks
[17:48] <orionvm> Yeah?
[17:48] <orionvm> Isn't KVM shared kernel?
[17:49] <iggy> no, it's on par with xen hvm
[17:50] <orionvm> Ahh yeah I remember now.
[17:50] <gregphone> We ran some tests that showed them to be pretty equivalent, a least in webserver roles
[17:50] <gregphone> KVM has come a long way in the last year or so
[17:50] <orionvm> We use Xen para-virt.
[17:50] <orionvm> Which we are able to squeeze pretty damn nice performance out of.
[17:50] <iggy> there are benchmarks that show kvm being faster than xen pv with a modern cpu (I would assume architecturally xen hvm could do the same)
[17:51] <orionvm> I must say I haven't had much to do with KVM in the last 7-8 months.
[17:51] <iggy> modern cpu = cpu with ept/npt
[17:52] <orionvm> Hmm yeah.
[17:52] <orionvm> We are going to be doing a full upgrade so I will look into it.
[17:52] <gregphone> I doubt it's worth trying to convert a company from xen to KVM just for storage though
[17:52] <orionvm> Yeah but we have a fully agnostic cluster manager.
[17:52] <iggy> yeah, don't get me wrong... I wasn't saying that
[17:53] <gregphone> I imagine there's a lot of specific knowledge involved?
[17:53] <iggy> if you are happy with xen, stick with it
[17:53] <orionvm> Yeah there is a fair bit hehe. :P
[17:53] <orionvm> I have been with Xen for a while now, since migrating from VMWare.
[17:53] <orionvm> Treated me well.
[17:53] <iggy> I was just commenting on the performance/scalability
[17:53] <orionvm> Just documentation is shit. :P
[17:54] <orionvm> Haha yeah fair enough, I was still under the impression that it's somewhat slower than paravirtualisation but I guess I should keep my mouth shut. :P
[17:54] <iggy> orionvm: semi-OT... does regular xen support live migration of hvm linux guests? (xenserver doesn't)
[17:54] <orionvm> Yes.
[17:55] <orionvm> iggy: Though I guess "depends
[17:55] <iggy> figured so... /me picks bone with citrix
[17:55] <orionvm> Issue is you can do alot of cool things with HVM.
[17:55] <orionvm> Like back it directly to a PCI device etc.
[17:55] <orionvm> Which would be messy when trying to live migrate.
[17:56] <orionvm> You would have to make sure you make all the resources available under the same aliases on both hosts before you attempt a migrate.
[17:56] <iggy> well, yeah... kvm has the same issues
[17:56] <iggy> as would vmware/etc.
[17:56] <orionvm> Aye, just thought I would mention it rather than giving you a flat yes. :)
[17:58] <orionvm> Alot of that stuff can be scripted though. We do some cool stuff with our LVM backed VMs that get direct access to a 10GE nic.
[17:59] <orionvm> They are hvm because they run Windows but needed to give them direct nic access to put them on a physically segregated network + performance.
[17:59] <orionvm> Long and the short of it is you can do anything you want if you take the time to write the required wrappers.
[17:59] <iggy> I just hate having to pv'ify all my linux guests just to be able to migrate them (not the least is the fact that I can't seem to manage to do it with centos5.5 at all)
[18:00] <orionvm> Hmm I have never used XenServer unfortunately.
[18:01] <orionvm> Mhmm, more on topic though.
[18:02] <iggy> yeah, sorry, just figured I'd ask someone who seemed clueful of xen while I had the chance
[18:02] <orionvm> If I am just using RADOS as a block store what deamons am I going to need.
[18:02] <orionvm> Hehe it's cool. :P
[18:03] <gregphone> OSD and monitors
[18:03] <orionvm> Sweet. :D
[18:03] <orionvm> No metadata = awesome.
[18:03] <gregphone> Which are cosd and cmon
[18:03] <orionvm> Metadata makes me cry. :(
[18:13] <orionvm> Ok cool, so what I need is 15 osds 3 monitors.
[18:13] <orionvm> Correct me if I am wrong but all I need on the client side is the ceph kernel module to create and attach rbd devices?
[18:13] <orionvm> Well not create, I need the userspace tools for that.
[18:14] <gregphone> I think that's right
[18:14] <orionvm> Cheers gregphone
[18:14] <orionvm> Going to start playing tomorrow I think it's late here, just pushing out the Debian image now.
[18:15] <gregphone> I'm not actually sure, though, as it's changing right now based on comments from bigger kernel maintainers
[18:15] <gregphone> ;)
[18:15] <orionvm> Hehehe kk. :P
[18:15] <orionvm> Yeah that's the other problem with GlusterFS heh.
[18:15] <gregphone> Oh, where are you from?
[18:15] <orionvm> Australia. :)
[18:16] <gregphone> Ah, nice
[18:16] <orionvm> Gluster can't keep their codebase still. -_-
[18:16] <gregphone> Really late for you then
[18:16] <orionvm> There is stuff in their documentation that isn't even implemented...
[18:16] <gregphone> Yikes
[18:16] <orionvm> Not -was- and is now removed, but never actually implemented.
[18:16] <orionvm> Yeah.
[18:17] <orionvm> We are using it atm because it's perfomance really is awesome over about 100 nodes.
[18:17] <gregphone> We do have a couple of users in the netherlands in here who can sometimes answer questions too, thought you might be from there
[18:17] <orionvm> haha yep. :)
[18:18] <orionvm> Ahh that's a question that comes to mind.
[18:18] <orionvm> Is there any interest to support Infiniband as a native transport?
[18:18] <gregphone> A few people have asked
[18:19] <orionvm> We are getting some IB gear very shortly to use with GlusterFS.
[18:19] <gregphone> We don't have any hardware though
[18:19] <orionvm> Hmm kk cool.
[18:19] <orionvm> Hmm yeah I can see that as I issue hehe.
[18:19] <orionvm> SDR gear is cheap now, HCAs at about $150AU which would be pretty cheap in most places.
[18:20] <orionvm> Australia is usually one of the most expensive places to get IT equipment.
[18:20] <gregphone> I don't do any project planning really but I don't think that's something we're likely to implement unless somebody pays us
[18:20] <orionvm> Haha yep. :P
[18:20] <orionvm> Well that being said we were thinking about implementing our own object store on Infiniband.
[18:21] <gregphone> But it's possible that once we hit 1.0 that "somebody"="Lawrence Livermore" or something
[18:21] <orionvm> If RADOS turns out to suit our needs I am pretty sure Infiniband support is something we would be interested in contributing back.
[18:21] <orionvm> Haha yep. :P
[18:21] <gregphone> That would be awesome too
[18:21] <orionvm> Yeah we are pretty commited to contributing back to opensource.
[18:22] <gregphone> The messenger layer is fairly well isolated so that's a feasible task for somebody to do without spending a year figuring out how the rest works ;)
[18:22] <orionvm> Aight awesome. :)
[18:22] <orionvm> I will checkout the git tree tomorrow and get my hands dirty I guess.
[18:23] <gregphone> Heh, good luck!
[18:23] <orionvm> Hehe I think I am going to need it. :P
[18:23] <gregphone> And thanks for considering Ceph ;)
[18:23] <orionvm> Haven't done such low level stuff for a while lol hehe.
[18:23] <orionvm> Yeah well the way I see it is Ceph is in the kernel.
[18:24] <orionvm> Means less custom modules for me to maintain as a sysadmin. :D
[18:24] <gregphone> Heh, there's a post about that on scalability.org which made us happy
[18:24] <orionvm> Ahh I read that haha.
[18:25] <orionvm> They failed to mention that Xen support is now in the kernel though whilst touting KVM lol.
[18:25] <orionvm> Xen domU support has been in there for a while now, dom0 is just a more complicated issue.
[18:25] <gregphone> I didn't know that
[18:26] <orionvm> Yeah alot of people seem to disregard it hehe. -_-
[18:26] <orionvm> It's called parvirt operations.
[18:26] <orionvm> So effectively extensions to the linux kernel that allow it to directly interact with the linux kernel.
[18:26] <gregphone> He seemed to think KVM would be in soon though, given how fast the distributions are picking it up
[18:26] <orionvm> *xen hypervisor
[18:27] <orionvm> Aye, it's more so that Xen fell behind in supporting the latest kernels.
[18:27] <wido> netherlands reporting!
[18:27] <orionvm> Hi wido. :)
[18:27] <wido> hi :)
[18:27] <wido> i haven't read the full conversation, but saw something about the netherlands?
[18:28] <orionvm> Lol, greg mentioned that there is a few of you guys in the Netherlands. :)
[18:28] <orionvm> And that I might have come from there lol.
[18:28] <gregphone> orionvm said it was late and I thought he might be near you
[18:28] <wido> ah, ok ;)
[18:28] <gregphone> But he's from Australia
[18:28] <orionvm> And no, I don't ride a kangaroo to work. -_-
[18:29] <wido> well, i don't wear wooden shoes or have a windmill in my backyard
[18:29] <orionvm> Hehe good to know. :P
[18:30] <wido> but i have to go, i'll be back in a few hours
[18:30] <orionvm> Aight. :)
[18:30] <orionvm> Where would I find some more RADOS specifi documentation?
[18:31] <orionvm> *specific
[18:32] <sagewk> there's some rbd howto stuff in the wiki. or are you looking for architecture/design type info?
[18:34] <orionvm> Yep, mainly internals, how to deal with OSDs entering/leaving the cluster stuff like that. :)
[18:35] <orionvm> Other stuff like how objects are mapped to OSDs etc would be nice too but I will be leaving most of the algorithm level stuff to the more dev-like people hanging around.
[18:39] <sagewk> for the most part the system handles all that for you. how it works from an admin perspective is covered in the wiki. for how the osds do their magic, the best reference is the osdi06 paper in the publications section on the ceph web site
[18:41] <orionvm> Sweet thanks sagewk. :)
[18:42] <orionvm> I am checking out the source now hehe, probably going to be way over my head lol.
[18:42] * gregphone (~gregphone@ Quit (Quit: Rooms • iPhone IRC Client • http://www.roomsapp.mobi)
[18:47] <orionvm> Hmm aight, I think I got some of under wraps. Looks like I am going to need to read all the papers in the publications section.
[18:47] <orionvm> Thanks for all your help guys, I need to go get some sleep now hehe.
[18:48] <orionvm> Take care and hopefully be back to chat in a few hours once I have absorbed a few of these papers. :D
[18:48] <sagewk> not all of them, many are tengential. the rados and osdi06 are the most directly related. the crush one if you care.
[18:48] <sagewk> wido: are you seeing lots of OOM with tcmalloc? :/
[18:49] <orionvm> Aight. :)
[18:49] <sagewk> wido: i haven't been able to reproduce the mislinked dentry replay problems from the korg rsync workload. i wonder if you can help out with that one? #329
[18:51] <orionvm> sagewk: I think wido mentioned he would be back in a few hours.
[18:52] <sagewk> yeah, i just want to get it out there before I forget about the things i want to ask him :)
[18:53] <orionvm> Haha fair enough. :)
[18:54] <orionvm> Reading the CRUSH paper it seems that it can easy be configured to be rack aware.
[18:54] <sagewk> that's the idea
[18:54] <orionvm> Are these options exposed in anyway atm?
[18:54] <orionvm> Looks damn sweet either way.
[18:54] <sagewk> you have to export, manually modify, and reimport the map. no user friendly process.
[18:54] <orionvm> Ahh yep yep.
[18:55] <orionvm> Yeah we do some nasty stuff to Gluster atm to make it rackaware.
[18:58] <orionvm> Is it possible to add OSDS to the config without downtime?
[18:59] <sagewk> yes
[18:59] <orionvm> Awesome. :)
[19:00] <orionvm> Both very painful things to do under GlusterFS. -_-
[19:01] <orionvm> We actually ended up being unable to do it actually, he had 30mins downtime to expand our production cluster.
[19:04] <orionvm> Last question before I -really- should head to bed.
[19:05] <orionvm> We have do site to site async replication to another set of GlusterFS nodes. Would this be prohibitively difficult to configure with Ceph?
[19:05] <orionvm> *have to
[19:06] <orionvm> We actually have another whole cluster in another datacentre connected via about 1Gbit/s of fibre atm. Looking to extend that to 10Gbit/s in the near future.
[19:13] <sagewk> there isn't a mechanism to do that currently, but it would be possible to add with a bit of careful thought.
[19:13] <iggy> there was some discussion on the mailing list about this
[19:14] <iggy> worth reading
[19:15] <orionvm> Aight will add myself to the mailing list etc.
[19:15] <orionvm> Cheers for all the help guys, been awesome. :)
[19:15] <orionvm> Night! (over here atleast.:P)
[19:16] <gregaf> later!
[19:17] <iggy> any bets on whether he pops back in before "really going to sleep this time"?
[19:17] <orionvm> Eh!
[19:18] <orionvm> I am going to sleep, just closing stuff down. -_-
[19:18] <orionvm> Sad to know I am so predictable though. -_-
[19:21] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has left #ceph
[19:21] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:21] <orionvm> Actually yeah I don't know if I should even bother sleeping tonight, it's 3:20 am here and I need to leave at 6am lol.
[19:23] <orionvm> Speaking of which I just found sample.txt in the src/crush directory. That looks like quite reasonable/friendly syntax to me.
[19:23] * sagewk wishes his body still let him get away with that
[19:23] <orionvm> sagewk: Hehe yeah, I am still young in the tooth you could say. :P My body does what I tell it to atm hehe.
[19:24] <gregaf> my body's never let me do that — I think I pushed it to 30 hours a couple times but it normally crashes on me after about 22 no matter what I do
[19:25] <orionvm> Haha I have done some pretty crazy stuff with sleep. -_-
[19:25] <orionvm> Used to adhere to a polyphasic sleeping schedule if you have heard of it.
[19:25] <orionvm> Very fun if you have a flexible enough lifestyle. :P
[19:26] <orionvm> Anyways, is this sample.txt I stumbled across be the kind of network map I would need to generate to have ceph understand my network topology?
[19:27] <sagewk> right
[19:28] <orionvm> Awesome, sounds quite manageable. :)
[19:40] <orionvm> In terms of rbd related stuff, can I have the same rbd device attached to 2 machines writeable at the same time?
[19:40] <orionvm> Assuming locking etc is not an issue.
[19:40] <sagewk> yeah, you just have to be careful when using snapshots. and use a cluster fs like ocfs2/gfs2 or something :)
[19:41] <orionvm> Yep, I actually don't need a filesystem on it because I will be using it to back Xen vms but yep understood. :)
[19:47] <orionvm> Bleh, are the man pages available online?
[19:48] <sagewk> nope
[19:48] <gregaf> well, if you can read through all the man-page formatting you can look in the repo
[19:49] <gregaf> http://ceph.newdream.net/git/?p=ceph.git;a=tree;f=man;h=0af9f864040fb19e8ef27763ea259d19613dd96f;hb=refs/heads/unstable
[19:49] <orionvm> Yeah I already pulled the git repo haha.
[19:49] <orionvm> Wait I can just build the package.
[19:49] <orionvm> Hmm no still need to install it. -_-
[19:50] <sagewk> man ceph/man/whatever.8
[19:51] <orionvm> Cheers. :)
[19:55] <orionvm> Are "disks" created with rbdtool actually pools?
[19:56] <orionvm> Or are they something different entirely?
[19:56] <sagewk> no, just a set of objects within some pool. many virt disks can go in each pool
[19:57] <orionvm> Aight, I see why I can't just snapshot them now.
[19:57] <sagewk> you can, but it's per disk, not per pool.
[19:57] <orionvm> Ahh kk yep.
[19:58] <orionvm> Oh ok now I look like a dumbass, blog post was saying "will get snapshotting" wiki says it's implemented. :) All is well!
[19:59] <orionvm> Just setting up puppet server now so should be rolling out some packages soon and playing around. :D
[20:00] <orionvm> I love new toys haha.
[21:24] <iggy> orionvm: have you heard of "Although good at informing other kernel developers what he is up to, actually collaborating with them to work up some of his enhancements to a kernel-ready state, or to improve the scheduler and other subsystems already in the kernel in areas where his code might offer benefits, does not seem to be Kolivas' strong suit" was just catching up on some news and saw
[21:24] <iggy> something about it
[21:38] <wido> sagewk: did you have those libvirt patches somewhere? I could see if i can test them and even make some debs of them for the Debian and Ubuntu users
[21:38] <wido> would make testing qemu-kvm much easier
[21:43] <orionvm> iggy: Kolivas is an Australian, he lives in Melboune though which is like 8 hours from here.
[21:43] <orionvm> Pretty cool dude, I use his BFS patches on my video/audio crunching box.
[21:45] <iggy> fsck
[21:45] <iggy> mispaste
[21:46] <iggy> orionvm: Rackspace's Cloud Files offering called 'OpenStack Object Storage'.
[21:46] <orionvm> Ahhh yeah.. that.
[21:46] <orionvm> That is pre pre beta code.
[21:46] <orionvm> Not something you would even put near a production system.
[21:47] <orionvm> It hasn't even hit 0.1 as far as I know.
[21:47] <orionvm> It's caused alot of fuss in the Cloud IaaS scene.
[21:47] <orionvm> It's mostly hype though.
[21:48] <orionvm> We tried to get OpenStack running here for fun.
[21:48] <orionvm> It just doesn't compare in features or scalabilty to our current cluster fabric controller though.
[21:48] <sagewk> wido: git://ceph.newdream.net/git/libvirt.git has what we've been using (not much). probably not worth packaging
[21:48] <orionvm> And the object store performance leaves alot to be desired.
[21:49] <wido> sagewek: ok, i'll check it out. I was looking into fully implementing rbd into libvirt
[21:49] <orionvm> If you are interested in the cloud scene take a look at OpenNebula.
[21:49] <wido> so you can manage rbd via libvirt too, even create RBD images through virt-manager
[21:49] <wido> would be pretty cool :)
[21:49] <orionvm> That would be pretty rad.
[21:50] <wido> OpenNebula, never took the time to check it out
[21:50] <wido> should do it some time
[21:50] <sagewk> wido: yeah :) just be warned, i think we tried to send something for libvirt rbd upstream recently and they weren't interested, wanted to generalize the striping or something some other way. i wouldn't expect them to take anything until the qemu-rbd driver goes into qemu.
[21:51] <wido> i'm pretty sure they won't accept it. But i'm already running a custom libvirt, made some changes for our env, so i don't mind packaging a seperate version
[21:51] <sagewk> cool
[21:52] <wido> btw, when comparing Ceph and RADOS, how "stable" would you classify RADOS?
[21:52] <sagewk> "more?" :)
[21:53] <gregaf> Ceph's built on top of RADOS
[21:53] <gregaf> so however unstable RADOS is, when you add in the MDS you get more problems
[21:53] <wido> yes, i know. But a lot of crashes you see are MDS or somehow related to the kclient
[21:53] <sagewk> "beta" is probably fair?
[21:53] <wido> using only RADOS has given me a pretty stable system, most of the time it is Ceph which is crashing it all
[21:53] <wido> yes, i thought so
[21:54] <gregaf> we're going to attempt to use RBD for a KVM offering here which we're rolling out as a beta I think at the end of the month
[21:55] * orionvm (3cf29034@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[21:56] * orionvm (3cf29034@ircip1.mibbit.com) has joined #ceph
[21:56] <orionvm> Back again!
[21:58] <orionvm> Can I build up a test stack on a single machine?
[21:58] <orionvm> About to try anyways, will see what happens. :)
[21:59] <gregaf> use the vstart.sh script
[21:59] <gregaf> you can vary the number of daemons it starts up with CEPH_NUM_[MON,MDS,OSD]
[21:59] <gregaf> eg CEPH_NUM_MON=3 CEPH_NUM_MDS=1 CEPH_NUM_OSD=1 ./vstart.sh
[21:59] <gregaf> you need to create the directories for each OSD by hand, though
[22:00] <gregaf> mkdir dev
[22:00] <gregaf> mkdir dev/osd0; mkdir dev/osd1; etc
[22:01] <orionvm> Aight.
[22:01] <gregaf> that's how I do most of my testing, although it'll be slow with all the journals and stuff running off one disk
[22:01] <orionvm> Tis cool.
[22:01] <orionvm> I just need to get a feel for the config and how it all interacts. :)
[22:02] <orionvm> Will be scripting the configuration through puppet once I have got the hang of it.
[22:04] <wido> btw, like mysql has a mysqld_safe proces to start a crashed mysqld process, would that be something for Ceph?
[22:04] <wido> for example: restart a crashed OSD, for a maximum of X (configurable) times
[22:04] <gregaf> you mean like specific recovery versions of each of the daemons?
[22:04] <gregaf> ah
[22:05] <orionvm> Hmm I use pacemaker for that sort of stuff.
[22:05] <wido> that could be an option too
[22:06] <orionvm> Pacemaker is pretty industry standard for HA stuff now.
[22:06] <iggy> init
[22:06] <iggy> /upstart/whatever fedora is using now
[22:06] <orionvm> Yeah there is that.
[22:07] <orionvm> But it's not stateful.
[22:07] <orionvm> That's the problem.
[22:07] <orionvm> If it actually dies due to say sigkill it will bring it back up.
[22:07] <wido> hmm, that might be an option too, should check that out too.
[22:07] <orionvm> But if it's locked or it's lost it's connection to the rest of the OSDs for instance init isn't and can't be made aware.
[22:08] <sagewk> 'restart on core dump = true' in ceph.conf will use the crun wrapper
[22:08] <sagewk> there's no limit counter tho
[22:08] <orionvm> Haha fair enough.
[22:09] <gregaf> that won't handle non-dump death scenarios though, like if it hits OOM killer or something
[22:10] <sagewk> yep
[22:12] <orionvm> Mhmm.
[22:14] <orionvm> Has anyone done any performance testing on rbd yet?
[22:14] <orionvm> I am itching to see what kind of speeds I can get from 15 OSDs lol.
[22:15] <iggy> there were some reports on the list of less than expected performance
[22:15] <gregaf> wido's running it on a similarly-sized cluster, I think
[22:16] <orionvm> Ahh kk cool, will be interesting to see I guees.
[22:16] <orionvm> *guess
[22:16] <gregaf> yeah, on the mailing list he said he was getting 30-35MB/s on his previous cluster, though I don't remember what his previous cluster was
[22:17] <orionvm> I just had these same machines setup with the latest unstable release of gluster and pushed 1.5GiB/s over bonded gigE.
[22:17] <wido> i still want to give RBD try on this cluster
[22:17] <gregaf> it's a fair bit slower than running the full filesystem since it doesn't enable any caching whatsoever
[22:17] <orionvm> Ahh I see.
[22:17] <wido> but i'm still to lazy to switch to the RBD branch ;)
[22:17] <gregaf> due to the consistency requirement
[22:17] <gregaf> *requirements
[22:18] <orionvm> So I would get more throughput running the full filesystem and using filebacked vms?
[22:18] <wido> but my current cluster has some issues with the various disks i have, still have some really slow disks
[22:18] <orionvm> Ahh kk yep.
[22:18] <orionvm> I have a pretty decent test cluster, 4x500GB WD Blacks in each machine.
[22:18] <wido> next week i'll have some new hw and start giving qemu-kvm (which is also RBD) a try
[22:18] <gregaf> orionvm: hmmm, it's possible
[22:18] <wido> yeah, i'm bugging some people to get budget for new disks
[22:19] <orionvm> Aye for sure.
[22:19] <orionvm> Hmm.
[22:19] <orionvm> I was thinking of designing a really dumb block store just to use as VM storage.
[22:19] <wido> well, i have some RBD tests, with the S3 Gateway i get about 38MB/sec read
[22:19] <wido> almost the same i have with regular Ceph
[22:19] <orionvm> Hmm kk.
[22:20] <wido> but my writes are 152MB/sec sec
[22:20] <orionvm> What links are you running that over?
[22:20] <orionvm> Wow.
[22:20] <wido> GigE where the clients have two links in bonding
[22:20] <orionvm> That's a huge difference.
[22:20] <wido> xor mode with hash mode 1
[22:20] <orionvm> Ahh xor.
[22:20] <orionvm> I am using balance-rr.
[22:20] <orionvm> With some arptables hacks to make it go faster lol.
[22:21] <wido> gregaf: with tcmalloc i still have OSD's invoking the OOM killer during a degraded cluster recovery
[22:21] <wido> but it's much less then without tcmalloc
[22:22] <gregaf> yeah
[22:22] <gregaf> I didn't get to trying to reproduce your MDS issues yesterday but I was talking with Sage about memory issues
[22:23] <wido> if it's easier for you, you are always welcome to log on to my cluster and do some tests where
[22:23] <wido> there*
[22:24] <gregaf> have you changed your testing at all recently, or did the memory pressure just get a lot worse recently?
[22:24] <wido> no, the test like always is rsync'ing kernel.org
[22:24] <orionvm> wido: Any idea why your read performance is so asymterical?
[22:24] <wido> which was most of the time not possible without tcmalloc, but with it is
[22:25] <wido> orionvm: no, i really have no idea
[22:25] <wido> might be the read speed of one OSD which is slowing things down
[22:25] <orionvm> Hmm I see.
[22:25] <sagewk> orionvm: i suspect it's a matter of tuning and optimizing readahead
[22:25] <gregaf> but I mean you weren't hitting the OOM killer until recently, were you?
[22:25] <orionvm> readahead is awesome.
[22:26] <wido> gregaf: with the MDS i haven't seen much OOM, most of them with the OSD
[22:26] <wido> but the MDS will hang much of the time when it starts to swap, will just hang or commit suicide
[22:26] <wido> and the hang is what i see right now when trying to do the find or just rsync kernel.org again
[22:28] <sagewk> wido: but the osd OOM is a recent thing?
[22:30] <wido> not really, will happen most of the time when my cluster tries to recover from a degraded state, most of the times it can't due to OSD's keep going OOM
[22:30] <wido> but the OSD which is going OOM right now has only 1GB of memory
[22:30] <wido> but, with 4GB swap
[22:32] <orionvm> OOM = out of memory yeah?
[22:32] <sagewk> ah. that may just be the recovery code not being careful about memory.
[22:32] <wido> frustrating thing is, it gets to 0.005% degraded and then a OSD goes OOM
[22:33] <wido> jumps back to 1.8%, then i start the OSD, it slowly gets to 0.005% and then another will go OOM
[22:33] <gregaf> orionvm: yea
[22:34] <wido> but recovery seems to be a heavy thing to, cluster becomes unworkable, might be because of the OSD's starting to swap. CPU's seem waiting for 60%
[22:36] <sagewk> wido: oh, that's interesting... i expected the memory intensive part to be peering actually, not the data migration. there must be a memory leak in there
[22:36] <wido> right now two OSD's got marked out and down since they were to slow, mostly because they were swapping.
[22:37] <wido> have to restart them to get it working again, but this way it will never recover and get clean again
[22:37] <sagewk> is it usually the same ones that OOM?
[22:37] <wido> yes, right now the two OSD which were down the whole day because of the corrupt pg
[22:37] <wido> in the meantime i uploaded about 70GB of data
[22:38] <sagewk> k
[22:39] <wido> more memory would work i think
[22:39] <sagewk> it'd hide this particular problem at least :)
[22:39] <sagewk> o
[22:40] <sagewk> can i play with the cluster now?
[22:40] <wido> oh yes, i wanted to go off anyway
[22:40] <sagewk> ok!
[22:40] <wido> if one OSD starts to reboot (reboot on panic is on), you will have to run btrfsctl -a
[22:40] <wido> and then mount the OSD dir, i'm using btrfs-stripe now
[22:40] <wido> the OSD dir is in the fstab with "noauto"
[22:42] <wido> last thing i was seeing, sometimes OSD's start to complain about keys not being correct, etc, etc, most of the time this is also during a recovery. Restarting them works
[22:42] <sagewk> interesting, ok
[22:42] <wido> something with cephx, some OSD's then refuse to talk to eachother
[22:42] <sagewk> yeah. i saw that once but haven't been able to reproduce it.
[22:42] <wido> and start marking them down
[22:42] <wido> causing some weird behaviour
[22:43] <wido> btw, it's node06 and node12 who are giving the troubles now (OOM / slow)
[22:43] <wido> afk now, ttyl
[22:43] <sagewk> ok thanks for the info!
[22:53] <orionvm> Bleh stupid debian and being old. :(
[22:54] <gregaf> build issues?
[22:54] <orionvm> They froze squeeze at 2.6.32. :(
[22:54] <orionvm> Nah just the default kernel is older than it really shoudl have been.
[22:54] <gregaf> ah
[22:54] <orionvm> I really would have thought they would have chosen 2.6.35.
[22:54] <orionvm> Ahh well, Debian is as Debian does lol.
[22:55] <darkfade1> orionvm: i read that 2.6.32 will be long-term maintained by gkh
[22:55] <darkfade1> thats why they stuck with it
[22:55] <orionvm> Ahh fair enough.
[22:55] <darkfade1> but i think it's stupid nonetheless
[22:55] <orionvm> yeah.
[22:55] <darkfade1> because it'll come out with an outdated from start
[22:55] <orionvm> Exactly..
[22:55] <darkfade1> and then it will just become more outdated over time
[22:55] <orionvm> .35 is already stable mainline.
[22:55] <orionvm> Aye I know.
[22:56] <orionvm> Stupid debian oldness.
[22:56] <darkfade1> coffee and work now or try to sleep and do it from 6-10 am
[23:39] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:46] <orionvm> Bleh must be doing something silly.
[23:46] <orionvm> Keep getting *** No rule to make target `modules'. Stop.
[23:51] <orionvm> Yep.. just a dumb moment.. default kernel dir is not correct.. -_-

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.