#ceph IRC Log


IRC Log for 2011-02-14

Timestamps are in GMT/BST.

[0:05] * johnl waits for build
[0:16] <johnl> hey greg, is each object on an osd stored in it's own file on disk?
[0:16] <greglap> yep
[0:17] <greglap> each PG gets a directory (or a couple, if they're pre-hashed) and each object in the PG is a file in that dir
[0:17] <johnl> is the max size of an object on an osd limited to the size of the free space on that osd?
[0:18] <johnl> or is each object striped across osds?
[0:18] <greglap> objects aren't striped, no, so yes, they're limited to the OSD size
[0:18] <johnl> ah ic
[0:19] <greglap> thus the need for balanced placement bla bla bla ;)
[0:19] <johnl> but the ceph filesystem itself stripes across multiple objects, right?
[0:19] <johnl> so you can store large files
[0:19] <johnl> larger than the free space on any one osd I mean
[0:20] <greglap> in principle, yes
[0:20] <johnl> does the rados s3 gateware stripe objects?
[0:21] <greglap> I don't think it does right now
[0:21] <greglap> yehuda's working on a big upgrade of rgw right now though; I'm not sure exactly where it's at
[0:21] <johnl> what about the rados block device?
[0:22] <johnl> seem to recall that uses many objects per "device"
[0:22] <greglap> well that stripes block devices across smallish objects
[0:22] <johnl> cool
[0:22] <greglap> the default is 4MB
[0:27] <johnl> oop, compile finished. lemme test your assert fix
[0:28] <johnl> oh bollocks, i386. need amd64
[0:29] <johnl> sigh.
[0:32] <johnl> you guys taken a look at the rackspace "swift" object store thing?
[0:38] <greglap> yeah; it's part of openstack
[0:38] <darkfader> johnl: i read some of the manuals but didnt really understand if one can mount it in /dev as a block device
[0:41] <johnl> darkfader: nah, it's not for that. more of an S3 clone. just a http fronted replicated object store
[0:41] <johnl> you'd need to build an equivalent of rbd for it. swiftbd :)
[0:41] <johnl> as far as I understand anyway
[0:42] <darkfader> (ok so yet another http post/get filesystem)
[0:42] <johnl> yer. though at least this one has been in production at rackspace for a while
[0:42] <darkfader> yeah thats a big plus, no doubt
[0:42] <johnl> though I don't really like the design. seems a bit convoluted.
[1:01] <johnl> greg: build still running here and I gotta get some sleep. will test tomorrow. sorry!
[1:15] <johnl> nn
[1:15] <greglap> night
[2:43] * tjikkun (~tjikkun@82-168-5-225.ip.telfort.nl) Quit (Read error: Operation timed out)
[4:25] * chrisrd (~chrisrd@ Quit (Read error: No route to host)
[5:37] * orzzz (~orzzz@ has joined #ceph
[6:09] * orzzz (~orzzz@ Quit ()
[7:33] * alexxy (~alexxy@ has joined #ceph
[7:33] * alexxy[home] (~alexxy@ Quit (Read error: Connection reset by peer)
[8:14] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[8:35] * tjikkun (~tjikkun@82-168-5-225.ip.telfort.nl) has joined #ceph
[9:13] * allsystemsarego (~allsystem@ has joined #ceph
[9:17] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) Quit (Quit: Yoric)
[9:24] * tjikkun (~tjikkun@82-168-5-225.ip.telfort.nl) Quit (Ping timeout: 480 seconds)
[9:33] * tjikkun (~tjikkun@82-168-5-225.ip.telfort.nl) has joined #ceph
[10:01] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[10:06] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[10:46] * Yoric (~David@ has joined #ceph
[13:10] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Read error: Connection reset by peer)
[13:10] * verwilst (~verwilst@router.begen1.office.netnoc.eu) has joined #ceph
[14:10] * Yoric_ (~David@ has joined #ceph
[14:10] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[14:10] * Yoric_ is now known as Yoric
[14:11] * Yoric (~David@ Quit ()
[14:11] * Yoric (~David@ has joined #ceph
[14:59] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[15:05] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[15:29] * ao (~ao@ has joined #ceph
[15:30] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[15:53] * Yoric_ (~David@ has joined #ceph
[15:53] * Yoric (~David@ Quit (Read error: Connection reset by peer)
[15:53] * Yoric_ is now known as Yoric
[16:15] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[16:29] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[17:00] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) has joined #ceph
[17:14] * greglap (~Adium@cpe-76-90-239-202.socal.res.rr.com) Quit (Quit: Leaving.)
[17:42] * morse (~morse@supercomputing.univpm.it) has joined #ceph
[17:44] * verwilst (~verwilst@router.begen1.office.netnoc.eu) Quit (Quit: Ex-Chat)
[17:53] * greglap (~Adium@ has joined #ceph
[17:55] * bchrisman (~Adium@c-98-207-207-62.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:57] <greglap> johnl: did you get to check out your cluster yet?
[17:58] * Tv (~Tv|work@ip-66-33-206-8.dreamhost.com) has joined #ceph
[17:59] <greglap> also, how did 777 turn out? were you continuing your write-a-bunch-of-files test when you got 803, or did it finish successfully?
[18:16] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[18:40] * greglap (~Adium@ Quit (Quit: Leaving.)
[19:00] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has left #ceph
[19:00] * gregaf (~Adium@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:04] * WesleyS (~WesleyS@ has joined #ceph
[19:08] * bchrisman (~Adium@70-35-37-146.static.wiline.com) has joined #ceph
[19:09] * ao (~ao@ Quit (Quit: Leaving)
[19:12] * cmccabe (~cmccabe@ has joined #ceph
[19:13] * joshd (~joshd@ip-66-33-206-8.dreamhost.com) has joined #ceph
[19:17] * Yoric (~David@ Quit (Quit: Yoric)
[20:22] * Yoric (~David@dau94-10-88-189-211-192.fbx.proxad.net) has joined #ceph
[20:33] * frankl_ (~frank@november.openminds.be) has joined #ceph
[20:34] <frankl_> hi all
[20:34] <frankl_> i was looking for info regarding xen-rbd performance?
[20:35] <gregaf> hi
[20:35] <frankl_> is it good enough to run xen domU's on?
[20:35] <gregaf> you mean RBD?
[20:36] <Tv> frankl_: in general, we're still just ramping up the benchmarking side of things, but as far as i understand rbd performance is typically good enough that you're more likely to need to worry about stability, for now
[20:36] <gregaf> rbd is doing pretty well
[20:36] <gregaf> I think we have a few people setting up real systems on that
[20:36] <gregaf> if you want to put block devices on ceph files, probably not
[20:39] <prometheanfire> I was testing it, works with kvm well
[20:39] <prometheanfire> I need to test speed though
[20:40] <frankl_> Tv: gregaf: my new gen storage for shared servers will be ceph, but I was pondering how to do vm's themselves (eg /usr, /var, ...)
[20:40] <frankl_> and (by extension) customer vm's
[20:40] <frankl_> that don't need the shared /ceph mount
[20:41] <Tv> frankl_: well, the rbd layer is much simpler than ceph itself, so kvm using rbd for block device storage is going to be the most reliable of the possible ways
[20:41] <gregaf> well RBD is a block device built on RADOS, and is much simpler and stabler than the full-POSIX Ceph
[20:41] <frankl_> gregaf: but isn't it a lot slower?
[20:41] <gregaf> I'm not sure how easy it is to use with Xen though — it's mostly part of the KVM ecosystem
[20:41] <frankl_> as it would translate block-device operations to an object store?
[20:42] <gregaf> hmmm, probably a little bit? it's not doing intelligent readahead and stuff
[20:42] <prometheanfire> I'm waiting to see if it is in kvm-0.14
[20:42] <Tv> frankl_: yeah, there's basically a qemu host-side driver that talks the rados protocol
[20:42] <prometheanfire> I think readahead might be the responsability of kvm
[20:42] <Tv> frankl_: it won't exist on Xen, put might be reasonably easy to port
[20:42] <cmccabe> I thought that libvirt also worked with xen
[20:42] <gregaf> it stripes a block device across rados objects
[20:43] <cmccabe> doesn't rados have libvirt bindings or something
[20:43] <gregaf> and of course you only talk to it when the fs wants to talk to its block device
[20:43] <Tv> cmccabe: libvirt works with kvm, xen & lxc, but mostly by abstracting over their differences; that doesn't mean the underlying features are the same
[20:44] <gregaf> rbd is in the kernel from 2.6.37 on, so you can use it that way maybe?
[20:44] <Tv> http://ceph.newdream.net/wiki/Xen-rbd
[20:44] <cmccabe> tv: so in general, libvirt is mostly for management, but rados needs to interoperate with kvm/zen in a deeper way?
[20:44] <cmccabe> *xen
[20:44] <Tv> cmccabe: yes, libvirt is just a wrapper on top
[20:44] <prometheanfire> libvirt is a wrapper
[20:44] <frankl_> http://ceph.newdream.net/wiki/Xen-rbd seems to suggest you can just use it?
[20:45] <frankl_> and all xen needs, is accessing it as a /dev/sumetin/ blockdevice
[20:45] <prometheanfire> I know kvm needed a patch to get it to use rbd as a block device
[20:45] <frankl_> nothing more than that
[20:45] <gregaf> yep
[20:46] <Tv> gregaf: wait is rbd client a userspace or kernel thing?
[20:46] <gregaf> you'd need to test it out to see if it fits your purposes — I think wido was complaining that he wasn't getting fast enough write performance
[20:46] <gregaf> Tv: both?
[20:46] <Tv> that xen thing looks like it's using a special kind of a kernel block device
[20:46] <Tv> ah ok
[20:46] <Tv> so
[20:46] <Tv> i think the difference might be that for kvm/qemu, the blockdev->rados logic is in the qemu userspace process; that xen-rbd page seems to be using a local blockdevice that talks rados via kernel features
[20:46] <gregaf> I think the kvm-rbd stuff is mostly shared code
[20:46] <gregaf> with the kernel
[20:46] <gregaf> not sure though
[20:46] <gregaf> it's a pretty thin wrapper, although it's getting thicker as we talk about adding in stuff like VM layering
[20:47] <gregaf> *VM=block device
[20:47] <Tv> gregaf: so the qemu rbd thing also makes remote rados services appear as local block devices, then just opens them in userspace?
[20:47] <Tv> or not
[20:47] <prometheanfire> rbd is in the kernel, the patch is to tell kvm how to access the block device is all
[20:47] <Tv> i'm unclear on that one
[20:47] <gregaf> I'm not sure the exact relationships between all the code :/
[20:48] <Tv> yehudasa knows i bet
[20:48] <gregaf> joshd or yehudasa would know
[20:48] <gregaf> heh, yes
[20:49] <Tv> "The linux kernel rbd (rados block device) driver allows striping a linux block device over multiple distributed object store data objects. It is compatible with the kvm rbd image." (from the wiki)
[20:50] <Tv> that makes me think the qemu thing talks rados from userspace
[20:50] <Tv> yeah
[20:50] <Tv> http://ceph.newdream.net/wiki/Rbd vs http://ceph.newdream.net/wiki/QEMU-RBD
[22:02] <prometheanfire> anyone have any performance numbers for rbd, I only have single systems to test (no network) so i can't test
[22:06] <yehudasa> Tv, gregaf: oh, just noticed the rbd discussion
[22:06] <yehudasa> yes, these are two distinct implementations
[22:07] <yehudasa> there is the kernel block device and the qemu block device
[22:07] <Tv> yehudasa: so i'm guessing the kvm "straight to rados" is going to be faster than xen "block device to kernel and rados from there"
[22:07] <yehudasa> well.. I can only guess too
[22:07] <Tv> unless/until someone writes a xen native rdb thingie
[22:07] <yehudasa> depends on how qemu handles the i/o
[22:07] <Tv> sure there's implementation details, but just looking at the architecture, that one does more work
[22:08] <yehudasa> I'd assume so too anyway
[22:08] <gregaf> well the kvm->rados step also goes through the kernel for networking?
[22:09] <yehudasa> gregaf: yes
[22:09] <gregaf> I wouldn't expect it to matter that much where the rados layout stuff is implemented
[22:09] <Tv> gregaf: it's more how many times do you need to marshal the requests before they finally go out
[22:10] <Tv> unless there's a benefit from it due e.g. vm paging, i expect every step to just cost performance
[22:10] <Tv> s/due/due to/
[22:13] <bchrisman> After I manipulate a variable with injectargs… is there a direct validation command somewhere such that I can verify or poll the value of an arg?
[22:14] <gregaf> bchrisman: don't think so
[22:14] <gregaf> I don't think it can return "okay" unless it succeeded, though
[22:14] <bchrisman> gregaf: ok thanks
[22:14] <gregaf> (I'd have to check the code path to be sure)
[22:15] <gregaf> anything specific you were after, or just general certainty?
[22:15] <bchrisman> yeah… just verification really… generally writes have symmetric operations to read… no big deal.
[22:16] <bchrisman> I see a fair amount of data in dump output.. but there's definitely some left..
[22:16] <darkfader> Tv: one would have to make a different blkback driver for ceph to have it like the kvm way. but i think that's really rare. i know only a small hba vendor made something similar to use NPIV for fc disks in a domU
[22:17] <darkfader> everyone else dodged it and assigned disks to dom0 :)
[22:17] <darkfader> probably "never gonna happen""
[22:17] <Tv> darkfader: yeah, just saying kvm might have an advantage there
[22:17] <darkfader> the explanation was helpful
[22:17] <darkfader> in my head userspace is still always evil
[22:18] <darkfader> i need to get rid of that
[22:18] <gregaf> johnl: you around?
[22:19] <Tv> darkfader: kernel is way more evil, mostly because any extra code in kernel means more code that can scribble anywhere in ram etc
[22:19] <Tv> darkfader: kernel is just sometimes a necessary evil, for making things fast ;)
[22:19] <darkfader> hehe yeah
[22:19] <yehudasa> we just rewrote the qemu rbd implementation to use our new shiny librbd
[22:20] <darkfader> yehudasa: is that something like LD_PRELOAD and then the io will just vanish into rbd?
[22:20] <Tv> darkfader: it moves the rados client logic from a qemu patch to a library
[22:20] <yehudasa> which means that we'll have a single userspace rbd codebase which'll allow us to maintain it more easily and to add new features
[22:20] <Tv> darkfader: e.g. the hypothetical xen support could now use the same library
[22:20] <darkfader> nice
[22:21] <darkfader> when I stop playing games in windows i'll boot my ubuntu and test if with kvm :)
[22:22] <prometheanfire> darkfader: you need kvm from git it looks like
[22:22] <darkfader> and actually it sounds very elegant like that, i'd love to boot xen vm's right off rbd. makes a lot of sense
[22:23] <darkfader> prometheanfire: thanks
[22:24] <prometheanfire> works fine for me on gentoo
[22:24] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) Quit (Remote host closed the connection)
[22:25] <darkfader> prometheanfire: in a way gentoo is the only linux flavour where everything works
[22:25] <darkfader> the rest is just virtual
[22:25] <darkfader> especially the cast with xen. gentoo is the only one I know that supports security labels and vtpm and stuff out of the box
[22:26] <prometheanfire> you talking about selinux?
[22:26] <darkfader> no, sHype
[22:26] <darkfader> and there's another one i forgot the name of
[22:26] <prometheanfire> I don't do any xen, so dunno :D
[22:27] <darkfader> it's labels for the things that a VM can access and such
[22:27] <darkfader> i think it's from 2006
[22:27] <darkfader> and only gentoo managed to support it
[22:27] * Dantman (~dantman@S0106001eec4a8147.vs.shawcable.net) has joined #ceph
[22:27] <Tv> yeah, IBM ported over their zSeries mainframe vm security thinking
[22:27] <prometheanfire> we aim to please
[22:27] <darkfader> rhel has selinux labels for vm images, but those just block the webserver from reading it
[22:28] <darkfader> and debian will just say "if I don't need it, why do you? closed->wontfix"
[22:28] <prometheanfire> lol, so true
[22:28] <darkfader> or maybe "no updated for 90days->closed"
[22:29] <Tv> pfft, selinux is a pain, there's no consensus it's worth the pain
[22:29] <prometheanfire> any RBAC is the same in that respect
[22:29] <darkfader> Tv: didnt mean selinux though, and i uhhh agree
[22:29] <darkfader> although i start living with it sometimes
[22:29] <Tv> i personally refuse to accommodate for selinux labels
[22:30] <Tv> it's just going the wrong way
[22:30] <Tv> it's too complex to be real-world security
[22:30] <Tv> i'd rather fiddle with privsep, seccomp etc
[22:30] <darkfader> the old linux rbac project was less a pita
[22:30] <darkfader> but it died
[22:31] <Tv> say what you will about races and ln traps etc, but apparmor is surprisingly good at not getting in the way
[22:31] <darkfader> selinux just lacks a useful way of turning on / off roles
[22:31] <darkfader> and so it sucks endlessly
[22:32] <prometheanfire> rbac isn't limited to that project
[22:32] <prometheanfire> selinux is rbac, as well as grsec
[22:33] <darkfader> i just really know selinux, started by turning it off and now sometimes test and leave it enabled
[22:33] <darkfader> http://www.joshd.ca/content/getting-confluence-work-selinux that post had given me a bad conscience. like "hmm it looks easy enough if someone knows what he's doing"
[22:34] <darkfader> grr. can we switch away from RBAC as a topic anyway?
[22:35] <prometheanfire> but I like it :P
[22:35] <darkfader> i just remembered how bad auditd is, on rhel5 it lost messages during it's logswitches
[22:35] <darkfader> prometheanfire: hehe, well that means you're in control of your world. that's enviable
[22:36] <prometheanfire> well, not all the way, but getting there
[22:36] <prometheanfire> I am bringing our servers back from lord of the flies
[22:39] <darkfader> a few days more sleep and i'll work on that again. puppet fell down somewhere in the lab :)
[22:39] <darkfader> and someone planned to use cfengine instead
[22:39] <darkfader> then just rebuild everything and be busy
[22:40] <prometheanfire> eww, I've not had a problem that I didn't cause myself with puppet
[22:40] <darkfader> hehe
[22:42] * prometheanfire tells himself to make sure the scope is correct before rolling out kernels...
[22:47] <Tv> FYI ceph-autotests.git just learned how to do custom binary tarball locations (gregaf)
[22:47] <gregaf> coolio
[22:50] <Tv> err wait now i'm confused about monmap & osdmap again.. they're only ever used by cmon --mkfs?
[22:50] <gregaf> err, probably
[22:50] <gregaf> I really haven't played around with manual setup, I just use vstart all the time
[22:50] <Tv> funny that i'm required to put them out in separate files etc, in the simple case :-/
[22:50] <gregaf> but the monitor distributes the osdmap
[22:50] <gregaf> and the monmap
[22:50] <Tv> yeah that makes sense
[22:50] <Tv> just.. not very admin friendly, yet
[22:51] <Tv> vstart rm's the files right after cmon --mkfs
[22:52] <Tv> just looking at how to refactor this so autotests can share those parts
[23:11] <johnl> hi
[23:11] <johnl> am around now gregaf
[23:12] <gregaf> was going to ask about your newest bug report, but I put it in the tracker
[23:12] <gregaf> :)
[23:14] <johnl> am gathering a log now
[23:14] <johnl> I seem to have a knack for breaking ceph, heh
[23:14] <johnl> I feel kinda bad, keeping you busy ;)
[23:16] <gregaf> oh no, it's very helpful
[23:16] <johnl> :)
[23:17] <johnl> added a log to redmine
[23:21] <gregaf> johnl: can you dump the mds journal and post it somewhere?
[23:21] <gregaf> cmds —dump-journal 0 journal.dump
[23:21] <johnl> yeah, how..
[23:21] <johnl> ah ta
[23:34] <johnl> journal dumped
[23:35] <gregaf> can you put it somewhere accessible?
[23:35] <gregaf> going to need to look at it manually and see what the damage looks like...
[23:37] <johnl> have done. posted url on redmine
[23:37] <johnl> http://johnleach.co.uk/downloads/ceph/

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.