#ceph IRC Log


IRC Log for 2012-12-21

Timestamps are in GMT/BST.

[0:00] * loicd (~loic@magenta.dachary.org) Quit (Remote host closed the connection)
[0:02] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[0:03] <paravoid> hadn't noticed that
[0:03] <paravoid> it was quite puzzling
[0:03] <paravoid> esp. since the upstart jobs and sysv are different in what they do
[0:04] <wer> so does ceph.conf need to be called ceph-all.conf for both init and service to work?
[0:04] <dmick> the service command will fall back to /etc/init.d if it doesn't find the service in /etc/init
[0:05] <dmick> and so for service to be able to reach /etc/init.d/ceph, there has to not be an /etc/init/ceph.conf
[0:05] <dmick> you can always run /etc/init.d/ceph directly
[0:05] <wer> cool. That is what I always do...
[0:06] <wer> I thought you guys were talking about /etc/ceph/ceph.conf :)
[0:06] <dmick> the upstart management is the way of the future though. much more autoconfiguring
[0:06] <dmick> and yes, I knew someone would think that :)
[0:07] <wer> yes upstart.... the future.... I don't typically turn to upstart when running things.... even though I know it is there.
[0:08] <wer> usually because I don't know the name of the service and look in init.d anyway.
[0:09] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:11] <wer> What is the 2 rbd pools used for?
[0:13] <wer> do I need it if I am not using cephfs?
[0:19] <paravoid> ceph-disk-activate/prepare is very undocumented, sadly
[0:31] * darkfader (~floh@ has joined #ceph
[0:33] * ircolle1 (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[0:39] <dmick> wer: initctl list is your friend
[0:39] <dmick> and, wer: what 2 rbd pools are those?
[0:40] <dmick> paravoid: yeah, it could be better
[0:40] <dmick> we're always accepting patches :)
[0:42] <paravoid> ;-)
[0:43] * darkfader (~floh@ Quit (Ping timeout: 480 seconds)
[0:43] <paravoid> it's also unclear to me what bootstrap-osd is
[0:47] <wer> dmick: it is pool 2 for rbd.
[0:49] <wer> basically I discovered that my radosgw pool is not really capable of using all the osd's I have... whereas the rbd pool is setup to use many. But if I am not using rbd then do I even need it?
[0:52] <paravoid> the chef cookbook takes bootstrap-osd key and saves it into /var/lib/ceph/bootstrap-osd/#{cluster}.keyring
[0:52] <paravoid> while ceph-create-keys doesn't
[0:52] <wer> So out of my 96 osd's the buckets I have created in rados can only use 16 of them. So the size is actually not very large.. and in order to utilize all the space I have, I am going to have to replace that bucket. Basically, if pool mapping equates to size, and pg_num equates to memory utilization on the osd's .... I am interested in determining the physical size of each pool (since I didn't know I was so limited) and maybe removing rbd if I will never use it
[0:52] <wer> to save on memory footprint.
[0:53] <paravoid> hmmm, actually it does
[0:53] <paravoid> but on non-mon boxes I guess
[0:55] * darkfader (~floh@ has joined #ceph
[0:55] <wer> dmick: basically all this raw storage is useless :)
[0:57] <sjustlaptop1> wer: if you are feeling very adventurous and are willing to have your data and cluster completely destroyed, you can try the experimental pg_num expansion support in next
[0:57] <wer> Is that in 55.1?
[0:57] <sjustlaptop1> no
[0:57] <sjustlaptop1> more recent
[0:58] <wer> :) heh probably not that adventurous :)
[0:58] <sjustlaptop1> it's experimental, don't use it if your cluster is real
[0:59] <wer> sjustlaptop1: Well it isn't real yet. but I did put data on it before I realized my mistake on the radosgw. Which sucks.
[1:00] <wer> basically it is a do-over, of which I have had many, and leads me to believe I will not ever actually put this in production.
[1:01] <wer> cause the human is left to add up all the bits... and go behind ceph and make the config... and so forth. It seems to me the human is required.... but also has to predict what ceph did. Such as ceph osd create, which really annoys me.
[1:01] <dmick> wer: rbd is just a pool. It happens to be the default pool name for lots of rbd commands, but there's nothing otherwise magic about it
[1:01] <wer> hmm. It 's all magic :)
[1:02] <dmick> I doubt removing the rbd pool will have any measurable effect on anything, assuming there are no objects in it
[1:04] <wer> ok. cool. So I next need to make a new pool with more pgs, and add it to .rgw.buckets.... then delete to old small pool and then presumable create a new bucket in order to use the new space.
[1:04] <dmick> and yes, getting into the swing of ceph can take some missteps; it's complex, and the admin model is not necessarily what you're thinking. We're always trying to make it more obvious, and we know there are shortfalls now; suggestions are always welcome
[1:04] <wer> which means everything writing to that bucket will have to write to a different bucket :(
[1:05] <dmick> I don't think you can add a pool to an existing pool
[1:05] <sjustlaptop1> dmick: no, but rgw has mechanisms for growing into a new pool
[1:05] <dmick> ah
[1:05] <sjustlaptop1> yehudasa: how would wer create a bigger bucket?
[1:05] <sjustlaptop1> *add a bigger pool?
[1:07] <yehudasa> sjustlaptop1: new pools only affect newly created buckets (atm)
[1:07] <sjustlaptop1> right
[1:07] <wer> right
[1:07] <sjustlaptop1> wer is probably ok with that
[1:08] <sjustlaptop1> whoa, we have docs
[1:08] <sjustlaptop1> http://ceph.com/docs/master/man/8/radosgw-admin/
[1:08] <sjustlaptop1> see the pool * commands
[1:08] <wer> Well I would rather use the same bucket... but I can get around that I guess with changes to the client.
[1:08] <sjustlaptop1> you'll want to add the new pool (I'm assuming) and then remove the old one
[1:08] <sjustlaptop1> but don't actually destroy the pools
[1:08] <sjustlaptop1> radosgw will still use them for the buckets already there
[1:09] <wer> ahh. ok
[1:09] <dmick> oh? pool rm doesn't prevent access to existing buckets, eh?
[1:09] <yehudasa> yeah, it just removes it from the list of pool to use for the newly created buckets
[1:09] <sjustlaptop1> dmick: based on my limited undestanding, I don't think so. yehudasa?
[1:09] <yehudasa> but buckets that were already created will keep go to the same old pools
[1:10] <yehudasa> maybe one day we can change that, but that how it works currently
[1:10] <wer> hmm. hmmm.. So will those osd's fill up??
[1:10] <dmick> that might be a good addition to the manpage
[1:10] <sjustlaptop1> wer: yes
[1:10] <dmick> clarifying what the "placement set" actually means, basically
[1:10] <sjustlaptop1> eventually
[1:10] <wer> df pools
[1:10] <wer> :)
[1:11] <wer> bad things happen when osds fill up.
[1:11] <Psi-Jack> Hmm
[1:11] <wer> you guys are awesome btw. you know I'm not bitching right?
[1:12] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) Quit (Read error: Connection reset by peer)
[1:13] <dmick> wer: well it sounded a little like you might have been :) but with cause
[1:13] <wer> :) It's complicated.
[1:13] <wer> df pools
[1:13] <wer> df radosgw
[1:14] <wer> df my bucket :)
[1:14] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) has joined #ceph
[1:14] <wer> I just know I will inevitably fill up an osd.... and things will grind to a halt.
[1:15] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:15] <wer> I already have an unhealthy test instance that is no longer accepting data.... but has two osd's at ~%50 and one at %96. And it is a fail. I don't know how to help it...
[1:15] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) Quit ()
[1:15] <wer> in the past I would move the journal somewhere and it would help the osd recover.... but I already used that trick.
[1:17] <wer> that was v.48 and I would miss purging data.... but I suspect the one of three osd's filled up due to rados having less pg's as in my 96 osd newer install.
[1:18] <Psi-Jack> Hmmm
[1:18] <wer> I don't see any way around keeping that bucket and pool if it will inevitably fill.... which means I need all new ones.... it is a little like that tower of babel game.
[1:18] <Psi-Jack> I've always wondered how having 3 differently sized drives might effect Ceph. I have 3 servers, each with 1TB, 500GB, and 320GB HDD's in them as their OSD's. heh
[1:19] <wer> yeah I have no idea. But with that stuff I imagine if ceph doesn't catch it that you could maybe weight them appropriately. Depends what pool you are using though? ;)
[1:20] <wer> ug.... I don't.. .kn ow what to.... do next.
[1:22] <wer> ceph osd pool .rgw.buckets2 12 4800 4800
[1:22] <wer> ceph osd pool .rgw.buckets2 4800 4800
[1:22] <wer> osd's *100 / replication right? To get full use of available storage?
[1:24] <wer> then radosgw-admin pool add .rgw.buckets2
[1:25] <wer> radosgw-admin bucket rm --purge-objects to delete my entire bucket on the old pool?
[1:27] <wer> pool 12 '.rgw.buckets2' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 4800 pgp_num 4800 last_change 4165 owner 0
[1:28] <wer> who is that pesky owner?
[1:31] <wer> hmm. Ok I have the pool added to radosgw... so will I need to remove the old pool to prevent the osd's that are being used to prevent ceph from breaking? Or can they fill up without impacting other services?
[1:34] <sjustlaptop1> wer: you would need to remove the buckets mapped to the old pool, yehudasa: any good way to do that?
[1:34] <sjustlaptop1> wer: oh, you did
[1:35] <yehudasa> sjustlaptop1: yeah
[1:35] <wer> sjustlaptop1: I have added the new pool to radosgw. I have not deleted anything.
[1:35] <yehudasa> bucket rm should work
[1:36] <wer> yehudasa: even if it has lots objects in it?
[1:36] <yehudasa> wer: in that case, I think you can unlink the buckets and just remove the rados pool
[1:36] <yehudasa> radosgw bucket unlink ...
[1:37] <wer> And what does that mean?
[1:37] <yehudasa> that'll will decouple bucket from its owner
[1:37] <yehudasa> then rados rmpool
[1:38] <yehudasa> radosgw-admin bucket unlink
[1:38] <wer> hmm I can't get the syntax correct.
[1:39] <yehudasa> I think it's 'radosgw-admin bucket unlink --bucket=<name> --uid=<uid>'
[1:39] <wer> oh duh.... I was doing pools=
[1:40] <wer> ok it is unlinked. Does that mean it is lost?
[1:40] <wer> yehudasa: ok, Just have the new pool now.
[1:40] * kYann5 (~KYann@did75-15-88-160-187-237.fbx.proxad.net) has joined #ceph
[1:41] <wer> do I need to relink the bucket?
[1:41] <yehudasa> wer: did you need the bucket?
[1:42] <wer> heh, well I mean... yesish?
[1:42] <yehudasa> wer: the bucket does not exist anymore, once you remove the pool
[1:43] <yehudasa> wer: the pool contains the data
[1:43] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[1:43] <wer> I can still access it?!
[1:43] <wer> hmm.
[1:44] <yehudasa> wer: did you do 'rados rmpool'?
[1:44] <wer> radosgw-admin pool rm --pool=.rgw.buckets
[1:44] <wer> I did that.
[1:44] <yehudasa> oh, ok
[1:44] <yehudasa> don't do 'rados rmpool'
[1:44] <yehudasa> if you don't want to lose the data
[1:44] <wer> lol
[1:45] <wer> ok so type rados rmpool?
[1:45] <yehudasa> no
[1:45] <wer> lol
[1:45] * KYann (~KYann@did75-15-88-160-187-237.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[1:45] <wer> what did I do then?
[1:45] <wer> I told radosgw to quit using that pool?
[1:46] <yehudasa> sorry, I was under the impression that you wanted to get rid of your old pool, and old data
[1:46] <wer> well, I want to be able to use all my storage... and has to create a new pool.
[1:47] <wer> I need to have a bucket named what it is currently named that is also able to use the much larger pool.
[1:47] <yehudasa> if you want to migrate your data, currently the best way to do it would be to create a new pool and migrate your old buckets to the new pool
[1:47] <yehudasa> so.. relink the buckets that you unlinked
[1:47] <wer> ok let me try that.
[1:49] <yehudasa> and you'll need to run s3 client that copies the data, however, just running s3 copy won't help as it'll still reference the old objects
[1:49] <yehudasa> what you'll need to do is read the old data, and write it back
[1:50] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) has joined #ceph
[1:50] <wer> ok. Since I can't figure out the syntax to link that bucket... I would rather destroy it at this point. I am getting twitchy. And create a new bucket of the same name.
[1:51] <yehudasa> wer: the syntax is exactly the same as with the unlink
[1:53] <wer> radosgw-admin bucket link --bucket=bucket1 doesn't work for me. I though that is what I used to unlink....
[1:53] <yehudasa> --uid=<asdasd>
[1:53] <wer> ok
[1:53] <wer> ok it is linked again.
[1:54] <yehudasa> how much data do you have there?
[1:54] <wer> um.... not a ton probably under 200Mb. I can't really tell.
[1:54] <wer> That is part of my issue, is that I don't have clear information like that...
[1:55] <yehudasa> radosgw-admin bucket stats --bucket=<name>
[1:56] <wer> heh, I was trying that.... I swear! I was. looks like ~12MB
[1:56] <wer> 112
[1:57] <yehudasa> ok, just download everything, create a new bucket and upload everything to the new one
[1:59] <wer> ok. I can do that. Would I have to do the same thing for 200tb? Cause that is how much data I expect this thing to hold? I am starting to wonder if I should have sized it for 10x the expected osd's or something.
[2:01] <wer> It sounds like ceph can't grow well. Is that not true? Like could I have copied the pool to a new pool or something?
[2:03] <yehudasa> wer: there is a copy pool command, but the gateway has some issues with it currently. We'll fix it later on.
[2:03] <yehudasa> wer: also soonish there will be automatic pg resizing for pools
[2:03] <wer> ok sweet. Is there anywhere I can read on that issue?
[2:03] <dmick> and the "growing" issue will be helped by the "split" functionality
[2:03] <yehudasa> wer: not much to read about it, it's all in the code
[2:03] <wer> ok so this is all not new accept to me :)
[2:05] <wer> Will the upgrade path be safe? I noticed it should be since .51 but will it be required?
[2:12] <sjustlaptop1> wer: a highly experimental version will be in bobtail
[2:12] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) Quit (Remote host closed the connection)
[2:12] <sjustlaptop1> it should stabilize by the release after bobtail
[2:12] <sjustlaptop1> upgrading from 0.55.1 to bobtail should be harmless
[2:12] <sjustlaptop1> or 0.48.* for that matter
[2:12] <Psi-Jack> Love that word: "should" :)
[2:13] <sjustlaptop1> Psi-Jack: no significant on-disk format changes, just new code
[2:13] <Psi-Jack> Yep :)
[2:13] <wer> sjustlaptop1: ok, as in rolling updates through all my nodes? cause I will ultimately have 24 nodes.
[2:13] <sjustlaptop1> wer: that should work
[2:13] <Psi-Jack> I'm currently running off a git build from 0.55.x
[2:13] <Psi-Jack> So far, been remarkably stable. :)
[2:14] <sjustlaptop1> workload?
[2:14] <sjustlaptop1> rbd?
[2:14] <wer> ok cool.... ok so that bucket is 403 now... until I copy stuff out, and destroy it, and create a new one it is no longer able to take data...
[2:14] <Psi-Jack> RBD primarily, some CephFS for webservers, and working on getting Maildir mailservers running off CephFS as well.
[2:14] <sjustlaptop1> cool
[2:14] <Psi-Jack> ~18 VM's, workload's not that heavy overall.
[2:14] <wer> neat.
[2:15] <Psi-Jack> Basically though, I'm testing every fricken thing about Ceph with it. :)
[2:15] <wer> we are going to start looking into larger vm deployments next year.
[2:15] <Psi-Jack> Except, currently, RadosGW/S3 stuff.
[2:15] <Psi-Jack> But that's coming. :)
[2:15] <wer> I am only testing radosgw/s3 stuff :)
[2:15] <Psi-Jack> hehe
[2:16] <Psi-Jack> The moment I got Ceph installed on my storage servers, I immediately started converting all my existing VM disks, which were qcow2 provided over NFSv4, to RBD disks. What a PITA that was. :)
[2:16] <sjustlaptop1> heh
[2:16] <Psi-Jack> There's no sane tool out there to convert qcow2 to rbd.
[2:17] <Psi-Jack> So, I had to use a "gateway" VM. Essentially a minimal installed Arch Linux, and brought in the qcow2 disk and RBD disk, and rsync'd over, chrooted, reinstalled grub, etc.
[2:18] <Psi-Jack> All worked out though.
[2:21] * yoshi (~yoshi@ Quit (Remote host closed the connection)
[2:28] * jlogan1 (~Thunderbi@2600:c00:3010:1:81cb:3b44:21fa:f590) Quit (Ping timeout: 480 seconds)
[2:33] <wer> Ok. So I have all my stuff moved into a newly created bucket. So how do I determine how much space that bucket now has theoretically? I think it should be much larger and more distributed among my 96 osd's accross 4 nodes... but when I add 12 more nodes will I need to do this again in order to make use of their space?
[2:34] <wer> should I have created my pool with a pg_num of 19200?
[2:34] <sjustlaptop1> it's usually ok to have up to 700-1000 pgs on a single osd (including replication)
[2:35] <sjustlaptop1> though 100 is the recomended amount
[2:35] <wer> wait a minute.
[2:35] <wer> I thought when creating a pool you used a formula to describe how many osd's would be used?
[2:36] <wer> num_osd*100/replication or something like that?
[2:36] * miroslav (~miroslav@c-98-234-186-68.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[2:36] * KindOne (~KindOne@ Quit (Ping timeout: 480 seconds)
[2:37] <sjustlaptop1> yes, because 100 pgs per osd gives a good distribution (no osd gets a disproportionate number)
[2:37] * KindOne (~KindOne@h33.168.17.98.dynamic.ip.windstream.net) has joined #ceph
[2:37] <sjustlaptop1> in general, more pgs means a more even distribution
[2:37] <sjustlaptop1> but, there is in-memory state associated with each pg that the osd must track
[2:37] <sjustlaptop1> we find that the memory requirements tend to kill clusters starting at around 1k-2k pgs per osd or so
[2:38] <sjustlaptop1> so 100/osd through 800/osd (or so) is your window
[2:38] <wer> ok. for a given pool.
[2:38] <sjustlaptop1> yeah
[2:39] <sjustlaptop1> but the memory requirements still stand so it does limit how many pools you can practically have
[2:39] <wer> ok so 100 to 800 total per osd.
[2:39] <sjustlaptop1> yeah, the cluster will work best around 100
[2:39] <sjustlaptop1> another option for you since you are using radosgw is to add more pools as you grow
[2:40] <wer> but then won't a given osd possibly fill up?
[2:40] <sjustlaptop1> that was mostly a problem with 8 pgs
[2:40] <sjustlaptop1> with >4k pgs, that pool will distribute reasonably well even over 16 nodes
[2:41] <wer> So if I didn't have such a crappy pool to begin with for rados then I could have just expanded.... and I can continue to expand should I add addition nodes... beyond 16?
[2:41] <wer> by adding additional pools...
[2:41] <sjustlaptop1> yeah, up to the point where even 4k pgs distributes really badly
[2:42] <wer> ahh interesting.
[2:42] <wer> ok last question
[2:42] <wer> so I ran ceph osd pool create .rgw.buckets2 4800 4800 for the new pool. I didn't really know what the second number was :)
[2:42] <sjustlaptop1> ah, the second is pgpnum
[2:43] <sjustlaptop1> for your purposes, it's always the same as pgnum
[2:43] * KindTwo (~KindOne@h158.234.22.98.dynamic.ip.windstream.net) has joined #ceph
[2:43] <sjustlaptop1> if you have a pgnum of 8 and a pgpnum of 4
[2:43] <sjustlaptop1> your 8 pgs will be placed as if it was actually 4 groups of two
[2:44] <sjustlaptop1> to put it another way, if dynamic pool growing were properly working, increasing the pgnum without increasing the pgpnum would cause the pgs to split in place
[2:44] <sjustlaptop1> then you would increase pgpnum to actually move the new pgs to new homes
[2:45] <wer> oh. k. Perhaps it is time to go read some code :)
[2:45] <wer> I worry I am going to paint myself into some corner I didn't know about :)
[2:45] <sjustlaptop1> from an operational standpoint, you might want to do the split part of increasing a pool size seperately from the copying-around part
[2:45] <sjustlaptop1> that is what pgpnum allows you to do
[2:45] * KindOne (~KindOne@h33.168.17.98.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[2:45] * KindTwo is now known as KindOne
[2:46] <wer> ok. So you can define it... and then execute it's distribution sort of?
[2:46] <sjustlaptop1> yeah, there is work involved in splitting a pg, so that part would probably cause a fair amount of load
[2:47] <sjustlaptop1> so you might want to wait for that to complete before triggering the re-distribution
[2:47] <wer> but if you can't redefine a pool to begin with then how can that feature help?
[2:48] <sjustlaptop1> it's there because ceph once was able to increase pool size dynamically, but it didn't work very well
[2:48] <sjustlaptop1> then it didn't work at all
[2:48] <sjustlaptop1> and now it's experimental
[2:48] <sjustlaptop1> and soon it will be stable :)
[2:49] <wer> ok. That makes more sense. Thanks sjustlaptop1!
[2:49] <sjustlaptop1> wer: if you are creating a new pool, there is no reason for pgpnum to be different from pgnum (that I can think of)
[2:49] <wer> ok good. cool!
[2:50] <wer> Well I am going to shov 80TB in this thing soon... it would have been interesting to see the small pool distribute and fill up. I might still setup another bucket with that same setup just to watch it :)
[2:51] <wer> I wish there was size associated with them instead of this pg stuff. As a human anyway.
[2:51] <sjustlaptop1> there really can't be
[2:51] <sjustlaptop1> the osds generally don't have uniform disks
[2:51] <sjustlaptop1> or uniform numbers of disks
[2:52] <sjustlaptop1> pgs are just object partitions
[2:52] <wer> they are still finite though.
[2:52] <sjustlaptop1> each osd gets some number of object parittions rather than objects
[2:52] <sjustlaptop1> if you have too few object partitions, the osds won't be evenly filled
[2:53] <sjustlaptop1> yeah, but they are also the basis for all of the consistency and recovery logic
[2:53] <sjustlaptop1> :)
[2:53] <wer> I might be too stupid for my job :)
[2:54] <jluis> dear #ceph, happy end-of-the-world day
[2:54] <sjustlaptop1> meh, ceph is fairly obtuse, but mostly with good reason :)
[2:54] * spacex (~spacex@user-24-214-57-166.knology.net) Quit (Quit: Leaving)
[2:54] * jluis is now known as joao
[2:56] <nhm> joao: isn't like 2am there?
[2:56] <joao> yes
[2:56] <sjustlaptop1> is the world over yet?
[2:56] <wer> end of times is 9:22am est or something like that... plenty of time.
[2:56] <joao> as it appears, Europe has survived still
[2:57] <joao> I always thought it would end on Samoa Time Zone
[2:57] <joao> which would give us roughly more 11h
[2:57] <wer> yeah. That sounds about right.
[2:58] <wer> someone pointed out some offworld storage to me yesterday :P
[2:58] <wer> http://www.timeanddate.com/countdown/maya here is the countdown
[2:59] <lurbs> The Final Countdown?
[2:59] <wer> yep
[2:59] <joao> why does it have to end at 11h11m ?
[2:59] <wer> I don't make rules.
[3:00] <wer> http://offworldbackup.com/Public for your important work to live past end of times :)
[3:00] <wer> shit, they were at ~2k files the other day.
[3:00] * andreask1 (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[3:00] <joao> wer, looks legit
[3:00] <joao> thanks!
[3:01] <wer> yeah man!
[3:01] <wer> "Because the end of the world is nigh."
[3:01] <lurbs> Why doesn't Ceph support off-world replication?
[3:01] <wer> lol
[3:01] <joao> lurbs, my guess is latency
[3:01] <wer> is it on the roadmap?
[3:02] <wer> I remember cisco talking about commodity routers in space... and google talking about the network requirements.
[3:04] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[3:05] <wer> later fellas. Thanks again for all the help.
[3:05] * wer is now known as wer_gone
[3:11] * nwat1 (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[3:16] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[3:21] * andreask1 (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Read error: Operation timed out)
[3:30] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Quit: drokita)
[3:30] * fzylogic (~fzylogic@ Quit (Quit: fzylogic)
[3:36] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[3:37] * Lea (~LeaChim@5ad684ae.bb.sky.com) Quit (Remote host closed the connection)
[4:01] * xiaoxi (~xiaoxiche@jfdmzpr02-ext.jf.intel.com) has joined #ceph
[4:11] * flakrat_ (~flakrat@eng-bec264la.eng.uab.edu) Quit (Quit: Leaving)
[4:11] * xiaoxi (~xiaoxiche@jfdmzpr02-ext.jf.intel.com) Quit (Ping timeout: 480 seconds)
[4:45] * Cube (~Cube@ Quit (Quit: Leaving.)
[4:59] * xiaoxi (~xiaoxiche@ has joined #ceph
[5:10] * xiaoxi (~xiaoxiche@ Quit (Remote host closed the connection)
[5:12] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has left #ceph
[5:13] * darkfaded (~floh@ has joined #ceph
[5:13] * darkfader (~floh@ Quit (Read error: Connection reset by peer)
[5:24] * DLange (~DLange@dlange.user.oftc.net) Quit (Read error: Connection reset by peer)
[5:24] * darkfaded (~floh@ Quit (Ping timeout: 480 seconds)
[5:24] * jochen_ (~jochen@laevar.de) Quit (Ping timeout: 480 seconds)
[5:25] * ivoks (~ivoks@jupiter.init.hr) Quit (Ping timeout: 480 seconds)
[5:25] * ferai (~quassel@quassel.jefferai.org) has joined #ceph
[5:28] * jochen (~jochen@laevar.de) has joined #ceph
[5:29] * jefferai (~quassel@quassel.jefferai.org) Quit (Ping timeout: 480 seconds)
[5:30] * ivoks (~ivoks@jupiter.init.hr) has joined #ceph
[5:30] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[5:30] * darkfader (~floh@ has joined #ceph
[5:39] * noot (~noot@S0106001731baec53.ed.shawcable.net) has joined #ceph
[5:40] <noot> I've got a bit of a noob question, if anyone wants to help
[5:43] * noot (~noot@S0106001731baec53.ed.shawcable.net) Quit ()
[5:51] * stp (~stp@dslb-084-056-007-080.pools.arcor-ip.net) has joined #ceph
[5:51] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has joined #ceph
[5:52] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has left #ceph
[5:58] * scalability-junk (~stp@dslb-084-056-033-228.pools.arcor-ip.net) Quit (Ping timeout: 480 seconds)
[6:07] * gaveen (~gaveen@ has joined #ceph
[6:25] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[6:30] * tezra (~Tecca@ Quit (Read error: Operation timed out)
[6:37] <mikedawson> noot: what is your question
[6:37] * xiaoxi (~xiaoxiche@jfdmzpr02-ext.jf.intel.com) has joined #ceph
[6:39] * tezra (~Tecca@ has joined #ceph
[6:49] * KindOne (~KindOne@h158.234.22.98.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[7:15] * deepsa_ (~deepsa@ has joined #ceph
[7:20] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[7:20] * deepsa_ is now known as deepsa
[7:25] * tezra_home (~Tecca@ has joined #ceph
[7:29] * tezra (~Tecca@ Quit (Ping timeout: 480 seconds)
[7:32] * tezra (~Tecca@ has joined #ceph
[7:35] * tezra_home (~Tecca@ Quit (Ping timeout: 480 seconds)
[7:56] * sjustlaptop1 (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Read error: Operation timed out)
[8:06] * dmick (~dmick@2607:f298:a:607:7d94:5dcf:7854:9654) Quit (Quit: Leaving.)
[8:33] * low (~low@ has joined #ceph
[8:38] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) has joined #ceph
[8:40] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[8:49] * KindOne (~KindOne@h158.234.22.98.dynamic.ip.windstream.net) has joined #ceph
[9:04] * gregorg (~Greg@ Quit (Quit: Quitte)
[9:09] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[9:14] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[9:18] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[9:21] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:22] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[9:39] * benner (~benner@ Quit (Read error: Connection reset by peer)
[9:39] * benner (~benner@ has joined #ceph
[9:43] * gregorg (~Greg@ has joined #ceph
[9:46] * xiaoxi (~xiaoxiche@jfdmzpr02-ext.jf.intel.com) Quit (Ping timeout: 480 seconds)
[9:51] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:55] * deepsa (~deepsa@ has joined #ceph
[9:57] * deepsa (~deepsa@ has left #ceph
[10:01] * fc (~fc@home.ploup.net) has joined #ceph
[10:07] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[10:21] * styx-tdo (~styx@chello084113243057.3.14.vie.surfer.at) has joined #ceph
[10:21] * jlogan (~Thunderbi@2600:c00:3010:1:81cb:3b44:21fa:f590) has joined #ceph
[10:24] <styx-tdo> hi, i have a question: It seems that I don't "get" ceph, so: - Can I use ceph as a backend storage for shared servers (i.e. 2 servers accessing the same data)? and - is there an OOTB Linux solution to provide this, like openfiler? and - i want it redundant. Please point me to some documentation - i seem not to be able to find it w/ google
[10:28] <dweazle> styx-tdo: ceph is not a set and forget solution (just yet), if you want ceph redundant you need 3 systems.. you can use rbd to have shared block storage, but you would need a clustered filesystem on top (like ocfs) to be able to use it from 2 (or more) systems at the same time.. cephfs is another option. it's a network filesystem comparable to NFS, but backed by ceph object storage, however it's currently not recommended for production use and requires a very
[10:30] <styx-tdo> hm.. ok, this sounds like i should hold off cephfs for a while :/
[10:30] <styx-tdo> but if it's there, i'm in ;)
[10:31] <dweazle> depends, if you use it for non-critical data and for private use it might be good enough already
[10:32] <styx-tdo> so, basically, i'd need, in the future, ceph as cluster solution, then iSCSI export of the devices synced by ceph for iSCSI and systems that can handle (esx) + cephfs for unaware systems (openVZ)..
[10:32] <styx-tdo> i tried gfs2 and it was a nightmare :/
[10:33] <styx-tdo> i cannot read the word quorum anymore ;)
[10:34] * jksM (~jks@3e6b7199.rev.stofanet.dk) Quit (Remote host closed the connection)
[10:36] <dweazle> well, ideally you don't need iscsi at all, but you could set up some kind of iscsi gateway to ceph to provide ceph storage to clients that don't support ceph, but do support iscsi.. perhaps in a year or two we'll see iscsi targets with built-in ceph support :) then it's bye bye expensive SAN boxes as far as I'm concerned
[10:36] * LeaChim (~LeaChim@5ad684ae.bb.sky.com) has joined #ceph
[10:37] <styx-tdo> dweazle: i have that feeling that esxi won't include ceph or alike for the foreseeable future ^^ - so, iSCSI it has to be
[10:37] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[10:37] <dweazle> styx-tdo: we'll see.. perhaps a 3rd party (like inktank) would be able to build a storage plugin for esxi to provide just that
[10:38] <dweazle> or perhaps EMC will make an offer to inktank
[10:39] <dweazle> (i hope not)
[10:39] <styx-tdo> emc² always gives me a bit of ... uneasines
[10:40] <dweazle> all companies that are too big to fail give me that
[10:40] * ScOut3R (~ScOut3R@catv-80-98-160-69.catv.broadband.hu) has joined #ceph
[10:42] <dweazle> anyway, gotta go
[10:43] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[10:43] <styx-tdo> bye^^
[10:48] * yoshi (~yoshi@ has joined #ceph
[10:52] * The_Bishop__ (~bishop@e179009051.adsl.alicedsl.de) has joined #ceph
[10:52] * The_Bishop_ (~bishop@e179001196.adsl.alicedsl.de) Quit (Read error: Connection reset by peer)
[10:59] * loicd (~loic@ has joined #ceph
[11:03] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[11:04] <fghaas> styx-tdo:
[11:05] <fghaas> styx-tdo and dweazle: http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices -- if you're looking for something out of the box, see if RTS OS is going to add RBD support anytime soon
[11:10] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Ping timeout: 480 seconds)
[11:11] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[11:20] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[11:21] * dxd828 (~dxd828@ has joined #ceph
[11:27] * dxd828 (~dxd828@ Quit (Quit: Leaving)
[11:28] * dxd828 (~dxd828@ has joined #ceph
[11:28] <dxd828> Good morning
[11:33] * Aiken (~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f) Quit (Remote host closed the connection)
[11:37] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Read error: Connection reset by peer)
[11:44] * yoshi (~yoshi@ Quit (Remote host closed the connection)
[11:46] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[11:50] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[11:52] * roald (~roaldvanl@ has joined #ceph
[11:58] * dxd828 (~dxd828@ Quit (Quit: Computer has gone to sleep.)
[12:00] * The_Bishop__ (~bishop@e179009051.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[12:03] <styx-tdo> fghaas: tnx, reading....
[12:09] * The_Bishop__ (~bishop@e179009051.adsl.alicedsl.de) has joined #ceph
[12:09] <fghaas> styx-tdo: it should be relatively straightforward to get running; if you do run into any issues please leave a comment at the bottom of the page and we'll be happy to fix it
[12:12] <styx-tdo> fghaas: hi.. one question there.. why do i need the iSCSI proxy node?
[12:12] <fghaas> well you asked for exporting iscsi
[12:12] <styx-tdo> rephrase: why do i need a dedicated node for that, as it is again required to be redundant for HA setups
[12:13] <fghaas> that's what it says at the bottom of the page, you can easily make it redundant with pacemaker. just don't get caught up in the delusion that you can magically scale out the iscsi head across as many nodes as you wish
[12:14] <styx-tdo> hm.. i am thinking more in the line of multipathing
[12:16] <fghaas> don't do that
[12:16] <fghaas> http://fghaas.wordpress.com/2011/11/29/dual-primary-drbd-iscsi-and-multipath-dont-do-that/ -- exact same considerations apply to the iscsi/rbd combo
[12:17] <styx-tdo> esx has, iirc, cluster-aware iscsi targets
[12:18] <fghaas> yep, but that's implemented differently, ttbomk
[12:19] * ScOut3R (~ScOut3R@catv-80-98-160-69.catv.broadband.hu) Quit (Remote host closed the connection)
[12:20] <fghaas> and I'm not sure if it requires VAAI, which is a commercial feature implemented in RTS OS only but not in the open source unified target. which has been the cause of some pretty heated discussion on terms of licensing, recently
[12:21] <fghaas> at any rate, if you're so inclined by all means try the multipath thingy and if it breaks, you get to keep all the pieces :)
[12:22] <styx-tdo> cool - as it is multipath, there will be many pieces ;)
[12:24] <fghaas> particularly when you shred your first vm image
[12:25] <fghaas> unless, of course, the intrepid inktank folks strike a deal with vmware to just support rbd natively, which would be the much preferred option for anyone I think
[12:26] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:31] <styx-tdo> so, to get that right: 2 boxes with ceph mon/osd/mds. the osd is the storage.
[12:32] <madkiss> not sure you'll be happy with 2 nodes.
[12:32] <styx-tdo> testbed...
[12:33] <styx-tdo> but it _should_ be redundant that way?
[12:33] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Read error: Connection reset by peer)
[12:33] <fghaas> 2 mons == bad idea
[12:33] <fghaas> and no they're not
[12:33] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:33] <styx-tdo> do you possibly have a link to the high-level architecture?
[12:34] <fghaas> http://lmgtfy.com/?q=ceph+architecture&l=1 seems to work remarkably well for me
[12:35] <styx-tdo> this i saw - it is not that high level ;)
[12:36] * jlogan (~Thunderbi@2600:c00:3010:1:81cb:3b44:21fa:f590) Quit (Ping timeout: 480 seconds)
[12:37] <fghaas> http://www.hastexo.com/misc/static/presentations/lceu2012/ceph.html -- does that help?
[12:38] <fghaas> surely there's a bunch of these introductory presentations on the inktank site and others too
[12:40] * dxd828 (~dxd828@ has joined #ceph
[12:41] <dxd828> Why do you have object replicas when crush is distributing the placement groups between osd's? Or is it somehow related to the min and max settings in crush?
[12:42] <madkiss> ask yourself what placement groups actually are :)
[12:42] <styx-tdo> fghaas: the hotel analogy is nice ^^
[12:43] <fghaas> styx-tdo: thanks, rturk shares credit for that one
[12:43] <dxd828> placement groups are just places to store osd's right.. but split up for efficiency?
[12:43] <styx-tdo> fghaas: can i get a duplicate of my wallet, please ;)
[12:44] * fghaas points dxd828 to http://www.hastexo.com/misc/static/presentations/lceu2012/ceph.html which I mentioned here just before you joined
[12:47] <styx-tdo> ok, so 3 MONs are good for being able to have a majority vote, right?
[12:48] <styx-tdo> fghaas: the shellinabox things don't show anyting :/
[12:51] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[12:52] <fghaas> styx-tdo: well, of course not. do you really think the slide deck as made available after the conference includes the live demos?
[12:52] <dxd828> fghaas, thanks I get all of that.. What I don't get is that I have told crush to branch off at the rack level, so I'm assuming every rack has a single copy of the data no matter how many machines / osd's are inside. But then i find this object replica number which goes against my crush rules?
[12:53] <fghaas> how do you test for object placement? osdmaptool --get-map-object ?
[12:53] <fghaas> er, --test-map-object
[12:58] <dxd828> fghaas, I haven't tested just trying to design my crush map and pg's
[12:59] <fghaas> huh? whaddaya mean by "i find this object replica number which goes against my crush rules" then?
[13:02] <dxd828> fghaas, when you create a pool you specify the number of replicas "odd pool set data [val]", but my crush map defines "rule data {step choose leaf firstn 0 type rack}". I have multiple racks so do I have to set the replica number to be = to the number of racks?
[13:03] <fghaas> if that's what you want, sure
[13:03] <fghaas> if you want one replica in each rack, then you had better set the pool size (# of replicas for the pool) to the number of racks
[13:03] <dxd828> fghaas, ok.. If it was a smaller number than the total amount of racks would it randomly place?
[13:04] <fghaas> it would place according to your crush rules
[13:06] <fghaas> also if you're using the default crushmap from earlier ceph releases, do also realize that the "pool" mentioned there is completely different from a rados pool, which is why the "pool" moniker has been released from current default crushmaps
[13:06] <fghaas> argl... "has been removed", of course
[13:09] <dxd828> fghaas, ok, so when I have "step choose leaf firstn 0 type rack" would it just do it on all "racks" available? if so how does the object replicas affect this?
[13:11] <dxd828> fghaas, does that mean there will be two copies of every object under each rack, or only two racks?
[13:12] <fghaas> dxd828: http://ceph.com/docs/master/rados/operations/crush-map/ doesn't help?
[13:14] <dxd828> fghaas, I have read it too many times this week lol. I just don't understand why there is this hard coded value of replicas when I thought the crush map handles that.
[13:15] <fghaas> it is not.
[13:15] <dxd828> also there is a lack of info explaining step take, choose, emit..
[13:15] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[13:16] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:16] <fghaas> I don't have a machine to test this right now, but iiuc what you propose simply means "never put two replicas in the same rack", where by default it just never puts two replicas on the same host
[13:19] <styx-tdo> fghaas: can you confirm i get it? - i'd need: 2x osd, 3x mon, 2x mds and do crush maps "replication" to have all data redundant&failsafe?
[13:19] <fghaas> for starts, yes. so 3 nodes total
[13:21] <styx-tdo> ok. the 3xmon/osd permiti a voting mechanism, so it won't get inconsistent, CRUSH makes redundancy and the OSDs are just the data slaves
[13:23] <styx-tdo> so, is there anything speaking against putting the iscsi initiators on the 2 OSD nodes?
[13:24] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[13:26] <styx-tdo> hm.
[13:26] * kYann5 (~KYann@did75-15-88-160-187-237.fbx.proxad.net) Quit ()
[13:32] <darkfader> initiators?
[13:32] <darkfader> or do you mean targets?
[13:35] * gaveen (~gaveen@ has joined #ceph
[13:39] * dxd828 (~dxd828@ Quit (Quit: Computer has gone to sleep.)
[13:53] * dxd828 (~dxd828@ has joined #ceph
[13:53] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Read error: Connection reset by peer)
[13:54] <dxd828> does anyone know how to mount a pool using ceph-fuse?
[13:55] <styx-tdo> darkfader: er.. that server part of iSCSI.. i do mix them up all the time
[13:55] <styx-tdo> yep, target
[13:55] <styx-tdo> who named the client "initiator", anyways.. *sigh*
[13:58] <darkfader> because it initiates io on the bus
[13:58] <darkfader> anyway
[13:59] <darkfader> there is a small note in the docs i think that says to *not* use RBD on the OSDs
[13:59] <darkfader> you'll have to ask the resident gurus about why
[14:00] <darkfader> i just thought i better throw that in early, rather me being wrong that you finding out later that i wasnt :)
[14:00] <styx-tdo> the resident guru quit IRC after my last question. ;) interesting coincidence ^^
[14:00] <darkfader> well they don't work in irc
[14:00] <darkfader> :>
[14:00] <styx-tdo> wha.. really? ;))
[14:01] <styx-tdo> according to some sites that have names similar to a shell, there is no life outside of IRC... or something like that
[14:01] <darkfader> i would probably be happy with that
[14:01] <darkfader> or not notice :>
[14:01] * Cube (~Cube@76-14-138-191.rk.wavecable.com) has joined #ceph
[14:06] <nhm> morning #ceph
[14:14] * yoshi (~yoshi@ has joined #ceph
[14:16] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[14:16] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[14:21] * dxd828 (~dxd828@ Quit (Quit: Computer has gone to sleep.)
[14:23] * oliver2 (~oliver@jump.filoo.de) has joined #ceph
[14:23] * yoshi (~yoshi@ has joined #ceph
[14:23] <paravoid> with the new "ceph create" syntax, is it now impossible to number your osds as you'd like?
[14:23] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[14:23] <paravoid> e.g. I'd like to add 4 disks in one box, then another 4 in a different box
[14:23] <paravoid> then expand both boxes to, say, 8 disks
[14:24] * yoshi (~yoshi@ has joined #ceph
[14:24] <paravoid> I'd like them to be ceph0-3/ceph7-11 then ceph3-6/ceph12-15, so basically create gaps
[14:24] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[14:24] <paravoid> er
[14:24] <paravoid> osd. even :)
[14:29] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[14:35] * yoshi (~yoshi@ has joined #ceph
[14:35] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[14:42] * yoshi (~yoshi@ Quit (Remote host closed the connection)
[14:50] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[15:03] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:09] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:12] * ScOut3R (~ScOut3R@1F2EA078.dsl.pool.telekom.hu) has joined #ceph
[15:15] * xiaoxi (~xiaoxiche@ has joined #ceph
[15:23] * dosaboy (~user1@host86-163-127-16.range86-163.btcentralplus.com) has joined #ceph
[15:26] * ScOut3R (~ScOut3R@1F2EA078.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[15:54] * xiaoxi (~xiaoxiche@ Quit (Remote host closed the connection)
[15:54] * PerlStalker (~PerlStalk@ has joined #ceph
[15:58] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[16:01] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) has joined #ceph
[16:02] * SkyEye (~gaveen@ has joined #ceph
[16:02] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[16:02] * tezra (~Tecca@ Quit (Ping timeout: 480 seconds)
[16:05] * Cube (~Cube@76-14-138-191.rk.wavecable.com) Quit (Quit: Leaving.)
[16:07] * Cube (~Cube@76-14-138-191.rk.wavecable.com) has joined #ceph
[16:09] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[16:10] * tezra (~Tecca@ has joined #ceph
[16:13] * SkyEye is now known as gaveen
[16:16] * ScOut3R (~ScOut3R@54000C01.dsl.pool.telekom.hu) has joined #ceph
[16:16] * ScOut3R (~ScOut3R@54000C01.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[16:17] * jskinner (~jskinner@ has joined #ceph
[16:30] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[16:34] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) Quit (Remote host closed the connection)
[16:34] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) has joined #ceph
[16:48] * stp (~stp@dslb-084-056-007-080.pools.arcor-ip.net) Quit (Ping timeout: 480 seconds)
[16:49] * xiaoxi (~xiaoxiche@jfdmzpr05-ext.jf.intel.com) Quit (Ping timeout: 480 seconds)
[16:49] * xiaoxi (~xiaoxiche@ has joined #ceph
[16:50] * styx-tdo (~styx@chello084113243057.3.14.vie.surfer.at) Quit (Quit: Konversation terminated!)
[16:52] * low (~low@ Quit (Quit: bbl)
[16:52] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[16:53] * Cube (~Cube@76-14-138-191.rk.wavecable.com) Quit (Quit: Leaving.)
[16:53] * styx-tdo_ (~styx@chello084113243057.3.14.vie.surfer.at) has joined #ceph
[17:02] * mtk (tCRfUXa1MU@panix2.panix.com) has joined #ceph
[17:08] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[17:12] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[17:31] * dxd828 (~dxd828@ has joined #ceph
[17:41] * xiaoxi (~xiaoxiche@ Quit (Ping timeout: 480 seconds)
[17:43] * oliver2 (~oliver@jump.filoo.de) has left #ceph
[17:43] * jlogan1 (~Thunderbi@2600:c00:3010:1:81cb:3b44:21fa:f590) has joined #ceph
[17:49] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[17:56] * Leseb (~Leseb@bea13-1-82-228-104-16.fbx.proxad.net) has joined #ceph
[18:01] <sstan> Hello
[18:05] * dxd828 (~dxd828@ Quit (Quit: Computer has gone to sleep.)
[18:11] * dxd828 (~dxd828@ has joined #ceph
[18:12] * roald (~roaldvanl@ Quit (Quit: Leaving)
[18:13] * ScOut3R (~ScOut3R@1F2EA078.dsl.pool.telekom.hu) has joined #ceph
[18:17] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[18:19] * dxd828 (~dxd828@ Quit (Ping timeout: 480 seconds)
[18:34] * Oliver2 (~oliver1@ip-178-201-135-221.unitymediagroup.de) has joined #ceph
[18:38] * fzylogic (~fzylogic@ has joined #ceph
[18:40] * DLange (~DLange@dlange.user.oftc.net) Quit (Quit: +++Carrier lost+++)
[18:41] * KindTwo (KindOne@h158.234.22.98.dynamic.ip.windstream.net) has joined #ceph
[18:46] * KindOne (~KindOne@h158.234.22.98.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[18:46] * KindTwo is now known as KindOne
[18:48] * DLange (~DLange@dlange.user.oftc.net) has joined #ceph
[18:54] * ScOut3R (~ScOut3R@1F2EA078.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[18:54] * houkouonchi-work (~linux@ has joined #ceph
[18:54] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[18:54] * houkouonchi (~linux@ has joined #ceph
[18:55] * yasu` (~yasu`@dhcp-59-227.cse.ucsc.edu) has joined #ceph
[19:02] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[19:06] <Robe> are there wheezy builds of 0.55?
[19:09] * sjustlaptop (~sam@ has joined #ceph
[19:14] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[19:15] * jskinner (~jskinner@ has joined #ceph
[19:17] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[19:25] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Read error: Connection reset by peer)
[19:25] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[19:34] * dosaboy (~user1@host86-163-127-16.range86-163.btcentralplus.com) Quit (Quit: Leaving.)
[19:39] <nhm> I wonder if "Monster Bash" is appropriate music to play when writing frankenstein scripts.
[19:41] <sstan> is there a pdf that contains all the documentation?
[19:43] <nhm> ah, apparently I played too many apogee games as a child.
[19:43] <nhm> sstan: that sounds like a great idea
[19:44] <sstan> yeah it would be cool
[19:44] <nhm> sstan: I actually have no idea how our documentation is generated. I used to use docbook that would let you export to pdf or html.
[19:46] <sstan> nhm: do you work for Ceph?
[19:46] <nhm> sstan: I'm a performance engineer at Inktank working on Ceph.
[19:46] <sstan> cool
[19:47] <nhm> sstan: Mostly I poke at ceph and various hardware bits and try to make them go
[19:49] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:06] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[20:06] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:08] * Cube (~Cube@76-14-138-191.rk.wavecable.com) has joined #ceph
[20:11] <nhm> ugh, too many debugging options to disable.
[20:18] * Aiken (~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f) has joined #ceph
[20:23] * houkouonchi (~linux@ Quit (Quit: Client exiting)
[20:24] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[20:34] <joshd> sstan: nhm: the docs are generated by sphinx, which can output to many formats including pdf
[20:34] <joshd> sstan: see admin/build-doc in ceph.git
[20:44] <mikedawson> joshd, nhm: is there anything written about RBD performance tuning? I have 22 OSDs online with SSD journal partitions, OpenStack Folsom. io performance is quite rough right now
[20:45] <mikedawson> XFS and disks in jbod, 8 nodes, GigE with separate public and cluster networks
[20:45] <nhm> mikedawson: I'll be writing it in about 4 weeks
[20:46] <joshd> mikedawson: do you have rbd caching turned on?
[20:46] <mikedawson> joshd: don't know, how would I set it?
[20:46] <nhm> mikedawson: joshd is definitely the expert right now. :)
[20:47] <joshd> on the compute hosts, add 'rbd cache = true' to your ceph.conf, and edit nova's libvirt xml template to add cache=writeback
[20:48] <joshd> that 'rbd cache = true' would go in the [client] or [client.volumes] section
[20:48] <joshd> or [global]
[20:48] <janos> when changes like that are made to ceph.conf is some sort of restart/reload needed?
[20:49] <joshd> yes
[20:49] <joshd> in this case it's a client side change, so any new vms will get the setting
[20:50] <mikedawson> i don't have a [client] or [client.volumes] section. the only purpose of this ceph deployment is to back Glance and Cinder, so I'm guessing it is ok to put it in [global] i.e. it won't make Glance unhappy, right?
[20:51] <joshd> right
[20:52] * BManojlovic (~steki@ has joined #ceph
[20:52] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[20:53] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[20:53] * vata (~vata@ Quit (Quit: Leaving.)
[20:59] * yoshi (~yoshi@ has joined #ceph
[21:00] <joshd> mikedawson: looks like there's no config option in nova for the cache setting in folsom. you'd need to change nova/virt/driver/libvirt/volume.py from conf.driver_cache = "none" to conf.driver_cache = "writeback"
[21:01] <mikedawson> joshd: will do
[21:03] <mikedawson> joshd: /usr/share/pyshared/nova/virt/libvirt/volume.py ?
[21:03] * jbarbee (17192e61@ircip1.mibbit.com) has joined #ceph
[21:03] * yoshi (~yoshi@ Quit (Remote host closed the connection)
[21:04] <mikedawson> it's in there tree times... under LibvirtVolumeDriver, LibvirtFakeVolumeDriver, and/or LibvirtNetVolumeDriver?
[21:05] <joshd> just the last one should do it
[21:05] <joshd> rbd is a 'network disk' to libvirt
[21:07] <mikedawson> got the config files changed and pushed around. ceph restarted everywhere. does service nova-compute restart on all nodes get the conf.driver_cache = "writeback" active?
[21:08] <joshd> yeah
[21:09] <mikedawson> if I create a new empty volume and attach it to the existing instance could I end up with C: ephemeral, D: without caching, and E: with writeback cache?
[21:10] <joshd> I think so
[21:11] * ScOut3R (~ScOut3R@1F2EA078.dsl.pool.telekom.hu) has joined #ceph
[21:13] <Oliver2> Hey Josh… talking about caching… uhm, do you have an idea, why (ping-) latency is bad with with rbd_cache=true… vs. cache=writeback whilst writing data?
[21:15] * ScOut3R_ (~ScOut3R@1F2EA078.dsl.pool.telekom.hu) has joined #ceph
[21:18] * ScOut3R (~ScOut3R@1F2EA078.dsl.pool.telekom.hu) Quit (Read error: Operation timed out)
[21:25] <joshd> Oliver2: yeah, sorry I haven't replied to that email yet
[21:25] <Oliver2> I know you're all _very_ busy these days… no prob.
[21:26] <Oliver2> Just wondering.
[21:26] <joshd> Oliver2: when the cache is full, writeback is triggered and the next write blocks until there is free space for it in the cache
[21:27] <joshd> Oliver2: this can result in some higher latency writes, similar to the situation on the osd when the journal fills up
[21:29] <Oliver2> Joshd: yeah, but you can really "feel" that the VM is stuck, even while typing inside VM in a linux-guest. Problem vanishes if "rbd_cache=false,cache=writeback", smooth and "spew"-testing is performing with better results. Customer not complaining any more. Stange.
[21:29] * vata (~vata@ has joined #ceph
[21:31] <joshd> Oliver2: I think it's something we'll want to look into more. that much degradation is strange. What is the write workload that triggers this in the guest, and do you know if your qemu is using a separate thread for i/o (there's an iothread option at configure time)?
[21:34] <joshd> Oliver2: actually it looks like the iothread option is the default, and no longer configurable
[21:34] <Oliver2> Joshd: looking at configure-ouput right now, no wonder I don't see it ;)
[21:35] <Oliver2> Joshd: linux-aio is configure since a couple of releases… if it matters. workload is like doing a dd with a couple of MBs, even without further oflags like sync.
[21:39] <Oliver2> Josh: seen with every version of current qemu, that is 1.2.2, 1.3.0, host-kernel 3.6.7, 3.6.11. Problem in the background, that we imported about 200 VM's with rbd_cache settings, and "feeling" is, that they brake each others performance. Not proofed of course.
[21:40] <joshd> Oliver2: you mean they degrade performance of running vms while they're being imported? or just that they affect each other more when running the vms with rbd_cache=true?
[21:42] <Oliver2> Joshd: the latter.
[21:43] <mikedawson> joshd: do you install cinder-volume on all nova-compute nodes?
[21:43] <joshd> mikedawson: no, you'd only do that if you were using lvm on all of them
[21:44] <mikedawson> we're not using lvm at all right now, but we have cinder-volume across all nova-compute nodes
[21:45] * ScOut3R (~ScOut3R@1F2EA078.dsl.pool.telekom.hu) has joined #ceph
[21:46] <joshd> Oliver2: if you could file an issue in the tracker about the ping latency problem with a couple more details (like the precise dd command if it matters, and a log with debug rbd = 20, debug objectcacher = 20, debug ms = 1), that'd be great
[21:47] <joshd> mikedawson: you're probably fine with just a single cinder-volume
[21:48] <mikedawson> ok. I basically want ceph osds and nova-compute across all nodes
[21:49] <Oliver2> Joshd: true, you deserve it ;) Won't be able in the productive cluster, though, too many trouble last weeks, and shutdown all 4 lab-machines. Will try my best.
[21:49] <joshd> Oliver2: ok, thanks :)
[21:50] <mikedawson> so libvirt talks directly to rbd without the need for cinder-volumes on each host?
[21:50] <Oliver2> Joshd: thanks for the feedback.
[21:51] <joshd> mikedawson: cinder-volume basically just runs rbd commands. it's just for management. qemu is what ends up talking directly to rbd (which is configured through libvirt by nova-compute)
[21:51] <joshd> Oliver2: you're welcome
[21:51] <mikedawson> joshd: thanks for the explaination
[21:52] * ScOut3R_ (~ScOut3R@1F2EA078.dsl.pool.telekom.hu) Quit (Ping timeout: 480 seconds)
[21:53] <joshd> mikedawson: no problem
[21:53] * ScOut3R (~ScOut3R@1F2EA078.dsl.pool.telekom.hu) Quit (Ping timeout: 480 seconds)
[21:56] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) Quit (Ping timeout: 480 seconds)
[22:05] * danieagle (~Daniel@ has joined #ceph
[22:05] * drokita (~drokita@ has joined #ceph
[22:06] <drokita> Good afternoon Josh D
[22:06] <drokita> joshd
[22:06] <joshd> hi drokita
[22:08] <drokita> Sorry that I am running a few late. Do you have some time?
[22:08] <joshd> yeah, ask away
[22:10] <drokita> So, based on the conversation yesterday, I seem to be missing a valid client.admin keyring, and attempts to create one seem to keep failing.
[22:10] <drokita> Yesterday, we restarted the mon with a new ceph config with NO cephx enabled
[22:11] <drokita> I believe that we were going to pull the runtime config to make sure that ceph was actually disabled
[22:11] <drokita> That was the last step that I remember
[22:11] <joshd> yeah, that's my recollection as well
[22:13] <joshd> to do that, it's 'ceph --admin-daemon /var/run/ceph/ceph-mon.a.asok config show'
[22:14] <joshd> assuming that's the path to mon.a's admin socket
[22:14] <drokita> connect to /var/run/ceph/ceph-mon.a.asok failed with (13) Permission denied
[22:14] <drokita> let me check the socket file name
[22:14] <drokita> The name is correct
[22:15] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Quit: This computer has gone to sleep)
[22:16] <joshd> you might need sudo depending on permissions
[22:17] <drokita> My bad :)
[22:17] <drokita> Ok, I have rather substantial list of configurations in front of me
[22:18] <joshd> auth_supported is the relevant one
[22:18] <drokita> ""
[22:19] <joshd> that means default, which is none
[22:20] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Quit: Leaving.)
[22:22] <joshd> could you pastebin your ceph.conf for reference?
[22:23] <drokita> sure
[22:23] <drokita> one sec
[22:23] * dmick (~dmick@2607:f298:a:607:512b:3e27:e264:529) has joined #ceph
[22:24] <joshd> if you're running everything on one box, the client (like running ceph -s) will use the same one
[22:25] <joshd> but to be sure it's using the right settings and monitor, you could run 'ceph -m ip_of_mon.a:port --auth-supported none -s'
[22:28] * calebamiles (~caleb@65-183-128-164-dhcp.burlingtontelecom.net) has joined #ceph
[22:35] <drokita> Hey Josh... I'm sorry. I just got pulled in 3 different directions. I will ping you in a sec
[22:36] <joshd> no problem, I'm doing other stuff in the background anyway
[22:36] * fmarchand (~fmarchand@ has joined #ceph
[22:36] <fmarchand> hi !
[22:37] <drokita> Here is the PB: http://pastebin.com/chsxGmZg
[22:38] <fmarchand> I have a question : I'm ready to install a ceph cluster ... I've already tried it with cephFS but I would like now to create a ceph cluster with rados ... where should I start ?
[22:39] <wer_gone> fmarchand: adding rados is pretty straight forward. there are docs on it. Basically apache, some rados fgci glue and a user....
[22:40] <wer_gone> http://ceph.com/docs/master/radosgw/config/
[22:41] <wer_gone> fcgi that is :)
[22:41] <fmarchand> wer_gone : but with radosgw I won't be able to mount a partition like cephFS ?
[22:41] <wer_gone> no, it is a restful interface. s3 or swift.
[22:42] <wer_gone> you can still mount your partitions.... but rados runs as another service.
[22:42] <wer_gone> http://ceph.com/docs/master/radosgw/
[22:43] <fmarchand> I would like to be able to mount a partition ... what about rbd ?
[22:44] <wer_gone> I thought you said you were messing with cephfs already?
[22:44] * dxd828 (~dxd828@host86-165-22-173.range86-165.btcentralplus.com) has joined #ceph
[22:44] <joshd> drokita: so does 'ceph -m --auth-supported none -s' work?
[22:44] <dmick> there may be some terminology confusion here
[22:44] <dmick> RADOS is the cluster
[22:44] <dmick> there are multiple ways to store things in the RADOS cluster
[22:45] <dmick> 1) cephfs, the Posix filesystem
[22:45] <drokita> 2012-12-21 15:44:54.393538 7fcf0a2f4780 -1 unable to authenticate as client.admin
[22:45] <drokita> 2012-12-21 15:44:54.393819 7fcf0a2f4780 -1 ceph_tool_common_init failed.
[22:45] <dmick> 2) radosgw, the S3/Swift RESTful interface
[22:45] <dmick> 3) rbd, the block device
[22:45] <dmick> 1) cephfs can be done with a kernel FS module or a FUSE module
[22:46] <dmick> 3) rbd can be done with a kernel block driver or userland libraries (as in libvirt-qemu and thus OpenStack)
[22:46] <dmick> you can use the RADOS cluster for any or all of these access methods at the same time
[22:46] <dmick> (and, further, you can write your own application to store and retrieve objects using librados directly)
[22:47] <wer_gone> fmarchand: what he said :)
[22:49] <fmarchand> mmmm ... :) I tried cephfs but I had some issues with mds I would like to try with rbd (I said rados but I'm still confused sometimes:) ) but in docs thay say that I need a running ceph cluster first ... so I dno't understand where to start ...
[22:49] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[22:49] <fmarchand> not sure I'm clear :)
[22:49] <dmick> yes. a Ceph cluster is a RADOS cluster
[22:49] * loicd (~loic@magenta.dachary.org) has joined #ceph
[22:49] <dmick> the cluster is "monitors, OSDs, and storage"
[22:49] <fmarchand> yes oki
[22:49] <dmick> Ceph != CephFS
[22:50] <fmarchand> yes
[22:50] <dmick> and yes, the root of everything is the cluster.
[22:51] <joshd> drokita: could you run ceph-mon --version to confirm it's argonaut, and not a later version that's enabling cephx by default?
[22:52] <joshd> drokita: you can also tell by the logs from 'ceph -s --debug-auth 20 --debug-ms 1'
[22:53] <fmarchand> so I can configure a ceph cluster with 3 virtual disks (not initialized) and run it ? and then run rbd comand to create the rados block device ? Sorry if it sounds really newbie ...
[22:54] <dmick> it depends on what you mean by "with 3 virtual disks". If you're talking about running the cluster on VMs, then yes. The cluster needs at least one real or virtual machine to run daemons on, and those daemons needs storage, real or virtual
[22:55] * noob2 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[22:55] <joshd> drokita: explicitly setting 'auth supported = none' in your ceph.conf and restarting the monitor will fix it if it is defaulting to cephx
[22:55] <dmick> but then the cluster is an object storage system. It stores objects. Then, you can operate on that object storage, for instance, to layer on top of that rbd images that you create with the rbd command, yes
[22:56] <dmick> underneath, the rbd images are stored as objects in RADOS, in the Ceph cluster. But you need never know that directly
[22:58] <drokita> 2012-12-21 15:57:34.234790 7fdc9713c780 2 auth: CephX auth is not supported.
[22:58] <drokita> ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
[22:58] <drokita> I will try to change the auth supportted
[22:58] <fmarchand> oki ! what I got in mind is : configure 3osd's (the 3 virtual disk) a mon ... run the ceph cluster .. and the run rbd command to initialize the 3 osd's (would like xfs) I think it's not the right way to proceed ....
[23:00] <drokita> auth supported set to none
[23:02] <dmick> fmarchand: no, it's not.
[23:02] <dmick> you set up the cluster first
[23:02] <dmick> any initialization of OSDs is done when you set up the cluster
[23:02] <dmick> because the cluster is made up of mons, OSDs, and storage
[23:02] <fmarchand> yes by the way, I run the cluster in a vm that's why I was talking about virtual disk
[23:02] <dmick> and the cluster runs first
[23:03] <dmick> *once the cluster is running*
[23:03] <dmick> then you can use it to store rbd images
[23:05] <fmarchand> mmmm oki that's more clear ... so when the cluster runs I'll be able to tell that's the 3 osd's must be formated to xfs with on top the rbd layer ? and this is done by the rbd commands ?
[23:06] <janos> you don't directly format the osd's the way you are thinking
[23:06] <janos> the cluster manages them
[23:07] <janos> you make a block device via rbd
[23:07] <janos> and format that
[23:07] <janos> rbd allows you to make a block device which is backed by the cluster - which includes those osd's
[23:07] <joshd> drokita: so now does ceph -s work?
[23:09] <janos> any plans on rbd to define the --size param with values like '10GB' or '1TB'?
[23:09] <janos> i spaced out and made a 1PB rbd device by accident through the magic of math
[23:09] <janos> ;)
[23:10] * fzylogic (~fzylogic@ Quit (Quit: fzylogic)
[23:12] <fmarchand> oki I understood ... the rbd will manage the disk through the cluster ... and it means I will be able to mount it as a single device as cephFS ?
[23:13] <drokita> It does now... that is an improvement :)O
[23:13] * fzylogic (~fzylogic@ has joined #ceph
[23:14] <dmick> fmarchand: no, nothing like that at all
[23:14] <dmick> I'm not sure how I'm not being clear here
[23:14] <dmick> once the cluster is running
[23:14] <dmick> you can use rbd to create virtual block devices
[23:14] <dmick> they act just like a disk
[23:15] <janos> plain naked block devices
[23:15] <dmick> so you can format them with any filesystem you like
[23:15] <dmick> *independent from that*, in a *completely separate use of the cluster*
[23:15] <dmick> you can use cephfs on the cluster to create a filesystem
[23:15] <dmick> but that's *completely separate from RBD*
[23:15] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:16] <janos> fmarchand: the cluster offers a few different ways to use it
[23:16] <dmick> janos: I know, that's annoyed me as well (the sizes)
[23:16] <dmick> I'd like to have that accept human-readable sizes
[23:16] <wer_gone> stupid humans
[23:16] <janos> i think i should learn some python
[23:16] <dmick> always good, although I'll point out that the rbd tool is C++
[23:17] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[23:17] <janos> my tiny little test cluster has been deleting that 1PB rbd image fora few days now
[23:17] <dmick> oh dear
[23:17] <janos> hahah
[23:17] <janos> i find it pretty funny
[23:17] <janos> it's 79% done!
[23:17] <dmick> if you don't have any other images there to save
[23:17] <janos> doesn't seem to be stopping me from using anything
[23:17] <dmick> you can short-circuit that pretty quickly
[23:17] <janos> !
[23:17] <janos> that would be nice
[23:18] <dmick> ./rados rmpool rbd; ./rados mkpool rbd
[23:18] <wer_gone> yup
[23:18] <dmick> will just throw away everything in the rbd pool, which is every rbd image
[23:18] <fmarchand> by the way thx for your patience ! what confuses me is : "they act just like a disk" ... which means for me that I'll be able to mount it ...
[23:18] <janos> ahh, ok
[23:18] <sstan> fmarchand: francais? ... Before you start the cluster, you format a block device ( XFS, btrfs). Then, you mount it somewhere. Only after that's done, you'll want to start the OSDs.
[23:18] <janos> i have other tests going on on that pool
[23:18] <janos> bummer
[23:18] <joshd> drokita: ok, cool. so now you're back to a state where you can get keyrings setup like http://ceph.com/docs/master/rados/operations/authentication/#enabling-cephx says
[23:18] <dmick> sstan: not sure that's helping. that's in teh cluster creation part, long before it's used for anything
[23:19] <sstan> dmick: exactly. After the cluster works, only THEN you use the rbd commands to provision block devices.
[23:19] <fmarchand> yes .... I thought my french accent would not be heard in irc ... busted :)
[23:19] <sstan> after block devices are available, you can do whatever (format them in XFS or EXT3, or even FAT)
[23:19] <sstan> haha syntaxe TYPIQUE
[23:20] <wer_gone> :)
[23:20] <janos> dmick: i first knew i screwed up on the device size when i tried to format with ext4 and it balked - couldn't handle the size
[23:20] <janos> threw a few others at it, also balked
[23:20] <janos> btrfs worked fine, but i inspected because i knew something was off
[23:21] * vata (~vata@ Quit (Quit: Leaving.)
[23:21] <sstan> fmarchand: just read all the documentation, it will make perfect sense, I promise.. there's a lot of words, but ... work has to be done I guess
[23:21] <janos> generally speaking (i'm a noob) - if you load up hosts with more osd's can you expect some degree of speed increase?
[23:22] <janos> currently i have 2 hosts, but few osd's
[23:22] <janos> (home use, testing)
[23:22] <dmick> janos: as long as you don't bottleneck on one disk, or one controller, or one bus, or start pushing memory/CPU limits, sure
[23:22] <dmick> or, between the 2 hosts, saturating the net
[23:22] <janos> yeah i haven't hit any resource limits
[23:22] <janos> defintiely not that!
[23:23] <dmick> you have 10G?
[23:23] <janos> 12mb/s ;(
[23:23] <janos> 1GB - home use
[23:23] <janos> i WISH
[23:23] <sstan> wow I wish I had 10G
[23:23] <dmick> it's not hard to saturate a 1GB link with disk traffic, be aware
[23:23] * jbarbee (17192e61@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[23:23] <janos> my current home set up - a raid10 arrangement i get near theoretical 1gb transfer speeds when i send files to it
[23:23] * wer_gone is now known as wer
[23:24] <dmick> 120MB/s?
[23:24] <janos> usually 100-115 mb/s
[23:24] <janos> it's consistent
[23:24] <dmick> presumably MB, not mb
[23:24] <janos> right
[23:24] <dmick> 115 millibits really sucks
[23:24] <sstan> yeah, GB
[23:24] <janos> MB/s
[23:24] <janos> haha
[23:24] <sstan> 1gb != 1GB
[23:25] <janos> my tendency to avoid capital letters is biting me
[23:25] <fmarchand> sstan : where was my syntax so "Typique" ? :) you lost me because I understood the cluster should be running before any rbd commands...
[23:25] <sstan> no .. the way you write sentences in English
[23:25] <janos> i have some dell rackmount powerconnect switches at home that i've been really happy with
[23:25] * dxd828 (~dxd828@host86-165-22-173.range86-165.btcentralplus.com) Quit (Quit: Computer has gone to sleep.)
[23:25] <sstan> fmarchand: you are correct. 1) Start the cluster 2) use rbd
[23:26] <janos> capital letters for common communication are a complete waste
[23:26] <janos> ;)
[23:26] <janos> but yes, for technical - like GB i should be more correct
[23:27] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) Quit (Remote host closed the connection)
[23:27] <janos> i have 2 switches, i think once i get more OSD's in the mix it will be time for me to isolate cluster traffic
[23:29] <janos> interesting
[23:30] <janos> i have 3 osd's - each 1TB
[23:30] <janos> one just told me it was full
[23:30] <janos> i've been putting gobs of files on them to get to this
[23:30] <fmarchand> so what should I do is configure the cluster almost like I did with my former cluster (cephFS) without mds in the ceph.conf then run the cluster (with osd's disks mounted but not initialized) and after and only afer create a block device image with rbd comands .. am I right ?
[23:30] <janos> how does a noob like me go about redistributing the storage more evenly?
[23:30] <dmick> the OSD's disks must be both mounted and initialized before you can run the cluster
[23:30] <dmick> since the OSDs *make up* the cluster
[23:30] <janos> replication is set to 2
[23:31] <dmick> creating storage for the OSDs to use is part of setting up the cluster
[23:31] <sstan> yeah, like even before installing Ceph
[23:32] <sstan> you format a disk , then you initialize it (i.e. mount /dev/xyz /ceph/osd.1 )
[23:32] * ScOut3R (~ScOut3R@catv-80-98-160-69.catv.broadband.hu) has joined #ceph
[23:33] <sstan> when your ceph health shows OK ... only THEN should you create rbd blocks
[23:35] <fmarchand> oh oki .... it makes sense ...
[23:35] <janos> hrmm. how can i tell how full my various osd's are?
[23:35] * Leseb (~Leseb@bea13-1-82-228-104-16.fbx.proxad.net) Quit (Quit: Leseb)
[23:36] <joshd> janos: ceph osd dump
[23:36] <janos> ah, i found another way as well - ceph health detail told me
[23:37] <janos> thank you though i'm goign to look at that as well
[23:37] <fmarchand> and then any block devices I will create could be used as a disk through the cluster ?
[23:38] <fmarchand> sstan : francais too ?
[23:38] <sstan> yes, that's how it works
[23:39] <sstan> oui, mais j'ai pas configure le layout francais sur mon clavier
[23:40] <fmarchand> if I don't have a mds daemon to handle metadata I assume I will see a rbd process somewhere. won't I ?
[23:41] <sstan> you don't need the mds daemon for that purpose
[23:42] <dmick> mds is only for cephfs
[23:42] <dmick> and when you use rbd, you can use it two ways, as I said above:
[23:42] <dmick> 1) you can make a kernel block device show up, so then you just use /dev/rbd1 as usual
[23:43] <dmick> 2) you can use the images you create behind libvirt-qemu KVM virtual machines
[23:43] <dmick> where they show up as "the disk" for the VM
[23:43] <dmick> in the first case, there's a kernel module that makes the block device appear (speaks RADOS to the cluster, and speaks block device on top)
[23:44] <dmick> in the second, there's a library (librbd) that talks to the cluster and talks qemu on top
[23:44] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:46] <fmarchand> oki you mean that in the second case if I use libvirt I would be able to "attach" the block device lik a virtual disk to my vm (vmware) ?
[23:46] <sstan> dmick: thanks; that's helpful
[23:46] <dmick> fmarchand: yes
[23:46] <dmick> not vmware, specifically
[23:46] <dmick> but kvm
[23:46] <sstan> dmick : but that's also true for the first case
[23:47] <dmick> if you want to use vmware, you'd have to use the kernel block device
[23:47] <fmarchand> yes oki ... I start to see the whole picture !
[23:47] <dmick> sstan: in the first case there's a kernel block device driver
[23:47] <sstan> in the first case, blocks are seen by "everyone" (like normal disks). In the second case, only the hypervisor sees blocks
[23:48] <sstan> ?
[23:48] <dmick> in the second, there's no kernel block driver; it's all internal to libvirt
[23:48] <fmarchand> thank you guys !
[23:49] <dmick> so for instance I can "rbd create image " and then "rbd map image", and on that machine where I ran the rbd map, there'll be a /dev/rbd0 created, that I can use just like any kernel block device
[23:49] <sstan> mick : aight I got that :)
[23:49] <dmick> however: I can also "rbd create image", and then configure kvm/qemu/libvirt to use that image
[23:49] <sstan> aaah
[23:49] <dmick> and then when I start the VM, that VM will talk to that image, no kernel block device around (except the one inside the VM)
[23:49] <sstan> oh quick question : Is it a good idea to run clients and OSDs on the same machine ?
[23:50] <dmick> sstan: not kernel RBD or kernel CephFS modules; that can cause deadlock
[23:50] <dmick> but the strictly-userland things are fine
[23:50] <sstan> ah, so I should use the second option you mentioned
[23:51] <dmick> or run the kernel block dev on some other machine
[23:51] <dmick> fmarchand: the picture here might help
[23:52] <dmick> http://ceph.com/docs/master/
[23:53] <fmarchand> dmick : I did that because I had to mount the cephFS partition for a samba share ... and yes ... it's not working well
[23:53] <dmick> either use FUSE on the same machine, or run the kernel client on a different machine
[23:53] <sstan> dmick : but that's not an option if want to run a VM cluster on the same machines that run the Ceph Cluster. If I got this right ... I should use userland software for that use case ??
[23:54] <dmick> yes.
[23:54] <sstan> aight ... hypervisors + ceph + FUSE
[23:54] <sstan> on the same machines
[23:54] <dmick> the upside is, since you're not going through the kernel, you gain some performance
[23:55] <dmick> and you're also more flexible in terms of which kernel you run
[23:55] <dmick> (RHEL/Centos still have reeeeally old kernels)
[23:55] <sstan> hmm indeed, that's a huge plus
[23:55] <sstan> SLES too
[23:55] <slang> fmarchand: if you're interested, we have a samba ceph module that talks to cephfs directly from the samba server
[23:56] <slang> fmarchand: still being tested before we ask the samba folks to incorporate it, but I can point you at it if you want to try it out
[23:56] <dmick> slang: I forgot about that, good lookin' out
[23:56] <fmarchand> slang : good news ! Yes I can test it ... there is a client that will be using windows ... so I don't have many options
[23:57] <sstan> yeah, thanks for mentioning that, I didn't know it exists
[23:57] <dmick> sstan: it's brand new
[23:57] <slang> fmarchand, sstan: http://github.com/ceph/samba
[23:58] <slang> fmarchand: what os are you looking to install it on?
[23:59] <slang> and which version of samba (3.x or 4.x)?
[23:59] <fmarchand> the cluster ?
[23:59] <fmarchand> ah no the samba server

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.