#ceph IRC Log


IRC Log for 2012-12-20

Timestamps are in GMT/BST.

[0:00] <nhm> yehudasa: depends on the year. Sometimes it's gone by early march, other times march is a very very snowy month.
[0:01] <elder> March is the snowiest month on average I think.
[0:01] <elder> I can tell you the ice can be gone in late March though. :)
[0:02] <nhm> elder: I know it used to be. Not sure if that's been the case in recent years.
[0:02] <elder> But April is more typical.
[0:03] <vjarjadian> anyone anyone know what Ceph's plans for the gro-replication feature they mentioned in the slideshow i was pointed to yesterday?
[0:03] <vjarjadian> geo-replication
[0:03] <rweeks> silly northerners
[0:03] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[0:05] <sjustlaptop> vjarjadian: what slideshow?
[0:06] <mikedawson> joshd: now I cannot "cinder create --image-id c054cf4a-b6d0-4107-8f75-55228a09b31e --display-name test 10" it errors every time
[0:06] <rweeks> Probably sage's preso from LISA 12
[0:06] <rweeks> slide 72: http://www.slideshare.net/Inktank_Ceph/ceph-lisa12-presentation#btnNext
[0:06] <mikedawson> can do it without the --image-id
[0:06] <rweeks> The Cuttlefish roadmap
[0:07] <vjarjadian> yeah, that one
[0:07] <sjustlaptop> vjarjadian: good question, what are you looking for?
[0:07] * `gregorg` (~Greg@ has joined #ceph
[0:07] <rweeks> vjarjadian: given that Bobtail, the next major release, is not quite out yet, Cuttlefish is months away at best. So yes, if you can tell us what use cases you have for geo replication, it would be useful
[0:07] <sjustlaptop> as far as I know the details are still fuzzy
[0:07] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[0:07] <sjustlaptop> now would be a *great* time for suggestions ;)
[0:08] <vjarjadian> ideally... to be able to have ceph write to the copies on the local subnet and declare the write complete to the OS... but then replicate that part of the block to the other sites as part of the self healing or after the write is complete so that processing can continue
[0:09] <sjustlaptop> what are your expectations about operation while the remote site is network-disconnected
[0:09] <sjustlaptop> ?
[0:09] * Kioob (~kioob@luuna.daevel.fr) Quit (Remote host closed the connection)
[0:09] <rweeks> alternatively, would something like snapshot replication to the remote site be a good alternative?
[0:10] * gregorg_taf (~Greg@ has joined #ceph
[0:10] <vjarjadian> in my scenario... if the servers/network connection go down at a site, that site is effectively useless, so the data could quite happily drop out of that site and reshuffle over the other servers
[0:10] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[0:10] <sjustlaptop> so you would be ok with treating it like loosing a rack?
[0:11] <nhm> sjustlaptop: btw, look at the list of things to test for the sweeps when you have a chance
[0:11] <vjarjadian> yeah... if just the OSD went down... theoretically the clients could still contact the files over the WAN at reduced speed... thats better than nothing... but at least the data woulc be safe on other sites
[0:11] <sjustlaptop> nhm: right! on it now
[0:12] <vjarjadian> and if i did add more sites later... it could instantly rebalance without much of a problem... theoretically
[0:13] <vjarjadian> the only downside could be if a site was lost... then the self healing would cause a lot of WAN traffic...
[0:13] <sjustlaptop> if you lost the primary site you would also loose the results of "completed"
[0:13] <sjustlaptop> writes
[0:14] <vjarjadian> nothing is perfect...
[0:14] <sjustlaptop> yah
[0:14] <vjarjadian> if we all had 1gbps connections with 5ms ping to the moon then we could just use ceph as is LOL
[0:14] <vjarjadian> some compromises have to be made
[0:15] <sjustlaptop> one option we have been kicking around is allowing you to set on a pool a number N such that the write is considered durable after only N replicas have committed where M is less than the number of replicas (M)
[0:16] <sjustlaptop> combined with a ceph rule requiring at least one replica in the remote DC, that would give you some of what you want
[0:16] <nhm> vjarjadian: Maybe we should buy up dark fibre and bundle it up with ceph subscriptions. ;)
[0:16] <rweeks> vjarjadian: unfortunately the laws of physics are still pretty firm on the speed of light
[0:16] <sjustlaptop> the challenge is that in the normal operating case, you need to talk to M-N+1 of the replicas from each acting set to be sure of seeing all of the writes
[0:16] <sjustlaptop> so we would need to add an escape valve to relax that constraint if you lost the primary DC (to allow the case where you voluntarily loose recent writes)
[0:17] <vjarjadian> it would be clunky... but if you had copy on write... and had write as many copies are needed as local first then after the write is done add more to the number of replicas needed... then the self healing would take care of the Geo-Replication...
[0:18] <vjarjadian> and might give ceph a methof of versioning blocks...
[0:18] * `gregorg` (~Greg@ has joined #ceph
[0:18] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[0:20] <sjustlaptop> vjarjadian: that is something you might consider for an rbd specific solution
[0:20] <sjustlaptop> is rbd what you are thinking of?
[0:20] <vjarjadian> i'm no SAN expert...
[0:20] <vjarjadian> i curently use VPN and a file sync program... set to run at a time
[0:21] <vjarjadian> it's a pain... but i dont have a better solution at the moment
[0:21] <sjustlaptop> sorry, I mean rados block device, is that the use case you want geo-replication for?
[0:21] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:24] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[0:24] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[0:24] * `gregorg` (~Greg@ has joined #ceph
[0:28] <vjarjadian> having it as iSCSI would make it easy to fit onto ESXI or other hypervisors... but at the moment i don't use anything like that... my network is rather simple, apart from having multiple sites
[0:28] * n-other (c32fff42@ircip3.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[0:28] * fc_ (~fc@home.ploup.net) Quit (Quit: leaving)
[0:29] * gregorg_taf (~Greg@ has joined #ceph
[0:29] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[0:31] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[0:33] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[0:33] <vjarjadian> in your case of m-n+1... what would cause Ceph to update the other blocks?
[0:34] <vjarjadian> would the other copies be dropped or something to enable self healing?
[0:34] * `gregorg` (~Greg@ has joined #ceph
[0:34] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[0:36] * occ (~onur@ Quit (Quit: Leaving.)
[0:38] * tezra (~Tecca@ has joined #ceph
[0:39] * gregorg_taf (~Greg@ has joined #ceph
[0:39] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[0:40] <sjustlaptop> it would be a (somewhat) simple extension of the current replication scheme
[0:40] <sjustlaptop> currently, any object is in a PG which is mapped to an ordered set of osds [a,b,c] whose size is the specified pool size
[0:40] <sjustlaptop> the first in the set is the primary, the rest are the replicas
[0:41] <sjustlaptop> when you write to an object, you write to the primary, which then replicates the write to the replicas, and, once all writes are complete, reports to the client that the write is complete
[0:41] <sjustlaptop> what I said above would allow you to say that only to complete writes are necessary (for a pool size of 3, for example)
[0:41] <sjustlaptop> *only two
[0:42] <yasu`> I'm also interested in the disaster recovery scenarios...
[0:42] <yasu`> Do we have an option to switch the primary OSD in the active set ?
[0:42] <sjustlaptop> not on purpose
[0:43] <sjustlaptop> but the crush rules allow you define constraints
[0:43] * `gregorg` (~Greg@ has joined #ceph
[0:43] <yasu`> If we have that option, it would be really helpful for the distant site operation, I think
[0:43] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[0:43] <yasu`> I know > constraints in CRUSH
[0:45] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) Quit (Read error: Connection reset by peer)
[0:45] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) has joined #ceph
[0:45] <vjarjadian> could you not much easier just tell ceph to say the write is complete once the primary is written... then no matter what the network topology is... it should replicate over WAN or LAN without a problem... unless that causes problems with multiple changes.
[0:46] <yasu`> +1 to vjarjadian, but that would be a significant design change
[0:47] <sjustlaptop> well, that means you have one copy between when the completion is sent to the client and when the replication completes
[0:47] <vjarjadian> really? would of thought that was easier than confirming all the writes to each copy...
[0:47] <sjustlaptop> vjarjadian: that part is for correctness
[0:47] <sjustlaptop> currently, a complete write means that all replicas have been written
[0:48] <rweeks> because it's designed for resliency
[0:48] <rweeks> er, but spelled right
[0:48] <vjarjadian> indeed.
[0:49] <yasu`> I read the papers and the current Ceph design thinks its important that all writes are done to the disk
[0:49] <vjarjadian> but different people will need different levels of resilience.. depending on their situation
[0:49] <yasu`> before reporting the client the completion of the write
[0:49] <yasu`> I agree, vjarjadian
[0:49] <vjarjadian> how long would it be before the system tried to replicate the primary after it had been written?
[0:51] <sjustlaptop> yasu`: it's more that the client should have visibility into the durability of the data
[0:51] <sjustlaptop> vjarjadian: no way to tell
[0:51] <yasu`> It depends on the configuration but the current Ceph architecture assumes some milliseconds
[0:51] <sjustlaptop> vjarjadian: if the replicas are actually dead, might be however long it takes for peering to complete
[0:51] <sjustlaptop> vjarjadian: actually, I think I misunderstood
[0:52] <sjustlaptop> what do you mea?
[0:52] <sjustlaptop> *you mean?
[0:52] <yasu`> maybe I am misunderstanding the vjarjadian's question :)
[0:52] <vjarjadian> if the primary was written... how long would elapse before the cluster copied that to the location of the other replicas.
[0:53] <sjustlaptop> currently, it does the write to the primary in parallel with the writes to the replicas
[0:53] <sjustlaptop> so the client sends the write operation to the primary
[0:53] <sjustlaptop> the primary sends the replicated operation to the replicas and then applies the write locally
[0:54] <sjustlaptop> and once the local and remote writes are complete it tells the client
[0:54] <yasu`> Figure 4 in the http://ceph.newdream.net/papers/weil-ceph-osdi06.pdf
[0:54] <vjarjadian> if you reordered it.... to have the primary written to and applied and then the client told and have the replication after that
[0:55] <sjustlaptop> well, like I said, it depends
[0:56] <sjustlaptop> if the replicas die between when the primary tells the client and when it attempts to write to the replicas
[0:56] <sjustlaptop> the write might only have one replica for 10s of seconds or longer
[0:58] <vjarjadian> indeed, for some that would be unacceptable... but for others it could be OK... the chances of a server outage happening in any given 10 seconds in very small... but if there were multiple replicas and one died, wouldnt the other replicas get written and that dead replica would then get pushed out of the cluster eventually
[0:58] <vjarjadian> since ceph is self healing
[0:59] <sjustlaptop> oh, you mean have the client re-replicate the write to the replicas?
[1:00] <sjustlaptop> that would open a whole new world or ordering problems
[1:00] <sjustlaptop> though there are systems that do that
[1:00] <sjustlaptop> *world of ordering problems
[1:00] <vjarjadian> couldnt ceph detect which version was newer?
[1:01] <sjustlaptop> not really, not if the primary is no longer around
[1:01] * rweeks (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[1:01] <vjarjadian> as, your saying the primary dies.. i was saying one of the replicas died
[1:01] <sjustlaptop> oh, as long as the primary survives, sure, it could re-replicate the write
[1:02] <vjarjadian> sooo complicated
[1:02] <sjustlaptop> that's the idea behind allowing the user to specify a smaller number of replicas to allow a write to complete
[1:02] <sjustlaptop> vjarjadian: yeah, fun isn't it :)
[1:02] <vjarjadian> yeah... i was thinking that doing replication after the primary would make things easier to implement rather than adding formulas
[1:03] * gregorg_taf (~Greg@ has joined #ceph
[1:03] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[1:04] <sjustlaptop> actually, doing the primary and replica writes in parallel does introduce a certain amount of complexity, but now that we have that stuff modifying to allow a smaller number of writes probably wouldn't be horrific
[1:04] <vjarjadian> fair enough
[1:04] <sjustlaptop> but it's not clear that it's enough for many use cases
[1:05] <vjarjadian> everyone has their own specific needs... meeting all of them will take much time
[1:06] <sjustlaptop> the ideal would be some form of basic support in rados that could be leveraged to implement several different flavors of replication
[1:06] <sjustlaptop> but we are still looking
[1:07] <vjarjadian> when ceph replicates... does it copy every replica from the primary or can it copy one replica to another?
[1:08] <vjarjadian> such as in the self healing process
[1:08] <sjustlaptop> for simplicity, the primary heals itself, and the it heals the replica
[1:08] <sjustlaptop> *replicas
[1:09] <vjarjadian> so the primary is the key to everything
[1:09] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[1:09] <sjustlaptop> yeah, but the replication scheme means that it's simple to shift primary-hood over to another osd when the primary dies
[1:09] <sjustlaptop> and any single osd will be primary for a bunch of pgs and replica for a bunch more
[1:10] <vjarjadian> because in my situation, the upload on the WAN on one of my sites is slow and i have it limited... so having the primary send it's data multple times to all the replicas isnt ideal
[1:10] <vjarjadian> would take anything with a primary on that server very slow
[1:11] <sjustlaptop> vjarjadian: yeah, that would be a bad case for this implementation
[1:11] <sjustlaptop> you could use crush to ensure that primarys are always elsewhere
[1:12] <vjarjadian> and if primary'hood was switched... theoretically a VM could be sending data over the WAN on a slow link instead of using a local OSD
[1:12] <vjarjadian> and then having the data sent back over that same WAN to the local OSD to then be read later LOL
[1:13] <sjustlaptop> yeah, it wouldn't work if you want vms to be able to do local reads at that site
[1:13] <vjarjadian> one would assume the reads would come off of the fastest OSD to respond, which should be the local one
[1:13] <sjustlaptop> no, reads are always served by the primary
[1:13] <sjustlaptop> mostly, kind of
[1:14] <vjarjadian> let me guess it's complex :)
[1:14] <sjustlaptop> in the case you are thinking of, it's always served by the primary
[1:14] <sjustlaptop> but we do have an escape hatch to allow you to read from replicas if the object is immutable
[1:14] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:14] <sjustlaptop> mostly for hadoop
[1:15] <sjustlaptop> that would be a difficult restriction to lift
[1:15] <vjarjadian> what about doing something with the IPs of the client and OSD... those are easy to find and if the client uses crush to locate all the OSDs with parts it needs, it could favour the ones on the local subnet
[1:16] <sjustlaptop> I think it actually does that to choose the replica to read from if it is using read from replica
[1:16] <sjustlaptop> but if the object is changing during the read, we don't currently support it
[1:16] <sjustlaptop> the concern is that you could do a read at the primary and get the new version and then do a read at the replica and get the old version
[1:17] <vjarjadian> i suppose some sort of copy on write would be good... but then some would have that in the client filesystem and the SAN...
[1:18] <vjarjadian> if every write creates a new primary then nothing can change... it can simply disgard or not use the old versions
[1:18] <sjustlaptop> that's one approach
[1:18] <sjustlaptop> rgw uses a variant of that on top of rados objects
[1:18] <vjarjadian> or would that be completely and horrifically aweful
[1:22] <sjustlaptop> vjarjadian: garbage collecting the old versions can be tricky, though there are interesting advantages
[1:23] <vjarjadian> when using iSCSI with ceph... what becomes the iSCSI target... one of the Ceph-mons?
[1:23] * `gregorg` (~Greg@ has joined #ceph
[1:23] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[1:23] <sjustlaptop> I don't know how that works
[1:23] <sjustlaptop> I don't think ceph has native support, really
[1:23] <sjustlaptop> there might be a translation layer somewhere
[1:23] <sjustlaptop> our block device is rados block device (rbd)
[1:24] <vjarjadian> i think so...
[1:24] <sjustlaptop> it's in kernel and can be used with qemu
[1:24] <vjarjadian> but if theres no single point of failure, it's interesting to think where you would point the iSCSI initiator
[1:24] * jlogan1 (~Thunderbi@2600:c00:3010:1:81cb:3b44:21fa:f590) has joined #ceph
[1:25] * Aiken (~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f) has joined #ceph
[1:25] <vjarjadian> garbage collection could maybe be run on one of the mons periodically... have it check for older versions in it's list and then remove them... or keep them if the admin wants them there
[1:25] <sjustlaptop> don't want to saddle the mons with work proportional to the total number of writes in the entire cluster
[1:26] <vjarjadian> theres no reason your mon couldnt be on some nice Xeon and your OSDs could be on i3s or atom CPU systems....
[1:27] <sjustlaptop> for any xeon, there exists N such that a single xeon cannot keep up with N atoms worth of writes
[1:27] <vjarjadian> indeed
[1:28] <vjarjadian> but if you've used an Atom... you'll know it would take a lot of them to keep up with anything... even a cabbage stuck in the CPU socket of a powered motherboard
[1:28] * jlogan (~Thunderbi@2600:c00:3010:1:942b:403e:be8f:1b6c) Quit (Ping timeout: 480 seconds)
[1:31] <vjarjadian> garbage collection is only really important if the cluster is relatively small and has lots of writes... if they eventually made a new Ceph-gar to do next to the OSDs and MONs and Metadat servers... it could go through the list as a low priority if garbage wasnt an issue or as high as the admin wanted... knowing it would 'eventually' pickup all the garbage would be fine for admins with huge clusters... just those with smaller storage would need to g
[1:32] <vjarjadian> although thats probably going a bit beyond the next version of Ceph :)
[1:34] * gregorg_taf (~Greg@ has joined #ceph
[1:34] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[1:39] * `gregorg` (~Greg@ has joined #ceph
[1:39] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[1:43] <sjustlaptop> vjarjadian: sure, but it probably wouldn't be hard to implement a policy of "delete after 1 day" using the osds themselves
[1:43] <sjustlaptop> which would be better than an external service
[1:44] * darkfaded (~floh@ Quit (Quit: leaving)
[1:45] <vjarjadian> as long as the config was the same for all OSDs in the server... or resolved from some central policy
[1:45] * gregorg_taf (~Greg@ has joined #ceph
[1:45] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[1:45] <sjustlaptop> meh, you could do it on an object-by-object basis
[1:45] <sjustlaptop> the osds already scan their local stores during periods of low io to look for silent corruption
[1:45] <sjustlaptop> they could also look for expired objects
[1:46] <vjarjadian> but if it's copy on write... it's possible the newer block is stored on a different set of OSDs if you have many
[1:46] * noob2 (~noob2@pool-71-244-111-36.phlapa.fios.verizon.net) has joined #ceph
[1:46] <vjarjadian> so unless the OSD's contact every other OSD to compare blocks... that might not work on the OSDs
[1:47] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[1:48] * dmick (~Dan@cpe-76-87-42-76.socal.res.rr.com) has joined #ceph
[1:48] * ChanServ sets mode +o dmick
[1:48] <jmlowe1> sjustlaptop: is the scrubbing in 0.55.1?
[1:49] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[1:49] * gregorg_taf (~Greg@ has joined #ceph
[1:49] <jmlowe1> I should say is the data as well as metadata scrubbing in 0.55.1?
[1:49] <sjustlaptop> jmlowe1: yes
[1:49] <sjustlaptop> you'll see "deep" scrubs
[1:49] <sjustlaptop> that's the data scrubbing
[1:50] <jmlowe1> oh, nice "active+clean+scrubbing+deep"
[1:51] <sjustlaptop> there's a knob to set how often it should do deep scrubs
[1:52] <jmlowe1> I think I recall that from the blog post, wasn't sure if it had hit or not before the official bobtail release
[1:52] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Quit: Leaving.)
[1:52] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[1:53] <jmlowe1> does deep scrub attempt to replicate a good copy if there is a problem?
[1:54] <yasu`> had a guest, and been catching up with the nice discussion on the disaster recovery
[1:54] <yasu`> Ideally, CRUSH would have supported the more complicated replication policy
[1:56] <yasu`> Like you can ack the write if replication in DC1 is completed, and later replicate it into DC2.
[1:57] <yasu`> but as for the next Ceph, switching or pinning the (next) primary would help for the disaster recovery scenario
[1:58] <yasu`> DC-1 is down, but we have DC-2 and DC-3, so I'll take DC-2's OSD as the primary for this set of PGs
[1:59] <yasu`> I can tell we're gonna need a more complex policy soon, such that DC-2 and DC-3 is really close, so DC-2 should transfer the replica from DC-1 to DC-3, etc.
[1:59] <yasu`> but that's for research :)
[2:01] <sjustlaptop> jmlowe1: it has a repair feature, but currently, it just chooses whatever the primary has
[2:02] * `gregorg` (~Greg@ has joined #ceph
[2:02] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[2:02] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[2:02] <glowell> ls
[2:03] <jmlowe1> sjustlaptop: it follows then that you will soon pick the most recent version of the object that will scrub clean?
[2:04] <sjustlaptop> well, if it was silent corruption, both versions might be recent
[2:04] <sjustlaptop> we'd probably have the replicas vote if possible
[2:05] * gregorg_taf (~Greg@ has joined #ceph
[2:05] <jmlowe1> <- thinks he has his silent corruption problem under control with btrfs data and metadata raid1
[2:05] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[2:05] * darkfader (~floh@ has joined #ceph
[2:06] <jmlowe1> occasionally slot 6 of one of my arrays likes to get creative with my bits
[2:18] * `gregorg` (~Greg@ has joined #ceph
[2:18] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[2:20] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[2:20] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[2:20] * yoshi_ (~yoshi@ Quit (Remote host closed the connection)
[2:25] * yasu` (~yasu`@dhcp-59-227.cse.ucsc.edu) Quit (Remote host closed the connection)
[2:25] <infernix> my clusters are coming
[2:26] <infernix> 2 datacenters, 6 boxes each, 12 OSD disks, 2x infiniband, onboard LSI 2208 and addon 9207-8i HBA
[2:26] * dmick1 (~dmick@2607:f298:a:607:7d94:5dcf:7854:9654) has joined #ceph
[2:27] <infernix> 108TB with 2 way replication
[2:28] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[2:32] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[2:32] <iggy> ipoib?
[2:34] * gregorg_taf (~Greg@ has joined #ceph
[2:34] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[2:34] <nhm> infernix: ah, interesting choice!
[2:35] <nhm> infernix: using the 9207 and 2208 at the same time?
[2:35] * noob2 (~noob2@pool-71-244-111-36.phlapa.fios.verizon.net) Quit (Quit: Leaving.)
[2:36] <infernix> now for a challenge, is the rbd kernel driver in any way backportable to the centos 6 kernel?
[2:37] <infernix> nhm: i'm using the 2208 for OS raid initially on 2 2.5" disks in the back of the chassis
[2:37] <infernix> but i will have the option to change stuff around might it be needed
[2:38] <infernix> two sff8087 connectors, one for those OS disks and one for the 12 disk SAS expander; and two SAS ports on each controller type
[2:39] * `gregorg` (~Greg@ has joined #ceph
[2:39] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[2:39] <infernix> for 200 bucks i opted to be flexible
[2:39] <infernix> iggy: yes, until rsockets shows up in the repo
[2:41] <nhm> infernix: I'll be curious how the expander backplane works for you. I went with the "A" backplane, so you can be my guienea pig. ;)
[2:47] * renzhi (~xp@ Quit (Ping timeout: 480 seconds)
[2:48] * gregorg_taf (~Greg@ has joined #ceph
[2:48] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[2:55] * LeaChim (~LeaChim@5ad684ae.bb.sky.com) Quit (Read error: Connection reset by peer)
[2:55] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) Quit (Read error: Connection reset by peer)
[2:55] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) has joined #ceph
[2:59] * `gregorg` (~Greg@ has joined #ceph
[2:59] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[2:59] <infernix> nhm: which backplane is that?
[2:59] * vjarjadian_ (~IceChat7@5ad6d005.bb.sky.com) has joined #ceph
[2:59] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) Quit (Read error: Connection reset by peer)
[3:02] <infernix> i'm getting BPN-SAS2-826EL1 1 826 backplane with single LSI SAS2X28 expander chip
[3:03] <infernix> this box http://www.supermicro.com/products/chassis/2U/826/SC826BE16-R920LP.cfm?parts=SHOW
[3:04] <nhm> infernix: I've got the SC847a, which has BPN-SAS-826A and a BPN-SAS-846A
[3:07] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[3:08] <infernix> nhm: looks like a very different beast
[3:10] <nhm> infernix: it is, but the expanders are the same as the ones in the smaller chassis, so your tests on the expander chassis should give me some nice insights. ;)
[3:13] <infernix> nhm: what chip ison the A?
[3:18] * gregorg_taf (~Greg@ has joined #ceph
[3:18] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[3:24] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[3:25] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[3:29] * `gregorg` (~Greg@ has joined #ceph
[3:29] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[3:30] * jlogan1 (~Thunderbi@2600:c00:3010:1:81cb:3b44:21fa:f590) Quit (Ping timeout: 480 seconds)
[3:33] * The_Bishop (~bishop@2001:470:50b6:0:5965:e8be:8ad:d440) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[3:34] * gregorg (~Greg@ has joined #ceph
[3:34] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[3:41] <Psi-Jack> Heh
[3:41] <Psi-Jack> Annoying.
[3:41] <Psi-Jack> I can't get CephFS to mount at boot from it being in the fstab, on Ubuntu 12.04
[3:42] * deepsa (~deepsa@ Quit (Quit: Computer has gone to sleep.)
[3:42] * deepsa (~deepsa@ has joined #ceph
[3:44] <iggy> you might have to let the mount scripts know it's a network fs
[3:45] <Psi-Jack> Well, I added the whole _netdev tag to it.
[3:46] * Aiken (~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f) Quit (Read error: Connection reset by peer)
[3:47] * gregorg_taf (~Greg@ has joined #ceph
[3:47] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[3:50] * gregorg (~Greg@ has joined #ceph
[3:50] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[3:50] <Psi-Jack> Yeah, and just did a full upgrade to it, still the same. ghrrr
[3:53] <Psi-Jack> Another reason I absolutely despite upstart. heh
[3:57] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[4:00] * gregorg_taf (~Greg@ has joined #ceph
[4:00] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[4:05] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[4:07] * tezra|work (~rolson@ has joined #ceph
[4:08] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[4:08] * gregorg (~Greg@ has joined #ceph
[4:11] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[4:13] * renzhi (~renzhi@ has joined #ceph
[4:14] * gregorg_taf (~Greg@ has joined #ceph
[4:14] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[4:16] * fzylogic (~fzylogic@ Quit (Quit: fzylogic)
[4:19] * gregorg (~Greg@ has joined #ceph
[4:19] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[4:26] * gregorg_taf (~Greg@ has joined #ceph
[4:26] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[4:33] * gregorg (~Greg@ has joined #ceph
[4:33] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[4:42] * deepsa_ (~deepsa@ has joined #ceph
[4:43] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[4:43] * gregorg (~Greg@ has joined #ceph
[4:44] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[4:44] * deepsa_ is now known as deepsa
[4:55] * The_Bishop (~bishop@e177089054.adsl.alicedsl.de) has joined #ceph
[4:58] * gregorg_taf (~Greg@ has joined #ceph
[4:58] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[4:59] * deepsa_ (~deepsa@ has joined #ceph
[5:00] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[5:00] * deepsa_ is now known as deepsa
[5:04] * gregorg (~Greg@ has joined #ceph
[5:04] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[5:07] * gregorg_taf (~Greg@ has joined #ceph
[5:07] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[5:09] * vjarjadian_ (~IceChat7@5ad6d005.bb.sky.com) Quit (Read error: Connection reset by peer)
[5:09] * vjarjadian_ (~IceChat7@5ad6d005.bb.sky.com) has joined #ceph
[5:11] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[5:16] * gregorg (~Greg@ has joined #ceph
[5:16] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[5:21] * Aiken (~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f) has joined #ceph
[5:23] * yeled (~yeled@spodder.com) Quit (Quit: meh..)
[5:24] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[5:31] * gregorg_taf (~Greg@ has joined #ceph
[5:31] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[5:38] * gregorg (~Greg@ has joined #ceph
[5:38] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[5:50] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[5:51] * gregorg_taf (~Greg@ has joined #ceph
[5:51] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[5:55] * gregorg (~Greg@ has joined #ceph
[5:55] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[6:04] * gregorg_taf (~Greg@ has joined #ceph
[6:04] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[6:07] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Read error: Connection reset by peer)
[6:09] * `gregorg` (~Greg@ has joined #ceph
[6:09] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[6:19] * gregorg_taf (~Greg@ has joined #ceph
[6:19] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[6:24] * `gregorg` (~Greg@ has joined #ceph
[6:24] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[6:24] * gaveen (~gaveen@ has joined #ceph
[6:32] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[6:34] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[6:35] * gregorg_taf (~Greg@ has joined #ceph
[6:35] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[6:39] * `gregorg` (~Greg@ has joined #ceph
[6:39] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[6:42] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[6:47] * gregorg_taf (~Greg@ has joined #ceph
[6:47] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[6:47] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[6:47] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit ()
[6:58] * `gregorg` (~Greg@ has joined #ceph
[6:58] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[7:03] * spacex (~spacex@user-24-214-57-166.knology.net) has joined #ceph
[7:04] * gregorg_taf (~Greg@ has joined #ceph
[7:04] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[7:04] <spacex> so, if this recovery that I have going on works, then I am going to be giving ceph some serious props
[7:07] <spacex> I had mon.0 crash with bad memory/cpu cache, osd.2 disk die, and osd.3 OOM-killed
[7:13] * `gregorg` (~Greg@ has joined #ceph
[7:13] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[7:16] * gregorg_taf (~Greg@ has joined #ceph
[7:16] * `gregorg` (~Greg@ Quit (Read error: Connection reset by peer)
[7:19] * lxo (~aoliva@lxo.user.oftc.net) Quit (Quit: later)
[7:19] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[7:19] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has left #ceph
[7:24] * vjarjadian_ (~IceChat7@5ad6d005.bb.sky.com) Quit (Ping timeout: 480 seconds)
[7:31] * deepsa_ (~deepsa@ has joined #ceph
[7:31] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[7:31] * deepsa_ is now known as deepsa
[7:43] <iggy> sounds like pretty much worst case scenario
[7:44] * IceGuest_75 (~IceChat7@buerogw01.ispgateway.de) has joined #ceph
[7:44] <IceGuest_75> good morning #ceph
[7:44] * IceGuest_75 is now known as norbi
[7:44] <norbi> can anybody tell me what ist "step chooseleaf" ? i dont find "chooseleaf" in the documentation ?
[7:46] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[7:53] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[8:06] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[8:08] * Ryan_Lane (~Adium@210.sub-70-196-110.myvzw.com) has joined #ceph
[8:20] * SkyEye (~gaveen@ has joined #ceph
[8:22] * Ryan_Lane (~Adium@210.sub-70-196-110.myvzw.com) Quit (Read error: Connection reset by peer)
[8:23] * Ryan_Lane (~Adium@210.sub-70-196-110.myvzw.com) has joined #ceph
[8:27] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[8:29] * SkyEye is now known as gaveen
[8:29] * slang (~slang@cpe-66-91-114-250.hawaii.res.rr.com) Quit (Quit: slang)
[8:36] <spacex> iggy: yes ... fortunately i had set the replication on all 3 pools to 4 (with 4 osds total) prior to this happening
[8:37] <spacex> im happy to say it all recovered properly though :-)
[8:40] * gregorg (~Greg@ has joined #ceph
[8:40] * gregorg_taf (~Greg@ Quit (Read error: Connection reset by peer)
[8:42] * low (~low@ has joined #ceph
[8:46] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has joined #ceph
[8:47] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has left #ceph
[8:49] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has joined #ceph
[8:50] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has left #ceph
[8:52] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has joined #ceph
[8:52] * gregorg (~Greg@ Quit (Quit: Quitte)
[8:53] * gregorg (~Greg@ has joined #ceph
[8:53] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has left #ceph
[8:53] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has joined #ceph
[8:54] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has left #ceph
[8:54] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has joined #ceph
[8:54] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has left #ceph
[8:54] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has joined #ceph
[8:54] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has left #ceph
[8:55] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has joined #ceph
[8:56] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has left #ceph
[8:56] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has joined #ceph
[8:57] * ranjansv (~ranjansv@kresge-37-61.resnet.ucsc.edu) has left #ceph
[9:01] * yoshi (~yoshi@ has joined #ceph
[9:01] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[9:02] * yoshi (~yoshi@ has joined #ceph
[9:04] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[9:04] * yoshi (~yoshi@ has joined #ceph
[9:10] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[9:10] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[9:12] * yoshi (~yoshi@ has joined #ceph
[9:13] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[9:13] * loicd (~loic@ has joined #ceph
[9:14] * yoshi (~yoshi@ has joined #ceph
[9:18] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[9:18] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[9:20] * yoshi (~yoshi@ has joined #ceph
[9:21] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[9:21] * yoshi (~yoshi@ has joined #ceph
[9:22] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[9:25] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[9:26] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:35] * deepsa (~deepsa@ has joined #ceph
[9:35] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[9:36] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[9:38] * ScOut3R_ (~ScOut3R@ has joined #ceph
[9:46] * Ryan_Lane (~Adium@210.sub-70-196-110.myvzw.com) Quit (Quit: Leaving.)
[9:48] * roald (~roaldvanl@ has joined #ceph
[9:51] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[9:54] * roald (~roaldvanl@ Quit (Quit: Leaving)
[9:58] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[10:01] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[10:02] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[10:17] * Leseb (~Leseb@2001:980:759b:1:9d5a:dd83:952c:d877) has joined #ceph
[10:19] * Leseb_ (~Leseb@ has joined #ceph
[10:25] * Leseb (~Leseb@2001:980:759b:1:9d5a:dd83:952c:d877) Quit (Ping timeout: 480 seconds)
[10:25] * Leseb_ is now known as Leseb
[10:32] * tziOm (~bjornar@ has joined #ceph
[10:37] * SkyEye (~gaveen@ has joined #ceph
[10:43] * fc (~fc@ has joined #ceph
[10:44] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[10:47] * yoshi_ (~yoshi@ has joined #ceph
[10:48] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[10:52] * The_Bishop_ (~bishop@e179001196.adsl.alicedsl.de) has joined #ceph
[10:57] <norbi> what is a blacklist entry in ceph osd ? i see it via "ceph osd dump 3081"
[10:57] <norbi> blacklist IPADDRESS::6809/3596 expires 2012-12-19 21:48:04.567420
[10:57] <norbi> ?
[10:59] * The_Bishop (~bishop@e177089054.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[11:08] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[11:13] * yoshi_ (~yoshi@ Quit (Remote host closed the connection)
[11:15] * yoshi (~yoshi@ has joined #ceph
[11:29] * yoshi (~yoshi@ Quit (Remote host closed the connection)
[11:36] <ScOut3R_> norbi: regarding the performance issues you were mentioned yesterday: yes, i'm also experiencing slow operation using replication between hosts, right now my best bet is the network connection between the nodes, but it is highly dependent on the setup
[11:38] <norbi> hm is there a way to tell ceph "use replication between hosts but use all OSDs on one host" ? think that must speed up the writes
[11:39] <norbi> so write a file with 1GB to 10 OSDs on one host and replicate the data to other hosts
[11:41] <ScOut3R_> if i'm correct ceph does that by default
[11:41] <ScOut3R_> it tries to evenly distribute data across the available osds
[11:43] <norbi> hm ok, think i need more hosts and channelbonding
[11:43] * silversu_ (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[11:46] <ScOut3R_> how fast is your connection between your hosts? are you using a dedicated cluster network?
[11:47] <norbi> 3 Hosts are using 1000mbit
[11:48] <norbi> but not using dedicated cluster network
[11:48] <norbi> the clients are virtualized with 800mbit
[11:49] <ScOut3R_> my bet is that a dedicated cluster network can make a huge difference
[11:50] <ScOut3R_> also make sure to check your hosts' I/O consumption, it might give you a hint on the bottlenecks
[11:50] <norbi> the hosts have raid1 with to ssds for journaling
[11:50] <ScOut3R_> my OSD hosts are choking on journal operations (right now a mdadm RAID1 SSD array is the journal device for the OSDs)
[11:51] <ScOut3R_> and 10 OSDs are using a RAID1 array on each host
[11:51] <norbi> and 8 and 10 disk, one disk for one osd
[11:51] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Ping timeout: 480 seconds)
[11:51] <norbi> one host has only 3 disks and 3 osds :)
[11:53] <ScOut3R_> well, you can work around using weights
[11:59] <norbi> using this allready :)
[11:59] <norbi> if i start rados bench lokal on one ceph station i get about 120mb/s
[12:00] <norbi> but from remote client i get only max 20mb/s
[12:00] <norbi> but i will try the best to find the problem
[12:03] <ScOut3R_> well, check network utilization and i/o stats
[12:04] * LeaChim (~LeaChim@5ad684ae.bb.sky.com) has joined #ceph
[12:16] * mtk0 (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[12:21] * dxd828 (~dxd828@ has joined #ceph
[12:22] <dxd828> Hi all
[12:28] <dxd828> Don't know if anyone can help me on a small cluster design. We have a rack in two diffrent locations, I want all data to be replicated between the two, so if one location was to go offline we would have the other one. But in the rack we have multiple machines and obiously don't want it to replicate all data between machines but distrubute and "stripe" between the machines like raid. Anyone know how to do this?
[12:31] <ScOut3R_> dxd828: i guess you should start ditributing data in your crushmap at the rack level
[12:31] <dxd828> I have added the machines into the racks on the crush map, and changed my data rule to: "step chooseleaf firstn 0 type rack" but how then do I get it to distrubute inside the rack level?
[12:32] <dxd828> ScOut3R_, that seems to work btw, but there is only a single machine atm in each rack, will be adding some more soon
[12:33] <ScOut3R_> dxd828: you can define more than 1 steps in your ruleset
[12:33] <ScOut3R_> that way you can distribute the data inside the rack between hosts, disks, etc
[12:37] <dxd828> ScOut3R_, right I see.. I'm assuming with the single rule currently it is just placing data on a single host in both racks
[12:38] <ScOut3R_> dxd828: correct
[12:39] <dxd828> ScOut3R_, if a single machine was to die, it would be the only one with the data on in that rack.. but the data would be in the other location so would re build when it came back up?
[12:39] <ScOut3R_> dxd828: yes
[12:40] <dxd828> ScOut3R_, great.. thanks for you help! :)
[12:40] <ScOut3R_> and i believe it would rebuild the data on another host in the same rack (where the original one died) after a few minutes if there's enough space
[12:42] <dxd828> right ok, do you know if you can change the design of you crush map while the cluster is live and in use? like change the distrobution from rack to rows for example
[12:42] * Aiken (~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f) Quit (Remote host closed the connection)
[12:42] <ScOut3R_> dxd828: of course :)
[12:50] <dxd828> Currently ceph -w shows 273GB avalible on my cluster, the two OSD are on 137GB drives but it is replicating at host level.. So why is it not showing ~137GB's?
[12:54] <ScOut3R_> dxd828: ceph always shows the raw storage space
[12:54] <ScOut3R_> alwo ceph -w shows the "clean" data and the raw used space
[12:56] <dxd828> ScOut3R_, ah is there anyway I can see the avalible for a pool?
[12:57] <ScOut3R_> dxd828: every pool has access to the whole storage space so what you see available with the ceph status command is available for the whole cluster; you can get a more detailed view by "rados df"
[13:11] * roald (~roaldvanl@ has joined #ceph
[13:18] * ScOut3R_ is now known as ScOut3R
[13:19] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[13:21] * deepsa (~deepsa@ has joined #ceph
[13:37] <norbi> oh it seems the problem is one virtual host
[13:55] <ScOut3R> so you've found the bottleneck?
[13:55] <norbi> yes
[13:55] <norbi> now testing with more clients parallel
[13:59] <norbi> 4 clients parallel 110mb/s
[13:59] <norbi> thats ok
[13:59] <norbi> :)
[14:03] <ScOut3R> hm
[14:03] <ScOut3R> now that's fast
[14:04] <ScOut3R> are you using rados bench for testing?
[14:04] <norbi> yes
[14:04] <norbi> rados bench 30 write -p test --no-cleanup
[14:05] <norbi> but i will test it now with rsync to
[14:06] <ScOut3R> great
[14:07] <norbi> bottleneck will be the network, need bonding :)
[14:07] <norbi> 10gig its to expensive
[14:07] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[14:09] <Psi-Jack> Heh, anyone here happen to use know how to get a CephFS mount to be mounted and ready, before any other services start up, on Ubuntu 12.04.1?
[14:10] <Psi-Jack> I managed to get it to mount, during bootup, using my own custom mount upstart definition, but it takes like 30~50 seconds to actually do the mount itself, and this is done after relevant services that need it are already started up and system is ready for login.
[14:11] <norbi> just do a "sleep" in the bootscripts that needs the mount ?
[14:11] <norbi> test mount, and if not sleep a while
[14:12] <Psi-Jack> heh.
[14:12] <norbi> but thats not perfect :)
[14:12] <Psi-Jack> No.. No it's not..
[14:12] <Psi-Jack> I mean, heck, if I was going to go that route, I'd run it under pacemaker. heh
[14:12] <Psi-Jack> And put a co-location constraint on it. :)
[14:14] <ScOut3R> norbi: will you please post your rsync results too? i'm interested :)
[14:15] <norbi> mom
[14:15] <norbi> 4294967296 100% 36.56MB/s 0:01:52
[14:15] <norbi> 4294967296 100% 37.22MB/s 0:01:50
[14:15] <norbi> two clients at the samte time
[14:16] <ScOut3R> hm
[14:16] <ScOut3R> thanks
[14:16] <norbi> start a test with 3
[14:17] * Lea (~LeaChim@5ad684ae.bb.sky.com) has joined #ceph
[14:17] * LeaChim (~LeaChim@5ad684ae.bb.sky.com) Quit (Read error: Connection reset by peer)
[14:18] <renzhi> I'm trying to create new cluster, and mkcephfs threw this error:
[14:18] <renzhi> root@s1:/etc/ceph# mkcephfs -a -c /etc/ceph/ceph.conf -k client.admin.keyring
[14:18] <renzhi> temp dir is /tmp/mkcephfs.5HisHM1vEr
[14:18] <renzhi> preparing monmap in /tmp/mkcephfs.5HisHM1vEr/monmap
[14:18] <renzhi> /usr/bin/monmaptool --create --clobber --add a --add b --add c --print /tmp/mkcephfs.5HisHM1vEr/monmap
[14:18] <renzhi> /usr/bin/monmaptool: monmap file /tmp/mkcephfs.5HisHM1vEr/monmap
[14:18] <renzhi> /usr/bin/monmaptool: generated fsid 26569bcb-3867-449f-be52-b31a904fa961
[14:18] <renzhi> /usr/bin/monmaptool: map already contains
[14:18] <renzhi> usage: [--print] [--create [--clobber][--fsid uuid]] [--generate] [--set-initial-members] [--add name] [--rm name] <mapfilename>
[14:18] <renzhi> what does that mean?
[14:18] <renzhi> ceph 0.55.1
[14:20] <norbi> have u tow ip in your config ?
[14:21] <norbi> rsync client 1: 4294967296 100% 44.25MB/s 0:01:32
[14:21] <roald> renzhi, --add a --add b <-- check your config
[14:21] <norbi> rsync client 2: 4294967296 100% 31.03MB/s 0:02:11
[14:21] <norbi> rsync client 3: 4294967296 100% 29.88MB/s 0:02:17
[14:22] <renzhi> obviously, thanks, too many no-sleep nights will do that to you.
[14:24] <ScOut3R> norbi: that's pretty good, i can get only arond 30MB/s with 1 client, i blame the network :)
[14:25] <Psi-Jack> heh, heck, I get around 80 MB/s
[14:26] * roald_ (~roaldvanl@139-63-12-64.nodes.tno.nl) has joined #ceph
[14:26] <norbi> with 2 replicas and only 1000mbit and 3 ceph nodes ?
[14:26] <Psi-Jack> 100mbit? No.
[14:26] <Psi-Jack> 1000mbit. :p
[14:26] <norbi> reading ;)
[14:26] <norbi> "and only 1000mbit " :)
[14:26] <Psi-Jack> But, yeah, 2 replicas, 1000mbit, 3 ceph nodes.
[14:26] <norbi> 80mb with only one client at a time
[14:26] <norbi> ?
[14:27] <Psi-Jack> Nope.
[14:27] * noob2 (~noob2@pool-71-244-111-36.phlapa.fios.verizon.net) has joined #ceph
[14:27] <Psi-Jack> That was with two bonnie++ running at the same time, on two different machines.
[14:27] * tryggvil_ (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[14:28] <Psi-Jack> My XFS logdev journal and ceph osd journal are all done via SSD, too, so that helps. ;)
[14:29] * noob21 (~noob2@ext.cscinfo.com) has joined #ceph
[14:29] * roald_ (~roaldvanl@139-63-12-64.nodes.tno.nl) Quit ()
[14:30] <norbi> logdev on ssd... must try that :)
[14:30] <norbi> osd journal is on ssd
[14:30] <ScOut3R> norbi: 3 osd hosts, 1 client, 2 replicas, only 30 MB/s, using a 1Gbps shared network
[14:31] <noob21> could you split your replication and public traffic ? ceph has that ability
[14:31] <Psi-Jack> Ahh, and I have my SAN-only netwirj.
[14:31] <noob21> i heard that helps the situation
[14:31] <Psi-Jack> network*
[14:31] <noob21> exactly
[14:31] <noob21> Psi-Jack: how much did the san only network help things?
[14:31] <Psi-Jack> I don't expose my mon's to the public though.
[14:31] <noob21> i'm planning on setting up the same thing
[14:32] <Psi-Jack> noob21: I've always had it, so, all I can say is, when you do have your own dedicated network for the storage cluster, you are majorly reducing the LAN noise and bandwidth. Especially on consumer hardware if that';s all you have, because consumer hardware switches you will only get a maximum bandwidth utilization of about 30~50%.
[14:32] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Ping timeout: 480 seconds)
[14:32] * tryggvil_ is now known as tryggvil
[14:32] <noob21> right
[14:33] * roald (~roaldvanl@ Quit (Ping timeout: 480 seconds)
[14:33] <noob21> yeah i expect a large improvement. i'm going to 802.3 aggregate the links for san network also. that'll bump up the speed a little more
[14:33] <Psi-Jack> Sure, that switch may have 24 ports.. But you only get about 25% of the bandith if you have all 24 ports plugged with 1Gbit devices. ;)
[14:33] <noob21> exactly
[14:33] <Psi-Jack> Heh
[14:33] <noob21> they cheap out on the backplane
[14:33] <Psi-Jack> Yep.
[14:34] <Psi-Jack> My SAN network has 2x1Gbit on it, but the secondary is a failover at the moment.
[14:34] <noob21> oh ok
[14:34] <noob21> lacp it up :D
[14:35] <Psi-Jack> The reliability factor of it's more than enough for my current needs.
[14:35] <noob21> yeah
[14:35] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[14:35] * noob2 (~noob2@pool-71-244-111-36.phlapa.fios.verizon.net) Quit (Ping timeout: 480 seconds)
[14:38] <Psi-Jack> Only issue I'm having problems with, currently, is mount.ceph during boot. ;)
[14:38] <elder> nhm, are you around>?
[14:39] <norbi> with bonnie++ running i see that the "problem" is my network... but thats easy to fix :D
[14:39] <noob21> psi-jack: yeah i know. i wrote a python script to do that for me
[14:40] <Psi-Jack> norbi: Could be. Dunno how you'd determine that with bonnie++ though. It's just a disk benchmark. Not anetwork benchmark. :p
[14:40] <Psi-Jack> noob21: A python script? I mount ceph? Makes no sense at all. :p
[14:40] <noob21> lol
[14:40] <norbi> mount.ceph and then to bonnie++ -d /path/to/cephmount :)
[14:40] <noob21> yeah on boot it checks a config file and mounts what should be there in the correct order
[14:41] <noob21> psi-Jack i'm talking about the rbd devices not the ceph filesystem
[14:43] <ScOut3R> guys, could you recommend a good HBA using for OSDs? my considerations now are: LSI9207-8i, Areca 1320-8i or Supermicro AOC-S2308L-L8e (based on LSI 2308 controller); i don't need RAID function but pure HBA because i would like to access my disks in JBOD
[14:46] <noob21> i'd check out the blog post on ceph's website about the hba's
[14:46] <noob21> there's good benchmarks there
[14:46] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[14:46] <ScOut3R> noob21: yes, i did that already but somehow i'm still not sure
[14:47] <noob21> what do you expect your workload to look like?
[14:48] * deepsa (~deepsa@ has joined #ceph
[14:49] <ScOut3R> noob21: i'm planning to run xen domU's off from the cluster using rbd; the cluster will have a 10Gbps cluster network and each OSD host will have a 1Gbps to the dom0s
[14:49] <todin> ScOut3R: I have the lsi sas 9201-16i hba
[14:49] <ScOut3R> so the workload will be very moxed
[14:49] <ScOut3R> mixed*
[14:50] <noob21> i see
[14:50] <noob21> are you going sata's for your drives or sas ?
[14:51] <noob21> or a mix of ssd in there?
[14:53] <ScOut3R> noob21: i'm planning to use 2TB SATA disks for object storage and put the OSD journals on 4 SSDs running in an mdadm RAID10 array
[14:53] <ScOut3R> also i'm planning to run 32 SATA HDDs in one host
[14:53] <noob21> sounds pretty speedy to me
[14:53] <noob21> esp if your card supports jbod mode
[14:54] <ScOut3R> yes, that's why i'm afraid to decide on a specific card
[14:54] <noob21> you can tune how often the journal flushes to the drives depending on your workload
[14:54] <noob21> well i'm not sure the card matters all THAT much in the grand scheme of things with ceph
[14:54] <ScOut3R> but i think i'll go with the Supermicro card, because the chassis will be Supermicro too
[14:54] <noob21> your striping your rbd's across many hosts
[14:54] <ScOut3R> and of course the motherboard also
[14:54] <noob21> right
[14:54] <ScOut3R> i'll have only 2 hosts in the beginning
[14:55] <noob21> i've heard some people on here mention they have really low end raid cards and get amazing performance
[14:55] <norbi> you will have 32 OSDs then on this host ?
[14:55] <ScOut3R> yes
[14:55] <ScOut3R> and 32 in the other
[14:55] <norbi> and only 1gbps network ?
[14:55] <ScOut3R> in the beginning i'll be probably just 16-16
[14:55] * akarpo (~textual@dh146.citi.umich.edu) has joined #ceph
[14:55] <noob21> can you afford to get a 3rd host for quorum ?
[14:55] <ScOut3R> 1Gbps to the clients
[14:56] <ScOut3R> but 10Gbps in between the OSD hosts
[14:56] <norbi> oh ok
[14:56] <noob21> yeah i think he'll be ok
[14:56] <ScOut3R> but i'm planning to expand the client side network to 10Gbps also in a few months after going live
[14:56] <noob21> nice :)
[14:56] <ScOut3R> and of course i'd like to get a 3rd host, it'll depend on the budget
[14:57] <ScOut3R> ah yes, one more thing, i'm planning 2 HBAs per host, running multipath i/o
[14:57] <norbi> i will see benchmarks then ;)
[14:57] <ScOut3R> sorry, i have to go to a meeting, be back in a few minutes
[14:57] <noob21> i see
[14:58] <noob21> ok np
[15:06] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:09] * norbi (~IceChat7@buerogw01.ispgateway.de) Quit (Quit: Do fish get thirsty?)
[15:12] * renzhi (~renzhi@ Quit (Quit: Leaving)
[15:17] <ScOut3R> back
[15:18] * The_Bishop_ (~bishop@e179001196.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[15:21] <ScOut3R> so as you see i tried to think about everything, but you can never know :)
[15:31] * The_Bishop_ (~bishop@e179001196.adsl.alicedsl.de) has joined #ceph
[15:38] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[15:40] * nhorman (~nhorman@ has joined #ceph
[15:46] * slang (~slang@cpe-66-91-114-250.hawaii.res.rr.com) has joined #ceph
[15:48] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[15:52] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[16:04] * xiaoxi (~xiaoxiche@ has joined #ceph
[16:06] <xiaoxi> Hi, I have seen serious BTRFS fragment problem, which lead to significant performance drop,is it a known issue?
[16:08] <xiaoxi> I created a 30Gb rbd volume(say volume A) on my ceph cluster (with only 2 OSDs,each with a sata disk ,btrfs is used),DD the volume
[16:08] * tezra (~Tecca@ Quit (Read error: Connection reset by peer)
[16:08] <xiaoxi> and use AIO to do sequential read on it,I get ~30MB/s
[16:09] <noob21> yeah btrfs is brutally slow for me also
[16:09] <xiaoxi> then I do some random write on it...after the random write, sequential read performance drop to 6MB/s
[16:09] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[16:09] <xiaoxi> Well, the problem is not whether it's faster,but it's performance degraded with time
[16:09] <noob21> you might need more osd's
[16:09] <noob21> oh
[16:09] <noob21> yes i've noticed that also
[16:10] <noob21> i used xfs instead and that has fixed things
[16:11] <xiaoxi> yes, I will try BTRFS later, but since Ceph guys strongly recomment BTRFS, I would like to know whether it's a known issue,and is there any solution for it?
[16:11] <wer> I didn't think btrfs was being recommended for production yet.
[16:13] <noob21> exactly
[16:13] <noob21> they recommend xfs for production until btrfs works out the bugs
[16:13] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[16:15] <xiaoxi> OK, there are some benchmark from ceph.com said BTRFS has the best performance,and indeed it **seems** to have some advantages (such as journal parallel).
[16:23] * dosaboy (~gizmo@host86-164-220-50.range86-164.btcentralplus.com) has joined #ceph
[16:23] * loicd (~loic@magenta.dachary.org) has joined #ceph
[16:28] <noob21> they'll eventually work out the bugs now that oracle and suse are pushing it as stable
[16:35] <xiaoxi> Is there any meterial about how BTRFS guys would like to do with this issue?
[16:35] <xiaoxi> since BTRFS is a journal-like filesystem, It looks to me that fragement is rather a property than a bug.
[16:38] <jmlowe1> xiaoxi: isn't there an auto background defrag option for btrfs?
[16:38] * scalability-junk (~stp@dslb-084-056-033-228.pools.arcor-ip.net) has joined #ceph
[16:40] <xiaoxi> jmlowe1:there is,but not works well.
[16:40] <xiaoxi> Even I defrag it manually (btrfs filesystem defragment somefile)
[16:41] <jmlowe1> xiaoxi: good to know, haven't tried defraging myself
[16:42] <xiaoxi> Another test I have done show that for a big file(1GB) and make it frag(do a lot of random write on it). The sequential read drop to 0.8MB/s after the RW.
[16:43] <xiaoxi> Then I defrag the file manually , after the defragment, 12MB/s was seen.
[16:43] <xiaoxi> But still much far from what a sata disk can offer
[16:44] <xiaoxi> and seek/sec remain ~180...which means it still has a lot of fragement.
[16:57] * allsystemsarego (~allsystem@ has joined #ceph
[17:00] * sagelap (~sage@212.sub-70-197-144.myvzw.com) has joined #ceph
[17:01] * roald (~roaldvanl@ has joined #ceph
[17:02] * jlogan1 (~Thunderbi@2600:c00:3010:1:81cb:3b44:21fa:f590) has joined #ceph
[17:08] <nhm> xiaoxi: heya, I ran the benchmarks on our blog. BTRFS performance very much degrades over time on older kernels.
[17:09] <jmlowe1> what do you consider older?
[17:09] <nhm> jmlowe1: it was still happening with 3.4. I don't know how bad it is on kernels newer than that.
[17:10] <nhm> jmlowe1: increasing -l/-n in 3.4+ seems to help, but I don't have any hard data.
[17:10] <nhm> jmlowe1: I've got plans to do an article about it but it's probably a month or two off at least.
[17:11] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[17:11] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has left #ceph
[17:13] <nhm> xiaoxi: in the 2nd article I do mention in the conclusion that small IO performance does tend to degrade on BTRFS. I mention it again in the new bobtail vs argonaut article, but I should probably be more explicit about it.
[17:13] * SkyEye (~gaveen@ Quit (Ping timeout: 480 seconds)
[17:15] <xiaoxi> Not small IO performance tend to degrade,but small update tend to degrade sequential performance, I think
[17:18] * low (~low@ Quit (Quit: bbl)
[17:18] * xiaoxi (~xiaoxiche@ Quit ()
[17:19] * joao changes topic to 'v0.55 has been released -- http://goo.gl/r6OG1 || argonaut vs bobtail performance preview -- http://goo.gl/Ya8lU'
[17:20] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[17:22] * SkyEye (~gaveen@ has joined #ceph
[17:23] * sagelap (~sage@212.sub-70-197-144.myvzw.com) Quit (Ping timeout: 480 seconds)
[17:23] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[17:23] * loicd (~loic@magenta.dachary.org) has joined #ceph
[17:23] <nhm> joao: ooh, thanks. :)
[17:23] <joao> it way my pleasure
[17:24] <joao> also, your /. post is hilarious
[17:24] * dosaboy (~gizmo@host86-164-220-50.range86-164.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[17:25] <joao> " I’ve heard rumors that Unity can now plant subliminal messages in your dreams. How am I supposed to fight when I can’t even sleep?"
[17:25] <joao> rofl
[17:27] * dosaboy (~gizmo@host86-164-143-213.range86-164.btcentralplus.com) has joined #ceph
[17:35] <noob21> jmlowe1: did you use the deadline or the noop scheduler with your ssd's? just curious
[17:38] * tezra (~Tecca@ has joined #ceph
[17:38] <nhm> joao: glad you liked it. :)
[17:39] <nhm> noob21: thanks for reminding me, that's one of the things I should be testing in the future.
[17:41] <mikedawson> nhm: Great blog post
[17:42] <nhm> mikedawson: thanks!
[17:43] <mikedawson> nhm: need to bench smallio (lots of ~13KB writes). Is this reasonable? rados bench 60 write -t 32 -b 13312 -p rbd
[17:45] <nhm> mikedawson: rados bench will test the throughput of creating a bunch of 13k objects. The behavior will be somewhate different compared to what RBD does.
[17:46] <mikedawson> does rados bench take the replication of the pool into play, or is that another hit above what it reports?
[17:46] <nhm> RBD will create (by default) a bunch of 4MB objects and then do small reads/writes into them. Also, if you are using the qemu stuff, you'll have RBD caching too.
[17:48] <mikedawson> nhm: I'm working on a video storage project. 683 cameras, 14VMs running recording software. Total camera bandwidth is about ~350Mbps. Writes to disk are currently in the 13KB size
[17:48] <nhm> Basically that means that in some cases you could get better performance (local caching) and some cases worse (say all of your 13k writes all happen to hit the same 4MB object on a single OSD).
[17:50] <mikedawson> wondering what the best way to benchmark performance of my current set of gear (12 nodes, each with 1 ssd and 3 7200rpm sata osd drive, dual 1GigE)
[17:50] <nhm> mikedawson: coincidently another thing on my list coming up is taking a look at RBD performance.
[17:50] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[17:51] <mikedawson> nhm: if you'd guide me on methodology, I'd be happy to help
[17:51] <jmlowe1> noob21: don't have ssd's
[17:51] <nhm> mikedawson: you might be able to use smalliobench to simulate an RBD workload, but I think I'd try starting out with a bunch of fio or iozone writers or something doing 13k writes to your rbd layer.
[17:53] <mikedawson> where do i get smalliobench?
[17:54] <nhm> mikedawson: if you are using ubuntu, there is a ceph-tools package I think.
[17:54] <noob21> nhm: no problem :)
[17:55] <noob21> jmlowe1: maybe i'm mistaking you for soemeone else who has some ssd's going on
[17:57] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) has joined #ceph
[17:58] <mikedawson> nhm: can't find ceph-tools
[17:59] <vjarjadian> hi
[17:59] <nhm> mikedawson: hrm, I might be misremembering
[17:59] <mikedawson> Ubuntu 12.10 with Ceph Testing repo
[17:59] <nhm> ceph-test
[18:00] <mikedawson> E: Unable to locate package ceph-test
[18:00] <mikedawson> hrm
[18:00] <nhm> theoretically. I'm basing this on a mailing list post that Gary sent out on 11/01. :)
[18:00] <nhm> ooops, internal dev list.
[18:01] <mikedawson> i see some mention of ceph-client-tools
[18:01] <mikedawson> but it won't install
[18:01] <mikedawson> root@node1:~# apt-get install -V ceph-client-tools
[18:01] <mikedawson> Reading package lists... Done
[18:01] <mikedawson> Building dependency tree
[18:01] <mikedawson> Reading state information... Done
[18:01] <nhm> mikedawson: I'm wondering if it never actually got setup in gitbuilder.
[18:01] <mikedawson> Package ceph-client-tools is not available, but is referred to by another package.
[18:01] <mikedawson> This may mean that the package is missing, has been obsoleted, or
[18:01] <mikedawson> is only available from another source
[18:01] <mikedawson> However the following packages replace it:
[18:01] <mikedawson> ceph-common:i386 ceph-common ceph-fs-common:i386 ceph-fs-common
[18:01] <mikedawson> E: Package 'ceph-client-tools' has no installation candidate
[18:02] <nhm> pinging our packaging guy
[18:03] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) Quit (Read error: No route to host)
[18:04] <mikedawson> nhm: can't even find it on github
[18:05] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[18:05] <nhm> mikedawson: looks like it's in master and next on our gitbuilder repo, but not in stable or testing.
[18:07] <mikedawson> nhm: is there something similar to echo deb http://ceph.com/debian-testing/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list I can do to add next or master repos?
[18:07] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[18:07] <nhm> mikedawson: look at the "add development tesitng packages" section here: http://ceph.com/docs/master/install/debian/
[18:08] <mikedawson> thx
[18:08] <nhm> for branch you'll want next or master, depending on just how bleeding edge you like to be. ;)
[18:09] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:11] * The_Bishop_ (~bishop@e179001196.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[18:11] <mikedawson> nhm: looks like it isn't building for quantal currently
[18:12] <nhm> mikedawson: doh
[18:13] * SkyEye (~gaveen@ Quit (Remote host closed the connection)
[18:14] <mikedawson> Is it safe to assume smalliobench will be packaged in bobtail?
[18:14] <wer> How do I tell when things are getting full with ceph? Or totals per pool or bucket or whatever? Or highest osd usage?
[18:17] <noob21> i hit refresh and lo and behold a new blog post :D
[18:17] <wer> I had a node that quit writing because one osd was at 96%... the others were at 46%.... so how can I tell when a given OSD if about to get full when I have 96 of them distributed across multiple nodes?
[18:18] <noob21> maybe a cron job to monitor df ?
[18:19] <wer> yeah. See that sucks.
[18:19] <noob21> how many PGs do you have?
[18:19] <noob21> i agree
[18:19] <wer> I have no idea noob21 :)
[18:19] <noob21> lol
[18:19] <noob21> lemme find the command
[18:19] <wer> I have done nothing with pgs. k.
[18:19] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[18:20] <noob21> ceph osd pool get {pool-name} pg-num
[18:20] <noob21> that might be why it's not balancing properly
[18:20] <noob21> ceph recommends (OSDs*100)/replica's
[18:21] <nhm> noob21: only thing I'd add is to keep it to a power-of-2.
[18:22] <nhm> so round up or or down slightly depending on what you end up with.
[18:22] <noob21> right
[18:22] <noob21> i forgot to mention that
[18:23] <wer> hmmmm. do I think I have 11 pools?
[18:23] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[18:23] <wer> s/do/so/ :)
[18:23] <dxd828> noob21, so if you have two osd's you would want 100 pgs
[18:23] <noob21> yeah roughly
[18:24] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[18:24] <noob21> 96 probably so it's a power of 2
[18:24] * The_Bishop_ (~bishop@e179001196.adsl.alicedsl.de) has joined #ceph
[18:24] <dxd828> noob21, can you change the ammount live when you add more osds?
[18:24] <noob21> i don't know. that's what i'm wondering also
[18:25] <noob21> nhm?
[18:25] <wer> So I have not even begun to look at any of that stuff.... Just been trying to get a stable multinode install going. Now I need to understand all this pg mess and how to determine health other then waiting for something to happen that will break writes. Getting status is pretty hard seemingly.
[18:25] <nhm> With 2 OSDs, I'd do 256 PGs
[18:25] <noob21> ceph -s is the status?
[18:26] <wer> noob21: yeah but that is very high level. There could be an impending problem that wouldn't show up in status.
[18:26] <nhm> though it doesn't really matter that much honestly. :)
[18:26] <wer> Such as an imbalanced osd :)
[18:26] <noob21> true
[18:26] <noob21> some more detail is in order
[18:27] * gaveen (~gaveen@ has joined #ceph
[18:27] <wer> yeah, if I have to monitor all 96 osd's myself, then this is a no go for me.
[18:27] <noob21> let me see how my cluster is balancing. 1 sec
[18:27] <dxd828> are there any good commands to view how much data is distrubuted and where?
[18:28] <noob21> i'd say it's roughly even across all osd's. about 2% variance
[18:28] <wer> noob21: until it isn't.
[18:28] <noob21> right
[18:28] <noob21> i see 1 osd that is 10% out from the others
[18:29] <dxd828> also how do you configure you placement groups, they don't seem to be done under the crush maps
[18:29] * Ryan_Lane (~Adium@120.sub-70-196-132.myvzw.com) has joined #ceph
[18:29] <wer> right. I have had up to a 20% delta. But I was hoping bigger and more osd's would help. But I have no way of knowing unless I keep an eye on that stuff.
[18:29] <wer> dxd828: right,... that is why I have been ignoring them up until now :)
[18:30] * tezra|work (~rolson@ Quit (Quit: *dance*)
[18:30] <noob21> how does the osd tree weight looks?
[18:30] <wer> ceph pg dump tells me a lot... but I don't know what it is telling me.
[18:30] <wer> osd tree looks good to me.... but that isn't reality.
[18:31] <noob21> the last part of osd dump shows the KBused and KBAvail on all osd's
[18:31] <alexxy> hi all
[18:31] <alexxy> any idea why df -h reports bogus values for cephfs
[18:31] <alexxy> ?
[18:31] <alexxy> 66G 5.3G 61G 9% /home
[18:32] <alexxy> while fs is ~16T
[18:32] <wer> noob21: where are you seeing usage in osd dump?
[18:32] <noob21> the last 10 lines or so
[18:32] <noob21> for you it would be 90 lines
[18:33] <wer> lol
[18:33] <noob21> :D
[18:33] <noob21> ceph pg dump | grep osdstat
[18:33] <noob21> if that doesn't show maybe your version is older than mine?
[18:33] <noob21> i'm on 0.54-1
[18:34] <wer> hmm. I am running 0.55-1 I think.
[18:34] <noob21> ok
[18:34] <wer> osd.95 up in weight 1 up_from 446 up_thru 4157 down_at 0 last_clean_interval [0,0) exists,up d5a21d64-9517-4573-8eb8-a2bad114c29a
[18:34] <noob21> does pg dump include that for you also?
[18:34] <wer> yeah pg dump includes a shitload of stuff.
[18:35] <noob21> totally
[18:35] <noob21> maybe a little script to parse that?
[18:35] <wer> The above stuff doesn't look like usage to me.
[18:35] <noob21> skip crap until you get to osdstat and then awk '{print $1 $2 $3 }'
[18:35] <wer> I don't have osdstat in osd dump.
[18:36] <noob21> oh :(
[18:36] <noob21> i guess the nixed it in the next version
[18:36] * drokita_ (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[18:36] <dxd828> wer, the docs says " you must specify the number of placement groups at the time you create the pool." does that mean you can never change it?
[18:36] <wer> I have no idea.
[18:36] <noob21> that seems odd because ceph is built with expansion in mind
[18:37] <mikedawson> wer: I'm on 0.55.1 and I have osdstat in pg dump
[18:37] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:37] <drokita_> dxd828: Form what I understand, that is the case. I inquired the same with the InkTank about 6 weeks ago.
[18:37] <noob21> what did inktank say?
[18:37] <dxd828> drokita_, this is why i'm confused :)
[18:37] <mikedawson> I've read # of placement groups will be able to be changes sometime soon
[18:37] * roald_ (~roaldvanl@139-63-12-68.nodes.tno.nl) has joined #ceph
[18:38] <drokita_> They said that we would have to create a new cluster and migrate the dater into it
[18:38] <noob21> ouch
[18:38] <noob21> so either create a crap load of PG's at the start or migrate later?
[18:38] <noob21> eww
[18:38] <drokita_> We were not happy about it, but I am sure that will find a way eventually
[18:38] <dxd828> what is the relation of a crush map to a placement group?
[18:39] <noob21> i think crush decides where to put placement groups
[18:39] <drokita_> Yeah, a lot of people have no idea what they are building when they start. Makes it hard if you are locked in.
[18:39] <noob21> true
[18:39] <drokita_> Yes, that is true
[18:39] <noob21> i'll keep that in mind when i create my cluster in a week or so
[18:39] <wer> true here.
[18:39] <noob21> i'll triple the PG's just in case
[18:40] <wer> what is "hb in" and "hb out" in pg dump?
[18:40] <noob21> if we scale past 200TB i think i'll have eaten all the storage in the building :D
[18:40] <drokita_> I would love to hear someone from Ceph weigh in, but it seems that more PGs would also mean smaller PGs that migrate during balancing at a more atomic scale.
[18:40] <noob21> yeah i think you're right
[18:41] <noob21> they said there's a perf hit for going too high
[18:41] * yoshi (~yoshi@ has joined #ceph
[18:41] <drokita_> Anyone out there good with authentication issues?
[18:41] <wer> I have no idea what all these pg dumps mean.
[18:42] <wer> drokita_: no I have mine off atm.... chicken and egg syndrom since going past v.48
[18:42] <noob21> yeah i have mine off also
[18:42] <wer> :)
[18:42] <dxd828> i'm going to kidnap a ceph engineer for a week, so he can explain everthing to me..
[18:42] <noob21> :D
[18:42] <drokita_> Yeah… in my prod it is off, but I am hoping to remedy that when we make some upgrades. I am having some issues in a test cluster/
[18:43] <wer> What's the problem? and what version?
[18:43] <noob21> my guess is pg dump outputs the epoc for each pg, plus the status of it
[18:43] <noob21> and also where it's currently housed at i think
[18:43] <drokita_> 48 Argonaut… trying to start the sole mon and it throws the following:
[18:44] <drokita_> (leader).auth v1 client did not provide supported auth type
[18:44] <drokita_> Currently this cluster has cephx turned off.
[18:44] * roald (~roaldvanl@ Quit (Ping timeout: 480 seconds)
[18:45] <wer> drokita_: is this all on one node? or is mon a separate machine?
[18:47] <dxd828> It would be epic if someone was to add some example "production" setups in the ceph documentation :)
[18:47] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[18:47] <wer> heh. Which version :P
[18:49] <dxd828> when you create a pool, you specify how many replicas you want, how does this work with the crush maps where you specify where data is put?
[18:50] <wer> I want to know the same thing :) I am finally at that point where I need to figure that out.
[18:52] <wer> "hb in" and "hb out" are heartbeats. That makes sense. The first group is OSDs which the one in question is keeping
[18:52] <wer> track of; the second are OSDs which the one in question should be
[18:52] <wer> reporting to.
[18:52] <wer> hb out is always empty on mine.
[18:53] * fzylogic (~fzylogic@ has joined #ceph
[18:56] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[18:57] <drokita_> wer: sorry, I had to step away. I have a single mon and 2 osds on the same server
[18:57] <drokita_> I turned up debug on the mon hoping for more info…. no dice
[18:58] <wer> ok. The only reason I asked is that if auth is enabled in the config.... things try to use it. So I have convinced myself it was off... but my local ceph was using auth.
[18:58] <wer> Or so it would seem to me.
[18:58] <wer> after disabling auth did you restart the mon?
[18:58] <sjustlaptop> drokita_, noob21, dxd828: currently, the number of pgs in a pool cannot be changed after creation
[18:59] <yehudasa> drokita_: which version are you running
[18:59] <yehudasa> ?
[18:59] <dxd828> I think i get it, objects belong to one placement group. crush distrobutes placement groups which contains objects across your osd's?
[18:59] <drokita_> I am running 48… aug was never turned on. I was trying to get a base working so that I can turn it on
[18:59] <sjustlaptop> the release after bobtail should have a stable way of increasing (but not decreasing) the number of pgs in a poo
[18:59] <sjustlaptop> dxd828: right
[18:59] <sjustlaptop> you don't want to create too many pgs either
[19:00] <yehudasa> drokita_: with 0.48 you should specify "auth supported = cephx" on all entities
[19:00] <sjustlaptop> each osd must keep a bunch of state for each pg in memory
[19:00] <sjustlaptop> rbd and radosgw both allow you to expand by adding and using new pools
[19:00] <sjustlaptop> you want to go for around 100 pgs per osd
[19:00] <drokita_> yehudasa: I will try that, but my production is not configured that way and it is working
[19:01] <wer> where are the number of pgs configured?
[19:01] <yehudasa> drokita_: well, if you're running without, you can upgrade and turn it off
[19:02] * tezra (~Tecca@ Quit (Ping timeout: 480 seconds)
[19:02] <yehudasa> drokita_: auth cluster required = none, auth service required = none, auth client required = none
[19:02] <yehudasa> drokita_: these 3 configurables will work on 0.55.1
[19:05] * dmick (~Dan@cpe-76-87-42-76.socal.res.rr.com) has left #ceph
[19:06] <sjustlaptop> wer: it's one of the arguments when you create a pool
[19:06] <wer> nice. What if I never created a pool? :)
[19:06] <sjustlaptop> in that case it depends on how you set up your cluster
[19:07] <wer> So I am only using radosgw to store things in the cluster. no cephfs if that matters.
[19:07] <sjustlaptop> you can see the number of pgs from ceph osd dump, it's the pg_num part
[19:07] <sjustlaptop> you can actually tell radosgw to start putting new objects in a different pool
[19:08] <sjustlaptop> if your current setup is not quite right
[19:08] <wer> hmm. All I ever did was create a bucket in rados :(
[19:08] <sjustlaptop> what is the output of ceph osd dump | head -n 20 ?
[19:09] <wer> http://pastebin.com/1ATgFrEX
[19:09] * BManojlovic (~steki@ has joined #ceph
[19:09] <sjustlaptop> that's ceph pg dump
[19:10] <wer> yesir
[19:11] <wer> I get 4607 lines from ceph pg dump.
[19:11] * jluis (~JL@ has joined #ceph
[19:11] <mikedawson> sjustlaptop: is smalliobench going to be packaged as part of bobtail? I'm on quantal, so there isn't a build available in next or master on the gitbuilder repo it seems
[19:11] <sjustlaptop> ceph osd dump
[19:11] <sjustlaptop> mikedawson: it's not currently packaged at all
[19:12] <sjustlaptop> I think there will be a debug/benchmark package soon though
[19:12] <wer> http://pastebin.com/TfVVwmiu
[19:12] * fc (~fc@ Quit (Quit: leaving)
[19:12] <mikedawson> sjustlaptop: ok. thanks!
[19:12] <sjustlaptop> yehudasa: does rgw create all of it's pools with pg_num 8?
[19:13] <sjustlaptop> *its
[19:14] <yehudasa> sjustlaptop: rgw creates it through librados. It doesn't specify anything, but that's the default I guess.
[19:14] <sjustlaptop> that's probably not great
[19:15] * drokita__ (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[19:15] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[19:15] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:16] * joao (~JL@89-181-148-171.net.novis.pt) Quit (Ping timeout: 480 seconds)
[19:18] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[19:18] * drokita_ (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[19:24] * drokita__ (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Remote host closed the connection)
[19:24] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Read error: Connection reset by peer)
[19:24] * drokita (~drokita@anon-147-47.vpn.ipredator.se) has joined #ceph
[19:26] * drokita_ (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[19:28] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[19:29] * sjustlaptop1 (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[19:29] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Read error: Connection reset by peer)
[19:32] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[19:32] <drokita_> So this would explain it… apparently I have no default client.admin keyring. Is it possible to create one after OSDs and Mons have already been created?
[19:34] * drokita (~drokita@anon-147-47.vpn.ipredator.se) Quit (Ping timeout: 480 seconds)
[19:34] * drokita_ is now known as drokita
[19:34] * roald__ (~roaldvanl@ has joined #ceph
[19:36] <sagewk> drokita_: yeah. you can authetnicate using the mon. key as well, which is in the mon data dir.. so using that key, create a client.admin key
[19:37] <sagewk> ceph --keyring .../mondata/keyring -n mon. auth list , auth get-or-create-key client.admin mds 'allow' osd 'allow *' mon 'allow *'
[19:37] * dxd828 (~dxd828@ Quit (Read error: Operation timed out)
[19:37] <wer> sagewk: that is nice to know....
[19:37] <drokita> Thanks Sage! I will give it a try.
[19:40] <drokita> Hmmm… maybe the problem is deeper than that. I am unable to authenticate as mon.
[19:41] * roald_ (~roaldvanl@139-63-12-68.nodes.tno.nl) Quit (Read error: Operation timed out)
[19:49] <yehudasa> drokita: does your ceph.conf have 'auth supported = cephx'?
[19:53] * roald__ (~roaldvanl@ Quit (Read error: Connection reset by peer)
[19:53] * roald__ (~roaldvanl@ has joined #ceph
[19:53] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) Quit (Remote host closed the connection)
[19:54] * akarpo (~textual@dh146.citi.umich.edu) Quit (Read error: Operation timed out)
[19:58] <noob21> sjustlaptop: thanks for the info. i didn't know that about the radosgw
[19:58] <noob21> for my light usage so far 0.54 has been stable for me
[20:03] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) Quit (Remote host closed the connection)
[20:08] * Aiken (~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f) has joined #ceph
[20:09] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[20:09] <dmick1> damn, we have 20 other noobs? That's cool
[20:09] * dmick1 is now known as dmick
[20:10] * janos could easily qualify for that name
[20:10] <wer> make that 22. :)
[20:10] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[20:12] <dmick> clearly we need a welcome mat :)
[20:12] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[20:12] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[20:13] * roald__ (~roaldvanl@ Quit (Ping timeout: 480 seconds)
[20:15] <drokita> yehudasa: No it does not, but it does mirror a know working configuration
[20:16] <wer> so working backwards... I can see from radosgw-admin that it's only pool is .rgw.buckets. The ceph osd dump tells me that pool 11 '.rgw.buckets' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 573 owner. So pg_num 8 is my pg for radosgw and the bucket I created in there?
[20:18] <wer> what can I find out with the ceph pg commands? asside from ceph pg dump I get invalid arg when trying "ceph pg map 8" for example.....
[20:18] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) has joined #ceph
[20:19] <wer> Like how can I tell that a specific pg is using all the osd's?
[20:19] <CloudGuy> hello all
[20:20] <drokita> Hello
[20:21] * nhorman (~nhorman@ Quit (Quit: Leaving)
[20:21] <sstan> hi
[20:24] <yehudasa> wer: what do you mean a specific pg using all the osds?
[20:24] <yehudasa> wer: ceph pg map 8.0
[20:25] <wer> Well I just want to make sure all the osd's will be used... and none will be excluded from use for storage.
[20:25] <CloudGuy> my ceph is up, but i am not sure how to get started or use it as rbd
[20:25] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) has joined #ceph
[20:25] <yehudasa> wer: ceph pg map 11.0 <--- a pg in the .rgw.buckets pool
[20:26] <dilemma> anyone know if there's a way, when bringing up a new OSD, to allow it to complete re-balancing before actually serving up data?
[20:26] <mikedawson> CloudGuy: if you are using OpenStack http://ceph.com/docs/master/rbd/rbd-openstack/
[20:26] <yehudasa> wer: since you have only 8 pgs in that pool it would be 11.0 to 11.8
[20:26] <drokita> CloudGuy: have you created the RBD
[20:26] <drokita> ?
[20:26] <wer> yehudasa: my brain hurts.
[20:26] <wer> so 11 is the pg and what is the pg_num then?
[20:27] <yehudasa> wer: and also, since you have only 8 pgs for that pool, it means that at the max you'd have 24 osds participating for data placement in that specific pool
[20:27] <CloudGuy> mikedawson: thank you .. i was trying to test using cloudstack, but i can give openstack a try also . but this is a separate cluster .. just wanted to get a rbd and mount it and store files on it to see how good it works
[20:27] <yehudasa> wer: 11 is the pool id, 11.0 is the pg num
[20:28] <yehudasa> wer: you have 8 pgs for that pool: 11.0, 11.1, ..., 11.7
[20:28] <sjustlaptop1> wer: objects are mapped to pgs which are mapped to osds
[20:28] <yehudasa> wer: mind my typo up there (11.8)
[20:28] <sjustlaptop1> so with 8 pgs in the pool you will use at most 8*2 osds
[20:29] <wer> yehudasa: scary. I don't get it :)
[20:29] <mikedawson> CloudGuy: Read Wido's stuff http://blog.widodh.nl/2012/09/ceph-distributed-storage-with-cloudstack/
[20:29] <sjustlaptop1> wer: how many osds do you have/
[20:29] <wer> so the pg_num reported in osd dump says pool 11 has 8 pg_nums which equates to 8*2 osds?
[20:29] <wer> 96
[20:31] <sjustlaptop1> it means 8 pgs, but pgs are placed on two osds each by default (for reduncancy)
[20:31] <sjustlaptop1> so 16 osds
[20:31] <drokita> Dumb question, but should an OSD kerning have entries for the keys to the other infrastructure (mon, odd, mds, etc) that it will be talking to?
[20:32] <wer> sjustlaptop1: so I am only using 16 osds for my rados buckets?
[20:33] <sjustlaptop1> which pool is that?
[20:33] <sjustlaptop1> each pool is only on about 16 osds
[20:33] <sjustlaptop1> which is definitely not great
[20:33] <yehudasa> wer: oh, rep size is 2, then yeah .. 8*2, not 8*3
[20:34] * Cube (~Cube@ has joined #ceph
[20:35] <dilemma> is it possible to sync an OSD that is in the crush map, but marked as out? (as in, a new OSD)
[20:36] <yehudasa> wer: you can create a new pool with more pgs (e.g., 4096), and then add it to the data placement set for rgw
[20:36] <yehudasa> wer: remove the old pool from the set, but note that it'll only affect data going to newly created buckets
[20:37] <yehudasa> wer: radosgw-admin pool add, pool rm, pools list
[20:37] * SkyEye (~gaveen@ has joined #ceph
[20:40] * dosaboy (~gizmo@host86-164-143-213.range86-164.btcentralplus.com) Quit (Quit: Leaving.)
[20:42] * SkyEye (~gaveen@ Quit (Remote host closed the connection)
[20:42] <elder> joshd, Sage mentioned he wanted to know whether you were really working on reproducing 3631.
[20:43] <elder> I don't remember why, but he might be interested to know.
[20:44] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[20:46] <dilemma> still looking for an answer to this one: when adding an OSD, is it possible to guarantee that re-balancing completes before directing client traffic to that OSD?
[20:46] <mikedawson> joshd, jamespage: Could either of you sanity check my process to integrate RBD and Cinder: http://pastebin.com/3LhUeuHk
[20:46] <dilemma> I'm seeing my new OSDs get hammered with re-balancing traffic, and as PGs sync over, client traffic starts using the OSD as well
[20:47] <dilemma> and as a result, I'm getting slow requests
[20:47] <joshd> elder: thanks, I'll talk to him. Sam found it to be a client side issue (he was able to reproduce with osd logs)
[20:47] <drokita> What version are you using dilemma
[20:47] <dilemma> 0.48.2
[20:47] <dilemma> I'm aware of the osd_num_concurrent_backfills that is coming in the next stable release
[20:47] <drokita> dilemma: As far as I know, that is normal behavior for an OSD that is rebuilding. Do you have the cluster and public on same network?
[20:48] <dilemma> I'd love to use that
[20:48] <dilemma> yes
[20:48] <dilemma> nodes talk to each other over the same interface they talk to clients
[20:49] <dilemma> I'm considering throttling inter-cluster traffic at the OS level
[20:49] <dilemma> but I'd rather allow re-balancing to complete before setting the OSD as in
[20:50] <drokita> From my experience, the fact that the OSD rebuild and access re happen in on the same OSD is not likely the cause of any performance problem, but rather a general problem with how the actual client accesses the cluster during a rebalancing. These are issues that are supposedly resolved in the Bobtail release
[20:51] <drokita> I have been in the same situation
[20:51] <dilemma> can you clarify " how the actual client accesses the cluster during a rebalancing" for me? Maybe a ticket somewhere that talks about this?
[20:52] <drokita> Don't quote me on this, but I though it had something to do with a over abundance of status checking from the libceph client during a rebuild that causes performance slowdown on the cluster as a whole
[20:53] <drokita> I am are there is someone smarter than me that can weigh in on that little nugget of wisdom
[20:53] <drokita> or smarter than I :)
[20:53] <sjustlaptop1> dilemma: the problem is mostly one of balancing client io work against recovery work
[20:54] <sjustlaptop1> bobtail improves the situation, but there will be more work in the next release after bobtail
[20:54] <sjustlaptop1> you can't at this time really "force" client io to avoid the new osd at this time
[20:54] <sjustlaptop1> but the config you mentioned will throttle the amount of recovery done
[20:55] <sjustlaptop1> which will help
[20:57] <wer> yehudasa: I will give that a try. ty.
[20:58] * nwat1 (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[20:58] * s15y (~s15y@sac91-2-88-163-166-69.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[20:58] <joshd> mikedawson: if you're just having issues with creating a volume from an image, it could be the glance configuration. is there an error in the cinder-volume log?
[21:00] <dilemma> sjustlaptop1: thanks for the advice. I do need to get this done before bobtail though - since it's just a matter of balancing IO, is there a chance that throttling inter-cluster traffic (thus reducing recovery IO) can improve the situation?
[21:00] <dilemma> or will that hurt me in other ways, or just not help
[21:00] <sjustlaptop1> dilemma: throttling inter-cluster traffic will also knee-cap client io replication
[21:01] <mikedawson> joshd: 2012-12-20 15:00:30 13920 DEBUG cinder.manager [-] Running periodic task VolumeManager._publish_service_capabilities periodic_tasks /usr/lib/python2.7/dist-packages/cinder/manager.py:164
[21:01] <mikedawson> 2012-12-20 15:00:30 13920 DEBUG cinder.manager [-] Running periodic task VolumeManager._report_driver_status periodic_tasks /usr/lib/python2.7/dist-packages/cinder/manager.py:164
[21:01] <sjustlaptop1> uh, try bringing the weight of the new node gradually from 0 to the desired weight
[21:01] <sjustlaptop1> we've had people do that with new racks/nodes
[21:01] <mikedawson> that's in in the time frame where the failure occurs
[21:01] <sjustlaptop1> that will move a few pgs at a time
[21:01] <joshd> mikedawson: any error at higher level then, like in the cinder-api log, or the cinder-schedule log?
[21:01] <dilemma> I did read about that, but I'm not sure I understand how it will help. The huge load IO spike I'm getting from re-balancing will just be shorter
[21:02] <sjustlaptop1> and smaller
[21:02] <sjustlaptop1> fewer pgs will attempt to move at once
[21:02] <dilemma> how does it determine how many PGs to move at once?
[21:02] <sjustlaptop1> it's a consequence of the crush distribution
[21:03] <sjustlaptop1> you don't get to control the number of pgs exactly, but the weights allow you to adjust what proportion of pgs get placed on what osd
[21:03] <dilemma> I understand that it will be *assigned* fewer PGs, but what prevents it from trying to sync large numbers of them at a time??
[21:03] <sjustlaptop1> there won't be a large number of them to sync
[21:03] <mikedawson> joshd: http://pastebin.com/fMWmdSn5
[21:04] <sjustlaptop1> that's all
[21:04] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[21:04] <dilemma> so I need to figure out a weight that is small enough, and ramp it up slowly enough, that there just aren't enough PGs to sync at the same time
[21:04] <sjustlaptop1> right
[21:04] <sjustlaptop1> in practice, that's pretty easy
[21:05] <sjustlaptop1> I've seen 10% at a time do pretty well
[21:05] <sjustlaptop1> and you just wait for all pgs to go active+clean before increasing it again
[21:05] <mikedawson> jpshd: root@node1:~# cat /etc/glance/glance-api.conf | grep -i rbd
[21:05] <mikedawson> default_store = rbd
[21:05] <mikedawson> # glance.store.rbd.Store,
[21:05] <mikedawson> # ============ RBD Store Options =============================
[21:05] <mikedawson> rbd_store_ceph_conf = /etc/ceph/ceph.conf
[21:05] <mikedawson> rbd_store_user = images
[21:05] <mikedawson> rbd_store_pool = images
[21:05] <mikedawson> rbd_store_chunk_size = 4
[21:06] <dilemma> I have a cluster designed to scale up to 720 OSDs (thus it has 72,000 placement groups). I'm probably going to have to go a lot smaller than 10%
[21:06] <joshd> mikedawson: that glance config looks fine
[21:07] <joshd> mikedawson: to be clear, the cinder-volume log you looked at was on node12?
[21:08] <mikedawson> no it was from node1
[21:08] <dilemma> sjustlaptop1: thanks though, I think that this is exactly how I'll do it
[21:08] <sjustlaptop1> k
[21:08] <joshd> mikedawson: check the one on node12 - the schedule log says "Casted 'create_volume' to host 'node12'"
[21:10] <mikedawson> joshd: aha - got the error on node12
[21:11] <mikedawson> I've been trying to fix the auth issue on node1 since yesterday, but haven't touched the other nodes. My guess is they need to be fixed now. Didn't realize the scheduler moved this workload off to another node
[21:13] <mikedawson> joshd: to fix the auth on the other nodes, do I just make them rbd_secret_uuid=441d8ebb-876a-329b-fbaf-9a2e9a0272d0 like I have in node1's /etc/cinder/cinder.conf?
[21:13] <joshd> yes, although that wouldn't cause an error during image creation
[21:14] <mikedawson> joshd: when I run your instructions on each compute host, I get a different uuid for each node (my guess is that's why it isn't working at present)
[21:16] <mikedawson> joshd: right now every host has a different uuid in virsh secret-list. Do I need those to be the same across the cluster?
[21:16] <joshd> mikedawson: yeah, so you can either add the node-specific rbd_secret_uuid to each node's nova.conf, or make them all use the same uuid (you can add a <uuid> element to the secret.xml)
[21:17] <mikedawson> joshd: That's the missing secret sauce! Once I get this working, can I submit a patch to the published docs?
[21:17] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[21:18] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has left #ceph
[21:18] <joshd> mikedawson: sure, that'd be great
[21:19] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) has joined #ceph
[21:23] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[21:32] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) Quit (Quit: Leaving)
[21:56] <mikedawson> joshd: set uuid in secret.xml, reran the virsh stuff, restarted nova-compute and cinder-volumes. Thought this would work. Tried to create a volume from an image. Scheduler sent it to node3. Log -> http://pastebin.com/rW6PNNLS
[21:56] * andreask1 (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[21:57] <mikedawson> joshd: I did the virsh stuff and service restarts on all nodes, if it wasn't clear
[21:58] <joshd> mikedawson: ah, there was one thing I forgot earlier - does your glance config have show_image_direct_url=True?
[21:59] <joshd> that error actually looks like it's just having trouble talking to the glance api at the network level though
[21:59] <mikedawson> so now I have one uuid for all hosts in virsh and a matching rbd_secret_uuid in all /etc/cinder/cinder.conf files. Nothing about rbd in any nova.conf files
[21:59] <joshd> maybe a firewall or something stopping it
[21:59] <elder> sage, are you there?
[22:00] <elder> sagewk, how about you?
[22:00] <dmick> he's in his office, I see him :)
[22:01] <mikedawson> joshd: root@node1:~# cat /etc/glance/glance-api.conf | grep show
[22:01] <mikedawson> show_image_direct_url=True
[22:01] <joshd> mikedawson: so all that uuid config stuff sounds good, and should make running the instances/attaching volumes work, but it's unrelated to the problem with creating a volume from an image
[22:02] <mikedawson> glance is on node1
[22:02] <mikedawson> root@node2:~# ls -lah /etc/ceph/
[22:02] <mikedawson> total 20K
[22:02] <mikedawson> drwxr-xr-x 2 root root 4.0K Dec 18 00:50 .
[22:02] <mikedawson> drwxr-xr-x 108 root root 4.0K Dec 18 15:56 ..
[22:02] <mikedawson> -rw-r--r-- 1 root root 65 Dec 18 00:50 ceph.client.volumes.keyring
[22:02] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[22:02] <mikedawson> -rw-r--r-- 1 root root 1.5K Dec 19 09:28 ceph.conf
[22:02] <mikedawson> -rw------- 1 root root 63 Dec 18 00:48 ceph.keyring
[22:03] <mikedawson> do compute nodes need ceph.client.images.keyring? do the root:root permissions look like a problem?
[22:05] <joshd> the permissions will work
[22:05] * dxd828 (~dxd828@host-92-25-121-210.as13285.net) has joined #ceph
[22:05] <joshd> but the error in your cinder log shows it can't connect to the glance api server at all (ECONNREFUSED)
[22:08] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) has joined #ceph
[22:08] <mikedawson> root@node1:~# lsof -i :9292
[22:08] <mikedawson> glance-ap 7292 glance 4u IPv4 459099 0t0 TCP *:9292 (LISTEN)
[22:08] <mikedawson> glance-ap 7297 glance 4u IPv4 459099 0t0 TCP *:9292 (LISTEN)
[22:09] <mikedawson> root@node3:/etc/nova# cat /etc/nova/nova.conf | grep glance
[22:09] <mikedawson> glance_api_servers=
[22:09] <mikedawson> image_service=nova.image.glance.GlanceImageService
[22:20] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[22:22] <mikedawson> joshd: the ECONNREFUSED was from cinder.openstack.common.rpc.amqp - does that mean Cinder can't talk to Rabbit?
[22:23] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:24] <joshd> mikedawson: no, that just happens to be the context in which it's making a request to glance - you can see it's in _create_glance_client in the backtrace
[22:25] <mikedawson> joshd: ok. thanks
[22:29] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) Quit (Quit: Leaving)
[22:29] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[22:31] <mikedawson> joshd: digging deep here: https://answers.launchpad.net/nova/+question/171610
[22:49] * gregorg (~Greg@ Quit (Read error: Connection reset by peer)
[22:49] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[22:50] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[22:51] * BManojlovic (~steki@ Quit (Read error: Operation timed out)
[22:54] * ircolle1 (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[22:55] * miroslav (~miroslav@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[22:56] * gregorg (~Greg@ has joined #ceph
[23:00] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[23:01] * noob21 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[23:04] <mikedawson> joshd: glance-api and glance-registry config was in fact a bit off. Thank you. Thank you. Thank you.
[23:05] * vjarjadian (~IceChat7@5ad6d005.bb.sky.com) has joined #ceph
[23:05] <joshd> mikedawson: you're welcome. glad it's all working now
[23:06] <vjarjadian> hi
[23:11] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[23:12] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Quit: drokita)
[23:14] <dmick> hi vjarjadian
[23:16] <vjarjadian> anything fun happening? discussion on geo-replication was fun last night :)
[23:16] <dmick> depends on your definition of fun I suppose :)
[23:17] <paravoid> so, Chef's cookbook ceph.conf indicates that with "mon hosts" there is no need to have separate [mon.foo] sections
[23:17] <paravoid> however both upstart and sysv init scripts won't start the mon if it isn't defined as such
[23:17] <paravoid> what am I missing?
[23:18] * andreask1 (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[23:20] <dmick> it looks to me like the ceph-mon-* upstart scripts only depend on presence of binaries and directories; what are you seeing that makes you think it looks at ceph.conf-the-INI-style-file?
[23:21] <dmick> (specifically things end up in ceph-mon.conf)
[23:22] <paravoid> what is ceph-mon.conf?
[23:22] <dmick> that's the upstart script that starts the monitors
[23:22] <paravoid> oh /etc/init you mean
[23:22] <dmick> no
[23:22] <dmick> er, yes
[23:22] <dmick> sorry
[23:23] <dmick> (my mind added a '.d' to what you typed)
[23:24] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[23:24] <paravoid> ah, I was missing /done
[23:24] <paravoid> doh
[23:24] <dmick> I'm looking at master, which is probably different from what you have, but I don['t think in that respect
[23:24] <dmick> ah, ok
[23:25] <paravoid> hm, still doesn't work
[23:25] <dmick> master also adds 'upstart' as a tag saying this is for sure an upstart-managed host
[23:25] <dmick> debugging upstart is fun
[23:25] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:29] <paravoid> a tag where?
[23:29] <dmick> I think it's a new file in /var/lib/ceph/whatever
[23:29] * dxd828 (~dxd828@host-92-25-121-210.as13285.net) Quit (Quit: Textual IRC Client: www.textualapp.com)
[23:29] <dmick> 02aca6830ad07f585317bfafc4780c1b50701cb4
[23:30] <paravoid> hm?
[23:30] <dmick> e597482f2941ec80e3159610fc84a674701b48fd
[23:30] <dmick> git commit ids for the changes
[23:30] <dmick> you can see them on github.com/ceph/ceph if you're not using git locally
[23:30] <paravoid> I am
[23:31] <dmick> looks like those are in master and next
[23:31] <paravoid> the latter is not in 0.55.1
[23:31] <paravoid> the former is for osd, not mon
[23:31] <paravoid> so, not related I think
[23:31] <wer> yehudasa: you still around?
[23:32] <dmick> oh I wasn't saying you had it, just that it was a newer wrinkle coming
[23:35] * tezra (~Tecca@ has joined #ceph
[23:47] * darkfader (~floh@ Quit (Ping timeout: 480 seconds)
[23:53] <paravoid> hm
[23:53] <paravoid> so
[23:54] <paravoid> service ceph-all start works
[23:54] <paravoid> service ceph start doesn't
[23:54] <paravoid> I think the latter calls the sysv script
[23:54] <paravoid> which does the ceph-conf calls
[23:56] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[23:57] <dmick> right, the latter is sysv
[23:58] <dmick> the amusing compatibility fallback
[23:58] <dmick> that's why we renamed /etc/init/ceph.conf to /etc/init/ceph-all.conf in 55.1
[23:58] <paravoid> ah!
[23:59] <dmick> sorry, I might have saved you some time had I mentioned that
[23:59] <wer> what? I didn't rename anything :)
[23:59] <dmick> I don't see any 'r' in that word :)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.