#ceph IRC Log


IRC Log for 2013-06-11

Timestamps are in GMT/BST.

[0:02] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[0:05] * LeaChim (~LeaChim@ has joined #ceph
[0:09] <redeemed> anyone else get access errors when playing with CephFS? populate a file on one client that has CephFS mounted, then try to look at the contents of that file on a second client.
[0:09] <rturk> Vjarjadian: try http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
[0:10] <gregaf> redeemed: what kind of access errors?
[0:10] <redeemed> the file gets blanked if i remount
[0:10] <redeemed> gregaf: the OS (ubuntu12.04) will say "operation not permitted"
[0:11] <gregaf> have you set up some custom pools for the filesystem to use?
[0:11] <gregaf> sounds like the clients can't access the pool the file data is supposed to go to
[0:12] <redeemed> gregaf: i was trying to use the default; i have not come across anything stating i needed to create a pool for CephFS. did i miss something?
[0:12] <gregaf> no, just a guess based on your symptoms
[0:12] <gregaf> is this kernel or ceph-fuse?
[0:12] <redeemed> kernel
[0:13] <redeemed> 3.5.0-31-generic #52~precise1-Ubuntu
[0:14] <gregaf> what's the output of "ceph -s" and "ceph metadata dump"?
[0:14] * grepory1 (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Ping timeout: 480 seconds)
[0:14] * Jahkeup (~Jahkeup@ Quit (Remote host closed the connection)
[0:15] <redeemed> how does one dump the metadata?
[0:15] * Jahkeup (~Jahkeup@ has joined #ceph
[0:16] <redeemed> ah, ceph -h doesn't say it but the docs site does
[0:17] <redeemed> gregaf: http://pastebin.com/vGciK5pg
[0:17] <redeemed> health is clean
[0:18] <gregaf> redeemed: I see that the filesystem is using two pools, and by default it will only use one
[0:18] <gregaf> s/pools/data pools
[0:19] <redeemed> i added the second pool a minute ago thinking i missed something
[0:19] * elder_ (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[0:19] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[0:21] <gregaf> redeemed: okay, can you grab the "ceph.layout.pool" xattr on the file in question?
[0:26] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:33] <TiCPU> just added an OSD and I'm seing slow request and stalled I/O in cuttlefish 0.61.3 is there any workaround yet?
[0:34] <redeemed> gregaf: i don't think i can redeploy fast enough to do this tonight. i will do my best to determine this tomorrow. however i was curious if anyone knew how to mount cephFS on another path than /? EXAMPLE: sudo mount -t ceph /blarge
[0:34] <gregaf> I think it's like that, although for the kernel it might look a little different
[0:34] <gregaf> pretty sure it's in the docs, anyway
[0:38] <redeemed> by default, the docs tell us to mount to use "" but that is reported to cause problems at boot in /etc/fstab; the workaround is to append to the path but ensure the "catalog" exists, whatever that is. EXAMPLE: ""
[0:39] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[0:41] * redeemed (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) Quit (Remote host closed the connection)
[0:44] * mgalkiewicz (~mgalkiewi@159-205-131-126.adsl.inetia.pl) has joined #ceph
[0:44] <mgalkiewicz> hi guys
[0:44] <mgalkiewicz> is it normal that during the update from bobtail to cuttlefish upgraded mon reports: cephx: verify_reply coudln't decrypt with error: error decoding block for decryption?
[0:45] <tnt> yes
[0:46] <tnt> you need to upgrade all mons at once. (well a majority of them anyway)
[0:46] <mgalkiewicz> I know that I need to upgraded all but I did not expect cephx problems in logs
[0:47] <tnt> it's because they can't join the quorum and get the session keys afaik.
[0:47] <mgalkiewicz> ok I will proceed
[0:52] <sjust> loicd: are you there?
[0:52] <loicd> yes
[0:52] <sjust> I'm not sure I understand your question
[0:52] <sjust> it shouldn't happen with default settings
[0:52] * meebey (meebey@white.cloud.smuxi.net) Quit (Remote host closed the connection)
[0:53] <loicd> that's what confused me I guess. When can it happen ?
[0:53] <sjust> but if you turn the min log entries option to a small number, the primary might wake up after a brief nap to find that all of its entries are divergent
[0:53] <sjust> although now it seems like that should cause backfill
[0:53] <sjust> hmm
[0:53] * loicd thinking
[0:53] <sjust> ah, if log.tail == auth_log.head
[0:54] <loicd> but
[0:54] <loicd> assert(newhead > log.tail);
[0:54] <loicd> would throw then, right ?
[0:54] * meebey (meebey@white.cloud.smuxi.net) has joined #ceph
[0:55] <sjust> yeah, you are right
[0:55] <sjust> I don't think that case can happen
[0:56] <phantomcircuit> is there anyway to monitor which rbd volumes are most active?
[0:57] * portante (~user@ Quit (Read error: Operation timed out)
[0:57] * loicd digging the history of changes for a hint
[0:59] * zhyan_ (~zhyan@ has joined #ceph
[0:59] <sjust> loicd: I think there may be a conflict with calc_acting where it chooses which peers can be activated, you may want to make a bug
[1:00] <loicd> ok, sjust I'll check this tomorrow morning. Thanks for the advice :-)
[1:00] <sjust> I think the assert is actually wrong, but we'll never see the problem case due to osd_min_pg_log_entries
[1:01] <loicd> the history goes straight back to when rewind_divergent_log was not yet a function on its own, difficult to compare
[1:01] * markbby (~Adium@ Quit (Quit: Leaving.)
[1:01] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has left #ceph
[1:02] <sjust> yeah, the history probably won't be a lot of help
[1:03] <loicd> not sure to understand why osd_min_pg_log_entries makes it impossible for newhead to be <= log.tail
[1:03] * loicd thinking
[1:03] <sjust> not impossible, but very unlikely
[1:04] * loicd contemplates
[1:04] <loicd> // The logs must overlap.
[1:04] <loicd> assert(log.head >= olog.tail && olog.head >= log.tail);
[1:04] <sjust> right, the question is whether the constraint is
[1:04] <sjust> log.head >= olog.tail
[1:04] <sjust> or
[1:04] <sjust> log.head > olog.tail
[1:05] <sjust> in the latter case, there may be no entries in common
[1:05] <sjust> in the former case, there must be at least 1 entry in common
[1:05] <mgalkiewicz> after upgrading to cuttlefish I ended up with degraded pgs. here is my ceph status and crushmap https://gist.github.com/maciejgalkiewicz/34b61afb0eac505f8a56
[1:06] <mgalkiewicz> I would really appreciate any help
[1:07] <grepory> mgalkiewicz: i don't know if it's related, but i had an issue with empty buckets in my crushmap. removing empty buckets and items with weight 0 resolved my issue.
[1:07] <loicd> so the assert can only be triggered if olog.head == log.tail because of the previous assert on olog.head >= log.tail)
[1:08] <sjust> well, that and the behavior in calc_acting that should prevent the peer from being chosen as non-backfill otherwise
[1:08] <mgalkiewicz> grepory: not sure what do you mean by empty bucket
[1:08] <sjust> loicd: just merged your unit tests
[1:08] <grepory> mgalkiewicz: the "localhost" bucket in your crushmap has a weight 0, that is an empty bucket.
[1:09] <loicd> sjust, thanks :-)
[1:09] <sjust> thank you!
[1:09] <grepory> mgalkiewicz: i.e. an empty bucket is a node in the crush hierarchy with no descendants and is not a leaf.
[1:10] <grepory> or rather, is not INTENDED to be a leaf.
[1:10] * zhyan__ (~zhyan@ has joined #ceph
[1:11] <mgalkiewicz> ok I will remove localhost and localrack
[1:11] <grepory> i believe this constitutes an improper crush map ruleā€”as a node in the crushmap's tree with no children should be a leaf, but only devices are eligible leaves.
[1:11] <grepory> leafs.
[1:11] <grepory> leaves.
[1:12] <grepory> I think it is probably a good thing, as empty buckets probably cause more retries in the placement algorithm than they're worth.
[1:14] * zhyan_ (~zhyan@ Quit (Ping timeout: 480 seconds)
[1:17] * loicd out of the night
[1:22] * zhyan__ (~zhyan@ Quit (Ping timeout: 480 seconds)
[1:26] <tnt> Mmm, I just upgraded to 0.61.3 ( from a 0.61.2 + a few mon patches ) and the behavior seems worse than it was.
[1:27] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[1:27] * mschiff (~mschiff@port-1123.pppoe.wtnet.de) Quit (Ping timeout: 480 seconds)
[1:34] * zhyan__ (~zhyan@ has joined #ceph
[1:37] <paravoid> sjust: not sure if it helps at all or if it's completely unrelated, but I got slow peering when adding a mon(!)
[1:37] <sjust> paravoid: really?
[1:37] <sjust> adding a mon shouldn't have caused peering at all
[1:38] <paravoid> yeah, weird
[1:38] <mgalkiewicz> grepory: I have removed localhost and localrack but it did not fix the problem
[1:38] <paravoid> didn't get a chance to pg dump
[1:39] <paravoid> it started with 1 remapped+peering, got up to 16 peering
[1:39] * sagelap (~sage@2600:1010:b01f:df48:bc04:545e:8f6f:72fe) has joined #ceph
[1:39] <paravoid> all the weird things happen to me
[1:39] <sagelap> jamespage: did you have a chance to talk to the upstart people yet?
[1:41] * jasdeepH (~jasdeepH@50-0-250-146.dedicated.static.sonic.net) has joined #ceph
[1:41] <tnt> sagelap: was there a lot of other mon changes between cuttlefish .2 and .3 beside the async compact ? I had .2 + trim fix + async range compact and just went to .3 : http://i.imgur.com/WDVGj7t.png
[1:41] * sagelap1 (~sage@2600:1010:b023:f933:bc04:545e:8f6f:72fe) has joined #ceph
[1:44] <tnt> sagelap1: was there a lot of other mon changes between cuttlefish .2 and .3 beside the async compact ? I had .2 + trim fix + async range compact and just went to .3 : http://i.imgur.com/WDVGj7t.png (repeat of message because you seem to have disconnected/reconnected)
[1:45] <sagelap1> hmm. there were quite a few smaller fixes. we also adjusted some of the defaults..
[1:45] <sagelap1> you're concerned about higher memory utilization?
[1:45] <tnt> that's the disk space
[1:46] <sagelap1> oh
[1:46] <tnt> I'm concerned by the amplitude of the oscillation :p
[1:46] <sagelap1> the trimming is a bit less frequent
[1:46] <tnt> ok, I guess that's probably that then.
[1:47] <sagelap1> you can play with the tunables, though.. looking at paxos_trim_{min,max} and paxos_service_trim_{min,max}
[1:47] <sagelap1> it may be that smaller values and more frequent cmopation is better. it is roughly trading disk io for disk utilization
[1:47] * sagelap (~sage@2600:1010:b01f:df48:bc04:545e:8f6f:72fe) Quit (Ping timeout: 480 seconds)
[1:47] <tnt> ok I'll have a look. Just wanted to make sure there was an explanation for the change in behavior so I can go to sleep and expect the cluster to still be alive when I wake up :p
[1:48] <grepory> mgalkiewicz: :( i am not sure. have you looked at http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ yet?
[1:50] <mgalkiewicz> I will take a look
[1:50] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[1:51] * sagelap1 is now known as sagelap
[1:59] * redeemed (~quassel@cpe-192-136-224-78.tx.res.rr.com) has joined #ceph
[2:07] <paravoid> should I consider "slow requests *after* peering" a different bug than slow peering?
[2:09] <sjust> yes
[2:10] <paravoid> tired of my bugs already?
[2:10] * tnt (~tnt@228.199-67-87.adsl-dyn.isp.belgacom.be) Quit (Read error: Operation timed out)
[2:11] <sjust> paravoid: never
[2:11] * AaronSchulz watches paravoid upset a hill of tiny fire ants
[2:11] <paravoid> heh :)
[2:11] <TiCPU> as I can see, those slow request are still present in 0.61.3, currently added an OSD and it gets shotted in the head by its peers
[2:12] <paravoid> what do you mean by that TiCPU?
[2:12] <TiCPU> the OSD has slow requests and after a while gets marked "down" then comes back, says it was wrongly marked as "down" and shuts down
[2:16] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[2:21] * Cube (~Cube@c-38-80-203-93.rw.zetabroadband.com) has joined #ceph
[2:24] * Tamil (~tamil@ Quit (Quit: Leaving.)
[2:24] <TiCPU> just found out that setting "noup" and waiting shows OSD using 160% CPU and sometime 100% I/O
[2:25] <paravoid> #5297 it is
[2:29] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[2:30] <TiCPU> after many tentative I was able to make the cluster clean again... osd.1 had enospace from btrfs, osd.4 was slow, osd.6 was corrupt.. I have 6 OSD
[2:30] <TiCPU> not good but at least ir recovered
[2:31] <jasdeepH> any good place to look to learn about OSD internals besides just reading the source?
[2:39] <joshd> http://ceph.com/docs/master/dev/peering/ and http://ceph.com/docs/master/dev/osd_internals/
[2:39] * mgalkiewicz (~mgalkiewi@159-205-131-126.adsl.inetia.pl) Quit (Quit: Ex-Chat)
[2:47] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[2:58] * zhyan__ (~zhyan@ Quit (Ping timeout: 480 seconds)
[3:12] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[3:19] * sagelap (~sage@2600:1010:b023:f933:bc04:545e:8f6f:72fe) Quit (Ping timeout: 480 seconds)
[3:20] * jasdeepH (~jasdeepH@50-0-250-146.dedicated.static.sonic.net) Quit (Quit: jasdeepH)
[3:28] * rturk is now known as rturk-away
[3:39] <paravoid> so, no pg merging yet, right?
[3:58] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[4:16] * Cube (~Cube@c-38-80-203-93.rw.zetabroadband.com) Quit (Quit: Leaving.)
[4:25] * Jahkeup (~Jahkeup@ Quit (Remote host closed the connection)
[4:31] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) Quit (Remote host closed the connection)
[4:32] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[4:33] * Jahkeup (~Jahkeup@ has joined #ceph
[4:39] * Jahkeup (~Jahkeup@ Quit (Read error: Operation timed out)
[4:41] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has joined #ceph
[4:41] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has left #ceph
[4:43] * redeemed (~quassel@cpe-192-136-224-78.tx.res.rr.com) Quit (Remote host closed the connection)
[4:51] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[4:51] * rturk-away is now known as rturk
[4:53] * rturk is now known as rturk-away
[5:14] * Meths_ (rift@ has joined #ceph
[5:16] * san (~san@ has joined #ceph
[5:20] * Meths (rift@ Quit (Ping timeout: 480 seconds)
[5:30] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[5:31] * alexk (~alexk@of2-nat1.sat6.rackspace.com) Quit (Ping timeout: 480 seconds)
[5:36] * ofu (ofu@dedi3.fuckner.net) Quit (Read error: Connection reset by peer)
[5:39] * Jahkeup (~Jahkeup@ has joined #ceph
[5:39] * ofu_ (ofu@dedi3.fuckner.net) has joined #ceph
[5:39] * Jahkeup (~Jahkeup@ Quit (Read error: Operation timed out)
[6:21] * sagelap (~sage@2600:1010:b00e:19eb:bc04:545e:8f6f:72fe) has joined #ceph
[6:41] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) Quit (Read error: Connection reset by peer)
[6:42] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) has joined #ceph
[7:23] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[7:38] * Psi-jack (~psi-jack@psi-jack.user.oftc.net) Quit (Remote host closed the connection)
[7:51] * tnt (~tnt@228.199-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:05] * Cube (~Cube@c-38-80-203-93.rw.zetabroadband.com) has joined #ceph
[8:11] * yanzheng (~zhyan@ has joined #ceph
[8:25] * krishna_ (~krishna@ has joined #ceph
[8:25] * Psi-jack (~psi-jack@yggdrasil.hostdruids.com) has joined #ceph
[8:25] <krishna_> hello... need help with ceph install on ubuntu...
[8:25] <wogri_risc> krishna_ - what's the question
[8:26] <krishna_> i'm not able to get through mkcephfs command...
[8:26] <krishna_> ceph -v
[8:26] <krishna_> ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532)
[8:26] <krishna_> sudo mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring --mkfs
[8:27] <krishna_> the log says
[8:27] <krishna_> 2013-06-11 11:45:32.977565 7f5443990780 -1 auth: error reading file: /var/lib/ceph/osd/ceph-1/keyring: can't open /var/lib/ceph/osd/ceph-1/keyring: (2) No such file or directory bufferlist::write_fd(/var/lib/ceph/osd/ceph-1/keyring): write_fd error: (28) No space left on device 2013-06-11 11:45:32.978131 7f5443990780 -1 ** ERROR: writing new keyring to /var/lib/ceph/osd/ceph-1/keyring: (28) No space left on device entity osd.1 not found
[8:27] * Cube (~Cube@c-38-80-203-93.rw.zetabroadband.com) Quit (Quit: Leaving.)
[8:28] <wogri_risc> can you please show me the output of ls /var/lib/ceph/osd/ceph-1/
[8:28] <wogri_risc> and df -h
[8:28] <krishna_> current fiemap_test fsid journal keyring lost+found store_version
[8:29] <wogri_risc> what about df -h
[8:30] <krishna_> both ceph-0 and ceph-1 is showing 100%
[8:30] <krishna_> they are 1GB each
[8:30] <wogri_risc> usage?
[8:30] <krishna_> fresh amazon ebs volumes
[8:30] <krishna_> yes
[8:30] <wogri_risc> 1 gb is already the size of the journal.
[8:30] <wogri_risc> you have to make them bigger, or decrease the size of the journal
[8:30] <krishna_> oh!
[8:30] <wogri_risc> you're full. that's why it won't work.
[8:31] <krishna_> this is for a poc/experimental....
[8:31] <wogri_risc> yeah. you can decrease the size of the journal.
[8:31] * yanzheng (~zhyan@ Quit (Ping timeout: 480 seconds)
[8:31] <krishna_> is there a minimum... or you suggest I worked with a larger hdd. like say 5GB?
[8:32] <wogri_risc> I don't think there is a minimum. it might slow down a lot with a slow journal, which might not be an isse in your experiment.
[8:32] <wogri_risc> s/slow/small/
[8:33] * Vjarjadian (~IceChat77@ Quit (Quit: Always try to be modest, and be proud about it!)
[8:33] <krishna_> let me try with 5GB thanks a lot for your help....
[8:33] <wogri_risc> np
[8:33] <krishna_> bye
[8:43] <krishna_> hello @wogri_risc are you still around?
[8:43] <wogri_risc> I am.
[8:43] <krishna_> now the commands are going through... with 5GB disks...
[8:43] <krishna_> sudo ceph health
[8:43] <krishna_> shows
[8:43] <krishna_> 2013-06-11 12:11:12.890713 7f9bf440f780 -1 unable to authenticate as client.admin
[8:43] <krishna_> 2013-06-11 12:11:12.891734 7f9bf440f780 -1 ceph_tool_common_init failed.
[8:44] <krishna_> any idea?
[8:44] <wogri_risc> seems your keys have problems.
[8:44] <wogri_risc> maybe it's a good idea to re-format the OSDs and start over.
[8:45] <krishna_> that is like unmount and run mkcephfs again?
[8:45] <wogri_risc> unmount, mkfs.xfs (or whatever) and run mkcephfs again.
[8:45] <krishna_> also delete all files except ceph.conf from /etc/ceph golder?
[8:45] <wogri_risc> right.
[8:45] <krishna_> thanks ... let me try...
[8:48] <krishna_> now health says...
[8:48] <krishna_> HEALTH_WARN 576 pgs stuck inactive; 576 pgs stuck unclean
[8:48] <krishna_> anything to worry or I can move further?
[8:52] <krishna_> now HEALTH_OK thanks a lot...
[8:56] * schlitzer|work (~schlitzer@ has joined #ceph
[8:58] <wogri_risc> krishna_ -> it had to construct a working cluster full of wonderful PG's.
[8:58] <wogri_risc> you're all set. have fun with ceph.
[9:01] * bergerx_ (~bekir@ has joined #ceph
[9:10] * krishna_ (~krishna@ Quit (Ping timeout: 480 seconds)
[9:17] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[9:22] * ScOut3R (~ScOut3R@ has joined #ceph
[9:30] * tnt (~tnt@228.199-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:30] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[9:32] * leseb (~Adium@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[9:40] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[10:04] * margiov (~oftc-webi@giovanni.ba.infn.it) has joined #ceph
[10:04] <margiov> Hi guys
[10:05] <margiov> I need a small information about ceph-osd configuration
[10:06] <margiov> How do I configure multiple storage disks on the same osd daemon?
[10:06] <wogri_risc> you don't.
[10:06] <wogri_risc> one daemon = one disk
[10:06] <wogri_risc> if you want to combine more disks you have to stripe using lvm, raid or whatever.
[10:07] <wogri_risc> ceph doesn't recommend doing this
[10:08] <margiov> got it
[10:08] <margiov> thanks
[10:08] <wogri_risc> welcome
[10:13] * san (~san@ Quit (Quit: Ex-Chat)
[10:18] <tnt> Can anyone on 0.61.3 run "grep won /var/log/ceph/ceph-mon.*.log" on a leader mon ?
[10:23] * LeaChim (~LeaChim@ has joined #ceph
[10:29] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:33] * yanzheng (~zhyan@ has joined #ceph
[10:41] <topro> tnt: how to identify leader mon?
[10:43] <tnt> topro: mm, lowest IP. pastebin ceph -s and it should be the first.
[10:44] <topro> monmap e3: 3 mons at {a=,b=,c=}, election epoch 312, quorum 0,1,2 a,b,c
[10:44] <tnt> the 'a' is the leader.
[10:44] <tnt> huh ..
[10:44] <topro> ok, wait a moment
[10:45] <tnt> ceph mon_status is more explicit about leader actually
[10:45] <topro> so I'm looking for rank 0, right?
[10:46] <tnt> look at the "name" field.
[10:46] <tnt> the leader should have answered that query ( "state": "leader" )
[10:46] <tnt> like so http://pastebin.com/SmVbSe9E
[10:47] <topro> hmm, name: "b" state: "peon"
[10:48] <topro> http://pastebin.com/vWyiutdq
[10:48] <topro> ^^ ceph mon_status called on node "a"
[10:48] * yanzheng (~zhyan@ Quit (Ping timeout: 480 seconds)
[10:49] <topro> grep won /var/log/ceph/ceph-mon.*.log gives no match on node "a" either
[10:51] <tnt> 'c' is the leader then.
[10:51] <topro> no result either on "c"
[10:51] <tnt> Ok thanks.
[10:52] <topro> and none on "b", so no one of the monitors has won in its monitor logs
[10:52] <tnt> probably means you don't have random spurious elections, which is good for you :)
[10:52] <topro> ^^ with 0.61.3-1~bpo70+1 debian wheezy cuttlefish packages from ceph.com that is
[10:56] <topro> tnt: but that didn't help you with your request either, right?
[10:59] <tnt> topro: well, it shows that the problem I'm experiencing is not universal.
[10:59] <tnt> What mon hw do you use ? (mostly disk type/speed)
[11:04] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[11:06] <topro> tnt: i have /var/lib/ceph mounted on a ssd as I saw a LOT of IO load cause by MONs, then for each OSD I have a seperate 10k HDD mounted to /var/lib/ceph/osd/ceph-*
[11:07] <topro> I have 3 nodes, with one MON, one MDS and 3 OSDs each
[11:07] <topro> all on commodity HW, 4 to 6 cores AMD with 16 to 24 GB RAM each
[11:08] <tnt> ok, so the ssd is probably helping a lot there :)
[11:08] <topro> the SSD is a samsung 840 pro 128 GB
[11:10] <tnt> is that MLC or SLC ?
[11:10] <topro> honestly, don't know
[11:11] * tziOm (~bjornar@ has joined #ceph
[11:11] <tnt> It's TLC actually
[11:12] <topro> tnt: the datashett states its MLC
[11:12] <topro> http://www.samsung.com/us/system/consumer/product/mz/7t/d5/mz7td500kw/SSD_840Pro_Spec_Sheet_FIN.pdf.pdf
[11:16] <tnt> ah yeah, I read the article wrong.
[11:17] <topro> damn should have had a look, I would have prefered SLC ;) but comparable SLC might not have been affordable anyway
[11:18] <topro> we'll see for how long they'll last with that IO-load
[11:19] * yanzheng (~zhyan@ has joined #ceph
[11:23] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[11:24] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[11:26] * mschiff (~mschiff@tmo-110-154.customers.d1-online.com) has joined #ceph
[11:27] * tziOm (~bjornar@ Quit (Ping timeout: 480 seconds)
[11:33] * capri (~capri@ has joined #ceph
[11:34] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 21.0/20130511120803])
[11:35] * wdk (~wdk@124-169-216-2.dyn.iinet.net.au) has joined #ceph
[11:39] * mschiff (~mschiff@tmo-110-154.customers.d1-online.com) Quit (Ping timeout: 480 seconds)
[11:39] * Meths_ is now known as Meths
[11:42] * mschiff (~mschiff@tmo-110-154.customers.d1-online.com) has joined #ceph
[12:01] * yanzheng (~zhyan@ Quit (Ping timeout: 480 seconds)
[12:04] * yanzheng (~zhyan@ has joined #ceph
[12:09] * margiov (~oftc-webi@giovanni.ba.infn.it) Quit (Remote host closed the connection)
[12:12] * yanzheng (~zhyan@ Quit (Ping timeout: 480 seconds)
[12:12] * mschiff (~mschiff@tmo-110-154.customers.d1-online.com) Quit (Ping timeout: 480 seconds)
[12:19] * dxd828 (~dxd828@ has joined #ceph
[12:20] * leseb (~Adium@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[12:31] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) Quit (Ping timeout: 480 seconds)
[12:36] * mschiff (~mschiff@tmo-110-154.customers.d1-online.com) has joined #ceph
[12:40] * leseb (~Adium@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[12:47] * mschiff (~mschiff@tmo-110-154.customers.d1-online.com) Quit (Ping timeout: 480 seconds)
[12:56] * tziOm (~bjornar@ has joined #ceph
[12:58] * diegows (~diegows@ has joined #ceph
[13:09] * oliver1 (~oliver@p4FD0708D.dip0.t-ipconnect.de) has joined #ceph
[13:20] * schlitzer|work (~schlitzer@ Quit (Quit: Leaving)
[13:38] * pja (~pja@a.clients.kiwiirc.com) has joined #ceph
[13:38] * capri (~capri@ Quit (Quit: Verlassend)
[13:38] <pja> hello, we have a performance problem with radosrgw
[13:38] <pja> only 8mb/s per upload
[13:38] * pja (~pja@a.clients.kiwiirc.com) Quit ()
[13:39] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[13:39] * capri (~capri@ has joined #ceph
[14:00] * markbby (~Adium@ has joined #ceph
[14:11] * elder_ (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[14:22] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[14:24] * mgalkiewicz (~mgalkiewi@staticline-31-182-149-134.toya.net.pl) has joined #ceph
[14:41] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[14:42] * mynameisbruce (~mynameisb@tjure.netzquadrat.de) Quit (Remote host closed the connection)
[14:45] * mynameisbruce (~mynameisb@tjure.netzquadrat.de) has joined #ceph
[14:45] * andrei (~andrei@host217-46-236-49.in-addr.btopenworld.com) has joined #ceph
[14:45] <andrei> hello guys
[14:46] <jerker> topro: last i checked, MLC and SLC gave the same amount of writes for the same money. And I have more alternative use for a large drives so go I went consumer products.
[14:46] <andrei> i was wondering if someone could help me to determine why I am getting a large number of slow requests?
[14:47] <andrei> how do I determine what is causing this?
[14:47] * rongze (~zhu@173-252-252-212.genericreverse.com) Quit (Ping timeout: 480 seconds)
[14:49] <sha> andrei: russian? if no - peastbean health deatil
[14:49] <sha> and ceph -s
[14:49] * rongze (~zhu@173-252-252-212.genericreverse.com) has joined #ceph
[14:49] <andrei> russian
[14:49] <sha> otlichno
[14:50] <andrei> )))
[14:50] <sha> shaitan_shaitani4 - skype
[14:51] <jerker> topro: well actually is about 80% better writes/price when guessing they do 10 times more writes per bit.
[14:51] <andrei> http://ur1.ca/ea5dz
[14:51] <jerker> topro: better with the Intel 313 SLC compared to Intel 512 at a local shop
[14:51] * tnt (~tnt@212-166-48-236.win.be) Quit (Read error: Connection reset by peer)
[14:51] * yanzheng (~zhyan@ has joined #ceph
[14:52] <jerker> s/512/520/
[14:52] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[14:52] <topro> jerker: but as prices for ssd are assumed to drop, I'll be cheaper with replacing my mlc regularly ;)
[14:56] * san (~san@ has joined #ceph
[14:57] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Quit: Ex-Chat)
[14:58] * Jahkeup (~Jahkeup@ has joined #ceph
[15:01] <jerker> I calculated how much sustained writes they would have to carry and came to the conclusion that it was fine whatever SSD i was using...
[15:01] * sha (~kvirc@ Quit (Read error: Connection reset by peer)
[15:01] <jerker> at least for some years.
[15:01] * tnt (~tnt@212-166-48-236.win.be) Quit (Read error: Connection reset by peer)
[15:01] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[15:03] * zhyan_ (~zhyan@ has joined #ceph
[15:07] * san (~san@ Quit (Ping timeout: 480 seconds)
[15:09] <andrei> sha: sorry, back online
[15:09] <andrei> thanks for your tips
[15:09] <andrei> i will check it out
[15:09] <andrei> i am new to ceph and have a small poc clusters with just two servers
[15:09] <TiCPU> is there a way to kill the MDS journal, my MDS has been down for almost a week and I'd like to get it back up ref bug 5250
[15:09] <andrei> so, i am doing some performance tensting
[15:09] <andrei> and found out that there are sh*t lods of slow requests
[15:09] <andrei> with 4k tests
[15:10] * yanzheng (~zhyan@ Quit (Ping timeout: 480 seconds)
[15:13] <TiCPU> just found out --reset-journal was taking 0 as argument, not my MDS number
[15:14] <TiCPU> however it seems to hang there
[15:15] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:23] * zhyan_ (~zhyan@ Quit (Ping timeout: 480 seconds)
[15:26] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[15:30] * mgalkiewicz (~mgalkiewi@staticline-31-182-149-134.toya.net.pl) has left #ceph
[15:31] * lxo (~aoliva@lxo.user.oftc.net) Quit (Remote host closed the connection)
[15:32] * PerlStalker (~PerlStalk@ has joined #ceph
[15:34] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[15:44] * SubOracle (~quassel@coda-6.gbr.ln.cloud.data-mesh.net) has joined #ceph
[15:46] * redeemed (~quassel@static-71-170-33-24.dllstx.fios.verizon.net) has joined #ceph
[15:47] * san (~san@ has joined #ceph
[15:55] * aliguori (~anthony@ has joined #ceph
[16:00] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[16:01] * RH-fred (~fred@ has joined #ceph
[16:01] <RH-fred> Hi !
[16:01] <mikedawson> hi
[16:02] * san (~san@ Quit (Quit: Ex-Chat)
[16:02] * Almaty (~san@ has joined #ceph
[16:05] <RH-fred> I have a little ceph setup with two nodes and two osd on each node, and the system disk of one of the two nodes is broken (no osd on it) but still working (read-only filesystem). I am looking on a secure way to switch off one of the two ceph nodes... Is there a way to say to ceph to send the entire data to the host who is working and then to switch off the host which has a broken disk ?
[16:08] <RH-fred> I know the command "ceph osd out {osd-num}" but is it the only thing to do before I switch off the server ? I mean, I don't have to remove it from the crush map ?
[16:11] <tnt> no, no need to remove it from crush map to switch it off.
[16:11] <tnt> at least if it's temporary.
[16:12] <RH-fred> it is !
[16:12] * Almaty (~san@ Quit (Quit: Ex-Chat)
[16:13] <RH-fred> So I only need to do : "ceph osd out", wait for the data rebalancing and then to switch off the server in order to replace the system drive ?
[16:13] <mikedawson> RH-fred: hopefully your crushmap was set to replicate all data on each host (as opposed to spread across all four osds where copy 1 and copy 2 could be on the same host)
[16:15] <mikedawson> RH-fred: so something like "step chooseleaf firstn 0 type host" as opposed to "step chooseleaf firstn 0 type osd"
[16:16] <RH-fred> Hum, where do I see the "step chooseleaf firstn 0 type host"
[16:16] * Almaty (~san@ has joined #ceph
[16:18] <mikedawson> RH-fred: ceph osd getcrushmap -o current-crushmap && crushtool -d current-crushmap -o current-crushmap.txt && cat current-crushmap.txt
[16:18] <RH-fred> hang on
[16:20] <andrei> hello guys
[16:21] <andrei> i am running some fio benchmarks with 4k random writes and having a bunch of issues with hang tasks
[16:21] <andrei> and fio crashing after a short while
[16:21] <andrei> the kernel messages that i get are: http://ur1.ca/ea6gl
[16:22] <andrei> does anyone know what is going one?
[16:22] <andrei> it's an ubuntu 12.04 server with latest updates and ceph installed from ceph repo version 0.61.3
[16:22] <RH-fred> Yes : step chooseleaf firstn 0 type host
[16:23] <mikedawson> RH-fred: Good. That means that you should have a complete set of data
[16:24] <RH-fred> mikedawson : Do I see in the crushmap how many copy I have ?
[16:26] <mikedawson> RH-fred: ceph osd dump | grep ^pool
[16:26] <mikedawson> RH-fred: look for rep size X.
[16:27] <RH-fred> rep size 2
[16:27] <RH-fred> ok
[16:28] <mikedawson> RH-fred: Next question. How many monitors are you running and where do they run?
[16:28] <RH-fred> I have 3 and the one on the server with broken disk is down due to ro filesystem
[16:28] <RH-fred> so I will always have 2 mon running
[16:28] <mikedawson> RH-fred: the other two are in quorum?
[16:29] <RH-fred> yes
[16:29] <RH-fred> e1: 3 mons at {a=,b=,c=}, election epoch 60, quorum 1,2 b,c
[16:29] <mikedawson> RH-fred: OK, then you could probably follow the process to remove the broken OSDs, fix the problem, then re-add them
[16:30] <RH-fred> ok ok, thanks a lot !
[16:30] <RH-fred> just one thing :
[16:30] <mikedawson> RH-fred: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
[16:30] <mikedawson> RH-fred: then fix the system, and http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#adding-osds
[16:30] <RH-fred> Yes, I am on this page but do not feel confident with it because it is already in production...
[16:31] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[16:31] <mikedawson> You'll also need to do the same on Monitors http://ceph.com/docs/next/rados/operations/add-or-rm-mons/
[16:31] <RH-fred> ok
[16:31] * alexk (~alexk@of2-nat1.sat6.rackspace.com) has joined #ceph
[16:31] <mikedawson> RH-fred: The solution there is to provision another server (or two) and add OSDs. Two servers isn't ideal for Ceph
[16:32] <RH-fred> juste one question, while running : ceph osd dump I have "removed_snaps [1~1]" on an empty line... isn't this weird ?
[16:32] * yanzheng (~zhyan@jfdmzpr02-ext.jf.intel.com) has joined #ceph
[16:33] <mikedawson> RH-fred: for a production system, I'd recommend 3x replication instead of 2x and more servers (at least 3 for 3x replication, but the more the better)
[16:33] * Anticimex (anticimex@ Quit (Ping timeout: 480 seconds)
[16:33] <mikedawson> RH-fred: not sure on the removed_snaps, perhaps someone else could help there
[16:34] <RH-fred> mikedawson: Yes, I know but my costumer don't understand that...
[16:35] <RH-fred> mikedawson: I also read somewhere that there is no way to add copies on a running ceph, is that true ?
[16:36] <mikedawson> RH-fred: i believe it works to change the replication level, but I've seen some placement issues while rebalancing, too. You can certainly add MONs and OSDs. You can also add new pools with higher replication levels.
[16:37] <RH-fred> ok ok
[16:37] <joao> I'm not sure what RH-fred means by 'add copies', but if it's a matter of replicas, then it's completely feasible to add more osds and then change the replication level for a pool on-the-fly
[16:38] <joao> ceph osd pool foo set size=N (iirc)
[16:38] <saaby> yep
[16:38] <RH-fred> joao: Yes I mean adding replicas :)
[16:40] <RH-fred> mikedawson: what do you mean by "some placement issues while rebalancing" ? Do that crash the entire ceph storage ?
[16:40] * portante (~user@ has joined #ceph
[16:40] <mikedawson> RH-fred: no, it was a legacy problem in older releases
[16:44] <RH-fred> ok
[16:44] * yanzheng (~zhyan@jfdmzpr02-ext.jf.intel.com) Quit (Remote host closed the connection)
[16:44] <RH-fred> Thanks a lot for your help, guys !
[16:48] <saaby> RH-fred: also, you should check that your size is minimum '2' and min_size is '1'.
[16:48] <saaby> RH-fred: ceph osd dump | grep 'rep size'
[16:49] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:49] <saaby> should give you the answers
[16:49] * yanzheng (~zhyan@ has joined #ceph
[16:49] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[16:51] <RH-fred> rep size 2
[16:52] <RH-fred> saaby: but I don't see the min_size
[16:55] <saaby> RH-fred: you dont see lines like this: "pool ID 'name' rep size 3 min_size 2 crush_ruleset 0 object_hash rjenkins...."
[16:55] <saaby> ?
[16:56] <RH-fred> pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 crash_replay_interval 45
[16:56] <RH-fred> that is the entire line...
[16:57] <saaby> which ceph version is this?
[16:57] * KindTwo (~KindOne@h158.214.89.75.dynamic.ip.windstream.net) has joined #ceph
[16:57] <RH-fred> ceph version 0.56.4
[16:57] * tkensiski (~tkensiski@97.sub-70-197-6.myvzw.com) has joined #ceph
[16:58] * vata (~vata@2607:fad8:4:6:a98d:9901:f82f:678f) has joined #ceph
[16:58] <saaby> oh. I actually don't know if 0.56 has the min_size concept..
[16:58] * tkensiski (~tkensiski@97.sub-70-197-6.myvzw.com) has left #ceph
[16:58] <tnt> no I don't think so.
[16:58] <saaby> no, probably not..
[16:58] <RH-fred> what is this min_size concept ?
[16:58] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[16:58] * KindTwo is now known as KindOne
[16:59] <saaby> min_size is the minimum number of replicas needed online for the pg to serve I/O
[17:00] <RH-fred> how is this usefull ? Should the data not be available if there is only one replica available ?
[17:01] <saaby> RH-fred: it also makes sure writes are not committed before written to min_size osd's.
[17:01] * markbby (~Adium@ Quit (Remote host closed the connection)
[17:01] <saaby> which garantees data redundancy on commit
[17:01] <RH-fred> ok ok
[17:01] <saaby> if min_size > 1
[17:02] * markbby (~Adium@ has joined #ceph
[17:03] * yanzheng (~zhyan@ Quit (Remote host closed the connection)
[17:03] * ghartz (~ghartz@ill67-1-82-231-212-191.fbx.proxad.net) has joined #ceph
[17:12] <mxmln> having 2 node cluster, is it possible to mount block device on both nodes?? "mount /dev/rbd/rbd/shared_disk /srv/shared/"
[17:12] <hufman> entirely up to the filesystem
[17:14] <mxmln> ext4..mkfs.ext4 -m0 /dev/rbd/rbd/shared_disk
[17:14] <hufman> ext4 doesn't support multiple mounts
[17:15] * Jahkeup (~Jahkeup@ Quit (Remote host closed the connection)
[17:15] <hufman> gfs2 or ocfs2 would be what you're looking for
[17:16] <hufman> you can use clustered-lvm on a single block device, with each logical volume being used by a different system
[17:17] <mxmln> over the top of ceph? lvm?
[17:17] <hufman> sure
[17:17] <hufman> you can think of an rbd like any other san device, with the same shared device issues
[17:18] * KindTwo (~KindOne@h140.40.28.71.dynamic.ip.windstream.net) has joined #ceph
[17:19] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[17:19] * KindTwo is now known as KindOne
[17:19] <jamespage> sagewk, I've not managed to track him down yet
[17:20] * tkensiski (~tkensiski@201.sub-70-197-6.myvzw.com) has joined #ceph
[17:20] * tkensiski (~tkensiski@201.sub-70-197-6.myvzw.com) has left #ceph
[17:22] * L2SHO (~adam@office-nat.choopa.net) has joined #ceph
[17:23] <L2SHO> Is there a secret to shutting down ceph services in ubuntu? The init scripts don't do aynthing, and if I kill the ceph process they automatically respawn.
[17:23] <mxmln> but is it not to heavy/expensive having lvm/ocfs/gfs over the top of ceph? mount - t ceph ... could solve this to?
[17:24] <hufman> "service ceph stop" for everything, or "service ceph stop osd" or any other component
[17:24] <hufman> lvm would only be a thin layer on top, and wouldn't cause any performance
[17:25] <mxmln> I only use ceph form vm's ...for live migrate's just want to create shared config directory
[17:25] <hufman> running ext4 right on rbd would be almost identical to running ext4 on lvm on rbd
[17:25] <L2SHO> hufman, ya, that doesn't work, I don't get any output from the script at all, and my services are all still running
[17:25] <hufman> mxmln: in that case, i would just do the cephfs instead of trying to do a full filesystem
[17:25] <hufman> since your usage would be so little
[17:26] <hufman> L2SHO: hmmmm do you have a /etc/default/ceph file, what does it say?
[17:26] <hufman> sometimes that file says to disable the services, and the top of the init script checks that before even stopping thing
[17:27] <L2SHO> hufman, I don't have that file
[17:27] * jlogan1 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[17:27] <hufman> hmmm ok
[17:27] <hufman> what does /etc/init/ceph look like?
[17:28] <hufman> i run debian instead of ubuntu, so i don't have a copy handy
[17:28] * portante (~user@ Quit (Ping timeout: 480 seconds)
[17:29] <L2SHO> hufman, well, that file doesn't exist, but there are 11 different files in /etc/init/ceph-*.conf
[17:29] <hufman> huh interesting
[17:29] <hufman> what happens if you try service ceph-osd stop?
[17:29] <hufman> or similar?
[17:30] <L2SHO> hufman, stop: Unknown parameter: id
[17:31] * tkensiski1 (~tkensiski@201.sub-70-197-6.myvzw.com) has joined #ceph
[17:32] <L2SHO> hufman, ahh, "service ceph-mon-all stop" worked
[17:32] <hufman> wooo
[17:32] * haomaiwang (~haomaiwan@ has joined #ceph
[17:32] <hufman> sorry i couldn't help faster!
[17:33] <L2SHO> hufman, no, thank you, I never would have even knows there was anyhting in /etc/init/ without you
[17:33] <L2SHO> I despise ubuntu
[17:34] <L2SHO> Where is the proper place to report a segfault?
[17:35] * jerker (jerker@Psilocybe.Update.UU.SE) Quit (Read error: Operation timed out)
[17:35] * ShaunR- (~ShaunR@staff.ndchost.com) has joined #ceph
[17:39] <TiCPU> is it possible crashes cause journal corruption in most case??? My OSD stops at filestore(/var/lib/ceph/osd/ceph-4) test_mount basedir /var/lib/ceph/osd/ceph-4 journal /dev/vgCEPH/journal
[17:39] * tkensiski1 (~tkensiski@201.sub-70-197-6.myvzw.com) Quit (Ping timeout: 480 seconds)
[17:39] <TiCPU> I need to mkjournal after each machine crash
[17:40] * oliver1 (~oliver@p4FD0708D.dip0.t-ipconnect.de) has left #ceph
[17:42] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:42] * ShaunR (~ShaunR@staff.ndchost.com) Quit (Ping timeout: 480 seconds)
[17:45] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[17:45] * KindTwo (~KindOne@h131.235.22.98.dynamic.ip.windstream.net) has joined #ceph
[17:46] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[17:48] * KindOne (KindOne@0001a7db.user.oftc.net) has joined #ceph
[17:51] * tkensiski (~tkensiski@ has joined #ceph
[17:51] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Read error: Operation timed out)
[17:53] * tkensiski (~tkensiski@ has left #ceph
[17:54] * KindTwo (~KindOne@h131.235.22.98.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[17:54] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[17:55] * markbby1 (~Adium@ has joined #ceph
[17:57] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[17:59] * KindTwo (~KindOne@h82.179.130.174.dynamic.ip.windstream.net) has joined #ceph
[17:59] * leseb (~Adium@3.46-14-84.ripe.coltfrance.com) Quit (Read error: Connection reset by peer)
[17:59] * leseb (~Adium@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[18:00] <andrei> hello guys
[18:00] * markbby (~Adium@ Quit (Remote host closed the connection)
[18:01] <andrei> i was wondering if someone could help me with placement groups?
[18:01] <andrei> i am not clear on the number I should set for the pool
[18:01] <andrei> my storage cluster is going to grow
[18:01] <andrei> from what i've read there is a formula that you should use to calculate the total number for the storage cluster
[18:01] <andrei> which is 100* osds / replicas
[18:03] <andrei> however, if the number of osds is going to grow
[18:03] <andrei> at the moment i have just 16 osds
[18:05] * DarkAceZ (~BillyMays@ Quit (Ping timeout: 480 seconds)
[18:06] * KindOne (~KindOne@0001a7db.user.oftc.net) has joined #ceph
[18:08] * DarkAceZ (~BillyMays@ has joined #ceph
[18:08] <andrei> anyone alive ?
[18:09] <nhm> andrei: heya, you can over provision if you'd like
[18:09] <andrei> nhm: would it effect the performance?
[18:09] <nhm> it'll use more memory and potentially more CPU on the MON, but should work fine.
[18:09] <nhm> mon rather
[18:09] <nhm> gotta stop doing that
[18:09] <andrei> okay, thanks
[18:10] <nhm> andrei: I usually pick a target and then round up to the nearest power of 2.
[18:10] <andrei> does the formula 100*osds/replicas apply to the whole cluster or to the pool that you are creating?
[18:10] <nhm> andrei: it may not matter, but theoretically you get slightly better distribution.
[18:10] <sjust> whole cluster!
[18:10] <sjust> oops
[18:10] <nhm> sjust: the WHOLE cluster?
[18:10] * Tamil (~Adium@cpe-108-184-66-69.socal.res.rr.com) has joined #ceph
[18:10] <sjust> you need around 100*osds/replicas *per pool* for good data distribution in *that pool*
[18:11] <sjust> however, the memory and cpu overheads are aggregate over all pools
[18:11] <andrei> ah, per pool, not for the entire ceph cluster?
[18:11] <sjust> so if you have 100 pools each of which has 100*osds/replicas pgs, you will have a problem
[18:11] <nhm> andrei: but yes, per pool.
[18:11] <nhm> sjust: indeed
[18:12] <andrei> okay
[18:12] <andrei> so i've created a new cluster
[18:12] <andrei> by default it had 3 pools
[18:12] * KindOne- (~KindOne@h99.46.28.71.dynamic.ip.windstream.net) has joined #ceph
[18:12] <andrei> data, rbd and metadate
[18:12] <nhm> andrei: do keep in mind though that in the recent versions of ceph, you get better overall distribution of data with more pools.
[18:12] <sjust> andrei: there is also a way to dynamically increase (but not decrease!) the number of pgs in a pool
[18:12] <nhm> andrei: so you can get away with fewer PGs per pool if you have lots of pools being used concurrently.
[18:12] * KindTwo (~KindOne@h82.179.130.174.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[18:13] <sjust> nhm: that's not enabled yet by default
[18:13] <sjust> andrei: you can enable it if you don't have old kernel clients
[18:13] * sagelap1 (~sage@2600:1010:b001:3c91:f898:7890:c1b4:dcd4) has joined #ceph
[18:13] <andrei> thanks
[18:13] <nhm> sjust: I thought that fix made it in a while back?
[18:13] <sjust> nhm: it did, but for compatibility reasons, it's off by default
[18:13] <nhm> ah
[18:13] <nhm> ok, forget what I just said. ;)
[18:13] <andrei> so, what i've done is
[18:14] <andrei> i've created a new pool in my new cluster
[18:14] <andrei> and gave it 1600 pgs
[18:14] <andrei> i have 16 osds
[18:14] <andrei> and replicat is 2
[18:14] <andrei> so, i should have used 800 according to the formula
[18:14] <andrei> however, i do plan to add more osds in the near future
[18:14] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[18:14] <andrei> is this a reasonable aproach?
[18:15] <andrei> i do plan to have more pools
[18:15] * sagelap (~sage@2600:1010:b00e:19eb:bc04:545e:8f6f:72fe) Quit (Ping timeout: 480 seconds)
[18:15] <andrei> so, have I done my calculations correctly?
[18:16] <sjust> andrei: sounds reasonable, we aren't sure how many pgs/osd you can have without a problem -- probably >500
[18:16] <sjust> but you may want to experiment
[18:16] <nhm> sjust: I've used over 1k without issue.
[18:17] <sjust> nhm: depends on the hardware, also, did you test peering time?
[18:17] <andrei> nope
[18:17] <andrei> how do i check that?
[18:17] <sjust> andrei: create a pool with more pgs, test failure scenarios
[18:17] <sjust> let us know what you find!
[18:18] <andrei> actually
[18:18] <nhm> sjust: nope, but I don't recall it being at all problematic. However when I jumped up to ~3k-6k PGs per OSD things got hairy, but that might have been the absolute number of PGs too.
[18:18] <andrei> i've restarted a one of the servers with power cycle
[18:18] <andrei> as one of the tests
[18:18] <andrei> the cluster hasn't been used during this time
[18:18] <nhm> sjust: at that point it was mostly the mon getting angry.
[18:18] <andrei> i mean there was no client side activity
[18:19] * tnt (~tnt@228.199-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[18:19] <andrei> after a reboot the status is: health_warn
[18:19] <andrei> and it's recovering
[18:19] <andrei> there was around 3% that it needed to recover
[18:19] <andrei> and it is doing it pretty slow
[18:19] <andrei> not sure if this has anything to do with the number of placement groups
[18:20] * KindOne- (~KindOne@h99.46.28.71.dynamic.ip.windstream.net) Quit (Ping timeout: 480 seconds)
[18:22] * danieagle (~Daniel@ has joined #ceph
[18:24] * leseb (~Adium@3.46-14-84.ripe.coltfrance.com) Quit (Quit: Leaving.)
[18:29] <andrei> i've noticed that one of my servers which runs CentOS 6.4 + ceph mon service has crashed and rebooted twice in the last 2 days
[18:29] <andrei> when I've started the benchmark testing
[18:30] <andrei> not sure if it is related to ceph or not
[18:30] <andrei> but the server is not used for anything else
[18:30] * PerlStalker (~PerlStalk@ Quit (Ping timeout: 480 seconds)
[18:30] <andrei> that's on 0.61.3
[18:32] <mxmln> I have just upgraded my two nodes cluster to 0.61.3 moved journal to tmpfs...rebooted...after reboot health HEALTH_WARN mds cluster is degraded http://pastebin.com/1e9ryM62
[18:36] * xmltok (~xmltok@pool101.bizrate.com) Quit (Remote host closed the connection)
[18:37] * xmltok (~xmltok@relay.els4.ticketmaster.com) has joined #ceph
[18:39] * xmltok_ (~xmltok@pool101.bizrate.com) has joined #ceph
[18:39] * noahmehl (~noahmehl@cpe-71-67-115-16.cinci.res.rr.com) Quit (Quit: noahmehl)
[18:39] * sagelap1 (~sage@2600:1010:b001:3c91:f898:7890:c1b4:dcd4) Quit (Ping timeout: 480 seconds)
[18:40] * portante (~user@ has joined #ceph
[18:43] <cjh_> mxmln: did you get a speed boost out of putting the journal on tmpfs?
[18:46] * xmltok (~xmltok@relay.els4.ticketmaster.com) Quit (Ping timeout: 480 seconds)
[18:48] <mxmln> cjh I have not yet testet but solved my problem with degraded mds using crushmap decompile compile and set
[18:49] <mxmln> now I m starting some benchmark test
[18:51] * sagelap (~sage@2600:1010:b01e:7448:c685:8ff:fe59:d486) has joined #ceph
[18:53] * PerlStalker (~PerlStalk@ has joined #ceph
[18:53] * rturk-away is now known as rturk
[19:10] * portante (~user@ Quit (Ping timeout: 480 seconds)
[19:10] * Almaty (~san@ Quit (Quit: Ex-Chat)
[19:18] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[19:21] * andrei (~andrei@host217-46-236-49.in-addr.btopenworld.com) Quit (Ping timeout: 480 seconds)
[19:27] * mschiff (~mschiff@ has joined #ceph
[19:30] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) Quit (Ping timeout: 480 seconds)
[19:31] * sagelap (~sage@2600:1010:b01e:7448:c685:8ff:fe59:d486) Quit (Ping timeout: 480 seconds)
[19:33] * sagelap (~sage@2600:1010:b018:8886:c685:8ff:fe59:d486) has joined #ceph
[19:34] * ShaunR- (~ShaunR@staff.ndchost.com) Quit (Read error: Connection reset by peer)
[19:34] * ShaunR (~ShaunR@staff.ndchost.com) has joined #ceph
[19:38] <cjh_> mxmln: lemme know how that works out :)
[19:38] * julian (~julianwa@ Quit (Quit: afk)
[19:44] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[19:47] * portante (~user@ has joined #ceph
[19:57] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[20:00] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) has joined #ceph
[20:10] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[20:27] * sagelap1 (~sage@2600:1010:b02a:d604:c685:8ff:fe59:d486) has joined #ceph
[20:31] * sagelap (~sage@2600:1010:b018:8886:c685:8ff:fe59:d486) Quit (Ping timeout: 480 seconds)
[20:37] * haomaiwang (~haomaiwan@ Quit (Ping timeout: 480 seconds)
[20:48] * ChanServ sets mode +o rturk
[20:48] * ChanServ sets mode +o dmick
[20:48] * ChanServ sets mode +v scuttlemonkey
[20:48] * ChanServ sets mode +v elder
[20:57] * bergerx_ (~bekir@ Quit (Quit: Leaving.)
[20:58] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[21:30] * sagelap1 (~sage@2600:1010:b02a:d604:c685:8ff:fe59:d486) Quit (Ping timeout: 480 seconds)
[21:33] <paravoid> more bugs!
[21:34] <nhm> paravoid: your wish is our command! ;)
[21:39] * flakrat (~flakrat@eng-bec264la.eng.uab.edu) Quit (Quit: Leaving)
[21:39] <paravoid> # grep 'reported failed' ceph.log |wc -l
[21:39] <paravoid> 2807
[21:39] <paravoid> fun
[21:41] * sagelap (~sage@155.sub-70-197-7.myvzw.com) has joined #ceph
[21:42] <paravoid> so I have 140 osds and 139 unique "reported failed by" pairs
[21:42] <paravoid> even with min reports = 14, it got down to
[21:42] <sjust> paravoid: did you have a network kerflufle?
[21:42] <paravoid> 2013-06-11 19:28:10.704975 mon.0 18844 : [INF] osdmap e185965: 144 osds: 140 up, 140 in
[21:43] <paravoid> 2013-06-11 19:28:29.515122 mon.0 22943 : [INF] osdmap e185967: 144 osds: 73 up, 140 in
[21:43] <paravoid> sjust: no evidence of that yet
[21:45] * hug (~hug@nuke.abacus.ch) has joined #ceph
[21:45] <paravoid> hm, could be
[21:45] <paravoid> the slow peering issue amplificated that to a full outage
[21:47] <paravoid> yep, confirmed it was a network thing
[21:47] <paravoid> switch stack master switchover
[21:48] * cdsboy_ (~cdsboy@cdsboy.com) has joined #ceph
[21:49] <cdsboy_> Hey guys, I'm try to find the minimum number of harddrives (and recommended #) to run a ceph setup, but I can't seem to find a clear answer. Could someone help me out?
[21:51] <tnt> As much as you need ...
[21:53] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[21:55] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[21:57] <hufman> i think ... 2
[21:58] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[21:58] * ChanServ sets mode +v andreask
[22:01] * portante (~user@ Quit (Ping timeout: 480 seconds)
[22:03] <hug> I have a weird bug with ceph and ec2. some commands like ceph-conf hang when terminating. last output when trying with strace is 'exit_group(1) = ?'
[22:03] <hug> I was following the deploying-ceph-with-juju tutorial
[22:05] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[22:07] * alram (~alram@ has joined #ceph
[22:15] * andrei (~andrei@host86-155-31-94.range86-155.btcentralplus.com) has joined #ceph
[22:15] <andrei> hello guys
[22:16] <andrei> could some one explain to me how does the osd weight effect the client's decision on where to get the data from?
[22:16] <andrei> let's say i've got a block which is replicated across two different osds
[22:16] <andrei> one osd with weight 1 and another one with weight 2
[22:17] <andrei> which osd the client would read from?
[22:17] <andrei> i guess from the one with weight 2, right?
[22:20] * Tamil (~Adium@cpe-108-184-66-69.socal.res.rr.com) has left #ceph
[22:20] * Tamil (~Adium@cpe-108-184-66-69.socal.res.rr.com) has joined #ceph
[22:28] * sagelap1 (~sage@63-237-196-66.dia.static.qwest.net) has joined #ceph
[22:32] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[22:35] * sagelap (~sage@155.sub-70-197-7.myvzw.com) Quit (Ping timeout: 480 seconds)
[22:40] * sagelap (~sage@2600:1010:b005:ac97:f898:7890:c1b4:dcd4) has joined #ceph
[22:45] <sjust> andrei: no
[22:46] <sjust> the client always reads/writes to the primary osd
[22:46] <sjust> the weight affects how many pgs will be on a particular osd
[22:46] * sagelap1 (~sage@63-237-196-66.dia.static.qwest.net) Quit (Ping timeout: 480 seconds)
[22:46] <andrei> and how does ceph decide which osd is a primary one?
[22:46] <sjust> andrei: crush.
[22:46] <sjust> the crush placement includes the primary designation
[22:47] <andrei> so, what role does weight play?
[22:47] <andrei> i thought that the weight determins the placement?
[22:47] <sjust> the one with weight 1 should only get half the number of pgs as the one with weight 2
[22:47] <sjust> crush takes weight into account when distributing pgs
[22:48] <andrei> sjust: i've noticed the following behaviour with my tests
[22:48] <andrei> i've got two servers
[22:48] <andrei> a fast and a slow one
[22:48] <andrei> with different size of osds
[22:48] <andrei> fast one has 3tb, slow one has 1.5tb
[22:48] <andrei> when the client is reading it seems to read only from a fast server
[22:49] <andrei> i've never seen it reading from a slow one
[22:49] <sjust> it depends a lot on how your crush map is set up
[22:49] <sjust> ceph pg dump will tell you where the pgs are mapped
[22:49] <sjust> you should start there
[22:50] <andrei> i've got 1825 lines when i do ceph pg dump
[22:50] <andrei> what should I look at?
[22:50] <sjust> for each pg, the acting set tells you the mapping
[22:50] <sjust> if the acting set is
[22:50] <sjust> [3,11]
[22:50] <sjust> the pg is mapped to osds 3 and 11
[22:50] <sjust> with 3 as the primary
[22:50] <sjust> so you can start by verifying that the distribution is good
[22:50] <andrei> i see
[22:51] <andrei> let me check
[22:55] <andrei> from what i can see they look okay, so the pgs are distributed between osds on both file servers
[22:56] <andrei> i am just doing some read tests
[22:56] <andrei> to check where the reads are coming from
[22:56] <sjust> are the primaries clustered on one of the two?
[22:59] * sagelap (~sage@2600:1010:b005:ac97:f898:7890:c1b4:dcd4) Quit (Quit: Leaving.)
[23:00] <andrei> sjust: both servers
[23:01] <sjust> equally?
[23:01] <andrei> but mostly on the faster server
[23:01] <andrei> nope, not equally
[23:01] * markbby1 (~Adium@ Quit (Quit: Leaving.)
[23:01] <sjust> ok, by a factor of 2, possibly/
[23:01] <sjust> ?
[23:01] <andrei> yeah, looks like it
[23:01] <sjust> ok, reads on the pgs with primaries on the slow server should be on the slow server
[23:01] <sjust> how are you generating the reads?
[23:01] <andrei> dd
[23:01] <andrei> via rbd
[23:02] <sjust> block size?
[23:02] <andrei> i was using 4M
[23:02] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) has joined #ceph
[23:02] <sjust> ok, so each write should hit a new object
[23:02] <andrei> and iflag=direct
[23:02] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) Quit (Remote host closed the connection)
[23:02] <sjust> odd, 1/3 of the reads should be hitting the slow server
[23:02] <andrei> sjust
[23:02] <andrei> they are
[23:02] <andrei> i was wrong
[23:02] <andrei> sorry
[23:02] <sjust> ok, no problem
[23:03] <andrei> the reason why
[23:03] <andrei> is that the dd test doesn't really stress the servers by much
[23:03] <andrei> even if i am running multiple dds
[23:03] <sjust> no
[23:03] <andrei> and the slow server is reading at around 2-4mb/s from each osd
[23:03] <andrei> which is odd
[23:03] * ghartz (~ghartz@ill67-1-82-231-212-191.fbx.proxad.net) Quit (Read error: Connection reset by peer)
[23:04] <andrei> the faster server is reading between 20 and 40mb/s
[23:04] <sjust> how many osds are there in the slow server?
[23:04] <andrei> which is still very slow compared with the disk capability
[23:04] <andrei> slow server has 8 osds
[23:04] <andrei> fast one has 9
[23:04] <andrei> the osd speeds are roughtly the same
[23:04] <andrei> it's just the fast server is brand new
[23:05] <andrei> with faster cpu and more ram
[23:05] <andrei> and better mb, etc
[23:05] <andrei> they are all sata enterprise disks
[23:05] <andrei> capble of doing around 140-160mb/s seq reads using dd
[23:06] <andrei> as an example
[23:06] <andrei> i am running 4 parallel dd tests with bs=4M and iflag=direct
[23:07] <andrei> reading 4GB in total from a 200GB test file
[23:07] <andrei> each dd is reading different parts of the file as set with different skip values
[23:07] <sjust> how much aggregate throughput are you seeing?
[23:07] <andrei> just under 200MB/s
[23:07] <sjust> what is the client's network connection?
[23:07] <andrei> on infiniband ipoib
[23:08] <sjust> oh, interesting
[23:08] <sjust> what kind of speed can you get to each osd via perf?
[23:08] <andrei> sjust: this is currently being read from the fast server actually
[23:08] <sjust> *iperf
[23:08] <andrei> so, 2/3rds of the reads are not coming from a wire
[23:08] <sjust> sorry?
[23:09] <andrei> okay
[23:09] <andrei> 2 servers
[23:09] <andrei> ipoib in between
[23:09] <sjust> oh, the client is on one of the servers?
[23:09] <andrei> the rbd map is done from a fast server
[23:09] <andrei> yeah
[23:09] <sjust> ok, what is the network activity from the slow server to the fast server?
[23:10] <andrei> you mean during the tests?
[23:10] <sjust> yes
[23:10] <andrei> let me check
[23:13] <andrei> it varies a lot. from about 30mb/s to about 200mb/s
[23:14] <sjust> the dds are seeing an aggregate of around 200MB/s?
[23:14] <andrei> yeah
[23:14] <andrei> about 50MB/s each
[23:14] <andrei> i was running 4 dds at a time
[23:15] <sjust> each fast osd is seeing around 20MB/s and each slow osd is seeing around 2 MB/s?
[23:15] <andrei> i ran dd tests several times and the speed went up a bit
[23:15] <andrei> to about 65-70mb/s for each dd
[23:16] <sjust> the discrepancy seems to be between the reads on the slow and fast
[23:17] <andrei> let me run a few more tests
[23:19] <andrei> sjust: i've ran couple of more tests without dropping cache on the server side
[23:19] <andrei> and 4 dds generate a maximum of around 150mb/s per dd, so, around 600mb/s cumulative
[23:20] <andrei> there is no disk activity at all during the last two dd tests
[23:20] <andrei> so, the data must be coming from cache
[23:20] <andrei> that doesn't look like a lot, does it
[23:20] <andrei> taking into account it's coming from ram?
[23:20] <andrei> what do you think?
[23:21] <andrei> ah, okay
[23:21] <andrei> got 8 dds concurrently
[23:21] <andrei> and the speed is reaching 1GB/s from ram
[23:21] * __jt__ (~james@rhyolite.bx.mathcs.emory.edu) has joined #ceph
[23:22] <andrei> so, for some reason 4 dds can't get enough speed to saturate ceph throughput
[23:23] <andrei> anyway, enough playing around with ram data, back to the problem at hand
[23:23] <andrei> let me drop cache and report back on the osd reading speeds
[23:24] <sjust> andrei: you are using direct
[23:25] <sjust> which means 1 request at a time
[23:25] <andrei> sjust: yeah
[23:25] <sjust> so you are latency bound
[23:25] <andrei> sjust: while checking the cached data read
[23:25] <sjust> yeah, the osd is not anywhere near as efficient as it could be in that area
[23:25] <andrei> i've discovered that the ipoib link from slow to fast is generating aboutn 450-500MB/s
[23:25] <sjust> andrei: that sounds about right
[23:26] <sjust> how fast can the link actually go/
[23:26] <sjust> ?
[23:26] <andrei> the link should handle around 1G/s
[23:26] <andrei> however,
[23:26] <andrei> i do not think the slow server is fast enought to push that amount of data
[23:27] <sjust> andrei: ah
[23:27] <andrei> actually, let me check it with some tools
[23:27] <andrei> netperf?
[23:27] <andrei> or something else?
[23:27] <sjust> I usually use iperf
[23:27] <sjust> whatever works
[23:27] <sjust> sounds like you are just latency limited though
[23:27] <sjust> so the next step would be to work out the latency by running a single dd
[23:27] <andrei> why do you think so?
[23:28] <sjust> increasing the number of client io threads increased throughput
[23:28] <sjust> actually, ignore that
[23:29] <sjust> you are limited by the number of concurrent requests, you won't be maxing out the osd disks until you have more concurrent requests
[23:30] <sjust> but running a single dd will tell you how fast individual osds are fulfilling requests
[23:30] <andrei> when i run a single thread, does it read the data from one osd at a time?
[23:30] <andrei> block at a time, right?
[23:30] <sjust> rbd images are a set of 4MB blocks by default
[23:30] <andrei> and if you run 8 threads, it should read 8 blocks at a time?
[23:31] <sjust> there are ways to enable striping, but it's not on by default
[23:31] * Anticimex (anticimex@ has joined #ceph
[23:31] <sjust> and possibly not in your kernel at all
[23:31] <sjust> so with bs=4MB, you are reading from probably 1 block at a time
[23:31] <sjust> so 1 object at a time
[23:31] <sjust> so 1 osd at a time
[23:32] <andrei> and if i increase bs to 8MB i should read from 2 objects, right
[23:32] <andrei> and possibley two osds at a time?
[23:32] <sjust> yeah, in parallel I think
[23:32] <sjust> I'm not precisely sure how the kernel driver will pass that through
[23:34] <andrei> yeah, it seems that my 8M block sizes generate almost twice as much throughput
[23:34] <sjust> andrei: yeah
[23:35] * leseb1 (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[23:35] <andrei> but that's purely from ram
[23:35] <andrei> not from osds
[23:35] <andrei> as I am still reading the same data without dropping cache
[23:35] <sjust> no, it's from the osd
[23:35] <andrei> let me drop it now and check
[23:35] <sjust> but the osd is reading it out of cache
[23:35] <sjust> that is, you are using direct, right?
[23:35] <andrei> sjust: i am looking at iostat and even though ii've set iflag=direct there is no disk activity
[23:35] <andrei> 0
[23:36] <sjust> so the reads are actually going to the osd daemon
[23:36] <sjust> but the osd daemon is not using direct
[23:36] <andrei> perhaps
[23:36] <sjust> so the osd daemon reads from the fs
[23:36] <andrei> yeah
[23:36] <sjust> which simply serves the read out of cache
[23:36] <andrei> true
[23:36] <andrei> okay
[23:36] <sjust> so osd, but no disk activity
[23:36] <andrei> yeah, you are right
[23:37] <andrei> okay, so, let me check the actual physical disk speed by dropping server side cache
[23:37] <andrei> and see how fast it reads from the actual spinners
[23:39] <L2SHO> is there something special that needs to be done to get radosgw logging working? I got it setup and uploaded some objects, but "radosgw-admin log list" shows nothing
[23:40] * Maskul (~Maskul@host-92-25-200-200.as13285.net) has joined #ceph
[23:46] <andrei> sjust: iperf showed the 9.14Gbits/s between the both servers
[23:47] <sjust> andrei: ok, maybe try more dds?
[23:47] <andrei> i've tried with 40 dds
[23:47] <andrei> each doing 8M blocks
[23:47] <sjust> max speed?
[23:47] <andrei> that's was coming from ram
[23:47] * leseb1 (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[23:47] <sjust> yeah
[23:48] <andrei> got around 1.6GB/s
[23:48] <sjust> seems about right?
[23:48] <andrei> but the slow server was still spitting out at around 500MB/S
[23:48] <sjust> ah
[23:48] <andrei> the rest were not coming from the wire
[23:48] <sjust> yeah
[23:49] <andrei> but from the same client/server machine
[23:49] <sjust> yeah
[23:49] <sjust> cpu use on the remote/slow machine?
[23:51] <L2SHO> andrei, maybe dd is the culprit, maybe try something like "pv [file] > /dev/null"
[23:53] <andrei> a single dd on the slow server can generate about 1.5GB/s when doing if from zero of to null with bs=4M
[23:54] <andrei> i need to restart one of the storage server
[23:54] <andrei> by the way, what is a recommended procedure for restarting a server running osds, mon and mds?
[23:58] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.