#ceph IRC Log

Index

IRC Log for 2013-09-24

Timestamps are in GMT/BST.

[0:02] <sleinen1> This is on 0.61.8 (cuttlefish)
[0:03] * BillK (~BillK-OFT@124-148-81-249.dyn.iinet.net.au) has joined #ceph
[0:10] * jskinner (~jskinner@69.170.148.179) Quit (Remote host closed the connection)
[0:11] <sagewk> sleinen1: what happens if you restart osd.23?
[0:11] <sleinen1> I'll try - wait a minute
[0:12] * dosaboy (~dosaboy@65.93.189.91.lcy-01.canonistack.canonical.com) Quit (Quit: leaving)
[0:12] * rturk-away is now known as rturk
[0:12] <sagewk> btw opened http://tracker.ceph.com/issues/6370 for the warning thing
[0:13] <sleinen1> Oh great, thanks.
[0:13] * dmsimard1 (~Adium@ap02.wireless.co.mtl.iweb.com) has joined #ceph
[0:13] <sleinen1> So, the state is not much better - the pg is now listed as "incomplete" rather than "creating".
[0:13] * dosaboy (~dosaboy@host109-158-232-255.range109-158.btcentralplus.com) has joined #ceph
[0:14] <sleinen1> osd.23 has started logging the slow request again.
[0:14] <sleinen1> Is it possible that it is actually trying to *do* something?
[0:14] <sleinen1> The problem is that since the cluster has been unclean for so long, the mons' state has become huge.
[0:14] * dmsimard (~Adium@108.163.152.2) Quit (Read error: Operation timed out)
[0:15] <sleinen1> (I understand that when the cluster isn't clean, the mons don't trim old maps - is that correct?)
[0:15] <Jedicus> insert "your mons state is SO huge..." joke
[0:17] <sleinen1> root@h1:~# du -hs /var/lib/ceph/mon/ceph-h1/store.db/
[0:17] <sleinen1> 41G /var/lib/ceph/mon/ceph-h1/store.db/
[0:18] <sagewk> sleinen1: can you ceph osd tell 23 injectargs '--debug-osd 20 --debug-ms 1' and redo the force create pg thing and then fpaste the log?
[0:18] * markbby (~Adium@168.94.245.1) Quit (Quit: Leaving.)
[0:18] <sleinen1> Sure.
[0:19] <BillK> after running for a couple of months, ceph on btrfs got very slow - looks like btrfs fragmentation due to cow. After unrelated crash, recreated cluster and back to full speed ... how to prevent/fix slowdown?
[0:19] <BillK> trying to run an online btrfs defrag got a kernel oops ...
[0:20] * dmsimard1 (~Adium@ap02.wireless.co.mtl.iweb.com) Quit (Read error: Operation timed out)
[0:22] <sagewk> BillK: what kernel version?
[0:22] <sagewk> i hear things got faster after ~3.6, but not sure if that includes the long-term fragmentation issues
[0:24] * dalegaard (~dalegaard@vps.devrandom.dk) Quit (Read error: Operation timed out)
[0:26] * codice_ (~toodles@71-80-186-21.dhcp.lnbh.ca.charter.com) has joined #ceph
[0:26] * alram (~alram@38.122.20.226) has joined #ceph
[0:26] * codice (~toodles@71-80-186-21.dhcp.lnbh.ca.charter.com) Quit (Read error: Connection reset by peer)
[0:26] <simulx> i ran a test with 3 parallel bonnie++ runs on 3 nodes. would you expect an nfs mount to a single raid device to be 30% faster than a ceph mount to 4 identical devices on that .... given that the disks are the same (we put 12 cores + 32gb ram in the ceph devices, only 4cores+16gb in the nfs). the cluster seems to scale well when there's lots of users. i don't really care, i just want to know if i've done
[0:26] <simulx> something terribly wrong and my ceph should be *way* faster
[0:26] * dalegaard (~dalegaard@vps.devrandom.dk) has joined #ceph
[0:26] * carif (~mcarifio@64.119.130.114) has joined #ceph
[0:27] <nhm> BillK: I've seen that in the past too. BTRFS tends to start out extremely fast, but eventually it slows down to the point where XFS is faster.
[0:27] <simulx> im using xfs. but as long as someone else seesit
[0:28] <simulx> the reason we use ceph is mostly scale+reliability
[0:28] <BillK> sagewk: currentlu 3.11.1 on gentoo, but noticed back on 3.9.x as well
[0:28] <simulx> so not too worried
[0:30] <sleinen1> sagewk: Here's a small extract from osd.23 with more logging: http://fpaste.org/41694/13799753/
[0:31] <sleinen1> (the parts where the 0.cfa is mentioned, the pg that I'm trying to re-create.)
[0:31] <gregaf> simulx: bonnie++…are you using CephFS, you mean?
[0:35] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[0:35] <sleinen1> sagewk: To turn off debugging again, I should injectargs '--debug-osd 0 --debug-ms 0', correct?
[0:35] <sagewk> sleinen1: hmm, what does 'ceph pg 0.cfa query' say?
[0:35] <sagewk> sleinen1: yeah
[0:35] * nwat (~nwat@eduroam-237-79.ucsc.edu) Quit (Ping timeout: 480 seconds)
[0:36] <sleinen1> sagewk: lots - looks like a long history. I'll paste it.
[0:36] * carif (~mcarifio@64.119.130.114) Quit (Quit: Ex-Chat)
[0:36] <sleinen1> sagewk: http://fpaste.org/41697/79975809/
[0:38] <sagewk> sleinen1: if you query again does the probing_osds list shrink?
[0:38] <sleinen1> No, the output seems to stay at 2356 lines.
[0:39] <sleinen1> sagewk: …and I don't see any changes in the probing_osds list.
[0:39] * diegows (~diegows@200.16.99.223) Quit (Read error: Operation timed out)
[0:41] <sagewk> sleinen1: see if hte pg directory exists on all 3 osds .. if not, you could temp shut down the ones that have a copy and re-run the force create pg command
[0:41] * AfC (~andrew@2407:7800:200:1011:6e88:14ff:fe33:2a9c) has joined #ceph
[0:42] <sagewk> hmm, actually that probably is not a good idea.
[0:43] * malcolm_ (~malcolm@silico24.lnk.telstra.net) has joined #ceph
[0:43] <sagewk> davidz, sjustwork: the ceph-filestore-dump can remove a pg right?
[0:45] <sleinen1> sagewk: I find a directory corresponding to 0.cfa on several OSDs, but only one (on osd.50) has data in it: http://fpaste.org/41703/79976298/
[0:49] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[0:49] * ScOut3R (~scout3r@540099D1.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[0:51] <sagewk> sleinen1: do those osds listed exist and are they up?
[0:52] <sleinen1> Yes.
[0:53] <davidz> sagewk: yes ceph-filestore-dump can remove a pg. In my branch there is a correction to the usage to show the "remove" option.
[0:54] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[0:56] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) Quit (Quit: ...)
[0:57] <sagewk> sleinen1: what does ceph osd dump | grep ^pool say min_size is for pool 0 ?
[0:57] <sleinen1> sagewk: pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 4160 pgp_num 4160 last_change 28091 owner 0 crash_replay_interval 45
[1:01] <sleinen1> sagewk: the replication factor (size) used to be 2 - I increased it to 3 a couple of weeks ago, while that pg was already unclean.
[1:01] <sagewk> let's try this: make a backup of the files in that pg directory on the osd with actual objects, stop that ceph-osd, and use the ceph-filestore-dump tool to delete that p
[1:01] <sagewk> pg
[1:01] <sagewk> then start it up again
[1:02] <sagewk> actually, befor eyou do that..
[1:02] <sagewk> stop osd.23, then redo that pg 0.cfa query and pastebin it
[1:04] * yanzheng (~zhyan@134.134.139.72) has joined #ceph
[1:05] <sleinen1> sagewk: Here's the pg query when osd.23 is down: http://fpaste.org/41708/13799775/
[1:06] <davidz> This would be the command syntax to remove a pg: ceph_filestore_dump --filestore-path <path> --journal-path <path>—type remove --pgid <pgid>
[1:07] <sagewk> davidz: there is an export too, right?
[1:07] <sagewk> sleinen1: the problem appears to be that you lost all but a partial copy of that pg sometime in the past
[1:08] <sagewk> osd.50 only has part of the pg contents, the other osds have none
[1:08] <dmick> and force create doesn't like that?
[1:08] <sagewk> sleinen1: in any case, i think if you export from 50 and then remove it, it will (i think) peer but with empty content.
[1:08] <sagewk> force create is for when there are no copies, but doesn't kick in if there is a partial copy.
[1:08] * dmick nods
[1:11] <sleinen1> Bear with me while I'm compiling the ceph-filestore-dump tool.
[1:11] <davidz> sagewk: yes, for example, ceph_filestore_dump --filestore-path <path> —journal-path <path> --type export --file <filename> --pgid <pgid>
[1:11] <sleinen1> Hm, maybe this is slow because I'm compiling off an rbd volume that is on the Ceph cluster that is reshuffling because I just stopped an OSD...
[1:11] <dmick> heh
[1:15] <sleinen1> davidz: in your example, <filename> is the output file, correct?
[1:15] * BManojlovic (~steki@fo-d-130.180.254.37.targo.rs) Quit (Remote host closed the connection)
[1:17] * symmcom (~symmcom@184.70.203.22) has joined #ceph
[1:17] <sagewk> sleinen1: you can restart osd.23
[1:17] <sagewk> and/or 'ceph osd set noout' so that osds don't get marked out and reshuffle data
[1:18] * TiCPU (~jeromepou@190-130.cgocable.ca) Quit (Ping timeout: 480 seconds)
[1:18] <symmcom> Hello amazing CEPH Community! I have couple questions if somebody online could help me out with
[1:19] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[1:20] <symmcom> I have a 2 Node CEPH Cluster with 4 HDD in each. Initially when i set it up, i went with default option and put Journals on each HDD. Now i am hearing it will increase performance if i put all Journal on SSD. My questions are how can i transfer journals to SSD on a live system and how big of SSD do i need on each node
[1:21] <cmdrk> the --block-size parameter doesn't seem to work on rados bench seq in 0.67. bug or working as intended ?
[1:22] <cmdrk> works for write it seems
[1:22] <davidz> slleinen1: yes <filename> is the output file
[1:22] <davidz> sleinen1: ^
[1:24] <cmdrk> i say --block-size=8388608, rados bench summary says "Write size: 8388608", gives a bandwidth that appears to be (write size * num_writes / time). looks good. the read benchmark claims that it's reading at --block-size, which gives totally unrealistic avg and cur MB/s, but the summary correctly does the math with 4MB blocks
[1:25] <sleinen1> davidz: Thanks. I have stopped osd.50 and exported the (partial) pg 0.cfa.
[1:25] <sagewk> sleinen1: delete, and then start it up...
[1:25] <sleinen1> sagewk: Like this: ceph_filestore_dump --filestore-path /var/lib/ceph/osd/ceph-50 --journal-path /dev/sdb14 --type remove --pgid 0.cfa ?
[1:25] <sagewk> 6323
[1:26] <sagewk> davidz: ?
[1:26] <cmdrk> http://paste.fedoraproject.org/41712/ if anyone is interested in taking a look
[1:27] * Cube (~Cube@12.29.178.4) has joined #ceph
[1:27] <sleinen1> sagewk/davidz: That seems to have worked - at least it told me that it removed a bunch of stuff.
[1:27] <sleinen1> sagewk: I'm restarting osd.50 now.
[1:27] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[1:29] * darkfader (~floh@88.79.251.60) Quit (Remote host closed the connection)
[1:29] * darkfader (~floh@88.79.251.60) has joined #ceph
[1:30] <cmdrk> so what appears to happen is that rados bench accepts the --block-size parameter for seq, calculates current/avg MB/s with the --block-size parameter, but in actuality is using 4MB blocks regardless of what you specify.
[1:30] <sagewk> cmdrk: heh
[1:30] * alexxy[home] (~alexxy@2001:470:1f14:106::2) Quit (Read error: Connection reset by peer)
[1:30] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[1:30] <cmdrk> i could be totally wrong but thats what it looks like..
[1:31] <dmick> so, you're saying you want command-line parameters to actually take effect? :)
[1:31] <cmdrk> or ignoring them completely rather than partially :)
[1:32] <dmick> two out of three is bad, sometimes
[1:33] <dmick> so block size only takes effect on write; if you've previously written and are now doing read tests, the operation size will be whatever was written
[1:33] <dmick> that's not fooling you, is it?
[1:34] <cmdrk> could be
[1:34] <cmdrk> i thought i was writing larger but let me try and make sure
[1:35] <cmdrk> either way there's a mismatch between summary bandwidth and current/avg
[1:36] <sleinen1> sagewk: Unfortunately, force_create_pg still doesn't seem to do anything. :-(
[1:36] <sleinen1> osd.23 logs: 2013-09-24 01:32:48.043153 7f2d92352700 10 osd.23 69168 mkpg 0.cfa already exists, skipping
[1:38] <sagewk> sleinen1: what does pg query say now?
[1:38] * dalegaard (~dalegaard@vps.devrandom.dk) Quit (Ping timeout: 480 seconds)
[1:39] <sleinen1> sagewk: http://fpaste.org/41713/37997954/
[1:39] <sagewk> turn up osd logging on osd.23, then ceph osd down 23, and then grep 0.cfa out of the resulting log
[1:39] <cmdrk> dmick: ah, yes, that seems to be what's happening. so if I set --block-size to be N for write, and --block-size to be M for read, then the curr and avg MB/s for the seq bench will be multiplied by M/N rather than reporting N . seems best to just ignore the option altogether in the read case then.
[1:40] <sagewk> there are other osds that remember the pg existed that are preventing it from proceding
[1:40] <cmdrk> the summary calculation always comes out correct, though, fwiw
[1:41] <cmdrk> http://fpaste.org/41714/ to illustrate what im trying to say
[1:44] <sleinen1> sagewk: Here's the result (grep 0.cfa in verbose osd.23 log after saying ceph osd down osd.23): http://fpaste.org/41715/13799798/
[1:47] <sleinen1> sagewk: are the relevant lines those with "got dup osd.XX info 0.cfa(…)"?
[1:47] <dmick> cmdrk: I'm confused, anyway; the object size is established by write, but the read operation size is independent. I agree there's a bug, it's just not as simple as I was thinking
[1:47] * rturk is now known as rturk-away
[1:49] <cmdrk> dmick: right. if there's anything I can do to test / help let me know -- it seems that if the object size is established by write, then the read operation shouldn't care about that --block-size parameter.
[1:50] <dmick> well, it's just that the read request isn't reading the whole object
[1:50] <dmick> or in your case, is reading past the end, which may be the real issue
[1:50] <cmdrk> ah
[1:50] <dmick> in effect, you issue a 32M read, but only actually read 16M, which of course reads at about the 16M rate
[1:50] <dmick> but since you asked for 32M, the code is assuming you got it
[1:50] <cmdrk> gotcha
[1:50] * yanzheng (~zhyan@134.134.139.72) Quit (Remote host closed the connection)
[1:51] <dmick> maybe it should just use the returned "amount read" for the calculations
[1:51] <dmick> I assume this makes more sense if the read size is < write size?
[1:51] <sleinen1> sagewk: What would happen if I stopped osd.23, and then *imported* the data that I got from osd.50, and started osd.23 again?
[1:51] * alram (~alram@38.122.20.226) Quit (Ping timeout: 480 seconds)
[1:52] <cmdrk> seems to make more sense to me :)
[1:52] <dmick> heh. well I mean I assume the numbers make more sense
[1:52] <cmdrk> ill give it a go, but i believe so
[1:53] <dmick> i.e. if you run a read with 8M instead of 16M, are the numbers roughly the same as 16M?
[1:53] <cmdrk> hmm.. i got a segfault instead :) ..let me try from a fresh batch of write data
[1:56] * dmsimard (~Adium@69-165-206-93.cable.teksavvy.com) has joined #ceph
[1:57] <cmdrk> yep, segfault.
[1:57] * dmsimard (~Adium@69-165-206-93.cable.teksavvy.com) Quit ()
[1:57] <dmick> well that's not very friendly
[1:58] <cmdrk> http://fpaste.org/41716/37998069/ shazam
[1:59] <dmick> that's certainly worth filing
[1:59] <cmdrk> alright
[2:00] <symmcom> Anybody available to help me with some speed issue in CEPH cluster
[2:01] <sleinen1> sagewk: I'm going to give up on this broken pg for today. But it should really be possible to get rid of something like that.
[2:01] <sleinen1> Good night.
[2:01] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[2:02] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[2:02] <cmdrk> dmick: shall I go ahead and open a ticket on the bug tracker?
[2:03] <symmcom> I am getting 60mb/s read speed and 22 mb/s write speed
[2:03] * KindOne (~KindOne@0001a7db.user.oftc.net) has joined #ceph
[2:03] <dmick> cmdrk: please
[2:03] <cmdrk> will do
[2:03] <dmick> and another for the bad stats
[2:03] <sagewk> sleinen1: importing won't help; the content is incomplete.
[2:03] <cmdrk> ok
[2:04] <sagewk> if you repeat the offline delete on osd's 15 17 18 21 and 51 then it should peer.
[2:04] * dalegaard (~dalegaard@vps.devrandom.dk) has joined #ceph
[2:04] <sleinen1> sagewk: OK, I'll try. Can I do this one at a time, or do I need to stop them all?
[2:16] * LeaChim (~LeaChim@host86-135-252-168.range86-135.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[2:21] <cmdrk> dmick: http://tracker.ceph.com/issues/6371 ; http://tracker.ceph.com/issues/6372
[2:22] <cmdrk> let me know if theres any additional information needed
[2:22] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[2:23] <dmick> those look great; I'll add my suspicion that the read is just completing early to 6372
[2:23] <cmdrk> ok great
[2:24] * dalegaard (~dalegaard@vps.devrandom.dk) Quit (Read error: Operation timed out)
[2:29] * dalegaard (~dalegaard@vps.devrandom.dk) has joined #ceph
[2:31] * alram (~alram@cpe-76-167-50-51.socal.res.rr.com) has joined #ceph
[2:39] * alram (~alram@cpe-76-167-50-51.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[2:40] * ismell_ (~ismell@host-64-17-89-79.beyondbb.com) has joined #ceph
[2:41] <sleinen1> sagewk: Thanks a bunch, that worked. Was a lot of work, too. But it's so nice to see the cluster clean again!
[2:41] * ismell (~ismell@host-64-17-89-79.beyondbb.com) Quit (Ping timeout: 480 seconds)
[2:44] * sagelap (~sage@2607:f298:a:607:ea03:9aff:febc:4c23) Quit (Ping timeout: 480 seconds)
[2:44] * sagelap (~sage@2600:1012:b008:981f:e58a:d29c:f733:95bd) has joined #ceph
[2:45] * Camilo (~Adium@216.207.42.132) Quit (Quit: Leaving.)
[2:46] * xarses (~andreww@204.11.231.50.static.etheric.net) Quit (Ping timeout: 480 seconds)
[2:49] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[2:49] * freedomhui (~freedomhu@117.79.232.206) has joined #ceph
[2:54] * yy-nm (~Thunderbi@122.224.154.38) has joined #ceph
[2:56] * Tamil (~Adium@cpe-108-184-71-119.socal.res.rr.com) Quit (Quit: Leaving.)
[2:58] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[3:02] * Tamil (~Adium@cpe-108-184-71-119.socal.res.rr.com) has joined #ceph
[3:05] * eternaleye_ (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) Quit (Ping timeout: 480 seconds)
[3:05] * xarses (~andreww@c-71-202-167-197.hsd1.ca.comcast.net) has joined #ceph
[3:09] * angdraug (~angdraug@204.11.231.50.static.etheric.net) Quit (Quit: Leaving)
[3:11] * themgt (~themgt@201-223-232-27.baf.movistar.cl) Quit (Quit: themgt)
[3:11] * eternaleye (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) Quit (Remote host closed the connection)
[3:19] <cjh_> dmick: you still around?
[3:21] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[3:21] <dmick> I am
[3:21] * sagelap (~sage@2600:1012:b008:981f:e58a:d29c:f733:95bd) Quit (Quit: Leaving.)
[3:22] <cjh_> any idea why ceph crc's everything?
[3:22] <cjh_> i know the network already checksum's packets so why do it again
[3:23] * freedomhui (~freedomhu@117.79.232.206) Quit (Quit: Leaving...)
[3:26] <cjh_> if you don't know i'll ask on the list :)
[3:26] <dmick> well a naive answer is "because lots more than the network can go wrong"
[3:26] <cjh_> true
[3:27] <cjh_> i think ceph is the only storage i've encountered so far that does that
[3:27] <cjh_> i'm pretty sure gluster doesn't
[3:27] <dmick> zfs, btrfs
[3:28] <cjh_> they do that as well?
[3:28] <dmick> zfs makes a big deal about it, yeah
[3:28] <cjh_> oh cool, i should read up on that :)
[3:28] <dmick> storage buses can go wrong, disks lie, software bugs..
[3:28] <dmick> what I'm not clear on is just how much Ceph CRCs
[3:28] <dmick> (and who calculates and who consumes)
[3:29] <cjh_> right
[3:29] <cjh_> just looking at the calls at runtime i just see like mostly crc's
[3:29] <cjh_> which was curious
[3:29] * glzhao (~glzhao@li565-182.members.linode.com) has joined #ceph
[3:32] <dmick> looks like message headers, for one
[3:33] <dmick> digesting monitor sync chunks for equality comparison
[3:34] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit (Quit: Leaving.)
[3:34] <dmick> and three separate chunks of the message itself
[3:35] <dmick> journal entries
[3:35] <dmick> PGLog entries
[3:36] * themgt (~themgt@201-223-232-27.baf.movistar.cl) has joined #ceph
[3:39] * glzhao_ (~glzhao@211.155.113.204) has joined #ceph
[3:41] * glzhao (~glzhao@li565-182.members.linode.com) Quit (Ping timeout: 480 seconds)
[3:42] * sagelap (~sage@76.89.177.113) has joined #ceph
[3:42] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[3:48] * mjeanson (~mjeanson@00012705.user.oftc.net) Quit (Ping timeout: 480 seconds)
[3:50] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[3:51] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[3:52] * peetaur (~peter@CPEbc1401e60493-CMbc1401e60490.cpe.net.cable.rogers.com) Quit (Ping timeout: 480 seconds)
[3:56] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) Quit (Quit: Leaving.)
[3:59] * yanzheng (~zhyan@134.134.139.72) has joined #ceph
[4:03] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[4:09] <symmcom> Hello! can somebody help me pin point the slowness of my ceph cluster , i have read 60 mb/s and write 21 mb /s
[4:10] * Kupo1 (~tyler.wil@wsip-68-14-231-140.ph.ph.cox.net) Quit (Quit: Leaving.)
[4:11] * huangjun (~kvirc@111.174.90.255) has joined #ceph
[4:14] <huangjun> when i write big files(10GB) to cluster, the ceph -w shows " slow request message " of some osds,but the osd disk is not busy,
[4:15] <huangjun> so what will result in the osd handling request slowly?
[4:15] * shang (~ShangWu@175.41.48.77) has joined #ceph
[4:17] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Read error: Operation timed out)
[4:18] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Read error: Operation timed out)
[4:18] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[4:18] <symmcom> I got "slow request message" twice during cluster recovery due to hard drive failure. But as soon as the recovery was done the msg never came back. i am guessing its because hdd was not transferring data fast enough
[4:19] * AfC (~andrew@2407:7800:200:1011:6e88:14ff:fe33:2a9c) Quit (Quit: Leaving.)
[4:19] * freedomhui (~freedomhu@211.155.113.204) has joined #ceph
[4:19] <huangjun> symmcom: but iostat shows that osd disk doesn't have much load
[4:21] * peetaur (~peter@CPEbc1401e60493-CMbc1401e60490.cpe.net.cable.rogers.com) has joined #ceph
[4:21] <huangjun> and we use ssd as osd jounral, but the performance did'tt imporved in our test situations.
[4:22] <symmcom> that was just my guess. :) I myself trying to find answer for my cluster slowness, i got terrible read/write speed
[4:22] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[4:23] <huangjun> will SSD improve the performance much?
[4:25] <symmcom> in my case?
[4:25] <huangjun> yes
[4:26] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[4:27] <symmcom> i was considering SSD when i was setting up but then i stumbled upon an article which said it is best to put journal on OSD itself if using large number of hdds
[4:28] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[4:28] <huangjun> can you find out the webpage?
[4:29] * erice_ (~erice@host-sb226.res.openband.net) has joined #ceph
[4:30] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[4:31] <symmcom> See the very end of this article http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals
[4:32] * erice (~erice@50.240.86.181) Quit (Ping timeout: 480 seconds)
[4:32] <huangjun> thanks
[4:33] <symmcom> my cluster has been runnign great for last 7 months, accept the speed
[4:34] <huangjun> what the write and read speed? and how many osds?
[4:35] <symmcom> 2 nodes, 4 hdd in each, avg write 20 mb/s read 60mb/s. gigabit network, all hdd are connected to Intel RAID controller in JBOD
[4:36] <symmcom> all OSDs are ext4 filesystem
[4:37] <huangjun> 20mb/s and 60mb/s is ok?
[4:37] <symmcom> 20mb/s write seems to me very slow
[4:38] <huangjun> really slow
[4:38] <huangjun> and we shouldn't use ceph in only two host and <10OSD nodes?
[4:39] <symmcom> 3rd node will increase performance ?
[4:40] <huangjun> i'm not sure
[4:40] * mjeanson (~mjeanson@bell.multivax.ca) has joined #ceph
[4:41] <symmcom> whats the avg r/w speed do u have huangjun
[4:41] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[4:42] <huangjun> write(53MB/s) read(90MB/s)
[4:42] <huangjun> the write speed pretty slow
[4:42] <symmcom> with SSD Journaling?
[4:42] <symmcom> faster than what i got :)
[4:43] <huangjun> and i thinked to use ssd as osd journal, but no performance imporved
[4:44] <symmcom> i personally like the idea of Journal on same OSD because if one hdd dies i just lose the journal for that drive not the entire node
[4:44] * andes (~oftc-webi@183.62.249.162) has joined #ceph
[4:48] <andes> I have some basic questions, can someone help? I want to use block device and limit the quota of the storage on ubuntu. I setup 'quota', but everytime I reboot the system, the mounted block device is not ready. Can I set a script to map the block on system boot-up and mount it to the fs?
[4:54] * AfC (~andrew@ppp244-218.static.internode.on.net) has joined #ceph
[4:54] * Tamil (~Adium@cpe-108-184-71-119.socal.res.rr.com) Quit (Quit: Leaving.)
[4:55] <huangjun> andes: you can set the "rbd map" and "mout -t ceph " in /etc/rc.local
[4:56] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) has joined #ceph
[4:56] * freedomhui (~freedomhu@211.155.113.204) Quit (Quit: Leaving...)
[4:57] <dmick> as of Jun 21, there's a /etc/init.d/rbdmap script that uses /etc/ceph/rbdmap
[4:57] <dmick> that's in Dumpling and later
[5:06] * fireD (~fireD@93-142-250-247.adsl.net.t-com.hr) has joined #ceph
[5:07] * fireD_ (~fireD@93-136-89-153.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:10] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[5:13] * andes (~oftc-webi@183.62.249.162) Quit (Remote host closed the connection)
[5:14] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[5:14] * freedomhui (~freedomhu@li565-182.members.linode.com) has joined #ceph
[5:15] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[5:15] * KindTwo (~KindOne@h69.0.40.162.dynamic.ip.windstream.net) has joined #ceph
[5:16] * KindTwo is now known as KindOne
[5:21] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) has joined #ceph
[5:21] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit ()
[5:23] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.90.1 [Firefox 24.0/20130910160258])
[5:24] * sjm (~sjm@12.29.178.4) has joined #ceph
[5:25] * AfC (~andrew@ppp244-218.static.internode.on.net) Quit (Ping timeout: 480 seconds)
[5:25] * eternaleye (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) has joined #ceph
[5:25] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) has joined #ceph
[5:26] * eternaleye_ (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) has joined #ceph
[5:26] * Tamil (~Adium@cpe-108-184-71-119.socal.res.rr.com) has joined #ceph
[5:28] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[5:30] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Quit: Bye!)
[5:32] * freedomhu (~freedomhu@117.79.232.226) has joined #ceph
[5:38] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[5:39] * freedomhui (~freedomhu@li565-182.members.linode.com) Quit (Ping timeout: 480 seconds)
[5:39] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) has joined #ceph
[5:44] * Tamil (~Adium@cpe-108-184-71-119.socal.res.rr.com) has left #ceph
[5:51] * yy-nm (~Thunderbi@122.224.154.38) Quit (Quit: yy-nm)
[6:01] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[6:02] * andes (~oftc-webi@183.62.249.162) has joined #ceph
[6:06] * malcolm_ (~malcolm@silico24.lnk.telstra.net) Quit (Quit: Konversation terminated!)
[6:09] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[6:10] * carif (~mcarifio@146-115-183-141.c3-0.wtr-ubr1.sbo-wtr.ma.cable.rcn.com) has joined #ceph
[6:15] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[6:16] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:16] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[6:18] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[6:19] * freedomhu (~freedomhu@117.79.232.226) Quit (Quit: Leaving...)
[6:22] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[6:23] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[6:25] * ggreg (~ggreg@int.0x80.net) Quit (Ping timeout: 480 seconds)
[6:25] * `10` (~10@juke.fm) Quit (Ping timeout: 480 seconds)
[6:26] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[6:34] * AfC (~andrew@59.167.244.218) has joined #ceph
[6:37] * sjm (~sjm@12.29.178.4) Quit (Read error: Operation timed out)
[6:42] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[6:49] * cfreak201 (~cfreak200@p4FF3EF6C.dip0.t-ipconnect.de) has joined #ceph
[6:52] * cfreak200 (~cfreak200@p4FF3E2F9.dip0.t-ipconnect.de) Quit (Ping timeout: 480 seconds)
[7:22] * Kupo1 (~Kupo-Lapt@ip70-162-72-3.ph.ph.cox.net) has joined #ceph
[7:22] <Kupo1> Hey all, im kinda new to the terms around here. Is an OSD basically a drive used for the storage cluster?
[7:23] <dmick> an OSD is a daemon that represents one independent set of information; you can think of it as "drive + daemon", which it usually is, although an OSD can use more than one drive (it just uses a filesystem)
[7:24] <Kupo1> I guess im a bit lost then, I've got 3 servers with 9 blank drives each, would i create 3x9 OSD's?
[7:25] <Kupo1> the end goal being http://ceph.com/docs/master/rbd/rbd-openstack/
[7:25] <xarses> you can, for performance, its not recommended to use the os disk
[7:25] <Kupo1> I've got a seperate OS disk for each server
[7:26] <xarses> people tend to raid 0 disks sometimes to reduce the overhead of managing as many osd's
[7:26] <xarses> but an osd is a minimum unit of redundancy
[7:26] <xarses> you would have a replica in another osd
[7:26] <xarses> with the default crushmap, this would be on another server too
[7:28] <Kupo1> Does it do that automatically by default or has to be instructed?
[7:28] <Kupo1> I assumed it would have a raid-esque operation?
[7:30] <xarses> yes, correct
[7:31] <xarses> the default is a replica level of 2, and osd's from separate nodes
[7:31] <xarses> you can of course extend this significantly
[7:32] <lurbs> Personally I'd argue against any RAID underlying the OSDs, and to use a separate disk for each. But that's a design decision you get to make.
[7:32] <xarses> for smaller clusters that seems to be the consensus lurbs
[7:32] <Kupo1> Currently the drives are simply connected in pass-through from the raid card, so in essence completely managed by ceph
[7:33] <Kupo1> however 3x9 drives im not sure you would consider small
[7:33] <xarses> but there is some discussion that at scale there is more work to maintain the osd's that striped raids help reduce
[7:33] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[7:33] <xarses> while still retaining total storage
[7:33] <lurbs> Also, each OSD will require a journal. This is typically a partition on an SSD, with a few OSDs sharing the same physical device.
[7:33] <xarses> and io
[7:33] <lurbs> Although it can be a partition on the same spinning disk you're using for the OSD itself.
[7:34] <xarses> http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals
[7:34] <xarses> http://comments.gmane.org/gmane.comp.file-systems.ceph.user/4267
[7:34] <Kupo1> one journal per drive?
[7:34] <lurbs> One journal per OSD.
[7:34] <xarses> Kupo1 lurbs ^^
[7:34] <Kupo1> I guess an OSD is a filesystem not a drive
[7:34] <Kupo1> or partition*
[7:35] <xarses> one journal is required per osd
[7:35] <xarses> where you put that osd is up to you
[7:35] <Kupo1> does ceph-deploy automatically create a journal?
[7:35] <Kupo1> http://ceph.com/docs/next/rados/deployment/ceph-deploy-osd/
[7:35] <xarses> yes
[7:36] <xarses> 1 GiB by default
[7:36] <Kupo1> so creating an OSD for each disk is correct; do I need to map each OSD to the same /dev/ point? (/dev/ssd1 in the example)
[7:38] <xarses> if you specify the journal, it must mounted path to a file that will be used
[7:38] <xarses> /mnt/ssd/osd.1
[7:38] <xarses> if you put another one on the same volume /mnt/ssd/osd.2
[7:39] <Kupo1> Im assuming defaults for ceph-deploy
[7:39] <Kupo1> as this is just a test environment
[7:41] <xarses> if you aren't to keen on the perf tuning then skip the ssd journals
[7:41] <xarses> ceph-deploy will put it on the disk with the osd for you
[7:42] <Kupo1> Yeah plan on having a dedicated ssd per server for journals, if that works
[7:42] * carif (~mcarifio@146-115-183-141.c3-0.wtr-ubr1.sbo-wtr.ma.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[7:44] <Kupo1> so this would work? ceph-deploy osd prepare osdserver1:sdb:/dev/ssd1 osdserver1:sdc:/dev/ssd1 osdserver1:sdd:/dev/ssd1 or would i need to create seperate /dev/* points?
[7:51] <xarses> if you pass a journal to ceph-deploy it must be a mounted filesystem
[7:51] <xarses> and then must be the unique name of a file within
[7:52] <Kupo1> the journal will be on the remote node correct?
[7:52] <xarses> so ceph-deploy osd perpare node1:sdb:/mnt/ssd/osd.sdb node1:sdc:/mnt/ssd/osd.sdc
[7:52] <xarses> yes
[7:53] <Kupo1> and /mnt/ssd/* would just be the OS / partition in that case, unless i setup a special drive/location for it
[7:55] <xarses> yes, but dont use the os partation
[7:55] <xarses> your better off with the journal on the osd drive
[7:55] <Kupo1> How would I do that?
[7:55] <xarses> dont pass a journal option
[7:56] <Kupo1> ah okay
[7:56] <Kupo1> missing link
[7:56] <xarses> it will create it inside the osd mount
[7:56] <Kupo1> diddnt know that was optional
[7:56] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[7:56] <xarses> just pass node1:sdb node1:sdc
[7:56] <xarses> and it will create /dev/sdb1 and mount it to /var/lib/ceph/osd.X
[7:57] <xarses> and create the journal in /var/lib/ceph/osd.X/journal
[7:57] <Kupo1> what would i do after creating/activating every OSD?
[7:57] <xarses> you need to deploy your monitors first
[7:57] <xarses> then osd prepare then osd activate
[7:58] <xarses> then need to create some pools
[7:58] <xarses> and some auth keys, and away you go
[7:58] <Kupo1> I got the monitors, at least from this page
[7:58] <Kupo1> http://ceph.com/docs/next/rados/deployment/ceph-deploy-mon/
[7:58] <xarses> just follow any of the quickstart guides
[7:58] <xarses> they will cover monitors and osd's
[7:58] <Kupo1> what role do monitors play?
[7:59] <xarses> they keep track of the osd's
[7:59] <xarses> and help the clients find the osd's
[7:59] <Kupo1> and its advised they be seperated from the OSD nodes?
[8:00] <xarses> it is
[8:00] <xarses> we stick them on our openstack controller nodes
[8:00] <xarses> only because the monitors cant update if they are too busy
[8:00] <Kupo1> In my case with only one contoller node, would i be fine with only one monitor?
[8:00] <xarses> not for any production
[8:01] <xarses> but you can always 1) add more later 2) add them to the osd servers
[8:01] <Kupo1> in that case would it be better to put them on OSD hosts?
[8:01] <xarses> the only requirement is that you have 3 or more
[8:01] * foosinn (~stefan@office.unitedcolo.de) has joined #ceph
[8:01] <xarses> for a good quorum
[8:02] <xarses> 1 is fine for r&d
[8:02] <xarses> i do need to head to bed
[8:02] <Kupo1> what kind of overhead is involved with monitors & osd's?
[8:02] <Kupo1> alright thanks for the help
[8:02] <xarses> the osd's are quite busy cpu wise
[8:03] <xarses> the mon's are low in the spectrum, but if enough cant do their thing, the cluster will fall down
[8:03] <Kupo1> Would I need a MDS as well, if im using it for Openstack?
[8:03] <xarses> (in a quorum)
[8:03] <xarses> no
[8:03] <xarses> object needs radosgw
[8:03] <xarses> cinder and glance use rbd
[8:04] * topro (~topro@host-62-245-142-50.customer.m-online.net) has joined #ceph
[8:04] <Kupo1> I think im supposed to use RBD
[8:04] <xarses> the same
[8:05] <Kupo1> I think the openstack guide explains that process
[8:05] <xarses> others would be happy to help you further, otherwise ill be on in the morning. good luck
[8:05] <Kupo1> thanks for the help
[8:08] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[8:15] * yy-nm (~Thunderbi@122.224.154.38) has joined #ceph
[8:17] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Read error: Operation timed out)
[8:26] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[8:27] * tobru (~quassel@2a02:41a:3999::94) Quit (Remote host closed the connection)
[8:28] * tobru (~quassel@2a02:41a:3999::94) has joined #ceph
[8:30] * freedomhui (~freedomhu@117.79.232.194) has joined #ceph
[8:30] * Vjarjadian (~IceChat77@90.208.125.77) Quit (Quit: Don't push the red button!)
[8:32] <huangjun> guys, where can i get smalliobenchfs?
[8:32] * Cube (~Cube@12.29.178.4) Quit (Quit: Leaving.)
[8:32] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Read error: Operation timed out)
[8:37] * sleinen1 (~Adium@2001:620:0:25:fd72:2d18:2b1:e45c) Quit (Quit: Leaving.)
[8:37] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[8:38] * sagelap (~sage@76.89.177.113) Quit (Read error: Operation timed out)
[8:45] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[8:48] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has joined #ceph
[8:49] * wogri_risc (~wogri_ris@ro.risc.uni-linz.ac.at) has left #ceph
[8:56] * dpippenger (~riven@cpe-76-166-208-83.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[8:57] * sleinen (~Adium@2001:620:0:2d:3410:64e8:f5a3:29f7) has joined #ceph
[8:58] * LeaChim (~LeaChim@host86-135-252-168.range86-135.btcentralplus.com) has joined #ceph
[8:59] * sleinen1 (~Adium@2001:620:0:25:853a:6f40:d212:3c93) has joined #ceph
[9:04] * Cube (~Cube@12.29.178.4) has joined #ceph
[9:05] * sleinen (~Adium@2001:620:0:2d:3410:64e8:f5a3:29f7) Quit (Ping timeout: 480 seconds)
[9:10] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:15] * Cube (~Cube@12.29.178.4) Quit (Ping timeout: 480 seconds)
[9:16] * ggreg (~ggreg@int.0x80.net) has joined #ceph
[9:23] * JustEra (~JustEra@89.234.148.11) has joined #ceph
[9:25] * jcfischer (~fischer@macjcf.switch.ch) has joined #ceph
[9:29] * Djinh (~alexlh@ardbeg.funk.org) has joined #ceph
[9:31] * sage (~sage@76.89.177.113) Quit (Ping timeout: 480 seconds)
[9:32] * sage (~sage@76.89.177.113) has joined #ceph
[9:36] * rendar (~s@host116-161-dynamic.1-87-r.retail.telecomitalia.it) has joined #ceph
[9:40] * AfC (~andrew@59.167.244.218) Quit (Quit: Leaving.)
[9:43] * long (~long@58.213.102.114) has joined #ceph
[9:52] * gyj6g6f5yiy98tyd (~hvbeuveig@41.46.217.191) has joined #ceph
[9:52] <gyj6g6f5yiy98tyd> Do skype, yahoo and other chat and social communication progremmes spy for israel &usa?
[9:52] <gyj6g6f5yiy98tyd> Do they record and analyse everything we do on the internet?
[9:52] <gyj6g6f5yiy98tyd> هل يتجسس الشات لامريكاواسرائيل؟؟؟
[9:52] <gyj6g6f5yiy98tyd> Do skype, yahoo and other chat and social communication progremmes spy for israel &usa?
[9:52] <gyj6g6f5yiy98tyd> Do they record and analyse everything we do on the internet?
[9:52] <gyj6g6f5yiy98tyd> هل يتجسس الشات لامريكاواسرائيل؟؟؟
[9:52] <gyj6g6f5yiy98tyd> Do skype, yahoo and other chat and social communication progremmes spy for israel &usa?
[9:52] <gyj6g6f5yiy98tyd> Do they record and analyse everything we do on the internet?
[9:52] * gyj6g6f5yiy98tyd (~hvbeuveig@41.46.217.191) Quit (autokilled: Do not spam. Mail support@oftc.net with questions. (2013-09-24 07:52:20))
[9:52] * jjgalvez (~jjgalvez@ip72-193-217-254.lv.lv.cox.net) Quit (Quit: Leaving.)
[9:58] * root_ (~root@58.213.102.114) has joined #ceph
[10:03] * shdb (~shdb@gw.ptr-62-65-159-122.customer.ch.netstream.com) Quit (Read error: Connection reset by peer)
[10:07] * shdb (~shdb@gw.ptr-62-65-159-122.customer.ch.netstream.com) has joined #ceph
[10:11] * Rocky_ (~r.nap@188.205.52.204) Quit (Quit: **Poof**)
[10:11] * Rocky (~r.nap@188.205.52.204) has joined #ceph
[10:12] * vhasi (vhasi@vha.si) Quit (Ping timeout: 480 seconds)
[10:21] * vhasi (vhasi@vha.si) has joined #ceph
[10:21] * gucki (~smuxi@77-56-39-154.dclient.hispeed.ch) has joined #ceph
[10:27] * freedomhui (~freedomhu@117.79.232.194) Quit (Quit: Leaving...)
[10:29] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[10:32] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) has joined #ceph
[10:37] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[10:37] * lubyou (~js@dsl093-174-037-223.dialup.saveho.com) has joined #ceph
[10:38] * jbd_ (~jbd_@2001:41d0:52:a00::77) has joined #ceph
[10:38] * freedomhui (~freedomhu@117.79.232.206) has joined #ceph
[10:40] <lubyou> attempting to setup a monitor node using "ceph deploy", ubuntu 12.04, dumpling. Having some issues, though: http://dpaste.com/1394024/
[10:41] * gucki (~smuxi@77-56-39-154.dclient.hispeed.ch) Quit (Read error: Connection reset by peer)
[10:43] <JustEra> lubyou, what return the command "ceph-mon --cluster ceph --mkfs -i mon-001 --keyring /var/lib/ceph/tmp/ceph-mon-001.mon.keyring" ?
[10:44] <lubyou> JustEra, "too many arguments: [--cluster,ceph]"
[10:46] <JustEra> ceph-deploy version ?
[10:46] <lubyou> JustEra, cloned the git repo ~30m ago
[10:47] <lubyou> JustEra, 1.2.6
[10:48] * yanlb (~bean@70.39.187.196) has joined #ceph
[10:48] * yanlb (~bean@70.39.187.196) Quit ()
[10:49] * yanlb (~bean@70.39.187.196) has joined #ceph
[10:49] <JustEra> lubyou, have you an existant cluster or it's a news ?
[10:53] <lubyou> JustEra, brand new, first time setup :)
[10:53] * claenjoy (~leggenda@37.157.33.36) has joined #ceph
[10:54] * claenjoy (~leggenda@37.157.33.36) Quit ()
[10:54] <JustEra> lubyou, try "ceph-deploy mon create mon-001"
[10:54] * claenjoy (~leggenda@37.157.33.36) has joined #ceph
[10:56] <lubyou> JustEra, http://dpaste.com/1394046/
[10:58] <JustEra> hmmm "Deploying mon, cluster ceph hosts ceph@mon-001" did you added the "ceph@" ? coz you have to add that part in the ssh connexion not in the cmd
[11:00] <lubyou> JustEra, I actually did
[11:00] * yy-nm (~Thunderbi@122.224.154.38) Quit (Quit: yy-nm)
[11:02] <lubyou> JustEra, how would I specify the ssh user?
[11:02] <lubyou> how else*
[11:03] <JustEra> http://ceph.com/docs/master/rados/deployment/preflight-checklist/#install-an-ssh-server
[11:05] <lubyou> JustEra, right, let me give it a try
[11:06] * hughsaunders (~hughsaund@2001:4800:780e:510:fdaa:9d7a:ff04:4622) has joined #ceph
[11:07] <lubyou> JustEra, http://dpaste.com/1394066/
[11:07] * Cube (~Cube@12.29.178.4) has joined #ceph
[11:12] * sleinen (~Adium@2001:620:0:2d:1c4b:edf1:bfa0:f2d9) has joined #ceph
[11:12] * tryggvil (~tryggvil@178.19.53.254) has joined #ceph
[11:12] <JustEra> lubyou, you have the git repo cloned right ?
[11:12] <lubyou> JustEra, I do.
[11:13] * agh (~oftc-webi@gw-to-666.outscale.net) has joined #ceph
[11:13] <JustEra> Go to "ceph_deploy/hosts/common.py" and remove the line 69 and recompile
[11:13] <agh> Hello,
[11:14] <agh> is there a way to get total IOPS counter for a whole cluster ?
[11:15] * Cube (~Cube@12.29.178.4) Quit (Ping timeout: 480 seconds)
[11:16] <lubyou> JustEra, http://dpaste.com/1394077/ "[mon-001][ERROR ] must specify '--mon-data=foo' data path"
[11:16] * yanzheng (~zhyan@134.134.139.72) Quit (Quit: Leaving)
[11:19] * sleinen1 (~Adium@2001:620:0:25:853a:6f40:d212:3c93) Quit (Ping timeout: 480 seconds)
[11:20] <root_> Hi Guys, does block device suitable for file storage?
[11:23] <absynth> what?
[11:25] * jksM (~jks@3e6b5724.rev.stofanet.dk) has joined #ceph
[11:25] * jks (~jks@3e6b5724.rev.stofanet.dk) Quit (Read error: Connection reset by peer)
[11:25] <hughsaunders> root_: yes, once you have put a filesystem on it and mounted that filesystem
[11:36] * long (~long@58.213.102.114) Quit (Quit: Leaving)
[11:46] * mattt_ (~mattt@92.52.76.140) has joined #ceph
[11:46] <mattt_> is there a documented process for replacing a failed drive?
[11:47] <mattt_> i don't want to take drive out and rebalance data, i just want to replace drive and have data migrate back once drive is back in
[12:01] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[12:09] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[12:14] * huangjun (~kvirc@111.174.90.255) Quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
[12:33] * glzhao_ (~glzhao@211.155.113.204) Quit (Quit: leaving)
[12:36] * ScOut3R (~ScOut3R@catv-89-133-25-52.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[12:45] * ScOut3R (~ScOut3R@catv-89-133-21-203.catv.broadband.hu) has joined #ceph
[12:51] * andrei (~andrei@46.229.149.194) has joined #ceph
[12:52] * `10` (~10@juke.fm) has joined #ceph
[12:56] * diegows (~diegows@190.190.11.42) has joined #ceph
[13:00] * sleinen1 (~Adium@2001:620:0:26:d0ba:37aa:fc15:cf47) has joined #ceph
[13:03] * allsystemsarego (~allsystem@188.25.131.49) has joined #ceph
[13:07] * sleinen (~Adium@2001:620:0:2d:1c4b:edf1:bfa0:f2d9) Quit (Ping timeout: 480 seconds)
[13:07] * Cube (~Cube@12.29.178.4) has joined #ceph
[13:15] * lubyou (~js@dsl093-174-037-223.dialup.saveho.com) Quit (Quit: Leaving)
[13:15] * Cube (~Cube@12.29.178.4) Quit (Ping timeout: 480 seconds)
[13:28] * lubyou (~lubyou@dsl093-174-037-223.dialup.saveho.com) has joined #ceph
[13:30] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[13:31] * marrusl (~mark@209-150-43-182.c3-0.wsd-ubr2.qens-wsd.ny.cable.rcn.com) Quit (Remote host closed the connection)
[13:32] <andrei> hello guys
[13:32] <andrei> could some one help me with addressing the slow requests?
[13:32] <andrei> i am having tons of them every day
[13:32] <andrei> between 5000 and 10000 every day
[13:35] <andrei> majority of them are like these:
[13:35] <andrei> [WRN] slow request 30.489283 seconds old, received at 2013-09-24 12:32:45.628345: osd_op(client.1794311.0:1957830 rbd_data.1b429e2ae8944a.00000000000000d0 [write 1884160~4096] 5.464929f e18204) v4 currently waiting for subops from [14]
[13:36] <andrei> I would say 60% are like the one above
[13:36] <andrei> the rest are:
[13:36] <andrei> [WRN] slow request 30.300683 seconds old, received at 2013-09-24 12:32:50.401252: osd_op(client.1915154.0:4826030 rbd_data.1d38c22ae8944a.000000000000ddcf [write 4181504~12800] 5.8879f129 e18204) v4 currently waiting for degraded object
[13:36] <andrei> and
[13:37] <andrei> [WRN] slow request 33.788814 seconds old, received at 2013-09-24 12:32:44.531784: osd_op(client.1796680.0:12988074 rbd_data.1b65422ae8944a.0000000000000823 [write 3088384~24576] 5.b3d469ba e18204) v4 currently reached pg
[13:37] <andrei> i've noticed that during the time of slow requests the virtual machines are having hang tasks
[13:37] <andrei> i am on 0.67.3
[13:37] <andrei> ubuntu 12.04 servers
[13:38] * yanlb (~bean@70.39.187.196) Quit (Quit: Konversation terminated!)
[13:38] * marrusl (~mark@209-150-43-182.c3-0.wsd-ubr2.qens-wsd.ny.cable.rcn.com) has joined #ceph
[13:39] <joelio> andrei: have you read http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
[13:40] <joelio> mattt_: just set noout flag and it won't rebalance?
[13:41] <joelio> (this is in the operational handling in docs)
[13:42] <mattt_> joelio: right, but then getting the disk reformatted and ceph-osd daemon prepped to run again
[13:42] <mattt_> joelio: i've been using ceph-deploy largely and i'm not sure if that work flow handles this situation
[13:43] <joelio> of course it does, losing a disk is a (fairly) common experience
[13:43] <joelio> I've done it with ceph-deploy
[13:44] <joelio> set noout before you start, add the extra osd, remove the erroneous osd, turn off noout
[13:44] <joelio> if the disk has been failing, chances are it's been marked as out anyway and possibly rebalanced anyway?
[13:46] <mattt_> joelio: cool, i need to re-look
[13:50] * yanzheng (~zhyan@134.134.137.75) has joined #ceph
[13:54] <agh> is there a way to get total IOPS counter for a whole cluster ?
[13:56] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[13:56] * kraken (~kraken@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[13:58] * shang (~ShangWu@175.41.48.77) Quit (Quit: Ex-Chat)
[14:01] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[14:01] * jcfischer_ (~fischer@130.59.94.234) has joined #ceph
[14:02] * AfC (~andrew@2001:44b8:31cb:d400:2ad2:44ff:fe08:a4c) has joined #ceph
[14:04] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[14:05] <yanzheng> agh, ceph -w
[14:05] * jcfischer__ (~fischer@user-28-10.vpn.switch.ch) has joined #ceph
[14:08] * jcfischer (~fischer@macjcf.switch.ch) Quit (Ping timeout: 480 seconds)
[14:08] * jcfischer__ is now known as jcfischer
[14:08] <agh> yanzheng: yes, but where do you read the IOPS counter ?
[14:10] <agh> yanzheng: I have 70 OSDs on my cluster, and, when I do a ceph-w, i am between 50 op/s and 250 op/s. I can't believe that it's the real figure
[14:10] * jcfischer_ (~fischer@130.59.94.234) Quit (Ping timeout: 480 seconds)
[14:12] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[14:20] * tsnider (~tsnider@nat-216-240-30-23.netapp.com) has joined #ceph
[14:30] <yanzheng> ag, something must be wrong, do you have uptodate osd and monitor
[14:31] <yanzheng> iops of my 3 osd test cluster can reach 2000
[14:33] * tsnider (~tsnider@nat-216-240-30-23.netapp.com) has left #ceph
[14:33] * torment (~torment@pool-72-91-185-241.tampfl.fios.verizon.net) has joined #ceph
[14:35] * torment_ (~torment@pool-72-64-182-81.tampfl.fios.verizon.net) Quit (Read error: Operation timed out)
[14:38] <andrei> joelio, thanks I will check it out
[14:38] <andrei> joelio, i've noticed that after upgrading from 0.61 branch i've started having slow requests on a regular basis
[14:38] <andrei> where as with 0.61 i only had them during high cluster load
[14:38] * TiCPU (~jeromepou@190-130.cgocable.ca) has joined #ceph
[14:38] <andrei> with 0.67 I am having tons of them every day
[14:39] <andrei> without much cluster load
[14:39] <andrei> that is what worrying me
[14:43] * yanzheng (~zhyan@134.134.137.75) Quit (Remote host closed the connection)
[14:46] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:48] * wschulze (~wschulze@cpe-72-229-37-201.nyc.res.rr.com) has joined #ceph
[14:52] * sleinen1 (~Adium@2001:620:0:26:d0ba:37aa:fc15:cf47) Quit (Quit: Leaving.)
[14:52] * sleinen (~Adium@130.59.94.202) has joined #ceph
[14:54] * markbby (~Adium@168.94.245.2) has joined #ceph
[14:57] * yanzheng (~zhyan@134.134.137.75) has joined #ceph
[14:58] * sleinen1 (~Adium@130.59.94.202) has joined #ceph
[14:58] * markbby (~Adium@168.94.245.2) Quit ()
[14:59] * markbby (~Adium@168.94.245.2) has joined #ceph
[14:59] * sleinen2 (~Adium@2001:620:0:25:451b:3685:65c0:696e) has joined #ceph
[15:00] * sleinen (~Adium@130.59.94.202) Quit (Ping timeout: 480 seconds)
[15:01] * KevinPerks (~Adium@cpe-066-026-252-218.triad.res.rr.com) has joined #ceph
[15:01] * Cube (~Cube@wr1.pit.paircolo.net) has joined #ceph
[15:04] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[15:04] * markbby (~Adium@168.94.245.2) has joined #ceph
[15:06] * sleinen1 (~Adium@130.59.94.202) Quit (Ping timeout: 480 seconds)
[15:09] * jcfischer (~fischer@user-28-10.vpn.switch.ch) Quit (Quit: jcfischer)
[15:09] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[15:10] * markbby (~Adium@168.94.245.2) has joined #ceph
[15:10] * sjm (~sjm@wr1.pit.paircolo.net) has joined #ceph
[15:13] * clayb (~kvirc@69.191.241.59) has joined #ceph
[15:14] * TiCPU (~jeromepou@190-130.cgocable.ca) Quit (Ping timeout: 480 seconds)
[15:14] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) Quit (Ping timeout: 480 seconds)
[15:15] * jcfischer (~fischer@130.59.94.234) has joined #ceph
[15:15] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[15:15] * markbby (~Adium@168.94.245.2) has joined #ceph
[15:16] * sjm_ (~sjm@wr1.pit.paircolo.net) has joined #ceph
[15:17] * jcfischer_ (~fischer@macjcf.switch.ch) has joined #ceph
[15:18] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit (Quit: sprachgenerator)
[15:20] * ScOut3R (~ScOut3R@catv-89-133-21-203.catv.broadband.hu) Quit (Read error: Operation timed out)
[15:21] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[15:21] * markbby (~Adium@168.94.245.2) has joined #ceph
[15:23] * jcfischer (~fischer@130.59.94.234) Quit (Ping timeout: 480 seconds)
[15:23] * jcfischer_ is now known as jcfischer
[15:26] * jeff-YF (~jeffyf@67.23.117.122) has joined #ceph
[15:28] * TiCPU (~jeromepou@190-130.cgocable.ca) has joined #ceph
[15:31] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) has joined #ceph
[15:32] * TiCPU (~jeromepou@190-130.cgocable.ca) Quit (Remote host closed the connection)
[15:34] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[15:37] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[15:37] * markbby (~Adium@168.94.245.2) has joined #ceph
[15:38] * b1tbkt (~b1tbkt@24-217-192-155.dhcp.stls.mo.charter.com) Quit (Remote host closed the connection)
[15:39] * doxavore (~doug@99-7-52-88.lightspeed.rcsntx.sbcglobal.net) has joined #ceph
[15:39] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Read error: Operation timed out)
[15:39] * agh (~oftc-webi@gw-to-666.outscale.net) Quit (Quit: Page closed)
[15:43] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[15:43] * markbby (~Adium@168.94.245.2) has joined #ceph
[15:43] <swinchen> Does anyone have a link describing user management? I am digging through the docs and am having a hard time finding it.
[15:46] <mattt_> swinchen: user management? for radosgw?
[15:48] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[15:48] * markbby (~Adium@168.94.245.2) has joined #ceph
[15:49] <swinchen> mattt_: err, access management to block storage...
[15:49] * jcfischer (~fischer@macjcf.switch.ch) Quit (Ping timeout: 480 seconds)
[15:50] * markbby (~Adium@168.94.245.2) Quit ()
[15:50] * markbby (~Adium@168.94.245.2) has joined #ceph
[15:50] * markbby (~Adium@168.94.245.2) Quit (Remote host closed the connection)
[15:51] <swinchen> For example... I set up a storage cluster and I have a bunch of users. How do I say that Bob has access to a certain pool, but Lindsey has access to a different pool
[15:52] <mikedawson> swinchen: there isn't a particularly good answer there. Most people use a framework (i.e. openstack or cloudstack) to enforce those rules. In that setup, there is typically one user with rights to the pool housing rbd images
[15:53] <swinchen> mikedawson: How do I set up that one user? I see the client.admin key, is there a way to create and add different keys for different pools?
[15:55] <mikedawson> swinchen: something like "ceph auth get-or-create client.volumes mon 'allow r' osd 'allow rwx pool=volumes, allow rx pool=images' > /etc/ceph/ceph.client.volumes.keyring && chown cinder:cinder /etc/ceph/ceph.client.volumes.keyring"
[15:55] <mikedawson> swinchen: that may be a bit dated though (pulled it from pre-cuttlefish notes)
[15:55] <swinchen> mikedawson: ahhh ok! perfect. That is what I was looking for.
[15:56] * PerlStalker (~PerlStalk@2620:d3:8000:192::70) has joined #ceph
[15:56] <mikedawson> swinchen: docs are at http://ceph.com/docs/next/rbd/rbd-openstack/
[15:58] <swinchen> mikedawson: perfect, that is exactly what I was looking for. Thank you.
[15:59] <mikedawson> swinchen: welcome
[16:00] <swinchen> man, ceph is really cool.
[16:05] <Djinh> ceph looks very sleek
[16:05] <Djinh> but my test cluster seems quite gaga at the moment
[16:05] <Djinh> rebalancing is filling up the entire cluster
[16:06] <Djinh> while there isn't close to that much data in the cluster
[16:08] <swinchen> When did ceph add pool quotas? In the faq it says it is not supported, but there is clearly a command for it: osd pool set-quota <poolname> max_objects|max_bytes <val>
[16:12] * thomnico (~thomnico@AMontsouris-652-1-207-134.w86-212.abo.wanadoo.fr) has joined #ceph
[16:12] <lubyou> is there a complete reference for the configuration file available?
[16:17] * dmsimard (~Adium@108.163.152.2) has joined #ceph
[16:25] * BillK (~BillK-OFT@124-148-81-249.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[16:27] * scuttlemonkey_ is now known as scuttlemonkey
[16:33] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) has joined #ceph
[16:34] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[16:34] * Chirasi123345646 (Chirasi123@a.clients.kiwiirc.com) has joined #ceph
[16:35] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[16:36] * AfC (~andrew@2001:44b8:31cb:d400:2ad2:44ff:fe08:a4c) Quit (Quit: Leaving.)
[16:36] * yanzheng (~zhyan@134.134.137.75) Quit (Ping timeout: 480 seconds)
[16:39] * zhyan_ (~zhyan@101.83.200.209) has joined #ceph
[16:40] <swinchen> Can someone explain why you need an odd number of monitors? I understand the "majority rules" in terms of the state of the cluster... but what happens if one of the monitor nodes fails?
[16:42] <mikedawson> swinchen: a strict majority is required for consensus. If you have 5 mons and one fails or becomes partitioned (not connected), the remaining four will have consensus and continue
[16:42] * iii8 (~Miranda@91.207.132.71) Quit (Read error: Connection reset by peer)
[16:42] <mikedawson> swinchen: same with two failures. the remaining three will continue.
[16:43] <mikedawson> swinchen: if you lose three of five, the remaining two will not proceed.
[16:43] * b1tbkt (~b1tbkt@24-217-192-155.dhcp.stls.mo.charter.com) has joined #ceph
[16:43] <swinchen> mikedawson: what happens if you have 3 and 1 fails? I guess they could have consensus, but if they didn't would you encounter a "split-brain" situation?
[16:44] <mikedawson> swinchen: if you have 3 mons and lose one, the two that can talk to eachother will proceed
[16:45] <Chirasi123345646> I have question for librados API's exec() usage. Should I look for help here or on #ceph-devel channel?
[16:46] <mikedawson> you basically want to avoid having two failure domains with a failure-prone link between them and an even number of monitors in each failure domains - that is the recipe for split-brain (where neither side will continue if the failure-prone link fails)
[16:46] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[16:47] <mikedawson> Chirasi123345646: if you can get someone to bite here, it is probably a better place to start
[16:47] <Chirasi123345646> mikedawson: ty
[16:48] <swinchen> mikedawson: Alright, that makes sense. Thanks. Does ceph have the ability to use multiple network paths (to avoid a failed switch for example)?
[16:49] <Chirasi123345646> Anyone can help on librados's exec() API? I was able to call other API (like omap_get, watch, notify etc etc) without problem. But when we tried the exec() API with custom plugin class, we were getting -95 error (Operation not supported).
[16:51] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[16:52] * rudolfsteiner (~federicon@200.68.116.185) Quit ()
[16:52] * thomnico (~thomnico@AMontsouris-652-1-207-134.w86-212.abo.wanadoo.fr) Quit (Read error: Connection reset by peer)
[16:54] * thomnico (~thomnico@AMontsouris-652-1-207-134.w86-212.abo.wanadoo.fr) has joined #ceph
[16:54] <swinchen> Ahhh yes. I think I found it and I think you can!
[16:56] <mikedawson> swinchen: I architect my network into failure domains, but only have one path to my ceph nodes. Then I use CRUSH rules and replication to handle switch failures.
[16:58] <swinchen> mikedawson: I think I will need to look into failure domains. I am not familiar with that concept. Thanks
[16:58] * erice (~erice@50.240.86.181) has joined #ceph
[16:58] * zhyan__ (~zhyan@101.83.175.94) has joined #ceph
[17:01] * erice_ (~erice@host-sb226.res.openband.net) Quit (Ping timeout: 480 seconds)
[17:01] <mikedawson> Chirasi123345646: you may get more traction on the mailing list, or wait a bit and re-ask (more devs show up in an hour or two)
[17:01] * zhyan_ (~zhyan@101.83.200.209) Quit (Ping timeout: 480 seconds)
[17:02] * tryggvil (~tryggvil@178.19.53.254) Quit (Quit: tryggvil)
[17:06] * Camilo (~Adium@puffy.webvolution.net) has joined #ceph
[17:08] * topro (~topro@host-62-245-142-50.customer.m-online.net) Quit (Quit: Konversation terminated!)
[17:08] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[17:10] <swinchen> mikedawson: when you say you "architect my network into failure domains" do you mean that you distribute OSD, and MON nodes across different paths/switches? That way if a switch fails you have a replicated data and monitors on completely different network hardware?
[17:11] <mikedawson> swinchen: yes
[17:11] <swinchen> bigmstone: mikedawson ahh, cool. That makes sense.
[17:13] * sprachgenerator (~sprachgen@c-50-141-192-36.hsd1.il.comcast.net) Quit (Quit: sprachgenerator)
[17:13] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[17:15] * sagelap (~sage@2600:1012:b02f:4ded:b024:ae9:24a0:ba18) has joined #ceph
[17:15] * foosinn (~stefan@office.unitedcolo.de) Quit (Quit: Leaving)
[17:16] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[17:18] * sjm1 (~sjm@wr1.pit.paircolo.net) has joined #ceph
[17:18] <benner> What's happends when i keep OSD journal on separate disk (ssd) and the disk fails?
[17:20] <mikedawson> benner: the OSD fails
[17:22] * zhyan__ (~zhyan@101.83.175.94) Quit (Ping timeout: 480 seconds)
[17:22] <benner> so what's most common scenario using ssd: ssd per journal/osd or mirrored ssd with few journals inside?
[17:22] <mikedawson> benner: a typical setup is 4 osd journal partitions per ssd. If that ssd fails, all four osds fail. Make sure your architecture can handle that reality.
[17:22] * dosaboy (~dosaboy@host109-158-232-255.range109-158.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[17:24] <benner> mikedawson: just curious - why 4?
[17:24] <mikedawson> benner: I do a 3:1 ratio. Others do 5 or 6 to 1. Calculate the ratio by dividing write throughput of your ssds by the write throughput of your spinners (then round, make concessions, etc at will)
[17:24] * sjm_ (~sjm@wr1.pit.paircolo.net) Quit (Quit: Leaving)
[17:25] * sjm (~sjm@wr1.pit.paircolo.net) Quit (Quit: Leaving)
[17:25] * dosaboy (~dosaboy@host109-155-1-124.range109-155.btcentralplus.com) has joined #ceph
[17:25] <mikedawson> benner: 450MB/s ssd and 120MB/s spinner for example is 3.75:1, so that rounds to 4:1 or so
[17:27] <swinchen> mikedawson: So ... This is sort of what I am thinking:
[17:28] <swinchen> https://docs.google.com/presentation/d/1RtgyX87uTj86PU2vV16Zr2-f1m0y_QyOffV03LBpYrI/edit?usp=sharing
[17:28] <swinchen> But how do you handle the cluster_network?
[17:28] <swinchen> (also I showed an even number of MONs... which is incorrect)
[17:29] <benner> swinchen: bonding & other network redundancy options
[17:29] <mikedawson> swinchen: I combine the cluster_network and the public_network
[17:29] <benner> i.e. switch stacking
[17:30] * sjm1 (~sjm@wr1.pit.paircolo.net) Quit (Quit: Leaving.)
[17:30] * sjm (~sjm@wr1.pit.paircolo.net) has joined #ceph
[17:30] <swinchen> Alright, thanks both of you. I will look into both options and see which one I think is more suitable for us.
[17:31] * sjm (~sjm@wr1.pit.paircolo.net) Quit ()
[17:31] <benner> swinchen: don't think that STP will help you :)
[17:31] * sjm (~sjm@wr1.pit.paircolo.net) has joined #ceph
[17:32] <mikedawson> swinchen: we use a routed spine-leaf architecture
[17:32] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:34] <swinchen> This looks like a decent description of that: http://goo.gl/6cMGkM
[17:34] <mikedawson> swinchen: I believe the PDF behind this form is a really worthwhile read: http://www.inktank.com/dreamhost/ and the PDF is at http://www.inktank.com/resource/dreamcompute-architecture-blueprint/
[17:34] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[17:35] <swinchen> mikedawson: Thanks. I really need some good references on network design.
[17:37] <mikedawson> swinchen: Brad Hedlund has some god stuff, too http://bradhedlund.com/2012/10/24/video-a-basic-introduction-to-the-leafspine-data-center-networking-fabric-design/ and http://bradhedlund.com/2012/01/25/construct-a-leaf-spine-design-with-40g-or-10g-an-observation-in-scaling-the-fabric/
[17:37] <mikedawson> s/god/good
[17:37] <swinchen> mikedawson: thank you! I will check those out.
[17:38] <mikedawson> swinchen: and don't be put off by the size of these networks, if you are small the concepts can still be valuable
[17:39] * JustEra (~JustEra@89.234.148.11) Quit (Quit: This computer has gone to sleep)
[17:43] * danieagle (~Daniel@179.176.56.253.dynamic.adsl.gvt.net.br) has joined #ceph
[17:48] <mikedawson> wrencsok: ping
[17:49] <mikedawson> SvenPHX: ping
[17:54] * ircolle (~Adium@c-67-165-237-235.hsd1.co.comcast.net) has joined #ceph
[17:56] * sagelap1 (~sage@2607:f298:a:607:ea03:9aff:febc:4c23) has joined #ceph
[17:57] * sagelap (~sage@2600:1012:b02f:4ded:b024:ae9:24a0:ba18) Quit (Ping timeout: 480 seconds)
[17:58] * ScOut3R (~scout3r@540130DB.dsl.pool.telekom.hu) has joined #ceph
[17:58] * mattt_ (~mattt@92.52.76.140) Quit (Read error: Connection reset by peer)
[18:00] * xarses (~andreww@c-71-202-167-197.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[18:00] <wrencsok> kind of awake
[18:03] <mikedawson> wrencsok: saw your update on the scrub preformance bug...do you use xfs by chance?
[18:03] <wrencsok> for the osd's yes, we do use xfs. on a side note. what kernel are you using on your cluster?
[18:04] <wrencsok> what are beta folks are formatting their rbd volumes. i have no idea.
[18:04] <mikedawson> wrencsok: 3.8.0-29-generic from ubuntu
[18:05] <wrencsok> nice. trying to get us to there. we're on a 3.2 kernel. and i don't like it one bit.
[18:05] <mikedawson> wrencsok: can you check for fragmentation on your XFS OSDs? "xfs_db -c frag -r /dev/sdb1"
[18:06] * peetaur (~peter@CPEbc1401e60493-CMbc1401e60490.cpe.net.cable.rogers.com) Quit (Read error: Connection reset by peer)
[18:07] <wrencsok> this could take some time. though you have me curious.
[18:07] * peetaur (~peter@CPEbc1401e60493-CMbc1401e60490.cpe.net.cable.rogers.com) has joined #ceph
[18:08] <mikedawson> wrencsok: a 3TB drive took me an hour or so to check for fragmentation (and 16+ hours and counting to try to defrag)
[18:08] * angdraug (~angdraug@204.11.231.50.static.etheric.net) has joined #ceph
[18:09] <mikedawson> wrencsok: during either, you'll see nearly 100% utilization on the disk, but client io seems to be prioritized nicely (for the most part)
[18:09] <wrencsok> yeouch not liking what i am seeing. tho not sure how concerned i am. did you do any tuning of your 3.8 kernel?
[18:11] <mikedawson> wrencsok: not particularly
[18:11] <wrencsok> do you use default settings? 3.2 has some memory mgmt issues for us with jumbo frames and 10 gig E. i've almsot tuned them out and stablized our memory use.
[18:13] <mikedawson> wrencsok: can't think of anything that I've had to tune offhand
[18:13] * sprachgenerator (~sprachgen@130.202.135.215) has joined #ceph
[18:14] <wrencsok> that's great info. trying to push that change has been difficult internally.
[18:15] <wrencsok> going quick on the frag check, will pastebin it when i've gotten thru a node's worth of drives, there's some variation.
[18:17] * Tamil (~tamil@38.122.20.226) has joined #ceph
[18:17] * sjm (~sjm@wr1.pit.paircolo.net) Quit (Quit: Leaving.)
[18:17] <mikedawson> wrencsok: thx
[18:17] * sjm (~sjm@wr1.pit.paircolo.net) has joined #ceph
[18:18] * sjm (~sjm@wr1.pit.paircolo.net) Quit (Remote host closed the connection)
[18:20] * xarses (~andreww@204.11.231.50.static.etheric.net) has joined #ceph
[18:20] * ScOut3R (~scout3r@540130DB.dsl.pool.telekom.hu) Quit (Read error: Connection reset by peer)
[18:20] * ScOut3R (~scout3r@540130DB.dsl.pool.telekom.hu) has joined #ceph
[18:21] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[18:22] * thomnico (~thomnico@AMontsouris-652-1-207-134.w86-212.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[18:23] * danieagle (~Daniel@179.176.56.253.dynamic.adsl.gvt.net.br) Quit (Read error: Operation timed out)
[18:27] * alram (~alram@38.122.20.226) has joined #ceph
[18:27] <mikedawson> wrencsok: found an old bug, too: http://tracker.ceph.com/issues/2003
[18:30] <dmsimard> mikedawson: We should probably build some documentation about journals on other drives - this is a question that comes back so often on IRC and on the mailing list :)
[18:30] <wrencsok> drive fragmentation on xfs, sitting around 20 to 23% per drive. only checked one node. http://pastebin.com/rLMBJFKn
[18:30] <mikedawson> dmsimard: agreed
[18:31] <xarses> dmsimard, yes please
[18:31] <dmsimard> i'll open up something in the documentation tracker
[18:31] <xarses> also alot of people have trouble getting the monitors and osd's to talk even with ceph-deploy
[18:32] <wrencsok> ceph deploy needs to read ceph.conf for path's that are not default paths. my two cents.
[18:32] <wrencsok> it won't work for me without modification and we do things the old way.
[18:32] * tsnider (~tsnider@nat-216-240-30-23.netapp.com) has joined #ceph
[18:33] <tsnider> DQOTD: is ceph documentation available as a PDF?
[18:33] <xarses> wrencsok, can you elaborate?
[18:33] <joao> tsnider, don't think so
[18:33] <joao> tsnider, you can probably build if from the source
[18:33] <mikedawson> dmsimard: are you JW from Inktank?
[18:33] <joao> *it
[18:33] <wrencsok> i can't use ceph-deploy. our paths are different for everything.
[18:34] <wrencsok> one sec
[18:34] <dmsimard> mikedawson: No :)
[18:34] <xarses> wrencsok, so you dont use /etc/ceph or /var/lib/ceph?
[18:34] <tsnider> joao: ok -- thought I'd ask
[18:34] <wrencsok> mon data = /data/${name}
[18:34] <wrencsok> keyring = /data/${name}/keyring
[18:34] <wrencsok> [osd]
[18:34] <wrencsok> osd data = /data/${name}
[18:34] <wrencsok> keyring = /data/${name}/keyring
[18:34] <wrencsok> it doesn't like that we use that path. and can't find things.
[18:35] <wrencsok> we don't use /var/lib or other defaults.
[18:35] <mikedawson> dmsimard: ok. point me at the issue in the tracker. I'll be happy to help
[18:36] <dmsimard> Isn't that you that replied to that Dell guy on the mailing list ? I found that thread really interesting
[18:36] * danieagle (~Daniel@177.97.249.214) has joined #ceph
[18:36] * erice (~erice@50.240.86.181) Quit (Read error: Operation timed out)
[18:36] * thomnico (~thomnico@AMontsouris-652-1-207-134.w86-212.abo.wanadoo.fr) has joined #ceph
[18:37] <wrencsok> maybe its been fixe3d, but last time i tried to use it to simplify things for our ops teams. it complained to no end about not being able to find the keyrings, etc.
[18:37] * markbby (~Adium@168.94.245.2) has joined #ceph
[18:37] <mikedawson> dmsimard: Yes. Building on the work of nhm and fghaas
[18:38] <wrencsok> it should read /etc/ceph/ceph.conf and i would lvoe a switch to take its paths from the existing configuration.
[18:38] <xarses> wrencsok, ya thats probably a bit to far out of the scope of what ceph-deploy is intended to do
[18:41] <mikedawson> wrencsok: Thanks for checking. I only checked one 6-month old xfs osd and it had 80% fragmentation. A couple added on Saturday had 2% fragmentation already. I'll let you know if I get better scrub results when I'm done with the defrag
[18:41] <sagewk> mikedawson: commented on 6333.. if you can gather some logs over the stall period that should let us look at exactly why the stalls are happening
[18:42] * markbby1 (~Adium@168.94.245.3) has joined #ceph
[18:42] * markbby (~Adium@168.94.245.2) Quit (Remote host closed the connection)
[18:43] <mikedawson> sagewk: I have admin socket metrics (osd and rbd) sampled every 10s from this weekend. We added two new servers to the mix resulting in roughly 24 hours of complete outage on VMs that get hung up on reads
[18:43] <Kupo1> xarses: Do you have ceph setup for a block device in openstack?
[18:44] <sagewk> mikedawson: specifically looking for the objecter_dump (or dump_objecter?) output that will show the request id and how long it has been waiting
[18:44] <sagewk> so that we map that back to the logs
[18:44] <mikedawson> sagewk: yesterday, we found lots of xfs fragmentation under our osds. Working on a defrag now. I think this one is a problem http://tracker.ceph.com/issues/2003
[18:45] <sagewk> mikedawson: hmm, yeah. there is a possible feature we can do there too to fallocate rbd as we create them, but it's a bit tricky to measure how effective it is.
[18:46] <mikedawson> sagewk: would that leave the rbd volumes sparse, but fully allocate space for each 4MB chunk that is written (as opposed to making each chunk sparse as well)?
[18:46] <sagewk> right
[18:47] <sagewk> it would avoid fragemntation on those xfs files as when they are filled in by small random ios
[18:48] <wrencsok> mikedawson: those osd's have been running since argonaut. about 6 months or more. light load, cluster isn't really loaded unless i do it. which is nice for seeing how certain workflows don't play well. ideally once i get past our kernel issues or get everyone on board to update to a 3.8 kernel i can spend time with profilers and debuggers to see if i can improve those without hurting where we excel leaps and bounds above rackspace or amaz
[18:48] <mikedawson> sagewk: I'll retest my failure conditions once I get past the defrag. My guess is ceph will work as expected (yay!)
[18:48] <sagewk> cool; let us know!
[18:48] * tryggvil (~tryggvil@89-160-133-149.du.xdsl.is) has joined #ceph
[18:49] <mikedawson> sagewk: if that is an easy/low risk patch, I'll test a wip based on dumpling anytime you have the inkling
[18:50] <sagewk> it's somewhat intrusive to properly implement, but i could make a hacky wip that hard-codes an fallocate on any rbd-looking objects just to see how effective it is
[18:51] * jjgalvez (~jjgalvez@ip72-193-217-254.lv.lv.cox.net) has joined #ceph
[18:52] <mikedawson> sagewk: cool
[18:52] * JustEra (~JustEra@ALille-555-1-127-163.w90-7.abo.wanadoo.fr) has joined #ceph
[18:55] * maciek (maciek@2001:41d0:2:2218::dead) Quit (Quit: restart internetu.)
[18:59] <dmsimard> mikedawson - xarses: http://tracker.ceph.com/issues/6379
[19:00] * Camilo (~Adium@puffy.webvolution.net) Quit (Quit: Leaving.)
[19:00] * davidzlap (~Adium@ip68-5-239-214.oc.oc.cox.net) has joined #ceph
[19:00] * sleinen2 (~Adium@2001:620:0:25:451b:3685:65c0:696e) Quit (Quit: Leaving.)
[19:00] * sleinen (~Adium@130.59.94.202) has joined #ceph
[19:02] <symmcom> I have a 2 node cluster with 4 OSDs in each. I replaced osd.0 which was 2TB with a 4TB HDD, but the cluster picked it up still as 2TB, any idea why ?
[19:02] * Pedras (~Adium@2001:470:84fa:3:f0cd:7353:ecb4:dfc7) has joined #ceph
[19:03] <mikedawson> dmsimard: thx
[19:04] <dmsimard> I "almost" had the time/resources to better test the impact on performance of journals on different drives for my setup
[19:05] <dmsimard> However I'll end up testing a 3U config with 16 standalone drives with journals on the same drives (16 drives backed by a RAID Card with caching) and i'll also test 8 pairs of RAID-0's with the journals on the same drives as well
[19:05] <dmsimard> Should be interesting
[19:06] * ishkabob (~c7a82cc0@webuser.thegrebs.com) has joined #ceph
[19:06] <ishkabob> hey guys, I'm trying to troubleshoot some slow requests. It only happens when I do a large write to an RBD that is connected with samba
[19:06] <ishkabob> there doesn't appear to be anything wrong with the OSDs, and all other operations appear to be normal
[19:06] <ishkabob> any ideas?
[19:07] * markbby (~Adium@168.94.245.2) has joined #ceph
[19:07] <mikedawson> ishkabob: does large write mean one big file or can it also mean lots of little files?
[19:08] * sleinen (~Adium@130.59.94.202) Quit (Ping timeout: 480 seconds)
[19:08] <ishkabob> one big file in this case
[19:09] * jbd_ (~jbd_@2001:41d0:52:a00::77) has left #ceph
[19:09] * markbby2 (~Adium@168.94.245.4) has joined #ceph
[19:09] * markbby (~Adium@168.94.245.2) Quit (Remote host closed the connection)
[19:09] * The_Bishop (~bishop@93.182.144.2) has joined #ceph
[19:10] <mikedawson> ishkabob: can you recreate the issue without samba involved with scp or something similar?
[19:10] * markbby2 (~Adium@168.94.245.4) Quit (Remote host closed the connection)
[19:10] <simulx> what do we do about having the mds as a bottleneck. if it goes down, the cluster becomes unavailable. However, we have this statement: "Do not run multiple metadata servers in production." I guess I should just put it on a separate server (not a mon), and make sure we never, ever touch it??
[19:10] * Cube (~Cube@wr1.pit.paircolo.net) Quit (Quit: Leaving.)
[19:10] <ishkabob> mikedawson: i don't think i can recreate it in this way actually, i'll try though
[19:12] * markbby1 (~Adium@168.94.245.3) Quit (Remote host closed the connection)
[19:12] * ScOut3R (~scout3r@540130DB.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[19:13] <gregaf> simulx: you can run multiple MDSes as long as only one is active; then if that goes down one of the standby MDSes will take over
[19:13] * nhm (~nhm@184-97-187-196.mpls.qwest.net) Quit (Quit: Lost terminal)
[19:14] <gregaf> but multiple active daemons is even less production-ready than CephFS as a whole
[19:14] <symmcom> can somebody tell me the right sequence ofcommand to replace a smaller osd to bigger one? Thanks!
[19:14] <simulx> ok, so then when the primary comes back up, will it re-sync correctly?
[19:15] <simulx> ie: you don't have to admin the active/passive thing?
[19:15] <simulx> that would be fine
[19:15] * Cube (~Cube@wr1.pit.paircolo.net) has joined #ceph
[19:15] <ishkabob> mikedawson: so I'm just stuffing a file with a bunch of crap from urandom right now, I'll start moving it onto the RBD in question when it finishes, seem reasonable?
[19:16] * erice (~erice@50.240.86.181) has joined #ceph
[19:17] <gregaf> simulx: no admin interaction required
[19:18] <gregaf> when you turn on the daemons it'll just take one of them as active
[19:18] <symmcom> simulx: i can confirm that about MDS. I have 2 MDS running in the cluster. When i reboot one MDS, the 2nd MDS simply takes over and vice versa
[19:19] <symmcom> absolutely no user interaction
[19:21] <mikedawson> ishkabob: sure
[19:26] * thomnico (~thomnico@AMontsouris-652-1-207-134.w86-212.abo.wanadoo.fr) Quit (Remote host closed the connection)
[19:32] <simulx> very nice
[19:32] <simulx> i'm going to add another to the cluster, thanks
[19:33] <sagewk> josef: fixed the ppc build error, but it won't be in a release until 0.70 (next mondayish).
[19:35] * thomnico (~thomnico@AMontsouris-652-1-207-134.w86-212.abo.wanadoo.fr) has joined #ceph
[19:35] * markbby (~Adium@168.94.245.2) has joined #ceph
[19:37] * davidz (~Adium@ip68-5-239-214.oc.oc.cox.net) Quit (Quit: Leaving.)
[19:38] * The_Bishop (~bishop@93.182.144.2) Quit (Ping timeout: 480 seconds)
[19:40] <josef> sagewk: sounds good
[19:40] <josef> i'm in no hurry
[19:40] * MooingLemur (~troy@phx-pnap.pinchaser.com) has left #ceph
[19:43] * davidz (~Adium@ip68-5-239-214.oc.oc.cox.net) has joined #ceph
[19:46] * jeff-YF (~jeffyf@67.23.117.122) Quit (Ping timeout: 480 seconds)
[19:46] * rudolfsteiner (~federicon@200.68.116.185) Quit (Quit: rudolfsteiner)
[19:50] * freedomhui (~freedomhu@117.79.232.206) Quit (Quit: Leaving...)
[19:50] * danieagle (~Daniel@177.97.249.214) Quit (Quit: inte+ e Obrigado Por tudo mesmo! :-D)
[19:51] <Kupo1> the 'total avail' in rados df; what format is that in? KB?
[19:51] * The_Bishop (~bishop@2001:470:50b6:0:873:cadb:5333:c64c) has joined #ceph
[19:54] * alram (~alram@38.122.20.226) Quit (Quit: leaving)
[19:54] <mikedawson> Kupo1: it looks like KB to me... odd choice, it seems
[19:55] * rudolfsteiner (~federicon@200.68.116.185) has joined #ceph
[19:56] <Kupo1> mikedawson: hmm, i created a 12gb volume and now im getting this in the output : http://pastebin.mozilla.org/3138560
[19:56] <Kupo1> 10gb*
[19:56] <mikedawson> Kupo1: was it sparse?
[19:57] <Kupo1> just standard openstack volume, not sure what that means
[19:57] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[19:58] <mikedawson> Kupo1: sparse means thin provisioned (i.e. it doesn't consume space until you write data to the volume)
[19:58] <Kupo1> that controlled by openstack or ceph?
[19:58] <mikedawson> Kupo1: they are sparse by default
[19:59] <Kupo1> okay
[20:00] <Kupo1> is the total listed the pool avail space?
[20:00] * JustEra (~JustEra@ALille-555-1-127-163.w90-7.abo.wanadoo.fr) Quit (Quit: This computer has gone to sleep)
[20:01] * JustEra (~JustEra@ALille-555-1-127-163.w90-7.abo.wanadoo.fr) has joined #ceph
[20:02] * JustEra (~JustEra@ALille-555-1-127-163.w90-7.abo.wanadoo.fr) Quit ()
[20:04] * markbby (~Adium@168.94.245.1) has joined #ceph
[20:04] <mikedawson> Kupo1: 13467986316/1024/1024/1024 means you have roughly 12.5TB of drive space in your ceph cluster. Then you need to factor in your replication factor for each pool in use. If you use 2x replication across all pools, and you keep you cluster below 75% full (a good idea), you'll end up with 4.7TB of real storage.
[20:05] <mikedawson> Kupo1: then you factor in thin provisioning and Copy on Write clones and things get complicated
[20:06] <Kupo1> 2x replication means there is 2 or 3 copies of the data (including original)
[20:08] <mikedawson> Kupo1: two copies
[20:09] <symmcom> mikedawson: 2 copies including the original?
[20:09] <mikedawson> Kupo1: 12.5TB / 2 copies * .75 = 4.7TB
[20:10] <mikedawson> symmcom: yes. 1 primary osd and one replica
[20:10] <Kupo1> does it do any striping of any kind?
[20:11] <ishkabob> mikedawson: I tried copying a 25gig file from the local disk to the RBD, as well as scp to the machine with the RBD. Neither one resulted in slow requests, although during the scp, it went to "stalled" a few times
[20:11] <mikedawson> Kupo1: http://ceph.com/docs/next/man/8/rbd/#striping
[20:15] <mikedawson> ishkabob: are there any reads going on on this rbd volume? Are your OSDs backed by XFS? I see stalls very regularly when there is spindle contention on the OSDs. My spindle contention tends to come from scrub or deep-scrub. I just found out that my drives have quite a bit of xfs extent fragmentation.
[20:16] <mikedawson> ishkabob: If you use xfs, can you check some of your drives for fragmentation? "xfs_db -c frag -r /dev/sdb1"
[20:17] <symmcom> I just replaced my 2TB OSD with 4TB osd, can somebody help me figure out why ceph osd tree showing i still have 2TB OSD
[20:19] <symmcom> i followed this sequence: osd set noin, osd stop osd.0, osd out osd.0, replaced physical hdd, ceph-deploy disk zap/prepare/activate
[20:20] <mikedawson> symmcom: can you paste your 'ceph osd tree'?
[20:21] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[20:22] <symmcom> mikedawson: http://pastebin.com/QJqsZVSp
[20:25] <mikedawson> symmcom: maybe you can do something like osd crush reweight osd.0 3.64
[20:26] * jksM (~jks@3e6b5724.rev.stofanet.dk) Quit (Ping timeout: 480 seconds)
[20:27] <mikedawson> symmcom: not sure if that syntax is right though. all my drives are listed as "1". They are actually 3TB, but it is the ratio that matters
[20:28] <mikedawson> symmcom: What does "ceph osd getcrushmap -o crushmap && crushtool -d crushmap -o crushmap.txt && cat crushmap.txt" look like?
[20:29] * thomnico (~thomnico@AMontsouris-652-1-207-134.w86-212.abo.wanadoo.fr) Quit (Quit: Ex-Chat)
[20:29] * tsnider1 (~tsnider@nat-216-240-30-23.netapp.com) has joined #ceph
[20:32] * tsnider (~tsnider@nat-216-240-30-23.netapp.com) Quit (Ping timeout: 480 seconds)
[20:33] <symmcom> mikedawson: http://pastebin.com/z34fYqSe
[20:34] <mikedawson> symmcom: looks like you need to reweight that drive. alfredodeza is the man if ceph-deploy should have done this automatically
[20:34] <symmcom> yep it was done automaticalyl by ceph-deploy
[20:35] <symmcom> on node osd.6 was a 2TB , i replaced it with 4TB 3 months by following the same sequence and it weighted the osd to correct 4tb
[20:35] * alfredodeza stands on the shoulders of GIANTS
[20:36] * jks (~jks@3e6b5724.rev.stofanet.dk) has joined #ceph
[20:39] <symmcom> mikedawson: will this command reweight the osd? #ceph osd crush reweight osd.0 3.64
[20:47] * tryggvil (~tryggvil@89-160-133-149.du.xdsl.is) Quit (Quit: tryggvil)
[20:48] * ScOut3R (~scout3r@540130DB.dsl.pool.telekom.hu) has joined #ceph
[20:51] <mikedawson> symmcom: perhaps, but I recall having some issue. Judging by the description of 'ceph osd crush --help | grep reweight' it should work that way
[20:52] * jeff-YF (~jeffyf@23.30.12.6) has joined #ceph
[20:52] <symmcom> mikedawson: just ran the command #ceph osd crush reweight osd.0 3.64, no effect
[20:52] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) has joined #ceph
[20:54] <mikedawson> symmcom: I recall having a similar issue
[20:55] <mikedawson> symmcom: dmick may be the man if the ceph command line tool is broken
[20:56] * andrei (~andrei@46.229.149.194) Quit (Ping timeout: 480 seconds)
[20:57] <symmcom> mikedawson: thanks mike, i will see if i can contact dmick
[20:57] <mikedawson> symmcom: the other option is to attempt to edit the crushmap.txt manually, recompile it. Something like "crushtool -c crushmap.txt -o new-crushmap && ceph osd setcrushmap -i new-crushmap"
[20:58] <symmcom> the thought of manually editing .txt then upload to the cluster gives me jitter :)
[20:58] * vata (~vata@2607:fad8:4:6:5546:a868:1758:fcde) has joined #ceph
[20:58] <mikedawson> symmcom: best done on a non-production cluster (like most things)
[21:00] * nhm (~nhm@184-97-187-196.mpls.qwest.net) has joined #ceph
[21:00] * ChanServ sets mode +o nhm
[21:01] <symmcom> unfortunately the cluster has migrated from test to production. may b i can add another HDD to the cluster and see how it weights. i m thinking it picked up old weight cause cluster used the same osd.0. not sure just speculating
[21:01] * markbby (~Adium@168.94.245.1) Quit (Quit: Leaving.)
[21:03] * tsnider (~tsnider@nat-216-240-30-23.netapp.com) has joined #ceph
[21:09] * tsnider1 (~tsnider@nat-216-240-30-23.netapp.com) Quit (Ping timeout: 480 seconds)
[21:12] * markbby (~Adium@168.94.245.4) has joined #ceph
[21:13] * BillK (~BillK-OFT@124-148-81-249.dyn.iinet.net.au) has joined #ceph
[21:17] * jeff-YF (~jeffyf@23.30.12.6) Quit (Quit: jeff-YF)
[21:21] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[21:23] <LCF> rgw_thread_pool_size <- what this value mean ? documentation isn't really helpfull :(
[21:27] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[21:28] <dmsimard> alfredodeza: http://tracker.ceph.com/issues/6154#change-27819 nice !
[21:28] <alfredodeza> dmsimard: :)
[21:28] <alfredodeza> should be released soon-ish
[21:30] * sleinen1 (~Adium@2001:620:0:25:2c84:60b7:5bf0:f7bf) has joined #ceph
[21:34] * ntranger (~ntranger@proxy2.wolfram.com) has joined #ceph
[21:35] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[21:38] * sleinen1 (~Adium@2001:620:0:25:2c84:60b7:5bf0:f7bf) Quit (Quit: Leaving.)
[21:38] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[21:42] * markbby (~Adium@168.94.245.4) Quit (Quit: Leaving.)
[21:46] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[21:46] * Vjarjadian (~IceChat77@90.208.125.77) has joined #ceph
[21:50] * markbby (~Adium@168.94.245.2) has joined #ceph
[21:52] * jeff-YF (~jeffyf@23.30.12.6) has joined #ceph
[21:53] <swinchen> wow... Spine/Leaf is intense.
[21:54] * rturk-away is now known as rturk
[21:59] <dmick> symmcom: that command should have affected osd tree output, but it's probably not what you want
[21:59] <dmick> the 'reweight' value is different than the 'crush weight'
[22:00] <dmick> the easiest thing IMO is to extract, decompile, edit, recompile, set the crushmap, but you can probably do it with ceph commands too if you wish
[22:01] * rudolfsteiner (~federicon@200.68.116.185) Quit (Quit: rudolfsteiner)
[22:04] <dmick> or, as I've just been told, ceph osd crush reweight will change the crush weight directly, which is easiest of all
[22:09] * BillK (~BillK-OFT@124-148-81-249.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[22:11] * gregaf (~Adium@2607:f298:a:607:c2b:8d87:d52:b902) Quit (Quit: Leaving.)
[22:11] * gregaf (~Adium@2607:f298:a:607:35af:48b:988a:ef6f) has joined #ceph
[22:12] * allsystemsarego (~allsystem@188.25.131.49) Quit (Quit: Leaving)
[22:13] <simulx> can i change the journal device of an existing osd... ie: take down osd, add ssd, change journal location, bring it back up....?
[22:14] * xdeller (~xdeller@91.218.144.129) Quit (Quit: Leaving)
[22:25] * jeff-YF (~jeffyf@23.30.12.6) Quit (Quit: jeff-YF)
[22:28] <simulx> i notice that the location of the journal is a soft link to an unmounted device: /var/lib/ceph/osd/ceph-0/journal ... can i just take it out, flush, change the soft link, put it back in?
[22:28] <simulx> that seems easiest
[22:29] <gregaf> simulx: yeah, there's a flush-journal command, then you can do create-journal with it pointed to the new place
[22:29] <gregaf> or something like that; the commands are doc'ed
[22:29] <simulx> http://ceph.com/docs/next/man/8/ceph-osd/
[22:29] <simulx> --flush-journal
[22:38] <symmcom> dmick: i tried the ceph osd reweight command on osd.0. but it did not change the weight
[22:38] <dmick> ceph osd reweight != ceph osd crush reweight
[22:38] <dmick> and
[22:38] <dmick> are you sure it didn't change the output of ceph osd tree?
[22:38] <dmick> (there are two weights)
[22:39] * JustEra (~JustEra@ALille-555-1-127-163.w90-7.abo.wanadoo.fr) has joined #ceph
[22:39] <symmcom> this is the command i used> ceph osd reweight osd.0 3.84. is not the right one ?
[22:41] <dmick> I don't see a "crush" in that command
[22:43] <symmcom> dmick: i am sorry, missed crush in above text. this is the exact command i have used> #ceph osd crush reweight osd.0 3.64
[22:43] <dmick> pastebin the output of ceph osd tree
[22:44] <symmcom> dmick: http://pastebin.com/PVszX6RJ
[22:44] * nwat (~nwat@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Read error: Operation timed out)
[22:45] <symmcom> dmick: osd.0 in node ceph-osd-01 is newly replaced 4TB hdd , the old one was 2TB
[22:45] <dmick> and, it should go without saying, but the reweight command seemed to complete with no errors?
[22:45] <symmcom> dmick: thats correct, no errors
[22:47] <symmcom> i used ceph-deploy to add the new 4TB to the cluster and it automatically assigned the weight of 1.82 which was the weight of old 2TB osd.0
[22:47] <dmick> and it issued the message "reweighted item id ... name osd.0 to3.64 in crush map"?
[22:49] * sprachgenerator (~sprachgen@130.202.135.215) Quit (Ping timeout: 480 seconds)
[22:49] * sprachgenerator (~sprachgen@vis-v410v141.mcs.anl-external.org) has joined #ceph
[22:50] <symmcom> ceph-deploy did not issue any reweighted item id, it simply added the hdd to osd.0 after i did #ceph-deploy activate node:/dev
[22:50] <dmick> not ceph-deploy, I mean the reweight command
[22:50] <symmcom> reweight command did not show any msg, it simply dropped me in shell prompt
[22:51] <dmick> ok. that's actually happening to me right now as well, and I think there should be a confirmation message if it's working
[22:51] <symmcom> http://pastebin.com/yuWfuPBY
[22:51] <dmick> looking intoi t
[22:53] * sprachgenerator (~sprachgen@vis-v410v141.mcs.anl-external.org) Quit (Read error: Operation timed out)
[22:54] * sprachgenerator (~sprachgen@130.202.135.215) has joined #ceph
[22:58] <symmcom> dmick: glad to see its not happening in my cluster only :)
[22:58] <dmick> yeah, this is odd; not sure what could be stopping it
[22:59] <symmcom> do u know why ceph-deploy auto selected old weight ?
[22:59] * Cube (~Cube@wr1.pit.paircolo.net) Quit (Read error: Connection reset by peer)
[23:00] <dmick> no
[23:00] <symmcom> also does all the hdd has to be same size on all nodes ? this is what i have right now, node-1>4TB, 4TB, 2TB; node-2:4TB, 2Tb, 2TB
[23:00] <dmick> no, that's what the weight is for, to balance data usage based on disk size
[23:00] * carif (~mcarifio@pool-96-233-32-122.bstnma.fios.verizon.net) Quit (Quit: Ex-Chat)
[23:00] <symmcom> great, understood
[23:01] <dmick> they can be relative, but by default it's "number of TB" I think, at least for initial creation
[23:01] <symmcom> i tested reweight before on this cluster when it was still a test cluster and it worked
[23:02] <symmcom> all the hdd in this cluster was 2 TB. i replaced 2 tb on both nodes with 4TB using ceph-deploy and it auto weighted based on 4TB size, but that was on dumpling
[23:03] <symmcom> sorry cuttlefish
[23:03] <symmcom> with this new 4tb this is my first attempt to replace after dumpling upgrade
[23:07] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Ping timeout: 480 seconds)
[23:07] * Kupo1 (~Kupo-Lapt@ip70-162-72-3.ph.ph.cox.net) Quit (Quit: Leaving.)
[23:08] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[23:09] <dmick> ah. yes, it's busted.
[23:09] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Remote host closed the connection)
[23:09] <dmick> back to recommending editing the crushmap
[23:09] <dmick> filing abug
[23:09] <mikedawson> symmcom: yep, thought that was broken. Thanks dmick!
[23:10] * ishkabob (~c7a82cc0@webuser.thegrebs.com) Quit (Quit: TheGrebs.com CGI:IRC)
[23:11] <symmcom> dmick: could you please tell me what command or text to type. never edited crushmap. dont want to crash the entire cluster :)
[23:11] <symmcom> still learning my way around ceph
[23:11] * JustEra (~JustEra@ALille-555-1-127-163.w90-7.abo.wanadoo.fr) Quit (Quit: This computer has gone to sleep)
[23:12] <dmick> have you checked the docs? This sequence may well be there
[23:12] <dmick> http://tracker.ceph.com/issues/6382
[23:14] <symmcom> dmick: this was posted by mikedawson earlier> "crushtool -c crushmap.txt -o new-crushmap && ceph osd setcrushmap -i new-crushmap" . is this the one ?
[23:15] * andrei (~andrei@46.229.149.194) has joined #ceph
[23:16] <dmick> that's the last half, once you've edited the map
[23:16] <symmcom> i think this is the doc u r talking about> http://ceph.com/docs/master/rados/operations/crush-map/
[23:17] <symmcom> if i understand right, this is the sequence i should follow: get crushmap > decomplie crush map > edit crush map > compile crush map > set crush map
[23:19] <mikedawson> symmcom: here is a patch that may do what you need http://pastebin.com/raw.php?i=qgtmB8Cv
[23:19] <mikedawson> symmcom: actually it is backwards, but that should show you the idea anyway
[23:21] <symmcom> mikedawson: sorry to be so ignorant, but how do i apply the patch, edit that info into the crush map ?
[23:23] <mikedawson> symmcom: edited version in its entirety: http://pastebin.com/raw.php?i=mjcyWhTQ
[23:23] * rturk is now known as rturk-away
[23:24] <symmcom> Thank You mike!
[23:24] * sjustlaptop (~sam@38.122.20.226) has joined #ceph
[23:26] <symmcom> mikedawson: just curious, should the total weight of the node 1 change to 9.10 from 7.280 or no?
[23:27] * andrei (~andrei@46.229.149.194) Quit (Ping timeout: 480 seconds)
[23:27] <mikedawson> symmcom: yep, good missed that
[23:27] <symmcom> i think i am actually beginning to understand the inside of crush map :)
[23:27] <mikedawson> symmcom: same goes for the remaining 14.560s
[23:28] <mikedawson> symmcom: notice I said this "may do what you need"
[23:29] <symmcom> ok, i will edit that and see how it goes . if it does not work, is there any chance that the entire cluster "may" go down ?
[23:29] * claenjoy (~leggenda@37.157.33.36) Quit (Quit: Leaving.)
[23:30] <dmick> symmcom: if you're running a production cluster and need this level of support, you might want to check into inktank.com's offerings
[23:30] <mikedawson> symmcom: don't know. dmick do you know if setcrushmap performs validation?
[23:30] <mikedawson> dmick: atta boy!
[23:30] <dmick> I don't know
[23:31] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[23:33] <symmcom> dmick: we have plans for the support. it is still a small production cluster, we have not moved everything in this yet, trying to see how it is performing in real world. we r still somewhat depended on the original cluster
[23:33] * rturk-away is now known as rturk
[23:36] * zhyan__ (~zhyan@101.82.112.135) has joined #ceph
[23:36] * carif (~mcarifio@75-150-97-46-NewEngland.hfc.comcastbusiness.net) has joined #ceph
[23:37] * jeff-YF (~jeffyf@216.14.83.26) has joined #ceph
[23:43] <mikedawson> Inktank sysadmins: Tracker is struggling with passenger/ruby/rack errors
[23:44] * tsnider (~tsnider@nat-216-240-30-23.netapp.com) Quit (Quit: Leaving.)
[23:44] <dmick> mikedawson: just noticed, tnx
[23:44] * zhyan__ (~zhyan@101.82.112.135) Quit (Ping timeout: 480 seconds)
[23:45] * rturk is now known as rturk-away
[23:46] * vata (~vata@2607:fad8:4:6:5546:a868:1758:fcde) Quit (Quit: Leaving.)
[23:47] * jeff-YF_ (~jeffyf@67.23.123.228) has joined #ceph
[23:48] * sjustlaptop (~sam@38.122.20.226) Quit (Ping timeout: 480 seconds)
[23:49] * jeff-YF (~jeffyf@216.14.83.26) Quit (Ping timeout: 480 seconds)
[23:49] * jeff-YF_ is now known as jeff-YF
[23:50] * rturk-away is now known as rturk
[23:51] * markbby (~Adium@168.94.245.2) Quit (Quit: Leaving.)
[23:52] <dmick> tracker back mikedawson
[23:54] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:57] * themgt (~themgt@201-223-232-27.baf.movistar.cl) Quit (Quit: Pogoapp - http://www.pogoapp.com)
[23:58] <symmcom> dmick: is it true that large number of HDDs say 12, it is better to put the journal on OSD themselves rather than SSD?

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.