[0:50] <amatter> does the ceph-mon data folder need to be on a high-speed device? I have a server with root fs on a CF card, a ssd in bay 1 and 3x SATA in the other bays. The only thing on the ssd are the osd journals for each drive. However, performance of the system is less than expected
[0:55] <dmick> I would say "not really"; you can look at the io stats for that filesystem, but I doubt it's very heavily used or very latency-dependent. I could be wrong.
[1:45] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[2:15] * portante (~user@c-24-63-226-65.hsd1.ma.comcast.net) has joined #ceph
[2:16] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[3:21] <john_barbee> joshd: I am currently deploying grizzly on ceph .60. While uploading an image to glance thru horizon, my dashboard hung with the image in a 'saving' state. It appears that the image was fully received by ceph in my images pool. My error looks identical to this bug. https://bugs.launchpad.net/glance/+bug/1146830 Do you know of a fix/workaround for this issue?
[3:23] <joshd> I didn't realize horizon used the copy-from api. that's good to know. I haven't had a chance to look into the root cause of the bug yet
[3:24] <john_barbee> thanks, i will try the cli
[3:30] <john_barbee> joshd: just realized you meant to import locally from the cli right? using copy-from in the cli returns the same result as horizon
[3:30] <joshd> yeah, anything using --copy-from won't work
[5:08] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[6:19] * DarkAceZ (~BillyMays@ has joined #ceph
[6:19] * DarkAceZ (~BillyMays@ Quit (Max SendQ exceeded)
[6:25] * DarkAceZ (~BillyMays@ has joined #ceph
[7:03] * eschnou (~eschnou@131.167-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[7:54] * loicd (~loic@2a01:e35:2eba:db10:95fe:77a8:ed71:b43b) has joined #ceph
[8:55] * fghaas (~florian@91-119-65-118.dynamic.xdsl-line.inode.at) has joined #ceph
[9:00] * itamar (~itamar@ has joined #ceph
[9:00] <itamar> Hi all,
[9:00] <itamar> anyone knows a way to start an OSD when the journal is corrupted?
[9:01] <itamar> any chance to discard whatever was in the journal?
[9:01] <itamar> I have some stale pgs.. two osds did not make it after a cluster restart
[9:17] <matt_> itamar, you can run --mkjournal and it will create a new journal
[9:18] <matt_> you should be able to boot after that
[9:18] <matt_> not a good idea if you have stale PG's though, some objects might be corrupt if all of the OSD's holding them have corrupt journals
[9:41] * itamar (~itamar@ has joined #ceph
[10:33] * MK_FG (~MK_FG@00018720.user.oftc.net) has joined #ceph
[10:38] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) has joined #ceph
[11:04] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[11:45] <rzerres> @joao: i do have another question concerning the monitor election, are you available shortly?
[11:45] <rzerres> good moning, by the way
[11:49] <joao> morning
[11:49] <joao> rzerres, what's up?
[11:53] <rzerres> hey, i'm still struggling to understand how the paxos is initiated
[11:54] <joao> if you can be more precise than that, I'd be happy to help
[11:54] <rzerres> when i restart a monitor member (here: mon.c) it will tell me in the log, that it is changing its rank form -1 to 0
[11:55] <rzerres> ceph quorum_status is listing the mons as expected
[11:55] <joao> when it boots, it doesn't know what his rank is until the point it kicks off a new elections and calculates his rank based on his ip:port and taking into account all the remaining monitors
[11:56] <rzerres> so mon.c is listed with "name": "c" and its "rank": 0
[11:56] <rzerres> i do not understand, why the mon instance is not in ceph mon stat.
[11:56] <joao> say you have three monitors, and for sake of simplicity let's assume they're all on the same machine,
[11:57] <joao> monitors are on ports 6789, 6790 and 6791
[11:57] <rzerres> e10: 3 mons at {a=,b=,c=}, election epoch 346, quorum 1,2 a,b
[11:57] <rzerres> ok.
[11:57] <joao> < <
[11:57] <joao> thus, given there are 3 monitors, ranks go from 0 to 2
[11:58] <rzerres> all fine
[11:58] <joao> monitor on has rank 0, has rank 2
[11:58] <joao> they only recalculate this when they kick off a new election, so the -1 is perfectly normal
[11:59] <joao> and they will only show in the quorum if they joined the quorum
[11:59] <joao> they may appear in the monmap, as you just showed above
[11:59] <joao> but mon.c didn't make it into the quorum
[12:00] <joao> either it was not running, or some other issue came in the way of joining the quorum (e.g., networking issues, crash, bug)
[12:00] <rzerres> now the story gets interesting
[12:00] <rzerres> it was in the quorum yesterday.
[12:01] <rzerres> today i needed to restart this node (mon.c)
[12:01] <rzerres> it restarted as expected. the mon instance was started, but it did not make it into the quorum.
[12:02] <rzerres> are we able to request an new paxos run for election/monmap resync?
[12:02] <joao> log messages would provide further insight
[12:02] <joao> rzerres, although we use a paxos-like algorithm to run elections, we don't really use the same paxos as for maps
[12:03] <joao> but anyway, mon.c will keep on trying joining the quorum
[12:03] <joao> no need to request a new election, it will do so automatically, it's not a just-once kind of thing
[12:03] <joao> if it's not joining, there's some other issue
[12:03] <joao> did ip:port changed for mon.c?
[12:04] <joao> how long was it down?
[12:04] <joao> and how big is the cluster, workload-wise?
[12:05] <joao> if a long time happened to go by, and your cluster is being under some considerable usage, it's possible that mon.c is still catching up to the current cluster state
[12:05] <rzerres> no, the port is the same. downtime where about 15 minutes. but filesystem on var had an overrun. So this for sure is the source
[12:05] <joao> overrun as in it ran out of disk space?
[12:06] <rzerres> yes..
[12:06] <joao> well, current monitor code would shutdown the monitor in such a case
[12:06] <joao> what version are you running?
[12:06] <joao> then again, not sure if that's already on 0.60 or if it's coming out only on cuttlefish
[12:07] <rzerres> I'm running next (0.60-460-ga329871)
[12:07] <rzerres> but the filesystem used for the mon is on a seperate partition, which hasn't any issues. just the log stuff under /var
[12:08] <joao> ah, okay
[12:08] <joao> I don't recall how we handle ENOSPC on the logs though
[12:08] <joao> rzerres, looking at logs with higher debug levels would be best really
[12:09] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[12:09] <rzerres> i do really appreciate the help here on the chat. this support is one main points to use opensource with anxious support-guys. thanks anyway.
[12:10] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[12:11] <joao> rzerres, providing support is also a great way to uncover bugs that we are not able to trigger on qa, so it's a win-win for all parties :)
[12:12] <rzerres> well, i have reduced the log level down to 1. So i will have to raise it (debug mon =20; debug paxos =20)
[12:27] <rzerres> tryed to use only manipulation like:
[12:27] <rzerres> ceph mon tell c injectargs '--debug-paxos 20 --debug-mon 20'
[12:27] <rzerres> but got an error:
[12:27] <rzerres> strict_strtoll: expected integer, got: 'c'
[12:27] <joao> well, if mon.c were in the quorum, you could use 'tell' with 0 instead of 'c' (mon's rank)
[12:28] <joao> but mon.c is not in the quorum, so I don't think that will work
[12:28] <joao> you could try; I might be wrong
[12:29] <rzerres> if i got the docu right, the integer correponds to the number of the monitor: mon.a = 0, mon.b = 1, mon.c =2
[12:29] <joao> wait a sec, I'll look for that
[12:32] <joao> not really, no
[12:32] <joao> it's a matter of rank
[12:33] <joao> mon.0 = mon.c ; mon.1 = mon.a ; mon.2 = mon.b
[12:33] <rzerres> ok. so you get the rank form the ceph quorum_status. right?
[12:34] <joao> or mon_status
[12:37] <joao> well, time to figure out lunch; will be back in an hour or so
[12:37] <rzerres> ok. played arround. and you are right. injectargs are only accepted for instances running in the cluster.
[12:37] <rzerres> worked on mon.a and mon.b, as they are in quorum
[12:38] * itamar_ (~itamar@IGLD-84-228-64-202.inter.net.il) has joined #ceph
[12:38] <rzerres> have to restart the monitor mon.c with args from ceph.conf
[12:53] <rzerres> after i restarded the monitor with new debug values:
[12:53] <rzerres> mon.c@0(leader).paxos(paxos updating c 4533131..4533170) now 0,1,2 have accepted
[12:53] <rzerres> so it is back now
[12:54] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[13:01] <BillK> I gave two PG's that are active+clean+inconsistent - they are both pool 1 (metadata) - is there a suggested way to deal with them?
[13:19] <matt_> BillK, have you done a deep scrub yet?
[13:24] * vanham (~vanham@ has joined #ceph
[13:32] <BillK> I think thats what caused it ... using .58 which is doing the deeper deep scrub (!) from what I read
[13:33] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[13:35] <BillK> matt_: just kicked one off manually and I'll see what turns up :)
[13:41] <matt_> BillK, you might also need to do a repair also
[13:41] <BillK> matt_: is that documented anyywhere?
[13:42] <matt_> ceph pg repair <PG>
[13:42] <matt_> and, ceph osd repair N
[13:42] <matt_> doc's are a bit sparse on repairing I think
[13:43] <BillK> matt_: tkx, Ive looked, saw the command but have not found docs to describe what it does
[13:44] <matt_> BillK, probably the best reference I can find http://www.spinics.net/lists/ceph-devel/msg04899.html
[13:46] <BillK> matt_: tkx, waiting for the deep subs to finish ...
[13:46] <matt_> BillK, are you in Perth?
[13:49] <BillK> Yep!
[13:49] <matt_> heh, cool! I thought I was the only one for a while there
[13:50] <BillK> BillK: I noticed that the ceph "in use" map has Perth on it, knew it wasnt me :)
[13:52] <matt_> I have no idea if that's me or not... could be
[14:50] <jerker> do anyone got a recipe (installation instructions) on how to set up a cache SSD in front of the OSD data disks the best way?
[14:51] <jerker> just got two SSDs and four mechanical drives for two nodes and was planning to experiment with bcache or md or flashcache
[14:53] <jerker> another thought, i have historically been buying boxes with lots of disk but now with Ceph I can grow with many boxes instead. Have anyone been thinking of the HP Proliant N40L microservers or something like that instead?
[14:55] <rzerres> sjust@ may you can help out on cleaning a test-cluster with unfound objects?
[14:57] <rzerres> i also got some stuck stale pg's. if i try to query them i do get:
[14:57] <rzerres> ceph pg x.yy query
[14:57] <rzerres> i don't have pgid x.yy
[14:58] <rzerres> haven't found a recipe in the online docu ....
[15:26] <mattch> Currently seeing a weird rbd problem with a new cluster I've just set up - I can create pools, and I can create images - but if I try to query the images I get an error. 0.56.3 on SL6 (packages from EPEL)
[15:41] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[15:41] <stxShadow> ouch ... quite old for ceph
[15:48] <scuttlemonkey> 2: The default kernel has an old Ceph client that we do not recommend for kernel client (kernel RBD or the Ceph file system). Upgrade to a recommended kernel.
[15:49] <mattch> scuttlemonkey: I'm not using kernel rbd here am I, if I'm just querying the pool state?
[15:50] <scuttlemonkey> yeah, just reading backlog
[15:51] <dwm37> jerker: My experiments at home have involved HP microservers. They're nicely put together machines, though you may want to up the RAM.
[15:52] <dwm37> jerker: Any sizable installation would probably see the use of rack-mounted kit, however.
[15:52] <dwm37> (The drive density of the microservers is not wonderful, unless you take extra time to add a 4x2.5in drive bay in the top and the requisite supporting IO card.
[15:53] <dwm37> A 2.5in equivalent, however, that had room for ~8-10 drives would be interesting, however.
[15:53] <scuttlemonkey> mattch: can you try 'rbd --image baz -p one info' instead?
[15:53] <mattch> scuttlemonkey: Same error
[15:57] <scuttlemonkey> what does 'for i in $(ceph osd ls); do ceph tell osd.$i version; done' and 'ls /usr/lib/rados-classes' show?
[15:57] <mattch> Fwiw, the closest thing I've seen is the stuff at the end of this irc log: https://www.evernote.com/shard/s50/sh/c54302e1-43f3-4371-b411-5b48d9924dc7/5e4d03724860bce648219742fe3a0a13 - and I was missing that symlink, but I've added it in and it doesn't help
[16:00] <mattch> ceph tell returns the same output for all osds - ceph version 0.56.3 (sha256)
[16:01] * PerlStalker (~PerlStalk@ has joined #ceph
[16:02] <mattch> ls /usr/lib64/rados_classes - http://pastebin.com/pWYWNNUn
[16:02] <mattch> ^scuttlemonkey
[16:02] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[16:02] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit ()
[16:10] <rzerres> anybody there from my osd problem?
[16:10] <vanham> mattch, the os-recommendations page do note that you have to upgrade the kernel
[16:11] <rzerres> s/from/for/
[16:11] <vanham> mattch, it is usually possible to do that in a distro-allowed way
[16:11] <scuttlemonkey> mattch: yeah, nothing jumping out at me...honestly I would just use a newer kernel
[16:11] <scuttlemonkey> I have very little experience with the older stuff
[16:11] <vanham> I upgraded my Debian Wheezy (7) to 3.8 to make sure that I have the latest kernel client and drivers
[16:12] <vanham> (95) Operation not supported says that your current kernel doesn't support that operation
[16:13] <vanham> rzerres, I'm not a developer but I can try
[16:13] <vanham> what is the problem?
[16:13] <rzerres> hey vanham,
[16:13] <mattch> vanham/scuttlemonkey: I get the same error running the rbd client commands on ubuntu 12.04 against this pool
[16:14] <rzerres> when poking around with my testing cluster i managed to have "zombie" pg's
[16:14] <mattch> though admittedly the pool is running on the old kernel too, if that is the cause here
[16:14] <rzerres> i also got some stuck stale pg's. if i try to query them i do get:
[16:14] <rzerres> ceph pg x.yy query
[16:14] <rzerres> i don't have pgid x.yy
[16:14] <vanham> ubuntu 12.04 is kernel 3.2. Upgrade to 3.4.20 or to 3.6.6
[16:14] <vanham> or better
[16:15] <vanham> can you some me your ceph -s?
[16:15] <vanham> and the x.yy that you are trying to query?
[16:15] <rzerres> ceph -s
[16:15] <rzerres> health HEALTH_WARN 1 pgs recovering; 50 pgs stale; 50 pgs stuck stale; 1 pgs stuck unclean; recovery 7714/2278940 degraded (0.338%); 3857/1139470 unfound (0.338%)
[16:15] <rzerres> monmap e10: 3 mons at {a=,b=,c=}, election epoch 350, quorum 0,1,2 a,b,c
[16:15] <rzerres> osdmap e4832: 8 osds: 8 up, 8 in
[16:15] <rzerres> pgmap v2124422: 1192 pgs: 1141 active+clean, 50 stale+active+clean, 1 active+recovering; 4427 GB data, 8834 GB used, 9660 GB / 18588 GB avail; 7714/2278940 degraded (0.338%); 3857/1139470 unfound (0.338%)
[16:15] <rzerres> mdsmap e1378: 1/1/1 up {0=0=up:active}
[16:16] <scuttlemonkey> rzerres: can you use pastebin.com (or similar) for that stuff? Helps streamline things a bit and cuts down on dropped characters/confusion
[16:16] <rzerres> sorry, i will switch to pastebin ....
[16:16] <vanham> rzerres, wait 2 minutes and ran it again plz
[16:18] <vanham> It is always going to be like that
[16:19] <vanham> Or you move to Gentoo and always live on the latest stuff at 50x more work, heehe
[16:19] <vanham> (I used to be a Gentoo user)
[16:21] <vanham> rzerres, plz send the new ceph -s
[16:21] <vanham> and also ceph pg dump_stuck stale
[16:22] <rzerres> vanham, all is prepared at http://pastebin.com/8hFuPVyG
[16:24] <rzerres> i included a ceph osd tree as well
[16:25] <vanham> thanks for the map, it was the next thing I was going to ask
[16:25] <rzerres> for pg 11.2: I have changed the crushmap to create the archive storage on clusterserver dwssrv1
[16:25] <vanham> for
[16:26] <imjustmatthew> jerker: wido and I both had clusters (wido's was production?) using these racked microservers: http://www.supermicro.com/products/system/1U/#S-Series
[16:26] <rzerres> before that change, all osd's (5-8) have been connected to dwssrv2
[16:27] <rzerres> i have crubbed the fs for osd.7 (changed from ext4 to xfs), thus had to clean the data.
[16:27] <vanham> nice thing you did with the weights
[16:28] <rzerres> so it wouldn't harm, if i mark the unfound objects as lost. just to clean up.
[16:29] <rzerres> i wonder, if i have to redefine the rbd pool object as well.
[16:30] <vanham> 1 sec
[16:30] <rzerres> this rbd is used as a block device in a kvm guest
[16:31] <vanham> send me a ceph health status plz
[16:32] <vanham> I think it lost all copies for a few pgs
[16:32] <vanham> You said you moved two osds at the same time?
[16:32] <rzerres> i digged through the underlying filesystem of the osd.[7.8] to get hold of the objects
[16:32] <rzerres> yes. osd.7 and osd.8
[16:33] <vanham> their prior weights are unaltered?
[16:33] <rzerres> old crushmap was referencing osd.5 up to osd.8 for rule archive.
[16:33] <rzerres> wights are unaffected
[16:33] <rzerres> s/wights/weights/
[16:34] <dspano> Good morning all.
[16:34] <rzerres> i thought the redundency of 2 would give me enough security for the balance of the data.
[16:34] <vanham> it is acting as if it lost the data at both osds
[16:34] <vanham> I think
[16:34] <vanham> each OSD had 1/16.8 of the data
[16:35] <vanham> you have two copies of each object
[16:35] <vanham> object/stripe/byte
[16:35] <rzerres> i did not take into account, that osd.7 and osd.8 are not mirrored to osd.5 and osd.6
[16:35] <vanham> (1/16.8)^2 is about 0.003543084
[16:36] <vanham> 0.003543084 times 1139470 is about 4037
[16:36] <rzerres> yes, that's the beast.
[16:36] <vanham> you lost 3857 objects
[16:36] <vanham> witch, is about the same as 4037, considering a statistical error
[16:36] <rzerres> 3857/1139470 unfound
[16:36] <vanham> yeah
[16:37] <vanham> the funny thing is that you didn't format both
[16:37] <rzerres> i guess i understood the ratio behind it
[16:37] <rzerres> yes, i did format osd.7 and osd.8
[16:37] <vanham> but I think that, since you took both offline at the same time and inserted them as new(??) again
[16:38] <vanham> With Dynamo/DHT/CRUSH-like algorithms I always think it is best not to use RAID and to have 3-2-2: 3 copies, 2 available to read, 2 available to write
[16:38] <rzerres> cluster was healthy. then i proceeded as descibed online (osd operation)
[16:39] <vanham> may be there is a alway to return the old osd that wasn't formatted to the old place?
[16:39] <vanham> in the map?
[16:39] <rzerres> haven't got that. thought read and write are equivalent in osd syntax
[16:39] <vanham> with ceph it is
[16:40] <vanham> with some other implementations (like DHT in Cassandra) it isn't. Sorry for the confussion
[16:40] <vanham> It is just that I love data distribution algorithms
[16:40] <rzerres> the old osd's (osd.5 and osd.6) are at the same physical place, as well as the logical place
[16:41] <vanham> can you put 7-8 back at where it was b4?
[16:41] <rzerres> well i could, but the crush algo has changed meanwhile
[16:41] <rzerres> and it is not possible to recreate the old way.
[16:41] <rzerres> at least it doesn't make sense.
[16:42] <rzerres> i get along, if the data are corruped and i can clean up the ceph structure to be healthy again.
[16:43] <vanham> sorry
[16:43] <vanham> sorry for that
[16:43] <rzerres> I'm running v0.60-460 (next) on a 3.6.9 kernel
[16:43] <vanham> cool!
[16:43] <vanham> neat!
[16:44] <rzerres> i'm aware, that this ceph version can mark unfound objects as lost, eg:
[16:44] <rzerres> $ ceph pg 11.2 mark_unfound_lost revert
[16:46] <rzerres> if i go for that, ceph is smart enough to tell me, that not all sources have been probed.
[16:46] * itamar_ (~itamar@IGLD-84-228-64-202.inter.net.il) Quit (Quit: Leaving)
[16:46] <rzerres> # ceph pg 11.2 mark_unfound_lost revert
[16:46] <rzerres> pg has 3857 objects but we haven't probed all sources, not marking lost
[16:46] <rzerres> i asured that the deep-scrub have gone through the osd's.
[16:47] <rzerres> no change in the result
[16:47] <vanham> usually, I never move/operate on more than N-1 OSDs at the same time, N being the number of copies.
[16:47] <rzerres> i was going with:
[16:47] <rzerres> # ceph pg deep-scrub 11.2
[16:48] <vanham> In worst case you can still recover the data
[16:48] <rzerres> yes, i learned that now :)
[16:49] <rzerres> i updated the pastebin with the info for the rbd in mind ....
[16:51] <mattch> vanham: For info, the same command fails with the same error on ubuntu 12.10 with a 3.5.x kernel. I appreciate that I'm running an old OS here, but it is on the supported OS list
[16:51] <mattch> and it seems odd that things like 'rbd ls' or 'ceph osd lspools' work, but 'rbd info' doesn't
[16:52] <rzerres> so all objects in the pool are prefixed with # rb.0.e6d2.238e1f29
[16:53] <rzerres> just as a through in:
[16:53] <rzerres> my cluster is using ssd-partitions as a cache
[16:55] <vanham> rzerres, Question: how did you do that ssd cache?
[16:57] <rzerres> all archive osd's have a 25G cache partition
[16:57] <vanham> mattch, ahhhhhhhhhh
[16:57] <vanham> mattch, sorry!!!!!!!!!1
[16:57] <vanham> mattch, ah wait
[16:58] <vanham> mattch, there is something about snapshots or cloning and the format of the rbd image
[16:58] <mattch> vanham: Hold that thought - I fixed it...
[16:58] <rzerres> concerning the cache:
[16:58] <rzerres> i created the partition with parted
[16:59] <mattch> vanham: fixed the missing rados_classes symlink on the 'client' but not on the ceph servers - doing it there too fixes the problem. Stupid rpm bug!
[17:00] <mattch> Looks like there's a new rpm in epel-testing - will have alook to check if they've noticed this bug and fixed it in that rpm
[17:00] <rzerres> in the ceph.conf i decleared for the relevant osd's:
[17:00] <rzerres> ; client specific (overriding global osd default)
[17:00] <rzerres> osd journal = /dev/intel-ssd<partnumber>
[17:00] <mattch> thanks for the help and advice vanham/scuttlemonkey!
[17:01] <rzerres> used a udev rule to garantee that the disk gets the same symlink name
[17:01] <mattch> nope - will file a bug with epel about this
[17:03] <rzerres> that leads to another question:
[17:03] <rzerres> if the cache disk gets corruped, do i lose all osd's, which use a cache partition on that cache disk?
[17:03] <vanham> mattch, great man! I'm sorry I was out of the spot
[17:03] <vanham> But, getting the latest kernel is recommended
[17:04] <mattch> vanham: No problem - yeah - I'm seriously trying to convince folks higher up that we should be using something from the 21st century :)
[17:04] <vanham> rzerres, that is the thing. This is not a cache, it is a journal
[17:04] <vanham> rzerres, journal is used for writing only
[17:04] <matt_> How long until RHEL 7 now? 6 months?
[17:04] <vanham> rzerres, you have to be careful with that data too
[17:04] <matt_> I'm hanging out for some 3.8 enterprise kernel greatness
[17:05] <vanham> to stop using that disk you will have to use ceph-osd --flush-journal
[17:05] <rzerres> for normal operation, this is worth while
[17:05] <vanham> SSD journals are awesome
[17:05] <vanham> I use them too
[17:06] <vanham> I is just that I would love to have a ssd cache and it is not much of an option right now
[17:06] <vanham> kernel 3.9 will get a new dm-cache
[17:06] <vanham> 3.10 should be getting bcache
[17:06] <vanham> EnhanceIO is on the way to main too
[17:06] <vanham> none of witch is available now
[17:07] <vanham> them it is truly going to be a SSD read/write cache
[17:08] <rzerres> yes, the proformance gains are awesome.
[17:08] <stxShadow> we use ssds for journal and caching (lsi cachecade)
[17:08] <vanham> I have about 120 plate disks here
[17:08] <stxShadow> works like a charm
[17:08] <vanham> Since last year we started using SSDs
[17:09] <vanham> I replace about 5 of the Enterprise Storage WDs every six months
[17:09] <rzerres> vanham, what are you suggesting to live with the situation right now?
[17:09] <vanham> I'm yet to replace a SSD
[17:10] <rzerres> going back to lvm2 with a logical volume on a raid?
[17:10] <vanham> So I think that SSDs (Crucial M4's) are very reliable
[17:10] <vanham> wait wait didn't get it
[17:10] <vanham> What I do today is having each OSD with its SSD disk
[17:11] <vanham> I didn't get what you are trying to do with the -archive thing getting 10% of your data
[17:11] <rzerres> great, if you have space and hardware to serve it.
[17:11] <vanham> So, each OSD is a 3TB disk with a 10G SSD partition
[17:11] <vanham> I intend to have up to 3 3TB HDs on each onde
[17:11] <vanham> *node
[17:12] <rzerres> following is inteded (having small customer infrastructure into mind):
[17:12] <vanham> and one SSD for journal and some data too
[17:12] <rzerres> a redundant cluster to store the io created on vm's (kvm + libvirt)
[17:14] <vanham> If you only will have 2 OSD nodes I would go with LVM+DRBD
[17:14] <rzerres> for the mon's i will use at least three machines (the 2 cluster-nodes + 1 admin-machine)
[17:14] <vanham> Much much simpler
[17:14] <vanham> LVM+DRBD+Heartbeat
[17:15] <vanham> On this test cluster I have right now I have 2 admin nodes and 4 processing/data nodes
[17:15] <rzerres> i want to get rid of DRBD and gain the possibility to enrich the cluster ....
[17:15] <vanham> On production I'll have 14 processing/data nodes
[17:15] <vanham> hummm
[17:16] <vanham> There is something really beautiful about Ceph. It is that, on a 10 node cluster, a failed node will spread its load evenly across all the other nodes.
[17:16] <vanham> That is 89% efficiency for N+1 redundancy on IO
[17:16] <vanham> *for IO
[17:17] <vanham> Compared to 50% efficiency with the usual DRBD HA
[17:17] <rzerres> not beeing religious. beside DRBD there are fault tolerant SAN solutions as well. It's a matter of price, maintainablility and the spirit of open-source.
[17:17] <vanham> But you only have that gain when you have at least 4 nodes.
[17:18] <vanham> If you want N+2 (3 copies) the math is a bit more complex but it is still a *lot* better than 50%
[17:19] <vanham> I usually can have 30k users of my system in a rack with DRBD pairs. With Ceph I could have 50k with exactly same hardware
[17:19] <vanham> I'm saying that because you need to plan to have IO enough capacity for when a node fails
[17:20] <rzerres> Well nail it down: Ceph is truely scalable.
[17:22] <rzerres> but back to my root: I can't afford the customers to run with SSD's for all storage
[17:23] <rzerres> "old" SATA and SCSI Disks will be araound for a while. Especially if speed is not the main factor (eg. for archive).
[17:24] <vanham> Ok, so do you have HDs and SSDs on each of the OSD nodes?
[17:24] <rzerres> and SSD as cache in front of slower Disks in Production OSD's will bring you enormous + in IO Throughput.
[17:24] <rzerres> yes.
[17:25] * vata (~vata@2607:fad8:4:6:98b0:7f26:d620:3c4) has joined #ceph
[17:25] <rzerres> right now, i get to the point, that old fashioned raid structures are less sastifacuring than a plain 1 Disk to 1 OSD structure.
[17:28] <rzerres> a node with 6 HD's usualy forming a raid5 + spare or raid6 will be outperformed with a JBOD stucture.
[17:28] <vanham> Okay, if you have two OSD nodes, why does your map shows more?
[17:29] <rzerres> If on OSD fails, the cluster will balance and the admin takes care to replace the missing store.
[17:29] <rzerres> this is just crushmap view
[17:29] <vanham> I think that in 2 node scenarios there isn't much difference. You gain more by paying attention to your partition alignment
[17:29] <rzerres> each machine has 4 osd's right now
[17:29] <vanham> Difference between RAID+LVM+DRBD and Ceph
[17:30] <vanham> Why the differences in weight?
[17:30] * sleinen (~Adium@2001:620:0:26:80a:a5bb:9fb3:e000) has joined #ceph
[17:30] <rzerres> 2 osd's for production, 2 osd for archive space
[17:31] <rzerres> weight because of space. the production osd's are raid5 structures with 3.7 TB data.
[17:31] * Yen (~Yen@ip-81-11-213-31.dsl.scarlet.be) Quit (Ping timeout: 480 seconds)
[17:31] <vanham> hummm
[17:31] <rzerres> the archive osd's just have 1 TB (one disk)
[17:32] <rzerres> probably it is better to build up 1 disk to 1 osd for the production storage.
[17:32] <vanham> So each server have a RAID and one lonely DISK?
[17:32] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[17:32] <vanham> ?
[17:33] <rzerres> no, 2 servers. each server 2 production osd's (raid5 + spare), 2 archive osd's (2 SATA HD's)
[17:34] <rzerres> each osd have a ssd-partition as a cache.
[17:34] <rzerres> so each of the 2 servers have one SSD for the cache partitions.
[17:34] <vanham> To make things simpler with Ceph I would just declare them as two servers in the map. Otherwise you might lose your data in case 1 server crashes
[17:34] <vanham> Ceph is interpreting them as different servers
[17:34] <vanham> Some of the data is going to the same server for both copies
[17:34] <vanham> Witch is bad
[17:35] <vanham> I would have 1 rack -> 2 servers -> 3 osds
[17:35] <rzerres> on phone - a second
[17:35] <vanham> I have to go here too
[17:35] <vanham> lunch time
[17:35] <vanham> I'll be here for the next week or you can e-mail me at daniel.colchete@gmail.com
[17:36] <rzerres> take care
[17:36] <vanham> take care and good luck!
[17:38] <vanham> Maybe you want to use one pool for the raid disks OSD's and one other pool for the archive disks so that you can separate your data
[17:38] <vanham> bye >)
[17:38] <vanham> :)
[17:39] <rzerres> that's what i do: 1 pool production, 1 pool archive. crushmap takes care for redundancy on the 2 servers .
[17:40] <rzerres> ceph pools are connected to the needed osd's
[17:40] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[17:41] <dspano> I thought raid was frowned upon for osd disks?
[17:42] <dspano> Hence the name of this webinar. The end of raid as we know it. https://www.brighttalk.com/webcast/8847/69375
[17:43] <janos> dspano: as a general rule of thumb i would approach it raid-less
[17:44] <janos> this being the realm of comptuers, there is always some odd case for something unorthodox
[17:44] * psomas (~psomas@inferno.cc.ece.ntua.gr) has joined #ceph
[17:53] <rzerres> dspano: yes, i guess everybody getting into serious touch with ceph will go through this line.
[17:54] <acalvo> hi, is there an init script for RHEL/CentOS?
[17:55] <acalvo> for rados, I meant
[17:55] <acalvo> or an easy way to start radosgw under CentOS?
[17:55] <acalvo> now I just get 2013-04-11 17:46:07.518388 7f5ce6110820 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
[17:56] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[18:23] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[18:24] * loicd (~loic@magenta.dachary.org) has joined #ceph
[18:25] * l0nk (~alex@ Quit (Quit: Leaving.)
[18:27] <gregaf1> acalvo: looks like you just didn't set up your rgw key/keyring; see http://ceph.com/docs/master/radosgw/config/#generate-a-keyring-and-key-for-rados-gateway
[18:29] <acalvo> gregaf1, thanks. The problem was that I didn't passed -n to the radosgw binary, so it couldn't load the section in ceph.conf with all the files
[18:40] * BillK (~BillK@58-7-164-149.dyn.iinet.net.au) has joined #ceph
[18:44] <alexxy> are cephfs still may encounter deadlocks if i will run kernel client on osd/mds/mon nodes?
[18:45] <gregaf1> you can't run any kernel clients on the OSD nodes; I think they should work out fine on the MDS or monitor nodes though
[18:45] * amatter (~oftc-webi@ has joined #ceph
[18:46] <gregaf1> that state of affairs is unlikely to change
[18:51] <rzerres> gegaf1, i have asked arround on how to clean up my test cluster to get rid of unfound objects
[18:52] <wido> imjustmatthew: my Atom machines are still in production
[18:52] <wido> :)
[18:52] <rzerres> can you or someone else can give me an inside?
[18:52] * Cube (~Cube@cpe-76-172-67-97.socal.res.rr.com) has joined #ceph
[18:53] <vanham> rzerres, you want to start over, remove everything?
[18:53] <imjustmatthew> wido: nice :)
[18:53] <wido> imjustmatthew: With the older Atom with 4GB of memory
[18:54] <wido> new board only has 4 SATA ports :(
[18:54] <rzerres> no vanham, not everything. just the objects on the osds that are marked unfound
[18:54] <vanham> ahh..
[18:54] <vanham> okay, can't help them
[18:54] <joelio> instruct as lost?
[18:54] <rzerres> welcome back, thought you are out....
[18:55] <vanham> rzerres, can you post or map on pastbin? I think you are using all OSDs to store everything, that is why you lost normal data when you operated on the archive OSDs
[18:55] <vanham> *your map
[18:56] * sjusthm (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[18:57] <vanham> commands: ceph osd getcrushmap -o file1.tmp; crushtool -d file1.tmp -o map.txt
[18:57] <rzerres> vanham, i will add the crushmap to the pastebin. and no, i'm not using all osd's for all data.
[18:57] <vanham> k thanks
[18:58] <imjustmatthew> wido: Yeah the separate IPMI port on the new ones is nice though, but the ECC SODIMMs it takes are seriously overpriced :(
[18:59] <dspano> There is nothing more satisfying than casually doubling the size of your storage cluster without anyone knowing.
[18:59] <wido> imjustmatthew: I heard that SuperMicro is launching a new board later this year with Atoms which support 16GB of memory
[18:59] <wido> Q3 or Q4
[19:00] <alexxy> gregaf1: i asked because fuse client crashes with 0.60
[19:00] <rzerres> vanham, you still have the pastbin? Its up now.
[19:00] <imjustmatthew> wido: That would be awesome
[19:00] <vanham> I'm taking a look at it right now
[19:01] <imjustmatthew> especially if they can squeeze AES-NI into it
[19:01] <dspano> wido: What do you use them for?
[19:01] <wido> dspano: Behind a CloudStack setup
[19:01] <wido> RBD storage
[19:01] <wido> It's 10 nodes with 4 OSDs each, 80TB in total
[19:02] <rzerres> rule dws-vdi is using daywalker-data (dwssrv1: osd.0 + osd.1 and dwssrv2: osd.3 + osd.4)
[19:02] <vanham> rzerres, I was right. The data, metadata and rbd rules are putting data into all your OSDs. That is why you lost data when you operated the *-archive OSDs
[19:03] <vanham> unless you are not using data, metadata and rbd
[19:03] <vanham> let me finish the file them, 1 sec
[19:03] <rzerres> rule dws-archive is using daywalker-archive (dwssrv1: osd.7 + osd.8 and dwssrv2: osd.5 + osd.6)
[19:03] <gregaf1> alexxy: did you submit a bug report? we're doing a lot of bug fixes in CephFS right now preparing for cuttlefish and if v0.60 ceph-fuse is broken somehow that we're not seeing, we'd like to know
[19:03] <dspano> wido: 4 2TB drives per host than. That's interesting. Are they cheap?
[19:04] <rzerres> no, i'm of corse not using rbd, metadata and data
[19:04] <vanham> rzerres, true true
[19:04] <alexxy> gregaf1: what info should i provide? i dont have backtracce
[19:04] <vanham> rzerres, true
[19:04] <wido> dspano: In total about 650 euro per host
[19:04] <gregaf1> vanham: if you are using the kernel client on a node with an OSD you have a potential deadlock under memory pressure: the kernel starts flushing ceph data out of the page cache, which requires allocating messages and then sending them out to the OSDs — which if the OSD is located on the same node requires more kernel memory, which you're trying to free up — deadlock!
[19:05] <gregaf1> alexxy: what do you have that indicates it's crashing?
[19:05] <alexxy> cannot ls
[19:05] <rzerres> just rule dws-test is available to benchmark with all osd's
[19:05] <alexxy> cannot unmount
[19:05] <alexxy> many zombie processes
[19:05] <vanham> gregaf1, with 64bits and 32gbs on the node is this really going to happen?
[19:05] <dspano> wido: That's insanely cheaper thant the servers I get from Dell.
[19:05] <alexxy> gregaf1: http://bpaste.net/show/90734/
[19:05] <alexxy> dmesg
[19:06] <gregaf1> vanham: I have no idea what the real likelihood in the field is, but it's a concern that we have with such setups
[19:06] <vanham> rzerres, your ceph health detail says pg 2.65 was using osds [0,7], [0,8] witch indicate that those where at rbd, data or metadata
[19:06] <gregaf1> alexxy: okay, so it's hanging, probably while trying to get caps or something
[19:06] <alexxy> may be
[19:07] <alexxy> its running on mds/mon node
[19:07] <gregaf1> the colocation isn't an issue
[19:07] <vanham> gregaf1, thanks for the explanation. I'll try to read a bit on the page cache
[19:07] <rzerres> gregaf1, what do you mean by kernel client?
[19:07] <vanham> gregaf1, I updated to 3.8 to make sure I have the syncfs stuff
[19:07] <gregaf1> we've fixed several bugs around that already and have a couple more in the pipeline but if you have the time to run through it with high debugging (debug client = 20; debug ms = 1) and describe the full-cluster workload we could check and see if it's new or known
[19:08] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[19:08] <gregaf1> rzerres: if you're using kernel rbd or the in-kernel ceph-fuse client instead of ceph-fuse or whatever
[19:09] <alexxy> gregaf1: ok
[19:09] <gregaf1> vanham: this is a classic client-server filesystem issue that NFS has the same troubles with; see eg http://h10025.www1.hp.com/ewfrf/wc/document?cc=us&lc=en&dlc=en&docname=c02073470
[19:10] <rzerres> gregaf1, thanks. so I'm still fine (for testing) to use the ceph-fuse on an osd-node to map rbd objects?
[19:12] <vanham> gregaf1, I completely agree with the problem, but I'll bet that with 14 OSDs/node servers the loopback communication will be low enough (3 replicas)
[19:12] <vanham> my load shouldn't be that big for now too...
[19:12] <vanham> hope it works
[19:12] <gregaf1> rzerres: yeah, the userspace clients are fine because there aren't any memory deadlock issues; the kernel can just page them out to disk if it wants to (which you can't do with kernel memory)
[19:13] <vanham> gregaf1, oh, it only happens when we are oom?
[19:13] <vanham> OOM
[19:13] <gregaf1> yeah
[19:13] <vanham> them that is fine!
[19:14] <rzerres> gregaf1, nice. I just once in a while like to map a snapshot or check a btrfsck on a test vm :)
[19:14] <alexxy> gregaf1: what logs should i show?
[19:15] <gregaf1> the ceph-fuse one is fine if it comes with a description of the workload that it and the cluster are going through
[19:15] <alexxy> where can i find it
[19:16] <alexxy> i dont see anything new in /var/log/ceph/
[19:16] <gregaf1> it might not have any logs by default and to be useful you'd need to start up a new one with the debugging I mentioned enabled from startup (ie, in the ceph.conf)
[19:17] <gregaf1> the default location is /var/log/ceph, though
[19:18] <alexxy> i added this part to ceph conf
[19:18] <alexxy> but i dont see logs
[19:19] <alexxy> # ls /var/log/ceph/
[19:19] <alexxy> ceph.log ceph-mds.alpha.log ceph-mon.alpha.log stat
[19:19] <gregaf1> does the user you're running ceph-fuse as have permission to log there?
[19:19] <alexxy> http://bpaste.net/show/90739/
[19:19] <alexxy> user is root
[19:19] <alexxy> ceph.conf here
[19:20] <rzerres> vanham, back to pg 2.65 [0,7], [0,8]
[19:20] <gregaf1> hrm; is /var/run/ceph/ceph.client.admin.pid (or similar) show up when you run it?
[19:20] * diegows (~diegows@200-081-038-239.wireless.movistar.net.ar) Quit (Ping timeout: 480 seconds)
[19:21] <vanham> rzerres, 0 is -data, 7 and 8 is -archive
[19:21] <rzerres> you are right, when i started with ceph 4 month ago, i didn't care about placement. just wanted to get used to it, handling, etc.
[19:22] <rzerres> and possibly the objects are written with bench calls ..... So i should clean up anyway.
[19:23] <vanham> Yes! You are the first I really helped!
[19:23] <vanham> hehe
[19:24] <vanham> I'll take a look at the case where you want different data stored on different disks/OSDs of the same server. I don't know any other way of doing it but your
[19:24] <vanham> s/your/yours/
[19:24] <alexxy> gregaf1: # ls /var/run/ceph/
[19:24] <alexxy> ceph-mds.alpha.asok ceph-mon.alpha.asok mds.alpha.pid mon.alpha.pid
[19:25] <rzerres> vanham, i can asure that crush is doing its job find. data are placed in the correct osd's.
[19:26] <gregaf1> alexxy: and you started up a ceph-fuse client that was still running when you did that ls?
[19:26] <rzerres> step chooseleaf firstn 0 type host - this takes care, that data is never written on osd's on the same server
[19:27] <rzerres> so replication of 2 is garanteed to be on osd's that reside on different hosts (here: dwssrv1 and dwssrv2)
[19:29] <vanham> rzerres, true true. I wish there was a way to declare that it is the same server but to use different pools on different osds of the same server. You have to declare them as different servers
[19:29] <vanham> the other rules you have are the only way of doing it. I wish it could be done differently
[19:29] <rzerres> vanham, yes - exactly. therefore i nedded to build this structure
[19:30] <vanham> dont forget to set the crush_ruleset on the right number
[19:30] <rzerres> it seems to be a bit overdone. And the ugly thing arises, if you have differnt rooms and different racks. The picture gets way more crouded.
[19:31] <vanham> yeap
[19:31] <alexxy> gregaf1: no
[19:31] <alexxy> now i have zombie client
[19:31] <vanham> I have two different data centers with two racks on each
[19:31] <vanham> no ceph between data centers
[19:31] <vanham> but I see what you mean
[19:32] <rzerres> again, true, true. ceph internaly uses numbers. but humans aren't very good in negotioation numbers to rulenames. At least I am to stupid to come up with it after a couple of days
[19:32] * BillK (~BillK@58-7-164-149.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[19:34] <vanham> I wish I could help you getting your data back.
[19:47] <gregaf1> rzerres: it uses librbd so it should be, but dmick would know for certain
[19:47] <dmick> rzerres: yes, because of librbd
[19:52] <dmick> rzerres: IIRC creating an image through rbd-fuse uses xattrs on the mountpoint for its args, so be aware that if you want to create a format 2 image, you'll have to check the xattr settings
[19:53] <dmick> but it'll read them created through the rbd CLI or whatever just fine
[19:58] <wido> dspano: http://www.mini-itx.com/store/?c=86
[20:01] * janos must.. resist... urge to spend....
[20:01] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[20:13] * Cube (~Cube@ has joined #ceph
[20:13] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[20:14] <rzerres> dmick, thanks for your insight. That is better then expected.
[20:16] <dmick> rzerres: librbd apps are useful
[20:26] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[20:26] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:27] <dspano> wido: Thanks!
[20:29] <jerker> wido: do you know of any with 8 drives in 1U?
[20:32] <jerker> http://linux-1u.net/Dwg/jpg.sm/c2500.jpg
[20:46] <vanham> up to 24 2.5 drives on 3U
[20:46] <vanham> sorry
[20:46] <vanham> 48 2.5 drives in 3U
[20:46] <vanham> or 24 3.5 drives in 3U
[20:46] <vanham> So, it is 8/U
[20:47] <vanham> About $16k without the drives
[20:47] <vanham> with 12 Xeon E3, 32GB of RAM servers
[20:48] <vanham> This is the toy I always dreamed of when I was a kid
[20:54] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[21:01] <dosaboy> hi guys, can someone tell me where ceph stores the cluster-fsid? I have reformatted my cluster but my OSDs seem to kleep their old cluster-fsid.
[21:03] <matt_> argh, ceph is taunting me. I'm stuck on 793/4574088 degraded (0.017%)
[21:07] <dosaboy> ok so seems to be a slightly different issue. If I specify an fsid in [global]
[21:07] <dosaboy> should that fsid becaome cluster-fsid for everyone?
[21:08] * hox (~hox@ has joined #ceph
[21:10] <Elbandi_> if i read 1 byte from a file in cephfs, the while stripe is transferred from osd?
[21:13] <hox> on radosgw, getting "failed to authorize request" on PUT methods, but not on GET methods. any hints on troubleshooting?
[21:13] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[21:50] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[21:57] * LeaChim (~LeaChim@ has joined #ceph
[22:06] * themgt (~themgt@96-37-28-221.dhcp.gnvl.sc.charter.com) has joined #ceph
[22:53] * rustam (~rustam@ has joined #ceph
[22:57] * diegows (~diegows@ has joined #ceph
[23:55] <dmick> dosaboy: fsid identifies the cluster. It really should be called the cluster UUID. If you specify it in global, everyone will use it.
[23:56] <dmick> Elbandi_: I believe so.
[23:56] <dmick> hox: logs, perhaps?

