#ceph IRC Log


IRC Log for 2013-03-22

Timestamps are in GMT/BST.

[0:05] * JohansGlock_ (~quassel@kantoor.transip.nl) has joined #ceph
[0:09] * masterpe_ (~masterpe@2001:990:0:1674::1:82) Quit (Remote host closed the connection)
[0:09] * scalability-junk (uid6422@id-6422.tooting.irccloud.com) Quit (Ping timeout: 480 seconds)
[0:09] * dennis (~dweazle@tilaa.krul.nu) Quit (Remote host closed the connection)
[0:10] * Gugge-47527 (gugge@kriminel.dk) Quit (Read error: Connection reset by peer)
[0:10] * dennis (~dweazle@2a02:2770::21a:4aff:fee2:5724) has joined #ceph
[0:10] * masterpe (~masterpe@2001:990:0:1674::1:82) has joined #ceph
[0:10] * Gugge-47527 (gugge@kriminel.dk) has joined #ceph
[0:11] * Cube (~Cube@ has joined #ceph
[0:12] * JohansGlock (~quassel@kantoor.transip.nl) Quit (Ping timeout: 480 seconds)
[0:15] * nwat (~Adium@eduroam-233-33.ucsc.edu) Quit (Quit: Leaving.)
[0:20] * markbby1 (~Adium@ Quit (Ping timeout: 480 seconds)
[0:23] * The_Bishop (~bishop@2001:470:50b6:0:a883:f83e:a532:aa6a) Quit (Ping timeout: 480 seconds)
[0:30] * eschnou (~eschnou@37.90-201-80.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[0:32] * The_Bishop (~bishop@2001:470:50b6:0:ad3a:97ea:9b13:6efa) has joined #ceph
[0:35] * scuttlemonkey (~scuttlemo@dslb-084-058-138-184.pools.arcor-ip.net) has joined #ceph
[0:35] * ChanServ sets mode +o scuttlemonkey
[0:42] * noob2 (~cjh@ has left #ceph
[0:48] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[0:55] * jlogan (~Thunderbi@2600:c00:3010:1:8c00:81c9:796a:9e97) Quit (Ping timeout: 480 seconds)
[0:56] * buck (~buck@bender.soe.ucsc.edu) Quit (Quit: Leaving.)
[1:06] * alram (~alram@ Quit (Quit: leaving)
[1:09] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[1:11] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[1:16] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) has joined #ceph
[1:23] <dmick> PerlStalker: something might be damaged with the underlying objects that make up an rbd image. Can you try with debug librbd = 20?
[1:31] * BillK (~BillK@124-149-78-131.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[1:33] <PerlStalker> How?
[1:35] <dmick> http://ceph.com/docs/master/rados/configuration/ceph-conf/#logs-debugging
[1:35] * LeaChim (~LeaChim@5e0d7853.bb.sky.com) Quit (Ping timeout: 480 seconds)
[1:35] <dmick> either conf-file or on cmdline
[1:36] <PerlStalker> ok
[1:40] <PerlStalker> I'm getting failed to parse arguments messages
[1:40] * dmick waits for details
[1:41] <PerlStalker> This is what I'm trying: ceph osd tell '*' injectargs '--debug-librbd 20'
[1:41] <PerlStalker> I'm, obviously, doing something wrong.
[1:42] <dmick> try rbd '--debug-librbd=20' info <image>
[1:42] <dmick> that may or may not come out to the console; you might have to look in the client.admin log
[1:44] <PerlStalker> rbd: error parsing command '--debug-librbd=20
[1:45] <dmick> oh, foo. rbd's one of the exceptions. I should have remembered that
[1:48] <dmick> sigh. no it isn't, it just needs the right value. Sorry. --debug-rbd=20
[1:48] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[1:48] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[1:48] <dmick> (the messages will be prefixed with librbd: when they're from the library, but the config value is rbd)
[1:49] <PerlStalker> Ok. That worked. What am I looking for?
[1:49] <dmick> stuff about failures
[1:49] <PerlStalker> 2013-03-21 18:48:39.444683 7f2ea13d4780 -1 librbd::ImageCtx: error finding header: (2) No such file or directory
[1:50] <dmick> that's as much as it says huh
[1:50] <PerlStalker> The full output is at http://pastebin.com/dqTWAgFB
[1:50] <PerlStalker> But it's not much more than that.
[1:52] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:53] <dmick> yeah. do you know if the image was format 1 or 2?
[1:53] <PerlStalker> format 1
[1:53] <dmick> ok. try rados -p kvm_prod ls | grep srvportfolio
[1:54] <PerlStalker> That comes up empty
[1:54] <PerlStalker> rbd ls ... finds it.
[1:54] <dmick> ok. so there should be a srvportfolio.rbd object there, but there is not.
[1:57] <dmick> has this cluster been uneventful?
[1:57] <PerlStalker> I had an osd drop out and reconnect
[1:57] <dmick> (and, you might compare rbd ls with rados -p kvm_prod ls | grep '\.rbd')
[1:57] <dmick> would someone have tried to remove this image?
[1:58] <dmick> (and failed?)
[1:58] <PerlStalker> No
[1:58] <dmick> what version of Ceph?
[1:58] <PerlStalker> Everything was happy until one osd lost connection, briefly.
[1:58] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[1:59] <PerlStalker> ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
[1:59] * hybrid5121 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) has joined #ceph
[2:04] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Ping timeout: 480 seconds)
[2:09] * loicd (~loic@ has joined #ceph
[2:11] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[2:11] * scalability-junk (uid6422@id-6422.tooting.irccloud.com) has joined #ceph
[2:20] <dmick> PerlStalker: is this the only image that is affected thusly?
[2:20] <PerlStalker> no
[2:24] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[2:26] <dmick> health is good now?
[2:28] <PerlStalker> Yes
[2:29] <dmick> if you ceph osd map kvm_prod srvportfolio.rbd
[2:30] <dmick> does the output show the OSD that died?
[2:31] <PerlStalker> No. It shows two others. osdmap e1702 pool 'kvm_prod' (3) object 'srvportfolio.rbd' -> pg 3.aaafda26 (3.26) -> up [1,2] acting [1,2]
[2:32] <PerlStalker> The one that failed was 4
[2:34] <dmick> well, this should not have happened, and I don't know why it did
[2:34] <PerlStalker> That makes two of us. :-)
[2:34] <dmick> what is the missing / total count?
[2:34] <dmick> i.e. compare rbd ls with rados -p kvm_prod ls | grep '\.rbd'
[2:36] <PerlStalker> It was 9 of 43
[2:54] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) has joined #ceph
[2:58] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[3:08] * scuttlemonkey (~scuttlemo@dslb-084-058-138-184.pools.arcor-ip.net) Quit (Quit: my troubles seem so far away, now yours are too...)
[3:11] <dmick> PerlStalker: only thing I can think at the moment is maybe try searching for those object names in any and all logs to see if there was some operation no one expected
[3:11] <dmick> would they have been renamed?
[3:11] <PerlStalker> They would not have been renamed,
[3:13] <PerlStalker> I did have a couple of pgs that were stuck in an inconsistent state that I performed a pg repair on.
[3:13] <PerlStalker> It possible the repair borked something.
[3:14] <dmick> do you have a record of which? would one of them have been 3.26, for example?
[3:14] <PerlStalker> Yes
[3:14] <PerlStalker> 3.26 was one of the repaired pgs.
[3:14] <dmick> hm. so that's a clue
[3:15] <dmick> if you can, it would be interesting to know if all the missing objects were mapped to the repaired pgs
[3:16] <PerlStalker> Where would I find that? ceph osd map?
[3:16] <dmick> yeah, like before
[3:16] <dmick> ceph osd map kvm_prod srvportfolio.rbd
[3:16] <dmick> osdmap e1702 pool 'kvm_prod' (3) object 'srvportfolio.rbd' -> pg 3.aaafda26 (3.26) -> up [1,2] acting [1,2]
[3:17] <dmick> that says "the map, currently at version 1702, would store that object in pg 3.26, which is currently on osds 1 and 2
[3:18] <dmick> (I forget exactly what the aaafda26 part is)
[3:18] <PerlStalker> I spot checked three of the images and they were all on the repaired pgs.
[3:18] <dmick> so this is indeed a clue.
[3:18] <PerlStalker> Indeed.
[3:19] <mikedawson> dmick, joao: recreated the ceph-create-keys never finishing issue on a 0.59 on both Quantal and Raring
[3:19] <dmick> mikedawson: did you find and deploy any upstart debugging/logging?
[3:21] <mikedawson> if I start with /etc/init.d/ceph , does that completely avoid upstart?
[3:22] <dmick> no; the upstart configs are still in /etc/init
[3:22] <dmick> it's *supposed* to avoid actually *doing* anything like starting daemons or key-creation
[3:22] <dmick> but apparently it's not
[3:23] <dmick> look back at the log for where I was muttering things about ceph- events
[3:24] <mikedawson> right, I was trying to bypass upstart completely to see if the issue persisted without that variable
[3:24] <dmick> you can temporarily move all the ceph-*.conf out of /etc/init and kick upstart in the head somehow, perhaps, to test that, but
[3:25] <dmick> I'm certain that's what's starting the ceph-create-keys process, so I don't think it needs verification
[3:25] <dmick> it's just a question of why our mechanism for not controlling the daemons isn't working
[3:26] * TMM (~hp@535240C7.cm-6-3b.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[3:27] <dmick> PerlStalker: do you have any logging on those objects? If repair affected them, that might be in the log
[3:28] <PerlStalker> dmick: I have some logs. Let me dig.
[3:31] <PerlStalker> dmick: The repair of 3.26 doesn't show in the logs but repairs of some of the others do.
[3:31] <dmick> does it mention the missing objects?
[3:34] <PerlStalker> Not by name
[3:34] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[3:36] <dmick> would have taken debug osd 10 or greater
[3:36] <PerlStalker> I'm running at the default log level.
[3:37] <mikedawson> dmick: looks like it it was ceph-create-keys was added Feb 28th http://ceph.com/git/?p=ceph.git;a=commitdiff;h=ffc0ff68e017747bbb1459520a0908ec75f32931
[3:37] * Cube (~Cube@ Quit (Quit: Leaving.)
[3:37] * loicd (~loic@ Quit (Quit: Leaving.)
[3:38] <dmick> ah! Maybe it isn't upstart after all then...
[3:41] <mikedawson> dmick: I'll work around it for now. Does this seem like a bug to you? If so, should I file a bug or ping sage on ceph-devel?
[3:41] <dmick> still, though, that's only supposed to be there for starting a mon
[3:41] <mikedawson> there is a mon on this system
[3:42] <dmick> yeah, but, didn't we see ceph-create-keys procs even during the time when the cluster was supposedly down? Or were those just leftovers from prior experiences?
[3:43] <mikedawson> leftovers
[3:46] <dmick> alright. so it's being started by init.d/ceph and after the monitor is started
[3:48] <dmick> *oh*.
[3:49] <mikedawson> dmick: yes, I believe that's right
[3:49] * The_Bishop (~bishop@2001:470:50b6:0:ad3a:97ea:9b13:6efa) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[3:49] <dmick> path = '/etc/ceph/{cluster}.client.admin.keyring'
[3:49] <dmick> seems ominous
[3:50] <dmick> is the process currently hung?
[3:50] <dmick> or, at least, running?
[3:53] <mikedawson> when I start, I see the mon process, then the /usr/bin/python /usr/sbin/ceph-create-keys -i a process (always with a higher pid). It never ends. When the mon is running, I don't yet see the get-or-create. After I stop the Mon, I see it ... ceph --cluster=ceph --name=mon. --keyring=/var/lib/ceph/mon/ceph-a/keyring auth get-or-create client.admin mon allow * osd allow * mds allow
[3:54] <mikedawson> if I start the mon a second time, I end up with another ceph-create-keys that never stops
[3:55] <dmick> I'd like to know what ceph-create-keys is doing when the mon is running
[3:55] <dmick> can you strace -f it?
[3:59] <mikedawson> dmick: here is the progression through the process http://pastebin.com/raw.php?i=qdjNsFKm
[4:03] <mikedawson> dmick: strace -> http://pastebin.com/raw.php?i=iiajkx0Y
[4:23] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:27] * TMM (~hp@535240C7.cm-6-3b.dynamic.ziggo.nl) has joined #ceph
[4:32] <dmick> hm. so it is trying to do the get-or-create
[4:52] <dmick> so mkcephfs creates a client.admin key, and prepopulates the mon keyring with that, and generates a mon. key and adds it
[4:52] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[4:52] <dmick> it looks as though ceph-create-keys is trying to create the client.admin key using the mon. key from the monitor keyring
[4:52] <dmick> which is certainly different, and doesn't seem to work for me when I try it
[4:53] <dmick> on a test cluster
[4:53] <dmick> I'll talk to sage about this tomorrow; I don't understand what rights the mon. key is supposed to grant
[4:53] <sage> dmick: wahtever caps are set in the $mon_data/keyring file
[4:54] <dmick> hey lookit that
[4:54] <sage> stick 'caps mon = allow *' in there or whatever it is and you can do it all
[4:54] <dmick> what cap does it take to create the client.admin key?
[4:54] <mikedawson> sage, dmick: just sent some documentation into ceph-devel. Thanks for looking at this one. Let me know if you need anything else to debug
[4:56] <sage> caps mon = "allow *"
[4:57] <sage> is sufficient
[4:57] <dmick> a mon key on teuthology-created cluster has no caps
[4:57] <sage> vstart was that way until recently too
[4:57] <sage> its only useful if you're using the mon. key as a back door
[4:57] <dmick> ...and this vstarted cluster has a allow *
[4:57] <dmick> ok
[4:58] <dmick> so how did mikedawson's key get created, I guess. Maybe before ceph-create-keys needed this back door
[4:58] <sage> the chef and ceph-deploy stuff sets it, though.. ceph-create-keys uses it.
[4:58] <sage> probably by mkcephfs?
[4:58] <dmick> ah, so if chef and ceph-deploy weren't used, then it won't be there for ceph-create-keys?...
[4:58] <mikedawson> i did use mkcephfs
[4:58] <sage> it builds the full keyring and feeds it to the montior when it is initialized
[4:58] <dmick> mikedawson: does your mon keyring have caps?
[5:01] * Cube (~Cube@cpe-76-95-217-215.socal.res.rr.com) has joined #ceph
[5:02] <mikedawson> http://pastebin.com/raw.php?i=7gFafdU8
[5:02] <dmick> no. So the question is how did you get such a mon keyring I guess
[5:03] <dmick> did you say you had just set this cluster up (with mkcephfs)?
[5:03] <mikedawson> yes, mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.keyring
[5:05] <mikedawson> this is a fresh quantal machine with 0.59 packages from ceph development repo
[5:07] <dmick> I may be being dense, but it seems to me that mkcephfs would have put the mon keyring in keyring.mon
[5:09] <dmick> probably in /tmp/mkcephfs.*/keyring.mon?
[5:11] <mikedawson> right now /tmp is empty
[5:12] <mikedawson> don't see any keyring.mons anywhere
[5:12] <dmick> I have to run mkcephfs to find out what it does, I guess. I'm missing smoething large
[5:14] <dmick> I'm really sorry for running you around the tree like this. I'm confused.
[5:15] <mikedawson> no worries, I'll start looking at code a bit closer tomorrow if it doesn't become clear to you by then
[5:15] <dmick> just installing 59 to run it from its normal paths
[5:19] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[5:28] <dmick> ok, I end up with a mon keyring in /var/lib/ceph that has no caps as well
[5:37] <dmick> alright. I didn't see it because it's ceph-mon --mkfs that's copying from the /tmp dir to the mon's filestore
[5:38] <dmick> but it looks to me like mkcephfs is making a mon key with no caps
[5:38] <dmick> and that's incompatible with ceph-create-keys
[5:39] * dec (~dec@ec2-54-251-62-253.ap-southeast-1.compute.amazonaws.com) Quit (Quit: Lost terminal)
[5:41] <dmick> mikedawson: I don't see your post to ceph-devel
[5:41] <dmick> maybe you had an attachment and it got bounced?
[5:42] <dmick> anyway, I think this is a bug in ceph-create-keys
[5:42] <dmick> and maybe a bug in mkcephfs
[5:43] <dmick> ceph-create-keys shouldn't be looking for the client.admin key in a fixed path /etc/ceph/{cluster}.client.admin.keyring
[5:43] <dmick> if it were looking where the client.admin keyring were configured, it wouldn't be trying to do anything else (that it doesn't have to do)
[5:44] <dmick> but also if we want ceph-create-keys to be able to clean up after a problem with mkcephfs-created clusters, mkcephfs will have to leave that backdoor mon. key around, and it's not doing that
[5:45] <dmick> so Sage will get my mind right tomorrow and I'll make sure issues get filed.
[5:47] <mikedawson> dmick: yeah, it had attachments. It hasn't been returned though. Thanks for your help!
[5:47] <dmick> vger may not bounce, it may just eat
[5:49] <dmick> http://vger.kernel.org/majordomo-info.html
[6:04] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:13] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) Quit (Quit: themgt)
[6:18] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[6:27] * ivoks (~ivoks@jupiter.init.hr) has joined #ceph
[6:46] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Read error: Operation timed out)
[6:48] * Cube (~Cube@cpe-76-95-217-215.socal.res.rr.com) Quit (Quit: Leaving.)
[6:49] * Cube (~Cube@cpe-76-95-217-215.socal.res.rr.com) has joined #ceph
[7:22] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[7:36] <Karcaw> dmick: can you glance at bug 4521 and tell me if there is any more usefull info i can provide before i take a sleep break?
[7:39] * Cube (~Cube@cpe-76-95-217-215.socal.res.rr.com) Quit (Quit: Leaving.)
[7:59] * jks (~jks@3e6b5724.rev.stofanet.dk) has joined #ceph
[8:04] * tnt (~tnt@82.195-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:19] * portante (~user@ Quit (Remote host closed the connection)
[8:20] * portante (~user@ has joined #ceph
[8:20] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: Some folks are wise, and some otherwise.)
[8:23] * Cube (~Cube@cpe-76-95-217-215.socal.res.rr.com) has joined #ceph
[8:27] * Kioob (~kioob@2a01:e35:2432:58a0:21a:92ff:fe90:42c5) has joined #ceph
[8:27] * Cube (~Cube@cpe-76-95-217-215.socal.res.rr.com) Quit ()
[8:50] * ssejour (~sebastien@ has joined #ceph
[8:50] * BManojlovic (~steki@ has joined #ceph
[8:57] * ssejour (~sebastien@ has left #ceph
[9:25] * mcclurmc (~mcclurmc@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[9:27] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:28] * BillK (~BillK@124-169-204-134.dyn.iinet.net.au) has joined #ceph
[9:31] * gerard_dethier (~Thunderbi@ has joined #ceph
[9:37] * leseb (~leseb@ has joined #ceph
[9:50] * Morg (b2f95a11@ircip2.mibbit.com) has joined #ceph
[9:55] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Read error: Connection reset by peer)
[10:06] * loicd (~loic@c-76-119-91-36.hsd1.ma.comcast.net) has joined #ceph
[10:11] * l0nk (~alex@ has joined #ceph
[10:13] * dosaboy (~user1@host86-163-9-154.range86-163.btcentralplus.com) has joined #ceph
[10:13] * dosaboy (~user1@host86-163-9-154.range86-163.btcentralplus.com) Quit ()
[10:16] * eschnou (~eschnou@ has joined #ceph
[10:16] * dosaboy (~user1@host86-163-9-154.range86-163.btcentralplus.com) has joined #ceph
[10:21] * danieagle (~Daniel@ has joined #ceph
[10:21] * ScOut3R (~ScOut3R@ has joined #ceph
[10:34] * LeaChim (~LeaChim@5e0d7853.bb.sky.com) has joined #ceph
[10:56] <sig_wall> Hello.
[10:57] <sig_wall> Both our production and development clusters are unusable during recovery. I have a question. It is a general ceph property? What a performance impact recovery makes on your cluster?
[10:58] <sig_wall> I want to understand why it happens on every test configuration too and does anyone else have this problem.
[11:03] * mattch (~mattch@pcw3047.see.ed.ac.uk) has joined #ceph
[11:14] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[11:21] * Qt3n (~Qten@qten.qnet.net.au) Quit (Read error: Connection reset by peer)
[11:21] * Qt3n (~Qten@qten.qnet.net.au) has joined #ceph
[11:23] * mcclurmc (~mcclurmc@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) Quit (Ping timeout: 480 seconds)
[11:23] * Qt3n (~Qten@qten.qnet.net.au) Quit (Read error: Connection reset by peer)
[11:24] * Qt3n (~Qten@qten.qnet.net.au) has joined #ceph
[12:08] * scuttlemonkey (~scuttlemo@dslb-084-058-138-184.pools.arcor-ip.net) has joined #ceph
[12:08] * ChanServ sets mode +o scuttlemonkey
[12:16] <Gugge-47527> sig_wall: how "unusable", and what version?
[12:18] <sig_wall> overall cluster reads drops to 500 kB/s, and a lot of write slow requests in ceph log. v. 0.56.3
[12:20] * mcclurmc (~mcclurmc@firewall.ctxuk.citrix.com) has joined #ceph
[12:27] * esammy (~esamuels@host-2-103-101-143.as13285.net) Quit (Ping timeout: 480 seconds)
[12:31] * rzerres (~ralf.zerr@xdsl-195-14-207-9.netcologne.de) has joined #ceph
[12:31] <rzerres> good morning US
[12:32] <rzerres> I'd like to find a helping hand to rescue my broken ceph-cluster. anybody online?
[12:35] <ninkotech> rzerres: good morning. i really cant help, but you are first one i see here talking about this... what happened?
[12:35] <ninkotech> americans will be up in 2-4 hours
[12:36] <scuttlemonkey> I may be able to help
[12:36] <scuttlemonkey> this particular American happens to be in Germany at the moment :)
[12:37] <rzerres> hey scuttler
[12:37] <rzerres> thanks for your offer
[12:37] <rzerres> let me explain the issue and the environment
[12:38] <rzerres> i'm working since 2 month to get into this stuff. So i do think i did my homework before bothering anybody :)
[12:38] <rzerres> aiming for a cluster with 2 servers for osd's and mds
[12:39] <rzerres> 3 servers for the monitors, to garantee voting is working in case of a corrution on on of the osd-servers
[12:40] <rzerres> every osd-server holds 2 osd's with production data
[12:40] <rzerres> every osd-server holds 2 osd's with archive data
[12:40] <rzerres> redundance is configured to be 2
[12:41] <rzerres> curshmap rules will ensure that data are striped to the correct osd's on the physical servers.
[12:41] * Kioob (~kioob@2a01:e35:2432:58a0:21a:92ff:fe90:42c5) Quit (Read error: Connection reset by peer)
[12:41] * Kioob (~kioob@2a01:e35:2432:58a0:21a:92ff:fe90:42c5) has joined #ceph
[12:42] <Gugge-47527> sig_wall: how is the io load on the osd's?
[12:42] <rzerres> so we have about 8 osd's all together.
[12:42] <rzerres> config is using xfs as fs on the osd's
[12:43] <rzerres> each osd is cuppled with a ssd-partition for the cache
[12:43] <rzerres> atop give nice values, so the machines are not overloaded to get their jobs done.
[12:44] <rzerres> if needed, i will post all the details of config, logs, etc.
[12:44] <rzerres> since yesterday i upgraded to 0.59.
[12:45] <rzerres> base systems are ubuntu-precise running kernel 3.6.9-030609-generic
[12:46] <rzerres> we have to public and a cluster network.
[12:46] <sig_wall> Gugge-47527: typical io load in normal operation is 300 iops, 5MB/s, with bursts.
[12:46] <rzerres> for the moment the are connected as 1GB trunks
[12:47] <sig_wall> Gugge-47527: not so much
[12:47] <rzerres> i made ioperf tests for the net. seems ok on all interfaces
[12:48] <rzerres> roughly it will make 110MB/s as a max.
[12:48] <rzerres> for benchmaking the osd's i was using the ceph bench tools.
[12:49] <rzerres> productions osd's hold a raid5 with 3 disks each.
[12:50] <rzerres> archive osd's are JBOD direct connected SATA-disks
[12:53] <rzerres> since i wanted to get save on handling error situations, i played arround.
[12:53] <rzerres> i moved osd's from one osd-server to the other.
[12:54] <rzerres> i made productions osd's out and down. deleted the fs and rebuilded them.
[12:54] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[12:55] <rzerres> made them go back into the cluster and watched the rebuild.
[12:55] <rzerres> realy nice software, by the way. many thanks to inktank :)
[12:58] <rzerres> then i played to much. more precise, rebooted one node before awaiting to finish the rebuild, backfilling, etc.
[12:58] <rzerres> since then, the osd's on cluster-node 2 can't be started anymore.
[13:01] <rzerres> osd show dumping cores
[13:01] <rzerres> it advises objdump -rdS <executable> to interprete the output.
[13:02] <rzerres> i have checked the archives and the known channels to get a hand on. but with no luck. i get stucked here.
[13:03] <scuttlemonkey> hmm
[13:04] <rzerres> now, when starting the 2nd mds it says
[13:04] <rzerres> Starting Ceph mon.1 on dwssrv2...
[13:04] <rzerres> error checking features: (1) Operation not permitted
[13:04] <rzerres> 2013-03-22 13:03:19.981440 7f231d449780 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={4=}
[13:04] <rzerres> failed: 'ulimit -n 131072; /usr/bin/ceph-mon -i 1 --pid-file /var/run/ceph/mon.1.pid -c /etc/ceph/ceph.conf '
[13:05] <rzerres> i didn't make any changes to the ceph.conf so ....
[13:07] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[13:08] <scuttlemonkey> that sounds like the monitor is unable to write to disk
[13:08] <scuttlemonkey> can see here:
[13:08] <scuttlemonkey> https://github.com/ceph/ceph/blob/f00f3bc4e5db04be036ec737e4ed9d9281f64eb3/src/mon/Monitor.cc
[13:08] <rzerres> right.
[13:10] <scuttlemonkey> you say you moved the osds and rebuilt the fs
[13:10] <scuttlemonkey> er, redid mkcephfs at least
[13:10] <scuttlemonkey> is it possible permissions got set incorrectly or something?
[13:10] <rzerres> yes. this is very well documented at ceph-docs.
[13:10] * BillK (~BillK@124-169-204-134.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[13:11] <rzerres> the rebuilded osd's are mounted in the fs after the mkfs.xfs
[13:12] <rzerres> then the fs for the osd (including the journal) were rebuilded.
[13:12] <rzerres> i could start the osd's afterwards and made them join the cluster.
[13:15] <scuttlemonkey> ok, so the only thing that changed is you restarted the osd while it was still rebalancing?
[13:15] <scuttlemonkey> and now it gives you a core dump?
[13:15] <rzerres> for now, i'm not able to dump out the ceph stats, since monitors can't obtain the status.
[13:15] <rzerres> suttle: yes, that was the case.
[13:16] <rzerres> and that the 2nd monitor can't start now.
[13:16] <rzerres> both running monitors are logging:
[13:16] <rzerres> mon.0: 2013-03-22 13:15:59.563645 7fcbb72c7700 0 -- >> pipe(0x7fcba400f0f0 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault
[13:17] <rzerres> mon.2: 2013-03-22 13:13:07.608197 7f8896ef5700 1 mon.2@0(probing) e4 discarding message auth(proto 0 26 bytes epoch 4) v1 and sending client elsewhere; we are not in quorum
[13:17] <Gugge-47527> sig_wall: what storage do you use for an OSD, i would not expect 300iops from a single sata disk :)
[13:17] <rzerres> mon2: 2013-03-22 13:17:34.628758 7f88955f1700 0 -- >> pipe(0x1da4a00 sd=17 :34151 s=1 pgs=0 cs=0 l=0).failed verifying authorize reply
[13:18] <scuttlemonkey> rzerres: wonder if it's something new in .59
[13:18] <rzerres> well, mybe. But i dought that.
[13:18] <scuttlemonkey> all nodes got update to .59, right?
[13:18] <rzerres> yes. all packages are up to date.
[13:19] <scuttlemonkey> k
[13:19] <scuttlemonkey> I'm probably not the one that will be able to solve this
[13:19] <sig_wall> Gugge-47527: no, 300 iops is total over 30 osds
[13:20] <scuttlemonkey> unfortunately, the experts wont be around for another 4 hours or so :(
[13:20] * BillK (~BillK@203-59-42-158.dyn.iinet.net.au) has joined #ceph
[13:20] <rzerres> i did see 260 iops an the 4 osd's coupled to pool archive.
[13:20] <Gugge-47527> sig_wall: total is not really interresting, what is interresting is if any of your osd's are limited
[13:21] <rzerres> yes. and i like to get prepared before.
[13:21] <scuttlemonkey> rzerres: I will send a message to the dev team and let them know what you have told me
[13:21] <rzerres> it would be helpful, if i can get the monitors running, to get ceph status infomation
[13:21] <rzerres> very nice. thank you
[13:22] <rzerres> if i can support them with more precise information, what do i have to do?
[13:22] <scuttlemonkey> I'll ask someone to stop by irc when they get in and look for you
[13:23] <rzerres> wonderful
[13:23] <sig_wall> Gugge-47527: actually slow requests is reported at almost all osds while recovery. every osd has 500GB disk with XFS.
[13:24] <Gugge-47527> and how is the iowait on those OSD's?
[13:25] <Gugge-47527> What kind of disks are the OSD's using?
[13:26] <BillK> rzerres: I saw that osd error on .58 when moving OS from reiserfs on a platter to btrfs on an ssd
[13:26] <BillK> rzerres: was gentoo so did a rebuild of ceph and that fixed it
[13:27] <BillK> rzerres: somewhere there I deleted the journals and recreated them, but that didnt seem to matter.
[13:28] <rzerres> billk: yes, i also had the impression, it could be a corrupt journal on the ssd
[13:28] <BillK> rzerres: move was done using rsync and my "theory" is something didnt go properly - /dev/sda3 to /dev/sdd3
[13:28] <rzerres> billk: but i made sure the caches got flushed to the osd
[13:30] <rzerres> on my cluster-servers, the os is homed on an btrfs
[13:30] <sig_wall> Gugge-47527: iowait is 20% in peak but usually lower. "atop" reports 40%-100% disk load during recovery.
[13:30] <BillK> rzerres: if you get nowhere, you might have a look at the ubuntu installer and see if a disk reference is embedded somewhere that needs to track the osd move?
[13:30] <rzerres> but as suggested for the osd's I made it conservative and used the stable xfs
[13:31] <Gugge-47527> sig_wall: so your problem is that you have slow osd's, or that recovery is to heavy
[13:31] <sig_wall> Gugge-47527: most time disk are not overloaded - local operations don't slow down
[13:31] <rzerres> billk: you lost me. whis ubuntu installer?
[13:32] <Gugge-47527> sig_wall: what kind of disks are those 500GB disks?
[13:33] <Gugge-47527> sig_wall: i would play around with http://ceph.com/docs/master/rados/configuration/osd-config-ref/#recovery settings
[13:36] <BillK> rzerres: apt or whatever they call it
[13:37] <sig_wall> Gugge-47527: actually cluster is unusable with any values > 0
[13:38] <sig_wall> Gugge-47527: recovery max ops = 0 -- cluster works, recovery max ops = 1 -- a lot of slow requests
[13:38] <sig_wall> (if I change it on the fly with inject args)
[13:38] <sig_wall> *injectargs
[13:38] <sig_wall> I don't understand why it is happening on all my clusters
[13:38] <rzerres> billk: well, all apt calls to install the package hirarchy did work out with no flaws.
[13:40] * sig_wall thinking about migration from ceph to raid which is not so fault-tolerant :(
[13:51] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[13:51] * ChanServ sets mode +o elder
[13:59] * markbby (~Adium@ has joined #ceph
[14:00] * esammy (~esamuels@host-2-102-68-175.as13285.net) has joined #ceph
[14:01] * __jt__ (~james@rhyolite.bx.mathcs.emory.edu) Quit (Quit: leaving)
[14:02] * Kioob (~kioob@2a01:e35:2432:58a0:21a:92ff:fe90:42c5) Quit (Quit: Leaving.)
[14:06] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) has joined #ceph
[14:10] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[14:16] <Gugge-47527> sig_wall: why do you refuse to tell what kind of disks you use? :)
[14:16] <Gugge-47527> sig_wall: you are talking about "osd recovery max active" right? :)
[14:17] <joao> scuttlemonkey, here
[14:17] <sig_wall> Gugge-47527: normal 7200 disks
[14:17] <sig_wall> Gugge-47527: sata
[14:18] <sig_wall> yes, max active
[14:18] <Gugge-47527> I would not call a 500GB sata disk normal anymore :)
[14:18] <Gugge-47527> But okay :)
[14:18] <Gugge-47527> with that setting set to 1, is the osd disk still 100% busy?
[14:20] <scuttlemonkey> joao: thanks!
[14:20] <joao> rzerres, can you set 'debug mon = 20', 'debug paxos = 20' and 'debug ms = 1' on your monitors, give them a spin, make them bleed, etc and then point me to the logs?
[14:21] <rzerres> joao: hey, and thnks for taking on. Will produce the logs .... hold on
[14:21] <Gugge-47527> sig_wall: what kind of controllers do you use for the sata disks?
[14:22] <Gugge-47527> sig_wall: and do they support ahci, and do they run ahci, and do the disks support NCQ, and is it enabled?
[14:22] <joao> also, rzerres, which version were you using prior to upgrading to v0.59?
[14:22] <sig_wall> yes of course.
[14:23] <rzerres> using v0.58
[14:23] <joao> okay
[14:23] <joao> and your monitors never worked after the upgrade?
[14:23] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) has joined #ceph
[14:23] <Gugge-47527> sig_wall: can you post some iostat -x 1 <device> output somewhere? (where the device is an osd disk)
[14:23] <rzerres> single error on mon.2: 2013-03-22 14:10:11.664664 7fadafda4780 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={4=}
[14:23] <Gugge-47527> while the recovery is running, and everything is slow
[14:24] <sig_wall> Gugge-47527: megasas without raid
[14:24] <rzerres> before i rebooted yesterday, mon.0 to mon.2 did work
[14:24] <joao> rzerres, but I suppose they were still using v0.58 by then, no?
[14:25] <rzerres> when is start the cluster yet: only mon.0 and mon.2 get up running
[14:25] <rzerres> v0.58 was working until yesterday afternoon
[14:26] <joao> right
[14:26] <joao> something might have gone wrong with the store conversion, after restarting v0.59
[14:26] <joao> I would need the logs from the time of the upgrade as well
[14:26] <joao> if that's okay with you
[14:27] <rzerres> joao: i did stop all processes for now.
[14:27] <rzerres> yes, if i can provide all logs. let me see ...
[14:27] <rzerres> so you need the mon logs from all instances?
[14:27] * diegows (~diegows@ has joined #ceph
[14:28] <sig_wall> Gugge-47527: https://gist.github.com/sigwall/96cdc4e39ca778bd05e8/raw/65e147d5b46c7598ca4f1c01ec6596acb3b0654b/gistfile1.txt
[14:29] * markbby1 (~Adium@ has joined #ceph
[14:29] * markbby (~Adium@ Quit (Remote host closed the connection)
[14:30] <Gugge-47527> sig_wall: post like 10 seconds output :)
[14:30] <Gugge-47527> i cant tell if that one is the first output, or not
[14:30] <joao> rzerres, for now I only need the logs from the monitors that have been having issues
[14:31] <rzerres> ok. this is mon.1 (2nd cluster server)
[14:31] <joao> and if they are too big, I'm okay with just the chunk right about the time you were going to upgrade, up to now
[14:32] <joao> also, whenever you have the time, spin up the monitor you're having issues with with 'debug mon = 20', 'debug paxos = 20' and 'debug ms = 1'
[14:32] <joao> that would also be very useful
[14:33] <sig_wall> Gugge-47527: after 10 seconds nothing changed in output
[14:33] <rzerres> joao: quick grep through the log of interest
[14:33] <rzerres> joao: 2013-03-21 22:22:22.299939 7f3554ec3780 0 ceph version 0.58 (ba3f91e7504867a52a83399d60917e3414e8c3e2), process ceph-mon, pid 3018
[14:33] <rzerres> 2013-03-21 22:22:22.299955 7f3554ec3780 1 store(/srv/ceph/mon.1) mount
[14:33] <rzerres> 2013-03-21 22:22:22.300449 7f3554ec3780 1 mon.1@-1(probing) e4 preinit fsid 3d9571c0-b86c-4b6c-85b6-dc0a7aa8923b
[14:33] <rzerres> 2013-03-21 22:22:22.310888 7f3554ec3780 0 mon.1@-1(probing).osd e4304 crush map has features 33816576, adjusting msgr requires
[14:33] <rzerres> joao: that was with working v0.58
[14:34] <scuttlemonkey> rzerres: this might be useful: http://pastebin.com/
[14:34] <joao> well, I'm going real quick to the kitchen to fix something for lunch
[14:34] <joao> I'll be back shortly
[14:35] <rzerres> scutlemonkey: never used it. give me a light. i paste code to the clipboard and then?
[14:36] <Gugge-47527> sig_wall: iostat -x 1 outputs a line every 1 second
[14:36] * eschenal (~eschnou@ has joined #ceph
[14:39] <sig_wall> Gugge-47527: https://gist.github.com/sigwall/dfa77cf741ef20adcb2f/raw/c7ffd3fda42fae4f232101fb6ba98044a2911750/gistfile1.txt
[14:39] <sig_wall> Gugge-47527: sorry
[14:39] <Gugge-47527> sig_wall: the disks seem pretty idle, are all disks on all nodes that idle?
[14:39] <Gugge-47527> while the cluster is slow?
[14:40] * loicd (~loic@c-76-119-91-36.hsd1.ma.comcast.net) Quit (Quit: Leaving.)
[14:41] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Remote host closed the connection)
[14:41] <sig_wall> Gugge-47527: no. busy at 40-100%.
[14:41] <sig_wall> but why recovery affect client i/o so much
[14:41] <Gugge-47527> could you capture _that_ and post it? :)
[14:42] <sig_wall> for example, linux mdraid1 recovery does not affect i/o
[14:42] <sig_wall> but ceph recovery just kills i/o at all
[14:42] <sig_wall> ;)
[14:42] <Gugge-47527> sure it does
[14:42] <Gugge-47527> everyting a disk does affects i/o
[14:42] <Gugge-47527> something more than other :)
[14:43] <Gugge-47527> im just guessing without more info :)
[14:44] <Gugge-47527> the OSD's on my testcluster are not busy doing a rebuild, but they are made of 2TB SATA + 30GB SSD, with flashcache writeback
[14:44] <Gugge-47527> So i guess they can handle more i/o than a single SATA disk :)
[14:45] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[14:46] <sig_wall> I can't provide iostat during recovery now because users already hate me.
[14:46] <Gugge-47527> You have users on the test cluster?
[14:47] <sig_wall> ah. okay.
[14:47] <Gugge-47527> you said it was the same on a test configuration, so if you have one, output from that is fine :)
[14:47] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[14:47] <rzerres> joao: should we use pastebin to exchange the infos? i just logged in their
[14:50] <sig_wall> Gugge-47527: https://gist.github.com/anonymous/6457ffe11fc5bc7e7bee/raw/f07a500688d5b3d3bb0a3b05a469e2b24a2067f8/gistfile1.txt
[14:50] <scuttlemonkey> rserres: pastebin makes it easier to drop long pages of text
[14:50] <scuttlemonkey> esp log files
[14:50] <scuttlemonkey> you don't have irc stuff in the way
[14:51] <scuttlemonkey> plus then joao can rip it down to his own computer and grep through it
[14:51] <Gugge-47527> sig_wall: and that is with "osd recovery max active" set to 1?
[14:51] <rzerres> scuttlemonkey: thanks. i logged in. will use it. how will joao log in to get the paste-name?
[14:52] <sig_wall> Gugge-47527: yep
[14:53] <Gugge-47527> sig_wall: how did you change that setting?
[14:53] <Gugge-47527> im impressed that it puts so much load on the disk with that set to 1, but if it does, i understand why the cluster is slow
[14:53] <Gugge-47527> the disks are pretty busy :)
[14:54] <sig_wall> ceph osd tell '*' injectargs '--osd-recovery-max-active 1'
[14:54] <scuttlemonkey> rzerres: when you make a paste it gives you a url to share
[14:55] <sig_wall> Gugge-47527: but not much busy than iowait 100%
[14:55] <Gugge-47527> sig_wall: i really look more at the w_await then %util
[14:56] <Gugge-47527> slow writes are a bad sign :)
[14:58] <rzerres> suttlemonkey: so, generated a unlisted paste. where do offer it to someone, which url?
[15:00] <absynth> sig_wall: do you have one OSD host that is much slower than the others?
[15:00] <scuttlemonkey> after you hit submit you can just paste the url in your browser here iirc
[15:01] <rzerres> thanks got it.
[15:02] <rzerres> joao: are you back?
[15:02] <sig_wall> absynth: no
[15:03] <sig_wall> they have identical configuration
[15:04] <absynth> ok, because we saw slow requests quite often when we had one or two slower hosts that were overwhelmed with the recovery traffic
[15:05] <absynth> what ceph version was that again?
[15:06] <rzerres> joao: look at pastebin.com/bAYPb2yB
[15:08] <sig_wall> absynth: 0.56.3
[15:09] <sig_wall> absynth: why ceph developers can't make recovery idle-priority process, like mdraid1 recovery ?
[15:10] <absynth> there were changes to introduce such a priorization recently, and they allow for a better distribution of disk i/o vs. recovery i/o
[15:10] <absynth> i just don't remember which version introduced them
[15:10] <joao> rzerres, will do shortly; just finishing having lunch
[15:11] <rzerres> joao: bon appetit. will get mine as well ...
[15:12] <joao> err... just a remark: that crash log you uploaded, it's still under v0.58 and it's usually a sign of lack of available disk space on the monitor data dir
[15:12] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[15:13] <rzerres> yes. there was a space problem. this is solved meanwhile.
[15:16] <rzerres> joao: updated to http://pastebin.com/V9Vku3Gc (see on bottom)
[15:17] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:18] * jskinner (~jskinner@ has joined #ceph
[15:19] <mikedawson> joao: ran into issues on a new 0.59 setup, 1 node w/ fresh Quantal, mkcephfs, when I start the Mon, I get a ceph-create-keys process that never exits. Restart mon and there are two ceph-create-keys. Could this be related to your work?
[15:20] <joao> yeah, any monitor problem now can be related to the rework
[15:22] * vata (~vata@2607:fad8:4:6:358d:3c90:6bc1:af08) has joined #ceph
[15:22] <mikedawson> joao: worked on it with dmick last night and he reproduced the issue. would you like hash it out on irc, see a bug filed, or an email to ceph-devel?
[15:22] <joao> can you get some logs out of the monitors during the time the process doesn't exit? debug mon = 20, debug paxos = 10 and debug ms = 1 would be appreciated
[15:23] <joao> mikedawson, bug filed would be nice
[15:23] <joao> but we can take care of it on IRC anywya
[15:25] <mikedawson> joao: this shows the progression of ceph-create-keys piling up: http://pastebin.com/raw.php?i=rL8gtWvg
[15:26] <mikedawson> joao: strace of ceph-create-keys: http://pastebin.com/raw.php?i=gDcn7Q1P
[15:28] * PerlStalker (~PerlStalk@ has joined #ceph
[15:29] * timmclaughlin (~timmclaug@ has joined #ceph
[15:33] <joao> mikedawson, looking
[15:33] <mikedawson> joao: mon log: http://www.gammacode.com/ceph-mon.a.log
[15:34] <joao> thanks!
[15:34] * noahmehl (~noahmehl@cpe-75-186-45-161.cinci.res.rr.com) has joined #ceph
[15:35] <noahmehl> nhm: yt?
[15:36] <rzerres> joao: how to proceed? what should i do
[15:36] <joao> rzerres, just a sec
[15:37] <joao> mikedawson, so it's an auth problem? permission denied all the way?
[15:37] <joao> can you set debug auth = 20 too?
[15:37] <joao> sorry about that, should have suggested it before
[15:38] * joelio bangs head on Ruby AWS s3 lib
[15:38] <joelio> I've just setup a radosgw and come to connect but get
[15:38] <joelio> [error] Hostname {blah} provided via SNI and hostname s3.amazonaws.com provided via HTTP are different
[15:39] <joelio> when using the aws/s3 demo code from the ceph.com site
[15:39] <joao> rzerres, is that all there is of the log? regarding the one you appended to the previous one
[15:40] <joelio> If I don't use SSL, it seems to go further, but still fails with user authentication 403's (so could be related to the hostname mismatch) where to override the s3.amazon hostname I'm not sure
[15:41] <joelio> (Ruby aws/s3 gem btw
[15:42] <rzerres> joao: yes, more or less
[15:42] <mikedawson> joao: with debug auth = 20 in the [mon] section: http://www.gammacode.com/ceph-mon.a.2.log
[15:43] <joao> rzerres, okay, let's do this: edit ceph.conf and add 'debug mon = 20', 'debug paxos = 20' and 'debug ms = 1' to the [mon] section; then restart mon.1, let it go about doing its thing and then send me the log to 'joao.luis@inktank.com'
[15:43] <joao> mikedawson, thanks
[15:44] <rzerres> joao: ok you will get the mail
[15:46] * scuttlemonkey (~scuttlemo@dslb-084-058-138-184.pools.arcor-ip.net) Quit (Ping timeout: 480 seconds)
[15:52] * Morg (b2f95a11@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[15:58] <joao> mikedawson, is this easily reproducible from a fresh cluster?
[15:58] <joao> or reproducible at all?
[15:59] <joao> I'm finding that there are just too few debug messages to make an informed decision on what the hell is happening
[15:59] <joao> so I might have to shove some more debug messages in the code and then reproduce this
[16:00] <joao> also, mikedawson, do you recall if the store conversion succeeded, after the upgrade to 0.59?
[16:00] <rzerres> mail is on its way.
[16:00] <rzerres> but there is not much info. It claims that it needs conversion ....
[16:00] <rzerres> 2013-03-22 14:10:11.607940 7fadafda4780 0 ceph version 0.59 (cbae6a435c62899f857775f66659de052fb0e759), process ceph-mon, pid 10127
[16:00] <rzerres> 2013-03-22 14:10:11.664664 7fadafda4780 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={4=}
[16:00] <rzerres> 2013-03-22 14:22:16.394397 7fd54c796780 0 ceph version 0.59 (cbae6a435c62899f857775f66659de052fb0e759), process ceph-mon, pid 10674
[16:00] <rzerres> 2013-03-22 14:22:16.394427 7fd54c796780 10 needs_conversion
[16:00] <rzerres> 2013-03-22 14:22:16.461391 7fd54c796780 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={4=}
[16:00] <rzerres> 2013-03-22 15:45:37.166756 7f41a22d5780 0 ceph version 0.59 (cbae6a435c62899f857775f66659de052fb0e759), process ceph-mon, pid 11175
[16:00] <rzerres> 2013-03-22 15:45:37.166785 7f41a22d5780 10 needs_conversion
[16:00] <rzerres> 2013-03-22 15:45:37.227624 7f41a22d5780 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={4=}
[16:00] <joao> rzerres, that monitor has never run, right?
[16:00] <mikedawson> joao: yes. I first saw it on a raring nightly with 0.58 that I later upgraded to 0.59. Dan was skeptical of upstart changes in Raring, so I built a fresh cluster on Quantal and reproducted. I believe Dan reproduced it last night as well
[16:01] <rzerres> not on 0.59
[16:01] <mikedawson> joao: so I've seen it on upgrade from 0.58-0.59 (with a single mon) and a fresh mkcephfs install with 0.59
[16:02] <joao> rzerres, can you run a 'ls /srv/ceph/mon.1' ?
[16:02] <joao> hmm
[16:02] <joao> mikedawson, cool, I'll try to reproduce
[16:03] <joao> oh, it first happened on 0.58?
[16:03] <joao> maybe I read that wrong though
[16:04] <mikedawson> joao: didn't notice it with 0.58 (but I can't rule it out either)
[16:04] <mikedawson> joao: I can only confirm seeing it on systems running 0.59
[16:05] <joao> yeah, I read that wrong :)
[16:05] <joao> I suppose I was hoping to rule the new code out so I could focus on other portions of code
[16:06] <mikedawson> joao: and I can confirm that on the Raring + 0.59 system showing this issue, I can successfully write data into RBD using Glance, so its not affecting access to the cluster.
[16:07] * loicd (~loic@74-94-156-210-NewEngland.hfc.comcastbusiness.net) has joined #ceph
[16:07] * capri (~capri@ Quit (Quit: Verlassend)
[16:07] <mikedawson> joao: looks like this commit is where ceph-create-keys showed up in the init script http://ceph.com/git/?p=ceph.git;a=commitdiff;h=ffc0ff68e017747bbb1459520a0908ec75f32931
[16:11] <joao> oh, 'ceph-create-keys' is a program
[16:11] <joao> first time I'm seeing this
[16:11] <joao> I thought it was just issuing a 'ceph get-or-create-key foo'
[16:12] <rzerres> joao: root@dwssrv2:/var/log/ceph# ls -l /srv/ceph/mon.1
[16:12] <rzerres> insgesamt 24
[16:12] <rzerres> drwxr-xr-x 1 root root 276 Mär 21 22:31 auth
[16:12] <rzerres> drwxr-xr-x 1 root root 168 Mär 21 22:31 auth_gv
[16:12] <rzerres> -rw------- 1 root root 37 Dez 20 00:56 cluster_uuid
[16:12] <rzerres> -rw------- 1 root root 4 Mär 21 22:22 election_epoch
[16:12] <rzerres> -rw------- 1 root root 120 Dez 20 00:56 feature_set
[16:12] <rzerres> -rw------- 1 root root 2 Dez 20 00:59 joined
[16:12] <rzerres> -rw------- 1 root root 55 Dez 20 00:59 keyring
[16:12] <rzerres> -rw------- 1 root root 0 Dez 20 00:34 lock
[16:12] <rzerres> drwxr-xr-x 1 root root 7200 Mär 21 22:39 logm
[16:12] <rzerres> drwxr-xr-x 1 root root 7070 Mär 21 22:39 logm_gv
[16:12] <rzerres> -rw------- 1 root root 21 Dez 20 00:56 magic
[16:12] <rzerres> drwxr-xr-x 1 root root 5772 Mär 21 22:22 mdsmap
[16:12] <rzerres> drwxr-xr-x 1 root root 5070 Mär 21 22:22 mdsmap_gv
[16:12] <rzerres> drwxr-xr-x 1 root root 14 Jan 7 19:39 mkfs
[16:12] <rzerres> drwxr-xr-x 1 root root 116 Mär 21 22:22 monmap
[16:12] <rzerres> drwxr-xr-x 1 root root 6 Mär 13 11:15 monmap_gv
[16:12] <rzerres> drwxr-xr-x 1 root root 6828 Mär 21 22:22 osdmap
[16:12] <joao> rzerres, pastebin?
[16:12] <rzerres> drwxr-xr-x 1 root root 6720 Mär 19 01:25 osdmap_full
[16:12] <rzerres> drwxr-xr-x 1 root root 6720 Mär 19 01:25 osdmap_gv
[16:12] <rzerres> drwxr-xr-x 1 root root 7150 Mär 21 22:22 pgmap
[16:12] <rzerres> drwxr-xr-x 1 root root 7042 Mär 19 01:28 pgmap_gv
[16:12] <rzerres> drwxr-xr-x 1 root root 332 Mär 22 15:45 store.db
[16:13] <rzerres> sorry
[16:14] <joao> anyway, here's what you are going to do: 1) rm -fr /srv/ceph/mon.1/store.db ; 2) add 'debug mon = 20', 'debug paxos = 20' and 'debug ms = 1' to your ceph.conf ; 3) start mon.1 ; 4) let it run for a couple of minutes ; 5) send me the log
[16:14] <rzerres> joao: updated
[16:15] * joao runs to grab some more coffee
[16:22] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[16:40] * ShaunR (~ShaunR@staff.ndchost.com) Quit (Ping timeout: 480 seconds)
[16:41] * ShaunR (ShaunR@ip68-96-89-159.oc.oc.cox.net) has joined #ceph
[16:44] <joao> rzerres, any news?
[16:46] * scuttlemonkey (~scuttlemo@dslb-084-058-138-184.pools.arcor-ip.net) has joined #ceph
[16:46] * ChanServ sets mode +o scuttlemonkey
[16:47] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[16:50] <jtangwk1> joao: just make it up
[16:50] <jtangwk1> make it all up
[16:57] <joao> jtangwk1, ?
[16:58] * gerard_dethier (~Thunderbi@ Quit (Quit: gerard_dethier)
[16:58] <jtangwk1> joao: the news ;)
[16:59] <rzerres> joao: sorry had to solve a mailserver problem first ....
[17:02] <rzerres> joao: updated on pastebin
[17:04] <joao> rzerres, surely there's more?
[17:06] <rzerres> joao: updated on pastebin, but just the conversion error
[17:08] * BillK (~BillK@203-59-42-158.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[17:12] * alram (~alram@ has joined #ceph
[17:13] <joao> rzerres, I kind of need the whole log since the conversion started until it error'ed out
[17:13] <rzerres> joao: redo the rm -rf /srv/ceph/mon.1/store.db and fired up the mon.1 and mon.2
[17:13] <rzerres> joao: now i have more output. update it on postebin ....
[17:14] <joao> k thanks
[17:24] <rzerres> joao: update done.
[17:25] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[17:25] <joao> looking
[17:25] <rzerres> paxos is trying to restart the election
[17:26] * eschenal (~eschnou@ Quit (Remote host closed the connection)
[17:26] * eschnou (~eschnou@ Quit (Remote host closed the connection)
[17:26] <rzerres> on cluster-server1 mon.0 is running
[17:27] <joao> rzerres, are all the monitors running 0.59? and are all the monitors running?
[17:27] <rzerres> yes, they do
[17:27] <joao> rzerres, do you have cephx enabled?
[17:27] * jlogan1 (~Thunderbi@2600:c00:3010:1:8c00:81c9:796a:9e97) has joined #ceph
[17:28] <rzerres> mon.2 is reporting: found old GV monitor store format -- should convert!
[17:28] <rzerres> joao: yes, cephx is enabled
[17:28] <sstan> dmick, I compared small writes for replica size 1 VS size 2
[17:28] <joao> rzerres, if mon.2 is reporting that, then it's likely mon.2 is still converting the mon store
[17:28] <sstan> size one makes small writes 5x faster
[17:29] <joao> that might take a while depending on how big the store is
[17:29] <rzerres> joao: ok. but it seems mon.2 is dead. no process running
[17:30] <joao> rzerres, send me mon.2's *full* log
[17:30] <joao> I need to take a look at it
[17:30] <joao> bzip2 it if needed; if it's still too large, I can arrange a place for you to drop the log in
[17:31] * barryo1 (~barry@host86-147-6-189.range86-147.btcentralplus.com) has joined #ceph
[17:31] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[17:32] <rzerres> joao: will send it as attachment to your mail address ....
[17:32] <joao> okay
[17:33] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:35] <mattch> OSD performance question - if you couldn't use SSD in your system, would you go for 2 15k (sas) journalling disks, with 4 7.2k (sata) OSD disks, or 4 15k sas disks and put each OSDs journal on its own disk?
[17:38] <PerlStalker> Has anyone see problems with libvirt+kvm+rbd block guests locking up after a live migration?
[17:39] <mattch> I wonder if nhm has any thoughts on this one, being the benchmark guru :)
[17:40] <rzerres> joao: mails are out.
[17:40] * cdblack (86868947@ircip3.mibbit.com) has joined #ceph
[17:47] <rzerres> joao: i got them delivered with the attachments
[17:47] <sstan> so FYI ... latency seems to be the bottleneck for small writes. Reducing replication size from 2 to 1 makes small writes 5x faster.
[17:49] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) Quit (Quit: Leaving.)
[17:49] <mikedawson> sstan: how many nodes? what is node to node latency on your network?
[17:50] * via (~via@smtp2.matthewvia.info) Quit (Ping timeout: 480 seconds)
[17:52] * dmick-train (~dmick@mobile-166-137-177-237.mycingular.net) has joined #ceph
[17:53] <joao> rzerres, did you shutdown mon.2?
[17:53] <dmick-train> joao: pretty sure I understand mikedawson's issue ... Check logs for "get my mind right"
[17:54] <rzerres> joao: just for the time to grab the mail attachment. It is started again
[17:54] <joao> rzerres, when you killed it, mon.2 was about to start an election
[17:54] <joao> you should check if you have a working cluster by now
[17:55] * markbby1 (~Adium@ Quit (Quit: Leaving.)
[17:56] * markbby (~Adium@ has joined #ceph
[17:56] * chutzpah (~chutz@ has joined #ceph
[17:56] <joao> dmick-train, so you think this is just ceph-create-keys grabbing the wrong key?
[17:56] <mikedawson> dmick-train, joao: can I provide you guys anything else?
[17:56] <joao> I haven't been able to reproduce this yet manually
[17:57] <joao> mikedawson, at this point I can't think of anything
[17:57] * via (~via@smtp2.matthewvia.info) has joined #ceph
[17:57] <joao> might poke you if I figure something out
[17:57] <mikedawson> good stuff
[17:58] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[17:58] * markbby (~Adium@ Quit ()
[17:59] * markbby (~Adium@ has joined #ceph
[17:59] <Karcaw> joao: have you had time to look at the logs in #4521?
[18:01] <rzerres> joao: no, cluster is not manageble. mon.1 is running
[18:01] <joao> Karcaw, hadn't noticed the update; thanks for letting me know
[18:01] * dmick-train (~dmick@mobile-166-137-177-237.mycingular.net) Quit (Ping timeout: 480 seconds)
[18:01] <rzerres> joao: when starting mon.0 got mon fs missing 'monmap/latest' and 'mkfs/monmap'
[18:02] <joao> rzerres, I'll need the full log for mon.0 too then
[18:04] * BillK (~BillK@124-169-163-183.dyn.iinet.net.au) has joined #ceph
[18:04] <dmick> joao: mikedawson: I think there are two problems: 1) ceph-create-keys needs to realize there's already a valid client key and not try 2) the mkcephfs mon key might need to have caps allow * when created to handle this situation (I don't yet understand whether that cap is intended to be transient)
[18:05] <dmick> 1) would solve the issue
[18:05] * cdblack (86868947@ircip3.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[18:08] <mikedawson> dmick: what is ceph-create-keys used for? bootstrapping keys if Ceph was deployed some way that doesn't do it automatically?
[18:08] <dmick> s/do it automatically/do it beforehand/, but yes
[18:11] <joao> mikedawson, do you happen to have 'ceph_test_store_tool' installed on your cluster?
[18:11] <mikedawson> dmick: gotcha. Is 0.59 the time I should abandon mkcephfs in favor of ceph-deploy or Chef?
[18:11] <dmick> mikedawson: no, not necessarily. These are just compatibility pains between the various methods.
[18:12] <dmick> and, really, ceph-create-keys can run and do little harm; it's just a ragged edge
[18:12] <mikedawson> joao: no. What package gets it?
[18:13] * leseb (~leseb@ Quit (Remote host closed the connection)
[18:13] <joao> dmick, any idea which package has the tests, if any?
[18:13] <dmick> mikedawson: ceph-test
[18:13] <joao> there
[18:13] <joao> :p
[18:13] <joao> dmick saves
[18:13] <dmick> (grep <file> debian/*install)
[18:13] <mikedawson> dmick: Quick fix: comment it out of my /etc/init.d/ceph and move on?
[18:14] <dmick> mikedawson: I expect that would be fine
[18:14] <mikedawson> joao: have it now. what do you want to see?
[18:15] * leseb (~leseb@ has joined #ceph
[18:15] <joao> mikedawson, this is a long shot
[18:16] <joao> ceph_test_store_tool /path/to/mon-data-dir/store.db list auth
[18:16] <joao> and for each 'auth:numbered-key', ceph_test_store_tool /path/to/mon-data-dir/store.db get auth key >> auth.out
[18:17] <joao> I should eventually add a more friendly interface to look into the store
[18:17] <joao> oh, this means you must shutdown the monitor first btw
[18:18] <rzerres> joao: might that help here as well? shall i prepare ceph-test tools?
[18:19] <mikedawson> joao: ceph_test_store_tool /var/lib/ceph/mon/ceph-a/store.db get auth key >> auth.out && cat auth.out shows (auth, key) does not exist
[18:20] * leseb (~leseb@ Quit (Remote host closed the connection)
[18:21] <joao> mikedawson, I'm sorry if I wasn't clear, 'key' should be a number; on a fresh cluster it can go from '1' up to 'n', depending on what 'list auth' states
[18:23] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[18:24] * Cube (~Cube@cpe-76-95-217-215.socal.res.rr.com) has joined #ceph
[18:25] <joao> rzerres, no idea yet
[18:25] <joao> I haven't really understood what is going on there
[18:25] * loicd (~loic@74-94-156-210-NewEngland.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[18:25] <rzerres> joao, got further
[18:26] <rzerres> joao: mon.2 and mon.1 are up and running
[18:26] <mikedawson> joao: I'm doing something wrong http://pastebin.com/raw.php?i=Q6QrxVY8
[18:27] <joao> rzerres, does 'ceph -s' work?
[18:27] <rzerres> joao: since mon.2 terminated while starting the election, i stopped mon.2, eleted the half though store.db
[18:27] <rzerres> joao: then started mon.2 again, let ceph go through.
[18:27] <joao> mikedawson, you're not doing anything wrong really; just have to run 'ceph_test_store_tool /var/lib/ceph/mon/ceph-a/store.db list auth' first
[18:27] <rzerres> joao: now ceph -s is working!
[18:28] <joao> it will output a list of all auth:<N> keys available
[18:28] <joao> you'll just have to 'for ((i=<lowest>;i<=<highest>;i++)); do ./ceph_test_store_tool ... get auth $i >> out ; done'
[18:29] <joao> rzerres, cool; it must have been the store conversion taking too long then
[18:30] <rzerres> joao: mon.0 can't get in. starting spits out: mon fs missing 'monmap/latest' and 'mkfs/monmap'
[18:31] <rzerres> joao: can i just delete the fs on mon.0 and start from scatch for this instance?
[18:32] <joao> rzerres, if you have other monitors and they have reached a quorum, you may; but if you are hitting an issue and you don't mind, I'd rather we explored it before getting rid of it
[18:32] <rzerres> joao: you are right, i would prefer that way as well
[18:33] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:33] <joao> okay, rzerres, now would be the time for you to install ceph-test
[18:33] <rzerres> joao: already done :)
[18:34] <joao> and run a 'ceph_test_store_tool /path/to/mon/data/store.db list monmap'
[18:35] <joao> and then the same but with 'list mkfs'
[18:35] <joao> and pastebin it please
[18:35] <rzerres> joao: you mean on mon.0, right?
[18:35] * l0nk (~alex@ Quit (Quit: Leaving.)
[18:36] <joao> yeah, mon.0
[18:36] <rzerres> joao: sorry, no output at all
[18:37] <joao> for either?
[18:37] <rzerres> joao: yes, for both calls
[18:37] * diegows (~diegows@ has joined #ceph
[18:37] <joao> well, then just run 'ceph_test_store_tool /path/to/mon/data/store.db list'
[18:38] <sstan> mikedawson: 3 nodes. Does it make a difference?
[18:38] <sstan> how does one measure the latency?
[18:38] <rzerres> joao: got list of logm:<id>
[18:38] <joao> that's all?
[18:38] <rzerres> joao: shall i paste them?
[18:39] <joao> rzerres, pastebin please
[18:40] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) has joined #ceph
[18:42] <rzerres> joao: pastebin.com/MQZHwj72
[18:44] <mikedawson> joao: http://pastebin.com/raw.php?i=WQaCxf7s
[18:47] <rzerres> joao: need to quit for today, since have to pick up my dauthers. drop me a mail on how to proceed. monday?
[18:48] <sstan> mikedawson: any tips to fix linux latency?
[18:49] <mikedawson> sstan: no, just trying to understand your results. I've had good luck (nearly linear scaling) increasing small sized write iops by brute force (throwing drives at the problem).
[18:49] <rzerres> joao: by the way attached output of ceph -w to give you an idea .... and thank you very much for the assisting today :)
[18:49] <sstan> ah more OSDs ...
[18:49] <mikedawson> sstan: in my tests, spindle contention is always the limiting factor ... never network latency or throughput
[18:49] <sstan> I don't see how that would help the latency tough
[18:50] <sstan> what's spindle contention ?
[18:50] <joao> rzerres, drop by on monday and we'll pick it up from there
[18:50] <joao> *from here
[18:50] <joao> ahve a good weekend
[18:50] <joao> *have
[18:50] <mikedawson> sstan: seeks on spinning disks
[18:51] <rzerres> joao: and you to. bye for now.
[18:51] * rzerres (~ralf.zerr@xdsl-195-14-207-9.netcologne.de) has left #ceph
[18:52] * loicd (~loic@74-94-156-210-NewEngland.hfc.comcastbusiness.net) has joined #ceph
[18:54] <sstan> mikedawson: I'm sure it would help, but ceph writes to the journal first. In my case, I use RAM for the jounal. My small writes are actually 5x faster that small writes directly sent to my hard drives.
[18:54] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[18:54] * timmclau_ (~timmclaug@ has joined #ceph
[18:55] <sstan> that leads me to believe that if the disks and journals aren't the issue AND reducing replica size makes better results, therefore it's some kind of latency that is to blame.
[18:56] * BillK (~BillK@124-169-163-183.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[18:59] <mikedawson> sstan: everything has to flush to disk eventually. run a benchmark and look at your drive utilization with iostat -x, you might find your disks have high wait times and %util
[19:01] <sstan> True. I might be wrong, but hard drives are irrelevant in the extreme case where the journal is big and fast, AND only one user is writing small writes to the cluster.
[19:02] * timmclaughlin (~timmclaug@ Quit (Ping timeout: 480 seconds)
[19:03] <dmick> ...and the journal doesn't fill up
[19:03] <mikedawson> sstan: Ceph defaults to flushing the journal to the backing store pretty quickly. I believe the idea is to flush early and often because i/o degrades (or blocks?) during flush
[19:04] <sstan> hmm I might want to increase the time between flushes
[19:08] * drokita (~drokita@24-107-180-86.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[19:10] <mikedawson> sstan: read up on filestore_min_sync_interval + filestore_max_sync_interval
[19:11] <sstan> thanks! I might want to decrease the min_sync_interval too :/
[19:11] <sstan> hmmmm actually I should increase it
[19:12] <mikedawson> sstan: but be warned, changing them has always appeared to be a zero-sum game to me. For any significantly long-running small io workload I've always been bounded by the IOPS my drives can handle
[19:13] <sstan> Now I don't understand why the min/max sync matters anyway (when one has a 4G RAM journal). If writes are acknowledged when they're written to the journal (and i/o doesn't block during flushes)
[19:14] <sstan> if i/o blocks during flushes, then one should increase the journal size and increase flush intervals
[19:18] <mikedawson> sstan: "if i/o blocks during flushes, then one should increase the journal size and increase flush intervals". Yep, then think about dmick's comment "...and the journal doesn't fill up". It doesn't take long to fill up a 4GB journal. Flush is also triggered if you reach some % of journal size (I believe 50%).
[19:22] <mikedawson> sstan: I think what you want is reordering. As in the journal backed with high IOPS device (ramdisk or ssd) injests small random IO all day long, then magically reorders the random IO to a single stream the backing disks can injest. Ceph doesn't do that today, but I'd like it too!
[19:23] <sstan> hmm I think that is already how it works. Doesn't the kernel take care of that ?
[19:23] <sstan> and hard drive buffers, etc.
[19:23] <sstan> That's the only reason why one would want a big journal rather than a small one
[19:24] <sstan> * one of the reasons
[19:24] <nhm> mikedawson: some people have had some success with bcache. :)
[19:24] <nhm> or flashcache
[19:25] <mikedawson> nhm: me too. "some" being the key word
[19:26] <sstan> Average Latency: 0.37065 --- Bandwidth (MB/sec): 170.931
[19:27] <sstan> Bandwidth (MB/sec): 90.571 ---- Average Latency: 0.70571
[19:27] <nhm> mikedawson: It sounds like some of the pginfo changes we made negated a lot of the benefit from bcache according to someone I was talking to a week or two ago.
[19:27] <sstan> replica size 2 VS 1
[19:28] <nhm> sstan: 2 was faster than 1?
[19:28] <sstan> sorry replica size 1 is faster than 2 because of reduced latency
[19:28] <mikedawson> sstan: with 2x you need to acks, the lowest common denominator bounds latency. it gets even worse with 3x, etc
[19:29] <mikedawson> nhm: haven't tried the new pginfo changes with bcache, do you remember who tested it?
[19:29] <sstan> What's strange also is that more osds = less latency : http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/
[19:29] <dmick> more osds spreads the object load around the cluster, so that requests for different objects aren't contending
[19:30] <sstan> that could likely be because every osd has an independant journal
[19:30] <nhm> mikedawson: Xiaoxi
[19:30] <nhm> mikedawson: I think it was bcache, might have been flashcache.
[19:30] <mikedawson> sstan: more osds is good for latency, more replication bad for latency
[19:31] <sstan> yeah but more osds doesn't change the fact that the write needs to be ACK by the same number of osds (i.e. 2)
[19:31] <nhm> sstan: it could be that you've just gotten to the point where the underlying devices have to queue ops due to the 2x number of writes.
[19:31] * leseb (~leseb@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[19:32] <nhm> sstan: if you do something like "collectl -sD -oT" on an osd node during each test, you can see what the average time spent in the queue is for each device.
[19:32] <sstan> nhm : do you know if i/o blocks when the journal flushes ?
[19:33] <sstan> hmm what would be the complete command for that, nhm?
[19:33] <nhm> sstan: If the journal fills completely it does afaik because it can't accept new writes until the flush completes. Normally I don't think it does.
[19:34] <sstan> that would make sense ... i.e. for sustained writes, one cannot go faster than what underlying devices allow
[19:34] <nhm> sstan: that is the complete command. If you want to watch a specific device, you can add a --dskfilt <filter> (ie "sda" or "sd")
[19:42] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) has joined #ceph
[19:43] * leseb (~leseb@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Remote host closed the connection)
[19:46] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[19:49] <sstan> let's say we have a infinitely fast/big journal. It never blocks (because it's never full). Why increasing the number of OSDs helps the latency for small writes?
[19:50] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[19:51] <nhm> sstan: If you keep the number of concurrent operations the same in each case, you have more backend disk throughput so operations don't wait as long in the device queues to complete.
[19:51] <nhm> because you are spreading those concurrent operations out over more disks
[19:52] <sstan> That is very clear to me ... but what about a single operation?
[19:53] <dmick> a single operation to a single object will not be sped up by more OSDs
[19:53] <sstan> single small i/o operation speed is inversly proportional to the latency
[19:54] <sstan> therefore the problem I'm trying to solve has nothing to do with journals or hard drives ...
[19:55] <nhm> sstan: Do you see a change in latecy with a single concurrent operation when you add more OSDs?
[19:56] <sstan> I didn't test that actually. I was confused by Sebastian Han's Ceph benchmarks. He gets lower latencies for more OSDs but that would be because he makes concurrent writes.
[19:57] * leseb (~leseb@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[19:57] * leseb (~leseb@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Remote host closed the connection)
[19:58] * mcclurmc (~mcclurmc@firewall.ctxuk.citrix.com) Quit (Ping timeout: 480 seconds)
[19:58] * jjgalvez (~jjgalvez@cpe-76-175-30-67.socal.res.rr.com) Quit (Quit: Leaving.)
[20:02] * leseb (~leseb@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[20:06] * noob2 (~cjh@ has joined #ceph
[20:06] <noob2> has anyone tried the ceph puppet module by Charlier?
[20:06] * diegows (~diegows@ has joined #ceph
[20:06] <noob2> just curious
[20:07] * lofejndif (~lsqavnbok@tor-exit-1.azire.net) has joined #ceph
[20:09] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[20:09] * BManojlovic (~steki@fo-d- has joined #ceph
[20:13] * timmclau_ (~timmclaug@ Quit (Remote host closed the connection)
[20:13] * timmclaughlin (~timmclaug@ has joined #ceph
[20:15] <scuttlemonkey> Charlier?
[20:16] <scuttlemonkey> oh, that is the enovance guy doing it
[20:16] <noob2> yeah
[20:16] <scuttlemonkey> yeah, those are the best ones...there are two other puppet methods to deploy ceph that I have seen
[20:16] <noob2> i know inktank is focused on chef
[20:16] <noob2> ok cool
[20:16] <scuttlemonkey> but yeah, the enovance guys are going a great job on the puppet stuff
[20:16] <noob2> he says it's unstable but it looks well put together
[20:17] <scuttlemonkey> yeah
[20:17] <scuttlemonkey> I talked to Nick and Sebastien at WHD, sounds like they want to put some polish on it
[20:17] <scuttlemonkey> but it works pretty well already
[20:17] <noob2> sweet :D
[20:17] * jjgalvez (~jjgalvez@ has joined #ceph
[20:20] <sstan> nhm, dmick : ifconfig ethN txqueuelen 10 increases small writes speed by 20%
[20:20] <sstan> where ethN is every interface on the cluster
[20:20] <dmick> 10?
[20:20] <dmick> that seems tiny
[20:22] * SpamapS (~clint@xencbyrum2.srihosting.com) Quit (Quit: leaving)
[20:22] <nhm> sstan: does that have an effect on iperf tests as well?
[20:22] <nhm> also, how many concurrent small writes?
[20:23] <sstan> I didnt test concurrent writes yet, let me check
[20:24] <sstan> rados bench -p rbd 10 write gives about the same numbers
[20:24] <sstan> than usual
[20:24] * leseb_ (~leseb@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[20:24] * SpamapS (~clint@xencbyrum2.srihosting.com) has joined #ceph
[20:25] <nhm> sstan: seems like this might be an example of buffer bloat.
[20:25] * leseb (~leseb@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Read error: No route to host)
[20:25] <sstan> for replica size 2, Average Latency: 0.70571 dropped to Average Latency: 0.454664
[20:25] <nhm> sstan: http://www.bufferbloat.net/projects/bloat/wiki/Linux_Tips
[20:26] <sstan> thanks
[20:26] <sstan> yup ... lots of improvements to be done in that regards
[20:26] <sstan> perhaps it would add a lot of added-value to ceph if it took care of that kind of things
[20:28] <nhm> sstan: you might want to try CoDel and see how that owrks
[20:28] <sstan> is search for CoDel on google?
[20:29] <nhm> http://www.bufferbloat.net/projects/codel/wiki/Wiki
[20:29] <dmick> codel is mentioned in that page
[20:29] <dmick> at the top of the section about txqueuelen
[20:29] <sstan> thanks I'll read that. Tuning buffers seems to be promising
[20:30] <sstan> what kind of results do you guyz get for small writes ? (block size = 4k for example)
[20:31] <nhm> sstan: rados bench? RBD? other?
[20:31] <sstan> I use iozone
[20:32] <sstan> dd if=/dev/zero of=/dev/rbdx bs=4k count=1000 oflag=sync
[20:34] <sstan> hmmm dd shows better than actual speeds :s
[20:34] <sstan> iozone is more flexible for clearing caches , etc.
[20:34] <dmick> kernel rbd/kernel buffers
[20:34] <nhm> sstan: depends on the number of disks and where the journals are and such. With FIO doing 4k random writes and varying io depths, I can do anywhere from 10-20MB/s with no rbd cache and up to ~55MB/s with RBD cache enabled for 4k random writes across several 100GB RBD volumes.
[20:34] <dmick> odirect will help
[20:34] <nhm> that's to 24 OSDs with 8 SSD journals.
[20:34] <nhm> and 1x replication.
[20:34] <nhm> (ie no replication)
[20:35] <sstan> what tool do you use, nhm ?
[20:37] <nhm> sstan: I've been using fio for rbd testing. rados bench to test underlying ceph performance. At once point I was using fileop for metadata tests, but I haven't done that in a while.
[20:37] <nhm> soon I'll be using swift-bench for some rgw tests.
[20:37] <sstan> thanks I'll look at FIO
[20:38] <nhm> sstan: it has a lot of knobs. Only problem I ran into is that it seems to break when doing direct block level testing on krbd volumes.
[20:38] <sstan> kernel rbd ?
[20:38] <nhm> Josh thinks it might be due to krbd non-standard block sizes.
[20:56] * vata (~vata@2607:fad8:4:6:358d:3c90:6bc1:af08) Quit (Quit: Leaving.)
[21:00] * eschnou (~eschnou@37.90-201-80.adsl-dyn.isp.belgacom.be) has joined #ceph
[21:11] * barryo1 (~barry@host86-147-6-189.range86-147.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[21:11] * miroslav (~miroslav@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[21:22] * miroslav (~miroslav@c-98-234-186-68.hsd1.ca.comcast.net) has left #ceph
[21:23] * barryo1 (~barry@host109-145-27-62.range109-145.btcentralplus.com) has joined #ceph
[21:26] * sjustlaptop (~sam@ has joined #ceph
[21:26] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) Quit (Quit: themgt)
[21:28] * vata (~vata@2607:fad8:4:6:9c71:5677:5c5f:a72f) has joined #ceph
[21:28] * LeaChim (~LeaChim@5e0d7853.bb.sky.com) Quit (Ping timeout: 480 seconds)
[21:32] * leseb_ (~leseb@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Remote host closed the connection)
[21:33] * nhorman (~nhorman@2001:470:8:a08:7aac:c0ff:fec2:933b) Quit (Quit: Leaving)
[21:34] * sagelap2 (~sage@ has joined #ceph
[21:38] * LeaChim (~LeaChim@b0fae63d.bb.sky.com) has joined #ceph
[21:42] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has left #ceph
[21:42] * barryo1 (~barry@host109-145-27-62.range109-145.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[21:51] * mauilion (~root@ has joined #ceph
[21:52] * scuttlemonkey (~scuttlemo@dslb-084-058-138-184.pools.arcor-ip.net) Quit (Ping timeout: 480 seconds)
[21:53] * barryo1 (~barry@host31-52-19-201.range31-52.btcentralplus.com) has joined #ceph
[21:57] * leseb (~leseb@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:58] <mauilion> hi guys.
[21:59] <mauilion> I am using cephfs to serve the /var/lib/nova/instances/_base directory to my openstack cloud.
[21:59] <mauilion> This means several servers all reading from that fs pretty consistently and some writes.
[21:59] <mauilion> writes when we are caching an image from glance.
[21:59] <mauilion> I just noticed that when the big write is happening.
[22:00] <mauilion> I can't ls the directory anymore.
[22:00] <mauilion> from any of the places its mounted.
[22:00] <mauilion> currently I have 23 HV's attached to that mount
[22:01] <barryo1> would it not be better to use rbd images over cephfs?
[22:01] <mauilion> I do for volumes and stuff
[22:01] <mauilion> but for a shared fs like the cache dir
[22:01] <mauilion> I have been using cephfs
[22:02] <barryo1> I'm not familar with openstack, i thought you meant you were using cephfs for images
[22:04] <mauilion> with openstack when you deploy an image. It will use a cached base image and make a copy on write disk for your vm.
[22:04] <mauilion> the cached base image is usually kept on shared storage.
[22:04] <mauilion> I used to use a netapp.
[22:04] <mauilion> Now I am using cephfs
[22:05] <mauilion> but it's not working too well.
[22:05] <davidz1> mauilion: I'll see if I can find someone who knows if this is a known issue or not.
[22:06] <mauilion> thanks davidz1
[22:07] <mauilion> I just failed to my backup mds and things seem to work again
[22:08] <barryo1> are ther eany performance experts about at the moment?
[22:09] <davidz1> Does ls eventually finish like after the write is complete?
[22:09] <mauilion> while it was broken it would just hang forever.
[22:09] <mauilion> kind of reminiscent of nfs that way
[22:10] <dmick> that's how you can tell it's a shared fs, if it hangs :)
[22:11] <barryo1> earlier on mattch asked "OSD performance question - if you couldn't use SSD in your system, would you go for 2 15k (sas) journalling disks, with 4 7.2k (sata) OSD disks, or 4 15k sas disks and put each OSDs journal on its own disk?", we're working on puting together a couple of clusters, it would be interesting to hear what others think?
[22:11] <mauilion> http://nopaste.linux-dev.org/?71164
[22:11] <mauilion> davidz1: that is the link to the log file from the mds that wasn't working so well
[22:11] <davidz1> mauilion: great
[22:12] <mauilion> davidz1: lot's of ms_handle_reset and ms_handle_reconnects .
[22:13] * markbby (~Adium@ Quit (Quit: Leaving.)
[22:16] <noob2> barryo1: i think the journals being on really fast storage doesn't make a huge difference overall when you have a lot of hosts. i could be wrong if someone else wants to chime in
[22:18] <barryo1> my cluster will either have 3 or 5 osds, mattch's may have 2 osds
[22:18] <barryo1> what would you class as a lot of hosts?
[22:19] <davidz1> mauilion: unfortunately, nothing helpful in the log. Cephfs is a work in progress and not ready for production use.
[22:19] * loicd (~loic@74-94-156-210-NewEngland.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[22:23] * sjustlaptop (~sam@ Quit (Read error: Operation timed out)
[22:24] <noob2> barryo1: prob 5 or more where the fast journal storage starts to matter less and less
[22:24] <noob2> it's just a guess though
[22:25] * Cube (~Cube@cpe-76-95-217-215.socal.res.rr.com) Quit (Quit: Leaving.)
[22:25] * Romeo (~Romeo@ has joined #ceph
[22:28] * jluis (~JL@89-181-159-246.net.novis.pt) has joined #ceph
[22:29] <mauilion> davidz1: yikes.
[22:29] <Romeo> Hi all... I was wondering if anyone here might have some tips for us. We setup a Ceph cluster with no issues (ceph -s returns HEALTH_OK) but when we try to start radosgw all we get in the logs is "Initialization timeout, failed to initialize". We turned debugging up (debug ms = 1) and tried to run in the foreground (radosgw -d), but no useful logs are produced... We've quadruple checked all the config, permissions, keyfiles,
[22:29] <Romeo> etc... We've been going around in crazy circles for 2 days now so I really hope someone can give a fresh pointer on what the issue might be....
[22:29] <barryo1> noob2: thanks, i'll try asking later when it's a bit busier
[22:30] <noob2> barryo1: are all those osd's going to be on 1 host?
[22:31] <noob2> or spread across a few hosts
[22:31] <noob2> there's performance data on the blog showing how things change when you add journals to fast ssd's
[22:31] <barryo1> sorry, i should have said osd hosts
[22:31] <barryo1> there will be at least 2 hosts with at least 4 osd's each
[22:32] <barryo1> sadly sssd's cost too much
[22:32] * sagelap2 (~sage@ Quit (Read error: Operation timed out)
[22:33] <noob2> well keep this in mind also. if your journals are all on the ssd and that dies you lose a lot of osd's
[22:33] * joao (~JL@ Quit (Ping timeout: 480 seconds)
[22:34] <barryo1> we have the choice of 2 15k sas for journals with 4 7.2k sata's for the OSD's or 4 15k sas disks for OSD's with the journal on them
[22:34] <noob2> i'd prob go with sas disks with the journal on them but that's just my hunch
[22:34] <noob2> the journal isn't really that large
[22:34] <noob2> like 2-5GB i think right?
[22:35] <noob2> you could buy a 20GB ssd if you wanted and that'd be plenty
[22:35] <barryo1> that's true
[22:36] <noob2> newegg has enterprise 50GB ssd's for like 119 bucks
[22:37] <noob2> i think there was a blog post about the optimal number of osd journals on an ssd before you start bottlenecking again
[22:37] <barryo1> we're pretty much tied in to buying all our kit from dell
[22:38] <noob2> i gotcha
[22:38] <noob2> i'd just go with all the 15K sas disks then
[22:38] <noob2> are you locked into providing x amount of iops?
[22:38] <davidz1> Romeo: I'm going to find someone who can help out.
[22:38] <Karcaw> could you use Sata Disk on modules effectively for journals. they can be places in the node on a spare sata/sas port.
[22:40] * mcclurmc (~mcclurmc@cpc10-cmbg15-2-0-cust205.5-4.cable.virginmedia.com) has joined #ceph
[22:40] * Romeo (~Romeo@ Quit ()
[22:41] <barryo1> noob2: we're not locked in to providing a certain amount of iops
[22:42] <noob2> gotcha
[22:42] <noob2> i think you'll have plenty of horsepower with your sas drives then. i was surprised when i benchmarked my ceph cluster for the first time
[22:43] <noob2> it's quite fast even with crap sata drives
[22:43] * xdeller (~xdeller@broadband-77-37-224-84.nationalcablenetworks.ru) has joined #ceph
[22:46] * sagelap (~sage@2607:f298:a:607:cd03:41c2:287b:8513) has joined #ceph
[22:51] * calebamiles1 (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) Quit (Remote host closed the connection)
[22:55] * vata (~vata@2607:fad8:4:6:9c71:5677:5c5f:a72f) Quit (Quit: Leaving.)
[22:58] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[23:00] <sagewk> in case romeo comes back: --log-to-stderr will make the log appear on stderr, which will help
[23:00] <sagewk> or --log-file /path/to/foo
[23:01] <sagewk> that and --debug-ms 1 will likely tell what the story is. probably missing ceph.conf, or bad auth/keyring info
[23:01] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[23:05] * mikedawson_ (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[23:06] * themgt (~themgt@97-95-235-55.dhcp.sffl.va.charter.com) has joined #ceph
[23:10] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:10] * mikedawson_ is now known as mikedawson
[23:12] * calebamiles (~caleb@c-107-3-1-145.hsd1.vt.comcast.net) has joined #ceph
[23:24] * timmclaughlin (~timmclaug@ Quit (Ping timeout: 480 seconds)
[23:28] * b1tbkt (~Peekaboo@68-184-193-142.dhcp.stls.mo.charter.com) has joined #ceph
[23:39] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[23:43] * jjgalvez (~jjgalvez@ Quit (Ping timeout: 480 seconds)
[23:48] * diegows (~diegows@ has joined #ceph
[23:50] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[23:54] * lofejndif (~lsqavnbok@9KCAABKHB.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[23:56] * barryo1 (~barry@host31-52-19-201.range31-52.btcentralplus.com) Quit (Quit: Leaving.)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.