#ceph IRC Log


IRC Log for 2013-07-17

Timestamps are in GMT/BST.

[0:00] * BrianA (~Adium@ Quit (Quit: Leaving.)
[0:03] <mteixeira> dmick: nope, holding steady at 7 pgs stuck unclean
[0:03] <dmick> pick one and pg dump it?
[0:04] <mteixeira> Ah, something changed in "ceph health detail"... pg 15.12f is stuck unclean for 4537.128036, current state active+remapped, last acting [1,11,23].... the number 7 no longer shows up in the square brackets.
[0:04] * volitas (~dg234@bl7-191-185.dsl.telepac.pt) has joined #ceph
[0:09] <dmick> ceph pg 15.12f query
[0:10] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[0:11] * volitas (~dg234@bl7-191-185.dsl.telepac.pt) Quit (autokilled: Do not spam. mail support@oftc.net (2013-07-16 22:11:34))
[0:12] <mteixeira> ceph pg 15.12f query: ://pastebin.com/qnhVJBSh
[0:12] <mteixeira> http://pastebin.com/qnhVJBSh
[0:15] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[0:16] * BillK (~BillK-OFT@124-148-212-240.dyn.iinet.net.au) has joined #ceph
[0:18] <lurbs> If a pg is stuck in active+remapped could it be the CRUSH tunables are set to old/defunct settings?
[0:18] <lurbs> http://ceph.com/docs/master/rados/operations/crush-map/#tunables
[0:18] <sagewk> lurbs: yes, very possible
[0:19] <sagewk> esp if you have hosts with only 1 or 2 osds
[0:19] <lurbs> I had that problem with a cluster after changing the CRUSH map not so long ago, was wondering if it was affecting mteixeira now.
[0:20] <mteixeira> I suspected that might be a problem, but since I am on kernel 3.5.0-34 I did not want to change my tunables
[0:22] <dmick> mteixeira: are you using the kernel modules for cephfs or rbd?
[0:22] <dmick> (or do you plan to)?
[0:22] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[0:22] <mteixeira> I don't suspect so, but we do run open stack, so I'm not sure what gets used behind the scenes. Basically I was trying to put off changing the tunables if I didn't have to since I wasn't sure :)
[0:24] <mteixeira> But if that is the likely suspect for the problem, then I can look more closely into the possibility of changing the tunables. I guess I'll research cephfs and rbd a bit more just to make sure I don't need it.
[0:24] <lurbs> OpenStack isn't likely to be using CephFS, unless you're doing something like sharing /var/lib/nova/instances with it.
[0:25] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:26] <mteixeira> I'm not an expert on our use of openstack. There's another fellow here who would know more. I think for now I'll assume the tunables is the fix, and I'll check with him tomorrow about how to proceed.
[0:28] <mteixeira> It seems like this issue with the tunable only crops up in certain corner cases. Would a non-really-a-fix be to add more OSDs and hope the problem goes away? We do have plans to add more disks.
[0:29] * SudoAptitude (~Yruns@ip-93-159-106-144.enviatel.net) Quit (Quit: Verlassend)
[0:29] <mteixeira> I ask this because I've run into problems with stuck pages before, when setting the weights to certain values, which I
[0:29] <mteixeira> "solved" by changing the weights again to something else.
[0:31] <mteixeira> Or I could add disks to the one host I have currently with only two OSDs.
[0:33] * tnt (~tnt@ Quit (Quit: leaving)
[0:35] <dmick> mteixeira: that's a good question; I don't know the exact impact of the tunable changes
[0:38] * BillK (~BillK-OFT@124-148-212-240.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[0:38] <mteixeira> Okay. I think the lesson for today is that removing the OSD was really not the problem. Apparently it just put my crush map in a state in which I am running into the tunables issue, which I have seen before.
[0:39] <mteixeira> Would it be safe at this point for me to continue removing the OSD? (ceph osd crush remove osd.7) Or should I keep it around?
[0:40] * tnt (~tnt@ has joined #ceph
[0:41] <mteixeira> Also, what's the opposite of "ceph osd crush unlink"? In case I decide I do want to bring OSD.7 back in order to get the health back to OK?
[0:42] <gregaf> link ;)
[0:42] * Meyer__ is now known as Meyer
[0:42] * mschiff (~mschiff@ Quit (Remote host closed the connection)
[0:42] <mteixeira> Ah, okay :) Well, thanks for the help today. I think I got a couple options to think about or explore.
[0:42] <gregaf> yehuda_hm: joshd just reminded me I had a question from standup this morning
[0:42] * BillK (~BillK-OFT@203-59-173-44.dyn.iinet.net.au) has joined #ceph
[0:43] <gregaf> you were talking about the sync agents not being able to catch all the way up and I'm not sure why
[0:44] <mteixeira> Thank you everyone. Have a good evening!
[0:44] <gregaf> g'night!
[0:44] <dmick> I think you shoudl be fine removing the osd, yes
[0:45] * mteixeira (~oftc-webi@ip-216-17-134-225.rev.frii.com) Quit (Quit: Page closed)
[0:46] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[0:47] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Remote host closed the connection)
[0:47] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[0:48] <yehuda_hm> gregaf: hmm.. wrt to the metadata stuff I might have been incorrect
[0:48] <yehuda_hm> that's true for the data sync
[0:48] <yehuda_hm> because we only update the data log once every period
[0:49] <gregaf> ah, I see — but that doesn't mean you can't catch up, just that if you do you have to go backwards by the liveness period on your next sync ;)
[0:50] <yehuda_hm> well, that practically means that you shouldn't catch up
[0:50] <gregaf> not really; not catching up as much as you can would be a waste of bandwidth
[0:50] <yehuda_hm> the metadata log is actually ok now, because we made it so that entries can't go back in time
[0:50] <gregaf> plus you need to have the logic to handle time differences anyway so just leaving a buffer isn't really the right way to handle it
[0:51] <yehuda_hm> but then you have a problem that you need to start syncing stuff that you already synced, which is also a waste of bandwidth and cpu
[0:51] <gregaf> you know the versions you have, so it's just a metadata check?
[0:52] <yehuda_hm> just a metadata check? that's a round trip for you
[0:52] <gregaf> people who run data centers don't save money by leaving connections idle, they save money by being able to average their bandwidth to the lowest level possible so they don't need to buy as high a tier for bursting ;)
[0:52] <yehuda_hm> also, there's the case where data can be overwritten multiple times
[0:53] <gregaf> which will increment the version each time?
[0:53] <yehuda_hm> gregaf: but it's not idle, it's handling everything.. just lags a few seconds behind
[0:53] <gregaf> if you catch up to the "I'm not allowed to catch up more than this" point, then it sits idle until that point has moved on
[0:53] <yehuda_hm> it's not a one off operation, it's done continuously. In the steady state you don't waste any bandwidth
[0:54] <yehuda_hm> well, then you move on to the next shard
[0:54] <gregaf> anyway, the problem is for data syncing and it's the deliberate lack of time resolution on the bucket update log, coolio
[0:54] <gregaf> joshd, that all makes sense to you? :)
[0:57] <joshd> yeah, I'm thinking about the continuous part though - right now the agent isn't running forever, the partial sync just goes up to a certain point in time in all shards and then exits
[0:58] <joshd> it could be run by cron or in a simple script to keep updating everything
[0:58] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[0:59] * tnt (~tnt@ Quit (Ping timeout: 480 seconds)
[1:01] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[1:01] <gregaf> I haven't looked at the agents at all yet; I'm not sure how they're structured, but I assume our final shipment wants to be a continuous process?
[1:04] <joshd> it's just a cli tool
[1:08] <yehuda_hm> gregaf, joshd: it should be continuous obviously
[1:15] * yanzheng (~zhyan@ has joined #ceph
[1:15] * ShaunR (~ShaunR@staff.ndchost.com) Quit (Remote host closed the connection)
[1:16] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[1:16] * ShaunR (~ShaunR@staff.ndchost.com) has joined #ceph
[1:25] * KevinPerks1 (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[1:48] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) has joined #ceph
[1:49] * yanzheng (~zhyan@ Quit (Ping timeout: 480 seconds)
[1:52] * bwesemann_ (~bwesemann@2001:1b30:0:6:dd4e:600d:f687:2650) Quit (Remote host closed the connection)
[1:52] * bwesemann_ (~bwesemann@2001:1b30:0:6:74bc:6499:312e:8e83) has joined #ceph
[1:53] * diegows (~diegows@ has joined #ceph
[1:56] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[1:59] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[2:07] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Ping timeout: 480 seconds)
[2:08] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Ping timeout: 480 seconds)
[2:14] * tserong_ is now known as tserong
[2:19] <sagewk> gregaf: https://github.com/ceph/ceph/pull/441 msgr fun!
[2:21] * jlogan2 (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[2:24] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[2:25] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[2:33] * sagelap (~sage@2600:1012:b01b:d38b:45bf:6bb8:16b9:1f53) has joined #ceph
[2:35] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[2:37] <sagelap> sjust: have you seen that repeated mon elections on your test cluster?
[2:37] <sagelap> recently?
[2:37] <sjust> not last time I tried
[2:37] <sjust> I think the mons were running an older version
[2:37] <sjust> snice I'd mostly been restarting the osds
[2:38] <sjust> I've got mon logs from the last time in samuelj@slider:~/big-cluster/mons
[2:51] * xmltok_ (~xmltok@pool101.bizrate.com) Quit (Quit: Bye!)
[2:54] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) has joined #ceph
[3:01] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[3:02] * yy (~michealyx@ has joined #ceph
[3:15] * mozg (~andrei@host217-44-214-64.range217-44.btcentralplus.com) Quit (Ping timeout: 480 seconds)
[3:17] * sagelap (~sage@2600:1012:b01b:d38b:45bf:6bb8:16b9:1f53) Quit (Ping timeout: 480 seconds)
[3:18] * The_Bishop (~bishop@2001:470:50b6:0:497e:a554:edd:9a9f) Quit (Ping timeout: 480 seconds)
[3:20] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Ping timeout: 480 seconds)
[3:23] * Tamil (~tamil@ Quit (Quit: Leaving.)
[3:25] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[3:27] * The_Bishop (~bishop@2001:470:50b6:0:61b6:2d85:7548:d113) has joined #ceph
[3:30] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Ping timeout: 480 seconds)
[3:59] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[4:00] * xmltok (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit (Remote host closed the connection)
[4:00] * xmltok (~xmltok@relay.els4.ticketmaster.com) has joined #ceph
[4:00] * yanzheng (~zhyan@ has joined #ceph
[4:10] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[4:19] <grepory> is it possible to setup a key in ceph with a pre-defined secret?
[4:19] <grepory> or even say… a key file made with ceph-authtool? just import that somehow
[4:20] <dmick> absolutely
[4:20] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[4:20] <dmick> auth import will do it from a keyring file
[4:20] <grepory> aha
[4:21] <grepory> i didn't see that in ceph --help
[4:21] <dmick> so will auth add
[4:21] <dmick> you will in ceph -h now
[4:23] <grepory> aha. fantastic
[4:23] <grepory> thanks
[4:23] <dmick> I just remembered the command with ceph -h auth. and yw
[4:23] <grepory> yeah i figured it out
[4:23] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[5:00] * fireD1 (~fireD@93-139-147-104.adsl.net.t-com.hr) has joined #ceph
[5:07] * fireD (~fireD@93-139-145-242.adsl.net.t-com.hr) Quit (Ping timeout: 480 seconds)
[5:43] * matt__ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[5:43] <grepory> blowing away everything in /var/lib/ceph/mon, /var/lib/ceph/osd should really reset the state of the cluster, right? assuming you also kill the partition tables on osd disks, etc
[5:49] <grepory> yes it would appear so
[6:03] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) Quit (Ping timeout: 480 seconds)
[6:08] * sagelap (~sage@ has joined #ceph
[6:10] * yy (~michealyx@ has left #ceph
[6:25] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[6:29] * jlogan2 (~Thunderbi@2600:c00:3010:1:1::40) Quit (Ping timeout: 480 seconds)
[6:34] * yy (~michealyx@ has joined #ceph
[6:34] * yy (~michealyx@ has left #ceph
[6:40] * matt__ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[6:53] * xmltok (~xmltok@relay.els4.ticketmaster.com) Quit (Ping timeout: 480 seconds)
[7:27] * yy (~michealyx@ has joined #ceph
[7:44] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[7:50] * tnt (~tnt@ has joined #ceph
[7:53] * The_Bishop (~bishop@2001:470:50b6:0:61b6:2d85:7548:d113) Quit (Ping timeout: 480 seconds)
[8:03] * AfC (~andrew@2001:44b8:31cb:d400:4c2f:6d75:850c:c196) has joined #ceph
[8:11] * The_Bishop (~bishop@2001:470:50b6:0:497e:a554:edd:9a9f) has joined #ceph
[8:22] <grepory> now i just need to figure out the easiest way to kill an osd's superblock.
[8:22] <grepory> or the fastest way to make sure the xfs filesystem goes bye bye… i think i can manage this.
[8:24] * Tamil (~tamil@ has joined #ceph
[8:30] <grepory> this is what i get for being on too many irc networks…
[8:44] * Tamil (~tamil@ Quit (Quit: Leaving.)
[8:44] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[8:50] * odyssey4me (~odyssey4m@ has joined #ceph
[8:54] <odyssey4me> good morning all
[8:55] <grepory> morning
[9:04] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) has joined #ceph
[9:06] <odyssey4me> Does anyone online at the moment have some experience with designing crush maps?
[9:13] <tnt> odyssey4me: a bit.
[9:16] * mschiff (~mschiff@p4FD7F468.dip0.t-ipconnect.de) has joined #ceph
[9:19] * hybrid512 (~walid@106-171-static.pacwan.net) has joined #ceph
[9:22] * stacker666 (~stacker66@104.pool85-58-195.dynamic.orange.es) has joined #ceph
[9:23] <ccourtaut> morning
[9:26] <odyssey4me> morning :)
[9:30] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:30] * ChanServ sets mode +v andreask
[9:34] * LeaChim (~LeaChim@ has joined #ceph
[9:36] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) has joined #ceph
[9:36] * tnt (~tnt@ Quit (Ping timeout: 480 seconds)
[9:36] * leseb1 (~Adium@2a04:2500:0:d00:c111:1c77:66dc:c3b6) has joined #ceph
[9:37] * hybrid512 (~walid@106-171-static.pacwan.net) Quit (Ping timeout: 480 seconds)
[9:38] * jks (~jks@3e6b5724.rev.stofanet.dk) Quit (Read error: Connection reset by peer)
[9:39] * jks (~jks@3e6b5724.rev.stofanet.dk) has joined #ceph
[9:41] <odyssey4me> hmm, I saw an config option last night which had to do with when a client considered a write operation complete - it had to do with the number of replicas... now I can't find it
[9:44] * infinitytrapdoor (~infinityt@ has joined #ceph
[9:47] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:47] <grepory> cluster and public networks is difficult
[9:51] <Gugge-47527> difficult?
[9:52] <grepory> yes… for example, my osd's are trying to communicate from their unrouted cluster addresses to the public address of a mon.
[9:52] <grepory> i don't know why, but they are.
[9:52] * dignus (~dignus@bastion.jkit.nl) has left #ceph
[9:52] <Gugge-47527> does the mon have a cluster address too?
[9:52] <grepory> no
[9:52] <Gugge-47527> then that is why
[9:53] <Gugge-47527> :)
[9:53] <grepory> it was my understanding that mons do not have cluster addresses.
[9:53] <grepory> also, as far as i can tell, there is no configuration directive to specify one.
[9:54] <Gugge-47527> how is the public addr and cluster addr setup on the osd?
[9:55] <grepory> [osd.<id>]
[9:55] <grepory> cluster addr = foo
[9:55] <grepory> public addr = bar
[9:55] <Gugge-47527> because you are correct, the mon should only have the public address :)
[9:55] <grepory> for mons it is:
[9:55] <grepory> [mon.<id>]
[9:55] <grepory> mon addr = baz
[9:55] <Gugge-47527> foo and bar is not gonna do it, i need the actual addresses :)
[9:55] <grepory> why?
[9:55] <Gugge-47527> because you are asking for help, and to help i need as complete info as possible :)
[9:56] * s2r2 (~s2r2@ has joined #ceph
[9:56] <grepory> an example
[9:56] <grepory> [osd.5]
[9:56] <grepory> host = ceph-1002
[9:56] <grepory> devs = /dev/sdb1
[9:56] <grepory> cluster addr =
[9:56] <grepory> public addr =
[9:57] <jks> odyssey4me, do you mean min_size?
[9:57] <Gugge-47527> and then you do a netstart -r on that osd server, and paste that on pastebin :)
[9:58] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) Quit (Read error: Operation timed out)
[9:58] <grepory> http://pastebin.com/M0Xw9LYR
[9:59] <Gugge-47527> and the output from "ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok config show | grep addr"
[10:00] <Gugge-47527> and what address does the mon in question have?
[10:03] <grepory> http://pastebin.com/8xybrVEE
[10:03] <grepory> it may not be attempting to talk to a mon, actually… but instead trying to talk to an osd or something
[10:03] <grepory> i just noticed it is connecting to port 6802
[10:03] <grepory> if it were trying the mon, it'd be 6789
[10:04] <Gugge-47527> yes, it looks like there is an osd on
[10:04] <grepory> but there isn't
[10:04] <Gugge-47527> and isnt a local network on any of the 3 vlans even
[10:04] <grepory> it is
[10:04] <odyssey4me> jks - nope, that's not it... it wasn't the specification of the number of replicas, it was a setting which allowed you to say that if you have 3 replicas, tell the client that the write is successful after 2 have been written
[10:05] <grepory> Gugge-47527: it is on the 10.20.64 network
[10:05] <Gugge-47527> ahh yes, it is
[10:05] <grepory> it's a /20, not a /24.
[10:05] * NuxRo (~nux@ has joined #ceph
[10:05] <jks> odyssey4me, which is min_size?
[10:05] <Gugge-47527> you sure nothing is listening on that ip/port ?
[10:05] <NuxRo> Hi, I'm trying to get ceph cuttlefish installed on Centos 6 but I'm missing the ceph-deploy command, any pointers?
[10:05] <NuxRo> `yum provides */ceph-deploy` returns nothing
[10:06] <Gugge-47527> somehow the osd thinks it should contact something there :)
[10:06] <grepory> Gugge-47527: yes. the osd is listening on the cluster network on that machine, not the .73
[10:06] <jks> NuxRo, it is in the noarch dir
[10:06] <jks> NuxRo: http://ceph.com/rpm-cuttlefish/el6/noarch/
[10:06] <Gugge-47527> grepory: and you confirmed that with netstat -nl ?
[10:07] <grepory> Gugge-47527: yes
[10:07] <odyssey4me> jks - min_size in a crush map rule?
[10:07] <odyssey4me> NuxRo - have you installed the ceph-deploy package?
[10:07] <Gugge-47527> grepory: did it ever listen on that address, so an old monmap has it?
[10:07] <jks> odyssey4me, you can set a "min_size" on the pool itself - it is that I am referring to
[10:07] <NuxRo> jks: looks like a bug either in the repo or the docs :) thanks
[10:08] <grepory> Gugge-47527: i just finished tearing down my cluster. confirmed mons came up at epoch 0.
[10:08] <grepory> 0 osds
[10:08] <NuxRo> odyssey4me: no, that's what I'm trying to do, but it is _not_ included in the main repo
[10:08] <jks> odyssey4me, http://ceph.com/docs/master/rados/operations/pools/
[10:08] <jks> odyssey4me, check under "SET POOL VALUES"
[10:09] * hybrid512 (~walid@106-171-static.pacwan.net) has joined #ceph
[10:09] <odyssey4me> jks - haha, just found it, thanks :) that looks like it
[10:10] <odyssey4me> osd pool default min size
[10:10] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[10:10] <Gugge-47527> grepory: so its a new cluster with no old monmaps from before?
[10:10] <grepory> Gugge-47527: by tearing down my cluster, i mean rm -rf /var/lib/ceph, rm -rf /var/lib/puppet, rm -rf /etc/ceph/*, rm -rf /var/run/ceph; re-running mkfs.xfs and destroying gpts on every disk
[10:10] <grepory> Gugge-47527: yes.
[10:16] <grepory> Gugge-47527: i would think that the osd would try to contact the mons in the 10.20.16 address space
[10:16] <grepory> oh but this is still osd to osd
[10:16] <grepory> i keep forgetting that. maybe i have been working on this for too long today.
[10:17] * NuxRo (~nux@ has left #ceph
[10:17] <grepory> i will just wipe my cluster and try again tomorrow… after i sleep.
[10:19] * infinitytrapdoor (~infinityt@ has joined #ceph
[10:20] * dosaboy (~dosaboy@faun.canonical.com) has joined #ceph
[10:21] <grepory> Gugge-47527: figured it out. puppet ruins everything once again.
[10:25] * bergerx_ (~bekir@ has joined #ceph
[10:25] * mschiff (~mschiff@p4FD7F468.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[10:27] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[10:31] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[10:31] <odyssey4me> jks - is it possible to see what the current pool's min size is?
[10:32] <jks> odyssey4me, yes, you can see it with: ceph osd dump
[10:33] * infinitytrapdoor (~infinityt@ has joined #ceph
[10:36] * leseb1 (~Adium@2a04:2500:0:d00:c111:1c77:66dc:c3b6) Quit (Quit: Leaving.)
[10:40] * leseb (~Adium@2a04:2500:0:d00:51b3:ce3e:7c75:db2b) has joined #ceph
[10:42] * Kioob (~kioob@2a01:e35:2432:58a0:21e:8cff:fe07:45b6) Quit (Read error: Connection reset by peer)
[10:43] * yy (~michealyx@ has left #ceph
[10:43] * toMeloos (~tom@53545693.cm-6-5b.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[10:45] * illya (~illya_hav@205-43-133-95.pool.ukrtel.net) has joined #ceph
[10:51] * illya (~illya_hav@205-43-133-95.pool.ukrtel.net) has left #ceph
[10:51] * illya (~illya_hav@205-43-133-95.pool.ukrtel.net) has joined #ceph
[10:52] * goldfish (~goldfish@ Quit (Remote host closed the connection)
[10:53] * illya (~illya_hav@205-43-133-95.pool.ukrtel.net) has left #ceph
[10:53] * illya (~illya_hav@205-43-133-95.pool.ukrtel.net) has joined #ceph
[10:56] * yy (~michealyx@ has joined #ceph
[10:58] * admin (~chatzilla@ has joined #ceph
[10:59] * admin (~chatzilla@ has left #ceph
[11:00] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[11:01] * huangjun (~huangjun@ has joined #ceph
[11:01] * admin (~chatzilla@ has joined #ceph
[11:01] * admin is now known as yy-nm
[11:02] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) has joined #ceph
[11:03] * yy-nm (~chatzilla@ Quit (Quit: ChatZilla [Firefox 22.0/20130618035212])
[11:03] <huangjun> hi,
[11:04] * yanzheng (~zhyan@ Quit (Remote host closed the connection)
[11:04] * grepory (~Adium@c-69-181-42-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[11:06] * ebo (~ebo@icg1104.icg.kfa-juelich.de) has joined #ceph
[11:08] <ebo> hi. i have a cephfs filesystem with a directory in which one or more files are no longer accessible. ls hangs indefinitely if it includes these files. any ideas?
[11:09] <huangjun> check the mds stat
[11:09] <ebo> up and active
[11:10] <ebo> other directories and files work
[11:11] <huangjun> what about restart the mds
[11:12] <ebo> nope
[11:12] <ebo> restarted it, no change
[11:14] <huangjun> in common case the ls op related to mds, and you can see the dmesg output of that client,may be found something
[11:14] * Cube (~Cube@173-8-221-113-Oregon.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[11:15] <ebo> nothing in the dmesg output, apart from reconnecting after the mds restart
[11:18] * _Tass4dar (~tassadar@tassadar.xs4all.nl) Quit (Ping timeout: 480 seconds)
[11:18] <ofu> can I get a 10sec average value from ceph -s (for monitoring stuff)
[11:20] <huangjun> what config options will effect the performance in ceph, does 'filestore max sync interval' effects?
[11:21] * _Tassadar (~tassadar@tassadar.xs4all.nl) has joined #ceph
[11:22] <huangjun> @<ebo>: maybe you can get the log to alaysis this
[11:22] <cephalobot> huangjun: Error: "<ebo>:" is not a valid command.
[11:23] <ebo> which log?
[11:26] <huangjun> the osd log,
[11:30] <ebo> will see
[11:31] * yy-nm (~chatzilla@ has joined #ceph
[11:33] * eternaleye (~eternaley@c-50-132-41-203.hsd1.wa.comcast.net) Quit (Ping timeout: 480 seconds)
[11:37] <huangjun> what does "osd_age" used for in ceph?
[11:37] * Midnightmyth (~quassel@93-167-84-102-static.dk.customer.tdc.net) has joined #ceph
[11:41] * tserong_ (~tserong@124-168-227-28.dyn.iinet.net.au) has joined #ceph
[11:42] * infinitytrapdoor (~infinityt@ has joined #ceph
[11:42] * tserong (~tserong@58-6-128-204.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[11:54] * eternaleye (~eternaley@2002:3284:29cb::1) has joined #ceph
[11:55] * deadsimple (~infinityt@ has joined #ceph
[12:00] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[12:02] * matt__ (~matt@220-245-1-152.static.tpgi.com.au) has joined #ceph
[12:07] * deadsimple (~infinityt@ Quit (Ping timeout: 480 seconds)
[12:12] * jeandanielbussy (~jeandanie@124x35x46x12.ap124.ftth.ucom.ne.jp) has joined #ceph
[12:12] * jeandanielbussy is now known as silversurfer
[12:16] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[12:18] * huangjun (~huangjun@ Quit (Quit: HydraIRC -> http://www.hydrairc.com <- *I* use it, so it must be good!)
[12:24] * mschiff (~mschiff@p4FD7F468.dip0.t-ipconnect.de) has joined #ceph
[12:24] * mynameisbruce (~mynameisb@tjure.netzquadrat.de) Quit (Read error: Connection reset by peer)
[12:25] * infinitytrapdoor (~infinityt@ has joined #ceph
[12:26] * yy-nm (~chatzilla@ Quit (Quit: ChatZilla [Firefox 22.0/20130618035212])
[12:26] * yy (~michealyx@ has left #ceph
[12:30] * jcfischer_afk (~fischer@user-28-11.vpn.switch.ch) Quit (Ping timeout: 480 seconds)
[12:31] * leseb (~Adium@2a04:2500:0:d00:51b3:ce3e:7c75:db2b) Quit (Quit: Leaving.)
[12:35] * jcfischer (~fischer@macjcf.switch.ch) has joined #ceph
[12:37] * infinitytrapdoor (~infinityt@ Quit ()
[12:38] * infinitytrapdoor (~infinityt@ has joined #ceph
[12:42] <jcfischer> gnu - lost my irc history (and the server shell history as well) odyssey4me: what was the command to create the "teststripe" image?
[12:42] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[12:45] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:45] * toMeloos (~tom@53545693.cm-6-5b.dynamic.ziggo.nl) has joined #ceph
[12:48] * s2r2 (~s2r2@ has joined #ceph
[12:52] * KindOne (KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[12:53] * KindOne (~KindOne@0001a7db.user.oftc.net) has joined #ceph
[12:56] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[12:57] <Gugge-47527> jcfischer: http://irclogs.ceph.widodh.nl/index.php?date=2013-07-16
[12:58] <jcfischer> wget: unable to resolve host address `irclogs.ceph.widodh.nl'
[12:59] * mynameisbruce (~mynameisb@tjure.netzquadrat.de) has joined #ceph
[12:59] <Gugge-47527> get a working resolver :)
[13:00] <jcfischer> damn :D
[13:00] * s2r2 (~s2r2@ has joined #ceph
[13:02] * leseb1 (~Adium@2a04:2500:0:d00:54f5:4c30:d24a:aaac) has joined #ceph
[13:06] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[13:07] <jcfischer> forwarded to our DNS admins
[13:08] <Gugge-47527> but 'rbd -p rbd --image-format 2 --size 10000 --stripe-unit 4096 --stripe-count 4 create teststripe' was what you wrote yesterday :)
[13:08] <Gugge-47527> according to the logs :)
[13:09] * s2r2 (~s2r2@ has joined #ceph
[13:10] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:10] * ChanServ sets mode +v andreask
[13:10] <jcfischer> yay
[13:10] <jcfischer> thank you so much
[13:10] * leseb1 (~Adium@2a04:2500:0:d00:54f5:4c30:d24a:aaac) Quit (Ping timeout: 480 seconds)
[13:12] * mschiff_ (~mschiff@p4FD7F468.dip0.t-ipconnect.de) has joined #ceph
[13:12] * mschiff (~mschiff@p4FD7F468.dip0.t-ipconnect.de) Quit (Read error: Connection reset by peer)
[13:13] <jcfischer> and after updating to 0.61.4 and restarting the whole cluster, the rbd_bench actually runs (and the rbd hang from yesterday is gone)
[13:17] * diegows (~diegows@ has joined #ceph
[13:20] <jcfischer> Read: 1003.61 Mb/s, Write: 177.21 Mb/s
[13:22] <Gugge-47527> remember 4k is pretty small writes
[13:23] <Gugge-47527> maybe try a stripe-unit of 16k or 64k
[13:24] <jcfischer> experimenting with different stripe-count now
[13:24] <jcfischer> stripe-unit afterwards
[13:27] <jcfischer> btw: the DNS setup for irclogs.ceph.widodh.nl has a problem with DNSSEC. Try http://www.dnssecmonitor.org/ and request an A or AAAA record
[13:31] <Gugge-47527> tell wido :)
[13:32] <jcfischer> will do
[13:32] <Gugge-47527> (i assume he is the one running it) :)
[13:32] <Gugge-47527> i dont really know though
[13:33] * leseb (~Adium@2a04:2500:0:d00:b97d:66b2:e87:312c) has joined #ceph
[13:34] <jcfischer> done
[13:41] <ebo> i need a cephfs fsck :-(
[13:42] * allsystemsarego (~allsystem@ has joined #ceph
[13:43] * mozg (~andrei@host217-46-236-49.in-addr.btopenworld.com) has joined #ceph
[13:49] * swamy_cs (~0e8d0505@webuser.thegrebs.com) has joined #ceph
[13:49] <swamy_cs> hi
[13:50] <swamy_cs> have setup 3 node ceph setup
[13:50] <swamy_cs> using fastcgi module
[13:50] <swamy_cs> getting issue while running s3 test
[13:51] <swamy_cs> boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
[13:51] <swamy_cs> not sure what is causing it here
[13:52] <andreask> so some sort of permission error
[13:53] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) has joined #ceph
[13:58] <andreask> swamy_cs: what test gives that eror
[13:58] <andreask> ?
[14:00] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Read error: Operation timed out)
[14:01] <jcfischer> some informal benchmark results: https://docs.google.com/spreadsheet/ccc?key=0AsjockBApInDdC10SWw4Y09HbUpsQWdZQ2RlTlhibEE&usp=sharing
[14:02] * syed_ (~chatzilla@ has joined #ceph
[14:02] <swamy_cs> andreask : I am using boto
[14:02] <swamy_cs> andreask: firing a pyhton script to create a bucket
[14:03] <andreask> I see .. so you are not using the s3test utilitiy?
[14:03] <swamy_cs> are you referring to the s3 api tool by amazon?
[14:04] <swamy_cs> I just wrote a test script in python to see basic things are working or no
[14:04] <swamy_cs> not
[14:04] <swamy_cs> so that I can add it to one of my cloustack instance for my testing
[14:06] <andreask> I was referring to the s3-tests tool that you can also find in the ceph repo
[14:07] <swamy_cs> oh ok
[14:07] <swamy_cs> but, this error looks to me as somethign to do with the user I am using or somethign like that
[14:07] <swamy_cs> I have double checked and I am passing the right secretkey and access key
[14:08] <swamy_cs> how to do I know whether the above user has access to create a bucket or not?
[14:08] <swamy_cs> is there a quick way
[14:09] <swamy_cs> I created tthis user with radosgw command line
[14:10] <swamy_cs> radosgw-admin user create --uid="testuser" --display-name="First User"
[14:12] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) Quit (Quit: Leaving.)
[14:14] <swamy_cs> andreask: any idea?
[14:16] <andreask> swamy_cs: you can do a quick test by following http://ceph.com/docs/next/radosgw/s3/bucketops/
[14:16] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:20] <swamy_cs> andreask: thanks for your help
[14:20] <swamy_cs> checked and made sure I am followng the constraints mentioned in the link you provided
[14:20] <swamy_cs> but still couldnt make any progress
[14:21] <andreask> swamy_cs: hmm ... logs?
[14:22] <swamy_cs> - - [17/Jul/2013:17:48:38 +0530] "PUT /buckets/ HTTP/1.1" 403 246 "-" "Boto/2.2.2 (linux2)" - - [17/Jul/2013:17:49:38 +0530] "PUT /buckets/ HTTP/1.1" 403 246 "-" "Boto/2.2.2 (linux2)"
[14:22] <swamy_cs> This is the log shown in access.log of apache2
[14:22] <swamy_cs> - - [17/Jul/2013:17:48:38 +0530] "PUT /buckets/ HTTP/1.1" 403 246 "-" "Boto/2.2.2 (linux2)"
[14:22] <swamy_cs> nothing more than that
[14:22] <swamy_cs> is there any other log that gives me more info?
[14:22] * The_Bishop (~bishop@2001:470:50b6:0:497e:a554:edd:9a9f) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[14:23] <andreask> swamy_cs: the radosgw.lgo
[14:23] * The_Bishop (~bishop@2001:470:50b6:0:497e:a554:edd:9a9f) has joined #ceph
[14:23] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[14:23] <andreask> log
[14:24] * leseb (~Adium@2a04:2500:0:d00:b97d:66b2:e87:312c) Quit (Quit: Leaving.)
[14:25] * jskinner (~jskinner@ has joined #ceph
[14:25] <swamy_cs> http://pastebin.com/xSd8jE1B
[14:25] <swamy_cs> andreask: pasted the log here in the above link
[14:26] <swamy_cs> pls let me know if anything else that you want me to provide
[14:29] <swamy_cs> it says failed to authorize the request
[14:29] <andreask> swamy_cs: yes
[14:31] <swamy_cs> any idea what could cause an issue here?
[14:31] <swamy_cs> radosgw-admin user info --uid=testuser
[14:31] <swamy_cs> I have used the access & secret keys from the output of above command
[14:33] <andreask> strange ... can you try with uncreased verbostiy for the logs? .... debug rgw = 20
[14:34] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[14:35] <swamy_cs> andreask: I got it working
[14:35] <swamy_cs> andreask: really appreciate your help
[14:35] <andreask> swamy_cs: oh ...what was the solution?
[14:35] <swamy_cs> andreask: Issue is with my python script
[14:35] <tnt> jcfischer: you did run the bench twice each time right ?
[14:35] <jcfischer> tnt: I did
[14:36] <swamy_cs> andreask: where the secret key contained "\"
[14:36] <swamy_cs> andreask: so I created another user and tried it and everykthign went fine
[14:36] <swamy_cs> andreask: do you work for inktank?
[14:36] <andreask> swamy_cs: no
[14:37] <jcfischer> from what I learn from this, is that I probably should leave the strip options alone and just go with the default
[14:37] <swamy_cs> andreask: Thanks for your help and prompt reply
[14:37] <andreask> swamy_cs: ah ... the \ ... there is this big fat warning in the docs
[14:37] <andreask> swamy_cs: you are welcome
[14:38] <tnt> jcfischer: it would probably be useful to do a random io test. The small rbd_bench case does sequential IO which may not benefit much from striping (and thus you get the overhead but none of the benefit).
[14:39] <jcfischer> somebody code that up :)
[14:40] <jcfischer> actually - should be relatively easy, from looking at the code...
[14:41] * sleinen (~Adium@2001:620:0:46:10f3:b5db:398f:c1d4) has joined #ceph
[14:41] <jcfischer> will see, if my c-foo still is strong
[14:42] * swamy_cs (~0e8d0505@webuser.thegrebs.com) Quit (Quit: TheGrebs.com CGI:IRC (Ping timeout))
[14:43] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:47] * infinitytrapdoor (~infinityt@ Quit ()
[14:49] <syed_> andreask: hello ;)
[14:50] <andreask> syed_: hi Syed ;-)
[14:50] * AfC (~andrew@2001:44b8:31cb:d400:4c2f:6d75:850c:c196) Quit (Quit: Leaving.)
[14:53] <madkiss> hm
[14:54] <syed_> madkiss: lolzz...
[14:59] * drokita (~drokita@97-92-254-72.dhcp.stls.mo.charter.com) Quit (Ping timeout: 480 seconds)
[15:00] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[15:01] * AfC (~andrew@2001:44b8:31cb:d400:104f:1ffe:4dcf:cf5f) has joined #ceph
[15:01] <jcfischer> Gugge-47527: can I ask you for a look in the irc logs again? What was the command I used to compile rbd_bench?
[15:05] * leseb (~Adium@2a04:2500:0:d00:4031:ea78:8b7a:4615) has joined #ceph
[15:10] * infinitytrapdoor (~infinityt@ has joined #ceph
[15:14] * jcfischer_ (~fischer@peta-dhcp-13.switch.ch) has joined #ceph
[15:15] * jcfischer (~fischer@macjcf.switch.ch) Quit (Read error: Operation timed out)
[15:15] * jcfischer_ is now known as jcfischer
[15:18] <odyssey4me> jcfischer - I see you came right?
[15:18] <jcfischer> huh?
[15:18] * jcfischer parsed the statement
[15:18] <jcfischer> yes I did
[15:18] <jcfischer> running random tests right now :)
[15:19] <odyssey4me> I'd be interested to hear your results.
[15:20] <odyssey4me> I still get best performance without the stripe, oddly enough
[15:20] <tnt> odyssey4me: for random tests you probably want a smaller chunk size.
[15:20] <jcfischer> watch the drama unfold live here: https://docs.google.com/spreadsheet/ccc?key=0AsjockBApInDdC10SWw4Y09HbUpsQWdZQ2RlTlhibEE#gid=0
[15:20] <tnt> 1M chunk will pretty much appear as 'sequential' for stripe size of 4k ...
[15:21] <odyssey4me> My consistent speed for rbd_bench appears to be around 1024-1100Mb/s read and 340-400Mb/s write
[15:21] <jcfischer> odyssey4me: you are on high end disk HW, right?
[15:23] <jcfischer> I should probably also reduce the chunk size from 1MB to something smaller....
[15:23] * drokita (~drokita@ has joined #ceph
[15:23] <odyssey4me> jcfischer - 6Gb/s SAS with a decent controller in an IBM system, I guess so :p
[15:24] <jcfischer> I'm on cheap 3 TB SATA drives and get 150 - 200 MB/s writes
[15:24] <jcfischer> and consistent 1GB/s reads
[15:25] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[15:25] <odyssey4me> I'll be testing with journals on SSD's later this week
[15:26] <jcfischer> we have journal SSD
[15:26] <odyssey4me> oddly enough in my performance testing from within a vm I get the reverse in performance
[15:26] <jcfischer> afk
[15:26] <odyssey4me> I get reasonable reads and shocking writes
[15:26] * KindTwo (KindOne@h97.24.131.174.dynamic.ip.windstream.net) has joined #ceph
[15:26] <odyssey4me> this is with rbd cache enabled and writeback enabled in kvm
[15:26] <jcfischer> I need to look into that
[15:26] <odyssey4me> and journals on seperate partitions on each osd drive
[15:27] <jcfischer> not sure if we have caching enabled in kvm
[15:27] <jcfischer> journals on separate partitions on a SSD
[15:27] <odyssey4me> I'm trying to figure out how to improve performance for the VM's in KVM
[15:28] * KindOne (~KindOne@0001a7db.user.oftc.net) Quit (Ping timeout: 480 seconds)
[15:28] * KindTwo is now known as KindOne
[15:30] * PerlStalker (~PerlStalk@ has joined #ceph
[15:31] <mozg> odyssey4me: same here
[15:31] <mozg> any tips would be great
[15:31] <mozg> what performance figures are you getting?
[15:31] <odyssey4me> what is interesting is that when I do the iozone tests the performance is better for writes with larger record sizes
[15:32] * markbby (~Adium@ has joined #ceph
[15:33] * syed_ (~chatzilla@ Quit (Quit: ChatZilla [Firefox 22.0/20130627172038])
[15:33] <tnt> Somehow when I use the rbd_cache, librbd seems to be using much more RAM than it should.
[15:35] <mozg> i do get a far better performance for writes
[15:35] <mozg> my reads are crazyly low
[15:35] * s2r2 (~s2r2@ has joined #ceph
[15:35] <mozg> at around 100-110mb/s
[15:35] <mozg> per thread
[15:36] <mozg> i can't for the love of god get a better performance from a single thread
[15:36] <mozg> odyssey4me: by the way, how many servers do you have in your setup
[15:36] <mozg> and how many osds?
[15:38] <mozg> what i've also noticed is the ubuntu kvm guest tend to panic when doing intensive fio tests
[15:38] * markbby (~Adium@ Quit (Remote host closed the connection)
[15:39] <mozg> like 4k block size with random read write of 4 files with 16 io threads each
[15:39] <mozg> after about 5-10 minutes you get a kernel panic
[15:39] <mozg> this doesn't happen on nfs
[15:40] <mozg> tnt: is there a way to check how much cache is being used by librbd?
[15:41] <tnt> no, I'm just looking at the 'top' output.
[15:42] * leseb (~Adium@2a04:2500:0:d00:4031:ea78:8b7a:4615) Quit (Quit: Leaving.)
[15:42] <jcfischer> what are you guys using for vm disk tests?
[15:43] <tnt> jcfischer: I usually do both dd sequential tests on the raw block device. Then also do a bonnie++ test
[15:43] * leseb (~Adium@2a04:2500:0:d00:1846:c65b:7ba3:e1ea) has joined #ceph
[15:44] <jcfischer> firing up a new vm :)
[15:46] <odyssey4me> mozg - is that in a vm or on the host?
[15:46] <odyssey4me> I have three servers in my test environment.
[15:47] * BillK (~BillK-OFT@203-59-173-44.dyn.iinet.net.au) Quit (Ping timeout: 480 seconds)
[15:48] <odyssey4me> jcfischer - I've been using iozone for the real tests... dd or rbd_bench for comparisons
[15:49] <odyssey4me> from in a vm I do a simple single-threaded write test using `dd if=/dev/zero of=/dev/vdb oflag=direct bs=1M` and get around 160Mb/s
[15:49] <odyssey4me> with a simple read test `dd if=/dev/vdb of=/dev/null iflag=direct bs=1M` I get 135Mb/s
[15:49] <odyssey4me> `
[15:50] * AfC (~andrew@2001:44b8:31cb:d400:104f:1ffe:4dcf:cf5f) Quit (Quit: Leaving.)
[15:50] <odyssey4me> this doesn't make sense - I would have thought that read would be more than write... what am I doing wrong here?
[15:50] <odyssey4me> btw - iozone reflects similar stats
[15:52] <mozg> odyssey4me; that's the vms
[15:53] <mozg> odyssey4me: that looks about right
[15:53] <mozg> i get slightly lower results
[15:53] <mozg> using sata disks
[15:53] <odyssey4me> although there's massive disparity - a 4k reclen on a 5gb file gives me 2.8 Mbytes/s random read and 17.5 Mbytes/s random write; whereas a 512k reclen on the same sized file gives me 75 Mbytes/s random read and 224 Mbytes/s random write
[15:53] <mozg> if you run the read test several times do you get improved results?
[15:54] <odyssey4me> mozg - nope, always very close to those numbers
[15:57] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[15:57] <jcfischer> can you share your bonnie++ command while I compile ozone?
[16:00] <odyssey4me> jcfischer - what os are you using?
[16:00] <tnt> just bonnie++ -u someuser ...
[16:00] <jcfischer> ubunut 12.10
[16:00] <tnt> someuser being an user that's not root but can write in the test directory
[16:00] <odyssey4me> apt-get install iozone3
[16:00] <mozg> odyssey4me: what i've found while doing benchmarks is they are all bullsh*t compared with the real life performance
[16:01] <mozg> when i started working with real data my performance was much lower
[16:01] <odyssey4me> and my command is: iozone -a -y 4k -q 512k -s 5g -i 0 -i 1 -i 2 -R
[16:01] <jcfischer> weird - no iozone package….
[16:01] <mozg> part of the reason is that most of the benchmarks are using /dev/zero to get their data
[16:01] <odyssey4me> mozg - sure, but these are indicators and I'm doing apples-to-apples comparisons... same loads with different configurations
[16:01] <mozg> whereas the real life is not ))
[16:02] <odyssey4me> mozg - yeah, that's why I'm using iozone for the real tests... dd just gives an indication
[16:02] <mozg> so, moving forward i've created a 100gb file taken from /dev/urandom
[16:02] <mozg> and used that for the input
[16:02] * diegows (~diegows@ Quit (Ping timeout: 480 seconds)
[16:03] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[16:04] * stacker666 (~stacker66@104.pool85-58-195.dynamic.orange.es) Quit (Ping timeout: 480 seconds)
[16:05] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[16:05] * ChanServ sets mode +v andreask
[16:05] <jks> anyone knows of an updated nagios plugin for ceph? I'm currently using a modified version of http://bazaar.launchpad.net/~dreamhosters/ceph-nagios-plugin/master/view/head:/check_ceph.pl as it appears to be made for an older, incompatible version of ceph
[16:07] <ebo> is there any documentation for the layout of cephfs metadata?
[16:11] * markbby (~Adium@ has joined #ceph
[16:12] <yanzheng> sage's thesis
[16:13] <tchmnkyz> jks: if you find one let me know i need something newer too
[16:13] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) Quit (Ping timeout: 480 seconds)
[16:14] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[16:15] <tchmnkyz> jks if you would not mind could you share your modified one
[16:15] <jks> tchmnkyz, I just changed line 66 to: if (($health_status) = ($line =~ /(HEALTH_(OK|WARN|ERR).*).*/)) {
[16:16] <tchmnkyz> k
[16:16] <tnt> Ont thing that annoys me with calling 'ceph health' is that it can just hang IIRC.
[16:16] <tchmnkyz> i mean it would be nice to be able to get a newer version but i have not found one my self.
[16:17] <tchmnkyz> tnt: yea i have seen that from time to time
[16:17] <jcfischer> running rbd_bench and bonnie simultaneously kills the rbd_bench performance
[16:17] <tnt> jcfischer: no kidding :)
[16:17] <odyssey4me> lol, jcfischer, of course....
[16:17] <jcfischer> how much space does bonnie require? on a 10GB volume it complains about no space left on device :)
[16:17] <tnt> jcfischer: twice the RAM of the machine.
[16:17] <jcfischer> and here I had high hopes that our cluster would hold up
[16:17] <jcfischer> tnx
[16:18] <jcfischer> redoing VM
[16:18] <tnt> so I usually lower the RAM of the VM I run it on :p
[16:18] <mozg> tnt: do you know of the best way to limit ceph resources spent on repair jobs?
[16:18] <jcfischer> redoing volume
[16:18] <mozg> i am having very poor performance if there is repair taking place
[16:19] <tnt> mozg: no. It should already do that. However if you see 'slow request' in the logs, I think is due to something completely different.
[16:19] <mozg> tnt: thanks
[16:19] <mozg> slow requests are a f*cking b*tch
[16:19] <mozg> no one knows how to resolve them
[16:20] <mozg> i am now trying to isolate the issue
[16:20] <nhm> mozg: might want to look at the recovery QOS section here: http://ceph.com/dev-notes/whats-new-in-the-land-of-osd/
[16:20] <nhm> that's back from the bobtail release.
[16:20] <mozg> as whenever i speak with people here they tend to blame 1) faulty hardware, 2) slow network
[16:20] <mozg> and in my case it is neither
[16:20] <mozg> after that everyone goes quite
[16:21] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[16:21] <mozg> i have no problem when i am working with a single osd server
[16:21] <mozg> i've been running a bunch of benchmarks and one server can handle them, no probs
[16:21] * zhyan_ (~zhyan@ has joined #ceph
[16:21] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[16:21] * ChanServ sets mode +v andreask
[16:21] <mozg> however, as soon as ceph has two osd servers i get slow requests
[16:22] <mozg> even when there are no tests running
[16:22] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit ()
[16:22] <mozg> i've left ceph alone for a week
[16:22] <nhm> mozg: does it happen if you setup a pool with no replication and 2 servers?
[16:22] <tnt> mozg: mmm, that's weird. I only get slow requests when starting or stopping an OSD, not "just randomly" ...
[16:22] <mozg> and seen slow requests on two different occasions
[16:22] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[16:22] * ChanServ sets mode +v andreask
[16:23] <mozg> tnt: starting and stopping osds also sometimes cause slow requests
[16:23] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit ()
[16:23] <mozg> and i am experiencing slow peering as well
[16:23] <nhm> mozg: might be worth looking at the osd admin socket on all of the servers and using the dump_historic_ops command to see what the slowest ops are doing.
[16:23] <tnt> how many PGs do you have ? (and how many OSD)
[16:23] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[16:23] * ChanServ sets mode +v andreask
[16:23] <mozg> i've seen that slow requests also happen just after deep scurb starts
[16:24] <mozg> tnt: does it happen if you setup a pool with no replication and 2 servers? <- i've not done this. i've left default replication settings
[16:24] <mozg> tnt: i've got around 1800 pgs
[16:24] * yanzheng (~zhyan@jfdmzpr06-ext.jf.intel.com) Quit (Ping timeout: 480 seconds)
[16:24] <mozg> and 16 osds
[16:24] <tnt> do you have any kind of instrumentation to track the number of IOps on your hdd, or the # of operation in flight etc .. thing like that .
[16:25] <mozg> tnt: i can check iostats
[16:25] <mozg> if that's of any use
[16:25] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[16:25] <tnt> having graphs can help correlate events between multiple servers a bit more easily than just raw numbers.
[16:26] <tnt> where are your mons btw ? on different physical disks ?
[16:28] * dosaboy (~dosaboy@faun.canonical.com) Quit (Read error: Operation timed out)
[16:30] * jeff-YF (~jeffyf@ has joined #ceph
[16:31] <mozg> tnt: i've got 2 osd servers, 3 mds and 3 mons
[16:32] <mozg> osd servers have mon + osd + mds
[16:32] <mozg> the third mon is located on one of the client servers
[16:33] <tnt> the mons generate a lot of IOPS, you really don't want them on the same physical disks as OSD data.
[16:34] <jcfischer> hmm - most of the bonnie numbers come out as +++
[16:34] * dosaboy (~dosaboy@faun.canonical.com) has joined #ceph
[16:35] <mozg> tnt: ah
[16:35] <mozg> well, the osds are only used for data
[16:35] <jcfischer> ah: use -n param
[16:35] <mozg> even they are on the same physical server
[16:35] <mozg> the mons data is on an ssd disk
[16:36] <mozg> the server's root partition is on an ssd disk
[16:36] <tnt> mozg: ok. if they're on different physical disks even on the same server, it's fine.
[16:37] <mozg> tnt: is there a problem for ceph to run osds on differently spec-ed hardware?
[16:38] <mozg> for instance, i've got a server with 12 core latest intel
[16:38] <mozg> as well as an old quad core xeon
[16:38] <mozg> could this cause issues with slow requests?
[16:39] <odyssey4me> why would writes perform better than reads in a kvm domain?
[16:39] <mozg> rbd_caching
[16:39] <mozg> is the only reason i can think of
[16:40] <tnt> if for read you wait for the response before sending the next 'read' then the added latency can kill throughput.
[16:40] <odyssey4me> but when I use the kernel module to mount a disk and test against that the performance is the opposite way around - read is great and write sucks
[16:40] <tnt> check that the io scheduler is noop
[16:41] <odyssey4me> tnt - I've set it to deadline... I'll try noop as well
[16:42] <tchmnkyz> dmick: we got training coming in aug 5th and the trainer is also giving us 2 days consulting time. on there too.
[16:43] <tchmnkyz> i catn get the boss to go for the 146k it would cost to get fulltime support on the cluster yet. once i have ti bringing in more monthly he said we can look into it then
[16:43] <jks> tnt: hmm, mons generate lots of iops? - is it important to have them on seperate disks?
[16:43] <odyssey4me> jks - the docs recommend it
[16:43] <jks> tnt: I asked this some time ago in here, but was told that mons are not important in that sense and could easily share disks with the rest
[16:44] <tnt> jks: yeah ... since the switch to leveldb that's not so much the case.
[16:44] <jks> tnt, so with leveldb it is more important than before to have them on seperate disks?
[16:45] <tnt> jks: at least not same disk as OSDs.
[16:45] <tnt> I put them on the same disk as the OS because the OS isn't really writing anyway.
[16:45] <jks> I have mons on the same servers as osds... 4 ordinary hard drives and 1 SSD for the journal
[16:45] <odyssey4me> mozg - a write test using /dev/urandom yielded the same results for noop as deadline
[16:45] <jks> would it be more optimal to place the mon store on the SSD then?
[16:45] <tnt> odyssey4me: are you sure urandom isn't the bottleneck :)
[16:46] <odyssey4me> tnt - using iozone now instead
[16:46] <odyssey4me> tnt - the numbers so far are pretty much the same
[16:47] <jcfischer> also running iozone at the moment - what numbers should I expect?
[16:48] <odyssey4me> what bothers me is this - while running the tests I'm watching the disks using iostat... when the write tests happen I see activity on the disks, but when the read tests are running the disks are idle
[16:48] <jks> when evaluating mon performance... would the system's performance be relative to the slowest mon or the fastest mon?
[16:48] <odyssey4me> jcfischer - a 4k reclen on a 5gb file gives me 2.8 Mbytes/s random read and 17.5 Mbytes/s random write; whereas a 512k reclen on the same sized file gives me 75 Mbytes/s random read and 224 Mbytes/s random write
[16:49] <jcfischer> still waiting for the numbers...
[16:50] <nhm> jks: lots of mon changes recently. In recent cuttlefish and dumpling, hopefully the mon shouldn't be very heavy for most clusters.
[16:51] <mozg> odyssey4me: i was having a bunch of reliability issues with /dev/zero as the data is compressed well
[16:51] <jks> nhm, okay, just wondering if this was some place where it was possible to gain a significant amount of performance... I have 5 mons of which 2 have their store on SSDs and 3 have them on rotating drives
[16:51] <mozg> some ssds are compressing zeros and give you crazy performance figures
[16:51] * infinitytrapdoor (~infinityt@ Quit (Ping timeout: 480 seconds)
[16:52] <mozg> jks: how's your cloudstack + ceph project going? did you manage to get the PoC done?
[16:52] <nhm> jks: Any particular reason to have 5 vs 3?
[16:52] <jks> mozg: I abandoned it - it seems I had to wait for the next version to have something that I would be able to put in production
[16:52] <odyssey4me> it is still flummoxing me that read is slower than write in a kvm guest
[16:52] <mozg> odyssey4me: yeah, the 4k performance should be addressed in ceph
[16:52] <jks> mozg: I setup OpenNebula instead - in comparison very easy to get going
[16:53] <mozg> it is not ideal by far
[16:53] <jks> nhm: only reason was that I would be able to live with 2 servers crashing... are there major disadvantages to having 5?
[16:53] <mozg> jks: is this the openstack project?
[16:53] <nhm> odyssey4me: have you tweaked readahead at all?
[16:53] <jks> mozg: no, haven't got anything to do with openstack
[16:53] <nhm> jks: not necessarily, though it's more mon->mon communication.
[16:53] <mozg> jks: i thougth that opennebula is part of openstack
[16:53] <mozg> no?
[16:54] <jks> mozg, they're not related at all, no
[16:54] <nhm> jks: IE if there are performance issues, it may be worth trying with fewer mons and see if it affects things.
[16:54] <nhm> jks: what version of ceph are you using?
[16:55] <jks> nhm, I did a brief performance test with 3 mons and then with 5 mons, I couldn't see much difference in qemu/kvm performance (but only tested with a few virtual machines running)
[16:55] <jks> nhm: 0.61.4
[16:55] <jks> nhm, I wouldn't say there's a performance "issue" as such, as I don't know what to expect... but I would always like better performance ofcourse ;-)
[16:56] <jks> nhm: I'm getting approx. 30 MB/s sequential write speed and 60 MB/s sequential read speed inside qemu/kvm
[16:56] <odyssey4me> nhm n, tell me more?
[16:56] <odyssey4me> *no
[16:57] <nhm> jks: there were definitely some issues with leveldb in cuttlefish prior to 0.61.4 that could affect performance. I think we've got most of those worked out on 0.61.4 and a couple of additional improvements in the dumpling series.
[16:58] <jks> looking forward to dumpling then :-)
[16:58] <nhm> jks: hopefully at this point the mons shouldn't affect performance at all except in very specific circumstances.
[16:58] <jks> nhm, super! I'll keep my 5 mons then :-)
[16:58] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) Quit (Quit: smiley)
[16:59] <jks> I'm very pleased with the setup right now from a reliability point of view... replication level 3 and 5 mons, and it has been a long time since I saw software errors
[16:59] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[16:59] <nhm> odyssey4me: I'm actually investigating KVM read performance issues now. Kernel RBD is faster. It appears that increasing readahead both on the OSDs and on the clients helps, but not enough to get it quite up to Kernel RBD performance.
[17:00] * gregaf1 (~Adium@cpe-76-174-249-52.socal.res.rr.com) has joined #ceph
[17:00] <nhm> odyssey4me: I only tested up to 2048, but that seemed to not really hurt small random IOPs (surprisingly) but did increase large read performance.
[17:00] * s2r2 (~s2r2@ has joined #ceph
[17:02] * s2r2 (~s2r2@ Quit ()
[17:02] <jcfischer> I'm getting totally weird numbers on iozone (or my calculations are totally bogus): https://docs.google.com/spreadsheet/ccc?key=0AsjockBApInDdC10SWw4Y09HbUpsQWdZQ2RlTlhibEE#gid=1 (sheet 2)
[17:03] * s2r2 (~s2r2@ has joined #ceph
[17:03] <jcfischer> 12 MB/s random write, 2820 MB/s random read (4k reclen)
[17:04] * dignus (~dignus@bastion.jkit.nl) has joined #ceph
[17:04] <odyssey4me> jcfischer - wtf... the output from iozone is in Kbytes/s - so to get Mbytes/s you divide by 1024
[17:05] <jcfischer> I did
[17:05] <jcfischer> check Cell c18
[17:06] * dignus (~dignus@bastion.jkit.nl) has left #ceph
[17:09] <jcfischer> the only sane numbers seem to be write (around 110 MB/s) and random write (12 - 58 MB/s) - the reads are just crazy (up to 6617 MB/s on reread with reclen 16)
[17:10] <jcfischer> or there is some badass caching at work
[17:13] <nhm> jcfischer: reads will be crazy if you have a file that wasn't completely written to.
[17:14] <nhm> jcfischer: Make sure that any reads you are doing come from fully populated files.
[17:14] <tnt> mmm, my peons mons grew to 1.5G ram ... wtf happenned.
[17:15] <nhm> tnt: mons were growing in RAM quite a bit back in the early cuttlefish era due to leveldb not keeping up with compaction.
[17:15] <nhm> tnt: that should be mostly taken care of in recent cuttlefish, and should be even better in the dumpling series.
[17:15] <nhm> But if you are seeing mon memory growth, it's probably leveldb related.
[17:16] <tnt> nhm: http://i.imgur.com/PcFghVf.png
[17:16] <tnt> nhm: it was workign fine and I had to restart the machine last night. And looks like it started then.
[17:16] <nhm> tnt: what version of ceph?
[17:16] <jcfischer> nhm: I'm using iozone - so not sure how that is handled
[17:16] <tnt> 0.61.4
[17:17] <odyssey4me> jcfischer - your numbers are like a total opposite to mine... very odd...
[17:17] <odyssey4me> jcfischer - is that inside a vm or on the host?
[17:17] <nhm> jcfischer: I haven't used iozone in a while, don't remember exactly how it does it. You might want to look for an option to preallocate the file before doing the reads.
[17:17] <jcfischer> that is inside a vm
[17:18] <jcfischer> with this command: ./iozone -a -y 4k -q 512k -s 5g -i 0 -i 1 -i 2 -R -f /mnt/vol/iozone.dat (as suggested above)
[17:18] <nhm> tnt: do you have perf installed? Might be worth doing a perf top and see if it's doing a bunch of leveldb stuff.
[17:19] <tnt> nhm: I'm not sure what 'perf' is. But I have some data collected. What did you want to look at ?
[17:19] <nhm> tnt: perf is a linux kernel profiler. If you are using ubuntu with a stock ubuntu kernel, you can install it right from the repository.
[17:20] * amatter2013 (~oftc-webi@ has joined #ceph
[17:20] <nhm> tnt: Can you email it to me? I'm trying to finish something up and then have to leave for a couple of hours, but will be around later.
[17:20] <tnt> email what ?
[17:20] <tnt> I restarted them now and they seem to be back to normal.
[17:21] <nhm> tnt: collected data, though perhaps it's not as much as I was worried about. :)
[17:21] <nhm> tnt: Ok. If it starts happening again, it would be good to know if it's leveldb compaction.
[17:22] <tnt> nhm: Ah well ... it's collectd data stored in a graphite DB, not realy readable 'by hand' :p
[17:22] * ScOut3R (~ScOut3R@catv-89-133-17-71.catv.broadband.hu) Quit (Ping timeout: 480 seconds)
[17:22] * stacker666 (~stacker66@ has joined #ceph
[17:22] <odyssey4me> nhm - I don't see an option to preallocate iozone's files
[17:22] <amatter2013> Morning! I have a new ceph installation on a single machine with 16x 2TB drives, as OS disk and a journal SSD. The drives are connected via an LSI SAS card. However, using the CephFS, the best write throughput I can get on a pool with 2x redundancy and 4096 pgs is about 45MB/sec which seems surprisingly low. Any pointers on how I can isolate the bottleneck?
[17:23] <nhm> odyssey4me: any idea if before the read, it's spending a good amount of time writing the file out?
[17:23] <mozg> nhm: what are the recommended readahead settings that I should use to increase the read performance?
[17:23] <odyssey4me> nhm - it does write, then rewrite, then read
[17:23] * toMeloos (~tom@53545693.cm-6-5b.dynamic.ziggo.nl) Quit (Quit: toMeloos)
[17:23] * toMeloos (~tom@53545693.cm-6-5b.dynamic.ziggo.nl) has joined #ceph
[17:23] <nhm> amatter2013: 1 SSD journal for 16 drives will limit your write throughput since every write goes through the journal disk(s).
[17:24] <nhm> mozg: Working that out. :)
[17:24] <nhm> mozg: but anywhere from 512k-2048k seems like a good bet so far.
[17:24] <nhm> mozg: The higher you go, the more you risk hurting small random reads.
[17:24] <nhm> But the better you will make large sequential reads.
[17:25] <nhm> well, probably sequential reads in general, but so far testing seems to indicate large reads more.
[17:25] <nhm> odyssey4me: do you know if it writes the entire file out?
[17:25] <mozg> thanks
[17:25] <nhm> odyssey4me: like, no time limit or anything?
[17:26] <odyssey4me> nhm - my readahead appears to be sitting at 256 (assuming that this is right? blockdev --getra /dev/sdb )
[17:26] <odyssey4me> nhm - it has a size limit
[17:26] <nhm> odyssey4me: I always just check /sys/block/<dev>/queue/read_ahead_kb
[17:26] <mozg> nhm: by the way, if ceph is using 4M block sizes, doesn't it read 4mb at a time?
[17:26] <amatter2013> nhm: the SSD performance is 123mb/s and a disk is 40mb/s, so presumably I need one SSD per three disks?
[17:26] * gregaf1 (~Adium@cpe-76-174-249-52.socal.res.rr.com) Quit (Quit: Leaving.)
[17:27] <odyssey4me> nhm - interesting, that's at 128
[17:27] <odyssey4me> how do I set it?
[17:27] <nhm> mozg: This is set at the block layer (both on the client RBD volume and on the volume under the OSD). Ceph can request 4MB reads, but it doesn't have any control over what happens underneath it.
[17:28] <nhm> odyssey4me: echo <foo> | sudo tee read_ahead_kb
[17:28] <nhm> odyssey4me: I wouldn't recommend using foo. :)
[17:28] <mozg> nhm: so, the readahead settings that you've been talking about,
[17:28] <mozg> is this for the block device
[17:28] * zhyan_ (~zhyan@ Quit (Ping timeout: 480 seconds)
[17:28] <mozg> or the osd setting
[17:28] <mozg> ?
[17:28] <nhm> mozg: yes. It seems that increasing it both at the block device layer under the OSD, and on the client side RBD volume helps for QEMU/KVM, but I don't have all the data yet,
[17:29] <nhm> mozg: this is all outside of Ceph
[17:29] <nhm> in /sys/block/<dev>/queue/read_ahead_kb
[17:29] <mozg> so, the ceph settings are not to be changed to change the readahead values?
[17:29] <nhm> Interestingly it has very minimal effects on kernel RBD so far.
[17:29] * jlogan (~Thunderbi@2600:c00:3010:1:1::40) has joined #ceph
[17:30] <jcfischer> doing another iozone run with the -I option (use direct i/o - will be interesting to see if that solves the 6 GB/s read)
[17:30] <nhm> mozg: nope, unless you are using the CephFS kernel client, where we do have some readahead options for the client.
[17:31] <nhm> jcfischer: potentially if it's only 5GB of data and you haven't flushed cache and have lots of memory, you could be doing all the reads from pagecache.
[17:31] <odyssey4me> nhm - thought as much, what setting has given you good results?
[17:31] <mozg> jcfischer: I would check the io stats of the osds while doing these tests
[17:31] <mozg> even if you specify direct io on the client side
[17:31] <nhm> odyssey4me: don't have all of the data yet, but somewhere between 512k-2048k seems to be a good bet.
[17:31] <mozg> it doesn't mean that the data would be read from the disks on the server side
[17:31] <jcfischer> mozg - that's difficult with 64 OSD
[17:31] <jcfischer> :)
[17:31] <mozg> it could serve it from cache
[17:32] <mozg> and most likely it does
[17:32] <nhm> odyssey4me: I'll have some detailed results out in a week or two.
[17:32] <odyssey4me> nhm - is that value in /sys/block/sdc/queue/read_ahead_kb bytes or kb?
[17:32] <mozg> so, you should drop caches on the server side as well
[17:32] <mozg> to get the true uncached performance figures
[17:32] <nhm> odyssey4me: KB, so default is 128k
[17:32] <jcfischer> there's another option for iozone: unmount/mount the filesystem - that should clear the cache
[17:32] <mozg> nhm: this is what i have
[17:32] <jcfischer> will try that next
[17:33] <mozg> jcfischer: this is again on the client side i believe
[17:33] <jcfischer> all inside the vm, yes
[17:33] <mozg> i've done some testing from the vm
[17:33] <nhm> jcfischer: I always do my fio runs independently and then sync/drop_caches on the OSDs and clients before doing the read tests.
[17:33] <mozg> and this works perfectly well in a sense that it doesn't use vm cache
[17:33] <mozg> and reads the data from ceph
[17:33] <mozg> however
[17:33] <nhm> ok, gotta run guys. bbl
[17:34] <mozg> i've noticed that during the tests all my osds are sitting idle
[17:34] <mozg> yet ceph -w is showing 900+mb/s figures
[17:34] <jcfischer> is there a way to get a global view of all osds?
[17:34] <mozg> realised that it's reading everything from its own cache
[17:34] <mozg> you can try ceph -w
[17:34] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) has joined #ceph
[17:34] <jcfischer> I have that running
[17:35] <mozg> this will show you the ceph performance
[17:35] <mozg> or you could use something like iotop
[17:35] <odyssey4me> cheers nhm
[17:35] <mozg> this will show aggregate ios
[17:35] <jcfischer> 161 TB / 174 TB avail; 931KB/s wr, 232op/s
[17:35] <odyssey4me> mozg - yeah, I noticed that too
[17:35] <mozg> this is okay for the reads
[17:35] <mozg> but if you have writes on an ssd
[17:36] <mozg> you should devide iotop write by 2 to get a good picture on the write performance
[17:36] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) has joined #ceph
[17:36] <mozg> so, what i do
[17:36] <mozg> when testing for real throughtput
[17:36] <mozg> is i do sync
[17:36] <odyssey4me> mozg - like now I've got this is ceph -s: 15206KB/s rd, 1710op/s
[17:36] <mozg> and drop caches on both the client and the storage(s)
[17:36] <mozg> and run the tests
[17:37] * mschiff_ (~mschiff@p4FD7F468.dip0.t-ipconnect.de) Quit (Remote host closed the connection)
[17:37] <mozg> yeah - reading at 15mb/s
[17:37] <mozg> and 1.7k iops
[17:37] * ebo (~ebo@icg1104.icg.kfa-juelich.de) Quit (Quit: Verlassend)
[17:38] * yehudasa__ (~yehudasa@2602:306:330b:1410:6dbd:dec3:7ee7:8132) Quit (Ping timeout: 480 seconds)
[17:39] <jcfischer> I have never seen these kind of numbers on our cluster
[17:39] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Ping timeout: 480 seconds)
[17:40] <odyssey4me> lol 18658KB/s wr, 4050op/s
[17:40] <odyssey4me> but what I find frustrating is that the performance shown va ceph -s is not reflected in the vm
[17:41] <mozg> i found it pretty close
[17:41] <mozg> actually
[17:41] <mozg> ceph -w is giving me performance figures
[17:41] <mozg> which are about 2-3 seconds delayed
[17:41] <mozg> to what I am seeing in the vm
[17:41] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[17:41] <mozg> at least i've noticed that with large block sizes
[17:41] <mozg> like 4M
[17:42] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) Quit (Read error: Connection reset by peer)
[17:42] <mozg> lol 18658KB/s wr, 4050op/s <- that's about 4K block size
[17:42] <odyssey4me> mozg - correct, 4k block size on host and guest
[17:42] <mozg> if you divide speed by ops
[17:43] <odyssey4me> where is it best to increase the block size?
[17:43] <mozg> what do you mean?
[17:43] * sagelap (~sage@ Quit (Read error: Operation timed out)
[17:43] <mozg> i would test all common block sizes if I were you
[17:43] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:43] <mozg> just to see how ceph performs
[17:44] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) has joined #ceph
[17:44] <odyssey4me> mozg - you've mentioned using a 4M block size... is that on the osd, or in a rados config somewhere?
[17:44] <mozg> i found that large block sizes are dealt very well by ceph
[17:44] <Kioob`Taff> Hi
[17:44] <mozg> but with 4k there is a great performance difficulty
[17:44] <odyssey4me> mozg - yes, it would appear that performance improves massively when I hit larger record sizes in the iostat tests
[17:44] <mozg> ah, the 4M block sizes are on the vm side. in the test itself
[17:44] * matt__ (~matt@220-245-1-152.static.tpgi.com.au) Quit (Ping timeout: 480 seconds)
[17:44] <Kioob`Taff> when rebuilding the Ceph 0.61.4 package for Debian Wheezy, I see this warning :
[17:44] <Kioob`Taff> dpkg-shlibdeps: warning: package could avoid a useless dependency if debian/librados2/usr/lib/librados.so.2.0.0 was not linked against libleveldb.so.1 (it uses none of the library's symbols)
[17:44] <mozg> like with dd i would use bs=4M
[17:45] <mozg> or bs=4K
[17:45] <Kioob`Taff> maybe this link can be removed
[17:45] <mozg> etc
[17:45] * sleinen (~Adium@2001:620:0:46:10f3:b5db:398f:c1d4) Quit (Read error: Connection reset by peer)
[17:45] * sleinen (~Adium@2001:620:0:46:10f3:b5db:398f:c1d4) has joined #ceph
[17:46] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[17:49] * sagelap (~sage@2600:1012:b00b:7e19:f151:800a:717b:f565) has joined #ceph
[17:50] * sagelap (~sage@2600:1012:b00b:7e19:f151:800a:717b:f565) Quit ()
[17:57] * tnt (~tnt@ has joined #ceph
[18:01] * grepory (~Adium@50-115-70-146.static-ip.telepacific.net) Quit (Quit: Leaving.)
[18:01] * sleinen (~Adium@2001:620:0:46:10f3:b5db:398f:c1d4) Quit (Ping timeout: 480 seconds)
[18:04] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[18:05] * s2r2 (~s2r2@ has joined #ceph
[18:08] * joshd1 (~jdurgin@2602:306:c5db:310:c5b8:40bc:80f3:80dd) has joined #ceph
[18:11] * toMeloos (~tom@53545693.cm-6-5b.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[18:21] * danieagle (~Daniel@ has joined #ceph
[18:33] * toMeloos (~tom@5356AAC5.cm-6-7c.dynamic.ziggo.nl) has joined #ceph
[18:33] * wrencsok (~wrencsok@wsip-174-79-34-244.ph.ph.cox.net) has left #ceph
[18:33] * wrencsok (~wrencsok@wsip-174-79-34-244.ph.ph.cox.net) has joined #ceph
[18:34] * ebo^ (~ebo@koln-4db4af6f.pool.mediaWays.net) has joined #ceph
[18:35] <ebo^> anyone here who can help me understand the cephfs metadata format?
[18:36] * wrencsok (~wrencsok@wsip-174-79-34-244.ph.ph.cox.net) has left #ceph
[18:36] * wrencsok (~wrencsok@wsip-174-79-34-244.ph.ph.cox.net) has joined #ceph
[18:36] * markbby (~Adium@ Quit (Quit: Leaving.)
[18:37] * markbby (~Adium@ has joined #ceph
[18:37] * toMeloos (~tom@5356AAC5.cm-6-7c.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[18:37] * toMeloos (~tom@5356AAC5.cm-6-7c.dynamic.ziggo.nl) has joined #ceph
[18:37] <gregaf> ebo^: what are you interested int?
[18:38] <gregaf> s/int/in
[18:39] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[18:39] <ebo^> i want to make a programm that walks the directory tree by hand
[18:39] * leseb (~Adium@2a04:2500:0:d00:1846:c65b:7ba3:e1ea) Quit (Quit: Leaving.)
[18:39] * hybrid512 (~walid@106-171-static.pacwan.net) Quit (Quit: Leaving.)
[18:40] <ebo^> i have a hard time understanding the serialization code in mds
[18:41] * joshd1 (~jdurgin@2602:306:c5db:310:c5b8:40bc:80f3:80dd) Quit (Quit: Leaving.)
[18:41] <gregaf> "walks the directory tree by hand"? you want to bypass the MDS entirely and look at raw rados? or you just want to send requests to the MDS to look at the tree?
[18:42] <ebo^> bypass
[18:42] <ebo^> i have problems accessing a specific file in my cephfs
[18:42] * yehudasa__ (~yehudasa@2607:f298:a:607:ea03:9aff:fe98:e8ff) has joined #ceph
[18:42] <gregaf> that's not really going to work; most of the metadata lives in per-directory objects but there are also MDS journals which can be authoritative
[18:43] <sagewk> if you know the ino number and size you can go grab teh objects directly, if that's what you mean
[18:43] <sagewk> for the problem file
[18:43] <mgalkiewicz> hi guys. Is there any way to make sure that rbd cache is enabled (except checking configs options)? I am using librbd with kvm.
[18:46] * toMeloos (~tom@5356AAC5.cm-6-7c.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[18:46] * markbby (~Adium@ Quit (Quit: Leaving.)
[18:46] * markbby (~Adium@ has joined #ceph
[18:49] <ebo^> i have a directory with two files i can not access at all. other files in the directory work without problems
[18:49] <mozg> <odyssey4me: the trouble with using tests with 4M block sizes is that you rarely see these types of reads/writes in real life on a virtual machine
[18:50] <mozg> unless your vms are dealing with large seq data streams you are likely to see small block size requests
[18:50] <mozg> correct me if I am wrong here
[18:50] <mozg> and it seems that ceph performance suffers a lot when it comes to small block size reads/writes
[18:51] <mozg> not sure if this is an addressable issue or simply ceph design fault
[18:51] <ebo^> i need fsck
[18:54] <sagewk> mgalkiewicz: if you enable the admin socket for the librbd (admin socket = /var/run/ceph/ceph-$name.$pid.asok usually) you can query that using the ceph cli tool (ceph --admin-daemon $path config show) and see what config options are active
[18:54] <sagewk> and do various other stuff
[18:55] * bergerx_ (~bekir@ Quit (Quit: Leaving.)
[18:55] <mozg> does anyone know if ceph has released the centos repo for qemu and librbd?
[18:55] <gregaf> ebo^: do you have any mds or client logs from attempts to access those files?
[18:55] <mozg> i've heard it has been in to do list
[18:55] <mozg> not sure if it's out yet
[18:56] <mgalkiewicz> sagewk: ok but how to enable such socket through libvirt or kvm?
[18:56] * s2r2 (~s2r2@ Quit (Quit: s2r2)
[18:57] <mgalkiewicz> I have already used admin sockets for mon and osd
[18:57] <sagewk> put it in the relevant section of ceph.conf, usually
[18:57] <mgalkiewicz> so [client] I guess?
[18:58] <sagewk> yeah, or client.something if you are using something other than client.admin for your kvm instances
[18:58] <sagewk> that way things like the ceph and rbd clis won't try to set up the sockets.
[18:58] * sjusthm (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[18:59] <sagewk> iirc there is also a way to pass random options via the kvm command line, but i forget..
[18:59] <mgalkiewicz> I am using sth other just thought that [client] applies to all clients
[18:59] <mozg> What does this value mean? "client_cache_size": "16384",
[18:59] <mozg> 16MB for the client side cache?
[19:00] <mgalkiewicz> sagewk: so if I understand correctly all I need is admin socket in [client.my_client] section and during booting the machine kvm will figure this out and set appropriate socket?
[19:00] <cjh_> is there a typo in the .66 release notes on the blog? it says pg log rewrites are *not* vastly more efficient
[19:01] <cjh_> shouldn't that say *now* ?
[19:08] * Tamil (~tamil@ has joined #ceph
[19:08] * dosaboy__ (~dosaboy@faun.canonical.com) has joined #ceph
[19:10] * AndroUser (~androirc@m950436d0.tmodns.net) has joined #ceph
[19:11] * mozg (~andrei@host217-46-236-49.in-addr.btopenworld.com) Quit (Ping timeout: 480 seconds)
[19:13] <loicd> for the record : the pad of tonight Ceph meetup in Paris http://pad.ceph.com/p/paris-meetup-2013-07-15
[19:13] * AndroUser is now known as dmick1
[19:13] <loicd> ( in french )
[19:13] <gregaf> cjh_: I think so — sagewk, can you confirm/fix?
[19:14] <sagewk> whoop i'll fix
[19:14] <sagewk> tnx
[19:14] <cjh_> no problem :)
[19:15] * sjustlaptop (~sam@24-205-35-233.dhcp.gldl.ca.charter.com) has joined #ceph
[19:15] * t4nk888 (~b75211e9@webuser.thegrebs.com) has joined #ceph
[19:16] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[19:16] * ChanServ sets mode +v andreask
[19:16] <t4nk888> hi
[19:16] <mgalkiewicz> sagewk: unfortunately it does not seem to work
[19:16] * dosaboy (~dosaboy@faun.canonical.com) Quit (Ping timeout: 480 seconds)
[19:16] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[19:16] <t4nk888> Is there a way for me to find the objects created by user on ceph nodes?
[19:16] * dmick1 (~androirc@m950436d0.tmodns.net) Quit (Read error: Connection reset by peer)
[19:20] <sagewk> it might be necessary to explicitly reference the config file in the kvm line?
[19:20] <sagewk> i forget, there is some trick there
[19:20] <sagewk> mgalkiewicz: ^
[19:21] <t4nk888> andreask: you there?
[19:21] <t4nk888> what are you guys using for s3 explorer?
[19:21] <mgalkiewicz> is it necessary for creating admin socket, enabling caching or both? I am using openstack so I am a little bit limited by its feature set
[19:22] <mgalkiewicz> to configure libvirt which manages kvm but there is always so room for monkey patching:)
[19:22] <t4nk888> I tried to use Dragondisk but, it always file for each operaton with error "2 operaton list" host no found
[19:22] <mgalkiewicz> could you point out some examples in docs?
[19:26] * t4nk888 (~b75211e9@webuser.thegrebs.com) Quit (Quit: TheGrebs.com CGI:IRC (EOF))
[19:31] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[19:31] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) has joined #ceph
[19:31] * xmltok_ (~xmltok@cpe-76-170-26-114.socal.res.rr.com) Quit ()
[19:39] * rturk-away is now known as rturk
[19:40] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[19:43] * jcsp (~john@82-71-55-202.dsl.in-addr.zen.co.uk) has joined #ceph
[19:46] <ebo^> trying to stat /lxa/120918_K03_F16/20120918_055800.cub mds logs: https://gigamove.rz.rwth-aachen.de/download/id/8AYisv5Vrdvpk6
[19:46] * rturk is now known as rturk-away
[19:47] * LeaChim (~LeaChim@ Quit (Ping timeout: 480 seconds)
[19:51] <gregaf> ebo^: this looks like one of a sequence of bugs that are usually caused by misbehaving clients
[19:51] <gregaf> how do you have the system mounted?
[19:56] * LeaChim (~LeaChim@ has joined #ceph
[19:57] <amatter2013> I have 16 drives on an LSI SAS card and one SSD for the journals. Would I get better performance in this case not to use the bottleneck SSD journal and use the on disk journal partitions instead?
[19:57] <sagewk> dmick: trivial patch in wip-cli when you have a minute
[19:58] <gregaf> amatter2013: depends on how good your card and your drives are in various combinations
[19:58] <sjusthm> paravoid: are you around?
[19:58] <gregaf> 45MB/s or whatever you were seeing seemed oddly slow, but maybe the ssd can't handle that many streams or something
[19:58] <gregaf> you should test it out!
[19:59] <amatter2013> gregaf: what's the best way to chenge the journal? use ceph-deploy to destroy the osd then re-create it?
[20:00] <gregaf> I believe the docs describe it, you can do a journal flush, then change the journal location on an osd-by-osd basis to avoid moving/recovering all the data
[20:01] <gregaf> or you can use destroy and create
[20:02] <amatter2013> gregaf: thanks
[20:06] <amatter2013> oops: ceph-deploy osd destroy => "subcommand not implemented" no worries
[20:08] <gregaf> erm, I think it is though, how old is your version of ceph-deploy?
[20:08] <gregaf> Tamil would know for sure
[20:09] <Tamil> amatter2013,gregaf: osd destroy is not implemented yet
[20:09] <gregaf> oh, k
[20:10] <gregaf> thought I'd seen it in use somewhere
[20:10] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) Quit (Remote host closed the connection)
[20:10] * dosaboy__ (~dosaboy@faun.canonical.com) Quit (Quit: leaving)
[20:11] <Tamil> gregaf: checking
[20:11] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[20:11] <amatter2013> Tamil,gregaf: thanks
[20:11] <gregaf> no, I'm sure you're right Tamil
[20:11] <Tamil> gregaf: yeah , not there yet
[20:12] <amatter2013> can I just destroy a whole cluster with ceph-deploy, probably easier than removing and readding 16 osds
[20:13] * grepory (~Adium@ has joined #ceph
[20:13] * grepory (~Adium@ Quit ()
[20:13] * sleinen1 (~Adium@2001:620:0:25:c6d:82ee:ed15:40f6) has joined #ceph
[20:13] <Tamil> amatter2013: yes, do a purge followed by purgedata
[20:14] * nwat (~oftc-webi@eduroam-251-132.ucsc.edu) has joined #ceph
[20:14] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Read error: Operation timed out)
[20:14] * s2r2 (~s2r2@g227004088.adsl.alicedsl.de) has joined #ceph
[20:16] * sleinen1 (~Adium@2001:620:0:25:c6d:82ee:ed15:40f6) Quit ()
[20:19] <amatter2013> tamil: thanks
[20:19] * odyssey4me (~odyssey4m@ Quit (Ping timeout: 480 seconds)
[20:20] <Tamil> amatter2013: np
[20:21] * aliguori (~anthony@cpe-70-112-157-87.austin.res.rr.com) has joined #ceph
[20:24] * indeed (~indeed@ has joined #ceph
[20:25] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[20:36] * mikedawson (~chatzilla@23-25-19-10-static.hfc.comcastbusiness.net) has joined #ceph
[20:37] <paravoid> sjusthm: hey
[20:38] <sjusthm> paravoid: current next may improve things for you
[20:38] <paravoid> oh?
[20:38] <sjusthm> we were delaying certain messages longer than necessary, it's possible that that is what is causing some of your long peering problem
[20:38] <sjusthm> you are already somewhere up past 0.66?
[20:39] <paravoid> no, just 0.66
[20:39] <paravoid> but I can move to next
[20:39] <sjusthm> yeah, current next should be more stable
[20:39] <paravoid> that's 39e5a2a406b77fa82e9a78c267b679d49927e3c3 ?
[20:40] <sjusthm> that's the one which may help
[20:40] <sjusthm> you might try upgrading just 1 at first
[20:41] <sjusthm> if that's really the problem, then the one you are restarting is the one which needs the fix
[20:41] <paravoid> oh, good to know
[20:41] <paravoid> I'll try now
[20:41] <sjusthm> we may also need to play with the config value
[20:41] <sjusthm> also
[20:41] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[20:42] <sjusthm> can you get debug osd = 20, debug ms = 1, debug filestore = 20 from the restarted osd?
[20:42] <sjusthm> just in case
[20:42] <paravoid> ok
[20:44] <paravoid> I'll restart one without debug options first to make sure that debug output isn't slowing it down
[20:44] <sjustlaptop> ok
[20:45] * s2r2 (~s2r2@g227004088.adsl.alicedsl.de) Quit (Quit: s2r2)
[20:48] * yehuda_hm (~yehuda@2602:306:330b:1410:baac:6fff:fec5:2aad) Quit (Ping timeout: 480 seconds)
[20:52] <paravoid> nope :/
[20:52] <paravoid> 2013-07-17 18:50:56.308239 mon.0 [INF] pgmap v10097452: 16760 pgs: 16396 active+clean, 353 active+degraded, 11 active+clean+scrubbing+deep; 46786 GB data, 144 TB used, 117 TB / 261 TB avail; 6380B/s rd, 0B/s wr, 12op/s; 6127735/860598494 degraded (0.712%)
[20:52] <paravoid> 2013-07-17 18:50:57.665306 mon.0 [INF] pgmap v10097453: 16760 pgs: 169 active, 16396 active+clean, 65 peering, 119 active+degraded, 11 active+clean+scrubbing+deep; 46786 GB data, 144 TB used, 117 TB / 261 TB avail; 12339B/s rd, 0B/s wr, 28op/s; 2084633/860598464 degraded (0.242%)
[20:52] <paravoid> [...]
[20:52] <paravoid> 2013-07-17 18:51:54.564229 mon.0 [INF] pgmap v10097479: 16760 pgs: 287 active, 16397 active+clean, 65 peering, 11 active+clean+scrubbing+deep; 46786 GB data, 144 TB used, 117 TB / 261 TB avail; 36820B/s rd, 11506KB/s wr, 144op/s
[20:52] <paravoid> 2013-07-17 18:51:55.663477 mon.0 [INF] pgmap v10097480: 16760 pgs: 352 active, 16397 active+clean, 11 active+clean+scrubbing+deep; 46786 GB data, 144 TB used, 117 TB / 261 TB avail; 16874B/s rd, 9452KB/s wr, 86op/s
[20:52] <paravoid> about a minute with those 65 peering
[20:53] <sjustlaptop> bizarre
[20:53] <sjustlaptop> ok, logging
[20:53] <paravoid> yeah
[20:53] <sjustlaptop> did you have slow requests?
[20:54] <paravoid> 2013-07-17 18:52:51.052245 mon.0 [INF] pgmap v10097515: 16760 pgs: 294 active, 16462 active+clean, 4 active+clean+scrubbing+deep; 46786 GB data, 144 TB used, 117 TB / 261 TB avail; 2450B/s rd, 3op/s; 4/860598770 degraded (0.000%)
[20:54] <paravoid> 2013-07-17 18:52:52.443280 osd.0 [WRN] 3 slow requests, 3 included below; oldest blocked for > 60.500777 secs
[20:54] <sjustlaptop> ok
[20:54] <paravoid> not during peering but there's literally no I/O
[20:54] <sjustlaptop> the status line says 144ops/s
[20:54] <sjustlaptop> anyway, logging
[20:55] <paravoid> I'm still waiting for those active pgs to settle
[20:56] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[20:58] * BManojlovic (~steki@237-231.197-178.cust.bluewin.ch) has joined #ceph
[20:59] <sjustlaptop> when you try it again, try to get a ceph pg dump during that time period as well
[20:59] * mikedawson (~chatzilla@23-25-19-10-static.hfc.comcastbusiness.net) Quit (Read error: No route to host)
[21:04] <paravoid> ok
[21:04] <paravoid> got them
[21:04] <sjustlaptop> cephdrop?
[21:04] <paravoid> yeah
[21:05] <sjustlaptop> k, looking
[21:05] <sjustlaptop> thansk
[21:05] <paravoid> oh, not yet
[21:05] <paravoid> wait :)
[21:05] <sjustlaptop> k
[21:07] * indeed (~indeed@ Quit (Remote host closed the connection)
[21:08] <paravoid> ok
[21:08] <paravoid> done
[21:09] <paravoid> I put the osd.1 log, a pg dump in the late phase of peering and two pg dumps while it was done peering but active (w/ slow requests)
[21:09] <paravoid> and the ceph.log
[21:09] <paravoid> hope they help
[21:12] <sjustlaptop> interesting, it was waiting for up thru
[21:12] <sjustlaptop> progress.
[21:12] <paravoid> yay
[21:12] <paravoid> waiting for up thru?
[21:14] <sjustlaptop> yeah
[21:14] <sjustlaptop> it was waiting on a new osd map
[21:15] <sjustlaptop> hmm
[21:15] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) has joined #ceph
[21:15] * indeed (~indeed@ has joined #ceph
[21:16] * sleinen1 (~Adium@2001:620:0:26:d579:6559:720:f508) has joined #ceph
[21:17] * s2r2 (~s2r2@g227004088.adsl.alicedsl.de) has joined #ceph
[21:19] * oddomatik (~Adium@cpe-76-95-217-129.socal.res.rr.com) Quit (Quit: Leaving.)
[21:20] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[21:21] <sjustlaptop> paravoid: can you try that again, but turn up mon logging as well?
[21:21] <sjustlaptop> so leave logging on that osd, and add debug mon = 20, debug ms = 1 to the mons
[21:22] <nwat> I'm using ceph-deploy with a single monitor (branch: next). ceph-create-keys is never finishing because the monitor is return "ceph-mon is not in quorum: 'probing'"
[21:23] * sleinen (~Adium@77-58-245-10.dclient.hispeed.ch) Quit (Ping timeout: 480 seconds)
[21:27] <sjustlaptop> paravoid: actually, one sec
[21:34] <ebo^> gregaf ceph kernel module (3.9.7). i have one client that had network problems lately, but i can not access the file from any client
[21:35] <gregaf> ebo^: what version of the ceph userspace?
[21:35] * rturk-away is now known as rturk
[21:35] <ebo^> 0.61.4-1~bpo70+1
[21:36] <gregaf> if you can restart each of your clients in turn (and possibly the MDS), you'll probably find that the files become available again
[21:36] <sagewk> sjustlaptop: want to look at wip-osd-latency?
[21:36] <sjustlaptop> sagewk: I think ReplicatedPG::clean_up_local must always be a noop
[21:36] * indeed_ (~indeed@ has joined #ceph
[21:36] <joshd> mgalkiewicz: qemu will read ceph.conf for additional options like caching (but you need to get libvirt to enable that too otherwise qemu won't know about it and it'll be unsafe)
[21:36] <gregaf> there are a bunch of things that are in the kernel for 3.10 or 3.11 to fix bugs which can cause this
[21:36] <ebo^> restarting as in unmounting & mounting again?
[21:37] <sagewk> because we remove them when we resolve the divergent entry stuff in merge_log?
[21:37] <sjustlaptop> yeah
[21:37] <joshd> mgalkiewicz: for the admin socket, the unix user running qemu needs to be able to create it, which might be restricted by unix permissions, selinux, apparmor, etc.
[21:37] <sjustlaptop> it seems to be the next peering speed blocker
[21:38] * rturk is now known as rturk-away
[21:38] <sagewk> let's kill it
[21:39] <sjustlaptop> I'm going to add in a debugging check enabled in teuthology for now
[21:39] <sjustlaptop> oh, that's actually not trivial
[21:40] <sjustlaptop> urgh
[21:40] * ofu_ (ofu@dedi3.fuckner.net) has joined #ceph
[21:40] * oddomatik (~Adium@ has joined #ceph
[21:41] <sjustlaptop> oh, yes it is, the store must be flushed when we activate
[21:42] * hflai_ (~hflai@alumni.cs.nctu.edu.tw) has joined #ceph
[21:42] * leseb1 (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[21:43] * indeed (~indeed@ Quit (Ping timeout: 480 seconds)
[21:43] * todin_ (tuxadero@kudu.in-berlin.de) has joined #ceph
[21:44] * dalegaar1 (~dalegaard@vps.devrandom.dk) has joined #ceph
[21:44] * MapspaM (~clint@xencbyrum2.srihosting.com) has joined #ceph
[21:44] * ggreg_ (~ggreg@int.0x80.net) has joined #ceph
[21:44] * Ludo__ (~Ludo@falbala.zoxx.net) has joined #ceph
[21:44] * lurbs_ (user@uber.geek.nz) has joined #ceph
[21:44] * NaioN_ (stefan@andor.naion.nl) has joined #ceph
[21:44] * Gugge_47527 (gugge@kriminel.dk) has joined #ceph
[21:44] * Gugge-47527 (gugge@kriminel.dk) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * ofu (ofu@dedi3.fuckner.net) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * Anticimex (anticimex@ Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * Ludo (~Ludo@falbala.zoxx.net) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * tdb (~tdb@willow.kent.ac.uk) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * SpamapS (~clint@xencbyrum2.srihosting.com) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * AaronSchulz (~chatzilla@ Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * lurbs (user@uber.geek.nz) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * Zethrok (~martin@ Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * soren (~soren@hydrogen.linux2go.dk) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * todin (tuxadero@kudu.in-berlin.de) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * ggreg (~ggreg@int.0x80.net) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * dalegaard (~dalegaard@vps.devrandom.dk) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * hflai (~hflai@alumni.cs.nctu.edu.tw) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * NaioN (stefan@andor.naion.nl) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * jtang (~jtang@sgenomics.org) Quit (synthon.oftc.net testlink-alpha.oftc.net)
[21:44] * Gugge_47527 is now known as Gugge-47527
[21:44] * tdb (~tdb@willow.kent.ac.uk) has joined #ceph
[21:44] * Anticimex (anticimex@ has joined #ceph
[21:45] * soren (~soren@hydrogen.linux2go.dk) has joined #ceph
[21:45] * Zethrok (~martin@ has joined #ceph
[21:47] * mcgoo (~oftc-webi@core-v7.DearbornTower.onshore.net) has joined #ceph
[21:48] * jtang (~jtang@sgenomics.org) has joined #ceph
[21:49] * AaronSchulz (~chatzilla@ has joined #ceph
[21:49] <gregaf> ebo^: yeah, unmount the ceph filesystem
[21:49] <ebo^> will try. thx
[21:50] * andreask (~andreask@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[21:50] * ChanServ sets mode +v andreask
[21:54] <mcgoo> hi all. I've got a 0.61.4 install now running with a single mon. store.db is 13gb on the running mon. adding a new mon causes about 20gm of disk usage on the new mon before it crashed
[21:55] <mcgoo> any ideas how I can get back up to a redundant configuration?
[21:58] * infinitytrapdoor (~infinityt@ip-109-46-155-169.web.vodafone.de) has joined #ceph
[22:01] <paravoid> sjustlaptop: so? anything I can do?
[22:01] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:01] * leseb1 (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) Quit (Quit: Leaving.)
[22:01] * mozg (~andrei@host217-44-214-64.range217-44.btcentralplus.com) has joined #ceph
[22:08] * iii8 (~Miranda@ Quit (Read error: Connection reset by peer)
[22:08] * infinitytrapdoor (~infinityt@ip-109-46-155-169.web.vodafone.de) Quit (Ping timeout: 480 seconds)
[22:10] * madkiss1 (~madkiss@2001:6f8:12c3:f00f:c498:42b3:5ccd:5aae) has joined #ceph
[22:10] <sjustlaptop> paravoid: not yet, I got another clue from your log
[22:10] <sjustlaptop> working on another patch
[22:10] <paravoid> okay, thanks :)
[22:14] * jakes (~oftc-webi@128-107-239-234.cisco.com) has joined #ceph
[22:17] * madkiss (~madkiss@2001:6f8:12c3:f00f:bd6b:c578:abc2:e72d) Quit (Ping timeout: 480 seconds)
[22:19] * illya (~illya_hav@205-43-133-95.pool.ukrtel.net) has left #ceph
[22:20] * iii8 (~Miranda@ has joined #ceph
[22:21] <jakes> I am following this architecture http://ceph.com/docs/master/rbd/rbd-openstack/. Since I also need file like access for various applications inside guest VM's, I thought of having cephFS client in each of the VM, connecting to the common object store in the host. But, the problem is, I would be directly writing tot he object store without using the volume created by the cinder. Is there any way for the cephfs to use the same volume created by th
[22:22] * tnt (~tnt@ Quit (Quit: leaving)
[22:24] <sjustlaptop> sagewk: minor comments on the histogram patch
[22:25] * zebirilau (~dg2334@bl9-248-137.dsl.telepac.pt) has joined #ceph
[22:25] <paravoid> sjustlaptop: also, note that there's a considerable amount of time after peering finishes and until all pgs get to active+clean during which I get slow requests
[22:26] <sjustlaptop> yeah, that may or may not be related
[22:26] <paravoid> this wasn't the case in earlier versions
[22:30] * sunday (~sunday@ has joined #ceph
[22:31] * sleinen1 (~Adium@2001:620:0:26:d579:6559:720:f508) Quit (Ping timeout: 480 seconds)
[22:33] * indeed_ (~indeed@ Quit (Remote host closed the connection)
[22:38] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[22:38] <jakes> Someone can help?
[22:41] <paravoid> 2013-07-17 20:37:21.632321 mon.0 [INF] pgmap v10102094: 16760 pgs: 16695 active+clean, 31 active+degraded+backfilling, 2 active+degraded+remapped+wait_backfill, 31 active+degraded+remapped+backfilling, 1 active+recovering; 46788 GB data, 144 TB used, 115 TB / 259 TB avail; 404887/860603606 degraded (0.047%); recovering 316 o/s, 55344KB/s
[22:41] <paravoid> 2013-07-17 20:37:17.615386 osd.0 [WRN] 50 slow requests, 2 included below; oldest blocked for > 3613.506459 secs
[22:41] <paravoid> 2013-07-17 20:37:17.615395 osd.0 [WRN] slow request 240.210748 seconds old, received at 2013-07-17 20:33:17.404554: osd_op(client.30982.0:2827169 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195502) v4 currently waiting for missing object
[22:41] <mozg> hello guys
[22:41] <paravoid> 2013-07-17 20:37:17.615406 osd.0 [WRN] slow request 240.114235 seconds old, received at 2013-07-17 20:33:17.501067: osd_op(client.30969.0:527824 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195502) v4 currently waiting for missing object
[22:41] <paravoid> that's... new
[22:41] <mozg> i am trying to enable rbd cache
[22:42] <mozg> i've added to the [client] section
[22:42] <mozg> rbd cache = true
[22:42] <mozg> and specified cache size
[22:42] <mozg> restarted ceph mon and osd services
[22:42] <mozg> however, when I check the running config it still shows rbd cache as false
[22:42] <mozg> what am i doing wrong?
[22:44] <mozg> paravoid: i am also getting slow requests
[22:44] <mozg> and unable to determine why
[22:45] <mozg> anyway, back to my original question
[22:45] <mozg> does anyone know how to enable rbd cache?
[22:46] <alexbligh> mozg, AFAIK rbd cache is only available from qemu (as opposed to kernel rbd), and you need to mount with cache=writeback on the qemu line
[22:46] <mozg> alexbligh: there is a client side rbd cache configuration options as well
[22:46] <mozg> which I am trying to use
[22:47] <alexbligh> I should be more specific
[22:47] * jakes (~oftc-webi@128-107-239-234.cisco.com) Quit (Quit: Page closed)
[22:47] <alexbligh> AFAIK the rbd cache is built into librbd.
[22:47] * markit (~marco@88-149-177-66.v4.ngi.it) has joined #ceph
[22:47] <alexbligh> That's only used by qemu and other librbd users, and not by the kernel
[22:47] <mozg> ah, i see
[22:47] <mozg> so, the on the server side you do not see this option being enabled
[22:47] * amatter2013 (~oftc-webi@ Quit (Remote host closed the connection)
[22:48] <alexbligh> Moreover, I believe qemu disables it without cache=writetback
[22:48] <mozg> is it only used on the client side?
[22:48] <alexbligh> I believe that is correct, yes
[22:48] * oddomatik (~Adium@ Quit (Quit: Leaving.)
[22:48] <markit> any OP here? zebirilau user opens a private chat with a link when you enter here
[22:48] <mozg> ah, okay
[22:48] <mozg> thanks
[22:48] <markit> dmick: ping
[22:48] <alexbligh> mozg, the best way to ascertain the cache handling inside qemu is to read the source for your qemu version (unfortunately)
[22:48] <alexbligh> it seems to vary.
[22:49] <markit> joao: ping
[22:49] <alexbligh> s/cache handling/rbd cache handling/
[22:49] <joao> markit, pong
[22:49] * s2r2 (~s2r2@g227004088.adsl.alicedsl.de) Quit (Quit: s2r2)
[22:50] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Remote host closed the connection)
[22:50] * joao sets mode +b *!*dg2334@*.dsl.telepac.pt
[22:51] <joao> pity, kicking the one other guy from .pt
[22:51] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[22:51] * zebirilau was kicked from #ceph by joao
[22:53] <paravoid> anyone want to help me with collecting logs for that new thing?
[22:53] <paravoid> 2013-07-17 20:53:02.897945 osd.0 [WRN] slow request 4558.789022 seconds old, received at 2013-07-17 19:37:04.108843: osd_op(client.30972.0:520820 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195366) v4 currently waiting for missing object
[22:53] <mozg> paravoid: how often do you see slow requests on your cluster?
[22:53] <mozg> i am having slow request issues too
[22:55] * PerlStalker (~PerlStalk@ Quit (Remote host closed the connection)
[22:55] * jakes (~oftc-webi@128-107-239-233.cisco.com) has joined #ceph
[22:55] <paravoid> slow requests is a very broad, there can be a lot of different causes
[22:56] * dpippenger (~riven@tenant.pas.idealab.com) Quit (Remote host closed the connection)
[22:57] * rturk-away is now known as rturk
[23:00] * dpippenger (~riven@tenant.pas.idealab.com) has joined #ceph
[23:02] * mcgoo (~oftc-webi@core-v7.DearbornTower.onshore.net) Quit (Remote host closed the connection)
[23:04] * indeed (~indeed@ has joined #ceph
[23:07] * sunday (~sunday@ Quit (Read error: No route to host)
[23:10] <sjustlaptop> paravoid: does osd0 already have logging on?
[23:11] <paravoid> sjustlaptop: http://tracker.ceph.com/issues/5655
[23:11] <paravoid> it didn't, I briefly told it to debug-ms/filestore/osd 20
[23:11] <paravoid> and then back at 0
[23:12] * indeed (~indeed@ Quit (Ping timeout: 480 seconds)
[23:12] <sjustlaptop> yeah, I think you are hitting a known messenger bug
[23:12] <sjustlaptop> sage has patches currently being reviewed
[23:13] <paravoid> ok
[23:13] <paravoid> great :)
[23:13] <paravoid> (me? hitting bugs? no way)
[23:13] <sjustlaptop> you can also add the logs you have for osd.0 and osd.25
[23:14] <paravoid> no debug logs for osd.25, I can collect them though
[23:14] <sjustlaptop> one sec
[23:14] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[23:15] <sagewk> sjustlaptop: thoughts on squeezing wip-osd-latency in?
[23:15] <sjustlaptop> sagewk: seems both safe and helpful
[23:16] <sagewk> any other key info we should include here too?
[23:16] <sjustlaptop> paravoid: you can flush the in-memory logging on osd.25 with ceph --admin-daemon <osd 25 .asok file> log dump
[23:16] <sjustlaptop> do that and then attach the dumped logs
[23:16] <sjustlaptop> sagewk: not that I can think of off hand
[23:16] <sagewk> should the hist struct be independently encoded (not just a vector)? couldn't imagine what else we would add in there.
[23:17] <sjustlaptop> more detailed info can be obtained via the perf counters
[23:17] <sjustlaptop> eek, did not notice that
[23:17] <sjustlaptop> yeah, should have it's own encoder
[23:17] <paravoid> uhm
[23:17] <sjustlaptop> paravoid: usually found in /var/run/ceph
[23:18] <sjustlaptop> probably ceph-osd.25.asok
[23:18] <paravoid> oh I know
[23:18] <paravoid> is that supposed to print something in stdout?
[23:18] <sjustlaptop> no, to the log file
[23:18] <sjustlaptop> if we were quick enough
[23:18] <sjustlaptop> it should just dump the current in-memory logging to the log file
[23:18] <paravoid> osd.25 logs are full of 2013-07-17 21:18:22.292996 7f6f902ef700 0 can't decode unknown message type 106 MSG_AUTH=17
[23:19] * indeed (~indeed@ has joined #ceph
[23:19] <sjustlaptop> did it dump the logging?
[23:19] <paravoid> I'm looking
[23:19] <paravoid> anything specific I can grep for?
[23:19] <sjustlaptop> last line would be
[23:19] <sjustlaptop> --- end dump of recent events ---
[23:20] <sjustlaptop> it's the same mechanism that produces the dump when you hit an assert
[23:20] <paravoid> ah
[23:20] <paravoid> 0> 2013-07-17 21:17:24.190167 7f6fd5fbd700 1 do_command 'log dump' ''
[23:20] <paravoid> --- logging levels ---
[23:20] <dmick> sjustlaptop or whoever: looking at Sage's idea for op queue age stats I'm looking at the op tracker. It seems that some few message types aren't tracked; this is on purpose, I assume, but why? op_tracker primarily about I/O requests?
[23:20] <paravoid> max_new 1000
[23:20] <paravoid> log_file /var/log/ceph/ceph-osd.25.log
[23:20] <paravoid> --- end dump of recent events ---
[23:20] <paravoid> (the log levels in between)
[23:22] <sagewk> dmick: all io requests
[23:22] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) has joined #ceph
[23:22] <sjustlaptop> paravoid: that's it, go ahead and do the same on osd 0 and then cephdrop both files
[23:22] * diegows (~diegows@ has joined #ceph
[23:22] <sjustlaptop> dmick: yeah
[23:23] * alfredodeza (~alfredode@c-24-131-46-23.hsd1.ga.comcast.net) Quit (Remote host closed the connection)
[23:25] <jakes> Can the block devices created in cinder be exposed outside?
[23:29] <paravoid> sjustlaptop: updated the bug report, still uploading, ETA 01:30
[23:29] <paravoid> and I really need to go now :)
[23:29] <sjustlaptop> k
[23:29] <paravoid> I'll be back in 2h or so
[23:29] <paravoid> thanks for all the hand holding
[23:31] * ebo^ (~ebo@koln-4db4af6f.pool.mediaWays.net) Quit (Quit: Verlassend)
[23:33] * PerlStalker (~PerlStalk@ has joined #ceph
[23:34] * markbby (~Adium@ Quit (Quit: Leaving.)
[23:34] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[23:43] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[23:52] * infinitytrapdoor (~infinityt@ip-109-41-61-88.web.vodafone.de) has joined #ceph
[23:53] * smiley (~smiley@pool-173-73-0-53.washdc.fios.verizon.net) has joined #ceph
[23:55] * jeff-YF (~jeffyf@ Quit (Ping timeout: 480 seconds)
[23:55] <sagewk> sjusthm: on the second patch, the flush will need to happen later in order for the read from replica to work properly.. can we do it in the activate stage?
[23:55] <sagewk> oh, no.. we need to see stuff from the prevous peering epoch during peering.
[23:56] <sjusthm> sagewk: for read from replica to work we do have to do the flush in activate, thanks
[23:56] * drokita1 (~drokita@ has joined #ceph
[23:56] <sagewk> so 2 flushes..
[23:56] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) Quit (Quit: Ex-Chat)
[23:57] <sjusthm> wait, what?
[23:57] <sjusthm> we don't trust anything in the filestore during peering
[23:57] <sagewk> ok
[23:57] <sjusthm> we just need to flush prior to serving reads
[23:58] <sjusthm> it's actually a bit tricky, the replica needs to be able to serve pulls without having been activated
[23:58] <sjusthm> so if the replica is destined to go active, we have to flush in activate
[23:59] <sjusthm> otherwise, we have to flush anyway prior to serving a pull
[23:59] <sagewk> can we just flush twice?

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.