#ceph IRC Log


IRC Log for 2013-01-04

Timestamps are in GMT/BST.

[0:04] <phantomcircuit> 2013-01-03 23:03:11.463486 osd.0 [WRN] slow request 32.421791 seconds old, received at 2013-01-03 23:02:39.041639: osd_op(client.5514.1:223152 rb.0.1427.238e1f29.000000006624 [write 0~430080] 2.b5204711) currently waiting for sub ops
[0:04] <phantomcircuit> hmm
[0:04] <phantomcircuit> how can i figure out what the sub ops are?
[0:05] <dmick> look for that transaction ID earlier in the log (client.5514.1:223152)
[0:07] <phantomcircuit> only that log has the in the log
[0:07] <phantomcircuit> im guessing i need to turn the log level up a lot
[0:10] <dmick> I assume you mean "that message is the only log message that mentions that string"?
[0:13] * nyeates (~nyeates@11.sub-70-192-196.myvzw.com) has joined #ceph
[0:14] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[0:14] <phantomcircuit> dmick, that's what i meant
[0:14] * The_Bishop__ (~bishop@e177088127.adsl.alicedsl.de) has joined #ceph
[0:14] <phantomcircuit> dmick, it's a good thing you can read minds
[0:16] <dmick> :)
[0:16] <dmick> yeah, if you don't have any more, than you'd have needed more logging to see it
[0:17] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[0:17] <dmick> I'm not sure if you can query outstanding transaction status interactively
[0:17] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit ()
[0:20] * miroslav1 (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[0:20] * maxiz (~pfliu@ Quit (Read error: Operation timed out)
[0:21] <dmick> phantomcircuit: docs: http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#slow-or-unresponsive-osd
[0:21] * The_Bishop_ (~bishop@e177089165.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[0:21] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[0:21] <dmick> you can find out what OSDs have that object and check those specfic ones
[0:24] <phantomcircuit> it seems like there's some sort of bug here
[0:24] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[0:24] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[0:25] <phantomcircuit> dmick, there's nothing in dmesg and i checked the disks with long smart tests before using them
[0:25] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Ping timeout: 480 seconds)
[0:27] <phantomcircuit> http://pastebin.com/raw.php?i=sCZ5xvMH
[0:27] <dmick> yeah
[0:27] <phantomcircuit> i may have turned logging up tooo much
[0:28] <dmick> the subop means, I think, that it's waiting for the other replicas to response
[0:28] <dmick> *respond
[0:28] <dmick> how many osds do you have?
[0:29] <phantomcircuit> 2
[0:29] <phantomcircuit> they're on the same machine, independent disks
[0:29] <phantomcircuit> journals on independent ssds
[0:29] <buck> hey, does anyone know why /a/ on teuthology.front.sepia.ceph.com doesn't have all the test output like it used to?
[0:29] <buck> did that move somewhere?
[0:30] <buck> nm. I see /a.old
[0:30] <buck> is there a long term plan with that or what's up?
[0:31] <phantomcircuit> line 13 takes 4 seconds 15 takes 10 seconds
[0:31] <dmick> I asked Sandon about that but he's not looking at IM I guess
[0:31] <phantomcircuit> both are eval_repop
[0:32] * tnt (~tnt@11.97-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[0:34] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[0:39] <phantomcircuit> hmm
[0:41] <phantomcircuit> i have the osd directories on independent disks
[0:41] * nyeates (~nyeates@11.sub-70-192-196.myvzw.com) Quit (Quit: Zzzzzz)
[0:41] <phantomcircuit> since the system came on one disk has had 3523231 KB read while the other had 832611231 KB
[0:42] <phantomcircuit> 0.423154393 %
[0:42] <phantomcircuit> somethings not right there
[0:45] * PerlStalker (~PerlStalk@ Quit (Quit: ...)
[0:48] <dmick> phantomcircuit: all reads come from the primary
[0:48] <dmick> unless it goes down
[0:52] <phantomcircuit> dmick, ah
[0:53] <gregaf> but with two equal-sized disks they should both be primaries for about half of it
[0:53] <phantomcircuit> dmick, wait but shouldn't they be
[0:53] <phantomcircuit> yeah what gregaf said
[0:53] <dmick> oh well true
[0:53] <dmick> duh
[0:55] <phantomcircuit> ceph pg dump has a column "up"
[0:55] <phantomcircuit> [1,0] is an example value
[0:55] <phantomcircuit> i assume the first value is the primary osd and the second is the replica
[0:55] <dmick> I believe that is true
[0:56] <phantomcircuit> in that case roughly half of the pgs are primary to 0 and roughly half to 1
[0:57] <dmick> ceph -s and ceph osd dump look as you expect?
[0:57] <phantomcircuit> yeah
[0:59] * Cube1 (~Cube@ Quit (Ping timeout: 480 seconds)
[0:59] <phantomcircuit> ceph osd tree also looks as i'd expect
[0:59] <phantomcircuit> pool -> datacenter -> host -> osd{0,1}
[0:59] <phantomcircuit> both up both weighted 1
[1:00] <dmick> hmm
[1:01] <dmick> and you believe your read workload has been distributed across those objects?
[1:01] <phantomcircuit> roughly
[1:02] <phantomcircuit> there's an rsync in progress which should be reading basically half of the data in the cluster
[1:04] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[1:05] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[1:12] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[1:14] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Quit: Leaving.)
[1:16] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[1:17] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[1:19] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[1:20] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:21] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: slang)
[1:22] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[1:25] * cdblack (c0373628@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[1:29] * jlogan (~Thunderbi@2600:c00:3010:1:519e:fddc:b274:5689) Quit (Ping timeout: 480 seconds)
[1:30] * benpol (~benp@garage.reed.edu) has joined #ceph
[1:31] <benpol> Really hoped btrfs had made more progress, anyone else have issues with "umount" hanging with btrfs backed OSDs?
[1:32] <janos> benpol: nope. though have you tried with the -l flag?
[1:32] <janos> umount -l /dev/your/disk
[1:33] <nhm> benpol: I've got automated scripts that mount/umount btrfs hundreds of times without problem.
[1:33] <benpol> janos: hm, first I've heard of that option.
[1:33] <janos> oh i love that option
[1:33] <janos> it's the "lazy" flag
[1:33] * sagelap (~sage@2607:f298:a:607:b12b:c856:f3f1:71e5) Quit (Read error: Operation timed out)
[1:34] <benpol> nhm: hmm not my experience that's for sure. running debian squeeze with a 3.7.1 kernel.
[1:34] <nhm> benpol: 3.6.3 currently here, but I previously was running 3.4 on this setup.
[1:34] <nhm> benpol: ubuntu 12.04
[1:35] <janos> any luck with the lazy flag?
[1:35] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[1:35] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[1:36] <benpol> janos: perhaps...
[1:36] <janos> i have to run - family time. good luck
[1:36] <benpol> nhm: yeah it hangs so regularly I'm completely unable to do a clean reboot.
[1:37] <benpol> janos: thanks for the tip!
[1:37] <janos> any time
[1:37] <nhm> benpol: strange!
[1:37] <benpol> nhm: that's one word for it ;)
[1:37] <nhm> benpol: lsof provide anything useful?
[1:38] <benpol> do you use the autodefrag mount option?
[1:38] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[1:39] <benpol> nhm: no nothing in lsof output that looks interesting
[1:39] <nhm> anything strange about the file system? multi-device?
[1:39] <benpol> no, just a straight forward single partition btrfs filesystem
[1:40] <nhm> huh
[1:40] <nhm> yeah, no idea
[1:40] * sagelap (~sage@ has joined #ceph
[1:48] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[1:53] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[1:54] * sagelap1 (~sage@ has joined #ceph
[1:54] * korgon (~Peto@isp-korex- has joined #ceph
[2:00] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[2:04] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[2:05] * nyeates (~nyeates@c-69-251-194-165.hsd1.md.comcast.net) has joined #ceph
[2:12] <phantomcircuit> client crashed again
[2:13] <phantomcircuit> rbd volume mapped to /dev/rbd1 with about 20 MB/s of data
[2:13] <phantomcircuit> http://imgur.com/a/ZpNuZ
[2:13] <phantomcircuit> (both kernel panics)
[2:17] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[2:22] * sagelap1 (~sage@ Quit (Ping timeout: 480 seconds)
[2:24] <dmick> aw. that's not good
[2:25] <elder> Once again, we need to get more information, earlier in the log, for this information to be useful (unfortunately)
[2:34] * sagelap (~sage@2600:1013:b015:d998:c685:8ff:fe59:d486) has joined #ceph
[2:36] <paravoid> it'd be nice if ceph had a more /proc/mdstat output for cluster degradation
[2:37] <paravoid> some estimate of the speed (in pgs would probably be enough), plus a rough ETA
[2:41] * rektide (~rektide@deneb.eldergods.com) Quit (Remote host closed the connection)
[2:41] * rektide (~rektide@deneb.eldergods.com) has joined #ceph
[2:43] <dmick> you don't mean in /proc, you mean that style of information, I assume
[2:43] <dmick> like recovery = 12.6% (37043392/292945152) finish=127.5min speed=33440K/sec
[2:48] * rektide (~rektide@deneb.eldergods.com) Quit (Remote host closed the connection)
[2:49] * sagelap (~sage@2600:1013:b015:d998:c685:8ff:fe59:d486) Quit (Ping timeout: 480 seconds)
[2:50] * dmick is now known as dmick_away
[2:53] * rektide (~rektide@deneb.eldergods.com) has joined #ceph
[2:56] * miroslav1 (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[2:59] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[3:00] * renzhi (~renzhi@ Quit (Quit: Leaving)
[3:01] * sagelap (~sage@210.sub-70-197-131.myvzw.com) has joined #ceph
[3:01] <sagelap> paravoid: agreed
[3:02] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Quit: This computer has gone to sleep)
[3:02] * imjustmatthew (~imjustmat@pool-173-53-54-22.rcmdva.fios.verizon.net) Quit (Remote host closed the connection)
[3:02] <paravoid> heh
[3:03] <paravoid> maybe I'll work on it when this work storm passes :)
[3:03] <paravoid> dmick_away: yes, that's what I meant
[3:12] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[3:13] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[3:18] * epa_ (~epa@84-253-205-45.bb.dnainternet.fi) has joined #ceph
[3:20] <epa_> supposed that you would have 11 nodes with ceph utilities installed and out of those 11 ten would be used for cluster (one would be a admin node or like). Is it supported to run "mkcephfs -a -c installation/cluster/ceph.conf -k ceph.keyring"
[3:20] <epa_> on the admin node.
[3:21] <epa_> admin node has ssh access to all 10 nodes, but they do not have ssh access to each others.
[3:21] <Vjarjadian> that sort of setup might imply a single point of failure....
[3:21] <Vjarjadian> which ceph was built to avoid
[3:21] <epa_> how?
[3:22] <epa_> the admin node is not supposed to participate in the cluster in any other way than 'initializing' it
[3:22] * LeaChim (~LeaChim@b01bde88.bb.sky.com) Quit (Ping timeout: 480 seconds)
[3:24] <epa_> instead or running --prepare-monmap, --init-local-daemons, --prepare-mon all separately I'd prefer to use the "mkcephfs -a -c ..."
[3:45] * fzylogic (~fzylogic@ Quit (Quit: fzylogic)
[4:07] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[4:19] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[4:40] * buck1 (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has joined #ceph
[4:40] <buck1> is anyone else noticing scheduled regression tests not starting?
[4:41] * sagelap (~sage@210.sub-70-197-131.myvzw.com) Quit (Ping timeout: 480 seconds)
[4:55] * nyeates (~nyeates@c-69-251-194-165.hsd1.md.comcast.net) Quit (Quit: Zzzzzz)
[5:03] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[5:07] * sagelap (~sage@252.sub-70-197-139.myvzw.com) has joined #ceph
[5:39] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[5:39] * ChanServ sets mode +o scuttlemonkey
[5:46] * buck1 (~buck@c-24-6-91-4.hsd1.ca.comcast.net) has left #ceph
[5:52] * tezra (~rolson@ Quit (Quit: Ex-Chat)
[5:59] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:00] <phantomcircuit> ok this is weird
[6:01] <phantomcircuit> i have a cluster with 1 host, 1 mon, 2 osd's, each osd is on it's own disk with it's own ssd journal, filesystem is xfs and the journals are the raw block device
[6:02] <phantomcircuit> im consistently seeing slow request warnings despite iostat reporting write throughput well below the disks capacity
[6:04] <phantomcircuit> 19852.40 KB/s ~60 IOPS
[6:23] <mikedawson> phantomcircuit: ~60IOPS seems about maxed out for a pair of 7200rpm spindles
[6:24] <mikedawson> is this 2x or 3x replication?
[6:24] <phantomcircuit> 2x
[6:24] <phantomcircuit> and that's 60 IOPS per device
[6:25] <mikedawson> that's about all I can get per 7200rpm spindle with ceph
[6:25] <phantomcircuit> fdatasync i get about 100 IOPS
[6:26] <phantomcircuit> but either way im writting in 1MB chunks to an rbd volume
[6:26] <phantomcircuit> is the performance penalty really that great?
[6:28] <mikedawson> I'm in the process of tuning for small io and searching for more performance, but that's pretty consistent with what I'm seeing
[6:31] <phantomcircuit> hmm
[6:31] <phantomcircuit> guess i need more spindles :)
[6:38] <phantomcircuit> sort of seems like the journal should be able to handle convert sequential writes into sequential io pretty trivially
[6:38] <phantomcircuit> but i guess not :(
[6:45] <mikedawson> phantomcircuit: I would like that too, but either my tuning efforts are failing, or that isn't how Ceph works at present
[6:48] <phantomcircuit> i should stop guessing and start reading
[6:48] <phantomcircuit> heh
[6:48] <phantomcircuit> BUT THATS CRAZYNESS
[6:52] <phantomcircuit> i need a laser printer
[7:04] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[7:14] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) has joined #ceph
[7:26] * CloudGuy (~CloudGuy@5356416B.cm-6-7b.dynamic.ziggo.nl) Quit (Remote host closed the connection)
[7:40] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.89 [Firefox 17.0.1/20121128204232])
[7:43] * themgt (~themgt@24-177-232-181.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[7:51] * tnt (~tnt@11.97-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[7:57] <phantomcircuit> doc/architecture.rst
[7:57] <phantomcircuit> there's a typo on line 486
[8:06] * gaveen (~gaveen@ has joined #ceph
[8:12] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) has joined #ceph
[8:23] * Morg (d4438402@ircip3.mibbit.com) has joined #ceph
[8:33] * low (~low@ has joined #ceph
[8:36] * mistur_ is now known as mistur
[8:44] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[8:46] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) has joined #ceph
[8:52] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[8:52] * themgt_ (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) has joined #ceph
[8:58] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) Quit (Ping timeout: 480 seconds)
[8:58] * themgt_ is now known as themgt
[9:12] * loicd (~loic@magenta.dachary.org) has joined #ceph
[9:24] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[9:28] * gucki (~smuxi@46-126-114-222.dynamic.hispeed.ch) has joined #ceph
[9:30] * gucki_ (~smuxi@46-126-114-222.dynamic.hispeed.ch) has joined #ceph
[9:34] * Leseb (~Leseb@ has joined #ceph
[9:37] * sagelap (~sage@252.sub-70-197-139.myvzw.com) Quit (Read error: Connection reset by peer)
[9:54] * ScOut3R (~ScOut3R@ has joined #ceph
[9:55] * sagelap (~sage@2600:1013:b003:9e84:c685:8ff:fe59:d486) has joined #ceph
[10:03] * LeaChim (~LeaChim@b01bde88.bb.sky.com) has joined #ceph
[10:06] * agh (~agh@www.nowhere-else.org) has joined #ceph
[10:06] <agh> hello to all, is there a way to crypt an rbd dev like a qcow2 file ?
[10:07] <tnt> you can use any standard block level crypto available on linux ...
[10:08] <agh> tnt: yes, but if i want to use qemu rbd driver, then it will not work anymore ...
[10:10] <tnt> well you can use crypto root inside your guest.
[10:10] <agh> tnt:mmm... another question
[10:11] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[10:11] <agh> tnt: i see that it's not possible to map a rbd format 2 on a client
[10:11] <agh> tnt: with the kernel rbd module
[10:11] <agh> tnt: is there a trick to do so ?
[10:12] <agh> tnt: In fact, I want to use some features of qcow2, and great feature fo rbd (like clone)
[10:12] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[10:12] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[10:17] <tnt> no, you can't use layering and fmt2 image using the kernel client yet.
[10:17] <agh> tnt: and via the librbd python API ?
[10:22] <tnt> AFAIK that should work.
[10:30] <phantomcircuit> agh, does work
[10:31] <phantomcircuit> it's hard to do performance optimization when the tunables dont tune
[10:31] <phantomcircuit> lol
[10:44] <phantomcircuit> i get the feeling the only way im ever going to be satisfied with performance is if i actually go through the code base
[10:48] <agh> phantomcircuit: and how should i proceed to do qcow2 on an rbd format 2, with librbdpy ?
[10:49] <phantomcircuit> agh, qcow2 is a fileformat so that doesn't make a lot of sense
[10:49] <phantomcircuit> if you're asking how you can take an rbd v2 volume mount it and have qcow2 volumes on it
[10:49] <phantomcircuit> you cant
[10:50] <phantomcircuit> im pretty sure you cant
[10:50] * ScOut3R (~ScOut3R@ Quit (Remote host closed the connection)
[10:50] <wido> you can't put QCOW2 in RBD
[10:50] <wido> you can convert though with qemu-img
[10:51] <wido> from qcow2 to RBD and back
[10:51] <phantomcircuit> wido, he could setup a ludicrously large rbd volume, mount it and then put the qcow2 images on that
[10:51] <Morg> anyone using ceph in production environment?
[10:51] <phantomcircuit> it wouldn't exactly be a good idea though
[10:51] <wido> no, I wouldn't do that
[10:51] <wido> Morg: yes
[10:51] <wido> RBD
[10:51] <wido> No CephFS
[10:52] <wido> phantomcircuit: I agree, you add another filesystem on top of RBD (which can corrupt) and add layers which will cost you performance
[10:52] <wido> You better run directly from RBD
[10:53] <Morg> im looking for most optimal hardware setting for mon/mds nodes
[10:54] <wido> You mention the MDS, you want to run the filesystem CephFS?
[10:54] <Morg> yup
[10:54] <Morg> i know that it's not ready
[10:54] <wido> ok :)
[10:54] <Morg> but still i want to do some tests
[10:55] <wido> For a monitor, make sure you have an odd number
[10:55] <wido> 1 or 3
[10:55] <Morg> i know that ;]
[10:55] <wido> Metadata servers needs something like a quad core and RAM, it will cache the whole fileystem tree
[10:55] <wido> more RAM == better
[10:55] <Morg> i meant ram/cpu config ;]
[10:55] <wido> Morg: virNetTLSContextNewServer
[10:56] <wido> uh
[10:56] <wido> Morg: http://ceph.com/docs/master/install/hardware-recommendations/
[10:56] <Morg> mhm
[10:56] <Morg> well, use the google Luke ;]
[10:56] <wido> There the recommendation is 1GB, but you better go with more
[10:57] <wido> it can never hurt the MDS
[10:57] <Morg> and is it better for osd nodes to have more machines with less number of hdd, or the oposite?
[10:58] <wido> I think it is
[10:58] <wido> The more machines you have, the less the impact of loosing one machine is
[10:58] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[10:58] <wido> You can have 6 machines with 8 disks each, of 12 machines with 4 disks each
[10:58] * agh (~agh@www.nowhere-else.org) has joined #ceph
[10:58] <wido> in the last situation loosing a machine has less impact on the whole cluster
[10:59] <Morg> i see
[10:59] * ScOut3R (~ScOut3R@ has joined #ceph
[11:04] <agh> sorry for the delay, Yes i know that it's a bad idea to stack technologies (qcow2 over rbd over that over that, etc)
[11:06] * low (~low@ Quit (charon.oftc.net synthon.oftc.net)
[11:06] * joao (~JL@ Quit (charon.oftc.net synthon.oftc.net)
[11:06] * `10 (~10@juke.fm) Quit (charon.oftc.net synthon.oftc.net)
[11:06] * nwl (~levine@atticus.yoyo.org) Quit (charon.oftc.net synthon.oftc.net)
[11:06] * Karcaw (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) Quit (charon.oftc.net synthon.oftc.net)
[11:06] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (charon.oftc.net synthon.oftc.net)
[11:06] * l3akage (~l3akage@martinpoppen.de) Quit (charon.oftc.net synthon.oftc.net)
[11:06] * wonko_be_ (bernard@november.openminds.be) Quit (charon.oftc.net synthon.oftc.net)
[11:06] * michaeltchapman (~mxc900@ Quit (charon.oftc.net synthon.oftc.net)
[11:06] * jochen (~jochen@laevar.de) Quit (charon.oftc.net synthon.oftc.net)
[11:06] * Lennie`away (~leen@lennie-1-pt.tunnel.tserv11.ams1.ipv6.he.net) Quit (charon.oftc.net synthon.oftc.net)
[11:06] * Anticimex (anticimex@netforce.csbnet.se) Quit (charon.oftc.net synthon.oftc.net)
[11:09] * low (~low@ has joined #ceph
[11:09] * joao (~JL@ has joined #ceph
[11:09] * `10 (~10@juke.fm) has joined #ceph
[11:09] * nwl (~levine@atticus.yoyo.org) has joined #ceph
[11:09] * Karcaw (~evan@68-186-68-219.dhcp.knwc.wa.charter.com) has joined #ceph
[11:09] * jochen (~jochen@laevar.de) has joined #ceph
[11:09] * michaeltchapman (~mxc900@ has joined #ceph
[11:09] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[11:09] * l3akage (~l3akage@martinpoppen.de) has joined #ceph
[11:09] * Lennie`away (~leen@lennie-1-pt.tunnel.tserv11.ams1.ipv6.he.net) has joined #ceph
[11:09] * Anticimex (anticimex@netforce.csbnet.se) has joined #ceph
[11:09] * wonko_be_ (bernard@november.openminds.be) has joined #ceph
[11:09] * ChanServ sets mode +o joao
[11:48] <Kioob`Taff> (10:58:25) wido: The more machines you have, the less the impact of loosing one machine is <== I agree. Recovery really slow down ceph
[11:52] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:22] * sleinen (~Adium@2001:620:0:26:7557:f112:4cbc:4422) has joined #ceph
[12:35] * dxd828 (~dxd828@ has joined #ceph
[12:57] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[13:02] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:02] * Aiken (~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f) Quit (Remote host closed the connection)
[13:14] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[13:20] * dxd828 (~dxd828@ Quit (Quit: Computer has gone to sleep.)
[13:22] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[13:22] * loicd (~loic@magenta.dachary.org) has joined #ceph
[13:37] * madkiss (~madkiss@p57A1CBD6.dip.t-dialin.net) has joined #ceph
[13:39] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:40] * dxd828 (~dxd828@ has joined #ceph
[13:41] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[13:41] * agh (~agh@www.nowhere-else.org) has joined #ceph
[13:48] * The_Bishop__ (~bishop@e177088127.adsl.alicedsl.de) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[13:51] * dxd828 (~dxd828@ Quit (Quit: Computer has gone to sleep.)
[13:51] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[13:56] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[13:59] * fprudhom (c3dc640b@ircip3.mibbit.com) has joined #ceph
[14:03] * madkiss (~madkiss@p57A1CBD6.dip.t-dialin.net) Quit (Quit: Leaving.)
[14:14] * noob2 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[14:16] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[14:21] * dxd828 (~dxd828@ has joined #ceph
[14:24] * ScOut3R_ (~ScOut3R@ has joined #ceph
[14:30] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:31] * dxd828 (~dxd828@ Quit (Quit: Computer has gone to sleep.)
[14:31] * ScOut3R (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[14:47] * dxd828 (~dxd828@ has joined #ceph
[14:55] * dxd828 (~dxd828@ Quit (Quit: Computer has gone to sleep.)
[14:57] * BManojlovic (~steki@ has joined #ceph
[15:02] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:06] * korgon (~Peto@isp-korex- has joined #ceph
[15:08] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) has joined #ceph
[15:09] * The_Bishop (~bishop@2001:470:50b6:0:3181:5255:f8a7:71b) has joined #ceph
[15:17] <fprudhom> Hi all :) I have a very short question : why the "hadoop configuration" have loose all information about patching hadoop ? This page is now... hmmm... short... very short :)
[15:17] * andreask1 (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[15:17] * andreask1 (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[15:18] <fprudhom> (in brief : just want to know if someone have information about hadoop support since 0.56)
[15:21] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[15:31] * gaveen (~gaveen@ has joined #ceph
[15:36] <iggy> fprudhom: I don't think anybody has messed with it in a while (at least i haven't heard anybody talk about it much)
[15:47] * tnt_ (~tnt@112.169-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[15:49] * tnt (~tnt@11.97-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[15:49] * PerlStalker (~PerlStalk@ has joined #ceph
[15:52] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[15:53] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[15:54] <mikedawson> Is Ceph capable of being tuned to achieve small random write IOPS greater than the capabilities of the spindles when using SSDs for journals?
[15:55] * Morg (d4438402@ircip3.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[15:55] <nhm> mikedawson: Possibly, but only for short bursts if so.
[15:56] <jtang> nhm: i dont suppose you know if its possible to dump a running configuration from ceph?
[15:56] <nhm> mikedawson: the problem is that eventually the writes need to make it to disk, so you can't do it indefinitely.
[15:56] <nhm> jtang: I think you can now, a while back I was bugging Sage that we needed it. Let me see if I can figure out how to do it.
[15:57] <mikedawson> I need to send a ~2000 IOPS at ~16KB with 95% being writes. In the ideal world, I wish the SSD journals could take those small writes and combine them into larger 4MB blocks to be written to the spindles (at a palatable IOPS level).
[15:57] <jtang> heh cool
[15:57] <mikedawson> And this is a 24/7/356 application so short bursts don't help me
[15:58] <jtang> i was hoping that there was something similar to the gpfs mmbackupconfig comand
[15:58] <nhm> mikedawson: yeah, combining small writes to the same object would be nice.
[15:59] <jtang> so i have at least a config for the osd, mds, mon for disaster recovery
[16:00] <nhm> mikedawsos: if you have WB cache, the controller may be smart enough to reorder the writes to make sure the seek penalties are minimal.
[16:01] <mikedawson> nhm: Do you know its on a development roadmap or even feasible to so Lots of small-size write IOPS -> SSD Journal -> *insert magic here* -> Less larger-size IOPS -> Backing Spindles?
[16:02] <mikedawson> nhm: Do you mean RBD in writeback mode?
[16:03] <nhm> jtang: ceph --admin-daemon <daemon> config show
[16:03] <nhm> jtang: if you have an admin daemon available, you can use that to get the current daemon config.
[16:04] <jtang> nhm: hrmm.... seems we're running the stable version of ceph
[16:04] <jtang> is this command only on post 0.48 releases?
[16:04] <nhm> mikedawson: rbd does it with it's cache, but if you have an expensive raid controller with writeback cache, it may be able to do some write reordering underneath ceph to minimize seeks.
[16:05] <nhm> jtang: ah, probably
[16:05] <jtang> well when bobtail comes out it will be less of an issue
[16:05] <jtang> i had a poke at the changelog in 0.56, the changes look promising
[16:05] * jtang eyes them deep scrub checks
[16:06] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[16:06] <nhm> jtang: Yeah, getting a couple of bugs worked out, but it should provide a lot of benefits.
[16:06] * agh (~agh@www.nowhere-else.org) has joined #ceph
[16:06] <nhm> jtang: btw, did you see the performance preview?
[16:06] <jtang> what you guys need is a tool for injecting errors into ceph at all levels to stress test the scrubbing
[16:06] <jtang> nhm: yeap, it was a long read
[16:06] <jtang> :)
[16:06] <nhm> jtang: :)
[16:07] <mikedawson> nhm: right now I'm JBOD to embedded SATA ports. Could it be possible that I could get better performance for small-size write IO when using RBD writeback compared to the numbers I am getting from rados bench?
[16:07] <nhm> jtang: I just got done doing parametric sweeps across ceph tunables that generated like 3-4 times the amount of data. ;)
[16:08] <jtang> you tuning ceph with a genetic algorigthm or something?
[16:08] <jtang> what sort of workloads you going when testing out tuning parameters?
[16:08] <nhm> jtang: naw, just testing different values for different parameters at different io sizes, different filesystems, etc.
[16:08] <jtang> s/going/using/
[16:09] <jtang> there's a bit of a gap in the storage world for a tool that does genetic/parametric tuning for different workloads
[16:10] <nhm> jtang: just rados bench right now for reads/writes at different sizes. That's what I had tooling for. Soon I'm going to start doing RBD tests with FIO, then probably to more realistiic workloads.
[16:10] <Kioob`Taff> nhm: the journal doesn't combine writes before sending them to the OSD ?
[16:10] <jtang> the openmpi guys are taking that approach and it seems to work
[16:10] <nhm> Kioob`Taff: I don't believe so.
[16:11] <Kioob`Taff> oh :(
[16:11] <jtang> nhm: i'd be interested in seeing random io that fills the cache
[16:11] <jtang> sequential io is often, well not realistic
[16:11] <Kioob`Taff> so, I need to put «Bcache» over journal ? :S
[16:11] <Kioob`Taff> mm... crazy, no ?
[16:12] <Kioob`Taff> (not over journal, but between journal and osd backend)
[16:12] <mikedawson> Kioob`Taff: I've though about bcache for speeding VM reads to RBD
[16:12] <nhm> jtang: On the plus side, rados bench is just doing object creations and reads. It's sequential within the object, but every object gets put somewhere in a nested directory tree on the OSD.
[16:12] <nhm> Kioob`Taff: I've wanted to test that for a while.
[16:13] <Kioob`Taff> mikedawson: for reads, Ceph is already very good... with good amount of memory
[16:13] <jtang> i need to check what the largest blocksize is for ceph
[16:14] <mikedawson> Kioob`Taff: yeah, all my issues are with small io writes. Wishing that Ceph could do some reordering magic from the SSD Journal
[16:15] <jtang> nhm: will you be doing lustre vs ceph comparisons/benchmarks at any point?
[16:15] <Kioob`Taff> me too mikedawson
[16:16] <jtang> or filesystem XYZ vs ceph for a well defined workload ?
[16:16] <mikedawson> nhm: Have you ever heard of reordering small io writes for greater IOPS as a feature on the Ceph roadmap?
[16:16] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[16:16] <nhm> jtang: I've been kind of hesitant to publicize comparisons with other vendors. I've got some gluster comparisons that I did that our marketing folks wanted to use, but I'd rather have the community do that kind of thing...
[16:17] <jtang> heh yea, i suppose that would be the case, i'd like to do it, but its not my day job ;)
[16:17] <Kioob`Taff> and for what I understand, using smaller journal (10GB ?) and give the rest of the SSD to Bcache can improve performance.... but... maybe their «ordering» algo/buffer can be integrate in Ceph :p
[16:18] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[16:18] * agh (~agh@www.nowhere-else.org) has joined #ceph
[16:18] <nhm> jtang: I'm friends with some of the Lustre guys and some of the glusterfs guys. some day I hope we can get together and do some kind of big performance tuning and benchmarking workshop/shootout.
[16:19] * BManojlovic (~steki@ has joined #ceph
[16:19] <nhm> mikedawson: I think right now we are just focusing on eliminating performance bottlenecks in the current code before we move on to that kind of thing.
[16:20] <jtang> what i'd also like to see is a real world "installation fest and review" by a virgin admin of a few of the more established distributed storage systems (with ceph thrown in)
[16:20] <jtang> and see what comes out
[16:20] <jtang> it would be a good marketting piece for me to see to people that i work with
[16:20] <jtang> but i guess that requires an independent lab to do it
[16:21] <mikedawson> nhm: Understood
[16:21] <jtang> i've found that its been hard to train up devs/admins to deal with distributed/parallel systems
[16:22] * The_Bishop (~bishop@2001:470:50b6:0:3181:5255:f8a7:71b) Quit (Quit: Wer zum Teufel ist dieser Peer? Wenn ich den erwische dann werde ich ihm mal die Verbindung resetten!)
[16:22] <jtang> btw im living life on the edge on my linux laptop, i just upgraded to using systemd and btrfs on my root partition
[16:22] <jtang> w00t!
[16:23] <jtang> just waiting for my data to explode
[16:23] <Kioob`Taff> I confirm jtang. It's why I use ceph : it allow to have one big VM with failover, and it's easy to use for devs... They are not able to correctly use a filer, a load balancer, etc
[16:23] <mikedawson> Kioob`Taff: Do you know of anyone who has documented the Ceph + Bcache integration?
[16:23] <nhm> jtang: I have to say that glusterfs is very easy to install. I'll be really happy when we can get the ceph setup process to be that easy.
[16:23] <jtang> nhm: we've never had good experiences iwht glusterfs's "failover"
[16:23] <Kioob`Taff> mikedawson: yes, but only on client side
[16:23] <jtang> it never seemed to failover properly for us
[16:23] <Kioob`Taff> mikedawson: http://www.sebastien-han.fr/blog/2012/06/10/introducing-ceph-to-openstack/
[16:24] <jtang> Kioob`Taff: heh yea, thats why we're looking at ceph ;)
[16:24] <nhm> jtang: I've heard similar stories.
[16:24] <jtang> and glusterfs kept on changing the command interface and tools which made it hell to maintain
[16:25] <jtang> it was clearly written to scratch an itch and for the small to medium scale when it was first developed
[16:25] <jtang> i guess with RH taking control glusterfs is less likely to go nuts with changing things too drastically
[16:25] <mikedawson> Kioob`Taff: I don't see any mention of bcache at that link
[16:26] <nhm> jtang: interesting, didn't know that. I've basically never tried to maintain it for more than a day or so. We ran it briefly at the supercomputing institute where I used to work, but eventually switched to lustre.
[16:26] <mikedawson> Kioob`Taff: maybe you meant http://www.sebastien-han.fr/blog/2012/11/15/make-your-rbd-fly-with-flashcache/
[16:26] <Kioob`Taff> sorry mikedawson, wrong link. Here http://www.sebastien-han.fr/blog/2012/11/15/make-your-rbd-fly-with-flashcache/, you have Flashcache, not Bcache, but it's near same thing
[16:27] <Kioob`Taff> ;)
[16:28] <nhm> nice!
[16:28] <jtang> nhm: there were also other oddities about it, like you can't change replication factors once the system is setup
[16:29] <jtang> and you had to be careful about recovery or changing the system if you layered the bricks in a certain way
[16:29] * XSBen (~XSBen@ Quit (Quit: Quitte)
[16:29] <jtang> on paper the system looked good till we tried to do things similar to what had can do with gpfs
[16:29] <jtang> the other kicker was the way the self-healing worked (i.e. it didnt) when we last tried it
[16:30] <jtang> and with it not failing over correctly you get the dreaded split-brain scenarios and just get corrupt data
[16:30] <jtang> or files re-appearing randomly after they get deleted
[16:31] <jtang> its probably improved significantly since we last looked at it though
[16:32] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Remote host closed the connection)
[16:34] <jtang> that said, glusterfs was pretty good for write once, read many type workloads
[16:35] <jtang> it sucked for parallel computing or anything that just needed lots of fast read/write operations
[16:40] <mikedawson> Kioob`Taff: I am motivated to try bcache in writeback mode. I'll let you know if I get anywhere
[16:43] <nhm> mikedawson: At what level are you going to try it?
[16:45] <mikedawson> nhm: Not sure exactly what you are asking. But instead of my earlier question of how to get the magic in the sequence Lots of small-size write IOPS -> SSD Journal -> *insert magic here* -> Less larger-size IOPS -> Backing Spindles?
[16:45] <mikedawson> nhm: I'm thinking the *insert magic here* could be bcache writeback caching
[16:45] <mikedawson> http://bcache.evilpiepirate.org/
[16:46] <nhm> yeah, I was just wondering if you were thinking of using it on the client with RBD, or on the OSD.
[16:46] <mikedawson> " It turns random writes into sequential writes - first when it writes them to the SSD, and then with writeback caching it can use your SSD to buffer gigabytes of writes and write them all out in order to your hard drive or raid array."
[16:47] <mikedawson> I've been watching Bcache for about a year. Just wish it would be integrated into the mainline kernel
[16:48] <mikedawson> nhm: on the OSD for my current issue
[16:50] * sagelap (~sage@2600:1013:b003:9e84:c685:8ff:fe59:d486) Quit (Ping timeout: 480 seconds)
[16:53] <noob2> would you guys say for 72 3TB drives that 4200 PG's are enough? These are split up over 6 servers
[16:53] <noob2> my pool is replica = 3
[16:54] <mikedawson> noob2: think I read # of PGs should be a power of 2
[16:54] <noob2> ah good point. i forgot about that
[16:55] <mikedawson> if that's the case 2048 (a bit low) or 4096 (a bit high)
[16:55] <noob2> yeah 4096
[16:56] <noob2> i seem to have a bottleneck in my cluster but i'm not sure where it is
[16:56] <noob2> my clients are only writing at about 80MB/s to it. they have 2Gb links up to the cluster. the cluster has a replication network and a client network
[16:57] <noob2> when i do the benchmarks to the disk they seem fairly quick. i can post some if you want to see
[16:57] <nhm> noob2: does the throughput increase with more clients?
[16:57] <noob2> i'm not sure. that's a good question. i should ramp it up
[16:58] <mikedawson> noob2: I have bottlenecks too, unfortunately I'm learning more about confirming bottlenecks than I am about fixing bottlenecks
[16:58] * jlogan1 (~Thunderbi@2600:c00:3010:1:519e:fddc:b274:5689) has joined #ceph
[16:59] <mikedawson> noob2: I can get 144MB/s on 8 OSDs over 4 nodes in one of my testbeds with 1GbE for client and 1 GbE for cluster
[17:00] <mikedawson> noob2: that's rados bench with 4MB writes
[17:01] <mikedawson> noob2: 3x replication and XFS
[17:03] <noob2> what commands were you using for the rados bench?
[17:03] <noob2> just so i can duplicate
[17:03] <noob2> yeah i'm on xfs as well with 3x replication
[17:03] <mikedawson> rados -p bench3x bench 30 write -b 16384 -t 50
[17:03] * sagelap (~sage@252.sub-70-197-139.myvzw.com) has joined #ceph
[17:03] <mikedawson> where bench3x is my pool
[17:04] <noob2> gotcha
[17:04] <nhm> mikedawson: how many clients in that test?
[17:04] <mikedawson> actually that's the 16K version
[17:04] <noob2> does that write back and forth to the client also?
[17:04] <mikedawson> rados -p bench3x bench 30 write -t 50 is 4MB
[17:04] <mikedawson> rados -p bench3x bench 30 write -t 50
[17:05] <nhm> mikedawson: not sure how you are getting 144MB/s with 1GbE?
[17:05] <mikedawson> nhm: I run that on one of my four nodes. It saturates my network
[17:05] <noob2> wow
[17:05] <noob2> now this is awesome
[17:05] <nhm> Ah, 144MB/s probably becasue some of teh writes are local.
[17:05] <mikedawson> nhm: some OSDs are local
[17:05] <noob2> i'm going to spread this out over a few clients
[17:06] <mikedawson> 2 OSDs local, and three remote nodes with 2 OSDs each
[17:06] <mikedawson> 152.168MB/s this last run
[17:06] <nhm> mikedawson: noob2: btw, once you get to ~400MB/s you should start running multiple copies of rados bench.
[17:06] <noob2> ok
[17:06] <noob2> checking now
[17:07] <mikedawson> nhm: thrilled with this performance. Now just need some writeback magic!
[17:07] <mikedawson> nhm: or more way more spindles than I really want
[17:07] <noob2> nhm: on 2 clients i get 188MB/s and 107MB/s avg
[17:08] <noob2> do i need to delete the rados bench objects after this finishes?
[17:08] <nhm> noob2: newer versions delete the data by default. In 0.48 the objects aren't deleted.
[17:08] <noob2> ok
[17:08] <mikedawson> noob2: it cleans them up at the end if you let it finish. For small write size it can take a significant amount of time
[17:08] <noob2> i'm running 0.56
[17:09] <nhm> noob2: 4MB writes?
[17:09] <noob2> yup 4MB writes
[17:10] <nhm> noob2: was that just 1 client or 2 clients going?
[17:10] <noob2> that was 2 clients giong
[17:10] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[17:11] <nhm> hrm, seems low for 72 OSDs.
[17:11] <noob2> it sounds like my bottleneck is the fibre i'm pushing ceph over then
[17:11] <noob2> nhm: should i add a third client?
[17:11] <nhm> noob2: sure
[17:12] <noob2> time to clone some vm's :D and fire them up
[17:13] <noob2> nhm: my servers are using raid 0's on all the disks. They're HP and i've found HP stuff is pretty slow.
[17:13] <nhm> noob2: what kind of OSD<->OSD netowrk?
[17:13] <noob2> this is about par for the course i think
[17:13] <noob2> oh
[17:13] <noob2> 2x1Gb vPC
[17:13] <nhm> bonded?
[17:13] <noob2> and 2x1Gb for the public network
[17:13] <noob2> yup 802.3 bonded
[17:13] <nhm> k
[17:14] <noob2> this test confirmed my bond works :)
[17:14] <janos> is rr bonding (bonding 0 i think) ok for speed?
[17:14] <nhm> so in reality if you are replicating to 2 other OSDs, you'll be limited to about 100MB/s per server.
[17:14] <janos> i haven't done up my home test network yet
[17:15] <noob2> yeah 180MB/s roughly per server for replication
[17:15] <noob2> oh i see what you're saying. you split it. yes that's correct
[17:15] <noob2> so if i upgraded to 10Gb i'd see a big boost
[17:15] <janos> in cost too!
[17:15] <janos> ;)
[17:15] <noob2> lol!@
[17:15] <noob2> exactly
[17:15] <janos> i really wish 10gbe would come down some
[17:15] <nhm> noob2: Maybe, seems like you should be getting higher performance even with that limitation.
[17:16] <noob2> in like 2 minutes i'll add that 3rd client and see how it performs
[17:16] <iggy> have you tested bw between the hosts with another tool (iperf, etc.)?
[17:17] <iggy> a lot of people expect bonding to help between 2 hosts and 99% of the time it doesn't
[17:17] <nhm> that's true
[17:20] <noob2> yeah iperf confirmed the bonds work
[17:23] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[17:23] * agh (~agh@www.nowhere-else.org) has joined #ceph
[17:28] <noob2> nhm: when i add a 3rd client i get this. MAX 132MB/s, 120MB/s, 148MB/s
[17:28] <noob2> avg is about 80-90MB/s each
[17:29] <noob2> i can see the network is my bottleneck. the disks are barely getting hit
[17:33] * low (~low@ Quit (Quit: Leaving)
[17:34] * ScOut3R_ (~ScOut3R@ Quit (Ping timeout: 480 seconds)
[17:37] <noob2> that benchmark tool is great. really helps you see the problems
[17:39] <noob2> looks like the rados benchmark isn't cleaning up after itself
[17:42] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) has joined #ceph
[17:43] <noob2> are these all the rados bench objects? rb.0.10df.6b8b4567.000000001fd5
[17:44] * joshd1 (~jdurgin@2602:306:c5db:310:4da1:bdc1:80d7:aea4) has joined #ceph
[17:48] <mikedawson> noob2: I create separate pools for benchmarking so I can just blow them out instead of manually cleaning up any remnants
[17:52] * Leseb (~Leseb@ Quit (Quit: Leseb)
[17:59] <janos> i currently have osd journals on the osd disks. if i were to add an ssd, partition it, and use partititons for the journals - is that a big deal to do?
[17:59] <phantomcircuit> so i notice that with replicated objects acknowledgement of completion is delayed until both the primary and the replicas have all applied the update
[17:59] * Kioob`Taff (~plug-oliv@local.plusdinfo.com) Quit (Quit: Leaving.)
[17:59] <janos> like can i just make configurationh changed, propogate it,a nd restart those osd's?
[18:01] <phantomcircuit> it seems like the replica is actually applying the update to the filestore before the primary even tries to apply the update
[18:01] <phantomcircuit> it would seem that replying on the journal there could significantly reduce the io path latency
[18:01] <phantomcircuit> but im guessing im missing something
[18:01] <iggy> janos: I think it goes something like: drain journal, down osd, change config, up osd
[18:02] <iggy> janos: but double check with one of the devs
[18:02] <janos> i have never heard of journal draining
[18:02] <janos> i will have to read up!
[18:02] <joshd1> janos: you have to shut down the osd, run ceph-osd -i N --flush-journal, change config for that osd, and ceph-osd -i N --mkjournal
[18:02] <janos> very nice
[18:02] <janos> thank you both
[18:02] <nhm> phantomcircuit: hrm, I think it should be applying the replica update and doing the filestore commit concurrently, but we should really ask Sam for confirmation.
[18:03] <phantomcircuit> nhm, that's what i figured but im looking at logs of slow writes that seem to say differently
[18:03] <mikedawson> joshd1: do you know of anyone using bcache with Ceph? Trying to greatly increase small random IO write IOPS without throwing spindles at the problem
[18:04] <gregaf> phantomcircuit: nhm: the primary will do its local journal-and-apply at the same time as it's sending the data out to the replicas to do journal-and-apply there; it's not waiting for the replicas to complete before doing the local write
[18:04] <joshd1> mikedawson: I know some folks here have used it, can't remember who though
[18:04] <gregaf> it *does* wait for the replica to complete before replying to the client
[18:04] <mikedawson> joshd1: I'm thinking of trying bcache in writeback mode
[18:04] <nhm> gregaf: yes, that was my understanding as well.
[18:04] <gregaf> and the reply comes back once the write is on durable media, whether that's the journal or the backing store
[18:04] <phantomcircuit> hmm ok
[18:05] <phantomcircuit> im trying to figure out why im getting ~10MB/s over rbd on a 1gbps link where the underlying filestore disks benched at 130 MB/s and the journal at the same
[18:06] <nhm> gregaf: perhaps if the local filestore was backed up and the remote OSD for the replica was not, the behavior would look like what phantomcircuit is seeing.
[18:07] <gregaf> umm, maybe, yeah
[18:07] <joshd1> mikedawson: it should be safe, assuming bcache has no bugs in recovering after power loss
[18:07] <gregaf> phantomcircuit: are your RBD writes the same as your filestore disk writes (they probably aren't ;) )
[18:08] <gregaf> eg, you're doing sync writes and aren't using RBD caching or something
[18:08] <mikedawson> joshd1: yeah that worries me a bit. Don't think fsck works with bcache in the mix at present
[18:08] <phantomcircuit> gregaf, im pretty sure it's async writes
[18:09] <phantomcircuit> i just did rbd map and am using dd to push data without any fsync or direct flags
[18:09] <gregaf> well nhm is the benchmarker and joshd1 is the rbd guy, maybe they have ideas :)
[18:10] <phantomcircuit> heh
[18:10] <nhm> phantomcircuit: what size writes?
[18:10] <gregaf> I just got back from vacation and am frantically trying to catch up and go over the CephFS backlog for sprint planning ;)
[18:10] <phantomcircuit> nhm, 1MB
[18:10] <phantomcircuit> actually the writes on the wire are whatever the kernel on the client is deciding to do since it's not synchronous
[18:11] <nhm> phantomcircuit: what does rados bench from the same host give you for 1MB writes? The workload is a bit different so the results aren't directly comparable, but it's a start.
[18:12] <nhm> gregaf: yes, get to work! ;)
[18:12] <phantomcircuit> gimme a minute i have to wait for the dirty writes to finish
[18:12] <phantomcircuit> takes about 40 seconds...
[18:12] <nhm> <whip crack>
[18:13] <phantomcircuit> or more
[18:13] <nhm> yeah, I get around that by stoping ceph and reformating. ;)
[18:13] <phantomcircuit> lol
[18:15] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[18:15] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[18:17] * sagelap (~sage@252.sub-70-197-139.myvzw.com) Quit (Read error: Connection reset by peer)
[18:24] <phantomcircuit> nhm, how do you set the blocksize with rados bench?
[18:25] <phantomcircuit> actually doesn't matter default is 4MB writes
[18:25] <phantomcircuit> Bandwidth (MB/sec): 17.704
[18:25] <phantomcircuit> single client
[18:25] <phantomcircuit> so synchronous performance
[18:25] <phantomcircuit> which is expected
[18:25] <phantomcircuit> maybe rbd map isn't actuall async?
[18:26] <mikedawson> phantomcircuit: rados -p bench3x bench 30 write -b 16384 -t 50 <- use the -b flag to set the write size
[18:27] <phantomcircuit> oh i see it's in global options in the help menu and not benchmark options
[18:28] <nhm> phantomcircuit: also, -t is the number of concurrent requests
[18:28] <nhm> phantomcircuit: how many OSDs do oyu have?
[18:28] <phantomcircuit> 2
[18:28] <phantomcircuit> heh
[18:29] <nhm> 2x replication?
[18:29] <phantomcircuit> yeah
[18:29] <nhm> Ok. That still seems basically terrible.
[18:29] <nhm> What version of ceph?
[18:29] <phantomcircuit> yup
[18:29] <phantomcircuit> 0.55.1
[18:29] <nhm> XFS?
[18:29] <phantomcircuit> yup
[18:30] <phantomcircuit> 9 GB journal on ssd
[18:30] <nhm> what mkfs/mount options?
[18:30] <phantomcircuit> not sure about mkfs mount options just have noatime set
[18:31] <nhm> phantomcircuit: might want to try inode64.
[18:31] <nhm> what controller are you using?
[18:32] <nhm> actually, could you create a pool with 1x replication and try the same test?
[18:32] <phantomcircuit> onboard sata controller
[18:32] <phantomcircuit> C600/X79 series mobo
[18:33] <phantomcircuit> im not expecting amazing performance but ~1/10th of raw disk throughput seems a bit low
[18:33] * sagelap (~sage@2607:f298:a:607:3c5a:2c79:5f87:1f13) has joined #ceph
[18:33] <mikedawson> nhm: I'm using inode64 now. Seems to work, but I don't have benchmarking with/without
[18:33] <mikedawson> nhm: do you still use noatime as well?
[18:34] <nhm> mikedawson: yes to both
[18:35] <nhm> mikedawson: Christoph Hellwig made some other recommendations on the mailing list that may also be potentially helpful.
[18:35] <nhm> haven't gotten to testing them yet.
[18:37] * ScOut3R (~ScOut3R@dsl5401A397.pool.t-online.hu) has joined #ceph
[18:38] <mikedawson> nhm: I'm using them (with the exception of I removed noatime) after reading http://xfs.org/index.php/XFS_FAQ#Q:_Is_using_noatime_or.2Fand_nodiratime_at_mount_time_giving_any_performance_benefits_in_xfs_.28or_not_using_them_performance_decrease.29.3F
[18:39] <mikedawson> but then I saw something old from Tommi saying noatime is your friend
[18:39] <mikedawson> so you are the tie breaker. shutting down cluster; adding noatime
[18:40] <phantomcircuit> nhm, even with 1 size = 1 performance is terrible
[18:40] <phantomcircuit> Bandwidth (MB/sec): 40.437
[18:40] <phantomcircuit> 50 concurrent clients 1 MB blocks
[18:42] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[18:43] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[18:44] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[18:46] <nhm> phantomcircuit: that's rbd or rados bench?
[18:46] <phantomcircuit> rados bench
[18:47] <nhm> phantomcircuit: do 4MB blocks do any better?
[18:47] <phantomcircuit> im starting to think something has gone wrong with these disks between testing them as raw partitions and getting ceph working
[18:50] <nhm> phantomcircuit: you might also try "ceph osd tell 0 bench"
[18:51] <phantomcircuit> 2013-01-04 17:52:37.014788 33067fff700 0 log [INF] : bench: wrote 1024 MB in blocks of 4096 KB in 9.105923 sec at 112 MB/sec
[18:51] * fzylogic (~fzylogic@ has joined #ceph
[18:51] <nhm> ok, so it looks like the filestore is doing it's job.
[18:51] <phantomcircuit> 2013-01-04 17:53:06.870398 osd.1 60 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 7.978292 sec at 128 MB/sec
[18:51] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[18:51] <phantomcircuit> yup
[18:51] <phantomcircuit> both of them
[18:51] <nhm> phantomcircuit: did you say you've done iperf tests?
[18:52] <phantomcircuit> i didn't but i have
[18:52] <nhm> They were fine too?
[18:52] <phantomcircuit> yup
[18:52] <phantomcircuit> 907 Mbits/sec
[18:53] <phantomcircuit> [ 4] 0.0-10.0 sec 1.06 GBytes 907 Mbits/sec
[18:53] <nhm> latency is good on the link?
[18:53] <phantomcircuit> <1 ms
[18:53] <phantomcircuit> rtt min/avg/max/mdev = 0.292/0.331/0.368/0.034 ms
[18:54] <phantomcircuit> rtt min/avg/max/mdev = 0.121/0.188/1.509/0.049 ms, ipg/ewma 0.231/0.190 ms
[18:54] <phantomcircuit> ping -f
[18:54] <nhm> phantomcircuit: I don't remember exactly how to do this, but you could try creating a pool with just osd0, make it 1x replication, and then try rados bench locally on that node.
[18:54] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[18:54] <phantomcircuit> yeah i can do that by messing around with a crush map
[18:55] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[18:58] <nhm> phantomcircuit: for a point of comparison, look at the XFS results on something like the SAS2208 in JBOD mode here for 4MB writes with XFS: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[18:59] * gucki_ (~smuxi@46-126-114-222.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[18:59] <nhm> phantomcircuit: that's 6 spinning disks, and XFS does about 60MB/s per disk.
[18:59] * gucki (~smuxi@46-126-114-222.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[18:59] <phantomcircuit> hmm
[18:59] <phantomcircuit> so maybe this is the top performance here
[18:59] <nhm> phantomcircuit: more like 80MB/s per disk with other controllers.
[19:00] <mikedawson> Played with increasing filestore max sync interval and filestore min sync interval. Here's the odd part: benchmarking 16K with rados bench performance got worse for my 3x replication pool as those settings increased, but performance got better for the 2x pool. Any ideas?
[19:00] <nhm> phantomcircuit: tough to say. Maybe there is some aspect of the ceph workload that controller really hates.
[19:01] <phantomcircuit> nhm, well i guess this just means at somepoint i'll think of btrfs as stable and will get a nice "free" performance boost :)
[19:01] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[19:01] <mgalkiewicz> hi guys
[19:01] <mgalkiewicz> why do you send all those patches to mailing list? what is the point?
[19:01] <nhm> phantomcircuit: just be aware that btrfs at least used to have some bad fragmentation issues over time.
[19:02] <mgalkiewicz> if it is inevitable could you please create separate address for this purpose?
[19:02] <nhm> phantomcircuit: It looks great in these tests but I don't know if it will stay that fast as the cluster ages, even with newer kernels.
[19:05] <phantomcircuit> nhm, the weird thing is the filestore disk itself doesn't see more than 20 MB/s and seems to be averaging only about 60 writes/second
[19:05] <mikedawson> nhm: going from inode64 to noatime,inode64 seems to have given me ~10% throughput boots for 3x and ~27% throughput boost for 2x
[19:05] <mikedawson> for 16K writes
[19:06] <mikedawson> nhm: apparently "All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. " doesn't apply to Ceph.
[19:07] <nhm> mikedawson: heh, who knows.
[19:07] <nhm> mikedawson: Glad it helped at least!
[19:08] <nhm> phantomcircuit: if you try something like collectl -sD -oT you can see how long IOs are waiting in the queue and what the service time is.
[19:13] * sjustlaptop (~sam@ has joined #ceph
[19:17] * loicd (~loic@ has joined #ceph
[19:17] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:19] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[19:19] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[19:19] * Leseb_ is now known as Leseb
[19:19] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[19:19] * agh (~agh@www.nowhere-else.org) has joined #ceph
[19:20] <mikedawson> joshd1: how can I verify if an RBD volume has writeback enabled? Is there anything to tune?
[19:21] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[19:22] <joshd1> mikedawson: you can enable the admin socket and use the 'config show' command to display the configuration. you can tweak the cache size, max dirty, and max age: http://ceph.com/docs/master/rbd/rbd-config-ref/
[19:23] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[19:24] <gregaf> mgalkiewicz: you should tell that to elder ;), but we are reviewing when it'll be appropriate to split up into ceph-devel and ceph-user lists and if you have input I'm sure scuttlemonkey or rturk would be interested (they'll be soliciting at some point when discussions get more serious)
[19:25] * korgon (~Peto@isp-korex- Quit (Quit: Leaving.)
[19:28] * Kioob (~kioob@luuna.daevel.fr) Quit (Quit: Leaving.)
[19:28] <mgalkiewicz> gregaf: it is really annoying when sb is sending a lot of patches which are irrelevant to user like me
[19:28] <mgalkiewicz> but why do you do this at all?
[19:28] <mgalkiewicz> isnt pull request on github much more convenient?
[19:29] <gregaf> Alex is following kernel best practices by sending patches to the appropriate developer list *shrug*
[19:29] <gregaf> and external developers are often more comfortable with that than github pull requests, depending on their background
[19:29] <gregaf> the list is filling two roles right now
[19:30] <mgalkiewicz> so the question is whether it is necessary to put ceph list in cc
[19:30] <joao> in case anyone has time, a review on wip-3633 on teuthology's git would be appreciated :)
[19:30] <gregaf> no, I'm saying the ceph list is the target — it is currently serving the roles of both a user and developer list
[19:30] <gregaf> at some point that needs to split up but it hasn't been appropriate in the past; it may be appropriate soon
[19:32] <mgalkiewicz> I would strongly recommend splitting them up:)
[19:33] * sjustlaptop (~sam@2607:f298:a:607:1e5:e1a9:6708:a1b8) has joined #ceph
[19:33] <noob2> mike: yeah that would've been smart for me to do. make another pool
[19:39] * nwat (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[19:41] <noob2> has anyone tried the --shared tag command with rbd?
[19:41] <noob2> i'm just curious
[19:42] <elder> mgalkiewicz, I think Greg did a nice job explaining why patches are sent to the list.
[19:42] <elder> Yes, a separate list would be great.
[19:42] <elder> Right now, this is what we have.
[19:43] * Ryan_Lane (~Adium@ has joined #ceph
[19:44] * morse (~morse@supercomputing.univpm.it) Quit (Remote host closed the connection)
[19:46] <elder> Your mail software can probably filter PATCH messages into a separate place. (But I'm not saying I think that should be required; a separate mailing list for patches would work well.)
[19:48] <phantomcircuit> nhm, i think one of my disks is crapping out
[19:48] <phantomcircuit> there's 1 osd on each of 2 disks
[19:48] <phantomcircuit> 1 of the disks has *much* higher io wait times
[19:52] <phantomcircuit> yeah im consistently seeing the other disks queue totally empty
[19:52] <phantomcircuit> sigh
[19:52] <phantomcircuit> stupid hitachi...
[19:52] <mgalkiewicz> elder: for now filter is a good idea
[19:53] <nhm> phantomcircuit: ah, that does sound suspect!
[19:53] <phantomcircuit> nhm, of course this disk passed a SMART long self test like a week ago
[19:53] <phantomcircuit> alas
[19:53] <nhm> phantomcircuit: just think of it as hitachi's gift of job security. ;)
[19:54] <noob2> lol
[19:55] <phantomcircuit> nhm, if only i could
[19:56] <phantomcircuit> i only get paid if i can make this work
[19:56] <phantomcircuit> thus my interest in making it work :)
[19:58] <nhm> phantomcircuit: I think usually you are supposed to charge extra if it works.
[19:59] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[19:59] <phantomcircuit> nhm, lol
[20:00] <phantomcircuit> nhm, the great thing about SMART is that there is zero indication of issues with this drive
[20:00] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[20:00] <phantomcircuit> :|
[20:00] <phantomcircuit> why thank you smart
[20:00] <joao> not so smart then eh?
[20:00] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[20:01] * joao feels bad about such a trivial pun
[20:01] <phantomcircuit> i feel like you should feel bad
[20:01] <nhm> phantomcircuit: Aren'y you using some kind of on-board controller too?
[20:01] * joao sets mode -o joao
[20:01] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[20:02] <mikedawson> joshd1: I was setting rbd cache = true under [global] and it shows up as enabled in the admin socket properly. The doc you referred to sets it under [client]. When move it there and restart, the admin socket shows it as False.
[20:03] <mikedawson> joshd1: do I need it in both [global] and [client] ? is the doc wrong?
[20:03] <joshd1> mikedawson: something's very strange if it's not working in [client]
[20:04] <phantomcircuit> nhm, yeah this is basically setup to fail
[20:04] <phantomcircuit> but hey what else are you gonna do when your budget is only 400 EUR/month
[20:05] <nhm> phantomcircuit: you know how you can make HDs play music? I bet if you listened closely you would hear "We're not gonna take it" playing.
[20:05] <joshd1> mikedawson: config is a hierarchy of [global] -> [type] -> [type.id], and setting it anywhere there should work for type=client and id=cephx user
[20:05] <ScOut3R> joshd1: it's not working for me under [client] either
[20:06] <joshd1> ScOut3R: mikedawson: how are each of you running it? via qemu and libvirt?
[20:06] <ScOut3R> joshd1: using plain rbd mapping
[20:06] <mikedawson> libvirt
[20:06] <joshd1> ScOut3R: it doesn't affect the kernel rbd module
[20:07] <ScOut3R> joshd1: ah, good to know, how can i use it with the kernel module?
[20:07] <joshd1> ScOut3R: you can't, you just have to rely on the page cache/fs on top of the block device
[20:08] <nhm> <cough> feature request! <cough>
[20:08] <janos> (says the person who won't be implementing it)
[20:08] <janos> ;)
[20:08] <joshd1> yeah, it's been suggested before, but there's still other feature catch up as well
[20:08] <nhm> janos: ;)
[20:09] * joshd1 (~jdurgin@2602:306:c5db:310:4da1:bdc1:80d7:aea4) has left #ceph
[20:09] <nhm> joshd1: indeed. I imagine it's not something particularly fun to implement in the kernel anyway.
[20:09] * joshd1 (~jdurgin@2602:306:c5db:310:4da1:bdc1:80d7:aea4) has joined #ceph
[20:09] <nhm> oh
[20:09] <mikedawson> joshd1: http://pastebin.com/pqasP6Kf
[20:10] <joshd1> mikedawson: phew, that's normal
[20:11] <joshd1> mikedawson: it's a client side setting, it doesn't matter to the osd
[20:11] * dmick_away is now known as dmick
[20:11] * madkiss (~madkiss@p57A1CBD6.dip.t-dialin.net) has joined #ceph
[20:11] <joshd1> mikedawson: if you add admin socket = /some/path to the [client] section you should see it enabled while qemu is running (assuming qemu can write to that path)
[20:12] <ScOut3R> joshd1: thanks for the clarification
[20:13] <joshd1> mikedawson: the osd will still have the setting when it's in the global section, but it doesn't do anything
[20:14] <mikedawson> joshd1: so I should only have it rbd cache under [client], and not under [global], right?
[20:14] <joshd1> mikedawson: yeah, for a little cleaner config
[20:15] <mikedawson> joshd1: Something like admin socket = /var/run/ceph/client.asock under [client]?
[20:16] <joshd1> yeah, or client.$pid.asock so you can tell which one it is
[20:16] <joshd1> $pid is in 0.56
[20:16] * sleinen (~Adium@2001:620:0:26:7557:f112:4cbc:4422) Quit (Quit: Leaving.)
[20:16] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[20:16] <dmick> hey joshd1, wanna come to standup?
[20:17] * sleinen1 (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[20:17] <mikedawson> joshd1: Thanks!
[20:17] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit (Read error: Connection reset by peer)
[20:17] <elder> glowell1, ping
[20:17] <glowell1> yes
[20:17] <elder> I'd like to talk to you (in about 10 minutes) about
[20:18] * sleinen (~Adium@2001:620:0:25:a40f:45b:9633:136) has joined #ceph
[20:18] <glowell1> ok
[20:21] <buck> question for the room: we need to expose the minimum stripe size in the java bindings so that we can prevent Hadoop from using a bad value. Is that something that should be exposed via libcephfs or is it better suited for our JNI code?
[20:22] * sjustlaptop (~sam@2607:f298:a:607:1e5:e1a9:6708:a1b8) Quit (Read error: Operation timed out)
[20:22] * Cube (~Cube@ has joined #ceph
[20:24] <mikedawson> joshd1: got a client.17655.asock after that, but I don't have a pid 17655
[20:25] * sleinen1 (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[20:25] <mikedawson> root@node1:~# ceph --admin-daemon /var/run/ceph/client.17655.asock config show | grep rbd
[20:25] <mikedawson> connect to /var/run/ceph/client.17655.asock failed with (111) Connection refused
[20:27] <gregaf> buck: not sure what you're asking about — minimum stripe size?
[20:29] * themgt (~themgt@71-90-234-152.dhcp.gnvl.sc.charter.com) Quit (Quit: themgt)
[20:30] * gucki (~smuxi@46-126-114-222.dynamic.hispeed.ch) has joined #ceph
[20:30] * gucki_ (~smuxi@46-126-114-222.dynamic.hispeed.ch) has joined #ceph
[20:31] <buck> gregaf: with open_layout you can specify the stripe unit, stripe count and object size. I guess I mean stripe unit & object size (we use the specified blocksize in Hadoop for both and a stripe count of 1)
[20:31] * terje__ (~terje@71-218-25-108.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[20:31] <gregaf> is there not a way to get that data out of libcephfs already via the layout info?
[20:32] <buck> *scurries off to double check
[20:32] * terje (~joey@71-218-25-108.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[20:32] <nwat> buck: I believe only the defaults are exposed (but now that is deprecated)
[20:34] <buck> in libcephfs.cc I see the call to get the defaults (deprecated) and calls to query existing files for their layout info, but not a way to get CEPH_MIN_STRIPE_UNIT
[20:34] * madkiss (~madkiss@p57A1CBD6.dip.t-dialin.net) Quit (Quit: Leaving.)
[20:35] <joshd1> mikedawson: that 'connection refused' means (like you said) the process isn't running anymore
[20:36] <mikedawson> joshd1: how does that process get started?
[20:36] <joshd1> mikedawson: it's whatever process is using librbd, so qemu/kvm in this case
[20:36] <buck> gregaf: given that it's not exposed, are you ok with it being exposed (in libcephfs) ?
[20:37] <gregaf> there should certainly be a get_layout function of some kind
[20:37] <joshd1> mikedawson: but apparmor or plain old unix permissions might prevent it from writing to /var/run/ceph
[20:37] <gregaf> oh, I see, you're after that
[20:37] <dmick> gregaf: and there is, but not the define
[20:37] <gregaf> isn't it a #define or something?
[20:37] <gregaf> ugh
[20:37] <dmick> #define CEPH_MIN_STRIPE_UNIT 65536
[20:37] <dmick> ceph_fs.h
[20:38] <mikedawson> joshd1: service libvirt-bin restart?
[20:38] <gregaf> my inclination would be to just re-implement it, although I'm not sure if that's appropriate or not
[20:38] <gregaf> I'm irritated to learn that one exists as a #define
[20:38] <joshd1> mikedawson: detaching / re-attaching an rbd device would do it, or restarting a vm using one
[20:38] <mikedawson> joshd1: where do ceph client log messages end up?
[20:39] <buck> gregaf: so just add a corresponding static definition somewhere (preferences to libceph vs the jni code?)
[20:39] <noob2> should i set my journal size on my osd's to match up with my raid cache on the card? i have a 1GB flash cache so i thought a 1 GB journal would be a good match up
[20:40] <joshd1> mikedawson: by default /var/log/ceph/ceph-client.$id.log iirc, but again qemu might not have permission to write there
[20:40] <mikedawson> ahh I don't have any of those
[20:40] <dmick> buck: can the set_layout just be allowed to fail, and hadoop cope with that, if it's tried an illegal-according-to-libcephfs layout?
[20:41] <dmick> there are other reasons it will fail, of course
[20:42] <buck> dmick: um....this is coming out of Hadoop calling create() for a file.....so allowing it to fail seems awkward
[20:42] <dmick> (and in fact looking at ceph_file_layout_is_valid, it's not clear that MIN has much semantic meaning; looks more like a required divisor of stripe_unit)
[20:43] <dmick> buck: how are the layout parms chosen? at the open(), or before?
[20:46] <buck> dmick: I'm still pretty new to this (Noah was working with Sage on it) but AFAIK, normally this will all be handled by defaults (the user just specifies the already existing pg and things work) but in this case, the Hadoop tests are setting the block size (manually) and we're trying to respect it
[20:47] <buck> dmick: and we're trying to support the HDFS semantics (as much as possible).
[20:48] * sleinen (~Adium@2001:620:0:25:a40f:45b:9633:136) Quit (Read error: No route to host)
[20:48] * brady (~brady@rrcs-64-183-4-86.west.biz.rr.com) has joined #ceph
[20:49] <wido> #685 == Verified: Yay!
[20:51] <dmick> buck: I wonder if you want to call the validation somehow (which might be a new interface)
[20:51] * sleinen (~Adium@2001:620:0:25:a40f:45b:9633:136) has joined #ceph
[20:51] <nwat> dmick: we'll let open() fail in a custom-layout situation, but for the common case of only being given a user-specified block size, we use stripe unit == block size, stripe count = 1. To support existing Hadoop code and assumptions, we need to support small block sizes (e.g. 1024), and the best way to do this is to log the situation and use Ceph's minimum.
[20:54] <buck> dmick: nwat: it seems like we either need to just outright fail on invalid combinations (and break some existing hadoop code/tests/apps) or else have a mechanism somewhere that sorts out what would work (by using the ceph minimum in this case). I'd vote for the latter and suggest we put that mechanism in the hadoop bindings for Ceph (rather than make the caller do it), since that would leave existing Hadoop code working (albeit with adjusted block sizes in so
[20:57] * jbarbee (17192e61@ircip4.mibbit.com) has joined #ceph
[21:00] <dmick> I don't know enough about how important the interplay between Hadoop block-size requests and Ceph's servicing of them is to have much useful comment, I guess.
[21:01] <dmick> but I'm a little weirded out by a distributed FS with a 1k block size :)
[21:02] <buck> dmick: yeah, it's an odd test case, for sure.
[21:02] <dmick> (limits, stress, etc. I understand.)
[21:02] <nwat> There is no reason we need to actually use 1K blocks :) We just need to do something reasonable when Hadoop clients ask for that.
[21:02] <dmick> yeah
[21:02] <dmick> to quote Ed Wood Jr.: "stupid, stupid clients with their stupid, stupid minds"
[21:06] <dmick> I guess I'd lean toward a functional interface for the limit, because I suspect it's fairly arbitrary, and who knows, someday we might decide smaller blocks are useful for testing at least, but I'm interested in gregaf's thoughts
[21:08] <buck> dmick: i im'd greg. He's going to weigh in shortly
[21:08] <dmick> but do at least note ceph_file_layout_is_valid and that su and os must be *multiples* of "MIN", not just limited by MIN
[21:08] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[21:08] * agh (~agh@www.nowhere-else.org) has joined #ceph
[21:09] <buck> dmick: yep. I'm hoping to basically replicate the is_valid logic and take corrective action if those checks are not true (rather than failing)
[21:10] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Ping timeout: 480 seconds)
[21:10] <buck> i hate code replication but it seems like the thing to do in this case. massage the data so that it fits the test.
[21:11] * dmick retches quietly to self
[21:12] * korgon (~Peto@isp-korex- has joined #ceph
[21:29] * sleinen1 (~Adium@user-23-11.vpn.switch.ch) has joined #ceph
[21:31] <gregaf> buck: dmick: nwat: I believe that sagewk just added the MIN_STRIPE_UNIT as a basic sanity thing to prevent people creating teeny block sizes that RADOS doesn't handle very efficiently
[21:33] <gregaf> if we aren't going to just kill it, we should probably keep any enforcement of it as close to the client as possible since it's clearly designed to be something that the human user should be correcting
[21:33] * madkiss (~madkiss@p57A1CBD6.dip.t-dialin.net) has joined #ceph
[21:33] <gregaf> the other possibility is making it a real part of the interface as a standard configurable and then creating interfaces for clients to query it
[21:34] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[21:34] <gregaf> is this actually a problem for Hadoop, or are we only hitting it in some validation tests that are setting small sizes as a means of generating a bunch of blocks without writing a ton of data, and of testing non-default sizes?
[21:36] * sleinen (~Adium@2001:620:0:25:a40f:45b:9633:136) Quit (Ping timeout: 480 seconds)
[21:40] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[21:41] * ScOut3R (~ScOut3R@dsl5401A397.pool.t-online.hu) Quit (Remote host closed the connection)
[21:41] <nwat> gregaf: we encountered requests for small blocks in Hadoop unit tests, that ostensibly wanted to run faster by creating small blocks. In practice I think we want to actually support big blocks (objects), but do something sane if a user requests small blocks (which they might I suppose if they are creating meta data files)
[21:42] <gregaf> would it impact them in any way if they asked for a 1K block and got 64K blocks instead?
[21:43] <buck> gregaf: double-checking the tests now
[21:44] <nwat> gregaf: assuming this doesn't impose a 64K minimum file size :), then no I dont' think so. I'm not aware of any users that create lots of tiny files, but as HDFS matures and it isn't frowned upon as much, those users might emerge.
[21:45] <buck> nope, changing the block-size should be okay as long as we return the correct file length (as Noah just said)
[21:47] <buck> the set of tests we're fixing do things like write a block (half block, one and a half blocks, etc) and just verify that they read the same bytes back that they read (both number of bytes and actual content). Pretty basic.
[21:48] <gregaf> yeah, that's what I figured
[21:49] <gregaf> it's not pleasant but if sagewk doesn't have a problem with it I'd just have the Java stuff set the stripe unit to 64k (and if necessary the other parameters as well, obviously) if it's smaller than that
[21:49] <gregaf> other options are removing CEPH_MIN_STRIPE_UNIT (wouldn't really bug me) or adding a real interface (seems a bit much)
[21:49] * terje (~terje@63-154-137-90.mpls.qwest.net) has joined #ceph
[21:49] <dmick> gregaf: is it useful to consider other clients of libcephfs? What do they do for choosing/validating layout parms?
[21:50] <dmick> or perhaps what should they do
[21:50] <gregaf> dmick: right now they check that the multiplication rules are followed; I don't think anybody else checks for that MIN_SIZE
[21:50] <dmick> so do they fail in ceph_open_layout() if they get it wrong?
[21:50] <sagewk> i'm fine either which changing the min stripe unit for munging the java tests. fwiw i chose 64k because the kenrel code can't have objects smaller than a page, and iirc IA64 can do 64k pages.
[21:51] <gregaf> in CephFS terms it's fairly new (late 2009) and nobody who uses Ceph sets stripe units that small
[21:51] <sagewk> it was really a kernel client limitation, nothing else in the system really cares.
[21:51] <gregaf> dmick: yeah, that'd be when it would turn up
[21:51] <dmick> yeah, not so concerned about supporting small, just wondering how a n00b client figures it out
[21:51] <nwat> dmick: a client like MPI-IO may find it useful to have a lot of flexibility when users creating complex data types
[21:51] <dmick> (a larger issue than just the hadoop tests IOW)
[21:52] <gregaf> sagewk: hmm, can you expand on that kernel limit a bit more?
[21:52] <gregaf> does that mean having smaller stripes would break it, or that having smaller stripes would waste a lot of memory?
[21:53] <gregaf> (or some combination of the two because the kernel client assumes objects are stripe sized?)
[21:56] <joshd1> sagewk: I guess it doesn't work so well with hugepages then
[21:56] * joshd1 (~jdurgin@2602:306:c5db:310:4da1:bdc1:80d7:aea4) Quit (Quit: Leaving.)
[21:56] <sagewk> gregaf: i think it woudl break the current code.
[21:56] <sagewk> honestly, i do'nt care to much about 64k IA64 pages tho.
[21:56] <dmick> fwiw it looks like create fails with EINVAL as you might expect
[21:56] <sagewk> i don't think hugepages are used by the page cache unless you ask for them
[21:58] * terje (~terje@63-154-137-90.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[21:58] * jbarbee (17192e61@ircip4.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[21:59] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[22:00] * terje (~terje@63-154-137-90.mpls.qwest.net) has joined #ceph
[22:00] <dmick> so I guess a client just has to know about the layout limitations ATM. not sure it needs much more, just fleshing out the story.
[22:02] <nwat> slang: i pushed wip-client-shutdown. it probably conflicts then with the fuse change, but handles the shutdown in a case more similar to rados.
[22:03] * joshd1 (~joshd@2602:306:c5db:310:4c63:4d99:25c7:6e21) has joined #ceph
[22:03] <dspano> Good afternoon all. Out of the two OSD servers I have, one ceph-osd process is using 91% server memory, while the other is only using 44%. Is that normal for it to be skewed like that?
[22:08] * terje (~terje@63-154-137-90.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[22:09] <Vjarjadian> could there be a MON or someting else on that physical server?
[22:11] <gregaf> sjust: sagewk: I'm not real up on the FileStore, but the commit description and the diff on that op_seq bug look convincing :)
[22:13] <gregaf> dspano: you should email the list with what info you have (or piggyback on the "Fwd: OSD memory leaks?" thread)
[22:13] <dspano> Yeah, I saw that one.
[22:13] <gregaf> we haven't been able to reproduce it locally but other people have seen behavior like that, ye
[22:13] <sjust> dspano: that thread also had (iirc) some instructions on how to get the profiler working
[22:13] <sjust> heap profiler that is
[22:13] <sjust> if you can get a heap graph, that would be worth a great deal
[22:14] <gregaf> Sam's a bit busy today :( but Sebastien's link to scrubbing might bear some fruit
[22:15] <dspano> I was just curious. Technically, everything is working fine. I'll see if I can get the profiler to work.
[22:15] <sjust> dspano: yeah, I think there may be a real leak, tracking it down would be a good thing
[22:16] <dspano> sjust: Is the workaround just restarting the process from time to time?
[22:16] <sjust> dspano: I suppose so, we haven't seen it ourselves
[22:17] <sjust> dspano: what version are you running?
[22:17] <dspano> Argonaut 0.48.2
[22:18] <sjust> k
[22:18] <sjust> hmm, our other report was from closer to .55
[22:18] <elder> sagewk, FYI I'm running tests on a ceph-client/testing branch that's been rebased on to of Linux 3.7.
[22:18] <elder> (We've been based on 3.6... I was surprised.)
[22:18] <sagewk> elder: cool.
[22:19] <sagewk> are the builds sorted out?
[22:20] <mikedawson> dspano: sjust I saw something similar with 0.55
[22:20] * terje_ (~joey@63-154-137-90.mpls.qwest.net) has joined #ceph
[22:20] <sjust> mikedawson: if you can get a heap graph, it would help
[22:20] <sjust> see the thread I mentioned above
[22:21] <sjust> although, memory will legitimately spike during recovery and (on argonaut) on scrub
[22:21] <sjust> more recent version with "chunky" scrub shouldn't see memory spikes during scrub
[22:22] <elder> Not yet sagewk, I thought this would be a good first step. Ready to do more though, since this built clean.
[22:22] <mikedawson> sjust: I rebuilt the cluster a week or so ago. Had that behavior on 3 of 36 OSDs. 1 resolved itself. The other two never got fixed. I never got any useful debugging. Ran into the known symbols stuff iirc
[22:22] <dspano> sjust: I'm reading S�bastian's most recent response to the OSD Memory Leak thread, and found it interesting that he mentions scrubbing. The OSD in question has been doing a lot of scrub ok messages in its logs.
[22:23] <sjust> dspano: hmm
[22:23] <mikedawson> dspano: I had rampant scrubbing as well iirc
[22:23] <sjust> scrubbing does happen automatically, so that part at least is normal
[22:23] * madkiss (~madkiss@p57A1CBD6.dip.t-dialin.net) Quit (Quit: Leaving.)
[22:24] <buck> sagewk: gregaf: so was the consensus to remove that is_valid() check for the minimum (and multiple) sizes?
[22:24] <sjust> sagewk, gregaf, anyone_else: anyone want to review wip_3698?
[22:24] <sagewk> sjust: i'll look
[22:25] <sagewk> buck: sorry, i didn't follow the whole discussion. this is a client-side or mds-side check for the min stripe size?
[22:25] <sjust> passed testrados with thrashing for a while
[22:25] <sjust> so the assert seems solid
[22:26] <sagewk> sjust: what if this is mixed code, and an old osd sent such a push request?
[22:26] <buck> sagewk: it's in src/include/ceph_fs.cc
[22:26] <sjust> ah, true
[22:26] <sjust> I'll remove the assert
[22:26] <sagewk> would dropping it to 4K work?
[22:26] <sagewk> buck: ^
[22:27] <sjust> I think I'm satisfied that we aren't sending those requests anymore
[22:27] <buck> sagewk: the hadoop test is currently trying to use 1k (seems arbitrary but that's what they have).
[22:28] <sagewk> i wouldn't drop it below 4k. if we can silently set a 4k floor to make hadoop happy, or something, that would be better.
[22:28] <sagewk> i'm ok going from 64k->4k tho. although at that point there may not be much point, since we still set a floor somewhere
[22:28] * terje_ (~joey@63-154-137-90.mpls.qwest.net) Quit (Ping timeout: 480 seconds)
[22:28] <buck> sagewk: yeah, it seems like we either adjust up to some minimum (and log it) or just don't enforce any thing in terms of minimums or alignment.
[22:29] <sagewk> buck: let's adjust up to the minimum then and leave it at 64k
[22:29] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[22:29] <sagewk> ideally in the java or hadoop binding, and not silently in generic libcephfs/Client code
[22:29] <buck> sagewk: cool. Now do we want to hard-code that 64k in the java code or do we want to use JNI to pull that #define in from ceph proper?
[22:29] <sagewk> sigh... um
[22:29] <buck> sagewk: yeah, we'll do the fixing in the hadoop code for sure
[22:30] <sagewk> JNI to get the #define is probably the cleaneset
[22:31] * fzylogic (~fzylogic@ Quit (Quit: fzylogic)
[22:31] * The_Bishop (~bishop@e177088127.adsl.alicedsl.de) has joined #ceph
[22:31] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:32] <dmick> can Java include c++ headers? :)
[22:32] <buck> sagewk: last Q, I swear. Should this be part of libcephfs (so other clients can grab it) or just something we stick in the java JNI code (to be used by our Hadoop code) ?
[22:34] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[22:34] <buck> dmick: i'
[22:34] <buck> dmick: er I'm not sure....I've only exposed functions and not #define values before
[22:36] <sagewk> buck: should be exposed by libcephfs too.
[22:36] <buck> sagewk: cool. Thanks.
[22:36] <sagewk> maybe a get_min_stripe_size() (or slightly better named) method
[22:37] <sagewk> if there are other similar methods to mirror the conventions for
[22:37] <sagewk> someday it may be a configurable or something, so that gives us flexibility to adjust it later
[22:39] <buck> ok. I'll update the ticket (but may not get to this until next week).
[22:40] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[22:41] <dmick> buck: I was joking, sorry
[22:42] <dmick> get_stripe_granularity or something?...
[22:42] <buck> dmick: np, Noah's been handling this lately and my jni is rusty enough that I sand-bagged there :0
[22:42] * Ryan_Lane1 (~Adium@ has joined #ceph
[22:42] * Ryan_Lane (~Adium@ Quit (Read error: Connection reset by peer)
[22:42] * jskinner (~jskinner@ has joined #ceph
[22:43] <buck> there's a ceph_get_file_replication() that takes a file handle as an argument so a ceph_get_min_stripe_size() with no argument seems reasonable-ish (with the implication that it pertains to the mount being used to make the call).
[22:47] <dmick> (except that it's not a min, and a better name is worth the time deciding on it)
[22:47] <sagewk> _granularity works for me.
[22:49] * fzylogic (~fzylogic@ has joined #ceph
[22:52] <buck> alright, _granularity it is (good point on the 'not actually a min' comment dmick: )
[22:56] <dmick> cheers
[22:58] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[22:59] * joshd1 (~joshd@2602:306:c5db:310:4c63:4d99:25c7:6e21) Quit (Quit: Leaving.)
[23:01] * The_Bishop_ (~bishop@e177088127.adsl.alicedsl.de) has joined #ceph
[23:02] * joshd1 (~jdurgin@2602:306:c5db:310:4da1:bdc1:80d7:aea4) has joined #ceph
[23:02] <dmick> hammer.sh's title(): kudos
[23:08] * The_Bishop (~bishop@e177088127.adsl.alicedsl.de) Quit (Ping timeout: 480 seconds)
[23:12] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[23:14] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[23:15] * agh (~agh@www.nowhere-else.org) has joined #ceph
[23:17] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[23:20] * jskinner (~jskinner@ Quit (Remote host closed the connection)
[23:21] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[23:21] * agh (~agh@www.nowhere-else.org) has joined #ceph
[23:26] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:29] * noob2 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[23:29] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[23:31] * jskinner (~jskinner@ has joined #ceph
[23:35] <paravoid> where can I read more about the geo-replication plans of ceph?
[23:35] <paravoid> I read somewhere that there was a presentation in Amsterdam about it, but can't find slides or more information
[23:36] * mattbenjamin (~matt@aa2.linuxbox.com) has joined #ceph
[23:40] <dmick> paravoid: I'm not sure much has been done yet; we're just sorta starting talking about radosgw async repl
[23:40] <paravoid> so it's going to happen on the radosgw level?
[23:41] <dmick> at first, because it's easier/quicker there
[23:41] <dmick> but rados will follow one day
[23:41] <paravoid> aha
[23:41] <dmick> (rgw has semantics that make it a much easier problem)
[23:42] <paravoid> are there any plans on providing a way to replicate via crush but always do local reads?
[23:42] <paravoid> I know that writes are synchronous, but let's assume a read-intensive workload
[23:43] * tnt_ (~tnt@112.169-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[23:44] * vata (~vata@ Quit (Quit: Leaving.)
[23:44] <joshd1> paravoid: reads always come from the primary, so you can already use (for example) faster primaries to speed up reads by choosing one from a fast hierarchy, and n-1 from another hierarchy
[23:45] <paravoid> I've read that, yeah
[23:45] <paravoid> on a multi-DC scenario though, this assumes a primary & secondary DC
[23:46] <dmick> sure; you can set up DC-aware crush rules
[23:46] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[23:46] <paravoid> I'm wondering if there are any thoughts on providing some way to do reads from secondary replicas
[23:46] * loicd (~loic@magenta.dachary.org) has joined #ceph
[23:46] <mattbenjamin> someone: can confirm the cap expected for an arbitrary update to inode size (appears to be CEPH_CAP_FILE_BUFFER)
[23:46] <paravoid> in the case where they're local
[23:46] * andret (~andre@pcandre.nine.ch) Quit (Ping timeout: 480 seconds)
[23:47] <paravoid> DC-aware crush rules are fine for a master/slave DC scenario; not so much if you want to serve clients in both DCs at the same time
[23:47] <paravoid> I'm pretty sure what I want isn't there, I'm asking if there are any plans to add such a feature :-)
[23:48] <paravoid> I guess not?
[23:49] <nwat> dmick: regarding ceph_get_stripe_unit_granularity(..): do you think it is wise to have the interface require the ceph mount in order to future proof ourselves from having per-mount configurations of minimums?
[23:49] <joshd1> paravoid: there's been talk of read-from same subnet-replica or something like that that might work for that case, but it doesn't work in general for mutable data
[23:49] <gregaf> nwat: definitely
[23:49] <dmick> nwat: yeah, certainly can't hurt much
[23:50] <paravoid> joshd1: how come?
[23:50] <gregaf> mattbenjamin: can you clarify? CEPH_CAP_FILE_BUFFER is the cap that lets the client buffer writes
[23:50] <paravoid> joshd1: and was that talk in some mailing list so I can read more about it?
[23:50] <joshd1> paravoid: rados assumes that reads are going through the primary. if they went to secondaries, they could get stale data
[23:51] <joshd1> paravoid: search for localized reads - it was created a long time ago for hadoop
[23:52] <paravoid> oh? that's already there?
[23:54] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[23:54] <joshd1> I think it may be only by ip, and not subnet
[23:55] <joshd1> and it's only wired up to work with ceph-fuse iirc, but could be extended for use elsewhere (like reading from a parent snapshot in rbd)
[23:55] <paravoid> if (osdmap->get_addr(acting[i]).is_same_host(messenger->get_myaddr())) {
[23:55] <paravoid> yeah
[23:56] <mattbenjamin> gregaf: yes, and apparently I eyeskipped. what I'm interested in doing is adjusting say size only.
[23:57] * Cube1 (~Cube@ has joined #ceph
[23:58] <gregaf> mattbenjamin: I think that just the WRITE cap (whatever it's actually called) is all the coverage there is on that
[23:59] <mattbenjamin> gregaf: so if I want size update ordering, CAP_FILE_EXCL?
[23:59] <gregaf> no, that's separate

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.