#ceph IRC Log


IRC Log for 2013-04-25

Timestamps are in GMT/BST.

[0:18] <joelio> Not been about for a couple of weeks, come back and there's a new wiki, call for blueprints and loads of good stuff. Nice!
[0:20] <joelio> plus the production kit has arrived now.. the fun begins :)
[0:31] <joelio> +1 for rgw http standalone
[1:07] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[1:31] <mikedawson> gregaf: no problems to report so far with ceph version 0.60-641-gc7a0477 (c7a0477bad6bfbec4ef325295ca0489ec1977926). Thanks for working through the bugs!
[1:31] <gregaf> yay
[1:55] <mikedawson> sagewk: are you guys planning to build packages for Raring prior to Cuttlefish?
[1:55] <mrjack> will 0.60 be a new stable release?
[1:56] <mikedawson> mrjack: 0.60 is not stable. I believe 0.61 will be deemed stable and called Cuttlefish
[1:57] <dmick> mikedawson: I'm only guessing, but my guess would be that looking at raring will come after cuttlefish is in the can
[1:58] <mikedawson> dmick: ok. for reference installing quantal packages on raring *seems* to work
[3:04] <jmlowe1> mikedawson: you still around?
[3:04] <mikedawson> jmlowe1: yes
[3:04] <jmlowe1> ever run into this with raring "error: unsupported configuration: unknown driver format value 'rbd'"
[3:06] <mikedawson> jmlowe1: sorry, no
[3:07] <mikedawson> are you using kernel rbd?
[3:07] <jmlowe1> nope, qemu driver
[3:10] <jmlowe1> *grumble*, xml format has changed
[3:18] <jmlowe1> now <driver name="qemu" type="raw" cache="writeback"/> was <driver name='qemu' type='rbd' cache='writeback'/>
[3:18] <jmlowe1> also refused migration of a pc-1.2 vm, wouldn't start copying the memory
[3:49] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Read error: Connection reset by peer)
[3:54] <dmick> but....how does it select rbd?...
[3:55] <jmlowe> protocol="rbd"
[3:55] <dmick> ah
[3:56] <jmlowe> they eliminated the redundant "rbd" but broke compatibility
[3:56] <jmlowe> <driver name="qemu" type="raw"/>
[3:56] <jmlowe> <source protocol="rbd" name="image_name2">
[4:22] <mikedawson> BillK: I would wait a few days for cuttlefish (0.61). If you must upgrade, the gitbuilder next may be a better choice than 0.60
[4:22] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has joined #ceph
[4:25] <BillK> mikedawson: tkx, Will stay with 58 until 61 if its only few days.
[4:26] <mikedawson> BillK: sure thing
[4:27] * dpippenger (~riven@206-169-78-213.static.twtelecom.net) Quit (Remote host closed the connection)
[6:48] * tkensiski (~tkensiski@c-98-234-160-131.hsd1.ca.comcast.net) has left #ceph
[7:57] * tnt (~tnt@ has joined #ceph
[9:39] * leseb (~Adium@ has joined #ceph
[11:07] <Kioob`Taff> is there performance improvement in kernel RBD client between Linux 3.6 and Linux 3.8 ?
[11:07] <Kioob`Taff> (hi)
[13:44] <Kioob`Taff> nhm: I rework my config, based on your article http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/ ; and effectivly I was able to reduce and stabilize the latency on my cluster. Thanks a lot !
[13:56] <nhm> Kioob`Taff: excellent! what did you change?
[14:04] * KevinPerks (~Adium@cpe-066-026-239-136.triad.res.rr.com) has joined #ceph
[14:10] * john_barbee (~jbarbee@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:14] <Kioob`Taff> nhm: I removed "filestore journal writeahead = true", use your "big_ops" parameters, remove my "journal max write bytes", and fix number of threads (disk = 4, op = 8)
[14:14] <Kioob`Taff> (I use XFS and have mainly random small writes)
[14:18] <Kioob`Taff> https://daevel.fr/lamp-response-time.png <== at left, my previous config, at right, the new config. And the huge overload is because of the restart of all the OSD (to be sure to re-init the conf)
[14:19] <Kioob`Taff> the new one is really stable
[14:19] <Kioob`Taff> (and faster)
[14:26] <Kioob`Taff> does some work is planned for RBD usage with Xen ? (in PV mode)
[14:27] <Kioob`Taff> current perfs are far lower that from the host
[15:32] <jerker> beats me with 16 TB raw 5 TB usable om 4 HDD :-)
[15:33] * wschulze (~wschulze@cpe-69-203-80-81.nyc.res.rr.com) has joined #ceph
[16:23] <matt_> jefferai, I would say that the 13.04 packages are fine to use until cuttlefish is released
[16:23] <matt_> then you will probably need to use a repository
[16:24] <jefferai> I see
[16:25] <jefferai> matt_: so I have 12.04 machines now, and I'm just trying to figure out whether I should upgrade to 13.04 (and stick with Ceph packages) or stay on 12.04 (and stick with Ceph packages)
[16:25] <jefferai> 13.04 brings a higher likelihood of using btrfs successfully with Ceph, for instance
[16:25] <matt_> I'm upgrading a 12.10 machine to 13.04 right now, I'll let you know how I go
[16:26] <matt_> I'm hoping the 3.8 kernel will fix some BTRFS slowness
[16:28] <jmlowe> if you are using rbd with qemu and libvirt, they changed the xml syntax slightly
[16:29] <matt_> jmlowe, thanks for the heads up. This is just an OSD server so it should be all good
[16:29] <jmlowe> now <driver name="qemu" type="raw"/>, was <driver name="qemu" type="rbd"/>
[16:29] <jmlowe> caused me a little grief last night
[16:30] <matt_> jmlowe, do any of your VM's run Windows 2008r2?
[16:30] <jmlowe> all linux, mostly centos 5 and centos 6
[16:31] <jefferai> jmlowe: yeah, I'll be upgrading my compute hosts to 13.04 to take advantage of qemu 1.4, so I'm trying to decide whether to upgrade the storage nodes at the same time
[16:31] <matt_> ah ok, I found a RTC timer bug a while back when playing with the 3.8 kernel just wanted to know if you had seen it too but it doesn't affect linux guests
[16:31] <jmlowe> I wasn't successful at doing a live migration from a vm on quantal to raring
[16:32] <jmlowe> I didn't dig into it too much,
[16:32] <matt_> IIRC, there are a heap of live migration changes that break QEMU 1.4 compatibility with earlier releases
[16:32] <jmlowe> jefferai: if you manage to do it please let me know
[16:33] <jmlowe> !#@$
[16:33] <matt_> I read about it on the proxmox forum, seems to be a well known issue
[16:35] <jmlowe> there is also a new machine type, q35, I couldn't get that to work either
[16:36] <jefferai> jmlowe: I see -- so my interest is that qemu is supposed to fix some slowness problems with I/O
[16:36] <jefferai> and some other changes
[16:36] <jefferai> but also, I want to use the native librados qemu stuff
[16:36] <jefferai> because I've been using ganeti and it does kernel RBD
[16:36] <matt_> jefferai, do you mean IO slowness with Ceph in particular?
[16:37] <jefferai> and that's been annoying and potentially causing issues that I've seen
[16:37] <jefferai> and requires using fairly untested kernels, or using older and known buggy kernels (known buggy w.r.t. RBD)
[16:37] <jefferai> so my thought was to use native qemu/libvirt migration, I don't *really* need a cluster manager for four nodes
[16:38] <jefferai> especially given that ganeti doesn't do automatic failover
[16:38] <jefferai> and in doing so I can use non-kernel RBD, which I have been hearing is a better option when possible
[16:38] * drokita (~drokita@ has joined #ceph
[16:38] <jefferai> so I'm fine testing this out on the compute side, but wondering if I should also upgrade on the storage side
[16:39] <jefferai> matt_: no, something odd -- I posted on ceph-devel but didn't hear back, let me dig it up
[16:39] <matt_> jefferai, just switching to the native RBD driver in qemu should give you a big jump in performance
[16:39] <jefferai> nice
[16:39] <matt_> and there is a heap of change in cuttlefish that fixes the rbd cache and add async IO
[16:39] * dgbaley27 (~matt@mrct45-133-dhcp.resnet.colorado.edu) has joined #ceph
[16:39] <jefferai> matt_: look at the "poor write performance" thread
[16:41] <jefferai> matt_: which kind of cache does it fix? Write-back or write-through?
[16:41] <jefferai> (or both)?
[16:51] * mnash (~chatzilla@66-194-114-178.static.twtelecom.net) has joined #ceph
[16:51] <jefferai> matt_: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg13893.html
[16:51] <jefferai> that's my message, but it ended up killing that part of the thread
[16:51] <jefferai> :-)
[16:53] <matt_> I'll have a look in a sec, just rebooted my server into 13.04
[16:56] <nhm> yes, rbd cache behavior should be much better in cuttlefish
[16:57] <jefferai> ok
[16:57] <nhm> btw guys, I may be seeing some performance regressions with kernel 3.8 vs 3.6.
[16:57] <nhm> doing more tests now.
[16:58] <jmlowe> jefferai: is that virtio-blk or virtio-scsi?
[16:58] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Read error: Operation timed out)
[16:58] <matt_> nhm, of the BTRFS kind or just in general?
[16:59] <nhm> matt_: In general.
[16:59] <jmlowe> nhm: what order of magnitude are we talking about?
[17:00] <jefferai> jmlowe: sadly, I don't remember and can't look now as I'm still cut off from the testbed
[17:00] <nhm> jmlowe: not sure how much of this is due to 3.8. I may be simultaneously having another issue. I'm getting about 50-60% of the performance I was getting 2 months ago.
[17:00] <jefferai> (I am changing from full time employee there to part time hourly, and while that is in place I am not an employee at all)
[17:01] <nhm> jmlowe: though it's primarily write performance that's a problem.
[17:02] <matt_> nhm, I just tested my 4MB write after the upgrade and it appears to be the same as before using 0.60
[17:02] <jefferai> matt_: how about 4k write?
[17:03] <nhm> matt_: yeah, this seems to be a hardware/kernel issue
[17:04] <matt_> hmm... 4kb seems crappy. 1.5Mb/s average which is pretty bad
[17:04] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[17:04] <jefferai> still better than 150kb/s
[17:04] <nhm> average for what?
[17:05] <matt_> 48 osd's over 2 servers, connected via infiniband, ssd journals, replica 3
[17:05] <nhm> 1.5MB/s aggregate throughput for the whole cluster?
[17:06] <matt_> at 4kb and 64 concurrent run from a single server, yes
[17:06] <nhm> :(
[17:06] <nhm> rados bench or something else?
[17:06] <matt_> This was rados bench
[17:07] <nhm> does more concurrrent ops help?
[17:07] <matt_> one of my servers is BTRFS and the other is XFS... I'm thinking I might change everything to XFS
[17:07] <jefferai> oh yeah, that was another reason I was thinking of upgrading my storage boxes to 13.04, I'm sick of XFS killing itself every reboot
[17:07] <jefferai> even when cleanly shut down
[17:07] <jefferai> some OSD or another fails to come up and XFS is corrupt :-(
[17:07] <jefferai> I know there were some kernel fixes put in XFS past 3.2...
[17:09] <matt_> nhm, I also have a pool with comprised of 48 SSD's over 10 hosts. Rep 3 again. Average is around 10MB/s for 4kb IO
[17:10] <jmlowe> matt_: I would switch to xfs, btrfs has caused me lots and lots of pain
[17:11] <nhm> matt_: writes and reads?
[17:11] <matt_> nhm, just writes. I haven't benched reads yet
[17:12] <matt_> 4MB writes are 450+ MB/s though :D
[17:12] <nhm> matt_: that's good at least!
[17:12] <nhm> matt_: I'm annoyed. btrfs on this node used to be good for 2GB/s+ and with kernel 3.8 I was hitting 1.1GB/s.
[17:13] <nhm> so now I have to go backtrack with older kernels and older ceph releases.
[17:13] * dxd828 (~dxd828@ Quit (Remote host closed the connection)
[17:13] <matt_> nhm, that's a bit odd. Did you keep your old kernel?
[17:14] <nhm> matt_: yeah, I've got a couple of old ones I'm trying out. I suspect maybe one of my drives is running a bit slower.
[17:14] <matt_> nhm, how are you benching reads? rados bench doesn't appear to do it
[17:15] <nhm> matt_: you have to do a write run with the --no-cleanup flag and then instead of write, use seq for the 2nd test.
[17:16] <jefferai> jmlowe: I thought in recent Ceph that ext4 was actually not a bad option these days
[17:16] <matt_> nhm, ah ok. That sounds a little too much work for tonight... I should probably get back to studying for my Google interview :/
[17:16] <nhm> I also do a sync and echo 3 | sudo tee /proc/sys/vm/drop_caches before the read test on all the nodes.
[17:16] <nhm> matt_: ah, good luck!
[17:17] <matt_> nhm, thanks! I'd happily trade some luck for a computer science degree right now though!
[17:18] <nhm> matt_: what kind of position are you interviewing for?
[17:18] <matt_> I think it's a software engineer position they had in mind for me. They tracked me down via linked-in so I didn't really apply for a certain one
[17:19] <jefferai> matt_: ah, cool -- most people that get jobs there actually get it from someone inside knowing them
[17:19] <jefferai> so if someone tracked you down on LinkedIn then you're partway in the door already
[17:20] <matt_> I'm hoping so
[17:20] <wido> nhm: I tried with the wip aio branch, that seems rather nice
[17:21] <wido> I'm thinking about upgrading to 0.61 right now to have all the fixes
[17:21] <wido> I'm just still trying to figure out why the Qemu instance isn't "snappy", for example a simply "df -h" took 5 seconds just now
[17:22] <nhm> wido: does turning off rbd cache help?
[17:22] <wido> nhm: I'll give that a try
[17:23] <wido> nhm: Observed any read issues with rbd cache on?
[17:23] * gmason (~gmason@hpcc-fw.net.msu.edu) has joined #ceph
[17:23] <nhm> wido: reads were fine in all of my tests, but I didn't really use the VMs interactively.
[17:24] <wido> nhm: Reads are fine in benchmarks, well, could be a bit better, but the VM isn't snappy
[17:24] <wido> seems like sometimes it's waiting
[17:24] <nhm> wido: We've definitely seen reports like that. Were hoping that Josh's patches from wip-aio fixed it.
[17:25] <matt_> wido, if you ping the VM whilst you're running commands do the ping times increase?
[17:25] <nhm> You are still seeing it 0.60?
[17:25] <wido> nhm: Have to try 0.60, but I'm running the wip-aio branch already on the client, with the Qemu fixes
[17:25] <wido> matt_: No, not seeing #3737
[17:28] <wido> nhm: I'll try the next branch to see what that does
[17:31] <nhm> wido: ok, cool
[17:31] <wido> nhm: What I do observe for example during a read is that the client isn't reading constantly
[17:31] <wido> so bwm-ng shows me a peak of 80MB/sec, nothing for 2 seconds and suddenly 80MB/sec again
[17:31] <wido> And in the end the VM reads with 40MB/sec with a simple dd read
[17:32] <nhm> wido: how big are the reads?
[17:32] <wido> nhm: 1M reads
[17:32] <nhm> wido: does increasing read_ahead_kb on the OSDs help at all?
[17:33] <nhm> also, on one cluster I was working on, tcp autotuning was causing all kinds of problems, but strangely only for reads.
[17:33] <nhm> It may not do anything, but you could try disabling it.
[17:35] <wido> nhm: I'll disable it on the cluster, just see what it does
[17:35] <nhm> make sure to do it on the clients and servers
[17:35] * dxd828 (~dxd828@ has joined #ceph
[17:36] * sjusthm (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[17:37] <wido> nhm: That indeed changed a lot. Saw a 50% increase. Went from 41MB/sec to about 69MB/sec
[17:38] <wido> No more peaks in bandwith, but now a sustained throughput
[17:39] <nhm> wido: the tcp autotuning or read_ahead_kb setting?
[17:39] <wido> nhm: the tcp autotuning
[17:39] <nhm> wido: ok, that's very good to know. We put a patch in recent versions of ceph to make the buffer size configurable that should theoretically fix it too.
[17:39] <nhm> Did you just turn it off in proc?
[17:43] <stacker666> hi there!
[17:43] <stacker666> the kernels of http://gitbuilder.ceph.com/ are stable?
[17:53] <nhm> wido: Jim Schutt had a big long email thread last year about this. Basically all kinds of tcp retransmits get sent causing all sorts of delays.
[17:54] <wido> nhm: I'll look that one up
[17:57] <stacker666> nhm: i have exported using iscsitarget and works fine with diferent images
[17:57] <stacker666> nhm: at the same time
[17:58] <nhm> stacker666: if you have a normal filesystem on a block device and try to mount it on multiple clients it will lead to trouble.
[18:26] * davidzlap (~Adium@ip68-96-75-123.oc.oc.cox.net) Quit (Quit: Leaving.)
[18:31] * hybrid512 (~walid@LPoitiers-156-86-25-85.w193-248.abo.wanadoo.fr) Quit (Quit: Leaving.)
[18:33] * Vjarjadian (~IceChat77@ has joined #ceph
[18:33] <imjustmatthew> greaf: around?
[18:34] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[18:38] <nhm> stacker666: good to know. FWIW, I just found that our 3.8 kernel also is dramatically slowing down throughput relative to a kernel-ppa 3.8 raring kernel.
[18:39] <nhm> stacker666: I suspect there is some debugging enabled that is causing issues.
[18:40] <gregaf> imjustmatthew: perhaps you meant gregaf? :p
[18:40] <imjustmatthew> gregaf: most def :)
[18:40] <imjustmatthew> did you per chance find out which gitbuilder builds have tcmalloc?
[18:41] <gregaf> I think dmick looked at it, the only thing I heard was "hmm, that should have had it wtf"
[18:44] <nhm> gregaf: missing tcmalloc?
[18:45] * l0nk (~alex@ Quit (Quit: Leaving.)
[18:45] <gregaf> yeah
[18:45] <imjustmatthew> okay. is their a biuld somewhere that you know I can use to track down this memory issue while it's still happening?
[18:46] <imjustmatthew> It's also weird bc it seems to be associated with unusually high CPU usage by the mons
[19:59] <mikedawson> gregaf: ceph-mon.a appears to have died at 11:11 utc this morning. No core dump. Nothing in the logs past that point. ceph version 0.60-641-gc7a0477 (c7a0477bad6bfbec4ef325295ca0489ec1977926). Any idea how that happens?
[20:00] <gregaf> mikedawson: dmesg show it getting killed? is the process actually gone or just the log stopped?
[20:02] * Cube (~Cube@ has joined #ceph
[20:03] <mikedawson> gregaf: can't find anything in dmesg about ceph. Process is gone and logging shows nothing indicating it went away
[20:10] <mikedawson> gregaf: that was my thought too, but the others have alibis (and logs confirm them)
[20:11] <gregaf> I've got no magic introspection that you don't, sorry :(
[20:11] <gregaf> I guess you could check the other monitor logs and see if they've got something at that time other than "oh, it disappeared"
[20:11] <mikedawson> gregaf: bigger problem is mon.a hasn't rejoined a happy quorum
[20:11] * rustam (~rustam@ Quit (Remote host closed the connection)
[20:13] * loicd (~loic@magenta.dachary.org) has joined #ceph
[20:13] * noob2 (~cjh@ Quit (Quit: Leaving.)
[20:24] <gregaf> I've got to work on something else right now, but thanks for the log
[20:24] <gregaf> it looks like it's got messenger but not monitor though?
[20:25] <mikedawson> gregaf: sure. would you like this to be entered as a bug? on the other question... mon.b and mon.c had an election right after mon.a went away. That seems like a reasonable response
[20:26] <gregaf> oh, sure — hoping it's a dupe but dunno
[20:28] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[20:31] <imjustmatthew> Quick check before I report it, does this mon crash look like a duplicate? http://pastebin.com/K7q71xPL
[20:32] <gregaf> imjustmatthew: not precisely, but we should flag sagewk about it as that could be related to a lost message bug he's working on
[20:32] * LeaChim (~LeaChim@ has joined #ceph
[20:34] <imjustmatthew> gregaf: #4810?
[20:34] <gregaf> don't remember, I just want sagewk to see it
[21:03] * mikedawson_ (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[21:03] * ninkotech_ (~duplo@static-84-242-87-186.net.upcbroadband.cz) has joined #ceph
[21:04] * b1tbkt_ (~Peekaboo@68-184-193-142.dhcp.stls.mo.charter.com) has joined #ceph
[21:05] * mistur_ (~yoann@kewl.mistur.org) has joined #ceph
[21:05] * trond_ (~trond@trh.betradar.com) has joined #ceph
[21:05] * dosaboy_ (~dosaboy@host86-161-164-218.range86-161.btcentralplus.com) has joined #ceph
[21:05] * ggreg_ (~ggreg@int.0x80.net) has joined #ceph
[21:06] * liiwi_ (liiwi@idle.fi) has joined #ceph
[21:25] <benner> ok
[21:29] <imjustmatthew> mikedawson: Interesting, is the CPU load related directly to one of your bugs? Also, we're showing great talent at breaking mons :)
[21:29] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Quit: ChatZilla 0.9.90 [Firefox 20.0.1/20130409194949])
[21:31] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[21:32] <mikedawson> imjustmatthew: no, CPU doesn't appear to be directly related to bugs, but I did have oom-killer kill off a ceph-mon process, and it currently will not rejoin quorum
[21:33] <mikedawson> imjustmatthew: http://tracker.ceph.com/issues/4815
[21:34] <Elbandi_> i try to disable the readahead on a cephfs mount (rsize=0,rasize=0), but still transfer more bytes as it should be
[21:34] <Elbandi_> http://pastebin.com/zS1fzJwd
[21:34] <Elbandi_> aio_read 0~4
[21:35] <Elbandi_> 4 bytes from the beginning
[21:35] <Elbandi_> start_read 0~16384
[21:35] <Elbandi_> :(
[21:45] <mikedawson> sagewk: Mon log is attached at http://tracker.ceph.com/attachments/download/797/ceph-mon.a.log
[21:45] <mikedawson> gregaf: how do I check?
[21:45] <gregaf> try getting heap stats out of them
[21:45] <gregaf> "ceph -m <mon-ip> heap stats", I think?
[21:45] <gregaf> while running ceph -w in another window
[21:47] <kylehutson> I recently expanded my ceph cluster and ended up with the problem mentioned at http://www.spinics.net/lists/ceph-devel/msg08361.html , but when I run "ceph osd crush tunables bobtail" (per the documentation), I get "unknown command crush"
[21:47] <kylehutson> What's the proper way to implement tunables now?
[21:47] <mikedawson> gregaf: the mon that is borked doesn't respond. The others say "tcmalloc not enabled, can't use heap profiler commands"
[21:48] <gregaf> dammit dammit dammit what is going on here
[21:48] <gregaf> what repository are you pulling from mikedawson?
[21:49] <mikedawson> gregaf: I am on a raring nightly installing from deb http://gitbuilder.ceph.com/ceph-deb-quantal-x86_64-basic/ref/next quantal main
[21:49] <mikedawson> perhaps the raring/quantal mismatch is the issue
[22:03] <mikedawson> gregaf: in case it matters, ceph-osd looks much more normal " 8044 root 20 0 9392m 613m 5848 S 2 1.3 29:20.35 ceph-osd"
[22:05] <mikedawson> actually that probably seems high, too
[22:05] <gregaf> mikedawson: hrm, can you do "ceph osd tell 0 heap stats" and see what that output is?
[22:06] <mikedawson> gregaf: it just returns "ok"
[22:06] <gregaf> what's ceph -w show?
[22:06] <gregaf> I want to see if it generates the heap stats output :)
[22:07] <mikedawson> gregaf: http://pastebin.com/raw.php?i=UWn9Hnqc
[22:08] <mikedawson> that covers the time I did ceph osd tell 0 heap stats
[22:08] <gregaf> okay
[22:08] <gregaf> so not enabled
[22:08] <gregaf> (the tell command has some pretty stupid routing, so the monitor is returning the "ok" without the OSD having said any such thing)
[22:08] <mikedawson> gotcha
[22:09] * dwt (~dwt@128-107-239-233.cisco.com) has joined #ceph
[22:20] <gregaf> hrm, it's in our sepia project which often concerns internals, so maybe the project is
[22:20] * leseb (~Adium@pha75-6-82-226-32-84.fbx.proxad.net) has joined #ceph
[22:21] <mikedawson> ok
[22:21] <gregaf> ah, yep
[22:21] <gregaf> sorry
[22:21] <gregaf> didn't realize that
[22:38] <barryo1> it's between that and nearline SAS
[22:40] <barryo1> I remember reading that with a decent controller you didn't need to worry about seperate journals, is a Dell H700 with 1GB NV Cache decent enough?
[22:41] <nhm> barryo1: For dell nodes, it may be better to just stick with SAS disks, throw the journals on the disks.
[22:41] <nhm> barryo1: I've had trouble getting good performance out of our R515s, but I think the R720xds performed a bit better. Sadly we don't have any in-house.
[22:42] <barryo1> It's 515's I'm looking at buying
[22:44] <barryo1> it'll mostly be used to host low i/o VM's so that should be ok
[22:47] <nhm> barryo1: ours have 8 disks in them and get about 300MB/s to the drives.
[22:47] <nhm> It's possible with some additional tuning we could get that up a bit.
[22:48] <barryo1> is that with journals on the osds?
[22:48] <nhm> yeah
[22:48] <nhm> and only 7 drives for OSDs
[22:48] <barryo1> nearline or real SAS?
[22:48] <nhm> nearline I think
[22:49] <nhm> it's been a while since I looked at them.
[22:49] <barryo1> thats not bad at all
[22:49] <nhm> barryo1: that's under very ideal testing.
[23:12] <sjusthm> just pushed wip_3904
[23:15] <barryo1> athrift: sadly, the 720s won't meet my budget
[23:17] <athrift> barryo1: we managed to get ours down to around $7200 USD
[23:17] <paravoid> sjusthm: enjoying my bugs? :)
[23:17] <sjusthm> paravoid: oh, certainly
[23:17] <sjusthm> oh, you were 3904 as well
[23:17] <sjusthm> heh
[23:17] <athrift> with 12x 3TB NL-SAS, H310, x520 NDC, 1x Xeon 2670, 32GB ram
[23:18] <paravoid> yeah :)
[23:18] <athrift> with 2TB drives it was about $1000 less
[23:18] <paravoid> athrift: STAY AWAY FROM THE H310
[23:18] <paravoid> seriously
[23:18] <athrift> paravoid: why is that, we have had no issues with them
[23:19] <nhm> paravoid: isn't it just a SAS2008?
[23:19] <paravoid> no
[23:19] <paravoid> it's a piece of shit
[23:19] <athrift> it is a SAS2008
[23:21] <paravoid> problem no1 is http://en.community.dell.com/support-forums/servers/f/906/t/19480834.aspx
[23:21] <athrift> nhm: Why is the H710 better ? We only use R720XD's
[23:21] <paravoid> problem no2 is that reads on disk A block writes on disk B and vice versa
[23:22] <paravoid> try it
[23:22] <nhm> athrift: I think the H710 is a SAS2208 instead of a SAS2108.
[23:22] <paravoid> try writing sequentially to a random disk and reading from an entirely different disk
[23:22] <paravoid> you'll get reads ranging in the kilobytes per second
[23:22] <athrift> ok, luckily we have H710's sitting around
[23:23] <paravoid> r720xd is our platform too
[23:23] <paravoid> note that there's no migration path from H310 JBOD -> H710
[23:23] <paravoid> you need to reformat the box
[23:23] <paravoid> we've lost months doing that
[23:24] <paravoid> there's no comparison really
[23:24] <nhm> btw, since you guys are interested in R515s and the H700, this is my post: http://lists.us.dell.com/pipermail/linux-poweredge/2012-July/046694.html
[23:26] <nhm> paravoid: that post you linked about reads blocking writes make me curious about the situation in my post where as soon as I have two writers going to the raid the performance tanks.
[23:26] <paravoid> h310?
[23:26] <nhm> paravoid: H700
[23:28] <nhm> btw, the iodepth=16 doesn't matter in those fio runs, not sure why I had it there.
[23:28] <paravoid> so I'm reading the mail I've written back then
[23:29] <paravoid> so, busy-looping seq reads on 12 disks and trying to write to a 2-drive SSD RAID0 resulted in a write capacity of 30KB/s for the SSDs
[23:30] <paravoid> reading from 7 disks had 400-500KB/s, reading from 6 resulted in a jump to 30MB/s
[23:30] <nhm> wow
[23:30] <paravoid> no reads was 50MB/s
[23:30] <paravoid> that was basically a write(100 bytes); fsync workload
[23:31] <paravoid> so, the LSI specsheet shows the controller having 8 ports
[23:31] <athrift> We thought of using them for Ceph, but dont want the mess of SAS cables going between the slots
[23:31] <nhm> Can you flash the H310 into a stock LSI card? It'd be interesting to see if you get the same results.
[23:31] <paravoid> the R720xd has 12 external bays + 2 internal
[23:31] <athrift> nhm: yes you can
[23:31] <paravoid> so there's probably a SAS expander in between
[23:31] <paravoid> that may be the culprit
[23:32] <athrift> paravoid: there is, check the Technical Guide
[23:32] <paravoid> so that may be why it sucks so much
[23:32] <nhm> paravoid: yes, both the R515 and the R720XD have SAS expanders, and I'm very suspicious that Ceph in general hates them.
[23:32] <paravoid> or firmware, who knows
[23:32] <athrift> but that wouldnt explain the difference in performance between the H310 and H710....
[23:33] <nhm> paravoid: though I've tested nodes with SAS expanders that don't suck, so it may come down to brand, or the drives being used, or some other crazy thing.
[23:33] <nhm> I wonder if Dell is using LSI expanders
[23:34] <nhm> athrift: yeah, the cables look to be a pain.
[23:35] <athrift> lsscsi shows the expander a BP12G+EXP
[23:35] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[23:36] <athrift> This discussion is making me reconsider SuperMicro even though they are more epensive
[23:37] <barryo1> We're a Dell shop so thats my only real choice
[23:38] <nhm> athrift: I have had very good performance with an SC847A chassis and multiple controllers. I imagine the 12-bay 2U node would perform very well with a pair of SAS9207-8is, 12 spinning disks, and 2 S3700 SSDs in the 2.5" bays.
[23:38] <nhm> or alternately no SSDs and a pair of 9265s.
[23:38] <nhm> wth the WB cache module.
[23:39] <nhm> er, BBU for wb rather.
[23:40] <paravoid> I'm fairly happy with R720xd + H710s
[23:41] <paravoid> we had C2100s previously and boy, they sucked a lot
[23:41] <athrift> So for standard deployment, this sort of thing http://www.supermicro.com/products/chassis/2U/826/SC826BA-R1K28W.cfm ?
[23:43] <nhm> athrift: yeah, I don't have one those exact chassis, but I expect it'd be like a scaled down version of what I've got.
[23:48] <barryo1> maybe the 720xd is more affordable than i thought
[23:49] * sleinen (~Adium@2001:620:0:26:edec:c0fd:9048:d23a) Quit (Quit: Leaving.)
[23:49] <nhm> barryo1: even just a single E5-2620 should be enough if it's just for OSDs.
[23:49] <nhm> Or a pair of E5-2403s if that's cheaper.
[23:50] <barryo1> It'll be osds and mon
[23:50] <barryo1> we have no need for mds at the moment
[23:51] <athrift> nhm: it looks like the 826 chassis has a SAS expander similar to the R720XD
[23:51] <nhm> athrift: the one you linked me claims to be a "direct attached" backplane.
[23:53] * drokita (~drokita@ has left #ceph
[23:53] <nhm> which is what the one I've got also claims (and has the appropriate number of SFF8087 ports for such claim)
[23:53] <athrift> nhm: you are right, the manual just doesnt correlate to it :)
[23:53] <athrift> nhm: teach me for RTFM
[23:53] <nhm> athrift: somewhere in there they may list both backplanes
[23:54] <nhm> athrift: I remember seeing the E16, E26, and A backplanes in the manual for my chassis.
[23:55] <athrift> nhm: Yes I found it, its the E5 backplane in this case
[23:59] <barryo1> I was hoping to finalise my spec tonight, now I have even more options to consider :s

