#ceph IRC Log


IRC Log for 2013-01-14

Timestamps are in GMT/BST.

[0:08] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[0:18] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[0:38] * dpippenger1 (~riven@cpe-76-166-221-185.socal.res.rr.com) has joined #ceph
[0:43] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[0:45] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[1:08] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[1:11] <dec_> phantomcircuit: we're seeing poor RBD performance too
[1:11] <dec_> starting to really cause some issues - might need to move the VMs off of RBD for now, until we have a fix
[1:12] <dec_> which is unfortunate, because it means testing the ceph/RBD fix will become a lot harder without real load
[1:12] * dec_ is now known as dec
[1:17] <dec> just plugged RBD cache in to a VM to see if that helps, but doesn't seem to
[1:18] <tnt> dec: what version ?
[1:19] <dec> 0.56.1
[1:19] <dec> we were having some issues with 0.53, upgraded to 0.56.1 which fixed the main issue but introduced some severe performance/latency issues
[1:19] <phantomcircuit> dec, yeah there is definitely something wrong
[1:20] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[1:20] <dec> phantomcircuit: are you 0.56.1 too ?
[1:20] <phantomcircuit> there's a performance benchmark for 0.55 which shows performance degradation in specific tests
[1:20] <tnt> dec: which underlying FS ?
[1:20] <phantomcircuit> and improvements in others
[1:20] <dec> tnt: ext4
[1:21] <tnt> I have a maintenance window scheduled for the 26th to upgrade our prod cluster to 0.56.1 ...
[1:21] <dec> from what version, tnt?
[1:21] <tnt> 0.48.x
[1:22] <dec> tnt: here's what happenened when we did that upgrade: http://i.imgur.com/hIDa2.png http://i.imgur.com/K0BE4.png
[1:22] <tnt> I have memory leaks in the osd currently which apparently 0.56.1 could fix. (currently I have to restart each OSD in sequence every 4 days or so)
[1:22] <dec> that's just an example from one of our OSD disks - all 18 OSDs had the same change
[1:22] <phantomcircuit> tnt, you using tcmalloc?
[1:23] <tnt> phantomcircuit: ... Given I don't know the answer, I'd guess "no" ?
[1:23] <dec> hmm, we *weren't* using tcmalloc before upgrading to 0.56.1; I wonder if that's introduced the issue...
[1:23] <tnt> It's the packages from ceph repo.
[1:23] <phantomcircuit> dec, i dont think so
[1:23] <phantomcircuit> im not using it
[1:24] <phantomcircuit> i actually switched it on to see if it helped and it had no effect
[1:24] <dec> it reduced memory usage on our OSDs
[1:24] <tnt> dec: wrt to those graphs, yeah scary ... I guess I'll just upgrade one of the machine and see how it behaves. I'm on xfs so that might make a difference.
[1:24] <phantomcircuit> yeah it did do that but didn't help with the actual issue
[1:25] <dec> yeah
[1:25] <phantomcircuit> i have plenty of free memory for the osd just evicts some cache which is not great but whatever
[1:25] <dec> phantomcircuit: so you're 0.56.1? what OS + FS ?
[1:25] <tnt> dec: what are you using to monitor iops btw ?
[1:25] <phantomcircuit> tnt, i am too unfortunately
[1:25] <dec> tnt: collectd + graphite
[1:25] <phantomcircuit> dec, collectd + collection3 is what im using
[1:25] <tnt> phantomcircuit: damn, you just ruined my hopes.
[1:26] <tnt> phantomcircuit: were you on 0.48 before ?
[1:26] <phantomcircuit> better to find out now :)
[1:26] <phantomcircuit> yeah i was but i didn't have collectd running
[1:26] <dec> we have had some big issues with XFS running underneath glusterFS recently, so avoided XFS for Ceph and went with ol' reliable ext4
[1:27] <tnt> phantomcircuit: and perf degraded fro 0.48 to 0.56 ?
[1:27] <phantomcircuit> tnt, yeah
[1:27] <tnt> I mean currently under 0.48 perf isn't exactly great, but it's good enough for my use.
[1:28] <phantomcircuit> tnt, http://ns238708.ovh.net:666/cgi-bin/graph.cgi?hostname=localhost;plugin=disk;plugin_instance=sdc;type=disk_time;begin=-2678400
[1:28] <phantomcircuit> im guessing you can spot the upgrade pretty easily
[1:31] <phantomcircuit> too be fair there are now more vms on the cluster than before
[1:31] <phantomcircuit> but almost all of them no almost zero io
[1:31] <phantomcircuit> it's peoples irc bouncers and other things
[1:36] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[1:37] <phantomcircuit> fortunately for me im the only one who actually puts any load on the cluster...
[1:38] * The_Bishop__ (~bishop@e177090017.adsl.alicedsl.de) has joined #ceph
[1:38] * The_Bishop_ (~bishop@f052103051.adsl.alicedsl.de) Quit (Read error: Connection reset by peer)
[1:43] <phantomcircuit> ok i see that librbd has logging support
[1:43] <phantomcircuit> but where do they go
[1:43] <dec> phantomcircuit: what version of librbd client are you using for your VM hosts?
[1:45] <phantomcircuit> 0.1.5
[1:45] <via> is there a large performance difference between using tcmalloc and not?
[1:45] <phantomcircuit> via, huge memory usage drop
[1:46] <phantomcircuit> but since there isn't much if any pressure on memory i didn't see any performance increase
[1:46] <via> ok, cool
[1:46] <via> cause i'm not using tcmalloc and wanting slightly higher performance
[1:47] <phantomcircuit> there's probably a slight performance gain beyond the memory reduction
[1:47] <phantomcircuit> but i didn't see it
[1:47] <via> does anyone know if the el6 rpms created are compiled against tcmalloc?
[1:47] <via> the ones available at the official repo
[1:52] <dec> doesn't look like it
[1:52] <dec> % rpm -qp ceph-0.56.1-0.el6.x86_64.rpm --requires 2>/dev/null | grep tcmalloc
[1:52] <dec> %
[1:53] <dec> here's ours:
[1:53] <dec> % rpm -qp ceph-0.56.1-0.mycompany.x86_64.rpm --requires 2>/dev/null | grep tcmalloc
[1:53] <dec> libtcmalloc.so.4()(64bit)
[1:53] <dec> %
[1:54] <via> do you compile your own tcmalloc package?
[1:58] <tnt> interesting, the ones for ubuntu precise do use tcmalloc
[1:58] <via> i see gperftools-libs has that lib
[1:59] <via> i'll try building
[1:59] <dec> I use google-perftools package from EPEL repositories
[1:59] <dec> then built our own Ceph RPMs
[2:02] <via> ok
[2:05] <dec> phantomcircuit: what OS are you on?
[2:07] * tnt (~tnt@216.186-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[2:07] <phantomcircuit> dec, gentoo
[2:07] <phantomcircuit> im rebuilding with different configurations and seeing if it helps
[2:07] <phantomcircuit> so far no combination has changed anything
[2:08] * lx0 (~aoliva@lxo.user.oftc.net) has joined #ceph
[2:09] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[2:09] * LeaChim (~LeaChim@b0fadd12.bb.sky.com) Quit (Ping timeout: 480 seconds)
[2:12] <phantomcircuit> time for git diff with tags
[2:12] <phantomcircuit> ... fun
[2:14] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[2:19] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[2:26] <via> dec: are you building for i386 or 64bit?
[2:28] <dec> 64bit
[2:28] <dec> anyone from inktank around?
[2:29] <via> the tcmalloc in epel is i386 only
[2:34] <phantomcircuit> im guessing it has something to do with the changes to aio in librbd
[2:34] <phantomcircuit> they're fairly extensive
[2:34] <dec> via: el6 EPEL has gperftools-libs-2.0-3.el6.2.x86_64
[2:34] <via> huh
[2:35] <via> maybe i'm pointed to the wrong place
[2:36] <via> yeah, i was pointed at 5, sorry
[2:37] <dec> :)
[2:38] <dec> phantomcircuit: where did the aio stuff change?
[2:39] <phantomcircuit> ceph/src/librbd/*
[2:39] <phantomcircuit> git diff v0.48.3argonaut v0.56.1 src/librbd
[2:39] <dec> yeah - so I didn't change my librbd during the upgrade, and I'm still seeing the issues
[2:40] <dec> so it's in core ceph OSD/filesystem stuff somewhere
[2:52] <phantomcircuit> lol aio results in code that is total non sense
[2:52] <phantomcircuit> :/
[2:54] * markl (~mark@tpsit.com) Quit (Remote host closed the connection)
[3:10] * BManojlovic (~steki@ Quit (Ping timeout: 480 seconds)
[3:12] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[3:12] * loicd (~loic@2a01:e35:2eba:db10:1171:612f:5b3:c0e9) has joined #ceph
[3:17] * Pagefaulted (~AndChat73@c-67-168-132-228.hsd1.wa.comcast.net) has joined #ceph
[3:17] * Pagefaulted (~AndChat73@c-67-168-132-228.hsd1.wa.comcast.net) Quit ()
[3:17] * Pagefaulted (~AndChat73@c-67-168-132-228.hsd1.wa.comcast.net) has joined #ceph
[3:17] <nhm> dec: btw, I've seen some indications that the filestore flusher changes have hurt ext4 performance now.
[3:18] <nhm> dec: not sure if you tried changing it to flush_min=0 and enabled, but that may help.
[3:18] <dec> hi nhm - you mentioned that a few days ago.
[3:18] <dec> I haven't tried to change the flush interval - because I found out that it didn't change between the two versions that I was running
[3:18] <dec> I went from v0.53 to v0.56.1, and both have the same flush interval as default
[3:18] <nhm> dec: some raid0 tests just finished up, small write performance was 2x with the flusher enabled and flush_min set to 0.
[3:19] <dec> what platform/OS do you test on nhm?
[3:19] <dec> ubuntu?
[3:19] <nhm> dec: that was on ubuntu 12.04 with almost ceph 0.56
[3:28] <dec> I noticed that syncfs on EL6 changed from 0.53 to 0.56.1
[3:29] <dec> I'm wondering what impact that had
[3:30] <dec> it wouldn't have been using syncfs before, which theoretically would have made performance worse before the upgrade
[3:30] <dec> but I wonder if it's changed something to do with the sync pattern
[3:45] <dec> nhm: would it help my case to disable the flusher al together?
[3:49] <nhm> dec: it seems that disabling the flusher actually hurts EXT4 performance. XFS and BTRFS seem to often do better with it off though.
[3:58] <dec> nhm: ok
[3:59] <dec> nhm: is there any way to see whether the OSD journals are full?
[4:01] <dec> and the journal queues
[4:03] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[4:05] <nhm> dec: you should be able to see that if you connect to the admin socket for the OSD
[4:07] <dec> yep, I just found that :)
[4:08] <dec> they're all 0 - no queued ops
[4:08] <dec> oh I lie, every now and then there's a few
[4:09] <dec> example: https://gist.github.com/4527505
[4:10] <dec> I only asked because I noticed journal_queue_max_ops and journal_queue_max_bytes was reduced recently
[4:12] <phantomcircuit> iirc journal queue max ops isn't the ops in the journal but the ops waiting to go into the journal
[4:13] <dec> yup
[4:14] <dec> can I set the flusher off via the admin socket whilst an OSD is live? nhm?
[4:18] <phantomcircuit> im pretty sure this isn't just the flusher
[4:25] <phantomcircuit> dec the primary issue with tracking down the regression is that there's ~14k modified loc between 0.48 and 0.56
[4:26] <dec> phantomcircuit: for me, it was introduced between 0.53 and 0.56
[4:26] <dec> which is less, but still significant
[4:26] <phantomcircuit> oh i meant 145k
[4:27] <phantomcircuit> that has to be 99% copying stuff around though
[4:27] <phantomcircuit> but the diff is still unpossible
[4:28] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[4:32] <nhm> phantomcircuit: Are you seeing similar symptoms as dec?
[4:32] <nhm> dec: I actually don't know to be honest.
[4:33] <phantomcircuit> nhm, yes but i went straight from 0.48 to 0.5 oh wait 0.55
[4:33] <phantomcircuit> nhm, http://ns238708.ovh.net:666/cgi-bin/graph.cgi?hostname=localhost;plugin=disk;plugin_instance=sdc;type=disk_time;begin=-2678400
[4:33] <phantomcircuit> spot the version dump
[4:34] <nhm> phantomcircuit: what filesystem?
[4:35] <phantomcircuit> xfs with a 10GB journal on an ssd
[4:35] <phantomcircuit> that's the filestore xfs
[4:39] <nhm> phantomcircuit: what kind of workload?
[4:40] <nhm> Also, what kind of controller?
[4:40] <phantomcircuit> nhm, everything is find until i start bitcoind in a vm
[4:40] <phantomcircuit> which has a nasty habit of calling fsync constantly
[4:40] <phantomcircuit> but possibly that's just a coincidence
[4:41] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[4:42] <nhm> phantomcircuit: the bitcoin workload was also running in the section of the graph before the spike?
[4:43] <phantomcircuit> yes
[4:44] <nhm> phantomcircuit: that was on rbd?
[4:44] <phantomcircuit> yes
[4:45] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[4:45] <phantomcircuit> nhm, i sort of suspect that there is a filesystem flush happening which shouldn't be
[4:46] * loicd (~loic@2a01:e35:2eba:db10:1171:612f:5b3:c0e9) Quit (Quit: Leaving.)
[4:46] <phantomcircuit> like the flush op from rbd is being run for both the journal and the filestore immediately
[4:46] * loicd (~loic@magenta.dachary.org) has joined #ceph
[4:47] <nhm> phantomcircuit: we should probably talk to Josh. I mostly have been working on filestore performance analysis.
[4:48] <nhm> phantomcircuit: Soonish I'm actually hoping to start doing a deeper investigation into RBD performnace.
[4:48] <phantomcircuit> nhm, i have a desktop here i can repurpose just to more deeply explode this issue
[4:48] <phantomcircuit> i think i'll do that tomorrow
[4:49] <phantomcircuit> bbl
[4:49] <nhm> phantomcircuit: have a good evening
[4:50] <dec> let me know if I can help debug - keen to get this fixed
[4:56] <nhm> dec: When I get a moment I'll try to go back and look at the svctime and queue wait time of the IOs during the argonaut vs bobtail preview.
[4:56] <dec> ok, cool
[4:57] <nhm> dec: given that in general performance improved, I doubt I'm going to see what you saw though. I'm wondering if it's specific to RBD.
[5:04] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[5:04] * loicd (~loic@magenta.dachary.org) has joined #ceph
[5:08] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[5:19] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[5:45] <dec> phantomcircuit: if you're around - are your OSD journals on the OSD disks, or separate?
[6:03] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) has joined #ceph
[6:09] <phantomcircuit> dec, separate on ssds
[6:09] <phantomcircuit> io latency on them is never > 2ms
[6:15] <dec> so it's definitely the filestore disk that's getting thrashed, not the journal?
[6:15] * renzhi (~renzhi@ has joined #ceph
[6:22] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[6:22] <phantomcircuit> dec, yes
[6:22] <phantomcircuit> a simple bonnie++ single threaded in one of the vms is causing 100% utilization of both hdds
[6:23] <phantomcircuit> ssd is trivial at ~2%
[6:23] <phantomcircuit> and only that in bursts
[6:24] <phantomcircuit> dec, http://pastebin.com/raw.php?i=gtCYYQmS
[6:24] <nhm> phantomcircuit: does rados bench do it too?
[6:25] <nhm> phantomcircuit: actually, I gotta go to bed, but if you are around tomorrow let me know if you try it
[6:29] * sjustlaptop (~sam@71-83-191-116.dhcp.gldl.ca.charter.com) Quit (Ping timeout: 480 seconds)
[6:38] <phantomcircuit> nhm, there's something weird afoot
[7:17] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[7:17] * loicd (~loic@magenta.dachary.org) has joined #ceph
[7:24] <morpheus__> hum updated 3 osds on our cluster to 0.56.1. OSDs are working fine but rbd info xyz throws errows. is there another compatibility issue 0.56.1 <-> argonaut.2
[7:27] <morpheus__> e.g. error opening image kvm1272: (5) Input/output error
[7:31] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[7:31] * loicd (~loic@magenta.dachary.org) has joined #ceph
[7:37] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[7:38] <phantomcircuit> nhm, im stumped
[7:38] <phantomcircuit> journal ssd really does 20k IOPS sequential write
[7:49] * gaveen (~gaveen@ has joined #ceph
[7:49] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) Quit (Read error: Connection reset by peer)
[7:51] * tnt (~tnt@216.186-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[7:59] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[8:08] * gaveen (~gaveen@ has joined #ceph
[8:13] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[8:32] * sleinen (~Adium@ has joined #ceph
[8:34] * sleinen1 (~Adium@2001:620:0:26:2cd5:5a9e:7d4f:8405) has joined #ceph
[8:40] * sleinen (~Adium@ Quit (Ping timeout: 480 seconds)
[8:50] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:04] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) Quit (Quit: Some folks are wise, and some otherwise.)
[9:10] * loicd (~loic@ has joined #ceph
[9:19] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:24] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[9:29] * Morg (d4438402@ircip2.mibbit.com) has joined #ceph
[9:30] * madkiss1 (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[9:36] * tnt (~tnt@216.186-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[9:36] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[9:52] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[10:01] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[10:01] * schlitzer|work (~schlitzer@ Quit (Read error: Connection reset by peer)
[10:01] * schlitzer|work (~schlitzer@ has joined #ceph
[10:05] * fc (~fc@home.ploup.net) has joined #ceph
[10:18] * Leseb (~Leseb@ has joined #ceph
[10:20] * BManojlovic (~steki@ has joined #ceph
[10:31] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:34] * nz_monkey (~quassel@ has joined #ceph
[10:38] * LeaChim (~LeaChim@b0fadd12.bb.sky.com) has joined #ceph
[10:42] * nz_monkey (~quassel@ Quit (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
[10:42] * nz_monkey (~quassel@ has joined #ceph
[10:43] * ninkotech (~duplo@ip-94-113-217-68.net.upcbroadband.cz) has joined #ceph
[10:48] * nz_monkey (~quassel@ Quit (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
[10:48] * nz_monkey (~quassel@ has joined #ceph
[10:51] * nz_monkey (~quassel@ Quit ()
[10:52] * nz_monkey (~quassel@ has joined #ceph
[10:53] * nz_monkey (~quassel@ has left #ceph
[10:55] * nz_monkey (~quassel@ has joined #ceph
[10:56] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[10:57] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) Quit (Quit: tryggvil)
[11:05] * fc (~fc@home.ploup.net) Quit (Quit: leaving)
[11:05] * sbadia (~sbadia@ Quit (Ping timeout: 480 seconds)
[11:06] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[11:06] * fc (~fc@home.ploup.net) has joined #ceph
[11:07] * fc (~fc@home.ploup.net) Quit ()
[11:08] * fc_ (~fc@home.ploup.net) has joined #ceph
[11:08] * ScOut3R_ (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[11:11] * ScOut3R__ (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[11:13] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[11:15] * dxd828 (~dxd828@ has joined #ceph
[11:18] * ScOut3R_ (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[11:22] * renzhi (~renzhi@ Quit (Ping timeout: 480 seconds)
[11:31] * andret (~andre@pcandre.nine.ch) has joined #ceph
[11:35] * dxd828 (~dxd828@ Quit (Quit: Computer has gone to sleep.)
[11:53] <tnt> Oh ... you can't downgrade an OSD ?
[11:56] * sbadia (~sbadia@yasaw.net) has joined #ceph
[11:58] <madkiss1> i don'T think you can
[12:00] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[12:01] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) has joined #ceph
[12:05] * ScOut3R__ (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[12:06] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[12:12] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[12:12] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:13] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[12:24] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[12:25] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[12:30] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Ping timeout: 480 seconds)
[12:35] * Leseb (~Leseb@ Quit (Quit: Leseb)
[12:36] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[12:47] * tryggvil_ (~tryggvil@rtr1.tolvusky.sip.is) has joined #ceph
[12:50] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Ping timeout: 480 seconds)
[12:50] * tryggvil_ is now known as tryggvil
[12:52] * gaveen (~gaveen@ Quit (Ping timeout: 480 seconds)
[13:00] * gaveen (~gaveen@ has joined #ceph
[13:03] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[13:03] * low (~low@ has joined #ceph
[13:04] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[13:11] * Leseb (~Leseb@ has joined #ceph
[13:20] <tnt> That's inconvenient ...
[13:22] <absynth_47215> from bobtail to argo?
[13:23] <tnt> yes
[13:23] <tnt> I would have liked "a way back" if the upgrade doesn't perform as expected.
[13:24] <absynth_47215> well, according to a posting in the ML, you can
[13:26] <tnt> Oh really ? Do you know which post ? Because it definitely didn't work ...
[13:26] <tnt> some asser failure in CompatSet::unsupported or something.
[13:27] <absynth_47215> ah, sorry
[13:27] <absynth_47215> it was an rbd downgrade, not an osd downgrade
[13:27] <absynth_47215> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/12006
[13:28] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Quit: Leaving.)
[13:28] <tnt> yes, that one I had seen. I've just now found in the release notes "Once each individual daemon has been upgraded and restarted, it cannot be downgraded."
[13:30] <absynth_47215> buy a support contract and have inktank do the upgrades for you :)
[13:30] <absynth_47215> that way, it's at least not your fault
[13:33] <tnt> unfortunately I'm pretty sure I would still be the one having to fix it ...
[13:34] <absynth_47215> no, if you have a support contract the inktank engineers would probably make it work
[13:37] <nhm> We do try to make things work for people.
[13:40] <tnt> I'm not too worried about the "provide correct result" part. More about the performance. Seems several people have experienced severe degradation when using RBD after the upgrade and those are still unexplained AFAIK.
[13:40] * Morg (d4438402@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[13:41] <absynth_47215> well, then nhm is probably the best to address this concern ;)
[13:47] <tnt> BTW, has anyone experienced with disabling journal for FS on RBD ?
[13:48] <absynth_47215> not sure what the question is?
[13:49] <tnt> s/experienced/experimented/
[13:50] <tnt> since RBD storage itself has a journal, doesn't it make kind of redundant ?
[13:52] <fghaas> tnt: now think about that for a minute :)
[13:52] <tnt> fghaas: I'm trying but it makes my head hurt ... I must admit I'm not sure of all the assumptions FS make to stay consistent.
[13:53] <tnt> Right now the way I see it, each write on ext4 on RBD becomes 8 writes to physical hdds.
[13:54] <tnt> (assuming rep size = 2)
[13:55] <janos> i am a noob, so grain of salt with this - but i was under the impression that a journal in ceph also had the benefit of a fast nearby write, where it could then do lazy writes out to the appropriate osd's
[13:57] <tnt> yes, my understanding is that once commited to the journal of both OSD, the write will "return" and be considered done. But at some point it will still have to do the write to the filestore.
[13:59] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Ping timeout: 480 seconds)
[14:00] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) Quit (Quit: Leaving.)
[14:00] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) has joined #ceph
[14:03] <darkfaded> tnt: i have not experimented but i think turning off the fs journal is not, uh, an option that makes more sense due to having an osd journal
[14:04] <darkfaded> basically because your fs journal is more of an open/do/close/done/open/do/close/done thing
[14:04] <darkfaded> and it would be no good if it is "ripped off" at some point in the middle
[14:04] <darkfaded> ext4 is going to get a trustworthy checksummed journal real soon now[tm]
[14:05] <darkfaded> then it can identify and cut away stuff that wasnt completed because it didn't make it into the osd journal for example because of some ordering issue
[14:05] <darkfaded> (in theory there's no issues of course :)
[14:06] <darkfaded> if you wanna speed up things and are only concerned with a few large filesystems you could leverage an external ext journal
[14:06] <darkfaded> it's a maintenance PITA though
[14:08] <darkfaded> (my idea being that the external journal gives you 2x4 writes instead of 8x1, same amount but could go to diff osds making it faster)
[14:08] <tnt> I'm talking about the FS on a formatted RBD block device here, so to keep the 'movable' aspect of RBD, I'd have to have the external journal on another RBD block device, so that would most likely not do anything good.
[14:08] <darkfaded> yes i understood that it's on RBD
[14:09] <darkfaded> if you can make the journal really write to other osds than the fs data goes to this would be faster, as it's really very parallel
[14:09] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:09] <tnt> Yes, ok, I hadn't seen the spread over OSD aspect.
[14:09] <darkfaded> was the 2x4 vs. 1x8 example understandable?
[14:09] <tnt> although with striping of RBD images, data and journal might end up in different OSD already.
[14:10] <tnt> yes it was.
[14:10] <darkfaded> oki
[14:11] <tnt> I'm not looking about cuting edge perf. What I have on argonaut suits me fine, I was just wondering if it wasn't crazy to have all those journals ...
[14:11] <darkfaded> no, not crazy ;)
[14:13] <darkfaded> but i see your point. if i think about the osds residing on a journalled fs then i start getting into a mindtrap
[14:13] * jrisch (~Adium@83-95-19-94-static.dk.customer.tdc.net) has left #ceph
[14:14] <darkfaded> if in doubt do it like ZFS and blame it all on the hardware
[14:14] <tnt> hehe :)
[14:25] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[14:39] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[14:40] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) has joined #ceph
[14:46] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[14:46] <LeaChim> Hi all, I had 2 OSDs in the cluster, I tried to add a new one (on a newly created logical partition/filesystem), and it crashes on startup: http://xelix.net/ceph-osd.2.log
[14:47] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:47] * KindOne (~KindOne@h87.44.28.71.dynamic.ip.windstream.net) Quit (Remote host closed the connection)
[14:51] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) has joined #ceph
[14:52] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) has joined #ceph
[14:53] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:01] <dec> Can I do anything to limit rebuild ("backfill"?) rate after an osd comes back up?
[15:02] * KindOne (~KindOne@h87.44.28.71.dynamic.ip.windstream.net) has joined #ceph
[15:02] <dec> The act of re-adding an OSD which is down (say, becuase we manually restarted it), causes a big latency spike in librbd IO
[15:03] <absynth_47215> yup, but usually you want that OSD back in ASAP
[15:03] <absynth_47215> i _think_ you can kind of configure that via some worker thread setting
[15:04] <absynth_47215> but i don't know where to look for the docs on that. some setting flew past me on irc or mailing list the other day
[15:04] <dec> I also wonder, if we're just doing a simple restart, whether it's safe to mark the osd 'nodown' so the cluster doesn't notice...
[15:04] <nhm> tnt: regarding rbd performance, do things also look bad with rados bench?
[15:05] <absynth_47215> there is a "ceph osd noout" setting that configures the time that can pass until a node is marked down
[15:05] <absynth_47215> mon osd down out interval = 300
[15:05] <dec> absynth_47215: actually that's the time until it's marked 'out', whic is different than 'down'
[15:05] <tnt> nhm: I haven't upgraded yet myself ...
[15:05] <nhm> tnt: ah, ok.
[15:06] <absynth_47215> dec: what's the practical difference?
[15:06] <tnt> nhm: dec has the issue and phantomcircuit as well AFAIK. And I'm worried about it happenning when I update :p
[15:06] <absynth_47215> i.e., what happens if it is marked nodown in contrast to noout?
[15:07] <dec> absynth_47215: if it's 'out', ceph begins moving data to other nodes.
[15:07] <dec> absynth_47215: after being 'down' for a period of time, ceph marks it as 'out' and begins that process
[15:07] <nhm> tnt: So far I don't think I've heard about it from anyone else, but that doesn't mean it isn't happening.
[15:08] <dec> 'down' and 'up' are what the gossip protocol detect, so they're automatic states...
[15:08] <tnt> dec: wouldn't leaving it 'up' make things worse ? AFAIU, writes to pg on that OSD would block until it actually respond ...
[15:08] <absynth_47215> yep
[15:09] <absynth_47215> that's what i would suspect, too
[15:09] <dec> I'm not sure...
[15:10] <dec> it takes mere seconds for it to go down and up
[15:10] <absynth_47215> yeah, but that should not amount to a huge backfill, either
[15:10] <dec> whereas it takes ~15-30 seconds for all of the juggling to occur after the mons detect a down OSD as up
[15:10] <absynth_47215> so aren't you exorcising the devil with beelzebub?
[15:11] <tnt> dec: if you restart it, it would still have to go through all the peering stuff right ? since it lost it's state and connections to other OSD>
[15:15] * loicd (~loic@3.46-14-84.ripe.coltfrance.com) Quit (Ping timeout: 480 seconds)
[15:15] <absynth_47215> very interesting discussion on the list, fghaas
[15:15] <absynth_47215> (+ blog article)
[15:15] <fghaas> absynth_47215: um, thanks, I think? :)
[15:16] <fghaas> (if that was a compliment)
[15:16] <absynth_47215> sure, why wouldn't it be :)
[15:18] * Dr_O (~owen@00012c05.user.oftc.net) has joined #ceph
[15:19] * Dr_O_ (~owen@00012c05.user.oftc.net) has joined #ceph
[15:20] <fghaas> absynth_47215: "interesting" can also mean "you're an a**hole and dead wrong" :)
[15:20] * Dr_O_ (~owen@00012c05.user.oftc.net) Quit ()
[15:22] <dec> what happens to writes whilst OSDs and Mons are voting on an OSD being down?
[15:25] <nhm> fghaas: definitely not. :)
[15:25] <jluis> afaiu, the osds don't vote on other osds status, they only report perceived failures to the monitors; the monitors will update the osdmap if they believe the reported osd is in fact down
[15:25] <nhm> fghaas: sorry if my response seemed cold btw, I'm watching two kids, one of which was throwing a tantrum while I was trying to write that. :)
[15:26] <fghaas> nhm: it didn't at all :)
[15:26] <jluis> during a vote, messages to the monitors are queued
[15:27] <nhm> fghaas: nothing impresses me more than the wrath of an angry three year old. ;)
[15:28] <fghaas> mine are, thankfully, 7 and 8 now, but they can get pretty impressively angry too :)
[15:31] <absynth_47215> fghaas: oh, like food can taste "interesting"?
[15:31] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Remote host closed the connection)
[15:34] <dec> I need to get to bed; chat with you all tomorrow!
[15:34] <absynth_47215> specifically, i did not realize that "Ceph OSDs currently cannot recover from a broken SSD journal without reinitializing and recovering the entire filestore. "
[15:35] * agh (~agh@www.nowhere-else.org) has joined #ceph
[15:35] <absynth_47215> that sounds painful if you have one SSD per host (and 36 spinners, har har)
[15:37] <nhm> absynth_47215: I just know someone is going to come to me with that setup some day. :)
[15:37] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) has joined #ceph
[15:38] <absynth_47215> well, we made him reconsider for a while, but he will certainly bounce back
[15:38] <nhm> absynth_47215: oh, someone was?
[15:38] <absynth_47215> yeah
[15:38] <nhm> doh
[15:39] <absynth_47215> there was a dude with storage blades
[15:39] <nhm> I actually am not entirelt opposed to using the 36 drive nodes (I have one after all!) but they really should be earmarked for *large* deployments.
[15:39] <absynth_47215> i think 24 disks per enclosure, and one controlling blade per enclosure
[15:39] <absynth_47215> one OSD per blade
[15:40] <absynth_47215> sorry, one SSD per blade
[15:40] <absynth_47215> at least if i remember correctly - fghaas was around too
[15:41] <nhm> The trick is to find a solution that gives you the density of the big 60 drive 4U boxes with the benefits of lots of small nodes.
[15:41] <absynth_47215> you mean backblaze pods?
[15:42] <nhm> absynth_47215: I don't think they've hit 60 drives in 4U yet.
[15:42] <absynth_47215> they are at 45
[15:42] <absynth_47215> who has 60?
[15:43] <low> absynth_47215: 45 being 2'5 ones ?
[15:43] <absynth_47215> no, 3,5"
[15:43] <absynth_47215> 3tb hitachi drives
[15:43] <nhm> absynth_47215: scalable informatics, sanmina, dell, others I may not be able to talk about.
[15:43] <absynth_47215> (in the sample configuration that they use for the blog)
[15:44] <dec> our chassis are supermicro 45 in 4RU
[15:44] <dec> 45x 3.5"
[15:44] <nhm> absynth_47215: if you go external, DDN has 4U box with 84 3.5" disks in it (though it needs a controller and node).
[15:45] <absynth_47215> do you have an SKU for those chassis, dec?
[15:45] <nhm> dec: yeah, I've got the 36-drive A chassis in my basement.
[15:45] <absynth_47215> nhm: for the sake of comparability, let's stick to internal
[15:45] <absynth_47215> nhm: uh, what exactly does a private home need such a storage for...? (should i ask? do i want to know?)
[15:45] <dec> absynth_47215: ours are SC847E26-RJBOD1
[15:46] <janos> "need"?
[15:46] <low> hmmmm I didn't receive a single mail from ceph-devel since 20121219
[15:46] <janos> we don't live by subsistence!
[15:46] <nhm> absynth_47215: my basement is the inktank performance lab. ;)
[15:46] <absynth_47215> low: did you have account issues? the kernel lists have bounce detection
[15:46] * fc_ (~fc@home.ploup.net) Quit (Quit: leaving)
[15:46] <absynth_47215> at least they used to
[15:47] <absynth_47215> nhm: uh. nice.
[15:47] <low> absynth_47215: not that I know of.
[15:47] <nhm> absynth_47215: http://ceph.com/wp-content/uploads/2012/09/Inktank_Performance_Lab.jpg
[15:47] <dec> nhm: are you all (Inktank) in LA? (just curious)
[15:47] <absynth_47215> wolfgang is on the east coast
[15:47] <nhm> dec: nope, I'm in Minneapolis. Joao is in Portugal.
[15:47] <dec> Oh, right. :)
[15:47] <absynth_47215> plywood. the basic ingredient of any highly secure lab environment.
[15:48] <dec> So is there actually an "Inktank office" ? :)
[15:48] <absynth_47215> i think i asked joao before, but where exactly in portugal are you again?
[15:48] <jluis> there's even two of them
[15:48] <nhm> dec: yeah, they rented out some space across the hall from the DreamHost offices.
[15:48] <absynth_47215> (i seem to remember lisboa)
[15:48] <jluis> absynth_47215, Lisbon
[15:48] <jluis> yep
[15:48] <absynth_47215> we talked about it on WHD 2012, didnt we
[15:48] <jluis> we did
[15:48] <dec> nhm: heh, cool :)
[15:49] <absynth_47215> is there going to be any inktank/ceph footprint at CeBIT 2013?
[15:49] * tjikkun (~tjikkun@2001:7b8:356:0:225:22ff:fed2:9f1f) Quit (Ping timeout: 480 seconds)
[15:49] <nhm> absynth_47215: no idea, probably depends on if there is a good business reason to go.
[15:50] <absynth_47215> i'll buy the first round?
[15:52] <nhm> absynth_47215: I went out to SC12, that was a ton of fun. :)
[15:53] * fc_ (~fc@home.ploup.net) has joined #ceph
[15:56] * ghbizness (~ghbizness@host-208-68-233-254.biznesshosting.net) has joined #ceph
[15:57] <ghbizness> has anyone successfully deployed ceph using a chef cookbook while modifying some of the deployment scripts ?
[15:58] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) Quit (Quit: Zzzzzz)
[16:02] <jtang> morning
[16:02] * nyeates (~nyeates@pool-173-59-239-231.bltmmd.fios.verizon.net) has joined #ceph
[16:02] <jtang> well afternoon
[16:11] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[16:11] <nhm> morning jtang
[16:13] <jtang> i was wonder which poor soul in the documentation team is doing the ditaa diagrams for ceph?
[16:13] <jtang> i was just reading some of the docs and noticed there are quite a few now :)
[16:17] <jluis> jtang, that's probably John Wilkins
[16:17] <jluis> when it comes to docs, he's the man
[16:17] * fc_ (~fc@home.ploup.net) Quit (Quit: leaving)
[16:28] * vata (~vata@2607:fad8:4:6:221:5aff:fe2a:d1dd) has joined #ceph
[16:29] * aliguori (~anthony@ has joined #ceph
[16:32] <absynth_47215> are there state diagrams online for the different OSD states (in,out,down,up) and for the different object "movements" (peering, backfill, remap etc.)?
[16:35] * fc_ (~fc@home.ploup.net) has joined #ceph
[16:38] * markl (~mark@tpsit.com) has joined #ceph
[16:44] <elder> jluis, it was the master branch, not the next branch that had the trouble I saw.
[16:56] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[16:56] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[17:01] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[17:01] * agh (~agh@www.nowhere-else.org) has joined #ceph
[17:01] * sander (~chatzilla@c-174-62-162-253.hsd1.ct.comcast.net) has joined #ceph
[17:11] * jlogan1 (~Thunderbi@2600:c00:3010:1:9cc3:821f:978c:5b0b) has joined #ceph
[17:17] * allsystemsarego (~allsystem@ has joined #ceph
[17:18] * tryggvil (~tryggvil@rtr1.tolvusky.sip.is) Quit (Quit: tryggvil)
[17:18] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) has joined #ceph
[17:19] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:21] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[17:21] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:32] * PerlStalker (~PerlStalk@ has joined #ceph
[17:42] * madkiss1 (~madkiss@chello062178057005.20.11.vie.surfer.at) Quit (Quit: Leaving.)
[17:43] * BManojlovic (~steki@ has joined #ceph
[17:45] * sleinen1 (~Adium@2001:620:0:26:2cd5:5a9e:7d4f:8405) Quit (Quit: Leaving.)
[17:45] * sleinen (~Adium@ has joined #ceph
[17:45] * tnt (~tnt@212-166-48-236.win.be) Quit (Ping timeout: 480 seconds)
[17:47] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[17:51] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[17:52] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[17:53] * sleinen (~Adium@ Quit (Ping timeout: 480 seconds)
[17:55] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[17:56] * low (~low@ Quit (Quit: Leaving)
[17:57] <sstan> is the documentation available on a single file?
[17:58] * tnt (~tnt@216.186-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[17:58] <sstan> in case one might want to print it or read it on a reader
[18:03] * zerthimon (~zerthimon@sovintel.iponweb.net) has joined #ceph
[18:04] * ScOut3R (~ScOut3R@dslC3E4E249.fixip.t-online.hu) Quit (Ping timeout: 480 seconds)
[18:05] <zerthimon> is there a way to disable deep-scrub completely ?
[18:08] <zerthimon> in v0.56.1
[18:10] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[18:19] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[18:19] * sleinen (~Adium@2001:620:0:25:88a2:6e21:1b6d:9b72) has joined #ceph
[18:26] * loicd (~loic@magenta.dachary.org) has joined #ceph
[18:26] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[18:26] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[18:29] <mikedawson> I continue to think the remapping is not working properly in 0.56.1
[18:29] <mikedawson> 2013-01-14 12:28:58.710608 mon.0 [INF] pgmap v1052: 2432 pgs: 80 active, 2131 active+clean, 162 active+remapped, 59 active+degraded; 0 bytes data, 222 MB used, 1903 GB / 1903 GB avail
[18:30] <noob2> your remaps never return?
[18:30] <paravoid> mikedawson: http://tracker.newdream.net/issues/3747 ?
[18:30] <mikedawson> paravoid: thx
[18:30] <noob2> paravoid: i actually did the same thing without the adding of OSD's and it worked ok
[18:30] <paravoid> what do you mean?
[18:31] <noob2> i changed my cluster from choose node to choose rack and it remapped ok
[18:31] <noob2> after about 5hrs of grinding
[18:31] <paravoid> ah
[18:31] <noob2> came back clean
[18:31] <paravoid> mine also had more OSDs added
[18:31] <paravoid> in any case, it was a very small percentage of OSDs
[18:31] * Leseb (~Leseb@ Quit (Quit: Leseb)
[18:32] <noob2> mikedawson: it looks like the kernel panic's i had on friday were related to ext4 and python
[18:32] <noob2> i upgraded my kernel to mainline 3.7.2 to see if that helps things
[18:32] * gregaf (~Adium@2607:f298:a:607:8006:2bb6:9c15:a221) Quit (Quit: Leaving.)
[18:32] <mikedawson> I ran mkcephfs against one OSD backed by an SSD. Added second OSD by hand, then third and forth. Never added any data. Stuck remapping
[18:33] <noob2> wow
[18:33] <paravoid> wow indeed
[18:33] <paravoid> mikedawson: I'd suggest commenting on the bug
[18:33] <paravoid> or I can do it for you
[18:33] <noob2> ubuntu 12.04 LTS just like the bug ?
[18:33] <mikedawson> paravoid: I'll do it
[18:33] <mikedawson> noob2: 12.10
[18:33] <noob2> gotcha
[18:34] <noob2> my proxies are on 12.10 and they're kinda flaky so far
[18:34] <mikedawson> noob2: I've had no problems related to 12.10, but lots of verified (and a couple unverified) ceph bug
[18:35] <noob2> yeah that's why i thought going with a higher mainline kernel might help. newer changes rolled in
[18:36] <noob2> oh i had a general question for you guys. if i set my osd weight = 3 (1 weight per TB of hard drive ) would i be able to add 4TB drives by setting their weight to 4?
[18:36] <noob2> it made sense in my head but i wanted to ask also
[18:36] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) has joined #ceph
[18:36] * madkiss (~madkiss@chello062178057005.20.11.vie.surfer.at) has joined #ceph
[18:36] <paravoid> what do kernels have to do with either remapped pgs or proxies?
[18:37] <noob2> paravoid: well for my proxy i'm mounting ceph rbd's and then exporting them over fibre. so the kernel makes a big difference
[18:37] <paravoid> oh
[18:37] <noob2> if i ceph osd down 1 will it remap data?
[18:37] <noob2> i'd like to test my kernel panic without downing my box again :)
[18:39] * gregaf (~Adium@ has joined #ceph
[18:41] <paravoid> I think it will
[18:43] * xmltok (~xmltok@pool101.bizrate.com) Quit (Quit: Leaving...)
[18:47] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[18:48] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[18:48] <noob2> paravoid: this works great with the new 3.7.2 kernel
[18:48] <noob2> no proxy crashes when my cluster goes into remapping mode
[18:49] <paravoid> by fibre you mean FC?
[18:49] * sleinen1 (~Adium@2001:620:0:26:1c34:9b3:41cb:d5d7) has joined #ceph
[18:50] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[18:52] * xdeller (~xdeller@broadband-77-37-224-84.nationalcablenetworks.ru) has joined #ceph
[18:53] * sleinen (~Adium@2001:620:0:25:88a2:6e21:1b6d:9b72) Quit (Read error: Operation timed out)
[18:55] <mikedawson> so my PGs stuck on active+degraded and active+remapped were fixed when I set the experimental ceph tunables
[18:55] <mikedawson> I can't say that's a good practice, but I can't seem to get 0.56.1 to work consistently when adding or removing OSDs without them
[18:55] * chutzpah (~chutz@ has joined #ceph
[18:57] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[18:58] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:00] * danieagle (~Daniel@ has joined #ceph
[19:01] <noob2> paravoid yeah i mean FC
[19:01] <paravoid> what do you use for the scsi target part?
[19:01] <paravoid> lio?
[19:03] * benner_ (~benner@ Quit (Read error: Connection reset by peer)
[19:03] * benner (~benner@ has joined #ceph
[19:05] <noob2> yeah lio utils
[19:06] * Cube (~Cube@ has joined #ceph
[19:06] <noob2> i wrote a python script to mount and setup the fibre channel targets on boot
[19:07] <darkfaded> noob2: can i see that or is it something you wanna keep internal?
[19:07] <darkfaded> it sounds interesting, especially from the lio side of things
[19:08] <darkfaded> and also, what OS do you run lio on? last time i tried was on fc16 and fc target wasn't really working there
[19:08] <noob2> i'm running it on ubuntu 12.10
[19:08] <noob2> sure lemme post it for ya
[19:08] <darkfaded> i then rpm -U'd to the fc17 version but that was not one of the smarter things i've done
[19:08] <darkfaded> cool
[19:09] <noob2> http://fpaste.org/cAMm/
[19:10] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[19:11] <noob2> darkfaded: lemme know what you think
[19:11] <noob2> i tries to check if your card is in target mode, reload the modules if it isn't. i should show you the config file so you have the full reference
[19:11] <darkfaded> i thought there is a dozen thngs from that
[19:12] <darkfaded> hah
[19:12] <noob2> here's my config file: http://fpaste.org/v0ww/
[19:12] <darkfaded> noob2 is writing python code that handles errors
[19:12] <noob2> lol
[19:13] <noob2> the python code will allow you to map 1 rbd to multiple initiators separated by |'s
[19:13] <noob2> it's def a little rough but so far it's working ok
[19:13] <noob2> i have a few vm's writing to the ceph cluster over fibre channel in an infinite bonnie++ loop to test
[19:14] <noob2> just make sure your wwn's are exactly the same. vmware sees them as different disks if not
[19:16] <darkfaded> really cool
[19:16] <noob2> thanks :)
[19:17] <darkfaded> i'll give real feedback when i've done something with it
[19:17] <noob2> go for it
[19:17] <darkfaded> (on vacation, etc. etc.)
[19:17] <noob2> i think i've worked out the bugs
[19:17] <noob2> yeah
[19:19] * sjustlaptop (~sam@2607:f298:a:607:a9f7:d5ef:de86:5f90) has joined #ceph
[19:19] <phantomcircuit> ok i have a test rig setup
[19:19] <phantomcircuit> mon/mds both on tmpfs
[19:20] <phantomcircuit> two osds each with a dedicated hdd and journal on tmpfs
[19:20] <phantomcircuit> rados bench 16MB writes with 1 writer hits a wall at 60 MB/s
[19:21] <phantomcircuit> op_queue_ops 0 journal_queue_ops 0 the entire time
[19:21] * aliguori (~anthony@ Quit (Remote host closed the connection)
[19:24] <phantomcircuit> the bizarre thing is that the cur MB/s in rados bench goes oscilates between ~120 MB/s and 0
[19:24] <phantomcircuit> maybe the journal is filling up let me try and make it bigger
[19:24] <paravoid> noob2: interesting.
[19:25] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[19:25] <noob2> paravoid: give it a shot and let me know what you think
[19:25] * agh (~agh@www.nowhere-else.org) has joined #ceph
[19:25] * sleinen1 (~Adium@2001:620:0:26:1c34:9b3:41cb:d5d7) Quit (Quit: Leaving.)
[19:25] <noob2> ./rbdmount -c /etc/ceph/mounts will create all the mount points
[19:26] <noob2> ./rbdmount -c /etc/ceph/mounts -a lun_name will add only the lun you specify at runtime
[19:26] * dpippenger1 (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Remote host closed the connection)
[19:27] * xmltok (~xmltok@pool101.bizrate.com) has joined #ceph
[19:27] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[19:28] <noob2> it's a little complicated because i had to make sure it new which rbd device it mapped to. they change when you unmap and map them again
[19:30] * sjustlaptop (~sam@2607:f298:a:607:a9f7:d5ef:de86:5f90) Quit (Read error: Operation timed out)
[19:30] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[19:30] <janos> i label disks wheni format, and then mount based on label
[19:30] <janos> to try to avoid that
[19:31] <noob2> well with rbd map you can't avoid it
[19:31] <noob2> fstab doesn't understand what rbd is
[19:31] <noob2> at least not to my knowledge
[19:31] <janos> i think you are right
[19:38] <paravoid> yehudasa: around?
[19:39] <yehudasa> paravoid: somewhat
[19:45] * fc_ (~fc@home.ploup.net) Quit (Quit: leaving)
[19:53] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[19:54] * aliguori (~anthony@cpe-70-112-157-151.austin.res.rr.com) has joined #ceph
[19:54] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) has joined #ceph
[19:55] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[20:03] * dpippenger (~riven@ has joined #ceph
[20:05] * rturk-away is now known as rturk
[20:08] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:09] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[20:13] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) has joined #ceph
[20:13] * dmick (~dmick@2607:f298:a:607:2de2:fa1:2d80:89f8) has joined #ceph
[20:25] * Kioob (~kioob@luuna.daevel.fr) Quit (Remote host closed the connection)
[20:25] <iggy> does blkid not work on rbd volumes?
[20:26] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[20:31] <dmick> iggy: good q
[20:31] <fghaas> iggy: /dev/rbd/<pool>/<image> symlinks not sufficient for you?
[20:32] <iggy> I was asking in reference to the earlier conversation
[20:32] <iggy> I don't really have a horse in this race
[20:33] <fghaas> iggy: sorry, missed that :)
[20:34] * Kioob (~kioob@luuna.daevel.fr) Quit (Quit: Leaving.)
[20:35] * Kioob (~kioob@luuna.daevel.fr) has joined #ceph
[20:35] <nhm> paravoid: was that just 1 concurrent operation in rados bench?
[20:36] <paravoid> what was?
[20:36] <nhm> paravoid: oops, sorry, that was for phantomcircuit
[20:36] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) Quit (Quit: Leaving.)
[20:36] <paravoid> okay :)
[20:36] * tryggvil (~tryggvil@85-220-71-147.dsl.dynamic.simnet.is) has joined #ceph
[20:37] <phantomcircuit> nhm, yes
[20:37] <nhm> phantomcircuit: it's quite possible you were client limited.
[20:37] <nhm> phantomcircuit: does it improve if you do like 16 concurrent ops?
[20:37] <phantomcircuit> i get basically the same with 16 and 256 though
[20:38] <phantomcircuit> no it doesn't
[20:38] <nhm> phantomcircuit: ah, ok
[20:38] <nhm> phantomcircuit: journal is on tmpfs and is mostly empty though?
[20:38] * tryggvil (~tryggvil@85-220-71-147.dsl.dynamic.simnet.is) Quit ()
[20:39] <phantomcircuit> nhm, well perf dump reports no queued ops
[20:39] <phantomcircuit> which is expected
[20:39] <nhm> phantomcircuit: network throughput is good?
[20:39] <phantomcircuit> 1 gbps < 1ms latency
[20:39] <phantomcircuit> im not sure what the rados bench does if the journal is full
[20:39] <phantomcircuit> im assuming that would be logged though
[20:41] <nhm> phantomcircuit: how much replication and how many osd nodes?
[20:42] <fghaas> gregaf: cf. previous post on ML, if I ran my filestore on btrfs and the journal on a raw SSD, would the journaling mode be parallel?
[20:44] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[20:46] <phantomcircuit> nhm, 2 and 2
[20:46] <phantomcircuit> im going to rebuild my kernel with block io tracing and see if blktrace gives me an idea of what's going on
[20:51] * nwatkins (~nwatkins@soenat3.cse.ucsc.edu) has joined #ceph
[20:51] <nz_monkey> Hi, I just set up a cluster with 3 nodes, each node has 4 x OSD but for some reason when doing a "ceph osd tree" it show all 12 OSD's but only 1 OSD from each node is up, the rest show DNE. All are mounted correctly, and mkcephfs has put a key on each one. I tried manually adding them to the crushmap but get a "OSD does not exist" error. Any pointers on where I should be looking ?
[20:53] * nwatkins (~nwatkins@soenat3.cse.ucsc.edu) Quit ()
[20:53] <janos> nz_monkey - what happens if you do /etc/init.d/ceph start <osd> on one of the DNE osd's
[20:53] <janos> i'd look in /var/log/ceph/<osd>.log too
[20:55] <nz_monkey> Hi Janos when doing the osd.0 start I get
[20:55] <nz_monkey> Starting Ceph osd.0 on cstor01...
[20:55] <nz_monkey> starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /dev/vg-raid0/lv-ceph_journal
[20:55] <nz_monkey> which appears to start, but "ceph osd tree" still shows a DNE state
[20:55] <nz_monkey> I will check the osd log
[20:55] <janos> yeha i would look there
[20:56] * jmlowe (~Adium@173-15-112-198-Illinois.hfc.comcastbusiness.net) has joined #ceph
[20:57] <nz_monkey> hrmm, errors about "someone elses journal" does ceph require a seperate OSD per journal, or can you have 1 per node that is used by all OSD's ?
[20:57] <fghaas> seems like gregaf is away. sjust, in case you have time & inclination to clarify, could you help?
[20:57] <janos> one journal per osd
[20:58] <fghaas> nz_monkey: you need to partition that ssd :)
[20:58] <nz_monkey> bingo. Thanks for your help Janos and fghaas
[20:58] * ScOut3R (~ScOut3R@5400A5AF.dsl.pool.telekom.hu) Quit (Remote host closed the connection)
[20:58] <janos> any time
[20:58] <nz_monkey> I'll create a bunch more lv's :)
[20:59] <dmick> yes, a journal is a private thing
[21:00] <dmick> keep in mind that if the journal fails, so does the OSD; depending on your use for this cluster, you may or may not want to put all the journals on the same device(s)
[21:01] <fghaas> nz_monkey: since I just posted it today, http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals might be of use
[21:03] <nz_monkey> Thanks dmick and fghaas. This is just our proof of concept cluster, so we are using 2x SSD per node with md raid0+LVM on top. Production will have more SSD's
[21:03] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[21:03] <fghaas> ahum, raid-0 sounds like a really bad choice
[21:03] <nz_monkey> For testing ?
[21:04] <fghaas> one ssd burns up, all your osds are gone
[21:04] <fghaas> for anything :)
[21:05] <janos> i call it aid-0
[21:05] <nz_monkey> haha. I'll split them then. We just did a whole bunch of testing with bcache so put them that way to see what difference it made and left them that way.
[21:05] <janos> i don't think that level has an "r"
[21:05] <nz_monkey> lol
[21:05] <fghaas> oh it just means "redundant" in the sense of "useless"
[21:06] <janos> lol
[21:06] <nz_monkey> I realise that. I just assumed that for testing the amount of writes will be quite low so we should get a year out of them, by that time we will replace them with higher end ssd
[21:07] <fghaas> not so much, really
[21:08] <nz_monkey> ok, I didnt realise it was that bad. Will split them before I go any further
[21:09] <mikedawson> nz_monkey: how'd your testing with bcache work out?
[21:10] <phantomcircuit> nhm, yeah this is bizarre
[21:10] <phantomcircuit> Stddev Bandwidth: 88.9723
[21:10] <phantomcircuit> Bandwidth (MB/sec): 92.352
[21:10] <phantomcircuit> 16 MB blocks 16 writers journal on tmpfs
[21:10] <nz_monkey> Quite well. We just tested against a single WD SATA disk, with the 2x SSD 520 in raid0 as a cache device. At worst IOPS doubled, at best we saw random 4k writes go from 237 iops to around 12000 iops
[21:10] <nhm> phantomcircuit: that seems relatively sane for 1GbE?
[21:11] <phantomcircuit> nhm, the stddev part
[21:11] <nz_monkey> that was with a single thread. With 4 threads it was even more impressive
[21:11] <nhm> phantomcircuit: was rados bench run on the same host as one of the OSDs?
[21:11] <phantomcircuit> nhm, throughput goes from very high to basically zero
[21:11] <phantomcircuit> yeah it was
[21:11] <phantomcircuit> should i not do that
[21:12] <nhm> phantomcircuit: it just means that 1 OSD has insanely high network throughput to rados bench and the other has standard 1GbE throughput. :)
[21:12] <mikedawson> nz_monkey: i believe bcache is supposed to improve as concurrency goes up, so that's consistent with my understanding. Run into any stability issues?
[21:12] <nhm> phantomcircuit: though with 2x replication it actually shouldn't oscilate that much, hrm.
[21:12] <nhm> phantomcircuit: have a 3rd host you can try it from?
[21:12] <phantomcircuit> nhm, hmm i dont think so but i'll try it from another host
[21:12] <buck> There are zero plana nodes available ATM. If anyone has some locked that they are not using, would you be so kind as to release them?
[21:13] <nz_monkey> mikedawson: there are a few bugs, but they affect initialisation rather than operational stability. So far it has been stable for us, but Kent is doing a lot of work on it at the moment preparing it for mainline
[21:13] <nz_monkey> mikedawson: so ymmv
[21:14] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[21:14] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[21:14] * agh (~agh@www.nowhere-else.org) has joined #ceph
[21:15] <mikedawson> nz_monkey: I've yet to see many docs about the process to build, install, update, etc. Care to share your process?
[21:15] * sleinen1 (~Adium@user-28-15.vpn.switch.ch) has joined #ceph
[21:15] <fghaas> sjust, gregaf: nevermind, found out myself. will post to ML
[21:16] <nz_monkey> mikedawson: We were initially doing git-pull's of bcache-3.7 and building .deb's, but are now pulling the main bcache tree and building that with the standard ubuntu config. We then just install and since this is testing we manually modprobe the module in, and dump the utils in /usr/local/bin
[21:17] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) has joined #ceph
[21:17] <nz_monkey> mikedawson: Kent is doing his development in bcache-dev now and pushing code for testing in to bcache.
[21:18] <mikedawson> nz_monkey: thx
[21:18] <dilemma> I'm in the middle of converting some OSDs from btrfs to xfs, and I was wondering if there was a way I could speed up the process by making a local copy of the contents of the OSD folder before re-formatting, then restoring that copy afterwards. The hope is to prevent backfilling as much as possible when bringing the OSD back up with the new file system.
[21:18] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) Quit (Quit: Leaving.)
[21:21] <phantomcircuit> nhm, monitor wont listen to anything but
[21:21] <phantomcircuit> mon addr =
[21:22] <phantomcircuit> that should have ceph-mon on right?
[21:22] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[21:23] <dmick> dilemma: I think that should work, as long as the OSD is down of course
[21:23] <dmick> certainly one restarts OSDs all the time with their prior directory contents, and I can't imagine it really cares about which filesystem it's on (other than tests for small optimizations which are dynamic)
[21:24] <dilemma> Anything I should know about xattrs, or any files I should leave out? Remember - I'm restoring the files to an OSD that has a different filesystem
[21:24] <dilemma> that sounds promising
[21:24] <phantomcircuit> dilemma, make sure the xattrs copy
[21:24] <dmick> hm. yeah, moving the xattrs has to happen, of course
[21:24] <dilemma> I assume I would need to flush the journal before doing this as well
[21:25] <phantomcircuit> and yeah
[21:25] <dmick> dilemma: I'd think that would help, yes
[21:29] <noob2> for cloudstack after you install a host is there anything special you need to do get ceph working as the storage backend?
[21:29] <iggy> iiuc, lots
[21:33] * fghaas (~florian@91-119-215-212.dynamic.xdsl-line.inode.at) has joined #ceph
[21:40] <mikedawson> noob2: wido may be the expert on ceph+cloudstack
[21:40] <noob2> gotcha
[21:40] <noob2> wido: you around?
[21:40] <noob2> i think i have all the components in place and it just gives me a null pointer error :)
[21:40] <noob2> on the cloudstack java side
[21:41] <gregaf> fghaas: looks like you got it down now
[21:41] <fghaas> gregaf, referring to my ML followup?
[21:45] <gregaf> yeah
[21:46] <fghaas> yeah, sorry... somehow I had underestimated the parallel journal's smarts :)
[21:46] <gregaf> yep
[21:47] <gregaf> the important thing that allows the parallelism is checkpointing on the backing store (ie, btrfs snapshots)
[21:47] <gregaf> it doesn't require anything of the journal itself
[21:49] <phantomcircuit> nhm, i moved to 3rd host and it's definitely better
[21:49] <phantomcircuit> but there are still periods where throughput drops to 0
[21:49] <phantomcircuit> Min bandwidth (MB/sec): 0
[21:51] <fghaas> gregaf: yup, I get that now :)
[21:53] <xdeller> nobody tried whamcloud` kernel with large-xattr patch on ext?
[21:58] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[22:00] * jrisch (~Adium@4505ds2-hi.0.fullrate.dk) has joined #ceph
[22:01] * dilemma (~dilemma@2607:fad0:32:a02:1e6f:65ff:feac:7f2a) Quit (Quit: Leaving)
[22:08] * jrisch (~Adium@4505ds2-hi.0.fullrate.dk) has left #ceph
[22:08] * schlitzer (~schlitzer@ip-109-90-143-216.unitymediagroup.de) has joined #ceph
[22:08] <schlitzer> hey there
[22:08] * jrisch (~Adium@4505ds2-hi.0.fullrate.dk) has joined #ceph
[22:11] <schlitzer> i hope some folks from Inktank are awake/around & can answer this question: Are CEPH installations out there that are bigger then 1 PIB?
[22:11] * ScOut3R (~ScOut3R@catv-89-133-32-74.catv.broadband.hu) has joined #ceph
[22:11] <noob2> yeah dreamhost has > 1PB
[22:12] <schlitzer> is dreamhost the only one?
[22:12] <noob2> that i know of. there's prob others out there. i only have 180TB
[22:13] <schlitzer> hmm okay
[22:13] <noob2> wondering how it scales?
[22:13] <schlitzer> yes
[22:13] <schlitzer> and if it is stable enough
[22:13] <noob2> 180TB has been no problem at all.
[22:13] <iggy> http://www.inktank.com/partners/featured-customers/
[22:13] <schlitzer> i�m interessted in production experience
[22:14] <noob2> i've had it in production for about a month now
[22:14] <iggy> schlitzer: are you looking at the filesystem or one of the lower level bits?
[22:14] <noob2> just rbd's
[22:14] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[22:14] <noob2> i haven't used the ceph fs
[22:14] <schlitzer> iggy, thx. but i already saw this list
[22:15] <schlitzer> iggy, we (as a company) would like to replace GPFS
[22:15] <iggy> the filesystem isn't considered production ready yet
[22:15] <schlitzer> so it would be easier if we could use CEPH_FS
[22:15] <schlitzer> i know
[22:15] <noob2> have you looked at gluster?
[22:15] <noob2> not to sidetrack you or anything
[22:16] <schlitzer> but it would be possible to replace this FS requerement & use RADOS-GW
[22:16] <noob2> so you're looking for a 1PB filesystem ?
[22:16] <schlitzer> from my experience Gluster is not what we are looking for
[22:16] <noob2> yeah you could use the radosgw. it's pretty quick
[22:17] <noob2> you'd have to use some tools like cyberduck or what not to get your files into and out of the gateway
[22:17] <schlitzer> i�m looking for other CEPH installations that use at least 1PB
[22:17] <iggy> the fs is supposed to get a lot of work 2013 1H
[22:17] <schlitzer> we would need more
[22:17] <iggy> I doubt there are others
[22:17] <janos> if someone wants to pay me to make one i suppose i could sacrifice the time
[22:17] <janos> ;)
[22:17] <iggy> I think one of the national labs was testing a pretty big deployment (1P+)
[22:18] * jrisch (~Adium@4505ds2-hi.0.fullrate.dk) Quit (Ping timeout: 480 seconds)
[22:19] <jmlowe> afaik cephfs isn't quite ready for primetime, I'd put it relatively close to or slightly ahead of lustre in terms of stability and gpfs has the edge if losing files would cost me my job
[22:19] <noob2> are you more concerned with iops or scale?
[22:19] <jmlowe> just my personal opinion as a ceph end user
[22:20] <jmlowe> also I believe cephfs will eventually get to the point where I'd risk my reputation and job for it but lustre will never get there
[22:22] * jmlowe (~Adium@173-15-112-198-Illinois.hfc.comcastbusiness.net) Quit (Quit: Leaving.)
[22:24] <noob2> i agree. bugs are getting patched at a fast rate for cephfs
[22:24] * Vjarjadian (~IceChat77@5ad6d005.bb.sky.com) has joined #ceph
[22:25] * miroslav (~miroslav@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[22:31] * ScOut3R (~ScOut3R@catv-89-133-32-74.catv.broadband.hu) Quit (Remote host closed the connection)
[22:32] <schlitzer> noob2, mostly i am lookng for somethink production ready, that can compare, or better outperform, GPFS
[22:33] <noob2> i see
[22:33] <noob2> well ceph rbd's are def prod ready
[22:33] <schlitzer> and rados gw?
[22:34] <noob2> i would think so. dreamhost is using the radosgw extensively i think
[22:36] <fghaas> yes they are
[22:37] * The_Bishop__ (~bishop@e177090017.adsl.alicedsl.de) Quit (Read error: Operation timed out)
[22:37] <schlitzer> thanks for input
[22:37] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:50] <jks> I'm testing ceph and have 3 osds setup doing nothing. When I bench the disks locally using dd, they can sustain approx. 110 MB/s when writing. When I do a ceph osd tell bench on each osd, they can do only about 5 MB/s. Is this to be expected?
[22:51] <nhm> jks: nope! something is definitely broken. :)
[22:51] <jks> I was expecting that :-| ... now I wonder how to benchmark/debug what is happening?
[22:52] <gregaf> see what happens if you do two dd streams to each disk
[22:52] <gregaf> or dd with O_DIRECT or something
[22:52] <nhm> jks: also, where is your journal?
[22:52] <jks> ah, to simulate writing to the journal as well as the disk - or?
[22:52] <jks> gregaf, I already did dd with O_DIRECT
[22:53] <jks> nhm: journal is on the same device as the data store... btrfs
[22:53] * amichel (~amichel@salty.uits.arizona.edu) has joined #ceph
[22:53] <gregaf> yeah, 110MB/s to 5MB/s is pretty bad for two streams, but some disks really do suck that much
[22:54] <jks> this is an mdraid5 of ordinary SATA drives... I'll try two dd's
[22:54] <gregaf> oh, that could definitely do it then
[22:55] <gregaf> the journal is doing a lot of small writes, and the RAID5 is probably turning those into read-then-write
[22:55] <amichel> So, I'm trying my hand at building a crushmap, but I'm not sure that I'm doing it right both because it won't compile and because I'm just plain not sure if I'm doing it right. Would anyone mind looking at a pastebin of my mess and seeing if I'm going about it in a conceptually appropriate manner?
[22:55] <phantomcircuit> what part of the codebase actually processes ops from clients?
[22:55] <jks> gregaf, even when using btrfs?
[22:55] <gregaf> it's the RAID5 that's the problem, the filesystem has no influence
[22:56] <nhm> gregaf: well, btrfs also lacks options afaik to tell it raid geometry. ;)
[22:56] <jks> I did two dd's at the same time now... one finished with 52 MB/s and the other with 42 MB/s
[22:56] <nhm> jks: btw, are you doing conv=fdatasync with your dds?
[22:57] <jks> nhm: I used this command: dd if=/dev/zero of=outputfile bs=1G count=1 oflag=direct
[22:57] * mikedawson (~chatzilla@23-25-46-97-static.hfc.comcastbusiness.net) Quit (Ping timeout: 480 seconds)
[22:57] <nhm> oh, you are doing direct, nevermind
[22:58] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[22:58] <jks> so you're saying that I should ditch the mdraid5 and do one osd per drive? - I was afraid that having two many osds on one server would mean trouble if the server crashed for some reason (as I only have 3 servers in this system)
[22:58] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[22:58] <nhm> jks: how many drives?
[22:58] <jks> the mdraid5 consists of 4 drives... would you recommend 4 osds and place the journal for each one on the same drive... or create 3 osds and let the fourth drive be dedicated to the journals for the other 3 drives?
[22:59] <nhm> jks: best to just put each journal on the same drive as the data.
[23:00] <jks> okay, thanks for the advice! I'll try removing a drive from the raid5 and setup an osd on that to test the performance
[23:00] <jks> if I split each of my current osds into 4, would I need to somehow "rebalance" or regenerate a crush-map?
[23:01] <nhm> you could also just start out with 1 node and run rados bench locally on that node to start out.
[23:01] <nhm> jks: Is there data on the existing cluster?
[23:01] <jks> only test data
[23:02] <nhm> if it's not important, I'd just reformat.
[23:02] <jks> I had a fourth server setup running qemu-kvm that used the rbd protocol to store the image on the ceph servers
[23:02] <jks> however it performed very badly, and a simple rsync would be enough to halt everything to a grind with very "jerky" performance
[23:02] <nhm> jks: yeah, if osd tell bench was doing that bad, I imagine everything would be terrible.
[23:02] <jks> I should have done the osd tell bench first ;-)
[23:04] * agh (~agh@www.nowhere-else.org) Quit (Remote host closed the connection)
[23:04] * agh (~agh@www.nowhere-else.org) has joined #ceph
[23:09] * tryggvil (~tryggvil@17-80-126-149.ftth.simafelagid.is) has joined #ceph
[23:09] <dmick> amichel: I can give it a look
[23:10] <amichel> Awesome
[23:10] <amichel> One sec, lemme get the link
[23:10] <amichel> http://pastebin.com/Ftxe7Hgn
[23:12] <amichel> When I try to compile it, I'm getting "crushmap:225 error: parse error at '# rules'" which I assume means I've got some kind of syntax sanfu above
[23:12] <amichel> But I don't even know if it's a sensible crushmap :D
[23:12] * noob2 (~noob2@ext.cscinfo.com) Quit (Quit: Leaving.)
[23:14] <amichel> The machine is a backblaze-style pod, if that helps all the bucket types make any sense
[23:14] <amichel> Kind of planning it so I can use it for SAS-based stuff as well, thus the buses
[23:16] * schlitzer (~schlitzer@ip-109-90-143-216.unitymediagroup.de) Quit (Quit: Leaving)
[23:21] <phantomcircuit> nhm, btw with rbd caching on and using a 3rd host rados -p rbd bench 60 write -b 4096 -t 16 is get about 0.25 MB/s which is ~60 IOPS
[23:22] <nhm> phantomcircuit: no good
[23:23] <lurbs> Is that 4096 bytes or kilobytes?
[23:23] <nhm> phantomcircuit: though rados bench won't use rbd caching
[23:23] <nhm> phantomcircuit: that's 2 replica?
[23:23] <phantomcircuit> yeah
[23:23] <phantomcircuit> lurbs, 4096 bytes
[23:24] <amichel> dmick: that bad huh? :D
[23:24] <dmick> amichel: sorry, distracted
[23:24] <amichel> Oh, no worries
[23:25] <amichel> Story of my life
[23:25] <nhm> phantomcircuit: how does a pool with 1 replica do?
[23:25] * vata (~vata@2607:fad8:4:6:221:5aff:fe2a:d1dd) Quit (Quit: Leaving.)
[23:28] <amichel> be right back
[23:28] <phantomcircuit> nhm, i'ts oscillating between dead stop and 9 MB/s
[23:28] <phantomcircuit> which is bizarre
[23:29] <phantomcircuit> http://pastebin.com/raw.php?i=nG6ZFak8
[23:30] <xdeller> just run into very weird thing - ceph-osd doing nothing on the mkfs http://pastebin.com/iciNZeMr
[23:31] <xdeller> any ideas?
[23:32] <dmick> xdeller: 6458       0.000009 connect(7, {sa_family=AF_INET, sin_port=htons(6789), sin_addr=inet_addr("")}, 16 <unfinished ...>
[23:32] <dmick> can't reach mon?
[23:32] <nhm> phantomcircuit: that usually means that the journal is going fast but the data disk can't keep up.
[23:32] <xdeller> uh-oh, forgot about it
[23:32] <xdeller> thanks
[23:32] <dmick> amichel: I get same error, looking at it
[23:33] <nhm> phantomcircuit: which usually makes sense since the journal can just write data sequentially while the data disk has to figure out where in the directory structure to stick it.
[23:34] <phantomcircuit> ok so with no replication it's 1k IOPS write on two conventional hdds that can only really do 120 IOPS
[23:34] * benner (~benner@ Quit (Read error: Connection reset by peer)
[23:34] <phantomcircuit> is rados bench issuing flush?
[23:34] <amichel> dmick: at least it's reproducible :D
[23:35] <nhm> phantomcircuit: It's conceivable that not all of the ops are random so some of them are getting coalesced
[23:35] <phantomcircuit> btw i was trying to follow the path of a flush from qemu -> librbd -> osd to figure out where the decision to fsync happens but i got lost sooo fast
[23:35] <phantomcircuit> codebase is pretty confusing
[23:36] <phantomcircuit> i got from qemu to ImageCtx::flush in librbd
[23:37] <phantomcircuit> or i mean im librados
[23:39] * benner (~benner@ has joined #ceph
[23:39] <nhm> phantomcircuit: yeah, I've been doing ceph performance for nearly a year and I still got lost in the code. :)
[23:39] <dmick> actual C++ browsers help a lot
[23:39] <dmick> but yes
[23:44] <phantomcircuit> heh i never use an ide but i sort of feel like i need one here
[23:45] <dmick> phantomcircuit: same way
[23:45] <dmick> kdevelop has been treating me pretty well
[23:45] <dmick> (even on Gnome)
[23:48] * sleinen1 (~Adium@user-28-15.vpn.switch.ch) Quit (Quit: Leaving.)
[23:48] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) has joined #ceph
[23:50] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[23:53] * korgon (~Peto@isp-korex- has joined #ceph
[23:54] * sander (~chatzilla@c-174-62-162-253.hsd1.ct.comcast.net) Quit (Ping timeout: 480 seconds)
[23:55] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[23:55] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[23:56] * sleinen (~Adium@217-162-132-182.dynamic.hispeed.ch) Quit (Ping timeout: 480 seconds)
[23:58] * jmlowe (~Adium@c-71-201-31-207.hsd1.in.comcast.net) has joined #ceph
[23:59] * slang1 (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Quit: Leaving.)

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.