#ceph IRC Log


IRC Log for 2012-08-02

Timestamps are in GMT/BST.

[0:03] * s[X] (~sX]@ppp59-167-154-113.static.internode.on.net) has joined #ceph
[0:13] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) Quit (Quit: Leaving)
[0:18] * hijacker (~hijacker@ Quit (Remote host closed the connection)
[0:26] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[0:26] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[0:26] * Leseb_ is now known as Leseb
[0:43] * loicd (~loic@brln-4d0cc4bd.pool.mediaWays.net) Quit (Quit: Leaving.)
[0:50] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[0:56] * Cube (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Ping timeout: 480 seconds)
[0:58] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[1:04] * MarkDude (~MT@ip-64-134-222-108.public.wayport.net) Quit (Quit: Leaving)
[1:08] * LarsFronius (~LarsFroni@dyndsl-085-016-195-190.ewe-ip-backbone.de) Quit (Ping timeout: 480 seconds)
[1:10] <Tv_> what's a good quick cpu benchmark, to compare raw vs virtualization?
[1:10] <dmick> sieve?
[1:11] <dmick> :)
[1:12] <Tv_> i need something easily runnable...
[1:12] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:12] <Tv_> apt-cache search sieve only talks about mail filtering
[1:13] <dmick> sigh. how tradition escapes us. I wasn't really serious, but in the 80s Eratosthenes' Sieve was all the rage for CPU benching
[1:13] <Tv_> oh sure, i just don't even have gcc installed
[1:14] <Tv_> i think i'll run an iterated sha1 in python
[1:14] <dmick> I was thinking some huge factorial
[1:15] <elder> Count to a MILLION!
[1:16] * elder holds back of pinky to mouth
[1:16] <dmick> import math; print math.factorial(10000000)
[1:16] <Tv_> time python -c 'import hashlib
[1:16] <Tv_> s=""
[1:16] <Tv_> for i in xrange(10000000):
[1:16] <Tv_> h=hashlib.sha1(); h.update(s); s=h.hexdigest()'
[1:16] <Tv_> yours is easier to type though ;)
[1:16] <dmick> yeah, yours is kinda cryptic. HAHAHAHAHAHA
[1:17] <dmick> that is using 100% of a CPU with pretty small memory
[1:19] <Tv_> oh funky, looks like nested virtualization is one third of raw cpu speed..?
[1:20] <gregaf1> "nested" virtualization?
[1:20] <Tv_> kvm running in a kvm
[1:20] <Tv_> but it works pretty nice
[1:20] <dmick> hm. I wonder if qemu is tweakable
[1:20] <dmick> and if it does some kinda default self-capping or yielding or something
[1:42] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Ping timeout: 480 seconds)
[1:56] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Read error: Connection reset by peer)
[1:57] <Tv_> dmick: btw Documentation/virtual/kvm/nested-vmx.txt in linux source
[1:59] * bchrisman (~Adium@ Quit (Quit: Leaving.)
[1:59] * tnt (~tnt@175.127-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[2:01] * LarsFronius (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) has joined #ceph
[2:01] * Leseb_ (~Leseb@ has joined #ceph
[2:04] * Tv_ (~tv@2607:f298:a:607:baac:6fff:fe95:8e86) Quit (Quit: Tv_)
[2:07] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Ping timeout: 480 seconds)
[2:07] * Leseb_ is now known as Leseb
[2:09] <dmick> Yeah, that's what I was thinking of...
[2:09] <dmick> (oh he left)
[2:15] * danieagle (~Daniel@ Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[2:30] * LarsFronius (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) Quit (Quit: LarsFronius)
[2:32] * Leseb (~Leseb@ Quit (Ping timeout: 480 seconds)
[2:36] * LarsFronius (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) has joined #ceph
[2:45] * LarsFronius (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) Quit (Quit: LarsFronius)
[2:56] * SuperSonicSound (~SuperSoni@28IAAGIYR.tor-irc.dnsbl.oftc.net) has joined #ceph
[3:33] <joao> <elder> Count to a MILLION!
[3:33] <joao> * elder holds back of pinky to mouth
[3:33] <joao> I thought that was a billion
[3:34] <dmick> why count to a trillion when we can count to a billion?
[3:35] <joao> well, it certainly depends on the number of zeros involved in both
[3:35] <dmick> but counting zeros takes no time at all
[3:35] <dmick> they're not even tehre
[3:36] <dmick> I think you need maths homework joao
[3:36] <joao> lol
[3:36] <joao> I meant, with how many zeros you'd represent each
[3:37] <dmick> oh. well that's completely a different matter then. nevermind.
[3:38] * deepsa (~deepsa@ has joined #ceph
[3:43] <joao> oh
[3:43] <joao> damn
[3:43] <joao> different quotes from different movies
[3:43] <joao> including yours, dmick
[3:43] <joao> oh well
[3:43] <dmick> it's all austin powers. they all blur together. one joke.
[3:43] <joao> yeah
[3:44] <joao> according to wikipedia, elder's was from the first movie, I recalled the second, and you went with the third :p
[3:44] <joao> or something like that
[3:47] <dmick> http://www.youtube.com/watch?v=cKKHSAE1gIs
[4:01] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:09] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) Quit (Quit: Leaving.)
[4:39] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[4:50] <elder> I really need to work on my comic timing though.
[4:50] <elder> Have it count to... a MILLION!
[4:50] * dmick rolls on the floor laughing
[4:51] <elder> See, timing is everything.
[4:51] <dmick> that's what she said
[4:51] <elder> Don't get me started.
[5:50] * glowell (~glowell@ Quit (Remote host closed the connection)
[5:55] * glowell (~glowell@ has joined #ceph
[5:55] * glowell (~glowell@ Quit (Remote host closed the connection)
[6:02] * dmick (~dmick@2607:f298:a:607:50dd:c6d6:34e9:ed3e) Quit (Quit: Leaving.)
[6:34] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[6:38] * SuperSonicSound (~SuperSoni@28IAAGIYR.tor-irc.dnsbl.oftc.net) Quit (Quit: Leaving)
[7:14] * gregaf (~Adium@2607:f298:a:607:b949:6cac:b5a1:d8f6) has joined #ceph
[7:16] * sagewk (~sage@2607:f298:a:607:219:b9ff:fe40:55fe) Quit (Ping timeout: 480 seconds)
[7:16] * gregaf1 (~Adium@2607:f298:a:607:1dd5:5769:ce20:fcdc) Quit (Ping timeout: 480 seconds)
[7:16] * mkampe (~markk@2607:f298:a:607:222:19ff:fe31:b5d3) Quit (Ping timeout: 480 seconds)
[7:16] * yehudasa (~yehudasa@2607:f298:a:607:4ceb:23f0:731b:359c) Quit (Ping timeout: 480 seconds)
[7:21] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[7:26] * yehudasa (~yehudasa@2607:f298:a:607:1420:29b1:501c:3f82) has joined #ceph
[7:26] * sagewk (~sage@2607:f298:a:607:219:b9ff:fe40:55fe) has joined #ceph
[7:31] * mkampe (~markk@2607:f298:a:607:222:19ff:fe31:b5d3) has joined #ceph
[7:35] * defiler (~xdefiler@ has joined #ceph
[7:35] * defiler is now known as john_fool
[7:53] * deepsa_ (~deepsa@ has joined #ceph
[7:55] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[7:55] * deepsa_ is now known as deepsa
[8:02] * tnt (~tnt@175.127-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[8:06] <tnt> dabeowulf: Someone posted a answer on the ml to my question and I'll try that today.
[8:16] * yoshi (~yoshi@p22043-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Ping timeout: 480 seconds)
[8:28] * EmilienM (~EmilienM@ has joined #ceph
[8:55] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:01] <john_fool> hi there. someone can help me to sort the issue? it's simple: ceph-{osd,mon,mds} binaries don't work at all on debian6 adm64 system:
[9:01] <john_fool> root@storage4:/etc/ceph# ceph-mds
[9:01] <john_fool> ceph-mds: symbol lookup error: /usr/lib/libtcmalloc.so.0: undefined symbol: _Ux86_64_getcontext
[9:01] * BManojlovic (~steki@ has joined #ceph
[9:01] <john_fool> ceph version - 0.48 agro
[9:02] <john_fool> system repos - squeeze + backports
[9:02] <john_fool> everything installed with dependencies
[9:07] <john_fool> kernel: 3.2.0-0.bpo.2-amd64
[9:09] * s[X] (~sX]@ppp59-167-154-113.static.internode.on.net) Quit (Remote host closed the connection)
[9:18] <tnt> john_fool: do you have libunwind installed ?
[9:18] * fc (~fc@ Quit (Remote host closed the connection)
[9:20] <tnt> john_fool: This is on my test system: http://pastebin.com/7eWLkQCF So libtcmalloc.so.0 links to libunwind and libunwind provides the _Ux86_64_getcontext symbol.
[9:20] <tnt> john_fool: I'm leaving for work now, but I'll be online in 15m or so.
[9:21] <john_fool> tnt: second..
[9:23] * tnt (~tnt@175.127-67-87.adsl-dyn.isp.belgacom.be) Quit (Read error: Operation timed out)
[9:25] * fc (~fc@ has joined #ceph
[9:26] * deepsa (~deepsa@ Quit (Quit: Computer has gone to sleep.)
[9:28] <john_fool> tnt: waiting for u )
[9:33] * EmilienM (~EmilienM@ Quit (Quit: Leaving...)
[9:33] * EmilienM (~EmilienM@ has joined #ceph
[9:35] * tnt (~tnt@212-166-48-236.win.be) has joined #ceph
[9:36] * loicd (~loic@brln-4d0cc4bd.pool.mediaWays.net) has joined #ceph
[9:39] <john_fool> tnt: http://pastebin.com/hKBun4SB
[9:40] <john_fool> package is installed but libraries do not provide _Ux86_64_getcontext
[9:40] <john_fool> even x86 version.
[9:41] <john_fool> oops. even x86-64 version )
[9:47] <tnt> huh, weird. (my setup is ubuntu, not debian fyi).
[9:48] <tnt> Is the ceph version you installed compiled for squeeze ?
[9:49] <tnt> and where does your libtcmalloc.so.0 comes from ? (since it's that lib that requires that symbol ...)
[9:49] * defiler (~xdefiler@ has joined #ceph
[9:50] * defiler is now known as jf_
[9:50] * john_fool (~xdefiler@ Quit (Ping timeout: 480 seconds)
[9:51] * jf_ is now known as john_fool
[9:51] <john_fool> tnt: have you sent something? I've just reconnected
[9:52] <tnt> 09:48 < tnt> huh, weird. (my setup is ubuntu, not debian fyi).
[9:52] <tnt> 09:48 < tnt> Is the ceph version you installed compiled for squeeze ?
[9:52] <tnt> 09:49 < tnt> and where does your libtcmalloc.so.0 comes from ? (since it's that lib that requires that symbol ...)
[9:53] <john_fool> tnt: 0.48argonaut-1~bpo60+1
[9:55] <john_fool> tnt: libtcmalloc.so.0 comes from libgoogle-perftools0 package (1.5-1 from squeeze)
[9:58] * defiler_ (~v@ has joined #ceph
[10:01] <tnt> john_fool: if you do objdump -T libtcmalloc.so.0 | grep _Ux86_64_getcontext what do you get ?
[10:03] <john_fool> root@storage4:/etc/ceph# objdump -T /usr/lib/libtcmalloc.so.0 | grep _Ux86_64_getcontext
[10:03] <john_fool> 0000000000000000 DF *UND* 0000000000000000 _Ux86_64_getcontext
[10:03] <john_fool> opps. sorry for multiple lines )
[10:04] <tnt> Well looks to me like this package is broken on squeeze ...
[10:04] <john_fool> aha
[10:04] * john_fool going to try sid version
[10:09] * deepsa (~deepsa@ has joined #ceph
[10:10] <john_fool> brr.... really this package comes from other repo.
[10:10] <john_fool> hope this problem not in squeeze
[10:11] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[10:13] <loicd> sileht: I'm looking at collectl ( in the context of running https://labs.enovance.com/attachments/86/bench_pgnum.sh to bench the performance impact of varying the number of pg ). Did not know the tool :-)
[10:14] <tnt> loicd: How do you set the number of pg btw ? In mkpool I didn't see a way to specify it and set pg_num doesn't work even on an empty pool ( "pg_num adjustment currently disabled (broken implementation)" )
[10:14] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has joined #ceph
[10:14] <loicd> tnt: I would not know, sileht does. I'm reading the script right now ;-) https://labs.enovance.com/attachments/86/bench_pgnum.sh
[10:15] <sileht> tnt, I recreate a new pool before each test
[10:15] <sileht> tnt, and write data on it with rados bench
[10:18] <john_fool> tnt: everything is Ok with squeeze version. It was my mistake. thanks
[10:18] <tnt> john_fool: great.
[10:19] <tnt> sileht: ah yes I see "ation
[10:20] <tnt> * I see "ceph osd pool create POOL [pg_num [pgp_num]]" ... I was using rados mkpool ... and this doesn't allow to set pg_num
[10:21] * lofejndif (~lsqavnbok@09GAAG0C3.tor-irc.dnsbl.oftc.net) has joined #ceph
[10:23] <loicd> It's kind of tedious to run the tests manually, even if http://ceph.com/wiki/Benchmark helps remember how to do it. I wonder if things like http://www.phoronix-test-suite.com/ could help automate the run of heterogeneous tests and collect all the results in a consistent way.
[10:25] * defiler_ (~v@ has left #ceph
[10:26] * LarsFronius (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) has joined #ceph
[10:27] <tnt> Can radosgw stripe objects across several placement groups ?
[10:27] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[10:29] * exec (~v@ has joined #ceph
[10:30] <loicd> tnt: my understanding is that one object (in the swift / S3 sense) will be in a single placement group so that all objects (in the RADOS sense) composing it are placed in a sensible way.
[10:30] * john_fool changed nick to exec to prevent stupid mistakes in future )
[10:30] * john_fool (~xdefiler@ Quit (Quit: Leaving)
[10:35] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[10:37] * LarsFronius (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) Quit (Quit: LarsFronius)
[10:39] * Leseb (~Leseb@ has joined #ceph
[10:39] <tnt> loicd: looks to me like one S3/SWIFT object maps to one RADOS object which will be in one placement group.
[10:40] <loicd> tnt: what if the S3/swift object is 1GB big ?
[10:41] * loicd checking the vocabulary to avoid being lost in translation ;-)
[10:42] <tnt> Well seems it ends up as 1 big rados object. How can you list objects in a placement group or pool ?
[10:47] * hijacker (~hijacker@ has joined #ceph
[10:48] * LarsFronius (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) has joined #ceph
[10:48] * nhm (~nh@65-128-130-177.mpls.qwest.net) Quit (Remote host closed the connection)
[10:48] * nhm (~nh@65-128-130-177.mpls.qwest.net) has joined #ceph
[10:50] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) Quit (Quit: Ex-Chat)
[10:58] * Leseb_ (~Leseb@ has joined #ceph
[10:58] * Leseb (~Leseb@ Quit (Read error: Connection reset by peer)
[10:58] * Leseb_ is now known as Leseb
[10:59] <loicd> tnt: a rados object has a fixed size, right ? 4MB by default if I'm not mistaken.
[10:59] <loicd> Leseb: morning sir ;-)
[11:00] <tnt> loicd: I don't think so. 4Mb RADOS object size is only for RDB
[11:00] <tnt> RBD
[11:00] * Leseb (~Leseb@ Quit (Read error: Connection reset by peer)
[11:00] * Leseb (~Leseb@ has joined #ceph
[11:05] * deepsa (~deepsa@ Quit (Quit: Computer has gone to sleep.)
[11:07] <loicd> tnt: ok. To list objects in a pool use "rados -p <poolname> ls"
[11:16] <loicd> sileht: in http://article.gmane.org/gmane.comp.file-systems.ceph.devel/8286 "I'm concerned with your 6 client numbers though." relate to "RADOS (6) 20MB/s 17MB/s (10MB/s) ... 3.38062s(2.39645) " in http://ceph.com/wiki/Benchmark#On_multiple_clients ?
[11:18] <sileht> Loicd, yes the result is ok because I use the same hardwares for the clients and the servers, so the bottleneck is the network
[11:18] <sileht> loicd, ^
[11:19] <loicd> sileht: I'll update the page to make it clearer.
[11:19] * nolan (~nolan@phong.sigbus.net) Quit (Ping timeout: 480 seconds)
[11:20] <sileht> loicd, I have prepared a answer for this thread, I need to do the recommanded test for radosgw with more currency operations
[11:21] <loicd> you mean with your latest results ?
[11:24] <sileht> loicd, I just want to redo the radosgw bench with more currency operations and to publish the result
[11:29] <loicd> sileht: http://ceph.com/w/index.php?title=Benchmark&action=historysubmit&diff=5596&oldid=5595
[11:29] <loicd> let me know if my interpretation is correct.
[11:30] <sileht> loicd, s/currency/concurrent/g
[11:30] <tnt> loicd: yes, I confirm one S3 objects ends up as one main object in rados. (there is a few other objects as well for meta data or such but all payload data seem to be a in a xxx.__shadow_xxx object)
[11:31] <joao> morning #ceph
[11:31] <sileht> loicd, your interpretation is correct :)
[11:35] <loicd> tnt: I thought they were split in a number of smaller rados objects. Now I know more about ceph, thanks ;-)
[11:37] * Cube1 (~Adium@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[11:37] <tnt> loicd: I wonder if when an object that's replicated to several is read, will the client ask several osd for data in // to speedup read ?
[11:38] <loicd> tnt: that's my understanding indeed.
[11:39] <loicd> sileht: in the rest bench, what is the size of the objects being used ?
[11:41] <tnt> What about writes ? I see in Sage's presentation that the client upload only one copy to one osd and then the osd replicate among themselves. But does that replication happen on the fly or will it wait for the write to complete ? (I mean if you put a 10G object that means after the client put the last byte, it'll have to wait a while before it's replicated)
[11:42] <sileht> loicd, the default one (I'm looking for the value)
[11:44] * johnl (~johnl@2a02:1348:14c:1720:a9a0:d515:3fc9:8dfb) Quit (Remote host closed the connection)
[11:45] * johnl (~johnl@2a02:1348:14c:1720:203a:8363:b2a4:caec) has joined #ceph
[11:54] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[11:56] * deepsa (~deepsa@ has joined #ceph
[11:57] <loicd> sileht do you think it would be relevant to vary --block-size=op-size in rest-bench ?
[11:59] <loicd> sileht: when I type rest-bench ceph in google.com http://ceph.com/wiki/Benchmark is the first result ;-)
[12:00] <sileht> It can show if my issue is related to the size of the operation, but i don't think so
[12:01] <loicd> tnt: I'd be interested to know the answer to this "write & replication" question ;-)
[12:05] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[12:09] <loicd> sileht: I do not understand why Mark think the total read throughput should be more than 20MB/s per client.
[12:10] <loicd> Assuming there are 6 clients, running on 6 hosts connected with 1Gb/s links.
[12:10] <sileht> In rados test all clients talks to all osds
[12:11] <loicd> Assuming there are 2 replicas for each object. The reads should be spread on two OSD and therefore the bottlneck should be 2Gb/s and not 1Gb/s. Hence each client should have at least 30MB/s read ( ~ 2Gb/s / 6 converted to MB/s instead of 0.3Gb/s )
[12:12] <loicd> sileht: is this also your understanding ?
[12:13] <sileht> Yes
[12:14] <sileht> loicd, all my tests have the default replications, ie: 2
[12:15] <loicd> sileht: ok :-)
[12:16] <loicd> sileht: Mark & Florian do not like bonnie++ . It does not have many supporters ;-)
[12:21] <loicd> sileht: when you write "So I have issued this command on a ceph node: " you mean "The following command is run on one of the node of the ceph cluster" ? I'm asking because you put an emphasis on "ceph node" and I want to be sure I fully understand why.
[12:21] <sileht> ye sit
[12:21] <sileht> yes it
[12:22] <loicd> o k
[12:24] <sileht> :)
[12:28] <loicd> sileht: I would like to add the following introduction phrase to http://ceph.com/wiki/Benchmark#OSD_benchmark so that the reader quickly understand what it is about:
[12:28] <loicd> The goal of this benchmark is to assert the read and write speed of each OSD. They are compared with the read and write speed of the underlying device. The expected result is that it will not show a significant difference.
[12:35] <loicd> I think we're missing something in http://ceph.com/wiki/Benchmark#OSD_benchmark , that is the write latency + throuhput of the disk as measured by fio. Since people tend to not trust bonnie++ it will be better to avoid the controversy.
[13:06] * EmilienM (~EmilienM@ Quit (Remote host closed the connection)
[13:07] * EmilienM (~EmilienM@ has joined #ceph
[13:13] <loicd> sileht: I added a short explanation for each benchmark at http://ceph.com/wiki/Benchmark for you to comment on
[13:16] <tnt> loicd: yeah I'd like to know the answer as well, but it's gonna take someone more knowledgeable than me to answer :p
[13:19] <loicd> since RBD is implemented by spreading the blocs using a number of rados objects , it would make sense to do the same when trying to store an object using radosgw. I don't understand why the rationale would be different.
[13:21] <tnt> I think RBD are expected to be large, while radosgw object not really ...
[13:22] <loicd> well, unless radosgw really is just a gateway and does not provide an additional abstraction layer as RBD does
[13:22] <loicd> tnt: it makes sense, yes
[13:24] <tnt> another useful consequence is that if some PG goes down, then you don't get into a case where part of an object is available and part of it is unavailable.
[13:29] <tnt> Huh you can 'execute' stuff on rados objects ?
[13:29] <tnt> IoCtx:exec ?
[13:31] * deepsa (~deepsa@ Quit (Quit: Computer has gone to sleep.)
[13:40] <loicd> sileht: when you write RADOSGW (with apache and fastcgi with ceph patch) in http://ceph.com/wiki/Benchmark could you link to the patch applied ? It is not explained in http://wiki.debian.org/OpenStackCephHowto
[13:43] * LarsFronius (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) Quit (Quit: LarsFronius)
[13:48] * LarsFronius (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) has joined #ceph
[13:54] <loicd> sileht: in http://ceph.com/wiki/Benchmark#First_Example you should explain how you created the table.
[13:56] <loicd> For instance, one can guess that for the RADOS + read column must be extracted from the output of the rados bench explained before. But a good howto should give instructions to follow to avoid any guessing.
[13:57] <loicd> It's probably best to add a section to http://ceph.com/wiki/Benchmark#RADOS_benchmark and explain how to build the table lines that are specific to RADOS. And do the same for each bench. In the end, the global benchmark results are a concatenation of the individual bench results.
[13:58] <loicd> It would also be a good place to explain how to interpret the bench results. In other words, what reasoning should I follow to come to the conclusion that "it rocks" or "this is bad and something must be fixed" ;-)
[14:07] <tnt> Is there an explanaltion somewhere of how RADOSGW stores data in rados ?
[14:09] * newtontm (~jsfrerot@charlie.mdc.gameloft.com) Quit (Quit: leaving)
[14:09] * nolan (~nolan@phong.sigbus.net) has joined #ceph
[14:26] <tnt> wtf ? I stopped radosgw, deleted the pools .rgw .rgw.control and .rgw.buckets ... and it _still_ somehow remembered what my buckets were named.
[14:27] <loicd> tnt: ahahah
[14:27] <loicd> tnt: ceph is magic, you did not know that ?
[14:27] <loicd> explaining how to collect the results is specialy important in the http://ceph.com/wiki/Benchmark#On_multiple_clients section. Even I cannot guess it ;-)
[14:30] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) Quit (Ping timeout: 480 seconds)
[14:45] * lofejndif (~lsqavnbok@09GAAG0C3.tor-irc.dnsbl.oftc.net) Quit (Quit: gone)
[14:46] * stxShadow (~Jens@ip-78-94-238-69.unitymediagroup.de) has joined #ceph
[14:48] * fc (~fc@ Quit (Ping timeout: 480 seconds)
[14:49] * mtk (ENHMqO0Io0@panix2.panix.com) has joined #ceph
[14:58] * Deuns (~kvirc@169-0-190-109.dsl.ovh.net) has joined #ceph
[14:58] <Deuns> helloo
[14:58] <Deuns> hello all
[15:00] <joao> helloo
[15:01] <Deuns> :)
[15:01] <Deuns> is there any guide to optimize a ceph cluster (and especially disk throughput)
[15:01] <Deuns> ?
[15:11] * fc (~fc@ has joined #ceph
[15:11] * nhm_ (~nh@65-128-130-177.mpls.qwest.net) has joined #ceph
[15:28] * gregorg (~Greg@ has joined #ceph
[15:33] <Deuns> What numbers should I expect with 3 osd + 1 mds + 1 mon on one server (1x Xeon E5-2603 1.80GHz + 8GB RAM + 3x 3TB SATA HD + 1x 1GE) ?
[15:33] <Deuns> I'm using btrfs with Debian 7 and Ceph 0.49
[15:36] <Deuns> From my cephfs client, I get only 20MB/s max with dd if=/dev/zero of=test (various bs and count)
[15:46] <loicd> Deuns: I think you will be interested in http://ceph.com/wiki/Benchmark . sileht is working on it and you want to wait until tomorrow evening before using it.
[15:47] <loicd> Deuns: I would start with http://ceph.com/wiki/Benchmark#OSD_benchmark to get a sense of how fast your OSD are.
[15:48] <Deuns> loicd: I'm currently reading that wiki page :)
[15:48] <Deuns> thank you for the answer
[15:49] * dspano (~dspano@rrcs-24-103-221-202.nys.biz.rr.com) has joined #ceph
[15:54] <tnt> Deuns: your 3 osd are on different physical disks ?
[15:54] <Deuns> yep
[15:54] <Deuns> my mds + mon are on a 4th disk
[15:59] <Deuns> rados bench results :
[15:59] <Deuns> Total writes made: 3297
[15:59] <Deuns> Write size: 4194304
[15:59] <Deuns> Bandwidth (MB/sec): 21.876
[15:59] <Deuns> it seems a bit low...
[16:04] <Deuns> I get about 67,3 MB/s from a plain local disk
[16:12] <tnt> Yeah, ceph isn't exactly super fast right now AFAICT.
[16:13] <andreask> I doubt btrfs is a good choice atm ... also for speed ....
[16:14] <Deuns> ok, I'll try with xfs :)
[16:14] <Deuns> thank you
[16:15] <tnt> I'm not convinced this will make such a difference.
[16:16] <andreask> for local benchmarks definitely
[16:17] <andreask> not talking about stability and reliability
[16:32] * glowell (~glowell@ has joined #ceph
[16:46] <dspano> There's nothing better than finding out your network config is the reason why you're OSDs aren't talking to each other after a reboot every once in a while.
[16:50] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[16:53] <Deuns> is it normal that when an osd is down, ceph health tells everything is OK ?
[16:53] <Deuns> health HEALTH_OK
[16:53] <Deuns> osdmap e46: 3 osds: 2 up, 2 in
[16:54] <tnt> I think that if every pg has been remapped to the remaining OSD, yes it will say everything is fine.
[16:58] <Deuns> ok
[16:58] <Deuns> thanks
[16:58] <Deuns> I'm amazed that restarting osd.0 is making ceph remapping everything :)
[17:02] <tnt> That's what I _love_ about ceph :)
[17:02] <tnt> you just keep the machines up, it'll handle the rest :)
[17:05] * BManojlovic (~steki@ Quit (Remote host closed the connection)
[17:09] * deepsa (~deepsa@ has joined #ceph
[17:11] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:16] * LarsFronius_ (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) has joined #ceph
[17:17] * Tv_ (~tv@ has joined #ceph
[17:20] * LarsFronius (~LarsFroni@dyndsl-031-150-008-017.ewe-ip-backbone.de) Quit (Ping timeout: 480 seconds)
[17:20] * LarsFronius_ is now known as LarsFronius
[17:21] * allsystemsarego (~allsystem@ has joined #ceph
[17:24] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[17:25] * Leseb (~Leseb@ Quit (Quit: Leseb)
[17:25] * stxShadow (~Jens@ip-78-94-238-69.unitymediagroup.de) has left #ceph
[17:26] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) Quit (Remote host closed the connection)
[17:40] * tnt (~tnt@212-166-48-236.win.be) Quit (Read error: Operation timed out)
[17:48] * Deuns (~kvirc@169-0-190-109.dsl.ovh.net) Quit (Quit: KVIrc 4.0.0 Insomnia http://www.kvirc.net/)
[17:52] * tnt (~tnt@175.127-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[17:56] * aliguori (~anthony@ has joined #ceph
[18:09] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:20] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[18:21] * Cube (~Adium@ has joined #ceph
[18:24] * deepsa (~deepsa@ Quit (Quit: Computer has gone to sleep.)
[18:38] * bchrisman (~Adium@ has joined #ceph
[18:39] * gregaf (~Adium@2607:f298:a:607:b949:6cac:b5a1:d8f6) Quit (Quit: Leaving.)
[18:39] * gregaf (~Adium@2607:f298:a:607:d8e6:7160:f1a4:d864) has joined #ceph
[18:41] <yehudasa> elder: there's a patch on the mailing list that I want to get hammered. Where do we push stuff that we want to run the nightlies on?
[18:54] <elder> Nightlies run on testing.
[18:54] <elder> I can pull it in though if you like.
[18:55] <elder> I do a series if pretty solid tests before I add it to the testing branch.
[18:55] <elder> Is this the memory leak thing that's been going back and forth?
[18:55] <elder> (I have been ignoring it since there seemed to be some problems, and you and/or Sage seemed to be on it.)
[19:00] * joshd (~joshd@2607:f298:a:607:221:70ff:fe33:3fe3) has joined #ceph
[19:02] <elder> autobuilder seems to be considerably faster than, well, a few weeks ago anyway.
[19:02] <Tv_> yay for hardware
[19:03] <elder> Is that what happened?
[19:04] <Tv_> moved from old dc to new dc
[19:04] <elder> I am very pleased, regardless of the reason.
[19:04] <elder> I now have to get used to having things ready reasonably quickly again...
[19:07] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[19:07] <nhm_> ugh, I hate waiting for fedex
[19:13] * dmick (~dmick@2607:f298:a:607:592b:48c2:f1f3:4291) has joined #ceph
[19:21] * jbd_ (~jbd_@34322hpv162162.ikoula.com) has left #ceph
[19:27] * chutzpah (~chutz@ has joined #ceph
[19:34] <yehudasa> elder: yes, that's the bio thing
[19:34] <elder> I'll grab a copy and will run some tests in a private cluster.
[19:34] <yehudasa> cool, the latest is v3
[19:34] <elder> I'll also see if I can review it carefully before I head for the airport htis afternoon.
[19:35] <yehudasa> awesome
[19:51] <joshd> tnt, loicd, sileht: radogw also breaks up large objects, there's some configurable threshold above which they're broken up iirc
[19:53] <loicd> joshd: thanks for the info :-) I'm relieved !
[19:53] <tnt> joshd: oh really ? Do you have any reference on that ?
[19:54] <tnt> I'd have to reread the post on atomicity of put/get because iirc that relied on all the data being in the same pg.
[19:55] <gregaf> radosgw currently puts each part of a multi-part upload into a different object, and large objects are broken up into a 512KB (I think that's the size) head and a large "tail"
[19:55] <gregaf> but it's not striping
[19:55] <gregaf> most of the code to support striping exists but it's not enabled on the put and it's not tested
[19:57] * JJ (~JJ@ has joined #ceph
[20:16] * sjust (~sam@ has joined #ceph
[20:30] <nhm> sjust: What was the bug you were talking about that you think is contributing to the jouranl filling up?
[20:30] <sjust> nhm, mikeryan: we were calling op_apply_start before queueing the op rather than before applying the op
[20:31] <sjust> 63b19f29593b8b6e11fc77b83385a6d9e38fdea3 should at least fix that
[20:31] <sjust> this increases the number of in flight ops causing the sync to take a long time
[20:31] <sjust> running to lunch now
[20:38] * aliguori (~anthony@ Quit (Remote host closed the connection)
[20:55] * tnt_ (~tnt@167.39-67-87.adsl-dyn.isp.belgacom.be) has joined #ceph
[20:56] * tnt (~tnt@175.127-67-87.adsl-dyn.isp.belgacom.be) Quit (Ping timeout: 480 seconds)
[20:58] * Leseb_ (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:58] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Read error: Connection reset by peer)
[20:58] * Leseb_ is now known as Leseb
[20:58] <elder> yehudasa, I was just about to write "all looks well in testing so far" but I just hit a problem running test 091 in xfstests over rbd.
[21:02] * BManojlovic (~steki@ has joined #ceph
[21:03] * aliguori (~anthony@cpe-70-123-145-39.austin.res.rr.com) has joined #ceph
[21:03] <elder> Test 091 exercises direct I/O, with sub-block sizes and running direct I/O with concurrent buffered I/O.
[21:03] <elder> I have not yet reviewed the patch.
[21:14] <elder> Test 273 also showed a failure. It looks like it was doing a lot of concurrent "cp" commands, but beyond that I'm not sure.
[21:15] <elder> I haven't been seeing these tests fail, so they are likely due to the patch.
[21:20] * tnt_ is now known as tnt
[21:26] <sjust> nhm: how can I most easily run tests like you were running?
[21:28] <nhm> sjust: compile the workload generator
[21:28] <sjust> I mean, which machine?
[21:28] <sjust> any burnupi?
[21:29] <nhm> sjust: ah, one of the burnupis. You'd need to make sure the partitions are formatted in a sane way.
[21:29] <nhm> sjust: I'm using burnupi12, 13 or 14 might work.
[21:29] <sjust> ok, I'll try 13
[21:30] <sjust> how do I check that the controller/partitions are sane?
[21:30] <nhm> sjust: "sudo megacli -pdlist -a0" will give you info about the raid groups
[21:30] <nhm> er, sorry the drives
[21:31] <nhm> one sec, there should be a command that gives you info about the raid groups too.
[21:33] <nhm> "sudo megacli -LDInfo -Lall -a0"
[21:33] <nhm> looks like burnupi13 doesn't have much configured on it right now.
[21:33] <sjust> do you have a script that'll duplicate the setup on 12?
[21:34] <nhm> sjust: yeah, but it's annoying and requires a reboot and sometimes manual intervention since megacli sucks so hard.
[21:34] <nhm> jump on 14
[21:34] <sjust> ok
[21:35] <nhm> I was doing gluster/ceph testing on that machine but I don't need it at the moment.
[21:36] <sjust> ok
[21:36] <nhm> look in /dev/disk/by-partlabel for data and journal partitions.
[21:36] <nhm> If you want a journal on the same disk as the data, use the same N value, ie osd-device-0-data and osd-device-0-journal.
[21:37] <nhm> if you want on seperate one, just use one of the other journal partitions.
[21:37] <nhm> you can run the workload generator like:
[21:37] <nhm> ./test_filestore_workloadgen --osd-data /srv/osd-device-0/workload --osd-journal /dev/sdd1 --test-show-stats --test-write-data-size 4k --test-suppress-ops ocl --test-num-colls 5000 --test-destroy-coll-per-N-trans 50000 --test-num-ops 200000 --test-max-in-flight 50 --filestore-op-threads 1 --osd-journal-size 0
[21:46] * lofejndif (~lsqavnbok@82VAAFFQ9.tor-irc.dnsbl.oftc.net) has joined #ceph
[21:50] * allsystemsarego (~allsystem@ Quit (Quit: Leaving)
[21:58] <sjust> nhm: k
[22:11] <yehudasa> elder: cool, I think that some assumptions about the iovecs there are wrong
[22:20] <elder> I'm leaving shortly for my flight. I've printed the patch to review en route (if it's adequate on its own to understand).
[22:21] <elder> But in any case, we can talk about it Monday if you like.
[22:22] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[22:43] * alexxy (~alexxy@2001:470:1f14:106::2) Quit (Read error: Connection reset by peer)
[22:44] * alexxy (~alexxy@2001:470:1f14:106::2) has joined #ceph
[22:47] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[23:05] <sjust> mikeryan: burnupi14
[23:08] * bshah (~bshah@sproxy2.fna.fujitsu.com) has joined #ceph
[23:09] * bshah (~bshah@sproxy2.fna.fujitsu.com) Quit ()
[23:10] * bshah (~bshah@sproxy2.fna.fujitsu.com) has joined #ceph
[23:15] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) has joined #ceph
[23:22] * EmilienM (~EmilienM@ has left #ceph
[23:31] * s[X] (~sX]@ppp59-167-157-96.static.internode.on.net) Quit (Remote host closed the connection)
[23:33] <nhm> sjust: how's the testing been going?
[23:33] <sjust> it seems like the tset-num-objs-per-coll configurable isn't working
[23:34] <nhm> sjust: interesting. I didn't dig into that too much.
[23:35] <mikeryan> nhm: are you an inktanker?
[23:35] <sjust> nhm: it's interesting because it seems like with 6000 objects per collection and 5000 collections each write is probably creating a new object
[23:35] <dmick> mikeryan: yes :)
[23:36] <mikeryan> palindromic initials!
[23:36] <mikeryan> wat
[23:36] <mikeryan> backwards initials!
[23:36] <mikeryan> i need to get more sleep..
[23:36] <sjust> heh
[23:36] <dmick> nicks vs realnames...don't get me started :)
[23:38] <mikeryan> my nick is my realname
[23:38] <mikeryan> best choice i ever made
[23:38] <nhm> mikeryan: yeah, I'm Mark Nelson. :)
[23:39] * asadpanda (~asadpanda@2001:470:c09d:0:20c:29ff:fe4e:a66) Quit (Remote host closed the connection)
[23:39] <nhm> sjust: that sounds about right.
[23:40] <sjust> nhm: given that the iops are actually creates per second, how many orders of magnitude off are we?
[23:43] <nhm> sjust: To be honest I'm not sure that our performance is that far off given the metadata workload involved in creating new files/directories.
[23:44] <nhm> sjust: what I dislike more is the rapid performance degredation.
[23:44] <sjust> k
[23:47] <nhm> sjust: We certainly had some things we should continue to look at like the journal write padding. It might just be that the workload we are giving the filesystem is just so non-optimal that this is what we are going to get.
[23:48] <sjust> nhm: https://forums.openfiler.com/index.php?/topic/5002-very-bad-file-creationdelete-performance-with-xfs/
[23:48] <sjust> not sure if that's still/ever actually true, but interesting
[23:48] <nhm> At last with cephfs and rbd we can buffer lots of small writes to individual files.
[23:49] <sjust> so I'm getting closer to 27MB/s with 25 files
[23:49] <sjust> trying more than one op thread
[23:49] * gregaf (~Adium@2607:f298:a:607:d8e6:7160:f1a4:d864) Quit (Read error: Operation timed out)
[23:50] <nhm> sjust: 4k writes?
[23:50] * gregaf (~Adium@2607:f298:a:607:4cc5:a15:c810:6c20) has joined #ceph
[23:50] <sjust> yeah
[23:50] <nhm> Interesting, I'm not sure I've seen it that high yet on spinning disks.
[23:50] <nhm> what underlying FS?
[23:50] <sjust> those are almost all overwrites
[23:50] <sjust> xfs
[23:50] <nhm> And journal on seperate disk?
[23:50] <sjust> yeah
[23:50] <nhm> ah
[23:51] <nhm> so that would mean extents are already allocated?
[23:51] <sjust> right
[23:51] <sjust> well, probably, I'd need to look at whether they are appends or overwrites
[23:51] <sjust> I mean that they are not file creations
[23:58] <sjust> nhm: yeah, the writes are overwrites

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.