#ceph IRC Log

Index

IRC Log for 2012-10-15

Timestamps are in GMT/BST.

[0:08] * jks (~jks@3e6b7571.rev.stofanet.dk) Quit (Read error: Connection reset by peer)
[0:09] * dabeowulf (dabeowulf@free.blinkenshell.org) Quit (Remote host closed the connection)
[0:14] * danieagle (~Daniel@186.214.56.184) Quit (Quit: Inte+ :-) e Muito Obrigado Por Tudo!!! ^^)
[0:15] * dabeowulf (dabeowulf@free.blinkenshell.org) has joined #ceph
[0:31] * synapsr_ (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[0:38] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[0:42] * synapsr_ (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[0:49] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[0:55] * tightwork (~tightwork@142.196.239.240) Quit (Ping timeout: 480 seconds)
[0:59] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[1:00] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[1:07] * tightwork (~tightwork@142.196.239.240) has joined #ceph
[1:08] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[1:17] * BManojlovic (~steki@212.200.241.182) Quit (Quit: Ja odoh a vi sta 'ocete...)
[1:36] * enoch_r (~jds@12.248.40.138) Quit (Quit: Computer has gone to sleep.)
[1:52] * yoshi (~yoshi@p37219-ipngn1701marunouchi.tokyo.ocn.ne.jp) has joined #ceph
[2:16] * miroslavk (~miroslavk@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[2:18] * via (~via@smtp2.matthewvia.info) has joined #ceph
[2:37] <Qten> Hi All, anyone using openstack w/ ceph rbd around?
[2:39] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Quit: Leaving.)
[2:52] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[3:03] * tightwork (~tightwork@142.196.239.240) Quit (Ping timeout: 480 seconds)
[3:06] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[3:13] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[3:15] * tightwork (~tightwork@142.196.239.240) has joined #ceph
[3:20] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[3:58] * lxo (~aoliva@1RDAAD89S.tor-irc.dnsbl.oftc.net) Quit (Ping timeout: 480 seconds)
[4:08] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[4:39] * tightwork (~tightwork@142.196.239.240) Quit (Ping timeout: 480 seconds)
[4:52] * tightwork (~tightwork@142.196.239.240) has joined #ceph
[4:56] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[5:16] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[5:24] * miroslavk (~miroslavk@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[5:59] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[6:17] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) has joined #ceph
[6:36] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) has joined #ceph
[6:38] * maelfius (~mdrnstm@52.sub-70-197-143.myvzw.com) has joined #ceph
[6:39] * zgooger (~user@222.131.8.89) has joined #ceph
[6:39] * zgooger (~user@222.131.8.89) has left #ceph
[6:46] * maelfius (~mdrnstm@52.sub-70-197-143.myvzw.com) Quit (Ping timeout: 480 seconds)
[6:47] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[6:54] * rweeks (~rweeks@c-24-4-66-108.hsd1.ca.comcast.net) Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[6:55] * tightwork (~tightwork@142.196.239.240) Quit (Ping timeout: 480 seconds)
[7:06] * loicd (~loic@12.180.144.3) has joined #ceph
[7:11] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[7:34] * Ryan_Lane (~Adium@wsip-184-191-191-52.sd.sd.cox.net) has joined #ceph
[7:35] * Ormod (~valtha@ohmu.fi) Quit (Ping timeout: 480 seconds)
[7:35] * Ryan_Lane (~Adium@wsip-184-191-191-52.sd.sd.cox.net) Quit (Read error: Connection reset by peer)
[7:35] * Ryan_Lane (~Adium@wsip-184-191-191-52.sd.sd.cox.net) has joined #ceph
[7:55] * synapsr_ (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[7:55] * synapsr_ (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[7:55] * synapsr_ (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[7:55] * Ryan_Lane (~Adium@wsip-184-191-191-52.sd.sd.cox.net) Quit (Read error: Connection reset by peer)
[7:56] * Ryan_Lane (~Adium@wsip-184-191-191-52.sd.sd.cox.net) has joined #ceph
[7:57] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[8:05] * tryggvil (~tryggvil@nova054-254.cust.nova.is) has joined #ceph
[8:10] * jks (~jks@3e6b7571.rev.stofanet.dk) has joined #ceph
[8:20] * tryggvil (~tryggvil@nova054-254.cust.nova.is) Quit (Quit: tryggvil)
[8:47] * NaioN (stefan@andor.naion.nl) Quit (Remote host closed the connection)
[8:48] * NaioN (stefan@andor.naion.nl) has joined #ceph
[8:53] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) has joined #ceph
[8:57] * Ryan_Lane (~Adium@wsip-184-191-191-52.sd.sd.cox.net) Quit (Quit: Leaving.)
[9:08] * miroslavk (~miroslavk@c-98-248-210-170.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[9:21] * synapsr_ (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Remote host closed the connection)
[9:23] * scuttlemonkey (~scuttlemo@wsip-70-164-119-40.sd.sd.cox.net) has joined #ceph
[9:23] * schlitzer (~schlitzer@109.75.189.45) Quit (Read error: Connection reset by peer)
[9:25] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:29] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[9:31] * Leseb (~Leseb@193.172.124.196) has joined #ceph
[9:34] * f4m8_ is now known as f4m8
[9:39] * tziOm (~bjornar@194.19.106.242) has joined #ceph
[9:39] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) Quit (Read error: Connection reset by peer)
[9:40] * andreask (~andreas@chello062178013131.5.11.vie.surfer.at) has joined #ceph
[9:41] <tziOm> I have a problem switching to public/cluster addressing
[9:42] <tziOm> setting up private network 10.5.0.0/24 and cluster network 10.100.0.0/24 in [global], distributing config and restarting ceph, cluster stops working
[9:42] <tziOm> ceph osd dump shows me new addresses
[9:42] <tziOm> and all addresses are reachab�le
[9:43] <tziOm> hmm..
[9:46] * BManojlovic (~steki@91.195.39.5) has joined #ceph
[9:53] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[9:53] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) has joined #ceph
[9:58] <wido> tziOm: Have you checked the logs?
[9:58] <wido> Checked netstat if the OSDs actually start to connect to eachother?
[10:05] <tziOm> its ok now
[10:06] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[10:06] * synapsr (~synapsr@c-69-181-244-219.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[10:06] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[10:11] * scuttlemonkey (~scuttlemo@wsip-70-164-119-40.sd.sd.cox.net) Quit (Quit: This computer has gone to sleep)
[10:13] * deepsa (~deepsa@117.207.90.55) Quit (Ping timeout: 480 seconds)
[10:14] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[10:18] * deepsa (~deepsa@110.224.88.184) has joined #ceph
[10:21] * dmick_away (~Dan@cpe-76-87-42-76.socal.res.rr.com) has left #ceph
[10:26] * deepsa_ (~deepsa@117.199.127.42) has joined #ceph
[10:26] * deepsa (~deepsa@110.224.88.184) Quit (Read error: Connection reset by peer)
[10:26] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[10:26] * deepsa_ is now known as deepsa
[10:26] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[11:24] * Cube1 (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[11:33] * mgalkiewicz (~mgalkiewi@staticline-31-182-149-180.toya.net.pl) has joined #ceph
[11:38] * stxShadow (~Jens@ip-178-203-169-190.unitymediagroup.de) has joined #ceph
[11:40] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[11:42] * mgalkiewicz (~mgalkiewi@staticline-31-182-149-180.toya.net.pl) Quit (Ping timeout: 480 seconds)
[11:47] * MikeMcClurg1 (~mike@client-7-203.eduroam.oxuni.org.uk) has joined #ceph
[11:55] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) has joined #ceph
[11:56] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[11:59] * rlr219 (62dc9973@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[12:01] * yoshi (~yoshi@p37219-ipngn1701marunouchi.tokyo.ocn.ne.jp) Quit (Remote host closed the connection)
[12:04] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[12:04] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[12:16] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) has joined #ceph
[12:17] * MikeMcClurg1 (~mike@client-7-203.eduroam.oxuni.org.uk) Quit (Ping timeout: 480 seconds)
[12:20] <tziOm> wido, But I am still experiencing very slow benchmark speeds with rados
[12:21] <tziOm> ..and ceph
[12:23] <tziOm> average about 40MB/s
[12:23] <tziOm> network tests show that I am able to do full gigabit on all links
[12:40] <tziOm> Why would I only get about 35MB/s on osd tell $id bench ?
[13:07] <todin> tziOm: osd tell bench is a local test on the osd, how fast are the disk for the journal and for the filestore?
[13:08] <tziOm> the disks write at about 250MB/sec
[13:09] <tziOm> that is the md the osd is working on
[13:09] <tziOm> journal on a separate device
[13:09] <todin> an the jounral device is that fast as well?
[13:09] <tziOm> no, lets see..
[13:11] <tziOm> about 80MB/sec on journal dev
[13:12] <todin> that's the only osd you have?
[13:12] <tziOm> no
[13:12] <tziOm> but same disks in all 4
[13:12] <todin> and you uses the default replica count?
[13:12] <tziOm> I use size 2
[13:12] <tziOm> also tried with size 1, but no diff
[13:13] * MikeMcClurg (~mike@client-7-203.eduroam.oxuni.org.uk) has joined #ceph
[13:15] <todin> how do you test the performance? I think the jounrnal is the limiter.
[13:15] <todin> is it a test cluster? you could try and put the jounral on a ramdisk, but be carefull of the datalos
[13:16] * ogelbukh (~weechat@nat3.4c.ru) Quit (Ping timeout: 480 seconds)
[13:18] <tziOm> 80MB/sec journal writes should be plenty to get above ~30MB/sec
[13:20] <todin> how do you test the bandwidth?
[13:22] <tziOm> netperf and various dd/nc
[13:23] <todin> do you use the cephfs or the rbd layer?
[13:24] <tziOm> only play with radcos atm
[13:24] <todin> you could use the ceph internal benchmark rados bench 60 -t 16 -p rbd write
[13:24] <tziOm> todin, have used that one alot
[13:24] <tziOm> gives me about 45MB/sec
[13:25] <tziOm> but I mean I "should" get the close to 120
[13:25] <todin> how many osd do you have?
[13:25] <tziOm> 4
[13:25] <tziOm> each with public/private Gbit
[13:25] <todin> an all have a seperate filestore and journal device?
[13:25] <tziOm> yes
[13:26] * MikeMcClurg (~mike@client-7-203.eduroam.oxuni.org.uk) Quit (Ping timeout: 480 seconds)
[13:27] <todin> hmm, than you should max out the 1Gbit. did you watch on the osd the disk performance via iostat?
[13:28] <joao> I'm not a performance specialist, but from the discussions I recall on the channel, I believe nhm usually advises to check the bw obtained with dd just to check if it's not a disk issue
[13:30] <todin> that's right as a frist step, I am interestet in, if the disk is runnig at the limit during the test, maybe a misconfiguration somewhere
[13:31] * ihwtl (~hr@odm-mucoffice-02.odmedia.net) has joined #ceph
[13:31] <joao> tziOm, I would also advise you to check the performance blog post that nhm put on the ceph blog
[13:32] <joao> might contain useful information
[13:32] <joao> http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[13:33] <todin> joao: nhm is mark nelson?
[13:33] <joao> the "inktank performance lab" always cracks me up
[13:33] <joao> todin, yes
[13:33] <todin> joao: cool, I should post a few pictures of our lab here ;-)
[13:34] <joao> todin, I doubt it can beat mark's basement setup :p
[13:34] * IvY (~ivy@78.111.76.67) has joined #ceph
[13:35] <joao> todin, where are you located?
[13:36] <todin> joao: it is more or less the same type of system but, I have 16 SAS disk, a few different intel ssd 710/ 910 and a few lsi raidcontroller
[13:36] <todin> joao: berlin, germany
[13:36] <tziOm> I read this
[13:36] <tziOm> and it does not really give me alot info
[13:36] <joao> todin, what timezone is that? CET?
[13:36] <todin> +2 gmt
[13:37] <tziOm> Just a bit sad that I can set up a quick and dirty monkey nfs/rsync/foo setup that will easily outperform ceph
[13:37] * ihwtl (~hr@odm-mucoffice-02.odmedia.net) Quit (Quit: ihwtl)
[13:38] <todin> tziOm: I get around 900MB/s per osd
[13:38] <joao> tziOm, won't give you the same guarantees though
[13:38] * ihwtl (~hr@odm-mucoffice-02.odmedia.net) has joined #ceph
[13:39] <todin> joao: I have a litte problem with my osd journal, the latency on the ssd is quite high, around 40ms, do you have an idee how i could debug that?
[13:41] <joao> todin, I honestly don't; maybe going the perf route could help?
[13:42] <joao> iirc, the latest suspected bottleneck was the journal
[13:42] <joao> maybe a sync or lock contention is involved; but nhm is definitely the person you want to talk to
[13:44] <todin> joao: do you know the option for the conf, do keep more iops for the journal in flight?
[13:44] * zynzel (zynzel@spof.pl) has joined #ceph
[13:44] <todin> I read it a few month ago on the ml, but cannot find it again
[13:45] <tziOm> todin, I am talking bytes, not bits
[13:45] <joao> todin, I'm just checking the list of config options and inferring from there, but could it be 'journal_queue_max_ops'?
[13:46] <todin> joao: not sure, I just could try it out
[13:46] <joao> there's also a 'journal_queue_max_bytes'
[13:47] <joao> not sure how they are related
[13:47] <zynzel> hi! i have question about ceph and rbd. I have test cluster with 3 osd node (each 5GB storage) and all pools have 2 replicas. I try to create 6GB image but this kill ods (near full osd), ceph dont split data between all nodes? (smth like raid0)
[13:48] <tziOm> Total time run: 900.890597
[13:48] <tziOm> Total writes made: 9792
[13:48] <tziOm> Write size: 4194304
[13:48] <tziOm> Bandwidth (MB/sec): 43.477
[13:48] <tziOm> Stddev Bandwidth: 16.2537
[13:48] <tziOm> Max bandwidth (MB/sec): 244
[13:48] <tziOm> Min bandwidth (MB/sec): 0
[13:48] <tziOm> Average Latency: 5.87227
[13:48] <tziOm> Stddev Latency: 0.64673
[13:48] <tziOm> Max latency: 9.65203
[13:48] <tziOm> Min latency: 1.2005
[13:48] <joao> todin, http://article.gmane.org/gmane.comp.file-systems.ceph.devel/8382/match=journal+flight+iops
[13:48] <joao> It's possible you need to use more threads to have more operations in
[13:48] <joao> flight in to the filestore (the main storage for the osd). Try
[13:48] <joao> something like this in your ceph configuration for the osds:
[13:48] <joao> osd op threads = 24
[13:48] <joao> osd disk threads = 24
[13:48] <joao> filestore op threads = 6
[13:48] <joao> filestore queue max ops = 24
[13:48] <joao> oh
[13:49] <joao> nevermind, this is about the filestore
[13:49] <joao> bummer, I should learn to pay attention before pasting stuff
[13:50] * tightwork (~tightwork@142.196.239.240) has joined #ceph
[14:01] <todin> joao: where are you located?
[14:01] <joao> todin, Lisbon
[14:02] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Ping timeout: 480 seconds)
[14:03] <todin> joao: nice, and warm :-)
[14:03] <joao> indeed :)
[14:04] <todin> are you coming to the workshop in amsterdam
[14:04] <joao> yeah
[14:05] <joao> I'm assuming you're going too then :p
[14:05] <todin> joao: yep, I booked the flight already, but no hotel yet
[14:11] <tziOm> Ok, managed to boost my performance a little
[14:11] <joao> brb; lunch
[14:11] <tziOm> figured journal settings was wrong, so moved off osd
[14:13] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[14:17] * tightwork (~tightwork@142.196.239.240) Quit (Ping timeout: 480 seconds)
[14:22] <tziOm> it really has quite some impact
[14:26] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[14:27] <todin> tziOm: you did share the filestore and journal on the same disk?
[14:36] * yehudasa (~yehudasa@2607:f298:a:607:d6be:d9ff:fe8e:174c) Quit (Ping timeout: 480 seconds)
[14:36] * IvY (~ivy@78.111.76.67) has left #ceph
[14:42] * MikeMcClurg (~mike@client-7-203.eduroam.oxuni.org.uk) has joined #ceph
[14:48] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Ping timeout: 480 seconds)
[14:48] <tziOm> todin, apparently..
[14:48] <tziOm> todin, dont know when this happened, but it should not have been
[14:49] <tziOm> not I am facing other probem
[14:49] * MikeMcClurg1 (~mike@client-7-215.eduroam.oxuni.org.uk) has joined #ceph
[14:49] <tziOm> health HEALTH_WARN mds a is laggy
[14:51] <tziOm> mdsmap e1398: 2/2/1 up {0=a=up:resolve,1=a=up:resolve(laggy or crashed)}
[14:53] * MikeMcClurg (~mike@client-7-203.eduroam.oxuni.org.uk) Quit (Ping timeout: 480 seconds)
[14:54] <todin> tziOm: I don't have an mds, I only use the rbd layer
[14:54] <tziOm> ok
[14:54] <tziOm> ceph fs would be great for few things here.. mail and web storage
[14:54] <tziOm> ..but seems gluster is the way to go for npow
[14:56] <todin> tziOm: and the rados bench performance is okay now?
[14:56] * aliguori (~anthony@cpe-70-123-154-34.austin.res.rr.com) has joined #ceph
[14:57] * Leseb_ (~Leseb@193.172.124.196) has joined #ceph
[14:57] * Leseb (~Leseb@193.172.124.196) Quit (Read error: Connection reset by peer)
[14:57] * Leseb_ is now known as Leseb
[14:58] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:00] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[15:01] <tziOm> Todin, its fair..
[15:01] <tziOm> todin, about 75MB/sec
[15:01] <tziOm> still far from filling link
[15:04] <tziOm> Are there any docs available on cephfs?
[15:04] <tziOm> For example, how do I descide what osd pool to mount.... or something..
[15:11] <todin> tziOm: I do't know
[15:12] <tziOm> I think I figured
[15:12] <tziOm> ...but this command crashed my osd:
[15:12] <tziOm> root@storage01:/mnt/cephfs# time mkdir -p {0..255}/{0..255}
[15:13] * MikeMcClurg (~mike@client-7-203.eduroam.oxuni.org.uk) has joined #ceph
[15:19] * MikeMcClurg1 (~mike@client-7-215.eduroam.oxuni.org.uk) Quit (Ping timeout: 480 seconds)
[15:32] * tryggvil (~tryggvil@62.50.255.81) has joined #ceph
[15:45] * MikeMcClurg (~mike@client-7-203.eduroam.oxuni.org.uk) Quit (Ping timeout: 480 seconds)
[15:50] * tziOm (~bjornar@194.19.106.242) Quit (Remote host closed the connection)
[15:53] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) Quit (Quit: Leaving)
[15:53] * PerlStalker (~PerlStalk@72.166.192.70) has joined #ceph
[15:56] * MikeMcClurg (~mike@client-7-203.eduroam.oxuni.org.uk) has joined #ceph
[16:01] * ihwtl (~hr@odm-mucoffice-02.odmedia.net) Quit (Ping timeout: 480 seconds)
[16:05] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) has joined #ceph
[16:05] * bchrisman1 (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Read error: Connection reset by peer)
[16:09] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[16:15] * scuttlemonkey (~scuttlemo@wsip-70-164-119-40.sd.sd.cox.net) has joined #ceph
[16:23] * synapsr (~synapsr@207.239.114.206) has joined #ceph
[16:44] * MikeMcClurg (~mike@client-7-203.eduroam.oxuni.org.uk) Quit (Ping timeout: 480 seconds)
[16:54] * scuttlemonkey (~scuttlemo@wsip-70-164-119-40.sd.sd.cox.net) Quit (Quit: This computer has gone to sleep)
[16:54] * synapsr (~synapsr@207.239.114.206) Quit (Remote host closed the connection)
[17:01] * lotia (~lotia@l.monkey.org) has joined #ceph
[17:09] * MikeMcClurg (~mike@client-7-203.eduroam.oxuni.org.uk) has joined #ceph
[17:13] * deepsa (~deepsa@117.199.127.42) Quit (Quit: Computer has gone to sleep.)
[17:14] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:16] * deepsa (~deepsa@117.199.127.42) has joined #ceph
[17:21] * BManojlovic (~steki@91.195.39.5) Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:23] * loicd (~loic@12.180.144.3) Quit (Ping timeout: 480 seconds)
[17:25] * vata (~vata@2607:fad8:4:0:f94e:88e5:de00:e387) has joined #ceph
[17:26] * loicd (~loic@12.180.144.3) has joined #ceph
[17:27] * tryggvil (~tryggvil@62.50.255.81) Quit (Remote host closed the connection)
[17:29] * MikeMcClurg (~mike@client-7-203.eduroam.oxuni.org.uk) Quit (Ping timeout: 480 seconds)
[17:31] * tryggvil (~tryggvil@62-50-223-253.client.stsn.net) has joined #ceph
[17:33] * scuttlemonkey (~scuttlemo@63.133.198.36) has joined #ceph
[17:34] * tryggvil (~tryggvil@62-50-223-253.client.stsn.net) Quit ()
[17:35] * synapsr (~synapsr@12.130.118.11) has joined #ceph
[17:38] * aliguori (~anthony@cpe-70-123-154-34.austin.res.rr.com) Quit (Ping timeout: 480 seconds)
[17:42] * tziOm (~bjornar@ti0099a340-dhcp0778.bb.online.no) has joined #ceph
[17:45] * loicd (~loic@12.180.144.3) Quit (Quit: Leaving.)
[17:46] * Tv_ (~tv@2607:f298:a:607:1859:b94f:f46e:5085) has joined #ceph
[17:49] * loicd (~loic@12.180.144.3) has joined #ceph
[17:52] * synapsr (~synapsr@12.130.118.11) Quit (Remote host closed the connection)
[17:58] <tziOm> When will focus eventually be on cephfs?
[18:00] * Leseb (~Leseb@193.172.124.196) Quit (Quit: Leseb)
[18:01] <tziOm> I have been playing around with ceph for about 2 weeks now, and actually ordering a planeticket to amsterdam for the workshop, but my experiences so far is far from what I expected..
[18:03] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) Quit (Quit: Leaving.)
[18:07] * loicd (~loic@12.180.144.3) Quit (Quit: Leaving.)
[18:09] * scalability-junk (~stp@188-193-205-115-dynip.superkabel.de) has joined #ceph
[18:14] <todin> tziOm: that is sad, I can tell you that my experience is quite good
[18:14] <todin> tziOm: where are you from?
[18:17] * gregaf (~Adium@38.122.20.226) Quit (Quit: Leaving.)
[18:18] * Cube1 (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[18:19] <tziOm> Norway
[18:19] <scalability-junk> just a quick question I came up when reading the docs and the paper. what happens when a failure occurs? let's say ip 1.1.1.1 is failing, wouldn't all requests to this ip fail till the ip is cleaned from the dns record or how is a request mapped to a server/ip address
[18:19] <scalability-junk> ?
[18:21] * jjgalvez (~jjgalvez@12.248.40.138) has joined #ceph
[18:21] <todin> tziOm: I will be in the hardangervidda next year
[18:21] <tziOm> Yeah!? where are you from and why is that?
[18:21] <Tv_> scalability-junk: we don't really use DNS for much
[18:22] <Tv_> scalability-junk: what clients need to discover is monitors (plural), any one of the ones the client knows about still working -> all of the services are still accessible
[18:22] <Tv_> scalability-junk: clients don't need to e.g. resolve actual storage locations via dns
[18:22] <Tv_> scalability-junk: you most definitely wouldn't need to adjust dns entries on the fly
[18:22] * gregaf (~Adium@2607:f298:a:607:10ac:744:64fe:342b) has joined #ceph
[18:23] <todin> tziOm: germany, for winter skiing and camping
[18:23] <tziOm> todin, going in the winter?
[18:24] <todin> tziOm: yep, I like it cold
[18:25] <scalability-junk> Tv_, ok so no dns entries for osd discovery, but what about mds discovery? wouldn't the client (aka user) need to be mapped to one mds to get the request for a file routed...
[18:26] <Tv_> scalability-junk: all a client needs is to reach just one working monitor
[18:26] <Tv_> scalability-junk: clients will try all monitors they know of until one responds
[18:27] <scalability-junk> Tv_, but what happens when a client is a webbrowser trying to access file42...
[18:28] <Tv_> scalability-junk: now you're talking about radosgw, not RADOS or MDS
[18:28] <scalability-junk> wouldn't the client here ask for the ip of one mds instead of all so if the one is down... bad?
[18:28] <scalability-junk> ah kk
[18:28] <Tv_> scalability-junk: that's a whole different world; a typical deployment has a load balancer in front of a pool of radosgw servers, etc
[18:28] <scalability-junk> ok so the loadbalancer is a HA setup and distributes the requests
[18:29] <tziOm> todin, what do you use? rbd?
[18:29] <scalability-junk> the radosgw is in your sense the client and knows all mds ips and tries all of them
[18:29] <scalability-junk> if any one of them fails.
[18:29] <scalability-junk> ok
[18:29] <Tv_> scalability-junk: yes except no MDS is used there
[18:29] <Tv_> scalability-junk: radosgw uses only RADOS, so it needs just ceph-mon and ceph-osd services
[18:29] <scalability-junk> so the radosgw is a proxy and will therefore all bandwidth will passthrough right?
[18:30] <todin> tziOm: yes, rbd, and qemu/kvm on top, for a cloud hosting product
[18:30] <Tv_> scalability-junk: sort of yes, it's an "application gateway" converting between two completely different protocols etc
[18:30] <scalability-junk> Tv_, ah alright. just one last thing I want to get clear
[18:31] <scalability-junk> when I use ceph for block storage all requests and bandwidth goes from osd to the vm or client requesting the blockstorage right?
[18:31] <scalability-junk> so for blockstorage how would the vm know which mds to contact ...
[18:32] <Tv_> scalability-junk: mds is not involved in rbd at all
[18:32] <Tv_> scalability-junk: mds is only needed for the cephfs filesystem
[18:32] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[18:32] <tziOm> todin, do you use this in production?
[18:33] <Tv_> scalability-junk: rbd block storage just slices your virtual disk into 4MB chunks, and each of those chunks resides in a RADOS object, stored on some OSDs; monitors give the client enough info to know what OSDs to talk to
[18:33] <scalability-junk> Tv_, ah alright so the paper I read was probably about mainly the filesystem part :P mixed that up sorry.
[18:33] <scalability-junk> so the HA setup I need would be for the monitors.
[18:34] <lotia> greetings all. do anything but OSDs need to on brtfs/xfs?
[18:34] <lotia> noob to ceph and considering deploying it as the storage for an in house openstack cluster
[18:34] <Tv_> scalability-junk: ceph does all its own HA
[18:34] <todin> tziOm: not yet, but I hope in the near future
[18:35] <Tv_> scalability-junk: the only part where you need any sort of external HA is when you're interacting with legacy protocols, such as HTTP, that have no such concept in them
[18:35] <scalability-junk> Tv_, it does? even for the radosgw and the monitor?
[18:35] <Tv_> scalability-junk: running multiple radosgw instances just gives you lots of ip:port destinations that will all serve you identically
[18:35] <scalability-junk> Tv_, is there any docs about how failures and HA is build into ceph?
[18:35] <todin> lotia: just the osd filestore
[18:36] <Tv_> scalability-junk: you *could* use it like that, raw, if you put enough smarts in javascript; most people run a load balancer to hide all those behind a single ip:port
[18:36] <Tv_> scalability-junk: a lot of the papers you've been skimming talk about that..
[18:36] <scalability-junk> Tv_, yeah but you have to make ip:port not failing... if file.com/files points to a not functioning ip:port it fails...
[18:36] <scalability-junk> Tv_, just found 2 papers. are there any more?
[18:37] <tziOm> todin, hmm.. so it proves stable to you?
[18:37] <scalability-junk> Tv_, ok so loadbalancer HA setup to provide one ip:port and the rest is HA from ceph
[18:37] <scalability-junk> Tv_, I assume the same for monitors.
[18:37] <todin> tziOm: In my configuration it is quite stable.
[18:38] <Tv_> scalability-junk: anything that uses a ceph-specific protocol has a smart client and does not need to worry about load balancers etc old school HA
[18:38] <Tv_> scalability-junk: for monitors, the clients are configured with a list of possibilities, they'll try all the alternatives
[18:39] <scalability-junk> Tv_, ok so as long as I get users/vms talking to one part of ceph it's HA great :P
[18:40] <scalability-junk> Tv_, helped a lot thanks
[18:40] * loicd (~loic@63.133.198.91) has joined #ceph
[18:40] * sage1 is now known as sgae
[18:40] * sgae is now known as sage_
[18:41] * sage_ is now known as sage
[18:41] * aliguori (~anthony@cpe-70-123-129-122.austin.res.rr.com) has joined #ceph
[18:43] <scalability-junk> one last thing... is the traffic between ceph services encrypted or is it thought to be running on private networks only?
[18:44] <Tv_> scalability-junk: not encrypted, kerberos-style authentication ("cephx") is optional
[18:44] * mtk (~mtk@ool-44c35bb4.dyn.optonline.net) has joined #ceph
[18:45] <scalability-junk> ok so authentication is there but if someone would have access to my network he could see which files are access from who... or sort of.
[18:46] <tziOm> todin, quite stable..
[18:47] * pudgetta (bca79153@ircip1.mibbit.com) has joined #ceph
[18:47] * Cube1 (~Cube@12.248.40.138) has joined #ceph
[18:47] * rweeks (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[18:48] <todin> tziOm: as long as I don't do wired stuff it is stable, but I am a quality assurance engineer, so my job is it to break things
[18:49] <tziOm> yeah.. I break things too...
[18:49] <tziOm> but not so hard with cephfs..
[18:49] <tziOm> mkdir -p {0..255}/{0..255} ..... kernel crash
[18:50] <todin> tziOm: I do not know that, I use rbd via qmeu. But I think the develpers still say that cephs is experimentel
[18:50] * jlogan1 (~Thunderbi@2600:c00:3010:1:4c87:c827:3fff:20a) has joined #ceph
[18:51] * loicd (~loic@63.133.198.91) Quit (Ping timeout: 480 seconds)
[18:52] * scuttlemonkey (~scuttlemo@63.133.198.36) Quit (Ping timeout: 480 seconds)
[18:55] <pudgetta> maybe is a dummy question but is possible to dedicate iops for rbd? I have ceph cluster and made rbd device. This one have 2200 iops is possible make another one but limited iops to 200 or something like this?
[18:56] <Tv_> pudgetta: not currently
[18:57] <pudgetta> so all rbd devices share iops from cluster right?
[18:57] <rweeks> the RBD device would use the ions from whatever pool of storage you defined the RBD in
[18:57] <rweeks> ions, that is.
[18:57] <rweeks> er....
[18:57] <rweeks> IOPS
[18:57] <pudgetta> sure
[18:57] <rweeks> jeez. It must be monday.
[18:57] <pudgetta> :)
[18:58] <rweeks> so - you could define yourself a really small pool of just a few OSDs, and create an RBD using just those.
[18:58] * yehudasa (~yehudasa@2607:f298:a:607:6534:f1b7:6a0b:6dfb) has joined #ceph
[18:58] <yehudasa> gregaf: I just pushed some trivial fix to master, if you could eyeball it
[18:59] * Ryan_Lane (~Adium@23.sub-70-197-141.myvzw.com) has joined #ceph
[19:01] * loicd (~loic@63.133.198.91) has joined #ceph
[19:01] <pudgetta> thanks
[19:02] * scuttlemonkey (~scuttlemo@63.133.198.36) has joined #ceph
[19:03] * Ryan_Lane1 (~Adium@157.sub-70-197-141.myvzw.com) has joined #ceph
[19:04] <pudgetta> i just playing with ceph few days... so i try another one :) i have 2 osd when i shutdown one of them i see rbd device freeze (iops stop) for few seconds then will continue operate. this is some parameter delay or i have something wrong in my config?
[19:05] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[19:05] <elder> gregaf I am unable to join the Ceph FS standup without a link to the Vidyo room.
[19:06] * jlogan1 (~Thunderbi@2600:c00:3010:1:4c87:c827:3fff:20a) Quit (Quit: jlogan1)
[19:07] * Ryan_Lane (~Adium@23.sub-70-197-141.myvzw.com) Quit (Ping timeout: 480 seconds)
[19:08] * BManojlovic (~steki@212.200.241.182) has joined #ceph
[19:11] * jlogan1 (~Thunderbi@2600:c00:3010:1:3ca8:5928:4097:8bcd) has joined #ceph
[19:14] <aaron> rweeks: hello :)
[19:14] <rweeks> hey
[19:15] <aaron> so if you have multiple rados gateways using lvs or rrdns, will authenticating against one give a token that can be used with the others?
[19:15] <rweeks> hmm, that's a good question that I don't know the answer to.
[19:15] * Cube1 (~Cube@12.248.40.138) has left #ceph
[19:18] <rweeks> Tv_: do you know?
[19:19] <Tv_> s3 authentication is more about signing requests with secret keys; nothing is specific to a node there; but that also doesn't "give a token", so perhaps you are talking about swift
[19:19] <Tv_> i don't know much about swift
[19:19] <aaron> yeah, swift
[19:21] * bchrisman (~Adium@c-76-103-130-94.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[19:22] * synapsr (~synapsr@12.180.144.3) has joined #ceph
[19:27] * lofejndif (~lsqavnbok@04ZAAAROS.tor-irc.dnsbl.oftc.net) has joined #ceph
[19:30] * Ryan_Lane1 (~Adium@157.sub-70-197-141.myvzw.com) Quit (Quit: Leaving.)
[19:32] <gregaf> yehudasa: yeah, looks good
[19:33] <aaron> Tv_: yehudasa might know :)
[19:33] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has left #ceph
[19:33] * sagelap (~sage@76.89.177.113) has joined #ceph
[19:33] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) has joined #ceph
[19:34] <yehudasa> aaron: I don't see why not
[19:34] * sagelap (~sage@76.89.177.113) has left #ceph
[19:34] * sagelap (~sage@76.89.177.113) has joined #ceph
[19:36] <aaron> yehudasa: ok, good, I just want to make sure scaling the gateway and caching tokens is reasonable
[19:37] * loicd (~loic@63.133.198.91) Quit (Ping timeout: 480 seconds)
[19:37] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[19:40] * Cube1 (~Cube@12.248.40.138) has joined #ceph
[19:44] * slang (~slang@207-229-177-80.c3-0.drb-ubr1.chi-drb.il.cable.rcn.com) Quit (Ping timeout: 480 seconds)
[19:49] <aaron> yehudasa: can you also confirm that the scalability of large containers is better than swift? I've been told it is and that it uses leveldb.
[19:50] <yehudasa> aaron: I haven't run the tests myselves. I did see it being mentioned on the web. In my own tests I created containers/buckets that had a few millions of objects with no obvious degradation
[19:51] <yehudasa> yeah, we use leveldb in the rados backend and rgw leverages that
[19:54] * slang (~slang@ace.ops.newdream.net) has joined #ceph
[19:55] * loicd (~loic@63.133.198.91) has joined #ceph
[19:56] * loicd1 (~loic@63.133.198.91) has joined #ceph
[19:57] * loicd (~loic@63.133.198.91) Quit (Read error: Connection reset by peer)
[20:00] <todin> pudgetta: if you use rbd in combination with qemu, you can limit the iops for the qemu instanz
[20:04] * loicd1 (~loic@63.133.198.91) Quit (Ping timeout: 480 seconds)
[20:04] * glowell (~glowell@c-98-210-224-250.hsd1.ca.comcast.net) has joined #ceph
[20:05] * dmick (~dmick@38.122.20.226) has joined #ceph
[20:08] <aaron> rweeks: so, is there anything important that sucks in ceph but not swift?
[20:08] <aaron> ;)
[20:09] <rweeks> IIRC you guys aren't interested in CephFS
[20:09] <aaron> yeah, just rados+radosgw
[20:10] <rweeks> as far as I know those are pretty damn solid.
[20:10] <aaron> I'm sure cephfs has bugs, heh
[20:10] <rweeks> DreamHost has a commercial service based on rados+gw that is doing really well
[20:10] * aaron is afraid of running into stuff at the last minute like swift
[20:10] <nhm_> aaron: what kinds of issues did you run into with swift?
[20:11] <nhm_> aaron: thing thing I would be most worried about is small object throughput, but I have no idea how swift performs.
[20:13] <aaron> container sync being garbage and leaving us with no dc-replication strategy, memory leaks, sqlite dbs that don't scale and needed ssds, poor handling increased load even with over-provisioned machines, high latency, no internal connection pooling, poor handling of large files (no paging)
[20:13] <aaron> lots of stuff, but the container sync thing is the most annoying
[20:13] <aaron> we can work around other stuff (like using a cron to restart the proxy's before they swap-death)
[20:14] <aaron> though that still briefly causes problems for the clients (apache servers)
[20:14] <aaron> swift also has annoying limits on header values (max size 255 bytes)
[20:15] <aaron> though that will be fixed in 1.7.* at least (via config options)
[20:15] <aaron> rweeks: btw, what is the max object name length in rados?
[20:16] <nhm_> aaron: we see high latency in some cases. It's something that we've been investigating for a couple of months.
[20:17] <nhm_> aaron: If I recall, I think we ended up determining it was when the journals got so far ahead of the data disks that they had to sit there flushing out data to the disk.
[20:18] <gregaf> ie, if you try to put more load on your system than it can handle, things will get backed up and you'll see latency
[20:19] <nhm_> yeah. It can just be deceiving because it looks like you are going along fast and then all of sudden you get a long stall.
[20:19] <nhm_> it genearlly happens when the journals can write much faster than the OSD data disks.
[20:22] * synapsr (~synapsr@12.180.144.3) Quit (Remote host closed the connection)
[20:22] * loicd (~loic@z2-8.pokersource.info) has joined #ceph
[20:25] <rweeks> aaron: I think one of the developers needs to answer that max object name length question
[20:26] <aaron> it's ok if its the same as swift, would be nice to be slightly higher
[20:28] <rweeks> I'm also trying to figure something out: if I have a cephFS client or a mounted RBD writing data, can I then read it out from another client via the swift or S3 APIs?
[20:29] <rweeks> my brain says there's something there that wouldn't work, but I'm not sure.
[20:29] * bchrisman (~Adium@108.60.121.114) has joined #ceph
[20:35] * mgalkiewicz (~mgalkiewi@staticline-31-183-94-25.toya.net.pl) has joined #ceph
[20:40] * rweeks is now known as rweeks_afk
[20:44] * mgalkiewicz (~mgalkiewi@staticline-31-183-94-25.toya.net.pl) Quit (Ping timeout: 480 seconds)
[20:46] * MikeMcClurg (~mike@cpc1-oxfd13-0-0-cust716.4-3.cable.virginmedia.com) has joined #ceph
[20:47] * rweeks_afk (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) Quit (Quit: ["Textual IRC Client: www.textualapp.com"])
[20:48] * MikeMcClurg (~mike@cpc1-oxfd13-0-0-cust716.4-3.cable.virginmedia.com) Quit (Read error: Connection reset by peer)
[20:50] <scuttlemonkey> is it possible to craft your crush map so that you can specify, for instance, 6 copies of data but 3 of them in a specific region syncronously and the other 3 in a different region, but async?
[20:55] <iggy> scuttlemonkey: ceph expects a low latency, high bandwidth connection between osd's... that usually makes it a no go for geographicaly dispersed setups
[20:55] <scuttlemonkey> ahh, ok?I knew that was the general rule
[20:55] <scuttlemonkey> just didn't know if you could build in any sort of intelligence wrt geographic issues
[20:57] <scuttlemonkey> thanks
[21:06] * MikeMcClurg (~mike@cpc1-oxfd13-0-0-cust716.4-3.cable.virginmedia.com) has joined #ceph
[21:07] * MikeMcClurg (~mike@cpc1-oxfd13-0-0-cust716.4-3.cable.virginmedia.com) Quit (Read error: Connection reset by peer)
[21:20] * miroslavk (~miroslavk@63.133.198.36) has joined #ceph
[21:25] * MikeMcClurg (~mike@cpc1-oxfd13-0-0-cust716.4-3.cable.virginmedia.com) has joined #ceph
[21:27] * loicd (~loic@z2-8.pokersource.info) Quit (Quit: Leaving.)
[21:28] * MikeMcClurg (~mike@cpc1-oxfd13-0-0-cust716.4-3.cable.virginmedia.com) Quit (Read error: Connection reset by peer)
[21:29] <sage> scuttlemonkey: no async replication yet
[21:35] * gregaf (~Adium@2607:f298:a:607:10ac:744:64fe:342b) Quit (Read error: Connection reset by peer)
[21:36] * gregaf (~Adium@2607:f298:a:607:10ac:744:64fe:342b) has joined #ceph
[21:37] * miroslavk (~miroslavk@63.133.198.36) Quit (Quit: Leaving.)
[21:39] * sagelap (~sage@76.89.177.113) Quit (Quit: Leaving.)
[21:40] <scuttlemonkey> sage: gotcha, definitely some interest in ceph + geo replication
[21:40] <scuttlemonkey> but all in good time I suppose :)
[21:42] <aaron> iggy: "no go"? for cephfs or for anything?
[21:43] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[21:46] <todin> sage: I would be interestet in some sort of desaster backup for a ceph cluster, did you look at btrfs send patch? In this way you could send the snapshot to a diffrent machine for longterm storage.
[21:46] * MikeMcClurg (~mike@cpc1-oxfd13-0-0-cust716.4-3.cable.virginmedia.com) has joined #ceph
[21:46] <sage> todin: yeah, the first geo-thing we do will be for DR. may or may not leverage btrfs send, though.
[21:47] * MikeMcClurg (~mike@cpc1-oxfd13-0-0-cust716.4-3.cable.virginmedia.com) Quit (Read error: Connection reset by peer)
[21:47] <todin> for DR? what does that mean?
[21:47] <sage> disaster recovery
[21:48] <todin> ahh, ok
[21:48] * pudgetta (bca79153@ircip1.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[21:48] <todin> sage: is there already a timeline for that?
[21:48] <sage> 6-9 months iirc?
[21:49] <sage> sometimes in 2013
[21:49] <tziOm> sage, when would you expect fs to be anything near stable?
[21:49] * rweeks (~rweeks@c-98-234-186-68.hsd1.ca.comcast.net) has joined #ceph
[21:50] <sage> i'm hoping for q1 or q2 2013, but we'll see.
[21:51] <rweeks> the perils of trying new IRC clients - if anyone sent me anything in the past hour, I didn't see it
[21:52] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) Quit (Quit: Leaving)
[21:52] <todin> would DR work with this simple thing, if all osd took a btrfs snapshot at the sametime? could you do a dr from those snapshots, or do you need more?
[22:00] * elder (~elder@c-71-195-31-37.hsd1.mn.comcast.net) has joined #ceph
[22:01] <tziOm> sage, do you know when you will start just focusing a little on cephfs, then?
[22:02] <tziOm> sage, for as of now, with 3.6.1 and git ceph it is, as you know, extreme in both slowness and instability
[22:03] <tziOm> a simple case as mkdir -p {0..255}/{0..255} either crashes my kernel or takes forever ... not to speak of a rm -r of the same (if it did not crash)
[22:04] <sage> we are pivoting to fs stuff now
[22:04] <sage> i'm surprised you see that level of instability, though.. we've running it through nightly qa for more than a year now and don't see that
[22:04] <sage> what kind of crash do you see?
[22:05] <tziOm> sage - a kernel crash, did not manage to cut the text, but ceph was mentioned few times in the dump
[22:05] <sage> would be very interested in seeing the dump
[22:05] <tziOm> I can try reproduce tomorrow
[22:06] <tziOm> ..I am sure I will manage
[22:06] <tziOm> sage - if you have fs going, try the mkdir command above.. and then a time rm -rf *
[22:08] * chutzpah (~chutz@199.21.234.7) has joined #ceph
[22:19] * rino (~rino@12.250.146.102) has joined #ceph
[22:23] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) Quit (Remote host closed the connection)
[22:25] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:28] <tziOm> sage, how did it perform?
[22:29] * loicd (~loic@63.133.198.91) has joined #ceph
[22:30] <sage> sidetracked with flakiness of home dev box
[22:35] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[22:35] <phantomcircuit> ceph keeps track of checksums right?
[22:36] <phantomcircuit> yeah dumb question
[22:38] * loicd (~loic@63.133.198.91) Quit (Ping timeout: 480 seconds)
[22:39] * sagelap (~sage@76.89.177.113) has joined #ceph
[22:40] <dmick> phantomcircuit: I don't think Ceph itself does any checksumming of data blocks currently
[22:41] <phantomcircuit> dmick, no but btrfs does which is the suggested fs for osd
[22:41] * sjust (~sam@2607:f298:a:607:baac:6fff:fe83:5a02) has joined #ceph
[22:42] * Ryan_Lane (~Adium@63.133.198.91) has joined #ceph
[22:42] <dmick> I believe right now it's up to the filesystem, yes. Scrubbing verifies size and attributes but not contents
[22:43] <sage> dmick: contents are scrubbed now too
[22:43] <sage> but less frequently
[22:44] <dmick> ah
[22:45] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[22:45] <phantomcircuit> sage, you mean ceph or btrfs?
[22:45] <sage> ceph
[22:48] <Tv_> sage: but content scrubbing just means "we ask the disk for the bytes"; ceph itself stores no checksum of the bytes
[22:48] <sage> right.. it's comparing content across replicas.
[22:49] <Tv_> that is, content scrubbing gives btrfs an opportunity realize the data is lost
[22:49] <sage> yeah
[22:49] <tziOm> sage, any does to bring some clarity to the cross-use of host, ip address and ip-addr:port .. what is needed and so on
[22:49] <Tv_> oh is there an actual comparison across replicas?
[22:49] <sage> yeah
[22:49] <Tv_> sage: does it like snapshot it and compare those, or what?
[22:50] <sage> no, there's some tricky avoidance of objects under io, if that's what you mean.
[22:50] <sage> iirc in the end a checksum per object is compared
[23:15] * tziOm (~bjornar@ti0099a340-dhcp0778.bb.online.no) Quit (Remote host closed the connection)
[23:21] * cowbell (~sean@adsl-70-231-145-136.dsl.snfc21.sbcglobal.net) has joined #ceph
[23:24] * tryggvil (~tryggvil@62-50-223-253.client.stsn.net) has joined #ceph
[23:25] * cowbell (~sean@adsl-70-231-145-136.dsl.snfc21.sbcglobal.net) Quit (Quit: cowbell)
[23:25] * loicd (~loic@12.180.144.3) has joined #ceph
[23:26] <phantomcircuit> lol
[23:26] <phantomcircuit> rbd mounted in qemu-kvm 6.5MB/s
[23:27] <phantomcircuit> ceph is in the same vm with osd 1/2 mon a mds a on separate virtio partitions in the same qemu-kvm instance
[23:28] <phantomcircuit> raw performance of btrfs partitions for osd
[23:28] <phantomcircuit> ~30MB/s
[23:28] <phantomcircuit> is that a typical performance hit for using ceph?
[23:30] <dmick> workload? kernel or userland rbd? if userland, cache on?
[23:35] <phantomcircuit> dmick, uh im not sure kernel vs userland
[23:35] <phantomcircuit> i just did
[23:35] <phantomcircuit> rbd create foo --size 2048
[23:35] <phantomcircuit> rbd map
[23:35] <dmick> kernel
[23:35] <phantomcircuit> mkfs -t ext4 /dev/rbd1
[23:35] <phantomcircuit> mount /dev/rbd1 /mnt/foo
[23:35] <phantomcircuit> dd if=/dev/zero of=/mnt/foo/test.bin bs=1M count=400 oflag=direct
[23:37] * loicd (~loic@12.180.144.3) Quit (Quit: Leaving.)
[23:40] * Ryan_Lane (~Adium@63.133.198.91) Quit (Quit: Leaving.)
[23:49] * loicd (~loic@63.133.198.91) has joined #ceph
[23:50] <phantomcircuit> dmick, thoughts?
[23:50] <dmick> sorry
[23:50] <dmick> um, that seems a little more of a hit than I'd expect, but the kernel driver does add extra complexity
[23:50] <dmick> cache won't matter for that workload, since it's all sequential write
[23:51] <dmick> just setting up a test vm with qemu-rbd now
[23:52] <elder> sage, brute force attempt to just map a clone the way a normal mapping is done seems to have somewhat worked.
[23:53] <elder> Kind of a big bang test, and the mapping completed. So works-in-principle prototype anyway seems OK. I have to clean it up quite a bit though.
[23:53] <sage> elder: because the rbd tool wouldn't behave with the libs?
[23:53] <nhm_> phantomcircuit: 2x replication?
[23:54] <elder> No, I mean I threw together a routine that went through the same steps that a normal "map" request would, populating a new rbd_dev structure with info from the original mapping's parent.
[23:54] <elder> Then I called the probe routine, and it seems to have completed what it had to do.
[23:54] <sage> elder: nice
[23:55] <sage> what do you think about the viability of this approach?
[23:55] * loicd (~loic@63.133.198.91) Quit (Quit: Leaving.)
[23:56] <phantomcircuit> nhm_, i didn't explicitly setup any replication at all but maybe
[23:56] <elder> Hard to say but so far so good. I'll keep at it for now and will let you know if I hit a dead end.
[23:56] <gregaf> phantomcircuit: you're doing sync writes to disk that have to go across the network — that is a pretty typical performance hit but you should try doing async writes with a sync at the end, and doing writes in parallel
[23:57] <phantomcircuit> i dont even know how to see what the replication settings are
[23:57] <phantomcircuit> :/
[23:57] <nhm_> phantomcircuit: you might try larger IOs, or try buffered writes.
[23:57] <dmick> ceph osd dump, look for "rep size"
[23:57] <dmick> 2 by dfault
[23:57] <dmick> so if you didn't set it, probably 2
[23:57] <phantomcircuit> nhm_, it's 1M blocks i cant imagein going bigger will help
[23:58] <phantomcircuit> rep size 2
[23:58] <phantomcircuit> ah
[23:58] <phantomcircuit> so the performance hit is only minor
[23:58] <phantomcircuit> since it's doing twice the io
[23:58] <phantomcircuit> so instead of 30 > 6.5 it's more like 15 -> 6.5
[23:59] <dmick> phantomcircuit: yes, and both replicas must ack before write completes, so synchronous single-threaded is the worst case. right.
[23:59] <phantomcircuit> which is still fairly large but isn't ridiculousness
[23:59] <phantomcircuit> ok that makes more sense
[23:59] <dmick> throw in "the network is in the way" and...
[23:59] <nhm_> phantomcircuit: With lustre I've seen direct writes scale up to 64MB+ sizes.
[23:59] <phantomcircuit> nhm_, wow really? that's crazyness
[23:59] <gregaf> oh, sorry, missed the part where they were all on one box

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.