#ceph IRC Log


IRC Log for 2012-12-12

Timestamps are in GMT/BST.

[0:00] <joshd> sjust: left some comments
[0:00] <sjust> k
[0:00] * nwat (~Adium@soenat3.cse.ucsc.edu) has left #ceph
[0:00] <yehudasa> dshea: you have mixed auth settings on the mds and on the monitor
[0:01] <yehudasa> are you running old mon with new mds?
[0:02] * cypher497 (~jay@ has joined #ceph
[0:04] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[0:06] <elder> dmick, are you around?
[0:07] <elder> Sandon just changed the root disk on plana49 (or something like that) and it now reports that I don't have the right key to connect when running teuthology. What's the fix?
[0:07] <elder> (WTF)
[0:08] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Read error: Operation timed out)
[0:12] * aliguori (~anthony@ Quit (Remote host closed the connection)
[0:12] * sagelap (~sage@2607:f298:a:607:8590:4da4:e99e:f80c) has joined #ceph
[0:13] <dmick> elder: yes
[0:14] <dmick> and is it complaining about host key?
[0:14] <elder> Yes.
[0:14] <dmick> or your key?
[0:14] <elder> BadHostKeyException: Host key for server plana49.front.sepia.ceph.com does not match!
[0:14] <dmick> ok, so host keys are updated with teuthology-updatekeys
[0:15] <elder> K�hl.
[0:15] <elder> Done
[0:15] <elder> (I think)
[0:15] <dmick> if you can connect now, it's done :)
[0:16] <elder> Well, um, no.
[0:18] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[0:18] * ebo^ (~ebo@ Quit (Ping timeout: 480 seconds)
[0:18] <elder> Better
[0:19] <elder> Now that I've updated the key in my yaml file.
[0:19] <elder> Thank you.
[0:23] <Psi-jack> Heh.
[0:23] <Psi-jack> Interesting OCF RA script. Just re-uses the init.d LSB-ish.. Script.
[0:23] <dshea> gregaf: yes, everythign is v55 downloaded and installed this morning EST
[0:24] <dshea> yehudasa: shouldn't be, I wiped everything from the previous install, but let me verify
[0:25] <dshea> it is going to be a problem if I need to re-image all the nodes after each version, what do you guys suggest the safest way to wipe old installs? I can do an apt-get remove and the try to re-install, I had tried an apt-get upgrade from stable to dev
[0:28] <dshea> ok, just did apt-get remove ceph && apt-get install ceph
[0:29] <via> has the syntax for ceph osd create changed?
[0:31] <via> ... just omitting the uuid worked
[0:33] * BManojlovic (~steki@242-174-222-85.adsl.verat.net) Quit (Quit: Ja odoh a vi sta 'ocete...)
[0:33] <Psi-jack> Heh, hmmm. this ceph.conf.sample that comes with ceph is.. Most interesting.
[0:33] <Psi-jack> if 'devs' is not specified, you're responsible for setting up the 'osd data' dir. if it is not btrfs, things will behave up until you try to recover from a crash (which usually fine for basic testing).
[0:34] <Psi-jack> This sounds like btrfs IS the recommended filesystem to use with Ceph.
[0:34] * gucki (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) Quit (Ping timeout: 480 seconds)
[0:35] <iggy> Psi-jack: ceph was written to use certain btrfs specific features at first.. support for other filesystems was added later
[0:35] <iggy> but I think xfs is actually recommended right now
[0:35] <Psi-jack> iggy: But are supported, generally-speaking? Gotcha. XFS is what I was going with.
[0:36] <gregaf> Psi-jack: hmm, that ceph.conf.sample sounds very old
[0:36] <Psi-jack> It's from the latest git checkout.
[0:36] * gucki (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) has joined #ceph
[0:37] <gregaf> where's it located?
[0:37] <gregaf> I can't find it in argonaut or the next branch
[0:38] <Psi-jack> Looking :)
[0:38] <gregaf> oh, you mean sample.ceph.conf
[0:38] <gregaf> I got it
[0:39] * tryggvil (~tryggvil@d51A4D5E2.access.telenet.be) has joined #ceph
[0:39] * infernix (nix@cl-1404.ams-04.nl.sixxs.net) has joined #ceph
[0:40] <gregaf> yeah, that needs a bit of an overhaul
[0:40] <Psi-jack> jeje
[0:40] <gregaf> guess I can delete that one comment while I'm at it
[0:40] <Psi-jack> Yeah. src/sample.ceph.conf
[0:41] <infernix> so i finally got my 6 dev servers with 7 sas data disks on infiniband set up with 12.04
[0:41] <Psi-jack> gregaf: Delete it? But it's at least somewhat useful.. Just needs some... updates. ;)
[0:42] <infernix> should i follow this? http://www.sebastien-han.fr/blog/2012/10/02/introducing-ceph-deploy/
[0:42] <Psi-jack> Like.. log_to_syslog, which I may use, unless I can configure ceph to log to stdout.
[0:46] * tryggvil (~tryggvil@d51A4D5E2.access.telenet.be) Quit (Read error: Connection reset by peer)
[0:46] * tryggvil (~tryggvil@d51A4D5E2.access.telenet.be) has joined #ceph
[0:47] * l0nk (~alex@ Quit (Quit: Leaving.)
[0:49] <Psi-jack> Hmm, interesting..
[0:49] <gregaf> okay, somebody should check on wip-fix-sample-conf and make sure my changes are correct (one of sagewk, yehudasa, dmick maybe?)
[0:49] <Psi-jack> ./configure --prefix=/usr --sysconfdir=/etc installs sbin binaries into /sbin, not /usr/sbin
[0:49] <Psi-jack> Why is ceph's autoconfigure not honoring the prefix?
[0:50] <gregaf> you're probably missing one of the other necessary options
[0:50] <gregaf> I haven't tried to get this right, but apparently it's a pain in the butt
[0:50] <gregaf> glowell might know about it if he's around
[0:51] <Psi-jack> --prefix=/usr /should/ set the default prefix appropriately.
[0:51] <Psi-jack> And only other options, like --sbindir=/sbin would need to be used IF you wanted a --prefix=/usr, but sbin in /sbin instead of /usr/sbin
[0:55] <Psi-jack> Heh, I had just randomly noticed it because I was looking at the final Arch build I had, untarred it into a temporary fs, and saw: etc, sbin, usr, var
[0:56] * gucki (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) Quit (Ping timeout: 480 seconds)
[0:56] * jlogan2 (~Thunderbi@ has joined #ceph
[0:57] * sjustlaptop (~sam@ has joined #ceph
[1:00] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[1:00] * loicd (~loic@magenta.dachary.org) has joined #ceph
[1:01] * sjustlaptop1 (~sam@2607:f298:a:607:b0b4:db53:526f:cac7) has joined #ceph
[1:02] * jlogan1 (~Thunderbi@2600:c00:3010:1:14a3:ca45:3136:669a) Quit (Ping timeout: 480 seconds)
[1:06] * sjustlaptop (~sam@ Quit (Ping timeout: 480 seconds)
[1:09] * sjustlaptop1 (~sam@2607:f298:a:607:b0b4:db53:526f:cac7) Quit (Ping timeout: 480 seconds)
[1:14] * maxiz (~pfliu@ has joined #ceph
[1:15] <glowell> gregaf, Psi-jack: No advice to offer on the prefixes. It should follow autotools conventions. There was another question about this recently and I have it on my todo list to look into.
[1:16] * tryggvil (~tryggvil@d51A4D5E2.access.telenet.be) Quit (Read error: Connection reset by peer)
[1:16] * tryggvil (~tryggvil@d51A4D5E2.access.telenet.be) has joined #ceph
[1:32] <infernix> when running ceph-deploy osd HOST:DISK[:JOURNAL] do I supply it a mountpoint for an existing mounted disk, or a raw device (e.g /dev/sdX) that it will partition/mkfs/mount?
[1:33] * sagelap (~sage@2607:f298:a:607:8590:4da4:e99e:f80c) Quit (Ping timeout: 480 seconds)
[1:37] <infernix> n/m, device
[1:37] <infernix> neat
[1:38] * lxo (~aoliva@lxo.user.oftc.net) Quit (Ping timeout: 480 seconds)
[1:42] * LeaChim (~LeaChim@5ad684ae.bb.sky.com) Quit (Remote host closed the connection)
[1:48] * lxo (~aoliva@lxo.user.oftc.net) has joined #ceph
[1:50] <infernix> ugh, stupid cciss device
[1:53] * sagelap (~sage@ has joined #ceph
[2:02] <lxo> sjust, thanks!
[2:03] <sjust> lxo: hmm?
[2:03] <infernix> ok so ceph-deploy fails on /dev/cciss/c0dX devices, because it assumes c0dX1 exists where it acutally is c0dXp1
[2:04] * guigouz (~guigouz@ has joined #ceph
[2:05] * gohko (~gohko@natter.interq.or.jp) Quit (Read error: Connection reset by peer)
[2:05] <lxo> sjust, err, it should have been slang, for acking and filing a couple of bugs I mentioned here
[2:05] <sjust> ah
[2:05] <infernix> but i don't think ceph-deploy does the mkfs, does it?
[2:06] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[2:07] * gohko (~gohko@natter.interq.or.jp) has joined #ceph
[2:07] <infernix> as far as i can see, ceph-osd does this. is it hardcoded to look for ${device}1 as the first partition?
[2:08] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) Quit (Quit: wer)
[2:10] <dmick> ceph-deploy uses ceph-disk-prepare, which I believe does all the partitioning/mkfs'ing
[2:10] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit (Quit: Leseb)
[2:11] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) has joined #ceph
[2:13] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has left #ceph
[2:15] <infernix> dev = '{disk}1'.format(disk=disk)
[2:15] <infernix> there it is
[2:15] * infernix hacks it
[2:15] * dshea (~dshea@masamune.med.harvard.edu) Quit (Quit: Leaving)
[2:16] * jjgalvez (~jjgalvez@ Quit (Ping timeout: 480 seconds)
[2:16] <infernix> alright. now how do I test the performance of OSDs on one single node, locally?
[2:16] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[2:17] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[2:18] * infernix discovers rados bench
[2:19] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[2:19] * sagelap (~sage@227.sub-70-197-131.myvzw.com) has joined #ceph
[2:19] <dmick> there's also an osd bench IIRC
[2:20] <dmick> ceph osd tell <osdid> bench <bytes-per-write> <total-bytes>
[2:20] <infernix> [ 3] 3.0- 4.0 sec 1000 MBytes 8.39 Gbits/sec
[2:20] <infernix> network is speedy enough
[2:21] <dmick> but you have to look at the logs to see the results of that one
[2:21] <dmick> rados bench is for the cluster as a whole, of course
[2:23] <infernix> what do i use the default pools for, if anything?
[2:23] <dmick> rbd is the default for rbd images
[2:23] <dmick> data/metadata are the defaults for cephfs
[2:25] <infernix> 41 disks, 300mb/sec writes
[2:25] <infernix> hmm
[2:26] <infernix> 42 actually
[2:33] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) Quit (Quit: wer)
[2:35] <nhm> infernix: have you looked at the write throughput articles at all yet?
[2:37] <infernix> nhm: no, just starting out :)
[2:37] <infernix> switched to deadline, created a pool
[2:38] <infernix> looking at iostat numbers
[2:38] <nhm> infernix: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[2:38] <nhm> http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
[2:38] <infernix> oh yes i've read your work
[2:38] <nhm> infernix: ah, ok. deadline may help some. Also make sure the pool you write to has a power-of-two number of PGs.
[2:39] <infernix> i'm on jbod with raid0 single disks, 2100 pgs
[2:39] <nhm> Oh, and you may need to run multiple concurrent copies of rados bench.
[2:39] <infernix> these are mere HP DL380s
[2:39] <infernix> client is a DL360
[2:39] <infernix> just looking where the bottlenecks are at
[2:40] <infernix> and also trying to compile for centos 5 which is part of the clients that need to talk to it :)
[2:40] <nhm> infernix: what distro/kernel and what version of ceph as well?
[2:40] <nhm> ah
[2:40] <nhm> that will be tough
[2:40] <infernix> curren test client is debian 6, 2.6.32
[2:40] <nhm> what's on the servers?
[2:41] <infernix> servers are all 12.04, ceph from debian-testing, and the kernel you guys run
[2:41] <infernix> 3.6.3-ceph-00262-g575fee7
[2:41] <nhm> cool
[2:41] <infernix> one MON with 32GB
[2:41] <infernix> rest is about 6-12GB ram
[2:41] <infernix> various cpus
[2:41] <nhm> how much replication?
[2:41] <infernix> size 2
[2:41] <infernix> min_size 2 too
[2:41] <infernix> Must write data before running a read benchmark! o_O
[2:42] <nhm> ok. So in reality about 600MB/s to the disks. Still, you should be able to do better than that!
[2:42] <infernix> 42 disks, 10k sas, i'd say 21x40mb right?
[2:43] * buck (~buck@bender.soe.ucsc.edu) has left #ceph
[2:43] <Psi-jack> glowell: Ahhh, gotcha, yeah.. It is a little weird, for sure.
[2:43] <nhm> infernix: how many servers?
[2:43] <infernix> oh, aha. rados bench eats all the cpu on the client
[2:43] <infernix> nhm: 6
[2:43] <nhm> infernix: yeah, try running multiple concurrent copies of rados bench.
[2:43] <nhm> infernix: possibly from several hosts at once.
[2:44] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) has joined #ceph
[2:44] <nhm> infernix: I do 8.
[2:44] <infernix> 224+237
[2:44] <infernix> ah
[2:45] <infernix> but where's the limitation in a single rados instance then?
[2:45] * vata (~vata@CPE000024cdec46-CM0026f31f24dd.cpe.net.cable.rogers.com) has joined #ceph
[2:45] <infernix> 800MB reads
[2:45] <infernix> that's good
[2:45] <infernix> 1.2gbyte/s reads with 2 rados seq
[2:46] * infernix runs a drop_caches
[2:46] <infernix> this is looking much better already
[2:46] <nhm> infernix: for reads, if you do 2 at once you've gotta have them read from separate pools otherwise the 2nd,3rd... will all read from pagecache after the first one reads the data in.
[2:47] <nhm> infernix: the way I test is to have each rados bench instance write to and read from it's own pool.
[2:47] <infernix> good point
[2:48] <infernix> now the question becomes, how do I turn this into very fast sequential reads and writes with just a single thread doing the reads/writes
[2:49] <infernix> also this is all xfs i think
[2:50] <infernix> yep
[2:52] <infernix> 595+580 reads, thats about 9.4gbit
[2:52] <nhm> distributed storage systems tend to do better with aggregate throughput since you can use the concurrency to hide latency. Single client single thread throughtput is a lot tougher.
[2:53] * jpieper (~josh@209-6-86-62.c3-0.smr-ubr2.sbo-smr.ma.cable.rcn.com) has joined #ceph
[2:53] <nhm> infernix: seems like you should be able to do a bit better with that many disks. How much client network throughput?
[2:53] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) Quit (Quit: wer)
[2:54] <infernix> does rados bench show that?
[2:54] * guigouz (~guigouz@ Quit (Quit: Textual IRC Client: http://www.textualapp.com/)
[2:55] * Ryan_Lane (~Adium@ Quit (Quit: Leaving.)
[2:56] <nhm> infernix: sorry, I meant how much throughput are your clients capable of (ie have you looked at iperf tests or anything?)
[2:56] <infernix> 510MB writes
[2:56] <infernix> nhm: all ~9gbit each
[2:56] <infernix> iperf
[2:57] <nhm> infernix: 1 client node or more?
[2:57] <infernix> 1 so far
[2:57] <nhm> ok, so for reads it sounds like you are maxing out the network?
[2:57] <infernix> 3 reads, 3x490mb/s e.g. 1470mb/sec
[2:58] <infernix> 11.7gbit
[2:58] <infernix> still one client
[2:58] <nhm> that sounds too good to be true. :)
[2:58] <infernix> it's 40gbit infiniband
[2:58] <nhm> Ah!
[2:58] <infernix> but ipoib is not going to do the full 40
[2:58] <nhm> Yeah
[2:58] <nhm> rsocket might do a bit better.
[2:59] <infernix> i read about that, but haven't looked into it yet
[2:59] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[2:59] <nhm> Also the interrupt affinity stuff for ipoib.
[2:59] <infernix> same
[3:00] <infernix> but still, i'll be reading block devices sequentially
[3:00] <nhm> infernix: at least for reads the numbers are starting to look right.
[3:00] <infernix> i can *try* to code something that would start multiple read threads from the source block devices (e.g. cut the device up in X segments) and then write in parallel; same for reading but reversed
[3:01] <infernix> though i'd have to code that myself, i don't think that exists
[3:01] <nhm> infernix: If you can, try launching like 8 concurrent rados bench reads.
[3:01] <infernix> 1.4gbyte/s after a drop cache
[3:01] <infernix> i'll create some more pools then
[3:01] <nhm> cool
[3:02] <infernix> still all on one client; i can add 2 IB cards in one
[3:02] <infernix> iirc that increases ipoib performance
[3:02] <nhm> infernix: I get about 800MB/s reads with 8 disks in 1 server.
[3:03] <nhm> with 256 concurrent operations across 8 rados bench instances.
[3:03] <infernix> yeah but as mentioned these are old HPs
[3:03] <infernix> mmh ok
[3:03] <nhm> 4MB reads
[3:03] <infernix> but i'm already seeing 1.4gbyte/s reads
[3:03] * nwat (~Adium@soenat3.cse.ucsc.edu) has joined #ceph
[3:03] <nhm> yeah, you just have way more disks and 6 servers. ;)
[3:03] <infernix> i am not sure if i can get more without rsockets
[3:04] <nhm> infernix: target is 2.5GB/s reads?
[3:04] <infernix> ipoib might be limiting here
[3:04] <infernix> yes
[3:04] * nwat (~Adium@soenat3.cse.ucsc.edu) has left #ceph
[3:06] <Psi-jack> What would people say is a good osd pg bits, and/or osd pgp bits, for a 3-node 3 disk-per-node ceph cluster?
[3:07] * infernix looks over the ipoib tuning guide
[3:10] <nhm> Psi-jack: I tend to recommend between 200PGs per OSD, but use a power of 2 number. So if you have 9 OSDs, 2048 PGs would be fine.
[3:10] <nhm> sorry, between 100-200PGs per OSD.
[3:11] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[3:11] <Psi-jack> nhm: Okay. I thought aiming around there was the idea. I remember talking a bit about it back a while back when I first started testing ceph.
[3:12] <Psi-jack> Before mkcephfs stuff, I can just set, osd default pool size = 2048, and it'll be done right from the start?
[3:12] * sagelap (~sage@227.sub-70-197-131.myvzw.com) Quit (Ping timeout: 480 seconds)
[3:12] <nhm> Psi-jack: hrm, not sure about that. I always create my own pools after mkcephfs.
[3:12] <Psi-jack> Heh yeah.
[3:13] <nhm> Psi-jack: there's been a lot of work done though and I'm honestly behind on what the newest innovations are. ;)
[3:13] <Psi-jack> hehe
[3:13] <Psi-jack> Hmmm, wait, would be osd default pg num, since that number relates to the number of pg_num?
[3:14] * vata (~vata@CPE000024cdec46-CM0026f31f24dd.cpe.net.cable.rogers.com) Quit (Read error: Connection reset by peer)
[3:14] <nhm> Probably best to try it and check your pools. :D
[3:14] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[3:14] * vata (~vata@CPE000024cdec46-CM0026f31f24dd.cpe.net.cable.rogers.com) has joined #ceph
[3:15] <Psi-jack> Yeaaah, osd pool default size is the side of an OSD pool in gigabytes.
[3:15] <Psi-jack> nhm: Looking at this: http://ceph.com/docs/master/rados/configuration/osd-config-ref/?highlight=osd%20pool%20size
[3:15] <Psi-jack> heh, guess I didn't need the highlight part, but oh well. ;)
[3:17] <Psi-jack> Okay, set that up. :)
[3:18] <infernix> 6x rados bench write on one client, 6x~95mb/s. 6x seq, 6x ~250mb/s. e.g. 1.5gbyte/s reads
[3:19] <infernix> need to add a client to test
[3:22] <nhm> Interesting. So write performance is kind of crappy, but it theoretically does the 400MB/s you need at least?
[3:22] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[3:24] <infernix> yeah thats fine basically
[3:24] <infernix> and i expect sandy bridge to outperform this stuff anyways
[3:24] <infernix> plus 12 disks per node instead of 7
[3:24] <infernix> but i need to do more testing
[3:25] <infernix> bypassing tcp would be a big boost
[3:25] <nhm> infernix: There's a lot of people interested in a native RDMA implementation of the messenger code.
[3:26] <infernix> or i could try libsdp preloading
[3:27] <infernix> nhm: i would have to hire folks, i am not the one to write that :)
[3:29] <nhm> infernix: you don't need to be. :)
[3:30] <infernix> hm, ib_sdp isn't even built
[3:30] <infernix> nhm: i wouldn't know who to hire
[3:31] <nhm> infernix: there are multiple other parties that are interested in writing it.
[3:32] <infernix> but i don't know who actually can
[3:33] <Psi-jack> Is it possible yet to convert a sparse qcow2 disk image to a sparse rbd image?
[3:33] <nhm> infernix: I can't say much, but I think any of the groups who are looking at doing it would be able to do so.
[3:33] <infernix> and in turn i don't know if it would help our case
[3:33] <infernix> i need to first further test our use case
[3:34] <nhm> infernix: yeah, not sure there. Lower latencies and higher throughput probably wouldn't hurt though. :D
[3:38] <infernix> there are other approaches to my problem, i can try to do backups and restores in parallel
[3:38] <infernix> but that requires some retooling
[3:38] <infernix> i'm sure though that if we do put it in production i'll take a look at better use of IB
[3:39] <infernix> and then there's the centos 5 support
[3:39] <infernix> :)
[3:39] <via> i build some packages with th ecurrent contents of -next, and i'm getting an osd crash like this: https://pastee.org/7suhv
[3:39] <via> i can revert, but i'm wondering if thats something someone should look at before next gets released
[3:40] <via> the error is fully repeatable
[3:42] <nhm> via: Mind submitting a bug? Any addtional info you can submit about what triggers it would be helpful, especially if you can make it happen with debugging turned up.
[3:42] <via> rgr
[3:43] <via> osd = 20 didn't add much
[3:43] <via> any other settings?
[3:43] <via> journal = 20 i'll try
[3:44] <nhm> yeah filestore too.
[3:44] <dmick> Psi-jack: not 100% sure about the whole process, but I implemented a somewhat sparse import-from-stdin
[3:44] <Psi-jack> dmick: Oh?
[3:44] <dmick> if you can get a data source that generates image-order-sized (4MB by default) runs of zeros, rbd import from stdin will seek past them
[3:45] <Psi-jack> Hmmm
[3:45] <dmick> (and if you can't, you might be able to pipe it through dd with bs=4M iflag=fullblock
[3:46] <Psi-jack> Almost sounds like it might just be easier to have a VM gate the VM disks through. Have a VM with it's own disk, with the qcow2 I want to convert, and rbd attached, and just rsync them from point A to point B.
[3:47] <mikedawson> dmick: Did you get my leveldb dump? Any clues?
[3:47] <Psi-jack> chroot in and re-install grub, in the rbd volume, and detach and put into actual vm I want it to use. ;)
[3:47] <dmick> mikedawson: we did, and sjust was on your problem; I only supplied the dump tool
[3:47] <Psi-jack> dmick: We spoke about qemu-img and -S before, and got put somewhat into an issue: http://tracker.newdream.net/issues/3499 But it's made 0% progress it seems. ;)
[3:48] <mikedawson> dmick: gotcha. thanks.
[3:49] <dmick> Psi-jack: yes, but I don't remember all the tools. will qemu-img dump to stdout? If so, this should be a workaround
[3:50] <dmick> if it won't directly, will it write to /dev/stdout?
[3:52] <Psi-jack> Hmmmm
[3:52] <dmick> hum:
[3:53] <dmick> open("/dev/stdout", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0644) = 9
[3:53] <dmick> ftruncate(9, 4194304) = -1 EINVAL (Invalid argument)
[3:53] <dmick> except for the ftruncate...
[3:55] <Psi-jack> Yeah, doesn't look like qemu-img itself supports stdout.
[3:56] <Psi-jack> Lotta complaints about that actually. LOL
[3:57] <Psi-jack> So yeah. Back to my former idea.. Likely easier to just setup a VM disk conversion VM? ;)
[3:57] <dmick> how hard can it be to hack qemu-img?
[3:58] <Psi-jack> Well, for me, pretty hard. I don't do C.
[3:58] <dmick> heh
[3:58] <dmick> but I mean..
[3:58] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[4:00] <Psi-jack> Heh.
[4:00] <Psi-jack> My silly method will work, as long as the rbd pool manager will allow me to rename vm disks around. ;)
[4:00] <infernix> meh
[4:00] <dmick> ok, well, pretty hard, as it turns out :)\
[4:01] <infernix> i just killed the server with dd if=rbd iflag=direct
[4:01] <infernix> completely hung, no panic or crash, just reset itself
[4:01] <dmick> infernix: this was a VM with a mapped kernel rbd?
[4:01] <dmick> or?
[4:01] <infernix> this was on a physical box
[4:02] <infernix> mapped kernel rbd yes
[4:02] <dmick> well that's....rude
[4:02] <Psi-jack> Cool, rbd mv [src] [dst] -- So yeah. A little PITA, but it'll work. :)
[4:02] <nhm> infernix: just the client? Everything else is healthy?
[4:03] <infernix> nhm: the client is the gateway to my dev setup, so i guess :)
[4:03] <infernix> can't tell right now
[4:03] <Psi-jack> Though, hmm, odd.. rbd mv doesn't take a pool name?
[4:03] <infernix> 3.2 kernel
[4:05] <infernix> can i create an object in userspace and later use that object as an rbd device?
[4:06] <Psi-jack> Ahh, rbd -p <pool> mv [src] [dst]
[4:06] <Psi-jack> Heh
[4:06] <nhm> infernix: ooh, I think there have been a lot of fixes for kernel rbd since then.
[4:06] <nhm> infernix: I think it's recommended to run 3.6+
[4:07] * tryggvil (~tryggvil@d51A4D5E2.access.telenet.be) Quit (Quit: tryggvil)
[4:07] <infernix> "format 2 - Use the second rbd format, which is supported by librbd"
[4:07] <infernix> there we go
[4:07] * cypher497 (~jay@ Quit ()
[4:07] <Psi-jack> nhm: Hehe. Hence.. Why my ceph cluster is running Arch. Generally has a pretty current kernel version. ;)
[4:07] <mgalkiewicz> is there anything except authx which should be necessary to take care of when upgrading from 0.52 to 0.55? I have upgraded 1 out of 3 osds and some clients started to crash
[4:07] <infernix> i don't really want to map rbd devices to host kernels, but will want to create them from userspace and have a VM attach to them. so that would work.
[4:07] <nhm> Psi-jack: yeah. We at least pacakge up kernels for ubuntu for people.
[4:08] <Psi-jack> hehe
[4:08] <nhm> infernix: speaking of which, you may want to download a kernel from our gitbuilder site.
[4:08] <mgalkiewicz> I wasnt able to map rbd volume because of "add failed: (1) Operation not permitted"
[4:08] <mgalkiewicz> I am using kernel rbd
[4:08] <nhm> oh wait, your client is debian right? We've only got them packaged for ubuntu.
[4:08] <Psi-jack> nhm: Well, either way. I'm liking the way Arch has gone. 100% systemd is extremely nice. Will be able to boot up each of my ceph storage servers in mere seconds.
[4:09] <infernix> nhm: i'm running 3.6.3-ceph-00262-g575fee7 on the ceph nodes
[4:09] <nhm> Psi-jack: nice. I haven't played with arch much
[4:09] <Psi-jack> nhm: by mere seconds, I mean about 5 seconds. ;)
[4:09] <nhm> nice1
[4:09] <infernix> i'll build 3.6.8 on the debian box later but i'm not really planning to do much in kernel space since i need to work with centos5 too
[4:09] <infernix> will need to look at librdbpy
[4:09] <Psi-jack> nhm: Except for the Dell PowerEdge, which Dell servers take FOREVER to boot. That one'll take about 1 minute.
[4:09] <nhm> infernix: kernel rbd doesn't have caching and some other nice stuff anyway.
[4:10] <infernix> and read/write to/from it
[4:10] <infernix> will need to write a tool that does parallel read/writes to rbd devices in userspace anyway so thats what i'll try, after i find out if i can get it to build on centos 5
[4:11] <nhm> infernix: you could always use librados directly too...
[4:11] <Psi-jack> URGGGGH! I can't wait... Friday, FRIDAY, I start my cephfs conversion project with full hardware ready.
[4:11] <dmick> centos5! yeesh
[4:11] <nhm> dmick: yeah, that sounds painful. ;)
[4:11] <infernix> nhm: but can i create rbd devices with librados?
[4:11] <nhm> infernix: no offense. ;)
[4:12] <nhm> infernix: no, you'd have to talk to it directly from the app.
[4:12] <dmick> you can create rbd devices from librbd
[4:12] <infernix> and read and write to it. so librbd it is
[4:12] <dmick> c, c++, python, take your choice.
[4:12] * nhm bows to dmick's superior rbd knowledge
[4:13] <dmick> keep in mind that the kernel doesn't support format 2 images yet
[4:13] <elder> Sure does!
[4:13] <infernix> i'll only ever use it in userspace, and with libguestfs
[4:13] <elder> But it doesn't support all the format 2 features yet.
[4:13] <elder> :)
[4:13] <infernix> but first some shuteye
[4:13] <infernix> thanks for the pointers :)
[4:19] * nazarianin (~kvirc@mg01.apsenergia.ru) has joined #ceph
[4:20] * chutzpah (~chutz@ Quit (Quit: Leaving)
[4:28] * jlogan2 (~Thunderbi@ Quit (Ping timeout: 480 seconds)
[4:36] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[4:36] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[4:41] * Cube (~Cube@ Quit (Quit: Leaving.)
[4:53] * yasu` (~yasu`@dhcp-59-224.cse.ucsc.edu) Quit (Remote host closed the connection)
[4:54] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[4:56] * sagelap (~sage@ has joined #ceph
[5:00] <mikedawson> 2012-12-11 22:58:19.853180 7f1b78aad700 20 osd.14 2420 scrub_should_schedule loadavg 2.01 >= max 0.5 = no, load too high
[5:07] * PerlStalker (~PerlStalk@ Quit (Remote host closed the connection)
[5:08] <lurbs> You can change the threshold with 'osd scrub load threshold'.
[5:09] <lurbs> http://ceph.com/docs/master/rados/configuration/osd-config-ref/
[5:09] * maxiz (~pfliu@ Quit (Read error: Connection reset by peer)
[5:10] * PerlStalker (~PerlStalk@ has joined #ceph
[5:11] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[5:15] <mikedawson> lurbs: set it to 3 and now I'm seeing scrubbing on osd.14. CPU remains high in top at ~95%
[5:15] * maxiz (~pfliu@ has joined #ceph
[5:21] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[5:29] <infernix> http://pastebin.ca/2291662 - this seems incorrect
[5:29] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) Quit (Ping timeout: 480 seconds)
[5:29] <infernix> isn't that endif supposed to be below ::sync_file_range
[5:30] <infernix> libos.a(libos_a-FileStore.o): In function `FileStore::_write(coll_t, hobject_t const&, unsigned long, unsigned long, ceph::buffer::list const&)':
[5:30] <infernix> /root/ceph/src/os/FileStore.cc:2883: undefined reference to `sync_file_range'
[5:30] * infernix curses centos 5
[5:33] <dmick> It actually kinda seems like that true should be false, but
[5:33] <dmick> it's a strange code block
[5:34] <dmick> but no, that's an ||, so that's not right either
[5:36] <dmick> infernix: if it's causing you problems, you can set filestore sync flush = false in the ceph.conf
[5:36] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Quit: ChatZilla 0.9.89 [Firefox 17.0.1/20121128204232])
[5:36] <dmick> but yeah, I agree that code does not look like it's doing what it intends
[5:37] <dmick> I'll file an issue
[5:39] <dmick> http://tracker.newdream.net/issues/3607 infernix
[5:53] * gucki (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) has joined #ceph
[5:55] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[6:17] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[6:18] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) Quit ()
[6:36] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[6:36] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[7:03] * deepsa_ (~deepsa@ has joined #ceph
[7:04] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[7:04] * deepsa_ is now known as deepsa
[7:06] * deepsa (~deepsa@ Quit ()
[7:06] * deepsa (~deepsa@ has joined #ceph
[7:10] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[7:14] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[7:14] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[7:20] * gregaf1 (~Adium@2607:f298:a:607:918d:d4e3:2387:5a6e) has joined #ceph
[7:21] * vata (~vata@CPE000024cdec46-CM0026f31f24dd.cpe.net.cable.rogers.com) Quit (Quit: Leaving.)
[7:24] * kbad_ (~kbad@malicious.dreamhost.com) has joined #ceph
[7:25] * ebo^ (~ebo@ has joined #ceph
[7:25] * gregaf (~Adium@2607:f298:a:607:a9c1:25b7:a54e:1cfe) Quit (Ping timeout: 480 seconds)
[7:26] * kbad (~kbad@malicious.dreamhost.com) Quit (Ping timeout: 480 seconds)
[7:38] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[7:39] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[7:41] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[7:57] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[7:58] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[8:07] * IceGuest_75 (~IceChat7@buerogw01.ispgateway.de) has joined #ceph
[8:07] <IceGuest_75> good morning #ceph
[8:07] <IceGuest_75> nick /norbi
[8:07] * IceGuest_75 is now known as norbi
[8:24] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[8:24] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[8:25] * Morg (d4438402@ircip2.mibbit.com) has joined #ceph
[8:27] <norbi> are there any infos how to delete single PGs without unfound objects ?
[8:29] * nosebleedkt (~kostas@ has joined #ceph
[8:30] <nosebleedkt> goodmorning everyone :D
[8:30] <nosebleedkt> goodmorning joao
[8:30] <norbi> morning :)
[8:30] <Morg> mornin'
[8:32] * low (~low@ has joined #ceph
[8:34] <norbi> no infos here? i dont think that ceph is stable at the moment
[8:37] <nosebleedkt> norbi, nobody thinks :D
[8:37] * nazarianin (~kvirc@mg01.apsenergia.ru) Quit (Read error: Connection reset by peer)
[8:39] * gucki (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) Quit (Read error: Operation timed out)
[8:40] * gucki (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) has joined #ceph
[8:44] * deepsa (~deepsa@ Quit (Ping timeout: 480 seconds)
[8:45] <ebo^> why would you want to delete a single pg?
[8:46] <norbi> have many PGs in state active+remapped but after 14h they are not get remapped ...
[8:47] <norbi> have found the command "ceph pg force_create_pg...."
[8:47] <norbi> but now i have two PGs with the same number :)
[8:47] <norbi> pg 2.45 is stuck inactive since forever, current state creating, last acting [1,2]
[8:47] <norbi> pg 2.45 is stuck unclean since forever, current state creating, last acting [1,2]
[8:50] <ebo^> force :-)
[8:53] * nazarianin (~kvirc@mg01.apsenergia.ru) has joined #ceph
[8:58] <ebo^> if the code works as it looks like it should, this should not have happened at all
[9:01] <ebo^> try restarting everything? :-)
[9:03] <norbi> lol
[9:04] <ebo^> i'm serious ;-p
[9:05] <norbi> i'm running a test about the performance :) after that i will try the restart, but a restart can't be the answer. what if that will be a live system ? :)
[9:10] * ebo^ (~ebo@ Quit (Remote host closed the connection)
[9:14] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[9:15] * BManojlovic (~steki@ has joined #ceph
[9:18] * ghbizness (~ghbizness@host-208-68-233-254.biznesshosting.net) Quit ()
[9:21] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) has joined #ceph
[9:38] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[9:44] * yoshi (~yoshi@ has joined #ceph
[9:46] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Quit: Leaving.)
[9:47] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) has joined #ceph
[9:56] * Leseb (~Leseb@ has joined #ceph
[9:57] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Remote host closed the connection)
[9:58] * Ryan_Lane (~Adium@c-67-160-217-184.hsd1.ca.comcast.net) Quit (Ping timeout: 480 seconds)
[10:01] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[10:05] * loicd (~loic@magenta.dachary.org) has joined #ceph
[10:21] * verwilst (~verwilst@d5152FEFB.static.telenet.be) has joined #ceph
[10:25] * scuttlemonkey_ (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) Quit (Quit: This computer has gone to sleep)
[10:32] * LeaChim (~LeaChim@5ad684ae.bb.sky.com) has joined #ceph
[10:36] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[10:50] * ebo^ (~ebo@icg1104.icg.kfa-juelich.de) has joined #ceph
[10:54] <jtang> good morning
[10:58] <gucki> good morning
[10:58] <nosebleedkt> yo
[10:58] <nosebleedkt> is any way to manually remap a PG?
[10:59] * match (~mrichar1@pcw3047.see.ed.ac.uk) has joined #ceph
[11:00] <gucki> any idea why rbd_remove fro librbd returns -16 for non-existant images, while the command line tool (rbd rm ...) correctly returns an errno of -2? :(
[11:04] <gucki> bug report is here now: http://tracker.newdream.net/issues/3608
[11:20] <nosebleedkt> norbi, active+remapped PGs in Ceph 0.55 sucks
[11:21] <norbi> hehe
[11:21] <norbi> i can help :D
[11:21] <norbi> http://ceph.com/docs/master/rados/operations/crush-map/#impact-of-legacy-values
[11:21] <norbi> read this :)
[11:21] <norbi> ceph is here now in status HEALTH_OK :)
[11:22] <norbi> i dont have found any command to manual remap a PG or forcing remap
[11:23] <Morg> norbi: so, basicly you to mark osd as out and it should get back do normal?
[11:23] <norbi> the last big problem now is, the ceph command odyssey :)
[11:23] <norbi> no morg
[11:24] <norbi> i changed the curshmap with "--set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50"
[11:24] <norbi> then insert the map, ceph is then remapping PGs
[11:24] <Morg> mhm
[11:27] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) Quit (Remote host closed the connection)
[11:27] * silversurfer (~silversur@124x35x68x250.ap124.ftth.ucom.ne.jp) has joined #ceph
[11:29] <nosebleedkt> oh
[11:29] <nosebleedkt> norbi, thats cool
[11:29] <nosebleedkt> gonna read and try it
[11:29] <norbi> believe me, it helps :) ceph is running fine now
[11:29] <norbi> have written to the mailinglist :)
[11:30] <norbi> written = wrote :) bad english :)
[11:37] * maxiz (~pfliu@ Quit (Ping timeout: 480 seconds)
[11:39] <nosebleedkt> :P
[11:40] <nosebleedkt> norbi,
[11:40] <nosebleedkt> did you lately updgraded from 0.48 to 0.55 ?
[11:40] <nosebleedkt> before seeing that issue ?
[11:40] <norbi> yes
[11:41] <norbi> but with 0.48 i have had only 2 OSDs :)
[11:41] <norbi> and then have add OSD nr 3 with 0.55
[11:45] <nosebleedkt> like me!
[11:45] <nosebleedkt> exactly
[11:46] <nosebleedkt> so that problem is due to the upgrading?
[11:46] <nosebleedkt> so it might be an upgrading issue only.
[11:47] <norbi> hm no
[11:47] <nosebleedkt> is it a bug then ?
[11:47] <norbi> but if u have only 2 OSDs with setup of 2 replicas
[11:47] <norbi> then u cant hit this issue :)
[11:47] <nosebleedkt> yes
[11:47] <norbi> the docu says
[11:47] <norbi> For hiearchies with a small number of devices in the leaf buckets, some PGs map to fewer than the desired number of replicas. This commonly happens for hiearchies with �host� nodes with a small number (1-3) of OSDs nested beneath each one.
[11:48] <nosebleedkt> lol
[11:48] <nosebleedkt> so this is how ceph worsk
[11:48] <nosebleedkt> works*
[11:48] <norbi> it seems so
[11:49] <norbi> so just add many many many OSDs :)
[11:49] <nosebleedkt> so instead of doing that new crushmap thing, we just should increase the number of OSDS
[11:49] <joao> I'm not a native english speaker, so take this with a grain of salt, but I believe that
[11:50] <joao> this was properly written
[11:50] <joao> 8<norbi> have written to the mailinglist :)
[11:50] <nosebleedkt> joao, norbi's idea solved my issue too
[11:50] <joao> great
[11:50] <nosebleedkt> so its how ceph works
[11:50] <joao> specially because I forgot to talk to sam last night
[11:51] <joao> and just realized that now
[11:51] <nosebleedkt> when havinh small number of OSDs
[11:51] <joao> (shame!)
[11:51] <nosebleedkt> will try to have 6 OSDs
[11:51] <norbi> if i correct understand the documentation, more OSDs should solve the problem too
[11:59] <nosebleedkt> norbi, i just tried to shutdown the 2nd node... im waiting to see what will happen now
[12:00] <nosebleedkt> and yes it worked
[12:01] * jlogan (~Thunderbi@2600:c00:3010:1:1939:8648:1731:e2d5) has joined #ceph
[12:05] <Morg> joao: i wasnt able to replicate yesterday ceph crash, so i got no logs :/
[12:05] <Morg> maybe it was one time glitch or smth
[12:05] <joao> that's a bummer :\
[12:06] <joao> yeah... no :p
[12:06] <joao> in my experience, there's no such thing as a one time glitch
[12:06] <Morg> o rly? :D
[12:06] <joao> yeah, there's just annoying bugs :)
[12:06] <Morg> *feel the irony* ;]
[12:06] <joao> the kind that will pop up once in a while, at the least convenient time
[12:07] <joao> and when you're running whatever with minimum debug levels
[12:07] <Morg> i will try it again, but atm im waiting to get my cluster to 0% degraded state
[12:07] <joao> cool
[12:07] <joao> thanks
[12:08] <Morg> and it will take a while, 67% now
[12:08] <Morg> after 15min
[12:15] <Morg> aaand its stuck at 576 active+degraded 66.616%
[12:18] <nosebleedkt> cool, I restored the legacy crushmap and Added a 6th OSD.
[12:18] <nosebleedkt> now its works ok norbi
[12:18] <nosebleedkt> so with repfactor=2 & osd=6, that problem is away
[12:20] * jlogan (~Thunderbi@2600:c00:3010:1:1939:8648:1731:e2d5) Quit (Quit: jlogan)
[12:21] <ebo^> what do i have to change on the default crush map to get all replicas on different nodes?
[12:27] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[12:36] * joao (~JL@89-181-151-182.net.novis.pt) Quit (Remote host closed the connection)
[12:49] <norbi> @ebo if u have all osds in the crushmap on the same host, us must split this on different hosts
[12:49] <cephalobot`> norbi: Error: "ebo" is not a valid command.
[12:55] * loicd (~loic@magenta.dachary.org) Quit (Quit: Leaving.)
[12:58] * nazarianin (~kvirc@mg01.apsenergia.ru) Quit (Quit: KVIrc 4.0.4 Insomnia http://www.kvirc.net/)
[13:23] * Leseb (~Leseb@ Quit (Quit: Leseb)
[13:24] * loicd (~loic@ has joined #ceph
[13:27] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[13:27] * joao (~JL@89-181-151-182.net.novis.pt) has joined #ceph
[13:27] * ChanServ sets mode +o joao
[13:31] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[13:31] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[13:47] * nosebleedkt_ (~kostas@ has joined #ceph
[13:52] * scuttlemonkey (~scuttlemo@c-69-244-181-5.hsd1.mi.comcast.net) has joined #ceph
[13:52] * ChanServ sets mode +o scuttlemonkey
[13:52] * nosebleedkt (~kostas@ Quit (Ping timeout: 480 seconds)
[13:58] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[14:02] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:07] * yoshi_ (~yoshi@ has joined #ceph
[14:07] * yoshi (~yoshi@ Quit (Read error: Connection reset by peer)
[14:21] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) Quit (Quit: Leaving.)
[14:39] * aliguori (~anthony@cpe-70-113-5-4.austin.res.rr.com) has joined #ceph
[14:48] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has joined #ceph
[14:56] <infernix> dmick: thanks
[14:58] * nhorman (~nhorman@hmsreliant.think-freely.org) has joined #ceph
[15:02] * Morg (d4438402@ircip2.mibbit.com) Quit (Quit: http://www.mibbit.com ajax IRC Client)
[15:10] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[15:15] <infernix> hrm
[15:16] <infernix> i registered at the bugtracker but didn't get an email
[15:16] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[15:22] * infernix rejoices
[15:22] <infernix> compiled on centos 5
[15:23] <janos> i have a noob question!
[15:23] <janos> i set up my first ceph cluster
[15:24] <janos> mapped part of it to /dev/rbd1 on a host
[15:24] <janos> dd'd an existing VM's raw disk device to that
[15:24] <janos> (same size, just a 10gb sized test)
[15:25] <janos> checked the vm to use /dev/rbd1 as its disk
[15:25] <janos> it kinda starts up, but locks early in the boot
[15:25] <janos> around grub. somtimes at, sometimes after
[15:25] <janos> is this a valid test that should work?
[15:25] <janos> i'm on f17
[15:25] <janos> using .53 i think
[15:26] <janos> i'm mostly looking for a sanity check, not looking to debug per se
[15:26] <janos> since .55 and beyond are coming
[15:28] <jtang> hello #ceph!
[15:28] <jtang> just had a meeting with DDN
[15:28] * tziOm (~bjornar@ has joined #ceph
[15:28] <jtang> seems like the WOS stuff looks interesting
[15:30] * mtk (~mtk@ool-44c35983.dyn.optonline.net) Quit (Remote host closed the connection)
[15:35] * infernix runs radosbench on centos 5
[15:35] <janos> centos 5 - going old school!
[15:35] <janos> older, anyway
[15:37] * noob2 (~noob2@ext.cscinfo.com) has joined #ceph
[15:37] <infernix> hmm, failing
[15:37] * infernix goes deeper down the rabbit hole
[15:37] * noob2 (~noob2@ext.cscinfo.com) has left #ceph
[15:38] <infernix> oh, no it isnt. just need to write before seq
[15:39] <infernix> woop. 859MB/s reads
[15:39] <infernix> one rados bench
[15:39] <janos> O_o
[15:39] * infernix does a victory dance
[15:39] <janos> nice
[15:40] <infernix> single rados bench write does about 320 still though. should be a bit more with 42 disks
[15:40] <norbi> please tell the rados bench command
[15:41] <infernix> rados bench 60 write -p test4 --no-cleanup
[15:41] <janos> is that 42 disks in one physical host?
[15:41] <infernix> no 6 hosts
[15:41] <infernix> 7 each
[15:41] <infernix> proof of concept setup
[15:41] <janos> nice proof of cncept!
[15:42] <Psi-jack> Indeed.
[15:42] <infernix> and again 1.4GByte/s reads with 2x rados bench
[15:42] <janos> using 10GBe between?
[15:42] * mtk (~mtk@ool-44c35983.dyn.optonline.net) has joined #ceph
[15:42] <infernix> now it's time to test with 2 clients
[15:42] <infernix> 40gbit infiniband, but that doesn't go past about 15gbit with ipoib
[15:43] <Psi-jack> Heh
[15:43] <Psi-jack> Odd.
[15:43] <nhm> infernix: it's going to be tough getting 1 instance of rados bench to scale beyond that I think.
[15:43] <nhm> infernix: unless it's rewritten to be multithreaded.
[15:43] <darkfaded> infernix: did you try pushing everything through sdp? i remember it was quite "mixed" but never tried with ceph
[15:45] <infernix> darkfaded: sdp isn't compiled into the ceph kernel
[15:46] <infernix> so haven't tried it yet
[15:46] <infernix> it could very well help
[15:46] <infernix> but it's finicky to LD_PRELOAD everything ceph
[15:48] <infernix> about http://tracker.newdream.net/issues/3607 i traced that back to https://github.com/ceph/ceph/commit/1477ec73e354972664424b3d98d78a20c24a2ff4
[15:48] <infernix> but it isn't entirely clear to me why that was changed in this way
[15:49] <infernix> i dropped the entire sync_file_range but if i'm not mistaken, filestore.cc isn't used in the client anyway
[15:49] <infernix> so i don't think it matters even if i broke it in my centos build
[15:51] <infernix> 2 client nodes, 2 rados bench writes each at 130mb/s, ~520mbyte/s writes
[15:54] <infernix> 2 clients, 2 reads, each 375mb. again 1500mb/s total.
[15:54] <infernix> i wonder where the bottleneck is. 6 nodes 7 disks each, all on 10-15gbit
[15:57] * l0nk (~alex@ has joined #ceph
[16:02] <nosebleedkt_> root@masterceph:~# ceph pg dump
[16:02] <nosebleedkt_> osdstat kbused kbavail kb hb in hb out
[16:02] <nosebleedkt_> 0 215644 822692 1038336 [1,2,3,6] []
[16:02] <nosebleedkt_> 1 260432 777904 1038336 [0,2,3,6] []
[16:02] <nosebleedkt_> 2 265508 772828 1038336 [0,1,3,6] []
[16:02] <nosebleedkt_> 3 223188 815148 1038336 [0,1,2,6] []
[16:02] <nosebleedkt_> 4 138228 900108 1038336 [] []
[16:02] <nosebleedkt_> 5 138196 900140 1038336 [] []
[16:02] <nosebleedkt_> 6 253040 785296 1038336 [0,1,2,3] []
[16:03] <nosebleedkt_> why 4,5 have empty [] [] ?
[16:03] <nosebleedkt_> ping joao
[16:05] <mikedawson> nhm: I have two of 22 ceph-osd processes that are constantly near 100% CPU in top. All others are near 0%. The two are on the same box (and have journals on the same SSD). Third osd on this machine had high CPU for a couple days, but is now near 0%.
[16:06] <mikedawson> Could you point me to the process to profile the misbehaving processes?
[16:10] * vata (~vata@ has joined #ceph
[16:11] <mikedawson> perf -> http://pastebin.com/RuawNFH9
[16:13] <mikedawson> perf with ceph-osd expanded -> http://pastebin.com/SnS9t9jf
[16:13] <mikedawson> am I missing symbols? If so, where do I go from here to provide a useful bug report?
[16:14] * nosebleedkt_ (~kostas@ Quit (Quit: Leaving)
[16:16] * vata (~vata@ Quit (Remote host closed the connection)
[16:16] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) Quit (Read error: Operation timed out)
[16:16] * mgalkiewicz (~mgalkiewi@toya.hederanetworks.net) has joined #ceph
[16:17] * l0nk (~alex@ Quit (Remote host closed the connection)
[16:19] * l0nk (~alex@ has joined #ceph
[16:21] * tziOm (~bjornar@ Quit (Remote host closed the connection)
[16:22] * loicd (~loic@ Quit (Ping timeout: 480 seconds)
[16:23] <infernix> rbd.Error: error writing to rbdpytest: error code 2147483648
[16:23] <infernix> that's one puzzling error code
[16:26] * vata (~vata@ has joined #ceph
[16:27] * fedepalla (~fedepalla@201-213-22-32.net.prima.net.ar) has joined #ceph
[16:28] * l0nk (~alex@ Quit (Ping timeout: 480 seconds)
[16:30] * l0nk (~alex@ has joined #ceph
[16:36] * sagelap1 (~sage@109.sub-70-197-150.myvzw.com) has joined #ceph
[16:39] * loicd (~loic@magenta.dachary.org) has joined #ceph
[16:42] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[16:44] <infernix> odd. when i create 1GB of zeroes in RAM and write them with librbd, things are fine. completes in about 2.5s
[16:44] <infernix> when i change to 2GB, everything breaks
[16:45] <infernix> common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f9dadb02700 time 2012-12-12 10:42:33.786776, common/Mutex.cc: 94: FAILED assert(r == 0)
[16:48] <infernix> the breakpoint seems to be at exactly 2048MB. 2047MB is fine
[16:49] <infernix> here's my crappy python code: http://pastebin.ca/2291853
[16:51] <janos> hrm i've never used python. i know it's surface, but i like the look of it
[16:51] <janos> so it doesn't look crappy to me!
[16:52] <janos> question - line 26
[16:52] <janos> * 1024**2
[16:52] <janos> what's the double ** ?
[16:52] <infernix> power of
[16:52] <janos> interesting notation
[16:52] <infernix> 1024^2
[16:52] <janos> yeah
[16:52] <janos> that caught my eye
[16:53] <nhm> mikedawson: heya, I always have annoying problems with perf symbols. First, are you running perf as root? That might help.
[16:53] <nhm> mikedawson: if not, could you try sysprof?
[16:53] <mikedawson> yes, running as root
[16:53] <mikedawson> ok, I'll try sysprof
[16:55] * low (~low@ Quit (Quit: bbl)
[16:56] <nhm> infernix: some of the guy that work on rbd should be around in an hour or so.
[16:57] * norbi (~IceChat7@buerogw01.ispgateway.de) Quit (Quit: Do fish get thirsty?)
[16:57] <infernix> np, i have time
[16:58] <infernix> still need to figure out how to read and write from/to a block device in the meantime
[16:58] <infernix> hm. stuck on removing the rbd image now
[16:59] <infernix> but that's with 64kb objects
[16:59] <infernix> can I async delete objects at all?
[16:59] <infernix> or is this an active client process?
[17:04] <infernix> there, just took a long time. wonder how long it takes to delete a 2TB rbd device
[17:08] * BManojlovic (~steki@ Quit (Quit: Ja odoh a vi sta 'ocete...)
[17:14] * verwilst (~verwilst@d5152FEFB.static.telenet.be) Quit (Quit: Ex-Chat)
[17:18] * yoshi_ (~yoshi@ Quit (Read error: Connection reset by peer)
[17:19] * yoshi (~yoshi@ has joined #ceph
[17:26] * sagelap1 (~sage@109.sub-70-197-150.myvzw.com) Quit (Ping timeout: 480 seconds)
[17:27] * sagelap (~sage@109.sub-70-197-150.myvzw.com) has joined #ceph
[17:29] <janos> should i be able to issue rbd commands like "rbd showmapped" on any osd?
[17:29] <janos> osd host i mean
[17:29] <janos> i'm unclear about what part exposes what
[17:30] <janos> i'm testing two hosts right now, my primary (for lack of a better term) seems fine
[17:30] <janos> made an rbd image, can map, showmapped, etc
[17:30] <janos> second host i get "Could not open /sys/bus/rbd/devices: (2) No such file or directory"
[17:31] <janos> i have monitors on separate machines
[17:31] <janos> ahhh, is it possibly only does on mds's?
[17:31] <janos> does/done
[17:32] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) Quit (Read error: Operation timed out)
[17:36] * joshd1 (~jdurgin@2602:306:c5db:310:1c6c:c7b2:9650:bc87) has joined #ceph
[17:44] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[17:46] * WinneR (~maravilla@ has joined #ceph
[17:46] <WinneR> http://www.carolinaherrera.com/212/es/areyouonthelist?share=Vb9UR_gNSOypVWCs4rq6jTOV5yr2vy28bBN8Zn1HTj3kz4rz3EUUdzs6j6FXsjB4447F-isvxjqkXd4Qey2GHw#episodio-3
[17:48] * WinneR (~maravilla@ Quit (autokilled: Please do not spam. Mail support@oftc.net with questions (2012-12-12 16:48:45))
[17:48] <ircolle> Gracias - I never realized she used a distributed file system
[17:52] <iggy> janos: since MDSes aren't even required for rbd operation, I doubt that's it
[17:53] <janos> ah i didn't realize that
[17:53] <janos> my ignorance is much larger than my knowledge base at this point
[17:55] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) Quit (Ping timeout: 480 seconds)
[17:59] * match (~mrichar1@pcw3047.see.ed.ac.uk) Quit (Quit: Leaving.)
[18:00] <infernix> Reading took 0.020917s at 95616.0061194 MB/s
[18:00] <infernix> something tells me i'm doing my read test wrong
[18:01] <infernix> Reading took 14.048642s at 142.362514469 MB/s
[18:02] <infernix> that's better, and it's not. I wonder why rbd linear reads are so much slower than writes
[18:05] <joshd1> infernix: I'll look into the 2gb problem when I get to the office, but it sounds very much like it's being treated as a signed 32-bit value somewhere
[18:06] <joshd1> infernix: rbd_remove synchronously goes through and removes each object currently, it could easily be made more efficient just by running a bunch of removes in parallel
[18:07] * yasu` (~yasu`@ has joined #ceph
[18:07] <infernix> joshd1: i'm currently converting this rbdpy code hackery into a multiprocessing one
[18:07] <joshd1> there's no 'rbd daemon' or anything like that anywhere, hence the client needs to wait to make sure the deletes actually happen
[18:07] <infernix> i'm hoping that that will get a boost
[18:07] <infernix> but i don't get why read performance is half that of write
[18:08] <joshd1> do you have journals set up on ssds?
[18:08] <joshd1> reads don't come from the journal
[18:08] <infernix> no and I don't intend to due to the increased risk of ssd failure = multiple osd failure
[18:08] <infernix> i'll be throwign probably 72 in the initial setup, expanding to hundreds over time
[18:08] <infernix> 72 SATA disks that is
[18:10] <infernix> rados bench is doing OK, e.g. 800MB/sec reads with 1 rados bench instance, 350mb writes
[18:10] <infernix> rbd writes are on par but reads are quite different. but as rados bench uses 8 threads i think that's the reason for this
[18:10] <joshd1> are you using order 22 (iirc rados bench uses 4mb objects by default)
[18:11] <darkfaded> infernix: i dont know if there is a /sys/block/rdb1234/queue/readahead_kb
[18:11] <darkfaded> if yes, that might be something to try
[18:11] <infernix> joshd1: yeah, tried various orders too
[18:11] <infernix> though not for reads yet
[18:11] <infernix> problem with small orders is the increased deletion time
[18:12] <infernix> i'm talking 2TB volumes in production and am testing with 2GB here, already slow with small orders :)
[18:12] * infernix tries 1MB
[18:13] <infernix> Writing took 4.638297s at 431.192741646 MB/s, Reading took 16.281087s at 122.841920813 MB/s
[18:13] <denken> ceph-mon is constantly writing around 4-5k IOPS on this single node... is that pretty standard for almost 200 osds?
[18:13] <denken> seems really high
[18:13] <joao> gregaf1, around?
[18:14] <joshd1> infernix: are you just doing read(0, image_size) for that test?
[18:15] <joao> gregaf1, sagewk, http://tracker.newdream.net/issues/3609
[18:17] * drokita (~drokita@ has joined #ceph
[18:19] <infernix> joshd1: http://pastebin.ca/2291893 - RBDBench: Disk/data size: 1024MB, W=391.735902672MB/s, R=139.75122644MB/s, order=22, threads=1
[18:20] <infernix> joshd1: basically yes, reading to RAM. will work on multiprocessing after dinner
[18:20] <drokita> So, I was trying to add a monitor to my cluster. Now the monitors are unresponsive to commands, ceph health/ceph -s are not functional and I am unable to free them. Has anyone seen this before?
[18:21] <infernix> joshd1: also, there's no kernel rbd involved so no kernel readahead
[18:21] <infernix> it has to be userspace as i have to make this work on centos 5 later, but currently testing this on debian squeeze client.
[18:27] <drokita> The following messages started spamming the mon log right when the seond monitor was added:2012-12-11 17:34:10.158406 7f409479c700 1 mon.a@0(probing) e2 discarding message auth(proto 0 26 bytes epoch 1) v1 and sending
[18:27] <drokita> client elsewhere; we are not in quorum
[18:36] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Ping timeout: 480 seconds)
[18:40] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) has joined #ceph
[18:41] <joshd1> infernix: if you want to dig deeper, I'd suggest adding 'debug ms = 1' and 'log file = /path/to/$name.$pid.log' - then you'll see how long the individual I/Os are taking. you can also use the admin socket command 'perf dump' to get overall latency stats
[18:44] <joshd1> infernix: you could also end up hitting the default objecter limits - objecter_inflight_op_bytes defaults to 100MB
[18:45] <joshd1> infernix: also try writing something other than zeroes, some I/O layers detect that and do more efficient things than usual
[18:46] <joao> dmick, around?
[18:51] * gaveen (~gaveen@ has joined #ceph
[18:53] * Cube (~Cube@cpe-76-95-223-199.socal.res.rr.com) Quit (Quit: Leaving.)
[18:56] * mikedawson_ (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) has joined #ceph
[19:00] * nwat (~Adium@c-50-131-197-174.hsd1.ca.comcast.net) has joined #ceph
[19:00] * davidz (~Adium@ip68-96-75-123.oc.oc.cox.net) has joined #ceph
[19:03] * mikedawson (~chatzilla@c-98-220-189-67.hsd1.in.comcast.net) Quit (Read error: Operation timed out)
[19:05] * sagelap1 (~sage@37.sub-70-197-140.myvzw.com) has joined #ceph
[19:06] * sagelap (~sage@109.sub-70-197-150.myvzw.com) Quit (Ping timeout: 480 seconds)
[19:07] * sagelap (~sage@195.sub-70-197-128.myvzw.com) has joined #ceph
[19:09] * joshd1 (~jdurgin@2602:306:c5db:310:1c6c:c7b2:9650:bc87) Quit (Quit: Leaving.)
[19:10] * jlogan (~Thunderbi@2600:c00:3010:1:dc6f:f613:d98b:42d5) has joined #ceph
[19:13] * buck (~buck@bender.soe.ucsc.edu) has joined #ceph
[19:13] * sagelap1 (~sage@37.sub-70-197-140.myvzw.com) Quit (Ping timeout: 480 seconds)
[19:15] * sagelap (~sage@195.sub-70-197-128.myvzw.com) Quit (Ping timeout: 480 seconds)
[19:19] * Cube (~Cube@ has joined #ceph
[19:21] * sjustlaptop (~sam@68-119-138-53.dhcp.ahvl.nc.charter.com) has joined #ceph
[19:23] * jjgalvez (~jjgalvez@cpe-76-175-17-226.socal.res.rr.com) has joined #ceph
[19:25] * ircolle1 (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[19:26] * ircolle2 (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) has joined #ceph
[19:30] * ircolle (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[19:32] * Ryan_Lane (~Adium@ has joined #ceph
[19:33] * ircolle1 (~ircolle@c-67-172-132-164.hsd1.co.comcast.net) Quit (Ping timeout: 480 seconds)
[19:40] <janos> ah ha, figured out my issue earlier
[19:40] <janos> when trying to "rbd showmapped' on a new machine i added in with osd's
[19:40] <janos> i had forgotten to modprobe rbd
[19:43] <elder> D'oh!
[19:43] <elder> I think we're going to add that to the rbd command so you don't have to.
[19:43] <janos> it yelled that there was no /sys/bus/rbd/devices
[19:43] <janos> i eventually figured out what i had failed to do
[19:44] <janos> my first attempt with ceph, so i'm not too surprised when i don't do something right
[19:53] * dshea (~dshea@masamune.med.harvard.edu) has joined #ceph
[19:54] * Leseb (~Leseb@5ED01FAC.cm-7-1a.dynamic.ziggo.nl) has joined #ceph
[20:02] <janos> i mapped an rbd image (10g in size) to a local drive
[20:02] <janos> then mounted the drive
[20:02] <janos> well, formatted then mounted
[20:02] <janos> made a directory in it
[20:02] <janos> that worked fine
[20:02] <janos> tried to chown that directory to a local user instead of root
[20:02] <janos> and it's hanging
[20:02] <janos> i can't even kill -9 the chown
[20:03] <janos> logs are showing many [WRN] : slow request, etc
[20:03] <janos> anything i should look out for?
[20:03] * wer (~wer@dsl081-246-084.sfo1.dsl.speakeasy.net) has joined #ceph
[20:04] * sagelap (~sage@ has joined #ceph
[20:04] <joshd> sounds like something's wrong with your osds, check syslog on them for fs isues
[20:05] <joshd> issues, even
[20:07] <janos> dmesg showing lines of this:
[20:07] <janos> libceph: tid 49579 timed out on osd0, will reset osd
[20:08] <sstan> I'm trying to build Ceph ... ./configure works, but make gives this error "bind.hpp:46: error: no class template named ‘result’ in ... "
[20:08] * fc (~fc@home.ploup.net) Quit (Quit: leaving)
[20:09] <sstan> does anyone have an idea as why compilation-time errors like that happen?
[20:11] <dmick> is that the literal message? bind.h is from boost, and I don't see it referring to "result" specifically. What distro, and are your packages up to date?
[20:14] <joshd> also make sure you ran 'git submodule update --init' after cloning
[20:14] <infernix> joshd: writing ones will work eh? :)
[20:14] <sstan> ah I'll try that
[20:14] <infernix> and those settings go where, in ceph.conf on all osds? or
[20:15] <joshd> infernix: just the ceph.conf on the client side
[20:16] * terje (~terje@71-218-25-108.hlrn.qwest.net) has joined #ceph
[20:16] * terje_ (~joey@71-218-25-108.hlrn.qwest.net) has joined #ceph
[20:16] <joshd> infernix: in a [client] section
[20:16] <infernix> no difference when writing with 1s
[20:17] * CristianDM (~CristianD@host85.186-109-1.telecom.net.ar) has joined #ceph
[20:17] <janos> well i marked the possible problem osd out, then back in and it shook things loose
[20:17] * terje___ (~terje@71-218-5-161.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[20:17] <CristianDM> Hi. I am running NFS server exporting RBD
[20:18] <CristianDM> I have issue with XFS load into Centos 6.2
[20:18] * terje__ (~joey@71-218-5-161.hlrn.qwest.net) Quit (Ping timeout: 480 seconds)
[20:19] <CristianDM> Any idea if I change xfs to ext4 inside the instance I will get bad performance
[20:21] * sagelap (~sage@ Quit (Ping timeout: 480 seconds)
[20:22] * Machske (~bram@d5152D87C.static.telenet.be) has joined #ceph
[20:23] * yoshi (~yoshi@ Quit (Remote host closed the connection)
[20:25] <sstan> joshd: thanks, I tried what you suggested, but ./configure gives me this warning that makes make fail ultimately: configure: WARNING: no configuration information is in src/leveldb
[20:25] <joshd> sstan: you'll need to re-run autogen.sh since you got the submodules
[20:26] * yoshi (~yoshi@ has joined #ceph
[20:27] <sstan> joshd: ./autogen.sh fails : couldn't open directory `m4': No such file or directory
[20:28] <sstan> I'm new to git
[20:28] * yoshi (~yoshi@ Quit (Remote host closed the connection)
[20:28] <sstan> git checkout master seems to help
[20:35] <Machske> setup question: if you have 10 machines with each 2 disks, would you mirror the drives on each server (for ex. with mdadm) and then create a xfs fs for the osd ? Or is the better approach to use each disk as a separte osd ?
[20:37] <joshd> sstan: do a 'git clean -fdx' from the root of the repo and retry, it'll get you back to a clean state
[20:38] <dmick> (although be aware that'll remove any files you yourself created)
[20:40] <sstan> joshd: the error in make still happened : http://pastebin.com/nUWRF5Rg
[20:41] * jks (~jks@3e6b7199.rev.stofanet.dk) Quit (Read error: Connection reset by peer)
[20:41] <joshd> sstan: what version of boost do you have?
[20:42] * jks (~jks@3e6b7199.rev.stofanet.dk) has joined #ceph
[20:43] <dmick> sstan: yep, what joshd said. My version has no proto in /usr/include/boost/xpressive
[20:43] <sstan> joshd : 1.36.0 I think
[20:43] <dmick> mine appears to be 1.49.0
[20:44] <sstan> dmick : version of Ceph? How can I control what version of Ceph I have ?
[20:44] <sstan> hmm I'll see if I can install 1.49
[20:44] <dmick> no, boost
[20:44] <dmick> /usr/include/boost is all from boost
[20:45] <joao> I can't wait for us to release 1.49 ;)
[20:45] <sstan> oh I see sorry
[20:45] <Machske> :)
[20:45] <sstan> but if the installation has a problem with boost, why isn't it detected by configure
[20:47] <joshd> there are checks, but perhaps they don't depend on the right version anymore
[20:48] <infernix> joshd: interestingly, when running my rbdbench.py on two nodes, write performance cuts in half, read performance is unaffected
[20:48] <joshd> infernix: sounds like read is being limited on the client side, and write on the osd side then
[20:50] <infernix> same with 3
[20:50] <elder> OK dmick whenever you're ready, maybe we can hang out.
[20:50] <infernix> joshd: but how? rados bench does 850MB/sec reads
[20:50] <infernix> could it simply be lack of threads?
[20:51] <joshd> possibly
[20:51] * infernix runs two benches on the same client
[20:51] <infernix> 2x140mb read
[20:52] <sjustlaptop> joshd, gregaf1: anyone want to review wip_watch? takes care of handle_watch_timeout in the scrub and degraded cases
[20:56] <joshd> infernix: did you try changing the objecter_inflight_op_bytes to something much higher, like 2147483648?
[20:57] <joshd> infernix: also objecter_inflight_ops to 100000 or something equally unlikely to be hit
[21:06] * l0nk (~alex@ Quit (Quit: Leaving.)
[21:10] * l0nk (~alex@ has joined #ceph
[21:11] * CristianDM (~CristianD@host85.186-109-1.telecom.net.ar) Quit ()
[21:11] * gucki_ (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) has joined #ceph
[21:11] * gucki (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) Quit (Ping timeout: 480 seconds)
[21:30] <joshd> sjustlaptop: that looks good to me, but I'm not too familiar with the new scrubber
[21:30] <sjustlaptop> joshd: nope, it's totally wrong
[21:30] <sjustlaptop> as it turns out
[21:30] <sjustlaptop> sorry :)
[21:30] * deepsa (~deepsa@ has joined #ceph
[21:30] <sjustlaptop> handle_watch_timeout is *fractally* broken
[21:30] <dmick> elder: yeah, PMs or Hangout, as you wish
[21:30] <elder> Hangout
[21:31] <sjustlaptop> on the plus side, dmick: I think I found the stupid memory corruption culprit
[21:31] <dmick> sjustlaptop: which one is that? The one that damages the core file you mean?
[21:31] <sjustlaptop> possibly, the one that at any rate was causing segfaults on rbd tests
[21:31] <dmick> (or, perhaps damages the running image which causes a coredump of a corrupt image?)
[21:32] <dmick> er, oh. I guess I wasn't solidly tuned in
[21:32] <dmick> but cheers!
[21:32] <sjustlaptop> something like that
[21:32] * verwilst (~verwilst@dD5769628.access.telenet.be) has joined #ceph
[21:32] <elder> dmick, I'm in the rbd hangout whenever you're ready. No hurry.
[21:32] <dmick> I'm fascinated to see what fractally broken means
[21:32] <sjustlaptop> the pattern of breakage repeats at every level of inspection
[21:46] * gaveen (~gaveen@ Quit (Remote host closed the connection)
[21:56] * vata (~vata@ Quit (Remote host closed the connection)
[22:01] * fedepalla (~fedepalla@201-213-22-32.net.prima.net.ar) has left #ceph
[22:03] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has joined #ceph
[22:04] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) has joined #ceph
[22:05] * vata (~vata@ has joined #ceph
[22:07] * miroslav (~miroslav@173-228-38-131.dsl.dynamic.sonic.net) Quit (Quit: Leaving.)
[22:07] * yasu` (~yasu`@ Quit (Remote host closed the connection)
[22:08] * andreask (~andreas@h081217068225.dyn.cm.kabsi.at) has joined #ceph
[22:09] * yehuda_hm (~yehuda@2602:306:330b:a40:152f:4820:319d:f1a7) Quit (Ping timeout: 480 seconds)
[22:10] * yasu` (~yasu`@ has joined #ceph
[22:13] * sagelap (~sage@ has joined #ceph
[22:18] * yehuda_hm (~yehuda@2602:306:330b:a40:b584:19da:e1ac:10fa) has joined #ceph
[22:21] * gucki_ (~smuxi@HSI-KBW-082-212-034-021.hsi.kabelbw.de) Quit (Ping timeout: 480 seconds)
[22:33] * nhorman (~nhorman@hmsreliant.think-freely.org) Quit (Quit: Leaving)
[22:50] <Psi-jack> Okay..
[22:50] * sagelap (~sage@ Quit (Quit: Leaving.)
[22:50] * sagelap (~sage@ has joined #ceph
[22:50] * sagelap (~sage@ Quit ()
[22:51] <Psi-jack> So, I'm trying to write systemd .service files for starting/stopping ceph services needed, but the fricken rc.d script itself is WAY over convuluted with it's design, completely fails LSB spec even, and does a lot more than it really should be... So... I'm trying to figure out, what exactly is needed to be started, and how.
[22:52] <gregaf1> Psi-jack: if you're familiar with upstart, the Ceph jobs for that make a lot more sense in terms of expressing what's needed to turn on a daemon
[22:52] <gregaf1> the init.d script is trying to be a whole cluster manager rather than just a local daemon manager...
[22:52] <Psi-jack> gregaf1: I'm very familiar (unfortunately) with upstart, so yeah, I could work with that.
[22:52] <Psi-jack> Correct.
[22:53] <gregaf1> you'll just need to be careful because our current upstart stuff assumes some default setup that doesn't match what the mkcephfs/init.d stuff doesn't match
[22:54] * Psi-jack nods.
[22:54] <Psi-jack> Okay. Is the upstart definitions included in the src?
[22:54] <Psi-jack> Ahh, src/upstart
[22:54] <gregaf1> but basically you start each daemon, tell it who it is with the -n or -i flag, and either on the command line or in a ceph.conf provide it the location of its on-disk data, monitors, and keyrings
[22:55] <Psi-jack> Hmm, why is ceph-create-keys in here? Do you have to generate a new everytime the mon is started for it? o.O
[22:56] * ebo^ (~ebo@icg1104.icg.kfa-juelich.de) Quit (Ping timeout: 480 seconds)
[22:58] <infernix> joshd: no, haven't tried yet. are these objecter inflight options compile flags?
[22:58] <gregaf1> no, that's a hook for the ceph-deploy script
[22:58] <Psi-jack> Ahh
[22:58] <joshd> infernix: no, more ceph.conf options for the client
[22:59] <infernix> joshd: and these all affect librbd/librados, right?
[22:59] <joshd> infernix: yeah
[22:59] <Psi-jack> Wow.. heh
[22:59] <Psi-jack> Even the upstart definitions are crazy! :)
[22:59] <joshd> infernix: assuming you're using conffile='' or specifying the path to it explicitly
[23:00] <Psi-jack> It's like the design is dependant /entirely/ on the init system to handle starting and stopping each piece, instead of the software itself reading it's own configuration and spawning it's own daemon children for them.
[23:00] * dpippenger (~riven@cpe-76-166-221-185.socal.res.rr.com) Quit (Remote host closed the connection)
[23:01] <gregaf1> that's what an init system is for
[23:01] <gregaf1> otherwise we'd have like a ceph-starter daemon that looks at the local ceph.conf and starts up a bunch of random ceph-osd and ceph-mds daemons based on that
[23:01] <gregaf1> ew
[23:02] <gregaf1> and that wouldn't be very friendly to many sorts of management frameworks either
[23:02] <infernix> joshd: no noticeable effect
[23:03] <infernix> 8x rbdbench.py writes=8x~47mb (376mb), reads=8x120mb (960mb/s)
[23:04] <Psi-jack> gregaf1: Not... Really.. No...
[23:04] <infernix> i really need to split the reads and writes in multiple threads/processes with pythons multiprocessing
[23:05] <Psi-jack> gregaf1: There /should/ be one parent daemon that handles reading it's own configuration, starting what it needs, and cleaning up after itself, optionally even starting and stopping specific children as need-be, THIS would be the more appropriate means to do this, rather than relying on an init system to do all the crunching, parsing, and having totally independent daemons for everything. sigh..
[23:06] <gregaf1> well, I'm not a sysadmin but I have to say you're the first person who doesn't like our use of upstart, and some people from inside Canonical really love it :)
[23:06] <Psi-jack> gregaf1: Pacemaker is a pretty good example of this, actually. pacemakerd starts up and it's primary thing is to fire up all it's children that it uses to run the cluster.
[23:07] <gregaf1> and a single parent daemon would be implying/creating a lot of inter-relationships between the daemons that we are explicitly avoiding
[23:07] <Psi-jack> gregaf1: Well, often times, I spit on Canonical, especially for their upstart. ;)
[23:07] <gregaf1> ew, no, pacemaker bad
[23:07] <gregaf1> that is the *wrong* kind of HA and cluster software management
[23:07] * ron-slc (~Ron@173-165-129-125-utah.hfc.comcastbusiness.net) Quit (Quit: Leaving)
[23:07] <Psi-jack> gregaf1: Red Hat disagrees with you. ;)
[23:07] * stxShadow (~Jens@ip-178-201-147-146.unitymediagroup.de) has left #ceph
[23:07] <infernix> thats just, like, your opinion, man
[23:07] <Psi-jack> They kinda adopted the Pacemaker devs from SUSE. ;)
[23:08] <gregaf1> that's because they have a bazillion master/slave software services
[23:08] <gregaf1> Ceph is real clustered software and its management methods and expectations reflect that
[23:08] <Psi-jack> gregaf1: No, they're removing their crappy ones, and replacing them with pacemaker. ;)
[23:08] <gregaf1> …at least in the cases where it's management methods and expectations are designed rather than hacked together
[23:09] <infernix> are there any data integrity tests lying around?
[23:10] <infernix> preferably in python?
[23:10] <jamespage> sagewk, https://github.com/javacruft/ceph/commit/eb9516b92fbf1d09376ad86bc081d927f47656c0
[23:11] <jamespage> I think that fixes the issue; just trying a build with -A to ensure the java deb gets built when asked for
[23:12] <Psi-jack> gregaf1: Okay. So so far, my only viable semi-sane way to systemd-ize this is to do very similarly to what was done with the OCF RA script for ceph, and basically become a wrapper to the rc.d script.. So, technically, what's the proper way to start ceph, for a single node, using the rc.d script?
[23:12] <elder> Anybody know what causes the "/tmp/cephtest/archive/coverage" directory to get created in teuthology?
[23:13] <joshd> infernix: what kind of data integrity test? like using the md5 module to hash data from rbd and compare to a file?
[23:14] * tziOm (~bjornar@ti0099a340-dhcp0628.bb.online.no) has joined #ceph
[23:14] <joshd> elder: coverage: true in the ceph task configuration
[23:15] <infernix> [sparse-read 0~4194304]
[23:15] <infernix> joshd: yeah, i'll probably roll that on my own
[23:15] <elder> Is that what causes it to be added to the command line also, joshd ?
[23:15] <infernix> but what does sparse-read imply?
[23:16] <gregaf1> thanks jamespage, but let's flag that to glowell today; sage is out of the office giving a talk today
[23:16] <jamespage> glowell, ^^
[23:16] <joshd> elder: yes
[23:16] <jamespage> gregaf1, I was just looking for his nick!
[23:17] <gregaf1> I would yell at Noah too but he's not online right now :(
[23:17] <joshd> infernix: sparse_read would return only the extents that exist in a sparse file if the osd supported it. this is turned off by default because we found fiemap was unreliable under load
[23:17] <joshd> infernix: so it's really just a regular read
[23:17] <gregaf1> Psi-jack: if you mean the init.d script, you run "init-ceph start -c /path/to/ceph.conf"
[23:17] <infernix> joshd: roger
[23:18] <gregaf1> but I wouldn't encourage spreading that around anywhere
[23:18] <Psi-jack> gregaf1: And that will automagically determine to start mon, osd, mds, on it's own?
[23:18] <jamespage> gregaf1, notified by email as well
[23:18] <gregaf1> yeah Psi-jack
[23:19] <infernix> any pointers as to how to parse that log? 2012-12-12 17:12:27.067197 7f43d7c47700 1 -- --> -- osd_op(client.4557.0:517 rbd_data.11cd2ae8944a.00000000000000ef [sparse-read 0~4194304] 2.94dfe3d) v4 -- ?+0 0x7f43c80688f0 con 0x7f43c8023600
[23:19] <infernix> where do i read the time that takes from this line?
[23:19] <glowell> jamespage; I'm back at my desk
[23:19] <gregaf1> assuming they include a "host = foo" line where "foo" matches what they can get out of hostname (-i or something, I think)
[23:20] <joshd> infernix: there's a tid associated with it - the 517 in osd_op(client.4557.0:517. the response will have osd_op_reply(517, and you can compare the timestamps
[23:20] <infernix> that's in ms?
[23:21] <infernix> oh n/m
[23:21] <infernix> i see
[23:22] * wschulze (~wschulze@cpe-98-14-23-162.nyc.res.rr.com) has left #ceph
[23:27] * jlogan (~Thunderbi@2600:c00:3010:1:dc6f:f613:d98b:42d5) Quit (Ping timeout: 480 seconds)
[23:27] * dpippenger (~riven@cpe-75-85-17-224.socal.res.rr.com) has joined #ceph
[23:28] <infernix> so perf dump can be pulled on the servers per osd. i suppose i have to walk through all servers and all OSDs to get an overview of what, say, journal_latency is?
[23:28] <infernix> in other words, when I want to identify what the slowest OSDs are I have to go through them individually?
[23:29] <joshd> it works on clients too
[23:30] <infernix> but there's no admin socket on clients as there's no daemon? e.g. from the docs. ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf schema
[23:30] <joshd> just add an 'admin socket = /path/to/socket' to the client's ceph.conf, and you can access it while the test is running
[23:30] <infernix> ahh
[23:31] <joshd> you can stick metavars like $pid in there
[23:32] <infernix> so how would I go about finding out which is the slowest (set of) OSDs? run a rados bench with an admin socket and look at the perf dump for that?
[23:33] <joshd> 'ceph osd tell \* bench' is good for that
[23:33] <joshd> watch 'ceph -w' in another terminal, the results will appear there
[23:38] <infernix> they're all between 31 and 40mb/s, nothing surprising there
[23:41] * jlogan (~Thunderbi@ has joined #ceph
[23:48] * scalability-junk (~stp@188-193-211-236-dynip.superkabel.de) Quit (Quit: Leaving)
[23:54] <infernix> but i'm definitely not scaling linearly with writes then. 42*31MB is 1.3gbyte/s writes
[23:54] <infernix> i'm never seeing this, not even with two nodes each doing x 8 rados bench write
[23:55] <infernix> i wonder if that's just the tcp limitation with ipoib
[23:57] * verwilst (~verwilst@dD5769628.access.telenet.be) Quit (Quit: Ex-Chat)
[23:57] <joshd> infernix: keep in mind that's not counting replication

These logs were automatically created by CephLogBot on irc.oftc.net using the Java IRC LogBot.